Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
(USC Thesis Other)
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HETEROGENEOUS GRAPHS VERSUS MULTIMODAL CONTENT: MODELING, MINING, AND ANALYSIS OF SOCIAL NETWORK DATA by Charalampos Chelmis A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2013 Copyright 2013 Charalampos Chelmis Dedication Tomywifeforherunconditionallove. Tomysisterforbelievinginme. Tomyparentsfortheirtremendousamountofsupportandconstant encouragementthroughoutmylife. ii Acknowledgements I would like to express my deep and sincere gratitude to my advisor, Dr. Viktor K. Prasanna for his support and patience, for encouraging me to try harder, for believing in me and my research ideas. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Paul Bogdan and Dr. Aiichiro Nakano. During this work I have collaborated with multiple colleagues whom I wish to acknowledge. I would like to thank Dr. Amol Bakshi for his valuable feedback and help. Many thanks to Dr. Karthik Gomadam for our early discussions towards shaping my thesis proposal, and to Dr. Vikram Sorathia for his contributions and for bearing with my stubbornness. I would also like to thank Dr. Dennis McLeod and Dr. Kristina Lerman for their constructive feedback. Finally, I wish to extend my warmest thanks to Aimee Barnard, Juli Legat, Lizsl De Leon, and Janice Thompson for taking care of administrative work for me. My deepest gratitude to my family for their love and affection, their continuing support and unlimited understanding. I wish to thank my parents, Sofia Ellinikaki and Marinos Chelmis, and my sister Niki Chelmi. Many thanks to my family friends for being so proud for me, and to my parents-in-law and my brother-in-law for entrusting me Daphney. I would especially like to thank my wife Daphney-Stavroula Zois for being there for me in sickness and in health. Last but not the least, I would like to mention our cat Mushroom for the serenity that he has offered me in the most stressful times. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii Abstract xiv Chapter 1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivating Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Information Overload and Targeted Microblogging . . . . . . . 3 1.2.2 Complex Query Answering . . . . . . . . . . . . . . . . . . . 6 1.3 State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Graph Theoretic Social Networking Analysis . . . . . . . . . . 8 1.3.2 Data Mining and Data Analytics in Social Networks . . . . . . 13 1.3.3 Semantic Social Networking Analysis . . . . . . . . . . . . . . 14 1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.5 Research Contributions and Thesis Outline . . . . . . . . . . . . . . . 17 Chapter 2: Preliminaries 20 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Structure of Tripartite Graphs . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Real–World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Corporate Microblogging . . . . . . . . . . . . . . . . . . . . . 24 2.3.2 Formal Organizational Hierarchy . . . . . . . . . . . . . . . . 33 2.3.3 Last.fm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Chapter 3: Modeling Social Networking Data 35 3.1 Formal Representation of Social Networking Data . . . . . . . . . . . . 35 3.1.1 Social Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 iv 3.1.2 Content Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1.3 Semantic Layer . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Social Algebraic Operators . . . . . . . . . . . . . . . . . . . . 44 3.3 rESONAtE: Semantic Social Network Analysis for the Enterprise . . . 48 3.4 Case Study on Real-World Informal Communication Data . . . . . . . 53 3.4.1 Multidimensional Expert Identification . . . . . . . . . . . . . 54 3.4.2 Contextual Ego-Network Analysis and Community Detection . 54 3.4.3 Trends Macro-Analysis . . . . . . . . . . . . . . . . . . . . . . 56 3.4.4 Communication Dynamics . . . . . . . . . . . . . . . . . . . . 56 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Chapter 4: Informal Communication at the Workplace 60 4.1 An empirical analysis of microblogging behavior in the enterprise . . . 62 4.1.1 Analysis of Network Structure . . . . . . . . . . . . . . . . . . 64 4.1.2 Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 Communication Behavior Analysis . . . . . . . . . . . . . . . . . . . . 81 4.2.1 What Makes Messages “Tick”? . . . . . . . . . . . . . . . . . 85 4.2.2 User Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Latent Homophily: Local vs. Global . . . . . . . . . . . . . . . . . . . 89 4.3.1 Latent Mixing Patterns . . . . . . . . . . . . . . . . . . . . . . 91 4.3.2 Topical Homophily in Enterprise Microblogging . . . . . . . . 95 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Chapter 5: Informal Interactions in the Presence of Formal Structure 100 5.1 The Effect of Formal Organization Hierarchy . . . . . . . . . . . . . . 103 5.1.1 Influence Score . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Effect of Peer Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.2.1 Independent Cascade Model . . . . . . . . . . . . . . . . . . . 111 5.2.2 Exponential Growth Model . . . . . . . . . . . . . . . . . . . . 112 5.3 Computational Models of Technology Adoption at the Workplace . . . 113 5.3.1 Complex Contagion Model . . . . . . . . . . . . . . . . . . . . 114 5.3.2 Complex Cascade Model . . . . . . . . . . . . . . . . . . . . . 115 5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 120 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Chapter 6: Predicting Communication Intention 125 6.1 Communication Intention Prediction Framework . . . . . . . . . . . . 128 6.1.1 User Representation . . . . . . . . . . . . . . . . . . . . . . . 129 6.1.2 Semantic Similarity of Textual Features . . . . . . . . . . . . . 131 v 6.1.3 Date Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.1.4 Time Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.1.5 Feature Set Similarity . . . . . . . . . . . . . . . . . . . . . . 134 6.1.6 Content Proximity . . . . . . . . . . . . . . . . . . . . . . . . 134 6.1.7 Network Proximity . . . . . . . . . . . . . . . . . . . . . . . . 135 6.1.8 User Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.2 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.2.1 Methods Comparison . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.2 Weight Scheme Selection . . . . . . . . . . . . . . . . . . . . 139 6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Chapter 7: Social Link Prediction in Online Social Tagging Systems 144 7.1 Generative Models of Tripartite Graphs for Recommendation in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.1.1 The User-Resource-Concept Model . . . . . . . . . . . . . . . 151 7.1.2 The User-Resource Model . . . . . . . . . . . . . . . . . . . . 153 7.1.3 The User-Concept Model . . . . . . . . . . . . . . . . . . . . . 155 7.1.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 156 7.2 Social Link Prediction with Hidden Topics (SLIgHT) . . . . . . . . . . 156 7.2.1 Threshold Selection . . . . . . . . . . . . . . . . . . . . . . . 158 7.3 Social Link Prediction Using Latent Semantics and Network Structure . 159 7.3.1 Latent Topics & Common Neighbors Scheme . . . . . . . . . . 160 7.3.2 Latent Topics & Shortest Distance Scheme . . . . . . . . . . . 161 7.3.3 Latent Topics Classification Scheme . . . . . . . . . . . . . . . 162 7.3.4 Ensemble Classification Scheme . . . . . . . . . . . . . . . . . 162 7.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 163 7.4 Models Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.4.1 Sample Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 166 7.4.2 User Focus Analysis . . . . . . . . . . . . . . . . . . . . . . . 168 7.4.3 Predictive Power . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.5 Network Reconstruction & User Homophily . . . . . . . . . . . . . . . 171 7.5.1 Users’ Homophily . . . . . . . . . . . . . . . . . . . . . . . . 174 7.6 Prediction of Social Ties . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.6.1 Comparison with other methods . . . . . . . . . . . . . . . . . 179 7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Chapter 8: Future Work and Conclusion 184 References 190 vi List of Tables 2.1 High–level statistics of the post–reply network . . . . . . . . . . . . . . 26 2.2 Last.fm Dataset High level statistics . . . . . . . . . . . . . . . . . . . 34 4.1 Power–law coefficient (α) estimates and corresponding Kolmogorov– Smirnov goodness–of–fit (D) metrics. For reference, we also provide estimates for previously studied online social networks [154]. . . . . . . 66 4.2 Averages and fluctuations of user activities . . . . . . . . . . . . . . . . 71 4.3 Pearson correlation coefficients . . . . . . . . . . . . . . . . . . . . . . 75 6.1 Weighting schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2 Prediction precision achieved by different metrics. . . . . . . . . . . . . 138 6.3 Prediction recall and MRR achieved by different metrics. . . . . . . . . 139 7.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.2 Symbols used in Complexity Analysis . . . . . . . . . . . . . . . . . . 164 7.3 Area under the ROC curve comparison for 10%, 25%, 50%, and 75% of edges observed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 vii List of Figures 1.1 Users’ Interactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Bob’s Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Taxonomy of Social Networking Analysis Approaches. . . . . . . . . . 8 1.4 Centrality Measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Semantic Social Networks . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Tripartite graph model of a social network. . . . . . . . . . . . . . . . . 23 2.2 We map posting/replying activity into a directed post–reply graph. Threaded discussion (a), represented as bipartite graph in (b), is collapsed to a directed graph in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Distribution of employees, who have used the microblogging service at least once, per (a) business unit, and (b) department. . . . . . . . . . . . 32 3.1 Multi-layered Social Network Stack . . . . . . . . . . . . . . . . . . . 36 3.2 Conceptualization of multi-dimensional user representation. . . . . . . 36 3.3 Contexts in informal interactions at the workplace. . . . . . . . . . . . 38 3.4 Contextual Interactions Example . . . . . . . . . . . . . . . . . . . . . 40 3.5 Content representation. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 Toy example of social distance between users in a three-dimensional space consisting of movies, music and sports. . . . . . . . . . . . . . . 47 3.7 rESONAtE Ontology Coverage . . . . . . . . . . . . . . . . . . . . . . 51 3.8 rESONAte Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.9 rESONAtE Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 viii 3.10 Partial result set for (a) Query 3.1 and (b) Query 3.2. . . . . . . . . . . 55 3.11 Partial result set for (a) Query 3.3 and (b) Query 3.4. . . . . . . . . . . 57 4.1 Distribution of users’ in–degree (number of users one has received answers from) and out–degree (number of users one has answered). Both axes are in logarithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 (a) Scatter plot of the number of followers and the number of followees. (b) CDF of out–degree to in–degree ratio. . . . . . . . . . . . . . . . . 66 4.3 (a) Top: Histogram of connected components sizes Bottom: Histogram of clustering coefficients. (b) Top: Histogram of clustering coefficients in LCC. Bottom: Average clustering coefficient as a function of degree in LCC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 From left to right and top to bottom, distribution of the number of mes- sages n m per user, the number of replies n r per user, the number n g of distinct groups to which a user’s messages belong and the numberg of group related messages per user, the numbern t of distinct hashtags per user and the numbert of hashtag assignments per user . . . . . . . . . 70 4.5 (a) Distribution of messages per group. Y–axis is in logarithmic scale. (b) Distribution of number of words per message. Both axes are in log- arithmic scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.6 (a) Number of group messages g, as a function of the number of mes- sages n m and replies n r of a user. (b) Number of hashtag assignments t, as a function of the number of messages n m and replies n r of a user (both axes are in logarithmic scale) . . . . . . . . . . . . . . . . . . . . 74 4.7 From left to right and top to bottoms, average number of (a) messages n m , (b) replies n r , (c) distinct groups n g and (d) groups g, (e) distinct hashtagsn t and (f) total hashtag assignmentst of users havingk neigh- bors in the post–reply network . . . . . . . . . . . . . . . . . . . . . . 76 4.8 (a) Average number of groups for the nearest neighbors of users belong- ing ton g groups. (b) Average number of distinct hashtags for the nearest neighbors of users withn t distinct hashtags . . . . . . . . . . . . . . . 78 ix 4.9 (a) Top: Average number of shared hashtags n st , σ hashtags (u,v), and σ U hashtags (u,v) of two users as a function of their distanced in the net- work Bottom: Probability distribution of the number of shared hashtags n st of two users being at distanced on the network, ford = 1,2,3. (b) Top: Average number of (i) shared groupsn sg , (ii)σ groups (u,v), and (iii) σ Ugroups (u,v) of two users as a function of their distanced in the network Bottom: Probability distribution of the number of shared groupsn sg of two users being at distanced on the network, ford = 1,2,3 . . . . . . . 81 4.10 Dynamics of threaded discussion at the workplace. Cumulative number of replies a message triggers over time . . . . . . . . . . . . . . . . . . 82 4.11 (a) Average maximum time required for a message to receivek replies. Time is measured in days. (b) Time required on average for a message to getk replies. Time is measured in days . . . . . . . . . . . . . . . . 84 4.12 Average number of replies (a) received in total and (b) sent by users that have (i) k incoming connections, (ii) k outgoing connections, (iii) distinct groupsn g , and (iv) distinct hashtagsn t . . . . . . . . . . . . . . 86 4.13 (a) Contribution Frequency of users in the post–reply network. (b) Aver- age Contribution Index of users having k in incoming, or k out outgoing connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.14 (a) Average Contribution Index of users having n t distinct hashtags, or participating inn g groups. (b) Average number of replies sent by users with Contribution Indexci. . . . . . . . . . . . . . . . . . . . . . . . . 88 4.15 (a) Average Jensen–Shannon divergence between all combinations of users having k = k in neighbors in the post–reply network, for T top- ics (the model used in this case is AT G ). (b) Average Jensen–Shannon divergence between all combinations of users havingk =k in neighbors in the post–reply network, and their neighbors, forT topics (the model used in this case isAT G ). . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.16 Average Jensen–Shannon similarity between users having received k replies, and replies’ authors (T indicates the number of topics) . . . . . 95 4.17 (a) Link probability among user pairs whose similarity is below some threshold. (b) Link probability of user pairs as a function of similar- ity. (c) Link probability of hyperactive and less active user pairs as a function of similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 x 5.1 Technology adoption dynamics at the workplace. Dynamics on and of the formal network structure are strongly coupled. The bottom layer illustrates the formal organization hierarchy, where black arrows repre- sent “reporting-to” relationships between employees. The directionality of edges go from lower level employees up to the company CEO. The middle layer depicts the flow of influence between people in the same group (red arrows), top-down influence from supervisors to team mem- bers (dashed, dark red arrows) and vice versa, bottom up team members’ influence of their supervisors (dashed purple arrows). The upper layer, depicts observed adoption dynamics, i.e., a potential propagation tree. . 104 5.2 Average number k of employees that joined the microblogging service after their manager, within the first n samples vs the total number K of employees that joined the microblogging service after their manager, and approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 (a) Averageι-score of managers withλ team members that have joined the microblogging service. (b) Average time-invariant influence of man- agers, who have themselves joined the microblogging service (similarly for those who have not joined), withλ team members. . . . . . . . . . . 109 5.4 Average influence score as a function of hierarchy level. . . . . . . . . . 110 5.5 Probability an employee joins the microblogging service given that n employees have adopted the service before. Solid lines lines depict probability estimates calculated with the exponential growth model. . . 113 5.6 Complex Contagion Model simulation in NetLogo. (a) Initial setup with employees arranged in a tree following reporting-to relationships according to the formal organizational hierarchy. The red arrow indi- cates the infection seed. (b) Simulation result after600 time steps. . . . 116 5.7 Complex Cascade Model simulation in NetLogo. (a) Initial setup with employees arranged in a tree following reporting-to relationships accord- ing to the formal organizational hierarchy. The red arrow indicates the infection seed. (b) Simulation result after600 time steps. . . . . . . . . 118 xi 5.8 True and predicted cumulative number of employees who have adopted the microblogging service (i.e. infected users). Time is measured in days. Solid line curves represent the outcome of (a) the SI model for various probabilities of infection, (b) the ICM model for various prob- abilities of infection, (c) the DM model for various numbers of ini- tial adopters, (d) ten runs of our complex contagion model (see Sec- tion 5.3.1), and (e) ten runs of our complex cascade model (see Sec- tion 5.3.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.1 Augmented social graph. . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.2 Example of hashtag hierarchy. . . . . . . . . . . . . . . . . . . . . . . 132 6.3 (a) Precision@1 and (b) Recall@1 as a function ofλ.. . . . . . . . . . 140 6.4 Impact of weighting scheme on accuracy (measured@5). . . . . . . . . 140 6.5 Average precision (measured@ 5) of users having k (a) posts or (b) neighbors in the@replies network. . . . . . . . . . . . . . . . . . . . . 141 6.6 MRR (measured @ 5) as a function of λ. Different plots impose struc- tural or content availability restrictions. All measurements refer to weight- ing schemeSS Mix 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.1 Dynamics on and of the network structure are strongly coupled in social media. The top layer illustrates the social network structure, where green arrows represent “follow” relationships. Information flows along arrows’ direction. Dashed black arrows mark newly created links, which are created on the premises of “common interests”, i.e., similar tastes in resources (artists in Last.fm) or common tag usage. The middle layer depicts a taxonomy, which may be provided/imposed by the social network or be collaboratively “curated” by the users of the social net- work. The bottom layer represents a network of resources. Relation- ships between resources depend on the network type and scope. In Last.fm for example, where resources are artists, connections between artists signify music genres. . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2 Generative models of tripartite graphs. (a) User-Resource-Concept model, (b) User-Resource model, (c) User-Concept model. . . . . . . . . . . . 154 7.3 Clouds of top tags and artists for 4 topics (out of 50) learned by the UC, UR and URC models. Size indicates higher probability. . . . . . . . . . 167 7.4 Probability distribution of three most popular users’ latent interests over twenty topics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 xii 7.5 (a) Average focus of users havingk friends. (b) Average Jensen-Shannon divergence between all combinations of users havingk friends and their friends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.6 UR, UC and URC perplexity for varying number of hidden topics. . . . 171 7.7 Performance of SLIgHT. X-axis: number of hidden topics; Y-axis: Per- formance on test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.8 Average similarity between latent topic vectors of Last.fm users as a function of (a) number of common neighbors and (b) distanced. . . . . 174 7.9 Precision and Recall of Latent Semantics Classification Schemes as a function of training data size. X-axis: Training set size as percentage of complete dataset; Y-axis: Precision/Recall. . . . . . . . . . . . . . . . . 176 7.10 F-measure (calculated for positive & negative classes separately) achieved by Latent Semantics Classification Schemes as a function of training data size. X-axis: Training set size as percentage of complete dataset; Y-axis: F-measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.11 Area under the ROC curve lift achieved by schemes A and C with respect to UR, UC and URC models on the link recommendation task in the Last.fm data set. Lift is defined as % change over MIP baseline. . 181 xiii Abstract Complex networks arise everywhere. Online social networks are a famous example of complex networks due to: (a) revolutionizing the way people interact on the Web, and (b) permitting in practice the study of interdisciplinary theories that arise from human activities, at both micro (i.e., individual) and macro (i.e., community) level. The vast scale (Big-Data) of online human interactions impose certain challenges, such as scal- able indexing and efficient retrieval of social data, which are by their nature intertwined in multiple dimensions. Understanding the rich properties and dimensional interdepen- dencies of topology and content in complex networks is necessary to uncover hidden structures and emergent knowledge. To address these questions, we propose a formal model that abstracts the semantics of complex networks into an integrated, context-aware, time-sensitive, multi-dimensional space, enabling joint examination of their static and dynamic properties, facilitating uni- fied mining and analysis of network structure and content, and their explicit and implicit interactions. Traditionally, network analysis methods, either ignore content and focus on the network structure, or make implicit assumptions about the complex correlation of these two components. We show that accurately modeling multiple symmetric or asym- metric, explicit and hidden interaction channels between people, integrating auxiliary networks into a unified framework, leads to significant performance improvements in a xiv variety of prediction and recommendation tasks. We empirically verify this insight using real-world datasets from online social networks and corporate microblogging data. Our work makes several steps towards building models of complex networks, under- standing their rich properties, hidden structures and dimensional interdependencies. We develop a novel model, that integrates heterogeneous networks of networks, each with rich properties and hidden dynamics, facilitating multimodal analysis of time-varying, complex social networking data. We study informal communication behavior, infor- mation sharing, and influence at the workplace, where formal structures, such as the organizational hierarchy, provide hints of the underlying, implicit social or communica- tion network. Particularly, we develop two simple yet accurate computational models of technology adoption at the workplace at the presence of influence, accurately repro- ducing the adoption process at the macroscopic level. We also achieve accurate com- munication intention prediction based on auxiliary information. Last but not the least, we study the structure of online social bookmarking systems, where we significantly improve social tie recommendation by exploiting the dynamics of collaborative annota- tion. xv Chapter 1 Introduction In this chapter, we first provide our motivation and we outline the problems that this thesis attempts to address. We then briefly describe our research contributions along with the outline of this manuscript. 1.1 Motivation Online Social Networks mainly aim to promote human interaction on the Web, assist community creation, and facilitate the sharing of ideas, opinions, and content. However, Online Social Networks have also become the medium for a plethora of applications such as targeted advertising [4, 55] and recommendation services [7, 85], collaborative filtering [107, 176], behavior modeling and prediction [77, 203], analysis and identi- fication of aggressive behavior (such as bullying and stalking) [61], epidemic studies [77, 88], crowd mood reading and tracking [15, 53], terrorist networks analysis [113], even political deliberation [196]. Major Online Social Networks like Facebook 1 , Online Social Media such as Twit- ter 2 and Online Bookmarking Sites like Digg 3 have been the focus of Social Network Analysis. However, the principles of Social Network Analysis have been applied to email communications [197, 58] revealing hidden social structures [205], web blogs 1 http://www.facebook.com 2 http://www.twitter.com 3 http://www.digg.com 1 and personal home pages [2] or explicit friend-of-a-friend (FOAF) networks [164, 126], bookmarks and tags [151], co-occurrence of names [99, 106, 145, 152], co-authorship in scientific publications references [62, 152, 178, 202], and co-appearance in movies or music productions [220]. Social Network Analysis has been mostly applied on top of a graph model repre- sentation [202, 172], in which nodes represent users and arcs represent explicit links between them. Such research has focused on understanding the structure and evolution of the network [116]. Numerous popular Social Networks such as Facebook and Twitter however have recently released different APIs, exposing more than the superficial struc- ture of social connectedness and creating the so calledSocialGraph, “aglobalmapping ofeverybodyandhowtheyarerelated”. Two universal features that remain invariant across social networks are the concept of profile and the concept of links between profiles. A profile, which dimensions are determined by the scope and the nature of the social network, represents a social net- working user. Users can be modeled in terms of favorite music, movies, books, preferred activities, interests, etc. for a general purpose social network like Facebook. More tar- geted social networks like LinkedIn 4 , a social network catering to professionals, on the other hand largely center around a small number of specific topics (e.g., professional skills and work experience in LinkedIn). The notion of link also varies between social networks. Elementary links are estab- lished when users add other users to their contact lists. General purpose social networks like Facebook allow only undirected links while others, like Twitter, permit directional links, established when a user chooses to follow another user. More well defined links such as “Recommends” are also available in some cases (e.g., LinkedIn). Depending on the social network, other information may also be available. Group participation 4 http://www.linkedin.com/ 2 Figure 1.1: Users’ Interactions. and applications usage, co-attendance of events and users’ interactions, such as “like”, “share”, and “comment” and “tag” (as shown in Figure 1.1), provide useful information and insights that the underling social graph does not inherently support. For example, the “knows” link in Figure 1.1, might be missing. However, “indirect interactions” between the two users might still exist, for instance, in the context of group commu- nication. Such indirect connections often offer clues on common interests and shared tastes in some particularcontext. 1.2 Motivating Scenarios 1.2.1 Information Overload and Targeted Microblogging The streaming nature of social media impose significant information overload to users who are struggling to keep up with incoming information, published by numerous sources, such as friends and family or coworkers, spammers or automated bots, and news sites. Consider Bob, who holds a Facebook account and has 4 friends, his boss at 3 (a) Information Overload (b) Targeted Microblogging Figure 1.2: Bob’s Network work, his sister and mother and his friend Bill. His ego-centric social network is shown in Figure 1.2. Bob reviews his most recent news feed in Facebook every now and then and tries not to miss any “important” information posted by his friends. He finds it difficult to go through 100 posts on average every other day and sometimes he indeed misses some posts. Hence, Bob chooses to control his news feed in Facebook by hiding news from specific people. He decides to hide news from his boss and his sister. He is relieved to now receive 20 new posts on average per day but a few days later he discovers that he missed his sister’s posts, and a notification sent from his boss to all employees. Quickly he decides to re-enable the hidden accounts fearing to lose any other “important” news in the future. In order to manage the posts Bob himself publishes in Facebook, he has grouped his contacts in three different groups, namely: 1) work-group, 2) friends and 3) family. Work-group consists of his boss, family consists of Bob’s sister and mother, and friends group consists of Bob’s friend Bill. Whenever Bob publishes something, Facebook broadcasts his post to all of his contacts. However, sometimes Bob prefers only specific groups of people to receive his updates. To do so, Bob selects to send specific messages to subsets of his groups. One day, while at work, he posts the following message: 4 “Anyone interested in going out for a walk? I feel tired and so bored. I do not feel like working at all”. Accidentally, he forgets to select specific groups to send the message to and his post is forwarded to all his connections including his boss, who later visits Bob personally in order to confront him for his behavior. The issues highlighted in this scenario include: • Due to the streaming nature of social networks users are overwhelmed by content that is often not interesting to them. • Currently Social Networks provide some means of content organization, man- agement and filtering but with limitations. Hiding news from specific users may reduce the volume of incoming news but significant information loss may occur as a byproduct of this filtering method. • Hashtags enable management, organization and filtering to some extend but have limitations such as their lack of organization, their ambiguity and heterogeneity and have to be explicitly included in micloposts. • It would be useful for Social Networks to provide ranking and categorization mechanisms for their users to be able to access their latest news, ranked based on similarity to their interests or by category. • ACL lists and group multicast of new posts provide some solution in terms of tar- geting and restricting the recipients of posts but, users have to explicitly choose the groups they would like their posts to be forwarded to. Further, targeted pub- lishing does not alleviate the need for filtering at the receivers endpoint. 5 1.2.2 Complex Query Answering Social Networks have successfully managed to facilitate community creation, to pro- mote collaboration and to assist the sharing of ideas and content. To do so, Social Networks rely on the fact that users who join Social Networks for the first time import their contacts from other Social Networks and/or address books from e-mail services like MSN, or simply search for people they might know. The OpenID 5 initiative offers a solution to this problem by providing a common identity to users that spans their online social activity across multiple online social networks that they use. This way users can rather seamlessly import their profiles and existing connections from one Social Network to another. However, people who are by nature social and are interested in expanding their social circle find themselves struggling to discover new “friends” that match their interests. Social Networks do provide services for users to search for new friends. For instance, Facebook provides friend recommendations based on friend-of-a-friend (FOAF) relationships under the assumption that it is likely for someone to befriend another person if they happen to have a number of friends in common. On the other hand users can search for others using keyword based search services, such as the Twit- ter advanced search service 6 . Keyword based search however lacks semantics. Thus, a search based on the ski keyword won’t necessarily return any of the users who have specified that they are interested in winter sports. To illustrate such issues, we present the following scenario of a researcher, Alice, who is working towards her PhD in Computer Science at the Viterbi School of Engi- neering at the University of Southern California. Alice is an international first year PhD student and she has a strong social “presence”. She is interested in making new friends 5 http://openid.net/ 6 https://twitter.com/search-advanced 6 in Los Angeles and she quickly discovers that numerous groups exist in Facebook rep- resenting USC student organizations. Unfortunately, Alice is astonished by the fact that these groups are “kind of too many” and she realizes that she has to manually go through each one of them in order to discover what is the purpose of each group as well as what kind of people participate in it. Alice discovers an intelligent Facebook application, which allows her to compose complex queries in order for her to identify a small set of people who are located in Los Angeles, study at the University of Southern California and have same/similar interests to hers. Alice searches for students in Computer Science, who have previously worked for IBM or Google, their research interests include Semantic Web Technologies, they have been to Disneyland, and they like going to the movies. The smart application analyses the query and compares Alice’s interests against other users to come up with the resulting set of recommended users. The application also provides explanation as to why Alice should choose each of the users to befriend, specifying what interests they have in common. The issues highlighted in this scenario include: • Currently, Social Networks provide friend suggestions based on FOAF relation- ships. • Users often want to befriend other “interesting” people who might not share com- mon connections/friends. • Users are currently able to search for others using keywords, possibly lacking semantics. • Currently Social Networks do not facilitate complex queries, involving different types of information and expanding to different levels of abstractions. 7 Figure 1.3: Taxonomy of Social Networking Analysis Approaches. 1.3 State-of-the-art Figure 1.3 presents a taxonomy of state-of-the-art approaches in Social Networking Analysis [29]. Next, we provide a brief overview of such approaches [29]. 1.3.1 Graph Theoretic Social Networking Analysis Much research on Social Networking Analysis applies graph theory [178, 202] on graph representations so as to unravel certain features of the network, identify the most impor- tant actors in a social network and discover community structures. To this end, several 8 centrality measures have been proposed. “Centrality measures the degree to which net- work structure contributes to the importance of a node in the network” [69]. Between- ness Centrality (see Figure 1.4(b)) measures the fraction of all shortest paths that pass through a given node and is often used to identify nodes that act as boundary spanners between different groups [21]. Studies of human [78] and animal [138] populations sug- gest that such nodes play a crucial role in the information flow and cohesiveness of the network. Degree Centrality (see Figure 1.4(a)) measures the number of edges that con- nect a node to others and is used to identify nodes that have the most connections in the network. However, the centrality of a node also depends on its neighbors’ centralities [17]. This measure is captured by the total number of paths linking a node to others in a network. The average length of such paths is measured by Closeness Centrality (see Figure 1.4(c)), which indicates the capacity of a node to be reached. The total number of paths from a node, exponentially attenuated by their length is measured byα–centrality (see Figure 1.4(d)) [17, 16]. “The attenuation parameter sets the length scale of inter- actions so as to distinguish between locally and globally connected nodes” [69]. Other centrality metrics include those based on random walks [163] and path-based metrics. The computation time of centrality measures is computationally expensive, with a min- imum time complexity of O(n· m), where n is the number of vertices and m is the number of edges [58]. However, several approximating and parallel algorithms have been proposed for large networks [58]. Closely tied to the concept of nodal degree isdensity, which indicates the percentage of edges that are present in the graph over the total number of plausible edges [144]. The higher the density of a network is, the more nodes in the network are connected to each other. Clustering Coefficient measures the likelihood of two nodes connected to a given node being connected themselves. It indicates the degree to which nodes in a network tend to cluster together and it is therefore considered to be a good measure 9 Figure 1.4: Centrality Measures. if a network demonstrates “small world” behavior [204]. Diameter, on the other hand measures the distance of nodes in a network and is defined as the maximum geodesic distance between any pair of nodes [144]. Geodesicdistance measures the shortest path between two nodes [144]. Diameter can only be calculated on connected graphs. If the graph is not connected, then diameter is undefined. To overcome this limitation, the mean geodesic value is calculated using only reachable pairs of nodes. Intuitively, the higher the diameter of a network is, the more dispersive the graph. To better understand the network structure and the mechanisms with which this structure affects information spreading as well as to identify sociometric features that influence people behavior, severalcommunitydetectionalgorithms have been proposed [178]. According to [202], community is a set of actors among whom there are rela- tively strong, direct, and intense, frequent or positive ties. Some Social Networks like Facebook and Flickr allow or even encourage people to form and join groups. However, in cases where group formation is not supported, network interaction provides suffi- cient information to infer implicitly formed communities. Community detection algo- rithms have consistently facilitated informative visualization of social networks, and have assisted with theinferenceofmissingproperties [219]. 10 Community detection criteria vary, but in general, community detection methods can be divided into four categories [190]. Innode-centric methods, each node in a group must satisfy certain/different properties. Representative measures includecliques (com- plete subgraphs), k-clique, k-clan and k-club (reachability of members), k-plex and k- core (nodal degrees), andLSsets andLambdasets (relative frequency of within-outside ties). In group-centric methods each group as a whole has to satisfy certain properties without zooming into node level. Innetwork-centric methods the whole network is par- titioned into several disjoint sets based onNodeSimilarity (nodes are structurally equiv- alent if they connect to the same set of nodes), Latent Space Model (transform nodes in a lower dimensional space such that the distance measure is kept in the Euclidean Space), Block Model Approximation (minimize the difference between an interaction matrix and a block structure),CutMinimization (minimize the cut: the number of edges linking nodes that belong to different groups) and Modularity Maximization (measure group interactions compared to the expected random connections in the group). The lim- itation of network-centric methods is that the number of communities must be known a-priori. In hierarchy-centric methods, a hierarchical structure of communities is con- structed based on network topology. Two strategies are used by hierarchical algorithms. Divisive hierarchical clustering partitions the nodes into several sets and each set is iteratively partitioned in smaller subsets [70]. Agglomerativehierarchicalclustering ini- tializes each node as a community and iteratively merges communities satisfying certain criteria into larger and larger communities [56]. Other algorithms, based on heuristics such as random walks or formula optimization are noted in [58]. Due to their computational complexity most of these measures are computed over static networks, but their computation can often be accelerated due to specific patterns and laws governing social networks. According to the famoussixdegreesofseparation [10], every node is on average approximately six steps away from any other node, while 11 nodes degree distribution follows a power law [162]. “According to the small world phenomenon [98] the order of the shortest path between any two nodes in a social networkofsizenisn·logn” [58]. Recently, research overtemporal analysis ofdynamic social networks has been conducted. Trends in this field include according to [24] the following approaches: 1) the meta-matrix [50], 2) treating ties as probabilistic [54], and 3) combining social networks with cognitive science and multi-agent systems [156, 150]. Graph discretization [189] and Time-Aggregated Graph approaches [180] have also been considered. Trust is also important since people tend to trust authorities/experts who have been accredited through their social activity as well as the number of connections they have and their global importance in the social network. [199, 137] exploit trust to perform collaborative filtering by forming bipartite [199] or tripartite [137] models. [137] per- forms random walks to propagate trust values through the social network, while [74, 73] extend FOAF:Person to allow users to indicate trust levels for their connections on a scale of 1-9 (1 = Distrust Absolutely, 9 = Trust Absolutely) in general or for specific topics. Reputation [73, 137, 157] may be considered the other side of the same coin since it often serves as a measure of influence [147], used to identify and predict the most influential users in a network. This work has mainly focused on binary friendship relations. However, since there is currently no way for users to strictly define friendship levels when they create links to other users, online social networks generally model heterogeneous relationships (e.g., acquaintances and best friends) all the same. In this case, the binary friendship indica- tor provides only a coarse representation of relationship information. [211] estimates relationship strength from interaction activity (e.g., communication, tagging) and user similarity. 12 1.3.2 Data Mining and Data Analytics in Social Networks In order to understand the synergy between published text and social structure, graph analysis alone is not sufficient. Analysis of social networking content is also crucial. Content includes but is not limited to microbloging posts as well as social networking users’ profiles and web pages. Users’ profiles are often used to compute users’ simi- larity for recommendation purposes as well as to model users’ interests [131]. Content analysis may lead to information disclosure [80] and revelation of private information [219]. In order to understand the models that drive information dissemination in social net- works research has mainly focus on identifying factors that impact information diffusion [187, 125]. Such factors include the presence of hashtags, mentions and URLs, and ratio between followers and followees. Hashtags (tags in general), are often used to organize and filter information [148, 92]. Tagging however lacks sentiment expression. Due to the relative importance of social media in advertising and information dissemination [121] however, sentiment analysis [96] and sarcasm detection has recently attracted much attention. Because of the large amount of content being shared in social networks, sentiment analysis is often unsupervised and completely automatic [196]. However, approaches based on distant supervision [72], where labels are implicitly stated with the use of emoticons (e.g., “:)” for positive and “:(” for negative) or completely supervised approaches [46] have also been proposed. “Consumerscanusesentimentanalysistoresearchproductsorservices before making a purchase. Marketers can use this to research public opinion of their companyandproducts,ortoanalyzecustomersatisfaction. Organizationscanalsouse thistogathercriticalfeedbackaboutproblemsinnewlyreleasedproducts” [72]. 13 1.3.3 Semantic Social Networking Analysis Graph representations and analysis performed on top of them share a common limita- tion. They all have a poor exploitation of complex relationship types and most impor- tantly they alllacksemantics. As an example, information filtering algorithms are either based on graph structure characteristics of social networks [183] or use tagging to orga- nize and filter information, but under-exploit relation types, which could enable routing of different messages to different groups of people (e.g., family, friends, co-workers) based on their relationship to the author. Tagging, which has recently become popular, allows users to tag web resources for organizational purposes (e.g., photos in Flickr, bookmarks in Delicious or tweets in Twitter). Twitter users adopted hashtags as an attempt to alleviate the significant infor- mation overload that the streaming nature of social media imposes on users interested in specific topics. [148, 92] exploit hashtags for content management, organization, and filtering. “However, hashtags have several limitations such is their lack of organiza- tion [167], their ambiguity (e.g., #apple) and heterogeneity (e.g., #realtime, #rt)” [148] and have to be explicitly included in tweets. By aggregating the set of tags collabo- ratively used by users, emerging semantics are exploited to generate folksonomies and taxonomies [142, 194, 151], which are then linked to ontologies [166]. [75] analyses the structure of collaborative tagging systems, as well as their dynamical aspects, uncovers hidden patterns, and proposes a dynamical model of collaborative tagging. Recently, Online Social Networks started to be modeled with rich structured data that incorporate semantics. Figure 1.5(a) demonstrates how various types of commu- nication between users can be modeled, if split according to type and weighted by communication frequency. Further semantics may be imposed using ontologies such 14 (a) Modeling Interactions (b) Complex Relationships Figure 1.5: Semantic Social Networks as FOAF 7 , SIOC 8 , and DC 9 , MOAT 10 , and SKOS 11 to describe users, content and their relationships. FOAF is used for describing people, their relationships and their activity. SIOC specializes FOAF types in order to model interactions between social web appli- cations and resources managed by such applications. Different types of relationships and trust levels may also be utilized to impose a finer grained description using vocabu- laries like, RELATIONSHIP 12 . RELATIONSHIP specializes the FOAF:knows property to specific relationships, as shown in Figure 1.5(b). A lighter way to add semantics to the representation of persons and web resources is to use microformats 13 . Er´ et´ eo et al. [59, 58] proposed an architecture based on the Semantic Web stack to analyze online social networks. Its purpose is to explore RDF 14 -based annotated profiles 7 http://www.foaf-project.org/ 8 http://sioc-project.org/ 9 http://dublincore.org/ 10 http://moat-project.org/ 11 http://www.w3.org/2004/02/skos/ 12 http://vocab.org/relationship/ 13 http://microformats.org/ 14 http://www.w3.org/RDF/ 15 and users’ interactions in social networks using background knowledge (domain vocabu- lary), predefined ontologies and OntoSNA (also referred to as SemSNA), an ontology of Social Network Analysis, which provides a way to compute sociometric features using SPARQL 15 . This work extends classical graph theory algorithms with semantic features, such as types of resources (e.g., FOAF:Person) or properties (e.g., FOAF:knows or rela- tionship:worksWith) to be considered in the analysis. While much of the work on semantic microblogging thus far focuses on represent- ing users, microblogs and microblog posts in the Semantic Web, the work described in [182] takes the complementary approach of harvesting semantic data embedded in the content of microblog posts, converting these metadata into RDF and publishing the har- vested knowledge base as Linked Open Data. TwitLogic, an open-source semantic data aggregator, which implements the above ideas, provides scoring of microblog content based on recency (time-based significance) and proximity (location-based significance). Semantic annotation transforms unstructured data into a structured representation that enables applications to better search, analyze, and aggregate information. [148] makes use of annotated microposts together with background knowledge obtained from Linked Open Data to offer advanced search and organizational capabilities. For exam- ple, thanks to semantic links between football and sports, all information mapped only to football can be retrieved in queries regarding sports. Multilayered models, which involve the network between people, the network between concepts they use, and links to ontologies modeling such concepts have lately been used. [101] proposes the use of such representation so as to extract relationships in one network from relationships in another. [23] on the other hand proposes a multi- layered semantic social network model that offers different views of common interests 15 http://www.w3.org/TR/rdf-sparql-query/ 16 underlying a community of people. Starting from a number of ontology-based user pro- files and taking into account their common preferences, the domain concept space is automatically clustered in order to identify similarities among individuals at multiple semantic preference layers and defineemergent,layeredsocialnetworks. 1.4 Thesis Statement Social networks can be seen as consisting of two independent yet strongly intercon- nectedaspects: thenetworkitselfandthedatathatisbeinggeneratedinthenetwork. Interactionsandinformationimpacttheevolutionofthenetwork,andviceversa,the network structure governs the dynamics of users’ activities. Our research focuses oncharacterizingtemporal,spatial,contextual,informalinteractionsincomplexnet- works. Westudybothstaticanddynamic,multimodalfeaturesofhumanonlineactiv- ities, and consider them in conjunction to hidden structures. Our goal is to build contextualmodelsofheterogeneousnetworks,integratingsemanticallyenrichedmul- timodal content, so as to achieve a complete, unified, and complex social network analysis towards identification of influencers and experts, high performance recom- mendationalgorithms,andeffectivepredictionoffuturetrends,activityandbehavior. 1.5 Research Contributions and Thesis Outline In Chapter 2, we provide general definitions of terms used throughout our work, and the description of the real-world datasets used in our empirical evaluations. We also study the basic structure of online social networks, arguing that tripartite graphs offer a mechanism to describe and capture users’ behavior and interests in terms of their online activities [35]. In Chapter 3, we propose a formal generalized representation of social 17 networks, which abstracts the semantics of multidimensional, informal, social interac- tions in the form of orthogonal dimensions [36, 28, 210, 37, 38]. Our model integrates structural information with dynamic collaboration modalities from multiple heteroge- neous networks, and in conjunction to our social algebra, facilitates multi-dimensional, time-varying, contextual, semantic analysis of such networks. We utilize Semantic Web Technology for our conceptual modeling and representation, since this approach offers a generic, reusable, and machine understandable model for representing the concepts and properties required for describing user activities and measuring their behavior, and enables seamless integration with available domain Ontologies and Linked Open Data. We present a case study on a large scale, real-world dataset from a Fortune 500 com- pany, demonstrating how our approach can improve collaborative data analytics for the enterprise, both at micro and macro level [210, 38]. Next, we move to studying informal interactions at the workplace, but also when a formal structure is imposed on top of the communication network. In Chapter 4, we provide an extensive, in-depth analysis of real-world enterprise microblogging data [30, 32], comparing the structure and properties of the post–reply network of threaded discussions to online social networks. We study correlations between user activi- ties, assortative mixing, and topical alignment, all providing corroborating evidence of homophilous communication based on latent topical similarity. Investigating users’ communication behavior, we find four emerging “laws” that govern the dynamics of threaded discussion at the workplace. We study informal interactions in the presence of formal structure in Chapter 5. Particularly, we emphasize the effect of formal orga- nizational structure on the adoption mechanism of a microblogging service at work- place, identifying the factors that govern the process of adoption at both microscopic and macroscopic levels, and proposing a prominent indicator of influence at various lev- els of granularity [34, 209]. We build two intuitive and simple computational models of 18 technology adoption at the workplace, accurately reproducing the adoption process at the macroscopic level [34]. In Chapter 6, we delve deeper into informal communication in online social net- works. We introduce the problem of communication intention prediction, demonstrat- ing how it can be accurately performed [31]. The similarity of this problem to the traditional link prediction problem in online social media, leads to investigate social tie recommendation in online social bookmarking systems [33, 35]. We present three generative probabilistic models of online social tagging systems as a principled way of reducing the dimensionality of such data, capturing at the same time the dynamics of collaborative annotation process. Modeling users’ interests in a latent space over resources and annotations, and combining them with structural network clues yield high precision and recall solutions, effectively alleviating the high class imbalance problem that occurs in online social media. Finally, we highlight the scope of future work and draw our conclusions in Chapter 8. 19 Chapter 2 Preliminaries In this chapter, we state general definitions used in this thesis. We also describe the datasets used in this work. We also probe on the basic structure of online social net- works, arguing that tripartite graphs offer a mechanism to describe and capture users’ behavior and interests in terms of their online activities. 2.1 Definitions A network is represented by a directed graphG = (V,E), where • V ={u i |i = 1,...,N} is the set of vertices, • E ={e ij |i,j = 1,...,N} is the set of edges, • I : E → V × V is the set of adjacency relations of G. When an edge e uv exists and points from node i to node j, vertices u and v are said to be adjacent to one another. The definition of an edge depends on the network type. In a communication network, an edge e uv exists and points from node u to node v if useru has sent a message to userv. In a friendship graph, an undirected edgee uv exists between nodesu andv, ifu andv are friends. In our work, all networks we consider are directed. We mention the specific edge semantics where appropriate. Symmetric, Asymmetric and Reciprocal Relationships: The existence of directed edgee uv between usersu andv does not guarantee that edgee vu also exists. When both edges e uv and e vu exist between users u and v, the relationship e between u and v is 20 symmetric, otherwise it is asymmetric. Asymmetric relationships, such as “u knows v”, do not require both edges to be present. Instead, symmetric relationships, such asu and v are friends, are always represented by both edges in a directed graph. When an asymmetric relationship is retributed back, it is being said that it is reciprocated. Neighbors and Neighborhood: If there exists an edge from u to v, v is said to be a neighbor of u. The neighborhood of u is defined as all the nodes of which u is a neighbor. Common Neighbors: if nodeu is a neighbor of nodez, andz neighbors with node v, then we call nodez a common neighbor of nodesu andv. Degree, in-degree and out-degree: The degree d u of a node u is the number of edges incident tou, regardless of their directionality. In-degreed u in denotes the number of edges starting from nodesx that are adjacent tov, and ending to nodev:, i.e., edges e xv . Similarly,d u out is the out-degree of nodeu. Clustering and Clustering coefficient: Clustering quantifies the degree of how densely the neighborhood of a node is connected. Clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The clustering coefficient of nodeu, with setS N ofN neighbors, is defined as the number of directed links that exist between nodes inS N , divided by the number of all possible directed linksN∗(N−1) between the nodes inS N . Connected components, strongly connected component, weakly connected com- ponent, and largest connected component: The connected component of an undi- rected graph is a subgraph in which any two vertices are connected to each other by paths. A graph that is itself connected has exactly one connected component, consist- ing of the whole graph. A directed graph is called strongly connected if every vertex is reachable from every other following the directions of the edges. A directed graph is weakly connected if there is an undirected path from each vertex in the graph to every 21 other vertex, i.e., its structurally equivalent undirected graph is connected. The largest connected component is the strongly connected component that encompasses the most nodes. Distance and Diameter: The distance between two vertices in a directed graph is the number of edges in the shortest path connecting them. If there is no path connecting the two vertices, i.e., if they belong to different connected components, then conven- tionally their distance is infinite. The diameterD m ax of a directed graph is the greatest distance between any pair of vertices in the graph. Homophily: Homophily is the principle that individuals tend to associate and bond with similar others [146]. Homophily is also known as assortativity or assortative mix- ing. 2.2 Structure of Tripartite Graphs A social network is often represented as sociogram [202], in which nodes represent users and arcs represent explicit relationships between them. A sociogram is realized as graph, adjacency matrix or distributed adjacency lists (each node in the network maintains a local collection of its neighboring vertices). In order to exploit implicit relationships between users, tripartite graph models have also been proposed [84], as shown in Figure 2.1. Tripartite graphs offer a mechanism to describe and capture users’ behavior and interests in terms of their activities. A tripartite graph is a graph whose vertices can be divided into three disjoint sets: 1) a set of actors (e.g., users)A = {a 1 ,...,a A }, 2) a set of concepts (e.g., tags)C = {c 1 ,...,c C } and 3) a set of resources (e.g., photos) R = {r 1 ,...,r R }. A resource r ∈ R is annotated with a set of concepts c r ∈ C of size N r (similarly created, used, bookmarked or shared), by a set of actors a r ∈ A. A 22 Figure 2.1: Tripartite graph model of a social network. collection of R resources is then represented as a concatenation of individual concept vectors c, havingN = P R r=1 N r concepts in total. It is possible to cluster vertices that belong to any of the three disjoint sets of a tri- partite model so as to extract emergent semantics. Tripartite graphs can in this way be reduced into three bipartite graphs, which model associations between actors and con- cepts (bipartite graph AC), concepts and resources (bipartite graph CR), and actors and resources (bipartite graph AR). Bipartite graphs are easier to comprehend and work with but the reduction process discards higher dimensional links between the three sets, which could otherwise be extremely useful in the analysis of the social network at hand. A bipartite graph can be further reduced to produce two simple, weighted graphs. For example, the bipartite graph of actors and concepts (AC) may be reduced into two graphs, one for actors (graph A) and one for concepts (graph C). In this case, the reduced graph A models relationships between actors, weighted by the number of times two actors have used same concepts. The creator of a resource is often considered to be its owner, but many actors may use, bookmark or share a resource, thus becoming “owners” themselves. Further, many actors may collectively annotate a resource, socially contributing to its set of concepts. Users annotate resources by choosing tags from an uncontrolled vocabulary according to their style and interests. Resources of the same nature (i.e., topic) may be tagged 23 with different keywords, which may have similar meaning (e.g., synonyms) or with linguistic variations of the same keyword due to the uncontrolled vocabulary (e.g., “lac” as opposed to “laclippers”). Conversely, the same keyword can be used to annotate resources of different nature due to polysemy. For example, “apple” may be used to describe a story about farmers’ market or about a new i-phone product. 2.3 Real–World Datasets In this section we provide a detailed description of the datasets we used to illustrate and validate the effectiveness of our formal modeling of complex networks. We have exten- sively analyzed user activity logs from a microblogging service during the first years of its adoption by a non IT–focused Fortune 500 multinational company. Our informal communication dataset was further enriched with a snapshot of the true organizational hierarchy of company. We have further investigated the structure and dynamics of the online social network Last.fm. 2.3.1 Corporate Microblogging Enterprises have been mainly relying on e-mail traffic to share information among coworkers [20]. Analysis of enterprise communication networks [71, 51] has broaden our understanding of information flow in the enterprise. [67] argued that “information extractedfrome-mailscouldproveusefulinaknowledgemanagementperspective”, as it would facilitate expert and community identification. Media like SharePoint and Office Communicator are heavily utilized as part of question-answering and problem solving processes, while Active Directory provides a formal structure for employees to compre- hend and navigate through the organizational hierarchy, accommodating their need to 24 identify potential collaborators, research teams and business units around the globe, as well as to discover “interesting” projects that others are currently working on. The wealth of information available in the context of enterprises however, is not limited to formal interactions and silos containing structured data. As social media have become phenomenally popular, enterprises have adopted light-weight tools such as on-line forums and microblogging services for internal communication as a means of promoting and enabling collaboration and sharing among employess [216]. Employees have been using social network sites and microblogging services to stay in touch with close colleagues or to reach out to employees they do not know, to connect on personal level or to establish strong professional relationships in order to advance their career within the company [208]. Others, perceive the use of such services as extra source of company news and events, a mean to promote their ideas or to contribute to conversa- tions revolving around company matters. Due to the richness and variability of systems and tools available in the enterprise information ecosystem, multiple communication channels between employees have thus become available. User activity and behavioral data in this context contains valuable information. User’s interests in personal and professional level can be discovered, whereas interesting communication motifs can be mined out, enhancing our understand- ing on employees’ communication patterns as well as patterns of information propaga- tion and browsing in enterprise networks. Here, we primarily focus on mining interconnections between employees’ work- related activities and their social interactions on collaboration platforms used at the workplace. In practice, users’ activities are scattered across various collaboration tools used in the enterprise. In our work, we study informal communication, which is uncon- trolled and unconstrained in online social media, as it is bounded by imposed formal structures (i.e., organizational hierarchy) at the workplace. 25 Table 2.1: High–level statistics of the post–reply network Metric Value Number of users 4,213 Number of messages 16,438 Number of threads 8,139 Number of thread starters 8,174 Number of replies 8,264 Number of hashtags 637 Number of groups 88 For our analysis we use a complete snapshot of a corporate microblogging service used by employees in a Fortune 500 multinational company, which operates outside the IT–sector. The corporate microblogging service had 4,213 unique users, who posted 16,438 messages by the end of August 2011, when we obtained the raw data. Our dataset represents 15% of the entire employee population, and reflects users activity during the first year of adoption of the microblogging service by the company. Table 2.1 summarizes the properties of our dataset. Description The functionality of the microblogging service resembles that of Twitter, whereas its interface is similar to Facebook. The corporate microblogging site does not impose any restrictions on the way people interact or who they chose to follow, much similar to Twitter. Its main purpose is to promote and enable collaboration and sharing within the enterprise. As in Twitter, users author messages in the enterprise microblogging service, and form threaded discussions. A message may be available to the corporate–wide news stream, sent to a specific group of employees (public or private), or be a direct message to a single individual. Each message may have been annotated with hashtags and may receive multiple replies by other employees. 26 As much as the corporate microblogging service that we are studying is similar to online microblogging services (e.g., Twitter), it also differs in several ways. First, only employees with a valid company email address can join the company’s network. This reduces spam (e.g., advertisements) and noisy text (e.g., personal status updates). The main purpose of the corporate microblogging service is to promote and enable collab- oration and sharing (e.g., information, knowledge and expertise) within the enterprise. Secondly, there is no140 character limit on messages, while at the same time, multiple number of files can be attached to a message. Usually large enterprises rely on multiple collaboration platforms such as blog, project wiki and discussion boards to share infor- mation between employees. In fact, email is shown to be the primary communication mechanism in enterprises [20]. The corporate microblogging service we are studying in this work, is compatible with email; messages can be posted to the service by email and received by email. Finally, the microblogging service offers both web–based access, as well as integration for all major platforms, desktop and mobile. Contrary to traditional collaboration platforms, such as blogs, project wiki and dis- cussion boards, microblogging offers an informal setting for communication, search for information, data and experts, and sharing of ideas and news. Instead of being project or team specific, conversations are often broader and replies are often instantaneous. The ultimate goal of the corporate microblogging service is to become the primary plat- form for asynchronous collaboration and colleagues’communication. The microblog- ging service includes a secure environment in which users can share tasks, learn about new topics of interest, ask questions and look for information. Merging social network capabilities with discussion board features and knowledge base paradigm leads to an integrated environment with major advantages. Clutter can be minimized by subscrib- ing to selected feeds, or by joining specific groups. As a knowledge base, content is 27 searchable and discoverable by colleagues, whereas in other mediums, such as email, content is accessible only by individuals. Our dataset reflects users activity in a microblogging service during the first year of its adoption by a non IT–focused Fortune 500 multinational company. As such, we suspect that the balk of knowledge and information is being exchanged using traditional collaboration media. As the use of the microblogging service increases, the company expects more value to be added by this service. Since we do not currently have access to other collaboration media in the enterprise, we can only speculate about the differences in traffic and information exchange between them. The interested reader may refer to [216, 128] for a thorough discussion on the value of social media in the workplace. Post–Reply Network of Enterprise Microblogging Online communities often have a discussion thread structure, based on which users share a status update or post a question, which other users comment on, effectively contribut- ing to the discussion, answering the question posed in the original post, or post their own (subsequent) questions [215]. Posting and commenting data illuminate the communica- tion information flow among commentators and posting creators [91]. We can use such posting/replying activity to create a post–reply network, representing each participating user as a node and linking the user starting a thread to a replier [215]. In this sense, links indicate information sharing between nodes. The direction of the link indicates how discussion flows among users through the network. A node with many inbound links indicates a user who has received many comments. A node with many outbound links, but no inbound links, indicates a user who has contributed to discussions in several occasions but received no replies in return. Figure 2.2 demonstrates how we map posting/replying activity into a directed post– reply graph. A bipartite graph of users and discussion threads they participate in can 28 Figure 2.2: We map posting/replying activity into a directed post–reply graph. Threaded discussion (a), represented as bipartite graph in (b), is collapsed to a directed graph in (c). be created by linking post creators and repliers to threads, as shown on the left. For example, the message created byA received three comments, two comments from user C and one from B. The bipartite graph is then transformed to a directed graph where an edge is drawn from the replier to the user who made the initial post. Formally, we represent the post–reply network as a directed graphG = (V,E): • vertices: V ={u i |i = 1,...,N}, whereN =|V| = 4,213 is the total number of users, • edges: E ={e ij |i,j = 1,...,N}, whereM =|E| = 4,489 is the total number of edges, • I : E → V ×V defined as follows: an edgee ij exists and points from nodei to nodej if useri has sent at least one message to userj. Research Scope We have chosen this intuitive definition for edges in the post–reply network to capture the “information flow” from user i to user j when user i sends user j a message. An undirected edge e ij between users i and j if either user sent a message to the other 29 would not capture the semantics of directed communication, which may or may not be reciprocal. We exclude multiple links between nodes, which would represent interaction across multiple discussion threads; instead, we used single links between nodes. We considered weighting the edges by the frequency of replies sent from user i to user j, and also weighting each message–reply occurrence differently, based on how many replies there are in a discussion thread. In our particular study, the addition of weights would have no effect on the structure and properties of the inferred graph. Although node ranking in terms of PageRank or similar metrics would be affected, this is not the focus of this research. It is straightforward however to incorporate edge weights for other applications or studies. For our study, we consider the complete snapshot of a corporate microblogging ser- vice, without limiting our analysis to users that form the largest connected component. This includes users who may have contributed to one–to–many conversations, but, who have never sent a personal reply. This definition does not include users who tried the service once and never used it, or found it useless. We do not seek to discover or test the perceived benefits and barriers to adoption of microblogging services in enterprise environments. We further do not attempt to examine information flow or temporal evo- lution of this network. While such aspects are important, they are beyond the scope of this thesis. This post–reply network has some interesting characteristics. First, it is not a net- work focused on social relationships, as it is not intentionally built by its users. Instead, it reflects “information flow”, members’ shared interests, and in case of questions being answered, “knowledge transfer”. [215] argued that “whether it is a community cen- tered on questions and answers, social support, or discussion, the reason that a user repliestoatopicisusuallybecauseofaninterestinthecontentofthetopicratherthan who started the thread”, reflecting shared interests between the original poster and the 30 repliers. “Furthermore, in a question and answer community, the direction of the links carries more information than just shared interest. A user replying to another user’s questionusuallyindicatesthatthereplierhassuperiorexpertiseonthesubjectthanthe asker” [215]. The examination of a network structure is dependent on the selection and time frame of the sample. During the time period between July, 2010 and August, 2011, that the snapshot of the corporate microblogging service was acquired, the network grows rapidly, as more users join the service and contribute to discussions. Our observations in this study, remain valid throughout the time period between July, 2010 and August, 2011, indicating that the basic network structure does not drastically change over time. Therefore, we can still gain valuable insights based on the analysis of our dataset. Finally, in online social media, “retweets” act as a key means of information diffu- sion and propagation of topics [187]. Studying “retweets” in the context of a corporate microblogging service could provide interesting insights to messages and topics prop- agation in the enterprise, as well as users’ topical homophily [158]. Even though this aspect of the post–reply network is important, our communication log dataset does not contain such information. For this reason, we leave the study of “retweet” behavior in corporate microblogging as a future work. [173] studied information spread on Twit- ter, measuring the ways in which hashtags spread on a network defined by interactions among Twitter users. Like in our study, they build a network on the users from the structure of interaction via @–messages: i.e., for users X and Y , if X includes “@Y ” in at least t tweets, for some threshold t, they construct a directed edge from X to Y . Analysis of the process according to which the adoption dynamics interact with the network structure in the context of enterprise communication in collaborative platforms is an interesting research question that remains to be discussed in future work. 31 0 5 10 15 20 25 30 35 10 0 10 1 10 2 10 3 Number of Employees (log) Business Unit ID (a) 0 50 100 150 200 10 0 10 1 10 2 Department ID Number of Employees (log) (b) Figure 2.3: Distribution of employees, who have used the microblogging service at least once, per (a) business unit, and (b) department. User Demographics Our dataset contains employees’ interaction logs during the first year of adoption of the service from the enterprise. During this time period, the number of employees who join the service increases dramatically. At the time we obtained the snapshot of this dataset, there were 4,213 users, who represent a broad spectrum of employees across 33 different business units and 228 departments worldwide, as shown in Figure 2.3. Figure 2.3 suggests that the microblogging service has not been equally adopted by different groups of corporate employees. However, we argue that our dataset contains a rough representative sample of all employees, as it includes users with various job functions. Whenever a user joins the service, a “join” message is automatically sent by the ser- vice to the company feed, announcing the event to the rest of the users. Users’ activity is not homogeneous across users. After the “join” message, not all users continue send- ing messages. Some just stop using the service altogether, others assume a receiving role, simply reading others’ messages, and some contribute to the microblogging ser- vice with status updates, which in our modeling do not contribute any edges between 32 users. From the remaining users,1,210 (28.7%) have sent at least one message (exclud- ing the “join” message). The distribution of the number of messages per user is broad and highly skewed. This highly skewed posting pattern is similar to what was found in Twitter [115] and is commonly observed among many online communities [154, 177]. This suggests that participation patterns are similar to online microblogging services, where a relatively small number of users produce the bulk of content and most users either contribute sparsely or just lurk. [63] proposes two methods for identification of leaders, lurkers, associates and spammers in Twitter. Unfortunately, our dataset does not include user login or reading logs, so we are not able to examine other usages of the microblogging service, other than posting–replying behavior. 2.3.2 Formal Organizational Hierarchy In conjunction to the dataset comprising of informal communication logs, which we described in Section 2.3.1, we have further obtained a snapshot of the organizational hierarchy of the Fortune 500 company. This includes over 12K employees and their reporting-to relationships. The dataset, also contains employees’ join logs during the first two years of adoption of a microblogging service from the enterprise (July 2, 2010 to March 22, 2012), expanding the dataset, which we described in Section 2.3.1 by almost one year worth of new observations. During this time period, the number of employees who join the service increases dramatically. Even though, not all employees have joined the microblogging service by the time we obtained the raw dataset, a broad spectrum of employees (9,421 users) had joined the microblogging service (77.35% of hierarchy dataset), sharing19,371 status updates and exchanging20,370 replies. 33 Table 2.2: Last.fm Dataset High level statistics Number of unique users 1,892 Number of unique artists 17,632 Number of unique tags 11,946 Directed user-user relations 25,434 User-artist relations 92,834 Annotations (user-artist-tag relations) 186,479 2.3.3 Last.fm We considerhetrec2011−lastfm−2k, a real–world dataset of2K users from Last.fm online music system [22] (see Table 2.2), containing social networking, tagging, and music artist listening information. Last.fm builds profiles of each user’s musical tastes by recording details of the songs users listen to. Further, Last.fm allows users to create social networks by listing friends (users who have similar musical tastes to them). The dataset includes 25,434 directed friend relationships (i.e., user–user) and 92,834 user– listened to artist relations (i.e.,<user,artist,listening count> tuples). The dataset further includes11,946 unique tags, used in about186,479 annotations (i.e.,<user,artist,tag> tuples) of17,632 artists. 34 Chapter 3 Modeling Social Networking Data In this chapter, we propose a generalized framework to model informal communication within a complex network. Later, we use this framework to show how interactions affect the network structure and vise versa, how the network structure affects activity. We consider a dynamic network of users who engage in multi-dimensional activities, devel- oping time-varying, contextual interactions and interconnections. Our formal modeling abstracts the semantics of social networking data, and provides a social algebra that operates on a multi-dimensional space. 3.1 Formal Representation of Social Networking Data We introduce a novel social graph representation, shown in Figure 3.1, which we call “enrichedmulti-layeredsocialnetwork”. This stack of interlinked layers, does not only contain unidimensional social links (e.g., “friend-of” links) between users, but also maintains integrated information regarding users’ dynamically changing interests and activities, throughout collaboration platforms. In our proposed representation, users may be perceived as multi-dimensional spheres lying in a multi-dimension space, as shown in Figure 3.2, where each dimension describes a specific feature (like education, movies, or music) and may be multidimen- sional itself. For example, movies have title, director, cast and other information. Some dimensions may have higher importance than others, depending on the social network. 35 Figure 3.1: Multi-layered Social Network Stack Figure 3.2: Conceptualization of multi-dimensional user representation. For instance, music is more important in Last.fm, while on the other hand professional skills and work experience are more important in LinkedIn. Next, we discuss in detail the layers of our formal model. 36 3.1.1 Social Layer Social Layer captures contextual and temporal interactions between users. Nodes rep- resent users and arcs represent explicit relationships (links) between them. The social layer records user relationships from multiple collaboration platforms: email correspon- dence, chat networks, blogging activity, shared bookmarks and common ratings. An edge between users is defined by thecontext under which it is created, and has an asso- ciated timestamp. A user u may be connected to user v under multiple contexts (e.g., through e-mail correspondence, but also frequent meeting co-participation), in multiple time instances. Human relationships, offline and online alike, are usually bounded by, at least some, temporal and spacial constraints. For instance, strong relationships may be achieved between employees at the workplace, but such connections might fade in the event that an employee retires or joins a rival company. We claim that contextual analysis can be proven significant in providing insights on informal communication. Further, each kind of context can be complex, thus being decomposable to “sub-contexts”. To this end, it is imperative to establish a comprehensive model of contexts, and semantic links to structure them [38]. Figure 3.3 shows various contexts of informal interactions at the workplace. Various interpretations of captured scalar, hierarchical, and nominal, tem- poral, or spatial data that differ with context (e.g., point of view) can provide different insights or views (i.e., dimensions). Contextual Communication At the Workplace Background knowledge is acquired and learned skill-set incrementally updates, interests and expertise change, as time progresses. The current focus of a specific employee may be completely different than what is stated in an outdated personal webpage or CV . We 37 Figure 3.3: Contexts in informal interactions at the workplace. introduce temporal context (TMC) to capture such temporal effects. Further, social inter- actions are in many occasions bounded by, at least some, temporal and localization con- straints. This refers to spatial context (SC). For instance, face-to-face interactions may only occur when individuals are physically located at the same place at the same time. Extended interactions due to office adjacency or limited communication at a conference further introduce the concept of time sensitive informal communication. Participation in meetings, talks, training, conferences, etc. constitutes event context (EC). Lengthy discussions on a daily basis indicate stronger bond than periodic hourly meetings, which in turn indicate more significance than a sporadic discussion. The relationship between two individuals therefore becomes a function of time and can be explored only as such. Static social networks ignore such interactions, establishing an edge between two users if at least some type of communication has happened, at least once. We argue that tem- poral correlations and causal effects between node features and social connectedness can only be manifested and magnified when considered as a function of time. 38 Employee interests, skills and expertise can change depending on time, work orien- tation and responsibilities, project focus and overall team competence. From employee’s perspective, interest, expertise, curiosity, familiarity for topics constitutes participation context (PPC). Topics constitute topic context (TPC). Given a context (e.g., a group discussion versus a status update) may yield significant, different aspects of employ- ees’ focus. Depending on personal or professional nature of content, different interests can be mined and different expertise levels can be identified, for disjoint set of topics. Moreover, employees often assume multiple roles in multiple projects (e.g., an employee might act as manager in one project, while being a software developer in another). This can be captured as project context (PRC). Roles and positions, and reporting hierarchy is captured in organizational context (OC). One context can be closely related to one or more other contexts. For instance, employee interests, skills and expertise can differ at multiple points in time, and be different at the same point in time within the boundaries of correspondence with different individuals. Figure 3.4 depicts a scenario of enterprise contextual social interactions. Employee AABF participates in a project (project context). In performing his role, he comes across a problem and posts a question (activity stream context) at the enterprise social networking platform, where Employee AAAD and Employee AADC read the question (activity stream context). Employee AADC is interested in the problem (topic context) and starts following the question (activity stream context). Employee AAAD responds (participation context) with a Sharepoint page reference, which was con- tributed by Employee AAGG (domain context). 3.1.2 Content Layer Content Layer captures published content from all available sources, including but not limited to resources shared by users (e.g., photos or videos), bookmarked and/or tagged 39 Figure 3.4: Contextual Interactions Example resources (e.g., URLs), users’ generated content (e.g., status updates in Facebook), e- mails, chat messages, and blog posts. Depending on available computational resources or application need, this layer may maintain raw content, which is meant to be processed later on, or in the other extreme only contain the aggregated post-processing results of previous analysis. In the latter case, provenance metadata are to be maintained in unison with analysis results so as to describe for instance the procedure followed and data sources used in the analysis. Content includes a variety of both textual and non-textual features that are dependent on the type of the resource and the purpose of the social network. For example photos may have a geographic location attached to them while regular documents may not. Some features are unique for each type of resource (e.g., creation date), while others may be missing or have multiple instances (e.g., hashtags). Some of the features are manually provided by users, while others are system generated. However, many social networks share a core set of features. ”These features include: 1. author, with an identifier of the user who created the document; 2. title, with the “name” of the document; 3. description,withashortparagraphsummarizingthedocumentcontents;4. tags,witha 40 set of keywords describing the document contents; 5. time/date, with the time and date when the document was published; 6. location, with the location associated with the document” [13]. We consider a representation of content by adapting the definition of event from [181], using each feature according to its type. Using this definition, content is com- pletely described by a set of attributes (features), both textual and non-textual, that pro- vide information regarding its “what”, “when” and “where” aspects. Content is linked back to users, as shown in Figure 3.5 with an appropriate context describing connections’ semantics (e.g., differentiate between the creator of a photo and users that like or comment on it). Next, we list the key types of features we use in our modeling. We do not however, that our model is extensible, allowing the deletion of unwanted features, or the addition of new attributes in the future. • Content Type: This feature describes content type (e.g., document or e-mail) and its origin (e.g., social status update, e-mail body text or blog entry). • Textual Features: These features include title and description (if available), raw textual content (bag-of-words), as well as user provided tags. • Enriched Annotation Features: These features include resolved resources (i.e., URLs), named entities (elements belonging to a set of predefined categories, like persons and organizations), and topics (a general description of the topic(s) that content belongs). • Date Features: Date values regarding content creation. We represent date values as the number of minutes elapsed since the Unix epoch. • Location Features: Location metadata associated with content (e.g., geographi- cal coordinates). 41 Figure 3.5: Content representation. 3.1.3 Semantic Layer Semantic Layer contains meta-information about content, and can be broken into sev- eral constituting layers, each containing different metadata about content. This layer may include, but is not limited to, domain ontologies, vocabularies, and folksonomies and taxonomies, external sources of formal knowledge, and linked open data. We use semantic information providers and annotation enablers, such as OpenCalais 1 , Alche- myAPI 2 , and Evri 3 , WordNet 4 , and Freebase 5 , for text analysis and annotation, entity identification and topic discovery. Linked Open Data 6 (LOD) can further be exploited to link to external knowledge repositories. 1 http://www.opencalais.com/ 2 http://www.alchemyapi.com/ 3 http://www.evri.com/ 4 http://wordnet.princeton.edu/ 5 http://www.freebase.com/ 6 http://linkeddata.org/ 42 3.2 Mathematical Formulation In our formal modeling of social networking data, users can be visualised as “multi- dimensional spheres” on a multidimensional attribute space. In order to strictly define the space we need the notion of basis, which we inherit from vector spaces. Typically, a basis is a vector representation of every dimension in a vector space, such that each dimension is a vector with zeros in every position except the current dimension. Such basis vector is the identity matrix. However, since each dimension in our model may also be itself multidimensional, our basis becomes multidimensional. Consider for example the following sample space: S = (aboutme,movies,music) d 1 =aboutme = ((name,occupation),0,0) d 2 =movies = (0,(title,director,genre,cast),0) d 3 =music = (0,0,(artist,album,genre)) In this example, the movies dimension contains a cast property, which can be fur- ther broken down as cast = (actor 1 ,actor 2 ,...), where each actor is in turn a vector: actor = (name,age,height,...). Hence, the basis (e) for the sample space above is defined as ann×n recursive matrix, wheren is the number of dimensions in the space, and each dimension is a matrix itself: e = d 1 0 0 0 d 2 0 0 0 d 3 43 3.2.1 Social Algebraic Operators Our formal model of social networking data facilitates the calculation of similarity strength between objects, on a [0,1] range. With respect to users, different similarity calculations can be performed to convey different meanings about how similar two users are, given a context/dimension. For instance to examine the amount of interests/knowl- edge shared between two users we can compute their similarity value with respect to their “what” dimension. We can further identify communities of users that exhibit col- lective knowledge by discovering linked groups of users with higher values of similarity among each other. Next, we define rigorous social operators, which we use to calculate similarity scores. Even though we focus on user similarity, any two objects can be provided as inputs to our operators. Definition 1 (Social Dimensional Distanceω d ) Givenusersxandy,dimensiondand somesimilaritymeasuresim(x,y),wedefineω d asthed-dimensinalsimilaritybetween usersxandy: ω d (x,y) . = 1 |x d | |x d | X i=1 sim(x d k ,y d k ), (3.1) where|x d | denotes the cardinality of x. ω d is a variant of Hausdorff 7 point set dis- tance measure used to compare sets, from which we adapt for calculating users similar- ity. We normalize the similarity instead of using themin operator as used in the original Hausdorff distance metric, since we want to compute similarity between two users over all sub-dimensions. Like the original Hausdorff distance metric, Social Dimensional Distance isasymmetric with respect to users: ω d (x,y)6=ω d (y,x). 7 http://en.wikipedia.org/wiki/Hausdorff distance 44 Definition 2 (Social Contextual Distanceω c ) Wedefineω c asthecumulativedistance betweentwousersundercontextc(asetofdimensions). Thecontributionofeachdimen- sionω d i isregularizedbyaweightingfactorα,where0≤α≤ 1. ω c (x,y) . = 1 |x c | |xc| X i=1 α i ω d i (x d i ,y d i ), c ={d 1 ,d 2 ,...,d k }. (3.2) Definition 3 (Social DistanceΩ) We define Ω as the cumulative distance between two users across all their dimensions. The result is the normalized sum of two user’s social dimensionaldistances. Thecontributionofeachdimensionisregularizedbyaweighting factorα. Ω(x,y) . == 1 |x| |x| X i=1 α i ω d i (x,y). (3.3) Definition 4 (Social Dimensional Neighborhoodθ d ) Wedefineθ d ofuserxasasetof userswhosedistancefromuserxalongdimensiondislessthanthresholdγ. θ d (x,d,γ) . ={u|ω d (x,u)≤γ}. (3.4) Definition 5 (Social Contextual Neighborhoodθ c ) We define θ c as a set of users whosedistancefromuserxundercontextc(asetofdimensions)islessthanthresholdγ. Thecontributionofeachdimensionisregularizedbyaweightingfactorα(0≤α≤ 1). θ c (x,c,γ) . =={u| 1 |x c | |xc| X i=1 α i θ d i ≤γ}, c ={d 1 ,d 2 ,...,d k }. (3.5) Definition 6 (Social NeighborhoodΘ) We define Θ as a set of users whose distance fromuserxacrossalltheirdimensionsislessthanthresholdγ. Thecontributionofeach 45 dimensionisregularizedbyaweightingfactorα(0≤α≤ 1). Differentthresholdsmay beprovidedfordifferentdimensions. Θ(x,γ) . ={u| 1 |x| |x| X i=1 α i θ d i ≤γ i }. (3.6) Figure 3.6 demonstrates a toy example, where three usersu 1 ,u 2 andu 3 are modeled in a three-dimensional space, which axes are movies, music, and sports. Figure 3.6(a) shows the distribution of the three users in the three dimensional space. Figures 3.6(b)- (d) show projections of the three users in each of the three dimensions accordingly. Notably, the similarity value of usersu 1 andu 2 is larger that the corresponding value for users u 1 and u 3 , with respect to movies and music dimensions accordingly. The same is not true however, when the sports dimension is examined. In this case the similarity between usersu 1 andu 2 is smaller than the corresponding value for usersu 1 andu 3 . This indicates greater alignment between users u 1 and u 3 ’s interests with respect to sports, whereas u 1 and u 2 show greater similarity in the rest of the dimensions. The social dimensional similarity metric captures such differentiations across dimensions. Further, the use of regularization factor in the computation of social distance, can be interpreted as stretching (similarly collapsing) the space so as to make differences between users along certain dimensions more (similarly less) apparent. Figure 3.6(d) further demonstrates the concept of social dimensional neighborhood for thresholds γ 1 , γ 2 , and γ 3 , such that γ 1 < sim(u 1 ,u 3 ) < γ 2 < sim(u 1 ,u 2 ) < γ 3 . Depending on the threshold value selected three different social dimensional neighbor- hoods of varying size and consistency can be established for user u 1 . Namely, for threshold γ 1 , u 1 ’s social neighborhood is empty, whereas for threshold γ 2 , the neigh- borhood consists of user u 3 only. Finally, the neighborhood includes both users u 2 and u 3 for threshold γ 3 , as a result of γ 3 value being greater than the largest similarity 46 Figure 3.6: Toy example of social distance between users in a three-dimensional space consisting of movies, music and sports. value. Finally, Figure 3.6(e) shows how the contextual distance is computed when con- text is defined in terms of music and sports. Even though users u 1 and u 3 are far apart with respect to music dimension, overall they are “close” when the sports dimension is considered as well. Overall, when all dimensions are treated equally, the computation of social distance will result inu 1 being closer tou 2 than tou 3 as expected. Assigning vari- able dumping factors to different dimensions will result in different outcomes. In that sense both our social dimensional and contextual similarity operators can reveal (sim- ilarly hide) hidden correlations in specific dimensions, which would otherwise remain unnoticed, when only examining a flat, vector representation of users, therefore without differentiating the context under which the similarity metric is to be computed in each case. 47 3.3 rESONAtE: Semantic Social Network Analysis for the Enterprise An enterprise’s success is bounded by its capacity for quick adaptation and spontane- ity in technological changes and paradigm shifts, rapid response to customer demands through anticipation of future opportunities, and ability to predict, detect and alleviate risk factors and threats. “Thecompetitivenessoffirmsisrelatedtotheadequacyoftheir decisions,whichdependsheavilyonthequalityofavailableinformationandtheirabil- itytocapitalize,enrichanddistributethisrelevantinformationtopeoplewhowillmake therightdecisionsattherightmoment” [59]. In modern enterprise, engineers typically spend 40%-60% of their time seeking information [45]. A system that enables quick expert identification and facilitates inter- disciplinary cooperations that span organizational charts, lessening time spent on search- ing for solutions, is pivotal for its success. The end-goal for an enterprise is not storing and managing lots of raw data, but instead, to get to newer actionable business insights faster. We argue that in order for enterprises to get to such insights faster, there is an imperative need for a platform that enables quick, rich, and novel data exploration in multiple, intuitive ways, gleaning information from multiple communication mediums and leveraging it into knowledge. However, due to the way data is generated in a modern enterprise, data management has become increasingly challenging. In practice, users’ activities are scattered across var- ious collaboration tools used in the enterprise, leaving behind structured, unstructured or semistructured information traces in multiple formats. A user might choose chat or microblogging services for casual Q&A sessions, email correspondence for document and ideas sharing, and SharePoint for project tracking purposes. Furthermore, a user may adopt different tools for different projects, or utilize different tools for the same 48 project, depending on current needs. In general, the existence of multiple communi- cation channels between employees available in the enterprise information ecosystem, scatters information related to a specific employee, establishing the need for an inte- grated view of users’ activities across platforms. Integrating users’ activities across multiple collaboration tools is not an easy task. Content from collaboration activities may be significantly short (e.g., 140 characters in Twitter) and inherently noisy. For instance, microblogging content does not adhere to any grammatical or syntactical rules, contains slang terms, user-defined hashtags and emoticons or other special characters, which denote emotions or user-defined notions, the semantics of which may be unknown or not previously modeled. Second, users activities on various collaboration tools signal different kinds of relations, personal or professional [208], of unequal importance [201]. Third, information heterogeneity due to different format and schemata or storing mechanism impose further restrictions. Forth, capturing employees’ interests and areas of expertise is a time sensitive task. Existing methods model users’ interests based on static profiles or by keeping track of users’ collaborative activities. However, users’ profiles may be completely unavailable or extremely scarce since users do not often populate enough information to describe themselves. Profiles may be obsolete if users do not constantly update them to match their most up to date interests. In this work, we primarily focus on employees’ interests and areas of expertise, as well as interconnections between employees’ work-related activities and their social interactions on collaboration platforms at the workplace. User activity and behavioral data in this context contains valuable information. User’s interests in personal and pro- fessional level can be discovered, whereas interesting communication motifs can be mined out, enhancing our understanding on employees’ communication patterns as well as patterns of information propagation and browsing in enterprise networks. Enterprise 49 social interactions analysis may lead to various insights, both at atomic (micro) and collective (macro) level. Enterprises can utilize the results stemming out of informal interactions macro anal- ysis, to better understand how their employees work together to complete tasks or pro- duce innovative ideas, reveal trends, identify experts and influential individuals, so as to evaluate and adjust their management strategy, team building and resource allocation policies. Similarly, employees can benefit in multiple ways. Recommendation services can provide better results in terms of “interesting” people to connect to, as well as sug- gest “interesting”’ discussions for employees to contribute to or projects to get involved in. Information filtering algorithms can better promote subset of news instead of directly delivering all sorts of irrelevant data to employees, alleviating information overload from them, and enabling them to focus on information that does matter. Information acquisi- tion, such is search for people, data and answers to problems can be significantly sped up, resulting in increased productivity through collaboration and problem deduplication. We instantiate our model of informal communication data (see Section 3.1) in the form of Ontology. We call this Ontology rESONAtE (Semantic Social Network Analysis for the Enterprise). The main reason we use Ontology, is that it provides a generic, reusable, and machine understandable model for representing the concepts and properties required for describing user activities and measuring their behavior. Figure 3.7 depicts the coverage of rESONAtE Ontology. Figure 3.8 shows an overview of classes in our ontology. Due to dynamism at the workplace, we regularly update our ontology for it to effectively track temporal changes, as shown in Figure 3.9. Instead of focusing on static properties such as personal information, we use our ontology to leverage dynamic enterprise data. For instance, an employee may have multiple supervisors while being assigned to multiple projects, and at the same time 50 Figure 3.7: rESONAtE Ontology Coverage Figure 3.8: rESONAte Ontology 51 Figure 3.9: rESONAtE Workflow maintain a list of email contacts. Hence, employees may have various types of tempo- ral connections, in multiple contexts. Previous positions and projects provide hints on employees’ specialization areas, however, peer validation in the form of informal inter- action acts as supporting evidence of the level of expertise of individuals. Finally, we mine content of informal interactions between employees to capture their expertise in a latent topic space. In our work, we adopt the Author-Recipient Topic model (ART) [174] to discover topics in employees’ messages in internal social media. Particularly, the ART model conditions the distribution over topics on both the sender and the recip- ient of a message. Therefore, latent topics are discovered according to relationships between people, instead of individual messages or authors. Using ART we are therefore able to identify not only how often employee pairs interact, but also which topics are the more prevalent in their discussions. 52 3.4 Case Study on Real-World Informal Communica- tion Data In this section, we present a case study on a large scale dataset of informal communica- tion logs from a Fortune 500 company (see Chapter 2, Section 2.3.1). The dataset con- tains users activity (message exchanges) and interactions (e.g., comment/reply, tagging), but lacks explicit social relationships (e.g., friend-of or followee/follower) between users. We compensate the lack of social relationships with a snapshot of the formal organizational hierarchy of users (see Chapter 2, Section 2.3.2). Each employee is rep- resented with a 4-character id (e.g., “AECF”). The organization level of each user is denoted using a character from “A” to “M”. Relationships are not symmetric: employee AECF may send a message to employee AAAF but AAAF can choose to not reply back. Similarly, employee AECF may be the supervisor of employee AAAF. Thus we have a directed labeled graph, with multiple relations between users. For our analysis, we have build a web accessible user interface using ASP .NET technology. The system relies on a back-end Virtuoso 8 server, which stores our ontol- ogy and populated instances. The system facilitates complex semantic search over the informal communication corpus. In fact, our system enhances collaborative data analy- sis in the enterprise, revealing latent topics, expertise, and interests, both at micro (single user) and macro (e.g department or business unit) level. Next, we present some typical use case scenarios that address specific informational needs in an enterprise context, corresponding SPARQL queries and sample results. 8 http://www.virtuoso.com/ 53 3.4.1 Multidimensional Expert Identification Employees’ level of expertise may vary from topic to topic and from medium to medium. One might share innovative ideas and contribute to discussions through emails, but not in microblogging sites. We differentiate expertise according to communication channel, time and content type. A micro level query that retrieves topics and levels of expertise for given employee follows (see Query 3.1 and Figure 3.10(a)). Expertise levels are quantized, taking integer values between1 (less expert) and5 (authority). PREFIX resonate:<http://local.virt/rdf_sink/user.rdf#> SELECT ?Topic ?Name ?Level WHERE { resonate:Employee_AABG resonate:hasExpertise ?Expertise. ?Expertise resonate:hasinTopic ?Topic. ?Topic resonate:hasTopicName ?Name. ?Expertise resonate:hasExpertiseLevel ?Level. } SPARQL Query 3.1: Given employee, retrieve her areas of expertise and expertise levels. 3.4.2 Contextual Ego-Network Analysis and Community Detection Analysis of employees’ interactions reveals hidden dynamics at various granularities. At the micro level, The topic-specific ego-network of different users may reveal patterns that vary across topics. Alternatively, it is possible to detect virtual topical commu- nities (macro level), by clustering users with similar interests. A micro level query 54 (a) (b) Figure 3.10: Partial result set for (a) Query 3.1 and (b) Query 3.2. (see Query 3.2) used to generate the topic-specific ego-network (Figure 3.10(b)) of an employee follows. PREFIX resonate:<http://local.virt/rdf_sink/user.rdf#> SELECT ?User ?Topic WHERE { resonate:Employee_AABF resonate:hasConection ?Connection. ?Connection resonate:hasContentContext ?Interest. ?Interest resonate:hasinTopic ?Topic_no. ?Topic_no resonate:hasTopicName ?Topic. ?Connection resonate:hasconnectedTo ?User. } SPARQL Query 3.2: Given employee and topic, create topic-specific ego-network 55 3.4.3 Trends Macro-Analysis Discovery of trending topics is a typical application in social network analytics. Typi- cally, trending topics are identified for a single communication channel for single net- work. Such analysis may be inefficient for an enterprise. Instead, analysis at the macro level can prove to be more useful in discovering, for instance, trending topics and sig- nificant contributors in various collaboration platforms, different departments and orga- nizational positions. Given a trending topic, Query 3.3 retrieves its most prominent keywords, in all contexts. Figure 3.11(a) shows the result. PREFIX resonate:<http://local.virt/rdf_sink/user.rdf#> SELECT ?Keyword ?Probability WHERE { resonate:Topic_97 resonate:hasKeyword ?Topic_Keyword_No. ?Topic_Keyword_No resonate:hasKeyword ?Keyword.} ?Topic_Keyword_No resonate:hasProbability ?Probability. } SPARQL Query 3.3: Get probability distribution of trending keywords in a topic. 3.4.4 Communication Dynamics Organizational hierarchy is static, and dated, whereas communication may create “short- cuts”, i.e., collaboration that spans hierarchical levels when seeking for help, or offer- ing guidance, etc. Studying communication may reveal hidden organizational dynam- ics. Our system enables analysis of information propagation across the rigid structure of formal organizational charts, providing insights on the collaboration processes that employees devise in order to tackle everyday problems and find solutions. 56 (a) (b) Figure 3.11: Partial result set for (a) Query 3.3 and (b) Query 3.4. Our intention is to better understand how information propagates across corporate borders, and identify potential shortcuts in the organizational chart, as well as better understand how employees collaborate to tackle everyday problems and find solutions. To this end, we compare employees’ connections with respect to the formal organiza- tion hierarchy (see Query 3.4), to online informal communication interactions. Visual- izations such as the own shown in Figure 3.11(b), can provide hints on communication dynamics over the formally imposed structure. PREFIX resonate:<http://local.virt/rdf_sink/user.rdf#> SELECT ?Employee ?Position ?Organization_Level} WHERE { resonate:Employee_AABF resonate:hasConection ?Connection. ?Connection resonate:hasconnectedTo ?Employee. ?Employee resonate:hasPosition ?Position. } SPARQL Query 3.4: Given employee, find direct links in the organizational hierarchy. 57 3.5 Summary Building models of complex networks, understanding their rich properties, hidden struc- tures and dimensional interdependencies are not trivial tasks. This chapter provides a novel framework for capturing the multidimensional nature of complex, informal, social interactions in the form of orthogonal dimensions. Our model integrates struc- tural information with dynamic collaboration modalities from multiple heterogeneous networks. Our model is extensible, since it allows the definition of new metrics, and at the same time is generic since it permits arbitrary similarity measures to be used for various dimensions. We utilize semantic web techniques for our conceptual modeling and representation, since this approach enables seamless integration of shared domain Ontologies and Linked Open Data. Our view of users online presence depends not only on their connectivity, but also on how they interact, and under which context. We have explored the issue of captur- ing and preserving previously unmodeled knowledge, which is generated under diverse contextual informal interactions of multiple types, both explicit and implicit. We argued that information extraction from informal communications can prove beneficial to enter- prises that have adopted various forms of collaboration services, like microblogging. Knowledge is discovered, captured and inferred based on such complex information. Our social algebraic operators facilitate dimensional and contextual analysis and min- ing of social data from multiple views and perspectives, enabling deep exploration and understanding of both explicit and implicit interactions and relations. Particularly, we discussed illustrative use case scenarios in a real world dataset from a Fortune 500 com- pany, that demonstrated how our approach can improve collaborative data analytics for the enterprise, both at micro and macro level. Our model is not limited to enterprise social data, but is extensible and applicable to other domains. 58 In this work we consider users’ social activities, which offer emergent semantics and allow expert identification to be conducted in a casual, informal context. We believe that richer and deeper understanding of such activities is necessary. Our future work, towards this direction, will focus on integrating and experimenting with more commu- nication channels, as well as conduct experiments on informal corporate networks and online social networks to quantitatively evaluate the effectiveness of our approach. The following chapters build on the premises of our formal modeling of informal interac- tions. 59 Chapter 4 Informal Communication at the Workplace Social Networks have revolutionized the way people communicate and interact, while serving as a platform for information dissemination, content organization and search, expertize identification, and influence discovery. The popularity of online social net- works like Facebook and Twitter has given researchers access to massive quantities of data for analysis. Such datasets provide an opportunity to study the characteristics of social networks in order to understand the dynamics of individual and group behavior, underlying structures, and local and global patterns that govern information flow. Most of the analysis performed thus far has mainly focused on publicly available online social networks [154]. However, microblogging capabilities have been lately adopted by enterprises as part of their day to day operations [216]. Contrary to online social networks, the main purpose of corporate microblogging services is to promote and enable colleagues’ communication and collaboration, and professional sharing within the enterprise [208]. Organizations hosting internal social networking sites can benefit from mining employees’ informal interaction logs, as well as from identifying reliable indicators of expertise. However, we have limited understanding of how employees make use of enterprise microblogging services. The topological characteristics of enter- prise social networks have thus far not been studied, partially due to the lack of available datasets. 60 Within an enterprise, microblogging behavior is bounded by main business and work culture, work practices, and everyday problems, emphasizing mostly on the business perspective. Being more focused and less noisy than online social networks, enter- prise microblogging platforms offer unique opportunities for identifying experts, as well as studying and understanding the processes of knowledge generation and shar- ing. Connecting employees within an organization can result in multiple benefits both for employees and corporations. Employees can “get help or advice, reach opportu- nities beyond those available through existing ties, discover new routes for potential career development, learn about new projects and assets they can reuse and leverage, connect with subject matter experts and other influential people within the organiza- tion, cultivate their organizational social capital, and ultimately grow their reputation and influence within the organization” [83]. Enterprises on the other hand can directly benefit from increased productivity due to reduced time spend in team building and knowledge propagation, as well as expertise mining and preservation [36]. In this chapter, we study the dynamics of interactions within a complex network, inferred by informal threaded discussions between employees at the workplace. In par- ticular, we provide an extensive analysis of enterprise microblogging data, collected from a large, multinational corporation over a one year period (see Chapter 2, Sec- tion 2.3.1). We study the topological properties of the post–reply network, and the dynamics between its social and topical components. We study homophily, finding a substantial level of alignment with respect to the network structure and users activity, as well as latent topical similarity and link probability. Our analysis suggests that users with strong local topical alignment tend to participate in focused interactions, whereas users with disperse interests contribute to multiple discussions, broadening the diversity of participants. The existence of a large, strongly connected core, suggests that high– degree nodes located at its center, exhibit characteristics of expertise, conceptualized by 61 frequent message exchanges with other nodes. Such high–degree nodes are therefore critical for the connectivity and flow of information at the workplace. 4.1 An empirical analysis of microblogging behavior in the enterprise The structure and evolution of online social networks has been investigated in detail [116, 154]. [3] analyzed Cyworld, MySpace and Orkut. [116] examined two online social networks and found that each contain a large strongly connected component. [70] observed that users in online social networks tend to form tightly knit groups. [5] and [159] examined the small–world properties (small diameter and high clustering) of dif- ferent networks, whereas [110] proposed a model that captures such properties. The network we examine in this chapter exhibits small–world properties much like other general purpose online social networks. [143] investigated patterns in Flickr user activ- ity and examined vocabulary overlaps between user pairs. [177] focused on topical and lexical alignment among users who lie close to each other in the Flickr social network and exploited this alignment as an indicator of user connectivity. On the other hand, most studies on microblogging networks have focused on Twit- ter. [115] presented a detailed characterization of Twitter, identified distinct classes of Twitter users and their behaviors, geographic growth patterns and current size of the network. [97] explored Twitter’s topological and geographical properties, analyzed user interactions at the community level and showed how users with similar interests connect to each other. [217] explored the factors that influence people’s tendency to share personal information in Twitter, and examined microblogging’s potential impact on informal communication at work. Through a qualitative study, they concluded that 62 microblogging at workplace can assist in building stronger personal bonds between col- leagues, rather than being used for professional benefits, even though they hinted that microblogging provides a complementary informal communication channel for cowork- ers to share and exchange information and ideas. Contrary, [42] studied the impact of social capital factors (social network, social trust, and shared goals) on a person’s voli- tion to share knowledge in collaborative tools at the workplace. [208] investigated work- place relationships built between coworkers using microblogging services and deter- mined interaction patterns that signal personal versus professional closeness between colleagues. [216] provided a systematic examination of adoption and usage of microblog- ging in a corporation environment, emphasizing on the perceived benefits of corporate microblogging and barriers to adoption. Key factors influencing microblogging systems adoption in the workplace include: privacy concerns, communication benefits, percep- tions regarding signal–to–noise ratio, and codification effort, reputation, expected rela- tionships, and collaborative norms [81]. [57] examined microblogging in workplace with emphasis on content type (ratio of information sharing messages versus questions and status updates) and users microblogging behavior as a function of their motivation. Their main focus was on providing qualitative results and insights on the reasons and ways people are utilizing microblogging for communication in a corporate environment. [192] observed emergent question and answer (Q&A) behaviors in enterprise social net- working systems, while arguing that users employ such mediums for non–urgent infor- mation seeking needs and perceive question asking as a way to elicit social support from their professional networks. However, we have limited understanding of how employees make use of enterprise microblogging services. [128] discussed various challenges and solutions in conducting 63 social network analysis in the enterprise. [91] explored the properties of a medium– scale professional social network, whereas [193] proposed a semantic model for the analysis of professional and institutional “skills networks”. Our study focuses on the social connectedness of an extracted corporate communication network, as well as the properties and characteristics of social content, tie formation and information flow, in the context of a corporate microblogging service. [213] showed that online content exhibits rich temporal dynamics. They studied temporal patterns of online content popularity and factors that impact the attention that content receives on the Web, where different pieces of content compete for attention. Such analyses however are lacking for microblogging data in the context of the workplace. 4.1.1 Analysis of Network Structure Let us denote the average in–degree (number of usersj who have sent a message to user i) by d in and the out–degree (number of users j to whom user i has sent a message) by d out . Then, d in = 1.07 and d out = 1.07, while average degree is d = 2.14. It has been shown that the degree distributions of many complex networks, including social networks, conform to power laws [154]. The post–reply network shows behavior con- sistent with a power–law network. Figure 4.1 shows the probability distributions of the number ofk neighbors (in–degree and out–degree respectively). For reference Table 4.2 reports the mean and variance of in/out–degree. To test how well the in–degree and out–degree distributions are modeled by power– law, we calculated the best power–law fit using the maximum likelihood method [43], which minimizes the Kolmogorov–Smirnov statistic, D, between the cumula- tive distribution function of the data and the power law: ˆ α = arg max α D, where D = max x |P emp (x)− P α (x)|, and P emp (x) and P α (x) denote the cdfs of the data 64 10 0 10 1 10 2 10 3 10 −4 10 −3 10 −2 10 −1 k P(k) in−degree out−degree Figure 4.1: Distribution of users’ in–degree (number of users one has received answers from) and out–degree (number of users one has answered). Both axes are in logarithmic scale. and the power law with exponentα, respectively. Table 4.1 shows the estimated power– law coefficient along with the Kolmogorov–Smirnov goodness–of–fit metric, suggesting that power law approximates the distributions very well for our dataset. Table 4.1 also shows that the distribution of outgoing links is similar to that of incoming links, as in previously observed online social networks [154]. The existence of edgee ij does not guarantee that the reciprocal edgee ji also exists. Hence the relationship is not symmetric. If user A sends a message to user B, the edge e AB is created, but not vice versa. We call user B the “follower” of user A. If B also replies to A, then they are each other’s “mutual followers”. Figure 4.2(a) shows the scatter plot of the number of followees versus the number of followers. The points are 65 Table 4.1: Power–law coefficient (α) estimates and corresponding Kolmogorov– Smirnov goodness–of–fit (D) metrics. For reference, we also provide estimates for previously studied online social networks [154]. Network in–degree out–degree α D α D Corporate Microblogging 1.84 0.0237 2.04 0.0468 Flickr 1.78 0.0278 1.74 0.0575 LiveJournal 1.65 0.1037 1.59 0.0783 Orkut 1.50 0.6203 1.50 0.6319 YouTube 1.99 0.0094 1.63 0.1314 Web 2.09 - 2.67 - 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Number of Followers Number of Followees y = 0.94*x + 0.059 (a) 0 1 2 3 4 5 6 7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Out−degree to In−degree Ratio CDF (b) Figure 4.2: (a) Scatter plot of the number of followers and the number of followees. (b) CDF of out–degree to in–degree ratio. scattered around the diagonal, indicating equal numbers of followers and followees (pos- sibly indicating reciprocal/mutual links). Figure 4.2(b) presents the cumulative distribu- tion of the out–degree to in–degree ratio, exhibiting high correlation between in–degree and out–degree. High correlation of symmetric links due to users’ tendency to reply back when they receive a message from other users is expected. Our analysis of the level of symmetry in the directed post–reply network reveals that the degree of symmetry is not as significant 66 as one would expect. Overall, the post–reply network exhibits low level of reciprocity with only 21.49% symmetric links, whereas the percentage of symmetric links in the largest connected component is23.18%. Our results align very well with those reported in [119] for reciprocity in Twitter. Following similar reasoning to [119], we conjecture that users tend to share information with all their colleagues, sending messages to the corporate–wide feed, rather than exchanging one–to–one messages, even if a conversa- tion is initiated with a directed message between two users. Further validation is out of scope of this research, and we leave it for future work. We now examine clustering, which quantifies the degree of how densely the neigh- borhood of a node is connected. Not all nodes are connected in one cluster. There are N cc = 3,570 strongly connected components. The largest strongly connected compo- nent encompasses N cc max = 582 nodes (13.8% of the network). Figure 4.3(a) (top) shows the histogram of connected component sizes. The clustering coefficient of node u, with set S N of N neighbors, is defined as the number of directed links that exist between nodes inS N , divided by the number of all possible directed linksN∗(N−1) between the nodes inS N . The clustering coefficient of the network,c, is0.0335. Figure 4.3(a) (bottom) shows the histogram of individual clustering coefficients at each node. Due to the fact that we compute this metric over the complete network, the clustering coefficient of the graph is low. However, in a random network with the same number of nodes (N) and degree (d),c = d N = 0.0005 [11]. Figure 4.3(b) (top) shows the histogram of connected component sizes for the largest strongly connected component, which contains582 nodes, with an average node degree d LCC = 12.97. The clustering coefficient of the largest strongly connected component is c LCC = 0.2311≫ c random = 0.0223. Similarly, the clustering coefficient of the largest weakly connected component, which encompasses N wcc max = 837 nodes (19.87% of the network), is c WCC = 0.2311 ≫ c random = 0.0122. Figure 4.3(b) (bottom) shows 67 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1000 2000 3000 Clustering Coefficients Number of Vertices 1 2 3 4 5 14 582 0 1000 2000 3000 Component Size Connected Componets (a) 0 0.2 0.4 0.6 0.8 1 50 100 150 200 Clustering Coefficients Number of Vertices 10 0 10 1 10 2 10 −1 k CC(k) in−degree out−degree (b) Figure 4.3: (a) Top: Histogram of connected components sizes Bottom: Histogram of clustering coefficients. (b) Top: Histogram of clustering coefficients in LCC. Bottom: Average clustering coefficient as a function of degree in LCC. how the clustering coefficient of nodes vary as a function of node degree. The aver- age clustering coefficient follows a decreasing trend with increasing node degree. It is higher for nodes of low degree, suggesting significant clustering among low–degree nodes. This evidence of strong local clustering supports the intuition that people tend to be introduced to others via mutual contacts, thus increasing the probability of two neighborsu andv of userz to be connected themselves [154]. Next, we look at the properties of shortest paths between users in the large weakly connected component. As only 21.49% of links are reciprocal, we expect the average path length between any two users to be longer than other known networks. The average path length is L av = 3.5677 and the diameter is D max = 11. In practice, L av is the average minimum number of users one needs to cross to get a message from one user to another. To obtain L av , we average the shortest paths between all user pairs [204]. Even though the graph is directed, these values are remarkably short and quite similar to corresponding values for Flickr and Orkut [154]. In a random network with the same number of nodes (N) and degree (d), L random = ln(N) ln(d) = 2.9015 [11, 204]. The small average path length (L av is almost as small as L random ) and the strong local clustering 68 of this network (c ≫ c random ) qualify the post–reply network as small–world network [204]. 4.1.2 Content Analysis In this section we take a close look at the content aspect of the post–reply network, focusing on numerous user activities. We further investigate the correlations between such activities. Figure 4.4 shows the probability distributions of the number of messages n m and the number of repliesn r per user, the distribution of the number of groupsn g to which a user belongs and the probability of finding a user with a numbern t of distinct hashtags in his vocabulary. Figure 4.4 further shows the total numbert of hashtag assignments per user (a hash- tag used twice is counted twice) and the total number g of group related messages per user (the number of messages sent to a group instead of binary group membership). More precisely, iff u (t) is the frequency of hashtagt being used by useru, then the total number of hashtag assignments of useru is given by: t u = P t f u (t). Similarly, iff u (g) is the number of times useru has sent a message to groupg (either privately to another group member or broadcast to the group), then the total number of group messages of useru is given by: g u = P g f u (g). All activities show behavior consistent with power law networks; the majority of users show small activity patterns with few nodes being significantly more active. All distributions are broad, indicating that the activity patterns of users are highly hetero- geneous. For reference, Table 4.2 reports the mean and variance of different activities. Some comments are in order. First, the average number of messages per thread is 2.02, while the ratio of the messages to the number of replies is≈ 1.011. Even though these statistics indicate on average shallow conversations, we found that is not the case overall. The mean is so 69 10 0 10 2 10 −4 10 −2 10 0 n m P(n m ) 10 0 10 2 10 −4 10 −2 10 0 n r P(n r ) 10 0 10 2 10 −4 10 −2 10 0 Probability Distribution Number of groups n g g 10 0 10 2 10 −4 10 −2 10 0 Number of tags Probability Distribution n t t Figure 4.4: From left to right and top to bottom, distribution of the number of messages n m per user, the number of replies n r per user, the number n g of distinct groups to which a user’s messages belong and the number g of group related messages per user, the numbern t of distinct hashtags per user and the numbert of hashtag assignments per user small due to the heavy–tailed distribution of number of messages per user. Further, even though the average number of messages per user is≈ 4, the average number of replies per user is quite higher (≈ 7.3), indicating users’ tendency to directional communication instead of personal status updates or sharing of news. The study by [216] reports a25% average of “conversation seeking” type of messages in an enterprise social network. We visually inspected randomly sampled messages and found that approximately 35% are “share news” type of messages, which probe some sort of response. The combined average of≈ 60% in the study by [216] aligns quite well with our findings here. We 70 Table 4.2: Averages and fluctuations of user activities Activityx E[x] Var[x] in–degree 1.0197 25.2624 out–degree 1.0197 23.6404 n m 3.9195 321.6220 n r 1.5196 84.7008 g 3.9067 321.3277 n g 1.1721 0.6040 t 0.4024 39.7527 n t 1.2601 9.1692 leave the classification of messages to private vs. professional and the examination of correlations to various features, which may possibly lead to different results as a future work. Intuitively, the number of replies cannot exceed the number of messages being exchanged. We explain this effect as a result of our modeling of the posting–replying activity in the enterprise microblogging service. Consider for example the scenario depicted in Figure 2.2. UsersA,B andC, each send an original message, starting three discussion threads respectively. Users B and C reply to A, whereas D and C reply to B’s message. C does not receive any replies at all. Assuming all messages are public, thenA,B andC’s discussion starter messages are delivered to all users’ news feed. In our modeling we count such messages once, even though each has been delivered to all users, and we do not create corresponding edges (e.g. e AB , e AC , e AD for message sent by userA). Instead when in our example usersB andC, each, reply personally to user A (C replies twice), we create directed edgese BA , e CA , and count 3 replies (1 fromB toA and 2 fromC toA). If we were to create all edges among users based on posting activity, the resulting network would be fully connected, forming a clique encompass- ing all nodes, rendering any analysis meaningless. In this scenario, the average number 71 of messages per user is number of messages number of users = 3 4 = 0.75, whereas the average number of replies per users is number of replies number of users = 5 4 = 1.25. The average number of group related messages (g) seems to be restricted on average, but its variance is quite high. The use of hashtags (t) is relatively low with the average being≈ 0.4 hashtags per message, which is approximately similar to traditional Twit- ter datasets [214]. Tagging, allows users to organize web resources (e.g. photos in Flickr, bookmarks in Delicious or tweets in Twitter). Twitter users adopted hashtags as an attempt to alleviate the significant information overload that the streaming nature of social media impose to users interested in specific topic(s). [92] examined tagging strategies followed by Twitter users for content management and filtering. In many online social networks, users with shared interests may create and join groups. In the corporate microblogging service users are able to create and join groups to collaborate with smaller teams. Messages sent within group boundaries are broadcast to group members only, while private message exchanges among group members are also feasible. We found that the average number of messages per group is 24.6, indi- cating considerably high activity patterns across all groups (see Figure 4.5(a)). Given the relatively low average number of hashtags per message, and the distribution of mes- sages per group, we conjecture that rather than using hashtags, users of the corporate microblogging service mostly rely on group membership for content organization. Finally, the sparsity of text in micro blogging social networks has traditionally been a hurdle for researchers interested in performing some type of statistical analysis of micro blogging content. We counted the average number of words per message in the post– reply network and we found it to be≈ 29.1, with a variance of 1360. Even though the distribution is heavily skewed (see Figure 4.5(b)), this number is quite high, indicating that many messages are adequately descriptive and could be safely used for statistical analysis (e.g. sentiment analysis), like Bayesian inferencing. 72 0 10 20 30 40 50 60 70 80 90 10 0 10 1 10 2 Groups Number of Messages (log) (a) 10 0 10 1 10 2 10 3 10 −4 10 −3 10 −2 10 −1 k Words per Message P(k) (b) Figure 4.5: (a) Distribution of messages per group. Y–axis is in logarithmic scale. (b) Distribution of number of words per message. Both axes are in logarithmic scale. Correlations [143] reported that some Flickr user activities are correlated (e.g. the number of photos uploaded by a user is strongly correlated with the number of hashtags from the same user). We wanted to test the validity of this hypothesis for our post–reply network. Figures 4.6(a) and 4.6(b) respectively show the number of group related messages and the number of hashtag assignments as a function of the number of messages n m and replies n r of a user. Clearly, g exhibits strong correlation to the number of messages (similarly for replies). Frequent communication between group members is expected, since joining a group is likely driven by a business need. However, we cannot conjecture the same for hashtag assignments. There seems to be a close relationship on a logarithmic scale, but such relationship is not perfectly linear. Even though there exist many users exhibiting high activity patterns with respect to the number of messages they send (similarly for replies), such users do not tend to tag their messages as often. Further, users tend to mostly tag their own content, as in Flickr [177]. It seems reasonable to deduct that group co– membership is a natural and direct indicator of shared users’ interests in a corporation. 73 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 400 450 500 Number of Messages Number of Groups g y = 1.9*x + 1 g(n m ) g(n r ) g(n r ) fit g(n m ) fit y = 1*x − 0.0087 (a) 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 Number of Messages Number of Tags t t(n r ) t(n m ) (b) Figure 4.6: (a) Number of group messagesg, as a function of the number of messages n m and replies n r of a user. (b) Number of hashtag assignments t, as a function of the number of messagesn m and repliesn r of a user (both axes are in logarithmic scale) Shared hashtags can also be considered as indicator of shared interests, but with some caution. We now examine correlations between users activities and the structure of the post– reply network. By doing this, we hope to get a clearer picture of the relationship between users’ posting and tagging activity, and the network structure. Specifically, we investi- gate if there is a connection between the number of neighbors a user has and the activity patterns of such user (i.e. number of messages, number of replies, group participation and tagging). We characterize the average activity of users with k neighbors (we consider in– degree and out–degree separately), using the following quantities: (i) the average num- ber of messages of users with k neighbors n m (k), (ii) the average number of replies of users with k neighbors n r (k), (iii) the average number of distinct groups (similarly for total number of group messages) of users with k neighbors n g (k), (iv) the average number of distinct hashtags (similarly for total hashtag assignments) of users with k neighborsn t (k). For example,n m (k) = 1 |u:ku=k| P u:ku=k n m (u). 74 Table 4.3: Pearson correlation coefficients Parametery PCC(k in ,y) PCC(k out ,y) n m 0.9779 0.9397 n r 0.9608 0.9767 n g 0.7543 0.6967 g 0.9779 0.9397 n t 0.7157 0.5407 t 0.6978 0.53 Figure 4.7 shows the probability distributions of such quantities. All activities have an increasing trend for increasing values ofk (both for in–degree and out–degree). Large fluctuations can be observed for large values ofk due to the fewer highly connected users over whom the averages are performed. Despite these important heterogeneities in the behavior of users with the same degree k, the data clearly indicate a strong correlation between the different types of activity metrics. Notably, the average number of messages and replies are very well correlated to the number of neighbors, as is the average number of distinct groups. The average number of (distinct) hashtag assignments exhibits more heterogeneity than the other measures, but still the trend is increasing with increasing values of k. Users with many contacts, but using very few hashtags and sending very few group messages can be observed. For reference, Table 4.3 reports the Pearson cor- relation coefficients, measured for k (both for in–degree and out–degree) and all user activities. The results imply that more connected users tend to annotate their messages more frequently, and with more hashtags. This suggests that more connected users spend more time in adequately describing their content, assisting themselves (and others) in organizing and filtering content, as well as improving content search. More connected users are also more likely to send more messages and also receive more replies from 75 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 k Average n m n m in n m out 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 Average n r k n r in n r out 10 0 10 1 10 2 10 3 10 0 10 1 10 2 k Average n g n g in n g out 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 k Average g g in g out 10 0 10 1 10 2 10 3 10 0 10 1 10 2 k Average n t n t in n t out 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 k Average t t in t out Figure 4.7: From left to right and top to bottoms, average number of (a) messagesn m , (b) replies n r , (c) distinct groups n g and (d) groups g, (e) distinct hashtags n t and (f) total hashtag assignmentst of users havingk neighbors in the post–reply network 76 others. Considering the fact that the frequency of interaction increases as the number of unique people one communicates with increases, these correlations make a lot of sense. While replying to many questions implies that one has high expertise, however, asking a lot of questions is usually an indicator that one lacks expertise on some topics [215]. We will discuss this issue in later sections. Assortativity Assortativity measures the preference of nodes of a network to link to other nodes with similar properties. Intuitively, one would expect users of a social network to exhibit pref- erential attachment to individuals with similar characteristics. Various mixing patterns with respect to node properties [161, 154], as well as node activities [177] have been discovered. In our work, we consider mixing patterns with respect to users activity in the post–reply network. We define for each useru the following quantities [177]: (i) the average number of distinct hashtags of its nearest neighbors: n u t,nn = 1 nt(u) P v∈V(u) n t (v), and (ii) the average number of distinct groups to which its nearest neighbors have sent at least one message to: n u g,nn = 1 ng(u) P v∈V(u) n g (v), where V(u) denotes the set of u’s nearest neighbors. To detect mixing patterns, if any, we compute: (i) the average num- ber of distinct hashtags of the nearest neighbors of the set of users having n t distinct hashtags: n t,nn (n t ) = 1 |u :n t (u) =n t | X u:nt(u)=nt n u t,nn , (4.1) and (ii) the average number of distinct groups of the nearest neighbors of the set of users participating inn g distinct groups: n g,nn (n g ) = 1 |u :n g (u) =n g | X u:ng(u)=ng n u g,nn . (4.2) 77 10 0 10 1 10 2 10 3 10 −1 10 0 10 1 10 2 n g n g,nn (n g ) k−in k−out (a) 10 0 10 1 10 2 10 3 10 0 10 1 10 2 n t n t,nn (n t ) k−in k−out (b) Figure 4.8: (a) Average number of groups for the nearest neighbors of users belonging ton g groups. (b) Average number of distinct hashtags for the nearest neighbors of users withn t distinct hashtags We have also examined assortativity with respect to other indicators of users’ activity: in–degree and out–degree, total number of hashtags and total number of groups. For brevity, we discuss our most interesting findings, revolving aroundn t,nn andn g,nn . Figures 4.8(a) and 4.8(b) show the behavior ofn g,nn andn t,nn for the post–reply net- work. In the case ofn g,nn , a clear assortative trend is visible: the average activity of the neighbors of a user (both in terms of incoming and outgoing edges) increases with the user’s own activity. A small drop as well as fluctuations and outliers are observed due to the sparsity of our data. The average number of distinct hashtags used by the near- est neighbors of users with n t hashtags however, follows a clear disassortative mixing pattern, since users tend to communicate with individuals that make use of significantly less number of hashtags. In fact, the more distinct hashtags used by a user, the less are used by the user’s neighbors. This result comes in disagreement with that reported in [177] for Flickr. A possible explanation of this observation is twofold. First, rather than using hashtags for their content organization, users of the corporate microblog- ging service mostly rely on group membership (see Sect. 4.1.2). Secondly, users mainly annotate their own content, often attaching hashtags to their microposts only. Hence, the 78 more hashtags an author has chosen to annotate a micropost, the lower the probability becomes for another individual to provide prominent annotations of the same post. Lexical & Topical Alignment We now examine user similarity in terms of hashtag usage, with respect to their distance in the post–reply network. We argued earlier that users of the corporate microblogging service mostly tag their own content. This observation along with the personal character of tagging make us conjecture that there will be no global hashtag vocabulary across users, or if such a vocabulary exists, it will be extremely incoherent. Hence, we do not anticipate an emergent globally accepted hashtag vocabulary, commonly found in social bookmarking sites [143]. To test the existence of a globally shared vocabulary, we selected pairs of users at random and measured the number of their shared hashtags, which on average is≈ 1.001. Random pairs of users don’t have common hashtags, adjacent users in social net- works however, tend to share common interests, a property known as homophily [146, 206, 48]. We measure user homophily with respect to hashtags as a function of the distance of users in the post–reply network. We regard hashtag assignments of user u as a feature vector, whose elements correspond to hashtags and whose entries correspond to frequencies of hashtag usage for useru. Hence, the normalized similarity between two users u and v with respect to their hashtag vectors, σ hashtags (u,v) can be computed as follows: σ hashtags (u,v) = P t f u (t)f v (t) qP t f u (t) 2 P t f v (t) 2 , (4.3) where f u (t) denotes the number of times user u has used hashtag t. σ hashtags (u,v) is equal to0 if usersu andv have no hashtags in common, and1 if they have used exactly 79 the same hashtags. We further define the normalized similarity between two usersu and v with respect to their distinct hashtag usage, as: σ U hashtags (u,v) = P t δ t u δ t v p n t (u)n t (v) , (4.4) where n t (u) is the total number of distinct hashtags of user u and δ t u = 1 if user u has used hashtagt at least once, and0 otherwise. To compute averages of the aforementioned similarities we performed an exhaustive investigation of the post–reply network up to distance equal to the network diameter (D max ). Figure 4.9(a) demonstrates the dependency of users similarity on distance, by showing the average number of shared hashtags and the corresponding average cosine similarities of two users as a function of d. The average number of shared hashtags remains almost constant for d ≤ 6, after which point it drops rapidly. High lexical alignment is observed between neighbors for greater distance than traditional online social networks [177], due to the fact that in a corporate environment users exhibit more focused interests aligned with their discipline, day to day responsibilities and ongoing projects. We examine user homophily with respect to groups as a function of their distance, following similar reasoning. In particular, we define the normalized similarity between two usersu andv with respect to their group participation as: σ Ugroups (u,v) = P t δ g u δ g v p n g (u)n g (v) , (4.5) wheren g (u) is the number of groups of which useru is a member andδ g u = 1 if useru belongs to groupg, and 0 otherwise (a user belongs at most once to a group). We also 80 1 2 3 4 5 6 7 8 9 10 11 0 0.1 0.2 0.3 0.4 d Similarity Average n st Average σ U tags (u,v) Average σ tags (u,v) 10 0 10 1 10 2 10 −5 n st P(n st ) d=1 d=2 d=3 (a) 1 2 3 4 5 6 7 8 9 10 11 0 0.1 0.2 0.3 0.4 d Similarity 10 0 10 −6 10 −4 10 −2 n sg P(n sg ) d=1 d=2 d=3 Average n sg Average σ U groups (u,v) Average σ groups (u,v) (b) Figure 4.9: (a) Top: Average number of shared hashtags n st , σ hashtags (u,v), and σ U hashtags (u,v) of two users as a function of their distance d in the network Bottom: Probability distribution of the number of shared hashtagsn st of two users being at dis- tance d on the network, for d = 1,2,3. (b) Top: Average number of (i) shared groups n sg , (ii)σ groups (u,v), and (iii)σ Ugroups (u,v) of two users as a function of their distance d in the network Bottom: Probability distribution of the number of shared groupsn sg of two users being at distanced on the network, ford = 1,2,3 examine user similarity in terms of messages sent to common groups. We calculate this similarity, as follows: σ groups (u,v) = P g f u (g)f v (g) r P g f u (g) 2 P g f v (g) 2 . (4.6) Figure 4.9(b) demonstrates the dependency of user similarity on distance, allowing us to draw similar conclusions for shared groups, as for shared hashtags. 4.2 Communication Behavior Analysis Much like Twitter, the corporate microblogging service we investigate in this work acts like an information network, where users share microposts, which are then propagated 81 0 50 100 150 200 250 300 0 10 20 30 40 50 60 70 80 90 100 Time in Days Cumulative Number of Replies Figure 4.10: Dynamics of threaded discussion at the workplace. Cumulative number of replies a message triggers over time to their followers, broadcasted to specific groups, or become available to the company– wide stories stream. Intuitively, one would expect a personal message that is being sent from useru to userv to trigger a reply message being sent from userv back to useru. In Sect. 4.1.1 we discussed that the existence of edgee uv does not guarantee the existence of reciprocal edgee vu . Instead, we argued that, overall, the post–reply network exhibits low level of reciprocity with only21.49% symmetric links. Figure 4.10 shows the evolution of the number of replies a message receives over a period of about one year. Each curve corresponds to a threaded discussion, showing how much the number of replies increases daily for a single message, over a year. To make Figure 4.10 readable, we randomly sample few discussion threads. The final number of replies varies widely among messages. A few messages become popular very fast, attracting numerous replies almost instantaneously, whereas other remain extremely “unpopular”. Overall, as messages age, accumulation of new replies slow down, and 82 after a few days, messages typically no longer receive new replies. For some messages, the slope abruptly increases, often periodically. Messages in the post–reply network can be classified in four broad categories: • Extremely Popular: Messages that receive a lot of attention fast. The cumulative number of replies explodes over a small period of time after the submission of the message. • Extremely Unpopular: Messages that receive small attention, in many cases just a single reply from the intended recipient of the message. Status updates of personal nature might fall into this category, with a few comments coming mainly from “friends”. • Messages that accumulate few replies fast, often immediately, and few replies after a considerable amount of time has elapsed from the time the original message was sent (in some cases even a year after). • Periodic Popularity: Messages that trigger numerous responses periodically. Figure 4.11(a) shows the average maximum time required for a message to receivek replies. The data clearly indicate a positive correlation, as expected, even though large fluctuations are observed for large values ofk due to the small number of highly popu- lar messages. Of course, not all messages will ever receivek replies, regardless of time. Figure 4.11(b) captures this effect, showing the relation between the number of repliesk and the average time required for a message to receivek replies. A cluster of extremely popular messages appear in the bottom right corner. Messages that receive low or mod- erate attention are placed at the bottom left part of the figure. Finally, messages that receive periodic attention are placed in the middle. In light of the above observations, we examine factors that we suspect to trigger variability in the total number of replies messages receive, as well as causal factors 83 10 0 10 1 10 2 10 0 10 1 10 2 10 3 Number of Replies Average Maximum Time Required (a) 0 10 20 30 40 50 60 70 80 90 100 0 20 40 60 80 100 120 140 Number of Replies Average Time Required (b) Figure 4.11: (a) Average maximum time required for a message to receive k replies. Time is measured in days. (b) Time required on average for a message to getk replies. Time is measured in days of patterns of messages’ popularity over time. Corporate microblogging capabilities provide users the opportunity to share day–to–day operational knowledge and domain knowledge, discuss about problem solving, relevant emerging techniques, applications and technologies, trends, etc. Enterprise microblogging services mostly emphasize on the business perspective and therefore their content revolves around their main busi- ness and work culture, work practices, and everyday problems (technical or otherwise related to business) [36]. This factor may partially explain messages that fold into the third category. Messages of the form “I do not know how to run Apache Server in my machine. Can anyone help me?” could trigger messages of the form “I will look into the matter and get back to you asap”, with the actual reply coming later on, often after a long time has passed, partially due to other, high priority responsibilities of repliers. Periodic attention could be the result of periodic messages being sent to groups, stimu- lating conversation between group members. Finally, extremely popular messages can be attributed to polls, which stimulate employees in positioning themselves with respect to a matter of interest (e.g. technology adoption). 84 For enterprises, the investment in money and time in maintaining microblogging services requires some sort of profit in return. Such profit may come from the sharing of ideas, leading to innovation, as well as reduction of searching time spend for data, information and experts. A reduction in activity could be detrimental to the deployment of such services in the workplace, whereas abnormally high activity patterns may collide with employees attention to work related issues. Further, understanding what catches the attention of users may lead to insights with respect to whom individuals choose to collaborate and for what reasons. Here, we use the total number of replies a message yields as a measure of the “attention” it receives. 4.2.1 What Makes Messages “Tick”? What are the factors that impact the number of replies a message receives? Is it possible to use such factors as accurate predictors of prominent discussion starters? We approach our research question through an empirical study of the post–reply network. Hypothesis: User activity and number of replies are correlated. Here we hypothe- size that the more active a user is the more the attention their messages will receive. To test our hypothesis, we examine the average total number of replies received by users having: (a)k incoming connections, (b)k outgoing connections, (c) distinct groupsn g , and (d) distinct hashtagsn t . Figure 4.12(a) shows the results. Even though there is some increasing trend that can be observed, mainly with respect tok out andn g , the results are not definitive for us to draw a safe conclusion. Reverse Hypothesis: Many replies come from highly active users. Here we hypoth- esize that highly active users contribute more than inactive users in discussions, author- ing more replies on average. To test our hypothesis we examine the average number 85 10 0 10 1 10 2 10 −1 10 0 k Average Number of Replies k in k out n g n t (a) 10 0 10 1 10 2 10 −1 10 0 k Average Number of Replies k in k out n g n t (b) Figure 4.12: Average number of replies (a) received in total and (b) sent by users that have (i) k incoming connections, (ii) k outgoing connections, (iii) distinct groups n g , and (iv) distinct hashtagsn t . of total replies authored by users having (a) k incoming connections, (b) k outgoing connections, (c) distinct groups n g , and (d) distinct hashtags n t . Figure 4.12(b) shows the results. No correlation (in fact a small negative correlation can be observed) can be observed in this case, thus making our reverse hypothesis also invalid. 4.2.2 User Contribution Based on our analysis of user activity with respect to the attention messages receive, it is not feasible to draw safe conclusions about the factors that make some messages more “interesting” than others. To better explain communication patterns we further examine users’ contribution frequency, and the extend to which their communication is balanced between sending and receiving messages. We measure users’ contribution with “contribution index” [71]: CI(u) = n m (u)−n r (u) n m (u)+n r (u) . (4.7) 86 0 100 200 300 400 500 600 700 800 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Contribution Index Total Number of Messages Sent and Received (a) 0 20 40 60 80 100 120 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 k Contribution Index ci(k in ) ci(k out ) (b) Figure 4.13: (a) Contribution Frequency of users in the post–reply network. (b) Average Contribution Index of users havingk in incoming, ork out outgoing connections CI(u) is equal to -1 if user u only receives messages, and 1 if u only sends messages. A user who exhibits total balanced communication behavior has CI(u) = 0. Figure 4.13(a) shows the contribution index of users against the total number of messages they have sent and received. The majority of users lies above the CI = 0 line, indicating that most users send substantially more messages than they receive. Numerous users have high values ofCI, however, their total number of messages is relatively small. On the other hand, numerous users beneath the CI = 0 line can be observed. Their CI values mostly fall in the[−0.2,0] range and they too have relatively few total number of messages. Few users with exceptionally high activity can also be observed, consistently sending more messages than receiving replies, in variable ratios. Our dataset does not contain information about the organizational hierarchy of indi- viduals. It would be interesting to investigate how contribution index varies between different types of roles in the company hierarchy, to better understand if individuals at certain organizational levels can be classified as information generators, information mediators, or information receptors. Such observations could lead to key insights of information flow in a corporate context, assisting managers to better organization of human resources and team building. 87 0 50 100 150 200 250 300 350 400 450 −0.5 0 0.5 1 k Contribution Index Avg ci(n t ) Avg ci(n g ) (a) −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 12 14 16 18 Contribution Index Average Number of Replies n r ≥ 1 n r ≥ 0 (b) Figure 4.14: (a) Average Contribution Index of users having n t distinct hashtags, or participating inn g groups. (b) Average number of replies sent by users with Contribution Indexci. It seems natural to ask whetherCI is correlated with the structure of the post–reply network as well as other users activities. Figure 4.13(b) shows the averageCI value of users with k number of unique neighbors, either incoming or outgoing. In both cases, the majority of individuals can be characterized as sending substantially more messages, on average, than receiving. CI is equal to 1 fork in = k out = 0, as expected, for users who have never received a reply for any of their messages (k in = 0), and users who have initiated conversations but never replied to others (k out = 0). Overall, there seems to be some balancing trend, such that the more k increases, the more the distribution of CI values tend to cluster around zero. Fluctuations are observed due to the small number of very active users. Figure 4.14(a) shows the correlation between users’ average contribution index and the number of n t distinct hashtags they are using, and number n g of groups they par- ticipate in. In both cases, a clear trend is observed. The average contribution index of users approximates a constant value c, such that +0.1 ≤ c ≤ +0.3. This indi- cates more balanced communication among users who are more active in annotating their content. More importantly, balanced communication, even though slightly shifted 88 towards “information producers”, is observed between users who participate in a large number of groups. A possible explanation for this “disassortative mixing” pattern may be given as follows: as the number of groups a user participates in increases, it becomes increasingly impossible to contribute high volumes of messages, while at the same time receiving increasingly many messages. Finally, Figure 4.14(b) shows the relationship between CI and average number of replies messages attract. We hypothesized earlier (see Sect. 4.2.1) about highly active users contributing more than inactive users in discussions, issuing more replies on aver- age. Figure 4.14(b) supports this claim, suggesting that the bulk of replies comes from users who exhibit balanced communication behavior, with a slight shift towards sending. In fact the majority of replies are provided on average by users with contribution index ci ∈ [0.2,0.5]. The difference between the two distributions is a result of taking into account messages that receive no replies at all (n r ≥ 0), or focusing only on messages that trigger at least one response (n r ≥ 1). 4.3 Latent Homophily: Local vs. Global Probabilistic models have been used for automatic extraction of hidden topics from large document collections. Latent Dirichlet Allocation (LDA) is a completely unsupervised model that treats each document as vector of word frequencies. Given a collection of documents, LDA learns the latent topics that best explain the words observed in the documents. Each document is represented as a probability distribution over some topics, while each topic is represented as a probability distribution over a number of words. The Author–Topic model (AT), an extension of LDA [174], associates each author in the document collection with a probability distribution over the same topics used to represent documents. 89 We use the AT model to learn the hidden topics of users of the corporate microblog- ging service. We view each user as a “document”, aggregating all training messages generated by the same user into a training profile for that user. Each user is then repre- sented as a mixture ofT Topics, and topics as distributions over terms. We refer to this method as LDA W to differentiate it from LDA W clean , which we train after we remove stopwords and punctuation symbols from our document corpus. Researchers have used probabilistic models in the social web before, however, there is a lot of discussion about the limitation of which “aggregation” strategy is better [90]. Further, the nature of available text in microblogging services is substantially different from traditional document collections used in information retrieval and web search. The content of messages is often scarce (e.g. “tweets” are limited to 140 characters), not pertaining to grammatical and syntactical rules, and usually contain slang, emoticons, and self–defined hashtags and URLs. Traditionally, LDA based models represent users, their “documents”, or both, as a mixture over words [171, 218]. Instead, to avoid text related issues described above, we choose to represent users interests based on metadata instead of words pertaining to some vocabulary. We consider two novel probabilistic models: (a) AT G and (b) AT T . Each user in AT G is represented as a probability distribution over some topics, while each topic is represented as a probability distribution over a number of groups (instead of words). Here, groups (i.e. n g ) become vocabulary terms that users select to “compose their documents”. Conversely toAT G , topics inAT T model are represented as a probability distribution over a number of hashtags. To construct this model we aggregate anno- tations each user assigns to their own messages and use these hashtags as vocabulary terms. We train our models using the default hyperparameters (α = 50/T , β = 0.01) and variable number of topics (T = [5,10,20,50,100,200,500]). We use the trained models to infer the topic distribution of each user in our dataset. 90 We evaluate our models in terms of perplexity (i.e. their predictive power). Better generalization performance is indicated by a lower perplexity over a held–out document. Under this metric,LDA W performs better overall thanLDA W clean , while both achieve best results for T = 200. We conjecture that cleaning microblogging data does not help in achieving better predictive power. Contrary, the estimation of hidden topics deteriorates in this case. AT G achieves best perplexity for T = 200 or T = 500, whereas AT T achieves best perplexity for T = 100, after which point overfitting is evidently affecting the predictive power of the model. In addition to perplexity, we compare our four models with respect to time com- plexity (running time for training at T = 500 at a commodity PC). AT T is the fastest, requiring 93.77 seconds of training, while AT G requires 228.67 seconds. LDA W clean requires≈ 51 minutes, half of the processing time required byLDA W (≈ 106 minutes). 4.3.1 Latent Mixing Patterns Homophily theory[146] postulates that people tend to form social networks with indi- viduals of similar characteristics. [206] examined homophily in Twitter, suggesting that a useru follows some userv because she is interested in the same topics thatv is writing about. We further extend this hypothesis and examine it under the scope of an enterprise microblogging service. The presence of homophily would imply that users send directed messages (similarly reply back) to others only if they have common topical interests. To address this problem we seek to answer two questions: • Global Effect: Are users with more social links more aligned with respect to latent topics? • Local Effect: Do users with more social links exhibit more topical alignment to their neighbors? 91 To answer these questions we need to: (a) know the topics users of the corpo- rate microblogging service are interested in, and (b) to measure the topical similarity between pairs of users. Since topic interests are not explicitly defined by users, we use our four LDA models to capture their interests in a latent space. We calculate similarity between two users by subtracting the Jensen–Shannon divergence of their learned topic distributions from 1, so that similarity score ranges from 0 (most dissimilar) to 1 (most similar). The Jensen–Shannon divergence between two probability distributionsP and Q is defined as: D JS (P kQ) = 1 2 D KL (P kM)+ 1 2 D KL (QkM), (4.8) where D KL (P k Q) = P i P(i)ln P(i) Q(i) is the Kullback–Leibler divergence, and M = 1 2 (P +Q) is the average of the two probability distributions. We characterize the global divergence of users withk neighbors (we consider in–degree and out–degree separately), as follows: D G JS (k) = 1 |V k | 2 X (u,v)∈V k D JS (u T ,v T ), (4.9) where u T represents the topic distribution for user u, and V k refers to the set of users that have exactly k neighbors in the post–reply network. Figure 4.15(a) shows how divergence varies between users with k = k in neighbors. We also examined cor- relations between users’ average Jensen–Shannon divergence with respect to LDA W , LDA W clean , and AT T and k out , n g , and n t . Even though we do not report such results here, our observations remain valid in all cases. Figure 4.15(a) suggests that as k increases so increases the topical similarity (i.e. divergence decreases) between users with k neighbors in the post–reply network. This effect can be observed for all values of T , even though it becomes more obvious for T ≤ 100. The variable number of topics allows us to understand how this global effect 92 10 0 10 1 10 −2 10 −1 k in Average Jensen–Shannon Divergence T=5 T=10 T=20 T=50 T=100 T=200 T=500 (a) 10 0 10 1 10 −2 10 −1 10 0 k in Average Jensen−Shannon Divergence T=5 T=10 T=20 T=50 T=100 T=200 T=500 (b) Figure 4.15: (a) Average Jensen–Shannon divergence between all combinations of users having k = k in neighbors in the post–reply network, for T topics (the model used in this case isAT G ). (b) Average Jensen–Shannon divergence between all combinations of users having k = k in neighbors in the post–reply network, and their neighbors, for T topics (the model used in this case isAT G ). changes when examined from different perspectives, ranging from coarse gained (T = 5) to extremely fine grained (T = 500). The fact that diversity decreases on average, regardless of the number of topics, verify the observed trend. To better understand if this result is an effect of local alignment between neighbors, we examine next the average divergence of users with k neighbors and their neighbors (local effect), as follows: D L JS (k) = 1 |u :k u =k| X u:ku=k D u JS , (4.10) where D u JS = 1 k u X v∈V(u) D JS (u T ,v T ). (4.11) Figure 4.15(b) shows the results. In this case, there is a strong alignment between neighbors for small values of k, but the topical similarity between neighbors disolves fast, with increasing values of k. This indicates that as users send more messages to 93 new people, the topical intersection between them broadens fast. Hence, the more the number of individuals a user has send a message to, the more diverse the topics of interests become. Intuitively, a user tends to have focused conversations on specific issues with a small number of nodes, whereas broader discussions emerge when more and more people are getting involved. Figure 4.15(a) and Figure 4.15(b) raise an interesting question. According to the local effect, people tend to show strong topical similarity with a few of their neighbors. However, according to the global effect, users with more neighbors tend to be topically close. Why does this phenomenon occur? A possible explanation can be provided by the following reasoning: Users with few neighbors perform very focused communication, thus making their topical distributions extremely narrow around t specific topics, such that t ≪ T . Assuming that t u and t v are the topic distributions for two such users, the probability of overlap is small. Instead, users with high number of neighbors exhibit much greater variance in their topic distributions. The divergence between such users is therefore smaller. Finally, we investigate the existence of a correlation between the number of replies users receive in response to sending messages, and topical similarity between users. We showed earlier that no apparent difference can be discovered between highly active and less active users with respect to their contribution in discussions (see Sect. 4.2.1). Here, we hypothesize that individuals who reply back, have similar “interests” to the authors of the source message. To test our hypothesis we examined the average Jensen– Shannon similarity between users and the authors of replies. Figure 4.16 shows the results. A decreasing trend can be observed in all cases, between the number of replies and the average similarity between users. Therefore, the more replies a message attracts, the more diverse the set of users that provides such replies. Fluctuations appear due to few instances over which averages are computed. The differentiation between similarity 94 0 10 20 30 40 50 60 70 80 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of Replies Average Similarity T=5 T=10 T=20 T=50 T=100 T=200 T=500 Figure 4.16: Average Jensen–Shannon similarity between users having received k replies, and replies’ authors (T indicates the number of topics) values for different number of topics can be attributed to the granularity of each model. The overall decreasing trend in all cases however, validates our observation. 4.3.2 Topical Homophily in Enterprise Microblogging So far we have tried to identify factors that impact the number of replies a message receives. [31] examined the factors that drive conversation initiation. Here we take a complimentary approach by investigating whether topical homophily exists in the cor- porate microblogging service. In other words, are users who are more topically similar to each other, more likely to have sent a message to each other at some point in time, thus to be linked in the post–reply network? 95 If two users write about similar topics, their topic distribution should be similar. If two users participate in groups on similar topics, their topic distribution should be sim- ilar. If two individuals make use of the same hashtags to annotate content on similar topics, their topic distribution should be similar. No matter which of the previous ques- tion we might ask, the hypothesis is always the same. Assuming that such similarity exists, does it lead to social interaction? In other words “are more similar users more likely to be linked” in the post–reply network [206, 103]? We study this question using latent models to measure topical similarity. We consider two usersu andv to be linked if eitheru has sent at least one directed message tov or vise versa, ifv has sent at least one directed message tou. We analyze the relationship between the likelihood of a link between pairs of users and their topical similarities. Here, we report results for theLDA W model only, though we observed similar results for all our models. For each user in our data set who has sent at least one group related message, we compute pair–wise similarities with the remaining users and bin together the top 2,000 pairs whose similarity is below some threshold (0.2, 0.4, 0.6 0.8, 1.0) [103]. We compute the likelihood of a link between pairs as the percentage of truly linked pairs in each bin. Figure 4.17(a) shows that probability of a link increases with similarity threshold. Therefore, topically similar users are more likely to be linked [103]. Figure 4.17(b) shows on a more granular level the link probability of user pairs which have been sorted by similarity in decreasing order and divided evenly between bins. The probability of a link increases as the topical similarity of user pairs increases. To verify that this effect is a result of homophily rather than assortative mixing, we divided users into two categories: hyperactive and less active. Hyperactive users have on average 16.3 neighbors and have sent on average 49.38 messages. Less active users have on average 2.6 neighbors and have sent on average 6.48 messages. We repeated 96 0 0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 0.25 Similarity Threshold Link Probability T=5 T=10 T=20 T=50 (a) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Average Similarity of Pairs in Bins Link Probability T=5 T=10 T=20 T=50 T=100 (b) 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Average Similarity Of Pairs in Bins Link Probability T=5 T=10 T=20 T=50 T=100 T=5 T=10 T=20 T=50 T=100 (c) Figure 4.17: (a) Link probability among user pairs whose similarity is below some threshold. (b) Link probability of user pairs as a function of similarity. (c) Link proba- bility of hyperactive and less active user pairs as a function of similarity. the analysis above, dividing pairs of users into bins based on similarity threshold and measuring average probability of linking within each bin [103]. Figure 4.17(c) shows the results. Solid lines refer to pairs of hyperactive users, while dash lines refer to pairs of less active users. In every case, the probability of a link between a pair of users is slightly higher for hyperactive users than less active user pairs. Even though we cannot at the time explain the reasons why this effect appears, the probability of a link increases for both classes as the topical similarity of user pairs increases. We explain this trend as a result of homophily [103]. 4.4 Summary In this chapter, we provided an extensive analysis of enterprise microblogging data. The extracted post–reply network of threaded discussions in a corporate microblogging ser- vice that we examined, is a “smaller” world than online social networks, has a strongly connected core of high–degree nodes, and exhibits strong positive correlation to users degree (both in–degree and out–degree). We further showed that strong correlations exist between user activities. Assortative mixing patterns between neighbors can be 97 found for all the examined features, with the exception of hashtag usage, which exhibits dissasortative characteristics. We showed that users alignment in terms of their hashtag vocabulary and group co– membership is more profound than in online social networks, for greater distances. This outcome can be interpreted as a result of the definition of the directed network of users’ interaction flow. Instead of modeling user friendship or following relationships, our net- work model captures the dynamics of informal communication, linking user activity to topical interests alignment. In particular, we showed a dependency between topical ali- ment between pairs of users and the probability of a link between them. We argued that our observation is a result of actual homophily, rather than an effect of pure statistical assortativity. The primary focus of corporate microblogging services is narrower than traditional social network sites like Twitter, Facebook, and Flicker. Enterprise microblogging ser- vices mainly emphasize on the business perspective and therefore their content revolves around the work culture, work practices, and everyday problems (technical or otherwise related to business). Discussions are often about problem solving, relevant emerging techniques, applications and technologies, trends, etc. Usually, some users lead the over- all discussion by expressing their opinion on a matter, which then triggers replies. The existence of high–degree nodes that we observed in the post–reply network confirms this behavior and suggests that such high–degree nodes are critical for the connectivity and flow of information in this context. On the other hand, information searches that exploit the social structure rapidly reach the core. The design of algorithms for information search or expertise identification in this context should consider this observation. Through a detailed, in–depth analysis of the attention that messages receive over time, we showed that messages can be classified in four general categories, exhibiting emerging patterns that remain to be fully understood in future work, in conjunction 98 to qualitative studies. Unlike in online social networks, we did not find supporting evidence regarding the predictability of message popularity, even though we identified a correlation with users contribution index. The present findings suggest a number of possible future directions. First, the results obtained for homophily suggest that accurate link recommendation based on user history of messages may be performed. Secondly, the causal relationship between homophily and network evolution deserve to be explored in detail. Conversely to general purpose online social networks, trust is a minor issue in cor- porate microblogging services, where malicious users cannot penetrate the core due to restricted access. Trust is often used as an indicator of expertise, hence newly hired employees may pursue highly ranked positions in the post–reply network by providing unconvincing and/or unhelpful responses (i.e. a form of “spamming”), contributing to discussions nonetheless. We conjecture that a “new” user should be highly trusted not only if multiple short disjoint paths to the user can be discovered [154] but also if the overall impact and positive sentimental response that her replies trigger are sufficiently large. One possible criticism of our study is that it does not account for network evolution. Our dataset spans a time period between July, 2010 and August, 2011. During this time frame, the network grows rapidly. However, our observations remain valid throughout this time period, indicating that the basic network structure does not drastically change over time. In conclusion, we have presented our quantitative study of enterprise microblogging data at scale, where we examined 1) the network structural characteristics and 2) users alignment with respect to content. We concluded by discussing the implications of our findings. However, examination of multiple other corporate social networks is required to further confirm our findings. 99 Chapter 5 Informal Interactions in the Presence of Formal Structure Researchers have well studied the importance of social networks on information dis- semination [212, 76, 18, 9]. Traditionally, diffusion and cascading behavior have been formalized as transmission of infectious diseases in a population, where each individ- ual is either infected or susceptible, and infected nodes spread the contagion along the edges of the network. There are, however, differences between information flow and the spread of viruses. While virus transmission is an indiscriminate process, information transmission is a selective process. Information is passed by its host only to individuals the host thinks would be interested in it. Diffusion models heavily rely on the premise that contagion propagates over an implicit network, the structure of which is assumed to be sufficient to explain the observed behavior. However, the structure of the underly- ing network has to be learned [76] from a plethora of historical evidence, i.e. cascades. Although diffusion theory brings up the importance of friendship relations, adoption behavior is instead examined on the premises of the behavior of the entire population [18]. In online social networks in particular, where individuals tend to organize into groups based on their common activities and interests, it has been hypothesized that the network structure (friendship or interaction) affects the way information spreads, and that adoption quickens as the number of adopting friends increases [8]. However, many times a node activation is not just a function of the social network but also depends on 100 many other factors like imitation [212]. This has lead to the development of epidemiol- ogy models [88] and computational approaches that are based on thresholds models [77], deterministic or stochastic [186]. Each agent has a threshold that, when exceeded, leads the agent to adopt an activity. When the threshold is applied within a local neighbor- hood [198, 184], local models emerge [109]. Instead, global diffusion models perform thresholding to the whole population [18]. Unlike online social networks where users create links to others who are similar to them (a phenomenon known as homophily [146]), or whose contributions they find interesting [206], in a corporate environment, employees form “bonds” not because of similar “tastes” but due to tasks at hand (i.e. a function to be completed or an orga- nizational need) or because of reporting-to relationships (i.e. team members reporting to their supervisor). In this sense, there is no explicit “social network”, however, for- mal structures such us the organizational hierarchy may provide hints of influence at the workplace. As illustrated in Figure 5.1, the formal organization structure may constrain influence patterns, but informal communication outside the boundaries and restrictions of this formal “backbone” may also affect how users behave and ultimately how the diffusion network changes and grows. The dynamics of information diffusion on a corporate environment are yet unknown and may be entirely different from online social networks. The interplay between for- mal structure and information propagation at the enterprise has been recently examined [200]. The authors found that social and organizational structure significantly impacts the spreading process of emails, while at the same time indicating context independence. In our study, on the contrary, we do not know the chain of infections, i.e. we do not observe who influences whom (i.e., middle layer in Figure 5.1). Instead, we empirically quantify the role of reporting-to relationships and local behavior (teammates), as well as the effect of global influence (overall popularity) in the spread of technology adoption 101 at the workplace. Prior work that quantifies influential users within a social network includes [68]. Influence models typically do not take the topology of the network into account, and when they do, they make assumptions about the details of the underlying dynamic process tacking place on the network. In our empirical study, we character- ize individual dynamics and influence, and examine the spread of adoption through the formal organizational hierarchy. Contrary to online social networks, microblogging services for enterprises are pri- marily designed to improve intra-firm transparency and knowledge sharing. How- ever, the adoption of such collaborative environments presents certain challenges to enterprises [81]. [216] provided a case study on the perceived benefits of corporate microblogging and barriers to adoption. Key factors influencing microblogging sys- tems adoption in the workplace include: privacy concerns, communication benefits, perceptions regarding signal-to-noise ratio, and codification effort, reputation, expected relationships, and collaborative norms [81]. The work, closest to ours, [200] examined email threads and the formal network (e.g. hierarchical structure) imposed by a large technology firm. They argued that the spreading process (to whom and how fast peo- ple forward information) can be well captured by a simple stochastic branching model. In our study, on the contrary, we do not know the chain of infections (i.e. we do not observe who influences whom). Instead, we use the outcome of our empirical study to quantify influence as a result of individual pressure from supervisors towards their team members, as well as an effect of global popularity. To characterize the adoption mechanism of new technologies at the workplace, we propose two simple and intuitive agent-based computational models with the least pos- sible number of parameters. We emphasize on accurately modeling the cumulative num- ber of adoptions over time, rather than trying to predict which node in the network will infect which other nodes. In this sense, we not only model the influence each node 102 has on the diffusion (microscopic modeling), permitting user behavior to vary accord- ing to the behavior of the general crowd, but we also provide a simple mechanism by which adoption rate rises and decays over time (macroscopicdynamics). For our study, we have acquired adoption logs of the internal microblogging service, which resembles Twitter, during the first two years of adoption of the service in the enterprise (see Chap- ter 2, Section 2.3.1). In addition, we gathered the organizational hierarchy of a Fortune 500 multinational company (see Chapter 2, Section 2.3.2). This dataset allows us to empirically characterize individual dynamics and influence, and examine the spread of adoption through the hierarchy. The company did not officially initiate usage of the microblogging service. Rather, it was independently initiated by an employee, in the begging of July, 2010. It was not promoted or even mentioned in any formal corporate communications. Our dataset does not contain information with respect to growth and invitations. We can only speculate that growth was achieved through email and word of mouth invitations. 5.1 The Effect of Formal Organization Hierarchy The underlying process of influencing employees towards adopting the microblogging service is unknown and non trivial. Here, we assume that when an employee chooses to join the corporate microbloging service, she then has some influence on the employees who directly report to her, according to the formal organizational chart (as shown in Figure 5.1). Some of these employees will choose to join, which will in turn influence some of their team members into joining themselves and so on. Therefore, we assume that an employee’s decision to join depends on: 1) direct influence by her manager, and 2) social influence resulting from the overall popularity of the microblogging service in the enterprise. In this section we seek supporting evidence on the influence inflicted 103 Figure 5.1: Technology adoption dynamics at the workplace. Dynamics on and of the formal network structure are strongly coupled. The bottom layer illustrates the for- mal organization hierarchy, where black arrows represent “reporting-to” relationships between employees. The directionality of edges go from lower level employees up to the company CEO. The middle layer depicts the flow of influence between people in the same group (red arrows), top-down influence from supervisors to team members (dashed, dark red arrows) and vice versa, bottom up team members’ influence of their supervisors (dashed purple arrows). The upper layer, depicts observed adoption dynam- ics, i.e., a potential propagation tree. by managers to employees reporting directly to them. Here, we assume that employees are not susceptible to peer influence by their teammates (i.e., we assume independence between teammates’ choices). We revisit this hypothesis in (Section 5.2.1). Assume that manageru urges her team members to join the microblogging service. A directed link e ju exists if employee j directly reports to u according to the formal organizational hierarchy. If j joins the microblogging service after u, we call her join 104 an “influenced join”. We counted the number of employees who joined the microblog- ging service after their manager and found that there are three classes of employees: (i) employees who did not join the microblogging service even if their manager did (10.94%), (ii) employees who did join the microblogging service before their manager (36.04%), and (iii) employees who did join the microblogging service after their man- ager (53.01%). LetN be the total number of employees directly reporting to manageru. LetK be the number of employees inN that joined the microblogging service after their manager u, and k be the total number of employees in N that joined the microblogging service after their manageru within the firstn draws. The stochastic process according to which employees directly reporting tou choose to join the microblogging service is described by the “urn model” [68], in which n balls are drawn without replacement from an urn containingN balls in total, of whichK are white. The probabilityP(X = k|K,N,n) thatk of the firstn employees reporting to manageru, joined the microblogging service after their manager purely by chance is equivalent to the probability thatk of then balls drawn from the urn are white. We set n = 8, calculating the number of employees that joined the microblogging service after their manager within the first 8 draws. This probability is given by the hypergeometric distribution: P(X =k|K,N,n) = K k N−K n−k N n . (5.1) We plot the average number of employees that joined the microblogging service after their manager during the firstn samples as a function of the number of employees that joined the microblogging service after their manager. Figure 5.2 shows the result. The scatter plot is approximated [68] by the Weibull cumulative distribution ( ˆ k = 24(1− e −(0.02K) 0.84 ). We use this expression to estimate the expected number ˆ k of employees to 105 0 5 10 15 20 0 2 4 6 8 K k k = 24(1 − e −(0.02K) 0.84 ) Figure 5.2: Average numberk of employees that joined the microblogging service after their manager, within the firstn samples vs the total numberK of employees that joined the microblogging service after their manager, and approximation. join the microblogging service after their manager within the firstn joins for a manager withK employees reporting to her that joined the microblogging service after her. Using Equation 5.1, we calculate the probability that ˆ k employees joined after their manager purely by chance. We found that forK > 3, this probability is exceedingly small. Since it is exceedingly highly unlikely for employees to adopt the microblogging service after their manager purely by chance, we conclude that the number of employees who joined after their manageru is a prominent indicator ofu’s influence. 5.1.1 Influence Score Let N j denote the number of employees who directly report to u and have joined the microblogging service. Let α ≤ N j be the number of employees that report to u and 106 have joined the microblogging service afteru, and letq≤N j be the number of employ- ees that report to u and have joined the microblogging service before u. While a high number of employees reporting to u that have joined the microblogging service after u implies that u has high influence, a high q value is an indicator that one lacks influ- ence. We propose an adaptation of the z-score [215], as a measure that combines the number of employees that have joined before and after their supervisor. Influence score (“ι-score”) measures how different this behavior is from a user with “random” influ- ence, i.e. a manager the employees reporting to whom join after him with probability p = 0.5 and before him with probability(1−p) = 0.5. We would expect such a random influencer to haveN j ∗p =N j /2 team-members who joined after their supervisor with a standard deviation of p N j ∗p∗(1−p) = p N j /2 [215]. Theι-score measures how many standard deviations above or below the expected “random” value a manageru lies: ι(u) = α−N j /2 p N j /2 = α−q √ α+q . (5.2) If the employees reporting to manageru have joined the microblogging service after u about half of the time, u’s ι-score will be close to 0. If they join after u more often than not, u’s ι-score will be positive, otherwise, negative. We also calculate the time- independentι-score of employees using Equation 5.2, with the difference thatα≤N is the number of employees that have joined the microblogging service (time invariably) and q ≤ N is the number of employees that have not joined the microblogging ser- vice. Above, we measured influence at the level of individual employees, assuming that influence scores are fixed in time, but that they differ from employee to employee. A more sophisticated model of influence might include some small increase (similarly for decrease) in influence score as a function of time. We stick to the simpler model for simplicity, and because our fundamental result is not sensitive to such details. 107 Next, we examine the correlation between ι-score of managers and the number of employees reporting to them (team size), hoping to get a clearer picture of the relation- ship between the two quantities. We characterize the averageι-score of managers with λ employees reporting to them asι(λ) = 1 |u:λu=λ| = P u:λu=λ ι(u). Figure 5.3(a) shows the average ι-score of managers with λ employees reporting to them, that have joined the microblogging service. Here, we focus on managers that have themselves joined the microblogging service, so that a time comparison of joining times is meaningful. A clear increasing trend is evident, providing a supporting evidence on top-down influential flow through the formal organizational hierarchy. Figure 5.3(b) shows the average time- independent ι-score of managers with λ employees reporting to them. Figure 5.3(b) further shows different plots of the average time-independentι-score of managers based on the premise that they have joined the microblogging service themselves or not. The average time-independent ι-score of managers that have not joined the microblogging service exhibits more fluctuations due to greater data sparsity. In every case, the aver- age time-independentι-score of managers that have joined the microblogging service is slightly higher than for managers that have not joined the service. Even though we can- not at the time explain the reasons why this effect appears, the average time-independent ι-score increases for both classes as the team sizeλ increases, clearly indicating a strong correlation between the two quantities. We explain this trend as a prominent indicator of influence imposed by managers to employees reporting directly to them. We now turn our attention to the impact of organizational levels. Here, we assume that influence scores are characteristic of a particular level at the organization hierarchy tree, are fixed in time, and are the same for all employees at that particular level. To compute the average influence score for hierarchy levell, we first find employeesm that belong to levell. We then compute the total number of employeesN that directly report to managers inE l . Quantitiesα andq are defined as before, with the difference that they 108 10 0 10 1 10 2 10 −1 10 0 10 1 10 2 Number k of team members who joined Average ι−score (a) 10 0 10 1 10 2 10 −1 10 0 10 1 10 2 Number of team members λ Average time−independent ι−score Overall Joined Not Joined (b) Figure 5.3: (a) Averageι-score of managers withλ team members that have joined the microblogging service. (b) Average time-invariant influence of managers, who have themselves joined the microblogging service (similarly for those who have not joined), withλ team members. now operate on the total number of employeesN that directly report to managers inE l . Finally, we use Equation 5.2 to calculate the influence score for each level. Levels are ascending from the CEO (level 1) to lower levels. Level 13, which represents bottom level employees in our dataset, contains employees with no team members reporting to them. Figure 5.4 shows the results. Level 13 has no influence score, thus it does not appear in Figure 5.4. Most levels exhibit positive influence scores, with the exception of higher levels, that are closest to the CEO. Particularly, level 3, exhibits negative influence on average. As before, we measured influence at the granularity of hierarchical levels, assuming that influence scores are fixed in time, but that they differ from level to level. A more sophisticated model of influence might include some small increase (similarly for decrease) in influence score as a function of time, and also introduce a balancing fac- tor based on the number of total employees at a level and the number of total employees reporting to them. While it is intuitive to assume that higher levels in the organization would have higher impact due to the report-to relationships involved, our study sug- gests that middle levels are more successful in influencing employees lying lower in the 109 0 2 4 6 8 10 12 −2 0 2 4 6 8 10 Hierarchy level Average ι−score Figure 5.4: Average influence score as a function of hierarchy level. hierarchy. Even though we do not have supporting evidence from other use-cases, we conjecture that middle-level managers are the most influential with respect to “convinc- ing” others to adopt new technologies (in this case the new microblogging service). 5.2 Effect of Peer Pressure We study the problem of progressive diffusion, where the employees who adopt the microblogging service become infected and do not become healthy again (i.e. employ- ees do not unsubsribe the service once they join). Classic models of social and biological contagion (e.g. [77, 160]) and observational studies of online contagion [6, 8, 26, 124] predict that the likelihood of infection increases with the number of infected contacts. However, recent studies suggest that this correlation can have multiple causes that might 110 be unrelated to social influence processes [9]. In our observational study of microblog- ging service adoption at the workplace, this assumption suggests two alternative mod- eling scenarios. According to the first scenario, an employee is more likely to adopt the microblogging service if more of her teammates join the service (Section 5.2.1). According to the second scenario, an employee is more likely to adopt the microblog- ging service as its popularity increases (Section 5.2.2). Our goal in this section is to find models that will provide a good fit with respect to the probability of adoption for each user given the actions of their teammates (local neighborhood) or overall popularity (global influence). 5.2.1 Independent Cascade Model Influence of friends is generally modeled to be additive. For instance, the independent cascade model (ICM) [109] states that a node has n independent chances to become infected, where n is the number of infected “friends”. In our case, every node can be infected only once, and once infected, it stays infected. Because of the structure of the organizational hierarchy, employee u’s “friends” may include either (i) her teammates alone, or (ii) her teammates and her direct supervisor. Starting with a single employee who has joined the microblogging service, employees susceptible to infection, decide to join the microblogging service with some probability that depends on the number of their infected “friends”. We model the influence employees receive by their “friends” as multiple exposures to an infection according to ICM [109] asp ICM = 1−(1−λ) n . We measured this quantity on our dataset, by isolating the employees in two classes: a) those who had exactlyn “friends” joining the microblogging service and did not join, and b) those who had exactlyn “friends” joining the microblogging service before they themselves joined. We found that the likelihood of adoption when no “friends” have joined is remarkably high (0.7581 when considering teammates only and 0.6807 when 111 the supervisor is also considered). In both cases, the likelihood of adoption becomes 1 when at least one “friend” has joined the service. We conclude that the relation- ship between the number of “friends” that have joined and likelihood of joining most probably reflects heterogeneous popularity of the microblogging service across teams [9]. Therefore, the naive conditional probability does not directly give the probability increase due to influence via multiple joining “friends” [9]. 5.2.2 Exponential Growth Model We studied earlier the effect of multiple teammates and neighbors of an employee u on the probability of u to join the microblogging service. Even though we discovered a positive correlation, we argued that this correlation might be an effect of multiple causes. We hypothesized that the more popular the microblogging service is for a team, the more likely it is for multiple team members to adopt it. Further, as employees observe others adopting the microbloggig service, they may not only be more likely to adopt the service, but the rate at which they do so may quicken as the popularity of the service increases. In this section, we venture to explore this hypothesis. Figure 5.5 shows the probability that an employee will join the microblogging ser- vice as a function of the service popularity. Intuitively, as more people adopt the microblogging service, a certain “buzz” around the service begins to unfold, increasing the probability of others joining the service as well. Interestingly, Figure 5.5 reveals that employees join the microblogging service following two very different, clearly distinc- tive patterns. According to the “optimistic” pattern (red line), the probability of adoption increases more profoundly as the overall number of people who join increases. Con- trary, the “pessimistic” pattern (green line) yields a probability of adoption that increases marginally as the total number of people who join the service increases over time. Even though we cannot at the time explain this effect, these two distinct classes of people 112 0 2000 4000 6000 8000 10000 0 2 4 6 8 x 10 −4 Number of employees (n) who have joined P(join|n) e (0.000147n i (t−1) − 9.51502) e (0.000135n i (t−1) − 8.792) Empirical Figure 5.5: Probability an employee joins the microblogging service given that n employees have adopted the service before. Solid lines lines depict probability esti- mates calculated with the exponential growth model. remain to be fully understood in future work, in conjunction to surveys and targeted interviews. 5.3 Computational Models of Technology Adoption at the Workplace What is the underlying hidden process that drives adoption of new technologies at the workplace? Our goal here is to find a generative model that generates the observed adop- tion process of the new microblogging service at the enterprise we are studying, given the organizational hierarchy. Such a model should exhibit the properties we observed in Sections 5.1 and 5.2 and reproduce the true cumulative number of adoptions. We aim for simple and intuitive modeling with the least possible number of parameters. 113 Prior work on modeling complex networks in social, biological and technological domains has focused on replicating one or more aggregate characteristics of real world networks [162]. Here, we take a different approach. Instead of having a target network to generate, we let individual influence and peer pressure dynamics determine the diffu- sion process of adoption of the new microblogging service over the formal organization hierarchy. We propose two models that account for influence effects imposed by the formal organizational structure. We build our models in NetLogo [207], and compare our results to the true epidemic, showing that the estimates produced by our models are consistent with the real observations. 5.3.1 Complex Contagion Model From the empirical analysis presented in the previous sections, we incorporate the fol- lowing dynamics into our model: • Employees are influenced by their managers to join the microblogging service. • Employees have multiple chances to get infected (join). Once an employee is infected, she cannot recover, i.e. an employee does not unsubscribe from the service. • As employees observe others adopting the microblogging service, they are not only more likely to adopt the service, but the rate at which they do so quickens as the popularity of the service increases. We begin by selecting a single node from the organization hierarchy to start the infection. We choose the seed node to be the exact employee that first registered to the microblogging service according to our dataset. Figure 5.6(a) shows this initial setup. At each time step, the virus can be spread as follows. Each node that was 114 infected at time t− 1 has n chances to infect the n employees that directly report to her, each with probability p, at time t. Once a node is infected, it cannot be infected again. According to our modeling, an infected employee is not allowed to infect her direct supervisor. Hence, following this strategy, the virus can only propagate towards the leafs of the hierarchy tree. Once all infected nodes are examined, healthy nodes have the chance to be “randomly” infected by observing the general popularity of the microblogging service up to time t− 1. For n t−1 i total infected nodes at time t− 1, the probability of “random” infection is computed using the pessimistic exponential growth pattern fit (p EG =e (0.000147n t−1 i −9.51502) ) from Section 5.2.2. Note that the selec- tion of the pessimistic exponential growth pattern is a conservative choice in that it does not unfairly help our model in predicting the cumulative number of adoptions over time. Figure 5.6(b) shows a simulation result of the aforementioned infection process. Red-color coded nodes indicate users, which in the simulation are influenced by their supervisors, whereas blue nodes indicate “random” infection attributed to the general popularity of the service. 5.3.2 Complex Cascade Model The model we described above spreads the adoption of the microblogging service over the formal organization hierarchy as a virus, which leaves a trail whenever employees are infected by their supervisors. To model this we used parameter p, which measures how infectious supervisors are, and parameter p EG that controls the effect of overall growing popularity of the microblogging service over time. Here we take an alternate approach based on which, nodes choose to become infected after examining their imme- diate neighborhood (which includes both the supervisor and employees directly report- ing to them) or after examining the overall growing popularity of the microblogging service over time. 115 (a) (b) Figure 5.6: Complex Contagion Model simulation in NetLogo. (a) Initial setup with employees arranged in a tree following reporting-to relationships according to the for- mal organizational hierarchy. The red arrow indicates the infection seed. (b) Simulation result after600 time steps. 116 We start with the organization hierarchy, and two colors. Let red represent employ- ees who have joined the microblogging service and blue those that have not. We choose a single node to be the seed user, i.e. have color red. All other users are painted blue. Figure 5.7(a) shows this initial setup. As before, we chose the seed node to be the exact employee that first registered to the microblogging service according to our dataset. At each time step, nodes painted blue (not infected), calculate the payoff of picking the color red over blue, and decide their colorf(color) as follows: f(color) = red, α n red n >β n blue n blue, otherwise , (5.3) wheren blue denotes the number of blue neighbors,n red denotes the number of red neigh- bors and n = n blue +n red is total number of neighbors. Parameters α and β = 1−α denote the rewards for choosing red and blue accordingly. Once a node is painted red, it cannot change color again. Finally, nodes have the chance to be “randomly” infected by observing the general popularity of the microblogging service up to timet−1. As in our contagion model, forn t−1 i infected nodes, the probability of “random” infection is com- puted using the pessimistic exponential growth model fit (p EG = e (0.000147n t−1 i −9.51502) ) from Section 5.2.2. Figure 5.7(b) shows a simulation run of the infection process. 5.4 Empirical Evaluation In this section, we validate our models by extensive numerical simulations. We begin with the organization hierarchy of 12,170 employees, and infect the true initiator of the epidemic (the employee who first joined the microblogging service). Each time step represents a day. We let our models run for 600 steps, or until all employees are infected. We compare the obtained epidemics against the real cumulative number of 117 (a) (b) Figure 5.7: Complex Cascade Model simulation in NetLogo. (a) Initial setup with employees arranged in a tree following reporting-to relationships according to the for- mal organizational hierarchy. The red arrow indicates the infection seed. (b) Simulation result after600 time steps. 118 adoptions extracted from our dataset. We experimented with various values of infection probability for our contagion model and parameters α and β for our cascade model. In the end, we decided to use p = 0.3 for our contagion model, and α = 0.82 and β = 0.18 in our cascade model. We simulated our models 10 times and report our findings. We compare three properties of the simulated epidemics as opposed to the true number of adoptions over time: (i) overall number of infections, (ii) cumulative number of infections over time, and (iii) total time required to infectN employees. We find that our models’ estimates are consistent with the real observations. 5.4.1 Baselines We compare our proposed models’ ability to approximate the true cumulative distribu- tion of infected users with three models, which have shown superior performance in the task of information and innovation diffusion in social networks. Particularly, we con- sider the Susceptible-Infected Model [95], the Independent Cascade Model [109], and the Diffusion Model proposed by [1, 12, 41]. • Susceptible-Infected Model (SI) [95]: According to the SI model, each node can infect her neighbors, each with probability p SI . We considered the Susceptible- Infected-Susceptible (SIS) and Susceptible-Infected-Resistant (SIR) models [88], as well as the Susceptible-Infected-Dead (SID) model [102] as alternatives to model social contagion, as these models are widely used in prior work. These models however do not appropriately capture the semantics of adoption, accord- ing to which, an employee that joins the microblogging service does not unsub- scribe, thus returning to the susceptible state, or becoming resistant. Further, our analysis did not provide any supporting evidence for the hypothesis that infected employees do not infect others, thus modeling them as “dead” is not appropriate in this case. 119 • Independent Cascade Model [109] (see Section 5.2.1). • Diffusion Model (DM) [1, 12, 41]: Each individual’s willingness to adopt the microblogging service at time t, U t u , is modeled by three main elements: the service’s stand-alone benefit, network effects, and the idiosyncratic reservation utility. Formally, U t u = Q u +γN (t−1) u −R u , where, Q u represents the service’s intrinsic value perceived by employee u, which is not affected by whether other people adopt it or not. N (t−1) u represents the proportion of adopters inu’s neigh- borhood at timet−1, andγ denotes the relative importance against stand-alone benefits. R u indicates u’s inherent reluctance or reservation about adopting the new service. 5.4.2 Experimental Results First, we study simulation results produced by the baselines, i.e. the SI, ICM and DM models. Figure 5.8(a) shows the true user adoption curve, compared to simulation results produced by the SI model, for varying infection probability values. We notice that simulation models do not fit the real cumulative number of adoptions over time. High infection probability values result in sudden outbreaks, whereas very small proba- bility values result in smooth cumulative distributions that do not exhibit the statistical properties of the true cumulative number of infected users. The total number of infec- tions and the time required to infect the whole body of employees is also inconsistent with the observed adoption curve. Figure 5.8(b) shows simulation results produced by the ICM model, for varying infection probability values. Clearly, the simulation results do not fit the real cumulative number of adoptions over time. In fact, this model results in sudden epidemics, which also fail to cover the entire population, and eventually come to a halt. No new infections are achieved due to the fact that each exposure has a single chance of success. If the 120 0 100 200 300 400 500 600 0 2000 4000 6000 8000 10000 12000 Time Infected Users p=0.02 p=0.03 p=0.05 True (a) SI Model (Baseline) 0 100 200 300 400 500 0 2000 4000 6000 8000 10000 12000 Time Infected Users p=0.9 p=0.9 p=0.8 p=0.8 p=0.7 p=0.7 p=0.5 p=0.4 p=0.3 (b) ICM (Baseline) 0 100 200 300 400 500 600 0 2000 4000 6000 8000 10000 12000 Time Infected Users 7 Initial Adopters True 1 Initial Adopter 5 Initial Adopters (c) DM (Baseline) 0 100 200 300 400 500 600 0 2000 4000 6000 8000 10000 Time Infected Users True Average (d) Complex Contagion Model 0 100 200 300 400 500 600 0 2000 4000 6000 8000 10000 Time Infected Users True Average (e) Complex Cascade Model Figure 5.8: True and predicted cumulative number of employees who have adopted the microblogging service (i.e. infected users). Time is measured in days. Solid line curves represent the outcome of (a) the SI model for various probabilities of infection, (b) the ICM model for various probabilities of infection, (c) the DM model for various numbers of initial adopters, (d) ten runs of our complex contagion model (see Section 5.3.1), and (e) ten runs of our complex cascade model (see Section 5.3.2). result of an exposure is no infection, that connection is not examined again. Hence if the root of a subtree is not infected, the infection cannot proceed further down the subtree. The simulation results corroborate our conjecture that the naive conditional probability does not directly give the probability increase due to influence via multiple joining “friends” [9] (see Section 5.2.1). Figure 5.8(c) shows simulation results produced by the DM model, for varying num- bers of initial adopters. When the first true adopter is selected to start the infection, the epidemic progresses slowly. Instead, when five true adopters are used, the epidemic is 121 substantially speeded up. When the seed set contains seven of the true adopters, the sim- ulation result adequately fits the observed adoption curve, without however exhibiting the statistical properties of the true cumulative number of infected users. Overall, this model too fails to capture the hidden dynamics of technology adoption at the workplace. Next, we show the outcome of ten runs of our complex contagion model (see Sec- tion 5.3.1) in Figure 5.8(d). The figure also shows the average of the ten runs. Notice a very good alignment between the reality and simulated epidemics in all cases. Not all runs result in the total number of true infections by the time threshold. Further, a few runs overestimate the cumulative number of infections, resulting in rapid epidemics. Unlike the baselines, our complex contagion model fits more naturally the true cumula- tive number of infected users in all cases. Particularly, the simulation results remarkably follow the speedups and slowdowns of adoption over time, exhibiting non-linear char- acteristics as the true adoption curve. Some runs diverge from the true curve after about 400 days. However, running the model numerous times and averaging the results seems to adequately approximate the statistical properties of the true cumulative number of infected users. We conclude that this is a direct result of the asymmetric contagion due to the hierarchical influence to adoption and the integration of peer pressure due to growing popularity of the microblogging service at the enterprise. Finally, we present the outcome of ten runs of our complex cascade model (see Sec- tion 5.3.2), and their average, in Figure 5.8(e). In this case too, simulated epidemics match the reality very well. Similarly to epidemics produced by our cascade model, not all runs result in the total number of true infections by the time threshold. Fur- ther, smooth regimes of adoption, speedups and slowdowns of the acceptance of the microblogging service from employees is apparent. Unlike our cascade model, this model slightly overestimates the cumulative number of infections. In all cases however, 122 we find that this model too fits rather closely to the true cumulative number of infected users, replicating the statistical properties of the empirical epidemic. 5.5 Summary In this chapter, we studied the effect of the formal organizational structure, to the adop- tion mechanism of a microblogging service at the enterprise. We addressed the fac- tors that govern the process of adoption at both microscopic and macroscopic levels. We found, microscopically, that employees’ tendency towards adopting or not the new microblogging service is influenced by their direct supervisors (dependency on the net- work structure). We proposedι-score as a prominent indicator of influence imposed by managers to their teams and we offered proof that middle level managers are on average more successful in promoting the adoption of the new service. Further, we empirically measured employees’ likelihood of adopting the new microblogging service with respect to the behavior of the general crowd. We revealed two distinct patterns, that capture the adoption likelihood increment as a function of the overall service popularity among the employee population. We incorporated our findings into two intuitive and simple adoption mechanisms, which capture both the local and global influence, accurately reproducing the adoption process at the macroscopic level. Prediction results show that our models provide great improvements over commonly used diffusion models. Our findings have important implications to enterprises’ understanding of the mechanisms driving adoption of new technologies, and could assist in designing better strategies for rapid and efficient technology adoption and information dissemination within the cor- poration. 123 A limitation of our study is that we estimate causal effects only within the formal organizational chart, due to the fact that we are unable to observe the actual adop- tion “cascade” (i.e. who really influences whom). We are planning to further evalu- ate our results with extended surveys and targeted interviews, as well as incorporate more datasets in future work. We also plan to extend our models to allow for influence scores to vary over time, as well as incorporate different roles individual assume in the adoption process, accounting for influence variations as a function of employees’ level in the organization hierarchy. We would also like to investigate the effect of network evolution (e.g. layoffs, or new hires) on influence, since one’s influence may intuitively increase with seniority in the company. Finally, it would be interesting to study adoption dynamics in the presence of competing technologies. 124 Chapter 6 Predicting Communication Intention In social networks, where users send messages to each other, the issue of what triggers communication between unrelated users arises: does communication between previ- ously unrelated users depend on friend-of-a-friend type of relationships, common inter- ests, or other factors? In this chapter, we study the problem of predicting directed com- munication initiation in online social media, between users who are previously not struc- turally connected. Link prediction is similar to communication intention in that it uses network struc- ture for prediction. However, these two problems exhibit fundamental differences that originate from their focus. Instead of trying to give an answer to the classification prob- lem of whether an edge between users u and v exists or not, we are trying to under- stand the factors behind the intention of user u to send a direct message to user v. An edge represents a conversation between users rather than friendship and its direc- tionality matters; it depends on the user who starts the conversation and is asymmetric (e 1 = (u,v)6= e 2 = (v,u)). Further, each edge is created under specific context (e.g. a message sent under a specific topic in groupg 1 , using a set of hashtagsS t 1 ). Link prediction refers to the problem of predicting the existence of a link between two entities in an entity relationship graph, by calculating entity similarity based on entity attributes [177] and the graph structure [139, 127, 112]. Social link prediction in particular has gained a lot of attention, since it can assist users into discovering and mak- ing new friends, improving their overall user experience. On the other hand, using link 125 prediction, companies exploit social networking sites for monetization, by selectively expanding their targeted user base and by triggering targeted advertising campaigns. Link prediction in social networks is a challenging problem, as social networking data is inherently noisy and heterogeneous. Overall, social networking users provide scarce information about their interests in their profiles, which are often incomplete and obsolete. Further, user-generated published content mostly comprises of free, unstruc- tured text, which often does not adhere to grammatical and syntactical rules, contains slag terms and abbreviations, and is often of restricted length (e.g. 140 characters in Twitter). Hence, social networking content includes useful information for social link prediction, but this information is not well structured, and is often misleading or ambigu- ous. For these reasons, most social link prediction approaches calculate graph-based proximity scores [139, 127], asserting that the “closer” two nodes are in the social graph, the more likely they are to become linked in the future. Liben-Nowell et al. [127] explored several such metrics for social link prediction, and showed that the Adamic/Adar metric performs best in scientific co-authorship networks. Link predic- tion approaches based on random walks include [40, 7, 220]. Such techniques mainly focus on the social network structure and are computationally expensive. [65] addressed the computational cost, using efficient topological features. Probabilistic inferencing approaches include [104, 118], but also exhibit high computational cost. Machine learn- ing approaches include [87, 170, 64]. However, information spread in social networks depends, not only on the underly- ing network structure, but also on the information itself, and the nature of the process by which nodes interact. In social media, users broadcast information to all their neighbors (i.e. one-to-many interactions) or specific groups (e.g. groups of friends in Facebook or circles in Google+), whereas public status updates in social networks may be dis- tributed to every single user (i.e. broadcast). One-to-one interactions exist in the form 126 of directed messages between users (e.g. @reply messages in Twitter). Prior link pre- diction approaches do not exploit the full extent of information available in a social network. The probabilistic model proposed in [175] exploits network structure, content, and user location to infer social ties. However it relies on a dense representation of all possible, symmetric friendships, and requires training. Instead, we are focusing on asymmetric, directed communication. Our approach does not require training to “learn” probability distributions for every node pair, but can dynamically keep track of changes and recompute pairwise similarities incrementally, when necessary. While social networking data present challenges in social link prediction, they also exhibit a wealth of information to be used for that cause. Social networking data, often have some sort of “context” associated with them, including user-provided annotations (e.g. description, hashtags) and system information (e.g. upload time). “Individualfea- tures might be noisy or unreliable but collectively they provide revealing information” [13] about users. Semantic similarity measures based on folksonomies are systemati- cally analyzed in [141]. [177] showed that strong correlations exist between annotations and social proximity, and used semantic similarity between user annotations as statis- tical predictors of friendship links in Flickr. We are extending this hypothesis to other aspects of users activity, semantically enriching them, and considering them in conjunc- tion to local network structure. A method to recommend “interesting” people using prior evidence of user interests in terms of following and tagging was presented in [94]. Rec- ommendation of new friends, based either on social or content similarity [39] required at least one prior interaction between users and employed simple keyword matching schemes. Instead, we do not assume prior relationships between users. Further, we are trying to predict communication, which does not entail friendship (and vice versa), and to understand the factors that trigger such communication. Instead, factors that motivate users to reply to other users in Twitter were investigated by [185]. Perhaps the most 127 close work to ours is [49]. Their focus, however, is on information diffusion around topics for given time slice, having past communication evidence. To address the problem of predicting directed communication initiation in online social media, we propose a novel technique that employs topological evidence (i.e., structural properties of the social network) in conjunction to transactional information (user interactions, both explicit and implicit), which have up to date been considered independently. We performed an empirical study to evaluate our method using an extracted network of directed@-messages sent between users of a corporate microblog- ging service, which resembles Twitter. We find that our method outperforms state of the art techniques for link prediction. Our findings have implications for a wide range of social web applications, such as contextual expert recommendation for Q&A, new friendship relationships creation, and targeted content delivery. 6.1 Communication Intention Prediction Framework Twitter is a social networking site which allows users to share their status updates as well as interact with others by sending short text messages. Users can follow other peo- ple and have followers themselves. The following relationship in this case may not be reciprocal. Users can “retweet” posts, make use of #hashtags, and directly address a message to another user (‘@’ followed by a username). Such messages are referred to as @replies. In this setting, users interact either directly or indirectly. Users are explicitly connected when there is a “following” relationship (social link) between them. Users are implicitly connected through indirect activities, such as common use of #hashtags, retweets, and@replies. We can infer a directed network from@replies messages, repre- senting users’ interaction flow [185]. 128 Here, we aim to understand what motivates communication between users. In par- ticular, we seek to answer the following question: “What makes people send @reply messages tostrangers?”. Our hypothesis is that information on the@reply network can help predict users’ intention to communicate in microblogging services. To this end, we consider a directed social graphG = (V,E), where each nodeu∈ V represents a user, and an edge e = (u,v) ∈ E exists if and only if user u has sent at least one @reply message to userv. Each edge may have a weightw uv attached to it, such thatw uv equals the frequency of replies sent from user u to user v. We have chosen this intuitive def- inition for edges so as to represent the “transer” of content from useru to userv when user u sends user v a @reply message. An undirected edge e uv between users u and v if either user sent a message to the other would not capture the semantics of directed communication, which may or may not be reciprocal. We formulate the communication intention prediction problem as follows: Communication Intention Prediction Problem Definition. GivenG ′ = (V ′ ,E ′ ), a subgraph of G consisting of all nodes in G (V ′ ≡ V) and a subset of edges in G (E ′ ⊂E),outputarankedlistLofedges(links),notpresentinG ′ ,thatarepredictedto appearinG,suchthatE ′ ∪L≡E. 6.1.1 User Representation A user can be modeled as a union of her connections and her content (i.e., microblogs). Particularly, we characterize each microblog by a set of attributes (features), using our content representation from Chapter 3, Section 3.1. Textual Features: These features include raw textual content (bag-of-words), as well as user provided hashtags and group participation. We formally represent raw textual content using tf.idf weight vectors [117] and then utilize the cosine similarity metric to compute similarity between such vectors. We clean the text by performing 129 Figure 6.1: Augmented social graph. stemming and basic stop-word elimination. The use of cosine similarity lacks semantics and ignores semantic associations between terms with similar meaning but poor lexical similarity. Hashtags are meant to be a selective set of highly descriptive keywords of the content of microblogs. Groups are indicative of communities of interest. Stemming and/or stop-word removal, and cosine similarity do not seem appropriate for them, hence we are calculating semantic similarity for these facets. Temporal Features: These features regard the date and time a post was made. We represent date values as the number of minutes elapsed since the Unix epoch. Using this user model, we form an augmented, directed social graph, presented in Figure 6.1. This model is a direct outcome of our formal modeling of social networking data, which we described in detail in Chapter 3, Section 3.1. In the rest of this section we describe in detail our approach, which consists of calculating users proximity through aggregation over their microblogs similarity and similarity with respect to their network neighborhood. 130 6.1.2 Semantic Similarity of Textual Features To compute semantic similarity between hashtags (similarly for groups), we utilize WordNet - a lexical database for English [153]. The WordNet toolkit permits search of relevant concepts in terms of conceptual, semantic and lexical relations: a) Synonyms: terms that denote the same concept (e.g. “car” - “automobile”); b) Hypernyms: more general concepts (e.g. “furniture” is a hypernym of “bed”); and c) Hyponyms: more specific concepts (e.g. “bed” is a hyponym of “furniture”). Semantic Similarity of Concepts: Given two concepts a and b, let S a denote a set of terms (specified below) that describe a, and S b a set of terms that describe b. The similaritys(a,b) betweena andb is then defined as the Jaccard index: s(a,b) =s(S a ,S b ) = |S a T S b | |S a S S b | , (6.1) where |·| is set cardinality, S is set union, and T denotes set intersection. It holds that 0 ≤ s(a,b) ≤ 1, and s(a,b) = 1 if S a and S b are identical, and s(a,b) = 0 if they do not share any terms at all. We use this similarity measure, which has been found to be a good trade-off between simplicity and performance [120], to calculate similarity between textual concepts a and b. The system returns all synonym concepts ofa, denoted withS a , as well as all synonym concepts ofb, denoted asS b . We define the synonym-based similarity between concepts a and b as s s (a,b) = s(a,b) = s(S a ,S b ). Similarly, we define hypernym-based similarity as s h (a,b) = s(a,b) = s(H a ,H b ) and hyponym-based similarity ass hp (a,b) =s(a,b) =s(Hp a ,Hp b ). The semantic similarity between two hashtags (same for groups) a and b can then be computed as the weighted sum of the measures described above. This metric how- ever does not discriminate between cases where hashtags belong to the same subtree as shown in Figure 6.2 (e.g. t 2 is a hypernym of t 1 ). To resolve this issue, we compute 131 Figure 6.2: Example of hashtag hierarchy. similarity between the union of annotations for each hashtag. To account for lexical similarity between hashtags a and b, we consider their Levenshtein similarity. We use themax operator to select the highest similarity, either semantic or lexical. Formally, we define the semantic similarity between two hashtagsa andb (s g for groups) as follows: s tg (a,b) = max{LevenshteinSimilarity(a,b), w s s s (a,b)+w h s h (a,b)+w hp s hp (a,b), s(S a ∪H a ∪Hp a ,S b ∪H b ∪Hp b )}, (6.2) wherew s = w h = w hp = 1/3. We chose symmetric weights, since we did not find any particular reason to weight differently the similarity contribution of synonyms, hyper- nyms, and hyponyms. Textual Similarity: We compute textual similarity s tx (x,y) between two bag-of- wordsx andy, represented astf.idf weight vectors, using cosine similarity [117]. 132 6.1.3 Date Similarity We compute similarity between datesd 1 andd 2 as follows: s d (d 1 ,d 2 ) = 0 if|d 1 −d 2 |≥T d 1− |d 1 −d 2 | T d otherwise , (6.3) where T d = 365. In other words, if d 1 and d 2 are more than one year apart, we define their similarity as zero. Otherwise, we define their similarity as one minus their differ- ence in days. 6.1.4 Time Similarity We compute similarity between time instancest 1 andt 2 as follows: s t (t 1 ,t 2 ) = 0 if|t 1 −t 2 |≥T t 1− |t 1 −t 2 | Tt otherwise , (6.4) whereT t = 86400. In other words, ift 1 andt 2 are more than one day apart, we define their similarity as zero. Otherwise, we define their similarity as one minus their differ- ence in seconds. Overall, we compute similarity between timestamps x and y as s df (x,y) = w d s d (x d ,y d ) + w t s t (x t ,y t ), where s d (.,.) is calculated using Equation 6.3, s t (.,.) is calculated using Equation 6.4, andw d +w t = 1. DifferentT d andT t values may yield optimal results for different datasets. We leave users the ability to set these thresholds according to their respective needs. 133 6.1.5 Feature Set Similarity We use a variation of Hausdorff point set distance measure [93] to calculate sim- ilarity between two sets of features A : {a 1 ,a 2 ,...,a m } (e.g. X’s hashtags) and B :{b 1 ,b 2 ,...,b n } (e.g. set of hashtags associated with postY ), as follows: S H (A,B) = 1 |A| |A| X k=1 max i {sim(a k ,b i )}, (6.5) where sim(a k ,b i ) is any similarity measure on any two set elements a k and b i . This is the average of the maximum similarity of features in set A with respect to features in set B [181]. Like the original Hausdorff distance metric, this similarity measure is asymmetric with respect to the sets: S H (A,B)6=S H (B,A). 6.1.6 Content Proximity To compute similarity between two postsp 1 andp 2 we compute the similarity between each of their attributes, respectively. Combining all similarity measures described above in a weighted sum, we get the similarity between two postsp 1 andp 2 as follows: S(p 1 ,p 2 ) = w g s g (p 1g ,p 2g )+w tg S H (p 1tg ,p 2tg )+ w tx s tx (p 1 ,p 2 )+w df s df (p 1 ,p 2 ), (6.6) wherew df +w g +w tg +w tx = 1. In our experiments we consider numerous weighting schemes and report our observations. The similarity measure between two usersu andv with respect to their microbloggs can then be computed using the modified Hausdorff distance as follows: S C (u,v) = 1 |u p | |up| X k=1 max i {S(u p k ,v p i )}, (6.7) 134 where u p denotes the set of user u’s microbloggs. Our content proximity metric is easily extensible to other types of resources, such as documents, videos etc., that have contextual features attached to them. 6.1.7 Network Proximity To compute similarity between two users with respect to their network proximity we considered numerous proximity methods proposed in the literature. For simplicity and reduced complexity, we chose to use a modification of the common neighbors metric. We define network proximity between usersu andv as: S N (u,v) =s(Γ u ,Γ v ) = |Γ u ∩Γ v | |Γ u | , (6.8) where Γ u denotes the set of u’s neighbors. Normalizing by|Γ u |, S N becomes asym- metric with respect to users. This way closeness is calculated on the premises of the percentage of common neighbors instead of the absolute number, with higher percent- age indicating greater intersection of common interests. 6.1.8 User Similarity We calculate user similarity as a weighted function of content and network proximity. We define similarity between two usersu andv as: S(u,v) =λS C (u,v)+(1−λ)S N (u,v), (6.9) whereλ controls the tradeoff between content and network proximity. For our prediction problem, we first construct the augmented social graphG(V,E). Given a user u, we compute user similarity in a top down fashion for all facets, for all 135 u’s posts with respect to all other users in the network that do not belong in the set of useru’s direct contacts, using Equations 6.7-6.9. To reduce complexity, we can restrain similarity calculation to users being up to distanced from useru, instead of considering the complete user corpus. 6.2 Empirical Evaluation We demonstrate the effectiveness of our framework in the enterprise microblogging dataset described in Chapter 2, Section 2.3.1. For our evaluation, we remove some of the edges in the post-reply network and recommend the links based on the pruned graph. We use four-fold cross validation by randomly dividing the set of edges into four partitions, use one partition for prediction, and retain the links in the other partitions. We randomly sample 100 users and recommend the top-k links for each user. We use precision, recall and mean reciprocal rank (MRR) for reporting accuracy. We measure precision atk as: P@k = 1 |S| P p∈S N k (p) k , whereS is the set of sampled users andN k (p) is the number of truly linked persons in the top-k list of userp. Similarly, we measure recall at k as R@k = 1 |S| P p∈S |Fp∩Rp| |Fp| , where F p denotes the truly linked user set of personp andR p denotes the set of recommended users of person p (|R p | = k). Finally, we measure MRR atk asMRR@k = 1 |S| P p∈S 1 rankp , whererank p denotes the rank of the first correctly recommended link of userp. We compare our approach against four baseline approaches described below: • Random Selection: Randomly select a pair of users and create a link between them. • Shortest Distance: Create a link between user u and the user v closest to him (length of shortest path). 136 Table 6.1: Weighting schemes. Metric w tx w tg w g w df w d w t SS Uniform 0.25 0.25 0.25 0.25 0.5 0.5 SS Tags 0.0 1.0 0.0 0.0 0.5 0.5 SS Groups 0.0 0.0 1.0 0.0 0.5 0.5 SS Text 1.0 0.0 0.0 0.0 0.5 0.5 SS Time 1 0.0 0.0 0.0 1.0 0.8 0.2 SS Time 2 0.0 0.0 0.0 1.0 0.5 0.5 SS Time 3 0.0 0.0 0.0 1.0 0.2 0.8 SS Mix 1 0.2 0.3 0.4 0.1 0.8 0.2 SS Mix 2 0.2 0.5 0.2 0.1 0.8 0.2 SS Mix 3 0.3 0.45 0.1 0.15 0.8 0.2 • Common Neighbors: σ(u,v) =|N(u) T N(v)|, whereN(u) is the set of neigh- bors of useru in the social network. • Shared Vocabulary: Following [25] and our analysis on user homophily as a function of network proximity, we regard the vocabulary of a user as a feature vector whose elements correspond to hashtags and whose entries are the hashtag frequencies for that specific user’s vocabulary. To compare the hashtag feature vectors of two users, we use the standard cosine similarity defined in Equation 4.4. We use SS Uniform to denote our method using a uniform weighting scheme. We experiment with multiple weighting schemes, resulting in numerous variations of our approach, shown in Table 6.1. 6.2.1 Methods Comparison Here we compare the accuracy of our conversation initiation prediction scheme (SS Uniform) against the baselines. Tables 6.2 and 6.3 list average precision, recall and MRR as calculated over our four-fold cross validation experiment for 100 randomly 137 Table 6.2: Prediction precision achieved by different metrics. Metric P@1 P@5 P@10 P@20 P@50 Random 0.0070 0.0036 0.0057 0.0059 0.0048 Shortest Distance 0.0716 0.0716 0.0643 0.0492 0.0303 Common Neighbors 0.1050 0.0768 0.0637 0.0487 0.0303 Shared V ocabulary 0.0327 0.0318 0.0247 0.0193 0.0141 SS Uniform,λ = 0.8 0.162 0.109 0.089 0.066 0.039 Precision Lift % 54.29 41.93 38.41 34.14 28.71 chosen users. We indicate the best performing baseline, against which we compute per- centage lift, i.e. the % improvement that our method achieves over the best performing baseline. Random selection performs the worst as expected, since the more users and edges in the graph the tougher it becomes to recommend correct links by random selection. Shortest Distance also performs poorly, while Common Neighbors perform slightly bet- ter. Even though we do not report results for the Adamic/Adar and Katz metrics, they perform as bad as common neighbors. This indicates that graph structure has some pre- dictive power but is insufficient by itself to perform well. Intuitively, close proximity due to short@replies path does not necessarily entail a direct@reply message to be sent. Shared V ocabulary is comparable in accuracy to network-based metrics, even though it performs worse overall, most probably due to the high hashtag alignment which is observed between neighbors for distanced≤ 6 in this dataset. Hence, vocabulary com- monality alone is not indicative of communication intension either, but could potentially prove complementary to structural features. Nonetheless, all approaches exhibit a drop in accuracy as a function of k, which can be explained by the average degree of nodes in our dataset. It is impossible to get more correct results in the top-k list once the maximum value of correct neighbors is reached. Our approach outperforms the baselines with respect to all accuracy metrics, often by a considerable margin, by mediating local structural characteristics with content and 138 Table 6.3: Prediction recall and MRR achieved by different metrics. Metric R@1 R@10 MRR@1 MRR@10 Random 0.0020 0.0021 0.0067 0.0016 Shortest Distance 0.0269 0.0204 0.0716 0.0156 Common Neighbors 0.0321 0.0198 0.1050 0.0177 Shared V ocabulary 0.0062 0.0069 0.0327 0.0068 SS Uniform,λ = 0.8 0.162 0.283 0.162 0.25 Lift % 404.67 1287.25 54.28 1312.43 rich semantics about content’s metadata. Although average precision achieved by our approach appears low, ranging from 16.2% fork = 1 to 3.9% fork = 50, it is at least an order of magnitude higher than precision achieved by baselines. Similarly, our approach performs better in terms of recall and MRR, where it achieves the best improvement over the best performing baseline. Note that for k = 50, the recall of our approach is 50%. Precision values follow a heavy-tailed distribution, indicating a strong difficulty in making accurate predictions for some users, while achieving very high precision (100% or close to 100%) for others. 6.2.2 Weight Scheme Selection There are two types of parameters in our approach: λ, and six weighting factors (w tx , w tg ,w g ,w df ,w d ,w t ) each controlling the significance of different facets into the proxim- ity calculation. Different datasets may lead to different optimal values for these parame- ters. We obtain the best values for our dataset by performing a grid search over ranges of values for these parameters and measuring accuracy on the validation set for each of the configuration settings. Table 6.1 lists some of the weighting schemes we experimented with. 139 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 λ P@1 SS_Mix_1 SS_Mix_2 SS_Mix_3 SS_Uniform SS_Tags SS_Groups SS_Text SS_Time_1 SS_Time_2 SS_Time_3 Baseline (a) Precision. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 λ R@1 SS_Mix_1 SS_Mix_2 SS_Mix_3 SS_Uniform SS_Tags SS_Groups SS_Text SS_Time_1 SS_Time_2 SS_Time_3 Baseline (b) Recall. Figure 6.3: (a) Precision@1 and (b) Recall@1 as a function ofλ.. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P@5, SS_Uniform, λ=0.5 P@5 SS_Mix_1, λ=0.9 SS_Uniform, λ=0.9 (a) Precision. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 R@5, SS_Uniform, λ=0.5 R@5 SS_Mix_1, λ=0.9 SS_Uniform, λ=0.9 (b) Recall. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MRR@5, SS_Uniform, λ=0.5 MRR@5 SS_Mix_1, λ=0.9 SS_Uniform, λ=0.9 (c) MRR. Figure 6.4: Impact of weighting scheme on accuracy (measured@5). Effect of Parameterλ Parameter λ controls the tradeoff between structural proximity and content similarity. The higher its value the more significance is given to content similarity. A value of 0 only considers network proximity, whereas a value of 1 only considers content simi- larity. Figure 6.3 demonstrates the effect of parameter λ in Precision and Recall @1, achieved by the weighting schemes presented in Table 6.1. Schemes which consider only one type of content facet (i.e. SS Uniform, SS Tags, SS Groups, SS Text, and the three SS Time schemes) perform better than the baseline, since they still combine network and content proximity scores to make a good predic- tion. Interestingly, time schemes perform better than schemes considering hastags or text alone, but have inferior performance than SS Groups. This indicates that timing 140 10 1 10 2 10 −2 10 −1 10 0 Number of posts Average P@5 SS_Mix_1, λ=0.9 SS_Uniform, λ=0.5 SS_Uniform, λ=0.9 (a) Precision as a function of content. 10 0 10 1 10 −1 10 0 Number of neighbors Average P@5 SS_Mix_1, λ=0.9 SS_Uniform, λ=0.5 SS_Uniform, λ=0.9 (b) Precision as a function of network structure. Figure 6.5: Average precision (measured@ 5) of users havingk (a) posts or (b) neighbors in the@replies network. between replies is essential in this dataset, an outcome which can be explained as a result of the corporate environment, which requires prompt answers. Among the mix schemes, SS Mix 3 performs worse, probably due to the discounted weighting of the group facet. SS Mix 2, which gives more emphasis on the hashtags and treats equally the textual and group facets is the best performing weighting scheme, apart from whenλ = 1. In this case, the uniform scheme outperforms the rest. Nonethe- less, all weighting schemes considerably outperform the baseline. Theλ value that pro- vides the best tradeoff between content and network appears to be λ = 0.8 (when also consideringk∈{5,10,20,50}). Effect of Weighting Scheme Figure 6.3 hints on how weighting schemes affect accuracy overall. Figure 6.4 demon- strates the effect of weighting schemes on accuracy per user. Here, we compare accuracy@5 of three schemes, however, our observations hold for all of our schemes, for all top-k results. In most cases, both SS Mix 1, λ = 0.9 and SS Uniform, λ = 0.9 significantly outperform SS Uniform, λ = 0.5 with respect to precision, recall and MRR. However, in few casesSS Uniform,λ = 0.5 performs better than the other two 141 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 λ MRR@5 ≥ 10 links ≥ 20 posts ≥ 50 posts ≥ 100 posts Figure 6.6: MRR (measured @ 5) as a function of λ. Different plots impose struc- tural or content availability restrictions. All measurements refer to weighting scheme SS Mix 1. weighting schemes. Moreover, different weighting schemes perform better for different users (e.g. SS Mix 1, λ = 0.9 and SS Uniform, λ = 0.9 achieve different accu- racy values for same users). This can be explained as a result of different criteria being important for different users (i.e. content versus time, or network proximity). Hence, personalization is imperative in order to achieve better accuracy overall. Content Availability and Structural Proximity Figure 6.5(a) shows average precision as a function of the number of available posts. Figure 6.5(b) shows how average precision depends on network structure. To test the effect of these two factors we performed an experiment where we imposed either structural or content restrictions. Structural restrictions implicitly impose some content restrictions since an edge represents at least one @reply message. This is not a 1-to-1 mapping, since many @replies between the same pair of users refer to the same edge. The number of messages does not necessarily reflect number of @replies since many posts may be broadcast instead of direct messages. 142 Figure 6.6 shows MRR as a function of λ for different restrictions. Intuitively, the greater the number of posts (or the number of neighbors) available for a user, the greater the statistical evidence, resulting in more accurate predictions. In fact, by restricting users to have≥ 50 posts, we achieve on average (over allk∈{1,5,10,20,50}) 58.7%- 68% MRR (32.7%-37.2% precision and 34.1%-38% recall). Considering users with ≥ 100 posts, we achieve on average 67.5%-73.5% MRR (38.9%-45.2% precision and 38.9%-45.3% recall). 6.3 Summary In this chapter, we introduced the problem of communication intention prediction in social networks. We addressed this problem using a novel framework that exploits both local structural characteristics and semantically enriched, user generated content. We tested the effectiveness of our approach on a post–reply network, inferred from a cor- porate microblogging service, demonstrating that communication intention prediction can be accurately performed. Particularly, we showed that the more statistical evi- dence available per user, the better accuracy we can achieve. Based on our findings, our methodology shows great potential to help users identify “interesting” people to ini- tiate conversations with, collaboratively solve problems, or simply create new friends. Although we didn’t explore temporal effects on our prediction problem, we found evi- dence that personalized weighting schemes can greatly improve overall accuracy. We leave this as future work, along with experimentation on larger datasets from Twitter and Facebook. 143 Chapter 7 Social Link Prediction in Online Social Tagging Systems Online social media sites users’ rich activities reveal crucial information about their behavior and interests. Users’ interaction with online content can be effectively captured with tripartite graphs [84]. The models we present in this chapter, mine users’ latent interests from their interactions with online content, instead of relying to user-generated profiles, which may be incomplete or obsolete. We benefit from such modeling of users’ latent interests into providing answers to a broad range of important queries, such as which users have similar interests (i.e., community detection) and which other users a user would be mostly interested in (i.e., social link prediction). In this chapter, we propose latent topic models as a principled way of reducing the dimensionality of online social tagging systems, and of capturing the dynamics of the collaborative annotation process. We propose three generative processes to model latent user tastes with respect to resources they annotate with metadata. Our motivation is to accurately model the structure of tripartite graphs that emerges in online social tagging systems (see Chapter 2, Section 2.2). As illustrated in Figure 7.1, the social network structure constrains information flow within the boundaries of “follow” relationships, but “common interest” propagation through the network also affects how users behave and ultimately how the network changes and grows. In this chapter, we study the role of “interests diffusion” in shaping the evolution of the network structure, particularly the creation of social links. We show 144 Figure 7.1: Dynamics on and of the network structure are strongly coupled in social media. The top layer illustrates the social network structure, where green arrows rep- resent “follow” relationships. Information flows along arrows’ direction. Dashed black arrows mark newly created links, which are created on the premises of “common inter- ests”, i.e., similar tastes in resources (artists in Last.fm) or common tag usage. The middle layer depicts a taxonomy, which may be provided/imposed by the social net- work or be collaboratively “curated” by the users of the social network. The bottom layer represents a network of resources. Relationships between resources depend on the network type and scope. In Last.fm for example, where resources are artists, connec- tions between artists signify music genres. that latent user interests combined with social clues from the immediate neighborhood of users can significantly improve social link prediction in the online music social media site Last.fm. Most link prediction methods suffer from the high class imbalance prob- lem, resulting in low precision and/or recall. In contrast, our proposed classification schemes for social link recommendation achieve high precision and recall with respect to not only the dominant class (non-existence of a link), but also with respect to sparse positive instances, which are the most vital in social tie prediction. Link prediction in social networks is a challenging problem, as social networking data are inherently noisy and heterogeneous (see also Chapter 6). One key assumption 145 in sociology is the theory of homophily [146], which postulates that people who have similar characteristics tend to form ties. Moreover, it is likely that the stronger the tie, the higher the similarity [78]. The problem of link prediction for social networks has been well studied in numerous domains and contexts [139]. Link prediction models that estimate tie strength from users’ attributes and graph structure [139] or interaction activ- ity and user’s profile similarity have been proposed [211]. V ocabulary overlap between users has also been used as indicator of user connectivity in Flickr [177]. While we are adopting this hypothesis, we instead propose a generative process to model con- tent annotation by users. Moreover, we consider this process in conjunction with local network structure. The Markov network framework proposed in [191] defined a joint probabilistic model over the entire graph-entity attributes and links, assuming a Markov dependency (the label of one node depends on its neighbors’ labels). In contrast to our work, their discriminative model only explains social ties conditioned on the observed variables. Prediction and recommendation of links in social networks using random walks [7] depend on knowing almost all links along with a set of source and candidate nodes, and only need to predict few new links. Sadilek et al. [175] predicted friendship links in Twitter based on one input feature, assuming mutual independence between the observed and hidden variables. Their model exhibits high complexity due to the vast number of hidden nodes it includes (one for each possible link). Both approaches only consider undirected graphs. Instead, we test the effectiveness of our approach in a directed social network, which captures more realistically the asymmetric relationships between users. Recently, the problem of link prediction in heterogeneous networks has been studied. PathSim [188] measure the similarity among same type objects in heterogeneous net- works based on symmetric meta paths (sequence of relations between different object 146 types). However, in online social networks many valuable paths are asymmetric and the relatedness of different-typed objects is also meaningful. Hence, PathSim is not suitable in this context. Further, there is no standard procedure to follow in order to explicitly specify path combinations to define meta paths. Typical users do not neces- sarily posses the amount of domain knowledge required to define meta paths. Choosing the best path by experimentation or learning it from training examples leads to a state space explosion, rendering this approach impractical. A neighborhood multi-relational link prediction approach based on triad census, trivially extended to heterogeneous net- works was proposed in [47]. Its weighted version is equivalent to weighted common neighbors. Prediction scores are calculated individually for each link type of interest, ignoring latent influence due to meta paths. Instead, our approach combines structural features with latent user interests, while at the same time provides a generative model of the heterogeneous network formation and evolution. Latent feature based models [89, 149] consider link prediction as a matrix comple- tion problem and employ latent matrix factorization to learn latent factors for each object and make predictions. However, such models disregard the local network structure. Our model, as an adaptation of the author-topic model, is closely related to methods based on matrix factorization [174]. For applications where models withn-ary relations with n > 3 need to be considered, tensor factorization techniques are required. Kolda et al.[111] provide a recent overview of leading approaches. Unfortunately, the straight- forward application of higher-order tensor models becomes problematic due to compu- tational requirements and data sparsity. A generative model that learns shared tastes of users from network structure and user playlists was proposed in [52]. Even though our approach is similar in spirit, our user modeling radically differs. Perhaps the work closest to ours is [165]. Their hierarchical system exploits latent user interests based on user profiles, treating users as documents. 147 In this sense, our work is a generalization of their approach, while at the same time requiring significantly less amount of training data to achieve high precision and recall. Further, in their small-scaled experiments, they only considered ROC-AUC analysis. We, in contract, address scalability issues, considering thousands of users who may be arbitrarily connected, resulting in million potential friendships. Last but not least, we show that our approach effectively addresses the high class imbalance problem due to data sparsity. 7.1 Generative Models of Tripartite Graphs for Recom- mendation in Social Media Social networking sites have offered Internet users a novel way to organize their online digital content and share content with other users. In general, users of social media sites contribute content which is not restricted to one media type (e.g., documents, photos, URLs). Depending on the social media site, users can annotate content using descriptive text (e.g., title and description of photos in Flickr 1 ) or with metadata (i.e., tags). User- generated content mostly comprises of free, unstructured text, which often does not adhere to grammatical and syntactical rules, contains slag terms and abbreviations and is often of restricted length (e.g., 140 characters in Twitter 2 ). To improve online content organization, categorization, search and filtering, users have adopted tags (or hashtags). The unrestricted vocabulary users choose from to annotate content has led to the creation of personalized taxonomies, offering greater malleability and adaptability in information organization than formal classification systems, which impose users with the restriction to annotate content based on predefined keywords. Mitigation of individual taxonomies, 1 http://www.flickr.com/ 2 https://twitter.com/ 148 despite the inconsistent tag usage between various users, which leads to polysemous and synonymous annotations [75], generate emergent hierarchies of mediated community knowledge [122]. In current social tagging systems organization, classification and search tend to be rather simplistic in nature, often relying on keyword-based retrieval algorithms or aggre- gated results stemming from collaborative filtering techniques [86]. Probabilistic mod- els have been successfully used in discovering the set of hidden topics that were respon- sible for generating a collection of documents (e.g., [14]). In this work, we describe three unsupervised models of online social tagging systems as a principled mechanism to address issues of synonymy, polysemy and tag sparseness. Our probabilistic models capture the generative primitives behind online content annotation, while at the same time extract information about users’ latent interests and hidden topics from online, large-scale social tagging systems. In particular, we consider artists in Last.fm 3 as resources, which users annotate with tags. Tags become concepts in our modeling. More complex hierarchical Bayesian models can be designed if more types of resources and concepts are considered. The models we describe below can be naturally extended to accommodate other resources and annotation types, such as annotations of Flickr photos, or descriptive text of Youtube videos. Social tagging systems have been well studied, leading to a vast literature around this area. Gupta et al. summarized different techniques employed to study various aspects of tagging [82]. Halpin et al. studied the basic dynamics behind tagging in the social bookmarking site del.icio.us 4 and proposed a collaborative tagging model based on preferential attachment and informational value [84]. We instead take a probabilistic, 3 http://www.last.fm/ 4 https://delicious.com/ 149 generative approach in an attempt to accurately model collaborative annotation in online social media. Probabilistic models have been successfully used in discovering the hidden topics that were responsible for generating a collection of documents [14]. Our model is an adaptation of the author-topic model proposed by Rosen-Zvi et al. [174]. The objec- tive of their work was to provide a generative process for document creation, capable of recovering hidden topics in a document corpus. We are extending their model to resources of any type (not just documents) and annotations (instead of words). The social process of annotation generation is unknown. It is not intuitive that such frame- work would perform as well in this context. The model proposed in [19] does not correctly simulate the real social annotation process because users are modeled as creators of content words instead of tags. To over- come the limitations of previous models, all related entities (users, documents, words and tags) and latent variables (topics, user perspectives) were represented in a unified model [136]. Their model exhibits high complexity due to the numerous variables that have to be estimated, and does not sufficiently capture users’ interests, as in our case. Hidden topic models were proposed to improve social bookmark search results. A context-aware music recommendation system, which leverages top frequent tags for songs from social tagging Web sites, using Latent Dirichlet Allocation to determine a set of latent topics for each song was proposed in [85]. Our models can be effectively used to recommend not only resources, but also tags and users at the same time. Tag growth and users’ activities dynamics in social media were explored using a model that resembles ours [129]. That approach differs from ours in that it modeled posts that contain resources and tags, whereas we are modeling direct annotation of resources. The framework in [132] combined the tasks of user preference discovery and document topic mining through modeling of user-document interactions. In both 150 works, there is only one “tagger” per document, whereas our model captures the social aspect of tagging, allowing a mixture of users to collaboratively contribute in the anno- tation process. Therefore, our approach is more general and the problem we study more difficult. A general model to find hidden structures (local clusters and global commu- nity structures) from a k-partite graph was proposed in [135]. By introducing hidden nodes into the original k-partite graph, they construct a relation summary network to approximate the original k-partite graph under a broad range of distortion measures. We instead are focusing on the actual generative process that drives the original tripartite graph creation. Relational topic model was introduced to model links between documents as binary random variable conditioned on their contents [27]. Topic-link model [133] performed topic modeling and author community discovery in a unified framework but did not provide reasonable results in the task of link prediction. LDA approaches for social link recommendation [168] or tag recommendation [114] are similar to our approach. However, we utilize resources and annotations as descrip- tors of user interests and we propose three generative models that capture the essence of tripartite graph formation in social networks. We further demonstrate that our approach yields high precision and recall in the social link recommendation task. 7.1.1 The User-Resource-Concept Model We introduce User-Resource-Concept model (URC), a probabilistic author-topic model [174] to model users’ interests based on their resource usage and annotation behavior. Topics are hidden variables representing categories that naturally split the corpus into clusters of closely related resources. In Last.fm, topics are equivalent to music genres. The process of resource annotation can be described as a stochastic process. A group of 151 users a r , which for the purposes of estimation we assume is observed, collectively anno- tate resource r. For each resource annotation a user a is chosen uniformly at random. Based on user a’s interests and the nature of the resource, a set of topics is selected. Conceptc ri (e.g., tag) is generated based on the selected set of topics. This generative process is described in graphical form in Figure 7.2a. x indicates the user, chosen from a r , responsible for a given annotation. Each user is associated with a distribution over latent topics θ, chosen from a symmetric Dirichlet(α) prior. Assuming there are T latent topics, the multinomial distribution over topics for each author can be represented as a matrix Θ of size T ×A. Its elements θ ta stand for the probability of assigning topict to a concept generated by actora. We useθ a to denote thea th column of the matrix. The mixture weights for the chosen user are used to select topicz and a concept is generated according to the distributionφ corresponding to that topic, drawn from a symmetric Dirichlet(β) prior. Matrix Φ of size C × T denotes the multinomial distribution over words associated with each topic. φ t represents the probability of generating concepts from topict. Table 7.1 summarizes this notation. To summarize, we have the following data generation process for URC: For each actora∈A chooseθ a |α∼ Dirichlet(α). For each topict∈T chooseφ t |β∼ Dirichlet(β). For each resourcer∈R, given actors vector a r , For each concepti∈N r Choose actorx ri | a r ∼ Uniform(a r ) Choose topicz ri |θ x ri ,x ri ∼ Multi(θ x ri ) Choose conceptc ri |z ri ,β∼ Multi(φ z ri ). 152 Table 7.1: Notation Set of actors A Set Number of unique actors A Scalar Set of concepts C Set Number of unique concepts C Scalar Total number of concepts N Scalar Set of resources R Set Number of unique resources R Scalar Number of topics T Scalar Dirichlet prior α Scalar Dirichlet prior β Scalar Probabilities of concepts given topics Φ C×T matrix Probabilities of concepts given topict φ t C-dimensional vector Probabilities of topics given actors Θ T ×A matrix Probabilities of topics given actorα θ α T -dimensional vector Number of actors associated with ther th resource A r Scalar Actors related to ther th resource a r A r -dimensional vector Number of concepts associated with ther th resource N r Scalar Concepts related to ther th resource c r N r -dimensional vector i th concept in ther th resource c ri i th component of c r Concepts related to all resources c N-dimensional vector Actor assignments x N-dimensional vector Actor assignments for concept c ri x ri i th component of x r Topic assignments z N-dimensional vector Topic assignments for concept c ri z ri i th component of z r The joint distribution of observed and hidden variables is: P(c, z, x,Φ,Θ|α,β,A) = T Y t=1 P(φ t |β) A Y a=1 P(θ a |α) R Y r=1 Nr Y i=1 P(x ri | a r )P(z ri |θ a ,x ri )P(c ri |φ t ,z ri ). (7.1) 7.1.2 The User-Resource Model The User-Resource Model (UR), shown in Figure 7.2b, is structurally equivalent to the LDA model [14]. We begin by reducing the tripartite graph of users, resources and 153 Figure 7.2: Generative models of tripartite graphs. (a) User-Resource-Concept model, (b) User-Resource model, (c) User-Concept model. concepts into a bipartite graph of users and resources. In this modeling, users’ interests are expressed in terms of activity involving similar resources (e.g., users A and B have similar tastes if user A creates a resource R, which user B comments on). Hence, each user owns one “document”, and resources become vocabulary terms that users select to “compose” their documents. In this model,φ denotes the matrix of topic distributions, with a multinomial distri- bution overR resources for each ofT topics being drawn independently from a symmet- ric Dirichlet(β) prior. The matrix of user-specific mixture weights for theseT topics,θ, is being drawn independently from a symmetric Dirichlet(α) prior. Each resource r is drawn from the topic distributionφ corresponding toz, the topic responsible for gener- ating that resource, drawn from the θ distribution for that user. To summarize, the UR model assumes the following generative process for each actora∈A: Chooseθ|α∼ Dirichlet(α). For each topict∈T chooseφ t |β∼ Dirichlet(β). 154 For each resourcer i ∈R a , Choose topicz i |a∼ Discrete(θ) Choose resourcer i |z i ,β∼ Discrete(φ z i ). The joint distribution of observed and hidden variables in this case is: P(r, z,Φ,Θ|α,β) =P(θ|α) T Y t=1 P(φ t |β) Ra Y i=1 P(z i |θ)P(r i |φ t ,z i ). (7.2) 7.1.3 The User-Concept Model The User-Concept (UC) model is shown in Figure 7.2c. Similarly to UR model, UC is an adaptation of the LDA model [14] with the difference that users are modeled based on their tag usage. In order to construct this model, we aggregate annotations assigned by users to resources they “own” and use these tags as vocabulary terms. The motivation for this reduction stems from our analysis of tripartite graphs’ structure in Section 2.2. There we argued that bipartite graphs are easier to work with, even though they discard information that could otherwise be used to enhance the modeling of users’ online activ- ities. We use this model as a simpler and more scalable solution to our problem, and compare its effectiveness against URC. In this model,φ denotes the matrix of topic distributions, with a multinomial distri- bution overN concepts for each ofT topics being drawn independently from a symmet- ric Dirichlet(β) prior. θ is the matrix of user-specific mixture weights for theseT topics, being drawn independently from a symmetric Dirichlet(α) prior. For each annotation,z denotes the topic responsible for generating that concept, drawn from theθ distribution for that user, and c is the concept, drawn from the topic distribution φ corresponding to z. To summarize, the UC model assumes the following generative process for each actora∈A: 155 Chooseθ|α∼ Dirichlet(α). For each topict∈T chooseφ t |β∼ Dirichlet(β). For each conceptc i ∈N a , Choose topicz i |a∼ Discrete(θ) Choose conceptc i |z i ,β∼ Discrete(φ z i ). The joint distribution of observed and hidden variables in this case is: P(c, z,Φ,Θ|α,β) =P(θ|α) T Y t=1 P(φ t |β) Na Y i=1 P(z i |θ)P(c i |φ t ,z i ). (7.3) 7.1.4 Parameter Estimation Given any one of the three models we described above, we can obtain information about which topics users are mostly interested in, as well as a representation of annotations with respect to these topics, by estimating parameters Φ (probability of topics given concepts) andΘ (probability distribution over topics for each user, given concepts). The hidden structure of topics is captured by the posterior distribution of the hidden variable z (probability of topic mixtures of concepts). We adopt collapsed Gibbs sampling [79] to compute the posterior distribution on z. We use the result to infer matricesΦ andΘ. 7.2 Social Link Prediction with Hidden Topics (SLIgHT) One natural application of our modeling is social link prediction given some snapshot of a tripartite graph. Makrehchi [140] constructed a semi-bipartite graph of extracted hidden topics from user profiles and then applied topological metrics such as Katz [105] 156 and short path scores to rank and recommend users. Makrehchi [140] showed that this method outperforms approaches that rely on similarity measures of feature vectors (i.e., Bag of Words) and low rank approximation (i.e Latent Semantic Indexing (LSI)). We extend this approach by considering resources and metadata to represent users’ interests instead of documents consisting of words. Our goal is to use this approach as a baseline in comparison to our novel techniques for social link prediction (see Section 7.3) in a generic social network that is not as focused as academic networks extracted from technical paper co-authorships. Gibbs sampling of the posterior distribution onz results into generating matricesΦ, Θ and C [174]. Topic-actor matrix Θ in particular represents a bipartite graph linking topics to actors. Using matrix Θ, we can build a semi-bipartite graph G [140] of size (A+T)×(A+T): G = S Θ ⊤ Θ Θ×Θ ⊤ . (7.4) S represents relationships between users and is unknown. Makrehchi [140] used Katz score [105] to predict the missing values of the unknown block S, such that S = Katz(G), in an academic social network. Katz score, a generalization of degree centrality, measures the degree of influence of an actor in a social network [105]. Typ- ical centrality measures only consider the geodesic distance between a pair of actors. Instead, Katz takes into account the total number of walks between a pair of actors, penalizing long paths by an attenuation factorδ∈ (0,1) (typically the spectral norm of matrix G), raised to the power of path length. The Katz score for any two entriesg i ,g j of matrix G can be computed as follows: Katz(g i ,g j ) = ∞ X l=1 δ l |path l (g i ,g j )|, (7.5) 157 wherepath l (g i ,g j ) is the set of all paths of lengthl betweeng i ,g j . 7.2.1 Threshold Selection Due to the fact that link prediction between two nodes is a binary classification problem, similarity matrix S has to be converted into a binary adjacency matrix. The process consists of examining the similarity between each pair of users and checking if its value exceeds a threshold. As the similarity threshold decreases, more links are added, leading to more true positives but also to more false positives. We determine the best threshold value automatically based on the probability of the existence of a link in a social network [140]. In a sparse, directed social network, the number of existing links is considerably smaller than all possible links. The density of a directed social network can be calculated asΔ = L n(n−1) , whereL is the number of true links in the network andn is the number of nodes. The higher the density of the graph, the higher the probability of a link, hence the higher the probability that nodes are connected to others. Conversely, low density indicates a sparse graph with few connected nodes, isolated communities and unreachable nodes. Given link probabilityp = Δ, we determine the optimal threshold valueτ by min- imizing the squared error between the empirical density of the graph that results from link prediction by converting all similarity values that exceedτ into links, and the true density of the graphp. Formally, we define the optimal threshold as follows: ˆ τ . = min τ " P n i=1 P n j=1 1 I{S(n i ,n j )} n(n−1) # −p ! 2 , (7.6) where 1 I{S(n i ,n j )} = 0, S(n i ,n j )≤τ 1, S(n i ,n j )>τ . (7.7) 158 7.3 Social Link Prediction Using Latent Semantics and Network Structure Social networking users involve in rich activities that reveal crucial information about their interests and tastes. Explicit user profiles, typically consisting of personal informa- tion like hobbies, favorite movies and music, etc., can be mined in order to identify user interests, based on which friendship predictions can be made. However, information in user profiles tends to be scarce or obsolete. Instead of mining explicit user profiles, we gather valuable information about users’ interests from metadata that describe their social network activities. In Last.fm we capture music genre preferences by mining listening frequencies to artists as well as by recording tags, with which users annotate artists they are mostly listening to. In our work, we exploit the latent description of users’ interests (matrix Θ), which we learn using Gibbs sampling. Our intention is to explore whether the activation of a social link induces a local alignment of interests or if conversely a similarity in interests triggers the creation of a social link. We test this hypothesis on our Last.fm dataset, which provides annotation metadata needed to construct our generative models, as well as ground truth social network to evaluate the accuracy of our recommendations. We describe user interests in a latent space, organizing them in topics, emerging from user activity and annotation process. To do so, we use the generative models we presented in Section 7.1, treating users as authors and annotations as the vocabulary authors use to describe resources. Since we do not a priori know what is the optimal number of topics, we vary the number of topics achieving description of user interests in variable granularities, from more abstract to extremely specific. We treat link recom- mendation as a binary classification problem, where 1 indicates a link and 0 indicates the absence of a link. 159 In the rest of this section we propose four classification schemes that utilize matrix Θ to learn how to recommend appropriate links. All classifiers are generated as support vector machines (SVM) with Gaussian radial basis function kernels [44]. Finally, the last classification scheme exploits all previous classifiers building a hierarchical system. 7.3.1 Latent Topics & Common Neighbors Scheme Many social link prediction approaches calculate graph-based proximity scores [139], asserting that the “closer” two nodes are in the social graph, the more likely they are to become linked in the future. Intuitively, network proximity measures the likelihood of an interaction between two usersu andv, regardless of the existence of a path between them. Proximity metrics used in prior work include neighborhood based methods and methods based on the ensemble of all paths [139]. Neighborhood based methods, include the number of common neighbors, the Jac- card coefficient, which computes the probability of two users sharing neighbors, and Adamic/Adar, which refines the simple counting of common neighbors by weighting neighbors that that are not shared with many others users. Such methods only exploit local network features. For simplicity and computational efficiency, we use the number of common neighbors between two users as a prominent indicator of social link creation. The number of common neighbors between usersu andv measures their corresponding neighborhood overlap. It is defined asCN(u,v) =|Γ(u) T Γ(v)|, whereΓ(u) is the set of neighbors of useru in the network and|·| denotes set cardinality. To account for user homophily with respect to latent topics, we consider column Θ(:,u) as a feature vector for useru and use the standard cosine similarity to compare the feature vectors of two usersu andv: σ(u,v) = P t Θ(t,u)Θ(t,v) p P t Θ(t,u) 2 p P t Θ(t,v) 2 . (7.8) 160 This quantity is 0 if u and v share no latent topics and 1 if they have exactly the same interests. The feature vector for a user pair(u,v) is therefore constructed as: F(u,v) = [σ(u,v),CN(u,v)]. (7.9) We found that when considering the above feature set, the result is a non separable training sample due to the fact that similarity values between pairs for both positive and negative samples exhibit great variance. This in effect produces very inefficient classi- fiers that preform poorly in the recommendation task. To avoid this situation, as well as to reduce the number of training samples provided to the classifier (effectively achieving scalability), we average similarity values over the number of common neighbors. We characterize the average latent similarity of user pairs withk common neighbors in the social network as follows: avg σ (k) = 1 |p :k p =k| X p:kp=k σ(p), (7.10) wherep denotes a user pair(u,v) andk p denotes the number of common neighbors for user pairp. 7.3.2 Latent Topics & Shortest Distance Scheme Instead of using the number of common neighbors, here, we use shortest distance to capture graph based similarity between usersu andv, denoted asSD(u,v). The feature vector for a user pair(u,v) is therefore constructed as: F(u,v) = [σ(u,v),SD(u,v)]. (7.11) 161 Because of the great variance of similarity values, we train this classifier using the aver- age latent similarity of user pairs with shortest distance k in the social network, using Equation (7.10), with the difference that in this case k p denotes the shortest distance value for user pairp. 7.3.3 Latent Topics Classification Scheme Here we focus solely on similarity of users’ interests, ignoring network effects. Consid- ering this scheme we are able to test the hypothesis that social links form on the basis of user homophily or conversely if the social network also plays some role in link for- mation. Again, we consider columnΘ(:,u) as feature vector for useru and we compute the pointwise squared distance between feature vectors of users u and v. The feature vector for a user pair(u,v) is therefore constructed as: F(u,v) = (Θ(1,u)−Θ(1,v)) 2 ,...,(Θ(T,u)−Θ(T,v)) 2 . (7.12) F(u,v) is zero when usersu andv are completely aligned with respect to their interests in the latent space, whereas larger values indicate less common interests. Note that the optimization objective of this classifier is to minimize the distance between usersu and v between whom a tie exists. In contrast, the two previous schemes assume maximum similarity values between such users. 7.3.4 Ensemble Classification Scheme The first step in an ensemble approach is data partitioning. Each partitioning technique should have a unique view of the data or use a different underlying model to generate the data partitions. In our approach, we select classifiers that partition the data using different set of features and appropriate similarity metrics discussed in the previous 162 subsections. In particular, we train each of the above three classifiers individually using the same set of training data. This results in classifiersCl 1 ,Cl 2 , andCl 3 respectively. We combine the predictions of each classifier using a consensus mechanism, accord- ing to which each classifier is treated as expert casting a vote for or against the existence of a link between a pair of users. We setCl 1 ,Cl 2 andCl 3 ’s ensemble weights to equal values and we normalize them such that 3 P i=1 λ Cl i = 1. The consensus function we use is a weighted binary vote. For a pair of users p = (u,v) and classifier Cl i we define a prediction functionξ Cl i (p) such that: ξ Cl i (p) = 1, ∃e(u,v) 0, otherwise , (7.13) wheree(u,v) denotes a directed edge between usersu andv. We compute the consensus score forp as 3 P i=1 λ Cl i ξ Cl i (p). We could have learned different weights for each classifier, indicating our confidence in its predictions. However, this procedure imposes another round of supervised training phase, which would unnecessarily increase the complexity of our approach. In our evaluation section we show that, despite its simplicity, the majority voting scheme is quite effective in producing high quality recommendations. 7.3.5 Complexity Analysis We performed our experiments on a 2.4 GHz Intel Core 2 Duo, with 2 GB of memory, running Windows 7. For our evaluation, we used a real-world dataset (see Chapter 2, Section 2.3.3) of 2K users from Last.fm online music system [22]. All algorithms were implemented in Matlab. We now discuss in detail the computational complexity of our approach and examine its ability to scale into large datasets. Table 7.2 summarizes the symbols used in our analysis. 163 Table 7.2: Symbols used in Complexity Analysis G Social Network Λ Adjacency matrix of G E Number of edges in G A Number of users V V ocabulary size U max Maximum number of users that can be associated with a resource A Train Training set size Complexity of Inferencing Latent Models The worst time complexity of each iteration of the Gibbs sampler is O(VU max A). As complexity is linear inV , Gibbs sampling can be efficiently carried out on large data sets [174]. Considerable speedup gains can be achieved by optimizing Gibbs sampling and by successfully incorporating recent advances in high performance computing [134]. Complexity of Structural Features Calculation Next, we discuss the computational complexities of graph-based similarity metrics. Common Neighbors Naively,Λ 2 computesCN for all user pairs. Intuitively,Λ 2 (u,v) denotes the number of different paths of length 2 that connect users (u,v). Multiplica- tion of extremely sparse matrices (i.e., adjacency matrix) is inefficient and can become very expensive for large datasets. Instead of using matrix multiplication in calculating CN for each useru and allu’s neighbors, we first search allu’s neighbors and then lay out the neighbors of each ofu’s neighbors respectively. The time complexity to traverse the neighborhood of a node withk neighbors in a sparse network isk ≪ A, hence the time complexity for calculatingCN isO(Ak 2 ). Shortest Distance We find the shortest path SD between any two users using John- son’s algorithm [100], resulting in a time complexity of O(AlogA + AE). A faster 164 implementation based on a min-priority queue (i.e., Fibonacci heap) can further reduce running time toO(AlogA+E). Complexity of Averaging Strategy To reduce the number of training samples provided to our SVM classifiers, we first average similarity values over the number of common neighbors (similarly for short- est distance) as shown in Equation (7.10). This needs the computation of all user pairs with k common neighbors, for each value of k, and then averaging over all similar- ity values. We begin by sorting CN by rows and columns in O(AlogA) time. This step can be significantly speeded up using better sorting strategies. Searching for user pairs withk common neighbors requires at mostO(A+A) = O(A) steps, resulting in O(KA|S CN k |), where K is the number of unique values of k, and|S CN k | denotes the maximum cardinality of the setS of user pairs withk common neighbors. Complexity of SVM Classification Support vector machines (SVMs), though accurate, are not preferred in applications requiring great classification speed due to the number of support vectors being large. Standard SVM training requires the solution of a very large quadratic programming (QP) optimization problem, which directly involves inverting the kernel matrix, resulting in O(A Train 3 ) time andO(A Train 2 ) space complexities [108]. Due to our averaging strat- egy, A Train is already sufficiently small (i.e., A Train ≪ A). However, one hardly ever needs to estimate the optimal solution, and the training time for a linear SVM to reach a certain level of generalization error actually decreases as training set size increases [179]. Tsang et al. proposed an approximation algorithm that obtains approximately optimal solution [195], while at the same time having a time complexity that is linear in A Train and a space complexity that is independent ofA Train for nonlinear kernels. 165 In our work, we use Sequential Minimal Optimization (SMO) to train our SVM classifiers [169]. SMO divides the quadratic programming optimization problem into smaller problems that can be solved analytically. Further, as SMO memory require- ments grow linearly to the training set size, SMO can handle very large training sets [169]. In testing time, we need to pass a user pair instance onto an SVM model to find the hypothesis with the highest confidence (i.e., existence of a link or not). The time complexity for this step is ∼ O(1) (linear to the number of the support vectors and linear to the number of features). 7.4 Models Evaluation To examine the effectiveness of our models and classification schemes, we use a dataset containing social networking, tagging and music artist listening information from a set of 2K users from Last.fm online music system (see Chapter 2, Secttion 2.3.3. This leads to a vocabulary size ofR = 17,632 in our User-Resource model andC = 11,946 unique words in our User-Concept and User-Resource-Concept models for this dataset. We split our dataset into two disjoint sets, such that we retain 10%, 25%, 50%, and 75% of the data for training, and the rest for testing. 7.4.1 Sample Topics Figure 7.3 shows 4 topics (out of 50) learned by the three models. The topics are extracted from a single sample at the 2000th iteration of the Gibbs sampler. Each topic is illustrated with its top most probable 10 tags (or artists in the case of UR). Learned top- ics capture Last.fm’s music taxonomy from user-generated annotations. Particularly for the UR model, the top 10 most likely artists in each topic are well-known artists in terms of popularity and fame, and representative samples of the music genres they belong to. 166 Figure 7.3: Clouds of top tags and artists for 4 topics (out of 50) learned by the UC, UR and URC models. Size indicates higher probability. Notably, URC topics on the right, match surprisingly well UC rightmost topics. Finally, while most of the topics in our models semantically capture music genres, some top- ics illustrate some other types of themes discovered. For instance, the left–topmost UC topic captures users preferences in the form of explicitly stated feelings and/or opinions with respect to specific artists. Topics learned by the URC model offer a qualitative representation of music genres in Last.fm, “generating” a music taxonomy based on user-specific tags. The top 10 most likely artists in each topic learned by URC are well-known in terms of popularity and 167 fame. Solo artists and music bands are being categorized in corresponding music cate- gories in this case. Finally, even though most of the topics in our models semantically capture music genres, some topics illustrate some other types of discovered themes. For instance, topic 5 in UC captures users’ preferences in the form of explicitly stated feel- ings and opinions with respect to specific artists. Notably, URC topics 44 and 47 match surprisingly well UC topics 47 and 45 accordingly. 7.4.2 User Focus Analysis Our generative models capture latent users’ interests in different contexts. A latent inter- est profile can be built for each user for each of our models, facilitating quantitative mea- surement of user “focus”. We measure the “focus” of a useru to characterize dispersion of user latent interests across multiple topics. To measure u’s focus, we first sort u’s topic probability vector in descending order and then sum the difference between topic pairs. Formally, we define user “focus” as: f(u) . = T−1 X t=1 (u pt −u p t+1 ), (7.14) whereu pt denotes the probability of topict for useru. Intuitively, a perfectly “focused” user has a focus value of one, whereas the focus of a completely “diverse” user is equal to zero. Figure 7.4 shows the probability distribution of three most popular users’ latent inter- ests over twenty topics, for each of our models. Users’ focus values for each model are provided inside parenthesis in the legend. UserU 3 exhibits more focused interests than the rest two users in all three models, whereas U 1 demonstrates clear focus only under 168 0 5 10 15 20 0 0.2 0.4 0.6 0.8 Topic ID Probability UR Model 0 5 10 15 20 0 0.2 0.4 0.6 0.8 Topic ID UC Model Probability 0 5 10 15 20 0 0.2 0.4 0.6 0.8 Topic ID URC Model Probability U 1 (0.5863, 0.1818, 0.2143) U 2 (0.4812, 0.26, 03478) U 3 (0.7498, 0.469, 0.5506) Figure 7.4: Probability distribution of three most popular users’ latent interests over twenty topics. the UR model. This indicates that our models indeed capture users’ interests from dif- ferent perspectives; here with respect to emergent (latent) music genres and annotation taxonomy. Figure 7.5(a) shows that a (small) disassortative mixing pattern exists between user popularity and focus for all our models. Users’ latent tastes tend to disperse slightly as the number of their friends increases. We used Jensen-Shannon divergence (JS) to analyze the similarity between popular users (i.e., users with many social ties) and their neighbors. We found that as users popularity increases, so does topical divergence with their ties. Figure 7.5(b) summarizes the results. The effect can be observed for all models, even though large fluctuations are apparent due to the small number of user pairs over which averaging is performed. This suggests that more and more diverse friendships are created with increasing user popularity. This phenomenon discloses the cognitive process of a user’s friending behavior. 169 10 0 10 1 10 2 10 −0.8 10 −0.7 10 −0.6 10 −0.5 10 −0.4 10 −0.3 10 −0.2 10 −0.1 Number of Friends Average Focus UR UC URC (a) 10 0 10 1 10 −1 Average Jensen−Shannon Divergence Number of Friends UR UC URC (b) Figure 7.5: (a) Average focus of users having k friends. (b) Average Jensen-Shannon divergence between all combinations of users havingk friends and their friends. 7.4.3 Predictive Power To demonstrate the effectiveness of our generative models on uncovering hidden topics, we compute their perplexity [174] (i.e., their ability to predict tags or artists for new users). We divide our dataset into two disjoint sets, such that we retain 90% of the data for training and the rest for testing. Figure 7.6 shows the three models’ perplexity scores on varying number of hidden topics. URC yields lower perplexity overall than the other two models on the Last.fm dataset. UC slightly outperforms URC for 100 topics. UR and UC models can be seen as extensions of the classic LDA model, whereas URC is an extension of the Author- Topic model. Intuitively, URC captures more of the hidden structure of users’ annotation activity in Last.fm. UC also captures the essence of tagging behavior through statistical categorization of tags in latent topics. Contrary, classification of artists based solely on users’ annotation seems to be of inferior quality, probably due to noisy human-provided metadata, which are in their nature unrestricted, uncontrolled and highly susceptible 170 10 20 30 40 50 60 70 80 90 100 50 100 150 200 250 300 350 400 450 500 Number of Hidden Topics Perplexity UR UC URC Figure 7.6: UR, UC and URC perplexity for varying number of hidden topics. to personal taste. We conjecture that annotation metadata can be extremely useful in capturing collective knowledge about a domain, such as music genres in Last.fm. 7.5 Network Reconstruction & User Homophily Link prediction is important in social networks for understanding the mechanisms by which social networks form and evolve [66]. Most approaches thus far assume that a snapshot of the social network, with some links missing, is available. A two-phase supervised method [66] addressed the problem of predicting the structure of a social network when only a small subgraph of the social network is known and multiple het- erogeneous sources are available. A recent study [123] has discussed the link prediction problem when the network is not fully observed. The complementary question: “can we predict topical similarity from the social network?” was explored in [155]. 171 In our work, we evaluate the effectiveness of our approach with respect to the task of extracting the structure of the social network, i.e., all links at the same time. In this scenario, no prior friendship links are provided to our method. Since no friendship links are available, it is impossible to exploit the topological structure of the social network. Here, we focus on latent-based network reconstruction, where our objective is to reveal all links between pairs of users through their pair-wise similarity. The novelty of our approach comes from the fact that we combine topological structure with inferred latent user profiles, which are described as distributions over resources and their associated metadata, instead of actual content [130, 140]. We examine the performance of SLIgHT (see Section 7.2) for each of our three tripartite graph generative models. We compute the Accuracy, Precision, Recall and F-measure of this approach while varying the number of hidden topics (T = {1,10,20,50,100}). The optimal threshold is selected using Equation (7.6). Figure 7.7 shows the results. All three models are able to yield very high accuracy, however, their precision and recall are low for practical purposes. This result seems to contradict the hypothesis of user homophily in social networks [146], since users’ interests in terms of hidden topics are not accurate predictors of link formation. In fact, high accuracy values are due to correct classification of true negatives (absence of links). Our results contradict those of Makrehchi [140] for academic co-authorship net- works. We explain this as a result of the very well defined structure and focused nature of co-authorship networks. Instead, online social networks encompass diverse user com- munities, which may or may not be related to each other. Katz score is used in this approach to calculate users proximity in the latent space defined by extracted hidden topics. Few heavily weighted paths in academic networks guarantee better results than many long (weak) paths in diverse, online social networks. Latent similarity with respect to artists yields better results than latent similarity with respect to tags. Similar musical 172 10 0 10 1 10 2 0.985 0.99 0.995 1 Accuracy 10 0 10 1 10 2 0 0.05 0.1 F−measure 10 0 10 1 10 2 0 0.05 0.1 0.15 0.2 Precision 10 0 10 1 10 2 0 0.05 0.1 Recall UR UC URC Figure 7.7: Performance of SLIgHT. X-axis: number of hidden topics; Y-axis: Perfor- mance on test set. preferences between users yield better predictive power with respect to link prediction in Last.fm as a result. Our URC model exhibits inferior performance than UR due to its attempt to capture similarity in terms of annotations as well. URC and UC are therefore comparable in performance with respect to network reconstruction. In the following sections, we focus our analysis on UR, UC, and URC models for T UR = 20, T UC = 20, and T URC = 50 hidden topics respectively. This selection is based on optimal values achieved by the three models with respect to F-measure (see Figure 7.7) in the network reconstruction task. Different datasets and different settings (e.g., number of hidden topics) may lead to different results than what we report here. 173 0 5 10 15 20 25 30 35 40 0.2 0.4 0.6 0.8 Number of Common Neighbors Average Similarity σ(u,v) 1 2 3 4 5 6 7 8 9 0.05 0.1 0.15 0.2 Distance d Average Similarity σ(u,v) UR UC URC Figure 7.8: Average similarity between latent topic vectors of Last.fm users as a function of (a) number of common neighbors and (b) distanced. 7.5.1 Users’ Homophily Next, we analyze in detail the similarity of users’ topic distributions in relation to their number of common friends and their distance d along the social network. Intuitively, the presence of a social tie indicates some degree of shared context between connected users, who are likely to have some interests in common [177]. Likewise, the existence of numerous common friends suggests sharing of common experiences through indirect interactions. Regardless of the mechanism driving this potential alignment, we measure this effect as a function of local structural properties. Figures 7.8a and 7.8b demonstrate these correlations respectively. The similarity score is calculated as cosine similarity between topic vectors from matrix Θ, using Equation (7.8). To compute averages for these quantities and exclude biases due to sampling, we performed an exhaustive inves- tigation of the social network up to distance equal to the network diameter. 174 Figure 7.8a indicates strong alignment between users sharing numerous friends. Pre- cisely, average similarity is large for large values of common friends, however, it drops as the number of common friends decreases. Large fluctuations in this case are visi- ble for large number of common friends due to the small number of users over whom averages are computed. Average similarity under the URC model is relatively constant in this case, when the number of common neighbors is in the range between 1 and 25, even though a small increasing trend is visible. Hence, the existence of many com- mon friends indicates interests commonality, which however may be distributed across different topics, for different subsets of common friends. Similarly, Figure 7.8b suggests that a certain degree of alignment between neigh- bors in the social network is in fact existent. While average similarity is quite large for neighbors (d = 1), it drops rapidly as d increases and is close to zero for d ≥ 3. Our observation corroborates the results presented in [177], suggesting that the alignment of users’ interests must be a local effect. Average similarity under the UC model is rela- tively constant ford∈ [3,8], indicating common tag usage by many Last.fm users who have not established friendship relationships with each other. This fact explains why UC and URC perform worse than UR in the network reconstruction task (see Figure 7.7). 7.6 Prediction of Social Ties In this section, we test the effectiveness of our four classification schemes. We refer to “Latent Topics & Common Neighbors Scheme” as Scheme A, to “Latent Topics & Shortest Distance Scheme” as Scheme B, and to “Latent Topics Classification Scheme” as Scheme C. Finally, we refer to “Ensemble Classification Scheme” as Scheme D. Fig- ure 7.9 shows the performance achieved by our classification schemes under our three 175 20 40 60 0.6 0.7 0.8 0.9 Precision UR Model 20 40 60 0.2 0.4 0.6 0.8 Recall 20 40 60 0.5 0.6 0.7 0.8 0.9 UC Model Precision 20 40 60 0.2 0.4 0.6 0.8 Recall 20 40 60 0.5 0.6 0.7 0.8 0.9 URC Model Precision 20 40 60 0.2 0.4 0.6 0.8 Recall Scheme A Scheme C Scehem D Figure 7.9: Precision and Recall of Latent Semantics Classification Schemes as a func- tion of training data size. X-axis: Training set size as percentage of complete dataset; Y-axis: Precision/Recall. models with respect to Precision and Recall. We found Scheme B to be the least effec- tive, hence we refrain from discussing its performance any further, even though Scheme B is included in “Ensemble Classification Scheme”’ (Scheme D), influencing its perfor- mance. Scheme B aggregates users’ latent similarity with respect to shortest distance, which in effect results in aggregating all training similarity values for true links (i.e., existing social ties) in a single training point in the distance–similarity space. To this 176 extent, the aggregation methodology is non-linear to the preprocessing of true positives and true negatives samples, resulting in information loss in exchange of scalability gains. The ensemble achieves the best precision (up to 89.8% under the UR model) due to its ability to alleviate bad choices made by some of the “expert” classifiers. Even though Scheme D’s recall is not as high when compared to the rest of the schemes, it’s comparable (up to 86.83% under the UC model) when the training dataset size is small (10%), which would be the case in a real life social network with millions of users. Overall, precision seems to be increasing or stay constant for dataset size up to 50%, after which point over-fitting causes degradation in performance. On the other hand, recall drops as a function of dataset size, indicating that small but discriminatory training samples can lead to good performance overall. Ultimately, the trade-off between precision and recall (F-measure) has to be considered for the optimal choice of model, scheme and training dataset size. Of course, different datasets may yield best results for different combinations. The nature and focus of the social network as well as user- generated content type in this context have to be considered when making this selection. Support Vector Machines have to achieve a trade-off between maximizing the mar- gin and minimizing the empirical error, which leads to classifying every sample to the dominant class (negative in our case) under high class imbalance or when data are non- separable, if the misclassification penalty is adequately small. This results in no (or minuscule) classification errors on the negative instances, but high errors on the positive instances, which even though are quite sparse, are also the most vital in social tie pre- diction. A classifier that classifies everything as negative may be extremely accurate but it will not have any practical use as it will never identify the positive instances correctly [60]. Due to social networks sparsity, we expect most test links to belong to the negative class (absence of link). We address this problem here by examining the Precision and 177 20 40 60 0.4 0.6 0.8 User−Resource Model 20 40 60 0.2 0.4 0.6 0.8 User−Concept Model 20 40 60 0.2 0.4 0.6 0.8 User−Resource−Concept Model Scheme A − positive class Scheme C − positive class Scheme D − positive class Scheme A − negative class Scheme C − negative class Scheme D − negative class Figure 7.10: F-measure (calculated for positive & negative classes separately) achieved by Latent Semantics Classification Schemes as a function of training data size. X-axis: Training set size as percentage of complete dataset; Y-axis: F-measure. Recall that our various schemes achieve when calculated separately for the positive and negative classes. Figure 7.10 shows the results. Intuitively, true negatives are easier to classify cor- rectly under most models, in most cases. Overall, we observe a degradation in perfor- mance with respect to true positives (which are harder to predict) due to over-fitting and noisy observations as the training dataset size increases. Nevertheless, all of our schemes yield reasonable results for practical purposes, for reasonably small training datasets (less than 20% of complete dataset in all cases). Based on the analysis pre- sented above we observe that hidden topics proximity alone is not sufficient to accu- rately predict social ties. However, our work demonstrates that the addition of local network features to latent semantics greatly improves performance, often by a consider- able margin. 178 7.6.1 Comparison with other methods We compare our schemes with two tag-based similarity metrics, which have shown superior performance in the content-based network reconstruction task [177]: 1. Cosine Similarity (CS). The normalized cosine similarity between two usersu and v can be calculated as follows: CS(u,v) = P t fu(t)fv(t) r P t fu(t) 2 P t fv(t) 2 , wheref u (t) denotes the number of times useru has used tagt. 2. Maximal Information Path: Similarity metric that computes semantic relatedness of terms in non-hierarchical triple representation [177]. We randomly split our dataset into two disjoint sets, such that we retain 10%, 25%, 50%, and 75% of the data for training and the rest for testing. The evaluation consists of selecting pairs of users, computing their similarity and adding links between users in decreasing order of their topical similarity. The pairs of users with highest similarity are those we predict to be most likely tied. Particularly, we randomly sample 12,716 pairs of users, out of which 50% are true links and 50% are negative samples. For each predicted social link, we check the actual social network to see if the prediction is correct. The choice to select pairs of users randomly stems from the size of our dataset, which makes an exhaustive comparative evaluation infeasible. We present results in the form of the area under the receiver-operating characteristic curve (AUC). AUC quantifies prediction accuracy and tests how much better a classi- fier is than pure chance, while at the same time measuring its overall ability to rank all missing connections (positive class), which are the hardest to predict, over nonexistent ones (negative class). AUC evaluates classification performance across the entire range of decision thresholds, providing a good performance overview when the operating con- dition for the classifier is unknown or the classifier is expected to be used in situations with significantly different class distributions. 179 Table 7.3: Area under the ROC curve comparison for 10%, 25%, 50%, and 75% of edges observed Model Scheme % of observed edges 10% 25% 50% 75% UR Scheme A 0.5624 0.6569 0.8663 0.8949 Scheme C 0.5454 0.6005 0.6418 0.7342 UC Scheme A 0.5514 0.7129 0.7993 0.8511 Scheme C 0.5500 0.6225 0.6429 0.7417 URC Scheme A 0.6515 0.7007 0.7967 0.8540 Scheme C 0.6491 0.5485 0.6357 0.7654 Baselines MIP 0.6256 CS 0.6087 “Ensemble Classification Scheme” (Scheme D) produces only class labels, without assigning score values to them, hence we exclude it from our comparison. We have also argued about Scheme B’s performance. This leaves us with two Schemes, A and C. For the calculation of AUC values for the two baselines, we use the complete dataset instead of splitting it into disjoint training and testing sets. This strategy may bias the evaluation in favor of the baselines, which have a complete view of the dataset for their similarity calculations. Therefore, our evaluation is a conservative choice in that it does not unfairly help our proposed schemes (in fact it might be biased against them). For consistency across our experiments, we focus our analysis on UR, UC and URC models forT UR = 20,T UC = 20, andT URC = 50 hidden topics respectively. This setup limits our observations to three settings only. However, it would be tedious and difficult to compare all our models for all settings respectively. Further, different datasets may result in different optimal models. Table 7.3 shows the results. The observation of AUC values further validates that our classification schemes act as proper ranking functions for all three models. Figure 7.11 reports the performance lift of Schemes A and C on the link recommendation task for 180 UR UC URC 0 20 40 AUC Lift (%) Scheme A UR UC URC −10 0 10 20 AUC Lift (%) Scheme C 10% 25% 50% 75% Figure 7.11: Area under the ROC curve lift achieved by schemes A and C with respect to UR, UC and URC models on the link recommendation task in the Last.fm data set. Lift is defined as % change over MIP baseline. varying training set size. Lift is defined as % change over best performing baseline, MIP. Positive % change signifies improvement, whereas negative % change indicates superiority of the baseline. The baselines CS and MIP attain AUC values of 0.6087 and 0.6256 respectively. Not all schemes can beat the baseline: our classifiers under the UR and UC model fail to beat the performance of MIP (which is however trained to the complete dataset) when 10% of the data are available for training. In this case the AUC lost from the considerable limitation of training data is minimal, i.e. in the order of 10% or less. The most lift, i.e., % improvement over baseline, is consistently attained by Scheme A, which integrates latent topics with local structural information, under the URC model in all four cases. When 25% or more of the complete dataset are used for training training, our schemes outperform the baselines, often by a considerable factor. 181 7.7 Summary In this chapter, we presented three generative probabilistic models of online social tag- ging systems as a principled way of reducing the dimensionality of such data, captur- ing at the same time the dynamics of collaborative annotation process. Our models represent users’ interests in a latent space over resources and rich metadata describing them. Even though our probabilistic models ignore several aspects of real-world annota- tion process (such as topic correlation and user interaction), they nonetheless provide a principled and efficient way of understanding user-resource-tag dynamics in very large, online social tagging systems. We showed that our generative probabilistic models can be used to learn users’ tastes and to effectively reconstruct the network of ties or predict future social links when some prior evidence is provided. In particular, we showed how to exploit latent user interests in conjunction with structural features to significantly improve social link prediction in the online music social media site Last.fm. We showed that similarity of interests alone does not trigger the creation of a social link. Instead, we showed how to achieve high prediction performance using four classifiers, which jointly exploit users’ interests similarity and their local network proximity. We plan to further validate our results by examining dynamic social networks. Taking into account temporality, we will be able to better understand if the combination of taste and local network similarity indeed drives tie formation or if conversely, tie formation results in taste alignment and local network densification. While most link prediction methods suffer from the high class imbalance problem resulting in low precision and/or recall solutions, our proposed methods achieve high precision and recall for highly imbalanced classes. 182 In addition to tags, news stories and music artists, there exist other types of resources, metadata and user activities that can be used to further improve the quality of predic- tions. In our future work, we plan to address the challenge of combining multiple het- erogeneous sources of information within a unified approach. We also plan to establish a mechanism which will automatically identify the most discriminative latent topics and will discard uninformative resources and metadata. Our results have important implica- tions for the design of social media sites. Besides link recommendation and prediction, our methods can be easily adapted to facilitate analysis of trending topics and users’ latent interests, resource and tag recommendations and categorization, classification and filtering of online information. 183 Chapter 8 Future Work and Conclusion In this thesis we have studied complex networks, with emphasis on informal commu- nication at the workplace, and collaborative bookmarking in online social media. We argued that there are many kinds of interactions taking place on online social networks. For instance, other than being a medium for friendship creation and conversations, social media are often used for information dissemination, political deliberation and campaigns, behavior modeling and prediction, revelation of terrorist networks, and epi- demic studies, to name a few. Both explicit and implicit interactions play a significant role in understanding the dynamics of complex networks. We believe that richer and deeper understanding of such activities is necessary. Building models of complex networks, understanding their rich properties, hidden structures and dimensional interdependencies are not trivial tasks. We proposed a novel formal generalized representation of social networks, which abstracts the semantics of multidimensional, informal, social interactions in the form of orthogonal dimensions. In conjunction to our social algebraic operators, our model facilitates multi-dimensional, time-varying, contextual, semantic analysis of complex networks. Our model is exten- sible, since it allows the definition of new metrics, and at the same time is generic since it permits arbitrary similarity measures to be used for various dimensions. 184 Building on the premises of our formal modeling of informal interactions, we ven- tured to uncover hidden structures and emergent knowledge. In particular, we exten- sively studied the dynamics of interactions within a complex network, inferred by infor- mal threaded discussions between employees at the workplace. In an enterprise, under- standing how information flows within and between organizational levels and business units is of great importance. Despite numerous studies in information diffusion in online social networks, little is known about information sharing, influence and expertise at the workplace. Our empirical analysis of employees’ communication behavioral patterns, dynamics and characteristics, statistical properties and complex correlations between social and topical structures, resulted in significant findings with respect to structural and topical homophily, and information sharing. Particularly, we found four emerging, distinct communication behaviors that govern the dynamics of threaded discussion at the workplace. Our analysis suggests that users with strong local topical alignment tend to participate in focused interactions, whereas users with disperse interests contribute to multiple discussions, broadening the diversity of participants. Within an enterprise, microblogging behavior is bounded by main business and work culture, work practices, and everyday problems. Thus, content is formal (although exchanged in an informal setting), less noisy than in online social networks, and empha- sizes on the business perspective. The existence of a large, strongly connected core, suggests that high-degree nodes located at its center, exhibit characteristics of expertise, conceptualized by frequent message exchanges with other nodes. We argued that such high-degree nodes are therefore critical for the connectivity and flow of information at the workplace. Nonetheless, much work remains to be done in the area of accu- rate expert identification. Even though communication frequency may be an indicator of status, it does not necessarily reflect expertise. Instead, one should consider topi- cal expertise, where expertise levels are assigned to specific topic(s). In future work we 185 also wish to examine how users’ agreement as well as explicitly expressed and implicitly conveyed sentiment impacts expertise identification algorithms. In cases where a single expert is not enough, a team of experts must be assem- bled to address a very specific problem. Team formation algorithms often propose a ensemble of individual experts, without considering their familiarity with each other, or the communication costs between them. Community detection algorithms on the other hand usually operate under the assumption that community members develop relatively strong ties among themselves. Community detection often results in groups of people that think alike, hence rendering themselves insufficient for this particular problem. In future work we would like to explore team formation on the premises of our modeling and social algebraic operators. Different “communities” can be this way identified either using dimensional operators, such as structural information, communication frequency, common interests and topical alignment, or multidimensional (i.e., contextual) criteria, such as a combination of temporal and topical filters. Our study of microblogging behavior at the workplace created even more questions than it answered. What triggers communication between unrelated users? Does commu- nication between previously unrelated users depend on friend-of-a-friend type of rela- tionships, common interests, or other factors? What impact does an externally imposed formal structure has on the social network? In this thesis, we addressed such questions studying informal interactions in the presence of formal structure at the workplace. We found that that employees’ tendency towards adopting a new microblogging service is influenced by their direct supervisors (dependency on the network structure). In fact, we found that middle level managers are on average more successful in promoting the adoption of the new service. At a macroscopic level, we revealed two distinct patterns regarding employees’ likelihood 186 of adopting the new microblogging service with respect to the behavior of the general crowd. We proposed two simple yet highly accurate computational models of technology adoption that is influenced by organizational hierarchy. Our findings have important implications to enterprises’ understanding of the mechanisms driving adoption of new technologies, and could assist in designing better strategies for rapid and efficient tech- nology adoption and information dissemination within the corporation. A limitation of our study is that we estimate causal effects only within the formal organizational chart, due to the fact that we are unable to observe the actual adoption. Future work would extend our models to allow for influence scores to vary over time, as well as incorpo- rate different roles individual assume in the adoption process, accounting for influence variations as a function of employees’ level in the organization hierarchy. We would also like to investigate the effect of network evolution (e.g. layoffs, or new hires) on influence, since one’s influence may intuitively increase with seniority in the company. Finally, it would be interesting to study adoption dynamics in the presence of competing technologies. One of the prime functionalities of online social networks is information propaga- tion and social link creating, two dynamic activities we studied in great detail. We introduced the problem of communication intention prediction and addressed this prob- lem using our novel framework. We showed that network or content, when considered separately, are not sufficient predictors. Instead, integrated local structural information and user-generated content, result in accurate predictors which outperform state of the art techniques for link prediction. This is a direct result of exploiting direct and indirect interactions, social and contextual, that have up to date been considered independently. We found that the more statistical evidence available per user, the better accuracy can achieved, whereas personalized predictors are required to further boost performance. 187 We wish to explore such issues in the future. Our findings have implications for a wide range of social web applications, such as contextual expert recommendation for Q&A, new friendship relationships creation, and targeted content delivery. Our investigation of communication intention has revealed a lot of potentials in the field. We think that the impact of indirect, “implicit” flow (or influence) of informa- tion and interests needs deeper investigation. Our study of social content annotation resulted in three generative models in an effort to capture the dynamics of collaborative annotation process in online social media. Our models represent users’ interests in a latent space over resources and rich metadata describing them. Even though our models ignore several aspects of real-world annotation process (such as topic correlation and user interaction), they nonetheless provide a principled and efficient way of understand- ing user-resource-tag dynamics in very large, online social tagging systems. We would like to investigate such aspects in the future. Particularly, in addition to tags and music artists, there exist other types of resources, metadata and user activities that can be used to further improve the quality of predictions. Future work would attempt to answer some open questions, such as combining multiple heterogeneous information sources, identify the most discriminative latent topics, discarding uninformative resources and metadata. Data availability in online social networks as well as the business world has not been an issue. Vast amounts of data, i.e., Big-Data, are being generated by social networking users in the form of informal interactions. What has been an issue, is the transformation of data into useful information, that in time and with appropriate processing becomes knowledge. In this thesis we have shown that there is a need for multidimensional, con- textual analysis of complex networks. Different views of the network result in different interpretations, which may be combined to gain deep insights in dynamic processes over the network. Explicit relationships that are captured by networks are just the tip of the 188 iceberg when it comes to social network analysis. Implicit channels of influence, tacit flow of interests, hidden structure and interconnections can lead to revelations about social behavior, effective recommendation and personalization services and knowledge preservation techniques. We claim that the complex interplay of individual and group dynamics in complex networks influences the structure and evolution of the network, and vise versa, the network influences individual and group dynamics. The problem is that there in not a single network to examine. Rather, multiple networks exist, entan- gled in a multidimensional space. Our formal model can be applied at the macroscopic level to reveal network wide patterns, but also at the microscopic level to help under- stand individual behavior and contextual peer influence, despite of the heterogeneity of human dynamics. Our work just scratches the surface of multi-layered networks. Further exploration is required and is the scope of future work. 189 References [1] E. Abrahamson and L. Rosenkopf. Social network effects on the extent of inno- vation diffusion: A computer simulation. Organization Science, 8(3):289–309, 1997. [2] L. Adamic and A. Eytan. Friends and neighbors on the web. Social Networks, 25:211–230, 2001. [3] Y .-Y . Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 835–844, New York, NY , USA, 2007. ACM. [4] F. Alkemade and C. Castaldi. Strategies for the diffusion of innovations on social networks. ComputationalEconomics, 25(1-2):3–23, 2005. [5] L. A. N. Amaral, A. Scala, M. Barth´ el´ emy, and H. E. Stanley. Classes of small-world networks. Proceedings of the National Academy of Sciences, 97(21):11149–11152, October 2000. [6] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social networks. In Proceedings of the 14th ACM SIGKDD international con- ference on Knowledge discovery and data mining, pages 7–15, New York, NY , USA, 2008. ACM. [7] L. Backstrom and J. Leskovec. Supervised random walks: predicting and rec- ommending links in social networks. In Proceedings of the fourth ACM interna- tional conference on Web search and data mining, WSDM ’11, pages 635–644, New York, NY , USA, 2011. ACM. [8] E. Bakshy, B. Karrer, and L. A. Adamic. Social influence and the diffusion of user-created content. In Proceedings of the 10th ACM conference on Electronic commerce, pages 325–334, New York, NY , USA, 2009. ACM. 190 [9] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The role of social networks in information diffusion. In Proceedings of the 21st international conference on WorldWideWeb, pages 519–528, New York, NY , USA, 2012. ACM. [10] A. L. Barab´ asi. Linked: How Everything Is Connected to Everything Else and WhatItMeans. Plume, 2003. [11] A.-L. Barab´ asi and R. Albert. Emergence of scaling in random networks.Science, 286(5439):509–512, 1999. [12] F. M. Bass. A new product growth for model consumer durables. Manage. Sci., 50(12 Supplement):1825–1832, December 2004. [13] H. Becker, M. Naaman, and L. Gravano. Learning similarity metrics for event identification in social media. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 291–300, New York, NY , USA, 2010. ACM. [14] D. M. Blei, A. Y . Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, Mar. 2003. [15] J. Bollen, H. Mao, and A. Pepe. Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. 2011. [16] P. Bonacich. Power and centrality: A family of measures. TheAmericanJournal ofSociology, 92(5):1170–1182, 1987. [17] P. Bonacich and P. Lloyd. Eigenvector–like measures of centrality for asymmetric relations. SocialNetworks, 23(3):191–201, 2001. [18] C. Budak, D. Agrawal, and A. El Abbadi. Diffusion of information in social networks: Is it all local? In 2012 IEEE 12th International Conference on Data Mining(ICDM), pages 121–130, 2012. [19] M. Bundschus, S. Yu, V . Tresp, A. Rettinger, M. Dejori, and H.-P. Kriegel. Hierar- chical bayesian models for collaborative tagging systems. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, pages 728–733, Washington, DC, USA, 2009. IEEE Computer Society. [20] T. Burkhart, D. Werth, and P. Loos. Context-sensitive business process support based on emails. InProceedingsofthe21stinternationalconferencecompanion on World Wide Web, WWW ’12 Companion, pages 851–856, New York, NY , USA, 2012. ACM. [21] R. S. Burt. Structural holes and good ideas. TheAmericanJournalofSociology, 110(2):349–399, 2004. 191 [22] I. Cantador, P. Brusilovsky, and T. Kuflik. 2nd workshop on information hetero- geneity and fusion in recommender systems (hetrec 2011). InProceedingsofthe 5th ACM conference on Recommender systems, RecSys 2011, New York, NY , USA, 2011. ACM. [23] I. Cantador and P. Castells. Multilayered semantic social network modelling by ontology-based user profiles clustering: Application to collaborative filtering. In Proceedingsofthe15thInternationalConferenceonKnowledgeEngineeringand Knowledge Management (EKAW 2006), Podebrady, Czech Republic. Springer VerlagLecturesNotesinArtificialIntelligence, pages 334–349. Springer, 2006. [24] K. M. Carley. Dynamic network analysis. Workshop on Dynamic Social Network Modeling and Analysis: Workshop Summary and Papers, 2003. [25] C. Cattuto, D. Benz, A. Hotho, and S. Gerd. Semantic grounding of tag relat- edness in social bookmarking systems. In Proceedings of the 7th International ConferenceonTheSemanticWeb, ISWC ’08, pages 615–631, Berlin, Heidelberg, 2008. Springer-Verlag. [26] M. Cha, A. Mislove, and K. P. Gummadi. A measurement-driven analysis of information propagation in the flickr social network. In Proceedings of the 18th international conference on World wide web, pages 721–730, New York, NY , USA, 2009. ACM. [27] J. Chang and D. Blei. Relational topic models for document networks. In Pro- ceedingsofConferenceonAIandStatistics(AISTATS), 2009. [28] C. Chelmis. Complex modeling and analysis of workplace collaboration data. In International Conference on Collaboration Technologies and Systems (CTS), 2013. [29] C. Chelmis and V . K. Prasanna. Social networking analysis: A state of the art and the effect of semantics. InSocialComputing(SocialCom),2011IEEEThird InternationalConferenceon, MIT, Boston, USA, October 9-11 2011. IEEE. [30] C. Chelmis and V . K. Prasanna. Microblogging in the enterprise: A few com- ments are in order. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 62–70, Istanbul, Turkey, August 26-29 2012. IEEE Computer Society. [31] C. Chelmis and V . K. Prasanna. Predicting communication intention in social networks. In Social Computing (SocialCom), 2012 ASE/IEEE Fourth Interna- tional Conference on, pages 184–194, Amsterdam, The Netherlands, September 3-5 2012. IEEE. 192 [32] C. Chelmis and V . K. Prasanna. An empirical analysis of microblogging behavior in the enterprise. SocialNetworkAnalysisandMining, pages 1–23, 2013. [33] C. Chelmis and V . K. Prasanna. Exploring generative models of tripartite graphs for recommendation in social media. In Proceedings of the 4th International Workshop on Modeling Social Media, MSM ’13, pages 2:1–2:8, New York, NY , USA, 2013. ACM. [34] C. Chelmis and V . K. Prasanna. The role of organization hierarchy in technology adoption at the workplace. In The 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara Falls, Canada, August 25-28 2013. [35] C. Chelmis and V . K. Prasanna. Social link prediction in online social tagging systems. ACMTrans.Inf.Syst., 2013. [36] C. Chelmis, V . Sorathia, and V . K. Prasanna. Enterprise wisdom captured socially. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 1228–1235, Istanbul, Turkey, August 26-29 2012. IEEE Computer Society. [37] C. Chelmis, V . Sorathia, and V . K. Prasanna. Collaborative Processes and Deci- sion Making in Organizations, chapter Enterprise Knowledge Preservation and Management. IGI Global, 2013. [38] C. Chelmis, H. Wu, V . Sorathia, and V . K. Prasanna. Semantic social network analysis for the enterprise. Journal of Computing and Informatics-Special Issue onComputationalIntelligenceforBusinessCollaboration, 2014. [39] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy. Make new friends, but keep the old: recommending people on social networking sites. In Proceedings of the 27th international conference on Human factors in computing systems, CHI ’09, pages 201–210, New York, NY , USA, 2009. ACM. [40] M. Chen, J. Liu, and X. Tang. Clustering via random walk hitting time on directed graphs. In Proceedings of the 23rd national conference on Artificial intelligence -Volume2, pages 616–621. AAAI Press, 2008. [41] H. Choi, S.-H. Kim, and J. Lee. Role of network structure and network effects in diffusion of innovations. Industrial MarketingManagement, 39(1):170 – 177, 2010. [42] W. S. Chow and L. S. Chan. Social network, social trust and shared goals in organizational knowledge sharing. InformationandManagemen, 45(7):458–465, Nov. 2008. 193 [43] A. Clauset, C. R. Shalizi, and M. E. J. Newman. Power–law distributions in empirical data. SIAMReview, 51(4):661–703, 2009. [44] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines andOtherKernel-basedLearningMethods. Cambridge University Press, 2010. [45] R. Cross, A. Parker, and S. P. Borgatti. A bird’s-eye view: Using social net- work analysis to improve knowledge creation and sharing.KnowledgeDirections, 2(1):48–61, 2000. [46] D. Davidov, O. Tsur, and A. Rappoport. Enhanced sentiment learning using twit- ter hashtags and smileys. InProceedingsofthe23rdInternationalConferenceon Computational Linguistics: Posters, COLING ’10, pages 241–249. Association for Computational Linguistics, 2010. [47] D. Davis, R. Lichtenwalter, and N. V . Chawla. Multi-relational link prediction in heterogeneous information networks. In Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’11, pages 281–288, Washington, DC, USA, 2011. IEEE Computer Society. [48] M. De Choudhury. Tie formation on twitter: Homophily and structure of egocen- tric networks. In Privacy, security, risk and trust (passat), 2011 ieee third inter- national conference on and 2011 ieee third international conference on social computing(socialcom), pages 465–470, October 2011. [49] M. De Choudhury, H. Sundaram, A. John, and D. D. Seligmann. Contextual prediction of communication flow in social networks. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, WI ’07, pages 57–65, Washington, DC, USA, 2007. IEEE Computer Society. [50] J. Diesner and K. Carley. Revealing social structure from texts: Meta-matrix text analysis as a novel method for network text analysis. In In V.K. Narayanan & D.J.Armstrong(Eds.),CausalMappingforInformationSystemsandTechnology Research: Approaches, Advances, and Illustrations, pages 81–108. Idea Group Publishing, 2005. [51] J. Diesner, T. Frantz, and K. Carley. Communication networks from the enron email corpus it’s always about the people. enron is no different? Computational &MathematicalOrganizationTheory, 11:201–228, 2005. [52] L. Dietz. Modeling shared tastes in online communities. In NIPS Workshop on ApplicationsforTopicModels: TextandBeyond, 2009. [53] P. S. Dodds, K. D. Harris, I. M. Kloumann, C. A. Bliss, and C. M. Danforth. Tem- poral patterns of happiness and information in a global social network: Hedono- metrics and twitter. PloSone, 6(12):e26752, 2011. 194 [54] M. Dombroski, P. Fischbeck, and K. Carley. Estimating the shape of covert net- works. In Proceedings of the 8th International Command and Control Research andTechnologySymposium, 2003. [55] P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowl- edgediscoveryanddatamining, KDD ’01, pages 57–66. ACM, 2001. [56] L. Donetti and M. A. Munoz. Detecting network communities: a new systematic and efficient algorithm. JournalofStatisticalMechanics, page P10012, 2004. [57] K. Ehrlich and N. Shami. Microblogging inside and outside the workplace. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, ICWSM ’10. AAAI, 2010. [58] G. Er´ et´ eo, M. Buffa, F. Gandon, P. Grohan, M. Leitzelman, and P. Sander. A state of the art on social network analysis and its applications on a semantic web. In Proc.SDoW2008(SocialDataontheWeb),Workshopheldwiththe7thInterna- tionalSemanticWebConference, 2008. [59] G. Er´ et´ eo, F. Limpens, F. Gandon, O. Corby, M. Buffa, and P. Leitzelman, Myl- neand Sander. Semantic Social Network Analysis: A Concrete Case, pages 122– 156. Handbook of Research on Methods and Techniques for Studying Virtual Communities: Paradigms and Phenomena 2 (V ols.). IGI Global, 2011. [60] S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM ’07, pages 127–136, New York, NY , USA, 2007. ACM. [61] D. B. Estell, T. W. Farmer, and B. D. Cairns. Bullies and victims in rural african american youth: Behavioral characteristics and social network placement. AggressiveBehavior, 33(2):145–159, 2007. [62] J. A. Evans. Electronic publication and the narrowing of science and scholarship. Science, 321(5887):395–399, 2008. [63] M. Fazeen, R. Dantu, and P. Guturu. Identification of leaders, lurkers, associates and spammers in a social network: context-dependent and context-independent approaches. SocialNetworkAnalysisandMining, 1:241–254, 2011. [64] T. Feyessa, M. Bikdash, and G. Lebby. Node-pair feature extraction for link prediction. InProceedingsoftheIEEEThirdInternationalConferenceonSocial Computing(SocialCom), October 2011. 195 [65] M. Fire, L. Tenenboim, O. Lesser, R. Puzis, L. Rokach, and Y . Elovici. Link prediction in social networks using computationally efficient topological features. InProceedingsoftheIEEEThirdInternationalConferenceonSocialComputing (SocialCom), October 2011. [66] L. Ge and A. Zhang. Pseudo cold start link prediction with multiple sources in social networks. In Proceedings of the Twelfth SIAM International Conference onDataMining, pages 768–779. SIAM / Omnipress, 2012. [67] A. L. Gentile, V . Lanfranchi, S. Mazumdar, and F. Ciravegna. Extracting semantic user networks from informal communication exchanges. In Proceedings of the 10th international conference on The semantic web - Volume Part I, ISWC’11, pages 209–224, Berlin, Heidelberg, 2011. Springer-Verlag. [68] R. Ghosh and K. Lerman. Predicting influential users in online social networks. InProceedingsofKDDWorkshoponSocialNetworkAnalysis, 2010. [69] R. Ghosh and K. Lerman. A parameterized centrality metric for network analysis. PhysicalReview, E 83(6), 2011. [70] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. ProceedingsoftheNationalAcademyofSciencesoftheUnitedStates ofAmerica, 99(12):7821–7826, 2002. [71] P. A. Gloor, R. Laubacher, S. B. C. Dynes, and Y . Zhao. Visualization of com- munication patterns in collaborative innovation networks - analysis of some w3c working groups. InProceedingsofthetwelfthinternationalconferenceonInfor- mation and knowledge management, CIKM ’03, pages 56–60, New York, NY , USA, 2003. ACM. [72] A. Go, R. Bhayani, and L. Huang. TwitterSentimentClassificationusingDistant Supervision, pages 1–6. 2009. [73] J. Golbeck. Inferring reputation on the semantic web. In In Proceedings of the 13thInternationalWorldWideWebConference, 2004. [74] J. Golbeck, B. Parsia, and J. Hendler. Trust networks on the semantic web. InIn ProceedingsofCooperativeIntelligentAgents, pages 238–249, 2003. [75] S. Golder and B. A. Huberman. The structure of collaborative tagging systems. JournalofInformationScience, 32(2):198–208, April 2006. [76] M. Gomez Rodriguez, J. Leskovec, and A. Krause. Inferring networks of dif- fusion and influence. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1019–1028, New York, NY , USA, 2010. ACM. 196 [77] M. Granovetter. Threshold models of collective behavior. American Journal of Sociology, 83(6):1420–1443, 1978. [78] M. Granovetter. The Strength of Weak Ties: A Network Theory Revisited. Soci- ologicalTheory, 1:201–233, 1983. [79] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004. [80] R. Gross and A. Acquisti. Information revelation and privacy in online social networks. InProceedingsofthe2005ACMworkshoponPrivacyintheelectronic society, WPES ’05, pages 71–80. ACM, 2005. [81] O. G¨ unther, H. Krasnova, D. Riehle, and V . Sch¨ ondienst. Modeling micro- blogging adoption in the enterprise.Proceedingsofthe15thAmericasConference onInformationSystems, 2009. [82] M. Gupta, R. Li, Z. Yin, and J. Han. Survey on social tagging techniques. SIGKDDExplor.Newsl., 12(1):58–72, Nov. 2010. [83] I. Guy, S. Ur, I. Ronen, A. Perer, and M. Jacovi. Do you want to know?: rec- ommending strangers in the enterprise. InProceedingsoftheACM2011confer- enceonComputersupportedcooperativework, CSCW ’11, pages 285–294, New York, NY , USA, 2011. ACM. [84] H. Halpin, V . Robu, and H. Shepherd. The complex dynamics of collaborative tagging. InProceedingsofthe16thinternationalconferenceonWorldWideWeb, WWW ’07, pages 211–220, New York, NY , USA, 2007. ACM. [85] N. Hariri, B. Mobasher, and R. Burke. Context-aware music recommendation based on latenttopic sequential patterns. In Proceedings of the sixth ACM con- ference on Recommender systems, RecSys ’12, pages 131–138, New York, NY , USA, 2012. ACM. [86] M. Harvey, I. Ruthven, and M. J. Carman. Improving social bookmark search using personalised latent variable language models. InProceedingsofthefourth ACM international conference on Web search and data mining, WSDM ’11, pages 485–494, New York, NY , USA, 2011. ACM. [87] M. A. Hasan, V . Chaoji, S. Salem, and M. Zaki. Link prediction using supervised learning. In Proceedings of SDM 06 workshop on Link Analysis, Counterterror- ismandSecurity, 2006. [88] H. W. Hethcote. The mathematics of infectious diseases. SIAM Review, 42(4):599–653, 2000. 197 [89] P. D. Hoff. Multiplicative latent factor models for description and prediction of social networks. Comput.Math.Organ.Theory, 15(4):261–272, Dec. 2009. [90] L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In ProceedingsoftheFirstWorkshoponSocialMediaAnalytics, SOMA ’10, pages 80–88, New York, NY , USA, 2010. ACM. [91] G. Hua and D. Haughton. A network analysis of an online expertise sharing community. SocialNetworkAnalysisandMining, 2:291–303, 2012. [92] J. Huang, K. M. Thornton, and E. N. Efthimiadis. Conversational tagging in twitter. InProceedingsofthe21stACMconferenceonHypertextandhypermedia, HT ’10, pages 173–178. ACM, 2010. [93] D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge. Comparing images using the hausdorff distance.IEEETransactionsonPatternAnalysisandMachine Intelligence, 15(9):850–863, sep 1993. [94] M. Jacovi, I. Guy, I. Ronen, A. Perer, E. Uziel, and M. Maslenko. Digital traces of interest: Deriving interest relationships from social media interactions. In ECSCW’11, pages 21–40, 2011. [95] J. A. Jacquez and C. P. Simon. The stochastic si model with recruitment and deaths i. comparison with the closed sis model.MathematicalBiosciences, 117(1- 2):77–125, 1993. [96] B. J. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information ScienceandTechnology, 60(11):2169–2188, 2009. [97] A. Java, X. Song, T. Finin, and B. Tseng. Why we twitter: understanding microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, WebKDD/SNA-KDD ’07, pages 56–65, New York, NY , USA, 2007. ACM. [98] T. Jeffrey and M. Stanley. An experimental study of the small world problem. Sociometry, 32:425–443, 1969. [99] Y . Jin, Y . Matsuo, and M. Ishizuka. Extracting a social network among entities by web mining. In ISWC 06 Workshop on Web Content Mining with Human LanguageTechnologies, 2006. [100] D. B. Johnson. Efficient algorithms for shortest paths in sparse networks. J.ACM, 24(1):1–13, Jan. 1977. 198 [101] J. J. Jung and J. Euzenat. Towards semantic social networks. In Proceedings of the 4th European conference on The Semantic Web: Research and Applications, ESWC ’07, pages 267–280. Springer-Verlag, 2007. [102] C. Kamp. Untangling the interplay between epidemic spread and transmission network dynamics. PLoSComputationalBiology, 6(11):e1000984, 11 2010. [103] J.-H. Kang and K. Lerman. Using lists to measure homophily on twitter. InAAAI workshop on Intelligent Techniques for Web Personalization and Recommenda- tion, July 2012. [104] H. Kashima and N. Abe. A parameterized probabilistic model of network evo- lution for supervised link prediction. In Proceedings of the Sixth International ConferenceonDataMining, ICDM ’06, pages 340–349, Washington, DC, USA, 2006. IEEE Computer Society. [105] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39–43, 1953. 10.1007/BF02289026. [106] H. Kautz, B. Selman, and M. Shah. The hidden web. AI Magazine, 18:27–36, 1997. [107] H. Kautz, B. Selman, and M. Shah. Referral web: combining social networks and collaborative filtering. Commun.ACM, 40(3):63–65, March 1997. [108] S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res., 7:1493–1515, Dec. 2006. [109] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD interna- tionalconferenceonKnowledgediscoveryanddatamining, pages 137–146, New York, NY , USA, 2003. ACM. [110] J. Kleinberg. The small-world phenomenon: an algorithm perspective. In Pro- ceedings of the thirty-second annual ACM symposium on Theory of computing, STOC ’00, pages 163–170, New York, NY , USA, 2000. ACM. [111] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Rev., 51(3):455–500, Aug. 2009. [112] Y . Koren, S. C. North, and C. V olinsky. Measuring and extracting proximity graphs in networks. ACMTrans.Knowl.Discov.Data, 1, December 2007. [113] V . Krebs. Mapping networks of terrorist cells. Connections, 24(3):43–52, 2002. 199 [114] R. Krestel, P. Fankhauser, and W. Nejdl. Latent dirichlet allocation for tag rec- ommendation. In Proceedings of the third ACM conference on Recommender systems, RecSys ’09, pages 61–68, New York, NY , USA, 2009. ACM. [115] B. Krishnamurthy, P. Gill, and M. Arlitt. A few chirps about twitter. InProceed- ings of the first workshop on Online social networks, WOSN ’08, pages 19–24, New York, NY , USA, 2008. ACM. [116] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Proceedings of the 12th ACM SIGKDD international conference onKnowledgediscoveryanddatamining, KDD ’06, pages 611–617, New York, NY , USA, 2006. ACM. [117] G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR confer- ence on Research and development in information retrieval, SIGIR ’04, pages 297–304, New York, NY , USA, 2004. ACM. [118] J. Kunegis and A. Lommatzsch. Learning spectral graph transformations for link prediction. In Proceedings of the 26th Annual International Conference on MachineLearning, ICML ’09, pages 561–568, New York, NY , USA, 2009. ACM. [119] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? InProceedingsofthe19thinternationalconferenceonWorldwide web, WWW ’10, pages 591–600, New York, NY , USA, 2010. ACM. [120] M. D. Lee, B. Pincombe, and M. Welsh. An Empirical Evaluation of Models of TextDocumentSimilarity, pages 1254–1259. Erlbaum, Mahwah, NJ, 2005. [121] K. Lerman and R. Ghosh. Information contagion: An empirical study of the spread of news on digg and twitter social networks. InProceedingsoftheFourth InternationalConferenceonWeblogsandSocialMedia, ICWSM ’10. The AAAI Press, May 2010. [122] K. Lerman and A. Plangprasopchok. Handbook of Research on Web 2.0, 3.0, and X.0: Technologies, Business, and Social Applications, chapter Leveraging User-specified Metadata to Personalize Image Search. IGI Global, 2009. [123] V . Leroy, B. B. Cambazoglu, and F. Bonchi. Cold start link prediction. In Pro- ceedings of the 16th ACM SIGKDD international conference on Knowledge dis- covery and data mining, KDD ’10, pages 393–402, New York, NY , USA, 2010. ACM. [124] J. Leskovec, L. A. Adamic, and B. A. Huberman. The dynamics of viral market- ing. In Proceedings of the 7th ACM conference on Electronic commerce, pages 228–237, New York, NY , USA, 2006. ACM. 200 [125] J. Letierce, A. Passant, J. Breslin, and S. Decker. Understanding how twitter is used to widely spread scientific messages. In Proceedings of the Web Science Conference: Extending the Frontiers of Society On-Line, WebSci ’10, March 2010. [126] D. Li, Z. Lina, T. Finin, and A. Joshi. How the semantic web is being used: An analysis of foaf documents. In Proceedings of the 38th Annual Hawaii Interna- tionalConferenceonSystemSciences(HICSS), page 113c, January 2005. [127] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social net- works. In Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM ’03, pages 556–559, New York, NY , USA, 2003. ACM. [128] C.-Y . Lin, L. Wu, Z. Wen, H. Tong, V . Griffiths-Fisher, L. Shi, and D. Lubensky. Social network analysis in enterprise. Proceedings of the IEEE, 100(9):2759– 2776, sept. 2012. [129] N. Lin, D. Li, Y . Ding, B. He, Z. Qin, J. Tang, J. Li, and T. Dong. The dynamic features of delicious, flickr, and youtube.J.Am.Soc.Inf.Sci.Technol., 63(1):139– 162, Jan. 2012. [130] M. Lipczak, B. Sigurbjornsson, and A. Jaimes. Understanding and leveraging tag-based relations in on-line social networks. In Proceedings of the 23rd ACM conference on Hypertext and social media, HT ’12, pages 229–238, New York, NY , USA, 2012. ACM. [131] H. Liu, P. Maes, and G. Davenpor. Unraveling the taste fabric of social networks. International Journal on Semantic Web and Information Systems, 2(1):42–71, 2006. [132] L. Liu, F. Zhu, L. Zhang, and S. Yang. A probabilistic graphical model for topic and preference discovery on social media.Neurocomputing, 95:78–88, Oct. 2012. [133] Y . Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 665–672, New York, NY , USA, 2009. ACM. [134] Z. Liu, Y . Zhang, E. Y . Chang, and M. Sun. Plda+: Parallel latent dirichlet allocation with data placement and pipeline processing. ACM Trans. Intell. Syst. Technol., 2(3):26:1–26:18, May 2011. [135] B. Long, X. Wu, Z. M. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In Proceedings of the 12th ACM SIGKDD international conference on 201 Knowledgediscoveryanddatamining, KDD ’06, pages 317–326, New York, NY , USA, 2006. ACM. [136] C. Lu, X. Hu, X. Chen, J.-R. Park, T. He, and Z. Li. The topic-perspective model for social tagging systems. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’10, pages 683–692, New York, NY , USA, 2010. ACM. [137] X. Luo and J. Shinaver. Multirank: Reputation ranking for generic semantic social networks. InProceedingsoftheWWW2009WorkshoponWebIncentives, WEBCENTIVES ’09, 2009. [138] D. Lusseau and M. E. J. Newman. Identifying the role that animals play in their social networks.ProceedingsoftheRoyalSocietyofLondon.SeriesB:Biological Sciences, 271(Suppl 6):S477–S481, 2004. [139] L. L and T. Zhou. Link prediction in complex networks: A survey. Physica A: StatisticalMechanicsanditsApplications, 390(6):1150–1170, 2011. [140] M. Makrehchi. Social link recommendation by learning hidden topics. In Pro- ceedings of the fifth ACM conference on Recommender systems, RecSys ’11, pages 189–196, New York, NY , USA, 2011. ACM. [141] B. Markines, C. Cattuto, F. Menczer, D. Benz, A. Hotho, and S. Gerd. Evaluating similarity measures for emergent semantics of social tagging. In Proceedings of the 18th international conference on World wide web, WWW ’09, pages 641– 650, New York, NY , USA, 2009. ACM. [142] C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academic article, to read. In Proceedings of the seventeenth conference on Hypertext and hypermedia, HYPERTEXT ’06, pages 31–40, New York, NY , USA, 2006. ACM. [143] C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academic article, to read. In Proceedings of the seventeenth conference on Hypertext and hypermedia, HYPERTEXT ’06, pages 31–40, New York, NY , USA, 2006. ACM. [144] F. Martino and A. Spoto. Social network analysis: A brief theoretical review and further perspectives in the study of information technology. PsychNology, 4(1):53–86, 2006. [145] Y . Matsuo, J. Mori, M. Hamasaki, K. Ishida, T. Nishimura, H. Takeda, K. Hasida, and M. Ishizuka. Polyphonet: an advanced social network extraction system from the web. InProceedingsofthe15thinternationalconferenceonWorldWideWeb, WWW ’06, pages 397–406, 2006. 202 [146] M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. AnnualReviewofSociology, 27(1):415–444, 2001. [147] C. Meeyoung, H. Hamed, B. FabrI´ ıcio, and P. G. Krishna. Measuring user influ- ence in twitter: The million follower fallacy. In Proceedings of international AAAIConferenceonWeblogsandSocial, ICWSM 10, 2010. [148] P. N. Mendes, A. Passan, P. Kapanipathi, and A. P. Shet. Linked open social signals. In Proceedings of the 2010 IEEE/WIC/ACM International Conference onWebIntelligenceandIntelligentAgentTechnology, volume 01 ofWI–IAT’10, pages 224–231. IEEE Computer Society, 2010. [149] A. K. Menon and C. Elkan. Link prediction via matrix factorization. InProceed- ings of the 2011 European conference on Machine learning and knowledge dis- covery in databases - Volume Part II, ECML PKDD’11, pages 437–452, Berlin, Heidelberg, 2011. Springer-Verlag. [150] M. Mesbahi and M. Egerstedt. Graph theoretic methods in multiagent networks. Princeton University Press, 2010. [151] P. Mika. Ontologies are us: A unified model of social networks and semantics. SemanticWeb, 5:5–15, March 2007. [152] P. Mika. SocialNetworksandtheSemanticWeb(SemanticWebandBeyond), vol- ume 5 ofSemanticWebAndBeyondComputingforHumanExperience. Springer- Verlag New York, Inc., 2007. [153] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller. Introduction to wordnet: An on-line lexical database. International Journal of Lexicography, 3(4):235–244, dec 1990. [154] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Mea- surement and analysis of online social networks. InProceedingsofthe7thACM SIGCOMM conference on Internet measurement, IMC ’07, pages 29–42, New York, NY , USA, 2007. ACM. [155] A. Mislove, B. Viswanath, K. P. Gummadi, and P. Druschel. You are who you know: inferring user profiles in online social networks. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pages 251–260, New York, NY , USA, 2010. ACM. [156] I.-C. Moon and K. M. Carley. Evolving multi-agent network structure with orga- nizational learning. InProceedingsofthe2007springsimulationmulticonference -Volume2, SpringSim ’07, pages 127–134, San Diego, CA, USA, 2007. Society for Computer Simulation International. 203 [157] L. Mui, M. Mohtashemi, and A. Halberstadt. A computational model of trust and reputation. In Proceedings of the 35th Annual Hawaii International Conference onSystemSciences, HICSS ’02, pages 2431–2439, January 2002. [158] E. Mustafaraj and P. Metaxas. From obscurity to prominence in minutes: Political speech and real-time search. In Proceedings of the WebSci10: Extending the FrontiersofSocietyOn-Line, April 2010. [159] M. E. J. Newman. The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences of the United States of America, 98(2):pp. 404–409, 2001. [160] M. E. J. Newman. Spread of epidemic disease on networks. Physical Review E, 66(1):016128, Jul 2002. [161] M. E. J. Newman. Mixing patterns in networks. Phys. Rev. E, 67:026126, Feb 2003. [162] M. E. J. Newman. The structure and function of complex networks.SIAMReview, 45:167–256, 2003. [163] M. E. J. Newman. A measure of betweenness centrality based on random walks. SocialNetworks, 27(1):39–54, 2005. [164] J. C. Paolillo, S. Mercure, and E. Wright. The social semantics of livejournal foaf: Structure and change from 2004 to 2005. InProceedingsofthe1stWorkshopon SemanticNetworkAnalysisattheISWC2005Conference, pages 69–80, 2005. [165] R. Parimi and D. Caragea. Predicting friendship links in social networks using a topic modeling approach. In Proceedings of the 15th Pacific-Asia conference on Advancesinknowledgediscoveryanddatamining-VolumePartII, PAKDD’11, pages 75–86, Berlin, Heidelberg, 2011. Springer-Verlag. [166] A. Passant and P. Laublet. Meaning of a tag: A collaborative approach to bridge the gap between tagging and linked data. In Proceedings of the Linked Data on theWeb(LDOW2008)workshopatWWW2008, 2008. [167] A. Passant, P. Laublet, J. G. Breslin, and S. Decker. A uri is worth a thousand tags: From tagging to linked data with moat. International Journal on Semantic WebandInformationSystems, 5(3):71–94, 2009. [168] M. Pennacchiotti and S. Gurumurthy. Investigating topic models for social media user recommendation. In Proceedings of the 20th international conference com- panion on World wide web, WWW ’11, pages 101–102, New York, NY , USA, 2011. ACM. 204 [169] J. C. Platt. Advances in kernel methods. chapter Fast training of support vec- tor machines using sequential minimal optimization, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. [170] A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In IJCAI03WorkshoponLearningStatisticalModelsfromRelationalData, 2003. [171] D. Ramage, S. Dumais, and D. Liebling. Characterizing microblogs with topic models. InProceedingsoftheFourthInternationalAAAIConferenceonWeblogs andSocialMedia. AAAI, 2010. [172] A. Rapoport and W. J. Horvath. A study of a large sociogram.BehavioralScience, 6(4):279–291, 1961. [173] D. M. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex con- tagion on twitter. In Proceedings of the 20th international conference on World wideweb, WWW ’11, pages 695–704, New York, NY , USA, 2011. ACM. [174] M. Rosen-Zvi, C. Chemudugunta, T. Griffiths, P. Smyth, and M. Steyvers. Learn- ing author-topic models from text corpora. ACMTrans.Inf.Syst., 28(1):4:1–4:38, Jan. 2010. [175] A. Sadilek, H. Kautz, and J. P. Bigham. Finding your friends and following them to where you are. In Proceedings of the fifth ACM international conference on Websearchanddatamining, WSDM ’12, pages 723–732, New York, NY , USA, 2012. ACM. [176] J. Schafer, D. Frankowski, J. Herlocker, and S. Sen. Collaborative filtering rec- ommender systems. In P. Brusilovsky, A. Kobsa, and W. Nejdl, editors, The Adaptive Web, volume 4321 of Lecture Notes in Computer Science, pages 291– 324. Springer Berlin Heidelberg, 2007. [177] R. Schifanella, A. Barrat, C. Cattuto, B. Markines, and F. Menczer. Folks in folksonomies: social link prediction from shared metadata. InProceedingsofthe Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 271–280. ACM, 2010. [178] J. Scott. Socialnetworkanalysis: Ahandbook. Sage, 2000. [179] S. Shalev-Shwartz and N. Srebro. Svm optimization: inverse dependence on training set size. InProceedingsofthe25thinternationalconferenceonMachine learning, ICML ’08, pages 928–935, New York, NY , USA, 2008. ACM. 205 [180] S. Shashi and O. Dev. Computational modeling of spatio-temporal social net- works: A time-aggregated graph approach. Specialist Meeting-Spatio-Temporal Constraints on Social Networks, 2010. [181] B. Shevade, H. Sundaram, and L. Xie. Modeling personal and social network context for event annotation in images. InProceedingsofthe7thACM/IEEE-CS joint conference on Digital libraries, JCDL ’07, pages 127–134, New York, NY , USA, 2007. ACM. [182] J. Shinavier. Real-time #semanticweb in <= 140 chars. In Proceedings of the LinkedDataontheWebWorkshop(LDOW2010), Raleigh, North Carolina, USA, April 2010. [183] M. A. Sicilia and B. E. Garc´ ıa. Filtering information with imprecise social cri- teria: A foaf-based backlink model. In Proceedings of the Fourth Conference of theEuropeanSocietyforFuzzyLogicandTechnology, 2005. [184] S. Solomon, G. Weisbuch, L. d. Arcangelis, N. Jan, and D. Stauffer. Social percolation models. Physica A: Statistical Mechanics and its Applications, 277(12):239 – 247, 2000. [185] D. Sousa, L. Sarmento, and E. Mendes Rodrigues. Characterization of the twitter @replies network: are user ties social or topical? In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC ’10, pages 63–70, New York, NY , USA, 2010. ACM. [186] D. Strang and M. W. Macy. In search of excellence: Fads, success stories, and adaptive emulation. AmericanJournalofSociology, 107:147–182, 2001. [187] B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In Proceedings of the 2010 IEEE Second International Conference on Social Computing, SocialCom ’10, pages 177–184, Washington, DC, USA, 2010. IEEE Computer Society. [188] Y . Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. InVLDB11, 2011. [189] J. Tang, M. Musolesi, C. Mascolo, and V . Latora. Characterising temporal dis- tance and reachability in mobile and online social networks. SIGCOMM Com- puterCommunicationReview, 40:118–124, January 2010. [190] L. Tang and H. Liu. Community detection and mining in social media. Synthesis LecturesonDataMiningandKnowledgeDiscovery, 2(1):1–137, 2010. [191] B. Taskar, M. fai Wong, P. Abbeel, and D. Koller. Link prediction in relational data. IninNeuralInformationProcessingSystems, 2003. 206 [192] J. Thom-Santelli, S. Yuen, T. Matthews, E. M. Daly, and D. R. Millen. What are you working on? status message q&a in an enterprise sns. In Proceedings of the12thEuropeanConferenceonComputerSupportedCooperativeWork,24-28 September2011,AarhusDenmark, pages 313–332. Springer London, 2011. [193] C. Thovex and F. Trichet. Semantic social networks analysis - towards a socio- physical knowledge analysis. Social Network Analysis and Mining, 3(1):35–49, 2013. [194] J. Trant. Studying social tagging and folksonomy: A review and framework. JournalofDigitalInformation, 10(1), 2009. [195] I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core vector machines: Fast svm training on very large data sets. J.Mach.Learn.Res., 6:363–392, Dec. 2005. [196] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe. Predicting elections with twitter: What 140 characters reveal about political sentiment. In ICWSM, 2010. [197] J. R. Tyler, D. M. Wilkinson, and B. A. Huberman. Email as spectroscopy: Auto- mated discovery of community structure within organizations. InProceedingsof C&T, pages 81–96. Kluwer, 2003. [198] T. W. Valente. Social network thresholds in the diffusion of innovations. Social Networks, 18(1):69–89, 1996. [199] F. E. Walter, S. Battiston, and F. Schweitzer. A model of a trust-based recommen- dation system on a social network. AutonomousAgentsandMulti-AgentSystems, 16:57–74, February 2008. [200] D. Wang, Z. Wen, H. Tong, C.-Y . Lin, C. Song, and A.-L. Barab´ asi. Information spreading in context. In Proceedings of the 20th international conference on Worldwideweb, pages 735–744, New York, NY , USA, 2011. ACM. [201] Q. Wang, H. Jin, and Y . Liu. Collaboration analytics: mining work patterns from collaboration activities. InProceedingsofthe19thACMinternationalconference onInformationandknowledgemanagement, CIKM ’10, pages 1861–1864, New York, NY , USA, 2010. ACM. [202] S. Wasserman and K. Faust. Socialnetworkanalysis: Methodsandapplications. Cambridge Univ Press, 1994. [203] S. Wasserman and J. Galaskiewicz. Advances in social network analysis: Researchinthesocialandbehavioralsciences. Sage, 1994. 207 [204] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, June 1998. [205] B. Wellman. Computer networks as social networks. Science, 293(5537):2031– 2034, 2001. [206] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pages 261–270, New York, NY , USA, 2010. ACM. [207] U. Wilensky. Netlogo. Available: http://ccl.northwestern.edu/ netlogo/, 1999. [208] A. Wu, J. M. DiMicco, and D. R. Millen. Detecting professional versus personal closeness using an enterprise social network site. In Proceedings of the 28th internationalconferenceonHumanfactorsincomputingsystems, CHI ’10, pages 1955–1964, New York, NY , USA, 2010. ACM. [209] H. Wu, C. Chelmis, V . Sorathia, and V . K. Prasanna. Beyond twitter: The role of company hierarchy in the yammer enterprise social network. 2013. (Submitted). [210] H. Wu, C. Chelmis, Y . Zhang, V . Sorathia, O. P. Patri, and V . K. Prasanna. Enrich- ing employee ontology for enterprises with knowledge discovery from social net- works. In The 2013 IEEE/ACM International Conference on Advances in Social NetworksAnalysisandMining, Niagara Falls, Canada, August 25-28 2013. [211] R. Xiang, J. Neville, and M. Rogati. Modeling relationship strength in online social networks. In Proceedings of the 19th international conference on World wideweb, WWW ’10, pages 981–990, New York, NY , USA, 2010. ACM. [212] J. Yang and J. Leskovec. Modeling information diffusion in implicit networks. In Proceedings of the 2010 IEEE International Conference on Data Mining, pages 599–608, Washington, DC, USA, 2010. IEEE Computer Society. [213] J. Yang and J. Leskovec. Patterns of temporal variation in online media. In ProceedingsofthefourthACMinternationalconferenceonWebsearchanddata mining, WSDM ’11, pages 177–186, New York, NY , USA, 2011. ACM. [214] E. Zangerle, W. Gassler, and G. Specht. Using tag recommendations to homog- enize folksonomies in microblogging environments. In Proceedings of the Third international conference on Social informatics, SocInfo’11, pages 113–126, Berlin, Heidelberg, 2011. Springer-Verlag. 208 [215] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online com- munities: structure and algorithms. InProceedingsofthe16thinternationalcon- ference on World Wide Web, WWW ’07, pages 221–230, New York, NY , USA, 2007. ACM. [216] J. Zhang, Y . Qu, J. Cody, and Y . Wu. A case study of micro-blogging in the enterprise: use, value, and related issues. InProceedingsofthe28thinternational conference on Human factors in computing systems, CHI ’10, pages 123–132, New York, NY , USA, 2010. ACM. [217] D. Zhao and M. B. Rosson. How and why people twitter: the role that micro- blogging plays in informal communication at work. In Proceedings of the ACM 2009 international conference on Supporting group work, GROUP ’09, pages 243–252, New York, NY , USA, 2009. ACM. [218] W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Proceedings of the 33rd EuropeanconferenceonAdvancesininformationretrieval, ECIR’11, pages 338– 349, Berlin, Heidelberg, 2011. Springer-Verlag. [219] E. Zheleva and L. Getoor. To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles. InProceedingsofthe18th international conference on World wide web, WWW ’09, pages 531–540. ACM, 2009. [220] Y . Zhijun, M. Gupta, T. Weninger, and H. Jiawei. A unified framework for link recommendation using random walks. In 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 152–159, August 2010. 209
Abstract (if available)
Abstract
Complex networks arise everywhere. Online social networks are a famous example of complex networks due to: (a) revolutionizing the way people interact on the Web, and (b) permitting in practice the study of interdisciplinary theories that arise from human activities, at both micro (i.e., individual) and macro (i.e., community) level. The vast scale (Big-Data) of online human interactions impose certain challenges, such as scalable indexing and efficient retrieval of social data, which are by their nature intertwined in multiple dimensions. Understanding the rich properties and dimensional interdependencies of topology and content in complex networks is necessary to uncover hidden structures and emergent knowledge. ❧ To address these questions, we propose a formal model that abstracts the semantics of complex networks into an integrated, context-aware, time-sensitive, multidimensional space, enabling joint examination of their static and dynamic properties, facilitating unified mining and analysis of network structure and content, and their explicit and implicit interactions. Traditionally, network analysis methods, either ignore content and focus on the network structure, or make implicit assumptions about the complex correlation of these two components. We show that accurately modeling multiple symmetric or asymmetric, explicit and hidden interaction channels between people, integrating auxiliary networks into a unified framework, leads to significant performance improvements in a variety of prediction and recommendation tasks. We empirically verify this insight using real-world datasets from online social networks and corporate microblogging data.❧ Our work makes several steps towards building models of complex networks, understanding their rich properties, hidden structures and dimensional interdependencies. We develop a novel model, that integrates heterogeneous networks of networks, each with rich properties and hidden dynamics, facilitating multimodal analysis of time-varying, complex social networking data. We study informal communication behavior, information sharing, and influence at the workplace, where formal structures, such as the organizational hierarchy, provide hints of the underlying, implicit social or communication network. Particularly, we develop two simple yet accurate computational models of technology adoption at the workplace at the presence of influence, accurately reproducing the adoption process at the macroscopic level. We also achieve accurate communication intention prediction based on auxiliary information. Last but not the least, we study the structure of online social bookmarking systems, where we significantly improve social tie recommendation by exploiting the dynamics of collaborative annotation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Learning the semantics of structured data sources
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Learning distributed representations of cells in tables
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Computational modeling of human behavior in negotiation and persuasion: the challenges of micro-level behavior annotations and multimodal modeling
PDF
Understanding diffusion process: inference and theory
PDF
Global consequences of local information biases in complex networks
PDF
A complex event processing framework for fast data management
PDF
Tag based search and recommendation in social media
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
Asset Metadata
Creator
Chelmis, Charalampos
(author)
Core Title
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/07/2013
Defense Date
10/08/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
complex networks,implicit relationships,influence,OAI-PMH Harvest,recommendation,semantic integration,social networks
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Bogdan, Paul (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
charalamposchelmis@gmail.com,chelmis@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-345886
Unique identifier
UC11295170
Identifier
etd-ChelmisCha-2140.pdf (filename),usctheses-c3-345886 (legacy record id)
Legacy Identifier
etd-ChelmisCha-2140.pdf
Dmrecord
345886
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Chelmis, Charalampos
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
complex networks
implicit relationships
recommendation
semantic integration
social networks