Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Computing cascades: how to spread rumors, win campaigns, stop violence and predict epidemics
(USC Thesis Other)
Computing cascades: how to spread rumors, win campaigns, stop violence and predict epidemics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNIVERSITY OF SOUTHERN CALIFORNIA DOCTORAL DISSERTATION Computing Cascades: How to Spread Rumors, Win Campaigns, Stop Violence and Predict Epidemics Author: Ajitesh SRIVASTAVA Supervisor: Dr. Viktor K. PRASANNA Dissertation submitted in fulfilment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science Degree conferral by FACULTY OF THE USC GRADUATE SCHOOL August 2018 “... most of the time I was convinced I’d lost it. But there were other times, I thought I was mainlining the secret truth of the universe. ” Detective Rust Cohle (True Detective) UNIVERSITY OF SOUTHERN CALIFORNIA Abstract Department of Computer Science Doctor of Philosophy Computing Cascades: How to Spread Rumors, Win Campaigns, Stop Violence and Predict Epidemics by Ajitesh SRIVASTAVA The study of information diffusion on social networks has gained significant importance with the rise of online social media. The applications include viral marketing, opinion cascades, find- ing influential players, and immunization. Since the true dynamics are hidden, various diffusion models have been proposed to explain the cascading behavior. Such models require extensive simulation for estimating the state of the diffusion process over time. Computing the diffusion over time analytically is #P-Hard for many probabilistic models. Moreover, certain decision problems requiring selection of individuals to initiate/alter a diffusion process are NP-hard and also rely on estimation of probabilities. solutions and/or intelligent sampling for a wide class of diffusion models. I provide approximate solutions to several diffusion computation and diffusion optimization problems using analytical solutions and/or intelligent sampling for a wide class of diffusion models. The specific problems include the following: (i) Predicting the number of infections in different countries across which an infection is spreading (DARPA Chikungunya Challenge 2014) (ii) Minimization of violence among homeless youth by identifying optimal peers for intervention. (iii) Finding the probability of infection/activation of an individual given that we start with a set of infected individuals in a network. (iv) Finding the set of individuals to initiate a campaign to maximize the influence (Viral Marketing). (v) Finding the set of indi- viduals to maximize one opinion over the other in presence of friend and foe relationships. (vi) Combating fake news by identifying most influential individuals in the network who are likely to receive the fake news and can be incentivized to propagate the corresponding real news. The approximate solutions to all the above problems have been demonstrated mathematically or em- pirically and have been tested on synthetic and real-world networks. Acknowledgements First, I would like to thank my brother, Dr. Animesh Srivastava, who paved the path for me to be in Computer Science by pursuing it himself. Otherwise, my parents would have me study Civil Engineering. Not that there is anything wrong with that. Thanks to my PhD adviser Prof Viktor Prasanna who gave me enough freedom to explore and for providing me with knowledgeable mentors to work with. Specifically, Dr. Charalampos Chelmis who introduced me to the area of this thesis, and Dr. Rajgopal Kannan with whom I have spent countless hours attempting to solve unsolvable problems. I also express my gratitude towards Dr. Axel Soto and Dr. Saptarshi Ghosh for introducing me to research and for writing high quality papers with me during my undergraduate years. Thanks to Kathryn Kassar for all the chocolate I have consumed from EEB 200. I should also thank that friend who told me that I was a terrible pianist. Otherwise, I would have spent more time playing the piano, delaying my thesis. Finally, I must mention my cat Sir Isaac Newton, who I wished would help me in my research but has contributed nothing to this thesis. He did not live up to his name. iii Contents Abstract ii Acknowledgements iii Contents iv List of Figures vii List of Tables ix 1 Introduction 1 1.1 Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 4 2.1 Model Categorizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Macroscopic vs Microscopic. . . . . . . . . . . . . . . . . . . 4 Progressive vs Non-progressive . . . . . . . . . . . . . . . . . 4 Unsigned vs Signed . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Models for Rumor Diffusion and Epidemics . . . . . . . . . . . . . . . . . . . 5 2.3 Cascade Dynamics in a Network . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Independent Cascade model . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Threshold Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Complex Contagion Model . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Macroscopic modeling - Predicting the Spread of Chikungunya Virus 14 3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Potential data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Representativeness of data sets . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Model Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Model Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Computational Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 iv Contents v 4 Progressive Diffusion – Minimizing Violence in Homeless Youth 26 4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.1 V oter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 Uncertain V oter Model . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2.1 Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2.2 Katz-bazed . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Greedy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.1 Uncertainty in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.3.2 Probabilistic Intervention . . . . . . . . . . . . . . . . . . . . . . . . . 32 Case f p (z p )> f p (z p ): . . . . . . . . . . . . . . . . . . . . . . 33 Case f p (z p )< f p (z p ): . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Synthetic Kronecker graphs: . . . . . . . . . . . . . . . . . . . 35 Real-world Homeless Youth Network: . . . . . . . . . . . . . . 35 4.4.1 Synthetic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.2 Homeless Youth Network . . . . . . . . . . . . . . . . . . . . . . . . 37 Choosing individuals in practice. . . . . . . . . . . . . . . . . . 38 4.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Non-progressive Diffusion – Analytical Approximation for a Unified Model 41 5.1 Unified Model of Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1.1 Infection Probability Formula Under the Unified Model . . . . . . . . 42 5.1.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Reduction to Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Complex Contagion Model . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 Independent Cascade Model . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Threshold Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.3.1 Experiments Using Real-World Data . . . . . . . . . . . . . . . . . . . 48 5.3.2 Experiments on Erd˝ os-R´ enyi Random Graphs . . . . . . . . . . . . . . 49 6 Seed Set Selection for Influence Maximization 56 6.1 Online Seed-set Selection using Unified Model . . . . . . . . . . . . . . . . . 57 6.2 Experiments with Seed-set selection . . . . . . . . . . . . . . . . . . . . . . . 58 7 Computing Competing Cascades on Signed Networks 61 7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.2 Unified Model of Competing Cascades in Signed Networks . . . . . . . . . . . 63 7.3 Influence Maximization in Signed Networks . . . . . . . . . . . . . . . . . . . 66 7.4 Seed-set Selection Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.4.1 OSSUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7.5 The case of Generalized Linear Threshold . . . . . . . . . . . . . . . . . . . . 72 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.6.1 Accuracy of Unified Model . . . . . . . . . . . . . . . . . . . . . . . 73 7.6.2 Seed-set Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Contents vi 8 Combating Fake News – A Network Approach 80 8.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.2.1 Submodularity Property . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.2.2 Principle of Reversibility . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.3 Proposed Algorithm for FActCheck . . . . . . . . . . . . . . . . . . . . . . . 86 8.3.1 Graph Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.3.2 Realistic Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.3.2.1 Choice of Diffusion Model . . . . . . . . . . . . . . . . . . 91 8.3.2.2 Importance of Nodes . . . . . . . . . . . . . . . . . . . . . 91 8.3.2.3 Individual Expertise . . . . . . . . . . . . . . . . . . . . . . 91 8.3.2.4 Willingness to share . . . . . . . . . . . . . . . . . . . . . . 91 8.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8.3.4 Random Source of Fake News . . . . . . . . . . . . . . . . . . . . . . 93 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.3 Quality of Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.4.5 FActCheck with Known Source . . . . . . . . . . . . . . . . . . . . . 97 8.4.6 FActCheck with Unknown Source . . . . . . . . . . . . . . . . . . . . 98 8.5 Fake News Immunization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 8.5.1 Fake News Immunization Experiments . . . . . . . . . . . . . . . . . 100 9 Conclusions and Future Directions 101 Bibliography 104 List of Figures 1.1 An internet prank termed Rickrolling where a seemingly relevant link is pro- vided in an online conversation, but it leads to Rick Astley’s song “Never Gonna Give You Up”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Compartmental representation of popular fully mixed diffusion processes. Solid arrows represent transition of an individual from one state to the other. The dotted arrows shows how an individual in one state affects the other. (a) Model for rumor spreading. States represent: ignorants i, spreaders s and stiflers r. (b) SIR epidemic model that resembles the rumor spreading model. The states are: susceptible S (similar to ignorant i), infected I (similar to spreader s) and recovered R (similar to stifler r). (c) SI epidemic model with susceptible and infected states. (d) SEIRS model, a generalization of SIR introduces one more state ‘exposed’ (E). A susceptible node becomes ‘exposed’ when it comes in contact with an infected node. This model also incorporates birth rate B and death rate d of the individuals. . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Toy examples of rumor spreading based on (a) Independent Cascade Model and (b) Linear Threshold Model. Yellow nodes denote infected nodes. Red arrows represent the links along which the infection takes place. . . . . . . . . . . . . 9 3.1 Prediction error as a function of hyperparameter k for fixed values of parameter m 18 3.2 Prediction error as a function of hyperparameter . . . . . . . . . . . . . . . . 19 3.3 Change in parameters learnt using varying size of training data . . . . . . . . . 20 3.4 Error in the final 8 weeks of prediction using a varying size of training data . . 20 3.5 Interactive GUI for infection predictions. . . . . . . . . . . . . . . . . . . . . . 22 3.6 Parts of the interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Visualization of the homeless youth network. The red nodes represent the vio- lent nodes and the green ones represent non-violent ones. The black nodes have unknown state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Comparison of the baseline against the greedy algorithm for varying interven- tion sizes under the Uncertain V oter Model on random Kronecker graphs. . . . 36 4.3 Comparison of the baseline against the greedy algorithm for varying interven- tion sizes under the Uncertain V oter Model for deterministic intervention. . . . 38 4.4 Comparison of the baseline against the greedy algorithm for varying interven- tion sizes under the Uncertain V oter Model for probabilistic intervention. . . . . 38 5.1 Agreement of simulation and theory for the three models for Digg1k dataset. . 49 5.2 RMSE as a function of time steps on synthetic graphs. RMSE is averaged over graph sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.3 RMSE averaged over time, as a function of graph size on synthetic graphs. . . . 52 vii List of Figures viii 5.4 Fractional error over time on synthetic graphs. Fractional error is averaged over graphs sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 Fractional error averaged over time, as a function of graph size on synthetic graphs. 54 5.6 RMSE on synthetic graphs for varying density, averaged over time. . . . . . . . 54 5.7 RMSE over time on the synthetic graphs of size 1;000, for varying density values. 55 6.1 Seed-set selection for Influence maximization. . . . . . . . . . . . . . . . . . . 60 7.1 Spread of ‘red’ and ‘blue’ infections with (a) S=(v 1 ;red) and (b) S=(v 1 ;red);(v 2 ;red). Solid line represents a positive link and dashed line represents a negative link. Green nodes represent ambiguous infection. . . . . . . . . . . . . . . . . . . . 67 7.2 Errors of approximation with ICM on synthetic graphs. . . . . . . . . . . . . . 74 7.3 Errors of approximation with GLT on synthetic graphs. . . . . . . . . . . . . . 75 7.4 Fraction of negative outlinks from nodes with varying degree. . . . . . . . . . 77 7.5 Influence spread achieved by the heuristics on graphs with varying fraction of negative links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.6 Fraction of nodes in the seed-set included with ‘red’ color by OSSUM. The fraction drops very quickly when majority of links in the network are negative. 78 7.7 Influence spread achieved by varying size of the seed-set by the heuristics in the datasets after flipping the signs of edges. . . . . . . . . . . . . . . . . . . . . . 79 7.8 Smooth density estimate of degree distribution of seed set nodes selected by the heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 8.1 Fundamental distinction between FActCheck and Competing Cascades. FActCheck enforces the constraint that the news must pass through the set I. . . . . . . . . 81 8.2 Taking into account different diffusion probabilities for sharing real and fake news. 92 8.3 Comparison ofs(S;I) vs execution time for varying values of reduction param- eterm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 8.4 Comparison ofs(S;I) vsjIj for different methods when the source is known. . 96 8.5 Comparison ofs(S;I) vsjIj for different methods when the source is random. . 96 8.6 Immunization results for AFC, RAFC, and the baselines, showing percentage of nodes that were saved from infection by removing 50 nodes. . . . . . . . . . . 100 9.1 Summary of contributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 List of Tables 4.1 Top 10 seeds for various values ofq output by Greedy Minimization . . . . . . 39 4.2 Top 10 seeds for various values of t output by Greedy Minimization . . . . . . 39 5.1 Parameters used in the experimental validation on Digg follower graph . . . . . 48 5.2 Parameters used in the experimental validation on synthetic graphs . . . . . . . 50 7.1 Summary of the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 8.1 Reduction details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 ix DedicatedtoHumanity x Chapter 1 Introduction Epidemiology has a rich literature of computational models for studying spread of infectious disease in a population [1–4]. Generally, epidemic models are concerned with the number of people infected by a biological contagion, and the effect of parameters such as rate of infec- tion, and recovery on the dynamics of the population. On similar lines, the advent of the social web has provided ample opportunities for studying the network of human interaction leading to information dissemination[5–7]. It is a diffusion process over a population, where a person ob- serves a certain action of others and decides to perform the same action, leading to a ‘cascading’ behavior. The diffusion happens through word-of-mouth and depends on how one values the opinion of others. For instance, a study [8] on farming practices showed that people involved in farming are influenced more by their friends and neighbors than by salesmen in adoption of a new practice. In Economics, marketers are interested in understanding the cascade behavior among consumers, so as to promote their product. Rumor spreading has been studied in different contexts than social settings such as replication and maintenance of databases [9] and network broadcast [10, 11]. In a social setting, tracking actual diffusion is a difficult task as it requires the complete record of the social network struc- ture and the entire history of the diffusion process, which however has been difficult to obtain partially due to privacy policies. With the rise of online social networks, such information be- came more readily available. Today social media is a central mode of creation and diffusion of information. Along with social marketing and sharing news, it has also given rise to Internet Phenomena (memes) which include viral jokes, images, videos, themes, catchphrases, pranks (See Figure 1.1) etc. The extreme scale, diversity and dynamics of online social networks has led to the development of simulation models of information dissemination, rather than studying directly the real diffu- sion. Considering the complexity of the cognitive, social and structural processes involved in diffusion dynamics, it is essential to build models that can accommodate “what-if” scenarios, 1 Chapter 1. Introduction 2 FIGURE 1.1: An internet prank termed Rickrolling where a seemingly relevant link is provided in an online conversation, but it leads to Rick Astley’s song “Never Gonna Give You Up”. such as viral outbreaks, to support decision making in (near) real-time. Rumor spreading/dif- fusion can be described as a stochastic process that unfolds over the network based on some probabilistic setting. Due to their probabilistic nature, such models require significantly high number of simulation runs. In some cases, when the computational model is available, it is in the form of a large number of coupled partial differential equations [12] which are computation- ally expensive. The task becomes more difficult as the size of the network grows. 1.1 Thesis Given the challenges and utilities of diffusion related problems, I propose the following thesis statement: Approximate solutions of several diffusion computation and diffusion optimization problems can be obtained using analytical solutions and/or intelligent sampling for a wide class of diffusion models. To support the thesis statement, I have conducted research by finding approximate solutions (demonstrated mathematically or empirically) for the following problems: Predicting the number of infections in different countries across which an infection is spreading - DARPA Chikungunya Challenge 2014. Chapter 1. Introduction 3 Minimization of violence among homeless youth by identifying optimal peers for inter- vention. Finding the probability of infection/activation of an individual given that we start with a set of infected individuals in a network. Finding the set of individuals to initiate a campaign to maximize the influence (Viral Marketing) Finding the set of individuals to maximize one opinion over the other in presence of friend and foe relationships. Combating fake news by identifying most influential individuals in the network who are likely to receive the fake news and can be incentivized to propagate the corresponding real news. The approximate solutions to all the above problems have been tested on synthetic and real- world networks and are described in the following chapters. Chapter 2 Background A number of diffusion models exist in the literature. This chapter presents a summary of mod- eling techniques that will be used in the later chapters. 2.1 Model Categorizations In diffusion modeling, we assume that there are a fixed number of states, say, S 1 ;S 2 ;:::;S c in which an individual can exist. The model then describes how the process of diffusion unfolds, i.e., how one transitions between one state to the other. Depending on this, diffusion models can be categorized in several ways as presented below: Macroscopic vs Microscopic. In macroscopic modeling, we model the states of a class/- compartments of individuals/nodes. In such modeling, we are typically only interested in the collective dynamics, for example, number of individuals in a certain state. We do not take into account how an individual entity interacts with others. On the other hand, microscopic modeling studies the dynamics of individual, for instance, probability of a particular individual getting a piece of news in a social network. Progressive vs Non-progressive A progressive diffusion process dictates that once a certain state is reached, it is never left. For example, once an individual receives a news/rumor, they cannot go back to the state of not knowing the news. Non-progressive process allows for transi- tioning out of the state, for example, someone who is initially susceptible can contact a disease to become infected, and then recover and become susceptible again. 4 Chapter 2. Background 5 Unsigned vs Signed This represents the nature of the network over which two competing cascades diffuse. A signed network, unlike unsigned networks can have a positive or a negative link between two nodes. Over an unsigned network, a node influences its neighbors with the “opinion” it posses. However, over a signed network, the opinion may flip when going through a negative link. This represents the phenomenon that we tend to have similar opinion as our friends and opposite opinion with respect to our enemies. Next, we dive deeper into some of the diffusion models. 2.2 Models for Rumor Diffusion and Epidemics Many mathematical models developed for studying diffusion of rumor/information rely on par- titioning the population into non-overlapping compartments [13, 14], which are then used to study the dynamics of the infection based on the transitions between these compartments. Such modeling is known as fully mixed technique, which assumes that every individual has an equal chance of “meeting” any other individual. Although simplistic, such models are used to re- veal macroscopic statistics. For instance, in [14] the population is partitioned into three com- partments: those who have not been exposed to the rumor (ignorants), those who are actively spreading it (spreaders), and those who are aware of the rumor and are no longer spreading it (stiflers). An ignorant may become a spreader with probablityb every time it comes in contact with a spreader. A spreader may become a stifler with probabilityg when it encounters another spreader or a stifler. At any time t, the densities of ignorants i(t), spreaders s(t) and stiflers r(t) sum to 1. i(t)+ s(t)+ r(t)= 1: (2.1) If< k> is the average degree in the original network, the dynamics of rumor spreading, assum- ing a homogeneous network with degree< k> is given by di(t) dt =b < k> i(t)s(t); ds(t) dt =b < k> i(t)s(t)g < k> s(t)[s(t)+ r(t)]; dr(t) dt =g < k> s(t)[s(t)+ r(t)]; (2.2) where b is the probability of an ignorant becoming a spreader, and g is the probability of a spreader becoming a stifler. Although, equation 2.2 assumes a homogeneous network, the idea can be extended to include scale-free networks [14]. Multiple simulations are performed to get the solution to this system of equations describing the densities i(t);s(t) and r(t) over time. Models for rumor spreading draw parallels from computational models for infectious diseases in epidemiology. One such popular model, which has inspired the rumor model described above Chapter 2. Background 6 (a) Model for rumor spreading (b) SIR model (a similar model in epidemic) (c) SI Model (a simplified model) (d) SEIRS Model (a general model) FIGURE 2.1: Compartmental representation of popular fully mixed diffusion processes. Solid arrows represent transition of an individual from one state to the other. The dotted arrows shows how an individual in one state affects the other. (a) Model for rumor spreading. States represent: ignorants i, spreaders s and stiflers r. (b) SIR epidemic model that resembles the rumor spreading model. The states are: susceptible S (similar to ignorant i), infected I (similar to spreader s) and recovered R (similar to stifler r). (c) SI epidemic model with susceptible and infected states. (d) SEIRS model, a generalization of SIR introduces one more state ‘exposed’ (E). A susceptible node becomes ‘exposed’ when it comes in contact with an infected node. This model also incorporates birth rate B and death rate d of the individuals. Chapter 2. Background 7 is SIR model [15], where individuals in the population can exist in three possible states: those who can get infected if they come in contact with the pathogen (susceptible), those who have been infected and can pass on the pathogen (infected), and those who were infected in the past but have recovered now and can no longer pass on the infection (recovered). The dynamics is governed by the following system of equations: dS dt =bI S N ; S(0)= S 0 0 dI dt =bI S N gI; I(0)= I 0 0 dR dt =gI; R(0)= R 0 0; (2.3) where S is the number of susceptibles, I is the number of infected, and R is the number of recovered individuals. b is the contact rate, and 1=g is the average period of infection. bI=N=l is referred to as the force of infection. Also S+ I+ R= N, a constant, which is equivalent to the criterion in Equation 2.1. Note that the system of equations 2.3 is similar to 2.2. A small difference arises from the fact that in SIR model of epidemics, a node gets ‘Recovered’ from being ’Infected’ at a certain rate irrespective of its neighbors. In the model of rumor spreading, a node becomes a stifler with a certain probability if it encounters a spreader or a stifler. Note that these equations disregard the network structure. It can be shown that these equations [12] apply only to Erd˝ os-R´ enyi model [16], because they assume that the probability of interac- tion between any two individuals is the same. This however is not true as most real life networks do not conform to a homogeneous structure. To take into account the network structure, SIR model can be formulated at node level [12], i.e., for each node i, dS i dt =l i S i ; dI i dt =l i S i gI; dR i dt =gI i l i = å j tG i; j I j =N j : (2.4) where S i =N i , I i =N i and R i =N i are the probabilities of node i being susceptible, infected and recovered, respectively. t is a constant that governs the rate of infection, and G is the adjacency matrix of the undirected network, such that G i; j = 1 if there is an edge joining node i and node j, 0 otherwise. For an SI model (which consists of only susceptible and infected individuals), for each node, one can write a differential equation as [17]: ds i dt =t å j G i; j P(i is suceptible, j is infected) (2.5) Chapter 2. Background 8 where s i is the probability of infection of node i. Assuming the state of a pair of nodes is independent, P(i is suceptible, j is infected)=(1 s i )s j . Therefore, Equation 2.5 becomes ds i dt =t å j G i; j (1 s i )s j =t(1 s i ) å j G i; j s j (2.6) This system of differential equations becomes computationally more difficult to solve as the size of the graph increases. Further approximations could be made, for example assuming all node to be equivalent and having same number of neighbors (homogeneous well-mixed population [18]). This, however, results in loss of information regarding the structure of the network. Most of the models described as system of differential equations are difficult to solve analyt- ically. Therefore, one needs to either make considerable amount of assumptions, or simulate the process. Note that to simulate such processes a discretization of time has to be assumed. Therefore, many models have been developed that describe diffusion as a discrete time process, in contrary to the continuous time process as described above. In the following sections we discuss such discrete time processes. 2.3 Cascade Dynamics in a Network Epidemic models that involve partitioning of the populations in compartments often ignore the microscopic structure of the network. In a social network, however, information dissemination is highly dependent on the neighborhood of each node due to social reinforcement. Existing models of spreading processes in networks attempt to model diffusion as a result of social influ- ence, i.e., the more influential a user is the wider the spread [19]. Diffusion is modeled using a network structure with static or dynamic edge probabilities [20, 21], which are estimated from past observational data [22]. According to such models, each node independently infects its neighbors with some probability, and each infected node then propagates the infection in the network. Even though this process captures individual influence (i.e., node-to-node), it ignores social influence effects which appear as a result of neighborhood or global pressure [23]. [19] proposed to mediate this problem by incorporating the notion of social capital to characterize the network effect in the influence process, whereas [23, 24] presented agent-based computational models, which quantified pairwise influence and global dynamics in the spread of technology adoption at the workplace. Two of the most widely used diffusion models are the Linear Threshold Model (LTM) [25], and the Independent Cascade Model (ICM) [20] (See Figure 2.2). LTM assumes that a node gets infected when the number of its infected neighbors exceeds a threshold. According to ICM, each node has n independent chances to become infected; n being the number of its infected Chapter 2. Background 9 (a) ICM (b) LTM FIGURE 2.2: Toy examples of rumor spreading based on (a) Independent Cascade Model and (b) Linear Threshold Model. Yellow nodes denote infected nodes. Red arrows represent the links along which the infection takes place. neighbors. ICM is closely related to the Susceptible-Infected-Susceptible (SIS) and Susceptible- Infected-Removed (SIR) models [15, 26, 27]. Furthermore, it was recently shown that ICM and LTM are special cases of the Genetic Algorithm Diffusion Model (GADM) [28]. GADM emulates social interactions through a tail-swap cross-over interaction [29], assuming that social interactions are always pairwise. Here we consider the scenario where a node (an individual) can exist in two states depending on whether it has been exposed to the rumor or not. Due to the similarity of the models of diffusion of infectious diseases and rumors, in this chapter we use the terms ‘rumor’ and ‘infection’ inter- changeably. A node that has been exposed to the rumor is referred to as ‘infected’, otherwise it is called ‘susceptible’ or ‘healthy’. The probability of infection of a node at time t can be found by simulating the cascade process. Algorithm 1 shows a generic method to simulate a cascade process. The algorithm takes as input the weighted adjacency matrix G of the graph with G v;u representing the weight of the link from v to u which the cascade model may depend on. It also takes the seed set S as input which is the set of nodes infected initially (at t = 0). The other two inputs are the maximum Chapter 2. Background 10 Algorithm 1 A generic simulation of rumor spreading over a network. 1: function SIMULATERUMOR(G, S, t max , NumSim) 2: A 0 . A u;t counts the number of times node u was 3: infected at time t over all simulations 4: for s= 1! NumSim do 5: In f ec 1 . A vector to store the time of infection of nodes 6: for all u2 S do 7: In f ec(u) 0 8: A u;0 A u;0 + 1 9: end for 10: for t= 1! t max do 11: R Cascade(G, In f ec, t) 12: for all u2 R do 13: In f ec(u) t 14: A u;t A u;t + 1 15: end for 16: end for 17: end for 18: 19: for all u do 20: for t= 1! t max do 21: B u;t =å t i A u;i =NumSim 22: end for 23: end for 24: return B 25: end function time steps and the number of simulations to perform. The algorithm makes use of a function RumorCascade(G;In f ec;t) (line 11), implementation of which is dependent on the model being used. The function takes as input the adjacency matrix G, time t and a list (In f ec) containing the time of infection of the all nodes. According the logic used in this algorithm, at simulation number s and time t, In f ec(u)=t, if node u was infected at time t t, and In f ec(u)=1, if the node has not been infected yet. RumorCascade returns a list of nodes that were infected by the process in the current time step. Algorithm 1 returns a matrix B, elements of which B u;t represent the probability of infection of node u by the time t. Note that if the objective is only to find the expected number of infections after steady state (infinite time), then a different simulation method may be more efficient [30]. However, to get a better insight into the model dynamics, we focus on the problem of finding the probability of infection at a given time. Next we provide a description of some of the models which run at discrete time steps, and describe the corresponding RumorCascade function for a better understanding of the imple- mentation of the process. Chapter 2. Background 11 2.3.1 Independent Cascade model In the Independent Cascade Model (Figure 2.2(a)) [20], a seed set of infected nodes is provided at t= 0. At each time step t, each node is either infected or susceptible, and every node v that was infected at time t1 has a single chance to infect each of its neighbors u. The infection succeeds with probability p v;u [20]. Algorithm 2 implements the Cascade function for ICM. First it finds all the nodes that were infected at time t 1, and hence are active at current time step t. Let v be such a node. Then v would try to infect each of its outgoing neighbor, succeeding in doing so with probability p v;u given by the element G v;u of the adjacency matrix. The success of the event is simulated by generating a random number random() between 0 and 1, and checking if its value is less than the probability of success (G v;u ). In the most basic version of ICM, we assume p v;u = p8(v;u). Algorithm 2 Cascade function for ICM 1: function CASCADEICM(G, In f ec, t) 2: R / 0 3: for all v2fvjIn f ec(v)= t 1g do 4: for all u2fujIn f ec(u)=1;G v;u > 0g do 5: if random()< G v;u then 6: R R[ u 7: end if 8: end for 9: end for 10: return R 11: end function 2.3.2 Threshold Models In threshold models (Figure 2.2(b)) the probability of infection of a node depends on the popu- larity of the contagion in its incoming neighborhood. Several threshold models exist in the litera- ture, including the Linear Threshold Model [25], and the Linear Friendship Model [22, 31]. The Generalized Threshold Model (GLT) [20] dictates that a node u is infected based on a monotone function of the set of its infected neighbors f(In(u;t))2[0;1] and a thresholdq u 2[0;1]. Partic- ularly, u is infected at time t if f(In(u;t))q u . Note that the the thresholdq u can be randomly selected at each time t [32] leading to non-determinism of the infection process. Since, these thresholds are selected uniformly at random, this is equivalent to saying that the probability of infection of a healthy node u at time t is f(In(u;t)). 1 Algorithm 3 shows the implementation of RumorCascade for GLT. For every node u that has not been infected yet, the probability of infection is given by the function f(In(u;t);t) based 1 This is different from the linear threshold model proposed in [20] where the threshold remains constant over time during a single simulation run. Chapter 2. Background 12 Algorithm 3 Cascade function for GLT 1: function CASCADEGLT(G, In f ec, t) 2: R / 0 3: for all u2fujIn f ec(u)=1g do 4: if random()< f(In(u;t)) then 5: R R[ u 6: end if 7: end for 8: return R 9: end function on the infection state of its neighbors and the weight of their links in G. Again, the infection is successful if a random number random() is less than f(In(u;t)). 2.3.3 Complex Contagion Model Both ICM and GLT have one mode of infection at each time step. In [23], it has been shown that in some situations one infection process may not be sufficient to model the dynamics of a diffusion process, and Complex Contagion Model(CCM) is proposed. According to CCM, infection can be achieved at time t in two ways. First, every infected node attempts to infect each of its outgoing neighbors with probability p. Once a node is infected, it cannot be infected again. Once all infected nodes are examined, healthy nodes have a chance of random infection based on the popularity of the contagion at time t 1. Particularly, for n t1 i infected nodes by the time t 1, the probability of random infection at time t is given by an exponential growth law: r(t)= exp(an t1 i b), wherea andb are constants [23]. The corresponding RumorCascade function is shown in Algorithm 4. Observe that the attempt of infection is made twice (line 5 and line 12) during the same time step. First, infected nodes attempt to infect their outgoing neighbors, and succeed with probability G v;u (line 7). We set all non-zero weights G v;u = p. Then a second round of infection takes place, where every node irrespective of the structure of the graph may randomly get infected (line 13) with probability according to the exponential growth function. Chapter 2. Background 13 Algorithm 4 Cascade function for CCM 1: function CASCADECCM(G, In f ec, t) 2: R / 0 3: n i jfujIn f ec(u)>1gj 4: r exp(an i b) 5: for all v2fvjIn f ec(v)>1g do 6: for all u2fujIn f ec(u)=1;G v;u > 0g do 7: if random()< G v;u then 8: R R[ u 9: end if 10: end for 11: end for 12: for all u2fujIn f ec(u)=1g do 13: if random()< r then 14: R R[ u 15: end if 16: end for 17: return R 18: end function Chapter 3 Macroscopic modeling - Predicting the Spread of Chikungunya Virus In 2014, the Chikungunya virus started spreading rapidly in the Western Hemisphere. Chikun- gunya is rarely fatal but can cause debilitating joint and muscle pain, fever, nausea, fatigue and rash. By May 2015, there were 1.4 million suspected cases in the Americas, making it a serious national risk. Therefore, DARPA announced a challenge 1 , inviting data scientists to fill the gap in the then existing infectious disease modeling and prediction. I developed a computational epidemic model to forecast outbreaks of Chikungunya (CHIKV) and assess its international spreading risk throughout the Americas with the intent of applying these capabilities to the mit- igation of CHIKV outbreaks. I was one of the winners of the DARPA Challenge. 3.1 Model Our model is an extension of the popular SI (susceptible-infected) model; it integrates disease dynamics with human mobility patterns. Our model is advantageous in that it allows for a fine- grained analysis of epidemic patterns both within a population and at an international scale while at the same time being computationally scalable and requiring only a small amount of data to be initialized. This allows for both local outbreaks detection and identification of the timeline of the arrival of the epidemic in each country (i.e., imported cases) as a direct result of the hu- man mobility network at the same time. Instead of studying the dynamics of CHIKV within the boundaries of each country in isolation, our extended SI model explicitly couples the prop- agation of the disease to human mobility patterns. Specifically, F(q; p) people traveling from country q can increase the number of infections in country p. Within each country, an individual can exist in either one of two states much like the traditional SI model: susceptible and infected. 1 https://www.darpa.mil/news-events/2015-05-27 14 Chapter 3. Predicting the Spread of Chikungunya Virus 15 Unlike the original SI model however, our model supports heterogeneous infection rates; a sus- ceptible individual is infected when in contact with an infected person at a rate depending on the time the latter individual was infected, i.e., rate of infection isb 1 for an individual infected at t-1,b 2 for an individual infected at t-2, and so on. We propose this diversity of infection rates to explicitly model how actively people propagate the infection when infected. In our modeling, we assume that after being infected for a certain time, individuals no longer spread the infection, i.e., there is parameter k, such thatb i = 0;8i> k. Our model can be represented by the following system of equations: DS p t = S p t1 N p k å i=1 b p i DI p ti (3.1) DI p t = S p t1 N p k å i=1 b p i DI p ti +d å q F(q; p) N q q å i=1 b q i DI q ti ! (3.2) where S p t and I p t represent the number of susceptible individuals and infected individuals re- spectively in country p at time t. TermsDS p t andDI p t represent changes induced by the disease dynamics and can be written as S p t S ( t1) p and I p t I ( t1) p respectively. Parameterg is used to capture the spatial dynamics of the disease by modeling the rate of infection due to neighbor- ing countries. Parameterd captures the rate of infection due to mobility of infected population from one region to another. 3.2 Data Sources In our model, the number of infections in a country in prior weeks is the chief determinant of the trend of number of infections. We obtain this information from the number of reported cases published weekly by PAHO [8]. The spread of infection in our model is dependent on the movement of infected persons from one country to another. Since this information is not directly available, the model can accept other travel-related information. In the current work, only the total number of passengers moving between all major airports in the countries is used. For an estimate of the number of passengers that move between countries, we use the output of an open-access model of passenger flow that was recently developed by Huang et al. [1]. The reason for selecting an analytical model instead of actual passenger count is because passenger count information is not readily available from public sources for all the relevant countries. Moreover, an analytical model can be adapted to changes in the granularity of our predictive model (for instance, modeling state or county-level infection spread instead of at the country level). The inputs to the analytical passenger flow model include air network characteristics, city population, local area, and GDP as of 2010 compiled from public datasets [1]. We consider travel only between airports with a host city-population of more than 100,000 and within two Chapter 3. Predicting the Spread of Chikungunya Virus 16 air transfers. For inclusion in our Chikungunya forecast model, we aggregate the number of passengers to and from all airports within a country. 3.2.1 Potential data sets Our model can accept estimates of movement of people by different modalities. The only re- quirement is that these estimates be available for pairs of locations at the same granularity of the prediction model (i.e., if the predictive model represents infection at the state level, then the travel estimates should be between pairs of states). Below, we list some potential datasets. Adjacency data: The information whether a country is adjacent to another, i.e., shares a land border, is expected to correlate in general with the relative number of people who move between the two countries, especially via land routes. The source data for country-level adjacency is available from [2]. This data is based on information collected by the Correlates of War Project [3]. Two countries are represented as being adjacent if they share a land or river border or up to 24 miles of water. This dataset uses country neighborhood information current as of December 2006. The adjacency matrix from this dataset does not connect the Caribbean islands with each other or other countries. However, in our initial analysis incorporating this dataset (and not air- line travel), we represented all Caribbean islands as being adjacent to each other. International airline traffic: Instead of using an analytical model-based estimate of airline passengers, it is also possible to use actual passenger counts where available. Such information is available from [4]. Migration rates: Seasonal migration is expected to account for a relatively large number of the people who move between countries in the region of interest for Chikungunya. Estimates of international migration are available from [5]. Migrant worker travel patterns within the United States are available from [6]. International migrant worker patterns are available from [7]. 3.2.2 Representativeness of data sets The number of infected cases published by PAHO [8] is not equally current for all countries. Thus, the prediction accuracy for countries which do not have current information is lower relative to countries with current reports. The number of infections includes the number of suspected cases. Our model does not account separately for suspected and true number of infec- tions. The reliability of the numbers of suspected cases vary between countries and this affects the prediction ability of our model. Air travel data does not account for all modes of travel between countries. In particular, air travel is biased towards people with higher economic re- sources. However, air travel is expected to account for most travel to and from several island countries in the Caribbean. The airline travel data source that we use also does not account for seasonal changes in passenger rates (the estimates are of total passengers per year) and also for annual changes (the air travel analytical model is not re-run for every year). Exclusion of other Chapter 3. Predicting the Spread of Chikungunya Virus 17 travel modalities will thus make our Chikungunya prediction model insensitive to land and sea- based movement of infected persons. Exclusion of seasonal changes in travel patterns makes our model insensitive to expected changes in number of people travelling in different weeks of the year. However, since our prediction model is re-trained every week with all available infec- tion data, the model can predict such unaccounted factors with a lag (as these factors affect the number of reported infections which are used for training our model parameters). 3.3 Model Robustness We use weighted least squared as a means of learning parametersfb p 1 ;b p 2 ;:::;b p k g andd from available infection data at the country level. The best fit in the least-squares sense minimizes the sum of squared residuals, i.e., the difference between observed data and predicted values provided by our learned model. We incorporate forgetting factora 1 in our minimization to weight more the recent infection trend when learning the model. We compute the minimum of the sum of squares of weighted errors as follows: LSE = T å t=1 å p a Tt I p t I p t 2 ; where I p t is the actual cumulative number of the infected subpopulation. We trained our model every month for the duration of the Challenge. We split our data into training and validation sets; most recent data (about 10-15% of the most recently updated data) is used as validation and the rest for learning the parameters of our model. We experimented with different values of k to minimize the error: E = 1 55N w å p;w I p w I p w 2 2maxf1;I p w g ; where N w is the number of weeks in the validation set. For the prediction of first week in the validation set, instead of using DI p T =DI p T = I p T I p T1 , we used DI p T = I p T I p Tmk mk , where hyperparameter m is learnt from the validation set. We found this normalization effective in reducing the effect of noise in the data. Robustness Testing: We performed experiments to evaluate the robustness of our proposed model. We report our findings next. In short, we found our proposed model to be fast and effective in predicting the number of infections within country and overall with small variability in the overall error as a function of the parameter values as learned by training. Chapter 3. Predicting the Spread of Chikungunya Virus 18 Our model contains three hyperparameters: m; k and a . Figure 3.1 shows the error for fixed k and varying m . Specifically, we first train our model with hyperparameter k fixed and test it with different values of m . We chose m= 2 and k= 2 for our final predictions as we found these values to result in good predictive accuracy while at the same time offering a computationally inexpensive model. The combination of parameters m = 1;k = 4 results in a model with a slightly lower error. However, more parameters need to be learnt in this case, which may lead to overfitting. For the above experiment the value of a was set to 1. Next, we evaluate the sensitivity of our propose model to the value of a . Figure 3.2 summarizes the results. As the error increases only slightly with the value of a (i.e., the error increment is less than 2.5% for a increasing from 0.2 to 1), we conclude that the prediction error is effectively independent of the value of a . Therefore, a model which imposes more weight to recent data (i.e., uses a high value for parametera ) is similar in prediction accuracy to a model which weighs all data equally (i.e., the value of parametera is small). We attribute this result to the fact that by the end of 50 weeks, enough data are available for our computational epidemic model to learn the Chikungunya in- fection dynamics and the role of human mobility in shaping the epidemic at the international level. However, if seasonality or drastic changes are observed in the cumulative number of in- fections, hyperparametera should be set to a lower value in order for the model to rely mostly on recent trends. FIGURE 3.1: Prediction error as a function of hyperparameter k for fixed values of parameter m Chapter 3. Predicting the Spread of Chikungunya Virus 19 FIGURE 3.2: Prediction error as a function of hyperparameter Assumptions: Due to the lack of availability of a large amount of data, our core philosophy is that the problem should be tackled with a simple model with few parameters per country. Our model makes the following assumptions. The interactions within a region are homogenous, i.e., all infected individuals who are infected during a given week have similar behavior of passing on the infection. This “mean-field” assumption is done for simplicity of the model. Although several factors including socio-economic conditions and mosquito density can affect the infection pattern, they implicitly contribute to the infection rate. Therefore, we do not model these variables and assume that the infection rates b q i should be able to capture these effects. The number of individuals traveling into a country every week is very small compared to the population of the country. The population of the countries remains constant throughout the weeks concerned. Birth and death rates are ignored. Model Sensitivity: To test the effect of input data on our model we trained our model using varying number of weeks. Specifically, we split our training dataset in increments of 10 weeks so that our first training dataset for this experiment contains data for 10 weeks, our second dataset contains data for 20 weeks and so on. We use the deviation between the values learnt for the hyperparameters using data for i< 50 weeks and those learnt using data for 50 weeks to measure the sensitivity of our model. Formally, the 2-norm of the vectorkDparamsk =kparams i params 50 k forms the basis of our evaluation. Figure 3.3 shows the effect of training set size on the learnt values of the model’s parameters. The y-axis represents the 2-norm described Chapter 3. Predicting the Spread of Chikungunya Virus 20 above. Even though the deviation is almost monotonically decreasing with increasing training set size, the extent of the variation is small. Consequently, the prediction error does not deviate significantly despite the increasing size of data available for training. Figure 3.4 summarizes the results. In summary, our proposed model is stable and only requires a small amount of data to learn the Chikungunya infection dynamics and the role of human mobility in shaping the epidemic at the international level. FIGURE 3.3: Change in parameters learnt using varying size of training data FIGURE 3.4: Error in the final 8 weeks of prediction using a varying size of training data 3.4 Model Applicability Generality: The range of applicability of our model is not restricted to Chikungunya spreading prediction. Our model does not explicitly use anything specifically tied to Chikungunya disease. Chapter 3. Predicting the Spread of Chikungunya Virus 21 Instead our model is generalizable and can be readily applied to other vector borne diseases. Some examples could include dengue fever and the West Nile virus. The integration of social contagion due to human mobility that are not necessarily of biological origin allow a flexible representation of diverse processes that can be used to study a wide range of diseases at multiple scales ranging from small regional units to worldwide. Mitigation Strategy: Preventing contagion in networks is an important problem in public health. Immunizing nodes based on their network interactions has been shown to be far more effective at containing infection spread than immunizing random subjects. In our approach, nodes represent geographic locations which at a coarse granularity can enclose entire countries. Depending on data availability geographical regions can be split at the province or state level or even at the city level. Our approach facilitates the analysis of infection transmission between nodes based on popula- tion movement (i.e., travel data). Thus, we can identify which nodes are “critical” in spreading the virus (nodes with the highest impact on other nodes). Assuming a fixed amount of resources, the infection rate can be minimized by distributed these resources to such “critical” nodes with the highest rates of infection. Targeting resources to a small subset of countries on the basis of their position on the human mobility network can be much more efficient than distributing infection control resources uniformly. In fact, concentrating resources at a small subset of nodes is both cost and time critical with respect to the spread of the disease. Specifically, in our model, the effect of the spread of infection in a region (e.g., country) on another is given by the following equation: ¶DI p t ¶DI q ti µ F(q; p) N q b q i : If state/province level data is available, our model can be used to control infection spread be- tween states by recommending inter-state travel routes to restrict. For infections such as Chikungunya for which there is no effective vaccination or treatment only fractional immunity can be achieved for some nodes by allocating infection-prevention resources to them. Restricting population movement between regions is the most effective strategy accord- ing to our approach. Nevertheless, our model can be extended to accommodate immunization due to vaccination or other mitigation/prevention strategies. Chapter 3. Predicting the Spread of Chikungunya Virus 22 3.5 Presentation All the datasets used are publicly available as described earlier. We have developed an interactive user interface that can be used to visualize the results and if necessary, add more data and retrain the model for better predictions. We have made the MATLAB files available online anonymously, which can be executed on a machine running MATLAB. For those machines that do not have MATLAB we are also providing a standalone installer package for Windows that will install all files necessary for executing the software. These files can be downloaded by following this link. Interacting with the GUI does not require any knowledge of MATLAB or the epidemiology of the disease. The following shows a screenshot of the interface. FIGURE 3.5: Interactive GUI for infection predictions. The “Country List” lets the user select a country. Clicking “See infection Trends” plots the infection trend (cumulative) in Plot 1 for the currently selected country (Figure 3.6(a)). Check- ing “Add prediction for ::: weeks” adds prediction using our model for the number of weeks mentioned in text box which can be edited. The predicted part of the trend is shown with dotted lines. To compare multiple countries the “Add Countries” checkbox can be checked. Doing so will keep the current plot in Plot1 and add the trend of the country that the user selects next from the list box. Clicking on “Affected By” shows the top five countries that affect the infection in the currently selected country. “Affecting Score” is calculated as F(q;p) N q b q i . The infection pattern in these five countries is shown in Plot 2 (Figure 3.6(b)). Chapter 3. Predicting the Spread of Chikungunya Virus 23 (a) Infection prediction (b) Affected by countries (c) Advanced options (d) File menu FIGURE 3.6: Parts of the interface. The “Advanced Options” frame (Figure 3.6(c)) lets the user change the hyperparameters of the model namely k;a and m . The default values are set based on our validation experiments. To apply these changes, i.e., train the model with new hyperparamters user needs to click on “Retrain Model”. The training may take some time (a few seconds) depending on the machine running the software. All the predictions made by the GUI are based on the history of infection gathered until last week of January 2015. The GUI also allows loading data through the “File” menu ((Figure 3.6(d))). “Load Infection History” is used to add historical infection data for all the countries. This Chapter 3. Predicting the Spread of Chikungunya Virus 24 file should contain cumulative number of infections separated by tabs or comma. Every row should represent a country, and consecutive columns should represent consecutive weeks. “Load Passenger Flow” should contain weighted adjacency matrix with weights showing number of passengers going from one country (region) to the other. “Load Country Names” loads a file containing list of countries that should be in the same order as in the file containing infection history. “Load population Data” can be used to load the populations of the corresponding coun- tries. Sample files have been included in the package to provide an idea of the format of each of the files. Loading these files will make the model use the existing parameters on the new data. However, to modify the model (learn parameters) again based on the new data, the user should click on “Retrain Model” button. After training the learnt model can be save using “Save Current Model” menu item. At any time, to rollback to the initial model supplied with the software, user can select “Load Defaults” . 3.6 Computational Requirements Training Requirements: The proposed model requires training based on historical data to de- rive the model parameters. Training involves learning the values of kN c + 1 parameters. The small complexity associated with computing infections enables fast learning of such parame- ters. Specifically, we measured the time required for training on MATLAB R2013a running on a 12-core 2.2GHz processor with 32GB RAM for k= 2;N c = 55 and T = 50 , and found it to be less than 13 seconds. Predicting the infection spread for 10 weeks after learning the parameters requires less than 1 millisecond. The same training takes approximately 210 seconds on a ma- chine with 3GB RAM with 2.1 GHz dual-core processor running MATLAB 2009a. Note that even on the weaker machine the time requirement is far less than the computational timeframe of 2 hours that is expected. In order for the best performing model to be selected, training has to be performed for a variety of combinations of hyperparameters k anda . Hyperparameter m is not required during training and is used only while predicting the number of anticipated infections for the first week follow- ing training. We found in practice that a reasonably good fit can be performed within ten trials of training which took about 200 seconds. Computing Requirements of Running the Submitted Model: Our model is described by an order k recurrence relation. The first term of the right hand side of Equation 2 can be computed in O(kN c ) and the second part in O(kjEj) resulting in a complexity of O(k(N c +jEj)) for each time step, wherejEj is the number of inflow population routes between countries. Therefore, for T time steps, the complexity of computing the infection trend for all countries is O(k(N c +jEj)T) Chapter 3. Predicting the Spread of Chikungunya Virus 25 . The fast execution of the model can be verified by running the software package that we provide with this submission. Model Scalability to diverse sets of data: In terms of time complexity we have shown that our model is highly scalable and grows linearly with time (number of weeks). The computation is also linear in number of countries and interactions among them. Therefore, addition of more countries does not have a drastic effect the runtime. We emphasize again that our model can be easily extended to other granularities, i.e., using states/regions instead of countries, if relevant data is available. Due to the low running time of our model, the identification of critical “nodes” will also be tractable. Chapter 4 Progressive Diffusion – Minimizing Violence in Homeless Youth There are an estimated 1.6–2.8 million youths experiencing homelessness in the United States [33]. Although violence in the United States has steadily decreased during the past decade, home- less youth remain disproportionately susceptible to violent victimization and perpetration [34]. These youths experience all types of violence at higher rates than their housed counterparts [35– 37]. Violence perpetuates violence and diffuses through a network like a contagious disease [38]. The Chicago Cure Violence program 1 is based on a similar idea of treating violence as a con- tagious disease, and has shown significant reduction in violence. Motivated by the contagious nature, a diffusion model is ideal for modeling spread of violence. Doing so can lead to optimal intervention strategies under certain assumptions. To the best of our knowledge, intervention strategy to reduce violence using diffusion models has received very little attention in the liter- ature [39, 40]. Violence is modeled based on susceptibility and infectiousness in [39]. In [40] the idea of opposing forces, “provocation” and “repression”, is used to model violence as two diffusing processes. This is more accurate as it captures the non-progressive nature of violence, where an individual may switch between the state of “violence” and “non-violence”. However, it is a macroscopic approach, which disregards the network structure. While many diffusion models exist that are variations of Independent Cascade Models, Linear Threshold Model, and Susceptible-infected, they are “progressive” models, i.e., they assume that once activated (or infected), the individuals remain activated. However, in the context of violence, it would mean that a violent person can never become non-violent, which is absurd. A popular model that captures non-progressive diffusion of competing behaviors on social net- works is voter model [41]. In voter model individuals are influenced by a randomly selected 1 http://cureviolence.org/ 26 Chapter 4. Minimizing Violence in Homeless Youth 27 neighbor 2 . But application of voter model in real-life scenarios such as diffusion of violence has the following drawbacks. (a) There is some uncertainty in the network structure, in the sense that, individuals may forget to mention someone as their peers, and yet be influenced by them [42]. (b) The number of discrete time steps over which the diffusion process unfolds (a pa- rameter required by voter model) is often unknown in practice. To deal with these uncertainties, we propose Uncertain V oter Model (UVM) as an extension of voter model. UVM allows for some uncertainty in the knowledge of the neighborhood that may arise from an individual being influenced by someone they did not explicitly state as their “friends” during the survey to create the network. Our model also incorporates uncertainty in number of time-steps of the diffusion process. Under UVM, we find the optimal intervention strategies to minimize violence. The task is to perform interventions on individuals with constrained “resources” so that they change their state from violent to non-violent resulting in others adopting non-violent state, eventually minimizing violence. We consider two types of interventions: (i) deterministic, where selecting an individual turns them into non-violent, with the constrained being the number of individuals to select; (ii) probabilistic, where an individual’s probability of being non-violent is increased based on number of “units” (hours, sessions, etc.) of intervention, with the constrained being total number of units available. Specifically, our contributions are as follows: We propose Uncertain V oter Model for violence that can capture its non-progressive na- ture and takes into account the uncertainty in neighborhood as well as uncertainty in the time period over which the diffusion of violence unfolds. We formally define Violence Minimization problem where the task is to perform inter- vention with a finite resources, i.e., change the state of some violent individuals so that the total expected number of violent individuals is minimized. We show that Uncertain V oter Model can be reduced to the classic voter model, and thus a greedy algorithm forms the optimal solution to Violence Minimization. We extend our solution to “probabilistic” intervention, where the intervention reduces the probability of violence of selected individual as a concave non-decreasing function. We perform experiments on synthetic networks and a real-life network of homeless youths and find the nodes to be selected for intervention and demonstrate that baselines that do not take the diffusion model into account perform significantly worse. 2 We use the terms “neighbor” and “neighborhood” to refer to the links of a given individual in the network and not their physical neighborhood Chapter 4. Minimizing Violence in Homeless Youth 28 4.1 Data Collection A sample of 481 homeless youth from ages of 18 to 25 years accessing services from two day-service drop-in centers for homeless youth in Hollywood and Santa Monica, CA, were ap- proached for study inclusion in October 2011 and February 2012. The research team approached all youths who entered the service agencies during the data collection period and invited them to participate in the study. The final sample consisted of 366 individuals who agreed to participate. The study consisted of a social network interview, where each participant was asked to name anyone they interacted with in person, on the phone, or through the internet in the previous month prompted by interviewers stating, “These might be friends; family; people you hang out with/chill with/kick it with/ have conversations with; people you party with–use drugs or alcohol; boyfriend/girlfriend; people you are having sex with; baby mama/baby daddy; case worker; people from school; people from work; old friends from home; people you talk to (on the phone, by email); people from where you are staying (squatting with); people you see at this agency; other people you know from the street.” The variable of interest is violent behavior. Violent behavior was assessed by recent participation in a physical fight. Participants were asked: “During the past 12 months, how many times were you in a physical fight?” Eight ordinal responses ranged from “zero times” to “over 12 times.” The responses were dichotomized similar to previous literature on youth violence [43, 44] to distinguish between participants who had been in no physical fights and participants who had been in at least one physical fight during the previous year. This question was adopted from the Youth Risk Behavior Survey, Centers for Disease Control and Prevention [45] and did not distinguish between victims and perpetrators of violence. 4.2 Model To model the spread of violence we model the network of homeless youth as a graph G(V;E) where every individual is a node which can exist in one of two states: ‘violent’ or ‘non-violent’. We chose to model violence as a non-progressive diffusion process, i.e, a node may switch its state unlike the progressive diffusion where once a node is violent it cannot become non-violent again. Next, we provide a background on voter model [41] on which our model is based. 4.2.1 Voter Model In the voter model [41], at every time step a node u picks an incoming neighbor v at random with a probability p(v;u). The incoming probabilities are normalized such thatå v p(v;u)= 1. Chapter 4. Minimizing Violence in Homeless Youth 29 Let x u;t represent the probability of node u being violent at time t. According to the model, x u;t = å v p v;u x v;t1 : (4.1) Let x t represent the state of all the nodes at time t, with ith element representing the probability that v i 2 V is violent at time t. Suppose matrix M represents the transpose of the adjacency matrix of the weighted network, i.e., M u;v = p(v;u). Then x t = Mx t1 : (4.2) It follows that x t = M t x 0 : (4.3) Here x 0 is the initial state of nodes, which is assumed to be known. Now we wish to select k nodes out of those who are violent at t = 0 and turn them into non-violent so that the expected number of nodes that are violent at time t is minimized. Define I X for X V as the vector in which the i-th element is 1 if v i 2 X. Then the expected number of violent nodes at time t is given by å i P(v i is violent at time)= å i x i;t = I T V x t (4.4) 4.2.2 Uncertain Voter Model A network formed through a survey may have missing edges due to the uncertainty in a person’s ability to recall all “friends” they might be influenced by [42]. To capture this aspect, we propose the Uncertain V oter Model, where we assume that a node which is not directly connected to the node of interest may also influence it. In this model, two mutually exclusive events happen: (i) with probabilityq a node randomly selects one incoming neighbor and adopts its state, (ii) with probability(1q) it selects a node that is not its neighbor in the network and adopts its state. We propose two ways of selecting the node form outside the neighborhood: (i) random and (ii) Katz-based. 4.2.2.1 Random In this case every node which is not a neighbor is equally likely to be selected. Mathematically, x u;t =q å fvjp(v;u)>0g p v;u x v;t1 +(1q) å fvjp(v;u)=0g 1 jfvjp v;u = 0gj x v;t1 (4.5) Chapter 4. Minimizing Violence in Homeless Youth 30 If n is the total number of nodes and d u is the number of incoming neighbors of u, thenjfvjp v;u = 0gj= n d u . Suppose we define, q q (v;u)= 8 < : q p v;u if p v;u > 0 1q nd u if p v;u = 0: (4.6) 4.2.2.2 Katz-bazed We treat the influence from outside the neighborhood as the problem of finding missing edges [46]. A popular method for missing edge detection is using Katz similarity [], which is based on ex- ponentially weighted number of paths between two nodes: K(u;v)= å i a i jpath of length i to u from vj: (4.7) Since, we are only interested in nodes that are not directly in the neighborhood we take the above summation for i 2. The entire similarity matrix is given by: K= å i2 a i M i =a 2 M 2 (IaM); (4.8) We choose a small value of a = 0:0005 which has shown to perform well for missing edge detection []. We normalize the scores for each node u over all nodes v which are not in its neighborhood, so that that the probability of selecting node v is proportional to K(u;v), i.e., K 0 (u;v)= K(u;v)= å w K(u;w): (4.9) Now, the Katz-based Uncertain V oter Model is given by x u;t =q å fvjp(v;u)>0g p v;u x v;t1 +(1q) å fvjp(v;u)=0g K 0 (u;v)x v;t1 : (4.10) Again, we can define q q (v;u)= 8 < : q p v;u if p v;u > 0 (1q)K 0 (u;v) if p v;u = 0: (4.11) From Equations 4.6 and 4.11, both random and Katz-based Uncertain V oter Model lead to re- duction of Equations 4.5 and Equations 4.10 to x u;t = å v q(v;u)x v;t1 or x t = Q q x t1 (4.12) Chapter 4. Minimizing Violence in Homeless Youth 31 where[Q q ] u;v = q q (u;v) which reduces to V oter Model (Equation 4.1) of a graph of which the transpose of the adjacency matrix is Q q . Now, we define the problem of Violence Minimization as follows. Problem Definition 1 (Violence Minimization). Given a weighted graph G(V;E), an initial set of violent nodes S, a time frame t, and an integer k, find T S such thatjTj= k, turning the nodes in T into non-violent minimizes the expected number of violent nodes after time t, i.e., I T V x t under Uncertain V oter Model. 4.3 Greedy Minimization Let x 0 0 be the vector formed by turning some k nodes into non-violent, resulting in the vector of probabilities x 0 t at time t. Now, minimizing I T V x 0 t is equivalent to maximizing I T V (x t x 0 t )= I T V Q t q (x 0 x 0 0 ), i.e., the problem reduces to maximizing I T V Dx t = I T V Q t q Dx 0 = å fuj x 0 (u)=1g I T V Q t q I u (4.13) which can be optimized using greedy strategy [41] as presented in Algorithm 5. Algorithm 5 Greedy algorithm to minimize violence function MINVIOLENCE(G;S;q;k;t) Compute Q t q for G 8u2 S computes(u)= I V Q t q I u Sortfs(u)g in descending order and return top k. end function The most expensive step of the algorithm is the computation of Q t q which can be computed in O(jVj 2:4 logt). 4.3.1 Uncertainty in Time Uncertain V oter Model requires t as a parameter which is unknown in real life. While we may have a certain time period (days or weeks) over which we want the intervention to work, finding a relation between that time period and the parameter t is non-trivial as it depends on how often the individuals interact. To capture this uncertainty, we assume that time t takes a valuet with probability P(t =t). Now, we wish to minimizeE(I V x t ) where the expectation is taken over t. Therefore, E(I T V x 0 t )= å t P(t=t)I T V Q t q x 0 0 = I T V å t P(t=t)Q t q x 0 0 : (4.14) Notice from Equation (4.14) that a greedy solution like Algorithm 5 still applies. Chapter 4. Minimizing Violence in Homeless Youth 32 FIGURE 4.1: Visualization of the homeless youth network. The red nodes represent the violent nodes and the green ones represent non-violent ones. The black nodes have unknown state. 4.3.2 Probabilistic Intervention In the previous section, we assumed that performing intervention on a “violent” node turns it into “non-violent”, i.e., an intervention is always successful. However, in real life this may not be true, and some nodes may require more “units” (hours, sessions, etc.) of intervention than others. Let s u (z u ) be the probability of success after applying z u units of intervention to node u. These functions can be different for different nodes, as different individuals may respond differently to interventions. We assume that these functionsfs i g are non-decreasing, i.e, adding more units of intervention cannot decrease the probability of success. We also assume that theses functions are concave, i.e., the marginal increase in probability reduces with increasing number of interventions. Such assumptions are similar to those made in immunization literature [47]. Mathematically, if z 0 z, s i (z 0 ) s i (z), and s i (z 0 + 1) s i (z 0 ) s i (z+ 1) s i (z),8i. Rewriting Equation 4.13 for probabilistic intervention, the utility (reduction in violence) obtained by an allocation offz 1 ;z 2 ;:::;z n g;z i 2N[f0g is I T V Q t q Dx t = å u I T V Q t q I u s u (z u ) (4.15) Let f u (z u )= I T V Q t q I u s u (z u ). This leads to the probabilistic intervention version of Violence Min- imization problem, which is equivalent to maximizingå u f u (z u ), such thatå u z u = k. Note that, Chapter 4. Minimizing Violence in Homeless Youth 33 I T V Q t q I u is a non-negative constant and s u (z u ) is non-decreasing concave function, and so, f u (z u ) is also non-decreasing and concave. Formally, we define this as follows. Problem Definition 2 (Units Assignment Problem). Given k2Z resources and n concave non- decreasing utility functions f i :Z!R, where f i (z i ) represents the utility of assigning z i units to function f i , maximize the total utility F =å i f i (z i ) subject toå i z i = k. Algorithm 6 Greedy Maximization using Marginal Returns 1: function GREEDYMAX(( f 1 ; f 2 ;:::; f n );k) 2: for i 1 : n do 3: z i 0 4: end for 5: for j 1 : k do 6: idx argmax i ( f(z i + 1) f(z i )) 7: z idx z idx + 1 8: end for 9: return(z 1 ;z 2 ;:::;z n ) 10: end function Lemma 4.1. For a non-decreasing concave function f :Z!R, and h 1, f(x+ h) f(x) h( f(x) f(x 1)) (4.16) And f(x) f(x h) h( f(x) f(x 1)) (4.17) We prove the following. Theorem 4.2. Algorithm 6 produces the optimal assignment for Units Assignment Problem. Proof. Suppose the greedy assignment results in an assignment of z i to the function f i . Without loss of generality, we assume that the functions are ordered as following: if i< j then f i (z i ) f i (z i 1) f j (z j ) f j (z j 1);8i; j such that z i ;z j 1. Assume that the optimal assignment(z 1 ;z 2 ;:::;z n ) is different from greedy assignment and pro- duces a greater F =å i f i (z i ). Choose the smallest index p such that f p (z p )6= f p (z p ). Case f p (z p )> f p (z p ): Since,å i z i =å z i = k;9p< i 1 < i 2 << i M , for some M> 0 such that z i r > z i r andå r (z i r z i r ) z p z p . Therefore, it is possible to pick h i r (z i r z i r ) such that å r h i r = z p z p . Suppose we take h i r out of the optimal assignment z i r ;8r and assign them to Chapter 4. Minimizing Violence in Homeless Youth 34 the function f p , then we should not expect any gain (DF 0) as the assignment we started with was optimal. We note that å r ( f i r (z i r ) f i r (z i r h i r )) å r h i r ( f i r (z i r h i r ) f i r (z i r h i r 1)) [Using Lemma 4.1] å r h i r ( f i r (z i r ) f i r (z i r 1)) [Due to concavity] å r h i r ( f p (z p ) f p (z p 1)) [Due to ordering of functions] (z p z p )( f p (z p ) f p (z p 1)) [Sinceå r h i r = z p z p ] ( f p (z p ) f p (z p )) [Using Lemma 4.1] Therefore, the gain obtained in this case isDF=( f p (z p ) f p (z p ))+å r ( f i r (z i r h i r ) f i r (z i r )) 0. Case f p (z p )< f p (z p ): Since,å i z i =å z i = k;9p< i 1 < i 2 << i M , for some M> 0 such that z i r z i r and å r (z i r z i r ) z p z p . Therefore, it is possible to pick h i r (z i r z i r ) such thatå r h i r = z p z p . Now, we take z p z p resources out of the optimal assignment on f p and distribute them such that f i r gets z i r + h i r . Since we have obtained the sequence z 1 ;z 2 ;:::;z n using greedy assignment, f i (z i + 1) f i (z i ) f j (z j ) f(z j 1);8i6= j, otherwise the last unit that went to f j would have gone to f i instead. We have f p (z p ) f p (z p ) (z p z p )( f p (z p + 1) f p (z p )) [Using Lemma 4.1] å i r h i r ( f p (z p + 1) f p (z p )) [Sinceå r h i r = z p z p ] å i r h i r ( f i r (z i r ) f i r (z i r 1)) [By construction of Algorithm 6] å i r h i r ( f i r (z i r + h i r ) f i r (z i r + h i r 1)) å i r ( f i r (z i r + h i;r ) f i r (z i r )) [Using Lemma 4.1] å i r ( f i r (z i r ) f i r (z i r )) [ f i r is non-decreasing]: Therefore,DF =å i r h i r ( f i r (z i r ) f p (z i r ))+( f p (z p ) f p (z p )) 0. But DF 0., and so DF must be zero, i.e., for any optimal assignment that differs from the greedy assignment first at index p, we can perform a reassignment that retains optimality so that they no longer differs at index p. Proceeding thus we get z i = z i ;8i. Hence, the greedy assignment is optimal. Chapter 4. Minimizing Violence in Homeless Youth 35 Suppose the exact response to intervention for individuals is hard to predict, and instead we have some estimation of the response. In other words, if the exact functions f i are not known but we have an approximation g i of f i , the following can be shown. Theorem 4.3. If concave non-decreasing functionsfg i g estimatef f i g, such that(1e) f i (z) g i (z)(1+e) f i ;8i;z, for somee 0, then Algorithm 6 applied on the functionsfg i g produce a(1e)-approximation for Units Assignment Problem. 4.4 Experiments We have shown that the greedy algorithms described in Algorithms 5 and 6 are optimal under Uncertain V oter Model for deterministic and probabilistic interventions, respectively. However, to study how prominent the difference is from other choices of intervention strategies, we com- pare it against the following baselines: Degree: We define the degree of a node based on the weighted graph as d v =å u p v;u . Then we select top k nodes. Betweenness Centrality: Top k nodes are selected based on the betweenness centrality in the graph. We have performed two sets of experiments: Synthetic Kronecker graphs: We generated random Kronecker graphs [48] with roughly same number of nodes and edges as the real Homeless Youth network, described next. Real-world Homeless Youth Network: We constructed the network obtained by the sur- veyed data, which consists of 369 nodes and 558 directed edges. Due to the lack of the knowl- edge of edge-weights, we assume that all incoming links for a node are equally weighted. 4.4.1 Synthetic Networks To simulate the fact that individuals often forget to mention some individuals they might be influenced by [42], for every Kronecker graph G, we randomly removed a certain fraction f of edges to form graph G 0 . We applied, our greedy algorithm to obtain optimal set of nodes for intervention under UVM for both random and Katz-based variation, assuming q = 1f on G 0 . We chose f from [0;0:5]. We consider this to be a sensible range for f as f > 0:5 Chapter 4. Minimizing Violence in Homeless Youth 36 0 5 10 15 20 Intervention size (k) 70 80 90 100 110 120 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (a) Deterministic Intervention (q = 0:9) 0 5 10 15 20 Intervention size (k) 85 90 95 100 105 110 115 120 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (b) Probabilistic Intervention (q = 0:9) 0 5 10 15 20 Intervention size (k) 60 70 80 90 100 110 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (c) Deterministic Intervention (q = 0:8) 0 5 10 15 20 Intervention size (k) 75 80 85 90 95 100 105 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (d) Probabilistic Intervention (q = 0:8) 0 5 10 15 20 Intervention size (k) 70 80 90 100 110 120 130 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (e) Deterministic Intervention (q = 0:7) 0 5 10 15 20 Intervention size (k) 100 105 110 115 120 125 130 135 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (f) Probabilistic Intervention (q = 0:7) 0 5 10 15 20 Intervention size (k) 80 90 100 110 120 130 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (g) Deterministic Intervention (q = 0:6) 0 5 10 15 20 Intervention size (k) 100 105 110 115 120 125 130 135 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (h) Probabilistic Intervention (q = 0:6) 0 5 10 15 20 Intervention size (k) 50 60 70 80 90 100 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (i) Deterministic Intervention (q = 0:8) 0 5 10 15 20 Intervention size (k) 75 80 85 90 95 100 105 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (j) Probabilistic Intervention (q = 0:8) 0 5 10 15 20 Intervention size (k) 70 80 90 100 110 120 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (k) Deterministic Intervention (q = 0:6) 0 5 10 15 20 Intervention size (k) 85 90 95 100 105 110 115 120 Expected # of violent individuals Greedy (VM) Greedy (UVM m ) Greedy (UVM K ) Degree Betweenness (l) Probabilistic Intervention (q = 0:6) FIGURE 4.2: Comparison of the baseline against the greedy algorithm for varying intervention sizes under the Uncertain V oter Model on random Kronecker graphs. Chapter 4. Minimizing Violence in Homeless Youth 37 (i.e., q < 0:5) would represent very low confidence in the collected data, i.e, it would mean that a node is more likely to be influenced by one of the nodes it is not connected to. The parameter t was assumed to be uniformly distributed between 1 and 5. We also applied the greedy algorithm assuming q = 1, which would be the optimal for voter model. Intervention was performed by selecting these sets, but on original graph G. Figure 4.2 shows the number of violent nodes that result from different intervention strategies, while varying the intervention size k. Figures 4.2(a), 4.2(c), 4.2(e), 4.2(g), 4.2(i) and 4.2(k) are for deterministic intervention scenario, where selected nodes become non-violent. Figures 4.2(b), 4.2(d), 4.2(f), 4.2(h), 4.2(j) and 4.2(l) are for probabilistic intervention scenario, where a selected node u becomes non- violent with a probability 1 r u z u , where z u is the number of units of intervention applied on u, and r u 2[0;1] is randomly selected to simulate how well u responds to the intervention. We obtained the plots for many synthetic Kronecker graphs, but only report a few as they all had the same trends. UV M m and UV M K represent UVM with random and Katz-based selection of out of neighbor- hood nodes, respectively. Greedy algorithm on Katz-based UVM significantly outperforms the baselines. Greedy algorithm on V oter Model (q = 1) and UVM based on randome selection also performs worse, suggesting that taking the uncertainty of edges into account by predicting links produces better intervention strategy. Most of these graphs were generated with approxi- mately 50% initial violent nodes to match the real-world network. For percentages outside this range the difference between Greedy (UVM) and Greedy (VM) was negligible. For instance, Figures 4.2(k) and 4.2(l) are generated with 100% initial violent nodes. 4.4.2 Homeless Youth Network For the real Homeless Youth Network, we performed selection and simulated intervention on the same graph, as the network that includes the “forgotten” links is not available. Out of the 366 nodes, 55.01% were “violent” (x u;0 = 1) and 42.55% are “non-violent” (x u;0 = 0). Data on the rest of 2.44% are missing and are assumed to be equally likely to be of either state (x u;0 = 0:5). Based on this “initial state” we run Greedy Minimization for the Uncertain V oter Model. Figure 4.3 show the comparison for expected number of nodes that are violent after t = 5 and t = 10. Figure 4.4 shows the comparison for probabilistic intervention. The value of q was set to 1 to generate these plots. Other values for parameters t and q show similar trends and hence, have been omitted. We observe that the greedy algorithm significantly outperforms both baselines. The upper limit of number of time steps (t) was chosen to be a small number in our experiments, keeping in mind that homeless youth networks are dynamic, and so in practice, the intervention should be performed in in short-term. Chapter 4. Minimizing Violence in Homeless Youth 38 0 5 10 15 20 Intervention size(k) 160 170 180 190 200 210 Expected # of violent nodes Greedy Degree Betweenness (a) t=5 0 5 10 15 20 Intervention size (k) 160 170 180 190 200 210 Expected # of violent nodes Greedy Degree Betweenness (b) t=10 FIGURE 4.3: Comparison of the baseline against the greedy algorithm for varying intervention sizes under the Uncertain V oter Model for deterministic intervention. 0 5 10 15 20 Intervention size(k) 180 185 190 195 200 205 Expected # of violent nodes Greedy Degree Betweenness (a) t=5 0 5 10 15 20 Intervention size (k) 175 180 185 190 195 200 205 Expected # of violent nodes Greedy Degree Betweenness (b) t=10 FIGURE 4.4: Comparison of the baseline against the greedy algorithm for varying intervention sizes under the Uncertain V oter Model for probabilistic intervention. Choosing individuals in practice. So far we have presented the comparison of our greedy method against the baseline centrality measures in terms of reduction in violence. Now, we proceed to examine individuals chosen for intervention based on our method. We experimented with different values for parameter q = 1;0:9;0:8;0:7;0:6 and 0:5, i.e., increasing edge uncer- tainty. Table 4.1 presents the top 10 nodes (in terms of PID assigned in the survey) chosen for intervention (deterministic). We also varied the value of t = 2;4;6;8;10; and 12 with the value of q = 0:75. Note that there are many nodes such as PIDs 47, 4, 2086, 2156, and 51, that consistently appear in the top 10, suggesting that the set of chosen individuals is not highly sensitive to the choice of parameters within a sensible range. However, the significant deviation from betweenness and degree centralities (Figure 4.3) suggests that finding this set is non-trivial. These individuals were selected based on deterministic intervention should be applied when the knowledge of personal traits is not available. However, with the availability of personal traits sufficient to model how an individual may respond to intervention (s u (z u )), probabilistic inter- vention should be used. Chapter 4. Minimizing Violence in Homeless Youth 39 TABLE 4.1: Top 10 seeds for various values ofq output by Greedy Minimization q Selected Seeds E(I T V x 0 t ) 1 47 4 2156 51 13 2086 169 2115 2099 2056 179.43 0.9 47 4 2156 2086 51 13 169 2115 2056 2099 183.327 0.8 47 4 2086 2156 51 13 169 2115 2056 89 185.86 0.7 47 4 2086 2156 51 2115 13 169 2056 2125 187.54 0.6 47 4 2086 2115 2156 51 169 13 2056 2125 188.66 0.5 47 4 2086 2115 2156 51 169 13 2056 2125 189.43 TABLE 4.2: Top 10 seeds for various values of t output by Greedy Minimization t Selected Seeds E(I T V x 0 t ) 2 47 2086 4 2115 51 2156 169 13 2056 2125 189.92 4 47 4 2086 2115 51 2156 169 13 2056 2125 188.66 6 47 4 2086 51 2156 2115 169 13 2056 2125 187.81 8 47 4 2086 51 2156 2115 13 169 2056 2125 187.22 10 47 4 2086 2156 51 13 2115 169 2056 2125 186.79 12 47 4 2086 2156 51 13 2115 169 2056 2125 186.45 4.5 Future Work We have only taken into account the network structure and state of violence of the individuals. A more complex model can be learned that more accurately models the dynamics of violence diffusion by accounting for personal characteristics. Here, we discuss one such extension of our model. In our collected dataset, one feature of particular interest is Difficulty in Emotion Regulation (DERS). Intuitively, an individual with high DERS is likely to have a higher propensity for violence. Suppose a(u) is the probability that a node u, given its DERS would prefer to be violent. Mathematically, x u;t = a(u)å j q q (v;u)x v;t1 a(u)å v q q (v;u)x v;t1 +(1a(u))å v q q (v;u)(1 x v;t1 ) = a(u)å j q q (v;u)x v;t1 a(u)å v q q (v;u)x v;t1 +(1a(u))(1å v q q (v;u)x v;t1 ) : (4.18) Note that this model can be represented as x t =C(Q q x t1 )=C(Q q C(:::C(Q q x 0 ))); (4.19) whereC(Qx t ) is a vector of functions with u th element given byC u (å v q q (v;u)x u;t ). EachC u is a non-decreasing concave function, and since a linear combination of concave functions and composition of concave functions is also concave, RHS of Equation 4.19 is also concave. Let Chapter 4. Minimizing Violence in Homeless Youth 40 that function beC t (x 0 ). Therefore, effect of intervention is given by I T V Dx t = I T V x 0 I T V C t ([x 1;0 (1 s 1 (z 1 )):::x n;0 (1 s n (z n ))]) (4.20) The utility of intervention can be represented as a function over multisets U(T)= I T V Dx t , where T =f(u;z u )jz u units assigned to node ug. The following can be shown. Theorem 4.4. U(T) is submodular and non-decreasing. Due to Theorem 4.4 the greedy algorithm maximizing marginal returns admits a (1 1=e)- approximation [49]. Chapter 5 Non-progressive Diffusion – Analytical Approximation for a Unified Model This chapter provides a generalized, analytical solution to the diffusion mechanism that com- prises of two processes unfolding over the network simultaneously: (a) pairwise influence, and (b) pressure from collective dynamics, which can be a result of local social pressure, global in- fluence, external forces (i.e., factors exogenous to the network), or a combination of the above. The methodology is vertex-centric, i.e., models each user separately, offering great flexibility in terms of modeling personalized influence functions which can be time-dependent. Note, that in this work, we are not concerned with learning the parameters that drive the spread of infection from observational data. While this aspect is important, it is outside of the scope of this chapter. My formula explicitly and formally unites a rich class of popular diffusion processes in social networks [20–23, 50] as special cases. 5.1 Unified Model of Influence Under the Unified Model (UM), the infection process starts with a seed set S V of infected nodes at time t = 0 and proceeds in discrete time steps, in which two types of influence un- fold over the network. According to the first process, each infected node v attempts to infect its neighbors (individual influence). The probability of infection p (v;u) (t) is pairwise and may change over time. Note that we assume independence between infection attempts from multiple neighbors. The second source of influence we consider is collective influence. According to this process, each susceptible node u can be infected with probability r u (t), independent of individ- ual influence. This may include external factors [51–53], or external sources of exposure [54], or the status of the incoming neighborhood of u [23]. The function r u (t) is node specific, and may be time dependent. Also, there may be arbitrary number of collective influence attempts 41 Chapter 5. Analytical Approximation for a Unified Model 42 on a node, as we assume r u (t) is not conditioned upon the node already having undergone a collective influence attempt or not. The process repeats until a pre-specified stopping criterion is satisfied (e.g., number of time steps elapsed, or fraction of infected nodes has exceeded some number). 5.1.1 Infection Probability Formula Under the Unified Model Let B u;t represent the probability of infection of node u by the time t. Initial valuesfB u;0 g are either 0 or 1 depending on the membership of u in the seed set. Let E v;t denote the indicator variable, which is 1 if node v is infected by the time t, 0 otherwise. To find the probability of a node u being infected at time t, we consider an arbitrary ordering of its incoming neighbor set N i (u):< v 1 ;v 2 ;:::;v n >. Based on this, we define zero state probability at time t1: P 0 s n ;s n1 ;:::;s 1 , where superscript 0 denotes E u;t1 = 0. The subscript is a vector, which elements s i denote the value of E v i ;t1 , and can take values inf0;1;g. s i = 0 represents E v i ;t1 = 0, s i = 1 denotes E v i ;t1 = 1, and s i = indicates marginalization over the state of v i , i.e., ‘E v i ;t1 = 0 or 1’. For instance, for a node u with four neighbors, P 0 0;1;;1 denotes the probability P(E u;t1 = 0;E v 4 ;t1 = 0;E v 3 ;t1 = 1;E v 1 ;t1 = 1). We begin by calculating B u;t in the special case of G being a tree, i.e., each node has at most one incoming neighbor. Corollary 5.1. The infection probability of node u with parent v in a tree is given by: B u;t = 1(1 r u (t)) (1 p v;u (t))(1 B u;t1 ) +p v;u (t)(1 B v;t1 ) t1 Õ k=1 (1 r u (k)) : (5.1) Proof. The probability of node u not being infected by time t is P(E u;t = 0)= 1 B u;t . Either one of two things must have happened for u not to be infected by time t. First, state E u;t = 0 was reached from state(E v;t1 = 0;E u;t1 = 0) if and only if collective influence r u (t) failed to infect u at time t. Intuitively, when the parent of u was not infected at timet 1, the only chance for u to be infected at time t is through collective influence r u (t), with probability 1 r u (t). Second, state E u;t = 0 was reached from state(E v;t1 = 1;E u;t1 = 0), i.e., when the parent of u was infected at time t 1, if and only if collective influence was unsuccessful, and furthermore v failed to infect u. As the two processes are independent, this can happen with probability (1 r u (t))(1 p v;u (t)). It follows that, 1 B u;t = P 0 1 (1 r u (t))(1 p v;u (t))+ P 0 0 (1 r u (t)) =(1 r u (t))(P 0 1 (1 p v;u (t))+ P 0 0 ) =(1 r u (t))((P 0 P 0 0 )(1 p v;u (t))+ P 0 0 ) =(1 r u (t))(P 0 (1 p v;u (t))+ P 0 0 p v;u (t))); (5.2) Chapter 5. Analytical Approximation for a Unified Model 43 where P 0 = P(E u;t1 = 0)= 1 B u;t1 and P 0 0 = P(E u;t1 = 0;E v;t1 = 0). This means v and u were both susceptible at time t 1. If v was also not infected by the time t 1, u can only be susceptible because all collective influence till that time failed, i.e., P 0 0 =(1 B v;t1 )Õ t1 k=0 (1 r u (k)). We set r u (0)= 1 if u2 S, 0 otherwise. Substituting the values of P 0 and P 0 0 in Equa- tion 5.2, results in Equation 5.1. Next, we extend Equation 5.1 to a graph of any type. Without loss of generality, we focus on directed graphs, as undirected graphs can be converted into their directed equivalent. Lemma 5.2. The probability of a node u not being infected by the time t is related to the zero state probabilities as follows 1 B u;t =(1 r u (t)) å s i 2f0;g P 0 s n ;s n1 ;:::;s 1 n Õ i=1 (1 p v i ;u (t)) d s i ; n Õ i=1 p v i ;u (t) d s i ;0 ! : (5.3) whered a;b = 1 only if a= b, 0 otherwise, is the Kronecker delta function. Proof. When the number of incoming neighbors is one, Lemma 5.2 follows from Equation 5.2. Now, suppose the statement is true for k 1 parents. Consider a sequence x k =< s k ;s k1 ;:::;s 1 >. We look at the new terms that are added due to the inclusion of v k+1 . For ease of notation, let D(x k )=(1 r u (t))Õ k i=1 (1 p v i ;u (t)) d s i ; Õ k i=1 p v i ;u (t) d s i ;0 . Equation 5.3 can be rewritten as 1 B u;t = å x n P 0 x n D(x n ): (5.4) We have assumed that this is true for n= k. The addition of v k+1 affects P(E u;t ) in ways similar to those discussed in Corollary 5.1, i.e., if E v k+1 ;t1 = 1, then this new node fails to infect u with probability (1 p v k+1 ;u (t)). On the other hand, if E v k+1 ;t1 = 0, node v k+1 does not have the ability to infect. Formally, the new terms added are: P 0 1;x k (1 p v k+1 ;u (t))D(x k )+ P 0 0;x k D(x k ) =(P 0 ;x k P 0 0;x k )(1 p v k+1 ;u (t))D(x k )+ P 0 0;x k D(x k ) =P 0 ;x k (1 p v k+1 ;u (t))D(x k )+ P 0 0;x k p v k+1 ;u (t)D(x k ) =P 0 ;x k D(;x k )+ P 0 0;x k D(0;x k ); which would generate the required terms in the right hand side of Equation 5.4, when n= k+ 1. This indicates that the statement is true for k+ 1 incoming neighbors. By induction, Lemma 5.2 is true8n. Chapter 5. Analytical Approximation for a Unified Model 44 Theorem 5.3. An approximate probability of infection is given by the recurrence relation: B u;t = 1 " (1 B u;t1 ) Õ v2N i (u) (1 p v;u (t)B v;t1 ) + Õ v2N i (u) p v;u (t)(1 B v;t1 ) t1 Õ k=1 (1 r u (k)) 1+ B u;t1 # (1 r u (t)): (5.5) The approximation comes from assuming that the states of infection of incoming neighbors of a given node u are independent, i.e., for two incoming neighbors v i and v j , events E v i ;t1 = 0 and E v j ;t1 = 0 are independent. Next, we proceed with proving the Theorem. Proof. We attempt to find the zero state probabilities for sequence x n . When x n =< 0;0;:::;0>, u and all nodes in N i (u) are susceptible, which means that collective influence till t 1 was unsuccessful. Further, at k= 0, r u (t)= 1 for u2 S. In this case, P 0 0;0;:::;0 = t1 Õ k=0 (1 r u (k)) Õ v2N i (u) (1 B v;t1 ): (5.6) Any other sequence x n , which consists of at least one in the i-th position, represents the state of u being not infected by the state of its i-th neighbor. Given the state of u’s neighbors, the conditional probability of u not being infected is 1 B u;t1 . The zero state probability is then computed as follows P 0 x n =(1 B u;t1 ) Õ s i =0 (1 B v i ;t1 ): (5.7) Combining Equations 5.4, 5.6 and 5.7, results in the following: 1 B u;t 1 r u (t) =(1 B u;t1 ) Õ v j 2N i (u) (1 p v j ;u (t)) ! å x n Õ s j =0 (1 B v j ;t1 )p v j ;u (t) 1 p v j ;u (t) ! + B u;t1 1+ Õ k (1 r u (t)) ! Õ v j (1 B v j ;t1 )p v j ;u (t) ! : After simplification, the above equation reduces to Equation 5.5. This step completes the proof. Chapter 5. Analytical Approximation for a Unified Model 45 5.1.2 Complexity Analysis The recurrence relation in Equation 5.5 requires inspection of all incoming links to a node u, jN i (u)j, at every time step. Therefore, in order to evaluate Equation 5.5 for all nodes for t time steps, the number of operations required is tå u O(jN i (u)j)= O(jEjt). In contrast, a discrete- time influence model that relies on Monte-Carlo simulations to obtain the expected infection probability at time t would require O(RjEjt) operations, where R is the number of simulations. 5.2 Reduction to Other Models Our analytical formula of influence in social networks, offers great flexibility in terms of mod- eling a variety of diffusion processes. Specifically, popular diffusion models can be reduced to special cases of the Unified Model, by carefully defining the individual influence probabilities and collective influence functions. We next describe few such reductions. 5.2.1 Complex Contagion Model According to the Complex Contagion Model [23], infection can be achieved at time t in two ways. First, each node that was infected at time t 1 attempts to infect each of its outgoing neighbors with probability p. Once a node is infected, it cannot be infected again. Once all infected nodes are examined, healthy nodes have a chance of random infection based on the popularity of the contagion at time t 1. Particularly, for n t1 i infected nodes by the time t 1, the probability of random infection at time t is given by an exponential growth law: r(t)= exp(an t1 i b), wherea andb are constants [23]. Proposition 5.4. The Complex Contagion Model [23] can be treated as a special case of the Unified Model (Section 5.1), and hence it can be approximated by Equation 5.5, when pairwise individual influence is constant and time independent, and collective influence is equivalent to random infection. Reduction. We begin with Equation 5.5. We model individual influence as p v;u (t)= p;8v;u;t: (5.8) Substituting collective influence with the random infection factor results in r u (t)= r(t)= exp(a å u B u;t1 b); (5.9) Chapter 5. Analytical Approximation for a Unified Model 46 since the number of infections by the time t 1, n t1 i , is computed as follows: n t1 i =E( å u E u;t1 )= å u E(E u;t1 )= å u B u;t1 : (5.10) Equations 5.8 and 5.9 form the reduction of Unified Model to the Complex Contagion Model. 5.2.2 Independent Cascade Model In the Independent Cascade Model [20], a seed set of infected nodes is provided. At each time step t, each node is either infected or susceptible, and every node v that was infected at time t1 has a single chance to infect each of its neighbors u. The infection succeeds with probability p v;u = p. Proposition 5.5. The Independent Cascade Model can be treated as a special case of the Unified Model (Section 5.1), when collective influence is a function of the state of infection of nodes in the local neighborhood. Reduction. We begin with Equation 5.5. At any time t, a susceptible node u has a single chance to be infected by its neighbors that were infected at t1. If at least one of them succeeds, u gets infected. The probability of node u getting infected is then given by r u (t)= P(at least one infected neighbor succeeds) = 1 P(no infected neighbor succeeds) = 1 Õ v2N i (u) (1 p(A v;t1 )); (5.11) where A v;t1 is the probability of v being infected at time t 1. Since B t v =å t t=0 A v;t , if follows that: A v;t = ( B v;t1 if t= 1 B v;t1 B v;t2 if t > 1 This step concludes the reduction. 5.2.3 Threshold Models In threshold models the probability of infection of a node depends on the popularity of the con- tagion in its incoming neighborhood. Several threshold models exist in the literature, including the Linear Threshold Model [25], and the Linear Friendship Model [22, 31]. The Generalized Threshold Model [20] dictates that a node u is infected based on a threshold q u 2[0;1] and a Chapter 5. Analytical Approximation for a Unified Model 47 monotone function of the set of its infected neighbors f(In(u;t))2[0;1]. Particularly, u is in- fected at time t if f(In(u;t))q u . Note that the the threshold q u can be randomly selected at each time t [32] leading to non-determinism of the infection process. Since, these thresholds are selected uniformly at random, this is equivalent to saying that the probability of infection of a healthy node u at time t is f(In(u;t)). 1 Proposition 5.6. The Generalized Threshold Model can be treated as a special case of the Unified Model (Section 5.1), when pairwise individual influence is zero, and collective influence is a function of weighted influence from the local neighborhood of nodes. Reduction. We begin with Equation 5.5. At any time t, the probability of node u getting infected is given by a function of u’s status as follows: P(u infected at time t)= f(In(u;t); ~ b u ); (5.12) where In(u;t) =fvjv2 N i (u);E v;t1 = 1g and ~ b u =fb v;u jv2 N i (u)g is a vector of pairwise weights b v;u associated to v’s incoming neighbors. Substituting r u (t) in Equation 5.5 with Equa- tion 5.12, and setting all individual influence probabilities to zero, concludes the reduction. Linear Friendship Model: The Linear Friendship Model (LFM) [22, 31] models the additive effect to the probability of infection at time t as a linear function of infected neighbors by the time t 1, and applies logistic regression to fit the linear function into a probability value.The Linear Friendship Model [22, 31] can be treated as a special case of the Unified Model (Sec- tion 5.1). The reduction follows similar reasoning to that for Generalized Threshold Model. The difference lies in the function used to model the effect of the local neighborhood to a node’s probability of infection in Equation 5.12. Here, the probability of node u getting infected at time t is given by P(u infected at time t)= exp(ajIn(u;t)j+b) 1+ exp(ajIn(u;t)j+b) ; (5.13) wherejIn(u;t)j=å v2N i (u) B v;t1 , i.e., the number of infections by the time t 1 is calculated similarly to the Complex Contagion Model [23] using Equation 5.10, with the difference that the population is restricted to the local neighborhood of node u. Including pairwise weights b v;u into the formulation results injIn(u;t; ~ b u )j=å v2N i (u) b v;u B v;t1 , which concludes the reduction. 1 Note that this is different from the linear threshold model in [20] in that the threshold may change at every time step. Chapter 5. Analytical Approximation for a Unified Model 48 TABLE 5.1: Parameters used in the experimental validation on Digg follower graph parameter set 1 CCM p= 0:1;r(t)= exp(0:002n t1 i 6) GLT f(In(u;t); ~ b u )=å v2In(u;t) b v;u ICM p= 0:1 parameter set 2 CCM p= 0:01;r(t)= exp(0:002n t1 i 6) GLT f(In(u;t); ~ b u )= exp(jIn(u;t; ~ b u )j) 1+exp(jIn(u;t; ~ b u )j) ICM p= 0:7 5.3 Experiments With our model well-defined, we now apply it to a real life dataset from popular social news aggregator Digg 2 , and a series of synthetic data. First, to better illustrate the ability of our Uni- fied Model (Theorem 5.3 in Section 5.2) to capture real-life behavior, we examine a specific real-world case study where we estimate information diffusion in a dynamic social network. We compare the results of our analytical framework, with those produced by several popular diffu- sion models (CCM, LT, LFM, and ICM in Section 5.2). Particularly, we verify that the expected epidemics calculated using Theorem 5.3 matches very well the average outcome of multiple simulation runs of these models. Subsequently, we run a series of large-scale experiments on synthetic data to show that the approximation error is small and insensitive both to graph prop- erties and to models’ parameters. Note, that in this work, we are not concerned with learning the spread of infection from observed data. While this aspect is important, it is outside of the scope of this work. Our findings imply that Equation 5.5 is able to accurately predict the expected epidemics forecasted by the rest of the models without extensive numerical simulations. 5.3.1 Experiments Using Real-World Data We used a subset of Digg’s 3 follower graph [55]. Digg is a popular social news aggregator that allows users to collectively curate a list of news stories they find online by submitting them to Digg and voting for them. In addition, Digg allows users to form social networks by designating as friends users whose activities they would like to track. Our dataset consists of 1;244 nodes and 28;343 directed links. A link from v to u exists if v influences u, i.e., when u follows v. Table 5.1 summarizes the set of parameters used in our experiments. For each model, we start with a seed set of two infected nodes. Figure 5.1 shows the results of infection spreads over time using average of 1000 simulations for each model, and the corresponding predicted values obtained by using the analytical solution from Theorem 5.3. The prediction matches very well with the average simulations, providing 2 http://digg.com/ 3 The dataset can be found online at http://www-scf.usc.edu/ ~ ajiteshs/datasets/digg_ASONAM2014. txt(last accessed on Oct 19, 2015) Chapter 5. Analytical Approximation for a Unified Model 49 0 10 20 30 40 50 0 500 1000 1500 time number of infections theory, p=0.1, r t =exp(0.002n i t−1 − 6) simulations, p=0.1, r t =exp(0.002n i t−1 − 6) theory, p=0.01, r t =exp(0.002n i t−1 − 6) simulations, p=0.01, r t =exp(0.002n i t−1 − 6) (a) Complex Contagion Model 0 10 20 30 40 50 0 500 1000 1500 time number of infections theory, LFM simulation, LFM theory, LT simulation, LT (b) Threshold Models 0 10 20 30 40 50 0 200 400 600 800 1000 1200 time number of infections theory, p = 0.1 simulation, p=0.1 theory, p=0.7 simulation, p=0.7 (c) Independent Cascade Model FIGURE 5.1: Agreement of simulation and theory for the three models for Digg1k dataset. an empirical, quantitative confirmation that Equation 5.5 produces a good fit to the expected outcome which is obtained by computationally expensive simulation runs. 5.3.2 Experiments on Erd˝ os-R´ enyi Random Graphs For an extensive analysis of the approximation error, we run a series of experiments on simulated data. To study the effect of graph size on the approximation quality, we generated random sparse directed graphs of sizes 20, 40, 80, 160, 320, 640, 1;280, 2;560, 5;120 and 10;240 following the Erd˝ os-R´ enyi model [16], with number of edges approximately five times the number of nodes. For each size and a given set of parameters we generated a random graph, nodes of which were uniformly partitioned into five roughly equal cardinality subsets. We started with infecting all nodes in one of these subsets, and ran 1000 simulations. Thus, we have five initial conditions for each size and set of parameters for a given model, and for each of the initial conditions we ran 1000 simulations. To examine the effect of graph density 4 on the approximation error, we fixed the size of the graph and then we generated random directed graphs with varying density values of 0:002;0:004;0:008;:::;0:512. We repeated our experiments for graphs with fixed size, but varied density instead. The parameters used for each model are summarized in Table 5.2. We report approximation error using two measures: (a) root mean squared error (RMSE) at time t, and (b) fractional error in prediction of total number of infections at time t. We measure the 4 Density refers to the ratio of the number of links present in the graph to the total number of possible links. Chapter 5. Analytical Approximation for a Unified Model 50 TABLE 5.2: Parameters used in the experimental validation on synthetic graphs Methods Parameter Sets CCM fp= 0:05;r t = exp(0:002n t1 i 6)g fp= 0:20;r t = exp(0:002n t1 i 6)g fp= 0:80;r t = exp(0:002n t1 i 6)g fp= 0:05;r t = exp(0:0002n t1 i 6)g fp= 0:20;r t = exp(0:0002n t1 i 6)g fp= 0:80;r t = exp(0:0002n t1 i 6)g ICM p= 0:025 p= 0:050 p= 0:100 p= 0:200 p= 0:400 p= 0:800 LFM fa = 0:05;b =2g fa = 0:20;b =2g fa = 0:80;b =2g error in approximating the probability of infection at time t in terms of RMSE as follows: e rms t = s å u (B u;t B u;t ) 2 n ; (5.14) where B u;t is the probability of infection of node u by the time t obtained by simulations, and B u;t is the value predicted by Theorem 5.3. We further report the fractional error in prediction of total number of infections at time t: e f t = js t s t j s t ; (5.15) where s t =å u B u;t is obtained by simulations and s t =å u B u;t is the predicted value obtained by Theorem 5.3. Figures 5.2(a), 5.2(b) and 5.2(c) show how RMSE averaged over graph sizes varies with time. For ICM, the error stabilizes quickly, while for CCM and LFM the error decreases with time. The decrease is more prominent for LFM, where the error is insensitive to the parameters. Fig- ures 5.3(a), 5.3(b) and 5.3(c) report average RMSE over time as a function of graph size. In all three cases, the error is very small. Figures 5.4 and 5.5 show the variation of fractional error e f t with graph size and time accordingly. We note that the trend is similar to that observed for RMSE. In fact, it can be shown that the Chapter 5. Analytical Approximation for a Unified Model 51 0 10 20 30 40 50 10 −10 10 −8 10 −6 10 −4 10 −2 10 0 time RMS error p = 0.05, r t = exp(0.002n i t−1 −6) p = 0.20, r t = exp(0.002n i t−1 −6) p = 0.80, r t = exp(0.002n i t−1 −6) p = 0.05, r t = exp(0.0002n i t−1 −6) p = 0.20, r t = exp(0.0002n i t−1 −6) p = 0.80, r t = exp(0.0002n i t−1 −6) (a) CCM 0 10 20 30 40 50 10 −3 10 −2 10 −1 time RMS error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 p = 0.800 (b) ICM 0 10 20 30 40 50 10 −3 10 −2 10 −1 time RMS error α=0.05, β = −2 α=0.20, β = −2 α=0.80, β = −2 (c) LFM FIGURE 5.2: RMSE as a function of time steps on synthetic graphs. RMSE is averaged over graph sizes. fractional error e f t is bounded by RMSE e rms t according to the following formula: p jVj s t e rms t e f t jVj s t e rms t : (5.16) Figure 5.6 shows how RMSE varies with density. For CCM the error decreases with increasing density, whereas the error increases till some density and then falls rapidly for ICM. No clear trend is prominent in LFM; nonetheless the error is contained in a very small window ( 0:0291 to 0:0298). RMSE curves with respect to time for different densities are shown in Figure 5.7. For brevity, we report results only for one parameter set for each model in this case, as we found other parameter sets produce similar trends. In all cases, the error decreases with time. The decrease becomes more apparent in high density graphs for CCM. Small density graphs require higher values of t (not shown in the figure) to reveal a similar pattern decreasing error. For ICM, the error decreases initially, but then stabilizes around a small constant value. Finally, RMSE rises initially in the case of LFM, but then falls exponentially, and is very less sensitive to the density. Overall, these experiments demonstrate the robustness of our model. To summarize, we find that the error remains small for different graph sizes and densities. The error is also unaffected by the various models’ parameter values. This fact empirically verifies our claim that under Chapter 5. Analytical Approximation for a Unified Model 52 10 1 10 2 10 3 10 4 10 5 10 6 10 −6 10 −4 10 −2 10 0 Size of graph RMS error p = 0.05, r t = exp(0.002n i t−1 −6) p = 0.20, r t = exp(0.002n i t−1 −6) p = 0.80, r t = exp(0.002n i t−1 −6) p = 0.05, r t = exp(0.0002n i t−1 −6) p = 0.20, r t = exp(0.0002n i t−1 −6) p = 0.80, r t = exp(0.0002n i t−1 −6) (a) CCM 10 1 10 2 10 3 10 4 10 5 10 −3 10 −2 10 −1 Size of Graph RMS error p=0.025 p=0.050 p=0.100 p=0.200 p=0.400 p=0.800 (b) ICM 10 1 10 2 10 3 10 4 10 5 10 −1.69 10 −1.65 10 −1.61 10 −1.57 10 −1.53 Size of graph RMS error α=0.05, β = −2 α=0.20, β = −2 α=0.80, β = −2 (c) LFM FIGURE 5.3: RMSE averaged over time, as a function of graph size on synthetic graphs. the Unified Model of Influence, Equation 5.5 is a good approximation to various models with minimal computational requirements. Chapter 5. Analytical Approximation for a Unified Model 53 0 10 20 30 40 50 10 −4 10 −2 10 0 10 2 time Fractional error p = 0.05, r t = exp(0.002n i t−1 −6) p = 0.20, r t = exp(0.002n i t−1 −6) p = 0.80, r t = exp(0.002n i t−1 −6) p = 0.05, r t = exp(0.0002n i t−1 −6) p = 0.20, r t = exp(0.0002n i t−1 −6) p = 0.80, r t = exp(0.0002n i t−1 −6) (a) CCM 0 10 20 30 40 50 10 −4 10 −3 10 −2 10 −1 time Fractional error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 p = 0.800 (b) ICM 0 10 20 30 40 50 10 −3 10 −2 10 −1 time Fractional error α = 0.05, β = −2 α = 0.20, β = −2 α = 0.80, β = −2 (c) LFM FIGURE 5.4: Fractional error over time on synthetic graphs. Fractional error is averaged over graphs sizes. Chapter 5. Analytical Approximation for a Unified Model 54 10 1 10 2 10 3 10 4 10 5 10 −10 10 −5 10 0 Size of graph Fractional error p = 0.05, r t = exp(0.002n i t−1 −6) p = 0.20, r t = exp(0.002n i t−1 −6) p = 0.80, r t = exp(0.002n i t−1 −6) p = 0.05, r t = exp(0.0002n i t−1 −6) p = 0.20, r t = exp(0.0002n i t−1 −6) p = 0.80, r t = exp(0.0002n i t−1 −6) (a) CCM 10 1 10 2 10 3 10 4 10 5 10 −8 10 −6 10 −4 10 −2 10 0 Size of graph Fractional error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 p = 0.800 (b) ICM 10 1 10 2 10 3 10 4 10 5 10 −3 10 −2 10 −1 Size of graph Fractional error α = 0.05, β = −2 α = 0.20, β = −2 α = 0.80, β = −2 (c) LFM FIGURE 5.5: Fractional error averaged over time, as a function of graph size on synthetic graphs. 10 −3 10 −2 10 −1 10 0 10 −15 10 −10 10 −5 10 0 density RMS error p = 0.05, r t = exp(0.002n i t−1 −6) p = 0.20, r t = exp(0.002n i t−1 −6) p = 0.80, r t = exp(0.002n i t−1 − 6) (a) CCM 10 −3 10 −2 10 −1 10 0 10 −15 10 −10 10 −5 10 0 density RMS error p = 0.05 p = 0.20 p = 0.80 (b) ICM 10 −3 10 −2 10 −1 10 0 10 −1.536 10 −1.533 10 −1.53 10 −1.527 density RMS error α = 0.05, β = −2 α = 0.20, β = −2 α = 0.80, β = −2 (c) LFM FIGURE 5.6: RMSE on synthetic graphs for varying density, averaged over time. Chapter 5. Analytical Approximation for a Unified Model 55 0 10 20 30 40 50 10 −6 10 −4 10 −2 10 0 time RMS error 0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 (a) CCM with (p= 0:05, r t = exp(0:002n t1 i 6)) 0 10 20 30 40 50 10 −6 10 −4 10 −2 time RMS error 0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 (b) ICM with p= 0:05 0 10 20 30 40 50 10 −3 10 −2 10 −1 time RMS error 0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 (c) LFM witha = 0:05,b =2 FIGURE 5.7: RMSE over time on the synthetic graphs of size 1;000, for varying density values. Chapter 6 Seed Set Selection for Influence Maximization One of the fundamental problems in the diffusion literature is the influence maximization prob- lem [20]. Let G(V;E;W) be a weighted graph with vertices in V modeling the individuals and E modeling relationships between them with certain weights W representing the influence of one individual over another. The influence maximization problem is defined as follows: Given a directed graph G(V;E;W), a propagation model M and a positive integer kjVj, find a seed-set S V ,jSj= k to initiate the influence propagation, such that the expected number of influenced nodes at steady state is maximized. We claim that it is more advantageous to know how infec- tion of a node affects the number of infections in the immediate future or in a given time frame rather than in infinite time. Therefore, we state a generalization of the problem: Problem Definition 1 (Generalized Influence Maximization). Given graph G(V;E;W), infection model M, time t and positive integer k, select S VN withjSj= k such thats M t (S) is maxi- mum. Here,s M t (S) denotes the expected number of infections achieved by the model M at time t with seed set S, i.e.,8(u;t)2 S, node u is selected to initiate the infection propagation at time t. Algorithm 7 Online Seed-set Selection using Unified Model 1: function OSSUM(G, M, k) 2: S / 0 3: for t= 1! k do 4: arg max l s M t (S[(l;t 1))s M t (S) . Computed using Equation 5.5 5: S S[ l 6: end for 7: return S 8: end function 56 Chapter 6. Seed Set Selection for Influence Maximization 57 6.1 Online Seed-set Selection using Unified Model We propose Online Seed-set Selection using Unified Model (OSSUM), a greedy method based on the formula for the Unified Model (Equation 5.5). At each time step t k we select a node, infecting which would produce the maximum increase in total infection at the next time step. Algorithm 7 details the selection process. As shown in Section 5.2, for ICM and GLT, the infection process can be captured by the collective influence, and the Equation 5.5 becomes B u;t = 1(1 r u (t))(1 B u;t1 ): (6.1) Let r l u (t) denote the collective influence on node u if node l was manually infected at time t 1. Suppose B l u;t denote the resultant infection probability of node u after the manual infection of node l. Then, B l u;t = ( 1(1 r l u (t))(1 B u;t1 ) if u6= l 1 if u= l (6.2) Now, note that for both ICM and GLT, the objective function (line 4 in Algorithm 7) F(l;t 1)=s M t (S[(l;t 1))s M t (S)= å u B l u;t å u B u;t =(1 B l;t1 )+ å u6=l (r l u (t) r u (t))(1 B u;t1 ); (6.3) where the values of B u;t and B l u;t are substituted from Equations 6.1 and 6.2. Using Equa- tions 5.11 and 6.3, and after some algebraic manipulation, the objective function becomes F(l;t 1) =(1 B l;t1 ) 1+ å u2N o (l) p(1 r u (t))(1 B u;t1 ) 1 pA l;t1 ! : (6.4) Computation of F(l;t 1) requires O(1+ deg o (u)) operations, where deg o (u) is the outdegree of u. To find the node l that maximizes F(l;t 1), one has to find F(l;t 1)8l2 V , which requires O(å l (1+ deg o (l)))= O(jVj+jEj) operations. Hence, the time complexity of finding k nodes for seed-set S for ICM using OSSUM is O(k(jVj+jEj)). Chapter 6. Seed Set Selection for Influence Maximization 58 Similarly for LFM, the objective function can be shown to be F(l;t 1)=(1 B l;t1 )+ å u2N o (l) (1 B u;t1 ) s a å j2N i (u) b j;u B j;t1 + b l;u (1 B l;t1 )+b ! s a å j2N i (u) b j;u B j;t1 +b !! ; (6.5) where, s(x)= exp(x)=(1+ exp(x))= 1=(1+ exp(x)). For LFM, finding F(l;t 1) requires O(1+å u2N o (l) deg i (u)) operations, where deg i (u) denotes the in-degree of node u. Finding F(l;t 1)8l2 V requires O(å l (1+å u2N o (l) deg i (u))). Since deg i (u) counted in this expression as many times u becomes an outgoing neighbor of a node, this expression can be rewritten as å l (1+ å u2N o (l) deg i (u))= å l (1+ å u deg i (u) 2 ); (6.6) which is bounded by O(jVj+jEj 2 =jVj) [56]. Therefore, the time complexity of finding k nodes for seed-set S for LFM using OSSUM is O(k(jVj+jEj 2 =jVj)). Note that the Influence Maximization problem in [20] is a special case of Generalized Influence Maximization where t!¥ and S2 Vf0g, i.e., all the nodes in the seed-set S are infected at t= 0. We propose to use OSSUM which generates S=f(u 1 ;0);(u 2 ;1);:::;(u k ;k1)g, and use the setfu 1 ;u 2 ;:::;u k g disregarding the time dimension. The incremental approach of OSSUM allows us to calculate the immediate effect of the chosen vertex, so that the next vertex that is included in the seed-set has least overlap with the previous one with respect to the nodes that they influence. 6.2 Experiments with Seed-set selection We evaluate OSSUM’s ability to identify good seed-sets in a real world network. We are inter- ested in studying its behavior in practise and compare its performance against state-of-the-art methods for Influence Maximization. We compared OSSUM against the following widely used methods for seed-set selection as baselines: Degree: The nodes with top k highest degrees are selected. Single Discount: First the highest degree node is selected and is removed from the graph. Next, the highest degree node from the remaining graph is selected. The process is con- tinued until k nodes have been selected. Chapter 6. Seed Set Selection for Influence Maximization 59 Degree Discount [30]: A heuristic designed for ICM, which performs a form of weighted discount based on the parameter p. CELF++ [57]: A further optimization on CELF that exploits sub-modularity of the spread- ing process. Here, we use it as a baseline for ICM. LDAG [58]: A scalable algorithm specifically designed for Linear Threshold Model. It utilizes local DAG structures to estimate influence spread. SPS-CELF++ [59]: Algorithm for influence maximization on Linear Threshold Model which performs several optimization including CELF++. We used HEPT [30], a real world co-authorship network, where nodes are authors and a link between two nodes represents a paper co-authored by the two authors. The network consists of 15;233 nodes and 58;891 undirected edge. The baselines and our algorithm were applied on the HEPT graph for ICM and LFM. Figure 6.1 shows the results of these experiments. For ICM (Figure 6.1(a)), we set p= 0:01 and plots ¥ , i.e., until number of infections reach a steady state for a given seed set. OSSUM performs almost same as Degree Discount. This could be attributed to the fact that Degree Discount heuristic for ICM is based on similar assumption as our solution for the Unified Model, i.e., it assumes that every node and it’s neighborhood forms a star-like network, which is equivalent to the statement that infection state of two neighbors are independent. Performance of CELF++ is also similar to that of Degree Discount, which is in agreement with the results demonstrated in [30]. A considerable deviation between OSSUM and the baselines appears in the case of LFM (Fig- ure 6.1(b)). In LFM, the probability of infection of a node always remains bounded below by s(b) (the infection probability of u at time t is s(ajIn(u;t)j+b)), and so at steady state (t!¥), every node becomes infected. Therefore, to study the difference between the different seed-set selection methods, we plot the number of infections achieved until t = 60, i.e.,(s 60 ). It can be observed that OSSUM significantly outperforms the baselines. The difference primarily arises from the fact that the infection probability of a node has a non-linear dependency on its neigh- borhood, which is difficult to capture by the baselines. LDAG and SPS-CELF++ perform better than the heuristic but are clearly outperformed by OSSUM. This is due to the fact these methods are specifically designed for Linear Threshold. However, OSSUM allows us to compute the infection probabilities as we select the nodes for manual infection providing a better selection algorithm, as long as the model can be fitted into the Unified Model. Our experiment has two prominent outcomes. First, better sees-set selection can be achieved by considering considering the dynamics of influence in a network rather than solely relying on structural properties. Second, OSSUM performs as good as or better than algorithms tailored to specific influence models. Chapter 6. Seed Set Selection for Influence Maximization 60 0 5 10 15 20 25 30 35 40 45 50 0 20 40 60 80 100 120 140 Size of Seed−set (k) Number of Infections Degree Single Discount Degree Discount OSSUM CELF++ (a) ICM at steady state 0 5 10 15 20 25 30 35 40 45 50 7760 7780 7800 7820 7840 7860 7880 7900 7920 7940 Size of Seed−set (k) Number of infections Degree Single Discount Degree Discount OSSUM LDAG SPS−CELF++ (b) LFM at t= 60 FIGURE 6.1: Seed-set selection for Influence maximization. Chapter 7 Computing Competing Cascades on Signed Networks In deciding whether to adopt an innovation, a political idea, or a product, people are frequently influenced, either explicitly or implicitly, by their social contacts, aggregate social behavior, and external factors, or a combination of the above [60, 61]. Several influence diffusion models have been proposed in the literature to formulate the underlying influence propagation process [24, 60–63]. Even though, these models of influence spread enable the study of complex and realistic scenarios, most models assume that the spreading process takes place on an unsigned network. In reality however, the polarity of relationships might not be always positive [64–66]. For instance, in online social networks such as Slashdot and Epinions, relationships might have a positive (e.g., represent “friends” or “trust”) or negative (e.g., model “foe”, “spite” or “distrust” relationships) connotation. Intuitively, positive relationships carry influence in a positive way, whereas negative edges carry influence in a reverse direction (i.e., one is more likely to follow a friend’s choice, yet do the opposite of a foe). Social influence can be further complicated when multiple competing processes unfold over the network (e.g., multiple companies with similar products vie for sales using competing viral marketing campaigns) [67–69]. How should few initial nodes be chosen for starting the process so that the expected total influence in a given signed network is maximized under a model of influence spreadM in the presence of competing cascades? One central aspect of this problem is the estimation of expected influence spread s(S), given the seed set, which is typically computed using numerous simulations. Even for a single cascade diffusing based on Independent Cascade Model, it has been shown that exact computation of thes(S) is #P-hard [58]. Estimation of the influence spread through analytical computation can reduce computation time by avoiding expensive simulations. To fill the gap of influence computation and maximization in signed networks with compet- ing diffusion processes, we propose a novel signed network influence maximization (SiNiMax) 61 Chapter 7. Computing Competing Cascades on Signed Networks 62 problem. The purpose of SiNiMax is to find a small set of seeds with maximum influence (for either competing processes denoted by colors ‘red’ and ‘blue’) in a signed social network. Un- like the few recent studies on influence maximization on signed networks [65, 66, 70], SiNiMax enables seeds to be either ‘red’ or ‘blue’ for maximization of either competing opinions; this fa- cilitates diversification of the seed-set portfolio, taking advantage of both positive and negative relationships at the same time. Our framework enables the study of the spreading dynamics of two concurrent yet interdependent contagion processes over a signed network. Specifically, we extend the unified model of influence [61] to signed networks for competing cascades and study the dynamics of influence diffusion for two opposite opinions, which are modeled as positive and negative, and are spread over positive and negative edges on a signed network. We first characterize analytically the contagion phenomena of two competing cascades and compute the temporal evolution of influence in a signed network. We show how our closed-form expression can be used to efficiently study the unfolding dynamics of opposite opinions in a signed network without requiring extensive simulations. We then apply our model to solve the influence max- imization problem and develop efficient algorithms to select initial seeds of either opinion that maximize influence coverage. We use both synthetic and real-world large-scale networks, such as Epinions and Slashdot, to confirm our theoretical analysis on competing influence diffusion dynamics over signed networks, and demonstrate that our influence maximization algorithm outperforms other heuristics. 7.1 Related Work In prior work [71] we provided an analytical solution to the problem of estimating the probability of infection of a node at any given time under Independent Cascade Model on signed networks. We proved that Influence Maximization in such a setting is NP-Hard. We also showed that the influence spread as a function of the seed-set is not monotonic, making it difficult to guaran- tee approximations. We provided an algorithm OSSUM to select a seed-set that attempts to achieves maximum spread of one of the competing infections. Our experiments demonstrated that OSSUM outperformed other heuristics. In this work, we extend the analytical solution to Generalized Linear Threshold. We verify the quality of approximation of the solution by com- paring it with the outcome of simulations on several synthetic graphs of varying sizes, densities and fractions of negative links. Information diffusion has been thoroughly studied on unsigned networks [30, 58, 59, 62]. Among the models that have been proposed, Independent Cascade Model and Linear Threshold Model[62] have been studied extensively. Computing the exact expectation of influence spread with ICM has been shown to be #P-hard [58]. Typically thousands of Monte Carlo simulations are run to estimate the influence spread. An approximate analytical solution was proposed in [61] that Chapter 7. Computing Competing Cascades on Signed Networks 63 covers computation of expected influence for several models including ICM and Generalized Linear Threshold (GLT). The solution is specific to unsigned graph with a single infection. We extend the anaytical model to signed networks with competing cascades. Diffusion of multiple cascades has been the focus of [69, 72, 73]. [74] studied the diffusion of multiple cascades and their interactions. Instead we study the spread of competing infections, where a node can be infected by only one of the infections prevalent in the network. Competing cascades have been studied in [67] from game theoretic perspective for maximizing the expected diffusion of an opinion against a competing one. [65] proposed influence maximization on a voter model on a social network with positive and negative links. They find the optimal seed- set for influence maximization on signed networks where opinions are flipped when flowing through a negative link. We assume a similar modeling of flipping infections over negative links, however, we show that it is NP-hard to find the optimal seed set in our model. A similar model on unsigned network was proposed in [70], where opinions propagate according to ICM, and positive opinions get flipped randomly with certain probability. Unlike their model, the expected influence spread is not monotonic, making the influence maximization more difficult. Our model for ICM is same as IC-P [66], however our influence maximization of one infection allows the inclusion of the opposite infection in the seed-set. We have demonstrated in our experiments that inclusion of opposite infection is important when the majority of links in the network are negative. 7.2 Unified Model of Competing Cascades in Signed Networks We consider a weighted, directed, and signed graph G=(V;E;W), where V is the set of nodes, and E is the set of directed edges. Edges represent influence; edge(u;v)2 E if node u influences v. W is a matrix whose element w uv denotes the signed weight of an edge(u;v) in the graph. En- tries in matrix W are non-negative when the network is unsigned, but may contain both positive and negative entries when graph G is signed. Particularly, a positive entry w uv may represent friendship or trust, whereas a negative value would be indicative of a foe or distrust relationship (i.e., node u distrusts node v. The absolute valuejw uv j denotes the strength of influence. To extend the present understanding of multiple contagions as they simultaneously spread through a signed network, we consider the case of two influence diffusion processes that spread in dis- crete time steps according to some propagation modelM . For simplicity, we describe the diffusion process of competing cascades in a signed network for the standard Independent Cas- cade model (ICM) [62], with the note that our results can be easily extended to other influence models. Chapter 7. Computing Competing Cascades on Signed Networks 64 In our modeling, two cascades spread over the network according toM . We use two colors, red and blue, to differentiate between them. A node can therefore be either susceptible, red or blue. Initially, all nodes are susceptible, i.e., have not been exposed to any of the two cascades. We study the problem of progressive diffusion, according to which nodes that become colored (infected by either red or blue) cannot become susceptible again. Additionally, we assume that once a node is colored it cannot change color in the future. At every time step t, with probability p v;u , each node v that was infected at t1 attempts to infect its outgoing neighbors u with its own color. A susceptible node on which multiple influence attempts were made, randomly selects one of such attempts and changes from susceptible to colored. We first describe the spreading dynamics of two concurrent yet interdependent contagion pro- cesses over an unsigned network. We extend the Unified Model (UM) [61], a generalized ana- lytical model of influence in networks, that incorporates both pairwise and collective influence dynamics into the diffusion mechanism for accurate calculation of the probability of infection at any time t. According to UM, the probability of node u being infected under ICM at or before time t, B u;t is given by B u;t = 1(1 r u;t )(1 B u;t1 ): After some algebraic manipulation, the previous equation can be written as B u;t = B u;t1 + r u;t (1 B u;t1 ); where r u;t denotes collective influence. According to [61] collective influence can be used to model local neighborhood effects, aggregate social behavior, and external factors, or a combi- nation of the above. However, as shown in [61], it is possible (in case of ICM) to aggregate the effect of multiple pairwise infection attempts into the collective influence term. We use terms B + u;t and B u;t to denote the probability of infection of node u with red and blue respectively at or before time t. Similarly, r + u;t and r u;t represent collective influence due to red and blue influ- ence respectively. From the perspective of an initially susceptible node, the probability of being colored red at time t can be formalized as: P(u colored red at time t)= P(u susceptible before t; collective red influence succeds and blue fails) (7.1) However, the probability of node u being colored red at time t can be calculated as A + u;t = B + u;t B + u;t1 . Therefore, Equation 7.1 becomes A + u;t = r + u;t (1 r u;t )(1 B + u;t B u;t1 ); (7.2) Chapter 7. Computing Competing Cascades on Signed Networks 65 where B + u;t = B + u;t1 + r + u;t (1 r u;t )(1 B + u;t1 B u;t1 ); (7.3) B u;t = B u;t1 + r u;t (1 r + u;t )(1 B + u;t1 B u;t1 ): (7.4) Note that in Equation 7.1, we have ignored the case when both red and blue infection succeed and the random selection among red and blue by u results in a red color. Considering this case Equation 7.1 becomes A + u;t = r + u;t (1 r u;t )+ r + u;t r u;t r + u;t r + u;t + r u;t (1 B + u;t B u;t1 ): (7.5) Since, typically r + u;t and r u;t are small, for simplicity we ignore this rare event and proceed with Equation 7.2. Although, our methods are not affected if one wishes to proceed with Equation 7.5 for a better accuracy. The collective influence probabilities for each competing cascade can be computed separately due to the independence between the two influence processes in a single time step (i.e., the event of an attempt of a red infection on a node is independent of an attempt of blue infection). Therefore extending the idea of collective influence of ICM from [61]to multiple cascades r + u;t = 1 Õ v!u (1 p + v;u A + v;t1 ) (7.6) and r u;t = 1 Õ v!u (1 p v;u A v;t1 ) (7.7) where p + v;u and p v;u represent the probabilities of node v exerting influence on u when v is colored red or blue respectively. Equations 7.2, 7.3, 7.4, 7.6, and 7.7 constitute our novel analytical solution for computing infection probability of any node at any time t for competing cascades that propagate based on the ICM model on an unsigned network. We naturally extend influence propagation modelM (in this case ICM) for signed networks based on the social principles “the friend of my enemy is my enemy” and “the enemy of my enemy is my friend” [5]. Specifically, the influence is flipped when traversing a negative edge (u;v) between nodes u and v. Intuitively, if node u is colored red and is successful in infecting v, then v will become infected with blue. However, from the perspective of v, u’s attempt to influence v with red through a negative edge is equivalent to u trying to pass along blue infection to v through an unsigned edge. Therefore, flipping the infection (from red to blue or vice-versa) of the incoming neighbor which has a negative link and removing the sign from the link is equivalent to the diffusion process of signed network. Thus, competing cascades in a Chapter 7. Computing Competing Cascades on Signed Networks 66 signed network can be reduced into an equivalent problem of competing cascades in an unsigned network. Particularly, Equations 7.3 and 7.4 can be used to calculate the infection probabilities in a signed network. For the calculation to be valid, the formulas for collective influence need to be modified. Formally, r + u;t = 1 Õ v + !u (1 p + v;u A + v;t1 ) Õ v !u (1 p v;u A v;t1 ) (7.8) and r u;t = 1 Õ v + !u (1 p v;u A v;t1 ) Õ v !u (1 p + v;u A + v;t1 ): (7.9) 7.3 Influence Maximization in Signed Networks We redefine the problem of Influence Maximization for signed networks as follows: Definition 7.1 (Signed Network Influence Maximization). Given a diffusion modelM of com- peting cascades C=fred, blueg on a signed graph G(V;E;w) possibly weighted, and an integer m, find S VC such thatjSj= m, and8(u;c)2 S, infecting u with c at t = 0 maximizes the expected spread of the red infection denoted bys M;+ (S). Influence Maximization of the ‘red’ infection is equivalent to the problem of maximizing the influence of the ‘blue’ infection. Therefore, an algorithm to maximizes M;+ (S) would also be able to maximizes M; (S). The most prominent difference of this problem from the traditional Influence Maximization problem [62] is that in the seed set S, along with the choice of nodes, their infection states (red or blue) are also to be decided. Due to the sign on the links, it is possible that initializing a node with blue color can lead to more red infections and vice-versa, i.e., including a red colored node in the seed set may lead to more blue colored nodes. Next, we prove that the problem of influence maximization in signed networks is NP-hard. Theorem 7.2. Signed Network Influence Maximization (SiNiMax) problem is NP-hard under Independent Cascade Model. Proof. Consider an instance of the traditional Influence Maximization problem where we need to find a seed-set of nodes infecting which would create maximum number of expected in- fections at steady state under Independent Cascade Model with parameter p. Suppose in the SiNIMax we set all the links in the graph to positive and p v;u = 0; p + v;u = p;8(v;u)2 E. Now we proceed to solve SiNiMax to maximize s ICM;+ (S). Clearly,8(u;c)2 S, c= red because ‘blue’ cannot propagate and if(u;blue)2 S, then replacing that element with(u;red) can only increase s ICM;+ (S). This instance of SiNiMax is equivalent to the traditional ICM as there is only one type of infection that propagates in the network. Therefore, a solution to SiNIMax Chapter 7. Computing Competing Cascades on Signed Networks 67 would provide a solution to the traditional Influence Maximization problem which is known to be NP-hard. Hence, SiNiMax is NP-hard. SiNiMax is similar to PRIM [66] with the only difference that we allow the addition of a node with initial coloring of ‘blue’ for maximization ofs ICM;+ (S). We point out one major difference of ICM with competing cascades over a signed network when compared to other formulations of ICM, in the following theorem. (a) Seed set S=f(v 1 ;red)g (b) Seed set S=f(v 1 ;red);(v 2 ;red)g FIGURE 7.1: Spread of ‘red’ and ‘blue’ infections with (a) S = (v 1 ;red) and (b) S = (v 1 ;red);(v 2 ;red). Solid line represents a positive link and dashed line represents a negative link. Green nodes represent ambiguous infection. Theorem 7.3. The functions ICM;+ (S) is not monotonic. Proof. We disprove the monotonicity of s ICM;+ (S) by a counter-example. Consider the graph in Figure 7.1, where solid line represents a positive link and dashed line represents a negative link. Suppose p v;u = 1 for all links. Starting with seed-setf(v 1 ;red)g (Figure 7.1(a)) we find that 8 nodes end up being colored ‘red’. Two nodes have ambiguous infection as their coloring, ‘red’ or ‘blue’, depends on the color node v 2 adopts. In any case these two nodes must have opposite infections due to a negative link between them (v 2 is the only neighbor of the other node and so its infection must come through v 2 ). Therefore,s ICM;+ (S)= 9. Inclusion of v 2 in the seed-set with ‘red’ color (Figure 7.1(b)) creates 7 ‘red’ and 3 ‘blue’ infections. Therefore s ICM;+ (f(v 1 ;red);(v 2 ;red)g) < s ICM;+ (f(v 1 ;red)g). Infecting v 2 with ‘blue’ in the seed-set creates at most 6 ‘red’ infections. Thus,s ICM;+ (f(v 1 ;red); (v 2 ;blue)g)<s ICM;+ (f(v 1 ;red)g), disproving the monotonicty. 7.4 Seed-set Selection Heuristic Based on the analytical solution obtained in Section 7.2, we develop a novel heuristic OSSUM for selecting the seed-set S that maximizes s M;+ (S). Our approach is incremental: we start with an empty set and include(u;c) at time t 1 if infecting u with c at t 1 would create most Chapter 7. Computing Competing Cascades on Signed Networks 68 number of new ‘red’ infections (expected value) at time t. A greedy approach, on the other hand, would be to choose (u;c) which maximizes s M;+ (S[(u;c))s M;+ (S). We refrain from using such a greedy approach because it would require calculation of total spread (instead of immediate spread as in our heuristic) adding more computational requirements. Also, due to the lack of monotonicity the (1 1=e)-approximation [62] is no longer guaranteed. Owing to the low value of p, it follows that the effect of infection of a node decays quickly with distance. Therefore, the node creating the most new number of infections is expected to have high contribution tos M;+ (S). 7.4.1 OSSUM Let B + u;t (k;+) represent the probability of node u being colored ‘red’ at time t if node k is infected with ‘red’ at time t 1, if it was not already infected. This is done by setting B + k;t1 to 1 B k;t1 and calculating its impact on node u. Similarly, B + u;t (k;) represent the probability of ‘red’ infection of node u at time t if node k is infected with ‘blue’ at time t 1. B + u;t (k;+)= 8 > > > < > > > : r + u;t (k;c)(1 r u;t (k;c)) (1 B + u;t1 B u;t1 ) if u6= k 1 B u;t1 if u= k (7.10) Let r (k;+) u;t and r (k;) u;t be the effective collective influence due to the infection of k with ‘red’ and ‘blue’ respectively. Our heuristic, termed Online Seed-set Selection using Unified Model on Signed Networks (OSSUM), is based on selecting a node k and infecting it with c2fred, blueg (k;c)= arg max (k;c) å u B + u;t (k;c) B + u;t (7.11) Now, B + u;t (k;c) B + u;t = r + u;t (k;c)(1 r u;t (k;c)) r u;t (k;c)(1 r + u;t (k;c)) (1 B + u;t1 B u;t1 ) (7.12) To evaluate B + u;t (k;c) B + u;t for u6= k, we need to consider four cases: 1. Node k is infected by ‘red’ and the link from k to u is positive: Chapter 7. Computing Competing Cascades on Signed Networks 69 Since k is infected with ‘red’ and k + ! u, it affects only the ‘red’ collective influence. There- fore r u;t (k;+)= r u;t : (7.13) And, r + u;t (k;+)= 1(1 p k;u (1 B + k;t2 B k;t1 )) Õ j + !u; j6=k (1 p j;u A + j;t1 ) Õ j !u; (1 p j;u A j;t1 ) = 1 1 p k;u (1 B + k;t2 B k;t1 ) 1 p k;u A + k;t1 Õ j + !u (1 p j;u A + j;t1 ) Õ j !u; (1 p j;u A j;t1 ) = 1 1 p k;u (1 B + k;t2 B k;t1 ) 1 p k;u A + k;t1 (1 r + u;t ): (7.14) Using Equations 7.12, 7.13 and 7.14, for this case å k + !u (B + u;t (k;+) B + u;t ) = å k + !u p k;u (1 B + k;t1 B k;t1 ) 1 p k;u A + k;t1 (1 B + u;t1 )(1 B u;t1 )(1 r + u;t r u;t ): (7.15) 2. Node k is infected with ‘red’ and the link from k to u is negative: In this case, since this action does not affect the collective influence for ‘red’ infection on u, r + u;t (k;+)= r + u;t : (7.16) And r u;t (k;+)= 1(1 p k;u (1 B + k;t2 B k;t1 )) Õ j + !u (1 p j;u A j;t1 ) Õ j !u; j6=k (1 p j;u A + j;t1 ) = 1 1 p k;u (1 B + k;t2 B k;t1 ) 1 p k;u A + k;t1 Õ j + !u (1 p j;u A j;t1 ) Õ j !u; (1 p j;u A + j;t1 ) = 1 1 p k;u (1 B + k;t2 B k;t1 ) 1 p k;u A + k;t1 (1 r u;t ): (7.17) Chapter 7. Computing Competing Cascades on Signed Networks 70 Using Equations 7.12, 7.16 and 7.17, for this case å k !u (B + u;t (k;+) B + u;t ) = å k !u p k;u (B + k;t1 + B k;t1 ) 1 p k;u A + k;t1 (1 B + u;t1 B u;t1 )r + u;t (1 r u;t ): (7.18) 3. Node k is infected with ‘blue’ and the link from k to u is positive: Similar to Case 2, it can be shown that å k + !u (B + u;t (k;) B + u;t ) = å k + !u p k;u (B + k;t1 + B k;t1 ) 1 p k;u A k;t1 (1 B + u;t1 B u;t1 )r + u;t (1 r u;t ): (7.19) 4. Node k is infected with ‘blue’ and the link from k to u is negative: Similar to Case 1, it can be shown that å k !u (B + u;t (k;) B + u;t ) = å k !u p k;u (1 B + k;t1 B k;t1 ) 1 p k;u A k;t1 (1 B + u;t1 B u;t1 )(1 r + u;t )(1 r u;t ): (7.20) Finally, a node k is added to the seed-set with color c2fred, blueg which maximizes the fol- lowing: max k maxf å u B + u;t (k;+) B + u;t ; å u B + u;t (k;) B + u;t g; (7.21) Chapter 7. Computing Competing Cascades on Signed Networks 71 where the objective function is computed by combining Equations 1, 7.18, 7.19 and 7.20: å u (B + u;t (k;+) B + u;t ) = å k + !u p k;u (1 B + k;t1 B k;t1 ) 1 p k;u A + k;t1 (1 B + u;t1 B u;t1 )(1 r + u;t )(1 r u;t ) å k !u p k;u (B + k;t1 + B k;t1 ) 1 p k;u A + k;t1 (1 B + u;t1 B u;t1 )r + u;t (1 r u;t ) (7.22) And å u (B + u;t (k;) B + u;t ) = å k !u p k;u (1 B + k;t1 B k;t1 ) 1 p k;u A k;t1 (1 B + u;t1 B u;t1 )(1 r + u;t )(1 r u;t ) å k + !u p k;u (B + k;t1 + B k;t1 ) 1 p k;u A k;t1 (1 B + u;t1 B u;t1 )r + u;t (1 r u;t ) (7.23) The OSSUM heuristic is summarized in Algorithm 8. Algorithm 8 Online Seed-set Selection using Unified Model on Signed Network (OSSUM) 1: function OSSUM(G, m) 2: S / 0 3: for t= 1! m do 4: (k;c)= arg max (k;c) å u B + u;t (k;c) B + u;t 5: . Computed using Equations 7.22 and 7.23 6: S S[(k;c) 7: end for 8: return S 9: end function 7.4.2 Complexity Analysis Computing r + u;t and r u;t require O(indegree(u)) computations. Doing this for all nodes u requires O(å u (1+ indeg(u))) Chapter 7. Computing Competing Cascades on Signed Networks 72 = O(jVj+jEj) operations. Once these values are calculated, Equation 7.22 and 7.23 are to be evaluated for each node k, which takes O(å k (1+ outdegree(k)))= O(jEj+jVj) computations. This is to be repeated for selection of each seed-set. Therefore, the time complexity of finding m nodes in the seed-set S for ICM using OSSUM is O(m(jVj+jEj)). 7.5 The case of Generalized Linear Threshold So far we have discussed influence maximization with Independent Cascade Model. However, other diffusion models can be approached in a similar manner. In an earlier work [61], we have shown how a version of generalized linear threshold can be computed using our Unified Model. In the context of signed networks we define the model as follows: at every times step t, the probability of a susceptible node becoming infected with red or blue depends on functions of the state of its incoming neighborhood f + (In(u)) and f (In(u)), respectively. If both red and blue infections are successful, the node u selects one of the infections at random. Here, we choose these functions as a normalized linear function of the neighborhood: P(red attempt on u at t)= f + (In(u)) = å v + !u p v;u E + v;t1 + å v !u p v;u E v;t1 and P(blue attempt on u at t)= f (In(u)) = å v + !u p v;u E v;t1 + å v !u p v;u E + v;t1 where E + v;t1 and E v;t1 are indicator variables which are 1 only if v was infected with ‘red’ and ‘blue’ respectively, by the time t 1. p v;u = w v;u =å v! j w v; j . For an unweighted graph, p v;u = 1=outdeg(v). The collective influence for the case of GLT can be calculated as r + u;t =E( f + (In(u)))= å v + !u p v;u E(E + v;t1 )+ å v !u p v;u E(E v;t1 ) = å v + !u p v;u B + v;t1 + å v !u p v;u B v;t1 (7.24) And similarly, r u;t =E( f (In(u)))= å v + !u p v;u B v;t1 + å v !u p v;u B + v;t1 (7.25) To use OSSUM for GLT we use Equation 7.12 along with Equations 7.24 and 7.25, we get Chapter 7. Computing Competing Cascades on Signed Networks 73 å u (B + u;t (k;+) B + u;t ) =(1 B + k;t1 B k;t1 ) 1+ å k + !u p k;u (1 r u;t )(1 B + u;t1 )(1 B u;t1 ) å k !u p k;u r + u;t (1 B + u;t1 )(1 B u;t1 ) (7.26) And, å u (B + u;t (k;) B + u;t ) =(1 B + k;t1 B k;t1 ) å k !u p k;u (1 r u;t )(1 B + u;t1 )(1 B u;t1 ) å k + !u p k;u r + u;t (1 B + u;t1 )(1 B u;t1 ) (7.27) Finally, we use Equations 7.26 and 7.27 in Algorithm 8 to determine k and c. Again, the com- plexity of the algorithm is O(m(jVj+jEj)) as explained in Section 7.4.2. 7.6 Experiments 7.6.1 Accuracy of Unified Model Since the formula for computing the spread of infection is an approximation, we conduct a series of experiments to demonstrate the accuracy of our formula. We assume that the infection trends obtained using 1000 simulations are the “ground truth”. We conducted these experiments on several Kronecker graphs [48] for varying sizes, densities, and fraction of negative links as described below. Varying graph size:jVj2f2 10 ;2 11 ;2 12 ;2 13 ;2 13 ;2 15 g, density(jEj=jVj)= 10, fraction of negative links= 0:7. Varying density:jVj= 2 12 , fraction of positive links= 0:75, density(jEj=jVj)2f2;4;8;16;32g. Varying fraction of negative edges:jVj= 2 12 , density(jEj=jVj)= 10, fraction of negative links2f0;0:25;50; 0:75;1g. Chapter 7. Computing Competing Cascades on Signed Networks 74 10 3 10 4 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 Size of Graph MSE p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (a) RMS error for varying size 10 3 10 4 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Size of Graph Fractional Error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (b) Fractional error for varying size 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 x 10 −3 Density (|E|/|V|) MSE p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (c) RMS error for varying density 0 5 10 15 20 25 30 35 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 Density (|E|/|V|) Fractional Error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (d) Fractional error for varying density 0 0.25 0.50 0.75 1 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 Fraction of Negative Links MSE p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (e) RMS error for varying fraction of negative links 0 0.25 0.50 0.75 1 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Fravtion of Negative Links Fractional Error p = 0.025 p = 0.050 p = 0.100 p = 0.200 p = 0.400 (f) Fractional error for varying fraction of negative links FIGURE 7.2: Errors of approximation with ICM on synthetic graphs. For each of the combination of parameters we generated 5 Kronecker graphs. We used the for- mula to find B + u;t the probability of a node u being infected with ‘red’ by the time t and compared it against the ground truthB + u;t1 obtained by running 1000 simulations. The errors from the 5 graphs were averaged to get the error for that configuration. For ICM each experiment was done with varying parameter p2f0:025;0:05;0:1;0:2;0:4g. The seed-set was for each experiment was the top 50 nodes with highest Effective Degree (negative outgoing links subtracted from positive outgoing links). Chapter 7. Computing Competing Cascades on Signed Networks 75 10 3 10 4 10 5 2.4 2.5 2.6 2.7 2.8 2.9 3 x 10 −3 Size of Graph MSE (a) RMS error for varying size 10 3 10 4 10 5 0.0275 0.028 0.0285 0.029 0.0295 0.03 0.0305 0.031 0.0315 Size of Graph Fractional Error (b) Fractional error for varying size 10 0 10 1 10 2 1.6 1.7 1.8 1.9 2 2.1 2.2 x 10 −3 Density (|E|/|V|) MSE (c) RMS error for varying density 10 0 10 1 10 2 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 Density (|E|/|V|) Fractional Error (d) Fractional error for varying density 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.5 1 1.5 2 2.5 x 10 −3 Fraction of nodes with −ve links MSE (e) RMS error for varying fraction of negative links 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.02 0.04 0.06 0.08 0.1 0.12 Fraction of nodes with −ve links Fractional Error (f) Fractional error for varying fraction of negative links FIGURE 7.3: Errors of approximation with GLT on synthetic graphs. We compute the error of approximation using two measures: mean squared error (MSE) and fractional error. MSE at a given time is calculated as e mse t = å u (B + u;t B + u;t ) 2 jVj (7.28) whereB + u;t is the “ground truth” obtained after 1000 simulations. Fractional error at time t is given by e f t = js t s t j s t (7.29) wheres t =å u B u;t ands t =å u B u;t . Chapter 7. Computing Competing Cascades on Signed Networks 76 Figure 7.2 shows the results obtained on the Kronecker graphs for varying size, density and fraction of negative links. All the errors are small. MSE increases with p, which is expected as higher p implies higher probabilities of infection, and hence more room for approximation error. Overall, errors are higher for large p(= 0:4). However, p= 0:4 is a significantly large diffusion probability compared to what is considered “high” probability of influence in the literature [75]. Figure 7.3 shows the experimental result for GLT. Again, we observe that all the errors are small. 7.6.2 Seed-set Selection To test the performance of OSSUM, we conducted experiments on two real-world datasets: Epinions and Slashdot [64]. Both are trust relations among users in the corresponding social media. The link from u to v is positive if u trusts v (u considers v a “friend”) and negative if u distrusts v (u considers v a “foe”). The influence graph is constructed by flipping the direction of these edges. If u trusts v, then there is a positive link from v to u, and if u distrusts v, then there is a negative link from v to u. The datasets are summarized in Table 7.1. For brevity, we present the experiments only for ICM. We compared our heuristic with the following heuristics TABLE 7.1: Summary of the datasets Epinions Slashdot # nodes 131828 82144 # edges 841372 549202 # positive edges 717667 425072 # negative edges 123705 124130 as baselines to solve the SiNiMax problem Positive Degree: m nodes with maximum positive outdegree are colored ‘red’. Positive Degree Discount [30]: A heuristic designed for ICM, which performs a form of weighted discount based on the parameter p. This heuristic is applied after removing all negative edges, and the selected nodes are colored ‘red’. Effective Degree: We define Effective Degree as number of negative outlinks subtracted from the number positive outlinks. We arrange the nodes in decreasing order of the ab- solute value of Effective Degree and pick the top m nodes. If two nodes have the same Effective Degree, they are sorted in an arbitrary order. Among these m nodes, those with positive Effective Degree are colored ‘red’, and those with negative are colored ‘blue’. Chapter 7. Computing Competing Cascades on Signed Networks 77 10 0 10 1 10 2 10 3 10 4 10 −3 10 −2 10 −1 10 0 Degree (Positive + Negative) Fraction of Outlinks that are Negative (a) Epinions 10 0 10 1 10 2 10 3 10 4 10 −3 10 −2 10 −1 10 0 Degree (Positive + Negative) Fraction of Outlinks that are Negative (b) Slashdot FIGURE 7.4: Fraction of negative outlinks from nodes with varying degree. 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 1200 1400 1600 1800 Fraction of Negative Links Influence Spread of Red Positive Degree Positive Degree Discount Effective Degree OSSUM± (a) Epinions 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 600 700 Fraction of Negative Links Influence Spread of Red Positive Degree Positive Degree Discount Effective Degree OSSUM± (b) Slashdot FIGURE 7.5: Influence spread achieved by the heuristics on graphs with varying fraction of negative links. To study the effect of negative links on the influence spread achieved by the heuristics, first we ignored the actual signs of the links and randomly assigned the signs. To check what should be the right strategy for assigning the signs on the links, we first studied the density of negative links in the real data-sets. Figure 7.4 shows the fraction of negative outlinks as a function of outdegree (positive + negative outlinks) of the nodes. While the plot of Epinions seems to be slightly decreasing, that of Slashdot is uniform. Therefore, to randomly assign the signs on the links, we decided to select the signs from a uniform distribution. The selection was done with probability of negative link ranging from 0 to 1, resulting in a range of network instances with fraction of negative links 0;0:1;0:2;:::;1, respectively. Figure 7.5 shows the result. Observe that for both datasets, the spread achieved by all the heuristics are almost equal when the fraction of negative links is low. However, when the graph has more negative links OSSUM significantly outperforms other heuristics. Performance of Effective Degree is comparable to OSSUM, but it drops at 0:5. This is because when almost half of the links are negative, many nodes are likely to have zero Effective Degree making it difficult to choose seeds based on this heuristic. Spread achieved by OSSUM is least when the fraction of negative links is around 0:5, which suggests Chapter 7. Computing Competing Cascades on Signed Networks 78 that achieving high influence spread is more difficult when there are almost equal number of positive and negative links compared to when all the links are negative. Owing to the low complexity of Effective Degree and its close performance to OSSUM, if time is a constraint, Effective Degree can be used for a network which is dominated by either positive or negative links. On the other hand, a network where fraction of negative links is close to half OSSUM should be used. When more links are negative in a network, i.e., there are more distrust relations between indi- viduals, it becomes important to include ‘blue’ colored nodes in the seed set for maximization of ‘red’ infection. This is demonstrated in Figure 7.6, which shows the fraction of nodes in the seed set included with ‘red’ color by OSSUM. The fraction drops very quickly when almost half of the links have negative signs. When the fraction of negative links reaches 0.7, almost all the nodes in the seed set are ‘blue’. We also studied the influence spread achieved by the heuristics with varying size of seed set. No significant difference between them was observed on the original graphs. This is due to the fact that the number of negative links in both graph are less compared to the number of positive links (14:70% and 22:60% for Epinions and Slashdot, respectively). It follows from the results of our previous experiments (Figure 7.5) that the heuristics do not differ significantly when the fraction of negative links is less than 0:5. Therefore we flipped the sign of all the edges in the original graph so that there are 14:70% and 22:60% positive links in Epinions and Slashdot, respectively. The results of influence spreads achieved in these ‘flipped’ datasets are shown in Figure 7.7. Note that OSSUM outperforms the other heuristics in both cases. Again, this is consistent with our earlier claim that significant advantage is observed with OSSUM compared to Positive Degree and Degree Discount when the number of negative links dominate the positive links. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fraction of Negative Links Fraction of Red Seeds Epinions Slashdot FIGURE 7.6: Fraction of nodes in the seed-set included with ‘red’ color by OSSUM. The fraction drops very quickly when majority of links in the network are negative. Chapter 7. Computing Competing Cascades on Signed Networks 79 To get an insight into the characteristics of nodes selected by the heuristics, we evaluated a den- sity estimate vs Effective Degree of the node (See Figure 7.8). The density estimate 1 fits smooth normal kernel to the number of seed-nodes having the given Effective Degree. While Positive Degree and Positive Degree Discount heuristics select nodes with more positive outgoing links than negatives, Effective Degree heuristic tends to select nodes with majority of negative links (due to high density of negative links in the graph). On the other hand, seed-nodes OSSUM are more distributed in the spectrum of Effective Degree, because it is able to make better decision of which nodes to include. 0 10 20 30 40 50 400 500 600 700 800 900 1000 Seed−set Size Influence Spread of Red Positive Degree Postive Degree Discount Effectve Degree OSSUM± (a) Epinions 0 10 20 30 40 50 0 50 100 150 200 250 300 350 400 450 Seed−set Size Influence Spread of Red Positive Degree Positive Degree Discount Effective Degree OSSUM± (b) Slashdot FIGURE 7.7: Influence spread achieved by varying size of the seed-set by the heuristics in the datasets after flipping the signs of edges. −4000 −3000 −2000 −1000 0 1000 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 −3 Effective Degree Density Estimate Positive Degree Positive Degree Discount Effective Degree OSSUM± (a) Epinions −3000 −2500 −2000 −1500 −1000 −500 0 500 1000 0 0.5 1 1.5 2 2.5 3 3.5 x 10 −3 Effective Degree Density Estimate Positive Degree Positive Degree Discount Effective Degree OSSUM± (b) Slashdot FIGURE 7.8: Smooth density estimate of degree distribution of seed set nodes selected by the heuristics. 1 ksdensity function in MATLAB Chapter 8 Combating Fake News – A Network Approach Online Social Networks have become prime sources of sharing news. Often times “fake news”, a made-up story propagates into the network and becomes accepted as “news”. Spreading of misinformation posses a major challenge to the society as it may influence people’s opinions and cause panic. For example, the fake news “Two explosions in White House and Obama is injured” that spread all around the Internet in April 23, 2013 led to 10 billion USD losses [76]. Similarly, misinformation about an impeding major earthquake led thousands of people in Ghazni province to leave their houses in panic in the middle of the night [77]. A possible remedy is to detect fake news by analyzing content and automatically labeling it as fake [78, 79]. While this approach is effective in identification of fake news, it does not account for containment of its dissemination. Therefore, an alternate mechanism pursued in the literature is to propagate the corresponding real news in the network [69, 80]. The motivation of doing so is based on the idea that “perceived realism of fake news is stronger among individuals with high exposure to fake news and low exposure to hard news than among those with high exposure to both fake and hard news” [81]. Several works based on this principle [69, 80] take the competing cascades approach. They attempt to minimize the reach of one opinion originating from a known set of individuals S by independently selecting another set of individuals I to propagate an alternate opinion (See Figure 8.1(a)). The goal is to ensure that an individual who receives an opinion from S is also likely to receive the alternate opinion from I. This optimal choice of I implictly assumes that nodes in I are already aware of the alternate opinion to be propagated. This may be applicable to a scenario of propagation of popular false beliefs such as Flat-Earth theory, where a set of people perpetuate the belief. However, note that in the case of containment of fake news, the alternate opinion (real news) cannot be determined unless the individuals in I are also aware of what the fake news is to be countered. For instance, suppose 80 Chapter 8. Combating Fake News – A Network Approach 81 person A is propagating the news saying “NASA predicts an asteroid hitting Earth in 48 hours”. When person B comes to know about this misinformation, only then she can counter it with the news that NASA has released no such statement, or perhaps in reality NASA has instead stated that “a comet will be visible from Earth in 48 hours”. The competing cascades approach is thus not very effective in this problem setting since I cannot be selected independently of S (I must receive news from S first!). To address this limitation, we propose a fundamental departure from the competing cascade approach by proposing the following problem: Given a set S of fake news initiators (which may be known or selected at random) in a network, we wish to find a set of nodes I such that a) the fake news from S is likely to reach I, and b) many other nodes are reachable from I (See Figure 8.1(b)). We refer to this problem of checking the activation of fake news as FActCheck. In most cases, fake news originate from the same set of sources 1 . There can also be cases where the source of fake news may change, and any node can be a part of the initiator set S with certain probability. Thus, the set I needs to be source-oblivious and has to be picked regardless of the knowledge of exact initiators S, depending on the underlying probability distribution. We show that our method can be modified to fit this case. The set I should be assigned the task of propagating real news as it is well connected to the source of fake news as well as to the rest of the network. Nodes in I may be encouraged, through some form of reward, to research the credibility of all the news they receive, and share the corresponding real news. Ideally, if everyone where to fact-check before sharing news, there will not be any fake news. However, not everyone has the motivation and the expertise to do so. Therefore, a social media company can use its resources (which are finite) to incentivize k individuals. Different individuals may have different expertise in verification of the news. Moreover, the propagation of fake news may follow a different model compared to propagation of real news. Both of these factors can be taken into account in our modeling as described in Sections 8.3.2. (a) Competing Cascades: Two sets attempting to influence a node (b) FActCheck: A subset of those who are aware of the fake news can influence a node FIGURE 8.1: Fundamental distinction between FActCheck and Competing Cascades. FActCheck enforces the constraint that the news must pass through the set I. 1 http://fortune.com/2016/11/28/map-fake-news/ Chapter 8. Combating Fake News – A Network Approach 82 Our work addresses the challenging issue of minimizing the effect of fake news in online social networks in a probabilistic setting of information diffusion, where a news propagates through a link with some probability. Specifically, the contributions of this work are as follows. We define the Fake news Activation Checking (FActCheck) problem and prove its NP- Hardness along with the monotonicity and submodularity of its objective function (Sec- tion 8.2), that leads to a polynomial time algorithm called Approximate FactCheck (AFC) with(1 1=ee)-approximation guarantee (Section 8.3). We propose a heuristic called Reduced Approximate FActCheck (RAFC) that provides a quality similar to AFC while reducing the runtime significantly (Section 8.3). We show that our methods are applicable to several other models, and can accommodate differences in individuals in terms of expertise and willingness to share real and fake news (Section 8.3.2). We show the effectiveness of AFC and RAFC in the problem of minimizing the effect of fake news propagation in social media using real-world datasets (Section 8.4). We demonstrate that AFC and RAFC also perform well for Fake News Immunization, where fake news dominates real news, i.e., a node prefers the adoption of fake news even if it has been exposed to the real news along with it. The task is to immunize k nodes so that the spread of fake news starting from a given set can be minimized (Section 8.5). 8.1 Related Work Strategies used for constraining misinformation in online social networks have mainly focused on identifying a set of nodes in the network in order to disrupt the diffusion process when the source of misinformation is known [69, 76, 82]. [83] formulated the problem as an optimiza- tion problem of identifying a subset of individuals that need to be initially convinced to adopt the competing campaign so as to to counteract the effect of misinformation. [80] studied the containment of rumors originating from a community and proposed a method to obtain the min- imum number of needed protectors. [69, 84] proposed a method to limit viral propagartion of misinformation in the context of two competing campaigns dissipating in an online social net- work. [84] showed that when the set of nodes that initiated the spread of misinformation is unknown, the constraining problem is NP-hard and it cannot be approximated in polynomial time. Even the greedy algorithm proposed in [84] has an “extremely slow execution due to the expensive task of estimating the marginal influence when a node is added to the current solu- tion” [84]. To address this challenge, [77, 84] proposed community-based heuristics which rely Chapter 8. Combating Fake News – A Network Approach 83 on the property of structural trapping. Structural trapping, however, has been shown to be unable to effectively capture the true dynamics of memes in social media [85]. Additionally, misinformation containment in online social networks has been studied through (i) efficient approaches to detect fake news in the first place [86, 87], and (ii) inspecting every message traversing every node to stop those suspected of carrying misinformation [78, 79]. Our work differs from prior work on fake news detection, as we focus on containing the effect of misinformation propagation based on a diffusion model, rather than detection using analysis of content. We study the problem of identifying a small set of nodes I tasked to fact check and diffuse the real news against the fake news originating from a set of nodes S. This differs from competing cascades [69, 88] approaches, as set I obtained in our setting is important in both (i) detecting that fake news is spreading (i.e., by passing through nodes in I), and (ii) maximizing the number of nodes that are reached from S through I, so that maximum nodes have exposure to the real news that counters the fake news. 8.2 Problem Definition Let X! Y represent the event of a news propagating from a node in X to a node in Y under some diffusion model, where X;Y V of graph G(V;E). Suppose X Z ! Y is the event of a news flowing from X to Y through a node in Z. Let S be the set of fake news initiators. We wish to find a set of nodes I that are most “central” in passing news to other nodes. The news from S is likely to reach I, and many other nodes are likely to be reached if the news reaches I. Formally, we wish to maximize s(S;I)=E å u2VnS I(S I ! u) ! ; (8.1) whereI is the indicator function. Each set I defines a set of Bernoulli variables, one per node in vertex set Vn S with success probability defined by reachability through at least one node in I. For the case when the set S spreading fake news is known, we are interested in finding the set I of size k that maximizes the expected number of successes. We explore the extension of the problem when set S is unknown in Section 8.3.4. We formally define this problem as Fake news Activation Checking (FActCheck). Definition 8.1. (FActCheck) Given a graph G(V;E), a seed-set S, a model of diffusionM and an integer k, find I Vn S,jIj= k, that maximizess(S;I). Henceforth, we proceed with Independent Cascade Model (ICM) as the model of diffusion due to its popularity since its introduction [62]. We assume that both fake and real news spread under Chapter 8. Combating Fake News – A Network Approach 84 the same diffusion model. We also assume that once a node is exposed to the real news, it will no longer be willing to perceive fake news as real. We have the following theorem. Theorem 8.2. FActCheck is NP-Hard under Independent Cascade Model. Proof. We show that Set Cover can be reduced to FActCheck. Consider an instance of Set Cover problem, where we have been given a set U and a collection of its subsets X =fX 1 ;X 2 ;:::;X m g. The task is to decide if there exists a collection of k m subsets X 0 X whose union is U. We construct a bipartite graph G b =(X[U;E b ), where a node x i 2X represents the subset X i , and node u j 2U represents an element of U. A directed edge (x i ;u j ) with probability 1 exists if the corresponding element in U belongs to X i . We add a source node s and add directed edges (s;x i )8x i 2X with probability 1. The Set Cover problem is equivalent to deciding if there exists a set I2X[U of size k such that s(s;I)= k+jUj. Now, if a set cover of size k exists, then selecting the nodes corresponding to those sets into I leads to s(S;I)= k+jUj, which is the maximum value that can be achieved. Conversely, suppose optimal FActCheck returnsjUj+ k. First, we observe that exactly k out ofjUj+ k nodes should be inX . This is because if less than k nodes are fromX , then there are not enough nodes to make a total of jUj+ k. More than k nodes cannot come fromX as there does not exist any path connecting two nodes in that set. Therefore, rest of thejUj nodes must belong toU resulting in a set cover. Suppose a node is selected uniformly from Vn S (without knowledge of I or the edges of G) and is connected to an additional node t with edge probability 1. This construction helps define another objective function for FActCheck. Theorem 8.3. arg max I P(S I ! t)= arg max I s(S;I). Proof. Let X u be the indicator variable which 1 only if node u is infected after removal of set I. Then, s(S;I)=E( å u2VnS S I ! u)= å u2VnS E(S I ! u) = å u2VnS (1:P(S I ! u)+ 0:(1 P(S I ! u))) = å u2VnS P(S I ! u): (8.2) Chapter 8. Combating Fake News – A Network Approach 85 Now, P(S I ! t)= å u2VnS P(S I ! u)\(u! t)) = å u2VnS P(S I ! u)P(u! t) = å u2VnS P(S I ! u) 1 jVn Sj = 1 jVn Sj s(S;I): (8.3) SincejVn(S[I)j is a constant, it follows from Equations 8.2 and 8.3 that maximizing one leads to maximization of the other. In terms of live-edge graph obtained by sampling edges based on edge probabilities [62], FActCheck translates to finding a set I with cardinality k, that maximizes the probability of existence of a path from S to t containing a node in I. 8.2.1 Submodularity Property To develop a polynomial time algorithm for FActCheck, we introduce the following theorem. Theorem 8.4. The function f S (I)= P(S I ! t) is monotonic and submodular for Independent Cascade Model. Proof. We proceed by considering an instance of live-edge graphG obtained by sampling edges from G. Let f G S (I) be the indicator function of event S I ! t in the sampleG . Then, f G S (I)= 1, if there is a path from S to t inG , 0 otherwise. Clearly, if there is a path S to t going through I, for any choice of v, the path still exists going through I[fvg to t. Addition of a node to set I can only create a path to t, and hence, f G S (I) is monotonic. Now, consider another set I 0 , such that I I 0 . If f G S (I 0 [fvg) f G S (I 0 ) is 1 then the path to t goes through I 0 [fvg, and not through I 0 . So the path must go through v. In that case, as I I 0 , there cannot exist a path through I either. Hence, f G S (I[fvg) f G S (I) must also be 1. Therefore, f G S (I[fvg) f G S (I) f G S (I 0 [fvg) f G S (I): (8.4) Probability of a news reaching from S to t through I is expectation that there is a path from S to u inG that contains a node in I. Therefore, f S (I)= å G P(G) f G S (I 0 ) (8.5) Since f S (I) is a non-negative linear combination of monotonic and submodular functions, f S (I) must also be monotonic and submodular. Chapter 8. Combating Fake News – A Network Approach 86 Theorem 8.4 implies that a greedy algorithm that incrementally selects k nodes into I, one at a time, maximizing the marginal increase in P(S I ! t), guarantees a(11=e)-approximation [89] for FActCheck. 8.2.2 Principle of Reversibility We use the notation P(u! vjG) to represent the probability of a news reaching v from u in graph G. Theorem 8.5 (Principle of Reversibility). Let G R be the graph obtained by reversing all the edges in graph G. Then, under Independent Cascade Model, for any two nodes u and v, P(u! vjG)= P(v! ujG R ). Although, this has been used in the literature [90], we mention Theorem 8.5 and present a proof here for the sake of completeness. Proof. For every instance of live-edge graphG of G, reversing the edges would produce an instance of live-edge graphG R of G R . Since the edge probabilities for ICM are the same in both G and G R , P(G R )= P(G). Also, if there is a path from u to v inG (denoted by the indicator variableI(u! vjG) then there is a path from v to u inG R , .i.e.,I(u! vjG)=I(v! ujG R ). Therefore, P(u! vjG)= å G P(G)I(u! vjG) = å G R P(G R )I(v! ujG R )= P(v! ujG R ): (8.6) 8.3 Proposed Algorithm for FActCheck We proceed to build an efficient algorithm for FActCheck. The approach is a greedy solution that starts with an empty set I 0 = / 0, and in every iteration, it adds a node u2 Vn(S[ I 0 ) that maximizes P(S I 0 [fug ! t). We observe that, P(S I ! t)= P([ v2I (S v ! t))= P([ v2I (t v ! S)jG R ) (8.7) Chapter 8. Combating Fake News – A Network Approach 87 The last equality follows from the Principle of Reversibility (Theorem 8.5). Therefore P(S I ! t)= å G P(G)I([ v2I (t v ! S)jG R )): (8.8) Algorithm 9 Pruned RR Generation 1: function GENERATE PRUNEDRR(G, S) 2: G / 0, Q / 0, Q 1 / 0 3: Pick a random node v and Push in queue Q 4: visit[v]= true 5: while Q is not empty do . Backward BFS 6: i Pop from Q 7: for each( j;i)2 G do 8: if p( j;i)> random() and visit[ j]== f alse then 9: Add( j;i) toG 10: Push j into Q, visit[ j]= true 11: if j2 S then 12: Push j in queue Q 1 , visit1[ j]= true 13: end if 14: end if 15: end for 16: end while 17: A / 0 18: while Q 1 is not empty do . Forward BFS 19: i Pop from Q 1 20: for each(i; j)2G do 21: if visit1[ j]= f alse then 22: Push j into Q 1 , visit1[ j]= true 23: A A[f jg 24: end if 25: end for 26: end while 27: return A 28: end function To simulate this, we perform a traversal of G R (“backward BFS”) starting from a randomly selected node generating a live-edge graph as we proceed, i.e., keeping an edge(i; j) with prob- ability p(i; j). Then we perform a traversal onG starting from the nodes in S (“forward BFS”) that are present inG . The nodes visited in the process constitute a “Pruned Reverse Reachable” (Pruned RR) set. Algorithm 9 presents the details of generating a Pruned RR. By construction, if 9v2 I which is present in the Pruned RR, thenI(S I ! t). Therefore, if A is a randomly generated Pruned RR, then P(S I ! t)=E(I(S I ! t))=E(I(jA\ Ij> 0)): (8.9) This probability can be estimated by generating q number of Pruned RRs, where q is “very large”. Once, q Pruned RRs have been generated, we can apply the greedy selection of nodes Chapter 8. Combating Fake News – A Network Approach 88 (Algorithm 10) to construct the desired set I k . Algorithm 10 Greedy Selection of Nodes 1: function GREEDY SELECT(R, k) 2: I k / 0 3: for all A2R do 4: for all v2 A do 5: Count[v] Count[v]+ 1 6: end for 7: end for 8: for i= 1! k do 9: v arg max v2VnI k Count[v] 10: I k I k [fvg 11: for all A2R such that v2 A do 12: for all w2 A, Count[w] Count[w] 1 13: end for 14: end for 15: return I k 16: end function The number of Pruned RRs q that need to be generated is same as the number of times set A needs to be sampled so that greedy selection of I guarantees a(1 1=ee)-approximation for the optimal value ofE(I(jA\ Ij)). This is given by [91] q = n(2+ 2 p 2e=3)(log n k + logn+ loglog 2 n) 2e 2 OPT ; (8.10) where n =jVn Sj and OPT is the optimal value of nP(S I ! t). We refer to this method of generating q Pruned RR sets and applying Algorithm 10 to solve FActCheck as Approximate FActCheck (AFC). 8.3.1 Graph Reduction One drawback of using AFC for FActCheck is that due to small size of OPT and large size of graph, the number of iterations required may be very high asq µ n=OPT . To address this, we propose a reduction step that first reduces the size of the graph before applying AFC to obtain a reasonable solution in less time. Theorem 8.6. Let V S be the set of nodes that are reachable from S in at least one instance of a live-edge graph in R= logn=m randomly generated instances. Then w.h.p., P([ v2VnV S (S! v))<m. Chapter 8. Combating Fake News – A Network Approach 89 Proof. Let p 0 = P([ v2VnV S (S! v)). Suppose p 0 m. Then the probability of the event[ v2VnV S (S! v) not happening in R independent trials is (1 p 0 ) R (1m) R =(1m) logn mu = exp logn log(1m) m : Since, log(1m)<m, log(1m) m <1. Therefore, we have (1 p 0 ) R exp log(n) log(1m) m exp(log(n))= 1 n : In fact, by setting R=a logn=m, and a proper choice ofa, we can remove nodes that are reached in less than or equal to r number of times and guarantee that w.h.p., p i = P(S! v i )<m;8v i 2 VnV S . This follows in a similar way as the proof for Lemma 8.6, and we have, probability of v i , for which p i >m, being reachable in at most r live-edge graphs = r å j=0 R j m j (1m) R j k å i=0 R j j! m j (1m) j (1m) R = r å j=0 (a logn) j j!(1m) j (1m) logn m r å j=0 (a logn) j j!(1m) j n a : By union bound the probability of existence of at least one v i such that p i >m is upper bounded by n r å j=0 (a logn) j j!(1m) j n a = n 1a r å j=0 (a logn) j j!(1m) j : (8.11) To simplify, if we assume r logn, then the last term of sum is the greatest and we can replace all the terms with the last term, i.e., (a logn) r r!(1m) r , we obtain the following looser upper-bound a r (r 1)!(1m) r (logn) r n a1 : (8.12) For a given r, we can selecta so that this bound is below a desired threshold. We simulate the diffusion process starting with the source S,a logn=m times. All nodes that are reached in at least r simulations are added to V S . Then, we run AFC on G S , the graph induced by V S (wherejV S n Sj= n S n) which is a smaller than V . Let I S be the optimal solution for Chapter 8. Combating Fake News – A Network Approach 90 Algorithm 11 Reduced Approximate FActCheck (RAFC) 1: function RAFC(G, k,m, r) 2: if r== 1 then 3: a 1 4: else 5: Picka such that bound given by (8.12) is small 6: end if 7: V S / 0 8: for i2 1!a logn m do 9: Do BFS on G from S constructing live-edge graph 10: 8v visited, count[v] count[v]+ 1 11: end for 12: 8v such that count[v] r;V S V S [fvg 13: G S Graph induced by V S on G 14: I k AFC(G S ;k;e) 15: return I k 16: end function FActCheck on G S , leading to a value OPT S . OPT S =s(S;I S jG)s(S;I S jG S )s(S;I jG S ) =s(S;I jG) å u2VnV S P(S I ! u) s(S;I jG) å u2VnV S P(S! u) OPT(n n S )m; [w.h.p.] Here, I is the optimal solution of FActCheck on G. The last inequality follows from The- orem 8.6 as å u2VnV S P(S! u) P([ v2VnV S (S! v)). Therefore, the difference between the optimal value obtained while considering only G S from the original optimal value is bound by (n n S )m, which can be made arbitrary close to 0. Under the assumption that the optimal value for FActCheck is large compared to 1,m= 1=n ensures that the new optimal differs from the true value by a quantity less than 1.In that case, number of simulation is logn=m = nlogn. In each of the simulations,the worst case requires visiting all the edges, where m=jEj. Therefore, the worst case complexity is O(mnlogn). However, in practice, not much improvement is observed with decreasing the value ofm. In fact, we experimentally demonstrate thatm= 0:1 is sufficient to produce good quality results, thus making the complexity O(mlogn). 8.3.2 Realistic Extensions Here we present some realistic extensions to our problem setting to accommodate (i) different diffusion models, (ii) importance of nodes, (iii) expertise of individuals to fact-check, and (iv) Chapter 8. Combating Fake News – A Network Approach 91 differences in willingness to share real and fake news. These are straight-forward mathematical extensions, and therefore, we do not present experiments for these variations. 8.3.2.1 Choice of Diffusion Model Although we have presented the results for ICM, our algorithms are applicable to any model that has a live-edge graph representation such as Linear Threshold Model [62]. A popular model of diffusion in epidemiology is the Susceptible-Infected-Recovered (SIR) model [3]. Given, u has been infected, leta be the probability of node u infecting a neighbor v, andb be the probability of its recovery. Then, an infection flows through the edge(u;v), if u was infected for j 1 time units and recovers in j th time step, for some j 1. p(u;v)= ¥ å j=1 (a(1b)) j1 b = b 1a(1b) : (8.13) Therefore, our algorithms are also applicable to SIR model. 8.3.2.2 Importance of Nodes We have assumed that every node has equal utility, i.e., we are attempting to maximize number of nodes reached through I. Suppose reaching a node v has a utility of U v , and now we wish to maximize the expected utility, i.e., s U (S;I) =å u U v P(S I ! u). This is solved simply by selecting v to connect to the sink node t with probability U v =å v U v . Proceeding as in the proof of Theorem 8.3, arg max I P(S I ! t)= arg max I s U (S;I). 8.3.2.3 Individual Expertise We have assumed that all individuals have equal expertise in checking the fake news and propa- gating the corresponding real news. While this is not always true, individuals’ expertise/willing- ness to check fake news can be easily incorporated into the model as follows: Suppose a node u can successfully check fake news with probability f , then from every PRR that contains u, we can randomly remove u with probability 1 f . This parameter f2[0;1] models the degree of domain expertise. For an expert, f = 1. 8.3.2.4 Willingness to share We have assumed that the probability with which a news item is shared (p(u;v)) is independent of whether the news is fake or real, which might not hold true in reality. On one hand, one might Chapter 8. Combating Fake News – A Network Approach 92 be more willing to share a fake news compared to real news because it tends to be alluring. On the other hand, the content of the fake news might make the person question its credibility and make it less likely to be shared. Also, that person might be more willing to share the real news refuting the fake news as a “Good Samaritan”. Such a setting that models the spread of fake and real news with different probabilities can be easily addressed by our formulation of FActCheck as follows. Suppose F(V f ;E f ) represents the graph corresponding to fake news sharing probability p f (u f ;v f ), and H(V h ;E h ) represents the graph corresponding to real news sharing edge probability p h (u h ;v h ). There are nodes corresponding to the same person in both graphs, i.e., u f and u h refer to the same individual. We construct graph G(V;E), where V = V f [V h and E = E f [ E h [ E f h . Here, E f h is constructed by adding directed edge (u f ;u h ) and (u h ;u f ) between all pairs of node that correspond to the same individual (see Figure 8.2). The set of initiators is S V f , and we would maximize P(S I ! t), I V h , where t is connected from a randomly selected node v h 2 V h , which can be solved using AFC. FIGURE 8.2: Taking into account different diffusion probabilities for sharing real and fake news. 8.3.3 Complexity Analysis Time complexity of greedy selection (Algorithm 10) is O(kå A2R jAj), which is approximately O(qE(jAj)). Note from Algorithm 9, if number of edges explored in one simulation is w, themE(w)E(jAj). Therefore, time complexity of AFC is dominated by O(qE(w)). Some algebraic manipulation results in this value being O( nk logn e 2 E(w) OPT ). The dependence on OPT makes the complexity hard to analyze, but we can obtain some loose upper-bounds. If n is a node chosen at random with probability proportional to its number of incoming edges, then it can be shown thatE(w) mE(s(n))=n [91]. To simplify the analysis, we assume that S has a higher expected reach, i.e.,s(S)s(n), so that we have E(w) m n s(S): (8.14) Chapter 8. Combating Fake News – A Network Approach 93 Let d be the number of nodes connected with an edge to one of the nodes in S. If we were to find a set of size d for FActCheck, then the optimal value ofs(S;I d )=s(S), where I d consists of all the neighbors of nodes in S. Since AFC selects k nodes greedily, by submodularity property [89] s(S;I k )(1 e k=d )s(S) k d s(S): (8.15) Combining Equations 8.14 and 8.15 with the fact that OPTs(S;I k ), we get E(w) OPT (m=n)s(S) (k=d)s(S) = md nk : (8.16) Therefore, under the assumption thats(S)>E(s(n)), the complexity of AFC is O(md logn=e 2 ). RAFC reduces the size of the graph, thus reducing m and n, which in turn reduces the runtime. It is difficult to estimate the extent of reduction theoretically as it depends on the given graph. The complexity has been derived following the assumption that s(S)E(s(n))). This is also the reason for choosing to perform “backward BFS” first in Algorithm 9. We assume that number of edges explored thus, is less than number of edges explored if we would start with “forward BFS”. The average number of edges explored in both cases can be estimated by running few simulations, and we can proceed with the sequence that has lower execution time. However, in theory, it is possible to construct a graph where the complexity is extremely high irrespective of the choice. For instance, take a complete graph on n 1 vertices, with edge weights arbitrarily close to 1. Now add a node s, and connect node v i with edge probability p(s;v i )=d i . Suppose s is the seed node. Now, both “forward BFS” and “backward BFS” are expected to exploreQ(n) edges. On the other hand d i s can be chosen to make OPT arbitrarily close to 0, making the complexity arbitrarily high. But such situations do not arise in real- life graphs, and experimentally obtained execution times for AFC and RAFC suggest that our methods have low runtime. 8.3.4 Random Source of Fake News Next, we consider the case when the source S is not determined and the fake news can originate from any set of nodes, i.e., with probability q(v), node v shares the fake news. This problem setting can be easily reduced to FActCheck by adding a node s and an edge (s;v) with edge probability p(s;v)= q(v),8u. Using this simple construction that adds O(1) nodes and O(jEj) edges to the original graph G, we can find I that maximizess(s;I) in the modified graph using AFC or RAFC (Algorithm 11). A different bound on the complexity of AFC can be obtained for this problem setting as follows. We make a simplifying assumption that q(v)= q8v. Let o be the node that maximizes s(o). Chapter 8. Combating Fake News – A Network Approach 94 Therefore, s(o)E(s(n)), and therefore,E(w)= mE(s(n))=n ms(o)=n. Also, OPT s(s;o)= qs(o). Therefore the complexity O( nk logn e 2 E(w) OPT ) is bounded by O( mk logn qe 2 ). TABLE 8.1: Reduction details Dataset HEPT Twitter LJ # of nodes 15,233 3,919,215 4,847,571 # of edges 62,774 5,399,949 68,475,391 jV S j=jVj m = 0:1 0.4319 0.0200 0.4436 m = 0:01 0.4830 0.0559 - m = 0:001 0.5061 0.1075 - m = 0:0001 0.5188 0.1489 - 8.4 Experiments In this section, we present the experimental results of our AFC and RAFC (Algorithm 11) on three real-world networks. Our objective is twofold: (i) study the effect of our reduction step in the execution time and quality of results, (ii) compare the performance of our proposed methods, AFC and RAFC in FActCheck with baselines when the source of fake news is (a) known and (b) unknown. 8.4.1 Datasets We conducted the experiments on the following real-world graphs: HEPT: Collaboration network [30] from ‘High Energy Physics - Theory’ section of arXiv considering papers from 1991 to 2003. The graph is undirected with 15;233 nodes and 62;744 edges, and has weighted edges. Twitter: A directed Twitter mention network 2 where a link represents one user mentioning another in a tweet. The network consists of approximately 3:9 million nodes and 5:4 million edges. LJ: A directed network of LiveJournal 3 where a link represents one user is friends with the other user. The network consists of approximately 4:8 million nodes and 68:5 million edges. 2 Available athttp://trec.nist.gov/data/tweets/ 3 Available athttps://snap.stanford.edu/data/soc-LiveJournal1.html Chapter 8. Combating Fake News – A Network Approach 95 8.4.2 Setup The probability of influence for each edge(u;v) was set to p(u;v)= 1=d v , where d v =å j weight(v; j) is given by the sum of the outgoing edge weights. This scheme is often called Weighted Cascade Model [62], a special case of ICM. The seedset S was set to the 50 nodes with highest out-degree (å v p(u;v)). All the experiments were performed on a 2x16 core Intel(R) Xeon(R)CPU running at 2.60GHz with 128GB RAM. The C++ code required for our experiments has been made publicly available 4 . 10 0 10 1 10 2 138 139 140 141 142 143 Execution Time σ(S, I) μ = 0.1 μ = 0.01 μ = 0.001 μ = 0.0001 No Reduction (a) HEPT 10 0 10 1 10 2 10 3 10 4 430 435 440 445 450 455 μ = 0.1 μ = 0.01 μ = 0.001 μ = 0.0001 No Reduction Execution Time σ(S, I) (b) Twitter 10 2 10 3 2800 2810 2820 2830 2840 2850 Execution Time σ(S, I) μ = 0.1 No Reduction (c) LJ FIGURE 8.3: Comparison ofs(S;I) vs execution time for varying values of reduction param- eterm. 8.4.3 Quality of Reduction We have proposed two algorithms: AFC and RAFC which is obtained by reducing the graph as described in Section 8.3.1. We performed a series of experiments to measure the effect of this reduction on the execution time and quality of the results obtained, i.e.,s(S;I). We usede= 0:1 in Algorithm 9, thus getting an approximation guarantee of(11=e0:1). For RAFC, we used m = 0:1;0:01;0:001;::: to reduce the graph size, and then we ran Algorithm 9 on the reduced graph. We decreased the value ofm exponentially until the reduced graph was of approximately the same size as the original. In case of HEPT we stopped when execution time of RAFC was significantly larger than that of AFC. Table 8.1 shows the reduction in graph size obtained for each datasets with varyingm. 4 Available athttps://goo.gl/ynSFuu Chapter 8. Combating Fake News – A Network Approach 96 0 10 20 30 40 50 0 50 100 150 |I| σ(S, I) OutDegree PageRank PPR FSBC SmartDegree RAFC (0.1) RAFC (0.01) RAFC (0.001) RAFC (0.0001) AFC (a) HEPT 0 10 20 30 40 50 0 100 200 300 400 500 |I| σ(S, I) OutDegree PageRank PPR FSBC SmartDegree RAFC (0.1) RAFC (0.01) RAFC (0.001) RAFC (0.0001) AFC (b) Twitter 0 10 20 30 40 50 0 500 1000 1500 2000 2500 3000 |I| σ(S, I) OutDegree PageRank PPR FSBC SmartDegree RAFC (0.1) AFC (c) LJ FIGURE 8.4: Comparison ofs(S;I) vsjIj for different methods when the source is known. 0 10 20 30 40 50 0 10 20 30 40 50 60 |I| σ(s, I) OutDegree PageRank TIM AFC RAFC (0.1) (a) HEPT 0 10 20 30 40 50 0 50 100 150 200 250 300 350 |I| σ(s, I) OutDegree PageRank TIM AFC RAFC (0.1) (b) Twitter 0 10 20 30 40 50 0 1000 2000 3000 4000 5000 6000 7000 8000 |I| σ(s, I) OutDegree PageRank TIM AFC RAFC (0.1) (c) LJ FIGURE 8.5: Comparison ofs(S;I) vsjIj for different methods when the source is random. After obtaining I using AFC and RAFC, we ran 10;000 simulations to compute s(S;I). Fig- ure 8.3 shows the comparison ofs(S;I) vs execution time for different values of m. The point Chapter 8. Combating Fake News – A Network Approach 97 labeled ‘No Reduction’ corresponds to AFC, i.e., applying Algorithm 9 to the entire graph. The- oreticallym should be arbitrarily close to zero. However, we observe thats(S;I) obtained with m = 0:1 is close to the one obtained without any reduction while requiring significantly smaller runtime. Speed-up obtained for HEPT, Twitter, and LJ are 1.8x, 102.3x, and 2.4x, respectively. For LJ, s(S;I) is almost same as the one obtained without any reduction. The least quality was observed in Twitter with a drop of approximately 3:8% ins(S;I), but it is compensated by 102.3x speed-up. 8.4.4 Baselines To assess the quality of results obtained using AFC and RAFC, we compare them against the following baselines: OutDegree: Sum of the probabilities of edges going out from a node, i.e.,8v2 Vn S;OutDegree(v)=å j2VnS p(v; j). PageRank: Google’s webpage ranking algorithm [92]. PPR: Personalized PageRank as described in [92]. This was also implemented using power iterations. The random walks are assumed to start from one of the nodes in S with probability 1=jSj. FSBC: A modified form of betweenness centrality that we refer to as Fixed Source Be- tweenness Centrality. We added a source node s and connected it to each node u2 S with probability 1. Then, each edge-weight p(i; j) was replaced withlog(p(i; j)), so that the shortest path represents most probable path of going from node i to node j. Finally, we counted the number of shortest paths going through a node as its FSBC value. SmartDegree: We developed this heuristic to easily capture which nodes are well-connected from the source S and also highly connected to other nodes. For v2 Vn S, we compute SmartDegree(v)= å u2S p(u;v) å j2VnS p(v; j): We did not compare our methods against works of [69, 88] as they were designed for a different problem setting. Instead, we use the above mentioned baselines that have been used to compare against in several information diffusion related works [62, 69, 76, 83]. 8.4.5 FActCheck with Known Source We computeds(S;I) obtained using these methods along with our methods by running 10;000 simulations. The size of I was varied from 1 to 50. Figure 8.4 shows s(S;I) achieved by all Chapter 8. Combating Fake News – A Network Approach 98 the methods on the three datasets. Performance of all versions of RAFC is close to AFC. These methods outperform the baselines by significant margins. OutDegree and PageRank do not take the source into consideration, which could be the reason for their poor performance. PPR uses the knowledge of the source, however it finds where a random walk would land starting from the source. These nodes may be well connected from the source but may not lead to many other nodes. Besides, the notion of connectivity through random walks is different from the diffusion process. It performs better in case of HEPT compared to Twitter and LJ, which could be due to the fact that HEPT is an undirected network. Therefore, if two paths (or random walks) P 1 and P 2 end up at the same node v, then a news that goes through P 1 can follow P 2 in reverse order, thus infecting nodes on P 2 after going through v. FSBC performs poorly in all datasets. This could be due to the fact that it only counts how many shortest paths go through a node. These paths may have extremely small probabilities and hence may not lead to a high expected reach. A weighted count of betweenness may improve the result, but how to perform such a weighted count, and if it can provide any approximation guarantee is beyond the scope of this work. SmartDegree takes into account direct connections from the source node and direct connections to the rest of the graph, but fails to capture the diffusion process. 8.4.6 FActCheck with Unknown Source We also conducted experiments when the seed set S is not fixed, and a node is likely to initiate the diffusion with probability q= 0:01. For RAFC we set a = 2 and chose r such that the value in (8.12) was < 0:0001. We run RAFC only for m = 0:1 because for smaller values, jV S j=jVj 1. The runtimes of AFC for HEPT, Twitter, and LJ were 14.1s, 1166.8s, and 550s, respectively. RAFC produced no speed-up for HEPT and LJ, while for Twitter its speedup was 1.6. Along with OutDegree and PageRank as baselines, we also used TIM [91] that finds a set of k nodes that maximizes the spread within a factor of(1 1=ee) of the optimal. Figures 8.5 shows the comparison of these methods with varying values of k. Although RAFC and AFC outperform the baselines, the difference between the performance of the algorithm is reduced compared to the case when the source is fixed (Figure 8.4). This suggests that if we can iden- tify potential candidates who are likely to initiate fake news, we can obtain better solution for FActCheck. 8.5 Fake News Immunization FActCheck asks for a set that that is reachable from the source S and is also well connected to the rest of the network. We explore its application in another setting - Fake News Immunization, Chapter 8. Combating Fake News – A Network Approach 99 where fake news dominates real news. We consider a setting where a node that is exposed to fake news, picks up the fake news even if it is exposed to the real news. We wish to find k nodes to disrupt the flow of fake news. Note that this bears some similarity with the Immunization prob- lem [93] where the task is to immunize nodes to contain an epidemic. Unlike FActCheck where the selected individuals propagate real news, here, the immunized individuals are removed from the graph (note that the spread of real news is irrelevant as fake news dominates real news). We consider a version of the problem where a given set of nodes S propagates the fake news. This is different from the typical setting for the Immunization where the source is assumed to be random [47]. Formally, the problem is defined as follows. Definition 8.7. (Fake-News Immunization Problem) Given a graph G=(V;E), a set S, and a positive integer k, find I Vn S, such thatjIj= k, which minimizes the expected spread of fake news in the graph induced by Vn I. Equivalently, the problem asks to minimize the probability of reaching a randomly selected node t in the graph induced by Vn I, i.e., P(S! tjG I). In a given instance, fake news either flows through a set I or it does not. Correspondingly, in a live-edge graph, there either exists a path that goes through a node in I or no such path exists. Due to the events being mutually exclusive, P(S !I ! t)= P(S! t) P(S I ! t); (8.17) where P(S !I ! t) represents the probability of a path from S to t that does not go through any node in I. Since, P(S! t) is a constant for fixed S, maximizing P(S I ! t) is equivalent to minimizing P(S !I ! t). In a live-edge graph whereI(S! t)= 1, for a given I V and v2 Vn I, three mutually exclusive and exhaustive cases arise - I(S I ! t)= 1 andI(S v ! t)= 0 =) I(S !I ! t)= 0;I(S! tjG I)= 0. I(S I ! t)= 1 andI(S v ! t)= 1 =) I(S !I ! t)= 0;I(S! tjG I)= 1. I(S I ! t)= 0 andI(S v ! t)= 1 =) I(S !I ! t)= 1;I(S! tjG I)= 1. From these cases, it can be concluded that P(S! tjG I) P(S !I ! t); (8.18) i.e., P(S !I ! t) forms a lower bound on the objective for Fake News Immunization. Although, mathematically not optimal, intuitively, if infection of a set of nodes leads to infection of many nodes in the network, that set may be considered to be a good candidate to immunize. Therefore, we use the result of FActCheck as a heuristic for Fake News Immunization. Chapter 8. Combating Fake News – A Network Approach 100 0 2 4 6 8 10 12 14 16 HEPT Twitter LJ Percentage of nodes saved OutDegree PageRank PPR FSBC SmartDegree RAFC (0.1) AFC FIGURE 8.6: Immunization results for AFC, RAFC, and the baselines, showing percentage of nodes that were saved from infection by removing 50 nodes. 8.5.1 Fake News Immunization Experiments Figure 8.6 shows the results of application of the baselines along with the proposed methods RAFC and AFC. The comparison is based on the percentage of nodes saved from adoption of fake news due to immunization of 50 nodes, which can be measured as 100 s(SjG)s(SjG I) s(SjG) : The seedset S was again based on out-degree as in our previous experiments. Again, we observe that our methods AFC and RAFC (withm = 0:1) significantly outperform the baselines. Smart- Degree performs the best among the baselines, yet is only able to save less than half of the nodes when compared to our methods for LJ. Chapter 9 Conclusions and Future Directions Influence analytics and diffusion prediction in online social networks have been important for many domains from marketing to public health. With the tremendous increase in the volume of data, network sizes reach millions of nodes, restricting the applicability of existing agent-based models or algorithmic solutions for diffusion prediction. My work touches several aspects of diffusion processes with different applications. Figure 9.1 summarizes the results. FIGURE 9.1: Summary of contributions. To address the challenge of accurately forecasting disease outbreaks and assess international spreading risk throughout the multiple regions with the intent of applying these capabilities to the mitigation of outbreaks, we developed a computational epidemic model which integrates disease dynamics with human mobility patterns. Our model is advantageous in that it allows 101 Chapter 9. Conclusions 102 for a fine-grained analysis of epidemic patterns both within a population and at an international scale while at the same time being computationally scalable (linear in number of countries and number of edges along which passengers travel) and requiring only a small amount of data to be initialized. This allows for both local outbreaks detection and identification of the timeline of the arrival of the epidemic in each country (i.e., imported cases) as a direct result of the human mobility network at the same time. This led to one of the winning solutions of the DARPA CHIKV challenge (2014), where the model accurately predicted the spread of Chikungunya virus over 55 countries in the western hemisphere. The applicability of our model is not limited to Chikungunya spreading prediction and is generalizable to other vector-borne diseases. As a microscopic, non-progressive diffusion process, we considered the problem of performing optimal intervention on at risk population to minimize violence . We proposed Uncertain V oter Model (UVM) to capture the non-progressive diffusion of violence. Under UVM, a node selects one of its neighbors with probabilityq or one of the remaining nodes with probability 1q, and adopts its state. The parameter q captures the certainty of being influenced by the neighbors. The model also captures uncertainty in time over which the diffusion of violence takes place. We have shown that a greedy algorithm is the optimal intervention strategy to minimize vio- lence under this model. We have also extended the deterministic intervention by considering a scenario where the intervention succeeds only with a certain probability as a function of number of resources (units of interventions) allocated to the individual. We have also shown that the greedy algorithm maximizing marginal returns forms the optimal intervention strategy. Experi- ments on synthetic Kronecker graphs suggest that UVM is a better choice than the classic voter model, where edges may have been omitted during data collection. Experiments on real-world Homeless Youth network have demonstrated that our intervention strategy significantly outper- forms interventions based on popular centrality based measures. We show in our experiments that for sensible choices of parameters the top individuals selected for intervention roughly re- main the same. USC is running the pilot study to conduct the intervention this summer. As a future direction, personal traits (Difficulty in Emotion Regulation Score) can be incorporated to more accurately model the diffusion process and the response to the intervention. The mod- eling and intervention scheme can also be applied to other behaviors that are contagious and non-progressive in nature, such as drug-abuse. To deal with the problems associated with microscopic, progressive diffusion process, we pro- posed a novel, general analytical framework for influence calculation in social networks, which does not require expensive Monte-Carlo simulations. In this framework, each node has its own function of collective influence and peer influence. Both functions can vary with time, thus making our framework directly applicable to a plethora of real-world scenarios. We have shown how various popular models of diffusion constitute special cases of our model, suggesting that our formula is applicable for approximating the expected outcome of these models. Particu- larly, we have shown that our formula can substitute expensive simulation runs to calculate the Chapter 9. Conclusions 103 expected probabilities of infection. We have further demonstrated that significant computation gains can be achieved using our formula instead of such models. We have validated our results using real-world social networks and a number of Erd˝ os-R´ enyi random graphs. We applied our analytical model for influence to the task of influence maximization - the problem of identifying a set of target nodes whose influence will maximize the overall cascade in the social network. We have shown that our unified model is beneficial to seed set selection. Specifically, we have proposed OSSUM a greedy solution to the problem of Influence Maximization based on our Unified Model. We have also extended this approach to signed-networks with positive and neg- ative links, where two competing opinions propagate and may flip over negative edges. We have empirically demonstrated its superiority against state of the art approaches under different influence models. Two important benefits of our approach are that a) it enables exploration and evaluation of different scenarios for large graphs under different influence models; b)considers the timing budget of influence propagation, adding a time-critical condition to the influence models. As a future direction, the analytical solution can be used to define influence based cen- trality measures that take the diffusion process into account to find application specific “central” nodes. To address the challenge posed by fake news propagation in online social networks, we have proposed Fake news Activation Checking (FActCheck) problem . Under Independent Cascade Model, we have shown that the problem is NP-Hard, but the objective is monotone and submod- ular, motivating a polynomial time greedy algorithm with (1 1=ee)-approximation guar- antee. Since the runtime of AFC increases with the size of the graph, we have developed a heuristic (RAFC) that reduces the size of the graph by removing nodes that are likely to have low probability of activation, before applying AFC. While Independent Cascade Model was chosen as the model of diffusion, our algorithms are applicable to other models including Lin- ear Threshold and Susceptible-Infected-Recovered. Experiments have demonstrated that RAFC produces similar quality to AFC, while providing significant speed-up in runtime. Our methods were compared against popular centrality measures from social network literature. Both AFC and RAFC outperform the baselines by a large margin on several real-life networks. The differ- ence in performance was less prominent when the sources of fake news are random suggesting that identifying popular sources of misinformation can have a significant impact in reducing the reach of fake news. We have also conducted experiments to validate the effectiveness of our methods for Fake News Immunization, where fake news dominates real news. This represents the setting where an individual always picks up the fake news even when exposed to the real news along with it. While our algorithms were not explicitly designed to address this prob- lem, a solution of FActCheck, intuitively, forms a candidate set to immunize. Experiments have demonstrated that our algorithms indeed save more nodes from spread of fake news compared to the baselines. Bibliography [1] William O Kermack and Anderson G McKendrick. Contributions to the mathematical theory of epidemics. ii. the problem of endemicity. Proceedings of the Royal society of London. Series A, 138(834):55–83, 1932. [2] Roy M Anderson, Robert M May, et al. Population biology of infectious diseases. In [Report of the Dahlem Workshop, Berlin, 14th-19th March 1982]. Springer-Verlag, 1982. [3] Matt J Keeling and Pejman Rohani. Modeling infectious diseases in humans and animals. Princeton University Press, 2008. [4] Norman TJ Bailey et al. The mathematical theory of infectious diseases and its applica- tions. Charles Griffin & Company Ltd, 5a Crendon Street, High Wycombe, Bucks HP13 6LE., 1975. [5] David Easley and Jon Kleinberg. Networks, crowds, and markets. Cambridge Univ Press, 6(1):6–1, 2010. [6] Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion, custom, and cultural change as informational cascades. Journal of political Economy, pages 992– 1026, 1992. [7] David Hirshleifer. The blind leading the blind: social influence, fads and informational cascades. 1995. [8] George M Beal and Joe M Bohlen. The diffusion process. Agricultural Experiment Station, Iowa State College, 1957. [9] Alan Demers, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. Epidemic algorithms for replicated database maintenance. In Proceedings of the sixth annual ACM Symposium on Principles of dis- tributed computing, pages 1–12. ACM, 1987. [10] Richard Karp, Christian Schindelhauer, Scott Shenker, and Berthold V ocking. Randomized rumor spreading. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 565–574. IEEE, 2000. 104 Bibliography 105 [11] Benjamin Doerr, Tobias Friedrich, and Thomas Sauerwald. Quasirandom rumor spreading. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 773–781. Society for Industrial and Applied Mathematics, 2008. [12] Kat Rock, Sam Brand, Jo Moir, and Matt J Keeling. Dynamics of infectious diseases. Reports on Progress in Physics, 77(2):026602, 2014. [13] David J Bartholomew and David J Bartholomew. Stochastic models for social processes. Wiley New York, 1967. [14] Yamir Moreno, Maziar Nekovee, and Amalio F Pacheco. Dynamics of rumor spreading in complex networks. Physical Review E, 69(6):066130, 2004. [15] Herbert W. Hethcote. The mathematics of infectious diseases. SIAM Review, 42(4):599– 653, 2000. [16] Paul Erd˝ os and Alfr´ ed R´ enyi. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5:17–61, 1960. [17] Mason A Porter and James P Gleeson. Dynamical systems on networks: A tutorial. arXiv preprint arXiv:1403.7663, 2014. [18] Fred Brauer and Carlos Castillo-Chavez. Mathematical models in population biology and epidemiology. Springer, 2011. [19] Karthik Subbian, Dhruv Sharma, Zhen Wen, and Jaideep Srivastava. Finding influencers in networks using social capital. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, pages 592–599, New York, NY , USA, 2013. ACM. ISBN 978-1-4503-2240-9. doi: 10.1145/ 2492517.2492552. [20] David Kempe, Jon Kleinberg, and ´ Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146, New York, NY , USA, 2003. ACM. ISBN 1-58113-737-0. [21] Jon Kleinberg. Cascading Behavior in Networks: Algorithmic and Economic Issues. Cam- bridge University Press, 2007. [22] Ceren Budak, Divyakant Agrawal, and Amr El Abbadi. Diffusion of information in social networks: Is it all local? In 2012 IEEE 12th International Conference on Data Mining (ICDM), pages 121–130, 2012. doi: 10.1109/ICDM.2012.74. [23] Charalampos Chelmis and Viktor K. Prasanna. The role of organization hierarchy in tech- nology adoption at the workplace. In Proceedings of the 2013 IEEE/ACM International Bibliography 106 Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, pages 8–15, New York, NY , USA, 2013. ACM. ISBN 978-1-4503-2240-9. [24] Charalampos Chelmis, Ajitesh Srivastava, and Viktor K Prasanna. Computational models of technology adoption at the workplace. Social Network Analysis and Mining, 4(1):1–18, 2014. [25] Mark Granovetter. Threshold models of collective behavior. American Journal of Sociol- ogy, 83(6):1420–1443, 1978. ISSN 00029602. [26] John A. Jacquez and Carl P. Simon. The stochastic si model with recruitment and deaths i. comparison with the closed sis model. Mathematical Biosciences, 117(1-2):77–125, 1993. ISSN 0025-5564. [27] Christel Kamp. Untangling the interplay between epidemic spread and transmission net- work dynamics. PLoS Computational Biology, 6(11):e1000984, 11 2010. [28] Mayank Lahiri and Manuel Cebrian. The genetic algorithm as a general diffusion model for social networks. 2010. [29] Alireza Hajibagheri, Ali Hamzeh, and Gita Sukthankar. Modeling information diffusion and community membership using stochastic optimization. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Min- ing, ASONAM ’13, pages 175–182, New York, NY , USA, 2013. ACM. ISBN 978-1-4503- 2240-9. doi: 10.1145/2492517.2492545. [30] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social net- works. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208. ACM, 2009. [31] Aris Anagnostopoulos, Ravi Kumar, and Mohammad Mahdian. Influence and correlation in social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 7–15, New York, NY , USA, 2008. ACM. ISBN 978-1-60558-193-4. [32] Zaixin Lu, Wei Zhang, Weili Wu, Joonmo Kim, and Bin Fu. The complexity of influence maximization problem in the deterministic linear threshold model. Journal of combinato- rial optimization, 24(3):374–378, 2012. [33] Marisa J Terry, Gurpreet Bedi, and Neil D Patel. Healthcare needs of homeless youth in the united states. Journal of Pediatric Sciences, 2(1):e17–e28, 2010. [34] Department of Justice. Office of victims of crime. 2013 National Crime Victims’ Rights Week resource guide: Section 6. Statistical Overviews., 2013. Bibliography 107 [35] Robin Petering, Eric Rice, Harmony Rhoades, and Hailey Winetrobe. The social networks of homeless youth experiencing intimate partner violence. Journal of interpersonal vio- lence, 29(12):2172–2191, 2014. [36] Jessica A Heerde, Sheryl A Hemphill, and Kirsty E Scholes-Balog. ‘fighting’for survival: A systematic review of physically violent behavior perpetrated and experienced by home- less young people. Aggression and violent behavior, 19(1):50–66, 2014. [37] Danice K Eaton, Laura Kann, Steve Kinchen, Shari Shanklin, Katherine H Flint, Joseph Hawkins, William A Harris, Richard Lowry, Tim McManus, David Chyen, et al. Youth risk behavior surveillance-united states, 2011. Morbidity and mortality weekly report. Surveillance summaries (Washington, DC: 2002), 61(4):1–162, 2012. [38] Jeffrey Fagan, Deanna L Wilkinson, and Garth Davies. Social contagion of violence. 2007. [39] Daniel J Myers. The diffusion of collective violence: Infectiousness, susceptibility, and mass media networks 1. American Journal of Sociology, 106(1):173–208, 2000. [40] Daniel J Myers and Pamela E Oliver. The opposing forces diffusion model: the initiation and repression of collective violence. Dynamics of Asymmetric Conflict, 1(2):164–189, 2008. [41] Eyal Even-Dar and Asaf Shapira. A note on maximizing the spread of influence in social networks. In International Workshop on Web and Internet Economics, pages 281–286. Springer, 2007. [42] Eric Rice, Ian W Holloway, Anamika Barman-Adhikari, Dahlia Fuentes, C Hendricks Brown, and Lawrence A Palinkas. A mixed methods approach to network data collection. Field methods, 26(3):252–268, 2014. [43] Jeffrey Duong and Catherine Bradshaw. Associations between bullying and engaging in aggressive and suicidal behaviors among sexual minority youth: The moderating role of connectedness. Journal of school health, 84(10):636–645, 2014. [44] Lisa A Eaton, Seth C Kalichman, Kathleen J Sikkema, Donald Skinner, Melissa H Watt, Desiree Pieterse, and Eileen V Pitpitan. Pregnancy, alcohol intake, and intimate partner violence among men and women attending drinking establishments in a cape town, south africa township. Journal of community health, 37(1):208–216, 2012. [45] Laura Kann, Steve Kinchen, Shari L Shanklin, Katherine H Flint, Joseph Hawkins, William A Harris, Richard Lowry, Emily O’Malley Olsen, Tim McManus, David Chyen, et al. Youth risk behavior surveillance—united states, 2013. 2014. [46] Linyuan L¨ u and Tao Zhou. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 390(6):1150–1170, 2011. Bibliography 108 [47] B Aditya Prakash, Lada Adamic, Theodore Iwashyna, Hanghang Tong, and Christos Faloutsos. Fractional immunization in networks. In Proceedings of the 2013 SIAM In- ternational Conference on Data Mining, pages 659–667. SIAM, 2013. [48] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985–1042, 2010. [49] Tasuku Soma and Yuichi Yoshida. Maximizing monotone submodular functions over the integer lattice. In International Conference on Integer Programming and Combinatorial Optimization, pages 325–336. Springer, 2016. [50] Thomas W. Valente. Social network thresholds in the diffusion of innovations. Social Networks, 18(1):69–89, 1996. ISSN 0378-8733. [51] Eric Abrahamson and Lori Rosenkopf. Social network effects on the extent of innova- tion diffusion: A computer simulation. Organization Science, 8(3):289–309, 1997. ISSN 10477039. [52] Frank M. Bass. A new product growth for model consumer durables. Manage. Sci., 50(12 Supplement):1825–1832, December 2004. ISSN 0025-1909. [53] Hanool Choi, Sang-Hoon Kim, and Jeho Lee. Role of network structure and network effects in diffusion of innovations. Industrial Marketing Management, 39(1):170 – 177, 2010. ISSN 0019-8501. [54] Seth A. Myers, Chenguang Zhu, and Jure Leskovec. Information diffusion and external influence in networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 33–41, New York, NY , USA, 2012. ACM. ISBN 978-1-4503-1462-6. doi: 10.1145/2339530.2339540. [55] Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi Konuru, Hari Sundaram, and Aisling Kelliher. Metafac: Community discovery via relational hypergraph factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 527–536, New York, NY , USA, 2009. ACM. ISBN 978-1-60558- 495-9. doi: 10.1145/1557019.1557080. [56] Kinkar Ch Das. Sharp bounds for the sum of the squares of the degrees of a graph. Kragu- jevac journal of Mathematics, 25(25):19–41, 2003. [57] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages 47–48. ACM, 2011. Bibliography 109 [58] Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD inter- national conference on Knowledge discovery and data mining, pages 1029–1038. ACM, 2010. [59] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Simpath: An efficient algorithm for influence maximization under the linear threshold model. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 211–220. IEEE, 2011. [60] Charalampos Chelmis and Viktor K Prasanna. The role of organization hierarchy in tech- nology adoption at the workplace. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 8–15. ACM, 2013. [61] Ajitesh Srivastava, Charalampos Chelmis, and Viktor K Prasanna. Influence in social networks: A unified model? In Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on, pages 451–454. IEEE, 2014. [62] David Kempe, Jon Kleinberg, and ´ Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. [63] Ceren Budak, Divyakant Agrawal, and Amr El Abbadi. Diffusion of information in social networks: Is it all local? In ICDM, pages 121–130, 2012. [64] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed networks in social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1361–1370. ACM, 2010. [65] Yanhua Li, Wei Chen, Yajun Wang, and Zhi-Li Zhang. Influence diffusion dynamics and influence maximization in social networks with friend and foe relationships. In Proceed- ings of the sixth ACM international conference on Web search and data mining, pages 657–666. ACM, 2013. [66] Dong Li, Zhi-Ming Xu, Nilanjan Chakraborty, Anika Gupta, Katia Sycara, and Sheng Li. Polarity related influence maximization in signed social networks. PloS one, 9(7):e102199, 2014. [67] Shishir Bharathi, David Kempe, and Mahyar Salek. Competitive influence maximization in social networks. In Internet and Network Economics, pages 306–311. Springer, 2007. [68] Allan Borodin, Yuval Filmus, and Joel Oren. Threshold models for competitive influence in social networks. In Internet and network economics, pages 539–550. Springer, 2010. Bibliography 110 [69] Xinran He, Guojie Song, Wei Chen, and Qingye Jiang. Influence blocking maximization in social networks under the competitive linear threshold model. In SDM, pages 463–474. SIAM, 2012. [70] Wei Chen, Alex Collins, Rachel Cummings, Te Ke, Zhenming Liu, David Rincon, Xiaorui Sun, Yajun Wang, Wei Wei, and Yifei Yuan. Influence maximization in social networks when negative opinions may emerge and propagate. In SDM, volume 11, pages 379–390. SIAM, 2011. [71] Ajitesh Srivastava, Charalampos Chelmis, and Viktor K Prasanna. The unified model of social influence and its application in influence maximization. Social Network Analysis and Mining, 5(1):1–15, 2015. [72] Nishith Pathak, Arindam Banerjee, and Jaideep Srivastava. A generalized linear threshold model for multiple cascades. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 965–970. IEEE, 2010. [73] Yasuko Matsubara, Yasushi Sakurai, and Christos Faloutsos. The web as a jungle: Non- linear dynamical systems for co-evolving online activities. In Proceedings of the 24th International Conference on World Wide Web, pages 721–731. ACM, 2015. [74] Seth A Myers and Jure Leskovec. Clash of the contagions: Cooperation and competition in information diffusion. In ICDM, volume 12, pages 539–548. Citeseer, 2012. [75] Chi Wang, Wei Chen, and Yajun Wang. Scalable influence maximization for independent cascade model in large-scale social networks. Data Mining and Knowledge Discovery, 25 (3):545–576, 2012. [76] Sheng Wen, Jiaojiao Jiang, Yang Xiang, Shui Yu, Wanlei Zhou, and Weijia Jia. To shut them up or to clarify: Restraining the spread of rumors in online social networks. IEEE Transactions on Parallel and Distributed Systems, 25(12):3306–3316, 2014. [77] Lidan Fan, Zaixin Lu, Weili Wu, Bhavani Thuraisingham, Huan Ma, and Yuanjun Bi. Least cost rumor blocking in social networks. In Distributed Computing Systems (ICDCS), 2013 IEEE 33rd International Conference on, pages 540–549. IEEE, 2013. [78] Zhe Zhao, Paul Resnick, and Qiaozhu Mei. Enquiring minds: Early detection of rumors in social media from enquiry posts. In Proceedings of the 24th International Conference on World Wide Web, pages 1395–1405. ACM, 2015. [79] Vahed Qazvinian, Emily Rosengren, Dragomir R Radev, and Qiaozhu Mei. Rumor has it: Identifying misinformation in microblogs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1589–1599. Association for Computa- tional Linguistics, 2011. Bibliography 111 [80] Nam P Nguyen, Guanhua Yan, My T Thai, and Stephan Eidenbenz. Containment of mis- information spread in online social networks. In Proceedings of the 4th Annual ACM Web Science Conference, pages 213–222. ACM, 2012. [81] Meital Balmas. When fake news becomes real: Combined exposure to multiple news sources and political attitudes of inefficacy, alienation, and cynicism. Communication Research, 41(3):430–454, 2014. [82] Masahiro Kimura, Kazumi Saito, and Hiroshi Motoda. Blocking links to minimize con- tamination spread in a social network. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(2):9, 2009. [83] Ceren Budak, Divyakant Agrawal, and Amr El Abbadi. Limiting the spread of misinfor- mation in social networks. In Proceedings of the 20th international conference on World wide web, pages 665–674. ACM, 2011. [84] Nam P Nguyen, Guanhua Yan, and My T Thai. Analysis of misinformation containment in online social networks. Computer Networks, 57(10):2133–2146, 2013. [85] L Weng, F Menczer, and YY Ahn. Virality prediction and community structure in social networks. Scientific reports, 3:2522–2522, 2012. [86] Boris Galitsky. Detecting rumor and disinformation by web mining. In 2015 AAAI Spring Symposium Series, 2015. [87] Naeemul Hassan, Gensheng Zhang, Fatma Arslan, Josue Caraballo, Damian Jimenez, Sid- dhant Gawsane, Shohedul Hasan, Minumol Joseph, Aaditya Kulkarni, Anil Kumar Nayak, et al. Claimbuster: the first-ever end-to-end fact-checking system. Proceedings of the VLDB Endowment, 10(12):1945–1948, 2017. [88] Shahrzad Shirazipourazad, Brian Bogard, Harsh Vachhani, Arunabha Sen, and Paul Horn. Influence propagation in adversarial setting: how to defeat competition with least amount of investment. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 585–594. ACM, 2012. [89] Andreas Krause and Daniel Golovin. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 3(19):8, 2012. [90] Christian Borgs, Michael Brautbar, Jennifer Chayes, and Brendan Lucier. Maximizing social influence in nearly optimal time. In Proceedings of the Twenty-Fifth Annual ACM- SIAM Symposium on Discrete Algorithms, pages 946–957. SIAM, 2014. [91] Youze Tang, Yanchen Shi, and Xiaokui Xiao. Influence maximization in near-linear time: A martingale approach. In Proceedings of the 2015 ACM SIGMOD International Confer- ence on Management of Data, pages 1539–1554. ACM, 2015. Bibliography 112 [92] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: bringing order to the web. 1999. [93] Chen Chen, Hanghang Tong, B Aditya Prakash, Charalampos E Tsourakakis, Tina Eliassi- Rad, Christos Faloutsos, and Duen Horng Chau. Node immunization on large graphs: Theory and algorithms. IEEE Transactions on Knowledge and Data Engineering, 28(1): 113–126, 2016.
Abstract (if available)
Abstract
The study of information diffusion on social networks has gained significant importance with the rise of online social media. The applications include viral marketing, opinion cascades, finding influential players, and immunization. Since the true dynamics are hidden, various diffusion models have been proposed to explain the cascading behavior. Such models require extensive simulation for estimating the state of the diffusion process over time. Computing the diffusion over time analytically is #P-Hard for many probabilistic models. Moreover, certain decision problems requiring selection of individuals to initiate/alter a diffusion process are NP-hard and also rely on estimation of probabilities. I provide approximate solutions to several diffusion computation and diffusion optimization problems using analytical solutions and/or intelligent sampling for a wide class of diffusion models. The specific problems include the following: (i) Predicting the number of infections in different countries across which an infection is spreading (DARPA Chikungunya Challenge 2014). (ii) Minimization of violence among homeless youth by identifying optimal peers for intervention. (iii) Finding the probability of infection/activation of an individual given that we start with a set of infected individuals in a network. (iv) Finding the set of individuals to initiate a campaign to maximize the influence (Viral Marketing). (v) Finding the set of individuals to maximize one opinion over the other in presence of friend and foe relationships. (vi) Combating fake news by identifying most influential individuals in the network who are likely to receive the fake news and can be incentivized to propagate the corresponding real news. The approximate solutions to all the above problems have been demonstrated mathematically or empirically and have been tested on synthetic and real-world networks.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Understanding diffusion process: inference and theory
PDF
Machine learning for efficient network management
PDF
Modeling information operations and diffusion on social media networks
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Dispersed computing in dynamic environments
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Hierarchical planning in security games: a game theoretic approach to strategic, tactical and operational decision making
PDF
Discrete optimization for supply demand matching in smart grids
PDF
From active to interactive 3D object recognition
PDF
Representation problems in brain imaging
PDF
Heterogeneous graphs versus multimodal content: modeling, mining, and analysis of social network data
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
The fusion of predictive and prescriptive analytics via stochastic programming
PDF
Prediction models for dynamic decision making in smart grid
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Detection, localization, and repair of internationalization presentation failures in web applications
Asset Metadata
Creator
Srivastava, Ajitesh (author)
Core Title
Computing cascades: how to spread rumors, win campaigns, stop violence and predict epidemics
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
08/05/2018
Defense Date
06/12/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
epidemics,fake news,information diffusion,OAI-PMH Harvest,social networks,Violence
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Deshmukh, Jyotirmoy V. (
committee member
), Rice, Eric (
committee member
)
Creator Email
ajitesh47@gmail.com,ajiteshs@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-50665
Unique identifier
UC11669035
Identifier
etd-Srivastava-6635.pdf (filename),usctheses-c89-50665 (legacy record id)
Legacy Identifier
etd-Srivastava-6635.pdf
Dmrecord
50665
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Srivastava, Ajitesh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
epidemics
fake news
information diffusion
social networks