Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Understanding diffusion process: inference and theory
(USC Thesis Other)
Understanding diffusion process: inference and theory
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UNDERSTANDING DIFFUSION PROCESS: INFERENCE AND THEORY by Xinran He A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2017 Copyright 2017 Xinran He Acknowledgments First, I would like to thank my advisor Prof. Yan Liu and Prof. David Kempe for their great guidance throughout my Ph.D. studies and contributions to this thesis. Iamalsothankfultomanycolleaguesandcollaborators: YiChang, Zhengping Che, Dehua Cheng, Marjan Ghazvininejad, Michael Hankin, David Kale, Guangyu Li, Whenzhe Li, Daniel Moyer, Tanachat Nilanon, Sanjay Purushotham, Natali Ruchansky, Sungyong Seo and Qi Yu from USC Melady group and Yu Cheng, Ho YeeCheung, EhsanEmamjomeh-Zadeh, LiHan, RuixinQiangandHaifengXufrom USC theory group. I not only shared many thoughts but also had great moments with them all. Last but not least, I would like to thank my family who constantly encouraged me during my Ph.D. studies. i Contents Acknowledgments i Contents ii List of Figures vi List of Tables xi Abstract xiii 1 Introduction 1 1.1 Summary of Thesis Work......................... 7 1.2 Thesis Outline............................... 11 1.3 Related Publications ........................... 12 2 Preliminaries 14 2.1 Basics ................................... 15 2.2 Set and Set Functions .......................... 15 2.2.1 Submodular Function Maximization............... 16 2.3 Distributions ............................... 19 2.4 Basics of Diusion Process ........................ 21 2.4.1 Diusion Network ......................... 21 2.4.2 Cascades.............................. 22 2.5 Diusion Models ............................. 23 ii 2.5.1 Cascade Models.......................... 24 2.5.2 Point Process Models....................... 29 2.6 Network Inference............................. 31 2.6.1 Network Inference under DIC model .............. 34 2.6.2 Network Inference under CIC model .............. 35 2.6.3 Network Inference under the Hawkes Process model...... 39 2.7 Influence Function Learning ....................... 40 2.7.1 Learning Objective and Formal Definition ........... 41 2.7.2 Existing Algorithms ....................... 43 2.8 Influence Maximization.......................... 43 3 Datasets and Evaluation Metrics 47 3.1 Synthetic Network Models ........................ 48 3.1.1 Preferential Attachment Model ................. 48 3.1.2 Watts-Strogatz Small-World Model ............... 48 3.1.3 Forest-Fire Model......................... 49 3.1.4 Kronecker Graph Model ..................... 50 3.2 Real-world Datasets ........................... 51 3.2.1 Facebook Network ........................ 51 3.2.2 STOCFOCS Network....................... 52 3.2.3 Twitter Dataset.......................... 52 3.2.4 MemeTracker Dataset ...................... 54 3.3 Evaluation Metrics for Network Inference................ 56 4 Influence Function Learning from Incomplete Observations 59 4.1 Models of Incomplete Observations ................... 60 4.1.1 Randomly Missing Activities................... 61 4.1.2 Adversarially Missing Nodes ................... 62 4.2 Related Work ............................... 63 4.3 Impact of Incomplete Observations ................... 64 iii 4.3.1 Experiment settings ....................... 65 4.3.2 Randomly Missing Activities................... 66 4.3.3 Adversarially Missing Nodes ................... 68 4.3.4 Limitations ............................ 71 4.4 Learning Influence Functions from Incomplete Observations...... 72 4.4.1 Problem Definition ........................ 74 4.5 Proper PAC Learning under Incomplete Observations ......... 76 4.5.1 Proof of Theorems 4.1 and 4.2.................. 81 4.5.2 Extensions of Proper PAC Learning............... 88 4.6 Ecient Improper Learning Algorithm ................. 96 4.7 Experiments................................ 104 4.7.1 Synthetic cascades ........................ 105 4.7.2 Influence Estimation on real cascades.............. 109 5 Joint Inference of Multiple Diusion Networks 111 5.1 Related Work ............................... 114 5.2 Problem Formulation........................... 116 5.3 MultiCascades Model........................... 117 5.3.1 MAP Inference in the E-step................... 122 5.3.2 Parameter Update in the M-step ................ 123 5.3.3 Running time ........................... 126 5.4 Empirical Evaluation ........................... 126 5.4.1 Synthetic data........................... 128 5.4.2 Real-world data.......................... 133 6 Network Inference with Content Information 139 6.1 Related Work ............................... 142 6.2 HawkesTopic Model............................ 143 6.2.1 Modeling the posting time .................... 144 6.2.2 Modeling the documents ..................... 146 iv 6.2.3 Summary and Discussion..................... 149 6.3 Inference.................................. 150 6.3.1 Update of variational distribution ................ 153 6.3.2 Update of model parameters................... 155 6.4 Experiments................................ 156 6.4.1 Synthetic Data .......................... 157 6.4.2 Real Data ............................. 159 7 Stability Analysis and Robust Algorithms for Influence Maximiza- tion 167 7.1 Introduction................................ 168 7.2 Related Work ............................... 170 7.3 Modeling Uncertainty in Influence Maximization............ 171 7.3.1 Perturbation Interval model ................... 172 7.3.2 Stochastic vs. Adversarial Models ................ 173 7.4 Stability of Influence Maximization ................... 175 7.4.1 Can Instability Occur? ...................... 175 7.4.2 Influence Dierence Maximization ................ 177 7.4.3 Experiments............................ 181 7.5 Robust Influence Maximization ..................... 186 7.5.1 Problem Definition ........................ 189 7.5.2 Algorithm and Hardness ..................... 192 7.5.3 Experiments............................ 204 8 Conclusion and Future Work 214 8.1 Contributions and Limitations...................... 215 8.2 Future Work................................ 218 Reference List 222 v List of Figures 4.1 Result of the MultiTree algorithm on a Hierarchical Community net- work. (a) The ground truth. (b) The network inferred with complete observations. (c)Thenetworkinferredwith10%randomlossofactiv- ities. On the left is the graph embedding on a circle and on the right is the adjacency matrix. ......................... 67 4.2 PrecisionandRecallofthe NetInf algorithmunderrandomlymissing activities with dierent loss rate ( 1≠ r). The X axis corresponds to dierent loss rates. Dierent data series correspond to dierent networks. ................................. 68 4.3 Result of the NetInf algorithm in a Core-peripheral network. (a)The ground truth. (b) The network inferred with complete observations. (c) The network inferred with the loss of the 2 maximum-degree nodes. On the left is the graph embedding on a circle and on the right is the adjacency matrix. ...................... 69 vi 4.4 Precision and Recall of the MultiTree algorithm under adversarially missing nodes with dierent numbers of nodes lost. The X axis is the number of adversersially lost nodes in the incomplete observations. Dierent series correspond to dierent networks. ............ 71 4.5 Illustration of the graph transformation under the DIC model. The light green node is the seed, the dark green nodes are the activated and observed nodes while the yellow node is activated but lost due to incomplete observations. ......................... 77 4.6 Illustration of the graph transformation under DLT model. The light green node is the seed, the dark green nodes are the activated and observed nodes while the yellow node is activated but lost due to incomplete observations. ......................... 78 4.7 MAE of estimated influence as a function of the retention rate on syntheticdatasetsfor(a)CICmodel, (b)DICmodel, (c)DLTmodel. The error bars show one standard deviation. .............. 106 4.8 Relative error in MAE when the true retention rates are drawn from the truncated Gaussian distribution. The x-axis shows the standard deviation ‡ of the retention rates from the mean, and the y-axis is the relative dierence of the MAE compared to the case where all retention rates are the same and known. ................ 107 4.9 Relative error in MAE when the true retention rates are drawn from the Uniform distribution. The x-axis shows the interval size ‡ of the retentionratesfromthemean, andthey-axisistherelativedierence of the MAE compared to the case where all retention rates are the same and known. ............................. 107 vii 4.10 Relative error in MAE under retention rate misspecification. x-axis: retention rate r used by the algorithm. y-axis: relative dierence of MAE compared to using the true retention rate 0.8. ......... 109 4.11 MAEofinfluenceestimationonsevensetsofreal-worldcascadeswith 20% of activations missing. ....................... 109 5.1 Graphical representation of the MultiCascades model. The diusion networks are first generated from the same parametric graph genera- tionmodelandthecascadesaregenerateddependingonthestructure of the specific diusion network. ..................... 114 5.2 Visualization of retweet networks of dierent topics .......... 117 5.3 GraphicalrepresentationofourMultiCascadesmodel. istheparam- eter for the network generation model; G i is the hidden diusion net- workforcascadesingraphi;C i,j representsthej-thobservedcascades in network i................................. 118 5.4 Visualization for the SBM prior. The darker the background color, the higher the value in the B matrix. The blue dots corresponds to the edges of one inferred diusion network. ............... 132 5.5 VisualizationofcascadegrowthundertheLatentSpaceModels,where T is the length of the observation window. The blue circles corre- sponds to inactive nodes while the red triangles are the activated nodes. ................................... 133 5.6 Precision-RecallcurveofnetworkinferencesintheTwitter100dataset. The solid squares correspond to points with maximum F1 score on the PR curve. ............................... 135 viii 5.7 Inferred diusion networks among top 10 users from the Twitter dataset. The bold black line segments at the end of edges correspond to the direction of the influence...................... 136 6.1 Graphical representation of HawkesTopic model............. 146 6.2 Results on synthetic datasets. ...................... 158 6.3 Inferred diusion network from the EventRegistry datasets. The col- ors of nodes represent the out-degree of the source. (The darker the color, the higher the out-degree.) The edge width represents the strength of influence. ........................... 159 6.4 Inferred diusion network from the Arxiv datasets. The colors of nodes represent the out-degree of the source. (The darker the color, the higher the out-degree.) The edge width represents the strength of influence. ................................ 160 7.1 ComparisonbetweenInfluenceDierenceMaximizationandInfluence Maximization results for four dierent networks. (The result for the STOCFOCS network is obtained with CELF optimization.) ..... 185 7.2 RatiobetweenthecomputedvaluesofInfluenceDierenceMaximiza- tion and Influence Maximization under random regular graphs with dierent degree. .............................. 186 7.3 Performance of the algorithms on the four topical/temporal datasets. Thex-axisisthenumberofseedsselected,andthey-axistheresulting robust influence (compared to seed sets of size k). .......... 206 ix 7.4 Saturate Greedy vs. Iran graph seed nodes. Green/pentagon nodesareselectedinboth;orange/trianglenodesareselectedbySat- urate Greedy only; purple/square nodes for Iran only. ....... 207 7.5 Saturate Greedy vs. Climate graph seed nodes. Green/pentagon nodesareselectedinboth;orange/trianglenodesareselectedbySat- urate Greedy only; purple/square nodes for Climate only. ..... 208 7.6 Performance of the algorithms (a) under dierent delay distributions following the CIC model, and (b) under dierent classes of diusion models. The x-axis shows the number of seeds selected, and k=10. 211 7.7 Performance of the algorithms under networks sampled from the Per- turbationIntervalmodel: (a)MemeTracker-DICnetwork; (b)STOC- FOCS network. (The x axis shows the (relative) size of the pertur- bation interval I e ). ............................ 212 7.8 RunningtimesonKroneckergraphnetworkswithdierentstructures. The x axis represents the number of nodes, and the y-axis is the running time in seconds, both plotted on a log scale. ......... 213 7.9 RunningtimesonKroneckergraphnetworkswithdierentstructures. Thex axis represents the number of diusion settings, and the y-axis is the running time in seconds, both plotted on a log scale. ...... 213 x List of Tables 3.1 Kronecker Graph Parameters ...................... 51 3.2 Examples of hashtags in each category ................. 54 5.1 Comparison between the Greedy algorithm and the Random Greedy Algorithm ................................. 128 5.2 Network inference accuracy on synthetic datasets with 200 cascades. . 129 5.3 Relative improvement in AUC with dierent |C|. ........... 129 5.4 The AUC of the MCM-NetInf-LSM algorithm and relative improve- ment over Sep-NetInf with dierent M.................. 132 5.5 Network inference accuracy on the Twitter dataset using NetInf in the E-step.................................. 134 5.6 Network inference accuracy on the Twitter dataset using MultiTree in the E-step. ............................... 134 5.7 NetworkinferenceaccuracyontheMemeTrackerdatasetusingNetInf in the E-step. ............................... 137 xi 5.8 Network inference accuracy on the MemeTracker dataset using Mul- tiTree in the E-step. ........................... 137 6.1 Notation used in this chapter. ...................... 144 6.2 EventRegistry: text modeling result (log document completion like- lihood)................................... 163 6.3 EventRegistry: network inference result (AUC) ............ 163 6.4 arxiv: text modeling result (log document completion likelihood)... 164 6.5 arxiv: network inference result (AUC) ................. 164 6.6 arxiv with dierent observation time lengths. .............. 165 6.7 Inferred topics for authors Andrei Linde and Arkady Tseytlin under LDA, CTM and HTM. .......................... 165 7.1 Instability for the clique K 200 . ...................... 184 7.2 Diusion model settings ......................... 205 xii Abstract Nowadays online social networks have become a ubiquitous tool for people’s socialcommunications. Analyzingthesesocialnetworksoersgreatpotentialtoshed lightonhumansocialstructureandcreatebetterchannelstoenablesocialcommuni- cationsandcollaborations. Mostsocialnetworkanalysistasksbeginwithextracting or learning the social network and the associated parameters, which remains a very challenging task due to the amorphous nature of social ties, along with the noise and incompleteness in the observations. As a result, the inferred social network is likely to be of low accuracy and to suer from a high level of noise which impacts the performance of analyses and applications depending on the inferred parameters. In this thesis, we study the following important questions with a special focus on analyzing diusion behaviors in social networks to achieve real practicality: (1) How can special properties of social networks be utilized to improve the accuracy of the extracted network under noisy and missing data? (2) How can the impact of noise in the inferred network be characterized for robust analysis and optimization to be carried out? Toaddressthefirstquestion,wetacklethechallengeofmitigatingtheimpactof incompleteobservations,whichareverycommoninsocialdatacollections. Focusing on learning influence functions under incomplete observations, we design methods xiii for both proper and improper learning under two types of incomplete cascades with theoretical guarantees on the sample complexity bound. To address the challenge of data scarcity in inferring diusion networks, we propose a hierarchical graphi- cal model referred to as the MultiCascade model to jointly infer multiple diusion networksaccurately. Byincorporatingasharednetworkgenerationprior, theMulti- Cascade model infers multiple diusion networks accurately from a limited number of observations. To utilize the additional rich content information in cascades, we propose the HawkesTopic model to analyze text-based cascades by combining tem- poralandcontentinformation. Weprovideajointvariationalinferencealgorithmfor HTM to simultaneously infer the diusion network as well as discover the thematic topics of the documents. In the second part of this thesis, we focus on the second question towards designing robust Influence Maximization algorithms under the noise and uncertain- ties of the inferred network. We first propose a framework to measure the stability of Influence Maximization with the Perturbation Interval Model to characterize the noise in the inferred diusion network. We then design an ecient algorithm for Robust Influence Maximization to find influential users robust across multiple dif- fusion settings with theoretical analysis on the hardness of the Robust Influence Maximization problem and proofs for approximation guarantees of our algorithm. xiv Chapter 1 Introduction 1 Online social networks have become a ubiquitous tool for people’s social com- munications. Analyzing these social networks oers great potential to shed light on human social structure, and to create better channels to enable social communica- tions and collaborations. Recent increased availability of social network and social media data (through sites such as Facebook, Twitter, and LinkedIn) has raised the prospect of applying social network analysis at a large scale to positive eects. Many tasks of social network analysis have already been studied with fruitful results. A few examples are as follows while many more applications exist. • Social link analysis provides evidence to prove or falsify classic sociology the- ories, for example, social status theory or balanced class theory. • Identifyingtherolesofindividualsinsocialnetworkswouldhelpusunderstand the underlying dynamics and help achieve personalized recommendations. • Exploiting social influence structure help us target the right individuals to eect changes in behavior or attitudes for a socially or financially desirable outcome. • Link prediction contributes to more accurate friendship recommendations in social network sites, while at the same time bringing new insight into how friendships are formed. • Identifying community structure helps us understand and visualize network structure and provide users with more targeted and useful information. • Analyzingtheevolutionofsocialnetworksresultsinnewtheoryonhowgroups and communities form, evolve, and dissolve. 2 Most social network analysis applications begin with extracting an abstract graph from real-world social interactions. The nodes in the graph represent the users, while the edges represent dierent types of social interactions depending on the applications at hand, such as friendship or professional interactions for com- munity detection, or social influence for finding the opinion leaders. Social network analysisproblemsthenreducetodierentgraphminingproblemswiththeextracted graph as input. For example, the community detection problem can be formulated as discovering sets of nodes such that each set of nodes is more densely connected internally than externally; opinion leader mining can be solved by finding the indi- viduals with the largest degree centrality or betweenness centrality in the graph. However, contrarytomanyotherdomains, thefirststepofextractingorlearn- ing the social network and the associated parameters is highly non-trivial. First, social networks cannot be directly observed. At best, social network data will be self-reported (e.g., using questionnaires or online social networks); it will then suer fromtheveryamorphousnatureofthemeaningandstrengthofatiewithstandards diering greatly among people. More typically, networks are inferred from observed behaviors, such as adoptions of a new product, exchanges of e-mails or phone calls, or observed time together. Since noise exists and the observations may be incom- plete, inferring the social network is a very challenging problem that often leads to low accuracy and high noise in the resultant network. In this thesis, we focus on studying diusion processes in social networks. The diusion processes have been long studied by social scientists starting from the seminal work by Everett Rogers [118] to recent studies [17, 48, 130]. Recently, the increasing availability of large-scale diusion data collection has drawn inter- est in analyzing diusion phenomena from a computational perspective. Similar to 3 other social network analysis tasks, diusion analysis starts with inferring the dif- fusion network from observed user behaviors. With the inferred diusion network, we focus on viral marking via word-of-the-mouth eects which is one of the most popular applications of diusion processes. The goal of the application is to target influential users to eect changes in their behaviors for socially or financially desir- able outcomes. Similar to other social network analysis tasks discussed above, there are several challenges in inferring the diusion network accurately and utilizing the diusion process towards successful applications. The first challenge comes from the prevalence of incomplete observations. Existing approaches for Network Inference rely on the common assumption that the observationsusedtotrainthemodelsarecompletewhereasmissingobservationsare commonplaceinpractice. Forexample,socialmediadataareoftencollectedthrough crawlers in the Blogosphere or acquired with public APIs provided by social media platforms, such as Twitter or Facebook. Due to technical challenges and established restrictions on the APIs, it is often impossible to obtain a complete set of observa- tions even for a short period of time. Naively applying existing inference algorithms to incomplete datasets usually results in inaccurate network structure due to the loss of true edges and the inclusion of spurious edges. Thus, we need to design algo- rithms which explicitly take incomplete observations into consideration for Network Inference. Another major challenge lies in the scarcity of the observational data to infer the network. Though a huge number of cascades occur within an online social net- work site via the exchange of millions of messages, we face the diculty of gathering sucient data for accurate inference when focused on a specific aspect of the cas- cades. This is exactly the small data versus big data problem. For example, we may 4 beinterestedininferringtheinfluencepatternofasubsetofuserswhocommunicate only infrequently. As another example, if we intend to understand a dynamic diu- sion network, the number of cascades within each time period may not be enough. Additionally, in many cases, multiple topic-specific networks need to be inferred to model the diusion for dierent topics or particular events. Though the total number of cascades, for example in Twitter, is large, the number of cascades on a specific topic can be very limited. Thus, there is a need to design algorithms to be data-ecient to infer networks accurately from a limited number of observations. Moreover, cascades also have complex structures and contain rich information besides the temporal information considered by existing approaches. For example, when a news story propagates across social media sites, the actual content of the news article provides valuable information to infer its source. Similarly, when a Twitter user retweets a post, the actual words used in the post can benefit the NetworkInferencetask. Moreover,considerthepropagationofscientificideasamong researchers. In addition to the time when a researcher publishes a paper, we have access to the content of the paper. Therefore, we need to design algorithms to incorporatetheadditionalrichinformationintoourmodeltoachievebetterinference accuracy. Noise in social network data is not an exception, but the norm.There exist manysources of noiseand uncertainties ininferringthe diusionnetworks andmod- eling the diusion process, ranging from the fundamental to the practical: • At a fundamental level, it is not even clear what a “social tie” is. Dier- ent individuals or researchers operationalize the intuition behind “friendship”, “acquaintance”, “regular” “advice seeking”, etc. in dierent ways (see, e.g., [23]). Based on dierent definitions, the same real-world individuals and 5 behavior may give rise to dierent mathematical models of the same “social network.” • Mathematicalmodelsofprocessesonsocialnetworks(suchasopinionadoption or tie formation) are at best approximations of reality, and frequently mere guesses or mathematically convenient inventions. Furthermore, the models are rarely validated against real-world data, in large part due to some of the following concerns; • Human behavior is typically influenced by many environmental variables, many of which are hard or impossible to measure. Even with the rapid growth of available social data, it is unlikely that data sets will become suciently rich to disentangle the dependence of human behavior on the myriad variables that may shape it. • Inferring diusion networks relies on a choice of model and hyperparameters, many of which are dicult to make. Furthermore, while for many models, parameter inference is computationally ecient, this is not universally the case. Both the models for social network processes and their inferred parameters must be treated with caution under noise and uncertainties. Such precautions must be taken for the sake of drawing scientific insight, and when using the inferred models to make computational social engineering decisions. Indeed, the correctness guarantees for algorithms are predicated on the assumption of the correctness of the model and the inferred parameters. When this assumption fails — which is inevitable — the utility of the algorithms’ output is compromised. Thus, there is an urgent need to first design analytic methods to decide whether an algorithm 6 is vulnerable to a given level of noise in inferred parameters and second propose algorithms that are robust to the existence of noise and uncertainties. 1.1 Summary of Thesis Work Our goal in this thesis work is to provide an solution to enable applications of diusionprocessestopositiveeectsby explicitly handlingthenoiseanduncertainty in modeling diusion process. More specifically, we focus on making application of Influence Maximization more practical by first designing more accurate algorithms toinferthediusionnetworksaswellastheparameters,andseconddesigningrobust optimization algorithms to target influential users under noise and uncertainty in diusion models and model parameters. We achieve our goal by addressing all of the above challenges as follows. To address the challenge of incomplete observations, we build upon recent advances in influence function learning [35, 110] to learn influence functions under two types of incomplete observations: randomly missing activities and adversarially missing nodes. The former assumes random loss of node activations: namely, for eachactivatednodeindependently, thenode’sactivationisobservedonlywithprob- abilityr, theretention rate, andfailstobeobservedwithprobability 1≠ r; thelatter assumes that all activities of a set of nodes chosen adversarially are not observed in all cascades. By interpreting the incomplete observations as complete observa- tionsinatransformedgraph, weestablishproperPAClearnabilityresultswithgood samplecomplexityboundsfortwopopulardiusionmodels: theDiscrete-TimeInde- pendent Cascade (DIC) and Discrete-Time Linear Threshold (DLT) models under the incomplete cascades with randomly missing activities. Towards designing more 7 practical algorithms and obtaining learnability under a broader class of diusion models, we propose an improper learning algorithm based on the parameterization of Du et al. [35] in terms of reachability basis functions with a theoretical guarantee onsample complexity. Moreover, weshowthatboth ourproperand improperlearn- ing algorithms can be generalized to handle incomplete cascades with adversarially missing nodes. To address the challenge of data scarcity in inferring multiple topic-specific diusion networks or snapshots of dynamic diusion networks, we propose a hierar- chical graphical model, referred to as the MultiCascades model (MCM) by explor- ing the commonality between multiple related diusion networks. We construct our MCM model based on the observation that although the number of cascades asso- ciated with each network is very limited, we are able to collect cascades on many dierenttopicsoroveralongtimeperiod. Theassociatedinfluencenetworks(either topic-specificortime-specific)arehighlycorrelated. Allthediusionnetworksshare the same network prior, e.g., the popular Stochastic Blockmodel or Latent Space Models, in MCM. The parameters of the network priors can be eectively learned by gleaning evidence from a large number of inferred networks. In return, each individual network can be inferred more accurately thanks to the prior information. Furthermore, we develop ecient inference and learning algorithms so that MCM is scalable for practical applications. To address the challenge of incorporating rich information in cascades, we develop the HawkesTopic model (HTM) for analyzing text-based cascades such as “retweeting a post” or “publishing a follow-up blog post” with content information. HTM combines Hawkes processes [99] and topic modeling [14, 15] to simultaneously reason about the information diusion pathways and the topics characterizing the 8 observedtextualinformation. HTMcapturesthetemporalaspectofthecascadesby the mutually exciting nature of the Hawkes process, i.e., an event can trigger future events. The content information is modeled as the marks associated with the events inHTMastopicsofthedocument. Weproposeanecientjointvariationalinference algorithm based on a mean-field approximation to discover both the pathways of information propagation and the evolution of the content. To address the challenges of the inevitable noise contained in the parameters of the inferred diusion networks, we first focus on analyzing the stability of the well-studied Influence Maximization problem when the input diusion networks are noisy. The goal of Influence Maximization is to identify a set of k nodes in a social networkwhosejointinfluenceonthenetworkismaximized. Weexhibitsimpleinputs on which even very small estimation errors may mislead every algorithm into highly suboptimal solutions, motivating the necessity of stability analysis. By proposing thePerturbationIntervalModeltoquantifythemisestimationoftheparameters,we formalize the susceptibility of Influence Maximization instances to estimation errors as a clean algorithmic question referred to as the Influence Dierence Maximization problem. We establish strong hardness of the Influence Dierence Maximization problem and empirically study the instabilities of Influence Maximization across a variety of real-world networks and diusion settings. To address the challenge of designing robust algorithms to handle noise in the inferrednetwork,weproposetheRobustInfluenceMaximizationframeworkwherein an algorithm is presented with a set of influence functions, typically derived from dierent influence models or parameter settings for the same model. The dierent parametersettingscouldbederivedfromobservedcascadesondierenttopics,under dierentconditions,oratdierenttimes. Thealgorithm’sgoalistoidentifyasetof k 9 nodes who are simultaneously influential for all influence functions, compared to the (function-specific) optimum solutions. We establish strong approximation hardness results for this problem and propose an ecient bicriteria approximation algorithm based on robust submodular optimization [82] to approximate the optimum robust influence to within a factor of 1≠ 1/e when enough extra seeds may be selected. Furthermore, we prove the optimality of our algorithm in terms of approximation guarantees up to a log factor. By address all the five key challenges, we show that (1) Network Inference accuracy can be improved under noisy and incomplete observations; (2) robust algo- rithms can be designed for social network analysis and optimization under noisy parameters. Specifically, we make the following claims: 1. Influence functions can be learned accurately and eciently despite random andadversariallossof activationinincomplete cascades withtheoreticalguar- antees. 2. Hierarchicalgraphicalmodelscanbebuiltforinferringdiusionnetworksfrom a limited number of observations by exploring the commonality between mul- tiple related diusion networks. 3. Incorporating additional content information besides the temporal aspect of cascadescontributestobetterNetworkInferencealgorithmsviajointinference. 4. Noisecontainedintheparametersoftheinferreddiusionnetworkscouldmis- lead analysis into wrong conclusions and an optimization algorithm to highly suboptimal solutions. 10 5. Robustinfluencemaximizationalgorithmswithapproximationguaranteescan bedesignedtofindnodesthataresimultaneouslyinfluentialformultipleinflu- ence functions. 1.2 Thesis Outline In Chapter 2, we provide the preliminaries including notations and concepts used throughout the thesis. Moreover, we overview the basics of diusion processes in social networks by introducing widely used diusion models. We then review the existing techniques for the three major tasks that are central to this work: Network Inference, Influence Function Learning, and Influence Maximization. In Chapter 3, we describe the datasets used in this thesis, including models used to generate synthetic networks, real-world social networks, and real-world cas- cade datasets. We also introduce the metrics we use to evaluate the accuracy of the inferred diusion networks. In Chapter 4, we first demonstrate the impact of incomplete observations on existing Network Inference algorithms via experiments. To mitigate the eects of incompleteness,wedescribeouralgorithmsforbothproperandimproperlearningof influence functions under incomplete cascades with random and adversarial missing of activations. We also provide theoretical analysis to establish PAC learnability with sample complexity bounds. In Chapter 5, we describe our MultiCascades model to infer multiple diusion networks with a limited number of observations. We demonstrate how MCM miti- gates data scarcity with a shared prior and how MCM incorporates prior knowledge 11 on network structure via the choice of network generation prior. In addition, we provide an ecient joint inference algorithm for our MultiCascades model. In Chapter 6, we describe our HawkesTopic model to analyze text-based cas- cades by combining temporal and content information. We provide a joint varia- tional inference algorithm for HTM to simultaneously infer the diusion network as well as discovering the thematic topics of the documents. In Chapter 7, we first propose a framework to measure the stability of Influ- ence Maximization with the Perturbation Interval Model to quantify the noise in the inferred diusion network. We then design an ecient algorithm for Robust Influence Maximization to find influential users robust in multiple diusion settings. Also, we provide theoretical analysis on the hardness of the Robust Influence Max- imization problem and prove approximation guarantees of our algorithm. In Chapter 8, we summarize our contributions and discuss their advantages and potential drawbacks. We also discuss potential future work to extend the con- tributions of this thesis. 1.3 Related Publications Parts of this thesis have been published in machine learning and data mining conferences. The list includes: • Related to Chapter 4: Xinran He, Ke Xu, David Kempe and Yan Liu, “Learning Influence Functions from Incomplete Observations”, In Proc. 28th Advances in Neural Information Processing Systems.2016. 12 • RelatedtoChapter5: XinranHeandYanLiu,“NotEnoughData? JointInfer- ring Multiple Diusion Networks via Network Generation Priors”, In Proc. 10th ACM Intl. Conf. on Web Search and Data Mining.2017. • Related to Chapter 6: Xinran He, Theodoros Rekatsinas, James Foulds, Lise Getoor and Yan Liu, “HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades”, In Proc. 32nd Intl. Conf. on Machine Learning.2015. • Related to Chapter 7: – Xinran He and David Kempe. “Stability of Influence Maximization”, arXiv preprint arXiv:1501.04579. 2015 – XinranHeandDavidKempe,“RobustInfluenceMaximization”,In Proc. 22nd Intl. Conf. on Knowledge Discovery and Data Mining.2016. 13 Chapter 2 Preliminaries 14 In this chapter, we review and define concepts used throughout the rest of this thesis. We also review several widely used diusion models and introduce the three main tasks in this thesis: Network Inference, Influence Function Learning and Influence Maximization. Popular approaches in each task under dierent diusion models are reviewed. 2.1 Basics In this work, we use bold-face symbols a=(a 1 ,...,a n ) to denote vectors. All the vectors are assumed to be column-vectors unless specified otherwise. We denote the vector of all zeros as 0 and usee i to denote the unit vector with 1 in the i-th coordinate, and 0 in all others. For a variable z taking values from a discrete set {1,2,...,k}, the one-of-k representation is denoted as z. z = e i if and only if z =i. We use I to denote the indicator function. I[condition]=1 if and only if the condition is true. 2.2 Set and Set Functions In this work, we use capital italic letters A,S,T to denote sets. The power set of a set S, i.e., the set of all subsets of S, is denoted as 2 S . We use |S| to denote the cardinality of set S. We use ‰ S (·) to denote the characteristic function of set S. ‰ S (v)=1 if and only if vœ S. For finite set S™{ 1,2,...,n}, we use S as the characteristic vector of S. ‰ s,i =1 if and only if iœ S. For a finite set V, a set function f :2 V æ R maps a subset Sœ 2 V to a real number. The following set function properties are used extensively in this work: 15 1. f is non-negative if f(S)Ø 0 for all Sµ V. 2. f is normalized if f(ÿ )=0. 3. f is monotone non-decreasingiff(S)Æ f(T)wheneverS™ T ™ V. Similarly, f is monotone non-increasing if f(S)Ø f(T) whenever S™ T ™ V. 4. f issubmodulariff(Sfi{ v})≠ f(S)Ø f(Tfi{ v})≠ f(T)wheneverS™ T ™ V. That is, f has the “diminishing returns property”. 2.2.1 Submodular Function Maximization Because many algorithms and results in this work use submodular function maximization, we briefly review algorithms to maximize both monotone and non- monotone submodular functions. Though the submodular maximization problem has been studied under many dierent types of constraints such as cardinality con- straints, knapsack constraints and matroid constraints [81], we only review max- imizing submodular functions under cardinality constraints due to their relevance to this thesis. Let f be a submodular function. We are interested in the following optimization problem: max S™ V:|S|Æ k f(S). While it has been shown by Feige that exact solution of the above optimization problem is NP-hard [46], constant factor approximations are possible for monotone submodular functions with a simple hill-climbing greedy algorithm shown in Algo- rithm 2.1. Starting from an empty set, the algorithm iteratively adds k elements with maximum marginal gains. Nemhauser, Wolsey and Fisher have shown in the following theorem that the above simple greedy algorithm provides a 1≠ 1/e constant approximation guarantee [113]. 16 Algorithm 2.1: Hill-Climbing Greedy Algorithm 1: Initialize: S 0 Ωÿ 2: for i=1,...,k do 3: Let u be the element maximizing the marginal gain f(S i≠ 1 fi{ u})≠ f(S i≠ 1 ) 4: Let S i Ω S i≠ 1 fi{ u}. 5: end for 6: Return S k Theorem 2.1 (Nemhauser, Wolsey and Fisher [113]). Let f(·) be a monotone sub- modular function. Then the greedy algorithm returns a k-element set S g such that f(S g )Ø (1≠ 1/e)·f(S ú ), where S ú œ argmax |S|Æ k f(S). However, the above simple algorithm fails to maximize non-monotone sub- modular functions. It has been shown by Buchbinder et al. [22] that the greedy algorithm could lead to an arbitrarily bad solution when applied to non-monotone functions. Approximation algorithms for maximizing non-monotone submodular functions subject to a cardinality constraint have recently received significant atten- tion [22, 134]. In this work, we use the Random Greedy algorithm of [22], given as Algo- rithm 2.2 below. It is a natural generalization of the simple greedy algorithm of Nemhauser et al.: Instead of picking the best single element to add in each iteration, it first finds the set of the k individually best single elements (i.e., the elements which when added to the current set give the largest, second-largest, third-largest, ..., k th -largest gain). Then, it picks one of these k elements uniformly at random and continues to the next iteration. TheapproximationguaranteeforRandomGreedyiscapturedbyTheorem2.2. 17 Algorithm 2.2: Random Greedy Algorithm 1: Initialize: S 0 Ωÿ 2: for i=1,...,k do 3: Let M i ™ V \S i≠ 1 be the subset of size k maximizing q uœ M i g(S i≠ 1 fi{ u})≠ g(S i≠ 1 ). 4: Draw u i uniformly at random from M i . 5: Let S i Ω S i≠ 1 fi{ u i }. 6: end for 7: Return S k Theorem 2.2 (Buchbinder et al. [22]). Let f :2 V æ R be a non-negative submod- ular (but not necessarily monotone) function where |V| = n. Consider the problem of maximizing f(S) subject to the constraint that |S| =k; let S ú k be the optimum set of size k. 1. The set S g k returned by the Random Greedy Algorithm guarantees E[f(S g k )]Ø max(0.266, 1 e ·(1≠ k en ))·f(S ú k ), where the expectation is taken over the random choices of the algorithm. 2. By taking the better of the outputs of the Random Greedy Algorithm and the “Continuous Double Greedy Algorithm,” the resulting set ˆ S guarantees that E Ë f( ˆ S) È Ø 0.356·f(S ú k ). The current best approximation guarantee — a factor 0.356 (slightly less than 1/e) — is achieved by running Random Greedy and another algorithm called “Con- tinuous Double Greedy” and keeping the better of the two solutions. While taking the better of the two algorithms gives a better approximation guarantee, in this work, we prefer to just use the Random Greedy algorithm due to its simplicity and eciency. Noticethatwhen kπn,theapproximationguaranteeofRandomGreedy is very close to 1/e. 18 Inadditiontotheconstantapproximationalgorithmforsetfunctionoptimiza- tion, we consider bicriteria approximation algorithms in this work. In the context of set function maximization under cardinality constraint, the bicriteria algorithm gets to pick more elements than the optimal solution but is only judged against the optimum solution with the original bound k on the size of the returned solution; that is the algorithm is allowed to relax the constraint on the size of the returned solution. 2.3 Distributions In this section, we review several probabilistic distributions used in this work. In general, we use f(·) to denote probability density functions of continuous distri- butions. 1. Discrete distribution: Given a set of K distinct elements {1,...,K} and parameter vector✓ œ R K Ø 0 satisfying 1 T ✓ =1, a random variable x follows the discrete distribution, i.e., x≥ Discrete(✓ ), if Prob[x =i]= ◊ i . 2. Poisson distribution: We use x≥ Poisson(◊ ),◊> 0 to denote that a random variable x follows the Poisson distribution with parameter ◊ . We have Prob[x =k]= ◊ k e ≠ ◊ k! . 3. Normal (Guassian) distribution: We use x≥ N(µ,‡ 2 ) to denote that a ran- dom variable x follows an univariate Normal distribution with mean µ and variance ‡ 2 . Similarly, we use x ≥ N(µ,) to denote that a random vari- able xœ R K follows the multivariate Normal distribution with mean vector µ 19 and covariance matrix . The probability density functions for univariate and multivvariate Normal distributions are as follows: f N (x|µ,‡ 2 )= 1 Ô 2fi‡ 2 e ≠ (x≠ µ) 2 2‡ 2 f N (x|µ,) = (2 fi ) ≠ 1 2 K | | ≠ 1 2 e ≠ 1 2 (x≠ µ) T ≠ 1 (x≠ µ) 4. Dirichlet distribution: Given a parameter vector ✓ œ R K > 0, we use x ≥ Dir(↵ )todenotethattherandomvariablexfollowsaK-dimensionalDirichlet distribution. The probability density function of the Dirichlet distribution is as follows: f(x|↵ )= ( q K i=1 – i ) r K i=1 ( – i ) K Ÿ i=1 x – i ≠ 1 i , where( x)= s Œ 0 t x≠ 1 e ≠ t dt is the gamma function. 5. Exponentialdistribution: Weusex≥ Exp(◊ ),◊> 0todenotethattherandom variable x follows an Exponential distribution, where the probability density function is f(x|◊ )= ◊ ·e ≠ ◊x . 6. PowerLawdistribution: Weusex≥ Pow(◊,” ),◊,”> 0todenotethattheran- dom variable x follows Power Law distribution, where the probability density function is f(x|◊,” )= ◊ ” ( x ” ) ≠ 1≠ ◊ . 7. Rayleigh distribution: We use x≥ Ray(◊ ),◊> 0 to denote that the random variablexfollowsRayleighdistribution,wheretheprobabilitydensityfunction is f(x|◊ )= ◊ ·x·e ≠ 1 2 x·◊ 2 . The Exponential distribution, Power Law distribution and Rayleigh distribution are usedmostlyasthedelaydistributionsinthediusionprocessesstudiedinthisthesis. Under this context, we use f (·) to denote their probability density functions. 20 2.4 Basics of Diusion Process Broadly speaking, entities such as information, ideas, and influence propagate through links between individuals in the social network. We refer to this type of propagation phenomena as diusion in a network. 2.4.1 Diusion Network The links between individuals form the diusion network in which the entity propagates in the diusion process. In this thesis, the diusion network is modeled as a directed graph G=(V,E) where n = |V| and m = |E| are the number of nodes and edges in the graph respectively. Each node represents a domain-specific entity in the social network. For example, each node could represent a user in Face- book or Twitter, a researcher in a collaboration network or an online news media site when considering the propagation of online news stories. The edges represent the social relation between the nodes in the network such as friendship relation- ship in Facebook, follower-followee relationship in Twitter or citation relationship in collaboration networks. We use N(v) to denote the in-neighbors of node v, i.e., N(v)={uœ V|(u,v)œ E} and d v =|N(v)| the in-degree of node v. Each edge e =(u,v) is associated with a parameter ◊ u,v representing the strength of influence user u has on v. Depending on the diusion model, there are dierent ways to represent the strength of influence between individuals. We use ✓ to denote the set of all edge parameters as a vector. 21 2.4.2 Cascades As an item propagates in a diusion process, we refer to the trace of the propagation as a cascade. A cascade consists of activities carried out by the nodes when they “adopt” the item under propagation. We denote each activity as a tuple a i =(t i ,v i ,X i ). This means that node v i carries out the activity at time t i . X i are additional features associated with the activity. The “adoption” activity could mean dierent behaviors in dierent contexts of diusion processes. For example, an activity could be a user buying a new product or adopting a new technology under the study of the “word of mouth eect”. Another example of an activity is a user publishing a post about the information under propagation on her timeline in Facebook or Twitter. The activity could also represent a researcher publishing a scientific paper or an online news media site publishing a story on a certain event. A cascade c is a sequence of activities, namely c ={a i |i=1,2,...,|c|}. If the diusion process reaches a node v, namely v carries out an activity in the cascade, we say that node v is activated in the cascade; otherwise, we say that node v remains inactive during the process. We denote the activation time of node v as t v . If node v is not activated in the cascade, we write t v = Œ . In this work, we consider dierent types of cascades. The most widely studied cascades only contain temporal information; this is we have X i =ÿ . Examples include the propagation of the adoption of a new product for which the time when the node is activated is the only information collected in the cascades. We study this type of cascade in Chapters5and7. Beyondonlytemporalinformation, acascadecouldcontainricher information in X i . For example, if the activity corresponds to the publication of a scientific article, X i could contain the text content of the published paper. We term this type of cascade text-based cascades and study them in detail in Chapter 6.In 22 this work, we also consider activation-only cascades studied in [5, 35, 110]; that is we only observe which set of nodes have carried out the adoption activities, but not when these activities occur. We can represent an activation-only cascade as a pair of sets, i.e., c=(S,A), where S is the set of nodes initiating the cascades and A is the set of nodes activated by the end of the diusion process. This type of cascade is observed when it is dicult or impossible to observe the temporal information in the diusion process. For example, for the propagation of diseases, we may only observewhowasinfectedbutnotwhenweretheyinfected. Itshouldbenoticedthat other work refer activation-only cascades as partial/incomplete observations (e.g., Narasimhan et al. in [110]); we change the terminology to avoid confusion with “incomplete observations” with missing activities or nodes studied in Chapter 4. 2.5 Diusion Models Diusionmodelsdescribehowinformationpropagatesinthediusionnetwork. Inmostmodels,thediusionprocessstartswithasetofactivenodes S™ V referred to as seeds. The diusion process then unfolds in either discrete or continuous time according to the dynamics specified in the diusion model. Generally speaking, the diusion models can be categorized into two types: progressive models and non- progressive models. In progressive diusion models, each user carries out at most one activity during each cascade. Progressive models capture scenarios such as the propagation of the adoption of a new product where each user buys at most one product. Another example is the propagation of the “like” of stories in Facebook where each user can only like the story once. On the other hand, in non-progressive diusion models, each user can carry out multiple activities in each cascade. It 23 models situations such as people having discussions on Twitter where each user can post multiple tweets on the same topic. Other examples could be the propagation of scientific ideas where one researcher publishes multiple papers or the propagation of an online news story where multiple articles can be published by the same online news media site. In this section, we review in detail Cascade models and Point Process models as two mostly widely used classes of progressive and non-progressive diusion models. The analysis and algorithms in this thesis often assume these two types of models. 2.5.1 Cascade Models Under cascade models, the diusion process begins with a set of seed nodes (initial adopters) S, who start active. It then proceeds in discrete or continuous time; with a probabilistic process, additional nodes may become active based on the influence from their neighbors. The process terminates when no new nodes become active anymore. By definition, we have A 0 =S. 2.5.1.1 Discrete-time Cascade Models The Discrete-time Linear Threshold (DLT) model [77] and the Discrete-time Independent Cascade (DIC) model [77] are two of the most widely used discrete-time cascade models. Motivated by the the models studied by sociologists with node-specific thresh- olds [62, 131], Kempe et al. propose the DLT model to study the diusion process. Under the DLT model, each node v has a threshold ÷ v drawn independently and uniformly from the interval [0,1]. The diusion under the DLT model unfolds in 24 discrete time steps: A node v becomes active at step t if the total incoming weight from its active neighbors exceeds its threshold: q uœ N(v)fl A t≠ 1 ◊ u,v Ø ÷ v where the weight ◊ u,v is the influence strength parameter associated with edge e=(u,v). The DIC model is a discrete-time model following the study of marketing by Goldenberg, Libai and Muller [49, 50]. The influence strength parameter associated withedgee=(u,v)satisfying ◊ u,v œ [0,1]representsanactivationprobabilityinthe DIC model instead of the edge weight in the DLT model. When a node u becomes active in stept, it attempts to activate all currently inactive neighbors in stept+1. For each neighbor v, it succeeds with probability ◊ u,v . If it succeeds, v becomes active; otherwise, v remains inactive. Once u has made all these attempts, it does not get to make further activation attempts at later times. It has been shown by Kempe et al. that both the DIC and DLT models are special cases of a more general model, the Triggering Model [77]. Under this model, each node v independently chooses a random “triggering set” T v ™ N(v) according to some distribution over subsets of its neighbors. Under this model, an inactive node v becomes activated at step t if it has a neighbor in its chosen triggering set that is active at time t≠ 1, namely T v fl A t≠ 1 ”=ÿ . For example, the TriggeringModelsubsumestheDICmodelasuserv canselecttriggeringsetT v with probability r uœ N(v)\Tv (1≠ ◊ u,v ) r uœ Tv ◊ u,v ; that is user v decides whether to include each of its neighbors into the triggering set independently with probability ◊ u,v . Moreover,asKempeetal.showin[77],thereisaevenbroaderdiusionmodel, the General Threshold (GT) model that generalizes the Triggering model [78]. Instead of assuming a linear relation among the incoming weights, the GT model assumes that there is a monotone threshold function f v that maps subsets of v’s neighbor set to real numbers in [0,1]. The threshold function f v is assumed to 25 be normalized, satisfying that f v (ÿ )=0. Each node v has a threshold ÷ v drawn independentlyanduniformlyfromtheinterval [0,1]. Usually, thethresholdfunction f v is also assumed to be submodular. Asmostreal-worlddiusionprocessesoccurincontinuoustime,onehastodis- cretize time into steps to apply the above discrete-time cascade models. It turns out that it remains a challenging question to decide the proper scale for discretization, and the artificial discrete time steps make the above models inaccurate for studying real-world cascades [51, 127]. To address these disadvantages, several extensions have been proposed to generalize the cascades model to handle continuous-time cascades directly. 2.5.1.2 Continuous-time Cascade Models Both DIC and DLT models have been generalized to capture diusion process incontinuoustime. Continuous-timeIndependentCascade(CIC)modelwas firstindependentlyproposedbyBharathietal.[13]andGomez-Rodriguezetal.[52]. Under the CIC model, there are two parameters associated with each edge. Besides the activation probability denoted as p u,v , there is also a delay distribution P u,v associated with each edge with parameter – u,v ; that is we have ◊ u,v =(p u,v ,– u,v ). Similar to the discrete-time models, the diusion process starts with a set of initial adopts as seeds active at time 0. Consider the model in a generative sense when the cascade c reaches node u at time t u and the model need to decide whether its inactive neighbor v will be activated and when if it is activated. Similar to the DIC model, withprobabilityp u,v ,usucceedsinactivatingnodev; otherwise, theattempt fails and u does not have any further chance to activate node v. If the activation attempt succeeds, a delay time d u,v is drawn from the delay distribution P u,v . d u,v 26 is the duration it takes u to activate v. If multiple nodes succeed in activating node v simultaneously, the activation time t v is taken as the earliest one; that is if both node u and w succeed in activating v with delay time d u,v and d w,v , the activation time of v is min{t u +d u,v ,t w +d w,v }. There is one simplification of the CIC model which is widely assumed in many works [36, 37, 38, 51, 56, 108]. The variant of the CIC model assumes the activation probability p u,v =1 for all edges; that is all attempts of activation are guaranteed to be successful. In this case, we use ◊ u,v to denote the parameter of the delay distribution – u,v . Moreover, there is a specified observation window [0,· ]. Nodes are considered activated in the process if they are activated within the observation window [0,· ]. We term this variant of the CIC model as the CIC-Delay model. The DLT model has also been extended to continuous time by Saito et al. in [122, 123] as theAsynchronous Linear Threshold (AsLT) model. Under the AsLT model, besides the threshold ÷ v , each node is associated with a time-delay parameter r v Ø 0. The diusion process unfolds in continuous time starting from a giveninitial activesetS as follows: suppose that the totalweightfrom activeparent nodes of v becomes larger than ÷ v at time t for the first time. A delay time d v is drawn from the Exponential distribution with parameter r v and node v becomes active at time t +d v . The other diusion mechanisms are the same as the DLT model. 2.5.1.3 Alternative Reachability Characterization In most cascades models, the activation of nodes can be characterized alter- natively via reachability from a seed set using live-edge graphs. A live-edge graph H has the same set of nodes V as in the diusion network G where the existence of 27 edges represents successful activation attempts in the diusion process. The diu- sion process defines a distribution over live-edge graphs H assigning probability “ H to each live-edge graph H. The distribution for the DIC, DLT and CIC models are as follows: 1. DICmodel: Foreachedgee=(u,v)inthediusionnetwork,weindependently insert it into the live-edge graph H with probability ◊ u,v . 2. DLT model: For each node vœ V in the diusion network, we independently pick at most one of its incoming edges at random. We insert the edge into the live-edge graph with probability ◊ u,v and insert no edge with probability 1≠ q uœ N(v) ◊ u,v . 3. CICmodel: Theedgesinthelive-edgegraphundertheCICmodelaredecided exactly as in the DIC model. Moreover, for each edge in the live-edge graph, we independently sample a delay time d u,v ≥P u,v . As both the DIC and DLT models are special cases of the Triggering model, the over live-edge graphs H is the same as sampling the triggering set T v for each node v independently and inserting the edges from every node uœ T v to node v. Theresultsof[36,77]haveshownthattheprobabilitythatanodevisactivated by seed set S equals the probability of sampling a live-edge graph such that node v is reachable from the seed set S. Lemma 2.3 (Kempe et al. [77]). Let the distribution over live-edge graphs H be the one defined above under the DIC and the DLT model respectively. For a diusion process with S as seed set, we have Prob[v is activated]= ÿ H:at least one node in S has a path to v in H “ H . 28 Similar results have been established for the CIC model by Du et al. [36]. Lemma 2.4 (Du et al. [36]). Define d H (u,v) as the distance between node u and v with the delay time d u,v as edge length in live-graph H and let the distribution over live-edge graphs H be the one defined as above under the CIC model. For a diusion process with seed set S and observation window · , we have Prob[v is activated]= ÿ H:min uœ S d H (u,v)Æ · “ H . To reduce the representation complexity, notice that from the perspective of activating v, two dierent live-edge graphs H,H Õ are “equivalent” if v is reach- able from exactly the same nodes in H and H Õ . Therefore, for any node set T, let — ú T := q H:exactly the nodes in T have paths to v in H “ H under the DIC and DLT model and — ú T := q H:exactly the nodes in T have paths with length less or equal than · to v in H “ H under the CIC model. We term the characteristic vectors of sets T as feature vectors r T = T , where we will interpret the entry for node u as u having a path to v in a live-edgegraph. Moreprecisely, let „ (x) = min{x,1}. Wedefine „ T (S)= „ ( € S ·r T ) as the reachability basis function. „ T (·)=1 if and only ifv is reachable fromS, and we can write Prob[v is activated]= ÿ T — ú T ·„ T (S)= ÿ T — ú T ·„ ( € S ·r T ). 2.5.2 Point Process Models Cascade models are appropriate to capture behaviors like the adoption of new products where each user carries out at most one activity in each cascade. Progres- sive models fail to model behaviors like the posting of tweets in Twitter and the 29 publishing of multiple papers by one researcher. The most commonly used diusion model for non-progressive diusion process is the Multivariate Point Process (MPP) Model. Mathematically, a point process is a collection of abstract points randomly located on some underlying space such as time in the modeling of dif- fusion processes [99, 127]. Each point represents an activity that occurred during the cascade. More specifically, in the MPP model, each node v is associated with a counting process N v (t), where N v (t) is the number of posting activities of node v in the time interval [0,t] (assuming the process starts at time 0). Following the traditional notation of point processes, each activity a i =(t i ,v i ,X i ) in the cascade c is referred to as an event (t i ,v i ) of the process associated with node v i ignoring the additional features. Let H t ={a|t a <t} be the history of events up to but not including t. A point process can be captured conveniently via the intensity process ⁄ v (t|H t ) defined as follows [99]. ⁄ v (t|H t ) = lim tæ 0 E[N v (t+ t)≠ N v (t)|H t ] t . Intuitively, the intensity process provides the expected instantaneous rate of future events at time t; that is the expected number of new activities carried out by user v is ⁄ v (t|H t ) t in the short time period [t,t+ t]. Simple examples of point pro- cesses include Poisson Processes where the intensity process is a constant function. However, the constant intensity process in Poisson Processes results in indepen- dence between the occurrence of events in the past and future events. In a dif- fusion process, the occurrence of an event can lead to a chain of future events. For example, a seminal paper may start a new field of study generating a large amount of follow-up work. To capture a mutually exciting nature, the Multi- variate Hawkes Process (MHP) Model has been used to model the diusion 30 processes [73, 98, 99, 127, 146, 147]. To simplify the notation, we get rid of the dependence on the history H t and write ⁄ v (t)= ⁄ v (t|H t ). For MHP, the intensity process ⁄ v (t) takes the form: ⁄ v (t)=µ v + ÿ a:ta<t Ÿ a (t,v), where µ v is the base intensity of the process, while each previous event e adds a nonnegative impulse response Ÿ a (t,v) to the intensity, increasing the likelihood of future events. We decompose the impulse response Ÿ a (t,v) into two factors that capture both the influence between nodes and the temporal aspect of the diusion: Ÿ a (t,v)= ◊ va,v f (t≠ t a ) . The influence parameter ◊ u,v associated with edgee=(u,v) represents the expected number of events that a single event at node u can trigger in the process of node v under the MHP model. The larger ◊ u,v is, the stronger the influence node u has on node v. We write ✓ ={◊ u,v } as a non-negative matrix with the collection of all edge parameters where only nonzero ◊ u,v entries correspond to edges (u,v) in the diusionnetwork. f (·)istheprobabilitydensityfunctionforthedelay distribution. It captures how long it takes for an event at one node to influence other nodes (i.e., trigger events at other nodes) and how long this influence will last. 2.6 Network Inference An essential component of the diusion model is the diusion network G and the associated edge parameters ◊ u,v . Usually, the first step towards understanding 31 the dynamics of information diusion is to extract or learn the diusion network. For many social network sites such as Facebook or Twitter, the diusion networks can be extracted from observable friendship networks or follower-followee networks. However, this approach suers from the following three problems: 1. Dierentsocialnetworksitesorresearchersoperationalizetheintuitionbehind “friendship”, “acquaintance”, “regular” advice seeking, etc. in dierent ways (see, e.g., [23]). Based on dierent definitions, the same real-world individu- als and behavior may give rise to dierent mathematical models of the same “social network.” 2. Though the edges can be directly observed, the parameters representing the strengthofinfluencebetweenusersareusuallyhiddenfromdirectobservation. Though we can observe that u and v are friends on Facebook, the probabil- ity that u’s post can influence v to publish a follow-up post remains to be estimated or learned. 3. Though the network and the parameters can be extracted by traditional approaches like user surveys as discussed in Chapter 1, these methods are hard to scale to large networks with hundreds of thousands or even millions of nodes. The increasing availability of large-scale social network data and the development of machine learning algorithms provide an alternative solution to this question; that is to infer the diusion network from observed cascades among the users in the network. 32 The typical model of Network Inference is that one has available cascade data of individuals adopting several behaviors or products. Assuming a probabilistic dif- fusion model describedin the previous section andthe occurrence ofcascades on the network according to the model, Network Inference focuses on inferring the struc- ture of the diusion network and the associated edge parameters ◊ u,v representing the strength of influence. First formulated by Rodriguez et al. [52], numerous works have been published on the Network Inference problem under dierent diusion models [1, 20, 37, 44, 51, 52, 53, 54, 55, 56, 58, 73, 84, 97, 98, 98, 108, 109, 115, 121, 123, 124, 127, 138, 139, 146, 147]. While some algorithms focus on only inferring the structure of the diusion network, e.g., NetInf [52] and MultiTree [55] under the CIC model, other algorithms focus on inferring both the structure and the strength of influence between indi- viduals, e.g., ConNIE [108] under the CIC model and LowRankSparse [147] under the Hawkes process model. Most existing algorithms for Network Inference take a Maximum Likelihood Estimation (MLE) approach, e.g., in [115, 124] under the DIC model, [52, 55] under the CIC model and [146, 147] under the Hawkes process model. Let ✓ = {◊ u,v } be the collection of all edge parameters as a matrix and C ={c 1 ,...,c |C| } be the set of observed cascades. The Network Inference problem is usually formulated as the following optimization problem: (G ú ,✓ ú ) = argmax G,✓ P(C|G,✓ ) = argmax G,✓ Ÿ cœ C P(c|G,✓ ) Here, P(c|G,✓ ) is the likelihood of observing cascade c under the specified diusion model with diusion network G and edge parameters ✓ . Depending on the exact form of the likelihood function that is decided by the diusion model, dierent 33 techniquesincludingconvexoptimization[108]andsubmodularoptimization[52,55] are applied to solve the optimization problem. Next, we review several widely used Network Inference algorithms according to the assumed diusion models, including DIC, CIC and Point Process models. 2.6.1 Network Inference under DIC model Under the DIC model, the log cascade likelihood takes the following form. Assume that T c = max{t i |a t œ c} is the time step of the last activated node. logP(c|G,✓ )= Tc ÿ t=1 S U ÿ vœ At logP t v + ÿ v”œ At ÿ uœ A t≠ 1 fl N(u) log(1≠ ◊ u,v ) T V . Here, P t v =1≠ r uœ A t≠ 1 fl N(u) (1≠ ◊ u,v ) is the probability that node v is activated exactly at stept. Saito et al. proposed an EM algorithm to solve the maximum like- lihood estimation in order to infer the diusion network [ 124]. In their algorithm, they consider the activation of node v as a latent variable following a Bernoulli distribution with parameter P t v and iterate between estimating the posterior distri- bution P t v and updating the edge parameters ◊ u,v . Praneeth and Sujay improve the eciency of the inference by showing that the log likelihood is a convex function of the transformed edge parameters ˆ ◊ u,v =≠ log(1≠ ◊ u,v ) [115]. Moreover, they deter- mine the sample complexity of the inference problem, i.e., the number of cascades needed for accurate parameter discovery, under the assumption of correlation delay. They also prove an information theoretical lower bound on the number of cascades requiredforaccurateinference. BesidestheMLEbasedalgorithmforNetworkInfer- enceundertheDICmodel,Goyaletal.proposedseveralheuristicstoinferactivation probabilities [58]. Moreover, the K-Lifts algorithm was proposed by Amin et al. as 34 another heuristic for activation probability estimation [5]. Dierent from previous work, Amin et al. focused on inferring the network from activation-only cascades. 2.6.2 Network Inference under CIC model Due to the advantages of continuous-time models, numerous Network Infer- ence algorithms have been proposed under the CIC model [1, 51, 52, 55, 56, 108]. The algorithms focus on inferring the network under slightly dierent variants of the CIC model. Myers and Leskovec [108] considered the scenario where the delay distribution is the same for all edges and are inputs to the algorithm. The Network Inferenceproblemfocusesoninferringtheactivationprobabilitiesastheedgeparam- eters. Letf (·) be the probability density function of the known delay distribution. They show that the log cascade likelihood takes the following form. logP(c|G,✓ )= ÿ v:tv<Œ log(1≠ ÿ u:tu<tv (1≠ f (t v ≠ t u )◊ u,v ))+ ÿ v:tv=Œ ÿ u:tu<Œ log(1≠ ◊ u,v ) They also show that the optimization problem of the MLE is convex after variable transformation and can be solved eciently with the ConNIe algorithm proposed in [108]. Rodriguez et al. also utilize convex optimization techniques to solve the Network Inference problem but under the CIC-Delay model [51]. They assume that the delay time of dierent edges follows the same distribution but with dierent parameters. The Network Inference problem becomes inferring the parameters of the delay distributions. Let f (·|◊ u,v ) be the probability density function of the 35 delay distribution for edge (u,v). They show that the log cascade likelihood takes the following form. logP(c|G,✓ )= ÿ v:tv<· C ÿ u:tu>· logS(· ≠ t u |◊ u,v ) + ÿ u:tu<tv logS(t v ≠ t u |◊ u,v )+log ÿ u:tu<tv H(t u ≠ t v |◊ u,v ) D Let F(·|◊ u,v ) be the cumulative density function of the delay distribution for edge (u,v), and let S( t|◊ u,v )=1≠ F( t|◊ u,v ) and H( t|◊ u,v )= f( t|◊ u,v) S( t|◊ u,v) be the sur- vival function and the hazard function of the delay distribution respectively. They show that the log likelihood of cascades is a convex function of the edge parameters ◊ u,v for several widely used delay distributions including Exponential, Power Law and Rayleigh distributions. Based on this observation, the NetRate algorithm is proposed to infer the diusion network via convex optimization. Recently Danesh- mand et al. improved the NetRate algorithm by incorporating an L 1 penalty to obtain sparse diusion networks [ 56]. With the L 1 regularization, they provide a theoretical guarantee on the sample complexity of the Network Inference problem. Instead of solving the Network Inference as a continuous convex optimization problem, Network Inference can also be cast as a discrete optimization problem and solved via submodular function maximization [52, 55]. Dierent from inferring the exact value of the edge parameters, NetInf [52] and MultiTree [55] algorithms focus onlyoninferringthestructureofthediusionnetwork,i.e.,theedgesinthediusion network. They assume that the activation probabilities and the delay distribution are the same for all edges and known (inputs to the algorithms). Both the NetInf 36 and MultiTree algorithms infer the diusion networks by maximizing the likelihood function, leading to the following combinatorial optimization problem: Maximize logP(C|G)= q cœ C logP(c|G) Subject to G=(V,E),|E|Æ k. As the problem is intractable due to its exponential search space, an approximation to the likelihood function is proposed by assuming that all cascades follow a tree structure: P(c|G) ¥ ÿ Tœ T(G) P(c|T)P(T|G) à ÿ Tœ T(G) Ÿ (u,v)œ T f (t v ≠ t u ), where T(G) is the set of all directed spanning trees in G. The approximate likeli- hood can be further approximated via a maximum spanning tree (e.g. NetInf) or calculated exactly with the matrix tree theorem (e.g., MultiTree). Then, both algo- rithms define the improvement of log-likelihood for cascade c, F c (G), under graph G over the empty graph ¯ K. The empty graph has an additional node besides V and contains only n edges from the additional node to each vœ V with activation probability ‘. For the NetInf algorithm, F c (G) = max Tœ T(G) logP(c|T)≠ max Tœ T( ¯ K) logP(c|T). For the MultiTree algorithm, F c (G) = log ÿ Tœ T(G) P(c|T)P(T|G)≠ log ÿ Tœ ¯ K P(c|T)P(T|G). 37 It has been shown that the improvement of log-likelihood F c (G) for cascade c is monotoneandsubmodularforbothNetInf[52]andMultiTree[52]algorithms. Both algorithms maximize the approximated likelihood F C (G)= q cœ C F c (G) by the hill- climbing greedy algorithm introduced in Section 2.2.1. Since the objective function F C (G) is monotone and submodular, this procedure results in a 1≠ 1/e approx- imation guarantee [52, 55]. It should be noticed that the guarantee is only for maximizing F C (G). Since there is no guarantee for the approximation of F C (G) to the true log likelihood of observed cascades, the approximation guarantee does not carry over to the quality of the inferred diusion network. Moreover, several extensions of the previous work have been proposed to improvetheaccuracyoftheinferenceunderdierentscenarios. Myersetal.incorpo- ratedexternalinfluencetoaccountforactivationsoutsidethediusionprocess[ 109]. Rodriguez et al. utilized theories from survival analysis to improve the inference under the CIC model [53]. Rodriguez et al. considered inferring a dynamic dif- fusion network where the strength of influence can change over time [54]. More- over, researchers have focused on the heterogeneity of cascades to improve Network Inference accuracy. The kernelCascade algorithm in [38] generalizes the NetRate algorithm to allow dierent non-parametric delay distributions of the edges. The TopicCascade algorithm in [37] allows the delay parameters to dier for dierent topics under propagation in the network. There also exists other work inferring the network without using the MLE principle. The First-Edge algorithm was proposed by Abrahao et al. in [1]. The basic idea of the algorithm is to include the edge between the seed and the first activated node of each cascade into the diusion network. While the algorithm only works for cascades with a single seed, Abrahao et al. provide a theoretical analysis 38 showing that the sample complexity of the algorithm is near optimal under mild conditions. 2.6.3 Network Inference under the Hawkes Process model SeveralalgorithmshavebeenproposedtosolvetheNetworkInferenceproblem under the Hawkes Process model as well [73, 98, 127, 146, 147]. Following the formulation in [127], the log cascade likelihood takes the following form. logP(c|G,✓ )=≠ Z + ÿ a i œ c log( ÿ a j œ c:t i <t j ◊ v j ,v i f (t i ≠ t j )) Z = ÿ a i œ c ⁄ · 0 ◊ v j ,v i f (t≠ t i )dt It has been shown that directly optimizing the log likelihood is inecient and usu- ally leads to subpar accuracy [127, 133]. Instead, the MLE problem can be solved via an EM algorithm [127]. In this approach, each activity is associated with a latent variable indicating which previous activity triggers its occurrence, which we refer to as the parent of the activity. In the E-step, we estimate the posterior dis- tribution of the parent relationship of each activity. In the M-step, the parameters of the diusion network ◊ v j ,v i are updated while fixing the parent distribution of the activities. Several extensions of the classic EM algorithm have been proposed to improve the inference accuracy. Zhou et al. proposed the LowRankSparse algo- rithm by assuming that the influence matrix ✓ takes the form of a low-rank plus a sparse matrix [147]. The low-rank structure captures communities in the network and improves the inference by reducing the number of parameters to be inferred. Moreover, Linderman and Adams proposed the NetHawkes algorithm which incor- porates a network generation prior on the diusion network. Instead of using the 39 EM algorithm for MLE, they use a sampling approach to approximate the posterior distribution of the diusion network [ 98]. Moreover, several papers extend the Hawkes Process model to a mixture of HawkesProcessmodelsinordertocapturethesimultaneouspropagationofmultiple items [73, 146]. Besides inferring the structure of the diusion networks, these two models also focus on separating the activities into several individual cascades. 2.7 Influence Function Learning The choice of diusion model and its associated parameters fully define the dynamics of the diusion process. One essential concept is the influence function F :2 V æ [0,1] n whichmapsseedsettoavectorofmarginalactivationprobabilities: F(S)=[F 1 (S),...,F n (S)]. Though the marginal probabilities do not capture the full information about the diusion process (since they do not observe co-activation patterns), they are sucient for many applications, such as Influence Maximiza- tion [77] and influence estimation [36]. Though the influence function can be obtained by estimating the diusion network under the assumed diusion model, this approach often suers from the following shortcomings when applied to real-world problems [35]: • Real-world information diusion is complicated, and it is hard to determine the most suitable diusion model in practice. • The high-dimensional Network Inference problem requires large amounts of cascades which is dicult to gather in practice. 40 • Theinfluencefunctionisimplicitlydefinedbasedonthediusionmodelandits parameters. Computing the value of the influence function may involve time- consuming Monto-Carlo simulations [77] or dicult graphical model inference problems [36]. Therefore, recently researchers have focused on learning the influence functions directly without the step of Network Inference [35, 110]. More formally, consider fixing one of the diusion models defined above and its parameters. For each seed set S, let S be the distribution of the finally activated nodes when the seed set is S. (In the case of the DIC and DLT model, this is the set of active nodes when no new activation occurs; for the CIC model, it is the set of nodes active at time · .) For any node v, let F v (S)=Prob A≥ S [vœ A] be the marginal activation prob- ability. F(S)=[F 1 (S),...,F n (S)] is then defined as the influence function. The Influence Function Learning problemfocusesonlearningtheinfluencefunctionfrom a set of observed cascades C ={c 1 ,...,c |C| }. The observations could be activation- only cascades [35] or cascades with complete information of each node’s activation time [110]. In this thesis work, we mainly focus on learning influence functions from activation-onlycascades. Next,weformallydefinethelearningobjectiveofinfluence functions from activation-only cascades. 2.7.1 Learning Objective and Formal Definition To measure estimation error of the learned influence functions, most existing works use a quadratic loss function [35, 110]. For two n-dimensional vectors x,y, the quadratic loss is defined as ¸ sq (x,y)= 1 n ·||x≠ y|| 2 2 . We also use this notation when one or both arguments are sets: When an argument is a set S, we use the 41 characteristic vector of the set S . In particular, for an activated set A, we write ¸ sq (A,F(S)) = 1 n || A ≠ F(S)|| 2 2 . With the loss function defined as above, the formal definition of the influence function learning problem from activation-only cascades is as follows. Let P be a distribution over seed sets (i.e., a distribution over 2 V ), and fix a diusion model M and parameters, together giving rise to a distribution S for each seed set. The algorithmisgivenasetofM activation-onlycascadesC ={(S 1 ,A 1 ),...,(S M ,A M )}, where each S i is drawn independently from P, and A i is the (random) acti- vated set A i ≥ S i . The goal is to learn an influence function F that accu- rately captures the diusion process. Accuracy of the learned influence func- tion is measured in terms of the squared error with respect to the true model: err sq [F]= E S≥P ,A≥ S [¸ sq (A,F(S))]; that is, the expectation is over the seed set and the randomness in the diusion process. PAC Learnability of Influence Functions. The learnability of influence func- tionshasbeencharacterizedusingtheProbablyApproximatelyCorrect(PAC)learn- ing framework [132] in both [110] and [35]. Let F M be the class of influence func- tions under the diusion model M, and F L the class of influence functions the learning algorithm is allowed to choose from. We say that F M is PAC learnable if there exists an algorithm A with the following property: for all Á,” œ (0,1), all parametrizations of the diusion model, and all distributions P over seed sets S: when given activation-only training cascades ˜ C = {(S 1 ,A 1 ),...,(S M ,A M )} with MØ poly(n,m,1/Á, 1/” ), A outputs an influence function FœF L satisfying: Prob ˜ C [err sq [F]≠ err sq [F ú ]Ø Á ]Æ ”. 42 Here, F ú œF M is the ground truth influence function. The probability is over the training cascades, including the seed set generation and the stochastic diusion process. We say that an influence function learning algorithm A is proper if F L ™F M ; that is, the learned influence function is guaranteed to be an instance of the true diusion model. Otherwise, we say that A is an improper learning algorithm. 2.7.2 Existing Algorithms Du et al. [35] leveraged the insight that the influence functions in many diu- sion models are coverage functions and proposed a novel parameterization of such functions using a convex combination of random basis functions introduced in Sec- tion 2.5.1.3. They design an ecient improper learning algorithm for several widely used diusion models including the DIC, DLT and CIC models with theoretical guarantees. Narasimhan et al. focused on the theoretical aspect of the influence learning problem [110]. They studied the influence function learning problem under several widely used diusion models including the DIC model and the DLT model withfixedthresholds. ProperPAClearningresultstogetherwithsamplecomplexity bounds are established by utilizing standard uniform convergence results in learning theory. 2.8 Influence Maximization Amongthemanyapplicationsofthediusionprocess, perhapsthemostpopu- laroneisInfluenceMaximization[77]. Itisbasedontheobservationthatbehavioral change in individuals is frequently eected by the influence of their social contacts. 43 Thus, by identifying a small set of “seed nodes,” one may influence a large frac- tion of the social network. The desired behavior may be of social value, such as refraining from smoking or drug use, using superior crops, or following hygienic practices. Alternatively, the behavior may provide financial value, as in the case of viralmarketing,whereacompanywantstorelyonword-of-mouthrecommendations to increase the sale of its products. A specific instance of Influence Maximization is described by the class of its influence model (such as DIC, CIC, DLT or others not discussed here in detail) and the setting of the model’s parameters. Together, they completely specify the dynamic process; and thus a mapping ‡ from the initial seed set S to the expected number 1 ‡ (S) of nodes active at the end of the process. In terms of the influence function, we have ‡ (S)= q vœ V F v (S). We can now formalize the Influence Maxi- mization problem as follows: Maximize the objective ‡ (S) subject to the constraint |S|Æ k. TheInfluenceMaximizationproblemisfirststudiedasanalgorithmicquestion by Kempe et al. in [77] under the DIC and DLT model. They show that ‡ (S) is a monotone and submodular function. These properties imply that the simple greedy algorithm in Algorithm 2.1 guarantees a 1≠ 1/e approximation [113]. Ever since the problem of Influence Maximization was first proposed, a large number of follow-up works have been published. A detailed survey is provided in a recent monograph of Chen et al. [24]. A large number of subsequent papers focused on the following two directions: solving the Influence Maximization problem under 1 The model and virtually all results in the literature extend straightforwardly when the indi- vidual nodes are assigned non-negative importance scores. 44 newdiusionmodelsanddesigningmoreecientInfluenceMaximizationalgorithms under existing models. In the first category, Mossel and Roch resolved a conjecture in [77] proving that the Influence Maximization objective under the GLT model is submodular thus implying a 1≠ 1/e approximation guarantee. Du et al. considered Influence Maximization under the CIC model [36]. They proved that the objective is submod- ular and proposed an ecient algorithm based on an innovative scalable influence estimation method. The influence maximization problem is also considered under other diusion models including the voter model [ 41], Hawkes Process model [43] and Ising model [106]. Researchers have also considered Influence Maximization problem under competitive diusion processes where multiple items compete and propagate simultaneously in the network, for example in [61, 68, 96]. Due to the ineciency of the original greedy algorithm [ 77], a large num- ber of algorithms have been proposed to solve Influence Maximization eciently. The algorithms broadly fall into two categories: heuristic algorithms without the 1≠ 1/e approximation guarantee and algorithms maintaining the guarantee with better running time. Most algorithms in the first category improve the running time by accelerating influence estimation: Chen et al. proposed simple heuristics for DICmodelwhenthevaluesofedgeparametersareallthesame[26]. Chenetal.also considered building local structures like trees in the DIC model [137] and DAGs in the DLT model [28] to accelerate influence computation in Influence Maximization. Jung et al. utilized the idea of page-rank to approximate the influence under the DIC model [76]. Other examples include using simulated annealing for Influence Maximization [75] and utilizing community structure in social networks [140]. In the second category, Leskovec et al. proposed the CELF heuristic based on a lazy 45 evaluation of the objective function [93]. The method is further improved by Goyal et al. to CELF++ with an additional heuristic optimization [60]. A recent break- throughofInfluenceMaximizationwaspublishedbyBorgsetal.[18]. Theyproposed a preprocessing step to build a random hypergraph based on the reachability prob- ability in the reversed graph. The Influence Maximization problem then becomes the maximum coverage problem under the constructed hypergraph. The algorithm preserves the 1≠ 1/e≠ ‘ approximation guarantee with a running time nearly linear in the size of the graph O((n+m)‘ ≠ 3 logn). While the work by Borgs et al. focuses mainly on theory, Tang et al. provided a more practical implementation of the algo- rithm [128, 129]. With the practical implementation, the authors manage to solve Influence Maximization on graphs with 1.4 billions of edges. A thorough empirical evaluation of several Influence Maximization algorithms has been carried out in a recent paper by Arora et al. [7]. They claimed to “ debunk a series of myths” of the eciency of several state-of-art algorithms including the IMM algorithm [ 128]. However, Lu et al. has pointed out several flaws in their methodology including buggy experiments and failure to recognize the trade-o between running time and solution quality [105]. 46 Chapter 3 Datasets and Evaluation Metrics 47 In this chapter, we describe the datasets and evaluation metrics used exten- sively throughout the thesis. We start the chapter with network generation models used to generate synthetic networks. We then introduce the real-world networks and cascades datasets used in the experiments. Finally, we describe the evaluation metrics for Network Inference. 3.1 Synthetic Network Models 3.1.1 Preferential Attachment Model The first network generation model we use is the Preferential Attachment model proposed by Barabasi and Albert [9]. The generative process starts with an empty graph with a single node. At each time t when the current number of nodes is less than the required number of nodes, a new node arrives and generates k outgoing edges; the node is labeled by its arriving time t. Each outgoing edge links to existing nodes with probabilities proportional to their degrees. It has been shown by Barabasi and Albert that the generated graph has a power-law degree distribution with probability density functionf(d)à ( d k +1) ≠ 3 2 similar to real-world complex networks [9, 10]. 3.1.2 Watts-Strogatz Small-World Model The second network generation model we consider is the Small-World Model proposed by Watts and Strogatz [143]. The model starts with a circle graph with n nodes where each node creates k edges to its closest nodes on each side. In total 2k edges are created. Then each edge is rewired to a uniformly random node with 48 probability p. It has been shown that the Watts-Strogatz Small-World model gen- erates graphs with small-diameter which is a property of many real-world complex networks [143]. 3.1.3 Forest-Fire Model Though the Preferential Attachment model and the Small-World model cap- ture some aspects of real-world complex networks, they are still dissimilar to real- world networks in many ways. To generate synthetic networks sharing more similar- ities with real-world networks, Leskovec, Kleinberg and Faloutsos have proposed the Forest-Fire model [91, 92]. Similar to the Preferential Attachment model, it starts withanemptygraphandnodesarriveoneatatimeuntilreachingtherequirednum- ber of nodes n. When a node u arrives at time t, the following procedure decides which edges exists from the new node to existing nodes. It is controlled by two parameters: the forward burning probability p and a backward burning ratio r. 1. u first chooses an ambassador node v uniformly at random and creates an edge (u,v). 2. A random numberx is generated according to the geometric distribution with success probability 1≠ p. Node u forms x outgoing edges to nodes incident to v including both in-neighbors and out-neighbors ofv. The nodes incident tov are chosen uniformly with replacement, however, the in-neighbors are selected with probability r times less than the out-neighbors. Let w 1 ,...w x be the ends of the created edges. 3. Step 2 is applied recursively to each of w 1 ,...w x . As the process continuous, each node can only be visited once. 49 It has been shown that the Forest-Fire model exhibits a lot of properties of real- world networks including heavy-tailed degree distribution, community structure, densification power law and shrinking diameter [91, 92]. 3.1.4 Kronecker Graph Model The Kronecker Graph model is another model which exhibits many desired properties of real-world networks including power-law degree distribution, small diameter and community structure [90]. The name of the model comes from the Kronecker product of matrices. Let U and V be two matrices of size n◊ m and n Õ ◊ m Õ . The Kronecker product matrix S =U¢ V of dimension (n·n Õ )◊ (m·m Õ ) is as follows: S =U¢ V = S W W W W W W W W W U u 1,1 Vu 1,2 V... u 1,m V u 2,1 Vu 2,2 V... u 2,m V ......................... u n,1 Vu n,2 V... u n,m V T X X X X X X X X X V The k-th Kronecker power of a square matrix U is defined as U [k] = U [k≠ 1] ¢ U. The Kronecker Graph model starts with a small seed matrix usually of size 2◊ 2. It generates a probabilistic adjacency matrix P = k .Foreachentry P (u,v) of the probabilistic adjacency matrix P , the corresponding edge (u,v) is included independently with probability P (u,v). It has been shown that networks with dierent structures can be generated from the Kronecker Graph Model by choosing dierent seed matrices [ 90]. In this thesis, we use three dierent 2-by-2 seed matrices with dierent entries listed in Table 3.1. The seed matrices generate Erds-Rényi networks, core-peripheral net- works and networks with hierarchical community structures, respectively [90]. It 50 Table 3.1: Kronecker Graph Parameters Name Parameter Erds-Rényi A 0.50.5 0.50.5 B Core-Peripheral A 0.962 0.107 0.107 0.962 B Hierarchical Community A 0.962 0.535 0.535 0.107 B should be noted that when the seed matrix is uniformly 0.5, the generated graph is the Erds-Rényi graph with edge probability 1/n. 3.2 Real-world Datasets We use both real-world social networks and real-world cascade datasets in our experiments in this thesis. 3.2.1 Facebook Network The Facebook network is extracted from the “friends lists” from Facebook. The dataset was collected from survey participants using the Facebook App [94]. The extracted network consists of 4039 nodes and 88234 edges. The dataset is pub- licly available from the SNAP Library 1 . We sampled a subgraph from the Facebook dataset via breadth-first search starting from the node with id 0. The resulting sub- graph has 256 nodes and 1012 edges. We only use this subgraph in our experiments and call it the Facebook network. 1 https://snap.stanford.edu/data/egonets-Facebook.html 51 3.2.2 STOCFOCS Network The co-authorship network, STOCFOCS, is a multigraph extracted from pub- lishedpapersintheconferencesSTOCandFOCSfrom1964–2001. Eachnodeinthe network is a researcher with at least one publication in one of the conferences. For each multi-author paper, we add a complete undirected graph among the authors. Parallel edges are compressed into a single edge with corresponding weight. The resulting graph has 1768 nodes and 10024 edges. 3.2.3 Twitter Dataset The first real-world cascade dataset we consider is the Twitter dataset. It consists of a complete collection of tweets between Oct. 2009 and Jan. 2010. Instead ofusingtherawdatacollection, severaldatasetsarepreprocessedfromthecomplete collection of tweets for dierent experimental purposes. 3.2.3.1 Haiti Network The Haiti network is extracted from tweets of 274 users on the topic Haiti Earthquake. We choose this topic because it was one of the hot topics during that time period and many tweets have been generated around the event. The network is extracted via the following procedure. First, all tweets with keyword “Haiti” are collected from Jan. 12, 2010 for 17 days. Then the top 1000 users who published most tweets during the period are selected. For accurate modeling, the users that arehighlycorrelatedwitheachotherareremoved; mostofwhichareoperatedbythe same persons and tweet exactly the same contents. Robot-like user accounts who tweet at very regular intervals are also removed. Finally, the set of all users with 52 at least one interaction with another user are selected, which results in a subset of 274 users as the nodes of the Twitter network. A directed multigraph is constructed using the retweeting information. Whenever user B retweets a post of user A,we insert an edge from A to B into the graph. After contracting parallel edges, the Twitter network has 383 weighted edges. It should be noted that we only apply the above data cleaning procedure to extract the Twitter network. For other Twitter datasets, we simply choose the most active users. 3.2.3.2 Topic-specific Networks and Cascades We extract a set of topic-specific networks and cascades from the complete collection of tweets. We treat each hashtag as a separate cascade, and extract the top 100/250 users with the most tweets containing these hashtags into two datasets (Twitter100 and Twitter250). The hashtags are manually grouped into five categories of about 70–80 hashtags each, corresponding to major events/topics during the data collection period. The five groups are: Haiti earthquake (Haiti), Iran election (Iran), Technology, US politics, and the Copenhagen climate change summit (Climate). Examples of hashtags in each group are shown in Table 3.2. Whenever user B retweets a post of user A with a hashtag belonging to category i, weinsertanedgefromAtoB ingraphi. TheextractednetworksinTwitter100/250 have100and250nodes,respectively. Itshouldbenotedthattheextractednetworks inTwitter250datasetaredierentfromthe Twitter networkintheprevioussection. The former consists of five networks all with 250 nodes, one for each topic. On the other hand, the latter is the diusion network of 274 nodes only on the topic Haiti earthquake. 53 Ourdecisiontotreateachhashtagasaseparatecascadeissupposedtocapture that most hashtags “spread” across Twitter when one user sees another use it, and starts posting with it himself. The grouping of similar hashtags captures that a user who may influence another to use the hashtag, say, #teaparty, would likely also influence the other user to a similar extent to use, say, #liberty. The pruning of the data sets was necessary because most users had showed very limited activity. Table 3.2: Examples of hashtags in each category Category Hashtags Iran #iranelection, #iran, #16azar, #tehran Haiti #haiti, #haitiquake, #supphaiti, #cchaiti Technology #iphone, #mac, #microsoft, #tech US politics #obama, #conservative, #teaparty, #liberty Climate #copenhagen, #cop15, #climatechange 3.2.4 MemeTracker Dataset Thesecondreal-worldcascadedatasetweconsideristheMemeTrackerdataset. The MemeTracker dataset [89] contains memes extracted from the Blogsphere and main-stream media sites between Aug. 2008 and Feb. 2009. It tracks the quotes and phrases that appear most frequently over time across this entire online news spectrum. Overall Memetracker tracks more than 17 million dierent phrases and about 54% of the total phrase/quote mentions appear on blogs and 46% in news media. The dataset is available for download at the SNAP Library 2 . 2 https://snap.stanford.edu/data/memetracker9.html 54 3.2.4.1 MemeTracker Network We construct a diusion network from the MemeTracker dataset. We restrict ourselvestothedatafromonlyAugust2008andthe500mostactiveusers. Weinfer a diusion network using the ConNIe algorithm [108] among the active users from the cascades during the period. The resulted network together with the parameters (activation probabilities) is referred to as the MemeTracker-DIC network. 3.2.4.2 Topic-specific Networks and Cascades Similar to the Twitter dataset, we extract a set of topic-specific networks and cascades. We manually pick five major events/topics in the year 2008 over a wide range of domains. The selected events are Barack Obama, Olympic Game, Financial Crisis, Iraq War and Microsoft News. We then extract all related memes as cascades of the five topics. Similar to the Twitter dataset, we extract the topic-specific diusion networks between the 250 most active sites using hyperlinks from articles that contains the corresponding topics. Due to the noise in the hyperlinks between the blog posts, for example, links to advertisement page or hyperlinks in irrelevant frames, we further exclude the hyperlinks that never appear in the collection of cascades. That is, we only include an edge (u,v) if u is at least once activated before v in a cascade. This dataset is referred to as the MemeTrackerTopic dataset. 3.2.4.3 Time-specific Networks and Cascades Besides topic-specific networks, we extract a set of time-specific networks and cascadesaccordingtodierenttimeofactivities. Weextractthe128/256/2000/5000 55 sites with the most posting activities across the six-month period we study (Meme- Tracker128/256/2000/5000). We group the cascades by their publishing month, leading to six groups of cascades. We construct six diusion networks, one for each month, from the hyperlinks between the blog posts during the same period. That is, if site A publishes an article in August with links to site B, we add an edge in the diusion network of August. 3.2.4.4 MemeTracker for Influence Function Learning WealsouseanotherMemeTrackerdatasetpreprocessedbyDuetal.[35]specif- ically to evaluate the performance of influence function learning. The dataset is available at http://www.cc.gatech.edu/~ndu8/InfluLearner.html. It contains sevengroupsofactivation-onlycascadescorrespondingtothepropagationofMemes with certain keywords, namely “apple and jobs”, “tsunami earthquake”, “william kate marriage”’, “occupy wall-street”, “airstrikes”, “egypt” and “elections.” Each cascade group consists of 1000 nodes, with a number of cascades varying from 1000 to 44000. We refer to this dataset as MemeTracker-Du. 3.3 Evaluation Metrics for Network Inference The network inference problem can be treated as a binary classification prob- lem where the inference algorithm needs to decide whether each edge is present or not. Several metrics for binary classification can be used to evaluate the accuracy of inferred diusion networks against the ground truth networks, including both precision-recall and receiver operating characteristic (ROC) curves. 56 Let ˆ G be the inferred network returned by the inference algorithm and G ú be the ground truth network. The precision and the recall of the inferred graph ˆ G is defined as precision = | ˆ Efl E ú | | ˆ E| recall = | ˆ Efl E ú | |E ú | The F1 score (F-measure) can be defined based on the precision and recall as: F1= 2·precision·recall precision+recall . F1 score can be interpreted as a weighted average of the precision and recall, which reaches its best value at 1 and worst at 0. Inmanycases, anetworkinferencealgorithmdoesnotreturnasinglenetwork. For example, both the NetInf and MultiTree algorithms add edges to ˆ G starting from an empty graph. Dierent diusion networks can be obtained by choosing dierent numbers of edges to be included. Other algorithms like the ConNIe and NetRate algorithms return a weighted adjacency matrix where the value of entry (u,v) indicates the likelihood or strength of the corresponding edge in the diusion network. Usually, the weighted adjacency matrix is thresholded to obtain the diu- sion network ˆ G. Dierent thresholds lead to dierent diusion networks inferred by the algorithms. To evaluate the performance of these algorithms more accurately, instead of choosing an arbitrary number of edges or threshold, we instead evaluate their performance by plotting the precision-recall curve varying the number of edges or the threshold. More specifically, for algorithms like the NetInf and MultiTree, we compute the precision and recall every time after we add a new edge to ˆ G start- ing from an empty graph. For algorithms like NetRate, we first sort all the edges 57 according to their weights and then add the edges one by one as for NetInf and Mul- tiTree algorithms starting from the one with largest weight. The performance of the algorithms is evaluated either as the area under the plotted precision-recall curve (AUC) or the maximumF1 score (max-F1) among all points on the precision-recall curve. Another widely used metric is the receiver operating characteristic (ROC) curve. Instead of plotting the precision and recall of the networks, the ROC curve plots the true positive rate as the y-axis and the false positive rate as the x-axis. The true/false positive rate is defined as follows: true positive rate = | ˆ Efl E ú | |E ú | false positive rate = |E ú \ ˆ E| n(n≠ 1)≠| E ú | Similartotheprecision-recallcurve,theROCcurveisplottedbyvaryingthenumber of edges to include or the threshold for weighted inferred network. The area under the ROC curve is used as the evaluation metric for the algorithms. In most cases, we choose to evaluate the performance of Network Inference using the precision-recall curve instead of the ROC curve. As most diusion net- works are very sparse, the precision-recall curve has been shown to provide a more informative picture of algorithms when dealing with highly skewed datasets [30]. 58 Chapter 4 Influence Function Learning from Incomplete Observations 59 The content of this chapter is based on the paper: Xinran He, Ke Xu, David KempeandYanLiu, “LearningInfluenceFunctionsfromIncompleteObservations”, In Proc. 28th Advances in Neural Information Processing Systems.2016. Most previous research work on inferring diusion networks assumes the avail- ability of complete diusion observations. However, the restrictions on the public APIs oered by social network sites and the limited observation windows can lead to many types of incomplete observations. In this chapter, we consider two types of incomplete observations, randomly missing activities and adversarially missing nodes in cascades. We begin our discussion by formally introducing the models of incomplete observations and review existing work on Network Inference under incomplete observations. We first demonstrate that incomplete observations have a significant impact on the performance of existing diusion network inference models via experiments. To mitigate the eects of incompleteness, we focus on design- ing algorithms to learn influence functions accurately under incompleteness in the second part of this chapter. 4.1 Models of Incomplete Observations Recall that a cascade c = {a 1 ,...,a |c| } is the set of all activities carried out by the users during the diusion process. The incompleteness of diusion datasets refers to the loss of activities in the cascades. Assume thatC is a complete diusion dataset. For each c i œ C, we observe an incomplete copy ˜ c i ™ c i . We say that ˜ C ={˜ c i ,...,˜ c |c| } is an incomplete observations of C. Missing data is a complicated phenomenon, but to address it meaningfully and rigorously, one must make at least some assumptions about the process resulting in the loss of data. Missing data 60 mechanisms are commonly classified into three types [100, 119]: Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR). In the context of diusion analysis, cascades are MCAR if whether an activity is lost or not depends only on independent coin flips. Cascades are MAR if the probability of missing an activity only depends on other observed variables. Cascades that are neither MCAR or MAR are termed MNAR. In this chapter, we mainly focus on two types of incomplete observations: randomly missing activities where the activities are MCAR and adversarially missing nodes where activities are MNAR. 4.1.1 Randomly Missing Activities Definition: Assume that C is a complete diusion dataset. For each cascade c = {a 1 ,...,a |c| } in C, we toss a biased coin |c| times, once for each activity a i in c. If the coin lands on head (with probability r), we include the activity a i into ˜ c, otherwise we exclude it. The diusion dataset ˜ C = {˜ c 1 ,...,˜ c | ˜ C| } is an incompletedatasetofC withrandomlymissingactivities. Theprobabilityr iscalled the retention rate. Datasets with randomly missing activities can be considered as independent subsampling of the complete dataset with probability r. Practical Scenarios: In real-world data collection processes, this type of incom- pleteness can occur in many situations. One common case occurs when the public API is used to collect user-generated content on a specific topic from social media such as Facebook and Twitter. The posts of contents related to the topic are activ- ities in the cascades, for example, a post in Facebook or a tweet in Twitter. In most cases, the contents are collected through the public API oered by the social media. However, the APIs usually have limitations on the number of calls that can 61 be made or the amount of data that can be acquired during a period of time. For example, the API provided by Twitter to collect recent tweets returns a fixed num- ber of uniformly sampled tweets and there is limitation on many times the API can be called per hour per IP address. Under this case, the diusion data set collected suers from randomly missing activities. The retention rate r is the ratio between the contents collected on the topic and all possible contents of the topic available in the social media. 4.1.2 Adversarially Missing Nodes Definition: Assume that C is a complete diusion dataset. There is a node set T ™ V chosen by an adversary. Then, for each cascadecœ C, we define ˜ c ={a i |a i œ c· v i ”œ T}.Dataset ˜ C ={˜ c 1 ,...,˜ c | ˜ C| } is a dataset with adversarially missing nodes where all activities of the nodes in T are lost across all cascades in the dataset. Dierent datasets with adversarially missing nodes can be built by dierent choices of T by the adversary. For example, the adversary can choose the node set T by certain topological properties such as nodes with the highest out-degree, centrality or pagerank value. Practical Scenarios: This type of incomplete observations is also common in diusiondatacollection. Oneparticularcaseoccurswhencertainsubpopulationare more concern more about privacy and are more reluctant to share their activities on online social networks. Because of their privacy setting, crawlers or public APIs provided by the social network sites cannot access the activities of these groups of users. Anotherexampleoccurswhenwefocusonanalyzingthediusionphenomena of only a subset of users. The users may be influenced by their friends outside the 62 range of our study. Thus, the nodes representing their friends are not observed in every cascade. 4.2 Related Work Incomplete observations are very common in practical applications and have a significant eect on the conclusions drawn from the data. The problem of missing data or incomplete observations has been widely studied by statisticians [3, 66, 100]. Popular methods include imputation [3], such as mean submission where all missing observationsaresubstitutedwiththemeanvalue,theEMalgorithm[100]forfinding maximum likelihood estimates from incomplete data and partial deletion [3] where the data points with incomplete observations are simply discarded. However, the methods for handling general incomplete observations may not be readily applied to incomplete observations in diusion analysis. It is usually very hard to find imputation rules due to the complex interactions among nodes in cascades. The EM algorithm is usually not ecient due to the huge number of hidden variables to be marginalized out. The partial deletion method is not appropriate as it is usually very expensive to collect the diusion data. Recently, the issue of incomplete observations has attracted the attention of researchersworkingondiusionprocessanalysisinsocialmediaduetotheincreasing availability but relatively low quality of data [29, 103, 109, 116, 120, 144], and in social science in general [3, 39]. For example, Chierichetti et al. [29] and Sadikov et al. [120] mainly focus on recovering the size of a diusion process. By assuming that every cascade forms a tree structure, Chierichetti et al. recover the complete cascade from incomplete observations in the form of reverse paths from observed 63 nodes to the root [29]. Similarly, by assuming that every cascade forms a k-tree, Sadikovetal.estimatethesizeofacompletecascadebyfittingtheparametersofthe k-tree from the observed incomplete cascade [120]. Myers et al. [109] mainly aim to model unobserved external influence in diusion. Besides the endogenous influence among users in the social network, Myers et al. assume that there is an external influence source like mass media which can impact all users simultaneously. They focus on inferring the diusion network under this unobserved external influence. Duongetal.[116]examinelearningdiusionmodelswithmissinglinksfrom complete observations, while we learn influence functions (introduced in detail in Section 2.7) from incomplete cascades with missing activations. Most related to our work is a paper by Wu et al. [144] and simultaneous work by Lokhov [103]. Both study the problem of network inference under incomplete observations. Lokhov proposes a dynamic message passing approach to marginalize all the missing activations, in order to infer diusion model parameters using maximum likelihood estimation, whileWuetal.developanEMalgorithm. Noticethatthegoaloflearningthemodel parameters diers from our goal of learning the influence functions (introduced in detail in Section 2.7) directly. Learning influence functions usually requires less number of cascades compared to learning the model parameters [36]. Both [103] and [144] provide empirical evaluation, but do not provide theoretical guarantees. 4.3 Impact of Incomplete Observations Before designing inference algorithms under incomplete observations, we first demonstratetheimpactofmissingactivitiestoexistingNetworkInferencealgorithm through experiments. 64 4.3.1 Experiment settings Diusionmodelsandinferencealgorithms: Wecarryoutempiricalevaluations under the CIC-Delay model introduced in Section 2.5.1.2. The delay distribution of alledgesistheexponentialdistributionwithdelayparameter 1. Wesetthelengthof the observation window to · =10. We use the NetInf algorithm [52] and MultiTree algorithm [55] to carry out Network Inference. Datapreparation: Weusebothsyntheticsocialnetworksandrealsocialnetworks in our experiments. Four synthetic social networks are generated according to two dierent graph generation models. Three of them are generated using Kronecker Graph models introduced in Section 3.1: Erds-Rényi networks, Core-Peripheral networks and Hierarchical Community networks. We generate two sets of synthetic networks. Thefirstsetcontainslargenetworksforquantitativestudy. Eachnetwork has 256nodesandapproximately 512edges. Thesecondsetcontainssmallnetworks for visualization of the inferred network structures, Each network has only 32 nodes. TheothersyntheticnetworkisgeneratedwiththeForest-Firemodelintroducedalso in Section 3.1. All the synthetic networks are generated with code provided by the SNAPlibrary. Weusetworealsocialnetworks: theHaitinetworkandtheFacebook network described in detail in Section 3.2. Thecascadesindierentnetworksaregeneratedsyntheticallyaccordingtothe CIC-Delay model. We generate diusion datasets with 500 cascades for large social networks. The datasets used for visualization analysis on small networks consists of 100 cascades. 65 4.3.2 Randomly Missing Activities Summary of findings: The incomplete observations with randomly missing activ- ities have a moderate but non-linear impact on the inference accuracy. Even 1% random loss of activities in diusion data could lead to more than a 10% decrease in both precision and recall. Despite the loss in accuracy of inference, the basic topological structure of the inferred network remains intact. Case study on small networks: We examine how randomly missing activi- ties impact the inferred network structure by visualizing the result with both graph embedding on a circle and the adjacency matrix. The results for the MultiTree algorithm on a Hierarchical Community network with randomly missing activities are presented in Figure 4.1. We choose the Hierarchical-Community network due to its characteristic community structure. The results for the same network under the other two inference algorithms and the results under other synthetic networks are similar and thus omitted. From Figure 4.1, we can see that the MultiTree algorithm nearly perfectly recovers the underlying diusion network as shown in the comparison between Fig- ure 4.1(a) and Figure 4.1(b). However, the inference algorithm suers moderately from the 10% missing activities in incomplete observations and makes mistakes on a few edges as shown in Figure 4.1(c). Nevertheless,we could see that the com- munity structure remains intact, shown as the four blocks along the diagonal in the adjacency matrix in Figure 4.1(c). Though the structure of the last two blocks deviates from the ground truth by multiple edges, the four communities of the orig- inal graph are successfully reconstructed by the algorithm even under a moderate level of random missing activities. 66 Figure4.1: ResultoftheMultiTreealgorithmonaHierarchicalCommunitynetwork. (a) The ground truth. (b) The network inferred with complete observations. (c) The network inferred with 10% random loss of activities. On the left is the graph embedding on a circle and on the right is the adjacency matrix. Accuracy on large networks: We now present the impact of randomly missing activities on large networks. As for the metrics of accuracy, we use the pre- cision and recall achieved when selecting exactly the number of edges in the ground truth network for NetInf and MultiTree. For each configuration (diusion model and network used), ten dierent cascade datasets are generated and the reported result is the average of the 10 runs. 67 (a) Recall (b) Precision Figure 4.2: Precision and Recall of the NetInf algorithm under randomly missing activities with dierent loss rate ( 1≠ r). The X axis corresponds to dierent loss rates. Dierent data series correspond to dierent networks. We present the experimental results of NetInf under randomly missing activi- tiesinFigure4.2. Asshowninthefigure,bothprecisionandrecallsueramoderate impactfromtherandomlymissingactivities. Thethreenetworksgeneratedwiththe KroneckermodelsuermorethantheForest-Firenetwork, Haitinetwork, andFace- book network. Even 1% loss of activities under these three classes of graphs leads to more than a 10% drop in both precision and recall on average. This dierence can be attributed to the perfect performance of NetInf with complete observations and the relatively low accuracy for the other three graphs with complete observations. 4.3.3 Adversarially Missing Nodes Summary of findings: In comparison to the random missing case, adversarially missingnodeshaveastrongerimpactonboththeaccuracyandtheinferrednetwork structure. Losingimportantnodesisdevastating. Eventhelossofonly10important nodes in a network with 256 nodes makes the result of the inference algorithm as bad as random guessing under the cascade model. 68 Case study on small networks: We present the result of the NetInf algo- Figure 4.3: Result of the NetInf algorithm in a Core-peripheral network. (a)The groundtruth. (b)Thenetworkinferredwithcompleteobservations. (c)Thenetwork inferred with the loss of the 2 maximum-degree nodes. On the left is the graph embedding on a circle and on the right is the adjacency matrix. rithm for a Core-Peripheral network under observations with adversarially missing nodes in Figure 4.3. The two nodes with the numbers 0 and 16 in the lost nodes set T are the two with the highest out-degree. Edges associated with them are denoted with red blocks in the adjacency matrix in Figure 4.3(c). Similar to the results from Section 4.3.2, the NetInf algorithm recovers the network structure nearly perfectly 69 with the complete diusion data. But even the loss of just two nodes with maximal degree leads to tremendous changes in the inferred network structure as shown in the comparison between Figure 4.3(a) and Figure 4.3(c). In the ground truth, nodes 0 and 16 are two hubs of the network, while node 1 receives much influence from other users in the network. All the other nodes have a moderate degree. However, in the inferred network from cascades with adversarially missing nodes, nodes 1 and 2 are identified as pseudo-hubs with high out-degree. Meanwhile, the property of high in-degree of node 1 is lost in the inferred network as well. The reason is that if the true hubs of the network are lost, the algorithm will find other pseudo-hubs to compensate for the loss of the real hubs and use them to explain the activation of other nodes in the network. This procedure significantly alters the structure of the network. This phenomenon occurs only when important nodes are lost. The inferred network structure is less impacted when losing unimportant nodes. Such casesoccureitherwhenwerandomlylosenodes(even5nodes)inaKroneckergraphs with Core-Periphery or Hierarchy Community structure or we lost high-degree (cen- trality, page-rank) nodes in Erds-Rényi graphs since it does not contain nodes with high importance. Accuracy on large networks: Figure 4.4 shows the impact of adversarially missing nodes using MultiTree algorithm for both synthetic and real-world diusion networks. Several choices of the set T are considered, including selecting nodes with maximum out-degree, highest centrality, largest pagerank value or uniformly atrandom. Themetricsusedinthispartaresimilartothatforpreviousexperiments except that we exclude the edges with either end lost when computing precision, recallandAUC.Theimpactoflosingimportantnodesismuchmoresignificantthan the randomly missing activities case, which is in line with our previous case study 70 (a) Recall (b) Precision Figure 4.4: Precision and Recall of the MultiTree algorithm under adversarially missing nodes with dierent numbers of nodes lost. The X axis is the number of adversersially lost nodes in the incomplete observations. Dierent series correspond to dierent networks. result on small networks. When the activation information of important nodes is lost, the inferred structure of the network is totally ruined. Even the loss of only 10 important nodes in a 256-node networks makes the inference result as bad as random guessing. On the other hand, the inference algorithm is much more robust against the loss of unimportant nodes, e.g., when the adversary chooses random nodes in a Core-Peripheral network or maximum degree nodes in a Erds- Rényi network. 4.3.4 Limitations Though we provide analysis on the impact of incomplete observations on the network inference problem, the evaluations still suer from limitations. First, the evaluation of impact is model-based. Namely, we test the impact on the inferred network from incomplete observations based on the exact diusion model assumed bytheinferencealgorithm. Therefore, theempiricalresultscanbeconsideredasthe 71 sensitivity analysis of the inference algorithms. Unfortunately, model-independent evaluation of information loss in cascades due to incomplete observations is very hard. Moreover, when carrying out the inference of the diusion network, we must use one of the existing inference algorithms, which all rely on assumptions on the diusion model. Thus, our model-dependent evaluation at least suggests that the results from network inference algorithms may be not trustworthy under incomplete observations. Second, we use synthetic cascade data generated from the diusion models instead of real cascade datasets. We take this approach since (1) it is very hardtocollectcompletelarge-scalecascadedatasettogetherwiththetrueunderlying diusion network and (2) the cascade models used in this thesis are widely-used. 4.4 Learning Influence Functions from Incom- plete Observations In the previous section, we have shown that naively applying existing Network Inference algorithms results in unsatisfactory inference accuracy. It remains a very challenging task to infer diusion networks from incomplete observations. As most Network Inference algorithms rely on the MLE principle, inference under incom- plete observations involves marginalizing a large number of hidden variables. The reason is that for each node observed to be inactive, we can not distinguish between whether the node is activated and lost or the node is not activated in the diusion process. Abinaryhiddenvariablehastobeintroducedforeachinactivenodeineach cascade indicating whether the node is activated or not. Computing the likelihood of incomplete cascades requires marginalizing over all the binary hidden variables. The marginalization involves summation over exponentially many terms. Recently, 72 Lokhov et al. propose to use dynamic message passing for ecient marginalization under random loss of activities [102]. However, as shown by their empirical eval- uation, their method takes several minutes to finish on networks with less than a 100 nodes and cannot scale to networks with 1000 nodes. Also, their approach lacks theoretical guarantees. Given the diculties of inferring networks, in this chapter, we instead focus on learning the influence functions. For many applications, such as influence max- imization [77] and influence estimation [36], it is sucient to learn the influence function instead of every parameter of the diusion network. We consider learning influence functions under both types of incomplete observations. More specifically, weestablishbothproperandimproperPAClearnabilityofinfluencefunctionsunder incompleteobservationsfortwodiusionmodels: theDICmodelandtheDLTmodel introduced in Section 2.5.1.1. The PAC learning framework and proper/improper learningaredescribedindetailinSection2.7. Theproperlearningresultisprovedby interpreting the incomplete observations as complete observations in a transformed graph. We show that randomly missing observations do not significantly increase the required sample complexity. Moreover, we demonstrate how the proper learning results can be generalized to handle adversarially missing nodes in the cascades and the combination of the two types of incompleteness. The PAC learnability result implies good sample complexity bounds for the DIC and DLT models. However, even without missing observations, proper PAC learnability of the Continuous-Time Independent Cascade (CIC) and other models appears to be more challenging. Furthermore, the PAC learnability result does not lead to an ecient algorithm, as it involves marginalizing a large number of hidden variables (one for each node not observed to be active). 73 Towards designing more practical algorithms and obtaining learnability under a broader class of diusion models, we pursue improper learning approaches. Con- cretely, we use the parameterization of Du et al. [35] in terms of reachability basis functions introduced in Section 2.5.1.3, and optimize a modified loss function sug- gested by Natarajan et al. [111] to address incomplete observations. We prove that the algorithm ensures improper PAC learning for the DIC, DLT and CIC models. Experimental results on synthetic cascades generated from these diusion models and real-world cascades in the MemeTracker dataset demonstrate the eectiveness of our approach. Our algorithm achieves a nearly 20% reduction in estimation error compared to the best baseline methods on the MemeTracker dataset, by compen- sating for incomplete observations. 4.4.1 Problem Definition Wefocusontheproblemoflearninginfluencefunctionsfromcascades. Similar to Narasimhan et al. [110], we focus on activation-only observations introduced in Section 2.4.2. In this section, we consider three popular diusion models introduced in Section 2.5.1: the DIC model, the DLT model and the CIC-Delay model. We slightly modify our models of incomplete observations for activation-only cascades. For incomplete datasets with randomly missing activities, we assume that for seed nodes vœ S, the activation is never lost. Formally, define ˜ A as follows: each vœ S is deterministically in ˜ A, and each vœ A\S is in ˜ A independently with probability r. Then, the incomplete cascade is denoted by ˜ c=(S, ˜ A). For incomplete datasets with adversarially missing nodes, we further assume that the lost nodes are never seeds in any cascades. Formally, the adversary can only pick T satisfying Tfl S =ÿ and the incomplete cascade is denoted by ˜ c=(S, ˜ A) where ˜ A =A\T. 74 4.4.1.1 Objective Functions and Learning Goals Recall that the influence function F :2 V æ [0,1] n maps seed sets to the vector of marginal activation probabilities: F(S)=[F 1 (S),...,F n (S)]. We charac- terize the learnability of incomplete observations using the Probably Approximately Correct (PAC) learning framework with respect to the quadratic loss introduced in Section 2.7. Forlearninginfluencefunctionsfromincompletecascadeswithrandomlymiss- ing activities, we say that a class of influence function F M is PAC learnable if there exists an algorithm A with the following property: for all Á,” œ (0,1), all parametrizations of the diusion model, and all distributions P over seed sets S: when given activation-only and training cascades with randomly missing activities ˜ C = {(S 1 , ˜ A 1 ),...,(S M , ˜ A M )} with M Ø poly(n,m,1/Á, 1/” ), A outputs an influ- ence function FœF L satisfying: Prob ˜ C [err sq [F]≠ err sq [F ú ]Ø Á ]Æ ”. Besides the seed set generation, the stochastic diusion process, the probability is also over the random missing data process. For learning influence functions from incomplete cascades with adversarially missing nodes, we change the quadratic loss function to measure the dierence only for the observed nodes. We have ¸ sq (A,F(S)) = 1 |V \T| ÿ vœ V\T (‰ A (v)≠ F v (S)) 2 . 75 TherestofthePAClearnabilitydefinitionisthesameaslearninginfluencefunctions from complete activation-only cascades. When the observed cascades suer from both randomly missing activities and adversarially missing nodes, we use the above quadratic loss over only the observed nodes and the probability is taken over the training cascades, including the seed set generation, the stochastic diusion process, and the missing data process. 4.5 Proper PAC Learning under Incomplete Observations In this section, we establish proper PAC learnability of influence functions under the DIC and DLT models. For both diusion models, F M can be parameter- ized by an edge parameter vector✓ , whose entries ◊ e are the activation probabilities (DIC model) or edge weights (DLT model). Our goal is to find an influence function F ✓ œF M that outputs accurate marginal activation probabilities. While our goal is proper learning — meaning that the function must be from F M — we do not require that the inferred parameters match the true edge parameters ✓ . Our main theoretical results under incomplete observations with randomly missing activities are summarized in Theorem 4.1 and Theorem 4.2. Theorem 4.1. The class of influence functions under the DIC model is PAC learn- ableunderincompleteobservationswithrandomlymissingactivitieswhererœ (0,1) 1 is the retention rate. The sample complexity is ˜ O( m Á 2 r 8 ) 2 . 1 The result can be generalized to handle r=1 by manually introducing a small amount of loss. 2 The ˜ O/ ˜ notation suppresses all logarithmic terms including log 1 ” and log 1 Á , which holds for the rest of the chapter if not mentioned otherwise. 76 ! " # $ % " # $ % "’ #′ $′ %′ ! ' ( Figure4.5: IllustrationofthegraphtransformationundertheDICmodel. Thelight green node is the seed, the dark green nodes are the activated and observed nodes while the yellow node is activated but lost due to incomplete observations. Theorem 4.2. The class of influence functions under the DLT model is PAC learn- ableunderincompleteobservationswithrandomlymissingactivitieswhererœ (0,1) 3 is the retention rate. The sample complexity is ˜ O( m Á 2 r 8 ) The key idea of the proof of both theorems is that a set of incomplete cascades ˜ C on G under the two models can be considered as essentially complete cascades on a transformed graph ˆ G=( ˆ V, ˆ E). The influence functions of nodes in ˆ G can be learned using a modification of the result of Narasimhan et al. [110]. Subsequently, the influence functions for G are directly obtained from the influence functions for ˆ G, by exploiting that influence functions only focus on the marginal activation probabilities. For the DIC model, the transformed graph ˆ G is built by adding a layer of n nodes to the original graph G. For each node vœ V of the original graph, we add a new node v Õ œ V Õ and a directed edge (v,v Õ ) with known and fixed edge parameter ˆ ◊ v,v Õ = r. The parameter value serves as activation probability under the DIC model. Moreover, we introduce another node fi into the transformed graph ˆ G.It 3 The result can be generalized to handle r=1 by manually introducing a small amount of loss. 77 ! " # $ % " # $ % ! & ' "’’ #′′ $′’ %′′ "’ #′ $′ %′ Figure 4.6: Illustration of the graph transformation under DLT model. The light green node is the seed, the dark green nodes are the activated and observed nodes while the yellow node is activated but lost due to incomplete observations. has outgoing edges to all the newnodes with fixed edgeparameter ˆ ◊ fi,v Õ=1≠ r. The new nodes V Õ have no other incident edges, and we retain all edges e=(u,v)œ E. Inferring their parameters is the learning task. An example of the transformation on the same simple graph consisting of four nodes is shown in Figure 4.5. For the DLT model, the transformed graph ˆ G is built by adding two layers of n nodes each to the original graph G. For each node vœ V of the original graph, we add a new node v ÕÕ œ V ÕÕ and a directed edge (v,v ÕÕ ) with known and fixed edge parameter ˆ ◊ v,v ÕÕ = r. We further add another layer of nodes v Õ œ V Õ on top of each node v ÕÕ . Node v ÕÕ is connected to node v Õ with directed edge (v ÕÕ ,v Õ ) with known and uniform fixed edge parameter ˆ ◊ v ÕÕ ,v Õ = r. Similarly, we introduce another node fi into the transformed graph ˆ G. It has outgoing edges to all the nodesv Õ with fixed edge parameter ˆ ◊ fi,v Õ=1≠ r. The new nodes V Õ have no other incident edges, and we retain all edges e=(u,v)œ E. An example of the transformation on a simple graph consisting of four nodes is shown in Figure 4.6. 78 Underbothmodels,foreachobserved(incomplete)cascade (S i , ˜ A i )onG(with S i ™ ˜ A i ), we produce an observed cascade with activation set A Õ i of nodes only in the top layer V Õ using the following procedure: 1. We add the node fi to the seed set of all the training cascades, namely S Õ = Sfi{ fi }. 2. For each node observed activated but which is not a seed, i.e., vœ ˜ A i \S i ,we let v Õ œ A Õ i deterministically. 3. For each seed vœ S i independently, we include v Õ œ A Õ i with probability r. 4. For each still inactive node v Õ , i.e., v Õ œ V Õ \ A Õ i independently, we include v Õ œ A Õ i with probability 1≠ r. This procedure defines the training cascades ˆ C ={(S Õ i ,A Õ i )}. Nowconsideranyedgeparameters✓ ,appliedtobothGandthefirstlayerof ˆ G. Let F(S) denote the influence function on G, and ˆ F(S Õ )=[ ˆ F 1 Õ(S Õ ),..., ˆ F n Õ(S Õ )] = [ ˆ F 1 Õ(Sfi{ fi }),..., ˆ F n Õ(Sfi{ fi })] the influence function of the nodes in the added layer V Õ of ˆ G with the additional node fi as seed. Under the DIC model, node v Õ is not activated if and only if both node fi and node v fail to activate it. Therefore, ˆ F v Õ(S Õ ), the activation probability of node v Õ satisfies: ˆ F v Õ(S Õ )= 1≠ (1≠ rF v (S))(1≠ (1≠ r)) =1≠ r +r 2 ·F v (S) 79 Under the DLT model, nodev Õ is activated when the total influence it receives from node fi and v ÕÕ exceeds its random threshold. Since node fi is always a seed and the activation probability of node v ÕÕ is rF v (S), we have ˆ F v Õ(S Õ )= 1≠ r +r·rF v (S) =1≠ r +r 2 ·F v (S) By our construction of the transformed graph, under both the DIC model and DLT model we get that ˆ F v Õ(S Õ )= 1≠ r +r 2 ·F v (S) (4.1) for all nodes vœ V. Now we consider the probability that a nodev Õ is included in ˜ A i via the trans- formation of the incomplete cascade. By the definition of the observation loss pro- cess, a node v Õ is not included in ˜ A i if and only if the node is not included in either step 2 or step 3 and it fails to be included in step 4. A node is included in step 2 or 3 with probability rF v (S). Each still inactive node v Õ is further included with probability r. Therefore, we have Prob[v Õ œ ˜ A i ]= 1≠ (1≠ rF v (S))(1≠ (1≠ r)) =1≠ r +r 2 ·F v (S)= ˆ F v Õ(S Õ ) While the cascades ˆ C do not provide complete activation information on all nodes of ˆ G, they do provide complete activation information on the nodes in V Õ . In the detailed proof which follows, we show that Theorem 2 of Narasimhan et al. [110] 80 can be adapted to provide guarantees for learning ˆ F(S Õ ) from the modified observed cascades ˆ C. For the DIC model, this is a straightforward modification of the proof from [110]. For the DLT model, [110] had not established PAC learnability 4 ,sowe provide a separate proof. Because the results of [110] and our generalizations ensure proper learning, they provide edge weights✓ between the nodes of V. We use these exact same edge weights to define the learned influence functions in G. Equation (4.2) then implies that under the DIC and DLT model the learned influence functions on V satisfy F v (S)= ˆ F v Õ(S Õ )≠ (1≠ r) r 2 . (4.2) It should be noticed that our results improve the sample complexity of influence function learning under complete observation in [110] by manually introducing ran- domly missing activities with high retention rate. The improvement comes from the factthattheintroducednoisesmoothesthelossfunctiontoreducebothitsLipschitz constant and absolute value. Next, we flesh out the proof sketch for both the DIC and the DLT model. The detailed proof shows that the learning error only scales by a multiplicative factor 1 r 4 under both models. 4.5.1 Proof of Theorems 4.1 and 4.2 To prove Theorems 4.1 and 4.2, we first establish the following lemma to show the boundedness and the Lipschitz continuity of the log-likelihood function. 4 [110] shows that the DLT model with fixed thresholds is PAC learnable under complete cas- cades. We study the DLT model when the thresholds are uniformly distributed random variables. 81 Dierent from [ 110], we do not need the assumption ⁄ Æ ◊ u,v Æ 1≠ ⁄ on the parameters of each edge. Instead, the function values are bounded away from 0 and 1 via the construction of the layered graph and the additional seed fi . Lemma 4.3. Let g v (S Õ ,y)= ylog ˆ F v Õ(S Õ )+(1≠ y)log(1≠ ˆ F v Õ(S Õ )), where S Õ = Sfi{ fi }. Fix parameters✓ of the first layer nodes. The influence functions of every top layer node v Õ œ V Õ under both the DIC and DLT models satisfies that: 1. 1≠ rÆ ˆ F ✓ Õ v Õ (Sfi{ fi })Æ 1≠ r +r 2 , for all seed set S. 2. |g v (S Õ ,y)|Æ ln 1 r(1≠ r) . 3. g v (S Õ ,y) is 1 r(1≠ r) -Lipschitz in ✓ with respect to the L 1 norm. In order to prove the Lipschitzness of the log-likelihood function, we need to first establish Lipschitzness of the influence function under both the DIC and DLT model. As the function ˆ F is learned from the DIC model, Lemma 3 in [110] carries thorough to establish the Lipschitz continuity of DIC influence functions. Lemma 4.4 (Lipschitz continuity of DIC influence function). Fix S™ V and v Õ œ V Õ . For any ✓ ,✓ Õ œ R m with ||✓ ≠ ✓ Õ || 1 Æ ‘, we have | ˆ F ✓ v Õ(S)≠ ˆ F ✓ Õ v Õ (S)|Æ ‘. Narasimhan et al. consider a threshold model with fixed thresholds, while the DLT model assumes random and independent thresholds. We prove the Lipschitz continuity of the DLT influence functions with the following lemma. Lemma 4.5 (Lipschitz continuity of DLT influence function). Fix S ™ V and uœ V. For any ✓ ,✓ Õ œ R m with ||✓ ≠ ✓ Õ || 1 Æ Á , we have that |F ✓ u (S)≠ F ✓ Õ u (S)|Æ Á . Proof. As described in Section 2.5.1.3, the influence functions under the DLT model can be also characterized via the reachability under a distribution over live- edge graphs. Specifically, the distribution is as follows [77, Theorem 2.5]: for each 82 node v, pick at most one of its incoming edges at random, selecting the edge from z œ N(v) with probability ◊ z,v and selecting no incoming edge with probability 1≠ q zœ N(v) ◊ z,v . For each node v, let the random variable X v be the incoming neighbor chosen for v, with X v =‹ if v has no incoming edge. For simplicity of notation, we define ◊ ‹ ,v =1≠ q zœ N(v) ◊ z,v . Define X =(X v ) vœ V , and write X for the set of all such vectorsX. For any nodev, we writeX ≠ v for the set of all vectors with edges (or ‹ ) for all nodes except v. And for a vector XœX ≠ v , we write X[v‘æ z] for the vector in which all entries agree with those in X, except for the entry for v which is now z. Let R X (v,S) be the indicator function of whether node v is reachable from the seed set S in the graph (V,X), where we interpret X as the set of all edges (X v ,v) with X v ”=‹ . Claim 2.6 of [77] implies that F ✓ u (S)= ÿ X Ÿ vœ V ◊ Xv,v R X (u,S). We fix an edge (y,y Õ ) and take the partial derivative of F ✓ u (S) with respect to ◊ y,y Õ: - - - - - ˆF ✓ u (S) ˆ◊ y,y Õ - - - - - = - - - - - - ˆ ˆ◊ y,y Õ Q a ÿ zœ N(y)fi{‹} ◊ z,y ÿ XœX ≠ y R X[y‘æ z] (u,S) Ÿ vœ V\{y} ◊ Xv,v R b - - - - - - = - - - - - - ÿ XœX ≠ y R X[y‘æ y Õ ] (u,S) Ÿ vœ V\{y} ◊ Xv,v ≠ ÿ XœX ≠ y R X[y‘æ‹ ] (u,S) Ÿ vœ V\{y} ◊ Xv,v - - - - - - Æ - - - - - - ÿ XœX ≠ y Ÿ vœ V\{y} ◊ Xv,v - - - - - - =1. Therefore, ||Ò ✓ F ✓ u (S)|| Œ Æ 1, implying Lipschitz continuity. 83 We now prove Lemma 4.3 using Lemma 4.4 and 4.5. Proof. 1. Since node fi is always a seed and has outgoing edges to all the nodes v Õ ,we have ˆ F v Õ(S Õ )Ø 1≠ r. Moreover, due to the layered structure of ˆ G, we have ˆ F v Õ(S Õ )Æ 1≠ r +r 2 . 2. Since we have r ≠ r 2 = r(1≠ r) Æ 1≠ r Æ ˆ F v Õ(S Õ ) Æ 1≠ (r ≠ r 2 ), |ylog ˆ F v Õ(S Õ )+(1≠ y)log(1≠ ˆ F v Õ(S Õ ))|Æ y|log ˆ F v Õ(S Õ )|+(1≠ y)|log(1≠ ˆ F v Õ(S Õ ))|Æ y|log 1 r(1≠ r) |+(1≠ y)|log 1 r(1≠ r) | = log 1 r(1≠ r) . 3. To show the Lipschitzness with respect to the L 1 norm, we bound the L Œ norm of its gradient: Ò ✓ g v (S Õ ,y)= C y ˆ F v Õ(S Õ ) ≠ 1≠ y 1≠ ˆ F v Õ(S Õ ) D Ò ✓ ˆ F v Õ(S Õ ) Since r≠ r 2 Æ ˆ F v Õ(S Õ )Æ 1≠ (r≠ r 2 ), | y ˆ F v Õ(S Õ ) ≠ 1≠ y 1≠ ˆ F v Õ(S Õ ) |Æ 1 r(1≠ r) . In addi- tion, from the Lipschitz property of the DIC and DLT influence function in Lemma 4.4 and 4.5, we know its gradient is bounded by 1, ||Ò ✓ g v (S Õ ,y)|| Œ Æ 1 r(1≠ r) ||Ò ✓ ˆ F v Õ(S Õ )|| Œ Æ 1 r(1≠ r) . Hence g v (S Õ ,y) is 1 r(1≠ r) -Lipschitz in✓ with respect to the L 1 norm. 84 Comparing our result in Lemma 4.3 to Lemma 9 in [110], we prove improved results forboththeboundednessandtheLipschitznessofthelog-likelihoodfunction. More- over, our result does not have the constraint ⁄ Æ ◊ u,v Æ 1≠ ⁄ . The improvement comes from the fact that we introduce the additional seed fi to make sure that ˆ F v Õ(S Õ ) is bounded away both from 0 and 1 by a constant instead of depending on the number of nodes in the network as in [110]. The improved boundedness and Lipschitzness directly lead to the improved sample complexity in Theorems 4.1 and 4.2. With all the supporting lemmas, the proofs of Theorem 4.1 and 4.2 are essen- tially the same as follows: Proof. For the transformed graph ˆ G, we consider only the influence functions of thennodesintheaddedlayerV Õ . Recallthatwewrite ˆ F(S Õ )=[ ˆ F 1 Õ(S Õ ),..., ˆ F n Õ(S Õ )] for the influence function of those nodes where S Õ = Sfi{ fi } with the new node fi added to the seed set S in each training cascade. Let ˆ F ú be the ground truth influ- ence function for the same nodes, andF ú the ground truth influence function forG. LetM(G)andM( ˆ G)betheclassofinfluencefunctionsofGand ˆ G. Forfunctions ˆ F, wewrite „ err sq [ ˆ F]=E S,A Ë 1 n q v Õ œ V Õ(‰ A (v Õ )≠ ˆ F v Õ(S Õ )) 2 È . Noticethatthegroundtruth functionsminimizetheexpectedsquarederror,i.e., ˆ F ú œ argmin ˆ FœM ( ˆ G) „ err sq [ ˆ F]and F ú œ argmin FœM (G) err sq [F]. We will show that err sq [F]≠ err sq [F ú ] can be made arbitrary small. We first prove a variation of Theorem 2 from [110] for learning ˆ F, by verifying that all the supporting lemmas still apply. The modified Theorem 2 from [110] is the following: 85 Theorem 4.6. Assume that the learning algorithm observes M = ˜ (ˆ‘ ≠ 2 m) train- ing cascades ˆ C = {(S Õ i ,A Õ i )} under both the DIC and the DLT model. Then, with probability at least 1≠ ” , we have ‰ err sq [ ˆ F]≠ ‰ err sq [ ˆ F ú ] Æ ˆ ‘. (4.3) Proof. While the cascades in ˆ C are incomplete onV, they are complete onV Õ .We use this completeness of the cascades as follows. Consider the restricted class of the DIC model on the transformed graph ˆ G in which only them activation probabilities ✓ between nodes in V are learnable, while the edges (v,v Õ ) have a fixed weight of r. Define the log-likelihood for a cascade (S Õ ,A Õ ) as L(S Õ ,A Õ |✓ )= ÿ v Õ œ V Õ ‰ A Õ i (v Õ )log( ˆ F ✓ v Õ(S Õ ))+(1≠ ‰ A Õ i (v Õ ))log(1≠ ˆ F ✓ v Õ(S Õ )). Thealgorithmoutputsaninfluencefunction ˆ F basedonthesolutionofthefollowing optimization problem: ✓ ú œ argmax ✓ M ÿ i=1 L(S Õ i ,A Õ i |✓ ). As the function ˆ F is learned either from the DIC model or the DLT model, Lemmas4.4and4.5establishtheLipschitzcontinuityoftheDICandDLTinfluence functions, respectively. Moreover, such instances (on 2n nodes) still only have m parameters, and the L Œ covering number bound in Lemma 8 from [110] applies without any changes. Lemma 4.7 (Covering number of DIC influence functions). The L Œ covering num- ber for radius ‘ of the restricted class of the DIC influence functions on the trans- formed graph is O((m/‘) m ). 86 Establishing the sample complexity bound on the log-likelihood objective (Lemma 4 in [110]) requires that all function values be bounded away from 0 and 1. We show that this requirement is satisfied under both the DIC and the DLT model with Lemma 4.3. Therefore, Lemma 4 in [110] carries thorough with the improved complexity of ˜ O(ˆ ‘ ≠ 2 m): Lemma 4.8 (Sample complexity guarantee on the log-likelihood objective). Fix ‘,” œ (0,1) and M = ˜ (ˆ‘ ≠ 2 m). With probability at least 1≠ ” (over the draws of the training cascades), max ✓ E S,A Õ 5 1 n L(S Õ ,A Õ |✓ ) 6 ≠ E S,A Õ 5 1 n L(S Õ ,A Õ |✓ ú ) 6 Æ ‘. As all the lemmas used in the proof of Theorem 2 from [110] remain true, we have proved our Theorem 4.6, with the guarantee that err sq [ ˆ F]≠ err sq [ ˆ F ú ]Æ ˆ ‘. Finally, we recall that according to Equation (4.2), F v (S)= ˆ F v Õ(S Õ )≠ (1≠ r) r 2 and F ú v (S)= ˆ F ú v Õ(S Õ )≠ (1≠ r) r 2 , giving us that 87 err sq [F]≠ err sq [F ú ] (ú ) = 1 n ÿ vœ V E S Ë (F v (S)≠ F ú v (S)) 2 È Equation (4.2) = 1 n ÿ v Õ œ ˆ V E S 5 ( 1 r 2 ˆ F v Õ(S Õ )≠ 1 r 2 ˆ F ú v Õ(S Õ )) 2 6 (ú ) = „ err sq [ ˆ F]≠ „ err sq [ ˆ F ú ] r 4 Equation (4.3) Æ ˆ ‘ r 4 The steps labeled (*) are applications of Equation (4) from [110]. „ err sq [ ˆ F]≠ „ err sq [ ˆ F ú ]= 1 n ( ˆ F v Õ(S Õ )≠ ˆ F ú v Õ(S Õ )) 2 Now, by taking ˆ ‘ = Á ·r 4 , with ˜ O( m Á 2 r 8 ) incomplete cascades, we obtain that err sq [F]≠ err sq [F ú ]Æ Á . 4.5.2 Extensions of Proper PAC Learning We have established the PAC learnability under incomplete observations with randomly missing activities under the assumption that the retention rate is given and the same for all the nodes. In this section, we show that our proper PAC learnability result can be generalized to the cases with dierent retention rates and when there is uncertainty about the retention rate. Moreover, we demonstrate that ourapproachcanhandletheothertypeofincompleteobservationswithadversarially missing nodes and even the combination of the two. 88 4.5.2.1 Dierent retention rate So far, we have assumed that the retention rate is the same for all the nodes; however, our approach can be easily generalized to the case in which each individual node has a dierent (but known) retention rate. The following theorem generalizes Theorems 4.1 and 4.2. Theorem 4.9. Let r v be v’s retention rate satisfying 0<r v < 1. Write fl = 1 n q n i=1 1 r 4 v . The class of influence functions under both the DIC model and DLT model is PAC learnable with sample complexity ˜ O( fl 2 m Á 2 ). Proof. Letr i be the retention rate of nodei and ˆ Á i the desired error guarantee for learning the influence functionF v i . It is immediate from the proofs of Theorems 4.1 and 4.2 that M = max i { ˜ O( m ˆ Á 2 i r 8 i )} incomplete cascades under both DIC model and DLT model are sucient to guarantee that with probability at least 1≠ ” ,foreach i, we obtain E S Ë (F v i (S)≠ F ú v i (S)) 2 È Æ ˆ Á i . The estimation error for the overall influence is the average Á = 1 n q i ˆ Á i . Given non-uniform retention rates r i , we can choose non-uniform ˆ Á i yielding the desired Á , so as to minimize the sample complexity. Under both models, the corresponding optimization problem is the following: Minimize max i 1 ˆ Á 2 i r 8 i subject to 1 n ÿ i ˆ Á i = Á, ˆ Á i > 0 for all i. 89 The minimum is achieved when all 1 ˆ Á 2 i r 8 i are equal to some constantC, meaning that ˆ Á i = 1 Ô C·r 4 i . TheconstantC canbeobtainedfromtheconstraint 1 n q i ˆ Á i = Á ,yielding that C = fl 4 Á 2 , where fl = 1 n q i 1 r 4 i . This completes the proof of the theorem. The empirical evaluation in the later section shows that the performance does not changesignificantlyifthetrueretentionrateofeachnodeisindependentlyperturbed around the estimated mean loss rate. 4.5.2.2 Uncertainty in retention rate A second limitation of our approach is that we assume the retention rate r to be known to the algorithm. In real-world applications, it remains a dicult task to estimate the retention rate accurately. Actually, estimating r presents a “chicken and egg” problem. That is, an accurate estimation of the retention rate requires access to complete cascade observations. The lack of complete cascades is exactly themotivationofourstudy. However,webelievethatasomewhataccurateestimate of r (perhaps based on past data for which ground truth can be obtained at much higher cost) will still be a significant improvement over the status quo, namely, pretending that no data are missing. Moreover, even approximate information about r leads to positive results on proper PAC learnability. We show that the PAC learnability result can be extended to the case where we only know that the true retention rate lies in a given interval. Insteadofknowingtheexactvaluer,weonlyknowthatrliesinanintervalmeasured by the relative error ÷ , namely rœ I=[¯ r·(1≠ ÷ ),¯ r·(1+÷ )]. Within that interval, the retention rate is adversarially chosen. We can then generalize Theorems 4.1 and 4.2 as follows. 90 Theorem 4.10. Assume that the ground truth retention rate r is adversarially cho- sen in I =[¯ r· (1≠ ÷ ),¯ r· (1 +÷ )]. For all Á,” œ (0,1), all parametrizations of the diusion model, and all distributions P over seed sets S: for both the DIC model and the DLT model, there exists a proper learning algorithm A, which, when given activation-only and incomplete training cascades ˜ C, outputs an influence function FœF M satisfying: Prob ˜ C [err sq [F]≠ err sq [F ú ]Ø Á + 4÷ 2 r 2 (1≠ ÷ ) 2 ]Æ ”. The required number of observations is M = ˜ O( m Á 4 r 8 (1≠ ÷ ) 4 ). Proof. The proof of Theorem 4.10 is similar to that of Theorems 4.1 and 4.2.We again treat the incomplete cascades as complete cascades in a transformed graph ˆ G. Because we no longer know the true retention rate r, we cannot set the probability ontheedge (v,v Õ )to ˆ ◊ v,v Õ =r undertheDICmodeloredge (v,v ÕÕ )to ˆ ◊ v,v ÕÕ =r under the DLT model. Instead, we treat the ˆ ◊ v,v Õ and ˆ ◊ v,v ÕÕ as parameters to infer, under the constraint that ˆ ◊ v,v Õ, ˆ ◊ v,v ÕÕœ I=[¯ r·(1≠ ÷ ),¯ r·(1+÷ )].Let ⁄ =1≠ (1+÷ )¯ r.We set the edge weight from the added node fi to all new nodes v Õ to ˆ ◊ fi,v Õ = ⁄ under both models. For the DLT model, we set the weight of the additional edge from v ÕÕ to v Õ as ˆ ◊ v ÕÕ ,v Õ=1≠ ⁄ . We spell out the details of the proof for the DIC model; the proof for the DLT model is essentially the same. As in Theorem 4.1, we consider only the influence functions of the n nodes in the added layer V Õ . Following the proof of Theorem 4.1, with probability as least 1≠ ” , using M = ˜ O( m ˆ Á 2 ) cascades, for all v Õ œ V Õ , E S Ë ( ˆ F v Õ(S Õ )≠ ˆ F ú v Õ(S Õ )) 2 È Æ ˆ Á. (4.4) 91 In the proof of Theorem 4.1, the fact that ˆ ◊ v,v Õ = r allowed us to obtain the influence function at v via F v (S)= ˆ F v Õ(S Õ )≠ ⁄ r(1≠ ⁄ ) . Since the edge probabilities ˆ ◊ v,v Õ are now inferred, we instead use the inferred prob- abilities for obtaining the activation functions for nodes v. On the other hand, the ground truth influence functions for v and v Õ are still related via the correct value r. Writing ˆ r v = ˆ ◊ v,v Õ, this gives us the following: F v (S)= ˆ F v Õ(S Õ )≠ ⁄ ˆ r v (1≠ ⁄ ) (4.5) F ú v (S)= ˆ F ú v Õ(S Õ )≠ ⁄ r(1≠ ⁄ ) (4.6) Consider E S [(F v (S)≠ F ú v (S)) 2 ] for any node v. The expected squared estimation error for node v can now be written as follows: E S Ë (F v (S)≠ F ú v (S)) 2 È = 1 (1≠ ⁄ ) 2 E S S U A ˆ F v Õ(S Õ )≠ ⁄ ˆ r v ≠ ˆ F ú v Õ(S Õ )≠ ⁄ r B 2 T V = 1 (1≠ ⁄ ) 2 E S S U A ˆ F v Õ(S Õ )≠ ˆ F ú v Õ(S Õ ) r +( ˆ F v Õ(S Õ )≠ ⁄ ) r≠ ˆ r v r· ˆ r v B 2 T V Æ 1 r 2 (1≠ ⁄ ) 2 ·E S Ë ( ˆ F v (S Õ )≠ ˆ F ú v (S Õ )) 2 È (4.7) + 2 r(1≠ ⁄ ) 2 ·E S C | ˆ F v Õ(S Õ )≠ ⁄ | ˆ r v · |r≠ ˆ r v | r ·| ˆ F v Õ(S Õ )≠ ˆ F ú v Õ(S Õ )| D (4.8) + 1 (1≠ ⁄ ) 2 E S S U ( ˆ F v Õ(S Õ )≠ ⁄ ) 2 ˆ r 2 v · (r≠ ˆ r v ) 2 r 2 T V . (4.9) 92 We will bound the term (4.7) using Inequality (4.4). In order to bound the terms (4.8) and (4.9), observe the following: • ForallseedsetsS andnodesv Õ œ V Õ , wehave ˆ F v Õ(S Õ )≠ ⁄ Æ ˆ r v bythestructure of the transformed graph. It holds as ˆ F v Õ(S Õ )Æ 1≠ (1≠ ⁄ )(1≠ ˆ r)Æ ⁄ +ˆ r. • As ⁄ =1≠ (1+÷ )¯ r, 1≠ ⁄ =(1+÷ )¯ rØ r. Therefore 1 (1≠ ⁄ ) 2 Æ 1 r 2 . • |r≠ ˆ rv| r Æ 2÷ 1≠ ÷ follows from the assumption that ˆ r v ,rœ [¯ r·(1≠ ÷ ),¯ r·(1+÷ )]. • By Jensen’s inequality and Inequality (4.4), E S Ë | ˆ F v Õ(S Õ )≠ ˆ F ú v Õ(S Õ )| È Æ Ô ˆ Á . Using the preceding four inequalities, we can bound the term (4.8)by 4÷ Ô ˆ Á r 3 (1≠ ÷ ) and the term (4.9)by 4÷ 2 r 2 (1≠ ÷ ) 2 . When ˆ Á Æ Ár 4 2 , using the inequality (4.4), the term (4.7) is upper-bounded by 1 r 2 (1≠ ⁄ ) 2 E S Ë (F v i (S)≠ F ú v i (S)) 2 È Æ Á 2 . Similarly, when ˆ Á Æ Á 2 r 4 (1≠ ÷ ) 2 64÷ 2 , the additive term (4.8) is upper-bounded by 2÷ Ô ˆ Á r 3 (1≠ ÷ ) Æ Á 2 . Thus, taking ˆ Á = min{ Ár 4 2 , Á 2 r 4 (1≠ ÷ ) 2 64÷ 2 }, the first two terms combined are bounded by Á . Thus, using M = ˜ O( m Á 4 r 8 (1≠ ÷ ) 4 ) cascades, with probability at least 1≠ ” , for each node v, E S Ë (F v (S)≠ F ú v (S)) 2 È Æ Á + 4÷ 2 r 2 (1≠ ÷ ) 2 . Now, taking an average on both sides of the above equation over all the nodesvœ V concludes the proof of Theorem 4.10. The proof under the DLT model is essentially the same. Notice that the result of Theorem 4.10 is not technically a PAC learnability result, due to the additive error that depends on the interval size. However, the theorem 93 provides useful approximation guarantees when the interval size is small. A depen- dence of the guarantee on the interval size is inevitable. For when nothing is known about the retention rate (for ÷ large enough), all information about the marginal activation probabilities is lost in the incomplete data: for instance, if no nodes are ever obesrved active, we cannot distinguish the case r=0 from the case in which no nodes become activated. The experiments in later sections confirm that for mod- erate uncertainty about the retention rate, the performance of our approach is not very sensitive to the misestimation of r. 4.5.2.3 Incomplete observations with adversarially missing nodes Inthepreviousdiscussion,wefocusedonlearningtheinfluencefunctionsunder randomly missing activities. The proper learning results naturally generalize to incomplete observations with both randomly missing activities and adversarially missing nodes when the lost nodes never serve as seeds to initiate the cascade. The results hold even if the skeletons of the diusion networks associated with the lost nodes are not given as we can add edges with unknown parameters from all other observed nodes to the lost ones. Corollary 4.11. Let ˜ C = {(S, ˜ A)} be a set of incomplete observations with both randomly missing activities of retention rate r and adversarially missing nodes T such that T fl S = ÿ for all seed sets S. The influence functions on the observed nodes V \T are proper PAC learnable. The sample complexity is ˜ O( m+n|T| Á 2 r 8 ) under both the DIC model and the DLT model. The corollary directly follows the proof of Theorems 4.1 and 4.2. The intu- ition is that when losing some nodes, we lose correlations between activations. But 94 by replacing them with direct edges that skip the missing nodes, we still get the marginals right. Proof. We prove the corollary by applying the results of Theorems 4.1 and 4.2 to thegraphskeletonwithallincomingedgestothelostnodesandoutgoingedgesfrom the lost nodes. Let G=(V,E) be the graph of the observed nodes. We augment the graph to include the unobserved nodesT as a new graph ¯ G=( ¯ V, ¯ E). We define ¯ V =Vfi T and ¯ E =Efi{ (u,v)|uœ V,vœ T}fi{ (v,u)|uœ V,vœ T}. We construct the layered graph under the DIC model and the DLT model as in Theorems 4.1 and 4.2 but adding the additional nodes v Õ ,v ÕÕ only for the vœ V. Since the augmented graph has O(m +n|T|) edges, the results in Theorems 4.1 and 4.2 show that the sample complexity is ˜ O( m+n|T| Á 2 r 8 ). To apply the above corollary, the number of lost nodes |T| is the only informa- tion required besides the observed cascades. For incomplete observations with only adversarially missing nodes, we can apply the same trick to manually inject a small amount of missing activities to achieve similar sample complexity for both models. However, the above results suer from a major limitation; that is the lost nodes can not serve as the seeds to initiate the cascades, which we consider as an interesting and important direction for future work. 4.5.2.4 Limitationss of proper PAC learning The PAC learnability result shows that there is no information-theoretical obstacle to influence function learning under incomplete observations. However, it doesnotimplyanecient algorithm. Thereasonisthatahiddenvariablewouldbe associated with each node not observed to be active, and computing the objective 95 function for empirical risk minimization would require marginalizing over all of the hiddenvariables. TheproperPAClearnabilityresultalsodoesnotreadilygeneralize to the CIC model and other diusion models, even under complete observations. This is due to the lack of a succinct characterization of influence functions as under the DIC and DLT models. Therefore, in the next section, we explore improper learning approaches with the goal of designing practical algorithms and establishing learnability under a broader class of diusion models. 4.6 Ecient Improper Learning Algorithm In this section, we develop improper learning algorithms for ecient influ- ence function learning. Instead of parameterizing the influence functions using the edge parameters, we adopt the model-free influence function learning framework, InfluLearner, proposed by Du et al. [35] to represent the influence function as a sum of weighted basis functions. From now on, we focus on the influence function F v (S) ofasinglefixednodev. Weprovideguaranteesonlearningthefunctionsofallnodes by applying a union bound on each learned F v (S). InfluenceFunctionParameterization. AsdiscussedindetailinSection2.5.1.3, for all three diusion models (CIC, DIC and DLT), the diusion process can be characterizedequivalentlyusinglive-edgegraphs. Asaresult,theinfluencefunctions can be alternatively characterized using the reachability basis functions introduced in Section 2.5.1.3: F ú v (S)= ÿ T — ú T ·„ ( € S ·r T ). 96 This representation still has exponentially many features (one for each T). In order to make the learning problem tractable, we sample a smaller set T of K features from a suitably chosen distribution, implicitly setting the weights — T of all other features to 0. Thus, we will parametrize the learned influence function as F v (S)= ÿ TœT — T ·„ ( € S ·r T ). The goal is then to learn weights — T for the sampled features. (They will form a distribution, i.e., || || 1 =1 and Ø 0.) The crux of the analysis is to show that a suciently small number K of features (i.e., sampled sets) suces for a good approximation, and that the weights can be learned eciently from a limited number of observed incomplete cascades. Specifically, we consider the log likelihood function ¸(t,y)= ylogt+(1≠ y)log(1≠ t), and learn the parameter vector via the following maximum likelihood estimation problem: Maximize q M i=1 ¸(F v (S i ),‰ A i (v)) subject to || || 1 =1, Ø 0. HandlingIncompleteObservations. Themaximumlikelihoodestimationcan- not be directly applied to incomplete cascades, as we do not have access toA i (only the incomplete version ˜ A i ). To address this issue, notice that the MLE problem is actually a binary classification problem with log loss and y i = ‰ A i (v) as the label. From this perspective, incompleteness is simply class-conditional noise on the labels. Let ˜ y i = ‰ ˜ A i (v) be our observation of whether v was activated or not under the incomplete cascade i. Then, Prob[˜ y i =1|y i =1]=r and Prob[˜ y i =1|y i =0]=0. 97 Inwords, theincompleteobservation ˜ y i suersfromone-sidederrorcomparedtothe complete observation y i . Known techniques can be used to address this issue. By results of Natarajan et al. [111], we can construct an unbiased estimator of ¸(t,y) using only the incomplete observations ˜ y, as in the following lemma. Lemma 4.12 (Corollary of Lemma 1 of [111]). Let y be the true activation of node v and ˜ y the incomplete observation. Then, defining ˜ ¸(t,y) := 1 r ylogt+ 2r≠ 1 r (1≠ y)log(1≠ t), for any t, we have E ˜ y Ë ˜ ¸(t,˜ y) È = ¸(t,y). Based on this lemma, we solve the maximum likelihood estimation problem with the adjusted likelihood function ˜ ¸(t,y): Maximize q M i=1 ˜ ¸(F v (S i ),‰ ˜ A i (v)) (4.10) subject to || || 1 =1, Ø 0. We analyze conditions under which the solution to (4.10) provides improper PAC learnability under incomplete observations; these conditions will apply for all three diusion models. These conditions are similar to those of Lemma 1 in the work of Du et al. [35], and concern the approximability of the reachability distribution — ú T . Specifically, let q be a distribution over node sets T such that q(T)Æ C— ú T for all node sets T. Here, C is a (possibly very large) number that we will want to bound below. Let T 1 ,...,T K be K i.i.d. samples drawn from the distribution q. The features are then 98 r k = T k . We use the truncated version of the function F ,⁄ v (S) with parameter ⁄ as in [35]: F ,⁄ v (S)=(1≠ 2⁄ )F v (S)+⁄. The proof of Theorem 4.13 will show how to choose ⁄ . LetM ⁄ be the class of all such truncated influence functions, and F ˜ ,⁄ v œM ⁄ theinfluencefunctionsobtainedfromtheoptimizationproblem(4.10). Thefollowing theorem establishes the accuracy of the learned functions. Theorem 4.13. Assume that the learning algorithm uses K = ˜ ( C 2 Á 2 ) features in the influence function it constructs, and observes 5 M = ˜ ( logC Á 4 r 2 ) incomplete cascades with retention rate r. Let P be a seed distribution. Then, with probability at least 1≠ ” , the learned influence functions F ˜ ,⁄ v for each node v satisfy E S≥P Ë (F ˜ ,⁄ v (S)≠ F ú v (S)) 2 È Æ Á. Proof. Let M = ˜ ( logC ‘ 4 r 2 ), and let F ˜ ,⁄ v (S) be the influence functions obtained in Theorem 4.13. We will show that for any single node v, with probability at least 1≠ ”/n , E S Ë (F ˜ ,⁄ v (S)≠ F ú v (S)) 2 È Æ Á. The theorem then follows by taking a union bound over all n nodes. Recall that M ⁄ is the function class of all truncated influence functions. We write R M (M ⁄ ) :=E S i ≥P ,(‘ i ) i ≥ Uniform({≠ 1,1} M ) C sup FœM ⁄ 1 M M ÿ i=1 ‘ i ·F v (S i ) D 5 The ˜ notation suppresses all logarithmic terms except logC, as C could be exponential or worse in the number of nodes. 99 for its Rademacher complexity, where the ‘ i ’s are i.i.d. Rademacher (symmetric Bernoulli) random variables. Du et al. have established the following in [35] bound- ing the number of features necessary to approximate the true influence function. Lemma4.14(Lemma12in[35]). ThereexistsatruncatedinfluencefunctionF ˆ ,⁄ v œ M ⁄ with K = O( C 2 Á 2 log Cn Á” ) features such that E S≥P Ë (F ˆ ,⁄ v (S)≠ F ú v (S)) 2 È Æ 2Á 2 + 2⁄ 2 with probability at least 1≠ ” 2n . Using the log likelihood function ¸(t,y)=ylogt+(1≠ y)log(1≠ t) as defined in Section 4.6, we write the log loss of the influence function F v as err log [F v ]= E S,A [≠ ¸(F v (S),A)]. Natarajan et al. proved the following theorem [111]. Theorem 4.15 (Theorem 3 in [111]). Let L fl be the Lipschitz constant of ˜ ¸. L fl = 2L/(1≠ fl +1 ≠ fl ≠ 1 ) where L is the Lipschitz constant of ¸(·) and fl +1 ,fl ≠ 1 are the flipping probabilities of positive and negative labels, respectively. With probability at least 1≠ ” , err log [F ˜ ,⁄ v ] Æ min fœM ⁄ err log [f]+4L fl R M (M ⁄ )+ Û log(1/” ) 2M . Applyingtheabovetheoremwithfailureprobability ” 2n andLipschitzconstant L fl = 1 ⁄ ·r (L=1/⁄,fl +1 =1≠ r,fl ≠ 1 =0), we have that with probability at least 1≠ ” 2n , err log [F ˜ ,⁄ v ] Æ min fœM ⁄ err log [f]+ 4 ⁄ ·r R M (M ⁄ )+ Û log(2n/” ) 2M . 100 BecauseF ˆ ,⁄ v œM ⁄ , we can bound that min fœM ⁄ err log [f]Æ err log [F ˆ ,⁄ v (S)] on the right-hand side. Subtracting err log [F ú v ] from both sides, we obtain err log [F ˜ ,⁄ v ]≠ err log [F ú v ] Æ err log [F ˆ ,⁄ v ]≠ err log [F ú v ] + 4 ⁄ ·r R M (M ⁄ )+ Û log(2n/” ) 2M . (4.11) The square and log errors can be related to each other as in the proof of Theorem 2 in [110], as follows: E S Ë (F ˜ ,⁄ v (S)≠ F ú v (S)) 2 È Æ 1 2 (err log [F ˜ ,⁄ v ]≠ err log [F ú v ]). Hence, in order to obtain a bound on E S Ë (F ˜ ,⁄ v (S)≠ F ú v (S)) 2 È , it suces to upper- boundtheright-handsideof (4.11). Thetermerr log [F ˆ ,⁄ v ]≠ err log [F ú v ]canbebounded asintheproofofLemma2in[35],usingLemma11andLemma16from[35]: Assume that F ˆ ,⁄ v uses K=( C 2 ˆ ‘ 2 log Cn ˆ ‘ ˆ ” ) features. Then, with probability at least 1≠ ˆ ” ,we have err log [F ˆ ,⁄ v ]≠ err log [F ú v ] Æ ˆ ‘ 2 +⁄ 2 ⁄ (1+log 1 ⁄ ). (4.12) Next, we bound the Rademacher complexity of the function class M ⁄ : Lemma 4.16. The Rademacher complexity R M (M ⁄ ) for the function class M ⁄ with at most K features is at most Ò 2log(1+K) M . Proof. Recall that we use basis functions „ i (S) := min{1, € S r T i }.Let W = {„ i |i=1,...,K}fi{ }, where is the constant function with value 1. By defini- tion, we have M ⁄ ™ conv(W), where conv(W) denotes the convex hull. Therefore, 101 R M (M ⁄ )ÆR M (conv(W)) =R M (W). Since |„ i (S)|Æ 1, by Massart’s finite class lemma (Lemma 4.17), we have R M (W)Æ Ò 2log(1+K) M , completing the proof. Lemma 4.17 (Massart’s Finite Class Lemma). Let F be a finite set of functions, such that sup fœF 1 n q n i=1 f(X i ) 2 Æ C 2 for any variable values X 1 ,...,X n . Then, the Rademacher complexity of F is upper bounded by R n (F)Æ Ò 2C 2 log|F| n . To finish the proof of Theorem 4.13, let ‘ be the desired accuracy. Define ˆ ” = ” 2n and ˆ ‘ = ⁄ = ‘ c Õ log 1 ‘ , where c Õ is a suciently large constant. Then, the right-hand side of (4.12) is upper-bounded by ˆ ‘·(1+log 1 ˆ ‘ )Æ ‘ 2 . With M =( logK ‘ 4 r 2 ), we have 4 ⁄ ·r R M (M ⁄ )Æ ‘ 4 . Whenever M =( log n ” ‘ 2 ),we also get that Ò log(n/” ) 2M Æ ‘ 4 . Taking M as the maximum of the above three, which is satisfied whenM = ˜ ( logC ‘ 4 r 2 ), we can substitute all of the bounds into the right-hand side of (4.11) and obtain thatE S Ë (F ˜ ,⁄ v (S)≠ F ú v (S)) 2 È Æ ‘ with probability at least 1≠ ” n . Now, taking a union bound over all nodes v concludes the proof. The theorem implies that with enough incomplete cascades, an algorithm can approximate the ground truth influence function to arbitrary accuracy. Therefore, all three diusion models are improperly PAC learnable under incomplete obser- vations. The final sample complexity does not contain the graph size, however, it is implicitly captured by C, which will depend on the graph and how well the distribution — ú T can be approximated. Notice that when the retention rate is 1, our Theorem 4.13 significantly improves the sample complexity bound compared to Du et al. [35]. The sample complexity in [35] is ˜ O( C 2 ‘ 3 ), while our theorem implies a sample complexity of ˜ O( logC ‘ 4 ) under complete observations. The improvement is derived from bounding the Rademacher complexity of the function classM ⁄ instead of theL 2,Œ dimension. 102 The Rademacher bound leads to a logarithmic dependence of the sample complex- ity on the number of features K, whereas the L 2,Œ bound results in a polynomial dependence. Moreover,astheinfluencefunctionsofallnodesarelearnedindependentlyfrom each other, the algorithm can be applied to the adversarial loss of nodes directly by just learning influence functions of the observed nodes. When the cascades suer from both types of incomplete observations, we can apply the above improper learn- ingalgorithmonlyfortheobservednodestolearntheinfluencefunctionsaccurately. Ecient Implementation. As mentioned above, the features T cannot be sam- pled from the exact reachability distribution — ú T , because it is inaccessible (and complex). In order to obtain useful guarantees from Theorem 4.13, we follow the approach of Du et al. [35], and approximate the distribution — ú T with the product of the marginal distributions, estimated from observed cascades. The optimization problem (4.10) is convex and can therefore be solved in time polynomial in the number of features K. However, the guarantees in Theorem 4.13 requireapossiblylargenumberoffeatures. Inordertoobtainan ecient algorithm for practical use and our experiments, we sacrifice the guarantee and use a fixed number of features. Notice that the optimization problem (4.10) can be solved independently for each node v; the learned functions F i (S) can then be combined into F(S)= [F 1 (S),...,F n (S)]. As the optimization problem factorizes over nodes, the method is obviously parallelizable, thus scaling to large networks. A further point regarding the implementation: in our theoretical analysis, we assumed that the retention rater is known to the learning algorithm. In practice, it 103 can be estimated via cross-validation. As we show in the next section, the algorithm is not very sensitive to the misspecification of the retention rate. 4.7 Experiments In this section, we experimentally evaluate the algorithm from Section 4.6. Since no other methods explicitly account for incomplete observations, we compare it to several state-of-the-art methods for influence function learning with full infor- mation. Hence, the main goal of the comparison is to examine to what extent the impact of missing data can be mitigated by being aware of it. We compare our algorithm to the following approaches: • CIC is an approach fitting the parameters of a CIC model, using the NetRate algorithm [51] with exponential delay distribution. • DICfitstheactivationprobabilitiesofaDICmodelusingthemethodin[115]. • InfluLearner is the model-free approach proposed by Du et al. in [35] and discussed in Section 4.6. • LogisticuseslogisticregressiontolearntheinfluencefunctionsF u (S)=f( € S · c u +b) for each u independently, where c u is a learnable weight vector and f(x)= 1 1+e ≠ x is the logistic function. • Linear uses linear regression to learn the total influence ‡ (S)= c € · S +b of the set S. Notice that the CIC and DIC methods have access to the activation time of each node in addition to the final activation status, giving them an inherent advantage. 104 4.7.1 Synthetic cascades Data generation. We generate synthetic networks with core-peripheral structure following the Kronecker graph model described in detail in Section 3.1. Each gen- erated network has 512 nodes and 1024 edges. We then generate synthetic cascades following the CIC, DIC and DLT models. For the CIC model, we use an exponential delay distribution on each edge whose parameters are drawn independently and uniformly from [0,1]. The observation window length is · =1.0. For the DIC model, the activation probability for each edge is chosen independently and uniformly from [0,0.4]. For the DLT model, we follow [77] and set the edge weight ◊ u,v as 1/d v where d v is the in-degree of node v. For each model, we generate 8192 cascades as training data. The seed sets are sampled uniformly at random with sizes drawn from a power law distribution with parameter 2.5. The generated cascades have average sizes of 10.8, 12.8 and 13.0 in the CIC, DIC and DLT models, respectively. We then create incomplete cascades by varying the retention rate between 0.1 and 0.9. The test set contains 200independentlysampledseedsetsgeneratedinthesamewayasthetrainingdata. TosidestepthecomputationalcostofrunningMonteCarlosimulations, weestimate the ground truth influence of the test seed sets using the method proposed in [35], with the true model parameters. Algorithmsettings. Weapplyallalgorithmstocascadesgeneratedfromallthree models; that is, we also consider the results under model misspecification. When- everapplicable, wesetthehyperparametersofthefivecomparisonalgorithmstothe ground truth values. When applying the NetRate algorithm to discrete-time cas- cades, we set the observation window to 10.0. When applying the method in [115] 105 (a) CIC (b) DIC (c) DLT Figure 4.7: MAE of estimated influence as a function of the retention rate on syn- thetic datasets for (a) CIC model, (b) DIC model, (c) DLT model. The error bars show one standard deviation. tocontinuous-timecascades, weroundactivationtimesuptothenearestmultipleof 0.1, resulting in 10 discrete time steps. For the model-free approaches (InfluLearner and our algorithm), we use K=200 features. Results. Figure4.7showstheMeanAbsoluteError(MAE)betweentheestimated total influence ‡ (S) and the true influence value, averaged over all testing seed sets. For each setting (diusion model and retention rate), the reported MAE is averaged over five independent runs. The main insight is that accounting for missing observations indeed strongly mitigates their eect: notice that for retention rates as small as 0.5, our algorithm can almost completely compensate for the data loss, whereas both the model-free and parameter fitting approaches deteriorate significantly even for retention rates close to 1. For the parameter fitting approaches, even such large retention rates can lead to missing and spurious edges in the inferred networks, and thus significant estimation errors. These approaches fail since when an activity is missing, the MLE based parameter fitting algorithms have to include spurious edge to explain the 106 96% 98% 100% 102% 104% 106% 0.02 0.05 0.1 0.2 CIC DIC DLT Figure 4.8: Relative error in MAE when the true retention rates are drawn from the truncated Gaussian distribution. The x-axis shows the standard deviation ‡ of the retention rates from the mean, and the y-axis is the relative dierence of the MAE compared to the case where all retention rates are the same and known. 96% 98% 100% 102% 104% 106% 0.02 0.05 0.1 0.2 CIC DIC DLT Figure 4.9: Relative error in MAE when the true retention rates are drawn from the Uniform distribution. Thex-axis shows the interval size ‡ of the retention rates from the mean, and the y-axis is the relative dierence of the MAE compared to the case where all retention rates are the same and known. activation of the node. Additional observations include that fitting influence using (linear or logistic) regression does not perform well at all. The reason is that the linear model does not have enough expressive power to capture the highly non- linear relationship in the diusion process. We also observe that the CIC inference approach appears more robust to model misspecification than the DIC approach. 107 Non-uniform retention rate In practice, dierent nodes may have dierent retention rates, while we may be able only to estimate the mean retention rate r. For our experiments, we draw each node v’s retention rate r v independently from a distribution with mean r. Specifically, in our experiments, we use uniform and Gaussian distributions; for the uniform distribution, we drawr v ≥ Unif[r≠ ‡,r +‡ ]; for the Gaussian distribution, we draw r v ≥N(r,‡ 2 ), truncating draws at 0 and 1. 6 In both cases, ‡ measures the level of noise in the estimated retention rate. We set the mean retention rate r to 0.8 and vary ‡ in {0,0.02,0.05,0.1,0.2}. Figures 4.8 and 4.9 show the results for the Gaussian and Uniform distribution, respectively. The results show that our model is very robust to random and independent per- turbations of individual retention rates for each node. The performance is better than the theoretical results in Theorem 4.10. The dierence lies in the model of the deviation of retention rate. We assume uniform or Gaussian noise centered at the assumed retention rate. In Theorem 4.10, we instead consider an adversarial setting only requiring that the retention rate lies in a given interval. Sensitivity of retention rate. We presented the algorithms as knowingr. Since r itself is inferred from noisy data, it may be somewhat misestimated. Figure 4.10 shows the impact of misestimating r. We generate synthetic cascades from all three diusionmodelswithatrueretentionrateof 0.8, andthenapplyouralgorithmwith (incorrect) retention rate rœ{ 0.6,0.65,...,0.95,1}. The results are averaged over five independent runs. While the performance decreases as the misestimation gets worse (after all, with r=1, the algorithm is basically the same as InfluLearner), the degradation is graceful. 6 The truncation could lead to a bias on the mean of r. However, empirical simulations show that the bias is negligible (only 0.01 when ‡ =0.2). 108 !10 !5 0 5 10 15 20 25 30 35 40 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 CIC DIC DLT Figure 4.10: Relative error in MAE under retention rate misspecification. x-axis: retention rate r used by the algorithm. y-axis: relative dierence of MAEcomparedtousingthetruereten- tion rate 0.8. 0 5 10 15 20 25 30 35 1 2 3 4 5 6 7 MAE Groups2of2memes Linear Logistic DIC CIC InfluLearner Our3Method Figure 4.11: MAE of influence estima- tiononsevensetsofreal-worldcascades with 20% of activations missing. 4.7.2 Influence Estimation on real cascades We further evaluate the performance of our method on the real-world MemeTracker-Du dataset described in detail in Section 3.2. We follow exactly the same evaluation method as Du et al. [35] on the held-out testing cascades: we ran- domly select 10 nodes from the testing cascades, which represents one particular seed set S. For each node uœ S, let C(u) denote the set of cascades starting from u on the testing data. For eachuœ S, one cascade is uniformly sampled fromC(u). Thus, the union of all sampled cascades is the set of nodes infected by source set S. We repeat the process for 1,000 times and take the average of the number of infected nodes as the true influence of source set S. Finally, we have generated 100 source sets and report the MAE of each method. The training/test set split of the dataset is 60%/40%. To test the performance of influence function learning under incomplete obser- vations, we randomly delete 20% of the occurrences, setting r=0.8. Figure 4.11 shows the MAE of our methods and the five baselines, averaged over 100 random 109 draws of test seed sets, for all groups of memes. While some baselines perform very poorly, even compared to the best baseline (InfluLearner), our algorithm provides an 18% reduction in MAE (averaged over the seven groups), showing the potential of data loss awareness to mitigate its eects. 110 Chapter 5 Joint Inference of Multiple Diusion Networks 111 Thecontentofthischapterisbasedonthepaper: XinranHeandYanLiu,“Not Enough Data? Joint Inferring Multiple Diusion Networks via Network Generation Priors”, In Proc. 10th ACM Intl. Conf. on Web Search and Data Mining.2017. Given the complexity of the Network Inference problem, most existing algo- rithmsneedalargenumberofcascadestoachieveagoodperformance. Forexample, for a network with 100 nodes, at least a thousand cascades are required for reason- able inference performance [52]. In practical applications, the amount of data is usually insucient (relative to the large number of nodes in the diusion networks). Therefore, it is not surprising that the inference accuracy is significantly lower in real-world applications (compared with synthetic dataset results) [51, 52]. Several recentworkshavebeendevelopedtoaddressthedatascarcityissuethroughsparsity regularization [51, 56, 146] or a low-rank structure constraint [146]. However, most of these models rely on strong assumptions of the network structure, which may be easily violated in real applications. In addition, the performance improvement by these models could be limited [56, 146] since one structural property may not be enough to compensate for the limited number of cascades, and it is extremely dicult to incorporate several network properties into these models. In many applications, we have access to cascade observations on the same set of nodes. But because they are for dierent topics or occur at dierent times, the cascadesdonotsharethesamediusionnetworkwiththesameparameters. Instead, they are generated from a large number of diusion networks over dierent topics or a long time period. Inferring a single diusion network from the combination of all cascades usually results in unsatisfactory accuracy. In order to address this problem, we notice that the associated influence networks (either topic-specific or time-specific) are highly correlated even though the average number of observations 112 associated with each network is limited. Therefore, we focus on investigating how this large-scale heterogeneous data collection can help eectively infer the latent diusion networks. In this chapter, we propose a novel generative model referred to as the MultiCascades model (MCM). Instead of assuming independence between the cascades from dierent networks and carrying out network inference separately for each network, MCM jointly infers the networks by exploiting the commonalities between them. Specifically, it builds a hierarchical graphical model, where all the diusion networks share one network prior. The cascades are then generated from the corresponding diusion network. MCM provides a systematic and flexible framework to take advantage of many keyfindingsinnetworkanalysis,suchasheavy-taileddegreedistribution,community structure, local clustering properties and small-world phenomenon [74, 142], for eective network inference. We will use Stochastic Blockmodel (SBM) [ 141] and Latent Space Models (LSM) [20, 71, 83] as examples to illustrate the power of our approach. MCM captures the commonalities among dierent networks through the network priors by jointly inferring the unknown but shared parameters in the network generation models from all the cascades. Conversely, the shared prior with many important properties of networks can eectively mitigate the problem of data sparsity. To the best of our knowledge, this is one of the first models that can eectivelyutilizebothobservationaldataandnetworkgenerationmodelsfornetwork inference. A graphical representation of the MCM model is shown in Figure 5.1. Foreectiveinferenceandlearning,weapplytheEMalgorithmtoestimatethe parameters of the graph generation model (i.e., the network prior) and the diusion networks. Since it is intractable to carry out exact posterior inference in the E-step, we use MAP estimation instead of marginalization [112]. The MAP formulation 113 Figure5.1: GraphicalrepresentationoftheMultiCascadesmodel. Thediusionnet- works are first generated from the same parametric graph generation model and the cascades are generated depending on the structure of the specific diusion network. allows us to reuse existing Network Inference algorithms as building blocks. In the M-step, we update the parameters of the graph generation model based on the inferred diusion networks in the E-step. We carry out empirical evaluation on bothsyntheticdatasetsandtworeal-worlddatasets,TwitterandMemeTracker. The experimental results show that our MCM model achieves significant improvement in accuracycomparedtoseveralstate-of-the-artbaselinealgorithmsbyjointlyinferring multiple diusion networks. 5.1 Related Work Ourworkbuildsonseveralpiecesofpriorwork. BesidestheNetworkInference models we reviewed in Section 2.6, we summarize and introduce other works most related to this chapter, in three categories: Network inference with prior knowledge: Motivated by the distinct properties exhibited by social networks, many algorithms have incorporated prior knowledge 114 to improve the accuracy of diusion network inference [ 51, 53, 56, 147]. For exam- ple, [51, 53, 56] introduce L 1 regularization in the maximal likelihood estimation motivated by the sparsity of social networks. Recently, Zhou et al. propose the LowRankSparse algorithm [147] utilizing the fact that the adjacency matrix has low rankstructurethatcapturesthecommunitystructureinthediusionnetwork. How- ever, many other useful properties, such as heavy-tailed degree distributions, local clustering properties and the small-world phenomenon [74, 142], remain unexplored in the previous line of work. The MCM provides a general framework to incorporate prior knowledge with a wide choice of possible graph generation models. Heterogeneous influence: Recently, a line of work has considered the hetero- geneity of influence in social networks [37, 38, 139]. For example, the TopicCascades algorithm[38]focusesoninferringdiusionnetworksfromtext-basedcascadeswhere the diusion rate depends on the similarity of the contents. The KernelCascade algorithm [38] assumes that the delay distributions are dierent for dierent edges in the diusion network. Moreover, Wang et al. propose the MMRate algorithm to carryoutmulti-aspectmulti-patternnetworkinference[139]. InMMRate,thetrans- mission rate depends on both the cascade patterns and the aspect of the diusion networks. Then a mixture model is proposed, where the hidden variable determines under which aspect the cascade is propagated. The focus of MMRate is to automat- ically infer the aspect of each cascade and then infer the aspect-specific diusion networks independently. In contrast, we answer the question on how to jointly infer multiple diusion networks with a limited number of observations. Moreover, the performance of MMRate is evaluated by combining all aspects and it is noted that MMRate performs worse than baselines for diusion networks with a smaller number of cascades [139]. In contrast, we show that the MCM can infer multiple 115 diusionnetworksaccuratelyevenwithalimitednumberofobservations,suggesting the usefulness of adopting powerful network priors. Multi-task learning: Multi-task learning aims at achieving better generalization by considering multiple related tasks and transferring information among them [11]. Many techniques have been developed to achieve information transfer among tasks. A few examples include regularized multi-task learning [42], subspace sharing [63] and imposing priors [70, 87]. In this work, we infer multiple diusion networks by imposing a prior on all diusion networks. Though the idea of imposing prior has already been widely applied to problems such as multi-task feature learning [70, 87], to the best of our knowledge, we are the first to apply this idea in the study of diusion networks and solve multiple network inference from a multi-task learning perspective. 5.2 Problem Formulation As introduced in Section 2.6, traditional network inference algorithms focus on inferring a single hidden diusion network G from observed cascades C = {c 1 ,...,c |C| }. In this chapter, we assume that there are multiple diusion net- works G = {G 1 ,...,G M }. All diusion networks share the same set of individuals V but dier in their edges and the strength of influence as the parameters of the diusion networks. We use G i =(V,E i ) to denote diusion network i. The multiple diusion networks can for example capture topic-specific influence strength or are multiple snapshots of one dynamic diusion network. Let C i = {c i,1 ,...,c i,|C i | } be the set of cascades generated from network i. We focus on simultaneously infer- ring multiple diusion networks G = {G 1 ,...,G M }, i.e., the edge set E i for each 116 (a) Iran (b) Haiti (c) US politics (d) Climate Figure 5.2: Visualization of retweet networks of dierent topics network G i , based on the collections of cascades generated from all the networks C ={C i ,...,C M }. We assume that the information from which network a cascade is generated is given. Foreaseofpresentation,wewillusetheCICmodeldescribedinSection2.5.1.2 asthediusionmodel. Asweaimatinferringtheedgesofthediusionnetworks,we assume that the delay distribution for all edges is the same with known parameters. This assumption is also used in [51, 55]. Our method can be easily extended to other network inference models, such as the DIC model, the DLT model and also the Hawkes Process models introduced in Section 2.5.2. 5.3 MultiCascades Model Our MCM is mostly motivated by the following two observations. First, the associated diusion networks are highly correlated with each other and share a lot of commonalities. As an illustration, we visualize four Twitter retweet networks corresponding to the propagation of dierent topics in Figure 5.2. (The detailed description of the dataset is provided in Section 3.2.) The community in the cen- ter of the plot is shared by the Iran, Haiti and Climate networks, while the US 117 politics network shares the left community with both the Iran and Haiti networks. In the MCM model, we capture the commonality by assuming that all associated networks are generated from the same network model with the same set of param- eters. Moreover, we infer the parameters jointly from the ensemble of all cascades to mitigate the data scarcity issue by transferring information between cascades in dierent groups. Second, the diusion networks indicate distinct properties such as the community structure and local clustering properties [74, 142]. MCM can easily incorporate the prior knowledge on the structure of the diusion networks by choosing which type of graph generation models to use as the prior. For example, we can use the Stochastic Blockmodel to capture the community structure of the diusion networks [ 141] in Figure 5.2. Moreover, the incorporated prior serves as a regularizer to prevent overfitting when the number of observations is small. Basedontheintuitionabove, weconstructMCMwiththefollowinggenerative process: first, we generate the diusion networks from a parametric network gener- ation model with the parameter , and then generate the cascadesC i ={C i,j } from the diusion network G i according to the CIC model. The graphical representation of MCM is shown in Figure 5.3. Figure 5.3: Graphical representation of our MultiCascades model. is the parame- terforthenetworkgenerationmodel;G i isthehiddendiusionnetworkforcascades in graph i; C i,j represents the j-th observed cascades in network i. 118 The main inference task to estimate the parameters of the network genera- tion model from the ensemble of all cascades can be formulated as the following optimization problem: ú = argmax P(C|) = argmax ÿ G P(C,G|) . (5.1) Imposingthenetworkgenerationmodelwithlearnedparameter , weformulatethe inference problem of each diusion network G i as an MAP estimation: G ú i = argmax G i P(G i |C i ,) . Since it is intractable to marginalize out the diusion networks G i as hidden variables, we propose to solve the optimization problem in Equation 5.1 with the EM algorithm. The algorithm iterates between the E-step and the M-step until convergence: in the E-step, we compute the posterior distribution over the diusion networksgiventheestimatednetworkpriorparametersinthepreviousiterationand the observed cascades; in the M-step, we update the network parameters based on the new posterior distribution over the diusion networks. However, exact posterior inference in the E-step is intractable for most choices of network priors and diusion models. Since we need to marginalize over exponentially many possible graphs, we propose to use MAP estimation instead of marginalization: G (t) i = argmax G i P(G i |C i , (t≠ 1) ), where (t≠ 1) istheestimatedparameterfromthepreviousM-step. Thoughsampling methods, such as MCMC, can be used to approximate the posterior distribution, they are usually not scalable due to the slow convergence of sampling from the 119 Algorithm 5.1: EM Algorithm for MCM 1: Inputs: Observed cascades of dierent types C 2: Outputs: Inferred networksG ={G 1 ,...,G M }, estimated prior parameters 3: Initialize (0) 4: while the algorithm has not converged do 5: // E Step 6: for i=1,...,M do 7: G (t+1) i Ω argmax G P(G|C i , (t) ) 8: end for 9: // M step 10: (t+1) Ω argmax P(C,G (t+1) 1 ,...,G (t+1) M |) 11: end while exponential space of all possible graphs. Moreover, the MAP formulation allows for reusing existing network inference algorithms with only small changes. As sum- marized in Algorithm 5.1, the EM algorithm using MAP estimation as E-step is equivalent to maximizing the joint likelihood P(C,G|) = r M i=1 P(C i |G i )P(G i |) with a coordinate ascent method where we alternately optimize over the diusion networks G and the prior parameter . The EM algorithm above provides a general framework to infer multiple diu- sion networks with a wide choice of possible network priors. The two requirements of the graph generation model to be used as network prior in our model are as fol- lows: (1) There exists an ecient algorithm for model parameter estimation; (2) The model provides a probabilistic adjacency matrix which determines the exis- tence of each edge independently. Let P be the probabilistic adjacency matrix where P u,v () is the probability edge (u,v) exists in the graph. This requirement suggests that the probability P(G i |) takes the following form: P(G i |) = Ÿ (u,v)œ E i P u,v () Ÿ (u,v)”œ E i (1≠ P u,v ()) . 120 Many widely used graph generation models naturally satisfy the requirements. Examples include Stochastic Blockmodel [141], Latent Space Models [71] and Kro- necker graph models [90]. Moreover, our framework is flexible enough to use a mixture of graph generation models as prior. However, in this chapter, we only use single graph generation models as prior. Also, there exist other widely used graph models which do not satisfy the requirements, such as the Preferential Attachment Model introduced in Section 3.1 and the Exponential Random Graph Model [117]. It remains an interesting direction of future work to generalize our model to handle a broader class of graph generation models. The MCM focuses only on inferring the structure of the diusion networks, i.e., the edge setE i for each networkG i . In contrast, there are algorithms that infer both the edges and the strength of influence such as the NetRate algorithm [51], the MMHP algorithm [146] and the LowRankSparse algoirthm [147]. These algorithms aim at inferring a real-valued influence matrix ✓ , where ◊ u,v corresponds to the strength of influence from nodeu to nodev. The MCM can be easily generalized to this case. Instead of providing the probability for the existence of edges, we require that the model imposes an independent prior distribution on the influence strength of each edge. The influence strength of the edge (e.g., delay distribution parameter in the CIC-Delay model) is then sampled from the truncated normal distribution or Laplace distribution with parameter ◊ u,v . Next,weshowhowtoadapttheNetInf algorithmandtheMultiTree algorithm for MAP inference in the E-step. We then use Stochastic Blockmodel and Latent Space Models as two concrete examples of network priors and illustrate how we implement parameter updates in the M-step. 121 5.3.1 MAP Inference in the E-step IntheE-step,wecarryouttheMAPinferenceofthediusionnetwork G i given the cascade observations C i in group i and the network prior with parameters . We demonstrate how the NetInf and MultiTree algorithm introduced in Section 2.6 can be easily adapted for the MAP inference with minimum changes. We omit the subscript i in both diusion networks G i and cascades C i below to simplify the notation. RecallthatboththeNetInfandMultiTreealgorithminferthediusionnetwork by maximizing the approximated submodular objective F C (G). In order to adapt thesealgorithmsfortheMAPinferenceintheE-stepoftheMCMmodel,wecombine the objective function with the likelihood corresponding to our network prior. The new objective function F Õ C (G) takes the following form: F Õ C (G)= F C (G)+logP(G|) = F C (G)+ ÿ (u,v)œ E i logp uv ()+ ÿ (u,v)”œ E i log(1≠ p uv ()) . Since logP(G|) is a modular function in the selected edges, the new objective function F Õ C (G) is still submodular as it is the sum of a submodular function and a modular function. However, it is no longer guaranteed to be monotone. As a result, the greedy algorithm fails to provide the approximation guarantee. Toaddressthisissue,weinsteadusetheRandomGreedyalgorithmintroduced in Chapter 2 to solve this non-monotone submodular optimization problem. It has been shown in Theorem 2.2 by Buchbinder et al. [22] that the Random Greedy algorithm achieves a 1/e approximation guarantee. It should be noticed that the 122 guaranteeisonlyformaximizingF Õ C (G). Sincethereisnoguaranteefortheapproxi- mationofF C (G)tothetrueloglikelihoodofobservedcascades(asisthecasealready forboththeNetInfandtheMultiTreealgorithm),theapproximationguaranteedoes not carry over to the quality of the inferred diusion network. 5.3.2 Parameter Update in the M-step In this section, we use Stochastic Blockmodel [141] and Latent Space Mod- els [71, 83] as two concrete examples of the network priors and demonstrate how to update the model parameter in the M-step. 5.3.2.1 Latent Space Models The first example class of models we consider is the Latent Space Models ([71, 79, 125] and [19, 20, 83] specifically for diusion analysis). Under this general class of models, the network is assumed to be embedded in aD-dimensional latent space. Theclosertwonodesareinthelatentspace,themorelikelythereisanedgebetween thetwonodesinthediusionnetworks. Underthismodel,theparameter issimply a V-by-D node location matrix. Each row✓ u of the matrix provides the position of node u in the D-dimensional latent space. In this work, we assume that each node position is drawn from a multivariate Gaussian distribution, i.e., ✓ u ≥ N(0,‡ 2 I). We use a Gaussian kernel as in [20] to generate the probabilistic adjacency matrix from the node positions: p uv () = exp( ≠ “ 2 ||✓ u ≠ ✓ v || 2 2 ). 123 In the M-Step, we update the parameter by maximizing the log likelihood of the inferred diusion networks G ={G 1 ,...,G M }: L = M ÿ i=1 S U ÿ (u,v)œ E i logp uv ()+ ÿ (u,v)”œ E i log(1≠ p uv ()) T V + ÿ vœ V 1 2‡ 2 ||✓ u || 2 2 The optimization is carried out via the gradient ascent method. Let ¸ u,v = q M i=1 (I G i [u æ v]+ I G i [v æ u]) and ¯ ¸ u,v =2M ≠ ¸ u,v , where I G i [u æ v]= 1 if there is an edge from u to v in diusion network G i . The gradient for each node position x i is as follows: Ò ✓ u L = ÿ vœ V C “ (✓ u ≠ ✓ v )( ¸ u,v ·p uv () 1≠ p uv () ≠ ¯ ¸ u,v ) D ≠ 1 ‡ 2 x u 5.3.2.2 Stochastic Blockmodel We use Stochastic Blockmodel [141] as another example of the network prior. Stochastic Blockmodel assume that each node v in the network belongs to one of K communities. The probability of the existence of the edge (u,v) depends on the community identity of the nodes and the K-by-K Bernoulli rate matrix B, where B i,j corresponds to the probability of having an edge from a node in communityi to anodeincommunityj. Weorganizethecommunityidentityforallnodesintoa|V|- by-K community indicator matrix Z=[z T 1 ,...,z T |V| ] T , where z u is the community identity of node u in the 1-of-K representation introduction in Chapter 2. As a result, the model parameters consist of both the community indicator matrix and 124 the Bernoulli rates matrix. The relationship between the probabilistic adjacency matrix P and the parameters=( B,Z) can be represented concisely as: P =ZBZ T . The optimization of the parameters=( B,Z) is carried out by the stochastic EM algorithm. The update of node u’s community identity via Gibbs sampling in the E-step is as follows: P[z u,k =1|Z ≠ u ,G]à M ÿ i=1 ÿ vœ V (I G i [uæ v]log(B j,zv )+I[u”æ v]log(1≠ B j,zv )) + M ÿ i=1 ÿ vœ V (I G i [væ u]log(B zv,j )+I[v”æ u]log(1≠ B zv,j )) Here, we use z v to denote the community that node v is in with a little abuse of notation. The analytic solution for optimizing matrix B in the M-Step is as follows: B s,t = q M i=1 q uœ V q vœ V I G i [uæ v]I[z u,s =1,z u,t =1] q M i=1 q uœ V q vœ V I[z u,s =1,z u,t =1] . In our implementation, we use a slightly dierent approach by calculating the posterior distribution P(Z|G,B) by Gibbs sampling. Then, the probabilistic adjacency matrix P is approximated as ˆ P =E[Z]BE[Z] T . 125 We use this ˆ P in the MAP inference as it is more stable and robust. The reason is that Z, i.e., the community identities of all nodes, is obtained via Gibbs sampling and can change abruptly during dierent iterations. 5.3.3 Running time The running time of the MCM algorithm depends on the time complexity of the components used in the EM algorithm. Empirically, we observed that the MCM converges in five to ten iterations. As a result, the running time of the MCM is only a constant factor slower than separately inferring all the networks. Moreover, the network inference of multiple networks in the E-step is trivially parallelizable leading to even faster implementation. 5.4 Empirical Evaluation For empirical evaluation, the main question we seek to answer is whether MCM can eectively infer multiple diusion networks. First, we test our model on controlled synthetic datasets where the true diusion networks are known. This set of experiments serves as proof of concept for our inference algorithm and shows that incorporating network priors significantly improves the inference accuracy. Second, weprovidetheexperimentalresultsofMCMontworeal-worlddatasets, theTwitter dataset and the MemeTracker dataset. We show that jointly inferring multiple networks leads to more accurate network discovery and a better understanding of the diusion dynamics. 126 Algorithms: We include the following algorithms for comparison in our empirical evaluation 1 : • MCM-[NetInf|Multi]-[SBM|LSM]: The MCM algorithm with Net- Inf/MultiTree for the MAP inference in the E-step using either an SBM or Latent Space Models as the network prior. • Sep-[NetInf|Multi]: Baseline algorithms that infer multiple diusion net- works independently using the NetInf or MultiTree algorithm. • Single-[NetInf|Multi]: Baseline algorithms that infer a single diusion net- work with the NetInf or MultiTree algorithm by combining all cascades. Evaluation metrics: As all the algorithms greedily add edges one by one, we generate the precision and recall curve by evaluating at dierent numbers of edges. We then compute both the max F1 score and the area under the precision-recall curve (AUC) as metrics to evaluate the accuracy of the inferred networks. (The metrics are described in detail in Section 3.3.) As Sep-[NetInf|Multi] and the MCM algorithms infer multiple diusion net- works, we evaluate the inferred network against the corresponding ground truth diusion network and report the averaged max F1 score and the area under the precision-recall curve over all networks. In contrast, for Single-[NetInf|Multi] algo- rithms, we evaluate the inferred single network against all ground truth diusion networks and report the averaged results. 1 We have also experimented with the NetRate algorithm with L 1 regularization [56], which can be considered as incorporating an independent Laplace prior. However, its performance is worse or similar to NetInf and MultiTree and thus omitted. 127 Table 5.1: Comparison between the Greedy algorithm and the Random Greedy Algorithm Greedy Random-Greedy SBM-NetInf 0.461 0.459 SBM-MultiTree 0.461 0.458 Latent-NetInf 0.435 0.428 Laten-MultiTree 0.436 0.428 5.4.1 Synthetic data Wefirstevaluateourmethodonsyntheticdatawithafocusontheeectiveness of the inference algorithm and network priors in MCM. Datageneration: WegeneratesyntheticdatasetsusingboththeStochasticBlock- model and the Latent Space Models as graph generation priors (the two types of datasets are later referred to as SBM and Latent). For the Stochastic Blockmodel, we set the number of communities to 3 and the Bernoulli rates matrix B is set to 0.05 on the diagonal and 0.01 elsewhere. For the Latent Space Models , we set the dimension of the latent space to 3 with “ =2.0 and ‡ =1.0. For both models, we generate ten dierent diusion networks corresponding to ten sets of cascades. Each diusion network has 256 nodes. We use an exponential delay distribution with ⁄ =0.1 in the CIC model to generate the cascades with an observation win- dow of length 10. We vary the number of cascades in each network from 100 to 500. For each setting, we randomly generate three datasets and report the mean for each evaluation metric. Results: We first show the comparison between the simple Greedy algorithm and the Random Greedy method in the E-step in the MCM as in Table 5.1. The per- formance of the two methods is compared under both the NetInf and MultiTree algorithms for both types of synthetic networks with the corresponding network 128 Table 5.2: Network inference accuracy on synthetic datasets with 200 cascades. NetInf MultiTree Sep Single MCM Sep Single MCM SBM maxF1 0.449 0.393 0.475(5.8%) 0.429 0.387 0.477(11.2%) AUC 0.300 0.254 0.377(25.7%) 0.300 0.244 0.373(25.7%) Latent maxF1 0.487 0.135 0.510(4.7%) 0.483 0.136 0.506(4.8%) AUC 0.338 0.060 0.359(6.2%) 0.330 0.061 0.368(11.5%) Table 5.3: Relative improvement in AUC with dierent |C|. #Cascades SBM Latent 100 17.56±0.01% 8.98±0.01% 200 25.71±0.01% 6.21±0.01% 300 7.62±0.01% 7.51±0.71% 400 8.33±0.01% 0.30±0.01% 500 13.32±0.38% 4.18±0.01% prior. Under all cases, the simple Greedy method performs slightly better than Random Greedy, though the Random Greedy algorithm provides an approximation guarantee. This can be explained by the observation that the objective function F Õ C (G) is approximately monotone, as the value of the prior is usually smaller than the likelihood of cascades. However, as discuessed earlier, the guarantee is only for optimizing the approximated objective function F Õ C (G) and it does not necessarily carry over to network inference accuracy. Given the superior performance of the simple Greedy method, we only present the results of the simple Greedy method in the following experiments. Table 5.2 shows the experimental results of all algorithms on the synthetic datasets with 200 cascades. The first observation is that Single-NetInf and Single- Multi perform significantly worse than joint network inference algorithms due to the large variation in the structures of the diusion networks in the synthetic datasets. Second, MCM achieves consistent improvement in the inference accuracy on both datasets across various numbers of cascades by incorporating the correct prior, i.e., 129 the same model used in the network generation. In particular, it achieves more than 15% accuracy on average in the area under the PR curve on datasets compared to runninginferenceseparatelywithaprior. Next,wevarythenumberofcascades,i.e., the degree of data scarcity, from 100 to 500. The relative improvement of MCM- NetInf compared to Sep-NetInf in terms of AUC is shown in Table 5.3,averaged over three independent runs. From the results, we can see that incorporating the network prior improves the inference accuracy significantly, especially when there is only a limit number of cascades for each network. We hypothesize that the relative improvement is not monotone because the inferred network prior is less accurate and thus not very helpful when the number of cascades is too small; on the other hand, the network prior also provides little help when there is a sucient number of observations. Another possible cause of the non-monotonicity is the small number of independent runs. To obtain more insights into MCM model, we carry out another set of experi- ments to study the eect of prior quality on the inference accuracy. We control the quality of the inferred prior by varying the number of networks. The more networks we can observe cascades from, the more accurately we can infer the parameters of the network generation model. We run the MCM-NetInf-LSM algorithm on the synthetic networks generated from the Latent Space Models. We vary the number of networksM in{1,10,20}. We fix the number of cascades in each network to 200. In the experiment, we also run the MCM-NetInf-LSM algorithm with the ground truth network generation prior (condition True Prior in Table 5.4) and Sep-NetInf algorithm without any prior (condition Sep-NetInf in Table 5.4). It should be noted that when there is only one network to be inferred, there is no need for joint infer- ence and the algorithm just iterates between inferring the diusion network and 130 optimizing the prior parameters. Thus, the MCM-NetInf-LSM algorithm can also beconsideredasrunningtheSep-NetInfbutwiththenetworkgenerationprior. The results of this experiment are shown in Table 5.4. The accuracy of the inferred dif- fusion networks goes up as the qualityof the inferred networks prior improves. Even inferring the prior from a single network and incorporating it in network inference leads to significant improvement in accuracy. MCM improves the accuracy further by learning the prior more accurately by joint inference from multiple networks. With 20 networks with 200 cascades in each one, the MCM achieves an accuracy of 0.439 which is slightly higher than running MCM with 10 networks but with 300 cascades in each one. However, collecting cascades from multiple networks is much less eective than collecting the same number of cascades from a single network. Running Sep-NetInf with 2000 = 200·10 cascades results in AUC of 0.742 which is much higher than 0.376 obtained by MCM with 10 networks with 200 cascades. In addition, using 4000 = 200·20 cascades results in AUC of 0.783 which is also much higher compared to 0.439 obtained by MCM with 20 networks with 200 cascades. On the other hand, the accuracy using the ground truth prior with 200 cascades (as an upper bound on the performance of the MCM) is only 0.510. As a general guideline, when one can only collect a fixed number of cascades, it is usually much moreeectivetocollectallthecascadesfromonenetworkinsteadofasmallnumber of cascades from multiple networks. We further visualize the inferred network to show that they capture the struc- ture of the diusion networks. We first plot the inferred SBM prior overlayed on one of the inferred diusion networks by rearranging the nodes according to their community identity. The result is shown in Figure 5.4. The MCM correctly infers 131 Table 5.4: The AUC of the MCM-NetInf-LSM algorithm and relative improvement over Sep-NetInf with dierent M. #Networks M AUC Relative Improvement Sep-NetInf 0.258 - 1 0.349 35.27% 10 0.376 45.74% 20 0.439 70.16% True Prior 0.510 97.67% Figure 5.4: Visualization for the SBM prior. The darker the background color, the higher the value in the B matrix. The blue dots corresponds to the edges of one inferred diusion network. the community identity for each node in the graph leading to the block structure along the diagonal. Next, wevisualizetheinferrednodepositionsinthelatentspacepriortogether with the growth process of one cascade as is suggested in [83]. Figure 5.5 (a)-(d) shows the active nodes (marked as red triangles) within 0%,33%,66%,100% of the observation window. The results demonstrate that the node locations inferred by MCM satisfy the influence preservation principle [83], i.e., the more likely node u influences node v, the closer the two nodes are in the visualization space. 132 (a) t=0 (b) t=T/3 (c) t=2T/3 (d)t=T Figure 5.5: Visualization of cascade growth under the Latent Space Models, where T is the length of the observation window. The blue circles corresponds to inactive nodes while the red triangles are the activated nodes. 5.4.2 Real-world data We evaluate our approach on two real-world datasets, Twitter and Meme- Tracker. The two datasets are introduced in detail in Section 3.2. For the Twitter dataset,weaimatinferringmultipletopic-specificdiusionnetworks. OntheMeme- Tracker dataset, we aim at inferring both time-specific and topic-specific diusion networks. 133 Table 5.5: Network inference accuracy on the Twitter dataset using NetInf in the E-step. NetInf Sep Single MCM-SBM MCM-LSM Twitter100 maxF1 0.152 0.143 0.208 0.150 AUC 0.074 0.062 0.112 0.088 Twitter250 maxF1 0.089 0.095 0.172 0.094 AUC 0.027 0.037 0.066 0.044 Table 5.6: Network inference accuracy on the Twitter dataset using MultiTree in the E-step. MultiTree Sep Single MCM-SBM MCM-LSM Twitter100 maxF1 0.164 0.148 0.180 0.169 AUC 0.097 0.072 0.107 0.100 Twitter250 maxF1 0.106 0.105 0.135 0.130 AUC 0.0232 0.028 0.060 0.059 5.4.2.1 Twitter dataset Since we focus on inferring topic-specific diusion networks, we use the Twit- ter100 and Twitter250 dataset introduced in Section 3.2 where the cascades are grouped into five categories. We apply all algorithms mentioned in the previous section and evaluate the inferred diusion networks using the averaged max F1 and the area under the PR curve as in the synthetic experiments. We set the delay distribution in CIC to the exponential distribution with ⁄ =0.1. The length of the observation window is set to 100; each time unit corresponds to a day. Results: The results on the Twitter dataset are summarized in Tables 5.5 and 5.6. First,theresultvalidatesourclaimthatinferringmultiplediusionnetworksaccord- ingtodierenttopicsleadstobetteraccuracy. MCMusingeitheranSBMpriororan LSM prior achieves improved accuracy for both evaluation metrics. The parameters 134 Figure 5.6: Precision-Recall curve of network inferences in the Twitter100 dataset. The solid squares correspond to points with maximum F1 score on the PR curve. of the SBM and LSM are set the same as in the synthetic experiments. We observe that incorporating a network prior is helpful for accurate inference of multiple net- works. Inferring each diusion network independently (Sep-NetInf/Sep-MultiTree) performs slightly better than just inferring a single network on the smaller dataset with 100 nodes. However, they perform worse on the larger dataset with 200 nodes due to an insucient number of cascades. MCM achieves consistent and significant improvement on both datasets. In particular, MCM-NetInf with the SBM prior achieves nearly 100% improvement on both Twitter100 and Twitter250 compared to Sep (i.e., inferring separate diusion networks independently). In order to provide further insight into the performance of the MCM, we plot the precision-recall curve averaged over five diusion networks in Figure 5.6.From the results, we can see that all the algorithms perform relatively well in choosing the first few edges with strong signals in the observed cascades. MCM performs significantly better than the baselines when including more edges. One possible 135 (a) Iran (b) Haiti (c) Technology (d) US Politics (e) Climate Figure5.7: Inferreddiusionnetworksamongtop10usersfromtheTwitterdataset. The bold black line segments at the end of edges correspond to the direction of the influence. explanation might be that our algorithm is able to transfer the information from the cascades of other topics (or time stamps) by incorporating the network prior. Wevisualizethediusionnetworksamongthetop10usersforallfivetopicsin Figure 5.7. The plots further validate the usefulness of inferring multiple diusion networks as the influence structure varies significantly across dierent topic-specific networks. For example, both persia_max_news and tehranweekly are most influ- ential in “Iran election” as Iran news agencies while persia_max_news has little influence in the propagation of other topics, such as “Haiti earthquake”. 136 Table 5.7: Network inference accuracy on the MemeTracker dataset using NetInf in the E-step. NetInf Sep Single MCM-SBM MCM-LSM MemeTracker128 maxF1 0.303 0.299 0.308 0.315 AUC 0.182 0.178 0.193 0.194 MemeTracker256 maxF1 0.228 0.229 0.232 0.239 AUC 0.122 0.123 0.129 0.131 MemeTrackerTopic maxF1 0.287 0.136 0.233 0.393 AUC 0.406 0.190 0.470 0.667 Table5.8: NetworkinferenceaccuracyontheMemeTrackerdatasetusingMultiTree in the E-step. MultiTree Sep Single MCM-SBM MCM-LSM MemeTracker128 maxF1 0.309 0.298 0.309 0.315 AUC 0.177 0.185 0.193 0.194 MemeTracker256 maxF1 0.236 0.229 0.232 0.239 AUC 0.127 0.123 0.129 0.130 MemeTrackerTopic maxF1 0.300 0.135 0.349 0.339 AUC 0.293 0.190 0.612 0.643 5.4.2.2 MemeTracker dataset As described in detail in Section 3.2, we preprocess the data into dierent datasets to test the performance of our approach on inferring both snapshots of dynamic diusion networks and topic-specific networks. Joint inference of dynamic networks: As the first experiment, we aim at inferring multiple snapshots of the dynamic diusion network between blog sites and mainstream media. For this purpose, we use the MemeTracker128 and Meme- Tracker256 datasets introduced in Section 3.2. We apply all the algorithms men- tioned in the previous section and evaluate the inferred diusion networks using the 137 averaged max F1 and the area under the PR curve. We set the delay distribution in the CIC model to the exponential distribution with ⁄ =1. The length of the obser- vation window is set to 30; each time unit corresponds to a day. The choice of larger ⁄ compared to the Twitter dataet is due to the dierent time-scale normalization. The results on inferring dynamic networks are summarized in Tables 5.7 and 5.8. From the results, we can see that MCM with both network priors consistently outperforms the baselines. MCM with the latent space prior achieves thebestperformancewithmorethan5%improvementonallthemetricsofdynamic network inference. Though the improvement is not very significant, it is consistent on both the MemeTracker128 and MemeTracker256 datasets. Jointinferenceoftopic-specificnetworks: Wealsoconstructtopic-specificcas- cades on the MemeTracker dataset. For this purpose, we use the MemeTrackTopic dataset introduced in Section 3.2. The results on inferring topic-specific networks for the MemeTrackerTopic dataset are also shown in Tables 5.7 and 5.8. Com- pared to inferring dynamic networks, our method with an LSM prior achieves much moresignificantimprovementinthetopic-specificnetworksinferencetask(with50% improvement) in both metrics we considered. As the variations across topic-specific networks are larger than in the dynamic networks, the MCM benefits more from incorporating the prior to mitigate the data scarcity issue. This behavior is also observed in the Twitter dataset. Another interesting observation is that the LSM prior performs much better than the SBM prior especially using the NetInf algo- rithm in the E-step. It is not clear why this occurs in the MemeTrackTopic dataset and it would be interesting to find out as future work. 138 Chapter 6 Network Inference with Content Information 139 The content of this chapter is based on the paper: Xinran He, Theodoros Rekatsinas, James Foulds, Lise Getoor and Yan Liu, “HawkesTopic: A Joint Model for Network Inference and Topic Modeling from Text-Based Cascades”, In Proc. 32nd Intl. Conf. on Machine Learning.2015. In the previous chapter, we have shown how to carry out joint inference of multiple diusion networks by utilizing network similarities to combat data spar- sity. However, the method does not apply to the scenario when one is interested in inferring only one diusion network accurately. In this chapter, we explore another direction to improve the accuracy of network inference when additional content information is available in the observed cascades. Cascades with content information, i.e., text-based cascades, are abundant in a variety of social platforms, ranging from well-established social networking web- sites such as Facebook, Google+, and Twitter, to increasingly popular social media websites such as Reddit, Pinterest, and Tumblr. Moreover, a growing number of platforms such as GDELT and EventRegistry 1 extract and analyze textual informa- tion from diverse news data sources which often borrow content from each other or influence each other [34]. Text is, in many cases, the medium by which information is propagated, mak- ing it particularly salient for inferring information diusion. Models that are based on observed timestamps have been shown to become more eective at discovering topic-dependent transmission rates or diusion processes when combined with the textual information associated with the information propagation [37, 139]. Never- theless, this line of work assumes that either the topics associated with the diusion process are specified in advance or that the influence paths are fully observed. It is 1 See gdeltproject.org and eventregistry.org. 140 easy to see that due to these assumptions, the aforementioned models are not appli- cable in many scenarios, such as discovering influence relationships between news data sources or users of social media and detecting information diusion paths. Inthischapter,wefocusontheproblemofinferringthediusionofinformation together with the topics characterizing the information. We assume that only the textual information and timestamp of posted information is known. Contrary to previousapproaches, wedonotrequirepriorknowledgeofeitherthestructureofthe network or the topics. We introduce a novel framework that combines topic models and the Hawkes process [99] under a unified model referred to as the HawkesTopic model (HTM). HTM uses the Marked Multivariate Hawkes Process [99] to model the diusion of information and simultaneously discover the hidden topics in the textual information. Specifically, our model captures the posting of information from dierent nodes of the hidden network as events of a Hawkes process. The mutually exciting nature [146]ofaHawkesprocess,i.e.,thefactthataneventcantriggerfutureevents, is a natural fit for modeling the propagation of information in domains such as the ones mentioned above. To address the limitation that the thematic content of the available textual information is unknown, HTM builds upon the Correlated Topic Model (CTM) [14] and unifies it with a Hawkes process to discover the underlying influence across information postings, and thus, the hidden influence network. We derive a joint variational inference algorithm based on mean-field approximation to discover both the diusion path of information and its thematic content. 141 6.1 Related Work Besides the existing Network Inference models reviewed in Section 2.6,we review additional prior work related to techniques proposed in this chapter. Text-ContentCascades: Whilemostofthepreviousworkutilizesonlythetiming information to infer the diusion networks, a dierent line of work has considered analyzing the available textual information and use text-based cascades [31, 37, 47]. The work by Dietz et al. [31] and Foulds et al. [47] assumes that the influence paths are known and Du et al. [37] assume that the topics characterizing the information aregiveninadvance. Ourproposedapproachisdierentinthatitdoesnotmakeany assumption on knowing the underlying diusion network or the information topics in advance, thus being applicable in domains like news media where the underlying influence network is unknown and the contents vary significantly over time. DiusionNetworksandText-ContentCascades: Arecentlineofworkfocuses on joint modeling of diusion networks and text-based cascades [ 101, 146]. Liu et al. extend the basic text-based cascades model in [31] such that the diusion paths also need to be inferred. However, the proposed approach is agnostic to time. The most relevant model to HTM is the MMHP model proposed by Yang and Zha in [146]. HTM is fundamentally dierent in the following aspects: (1) MMHP utilizes the textual information to cluster activations into dierent cascades, while HTM leverages text to improve the prediction of a single cascade, and vice-versa, by modeling the evolution of textual information and event times jointly. (2) MMHP uses a simple language model and assumes that documents are drawn indepen- dently without considering the source of the influence. Instead, HTM models the evolution of textual information with CTM through the cascade of topics, which is 142 essential in text diusion processes. To our knowledge, there are only two papers that combine Hawkes processes and topic modeling before our HTM [64, 95]. Li et al. [95] use the model to identify and label search tasks, while Guo et al. [64] focus on studying conversational influence. In contrast, we combine the above for modeling text-based cascades. Since we proposed HTM [69], several papers have been published following the idea of combining temporal and content information in modeling diusion process. Hosseini et al. generalize the HTM to use a nonpara- metric approach in order to adapt its temporal and topical complexity according to the complexity of data [72]. Ding et al. propose a semi-supervised model to detect and track trendy topics on Twitter by incorporating the diusion information [ 32]. Farajtabar et al. use a similar model to analyze the text-based cascades to detect fake news stories [45]. 6.2 HawkesTopic Model We consider text-based cascades among a set of nodes via a hidden diusion networkG. Weobserveasequenceofpostingactivitiesc ={a i |i=1,2,...,|c|},e.g., a series of news article or research paper publications, or a series of user postings on Facebook. Following the notation introduced in Section 2.4.2, each posting activity is a tuple a i =(t i ,v i ,X i ). In this chapter, the additional features{X 1 ,...,X |c| } are the documents represented as a bag-of-words with vocabulary size W. Given this input, our goal is to infer the hidden diusion network G and the topics characterizing the observed textual information. We adopt a model that jointlyreasonsaboutthepostingtimeandthecontentofdocumentsto(1)accurately infer the hidden diusion network structure, and (2) track the thematic content of 143 Table 6.1: Notation used in this chapter. Notation Definition V Set of nodes c Set of events t a Posting time of event a v a Node who carries out event a X a Document of event a ⇢ a =(fl a,0 ,{fl a,a Õ} a Õ œ c ) Parent indicator of event a ↵ v Topic parameters of user v’s interest ⌘ a Topic parameters of document X a = 1:K Collection of all topics ⁄ v (t) Intensity process of node v µ v Base intensity of node v Ÿ a (t,v) Impulse response of event a ✓ ={◊ u,w } Node influence matrix documents as they propagate through the diusion network. Next, we discuss the two components of our HawkesTopic model in detail. The notation of our model is summarized in Table 6.1. 6.2.1 Modeling the posting time The first component of our framework models the nodes’ posting times via the Multivariate Hawkes Process (MHP) [99] introduced in Section 2.5.2. Following the traditional notation of point processes, we call each posting activity a i =(t i ,v i ,X i ) an event. As we focus on modeling the posting time, we ignore the documents as additional features in each event. Document modeling is described in Section 6.2.2. Figure 6.1 provides an example of the MHP model. In the example, the sources correspond to two users v 1 and v 2 that publish alternately on the same subject. Here, a 2,1 is a response to a 1,1 published by v 1 , and a 1,2 is v 2 ’s response to v 1 ’s document a 2,1 . 144 Besides characterizing a Hawkes process by its intensity process as in Sec- tion 2.5.2, a Hawkes process can also be treated as a clustered Poisson process [127] where each event triggers a homogeneous Poisson process with intensity Ÿ a (t,v). The processN v (t) of nodev is a superposition of a homogeneous Poisson processP v with intensity µ v as base process, and the Poisson processes triggered by previous events. Viewing the Hawkes process from this perspective has two advantages. First, it provides a method to generate the events of dierent processes in a breadth first order [127]. The main idea is to first generate all events corresponding to the base process of each node, referring to them as events at level zero, and then generate events at level ¸ from processes triggered by events at level ¸≠ 1. This process is repeated until there are no events triggered at level L. This construction provides an explicit parent relationship denoted by an indicator vector ⇢ a =(fl a,0 ,(fl a,a Õ) a Õ œ c ) foreacheventa.Ifeventaisgeneratedfromtheprocesstriggeredbypreviousevent a Õ , we say that a Õ is the parent of a and denote it by fl a,a Õ=1. In short, we say that event a Õ triggers event a. Otherwise, if event a is generated from the base process, andwesaythatithasnoparentanddenoteitby fl a,0 =1. Equivalently, wesaythat node v a carries out event a spontaneously. The colored arrows in Figure 6.1 depict the parent relationship. In this example, events a 1,1 and a 2,2 have no parent, while event a 1,1 is the parent of event a 2,1 , which itself is the parent of event a 1,2 . The parent relationship is essential to model the evolution of the content information, since our model should capture the intuition that the document associated with an event is supposed to be similar to the document of its parent. 145 Figure 6.1: Graphical representation of HawkesTopic model. 6.2.2 Modeling the documents Oneapproachforreasoningabouttheinformationcontainedindocumentsisto generalizetheMHPtotheMarked Multivariate Hawkes Process (MMHP)[99]which treats the words of documents associated with events as the marks associated with those events. In the MMHP model, events are extended to triples a=(t a ,v a ,X a ) where the mark value X a is an additional label characterizing the content. However, this approach suers from two major drawbacks: (1) using words as marks leads to noisy representations due to polysemy and synonyms; more impor- tantly,(2)inthetraditionalMMHPmodel,themarkvaluedependsonlyonthetime of the event and is not aected by what triggers the event [ 99]. This assumption is acknowledged in [146]. The documents as marks associated with the events are just drawn independently from the same language model without considering what is the source of the influence. In other words, if a user posts something influenced by the post of her friend, then the content of the user’s post is independent of the content of her friend’s post. We can see that this assumption is unrealistic in many 146 real-word scenarios including social or online news media, as posts of users that influence each other should exhibit dependencies. Topics as Marks: To overcome the first disadvantage, we propose using the topics of the event documents as marks in our HawkesTopic model. Topics, as an abstrac- tion of the actual words, provide a less noisy and more succinct representation of the documents’ content. Assuming a fixed set of topics = 1:K for all documents, we use a topic vector⌘ a to denote the topics in document X a associated with event a. We assume that the actual words of the document are generated similar to the Correlated Topic Model (CTM) [14]. The generative process for a document X with N a words is as follows: Let fi (·) be the softmax function fi k (⌘ a )= exp(÷ a,k ) q j exp(÷ a,j ) and 1:K be the discrete distributions over words characterizing each of the K topics. • For n=1,...,N a : 1. Draw topic assignment z a,n ≥ Discrete(fi (⌘ a )). 2. Draw word x a,n ≥ Discrete( za,n ). We choose the logistic normal distribution for variable z a,n as it provides more flex- ibility than the multinomial distribution in the LDA model [15]. As the logistic normal distribution does not constrain the document-topic parameters to the prob- ability simplex, it allows us to model the topic evolution flexibly using Gaussian dynamics, which makes it a better representation for modeling the dynamics of topic diusion. Diusion of Topics: To overcome the second disadvantage, i.e., modeling the dependencies across marks of events that influence each other, the HTM explicitly reasons about the diusion of topics across events that influence each other. We distinguish between two types of events: (i) those occurring spontaneously and (ii) 147 thosetriggered bypreviousevents. Forexample, inFigure6.1eventsa 1,1 ,a 2,2 belong to the first type, while the remaining events belong to the second type. We assume each node has a prior of interests in dierent topics. For example, one Facebook user may be interested in sports and politics while another is interested in music and movies. The contents of documents corresponding to spontaneous events are determined by the topic prior of the node. If an event is triggered by another event, its document should be similar to the document of the triggering event. This suggests that a user’s post, influenced by her friend’s previous post, should have similar content to her friend’s post. We use the parent relationship generated in event time modeling to realize the above intuition. Let ↵ v be the parameter describing the topic prior for node v. The topics of spontaneously posted documents for node v are generated from a Gaussian distribution with mean↵ v , i.e.,⌘ a ≥N(↵ v ,‡ 2 I). The topics of triggered events⌘ a arealsogeneratedfromaGaussiandistribution, butwithmeanparameter ⌘ parent[a] , i.e.,⌘ a ≥N(⌘ parent[a] ,‡ 2 I), where parent[a] is the event triggering event a. For simplicity, our model uses an isometric Gaussian distribution. In the example of Figure 6.1, the topic distribution ⌘ 1,1 is drawn from a Gaussian distribution with mean ↵ 1 . Similarly, the topic distribution ⌘ 2,2 is drawn from a Gaussian distribution with mean↵ 2 . On the other hand, the topics associated with eventa 2,1 are influenced by event a 1,1 as that is the one triggering it. More specifically, we draw⌘ 2,1 from a Gaussian distribution with mean⌘ 1,1 . 148 6.2.3 Summary and Discussion Wesummarizethegenerativeprocessofourmodelanddiscusshowitcompares to existing models. The notation of our model is summarized in Table 6.1. The generative process of our model is: 1. Generate all the events and the event times via the Multivariate Hawkes Pro- cess, as in Section 6.2.1. 2. For each topic k: draw k ≥Dir(– ). 3. For each event a of node v: (a) If a is a spontaneous event: ⌘ a ≥N(↵ v ,‡ 2 I). Otherwise⌘ a ≥N(⌘ parent[a] ,‡ 2 I). (b) Generate document length N a ≥ Poisson(⁄ ). (c) For each word n: z a,n ≥ Discrete(fi (⌘ a )),x a,n ≥ Discrete( za,n ). Our model generalizes several existing models. If we ignore the event documents, our model is equivalent to the traditional MHP model. At the other extreme, if we only consider the documents associated with spontaneous events of node v, our model reduces to the CTM with hyperparameter (↵ v ,‡ 2 I). HTM further models the contents of triggered events using the diusion of topics. We also considered other alternatives for modeling the diusion of documents. For example, it can be modeled via controlling the parameters that determine the generation of topics as in the Dynamic Topic Model [16]. We chose the current approach as it is more robust in the presence of limited influence information. The evolution of documents can be alternatively modeled as the dynamics on the hyperparameters controlling the generation of the topics as in the Dynamic Topic Model [16]. As each event may only trigger a limited number of 149 events, there is not enough information to recover the influence paths if document diusionismodeledviathetopichyperparameters. Otherwise,theinfluencecanalso be modeled at the word level [31]. In this approach, each word in a document can influence the words in the triggered documents individually. Our approach yields a simpler model since documents are utilized to recover the influence pathways. 6.3 Inference Exact inference for the HawkesTopic model is clearly intractable due to the huge number of hidden variables. Thus, we derive a joint variational inference algorithm based on the mean-field approximation. We apply the full mean-field approximation for the posteriors distribution P(⌘ ,z,⇢ |c,↵,,✓,µ ) as Q(⌘ ,z,⇢ |{ˆ ⌘ a },{r a },{ a,n })= Ÿ aœ c C q(⌘ a |ˆ ⌘ a )q(⇢ a |r a ) Na Ÿ n=1 q(z a,n | a,n ) D . The variational distribution for ⌘ a is assumed to be a Gaussian distribution with its mean as the parameter to infer, ⌘ a ≥ N(ˆ ⌘ a ,ˆ ‡ 2 I). The variational distributions for the remaining variables are: z a,n ≥ Discrete( a,n ) and ⇢ a ≥ Discrete(r a ). Here, ˆ ⌘ a is the variational parameter of document a’s topic distribution; a,n is the varia- tional parameter of topic distribution of wordn in documenta;r a is the variational parameter of parent distribution of event a. The choice of using the same simple covariance matrix is to limit the complexity of our model. This choice is a trade-o between the model complexity and the expressive power of the model. When the number of training documents is limited, using simpler model usually leads to bet- ter accuracy which is the case in our application. If more training documents can be collected, the current model can be extended to use more complex covariance 150 matrix. To keep the notation uncluttered, we drop the dependence on variational parameters {ˆ ⌘ a },{r a },{ a,n } in Q(⌘ ,z,⇢ |{ˆ ⌘ a },{r a },{ a,n }). Since the Correlated Topic Model is a building block in our HawkesTopic model, our model is not conjugate. We adopt the Laplace Variational Inference method in [136] to handle the non-conjugate variable q(⌘ ). Using standard variational inference theory, the inference task becomes min- imizing the KL divergence between Q(⌘ ,z,⇢ ) and P(⌘ ,z,⇢ |c,↵,,✓,µ ). This is equivalent to maximizing a lower bound L(Q) on the log marginal likelihood. The joint likelihood of observing all the events c is: P(c,⌘ ,z,⇢ |↵,,✓,µ ) = Ÿ aœ c Y ] [ P(⌘ a |↵ va ) fl a,0 · Ÿ a Õ :t a Õ<ta P(⌘ a |⌘ a Õ) fl a,a Õ Z ^ \ ¸ ˚˙ ˝ (1) · Ÿ aœ c C Na Ÿ n=1 P(z a,n |⌘ a )P(x a,n |z a,n , ) D ¸ ˚˙ ˝ (2) · Ÿ vœ V e ≠ mv m nv v n v ! · Ÿ aœ c S U a ≠ Ga G ¸a a ¸ a ! · r a Õ œ c Ÿ fl a,a Õ a,a Õ G a ! T V ¸ ˚˙ ˝ (3) Recall that for each event a, we have a hidden variable ⇢ a as the parent indi- cator as introduced in Section 6.2.1. fl a,a Õ =1 if event a Õ triggers event a, while fl a,0 =1 indicates that the event a is spontaneous. Ÿ a,a Õ = Ÿ a Õ(t a ,v a ) is the impulse response of event a Õ on event a. n u = q aœ c:va=u fl a,0 is the number of spontaneous events of node u, while m u = µ u ·· is the expected number of spontaneous events of node u in observation window [0,· ]. Similarly, ¸ a = q a Õ œ c fl a Õ ,a is the number of 151 events triggered by a, while G a = q uœ V s Ÿ a (t,u)dt = q uœ V ◊ av,u is the expected number of events triggered by a. The first part (1) in the likelihood corresponds to the diusion of topics, while the second part (2) provides the likelihood of content given the topic of the doc- ument. The third part (3) is the complete likelihood of the MHP model with the parent relationship [133]. By Jensen’s inequality, applying the full mean-field approximation for the pos- teriors distribution P(⌘ ,z,⇢ |c,↵,,✓,µ ) leads to the following lower bound L(Q) to be optimized. To keep the notation uncluttered, we omit the dependence on the parameters↵,,✓,µ . logP(c) = log ⁄ P(⌘ ,z,⇢ ,c)d⌘ dzd⇢ Ø E ⌘ ,z,⇢ ≥ Q [logP(⌘ ,z,⇢ ,c)]≠ E ⌘ ,z,⇢ ≥ Q [Q(⌘ ,z,⇢ )] = L(Q) . 152 The explicit form of the variational objective L(Q) is as follows: L(Q)= ÿ aœ c ; ≠ 1 2ˆ ‡ 2 r a,0 ||ˆ ⌘ a ≠ ↵ va || 2 2 + ÿ a Õ :t a Õ<ta ≠ 1 2ˆ ‡ 2 r a,a Õ||ˆ ⌘ a ≠ ˆ ⌘ a Õ|| 2 2 Z ^ \ + ÿ aœ c Na ÿ n=1 K ÿ k=1 „ a,n,k E ⌘ a≥ q(⌘ a) [logfi a,k ] + ÿ aœ c Na ÿ n=1 K ÿ k=1 „ a,n,k log— k,xa,n + ÿ vœ V E ⇢ ≥ q(⇢ ) [≠ “ v +n v log“ v +log(n v !)] + ÿ aœ c E ⇢ ≥ q(⇢ ) [≠ G a +¸ a logG a +log(¸ a !)] + ÿ aœ c Q a ÿ a Õ œ c r a,a Õ log(Ÿ a,a Õ) R b ≠ ÿ aœ c Q a r a,0 logr a,0 + ÿ a Õ œ c r a,a Õ logr a,a Õ R b ≠ ÿ aœ c Na ÿ n=1 „ a,n,k log„ a,n,k +const Here fi a,k = fi k (⌘ a ). The first two lines of the variational objective correspond to the likelihood of document topics. Line 3 and 4 compute the likelihood of the words in thedocuments. Line5and6modelthetotalnumberofobservedeventsofeachnode. The remaining part of the objective corresponds to the entropy of the variational distribution. 6.3.1 Update of variational distribution We first present the update rule for the variational distribution of the parent relationship q(⇢ ). It is unique in our model which depends on both the time and 153 content information. We provide the derivation of the updates of other variational distributions. Update of parent relationship q(⇢ ) Let f ( t ) be the pdf for the delay dis- tribution and f N (x|µ,) be the pdf for the Gaussian distribution with mean µ and covariance matrix . Taking the derivative of L(Q) with respect to r a and setting it to zero gives the following update equations: r a,0 _ µ va f N (ˆ ⌘ a |↵ va ,ˆ ‡ 2 I) r a,a Õ_ ◊ v a Õ,va f N (ˆ ⌘ a |ˆ ⌘ a Õ,ˆ ‡ 2 I)f (t a ≠ t a Õ). Intuitively, we combine three aspects in our joint HawkesTopic model to decide the parent relationship for each event: (i) ◊ v a Õ,va captures the influence between nodes, (ii) f N (ˆ ⌘ a |ˆ ⌘ a Õ,ˆ ‡ 2 I) considers the similarity between event documents, and (iii) f (t a ≠ t a Õ) models the proximity of events in time. In contrast, the traditional MHPmodelusesonlythetimeproximityandnodeinfluencestodetermineanevent’s parent. Update of q(⌘ ) We utilize the Laplace method for variational inference in [136] to derive the update rule for the non-conjugate variables⌘ . Since we cannot obtain closed form solution for the terms in the objective with expectation over q(⌘ a ), the basic idea is to instead optimize an alternative objective via a Taylor expansion of the lower bound. The approximated objective f a (⌘ a ) is as follows: f a (⌘ a )=≠ 1 2ˆ ‡ 2 r a,0 ||⌘ a ≠ ↵ va || 2 2 ≠ 1 2ˆ ‡ 2 ÿ a Õ :t a Õ<ta r a,a Õ||⌘ a ≠ ˆ ⌘ a Õ|| 2 2 + Na ÿ n=1 K ÿ k=1 „ a,n,k logfi a,k . 154 Intuitively, the form of function f a (⌘ a ) suggests that the topics of each document are determined by two factors: (i) the topics should explain the actual words in the document, (ii) the topics of the document should be similar to either the preference of the node if it is from a spontaneous event or to the document of its parent if it is triggered by another event. As the solution of ˆ ⌘ a = argmax ⌘ a f a (⌘ a ) depends on the parameters⌘ of other documents, we propose a coordinate ascent method that iterates over maximizing each ˆ ⌘ a . The optimization for each ˆ ⌘ a is solved by the Newton Conjugate Gradient method where the first and second order derivatives are as follows: Ò f a (⌘ a )=≠ 1 ˆ ‡ 2 (⌘ a ≠ r a,0 ↵ va ≠ ÿ a Õ :t a Õ<ta r a,a Õ· ˆ ⌘ a Õ)+ Na ÿ n=1 a,n ≠ fi (⌘ a )· C Na ÿ n=1 K ÿ k=1 „ a,n,k D ˆf a (÷ a ) ˆ÷ a,i ˆ÷ a,j =(≠ fi a,i 1[i =j]+fi a,i fi a,j ) C Na ÿ n=1 K ÿ k=1 „ a,n,k D ≠ I[i =j]/ˆ ‡ 2 Update of q(z) As in theCTM model, takingthe derivativeofL(Q) with respect to „ a,n,k and setting to zero leads to the update of „ a,n,k as: „ a,n,k _ exp(ˆ ÷ a,k ) q j exp(ˆ ÷ a,j ) — k,xa,n . 6.3.2 Update of model parameters Update for As in the CTM model, taking the derivative ofL(Q) with respect to — k,w and setting it to zero gives us the following update for — k,w : — k,w _ ÿ aœ c K ÿ k=1 ÿ n:xa,n=w „ a,n,k . 155 Update for ↵ Taking the derivative ofL(Q) with respect to↵ u and setting it to zero leads to the update of↵ u as: ↵ u = ÿ aœ c:va=u r a,0 · ˆ ⌘ a . Treating r a,0 as soft count, we update the topic prior ↵ u as the average over the topics of documents belonging to the spontaneous events of node u. Update for ✓,µ Using the posterior distribution of the parent relationship as soft count, the update of the source influence matrix ✓ and the base intensity µ is similar to the traditional MHP model by taking a derivative of L(Q) with respect to ◊ u,w and µ u and setting it to zero: ◊ u,w = q aœ c q a Õ :t a Õ<ta r a,a ÕI[v a Õ =u]I[v a =w] q aœ c I[v a =u] µ u = q aœ c r a,0 I[v a =u] T . 6.4 Experiments We present an empirical evaluation of the HawkesTopic model (HTM). The mainquestionsweseektoaddressare: (1)howeectiveisHTMatinferringdiusion networks and (2) how well can HTM detect the topics associated with the event documents and their diusion? We empirically study these questions using both real and synthetic datasets. First, we apply HTM to synthetically generated data. Since the true diusion net- work and topics are controlled, this set of experiments serves as proof of concept for 156 the variational inference algorithm introduced in Section 6.3. Then, we provide an extensive evaluation of HTM on diverse real-world datasets and compare its perfor- mance against several baselines showing that HTM achieves superior performance for both tasks. 6.4.1 Synthetic Data We first evaluate HTM on synthetic datasets and focus mainly on our varia- tional inference algorithm. Data generation: We generate a collection of datasets following the genera- tive model assumed by HTM with a circular diusion network G consisting of five nodes with ✓ = Q c c c c c c c c c c c c c a 0.30.15 0 0 0 00.30.15 0 0 00 0.30.15 0 00 0 0.30.15 0.15 0 0 0 0.3 R d d d d d d d d d d d d d b . We set the number of topics K to five. The hyperparameters characterizing the node topic priors are set as↵ 1 =2·e 1 ,...,↵ 5 =2·e 5 . The base intensityµ for the processes of all nodes is set to 0.02. Regarding the parameters of the generative model, we set the dictionary size W to 500, ⁄ =200, – =0.02, ‡ =2.0. We vary the observation window length from 1000 to 4000.For each length, we generate five dierent datasets and report the mean and standard deviation for the evaluation measures below. Results: First, we compare the true values of the topic distribution parameters ⌘ a and node topic prior parameters ↵ v with their inferred equivalents ˆ ⌘ a and ˆ ↵ v , 157 (a) Performance of text modeling (b) Performance of network inference Figure 6.2: Results on synthetic datasets. respectively. The total absolute error (TAE) for the two parameter vectors is com- puted as: TAE(fi (÷ a )) = ÿ aœ c |fi (⌘ a )≠ fi (ˆ ⌘ a )| 1 , TAE(fi (– v )) = ÿ vœ V |fi (↵ v )≠ fi (ˆ ↵ v )| 1 . The corresponding errors are shown in Figure 6.2(a) together with the performance of the Correlated Topic Model (CTM) [14]. Our HTM exhibits improved perfor- mance yielding an error reduction of up to 85% for ↵ v and up to 25% for ⌘ a . The TAE corresponding to ⌘ increases for larger window length as the number of observed events increases, while the error of ↵ is rather stable as it is independent of the number of events. Next, we evaluate HTM at inferring the structure of the underlying diusion network. We measure the accuracy of the inferred network using two metrics: (i) the percentage of correctly identified parent relations for the observed events, and (ii) the sum of absolute dierences between the true node influence matrix ✓ and the estimated matrix ˆ ✓ . The results are shown in Figure 6.2(b). We compare HTM against a Hawkes process model that does not consider the available textual information. Our HTM yields an increased accuracy of around 19% at identifying 158 Figure 6.3: Inferred diusion network from the EventRegistry datasets. The colors of nodes represent the out-degree of the source. (The darker the color, the higher the out-degree.) The edge width represents the strength of influence. the event parent relationships and a decreased error of up to 28% for inferring the overall influence matrix. The decreasing error trend for the latter is due to the increased number of observed events for larger window length. 6.4.2 Real Data WefurtherevaluateHTMontwodiversereal-worlddatasets. Thefirstdataset correspondstoarticlesfromnewsmediaoveratimewindowoffourmonthsextracted from EventRegistry 2 , an online aggregator of news articles, while the second cor- responds to papers published on arXiv over a period of 12 years. We apply HTM on these seeking to: (i) identify the hidden topics associated with the documents in each dataset and (ii) infer the hidden diusion network of opinions and ideas, respectively. 2 eventregistry.org 159 Figure 6.4: Inferred diusion network from the Arxiv datasets. The colors of nodes represent the out-degree of the source. (The darker the color, the higher the out- degree.) The edge width represents the strength of influence. EventRegistry Dataset: We collected news articles from EventRegistry by crawl- ing all articles with keyword “Ebola” from July 1st, 2014 to November 1st, 2014. The dataset contains 9180 articles from 330 distinct news media sites. News media sites are treated as nodes in the diusion network, and published articles as events in our model. We preprocessed the articles to remove stop words and words that appear less than ten times. Since the true diusion network is not available, we use the available copying information across news media, i.e., identical news arti- cles published in multiple sites, to approximate the true diusion network. More precisely, if one article appears in multiple media sites, we consider the site that publishes the article first as the true source and add an edge to all sites who publish the article at a later point in time. We use the three largest connected components in the induced graph as three separate datasets in our experiments. To extract the delay distribution for the Hawkes processes in our model, we fit an empirical delay distribution based on the delay times observed for duplicate articles. 160 arXiv Dataset: We also use the arXiv high-energy physics theory citation network data from Stanford’s SNAP 3 . The dataset includes all papers published in the arXiv high-energy physics theory section from 1992 to 2003. We treat each author as a node and each publication as an event. For each paper, we use its abstract instead of the full paper as the document associated with the event. We use the articles of the top 50/100/200 authors in terms of number of publications as our datasets. The citation network is used as ground truth to evaluate the quality of the inferred diusion network. Similarly to EventRegistry, we fit an empirical delay distribution for the Hawkes processes based on the observed citation delays. Algorithms: We compare the topic modeling and network inference capabilities of the following algorithms: • HTM: Our HawkesTopic model. We set the number of topicsK to 50, except that we use K =100 for the arXiv dataset with 200 authors as it contains more documents. We normalize the observation interval to a time window of length 5000 and fix the base intensity for all Hawkes processes to 0.02. The variance ˆ ‡ 2 for topic diusion is set to 0.5. • LDA: Latent Dirichlet allocation with collapsed Gibbs sampling. The hyper- parameter – for the document topic distributions is set to 0.1 and the hyper- parameter for the topic word distributions is set to 0.02. The number of topics is set as in HTM. This is a baseline against the topic modeling component of HTM. • CTM: Correlated topic model with variational inference. CTM serves as another baseline for the topic modeling of HTM. 3 snap.stanford.edu/data/cit-HepTh.html 161 • Hawkes: Network inference based on Hawkes processes considering only time using the same empirical delay distribution as in HTM and setting the base intensity to be the same as in HTM. This is a baseline against the diusion network inference component of HTM. • Hawkes-LDA:Hawkes-LDAisatwo-stepapproachthatfirstinfersthetopics ofeachdocumentwithLDAandthenusesthoseasmarksforeacheventinthe Hawkes processes. We compare this algorithm with HTM in terms of network inference accuracy. • Hawkes-CTM: Similar to Hawkes-LDA with CTM being used instead. Hawkes-LDA and Hawkes-CTM are the equivalent point process versions of the TopicCascade algorithm [37], a state-of-the-art baseline for inferring the diusion network with textual information (Section 6.1). Evaluation Metrics: We compare HTM to baseline methods both with respect to the quality of the discovered topics as well as the accuracy of the inferred diusion network. To measure the quality of the discovered topics, we use the document completion likelihood [135] via sampling for all algorithms. We use the area under the ROC curve (AUC) (detailed definition in Section 3.3) to evaluate the accuracy of network inference for all algorithms. Results: The results for the three EventRegistry datasets with respect to the topic modelingperformanceandnetworkinferenceperformanceareshowninTable6.2and Table 6.3. We see that our HTM outperforms the baseline algorithms in both text modeling and network inference for all three. While the improvement is marginal, our algorithm consistently performs better compared to baselines. We conjecture that the low AUC near 0.7 for all algorithms is due to the noisy ground truth 162 Table6.2: EventRegistry: textmodelingresult(logdocumentcompletionlikelihood) LDA CTM HTM Cmp. 1 -42945.60 -42458.89 -42325.16 Cmp. 2 -22558.75 -22181.76 -22164.05 Cmp. 3 -17574.70 -17574.30 -17571.56 Table 6.3: EventRegistry: network inference result (AUC) Hawkes Hawkes-LDA Hawkes-CTM HTM Cmp. 1 0.622 0.669 0.673 0.697 Cmp. 2 0.670 0.704 0.716 0.730 Cmp. 3 0.666 0.665 0.669 0.700 network. Moreover, as promptness is essential for news sites, time information plays a more important role in this diusion scenario. This explains the fact that HTM performs similarly to the Hawkes model with only 10% marginal improvement. InFigure6.3,wevisualizethediusionnetworkforthethirdcomponent. From the graph, it suggests that some news sites, like local newspapers or local editions of Reuters, are the early bird in reporting stories. For example, the three red nodes, “sunherald.com", “miamiherald.com" and “billingsgazette.com" correspond to local newspaperswhile“in.reuters.com"correspondstotheIndianeditionofReuters. On the other hand, bigger news agencies, such as “reuters.com" are strongly influenced by other sites. This is to be expected, as it is common for news agencies to gather reports from local papers and redistribute them to other news portals 4 . The results for the arXiv datasets are shown in Table 6.4 and Table 6.5. HTM consistently performs better than CTM and LDA in text modeling. This indicates that HTM discovers better topics by utilizing the cascade of information. However, the improvement is not very significant. In terms of network inference, the HTM 4 en.wikipedia.org/wiki/News_agency#Commercial_services 163 Table 6.4: arxiv: text modeling result (log document completion likelihood) LDA CTM HTM Top 50 -11074.36 -10769.11 -10708.96 Top 100 -15711.53 -15477.24 -15252.47 Top 200 -27757.71 -27629.87 -27443.41 Table 6.5: arxiv: network inference result (AUC) Hawkes Hawkes-LDA Hawkes-CTM HTM Top 50 0.594 0.656 0.645 0.807 Top 100 0.588 0.589 0.614 0.687 Top 200 0.618 0.630 0.629 0.659 achieves more than 40% improvement in accuracy compared to the Hawkes process by incorporating the textual information associated with each event. This result validates our claims that in many domains, timing information alone is not su- cient to infer the diusion network. Moreover, our method outperforms the strong baselines Hawkes-LDA and Hawkes-CTM, suggesting that joint modeling of topics and information cascades is useful and the information of diusion pathways and the content information can benefit from each other vastly. The performance drops as the number of nodes increases since the dataset for the top 100 authors has a limitednumberofpublications. Webelievethatthereasonistheincreasingsparsity which makes the inference problem harder. Additionally, we carry out experiments on dierent observation time lengths on the arXiv dataset with the top 50 authors. Namely, we train our models using the papers published in the first three, six, and nine years, and the complete data set. The AUC of network inference accuracy is shown in Table 6.6. Figure 6.4 shows the hidden network among the top 50 authors. From the figure, we can see that the diusion network has a core-peripheral structure. Influ- entialauthorssuchasEdwardWitten,MichaelR.Douglas,JosephPolchinskialmost 164 Table 6.6: arxiv with dierent observation time lengths. 1992-1995 1992-1998 1992-2001 1992-2004 AUC 0.614 0.747 0.789 0.807 Table6.7: InferredtopicsforauthorsAndreiLindeandArkadyTseytlinunderLDA, CTM and HTM. Author LDA CTM HTM Andrei Linde black, hole, holes black, holes, entropy black, holes, hole supersymmetry, supersymmetric, solutions supersymmetric, supersymmetry, superspace universe, inflation, may linde@physics.stanford .edu universe, cosmological, cosmology metrics, holonomy, spaces supersymmetry, supersymmetric, breaking Arkady Tseytlin magnetic, field, conformal solutions, solution, x string, theory, type type, iib, theory action, effective, background action, actions, duality a.tseytlin@ic.ac.uk action, superstring, actions type, iib, iia bound, configurations, states form a clique in the middle left of Figure 6.4. The common characteristics of these authors is that they do not publish the most. However, on average, each of their papers receives the largest number of citations. For example, Edward Witten has published 397 papers but has received more than 40000 citations. As another exam- ple, Joseph Polchinski has received near 9000 citations with only 190 publications. They serve as the core of the influence network, suggesting they may be the innova- tors in their corresponding fields. Influenced by the core authors, researchers such as Christopher Pope and Arkady Tseytlin with intermediate numbers of both in- coming and out-going edges, further pass the influence to other authors. Overall, their works also receive a lot of citations; however, they publish more papers than 165 the core authors with fewer average citations for each paper. For example, Christo- pher Pope received 6898 citation with 563 papers. This suggests that they can be considered as the intercessor in the diusion networks. Most of other authors, lying intheoutsidepartinFigure6.4, havefewout-goingedges, suggestingtheyactmore as the receivers of new scientific ideas. Besides inferring the influence relationship between authors, our model is also able to discover the research topics of the authors accurately. We list the inferred top-three topics for two cherry-picked authors together with the top-three words in each topic in Table 6.4.2. For HTM, we simply select the topics with largest value in ˆ – v . For the LDA and CTM models, we average over the topics of the papers published by the author. We compare the learnt topics to the research interests listed by the authors on their websites. One of Andrei Linde’s major research areas is the study of inflation 5 . Only the HTM discovers it among the top-three topics of the authors. Though our HTM reports an irrelevant word “may” as the top word in the discovered topic. LDA and CTM fail completely to discover this topic. Arkady Tseytlin reports string theory as his main research area on his webpage 6 . HTM suc- cessfullyliststhestringtheorytopicfirst, whileCTMandLDAbothleavethistopic out of the top-three topics of the author. We hypothesize that our model detects the topics more accurately because it can distinguish between spontaneous and trig- gered events. It infers authors’ preferences based only on spontaneous publications, while baseline models infer those using all publications. 5 physics.stanford.edu/people/faculty/andrei-linde 6 www.imperial.ac.uk/people/a.tseytlin 166 Chapter 7 Stability Analysis and Robust Algorithms for Influence Maximization 167 The content of this chapter is based on the following unpublished manuscript and paper: • Xinran He and David Kempe. “Stability of Influence Maximization”, arXiv preprint arXiv:1501.04579. 2015 • Xinran He and David Kempe, “Robust Influence Maximization”, In Proc. 22nd Intl. Conf. on Knowledge Discovery and Data Mining.2016 7.1 Introduction One of the most widely used applications of the inferred diusion network is the Influence Maximization problem introduced in Section 2.8. The algorithmic questionfocusesonselectingasetS ofseed nodes ofpre-specifiedsizek tomaximize the total influence function ‡ (S). In previous chapters, we have focused on inferring ‡ (S) accurately either by inferring the parameters in Chapter 5 and 6 or directly learning the influence function in Chapter 4. Even with inference algorithms designed specifically to handle noise, incom- pleteness and data scarcity, there is still inevitably noise and uncertainty in the total influence function to be optimized. For example, if the retention rate of the observed incomplete cascades is low, the learned influence function by our method in Chapter 4 may suer from high levels of noise. Or if we have learned multiple topic-specific diusion networks using our MCM in Chapter 5, it is not clear which topic-specific diusion network should be used to obtain the total influence function for the propagation of a story with multiple topics. A similar scenario occurs when 168 one has learned multiple snapshots of a dynamic diusion network and needs to decide which snapshot to use to carry out a future marketing campaign. As a result, the learned total influence function must be treated with cau- tion due to the noise and uncertainty. The correctness guarantees for the Influence Maximization algorithms are predicated on the assumption of the correctness of the total influence function. When this assumption fails — which is inevitable — the utilityofthealgorithms’outputiscompromised. Recently, Yadavetal.[145]carried out a field study to test the performance of Influence Maximization algorithms in real-world homeless-youth social networks. They have shown that the performance of Influence Maximization algorithms dier significantly between mathematical dif- fusion models and complex real-world diusion phenomena. In this chapter, we study the Influence Maximization problem when the above assumption fails, i.e., there is noise and uncertainties in the influence functions. To be more specific, we are guided by the following three overarching questions: 1. How do the noise and uncertainty impact the performance of Influence Maxi- mization algorithms? 2. How can we diagnose the instability algorithmically? 3. How can we design Influence Maximization algorithms that are robust to noise and uncertainty? We begin the chapter by reviewing existing work on Influence Maximization consid- ering uncertainty in the diusion network and then formally introduce our model for uncertainty in the diusion network parameters. We focus on answering the first question in Section 7.4 and the second question in Section 7.5, respectively. 169 7.2 Related Work Several recent papers take first steps toward Influence Maximization under uncertainty. Goyal, Bonchi and Lakshmanan [59] and Adiga et al. [2] study ran- dom (rather than adversarial) noise models, in which either the edge activation probabilities ◊ u,v are perturbed with random noise [59], or the presence/absence of edges is flipped with a known probability [2]. Neither of the models truly extends the underlying diusion models, as the uncertainty can simply be absorbed into the probabilistic activation process. A detailed reduction is discussed later in Sec- tion 7.3.2. Another approach to dealing with uncertainty is to carry out multiple influ- ence campaigns, and to use the observations to obtain better estimates of the model parameters. Chen et al. [27] model the problem as a combinatorial multi-armed bandit problem and use the UCB1 algorithm with regret bounds. Lei et al. [88] instead incorporate beta distribution priors over the activation probabilities into the DIC model. They propose several strategies to update the posterior distribu- tions and give heuristics for seed selection in each trial so as to balance exploration and exploitation. Our approach is complementary: the study of stability tries to answerthequestionwhetherexplorationsarenecessarytoachievegoodperformance. Even in an exploration-based setting, there will always be residual uncertainty to be handled via robust algorithms, in particular when exploration budgets are limited. Most related to the present work, Chen et al. [25] and Lowalekar et al. [104] havebeenstudyingtheRobustInfluenceMaximizationproblemunderthePerturba- tionIntervalmodelwhichweintroduceinthenextsection. ThemainresultofChen et al. [25] is an analysis of the heuristic of choosing the best solution among three 170 candidates: make each edge’s parameter as small as possible, as large as possible, or equal to the middle of its interval. They prove solution-dependent approximation guarantees for this heuristic. The objective of Lowalekar et al. [104] is to minimize the maximum regret, i.e., the dierence between the inferred influence function and the true one, instead of maximizing the minimum ratio between the inferred influence function and the true one. They propose a heuristic based on constraint generation ideas to solve the Robust Influence Maximization problem. The basic idea of the algorithms is to iterate between the following two steps. The first step finds the influence function with maximum regret and adds it to the set of possible influence functions ; the second step finds a set of seeds that maximizes influence robustly among all influ- ence functions in . Both steps are solved via mixed-integer programming and the algorithm terminates when the maximum regret in the first step is smaller than a predefined threshold. The heuristic does not come with approximation guarantees; instead, [104] proposes a solution-dependent measure of robustness of a given seed set. As part of their work, [104] prove a result similar to our Lemma 7.4, showing that the worst-case instances all have the largest or smallest possible values for all parameters. 7.3 ModelingUncertaintyinInfluenceMaximiza- tion TheconcernsdiscussedinSection7.1combinetoleadtosignificantuncertainty about the function ‡ : dierent models give rise to very dierent functional forms 171 of ‡ , and missing observations or approximations in inference lead to uncertainty about the models’ parameters. Tomodelthisuncertainty,weassumethatthealgorithmispresentedwithaset of influence functions, and assured that one of these functions actually describes the influence process, but not told which one. The set could be finite or infinite. A finite could result from a finite set of dierent information diusion models that are being considered, or from a finite number of dierent contexts under which the individuals were observed (e.g., multiple topic-specific networks inferred by our MCM in Chapter 5 from cascades on dierent topics), or from a finite number of dierent inference algorithms or algorithm settings being used to infer the model parameters from observations. 7.3.1 Perturbation Interval model An infinite (even continuous) arises if each model parameter is only known to lie within some given interval. We assume that for each of the edges (u,v),we are given an interval I u,v =[¸ u,v ,r u,v ]™ [0,1] with ◊ u,v œ I u,v . For the DLT model, to ensure that the resulting influence functions are always submodular, we require that q uœ N(v) ◊ u,v Æ 1 for all nodes v. We write ⇥ = ◊ (u,v)œ E I u,v for the set of all allowable parameter settings. It is guaranteed that the ground truth parameter values satisfy✓ 0 œ ⇥ ; subject to this requirement, the ground truth parameters can be chosen arbitrarily. We term the above model the Perturbation Interval Model or in short PIM. For a fixed diusion model such as DIC or DLT, the parameter values ✓ deter- mineaninstanceoftheInfluenceMaximizationproblem. Wewillusuallybeexplicit about indicating the dependence of the objective function on the parameter setting. 172 We write ‡ ✓ for the objective function obtained with parameter values ✓ , and only omit the parameters when they are clear from the context. Therefore, under the PIM, we have= {‡ ✓ |✓ œ ⇥ }. There are uncountably infinitely many influence functions contained in the set . For a given setting of parameters✓ , we will denote by S ú ‡ ✓ œ argmax S ‡ ✓ (S) a solution maximizing the expected influence. 7.3.2 Stochastic vs. Adversarial Models The classic Influence Maximization problem assumes that contains only the true influence function to be optimized. Under uncertainty, there is more than one candidates in the set . In this chapter, we assume an adversarial model. That is, the ground truth function to be optimized is chosen by an adversary from the set . The process of Influence Maximization under uncertainty can be considered as a two-player game: the first player observes the candidate set and moves first to pick a seed set S trying to maximize the influence; the second player moves second to pick an influence function from the candidate set to minimize the algorithm performance, i.e., the ratio of the influence of the selected seeds and the influence of the optimal seeds under the chosen function. Given its prominent role in our model, the decision to treat the choice of influence function as adversarial rather than stochastic deserves some discussion. First, adversarial guarantees are stronger than stochastic guarantees, and will lead to more robust solutions in practice. Perhaps more importantly, inferring a Bayesianprioroverinfluencefunctionsin willrunintoexactlythetypeofproblem we are trying to address in the first place: data are sparse and noisy, and if we infer an incorrect prior, it may lead to very suboptimal results. Doing so would next 173 require us to establish robustness over the values of the hyperparameters of the Bayesian prior over functions. Specifically for the Perturbation Interval model, one may be tempted to treat the parameters as drawn according to some distribution over their possible range. This approach was essentially taken in [2, 59]. Goyal et al. [59] assume that for each edge (u,v), the value of ◊ u,v is perturbed with uniformly random noise from a known interval. Adiga et al. [2] assume that each edge (u,v) that was observed to be present is actually absent with some probability ‘, while each edge that was not observed is actually present with probability ‘; in other words, each edge’s presence is independently flipped with probability ‘. The standard Independent Cascade model subsumes both models straightfor- wardly. Suppose that a decision is to be made about whether u activates v. In the model of Goyal et al., we can first draw the actual (perturbed) value of ◊ Õ u,v from its known distribution; subsequently, u activates v with probability ◊ Õ u,v ; in total, u activates v with probability E Ë ◊ Õ u,v È . Thus, we obtain an instance of the IC model in which all edge probabilities ◊ u,v are replaced byE Ë ◊ Õ u,v È . In the special case when the noise has mean 0, this expectation is exactly equal to ◊ u,v , which explains why Goyal et al. observed the noise to not aect the outcome at all. In the model of Adiga et al., we first determine whether the edge is actually present; whenitwasobservedpresent, thishappenswithprobability 1≠ ‘; otherwise with probability ‘. Subsequently, the activation succeeds with probability p.([2] assumed uniform probabilities.) Thus, the model is an instance of the IC model in which the activation probabilities on all observed edges are p(1≠ ‘), while those on unobserved edges arep‘. This reduction explains the theoretical results obtained by Adiga et al. 174 More fundamentally, practically all “natural” random processes that indepen- dently aect edges of the graph can be “absorbed into” the activation probabilities themselves; as a result, random noise does not at all play the result of actual noise. To model the type of issues one would expect to arise in real-world settings, at the very least, noise must be correlated between edges. For instance, certain subpopulations may be inherently harder to observe or have sparser data to learn from. In the extreme case, certain subpopulations may not be observable at all due to their privacy setting as in incomplete observations with adversarial missing nodes considered in Chapter 4. However, correlated random noise would result in a more complex description of the noise model, and thus make it harder to actually learn andverifythenoisemodel. Inparticular, asdiscussedabove, thiswouldapplygiven that the noise model itself must be learned from noisy data. 7.4 Stability of Influence Maximization In this section, we focus on answering the first question: diagnosing the stabil- ityofInfluenceMaximization algorithms. Westudytheproblemunder theDICand DLT model where the perturbations on theedge parameters followthe Perturbation Interval model. 7.4.1 Can Instability Occur? Suppose that we have inferred all parameters ◊ u,v , but are concerned that they may be slightly o: in reality, the influence probabilities are ◊ Õ u,v ¥ ◊ u,v . Are there instances in which a seed set S that is very influential with respect to the ◊ u,v may 175 be much less influential with respect to the ◊ Õ u,v ? It is natural to suspect that this might not occur: when the objective function ‡ varies suciently smoothly with the input parameters (e.g., for linear objectives), small changes in the parameters only lead to small changes in the objective value; therefore, optimizing with respect to a perturbed input still leads to a near-optimal solution. However, the objective ‡ of Influence Maximization does not depend on the parameters in a smooth way. To illustrate the issues at play, consider the following instance of the DIC model. The social network consists of two disjoint bidirected cliques K n , and ◊ u,v = ˆ ◊ for all u,v in the same clique; in other words, for each directed edge, the same activation probability ˆ ◊ is observed. The algorithm gets to select exactly k =1 node. Notice that because all nodes look the same, any algorithm essentially chooses an arbitrary node, which may as well be from Clique 1. Let ˆ ◊ =1/n be the sharp threshold for the emergence of a giant component in the Erds-Rényi Random Graph G(n,p). It is well known [12, 40] that the largest connected component of G(n,p) has size O(logn) for any pÆ ˆ ◊ ≠ (1 /n), and size ( n) for any p Ø ˆ ◊ +(1 /n). Thus, if unbeknownst to the algorithm, all true activation probabilities in Clique 1 are p Æ ˆ ◊ ≠ (1 /n), while all true activation probabilities in Clique 2 arep Õ Ø ˆ ◊ +(1 /n), the algorithm only activates O(logn) nodes in expectation, while it could have reached( n) nodes by choosing Clique 2. Hence, small adversarial perturbations to the input parameters can lead to highly suboptimal solutions from any algorithm. The example reveals a close connection between the stability of a DIC instance and the question whether a uniform activation probability p lies close to the edge percolation threshold of the underlying graph. Characterizing the percolation threshold of families of graphs 176 has been a notoriously hard problem. Successful characterizations have only been obtained for very few specific classes (such asd-dimensional grids [80] andd-regular expander graphs [4]). Therefore, it is unlikely that a clean characterization of stable and unstable instances can be obtained. The connection to percolation also reveals that the instability was not an artifact of having high node degrees. By the result of Alon et al. [4], the same behavior will be obtained if both components are d-regular expander graphs, since such graphs also have a sharp percolation threshold. 7.4.2 Influence Dierence Maximization The example of the two cliques shows that there exist unstable instances, in which an optimal solution to the observed parameters is highly suboptimal when the observed parameters are slightly perturbed compared to the true parameters. Of course, not every instance of Influence Maximization is unstable: for instance, when the probability ˆ ◊ in the Two-Clique instance is bounded away from the critical threshold ofG(n,p), the objective function varies much more smoothly with ˆ ◊ . This motivatesthefollowingalgorithmicquestion, whichisthemainfocusofthissection: Given an instance of Influence Maximization, can we diagnose eciently whether it is stable or unstable? Under the Perturbation Interval model, an instance (◊ u,v ,I u,v ) u,v of Influence Maximization is the combination of the inferred parameters together with the inter- val of perturbations. We say the instance is stable if|‡ ✓ (S)≠ ‡ ✓ 0(S)| is small for all objective functions ‡ ✓ 0 induced by legal probability settings 1 , and for all seed sets 1 For the DIC model, it only requires that ◊ u,v œ [0,1]. For DLT model, there is an additional constraint that q uœ N(v) ◊ u,v Æ 1 for all vœ V. 177 S of size k. Here, “small” is defined relative to the objective function value ‡ ✓ (S ú ‡ ✓ ) of the optimum set. When |‡ ✓ (S)≠ ‡ ✓ 0(S)| is actually small compared to ‡ ✓ (S ú ‡ ✓ ) for all sets S,a usercanhaveconfidencethathisoptimizationresultwillprovidedecentperformance guarantees even if his input was perturbed. The converse is of course not necessarily true: even in unstable instances, a solution that was optimal for the observed input may still be very good for the true input parameters. Trying to determine whether there are a function ‡ ✓ 0 and a set S for which |‡ ✓ (S)≠ ‡ ✓ 0(S)| is large motivates the following optimization problem: Maximize |‡ ✓ (S)≠ ‡ ✓ 0(S)| over all feasible functions ‡ ✓ 0 and all sets S. That is, we are interested in the quantity max S max ✓ 0 œ ⇥ |‡ ✓ (S)≠ ‡ ✓ 0(S)|, (7.1) where ✓ denotes the observed parameter values. For two parameter settings ✓ ,✓ 0 with ✓ Ø ✓ 0 coordinate-wise, it is not dicult to show using a simple coupling argumentthat‡ ✓ (S)Ø ‡ ✓ 0(S)forallS. Therefore,foranyfixedsetS,themaximum is attained either by making ✓ 0 as large as possible or as small as possible. Hence, solving the following problem is sucient to solve ( 7.1). Definition 7.1. Given an influence model and two parameter settings ✓ ,✓ 0 with ✓ Ø ✓ 0 coordinate-wise, define ” ✓ ,✓ 0(S)= ‡ ✓ (S)≠ ‡ ✓ 0(S). (7.2) 178 Given the set sizek, the InfluenceDierence Maximization (IDM) problem is defined as follows: Maximize ” ✓ ,✓ 0(S) subject to |S| =k. (7.3) In this generality, the Influence Dierence Maximization problem subsumes the Influence Maximization problem, by setting ◊ Õ u,v © 0 (and thus also ‡ ✓ 0© 0). While Influence Dierence Maximization subsumes Influence Maximization, whose objective function is monotone and submodular, the objective function of Influence Dierence Maximization is in general neither. To see non-monotonicity, notice that ” ✓ ,✓ 0(ÿ )= ” ✓ ,✓ 0(V)=0, while generally ” ✓ ,✓ 0(S)> 0 for some sets S. The function is also not in general submodular. The following example shows non-submodularity for both the DIC and DLT Models. The graph has four nodes V ={u,v,x,y} and three edges (u,v),(v,x),(x,y). The edges (v,x) and (x,y) are known to have an activation probability of 1, while the edge (u,v) has an adver- sarially chosen activation probability in the interval [0,1]. With S = {u} and T = {u,x}, we obtain that ” ✓ ,✓ 0(S +v)≠ ” ✓ ,✓ 0(S)= |ÿ|≠|{ v,x,y}| = ≠ 3, while ” ✓ ,✓ 0(T +v)≠ ” ✓ ,✓ 0(T)=|ÿ|≠|{ v}| =≠ 1, which violates submodularity. Weestablishaverystronghardnessresultintheformofthefollowingtheorem. Theorem 7.2. Under the DIC model, the Influence Dierence Maximization objec- tive function ” ✓ ,✓ 0(S) cannot be approximated better than n 1≠ ‘ for any‘> 0 in polynomial time unless NP™ ZPP. Proof. WeestablishtheapproximationhardnessofInfluenceDierenceMaximiza- tion without any constraint on the cardinality of the seed set S. From this version, the hardness of the constrained problem is inferred easily as follows: if any better 179 approximationcouldbeobtainedfortheconstrainedproblem, onecouldsimplyenu- merate over all possible values of k from 1 to n, and retain the best solution, which would yield the same approximation guarantee for the unconstrained problem. We give an approximation-preserving reduction from the Maximum Inde- pendentSet problem to the Influence Dierence Maximization problem. It is well known that Maximum Independent Set cannot be approximated better than O(n 1≠ ‘ ) for any‘> 0 unless NP™ ZPP [67]. LetG=(V,E) be an instance of theMaximum Independent Set problem, with |V| = n. We construct from G a directed bipartite graph G Õ with vertex set V Õ fi V ÕÕ . For each node v i œ V, there are nodes v Õ i œ V Õ and v ÕÕ i œ V ÕÕ . The edge set is E Õ fi E ÕÕ , where E Õ ={(v Õ i ,v ÕÕ j )|(v i ,v j )œ E}, and E ÕÕ ={(v Õ i ,v ÕÕ i )|v i œ V}. All edges of E Õ are known to have an activation probability of 1, while all edges of E ÕÕ have an uncertain activation probability from the interval [0,1]. The dierence is maximized by making all probabilities as large as possible for one function (meaning that all edges inE Õ fi E ÕÕ are present deterministically), while making them as small as possible for the other (meaning that exactly the edges in E Õ are present). First, let S be an independent set in G. Consider the set S Õ = {v Õ i |v i œ S}. Each node v ÕÕ i with v i œ S is reachable from the corresponding v Õ i in G Õ , but not in (V Õ fi V ÕÕ ,E Õ ), because S is independent. Hence, the objective function value obtained in Influence Dierence Maximization is at least |S|. Conversely, consider an optimal solution S Õ to the Influence Dierence Max- imization problem. Without loss of generality, we may assume that S Õ ™ V Õ :any nodev ÕÕ j œ V ÕÕ can be removed fromS Õ without lowering the objective value. Assume that S :={v i œ V |v Õ i œ S Õ } is not independent, and that (v i ,v j )œ E for v i ,v j œ S. 180 Then,removingv Õ j fromS Õ cannotlowertheInfluenceDierenceMaximizationobjec- tive value of S Õ : all of v Õ j ’s neighbors in V ÕÕ contribute 0, as they are reachable using E Õ already; furthermore,v ÕÕ j also does not contribute, as it is reachable usingE Õ from v Õ i . Thus, any node with a neighbor in S can be removed from S Õ , meaning that S is without loss of generality independent in G. At this point, all the neighbors of S Õ contribute 0 to the Influence Dierence Maximizationobjectivefunction(becausethey arereachableunderE Õ already), and the objective value of S Õ is exactly |S Õ | =|S|. 7.4.3 Experiments While we saw in Section 7.4.1 that examples highly susceptible (with errors of magnitude( n)) to small perturbations exist, the goal of this section is to evaluate experimentally how widespread this behavior is for realistic social networks. 7.4.3.1 Experimental Setting We carry out experiments under the DIC model, for six classes of graphs — four synthetic and two real-world. In each case, the model/data give us a simple graph or multigraph. Multigraphs are converted to simple graphs by collapsing parallel edges to a single edge with weight c e equal to the number of parallel edges; for simple graphs, all weights are c e =1. The observed probabilities for edges are ◊ e =c e ·p; across experiments, we vary the base probability p to take on the values {0.01,0.02,0.05,0.1}. The resulting parameter vector is denoted by✓ . The uncertainty interval for e is I e =[(1 ≠ ) ◊ e ,(1 + ) ◊ e ]; here, is an uncertainty parameter for the estimation, which takes on the values 181 {1%,5%,10%,20%,50%} in our experiments. The parameter vectors ✓ + and ✓ describe the settings in which all parameters are as large (as small, respectively) as possible. 7.4.3.2 Network Data We run experiments on four synthetic networks and two real social networks. Synthetic networks provide a controlled environment in which to compare observed behavior to expectations, while real social networks may give us indications about the prevalence of vulnerability to perturbations in real networks that have been studied in the past. Synthetic Networks. We generate synthetic networks according to four widely used network models introduced in Section 3.1. In all cases, we generate undirectednetworkswith400nodes. Thenetworkmodelsare: (1)the2-dimensional grid, (2) random regular graphs, (3) the Watts-Strogatz Small-World (SW) Model [143]onaringwitheachnodeconnectingtothe5closestnodesoneachsideinitially, and a rewiring probability of 0.1. (4) The Barabási-Albert Preferential Attachment (PA)Model[9]with5outgoingedgespernode. Forallsyntheticnetworks, weselect k=20 seed nodes. Real Networks. We consider two real networks introduced earlier in Sec- tion 3.2 to evaluate the susceptibility of practical networks: the co-authorship net- work STOCFOCS of theoretical CS papers and the retweet network Haiti. In all experiments, we work with uniform edge weights p, since — apart from edge multiplicities — we have no evidence on the strength of connections. It is a promising direction for future in-depth experiments to use influence strengths 182 inferred from real-world cascade datasets by network inference methods such as [51, 58, 108]. 7.4.3.3 Algorithms Our experiments necessitate the solution of two algorithmic problems: Find- ing a set of size k of maximum influence, and finding a set of size k maximizing the influence dierence . The former is a well-studied problem, with a monotone sub- modular objective function. We simply use the widely known 1≠ 1/e approximation algorithm introduced in Section 2.2.1, which is best possible unless P=NP. For the goal of Influence Dierence Maximization, we established that the objective function is hard to approximate better than a factorO(n 1≠ ‘ ) for any‘> 0 in Theorem 7.2. For experimental purposes, we use the Random Greedy algorithm of Buchbinder et al. [22]. The algorithm is introduced in detail in Section 2.2.1. The running time of the Random Greedy Algorithm is O(kC|V|), where C is the time required to estimate g(Sfi{ u})≠ g(S). In our case, the objective function is #P-hard to evaluate exactly [28, 137], but arbitrarily close approximations can be obtained by Monte Carlo simulation. Since each simulation takes time O(|V|), if we run M =2000 iterations of the Monte Carlo simulation in each iteration, the overall running time of the algorithm is O(kM|V| 2 ). A common technique for speeding up the greedy algorithm for maximizing a submodular function is the CELF heuristic of Leskovec et al. [93]. When the objective function is submodular, the standard greedy algorithm and CELF obtain the same result. However, when it is not as is the case here, the results may be dierent. We run the Random Greedy algorithm both with and without the CELF 183 heuristic. Thesingleexceptionisthelargestinput,theSTOCFOCSnetwork. (Here, the greedy algorithm without CELF did not finish in a reasonable amount of time.) For all networks other than STOCFOCS, the results using CELF are not signif- icantly dierent from the reported results without the CELF optimization. For STOCFOCS, we therefore only report the result including the CELF heuristic. 7.4.3.4 Results In all our experiments, the results for the Grid and Small-World network are suciently similar that we omit the results for grids here. As a first sanity check, we empirically computed max S:|S|=1 ” ✓ + ,✓ (S) for the complete graph on 200 nodes with I e =[1/200· (1≠ ) ,1/200· (1 +)] and k=1. According to the analysis in Section 7.4.1, we would expect extremely high instability. The results, shown in Table 7.1, confirm this expectation. ‡ ✓ + ‡ ✓ 50% 66.529 1.955 20% 23.961 4.253 10% 15.071 6.204 Table 7.1: Instability for the clique K 200 . Next, Figure 7.1 shows the (approximately) computed values max S:|S|=k ” ✓ + ,✓ (S), and — for calibration purposes — max S:|S|=k ‡ ✓ (S) for all networks and parameter settings. Notice that the result is obtained by running the Random Greedy algorithm without any approximation guarantee. However, as the algorithm’s output provides a lower bound on the maximum influence dierence, a large value suggests that Influence Maximization could be unstable. On the other hand, small values do not guarantee that the instance is stable. 184 (a) Small-World (b) PA (c) STOCFOCS (d) Haiti Figure 7.1: Comparison between Influence Dierence Maximization and Influence Maximization results for four dierent networks. (The result for the STOCFOCS network is obtained with CELF optimization.) While individual networks vary somewhat in their susceptibility, the overall trend is that larger estimates of baseline probabilities ◊ make the instance more susceptible to noise, as do (obviously) larger uncertainty parameters . In partic- ular, for Ø 20%, the noise (after scaling) dominates the Influence Maximization objective function value, meaning that optimization results should be used with care. Next, we evaluate the dependence of the noise tolerance on the degrees of the graph, by experimenting with random d-regular graphs whose degrees vary from 5 to 25. It is known that such graphs are expanders with high probability, and hence have percolation thresholds of 1/d [4]. Accordingly, we set the base probability to (1 + – )/d with – œ {≠ 20%,0,20%}. We use the same setting for uncertainty intervals as in the previous experiments. Figure 7.2 shows the ratio between Influ- ence Dierence Maximization and Influence Maximization, i.e., max S ” ✓ + ,✓ (S) max S ‡ ✓ (S) , with – œ {≠ 20%,0,20%}. It indicates that for random regular graphs, the degree does not appear to significantly aect stability, and that again, noise around 20% begins to pose a significant challenge. Moreover, we observe that the ratio reaches its min- imum when the edge activation probability is exactly at the percolation threshold 185 – =≠ 20% – =0 – =20% Figure7.2: RatiobetweenthecomputedvaluesofInfluenceDierenceMaximization and Influence Maximization under random regular graphs with dierent degree. 1/d. This result is in line with percolation theory and also the analysis of Adiga et al. [2]. Asageneraltakeawaymessage,forlargeramountsofnoise(evenjustarelative error of 20%) — which may well occur in practice — a lot of caution is advised in using the results of algorithmic Influence Maximization. 7.5 Robust Influence Maximization The fact that our main theorem for diagnosing instability is negative (i.e., a strongapproximationhardnessresult)issomewhatdisappointing,inthatitrulesout reliably categorizing data sets as stable or unstable. This motivates our investiga- tion of the second question: setting up a robust Influence Maximization framework wherein an algorithm is presented with a set of influence functions derived from dierent influence models or dierent parameter settings for the same model. The mainmotivationforourworkisthatoften, ‡ isnotpreciselyknowntothealgorithm trying to maximize influence. There may be a (possibly infinite as under the PIM 186 in Section 7.3.1) number of candidate functions ‡ , resulting from dierent diusion models or parameter settings. Since the algorithm does not know which ˆ ‡ œ is the ground truth influ- ence function, in the Robust Influence Maximization problem, it must “simul- taneously optimize” for all objective functions in , in the sense of maximizing fl (S) = min ˆ ‡ œ ˆ ‡ (S) ˆ ‡ ( ˆ S) , where ˆ Sœ argmax S ˆ ‡ (S) is an optimal solution knowing which function ˆ ‡ istobeoptimized. Inotherwords, theselectedsetshouldsimultaneously get as close as possible to the optimal solutions for all possible objective functions. That is, the algorithm’s goal is to identify a set of k nodes who are simultaneously influential for all influence functions, compared to the (function-specific) optimum solutions. Tobemorespecific,ourstudyisguidedbythefollowingoverarchingquestions: 1. How well can the objective fl be optimized in principle? 2. How well do simple heuristics perform in theory? 3. How well do simple heuristics perform in practice? 4. How do robustly and non-robustly optimized solutions dier qualitatively? We address these questions as follows. First, we show (in Section 7.5.2.2) that unless the algorithm gets to exceed the number of seedsk by at least a factor ln| |, approximating the objective fl to within a factor O(n 1≠ ‘ ) is NP-hard for all‘> 0. However, when the algorithm does get to exceed the seed set target k by a factor of ln| | (times a constant), much better bicriteria approximation guarantees canbeobtained. Specifically, weshowthatamodificationofanalgorithmofKrause 187 etal.[82]usesO(kln| |)seedsandfindsaseedsetwhoseinfluenceiswithinafactor (1≠ 1/e) of optimal. We also investigate two straightforward heuristics: 1. Run a greedy algorithm to optimize fl directly, picking one node at a time. 2. For each objective function ‡ œ , find a set S ‡ (approximately) maximizing ‡ (S ‡ ). Evaluate each of these sets under fl (S ‡ ), and keep the best one. We first exhibit instances on which both of the heuristics perform very poorly. Next (in Section 7.5.3), we focus on more realistic instances, exemplifying the types of scenarios under which robust optimization becomes necessary. In the first set of experiments, we infer influence networks on a fixed node set from Twitter cascades on dierent topics. Individuals’ influence can vary significantly based on the topic, and for a previously unseen topic, it is not clear which inferred influence network to use. In additional sets of experiments, we derive data sets from the same Meme- Tracker data [89], but use dierent time slices, dierent inference algorithms and parametrizations, and dierent samples from confidence intervals. The main outcome of the experiments is that while the algorithm with robust- ness as a design goal typically (though not even always) outperforms the heuristics, the margin is often quite small. Hence, heuristics may be viable in practice, when the influence functions are reasonably similar. A visual inspection of the nodes cho- senbydierentalgorithmsrevealshowtherobustalgorithm“hedgesitsbets”across models, while the non-robust heuristic tends to cluster selected nodes in one part of the network. 188 7.5.1 Problem Definition For concreteness, we focus on two diusion models: the DIC model and the CIC-Delay model introduced in Section 2.5.1.2. Our framework applies to most other diusion models; in particular, most of the concrete results carry over to the discrete and continuous-time Linear Threshold models [77, 122]. A specific instance is described by the class of its influence model (such as DIC, CIC, or others not discussed here in detail) and the setting of the model’s parameters; in the DIC and CIC models above, the parameters ◊ u,v would be the activation probabilities and the parameters of the delay distributions, respectively. Together, they completely specify the dynamic process; and thus a mapping ‡ from initially active sets S to the expected number 2 ‡ (S) of nodes active at the end of the process. We now formally define the Robust Influence Maximization problem. Definition 7.3 (Robust Influence Maximization). Given a set of influence func- tions, maximize the objective fl (S) = min ‡ œ ‡ (S) ‡ (S ú ‡ ) , subject to a cardinality constraint |S|Æ k. Here S ú ‡ is a seed set with |S ú ‡|Æ k maximizing ‡ (S ú ‡ ). A solution to the Robust Influence Maximization problem achieves a large fractionofthemaximumpossibleinfluence(comparedtotheoptimalseedset)under all diusion settings simultaneously. Alternatively, the solution can be interpreted 2 The model and virtually all results in the literature extend straightforwardly when the indi- vidual nodes are assigned non-negative importance scores. 189 as solving the Influence Maximization problem when the function ‡ is chosen from by an adversary. While Definition 7.3 per se does not require the ‡ œ to be submodular and monotone, these properties are necessary to obtain positive results. Hence, we will assume here that all ‡ œ are monotone and submodular, as they are for standard diusion models. Notice that even then, fl is the minimum of submodular functions, and as such not necessarily submodular itself [82]. A particularly natural and important special case of Definition 7.3 is the Per- turbation Interval model we considered in Section 7.3.1. Here, the influence model is known (for concreteness, DIC), but there is uncertainty about its parameters. For each edgee, we have an intervalI e =[¸ e ,r e ], and the algorithm only knows that the parameter (say, ◊ e ) lies in I e ; the exact value is chosen by an adversary. Notice that is (uncountably) infinite under this model. While this may seem worrisome, the following lemma shows that we only need to consider finitely (though exponentially) many functions: Lemma 7.4. Under the Perturbation Interval model for DIC 3 , the worst case for the ratio in fl for any seed set S is achieved by making each ◊ e equal to ¸ e or r e . Proof. Fix one edge ˆ e, and consider an assignment (fixed for now) ◊ e œ I e of activationprobabilitiestoalledgese”=ˆ e.Letxœ I ˆ e denotethe(variable)activation probability for edgee. First, fix any seed setS, and definef S (x) to be the expected number of nodes activated by S when the activation probabilities of all edges e”=ˆ e are ◊ e and the activation probability of ˆ e is x. 3 The result carries over with a nearly identical proof to the Linear Threshold model. We currently do not know if it also extends to the CIC model. 190 We express f S (x) using the triggering set approach introduced in Chapter 2. Let G be the set of all possible directed graphs on the given node set V.Forany graph G, let R G (S) be the number of nodes reachable from S in G via a directed path, and let P(G) be the probability that graph G is obtained when each edge e is present in G independently with probability ◊ e (or x, if e=ˆ e). By the triggering set technique [77, Proof of Theorem 4.5], we get that f S (x)= ÿ GœG P(G)·R G (S). The probabilities P(G) for obtaining a graph G are: P(G)=(1≠ x)· Ÿ eœ G ◊ e · Ÿ e/ œ G,e”=ˆ e (1≠ ◊ e ) when ˆ e/ œ G; P(G)=x· Ÿ eœ G,e”=ˆ e ◊ e · Ÿ e/ œ G (1≠ ◊ e ) when ˆ eœ G. In either case, we obtain a linear function of x, so that f S (x), being a sum of linear functions, is also linear in x. Therefore, the function g(x) := max S f S (x), being a maximum of linear func- tionsofx,isconvexandpiecewiselinear. ConsideranyfixedseedsetS,andtheratio h(x) := f S (x) g(x) .Its – -level set{x|h(x)Ø – } is equal to{x|g(x)≠ 1/– ·f S (x)Æ 0}. Because g(x)≠ 1/– ·f S (x), a convex function minus a linear function, is convex, its 0-level set is convex. Hence, all – -level sets of h are convex, and h is quasi-concave. Because h is quasi-concave, it is unimodal, and thus minimized at one of the endpoints of the interval. Hence, we can minimize the ratio h(x) — and thus the performance of the seed set S— by making x either as small or as large as possible. By repeating this argument for all edges ˆ e one by one, we arrive at an influence 191 setting minimizing the performance ofS, and in which all influence probabilities are equal to the left or right endpoint of the respective interval I ˆ e . Lowalekaretal.[104]haveprovedasimilarresultasinLemma7.4. Thedier- ence lies in the objective of Robust Influence Maximization. We model the problem as maximizing the minimum ratio fl (S) while Lowalekar et al. consider minimizing the maximum regret max ‡ œ ‡ (S ú ‡ )≠ ‡ (S). They prove that the maximum regret is achievedbymakingeach ◊ e equalto ¸ e orr e similarlybyshowingthatthemaximum regret is also quasi-concave. 7.5.2 Algorithm and Hardness Evenwhen containsjustasinglefunction ‡ , RobustInfluenceMaximization is exactly the traditional Influence Maximization problem, and is thus NP-hard. Thisissuealsoappearsinamoresubtleway: evaluating fl (S)(foragivenS)involves taking the minimum of ‡ (S) ‡ (S ú ‡ ) over all ‡ œ . It is not clear how to calculate the ratio ‡ (S) ‡ (S ú ‡ ) even for one of the ‡ , since the scaling constant ‡ (S ú ‡ ) (which is independent of thechosenS)isexactlythesolutiontotheoriginalInfluenceMaximizationproblem, and thus NP-hard to compute. This problem, however, is fairly easy to overcome: instead of using the true optimum solutions S ú ‡ for the scaling constants, we can compute (1≠ 1/e)- approximationsS g ‡ usingthegreedyalgorithm,becausethe‡ aremonotoneandsub- modular as is shown in Chapter 2. Then, because (1≠ 1/e)·‡ (S ú ‡ )Æ ‡ (S g ‡ )Æ ‡ (S ú ‡ ) for all ‡ œ , we obtain that the “greedy objective function” fl g (S) = min ‡ œ ‡ (S) ‡ (S g ‡ ) , 192 satisfies the following property for all sets S: (1≠ 1/e)·fl g (S)Æ fl (S) Æ fl g (S). (7.4) Hence, optimizing fl g (S) in place of fl (S) comes at a cost of only a factor (1≠ 1/e) in the approximation guarantee. We will therefore focus on solving the problem of (approximately) optimizing fl g (S). Because each ‡ is monotone and submodular, and the ‡ (S g ‡ ), just like the ‡ (S ú ‡ ), are just scaling constants, fl g (S) is a minimum of monotone submodular functions. However, we show that even in the context of Influence Maximization, this minimum is impossible to approximate to within any polynomial factor. This holds even in a bicriteria sense, i.e., the algorithm’s solution is allowed to pick (1≠ ” )ln| |·k nodes, but is compared only to solutions using k nodes. The result also extends to the seemingly more restricted Perturbation Interval model, giving an almost equally strong bicriteria approximation hardness result there. Theorem 7.5. Let ”,‘ > 0 be any constants, and assume that P”= NP. There are no polynomial-time algorithms for the following problems: 1. Given n nodes and a set of influence functions on these nodes (derived from the DIC or CIC models), as well as a target size k. Find a set S of |S|Æ (1≠ ” )ln| |·k nodes, such that fl (S)Ø fl (S ú )·(1 /n 1≠ ‘ ), where S ú is the optimum solution of size k. 2. Given a graph G on n nodes and intervals I e for edge activation probabilities under the DIC model (or intervals I e for edge delay parameters under the CIC model), as well as a target size k. Find a set S of cardinality|S|Æ ‘·c·lnn·k 193 (for a suciently small fixed constant c) such that fl (S)Ø fl (S ú )·(1 /n 1≠ ‘ ), where S ú is the optimum solution of size k. Proof. We prove the two parts of the theorem by (slightly dierent) reductions from the gap version of Set Cover.ASet Cover instance consists of a universe U ={a 1 ,...,a N }, a collection T of M subsets of U, and an integer k.Asetcover is a collectionC™T such that t TœC T =U. Without loss of generality, we assume that each element is contained in at least one set — otherwise, there trivially is no set cover. Also, without loss of generality, we assume that k Æ min(M,N),as otherwise, one can trivially pick all sets or one designated set per element. The gap version of Set Cover then asks us to decide whether there is a set cover C of size |C|Æ k or whether each set cover has size at least (1≠ ” )lnN ·k. (The algorithm is promised that the minimum size will never lie between these two values.) Dinur and Steurer [33, Corollary 1.5] showed that the gap version of Set Cover is NP-hard. Part 1 Based on theSet Cover instance, we construct the following instance of Robust Influence Maximization under the DIC model. Let m := (max(N,M)) 3/‘ . The instance consists of N bipartite graphs on a shared vertex set V = Xfi Y. X contains one node x T for each set TœT ; Y contains m nodes y a,1 ,...,y a,m for each element aœ U. Hence, the number of nodes in the constructed graph is n = M + mN =( mN) Æ ( m 1+‘/3 ); in particular, it is polynomial, and the reduction takes polynomial time. In the i th influence function, all nodes x T with T – a i have a directed edge withactivationprobability1(orexponentialdelaydistributionwithdelayparameter 194 1) to all of the y a i ,j (for all j); no other edges are present. Hence, | | = N, and lnN = ln| |. For the CIC model, the time window has size T =NM. First, consider the case when there is a set cover C of size k. Choose the corresponding x T ,TœC as seed nodes, and call the resulting seed set S. Because C is a set cover, in the i th instance, all of the y a i ,j are activated, for a total of at least m+k nodes. (Under the CIC model, all of these y a i ,j are activated with high probability, not deterministically, within T steps.) Because none of the nodes in X and none of the y a i Õ,j ,i Õ ”= i have incoming edges in the i th instance, the optimum solution for that instance can activate at most all of the m nodes y a i ,j and its k selected nodes, for a total of m+k. Thus, the objective function value will be 1 (or arbitrarily close to 1 w.h.p. for the CIC model). Now assume that there is no set cover of size (1≠ ” )lnN·k, and consider any seed set S.Let k Õ = |Sfl X|Æ (1≠ ” )lnN ·k be the number of nodes from X selected as seeds. Because the set S := {TœT | x T œ S} cannot be a set cover by assumption, there must be some a i / œ t TœS T. Therefore, under the i th influence function, none of the y a i ,j can be ever activated, except those selected directly in S. Hence, the number of nodes activated under the i th influence function is at most |S|Æ (1≠ ” )lnN·k. Ontheotherhand,byselectingjustonenodex T corresponding to any set T – a i , one could have activated all of the y a i ,j (with high probability under the CIC model), for a total of m. Thus, the objective function value is at most fl (S) Æ (1≠ ” )lnN·k m Æ O(m 2‘/3≠ 1 ) Æ O(n 2‘≠ 3 3+‘ )= o(n ≠ (1≠ ‘) ), where we crudely bounded both lnN and k by NÆ O(m ‘/3 ). Hence, a ((1≠ ” )lnN,O(n 1≠ ‘ )) bicriteria approximation algorithm could dis- tinguish the two cases, and thus solve the gap version of Set Cover. 195 Part 2 For the second part, we just consider the gap version with a fixed ” ,say, ” = 1 2 . Then, in the hard instances, M and N are polynomially related, which we assume here, i.e., MÆ N q for some constant q which is independent of ‘ or N. Based on theSet Cover instance, we construct a dierent Robust Influence Maximization instance, consisting of a directed graph with three layers V = Xfi Y fi Z. The first layer again contains one node x T for each set TœT ; the second layer now contains just one node y a for each element aœ U. There is an edge (with known influence probability 1, or exponential delay distribution with parameter 1) from x T to y a if and only if aœ T. The third layer Z contains m = (max(N,M)) 2/‘ nodes. For each aœ U and z œ Z, there is a directed edge (y a ,z) with complete uncertainty about its parameter: under the DIC model, the probability is in the interval I (ya,z) =[0,1], and under the CIC model, the edge delay is exponentially distributed with parameter in the interval I (ya,z) =(0,1]. In total, the graph has n =N+M+m=( m)nodes(inparticular,polynomiallymany),andthereduction takes polynomial time. BecauseN is at most polynomially smaller thanM, we have N =( n ‘/2q ), and thus lnN =( ‘ q · ln(n)). For the CIC model, we set the time horizon to T =NM. First, considerthe casewhen there isa setcoverC ofsizek. Considerchoosing the corresponding x T ,TœC as seed nodes; call the resulting seed set S. S will definitelyactivateallnodesinY,foratotalofk+N. Now, consideranyassignment of probabilities ◊ ya,z or edge delays d ya,z to the edges from Y to Z, and an optimal seed set S ú of size k.Let Z ú = Zfl S ú be the set of seed nodes chosen from Z,of size k Õ . Then, S ú definitely activates all of Z ú , and at most all N nodes from Y as well as k≠ k Õ nodes from X,foratotal(sofar)of N +k. For any node zœ Z\Z ú , the probability that it is activated by S is at least as large as under S ú , because for 196 any values of the individual activation probabilities or delays betweenY andZ, the fact that S activates all of Y ensures that any node in Z activated under S ú is also activated under S (by time T, in the case of the CIC model). Because the expected numberofnodesactivatedfromZ\Z ú isatleastaslargeunderS asunderS ú . Since this holds for all settings of the activation probabilities or edge delay parameters, we get that fl (S)Ø 1. Nowassumethatthereisnosetcoverofsize 1 2 lnN·k,andconsideranyseedset S.IfS contained any nodey a , we could replace it with any nodex T such thataœ T and activate at least as many nodes as before, so assume without loss of generality that Sfl Y =ÿ . Because|Sfl X|Æ| S|Æ 1 2 lnN·k, the gap guarantee implies that there is at least one node y a œ Y that is never activated by S. Now consider the probability assignment ◊ ya,z =1 for all z œ Z, and ◊ y,z =0 for all y ”= y a ,z œ Z. (Under the CIC model, set d ya,z =1 for all z œ Z, and d y,z =1/(NM) 2 for all y”= y a ,zœ Z.) Then, the seed set S cannot activate any nodes in Z (except those it may have selected), and will activate a total of at most N +k =O(N)=O(n ‘/2 ) nodes. (Under the CIC model, this statement holds with high probability.) On the otherhand, theseedset{y a }(justasinglenode)wouldhaveactivatedallofZ (with high probability, under the CIC model), for a total of m+1=( n) nodes. Hence, the ratio is at most O(n ‘/2 /n)=O(1/n 1≠ ‘/2 ), implying that fl (S)Æ O(1/n 1≠ ‘/2 ). If there were an (‘·c· ln(n),O(n 1≠ ‘ )) bicriteria approximation algorithm for a suciently small constant c, it could distinguish which of the two cases (fl (S)= 1,fl (S)Æ O(1/n 1≠ ‘/2 )) applied, thus solving the gap version of Set Cover. 197 The hardness results naturally apply to any diusion model that subsumes the DIC or CIC models. However, an extension to the DLT model is not immediate: the construction relies crucially on having many edges of probability 1 into a single node, which is not allowed under the DLT model. 7.5.2.1 Bicriteria Approximation Algorithm Theorem 7.5 implies that to obtain any non-trivial approximation guarantee, one needs to allow the algorithm to exceed the seed set size by at least a factor of ln| |. In this section, we therefore focus on such bicriteria approximation results, by slightly modifying an algorithm of Krause et al. [82]. The dierence is that we use the Greedy Mintss algorithm [57] (with pseudo code in Algorithm 7.2)to solve the submodular coverage subproblem. The high-level idea of the algorithm is as follows. Fix a real valuec, and define h (c) ‡ (S) := min(c, ‡ (S) ‡ (S g ‡ ) ) and H (c) (S) := q ‡ œ h (c) ‡ (S). Then, fl g (S)Ø c if and only if ‡ (S) ‡ (S g ‡ ) Ø c for all ‡ œ . But because by definition, h (c) ‡ (S)Æ c for all ‡ , the latter is equivalent to H (c) (S)Ø| |·c. (If any term in the sum is less than c, no other term can ever compensate for it, because they are capped at c.) BecauseH (c) (S)isanon-negativelinearcombinationofthemonotonesubmod- ularfunctionsh (c) ‡ ,itisitselfamonotoneandsubmodularfunction. Thisenablesthe use of a greedy ln| |-approximation algorithm to find an (approximately) smallest set S with H (c) (S) Ø c| |.If S has size at most kln| |, this constitutes a sat- isfactory solution, and we move on to larger values of c.If S has size more than kln| |, then the greedy algorithm’s approximation guarantee ensures that there is no satisfactory set S of size at most k. Hence, we move on to smaller values of c. 198 Algorithm 7.1:Saturate Greedy ( , k, precision “ ) 1: Initialize c min Ω 0, c max Ω 1. 2: while (c max ≠ c min )Ø “ do 3: cΩ (c max +c min )/2. 4: Define H (c) (S)Ω q ‡ œ min(c, ‡ (S) ‡ (S g ‡ ) ). 5: SΩ Greedy Mintss(H (c) ,k,c·| |,c·“/ 3). 6: if |S|>— ·k then 7: c max Ω c. 8: else 9: c min Ω c·(1≠ “/ 3), S ú Ω S. 10: end if 11: end while 12: Return S ú . Algorithm 7.2:Greedy Mintss (f, k, threshold ÷ ,error Á ) 1: Initialize SΩÿ . 2: while f(S)<÷ ≠ Á do 3: uΩ argmax v/ œ S f(Sfi{ v}). 4: SΩ Sfi{ u}. 5: end while 6: Return S. For eciency, the search for the right value of c is done with binary search and a specified precision parameter. A slight subtlety in the greedy algorithm is that H (c) could take on fractional values. Thus,insteadoftryingtomeettheboundc| |precisely,weaimforavalueof c| |≠ ‘. Then, theanalysisoftheGreedyMintssalgorithmofGoyaletal.[57](of which our algorithm is an unweighted special case) applies. The resulting algorithm Saturate Greedy is given as Algorithm 7.1. The simple greedy subroutine — a special case of theGreedy Mintss algorithm — is given as Algorithm 7.2. The slight dierence between our algorithm and the algorithm of Krause et al. [82] lies in how the submodular coverage subproblem is solved. Both [82] and 199 theGreedy Mintss algorithm [57] greedily add elements. However, theGreedy Mintss algorithm adds elements until the desired submodular objective is attained up to an additive Á term, while [82] requires exact coverage. That is, the while loop in line 3 in Algorithm 7.2 exits only if f(S) Ø ÷ in [82]. Moreover, directly considering real-valued submodular functions instead of going through fractional values leads to a more direct analysis of theGreedy Mintss algorithm [57]. Bycombiningthediscussionatthebeginningofthissection(aboutoptimizing fl vs. fl g ) with the analysis of Krause et al. [82] and Goyal et al. [57] (the following theorem), we obtain the following approximation guarantee. Theorem 7.6 (Theorem 1 in [57]). LetÁ> 0 be any shortfall and let S be the solution of GreedyMintss with chosen threshold ÷ ≠ Á . Then|S|Æ| S ú |·(1+ln ÷ Á ) where S ú is the optimal solution of the submodular set cover problem. Theorem7.7. Let“> 0 be arbitrary and — = 1+ln| |+ln 3 “ . SaturateGreedy finds a seed set ˆ S of size | ˆ S|Æ —k with fl ( ˆ S)Ø (1≠ 1/e)·fl (S ú )≠ “, where S ú œ argmax S:|S|Æ k fl (S) is an optimal robust seed set of size k. Proof. Algorithm 7.1 uses Algorithm 7.2 (Greedy Mintss) as a subroutine to find 4 a solution S such that f(S)Ø ÷ ≠ Á and |S|Æ| S ú |·(1+ln ÷ Á ), where S ú is a smallest solution guaranteeing f(S ú )Ø ÷ . 4 Technically, the guarantees on Greedy Mintss depend on being able to evaluate f precisely [57, Theorem 1]. However, Theorem 2 of [57] states that by obtaining (1±” )-approximations to f, we can ensure that|S|Æ (1+” Õ )|S ú |·(1+ln ÷ Á ),where ” Õ æ 0 as ” æ 0. For influence coverage functions, arbitrarily close approximations to f can be obtained by Monte Carlo simulations. We 200 In light of the general outline and motivation for the Saturate Greedy algorithm given above, it mostly remains to verify how the guarantees forGreedy Mintss and the balancing of the parameters carry through. We will show that throughout the algorithm (or more precisely: the binary search),c min always remains a lower bound on the solution for the problem with the relaxed cardinality constraint, while c max remains an upper bound on the solution for the original problem. In other words, there is no set S of cardinality at most |S|Æ k with fl (S)>c max , and there isaset S of cardinality at most|S|Æ —k with fl (S)Ø c min . To show this claim, consider the set S returned by the Greedy Mintss algorithm. If|S|>—k , theguaranteeforGreedyMintssimpliesthat|S|Æ — |S ú |, where S ú is the optimal solution for the instance. Because |S ú |Ø| S|/— > k , the value c is not feasible, and the algorithm is correct in setting c max to c. Otherwise, |S|Æ —k , and the guarantee of Greedy Mintss in Theorem 7.6 implies that H (c) (S)Ø c·| |≠ c·“/ 3. Because each h (c) ‡ (S)Æ c by definition, we get for all ‡ , h (c) ‡ (S) Ø H (c) (S)≠ (| |≠ 1)·c Ø c≠ c·“/ 3, and therefore fl (S)Ø c≠ c·“/ 3. This confirms the correctness of assigning c min = c·(1≠ “/ 3). Since we do not set c min = c, we need to briefly verify termination of the binary search. For any iteration in which we update c min , let := c max ≠ c min Ø “ . When the new c Õ min is set to c· (1≠ “/ 3) Ø c≠ “/ 3, we get that c max ≠ c Õ min = therefore ignore the issue of sampling accuracy in this thesis, and perform the analysis as though f could be evaluated precisely. Otherwise, the approximations carry through in a straightforward way, leading to multiplicative factors (1+” ÕÕ ). 201 (c max ≠ c)+(c≠ c Õ min )Æ /2+“/ 3Æ 5 /6. Hence, the size of the interval keeps decreasinggeometrically,andthebinarysearchterminatesinO(log(1/“ ))iterations. At the time of termination, we obtain that |c ú ≠ c min |Æ “ . Combining this boundwiththefactorof (1≠ 1/e)welostduetoapproximating fl with fl g , weobtain the claim of the theorem. Theorem7.7holdsverybroadly,solongasallinfluencefunctionsaremonotone and submodular. This includes the DIC, DLT, and CIC models, and allows mixing influence functions from dierent model classes. Notice the contrast between Theorems 7.7 and 7.5. By allowing the seed set size to be exceeded just a little more (a factor ln| |+O(1) instead of 0.999ln| |), we go from( n 1≠ ‘ )approximationhardnesstoa (1≠ 1/e)-approximationalgorithm. 7.5.2.2 Simple Heuristics In addition to the Saturate Greedy algorithm, our experiments use two natural baselines. The first is a simple greedy algorithm Single Greedy which adds —k elements to S one by one, always choosing the one maximizing fl g (Sfi {v}). While this heuristic has provable guarantees when the objective function is submodular, this is not the case for the minimum of submodular functions. The second heuristic is to run a greedy algorithm for each objective function ‡ œ separately, and choose the best of the resulting solutions. Those solutions are exactly the sets S g ‡ defined earlier in this section. Thus, the algorithm consists of choosing argmax ‡ œ fl g (S g ‡ ). We call the resulting algorithmAll Greedy. A vari- ant of this heuristic is also proposed by Chen et al. [25] as the LUGreedy algorithm 202 under the PIM where contains only two function: one with all edge parameters set to the maximum; the other one with all edge parameters set to the minimum. In the worst case, both Single Greedy and All Greedy can perform arbitrarily badly, as seen by the following class of examples with a given parameter k. Theexampleconsistsofk instancesoftheDICmodelforthefollowinggraphwith 3k +m nodes (where m∫ k). The graph comprises a directed complete bipartite graphK k,m withk nodesx 1 ,...,x k on one side andm nodesy 1 ,...,y m on the other side, as well as k separate edges (u 1 ,v 1 ),...,(u k ,v k ) between 2k new nodes {u i } and {v i }. The edges (u i ,v i ) have activation probability 1 in all instances. In the bipartite graph, in the i th scenario, only the edges leaving node x i have probability 1, while all others have 0 activation probability. The optimal solution for Robust Influence Maximization is to select all nodes x i , since one of them will succeed in activating the m nodes y j . The resulting objective value will be close to 1. However, All Greedy only picks one node x i and the remaining k≠ 1 nodes as u j . Single Greedy instead picks all of the u j . Thus, both All Greedy and Single Greedy will have robust influence close to 0 as m grows large. Empirical experiments confirm this analysis. For example, for k =2 and m=100, Saturate Greedy achieves fl =0.985, while Single Greedy andAll Greedy only achieve 0.038 and 0.029, respectively. Implementation The most time-consuming step in all of the algorithms is the estimation of influence coverage, given a seed set S. Naïve estimation by Monte Carlo simulation could lead to a very inecient implementation. The problem is even more pro- nounced compared to traditional Influence Maximization as we must estimate the 203 influence in multiple diusion settings. Instead, we use the ConTinEst algorithm of Du et al. [36] for fast influence estimation under the CIC model. For the DIC model, we generalize the approach of Du et al. To accelerate theGreedy Mintss algorithm, we also apply the CELF optimization [93] in all cases. Analytically, one can derive linear running time (in both n and | |) for all three algorithms, thanks to the fast influence estimation. This is borne out by detailed experiments in Section 7.5.3.4. 7.5.3 Experiments We empirically evaluate the Saturate Greedy algorithm and the Single Greedy and All Greedy heuristics. Our goal is twofold: (1) Evaluate how well Saturate Greedy and the heuristics perform on realistic instances. (2) Qual- itatively understand the dierence between robustly and non-robustly optimized solutions. Our experiments are all performed on real-world data sets. The exception is the scalability experiments in Section 7.5.3.4, which benefit from the controlled environment of synthetic networks. The data sets span the range of dierent causes for uncertainty, namely: (1) influences are learned from cascades for dierent topics; (2) influences are learned with dierent modeling assumptions; (3) influences are only inferred to lie within intervals I e (the Perturbation Interval model). 7.5.3.1 Dierent Networks We first focus on the case in which the diusion model is kept constant: we use the DIC model, with parameters specified below. Dierent objective functions 204 areobtainedfromobservingcascades(1)ondierenttopics. WeuseTwitterretweet networksfordierenttopics(Twitter100/256). (2)atdierenttimes. WeuseMeme- Tracker diusion network snapshots at dierent times (MemeTracker2000/5000). The datasets are introduced in detail in Section 3.2. The parameters of the DIC model used for this set of experiments are summa- rized in Table 7.2. Data set Edge Activation Probability # Seeds Twitter100 0.2 10 Twitter250 0.1 20 MemeTracker2000 0.05 50 MemeTracker5000 0.05 100 Table 7.2: Diusion model settings Recallingthatintheworstcase, arelaxationinthenumberofseedsisrequired to obtain robust seed sets, we allow all algorithms to select more seeds than the solution they are compared against. Specifically, we report results in which the algorithms may select k, 1.5·k and 2·k seeds, respectively. The reported results are averaged over three independent runs of each of the algorithms. Results: Performance The aggregate performance of the dierent algorithms on the four data sets is shown in Figure 7.3. The first main insight is that (in the instances we study) getting to over-select seeds by 50%, all three algorithms achieve a robust influence of at least 1.0. In other words,50%moreseedsletthealgorithmsperformasthoughtheyknewexactlywhich of the (adversarially chosen) diusion settings was the true one. This suggests that the networks in our data sets share a lot of similarities that make influential nodes 205 0 0.2 0.4 0.6 0.8 1 1.2 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (a) Twitter100 (k = 10) 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (b) Twitter250 (k = 20) 0 0.2 0.4 0.6 0.8 1 1.2 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (c) MemeTracker2000 (k = 50) 0 0.2 0.4 0.6 0.8 1 1.2 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (d) MemeTracker5000 (k = 100) Figure 7.3: Performance of the algorithms on the four topical/temporal datasets. The x-axis is the number of seeds selected, and the y-axis the resulting robust influence (compared to seed sets of size k). in one network also (mostly) influential in the other networks. This interpretation is consistentwiththeobservationthatthebaselineheuristicsperformsimilarlyto(and in one case better than) the Saturate Greedy algorithm. Notice, however, that when selecting justk seeds,SaturateGreedy does perform best (though only by a small margin) among the three algorithms. This suggests that keeping robustness in mind may be more crucial when the algorithm does not get to compensate with a larger number of seeds. Results: Visualization Tofurtherillustratethetradeosbetweenrobustandnon-robustoptimization, we visualize the seeds selected by Saturate Greedy (robust seeds) compared to seeds selected non-robustly based on only one diusion setting. For legibility, we focus only on the Twitter250 data set, and only plot 4 out of the 5 networks. (The fifth network is very sparse, and thus not particularly interesting.) 206 Figure 7.4 compares the seeds selected by Saturate Greedy with those (approximately) maximizing the influence for the Iran network. Notice that Sat- urate Greedy focuses mostly (though not exclusively) on the densely connected core of the network (at the center), while the Iran-specific optimization also exploits thedenseregionsontheleftandatthebottom. Theseregionsaremuchlessdensely connected in the US politics and Climate networks, while the core remains fairly densely connected, leading the Saturate Greedy solution to be somewhat more robust. (a) Iran (b) Haiti (c) US politics (d) Climate Figure 7.4: Saturate Greedy vs. Iran graph seed nodes. Green/pentagon nodes are selected in both; orange/triangle nodes are selected by Saturate Greedy only; purple/square nodes for Iran only. Similarly, Figure 7.5 compares the Saturate Greedy seeds (which are the same as in Figure 7.4) with seeds for the Climate network. The trend here is exactly the opposite. The seeds selected based only on the Climate network are exclusively in the core, because the other parts of the Climate network are barely connected. On the other hand, the robust solution picks a few seeds from the clusters at the bottom, left, and right, which are present in other networks. These seeds lead to extra influence in those networks, and thus more robustness. 207 (a) Iran (b) Haiti (c) US politics (d) Climate Figure 7.5: Saturate Greedy vs. Climate graph seed nodes. Green/pentagon nodes are selected in both; orange/triangle nodes are selected by Saturate Greedy only; purple/square nodes for Climate only. 7.5.3.2 Dierent Diusion Models In choosing a diusion model, there is little convincing empirical work guiding the choice of a model class (such as CIC, DIC, or threshold models) or of distribu- tional assumptions for model parameters (such as edge delay). A possible solution is to optimize robustly with respect to these dierent possible choices. In this section, we evaluate such an approach. Specifically, we perform two experiments: (1) learning the CIC influence network under dierent parametric assumptions about the delay distribution, and (2) learning the influence network underdierentmodelsofinfluence(CIC,DIC,DLT).WeagainusetheMemeTracker dataset, restricting ourselves to the data from August 2008 and the 500 most active users. WeusetheMultiTreealgorithmofGomez-Rodriguezetal.[55]toinferthe diusion network from the observed cascades. This algorithm requires a parametric assumption for the edge delay distribution. We infer ten dierent networks G i corresponding to the Exponential distribution with parameters 0.05, 0.1, 0.2, 0.5, 1.0, and to the Rayleigh distribution with parameters 0.5, 1, 2, 3, 4. The length of the observation window is set to 1.0. 208 Wethenusethethreealgorithmstoperformrobustinfluencemaximizationfor k=10 seeds, again allowing the algorithms to exceed the target number of vertices. Theinfluencemodelfor eachgraph istheCICmodelwiththesame parameters that were used to infer the graphs. The performance of the algorithms is shown in Figure 7.6(a). All methods achieve satisfactory results in the experiment; this is again due to high similarity between the dierent diusion settings inferred with dierent parameters. For the second experiment, we investigate the robustness across dierent classes of diusion models. We construct three instances of the DIC, DLT and DIC model from the ground truth diusion network between the 500 active users. For the DIC model, we set the activation probability uniformly to 0.1. For the DLT model, we follow [77] and set the edge weights to 1/d v where d v is the in-degree of node v. For the CIC model, we use an exponential distribution with parameter 0.1 and an observation window of length 1.0. We perform robust influence maximiza- tion fork=10 seeds and again allow the algorithms to exceed the target number of seeds. The results are shown in Figure 7.6(b). Similarly to the case of dierent estimatedparameters,allmethodsachievesatisfactoryresultsintheexperimentdue to the high similarity between the diusion models. Our results raise the intriguing question of which types of networks would be prone to significant dierences in algorithmic performance based on which model is used for network estimation. 209 7.5.3.3 Networks sampled from the Perturbation Interval model To investigate the performance when model parameters can only be placed inside “confidence intervals” (i.e., the Perturbation Interval model), we carry out experiments under two networks, MemeTracker-DIC and STOCFOCS introduced in Section 3.2. Following the approach in the previous section, for both networks, we assign “confidence intervals” I e =[(1≠ q)◊ e ,(1+q)◊ e ], where the ◊ e are the inferred acti- vation probabilities. If (1≠ q)◊ e > 1, we truncate its value to 1. For experi- ments on the MemeTracker-DIC network, we set qœ{ 10%,20%,30%,...,100%}, while we use a coarse grid for the experiments on the large graph STOCFOCS with qœ{ 5%,10%,20%,50%,100%}. While Lemma 7.4 guarantees that the worst-case instances have activation probabilities (1≠ q)◊ e or (1+q)◊ e , thisstillleaves 2 |E| candidatefunctions, toomany to include. We generate an instance for our experiments by sampling 10 of these functionsuniformly,i.e., byindependentlymakingeachedge’sactivationprobability either (1≠ q)◊ e or (1 +q)◊ e . This collection is augmented by two more instances: one where all edge probabilities are (1≠ q)◊ e , and one where all probabilities are (1+q)◊ e . Notice that with the inclusion of these two instances, the All Greedy heuristic generalizes the LUGreedy algorithm by Chen et al. [25], but might provide strictlybettersolutionsontheselected instancesbecauseitexplicitlyconsidersthose additional instances. The algorithms get to select 20 seed nodes; note that in these experiments, we are not considering a bicriteria approximation. 210 0 0.2 0.4 0.6 0.8 1 1.2 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (a) Dierent Distributions. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1x 1.5x 2x Single0Greedy All0Greedy Saturate0Greedy (b) Dierent Models Figure 7.6: Performance of the algorithms (a) under dierent delay distributions following the CIC model, and (b) under dierent classes of diusion models. The x-axis shows the number of seeds selected, and k=10. The results are shown in Figures 7.7(a) and 7.7(b). Contrary to the previous results, when there is a lot of uncertainty about the edge parameters (relative inter- val size 100% in both networks), the Saturate Greedy algorithm more clearly outperforms theSingle Greedy andAll Greedy heuristics. Thus, robust opti- mization does appear to become necessary when there is a lot of uncertainty about the model’s parameters. Notice that the evaluation of the algorithms’ seed sets is performed only with respect to the sampled influence functions, not with respect to all 2 |E| functions. Whether one can eciently identify a worst-case parameter setting for a given seed set S is an intriguing open question. Absent this ability, we cannot eciently guar- antee that the solutions are actually good with respect to all parameter settings. 211 0 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Single2Greedy All2Greedy Saturate2Greedy (a) Perturbation Interval model: Meme- Tracker (k = 20) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.2 0.5 1 Single2Greedy All2Greedy Saturate2Greedy (b) PerturbationIntervalmodel: STOCFICS (k = 50) Figure 7.7: Performance of the algorithms under networks sampled from the Per- turbationIntervalmodel: (a)MemeTracker-DICnetwork; (b)STOCFOCSnetwork. (The x axis shows the (relative) size of the perturbation interval I e ). 7.5.3.4 Scalability To evaluate the scalability of the algorithms, we depart from real-world data sets in order to obtain a controlled environment. We generate networks using the Kronecker graph model [90] to generate Erds-Rényi networks, core-peripheral net- works and networks with hierarchical community structure. For each type, we gen- erate a set of 5,10,15,20,25 networks of sizes 128,256,...,4096. We use the DIC model with activation probability set to 0.1, and select k=50 nodes. The running times of the three algorithms are shown in Figure 7.8 and Figure 7.9. In Figure 7.8, wefixthenumberofnetworkstofiveandvarythesizeofeachnetwork; inFigure7.9, wefixthesizeofthenetworksto 1024andvarythenumberofnetworks. Thegraphs showthattheheuristicsarefasterthantheSaturateGreedy algorithmbyabout a factor of ten, but all three algorithms scale linearly both in the size of the graph and the number of networks, due to the fast influence estimation method. 212 10 100 1000 10000 128 256 512 1024 2048 4096 Single/Greedy All/Greedy Saturate/Greedy (a) Erds-Rényi 10 100 1000 10000 128 256 512 1024 2048 4096 Single/Greedy All/Greedy Saturate/Greedy (b) Core-peripheral 10 100 1000 10000 128 256 512 1024 2048 4096 Single/Greedy All/Greedy Saturate/Greedy (c) Hierarchical community Figure 7.8: Running times on Kronecker graph networks with dierent structures. The x axis represents the number of nodes, and the y-axis is the running time in seconds, both plotted on a log scale. 10 100 1000 10000 100000 5 10 15 20 25 Single+Greedy All+Greedy Saturate+Greedy (a) Erds-Rényi 10 100 1000 10000 100000 5 10 15 20 25 Single+Greedy All+Greedy Saturate+Greedy (b) Core-peripheral 10 100 1000 10000 5 10 15 20 25 Single+Greedy All+Greedy Saturate+Greedy (c) Hierarchical community Figure 7.9: Running times on Kronecker graph networks with dierent structures. Thex axis represents the number of diusion settings, and the y-axis is the running time in seconds, both plotted on a log scale. 213 Chapter 8 Conclusion and Future Work 214 In this thesis, we studied diusion behaviors in social networks to achieve real practicality by answering the following two questions. (1) How to utilize special properties of social networks to improve the accuracy of extracted networks from noisy and missing data? (2) How to characterize the impact of noise in the inferred networksandcarryoutrobustanalysisandoptimization? Thefirstpartofthethesis focusedonimprovingtheaccuracyofNetworkInferenceundernoisy,incompleteand limited observations. Moreover, we explored the rich content information contained in cascades to further improve inference accuracy. In the second part, we focused on the problem of Influence Maximization when there is noise and uncertainty in the influence functions. We first proposed a framework to measure the stability of Influence Maximization instances in terms of the noise level in the inferred diusion networks. We then designed an ecient Robust Influence Maximization algorithm to find a set of influential users under uncertainty in the influence functions. We demonstrated the eciency and eectiveness of the proposed solutions in terms of the improved inference accuracy and better influence coverage. 8.1 Contributions and Limitations The major contributions of this thesis are four-fold: 1. First, we demonstrated that incomplete observations have a significant impact on the accuracy of existing Network Inference algorithms via experiments. To mitigate the eects of incompleteness, we designed algorithms for both properandimproperlearningofinfluencefunctionsunderincompletecascades. We provided theoretical analysis to establish PAC learnability with sample complexity bounds. 215 2. Second, in order to accurately infer the diusion networks under data scarcity, we designed a hierarchical graphical model referred to as the MultiCascades model. Ourmodelmitigatesdatascarcitywithasharedpriorandincorporates prior knowledge on the network structure via the choice of network generation priors. Moreover, we showed that joint inference under the MCM can be carried out eciently via the EM algorithm. 3. Third, in order to utilize the rich content information in cascades to improve Network Inference accuracy, we proposed the HawkesTopic model to analyze text-based cascades by combining temporal and content information. We pro- videdajointvariationalinferencealgorithmundertheHTMtosimultaneously infer the diusion network as well as discover the thematic topics of the doc- uments. 4. Finally, we observed that the noise in inferred diusion networks has a signifi- cantimpactontheperformanceofexistingInfluenceMaximizationalgorithms. To quantitatively measure the stability of Influence Maximization instances, we proposed a framework with the Perturbation Interval Model to quantify the noise in the inferred diusion networks. We then designed an ecient algorithmforRobustInfluenceMaximizationtofindinfluentialusersrobustin multiplediusionsettings. Wecarriedouttheoreticalanalysisonthehardness of the Robust Influence Maximization problem and proved an approximation guarantee for our algorithm. These contributions confirm that (1) Network Inference accuracy can be improved under noisy and incomplete observations; (2) robust algorithms can be designed for social network analysis and optimization under noisy parameters. 216 However, there are a few limitations to the proposed solutions. At a funda- mental level, most of the algorithms and analyses are based on the assumption that the cascades follow certain diusion models. For example, the MCM in Chapter 5 assumes that the cascades follow the CIC-Delay model and the HTM in Chapter 6 assumes that the cascades follow an extension of the Hawkes process model. The learnability of the influence functions and the sample complexity bounds in Chap- ter 4 rely on the assumption that input cascades follow the diusion models pre- cisely. As we have discussed in Chapter 1, in most cases the mathematical models of diusion processes on social networks are at best approximations of reality, and fre- quentlymereguessesormathematicallyconvenientinventions. Whenthereal-world diusion deviates from the assumed diusion model or even from all existing diu- sion models, the performance of the inference and optimization algorithms depends on the degree to which our models capture the reality. In such cases, there are no theoretical guarantees on the performance of the algorithms. Some more practical limitations are as follows: 1. In Chapter 4, we assumed randomly and independently missing activities. A much more significant departure for future work would be the dependent loss of activities. Though our results generalize to adversarially missing nodes, we have the strong constraint that the lost nodes cannot be the seeds to initiate cascades, which does not hold in real-world applications. 2. In Chapter 7, the question whether one can eciently find an (approximately) worst-caseinfluencefunctioninthePerturbationIntervalmodelremainsunre- solved. As a result, the current empirical analysis only shows that there is instability when the level of noise is high but does not guarantee the stability of instance under a small level of noise. An answer to this question would 217 allow us to empirically evaluate the performance of natural heuristics for the Perturbation Interval model, such as randomly sampling a small number of influence functions. 3. While in this work we have focused on making our inference algorithms in Chapter 4, 5 and 6 and optimization algorithms in Chapter 7 scalable, given the growing size of social network datasets, we still need to find new ways to makeoursolutionsfurtherscalableandparallelizableindistributedcomputing environments. 8.2 Future Work It is exciting to explore several other directions to extend the contributions of our work: Diusion Model Selection Traditional influence analysis takes a two-step approach: first, a diusion network is inferred according to a particular diusion model and then an analysis is carried out on the inferred network. The success of this approach depends on one important assumption: the probabilistic model of influence models the real-world diusion phenomena accurately . It suggests that a systematic method is needed to identify which influence model best captures the behavior of real-world cascades, and under what circumstances. It is quite likely that dierent models will perform dierently depending on the type of cascade and many other factors and in-depth evaluations of the models could give practitioners more guidance on which mathematical models to choose. While our Robust Influ- ence Maximization model proposed in Chapter 7 allows us to combine instances of 218 dierent models (e.g., DIC, CIC and DLT), this may come at a cost of decreased performanceforeachofthemodelsindividually. Thus, itremainsanimportanttask to identify the influence models that best fit real-world data. The model selection problem has been extensively studied in statistics. Pop- ular methods include model quality measures like the Akaike Information Criterion (AIC) [21] and the Bayesian Information Criterion (BIC) [126], cross-validation procedures [6] and Minimum Description Length [65]. However, the diusion model selectionproblemhasnotbeenwidelystudied. Asfarasweknow, themostrelevant work is by Saito et al. [123]. They consider the problem of selecting between the AsLT model and the CIC-Delay model. The traditional model quality measures like AIC and BIC cannot be used as dierent models provide completely dierent prob- ability distributions over the cascades. Instead, they propose a method based on a hold-outstrategy,whichattemptstopredictthefutureactivities. Asafirststep,the parameters of the diusion networks under both models are inferred using Network Inference algorithms. Given a test cascade observed up to time step t, both models are used to predicted the activation probabilities for nodes activated in time interval [t,t + · ] with the inferred parameters. The KL-divergence is used to measure the accuracy of the predictions. The model with lower KL-divergence averaged over all the test cascades and the choices of t is selected. The success of using the hold-out strategyformodelselectiondependsontheaccurateinferenceofthediusionmodel parameters. When there are only a few cascades, the above method may select the wrong model if the inferred model parameters are not accurate. Instead of fitting the model parameters, it might be promising to use machine learning techniques to directly infer the diusion models from cascades as a classi- fication problem. The training data of the model classifier would be multiple sets 219 of synthetic cascades and the labels would be the models used to generate the cas- cades. The main diculties of the above approach are two-fold. First, the cascades generated from dierent diusion models may vary significantly across dierent dif- fusion networks. It is very likely that two cascades under the same network from dierent models are much more similar to each other than two cascades following the same model under two dierent networks. It is not clear whether the machine learning model can disentangle the eect of diusion models from dierent diusion networks. Second, it is not clear how the input, i.e., the set of cascades, should be represented as features to be fed into the classifier. We consider deep learning as one possible solution to handle both challenges. For the first challenge, deep neural networks have demonstrated their ability to captureverycomplexpatternsinapplicationssuchasimageclassificationandspeech recognition [86]. Deep learning models might also serve as a the solution to the second challenge via cascade embedding. Since a cascade can be considered as a time series, it can be embedded as a vector using time series embedding techniques (e.g., the LSTM model [107, 114]). Then the cascades in the set can be aggregated into a single feature vector using strategies similar to those in obtaining sentence or document embeddings from word vectors [85]. Besides diusion model selection, we need to make other modeling choices. One important example is the selection of data loss models discussed in Chapter 4. It turns out that selecting the correct data loss models is even harder as a “chicken and egg” problem. In order to form models of data loss, we need some data where we know what data are lost. But the reason we need data loss models is that there is always data loss, so the problem is where to get some data that will allow us to figure out how much data gets lost and in what mechanisms? One possible 220 solution is to collect a small amount of high-quality and complete data and compare it to the incomplete observations to infer the data missing mechanism. However, even collecting a small amount of complete data may be very expensive or even impossible due to data privacy issues. Therefore, it remains an interesting and important question to design eective and aordable method for the selection of data loss models. Model-free Influence Analysis Another promising solution to the noise and uncertainty of the inferred diusion networks is to directly solve the influence anal- ysis tasks such as Influence Maximization from raw observations. Model-free influ- ence analysis skips the steps of assuming a certain diusion model and the infer- ence of its parameters altogether to significantly reduce the uncertainty and noise towards obtaining the final solution. For the Influence Maximization problem, an end-to-end approach is proposed by [59] to select influential users directly from observed cascades without the assumption of any diusion model. However, the results are purely empirical without any guarantees on the performance of the algo- rithm. On the other hand, Du et al. propose the model-free influence function learning algorithm [35] which is free from the risk of model misspecification. How- ever, it may require an exponential number of cascades in the worst case to reach an Á -approximation even if the true influence function is derived from the simple DIC model. It remains an interesting question to investigate algorithms with only poly- nomial sample complexity and approximation guarantees. One potential solution is to explore the structure and properties of real-world cascades to limit the space of possible influence functions in order to achieve better learnability than general submodular functions [8]. 221 Reference List [1] Abrahao, B., Chierichetti, F., Kleinberg, R., and Panconesi, A. (2013). Trace complexityofnetworkinference. InProc. 19th Intl. Conf. on Knowledge Discovery and Data Mining, pages 491–499. [2] Adiga, A., Kuhlman, C. J., Mortveit, H. S., and Vullikanti, A. K. S. (2013). Sensitivity of diusion dynamics to network uncertainty. In Proc. 28th AAAI Conf. on Artificial Intelligence. [3] Allison, P. D. (2002). Missing data: Quantitative applications in the social sciences. British Journal of Mathematical and Statistical Psychology, 55(1):193– 196. [4] Alon, N., Benjamini, I., and Stacey, A. (2004). Percolation on finite graphs and isoperimetric inequalities. Annals of Probability, pages 1727–1745. [5] Amin,K.,Heidari,H.,andKearns,M.(2014). Learningfromcontagion(without timestamps). In Proc. 31st Intl. Conf. on Machine Learning, pages 1845–1853. [6] Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79. [7] Arora, A., Galhotra, S., and Ranu, S. (2017). Debunking the myths of influence maximization: An in-depth benchmarking study. In Proc. 37th ACM SIGMOD Intl. Conference on Management of Data, pages 651–666. [8] Balcan, M.-F. and Harvey, N. J. (2010). Submodular functions: Learnability, structure, and optimization. arXiv preprint arXiv:1008.2159. [9] Barabási,A.-L.andAlbert,R.(1999). Emergenceofscalinginrandomnetworks. Science, 286:509–512. [10] Barabási, A.-L., Albert, R., and Jeong, H. (1999). Mean-field theory for scale- free random networks. Physica A: Statistical Mechanics and its Applications, 272(1):173–187. 222 [11] Baxter, J. (1997). A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning, 28(1):7–39. [12] Berger, N., Bollobás, B., Borgs, C., Chayes, J., and Riordan, O. (2003). Degree distribution of the fkp network model. In Proc. 30th Intl. Colloq. on Automata, Languages and Programming, pages 725–738. [13] Bharathi, S., Kempe, D., and Salek, M. (2007). Competitive influence maxi- mization in social networks. In Proc. 3rd Conf. on Web and Internet Economics, pages 306–311. [14] Blei, D. and Laerty, J. (2006a). Correlated topic models. In Proc. 18th Advances in Neural Information Processing Systems, pages 147–152. [15] Blei, D., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. [16] Blei, D. M. and Laerty, J. D. (2006b). Dynamic topic models. In Proc. 23rd Intl. Conf. on Machine Learning, pages 113–120. [17] Borgatti, S. P., Mehra, A., Brass, D. J., and Labianca, G. (2009). Network analysis in the social sciences. science, 323(5916):892–895. [18] Borgs, C., Brautbar, M., Chayes, J. T., and Lucier, B. (2014). Maximizing social influence in nearly optimal time. In Proc. 25th ACM-SIAM Symp. on Discrete Algorithms, pages 946–957. [19] Bourigault, S., Lagnier, C., Lamprier, S., Denoyer, L., and Gallinari, P. (2014). Learning social network embeddings for predicting information diusion. In Proc. 7th ACM Intl. Conf. on Web Search and Data Mining, pages 393–402. [20] Bourigault, S., Lamprier, S., and Gallinari, P. (2016). Representation learning for information diusion through social networks: An embedded cascade model. In Proc. 9th ACM Intl. Conf. on Web Search and Data Mining, pages 573–582. [21] Bozdogan, H. (1987). Model selection and akaike’s information criterion (aic): The general theory and its analytical extensions. Psychometrika, 52(3):345–370. [22] Buchbinder,N.,Feldman,M.,Naor,J.S.,andSchwartz,R.(2014).Submodular maximization with cardinality constraints. In Proc. 25th ACM-SIAM Symp. on Discrete Algorithms, pages 1433–1452. [23] Campbell, K. E. and Lee, B. A. (1991). Name generators in surveys of personal networks. Social networks, 13(3):203–221. 223 [24] Chen, W., Lakshmanan, L. V. S., and Castillo, C. (2013). Information and Influence Propagation in Social Networks. Morgan & Claypool. [25] Chen, W., Lin, T., Tan, Z., Zhao, M., and Zhou, X. (2016a). Robust influ- ence maximization. In Proc. 22nd Intl. Conf. on Knowledge Discovery and Data Mining. [26] Chen, W., Wang, Y., and Yang, S. (2009). Ecient influence maximization in social networks. In Proc. 15th Intl. Conf. on Knowledge Discovery and Data Mining, pages 199–208. [27] Chen, W., Wang, Y., Yuan, Y., and Wang, Q. (2016b). Combinatorial multi- armed bandit and its extension to probabilistically triggered arms. Journal of Machine Learning Research, 17(50):1–33. [28] Chen, W., Yuan, Y., and Zhang, L. (2010). Scalable influence maximization in social networks under the linear threshold model. In Proc. 10th Intl. Conf. on Data Mining, pages 88–97. [29] Chierichetti, F., Kleinberg, J. M., and Liben-nowell, D. (2011). Reconstructing patterns of information diusion from incomplete observations. In Proc. 23rd Advances in Neural Information Processing Systems, pages 792–800. [30] Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves. In Proc. 23rd Intl. Conf. on Machine Learning, pages 233–240. [31] Dietz,L.,Bickel,S.,andScheer,T.(2007). Unsupervisedpredictionofcitation influences. In Proc. 24th Intl. Conf. on Machine Learning, pages 233–240. [32] Ding, W., Zhang, Y., Chen, C., and Hu, X. (2016). Semi-supervised dirichlet- hawkes process with applications of topic detection and tracking in twitter. In Big Data (Big Data), 2016 IEEE International Conference on, pages 869–874. [33] Dinur, I. and Steurer, D. (2014). Analytical approach to parallel repetition. In Proc. 46th ACM Symp. on Theory of Computing, pages 624–633. [34] Dong, X. L., Berti-Equille, L., Hu, Y., and Srivastava, D. (2010). Global detectionofcomplexcopyingrelationshipsbetweensources. Proc. VLDB Endow., 3(1-2):1358–1369. [35] Du, N., Liang, Y., Balcan, M.-F., and Song, L. (2014). Influence function learning in information diusion networks. In Proc. 31st Intl. Conf. on Machine Learning. 224 [36] Du, N., Song, L., Gomez-Rodriguez, M., and Zha, H. (2013a). Scalable influ- ence estimation in continuous-time diusion networks. In Proc. 25th Advances in Neural Information Processing Systems, pages 3147–3155. [37] Du, N., Song, L., Woo, H., and Zha, H. (2013b). Uncover topic-sensitive infor- mation diusion networks. In Proc. 16th Intl. Conf. on Artificial Intelligence and Statistics. [38] Du, N., Song, L., Yuan, S., and Smola, A. J. (2012). Learning networks of heterogeneousinfluence. InProc. 24th Advances in Neural Information Processing Systems, pages 2780–2788. [39] Enders, C. K. (2010). Applied missing data analysis. Guilford Press. [40] Erds, P. and Rényi, A. (1960). On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60. [41] Even-Dar, E. and Shapira, A. (2007). A note on maximizing the spread of influence in social networks. In Proc. 3rd Conf. on Web and Internet Economics, pages 281–286. [42] Evgeniou, T. and Pontil, M. (2004). Regularized multi–task learning. In Proc. 10th Intl. Conf. on Knowledge Discovery and Data Mining, pages 109–117. [43] Farajtabar, M., Du, N., Gomez-Rodriguez, M., Valera, I., Zha, H., and Song, L. (2014). Shaping social activity by incentivizing users. In Proc. 26th Advances in Neural Information Processing Systems, pages 2474–2482. [44] Farajtabar, M., Wang, Y., Rodriguez, M., Li, S., Zha, H., and Song, L. (2015). Coevolve: A joint point process model for information diusion and network co- evolution. In Proc. 27th Advances in Neural Information Processing Systems, pages 1945–1953. [45] Farajtabar,M.,Yang,J.,Ye,X.,Xu,H.,Trivedi,R.,Khalil,E.,Li,S.,Song,L., and Zha, H. (2017). Fake news mitigation via point process based intervention. arXiv preprint arXiv:1703.07823. [46] Feige, U. (1998). A threshold of ln n for approximating set cover. Journal of the ACM (JACM), 45(4):634–652. [47] Foulds, J. and Smyth, P. (2013). Modeling scientific impact with topical influ- ence regression. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 113–123. 225 [48] Friedkin, N. E. and Johnsen, E. C. (1990). Social influence and opinions. Jour- nal of Mathematical Sociology, 15(3-4):193–206. [49] Goldenberg, J., Libai, B., and Muller, E. (2001a). Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing letters, 12(3):211–223. [50] Goldenberg, J., Libai, B., and Muller, E. (2001b). Using complex systems anal- ysis to advance marketing theory development: Modeling heterogeneity eects on new product growth through stochastic cellular automata. Academy of Marketing Science Review, 2001:1. [51] Gomez-Rodriguez, M., Balduzzi, D., and Schölkopf, B. (2011). Uncovering the temporal dynamics of diusion networks. In Proc. 28th Intl. Conf. on Machine Learning, pages 561–568. [52] Gomez-Rodriguez, M., Leskovec, J., and Krause, A. (2012). Inferring networks ofdiusionandinfluence. ACM Transactions on Knowledge Discovery from Data, 5(4). [53] Gomez-Rodriguez, M., Leskovec, J., and Schölkopf, B. (2013). Modeling infor- mation propagation with survival theory. In Proc. 30th Intl. Conf. on Machine Learning. [54] Gomez Rodriguez, M., Leskovec, J., and Schölkopf, B. (2013). Structure and dynamics of information pathways in online media. In Proc. 6th ACM Intl. Conf. on Web Search and Data Mining, pages 23–32. [55] Gomez-Rodriguez, M. and Schölkopf, B. (2012). Submodular inference of diu- sion networks from multiple trees. In Proc. 29th Intl. Conf. on Machine Learning. [56] Gomez-Rodriguez, M., Song, L., Daneshmand, H., and Schölkopf, B. (2014). Estimating diusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In Proc. 31st Intl. Conf. on Machine Learning. [57] Goyal,A.,Bonchi,F.,Lakshmanan,L.V.,andVenkatasubramanian,S.(2013). On minimizing budget and time in influence propagation over social networks. Social Network Analysis and Mining, 3(2):179–192. [58] Goyal, A., Bonchi, F., and Lakshmanan, L. V. S. (2010). Learning influence probabilities in social networks. In Proc. 3rd ACM Intl. Conf. on Web Search and Data Mining, pages 241–250. [59] Goyal, A., Bonchi, F., and Lakshmanan, L. V. S. (2011a). A data-based approach to social influence maximization. Proc. VLDB Endow., 5(1):73–84. 226 [60] Goyal, A., Lu, W., and Lakshmanan, L. V. S. (2011b). Celf++: Optimizing the greedy algorithm for influence maximization in social networks. In 20th Intl. World Wide Web Conference, pages 47–48. [61] Goyal, S. and Kearns, M. (2012). Competitive contagion in networks. In Proc. 44th ACM Symp. on Theory of Computing, pages 759–774. [62] Granovetter, M. (1978). Threshold models of collective behavior. American Journal of Sociology, 83:1420–1443. [63] Gu, Q. and Zhou, J. (2009). Learning the shared subspace for multi-task clus- tering and transductive transfer classification. In Proc. 9th Intl. Conf. on Data Mining, pages 159–168. [64] Guo, F., Blundell, C., Wallach, H. M., Heller, K. A., and Unit, U. G. (2015). The bayesian echo chamber: Modeling social influence via linguistic accommoda- tion. In Proc. 18th Intl. Conf. on Artificial Intelligence and Statistics. [65] Hansen,M.H.andYu,B.(2001). Modelselectionandtheprincipleofminimum description length. Journal of the American statistical Association, 96(454):746– 774. [66] Harel, O. (2003). Strategies for data analysis with two types of missing values. PhD thesis. [67] Håstad, J. (1999). Clique is hard to approximate within n 1≠ Á . Acta Mathemat- ica, 182:105–142. [68] He, X. and Kempe, D. (2013). Price of anarchy for the n-player competitive cascade game with submodular activation functions. In Proc. 9th Conf. on Web and Internet Economics, pages 232–248. [69] He, X., Rekatsinas, T., Foulds, J., Getoor, L., andLiu, Y.(2015). Hawkestopic: A joint model for network inference and topic modeling from text-based cascades. In Proc. 32nd Intl. Conf. on Machine Learning, pages 871–880. [70] Hernández-Lobato, D. and Hernández-Lobato, J. M. (2013). Learning feature selection dependencies in multi-task learning. In Proc. 25th Advances in Neural Information Processing Systems, pages 746–754. [71] Ho, P. D., Raftery, A. E., and Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American statistical Asso- ciation, 97(460):1090–1098. 227 [72] Hosseini, S. A., Khodadadi, A., Arabzade, S., and Rabiee, H. R. (2016). Hnp3: A hierarchical nonparametric point process for modeling content diusion over social media. arXiv preprint arXiv:1610.00246. [73] Iwata, T., Shah, A., and Ghahramani, Z. (2013). Discovering latent influence in online social activities via shared cascade poisson processes. In Proc. 19th Intl. Conf. on Knowledge Discovery and Data Mining, pages 266–274. [74] Jackson, M. O. (2010). Social and economic networks. Princeton university press. [75] Jiang, Q., Song, G., Cong, G., Wang, Y., Si, W., and Xie, K. (2011). Simulated annealing based influence maximization in social networks. In Proc. 26th AAAI Conf. on Artificial Intelligence. [76] Jung, K., Heo, W., and Chen, W. (2012). Irie: Scalable and robust influence maximization in social networks. In Proc. 12th Intl. Conf. on Data Mining, pages 918–923. [77] Kempe, D., Kleinberg, J., and Tardos, E. (2003). Maximizing the spread of influence in a social network. In Proc. 9th Intl. Conf. on Knowledge Discovery and Data Mining, pages 137–146. [78] Kempe,D.,Kleinberg,J.,andTardos,E.(2005). Influentialnodesinadiusion model for social networks. In Proc. 32nd Intl. Colloq. on Automata, Languages and Programming, pages 1127–1138. [79] Kermarrec, A.-M., Leroy, V., and Trédan, G. (2011). Distributed social graph embedding. In Proceedings of the 20th ACM international conference on Infor- mation and knowledge management, pages 1209–1214. [80] Kesten, H. (1990). Asymptotics in high dimensions for percolation. Disorder in physical systems, pages 219–240. [81] Krause, A. and Golovin, D. (2012). Submodular Function Maximization. Cam- bridge University Press. [82] Krause, A., McMahan, H. B., Guestrin, C., and Gupta, A. (2008). Robust submodularobservationselection. Journal of Machine Learning Research,9:2761– 2801. [83] Kurashima, T., Iwata, T., Takaya, N., and Sawada, H. (2014). Probabilistic latentnetworkvisualization: Inferringandembeddingdiusionnetworks. In Proc. 20th Intl. Conf. on Knowledge Discovery and Data Mining, pages 1236–1245. 228 [84] Kutzkov,K.,Bifet,A.,Bonchi,F.,andGionis,A.(2013). Strip: streamlearning of influence probabilities. In Proc. 19th Intl. Conf. on Knowledge Discovery and Data Mining, pages 275–283. [85] Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proc. 31st Intl. Conf. on Machine Learning, volume 14, pages 1188–1196. [86] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444. [87] Lee, S.-I., Chatalbashev, V., Vickrey, D., and Koller, D. (2007). Learning a meta-level prior for feature relevance from multiple related tasks. In Proc. 24th Intl. Conf. on Machine Learning, pages 489–496. [88] Lei, S., Maniu, S., Mo, L., Cheng, R., andSenellart, P. (2015). Onlineinfluence maximization. InProc.21stIntl.Conf.onKnowledgeDiscoveryandDataMining, pages 645–654. [89] Leskovec, J., Backstrom, L., and Kleinberg, J. (2009). Meme-tracking and the dynamics of the news cycle. In Proc. 15th Intl. Conf. on Knowledge Discovery and Data Mining, pages 497–506. [90] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. (2010). Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11:985–1042. [91] Leskovec, J. and Faloutsos, C. (2006). Sampling from large graphs. In Proc. 12th Intl. Conf. on Knowledge Discovery and Data Mining, pages 631–636. [92] Leskovec, J., Kleinberg, J., and Faloutsos, C. (2005). Graphs over time: densi- fication laws, shrinking diameters and possible explanations. In Proc. 11th Intl. Conf. on Knowledge Discovery and Data Mining, pages 177–187. [93] Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., and Glance, N. S. (2007). Cost-eective outbreak detection in networks. In Proc. 13th Intl. Conf. on Knowledge Discovery and Data Mining, pages 420–429. [94] Leskovec, J. and McAuley, J. J. (2012). Learning to discover social circles in ego networks. In Proc. 24th Advances in Neural Information Processing Systems, pages 539–547. [95] Li, L., Deng, H., Dong, A., Chang, Y., and Zha, H. (2014). Identifying and labeling search tasks via query-based hawkes processes. In Proc. 20th Intl. Conf. on Knowledge Discovery and Data Mining, pages 731–740. 229 [96] Li, Y., Chen, W., Wang, Y., and Zhang, Z.-L. (2013). Influence diusion dynamics and influence maximization in social networks with friend and foe rela- tionships. In Proc. 6th ACM Intl. Conf. on Web Search and Data Mining, pages 657–666. [97] Lin, S., Wang, F., Hu, Q., and Yu, P. S. (2013). Extracting social events for learning better information diusion models. In Proc. 19th Intl. Conf. on Knowledge Discovery and Data Mining, pages 365–373. [98] Linderman, S. W. and Adams, R. P. (2014). Discovering latent network struc- ture in point process data. In Proc. 31st Intl. Conf. on Machine Learning. [99] Liniger, T. J. (2009). Multivariate Hawkes processes. PhD thesis, ETH Zurich University. [100] Little, R. J. and Rubin, D. B. (2014). Statistical analysis with missing data. John Wiley & Sons. [101] Liu, L., Tang, J., Han, J., Jiang, M., and Yang, S. (2010). Mining topic- level influence in heterogeneous networks. In Proc. 19th ACM. Intl. Conf. on Information and Knowledge Management, pages 199–208. ACM. [102] Lokhov, A. Y. (2016). Reconstructing parameters of spreading models from partial observations. arXiv preprint arXiv:1608.08698. [103] Lokhov, A. Y. and Saad, D. (2016). Optimal deployment of resources for maximizing impact in spreading processes. arXiv preprint arXiv:1608.08278. [104] Lowalekar, M., Varakantham, P., and Kumar, A. (2016). Robust influence maximization. In Proc. 15th Intl. Conf. on Autonomous Agents and Multiagent Systems, pages 1395–1396. [105] Lu, W., Xiao, X., Goyal, A., Huang, K., and Lakshmanan, L. V. (2017). Refutations on" debunking the myths of influence maximization: An in-depth benchmarking study". arXiv preprint arXiv:1705.05144. [106] Lynn, C. and Lee, D. D. (2016). Maximizing influence in an ising network: A mean-field optimal solution. In Proc. 28th Advances in Neural Information Processing Systems, pages 2487–2495. [107] Mueller, J. and Thyagarajan, A. (2016). Siamese recurrent architectures for learning sentence similarity. In Proc. 31st AAAI Conf. on Artificial Intelligence, pages 2786–2792. 230 [108] Myers,S.A.andLeskovec,J.(2010). Ontheconvexityoflatentsocialnetwork inference. In Proc. 22nd Advances in Neural Information Processing Systems, pages 1741–1749. [109] Myers, S. A., Zhu, C., and Leskovec, J. (2012). Information diusion and external influence in networks. In Proc. 18th Intl. Conf. on Knowledge Discovery and Data Mining, pages 33–41. [110] Narasimhan,H.,Parkes,D.C.,andSinger,Y.(2015). Learnabilityofinfluence in networks. In Proc. 27th Advances in Neural Information Processing Systems, pages 3168–3176. [111] Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. (2013). Learn- ing with noisy labels. In Proc. 25th Advances in Neural Information Processing Systems, pages 1196–1204. [112] Neal,R.M.andHinton,G.E.(1998). Aviewoftheemalgorithmthatjustifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. [113] Nemhauser, G. L., Wolsey, L. A., and Fisher, M. L. (1978). An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265–294. [114] Pei, W., Tax, D. M., and van der Maaten, L. (2016). Modeling time series similarity with siamese recurrent networks. arXiv preprint arXiv:1603.04713. [115] Praneeth, N. and Sujay, S. (2012). Learning the graph of epidemic cascades. In Proc. 12th ACM Sigmetrics Intl. Conf. on Measurement and Modeling of Com- puter Systems, pages 211–222. [116] Quang, D., P, W. M., and P, S. S. (2011). Modeling information diusion in networks with unobserved links. In SocialCom, pages 362–369. [117] Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social networks, 29(2):173–191. [118] Rogers, E. M. (2010). Diusion of innovations . Simon and Schuster. [119] Rubin, D. B. (1976). Inference and missing data. Biometrika, pages 581–592. [120] Sadikov, E., Medina, M., Leskovec, J., and Garcia-Molina, H. (2011). Cor- recting for missing data in information cascades. In Proc. 4th ACM Intl. Conf. on Web Search and Data Mining, pages 55–64. 231 [121] Saito, K., Kimura, M., Ohara, K., and Motoda, H. (2009). Learning continuous-time information diusion model for social behavioral data analysis. In Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning, pages 322–337. [122] Saito,K.,Kimura,M.,Ohara,K.,andMotoda,H.(2010a). Generativemodels of information diusion with asynchronous timedelay. In ACML, pages 193–208. [123] Saito, K., Kimura, M., Ohara, K., and Motoda, H. (2010b). Selecting infor- mation diusion models over social networks for behavioral analysis. In Proc. 2010 European Conference on Machine Learning and Knowledge Discovery in Databases: Part III, ECML/PKDD 10, pages 180–195. [124] Saito, K., Nakano, R., and Kimura, M. (2008). Prediction of information diusion probabilities for independent cascade model. In Proc. 12th Intl. Conf. on Knowledge-Based and Intelligent Information & Engineering Systems, pages 67–75. [125] Sarkar, P. and Moore, A. W. (2006). Dynamic social network analysis using latent space models. In Proc. 18th Advances in Neural Information Processing Systems, page 1145. [126] Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals of statistics, 6(2):461–464. [127] Simma, A. (2010). Modeling Events in Time Using Cascades Of Poisson Pro- cesses. PhD thesis, EECS Department, University of California, Berkeley. [128] Tang, Y., Shi, Y., and Xiao, X. (2015). In Proc. 35th ACM SIGMOD Intl. Conference on Management of Data, pages 1539–1554. [129] Tang, Y., Xiao, X., and Shi, Y. (2014). Influence maximization: Near-optimal time complexity meets practical eciency. In Proc. 34th ACM SIGMOD Intl. Conference on Management of Data, pages 75–86. [130] Valente, T. W. (1995). Network models of the diusion of innovations. [131] Valente, T. W. (1996). Social network thresholds in the diusion of innova- tions. Social networks, 18(1):69–89. [132] Valiant, L.G.(1984). Atheoryofthelearnable. Communications of the ACM, 27(11):1134–1142. [133] Veen, A. and Schoenberg, F. P. (2008). Estimation of space–time branch- ing process models in seismology using an em–type algorithm. Journal of the American statistical Association, 103(482):614–624. 232 [134] Vondrák, J. (2009). Symmetry and approximability of submodular maximiza- tion problems. In Proc. 50th IEEE Symp. on Foundations of Computer Science, pages 651–670. [135] Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009). Eval- uation methods for topic models. In Proc. 26th Intl. Conf. on Machine Learning, pages 1105–1112. [136] Wang, C. and Blei, D. M. (2013). Variational inference in nonconjugate mod- els. Journal of Machine Learning Research,14(1). [137] Wang, C., Chen, W., and Wang, Y. (2012a). Scalable influence maximization for independent cascade model in large-scale social networks. Data Mining and Knowledge Discovery, 25:545–576. [138] Wang, L., Ermon, S., and Hopcroft, J. E. (2012b). Feature-enhanced proba- bilisticmodelsfordiusionnetworkinference. In Proceedingsofthe2012European ConferenceonMachineLearningandKnowledgeDiscoveryinDatabases-Volume Part II, pages 499–514. [139] Wang, S., Hu, X., Yu, P. S., and Li, Z. (2014). Mmrate: Inferring multi- aspect diusion networks with multi-pattern cascades. In Proc. 20th Intl. Conf. on Knowledge Discovery and Data Mining, pages 1246–1255. [140] Wang, Y., Cong, G., Song, G., and Xie, K. (2010). Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In Proc. 16th Intl. Conf. on Knowledge Discovery and Data Mining, pages 1039–1048. [141] Wang, Y. J. and Wong, G. Y. (1987). Stochastic blockmodels for directed graphs. Journal of the American statistical Association, 82(397):8–19. [142] Wasserman, S. and Faust, K. (1994). Social network analysis: Methods and applications, volume 8. Cambridge university press. [143] Watts, D. J. and Strogatz, S. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393:440–442. [144] Wu, X., Kumar, A., Sheldon, D., and Zilberstein, S. (2013). Parameter learn- ing for latent network diusion. In Proc. 28th Intl. Joint Conf. on Artificial Intelligence, pages 2923–2930. [145] Yadav,A.,Wilder,B.,Rice,E.,Petering,R.,Craddock,J.,Yoshioka-Maxwell, A.,Hemler,M.,Onasch-Vera,L.,Tambe,M.,andWoo,D.(2017). Influencemaxi- mizationinthefield: Thearduousjourneyfromemergingtodeployedapplication. In Proc. 16th Intl. Conf. on Autonomous Agents and Multiagent Systems, pages 150–158. 233 [146] Yang, S.-H. and Zha, H. (2013). Mixture of mutually exciting processes for viral diusion. In Proc. 30th Intl. Conf. on Machine Learning. [147] Zhou, K., Zha, H., and Song, L. (2013). Learning social infectivity in sparse low-rank network using multi-dimensional hawkes processes. In Proc. 30th Intl. Conf. on Machine Learning. 234
Abstract (if available)
Abstract
Nowadays online social networks have become a ubiquitous tool for people’s social communications. Analyzing these social networks offers great potential to shed light on human social structure and create better channels to enable social communications and collaborations. Most social network analysis tasks begin with extracting or learning the social network and the associated parameters, which remains a very challenging task due to the amorphous nature of social ties, along with the noise and incompleteness in the observations. As a result, the inferred social network is likely to be of low accuracy and to suffer from a high level of noise which impacts the performance of analyses and applications depending on the inferred parameters. ❧ In this thesis, we study the following important questions with a special focus on analyzing diffusion behaviors in social networks to achieve real practicality: (1) How can special properties of social networks be utilized to improve the accuracy of the extracted network under noisy and missing data? (2) How can the impact of noise in the inferred network be characterized for robust analysis and optimization to be carried out? ❧ To address the first question, we tackle the challenge of mitigating the impact of incomplete observations, which are very common in social data collections. Focusing on learning influence functions under incomplete observations, we design methods for both proper and improper learning under two types of incomplete cascades with theoretical guarantees on the sample complexity bound. To address the challenge of data scarcity in inferring diffusion networks, we propose a hierarchical graphical model referred to as the MultiCascade model to jointly infer multiple diffusion networks accurately. By incorporating a shared network generation prior, the MultiCascade model infers multiple diffusion networks accurately from a limited number of observations. To utilize the additional rich content information in cascades, we propose the HawkesTopic model to analyze text-based cascades by combining temporal and content information. We provide a joint variational inference algorithm for HTM to simultaneously infer the diffusion network as well as discover the thematic topics of the documents. ❧ In the second part of this thesis, we focus on the second question towards designing robust Influence Maximization algorithms under the noise and uncertainties of the inferred network. We first propose a framework to measure the stability of Influence Maximization with the Perturbation Interval Model to characterize the noise in the inferred diffusion network. We then design an efficient algorithm for Robust Influence Maximization to find influential users robust across multiple diffusion settings with theoretical analysis on the hardness of the Robust Influence Maximization problem and proofs for approximation guarantees of our algorithm.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Diffusion network inference and analysis for disinformation mitigation
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Artificial intelligence for low resource communities: Influence maximization in an uncertain world
PDF
Learning distributed representations from network data and human navigation
PDF
Deriving real-world social strength and spatial influence from spatiotemporal data
PDF
Computing cascades: how to spread rumors, win campaigns, stop violence and predict epidemics
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Elements of robustness and optimal control for infrastructure networks
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Deep generative models for time series counterfactual inference
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Statistical approaches for inferring category knowledge from social annotation
PDF
Learning and control for wireless networks via graph signal processing
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Enhancing collaboration on the edge: communication, scheduling and learning
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Modeling and predicting with spatial‐temporal social networks
Asset Metadata
Creator
He, Xinran
(author)
Core Title
Understanding diffusion process: inference and theory
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/14/2017
Defense Date
04/18/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
influence function learning,influence maximization,information diffusion,network inference,OAI-PMH Harvest,robust optimization,social network
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Liu, Yan (
committee chair
), Kempe, David (
committee member
), Lerman, Kristina (
committee member
), Valente, Thomas (
committee member
)
Creator Email
xinranhe@usc.edu,xinranhe1990@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-399374
Unique identifier
UC11264348
Identifier
etd-HeXinran-5512.pdf (filename),usctheses-c40-399374 (legacy record id)
Legacy Identifier
etd-HeXinran-5512.pdf
Dmrecord
399374
Document Type
Dissertation
Rights
He, Xinran
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
influence function learning
influence maximization
information diffusion
network inference
robust optimization
social network