Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Dynamic graph analytics for cyber systems security applications
(USC Thesis Other)
Dynamic graph analytics for cyber systems security applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DYNAMIC GRAPH ANALYTICS FOR CYBER SYSTEMS SECURITY APPLICATIONS by Charith Dhanushka Wickramaarachchi A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2018 Copyright 2018 Charith Dhanushka Wickramaarachchi Dedication To my parents for their sacrifices and support. To my wife for her understanding, love, and encouragement. ii Acknowledgments I would like to express my deep and sincere gratitude to my advisor, Prof. Viktor K. Prasanna for his continual support, patience, kindness, and encourage- ment. His continual guidance helped me a lot to stay on track, improve my skills, and understand my limitations. I am also grateful to Prof. Cauligi Raghavendra and Prof. Aiichiro Nakano for serving on my qualifier and thesis committee and for their guidance. Also, I am grateful to Prof. Rajgopal Kannan for serving on my thesis committee and his guidance in the last two years to understand my limitations and expand my skill set. I am grateful to Prof. Yogesh Simmhan for guiding me during the initial years of my Ph.D. and introducing me to the area of large-scale graph processing. I am thankful to Prof. Marc Frincu for his guidance and encouragement. I want to express my deep gratitude to Prof. Charalampos Chelmis for his patience in guiding me during the last three years. I am also grateful to be a team member of our brilliant research group at USC, especially: Alok Kumbhare, Ranjan Pal, San- mukh Rao Kuppannagari and Ajitesh Srivastava for valuable discussions. Finally, I wish to thank Lizsl De Leon, Kathryn Kassar, and Janice Thompson for their help in administrative work. I am grateful to my family for their love and encouragement, especially in hard times. I wish to thank my parents, my brother and my wife. iii Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract xi 1 Introduction 1 1.1 Cyber Systems as Graphs . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Cyber Networks . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 Online Social and Communication Networks . . . . . . . . . 5 1.1.3 Smart Power Grids . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 The Dynamic Nature of the Graph Representations . . . . . . . . . 8 1.3 Cyber Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Need for Low Latency Dynamic Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.7 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2 Background and Related Work 19 2.1 Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Distributed Graph Processing Models . . . . . . . . . . . . . . . . . 20 2.2.1 Vertex Centric Model . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Sub-Graph Centric Model . . . . . . . . . . . . . . . . . . . 24 2.3 Incremental Graph Processing . . . . . . . . . . . . . . . . . . . . . 28 2.4 Subgraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.1 Exact Subgraph Matching . . . . . . . . . . . . . . . . . . . 30 2.4.2 Graph Simulation Matching . . . . . . . . . . . . . . . . . . 30 2.5 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 iv 3 Structural Group Membership Monitoring in Dynamic Graphs 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Data Structures and Algorithms . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Handling Edge Removals . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Handling Edge Additions . . . . . . . . . . . . . . . . . . . . 51 3.5 Correctness and Complexity . . . . . . . . . . . . . . . . . . . . . . 59 3.6 Asynchronous Execution . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.7.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.7.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 69 3.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.8 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4 Exact Subgraph Matching in Dynamic Graphs 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3 Incremental Subgraph Isomorphism Matching In Dynamic Graphs . 83 4.4 Distributed Graph Pruning . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . 96 4.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5 Dynamic Variant Steiner Tree Heuristics 108 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3 Improved Protection Against Data Spoofing Attacks . . . . . . . . . 116 5.4 Adaptive Protection Schemes . . . . . . . . . . . . . . . . . . . . . 120 5.4.1 Minimumprotectioncosttreeforlocalriskpredictions(MPT- Local) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.2 Minimum protection cost trees for a time window of risk predictions (MPT-Window) . . . . . . . . . . . . . . . . . . 123 5.5 Proposed Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.5.1 Heuristic for MPT-Local and MPT-Window . . . . . . . . . 125 5.5.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 128 5.6 Scaling for Large Transmission Networks . . . . . . . . . . . . . . . 129 5.6.1 Communication Optimizations . . . . . . . . . . . . . . . . . 135 v 5.7 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.7.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.7.2 Evaluations Metrics . . . . . . . . . . . . . . . . . . . . . . . 137 5.7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.8 Other Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6 Conclusions 146 6.1 Broader Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Reference List 151 vi List of Tables 3.1 Definitions: cs, sc and ss edge updates . . . . . . . . . . . . . . . . 45 3.2 Symbols and their definitions . . . . . . . . . . . . . . . . . . . . . 45 3.3 Changes in data structures for edge removal (2,4) based on SIM. M C =M Children , ’-’ denotes no change, v: vertex id and i: iteration. 51 3.4 Changes in data structures for edge removal (1,2) based on D-SIM. M C =M Children ,M P =M Parents , ’-’ denotes no change, v: vertex id and i: iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 Symbols and their definitions . . . . . . . . . . . . . . . . . . . . . 84 4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Sizes of different power system test cases. . . . . . . . . . . . . . . . 137 5.2 R C for time windows 3 and 6 with random protection cost and criticality assignments. SD=standard deviation, SKW= skewness. . 139 5.3 Approximation ratios for simulations on IEEE-9 bus test case with PMUs at 25%, 50% and 75% of the buses with random protection cost and criticality assignments. SD=standard deviation, SKW= skewness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 vii List of Figures 1.1 State of the art cyber systems. . . . . . . . . . . . . . . . . . . . . . 1 1.2 A graph representation of a cyber network. . . . . . . . . . . . . . . 4 1.3 A graph representation of an online social network. . . . . . . . . . 6 1.4 An example graph representation of IEEE 14 bus system. . . . . . . 7 1.5 A cyber attack pattern and group of nodes subject to the attack. . 10 1.6 Distributed denial of service smurf attack [32]. . . . . . . . . . . . . 10 1.7 Distributed denial of service DNS amplification attack [32]. . . . . . 11 2.1 A series of snapshots of an example dynamic graph. . . . . . . . . . 21 2.2 Execution of vertex centric connected components algorithm on a small graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Subgraph centric model. . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Execution of subgraph centric connected components algorithm on a small graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Comparison of subgraph isomorphism, graph simulation and dual simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 Illustrativeexamplesofstructuralgroupmembershipsbasedongraph simulation and dual simulation . . . . . . . . . . . . . . . . . . . . 40 viii 3.2 Execution model for structural group membership monitoring based on the vertex-centric bulk synchronous parallel. . . . . . . . . . . . 43 3.3 State of vertex 7 in Figure 3.1 for SIM and D-SIM. . . . . . . . . . 44 3.4 Query graph, data graph and initial states of data structures. M C = M Children ,M P =M Parents . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 Number of messages and percentage of savings in the number of messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6 Number of vertex activations and the percentage of savings in the number of vertex activations. . . . . . . . . . . . . . . . . . . . . . 72 3.7 Comparison of α and Θ(α) on a RMAT dataset with and without activating vertices with no incident edge updates in the first iteration. 72 3.8 The number of iterations and percentage of savings in the number of iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.9 Performance on various datasets (CTR, LJ and RMAT (|V| = 2 23 and|E| = 2 26 )) for general query graphs. . . . . . . . . . . . . . . . 74 3.10 Impact of number of edge updates. . . . . . . . . . . . . . . . . . . 74 3.11 Impact of query graph. . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.12 Impact of data graph density (β). (α in millions) . . . . . . . . . . 76 3.13 Preventing Cyber Attacks on Cyber Networks. . . . . . . . . . . . . 76 3.14 Agraphprocessingframeworkforpreventingthreatsinsocialnetworks 78 4.1 Five temporal snapshots of a dynamic graph. . . . . . . . . . . . . . 83 4.2 Data flow of the proposed algorithm for exact subgraph isomorphism. 86 4.3 Initial state of data graph for graph pruning algorithm. . . . . . . . 97 4.4 Vertex states at end of each super-step (ss) in the initialization stage of D-IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 ix 4.5 Vertex states at the end of each super-step (SS) after adding edge (5, 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6 Vertex states at end of each super-step after removing edge (3, 5). . 99 4.7 Comparison of latency with (D-ISO + D-IDS) and without (D-ISO) graph pruning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.8 Average percentage reduction of graph size (number of vertices) D- IDS. d denotes the diameter of query graph. . . . . . . . . . . . . 103 4.9 SpeedupandlatencyofD-ISI+D-IDSforquerygraphswithvarious diameters (d). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.10 Comparison of latency of D-ISI + D-IDS for query graphs with various diameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.11 Latency and throughput of D-ISI + D-IDS for different batch size. . 106 5.1 High level overview of the state estimation process [40]. . . . . . . . 113 5.2 An example transmission network and its graph representation. . . 115 5.3 Changing the measurement protection with bus criticality changes. 122 5.4 Execution flow of the heuristics algorithms for the proposed protec- tion schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.5 Illustration of message combiner in Pregel model [78] . . . . . . . . 136 5.6 R C with increasing time window size. . . . . . . . . . . . . . . . . . 141 5.7 R C forvariousbuscriticalityvariationsonEU-1494. Sin=sinwave, SQR = square wave, TRI= triangle wave, SW= sawtooth wave. . . 142 5.8 R C with increasing number of buses. . . . . . . . . . . . . . . . . . 142 5.9 Execution time vs number of cores. RMAT graphs: n×β where |V| = 2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 5.10 Execution time vs graph size. RMAT graphs: n×β where|V| = 2 n 143 x Abstract State of the art cyber systems consist of interconnected networking infrastruc- tures, information infrastructures, and other entities, such as the humans involved in the systems. Communication networks, online social networks, and smart grids are popular examples. These systems are becoming an organic part of our day to day life with the advancement of Internet infrastructure, mobile technologies, and sensor networks. As a result, protecting cyber systems against attacks has become a task of vital importance. However, the highly complex nature of mod- ern cyber systems makes the design of security solutions a challenging task. Cyber system security is a primary concern when designing and operating mission-critical systems, such as smart grids. Such systems should be continually monitored to prevent and detect attacks. The mission-critical nature of these systems demands low latency solutions that identify and prevent attacks. The majority of existing security solutions focus on the protection of individual components of the sys- tem. The tightly coupled nature of the modern cyber system makes such security solutions inadequate. More sophisticated security solutions are needed as a result. Graphs are fundamental in representing complex interconnected systems and data. Thus, security solutions based on graph representations will play a crucial role in future cyber system security solutions. Such solutions should be able to handle the dynamic nature of cyber systems and comply with the low latency xi requirements of security applications. To address these demands, we propose a set of fundamental dynamic graph algorithms that can be used to develop cyber system security solutions. First, we present distributed dynamic graph algorithms that can be used to prevent threats in cyber systems. We develop distributed algorithms to monitor the vertices in a dynamic network to detect whether they become a part of a given graph pattern. Cyber systems such as computer networks can be abstracted as dynamic networks, in which the vertices represent computing devices and the edges represent communication channels between the devices. Similarly, in online social or communication network systems, the users can be represented by vertices and their interactions by edges. Given the knowledge of communication patterns of cyber attacks or suspicious communications, the proposed algorithms can be used to develop a distributed, proactive cyber threat prevention system. We show that the proposed algorithms are memory efficient and can be executed asynchronously. We analyze and report their communication, computation, and space complexity. Experimental evaluations on large real-world and synthetic datasets show that the proposed algorithms are highly scalable. Further, to enable highly accurate detection of cyber attacks, we present a dis- tributed algorithm for exact subgraph matching (i.e., subgraph isomorphism). To improve the latency and scalability of the solution, we propose a lossless graph pruning technique for dynamic graphs based on graph simulation. Experimental results on real-world and synthetic datasets show that our approach is highly effec- tive on graphs with small diameters. We discuss the application of the proposed technique to detect attacks on cyber systems with high accuracy. Last but not the least, we address a security vulnerability in the smart grid state estimation process by providing a set of protection schemes based on the xii prize-collecting Steiner tree problem. Determining the complex bus voltages in the entire power system is a key operation in the smart grid state estimation process. The estimated states of the system are used to determine the operating condition of the system and take any emergency and restorative control actions if necessary. However, recently it has been shown that the state estimation process is vulnerable to data spoofing attacks. Specifically, when an attacker knows the topology of the transmission network, they can attack the state estimation process by spoofing a carefully selected set of sensors. Such attacks are not detectable by existing protec- tive measures. Existing techniques to protect the smart grid against data spoofing attacks focus on design-time protection of the system. These techniques fail to capture the dynamic nature of the criticality of various parts of the smart grid’s power transmission network. Addressing this limitation, we propose a dynamic prize-collecting Steiner tree based security solutions to provide optimal cost pro- tection against data spoofing attacks. The proposed protection schemes consider the dynamic nature of the criticality of the buses in power transmission networks and the costs of protection to provide optimal cost protection recommendations. We develop scalable, highly accurate heuristic algorithms to obtain security rec- ommendations with low latency. We discuss how the proposed heuristics can be generalized to be applied in security solutions beyond the smart grid. Ourdistributedalgorithmstopreventanddetectcyberthreatsincybersystems and novel adaptive protection schemes to protect smart grids against data spoofing attacks take a few steps towards the goal of securing complex cyber systems. xiii Chapter 1 Introduction Cybersystemshavebecomeubiquitousinmoderndaysocietywiththeadvance- ment of internet technologies, mobile networks, and cyber physical systems. These systems are now starting to play vital roles in our day to day life. Online social networks, communication networks, and smart power grids are examples. A cyber system can be defined as a system that uses a cyberspace, namely, a collection of interconnected computerized networks, including services, computer systems, embedded processors, and controllers, as well as information in storage or in transit [90]. Cyber Systems Infrastructures Cyber social systems Cyber physical systems 1) 2) 3) 5) 1) Social media 2) E-commerce 3) Smart transportation 4) Smart homes 5) Smart grid 4) Figure 1.1: State of the art cyber systems. 1 As illustrated in Figure 1.1, modern cyber systems consist of core cyber system infrastructures and cyber system applications that use them to provide a service. We can categorize the cyber system applications into two main categories: 1) Cyber Social Systems: Cyber social systems consist of user applications such as email services, online social networks, communications networks and e-commerce applications. Most of these systems are used by non-expert human users in their day to day life for their social interactions. 2) Cyber Physical Systems: Cyberphysicalsystemsconsistofapplicationsthatincludephysicalcomponents governed by the laws of physics and cyber components that are integrated with the physical system to enable better control and monitoring. Systems such as smart power grids, smart homes, and smart cars are such examples. Many cyber social systems and cyber physical systems have become a crucial part of modern everyday life. As a result, the security of these systems is of vital importance. When protecting cyber systems, two main scenarios should be considered: 1 Attackers may use cyber systems to spread their propaganda using standard features of the system. This scenario is common in cyber social systems such, as online social net- works. One such example is a group of terrorists using online social net- working features to spread their propaganda by propagating false rumors. Another example is a group of automated bots spreading spam information using online social networks. 2 2 Attackers may attack cyber systems to gain unauthorized access to the sys- tem or impact the availability of the system. Thisscenarioiscommonincybersysteminfrastructure, cybersocialsystems, and cyber physical systems. Denial of service attacks [61] on a cyber system infrastructure and data spoofing attacks [73] on cyber physical systems such as smart grids are examples. Modern cyber systems are highly complex in nature due to their massive scale, complex interdependencies, and dynamic nature. As a result, designing security solutions to monitor and protect these systems is a challenging task. Due to the complexinterdependenciesinthecybersystems, attackerscouldformulatecomplex attacks that may involve multiple components of the system [32, 73, 106]. Secu- rity solutions that provide localized protection for individual components are not enough to detect such attacks. Sophisticated solutions that take into account the interdependencies of various components of the cyber system and their interactions areneeded. Themissioncriticalnatureofmanycybersystemsdemandslowlatency attack prevention and detection solutions. As an example, the unavailability of an e-commerce service for a few minutes may cause massive losses. Moreover, attacks that can cause blackouts in power grids may result in significant socio-economic impacts. Security solutions should cater to these low latency requirements when preventing and detecting attacks on cyber systems. 1.1 Cyber Systems as Graphs Graph based representations are a natural fit to represent systems and data with complex relationships. As a result graph based representation formats have been widely used when representing and analyzing cyber systems. Thus, graph 3 Email Server, Linux Web Server, Linux Web Server, Windows Switch, Cisco Mobile Device, Android OS Mobile Device, iOS Router, Juniper Router, Cisco Database, Linux Data center, Linux 1 2 4 3 5 7 10 9 8 6 Type: OS: Juniper Type: OS: Cisco Type: Data center OS: Linux Type: Database OS: Linux Type: Mobile Devise OS: iOS Type: Mobile Devise OS: Android Type: Email server OS: Linux Type: Web server OS: Linux Type: Web server OS: Windows Type: Switch OS: Cisco Cyber network A graph representation Protocols: ethernet Protocols: ethernet Figure 1.2: A graph representation of a cyber network. representation based security solutions will play a crucial role in future cyber sys- tem security solutions. In the following subsections, we motivate our work by introducing three widely used graph representations of real world cyber systems. 1.1.1 Cyber Networks Cybernetworkscanbenaturallyrepresentedbygraphsinwhichentitiessuchas host devices, routers or switches can be modeled as vertices, while communication channels or connections between entities can be represented by edges. Various characteristics of devices and connections can be modeled as attributes of these vertices and edges. Figure 1.2 illustrates an example of a graph representation of a cyber network. As illustrated in Figure 1.2, the example of a cyber network consists of various types of host devices, including mobile devices and application servers such as web servers. Also, it includes connections to data centers and connecting devices such as routers. Communication between devices occurs in various communication pro- tocols, suchasTransmissionControlProtocol(TCP)[85], UserDatagramProtocol 4 (UDP) [84], and Border Gateway Protocol (BGP) [91]. In the illustrated graph representation, the vertices represent entities such as the mobile devices and appli- cation servers. Edges in the graph represent the communications between these entities. The vertex attributes represent device attributes, such as the type of the device and its operating system. The edge attributes represent the communication protocols used to communicate between the incident entities. Graph representations of cyber networks have many use cases, including in areas such as cyber network security, the design of routing algorithms, and the design of network simulation frameworks [32, 108, 101, 86]. As an example, discov- eringandmodelingthetopologyanddetailsofcomplexcybernetworksiscrucialto conducting simulation based studies to understand and improve the cyber network systems [101]. Thus, a significant amount of research work exists on discovering the graph structure of cyber networks [42, 17, 77]. 1.1.2 Online Social and Communication Networks Online social and communication networks have grown rapidly during the past decade with the democratization and development of internet technologies and mobile devices. Analytics on online social networks are mainly based on graph based models [107, 88]. As a result, graph based storage solutions and analytic techniques for large scale online social network analytics have been developed [26, 109, 54]. Similar analytic methods have been used to analyze communication networks, such as email and telephone communication networks [103, 27]. Online social and communication networks are represented by graphs in which entities such as the users and the posts are represented by vertices and their interactions or relationships by edges. The user and their interactions are represented by the 5 Name: James Type: User Name: John Type: User Type: Status Update Type: Video Type: Check-in Type: post Figure 1.3: A graph representation of an online social network. attributes of the vertices and edges. Figure 1.3 illustrates a graph representation of an online social network like Facebook. Online social networking systems such as Facebook use similar graph repre- sentations to perform analytics for applications such as targeted advertising, bot detection, and content recommendations [26]. 1.1.3 Smart Power Grids Applications of graph theory are not new to the power system domain [93]. In electrical networks, there are sets of buses and branches in which each branch has two terminal buses and these buses are shared by one or more other branches in the network. Graph representations of such networks are typically constructed by replacing the buses and branches by vertices and edges [15]. Figure 1.4 illustrates a graph representation of an IEEE-14 bus network based on this model. 6 1 2 3 4 7 8 9 14 13 5 6 10 11 12 IEEE 14 bus system Bus branch graph representation Figure 1.4: An example graph representation of IEEE 14 bus system. Variants of this graph model have been used in various applications in power grid transmission networks [15, 87, 22, 41, 110]. As an example, in the power trans- mission systems of smart grids, similar graph representation based techniques have been used in the network observability analysis of the transmission network’s state estimation process [15]. Network observability is a fundamental concept when it comestotheoperationofapowergridtransmissionnetwork. Apowertransmission network is said to be observable if the voltage phasors at all system buses can be uniquely estimated using the available measurements from the system [15]. The above-mentioned graph representation is used for determining the observability, where the network is identified as observable if a spanning tree can be constructed on the graph representation created from the measurements connecting all the buses in the transmission network. Furthermore, graph-based techniques have been proposed to secure the state estimation of smart grid against data spoofing attacks [22, 41, 110]. 7 1.2 The Dynamic Nature of the Graph Repre- sentations Many modern cyber systems are inherently dynamic in nature. As a result, graph representations of these systems should also be represented as dynamic graphs to capture the dynamic nature of the cyber systems. Consider the cyber network graph representation described in Section 1.1.1 in which the host devices, routers, or switches are modeled as vertices, while the communication channels or connections between the entities are represented by edges. Inthisexample, dynamicfeaturesofthenetwork, suchascongestion, packet rate, and the criticality of each entity can be captured as attributes of the vertices. Similarly, changes in communication policies, such as firewall configurations may cause changes in the communication channels which produce changes in the edges. Moreover, the addition, removal, or failures of an entity causes the addition and removal of vertices in the graph representation. Thus, such cyber network graph representations are dynamic: there can be changes in both their structure and their attributes. Online social networks are getting increasingly popular in modern society and used as one of the vanguard media for social interactions. Due to the vast number of users on online social networks and their daily activities, these online social networks are highly dynamic in nature. As discussed before, online social networks are naturally represented as graphs, in which the users, user posts, and media are modeled as vertices while the user interactions with these entities are modeled as edges. The day to day activities in online social networks, such as communications between the users, publishing new content, as well as users joining or leaving, makes the network dynamic. Changes such as a users joining or the publication of 8 new content can be modeled as additions of new vertices, whereas changes in user interactions, such as messages between users or media content can be modeled as changes in the edges in the graph representations. A large number of studies have been conducted on the dynamic nature of online social networks [21, 68, 69]. In power grids, thetopology ofthe systemwhichconsists ofcomponents such as substrations, transmissionlinesanddistributionlinesetc., remainsstaticcompared to other systems such as cyber systems and online social networks. But power systems are dynamic in nature due to changes in factors such as power demand, power supply, and congestion in transmission networks [19]. Such changes should be captured in the graph representation. As an example, congestion in lines and substations can be represented by dynamic edge and vertex attributes. Moreover changes in demand and supply can be represented by changes in attributes of vertices that represent substations and generators. As a result, a power grid’s graph representation is dynamic: vertex and edge attributes change over time. 1.3 Cyber Attacks Various graph based models of cyber attacks on cyber systems have already been discussed in the research literature [32, 22, 73]. We briefly present some of thisworktomotivatetheneedfordynamicgraphbasedsecuritysolutionsforcyber system security. As discussed in Section 1.1, a cyber network can be abstracted as a graph, in which the vertices represent computing devices (nodes) and the edges repre- sent communication channels between these devices. Consider the communication pattern that occurs in a cyber attack depicted as a graph in Figure 1.5(a). In this example, the attackers (A) infect the web application servers (WA) and data 9 (a) Cyber Attack Pattern A DS WA 2 3 4 5 6 7 8 (b) Cyber Attack 1 M DS DS A A M M WA WA Figure 1.5: A cyber attack pattern and group of nodes subject to the attack. Attacker Router Victim Host Host Host ICMP Echo Request ICMP Echo Request ICMP Echo Reply Figure 1.6: Distributed denial of service smurf attack [32]. servers (DS). The infected data servers leak user contact information to a web server on which is running an infected application program that sends out spam emails to users via email servers (M). Figure 1.5(b) shows a cyber network sys- tem that may be subject to such attacks. Such communication patterns in cyber attacks have been discussed in the research community [32]. Two such patterns of known attacks are discussed below. In these attacks, the attackers can be software agents such as bots running in a set of infected computers in a cyber network. 10 Victim Host Host Host Attacker Attacker Attacker DNS Query DNS Query DNS Query DNS Query Response Figure 1.7: Distributed denial of service DNS amplification attack [32]. Figure 1.6 illustrates a communication pattern of a distributed denial of service smurf attack. In this attack, the attacker sends Internet Control Message Protocol (ICMP) [35] echo request packets with broadcast IP where the source address in packets is spoofed with the IP address of the victim. The routers that receive these packets broadcast the packets to all the hosts in the network. Upon receiving these messages, hosts send an ICMP echo reply message to the victim, congesting the network resources of the victim. Figure 1.7 illustrates a communication pattern of a distributed denial of service DNS amplification attack. The attackers send DNS queries with spoofed source addresses to DNS servers. The source address is the address of an entity in the network that is the target of the attack. The DNS servers reply to the spoofed source address with a DNS response which is larger in size than DNS request. This will result in network congestion at the targeted entity that is under attack. Given the knowledge of the communication patterns of cyber attacks as depicted in Figures 1.5, 1.6 and 1.7, the network can be monitored continually to detect whether nodes become vulnerable to such attacks when they change their attributes or communication channels. Furthermore, communications between 11 nodes in a cyber system can be monitored to detect whether they are under attack. Understanding which nodes are vulnerable or under attack and what roles they play in the given attack is important when mitigating the threats. Similarly, given the knowledge of suspicious communication patterns of groups of people, commu- nication networks or social networks can be continually monitored to detect and mitigate the threats from groups of people with malicious intent. This can be done by monitoring these networks to identify occurrences of known communica- tion patterns of suspicious groups. Thus identifying graph patterns in dynamic graphs is a fundamental graph operation for cyber system security applications. Inthecaseofasmartgridpowersystem,theoperatingstateconsistsofthecom- plex voltages at the buses [15]. The operational state of the system is continually monitored by SCADA (Supervisory Control and Data Acquisition) systems. State estimation is a mission-critical operation in a smart grid [15]. It is performed in an online manner in modern SCADA systems using real-time sensor measurements of the transmission network [15]. The sensors in a smart grid and its communication networks are vulnerable to complex attacks. Such attacks include human terrorist agents and cyber viruses [96, 83]. Smart grid SCADA systems are equipped with a bad data detector (BDD) to detect random errors in sensor measurements which may affect the state estimation process. However, it has been recently shown that state estimation is vulnerable to data spoofing attacks (DSAs) [73]. Explicitly, when an attacker knows the topology of a transmission network, he/she could for- mulate an attack by spoofing a carefully selected set of sensors. Such attacks are not detectable by BDDs in the SCADA systems. Invalid state estimates can cause a major socio-economic impact which may include losses in power markets and large-scale blackouts. 12 Graph based protection schemes to secure the state estimation of a critical subset of buses have been proposed [22, 110]. All these protection schemes use variants of graph Steiner tree as fundamental operations. We discuss the details of these Steiner tree based protection schemes in Chapter 5. Steiner tree based techniques to protect privacy in wireless sensor networks have also been proposed [62]. Thus computing Steiner trees in power systems graph representations is a core operation for these cyber security applications. 1.4 Need for Low Latency Dynamic Graph Analytics Prevention, detection, andmitigationarethethreephasesofsecuringanycyber system. The dynamic nature of modern cyber systems requires security solutions to continually monitor the cyber systems in order to find security vulnerabilities and detect on going attacks. We discuss the need for low latency dynamic graph analytics below. Early prevention and detection of cyber attacks is critical in cyber system security applications to prevent losses due to the potential impact of these attacks. As an example, the WannaCry 1 ransomware that started to spread on May 12, 2017 affected more than 60 trusts 2 within the United Kingdom’s National Health Service (NHS) and spread to more than 200,000 computer systems in 150 countries [33]. Inthefirstday, theransomwareaffectedover50,000computingsystemsinthe 1 http://malware.wikia.com/wiki/WannaCry 2 https://en.wikipedia.org/wiki/NHS_trust 13 world and affected over 200,000 systems within a few days 3 . Many NHS facilities could not access patient records, which led to delays of non-urgent surgeries and canceled patient appointments. Some hospitals had to divert ambulances to other facilities [33]. A cost of over 100 million to 4 billion dollars was estimated as loss due to this attack 4 . Such fast spreading attacks and their potential impact demand low latency analytics over dynamic graphs to prevent, detect and mitigate the threats as soon as possible. Developing low latency dynamic graph analytics for cyber system security applications poses many challenges: 1 Massive scale: Modern cyber systems such as online social networks and cyber networks can be massive in size. This makes performing low latency analytics on their graph representations a challenging problem due to the scale of the problem size. 2 Distributed nature: Many cyber systems are distributed in nature. As an example, cyber network systems can be geo-distributed where online social networks may be partitioned across multiple data centers distributed across the country. Thus, distributed dynamic graph analytic solutions should be developed for such cyber system security solutions. This is a challenging problem due to the communication delays in distributed systems. 3 Computational complexity: Some key graph problems associated with cyber security can have high computational complexity, which makes it hard to perform low latency analytics. As an example, finding exact graph patterns 3 https://www.theverge.com/2017/5/14/15637888/authorities-wannacry-ransomware-attack- spread-150-countries 4 https://www.cbsnews.com/news/wannacry-ransomware-attacks-wannacry-virus-losses/ 14 in graphs (i.e., subgraph isomorphism) is an NP-complete problem. Alter- nate methods such as approximate algorithms should be developed for such scenarios in order to enable low latency subgraph isomorphism detection in dynamic graphs. 4 High velocity data: Cyber systems such as social and communication net- works can evolve faster due to high velocity of interactions and changes of entities 5 . The dynamic graph algorithms proposed in this thesis address these challenges to enable low latency analytics on dynamic graphs. 1.5 Thesis Statement Many complex cyber systems are commonly modeled as graphs in which enti- ties and their relationships can be represented by vertices and edges and various characteristics of entities and their relationships as attributes of vertices and edges. The changing nature of cyber systems makes these graph representations dynamic. Ourresearchfocusesonenablingsecuritysolutionsforcomplexcybersystems. Our goal is to develop fundamental dynamic graph analytics that can be used when designing security solutions for these cyber systems. We take into account the changing nature of the cyber systems and the low latency requirements of security solutions when developing these key analytic techniques. 5 http://www.internetlivestats.com/one-second/ 15 1.6 Research Contributions Motivated by the current findings in the cyber system security domain [32, 73, 22, 110], we have identified a key set of dynamic graph problems that can be used when developing cyber system security applications. In this thesis, we present a set of dynamic graph algorithms for this key set of problems. The proposed algorithms are scalable and provide low latency results. We evaluate these algorithms on various real-world and synthetic datasets to demonstrate their effectiveness. The main contributions of this thesis are summarized below. 1 We motivate the need for dynamic graph based solutions for cyber system security applications by providing a set of motivating applications. We iden- tify the core requirements of such solutions and discuss the challenges in meeting to these requirements. We identify a set of key dynamic graph prob- lems for cyber system security applications. We discuss the applicability of theseproblemsandourproposedsolutionsincybersystemsecuritysolutions. 2 We present distributed data structures and algorithms to monitor structural group memberships of vertices in a dynamic network. Given a query graph and a subgraph matching criterion, the structural group membership of a given vertex in a dynamic network is the set of query vertices that maps to it based on the given subgraph matching criterion. We consider two widely used subgraph matching criteria: graph simulation and dual simulation. The proposeddistributedalgorithmsupdatethedistributeddatastructuresmain- tained at each vertex based on their previous state, to continually monitor structural group memberships. We show that our algorithms are memory- efficient and scalable; the amount of memory required to maintain the pro- posed data structures at each vertex is independent of the size of the dynamic 16 network and is bounded by the size of the query graph. Moreover, the pro- posed algorithms can be executed asynchronously making them portable to various network computing environments. We evaluate the performance of the proposed algorithms on a diverse set of large real-world, and synthetic graph datasets to show the effectiveness of our approach compared with the state-of-the-art. We discuss the applicability of our proposed solution in cyber network security solutions. 3 To enable exact subgraph matching in dynamic networks, we present a dis- tributed graph pruning algorithm to enable efficient subgraph isomorphism detection in dynamic graphs. The proposed algorithm continually maintains the maximum dual simulation match in a dynamic graph. We develop a distributed incremental algorithm for subgraph isomorphism that uses this pruningtechniquetoenablelowlatencyexactsubgraphmatchingindynamic networks. Our evaluation results show that the graph pruning technique is highly effective on graphs with small diameters. We discuss the applicability of our proposed solution in cyber system security solutions. 4 We identify a limitation in an existing protection scheme to protect a smart grid transmission network’s state estimation process from data spoofing attacks. We then discuss the limitations of the existing protection scheme to show how an attacker can bypass this protection scheme to attack the state estimates. We identify a class of attack vectors under which this protection scheme fails to protect the state estimates. Addressing this limitation, we present a Steiner tree based improved protection scheme to provide complete protection against data spoofing attacks. 17 5 Finally, extending our above mentioned protection scheme, we propose novel optimal cost protection schemes to protect smart grid transmission networks. We present variants of dynamic prize collecting Steiner tree problems for these protection schemes. Given the intractable computational complexity of these problems, we develop fast heuristic algorithms that can approximate the optimal results with low latency. We evaluate the algorithms on real- world and synthetic datasets to demonstrate their effectiveness. We discuss how the proposed heuristics can be generalized to be applied in other security solutions besides the smart grids. 1.7 Thesis Outline The rest of this thesis is organized as follows. In Chapter 2, we present rel- evant definitions, background and related research that are useful to understand the material presented in the thesis. In Chapter 3, we present our first contri- bution: distributed dynamic graph algorithms and data structures for structural group membership monitoring. In Chapter 4, we present our distributed algo- rithm for exact subgraph matching in dynamic graphs. In Chapter 5, we present our variant dynamic prize-collecting Steiner tree problems for protecting smart grid state estimation from data spoofing attacks and our proposed low latency heuristic algorithms for the problems. We conclude the thesis in Chapter 6 with a discussion of the broad applicability of the proposed work and future directions for the community to take based on our results. 18 Chapter 2 Background and Related Work In this chapter, we provide relevant background and related work for the follow- ing chapters. We discuss the concepts on which we built our proposed algorithms and present the definitions that are used throughout the thesis as background. We also discuss related research to differentiate our work from existing research. 2.1 Dynamic Graphs Graph data can be found in many applications of computer science and such application specific graph data include but are not limited to: • Social graphs (Facebook, Twitter, Google+, LinkedIn, etc.) • Communication graphs (Skype, E-mail, Whatsapp, etc.) • Endorsement graphs (web link graph, paper citation graph, etc.) • Location graphs (road map, power grid, telephone network, etc.) • Simulation graphs (biological network, astrophysics graph, etc.) We can observe a dynamic nature in many of these graphs. This is a result of growthintheuserbaseintheapplicationsand/orduetotheinteractionsthatoccur between entities in the applications. Social and communication networks with user interactions is such an example. The dynamic nature of the graphs brings another dimension to the processing of large scale graphs. The data streams generated 19 from modern day dynamic graphs are of high velocity in nature. As an example, Twitter reports 500 million tweets generated in a day on average. These data streams and user dynamics (joining/leaving) contribute to the dynamic nature of the graph. A graph G = (V,E,l) comprises a set V of vertices together with a set E of edges, where E⊆V×V and l :V −→L is a label function that associate labels with vertices respectively. L is the set of labels. We define a graph update δG ut as an atomic change to a graph at a time t. It can be any one of the following: • add a vertex • remove a vertex • add an edge • remove an edge • change the label of a vertex A dynamic graph is a graph G T = {...,G t−2 ,G t−1 ,G t ,...}, T ⊂ Z + that changes over discrete time steps. At any time t, the directed graph G t denotes a snapshot of G T . At time t, G t evolves from G t−1 based on a set of graph updates ΔG ut . Figure 2.1 illustrates a series of snapshots of a dynamic graph. 2.2 Distributed Graph Processing Models Overthepastdecade, variousprocessingmodelsforlargescalegraphprocessing have been introduced. Boost graph library (BGL) [97] can be thought as one of the early frameworks which provided support for large scale graph processing. It 20 t • Add edge • Add vertex • Add edge • Remove edge • Change vertex label Figure 2.1: A series of snapshots of an example dynamic graph. provided abstractions and data structures for users to develop graph algorithms which can run on a distributed memory environment. The distributed memory support of this library targets MPI environments. The introduction of Google map reduce [39] can be thought as a turning point in large scale data processing on commodity clusters. A simple programming model and an open implementation like Apache Hadoop 1 made map reduce the pre-eminent programming model for large scale data processing. There have been research efforts to implement graph algorithms on the map reduce model [72] due toitswideadoption. Butthemapreducemodelisnotsuitableforlargescalegraph processing. It was mainly designed for the processing of data with minimal inter- dependencies. The irregular memory access nature of graph algorithm resulting from the transfer of the graph structure between the map and the reduce phases introduces a high overhead. Map reduce model also focuses on batch processing of data with high data processing throughput. But the processing latency of the map reduce model is high. 1 https://hadoop.apache.org/ 21 2.2.1 Vertex Centric Model The Google Pregel [78] model was introduced to address the issue of unneces- sary data movement in the map reduce model when used for graph data processing. Google Pregel provides a vertex centric programming model for large scale graph processing. It follows the Bulk Synchronous Parallel (BSP) model, in which each vertex is a processor [105]. In a vertex centric model, users compose the graph algorithm to be executed within each vertex. Each vertex receives a set of incoming messages that were sent from other vertices. The user composed algorithm is executed using these messages. Each vertex can send messages to other vertices if and when required. Vertex centric algorithms execute iteratively; the iterations are called super steps. Eachiterationisseparatedfromtheothersbyabarriersynchronizationstep, which ensures that all vertices finish their computations and communications before pro- ceeding to the next iteration. Also, the barrier synchronization step ensures that messages sent in an iteration are only available for vertices to be processed in the next iteration. Algorithms terminate after a user defined number of iterations or after all the vertices indicate that they have completed computation (vote to halt) or when there is no communication between vertices. In vertex centric graph processing platforms, the graph is distributed between multiple workers. Each worker node has multiple tasks, which are responsible for executing the user defined algorithm at each vertex. Users are given a compute function to implement these graph algorithms. The framework makes the underly- ing communication, barrier synchronization and fault tolerant aspects transparent from the user. Algorithm 1 is an example of a graph algorithm based on the vertex centric model to label weakly connected components. In the first super step, the vertices 22 send their ids to their neighbors. Each vertex maintains, as its state, the smallest vertex id it has seen so far. Upon receiving messages, each vertex updates its state with the smallest vertex id seen so far. If the state has changed during a super step, the vertex sends the updated value to its neighbors. The algorithm terminates when there are no more changes in the vertex states and the vertices have no incoming messages. At the termination of the algorithm, vertices in the same weakly connected component have seen the same value in their state. Figure 2.2 illustrates the execution steps of this algorithm on a small graph. 23 Algorithm 1 Vertex Centric Connected Components 1: procedure Compute(Messages M) 2: if superstep = 1 then 3: cc← getId(.) 4: Message msg 5: msg.value←cc 6: sendtoNeighbours(msg) 7: else 8: changed← false 9: for m∈M do 10: if m.value<cc then 11: cc←m.value 12: changed←true 13: end if 14: end for 15: if changed then 16: Message msg 17: msg.value←cc sendtoNeighbours(msg) 18: end if 19: end if 20: end procedure 2.2.2 Sub-Graph Centric Model The vertex centric graph processing model has its advantages, such as ease of use and performance benefits compared to map-reduce like programming models. But most of the algorithms ended up performing minimal computational work per 24 1 2 3 4 5 1 2 3 4 5 1 1 2 2 5 1 2 3 4 5 1 1 1 1 5 1 2 3 4 5 1 3 2 2 2 1 1 1 1 1 1 5 1 2 3 4 5 1 1 2 3 4 Super step Figure 2.2: Execution of vertex centric connected components algorithm on a small graph. Partition L M N O B C E D G K H I F J Partition 1 Partition 2 A Sub-graph 1 Sub-graph 2 Sub-graph 3 Sub-graph Local Edge Remote Edge Figure 2.3: Subgraph centric model. vertex [98]. This results in higher communication overhead, and a large number of iterations for the algorithm to converge to the final result. A subgraph centric graph processing model has been proposed to address these limitations [98]. In the subgraph centric model, a subgraph, instead of a vertex, in a graph acts as the unit of execution. As shown in Figure 2.3, the graph is initially 25 partitioned into a set of graph partitions. Partitioning tools like Metis [67] are used for this process. Metis tries to minimize the number of cross partition edges between graph partitions while keeping the number of vertices equal across graph partitions. Weakly connected components are identified within each partition. Users will use each of these connected components (subgraphs) as the unit of execution. In a subgraph centric model, the users are provided with a programming abstraction to develop computations for each subgraph. They will have access to all the vertices and edges within the subgraph. This will enable them to reuse traditional in-memory graph algorithms within the subgraph. This programming abstraction provides the capability to communicate with other subgraphs. The execution model is the same as the bulk synchronous parallel vertex centric model, but in this case, it is the subgraphs, instead of the vertices, that act as the units of execution. Algorithm 2 is an example of a subgraph centric algorithm for labeling weakly connected components in a graph. As shown in Algorithm 2, initially each sub- graph locally calculates the minimum vertex id within the subgraph. Then each subgraph exchanges messages with its neighbors, communicating their current known minimum vertex id. In every other super step, each subgraph checks if any of the incoming values are less than the current value, and if so, the subgraph updates the current known value. In the case of a value change, a subgraph sends messages to its neighbors informing them of the update. The algorithm terminates when there are no incoming messages to the subgraphs. 26 Algorithm 2 Subgraph Centric Connected Components 1: procedure Compute(SubGraph SG, Messages M) 2: if superstep = 1 then 3: SG.value←∞ 4: for v∈SG.vertices do 5: if SG.value<v.id then 6: SG.value←v.id 7: end if 8: end for 9: end if 10: changed← (superstep = 1)?true:false 11: for m∈M do 12: if m.value<cc then 13: SG.value←m.value 14: changed←true 15: end if 16: end for 17: if changed then 18: Message msg 19: msg.value←SG.value sendtoNeighbours(msg) 20: end if 21: end procedure Experimental results have shown that the subgraph centric model outperforms the vertex centric graph computing model [98]. This is mainly due to the reduced number of super steps and messages. Figure 2.4 illustrates the execution steps of Algorithm 2. 27 Super step 1 2 3 4 5 1 4 5 1 2 3 4 5 1 1 5 1 1 2 Super step Figure 2.4: Execution of subgraph centric connected components algorithm on a small graph. 2.3 Incremental Graph Processing While the existing distributed graph processing models for static graphs can be directly used for dynamic graphs by processing snapshots, this will result in high overhead, due to the recomputations. Thus, incremental algorithms have been introduced to perform analytics on dynamic graphs [79, 45, 51, 44]. Incremental graph algorithms continually perform analytics on a dynamic graph by reusing the results computed for the previous graph snapshots of the dynamic graph instead of recomputing the results from scratch. This enables low latency analytics. LetA be a graph algorithm whereA(G) denotes the result of performingA on the graphG. Thus,A(G t−1 ) andA(G t ) denote the results of performing algorithm A on two consecutive graph snapshots in a dynamic graph. An incremental graph algorithm A 0 takes as input G t−1 , A(G t−1 ), and ΔG ut , to produce A(G t ). This is denoted by A 0 (G t−1 ⊕A(G t−1 )⊕ ΔG ut )→A(G t ) (2.1) 28 There has been some initial research on dynamic graph processing on clusters. Kineograph from Microsoft Research [29] is such system for large dynamic graph processing on clusters. Kineograph is based on a snapshot processing model, pro- viding programming abstractions for users to implement incremental graph algo- rithms. Experimental results show that this system needs further improvement to enable near real time analytics on dynamic graphs. Cai et al. proposed a technique to perform incremental computation using the vertex centric model for deterministic graph algorithms, by reusing the states of previous graph compu- tations [27]. The proposed system GraphInc assumes that in a vertex centric program, the vertex computation at any super step only depends on the input messages and the vertex state at that point in time. Given these assumptions, GraphInc executes a static vertex centric algorithm provided by users in an incre- mental manner on a dynamic graph by pruning out repeated computations and communications when recomputing analytics. To avoid recomputing the analytics from scratch, GraphInc uses memoization by storing the incoming messages and state of each vertex in every super step. GraphInc uses these memoized states to skip re-computations when there are changes to the graph. But this memoization technique is memory-expensive for large scale graphs. Moreover, it will result in scalability issues for scale free graphs with vertices of high degree. 2.4 Subgraph Matching In this thesis, we provide algorithms for distributed subgraph matching in dynamic graphs, which can be used for cyber system security applications. Below is the definition of the subgraph matching problem we study and the subgraph matching criteria we used in our work. 29 The subgraph matching problem takes as input a directed data graph G = (V,E,l), directed query graph Q = (V q ,E q ,l q ) where both G and Q consist of a set of vertices V and V q , set of edges E and E q , and a vertex label function l and l q , respectively. The label functions associate a label (from a label set L) to each vertex. The subgraph matching problem finds all subgraphs of G that match Q, based on a given subgraph matching criterion. 2.4.1 Exact Subgraph Matching Theexactsubgraphmatchingcriterion(i.e., subgraphisomorphism)findsthose subgraphs in the data graph that are isomorphic to the query graph. The subgraph isomorphism problem is well known to be NP-complete. It is formally defined by •SubgraphIsomorphism(ISO)[75]: QmatchesdatagraphGundersubgraph isomorphism if and only if there exists a subgraphG s ⊆G and a bijective function f :V q →V s suchthatforanytwonodesv i ,v j ∈V q , (v i ,v j )∈E q ⇒ (f(v i ),f(v j ))∈ E s , l q (v i ) =l s (f(v i )) and l q (v j ) =l s (f(v j )). 2.4.2 Graph Simulation Matching It has been shown that subgraph isomorphism can be too restrictive for modern applications, such as social network analysis, due to its intractable computational complexityandtheconstraintofexactlymatchingthequerygraph[47]. Thus, sim- ulation based matching criteria have been introduced addressing the limitations of subgraph isomorphism [75, 53]. These matching criteria enable subgraph matching in large graphs in polynomial time while capturing the important aspects of the topology of query graphs [75, 53]. In this thesis we use two such graph simulation based matching criteria, defined below. 30 • Graph Simulation (SIM) [75]: Q matches the data graph G via graph sim- ulation, denoted by QE sim G, if there exists a binary relation R M ⊆V q ×V such that 1.∀u∈V q ,∃u 0 ∈V : (u,u 0 )∈R M ; 2. if (u,u 0 )∈R M then l q (u) =l(u 0 ); and 3.∀(u,v)∈E q ,∃(u 0 ,v 0 )∈E : (v,v 0 )∈R M . • Dual Simulation (D-SIM) [75]: Q matches G via dual simulation, denoted by QE D sim G, if 1. Q matchesG via graph simulation under a match relationR D ⊂V q ×V; and 2.∀(u,u 0 )∈R D [(w,u)∈E q ⇒∃w 0 ∈V : (w,w 0 )∈R D V (w 0 ,u 0 )∈E]. The biggest match relation R M ⊆ V q ×V between Q and G with respect to QE sim G is the maximum simulation match set. The result match graph is the subgraph of G that is created from the maximum simulation. Formally, G M (V M ,E M ,L M ) is a subgraph of G that satisfies 1. (u,u 0 )∈R M ⇔u 0 ∈V M and 2.∀(u,u 0 ), (v,v 0 )∈R M [(u,v)∈E q ⇔ (u 0 ,v 0 )∈E M ]. Similarly, the biggest match relation R D ⊆ V q ×V between Q and G with respect toQE D sim G is the maximum dual simulation match set. The result match graph for dual simulation is the subgraph of G that is created from the dual simulation match set. Formally, G D (V D ,E D ,L D ) is a subgraph of G that satisfies: 31 1. (u,u 0 )∈R D ⇔u 0 ∈V D ; and 2.∀(u,u 0 ), (v,v 0 )∈R D [(u,v)∈E q ⇔ (u 0 ,v 0 )∈E D ]. G D (V D ,E D ,L D ) is a subgraph of G that satisfies 1. (u,u 0 )∈R D ⇔u 0 ∈V D and 2.∀(u,u 0 ), (v,v 0 )∈R D [(u,v)∈E q ⇔ (u 0 ,v 0 )∈E D ]. Intuitively, graphsimulationpreservesthechildrelationshipsinthequerygraph within the result match graph G M , whereas dual simulation preserves both child and parent relationships in the query graph within the result match graph G D . Graph simulation based matching has been shown to capture meaningful pat- terns in modern applications compared to strict matching criteria such as sub- graph isomorphism [47]. As an example, in Figure 1.5, only vertices{2, 4, 5, 7, 8} match the given query graph (pattern) via subgraph isomorphism, but all ver- tices match the given attack pattern via graph or dual simulation. In this sce- nario, dual simulation allows data servers and web servers (e.g., 3 and 6) to be infected by different attackers (1 and 4) and still be part of the attack. While this may happen in a practical setting, subgraph isomorphism fails to capture such scenarios due to its restrictive nature. Figure 2.5 depicts a comparison of the matching criteria described above. In Figure 2.5, vertices{5, 11, 12, 15, 16} in the data graph match the query graph via subgraph isomorphism. Vertices {5, 11, 12, 13, 14, 15, 16} match the query graph via dual simulation and vertices {1, 2, 3, 4, 5, 6, 11, 12, 13, 14, 15, 16} match via graph simulation. 32 h s b d a 1 2 3 5 4 b h s b h b s d a d a b a d d a Query Graph Data Graph 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Subgraph isomorphism Graph simulation match Dual simulation match Figure 2.5: Comparison of subgraph isomorphism, graph simulation and dual sim- ulation. 2.5 Related Research Graph simulation and dual simulation have been studied extensively in the past due to their wide range of modern applications. they have been extended and adoptedinvariousdirections, targetingdiverseapplicationdomains[49,75,52,99]. Fan et al. [49] were the first to propose incremental algorithms for SIM and D-SIM. The algorithms they proposed are sequential algorithms, whereas the algorithms we propose in Chapter 3 are distributed algorithms. In [49], Fan et al. use the notion of an affected area to analyze the time complexity of their proposed algo- rithms. In Chapter 3, we adopt this concept to analyze the computational and communicational complexity of our proposed algorithms for structural group mem- bership monitoring. The algorithms presented in [49] are centralized algorithms that which require access to the complete data graph, query graph, and set of edge updates, to compute the changes in the result match graph due to edge updates. In Chapter 3, we take a vertex-centric approach to the problem, in which each vertex maintains the part of the query graph it matches based on SIM or D-SIM. 33 In our proposed approach for structural group membership monitoring, vertices do not require access to either the complete data graph or the complete set of edge updates. We motivate the importance of our decentralized approach by discussing the applicability of the proposed algorithms for cyber system security applications (Section 3.8). Parallel and distributed algorithms have been proposed for SIM and D-SIM [52, 50]. In [52], Fard et al. presented distributed algorithms for a set of graph simulation based subgraph matching criteria. The algorithms presented in that paper are based on the vertex-centric bulk-synchronous parallel model. To the best of our knowledge, this is the only vertex-centric algorithm for SIM and D-SIM for static graphs. We use these algorithms as the baseline for comparison with our algorithms, which will be presented in Section 3.7. In [50], Fan et al. presented a theoretical analysis of the fundamental limitations of distributed graph simulation. However, their paper focuses on static graphs, whereas we target dynamic graphs. Interactive subgraph matching has been studied in the database research com- munity [43]. In subgraph matching, the objective is to find matching subgraphs faster when users revise the input query graph [43]. Interactive subgraph matching is useful when users are exploring sizeable static datasets to find answers where they might not know what exactly they are looking for in the datasets. The work on interactive subgraph matching focuses on static graphs, whereas we focus on dynamic graphs. In our work, we assume that the input query graph remains static. In Chapter 4, we present a distributed MPI based partition centric algorithm for D-SIM. In the present paper, we use D-SIM as a graph pruning technique to improve the performance of exact subgraph isomorphism on dynamic graphs. Most approaches to subgraph isomorphism in graph data in the last few decades 34 have focused on small graphs [104, 37, 36]. This has been mainly due to the NP- hard nature of the problem [59]. A detailed comparison of existing algorithms for subgraph isomorphism in static graphs can be found in [70]. Recently, an algorithm for subgraph matching in large-scale static graphs in a distributed shared memory environment has been proposed [102]. That paper assumes a distributed shared memory abstraction and uses the Trinity framework [94]. Furthermore, distributed algorithms for subgraph matching have been pro- posed [76, 53]. But these algorithms focus on static graphs, whereas our work focuses on dynamic graphs [63]. In [58, 57], Gao et al. proposed a system and set of optimizations for subgraph isomorphism on distributed dynamic graphs on a vertex-centric graph processing system [78]. But they focused on approximate subgraph isomorphism whereas the work presented in Chapter 4 focuses on exact subgraph isomorphism in distributed dynamic graphs. In 2009 Stotz et al. presented an incremental algorithm for exact subgraph isomorphism in dynamic graphs [100, 47]. This is a sequential algorithm and the evaluations were conducted only for small graphs. Also in [31] Choudhury et al. proposed a query decomposition based method to perform incremental subgraph isomorphism in dynamic graphs. This work also presented a sequential centralized algorithm. Fan et al. [48] presented a query preserving graph pruning algorithm using bi-simulation. They present a theoretical study of the application of bi-simulation based graph pruning to reachability queries and bounded simulation matching. Our work in Chapter 4 focuses on exact subgraph isomorphism by using dual simulation as a graph pruning technique. 35 InChapter5wepresentdynamicgraphalgorithmsforavariantoftheminimum prize-collecting Steiner tree problems. The minimum prize-collecting Steiner tree is a well-known NP-hard problem, that appears in the design of utility networks, such as fiber optics. Some work has been done on approximate algorithms for the minimum prize-collecting Steiner tree problem [20, 18]. Our optimal protection schemes presented in Chapter 5 are formulated as variants of the prize-collecting Steiner tree problems, and the existing heuristics for the classic prize-collecting Steiner tree (PCST) cannot be directly applied. None of the existing variants of PCST capture time-varying aspects in the objective. We have adopted the fast heuristic algorithm proposed for PCST [18] when developing heuristics for the proposed protection schemes. 36 Chapter 3 Structural Group Membership Monitoring in Dynamic Graphs InChapter1wemotivatedtheneedfordynamicgraphbasedsolutionsforcyber security applications. We discussed how various cyber attacks can be modeled as graph patterns and how identifying the occurrences of these patterns in dynamic networks can be used to prevent and detect cyber threats. In this chapter, we present our first contribution: distributed data structures and algorithms to monitor structural group memberships (SGM) of the vertices in a dynamic network. Given a query graph and a subgraph matching criterion, the SGM of a given vertex in a dynamic network is the set of query vertices that map to it based on the subgraph matching criterion. We consider two widely used subgraph matching criteria: graph simulation and dual simulation. The proposed distributed algorithms update the distributed data structures maintained at each vertex based on their previous state to continually monitor SGMs. We show that our algorithms are memory-efficient and scalable: the amount of memory required to maintain the proposed data structures at each vertex is independent of the size of the dynamic network and is bounded by the size of the query graph. Moreover, the proposed algorithms can be executed asynchronously, making them portable to various network computing environments. Efficent monitoring of the structural group memberships (SGM) of the vertices in a dynamic network can be used to continually monitor computer networks to 37 identify and mitigate security vulnerabilities to prevent cyber threats. Moreover, it can be used to monitor social and communication networks to identify and address users that may pose threats. We discuss the applicability of our approach for cyber security applications in Section 3.8 of this chapter. 3.1 Introduction Finding groups of vertices in a graph is a fundamental problem that has been studied extensively [55]. Groups of vertices have been defined based on connect- edness [80], high modularity [38], and structure (e.g., cliques [14]). This chapter focuses on finding and monitoring changes in the structural group membership (SGM) of vertices based on a given query graph and a subgraph matching crite- rion. The SGM of a vertex in a graph is the set of query vertices that map to the vertex in the graph based on the subgraph matching criterion. With changes in the graph, such as edge additions and removals, our objective is to monitor the changes in the SGM of each vertex. In particular, we focus on monitoring the SGM in a distributed memory message passing environment based on the vertex-centric programming model [78]. Distributed monitoring of SGMs of vertices in a dynamic network (dynamic graph) is an important problem with many applications. Asanexample, considerthecyberattackpatternsthatcanbemodeledasgraph patterns discussed in Chapter 1. Given such patterns [32] and the attributes of the nodes that can play different roles in such attacks, the task is to identify whether a node in a cyber network may become part of an attack. As discussed in Chapter 1, when a new communication channel is added to the network, if nodes can detect whether they may become a part of an attack and identify its potential role(s) (i.e., SGM) in the attack, preventive measures can be taken. 38 Distributed monitoring of the SGM of the vertices is crucial due to the dis- tributed nature of cyber systems. Additionally, being able to dynamically identify thechangesintheSGMwiththechangestothegraphiscrucialduetothedynamic nature of the graph representations. This is a challenging problem, especially in large dynamic graphs (dynamic networks), as recomputing the SGM with the net- workchangesisexpensiveintermsofcommunicationandcomputation. Addressing this problem, we propose distributed data structures and algorithms to maintain the SGM of the vertices in a dynamic network with low overhead. The amount of memory required to maintain the proposed data structures at each vertex is inde- pendent of the size of the dynamic network and bounded by the size of the query graph. This is an important property to have in order to scale for large networks. Moreover, we show that the proposed algorithms can be executed asynchronously. Asynchronous execution is a pivotal feature to have in a distributed computing environment in order to perform low latency analytics. We focus on graph sim- ulation based matching criteria that are considered as the preferred methods for identifying patterns in modern applications due to their tractable time complexity and less restrictive nature to detect meaningful patterns compared to subgraph isomorphism [47]. This chapter covers the following aspects: • We define the problem of monitoring the structural group membership of the vertices in a dynamic network. • We propose scalable and memory efficient distributed data structures and algorithms to maintain the structural group membership of the vertices in a dynamic network based on graph simulation and dual simulation. • We prove the correctness of our proposed algorithms and analyze their com- putational and communicational complexity. 39 a b c a 1 2 3 4 Query Graph c a b a b c a Data Graph 1 2 3 4 5 6 7 c 8 Figure 3.1: Illustrative examples of structural group memberships based on graph simulation and dual simulation • We prove that the proposed algorithms can be executed asynchronously. • We perform an extensive experimental evaluation of our proposed algorithms to show their effectiveness. • We discuss the applicability of our proposed solutions in cyber system secu- rity applications. 3.2 Problem Formulation Structural group membership (SGM) of a vertex v∈V,γ(v) for a given subgraph matching criterion is the set of query vertices that map to v in the matching relation R (R M or R D ). Formally, γ(v) ={u|∀(u,v)∈R} (3.1) Consider the query graph and data graph in Figure 3.1, where all the vertices except vertex 8 match the query graph via SIM. Vertices 1, 3, 4, 5, 6, 7 match the 40 query graph via D-SIM. In this example, the SGM of vertex 6 in the data graph based on D-SIM is{3} whereas the SGM of vertex 5 is{1, 3}. In this chapter, we focus on graph updates consisting only of edge updates (edge additions and removals) but not label changes. The algorithms covered in this chapter can be easily extended to handle label changes. At time t,G t evolves from G t−1 based on a set of edge updates Δe ut . We address the structural group membership monitoring problem in a dynamic network. The objective is to maintain SGMs γ(v t ) of each vertex v t ∈V t at timet with the changes Δe ut to the dynamic network. Motivatedbythedistributednatureoftheapplications, asdescribedinChapter 1, we propose distributed data structures to maintain the SGM of the vertices in a dynamic graph and algorithms to update them with edge updates. In our proposed solution, each vertex maintains a set of data structures as part of its state. Given a set of edge updates Δe ut in the dynamic graph, the proposed algorithms use the current state of the data structures in each vertex to update the SGMs instead of recomputing them from scratch. 3.3 Proposed Solution Data structures and algorithms cover in this chapter assumes a distributed memory message passing environment. A computing cluster in a data center in which vertices are distributed across computing nodes or an environment in which vertices only has access to its state such as a network computing environment are examples. Each vertex can communicate with other vertices via messages using the vertex id as the address. Also, each vertex has the knowledge of the vertex 41 ids of its parents and children. Many distributed vertex-centric graph processing frameworks support these features [78, 54]. We first present the proposed algorithms following the vertex-centric bulk- synchronous parallel model (BSP) [78] described in Section 2.2.1. BSP model requires a central coordinator. In our solution, the coordinator signals vertices to start the computation for a given iteration and each vertex notifies the coordina- tor once the communication for the iteration is complete. The coordinator is also responsible for the initiation of algorithms to update the data structures in case of changes in the dynamic graph. Each vertex, reports changes in incident edges to the coordinator. The coordinator signals vertices to start updating the data structures based on the edge updates in the dynamic graph. Vertices update its child/parent vertex ids based on its edge updates after this signal. The coordinator decides when to signal the vertices to initiate the update process (e.g. after every x number of edge updates) so that edge updates are applied in batches after the completion of computation for previous edge updates. Figure 3.2 illustrates this execution model. The set of steps executed after Δe ut set of edge updates is shown in Algorithm 3. Algorithm 3 Updating group membership 1: Coordinator signals to start the update process. 2: Vertices update its parents/children vertex ids. 3: Vertices start the algorithm to handle edge removals (Section 3.4.1). 4: Coordinator waits until the algorithm terminates. 5: Coordinator signals to start the update process. 6: Vertices start the algorithm to handle edge additions (Section 3.4.2). 7: Coordinator waits until the algorithm terminates. 42 Edge Updates t Coordinator Coordinator Coordinator Start Updating End Iteration - 1 Start Iteration - 2 End Iteration - 2 Start Iteration - 1 1 2 3 Coordinator Algorithms to handle edge updates Start processing next batch of edge updates Figure 3.2: Execution model for structural group membership monitoring based on the vertex-centric bulk synchronous parallel. In Section 3.6 we prove that it is possible to execute the algorithms to han- dle edge removals asynchronously. We also prove that algorithms to handle edge additions for directed acyclic query graphs can be executed asynchronously. Asyn- chronous execution takes away the need for synchronizing with the coordinator at the end of each iteration. Synchronization is required at the start and termination of algorithms. Algorithms to update the data structures after edge updates based on SIM and D-SIM are presented in Section 3.4. 3.4 Data Structures and Algorithms Each vertex (v t ∈V t ) maintains its SGM (γ(v t )) as a part of its state. The set that containsγ(v t ) is denoted byM. Additionally, for SIM, each vertex maintains a map (M Children ) that contains the γ(v t ) of its children and the frequency of 43 SIM M Children M Children M Parents D-SIM 4 M 2 2 V q Count 4 M 2 2 1 3 2 2 V q V q Count Count Figure 3.3: State of vertex 7 in Figure 3.1 for SIM and D-SIM. each member query vertex. Similarly, each vertex maintains maps of children’s and parent’s SGMs and their frequencies for dual simulation. This is denoted by M Children and M Parents respectively. This design is based on the observation that in order to evaluate the child condition for SIM (parent and child condition for D- SIM)ateachvertex, itisonlyrequiredtoknowthesetofSGMsofchildren(parents and children for D-SIM). Differentiating the SGMs of each child and parent is not necessary. Figure 3.3 illustrates the data structures maintained at each vertex for SIM and D-SIM. We assume that each vertex has access to the query graph. If vertices are partitioned among a set of worker nodes in a cluster, the query graph can be stored in each worker allowing shared read-only access to all vertices within each worker. If a vertex only has access to its local state, it should store the query vertices that has its label and the children of those query vertices for SIM (parents and children for D-SIM). We consider four types of edge updates, denoted by cc, cs, sc and ss that cause changes in SGM when processing query graphs [49]. Table 3.4 summarizes the definitions of these edge updates. We discuss algorithms to update data structures with edge removals and additions in Sections 3.4.1 and 3.4.2. Table 3.2 summarizes the symbols used in the proposed algorithms. 44 cc edges (v,v 0 ) s.t. l(v) =l q (u) and l(v 0 ) =l q (u 0 ) for (u,u 0 )∈E q cs edges (v,v 0 ) s.t. l(v) =l q (u) and u 0 ∈γ(v 0 ) for (u,u 0 )∈E q sc edges (v,v 0 ) s.t. l(v 0 ) =l q (u 0 ) and u∈γ(v) for (u,u 0 )∈E q ss edges (v,v 0 ) s.t. u∈γ(v) and u 0 ∈γ(v 0 ) for (u,u 0 )∈E q Table 3.1: Definitions: cs, sc and ss edge updates Symbol Definition M A set containing SGM of each vertex. M Children A map of SGM of children and their frequency M Parents A map of SGM of parents and their frequency Δe − ut A set of edge removals Δe + ut A set of edge additions M − A map of SGM removals (query vertex, frequency) M + A map of SGM additions (query vertex, frequency) ΔM A map containing M − and M + this current vertex in(V ) sum of the in degrees of vertices in V out(V ) sum of the out degrees of vertices in V Table 3.2: Symbols and their definitions 3.4.1 Handling Edge Removals Edge removals only cause SGM removals and onlyss edge removals cause SGM removals. The reason is that edge removals only remove entries from R M and R D and only ss edge removals result in changes in the match relations [49]. Algorithm 4 presents the steps executed at each vertex to handle edge removals based on SIM. During the first iteration, each target vertex of removed edges with a non-empty membership (|M| > 0), sends its M to the source vertices. Upon receiving the SGMs removed from its children, each vertex updates its M Children (line 13-17 in Algorithm 4) in the next iteration and beyond. Then, it evaluates 45 its current M with updated M Children based on SIM conditions to find the set of query vertices removed from M (if any) due to updates (Algorithm 5). Removals are sent to all parent vertices to be processed in the next iteration. The algorithm terminates when there are no removals from M in any vertex. 46 Algorithm 4 Edge Removals - SIM (SIM-) 1: procedure RM-SIM(M,M Children , Δe − ut ,Messages) 2: iteration = 1 3: if¬isEmpty(M) then 4: r←{(u,v)∈ Δe − ut :this =v} 5: for each v q ∈M do 6: M − [v q ]← 1 7: end for 8: for each e = (u,v)∈r do 9: sendMessageTo(u,M − ) 10: end for 11: end if 12: iteration≥ 2 13: for each M − ∈Messages do 14: for each v q ∈M − do 15: M Children [v q ]←M Children [v q ]−M − [v q ] 16: end for 17: end for 18: M Removals ← EvalSIM-RM(M,M Children ) 19: for each v q ∈M Removals do 20: M − [v q ]← 1 21: end for 22: sendMessageToAllParents(M − ) 23: end procedure 47 Algorithm 5 Find removals from M 1: procedure EvalSIM-RM(M,M Children ) 2: for each v q ∈M do 3: for each u q ∈{u q :∀(v q ,u q )∈E Q } do 4: if M Children [u q ]< 1 then 5: M Removals ←M Removals ∪{v q } 6: M←M\{v q } 7: end if 8: end for 9: end for 10: return M Removals 11: end procedure Similarly, the steps executed at each vertex to maintain SGM based on D-SIM is shown in Algorithm 6. Execution of Algorithm 6 closely follow Algorithm 4. The main difference is that, in Algorithm 6, updates to SGMs are sent to both the parents and children of the vertices. Updated M, M Children and M Parents are evaluated based on D-SIM conditions to find the set of query vertices removed from M (if any) due to updates. The algorithm terminates when there are no removals from M in any vertex. 48 Algorithm 6 Edge Removals - D-SIM (D-SIM-) 1: procedure RM-DSIM(M,M Children ,M Parents , Δe − ut , Messages) 2: iteration = 1 3: if¬isEmpty(M) then 4: r p ←{(u,v)∈ Δe − ut :this =v} 5: r c ←{(u,v)∈ Δe − ut :this =u} 6: for each v q ∈M do 7: M − [v q ]← 1 8: end for 9: for each e = (u,v)∈r p do 10: sendMessageTo(u,M − ) 11: end for 12: for each e = (u,v)∈r c do 13: sendMessageTo(v,M − ) 14: end for 15: end if 16: iteration≥ 2 17: for each M − ∈Messages do 18: for each v q ∈M − do 19: if isFromChild(M − ) then 20: M Children [v q ]←M Children [v q ]−M − [v q ] 21: else 22: M Parents [v q ]←M Parents [v q ]−M − [v q ] 23: end if 24: end for 25: end for 26: M Removals ← EvalDSIM-RM(M,M Children ,M Parents ) 27: for each v q ∈M Removals do 28: M − [v q ]← 1 29: end for 30: sendMessageToAllParents(M − ) 31: sendMessageToAllChildren(M − ) 32: end procedure 49 a c b 1 2 3 Initial States a b Data Graph 1 2 b c c 3 5 4 M[1]={1} M C [1] = {(2,2), (3,2)} M P [1] = {} M[2]={2} M C [2] = {(3,1)} M P [2] = {(1,1)} M[3]={2} M C [3] = {(3,1)} M P [3] = {(1,1)} M[4]={3} M C [4] = {} M P [4] = {(1,1), (2,1)} M[5]={3} M C [4] = {} M P [5] = {(1,1), (2,1)} Query Graph Figure 3.4: Query graph, data graph and initial states of data structures. M C = M Children ,M P =M Parents Figure 3.4.1 depicts an illustrative example comprising query graph, data graph and the initial state of proposed data structures for SIM and D-SIM. For the removal of the edge (2,4), execution of Algorithm 4 proceeds as follows. In the first iteration, vertex 4, sends its SGMs to vertex 2. Upon receiving this message in iteration 2, vertex 2 updates itsM Children . This results in the removal of SGM 2 from vertex 2. Vertex 2 then sends this removal to its parents. Upon receiving the removal message from vertex 2 in iteration 3, vertex 1 updates its data structures. However, this does not cause any changes in its SGM. The algorithm terminates at iteration 3 as there are no more changes in vertex SGMs. Table 3.3 presents the changes in data structures at the end of each iteration in each vertex. Similarly, Table 3.4 shows the vertex states at the end of each iteration in each vertex based on Algorithm 6 when edge (1,2) is removed from the data graph in Figure 3.4.1. In the first iteration, vertex 1 sends its SGM as a removal to vertex 2 and vertex 2 sends its SGM as a removal to 1. Upon receiving the removal messages, vertex 1 and 2 update the data structures. This causes vertex 2 to lose its SGM 1. Vertex 2 thus sends this removal to vertex 4. In the next iteration, 50 H H H H H H v i 1 2 3 1 M {1} - - M C {(2,2), (3,2)} - {(2,1), (3,2)} 2 M {2 } { } - M C {(3,1)} {(3,0)} - 3 M {2 } - - M C {(3,1)} - - 4 M {3 } - - M C - - 5 M {3 } - - M C - - Table 3.3: Changes in data structures for edge removal (2,4) based on SIM.M C = M Children , ’-’ denotes no change, v: vertex id and i: iteration. upon receiving the message from vertex 2, vertex 4 updates its data structures, causing it to lose its SGM 3. Vertex 4 sends this removal to vertex 2 and vertex 1. Vertex 2 and vertex 1 update their data structures based on incoming messages from vertex 4 in the next iteration. However, this does not cause any change in SGMs in any vertex. The algorithm terminates at iteration 4 as there are no more changes to vertex SGMs. 3.4.2 Handling Edge Additions In this section, we focus on algorithms to handle edge additions. For directed acyclic query graphs (DAG), only cs edges add SGMs based on SIM. Because it has been shown that edge additions to the data graph only add matches to R M and R D [49]. Moreover, for DAG query graphs only cs edge additions result in changes in R M based on SIM [49]. 51 H H H H H H v i 1 2 3 4 1 M {1} - - {1 } M C {(2,2), (3,2)} {(2,1), (3,2)} - {(2,1), (3,1)} M P - - - 2 M {2} {} - - M C {(3,1)} - - {(3,0)} M P {(1,1)} {(1,0)} - - 3 M {2} - - - M C {(3,1)} - - - M P {(1,1)} - - - 4 M {3} - {} - M C - - - - M P {(1,1), (2,0)} - {(1,1), (2,0)} - 5 M {3} - - - M C - - - M P {(1,1), (2,1)} - - - Table 3.4: Changes in data structures for edge removal (1,2) based on D-SIM. M C = M Children ,M P = M Parents , ’-’ denotes no change, v: vertex id and i: itera- tion. Algorithm 7 shows the steps executed at each vertex when a set of edges (Δe + ut ) are added to update SGM based on SIM for DAG query graphs. In the first iteration, each target vertex of edge updates with a non-empty M sends its M to the source vertices. In the next iteration and beyond, upon receiving SGMs added to their children, each vertex updates its M Children . The vertices, then evaluate its current M with updated M Children based on SIM matching conditions to find all additions to M (See Algorithm 8). The query vertices added to M are sent to 52 all parents to be processed in the next iteration. The algorithm terminates when there are no more additions to M in any vertex in the data graph. Algorithm 7 CS Edge Additions - SIM (SIM+) 1: procedure AddCS-SIM(M,M Children , Δe + u ,Messages) 2: iteration = 1 3: if¬isEmpty(M) then 4: r←{(u,v)∈ Δe + u :this =v} 5: for each v q ∈M do 6: M + [v q ]← 1 7: end for 8: for each e = (u,v)∈r do 9: sendMessageTo(u,M + ) 10: end for 11: end if 12: iteration≥ 2 13: for each M + ∈Messages do 14: for each v q ∈M + do 15: M Children [v q ]←M Children [v q ] +M + [v q ] 16: end for 17: end for 18: M Additions ← EvalSIM-ADD(M,M Children ,this) 19: for each v q ∈M Additions do 20: M + [v q ]← 1 21: end for 22: sendMessageToAllParents(M + ) 23: end procedure 53 Algorithm 8 Find new additions to M 1: procedure EvalSIM-ADD(M,M Children ,v) 2: M C ←{u q :u q ∈E Q V L(u q ) =L(v)} 3: for each v q ∈M C do 4: for each u q ∈{u q :∀(v q ,u q )∈E Q } do 5: if M Children [u q ]< 1 then 6: M C ←M C \{v q } 7: end if 8: end for 9: end for 10: M Additions ←M C \M 11: M←M C 12: return M Additions 13: end procedure We put forward following proposition on edge additions for D-SIM. Proposition 3.1 For a DAG query graph, only the addition of a cs or sc edge to the data graph adds new matches to R D . Proof: The addition of an edge e = (v,v 0 ) has following cases. 1 Neither v nor v 0 is a match for any query vertex u∈ V q : Since v is not a match of any u∈ V q , either L(v)6= L(u) or there is a child u 0 of u, where none of the children of v is a match of u 0 or there is a parent u 00 of u, where none of the parents of v is a match of u 00 . In either case, v can not match u after addition of edge e = (v,v 0 ) for DAG query graphs. 2 Edge e is a ss edge: Since there are u∈γ(v) and u 0 ∈γ(v 0 ) for (u,u 0 )∈E q , adding the edge does not add entry to R D . 54 3 Edge e is a cs edge: Since L(v) = L(u) and u 0 ∈ γ(v 0 ) for (u,u 0 )∈ E q , If v has a parentv 00 that matchesu 00 such that (u 00 ,u)∈E q , the addition of edge will result in v becoming a member of u via D-SIM. 4 Edge e is a sc edge: Since u∈ γ(v) and L(v 0 ) = L(u 0 ) for (u,u 0 )∈ E q , if v 0 has a child v 00 that matches u 000 such that (u 0 ,u 000 )∈E q , the addition of edge will result in v 0 becoming a member of u 0 via D-SIM. Thus, only the addition of a cs or sc edge adds new matches to R D based on D-SIM for DAG query graphs. Algorithm 9 shows the steps executed at each vertex to update SGM based on D-SIM when a set of edges (Δe + ut ) are added in the dynamic graph. Algo- rithm 9 executed similarly to Algorithm 7 in which new membership additions are communicated to parents and children to update their M Children and M Parents . 55 Algorithm 9 Edge Additions - D-SIM (D-SIM+) 1: procedure AddCS-DSIM(M,M Children ,M Parents , Δe + u , Messages) 2: iteration = 1 3: if¬isEmpty(M) then 4: r p ←{(u,v)∈ Δe + u :this =v} 5: r c ←{(u,v)∈ Δe + u :this =u} 6: for each v q ∈M do 7: M + [v q ]← 1 8: end for 9: for each e = (u,v)∈r p do 10: sendMessageTo(u,M + ) 11: end for 12: for each e = (u,v)∈r c do 13: sendMessageTo(v,M + ) 14: end for 15: end if 16: iteration> 2 17: for each M − ∈Messages do 18: for each v q ∈M + do 19: if isFromChild(M + ) then 20: M Children [v q ]←M Children [v q ]−M + [v q ] 21: else 22: M Parents [v q ]←M Parents [v q ]−M + [v q ] 23: end if 24: end for 25: end for M Additions ← EvalDSIM-ADD(M,M Children ,M Parents ) 26: for each v q ∈M Additions do 27: M + [v q ]← 1 28: end for 29: sendMessageToAllParents(M + ) 30: sendMessageToAllChildren(M + ) 31: end procedure 56 Handling Edge Additions for General Query Graphs Fan et al, pointed out that bothcc andcs edges add entries to match relations for general query graphs [49]. Moreover, cc edge additions only add new entries to match relations, if the query graph contains strongly connected components (SCCs). Additionally, cc edge additions only add SGMs to vertices in the SCCs formedinthedatagraphwhichconsistsonlyofccedges. Asaresult, queryvertices added to M can be found by recomputing SGMs within these SCCs after cc edge additions. Algorithm 10 shows the set of steps executed in order to find query vertices that may be added to M due to cc edge additions based on SIM. Algorithm 10 is adopted from the distributed algorithm presented in [52] to compute SIM in static graphs. Query vertices added to M based on D-SIM can be found in a similar manner. Algorithm 10 requires each vertex to maintain a separate set of data structures similar to what is used for handling edge updates for DAG query graphs. It is denoted byM 0 andM 0 Children respectively. Moreover, each vertex should maintain the set of vertices in query graph that is part of an SCC in Q and has the same label as the vertex label. This is denoted by V SCC q . Algorithm 10 recomputes SGM of vertices that may be a part of SCCs with cc edges in the data graph. At the termination of Algorithm 10,M 0 of vertices contains query vertices that may be newly added due to cc edge additions. New group membership additions for each vertex is found by comparing M 0 and M (M 0 \M). New additions are copied toM andM 0 andM 0 Children are cleared. New additions toM are propagated in an inductive manner to parent vertices similar to the iteration 2 in Algorithm 7. 57 Algorithm 10 CC Edge Additions - SIM 1: procedure RM-SIM(M 0 ,M 0 Children ,Messages) 2: iteration = 1 3: if¬isEmpty(M) then 4: r←{(u,v)∈ Δe + u :this =v} 5: for each v q ∈M do 6: M + [v q ]← 1 7: end for 8: for each e = (u,v)∈r do 9: sendMessageTo(u,M + ) 10: end for 11: end if 12: M 0 ←{u q :∀u q ∈V SCC q } 13: for each v q ∈M 0 do 14: M + [v q ]← 1 15: end for 16: if¬isEmpty(M 0 ) then 17: sendMessageToAllParents(()M + ) 18: end if 19: iteration = 2 20: if¬isEmpty(M 0 ) then 21: for each M + ∈Messages do 22: for each v q ∈M + do 23: M 0 Children [v q ]←M 0 Children [v q ] +M + [v q ] 24: end for 25: end for M Removals ← EvalSIM-RM(M 0 ,M 0 Children ) 26: for each v q ∈M Removals do 27: M − [v q ]← 1 28: end for 29: sendMessageToAllParents(M − ) 30: end if 31: iteration> 3 32: if¬isEmpty(M 0 ) then 33: for each M − ∈Messages do 34: for each v q ∈M − do 35: M 0 Children [v q ]←M 0 Children [v q ]−M − [v q ] 36: end for 37: end for M Removals ← EvalSIM-RM(M 0 ,M 0 Children ) 38: for each v q ∈M Removals do 39: M − [v q ]← 1 40: end for 41: sendMessageToAllParents(M − ) 42: end if 43: end procedure 58 Following the analysis in [52], one can observe that Algorithm 10 takes O(|E SCCt |) iterations to terminate where E SCCt denotes the number of edges in the largest SCC in the data graph after edge additions. Given a set of edge additions, Algorithm 10 is executed followed by the Algo- rithm 7 separated by a barrier. In practice, new additions to M due to cc edge additions are propagated to parents within the first iteration of the Algorithm 7. 3.5 Correctness and Complexity We denoteM for a given vertexv t ∈V t is correct if all the query vertices inM is valid based on the conditions of subgraph matching criteria (Section 2.4). We denote M as complete if it contains all the query vertices that should be in γ(v t ) based on the conditions of subgraph matching criteria. Algorithm 4 and Algorithm 6 terminate when there are no more removals from M in any vertex in the data graph. In Algorithms 4 and 6 after iteration 1, only the vertices that receive removal messages from their children or parents do any processing. Moreover, vertices only send messages out when there are removals to its M. Given γ(v t ) is a finite set for all vertices (∀v t ∈ V t ,|γ(v t )| ≤ |Q|), algorithms should eventually terminate in finite time when all the memberships are removed in the worst case. Extending this argument, we obtain: Theorem 3.1 M at each vertex is correct and complete at the termination of Algorithms 4 and 6. We prove Theorem 3.1 for SIM. The proof for D-SIM is similar. Proof: At the start of iteration 1, each vertex v t ∈ V t maintains the correct and complete set of M (γ(v t−1 )) before the edge removals (initial condition). As 59 discussed before, since edge removals only remove query vertices from γ(v t−1 ), at the start of iteration 1, M is a superset of complete and correct M after edge removals. Therefore, we need to prove that our algorithm only filters out invalid query vertices from M. Further, we need to prove that the algorithm filters out all invalid query vertices from M. The algorithm maintains the invariant that M for each vertex at the end of each iteration is correct and complete based on its local M Children . This invariant will hold at end of each iteration, due to the fact that the vertices that received any removal messages update their M Children and re-evaluate M to remove all query vertices that are no longer valid for M as a result of updated M Children (in Algorithm 5). It can be clearly observed that Algorithm 5 only removes query verticesthatarenolongervalidforM asaresultofupdatedM Children . Asdiscussed before, the algorithm terminates after a finite number of iterations when there are no more removals in M from any vertex. This occurs when M in all vertices have satisfied the children condition. Therefore the invariant holds at the termination of the algorithm. In the proposed data structures for SIM and D-SIM, the number of entries in M,M Children andM Parent is bounded by the number of query vertices. As a result, the upper bound on the memory requirement for the complete data structure is O(|V q ||V t |) where each vertex maintains O(|V q |) elements as a part of its state. Thus we obtain: Theorem 3.2 The memory requirement for each vertex to maintainM,M Children and M Parent is independent of the size of the data graph, and is only bounded by the size of the query graph. 60 Therefore, the additional memory requirement imposed by our proposed data structures is linear to the number of vertices in the data graph. This allows our approach to scale for large networks. We adopt the notion of affected area (AFF) used in analyzing incremental algorithms [49] to analyze the complexity of the proposed algorithms. Let V AFF denote the set of vertices in the data graph in which data structures states changed due to a set of edge updates. Additionally, E AFF denote the set of edges adjacent to vertices in V AFF . In Algorithm 4, vertices do not send out messages unless there is a removal to itsM in iteration 2 and beyond. Also, the vertices that cannot match any vertices in Q do not send out messages. Hence, for a given set of edge removals, changes are contained within AFF of the data graph for the edge removals. Algorithm 4 takes O(|E AFF |) number of iterations to terminate in the worst case for a set of edge removals as a result. Similarly, Algorithm 6 also takes O(|E AFF |) number of iterations. In Algorithm 4 and Algorithm 6, only the vertices in V AFF send messages in the second iteration and beyond. Additionally, in Algorithm 4, iteration 1, target vertices of edge updates send messages to source vertices. Similarly, in Algorithm 6, source and target vertices of edge updates send messages to target and source vertices respectively. Letμ SIM− andμ DSIM− denote the total number of messages sent by the vertices in Algorithm 4 and Algorithm 6 respectively. The following inequalities hold for SIM and D-SIM respectively: μ SIM− ≤out(V AFF )∗i +|Δe − ut |, (3.2) μ DSIM− ≤ (in(V AFF ) +out(V AFF ))∗i + 2∗|Δe − ut |, (3.3) 61 where i denotes the number of iterations it takes for Algorithms to terminate. Algorithm 7 and Algorithm 9 terminate when there are no more additions to M on any vertex in the data graph. In proposed algorithms, after iteration 1, only the vertices that receive addition messages from its children or parents do any processing. Moreover, vertices only send messages out when there are additions to its M. Given γ(v t ) is a finite set for all vertices (∀v t ∈ V t ,|γ(v t )| ≤ |Q|). Algorithms should eventually terminate in a finite time. (When all the vertex memberships reach its upper limit (|γ(v)| =|Q|) in the worst case). Extending this argument we obtain: Theorem 3.3 M at each vertex is correct and complete, at the termination of Algorithm 7 and at the termination of Algorithm 9. We prove Theorem 3.3 for SIM. The proof for D-SIM is similar. Proof: At the start of iteration 1, each vertex v t ∈ V t maintains the correct and complete set of M (γ(v t−1 )) before the edge additions (initial condition). As discussed before, given edge additions only add query vertices toM,M at the start of iteration 1 is a subset of complete and correct M after edge additions. This set is expanded in next iterations by adding new query vertices. Therefore, we need to prove that the algorithm only adds valid members to M and M contains all valid members at termination. The algorithm maintains the invariant that M for each vertex is correct and complete based on its local M Children . This invariant is held at the end of each iteration as the vertices that received any addition message update their M Children andre-evaluateM tofindallqueryverticesthatmayhaveaddedtoM asaresultof updatedM Children (in Algorithm 8). Also, it can be easily observed that Algorithm 8 does not add invalid query vertices to M. As discussed before, the algorithm terminates in finite time. This occurs when there are no more additions to M in 62 any vertex. That means the M in all vertices have satisfied the child condition of SIM. Therefore the invariant holds at the termination of the algorithm. An upper bound on the number of iterations required for Algorithm 7 can be obtained similarly to Algorithm 4. Because vertices only send out new additions to its M, the Algorithms 7 and 9 takes O(|E AFF |) number of iterations in the worst case to terminate for a set of edge additions. In Algorithm 7 and Algorithm 9, only the vertices in V AFF send messages in the second iteration and beyond. Additionally, in Algorithm 7, iteration 1, sink vertices of edge updates send messages to source vertices. Similarly, in Algorithm 9, source and sink vertices of edge updates send messages to sink and source vertices respectively. Let μ SIM+ and μ DSIM+ denote the number of messages sent by vertices in Algorithm 7 and Algorithm 9 respectively. The following inequalities hold for SIM and D-SIM respectively: μ SIM+ ≤out(V AFF )∗i +|Δe + ut | (3.4) μ DSIM+ ≤ (in(V AFF ) +out(V AFF ))∗i + 2∗|Δe + ut | (3.5) whereidenotesthenumberofiterationsittakesforAlgorithm7andAlgorithm 9 to terminate. 3.6 Asynchronous Execution Asynchronous execution is an important feature to have in distributed algo- rithms. Asynchronous execution not only enables low latency results but also make the algorithms portable to various network computing environments. 63 Algorithms 4 and 6 to handle edge removals can be executed asynchronously without a barrier to separate the iterations. When executed in an asynchronous manner, messages may cross the iteration boundaries and may arrive out of order. The steps in the second iteration are executed at each vertex as the messages arrive. Proposition 3.2 M at each vertex is correct and complete at the termination of Algorithms 4 and 6 when executed in an asynchronous manner. We prove Proposition 3.2 for SIM. The proof for D-SIM is similar. Proof: Algorithm 4 should eventually terminate due to the fact that for any vertex v t ∈V t , γ(v t ) is a finite set. When vertices receive SGM removals from its children, algorithm updates M Children and evaluate its current M based on local M Children to find all the query vertices that may be removed fromM. Each vertex eventually, sends its SGM removals to its parents. Due to the fact that the edge removals can only cause removals fromM and removals are always caused by local evaluations of SIM conditions, M always satisfy the condition of SIM based on its localM Children after executing line 13-18 of Algorithm 4 (Notice that EvalSIM- RM in Algorithm 4 is commutative). As algorithms continue the execution until there are no more removals, M at each vertex is correct and complete when the algorithm terminates. Similarly, Algorithms 7 and 9 can be executed asynchronously. Proposition 3.3 M at each vertex is correct and complete at the termination of Algorithms 7 and at the termination 9 when executed in an asynchronous manner. The proof is similar to the proof of Proposition 3.2. 64 Extending Propositions 3.2 and 3.3 we argue that algorithms to handle edge additions (using Algorithm 7 and Algorithm 9) and removals (using Algorithm 4 and Algorithm 6) can be combined and executed asynchronously. Algorithm 11 presents the combined algorithm to handle edge updates based on SIM for DAG query graphs. 65 Algorithm 11 Asynchronous Algorithm to Handle Edge Updates - SIM (ASIM) 1: procedure RM-SIM(M,M Children , Δe − ut ,Messages) 2: onStartUpdate 3: if¬isEmpty(M) then 4: r−←{(u,v)∈ Δe − ut :this =v} 5: r+←{(u,v)∈ Δe + u :this =v} 6: for each v q ∈M do 7: M − [v q ]← 1 8: end for 9: for each e = (u,v)∈r− do 10: sendMessageTo(u,M − ) 11: end for 12: for each v q ∈M do 13: M + [v q ]← 1 14: end for 15: for each e = (u,v)∈r+ do 16: sendMessageTo(u,M + ) 17: end for 18: end if 19: onMessages 20: for each M − ∈Messages do 21: for each v q ∈M − do 22: M Children [v q ]←M Children [v q ]−M − [v q ] 23: end for 24: end for 25: for each M + ∈Messages do 26: for each v q ∈M − do 27: M Children [v q ]←M Children [v q ] +M + [v q ] 28: end for 29: end for 30: ΔM← EvalSIM(M,M Children ) 31: sendMessageToAllParents(ΔM) 32: end procedure 66 Lines 3-18 of Algorithm 11 are executed initially at each vertex when the coor- dinator signals to start the update process. Line 20-31 are executed at vertices, as and when messages arrive. In Algorithm 11, as long as a vertex receives all the addition and removal of SGMs from the neighborhoods without a failure, after a sequence of addition and removal messages, SGMs eventually converge to the same set irrespective of the incoming order of SGM additions and removals. Also, each vertex correctly sent out the additions or removals (ΔM) from its M to the parents. Moreover, we can observe that in Algorithm 11, as long as SGM additions and removals do not propagate in a causal loop, the algorithm terminates. But such causal loop cannot occur for DAG query graphs. Putting these arguments together, we obtain: Theorem 3.4 Algorithm 11 eventually terminates andM in each vertex is correct and complete at the termination. Similarly, the following can be obtained: Theorem 3.5 When combined and executed in an asynchronous manner, Algo- rithm 6 and Algorithm 9 eventually terminate andM in each vertex is correct and complete at the termination. 3.7 Evaluations We implemented our proposed algorithms on Apache Flink Gelly utilizing its vertex-centric graph processing framework [54]. We deployed Flink on a dedicated commodity cluster of 12 nodes with 8 cores in each node. The proposed data structures were stored as a part of the state in each vertex. Our data structure design enabled us to utilize message combiners [78] to reduce 67 communication between cluster nodes. Message combiner at each worker takes a set of incoming messages for a vertex and reduce the number of messages by aggregation based on a user defined function. Distributed algorithms for SIM and D-SIM for static graphs presented in [52] were used to initialize the proposed data structures. 3.7.1 Datasets We used three types of datasets in our evaluations. 1 RMAT Graphs. 2 Real world planar graphs. 3 Real world power-law graphs. We used a graph generator based on R-MAT model [28] to generate synthetic graphs by varying the number of vertices|V| and graph density denoted byβ such that|E| =β|V|. We randomly assigned labels to vertices from a label set L such that|L| = 50. All other initialization parameters were based on default Graph500 benchmark values [34]. We used United States road network datasets [1] as planar graphs. Road net- works are undirected sparse graphs with fairly uniform degree distributions. We converted them to directed graphs by randomly removing 20% of bidirectional edges. Directed social and web graph datasets were used as real world graphs with power-law degree distribution [25, 24]. Table 3.5 summarizes the details of the real-world datasets. Amazon dataset is a symmetric graph describing similarity among books as reported by the Amazon store 1 . LiveJournal is a social network 1 http://law.di.unimi.it/webdata/amazon-2008/ 68 Dataset |V| |E| Type Western USA R/N 6,262,104 15,248,146 Planar Central USA R/N (CTR) 14,081,816 34,292,496 Planar Full USA R/N (USA) 23,947,347 58,333,344 Planar Amazon (AZ) 735,âĂĽ323 5,158,388 Power-law Live Journal (LJ) 5,363,260 79,023,142 Power-law UK Web Crawl(UK) 18,520,486 298,113,762 Power-law Table 3.5: Datasets graph in which vertices are users and there is an edge from users u 1 to u 2 if u 1 registered u 2 among his friends 2 . UK web crawl graph has been obtained from a 2002 crawl of the .uk domain 3 . Query graphs were randomly extracted from the graph datasets. We extracted query graphs with various numbers of nodes. Vertex labels were randomly assigned from the label set L. Edge updates were generated randomly from data graphs for each query graph. All updates were cc edge updates. We generated cc edge updates for each query graph, data graph pair. The objective was to increase the probability of edge updates affecting the vertex SGMs. We generated five sets of query graphs and edge updates for each set of param- eters used in evaluations. We report median valued results. 3.7.2 Evaluation Metrics We measured three parameters in order to evaluate the performance of our algorithms. 2 http://law.di.unimi.it/webdata/ljournal-2008/ 3 http://law.di.unimi.it/webdata/arabic-2005/ 69 Number of messages (μ): The total number of messages exchanged during the execution of an algorithm. Number of vertex activations (α): The total number of times the computation logic in vertices were executed. Number of iterations (i): The number of iterations it took for an algorithm to terminate. We used the state of the art vertex-centric algorithm for SIM and D-SIM for static graphs [52] as a baseline. Baseline algorithms compute SGMs from scratch after updating the graph. For comparison, we used the percentage of savings achieved by our algorithms for each parameter compared to the baseline. Percent- age of savings achieved for parameter x is denoted by Θ(x) as: Θ(x) = ( x Static −x Incremental x Static )× 100%, (3.6) wherex Static is the measured value of parameter x for baseline and x Incremental is the value when our algorithms were used. 3.7.3 Results Performance on Various Datasets: We evaluated the performance of the pro- posed algorithms on various datasets for DAG and general query graphs. Figure 3.5 shows the number of messages (μ) and percentage of savings in the number of messages (Θ(μ)) on various datasets ((a) road networks (b) power-law graphs, (c) RMAT graphs: n×β where|V| = 2 n ) for DAG query graphs with |V q | = 10,|Δe ut | = 1024. As can be seen, compared to the baseline, we observed a great percentage of savings in the number of messages (over 99%) across all datasets, both real and synthetic, when the proposed algorithms were used. This 70 0 200 400 600 800 1000 1200 99.97 99.975 99.98 99.985 99.99 99.995 100 W CTR USA Θ ( μ ) (a) (b) 0 1000 2000 3000 4000 5000 6000 99.97 99.975 99.98 99.985 99.99 99.995 100 23x8 24x8 25x8 μ (c) 0 500 1000 1500 2000 2500 3000 3500 4000 99.92 99.93 99.94 99.95 99.96 99.97 99.98 99.99 100 AZ LJ UK Θ ( μ )-SIM+ Θ ( μ )-DSIM+ Θ ( μ )-SIM- Θ ( μ )-DSIM- μ-SIM+ μ-DSIM+ μ-SIM- μ-DSIM- Figure 3.5: Number of messages and percentage of savings in the number of mes- sages. was due to the fact that in our proposed algorithms, messages are contained within the AFF for edge updates. We observed an increasing trend in Θ(μ) with the increasing size of the graph. Also, the number of messages exchanged on power- law graphs were significantly higher compared to road network graphs. This was due to the sparse edge distribution in road network graphs compared to power-law graphs. In Figure 3.6, we show the number of vertex activations (α) and the percentage of savings in the number of vertex activations (Θ(α)) on various datasets ((a) road networks (b) power-law graphs, (c) RMAT graphs: n×β where|V| = 2 n ) for DAG querygraphswith|V q | = 10,|Δe ut |=1024. Eventhoughasignificantimprovement compared to the baseline was observed, the savings were less compared to Θ(μ). This is because, in the first iteration, all the vertices are activated irrespective of edge updates. As a result, α increased with the size of the graph. This is a limitation in the traditional vertex-centric graph processing framework that we used for our implementation. As can be seen in Algorithms 4, 6, 7 and 9 vertices with no incident edge updates do not do any processing. Thus, vertices with no incident edge updates can be ignored in the first iteration of the proposed 71 0 5 10 15 20 25 30 0 10 20 30 40 50 60 W CTR USA Θ ( α ) (a) (b) (c) 0 2 4 6 8 10 12 14 16 0 10 20 30 40 50 60 23x8 24x8 25x8 α 0 2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70 80 AZ LJ UK Θ (α )-SIM+ Θ (α )-DSIM+ Θ (α )-SIM- Θ (α )-DSIM- α-SIM+ α-DSIM+ α-SIM- α-DSIM- Figure 3.6: Number of vertex activations and the percentage of savings in the number of vertex activations. 1 10 100 1000 10000 100000 1000000 10000000 0 20 40 60 80 100 Θ (α )-SIM+ Θ (α )-DSIM+ Θ (α )-SIM- Θ (α )-DSIM- α Θ (α ) Θ(α )-Normal Θ (α )-Optimized α-Normal α-Optimized Figure 3.7: Comparison of α and Θ(α) on a RMAT dataset with and without activating vertices with no incident edge updates in the first iteration. algorithms. Figure 3.7 shows a comparison ofα and Θ(α) on a RMAT dataset with and without ignoring vertices with no incident edge updates in the first iteration with V = 2 23 , β = 8,|V q | = 10,|Δe ut | = 1024. Figure 3.7 shows that, similar savings compared to Θ(μ) can be obtained in vertex activations by ignoring the vertices with no incident edge updates in the first iteration of Algorithms 4, 6, 7 and 9. The number of iterations (i) and the percentage of savings in the number of iterations (Θ(i)) on various datasets ((a) road networks (b) power-law graphs, (c) RMAT graphs: n×β where|V| = 2 n ) for DAG query graphs with|V q | = 10,|Δe ut | 72 (a) (b) (c) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 10 20 30 40 50 60 W CTR USA Θ(i) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 10 20 30 40 50 60 23x8 24x8 25x8 i 0 1 2 3 4 5 6 0 10 20 30 40 50 60 AZ LJ UK Θ(i)-SIM+ Θ(i)-DSIM+ Θ(i)-SIM- Θ(i)-DSIM- i-SIM+ i-DSIM+ i-SIM- i-DSIM- Figure 3.8: The number of iterations and percentage of savings in the number of iterations. = 1024 are shown in Figure 3.8. A reduction in the number of iterations up to 57.14% compared to baseline was observed in most cases. Figure 3.9 shows the performance of our proposed algorithms for general query graphs ((a) μ and Θ(μ), (b) α (in millions) and Θ(α) (c) i and Θ(i). |V q | = 10, |Δe ut | = 1024.). As can be seen, even though Θ(μ) is significant compared to the baseline for edge additions, it is comparatively low compared to the Θ(μ) for edge removals. Additionally, Θ(α) and Θ(i) for edge additions was poor compared to the baseline. This is due to the fact that handling edge additions for general graph queries require an extra re-computation step in order to handle cc edge additions. We observed that the performance of our proposed algorithms to handle edge additions for general query graphs were comparatively poor in terms α and i. All other algorithms demonstrated significant improvements compared to the baseline. A detailed performance analysis of these algorithms on various settings are given below. Impact of Number of Edge Updates: We evaluated the impact of|Δe ut | by measuring the performance with varying|Δe ut |. In Figure 3.10, we show Θ(μ), Θ(α), μ and α with increasing number of edge updates. We observed a gradual 73 (a) (b) (c) 1 10 100 1000 10000 100000 1000000 10000000 0 20 40 60 80 100 120 CTR LJ RMAT-23x8 μ Θ ( μ ) Θ ( μ )-SIM+ Θ ( μ )-DSIM+ Θ ( μ )-SIM- Θ ( μ )-DSIM- μ-SIM+ μ-DSIM+ μ-DSIM- μ-SIM- 0 1 2 3 4 5 6 7 8 -120 -100 -80 -60 -40 -20 0 20 40 60 CTR LJ RMAT-23x8 i Θ(i) Θ(i)-SIM+ Θ(i)-DSIM+ Θ(i)-SIM- Θ(i)-DSIM- i-SIM+ i-DSIM+ i-SIM- i-DSIM- 0 5 10 15 20 25 30 35 -60 -40 -20 0 20 40 60 80 CTR LJ RMAT-23x8 α Θ ( α ) Θ ( α )-SIM+ Θ ( α )-DSIM+ Θ ( α )-SIM- Θ ( α )-DSIM- α-SIM+ α-DSIM+ α-SIM- α-DSIM- Figure 3.9: Performance on various datasets (CTR, LJ and RMAT (|V| = 2 23 and |E| = 2 26 )) for general query graphs. 0 2000 4000 6000 8000 99.95 99.96 99.97 99.98 99.99 100 128 256 512 1024 2048 μ Θ (μ ) Θ (μ )-SIM+ Θ (μ )-DSIM+ Θ (μ )-SIM- Θ (μ )-DSIM- μ-SIM+ μ-DSIM+ μ-SIM- μ-DSIM- (a) 3.848 3.85 3.852 3.854 3.856 3.858 0 10 20 30 40 50 60 128 256 512 1024 2048 α Θ (α ) Θ (α )-SIM+ Θ (α )-DSIM+ Θ (α )-SIM- Θ (α )-DSIM- α-SIM+ α-DSIM+ α-DSIM- α-SIM- (b) Figure 3.10: Impact of number of edge updates. increase in μ with the increasing|Δe ut |. Consistent with this observation, α also increased with the number of edge updates. This is as a result of the increase in size in AFF with the increasing number of edge updates. Θ(μ), Θ(α) gradually decreased as a result. The increase in the number of edge updates had little impact on the number of iterations. Impact of Query Graph: Next, we evaluated the performance with increasing query graph size (|V q |) with|Δe ut | = 1024 and RMAT data graph with|V| = 2 23 , β = 8. As shown in Figure 3.11 (a), a slight increase in the number of messages was 74 0 1000 2000 3000 4000 5000 6000 99.92 99.94 99.96 99.98 100 5 10 20 40 μ Θ (μ ) Θ (μ)-SIM+ Θ (μ)-DSIM+ Θ (μ)-SIM- Θ (μ )-DSIM- μ-DSIM+ μ-SIM- μ-DSIM- μ-SIM+ (a) 3.85 3.851 3.852 3.853 3.854 3.855 3.856 0 20 40 60 80 5 10 20 40 α Θ (α ) Θ (α )-SIM+ Θ (α )-DSIM+ Θ (α )-SIM- Θ (α )-DSIM- α-SIM+ α-DSIM+ α-SIM- α-DSIM- (b) Figure 3.11: Impact of query graph. observed with increasing query graph size for all the proposed algorithms for DAG query graphs. Consistent with this observation a slight increase in the number of vertex activations was observed (Figure 3.11 (b)). Θ(μ) and Θ(α) increased compared to the baseline with the increasing query graph size. Increasing query graph size had a minimal effect on i and Θ(i). Impact of Data Graph Density: Figure 3.12 shows the impact of graph density (β) on performance on DAG query graphs with|V Q | = 10 and|Δe u | = 1024 and RMAT data graph of V = 2 23 . We observed an increasing trend in μ and α with increasing graph density. Only a slightly increasing trend was observed for Θ(μ) where no clear correlation between Θ(α) and β was observed in our evaluations. No clear correlation between i or Θ(i) with β was observed. 3.8 Applicability In this section, we discuss the applicability of the proposed data structures and algorithms in real-world cyber system security applications. Preventing Cyber Attacks on Cyber Networks: Figure 3.13 illustrates an overview of a distributed cyber attack prevention system that can be implemented 75 0 1000 2000 3000 4000 5000 99.95 99.96 99.97 99.98 99.99 100 23x4 23x6 23x8 23x10 23xx12 μ Θ (μ ) Θ (μ )-SIM+ Θ (μ )-DSIM+ Θ (μ )-SIM- Θ (μ )-DSIM- μ-SIM+ μ-DSIM+ μ-SIM- μ-DSIM- (a) 0 1000 2000 3000 4000 5000 99.95 99.96 99.97 99.98 99.99 100 23x4 23x6 23x8 23x10 23xx12 μ Θ (μ ) Θ (μ )-SIM+ Θ (μ )-DSIM+ Θ (μ )-SIM- Θ (μ )-DSIM- μ-SIM+ μ-DSIM+ μ-SIM- μ-DSIM- (b) Figure 3.12: Impact of data graph density (β). (α in millions) Physical Network Abstracted dynamic network Coordinator Election (a) (b) (c) Figure 3.13: Preventing Cyber Attacks on Cyber Networks. using the proposed algorithms. Consider the computer network illustrated in Fig- ure 3.13 (a). In this network, computers may have different characteristics, such as firewalls, operating systems, etc. Applications running on the computers may communicate with each other, using the underlying network. Theproposedmethodfordistributedcyberattackpreventioninvolvesinstalling a trusted software agent on each computer. Each software agent monitors application-layer communication channels going in and out of its computer (Figure 3.13 (b)). It keeps track of the new in/out communication channels and commu- nications that may have terminated due to various reasons (e.g., firewall policy changes, link failures, etc.). The software agents communicate with each other using the IP of each computer as the address. 76 In this use case, each computer is modeled as a vertex in the dynamic network and communication channels as edges. The different characteristics of the comput- ers are modeled as vertex labels. The network is dynamic since the vertex labels can be updated over time and edges can be added or deleted based on communi- cation between the computers. Query graphs representing known communication patterns in cyber attacks are distributed to each computer and stored as a part of its vertex state. The software agent installed in each computer is responsible for maintaining the proposed data structures in memory. One of the software agents is selected as the coordinator (Figure 3.13 (c)) using a distributed leader election algorithm [74]. Software agents report the changes in adjacent edges to the elected coordina- tor. The coordinator is responsible for orchestrating the distributed algorithms to update the data structures to update their memberships. The software agent installed on each computer monitors changes in its SGMs and trigger alarms or take automatic actions (e.g., changes in firewall policy, updates to the virus protection software) if it becomes a potential member of a known cyber attack. PreventingThreatsinSocialNetworks: AsexplainedinChapter1, communi- cation patterns of certain groups of people can be modeled as graphs. Identifying people in these groups and their roles in a social/communication network is an important problem, which has many applications in the area of security. In large social networks, user data is stored in multiple data centers. Based on user activ- ities, such as liked posts or information user shares, a profile of each user can be generated. This profile can contain potential roles the user can play in a suspicious group. 77 Batching Edge Updates Apply Updates Iterative Vertex Centric Computation Orchestration Detect Critical Events Figure 3.14: A graph processing framework for preventing threats in social net- works In a social network, users are modeled as vertices and communications between users as edges. Potential roles in the user profile are modeled as vertex labels. Vertex centric graph processing frameworks such as Apache Flink Gelly [54] can be used to maintain and update the SGMs of the user with the changes to the graph. In this case, each vertex maintains the proposed data structures as a part of its vertex state. As illustrated in Figure 3.14, edge updates are batched and applied to the graph in iterations. Vertex SGMs are updated on the graph, using the proposed algorithms. After updating the SGMs of the vertices, each vertex is processed in an massively parallel manner to identify changes in its SGMs that may result in critical events such as the detection of a user’s being a part of a suspicious communication group. 3.9 Summary In this chapter, we presented distributed data structures and algorithms to monitor structural group membership in dynamic networks. We evaluated our proposed algorithms on various real-world and synthetic datasets, using an open source vertex graph processing framework. The proposed algorithms demonstrated 78 superior performance compared with a state of the art vertex-centric baseline in terms of the number of computations and communications for direct acyclic query graphs. The results of the evaluation showed that over 99% savings in computation and communication is achieved by our approach compared with the baseline. We showed the significance of our proposed algorithms by presenting two use cases in real-world scenarios. 79 Chapter 4 Exact Subgraph Matching in Dynamic Graphs In Chapter 3, we covered the structural group membership monitoring problem for dynamic graphs. The structural group membership problem finds the set of querygraphverticesthatmatch adatagraph vertexbasedonasubgraphmatching criterion. We used two relaxed matching criteria, called graph simulation and dual simulation (Section 2.4). While graph simulation and dual simulation find most meaningful patterns, they may cause false positives. In this chapter, we cover a distributed algorithm for exact subgraph matching (subgraph isomorphism) in dynamic graphs. Subgraph isomorphism is a fundamental graph problem with many applications including the applications in cyber system security discussed in Chapter 1. Due to its NP-hard nature, subgraph isomorphism in large dynamic graphs is considered a challenging problem. Addressing this challenge, we first cover a distributed graph pruning algorithm (D-IDS) for dynamic graphs to enable efficient subgraph isomorphism. D-IDS continually maintains the maximum dual simulation match in a dynamic graph. We showcase D-ISI, a distributed incremental algorithm for subgraph isomorphism that uses D-IDS. We show the effectiveness of the proposed algorithms using evaluations on large real-world datasets. 80 4.1 Introduction Subgraph isomorphism (ISO) is a fundamental graph problem with wide and varied applications beyond what was discussed in Chapter 1. These applications include substructure matching in chemical components, analysis of social network structures, cryptography and security [16, 56, 95, 89]. For the case of static graphs, several algorithms have been developed for ISO [104, 37, 36]. These fall into two main categories ([102, 58]) 1) exploratory methods or 2) partial query match-join methods. Exploratory methods start from a single vertex in the data graph and explore the rest of the vertices, evaluating them against structural properties and matching constraints to determine whether they can be matched to the query. Par- tial query match-join methods find partial candidates for different query vertices and try to join them incrementally to create the complete pattern. While there has been a lot of work on parallel and distributed incremental algorithms for dynamic graphs in general [29, 44, 58], not much effort has been devoted to the ISO problem. Due to its NP-hard nature, exact ISO in large dynamic graphs is considered challenging [58] and thus much of the existing work has focused on approximate algorithms [58]. However, we assert that exact ISO is an essential requirement in several mission-critical application domains, for exam- ple, fully automated cyber intrusion detection and prevention systems [115]. False positives that may occur if relaxed matching criteria, such as graph simulation and dual simulation, are used will cause issues in such applications. Thus the exact ISO problem is of independent interest. Driven by the need to develop low latency and exact solutions to the ISO problem in dynamic graphs, in this paper we take a practical approach towards developing incremental distributed algorithms. This chapter covers the following aspects: 81 • We identify the limitations of incremental ISO based on neighborhood search for small diameter graphs. To address these limitations, we present a novel distributed graph pruning technique for dynamic graphs (D-IDS) that pre- serves subgraph isomorphism matches. • We present a distributed incremental algorithm for exact subgraph isomor- phism that uses the above-mentioned graph pruning technique. • Via experimental evaluations, we demonstrate the effectiveness of our algo- rithms using real-world graph datasets on a commodity cluster environment in Amazon EC2. 4.2 Problem Formulation We study the problem of exact subgraph isomorphism in dynamic graphs. We presentadistributedincrementalsubgraphmatchingalgorithmfordynamicgraphs based on subgraph isomorphism subgraph matching criteria. For dynamic graphs, let M t be the set of subgraphs in G t that match a query graph Q via ISO. An incremental subgraph matching algorithm takes G t , Δe u and M t as the input to produce M t+1 for G t+1 by computing the changes ΔM to match setM t . Figure 4.1 illustrates five graph snapshots of a dynamic graph, each snapshot G t+1 is obtained by applying an edge update to the previous snapshot G t . In Figure. 2.1, vertices{1, 2, 3, 5} inG 4 match the given query graph via ISO. This match happens as the graph changes from G 3 to G 4 with the addition of an edge. Also, the matched pattern dissolves as the graph changes from G 4 to G 5 with an edge removal. A distributed incremental subgraph matching algorithm partitions the input among worker nodes and reuses already computed results to minimize unnecessary 82 a b c a Query Graph G 1 G 2 G 3 G 4 G 5 1 2 a b c a b c d a b c d a a b c d a a b c d a 1 2 3 1 2 3 1 2 3 4 4 5 3 4 5 1 2 3 4 5 +(1:a, 4:d) +(3:c, 5:a) +(1:a, 3:c) -(2:b, 3:c) t Figure 4.1: Five temporal snapshots of a dynamic graph. re-computations, thus enabling low latency high throughput computation. The algorithms presented in this paper assume a distributed memory computing envi- ronment with a set of worker nodes W. Each worker w∈ W maintains its own vertex disjoint partition G w t of dynamic graph G t at time t. Table 4.1 provides a summary of symbols and definitions used in the rest of the paper. 4.3 Incremental Subgraph Isomorphism Match- ing In Dynamic Graphs The computation time of ISO in a dynamic graph is known to be unbounded relative to the size of edge updates Δe u or the incremental match set ΔM [47]. But as we observe below, the set of subgraphs that can potentially be in ΔM as the result of an edge update is within a neighborhood of edge update, bounded by the query diameter. Observation 4.1 Lete u = (v i ,v j ) be an edge update to the data graph G t which results in G t+1 . Then∀ v k ∈V (ΔM), s (v k ,v i ) ≤d Q and s (v k ,v j ) ≤d Q . Observation4.1canbeexploitedtoreducethecomputationspaceforincremen- tal ISO in a dynamic graph to decrease computation latency. Using Observation 4.1, we develop a distributed incremental algorithm (Algorithm 12) for ISO in 83 Symbol Definition d G Diameter (longest undirected shortest path) of the graph G. s (v i ,v j ) Shortest path between two vertices v i and v j V (ΔM) Set of vertices in ΔM l(v) Label associated with vertex v G w t Partition of G t in worker w∈W G t,DSim Maximum dual simulation match of G t G w t,DSim Partition of G t,DSim in worker w∈W MS[v] Match set of vertex v P [v] Parent set of vertex v C[v] Child set of vertex v MS W t [v] Match set of each vertex v in G w t P W t [v] Match set of parents of vertex v in G w t P W t [v][u] Match set of parent u of vertex v in G w t C W t [v] Match set of children of vertex v in G w t C W t [v][u] Match set of child u of vertex v in G w t Δe w u Set of edge updates assigned to worker w∈W e w u + Set of edge additions assigned to worker w∈W e w u − Set of edge removals assigned to worker w∈W match[v] Match status (true/false) of vertex v L[v] Labels of parents and children of vertex v Table 4.1: Symbols and their definitions dynamic graphs (D-ISI) via a simple framework that re-uses legacy subgraph iso- morphism libraries developed for small static graphs. The proposed algorithm is designed specifically to deliver low-latency analytics on dynamic graphs with a larger diameter as compared to the diameter of the query graph (d Gt >>d Q ). We develop a graph pruning scheme described in the next section for small diameter dynamic graphs where this is not the case. 84 Road networks and social networks are example graphs that demonstrate these behaviors. Road networks are examples of large diameter graphs, whereas the social networks the effective diameter of the graph is very low (< 10) [71]. As a result, in social networks, subgraph matches in a significant portion of the graph can potentially be affected by an edge update compared to road networks for a given query graph Q. Algorithm 12 takes a stream of edge updates (edge stream) as input and pro- cesses edge updates in batches. Each batch of edge updates collected at time t starts a three-stage process. First, edge updates are assigned to different parti- tions of the dynamic graph using an edge partitioning algorithm running on a single processing node (master), where each partition (G w t ) is processed by a sin- gle worker (w). We use a simple hash-based partitioning strategy which assigns the edge updates to different partitions by hashing the source vertex of each edge update. While there has been some work on graph partitioning strategies [64], evaluating those strategies on this algorithm is beyond the scope of this Chapter. Second, edge updates are applied to the respective graph partitions which result in G t+1 . Third, each worker starts processing its graph partition to find the portion of ΔM caused by the edge updates assigned to the worker. Figure 4.2 illustrates the data flow of the Algorithm 12. 85 1. Partitioner 2. Workers 3. Reducers P N 1 N 2 N 3 N 4 R 1 R 2 Figure 4.2: Data flow of the proposed algorithm for exact subgraph isomorphism. Algorithm 12 Distributed Incremental Subgraph Isomorphism Matching In A Dynamic Graph (D-ISI) 1: procedure D-ISI(Δe u ) 2: if master then partition(Δe u ) 3: for each w∈ WORKERS do 4: sendupdates(Δe w u ) 5: end for 6: else 7: Δe w u ← receiveUpdates(.) 8: . Apply edge updates to G w t 9: G w t+1 ← UpdateGraph(G w t , Δe w u ) 10: ΔM← process(G w t+1 , Δe w u ) 11: Barrier(WORKERS) 12: M t+1 ←M t ⊕ ΔM 13: end if 14: end procedure 86 Algorithm 13 presents the core of the distributed algorithm executed at each worker for computing ΔM. Each worker loops through the edge updates from Δe u assigned to it and finds the subgraphs affected by these updates in its partition using a distributed breadth first search limited to depthd Q . Utilizing Observation 4.1, we note that ΔM for this worker can be found by restricting the search for matches to only these subgraphs. Existing subgraph isomorphism libraries can be used to find these matches. The algorithm exploits parallelism at multiple levels. Edge updates are first processed in parallel by distributing it among the workers followed by distributed subgraph construction for each edge update. Conflicting edge updates (additions/removals of the same edge) in a batch are removed before the partitioning step and duplicate matches due to overlaps in the constructed subgraphs are removed in line 8 of Algorithm 13. Algorithm 13 Computing ΔM for Δe w u 1: procedure process(G w t+1 , Δe w u ) 2: F← matches found for each subgraph 3: for each e∈ Δe w u do s← e.source t← e.target 4: SG e ← DDLBFS(s, t, G w t+1 , Q, d Q ) 5: F[SG e ]← MatchIso(SG e , Q) 6: Barrier(WORKERS) 7: end for 8: return FindDiff(F,M t ) 9: end procedure Note that in the UpdateGraph subroutine of our Algorithm 12 (line 9), G w t+1 refers to the new data graph created after worker w applies all edge updates from edge stream Δe w u . Although temporary matches may exist at internal points of the edge stream, the algorithm only finds and outputs final matches that exist in 87 G w t+1 . If desired, the algorithm can output matches at a more fine-grained scale by reducing the size of Δe u . Each worker starts a distributed depth limited breadth-first search (DDLBFS) (Algorithm 14) independently to construct the subgraph for each of its edge updates. Execution follows the bulk synchronous parallel (BSP) execution model [105]. Each worker maintains a map that tracks the vertices visited by each inde- pendent search. This map is used at the end of DDLBFS to send the portions of subgraphs to respective workers (line 24-26 in Algorithm 14). 88 Algorithm 14 Distributed Depth Limited Breadth First Search 1: procedure DDLBFS(source, target, G w t+1 , Q, d Q ) 2: M← vertices visited by each processor 3: for ss← 1 to d Q do 4: if ss = 1 then 5: M[w].add(source) 6: M[w].add(target) 7: N← Neighbors(source, target) 8: root← w 9: SendMessages(N, root) 10: else 11: MSGS← Messages sent to this worker 12: for each m∈ MSGS do 13: v← m.vertex 14: root← m.root 15: if NotVisted(v, root) then 16: M[root].add(v) 17: N← Neighbors(v) 18: SendMessages(N, root) 19: end if 20: end for 21: end if 22: Barrier(WORKERS) 23: end for 24: for each root worker w∈M do 25: SendSubGraph(M[w], G w t+1 ) 26: end for 27: SG← receiveSubGraph(.) 28: return SG 29: end procedure 89 4.4 Distributed Graph Pruning As described previously, Algorithm 12 is likely to work well only for large diameter data graphs with comparatively small query graphs. Our experimental resultsshowthattheperformanceofthealgorithmdegradesrapidlywithincreasing query graph diameters, especially in small diameter graphs. This is because the size of the subgraph constructed in Algorithm 13 grows rapidly with increasingd Q on small diameter graphs. In this section, we present a distributed graph pruning algorithm that can significantly improve the performance of exact ISO on small diameter graphs. Our proposed graph pruning algorithm maintains a pruned graph of the under- lying dynamic graph G T based on dual simulation, where we maintain the maxi- mum dual simulation match of G t for Q as the pruned graph at each time point t. The following proposition 4.1 can be easily verified from the definitions of dual simulation and subgraph isomorphism (Section 2.4) [75]. Proposition 4.1 In a data graph, the maximum dual simulation match for a query graph contains all subgraphs that match the query graph via subgraph isomorphism. We refer readers to Fig. 2.5 which illustrates this proposition. We develop a distributed algorithm to maintain the maximum dual simulation match in a dynamic graph. Our algorithm follows the BSP model and can take O(|E SCC |) super-steps to complete in the worst case, where|E SCC | denotes the number of edges in the largest strongly connected component of the data graph. However, we observed that the number of super-steps required in practice to be significantly less. 90 Algorithm15showstheintegrationofD-IDS(Distributedincrementaldualsim- ulation) with the D-ISI (Distributed incremental subgraph isomorphism). Symbols used in the algorithms presented in this section with their definitions are summa- rized in Table 4.1. As shown in line 9-11 of Algorithm 15 we perform distributed graphpruningontheupdateddynamicgraphbeforeprocessingtheedgeupdatesto find ΔM. As a result, the rest of the processing occurs on the pruned graph which can be very small, compared to the original graph. Given the NP-Hard nature of the problem, this reduction in size can significantly improve the performance of ISO in dynamic graphs. While the proposed pruning technique does not bring down the computational complexity of ISO, it helps to decrease computation time by reducing the size of the data graph to be searched for matches along with the communication overhead of subgraph construction (Line 4 in Algorithm 13). 91 Algorithm 15 D-ISI with Distributed Graph Pruning 1: procedure D-ISI(Δe u ) 2: if master then 3: partition(Δe u ) 4: for each w∈ WORKERS do 5: sendupdates(e w u ) 6: end for 7: else 8: e w u ← receiveUpdates(.) 9: G w t+1 ← UpdateGraph(G w t ,e w u ) 10: G w t+1,Dsim ← MaximumDualSim(G w t+1 ,e w u ) 11: ΔM← process(G w t+1,Dsim ,e w u ) 12: Barrier(WORKERS) 13: M t+1 ←M t ⊕ ΔM 14: end if 15: end procedure Associated with each vertexv, D-IDS maintains a set of states: 1) match state (0/1) to indicate whether the vertex is in the maximum dual simulation match (1 if the vertex is in the match). 2) Set of query vertices that v matches (match set of v) via dual simulation (MS[v]). 3) Match sets of each parent and child of v (P [v],C[v]). Given an initial graph G 0 , D-IDS is initialized using a distributed dual sim- ulation matching algorithm for static graphs for which Algorithm 16 shows the high-level steps associated with each vertex. Algorithm 16 executes in iterations which we call super-steps (ss) following the BSP execution model. In the first super-step, match set of each vertex v, MS[v] is initialized by adding all query vertices that can match v based on its 92 Algorithm 16 BSP Algorithm to Initialize The Graph For D-IDS 1: procedure Initialize(G w 0 ) 2: match← initialized to 0 for each vertex in G w 0 in ss 0 3: if ss = 0 then 4: for each v∈V 0 do 5: if ∀v q ∈V Q :l q (v q ) =l 0 (v) then 6: MS w 0 [v]←MS w 0 [v]∪v q 7: end if 8: if MS w 0 [v].size> 0 then 9: match[v]← 1 10: end if 11: SendToParentsAndChildren(v,l 0 (v)) 12: end for 13: else 14: if ss = 1 then 15: for each v∈V 0 V match[v] = 1 do 16: L[v]← labels from parents and children of v 17: {P w 0 [v],C w 0 [v]}← CreateMatchSets(L[v]) 18: . R [v]: match removals from vertex v 19: . EvalDSim: evaluate matching rules 20: R[v]← EvalDSim(MS w 0 [v],P w 0 [v],C w 0 [v]) 21: if MS w 0 [v].size = 0 then 22: match[v]← 0 23: end if 24: SendToParentsAndChildren(R[v],v) 25: end for 26: else 27: R p ← GetParentRemovals 28: R c ← GetChildRemovals 29: for each r∈ R c do 30: v← r.vertex . target vertex 31: RemoveFromChildMatch(r,v) 32: R[v]← EvalDSim(MS w 0 [v],P w 0 [v],C w 0 [v]) 33: SendToParentsAndChildren(R[v],v) 34: end for 35: for each r∈ R p do 36: v← r.vertex 37: RemoveFromParentMatch(r,v) 38: R[v]← EvalDSim(MS w 0 [v],P w 0 [v],C w 0 [v]) 39: if MS w 0 [v].size = 0 then match[v]← 0 40: end if 41: SendToParentsAndChildren(R[v],v) 42: end for 43: end if 44: end if 45: end procedure 93 label (line 5 of Algorithm 16). match status in all the vertices whose match set contain some matches is set to 1. Labels of matching vertices are sent to their parents and children to be received in the next super-step. In the next super-step, each currently matched vertex v, create parent and child match sets (P [v],C[v]) from the labels received. They are then evaluated with their match set MS[v] to see if they satisfy dual simulation rules (see Section 2.4). Query vertices that violate these rules are removed from the match set. match status is set to 0 for each vertex with an empty match set. Removals from match sets from each vertex are notified to its parents and children as messages to be received in next super-step. In subsequent super-steps, vertices that receive removal notifications update their parent and child matches based on these removals. Then their match set is evaluated with updated parent and child match to see if they satisfy dual simulation rules. Query vertices that violate these rules are again removed from the match set and removals are notified to parents and children of these vertices as before. The algorithm stops when there are no more removal notifications. Algorithm 17 provides an overview of our distributed graph pruning algorithm for dynamic graphs. Given a batch of edge updates Δe u , initially, we perform a pruning process on them to prune out safe edges updates. Safe edges updates are the edges updates that do not have an impact on the maximum dual simulation match (i.e. Safe to add/remove), and it is possible to verify that using local information associated with the source and sink vertex of the edge update. 94 Algorithm 17 Distributed Incremental Dual Simulation Matching in a Dynamic Graph (D-IDS) 1: procedure MaximumDualSim(G w t+1 , Δe w u ) 2: {Δe w u +, Δe w u −}← PruneSafeEdges(Δe w u ) 3: G w t+1,Dsim ← ProcessRemovals(G w t+1 , Δe w u −) 4: G w t+1,Dsim ← ProcessAdditions(G w t+1,Dsim , Δe w u +) 5: return G w t+1,Dsim 6: end procedure We use the propositions 4.2 and 4.3 to determine safe edge updates. Proposition 4.2 Edge removal e = (v,u) is safe if @(v q ,u q )∈ E Q : l q (v q ) = l(v) ANDl q (u q ) =l(u) ormatch[v] = 0 ORmatch[u] = 0 (Either source or sink vertex not in the match set). Proposition 4.3 Edge addition e = (v,u) is safe if @v q ∈ V Q : l q (v q ) = l(v) OR l q (v q ) =l(u). The algorithm to handle edge removals closely follows the steps in Algorithm 16. In the first super-step (ss = 0), the algorithm finds the removed matches due to the edge removals on incident vertices of removed edges. These removals are notified to the parents and children of respective vertices as match removal messages. In next subsequent super-steps, the vertices who receive the messages apply those removals to their parents and children match sets and evaluate the matching conditions. Removals are notified to parents and children as before. The algorithm terminates when there are no more removal notifications. Processing edge additions can also be done incrementally with small modifi- cations to Algorithm 16. In the first super-step (ss = 0), the algorithm will be 95 executed only on the incident vertices of edge additions. Which will, in turn, send their labels to the parents and children. Vertices that receive the labels will exe- cute the super-step 1 of Algorithm 16 wherein the start of the super-step match set of each of these verticesv are re-initialized by adding all the query vertices that match the label of the v. The rest of the steps are similar as in Algorithm 16. The distributed graph pruning algorithm D-IDS closely follow the sequential algorithms presented in [49] for incremental graph simulation matching. The cor- rectness of the algorithms can be proved following the same lemmas presented in [49] summarized below. Edge Removal: 1) Edge removals only removes vertices from match set that are no longer valid matches 2) Algorithm terminates when all invalid matches are removed. Edge Addition: 1) Edge addition, only adds vertices to the match set if they are candidates (vertices whose label matches a label of a query vertex) 2) Algorithm terminates when all the invalid matches are removed. 4.4.1 An Illustrative Example Figure 4.3 provides an illustrative example of a small query and the data graph. We provide a detailed walk through of D-IDS in following sections using Figure 4.3. In Figure 4.3, the initial data graph (G 0 ) contains edges: {(1, 2), (2, 3), (3, 2), (3, 5), (3, 4), (4, 5)}. The edge (5, 1) added at time t = 1 and edge (3, 5) removed at timet = 2. Initial states (G 0 ) associated with vertex 2 and 3 are presented in the right. Initialization: Figure 4.4 illustrates the states associated with each vertex at the end of each super-step of Algorithm 16. After the end of the first super-step(ss=0) vertices{1, 2, 3, 5} send their labels to parents and children. As a result, in next 96 A Data Graph Query q 1 q 2 1 2 3 4 5 B A B A B C MS[3]= { q 1 } P[3] [2] = { q 2 } C[3] [2] = { q 2 } C[3] [4] = ¿ C[3] [5] = ¿ MS[2] = { q 2 } P[2][1] = ¿ P[2][3] = { q 1 } C[2][3] = { q 1 } Figure 4.3: Initial state of data graph for graph pruning algorithm. SS 1 2 3 4 5 0 MS[1] = {q 1 } MS[2] = {q 2 } MS[3] = {q 1 } MS[4] = ¿ MS[5] = {q 2 } 1 MS[1] = ¿ MS[2] = {q 2 } P[2][1] = {q 1 } P[2][3] = {q 1 } C[2][3] = {q 1 } MS[3] = {q 2 } P[3][2] = {q 2 } C[3][2] = {q 2 } C[3][5] = {q 2 } MS[4]= ¿ MS[5] = ¿ 2 MS[1]= ¿ MS[2] = {q 2 } P[2][3] = {q 1 } C[2][3] = {q 1 } MS[3]= {q 2 } P[3][2] = {q 2 } C[3][2] = {q 2 } MS[4]= ¿ MS[5] = ¿ Figure 4.4: Vertex states at end of each super-step (ss) in the initialization stage of D-IDS super-step (ss = 1), each vertex v receiving labels from parents and children can constructP [v] andC[v]. Vertices 1 and 5 removes all matching query vertices from MS[1] and MS[5] after evaluating dual simulation matching conditions using its match set with the newly created parent and child match sets. These removals are notified to parents and children of vertices (1,5). After receiving the removal notifications in super-step 2 vertex 2 and 3 update their parent/child match sets. But this does not cause any removals in MS[2] or MS[3]. 97 SS 1 2 3 4 5 0 MS[1] = {q 1 } C 1 [2] = {q 2 } X X X MS[5] = {q 2 } P[5][3] = {q 1 } 1 MS[1] = {q 1 } P[1][5] = {q 2 } C[1][2] = {q 2 } MS[2] = {q 2 } P[2][1] = {q 1 } P[2]3] = {q 1 } C[2][3] = {q 1 } MS[3] = {q 2 } P[3][2] = {q 2 } C[3][2] = {q 2 } C[3][5] = {q 2 } MS[4] = ¿ MS[5] = {q 2 } P[5][3] = {q 1 } C[5][1] = {q 1 } Figure 4.5: Vertex states at the end of each super-step (SS) after adding edge (5, 1). EdgeAddition: Figure4.5illustratesthestatesassociatedwitheachvertexatthe end of each super-step after adding the edge (5, 1). X indicates that those vertices did not involve in any processing in that super-step. In the first super-step (ss=0), vertices 1 and 5 construct match sets MS[1] and MS[2] using its labels and they are added to the dual simulation match (match[5] = 1 and match[1] = 1). Labels of the vertices are sent to parents and children of vertices 1 and 5. Upon receiving labels, vertices 1, 2, 3 and 4 re-set their match set and match set of parents and children based on the received labels. After evaluating dual simulation matching conditions, using its match set with the newly created parent and children match sets, the algorithm will terminate since there are no changes to the match sets. Edge Removal: Figure 4.6 illustrates the states of each vertex at the end of each super-step after removing the edge (3, 5). In the first super-step, vertices 3 and 5 update their parent/child match sets. Vertex 3 removes the match set of vertex 5 from C[5] and vertex 5 removes the match set of vertex 3 from P [5]. As a result, vertex 5 will lose query vertex q 2 from its match set making its match status 0. These removals are notified to vertices 1, 2 and 4 in the next super-step. In the next super-step, the removal of vertex 5 causes vertex 1 to lose query vertex q 1 from its match set. Notification of this removal in the next super-step to vertex 2 does not cause any removals. 98 SS 1 2 3 4 5 0 X X MS[3] = {q 1 } P[3][2] = {q 2 } C[3][2] = {q 2 } X MS[5] = ¿ C[5][1] = {q 1 } 1 MS[1] = ¿ C[1][2] = {q 2 } MS[2] = {q 2 } P[2][1] = {q 1 } P[2][3] = {q 1 } C[2][3] = {q 1 } MS[3] = {q 1 } P[3][2] = {q 2 } C[3][2] = {q 2 } MS[4] = ¿ MS[5] = ¿ C[5][1] = {q 1 } 2 MS[1]= ¿ C[1][2] = {q 2 } MS[2] = {q 2 } P[2][3] = {q 1 } C[2][3] = {q 1 } MS[3] = {q 1 } P[3][2] = {q 2 } C[3][2] = {q 2 } MS[4]= ¿ MS[5] = ¿ C[5][1] = {q 1 } Figure 4.6: Vertex states at end of each super-step after removing edge (3, 5). 4.5 Evaluations We implemented D-ISI and D-IDS in C++ using MPI (MPICH2 [6]). An open source implementation of VF2 [36] algorithm was used for subgraph matching in D-ISI (line 5 in Algorithm 13) [13]. Lemon graph library [4] was used to store the graph partitions locally within each worker. As explained in Section 4.4, during the execution of D-IDS, each vertex main- tains the labels and the match sets of its parents and children in memory. We did some optimizations to reduce the memory footprint by using memory references (pointers). Instead of each vertex maintaining a copy of the labels for each of its parents and children, at each worker, we maintained a pool of labels in the mem- ory. This pool contained all the unique vertex labels. Also, adjacent vertices with bidirectional edges, which are in the same worker, shared the parent/child match sets instead of keeping local copies. 99 All the experiments were conducted in Amazon EC2 on a cluster consisting of 5 c3.2xlarge [2] instances (data center limit). This represents a commodity cluster availableforthegeneralcomputersciencecommunityratherthanHPCusers. Each c3.2xlarge instance consists of eight virtual CPUs and 15 GB RAM. 4.5.1 Datasets We performed experiments on six large-scale datasets (see Table 4.2) that are publiclyavailablefrom[71,1,81], Thisdatasetincludesthreelargediametergraphs of increasing size (road networks) and three small diameter graphs. Road networks (CA, CTR, USA) are sparse planar graphs with relatively uniform degree distribu- tionandlargediameter. Smalldiameter(DBLP,YT,LJ)graphsarecomparatively dense graphs with power-law degree distribution and a smaller diameter. Edge updates are generated from randomly extracted edges of these static graphs. Edge additions always added new edges randomly connecting existing vertices where edge removals removed existing edges. Vertex labels were generated randomly from a fixed size dictionary. Subgraphs with various diameters were extracted/injected from/to these graph datasets and used as query graphs. All the evaluations were done on query graphs with diameter (undirected) 1 (|V| = 5), 2 (|V| = 12) and 3 (|V| = 17). 4.5.2 Evaluation Metrics We used latency as the main metrics for evaluating the performance. Latency is defined as the time difference between the time that the first edge update of a Δe u entered the system and the time the algorithm finished finding matches for all the edges in Δe u . 100 Dataset |V| |E| Type California R/N (CA) 1,965,206 2,766,607 Large diameter Central USA R/N (CTR) 14,081,816 34,292,496 Large diameter Full USA R/N (USA) 23,947,347 58,333,344 Large diameter DBLP network (DBLP) 317,080 1,049,866 Small-diameter YouTube (YT) 4,945,382 49,445,382 Small-diameter Live Journal (LJ) 5,284,457 77,402,652 Small-diameter Table 4.2: Datasets We used the sequential exact ISO algorithm presented in [49] as a baseline for comparison. This is a widely cited representative sequential baseline for D-ISI which performs a neighborhood-based search for exact ISO. Inordertoreportstatisticallysignificantresults, weranourexperimentsseveral times on Amazon EC2. During our initial evaluations, we found that there is a variation in performance compared to the results in an isolated cluster. But we observed that this variation in performance did not change the overall performance behavior (How algorithm scales with increasing number of resources and graph size). To reduce the effect of variation in performance, final experiments for each plot were conducted together. The latency values reported below are mean values averaged over a large number of edge updates. 4.5.3 Results WecomparedthelatencyofD-ISIwithandwithouttheproposedgraphpruning algorithm (D-IDS) on the data sets listed in Table 4.2. As shown in the Figure 4.7 combiningwithD-IDSsignificantlyimprovesthelatencyofD-ISIonsmalldiameter graphs. But we observed that using D-IDS has no impact on the performance on road network datasets. Further analysis showed that, even though the graph 101 pruning algorithm significantly reduces the size of the graphs (See Figure 4.8) in all the datasets, the impact of graph pruning is minimal due to small uniform degree distribution and large diameter of road network datasets. But in small diameter networks, since edge updates affect a large portion of the graph, the impact of graph pruning was comparatively high. Note that D-ISI failed to execute without graph pruning on the YT dataset due to the large size of the subgraphs created in the subgraph construction step (line 4 of Algorithm 13). 1 2 4 8 16 32 CA CTR USA Latency (ms) - Log D-ISO D-ISO + D-IDS 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 DBLP YT LJ Latency (sec) - Log D-ISO D-ISO + D-IDS Figure 4.7: Comparison of latency with (D-ISO + D-IDS) and without (D-ISO) graph pruning. Figure 4.8 shows the average percentage reduction (relative to the original graph) of the graph size (number of vertices) when D-IDS is used. As shown in Fig. 4.8 using D-IDS, we were able to reduce the graph size by over 60% for all the graphs on the average. 102 0 10 20 30 40 50 60 70 80 90 100 CA CRT USA DBLP YT LJ % Reduction d=1 d=2 d=3 Figure 4.8: Average percentage reduction of graph size (number of vertices) D-IDS. d denotes the diameter of query graph. Figure 4.9 presents the speedup and latency of D-ISI with D-IDS on a road net- work and a small diameter graph with increasing number of workers. As expected, on the road network dataset D-ISI algorithm with graph pruning was able to produce very low latency results compared to small diameter graphs. But the improvement in speedup tapers off when scaling up the number of workers on road network datasets. We were able to achieve significant improvements in speedup and reduction of latency when scaling up the number of workers on small diameter graph datasets. This is as a result of graph pruning where our algorithm (D-ISI + D-IDS) works on a much smaller subgraph compared to the baseline (Intractability of the exact ISO should also be taken into account). 103 0 50 100 150 200 250 0 2 4 6 8 10 12 14 2 4 8 16 32 39 Latency (ms) Speedup # of Workers Speedup(d=1) Speedup(d=2) Speedup(d=3) Latency(d=1) Latency(d=2) Latency(d=3) (a) CRT dataset 0.1 1 10 100 1000 0.1 1 10 100 1000 2 4 8 16 32 39 Latency (sec) -Log Speedup -Log # of Workers Speedup(d=1) Speedup(d=2) Speedup(d=3) Latency(d=1) (b) DBLP dataset Figure 4.9: Speedup and latency of D-ISI + D-IDS for query graphs with various diameters (d). Weobservedthatthecommunicationnetworkinourclusterwassaturatedwhen scaling up the number of workers on small diameter datasets, increasing the overall communication latency of the system. We believe that further improvements can be achieved on a cluster with a high-performance communication network. Due 104 0.03125 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128 256 512 1024 DBLP YT LJ CA CTR USA DBLP YT LJ CA CTR USA DBLP YT LJ CA CTR USA W=8 W=16 W=32 Latency(sec) - Log d=1 d=2 d=3 Figure4.10: ComparisonoflatencyofD-ISI+D-IDSforquerygraphswithvarious diameters. scalability limitations in baseline algorithm, we were unable to compute speedups on YT and LJ datasets. A similar behavior of latency was observed in these datasets. Comparison of latencies on all datasets are summarized in Fig. 4.10. We also observed a higher improvement in speedup with increasing d Q . The reason for this speedup is twofold: With the Increasing diameter of the query graph, edge updates can affect a large portion of the graph. As a result, a larger subgraph is constructed in the subgraph construction step. But since the baseline algorithm does not perform any graph pruning, reduction in the constructed subgraph due to graph pruning compared to the baseline becomes larger with the increasing diameter. Also increasing diameter in the query graph enables the DDLBFS to explore deeper, increasing the parallelism. Evaluations with increasing|Δe u | showed that (See Figure 4.11) D-ISI with D- IDS achieves higher throughput with increasing|Δe u | due to the higher parallelism achieved with increasing|Δe u |. But this resulted in increase in latency. 105 0 0.1 0.2 0.3 0.4 0.5 16 32 64 128 256 Latency (sec) Batch size W=16 W=32 W=39 0 100 200 300 400 500 600 700 800 16 32 64 128 256 Throughput (e u /sec) Batch size W=16 W=32 W=39 (a) Latency (b) Throughput Figure 4.11: Latency and throughput of D-ISI + D-IDS for different batch size. 4.6 Summary In this chapter, we presented a query preserving distributed graph pruning technique (D-IDS) to enable exact subgraph matching in small diameter dynamic graphs. We evaluated our proposed algorithms on a diverse set of real-world datasets. The evaluation results show that our graph pruning technique reduced 106 the graph size significantly on real-world graphs, where it achieved over 60% reduc- tion in graph size. A simple distributed incremental exact subgraph isomorphism algorithm (D- ISI) which can use the above-mentioned graph pruning technique was presented. Our results show a significant improvement in performance on small diameter graphs when D-ISI was used with D-IDS. But no improvement in performance was observed on large diameter graphs. 107 Chapter 5 Dynamic Variant Steiner Tree Heuristics In Chapters 3 and 4, we presented graph based techniques that can be used to protect cyber systems against attacks. In these sections, subgraph matching was used as the underlying fundamental graph theoretic method. In the present chapter, we cover a set of variant Steiner Tree based problems that have applica- tions to cyber system security. The problem formulations covered in this chapter are mainly focused on a specific cybersecurity application in a smart grid: pro- tecting the state estimates of buses from data spoofing attacks. We also discuss the applicability of the proposed problem formulations and techniques beyond the applications to smart grid security. Determining the voltage phase angles of buses in a smart grid is a critical operation in the power system state estimation process. Invalid state estimates of strategic buses can cause a severe socioeconomic impact. In this chapter, we cover an optimal protection scheme to protect the voltage phase angle estimation of strategic buses in a smart grid against data spoofing attacks. We discuss the limitations of an existing protection scheme by identifying a class of attack vectors which cannot be defended against by using the protection scheme. We then pro- vide an improved Steiner tree based protection scheme to find the minimal set of measurements to protect in order to secure the set of strategic buses against any data spoofing attack. 108 Further, capturing the changes in the criticality of buses over time, we pro- pose variant Steiner tree based protection schemes to provide adaptive protection. Our proposed optimal protection schemes capture the changes in the criticality of buses and protection cost. We note that such optimal schemes are computa- tionally intractable. For this reason, we propose heuristics with polynomial time complexity. Further, we develop parallel algorithms to make the proposed protec- tion schemes scalable for massive transmission networks. Finally, we discuss the applications of the proposed methods beyond these applications in the protection against smart-grid data spoofing attacks. 5.1 Introduction A smart grid consists of a power system and a communication system. The objective of the power system is to deliver the power generated from the power generation units to the consumers. The power system consists of generation, trans- mission, and distribution subsystems. The operating conditions of the power sys- tem are continually monitored by SCADA/EMS (Supervisory Control and Data Acquisition/Energy Management System) systems and managed so as to use the information coming through the communication system. Various sensor devices are deployed across the power system to achieve this task. These sensors include 1) power injection meters that measure the power flow at the generators; 2) power flow meters and phasor measurement units (PMUs) that measure the power flow and voltage phase angles at transmission lines and buses in the transmission sub- system; 3) smart meters that measure the power consumption at the consumer sites. 109 Power system state estimation is a critical process in the power transmission subsystem [15]. State estimation is conducted by EMS/SCADA systems in an online manner using the measurement information coming from different parts of the transmission network [15]. The computed state estimates are then used to manage the power flow of the system to ensure that the system is maintained in a nominal state [15]. Protective measures are taken in emergency situations using the state estimate results. Invalid state estimates can create false demands in power markets [65, 112], increase the operational cost [114], and can even cause blackouts [113]. Hence protecting the integrity of the state estimate is of vital importance. It has been shown in theliterature [23] that the numberof sensor measurements that need to be protected in order to secure the state estimate of all the buses is the same as the number of buses in the transmission network. Since this can be a costly task, protection schemes to protect the state estimate of strategic subsets of buses in the transmission network have been proposed [41, 22]. In [41], Deka et al. presented a polynomial time algorithm using min-cuts to find the minimum set of measurements an adversary needs to attack in order to perform a data spoofing attack to compromise a given set of state estimates. They extend this method to propose a polynomial time algorithm to find the minimum set of measurements to protect in order to secure a given set of state estimates from data spoofing attacks. In real-world smart grids, the load patterns at each bus fluctuate significantly. This, in turn, causes the congestion in the transmission lines to fluctuate. Con- gestion in different parts of the transmission network can change dramatically and frequently by time of day, the day of the week, and the season, with the changes in the loads at the buses [7]. Attacking buses that have a high impact on transmission network congestion can have a greater impact on power markets [46]. Thus, the 110 criticality of buses is dynamic in nature and changes over time. Protection schemes for smart grid data spoofing attacks should be adaptive to capture these changes as a result. Existing methods to protect the sensor measurements in a transmission network include placing guards and surveillance instruments at sensor locations [22]. With the changes in the criticality of buses, these resources will have to be relocated, activated, or deactivated, to protect new sensor measurements. We pro- pose various Steiner tree based protection schemes to provide adaptive protection against data spoofing attacks. Our proposed optimal protection schemes capture the changes in the criticality of buses and protection cost. This chapter covers the following aspects: • We identify the limitations of the protection scheme presented by Deka et al. in [41] by providing a counterexample. We identify a class of attack vectors that cannot be defended against by using the protection scheme of [41]. • We provide an improved protection scheme to find the minimal set of mea- surements to protect in order to secure a set of buses against any data spoof- ing attacks. • We introduce the problem of adaptive protection against smart grid data spoofing attacks. • We formulate a set of adaptive protection schemes to provide optimal pro- tection against smart grid data spoofing attacks. • We propose heuristic algorithms with polynomial time complexity for the proposed protection schemes. • Using parallelization, we further improve the performance and scalability of the proposed heuristic algorithms. 111 • We demonstrate the effectiveness of the proposed solutions using evaluations on real-world and synthetic datasets. 5.2 Background The power grid transmission network consists of buses and transmission lines. The states of the power system are represented by state variables (state estimates) consisting of voltage magnitudes and phase angles at buses. Voltage magnitudes can be directly measured using sensors deployed at buses while phase angles are estimated using the active power flow measurements collected from sensors [60]. In this paper, two types of measurements are considered: 1) Power flow measure- ments of transmission lines, 2) Voltage phase angles directly measured at buses. Power flow measurements are measured by power flow meters and voltage phase measurements can be directly measured through phase measurement units. In the DC power flow model, the relationship between measurements and state estimates is governed by: z =Hθ +e (5.1) z ∈ R m is a vector of measurements, x∈ R n is a vector of state variables (m > n). H is the measurement matrix and e∈ R m represents measurement errors (noise). Assuming a gaussian error distribution with co-variance R, state estimate x can be obtained by: θ = (H T RH) −1 H T Rz =Pz (5.2) Figure 5.1 provides a high-level overview of the state estimation process. As illustrated in Figure 5.1, state estimation use the estimated topology, power flow 112 Topology measurements (Breaker Status) Power flow measurements PMU measurements Topology estimation and network observability check State estimation Bad data detection Measurements Estimation Error detection and correction Figure 5.1: High level overview of the state estimation process [40]. measurements of lines and phasor measurement unit measurements from buses to compute complex voltage phase angles at buses. Estimated states are processed through a bad data detector which can detect invalid estimates due to random measurements errors. If an error is detected, invalid measurements are removed and states are recalculated. However, in [73] Liu et al. showed that existing bad data detectors are only capable of detecting unstructured random errors from the measurements. The attack model used by an attacker is as follows: An attacker formulates an attack by introducing an attack vector a to the vector of measurements z so that ¯ z =z +a. It is shown in [73] that some structured attacks witha =Hc can bypass thebaddatadetector, introducinganerrorc∈R n tothestateestimates. Attackers should have access to the topology of the network and access to the measurements to conduct such attack. Since the power system communication network is an isolated system and it is very hard to manipulate data at the SCADA due to physical security. Attackers will have to physically access the sensors to inject bad data. Generally, these sensors are protected by guards or using surveillance 113 equipment. But due to the large number of sensors in transmission networks and limited budgets, it is hard to protect all the sensors. As shown in [23] the number of measurements that needs to be protected in order to secure the state estimate of all buses is same as the number of buses in the transmission network. In [41] Deka et al proposed a method to find the minimum cardinality attack vector toattack a given setof state variablesS atck such that theattack will produce a non-zero change in the state estimate of all the state variables in the set S atck . First, a graph G is constructed from the transmission network using following rules: 1 Each state variable associated with a bus is represented by a vertex. 2 Each flow meter is represented by an edgee = (u,v) whereu andv are buses incident to the measured transmission line. 3 A new vertex: reference vertex is introduced connecting each bus with a phase measurement to the reference vertex by adding new edges. 4 A new vertex: attack vertex is introduced connecting each bus inS atck to the attack vertex by adding new edges. 5 Each edge that represents a secured measurement has infinite weight and each unsecured edge has a unit weight. Figure 5.2 illustrates an example transmission network and its graph represen- tation. Deka et al [41] stated a method to find the attack vector with minimum cardi- nality in Lemma 5.1: 114 PMU PMU PMU Flow meter 6 5 4 3 2 1 PMU Phase measurement Reference Vertex Attack Vertex 6 5 4 1 3 2 (a) (b) Figure 5.2: An example transmission network and its graph representation. Lemma 5.1 [41]: The attack vector of minimum cardinality is given by the min- imum cut of the undirected graph G which separates vertices in S atck from the reference vertex. In Lemma 5.1 Deka et al, proposed a method to find the minimum set of measurements that needs to be protected in order to protect a given set S atck from such attacks. They argue that protection is possible if and only if the weight of the minimum cut between the reference vertex and the attack vertex becomes unbounded or infinite. Hence they argue that the minimum set of measurements to protect in order to secure a given S atck can be found in polynomial time [41]: Theorem5.1[41]: The minimum measurements that need to be protected to secure the setS atck against any hidden false data spoofing attack is given by the unprotected edges in the minimum cost path from the attack vertex to the reference vertex in G, where the cost of an edge is given by the reciprocal of its edge weight. 115 5.3 Improved Protection Against Data Spoofing Attacks Consider the transmission network of Figure 5.2. Assume all transmission lines have flow measurements and buses 1, 2, 3 have phase measurements. Let S atck be {1, 2, 3, 4, 5, 6}. Assuming unit susceptance magnitude at each transmission line and using equation (1), the relationship between measurements and state variables for this transmission network is as stated in equation (3). Here z (i,j) represents flow measurement in line (i,j),z i the phase measurement at busi andx i the state variable of bus i. z (1,2) z (1,6) z (2,3) z (2,5) z (3,4) z (4,5) z (5,6) z 1 z 2 z 3 = −1 1 0 0 0 0 1 0 0 0 0 −1 0 −1 1 0 0 0 0 1 0 0 −1 0 0 0 −1 1 0 0 0 0 0 −1 1 0 0 0 0 0 −1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 × x 1 x 2 x 3 x 4 x 5 x 6 (5.3) Now according to Theorem 5.1 [41], protecting any of the phase measurements at buses 1, 2 or 3 should protect S atck from any hidden data spoofing attack. 116 Consider a scenario where we choose to protect measurement 1 (z 1 ). One can show that state variables 4, 5, 6 will be affected by a hidden data spoofing attack if an attacker uses the attack vector a below: a = " 0 −1 0 −1 1 0 0 0 0 0 # T (5.4) The counter-example above can be generalized to arbitrary transmission net- work topologies by using the following assumption from [8]: for a successful data spoofing attack, a non-zero change should be observed in each and every state variable of the set S atck . Let P {i} denote the submatrix of P consisting only of rows indexed by a given index set{i}. Then we have, Proposition 5.1: There exists a structured attack vector a and a set S atck such that c = P {i} a, i = S atck and c is an|S atck |× 1 vector and 1 <||c|| 0 which cannot be defended against by the protection scheme defined by Theorem 5.1. Proof: (by contradiction) Assume that the protection scheme defined by Theo- rem 5.1 can defend against any structured attack defined in Proposition 5.1. Alter- natively, there cannot be any attack vector a for any set S atck for which 1<||c|| 0 . Let the minimum number of measurements be protected as determined by Theo- rem 5.1 be l. Let||S atck || =k with k>l (we will show later why this assumption is valid). Let the attack vectora have non-zero values for each meter measurement except for the ones protected using the protection scheme defined by Theorem 5.1. Hence, only the protected meter measurements can be relied upon to make state estimation. Consider the equation, 117 z =H {T},{S atck } θ +e (5.5) H {T},{S atck } denotes the submatrix of H with rows corresponding to all the protected meters T and columns corresponding to the state variables S atck . Now, H {T},{S atck } is an l×k matrix with l < k. So the rank of H {T},{S atck } = l. This implies that there exists k−l state variables in the set S atck which are linearly dependent uponl independent state variables. If thel independent state variables do not change, it is not possible to detect any changes in the remaining k−l variables. Hence, there exists c such that||c|| 0 = k−l. Choosing k,l such that k−l> 1 contradicts the assumption that the protection scheme can defend against any structured attack defined in Proposition 5.1. Now we show why choosing k,l such that k−l > 1 is a valid assumption. As per the protection scheme of Theorem 5.1, each node in S atck is joined to the attack node with an edge. So the minimum number of metersl to be protected will be: min{dist(ref,s),s∈ S atck } + 1 where dist(u,v) gives the minimum distance between nodesu,v. So, for anyS atck where||S atck || =k> 2+min{dist(ref,s),s∈ S atck }, the assumption that k−l> 1 is valid . Theorem 5.1 [41] defines a protected path in graph G. Proposition 5.1 implies that the protection scheme described in [41] is limited, as shown below. Theorem 5.2: The protection scheme proposed in [41] will only protect the state variables represented by the vertices in the protected path from attack vertex to reference vertex against any data spoofing attack. Proof: The protection scheme provided by Theorem 5.1 only secures all the measurements in the shortest path from reference vertex to the set of verticesinS atck . Let u∈ S atck denote the vertex at the end point of this hort- est path. Every cut that separates any vertex in the shortest path from reference 118 vertex v to u will thus have a protected edge with infinite weight. However, it can be clearly seen that there can be cuts not containing any protected edges that separate other vertices in the set S atck \u from the reference vertex v, thereby making them vulnerable to data spoofing attack . Thustheprotectionschemeproposedin[41](Theorem5.1)onlymakesitimpos- sible for an attacker to attack all the state variables in S atck at once. However it might be possible to attack a subset of the state variables associated with S atck which is a fundamental limitation in the protection scheme. We now provide a scheme for optimal protection against any data spoofing attack where it can protect all the state variables associated with S atck . The proposed method to find the protection scheme is given by Theorem 5.3 below. Theorem 5.3: The minimum measurements that need to be protected to secure the set S atck against any hidden false data injection attack is given by the unprotected edges in a minimum Steiner tree that connects reference vertex in G to all the vertices that represent S atck . Proof: According to Deka et al. in [41] the attack vector to attack a given set S atck is given by the cut that separates the vertices associated with S atck with the reference vertex. If we need to protect each state variable associated with S atck , all the cuts in G that separate any vertex in S atck with reference vertex should have an edge with infinite weight. It is clear that to achieve this condition there should be a protected path from each vertex in S atck to the reference vertex. In other words edges in a subgraph that connect each vertex in S atck and reference vertex represents a set of measurements to protect in order to protect S atck from data spoofing attack. The minimum set of measurements is therefore given by the unprotected edges in a minimum Steiner tree that connects reference vertex in G to all the vertices that represent S atck . . 119 5.4 Adaptive Protection Schemes We observe that the Steiner tree-based protection scheme described above can be generalized to incorporate the cost of protecting each measurement and criti- cality of the buses using the rooted prize-collecting Steiner tree problem. As men- tioned in [46], an attacker can gain financial benefits (loss for utility) by attacking buses. The potential loss that can occur if a bus is attacked denotes its criticality. The criticality of a bus is represented as the vertex “prize" in G. Costs of pro- tecting measurements are represented as edge weights of G. The objective of an optimal protection scheme is to minimize the potential loss that can occur due to unprotected buses and the protection cost. Observation 5.1 summarizes the resulting optimal protection scheme. Observation 5.1: The set of minimum cost measurements that should be pro- tected to minimize the potential loss that can occur due to unprotected buses is given by edges in a rooted prize-collecting Steiner tree (PCST) of G. In which: 1 Each vertex representing a bus has a prize representing its criticality; 2 Each edge has a weight representing the cost of protecting the associated measurement; 3 Reference vertex v r is the root vertex. The power-grid operator can choose one of the several available measures to determine the criticality of each bus. The criticality can be statically determined by the presence of critical infrastructure such as hospitals, military installations etc. in the bus’ supply area. It can also be dynamically determined based on system operation variables such as loads, generation at various buses or power flow, congestion in transmission lines. 120 In this work, we focus on developing a protection scheme for dynamically deter- mined criticality values of the buses. Given the fact that loads of buses and conges- tion can change over time, criticality of buses also changes with time. To capture this, next we propose adaptive protection schemes extending the above-mentioned protection scheme. In order to capture the changes in criticality of buses, we denote it as a time varying value. π t v denotes the criticality of bus represented by vertex v∈ V at time t. As a result, the potential loss that can occur due to unprotected buses at timet is P v∈V π t v y t v . The decision variable y t v = 1 if bus represented by vertex v is protected at time t. Protection cost for the measurement represented by the edge e is denoted by c e . Due to the change in the criticality of buses, the set of measurements that should be protected (M) may have to be changed. The current protection tech- niques to protect measurements include placing guards or surveillance equipment at measurement locations [22]. With changes in M the guards or surveillance equipment has to be relocated, activated or deactivated to protect new measure- ments. These changes incur a cost. Let M t , M t−1 denote protection strategies at time t, t− 1 respectively and E M t, E M t−1 the edges that represent them in G. In order to capture the cost of relocating measurements, we add a relocation cost r e for each edge e∈ (E M t\ E M t−1)∪(E M t−1\E M t). As a result, relocation cost for a measurement represented by edge e at time t is r e Δ t (e) where Δ t (e) =| x t e −x t−1 e |. The decision variable x t e = 1 if the measurement represented by e is protected at time t. An optimal adaptive protection scheme takes a set of criticality predictions for buses as input to provide protection strategies that will minimize measurement protection cost (K), the potential loss that can occur due to unprotected buses 121 Movements to/from measurement locations Guard stations SCADA Inform changes in protection strategy Adaptive Protection Schemes Criticality Values Load, congestion information Figure 5.3: Changing the measurement protection with bus criticality changes. (L) and relocation cost (R). We formalize two such optimal adaptive protection schemes below. C,P,R are defined separately for each protection scheme. Transmission networks span a large geographic area in general. As a result, the task of protecting measurements at measurement locations can be outsourced to local security agencies in these regions. As the power-grid operator determi- nes/predicts the criticality values of buses. The protection recommendations are calculated using proposed protection schemes. The local security agencies are notified if a nearby measurement needs to be protected or protection is no longer needed. The proposed protection schemes capture the cost of placing guards at measurementlocations(K)andcostofrelocatingthem(R)toprovidecost-optimal protection recommendations. We define optimal adaptive protection schemes as variant prize-collecting Steiner tree problems. δ(S) denotes the cut set for the set S ⊂ V. Recom- mended protection strategy is given by the set of measurements represented by the edges in resulting Steiner trees. Buses represented by vertices in these trees will be protected as a result. 122 5.4.1 Minimum protection cost tree for local risk predic- tions (MPT-Local) The MPT-Local scheme provides a protection recommendation (M t ) based on criticality predictions for each bus for the time interval(t− 1,t). As a result, we get: K = X e∈E c e x t e (5.6) L = X i∈V Π t i (1−y t i ) (5.7) R = X e∈E r e Δ t (e) (5.8) The ILP formulation for the variant PCST problem that minimizes the overall cost is as follows: Minimize :K +L +R Subject to: X e∈δ(S) x t e ≥y t i ,∀S⊂V\{v r },S6=∅,∀i∈S x t e ∈{0, 1},∀e∈E y t i ∈{0, 1},∀i∈V (5.9) 5.4.2 Minimum protection cost trees for a time window of risk predictions (MPT-Window) The MPT-Window scheme provides protection recommendations for a time window. A time window consists of multiple time intervals. The MPT-Window 123 scheme takes criticality predictions for the time window for the buses as the input. The protection strategy for each time interval in the time window is selected based on the criticality predictions. K,L,R for a time window with n time intervals is given by: K = X j∈[0,n),e∈E c e x t+j e (5.10) L = X j∈[0,n),i∈V π t+j i (1−y t+j i ) (5.11) R = X j∈[0,n),e∈E r e Δ t+j (e) (5.12) The ILP formulation that minimizes the overall cost for the time window is as follows: Minimize :K +L +R Subject to: ∀j∈ [1,n) : X e∈δ(S) x t+j e ≥y t+j i ,∀S⊂V\{v r },S6=∅,∀i∈S x t+j e ∈{0, 1}∀e∈E y t+j i ∈{0, 1}∀i∈V (5.13) All of above stated objective functions are LP objective functions since we are minimizing a sum of absolute values. We note that MPT-Local and MPT-Window are NP-Hard by considering a particular case in which there is no relocation cost. With no relocation cost, MPT-Local and MPT-Window become instances of the standard PCST problem 124 making them at least as hard as PCST. PCST is a well known NP-Hard problem [66]. 5.5 Proposed Heuristics Due to intractable computation complexity, the proposed optimal protection schemes do not scale for real world transmission networks. It takes a long time duration to compute optimal results for small or medium size transmission net- works. Attackers can exploit the long computation times to attack unprotected critical buses. Addressing this limitation, we propose heuristic algorithms with polynomial time complexity that can scale for real-scale transmission networks. The proposed heuristics for MPT-Local and MPT-Window uses a heuristic algorithm for rooted prize-collecting Steiner as its base algorithm (Algorithm 18). Algorithm 18 has two phases. In the first phase, it creates a spanning tree in a greedy manner starting from the root vertex. Algorithm 18 maintains a set T V to store the set of vertices spanned by the spanning tree. T V is expanded connecting new vertices through edges in the cut setδ(T V ) that gives the maximum profit (or least loss) (line 6-13). After the construction of the spanning tree, algorithm prunes the leaf vertices from the tree in an iterative manner if pruning improves the overall profit (line 16-30). 5.5.1 Heuristic for MPT-Local and MPT-Window Figure 5.4 illustrates the execution flow of our proposed heuristics for MPT- Local and MPT-Window. The heuristics take as inputs: 1 Bus-branch graph of the transmission network (G); 125 Algorithm 18 Heuristic Algorithm for Price Collecting Steiner Tree (H-PCST) 1: procedure H-PCST(G) 2: T V ←{v r }, T E ←{} 3: cost[v r ]← 0 4: for all v∈V\{v r } do 5: cost[v]←∞ 6: end for 7: while T V not contain all v∈V D t+1 do 8: find e← max e∈E {π j −c e |e = (i,j)∈δ(T V )} 9: T V ←T V ∪{j} 10: T E =T E ∪{e} 11: for all e = (j,k)∈E do 12: if cost[k]<π k −c e then 13: cost[k]←π k −c e 14: . set predecessor of k as j 15: end if 16: end for 17: end while 18: Q←CreateMinHeap(T V \{v r }) 19: . out-degree as the key 20: while Q not empty do 21: j← getMin(Q) 22: val← GetKey(j) 23: if val = INF then 24: break 25: end if 26: if isLeaf(v) then 27: e = (i,j) . i is the parent of j 28: if π j <c e then 29: T V ←T V \{j} 30: T E =T E \{e} 31: remove(Q,v) 32: else 33: IncreaseKey(Q,v,INF) 34: end if 35: else 36: break 37: end if 38: end while 39: return (T V ,T E ) 40: end procedure 126 Criticality Predictions Compute new edge weights and vertex prizes to create G’ Compute PCST of G’ G’ PCST Protection Recommendations PCST G Figure 5.4: Execution flow of the heuristics algorithms for the proposed protection schemes 2 PCST computed for the previous time interval (i.e. protection strategy for previous time interval); 3 Criticality predictions for buses. The proposed heuristics use these inputs to produce an augmented graph G 0 by assigning new edge weights and new vertex prizes. The new edge weights are calculated based on the edges in PCST for the previous time interval. The new vertex prizes are calculated using bus criticality predictions. Next, PCST of G 0 is calculated using Algorithm 18. Edges in the resulting PCST provides the protection strategy for the next time interval. In heuristic for MPT-Local G 0 is created from G by recomputing the edge weights. The edge weights are recomputed by increasing the weights of edges that represent the measurements not protected during the previous time interval. The objective is to discourage the change in protection strategy between consecutive time intervals to reduce the cost of relocating measurement protection resources. The new edge weights are calculated based on equations 5.14 and 5.15. 127 c 0 e = c e , if e∈E M t−1 . r e +m e +c e , otherwise. (5.14) m e =max{r e 0|∀e 0 ∈E M t−1 \{e}} (5.15) Where c 0 e denotes new weight of edge e in G 0 . The heuristic for MPT-Window computes Steiner trees for each time interval in the time window (t,t + n). G 0 is created from G for each time interval in the window by re-computing edge weights and vertex criticality values. The edge weights are computed based on the equation 5.14 and 5.15 similar to the heuristic for MPT-Local. The vertex prizes are calculated using the equation 5.16 for each time interval in the time window. π 0t+i v = P j=n j=i π t+j v j−i+1 P j=n−i+1 j=1 (j −1 ) ,∀v∈V\{v r } (5.16) The vertex prize for a vertex representing a given bus is calculated by tak- ing a weighted average of predicted criticality values. The criticality predictions closer to current interval has the highest weight. The contributions from criticality predictions decay over the time window. 5.5.2 Complexity Analysis Assuming an adjacency list is used to store the graph and a heap is used to find edges in δ(T V ) that gives maximum profit(Line 7 of Algorithm 18), we can observe that line 11-15 of the algorithm will be executed O(|E|) times. In each step, updates to the heap in order to update the new edge costs requireO(log|V|) 128 operations. Therefore the total number of operations required to create a spanning tree from line 11-15 of the algorithm is O(|E| log|V|). Also, the remove operation of heap required to extract the vertex that gives maximum profit (in the line 7 of Algorithm 18) requires O(log|V|) operations. Therefore the computation complexity of Algorithm 18 is O((|V| +|E|) log|V|) which is O(|E| log|V|). Next, line 14 of the algorithm will require O(|V| log|V|) operations to build a heap. Since|T E | =|V|− 1, line 18-32 will requireO(|V| log|V|) operations for the leaf pruning process. Therefore, the total number of operations Algorithm 18 require is O(|E| log|V|). Algorithm18isexecutedforeachtimeintervalinatimewindow. In each time interval edge weights and vertex criticality values are recalculated using equations 5.14 and 5.16 respectively. Assuming|TW| is the number of time inter- valsinthetimewindow,itwillrequireO(|E||V||TW|)andO(|V||TW| 2 )operations for these calculations based on 5.14 and 5.16 respectively. As a result, heuristics for MPT-Local and MPT-Window will require O(|E||V||TW| +|TW||E| log|V|) and O(|V||TW| 2 +|E||V||TW| +|TW||E| log|V|) operations respectively for a time window with|TW| intervals. 5.6 Scaling for Large Transmission Networks With the wide scale adoption of residential solar PVs, the customers in the power distribution networks, who traditionally were just passive consumers, are turning into electricity producers for certain intervals of the day. To enable this, techniques used for state estimation and electricity market price determination are being adapted to the distribution networks [111]. The distribution networks are 129 very large containing millions of nodes. Hence, the protection schemes should be able to scale for large distribution networks. We propose parallel algorithms for the heuristics for MPT-Local and MPT- Window to cater above-mentioned requirements. The proposed algorithms target distributed computing environments such as clouds or commodity clusters. It can be observed that vertex prize and edge weight reassignment steps in the heuristics for MPT-Local and MPT-Window can be executed in an embarrass- ingly parallel manner for each vertex and edge. We propose parallel algorithms for Algorithm 18 based on the vertex-centric bulk synchronous parallel (BSP) model [78]. The vertex centric BSP model offers a simple abstraction for composing graph algorithms for cluster computing environments. It makes the complexities of underlying communication, fault tolerance, and state management tasks trans- parent from the users. Over the last decade, with the wide adoption of commodity clusters and clouds, the vertex centric BSP model has become one of the most prominent large scale graph processing models. In the vertex centric BSP model, vertices in the graph are partitioned across machines in a cluster. Users are provided to implement a function Compute which gives them access to a single vertex in the graph, its values, and its outgoing edges. Userscansendmessagestootherverticesinthegraphwithinthe Compute function using vertex id as the address. Vertex centric graph algorithms are executed itera- tively based on the bulk synchronous parallel model [105]. The Compute function of each vertex in the graph is executed independently. Messages sent by vertices in a given iteration is only available for the receiving vertices in the next iteration. A barrier synchronization at the end of an iteration ensures that computation and communication for all the vertices are complete before proceeding to the next iteration. Execution stops when no vertex in the graph sends out any messages. 130 The proposed distributed algorithms for Algorithm 18 based on vertex-centric BSP model has two phases. In the first phase a spanning tree is created starting from the root vertex. In the next phase, the leaf vertices are pruned from the tree in an iterative manner if pruning improves the overall profit. Algorithm 19 presents the algorithm to create a spanning tree starting from the root vertex. Each vertex maintains effective prize its contributes to the tree by subtracting the cost of the current edge connects it to the tree from its prize (line 14). Vertices choose the edge that gives the most effective prize to connect it to the spanning tree based on incoming messages (line 13-18). Each vertex keeps its parent vertex in the spanning tree as a part of the vertex value in order to construct the spanning tree. 131 Algorithm 19 Vertex Centric Algorithm to Create Spanning Tree for H-PCST 1: procedure Compute(Messages) 2: if iteration = 1 then 3: if isRoot( )then 4: for Edge e∈ getOutEdges( )do 5: Message msg; 6: msg.edgeCost← c e 7: sendMessageTo(e.target, msg) 8: end for 9: end if 10: else 11: if isRoot( )then 12: return 13: end if 14: changed← false 15: for each msg∈Messages do 16: effective← getPrize() - msg.edgeCost 17: if effective > getEffective( )then 18: changed← true 19: setEffective(effective) 20: setTreeParent(msg.from) 21: end if 22: end for 23: if changed then 24: for Edge e∈ getOutEdges( )do 25: Message msg; 26: msg.edgeCost← c e 27: sendMessageTo(e.target, msg) 28: end for 29: end if 30: end if 31: end procedure 132 Next, Algorithm 20 is executed to prune the leaf vertices from the spanning tree in an iterative manner if pruning improves the overall profit. First, each vertex in the graph sends its vertex id to its parent vertex of spanning tree (line 3-7). Upon receiving messages from children, each vertex updates its children list (line 10-12). The leaf vertices (vertices with no children) with a negative effective prize are marked for removal from the spanning tree (line 23-24). The vertex removals are notified to parent vertices. In the next iteration and beyond, upon receiving child removal messages, each vertex updates its children list. The leaf vertices are removed if it has a negative effective prize. Each removed vertex notifies its parent about the removal. Algorithms terminate when there are no messages to be processed in the next iteration. 133 Algorithm 20 Vertex Centric Algorithm to Prune the Spanning Tree for H-PCST 1: procedure Compute(Messages) 2: if iteration = 1 then 3: if !isRoot( )then 4: sendMessageTo(getTreeParent()) 5: end if 6: else 7: if iteration = 2 then 8: for each msg∈Messages do 9: child← msg.from 10: addTreeChild(child) 11: end for 12: if !isRoot()&isLeaf()&getEffective()< 0 then 13: setRemoved(true) 14: sendMessageTo(getTreeParent()) 15: end if 16: else 17: for each msg∈Messages do 18: child← msg.from 19: removeTreeChild(child) 20: end for 21: if isRoot( )then 22: return 23: end if 24: if isLeaf()AND getEffective()< 0 then 25: setRemoved(true) 26: sendMessageTo(getTreeParent()) 27: end if 28: end if 29: end if 30: end procedure 134 5.6.1 Communication Optimizations We utilize message combiners available in the Pregel model to reduce the num- ber of messages exchanged during the execution of the proposed algorithms. As illustrated in Figure 5.5, message combiners at each worker machine in clusters takes a set of outgoing messages for a vertex and reduce the number of messages by aggregation based on a user defined function. Message combiner functions for Algorithms 19 and 20 are presented in Algorithms 21 and 22 respectively. Algorithm 21 Message Combiner for Algorithm 19 1: procedure Combine(Messages msgs) 2: Message mc 3: for each m∈msgs do 4: if mc = NULL OR mc.edgeCost > m.edgeCost then 5: mc←m 6: end if 7: end for 8: SendCombinedMessage(mc) 9: end procedure Algorithm 22 Message Combiner for Algorithm 20 1: procedure Combine(Messages msgs) 2: Message mc 3: for each m∈msgs do 4: mc.from←mc.from∪m.from 5: end for 6: SendCombinedMessage(mc) 7: end procedure 135 V 1 Combiner Combiner Combiner V 2 V 3 V 4 V 5 V 6 Worker 1 Worker 2 Worker 3 Figure 5.5: Illustration of message combiner in Pregel model [78] 5.7 Evaluations 5.7.1 Setup We implemented the ILP formulations for the proposed optimal protection schemes using Gurobi solver [3]. The proposed heuristic algorithms were imple- mented using Java 1.8. Evaluation of the sequential algorithms and ILP formulations were conducted on an Intel Core i5 3.2Ghz machine with 16GB memory. We implemented our proposed distributed algorithms on Apache Flink Gelly utilizing its vertex-centric graph processing framework [54]. We deployed Flink on a dedicated commodity cluster of 11 nodes with 8-core Intel Xeon CPU, 16 GB RAM and connected by Gigabit Ethernet. We used a set of publicly available IEEE power system [5] and European trans- missionnetwork[11] datasetsin order toevaluatethe proposedprotectionschemes. Details of transmission network datasets used in our simulations are summarized in Table 5.1. We generated bus branch graphs from transmission line datasets 136 Dataset # of Buses # of Lines IEEE 9 9 9 IEEE 14 14 20 IEEE 57 57 80 IEEE 118 118 186 IEEE 300 300 409 EU 1494 1494 2322 Table 5.1: Sizes of different power system test cases. based on steps mentioned in the sections 5.2 and 5.4. We considered three con- figurations for each transmission network in which 25%, 50% and 75% buses have direct phase measurements (PMUs) when generating bus branch graphs. We con- ducted the experiments on all possible configurations. Few representative results are presented below. In order to evaluate the scalability of proposed distributed algorithms, we gen- erated large synthetic graphs. We used a graph generator based on R-MAT model to generate synthetic graphs by varying the number of vertices [28]. Graph density, β was 8 in all the graphs such that|E| =β|V|. 5.7.2 Evaluations Metrics In order to compare the performance of optimal protection schemes with the proposed heuristics, we compute approximation ratios. The approximation ratio ρ is the ratio between the protection cost given by the heuristics and the optimal values. ρ = Γ H Γ Opt (5.17) 137 Γ Opt and Γ H are the protection cost given by optimal protection schemes and heuristic algorithms respectively. We use the relative increase in protection cost when MPT-Local is used com- pared with the MPT-Window (R C ) for different time windows to compare the performance of MPT-Local and MPT-Window. R C is given by: R C = (Γ MPT−Local − Γ MPT−Window ) Γ MPT−Window ∗ 100 (5.18) Γ MPT−Local and Γ MPT−Window are the overall protection cost for a given time window for MPT-Local and MPT-Window respectively. Γ MPT−Local for a time window is the sum of the protection costs given by MPT-Local for each time interval in the time window. 5.7.3 Results Comparison of Optimal Protection Schemes: We compared the performance of optimal protection schemes defined in formulations 5.9 and 5.13 based on the protection cost for different time windows. In order to report results with statistical significance, we conducted a large number of simulations on IEEE 9 bus test case. We generated 1000 different test cases by assigning different bus criticality values and measurement protection costs. Values were generated from a uniform random distribution between 0 and 100. We report the distribution of R C for time windows with a various number of time intervals in Table 5.2. We consistently observed a rapid increase in protection cost for MPT-Local compared with the MPT-Window with the increasing number of time intervals in the time window. 138 We had to limit the evaluations for smaller test cases due to the intractable complexity of optimal formulations. It took 3 days to complete the computation of an optimal protection strategy based on MPT-Window for time window with 9 intervals on a single IEEE 9 bus test case where the proposed heuristics calcu- lated the results under a minute on all transmission network datasets (less than 1 millisecond to calculate the protection strategy for 9-bus test case for MPT-Local and MPT-Window). The computation time for EU 1494 test case for window size 1152 was 47 and 54 seconds for MPT-Local and MPT-Window respectively for heuristic algorithms. %PMU 25% 50% 75% 3 6 3 6 3 6 Mean 5.72 10.73 7.65 13.41 8.86 17.04 SD 5.63 8.64 7.33 10.58 6.88 11.24 SKW 1.15 0.93 1.28 1.17 0.90 0.91 Table 5.2: R C for time windows 3 and 6 with random protection cost and criticality assignments. SD=standard deviation, SKW= skewness. Performance of Heuristics Compared with the Optimal Results: In order to evaluate the performance of proposed heuristics compared with the optimal results, we used 1000 test cases generated from the IEEE 9 bus network. The distributions of approximation ratios for different test cases are reported in Table 5.3. The results confirmed that the proposed heuristics closely approximate the optimal results in many cases. 139 Method %PMU 25% 50% 75% Window size 3 6 3 6 3 6 MPT-Local Mean 1.09 1.08 1.03 1.05 1.15 1.14 SD 0.11 0.11 0.08 0.09 0.13 0.15 SKW 1.14 0.90 0.94 0.62 0.80 1.11 MPT-Window Mean 1.11 1.15 1.09 1.17 1.18 1.24 SD 0.09 0.12 0.07 0.12 0.11 0.14 SKW 1.23 1.39 1.33 1.25 0.83 1.10 Table 5.3: Approximation ratios for simulations on IEEE-9 bus test case with PMUs at 25%, 50% and 75% of the buses with random protection cost and criti- cality assignments. SD=standard deviation, SKW= skewness. Performance of Heuristics on Real Scale Transmission Networks: We compared the performance of heuristics for MPT-Local and MPT-Window on large real scale transmission networks. Each experiment with the same con- figuration (Number of PMUs, transmission network and time window size) was conducted 50 times on different test cases by assigning different bus criticality values and measurement protection costs. Each reported data point represents a mean value of the results. In order to study the impact of the size of the time window on the performance of the heuristics, we compared the performance on increasing time window sizes. In each time interval, criticality of buses was assigned from a uniform random distribution between 0 and 100. As shown in Figure 5.6 heuristic for MPT-Window performedbettercomparedwiththeMPT-Localwithincreasingtimewindowssize. In order to further understand the impact of changes in the criticality of buses to this behavior, we conducted experiments in which criticality of buses oscillate between 0 and 100 based on different wave functions. We usedsin [9],square [10], triangle [12] and sawtooth [8] wave functions for this evaluation. We assigned wavelength (λ) of the waves to be same as the time window size. Each bus was 140 0 5 10 15 20 25 3 6 9 18 36 72 144 288 576 1152 R C Time window size IEEE-118 IEEE-300 EU-1494 Figure 5.6: R C with increasing time window size. assigned an initial value randomly between 0 and 100 (initial phase shift). Figure 5.7 shows the results for EU-1494 transmission network dataset with PMUs at 25% of the buses. While heuristic for MPT-Window outperformed MPT-Local in all cases, we observed that the performance of MPT-Local improved with increasing time window size when criticality of buses change based on sine, triangle and sawtooth wave functions. This is because, with increasing λ, the changes in crit- icality were more gradual on these waves, this made MPT-Local perform better compared with the cases with sharp criticality changes. Next, tounderstandtheimpactoftransmissionnetworksizetotheperformance of heuristics, we evaluated the heuristics performance on transmission networks with increasing size. As shown in Fig. 5.8 R C was not impacted significantly by the transmission network size. Performance of Parallel Algorithms: We evaluated proposed parallel algorithms (Algorithms 19 and 20) on large scale synthatic graph datasets generated using a RMAT generator. As can be seen in Figure 5.9 we observe a decrease in execution time with increasing parallelism. Performance improvement gained by increasing the num- ber of cores degrade rapidly for smaller graphs due to the increase in the framework 141 0 5 10 15 20 25 30 35 6 9 18 36 72 144 288 576 R C Time window Random Sin TRI SQR SW Figure 5.7: R C for various bus criticality variations on EU-1494. Sin = sin wave, SQR = square wave, TRI= triangle wave, SW= sawtooth wave. 0 2 4 6 8 10 12 14 16 14 57 118 300 1494 R C # buses 25% PMU 50% PMU 75% PMU Figure 5.8: R C with increasing number of buses. overhead. Results demonstrate that performance improvement gained by increas- ing parallelism increased significantly for larger graph datasets compared with the smaller datasets. Figure 5.10 shows the change in execution time with the size of the input graph. As can be seen, execution time is nearly linear to the input graph size. 142 8 16 24 32 40 48 56 64 72 80 20x8 21.73 18.44 17.31 17.77 16.37 17.77 18.58 18.81 19.14 19.75 21x8 41.95 34.91 35.68 33.4 32.2 31.95 33.98 33.53 34.6 34.56 22x8 81.19 65.28 63.56 60.67 60.14 61.05 61.81 60.25 61.44 59.08 0 10 20 30 40 50 60 70 80 90 Time(s) Number of cores Figure 5.9: Execution time vs number of cores. RMAT graphs: n×β where |V| = 2 n 0 20 40 60 80 100 120 140 19x8 20x8 21x8 22x8 23x8 Time (s) 32 40 48 Figure 5.10: Execution time vs graph size. RMAT graphs: n×β where|V| = 2 n 5.8 Other Applications Proposed variant prize collecting Stiner tree formulations have applications beyond the Smart Grid protection schemes discussed above. We discuss a general application of securing distribution networks. Consider distribution networks such as water/oil distribution networks and optical fiber networks. In water distribution networks there are water sources, 143 pumps, water tanks, water distribution lines, valves and consumers [82]. Similarly, in optical fiber networks, there are optical lines, switches, a main service provider and consumers. Such systems can be monitored continuously using SCADA sys- tems for better control and situational awareness. These networks can be modeled as graphs in which entities such as water tanks, consumers and switches are represented by vertices and water distribution or opti- cal lines and valves are represented by edges. High availability of the service is critical in above-mentioned applications. Depending on various situations criticality of entities such as consumers may change with time. As an example, in a crisis situation, water supply to a given area with critical infrastructure such as hospitals will be more important compared to the households. Similarly, connectivity for high paying high demand consumers such as Universities will be critical in normal operations of an optical network whereas in a special occasion connectivity for few selected consumers can become critical for a short period of time. Critical components such as valves or optical line repeaters in a water distribution lines or optical network lines should be protected in some scenarios in order to provide uninterpreted services. The above mentioned variant prize collecting Steiner tree formulations can be directly applied in these scenarios by representing the criticalities by vertex prizes and component protection costs associated with edges. 144 5.9 Summary In this chapter, we identified a limitation in an existing protection scheme to protect a set of buses in a power grid against hidden data spoofing attacks. We presented an improved protection scheme that addressed this limitation. Furthermore, we proposed two optimal adaptive protection schemes against smartgriddataspoofingattacks, MPT-LocalandMPT-Window. Theseprotection schemes provide protection recommendations while minimizing the protection cost and potential loss that can occur due to smart grid data spoofing attacks. Due to the intractable computational complexity of the proposed optimal pro- tection schemes, we proposed heuristic algorithms with polynomial time complex- ity. Moreover, to make the proposed heuristics scalable for massive transmission networks in the future, we proposed parallel algorithms for the proposed heuristic algorithms. Our evaluations showed that, given a set of predictions for a time window on the changes in the criticality of the buses, MPT-Window always provided the most cost-effective protection strategies in terms of overall protection cost for the time window. We observed that MPT-Local performs best when the criticality of buses changes gradually over time. Our heuristics scaled for large transmission networks and closely approximated the optimal solutions. Finally, we discussed how the proposed variant prize collecting Steiner tree formulations can be applied beyond the smart grid protection schemes. 145 Chapter 6 Conclusions In this dissertation, we studied a fundamental set of dynamic graph prob- lems that have applications in cyber systems security. We developed low latency dynamic graph algorithms for these problems. We demonstrated the scalability of the algorithms and their capability of providing low latency results, both empir- ically and analytically. We used both real-world and synthetic datasets in our experimental evaluations to show the effectiveness of the proposed solutions in a wide rage of graph types. 6.1 Broader Impact The proposed algorithms for solving certain dynamic graph problems and dynamic graph analytics in general have applications beyond cyber system secu- rity. We envision applications in areas including the domains of network system monitoring and social network analytics. The proposed methods for monitoring structural group membership and find- ing subgraph isomorphisms in dynamic graphs can be used to monitor network systems and online social networks. Especially with the democratization of sensor networks with the Internet of Things (IoT) applications and the emergence of net- work function visualization, monitoring of the state of individual devices in terms of patterns of structural connectivity will become a significant task. The proposed algorithms for structural group membership monitoring will play a major role in 146 such scenarios. As an example, in network function virtualization, a complex net- work service (network service chain) composed of multiple network functions can be represented by a graph. The following example from [92] explains this repre- sentation. Assume that a user requests a service including a network function for parental control. The functionality of this network function can be decomposed into 1) traffic classifier, 2) Web proxy, and 3) firewall network functions. Each of these network functions can be decomposed into more refined network functions. These networkfunctionsshouldbetraversedinagivenorderandthelogicalconnectivities between them are as follows: Traffic Classifier→ Web Proxy→ Firewall. This connectivity can be represented by a graph, which is referred to as a Network Function Graph (NFG) [92]. Once deployed, these functional components should be monitored to check whether the required connectivity invariant is satisfied and maintained. This problem can be abstracted as a structural group membership monitoring problem in dynamic graphs, where the connectivity invariant is given by a query graph. Also, theproposed heuristicsfor prize-collecting Steinertree can be applied and extended in many scenarios other than the applications described in Section 5.8 [30]. Theproposedtechniquesweusedwhendesigningourdynamicprize-collecting Steiner tree heuristics will be useful when designing algorithms for problems that should address the connectivity between a dynamic set of entities. As an example, consider the kind of self-organizing networks that are being proposed to provide internet coverage to rural areas in the world. The Google loon project 1 and the 1 https://x.company/loon/ 147 Facebook drone 2 are examples. The objective of such networks is to provide an internet backbone to rural areas using a set of network devices flying in the air. These devices should maintain connectivity to the main internet backbone at all times, toprovidecontinualinternetconnectivitytotheusers. Theneedforinternet access for various areas and demand may change with time. In these scenarios, the drones or balloons should be moved to areas of higher demand. The proposed heuristics for dynamic prize-collecting Steiner tree can be directly applied to such scenarios. 6.2 Future Work This dissertation introduced a set of dynamic graph algorithms for cyber secu- rity applications. In Chapter 3 and Chapter 4, we discussed different variants of distributed algorithms for subgraph matching. Following are few future research directions we envision based the experience gathered during this work. Whiletherearealargenumberofframeworksforprocessinglargestaticgraphs, there is a lack of research towards publicly available dynamic graph processing frameworks. The incremental computation using memoization technique we used in our algorithms for structural group membership monitoring can be generalized for general purpose dynamic graph processing frameworks to reduce communica- tionandcomputation. Onecanalsonotethattherearenogeneralizedcomputation models or programming abstractions for processing dynamic graphs. Researchers could use the presented computation model for structural group membership mon- itoring as a starting point to design a general-purpose computation model and programming abstractions. 2 https://www.theguardian.com/technology/2017/jul/02/facebook-drone-aquila-internet- test-flight-arizona 148 Advancement of mobile devices has resulted in a significant increase in the com- putation power of state of the art personal mobile devices, such as mobile phones. These devices now carry more computation power and communicational capabili- ties than the high-end general purpose computers that existed a decade ago. This has made it possible to perform complex distributed computation among mobile devices. The dynamic nature of the environments makes these mobile networks dynamic. Asaresult,webelievethereisroomfordynamicgraphalgorithmstoplay a role in these environments. In all the presented distributed algorithms, there was assumed a reliable computation and communication environment in which there are no device or communication failures. In distributed mobile networks, these assumptions need not always be true. Thus algorithmic and runtime innovations are required to adapt the proposed algorithms for these environments. The dynamic graph problems and solutions presented in this thesis only cover a small set of the cyber system security applications that exist today. As a result, there is an extensive vacuum to explore and identify more dynamic graph ana- lytic kernels for cyber system security applications, and to design algorithms for these kernels. Few such problems include the prediction of worm spread in com- puter networks or the prediction and detection of the spread of rumors in online social networks. These algorithmic techniques, such as incremental computation using memoization and graph pruning, can be used and extended when developing solutions for new dynamic graph problems. The proposed methods for dealing with dynamic graph problems only play a small role in any end to end cyber security solution. Integrating these proposals to develop end to end cyber security solutions is a significant research and develop- ment task. Research and engineering challenges in areas including data extraction 149 and transformation, complex event detection, and data integration, have to be addressed when developing such end to end cyber system security solutions. 150 Reference List [1] 9th dimacs implementation challenge. http://www.dis.uniroma1.it/ challenge9/download.shtml. Accessed: 2016-02-01. [2] Amazon ec2 instance types. https://aws.amazon.com/ec2/ instance-types/. Accessed: 2016-04-01. [3] Gurobi solver. http://www.gurobi.com/. Accessed: 2016-04-01. [4] Lemon: Library for efficient modeling and optimization in networks. https: //lemon.cs.elte.hu/trac/lemon. Accessed: 2016-04-01. [5] Liines smart power grid test case repository. http://amfarid.scripts. mit.edu/Datasets/SPG-Data/index.php. Accessed: 2016-11-01. [6] Mpich, a high performance and widely portable implementation of the mes- sage passing interface (mpi) standard. https://www.mpich.org. Accessed: 2016-4-1. [7] The national electric transmission congestion study. https://www.energy.gov/oe/downloads/ 2015-national-electric-transmission-congestion-study. Accessed: 2016-11-01. [8] Sawtooth wave function. http://mathworld.wolfram.com/SawtoothWave. html. Accessed: 2016-11-01. [9] Sin wave function. http://mathworld.wolfram.com/Sine.html. Accessed: 2016-11-01. [10] Squarewavefunction. http://mathworld.wolfram.com/SquareWave.html. Accessed: 2016-11-01. [11] Transmission network datasets. https://wiki.openmod-initiative.org/ wiki/Transmission_network_datasets. Accessed: 2016-11-01. 151 [12] Triangle wave function. http://mathworld.wolfram.com/TriangleWave. html. Accessed: 2016-11-01. [13] Vflib: graph matching library. http://www3.cs.stonybrook.edu/ ~algorith/implement/vflib/implement.shtml. Accessed: 2016-04-01. [14] J. Abello, M. G. Resende, and S. Sudarsky. Massive quasi-clique detection. In Latin American Symposium on Theoretical Informatics, pages 598–612. Springer, 2002. [15] A. Abur and A. G. Exposito. Power system state estimation: theory and implementation. CRC press, 2004. [16] C.C.Aggarwal, H.Wang, etal. Managing and mining graph data, volume40. Springer, 2010. [17] K. Ahmat. Ethernet topology discovery: A survey. arXiv preprint arXiv:0907.3095, 2009. [18] M. Akhmedov, I. Kwee, and R. Montemanni. A fast heuristic for the prize- collectingsteinertreeproblem. Lecture Notes in Management Science, 6:207– 216, 2014. [19] M. Anghel, K. A. Werley, and A. E. Motter. Stochastic model for power grid dynamics. In System Sciences, 2007. HICSS 2007. 40th Annual Hawaii International Conference on, pages 113–113. IEEE, 2007. [20] A. Archer, M. Bateni, M. Hajiaghayi, and H. Karloff. Improved approxima- tion algorithms for prize-collecting steiner tree and tsp. SIAM Journal on Computing, 40(2):309–332, 2011. [21] A.-L. Barabâsi, H. Jeong, Z. Néda, E. Ravasz, A. Schubert, and T. Vic- sek. Evolution of the social network of scientific collaborations. Physica A: Statistical mechanics and its applications, 311(3):590–614, 2002. [22] S.BiandY.J.Zhang. Graphicalmethodsfordefenseagainstfalse-datainjec- tion attacks on power system state estimation. Smart Grid, IEEE Transac- tions on, 5(3):1216–1227, 2014. [23] R. B. Bobba, K. M. Rogers, Q. Wang, H. Khurana, K. Nahrstedt, and T. J. Overbye. Detecting false data injection attacks on dc state estimation. In Preprints of the First Workshop on Secure Control Systems, CPSWEEK, volume 2010, 2010. 152 [24] P. Boldi, M. Rosa, M. Santini, and S. Vigna. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks. In S. Srinivasan, K. Ramamritham, A. Kumar, M. P. Ravindra, E. Bertino, and R. Kumar, editors, Proceedings of the 20th international conference on World Wide Web, pages 587–596. ACM Press, 2011. [25] P.BoldiandS.Vigna. TheWebGraphframeworkI:Compressiontechniques. InProc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595–601, Manhattan, USA, 2004. ACM Press. [26] N.Bronson, Z.Amsden, G.Cabrera, P.Chakka, P.Dimov, H.Ding, J.Ferris, A. Giardullo, S. Kulkarni, H. C. Li, et al. Tao: Facebook’s distributed data store for the social graph. In USENIX Annual Technical Conference, pages 49–60, 2013. [27] Z.Cai, D.Logothetis, andG.Siganos. Facilitatingreal-timegraphmining. In Proceedings of the fourth international workshop on Cloud data management, pages 1–8. ACM, 2012. [28] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 442–446. SIAM, 2004. [29] R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In Proceedings of the 7th ACM european conference on Computer Systems, pages 85–98. ACM, 2012. [30] X. Cheng, Y. Li, D.-Z. Du, and H. Q. Ngo. Steiner trees in industry. In Handbook of combinatorial optimization, pages 193–216. Springer, 2004. [31] S. Choudhury, L. Holder, G. Chin, K. Agarwal, and J. Feo. A selectivity based approach to continuous pattern detection in streaming graphs. arXiv preprint arXiv:1503.00849, 2015. [32] S. Choudhury, L. Holder, G. Chin, A. Ray, S. Beus, and J. Feo. Streamworks: a system for dynamic graph search. In Proceedings of the 2013 ACM SIG- MOD International Conference on Management of Data, pages 1101–1104. ACM, 2013. [33] R. Collier. Nhs ransomware attack spreads worldwide. CMAJ, 189(22), 2013. [34] G. . S. Committee. Graph 500 benchmark 1. http://www.graph500.org/ specifications. Accessed: 2017-03-01. 153 [35] A. Conta and M. Gupta. Internet control message protocol (icmpv6) for the internet protocol version 6 (ipv6) specification. 2006. [36] L. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub)graph isomor- phism algorithm for matching large graphs. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(10), 2004. [37] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. Performance evaluation of the vf graph matching algorithm. In International Conference on Image Analysis Proceedings, 1999. [38] P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti. Generalized louvain method for community detection in large networks. In Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on, pages 88–93. IEEE, 2011. [39] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [40] D. Deka. Analysis of the power grid: structure and secure operations. PhD thesis, The University of Texas at Austin, 2015. [41] D. Deka, R. Baldick, and S. Vishwanath. Data attack on strategic buses in the power grid: Design and protection. In PES General Meeting| Conference & Exposition, 2014 IEEE, pages 1–5. IEEE, 2014. [42] B. Donnet and T. Friedman. Internet topology discovery: a survey. IEEE Communications Surveys & Tutorials, 9(4):56–69, 2007. [43] B. Du, S. Zhang, N. Cao, and H. Tong. First: Fast interactive attributed subgraph matching. In Proceedings of the 23rd ACM SIGKDD Interna- tional Conference on Knowledge Discovery and Data Mining, pages 1447– 1456. ACM, 2017. [44] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. Stinger: High perfor- mance data structure for streaming graphs. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1–5. IEEE, 2012. [45] D. Ediger, J. Riedy, D. A. Bader, and H. Meyerhenke. Tracking structure of streaming social networks. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on, pages 1691–1699. IEEE, 2011. [46] M. Esmalifalak, Z. Han, and L. Song. Effect of stealthy bad data injection on network congestion in market based power system. In 2012 IEEE Wireless 154 Communications and Networking Conference (WCNC), pages 2468–2472. IEEE, 2012. [47] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching: From intractable to polynomial time. Proc. VLDB Endow., 3(1-2), Sept. 2010. [48] W. Fan, J. Li, X. Wang, and Y. Wu. Query preserving graph compres- sion. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 157–168. ACM, 2012. [49] W. Fan, X. Wang, and Y. Wu. Incremental graph pattern matching. ACM Transactions on Database Systems (TODS), 38(3):18, 2013. [50] W. Fan, X. Wang, Y. Wu, and D. Deng. Distributed graph simula- tion: Impossibility and possibility. Proceedings of the VLDB Endowment, 7(12):1083–1094, 2014. [51] A. Fard, A. Abdolrashidi, L. Ramaswamy, and J. A. Miller. Towards efficient query processing on massive time-evolving graphs. In Collaborative Comput- ing: Networking, Applications and Worksharing (CollaborateCom), 2012 8th International Conference on, pages 567–574. IEEE, 2012. [52] A. Fard, M. U. Nisar, J. A. Miller, and L. Ramaswamy. Distributed and scal- able graph pattern matching: Models and algorithms. International Journal of Big Data (IJBD), 1(1):1–14, 2014. [53] A. Fard, M. U. Nisar, L. Ramaswamy, J. A. Miller, and M. Saltz. A dis- tributed vertex-centric approach for pattern matching in massive graphs. In IEEE International Conference on Big Data, 2013. [54] A. Flink. Introducing gelly:graph processing with apache flink. https:// flink.apache.org/news/2015/08/24/introducing-flink-gelly.html. Accessed: 2016-11-01. [55] S. Fortunato. Community detection in graphs. Physics reports, 486(3):75– 174, 2010. [56] B. Gallagher. Matching structure and semantics: A survey on graph-based pattern matching. 2006. [57] J. Gao, C. Zhou, and J. X. Yu. Toward continuous pattern detection over evolving large graph with snapshot isolation. The VLDB Journal, 25(2):269– 290, 2016. 155 [58] J. Gao, C. Zhou, J. Zhou, and J. X. Yu. Continuous pattern detection over billion-edge graph using distributed framework. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on. [59] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. 1979. [60] J. J. Grainger and W. D. Stevenson. Power system analysis. McGraw-Hill, 1994. [61] Q.GuandP.Liu. Denialofserviceattacks. Handbook of Computer Networks: Distributed Networks, Network Planning, Control, Management, and New Trends and Applications, 3:454–468, 2007. [62] H. Haiping, F. Juan, W. Ruchuan, and Q. XiaoLin. An exact top-k query algorithm with privacy protection in wireless sensor networks. International Journal of Distributed Sensor Networks, 10(2):749049, 2014. [63] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke. Computing simulations on finite and infinite graphs. In 36th Annual Symposium on Foundations of Computer Science, pages 453–462. IEEE, 1995. [64] J. Huang and D. J. Abadi. Leopard: Lightweight edge-oriented partitioning and replication for dynamic graphs. Proceedings of the VLDB Endowment, 9(7). [65] L. Jia, R. J. Thomas, and L. Tong. Impacts of malicious data on real-time price of electricity market operations. In System Science (HICSS), 2012 45th Hawaii International Conference on, pages 1907–1914. IEEE, 2012. [66] D. S. Johnson, M. Minkoff, and S. Phillips. The prize collecting steiner tree problem: theory and practice. In SODA, volume 1, page 4, 2000. [67] G.KarypisandV.Kumar. Metis–unstructuredgraphpartitioningandsparse matrix ordering system, version 2.0. 1995. [68] G. Kossinets and D. J. Watts. Empirical analysis of an evolving social net- work. science, 311(5757):88–90, 2006. [69] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Link mining: models, algorithms, and applications, pages 337–357. Springer, 2010. [70] J. Lee, W.-S. Han, R. Kasperovics, and J.-H. Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. In Proceedings of the VLDB Endowment, volume 6, pages 133–144. VLDB Endowment, 2012. 156 [71] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. [72] J. Lin and M. Schatz. Design patterns for efficient graph algorithms in mapreduce. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pages 78–85. ACM, 2010. [73] Y. Liu, P. Ning, and M. K. Reiter. False data injection attacks against state estimation in electric power grids. ACM Transactions on Information and System Security (TISSEC), 14(1):13, 2011. [74] N. A. Lynch. Distributed algorithms, chapter 15, pages 475–496. Morgan Kaufmann, 1996. [75] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Capturing topology in graph pattern matching. Proceedings of the VLDB Endowment, 5(4). [76] S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph pattern matching. In Proceedings of the 21st international conference on World Wide Web, pages 949–958. ACM, 2012. [77] H. V. Madhyastha, E. Katz-Bassett, T. E. Anderson, A. Krishnamurthy, and A. Venkataramani. iplane nano: Path prediction for peer-to-peer applica- tions. In NSDI, volume 9, pages 137–152, 2009. [78] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceed- ings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010. [79] R. McColl, O. Green, and D. A. Bader. A new parallel algorithm for con- nected components in dynamic graphs. In High Performance Computing (HiPC), 2013 20th International Conference on, pages 246–255. IEEE, 2013. [80] W. Mclendon Iii, B. Hendrickson, S. J. Plimpton, and L. Rauchwerger. Find- ing strongly connected components in distributed graphs. Journal of Parallel and Distributed Computing, 65(8):901–910, 2005. [81] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego, CA, October 2007. [82] A. Ostfeld. Water Distribution Networks, pages 101–124. Springer Berlin Heidelberg, Berlin, Heidelberg, 2015. 157 [83] N. Perlroth. Russian hackers targeting oil and gas com- panies. https://www.nytimes.com/2014/07/01/technology/ energy-sector-faces-attacks-from-hackers-in-russia.html. Accessed: 2017-10-01. [84] J. Postel. User datagram protocol. Technical report, 1980. [85] J. Postel. Transmission control protocol. 1981. [86] N. L. Prasanna, K. SRAVANTHI, and N. SUDHAKAR. Applications of graph labeling in communication networks. Oriental Journal of Computer Science and Technology, 7, 2014. [87] V. Quintana, A. Simoes-Costa, and A. Mandel. Power system topological observability using a direct graph-theoretic approach. IEEE Transactions on Power Apparatus and Systems, (3):617–626, 1982. [88] A. Rapoport and W. J. Horvath. A study of a large sociogram. Systems Research and Behavioral Science, 6(4):279–291, 1961. [89] J. W. Raymond and P. Willett. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of computer- aided molecular design, 16(7). [90] A. Refsdal, B. Solhaug, and K. Stølen. Cyber-risk management. In Cyber- Risk Management, pages 25–27. Springer, 2015. [91] Y. Rekhter, T. Li, and S. Hares. A border gateway protocol 4 (bgp-4). Technical report, 2005. [92] S. Sahhaf, W. Tavernier, M. Rost, S. Schmid, D. Colle, M. Pickavet, and P. Demeester. Network service chaining with optimized network function embedding supporting service decompositions. Computer Networks, 93:492– 505, 2015. [93] N.Sayeekumar, S.Ahmed, S.P.Karthikeyan, S.K.Sahoo, andI.J.Raglend. Graph theory and its applications in power systems-a review. In Control, Instrumentation, Communication and Computational Technologies (ICCI- CCT), 2015 International Conference on, pages 154–157. IEEE, 2015. [94] B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, New York, USA, 2013. 158 [95] D. Shasha, J. T. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In Proceedings of the twenty-first ACM SIGMOD- SIGACT-SIGART symposium on Principles of database systems, pages 39– 52. ACM, 2002. [96] D. P. Shepard, T. E. Humphreys, and A. A. Fansler. Evaluation of the vul- nerabilityofphasormeasurementunitstogpsspoofingattacks. International Journal of Critical Infrastructure Protection, 5(3):146–153, 2012. [97] J. Siek, L.-Q. Lee, and A. Lumsdaine. Boost random number library. http://www.boost.org/libs/graph/. Accessed: 2015-01-01. [98] Y. Simmhan, A. Kumbhare, C. Wickramaarachchi, S. Nagarkar, S. Ravi, C. Raghavendra, and V. Prasanna. Goffish: A sub-graph centric framework for large-scale graph analytics. In European Conference on Parallel Process- ing, pages 451–462. Springer, 2014. [99] C. Song, T. Ge, C. Chen, and J. Wang. Event pattern matching over graph streams. Proceedings of the VLDB Endowment, 8(4):413–424, 2014. [100] A. Stotz, R. Nagi, and M. Sudit. Incremental graph matching for situation awareness. In 12th International Conference on Information Fusion, 2009. [101] D. Stutzbach, R. Rejaie, and S. Sen. Characterizing unstructured overlay topologies in modern p2p file-sharing systems. IEEE/ACM Transactions on Networking, 16(2):267–280, 2008. [102] Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. Proceedings of the VLDB Endowment, 5(9). [103] J. Thom, S. Y. Helsley, T. L. Matthews, E. M. Daly, and D. R. Millen. What are you working on? status message q&a in an enterprise sns. In ECSCW 2011: Proceedings of the 12th European Conference on Computer Supported Cooperative Work, 24-28 September 2011, Aarhus Denmark, pages 313–332. Springer, 2011. [104] J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of the ACM (JACM), 23(1):31–42, 1976. [105] L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990. [106] W.WangandZ.Lu. Cybersecurityinthesmartgrid: Surveyandchallenges. Computer Networks, 57(5):1344–1371, 2013. 159 [107] S. Wasserman and K. Faust. Social network analysis: Methods and applica- tions, volume 8. Cambridge university press, 1994. [108] J.Webb,F.Docemmilli,andM.Bonin. Graphtheoryapplicationsinnetwork security. arXiv preprint arXiv:1511.04785, 2015. [109] J. Webber. A programmatic introduction to neo4j. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, pages 217–218. ACM, 2012. [110] C. Wickramaarachchi, S. R. Kuppannagari, R. Kannan, and V. K. Prasanna. Improved protection scheme for data attack on strategic buses in the smart grid. In 4th IEEE Conference on Technologies for Sustainability (SusTech), pages 96–101. IEEE, 2016. [111] L. Xiaoping, D. Ming, H. Jianghong, H. Pingping, and P. Yali. Dynamic economic dispatch for microgrids including battery energy storage. In Power Electronics for Distributed Generation Systems (PEDG), 2010 2nd IEEE International Symposium on, pages 914–917. IEEE, 2010. [112] L. Xie, Y. Mo, and B. Sinopoli. Integrity data attacks in power market operations. IEEE Transactions on Smart Grid, 2(4):659–666, 2011. [113] Y. Yuan, Z. Li, and K. Ren. Modeling load redistribution attacks in power systems. IEEE Transactions on Smart Grid, 2(2):382–390, 2011. [114] Y. Yuan, Z. Li, and K. Ren. Quantitative analysis of load redistribution attacks in power systems. IEEE Transactions on Parallel and Distributed Systems, 23(9):1731–1738, 2012. [115] A. Zeeshan, A. M. Masood, Z. M. Faisal, A. Kalim, and N. Farzana. Prism: Automatic detection and prevention from cyber attacks. In Wireless Net- works, Information Processing and Systems. 160
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
AI-enabled DDoS attack detection in IoT systems
PDF
Improving network security through cyber-insurance
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Efficient graph learning: theory and performance evaluation
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Prediction models for dynamic decision making in smart grid
PDF
Game theoretic deception and threat screening for cyber security
PDF
Theoretical foundations and design methodologies for cyber-neural systems
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
A function-based methodology for evaluating resilience in smart grids
Asset Metadata
Creator
Wickramaarachchi, Charith Dhanushka
(author)
Core Title
Dynamic graph analytics for cyber systems security applications
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
03/12/2018
Defense Date
11/10/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cyber systems security,dynamic graph analytics,incremental graph processing,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Kannan, Rajgopal (
committee member
), Nakano, Aiichiro (
committee member
), Raghavendra, Cauligi (
committee member
)
Creator Email
charith.dhanushka@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-484965
Unique identifier
UC11268345
Identifier
etd-Wickramaar-6102.pdf (filename),usctheses-c40-484965 (legacy record id)
Legacy Identifier
etd-Wickramaar-6102.pdf
Dmrecord
484965
Document Type
Dissertation
Rights
Wickramaarachchi, Charith Dhanushka
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cyber systems security
dynamic graph analytics
incremental graph processing