Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Exploiting variable task granularities for scalable and efficient parallel graph analytics
(USC Thesis Other)
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EXPLOITING VARIABLE TASK GRANULARITIES FOR SCALABLE AND EFFICIENT PARALLEL GRAPH ANALYTICS by Kartik Lakhotia A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) December 2021 Copyright 2021 Kartik Lakhotia Dedication To my wife and my parents. ii Acknowledgements First, I would like to sincerely thank my advisor, Prof. Viktor K. Prasanna, for being a wonderful advisor and for his continued supported during the course of my stay at USC. He has patiently mentored me through several challenges and his immense technical knowledge in the eld of parallel computing has been crucial to the research presented in this thesis. He always emphasized the importance of simple and fundamental ideas along with the skills to clearly present them. I am extremely grateful for the invaluable opportunities that he has provided for my research and career, which extend well beyond the Ph.D. program. I also express my gratitude to Prof. Rajgopal Kannan for extensive collaboration on several projects and brainstorming sessions that I always looked forward to. I would also like to thank committee members Prof. Stephen Crago, Prof. Aiichiro Nakano and Prof. Xuehai Qian for their valuable feedback and assistance in strengthening this thesis. The Pgroup at USC introduced me to amazing colleagues that I frequently interacted with. Weekly group meetings involving Shijie Zhou, Sanmukh Kuppannagari, Ajitesh Srivastava, Hanqing Zeng and Shreyas Singapura have been a source of new ideas and early feedback on my work. Sasindu Wijeratne, Yang Yang, Tian Ye, Bingyi Zhang, Hongkuan Zhou, Pengmiao Zhang, Yuan Meng and other new members of the group have infused a great deal of enthusiasm and have taught me a lot about hardware acceleration using FPGAs. While working on the DARPA HIVE project, I also had the pleasure of collaborating with some esteemed researchers in the eld of parallel computing. I especially thank Fabrizio Petrini, Cesar A. F. De Rose, and iii Oded Green who have been a tremendous source of knowledge on parallel computing. Fabrizio supervised my internship at Intel where he taught me about network simulators, collective computations, and various nitty-gritties of shared-memory parallelism. Technical discussions with Prof. Cesar and Oded were the origin of some of the novel techniques presented in this thesis. Finally, I would like to thank my family, especially my wife Nayana Nair, my brother Rahul Lakhotia, and my parents Rajkumar Lakhotia and Manju Lakhotia, for their immense support without which this thesis would not be possible. Nayana has been a constant source of inspiration, encouragement and a driving factor that has helped me come across tough times during the Ph.D. Last but not least, I am indebted to my dogs Spotty and Tuy, who have provided me unconditional love and joy, even while being away from me. iv TableofContents Dedication ii Acknowledgements iii ListofTables viii ListofFigures ix Abstract xii Chapter1: Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Parallel Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Graph Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Graph Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Graph Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2.1 Adjacency List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3.2.2 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Graph Algorithms in this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4.1 Simple Neighborhood Propagation Algorithms . . . . . . . . . . . . . . . . . . . . 10 1.4.2 Complex Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter2: CacheandMemory-ecientGraphProcessing 17 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Graph Processing Over Partitions Framework Overview . . . . . . . . . . . . . . . . . . . 19 2.3 Communication Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Message Generation and Consumption . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.3 Message Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.4 2-level Active List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.5 Interleaving Scatter and Gather Phases . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4 Programming Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Target Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 v 2.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2.1 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6.2.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.6.2.3 Impact of GPOP Optimizations: . . . . . . . . . . . . . . . . . . . . . . . 43 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Chapter3: PlantingTreesforScalableandEcientHubLabeling 49 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2.1 Hub Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2.2 Pruned Landmark Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 Shared-memory Parallel Hub Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.1 Label Construction and Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.2 Global Local Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4 Distributed-memory Parallel Hub Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.1 Distributed Global Local Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.2 PLaNT: Prune Labels and (do) Not (prune) Trees . . . . . . . . . . . . . . . . . . . 63 3.4.3 Hybrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.3.1 Enabling Ecient Multi-node Pruning . . . . . . . . . . . . . . . . . . . 69 3.5 Querying Distributed Labels: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6.2.1 Shared-memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6.2.2 Distributed-memory Algorithms . . . . . . . . . . . . . . . . . . . . . . . 75 3.6.2.3 Evaluating Query Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter4: ParallelBipartiteGraphDecomposition 85 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Buttery counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.2 Bipartite Graph Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.3 Bottom-Up Peeling (BUP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2.4 Bloom-Edge-Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.2.5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.3 Parallel Bipartite Network Peeling framework . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1 Two-phased Peeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3.1.1 Coarse-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . 99 4.3.1.2 Fine-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3.1.3 Range Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.1.4 Partition scheduling in Fine-grained Decomposition . . . . . . . . . . . . 104 4.3.2 Tip Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3.2.1 Coarse-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . 104 vi 4.3.2.2 Fine-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3.3 Wing Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.3.1 Coarse-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . 108 4.3.3.2 Fine-grained Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.1 Correctness of Decomposition Output . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.2 Computation and Space Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.5.1 Batch Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.5.2 Dynamic Graph Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.6.2 Results: Wing Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.6.2.1 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 122 4.6.2.2 Eect of Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.6.2.3 Comparison of Dierent Phases . . . . . . . . . . . . . . . . . . . . . . . 125 4.6.2.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.6.3 Results: Tip Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.6.3.1 Comparison with Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.6.3.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.6.3.3 Comparison of phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.6.3.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter5: Conclusion 134 5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2.1 Graph Processing on Custom Architectures . . . . . . . . . . . . . . . . . . . . . . 135 5.2.2 Advanced Memory Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.3 Scaling Hub Labeling to Massive Networks . . . . . . . . . . . . . . . . . . . . . . 137 5.2.4 Distributed-memory Parallel Graph Decomposition . . . . . . . . . . . . . . . . . . 138 Bibliography 140 vii ListofTables 2.1 Neighborhood Propagation Algorithms: datasets for evaluation. . . . . . . . . . . . . . . . 35 2.2 Execution time of GPOP and baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3 L2 Cache Misses incurred by GPOP and baselines. . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 Hierarchical Hub Labeling: notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Intra-tree parallelism using GPOP exhibits poor parallel scalability. . . . . . . . . . . . . . 55 3.3 Hierarchical Hub Labeling: datasets for evaluation. . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Performance comparison of GLL, LCC and baselines. Time=1 implies that execution did not nish in 4 hours. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5 Querying Throughput, Latency and Memory Consumption on 16 nodes. . . . . . . . . . . 79 4.1 Bipartite Graph Decomposition: notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Parallel implementation of bottom-up peeling exhibits poor parallel scalability. . . . . . . . 97 4.3 Bipartite Graph Decomposition: datasets for evaluation with number of butteries (./ G in billions), maximum tip numbers max U (for peelingU(G)) and max V (for peelingV (G)), and maximum wing number max E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.4 Comparing execution time (t), number of support updates and synchronization rounds () of wing decomposition algorithms. Missing entries denote that execution did not nish in 2 days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.5 Comparing execution time (t), number of wedges traversed and synchronization rounds () of tip decomposition algorithms. Missing entries denote that execution did not nish in 2 days or ran out of memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 viii ListofFigures 1.1 Conceptual illustration of parallel computing platforms used in this thesis. A shared- memory system may have additional levels of private or shared caches. . . . . . . . . . . . 4 1.2 An undirected graph of 5 vertices and 6 edges with labels denoting edge weights. . . . . . 5 1.3 Graphs with dierent topological characteristics. . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Bipartite graph showing purchase history of consumers on an e-commerce website. . . . . 7 1.5 Sparse storage formats for adjacency matrix representation of the graph in gure 1.2. . . . 9 1.6 Graphical illustration of iterative communication and computation patterns of Neighbor- hood Propagation algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Dataow graph of existing Hierarchical Hub Labeling algorithms typies complex analytics studied in this thesis. Edges from output of a tree (denotedOp i ) to subsequent trees indicate sequential dependencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 DRAM Trac generated by random accesses to vertex data in PageRank. . . . . . . . . . . 18 2.2 Read-writes to vertex attributes or states generates random memory access patterns. . . . 19 2.3 Conceptual demonstration of two phases in GPOP. . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Graphical illustration of index-based vertex partitioning in an example graph. . . . . . . . 22 2.5 Messages scattered from partitionP 2 toP 0 for the graph shown in gure 2.4. GPOP avoids redundant propagation of the same update value to reduce DRAM communication. . . . . 23 2.6 Partition-Vertex Graph layout for the example graph shown in gure 2.4. . . . . . . . . . . 26 2.7 Strong Scaling results for GPOP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.8 Weak scaling results for GPOP. Labels on the curves indicate number of edges in millions. 42 2.9 Graph density vs execution time for GPOP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 ix 2.10 Impact of GPOP optimizations on execution time. . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Shortest distance computation between verticesv 4 andv 5 using their hub label setsL 4 and L 5 , respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2 Label generation by PLL with hubv 2 .SPT v 2 does not visitv 4 due to pruning. . . . . . . . 53 3.3 Distribution of labels generated by individual SPTs. . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Label generation by PLaNT with hubv 2 for the graph in gure 3.2a. PLaNT generates the same labels as PLL but does not require other labels for pruning (labels with hubv 1 in this case). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.5 value for SPTs generated by PLaNT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6 Number of labels generated as a function of the number of (highest-ranked) hubs used for pruning. X-axis= 0 means rank queries only. . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.7 Execution time of GLL vs synchronization threhsold. . . . . . . . . . . . . . . . . . . . . 73 3.8 Execution time of Hybrid algorithm (on 16 nodes) vs switching threshold th . . . . . . . . 73 3.9 Time taken for label construction and cleaning in LCC and GLL (normalized by the total execution time of GLL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.10 Execution time of DparaPLL, DGLL, PLaNT and Hybrid algorithms. . . . . . . . . . . . . . 76 3.11 Size of labeling generated by DparaPLL and the Hybrid/PLaNT/DGLL algorithms. . . . . . 77 4.1 (a) Bipartite graphG(U;V;E) which is also a 1-wing. (b) Wing decomposition hierarchy ofG(U;V;E) – edges colored blue, red, green and black have wing numbers of 1, 2, 3 and 4, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 (a) Bipartite graphG 0 (W = (U;V );E) (b) BE-Index ofG 0 with two maximal priority blooms. 95 4.3 Graphical illustration of PBNG’s two-phased peeling for wing decomposition of the graph G from gure 4.1. The coarse-grained decomposition dividesE(G) intoP = 2 partitions using parallel peeling iterations. The ne-grained decomposition peels each partition using a single thread but concurrently processes multiple partitions. . . . . . . . . . . . . . . . . 98 4.4 Benets of Workload-aware Scheduling (WaS) in a 3-thread (T 1 ;T 2 andT 3 ) system. Top row shows entity partitions with estimated time to peel them in PBNG FD. Dynamic allocation without WaS nishes in 28 units of time compared to 20 units with WaS. . . . . 105 4.5 Execution time of tip decomposition vs number of partitionsP in PBNG. . . . . . . . . . . 121 4.6 Execution time of wing decomposition vs number of partitionsP in PBNG. . . . . . . . . . 122 x 4.7 Eect of optimizations (section 4.5) on wing decomposition in PBNG. Execution time normalized with respective measurements for PBNG with all optimizations enabled. With batch processing disabled, Gtr, Tr and De-ut did not nish within 2 days. . . . . . . . . . . 125 4.8 Contribution of dierent steps to the overall support updates and the execution time of wing decomposition in PBNG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.9 Strong scaling of wing decomposition in PBNG. . . . . . . . . . . . . . . . . . . . . . . . . 126 4.10 Eect of optimizations (section 4.5) on tip decomposition in PBNG. All quantities are normalized with the respective measurements for PBNG with all optimizations enabled. . 128 4.11 Contribution of dierent steps to the overall wedge traversal and the execution time of tip decomposition in PBNG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.12 Strong scaling of tip decomposition in PBNG. . . . . . . . . . . . . . . . . . . . . . . . . . 130 xi Abstract Graphs are a preferred choice of data representation in various domains including social networks, trans- portation, bioinformatics, e-commerce and others. With the rapid growth of data produced in these domains, parallel computing has become crucial for scaling the capabilities of graph analytics. Conventional ap- proaches to parallelizing graph analytics typically create ne-granularity tasks to maximally exploit the parallelizable workload available in existing sequential algorithms. However, the resulting parallel algo- rithms pose fundamental issues in terms of random memory accesses, communication and synchronization. In this dissertation, we make a case for exploiting concurrent tasks at a level that inherently targets the bot- tlenecks in application performance, instead of mining parallelism from sequential algorithms. We identify bottleneck alleviating tasks for several analytics with dierent computation-communication characteristics and develop novel parallel algorithms that signicantly outperform existing approaches. The rst half of this thesis focuses on a broad category, which we refer to as Neighborhood Propagation analytics, that are characterized by iterative communication between adjacent vertices and local computation on vertices, until convergence. This includes fundamental kernels such as Pagerank, Single Source Shortest Path and others which exhibit a high communication to computation ratio and are bottlenecked by memory system performance. However, the ne-grained vertex-centric parallelism typically used for such algorithms, generates highly irregular memory access patterns that fail to eectively utilize the capabilities of memory hierarchy. For these analytics, we propose a novel framework that utilizes coarse-granular tasks in the form of cacheable vertex subsets, to achieve cache and bandwidth friendly access patterns along with xii asynchronous execution for fast convergence. On a 36-core Intel server, our framework achieves a 5:5 and 3:6 reduction in cache misses and execution time, respectively, over state-of-the-art. In the second half of this thesis, we study Complex analytics that are computationally intensive and execute numerous instances of a custom neighborhood propagation algorithm. Specically, we look at the analytics that exhibit sequential dependencies among these instances. As a result, existing approaches restrict themselves to parallelism within each instance and incur large amount of synchronization and communication. Consequently, they exhibit poor parallel scalability which becomes the primary per- formance bottleneck. For such analytics, we develop the rst set of scalable algorithms that exploit the massive coarse-granularity parallelism across the neighborhood propagation instances. Our algorithms appreciably enhance the state of current practice for these analytics with more than two orders of magnitude speedup over existing algorithms and the capability to process large datasets in few minutes which existing algorithms cannot process even in several days. Through these innovations, this thesis emphasizes the importance of exploiting bottleneck-specic task partitioning for parallel graph processing. We further discuss how our approach can benet graph processing on custom architectures designed for sparse data analysis and emerging memory technologies. The algorithms developed in this thesis are also of independent interest and can inspire new directions for future research in the respective areas. xiii Chapter1 Introduction 1.1 Motivation Graphs provide a powerful representation for storage and analysis of pairwise relationship data. Many real-world systems such as social networks, transportation routes, world wide web, biological systems, consumer purchase networks and nancial transactions can be eectively modeled as a graph [21, 91, 64, 55]. The rapid growth of data generated in these systems has garnered tremendous interest in large-scale graph processing on parallel computing platforms. Graph analytics employ computational algorithms on the graph representation to mine useful informa- tion from the underlying real-world system. They are at the heart of in several applications, some of which are listed below: 1. Identication of highly inuential or most important nodes in a network. 2. Shortest path routing in a computer or a transportation network. 3. Spam reviewer detection in rating networks of review websites. 4. Fraud detection in nancial networks. 5. Link prediction in social networks and e-commerce applications. 1 6. Context-aware search on knowledge bases. Pertaining to high performance parallel graph analytics, there is a large body of research on distributed- memory clusters that provide in-memory processing capabilities for massive datasets [57, 81, 49, 79, 83, 144, 120]. In recent years, shared-memory systems with relatively lower communication costs have become popular for graph processing [110, 36, 70, 83, 92, 120, 139, 140]. Further, programs such as DARPA Hierarchical Identify Verify Exploit (HIVE) focused on vertical integrated parallel solutions (algorithms, runtimes, architectures) tailored for graph processing [123]. Yet, developing scalable and ecient parallel graph algorithms remains challenging due to the irregular nature of computation in graphs [79]. This thesis proposes novel approaches to addresses the following issues: 1. IrregularMemoryAccesses: Graph traversal results in highly irregular accesses with poor locality. We develop novel methodologies to induce memory access locality in graph processing for ecient utilization of memory hierarchy. 2. Communication: Graph algorithms are typically bottlenecked by network or DRAM communication. We develop novel data layouts and algorithms to reduce communication for graph analytics. 3. Synchronization: Sequential dependencies in complex graph applications require frequent synchro- nization between concurrent workers. This decreases eective parallelism and deteriorates scalability. We develop novel algorithms that reduce synchronization and improve scalability. Conventional approaches to parallelizing graph processing try to exploit the concurrency available in pre-existing sequential algorithms. Further optimizations are built on top of the parallel algorithm thus derived to improve its performance. However, the nal implementation faces performance challenges as the basis sequential algorithm is not designed for parallel performance. In this thesis, we rst identify the task granularity that explicitly targets performance bottlenecks specic to the problem and the underlying computing platform. Parallel algorithms and optimizations are then 2 built to exploit these tasks and hence, inherently avoid the bottlenecks. To better place our contributions, we briey discuss the parallel computing platforms and dierent categories of graph analytics targeted in this thesis. 1.2 ParallelComputingPlatforms The rapid growth of data generated in graph structured systems such as social networks or world wide web, have generated tremendous interest in large-scale graph processing on parallel platforms. The proposed algorithms in this thesis are designed for the following platforms: • Shared-memory multicoresystems consist of several CPU cores, all of which have direct access to a shared main memory (DRAM) via an on-chip interconnection network, as shown in gure 1.1. Consequently, they exhibit lower communication cost than distributed-memory clusters which use an external network to connect processors. This is particularly benecial for graph computations that feature a high computation to communication ratio. These systems provide large caches and have seen signicant growth in memory capacity, enabling big data processing. However, the available compute and memory resources on mutlicore servers are typically restricted to a few tens of cores and few hundred Gigabytes, respectively. Moreover, ecient utilization of caches and memory bandwidth is crucial for performance, but is notoriously hard to achieve for graph processing. • Distributed-memory clusters comprise of multiple (shared-memory processor) nodes each with its own local main memory, that are connected via an external network as shown in gure 1.1. They provide expandable compute and memory resources that can be scaled by adding nodes to the cluster. Hence, they are a good t for complex analytics that have high computational and space requirements. 3 However, they also present starkly dierent challenges than shared-memory systems. Particularly, inter- node communication in distributed-memory sytems is expensive and explicit. Therefore, information generated or stored on a node is not readily accessible to other nodes. Core 0 Cache Core 1 Cache Core 2 Cache Core 3 Cache On-chip Network Shared Main Memory (DRAM) Processor Processor Processor Processor DRAM DRAM DRAM DRAM External Network Shared-memory Multicore System Distributed-memory Multiprocessor System Figure 1.1: Conceptual illustration of parallel computing platforms used in this thesis. A shared-memory system may have additional levels of private or shared caches. Research on graph processing is ongoing on several other platforms including GPUs, FPGAs and ASICs [141, 114, 133, 56, 51]. However, FPGAs and ASICs pose considerable implementation challenges and are not suitable for complex individual analytics such as the ones we target in chapters 3 and 4. GPUs provide immense computation capability, but have limited memory which restricts the size of graphs that can be loaded on a single GPU. Moreover, many of these platforms are conceptually based on shared- memory multiprocessor systems. Hence, we believe that our innovations with respect to access locality, communication and synchronization will also benet implementations on these platforms. 1.3 GraphAbstraction A graph models a system of objects and pairwise connections between them. Formally, a graphG(V;E) is a mathematical structure of vertex setV and edgesE, such that each edge (u;v) =e2E represents a 4 pair of connected vertices. The neighborhood of a vertexv is denoted byN(v) and the number incident edges of a vertex, known as the degree, is denoted asd v = N(v) . For example, in gure 1.2, the degree of v 2 andv 4 is 3 and that of all other vertices is 2. Vertex Edge 3 4 5 1 2 5 3 14 4 2 10 Figure 1.2: An undirected graph of 5 vertices and 6 edges with labels denoting edge weights. The edges may be bidirectional or unidirectional depending on whether the graph is undirected or directed, respectively. For example, friendship network of Facebook is undirected whereas follower net- work of Twitter is directed. Additionally, weighted graphs have numeric weights associated with the edges (gure 1.2) to represent tie strength or distance between the connected pair of vertices. 1.3.1 GraphTopology The structure of graph workload can signicantly impact the performance of graph algorithms. We broadly categorize the graphs into spatial and scale free networks. Spatial graphs are typically derived from geometric objects with vertices and edges embedded in a physical space such as polygon meshes for 3D graphic rendering and transportation maps. Since edges exist among spatially close vertices, these networks are typically low-dimensional structures with a near uniform degree distribution. For example, an urban road network shown in gure 1.3 is completely embedded in a 2D space. Consequently, these graphs can be easily partitioned into subgraphs along the few dimensions in which they are laid out. 5 (a) Road networks exemplify a spatial graph (b) Social networks exemplify a scale free graph Figure 1.3: Graphs with dierent topological characteristics. On the contrary, scale free graphs are complex high-dimensional structures with a power law degree distribution i.e. the fraction of vertices with degreek isP (k) = k for some parameter typically in the range 2< < 3. These graphs commonly arise in social networks, web crawls and bioinformatics domains, that produce the biggest real-life datasets with several billions of edges. They typically exhibit small diameter and are dicult to partition because of large cut sets. Moreover, neighbors of a vertex are scattered in a high-dimensional space leading to very irregular memory access patterns during information propagation along a vertex’ edges. The power law degree distribution is highly skewed with a large fraction of edges concentrated on few vertices, which further introduces load imbalance during parallel processing. For example, in the social network of gure 1.3, the large vertices have very high degrees whereas the tiny vertices on the fringe have very few incident edges. This thesis will primarily focus on scale free graphs as they are of great research interest due to their massive size and the challenges that are not present in spatial networks. Independent of the above classication, there exists a special category called bipartite graph whose vertices can be divided into two disjoint sets such that any edge connects a vertex from one set to a vertex in the other. Mathematically, we denote a bipartite graph asG(W = (U;V );E) with vertex setsU and V such that ane = (u;v)2E only ifu2U andv2V . These graphs commonly occur in the form of 6 Figure 1.4: Bipartite graph showing purchase history of consumers on an e-commerce website. consumer-product purchase networks as shown in gure 1.4, author-article networks in research databases, user-rating networks of review aggregators, group memberships on discussion forums etc. The bipartite nature of graph may dictate the choice of analytics. For instance, bipartite graphs do not have triangles and hence, quadrangle/buttery/2,2-biclique mining is used instead to nd dense regions of interest. 1.3.2 GraphRepresentation A graphG(V;E) can be represented using a dierent formats based on space constraints and algorithm requirements. It is worth mentioning that almost all of the large graphs encountered in real-world systems are sparse i.e. the number of edges is much smaller than the number of vertex pairs. For example, a facebook user only has few thousand friends out of potential billions of users that they can connect with. Hence, this thesis restricts the discussion to storage formats relevant for sparse graphs. 1.3.2.1 AdjacencyList Adjacency list is an array of linked lists such that the i th list contains the neighbors of vertex v i 2 V . It consumesO(jVj +jEj) space which is theoretically optimal, and allows quick access to a vertex’ 7 neighborhood withO(1) time per neighbor. The insert and delete functionality of linked lists facilitates dynamic updates to adjacencies of vertices. However, traversing the neighborhood of a vertex requires pointer jumping through its list. The resulting access patterns fail to utilize the caches and DRAM bandwidth and are inecient in practice. 1.3.2.2 AdjacencyMatrix An adjacency matrixA2R VV contains a non-zero entryA ij if and only if there is an edge from vertex v i to vertexv j . In other words, the row and column indices and the value of a non-zero matrix elementA i represent the source vertexv i , destination vertexv j and weight of the edge (v i ;v j ), respectively. Thus, non-zero elements of rowi and columni indicate out-edges and in-edges of vertexv i , respectively. Note thatA is symmetric for an undirected graph. Clearly, the number of non-zero elements in matrix isjEj and most of the elements will be zero for sparse graphs. Further, storing all matrix entries requires O jVj 2 space which is infeasible for large graphs with hundreds of millions of vertices. Hence, sparse matrix formats are used to store the adjacency matrix of a graph. Coordinate Format (COO): This format stores an array of tuplesfv i ;v j ;A ij g corresponding to all non-zero matrix entries, as shown in gure 1.5. This allows ne-granularity parallelism at the level of individual edges. Moreover, tuples for the entire graph can be traversed in locality preserving orders to optimize cache data reuse [137, 83]. However, accessing neighborhood of a particular vertex using COO format may require traversal of the entire graph in the worst-case. Hence, it is not suitable for algorithms such as breadth-rst search, where every iteration typically explores only a subset of vertices. CompressedSparseRow(CSR): This format stores the adjacency matrix in three arrays (gure 1.5) - row oests (RA[]), column indices (EA[]) and the non-zero matrix elements (EW[]). All non-zero entries in the matrix are sorted by row index. The column indices and values of these sorted entries are stored 8 1 2 3 1 5 5 2 3 10 2 4 14 3 4 2 4 5 4 2 5 1 3 4 2 4 2 3 5 1 4 3 5 3 10 14 10 2 14 2 4 5 4 0 2 5 7 10 12 RA EA EW COO Format CSR Format Figure 1.5: Sparse storage formats for adjacency matrix representation of the graph in gure 1.2. inEA[] andEW[] arrays, respectively. TheRA[] array is of lengthjVj + 1 and stores osets pointing to where a row starts inEA[] andEW[] arrays. For directed graphs, the Compressed Sparse Column (CSC) format similarly stores the adjacency matrix entries sorted on column indices and provides direct access to in-edges of each vertex. A major corollary of having an oset array is that allout-edges of a vertexv i can be traversed by reading onlyfRA[i];RA[i]+1, ..., RA[i+1]-1g locations from EA[] and EW[] arrays. This allows ecient execution of algorithms such as BFS, that access neighborhood of only a subset of vertices at a given time. Moreover, unlike adjacency lists, edges of a vertex are stored at consecutive array locations and can be read with performance friendly sequential memory accesses. For directed graphs, CSR is also space-ecient as row indices (source vertex indices) are not repeated in all non-zero elements (edges) of a row, unlike the COO format. Therefore, CSR (and CSC) is the most popular format used in the graph processing frameworks [110, 92, 51, 70, 140, 81, 57] and is the format of choice in this thesis as well. CSR can be visualized as an adjacency list representation with all lists packed in a dense array. One of the drawbacks of such packing is that dynamic updates to the graph are challenging to accommodate. However, recent advancements in graph data structures provide support for high throughput updates while 9 maintaining a CSR like interface for ecient application execution [51, 24, 35]. A detailed discussion of dynamic graph data structures is out of the scope of this thesis. 1.4 GraphAlgorithmsinthisThesis 1.4.1 SimpleNeighborhoodPropagationAlgorithms In a real-world network, the most basic function is ow of data or traversal between connected objects. For example, sharing a post with social media friends, web browsing via hyperlinks, trac movement on roads etc. Likewise, the most basic operation in graph algorithms is information propagation along the edges of a graph. The rst category of algorithms studied in this thesis, which we refer to as Neighborhood Propagation algorithm, are based on this fundamental operation. They are characterized by iterative information exchange between vertices and their neighbors, and local (typically lightweight) computation on information received by vertices, as shown in gure 1.6. The information consists of the vertex state being computed (for example, distance from the root vertex in shortest path computation) and the iterations continue until convergence. Neighborhood Propagation algorithms are fundamental kernels and building blocks in graph analysis. Breadth-First Search and Single Source Shortest Path (SSSP) are used for distance computations and are a part of the famous Graph500 benchmark [87]; Pagerank is a popular node ranking algorithm; Label propagation is used for various applications such as connectivity and community detection [142]. Apart from their intrinsic utility, these algorithms also represent the computation and communication patterns incurred during graph processing in general, and thus act as standard benchmark kernels [87, 16]. Research on these algorithms has closely reected their popularity, with several frameworks in the literature supporting them on various platforms [81, 57, 22, 110, 120, 92, 139, 133]. A large fraction of 10 Propagate Compute Figure 1.6: Graphical illustration of iterative communication and computation patterns of Neighborhood Propagation algorithms. existing research is focused on shared-memory platforms because of their low communication overhead compared to distributed systems [83]. However, thread level parallelism for graph algorithms utilizes synchronization primitives such as atomics and locks, that can decrease scalability. Furthermore, graph algorithms are characterized by a small computation to communication ratio that makes it challenging to eciently utilize the resources even on a single machine [79]. The conventional and most intuitive Vertex-centric programming model executes the communication and computation from the point of view of each vertex i.e. each vertex concurrently probes its neighbors and updates their states or its own state. This has emerged as the dominant approach for graph processing [110, 139, 36, 92]. However, vertex-centric programming incurs random memory accesses which negatively impact cache and bandwidth utilization and in turn, bottleneck the overall performance. Some frameworks [101, 143] have adopted optimized Edge-centric programming models to improve access locality. However, these programming models sacrice work-eciency and can signicantly under- perform an ecient serial algorithm if only a small fraction of the graph is traversed in each iteration. 11 1.4.2 ComplexGraphAnalytics The second half of this thesis proposes novel parallel algorithms for complex analytics that exhibit high computational complexity but mine a lot of information from a given graph. For example, Hierarchical Hub Labeling (discussed later in chapter 3) creates an indexing for extremely fast all pairs shortest distance querying. While sequential algorithms for such analytics have been well studied, the growth in graph size is now driving the research on parallel algorithms. The DARPA HIVE program also involves several complex workloads and inspires the problems studied in this thesis [123]. We particularly focus on analytics characterized by several instances of a custom neighborhood prop- agation algorithm with dependencies between each instance, which renders parallelization challenging. For example, existing hub labeling algorithms construct custom shortest path trees from all vertices such that every tree depends on the partial indices obtained from previous trees [9] as shown in gure 1.7. The two such case studies considered in this thesis are Hierarchical Hub Labeling and Bipartite Graph Decomposition, which are discussed in detail in chapters 3 and 4, respectively. Tree 1 Root 1 Tree 2 Root 2 Intra-tree parallelism 1 2 Tree 3 Root − 1 Vertex order ℛ 1 2 3 Figure 1.7: Dataow graph of existing Hierarchical Hub Labeling algorithms typies complex analytics studied in this thesis. Edges from output of a tree (denotedOp i ) to subsequent trees indicate sequential dependencies. 12 The conventional approach to accelerate such analytics is to exploit the limited parallelism within each instance (for example, shortest path tree computation for hub labeling in gure 1.7) using an o-the-shelf graph processing framework. However, this approach requires heavy synchronization and communication for exchanging dependency information, which bottlenecks the parallel performance. 1.5 ThesisStatement In this thesis, we highlight the importance of exploiting bottleneck alleviating levels of parallelism to achieve high performance graph algorithms. The hypothesis of this thesis is as follows: Application specic granularity for task-partitioning can improve the scalability and e- ciencyofparallelGraphAnalytics. We test this hypothesis by developing parallel solutions for dierent categories of graph analytics described in section 1.4. Our approach is to rst understand the bottlenecks specic to the application and the target platform. Next, we identify a suitable level of parallelism in the application and corresponding task partitioning that would alleviate the bottleneck. Finally, we design novel algorithms and optimizations to exploit parallelism across these tasks at the identied level. The algorithms thus developed, inherently avoid the bottlenecks that are otherwise present in conventional approaches. 1.6 Contributions The specic contributions of this thesis can be summarized as follows: • Cache and memory bandwidth ecient graph processing framework: We develop the Graph Processing Over Partitions (GPOP) framework for shared-memory multicore platforms that improves cache performance and achieves high DRAM bandwidth for graph analytics [70]. GPOP provides an easy to program set of APIs and enables work-ecient parallel implementations of several algorithms 13 including Pagerank, Breadth First Search, Connected Components, Single Source Shortest Path and Local Clustering. GPOP utilizes thepartition-centric model [71] that abstracts neighborhood propagation as communication between disjoint cacheable vertex subsets called partitions. This abstraction drastically increases the spatio-temporal locality in memory accesses patterns generated by graph algorithms. GPOP further em- ploys novel data layouts and optimizations to reduce DRAM communication and accelerate convergence of neighborhood propagation algorithms. Empirical evaluation on several large real-world and synthetic datasets shows that compared to state-of- the-art graph processing frameworks, GPOP reduces execution time and cache misses by up to 3:6 and 5:5, respectively. The proposed optimizations for ecient and asynchronous communication enable up to 7 speedup. • ScalableandEcientHubLabelingalgorithms: In this thesis, we develop the rst scalable shared and distributed-memory parallel algorithms that output the minimal hub labeling for a given weighted graph and vertex ranking order. Moreover, our distributed-memory approach is the rst to use the memory of multiple cluster nodes in a collaborative fashion to enable completely in-memory processing of large graphs whose labeling do not t in the main memory of a single node. For instance, we can process the Livejournal graph with > 100 GB labeling size on a cluster with only 64 GB DRAM per node. Our major contribution comes in the form of PLaNT - a new embarrassingly parallel algorithm for hub labeling, that eschews all label dependencies to provide extreme scalability [69]. Particularly for distributed memory systems, PLaNT completely avoids inter-node communication and synchronization at the cost of extra computation. We further extend it to a Hybrid algorithm which achieves the scalability of PLaNT and the computational eciency of existing pruning based algorithms. 14 We thoroughly evaluate the proposed algorithms on a 64-node cluster and several real-world graphs. The Hybrid algorithm running on this cluster exhibits up to 9.5× speedup over single node shared-memory parallel execution, and more than an order of magnitude speedup over state-of-the-art distributed-memory algorithm. We also show that it can label some large graphs in few minutes that the sequential algorithms cannot process even in multiple days. • ParallelalgorithmsforBipartiteGraphDecomposition: Decomposition analytics iteratively peel vertices or edges to mine a hierarchy of dense regions in a bipartite graph. In this thesis, we propose a generic two-phased peeling approach for bipartite graph decomposition which rst divides all levels of the hierarchy into few partitions, and then concurrently constructs individual levels in each parti- tion [72]. Thus, it parallelizes workload in multiple levels to drastically reduce the the amount of thread synchronization and in turn, improve parallel scalability. To the best of our knowledge, this is the rst approach to exploit parallelism across the levels of a graph decomposition hierarchy. We further develop novel optimizations that are highly eective under the two-phased peeling approach and dramatically reduce the computational workload. These optimizations are crucial as they make it feasible to decompose graphs with several trillions butteries/2,2-bicliques and hundreds of trillions of wedges, on a single multi-core server. We thoroughly assess the proposed approach on 12 large real-world bipartite graphs and against several baselines. We show that it signicantly expands the limits of current practice by decomposing some of the largest publicly available datasets in few minutes, that existing algorithms cannot decompose in multiple days. Compared to the state-of-the-art, it reduces synchronization and execution time by up to four and two orders of magnitude, respectively. 15 1.7 ThesisOrganization The rest of the thesis is organized as follows: • Chapter 2 describes the design and implementation of our graph processing framework called GPOP. It shows implementation of ve neighborhood propagation algorithms using GPOP’s high level APIs and evaluates their performance. • Chapter 3 delineates our novel shared-memory and distributed-memory parallel algorithms for Hierarchical Hub Labeling. It also motivates and describes optimizations for label construction and label querying on parallel platforms. • Chapter 4 illustrates our novel two-phased methodology for shared-memory parallel peeling of arbitrary entities in bipartite graphs. It further describes ecient adaptations of this approach for vertex and edge peeling, and optimizations to improve computational eciency. • Lastly, chapter 5 concludes the thesis and lists some important directions for future research in the context of the work done in this thesis. 16 Chapter2 CacheandMemory-ecientGraphProcessing 2.1 Background As described in section 1.4.1, neighborhood propagation algorithms iteratively propagate information between vertices and their neighborhoods, until a halting criteria is met. In several algorithms such as BFS, each iteration only requires a subset of verticesV a , called active vertices or the frontier, to participate in communication with their neighbors. Vertices in the frontier communicate updates along their edges, also called active edges, performingO P v2Va d v work. A parallel implementation is considered theoretically work-ecient if the amount of work done in an iteration is proportional to the number of active edges. A Vertex-Centric (VC) programming model treats each vertex as a unit of computation and information dissemination (or gathering). An iteration of neighborhood propagation in VC model can use either of the following directions: Push VC – Here, vertices in setV a are allocated to worker threads, who propagate updates from all v2 V a and modify the attributes of their respective neighborsN v . Synchronization primitives such as atomics are used to prevent write conicts. Pull VC – Here, worker threads iterate over entire setV or the set of neighbors of all vertices inV a . For each vertexv, the assigned thread reads updates from its and modies the private state ofv, requiring 17 no synchronization. However, since edges from active and inactive vertices are interleaved, all edges to the vertices in the neighborhood of frontier are traversed. Algorithm1 Vertex-centric push and pull methods G(V;E) input graph val[v] attribute of vertexv (eg. distance in SSSP, parent in BFS) 1: functionpush(G(V;E);V a ) 2: foreachv2V a doinparallel 3: foreachu2N v 4: val[u] = atomic_update_func(val[v];val[u]) 5: functionpull(G(V;E)) 6: foreachv2V doinparallel 7: foreachu2N(v) 8: val[v] = update_func(val[v];val[u]) Apart from the graph topology, these algorithms use an array of lengthjVj to store the vertex states or attributes being computed. The sparse graph formats CSR and CSC (section 1.3.2) allow ecient retrieval of the edges of a given vertex with sequential accesses [113]. However, acquiring and updating values of neighboring vertices requires ne-grained random accesses to the attribute array, because neighbors’ indices (stored in the EA[] array of CSR format) can be scattered in the range 0 tojVj 1, as shown in gure 2.2. For large graphs, such accesses increase cache misses and DRAM trac, becoming the bottleneck in graph processing. As shown in gure 2.1, we observed that > 75% of the total DRAM trac in a PageRank iteration (hand-optimized code) is generated by the random vertex value accesses. 40 60 80 100 gplus pld friendster rmat27 twitter sd1 Percentage of DRAM Traffic Datasets Figure 2.1: DRAM Trac generated by random accesses to vertex data in PageRank. 18 off v off v +d v 0 2 25 89 Vertex attributes EA RA Figure 2.2: Read-writes to vertex attributes or states generates random memory access patterns. To improve locality of memory accesses, Edge-Centric (EC) programming and blocking techniques [141, 101, 23, 143] are also used. In EC, vertex values are written on edges in a semi-sorted manner (analogous to bucket sort) to increase spatial locality and cache-line utilization. Blocking restricts the range of vertices accessed by each thread and thus, improves cache data reuse. However, these methods still incur coarse-grained random accesses as cache lines are dumped to the memory at non-consecutive locations, as determined by the order of edge traversal. This limits the bandwidth of CPU-DRAM communication. Further, the Gather-Apply-Scatter (GAS) methodology used in blocking techniques traverses the graph multiple times within an iteration, rendering them inherently sub-optimal. Lastly, the improved locality in these methods comes at the cost of fully synchronous vertex value propagation that reduces the convergence rate for certain graph algorithms, such as SSSP. 2.2 GraphProcessingOverPartitionsFrameworkOverview To alleviate the issues associated with conventional programming models, we propose a novel Graph Processing over Partitions (GPOP) framework that enhances cache and memory performance, while guaranteeing theoretically ecient execution of a variety of graph algorithms. GPOP utilizes a novel Partition-centric paradigm that considers a cacheable subset of vertices called partition (instead of single vertex or edge), as a basic unit of computation. The graph is perceived as a set of interconnections between these partitions, and each iteration is implemented in two phases computed over all partitions: 19 1. Inter-partition communication (Scatter)! In this phase, the edges of active vertices in a partition are streamed in and vertex data is communicated to other partitions containing the neighbors, in the form of messages (gure 2.3). For directed graphs, messages are addressed to out-neighbors of active vertices. 2. Intra-partition updates (Gather)! In this phase, messages received by a partition are streamed in and the vertex attributes are updated as per a user-dened function. DRAM Graph Partitions Bins Processor Core 1 Cache Vertex Data Read Edges Processor Core 2 Cache Processor Core 3 Cache Processor Core 4 Cache CPU Write Messages Read Messages Update Vertex Data Figure 2.3: Conceptual demonstration of two phases in GPOP. Both Scatter and Gather phases also allow user-controlled manipulation of the frontier for next iteration and attributes of vertices in current frontier. The scatter phase allows selective continuation of some vertices from current frontier into next iteration’s frontier. The gather phase activates an additional set of vertices based on the updates to their attributes. This provides greater exibility for implementing graph algorithms compared to existing frameworks that create a frontier only based on the vertex data updates [110, 120, 20 139]. Algorithm 2 describes the typical execution ow for an iteration of neighborhood propagation in GPOP. Algorithm2 An Iteration of a Graph algorithm in GPOP N P v set of partitions containing neighbors ofv V p set of vertices in partitionp val[v] attribute of vertexv (eg. distance in SSSP, parent in BFS) 1: foreach partitionp containing an active vertexdoinparallel .Scatter 2: foreach active vertexv2V p \V a 3: foreach partitionq2N p v 4: insert messagefval[v];V q \N v g inbin p;q 5: foreach active vertexv2V p \V a . frontier continuation 6: val[v] = update_user_func_s(val[v]) 7: if active_user_func_s(val[v]) = falsethen V a =V a nv 8: foreach partitionp that received a messagedoinparallel .Gather 9: foreachmsg2bin(:;p) 10: V msg list of destination vertices inmsg 11: val msg update value inmsg 12: forallv2V msg do 13: val[v] = user_apply_func(val msg ;val[v]) 14: V a =V a [v . activate vertices 15: forallv2V p a do . lter frontier 16: val[v] = update_user_func_g(val[v]) 17: if active_user_func_g(val[v]) = falsethen V a =V a nv Since the basic unit of parallelism is a partition, a thread exclusively owns the partitions assigned to it in either phase. Thus, updates to vertex attributes in gather phase can be performed without the use of synchronization primitives such as atomics. During the scatter phase, messages between partitions are written at pre-designated memory spaces called bins. Each pair of source and destination partitions (p;q) is assigned a disjoint space in memory denoted bybin(p;q) for message transmission, allowing lock-free communication. Further, each thread in GPOP processes one partition at a time. The vertex data of the partition is cached and reused, while edges and messages are only accessed once from the DRAM in a streaming fashion. This ensures high temporal locality since a partition’s vertex data is designed to t in the private cache of a 21 core. At the same time, DRAM accesses enjoy high spatial locality as messages are streamed to consecutive locations in the bins. GPOP utilizes novel data layouts to unlock the desired cache and bandwidth friendly access patterns. Additionally, it incorporates several optimizations to reduce DRAM communication, ensure work-ecient execution and boost the convergence of graph algorithms. In the subsequent sections, we provide a detailed discussion of GPOP’s data layouts and optimizations. 2.3 CommunicationOptimizations 2.3.1 GraphPartitioning GPOP uses a lightweight index-based scheme for partitioning vertices intok disjoint equisized subsets, as shown in gure 2.4. The vertex set of partitionp, denoted byV p consists of all vertices with indices in the range h p V k ; (p + 1) V k . The number of partitionsk is chosen to be the smallest number satisfying both of the following metrics – (a) Cache-eciency: To make partitions cacheable, the data forq = V k vertices should t in the private cache of a core (L2 cache in multi-core servers). (b) Load balance: Number of partitionsk should be suciently large to assist dynamic load balancing. For this purpose, we setk 4t, wheret is the number of threads. 6 5 4 3 2 1 0 7 8 0 1 2 3 4 5 6 7 8 Partitions P 0 P 1 P 2 Figure 2.4: Graphical illustration of index-based vertex partitioning in an example graph. 22 2.3.2 MessageGenerationandConsumption In the conventional vertex-centric programming model, a vertex generates individual messages for each of its neighbors. Consequently, the vertex attribute is replicated across many messages. The partition-centric abstraction of GPOP naturally reduces such replication, since a vertex generates a single message for all the neighbors in a given partition, as shown in gure 2.5. Thus, an active vertexv generates a total of N P v messages (instead ofd v =jN v j messages as in vertex-centric programming), whereN P v is the set of partitions containing neighbors ofv, which signicantly reduces the communication volume. We dene aggregation factorr p = P v2V pdv P v2V pjN P v j to quantify the eect of such message aggregation in GPOP. 6 7 8 Data Bin PR[6] PR[7] MSB 1 1 0 0 Non-redundant updates only 2 ( 2 , 0 ) ID Bin 2 0 1 2 { Propagate Updates to all neighbors Data Bin PR[6] PR[7] PR[7] PR[7] ID Bin 2 0 1 2 ( 2 , 0 ) 6 7 (a) Scatter in Vertex-centric programming (b) Scatter in GPOP 2 Figure 2.5: Messages scattered from partitionP 2 toP 0 for the graph shown in gure 2.4. GPOP avoids redundant propagation of the same update value to reduce DRAM communication. Each message in GPOP consists of a single update value from source vertex and the entire list of the source vertex’ neighbors in the destination partition. All such messages from partitionp to partitionq are stored in a disjoint space represented bybin(p;q) which comprises of two arrays - data_bin and id_bin - for storing vertex data and list of neighbors, respectively. In case of directed graphs, out-neighbors of a vertex are stored in id_bin as shown in gure 2.5. To demarcate multiple neighbor lists within an id_bin array, the most signicant bit (MSB) of the rst vertex ID in each list is set to 1. This avoids the need to communicate additional metadata and contributes to the eectiveness of message aggregation in GPOP. 23 The messages are read sequentially from bins in the Gather phase and each update is applied to the respective list of neighbors, as per a user-dened function. Update application exhibits high temporal locality and cache data reuse as it modies vertex data of a single partition which is cached. While message reading generates sequential memory access patterns, the pointer todata_bin is conditionally updated based on the MSB of the neighbor ID fetched from id_bin. The conditional updates are typically implemented using data-dependent branches which have been shown to signicantly impact the performance of graph algorithms [52]. BranchAvoidance: To avoid data-dependent branches in gather phase, instead of using a condition check over the MSB, we simply use it as the increment value for the pointer to data_bin. Thus, whenever a new neighbor list starts, the pointer to data_bin is incremented by one and the next update value is fetched. This allows the gather phase to eciently utilize the sequential access patterns and achieve high DRAM bandwidth. 2.3.3 MessageCommunication Each partitionp uses one of the two communication modes to propagate messages in Scatter phase: Source-centric(SC)mode: The number of active edges of a partitionp is given byE p a = P v2V p \Va d v i.e. the sum of degrees of active vertices inV p . WhenE p a is small, the thread assigned for scatteringp reads and processes only active vertices inp. In this mode, all messages from a given active vertex to other partitions are written into the corresponding bins before processing the next vertex. Note that successive messages to any partitionq from active vertices inp are written to contiguous ad- dresses inbin(p;q), labeled as insertion points of bins. The small number of partitions and hence, insertion points, ensures good spatial locality and ecient cache line utilization. SC mode is also work-optimal and generatesr p E p a messages in expectation. However, a thread will switch partitions (bins) being written into which can negatively impact the sustained memory bandwidth and overall performance. Moreover, when 24 the frontier is large, the entire graph may be traversed once each in the Scatter and Gather phases, resulting in high DRAM communication volume. This motivates an alternate communication mode for GPOP. Algorithm3 Scattering partitionp in Partition-Centric mode N p (q) vertices in partitionp that have neighbors in partitionq 1: forallq2P do 2: forallv2N p (q)do 3: insertval[v] inbin(p;q) Partition-centric(PC)mode: This is a highly performant mode of communication that scatters messages from all vertices inV p . To ensure high DRAM bandwidth, messages fromV p are scattered in a partition- centric order where a thread generates and writes all messages tobin(p;q) before moving tobin(p;q + 1), as shown in algorithm 3. This ensures completely sequential DRAM accesses and high sustained memory bandwidth. Additionally, the PC mode drastically reduces DRAM communication by allowing redundant messages to be generated. The desired message generation order and communication reduction is achieved by a novel data layout and optimizations described below: • Adjacency Data Reuse! PC mode utilizes the fact that it scatters messages from all verticesV p in a partitionp, to avoid repeated writing of adjacency information across multiple iterations. The key idea here is that if we use a static graph data structure, the order in which messages are generated always stays the same. Therefore, the adjacency information written once in id_bin can be reused in all iterations where PC mode is used. Thus, messages generated in PC mode only contain vertex data updates as shown in algorithm 3, whereas adjacency information is only written during pre- processing. This dramatically reduces the amount of data written by GPOP. However, this does not avoid traversal of entire graph topology in Scatter phase which we will address with our next optimization. 25 • Partition-Vertex Graph (PVG) layout! We exploit the fact that once adjacencies are written in the bins, the only information for PC mode scatter is the connectivity between vertices and the partitions containing their neighbors. Therefore, edges going from a vertex to all of its neighbors in a given partition can be compressed into a single edge whose new destination is the corresponding partition number. This gives rise to a bipartite graphG 0 (W = (V;P );E 0 ), with disjoint vertex setsV and P , whereP =p 0 ;:::;p k1 represents the set of partitions. Such a layout prevents redundant edge traversal in Scatter phase, which drastically reduces the DRAM communication. Note that the edges in PVG are directed from vertices inV towards partitions inP . However, a vertex-centric traversal of bipartite graphG 0 will not generate messages in destination partition order, which is required for high bandwidth sequential accesses. To facilitate the desired order of message generation for each partitionp, we compute a bipartite subgraph induced by the vertices V p V of each partitionp, as shown in gure 2.6. The collection of these subgraphs is termed the Partition-Vertex Graph (PVG) layout. P 2 P 2 P 1 P 0 P 1 2 3 6 7 Graph between and vertices in 0 Graph between and vertices in 1 Graph between and vertices in 2 Figure 2.6: Partition-Vertex Graph layout for the example graph shown in gure 2.4. Now, each of these subgraphs can be stored in CSC format (section 1.3.2) allowing sequential packing to all edges destined to a given partitionq, from vertices within a given partitionp. Note that this approach is feasible because of the small number of partitions. This is because the size of CSC format for bipartite subgraph of a partitionp toO k + P v2V pd v , and hence, the entire PVG layout 26 isO k 2 +jEj . Typicallyk 2 E, and hence, reading column osets in PVG layout imposes a negligible communication overhead. The drawback of PC mode is that it scatters messages from all the vertices in a partition. This could be highly inecient if only few vertices are active and have any useful information to send to their neighbors. Eectively, the SC and PC modes represent a trade o between work eciency and highly optimized communication with high DRAM bandwidth. To achieve the best of both worlds, GPOP chooses the lower cost mode individually for each partition in every iteration. PerformanceModel: GPOP employs an analytical cost model to choose the optimal communication mode for each partition. Lets i denote the size of a vertex index ands v =sizeof(val(v)) denote the size of a vertex attribute. For a given partitionp,V p a andE p a denote the set of active vertices and edges, respectively, andr p denotes the aggregation factor as described in section 2.3.2. The expected communication volume generated by messages from partp in SC mode is given by: (V p a +E p a )s i + 2(r p E p a s v +E p a s i ) In the above equation, the rst term represents Scatter phase trac for readingV p a osets (CSR) andE p a edges. These generate messages containing an expectedr p E p a data values andE p a neighbors’ IDs. These messages are written once in Scatter phase and read once in the succeeding Gather phase, resulting in a factor of 2. The PC mode readsr p E p edges andk osets from the PNG layout and writesr p E p messages containing only vertex attributes. The Gather phase, however, reads both the vertex data and the neighbors’ IDs from these messages. Therefore, total DRAM communication for messages scattered by PC mode is given by: r p E p s i +ks i + 2rE p s v +E p s i =E p ((r p + 1)s i + 2rs v ) +ks i 27 Since graph algorithms are mostly communication bound, we compute the execution time as a ratio of communication volume and bandwidth achieved by these modes (BW SC orBW PC ). GPOP chooses PC mode for scattering a part only if: E p ((r p + 1)s i + 2rs v ) +ks i BW PC 2r p E p a s v + 3E p a s i +V p a s i BW SC (2.1) The ratio BW PC BW SC is a user congurable parameter which is set to 2 by default. Note that although PC mode generates more messages, it is selected only if it does work within a constant factor of optimal. This ensures that implementations in GPOP are work ecient and simultaneously achieve high DRAM bandwidth as well. Note that for communication mode decision, onlyV p a andE p a need to be computed at runtime as the other terms remain constant across all iterations. This requires parallel accumulation of the degrees of active vertices and poses negligible overhead. 2.3.4 2-levelActiveList If Gather is unaware of the bins that are empty (not written into at all in the preceding Scatter phase), it will probe all the bins performing at least(k 2 ) work in the process. Theoretically, this makes the implementation inecient since k = (V ) (cache size is constant). Practically, even if k is small (in thousands for very large graphs),k 2 can be large (in millions). This can deteriorate performance, especially when the useful work done per iteration could be very small, for instance, in the Nibble algorithm [111, 117]. We propose a hierarchical list data structure for fast retrieval of partitions and bins with useful work. At the top level, we implement a global list, that enumerates all partitions which have received at least one incoming message in the current iteration. Similarly, we create a list of partitions with at least one active vertex for use during scatter phase. At the second level, we implement a local list for each partitionp, that enumerates all the non-empty binsbin(:;p) to be gathered for a given partition. 28 2.3.5 InterleavingScatterandGatherPhases When scatter and gather phases are completely separated, vertex updates propagate synchronously with the iterations. In other words, any update to a vertex’ attribute is made visible to its neighbors only in the next iteration. This is a direct side eect of the two-phased computation which negatively aects the convergence rate for algorithms such as SSSP. To mitigate this eect, GPOP employs a unique Interleaved Scatter-Gather (ISG) feature that propagates vertex data updates asynchronously within each iteration. With ISG enabled, the Scatter phase proceeds as follows: when a thread is allocated a partitionp to scatter, before generating its outgoing messages, the thread rst gathers all the incoming messages top that it has already received from its neighbors. These messages are then processed to update the vertices in p and a new frontier is generated as previously inactive vertices can be activated. Thus, while scattering, updated values of active vertices inp are packed into the messages, instead of the stale values present at the end of the previous iteration. This interleaving of Gather operations within the Scatter phase is made possible by 1. Partition-level granularity of tasks in GPOP – a single thread scatters all the active vertices in a partition p at a time. Therefore, it can also process the messages received byp just before scattering it, as both the tasks operate on the vertex data of the same partition which is loaded into the cache once. Thus, using ISG feature does not aect the cache eciency of GPOP. 2. Shared-memory model of target Multicore platforms - ensures that messages written in a bin are immediately visible to the destination partition without requiring the threads to synchronize and explicitly send/receive data. Thus, within the same scatter phase, these messages can be accessed and processed. This is unlike certain distributed-memory models where processes need to synchronize (like threads at the end of Scatter phase) and explicitly exchange messages. 29 Note that such interleaving of gather operations is not possible in state-of-the-art blocking methods [23, 12], where gather phase works at a partition level granularity but scatter phase works at vertex-level granularity and dierent threads may scatter vertices of the same partition at dierent times. 2.4 ProgrammingInterface We carefully assess the program structure in GPOP to construct the right set of APIs that (a) hide all the details of parallel programming model from the user, and (b) does not restrict the functionality available to programmer and provide enough expressive power to implement a variety of applications. In each iteration, the GPOP calls four main user-dened functions that steer application development: 1. scatter_func(vertex): Called for all (active) vertices during (SC mode) Scatter; returns an algorithm- specic value to propagate to neighbors of node. 2. user_func_s(vertex): Applied to all active vertices at the end of Scatter (algorithm 2, lines 5-7). If this function returnstrue, the vertex remains active for the next iteration, thus enabling conditional continuity of frontier. For example, in parallel Nibble, a vertex stays active if its degree-normalized PageRank is greater than threshold, even if it is not updated in Gather phase. The programmer can also use this API to update vertex data before Gather phase begins. Existing frameworks put the onus of implementing such selective frontier continuity on the program- mer. By intrinsically providing such functionality, GPOP appreciably reduces both the length and the complexity of the code. 3. gather_func(val,vertex): called individually for all destination vertices in each message read from the bins during Gather phase with message value as one of the input arguments. The user can accordingly update the vertex data without worrying about atomics or locks for parallel correctness. The vertex is marked active if the function returnstrue. 30 4. user_func_g(vertex): This function applies to all vertices activated in gather phase and can be used to lter out (deactivate) some vertices. It can also be used to update the nal aggregated vertex values computed bygather_func(). For instance, in PageRank, we use this function to apply the damping factor to each vertex as shown in algorithm 5. For weighted graphs, the user can dene how the weight should be applied to the source vertex’ value, using theapply_weight() function. 2.5 TargetApplications We evaluate the performance of GPOP on ve key graph algorithms. In this section, we list ve key graph algorithms used to evaluate GPOP and also provide pseudocode for Nibble, Pagerank and BFS algorithms (algorithm 4-6) to illustrate the ease of implementation using GPOP’s interface described in section 2.4. Program initialization and iterative calls to Scatter-Gather execution are similar for all applications and therefore only shown for algorithm 4. Parallel Nibble algorithm (algorithm 4) computes the probability distribution of a seeded random walk, which provides an approximation to the personalized pagerank when a single seed vertex is used. Nibble is often used for strongly local clustering [117, 111], where work-eciency is a must. It is also used in graph neural networks [67, 17]. The Nibble algorithm starts by assigning a unit probability to the seed vertex and a zero probability to all other vertices (line 13). Thereafter, this probability is iteratively distributed to remaining vertices until either no active vertex remains or a desired number of iterations are reached (line 15). In each iteration, vertices in the frontier preserve half of the current probability (line 4) and distribute the remaining half uniformly among all the neighbors (line 2). Each vertex accumulates the received probability from its 31 neighbors (line 7) and is activated if its nal probability is greater than a threshold (lines 5, 10). With GPOP’s APIs, all of these operations can be represented in few lines of code as shown in algorithm 4. Algorithm4 Parallel Nibble implementation in GPOP PR probability vector for starting vertexv s MAX_ITER maximum iteration limit for Nibble algorithm activation threshold in Nibble algorithm 1: procedurescatter_func(vertexv) 2: returnPR[v]=(2d v ) 3: procedureupdate_func_s(vertexv) 4: PR[v] PR[v] 2 5: return (PR[v]d v ) 6: proceduregather_func(vertexv, value) 7: PR[v] PR[v] + 8: returntrue 9: procedureupdate_func_g(vertexv) 10: return (PR[v]d v ) 11: functionnibble(G;v s ) 12: frontier fv s g . initialize frontier 13: PR[:] 0; PR[v s ] 1 14: cnt 0 15: whilefrontier6=andcnt< MAX_ITERdo 16: frontier scatter_and_gather(G;frontier) 17: cnt cnt + 1 PageRank(algorithm5) is a vertex ranking algorithm that computes the probability of a random walker landing on a particular vertex. It is also a representative of the crucial SpMV multiplication kernel, which is used in many scientic and engineering applications [11, 126]. Similar to Nibble algorithm, in each iteration of pagerank, vertices uniformly distribute their visit probability (line 2) to their neighbors and update their own probability based on the updates received from the neighbors (line 7). Additionally a damping factor is applied to represent the random walker restarting at an arbitrary vertex in the graph (line 10). Note that we implement the topological PageRank in which all vertices are always active. Hence, update_func_s andupdate_func_g always returntrue and are used only to reinitialize PageRank values and apply damping factor, respectively, as shown in algorithm 5. 32 Algorithm5 PageRank implementation in GPOP D damping factor PR[:] 1=jVj . initialize probability vector 1: procedurescatter_func(vertexv) 2: returnPR[v]=d v 3: procedureupdate_func_s(vertexv) 4: PR[v] 0 5: returntrue 6: proceduregather_func(vertexv, value) 7: PR[v] PR[v] + 8: returntrue 9: procedureupdate_func_g(node) 10: PR[v] = (1D)=jVj +DPR[v] 11: returntrue BreadthFirstSearch(BFS,algorithm6) is used for rooted graph traversal or search, and is one of the kernels in the famoous Graph500 benchmark ([87]. Algorithm 6 nds the parent of every reachable node in the BFS tree rooted at a given vertex. Hence, each visited vertex sends its own identity to its neighbors (line 2). The Gather phase updates the parent of a vertex if it was not visited before the current iteration (lines 6-7), and activates it for propagation in next iteration. Since the neighborhood of active vertices of an iteration is explored in the current iteration, none of the vertices in frontier continue being active in the next iteration (line 4). Algorithm6 Breadth First Search implementation in GPOP v s root vertex parent[:] 1; parent[v s ] v s 1: procedurescatter_func(vertexv) 2: returnv 3: procedureupdate_func_s(vertexv) 4: returnfalse 5: proceduregather_func(vertexv, value) 6: ifparent[v]< 0then 7: parent[v] 8: returntrue 9: returnfalse 10: procedureupdate_func_g(vertexv) 11: returntrue 33 SingleSourceShortestPath(SSSP) nds the shortest distance to all vertices in a weighted graph from a given source vertex and is a kernel in the famous Graph500 benchmark [87]. We use Bellman Ford to compute shortest path distances of reachable vertices from a given root vertex. Implementation of Bellman Ford is similar to BFS with a few exceptions. Firstly, SSSP computes distance from the root vertex instead of a parent in the search tree. Consequently, vertices propagate their current known distance from the root aggregated with the respective edge weights, to their neighbors. Secondly, in the Gather phase, each vertex selects the minimum of received distance values, and is activated when the computed value reduces its current known distance. As a result, a vertex can be activated in the frontier of multiple iterations unlike BFS. ConnectedComponents(CC) is used to nd the set of vertices that are linked to each other by paths within the network. We use Label Propagation to compute connected components. All vertices are initialized with their own indices as the labels. In an iteration, every active vertex scatters its label to its neighbors and selects the minimum of its current label and the updates in incoming messages as its future label. If the label of a vertex changes, it becomes active in the next iteration. The procedure is repeated until all vertex labels converge and the set of vertices with common labels are declared to form a connected component. Program structure of this algorithm is similar to Bellman Ford albeit with no edge weights. 2.6 ExperimentalEvaluation 2.6.1 Setup We conduct experiments on a dual-socket server with two 18-core Intel Xeon E5-2695 v4 processors running Ubuntu 16.04. Our machine has a 256 KB private L2 cache per core, 45 MB L3 cache per socket and 1TB DRAM. The peak bandwidth of DRAM measured by Copy STREAM benchmark is 55 GBps. 34 ImplementationDetails: We set the partition size in GPOP to 256 KB, same as L2 cache size. We enable the Interleaved Scatter-Gather (ISG) feature (section 2.3.5) for CC and SSSP; remaining applications are run in a bulk-synchronous manner. For PageRank, we record our observations over 10 iterations of the algorithm. For the Nibble algorithm, we set the threshold to 10 9 . Unless explicitly stated otherwise, the results depict performance using 36 threads. Datasets: We use large real world and synthetic graph datasets to evaluate the performance of GPOP (table 2.1). Synthetic graphs are generated by RMAT graph generator with default settings (scale-free graphs). The social networks are small-world graphs with very low diameter (< 40); the hyperlink graphs have relatively much larger diameter (> 125 for uk). The uk graph is heavily pre-processed by Webgraph [19] and LLP [18] tools, and has very high locality due to optimized vertex ordering. In order to understand the eects of vertex ordering induced locality on performance of dierent frameworks, we also create a uk_rnd graph by randomly shuing the vertex IDs in uk. For SSSP, edges are randomly assigned weights with a uniform distribution in the range [1; logjVj]. Table 2.1: Neighborhood Propagation Algorithms: datasets for evaluation. Dataset Description #Vertices #Edges soclj [68] LiveJournal (social network) 4.84 M 68.99 M gplus [48] Google+ (social network) 28.94 M 462.99 M pld [84] Pay-Level-Domain (hyperlink graph) 42.89 M 623.06 M twitter [68] follower network (social network) 61.58 M 1468.36 M sd1 [84] Subdomain graph (hyperlink graph) 94.95 M 1937.49 M friend [68] Friendster (social network) 68.35 M 2568.14 M uk-2007 [19] UK Domain graph (hyperlink graph) 105.9 M 3738.7 M rmat<n> [25] synthetic graph 2 n M 16 2 n M Baselines: We compare the performance of GPOP against 3 most popular state-of-the-art frameworks - Ligra [110], GraphMat [120] and Galois [92]. Galois provides dierent algorithms for SSSP (topological and delta-stepping) and PageRank (topological and residual) computation. GraphMat only supports graphs 35 with< 2 31 edges and cannot process thefriend anduk graphs. For the parallel Nibble algorithm, only Ligra implementation is available [111]. Both Ligra and Galois use direction optimization for early termination of vertex neighborhood searchers in BFS [15]. This dramatically reduces the computation and traversal, resulting in huge gains in execution time even if the underlying implementations are inecient in terms of the rate of traversal. Therefore, we also report the time for Ligra without direction optimization (denoted Ligra_Push) for eective comparison of BFS. 2.6.2 Results 2.6.2.1 ComparisonwithBaselines ExecutionTime: Table 2.2 compares the execution time of dierent frameworks. Note that Galois oers implementations of multiple algorithms for some applications such as Pagerank or SSSP. For all applications other than BFS, GPOP consistently outperforms the baselines for most datasets with relatively larger speedups for large small-world networks (table 2.2). GPOP provides the fastest PageRank implementation for all datasets except uk. On large datasets twitter, sd1 and friend, it achieves up to 19, 2:3 and 3:6 speedup over Ligra, Graphmat and Galois, respectively. This is because the cache and memory bandwidth optimizations of GPOP become extremely crucial as the graphs become larger. Further, PageRank uses the high performance PC mode throughout the course of the algorithm. For BFS execution, Ligra is the fastest for most datasets due to the direction optimization. GPOP does not employ any direction optimization but shows comparable performance for large graphs sd1, friend and uk. For SSSP, GPOP is up to 2 faster than Galois even though Galois utilizes the more ecient Delta stepping algorithm instead of the Bellman Ford. Compared to GraphMat and Ligra, GPOP is up to 6:5 and 5 faster, respectively. For Connected Components also, GPOP outperforms other frameworks with 36 signicantly large speedups for twitter, sd1 and friend. GPOP typically takes more iterations than Galois or Ligra but makes up for it by enhanced performance from memory hierarchy. Note that even with ISG enabled, GPOP typically takes more iterations to converge than Galois or Ligra, because of the coarser task granularity. However, the memory access eciency results in each iteration of GPOP being dramatically faster than the baselines. In parallel Nibble algorithm, very few vertices near the seed vertex are explored. Consequently, execution time is not heavily aected by the graph size. Since all threads are updating a small set of vertices, GPOP benets by the atomic free updates. Furthermore, the Ligra implementation [111] uses sparse hash table data structure that incurs collisions. Instead, our implementation uses an array to store probabilities but still achieves theoretical eciency by amortizing the array initialization costs across multiple runs of Nibble algorithm. Theuk dataset possesses an optimized vertex ordering that introduces high locality during neighborhood access. Ligra and Galois benet heavily from this locality resulting in Galois being the best performing framework for uk. However, the pre-processing required to generate such a vertex ordering is extremely time consuming and is not always practical. VertexReordering: The impact of reordering can be seen by comparing the execution times for uk and uk_rnd. Using an optimized ordering can reduce the number of messages exchanged between partitions, as neighbors of a vertex are likely to be packed in very few partitions. For example, the number of inter- partition edges in uk is 179 million as opposed to 3:4 billion for uk_rnd. Hence, we see a 3:2 dierence in BFS performance. However, the performance dierence between uk and uk_rnd for Galois and Ligra is much higher than GPOP, as vertex ordering alleviates the numerous cache misses incurred by these frameworks. 37 Table 2.2: Execution time of GPOP and baselines. ExecutiontimeonDatasets(inseconds) Application Framework soclj gplus pld twitter sd1 friend uk uk_rnd GPOP 0.26 1.95 2.5 4.33 7.17 8.05 4.69 13.63 Ligra 1.55 24.2 10.7 29.9 33.1 155 9.59 80 GraphMat 0.31 4.51 5.23 10.1 15.7 - - - Galois(Topological) 0.26 3.2 4.5 8.2 12.4 33 3.42 21.1 PageRank Galois(Residual) 0.25 2.5 3.66 7.4 10.7 28.7 3.9 19.9 GPOP 0.04 0.37 0.51 0.8 2 1.31 1.89 6.08 Ligra 0.04 0.27 0.32 0.49 1.9 1.25 1.56 6.33 Ligra_Push 0.07 0.53 1.07 2.81 4.1 4.02 2.1 3.21 GraphMat 0.11 1.35 2.44 2.71 12.2 - - - BFS Galois 0.04 0.32 0.49 1.05 1.01 3.13 0.63 1.16 GPOP 0.16 1.21 1.01 1.48 4.5 3.54 12.17 18.28 Ligra 0.8 5.6 6.15 9.6 12.9 14.8 16 26 GraphMat 0.18 1.9 5.22 4.3 22.3 - - - Galois (Delta-step) 0.19 1.28 1.3 3 4.04 5.1 7.2 12.16 SSSP Galois(Topological) 0.8 5.1 6.75 7.57 13.6 18.9 16.7 30.4 GPOP 0.09 0.45 0.52 0.55 1.8 2.16 6.03 11.3 Ligra 0.22 3.3 1.7 4.8 6.9 21.7 2.91 40.1 GraphMat 0.15 2.94 3.91 5.73 16.7 - - - Connected Components Galois 0.04 0.12 1.12 1.75 3.45 3.92 2.3 9.97 GPOP 0.33 0.19 0.33 1.09 0.47 0.4 0.0067 0.051 Parallel Nibble Ligra 1.6 0.5 1.2 5.3 1.18 0.54 0.0082 0.022 38 Table 2.3: L2 Cache Misses incurred by GPOP and baselines. L2Cachemisses(inBillions) Application Framework soclj gplus pld twitter sd1 friend uk uk_rnd GPOP 0.18 1.44 1.99 4.1 4.95 7.71 0.53 17.28 Ligra 1.06 9 15.18 31.09 47.91 35.18 5.44 135.9 GraphMat 0.33 3.78 6.28 14.92 17.77 - - - PageRank Galois 0.39 3.53 5.71 13.25 18.97 42.16 0.9 42.4 GPOP 0.02 0.13 0.2 0.36 1.01 0.51 0.29 2.84 Ligra 0.03 0.21 0.23 0.34 2.14 1.2 0.3 13.7 GraphMat 0.03 0.46 0.65 1.35 2.52 - - - BFS Galois 0.05 0.39 0.63 1.54 1.69 4.27 0.29 1.18 GPOP 0.14 0.81 0.87 1.32 3.49 3.58 2.29 19.58 Ligra 0.13 0.97 1.48 2.64 4.79 8.13 - - GraphMat 0.14 1.93 3.34 4.57 10.37 - - - SSSP Galois 0.18 1.26 1.47 3.4 5.44 6.93 4.37 11.42 GPOP 0.08 0.3 0.54 0.49 1.53 1.89 0.98 8.68 Ligra 0.18 1.71 3.01 5.11 11.65 19.45 0.54 55.9 GraphMat 0.12 2.17 3.07 4.68 10.42 - - - Connected Components Galois 0.05 0.5 0.88 1.68 3.16 5.25 0.62 16.64 GPOP 0.2 0.112 0.225 0.705 0.425 0.289 0.001 0.025 Parallel Nibble Ligra 0.453 0.288 0.331 4.215 0.386 1.139 0.004 0.012 CachePerformance: Table 2.3 shows the number of L2 cache misses encountered by dierent frame- works for each of the target algorithms. For Ligra and Galois, we report the cache misses of the best performing variant in terms of execution time. Because of the aggressive cache optimizations (section 2.3), GPOP incurs dramatically lesser L2 cache misses than Ligra, GraphMat and Galois for almost all datasets and applications (table 2.3). Even for BFS where Ligra performs much less work than GPOP due to direction optimization, cache misses for GPOP are signicantly smaller. Overall, GPOP achieves up to 9, 3:6 and 5:5 lesser cache misses than Ligra, GraphMat and Galois, respectively. The impact of vertex ordering on cache misses is similar to that on execution time. For certain applications on uk and uk_rnd datasets, Galois converges much faster, thereby reducing the overall memory accesses and the cache misses. The same eect is also seen in Delta stepping based SSSP 39 implementation of Galois, especially for graphs with relatively large diameter such as sd1. Cache misses for Parallel Nibble are much smaller as the algorithm only explores a small region around the seed vertex. 2.6.2.2 Scalability We evaluate the scalability of GPOP on the target applications using synthetic rmat graphs for which parameters, such as graph size and average degree can be controlled. Strongscaling(gure2.7): To analyze strong scaling behavior, we measure the execution time vs number of threads for several graphs from rmat22 jVj = 4M;jEj = 64M to rmat27 jVj = 128M;jEj = 2048M . GPOP demonstrates very good scalability for BFS, APPR and SSSP, achieving up to 17:6, 15 and 21:2 parallel speedup using 36 threads, respectively. In case of PageRank, GPOP achieves up to 11:6 speedup with 36 threads and scales poorly after 16 threads. This is because PageRank always uses PC mode scatter and nearly saturates the bandwidth with 20 threads. We also observe that scaling improves as the graph size increases. This is because bigger graphs provide more workload per partition, thereby reducing the relative overheads of parallelization. Weakscaling(gure2.8) : To observe the weak scaling, we vary the threads from 1 to 32 along with a proportional change in the size of the input graph. For all of the applications, GPOP exhibits less than 4 increase in execution time for a 32 increase in the input graph size. The number of partitions increases linearly with the graph size as partition size is constant. Since we employ a simple index-based partitioning, an increase in the number of partitions amplies the proportion of inter-partition edges and consequently, the amount of DRAM communication exhibits a super linear increase. For PageRank and Connected Components, we see a sharp increase in execution time when going from 16 to 32 threads, relative to the scaling from 1 to 16 threads. This can be attributed to the memory bandwidth bottleneck and NUMA eects as explained previously. 40 0 4 8 12 16 20 1 2 4 8 16 32 64 Parallel Speedup Number of Threads BFS 0 4 8 12 1 2 4 8 16 32 64 Parallel Speedup Number of Threads Pagerank 0 4 8 12 16 1 2 4 8 16 32 64 Parallel Speedup Number of Threads SSSP 0 4 8 12 1 2 4 8 16 32 64 Parallel Speedup Number of Threads Connected Components 0 6 12 18 24 1 2 4 8 16 32 64 Parallel Speedup Number of Threads Parallel Nibble Figure 2.7: Strong Scaling results for GPOP. 41 2048 1024 512 256 128 64 0 0.5 1 1.5 2 2.5 1 2 4 8 16 32 Execution Time (s) Number of Threads BFS 2048 1024 512 256 128 64 0 3 6 9 12 1 2 4 8 16 32 Execution Time (s) Number of Threads PageRank 2048 1024 512 256 128 64 0 1 2 3 4 5 1 2 4 8 16 32 Execution Time (s) Number of Threads SSSP 2048 1024 512 256 128 64 0 0.4 0.8 1.2 1.6 1 2 4 8 16 32 Execution Time (s) Number of Threads Connected Components 2048 1024 512 256 128 64 0 1.5 3 4.5 6 1 2 4 8 16 32 Execution Time (s) Number of Threads Parallel Nibble Figure 2.8: Weak scaling results for GPOP. Labels on the curves indicate number of edges in millions. 42 The behavior of Parallel Nibble is more complex than the other applications. Initially, the time increases with the graph size due to an increase in the size of explored subgraph. However, it suddenly drops when going from rmat24 to rmat25. This is because the degree of the seed vertex increases with graph size, thereby diluting the contribution of the seed to the visiting probability of its neighbors. After a certain point, this contribution becomes too small and is able to activate only a few vertices for the future iterations. Consequently, the execution time decreases sharply. ExecutionTimevsGraphDensity(gure2.9): To understand the eects of density on GPOP’s per- formance, we take the rmat26 (jVj = 64 million) dataset and vary its degree from 1 (jEj = 64 million) to 128 (jEj = 8 billion). We observe that the performance scales very well with the density. For BFS, SSSP and Connected Components, the execution time increases by less than 12 for a 128 increase in the average degree. As the graph became denser, GPOP execution preferred the highly ecient PC mode more often, which explains the relatively small increase in execution time. Further, the aggregation factorr P (section 2.3.2) of partitions increases with degree, making PC mode even more ecient. For Parallel Nibble, workload is not determined by the total number of edges in the graph but only by the local region around the seed vertex which gets activated. Hence, we see that the Parallel Nibble execution time does not vary signicantly for degree more than 16. 2.6.2.3 ImpactofGPOPOptimizations: To analyze the individual eect of GPOP optimizations on the overall performance, we report the execution time for three dierent implementations (g.2.10) – (a) GPOP with only Source-centric communication and ISG disabled (GPOP_SC), (b) GPOP with dual communication mode and ISG disable (GPOP_DC), and (c) GPOP with dual communication mode and ISG enabled (GPOP_DI). Recall that the ISG feature is enabled for SSSP and Connected Components only. 43 0.125 0.25 0.5 1 2 4 1 2 4 8 16 32 64 128 Execution Time (s) Average Degree BFS 0.5 2 8 32 1 2 4 8 16 32 64 128 Execution Time (s) Average Degree PageRank 0.5 1 2 4 8 1 2 4 8 16 32 64 128 Execution Time (s) Average Degree SSSP 0.25 0.5 1 2 4 1 2 4 8 16 32 64 128 Execution Time (s) Average Degree Connected Components 0.5 1 2 1 2 4 8 16 32 64 128 Execution Time (s) Average Degree Parallel Nibble Figure 2.9: Graph density vs execution time for GPOP. 44 0 0.5 1 1.5 2 2.5 3 3.5 soclj gplus pld twitter sd1 friend uk uk_rnd Normalize d Execution Time BFS 0 0.5 1 1.5 2 2.5 3 3.5 4 soclj gplus pld twitter sd1 friend uk uk_rnd Normalize d Execution Time PageRank 0 0.5 1 1.5 2 2.5 3 soclj gplus pld twitter sd1 friend uk uk_rnd Normalized Execution Time SSSP 0 1 2 3 4 5 6 7 8 soclj gplus pld twitter sd1 friend uk uk_rnd Normalized Execution Time Connected Components 0 0.5 1 1.5 2 2.5 soclj gplus pld twitter sd1 friend uk uk_rnd Normalize d Execution Time Parallel Nibble Figure 2.10: Impact of GPOP optimizations on execution time. 45 Figure 2.10 shows the relative execution time of GPOP with dierent optimizations. Clearly, both the optimizations contribute signicantly to performance enhancement. Communication mode optimization provides 23 speedup for PageRank and Connected Components, and 1:52 speedup for BFS and SSSP. The ISG feature enables faster convergence resulting in a further 3 and 1:5 approx. speedup for Connected Components and SSSP, respectively. Note that due to limited graph exploration in Parallel Nibble, the GPOP_DC is unable to utilize the Partition-centric communication and hence, does not provide any signicant speedup over GPOP_SC. 2.7 RelatedWork Pregel [81] was one of the earliest (distributed-memory) graph processing frameworks that proposed the Vertex-centric (VC) programming model with the Gather-Apply-Scatter (GAS) abstraction, inspired by the Bulk Synchronous Parallel model. However, this model requires synchronizing and explicit message exchange between compute nodes, thus introducing scalability challenges. Some frameworks also support asynchronous execution on distributed-memory systems [49, 78] although it sacrices the eciency of bulk communication. In contrast, GPOP is designed for shared-memory systems where messages generated by a subgraph (partition) are immediately visible to other partitions, allowing the use of interleaved scatter and gather. The shared-memory model also allows simple dynamic runtime task allocation in GPOP, where threads are not tied to specic partitions at the beginning. Many shared memory graph analytic frameworks [110, 139] also adopt the VC model with push or pull direction execution (algorithm 1). The direction optimizing BFS[15] performs a frontier dependent switch between the two directions. Ligra [110] is a framework that generalizes this idea to other graph algorithms. Galois [92] utilizes speculative parallelization along with a ne-grained work-stealing based load balancing. Its priority drive task scheduler boosts convergence signicantly. Galois was arguably the fastest baseline 46 in our experiments (section 2.6). However, both Galois and Ligra require atomic instructions and are not cache and memory ecient due to low-locality ne-grained random memory accesses. Graph reordering [134, 19] can improve memory access locality. However, these methods can be prohibitively expensive for large graphs. Several works [101, 120, 12, 23] borrow the 2-phased GAS programming (also used by GPOP) and avoid use of synchronization primitives on shared-memory platforms. GraphMat [120] maps graph algorithms to ecient Sparse Matrix (SpM) kernels but its bit-vector frontier representation requiresO(V ) work in each iteration. X-Stream uses Edge-centric programming and performsO(E) traversal in each iteration [101]. Blocking techniques [23] have also been utilized to improve locality and cache performance. However, these works [23, 101] are not work-ecient for frontier-based algorithms like BFS. The PartitionedVC framework [82], developed independently of GPOP, provides ecient out of core graph processing. It loads only active vertex adjacencies from the SSD and supports asynchronous processing by updating vertex data before scheduling its adjacency list for processing. The benets of partitioning to improve locality have also been explored previously. However, novelty in the GPOP framework lies in its guaranteed cache performance without optimized partitioning and asynchronous computation by interleaving scatter-gather phases. The 2-phased SpMSpV algorithm of [12] partitions the rows of matrix and buckets the non-zero entries of SpV corresponding to each partition individually. It keeps a constant number of partitions to ensure work eciency. However, for large graphs, the number of vertices in a partition can outgrow the cache capacity, resulting in high cache miss ratio. Unlike [12], GPOP caps the partition size to ensure cacheability of vertices and still achieves work-eciency using the 2-level Active List. Moreover, as explained in section 2.3.5, it is not possible to mix bucketing with the aggregation phase in [12]. Zhou et al [141] also use a hybrid of Edge-centric and Vertex-centric computation to increase the rate of reading edges on a CPU-FPGA integrated system. However, GPOP incurs less communication than these methods and support asynchronous computation. 47 Domain Specic Languages (DSL) are another line of research in high performance graph analytics. Green-Marl [61] is a popular DSL that allows users to write graph algorithms intuitively and relies on its compiler to extract parallelism and performance. GraphIt [140] allows user to explore the trade o space of various optimizations (push vs pull, dense vs sparse frontier, NUMA partitioning). Perhaps, the optimizations imbibed in GPOP can also become a part of such languages in the future. 2.8 ChapterSummary In this chapter, we described the GPOP framework for graph processing that enables cache and memory ecient implementation of graph algorithms. GPOP adopts a new Partition-centric abstraction that creates tasks at the granularity of cacheable vertex subsets – partitions. This puts caches at the center of computation and enables several performance optimizations that are not possible with the conventional vertex-centric abstraction. The coarser partition-level granularity in both Scatter and Gather phases also makes it feasible to propagate updates asynchronously, which boosts convergence in GPOP. We extensively evaluated the performance of GPOP for a variety of graph algorithms, using several large datasets. We observed that GPOP incurs up to 9, 6:8 and 5:5 less L2 cache misses compared to Ligra, GraphMat and Galois, respectively. In terms of execution time, GPOP is upto 19, 9:3 and 3:6 faster than Ligra, GraphMat and Galois, respectively. 48 Chapter3 PlantingTreesforScalableandEcientHubLabeling 3.1 Motivation In this chapter, we discuss one of the complex graph applications – Hierarchical Hub Labeling (HHL). It is used to answer Point-to-Point Shortest Distance (PPSD) queries, which is one of the most important primitives encountered in graph databases. PPSD computations are used in several applications such as phylogenetics, context-aware search on knowledge graphs, route navigation on roads etc [97, 136, 138, 90]. While graph traversal algorithms (such as Bellman Ford discussed in section 2.5) can be used to answer PPSD queries, even state-of-the-art traversal methods [58, 14, 110, 92, 36, 70] take hundreds of milliseconds to a few seconds for answering a single query on large graphs. Consequently, this approach is not suitable for applications that generate a large number of queries or online applications with low latency require- ments. On the other hand, pre-computing all pairs shortest paths enables constant time query response, but is impractical due to quadratic storage and time complexity. Hierarchical Hub labeling oers a practical solution to this problem. By pre-computing for each vertex, the distance to a ‘small’ subset of vertices known as hubs, it enables extremely fast response to PPSD queries, without incurring quadratic time and storage costs [28]. The set of (hub, distance)-tuples for a vertexv, denoted byL v , is known as the hub label set ofv. As shown in gure 3.1, the shortest distance between a pair of vertices can be computed via one of the common hubs in their label set. For a given 49 vertex rankingR, HHL only selects the highest ranked vertex as a hub for any given shortest path, thus enabling (1) reasonable labeling size, and (2) fast query response for exact shortest distance computation between any pair of vertices. ( , , ) Figure 3.1: Shortest distance computation between verticesv 4 andv 5 using their hub label setsL 4 andL 5 , respectively. Pruned Landmark Labeling (PLL) [9] is arguably the most ecient sequential algorithm for HHL. It iteratively selects root vertices in a given rank orderR and constructs Shortest Path Trees (SPTs). For all vertices explored in these SPTs, the root along with the distance is added as a new hub label. Remarkably, PLL prunes exploration in every SPT by using the labels created in previous SPTs. This makes PLL computationally ecient and the output labeling minimal in size (for a givenR), which is also known as Canonical Hub Labeling (CHL) [4]. However, the pruning procedure adds a sequential root order label dependency between the SPTs, wherein pruning in an SPT depends on the knowledge of labels generated from previous SPTs rooted at higher ranked vertices. Despite aggressive pruning, PLL is still computationally very demanding. We observed that large weighted graphs such asPokec (1:5M nodes, 30M edges) andLiveJournal (4:8M nodes, 69M edges) can take several hours to days to process. We also note that the size of CHL for weighted graphs can be several orders of magnitude larger than the graph itself, stressing the available main memory on a single machine. While 50 disk-based labeling can extend system capabilities beyond DRAM, disk access and resulting algorithms [65, 45] are substantially slower than in-memory PLL. Conventionally, such applications are parallelized by constructing every SPT using o-the-shelf parallel frameworks like GPOP [37]. This approach respects the sequential dependencies but incurs large amount of synchronization and is not scalable. Furthermore, the computational and space requirements of HHL motivate the use of massively parallel systems, such as distributed-memory clusters, for scaling. The conventional approach with large amount of synchronization and inter-node communication is highly unsuitable for such platforms. Therefore, we explore the parallelism available in HHL at a coarser granularity i.e. concurrentconstruction of multiple SPTs. This granularity exposes much larger parallelism than what is available within individual SPTs. However, this approach also comes with challenges of its own. It breaks the root order label dependency of PLL, violating the network hierarchy and resulting in redundant labels that would otherwise be pruned in sequential PLL. Distributed-memory systems exacerbate these problems, as the labels generated on a node are not immediately visible to other nodes for pruning. 3.2 Background Table 3.1: Hierarchical Hub Labeling: notations. G(V;E;W ) a weighted undirected graph with vertex setV and edgesE n;m number of vertices and edges;n =jVj;m =jEj N v neighboring vertices ofv w u;v weight of edgee = (u;v)2E SP u;v (vertices in) shortest path(s) betweenu,v (inclusive) SPT v shortest path tree rooted at vertexv d(u;v) shortest path distance between verticesu andv (h;d(v;h)) a hub label for vertexv withh as the hub L v set of hub labels for vertexv q number of nodes in the cluster 51 Table 3.1 lists some frequently used notations in this chapter. For clarity of description, we consider G(V;E;W ) to be weighted and undirected. However, the techniques described here can be easily extended to directed graphs by using forward and backward labels for each vertex [4]. 3.2.1 HubLabeling 2-hopHubLabeling[28]: A 2-hop hub labeling directly connects each vertex to its respective hubs such that the shortest distance between verticesu andv can be computed by hopping fromu to a common hubh and fromh tov. Such a labeling can correctly answer any PPSD query if and only if it satises the following cover property: Property 1. Every connected pair of verticesu;v is covered by a hub vertexh that lies in a shortest path betweenu andv i.e. there exists anh2SP u;v such that (h;d(u;h))2L u and (h;d(v;h))2L v . CanonicalHubLabeling(CHL)[4]: Querying on a 2-hop labeling requires intersecting the labels of source and destination vertices as shown in gure 3.1. Clearly, the query response time is proportional to the average label size. However, nding the optimum labeling with minimum label size is NP-hard [28]. Therefore, Abraham et al.[4] conceptualize Canonical Hub Labeling (CHL) which achieves minimality for a given vertex ranking. LetR :V !N denote a ranking function that denes a total order on all vertices, also known as network hierarchy. In CHL, only the highest-ranked hub inSP u;v is added to the labels ofu andv. Note that CHL satises the cover property. It is minimal for a givenR in the sense that removing any label from CHL leaves at least one shortest path uncovered, resulting in a violation of the cover property (property 1). Intuitively, a good ranking functionR will prioritize highly central vertices that cover a large number of shortest paths in the graph. Such vertices are good candidates for being hubs as they can cover a large fraction of shortest paths with a few labels. Therefore, a labeling which is minimal 52 for a good rankingR will also be quite ecient overall. 3.2.2 PrunedLandmarkLabeling Distance between a hubh and other vertices in the graph can be found by constructing an SPT rooted ath. However, withinSPT h , further exploration from a vertexv is futile ifSP h;v is covered by a higher-ranked hub. This motivates the Pruned Landmark Labeling (PLL) algorithm [9] which constructs SPTs in decreasing rank order of root vertices, pruning them on-the-y by querying on previously discovered labels. 1 > ( 2 ) > ( 3 ) > ( 4 ) > ( 5 ) 1 2 3 5 4 5 3 10 2 4 14 ℎ 1 0 ℎ 1 3 ℎ 1 11 ℎ 1 9 ℎ 1 5 (a) A graph with labels fromSPTv 1 (in tables) and vertex rankingR. Root vertices are selected in decreasing order of their ranks. 1 2 3 5 4 5 3 10 2 4 14 1 = 3 2 = 0 3 = 10 5 = 14 4 = ∞ (i) pop common hub = 1 distance via hub = 6 > 2 Insert Label ( 2 , 0 ) in 2 Relax edges of 2 1 2 3 5 4 5 3 10 2 4 14 1 = 3 2 = 0 3 =10 5 = 14 4 = ∞ (ii) pop common hub = 1 distance via hub = 3 ≤ 1 Prune 1 2 3 5 4 5 3 10 2 4 14 1 = 3 2 = 0 3 =10 5 = 12 4 = ∞ (iii) pop common hub = 1 distance via hub = 14 > 3 Insert label ( 2 , 10 ) in 3 Relax edges of 3 1 2 3 5 4 5 3 10 2 4 14 1 = 3 2 = 0 3 =10 5 = 12 4 = ∞ (iv) pop common hub = 1 distance via hub = 12 ≤ 5 Prune Finish (b)SPTv 2 construction using pruned Dijkstra in PLL (di denotes distance from root tovi). Figure 3.2: Label generation by PLL with hubv 2 .SPT v 2 does not visitv 4 due to pruning. 53 Figure 3.2 provides a graphical illustration of SPT construction and pruning in PLL withv 2 as root. For every visited vertexv, PLL conrms if there exists a higher-ranked hubh (v 1 in this case) in bothL v 2 and L v such thatd(h;v 2 ) +d(h;v)d v , whered v is the distance tov inSPT v 2 . In gure 3.2b, this condition holds true forv 1 andv 5 , because of which their adjacent edges are not explored andv 4 is never visited. We denote this as distance query pruning. Such pruning not only restricts label size, but also limits graph traversal resulting in a computationally ecient labeling algorithm. 3.2.3 Challenges Assuming that a weighted graphG and a ranking functionR are given, we study the following problems: 1. Eciently utilize available parallelism in shared and distributed-memory systems to accelerate CHL construction. This requires scalable algorithms that can guarantee minimal minimal labeling size (by denition of CHL). 2. Scale in-memory processing to large graphs whose labeling does not t on the main memory of a single machine. This requires partitioning the labels and distributing the partitions to dierent compute nodes. 3. Given a labelingL =[ v2V L v , eciently answer PPSD queries using shared and distributed-memory parallelism. The eciency of query response can be measured in terms of either latency or throughput, and may require dierent layouts for either of these objectives. To achieve the aforementioned objectives, we need to address two main challenges described below. Synchronization: We try to parallelize HHL on shared-memory systems by constructing each SPT using the parallel GPOP framework (chapter 2). The two-phased GAS model used in GPOP is widely used in most distributed graph processing as well. Therefore, performance of a GPOP-based implementation is indicative of a similar implementation on distributed-memory systems. 54 Since PLL constructs one tree at a time, threads synchronize after every SPT. Within an SPT also, threads synchronize multiple times corresponding to the number of iterations required for termination. This results in large amount of synchronization which impacts parallel scalability. Further, due to heavy pruning, the workload available between successive synchronizations is insucient for ecient parallelism. Consequently, exploiting only intra-tree parallelism provides limited benets as shown in table 3.2. Table 3.2: Intra-tree parallelism using GPOP exhibits poor parallel scalability. Dataset WND AUT YTB FLA Time(1thread) 11 s 244 s 1394 s 377 s Time(36threads) 108 s 168 s 487 s 1115 s Sequential Dependencies: Pruning a treeSPT v in PLL is dependent on the labels generated by all the trees rooted at vertices with rank higher than that of v. Constructing multiple trees in parallel violates these dependencies and can negatively impact pruning eciency. This can result in increased label size, computational overhead and perhaps, an incorrect output labeling that does not satisfy cover property (property 1). LabelSetPartitioning: To scale the capabilities of HHL beyond the main memory capacity of a single node, we intend to utilize multi-node clusters. The labeling can be partitioned across hubs or across the vertices and partitions can be distributed on the nodes to enable in-memory processing. However, in distributed-memory clusters, labeling stored or generated on a node is not readily available for pruning on another node. As a result, each node can only utilize the labels in its local memory for pruning, which is only a fraction of the labels collectively generated by all nodes. This results in increased label size and poor computational eciency. Moreover, the partitions get sparser with increase in the number of nodes, further exacerbating this issue. 55 3.3 Shared-memoryParallelHubLabeling 3.3.1 LabelConstructionandCleaning Owing to scalablilty challenges with parallelization within SPTs, we develop an optimistic parallel algorithm called Label Construction and Cleaning (LCC), for CHL construction. LCC exploits parallelism across the SPTs without sacricing computational eciency or labeling size. We rst dene some properties of a labeling that will be useful in the development of LCC. Denition1. A hub label (h;d(v;h))2L v is said to be redundant if it can be removed fromL v without violating the cover property (property 1). Denition2. A labelingL is minimal if it has no redundant labels. Denition3. Consideranarbitrarypairofconnectedverticesuandvwithh m beingthehighest-rankedvertex inSP u;v . A labeling is said to respect the rank orderR, if (h m ;d(u;h m ))2L u and (h m ;d(v;h m ))2L v , for all such pairs (u;v). By denition 2, we need to remove any redundant labels in order to ensure minimality of a labeling. The following lemmas dene the usefulness of a hub label and a mechanism to identify the redundant labels. Lemma1. A hub label (h;d(v;h))2L v in a labeling that respectsR is redundant ifh is not the highest ranked vertex inSP v;h . Proof. Assume thath is not the highest ranked vertex inSP v;h . Letw6=h be the highest ranked vertex inSP v;h . LetS v h denote the set of verticesv 0 such thath2 SP v 0 ;v . This implies thatw2 SP v;v 0. Since the labeling respectsR, for anyv 0 , we must have (w 0 ;d(v;w 0 ))2L v and also (w 0 ;d(v 0 ;w 0 ))2L v 0, where w 0 = arg max u2SP v;v 0 fR(u)g. Clearly,R(w 0 )R(w)>R(h). This implies that for everyv 0 2S v h , there 56 exists a hubw 0 6=h that covers the shortest path betweenv andv 0 . Therefore, (h;d(v;h)) can be removed without aecting the cover property. Lemma 2. Given a rankingR and a labeling that respectsR, a redundant label (h;d(v;h))2 L v can be detected by a PPSD query between the vertexv and the hubh. Proof. Letw 0 = arg max u2SP v;h fR(u)g. If the label (h;d(v;h)) is redundant,w 0 6=h andR(w 0 )>R(h) by lemma 1. Since the labeling respectsR,w 0 must be a hub for bothv andh. Thus, a PPSD query on the labels ofv andh with rank priority will returnw 0 with distanced(v;w 0 ) +d(h;w 0 )d(v;h). Moreover, if (h;d(v;h)) is not redundant, thenh =w 0 = arg max u2SP v;h fR(u)g. Therefore, if a PPSD query on the labels ofv andh returns a hubw 0 withR(w 0 )>R(h), the label (h;d(v;h)) inL v is redundant. Lemmas 1 and 2 show that redundant labels (if any) in a labeling can be detected if the labeling respects R. We utilize this insight in the development of parallel Label Construction and Cleaning (LCC) algorithm and show that it outputs the minimal sized labeling. ParallelLabelConstruction(LCC-I): To parallelize PLL, LCC creates a task queue of root vertices in the order dened byR, as shown in algorithm 7. Every idle thread pops the highest-ranked vertex still in the queue and constructs a pruned SPT rooted at that vertex (algorithm 7, line 5). This process is repeated until the queue becomes empty. This parallelization strategy ensures that SPTs are assigned in the same order as R, enabling signicant pruning even though SPT construction may be out of order. It also dynamically balances the loads across all threads and ensures that all threads are working until the last SPT is assigned. Moreover, fetching unique tasks from the queue can be achieved by atomically incrementing the queue pointer. Thus, label construction avoids any global synchronizations that can impact the parallel scalability. Such parallelization can be viewed as an optimistic parallelization of PLL as an SPT may start before all labels from the prior SPTs are generated. Thus, it may allow some ‘mistakes’ (redundant labels not in CHL) in the hub labeling. To ensure minimality, we develop a method to restrict redundant labels and remove 57 them from the nal labeling, called Label Cleaning (algorithm 7, lines 8-15). First, we discuss the pruning process in LCC which underlines the key idea behind Label Cleaning. Algorithm7 LCC: Label Construction and Cleaning Input:G(V;E;W ),R;Output: LabelingL =[ v2V L v p # parallel threads Q queue containing vertices ordered by rank 1: L v 8v2V . initialization 2: forallt id = 1; 2:::pdoinparallel . LCC-I: Label Construction 3: whileQ6= emptydo 4: atomically pop highest ranked vertexh fromQ 5: pruneDijRQ(G;R;h;L) 6: forallv2V doinparallel 7: sort labels inL v using hub rank 8: forallv2V doinparallel . LCC-II: Label Cleaning 9: foreach (h;d(v;h))2L v 10: DQ_Clean(v;h;d(v;h);L h ;L v ;R) 11: functionDQ_Clean(v;h;;L h ;L v ;R) . Cleaning Query 12: W set of common hubs inL h andL v 13: u highest-ranked vertex inW for whichd(u;v) +d(u;h) 14: if such au existsandR(u)>R(h)then 15: delete (h;) fromL v RankQuerying: In addition to the distance query pruning, LCC also incorporates Rank Query pruning in its modied Dijkstra algorithm shown in algorithm 8. Specically, during the construction ofSPT h , if a vertexv withR(v)>R(h) is visited, we 1) pruneSPT h atv and 2) do not inserth as a hub intoL v , even if the corresponding distance query is unable to prune. Thus the labels generated bySPT h will not be able to prune the SPTs rooted at higher ranked vertices. Eectively, this guarantees that for any pair of connected vertices (h;v) withR(v)>R(h), eitherv is labeled a hub ofh or they both share a higher ranked hub which is common to their label sets. Claim 1. The labeling generated by LCC’s label construction step (LCC-I) satises the cover property and respectsR. 58 Algorithm8 Pruned Dijkstra in LCC (pruneDijRQ) Input:G(V;E;W ),R, rooth, current labelsL =[ v2V L v ;Output: hub labels with hubh v ! distance tov 1: initialize an empty vertex priority queueQ 2: LR = hash(L h ), h = 0; v =18v2Vnfhg . initialize 3: Insert (h; 0) inQ 4: whileQ is not emptydo 5: pop (v; v ) fromQ 6: ifR(v)>R(h)then continue . Rank-Query 7: if DQ(v;h; v ;LR;L v )then continue . Dist. Query 8: L v =L v [f(h; v )g 9: foreachu2N v 10: if v +w v;u < u then 11: u = v +w v;u ; updateQ 12: functionDQ(v;h;;LR;L v ) 13: foreach (h 0 ;d(v;h 0 ))2L v 14: if (h 0 ;d(h;h 0 ))2LRthen 15: ifd(v;h 0 ) +d(h;h 0 ) then returntrue 16: returnfalse Proof. LetH P v andH S v denote the set of hub vertices of a vertexv after LCC-I and sequential PLL, respec- tively. First, we will show thatH S v H P v for anyv2V . Consider a vertexh62H P v . This could happen under either of the following scenarios: Case1: h62H P v because a Rank-Query prunedSPT h atv in LCC-I. For that to happen, we must have R(v)>R(h). Since sequential PLL is also the CHL,h62H S v also. Case 2: h62 H P v because a Distance-Query prunedSPT h atv in LCC-I. This can only happen if LCC found a shorter distanced(h;v) through a hub vertexh 0 2SP h;v (algorithm 8, lines 14-15). Since LCC with Rank-Querying allowsh 0 to be a hub for bothh andv, we must haveR(h 0 )>R(h)>R(v). This implies thath = 2H S v . Case 3: h62 H P v becausev was not discovered bySPT h due to pruning. Similar to Case 2 above, this implies that there exists anh 0 2SP v;h withR(h 0 )>R(h), and thereforeh62H S v . Combining these cases, we can say thatH S v H P v . Since sequential PLL generates the CHL, the claim follows. 59 Combining lemma 2 and claim 1 implies that we can identify and remove redundant labels from the output of LCC-I. Note that Rank Querying is crucial for this purpose as without it, the subsequent redundant label removal procedure may result in a labeling that violates the cover property. LabelCleaning(LCC-II): The extra labels created due to optimistic parallelization in LCC are redundant, since there exists a canonical subset of those labels that satises the cover property (claim 1). LCC eliminates redundant labels using theDQ_Clean function as shown in algorithm 7 (lines 11-15) ∗ . For a vertexv, a label (h;d(v;h)) is removed if a distance query on (v;h) nds a common hubu such thatR(u)>R(h) and the distance betweenv andh viau is no greater thand(v;h). Claim2. The nal labeling generated by LCC after the Label Cleaning step (LCC-II) is the CHL. Lemma3. LCC is work-ecient. It performsO(wm log 2 n +w 2 n log 2 n) work, generatesO(wn logn) hub labels and answers each query inO(w logn) time, wherew is the tree-width ofG. Proof. Consider the centroid decomposition (;T (V T ;E T )) of minimum-width tree decomposition of the input graphG, where =fX t V8t2 V T g maps vertices inT (bags) to subset of vertices inG [9]. LetR(v) be determined by the minimum depth bagfb v 2 V T jv2 X bv g i.e. vertices in root bag are ranked highest followed by vertices in children of root and so on. Since we prune using Rank-Query,SPT v will never visit vertices beyond the parent ofb v . A bag is mapped to at mostw vertices and the depth of T isO(logn). Since the only labels inserted at a vertex are its ancestors in the centroid tree, there are O(w logn) labels per vertex. Every time a label is inserted at a vertex, we evaluate all its neighbors in the distance queue. Thus the to- tal number of distance queue operations isO(wm logn). Further, distance queries are performed on vertices that cannot be pruned by rank queries. This results inO(nw lognw logn) =O(w 2 n log 2 n) work. Lastly, the Label Cleaning step sorts the label sets and executes PPSD queries performingO(nw logn logw log logn+ ∗ Actual implementation of DQ_Clean stops at the rst common hub (also the highest ranked) in sortedL h andLv which satises the condition in line 14 of algorithm 7. 60 w 2 n log 2 n) =O(w 2 n log 2 n) work. Thus, overall work complexity of LCC isO(wm log 2 n +w 2 n log 2 n) which is the same as the sequential PLL algorithm [95]. Therefore, LCC is work-ecient. Although LCC is theoretically ecient, in practice, the Label cleaning step adds non-trivial overhead to the execution time. Next, we describe an algorithm that reduces the overhead of cleaning. 3.3.2 GlobalLocalLabeling A natural way to accelerate label cleaning is to severely restrict the size of label sets used for label cleaning queries. We observe that some of the labels used in cleaning queries (algorithm 7,DQ_Clean) were already consulted during label construction and it is futile to iterate over them again during cleaning. To this purpose, we propose a Global Local Labeling (GLL) algorithm that uses a novel hierarchical label table data structure and an interleaved cleaning strategy. Unlike LCC, GLL utilizes multiple synchronizations where the threads switch between label construction and cleaning. We denote the combination of a label construction and the subsequent label cleaning step as a superstep. During label construction, the newly generated labels are pushed to a Local Label Table and their volume is tracked. Once the number of labels in the local table becomes greater thann, where> 1 is the synchronization threshold, the threads synchronize, sort, clean the labels in local table and commit them to a Global Label Table. In any superstep, it is known that all existing labels in the global table are consulted during the label construction. Therefore, label cleaning only needs to query on the local table for redundancy check, thus dramatically reducing the number of repeated computations. After a label construction step, the local table holdsn labels. AssumingO() expected labels per vertex, each cleaning step on average performsO(n 2 ) work. The number of cleaning steps isO wnlogn n and thus the expected work complexity of cleaning in GLL isO(nw logn), as opposed toO(nw 2 log 2 n) in LCC. Ifw logn, cleaning in GLL is more ecient than LCC. 61 3.4 Distributed-memoryParallelHubLabeling Distributed-memory clusters can provide large amount of memory and computational resources for appli- cation scaling. However, they also present strikingly dierent challenges than shared-memory systems, in general as well as in the specic context of hub labeling. Therefore, a trivial extension of GLL algo- rithm (section 3.3.2) is unsuitable for such systems. Particularly, to harness the collective memory capability of multiple nodes, labels must be partitioned and distributed across multiple nodes. This severely restricts the pruning eciency on any given node, resulting in large number of redundant labels and huge communication volume that bottlenecks the per- formance. In this section, we present novel algorithms and optimizations that systematically conquer these challenges. We begin by describing a distributed extension of GLL (section 3.3.2) that highlights the basic data distribution and parallelization approach. 3.4.1 DistributedGlobalLocalLabeling DGLL divides the task queue for SPT creation uniformly amongq nodes in a rank circular manner. Thus, the set of root vertices assigned to nodei is given byTQ i =[ v2V fvjR(v) modq =ig. Every node loads the graph and rankingR (for rank queries) and executes GLL on its allotted task queue. 1. LabelSetPartitioning: Each node only stores the labels generatedlocally i.e. all labels at nodei are of the form (h;d(v;h)), whereh2TQ i . Equivalently, the labels of a vertexv are disjoint and distributed across nodes i.e.L v =[ i L i;v . Thus, all nodes collaborate to provide main memory space to store the labels and the eective memory scales in proportion to the number of nodes. 2. SynchronizationandLabelCleaning: Interleaved cleaning requires nodes to synchronize after constructing a certain number of SPTs. For every superstep in DGLL, we decide the synchronization point apriori in terms of the number of SPTs to be created. The synchronization point computation is motivated by the label generation behavior of the algorithm. Figure 3.3 shows that the number of labels generated by 62 initial SPTs rooted at high-ranked vertices is very large and it drops exponentially as the rank decreases. To maintain cleaning eciency with few synchronizations, we increase the number of SPTs constructed in each superstep by a multiplicative factor of i.e. if superstepi constructsx SPTs, superstepi + 1 will constructx SPTs. After synchronization, the labels generated on a node are broadcasted to all nodes for redundancy check. Each node creates a bitvector containing response of all cleaning queries. The bitvectors are then combined using an all reduce operation to obtain nal redundancy information. Note that DGLL uses both global and local tables to answer cleaning queries. Yet, interleaved cleaning is benecial as it removes redundant labels, thereby reducing query response time for subsequent cleaning steps. For some datasets, we empirically observe> 90% redundancy in labels generated in some supersteps. Presence of such large number of redundant labels can radically slow down ensuing pruning queries. # per SPT → ID → California Road Network # per SPT → ID → Skitter AS Links Figure 3.3: Distribution of labels generated by individual SPTs. 3.4.2 PLaNT:PruneLabelsand(do)Not(prune)Trees The redundancy check in DGLL generates huge label broadcast trac , comprising both redundant and non-redundant labels. This restricts the scalability of DGLL and motivates the need for an algorithm that can avoid redundancy without communicating labels with other nodes. To this purpose, we propose the Prune Labels and (do) Not (prune) Trees (PLaNT) algorithm, that PLaNT outputs completely non-redundant labels using additional traversal, but without any label cleaning 63 or query based pruning. Eectively, PLaNT accepts some loss in pruning eciency to achieve a dramatic decrease in communication across cluster nodes. The key idea underlying PLaNT is that the redundancy of a label (h;d(v;h))2L v is only determined by whether or noth is the highest ranked vertex inSP v;h . Thus, when constructingSPT h , if we track the information about highest ranked vertices on all shortest paths, we will intrinsically have the requisite information to detect redundancy ofh as a hub. Algorithm 9 (PLaNTDijkstra) depicts the construction of a shortest path tree using PLaNT, which we call PLaNTing trees. PLaNTDijkstra tracks thehighest-rankedancestora v encountered on the path fromh to v by propagating ancestor information along with the distance during path exploration. Whenv is popped from the distance queue, a label is added toL v if neitherv nora v are ranked higher than the root. Thus, for any shortest pathSP h;v , onlyh m = argmax w2SP h;v fR(w)g succeeds in adding itself to the labels ofu andv, guaranteeing a minimal labeling with complete cover without consulting any labels generated from other SPTs. Figure 3.4 provides a detailed graphical illustration of label generation using PLaNT and shows that it generates the same labeling (CHL) as the PLL (gure 3.2b). Note that even if a label is not inserted in the label set of a vertexv, its neighborhood may still be explored. For example, in gure 3.4, neighborhood ofv 1 andv 5 is explored resulting in a visit tov 4 which was not visited in PLL (gure 3.2b). If there are multiple shortest paths fromh tov, the path with the highest-ranked ancestor is selected. This is achieved in the following manner: when a vertexv is popped from the dijkstra queue and its edges are relaxed, the ancestor of a neighboru2 N v is allowed to update even if the newly calculated tentative distance tou is equal to the currently assigned distance tou (algorithm 9, line 12). For example, in gure 3.4, the shortest paths tov 5 ,P 1 =fv 2 ;v 1 ;v 4 ;v 5 g andP 2 =fv 2 ;v 3 ;v 5 g have the same length. Ancestor selection forv 5 usesP 1 by settinga v 5 =v 1 becauseR(v 1 )>R(v 2 ). 64 1 2 3 5 4 5 3 10 2 4 14 1 = 3 = 1 2 = 0 = 2 3 = 10 = 2 5 = 14 = 2 4 = ∞ = 4 (i) pop ancestor = 2 ≤ ( 2 ) Insert Label ( 2 , 0 ) in 2 Relax Edges 1 2 3 5 4 5 3 10 2 4 14 1 = 3 = 1 2 = 0 = 2 3 = 10 = 2 5 = 14 = 2 4 = 8 = 1 (ii) pop ancestor = 1 > ( 2 ) Relax Edges 1 2 3 5 4 5 3 10 2 4 14 1 = 3 = 1 2 = 0 = 2 3 = 10 = 2 5 = 12 = 1 4 = 8 = 1 (iii) pop ancestor = 1 > ( 2 ) Relax Edges 1 2 3 5 4 5 3 10 2 4 14 1 = 3 = 1 2 = 0 = 2 3 = 10 = 2 5 = 12 = 1 4 = 8 = 1 (iv) pop ancestor = 2 ≤ ( 2 ) Insert Label ( 2 , 10 ) in 3 Relax Edges 1 2 3 5 4 5 3 10 2 4 14 1 = 3 = 1 2 = 0 = 2 3 = 10 = 2 5 = 12 = 1 4 = 8 = 1 (v) pop ancestor = 1 > ( 2 ) Relax Edges Finish Figure 3.4: Label generation by PLaNT with hubv 2 for the graph in gure 3.2a. PLaNT generates the same labels as PLL but does not require other labels for pruning (labels with hubv 1 in this case). 65 Note that PLaNT not only avoids dependency on the labels on remote nodes, it rids SPT construction of any dependency on the output of other SPTs, eectively providing an embarassingly parallel solu- tion for CHL construction withO(m +n logn) depth (complexity of a single instance of dijkstra) and O(mn +n 2 logn) work. Due to its embarassingly parallel nature, PLaNT does not require SPTs to be constructed in a specic order either. However, to enable optimizations discussed later, we follow the same rank order of SPT construction as used in DGLL (section 3.4.1). Algorithm9PLaNTDijkstra algorithm to plant SPTs Input:G(V;E;W ),R, rooth Notations: v ! distance tov,a v ! ancestor ofv,Q! vertex priority queue,cnt! number of verticesv witha v =h 1: Initialize h = 0;a h =h anda v =v; v =18v2Vnh 2: Insert (h; 0) inQ,cnt 1 3: whileQ is not emptydo 4: ifcnt = 0then exit . Early Termination 5: pop (v; v ) fromQ 6: ifa v =hthencnt cnt 1 7: a v argmax x2fv;avg R(x) . update ancestor 8: ifa v =hthenL v L v [f(h; v )g 9: foreachu2N v 10: previous ancestora 0 u a u 11: if v +w v;u < u thena u argmax x2fav;ug R(x) . update neighbor’s ancestor 12: elseif v +w v;u = u thena u argmax x2fav;aug R(x) 13: ifa u =h anda 0 u 6=hthencnt cnt + 1 14: elseifa u 6=h anda 0 u =hthencnt cnt 1 15: u min( u ; v +w v;u ) . update distance EarlyTermination: To improve the computational eciency of PLaNT, we propose the following early termination strategy: stop further exploration when for all vertices in dijkstra’s distance queue, the rank of either the ancestor or the vertex itself is higher than the root. Further exploration from such vertices only creates shortest paths with at least one vertex ranked higher than the root. Early termination can dramatically cut down traversal in SPTs with low-ranked roots. 66 Despite early termination, PLaNTed trees can possibly explore a large part of the graph which PLL would have avoided by pruning as shown in gure 3.5. For each SPT, let denote the average number vertices explored per label generated. This ratio quanties the pruning eciency of an algorithm and the workload per label. Figure 3.5 shows that in PLaNT, for many SPTs can be even higher than 10000. Ψ Shortest Path Tree ID California Road Network Ψ Shortest Path Tree ID Skitter AS Links Figure 3.5: value for SPTs generated by PLaNT. 3.4.3 HybridAlgorithm An important virtue of PLaNT is its compatibility with DGLL. Since PLaNT also constructs SPTs in rank order and generates labels with root as the hub, we can seamlessly transition between PLaNT and DGLL to enjoy the best of both worlds. Exploiting this feature, we propose a Hybrid algorithm that initially uses PLaNT and later, switches to DGLL. The initial SPTs rooted at high ranked vertices generate most of the labels and exhibit low value, as shown in gure 3.5). By PLaNTing these SPTs, we eciently parallelize bulk of the computation and avoid communicating a large fraction of the labels. In the later half of execution, when becomes high and few labels are generated per SPT, the Hybrid algorithm switches to DGLL to exploit the heavy pruning and thus, avoids the ineciencies associated with PLaNT. The Hybrid algorithm is a natural t for scale-free networks. It makes use of the core-fringe structure in such networks where removing a small dense core leaves a fringe like structure with very low tree-width 67 [10]. Typical degree and betweenness based hierarchies also prioritize vertices in the dense core. In such graphs, the Hybrid algorithm PLaNTs SPTs rooted at core vertices which generate a large number of labels. SPTs rooted on fringe vertices generate few labels and are constructed using DGLL which exploits heavy pruning to limit computation. For graphs with a core-fringe structure, a relaxed tree decomposition (;T (V T ;E T )) parameterized by an integerc can be computed such thatjX tr j =w m ^jX t jc8t2V T nt r , wheret r is the root ofT and =fX t V8t2V T g maps vertices inT (bags) to subset of vertices inG [10]. In other words, except root bag,jX t j is bounded by a parameterfcjcww m g. Lemma4. ThehybridalgorithmperformsO(m(w m +c log 2 n)+nc logn(w m +c logn))work,broadcasts onlyO(cn logn)data,generatesO(n(w m +c logn))hublabelsandanswerseachqueryinO(w m +c logn) time. Proof. Consider the relaxed tree decomposition (;T (V T ;E T )) with roott r and perform centroid decom- position on all subtrees rooted at the children oft r to obtain treeT 0 . The height of any tree in the forest generated by removingt r fromT 0 isO(logn). Hence, the height ofT 0 =O(logn + 1) =O(logn). Consider a rankingR whereR(v) is determined by the minimum depth bagfb2V T 0jv2X b g. For GLL, the number of labels generated by SPTs from vertices in root bag isO(w m n). Combining this with lemma 3, we can say that total labels generated by GLL isO(n (w m +c logn)) and query complexity is O(w m +c logn). The same also holds for the Hybrid algorithm since it outputs the same CHL as GLL. If Hybrid algorithm constructsw m SPTs using PLaNT and rest using DGLL, the overall work-complexity isO(w m (m+n logn))+O(mc log 2 n+nc logn (w m +c logn)) =O((m (w m +c log 2 n)+nc logn (w m +c logn))). The Hybrid algorithm only communicates the labels generated after switching to DGLL, resulting inO(cn logn) data broadcast. In comparison, doing only DGLL for the same ordering will broadcast O(w m n +cn logn) data. 68 Note that in practice, instead of mining the core from the graph, we use the ratio as a heuristic, dynamically switching from PLaNT to DGLL when becomes greater than a threshold . 3.4.3.1 EnablingEcientMulti-nodePruning We propose an optimization that simultaneously solves the following two problems: 1. Pruning exploration in PLaNT! PLaNT cannot use distance query pruning on partial local labels on a node, as it can result in mistakes in ancestor information propagation, which can in turn lead to redundant labels and defeat the purpose of PLaNT. In general, if a node prunes usingH u , it must havefH v 8v2 VjR(v) R(u)g to guarantee non-redundant labels. In this situation, a vertex is either visited through the shortest path with correct ancestor or is pruned. 2. Reduce redundant labels in DGLL! SPT pruning in DGLL only employs partial labels available locally on a node, and is inecient. Labels from some of the highest-ranking hubs are missing on each node. We observe that label count decreases dramatically even if labels from only a few highest-ranked hubs are available for pruning (gure 3.6). Moreover, the marginal utility of labels from each hub decreases as more and more hubs are used for pruning. This motivates the following solution for improving pruning eciency: for a small constant integer, if we storeall labels from highest-ranked hubs on every compute node, we can • use distance queries to prune PLaNTed trees, and • drastically increase pruning eciency of DGLL. For =O(1), using common labels incurs additionalO(n) broadcast trac, and consumesO(n) more memory per node. Thus, it does not alter the theoretical bounds on work-complexity, communication 69 # Labels # Hubs for Pruning Query California Road Network # Labels # Hubs for Pruning Query Skitter AS Links Figure 3.6: Number of labels generated as a function of the number of (highest-ranked) hubs used for pruning. X-axis= 0 means rank queries only. volume and space requirements of the Hybrid algorithm. In our implementation, we store labels from = 16 highest ranked hubs in the Common Label Table. 3.5 QueryingDistributedLabels: We propose three dierent data layouts for ecient distributed storage of labeling and querying over it: • Querying with Labels on Single Node (QLSN)! In this mode, ll labels are stored on every node and a query response is computed only by the node where the query emerges. It has lowest latency. • Querying with Fully Distributed Labels (QFDL)! In this mode, the label set of every vertex is partitioned between all nodes. Queries are broadcasted to all nodes and individual responses of the nodes are reduced using MPI_MIN to obtain the shortest distance. It has highest space-eciency. • QueryingwithDistributedOverlappingLabels(QDOL)! In this mode, we divide the vertex setV into partitions. For every possible partition pair, a node is assigned to store entire label set of vertices in that pair. Thus, a given query is answered only by a single node but not by every node. Queries are 70 P2P communicated to respective nodes and who return the responses. It provides high throughput for batched queries. For a cluster ofq nodes, can be computed as follows: 2 =q =) = 1 + p 1 + 8q 2 3.6 ExperimentalEvaluation 3.6.1 Setup We conduct shared-memory experiments on a 36 core, 2-way hyperthreaded, dual-socket linux server with two Intel Xeon E5-2695 v4 processors@ 2.1GHz and 1TB DRAM; all 72 threads are used for labeling. For distributed-memory experiments, we use a 64-node cluster. Each node has an 8 core, 2-way hyperthreaded, Intel Xeon E5-2665@ 2.4GHz processor and 64GB DRAM; all 16 threads on each node are used for labeling. We use OpenMP v4.5 for multithreading within a node and OpenMPI v3.1.2 for parallelization across multiple nodes. Baselines: We use sequential PLL (seqPLL [9]) and a recently proposed paraPLL framework [95] for labeling time comparison. ParaPLL simply constructs multiple SPTs in PLL concurrently, with dynamic task allocation and rank order scheduling similar to LCC (section 3.3.1). It provides both shared-memory (Spara- PLL) and distributed-memory (DparaPLL) implementations. Similar to DGLL (section 3.4.1), DparaPLL circularly divides tasks amongst nodes compute nodes and utilizes shared-memory parallelism on each node. We also report the performance of DGLL for eective comparison. Both DGLL and DparaPLL synchronize log 8 n times to exchange labels. However, unlike DGLL, number of trees between two synchronizations is uniform in DparaPLL. 71 Datasets: We evaluate our algorithms on 12 real-world graphs with varied topologies (high-dimensional (complex) graphs vs low-dimensional road networks, uniform-degree vs power-law social networks), as listed in table 3.3 † . The rankingR is determined by degree for scale-free networks [9] and betweenness for road networks [76] ‡ . As opposed to complex graphs, road networks typically exhibit smaller label size and ecient early termination in PLaNT due to the high betweenness of highways that cover most of the shortest paths. Table 3.3: Hierarchical Hub Labeling: datasets for evaluation. Dataset n m Description CAL[33] 1,890,815 4,657,742 California Road network EAS[33] 3,598,623 8,778,114 East USA Road network CTR[33] 14,081,816 34,292,496 Center USA Road network USA[33] 23,947,347 58,333,344 Full USA Road network SKIT[121] 192,244 636,643 Skitter Autonomous Systems WND[68] 325,729 1,497,134 Univ. Notre Dame webpages AUT[68] 227,320 814,134 Citeseer Collaboration YTB[68] 1,134,890 2,987,624 Youtube Social network ACT[68] 382,219 33,115,812 Actor Collaboration Network BDU[68] 2,141,300 17,794,839 Baidu HyperLink Network POK[68] 1,632,803 30,622,564 Social network Pokec LIJ[68] 4,847,571 68,993,773 LiveJournal Social network Implementation Details: We vary the synchronization threshold in GLL and switching threshold th in the Hybrid algorithm to empirically assess their impact on the performance. Figure 3.7 shows the impact of on GLL’s performance. Execution time of GLL is robust to signicant variations in within a range of 2 to 32. Intuitively, a small value of reduces cleaning time (section 3.3.2) but making it too small can lead to frequent synchronizations that hurt parallel performance. Based on our observations, we set = 4. Figure 3.8 shows the eect of th on the hybrid algorithm. Intuitively, a very large value of th increases computational overhead (seen in scale-free networks), because even low-ranked SPTs that generate few † Scale-free networks did not have edge weights from the download sources. For each edge, we choose an integral weight value between[1; p n) uniformly at random. ‡ Most vertices in road networks have a small degree and their importance cannot be identied by their degree. 72 1 4 16 64 256 1024 1 4 16 64 256 Execution Time (seconds) GLL Execution Time vs α CTR BDU CAL SKIT ACT YTB EAS AUT Figure 3.7: Execution time of GLL vs synchronization threhsold. labels are PLaNTed. On the other hand, keeping th too small results in poor scalability (seen in road networks) as the execution switches to DGLL quickly and PLaNT algorithm remains underutilized. Based on our observations, we set th = 100 for scale-free networks and th = 500 for road networks. Ψ ℎ 4 16 64 256 1024 4096 16 128 1024 8192 Execution Time (s) Road Networks CTR CAL EAS 1 4 16 64 256 16 256 4096 Execution Time (s) Scale-free Networks BDU SKIT ACT YTB AUT Ψ ℎ Figure 3.8: Execution time of Hybrid algorithm (on 16 nodes) vs switching threshold th . 3.6.2 Results 3.6.2.1 Shared-memoryAlgorithms Table 3.4 compares the performance of GLL, LCC, LCC-I (Label Construction in LCC), SparaPLL which is equivalent to LCC-I without rank querying, and seqPLL. It also shows the Average Label Size per vertex 73 Table 3.4: Performance comparison of GLL, LCC and baselines. Time=1 implies that execution did not nish in 4 hours. SparaPLL CHL ALS seqPLL Time(s) LCCTime(s) GLL Time(s) Dataset ALS Time(s) LCC-I Total CAL 108:3 51:2 83:4 215 26 41:4 35:4 EAS 138:1 116:3 116:8 680:6 73:8 108:7 88 CTR 178:7 424:2 160:9 5045 415 664:1 567:7 USA 185:6 816:9 166:1 1 715 1149 834 SKIT 88:3 2:5 85:1 95:8 3 4:85 3:9 WND 39:6 2:4 23:5 21:98 1:9 2:94 2:1 AUT 240:2 10:4 229:6 670 10 18:4 14:6 YTB 208:9 69:6 207:5 2693 73:8 126:7 104:6 ACT 376:1 112:4 366:3 2173 114:9 151:3 141:9 BDU 100:1 103:1 90:7 4736 108 133:9 99:9 POK 2243 4159 2231 1 4213 8748 3917 LIJ 1 1223 1 1 1 1 (ALS) for CHL (output of GLL, LCC and seqPLL), and the labeling generated by SparaPLL. Query response time is directly proportional to labeling size and hence, it is a crucial metric for evaluating any hub labeling algorithm § . To understand the individual benets of parallelism and rank queries, we compare the execution times of seqPLL, SparaPLL (which is essentially parallel PLL without rank queries) and Label Construction in LCC (LCC-I). The advantage of pure parallelism is evident in the substantial speedup achieved by SparaPLL over seqPLL. However, labeling size of SparaPLL is noticeably larger than CHL (see CAL, EAS, WND). Rank querying improves pruning during parallel execution as can be seen by comparing the execution times of LCC-I and SparaPLL for these datasets. Note that the (potential) performance improvement is a side-eect of rank queries. The primary motivation for rank querying was to ensure that the labeling respectsR, so that redundant labels can be removed later. We observe that on average, GLL generates 17% less labels than SparaPLL. Its execution time is comparable to SparaPLL, even though it re-veries every label using the cleaning queries. For some graphs such as CAL, GLL is even 1:3 faster than SparaPLL. This is primarily because rank queries and § For LIJ, the labeling size of CHL was obtained from distributed algorithms since none of the shared-memory algorithms nished execution. 74 interleaved cleaning that limit graph exploration overhead and label size, respectively, resulting in faster distance-pruning queries. GLL also exhibits good parallel scalability with up to 47 speedup over sequential PLL. 0 0.5 1 1.5 2 CAL EAS CTR USA SKIT WND AUT YTB ACT BDU POK LCC and GLL Execution Time Breakup GLL Construct LCC Construct GLL Clean LCC Clean Figure 3.9: Time taken for label construction and cleaning in LCC and GLL (normalized by the total execution time of GLL). Figure 3.9 shows execution time breakup for LCC and GLL. GLL cleaning is signicantly faster than LCC because of the reduced cleaning complexity (section 3.3.2). Overall, GLL is 1:3 faster than LCC on average. However, for some graphs such as CAL, fraction of cleaning time is> 30% even for GLL. When parallelized overp threads, the initialp SPTs in the rst superstep of GLL generate a large number of labels as there are no prior labels available for distance query pruning in these SPTs. Many of these labels are redundant and require signicant time to clean. 3.6.2.2 Distributed-memoryAlgorithms To assess the scalability of distributed hub labeling algorithms, we vary the number of nodesq from 1 to 64 (number of compute cores from 8 to 512). Figure 3.10 shows the strong scaling of dierent algorithms in terms of labeling construction time. We note that both DparaPLL and DGLL do not scale well withq. DparaPLL often runs out-of-memory whenq is large. This is because in the rst superstep itself, a large number of hub labels are generated 75 10 100 1000 10000 8 32 128 512 CTR 10 100 1000 8 32 128 512 Time(s) USA 100 1000 10000 100000 8 32 128 512 # Compute Cores POK 100 1000 10000 8 32 128 512 # Compute Cores LIJ 1 10 100 1000 8 32 128 512 Time (s) CAL 1 10 100 1000 10000 8 32 128 512 EAS 1 10 100 1000 8 32 128 512 SKIT 1 10 100 1000 8 32 128 512 WND 10 100 1000 8 32 128 512 Time (s) YTB 1 10 100 1000 8 32 128 512 AUT 10 100 1000 10000 100000 8 64 512 ACT 10 100 1000 10000 8 64 512 Time (s) # Compute Cores BDU Figure 3.10: Execution time of DparaPLL, DGLL, PLaNT and Hybrid algorithms. 76 that when broadcasted, overwhelm the memory of individual nodes. DGLL, on the other hand, limits the amount of labels per superstep by synchronizing relatively frequently in the earlier stages of execution. Moreover, all algorithms proposed in this chapter partition the labels and distribute them across the nodes. Thus, they require signicantly less memory than DparaPLL. 10 100 1000 8 32 128 512 # Labels per vertex CAL 100 1000 8 32 128 512 EAS DparaPLL ALS Hybrid ALS 10 100 1000 10000 8 32 128 512 # Labels per vertex SKIT 10 100 1000 10000 8 32 128 512 WND 100 1000 10000 8 32 128 512 AUT 100 1000 8 32 128 512 # Labels per vertex # Compute Cores YTB 100 1000 10000 8 32 128 512 # Compute Cores ACT 100 1000 10000 8 32 128 512 # Compute Cores BDU Figure 3.11: Size of labeling generated by DparaPLL and the Hybrid/PLaNT/DGLL algorithms. Moreover, label size of DparaPLL explodes as number of nodesq increases (gure 3.11), partly due to the absence of rank queries. In contrast, our algorithms (DGLL, PLaNT and Hybrid) generate the same labeling - CHL and have the same label size, irrespective ofq. Ecacy of distance query based pruning in DparaPLL suers because on every node, labels from several high-ranked hubs are missing in between the synchronization points. Increase in label size further slows down the distance querying and the labeling process. 77 Rank queries in DGLL allow pruning even at those hubs whose SPTs were not created on the querying node. Further, it periodically cleans redundant labels to retain the performance of distance pruning queries. Yet, DGLL incurs signicant communication and does not scale well. Neither DparaPLL nor DGLL can process the large CTR, USA, POK and LIJ datasets, either running out-of-memory or exceeding the time limit. PLaNT on the other hand, paints a completely dierent picture. Owing to its embarrassingly parallel nature, PLaNT exhibits excellent near-linear speedup up toq = 64 for almost all datasets. On 64 nodes, PLaNT achieves an average 42 speedup over single node execution. However, for scale-free graphs, PLaNT is computationally inecient. It cannot process LIJ and takes> 1 hr to process POK even on 64 nodes. The Hybrid algorithm combines the scalability of PLaNT with the pruning eciency of DGLL (powered by common Labels). It scales well up toq = 64 and for most datasets, achieves> 10 speedup over single node execution. Furthermore, it outperforms shared-memory parallel execution (GLL) with just 2 to 4 nodes. For scale-free datasets ACT, BDU and POK, it is able to construct CHL 7:8, 26:2 and 5:4 faster than PLaNT, respectively, on 64 nodes. Compared to DparaPLL, the Hybrid algorithm is 3:8 faster on average when run on 2 compute nodes. For SKIT and WND, the Hybrid algorithm is 47 and 96:8 faster, respectively, than DparaPLL on 16 nodes. When processing scale-free datasets on small number of nodes, Hybrid beats PLaNT by more than an order of magnitude dierence in execution time. On a single node, GLL is faster than Hybrid and DGLL algorithms, as it does not incur the overheads associated with distributed computation, extra graph traversal in PLaNTed Trees and global table search for cleaning queries. However, Hybrid algorithm (1) outperforms GLL for all graphs after just 4 nodes, and (2) can process large graphs, such as LIJ, using multiple nodes. We also observe superlinear speedup in some cases (for eg. Hybrid algorithm on CAL and EAS – 1 node vs 4 nodes). This is because running on few nodes puts stress on the memory manager due to frequent 78 memory (re-)allocations for per-vertex dynamic label arrays. In such cases, increasing number of nodes releases the pressure on memory manager, resulting in a super linear speedup. Graph Topologies: We observe that PLaNT alone not only scales well but is also extremely ecient for road networks. On the other hand, in scale-free networks, PLaNT although scalable is not ecient as it incurs large overhead of additional exploration. This is consistent with our observations in gure 3.5 where the maximum value of for SKIT was> 10 that of maximum in CAL dataset. The Hybrid algorithm cleverly manages the trade-o between additional exploration and communication avoidance, and is signicantly faster than PLaNT for most scale-free networks. However, it does not scale equally well for small datasets. This is because even few synchronizations of large number of nodes completely dominate their small labeling time. 3.6.2.3 EvaluatingQueryModes Table 3.5: Querying Throughput, Latency and Memory Consumption on 16 nodes. Dataset Throughput(millionqueries/s) Latency(sperquery) MemoryUsage(GB) QLSN QFDL QDOL QLSN QFDL QDOL QLSN QFDL QDOL CAL 10.1 12.1 29.6 2.8 22.3 8.4 43.8 2.4 13.7 EAS 7.1 8.9 14.6 3.6 24 11.4 125.4 7.4 39.2 CTR - 6.5 9 - 26.6 14.7 - 45 242.1 USA - 5.4 10 - 29.5 20 - 80 413.3 SKIT 15.8 18.5 29.8 1 20.7 7.9 4.5 0.3 1.4 WND 37.5 19.6 42.7 0.3 22.7 7.1 0.6 0.1 0.6 AUT 4.9 9.9 27.5 3.7 21.7 12.9 16.6 1 5.2 YTB 10.4 23.3 30.3 2.2 23.9 13.6 74.9 4.6 23.4 ACT 3.2 10.4 21.3 4.8 22.8 18.1 46.1 2.8 14.4 BDU 13.2 16.4 21.5 1.5 22.1 11.1 54.7 3.2 17.1 POK - 5.1 7.5 - 32 34.5 - 77.6 388.9 LIJ - 6 - - 31.6 - - 125.8 - In this section, we assess the dierent query modes on the basis of their memory consumption, query response latency and query processing throughput. 79 Table 3.5 shows the memory consumed by label storage under dierent modes. QLSN requires all labels to be stored on every node and is the most memory hungry mode. Both QDOL and QFDL distribute the labels across multiple nodes enabling queries on large graphs where QLSN fails. Our experiments also conrm the theoretical insights into the memory usage of QFDL and QDOL presented in section 3.5. On average, QDOL requires 5:3 more main memory for label storage than QFDL. This is because the label size per partition in QDOL scales withO 1= p q and every compute node has to further store label set of 2 such partitions. To evaluate the latency of various query modes, we generate 1 million random PPSD queries and compute their response one at a time. In QFDL (QDOL) mode, one query is transmitted per MPI_Broadcast (MPI_Send, respectively) and inter-node communication latency becomes a major contributor to the overall query response latency. This is evident from the results (table 3.5) where latency of QFDL shows little variation across dierent datasets. In contrast, QLSN does not incur inter-node communication and compared to QDOL and QFDL, has signicantly lower latency although it increases proportionally with labeling size. For most datasets, QDOL latency is< 2 compared to QFDL, because of the cheaper point-to- point communication as opposed to more expensive broadcasts (section 3.5). An exception is POK, where average label size is huge (table 3.4) and QFDL takes advantage of multi-node parallelism to reduce latency. To evaluate the query throughput, we create a batch of 100 million random PPSD queries and compute their responses in parallel. For most datasets, the added multi-node parallelism of QFDL and QDOL ¶ overcomes the query communication overhead and results in higher throughput than QLSN. QDOL is further 1:8 faster than QFDL because of reduced communication overhead ∥ . ¶ Results for QDOL mode also include the time taken to sort the queries and reorder the responses. ∥ QDOL has better memory access locality as every node scans all labels of vertices in the assigned queries. In contrast, each node in QFDL scans a part of labels for all queries, frequently jumping from one vertex’s labels to another. 80 3.7 RelatedWork Hub labeling for shortest distance computation has been heavily studied in the last decade. Since its inception (credited to Cohen et al.[28]), several labelings and corresponding algorithms have been developed [28, 4, 5, 9, 10, 8, 45, 65, 3, 75]. In particular, Canonical Hub Labeling (CHL) [4] is a popular labeling in which the labels cover all shortest paths in a graph (complete cover), allowing arbitrary query responses within microseconds. Network hierarchies based on betweenness [4] and degree [9] extend the applicability of CHL to variety of graphs. With the use of network hierarchies based on betweenness [3, 4, 13] and degree [9, 65], the applicability of CHL has been extended to road networks as well as complex graphs such as social networks. Delling et al.[31] propose an optimized greedy hub labeling algorithm which produces smaller labels than CHL, but is limited to small graphs due to high complexity of the algorithm. Abraham et al. develop a series of CHL construction algorithms primarily intended for road networks [3, 5, 2, 4, 32]. Their algorithms are based on the notion of Contraction Hierarchies built by shortcutting vertices in rank order. A vertex’s hub labels are generated by merging the label sets of neighbors [4] (or from reachable vertices in a CH search [5]) in the augmented graph. However, Akiba et al. show that this approach can be prohibitively expensive due to the cost of building (and searching) augmented graphs and merging neighbors’ labels therein [9, 76]. They propose the state-of-the-art sequential algorithm PLL [9], that pushes a vertex as a hub into the labels of vertices reachable in a pruned search. Yet, even with PLL, labeling large weighted graphs using a single thread is often infeasible due to the high execution time [37, 95]. Moreover, the label size of a complete cover such as CHL, can be much larger than the graph size. To enable better scalability in terms of graph size, several partial labeling approaches have been developed [10, 45]. IS-label [45] is a well-known labeling technique that builds an increasing hierarchy of independent sets and augmented graphs obtained by removing those sets. Hub labels are constructed in top-down order of the hierarchy (of augmented graphs). The total label size can be controlled by stopping 81 the hierarchy at any desired level. PPSD computation requires querying on these labels along with traversal on the highest augmented graph in the hierarchy. Similarly, [10] also generates labels only from the dense core structure of a graph, to assist graph exploration. Querying on partial labels requires additional graph traversal and hence, their response time can be orders of magnitude larger than a complete labeling [65, 76]. Moreover, IS-label does not guarantee small label sizes and is not applicable to road networks [76]. As shown in [65], IS-label execution time can be higher than PLL or the disk-based 2-hop doubling algorithm of [65]. ParallelHubLabeling: Parallelization oers great avenues to scale the performance and input size of a computationally intensive graph application like HHL. Specically, distributed-memory parallelism allows modular expansion of the DRAM capacity and the computing resources in a system, by simply adding more nodes in the cluster. Most of the existing techniques [43, 37, 95] parallelize PLL [9] using multiple threads on a single node. Ferizovic et al.[43] and Qiu et al.[95] dynamically allocate root vertices to threads for concurrent SPT con- struction. While this approach benets from the dynamic rank order task scheduling, it may not respectR as parallel Dijkstra instances can nish at varying times, allowing low-ranked SPTs to generate labels that prune high-ranked SPTs. The label size can also be signicantly larger than CHL if the number of threads is large. Dong et al.[37] proposed a hybrid scheme for weighted graphs, utilizing parallel Bellman ford for the large SPTs, and concurrent dijkstra instances for small SPTs. The parallel performance of this approach is dependent on graph topology, and could be even worse than sequential PLL for high diameter networks where Bellman Ford takes a large number of iterations to converge. Li et al.[75] propose Parallel Shortest distance Labeling (PSL) algorithm which replaces the root order label dependency of PLL with a distance label dependency. PSL is ecient on unweighted small-world networks with small diameter. In contrast, we target a generalized problem of labeling weighted graphs with arbitrary diameters, where these approaches 82 are either not applicable or not eective. Furthermore, all of these approaches use label-based pruning and are not scalable for distributed-memory computation. paraPLL also provides a distributed-memory implementation (DparaPLL) that statically divides the tasks (root IDs) to multiple nodes in a circular manner. Each node then runs the multi-threaded version on the assigned vertices, periodically synchronizing with other nodes to exchange generated labels for future pruning. DparaPLL generates large amount of label trac and requires storingall hub labels on every node. Hence, it does not scale well and its capability is limited by the memory of a single node instead of the cluster. Further, DparaPLL generates substantial amount of redundant labels due to (1) massive parallelism in clusters, and (2) partial labels available for pruning on each node in-between successive synchronization points. [9, 75] propose label size reduction methods. The bit-parallel labels in [9] store distances from multilpe hubs within a single word. The Local Minimum Set Elimination [75] does not generate labels for vertices with minimal rank in local neighborhood. Our approach is complimentary and can be amalgamated with such techniques. 3.8 ChapterSummary In this chapter, we proposed novel hub labeling algorithms that eectively utilize the parallelism in shared and distributed memory systems to scale the performance of labeling. We observed that the conventional approach of parallelizing each shortest path tree construction scales poorly. Therefore, we exploited the parallelism available at a coarser task granularity – across the trees with each tree as an individual task. We developed novel techniques to mitigate the eects of sequential dependencies between SPTs on the performance of a inter-tree parallel algorithm. Our embarrassingly parallel algorithm PLaNT is the rst distributed-memory algorithm that enables collaborative memory usage by partitioning the labels across multiple nodes, while simultaneously ensuring high parallel scalability. On a 64 node cluster, we observed 83 that our algorithms outperform the state-of-the-art by more than an order of magnitude improvement in execution time and graph size that can be labeled feasibly. Compared to the best shared-memory algorithm running on a single node, our distributed-memory solution based on PLaNT achieved up to 9:5 speedup on 64 nodes. 84 Chapter4 ParallelBipartiteGraphDecomposition 4.1 Motivation In this chapter, we discuss another complex graph application – Bipartite Graph Decomposition, which is used to mine dense bipartite subgraphs. Similar to HHL (chapter 3), there are sequential dependencies in decomposition algorithms that make it challenging to parallelize. Recall that a bipartite graphG(W = (U;V );E) is a special graph whose vertices can be partitioned into two disjoint setsU(G) andV (G) such that any edgee2E(G) connects a vertex from setU(G) with a vertex from setV (G) (section 1.3.1). Several real-world systems naturally exhibit bipartite relationships, such as consumer-product purchase network of an e-commerce website [63], user-ratings data in a recom- mendation system [59, 77], group memberships in a social network [85] etc. Due to the rapid growth of data produced in these applications, ecient mining of dense structures in bipartite graphs has become a popular research topic [132, 130, 145, 108, 109, 72]. Truss decomposition is commonly used to mine hierarchical dense subgraphs where minimum trian- gle (3-clique) participation of an edge in a subgraph determines its level in the hierarchy [106, 46, 103, 127, 107, 20, 135]. However, it is not directly applicable on bipartite graphs as they do not have triangles. 85 While it can be applied on the unipartite projection of a bipartite graph, this approach suers from (a) in- formation loss which can impact quality of results, and (b) explosion in dataset size which can restrict its scalability [108]. A buttery (2; 2biclique/quadrangle) is the smallest non-trivial cohesive motif in bipartite graphs. Butteries can be used to directly analyze bipartite graphs and have drawn signicant research interest in the recent years [128, 104, 105, 109, 132, 60, 108, 129]. Sariyuce and Pinar [108] use butteries as a density indicator to dene the notion of kwings and ktips, as maximal bipartite subgraphs where each edge and vertex, respectively, is involved in at leastk butteries. For example, the graph shown in gure 4.1a is a 1wing since each edge participates in at least one buttery. Analogous tok trusses [29], kwings (ktips) represent hierarchical dense structures in the sense that a (k + 1)wing ((k + 1)tip) is a subgraph of akwing (ktip). In this chapter, we explore parallel algorithms forwing ∗ andtipdecomposition analytics, that construct the entire hierarchy ofkwings andktips in a bipartite graph, respectively. For space-ecient representation of the hierarchy, these analytics output wing number of each edgee or tip number of each vertexu, which represent the densest level of hierarchy that containse oru, respectively. Wing and tip decomposition have several real-world applications such as: • Link prediction in recommendation systems or e-commerce websites that contain communities of users with common preferences or purchase history [60, 74, 89, 38]. • Mining nested communities in social networks or discussion forums, where users aliate with broad groups and more specic sub-groups based on their interests. [60]. • Detecting spam reviewers that collectively rate selected items in rating networks [86, 42, 77]. • Document clustering by mining co-occurring keywords and groups of documents containing them [34]. ∗ kwing and wing decomposition are also known askbitruss and bitruss decomposition, respectively. 86 • Finding nested groups of researchers from author-paper networks [108] with varying degree of collabo- ration. e 10 e 9 e 6 u 1 u 2 u 3 u 4 u 0 v 1 v 2 v 3 v 4 v 0 e 3 e 4 e 5 e 11 e 13 e 12 e 14 e 15 e 8 e 2 e 1 e 0 e 7 u 1 u 2 u 3 u 4 v 1 v 2 v 3 v 4 u 2 u 3 u 4 v 1 v 2 v 3 v 4 u 2 u 3 u 4 v 2 v 3 v 4 Edge ID U V (a) 2-wing 3-wing 4-wing (b) Figure 4.1: (a) Bipartite graphG(U;V;E) which is also a 1-wing. (b) Wing decomposition hierarchy of G(U;V;E) – edges colored blue, red, green and black have wing numbers of 1, 2, 3 and 4, respectively. Existing algorithms for decomposing bipartite graphs typically employ an iterative bottom-up peeling approach [108, 109], wherein entities (edges and vertices for wing and tip decomposition, respectively) with the minimum support (buttery count) are peeled in each iteration. Peeling an entityl involves deleting it from the graph and updating the support of other entities that share butteries withl. However, the huge number of butteries in bipartite graphs makes bottom-up peeling computationally demanding and renders large graphs intractable for decomposing by sequential algorithms. For example, trackers – a bipartite 87 network of internet domains and the trackers contained in them, has 140 million edges but more than 20 trillion butteries. Parallel computing is widely used to scale such high complexity analytics to large datasets [94, 115, 69]. However, bottom-up peeling severely restricts parallelism by peeling entities in a strictly increasing order of their entity numbers (wing or tip numbers). As a result, the existing parallel frameworks [109] can only exploit parallelism within each peeling iteration, which peels a subset of entities in a single level of the hierarchy. Since it takes a very large number of iterations to peel an entire graph (for example it takes > 31 million iterations to peel all edges of the trackers dataset), this approach suers from heavy thread synchronization and poor parallel scalability. In this chapter, we develop algorithms to exploit parallelism at a coarser granularity – across the levels of decomposition hierarchy. Algorithms that can concurrently process entities in multiple levels can potentially reduce synchronization and enjoy much higher workload for ecient parallelization. However, in bottom up peeling, the choice of entities to peel in each iteration is sequentially dependent on support updates from previous iterations. In this chapter, we develop novel algorithms to address this challenge. 4.2 Background In this section, we formally dene the problem statement and review existing methods for buttery counting and bipartite graph decomposition. Note that counting is used to initialize support (running count of butteries) of each vertex or edge before peeling, and also inspires some optimizations to improve eciency of decomposition. Table 4.1 lists some notations. For description of a general approach, we use the term entity to denote a vertex (for tip decomposition) or an edge (for wing decomposition), and entity number to denote tip or wing number (section 4.2.3), respectively. Correspondingly, notations./ l and l denote the support and entity number of entityl. 88 Table 4.1: Bipartite Graph Decomposition: notations G(W = (U;V );E) bipartite graphG with disjoint vertex setsU(G) andV (G), and edgesE(G) n=m no. of vertices inG i.e.n =jWj / no. of edges inG i.e.m =jEj arboricity ofG [27] N u =d u neighbors of vertexu / degree of vertexu ./ G u /./ G e no. of butteries inG that contain vertexu / edgee ./ u /./ e support (runing count of butteries) of vertexu / edgee ./ U /./ E support vector of all vertices in setU(G) / all edges inE(G) u / e tip number of vertexu / wing number of edgee max U / max E maximum tip number of vertices inU(G) / maximum wing number of edges inE(G) P number of vertex/edge partitions created by PBNG T number of threads 4.2.1 Butterycounting A buttery (2,2-bicliques/quadrangle) can be viewed as a combination of two wedges with common endpoints. For example, in gure 4.1a, both wedges (e 0 ;e 1 ) and (e 2 ;e 3 ) have end pointsv 0 andv 1 , and form a buttery. A simple way to count butteries is to explore all wedges and combine the ones with common end points. However, this is computationally inecient with complexityO P u2U P v2Nu d v = O P v2V d 2 v (if we use vertices inU as end points). Chiba and Nishizeki [27] developed an ecient vertex-priority quadrangle counting algorithm in which starting from each vertexu, only those wedges are expanded whereu has the highest degree. It has a theoretical complexity ofO P (u;v)2E min (d u ;d v ) =O (m), which is state-of-the-art for buttery counting. Wang et al.[132] further propose a cache-ecient version of this algorithm that traverses wedges such that the degree of the last vertex is greater than the that of the start and middle vertices (algorithm 10, line 10). Thus, wedge explorations frequently end at a small set of high degree vertices that can be cached. The vertex-priority algorithm can be easily parallelized by concurrently processing multiple start vertices [109, 132]. In PBNG, we use the per-vertex and per-edge counting variants of the parallel al- gorithm [109, 132], as shown in algorithm 10. To avoid conicts, each thread is provided an individual n-elementwedge_count array (algorithm 10, line 5) for wedge aggregation, and buttery counts of entities are incremented using atomic operations. 89 Algorithm10 Counting per-vertex and per-edge butteries (pveBcnt) Input: Bipartite GraphG(W = (U;V );E) Output: Buttery counts./ u for eachu2W (G), and./ e for eache2E(G) 1: ./ u 0 for eachu2W (G) 2: ./ e 0 for eache2E(G) . Initialization 3: Relabel verticesW (G) in decreasing order of degree . Priority assignment 4: foreachu2W (G)doinparallel 5: SortN u in increasing order of new labels 6: foreach vertexstart2W (G)doinparallel 7: Initialize hashmapwedge_count to all zeros 8: Initialize an empty wedge setnzw 9: foreach vertexmid2N start 10: foreach vertexlast2N mid 11: if (lastmid) or (laststart)then break 12: wedge_count[last] wedge_count[last] + 1 13: nzw nzw[ (mid;last) 14: foreachlast such thatwedge_count[last]> 1 .Per-vertexcounting 15: bcnt wedge_count[last] 2 16: ./ start ./ start +bcnt 17: ./ last ./ last +bcnt 18: foreach (mid;last)2nzw 19: ./ mid ./ mid + (wedge_count[last] 1) 20: foreach (mid;last)2nzw wherewedge_count[last]> 1 .Per-edgecounting 21: Lete 1 ande 2 denote edges (start;mid) and (mid;last), respectively 22: ./ e 1 ./ e 1 + (wedge_count[last] 1) 23: ./ e 2 ./ e 2 + (wedge_count[last] 1) 90 4.2.2 BipartiteGraphDecomposition Sariyuce et al.[108] introducedktips andkwings as a buttery dense vertex and edge-induced subgraphs, respectively. They are formally dened as follows: Denition4. A bipartite subgraphHG, induced on edgesE(H)E(G), is ak-wing i • each edgee2E(H) is contained in at least k butteries, • any two edges (e 1 ;e 2 )2E 0 is connected by a series of butteries, • H is maximal i.e. no otherkwing inG subsumesH. Denition5. A bipartite subgraphHG, induced on vertex setsU(H)U(G) andV (H) =V (G), is a k-tip i • each vertexu2U(H) is contained in at least k butteries, • any two verticesfu;u 0 g2U(H) are connected by a series of butteries, • H is maximal i.e. no otherktip inG subsumesH. Bothkwings andktips are hierarchical as akwing/ktip completely overlaps with ak 0 wing/k 0 tip for allk 0 k. Therefore, instead of storing allkwings, a wing number e of an edgee is dened as the maximumk for whiche is present in akwing. Similarly, tip number u of a vertexu is the maximumk for whichu is present in aktip. Wing and tip numbers act as a space-ecient indexing from which any level of thekwing andktip hierarchy, respectively, can be quickly retrieved [108]. In this chapter, we study the problem of nding wing and tip numbers, also known as wing and tip decomposition, respectively. 4.2.3 Bottom-UpPeeling(BUP) Bottom-Up Peeling (BUP) is a commonly employed technique to compute wing and tip decomposition (al- gorithm 11). For wing decomposition, it initializes the support of each edge using per-edge buttery 91 counting (algorithm 11, line 1), and then iteratively peels the edges with minimum support until no edge remains. When an edgee2E is peeled, its support in that iteration is recorded as its wing number (al- gorithm 11, line 4). Further, for every edgee 0 that shares butteries withe, the support./ e 0 is decreased corresponding to the removal of those butteries. Thus, edges are peeled in a non-decreasing order of wing numbers. Bottom-up peeling for tip decomposition utilizes a similar procedure for peeling vertices, as shown in algorithm 12. Acrucialdistinction here is that in tip decomposition, vertices in only one of the setsU(G) or V (G) are peeled, as aktip consists of all vertices from the other set (defn.5). For clarity of description, we assume thatU(G) is the vertex set to peel. As we will see later in section 4.3.2, this distinction renders our two-phased peeling approach easily adaptable for tip decomposition. Execution time of bottom-up peeling is dominated by wedge traversal required to nd butteries incident on the entities being peeled (algorithm 11, lines 7-9). The overall complexity for wing decomposition is O P (u;v)2E(G) P v2Nu d v =O P u2U(G) P v2Nu d u d v . Tip decomposition has a relatively lower complexity ofO P u2U(G) P v2Nu d v =O P v2V(G) d 2 v , which is still quadratic in vertex degrees and very high in absolute terms. Algorithm11 Wing decomposition using bottom-up peeling (BUP) Input: Bipartite graphG(W = (U;V );E) Output: Wing numbers e 8e2E(G) 1: ./ E pveBcnt(G) . Counting for support initialization (algorithm 10) 2: whileE(G)6= fgdo . Peeling 3: e arg min e2E(G) f./ e g 4: e ./ e ; E(G) E(G)nfeg 5: update(e; e ;./ E ;G) 6: functionupdate(e = (u;v); e ;./ E ;G) 7: foreachv 0 2N u nfvg . Find butteries 8: Lete 1 denote edge (u;v 0 ) 9: foreachu 0 2N v 0nfug such that (u 0 ;v)2E(G) 10: Lete 2 ande 3 denote edges (u 0 ;v) and (u 0 ;v 0 ), respectively 11: ./ e 1 max e ; ./ e 1 1 . Update support 12: ./ e 2 max e ; ./ e 2 1 13: ./ e 3 max e ; ./ e 3 1 92 Algorithm12 Tip decomposition using bottom-up peeling (BUP) Input: Bipartite graphG(W = (U;V );E) Output: Tip numbers u 8u2U(G) 1: f./ U ;./ V g pveBcnt(G) . Counting for support initialization (algorithm 10) 2: whileU(G)6= fgdo . Peel 3: letu2U(G) be the vertex with minimum support./ u 4: u ./ u ; U(G) U(G)nfug 5: update(u; u ;./ U ;G) 6: functionupdate(u; u ;./ U ;G) 7: Initialize hashmapwedge_arr to all zeros 8: foreachv2N u 9: foreachu 0 2N v nfug 10: wedge_arr[u 0 ] wedge_arr[u 0 ] + 1 11: foreachu 0 2wedge_arr . Update Support 12: ./ u;u 0 wedge_arr[u 0 ] 2 13: ./ u 0 maxf u ; ./ u 0./ u;u 0g 4.2.4 Bloom-Edge-Index Chiba and Nishizeki [27] proposed that storing wedges derived from the computational patterns of their buttery counting algorithm, is a space-ecient representation of all butteries. Wang et al.[130] used a similar representation termed Bloom-Edge-Index (BE-Index) for quick retrieval of butteries containing peeled edges during wing decomposition. We extensively utilize BE-Index not just for computational eciency, but also for enabling parallelism in wing decomposition. In this subsection, we give a brief overview of some key concepts in this regard. The buttery counting algorithm assigns priorities (labels) to all vertices in a decreasing order of their degree (algorithm 10, line 2). Based on these priorities, a structure called maximal prioritykbloom, which is the basic building block of BE-Index, is dened as follows [130]: Denition6. A maximal prioritykbloomB(W = (U;V );E) is a (2;k)biclique (eitherU(B) orV (B) has exactly two vertices, each connected to allk vertices inV (B) orU(B), respectively) that satises the following conditions: 93 1. The highestpriority vertex inW (B) belongs to the set (U(B) orV (B)) which has exactly two vertices, and 2. B ismaximal i.e. there exists no (2;k)bicliqueB 0 such thatBB 0 G andB 0 satises condition 1. MaximalPriorityBloomNotations: The vertex set (U(B) orV (B)) containing the highest priority vertex is called the dominant set ofB. Note that each vertex in the non-dominant set has exactly two incident edges inE(B), that are said to be twins of each other in bloomB. For example, in the graph G 0 shown in gure 4.2, the subgraph induced onfv 0 ;v 1 ;u 0 ;u 1 g is a maximal priority 2bloom with v 1 2V (B) as the highest priority vertex and twin edge pairsfe 0 ;e 1 g andfe 2 ;e 3 g. The twin of an edgee in bloomB is denoted bytwin(e;B). The cardinality of the non-dominant vertex set of bloomB is called thebloomnumber ofB. Wang et al.[130] further prove the following properties of maximal priority blooms: Property 2. A kbloom B(W = (U;V );E) consists of exactly k 2 = k(k1) 2 butteries. Each edge e2 E(B) is contained in exactlyk 1 butteries inB. Further, edgee shares allk 1 butteries with twin(e;B), and one buttery each with all other edgese 0 2E(B)nftwin(e;B)g. Property3. A buttery inG must be contained in exactly one maximal prioritykbloom. Note that the butteries containing an edgee, and the other edges in those butteries, can be obtained by exploring all blooms that containe. For quick access to blooms of an edge and vice-versa, BE-Index is dened as follows: Denition7. BE-Index of a graphG(W = (U;V );E) is a bipartite graphI(W = (U;V );E) that links all maximal priority blooms inG to the respective edges within the blooms. • W(I)– Vertices inU(I) andV (I) uniquely represent all maximal priority blooms inG and edges in E(G), respectively. Each vertexB2U(I) also stores the bloom numberk B (I) of the corresponding bloom. 94 • E(I)– There exists an edge (e;B)2E(I) if and only if the corresponding bloomBG contains the edgee2E(G). Each edge (e;B)2E(I) is labeled withtwin(e;B). BE-IndexNotations: For ease of explanation, we refer to a maximal priority bloom as simply bloom. We use the same notationB (ore) to denote both a bloom (or edge) and its representative vertex in BE-Index. Neighborhood of a vertexB2 U(I) ande2 V (I) is denoted byN B (I) andN e (I), respectively. The bloom number ofB in BE-IndexI is denoted byk B (I). Note thatk B (I) = jN B (I)j 2 . e 10 e 9 e 6 u 1 u 2 u 3 u 0 v 1 v 2 v 0 e 3 e 4 e 5 e 2 e 1 e 0 e 0 e 2 e 4 e 6 e 1 e 3 e 5 e 10 e 9 B 0 B 1 e 0 e 1 e 3 e 2 e 4 e 3 e 6 e 5 e 10 e 9 (a) (b) Figure 4.2: (a) Bipartite graphG 0 (W = (U;V );E) (b) BE-Index ofG 0 with two maximal priority blooms. Figure 4.2 depicts a graphG 0 (subgraph ofG from gure 4.1) and its BE-Index. G 0 consists of two maximal priority blooms: (a)B 0 with dominant setV (B 0 ) =fv 0 ;v 1 g andk B 0 (I) = 2, and (b)B 1 with dominant vertex setV (B 1 ) =fv 1 ;v 2 g andk B 0 (I) = 3. As an example, edgee 3 is a part of 1 buttery in B 0 shared with twine 2 , and 2 butteries inB 1 shared with twine 4 . With all other edges inB 0 andB 1 , it shares one buttery each. ConstructionofBE-Index: Index construction can be easily embedded within the counting proce- dure (algorithm 10). Each pair of endpoint verticesfstart;lastg of wedges explored during counting, represents the dominating set of a bloom (withlast as the highest priority vertex) containing the edges fstart;midg andfmid;lastg for all midpointsmid. Lastly, for a given vertexmid, edgesfstart;midg andfmid;lastg are twins of each other. Thus, the space and computational complexities of BE-Index construction are bounded by the the wedges explored during counting which isO (m). 95 WingDecompositionwithBE-Index: Algorithm 13 depicts the procedure to peel an edgee using BE-IndexI. Instead of traversing wedges inG to nd butteries ofe, edges that share butteries withe are found by exploring 2-hop neighborhood ofe inI (algorithm 13, line 7). Number of butteries shared with these edges in each bloom is also obtained analytically using property 2 (algorithm 13, lines 4 and 8). Remarkably, peeling an edgee using algorithm 13 requires at most./ G e traversal in BE-Index [130]. Thus, it reduces the computational complexity of wing decomposition toO P e2E(G) ./ G e . However, it is still proportional to the number of butteries which can be enormous for large graphs. Algorithm13 Support update during edge peeling, using BE-Index 1: functionupdate(e; e ;./ E ; BE-IndexI(W = (U;V );E)) 2: foreach bloomB2N e (I) 3: e t twin(e;B), k B (I) bloom number ofB inI 4: ./ et max e ; ./ et (k B (I) 1) 5: E(I) E(I)nf(e;B); (e t ;B)g 6: k B (I) k B (I) 1 . Update bloom number 7: foreache 0 2N B (I) 8: ./ e 0 maxf e ; ./ e 01g . Update support 4.2.5 Challenges Bipartite graph decomposition is computationally very expensive and parallel computing is widely used to accelerate such workloads. However, state-of-the-art parallel framework ParButterfly [109, 36] is based on bottom-up peeling and only utilizes parallelism within each peeling iteration. This restricted parallelism is due to the following sequential dependency between iterations – support updates in an iteration guide the choice of entities to peel in the subsequent iterations. The scalability of bottom-up peeling based parallel approach is limited because: 1. It incurs large number of iterations and low parallel workload per iteration. Due to the resulting synchronization and load imbalance, intra-iteration parallelism is unable to provide substantial acceleration as shown in table 4.2. 96 Objective1 is therefore, to design a parallelism aware peeling methodology for bipartite graphs that reduces synchronization and exposes large amount of parallel workload. 2. It traverses an enormous amount of wedges (or bloom-edge links in BE-Index) to retrieve butteries removed by peeling. This is computationally expensive and can be infeasible on large datasets, even for a parallel algorithm. Objective2 is therefore, to reduce the amount of traversal in practice. Table 4.2: Parallel implementation of bottom-up peeling exhibits poor parallel scalability. ItU ItV DeV OrV LjV TrV Execution Time(s) SequentialBUP 3849 8.4 428 2297 200 5711 ParallelBUP 3677 8.1 377.7 1510 132.5 3524 #Iterations ParallelBUP 377904 10054 127328 334064 83423 342672 4.3 ParallelBipartiteNetworkPeelingframework In this section, we describe a generic parallelism friendly two-phased peeling approach for bipartite graph decomposition (targeting objective 1, section 4.2.5). We further demonstrate how this approach is adopted individually for tip and wing decomposition in our Parallel Bipartite Network peelinG (PBNG) framework. 4.3.1 Two-phasedPeeling The fundamental observation underlining our approach is that entity number l for an entityl only depends on the number of butteries shared between l and other entities with entity numbers no less than l . Therefore, given a graphG(W = (U;V );E) and per-entity buttery counts inG (obtained from counting), only the cumulative eect of peeling all entities with entity number strictly smaller than l , is relevant for computing l level ( l tip or l wing) in the decomposition hierarchy. Due to commutativity of addition, the order of peeling these entities has no impact on l level. 97 Graphaasa e 10 e 9 e 6 u 1 u 2 u 3 u 4 u 0 v 1 v 2 v 3 v 4 v 0 e 3 e 4 e 5 e 11 e 13 e 12 e 14 e 15 e 8 e 2 e 1 e 0 e 7 ( ( , ), ) GW U V E = G 1 u 1 u 2 u 3 u 0 v 1 v 2 v 0 G 2 u 2 u 3 u 4 v 1 v 2 v 3 v 4 1 {0,1,2} R = 1 0 1 4 { , ..., } E e e e = 2 {3,4,5} R = 1 5 6 15 { , ..., } E e e e = Edge Edge Coarse-grained Decomposition Coarse-grained Decomposition Representative Graphs Fine-grained Decomposition Edge Edge Figure 4.3: Graphical illustration of PBNG’s two-phased peeling for wing decomposition of the graphG from gure 4.1. The coarse-grained decomposition dividesE(G) intoP = 2 partitions using parallel peeling iterations. The ne-grained decomposition peels each partition using a single thread but concurrently processes multiple partitions. 98 This insight allows us to eliminate the constraint of deleting only minimum support entities in each iteration, which bottlenecks the available parallelism. To nd l level, all entities with entity number less than l can be peeled concurrently, providing sucient parallel workload. However, for every possible l , peeling all entities with smaller entity number will be computationally very inecient. To avoid this ineciency, we develop a novel two-phased approach. 4.3.1.1 Coarse-grainedDecomposition The rst phase divides the spectrum of all possible entity numbers [0; max ] intoP smaller non-overlapping rangesR 1 ;R 2 :::R P , where max is the maximum entity number inG, andP is a user-specied parameter. A rangeR i represents a set of entity numbers [(i);(i + 1)), such thatR i \R j =fg for allj6=i. These ranges are computed using a heuristic described in section 4.3.1.3. Corresponding to each rangeR i , PBNG also computes the partitionL i comprising all entities whose entity numbers lie inR i . Thus, instead of nding the exact entity number l of an entityl, the rst phase of PBNG computesbounds on l . Therefore, we refer to this phase asCoarse-grainedDecomposition(PBNGCD). The absence of overlap between the ranges allows each subset to be peeled independently of others in the second phase, for exact entity number computation. Entity partitions are computed by iteratively peeling entities whose support lie in the minimum range (al- gorithm 14,lines 5-13). For each partition, the rst peeling iteration in PBNG CD scans all entities to nd the peeling set, denoted asactiveSet (algorithm 14, line 9). In subsequent iterations,activeSet is computed jointly with support updates. Thus, unlike bottom-up peeling, PBNG CD does not require a priority queue data structure which makes support updates relatively cheaper. PBNG CD can be visualized as a generalization of bottom-up peeling (algorithm 11 and algorithm 12). In each iteration, the latter peels entities with minimum support (jR i j = 1 for alli), whereas PBNG CD peels entities with support in a broad custom range (jR i j 1). For example, in gure 4.3, PBNG CD divides edgesE(G) into two partitions corresponding to rangesR 1 = [0; 2] andR 2 = [3; 5], whereas 99 bottom-up peeling will create 4 partitions corresponding to every individual level in the decomposition hierarchy ( =f1; 2; 3; 4g). SettingP max ensures a large number of entities peeled per iteration (suf- cient parallel workload) and signicantly fewer iterations (dramatically less synchronization) compared to bottom-up peeling. In addition to the ranges and partitions, PBNG CD also computes a support initialization vector./ init . For an entityl2L i ,./ init l is the number of butteries thatl shares only with entities in partitionsL j such thatj i. In other words, it represents the aggregate eect of peeling entities with entity number in ranges lower thanR i . During iteative peeling in PBNG CD, this number is inherently generated after the last peeling iteration ofR i1 and copied into./ init (l) (algorithm 14, lines 6-7). For example, in gure 4.3, support ofe 5 after peelingE 1 =fe 0 ;e 1 ;e 2 ;e 3 ;e 4 g is 3, which is recorded in./ init e 5 . 4.3.1.2 Fine-grainedDecomposition The second phase computes exact entity numbers and is called Fine-grained Decomposition (PBNG FD). The key idea behind PBNG FD is that if we have the knowledge of all butteries that each entityl2L i shares only with entities in partitionsL j such thatji,L i can be peeled independently of all other partitions. The./ init vector computed in PBNG CD precisely indicates the count of such butteries (section 4.3.1.1) and hence, is used toinitializesupportvalues in PBNG FD. PBNG FD exploits the resulting independence among partitions to concurrently process multiple partitions using sequential bottom up peeling. SettingPT ensures that PBNG FD can be eciently parallelized across partitions onT threads. Overall, both phases in PBNG circumvent strict sequential dependencies between dierent levels of decomposition hierarchy to eciently parallelize the peeling process. The two-phased approach can potentially double the computation required for peeling. However, we note that since partitions are peeled independently in PBNG FD, support updates are not communicated across the partitions. Therefore, to improve computational eciency, PBNG FD operates on a smaller 100 representative subgraphG i G for each partitionsL i . Specically,G i preserves a buttery./ i it satises both of the following conditions: 1. ./ contains multiple entities withinL i . 2. ./ only contains entities from partitionsL j such thatji. If./ contains an entity from lower ranged partitions, then it does not exist in(i)-level of decomposition hierarchy (lowest entity number in L i ). Moreover, the impact of removing./ on the support of entities inL i , is already accounted for in ./ init (section 4.3.1.1). For example, in gure 4.3,G 1 contains the buttery./ = (u 1 ;v 1 ;u 2 ;v 2 ) because (a) it contains multiple edgesfe 3 ;e 4 g2L 1 and satises condition 1, and (b) all edges in./ are fromL 1 orL 2 and hence, it satises condition 2. However,G 2 does not contain this buttery because two if its edges are inL 1 and hence, it does not satisfy condition 2 forG 2 . 4.3.1.3 RangePartitioning In PBNG CD, the rst step for computing a partitionL i is to nd the rangeR i = [(i);(i+1)) (algorithm 14, line 8). For load balancing,(i + 1) should be computed † such that the all partitionsL i pose uniform workload in PBNG FD. However, the representative subgraphs and the corresponding workloads are not known prior to actual partitioning. Furthermore, exact entity numbers are not known either and hence, we cannot determine beforehand, exactly which entities will lie in L i for dierent values of (i + 1). Considering these challenges, PBNG uses two proxies for range determination: 1. Proxy 1! current support./ l of an entityl is used as a proxy for its entity number. 2. Proxy 2! complexity of peeling individual entities inG is used as a proxy to estimate peeling workload in representative subgraphs. † (i) is directly obtained from upper bound of previous rangeRi1. 101 Now, the problem is to compute(i + 1) such that estimated workload ofL i as per proxies, is close to the average workload per partition denoted astgt. To this purpose, PBNG CD creates a bin for each support value, and computes the aggregate workload of entities in that bin. For a given(i + 1), estimated workload of peelingL i is the sum of workload of all bins corresponding to support less than ‡ (i + 1). Thus, the workload ofL i as a function of(i + 1) can be computed by a prex scan of individual bin workloads (algorithm 14, lines 17-18). Using this function, the upper bound is chosen such that the estimated workload ofL i is close to but no less thantgt (algorithm 14, line 19). AdaptiveRangeComputation: Since range determination uses current support as a proxy for entity numbers, the target workload for each partitionL i is covered by the entities added toL i in its very rst peeling iteration in PBNG CD. After the support updates in this iteration, more entities may be added toL i and nal workload estimate ofL i may signicantly exceedtgt. This can result in signicant load imbalance among the partitions and potentially, PBNG CD could nish in much fewer thanP partitions. To avoid this scenario, we implement the following two-way adaptive range determination: 1. Instead of statically computing an average target, we dynamically updatetgt for every partition based on the remaining workload and the number of partitions to create. If a partition gets too much workload, the target for subsequent partitions is automatically reduced, thereby preventing a situation where all entities get peeled inP partitions. 2. A partitionL i likely covers many more entities than the initial estimate based on proxy 1. The second adaptation scales down the dynamic target forL i in an attempt to bring the actual workload close to the intended value. It assumes predictive local behavior i.e.L i will overshoot the target similar toL i1 . Therefore, the scaling factor is computed as the ratio of initial workload estimate ofL i1 during(i) computation, and nal estimate based on all entities inL i1 . ‡ All entities with entity numbers less than(i) are already peeled before PBNG CD computesRi. 102 Algorithm14 PBNG Coarse-grained phase for Wing Decomposition (PBNG CD) Input: Bipartite graphG(W = (U;V );E), number of partitionsP Output: Rangesf(1);(2):::(P + 1)g, Edge PartitionsfE 1 ;E 2 :::E P g, Support initialization vector./ init E 1: E i fg8i2f1; 2:::Pg, Initial supportf./ E g pveBcnt(G) 2: I(W = (U;V );E) BE-Index ofG 3: (1) 0, i 1, tgt target butteries (workload) per partition 4: while E(G)6= fg and (iP )do 5: foreache2E(G)doinparallel . Support Initialization Vector 6: ./ init e ./ e 7: (i + 1) find_range(E(G); ./ E ; tgt) . Upper Bound 8: activeSet all edgese2E(G) such that (i)./ e <(i + 1) 9: whileactiveSet6=fgdo . Peel edges 10: E i E i [activeSet, E(G) E(G)nactiveSet 11: parallel_update(activeSet; (i); ./ E ; I) 12: activeSet all edgese2E(G) such that (i)./ e <(i + 1) 13: i i + 1 14: functionfind_range(E(G); ./ E ; tgt) 15: Initialize hashmapwork to all zeros 16: foreach 2./ E 17: work[] P e2E(G) ./ e 1(./ e ) 18: return ub arg min () such thatwork[]tgt 19: functionparallel_update(activeSet; e ; ./ E ; BE-IndexI(W = (U;V );E)) 20: Initialize hashmapcount to all zeros 21: foreach edgee2activeSetdoinparallel 22: foreach bloomB2N e (I) . Update support atomically 23: Lete t denote edgetwin(e;B) 24: Letk B (I) denote bloom number ofB inI 25: if e t = 2activeSet or edgeID(e t )<edgeID(e) then 26: ./ et max e ; ./ et (k B (I) 1) 27: E(I) E(I)nf(e;B); (e t ;B)g 28: count[B] count[B] + 1 29: foreache 0 2N B (I) such that twin(e 0 ;B) = 2activeSet 30: ./ e 0 maxf e ; ./ e 01g 31: foreachB2U(I) such thatcount[B]> 0doinparallel 32: k B (I) k B (I)count[B] . Update bloom number 103 4.3.1.4 PartitionschedulinginFine-grainedDecomposition While adaptive range determination (section 4.3.1.3) tries to create partitions with uniform estimated workload, the actual workload per partition in PBNG FD depends on the the representative subgraphs G i and can still have signicant variance. Therefore, to improve load balance across threads, we use scheduling strategies inspired from Longest Processing Time (LPT) scheduling rule which is a well known 4 3 -approximation algorithm [50]. We use the workload ofL i as an indicator of its execution time in the following runtime scheduling mechanism: • Dynamic task allocation! All partition IDs are inserted in a task queue. When a thread becomes idle, it pops a unique ID from the queue and processes the corresponding partition. Thus, all threads are busy until every partition is scheduled. • Workload-aware Scheduling! Partition IDs in the task queue are sorted in a decreasing order of their workload. Thus, partitions with highest workload get scheduled rst and the threads processing them naturally receive fewer tasks in the future. Figure 4.4 shows how workload-aware scheduling can improve the eciency of dynamic allocation. 4.3.2 TipDecomposition In this section, we give show how PBNG’s two-phased peeling (section 4.3.1) can be adopted for tip decomposition [72]. 4.3.2.1 Coarse-grainedDecomposition For tip decomposition, PBNG CD divides the vertex setU(G) intoP partitions –fU 1 ;U 2 ;:::;U P g as shown in algorithm 15. Peeling a vertexu2U(G) requires traversal of all wedges withu as one of the endpoints. Therefore, range determination in PBNG CD uses wedge count of vertices inG, as a proxy to estimate the 104 = = = = = = = 13 = 8 = FINISH = 20 = 13 = 15 FINISH = ORIGINAL ORDER WORKLOAD AWARE = 10 = 11 = 10 = 18 = 13 = 16 Figure 4.4: Benets of Workload-aware Scheduling (WaS) in a 3-thread (T 1 ;T 2 andT 3 ) system. Top row shows entity partitions with estimated time to peel them in PBNG FD. Dynamic allocation without WaS nishes in 28 units of time compared to 20 units with WaS. workload of peeling each partitionU i . Moreover, since only one of the vertex sets ofG is peeled, at most two vertices of a buttery can be a part of the peeling set (activeSet). Hence, support updates to a vertexu 0 from dierent vertices inactiveSet correspond to disjoint butteries inG and the same function used in bottom- up peeling (algorithm 12) can be used to compute these individual updates. The net update to support./ u 0 in an iteration can be simply computed by atomically applying updates from individual vertices inactiveSet. 4.3.2.2 Fine-grainedDecomposition PBNG FD also utilizes the fact that any buttery./ contains at most two verticesfu;u 0 g in the vertex set U(G) being peeled (section 4.2.2). Ifu2U i andu 0 2U j , the two conditions for preserving./ in either representative graphsG i orG j are satised only wheni =j (section 4.3.1.2). Based on this insight, we constructG i as the subgraph induced on verticesW i = (U i ;V (G)). Clearly,G i preserves every buttery u;v;u 0 ;v 0 wherefu;u 0 g2U i . For task scheduling in PBNG FD (section 4.3.1.4), we use the total wedges inG i with endpoints inU i as an indicator of the workload of peelingU i . 105 Algorithm15 PBNG Coarse-grained phase for Tip Decomposition (PBNG CD) Input: Bipartite graphG(W = (U;V );E), number of partitionsP Output: Rangesf(1);(2):::(P + 1)g, Vertex PartitionsfU 1 ;U 2 :::U P g, Support initialization vector./ init U Letw[u] denote wedges incident on vertexu 1: U i fg8i2f1; 2:::Pg 2: Initial supportf./ U ;./ V g pveBcnt(G) 3: (1) 0, i 1 4: tgt P u2U w[u] P . work per partition 5: while U(G)6= fg and (iP )do 6: foreachu2U(G)doinparallel . Support Initialization Vector 7: ./ init u ./ u 8: (i + 1) find_range(U(G);./ U ;w [ ];tgt) . Upper Bound 9: activeSet all verticesu2Usuch that(i)./ u <(i + 1)g 10: whileactiveSet6=fgdo . Peel Vertices 11: U i U i [activeSet, U(G) U(G)nactiveSet 12: foreachu2activeSetdoinparallel 13: update(u;(i);./ U ;G) . algorithm 12, lines 6-13 14: activeSet all verticesu2Usuch that(i)./ u <(i + 1)g 15: i i + 1 16: functionfind_range(U(G);./ U ;w[:];tgt) 17: Initialize hashmapwork to all zeros 18: foreach 2./ U 19: work[] P u2U(G) w[u]1(./ u ) 20: ub smallest 2./ U such thatwork[]tgt 21: return ub Algorithm16 Fine-grained phase for Tip Decomposition (PBNG FD) Input: Bipartite graphG(W = (U;V );E), vertex partitionsfU 1 ;U 2 :::U P g, Support initialization vector./ init U Output: Tip number u for eachu2U(G) 1: ./ u ./ init u for allu2U(G) . Initialize Support 2: Insert indicesi2f1; 2:::Pg in a queueQ 3: work[i] number of wedges in partitionU i 4: Decreasing sort indices inQ by work . LPT Schedule 5: forallthread_id = 1; 2:::T doinparallel 6: whileQ is not emptydo 7: Atomically pop indexi fromQ . Dynamic Task Allocation 8: Create subgraphG i induced onW i = (U i ;V (G)) 9: whileU i 6=fgdo . Peel partition 10: u arg min u2U i f./ u g 11: u ./ u , U i U i nfug 12: update(u; u ;./ U i ;G i ) . algorithm 12, lines 6-13 106 Given the bipartite nature of graphG, any edge (u;v)2E(G) exists in exactly one of the subgraphs G i and thus, the collective space requirement of all induced subgraphs is bounded byO(m). Moreover, by the design of representative (induced) subgraphs, PBNG FD for tip decomposition traverses only those wedges for which both the endpoints are in the same partition. This dramatically reduces the amount of work done in PBNG FD compared to bottom-up peeling and PBNG CD. Note that we do not use BE-Index for tip decomposition due to the following reasons: • Butteries between two vertices are quadratic in the number of wedges between them, and wedge traversal (not buttery count) determines the work done in tip decomposition. Since BE-Index facilitates per-edge buttery retrieval, peeling a vertex using BE-Index will require processing each of its edge individually and can result in increased computation if P v2V(G) d 2 v P e2E(G) ./ G e (section 4.2.4). • BE-Index has a high space complexity ofO P (u;v)2E(G) min (d v ;d u ) compared to justO (m) space needed to storeG and all induced subgraphsG i . This can make BE-Index based peeling infeasible even on machines with large amount of main memory. For example, BE-Index of a user-group dataset Orkut (327 million edges) has 26 billion blooms, 150 billion bloom-edge links and consumes 2:6 TB memory. 4.3.3 WingDecomposition Each buttery consists of 4 edges inE(G) which is the entity set to decompose in wing decomposition. This is unlike tip decomposition where each buttery has only 2 vertices from the decomposition setU(G), and results in the following issues: 1. When a buttery./ is removed due to peeling, the support of unpeeled edge(s) in./ should be reduced by exactly 1 corresponding to this removal. However, when multiple (but not all) edges in ./ are concurrently peeled in the same iteration of PBNG CD, multiple updates with aggregate value> 1 may be generated to unpeeled edges in./. 107 2. It is possible that a buttery./ contains multiple but not all edges from a partition. Thus,./ may need to be preserved in the representative subgraph of a partition, but will not be present in its edge-induced subgraph. Due to these reasons, a trivial extension of tip decomposition algorithm (section 4.3.2) is not suitable for wing decomposition. In this section, we explore novel BE-Index based strategies to enable two-phased peeling for wing decomposition. 4.3.3.1 Coarse-grainedDecomposition This phase divides the edgesE(G) intoP partitions –fE 1 ;E 2 ;:::;E P g, as shown in algorithm 14. Not only do we utilize BE-Index for computationally ecient support update computation in PBNG CD, we also utilize it to avoid conicts in parallel peeling iterations of PBNG CD. Since a buttery is contained in exactly one maximal priority bloom (section 4.2.4, property 3), correctness of support updates within each bloomB implies overall correctness of support updates in an iteration. To avoid conicts, we therefore employ the following resolution mechanism for each bloomB: 1. If an edgee and its twine 0 = twin(e;B) are both peeled in the same iteration, then only the edge with highest index among e and e 0 updates (a) the support of other edges in B, and (b) the bloom numberk B (I) (algorithm 14, lines 26-31). This is because all butteries inB that containe also contain e 0 (section 4.2.4, property2). 2. If in an iteration, an edgee2 activeSet bute 0 = twin(e;B) = 2 activeSet, then the support./ e 0 is decreased by exactlyk B (I) 1 whene is peeled. Other edges inactiveSet do not propagate any updates to./ e 0 via bloomB (algorithm 14, lines 26-30). This is becausee 0 is contained in exactlyk B (I) 1 butteries inB, all of which are removed whene is peeled. To ensure thatk B (I) correctly represents the butteries shared between twin edges, support updates from all peeled edges are computed prior to updatingk B (I). 108 Peeling an edgee2E(G) requiresO (./ e ) traversal in the BE-Index. Therefore, range determination in PBNG CD uses edge support as a proxy to estimate the workload of peeling each partitionE i . Algorithm17 PBNG Fine-grained Decomposition (PBNG FD) for wing decomposition Input: GraphG(W = (U;V );E), BE-IndexI(W = (U;V );E), edge partitionsfE 1 ;:::;E P g, Support initialization vector./ init E Output: Wing number e for alle2E(G) 1: ./ e ./ init e for alle2E(G) . Initialize Support 2: fI 1 ;I 2 ;:::;I P g partition_BE_Index(G;I) . BE-Indices for all partitions 3: Insert indicesf1; 2;:::;Pg in queueQ 4: work[i] P e2E i ./ init e for alli2Q 5: Decreasing sort indices inQ bywork . LPT Scheduling 6: forallthread_id2f1; 2:::Tgdoinparallel 7: whileQ is not emptydo 8: Atomically pop indexi fromQ . Dynamic Task Allocation 9: whileE i 6=fgdo . Peel partition 10: e arg min e2E i f./ e g 11: e ./ e , E i E i nfeg 12: update(e; e ;./ E i ;I i ) . Ref: algorithm 13 13: functionpartition_BE_Index(G(W = (U;V );E);I(W = (U;V );E)) 14: Letp(e) denote the partition index of an edgee i.e.e2E p(e) 15: V (I i ) E i ; U(I i ) fg; E(I i ) fg8i2f1; 2:::Pg 16: foreachB2U(I)doinparallel 17: Initialize hash mapwedge_count to all zeros 18: foreache2N B (I) 19: e t twin(e;B) 20: ifp(e t )p(e)then 21: U(I i ) U(I i )[fBg; E(I i ) E(I i )[ (e;B) . Bloom-edge links 22: if p(e t ))>p(e) or edgeID(e t )<edgeID(e) then 23: wedge_count(I i ) wedge_count(I i ) + 1 24: foreach partition indexi such thatB2U(I i ) 25: k B (I i ) P ji wedge_count(I j ) . Bloom numbers (prex scan) 26: returnfI 1 ;I 2 ;:::;I P g 4.3.3.2 Fine-grainedDecomposition The rst step for peeling a partitionE i in PBNG FD, is to construct the corresponding BE-IndexI i for its representative subgraphG i . One way to do so is to computeG i and then use the index construction algorithm (section 4.2.4) to constructI i . However, this approach has the following drawbacks: 109 • ComputingG i requires mining all edges inE(G) that share butteries with edgese2E i , which can be computationally expensive. Additionally, the overhead of index construction even for a few hundred partitions can be signicantly large. • Any edgee2E i can potentially exist in all subgraphsG j such thatji. Therefore, creating and storing all subgraphsG i requiresO(mP ) memory space. To avoid these drawbacks, we directly computeI i for eachE i by partitioning the BE-IndexI of the original graph G (algorithm 17, lines 12-25). Our partitioning mechanism ensures that all butteries satisfying the two preservation conditions (section 4.3.1.2) for a partitionE i , are represented in its BE-Index I i . Firstly, for an edgee2 E i , its link (e;B) with a bloomB is preserved inI i if and only if the twin e t =twin(e;B)2E j such thatji (algorithm 17, lines 19-20). Since all butteries inB that contain e also containe t (section 4.2.4, property 2), none of them need to be preserved inI i ifj b>c (for unique- ness of representation). Ife a ,e b ore c are peeled tillt, corresponding to the removal of./ e;ea;e b ;ec , only one of them (the rst edge to be peeled) reduces the support./ e by a unit. SinceBUP peels edges in a non-decreasing order of their wing numbers, e 0 < e for alle 0 2S. Hence, current support (algorithm 11, line 11) ofe is given by./ e =./ G e P ./e;ea;e b ;ec 1(e a 2S ore b 2S ore c 2S). The rst term of RHS is constant for a givenG, and by commutativity of addition, the second term is independent of the order in which contribution of individual butteries are added. Therefore,./ e is independent of the order of peeling inS. § This is unlike the BE-IndexI of graphG where bloom numb erkB(I)= jN B (I)j 2 (section 4.2.4). 111 Lemma6. Given a setS of edges peeled in an iteration, the parallel peeling in PBNG CD (algorithm 14, lines 21-33) correctly updates the support of remaining edges inE(G). Proof. Parallel updates are correct if for an edgee not yet assigned to any partition,./ e decreases by exactly 1 for each buttery ofe deleted in a peeling iteration. By property 3 (section 4.2.4), parallel updates are correct if this holds true for butteries deleted within each bloom. Consider a bloomB(W = (U;V );E) and let./ e;et;e 0 ;e 0 t B be a buttery containing edgesfe;e t ;e 0 ;e 0 t g, wheree t =twin(e;B) ande 0 t =twin(e 0 ;B) (property 2, section 4.2.4), such that./ e;et;e 0 ;e 0 t is deleted in thej th peeling iteration of PBNG CD. Clearly, neither ofe,e t ,e 0 ore 0 t have been peeled beforej th iteration. We analyze updates to support./ e corresponding to the deletion of buttery./ e;et;e 0 ;e 0 t . IfS denotes the set of edges peeled inj th iteration,./ will be deleted if and only if either of the following is true: 1. e2 S! In this case,e is already assigned to a partition and updates to./ e have no impact on the output of PBNG. 2. e = 2S bute t 2S!./ e is reduced by exactly (k B 1) via bloomB, which amounts to unit reduction per each buttery inB that containse (property 2, section 4.2.4). No updates from other edges are propagated to./ e viaB (algorithm 14, line 30). 3.fe;e t g = 2S bute 0 2S ore 0 t 2S!./ e is decreased by exactly 1 whene 0 =e 0 t is peeled. If bothfe 0 ;e 0 t g2S, then./ e is decreased by exactly 1 when the highest indexed edge amonge 0 ande 0 t is peeled (algorithm 14, line 26). Both the links (e 0 ;B) and (e 0 t ;B) are deleted and bloom numberk B is decreased by 1. Hence, subsequent peeling of any edges will not generate support updates corresponding to./ e;et;e 0 ;e 0 t . Thus, the removal of./ during peeling decreases the support ofe by exactly 1 ife is not scheduled for peeling yet. Lemmas 5 and 6 together show that whether we peel a setSE(G) in its original order inBUP, or in parallel in any order in PBNG CD, the support of edges with wing numbers higher than all edges inS 112 would be the same after peelingS. Next, we show that PBNG CD correctly computes the edge partitions corresponding to wing number ranges and nally prove that PBNG FD outputs correct wing numbers for all edges. For ease of explanation, we use./ e (j) to denote the support of a edgee afterj th peeling iteration in PBNG CD. Lemma7. There cannot exist an edgee such thate2E i and e (i + 1). Proof. Letj be the rst iteration in PBNG CD that wrongly peels a set of edgesS w , and assigns them toE i even though e (i + 1)8e2S w . LetS hi S w be the set of all edges with wing numbers(i + 1) andS lo be the set of edges peeled till iterationj 1. Since all edges tillj 1 iterations have been correctly peeled, e <(i + 1)8e2S lo . Hence,S hi E(G)nS lo . Consider an edgee2 S w . Sincee is peeled inj th iteration, ./ e (j 1) < (i + 1). From lemma 6, ./ e (j 1) is equal to the number of butteries containing e and edges only from EnS lo . Since S hi E(G)nS lo , there are at most./ e (j 1) butteries that containe and other edges only fromS hi . By denition of wing number (section 4.2.3), e ./ e (j 1)<(i + 1), which is a contradiction. Thus, no suche exists,S w =fg, and all edges inE i have tip numbers less than(i + 1). Lemma8. There cannot exist an edgee such thate2E i and e <(i). Proof. LetR i be the smallest range for which there exists an edgee such that(i) e <(i + 1), but e2 E i 0 for somei 0 > i. Letj be the last peeling iteration in PBNG CD that assigns edges toE i , and S hi =[ i 0 >i E i 0 denote the set of edges not peeled till this iteration. Clearly,e2 S hi and for each edge e 0 2S hi ,./ e 0 (j)(i + 1), otherwisee 0 would be peeled in or before thej th iteration. From lemma 6, all edges inS hi (includinge) participate in at least(i + 1) butteries that contain edges only fromS hi . Therefore,e is a part of(i + 1)wing (def.4) and by the denition of wing number, e (i + 1), which is a contradiction. 113 Theorem1. PBNG CD (algorithm 14) correctly computes the edge partitions corresponding to every wing number range. Proof. Follows directly from lemmas 7 and 8. Theorem2. PBNG correctly computes the wing numbers for alle2E(G). Proof. From theorem 1,(i) e <(i + 1) for each edgee2E i . Consider an edge partitionE i and Let S lo =E 1 [E 2 :::E i1 denote the set of edges peeled beforeE i in PBNG CD. From theorem 1, e (i) for all edgese2E i , and e <(i) for all edgese2S lo . Hence,S lo will be peeled beforeE i inBUP as well. Similarly, any edge inS hi =E i+1 [E i+2 [E P will be peeled afterE i inBUP. Hence, support updates to any edgee2S hi have no impact on wing numbers computed for edges inE i . For each edgee2E i , PBNG FD initializes./ e using./ init e , which is the support ofe in PBNG CD just afterS lo is peeled. From lemma 6, this is equal to the support ofe inBUP just afterS lo is peeled. Note that both PBNG FD andBUP employ same algorithm (sequential bottom-up peeling) to peel edges inE i . Hence, to prove the correctness of wing numbers generated by PBNG FD, it suces to show that when an edge inE i is peeled, support updates propagated to other edges inE i via each bloomB, are the same in PBNG FD andBUP. 1. Firstly, every bloom-edge link (e;B) wheree2 E i andtwin(e;B)2 E i [S hi , is preserved in the partition’s BE-IndexI i . Thus, when an edgee2E i is peeled, the set of aected edges inE i are correctly retrieved fromI i (same as the set retrieved from BE-IndexI inBUP). 2. Secondly, by construction (algorithm 17, lines 21-22 and 23-24), the initial bloom number ofB inI i is equal to the number of those twin edge pairs inB, for which both edges are inE i or higher ranged partitions (not necessarily in same partition). Thus,k B (I i ) = P e2E(B) 1(e(i) and twin(e;B) i) 2 , which in turn is the bloom numberk B (I) inBUP just beforeE i is peeled. 114 Since both PBNG FD andBUP have identical support values just before peelingE i , the updates computed for edges inE i and the order of edges peeled inE i will be the same as well. Hence, the nal wing numbers computed by PBNG FD will be equal to those computed byBUP. The correctness of tip decomposition in PBNG can be proven in a similar fashion. A detailed derivation for the same is given in theorem 2 of [72], 4.4.2 ComputationandSpaceComplexity To feasibly decompose large datasets, it is important for an algorithm to be ecient in terms of computation and memory requirements. The following theorems show that for a reasonable upper bound on partitions P , PBNG is at least as computationally ecient as the best sequential decomposition algorithmBUP. Theorem 3. For P =O P e2E(G) ./ G e m partitions, wing decomposition in PBNG is work-ecient with computational complexity ofO m + P e2E(G) ./ G e , where./ G e is the number of butteries inG that containe. Proof. PBNG CD initializes the edge support using buttery counting algorithm withO (m) complexity. Since max E m, binning for range computation of each partition can be done using anO(m)-element array, such thati th element in the array corresponds to workload of edges with supporti (algorithm 14, lines 16-19). A prex scan of the array gives the range workload as a function of the upper bound. Parallel implementations of binning and prex scan performO(m) computations per partition, amounting to O (Pm) computations in entire PBNG CD. ConstructingactiveSet for rst peeling iteration of each partition requires anO(m) complexity parallel lter on remaining edges. Subsequent iterations construct activeSet by tracking support updates doingO(1) work per update. Further, peeling an edgee generates O(./ G e ) updates, each of which can be applied in constant time using BE-Index (section 4.2.4). Therefore, total complexity of PBNG CD isO P e2E(G) ./ G e + (P +)m . 115 PBNG FD partitions BE-IndexI to create individual BE-Indices for edge partitions. The partition- ing (algorithm 17, lines 12-25) requires constant number of traversals of entire setE(I) and hence, has a complexity ofO(m). Each buttery inG is represented in at most one partition’s BE-Index. Therefore, PBNG FD will also generateO P e2E(G) ./ G e support updates. Hence, the work complexity of PBNG FD isO P e2E(G) ./ G e +m (section 4.2.4). The total work done by PBNG isO P e2E(G) ./ G e + (P +)m =O P e2E(G) ./ G e +m if P =O P e2E(G) ./ G e m , which is the best-known time complexity of BUP. Hence, PBNG’s wing decomposi- tion is work-ecient. Theorem 4. ForP =O P u2U P v2Nu dv nlogn partitions, tip decomposition in PBNG is work-ecient with computational complexity ofO P u2U P v2Nu d v . Proof. The proof is simliar to that of theorem 3. The key dierence from wing decomposition is that maximum tip number can be cubic in the size of the vertex set. Therefore, range determination uses a hashmap with support values as the keys (algorithm 15). The aggregate workloads of the bins need to be sorted on keys before computing the prex scan. Hence, total complexity of range determination for tip decomposition in PBNG CD isO (Pn logn). A detailed derivation is given in theorem 3 of [72]. Next, we prove that PBNG’s space consumption is almost similar to the best known sequential algo- rithms. Theorem5. Wing decomposition in PBNG parallelized overT threads consumesO (m +nT ) memory space. Proof. For buttery counting, each thread uses a privateO(n) array to accumulate wedge counts, resulting toO(nT ) space consumption [132]. For peeling: 1. PBNG CD uses the BE-IndexI(W = (U;V );E) whose space complexity isO (m). 116 2. PBNG FD uses individual BE-Indices for each partition. Any bloom-edge link (e;B)2E(I) exists in at most one partition’s BE-Index. Therefore, cumulative space required to store all partitions’ BE-Indices isO (m). Thus, overall space complexity of PBNG’s wing decomposition isO (m +nT ). Theorem6. Tip decomposition in PBNG parallelized overT threads consumesO (m +nT ) memory space. Proof. For buttery counting and peeling in PBNG, each thread uses a privateO(n) array to accumulate wedge counts, resulting inO(nT ) space consumption. In PBNG FD, any edgee2E(G) exists in the induced subgraph of at most one partition (because partitions ofU(G) are disjoint). Hence,O (n +m) space is required to storeG and all induced subgraphs, resulting in overallO (m +nT ) space complexity. 4.5 Optimizations Despite the use of parallel computing resources, PBNG may consume a lot of time to decompose large graphs such as the trackers dataset, that contain several trillion wedges (for tip decomposition) or butteries (for wedge decomposition). In this section, we propose novel optimization techniques based on the two-phased peeling of PBNG, that dramatically improve computational eciency and make it feasible to decompose datasets like trackers in few minutes. 4.5.1 BatchProcessing Due to the broad range of entity numbers peeled in each iteration of PBNG CD (section 4.3.1.1), some iterations may peel a large number of entities. Peeling individual entities in such iterations requires a large amount of traversal inG or BE-IndexI. However, visualizing such iterations as peeling a single large set of entities can enable batch optimizations that drastically reduce the required computation. 117 TipDecomposition Here, we exploit the fact that (per-vertex) buttery counting is computationally ecient and parallelizble (section 4.2.1). Given a vertex setactiveSet, the number of wedges traversed for peelingactiveSet is given by^(activeSet) = P u2activeSet P v2Nu d v . However, number of wedges tra- versed for re-counting butteries for remaining vertices is upper bounded by^ cnt = P (u;v)2E min (d u ;d v ), which is constant for a givenG (section 4.2.1). If^(activeSet)>^ cnt , we re-compute butteries for all remaining vertices inU instead of peelingactiveSet. Thus, computational complexity of a peeling iteration in PBNG tip decomposition isO (m). Algorithm18 Batch computation of support updates from peeling a set of edgesactiveSet 1: functionupdate(activeSet; e ;./ E ; BE-IndexI(W = (U;V );E)) 2: Initialize hashmapcount to all zeros 3: foreach edgee2activeSetdoinparallel 4: foreach bloomB2N e (I) 5: e t twin(e;B); k B (I) bloom number ofB inI 6: if e t = 2activeSet or edgeID(e t )<edgeID(e) then 7: ./ et max e ; ./ et k B (I) 8: count[B] count[B] + 1 . Aggregate updates at blooms atomically 9: E(I) E(I)nf(e;B); (e t ;B)g 10: foreach bloomB2U(I) such thatcount[B]> 0doinparallel 11: k B (I) k B (I)count[B] . Update bloom number 12: foreach edgee 0 2N B (I) 13: ./ e 0 maxf e ; ./ e 0count[B]g . Update support atomically WingDecomposition For peeling a large set of edgesactiveSet, we use the batch processing proposed in [130]. The key idea is that when an edgee is peeled, the aected edges are discovered by exploring the neighborhood of blooms inN e (I) (algorithm 13). Therefore, the support updates from all edges in activeSet can be aggregated at the blooms (algorithm 18, line 8), and then applied via a single traversal of their neighborhoods (algorithm 18, lines 10-13). Thus, computational complexity of a peeling iteration in PBNG wing decomposition is bounded by the size of BE-Index which isO (m). While batch processing using BE-Index was proposed in [130], we note that it is signicantly more benecial for PBNG CD compared to bottom-up peeling, due to the large number of edges peeled per iteration. 118 4.5.2 DynamicGraphUpdates After a vertexu (or an edgee) is peeled in PBNG, it is excluded from future computation in the respective phase. However, due to the undirected nature of the graph, the adjacency list data structure forG (or BE-Index I) still contains edges of u (or bloom-edge links of e) that are interleaved with other edges. Consequently, wedges incident onu (or bloom-edge links ofe), though not used in computation, are still explored even afteru (ore) is peeled. To prevent such wasteful exploration, we update the data structures to remove edges incident on peeled vertices (or bloom-edge links of peeled edges). These updates can be performed jointly with the traversal required for peeling. In tip decomposition, updating vertex support requires traversing adjacencies of the neighbors of peeled vertices. Edges to peeled vertices can be removed while traversing neighbors’ adjacency lists. In wing decomposition, updating edge support requires iterating over the aected blooms (algorithm 18, lines 10-13) and their neighborhoods N B (I). The bloom-edge links incident on peeled edges and their twins can be removed fromN B (I) during such traversal. 4.6 ExperimentalEvaluation In this section, we present detailed experimental results of PBNG for both tip and wing decomposition. In section 4.6.1, we list the datasets and describe the baselines used for comparison. Secondly, in section 4.6.2, we provide a thorough evaluation of wing decomposition in PBNG. We (a) compare PBNG against the baselines on several metrics, (b) report empirical benets of optimizations proposed in section 4.5, (c) compare workload and execution time of coarse and ne decomposition phases, and (d) evaluate parallel scalability of PBNG. Lastly, in section 4.6.3, we report a similar evaluation of tip decomposition. 119 4.6.1 Setup We conduct the experiments on a 36 core dual-socket linux server with two Intel Xeon E5-2695 v4 proces- sors@ 2.1GHz and 1TB DRAM. All algorithms are implemented in C++-14 and are compiled using G++ 9.1.0 with the -O3 optimization ag, and OpenMP v4.5 for multithreading. Datasets : We use twelve unweighted bipartite graphs obtained from the KOBLENZ collection [68] and Network Repository [100], whose characteristics are shown in table 4.3. To the best of our knowledge, these are some of the largest publicly available real-world bipartite datasets. Table 4.3: Bipartite Graph Decomposition: datasets for evaluation with number of butteries (./ G in billions), maximum tip numbers max U (for peelingU(G)) and max V (for peelingV (G)), and maximum wing number max E . Dataset Description jUj jVj jEj ./ G (in B) max U (in M) max V (in M) max E Di-af Artists and labels aliation from Discogs 1,754,824 270,772 5,302,276 3.3 120 87 15,498 De-ti URLs and tags from www.delicious.com 4,006,817 577,524 14,976,403 22.9 0.57 40.1 26,895 Fr Pages and editors from French Wikipedia 62,890 94,307 2,494,939 34.1 1.6 645 54,743 Di-st Artists and release styles from Discogs 1,617,944 384 5,740,842 77.4 0.74 1,828 52,015 It Pages and editors from Italian Wikipedia 2,255,875 137,693 12,644,802 298 1.6 5,328 166,785 Digg Users and stories from Digg 872,623 12,472 22,624,727 1,580.5 47.6 3,726 166,826 En Pages and editors from English Wikipedia 21,504,191 3,819,691 122,075,170 2,036 37.2 96.2 438,728 Lj Users’ group memberships in Livejournal 3,201,203 7,489,073 112,307,385 3,297 4.7 82,785 456,791 Gtr Documents and words from Gottron-trec 556,078 1,173,226 83,629,405 19,438 205 38,283 563,244 Tr Internet domains and trackers in them 27,665,730 12,756,244 140,613,762 20,068 18,668 3,030,765 2,462,017 Or Users’ group memberships in Orkut 2,783,196 8,730,857 327,037,487 22,131 88.8 29,285 - De-ut Users and tags from www.delicious.com 4,512,099 833,081 82,989,133 26,683 936 91,968 1,290,680 Baselines: We compare the performance of PBNG against the following baselines: • BUP! sequential bottom-up peeling (algorithm 11 and 12) that does not use BE-Index. • ParB!ParButterfly framework ¶ with the best performing BatchS aggregation method [109]. It parallelizes each iteration of BUP using a parallel bucketing structure [36]. • BE_Batch (for wing decomposition only)! BE-Index assisted peeling with batch processing optimiza- tion [130], and dynamic deletion of bloom-edge links (section 4.5). ¶ https://github.com/jeshi96/parbuttery 120 • BE_PC (for wing decomposition only)! BE-Index assisted progressive compression peeling approach, proposed in [130]. It generates candidate subgraphs top-down in the hierarchy to avoid support updates from peeling edges in lower subgraphs (small e ) to edges in higher subgraphs (high e ). Scaling parameter for support threshold of candidate subgraphs is set to = 0:02, as specied in [130]. Furthermore, to evaluate the eect of optimizations (section 4.5), we create two variants of PBNG: • PBNG-! PBNG without dynamic graph updates (section 4.5.2). • PBNG--! PBNG without dynamic graph updates (section 4.5.2) and batch processing optimization (sec- tion 4.5.1). 16 64 256 1024 4096 16384 50 150 250 350 450 550 Execution Time (s) # Partitions Execution time vs P TrU OrU EnU LjU DeU ItU Figure 4.5: Execution time of tip decomposition vs number of partitionsP in PBNG. ImplementationDetails: The only user-specied parameter in PBNG is the number of partitionsP . PBNG as a function ofP as shown in gure 4.6 and 4.5. Performance of PBNG CD improves with a decrease inP because of reduced peeling iterations and larger peeling set (batch size) per iteration. However, for PBNG FD, a small value ofP reduces parallelism and increases the workload. Thus,P represents a trade-o between the two phases of PBNG. Based on our our observations, 121 16 32 64 128 0 200 400 600 800 1000 Execution Time (s) No. of Partitions Graphs with <100M edges It De-ti Fr Di-st 512 1024 2048 4096 8192 16384 0 400 800 1200 1600 Execution Time (s) No. of Partitions Graphs with ≥100M edges En Lj Tr Figure 4.6: Execution time of wing decomposition vs number of partitionsP in PBNG. 1. for tip decomposition, we setP = 150, and 2. for wing decomposition, we setP = 400 for graphs with< 100 M edges andP = 1000 for graphs with 100 M edges. We also note that the performance of PBNG is robust (within 2 of the optimal) in a wide range ofP for both small and large datasets. 4.6.2 Results: WingDecomposition 4.6.2.1 ComparisonwithBaselines Table 4.4 shows a detailed comparison of PBNG and baseline wing decomposition algorithms. To compare the workload of dierent algorithms, we measure the number of support updates applied in each algorithm [130]. This may under-represent the workload of BUP and ParB, as they cannot retrieve aected edges during peeling in constant time. However, it is a useful metric to compare BE-Index based approaches [130], as support updates represent bulk of the computation performed by during decomposition. Note thatParB will generate same number of support updates asBUP, and parallel variants of all baselines will have same amount of synchronization () asParB. 122 Amongst the baseline algorithms, BE_PC demonstrates state-of-the-art execution time and lowest computational workload, represented by the number of support updates, due to its top-down subgraph construction approach. However, with the two-phased peeling and batch optimizations, support updates in PBNG are at par or even lower thanBE_PC in some cases. Moreover, most updates in PBNG are applied to a simple array and are relatively cheaper compared to updates applied to priority queue data structure in all baselines (includingBE_PC). Furthermore, by utilizing parallel computational resources, PBNG achieves up to 38:5 speedup overBE_PC, with especially high speedup on large datasets. Compared to the parallel framework ParB, PBNG is two orders of magnitude or more (up to 295) faster. This is becauseParB does not use BE-Index for ecient peeling, does not utilize batch optimizations to reduce computations, and requires large amount of parallel peeling iterations (). The number of threads synchronizations is directly proportional to ∥ , and PBNG achieves up to 15260 reduction in compared toParB. This is primarily because PBNG CD peels vertices with a broad range of support in every iteration, and PBNG FD does not require a global thread synchronization at alll. This drastic reduction in is the primary contributor to PBNG’s parallel eciency. Quite remarkably, PBNG is the only algorithm to successfully wing decompose Gtr and De-ut datasets in few hours, whereas all of the baselines fail to decompose them in two days ∗∗ . Overall, PBNG achieves dramatic reduction in workload, execution time and synchronization compared to all previously existing algorithms. ∥ ForParB, can be determined by counting peeling iterations in PBNG FD, even ifParB itself is unable to decompose the graph. ∗∗ None of the algorithms could wing decompose Or dataset –BE_Batch,BE_PC and PBNG due to large memory requirement of BE-Index, andBUP andParB due to infeasible amount of execution time 123 Table 4.4: Comparing execution time (t), number of support updates and synchronization rounds () of wing decomposition algorithms. Missing entries denote that execution did not nish in 2 days. Di-af De-ti Fr Di-st It Digg En Lj Gtr Tr De-ut BUP 911 6,622 2,565 9,972 38,579 - - - - - - ParB 324 3,105 1,434 2,976 14,087 - - - - - - BE_Batch 87 568 678 961 9,940 - 78,847 - - 54,865 - BE_PC 56 312 237 314 1756 37,006 25,905 50,901 - 28,418 - t(s) PBNG 7.1 22.4 20.4 26.5 47.7 960 748 1,293 13,253 858 6,661 BUP 4.7 37.3 39.4 122 424 - - - - - - BE_Batch 2.9 16.6 24.8 33.9 119.0 1,413 1,116 - - 679 - BE_PC 1.1 5.7 7.6 9.9 33.6 592 391 785 - 390 - Updates (billions) PBNG 1.9 9 11 11.6 26.9 794 402 678 5,765 164 5,530 ParB 179,177 868,527 181,114 1,118,178 781,955 1,077,389 8,456,797 4,338,205 3,655,765 31,043,711 4,844,812 PBNG 3,666 5,902 4,062 4,442 4,037 14,111 16,324 19,824 18,371 2,034 15,136 4.6.2.2 EectofOptimizations Since dynamic BE-Index updates do not aect the updates generated during peeling, PBNG and PBNG- exhibit the same number of support updates. Hence, to highlight the benets of BE-Index updates (sec- tion 4.5.2), we also measure the number of bloom-edge links traversed in PBNG with and without the optimizations. Figure 4.7 shows the eect of optimizations on the performance of PBNG. Normalized performance of PBNG (gure 4.7) shows that deletion of bloom-edge links (corresponding to peeled edges) from BE-Index reduces traversal by an average of 1:4, and execution time by average 1:11. However, traversal is relatively inexpensive compared to support updates as the latter involve atomic computations on array elements (PBNG CD) or on a priority queue (PBNG FD). Figure 4.7 also clearly shows a direct correlation between the execution time of PBNG-- and the number of support updates. Consequently, the performance is drastically impacted by batch processing, without which large datasets Gtr, Tr and De-ut can not be decomposed in two days by PBNG--. Normalized performance of PBNG-- shows that both optimizations cumulatively enable an average reduction of 9:1 and 21 in the number of support updates and execution time of wing decomposition, respectively. This shows that the two-phased approach of PBNG is highly suitable for batch optimization as it peels large number of edges per parallel iteration. 124 1 1.2 1.4 1.6 1.8 2 Di-af De-ti Fr Di-st It Digg En Lj Gtr Tr De-ut PBNG- (PBNG without dynamic BE-Index updates) 0 6 12 18 24 30 Di-af De-ti Fr Di-st It Digg En Lj PBNG-- (PBNG- without batch processing) 47.7 62.8 Figure 4.7: Eect of optimizations (section 4.5) on wing decomposition in PBNG. Execution time normalized with respective measurements for PBNG with all optimizations enabled. With batch processing disabled, Gtr, Tr and De-ut did not nish within 2 days. 4.6.2.3 ComparisonofDierentPhases 0 20 40 60 80 100 Percentage Execution Time 0 20 40 60 80 100 Percentage Support Updates Figure 4.8: Contribution of dierent steps to the overall support updates and the execution time of wing decomposition in PBNG. Figure 4.8 shows a breakdown of the support updates and execution time of PBNG across dierent steps, namely initial buttery counting and BE-Index Construction, peeling in PBNG CD, BE-Index partitioning †† and peeling in PBNG FD. For most datasets, PBNG CD dominates the overall workload, contributing more than 60% of the support updates for most graphs. In some datasets such as Tr and Gtr, the batch optimizations drastically reduce the workload of PBNG CD, rendering PBNG FD as the dominant phase. The trends in execution time are largely similar to those of support updates. However, due to dierences in parallel scalability of dierent steps, contribution of PBNG FD to execution time of several datasets is †† BE-Index partitioning does not update the support of edges. 125 slightly higher than its corresponding contribution to support updates. We also observe that peeling in PBNG CD and PBNG FD is much more expensive compared to BE-Index construction and partitioning. 4.6.2.4 Scalability 1 2 4 8 16 1 4 16 64 Parallel Speedup No. of Threads Di-af De-ti Fr Di-st 1 2 4 8 16 1 4 16 64 No. of Threads It Digg En Lj Tr Figure 4.9: Strong scaling of wing decomposition in PBNG. Figure 4.9 demonstrates the parallel speedup of PBNG over sequential execution ‡‡ . Overall, PBNG provides an average 8:7 parallel speedup with 36 threads, which is signicantly better thanParB. Fur- thermore, the speedup is generally higher for large datasets (up to 11:8 for Lj), which are highly time consuming. In contrast, ParB achieves an average 2:6 speedup over sequential BUP, due to the large amount of synchronization. We also observe that PBNG consistently accelerates decomposition up to 18 threads (single socket), providing average 7:2 parallel speedup. However, scaling to two sockets (increasing threads from 18 to 36) only fetches 1:2 speedup on average. This could be due to NUMA eects which increases the cost of memory accesses and atomics. This can signicantly impact the performance as PBNG’s workload is dominated by traversal of the large BE-Index and atomic support updates. Further, edges contained in a ‡‡ We also compared a serial implementation of PBNG (for both wing and tip decomposition) with no synchronization primitives (atomics) and sequential implementations of kernels such as prex scan. However, the observed performance dierence between such implementation and single-threaded execution of parallel PBNG was negligible. 126 large number of butteries may receive numerous support updates, which increases coherency trac and reduces scalability. 4.6.3 Results: TipDecomposition To evaluate tip decomposition in PBNG, we select 6 of the largest datasets from table 4.3 and individually decompose both vertex sets in them. Without loss of generality, we label the vertex set with higher peeling complexity asU(G) and the other asV (G). Corresponding to the set being decomposed, we sux the dataset name withU(G) orV (G). 4.6.3.1 ComparisonwithBaselines Table 4.5: Comparing execution time (t), number of wedges traversed and synchronization rounds () of tip decomposition algorithms. Missing entries denote that execution did not nish in 2 days or ran out of memory. EnU EnV LjU LjV GtrU GtrV TrU TrV OrU OrV De-utU De-utV BUP 111,777 281 67,588 200 12,036 221 - 5,711 39,079 2,297 12,260 428 ParB - 198 - 132.5 - 163.9 - 3,524 - 1,510 - 377.7 t(s) PBNG 1,383 31.1 911.1 23.7 163.9 26.5 2,784 530.6 1,865 136 402.4 32.4 BUP 12,583 29.6 5,403 14.3 3,183 26.6 211,156 1,740 4,975 231.4 2,861 70.1 Wedges (billions) PBNG 2,414 22.2 1,003 11.7 1,526 28.2 3,298 658.1 2,728 170.4 1,503 51.3 ParB 1,512,922 83,800 1,479,495 83,423 491,192 73,323 1,476,015 342,672 1,136,129 334,064 670,189 127,328 PBNG 1,724 453 1,477 456 2,062 362 1,335 1,381 1,160 639 1,113 406 Table 4.5 shows a detailed comparison of various tip decomposition algorithms. To compare the workload of tip decomposition algorithms, we measure the number of wedges traversed inG. Wedge traversal is required to compute butteries between vertex pairs during counting/peeling, and represents bulk of the computation performed in tip decomposition. Note thatParB traverses the same # wedges as BUP and With up to 80:8 and 64:7 speedup over BUP and ParB, respectively, PBNG is dramatically faster than the baselines, for all datasets. In contrast, ParB achieves a maximum 1:6 speedup compared to sequentialBUP for TrV dataset. The speedups are typically higher for large datasets that oer large amount of computation to parallelize and benet more from batch optimization (section 4.5.1). 127 Optimization benets are also evident in the wedge traversal of PBNG §§ . Forall datasets, PBNG traverses fewer wedges than the baselines, achieving up to 64 reduction in wedges traversed. Furthermore, PBNG achieves up to 1105 reduction in synchronization () overParB, due to its two-phased peeling approach. The resulting increase in parallel eciency and workload optimizations enable PBNG to decompose large datasets like EnU in few minutes, unlike baselines that take few days for the same. Quite remarkably, PBNG is the only algorithm to successfully tip decompose TrU dataset within an hour, whereas the baselines fail to decompose it in two days. We also note PBNG can decompose both vertex sets ofOr dataset in approximately half an hour, even though none of the algorithms could feasibly wing decompose theOr dataset (section 4.6.2). Thus, tip decomposition is advantageous over wing decomposition, in terms of eciency and feasibility. 4.6.3.2 Optimizations 1.0 1.2 1.4 1.6 1.8 2.0 PBNG- (PBNG without dynamic graph updates) 1 2 3 4 5 6 PBNG-- (PBNG- without batch processing) 68.8 47.7 7.8 Figure 4.10: Eect of optimizations (section 4.5) on tip decomposition in PBNG. All quantities are normalized with the respective measurements for PBNG with all optimizations enabled. Figure 4.10 shows the eect of workload optimizations on tip decomposition in PBNG. Clearly, the execution time closely follows the variations in number of wedges traversed. §§ Wedge traversal byBUP can be computed without executing algorithm 12, by simply aggregating the2-hop neighborhood size of vertices inU(G) orV(G). 128 Dynamic deletion of edges (corresponding to peeled vertices) from adjacency lists can potentially half the wedge workload since each wedge has two endpoints in peeling set. Normalized performance of PBNG (gure 4.7) shows that it achieves 1:41 and 1:29 average reduction in wedges and execution time, respectively. Similar to wing decomposition, the batch optimization provides dramatic improvement in workload and execution time. This is especially true for datasets with a large ratio of total wedges with endpoints in peeling set to the wedges traversed during counting (for example, for LjU, EnU and TrU, this ratio is > 1000). For instance, in TrU, both optimizations cumulatively enable 68:8 and 47:7 reduction in wedge traversal and execution time, respectively. In contrast, in datasets with small value of this ratio such as DeV, OrV, LjV and EnV, none of the peeling iterations in PBNG CD utilize re-counting. Consequently, performance of PBNG- and PBNG-- is similar for these datasets. 4.6.3.3 Comparisonofphases 0 20 40 60 80 100 Percentage Wedge Traversal 0 20 40 60 80 100 Percentage Execution Time Figure 4.11: Contribution of dierent steps to the overall wedge traversal and the execution time of tip decomposition in PBNG. Figure 4.8 shows a breakdown of the wedge traversal and execution time of PBNG across dierent steps, namely initial buttery counting, peeling in PBNG CD and PBNG FD. As expected, PBNG FD only contributes less than 15% of the total wedge traversal in tip decomposition. This is because it operates on 129 small subgraphs that preserve very few wedges ofG. When peeling the large workloadU(G) vertex set, more than 80% of the wedge traversal and 70% of the execution time is spent in PBNG CD. 4.6.3.4 Scalability 1 2 4 8 16 32 1 4 16 64 Parallel Speedup No. of Threads EnV LjV GtrV TrV OrV De-utV 1 2 4 8 16 32 1 4 16 64 No. of Threads EnU LjU GtrU TrU OrU De-utU Figure 4.12: Strong scaling of tip decomposition in PBNG. Fig.4.12 demonstrates the parallel speedup of PBNG over sequential execution. When peeling the large workload vertex setU(G), PBNG achieves almost linear scalability with 14:4 average parallel speedup on 36 threads, and up to 19:7 speedup forGtrU dataset. In contrast,ParB achieves an average 1:54 speedup over sequentialBUP, and up to 2:3 speedup forTrV dataset. Typically, datasets with small amount of wedges (LjV,EnV ) exhibit lower speedup, because they provide lower workload per synchronization round on average. For example, LjV traverses 86 fewer wedges than LjU but incurs only 3:2 fewer synchronizations. This increases the relative overheads of parallelization and restricts the parallel scalability of PBNG CD, which is the highest workload step in PBNG (gure 4.11). Similar to wing decomposition, NUMA eects on scalability to multiple sockets (36 threads) can be seen in tip decomposition of some datasets such as OrU and TrU. However, we still observe 1:4 average speedup on large datasets when threads are increased from 18 to 36 threads. This is possibly because most of the workload in tip decomposition is comprised of wedge traversal which only reads the graph data structure. It incurs much fewer support updates, and in turn atomic writes, compared to wing decomposition. 130 4.7 RelatedWork Discovering dense subgraphs and communities in networks is a key operation in several applications [119, 46, 38, 40, 102, 26, 88, 118, 99]. Motif-based techniques are widely used to reveal dense regions in graphs [107, 41, 47, 127, 73, 66, 129, 124, 108]. Motifs like triangles represent a quantum of cohesion in graphs and the number of motifs containing an entity (vertex or an edge) acts as an indicator of its local density. Consequently, several recent works have focused on eciently nding such motifs in the graphs [54, 112, 7, 109, 132, 62, 44, 80]. Nucleus decomposition is a clique based technique for discovering hierarchical dense regions in unipar- tite graphs. Instead of per-entity clique count in the entire graph, it considers the minimum clique count of a subgraph’s entities as an indicator of that subgraph’s density [6]. This allows mining denser subgraphs compared to counting alone [6, 106]. Truss decomposition is a special and one of the most popular cases of nucleus decomposition that uses triangle clique. It is a part of the GraphChallenge [103] initiative, that has resulted in highly scalable parallel decomposition algorithms [30, 125, 115, 53]. However, nucleus decomposition is not applicable on bipartite graphs as they do not have cliques. The simplest non-trivial motif in a bipartite graph is a Buttery (2,2-biclique). Several algorithms for buttery counting have been developed: in-memory or external memory [132, 128], exact or approximate counting [104, 105] and parallel counting on various platforms [109, 132, 128]. The most ecient approaches are based on Chiba and Nishizeki’s [27] vertex-priority buttery counting algorithm. Wang et al.[132] propose a cache optimized variant of this algorithm and use shared-memory parallelism for acceleration. Independently, Shi et al.[109] develop provably ecient shared-memory parallel implementations of this algorithm. Notably, their algorithms are able to extract parallelism at the granularity of wedges explored. Such ne-grained parallelism can also be explored for improving parallel scalability of PBNG. Inspired byk-truss, Sariyuce et al.[108] denedk-tips andk-wings as subgraphs with minimumk butteries incident on every vertex and edge, respectively. Similar to nucleus decomposition algorithms, 131 they designed bottom-up peeling algorithms to nd hierarchies of k-tips and k-wings. Independently, Zou [145] dened the notion of bitruss similar tok-wing. Shi et al.[109] propose theParButterfly framework that parallelizes individual peeling iterations. Chiba and Nishizeki [27] proposed that wedges traversed in buttery counting algorithm can act as a space-ecient representation of all butteries in the graph. Wang et al.[130] propose a buttery representation called BE-Index (also derived from the counting algorithm) for ecient support updates during edge peeling. Based on BE-Index, they develop several peeling algorithms for wing decomposition that achieve state-of-the-art computational eciency and are used as baselines in this thesis. Very recently, Wang et al.[131] also proposed parallel versions of BE-Index based wing decomposition algorithms. Although the code is not publicly available, authors report the performance on few graphs. Based on the reported results, PBNG signicantly outperforms their parallel algorithms as well. Secondly, these algorithms parallelize individual peeling iterations and will incur heavy synchronization similar toParButterfly (table 4.4). Lastly, they are only designed for wing decomposition, whereas PBNG comprehensively targets both wing and tip decomposition. This is important because tip decomposition in PBNG (a) is typically faster than wing decomposition, and (b) can process large datasets like Orkut in few minutes, that none of the existing tip or wing decomposition algorithms can process in several days. 4.8 ChapterSummary In this chapter, we studied the problem of bipartite graph decomposition which is a computationally intensive graph application. We observed that existing decomposition algorithms only utilize a fraction of parallelism within each level of the decomposition hierarchy. Consequently, they exhibit poor parallel scalability due to the large number of levels and the resulting synchronization. To reduce synchronization, we proposed a novel two-phased peeling framework called PBNG, which is able to exploit additional parallelism at a coarser granularity – both within and across the levels of 132 decomposition hierarchy. The proposed approach further enabled novel optimizations that drastically reduce computational workload, allowing bipartite decomposition to scale beyond the limits of current practice. On a 36-core shared-memory parallel server, our framework could process some of the largest publicly available bipartite datasets two orders of magnitude faster than the state-of-the-art with more than three orders of magnitude reduction in thread synchronization. In terms of scalability, we achieved up to 19:7 self-relative parallel speedup which is signicantly better than the existing parallel algorithms. 133 Chapter5 Conclusion 5.1 SummaryofContributions In this thesis, we made a case for exploiting bottleneck-specic task granularities to design scalable and ecient graph algorithms. The conventional approach to parallelizing graph algorithms mines ne-grained tasks from existing sequential algorithms to maximally exploit the parallelism available in them. We showed that for several analytics, this approach can result in irregular memory accesses that decrease cache and DRAM performance, and excessive synchronization that restricts parallel scalability. We addressed these issues by rst identifying a level of parallelism and concurrent tasks at the corre- sponding granularity, that would alleviate the performance bottlenecks for the given target application(s) and parallel computing platform. Building upon these tasks, we designed novel parallel algorithms and optimizations that implicitly mitigate major bottlenecks. The proposed algorithms were comprehensively evaluated on several large real-world datasets and showed drastic improvement over state-of-the-art on various metrics including execution time, cache performance, communication volume and synchronization. We demonstrated the applicability of our methodology on dierent categories of graph analytics. For Neighborhood Propagation algorithms that have been the mainstay of graph processing research, we developed a highly cache-ecient framework by constructing coarse-granularity tasks in the form of cacheable vertex subsets called partitions. This task granularity further enabled novel data layouts 134 and optimizations to boost convergence and streaming high bandwidth DRAM accesses. Neighborhood propagation algorithms such as Pagerank, Connected Components, and Graph traversal are not only widely used analytics, but are also representative of computational patterns arising in graph processing. Thus, our contributions in this context have signicance even beyond the algorithms considered in chapter 2. In addition to the Neighborhood Propagation algorithms, we also investigated two complex analytics that have recently gained signicant research interest – namely Hierarchical Hub Labeling and Bipartite Graph Decomposition. For these analytics, we developed the rst scalable parallel solutions under the umbrella of our bottleneck-specic parallelism approach. Our algorithms signicantly advance the current state of practice for these analytics with up to two orders of magnitude speedup over state-of-the-art and scalability to large datasets that the existing algorithms could not process. Besides supporting our hypothesis regarding graph analytics parallelization, these algorithms are also of independent interest for numerous applications that require shortest path computations and dense subgraph mining. 5.2 FutureWork Parallel graph processing research is a rapidly evolving area of research which has witnessed rapid ad- vancements on several fronts. This includes both target architectures and analytics, that have been the key focus of our methodology. We believe that this thesis is well positioned in this area with several directions for future work that can be built on top of our ndings. Additionally, the novel algorithms presented in this thesis can also pave the way for research on massive scaling and parallelization of the respective analytics. In this section, we discuss some prominent areas for further research in the context of our work. 5.2.1 GraphProcessingonCustomArchitectures Existing multiprocessor systems are designed for applications with regular and predictable computational patterns. The out-of-order processors, branch predictors, memory prefetchers and wide cache line sized 135 DRAM transfers make use of highly correlated access patterns in dense computing applications. However, the same features become a challenge for irregular applications like graph processing as described in chapter 2. To bridge the gap between algorithmic and architectural development, researchers have proposed custom architectures tailored for processing irregular applications [39, 123, 56, 93]. Of particular interest is the Intel PIUMA [1] system which features a distributed shared-memory architecture designed for graph analytics. Individual nodes in Intel PIUMA forego extensive caching and wide cache lines, and instead provide support for ne-grained memory accesses with massive multi-threading to hide access latency. A multi-node PIUMA system presents a global address space for high performance remote memory accesses without target synchronization. It also provides hardware support for Scatter-Gather operations in the form of DMA engine ooading, which are crucial primitives in GPOP. Such an architecture can benet heavily from a hierarchical programming model which can exploit the vast amount of parallelism in Neighborhood Propagation algorithms and eciently utilize the PIUMA interconnection network. At the topmost level (inter-node parallelism), graph data can be partitioned similar to GPOP for high bandwidth streaming remote memory accesses. Remote data reads incur high- latency and can stall the simple in-order pipelines in PIUMA. However, batched remote reads as seen in coarse-grained partitions, can be ooaded to DMA engines prior to actual consumption of data in order to avoid stalling. Moreover, the global address space allows for asynchronous execution similar to the scatter-gather interleaving in GPOP (section 2.3.5) for multi-core servers. Given the hardware support for random accesses, within each node, the programming model must exploit nest granularity parallelism to maximize scalability. Potentially, workload within a partition can be parallelized over individual edges to utilize the massive number of provisioned threads and hardware support for work stealing based load balancing. 136 5.2.2 AdvancedMemoryTechnologies Advancement in memory technologies provide potential avenues for accelerating communication intensive applications like graph processing. Several newer architectures are equipped with High Bandwidth Memo- ries (HBM) which feature much wider cache lines than DRAM [98, 114, 116]. However, HBM’s capacity is limited and it is not suitable to store the entirety of the large real-world graphs. To eciently utilize the high bandwidth of HBM, a hierarchical task partitioning can create HBM-sized tasks and reuse partition data loaded on HBM to boost the performance. This can be achieved by streaming partition data multiple times for multi-level update propagation within each partition. In current implementation of GPOP, we only execute a single iteration of intra-iteration update propagation. However, this is not a conceptual limitation but a matter of choice because GPOP assumes a shared DRAM model where propagating updates within and across the partition has similar cost. With partition loaded on HBM, the former is signicantly cheaper and can be executed multiple times to boost convergence. Upcoming graph processing architectures [1] and FPGAs have also featured a large scratchpad in place of the conventional shared cache commonly present in multiprocessor platforms. Eectively, it provides a user-controlled low-latency random access friendly memory in the hierarchy which is also common in FPGAs. Programming models studied in this thesis (chapter 2) explicitly determine data segments that are randomly accessed and reused, and can eciently make use of deterministic caching capabilities of scratchpad memories. 5.2.3 ScalingHubLabelingtoMassiveNetworks Hub Labeling enables extremely low-latency shortest path querying in large networks which is crucial for several applications. However, despite the minimality of canonical labeling (chapter 3), the labeling size still restricts the applicability of Hub Labeling to billion edge networks. Some recent works have proposed labeling size reduction methods for unweighted small-world networks. Akiba et al. [9] use bit-parallel 137 labeling in which they store distances to multiple hubs adjacent to a common vertex, within the same word. Li et al. [75] propose Local Minimum Set Elimination (LMSE) that drops labels for locally minimal ranked vertices. For such vertices, distance queries are answered by combining the labels of all of their neighbors. Despite these advancements, labeling massive networks with billions of edges remains challenging with several hours to days of execution time on parallel processing platforms [75]. Distributed-memory parallelism is key to scaling such high complexity analytics and the algorithms proposed in chapter 3 of this thesis could be valuable for this task. Specically, all the label compression techniques previously proposed can be easily incorporated with our PLaNT algorithm (section 3.4) because it does not use label-based pruning during indexing. It can unlock the benets of label compression techniques on distributed systems without incurring a severe penalty of multi-node querying as would be required by other algorithms to implement such techniques. For unweighted graphs, Multi-source BFS (MSBFS) can be investigated for simultaneous construction of multiple SPTs within the same BFS instance [122]. MS-BFS exploits the fact that BFS only needs to propagate the "visited" status of vertices which can be encoded in a single bit. The ancestor information required in PLaNT can also be compressed in a single bit as the algorithm is only interested in knowing if the ancestor is higher-ranked than the root, and not in the exact ancestor index. While this thesis focused on labeling generic networks with arbitrary weights, PLaNT can potentially enable signicantly faster labeling of massive unweighted networks, compared to existing algorithms. 5.2.4 Distributed-memoryParallelGraphDecomposition Bipartite Graph Decomposition is computationally intensive as observed in chapter 4 of this thesis. The use of BE-Index for computational eciency (section 4.2.4) further gives it a large memory footprint. Both of these characteristics motivate the use of distributed-memory solutions that can utilize scalable memory and compute resources of multi-node clusters. 138 Multi-phased decomposition based on the approach proposed in this thesis (chapter 4) could underline ecient algorithms for distributed bipartite graph decomposition. Such algorithms can employ recursive partitioning of decomposition hierarchy to utilize the MPI+OpenMP programming model [96]. Recursive partitioning can also provide opportunities for better load balancing with adaptive determination of the number of partitions in each phase. Lastly, the computational and parallelization challenges observed with buttery based decomposition have also been observed for decomposition based on higher-order motifs such as 4-cliques or 5-cliques. We believe that our two-phased decomposition approach can be employed for ecient parallelization of higher order clique decomposition. 139 Bibliography [1] Sriram Aananthakrishnan, Nesreen K. Ahmed, Vincent Cavé, Marcelo Cintra, Yigit Demir, Kristof Du Bois, Stijn Eyerman, Joshua B. Fryman, Ivan Ganev, Wim Heirman, Hans-Christian Hoppe, Jason Howard, Ibrahim Hur, Midhunchandra Kodiyath, Samkit Jain, Daniel S. Klowden, Marek M. Landowski, Laurent Montigny, Ankit More, Przemyslaw Ossowski, Robert Pawlowski, Nick Pepperling, Fabrizio Petrini, Mariusz Sikora, Balasubramanian Seshasayee, Shaden Smith, Sebastian Szkoda, Sanjaya Tayal, Jesmin Jahan Tithi, Yves Vandriessche, and Izajasz P. Wrosz. “PIUMA: Programmable Integrated Unied Memory Architecture”. In: CoRR abs/2010.06277 (2020). arXiv: 2010.06277. [2] Ittai Abraham, Daniel Delling, Amos Fiat, Andrew V Goldberg, and Renato F Werneck. “VC-dimension and shortest path algorithms”. In: International Colloquium on Automata, Languages, and Programming. Springer. 2011, pp. 690–699. [3] Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck. “A hub-based labeling algorithm for shortest paths in road networks”. In: International Symposium on Experimental Algorithms. Springer. 2011, pp. 230–241. [4] Ittai Abraham, Daniel Delling, Andrew V Goldberg, and Renato F Werneck. “Hierarchical hub labelings for shortest paths”. In: European Symposium on Algorithms. Springer. 2012, pp. 24–35. [5] Ittai Abraham, Amos Fiat, Andrew V Goldberg, and Renato F Werneck. “Highway dimension, shortest paths, and provably ecient algorithms”. In: Proceedings of the twenty-rst annual ACM-SIAM symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics. 2010, pp. 782–793. [6] Nesreen K. Ahmed, Jennifer Neville, Ryan A. Rossi, Nick G. Dueld, and Theodore L. Willke. “Graphlet Decomposition: Framework, Algorithms, and Applications”. In: Knowledge and Information Systems 50.3 (Mar. 2017), pp. 689–722.issn: 0219-1377. [7] Nesreen K Ahmed, Jennifer Neville, Ryan A Rossi, and Nick Dueld. “Ecient graphlet counting for large networks”. In: 2015 IEEE International Conference on Data Mining. IEEE. 2015, pp. 1–10. [8] Takuya Akiba, Yoichi Iwata, Ken-ichi Kawarabayashi, and Yuki Kawata. “Fast shortest-path distance queries on road networks by pruned highway labeling”. In: 2014 Proceedings of the Sixteenth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM. 2014, pp. 147–154. 140 [9] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. “Fast exact shortest-path distance queries on large networks by pruned landmark labeling”. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM. 2013, pp. 349–360. [10] Takuya Akiba, Christian Sommer, and Ken-ichi Kawarabayashi. “Shortest-path queries for complex networks: exploiting low tree-width outside the core”. In: Proceedings of the 15th International Conference on Extending Database Technology. ACM. 2012, pp. 144–155. [11] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. The landscape of parallel computing research: A view from berkeley. Tech. rep. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006. [12] Ariful Azad and Aydin Buluç. “A work-ecient parallel sparse matrix-sparse vector multiplication algorithm”. In: Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International. IEEE. 2017, pp. 688–697. [13] Maxim Babenko, Andrew V Goldberg, Haim Kaplan, Ruslan Savchenko, and Mathias Weller. “On the complexity of hub labeling”. In: International Symposium on Mathematical Foundations of Computer Science. Springer. 2015, pp. 62–74. [14] Reinhard Bauer, Daniel Delling, Peter Sanders, Dennis Schieferdecker, Dominik Schultes, and Dorothea Wagner. “Combining hierarchical and goal-directed speed-up techniques for dijkstra’s algorithm”. In: ACM Journal of Experimental Algorithmics 15.2.3 (2010), pp. 2.1–2.31. [15] Scott Beamer, Krste Asanović, and David Patterson. “Direction-optimizing breadth-rst search”. In: Scientic Programming 21.3-4 (2013), pp. 137–148. [16] Scott Beamer, Krste Asanović, and David Patterson. “The GAP Benchmark Suite”. In: CoRR abs/1508.03619 (2015). arXiv: 1508.03619. [17] Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózemberczki, Michal Lukasik, and Stephan Günnemann. “Scaling graph neural networks with approximate pagerank”. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, pp. 2464–2473. [18] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. “Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks”. In: Proceedings of the 20th international conference on World Wide Web. Ed. by Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar. ACM Press, 2011, pp. 587–596. [19] Paolo Boldi and Sebastiano Vigna. “The WebGraph Framework I: Compression Techniques”. In: Proc. of the Thirteenth International World Wide Web Conference (WWW 2004). Manhattan, USA: ACM Press, 2004, pp. 595–601. [20] Francesco Bonchi, Arijit Khan, and Lorenzo Severini. “Distance-generalized core decomposition”. In: Proceedings of the 2019 International Conference on Management of Data. 2019, pp. 1006–1023. 141 [21] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. “Graph structure in the web”. In: Computer networks 33.1-6 (2000), pp. 309–320. [22] Aydın Buluç and John R Gilbert. “The Combinatorial BLAS: Design, implementation, and applications”. In: The International Journal of High Performance Computing Applications 25.4 (2011), pp. 496–509. [23] Daniele Buono, Fabrizio Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, and Tai-Ching Tuan. “Optimizing sparse matrix-vector multiplication for large-scale data analytics”. In: Proceedings of the 2016 International Conference on Supercomputing. 2016, pp. 1–12. [24] Federico Busato, Oded Green, Nicola Bombieri, and David A Bader. “Hornet: An ecient data structure for dynamic sparse graphs and matrices on gpus”. In: 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE. 2018, pp. 1–7. [25] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. “R-MAT: A recursive model for graph mining”. In: Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM. 2004, pp. 442–446. [26] Jie Chen and Yousef Saad. “Dense subgraph extraction with application to community detection”. In: IEEE Transactions on knowledge and data engineering 24.7 (2010), pp. 1216–1230. [27] Norishige Chiba and Takao Nishizeki. “Arboricity and subgraph listing algorithms”. In: SIAM Journal on computing 14.1 (1985), pp. 210–223. [28] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. “Reachability and distance queries via 2-hop labels”. In: SIAM Journal on Computing 32.5 (2003), pp. 1338–1355. [29] Jonathan Cohen. “Trusses: Cohesive subgraphs for social network analysis”. In: National security agency technical report 16.3.1 (2008). [30] Ketan Date, Keven Feng, Rakesh Nagi, Jinjun Xiong, Nam Sung Kim, and Wen-Mei Hwu. “Collaborative (cpu+ gpu) algorithms for triangle counting and truss decomposition on the minsky architecture: Static graph challenge: Subgraph isomorphism”. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2017, pp. 1–7. [31] Daniel Delling, Andrew V Goldberg, Ruslan Savchenko, and Renato F Werneck. “Hub labels: Theory and practice”. In: International Symposium on Experimental Algorithms. Springer. 2014, pp. 259–270. [32] Daniel Delling, Andrew V Goldberg, and Renato F Werneck. “Hub label compression”. In: International Symposium on Experimental Algorithms. Springer. 2013, pp. 18–29. [33] Camil Demetrescu, Andrew Goldberg, and David Johnson. “9th DIMACS implementation challenge–Shortest Paths”. In: American Mathematical Society (2006). [34] Inderjit S Dhillon. “Co-clustering documents and words using bipartite spectral graph partitioning”. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. 2001, pp. 269–274. 142 [35] Laxman Dhulipala, Guy E Blelloch, and Julian Shun. “Low-latency graph streaming using compressed purely-functional trees”. In: Proceedings of the 40th ACM SIGPLAN conference on programming language design and implementation. 2019, pp. 918–934. [36] Laxman Dhulipala, Guy Blelloch, and Julian Shun. “Julienne: A framework for parallel graph algorithms using work-ecient bucketing”. In: Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. ACM. 2017, pp. 293–304. [37] Qing Dong, Kartik Lakhotia, Hanqing Zeng, Rajgopal Karman, Viktor Prasanna, and Guna Seetharaman. “A Fast and Ecient Parallel Algorithm for Pruned Landmark Labeling”. In: 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE. 2018, pp. 1–7. [38] Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Ismail Oner Sebe, Ahmed Taei, and Sunita Verma. “Ego-net community mining applied to friend suggestion”. In: Proceedings of the VLDB Endowment 9.4 (2015), pp. 324–335. [39] Stijn Eyerman, Wim Heirman, Kristof Du Bois, Joshua B Fryman, and Ibrahim Hur. “Many-core graph workload analysis”. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2018, pp. 282–292. [40] Yixiang Fang, Yixing Yang, Wenjie Zhang, Xuemin Lin, and Xin Cao. “Eective and ecient community search over large heterogeneous information networks”. In: Proceedings of the VLDB Endowment 13.6 (2020), pp. 854–867. [41] Yixiang Fang, Kaiqiang Yu, Reynold Cheng, Laks VS Lakshmanan, and Xuemin Lin. “Ecient algorithms for densest subgraph discovery”. In: Proceedings of the VLDB Endowment 12.11 (2019), pp. 1719–1732. [42] Geli Fei, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. “Exploiting burstiness in reviews for review spammer detection”. In: Seventh international AAAI conference on weblogs and social media. 2013, pp. 175–184. [43] Damir Ferizovic. “Parallel Pruned Landmark Labeling for Shortest Path Queries on Unit-Weight Networks”. Abschlussarbeit - Bachelor. 2015. 36 pp. [44] James Fox, Oded Green, Kasimir Gabert, Xiaojing An, and David A Bader. “Fast and adaptive list intersections on the gpu”. In: 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE. 2018, pp. 1–7. [45] Ada Wai-Chee Fu, Huanhuan Wu, James Cheng, and Raymond Chi-Wing Wong. “Is-label: an independent-set based labeling scheme for point-to-point distance querying”. In: Proceedings of the VLDB Endowment 6.6 (2013), pp. 457–468. [46] David Gibson, Ravi Kumar, and Andrew Tomkins. “Discovering large dense subgraphs in massive graphs”. In: Proceedings of the 31st international conference on Very large data bases. 2005, pp. 721–732. 143 [47] David Gibson, Ravi Kumar, and Andrew Tomkins. “Discovering large dense subgraphs in massive graphs”. In: Proceedings of the 31st international conference on Very large data bases. 2005, pp. 721–732. [48] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang, Prateek Mittal, Emil Stefanov, Vyas Sekar, and Dawn Song. “Evolution of social-attribute networks: measurements, modeling, and implications using google+”. In: Proceedings of the 2012 Internet Measurement Conference. ACM. 2012, pp. 131–144. [49] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. “PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs”. In: Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). USENIX, 2012, pp. 17–30. [50] Ronald L. Graham. “Bounds on multiprocessing timing anomalies”. In: SIAM journal on Applied Mathematics 17.2 (1969), pp. 416–429. [51] Oded Green and David A Bader. “cuSTINGER: Supporting dynamic graph algorithms for GPUs”. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2016, pp. 1–6. [52] Oded Green, Marat Dukhan, and Richard Vuduc. “Branch-avoiding graph algorithms”. In: Proceedings of the 27th ACM symposium on Parallelism in Algorithms and Architectures. 2015, pp. 212–223. [53] Oded Green, James Fox, Euna Kim, Federico Busato, Nicola Bombieri, Kartik Lakhotia, Shijie Zhou, Shreyas Singapura, Hanqing Zeng, Rajgopal Kannan, Viktor Prasanna, and David Bader. “Quickly nding a truss in a haystack”. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2017, pp. 1–7. [54] Oded Green, James Fox, Alex Watkins, Alok Tripathy, Kasimir Gabert, Euna Kim, Xiaojing An, Kumar Aatish, and David A Bader. “Logarithmic radix binning and vectorized triangle counting”. In: 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE. 2018, pp. 1–7. [55] Mordechai Haklay and Patrick Weber. “Openstreetmap: User-generated street maps”. In: IEEE Pervasive Computing 7.4 (2008), pp. 12–18. [56] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. “Graphicionado: A high-performance and energy-ecient accelerator for graph analytics”. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE. 2016, pp. 1–13. [57] Minyang Han and Khuzaima Daudjee. “Giraph unchained: barrierless asynchronous parallel execution in pregel-like graph processing systems”. In: Proceedings of the VLDB Endowment 8.9 (2015), pp. 950–961. [58] Peter E Hart, Nils J Nilsson, and Bertram Raphael. “A formal basis for the heuristic determination of minimum cost paths”. In: IEEE transactions on Systems Science and Cybernetics 4.2 (1968), pp. 100–107. 144 [59] Ruining He and Julian McAuley. “Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative ltering”. In: Proceedings of the 25th International Conference on World Wide Web. ACM. 2016, pp. 507–517. [60] Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, and Ying Zhang. “Exploring cohesive subgraphs with vertex engagement and tie strength in bipartite graphs”. In: Information Sciences 572 (2021), pp. 277–296. [61] Sungpack Hong, Hassan Cha, Edic Sedlar, and Kunle Olukotun. “Green-Marl: a DSL for easy and ecient graph analysis”. In: ACM SIGARCH Computer Architecture News. ACM. 2012, pp. 349–362. [62] Yang Hu, Hang Liu, and H Howie Huang. “Tricore: Parallel triangle counting on gpus”. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2018, pp. 171–182. [63] Zan Huang, Daniel D Zeng, and Hsinchun Chen. “Analyzing consumer-product graphs: Empirical ndings and applications in recommender systems”. In: Management science 53.7 (2007), pp. 1146–1164. [64] Wolfgang Huber, Vincent J Carey, Li Long, Seth Falcon, and Robert Gentleman. “Graphs in molecular biology”. In: BMC bioinformatics 8.6 (2007), pp. 1–14. [65] Minhao Jiang, Ada Wai-Chee Fu, Raymond Chi-Wing Wong, and Yanyan Xu. “Hop doubling label indexing for point-to-point distance querying on scale-free networks”. In: Proceedings of the VLDB Endowment 7.12 (2014), pp. 1203–1214. [66] Wissam Khaouid, Marina Barsky, Venkatesh Srinivasan, and Alex Thomo. “K-core decomposition of large networks on a single PC”. In: Proceedings of the VLDB Endowment 9.1 (2015), pp. 13–23. [67] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. “Predict then propagate: Graph neural networks meet personalized pagerank”. In: arXiv preprint arXiv:1810.05997 (2018). [68] Jérôme Kunegis. “Konect: the koblenz network collection”. In: Proceedings of the 22nd International Conference on World Wide Web. ACM. 2013, pp. 1343–1350. [69] Kartik Lakhotia, Rajgopal Kannan, Qing Dong, and Viktor Prasanna. “Planting trees for scalable and ecient canonical hub labeling”. In: vol. 13. 4. VLDB Endowment, 2019, pp. 492–505. [70] Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. “GPOP: a cache and memory-ecient framework for graph processing over partitions”. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 2019, pp. 393–394. [71] Kartik Lakhotia, Rajgopal Kannan, and Viktor Prasanna. “Accelerating pagerank using partition-centric processing”. In: 2018fUSENIXg Annual Technical Conference (fUSENIXgfATCg 18). 2018, pp. 427–440. [72] Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna, and Cesar AF De Rose. “Receipt: rene coarse-grained independent tasks for parallel tip decomposition of bipartite graphs”. In: Proceedings of the VLDB Endowment 14.3 (2020), pp. 404–417. 145 [73] Victor E Lee, Ning Ruan, Ruoming Jin, and Charu Aggarwal. “A survey of algorithms for dense subgraph discovery”. In: Managing and Mining Graph Data. Springer, 2010, pp. 303–336. [74] Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. “Vertex similarity in networks”. In: Physical Review E 73.2 (2006), p. 026120. [75] Wentao Li, Miao Qiao, Lu Qin, Ying Zhang, Lijun Chang, and Xuemin Lin. “Scaling Distance Labeling on Small-World Networks”. In: Proceedings of the 2019 International Conference on Management of Data. SIGMOD ’19. New York, NY, USA: ACM, 2019, pp. 1060–1077. [76] Ye Li, Leong Hou U, Man Lung Yiu, and Ngai Meng Kou. “An experimental study on hub labeling based shortest path algorithms”. In: Proceedings of the VLDB Endowment 11.4 (2017), pp. 445–457. [77] Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu, and Hady Wirawan Lauw. “Detecting product review spammers using rating behaviors”. In: Proceedings of the 19th ACM international conference on Information and knowledge management. 2010, pp. 939–948. [78] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M Hellerstein. “Distributed GraphLab: a framework for machine learning and data mining in the cloud”. In: Proceedings of the VLDB Endowment 5.8 (2012), pp. 716–727. [79] Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan Berry. “Challenges in parallel graph processing”. In: Parallel Processing Letters 17.01 (2007), pp. 5–20. [80] Chenhao Ma, Reynold Cheng, Laks VS Lakshmanan, Tobias Grubenmann, Yixiang Fang, and Xiaodong Li. “LINC: a motif counting algorithm for uncertain graphs”. In: Proceedings of the VLDB Endowment 13.2 (2019), pp. 155–168. [81] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. “Pregel: a system for large-scale graph processing”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM. 2010, pp. 135–146. [82] Kiran Kumar Matam, Hanieh Hashemi, and Murali Annavaram. “PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs”. In: arXiv preprint arXiv:1905.04264 (2019). [83] Frank McSherry, Michael Isard, and Derek G. Murray. “Scalability! But at What Cost?” In: Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems. HOTOS’15. USENIX Association, 2015, pp. 1–6. [84] Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, and Christian Bizer. “The graph structure in the web: Analyzed on dierent aggregation levels”. In: The Journal of Web Science 1.1 (2015), pp. 33–47. [85] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. “Measurement and Analysis of Online Social Networks”. In: Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07). ACM, 2007, pp. 29–42. [86] Arjun Mukherjee, Bing Liu, and Natalie Glance. “Spotting fake reviewer groups in consumer reviews”. In: Proceedings of the 21st international conference on World Wide Web. ACM. 2012, pp. 191–200. 146 [87] Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. “Introducing the graph 500”. In: Cray Users Group (CUG) 19 (2010), pp. 45–74. [88] Eisha Nathan, Anita Zakrzewska, Jason Riedy, and David A Bader. “Local community detection in dynamic graphs using personalized centrality”. In: Algorithms 10.3 (2017). [89] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. “Graph summarization with bounded error”. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 2008, pp. 419–432. [90] Mark EJ Newman. “Scientic collaboration networks. II. Shortest paths, weighted networks, and centrality”. In: Physical review E 64.1 (2001), p. 016132. [91] Mark EJ Newman, Duncan J Watts, and Steven H Strogatz. “Random graph models of social networks”. In: Proceedings of the national academy of sciences 99.suppl 1 (2002), pp. 2566–2572. [92] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. “A lightweight infrastructure for graph analytics”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. 2013, pp. 456–471. [93] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. “Energy ecient architecture for graph analytics accelerators”. In: ACM SIGARCH Computer Architecture News 44.3 (2016), pp. 166–177. [94] Ha-Myung Park, Sung-Hyon Myaeng, and U. Kang. “PTE: Enumerating Trillion Triangles On Distributed Systems”. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16. ACM Press, 2016, pp. 1–30. [95] Kun Qiu, Yuanyang Zhu, Jing Yuan, Jin Zhao, Xin Wang, and Tilman Wolf. “ParaPLL: Fast Parallel Shortest-path Distance Query on Large-scale Weighted Graphs”. In: Proceedings of the 47th International Conference on Parallel Processing. ACM. 2018, pp. 1–10. [96] Rolf Rabenseifner, Georg Hager, and Gabriele Jost. “Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes”. In: 2009 17th Euromicro international conference on parallel, distributed and network-based processing. IEEE. 2009, pp. 427–436. [97] Syed Asad Rahman, P Advani, R Schunk, Rainer Schrader, and Dietmar Schomburg. “Metabolic pathway analysis web service (Pathway Hunter Tool at CUBIC)”. In: Bioinformatics 21.7 (2004), pp. 1189–1193. [98] Suresh Ramalingam. “HBM package integration: Technology trends, challenges and applications”. In: 2016 IEEE Hot Chips 28 Symposium (HCS). IEEE. 2016, pp. 1–17. [99] E Jason Riedy, Henning Meyerhenke, David Ediger, and David A Bader. “Parallel community detection for massive graphs”. In: International Conference on Parallel Processing and Applied Mathematics. Springer. 2011, pp. 286–296. 147 [100] Ryan Rossi and Nesreen Ahmed. “The network data repository with interactive graph analytics and visualization”. In: Proceedings of the AAAI Conference on Articial Intelligence. Vol. 29. 1. 2015, pp. 4292–4293. [101] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. “X-stream: Edge-centric graph processing using streaming partitions”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM. 2013, pp. 472–488. [102] Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, and Nikolaj Tatti. “Event detection in activity networks”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014, pp. 1176–1185. [103] Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. “Static graph challenge: Subgraph isomorphism”. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2017, pp. 1–6. [104] Seyed-Vahid Sanei-Mehri, Ahmet Erdem Sariyuce, and Srikanta Tirthapura. “Buttery counting in bipartite networks”. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, pp. 2150–2159. [105] Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet Erdem Sariyüce, and Srikanta Tirthapura. “FLEET: Buttery Estimation from a Bipartite Graph Stream”. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019, pp. 1201–1210. [106] Ahmet Erdem Sariyuce, C Seshadhri, Ali Pinar, and Umit V Catalyurek. “Finding the hierarchy of dense subgraphs using nucleus decompositions”. In: Proceedings of the 24th International Conference on World Wide Web. ACM. 2015, pp. 927–937. [107] Ahmet Erdem Sarıyüce and Ali Pinar. “Fast Hierarchy Construction for Dense Subgraphs”. In: Proceedings of the VLDB Endowment 10.3 (2016). [108] Ahmet Erdem Sarıyüce and Ali Pinar. “Peeling bipartite networks for dense subgraph discovery”. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 2018, pp. 504–512. [109] Jessica Shi and Julian Shun. “Parallel Algorithms for Buttery Computations”. In: arXiv preprint arXiv:1907.08607 (2019). [110] Julian Shun and Guy E Blelloch. “Ligra: a lightweight graph processing framework for shared memory”. In: ACM Sigplan Notices. Vol. 48. ACM. 2013, pp. 135–146. [111] Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, and Michael W Mahoney. “Parallel local graph clustering”. In: Proceedings of the VLDB Endowment 9.12 (2016), pp. 1041–1052. [112] Julian Shun and Kanat Tangwongsan. “Multicore triangle computations without tuning”. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE. 2015, pp. 149–160. 148 [113] Jeremy G Siek, Lie-Quan Lee, and Andrew Lumsdaine. The Boost Graph Library: User Guide and Reference Manual, Portable Documents. Pearson Education, 2001. [114] Gaurav Singh and S Ahmad. “Xilinx 16nm datacenter device family with in-package hbm and ccix interconnect”. In: IEEE Hot Chips Symposium. 2017. [115] Shaden Smith, Xing Liu, Nesreen K Ahmed, Ancy Sarah Tom, Fabrizio Petrini, and George Karypis. “Truss decomposition on shared-memory parallel systems”. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2017, pp. 1–6. [116] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. “Knights landing: Second-generation intel xeon phi product”. In: Ieee micro 36.2 (2016), pp. 34–46. [117] Daniel A Spielman and Shang-Hua Teng. “A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning”. In: SIAM Journal on Computing 42.1 (2013), pp. 1–26. [118] Christian L Staudt and Henning Meyerhenke. “Engineering parallel algorithms for community detection in massive networks”. In:IEEETransactionsonParallelandDistributedSystems 27.1 (2015), pp. 171–184. [119] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. “Neighborhood formation and anomaly detection in bipartite graphs”. In: Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE. 2005, 8–pp. [120] Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Subramanya R Dulloor, Michael J Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. “Graphmat: High performance graph analytics made productive”. In: Proceedings of the VLDB Endowment 8.11 (2015), pp. 1214–1225. [121] The Skitter AS Links Dataset. [Online; accessed 8-April-2019]. 2019.url: http://www.caida.org/data/active/skitter_aslinks_dataset.xml. [122] Manuel Then, Moritz Kaufmann, Fernando Chirigati, Tuan-Anh Hoang-Vu, Kien Pham, Alfons Kemper, Thomas Neumann, and Huy T Vo. “The more the merrier: Ecient multi-source graph traversal”. In: Proceedings of the VLDB Endowment 8.4 (2014), pp. 449–460. [123] Trung Tran. “Hierarchical Identify Verify Exploit (HIVE)”. In: DARPA/MTO. Proposers Day Brief DARPA-BAA-16-52 13 (2016). [124] Charalampos E Tsourakakis. “A novel approach to nding near-cliques: The triangle-densest subgraph problem”. In: arXiv preprint arXiv:1405.1477 (2014). [125] Chad Voegele, Yi-Shan Lu, Sreepathi Pai, and Keshav Pingali. “Parallel triangle counting and k-truss identication using graph-centric methods”. In: 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE. 2017, pp. 1–7. 149 [126] Richard Vuduc, James W Demmel, and Katherine A Yelick. “OSKI: A library of automatically tuned sparse matrix kernels”. In: Journal of Physics: Conference Series. IOP Publishing. 2005, p. 521. [127] Jia Wang and James Cheng. “Truss Decomposition in Massive Networks”. In: Proceedings of the VLDB Endowment 5.9 (2012), pp. 812–823. [128] Jia Wang, Ada Wai-Chee Fu, and James Cheng. “Rectangle counting in large bipartite graphs”. In: 2014 IEEE International Congress on Big Data. IEEE. 2014, pp. 17–24. [129] Kai Wang, Xin Cao, Xuemin Lin, Wenjie Zhang, and Lu Qin. “Ecient computing of radius-bounded k-cores”. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE. 2018, pp. 233–244. [130] Kai Wang, Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang. “Ecient bitruss decomposition for large-scale bipartite graphs”. In: arXiv preprint arXiv:2001.06111 (2020). [131] Kai Wang, Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang. “Towards ecient solutions of bitruss decomposition for large-scale bipartite graphs”. In: The VLDB Journal (2021), pp. 1–24. [132] Kai Wang, Xuemin Lin, Lu Qin, Wenjie Zhang, and Ying Zhang. “Vertex priority based buttery counting for large-scale bipartite networks”. In: Proceedings of the VLDB Endowment 12.10 (2019), pp. 1139–1152. [133] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riel, and John D Owens. “Gunrock: A high-performance graph processing library on the GPU”. In: Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming. 2016, pp. 1–12. [134] Hao Wei, Jerey Xu Yu, Can Lu, and Xuemin Lin. “Speedup graph processing by graph ordering”. In: Proceedings of the 2016 International Conference on Management of Data. ACM. 2016, pp. 1813–1828. [135] Dong Wen, Lu Qin, Ying Zhang, Xuemin Lin, and Jerey Xu Yu. “I/o ecient core graph decomposition: application to degeneracy ordering”. In: IEEE Transactions on Knowledge and Data Engineering 31.1 (2018), pp. 75–90. [136] Jianzhen Xu and Yongjin Li. “Discovering disease-genes by topological features in human protein–protein interaction network”. In: Bioinformatics 22.22 (2006), pp. 2800–2805. [137] Albert-Jan N Yzelman and Rob H Bisseling. “A cache-oblivious sparse matrix–vector multiplication scheme based on the Hilbert curve”. In: Progress in Industrial Mathematics at ECMI 2010. Springer, 2012, pp. 627–633. [138] Detian Zhang, Chi-Yin Chow, An Liu, Xiangliang Zhang, Qingzhu Ding, and Qing Li. “Ecient evaluation of shortest travel-time path queries through spatial mashups”. In: GeoInformatica 22.1 (2018), pp. 3–28. [139] Kaiyuan Zhang, Rong Chen, and Haibo Chen. “NUMA-aware graph-structured analytics”. In: ACM SIGPLAN Notices 50.8 (2015), pp. 183–193. 150 [140] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. “GraphIt-A High-Performance DSL for Graph Analytics”. In: arXiv preprint arXiv:1805.00923 (2018). [141] Shijie Zhou and Viktor K Prasanna. “Accelerating Graph Analytics on CPU-FPGA Heterogeneous Platform”. In: 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE. 2017, pp. 137–144. [142] Xiaojin Zhu and Zoubin Ghahramani. “Learning from labeled and unlabeled data with label propagation”. In: 119 (2005), pp. 1036–1043. [143] Xiaowei Zhu, Wentao Han, and Wenguang Chen. “GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning”. In: 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, 2015, pp. 375–386. [144] Youwei Zhuo, Jingji Chen, Qinyi Luo, Yanzhi Wang, Hailong Yang, Depei Qian, and Xuehai Qian. “SympleGraph: distributed graph processing with precise loop-carried dependency guarantee”. In: Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation. 2020, pp. 592–607. [145] Zhaonian Zou. “Bitruss decomposition of bipartite graphs”. In: International Conference on Database Systems for Advanced Applications. Springer. 2016, pp. 218–233. 151
Abstract (if available)
Abstract
Graphs are a preferred choice of data representation in various domains including social networks, transportation, bioinformatics, e-commerce and others. With the rapid growth of data produced in these domains, parallel computing has become crucial for scaling the capabilities of graph analytics. Conventional approaches to parallelizing graph analytics typically create fine-granularity tasks to maximally exploit the parallelizable workload available in existing sequential algorithms. However, the resulting parallel algorithms pose fundamental drawbacks in terms of random memory accesses, communication and synchronization. These drawbacks arise primarily due to the (1) irregular nature of graph data, and (2) dependencies in underlying sequential algorithms that are not designed for parallel execution. In this dissertation, we make a case for exploiting concurrent tasks at a level that inherently targets the bottlenecks in application performance, instead of mining parallelism from sequential algorithms. We identify bottleneck alleviating tasks for several analytics with different computation-communication characteristics and develop novel parallel algorithms that significantly outperform existing approaches. ❧ In the first half of this thesis, we present a cache-efficient framework for neighborhood traversal algorithms such as BFS, PageRank, Shortest Path computation and others. Our framework utilizes concurrent tasks at the granularity of cacheable vertex subsets to simultaneously enable high performance memory access patterns and asynchronous execution. On large graphs with billions of edges, it exhibits up to 3.6× and 5.5× reduction in execution time and cache misses, respectively, compared to the state-of-the-art frameworks. In the second half of this thesis, we study complex analytics that execute several sequentially dependent instances of graph traversal. For such analytics, we develop the first set of scalable algorithms that exploit coarse-granularity parallelism across the traversal instances. Our algorithms appreciably extend the state of current practice with more than two orders of magnitude speedup over existing algorithms and capability to process large datasets in few minutes which baselines cannot process even in several days.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Hardware and software techniques for irregular parallelism
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Efficient graph learning: theory and performance evaluation
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Fast and label-efficient graph representation learning
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Data-driven methods for increasing real-time observability in smart distribution grids
PDF
Performance optimization of scientific applications on emerging architectures
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
Asset Metadata
Creator
Lakhotia, Kartik
(author)
Core Title
Exploiting variable task granularities for scalable and efficient parallel graph analytics
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2021-12
Publication Date
09/22/2021
Defense Date
08/31/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bipartite graphs,cache efficiency,dense graph mining,distributed-memory parallelism,graph algorithms,irregular algorithms,OAI-PMH Harvest,parallel graph analytics,shared-memory parallelism,shortest path computation,social networks
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Crago, Stephen (
committee member
), Kannan, Rajgopal (
committee member
), Nakano, Aiichiro (
committee member
), Qian, Xuehai (
committee member
)
Creator Email
klakhoti@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15925681
Unique identifier
UC15925681
Legacy Identifier
etd-LakhotiaKa-10089
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Lakhotia, Kartik
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
bipartite graphs
cache efficiency
dense graph mining
distributed-memory parallelism
graph algorithms
irregular algorithms
parallel graph analytics
shared-memory parallelism
shortest path computation
social networks