Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hardware and software techniques for irregular parallelism
(USC Thesis Other)
Hardware and software techniques for irregular parallelism
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
HARDWARE AND SOFTWARE TECHNIQUES FOR IRREGULAR PARALLELISM by Youwei Zhuo A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2022 Copyright 2022 Youwei Zhuo Dedication To everyone. ii Acknowledgements First and foremost, I want to thank my advisor, Xuehai Qian, for his advice and support. When I rst entered USC, I have almost no knowledge of how to run a simulator, how to develop a research plan, how to teach a class, and how to write an academic paper. He taught me everything from the technical details of computer architecture, to the high-level ideas of what a good research will be like. He is always passionate about new discoveries. When I became a senior PhD student, he encouraged me to explore more things outside computer architecture research and publish in top-tier conferences. He also encouraged me to pursue a career in academia and introduced me to other professors during conferences. It is denitely a great experience to learn from him and do research with him. I am grateful to other faculty members at USC. Thanks to Chao Wang, Ramesh Govindan, Viktor Prasanna, and Murali Annavaram, for being my dissertation defense committee members. They provided invaluable feedbacks that makes my research more organized. I am also grateful to Wenguang Chen and Jidong Zhai, my undergraduate advisor at Tsinghua University, and Yuan Xie, for introducing me to parallel computing research. Without them, I would not be where I am today. I have learned a lot from my fellow students at USC. I would like to thank all the members of the Alchem research group: Chao Wang, You Wu, Qinyi Luo, Jinglei Chen, Jingji Chen, and Gengyu Rao. They have a wide range of research interests, which broadens my view and inspires my research with inter-disciplinary intuitions. It is great fun to spend time with them during my PhD life. I would like to thank my coauthors, iii Mingxing Zhang and Mingyu Gao. They introduced me to Hybrid Memory Cubes and this is the rst topic in my PhD studies. I want to thank all my friends both in the U.S. and in China. I met many other PhD Students when I attend conferences and did my summer internship. They set a good example to me and push me to work harder. Finally, I want to thank my family, for their support and encouragement With a 16-hour time dierence, I can only talk with my parents over the phone at night once a week and meet with them once a year. They cheered me up when I met with diculties, and reminded me when I became complacent with my achievements. iv TableofContents Dedication ii Acknowledgements iii ListofTables viii ListofFigures ix Abstract xiii Chapter1: Introduction 1 1.1 The Challenges of Irregular Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Data dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter2: Background 8 2.1 Embarrassingly Sequential Algorithms and Hardware Accelerators . . . . . . . . . . . . . 8 2.1.1 Sequential FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Parallel FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Graph Processing and Novel Memory Architectures . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Basics of Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Processing-In-Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 Processing-In-Memory Graph Processing . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Graph Processing and Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter3: CSE:ParallelFiniteStateMachineswithConvergenceSetEnumeration 16 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 FSM Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Enumerative FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.3 Lookback Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.4 Parallel Automata Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2.5 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 set(N)!set(M) Computation Primitive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3.2 Application to Enumerative FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 v 3.4 CSE Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.2 Convergence Set Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.3 Proling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.4 Convergence Partition Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5 Correctness Guarantee with Re-Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.1 FSM Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6.2 FSM Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6.3 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.7.2 Initial Flow NumberR 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7.3 Last Flow NumberR T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7.4 LBE and Lookback Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.7.5 CSE and Convergence Set Generation . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7.6 The Predictive Power and Re-Execution . . . . . . . . . . . . . . . . . . . . . . . . 45 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Chapter 4: GraphP: Reducing Communication for PIM-based Graph Processing with E- cientDataPartition 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.1 Interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.2 Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.3 PIM-Based Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 GraphP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.1 Source-Cut Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3.2 Two-Phase Vertex Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.3 Hierarchical Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.4 Overlapped Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5.2 Cross Cube Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.5.3 Bandwidth Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5.5 Energy/Power Consumption Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.6 Memory Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Chapter5: GraphQ:ScalablePIM-BasedGraphProcessing 79 5.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.1 Tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.2 Lessons Learned and Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 GraphQ Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2.1 Predictable Inter-Cube Communication . . . . . . . . . . . . . . . . . . . . . . . . 82 vi 5.2.2 Decoupled Intra-Cube Data Movements . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.3 Tolerating Inter-Node Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.4 Novelty and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.3 GraphQ Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.1 Inter-Cube: Batched Communication . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.3.2 Intra-Cube: Specialized Message Passing . . . . . . . . . . . . . . . . . . . . . . . . 94 5.3.3 Parameter Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 GraphQ Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4.1 Inter-Cube Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Intra-Cube Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5.2 Comparing with Tesseract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.3 Comparing with GraphP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.4 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.5 Eect of PU/AU Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.6 Multi-Node Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5.7 Larger Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Chapter6: SympleGraph: DistributedGraphProcessing 109 6.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Background and Problem Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.1 Graph and Graph Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2.2 Distributed Graph Processing Frameworks . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 Ineciencies with Existing Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 SympleGraph Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5 SympleGraph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5.1 SympleGraph Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.5.2 SympleGraph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.6 SympleGraph System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.6.1 Enforcing Dependency: Circulant Scheduling . . . . . . . . . . . . . . . . . . . . . 124 6.6.2 Dierentiated Dependency Propagation . . . . . . . . . . . . . . . . . . . . . . . . 126 6.6.3 Hiding Latency with Double Buering . . . . . . . . . . . . . . . . . . . . . . . . . 128 6.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.8 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.8.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6.8.2 Computation and Communication Reduction . . . . . . . . . . . . . . . . . . . . . 133 6.8.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.9 Analysis of SympleGraph Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter7: Conclusions 140 Bibliography 142 vii ListofTables 2.1 Vertex Programming APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Parallel Enumerative FSMs Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Comparison of Cross-Cube Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Graph Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Preprocessing overhead for large graph datasets . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Graphs Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1 Graph datasets. |V’| is the number of high-degree vertices . . . . . . . . . . . . . . . . . . . 131 6.2 K-core runtime (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.3 Execution time(in seconds) on large graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.4 Execution Time (in seconds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Number of traversed edges (Normalized to total number of edges in the graph) . . . . . . . 136 6.6 SympleGraph communication breakdown (normalized to total communication volume in Gemini) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.7 Execution time (in seconds) of MIS using the best-performing number of nodes (in parenthesis) on Stampede2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 viii ListofFigures 1.1 Characteristics of Irregular Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges of Irregular Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 (Sequential) FSM Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Vertex Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Processing-In-Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 An Example Implementation of HMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Enumerative FSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Lookback Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Parallel Automata Processor Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.4 Automata Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5 New Computation Primitive: State Set Transition . . . . . . . . . . . . . . . . . . . . . . . 25 3.6 CSE Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Maximum Frequency Partition (MFP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.8 Merge Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.9 CSE Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.10 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.11 R 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.12 R T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 ix 3.13 LBE Speedup w.r.t. Lookback Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.14 CSER 0 w.r.t. Merge Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.15 CSE Speedup w.r.t. Merge Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.16 CSE Re-execution rate w.r.t. Merge Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1 Examples of HMCs’ Interconnection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Pseudocode of PageRank in Tesseract. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Graph Partitioning for Vertex Program (a) and Source-Cut (b)(c). . . . . . . . . . . . . . . . 56 4.4 PageRank in Two-Phase Vertex Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.5 Two-Phase Vertex Program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 Expressiveness of Two-Phase Vertex Program. . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.7 Hierarchical Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.8 Averaged Communication. Inner Group Links v.s. Cross Group Links . . . . . . . . . . . . 65 4.9 Performance of GraphP: 2DMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.10 Performance of GraphP: Dragony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.11 Cross-Cube Communication: 2DMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.12 Cross-Cube Communication: Dragony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.13 Bandwidth utilization: 2DMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.14 Bandwidth utilization: Dragony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.15 Scalability of GraphP - 2DMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.16 Scalability of GraphP - DragonFly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.17 Energy Consumption of GraphP: 2DMesh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.18 Energy Consumption of GraphP: Dragony . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Tesseract Communication and Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Batched Communication in GraphQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 x 5.3 Overlapped Computation and Communication . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 Example of GraphQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 GraphQ Intra-Cube Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.6 Intra-Cube Architecture Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.7 Hybrid Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.8 Ordered Batch Inter-cube Communication Code . . . . . . . . . . . . . . . . . . . . . . . . 97 5.9 Process Unit Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.10 Apply Unit Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.11 GraphQ Intra-Cube Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.12 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.13 Execution time breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.14 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.15 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.16 Performance w.r.t Dierent PU/AU Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.17 Multi-Node Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.1 Bottom-up BFS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Examples of algorithms with loop-carried dependency . . . . . . . . . . . . . . . . . . . . 113 6.3 Bottom-up BFS Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.4 Signal-Slot in pull mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5 SympleGraph instrumented bottom-up BFS UDFs . . . . . . . . . . . . . . . . . . . . . . . 123 6.6 Circulant Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.7 Circulant Scheduling Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.8 Dierentiated Dependency Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.9 Double buering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 xi 6.10 Scalability (MIS/s27) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.11 Analysis of optimizations (baseline is SympleGraph with only circulant scheduling) . . . . 139 xii Abstract Computer architecture is at a critical juncture. With the end of Moore’s law and Dennald scaling, it is impossible to increase the number of transistors exponentially in single-core architecture. It becomes more ecient to build multi-core processors to scale performance. As a result, modern computers, from personal desktop computer to high performance computing clusters, are using exponentially more cores. To leverage the power of multi-core, we need to modify the applications and enable thread parallelism. Many applications are easy to modify. However, it is dicult for irregular applications to scale eciently on multi-core architecture. We dene applications into two categories: regular applications and irregular applications. Irregu- lar applications exhibit two main characteristics: sparsity and data dependency. Sparsity means that the number of non-zero data elements is much fewer than the possible maximum number. Data dependency describes the fact that computing on dierent parts of the data is dependent. Typical irregular applications include sparse matrix multiplication and graph processing. At a high level, these two characteristics of irregular applications limit multi-core scalability. First, sparsity is bad for resource utilization. Hardware design is optimized for dense applications. When there are zero elements, the hardware resources corresponding to these elements are idle. Sparse applications can rarely reach peak theoretical performance. Second, data dependency prevents us from separating the problem into a number of parallel tasks. With data dependency, we must enforce it for correctness and serialize the computation. It is also possible to put extra eorts to break data dependency, at the expense of xiii redundant computation and communication. The synchronization overhead will become the performance bottleneck of irregular applications. The focus of this thesis is to develop new techniques to scale irregular applications eciently on par- allel architecture. We believe that either software or hardware techniques alone will not meet this goal. To this end, we present hardware and software techniques that addresses both the sparsity and data de- pendency challenge posed by irregular applications. For hardware accelerators, we design CSE, a new parallel algorithm that parallelizes nite-state ma- chines . Existing solutions are focused on software optimizations and ignore the hardware features. CSE can partition the state machine into convergence sets and parallelize them with constant parallelization overhead. For novel memory architecture Processing-In-Memory (PIM), we propose two techniques that together exploits the full memory bandwidth benets of Hybrid Memory Cubes (HMC). GraphP is a HMC graph accelerator with data partition as rst-order design principle. We use the software method of partitioning to reduce communication. GraphQ is another HMC graph accelerator which focuses on new hardware design at dierent levels of memory hierarchy. We change the programming model of graph processing and rearrange the memory access patterns in memory cubes. We also develop a runtime system that schedules the communication messages in batches to achieve peak bandwidth utilization. For distributed systems, our contributions include a compiler-based technique that removes redundant computation and communication. SympleGraph is a state-of-the-art distributed graph processing system. It is the rst distributed graph system that considers data dependency in its programming model and system implementation. We use circulant scheduling, double buering, and dierential communication to optimize data dependency in a distributed setting. xiv Chapter1 Introduction 1.1 TheChallengesofIrregularApplications Parallel computing is at a critical juncture. For many years, performance grows exponentially with the number of transistors according to Moore’s law. Unfortunately, with the end of Moore’s law, it becomes impossible to continue to improve performance in a single core architecture. The industry has shifted to multi-core architecture and parallel computers. Parallel computing is pervasive. From personal desk- top computers to mobile phones, from cloud servers to high performance computing clusters, all current computers are parallel. Many applications have enough parallelism and thus can be parallelized without any eort. However, we nd that a group of other applications are dicult to parallelize, or suers severe performance loss in a multi-core architecture. In this dissertation, we will refer to these applications as irregular applications. To have a clear understanding of the challenges of irregular applications, we observe two characteristics: sparsity and data dependency. Sparse applications have many zero data. When we calculate the algorithm complexity, we can omit these zero elements because they require no work. However, when we execute the algorithm on hardware, the corresponding hardware resources are wasted. For example, when we iterate over a sparse vector in 1 a GPU, the threads corresponding to the zero element is idle, although the GPU device seems to be fully occupied. Figure 1.1: Characteristics of Irregular Applications Data dependency is another reason of low eciency for irregular applications. Some applications are inherently sequential. According to Amdahl’s law, the max speedup is limited by the sequential part of the application. Some applications have sequential parts that can be parallelized at a great cost. The communication and synchronization overhead sometimes oset the benets of parallel computing. The irregular applications play an important role in real world. In scientic computing, sparse matrix multiplication is irregular and it is the core computation of many other applications. In big data, graph processing is a typical irregular application because most real graphs are sparse. In machine learning, new algorithms related with sparse matrix and graphs, like graph convolution networks, are being invented and becomes the state-of-the-art solutions. With the rapid development of articial intelligence, more irregular applications will be developed and applied. There are existing solutions to irregular applications, both from the hardware side and software side. If you examine the sparse vector example again, we will get a simple solution: we can change the repre- sentation of the sparse vector. We will store the vector with an index vector so that we can skip the zero data after reading the index. It avoids the idle thread problem mentioned above. However, it creates a new 2 problem in accessing the dense vector: the memory access pattern is random. The eciency problem is not resolved. When the application becomes more complex, current solutions are not sucient to scale irregular applications. Figure 1.2: Challenges of Irregular Applications 1.2 Sparsity As discussed in Section 1.1, sparsity is one of the challenges of irregular applications, especially novel memory architecture. Novel memory architecture like Processing-In-Memory (PIM) is a promising architecture for future accelerators. It provides high memory bandwidth and low energy consumption. Current solutions are not designed for novel architectures and thus can not fully leverage its power due to sparsity. The state-of-the-art accelerator achieves 10 times speedup over conventional memory architecture. However, most of the performance gains come from the hardware In this example, the PIM accelerator have potentially 60 to 100 times more memory bandwidth. The power of new hardware is not fully unleashed. 1.3 Datadependency As discussed in Section 1.1, data dependency is a severe threat to resource utilization in parallel computing. 3 There are two possible solutions: First, we can reduce the cost of parallelization. The software solutions are usually well-studied. However, there is more design space if we consider both hardware and software. We can design a new algorithm which reduces the cost statically. Second, we can see it as an optimization opportunity. For some graph algorithms, data dependency contains break control ow. We can leverage it to skip computation and communication in distributed systems. 1.4 Contributions The focus of this dissertation is to enable ecient and scalable irregular applications on parallel computers. To this end, our contributions address the sparsity and data dependency challenges of irregular applications over a wide range of software applications and hardware architectures. Embarrassinglysequentialalgorithmsandclassichardwareaccelerators Finite-State Machine (FSM) is one example of graph-based computations: If the states are nodes and state transitions are edges, FSM executions are essentially walks on a graph. FSM is the fundamental representation in encoding, parsing or pattern match. Typical input data to be processed by FSMs include web pages, network packets and system logs, which are of extremely large volumes. Therefore, the performance of FSM is crucial. FSM is dicult to parallelize and known as "embarrassingly sequential" because of data dependency. Previous works break the data dependency by enumerating all states. The overhead of parallelization is proportional to the total number of states of a FSM. Although there are several techniques to reduce the overhead gradually at runtime [83], the algorithm is input dependent and thus the performance is not consistent across dierent benchmarks. We develop CSE, a Convergence Set based Enumerative FSM. Unlike prior approaches, CSE is based on a new computation primitive set(N)!set(M), which maps N states (i.e., a state set) to M states without giving the specic state!state mappings (which state is mapped to which). CSE uses set(N)!set(M) as an essential building block to construct enumerative FSM. Essentially, CSE reformulates the enumeration 4 paths as set-based rather than singleton-based. We evaluate CSE with 13 benchmarks. It achieved on average 2.0x/2.4x and maximum 8.6x/2.7x speedup compared to Lookback Enumeration (LBE) and Parallel Automata Processor (PAP), respectively. Graphprocessingandemergingmemoryarchitecture The recent development in memory technol- ogy, including Hybrid Memory Cube (HMC), High Bandwidth Memory (HBM) and metal-oxide resistive random access memory (ReRAM), have emerged to enable new architecture ideas such as Processing-In- Memory (PIM) (or Near-Data-Processing, NDP). PIM-based accelerators place logics near memory and hence remove the data movement overhead and provide extremely high memory bandwidth. It believed to be an ideal architecture platform for graph processing because graph processing is a well-known mem- ory bound problem. However, there are still important challenges in fully utilizing the bandwidth. Graph processing has a random memory access pattern, and remote memory bandwidth will become the perfor- mance bottleneck. We develop GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to Tesseract. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communica- tion from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities oered by the proposed partitioning and programming model. We evaluate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it pro- vides on average 1.7 speedup and 89% energy saving compared to Tesseract. We develop GraphQ, a novel PIM-based graph processing architecture that eliminates irregular data movements. The key idea is to generate static and structured communication with runtime system and architecture co-design. Using a zSim-based simulator and ve real-world graphs and four algorithms, 5 the results show that GraphQ achieves on average 3.3×and maximum 13.9x speedup, 81% energy saving compared to Tesseract. Comparing to GraphP, GraphQ achieves more speedups to Tesseract. In addition, the 4-node GraphQ achieves 98.34×speedup compared to a single node with the same memory size using conventional memory hierarchy. Distributedgraphprocessingsystemsandprogrammingabstractions At the age of big data, graphs have grown to billions of edges and will not t into the memory of a single machine. To process large-scale graphs eciently, a number of distributed graph processing frameworks have been proposed, e.g., Power- Graph [39] and GeminiGraph [140]. In order to support various graph analytics algorithms, the systems provide programming interfaces and allow algorithm programmers to write user-dened functions. Unfortunately, all previous distributed programming abstractions make implicit assumptions. For ex- ample, current systems will not enforce loop-carried dependencies in user-dened functions. Many im- portant algorithms, like BFS and MIS, can be implemented more eciently with early exit optimization, but it is not available in systems and programming interfaces. We develop SympleGraph, a novel framework for distributed graph processing that precisely enforces loop-carried dependency, i.e., when a condition is satised by a neighbor, all following neighbors can be skipped. SympleGraph analyzes user-dened function and identies the loop-carried dependency. The distributed framework enforces the precise semantics by performing dependency propagation dynamically. To achieve high performance, we apply circulant scheduling in the framework to allow dierent machines to process disjoint sets of edges and vertices in parallel while satisfying the sequential requirement. To further improve communication eciency, SympleGraph dierentiates dependency communication and applies double buering. In a 16-node setting, SympleGraph outperforms Gemini and D-Galois on average by 1.42×and 3.30×, and up to 2.30×and 7.76×, respectively. The communication reduction compared to Gemini is 40.95% on average, and up to 67.48%. 6 1.5 ThesisOrganization The rest of the dissertation is organized as follows. Chapter 2 provides relevant background and motiva- tion. Chapter 3 describes CSE, a parallel Finite-State Machine accelerator with constant overhead. Chap- ter 4 describes GraphP, a HMC accelerator with data partition as rst-order design principle. Chapter 5 describes GraphQ, a HMC accelerator with new hardware design at dierent levels of memory hierar- chy. In Chapter 6, we describe SympleGraph, a state-of-the-art distributed graph processing system that considers data dependency. Finally, Chapter 7 concludes this dissertation. 7 Chapter2 Background 2.1 EmbarrassinglySequentialAlgorithmsandHardwareAccelerators 2.1.1 SequentialFSM Finite-State Machine (FSM) is an important mathematical concept in automata theory. FSMs are widely used as a computation model in several important applications such as data analytics and data mining [122, 20, 44, 91, 90], bioinformatics [19, 27, 102, 127], network security [131, 31], computational nance [36] and software engineering [5, 21, 93, 99]. These applications use FSMs as the essential computational model to process tens to thousands of patterns on a large amount of input data. Therefore, the performance of FSM is crucial. There are two types of FSM: Deterministic Finite-state Automata (DFA) and Nondeterministic Finite- state Automata (NFA). In our work, we consider DFA (every NFA can be converted to an equivalent DFA). Figure 2.1 (a) shows an FSM with ve states and input symbolsf0; 1g. The state transition rules could be represented as thestatetransitiontable, as shown in Figure 2.1 (b). During execution, the FSM processes the input symbols sequentially due to the state transition dependency (Figure 2.1 (c)). Specically, each time the FSM reads one symbol from the input string, it looks up the transition table based on the current state and the read symbol to nd a transition to the next state (line 3 of Figure 2.1 (d)). Besides the sequential 8 V V V V V V V V V D)60 V V V V V V V V V V V V V V V E7UDQVLWLRQ7DEOH ŏ F'HSHQGHQFHV state = a; foreach (input a) state = T[in][state]; G6HTXHQWLDO)60$OJRULWKP 7 Figure 2.1: (Sequential) FSM Example bottleneck, transition table lookups also cause irregular memory accesses, which motivates the recent works on hardware FSM accelerators [117, 38]. 2.1.2 ParallelFSM To parallelize the sequential computation, enumerative FSM [96, 83, 97, 137, 139, 53, 116, 138] is proposed, in which the input is divided into segments that can be processed in parallel. All segments (except the rst) have unknown starting state, therefore the computation has to calculate the state transitions for all states, which are called enumeration paths. The number of enumeration paths is equal to the number of states in the FSM, which poses a key challenge due to the high computation cost. Fortunately, FSM has the convergence property, which states that the number of enumeration paths to compute is non-increasing due to state convergence, — after two states transition to the same state, the following state paths become the same. This property is the foundation of various hardware and software implementations of enumerative 9 FSM, which are reviewed in detail in Section 2. Essentially, they all computestate!statetransitions for the non-increasing enumeration paths. 2.2 GraphProcessingandNovelMemoryArchitectures 2.2.1 BasicsofGraphProcessing Table 2.1: Vertex Programming APIs Function Input Output processEdge source vertex value partial update reduce reduced/partial update reduced update apply reduced update/old value new value 1 for (v Graph.vertices) { 2 for (e outEdges(v)) { 3 res = processEdge(e, v.value, ...) 4 u comp[e.dest] 5 u.temp = reduce(u.temp, res) 6 } 7 } 8 for (v Graph.vertices) { 9 v.value, v.active = apply(comp[v].temp, v.value) 10 } Figure 2.2: Vertex Programming Model A graph G is dened as an ordered pair (V;E), where V is a set of vertices connected by E, a set of edges. To ease the development of graph algorithms, several domain-specic programming models based on “think like a vertex” principle are proposed, such as vertex program [75], gather-apply-scatter program [39], amorphous data-parallel program [94] and some other frameworks [109]. Among them, vertex program is supported by many software and hardware accelerated graph processing frameworks, including Tesseract [1], GraphLab [74], and Graphicionado [47]. Figure 2.1 lists the semantics of three APIs of vertex program. Figure 2.2 shows a general graph application expressed with these primitives. During processing, each vertex in vertex array is visited and all its outgoing edges in edge array are processed, involving three steps. 1) process: for each outgoing edge e of vertex v, function processEdge 10 computes the contribution of source vertex v through edge e to the destination vertex u (accessed in line 4). 2) reduce: from the perspective of u, a new update returned by processEdge (res) is combined using areduce function with the existing value of u in compute array, i.e.,u.temp, incurring a random access. 3)apply: after the whole graph is processed in an iteration, the new value of each vertex in compute array is applied to vertex array with theapply function. In an iterative graph algorithm, the procedure repeats multiple iterations until certain convergence condition has been reached. We can summarize two characteristics of graph processing: random accesses to compute array (line 4); the high ratio of memory accesses to computation (processEdge is typically simple). As a result, graph algorithms incur random accesses and require high memory bandwidth. 2.2.2 Processing-In-Memory Vault Cube Inter-Cube bandwidth: 120GB/s per link …… DRAM Layers Logic Layer Through Silicon Via (TSV) Intra-Cube bandwidth: 360GB/s 10GB/s Processing-In-Memory (PIM) Node 3 Node 2 Node 1 Node 0 Inter-Node bandwidth: 6GB/s Figure 2.3: Processing-In-Memory Architecture Processing-In-Memory (PIM) architecture reduces data movements by performing computations close to where the data are stored. 3D memory technologies (e.g., Hybrid Memory Cubes (HMC) [23] and High Bandwidth Memory (HBM) [67]) make PIM feasible by integrating memory dies and compute logic in the same package, achieving high memory bandwidth and low latency. Similar to Tesseract and GraphP, we considers a general PIM architecture (shown in Figure 2.3) that captures key features of specic PIM implementations. The architecture is composed of multiple cubes.An HMC device (i.e., a cube) is a single chip stack that consists of several memory dies/layers and a single 11 Vaults 00slogic Links0 Links1 Links2 Vaults 01slogic Vaults 31slogic Logicsdie BIST Reference clock InternalsSwitch Serializedspacketsrequestsandsresponses Links3 P00B P00A P01B P01A P31B P31A P00H P01H P31H MemorysdiesB Memorysdies... MemorysdiesA MemorysdiesH MemorysPartitionss(Vaults) Figure 2.4: An Example Implementation of HMC. logic die/layer. Two kinds of bandwidth are dened: 1)Internal bandwidth, which caps the maximum data transfer speed between memory dies and the logic dies of a same cube; and2)External bandwidth, which is provided by a cube to external devices (e.g., other cubes and the host processor). Within each cube, multiple DRAM dies are stacked with Through Silicon Via (TSV) and provide higher internal memory bandwidth up to 320GB/s. At the bottom of the dies, computational logics (e.g., simple cores) could be embedded. In Tesseract [1], a small single-issue in-order core is placed at the logic die of each vault. It is feasible because the area of 32 ARM Cortex-A5 processors including an FPU (0.68 mm 2 for each core [6]) corresponds to only 9.6% of the area of an 8 Gb DRAM die area (e.g., 226 mm 2 [107]). With 16 cubes, the whole system delivers 5 TB/s memory bandwidth, considerably larger than conventional memory systems. Moreover, the memory bandwidth grows proportionally with capacity in a scalable manner. 12 2.2.3 Processing-In-MemoryGraphProcessing Processing-In-Memory (PIM) is an eective technique that reduces data movements by integrating pro- cessing units within memory. While conceptually appealing, early works [61, 62] only achieved limited success due to both technology restrictions and lack of appropriate applications. However, with the recent advance of “big data” and 3D stacking technology, both problems seem to become solvable. On the application side, modern big data applications operate on massive datasets with signicant data movements, posing great challenges to conventional computer architecture. Among them, graph analytics [76, 40] in particular received intensive research interests, because graphs naturally capture re- lationships between data items and allow data analysts to draw valuable insights from the patterns in the data for a wide range of applications. However, graph processing poses great challenges to memory systems. It is well-known for the poor locality because of the random accesses in traversing the neighbor- hood vertices, and high memory bandwidth requirement, because the computations on data accesses from memory are typically simple. On the technology side, 3D integration [13] enables stacking logic and memory chips together through TSV-based interconnection, which provides high bandwidth with scalability and energy-eciency. One of the most prominent 3D-stacked memory technologies is Micron’s Hybrid Memory Cube (HMC) [51], which consists of a logic die stacked with several DRAM dies. With this technology, it is possible to build a system that consists of multiple HMCs, which can provide 1) high capacity of main memory that is large enough for in-memory big data processing; and, more importantly, 2) memory-capacity-proportional bandwidth, which is essential for applications with poor locality and high memory bandwidth requirement. 13 As a result, due to the advances in both application and technology, the research community and industry again became increasingly interested in applying PIM to various applications like machine learn- ing [129], natural language processing [4, 42, 133], social inuence analysis [16, 118, 126] and many oth- ers [34, 35]. Among these kinds of applications, PIM (e.g., HMC) is specially suitable for building ecient architecture for graph processing frameworks. Tesseract [1] is a PIM-enabled parallel graph processing architecture. It implements a Pregel-like vertex-centric programming model [76] on top of the HMC architecture, so that users could develop pro- grams in the familiar interface while taking advantage of PIM. The results show that Tesseract can be orders of magnitude faster than DRAM-based in-memory graph processing systems. Despite the promising results, Tesseract generatesexcessivecross-cubecommunications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Such cross-cube communications delay the executions in memory cubes, and eventually aect HMC’s internal bandwidth utilization. In fact, the results in [1] conrms this observation: the bandwidth utilization of Tesseract is usually less than 40%. Moreover, Tesseract adopts the Dragony topology to connect HMCs [51], which provides higher connectivity and shorter diameter than the simpler topology like mesh. However, Drag- ony is still not fully symmetric, which means that the bandwidth of certain critical cross-cube links may sustain much higher throughput than the others, becoming bottlenecks that further hampering Tesseract’s performance. 2.3 GraphProcessingandDistributedSystems Graphs capture relationships between entities. Graph analytics has emerged as an important way to un- derstand the relationships between heterogeneous types of data, allowing data analysts to draw valu- able insights from the patterns for a wide range of real-world applications, including machine learning 14 tasks [130], natural language processing [4, 42, 133], anomaly detection [98, 115], clustering [110, 106], recommendation [32, 46, 81], social inuence analysis [22, 118, 126], and bioinformatics [3, 30, 64]. At the age of big data, graphs have grown to billions of edges and will not t into the memory of a single machine. Even if they can, the performance will be limited by the number of cores. Single-machine processing is not a truly scalable solution. To process large-scale graphs eciently, a number of distributed graph processing frameworks have been proposed, e.g., Pregel [75], GraphLab [74], PowerGraph [39] , D-Galois [26], and Gemini [140]. These frameworks partition the graph to distributed memory, so the neighbors of a vertex are assigned to dierent machines. To hide the details and complexity of distributed data partition and computation, these frameworks abstract computation as vertex-centric User-Dened Functions (UDFs)P (v), which is executed for each vertexv. In eachP (v), programmers can access the neighbors of v as if they are local. The framework is responsible for distributing the function to dierent machines, scheduling the com- putations and communication, performing synchronization, and ensure that the distribute execution out- put the correct results. To achieve good performance, both communication and computation need to be ecient. The communication problem, which is closely related to graph partition and replication, has been traditionally a key consideration of distributed framework. Prior works have proposed 1D [75, 74], 2D [39, 140], 3D [134] partition, and investigate the design space extensively [26]. 15 Chapter3 CSE:ParallelFiniteStateMachineswithConvergenceSetEnumeration As discussed in Section 2.1.1, FSM is an embarrassingly sequential application due to its data dependency. As discussed in Section 2.1.2, there are existing software solutions to parallel FSMs. Unfortunately, these techniques fail to capture the hardware features and the parallelization overhead is high even after the optimizations. We will rely on the hardware features and design a new algorithm to achieve near optimal speedup. 3.1 Introduction The hardware implementations of enumerative FSM and NFA (e.g., [116]) typically use one-hot encoding, where N states are represented by N-bit vector (i.e., active mask), and a bit is set when the state is active. Based on one-hot encoding, the combinational logic (i.e., state transition matrix) selects the next states, based on the input symbol and the matched and activated states (more details in Section 3.3.1). In the normal state!state, only one bit in active mask is set any time. Lifting this restriction re- quires no change in combinational logic and hardware cost but enables a novel computation primitive set(N)!set(M), which maps a state set with N states to another state set with M states without giving the specic state!state mappings. The key property of set(N)!set(M) is that, when M = 1, it computes the enumerative paths for all N states with the cost of computing one path in state!state. This is because all 16 N states are mapped to the same state. Importantly, the convergence property of enumerative FSM ensures M< N. While it is unlikely that all N states of an FSM can converge to a single state, it is highly possible that a subset of N states can converge to a single state. Our solution in this chapter is based on this insight. We propose CSE, a Convergence Set based Enumerative FSM. We use set(N)!set(M) as the building block to compute multiple enumeration paths of a state set in parallel speculating that the states will converge to a single state. When the speculation is correct, CSE achieves signicant speedup, otherwise, one of more stages may be re-executed based on the concrete output from previous stage. To increase the likelihood of correct speculation, the whole state set (S) needs to be partitioned inton disjoint convergence setsfCS i ji2 [1;n]g such thatS = S n i=1 CS i andCS i T CS j =?. The quality of the partition is a major factor determining the performance. CSE is supported by two techniques. First, convergence set prediction generatesfCS i ji2 [1;n]g with random input based proling. Each proling input will produce one convergence set partition. By proling a large number of inputs, we can use the partition with the maximum frequency (MFP) as the predicted partition for a given FSM. To improve the prediction accuracy, we apply a partition renement algorithm to merge distinct partitions. The rened partition covers all or most (99% to 100%) partitions generated in proling input, which leads to high prediction accuracy and low re-execution ratio for real input. Second, we propose a global re-execution algorithm and its hardware implementation. It minimally triggers the re-execution of certain stages to ensure the correctness. In essence, CSE reformulates the enumeration paths as set-based rather than singleton-based. To evaluate CSE, we use an automata simulator VASim [125] and 13 benchmarks in Regex [11] and ANMLZoo [125]. CSE is compared to lookback enumeration and parallel automata processor [116] with all optimizations implemented. We directly compare dierent designs based on symbol/sec. We observe that CSE achieved on average 2.0x/2.4x and maximum 8.6x/2.7x speedup compared to lookback enumeration and parallel automata processor, respectively. 17 The remainder of this chapter is organized as follows: Section 3.2 reviews the FSM basics and the cur- rent works on enumerative FSMs; Section 3.3 introduces set(N)!set(M) and analyzes its unique properties; Section 3.4 proposes CSE, focusing on convergence set generation and re-execution algorithm; Section 3.6 discusses evaluation methodology and Section 3.7 shows evaluation results; Section 3.8 summarizes the chapter. 3.2 Background This section rst explains the concepts of basic and enumerative FSMs. Then we review two recent ap- proaches to improve the eciency of enumerative FSMs. In the end, we outline the drawback of existing works and motivate our approach. 3.2.1 FSMBasics There are two forms of FSM: Deterministic Finite-state Automata (DFA) and Nondeterministic Finite-state Automata (NFA). As every NFA can be converted to an equivalent DFA, we focuse on DFA. In the remainder of the chapter, we will use FSM and DFA interchangeably. A DFA is a 5-tuple (Q; ;;q 0 ;F ), in whichQ is a nite set of states, is a nite set of input symbols called the alphabet, is a transition function that maps a state and an input symbol to another state,q 0 2Q is the initial state andF is a set of reporting/accepting states. With a sequence of input symbolsW = w 1 ;w 2 ;:::w T (w t 2 ), the DFA goes over a sequence of statesq 0 ;q 1 ;q 2 ;:::q m , whereq t+1 = (q t ;w t ). The computation is sequential, because the next state depends on the current one. 18 ➛ Figure 3.1: Enumerative FSM 19 3.2.2 EnumerativeFSM Sequential bottleneck is the major challenge for FSM computation. Enumerative FSM is an eective tech- nique to parallellize sequential FSM by breaking data dependence. In this method, input symbols are partitioned into a sequence of segments that are processed in parallel. Clearly, only the rst segment has a concrete initial state; for all other segments, while the inputs are known, the starting states are unknown. Therefore, they have to perform enumerative execution to compute the state transition for every state (i.e., enumeration path). When all segments nish execution, the rst segment generates the concrete output state, while the others generate state!state mapping of all states. At this point, the stages can be “chained” together by selecting the output state of each segment i from the computed state!state mapping of segmenti + 1. Figure 3.1 (a) illustrates an example of enumerative FSM with four segments. Figure 3.1 (b) shows a conceptual view of a segment. With unknown starting state, it needs to compute the state transitions for allN states. Each curve is an enumeration path that is a sequence of state transitions. Given thet input symbols (shown at the bottom), each state follows a dierent path, nally reaching the eventual states in this stage. Each enumeration path computes the state!state mapping starting from one state. Aow is a logical concept that computes an enumeration path. Let the number of ows to perform an enumerative execution beR. InitiallyR is equal to N, the total number of states in an FSM. Figure 3.1 (b) also indicates R 0 and R T ,R at the start and end of a stage. To compute multiple enumerative paths, the component implementing the state!state is time-multiplexed among all ows. In enumerative FSM, the convergence property implies thatR is non-increasing. Intuitively, it is be- cause when two states transition to a same state, then the paths for the two will be the same afterwards. The concept is illustrated in Figure 3.1 (c). At the output of a segment, there are some infeasible states (marked as the empty dots). Another concept is the dead states, which are deemed not to lead to a pattern matching. The enumeration path terminates after reaching a dead state, which is shown in Figure 3.1 (d). 20 Data Parallel FSM (DPFSM) [83] is the rst to study enumerative FSM. The implementation uses SIMD instructions in CPU to enable parallel computation of multiple ows. It performs convergence check during execution, thereby reducing R dynamically. This approach is illustrated in Figure 3.1 (e), each compute unit is marked as a red square. The compute resources (e.g., SIMD or SMs in GPGPU) are devoted to the computations corresponding to the shaded region. In principle, smaller R 0 and R T lead to faster execution. To accelerate processing, the general guidance is to reduce R before the computation (to get a smaller R 0 ) and during enumeration (to get a smaller R T ). Besides convergence check, dead state elimination (i.e., deactivation) is another dynamic optimization to reduce R T , The recent works also perform dierent static optimizations to reduce R 0 . Next, we discuss their key ideas based on the just discussed terminology. 3.2.3 LookbackEnumeration Figure 3.2: Lookback Execution. Lookback Enumeration (LBE) [139, 137, 96, 97] is illustrated in Figure 3.2. The key idea is to “lookback"- reduce the initial N states to a smaller R 0 using sux symbols of previous input segment. LBE of each segment is divided into two phases. The rst step is lookback, which performs enumerative execution starting from N ows using L prex symbols in the sux. The result is a smaller set of possible 21 starting states (R 0 N) for the current segment. The second step is to compute the enumeration paths of one or more predicted states in R 0 . If the predicted starting state(s) contain the actual nal state of the previous segment, the speculation is successful; otherwise this segment is re-executed sequentially with a concrete state. Without prediction, LBE is essentially the same as the basic enumerative FSM but with longer input symbols (with L sux symbols). Recent works [139, 137] use probabilistic analysis and proling to study how to select starting states from R 0 to reduce re-execution. [96, 97] considered choosing multiple states from R 0 . It is worth noting that the existing probabilistic methods are designed for software implementation. For example in [137], the “stochastic speculation scheme” requires tracking the feasibility (probability) of each state at runtime. Such computation can be extremely inecient on hardware because at least we need to update oating point values. 3.2.4 ParallelAutomataProcessor Parallel Automata Processor (PAP) [116] supports NFA enumeration using Automata Processor (AP) [29]. In PAP, R 0 ows in a segment are computed by a hardware component for state!state with time multi- plexing. Targeting NFA, PAP needs to compute state!set mapping, so one-hot encoding is also used to represent multiple active states. The key dierence between NFA and DFA is that R is not monotonically decreasing. Based on the results in [116], over a long symbol sequence, R still decreases. An important eort of PAP is to reduce R 0 . PAP proposed fourstatic optimizations. They are illustrated in Figure 3.3. The rst optimization is range guided input partition, by which R 0 can be reduced by cutting inputs at frequent symbols with small feasible range (Figure 3.3 (a)). The second optimization is connected component analysis, in which all feasible states for a stage are partitioned into connected components (CCs). Two ows from two CCs can be merged because both current and destination states are disjoint. It 22 Figure 3.3: Parallel Automata Processor Optimizations allows one ow to compute several state!state transitions in parallel. In Figure 3.3 (b), the total number of states is 1000, which is partitioned into four CCs with sizes: 400,300,200,100. Thus, only 400 ows are needed, a reduction of 600 ows. Figure 3.3 (c) shows active state group optimization. Since NFAs usually have several states which are always active due to self-loops on all possible symbols, they can be assigned into one ow. Figure 3.3 (d) shows common parent optimization, by which the number of ows is reduced to the number of parents. In the example, the intuition is that, if the boundary of the two segments have been one symbol earlier, only 2, instead of 5, ows are needed. 3.2.5 Drawbacks While recent works achieve certain speedups, they share a common assumption: computing each enu- meration path through state!state. For PAP, although several state!state transitions can be computed in parallel in certain cases (see Section 3.2.4), it still stores state!state mapping in memory so that they can be used when the computation for the corresponding states are resumed in time-multiplexing. The large amount of state!state computation for the enumeration paths poses a great challenge. Although 23 PAP can statically parallelize and reduce ows, the optimizations need to ensure correctly computing state!state and achieve limited improvements. We explore a new enumerative FSM design, which is in- spired by DPFSM and PAP but uses set(N)!set(M) primitive, an unique opportunity enabled by the one-hot encoded architecture. 3.3 set(N)!set(M)ComputationPrimitive This section discusses the insight, denition and application of set(N)!set(M), the novel and key compu- tation primitive that leads to ecient implementation of enumerative FSM. 3.3.1 Motivation The memory-centric Automata Processor (AP) [29] accelerates nite state automata processing by imple- menting NFA states and state transitions in memory. Figure 3.4 shows the AP architecture, which accepts the input symbols one at a time and performs state transitions. The current states are represented us- ing one-hot encoding since multiple states can be active in an NFA. The computation is divided into two phases: 1) state matching, which reads the row based on the current input symbol, a bit one means that the state matches the symbol; 2) state transition, which generates the next states in active mask. The state transition is only performed when a state can match the input symbol and is active. The active mask is initially set to the start states. In one-hot encoding, all active mask bits for all states can be independently set in a given cycle. The general architecture can implement both NFA and DFA. In NFA, multiple bits in active mask can be set and multiple state transitions can be performed in parallel. In DFA, only one bit in active mask is set and one state transition is performed. We consider the scenario that multiple bits in active mask are set. Clearly, the hardware implemen- tation does not require any change. However, it can no longer compute the state!state transitions for 24 State transition matrix active mask … 8-bit input symbol STE(0) STE(1) … STE(49150) STE(49151) 0 255 Figure 3.4: Automata Processor individual state in DFA. For example, if we have two states S 0 and S 1 , and after an input symbol ‘a’, they transition to two dierent new states, S 0 ! S 2 , S 1 ! S 3 . If the bits for S 0 and S 1 in active mask are both set, then the bits for S 2 and S 3 are set in the updated active mask. From this information, we know the state set {S 0 , S 1 } transitions to {S 2 , S 3 }, but we do not know the specic state mapping (which state transitions to which). ➛ ➛ ➛ Figure 3.5: New Computation Primitive: State Set Transition We dene this new computation primitive as set(N)!set(M), which transitions a state set with N states to another state set with M states. The concept is shown in Figure 3.5. The curve in Figure 3.5 (a) is an 25 enumeration path, a sequence of state!state. The dotted curves in Figure 3.5 (b) are the enumeration paths for the N starting states. However, set(N)!set(M) cannot provide the states sequences in each path, but can just give the M states that the N starting states are mapped to. As shown in Figure 3.5 (c), two active states {S 1 , S 2 } are transitioned to {S 3 , S 4 , S 5 }, but the information that S 1 is mapped to {S 3 , S 4 } is lost. Next, we show that in a special case, set(N)!set(M) is very useful. 3.3.2 ApplicationtoEnumerativeFSM The most natural application is to compute the lookback in LBE. Referring to Figure 3.2, the existing LBE uses state!state to compute the enumeration paths on the sux (length L) of the previous segment. We only need to know the state set after looking back the sux (R 0 ), but there is no need to know how these states are reached. In this scenario, set(N)!set(M) perfectly matches the goal: we can directly reduce state set of size N at the beginning of the sux toR 0 with overhead of just computing one enumeration path of length L. Most importantly, set(N)!set(M) can compute enumerative paths in a special case. If M = 1, all N states converge to the same state, so all N enumerative paths are computed in parallel. Moreover, it is achieved with the same hardware cost. The convergence property of enumerative FSM ensures M< N. While it is unlikely that all N states of an FSM can converge to a single state, it is highly possible that a subset of N states can converge to a single state. Next, we describe out solution based on this insight. 3.4 CSEApproach This section presents CSE, Convergence Set Enumeration. We rst explain the insights and then focus on the convergence set generation and re-execution algorithm. 26 Figure 3.6: CSE Approach 3.4.1 Insights CSE uses set(N)!set(M) as the building block to construct enumerative FSM. To increase the probability of converging to one state, we partition all states (S) inton convergence sets, {CS(i)ji2 [1,n]} such that S = S n i=1 CS(i). We speculate that each CS(i) converges to one state, if it is true, set(N)!set(M) computes jCS(i)j enumeration paths in parallel. Otherwise, one or more stages need to be re-executed. Figure 3.6 illustrates the CSE approach. In Figure 3.6(a), we partition S states into three convergence sets: CS(1), CS(2) and CS(3), so R 0 is directly reduced to 3. At the end of the segment, each CS(i) converges to one state successfully, so R T is also 3. We mentioned earlier that smaller R 0 and R T lead to faster execution, and in this case, R 0 is directly reduced to the lower bound, the same as R T . It is dicult to achieve with the optimizations in PAP. Figure 3.6(b) shows a more conservative partition. Instead of partitioning into three CSs, we partition S states into four CSs (R 0 = 4). In the end, each CS(i) converges successfully into one state and CS(2) and CS(3) converge to the same state S 2 , so R T is equal to 3. In this example, R 0 is not reduced to the lower bound, but is still signicantly less than N. Figure 3.6(c) shows a dierent scenario. Here, CS(2) does not converge to a single state, so we do not successfully compute the state!state mapping of all states in CS(2). This example illustrates an incorrect partition and suggests a possible re-execution, which will execute state!state on the concrete state from 27 previous stage. In certain cases, re-execution might be avoided: if the concrete output state from the previous stage belongs to CS(1) or CS(3), then the re-execution is unnecessary. The details of re-execution algorithm is described in Section 3.4.4. The above discussion is about executing one stage in CSE. After nishing the stage, we need to com- posite the ows of all stages, i.e., distinguish the true/false state of enumerative execution in each stage. PAP simply selects the ow starting from the concrete result state from previous stage as the true one, because in state!state computation, results starting from all states are available. CSE does it similarly at convergence set granularity. We select the result of the corresponding CS of the concrete state. If CS con- verges, all states (including the true concrete output state) produce the same report state. If CS diverges, we will perform re-execution with the concrete state. ∗ Therefore, CSE can produce exactly the same result state and output report as performing sequential execution. However, in some cases, we not only need the terminal state but the whole state transition path. Be- cause set(N)!set(M) did not keep the path of individual state, CSE can not distinguish the state transition path within a convergence set. We can still recover such path information with another sequential exe- cution. In real-world services (e.g., FSMs in network intrusion detection), computing the terminal state is latencysensitivewhilestatetransitionpathisnot. Thus, CSE can still accelerate FSM processing by speeding up latency critical tasks. 3.4.2 ConvergenceSetPrediction The quality of convergence sets generated is a key factor determining performance. With fewer conver- gence sets (smaller R 0 ), the successful execution is faster because the number of ows that need to be enumerated by set(N)!set(M) is smaller. However, it is more likely to trigger re-execution, which incurs extra latency. If we are more conservative and generate more convergence sets (larger R 0 ), we may end up ∗ If at time point during the execution, CS arrives at more than one reporting/accepting states, this CS will diverge at the end of this segment and we also need re-execution to distinguish report states. 28 performing certain redundant computations similar to enumerative FSMs based on state!state transition. Figure 3.6(b) is an example of this case. We propose a proling-based prediction method and partition renement algorithm to merge parti- tions. By carefully selecting the merge strategy, our convergence set prediction can produce both concise (small number of converge sets) and accurate (low re-execution rate) state partition for real-world FSM applications. 3.4.3 Proling The idea of proling is intuitive: we can use synthetic input sequences to discover the convergence be- havior. The proling input is randomly generated based on two characteristics in real input strings: string length and symbol range. 1) String length: While input strings may have varying sizes, real-world ap- plications can always split a string into independent strings of similar length. Section 3.6.2 describes the input length of each benchmark used in evaluation. 2) Symbol range: Symbol range can be obtained from FSM specication. For example, some FSM benchmarks only accept visible ASCII codes, thus the proling symbols are sampled on a subset of ASCII. After generating proling input, we emulate the execution. One input will produce one convergence partition. We count the frequency of appearances of distinct partitions. The more frequent a partition appears, the more likely the real-world inputs could converge with convergence sets in this partition. In the evaluation, we prole with 1k input strings. The proling time is less than 5 minutes for FSM benchmark on one PC. We also tried to prole with 10k input strings and the proling result does not change. It is because the frequency distribution has unnoticeable change across all benchmarks. Note that although the proling strings are randomly generated, the proling result is consistent across all benchmarks. Since only one convergence partition has to be selected for a given FSM, we can simply choose the maximum frequency partition (MFP) to minimize the probability of re-execution. 29 0% 20% 40% 60% 80% 100% Figure 3.7: Maximum Frequency Partition (MFP) 3.4.4 ConvergencePartitionMerge Among the many partitions generated from proling, we found that even MFP does not achieve su- ciently high frequency. Figure 3.7 shows the frequency of MFP after proling. For example, frequency of MFP in Clamav (one of the benchmarks we used) is only 61%. These MFPs are far from “accurate": If we choose them, the convergence sets generated will diverge in around 39% executions, leading to frequent re-execution and performance degradation. Figure 3.8: Merge Example 30 Instead of simply choosing MFP with low accuracy, we propose to rene partitions: merging multiple partitions to a new partition with rened subsets. This is based on a classic algorithm known as partition renement [89] (Figure 10). It takes two two partitionsP 1 andP 2 as input and output a merged partition. Note that the renement is commutative operation, so the order of selecting elements inP 1 orP 2 does not aect the result. OutputP will have the following property: any pair of states both belonging to the a convergence set inP 1 and a convergence setP 2 will fall in a convergence set inP . It implies that if an input string can converge inP 1 orP 2, it will converge underP . Thus, the frequency ofP is the sum of P 1 andP 2. In this case, we say that partitionP coversP 1 andP 2. In essence, we trade o the increased number of convergence sets for the benet of increased frequency. Figure 3.8 is an example of 4 partitions after proling. To merge A, B and C, a new partition of 4 subsets is created. If we only want to merge A and C, the result is still C. Algorithm1 Partition Renement Algorithm Require: P 1 contains subsetsfS1 1 ;:::;S1 m g Require: P 2 contains subsetsfS2 1 ;:::;S2 n g forS2inP 2do forS1inP 1do IntersectionI S1\S2 DierenceD S1nS2 ifI 6=?then SplitS1 intoI andD endif endfor endfor return P 1 Based on the above algorithm that merges two partitions, we can merge all partitions that appear in proling. Our objective is to nd a rened partition with higher frequency without increasing the number of subsets signicantly. For simplicity, we propose an eective heuristic merge strategy: • Merge all partitions if one covers another. The reason to do this compatible check rst is that such merge does not increase the number of subsets. 31 • Merge from partition with higher frequency. • Stop when the merged partition reaches some cut-o frequency. We explored several cut-o frequency, from 90% to 100% (i.e. merge all partitions that appear in proling). For most benchmarks, even merging to 100% will not increase the number of subsets drastically. However, Protomata using 100% MFP will generate a partition with 61 subsets, a prohibitive number of subsets. The cut-o frequency used in evaluating performance is shown in the MFP column of Table 3.1 and detailed discussion on selecting the best merge strategy is in Section 3.7.5. 3.5 CorrectnessGuaranteewithRe-Execution Formalization. To uniformly specify the transitions in each segment, we dene a new type called State Type (ST), which can be either a state (st) or a convergence set (CS). Assume we have M segments and N convergence sets, then after all segments are executed in parallel based on the corresponding input symbols, we have m transition functions, T: ST! ST dened as follow: T s (CS(i)) = 8 > > > > < > > > > : st(j) if CS(i) converges k S j=1 CS(j) otherwise Each segment has such transition function for each convergence set, so we have i = 1,2,..,n. In addition, the transition function for each segment is dierent and depends on its input, so we have s = 1,2,..,m. If a CS does not converge, the output states can be included in one or more convergence sets. Our deni- tion includes both cases: k = 1 when converging to one CS, k > 1 otherwise. In the rst segment, the starting state is concrete. We can trivially modify the function to have special treatment for this segment: T 1 (s(i)) =s(j), which means that it always performs the normal state!state. 32 After all segments nish the execution, we essentially need to compute the composition function,T 1 T 2 :::T m . Iftheoutcomeisasinglestate,thennore-executionisrequired, otherwise, some segments need to re-execute. Before presenting how to identify the re-execution segments and organize the procedure, we rst discuss function composition. We consider the rules to compose two functions T i and T i+1 . Note that we only need to compose the transition functions for the consecutive segments since FSM has a linear structure. T i+1 can have several possible input types. (1) The input of T i+1 (output of T i ) can be S k j=1 CS(j) based on our denition. In this case, we have T i T i+1 = T i+1 ( S k j=1 CS(j)) = S k j=1 T i+1 (CS(j)). The output is the union of the results of each convergence set’s transition function. (2) Based on the result in (1), the input of T i+1 (output of T i ) can be one state, multiple states or a mix of dierent states and convergence sets. Let us consider a case that the input is k states. We rst convert the states to convergence sets, nd the union and then apply (1). Specically, for states st(j), where j=1,2,..,k, we nd the convergence set CS(j) that each state st(j) belongs to. Then, we haveT i T i+1 = T i+1 ( S k j=1 st(j)) =T i+1 ( S k j=1 CS(j)) = S k j=1 T i+1 (CS(j)). If states are mixed with convergence sets, the procedure is the same because the states are already treated as convergence sets. Re-Execution Algorithm. As discussed earlier, if the output of T m (the last segment) is a single state, re-execution is not required. It is easy to see that such condition can be satised even if some previous segments do not converge. Our formalization can naturally capture this behavior by the transition function of each segment. However, this condition does not tell how to identify the segments for re-execution, which is discussed below. (1) Basic approach. Re-executing a segment means that a concrete output state is generated when the concrete input state is known. Obviously, it requires that the output of the previous segment is a single state. For example, we can always re-execute segment 2 because the rst segment can always produce a 33 concrete output state. Thus, a simple approach is that, when re-execution is required, segment 2 to segment m sequentially perform the re-execution. This guarantees that the output of segment m is a concrete state. (2) Last-concrete optimization. We can improve the basic approach by realizing the fact that, even if the input of a segment is not a single concrete state, the output can be. The next segment can perform the re-execution based on the concrete output. In the segment chain from 2 to m, there could be multiple such segments, we call them concrete points. They are depending on the input symbols of a segment, so it can be only determined dynamically. The re-execution is always possible after these points. To ensure that the output of the last segment is a concrete state, we can perform backward checks through the segment chain (segment m! (m-1)! ...), and nd the rst concrete point. Let that point be segment r (r< m), the re-execution can be performed sequentially from segment r to segment m. We call it “last-concrete” optimization. (3) Opportunistic transition function re-evaluation. After re-executing the segment r, instead of di- rectly re-executing all successor segments sequentially, we can “re-evaluate” the transition functions T r+1 , T r+2 ,...,T m in sequence with the changed input of T r+1 . Since these functions are already computed in par- allel in all segments, re-evaluating the functions is faster than the actual re-execution which depends on input length. There are two benets: a) With function re-evaluation, it is possible that T m produces a con- crete output, and no further re-execution is needed. This situation can be reached faster than re-executing multiple segments. b) Even if T m still do not converge, after function re-evaluation, we may identify a dierent last-concrete point (segment r’) “later” than segment r, where r’ > r, using the same method. Essentially, we can skip re-execution of segments between r and r’. The hardware implementation to be discussed next is based on the most advanced design. Hardware Implementation. Figure 3.9 shows the hardware implementation based on the re-execution algorithm with opportunistic transition function re-evaluation. Here, there are ve segments. The input buer of each segment stores the input symbols until no re-execution is needed. At this point, the input 34 set(N) → set(M) inbuf Logic set(N) → set(M) inbuf Logic set(N) → set(M) inbuf Logic set(N) → set(M) inbuf Logic set(N) → set(M) inbuf Logic conv? conv? conv? conv? conv? Control & Re-Execution output state result ready re-exec 0 st(1) st(2) st(S) CS(0) 1 …… 1 Segment Transition Table CS Transition Vector Segment 1 Segment 2 Segment 3 Segment 4 Segment 5 Figure 3.9: CSE Hardware Implementation buer of all segments are cleared together. The Control & Re-Execution module determines the segment required to re-execute and controls the timing. Note that even in re-execution, the number of cycles needed is predictable, so the cycles that dierent operations are performed can be easily determined, therefore, the control logic is not complex. The interface between each segment and Control & Re-Execution module is Segment Transition Table, which essentially species the transition functions for each convergence set. The table is composed of n CS Transition Vectors, each corresponds to one convergence set. The length of the vector is the same as the number of states in the FSM, the vector is simply copied from the active mask for each convergence set after the last input symbol for a segment is processed † . To make the logic design clean, we choose not to directly indicate the vector for CSs, the simple operation to generate CSs from states is implemented in the Logic module after the segment is nished. The Logic module of a segment generates the output, which are connected to the Logic module of the next segment. It also outputs a signal (conv) indicating whether this segment converges. Therefore, by chaining Logic modules of all segments together, we can † The active mask for each convergence set needs to be saved and restored during context switch, similar to PAP 35 get the nal output state (if the output of the last segment is a single state) and a global result ready signal if it is true. The Control & Re-Execution module takes all conv signals and nds the last-concrete point, which is a simple backward search for the rst conv signal set to 1. The re-execution is warranted when the conv of the last segment is not set, and the re-execution signal (an M-bit vector) is sent to all segments. Only one bit is set among the M bits, indicating the segment selected to be re-executed (found by the backward search). The re-evaluation can be also performed by the Logic modules of the re-executed segment (r) and its successors, to segment m. The logic is the same as chaining all segments, except that the segment sequence is shorter, from segment r! m, instead of from segment 1! m. Further re-execution may be necessary and can be determined by the same logic. 3.6 EvaluationMethodology In this section, we evaluate dierent enumerative FSM designs across a wide range of FSM benchmarks including ANMLZoo and Regex suites. These benchmarks contain multiple real-world FSMs and input sequences. We will rst describe environment setup in detail and how we validate our results with PAP. 3.6.1 FSMBenchmark Regex [11] benchmark consists of both real-world and synthetic regular expressions. ExactMatch repre- sents the simplest patterns which may appear in a rule-set. Dotstar contains a set of regular expressions matching wildcard “.*”. TheRanges rulesets (Range01,Range5) contain character ranges (randomly selected between the dierentnx andnx+ groups), with average frequency of 0.5 and 1 per regular expression. The TCP regular expressions lter network packet header before deep packet inspection. 36 Table 3.1: Benchmarks Benchmark #FSM #State #Half-Core PerSegment / #Segment L MFP Dotstar03 300 19038 1/16 30 100% Dotstar06 300 24821 1/16 30 100% Dotstar09 299 29442 1/16 30 99% Ranges05 300 13391 1/16 20 100% Ranges1 299 13324 1/16 10 100% ExactMatch 300 13212 1/16 10 100% TCP 733 38107 1/16 30 100% PowerEN 2860 73140 1/16 20 100% Dotstar 3000 119468 2/8 20 100% Protomata 2340 1949014 2/8 20 99% Snort 3379 152616 3/5 10 99% Clamav 515 185181 3/5 40 99% Brill 2050 256349 3/5 50 100% ANMLZoo [125] is a much more diverse benchmark of automata-processing engines. Brill is short for Brill rule tag updates in the Brill part-of-speech tagging application. ClamAV is an open source reposi- tory of virus signatures intended for email scanning on mail gateways. It is a subset of the full signature database. Dotstar is a set of synthetic automata generated from 5%, 10% and 20% ".*" probability regular expression rulesets. PowerEN is a synthetic regular expression benchmark suite developed by IBM. Snort is an open-source network intrusion detection system (NIDS) software. It monitors the network packages and analyzes them against a rule set dened by the community and the user. Protomata converts the 1307 rules based on protein patterns in PROSITE into regular expression patterns. Regex and ANMLZoo pro- vide two representations for the same FSMs: NFA and regular expression. While PAP focuses on NFA, we use the regular expression and convert it to DFAs with RE2 [100], an open-source regular expression library. State blow-up is possible when compiling to DFA. However, this is not the case for Regex and ANMLZoo benchmark (In ANMLZoo paper, both DFA and NFA representations are evaluated). Table 3.1 shows characteristics of converted FSMs. 37 Table 3.2: Parallel Enumerative FSMs Evaluated FSM Basic FSM Static Optimization Dynamic Optimization Baseline state FSM NA NA LBE state and set FSM NA lookback PAP state FSM Four optimizations in Section 3.3 convergence check and deactivation check CSE set FSM convergence set prediction 3.6.2 FSMInput For Regex benchmark, we generate the input using trace generators provided by Becchi[11]. We setp m to 0.75, the probability that a state matches and activates subsequent states. ForANMLZoo benchmark, input traces have been provided for each application. PAP takes one input le as one input string. However, for applications in ANMLZoo, the input can be easily split up by delimiter symbols into smaller input with provably no dependencies. Here are two examples: In Brill, no matches can occur across sentence boundaries, so we can split the input le by periods. In Snort, one input le contains many packets, and processing them should be independent and done in parallel. In our evaluation, we use the same input le as in PAP but split them according to real-world practice. The performance number is averaged over all input strings. Note that in proling convergence sets, we do not use any of the dataset mentioned above and only use random inputs. 3.6.3 EnvironmentSetup We use VASim [125], the same automata simulator as utilized in PAP and we implement PAP optimizations described in Section 3.2.4. VASim is a widely-used open-source library with fast NFA emulation capability. It also supports multi-threading and therefore is able to partition input stream and process them simul- taneously. We divide the input stream into segments and execute each ow using context switch. We utilize dierent optimizations mentioned in PAP such as Range Guided Input Partition, Deactivation and Dynamic Convergence. We also take into consideration the overhead of path-decoding part which varies 38 with the benchmark. We implemented a baseline sequential FSM and three enumerative FSMs for com- parison. Table 3.2 shows the name of each together with the hardware building block, static and dynamic optimizations applied. Note that for LBE, we use set FSM to perform the lookback, so it is better than the basic LBE. Also, we use LBE without prediction since the probabilistic methods are not suitable for hardware implementation. We assume that enumerative FSM designs are running on Micron Automata Processor (AP) [29]. In AP, half-core is the smallest unit of parallelization. AP in the current generation has 4 ranks, while each rank contains 16 half-cores. We choose to run on 1 rank (16 half-cores). For most benchmarks, one segment is assigned to one half-core. According to PAP, some benchmarks are densely connected that AP compiler will place these FSMs on multiple AP half-cores. We put the same resource constraint on LBE and CSE. Table 3.1 lists the number of half-cores assigned for each segment and the total number of segments. We estimate that sequential FSM on AP processes 1 symbol per cycle (7.5ns). Context switching between ows requires 3 cycles. And we assume that it takes 1 cycle for dynamic convergence check of every two ows. In summary, we adopt the same evaluation methodology as in PAP, except that we split the input dierently according to real-world practice. To further verify that we implement the static and dynamic optimizations correctly, we make sure that ourR 0 andR T number is the same or less than those reported in PAP paper. 3.7 ExperimentalResults In this section, we rst present the speedup of dierent enumerative FSM designs. We will then analyze the sources of speedup by looking at the enumeration overheadR 0 andR T and explain how we explore the parameter space of LBE and CSE to achieve the best performance. Finally, we investigate the predictive power of our proling based convergence set prediction method by comparing the re-execution rate with dierent cut-o partition coverage frequency. 39 3.7.1 Performance Figure 3.10 demonstrates our proposed CSE performance compared to baseline, LBE, PAP and ideal speedup. The baseline throughput is 1 symbol per cycle. The ideal throughput is 1 symbol per cycle in each segment, so the speedup over baseline equals total number of segments. For LBE and CSE, parameters like lookback length and merge strategy are shown in Table 3.1. Our proposed CSE is better than LBE and PAP in all applications. On average, CSE speedup is 2.0x to LBE and 2.4x to PAP. CSE achieves near ideal speedup in all applications exceptPowerEN. PAP also reaches near 16x speedup inExactMatch,Ranges1 andRanges5. Note that CSE still outperforms PAP by a very small margin (4.7% on average). However, the speedup of PAP is not consistent. ForSnort andClamAV, PAP speedup is 1.14x and 1.37x respectively. Considering that ideal speedup should be 5x, PAP is only a little faster than baseline. Our design oers consistent speedup of 4.9x and 4.4x respectively. 0 2 4 6 8 10 12 14 16 LBE PAP CSE BASELINE IDEAL Figure 3.10: Speedup 40 3.7.2 InitialFlowNumberR 0 Figure 3.11 shows R 0 for each application evaluated. Recall that R 0 is an intuitive indication of initial enumeration overhead. For CSE, all applications has a smallR 0 . In fact, exceptPowerEN,R 0 is reduced to 1 dynamically within less than 10 symbols (20 cycles). WhenR 0 becomes 1, it follows thatR T is 1 as well. During the whole execution process, there is only one ow, no enumeration overhead and no time multiplexing. There is no doubt that such applications will match the ideal performance. As for PowerEN, it takes 565 symbols forR T to become stable. The large enumeration overhead makes CSE PowerEN much lower than ideal speedup. As we have mentioned, PAP also performs well in 3 applications. The static optimization of PAP has reducedR 0 to 1 for the applications. Sinilarly, PAP has no enumeration overhead when running these applications. Here is one intricate question: why CSE still outperforms PAP in the 3 applications when all of them have smallR 0 ? The problem is in the "range guided input partition" optimization of PAP (Figure 3.3(a) and Section 3.2.4). The input string is divided by a certain symbol to make sure a smallest feasible state range. This is a clever technique which greatly reducesR 0 from tens of thousands to hundreds. However, such special symbol boundary might appear at any position and one can not expect the symbol to divide the input string evenly. The longer segment will take more cycles to execute and determine the total execution cycle. In our CSE, we do not rely on this technique to reduceR 0 and always divide into equal segments. This accounts for the marginal 4.7% performance benet. Unfortunately, if we look at other applications, R 0 of PAP is larger than that of LBE and CSE. For example, Snort and ClamAV for PAP has 103 and 404 initial ows respectively, comparing to 2.9 and 2.9 for LBE and 3.9 and 4.8 for CSE. In PAP, the authors argue that largeR 0 is not a problem. They will rely on dynamic optimization to reduce the number of active ows quickly. There are two problems with this claim: First, their observation is true based on input of 1 million or 10 million symbols. However, as we have 41 discussed in the evaluation methodology (Section 3.6.2), the input le should be split into independent pieces and processed in parallel. In practice, dependent input sequence length rarely exceeds ten thousand. The time spent on initial enumeration will become a major overhead, impacting the overall performance. Second, the capability of PAP dynamic convergence check is limited by some static optimization performed. We will discuss it in the following section. 0.1 1 10 100 1000 LBE PAP CSE Figure 3.11:R 0 3.7.3 LastFlowNumberR T Figure 3.12 demonstratesR T . Due to dynamic optimization (convergence check and deactivation check), R T is always no greater thanR 0 . The value ofR T tells us the enumeration ow number at the end of computation.R T andR 0 together give us some hint on FSM performance. For CSE,R T becomes around 1 for all applications. There is almost no enumeration overhead at the end of computation. 42 0.1 1 10 100 LBE PAP CSE Figure 3.12:R T For LBE,R T is 1.9 on average, which means that there are still about 2 active ows running in LBE and it takes 2 cycles to execute a symbol.R T of LBE conrms the fact that LBE performance will be lower than half of ideal throughput. The reason of smallerR T in CSE compared to LBE is the weak convergence condition. (Section 3.5) However, in PAP, some applications have R T much larger than 2. While R T in TCP, Dotstar, and Protomata is 3, ClamAV and Snort have as many as 21 and 30 active ows at the end.R T clearly answers the question of why PAP is extremely slow in these 2 cases: dynamic optimization is not working eectively and there is large enumeration overhead. From the experiment results, we can reveal the real cause behind inecient dynamic optimization: the "connected component analysis" optimization in PAP. This optimization merge states in dierent con- nected components in the same ow. It is intuitively appealing to do so because it reducesR 0 . (See Sec- tion 3.2.4 and Figure 3.3(d) for details) However, after merging, it becomes more dicult to merge these 43 ows dynamically. Here is a simple example. There are two connected components and two states in each component, State 1 and 2 in component A and state 3 and 4 in component B. PAP will merge 4 states to 2 ows: A:(1,3) and B:(2,4). As for static optimization,R 0 decreases from 4 to 2. If we consider dynamic convergence, ow A and B will be merged only when state 1 and 2 converge and 3 and 4 converge. In general, if there are H connected components, ows will be merged only if all H pairs of states converge. 0 1 2 3 4 5 6 7 8 9 10 20 30 100 Figure 3.13: LBE Speedup w.r.t. Lookback Length 3.7.4 LBEandLookbackLength In the previous section, we have learned that the overall LBE performance is limited by R T to around half of ideal. However, we can still tune lookback length parameter to nd the best performance for LBE. We explored 10 lookback lengths from 10 to 100. For simplicity, Figure 3.13 shows the speedup of 4 congurations. For applications like Brill, we can see that lookback is extremely benecial with 5.2 times speedup. ForClamAV, lookback of 10 symbols will perform worse than sequential baseline, becauseR 0 is still large to introduce enumeration overhead. For all benchmarks, lookback prex longer than 100 steps will bring 44 about diminishing benet or even slow-down, becauseR 0 cannot be reduced further and lookback cycles become unneglected overhead. 3.7.5 CSEandConvergenceSetGeneration As discussed in Section 3.4.2, convergence set generation is essential to CSE performance. We have known that proling alone will not select MFP with a large frequency and partition merge is necessary. We evaluate the eectiveness of dierent merge strategies in this section. Figure 3.14 shows the number of convergence sets in MFP. Note that the number is equal toR 0 in CSE, which determines the initial enumeration overhead. For almost all benchmarks except Protomata, merge to 100% will not increaseR 0 signicantly. For Protomata and ClamAV, merging all partitions that have shown up in proling will increaseR 0 to 61 and 13. This can possibly incur considerable overhead. Before partition merge, the averageR 0 is 2.2. Merge to 99% and 100% increasesR 0 to 3.9 and 9.9 respectively. We choose dierent merge strategies for dierent applications, shown in the last column of Table 3.1, withR 0 equals 5.2 on average. The overhead introduced by merge can be handled by dynamic optimization checks in a few cycles. Meanwhile, the partition prediction becomes much more reliable and avoids many re- execution situations. Figure 3.15 shows speedup with respect to merge strategy. For Protomata, ClamAV and Brill, merge to 99% is more desirable. Partition merge technique proves to be eective across all benchmarks, boosting performance 1.4x over baseline on average. 3.7.6 ThePredictivePowerandRe-Execution Apart fromR 0 andR T , re-execution rate is another important factor in determining CSE performance. It is determined by the quality of predicted convergence sets. As we have merged MFP to a large frequency like 99%, partition is expected to re-execute with less than 1% possibility on our random test inputs. The key question is how well the convergence sets can predict the convergence behavior for real inputs. It 45 1 10 100 BASELINE 99% 100% Figure 3.14: CSER 0 w.r.t. Merge Strategy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 BASELINE 99% 100% Figure 3.15: CSE Speedup w.r.t. Merge Strategy 46 can be measured by re-execution rate as a proxy. Figure 3.16 illustrates the re-execution rate for each application, evaluated by representative real-world input. The MFP generated by basic proling suers from high re-execution rate. For TCP, Protomata and Brill, it is 23.2%, 19.0%, and 26.3% respectively. With the merge optimization, re-execution drops signi- cantly. The re-execution rate does not deviate from our expectation, staying lower than 0.5%. On average, re-execution rate is only 0.2% and will have unnoticeable aect on the nal CSE performance. It also shows that the convergence sets generated with proling on random inputs indeed predict very well the convergence behavior of real inputs. 0% 5% 10% 15% 20% 25% 30% BASELINE 99% 100% Figure 3.16: CSE Re-execution rate w.r.t. Merge Strategy 3.8 Summary This chapter proposes CSE, a Convergence Set based Enumerative FSM. Unlike prior approaches, CSE is based on a new computation primitive set(N)!set(M), which maps N states (i.e., a state set) to M 47 states without giving the specic state!state mappings (which state is mapped to which). CSE uses set(N)!set(M) as an essential building block to construct enumerative FSM. Essentially, CSE reformulates the enumeration paths as set-based rather than singleton-based. We evaluate CSE with 13 benchmarks. It achieved on average 2.0x/2.4x and maximum 8.6x/2.7x speedup compared to Lookback Enumeration (LBE) and Parallel Automata Processor (PAP), respectively. 48 Chapter4 GraphP:ReducingCommunicationforPIM-basedGraphProcessing withEcientDataPartition 4.1 Introduction As discussed in Section 2.2.3, cross-cube communication becomes the performance bottleneck of current PIM accelerator Tesseract. Our investigation shows that this problem is due to a missing consideration, — data organization, and the suboptimal order in considering dierent aspects of the system. To develop an ecient graph processing system, a careful co-design of both the software and hardware components of the systems is needed. Typically, we need to consider the following four issues: 1) programming model, which eects the user programmability and algorithm expressiveness; 2) runtime system, which maps programs to architecture; 3) data organization, which determines the communication pattern; and 4) architecture, which determines the eciency of execution; In Tesseract, data organization aspect is not treated as a primary concern and is subsequently determined by the presumed programming model. Specically, Tesseract follows the “vertex program” programming model that rst proposed by Pregel [76], where a vertexfunction is dened for all vertices. This vertex program takes the vertex’s value as parame- ter and updates the outgoing neighbors, — the destinations of all outgoing edges (potentially in dierent ways). If a vertex and all its outgoing neighbors are in the same cube, the vertex function is executed 49 locally. Otherwise, the cross-cube messages are incurred to remotely perform the reduce function. Let the vertex bev and itsk outgoing neighbors arefu 1 ;u 2 ;:::;u k g, in Tesseract, for any outgoing neighboru i that is in a dierent cube thanv, a put message is sent fromv’s cube tou i ’s cube, containing a reduce function andu i ’s value as the parameter. This message asksu i ’s cube to perform the reduce as a remote function. We see that,determinedbyvertexprogrammodel,eachcross-cubeedgeincursacross-cubemessage, and hence the amount of cross-cube communications is proportional to the number of cross-cube edges. In order to reduce this number, Junwhanetal. [1] have tried to use METIS [56] to obtain a better parti- tioning for Tesseract, but the result is not that promising. Only very small performance improvements are achieved for 3 out of 5 benchmarks tested; and the METIS-generated partitioning even leads to worse per- formance for one of the rest two benchmarks. Moreover, the complexity of METIS prohibits its application in real-world large graphs. To resolve the issue in the conventional design ow, we argue that a PIM-based graph processing sys- tem should takedataorganizationasarst-orderdesignconsideration. This principle is important because: 1) data organization aects cross-cube communication, workload balance, and synchronization overhead, which directly translate into the energy consumption; 2) if the programming model is decided rst, this xed programming model may prohibit users from using the optimal data partitioning method; 3) co- designing data organization and interconnection structure can enable extra opportunities and benets such as broadcasting and overlapping. Therefore, we propose a dierent order of design consideration: one should rst choose the proper data organization with less communication, then design the program- ming model based on it, nally, the architecture and runtime optimizations could be applied to further improve performance. Following the above design principle, we propose GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to Tesseract. GraphP features three key techniques. 50 “Source-cut” Partitioning. This algorithm ensures that a vertex and all its incoming edges are assigned in the same cube. As a result, if an edge(u,v) is assigned to cubei, all the incoming edges of vertex v will also be assigned to cubei. But, at the same time, the source vertexu of this edge may be assigned to other cubes. In such case, for an edge with the source vertex in a remote cube, the local cube maintains a replica of the source, which will be synchronized with the master in remote cube in each iteration. This mechanism fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. We show that it generates strictly less communication compared to Tesseract. Moreover, source-cut is a heuristic-based algorithm in which the assignment of each edge can be processed independently. As a result, the partitioning overhead is much less than METIS [56]. “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate, which generates the vertex value update based on all (local) incoming edges; and ApplyUpdate, which applies the update to each vertex. The replica synchronization is han- dled transparently by the software framework. This model slightly trade-os the expressiveness for less communication. However, the real-world applications (e.g., pagerank) typically do not need the exibility provided by the general vertex program. We believe this model is suciently expressive, in the worst case, it can be augmented to express more general vertex function (see Section 4.3.2). Hierarchical Communication and Overlapping. The replica synchronization requires that the updates from master to replicas are thesame. This property enables the hierarchical communication which avoids sending the same messages when possible, thus, reduces the communication amount in certain bot- tleneck links between cubes. Moreover, two-phase vertex program model naturally leads to an overlapping mechanism, which can further hide the latency of cross-cube communication. According to our evaluation results, GraphP eectively reduces the communication amount by 35% - 98% and reaches 1.7x average, 3.9x maximum speedup and reduces 89% average, 96% maximum energy cost compared to Tesseract. 51 4.2 BackgroundandMotivation 4.2.1 Interconnection CPU CPU Group CPU CPU CPU CPU CPU CPU (a)4Processor-centric4Network CPU CPU CPU CPU (b)4Memory-centric4Network:4Mesh (c)4Memory-centric4Network:4Dragonfly C12 C13 C14 C15 C08 C09 C10 C11 C04 C05 C06 C07 C00 C01 C02 C03 C12 C13 C14 C15 C08 C09 C10 C11 C04 C05 C06 C07 C00 C01 C02 C03 Figure 4.1: Examples of HMCs’ Interconnection. The key benet that HMC can provide is memory-capacity-proportional bandwidth, which is achieved by using multiple HMCs. Typically, a system that containsN HMCs can provideN 8GB memory space andN 320GB=s aggregation internal bandwidth. However, this aggregated bandwidth depends on the interconnection network that connects these HMCs and host processors. The straightforward design choice is “processor-centric network”, which simply reuses the current NUMA architecture and replaces traditional DIMMs with HMCs. Figure 4.1 (a) presents a typical system that has four processor sockets. In this case, Intel QuickPath Interconnect (QPI) technology is used to built a fully-connected interconnection network among the processors, and each HMC is exclusively attached to a particular processor (i.e., there isn’t a direct connection between HMCs). Although this network organization is simple and compatible with the current architecture, Kim et al. [57] concludes that this processor-centric organization does not fully utilize the additional opportunities oered by multiple HMCs. Since the routing/switching capacity can be supported by HMC’s logic die, it is possible to use more sophisticated topologies and connectivities that were infeasible with traditional DIMM-based DRAM mod- ules. To take this opportunity, Kim et al. [57] proposes “memory-centric network”, in which HMCs can directly connect to other HMCs and there is no direct connection between processors (i.e., all proces- sor channels are connected to HMCs and not to any other processors). According to the evaluation, the 52 throughput of a memory-centric network can exceed the throughput of a processor-centric network by up to 2:8. Moreover, Kim et al. [57] also evaluated various dierent kinds of topologies to interconnect HMCs. Two of the most prevalently used examples are presented in Figure 4.1 (b) and Figure 4.1 (c). Among dif- ferent topologies, Dragony [59] is suggested as the favorable choice, because it1) has higher connectivity and shorter diameter than simple topology like Mesh; 2) achieves a similar performance as the best inter- connection topology, named attened buttery [60], in their evaluation of 16 HMCs; and 3) does not face the same scalability problem as attened buttery. In this paper, we will use the memory-centric network and Dragony topology as suggested and used by previous works [57]. However, the techniques proposed are not tightly coupled with this particular architecture. 4.2.2 Bottleneck Based on the HMC implementation discussed in Section 2.2.2, the maximum external bandwidth of a cube (480 GB=s) is actually larger than its internal bandwidth (320 GB=s). However, due to the limitation on the number of pins, this external bandwidth does not scale with the number of HMCs. Thus, the aggregation internal bandwidth of a real system will largely surpass the available external bandwidth. Take the Dragony topology shown in Figure 4.1 (c) as an example, it presents a typical HMC-based PIM system that contains 16 HMCs. As we can see, since at most 4 o-chip links are provided by a cube, it is impossible to achieve a full-connection between the cubes. To be realizable, Dragony splits the total 16 HMCs into 4 groups and only achieves the full connection within each group. In contrast, only one link is provided for each pair of groups. As a result, the bandwidth that caps cross-group communication is bounded by the bandwidth of a link, which is only 120GB=s. As a comparison, the aggregation internal bandwidth of the entire PIM system is 16 320 GB=s = 5:12 TB=s. It is why data organization is 53 extremely important for a HMC-based PIM system and should be taken as the rst-order consideration. In Tesseract, the simple partitioning strategy leads to excessive cross-cube communications, which prohibits the applications from fully utilizing the aggregation internal bandwidth of HMC. It is also notable that the load of dierent external links are not equal. For example, if we assume that the amount of communication is equal for each pair of two HMCs, each of the cross-group link in the Dragony topology will need to serve 4 4 = 16 pairs of HMCs communication (e.g., 4 HMCs in each group). As a comparison, the link between HMC C0 and HMC C1 only serves the communication between (1) HMC C0 and C1 (1 pair) and(2) HMC C0 and the top-right HMC group (C2, C3, C6, C7) (4 pairs), which is less than 1/3 of the 16 pairs formerly calculated. This implies that the links across groups can easily become the bottleneck and should be particularly optimized. Essentially, these bottleneck is rooted from the fact that only limited external links are provided by each HMC, which means that they cannot be simply avoided by using other topologies. As an illustration, Figure 4.1 (b) presents the Mesh topology. In this case, there are only four links between HMC group (C0C7) and HMC group (C8C15), so that each of them need to server 8 8=4 = 16 pairs of HMCs’ communication, which is the same as the bottleneck of the Dragony topology. Even worse, the number of these bottleneck links is 8 in Mesh and only 6 in Dragony. 4.2.3 PIM-BasedAccelerator The current 3D-stacking based PIM technologies oer great opportunities for graph analytics because: 1) 3D-stacking provides high density, which opens up the possibility of in-memory graph processing; 2) the memory-capacity-proportional bandwidth is ideal for graph processing applications that lack temporal locality but require high memory bandwidth; 3) various programming abstractions have been proposed for graph processing to improve programmability, for PIM-based accelerators, they can be naturally used to hide architectural details. 54 1 count = 0; 2 do { 3 ... 4 list_for (v: graph.vertices) { 5 value = 0.85 * v.pagerank / v.out_degree; 6 list_for (w: v.successors) { 7 arg = (w, value); 8 put(w.id, function(w, value) { 9 w.next_pagerank += value; 10 }, &arg, sizeof(arg), &w.next_pagerank); 11 } 12 } 13 barrier(); 14 ... 15 } while (diff > e && ++count < max_iteration); Figure 4.2: Pseudocode of PageRank in Tesseract. Tesseract is a 16-HMC system using Dragony interconnection in Figure 4.1 (c). It provides users with low level APIs which can conveniently be composed to a programming model that similar to Pregel’s vertex program. Figure 4.2 shows the PageRank computation using Tesseract’s programming interface, where the main procedure is a simple two-level nested loop (i.e., ling 5 line 13). The outer loop iterates on all vertices in the graph. For each vertex, the program iterates on all its outgoing edges/neighbors in the inner loop and executes aput function for each of them. The signature of thisput function isput(id, void* func, void* arg, size_t arg_size, void* prefethc_addr). It executes a remote function callfunc with argumentarg on theid-th HMC. Specically, for every vertex, the program rst calculates the proper pagerank division based on the pagerank sent to the vertex and out degree, the result is stored in value (line 6). Then, a user-dened vertex function is called for every outgoing edge to add value to the corresponding destination vertex’s pagerank for the next iteration (w.next_pagerank) (line 10). This function is executed asynchronously and cross-cube communication is incurred when the outgoing neighbor is in a dierent cube. Finally, a barrier is applied to ensure that all the updates performed by vertex functions in the current iteration have been completed. It is easy to see that this API is equivalent to Pregel’s [76] vertex program, which assures the programmability of Tesseract. For the cross-cube remote function calls, blocking will lead to unacceptable latency, therefore, Tesseract implements them in a non-blocking manner. A cube could also 55 combine several remote functions together to reduce the performance impact due to interrupts on receiver cores. Nevertheless, the optimization techniques in Tesseract are only used to hide cross-cube communica- tion latency. None of them can reduce the amount of cross-cube communication. Essentially, it is due to the ineciency of Tesseract’s simple graph partitioning, which is constrained by the vertex program model. Specically, only edge-cut (i.e., the graph is partitioned in vertex granularity and a vertex can only be assigned to one cube) can be used. The results show that even the sophisticated METIS partitioner [56] cannot improve performance much (in one case, even make it worse). As another consequence, the band- width utilization of Tesseract is usually less than 40%. 4.3 GraphPArchitecture destination vertices source vertices cube 0 cube 1 cube 2 cube 3 edge D*UDSK3DUWLWLRQIRU9HUWH[3URJUDP one put per edge destination vertices source vertices cube 0 cube 1 cube 2 cube 3 one update per replica replica cube 1 generates update for vi locally j E6RXUFH&XWLQ0DWUL[9LHZ one put per cross-cube edge one update per replica v1 v1 vertex replica cross-cube communication Intra-cube read v1 HMC 0 HMC 1 HMC 0 HMC 1 put put v1 v1 v2 v3 v2 v3 update F6RXUFH&XWLQ*UDSK9LHZ master i Figure 4.3: Graph Partitioning for Vertex Program (a) and Source-Cut (b)(c). In this section, we describe GraphP, a software/hardware co-designed HMC-based architecture for graph processing. First, we propose a new graph partitioning algorithm that would drastically reduce cross-cube communication. Then, a programming model is designed to match the partitioning method. Finally, we discuss the optimization opportunities oered by our approach, optimized broadcast and over- lapping, to further improve the performance. 56 4.3.1 Source-CutPartitioning Let us start with a detailed understanding of the graph partition in Tesseract through a matrix view. Con- sider Figure 4.3 (a), A graph can be considered as a matrix, where the rows and columns are corresponding to the source and destination vertices. In Tesseract, a graph is partitioned among cubes, — each cube is assigned with a set of vertices (i.e., vertex-centric partition), corresponding to a set of rows. The edges are the non-zero elements in the matrix, denoted as black dots. With the graph partitioned, the matrix could be cut into grids, each of which contains edges from vertices in cubei to cubej. It is similar to the concept in GridGraph [141]. WithN cubes, the whole matrix is divided intoN 2 grids. The grids on the diagonal contain the local edges, whose source and destination vertex are in the same cube. As discussed earlier, each non-local edge incurs a cross-cube communication in Tesseract. They are essentially the edges in the grey grids. Assume that edges distribute in the graph uniformly, the amount of cross-cube communication in one iteration isO(N(N 1) jEj N 2 ) =O( (N1) N jEj). We can see it is roughly the number of edges in the graph. Next, we propose source-cut, in which a graph is partitioned such that, when a vertex (e.g., v j ) is assigned to a cube (e.g., cube 1), all the incoming edges ofv j are also assigned to the same cube. The idea is shown in Figure 4.3 (b). Dierent from Tesseract, the matrix is cut vertically, — each cube is assigned with a set of columns, not rows. To perform the essential operations in graph algorithm, — propagating the value of the source vertex through an edge to the destination, a replica (denoted as red) is generated if a cube only holds the edge and its destination vertex. The masters (denoted as black) are the vertices in a cube that serve as the destination. With this data organization, the column ofv j corresponds tov j ’s all incoming edges and neighbors, therefore,v j ’s update can be computed locally. The sources of edges in a column can be masters (black) or replicas (). Similar to earlier discussion, after the matrix is divided into grids, the ones on the diagonal represent the edges in a cube where both their source and destination vertex are masters. 57 The communication in source-cut is caused by replica synchronization, in which the value of master vertex is used to update the replicas in all other cubes. In the matrix view, it means that each master vertex in the diagonal grids updates its replicas in other cubes in the same row. In Figure 4.3 (b), consider the master vertexv i in cube 0. In replica synchronization, cube 0 needs to sendv i ’s value to both cube 1 and cube 3, but not cube 2. Because cube 2 does not have any edge fromv i . Note that only one message is sent from cube 0 to cube 1, even if there are three edges fromv i to dierent vertices in cube 1. This is the key property why source-cut generates strictly less communication compared to vertex-centric partition: in the same case, it will incur three messages from cube 1 to cube 2 (refer to Figure 4.3 (a)). This property informally proves that: with the same master-to-cube assignment, source-cut always generates less or equal amount of communication compared to vertex-centric partition. In essence, source-cut generatesoneupdateperreplica while the graph partition for vertex program would incuroneputpercross-cubeedge. It is illustrated in Figure 4.3 (c) in a graph view. Then, we can calculate the communication amount of source-cut. We dene thereplicationfactor,, which includesboth master and replicas. Then, the communication amount due to replica synchronization isO(N(1) jVj N ) = O((1)jVj). This is an estimation as it assumes that each cube contains similar number of vertices. The maximum value of is (N1), therefore, the maximum communication cost isO((N1)jVj). Comparing it with the earlier calculated vertex-centric partition estimationO(N(N 1) jEj N 2 ) =O( (N1) N jEj), we see that from the equations the communication amount of source-cut is not strictly less than vertex-centric partition. We show that there is no contradiction as follows. For source-cut, to reach the maximum communicationO((N 1)jVj), at leastN 2 (N 1) edges are needed. In particular, they are all in non-diagonal grids. For example, assume N = 4;jVj = 16, and each cube contains 4 vertices, we need at least 48 edges in the white grids in Figure 4.3 (b). Specically, each grid contains 4 edges: from the 4 vertices corresponding to each row to any master vertex in the 58 column. In this way, 3 replica synchronizations are needed for each row, with 16 rows, in total 48 cross- cube communications. It is easy to see that, in vertex-centric partition, the same amount of communication is incurred as well, because the source and destination of each of the 48 edge are not in the same cube. However, if we putjEj = 48;N = 4 to O( (N1) N jEj), we would get only 36 communications. It is because the equation assumes that the edges are uniformly distributed in all cubes, which is not true in this case. Overall, this is an example that for certain graph, source-cut can incur at most the same amount of communication as vertex-centric partition. The implementation of source-cut is much simpler than the complex algorithm used in METIS [56]. One simple implementation of source-cut is to dene a hash function hash(v) and assign an edge (u;v) to hash(v)%N, whereN is the number of HMCs (i.e., 16 in our system). Note that, although source-cut ensures that all the incoming edges of a vertex are assigned to the same HMC, it does not provide any guarantee on the outgoing edges. As a result, if an edge (u;v) is assigned to HMCi, all the edges with form (;v) will also be assigned to HMCi, but some or even all of the edges with form (u;) are not assigned to HMCi. In that case, we need to set up a replica of vertexu in HMCi to store the newest value of vertexu. 4.3.2 Two-PhaseVertexProgram Based on source-cut, we propose a new programming model named “Two-Phase Vertex Program”, in which the unit of data processing is the incoming edges in source-cut. As discussed in Section 4.1, the program- ming model and data organization can interact with each other, therefore, a co-design is required. Our “Two-phase Vertex Program” splits a vertex program into two phases: 1) Generate phase, where all the incoming edges of a vertex and their corresponding sources are read and used to generate an update for their shared destination vertex; and2)Apply phase, where the update is applied to the corresponding ver- tex’s every replica. Our new model is designed for source-cut. First, since each vertex and all its incoming 59 edges are in the same cube, the Generate phase could be performed locally. Second, the communication only happens before the Apply phase, which provides “one update per replica”, instead of “one put per cross-cube edge” in vertex-centric partition in Tesseract. Figure 4.4 shows a PageRank implementation that 1) uses the same set of APIs as Tesseract; 2) is equivalent to the implementation described in Figure 4.2; but 3) is programmed in “Two-phase Vertex Program” model. As we can see, the rst loop iterates on all the replicas to calculate the proper share given to each edge (by dividing new pagerank with the outgoing degree of the corresponding vertex). In the next two-level nested loop, the outer loop iterates on every vertex. Then, for every vertexv, the program rst iterates all its incoming edges to calculate the new pagerank and then broadcasts this new value to all its replicas. Due to source-cut partition, all the computations during incoming-edge iterations occur locally and hence do not incur any communication. 1 list_for (r: graph.all_replicas) { 2 r.pagerank = r.next_pagerank; 3 r.value = 0.85 * r.pagerank / r.out_degree; 4 } 5 list_for (v: graph.vertics) { 6 update = 0; 7 list_for (e: v.incoming_edges) { 8 update += e.source.value; 9 } 10 list_for (r: v.replicas) { 11 put(r.id, function(r, arg) { 12 r.next_pagerank = arg 13 }, &update, sizeof(update), &e.next_pagerank) 14 } 15 } 16 barrier(); Figure 4.4: PageRank in Two-Phase Vertex Program. While it is possible to express the operations of “Two-phase Vertex Program” with Tesseract’s API, it is tedious and the new model requires a number of internal data structures that Tesseract does not provide (e.g., the replica list). Therefore, we propose our own APIs as the higher level abstraction to enhance programmability. As shown in Figure 4.5, users of GraphP only need to write two functions, GenUpdate andApplyUpdate, and all the other chores, e.g., replica synchronization will be handled by our system. 60 Specically, the input of GenUpdate function is the incoming edges of a specic vertex and the output is the corresponding update. In contrast, the input of ApplyUpdate function is the vertex property and the update generated in this iteration. It does not have output. In each iteration,GenUpdate function will be executed on every vertex once and ApplyUpdate will be executed on every replica once. One should note that both GenUpdate and ApplyUpdate can be executed locally. The replica synchronization (i.e., the broadcast of update to replicas) is transparently handled by our software framework. In other words, the communication pattern of our system is xed. As we will see later in Section 4.3.3 and Section 4.3.4, this higher-level abstraction not only ensures programmability but also provides the exibility to apply additional optimizations. Due to the xed communication pattern, it is possible to further optimize the architecture to reduce cross-cube communication on the bottleneck links. 1 GenUpdate(incoming_edges) { 2 update = 0; 3 list_for (e: incoming_edges) { 4 update += e.source.value; 5 } 6 return update; 7 } 8 9 ApplyUpdate(v, update) { 10 v.pagerank = update; 11 v.value = 0.85 * v.pagerank / v.out_degree; 12 } Figure 4.5: Two-Phase Vertex Program. Random METIS Source-cut Average Maximum Average Maximum Average Maximum Orkut 457,754 470,959 187,532 843,270 107,206 109,706 LiveJournal 269,506 289,519 79,224 352,341 107,789 115,594 Twitter 5735801 6374486 failed failed 1079390 1105802 Table 4.1: Comparison of Cross-Cube Communication. To illustrate the eectiveness of source-cut and “Two-Phase Vertex Program” model. Table 4.1 com- pares the amount of cross-cube communication on three real-world graphs. For every graph, we have tried three partitionings: 1) Random, which randomly assigns a vertex to an HMC; 2) METIS, which takes advantage of the advanced partitioning application METIS [56]; 3) Source-cut, which randomly assigns a vertex and all its incoming edges to a cube. The rst two are vertex-centric partitions, which can be used in Tesseract. We report both the average and the maximum amount of cross-cube communication for every 61 case. We see that, when Random is used, the skewness among all the 16 16 = 256 pairs of cross-cube communication is not large. In contrast, although the advanced partitioner METIS can largely reduce the average amount of cross-cube communication, it usually leads to excessive skewness (i.e., a large dier- ence between maximum and average communication). As a result, the maximum amount of cross-cube communication produced by METIS is sometimes much higher than Random. This observation explains the reason why in Tesseract’s evaluation METIS does not improve the performance as expected. Moreover, the cost of using METIS is huge: it not only takes long time but also consumes large amount of memory. As we note, the results of partitioning Twitter with METIS are not given in the table. This is because the METIS program failed for out of memory even when we use a machine with 1 TB memory. For Source-cut, we assume that the argument size needed for the remote function call is the same as the data size of update generated by GenUpdate. From the results, we see that Source-cut incurs only 18.8% to 39.9% of communication compared with Random. Compared with METIS, Source-cut incurs 55.9% communication on Orkut, but it increases the communication on Livejournal graph by 54.4%. Note that it meanssource-cut must have a dierent vertex-to-cube assignment thanMETIS, because otherwise source- cut can be proven to generate less cross-cube communication (see Section 4.3.1). However, Source-cut has much smaller maximum cross-cube communication: 68.4% and 92.6% reduction compared to Random and METIS on average, respectively. This leads to more balanced execution. More importantly, the partitioning cost of METIS is much higher than Source-cut. Expressiveness of Two-Phase Vertex Program. Before proposing further architecture optimiza- tions, we compare the expressiveness of the general vertex program and the proposed Two-Phase Vertex Program. In Figure 4.6, consider three vertices:fv 1 ;v 2 g2 HMC 0,fv 3 g2 HMC 1; and two edges: (v 1 ;v 3 ) and (v 2 ;v 3 ). In Two-Phase Vertex Program, there are replicas of v 1 andv 2 in HMC 1, v 3 ’s GenUpdate can generate the update based on allv 3 ’s incoming edges/neighbors. The restriction in Two-phase Vertex Program model is that theGenUpdate has to perform the same operation (e.g., dened asf(v 1 ;v 2 ;:::) for 62 all incoming edges/neighbors. In contrast, the general vertex program semantically allows performing dif- ferent operations for each edge. For example,f1(v 1 ) andf2(v 2 ) and thenv 3 could reduce the two results and apply. However, real-world applications (e.g., pagerank) do not need such exibility. In fact, the extra exibility may do more harm than good, — it may lead to many duplications (e.g., same remote function is sent for all outgoing neighbors) that is hard to be automatically removed. In contrast, Two-Phase Vertex Program inherently avoids these duplications. We believe our model is suciently expressive. Moreover, it is possible to express the general vertex program with certain changes to the proposed model. Specically, the GenUpdate function can concate- nate the list of incoming edges/neighbors, then in ApplyUpdate function, dierent functions could be applied to dierent incoming edges/neighbors. This change could perform the same computations of gen- eral vertex program in Figure 4.6. The cost is more complex function parameters and more memory space. Overall, we believe our Two-Phase Vertex Program provides an ecient mechanism to remove the redun- dant information that is not required in most applications. We argue that our approach is a sweet spot that balance the trade-o between generalizability and communication/performance. Two-Phase Vertex Program v1 v1 vertex replica cross-cube communication Intra-cube read v3 HMC 0 HMC 1 put (f1) put (f2) v1 v2 HMC 0 v1 v2 v3 HMC 1 v1 v2 GenUpdate(f) General Vertex Program Figure 4.6: Expressiveness of Two-Phase Vertex Program. As discussed in Section 4.2.1, since each HMC provide only 4 links, it is impossible to achieve a full connection between 16 HMCs. Due to the interconnection topology, certain links cross groups serve much 63 more cross-cube communication and could become the bottlenecks. Specically, the bottleneck links of Dragony are the links between every pair of HMC groups. Each of them needs to serve 16 pairs of HMCs communication. As an illustration, Figure 4.8 presents: 1) the average load of inner-group links (e.g., link between C0 and C1); and 2) the average load of cross-group links (e.g., link between C5 and C10) for dierent graphs in GraphP. We see that, although source-cut can signicantly reduce the communication load, it does not change the fact that cross-group links serve much more communication than inner-group links. As a result, these cross-group links usually become the bottleneck and may potentially hinder the performance and energy consumption. 4.3.3 HierarchicalCommunication Group 0 Group 3 C12 C13 C14 C15 C08 C09 C10 C11 C04 C05 C06 C07 C00 C01 C02 C03 Figure 4.7: Hierarchical Communication In order to mitigate the imbalance, we develop the hierarchical communication mechanism to remove the redundancies in simple point-to-point communications. Please note that it is an additional opportunity enabled by source-cut and Two-phase Vertex Program, where it is guaranteed that the update sent to 64 replicas of the same vertex must be thesame. As a result, it is possible to sent only one copy of the update for every HMC group, in contrast of one copy per HMC. Specically, for every pair of HMC groups in the Dragony topology, we will build a broadcasting tree as illustration by Figure 4.7. For a group, one HMC is selected as the broker (e.g., HMC C5 and HMC C10 for group 0 and group 3 in the example) which is responsible for 1) gathering the needed update from its local group; 2) sending them to the broker of the other group; and 3) scattering the update inside a group as needed. As a simple example, suppose hash(v_1)%16 = 0, the master ofv 1 is at HMC C0, and C10, C11, and C14 containv 1 ’s replicas. In the conventional point-to-point, the links between C5 and C10 need transfer the duplicated updates three times. With hierarchical communication, only one copy of the update value is transferred with our broadcasting tree. 4.3.4 OverlappedExecution TT LJ AZ SD WV 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Averaged Communication Amount bfs TT LJ AZ SD WV 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 wcc TT LJ AZ SD WV 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 pagerank TT LJ AZ SD WV 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 sssp Inner-Group Links Cross-Group Links Figure 4.8: Averaged Communication. Inner Group Links v.s. Cross Group Links 65 In an architecture with multiple HMCs, it is important to hide the remote access latency of a cross- cube communication. To support this, Tesseract uses non-blocking communication, so that a local core can continue its execution without waiting for the results. Moreover, since the execution of non-blocking remote calls can be delayed, it is possible to batch several such functions together so that the core in the receiver cube is only interrupted once. Otherwise, the performance will be signicantly impacted by the frequent context switches. In GraphP, one option is to perform replica synchronization at the end of each iteration, but it is also possible to overlap communications due to replica updates with the current execution. The insights are the same as Tesseract: when a vertex nished the execution of GenUpdate, the cube contains the master could start performing the broadcast of the update immediately. The updates could be saved in the local message queues and get processed in ApplyUpdate. The replica updates could be stored in the receiver cube’s message queue. WhenGenUpdate of all master vertices are nished, a cube could process the earlier received replica updates in batch. Therefore, only the executions due to updates sent toward the end of an iteration may appear in the critical path of execution. While conceptually simple, one caveat needs to be considered to support the overlapped execution in GraphP. Referring to Figure 4.3 (b), each cube has to follow the “column-major order” so that the complete updates can be generated continuously. It is because the whole column has to be accessed to generate the update for a master vertex. If “row-major order” is followed, we only generate the “partial updates” for all master vertices before all edges are processed. It means that the cube cannot send any updates until the end of iteration, thus, the overlapped execution does not apply. On the other side, the choices of dierent orders may aect performance due to locality. The“column- major order” optimizes the write locality, since all edges in the same column incurs writes to the same vertex. However, it may incur some non-sequential reads of the source vertices. The “row-major order” 66 optimizes the read locality, since the edges from the same vertex are processed together. It is interest- ing to investigate the interactions between locality and execution overlapping. Our evaluation results in Section 4.5.1 provide more insights to this trade-o. To further improve the eectiveness of overlapping, sorting the vertices by their incoming degree may be benecial. This means that theGenUpdate of the vertex is executed with the smallest incoming degree rst. This simple optimization can make best use of the bandwidth. Due to text limit, we will not go detail on the communication/execution overlapping technique, and we leave it as an interesting alternative that can be explored in the design space of GraphP. 4.4 EvaluationMethodology Simulation Conguration We simulate the HMC architecture by building an HMC simulator called hmc-zSim. Specically, we integrate an HMC interconnect component into zSim [104], a fast and scalable simulator designed for x86-64 multicores. We congure the simulator to have 16 HMC Cubes with each cube containing 32 in order cores. Each core is congured to equip with a 64K, 4-way Set-associative L1-D cache and 32K, 4-way Set-Associative L1-I cache. We use 64 Byte cache line and 1000 MHz as simulation frequency. The results are validated with [33]. Datasets Table 4.2 shows the graph dataset we use to evaluate GraphP. For each graph, we also show the replication factor. These data sets are retrieved from the Stanford Large Network Dataset Collection (SNAP) [71] and are representative of modern medium to large graphs, including social networks from Twitter, Slashdot, and LiveJournal as well as networks collected from other types of user activities, like voting and co-purchasing. Applications We implement four widely used applications based on the two-phase vertex program- ming model, including BFS (Breadth First Search), PageRank (An algorithm to rank website pages), SSSP (Single Source Shortest Path) and WCC (Weakly Connected Component). 67 Graphs #Vertices #Edges Note Wiki-Vote (WV) 7.1K 104K 2.96 Wikipedia who-votes-on-whom network [69] ego-Twitter (TT) 81K 2.4M 3.79 Social circles from Twitter; Edges from all egonets combined. [79] Soc-Slashdot0902 (SD) 82K 0.95M 4.40 Slashdot social network from February 2009 [72] Amazon0302 (AZ) 262K 1.2M 2.68 Amazon product co-purchasing network from March 2 2003 [68] LiveJournal (LJ) 4.8M 69M 4.18 LiveJournal online social network [72] Table 4.2: Graph Dataset PerformanceEvaluationMethodology In order to evaluate the performance of GraphP, we run all the applications against the graph dataset on top of hmc-zsim. For each run, we get the cycle numbers of each core and the largest cycle number among all cores is deemed as the cycle spent for the entire run. EnergyEvaluationMethodology The energy consumption of the HMC interconnect is estimated as two parts. a) The dynamic consumption, which is proportional to the number of it transfer events that happen between each pair of routers. b) the static consumption, which corresponds to the energy cost when the interconnect is plugged in power but in idle state (i.e., no transfer event happens). We use hmc- zsim to count the number of transfer events and turn to the router power/area modeling tool ORION 3.0 [54] to estimate unit energy cost, i.e., the power, of the routers and associating links in the interconnect. We also refer to McPAT [73] and [121] to double check the power results to ensure its correctness and reliability. 4.5 Evaluation 4.5.1 Performance We rst present the overall performance of our system and compare it with the Tesseract-like system. Figure 4.9 and Figure 4.10 show our results, which includes the speedup of source-cut algorithm with and without the use of innner-group broadcasting (i.e, hierarchical communication) and overlapping optimiza- tion compared to the baseline. 68 TT LJ AZ SD WV 0 1 Speedup bfs TT LJ AZ SD WV 0 2 wcc TT LJ AZ SD WV 0.0 2.5 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Source-cut Source-cut w/ Broadcast Source-cut w/ OverLapping Source-cut w/ Broadcast & OverLapping Figure 4.9: Performance of GraphP: 2DMesh TT LJ AZ SD WV 0 1 Speedup bfs TT LJ AZ SD WV 0 2 wcc TT LJ AZ SD WV 0.0 2.5 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Source-cut Source-cut w/ Broadcast Source-cut w/ OverLapping Source-cut w/ Broadcast & OverLapping Figure 4.10: Performance of GraphP: Dragony 69 Our results show thatGraphP (source-cut only) outperforms the Tesseract-like system by an 1.7x on average and 3.9x in maximum across all applications in our experiment. We note that the inner-group broadcast optimization only contributes to marginal performance improvement compared with source- cut only. This is because communication is not bottleneck for GraphP and the reduced communication amount will not aect performance. The results of overlapping are somewhat disappointing but interesting: in only one case (i.e., pagerank on WV), it achieves best performance, in a few cases (e.g., bfs/wcc/sssp on WV), it achieves similar (but a little worse) speedups compared to source-cut only (or with broadcast). Apparently, it is due to the trade- o we discussed in Section 4.3.4, the dierent order of vertex/edge traversal leads to worse locality that eventually hurts performance. We also see that used with overlapping, broadcast always improves the performance, but still not as good as without overlapping. 4.5.2 CrossCubeCommunication Figure 4.11 and Figure 4.12 show cross-cube communication amount of GraphP normalized to the Tesser- act baseline. As we can see, source-cut only reduces 35% - 98% cross cube communications across all applications. With the broadcast optimization, cross cube communication is further reduced roughly 5/6 compared to the source-cut only setting, which causes more than 90% reduction compared to the Tesseract baseline. We can also see that inter-group and inner-group communications each accounts for approxi- mately half of total cross cube communications. However, given the fact that there are far fewer inter-group links compared to inner-cube links, the inter-group links are still over-loaded and become the potential bottleneck as reected in Figure 4.8. 70 TT LJ AZ SD WV 0 1 Cross-Cube Communication bfs TT LJ AZ SD WV 0 1 wcc TT LJ AZ SD WV 0 1 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Inter-Group Tesseract Intra-Group Source-cut Inter-Group Source-cut Intra-Group Source-cut w/ Broadcast Inter-Group Source-cut w/ Broadcast Intra-Group Figure 4.11: Cross-Cube Communication: 2DMesh TT LJ AZ SD WV 0 1 Cross-Cube Communication bfs TT LJ AZ SD WV 0 1 wcc TT LJ AZ SD WV 0 1 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Inter-Group Tesseract Intra-Group Source-cut Inter-Group Source-cut Intra-Group Source-cut w/ Broadcast Inter-Group Source-cut w/ Broadcast Intra-Group Figure 4.12: Cross-Cube Communication: Dragony 71 4.5.3 BandwidthUtilization As we have mentioned in Section 4.3.3, the main reason of our speedup is the reduction on cross-cube communication, especially the reduction on bottleneck links. To further validate our argument, in this section, we present the data on measured bandwidth utilization. Specically, we evaluate the utilization on 1) aggregated internal bandwidth; 2) normal inner-group cross-cube link; and 3) corss-group bottle- neck links. The rst item is measured by calculating the division between the total amount of local HMC read/write and the total execution time. In contrast, the bandwidth on links is calculated by the amount of communication divided by time. Figure 4.13 and Figure 4.14 show the result of our evaluation normalized to the Tesseract baseline. As we can see from these gures, the aggregated internal bandwidth of each HMC of GraphP increases by from 1.1x to 46x. Especially, for graph inputs TT, LJ and AZ, the internal bandwidth utilization has increased by at least 5x. The inner-group bandwidth and cross-group bandwidth are both as expected: the source-cut algorithm enables an averaged of 81% as well as maximum of 98% reduction in inner-group link bandwidth. As for cross-group link bandwidth, the source-cut algorithm also reaches 80%-average and 98%-maximum utilization reduction. Inner-group broadcasting further reduces the inner-group and inter-group bandwidth utilization to a marginal amount compared to the baseline. 4.5.4 Scalability Figure 4.15 and 4.16 evaluate the scalability of GraphP by measuring the performance of 1/4/16-HMCs systems (i.e., 32/128/512 cores), normalized to the performance of the 1-HMCGraphP system. For all the applications, graph inputs TT, LJ, AZ exhibit 2x-6x improvement for 4-HMCs and 2.5x-30x improvement for 16-HMCs. However, SD and WV do not scale out as good as the other three graph inputs. The actual reason for such behavior is still unclear. We conjecture that it may be due to the these graphs’ inherent structures themselves that hinder the computation from scale out. This behavior suggests that graphs may 72 0 2 4 6 8 10 12 Local Internal Bandwidth bfs 0 5 10 15 20 25 wcc 0 10 20 30 40 50 pagerank 0 2 4 6 8 10 sssp 0.0 0.2 0.4 0.6 0.8 1.0 Inner-Group Bandwidth bfs 0.0 0.2 0.4 0.6 0.8 1.0 wcc 0.0 0.2 0.4 0.6 0.8 1.0 pagerank 0.0 0.2 0.4 0.6 0.8 1.0 sssp TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 Cross-Group Bandwidth bfs TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 wcc TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 pagerank TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 sssp Tesseract Source-cut Source-cut w/ Broadcast Figure 4.13: Bandwidth utilization: 2DMesh 73 0 2 4 6 8 10 12 Local Internal Bandwidth bfs 0 5 10 15 20 25 wcc 0 10 20 30 40 50 pagerank 0 2 4 6 8 10 sssp 0.0 0.2 0.4 0.6 0.8 1.0 Inner-Group Bandwidth bfs 0.0 0.2 0.4 0.6 0.8 1.0 wcc 0.0 0.2 0.4 0.6 0.8 1.0 pagerank 0.0 0.2 0.4 0.6 0.8 1.0 sssp TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 Cross-Group Bandwidth bfs TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 wcc TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 pagerank TT LJ AZ SD WV 0.0 0.2 0.4 0.6 0.8 1.0 sssp Tesseract Source-cut Source-cut w/ Broadcast Figure 4.14: Bandwidth utilization: Dragony 74 need to be classied properly so that each class of graphs can squeeze the HMC resources. We leave the detailed investigation of this issue as future work. TT LJ AZ SD WV 0 10 Speedup bfs TT LJ AZ SD WV 0 20 wcc TT LJ AZ SD WV 0 20 pagerank TT LJ AZ SD WV 0 20 sssp 1-cube 4-cubes 16-cubes Figure 4.15: Scalability of GraphP - 2DMesh TT LJ AZ SD WV 0 10 Speedup bfs TT LJ AZ SD WV 0 20 wcc TT LJ AZ SD WV 0 20 pagerank TT LJ AZ SD WV 0 20 sssp 1-cube 4-cubes 16-cubes Figure 4.16: Scalability of GraphP - DragonFly 75 TT LJ AZ SD WV 0 1 Energy Cost bfs TT LJ AZ SD WV 0 1 wcc TT LJ AZ SD WV 0 1 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Source-cut Source-cut w/ Broadcast Figure 4.17: Energy Consumption of GraphP: 2DMesh 4.5.5 Energy/PowerConsumptionAnalysis Figure 4.17 and Figure 4.18 demonstrate the energy consumption of GraphP. Out of the four applications, source-cut only setting of GraphP reduces energy cost by 30% to 95% while source-cut with inner-group broadcasting further reduces energy cost for all the applications. For some graph inputs like TT, LJ and AZ, the broadcasting optimization accounts for more than 50% more energy reduction compared with the naive source-cut algorithm. 4.5.6 MemoryOverhead Since GraphP replicates only vertices not edges, it only increases the consumed memory fromjVj (vertexsize) +jEj (edgesize) tojVj (vertexsize) +jEj (edgesize). In a typical case that each vertex’s size is 4 bytes and each edge’s size is 8 bytes, our technique only leads to ( 1) 4jVj=(4jVj + 8jEj) additional memory. With this formula, the calculated overhead for datasets [WV, TT, SD, AZ, LJ] are only [6.47%, 4.63%, 14.1%, 16.5%, 10.7%]. The average overhead is only 10.4% and the largest is also smaller than 20%. 76 TT LJ AZ SD WV 0 1 Energy Cost bfs TT LJ AZ SD WV 0 1 wcc TT LJ AZ SD WV 0 1 pagerank TT LJ AZ SD WV 0 1 sssp Tesseract Source-cut Source-cut w/ Broadcast Figure 4.18: Energy Consumption of GraphP: Dragony 4.6 RelatedWork Graph Processing Accelerators Mustafa et al. [87] proposed an accelerator for asynchronous graph processing, which features ecient hardware scheduling and dependence tracking. To use the system, programmers have to understand its architecture and modify existing code. Graphicionado [47] is a cus- tomized graph accelerator designed for high performance and energy eciency, based on o-chip DRAM and on-chip eDRAM instead of PIM. Graphicionado uses specialized memory subsystem for higher band- width. GraphPIM [84] demonstrates the performance benets for graph applications by using PIM for the atomic operations. However, it does not focus on the inter-cube communications. Large-Scale Graph Processing system There are many distributed graph processing systems pro- posed by researchers. Pregel [75] is the rst distributed graph processing system and proposes a vertex- centric programming model, which is later inherited by many other graph processing systems including Tesseract [39, 74, 101, 141, 111, 112, 65]. However, due to the problem of vertex-centric programming model (e.g., the enforcing of 1D partitioning), many new kinds of partitioning algorithm and the cor- responding programming models (e.g., GAS proposed by PowerGraph [39] and hybrid-cut proposed by 77 PowerLyra [18]). Certain parts of GraphP are inspired by these works, such as selective scheduling, re- moving the short sight of a vertex [120], but we adapt them into a PIM architecture and propose many more architecture-specic optimizations (e.g., broadcasting, overlapping). Besides distributed graph processing, there are also many out-of-core graph processing systems that use disks. The key principle of such systems is to keep only a small portion of active graph data in memory and spill the remainder to disks. Although it is reported that these works can sometimes comparable with distributed systems that have hundreds of cores, it is also a well-known fact that all these works are bounded by the bandwidth of disks. As a result, all these works focus on how to enlarge the locality of disk I/O. In contrast, 3D stacking technologies provide high density, which opens up the possibility of in-memory big-data processing. Thus, the most signicant problem is changed from increasing locality to reducing cross-cube communication in our work. 4.7 Summary This chapter proposes GraphP, a novel HMC-based software/hardware co-designed graph processing sys- tem that drastically reduces communication and energy consumption compared to Tesseract. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUp- date and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves perfor- mance with unique opportunities oered by the proposed partitioning and programming model. We eval- uate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to Tesseract. 78 Chapter5 GraphQ:ScalablePIM-BasedGraphProcessing In Chapter 4, we reveal that the performance bottleneck of Tesseract is inter-cube communication. We propose a new data partition and programming model to resolve it. In this work, we take a closer look at GraphP and Tesseract. We nd that although the inter-cube problem is alleviated, there is still room for performance improvement at intra-cube and inter-node level. To integrate these optimizations at dierent memory hierarchy levels, we also modify the inter-cube communication with a new programming model. 5.1 BackgroundandMotivation 5.1.1 Tesseract Tesseract [1] is a PIM-based graph processing accelerator with 16 cubes. Tesseract provides low-level prim- itives to support vertex program model. For each vertex, the program iterates over all its edges/neighbors and executes aput function for each of them. The signature of thisput function isput(id, void* func, void* arg, size_t arg_size, void* prefethch_addr). It executes a function callfunc with argu- mentarg on theid-th cube, therefore it could be either 1) a remote call if the destination vertex resides on a dierent cube from source vertex; or otherwise 2) a local function call. In the end, a barrier ensures that all operations in one iteration are performed before the next iteration. 79 destination vertices source vertices cube 0 cube 1 cube 2 cube 3 i j j irregular message from cube 0 ➛ cube 2 1 ➔ 2 1 ➔ 5 1 ➔ 8 2 ➔ 1 2 ➔ 5 2 ➔ 3 2 ➔ 7 3 ➔ 2 3 ➔ 5 3 ➔ 8 5 ➔ 6 8 ➔ 9 E,QWUDFXEHDFFHVVHV sequential source vertex accesses 1 2 1 sequential edge accesses 2 3 random destination vertex accesses 3 D,QWHUFXEHFRPPXQLFDWLRQ remote func (v) process reduce apply batching Figure 5.1: Tesseract Communication and Access Pattern Figure 5.1 (a) shows theinter-cube communication in an adjacency matrix view, where the rows and columns correspond to the source and destination vertex of edges, and each dot is an edge. All vertices are partitioned among four cubes—each cube is assigned to a set of rows. The circled dot represents an edge from a vertex in cube 0 to a vertex in cube 2: (v i !v j ), which corresponds to an inter-cube message from cube 0 to cube 2. Thus, each edge across cubes incurs such a message and the destination cube is determined by the graph structure (e.g., the destination of the edge). These small and irregular inter-cube messages are generated during execution to unpredictable destination cube at any time. On the receiver side, the core is interrupted to execute the remote function, incurring overhead due to context switch. Tesseract uses batching to mitigate interrupt overhead by buering the received remote function calls in a queue and executing multiple functions together at certain point later. It can be seen as the square in Figure 5.1 (a): the functions corresponding to the edges inside the square are executed in batch by a core in remote cube. Due to the large number of inter-cube messages, batches generated are too small to oset the performance impact of interrupt overhead. Moreover, irregular communication between cubes may incur imbalanced load and hardware utiliza- tion. Due to graph-dependent communication pattern, when messages are sent to the same cube from 80 dierent senders, its message queues may become full and put backpressure on the network to prevent senders from generating more messages. In this case, cores in the receiver cube will be overwhelmed by handling remote function call requests without making progress in processing its local data. Finally, the dynamic communication pattern leads to excessive energy consumption of inter-cube links. To save energy, each inter-cube link can be set to a low-power state (e.g., the Power-Down mode in HMC [51]). However, this optimization is not applicable to the scenario when the message can be sent at any time. Figure 5.1 shows the problem with intra-cube data movement. If the destination of an edge is in the local cube, a local apply is performed, which incurs random accesses and causes locality interference. Specically, accesses to vertex array (¶) and edge array (·) are sequential reads. However, the accesses to compute array for the destination vertices are random (¸). Besides, remote function call also incur random accesses. GraphH [24] shares some similarity with GraphQ on reducing irregularity of inter-cube communica- tion, we defer the comparison to Section 5.2.4 after a thorough explanation of ideas of GraphQ. 5.1.2 LessonsLearnedandDesignPrinciples We believe that an ecient PIM-based architecture should ideally satisfy three requirements. First, inter- cube communication should be predictable, i.e., each cube should know exactly when the message will arrive fromwhich source cube. This would largely eliminate the interrupt overhead in the current designs. There will be still overhead on the receiving cube side to execute the remote function, but these operations will not interfere with the current processing on that cube since they happen at known times. Second, inter-cube data movement should be handled by heterogeneous cores in a decoupled manner due to the dierent access patterns. It is critical to reduce interference given that data from dierent array share the same cache. Third, the multi-node PIM-based graph processing architecture must eciently handle the 81 large discrepancy of inter-node, inter-cube, and intra-cube bandwidth. The bottom line is that the design should achieve speedup over the conventional memory hierarchy with the same memory size when the graph data are distributed into the PIMs of all nodes. 5.2 GraphQArchitecture We propose GraphQ, the rst multi-node PIM-based graph processing architecture built on the recent work Tesseract. Our solution is inspired by techniques in distributed graph processing and irregular applications, but they have never been applied and investigated in the context of PIM. To mitigate the interrupt handling overhead, GraphQ uses predictable inter-cube communication, which is supported by simple reordering of edge processing order according to graph partition in cubes. To enable ecient intra-cube data movement, we divide the cores in a cube into two heterogeneous groups. To hide long inter-node communication latency, we propose a hybrid execution model to perform additional local computations during inter-node communication. 5.2.1 PredictableInter-CubeCommunication Batched Communication We propose an execution model that supports predictable and batched com- munication, which is enabled by two key ideas. Shown in Figure 5.2, the reduce step is performed in the source cube. For each edge, instead of sending the function and parameters to a remote cube, the source cube locally reduces the values for each destination vertex. In the matrix view, the reduced value is gen- erated in the cube of the source vertices of edges in the same column. Second, we generate all messages for the same remote cube together, so that they can be naturally batched. We partition the whole matrix intoblocks, each of which contains the edges that will contribute to the batched message between a pair of cubes. For example, the third block in the rst row will generate a batched message from cube 0 to cube 2. 82 destination vertices source vertices cube 0 cube 1 cube 2 cube 3 i process reduce apply combine … … batched message cube 0 ➛ cube 2 batched message block Figure 5.2: Batched Communication in GraphQ cube 0 cube 1 cube 2 cube 3 apply process process/comm (0,1) round 0 round 1 Iteration (0,1) destination vertices source vertices (0,2) (0,3) (0,0) (1,2) (1,3) (1,0) (1,1) (2,3) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) cube 0 cube 1 cube 2 cube 3 (cube_id, block_id) round 0 1 2 3 reduce process/comm round 2 reduce process/comm round 3 reduce Figure 5.3: Overlapped Computation and Communication In GraphQ, apply step in each cube is performed by reducing (N 1) batched messages from other cubes, whereN is the number of cubes. In our example,N = 4, the cube 2 will reduce the three batched messages in dierent colors and then update its local vertices with new values. Rounded Execution The batched communication enables new optimization to support the overlap of communication and computation with balanced execution. The insight is illustrated in Figure 5.3. We use the (cube_id, block_id) pair to indicate the source and destination of the batched messages. The order of batched messages is determined by the order of blocks in each cube (from left to right in the gure). For example, cube 1 should rst process the block (1,2), which will generate a batched message from cube 1 to cube 2. We call this execution model as rounded execution, where each iteration is divided intoM rounds, M is the number of cubes. The rounded execution is synchronous, which means that all cubes have to 83 nish one round before entering the next. With four cubes, there are in total four rounds. The key insight is that, after one round, each cube will only generateone batched message forone remote cube. Following this principle, the destination cubes should be “interleaved”: in the rst round, cube 0 generates a message to cube 1, cube 1 generates a message to cube 2, etc. With the starting rounds in all cubes determined, it is easy to derive the others: each cube only need to process the increasing rounds after the rst one. For the rst (M 1) rounds, the destinations of batched messages are organized in a circulant manner. The last round is the same for all cubes: they process the block that will generate only local updates and no inter-cube messages. At the end of each round, a barrier ensures that all cubes receive the batched message. When all batched messages are received in all cubes, they perform reduce, which accumulate the updates from the source cube. After that, the batched message buer can be reclaimed and is available to be used by the next round. Therefore, onlyone receive message buer is needed for each cube. Since rounds are executed synchronously, the load imbalance among dierent cubes in the same round may increase execution time. We study this eect by comparing the sum of maximum cube computation in each round (for rounded execution) with the maximum of the sum of cube computation in all rounds (for no rounded execution). For graphs used in our evaluation, we report the results in Section 5. In summary, rounded execution enables balanced execution with two properties: a) the batched mes- sages from previous round can be overlapped with the execution of the current round, so the messages can be sent in the non-blocking manner; and b) each cube only receives one batched message in one round, so only one receive buer is needed. Note that the irregular inter-cube communication cannot achieve full overlapping, it is the fundamental dierence between GraphQ and GraphP/Tesseract. In fact, GraphP tried the idea of overlapping but the results showed that benets are little. Therefore, GraphP and GraphQ are orthogonal. 84 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Example Graph Round 1: Round 2: Round 3: {0➔2,1➔3} {1➔5} {0➔1} {2➔5,3➔4} {2➔0,2➔1} {} {4➔0} {4➔2,5➔3} {5➔4} Cube 0 Cube 1 Cube 2 Figure 5.4: Example of GraphQ Figure 5.4 shows a concrete example based on a small graph. Suppose we have three cubes, the vertices partitioned into each cube are indicated with dierent colors. On the right, we show the vertices and edges assigned to each cube. The edges processed in dierent rounds are shown in the color of the destination cube. The edge sets processed in each cube/round are also shown below. Preprocessing Preprocessing is common for all graph processing frameworks, including both Tesseract and GraphQ. There are two common preprocessing steps: (1) convert graph in text format, e.g., SNAP (Stanford Network Analysis Project) format [71] or Matrix Market format [114], to binary graph data structure, e.g., CSR (Compressed Sparse Row); (2) partition the graph among cubes. GraphQ requires an additional step to group edge block for each round together to enable batched communication. This incurs small overhead because it does not require an extra iteration over the input. Specically, during graph partition, we can maintain an edge list for each remote cube, each edge is placed into the corresponding edge list based on its destination vertex. In the end, the edge lists are concatenated together to generate the new combined edge list for each cube. For graphs used in the simulation, the total preprocessing time is within 3 seconds in a single-thread im- plementation. Table 5.1 compares preprocessing time of several large graphs in seconds between GraphQ and Tesseract, we can see that the overhead for both schemes are low, considering it is a an one-time overhead that can be amortized over executions. The dierence between the two is the additional over- head due to edge reordering in GraphQ, e.g., for R-MAT-Scale27, GraphQ requires about additional 10s to 85 preprocess the graph. The increases in all three large graphs are less than 10%. In general, the one-time preprocessing time and execution time of graph processing systems are reported separately [39, 18, 140]. Table 5.1: Preprocessing overhead for large graph datasets Graphs jVj jEj Tesseract GraphQ Twitter-2010 [63] (TW) 42M 1.5B 36.7 39.7 Friendster [70] (FR) 66M 1.8B 53.5 58.2 R-MAT-Scale27 [17] (R27) 134M 4.3B 130.2 140.2 5.2.2 DecoupledIntra-CubeDataMovements We put the intra-cube architecture of GraphQ in the same context with the conventional multicore and Tesseract. Figure 5.6 (a) shows a conventional multicore architecture, where a last-level shared cache is placed below private caches. While shared cache facilitates the inter-core communication, it causes a series of issues, including data races which require atomic operations/locks, and coherence protocol. Given the poor locality of graph processing applications, conventional cache hierarchy is not eective. In comparison, Tesseract eliminates the shared cache and only use a small private cache for each core and a simple prefetcher, as shown in Figure 5.6 (b). Without the shared cache, Tesseract uses message passing for intra-cube communication, using the same mechanism as inter-cube communication. Specif- ically, each core has a message queue, a global router of each cube inspects the local message queues and sends messages to any core (local or remote) in the system. Without a shared cache, write accesses directly update memory. Since atomic operations and locks are much slower in memory, Tesseract avoids using them by assigning a disjoint set of vertices for each cube to update. The intra-cube architecture of GraphQ is shown in Figure 5.6 (c), which has two key dierences com- pared with Tesseract. First, inter-cube and intra-cube data movements are handled separately. For inter- cube communication, batch messages are generated in batch message buer in memory and sent by routers in our runtime system (see Section 5.4). Intra-cube messages are handled by message queues and local 86 Figure 5.5: GraphQ Intra-Cube Architecture routers, because the message source and destination are within the same cube. Second, sequential and random accesses are processed separately. We divide the cores in the same cube into two groups: ProcessUnits(PUs), which executeprocessEdge function and generate update messages, involving sequential vertex and edge reads in vertex/edge array; andApplyUnits(AUs), which directly receive messages from PUs and performreduce andapply function, involving random accesses to vertices in compute array. This organization eliminates locality interference. Moreover, we replace the private cache attached to each AU with scratchpad memory (SPM), serving as a buer of vertex values with contiguous ID. Each AU randomly accesses on-chip SPM, which provides shorter latency and higher bandwidth than L1 cache. At the end of a round, the batched message is ready in SPM and can be written sequentially to batched message buer in memory. While the functionalities of PUs and AUs are dierent, they are actually a subset of a more general core in Tesseract. For larger graphs when all destination vertices can not t into SPMs, we will have sub-partitions (more details in Section 5.4.2). The ratio between PUs and AUs are determined empirically and we show the performance of dierent ratios in Section 5.5.5. 87 Figure 5.6: Intra-Cube Architecture Comparison 5.2.3 ToleratingInter-NodeLatency The key challenge of multi-node PIM system is the long inter-node communication latency. Due to the large gap shown in Figure 2.3, the conventional communication and computation overlapping cannot fully hide such long latency. Figure 5.7 (a) shows the scenario when the idea of batched and overlapped com- munication in Section 5.2.1 is applied to inter-node communication. Unfortunately, the execution of next iteration nishes long before receiving the batched message from a remote node, during this time, each node is idle. According to our experiments, inter-node communication takes 82% to 91% of total execution time, which implies signicant time wasted waiting for remote node messages. Note that this issue may get worse in Tesseract because a cube could send and receive both inter-cube and inter-node message at any time, the large latency dierence will result in more imbalanced execution. To solve this problem, we propose a simple but eective bandwidth-aware hybrid execution model, that performs potentially useful computation during idle time. The idea is shown in Figure 5.7 (a). When each node nishes the execution of an iteration and has to wait for the remote batched message, they can run more iterations based on local subgraph. In this way, the PIM of each node can make use of the idle wait cycles to perform local computation. Specically, we call the normal iteration as global iteration, 88 1 2 3 0 8 7 6 4 5 1 3 10 2 1 2 1 2 3 1 2 3 5 1 2 3 4 5 6 7 8 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ distance[i] init ∞ ∞ 3 10 ∞ ∞ ∞ ∞ ∞ ∞ 3 5 ∞ ∞ ∞ ∞ ∞ ∞ 3 5 10 ∞ ∞ ∞ ∞ 1 3 5 10 ∞ ∞ ∞ 3 1 3 5 10 ∞ ∞ 3 3 1 2 5 10 ∞ 5 3 3 1 2 4 10 ∞ 4 3 3 1 2 4 9 ∞ 4 3 3 1 2 4 9 5 4 3 local local local global1 local global2 local local global3 1 2 3 4 5 6 7 8 ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 1 3 10 ∞ ∞ ∞ ∞ 3 1 2 5 15 ∞ 5 3 3 1 2 4 10 6 4 3 3 1 2 4 9 5 4 3 distance[i] init global1 global2 global3 global4 Hybrid execution Unified iterations (a) Hybrid Execution (b) Running Example Figure 5.7: Hybrid Execution Model which is composed of multiple local iterations. In a global iteration, the rst local iteration is performed after the most recent remote updates are received, the other local iterations are performed based on local subgraph only. In other words, each node runs several local iterations “asynchronously” within the cube before a global synchronization among node at the end of the global iteration. The mechanism proposed in Section 5.2.1 is still applicable to the local iteration, which is composed of several rounds. During inter-node communication, the loads for local iterations in dierent nodes can be dierent, but since it is opportunistic and all overlapped with much longer inter-node communication, such imbalance is not an issue. This model matches the hierarchical bandwidth in multi-node PIM by overlapping the longer inter-node communication with more local computation. Essentially, this design presents a mid-point between synchronous and asynchronous execution model—inside a global iteration, the nodes execute asynchronously. 89 In general, the hybrid execution model is applicable for asynchronous iterative algorithms as dened in ASPIRE [124] that can tolerate stale values. The computations that are left behind can be viewed to generate stale values. As long as the system can ensureboundedstaleness, the results are correct, and there is no need to recovery re-computation. ASPIRE ensures such property by maintaining the staleness infor- mation with each vertex. In our hybrid model, since the global synchronization will eventually happen, the staleness is also bounded. All the four algorithms we evaluated (BFS, WCC, PageRank and SSSP) belong to this algorithm category. We experimentally conrmed that they all produce the correct results. Note that the property is proved in [124]. The intuition of the correctness is that, the remote intermediate results (e.g., shortest path in other subgraphs) will eventually be propagated to the local node and “correct” the potentially staled local results. Similar idea was applied in out-of-core system [2], where the subgraph loaded into memory (the other parts are in disk) are processed with multiple iterations. The purpose is to tolerate longer disk access latency. We uniquely apply the idea in multi-node PIM architecture. Figure 5.7 (b) shows a running example of Single Source Shortest Path (SSSP) with Bellman-Ford al- gorithm [12]. We compare hybrid execution model with unied iterations. The graph is partitioned into two nodes, indicated in dierent colors. On the bottom left, we show the change of distance vector using hybrid execution. We can see that the whole execution incurs three global iterations, they are marked by dierent colors. The edges in the graph that result in inter-node communication are also marked with the corresponding colors. We can see that during the rst global iteration, three local iterations are executed. During the second one, only one local iterations is executed—this case is the same as unied iteration. Dur- ing the third global iteration, two local iterations are executed. On the bottom right, we show the change of distance vector with unied iterations. We can see that in total four global iterations are needed—one iteration more than the hybrid execution. Note that it is a small example showing the insights, the iteration reduction is only one. In real graphs, as we show in the evaluation, the benet is signicant. 90 The hybrid execution model is general and widely applicable, because many algorithms executed in the graph parallel execution model are asynchronous and iterative. The parallel version of a certain algorithm is typically dierent from the optimal sequential version. For example, Dijkstra algorithm for SSSP [28] is sequential and dicult to parallelize. For this reason, the graph processing framework normally uses the non-optimal (i.e., may lead to redundant work) but more relaxed iterative algorithms (e.g., Bellman- Ford algorithm [12] for SSSP) which are more amendable for parallel execution. The hybrid execution is applicable to all such relaxed iterative algorithms used in parallel graph processing. It is possible to support sequential optimal algorithms (e.g., Dijkstra) in a type of architecture that can execute logically sequential tasks in parallel with speculative execution, e.g., Ordered Parallelism [52, 82]. Since PIM does not provide the architectural supports required for speculative tasks and most parallel graph processing frameworks have not yet considered that option, we leave it as future work. 5.2.4 NoveltyandDiscussion GraphQ is inspired by the ideas in distributed graph processing and irregular applications. It does not weaken our contributions because: 1) we are the rst to investigate the benets in the context of PIM with detailed architecture model; and 2) some subtle but important dierences exist. The communication and computation overlapping are well-known optimizations in high performance computing [25, 105]. Our architectural model captures key aspects such as the reduction of communication energy and each core’s interrupt overhead. They are not considered in distributed graph processing (e.g., Gemini [140]). The idea of local reduction near sources is explored in PowerGraph [39]. However, it is only one technique (among the three) that enables the predictable and batched communication in GraphQ. The more subtle distinction is batching. Grappa [85] and ASPIRE [124] are two closely related recent work. They are both latency-tolerant distributed shared memory (DSM) system and use batching to reduce 91 communication overhead. The key dierence between GraphQ ismessageaggregation. In Grappa and AS- PIRE, the message aggregation is dynamic and unstructured, which means that if the sender accumulates several messages for the same destination, a batched message can be formed and sent to the receiver. It is dynamic because the aggregation is determined by the runtime processing order of vertices and the graph. Due to the unstructured batching, the communication is still irregular. In comparison, the message aggre- gation in GraphQ isstaticandstructured, which means that the messages to the same destination are forced to be generated and sent. As a result, the communication becomes regular. In fact, the batching mechanism in Tesseract is exactly dynamic and unstructured: during one iteration, a cube can send multiple batched messages to the same remote cube. In GraphQ, a cube can only send one batched message to the same remote cube. Tesseract is more similar to Grappa and ASPIRE. Moreover, due to regular communication, GraphQ enables link power down. Both Grappa and ASPIRE require an outstanding thread or core to manage communication. In partic- ular, ASPIRE needs to maintain the per-vertex information on staleness. While it is possible to have such support in a DSM with a x86 core working on a full-edged NIC, it is infeasible to implement in PIM. Next, we discuss two closely related schemes. GraphH [24] also uses the idea of rounded execution and batching. It is implemented by reconguring interconnection network and relies on the host processor to control the switch status of all connections. Instead, GraphQ only requires lightweight hardware primi- tives. We believe that our approach is more exible and incurs much less hardware modications. Never- theless, besides rounded execution that enhance the eciency of inter-cube communication. GraphP [136] uses replicas to reduce inter-cube communication, converting one message per inter-cube edge to one mes- sage per replica synchronization. However, such replica synchronization still incurs irregular communi- cation, making it suer from the similar drawbacks as Tesseract. The irregular communication in GraphP also limits its capability to overlap communication with computation, which explains the reported minor improvement (Figure 10 and 11 in [136]). 92 In summary, the key takeaway in GraphQ is that, the static and structured communication pattern is critical to achieve good throughput and performance on PIM-based architecture. 5.3 GraphQImplementation GraphQ is implemented by runtime and architecture co-design: the architecture provides the communi- cation and synchronization primitives; and runtime system orchestrates the execution of the user-dened functions with batched inter-cube communication and decoupled intra-cube data movements using archi- tectural primitives. The proposed techniques are transparently implemented, thus require no code modi- cation. Similar to other graph processing frameworks, GraphQ is fully compatible with the widely-used vertex programming model. Programmers only need to dene three functions: processEdge, reduce, apply. 5.3.1 Inter-Cube: BatchedCommunication CommunicationPrimitives GraphQ architecture provides three inter-cube primitives: initBatch,sendBatch,recvBatch. To initiate communication, each cube registers its send and receive buer through initBatch. The local router allocates two shadow send and receive buer in memory, and initializes status ags for each buer. These ags keep track of buer availability. The sendBatch is, in essence, buered non-blocking asynchronous send. This primitive ooads the send operation to router after the send buer in memory is copied to its corresponding shadow buer, so that the computation can proceed and send buer can be reused. Before starting writing to remote buer, the router checks the ag of remote shadow receive buer to make sure that it can be overwritten. TherecvBatch is blocking synchronous and does not require any parameters. In our ordered batched communication, the source cube id of messages can be inferred from round ID. The status of receive buer 93 indicates whether new messages have arrived. If so, the messages are copied from shadow receive buer to receive buer in memory and ags are reset. Thereduce function can be executed only after receiving updates from remote cubes, except the last round. Inter-cubeLink In GraphQ, with respect to one intercube routing algorithm, we know at static time that some links are idle in the entire round. Moreover, at runtime, when the communication time is overlapped by computation, we can set the links to idle or sleep mode when the communication has completed. In HMC, this is can be achieved by setting the “Power-Down” signal according to HMC specication 2.1 [23]. We also take into consideration the link state transition time (150 µs). The benet of this optimization is more prominent when the graph size is larger, because the transition time will become negligible. 5.3.2 Intra-Cube: SpecializedMessagePassing Compute Units GraphQ uses single-issue in-order cores in the logic dies to meet the thermal and area requirements. We leverage heterogeneous cores to fully utilize the high memory bandwidth. ProcessUnit(PU) is responsible forsequentially reading vertices and edges from memory and performing operations (processEdge). A simple stride prefetcher is employed to match the high memory bandwidth and hide latency. The output of PUs are update messages to be sent to AUs through the on-chip network. ApplyUnit(AU) receives update messages from PUs, performs reduce, and writes (apply) destination vertices with random accesses. In essence, AUs prepare the batched messages to be sent to a remote cube at the end of a round. Instead of using a prefetcher, we replace the private cache of each core with a programmer controlled scratchpad memory (SPM) as data cache. On one hand, SPM is faster than L1 cache. On the other hand, SPM allows the software to explicitly allocate space in SPM, and vertex data will not be evicted due to cache replacements. The PUs and AUs in a cube form a data movementpipeline: PUs continuously perform data processing and send update messages to AUs through on-chip interconnect; AUs randomly ll SPM with the reduced 94 updates. In the end of a round, the prepared batched message is written sequentially from SPM to send buer in memory. On-chipInterconnect In order to support the pipelined data movements from PU to AU, GraphQ provides primitives: Send,Recv,Sync; and architecture supports: local router, message queues, interconnection be- tween cores for intra-cube communication. The separation of inter-cube and intra-cube communication leads to simpler interconnect and router design and lower pressure to hardware resources. In our imple- mentation, each PU has a send queue and each AU has a receive queue. The local router is responsible for moving data from send queues of PUs to receive queues of AUs. Send is executed asynchronously. For a PU, sending a message simply means moving data from register to its send queue and continuing the execution without waiting for any return value. The message transfer will be handled by the local router. Recv fetches one message from queue. When a new message arrives, AU usesRecv to move it from receive queue to register. Sync is a signal emitted by PUs to notify all AUs in the same cube, indicating that it has reached the end of a round. Although Send is asynchronous, it can block a PU when its send queue is full. This is a common problem for both Tesseract and GraphQ. However, because GraphQ’s on-chip network only handles intra- cube communication, in practice we can choose proper queues capacity (less than that in Tesseract) to buer the incoming send messages. To avoid the overhead of context switch, we implement the primitives by extending the instruction set of PU and AU. In addition, reading from or writing to message queue takes one cycle, without stalling processor pipeline. It is reasonable for the small intra-cube messages of graph applications, in which the essential operation is updating vertex values. Typically, the vertex value has basic data type less than 64 bits, such asint orfloat. It is true for all applications evaluated. With 64-bits vertex ID, the message size is only 128 bit. 95 5.3.3 ParameterConsideration In the implementation, we need to decide certain key parameters. The rst is the number of PUs and AUs. The total number of cores is limited by the size of logic die per cube. The number of PUs should fully exploit memory bandwidth. For example, in HMC, the available internal bandwidth is 320 GB/s. Suppose the PUs run at 1 GHz and a prefetcher of 64B works perfectly with sequential pattern, 6 PUs are enough. In contrast, AUs should be able to consume the stream of messages from PUs and clear the receive queue in time. We tune these parameters so that the execution is rarely blocked by hardware resources limitation. In GraphQ, the number of PUs and AUs are both 8, and the queue size is 16. The second issue is the size of scratchpad memory. In Tesseract, the compute unit is ARM Cortex A5, with congurable L1 data cache from 4KB up to 64KB. We expect our program-controlled SPM has capacity of the same order and thus use 64KB SPM per AU. In total, we have 8*64 KB SPMs per cube, which can hold 128K vertex values of 4 Bytes. Since in one round, only 1/16 of total vertices can be destination, graphs with less than 2M (16*128K) vertices can be completely held in SPMs. If the input graph size increases, we can further divide the vertices into more blocks as we will see later in Section 5.4.2. 5.4 GraphQRuntimeSystem GraphQ allows programmers to specify graph applications using vertex programming interface; and the mechanisms for regular communication and data movements are supported transparently by runtime sys- tem. Due to space limit, we do not include inter-node supports, but it is similar to inter-cube. The number of local iterations in a global iteration can be specied as a parameter. For intra-cube execution, we will rst explain the case when the memory usage of destination vertices does not exceed the SPM capacity, and then discuss the solution to scale to larger graphs at the end of Section 5.4.2. 96 1 sendBuf = local Array[DataType] 2 recvBuf = local Array[DataType] 3 tempBuf = local Array[DataType] //partially reduced values 4 InitBatch(sendBuf, recvBuf) 5 for (roundId = 0; roundId < cubeNum; roundId++) { 6 toId = (myId + roundId + 1) % cubeNum 7 for (v <- GraphBlock.vertices) { 8 for (u <- outNbrs(v)) { //overlapped with comm 9 res = processEdge(u.value, v.value, ...) 10 sendBuf(u) = reduce(sendBuf(u), res) 11 } 12 } 13 if (roundId != (cubeNum-1)) { //end of each round 14 SendBatch(toId) //except last round 15 } else { //end of last round 16 for (v <- Partition.vertices) 17 tempBuf(v) = reduce(tempBuf(v), sendBuf(v)) //local reduce 18 } 19 if (roundId != 0) { 20 RecvBatch() //each round except first 21 for (v <- Partition.vertices) 22 tempBuf(v) = reduce(tempBuf(v), recvBuf(v)) //per-round reduce 23 } 24 } 25 for (v <- Partition.vertices) //final apply 26 v.value, v.active = apply(tempBuf(v), v.value) Figure 5.8: Ordered Batch Inter-cube Communication Code 5.4.1 Inter-CubeCommunication Figure 5.8 shows the pseudocode for one iteration concurrently executed in each cube based on vertex programming API and batched communication primitives. The current cube ID is stored inmyId, ranging from 0 to (cubeNum-1), cubeNum is the total number of cubes. The source and destination of cube ID of inter-cube communication in a round isfromId andtoId, respectively. To enable batched communication, three local buers are used for each cube. They are sendBuffer, which buers the updates to be sent to remote cubes in each round;recvBuffer, which buers the updates received from other cubes by the end of each round; andtempBuffer, which stores the partially reduced values. In each round, graph blocks corresponding to destinations in cube toId are streamed in (line 7 and 8), update messages are generated (line 9) and reduced in sendBuffer (line 10). At the end of the round, i.e., computation has nished, each cube performs inter-cube communication (line 13 and 14). During the period of rounds, updates from remote cube in the previous round are received with recvBuf with 97 1 for (v <- Partition.vertices) { 2 for (e <- GraphBlock.outEdges(v)) { 3 res = processEdge(e, v.value, ...) 4 Send(res) 5 } 6 } 7 Sync() Figure 5.9: Process Unit Code recvBatch primitive (line 20) and reduced in tempBuffer (line 22). Updates in sendBuffer are trans- ferred in batch usingsendBatch primitive (line 14). In the rst round, we will not call recvBatch (if-statement at line 19), because no messages are ex- pected to arrive in round 0: the rst batched message will be sent in the end of round 0, which will be received by the end of round 1, so the overlapped computation and communication starts from round 1. In the last round,sendBatch is omitted (if-statement at line 13), because in round (cubeNum-1) the generated updates should be reduced locally. Note that, while sendBuf (line 10) is nonblocking, the whole batched send communication might be blocked. It is because the sender needs to wait until the remote shadow receive buer becomes available. As discussed in Section 5.3.1, this is ensured by the semantic of sendBatch primitive: the router will automatically check whether the shadow buer on remote cube is available, if not, the data is kept in shadow send buer and wait. Finally, after nishing the rounded execution (in line 17) updates from local cube (i.e., insendBuf) and remote cubes in this iteration have already been reduced (i.e., in tempBuf). The nally reduced updates are applied to vertex states (line 26), which will be used as new vertex states in the next iteration. 5.4.2 Intra-CubeMessagePassing In Figure 5.8, computation in each round (line 10 to 16 and line 18 to 23) is modied to leverage the pipelined heterogeneous compute units. The runtime operations for PUs and AUs are shown in Figure 5.9 and 5.10, respectively. 98 1 buffer = Arary[DataType] 2 countSync = 0 3 while (countSync < numPU) { 4 msg = Recv() 5 msg match { 6 case Sync() => countSync++ 7 case Send(uData, uAddr) => 8 buffer(uAddr) = reduce(buffer(uAddr), uData) 9 } 10 } 11 buffer.flush() Figure 5.10: Apply Unit Code The code in PU resembles the original program in Figure 5.8, except that the reduce function is re- placed with send. The graph blocks in each PU is further divided and organized as a contiguous region in memory. Hence, all memory accesses in PU are sequential, which can benet from the high intra-cube memory bandwidth. When all edges in its block have been processed, PU broadcastssync message to all AUs in the same cube. As shown in Figure 5.10, an AU allocates a buer in SPM, as a copy of sendBuffer in Figure 5.8. AU uses a countercountSync to keep track of the number of PUs synced. The main body of AU implementa- tion is a while-loop, which keeps peeking message queues with the Recv primitive. Once a new message arrives, the data is reduced in buer residing in SPM. If the message is a Sync, countSync is increased. The loop exits when sync messages from all PUs have been received. Finally, each AU will ush the buer in SPM to memory with sequential writes. When the graph is larger, the destination vertices can not t into the SPMs. GraphQ will divide the destination vertices to smaller sub-partitions and run the intra-cube execution for each sub-partition. Fig- ure 5.11 shows the intra-cube execution of one round for one cube in matrix view of a graph. In the gure, we have 4 PUs, 4 AUs and 2 sub-partitions. Each sub-partition contains edges with half destination ver- tices subset. In the rst run, only 3 edges will be processed by PUs, and they all have the same destination vertex assigned to AU3. After all the edges are applied, we synchronize and start the next run. In this way, we can run arbitrary large graphs while all the apply operations fall in SPMs. 99 Figure 5.11: GraphQ Intra-Cube Execution 5.5 Evaluation 5.5.1 EvaluationMethodology We evaluate GraphQ based on zSim [104], a scalable x86-64 multicore simulator. We modied zSim ac- cording to HMC’s memory and interconnection model, heterogeneous compute units, on-chip network and other hardware features. While zSim does not natively support HMC interconnection simulation, we insert a NOC layer between LLC and memory to simulate dierent intra-cube and inter-cube mem- ory bandwidth. The results are validated against NDP [33]. For compute units, we use 256 single-issue in-order cores in Tesseract and GraphP. Each core has 32KB L1 instruction cache and 64K L1 data cache. Cache line size is 64B and simulation frequency is 1000 MHz. In GraphQ, we also use the same number of cores-256 in total, in which 128 are PUs and 128 are AUs. PU has a 64B prefetcher with 4KB buer and AU has a 64KB scratchpad memory. Each core has a 16-entry message queue and 32KB L1 instruction cache, with no L2 or shared cache. For memory conguration, we use 16 cubes (8 GB capacity, 512 banks). The cubes are connected with the Dragony topology [58]. The maximal internal data bandwidth of each cube is 320GB/s. We run four widely-used application benchmarks: Breadth-First Search (BFS), Weakly Connected Component (WCC), PageRank(PR) [88], Single Source Shortest (SSSP). ∗ Figure 5.2 shows the ∗ For BFS, we do not use direction-switch optimization. For SSSP, We run Bellman-Ford algorithm for xed number of itera- tions. 100 Table 5.2: Graphs Datasets Graphs #Vertices #Edges ego-Twitter (TT) [79] 81K 2.4M Soc-Slashdot0902 (SD) [72] 82K 0.95M Amazon0302 (AZ) [68] 262K 1.2M Wiki (WK) [15] 4.2M 101M LiveJournal (LJ) [72] 4.8M 69M graph datasets that we use in our experiment respectively. The dataset is similar as in GraphP, where we replace WikiVote [69] with a larger Wiki graph. † The energy consumption of the inter-cube interconnect is estimated as two components: a) The dy- namic consumption, which is proportional to the number of it transfer events that happen among each pair of cubes; and b) the static consumption, which corresponds to the energy cost when the interconnect is plugged in power but in idle state (i.e., no transfer event happens). We use zSim to count the number of transfer events and use ORION 3.0 [55] to model the dynamic and static power of each router. We calculate the leakage energy of the whole interconnect from the it transfers. We also validated Table 1 in [121] with McPAT [73]. 5.5.2 ComparingwithTesseract From Figure 5.12, GraphQ achieves 1.1x - 14x speedup across all four benchmarks with dierent graph inputs. Specically, for WCC/PageRank with batching, the maximum speedups reach 6x/4x. When the intra-cube optimization is enabled, the further increase to 16x/6x. BFS and SSSP achieve less maximum speedup (2x-3x). The reason is that WCC and Pagerank are “all-active” benchmarks, i.e., all vertices in the graph are active in each iteration, while BFS and SSSP only enable partial vertices and edges. Thus WCC and Pagerank benet more from batching communication. While there are other algorithms for WCC and PageRank that does not require all-active execution, we use the all-active (same in many other † We do not use road graphs because the dataset are relatively small to t in one memory cube. 101 TT SD AZ WK LJ 0.0 0.5 1.0 1.5 2.0 2.5 bfs TT SD AZ WK LJ 0 1 2 3 4 5 6 wcc TT SD AZ WK LJ 0 2 4 6 8 10 12 14 pagerank TT SD AZ WK LJ 0.0 0.5 1.0 1.5 2.0 2.5 sssp Tesseract GraphQ Intercube GraphQ Intercube&Intracube GraphP Figure 5.12: Performance publications [141, 140, 136, 47]) to demonstrate the performance characteristics in various application settings. Figure 5.13 shows the time breakdown. In GraphQ, computation and communication cannot fully overlap. The interruption adds extra overhead of about 10%, and synchronization taks 10% to 50% time. In GraphQ, two parts are almost invisible: communication is overlapped by computation; intra-cube syn- chronization (among compute units in the same cube) is low. The inter-cube synchronization percentage is high. In some cases it reaches 60% and is higher than Tesseract. However, considering that GraphQ total execution time is less, the absolute time wasted in synchronization is still less in GraphQ. The conclusion is that overall GraphQ regular batch communication is better than Tesseract irregular peer-to-peer com- munication. We admit that the current design has much room for improvement to reduce synchronization and achieves better load balance. Figure 5.14 illustrates total bytes transferred by inter-cube routers, with results normalized to Tesseract. Compared with Tesseract, GraphQ reduces the communication amount by at least 70% in all experiments. 102 TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 breakdown bfs TT SD AZ WK LJ sssp TT SD AZ WK LJ pagerank TT SD AZ WK LJ wcc Tesseract:Computation Tesseract:Interrupt Tesseract:Communication Tesseract:Synchronization GraphQ:Computation GraphQ:Communication GraphQ:Intra-cube Synchronization GraphQ:Inter-cube Synchronization Figure 5.13: Execution time breakdown First, Tesseract inter-cube global routers are responsible for handling both inter-cube and intra-cube com- munication, while intra-cube messages in GraphQ are sent through on-die network and not counted (Sec- tion 5.2.2). Second, the batch optimization in inter-cube communication combines some messages at the sender size (Section 5.2.1). Note that when running BFS and SSSP on workload AZ in GraphQ, the inter- cube transfer amount is negligible. This is because AZ has good locality and most updates are applied in the same cube without generating inter-cube messages. 5.5.3 ComparingwithGraphP We also implement GraphP and quantitatively compare it with GraphQ. Among the four common graph datasets, the overall performance of GraphQ is consistently better: GraphQ speedup is 3.2x on average and 13.9x at maximum, while GraphP is only 1.6x and 3.9x. In GraphP, overlapping is not eective: in several dataset applying the technique leads to slower result due to the dynamic execution. GraphQ enables 103 TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 bfs TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 wcc TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 pagerank TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 sssp Tesseract GraphQ Intercube Figure 5.14: Communication TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 bfs TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 wcc TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 pagerank TT SD AZ WK LJ 0.0 0.2 0.4 0.6 0.8 1.0 sssp Tesseract GraphQ Intercube GraphQ Intercube&Intracube GraphQ LP GraphP Figure 5.15: Energy Consumption ecient overlapping with structured batching. Moreover, GraphQ benets from intra-cube optimization, which is considered in GraphP. The pipelined architecture leverages the high local memory bandwidth and accelerate the computation further by 56%. 5.5.4 EnergyConsumption Figure 5.15 shows the interconnect energy consumption of GraphQ compared with Tesseract. As we can see, batching contributes from 10% to 85% reduction and batching with PU/AU optimization contributes up to 90% reduction in energy cost in most of our experiment runs. The energy cost of the interconnect 104 P14A2 P12A4 P10A6 P8A8 P6A10 P4A12 P2A14 0.0 0.5 1.0 1.5 2.0 2.5 Cycles 1e9 pagerank_wiki Figure 5.16: Performance w.r.t Dierent PU/AU Ratios consists of both the static consumption and the dynamic consumption, which are determined by the ex- ecution time (performance) and communication amount, respectively. Figure 5.15 also demonstrates the energy cost with cubes’ low-power option enabled. We see that GraphQ’s capability of setting the low- power option further saves around 50% - 80% of energy across all benchmarks and thus drastically reduces energy cost by 81% on average and 98% in maximum. 5.5.5 EectofPU/AURatio Figure 5.16 shows dierent performance results under dierent PU/AU conguration, when running pagerank on enwiki graph. If there are only 2 to 4 PUs, the memory bandwidth is far from saturation and thus it spends 87% more cycles compared with the optimal case. If we replace most AUs with PUs, the system is bottlenecked by the AU operations, i.e., writing to SPM. In this setting, total cycles will increase by 93%. With poor conguration, the streamlined processing can even get worse performance. 105 5.5.6 Multi-NodePerformance We evaluate a 4-node system where each machine has the same HMC memory setting as in the single- node performance evaluation. The inter-node, inter-cube and intra-cube bandwidth are 6GB/s, 120GB/s and 360GB/s, respectively. Each global iteration contains four local iterations. The results are shown in Figure 5.17 where speedup is normalized to GraphQ single node PIM performance. For many test cases, multi-node Tesseract speedup is less than 1, it is because the inter-node communication becomes the new bottleneck. Moreover, the problem of irregular communication in Tesseract becomes worse in a low band- width multi-node setting and signicantly limits its scalability. Take PageRank as an example, single-node Tesseract is 3x to 14x slower than GraphQ, while the multi-node Tesseract can be 61x slower than GraphQ. The speedup of GraphQ is also less in multi-node setting, but due to regular batch communication and fast single-node design, multi-node GraphQ is consistently better than Tesseract. GraphQ’s hybrid execution alleviates this problem and makes PIM-based graph processing still ecient across dierent machines. In a 4-node PIM, the speedup is 2.98 compared with single-node GraphQ, which translates into a 98.34 speedup compared with a single node with conventional memory hierarchy of the same memory size. This is because the 4-node system has more computing resources (more cores embedded in the cube) and memory bandwidth. Specically considering hybrid execution model, it leads to average 39.3% (at most 2.57) speedups than the executions without the optimization. 5.5.7 LargerGraphs We can not run larger graphs due to simulation constraints. ‡ However, for larger graphs, load imbalance among dierent cubes in the same round might lead to more inter-node synchronization and signicant increase in total time. ‡ While both [82] and GraphQ also uses zsim, the number of cores simulated in our study is much larger, so we can not run the same larger graphs. 106 TT SDAZWK LJ 0 1 2 3 4 Speedup bfs TT SDAZWK LJ 0 1 2 3 4 wcc TT SDAZWK LJ 0 1 2 3 4 pagerank TT SDAZWK LJ 0 1 2 3 4 sssp Tesseract GraphQ GraphQ Hybrid Figure 5.17: Multi-Node Performance To evaluate the issue for larger graphs, we use the similar methodology to investigate estimate load imbalance as in Section 5.2.1. If there is no inter-cube synchronization (no rounded execution in GraphQ), the total time will be determined by the slowest cube, i.e. the maximum of the sum of cube time in all rounds. If there is the additional synchronization in GraphQ, the time spent in each round will be deter- mined by the slowest cube, i.e. maximum cube time in each round, and the total time is the sum of all rounds. In graph processing, the execution time can be approximated by the number of traversed edges. To get the number of edges, we run three very large graphs (aforementioned in Table 5.1) in real distributed systems [140] with 16 nodes. The results show that load imbalance introduced by GraphQ synchronization is 33%, 27%, and 0.55% of total computation respectively. The average synchronization overhead is even smaller than the smaller graphs we use in Table 5.2. Based on the above discussion, we believe that for larger graphs, the benet of regular communication is more than the additional synchronization. Therefore, GraphQ is still better than Tesseract. 107 5.6 RelatedWork Tesseract [1] is the rst PIM-based accelerator and is the baseline. Ozdal et al. [87] proposed an accel- erator for asynchronous graph processing, which features ecient hardware scheduling and dependence tracking. Both GraphQ and Tesseract are designed for synchronous processing. GraphPIM [84] demon- strates the performance benets for graph applications by adding the atomic operations to PIM. Graphi- cionado [47] is a high performance customized graph accelerator, based on specialized memory subsystem, instead of PIM. GraphP [136] proposes a graph partitioning method that reduces inter-cube communica- tion, but it is based on single node and did not enable the regular data movements. [7] characterized the memory system performance of graph processing workloads and proposed a physically decoupled prefetcher that improves the performance of these workloads. Overlapping communication and compu- tation [105, 77, 123], graph partition [66, 135], graph load balance [111], and graph load characteriza- tion [112, 113] is studied extensively in distributed computing setting. However, it is a new problem in PIM with multiple cubes and nodes. 5.7 Summary This chapter proposes GraphQ, a novel PIM-based graph processing architecture that eliminates irregular data movements. The key idea is to generate static and structured communication with runtime system and architecture co-design. Using a zSim-based simulator and ve real-world graphs and four algorithms, the results show that GraphQ achieves on average 3.3 and maximum 13.9x speedup, 81% energy saving compared to Tesseract. Comparing to GraphP, GraphQ achieves more speedups to Tesseract. In addition, the 4-node GraphQ achieves 98.34 speedup compared to a single node with the same memory size using conventional memory hierarchy. 108 Chapter6 SympleGraph: DistributedGraphProcessing 6.1 IntroductionandMotivation As discussed in Section 2.3, this work makes the rst attempt to improve the eciency of the two factors at the same time by reducing redundant computation and communication, leveraging the dependency in UDFs. Loop-carried dependency is a common code pattern used in UDFs: when traversing the neighbors of a vertex in a loop, a UDF decides whether to break or continue, based on the state of processing previous neighbors. Specically, consider two neighborsu 1 andu 2 of vertexv. Ifu 1 satises an algorithm-specic condition, u 2 will not be processed due to the dependency. The pattern appears in several important algorithms. Consider the bottom-up breadth-rst search (BFS) [9] with pseudocode in Figure 6.1a. In each iteration, the algorithm visits the neighbors of “unvisited” vertices. If any of the neighbors of the current unvisited vertex is in the “frontier”, it will no longer traverse other neighbors and mark the vertex as “visited”. In distributed frameworks [75, 39, 48, 50, 18, 128, 103, 37, 140, 26], programmers can write a control ow with the break statement) in UDF to indicate the control dependency. Figure 6.1b (b) shows signal-slot implementation of bottom-up BFS in Gemini [140]. Thesignal andslot UDF specify the computation to process each neighbor of a vertex and vertex update, respectively. We see that the bottom-up BFS UDF has 109 controldependency. The signal function iterates the neighbors of vertexv, and breaks out of the loop when it nds the neighbor in the frontier (Line 5). This control dependency expresses the semantics of skipping the following edges and avoids unnecessary edge traversals. However, if u 1 and u 2 are distributed in dierent machines,u 1 andu 2 can be processed in parallel andu 2 does not know the state after processing u 1 . Therefore, the loop-carried dependency specied in UDF is not precisely enforced in the execution, thereby only an “illusion”. The consequence of such imprecise execution behavior isunnecessarycomputationandcommunication. As shown in Figure 6.3, vertex 9 has eight neighbors, two of them (vertex 7 and 8) are allocated in machine 3, the same as the master copy of vertex 9. The others are allocated in machine 1 and 2. More background details on graph partition will be discussed in Section 6.2.2. To perform the signal UDF in remote ma- chines, mirrors of vertex 9 are created. Update communication is incurred when mirrors (machine 1 and 2) transfer partial results of signal to the master of vertex 9 (machine 3). Unnecessary computation is incurred when a mirror performs computations on vertex 9’s neighbors while the condition has already been satised. Unnecessary update communication is incurred when the mirror sends partial results to the master. To address this problem, we propose SympleGraph ∗ , a novel framework for distributed graph pro- cessing that enforces the loop-carried dependency in UDF.SympleGraph analyzes the UDFs of unmodied codes, identies, and instruments UDF to express the loop-carried dependency. The distributed framework enforces the dependency semantics by performing dynamic dependency propagation. Specically, a new type of dependency communication propagates dependency among mirrors and back to master. Existing frameworks only support update communication, which aggregates updates from mirrors to master. ∗ The name SympleGraph does not imply symbolic execution. Instead, it refers to the key insight of scheduling the symbol execution order and making all evaluation concrete. 110 Enforcing loop-carried dependency requires that all neighbors of a vertex are processed sequentially. To enable sucient parallelism while satisfying the sequential requirement, we proposecirculantschedul- ing and divide the execution of each iteration into steps, during which dierent machines process disjoint sets of edges and vertices. If one machine determines that the execution should break in a step, the break information is passed to the following machines so that the remaining neighbors are not processed. In practice, the computation and update communication of each step can be largely overlapped (see details in Section 6.6.3); thus the ne-grained steps do not introduce much extra overhead. SympleGraph not only eliminates unnecessary computation but potentially reduces the total amount of communication. On the one side, small dependency messages are organized as a bit map (one bit per vertex) circulating around all mirrors and master, do not exist in current frameworks and thus incur extra communication. On the other side, precisely enforcing loop-carried dependency can eliminate unnecessary computation and communication. Our results show that the total amount of communication is indeed re- duced in most cases (Section 6.8.2, Table 6.6). To further reduce dependency communication, SympleGraph dierentiates dependency communication for high-degree and low-degree vertices, and only performs de- pendency propagation for high-degree vertices. We apply double buering to enable computation and dependency communication overlapping and alleviate load imbalance. To evaluate SympleGraph, we conduct the experiments on three clusters using ve algorithms and four real-world datasets and three synthesized scale-free graphs with R-MAT generator [17]. We compare Sym- pleGraph with two state-of-the-art distributed graph processing systems, Gemini [140] and D-Galois [26]. The results show that SympleGraph signicantly advances the state-of-the-art, outperforming Gemini and D-Galois on average by 1.42 and 3.30, and up to 2.30 and 7.76, respectively. The communication reduction compared to Gemini is 40.95% on average, and up to 67.48%. 111 6.2 BackgroundandProblemFormalization 6.2.1 GraphandGraphAlgorithm Graph. A graph G is dened as (V, E) where V is the set of vertices, and E is the set of edges (u, v) (u and v belong to V). The neighbors of a vertex v are vertices that each has an edge connected to v. The degree of a vertex is the number of neighbors. In the following, we explain ve important iterative graph algorithms whose implementations based on vertex functions will incur loop-carried dependency in UDF. Figure 6.2 shows the pseudocode of one iteration of each algorithm in sequential implementation. 1 def bfs(Array[Vertex] nbr) { 2 for v in V { 3 for u in nbr { 4 if (not visited[v] && 5 frontier[u]) { 6 parent[v] = u; 7 visited[v] = true; 8 frontier[v] = true; 9 break; 10 } 11 } // end for u 12 } // end for v 13 } // end bottom_up_bfs (a) Bottom-up BFS 1 def signal(Vertex v, Array[Vertex] nbr ) { 2 for u in nbrs { 3 if (frontier[u]) { 4 emit(v, u); 5 break; 6 } 7 } // end for u 8 } // end signal 9 def slot(Vertex v, Vertex upt) { 10 if (not visited[v]) { 11 parent[v] = upt; 12 visited[v] = true; 13 frontier[v] = true; 14 } 15 } // end slot (b) Bottom-up BFS in Gemini Figure 6.1: Bottom-up BFS Algorithm Breadth-FirstSearch(BFS). BFS is an iterative graph traversal algorithm that nds the shortest path in an unweighted graph. The conventional BFS algorithm follows the top-down approach: BFS rst visits a root vertex, then in each iteration, the newly “visited” vertices become the “frontier” and BFS visits all the neighbors of the “frontier”. Bottom-up BFS [9] changes the direction of traversal. In each iteration, it visits the neighbors of “un- visited” vertices, if one of them is in the “frontier”, the traversal of other neighbors will be skipped, and the current vertex is added to the frontier and marked as “visited”. Compared to the top-down approach, 112 1 def mis(Array[Vertex] nbr) { 2 for v in V { 3 flag = true; 4 for u in nbr { 5 if (active[u] && 6 color[u] < color[v]) { 7 flag = false; 8 break; 9 } 10 if (flag) 11 is_mis[v] = true; 12 } // end for u 13 } // end for v 14 } // end mis (a) MIS 1 def kcore(Array[Vertex] nbr) { 2 for v in V { 3 cnt = 0; 4 for u in nbr { 5 if (active[u]) { 6 cnt += 1; 7 if (cnt >= k) { 8 break; 9 } 10 } 11 } // end for u 12 } // end for v 13 } // end kcore (b) K-core 1 def kmeans(Array[Vertex] nbr) { 2 // generate C random centers 3 for v in V { 4 for u in nbr { 5 if (assigned_to[u]) { 6 cluster[v] = cluster[u]; 7 break; 8 } 9 } // end for u 10 } // end for v 11 } // end kmeans (c) K-means 1 def sample(Vertex v, Array[Vertex] nbr ) { 2 // generate C random number 3 r = rand() 4 // set prefix-sum 5 weight = 0 6 for u in nbr { 7 weight += weight[u] 8 if (weight >= r) { 9 select[u] = true; 10 break; 11 } 12 } // end for u 13 } // end kmeans (d) Graph Sampling Figure 6.2: Examples of algorithms with loop-carried dependency bottom-up BFS avoids the ineciency due to multiple visits of one new vertex in the frontier amd signi- cantly reduces the number of edges traversed. Maximal Independent Set (MIS). An independent set is a set of vertices in a graph, in which any two vertices are non-adjacent. A Maximal Independent Set (MIS) is an independent set that is not a subset of any other independent set. A heuristic MIS algorithm (Figure 6.2 (a)) is based on graph coloring. First, each vertex is assigned distinct values (colors) and marked as active. In each iteration, we nd a new MIS composed of active vertices with the smallest color value among their active neighbors’ colors. The new MIS vertices will be removed from further execution (marked as inactive). K-core. A K-core of a graph G is a maximal subgraph of G in which all vertices have a degree at least k. The standard K-core algorithm [108] (Figure 6.2 (b)) † removes the vertices that have a degree less than † There are other K-core algorithms with linear time complexity [78]. We choose this algorithm to demonstrate the basic code pattern. We also compare with the algorithm in evaluation. 113 K. Since removing vertices will decrease the degree of its neighbors, the operation is performed iteratively until no more removal is needed. When counting the number of neighbors for each vertex, if the count reaches K, we can exit the loop and mark this vertex as “no remove”. K-means. K-means is a popular clustering algorithm in data mining. Graph-based K-means [103] is one of its variants where the distance between two vertices is dened as the length of the shortest path between them (assuming that the length of every edge is one). The algorithm shown in Figure 6.2 (c) consists of four steps: (1) Randomly generate a set of cluster centers; (2) Assign every vertex to the nearest cluster center; (3) Calculate the sum of distance from every vertex to its belonging cluster center; (4) If the clustering is good enough or the number of iterations exceed some pre-specied threshold, terminate the algorithm, else, goto (1) and repeat the algorithm. GraphSampling. Graph sampling is an algorithm that picks a subset of vertices or edges of the original graph. We show an example of neighbor vertex sampling in Figure 6.2 (d), which is the core component of graph machine learning algorithms, such as DeepWalk [92], node2vec [45], and Graph Convolutional Networks [8]. In order to sample from the neighbor of the vertex based on weights, we need to generate a uniform random number and nd its position in the prex-sum array of the weights, i.e., the index in the array that the rst prex-sum element is larger than or equal to our random number. ‡ 6.2.2 DistributedGraphProcessingFrameworks There are two design aspects of distributed graph framework: programming abstraction, and graph par- tition/replication. Programming abstraction deals with how to express algorithms with a vertex function. Graph partition determines how vertices and edges are distributed, replicated, and synchronized in dier- ent machines. ‡ There are other sampling algorithms, such as the alias method. It builds alias table step to exhibit a similar pattern that searches prex-sum array. We choose this algorithm since it reects our basic code pattern. 114 3 1 ... 2 6 4 ... 5 8 9 Machine 1 Machine 2 Machine 3 signal slot communication master mirror 7 Figure 6.3: Bottom-up BFS Execution Master-mirror. To describe vertex replications, current frameworks [39, 18, 140, 26] adopt the master- mirror notion: each vertex is owned by one machine, which keeps themaster copy, its replications on other machines are mirrors. The distribution of masters and mirrors is determined by graph partition. There are three types of graph partition techniques based on the denition in [26]. Incoming edge-cut: Incoming edges of one vertex are assigned only to one machine, while its outgoing edges may be partitioned; Outgoing edge-cut: Outgoing edges of each vertex are assigned only to one machine, while its incoming edges are partitioned. It is used in several systems, including Pregel [75], GraphLab [74], Gemini [140]. Vertex-cut: Both the outgoing and incoming edges of a vertex can be assigned to dierent machines. It is used in PowerGraph [39] and GraphX [41]. Recent work [134] also proposed 3D graph partition that divides the vector data of vertices into layers. This dimension is orthogonal to the edge and vertex dimensions considered in other partitioning methods. We build SympleGraph based on Gemini, the state-of-the-art distributed graph processing framework using outgoing edge-cut partition. However, our ideas also apply to vertex-cut and other distributed frameworks. It is not applicable to incoming edge-cut, which will be discussed in Section 6.3. 115 1 // mirror signal 2 for m in machines { 3 for mirror in m.mirrors(v) { 4 signal(v, nbrs(v)); 5 } 6 } 7 // master slot 8 for (v, update) in signals { 9 slot(v, update); 10 } Figure 6.4: Signal-Slot in pull mode In outgoing edge-cut, a mirror vertex is generated if its incoming edges are partitioned among multiple machines. Figure 6.3 shows an example of a graph distributed in three machines. Circles with solid lines are masters, and circles with dashed lines are mirrors. Here, vertex 9 has 8 incoming edges, i.e., sources vertex 1 to 8. Machine 1 contains the master of vertex 1 to 3, and machine 2 contains the master of vertex 4 to 6. The master of vertex 9 resides on machine 3 but its incoming edges are partitioned across all three machines, so mirrors of v are created on machine 1 and 2. Signal-slot. Ligra [109] discusses the two modes of signal-slot: push and pull. Push mode traverses and updates the outgoing neighbors of vertices, while pull mode traverses the incoming neighbors. The ve graph algorithms discussed earlier are more ecient in pull mode in most iterations, and SympleGraph optimization focuses on pull mode. Figure 6.4 shows the pseudocode of pull mode. The signal function is rst executed on mirrors in parallel. The mirrors then send update messages to the master machine. On receiving an update message, the master machine applies the slot function to aggregate the update, and then eventually updates master vertex after receiving all updates. Figure 6.3 also illustrated how the signal-slot function is applied for vertex 9. The blue edges (in machine 1 and 2) refer to signals, and the yellow edges (in machine 3) refer to slots. Green edges across machines indicate communication. We can formalize the signal-slot abstraction by borrowing the notions of distributed functions in [132]. Denition1 We useu to denote a sequence of neighbors of vertexv, and useu 1 u 2 to denote the concate- nationofu 1 andu 2 . AfunctionH is associative-decomposableifthereexisttwofunctionsI andC satisfying the following conditions: 116 1. H is the composition ofI andC:8u; H(u) =C(I(u)); 2. C is commutative:8u 1 ;u 2 ; C(u 1 u 2 ) =C(u 2 u 1 ); 3. C is associative:8u 1 ;u 2 ; u 3 ;C(C(u 1 u 1 )u 3 ) =C(u 1 C(u 2 u 2 )): Generally, all graph algorithms can be represented by associative-decomposable vertex functions in Denition 1. Intuitively,I andC correspond to signal and slot functions. Note that the abstraction speci- cation is also a system implementation specication. IfC is commutative and associative, a system can can performC eciently: the execution can be out-of-order with partial aggregation. However, this essentially means that existing distributed systems require the graph algorithms to sat- isfy a stronger condition. Denition2 A function H is parallelized associative-decomposable if there exist two functions I and C satisfying the conditions of Denition 1, andI preserves concatenation inH: 8u 1 ;u 2 ; H(u 1 u 2 ) =C(I(u 1 u 2 )) =C(I(u 1 )I(u 2 )): Gemini and other existing frameworks require the graph algorithms to satisfy Denition 2, which oers parallelism and ensures correctness. One the one hand, Gemini can distribute the execution of neighbors to dierent machines, and performI independently and in parallel. One the other hand, the output ofH is the same as if executingI sequentially. 6.3 IneciencieswithExistingFrameworks Existing frameworks are designed for algorithms without loop-carried dependency. We rst dene loop- carried dependency and dependent execution. After that, we can rewrite Denition 1 as Denition 4. 117 Denition3 We useI(u 2 ju 1 ) to denoteI(u 2 ) given the state thatI(u 1 ) has nished, such that8u 1 ;u 2 ; I(u 1 u 2 ) =I(u 1 )I(u 2 ju 1 ). AfunctionI hasnoloop-carrieddependencyif8u 1 ;u 2 ; I(u 2 ju 1 ) =I(u 2 ). Denition4 A functionH is associative-decomposable if there exist two functionsI andC satisfying the conditions of Denition 1.H has the property: 8u 1 ;u 2 ; H(u 1 u 2 ) =C(I(u 1 u 2 )) =C(I(u 1 )I(u 2 ju 1 )): By Denition 3, these algorithms always satisfy both Denition 4 and Denition 2. Otherwise, if a graph algorithm only satises Denition 4, but not Denition 2, existing frameworks will not output the correct results. Fortunately, many graph algorithms with loop-carried dependency (including the ve algorithms in this chapter) satisfy Denition 2, so correctness is not an issue for existing frameworks. However, the intermediate output ofI can be dierent. By Denition 2, we will executeI(u 1 ) and I(u 2 ). By Denition 4, if we enforce dependency, we will execute I(u 1 ) and I(u 2 ju 1 ). The dierence comes down to I(u 2 ) and I(u 2 ju 1 ). If we use cost() to denote the computation cost of a function or the communication amount for the output of a function, a functionI has redundancy without enforcing dependency if8u 1 ;u 2 ; cost(I(u 2 ))cost(I(u 2 ju 1 ))) and9u 1 ;u 2 ; cost(I(u 2 ))>cost(I(u 2 ju 1 )). We can dene functions with break semantics: 9u 1 ;u 2 ; I(u 2 ju 1 ) =I(?) =?: The computation cost forI(?) is 0, and the communication cost for? is 0. It is evident that these functions suer from the redundancy problem. We can use bottom-up BFS and Figure 6.3 as an example to calculate the cost. The computation cost is the number of edges traversed and the communication cost is the update 118 message to the master. For now, we ignore the overhead of enforcing dependency. The circles with colors are incoming neighbors that will trigger the break branch. On machine 1, the signal function breaks traversing after vertex 1, so vertex 2 and vertex 3 are skipped. On machine 2, it iterates all 3 vertices if machine 2 is not aware of the dependency in machine 1. The computation cost is 4 edges traversed (the sum of machine 1 and machine 2), and the communication is 2 messages (1 message from each machine). However, if we enforce the dependency, all vertices in machine 2 should not have been processed. The computation cost is 1 edge traversed (only on machine 1) and the communication is 1 message (only from machine 1). In summary, a graph algorithm with loop-carried dependency can be correct in existing frameworks, if it satises Denition 2. However, it can be inecient with both redundant computation, and communi- cation when loop-carried dependency is not faithfully enforced in a distributed environment. Applicability The problem exists for all graph partitions except the incoming edge-cut, i.e., all of the incoming edges of one vertex are on the same machine, and the execution of UDFs is not distributed to remote machines. To our knowledge, none of distributed systems [75, 39, 48, 50, 18, 128, 103, 37, 140, 26] precisely enforce loop-carried dependency semantics. While the incoming edge-cut is an exception, the partition is inecient and rarely used due to load imbalance issues. According to D-Galois (Gluon), they used the vertex-cut partition by default “since it performs well at scale” [26]. The problem exists for many algorithms with loop-carried dependency. For the other four graph al- gorithms discussed in Section 6.2.1: MIS has control dependency. If one vertex already nds itself not the smallest one, it will not be marked as a new MIS in this iteration and thus break out of the neighbor traver- sal. K-core hasdataandcontroldependency. If the vertex has more than K neighbors, it will not be marked as removed in this iteration, and further computation can be skipped. K-means has control dependency: when one of the neighbors is assigned to the nearest cluster center, the vertex can be assigned with the same center. Graphsampling has data and control dependency. The sample is dependent on the random 119 number and all the preceding neighbors’ weight sum. It exits once one neighbour is selected. Note that we use these algorithms as typical examples to demonstrate the eectiveness of our idea. They all share the basic code pattern, which can be used as the building blocks of other more complicated algorithms. 6.4 SympleGraphOverview SympleGraph is a new distributed graph processing framework that precisely enforces loop-carried de- pendency semantics in UDFs. SympleGraph workow consists of two components. The rst one is UDF analysis, which 1) determines whether the UDF contains loop-carried dependency; 2) if so, identies the dependency state that need to be propagated during the execution; and 3) instruments codes of UDF to insert dependency communication codes executed by the framework to enforce the dependency across distributed machines. The second component is system support for loop-carried dependency on the analyzed UDF codes and communication optimization. The key technique is dependency communication, which propagates depen- dency among mirrors and back to master. To enforce dependency correctly, for a given vertex, execution of UDF related to its neighbors assigned to dierent machines must be performed sequentially. The key challenge is how to enforce the sequential semantics while still enabling enough parallelism? We solve this problem by circulant scheduling and other communication optimizations (Section 6.6.1). 6.5 SympleGraphAnalysis 6.5.1 SympleGraphPrimitives SympleGraph provides dependency communication primitives, which are used internally inside the frame- work and transparent to programmers. Dependent message has a data type DepMessage with two types 120 of data members: a bit for control dependency, and data values for data dependency. To enforce loop- carried dependency, the relevant UDFs need to be executed sequentially. Two functionsemit_dep<T> and receive_dep<T> send and receive the dependency state of a vertex, where the type of T is DepMessage. We rst describe how SympleGraph uses these primitives in the instrumented codes. Shortly, we will describe the details of SympleGraph analyzer to generate the instrumented codes. Figure 6.5 shows the analyzed UDFs of bottom-up BFS with dependency information and primitive. When processing a vertex u, the framework rst executes emit_dep to get whether the following com- putation related to this vertex should be skipped (Line 5 7). After the vertex u is added to the current frontier,emit_dep is inserted to notify the next machine which executes the function. Note thatemit_dep andemit_dep do not specify the sender and receiver of the dependency message, it is intentional as such information is pre-determined by the framework to support circulant scheduling. 6.5.2 SympleGraphAnalysis To implement the dependent computation of functionI in Denition 4, we instrumentI to include depen- dency communication and leaveC unchanged. We develop SympleGraph analyzer , a prototype tool based on Gemini’s signal-slot programming abstraction. To simplify the analyzer design, we make the following assumptions on the UDFs. • The UDFs store dependency data in capture variables of lambda expressions. Copy statements of these variables are not allowed so that we can locate the UDFs and variables. • The UDFs traverse neighbor vertices in a loop. Based on the assumptions, we design SympleGraph analyzer as two passes in clang LibTooling at clang- AST level. 121 1. In the rst pass, our analyzer locates the UDFs and analyzes the function body to determine whether loop-carried dependency exists. (a) Use clang-lib to compile the source code and obtain the corresponding Clang-AST. (b) Traverse the AST to: (1) locate the UDF; (2) locate all process-edges (sparse-signal, sparse-slot, dense-signal, dense-slot) calls and look for the denitions of all dense-signal functions; (3) search for all for-loops that traverse neighbors in dense-signal functions and check whether loop-dependency patterns exist (there is at least one break statement related to the for-loop); (4) store all AST nodes of interests; 2. In the second pass, if the dependency exits, it identies the dependency state for communication and performs a source-to-source transformation. (a) Insert dependency communication initialization code. (b) Before the loop in UDF, insert a new control ow that checks dependency in preceding loops with receive_dep. (c) Inside the loop in UDF, insert emit_dep before the corresponding break statement to propagate the dependency message. Based on the codes in Figure 6.1 (b), SympleGraph analyzer will generate the source codes in Figure 6.5. 6.5.3 Discussion In this section, we discuss the alternative approaches that can be used to enforce loop-carried dependency. New Graph DSL. Besides the analysis, SympleGraph provides a new DSL and asks the programmer to express loop dependency and state. We support a new functional interface fold_while to replace the for- loop. It species a state machine and takes three parameters: initial dependency data, a function that 122 1 struct DepBFS : DepMessage { // datatype 2 bit skip?; 3 }; 4 def signal(Vertex v, Array[Vertex] nbrs) { 5 DepBFS d = receive_dep(v); // new code 6 if (d.skip?) { 7 return; 8 } 9 for u in nbrs(v) { 10 if (frontier[u]) { 11 emit(v, u); 12 emit_dep(v, d); // new code 13 break; 14 } 15 } 16 } 17 def slot(Vertex v, Vertex upt) { 18 ... // no changes 19 } Figure 6.5: SympleGraph instrumented bottom-up BFS UDFs composes dependency state and current neighbor, a condition that exits the loop. The compiler can easily determine the dependency state and generate the corresponding optimized code. Manual analysis and instrumentation. Some will argue that if graph algorithms UDFs are simple enough, the programmers can manually analyze and optimize the code. SympleGraph also exposes com- munication primitives to the programmers so that they can still leverage the optimizations when the code is not amendable to static analysis. Manual analysis may even provide more performance benets because some optimizations are dicult for static analysis to reason about. One example is the communication buer. In bottom-up BFS, users can choose to repurpose “visited” array as the break dependency state. The “visited” is a bit vector and can be implemented as a bitmap. When we record the dependency for a vertex, the “visited” has already been set, so we can reduce computation by avoiding the bit set operation in the dependency bitmap. When we send the dependency, we can actually send “visited” and avoid the memory allocation for dependency communication. However, writing such optimizations manually is not recommended for two reasons. First, the opti- mizations in memory footprint and computation are not the bottleneck to the overall performance. The memory reduction is one bit per vertex, while in every graph algorithm, the data eld of each vertex takes 123 at least four bytes. As for the computation reduction, setting a bit sequentially in a bitmap is also negligi- ble compared with the random edge traversals. In our evaluation, the performance benet is not notice- able (within 1% in execution time). Second, manual optimizations will aect the readability of the source code, and increase the burden of the user, hurting programmability. It contradicts the original purpose of domain-specic systems. The programmers need to have a solid understanding of both the algorithm and the system. In the same example, there is another bitmap “frontier” in the algorithm. However, it is incorrect to repurpose “frontier” as the dependency data. 6.6 SympleGraphSystem In this section, we discuss how SympleGraph schedules dependency communication to enforces dependent execution and several system optimizations. 6.6.1 EnforcingDependency: CirculantScheduling By expanding the signal expressions in Figure 6.4 for all vertices, we have Figure 6.6, a nested loop. Our goals are to 1) parallelize the outer loop, and 2) enforce the dependency order of the inner loop. However, if each vertex starts from the same machine, the other machines are idle and parallelism is limited. To preserve parallelism and enforce dependency simultaneously, we have to schedule each vertex to start with mirrors from dierent machines. We formalize the idea as circulant scheduling, which divides the iteration intop steps forp machines and executeI according to a circulant permutation. In fact, any cyclic permutation will work, and we choose one circulant for simplicity. Denition5 (Circulant scheduling) A circulant permutation is dened as(i) = (i +p 1)%p, and initially(i) =i;i = 0;:::; (p1). Theverticesinagraphisdividedintopdisjointsetsaccordingtohemaster vertices. Letu (i) denotethesequenceofneighborsofmasterverticesonmachinei. Instepj (j = 0; 1;:::;p1), circulant scheduling executesI(u (i) ) on machine j (i). 124 1 for v in V { 2 // mirror signal 3 for m in machines { 4 for mirror in m.mirrors(v) { 5 ... 6 emit(v, upt) // update 7 emit_dep(v, dep) // dependency 8 } 9 } 10 } Figure 6.6: Circulant Scheduling S3 S2 S1 S0 S0 S3 S2 S1 S1 S0 S3 S2 S2 S1 S0 S3 M0 M1 M2 M3 M0 M1 M2 M3 S0 S1 S2 S3 Figure 6.7: Circulant Scheduling Example Circulant scheduling achieves the two goals and the correctness can be inferred from the properties of permutation. For any specic vertex set, its execution follows the order ofI(u j1ju 0u 1:::u j2), starting from step 0. For any specic stepj, the scheduling species dierent machines, because j is a permutation. For example, the permutation of step 0 based on (0; 1; 2; 3) is 0 = (3; 0; 1; 2). In step 0 (the rst step),I(u (0) ) (the sequence of neighbors of master vertices on machine 0) is processed on machine 3 ( 0 (0) = 3). In step 1 (the second step), 1 = (2; 3; 0; 1),I(u (0) ) is processed on machine 2 ( 1 (0) = 2). Figure 6.7 shows an example with four machines. Figure 6.7 (a) shows the matrix view of the graph. An element (i,j) in the matrix represents an edge (v i ;v j ). Similarly, we use the notion [i;j] to represent a subgraph with edges from machine i to machine j. Based on circulant scheduling, machine 0 rst processes all edges in [0; 1] and then [0; 2], [0; 3], [0; 0]. [0; 1] contains the edges between master vertices on machine 1 and their neighbors in machine 0. The other machines are similar. In the same step, each machine i processes edges in dierent subgraph [i;j] in parallel. For example, in step 0, the subgraphs processed by machine 0,1,2,3 are [0; 1],[1; 2],[2; 3],[3; 0], respectively. After all steps, edges in [j;i],j2f0; 1; 2; 3g, are processed sequentially. 125 Figure 6.7 (b) shows the step execution according to Figure 6.7 (a) with dependency communication. The dependency communication pattern is the same for all steps: each machine only communicates with the machine on its left. Note that circulant scheduling enables more parallelism because each machine processes disjoint sets of edges in parallel. It is still more restrictive than arbitrary execution. Without circulant scheduling, a machine has the freedom to process all edges with sources allocated to this machine (a range of rows in Fig6 (a)); with circulant scheduling, during a given step (a part of an iteration), the machine can only process edges in the corresponding subgraph. In another word, the machine loses the freedom to process edges in other steps during this period. The evaluation results in Section 6.7 will show that the eliminated redundant computation and communication can fully oset the eects of reduced parallelism. Figure 6.7 also shows the key dierence between dependency and update communication. The depen- dency communication happens between two steps because the next step needs to receive itbefore execution to enforce loop-carried dependency. For update communication, each machine will receive from all remote machines by the end of the current iterations, when local reduction and update are performed. The cir- culant scheduling will not incur much additional synchronization overhead by transferring dependency communication between steps because it is much smaller than dependency communication. Moreover, be- fore starting a new step, if a machine does not wait for receiving the full dependency communication from the previous step, the correctness is not compromised. With incomplete information, the framework will just miss some opportunities to eliminate unnecessary computation and communication. In fact, Gemini can be considered as a special case without dependency communication. 6.6.2 DierentiatedDependencyPropagation This section discusses an optimization to further reduce communication. In circulant scheduling, by de- fault, every vertex has dependency communication. For vertices with a lower degree, they have no mirrors 126 H slot update master mirror Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 dependency L Figure 6.8: Dierentiated Dependency Propagation on some machines, thus dependency communication is unnecessary. Figure 6.8 shows the execution of two vertices L and H in basic circulant scheduling. The system has ve machines. Two vertices have masters in machine 1. For simplicity, the gure removes the edges for signal functions. The green and red edges are update and dependency messages. For vertex H, every other machine has its mirror. Therefore, the dependency message is propagated across all mirrors and potentially reduces computation and update communication in some mirrors. For vertex L, only machine 2 has its mirror. However, we still propagate its dependency message from machine 1 to machine 5. One naive solution to avoid unnecessary communication for vertex L is to store the mirror informa- tion in each mirror. Before sending the dependency communication of a vertex, we rst check the machine number of the next mirror. However, the solution is infeasible for three reasons: First, the memory over- head for storing the information is prohibitive. The space complexity is the same as the total number of mirrorsO(jEj). Second, dependency communication becomes complicated in circulant scheduling. Con- sider a vertex with mirrors in machine 2 and machine 4, even when there is no mirror of the vertex on machine 3, we still need to send a message from machine 2 to 3 because we cannot discard any message in 127 circulant communication. Third, it does not allow batch communication, since the communication pattern for contiguous vertices are not the same. To reduce dependency communication with smaller benets, we propose to dierentiate the depen- dency communication for high-degree and low-degree vertices. The degree threshold is an empirical con- stant. The intuition is that dependency communication is the same for the high-degree and low-degree vertices, but the high-degree vertices cansavemoreupdate communication. Therefore, SympleGraph only propagates dependency for high-degree vertices. For low-degree vertices, we can fall back to the original schedule: each mirror directly sends the update messages to the machine with the master vertex. Dierentiated dependency propagation is a trade-o. Falling back for low-degree vertices may reduce the benets of reducing the number of edges traversed. However, since the low-degree vertices have fewer neighbors, the redundant computation due to loop-carried dependency is also insignicant, because it skips less neighbors. PowerLyra [18] proposed dierentiated graph partition optimization that reduces update communication for low-degree vertices. In SympleGraph, dierentiation is relevant to dependency communication, and it is orthogonal to graph partition. 6.6.3 HidingLatencywithDoubleBuering In circulant scheduling, although disjoint sets of vertices can be executed in parallel within one step, and the computation and update communication can be overlapped, the dependency communication appears in the critical path of execution between steps. Before each step, every machine waits for the dependency message from the predecessor machine. It is not a global synchronization for all machines: synchroniza- tion between machine 1 and 3 is independent of that between machine 1 and 2. However, it still impairs performance. Besides the extra latency due to the dependency message, it also incurs load imbalancewithin the step. However, all existing load balancing techniques focus on an entire iteration and cannot solve our problem. As a result, the overall performance is aected by the slowest step. 128 2A 2A machine 1 machine 3 step 1 2B 2B step 2 step 1 1B 1A Figure 6.9: Double buering We propose double buering optimization that enables computation and dependency communication overlap and alleviates load imbalance. Figure 6.9 demonstrates the key idea with an example. We consider two machines and the rst two steps. Specically, the gure shows the dependency communication from machine 1 to machine 3 in step 1 in red. We also add back the blue signal edges to represent the computation on the mirrors. In circulant scheduling, the dependency communication starts after all computation is nished for the mirrors of partition 2 in machine 1. With double buering, we divide the mirror vertices in each step into two groups, A and B. First, each machine processes group A and generates its dependency information, which is sent before the processing of vertices in group B. Therefore, the computation on group B is overlapped with the dependency com- munication for group A, and can be done in the background. In the example, machine 3 will receive the dependency message of group 2A earlier so that the processing of vertices in group 2A in machine 3 does not need to wait until machine 1 nishes processing all vertices in both group 2A and 2B. After the second group is processed, its dependency message is sent, and the current step completes. Before starting the next step, machine 3 only needs to wait for the dependency message for group A, which was initiated earlier before the end of the step. Double buering optimization addresses two performance issues. First, at the sender side, group A communication is overlapped with group B, while group B communication can be overlapped with group 129 A computation in the next step. Second, the synchronization wait time is reduced due to reduced load imbalance. Consider the potential scenario of load imbalance in Figure 6.9, machine 3 (receiver) has much less load in step 1 and proceeds to the next step before machine 1. only waits for the dependency message of group A. Since that message is sent rst, it is likely to have already arrived. Without double buering, machine 3 has to wait for the full completion of machine 1 in step 1. Importantly, the double buering optimization can be perfectly combined with the dierentiated op- timization. We can consider the high-degree and low-degree vertices as two groups. Since processing low-degree vertices does not need synchronization, we can overlap it with dependency communication. In the example, if dependency from machine 1 has not arrived, we can start low-degree vertices in step 2 without waiting. 6.7 Evaluation We evaluate SympleGraph, Gemini [140], and D-Galois [26]. D-Galois is a recent state-of-the-art dis- tributed graph processing frameworks with better performance than Gemini with 128 to 256 machines. In the following, we describe the evaluation methodology. After that, we show the results of several important aspects: 1) comparison of overall performance among the three frameworks; 2) reduction in communication volume and computation cost; 3) scalability; and 4) piecewise contribution of each opti- mization. 6.8 EvaluationMethodology Systemconguration. We use three clusters in the evaluation: (1) Cluster-A is a private cluster with 16 nodes. In each node, there are 2 Intel Xeon E5-2630 CPUs (8 cores/CPU) and 64 GB DRAM. The operating system is CentOS 7.4. MPI library is OpenMPI 3.0.1. The network is Mellanox InniBand FDR (56Gb/s). The 130 following evaluation results are conducted in Cluster-A unless otherwise stated. (2) Cluster-B is Stampede2 Skylake (SKX) at the Texas Advanced Computing Center [119]. Each node has 2 Intel Xeon Platinum 8160 (24 cores/CPU) and 192 GB DRAM with 100Gb/s interconnect. It is used to reproduce D-Galois results, which requires 128 machines and fails to t in Cluster-A. (3) Cluster-C consists of 10 nodes. Each node is equipped with two Intel Xeon E5-2680v4 CPUs (14 cores/CPU) and 256GB memory. The network is InniBand FDR (56Gb/s). It is used to run the two large real-world graphs (Clueweb-12 and Gsh-2015), which requires larger memory and fails to t in Cluster-A. GraphDataset. The datasets are shown in Table 6.1. There are four real-world datasets and three synthe- sized scale free graphs with R-MAT generator [17]. We use the same generator parameters as in Graph500 benchmark [43]. Table 6.1: Graph datasets. |V’| is the number of high-degree vertices Graph Abbrev. |V| |E| jV 0 j jVj Twitter-2010 [63] tw 42M 1.5B 0.13 Friendster [70] fr 66M 1.8B 0.31 R-MAT-Scale27-E32 s27 134M 4.3B 0.12 R-MAT-Scale28-E16 s28 268M 4.3B 0.09 R-MAT-Scale29-E8 s29 537M 4.3B 0.04 Clueweb-12 [15, 95] cl 978M 43B 0.12 Gsh-2015 [14] gsh 988M 34B 0.28 For experiments in Cluster-A, we generate three largest possible synthesized graph that ts in its memory. Any larger graph will cause an out-of-memory error. The scales (logarithm of the number of vertices) are 27, 28 and 29 and the edge factor (average degree of a vertex) are 32, 16 and 8, respectively. To run undirected algorithms using directed graphs, we consider every directed edge as its undirected counterpart. To run directed algorithms using undirected graphs, we convert the undirected datasets to directed graphs by adding reverse edges. GraphAlgorithms. We evaluate ve algorithms discussed before. We use the reference implementations when they are available in Gemini and D-Galois. While SympleGraph only benets the bottom-up BFS, we 131 use adaptive direction-switch BFS [109] that chooses from both top-down and bottom-up algorithms in each iteration. § ¶ We follow the optimization instructions in D-Galois by running all partition strategies provided and report the best one as the baseline. ∥ For BFS, we average the experiment results of 64 randomly generated non-isolated roots. For each root, we run the algorithm 5 times. For K-core, 2-core is a subroutine widely used in strongly connected component [49] algorithm. We also evaluate other values of K. For K-means, we choose the number of clusters as p jVj and runs the algorithm for 20 iterations. For algorithms other than BFS, we run the application 20 times and average the results. 6.8.1 Performance Table 6.4 shows the execution time of all systems. SympleGraph outperforms both Gemini and D-Galois with 1.46 geomean (up to 3.05) speedup over the best of the two. For the three synthesized graphs with the same number of edges but dierent edge factor (s27, s28, and s29), graphs with larger edge factor have slightly higher speedup in SympleGraph. For K-core, the numbers in parenthesis use the optimal algorithm with linear complexity in the number of nodes and no loop dependency [78]. It is slower than SympleGraph for large synthesized graphs, but signicantly faster for Twitter-2010 and Friendster. The reason is that the algorithm is suitable for graphs with large diameters. Although real-world social graphs have relatively small diameters, they usually have a long link structure attached to the small-diameter core component. K-core. Table 6.2 shows the execution time (using 8 Cluster-A nodes) for dierent values of K. Symple- Graph has consistent speedup over Gemini regardless of K. § Adaptive switch is not available in D-Galois. For fair comparison, we implement the same switch in D-Galois. ¶ Graph sampling implementation is not available in D-Galois. ∥ We exclude Jagged Cyclic Vertex-Cut and Jagged Blocked Vertex-Cut (in all algorithms) and Over decomposed by 2/4 Carte- sian Vertex-Cut (in K-core), because the reference implementations either crashed or produced incorrect results. 132 Table 6.2: K-core runtime (in seconds) Graph K Gemini SympleG. Speedup tw 4 1.9663 1.3009 1.51 8 2.9752 2.0595 1.44 16 4.9062 3.2957 1.49 32 5.8374 3.7916 1.54 64 7.5694 5.1717 1.46 fr 4 14.7322 10.3543 1.42 8 10.1319 6.7909 1.49 16 11.5904 7.4135 1.56 32 21.8317 13.4914 1.62 64 17.9096 11.4387 1.57 Large Graphs. We run Gemini and SympleGraph with the two large real-world graphs (Gsh-2015 and Clueweb-12) on Cluster-C. SympleGraph has no improvement for BFS and K-means in Clueweb-12. The reason is that bottom-up algorithm eciency depends on graph property. In cl, it is slower than top-down BFS for most iterations, so they are not chosen by the adaptive switch. In other test cases, SympleGraph is noticeably better than Gemini. Table 6.3: Execution time(in seconds) on large graphs Graph App. Gemini SympleG. Speedup gsh BFS 4.5843 4.6031 1.00 MIS 7.3186 4.1530 1.76 K-core 24.1753 13.4465 1.80 K-means 84.7207 75.7227 1.12 Sampling 4.6578 3.4686 1.34 cl BFS 16.8839 17.9272 1.00 MIS 11.9406 6.8330 1.75 K-core 171.8570 97.7020 1.76 K-means 128.5634 142.6216 1.00 Sampling 4.5093 3.6143 1.25 6.8.2 ComputationandCommunicationReduction The source of performance speedup in SympleGraph is mainly due to eliminating unnecessary compu- tation and communication with precisely enforcing loop-carried dependency. In graph processing, the number of edges traversed is the most signicant part of computation. Table 6.5 shows the number of 133 Table 6.4: Execution Time (in seconds) Graph Gemini D-Galois SymG. Speedup BFS tw 0.608 2.053 0.264 2.30 fr 1.212 4.993 0.706 1.72 s27 1.054 2.681 0.733 1.44 s28 1.325 3.682 0.976 1.36 s29 1.760 5.356 1.372 1.28 K-core tw 3.021(0.184) 4.125 2.190 1.38 fr 11.258(0.580) 17.213 7.390 1.52 s27 2.754(1.885) 3.512 1.640 1.68 s28 4.432(4.779) 6.056 2.663 1.66 s29 5.413(10.330) 8.534 3.806 1.42 MIS tw 2.081 4.056 1.421 1.46 fr 2.363 5.045 1.754 1.35 s27 2.720 5.329 1.861 1.46 s28 3.031 7.110 2.408 1.26 s29 3.600 8.620 2.835 1.27 K-means tw 17.590 56.748 12.688 1.39 fr 19.212 78.526 13.143 1.46 s27 27.626 61.598 19.279 1.43 s28 34.393 86.632 26.919 1.28 s29 52.087 116.307 41.760 1.25 Sampling tw 0.786 N/A 0.867 0.91 fr 1.180 0.977 1.21 s27 1.388 1.090 1.27 s28 2.051 1.331 1.54 s29 2.932 1.869 1.57 edges traversed in Gemini and SympleGraph. The rst two columns are edge traversed in Gemini and SympleGraph. The last column is their ratio. We see that SympleGraph reduces edge traversal across all graph datasets and all algorithms with a 66.91% reduction on average. For communication, Gemini and other existing frameworks only have update communication, while SympleGraph reduces updates but introduces dependency communication. Table 6.6 shows the breakdown of communication in SympleGraph. Communication size is counted by message size in bytes and all the numbers are normalized to the total communication in Gemini. The rst (SympleGraph.upt) and second (SympleGraph.dep) column show update and dependency communication, respectively. The last column is the total communication of SympleGraph. 134 There are two important observations. First, s27, s28, and s29 have the same total number of edges, while s27 traverses consistently less edges than s28 and s29 in all algorithms. On average, SympleGraph on s27 traverses 24.8% edges compared with Gemini, while on s29 traverses 32.8%. When the graph structure is similar (R-MAT), the number of traversed edges is less in graphs with a larger average degree. A large average degree means more high-degree vertices that SympleGraph optimizes in dierentiated computa- tion. Therefore, s27 has more potential edges when considering reducing computation. Second, in terms of total communication size, SympleGraph is less than Gemini in all algorithms except graph vertex sampling. For these algorithms, control dependency communication is one bit per vertex because the dependency information indicates whether the vertex in the previous step has skipped the loop. For graph sampling, data dependency communication is the current prex sum. It is one oating-point number for one vertex; thus total communication might increase. 1.00 2.00 4.00 8.00 16.00 #nodes 1.00 2.00 4.00 Normalized Runtime Gemini SympleG D-Galois Figure 6.10: Scalability (MIS/s27) 6.8.3 Scalability We rst compare the scalability results of SympleGraph with Gemini and D-Galois, running MIS on graph s27 (Figure 6.10). The execution time is normalized to SympleGraph with 16 machines. The data points for 135 Table 6.5: Number of traversed edges (Normalized to total number of edges in the graph) Graph Gemini SympG. SympG./Gemini BFS tw 0.4383 0.2214 0.5051 fr 0.8537 0.3435 0.4024 s27 0.3089 0.0870 0.2815 s28 0.3586 0.1348 0.3760 s29 0.4716 0.1879 0.3985 K-core tw 2.6421 1.1986 0.4537 fr 11.3283 3.1951 0.2820 s27 1.1188 0.3498 0.3126 s28 1.8717 0.6165 0.3294 s29 2.4237 1.0513 0.4338 MIS tw 3.9014 1.9750 0.5062 fr 5.4431 2.0479 0.3762 s27 3.1328 0.8717 0.2782 s28 3.4390 1.0174 0.2958 s29 3.7762 1.1970 0.3170 K-means tw 13.3972 5.5608 0.4151 fr 2.5798 1.8989 0.7361 s27 5.6167 1.7196 0.3062 s28 8.8354 2.7847 0.3152 s29 13.6472 5.3375 0.3911 Sampling tw 1.0313 0.2143 0.2078 fr 1.2097 0.1290 0.1066 s27 1.1096 0.0709 0.0639 s28 1.1498 0.0966 0.0840 s29 1.1912 0.1172 0.0984 Gemini and SympleGraph with 1 machine are missing because the system is out of memory. Both Gemini and SympleGraph achieves the best performance with 8 machines. D-Galois scales to 16 machines, but its bestperformancerequires128to256machines according to [26]. In summary, SympleGraph is consistently better than Gemini and D-Galois with 16 machines. From 8 to 16 machines, SympleGraph has a smaller slowdown compared with Gemini, thanks to the reduction in communication and computation. Thus, SympleGraph scales better than Gemini. COST. The COST metric [80] is an important measure of scalability for distributed systems. It is the number of cores a distributed system need to outperform the fastest single-thread implementation. We use the MIS algorithm in Galois [86] and s27 graph as the single-thread baseline. The COST of Gemini and 136 Table 6.6: SympleGraph communication breakdown (normalized to total communication volume in Gem- ini) Graph SymG.upt SymG.dep SymG BFS tw 0.7553 0.0446 0.7999 fr 0.4657 0.0429 0.5085 s27 0.4151 0.0175 0.4326 s28 0.4855 0.0193 0.5047 s29 0.5993 0.0154 0.6147 K-core tw 0.5377 0.0074 0.5450 fr 0.3646 0.0074 0.3719 s27 0.3705 0.0051 0.3755 s28 0.3987 0.0051 0.4038 s29 0.5028 0.0039 0.5067 MIS tw 0.4721 0.0313 0.5034 fr 0.3639 0.0259 0.3898 s27 0.3053 0.0199 0.3252 s28 0.3336 0.0208 0.3544 s29 0.4127 0.0160 0.4287 K-means tw 0.6854 0.0250 0.7103 fr 0.7044 0.0393 0.7437 s27 0.3306 0.0100 0.3406 s28 0.3797 0.0118 0.3915 s29 0.5188 0.0106 0.5294 Sampling tw 0.1877 1.1578 1.3455 fr 0.1637 0.7238 0.8875 s27 0.1706 0.6558 0.8264 s28 0.2106 0.7050 0.9157 s29 0.2565 0.7504 1.0069 SympleGraph is 4, while the COST of D-Galois is 64. We also use the BFS algorithm in GAPBS [10] and tw graph as another baseline. GAPBS nishes in 2.29 seconds while SympleGraph takes 2.66 and 1.83 seconds for 2 and 3 threads, respectively. The cost of SympleGraph is 3. D-Galois. To evaluate the best performance of D-Galois, We reproduce the results with Cluster-B. The results are shown in Table 6.7. As the SKX nodes have more powerful CPUs and network, SympleGraph requires less number of nodes (2 or 4 nodes) for the best performance. D-Galois achieves similar or worse performance with 128 nodes. While D-Galois scales better with a large number of nodes, running in- creasingly common graph analytics applications in the supercomputer isnot convenient. In fact, for these experiments, the jobs have waited for days to be executed. Based on the results, SympleGraph on a local 137 cluster with 4 nodes can fulll the work of D-Galois with 128 nodes. We believe using SympleGraph on a small-scale distributed cluster is the most convenient and practical solution. Table 6.7: Execution time (in seconds) of MIS using the best-performing number of nodes (in parenthesis) on Stampede2 Graph D-Galois SympleGraph tw 1.321(128) 1.113(2) fr 1.355(128) 0.823(4) s27 1.258(128) 0.911(4) s28 1.380(128) 1.159(4) s29 1.565(128) 1.420(4) 6.9 AnalysisofSympleGraphOptimizations In this section, we analyze the piecewise contribution of the proposed optimizations over circulant schedul- ing, i.e., dierential dependency propagation, and double buering. We run all applications on four ver- sions of SympleGraph with dierent optimizations enabled. Due to space limit, Figure 6.11 only shows the geometric average results of all algorithms. For each graph dataset, we normalize the runtime to the version with basic circulant scheduling. Note that here the baseline is not Gemini. Double buering eectively reduce the execution time in all cases. It successfully hides the latency of dependency communication and reduces synchronization overhead. Dierential propagation optimization alone has little performance impact, because synchronization is still the bottleneck without double buer- ing. When combined with double buering, dierential propagation has a noticeable eect. This shows that our trade-o consideration in update and dependency communication is eective. Overall, when all optimizations are applied, the performance is always better than individual optimization. 138 tw fr s27 s28 s29 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Average Normalized Runtime Double Buffering(DB) Differentiated Propagation(DP) SympleGraph(DB+DP) Figure 6.11: Analysis of optimizations (baseline is SympleGraph with only circulant scheduling) 6.10 Summary This chapter proposes SympleGraph, a novel framework for distributed graph processing that precisely en- forces loop-carried dependency, i.e., when a condition is satised by a neighbor, all following neighbors can be skipped. SympleGraph analyzes user-dened functions and identies the loop-carried dependency. The distributed framework enforces the precise semantics by performing dependency propagation dynamically. To achieve high performance, we apply circulant scheduling in the framework to allow dierent machines to process disjoint sets of edges and vertices in parallel while satisfying the sequential requirement. To further improve communication eciency, SympleGraph dierentiates dependency communication and applies double buering. In a 16-node setting, SympleGraph outperforms Gemini and D-Galois on average by 1.42 and 3.30, and up to 2.30 and 7.76, respectively. The communication reduction compared to Gemini is 40.95% on average, and up to 67.48%. 139 Chapter7 Conclusions This dissertation has presented hardware and software techniques for irregular applications on parallel computer architecture. We show that existing solutions, hardware or software solutions alone are not sucient to address the sparsity and data dependency challenges of irregular parallelism. In particular, we have made the following contributions: • For classical hardware accelerators, we propose an ideal solution to parallel FSM. First, we exam- ine the current software optimizations that reduces parallelization overhead. Second, we design the rst hardware-software solution CSE (Chapter 3) with constant parallelization overhead. The embarrassingly sequential application can scale perfectly on hardware now. • For novel memory architecture, we have designed two accelerators that fully leverages the power of Processing-In-Memory. First, GraphP (Chapter 4) identies the performance bottleneck of the state- of-the-art accelerator and improves inter-cube communication. GraphP oers a new data partition method and changes the programming model of graph processing. GraphQ (Chapter 5) takes a closer look at the memory hierarchies and adopt dierent optimizations at dierent hierarchy levels. GraphQ is the rst accelerator that fully takes advantage of the high memory bandwidth and the rst design that scales to multiple nodes. 140 • For distributed systems, we develop a compiler-based technique for graph processing. Our system, SympleGraph, is the rst one to dene a rigorous specication of distributed graph processing sys- tems. We use the compiler to nd the optimization opportunity from the data dependency of certain graph algorithms. We believe that these contributions open several interesting avenues for future research. • Our work in parallel FSM opens a door to the parallelization of other embarrassingly sequential applications by hardware. These applications were believed to be impossible to parallelize. • Our work on novel memory architecture starts a series of research on applying hardware-software solutions for PIM. We hope that new applications, especially irregular applications, can be optimized for PIM. Additionally, several of our contributions provide new insight into the logic design of tra- ditional graph accelerators. For example, the intra-cube optimizations of GraphQ can be applied to other non-PIM accelerators. • Our work in distributed systems put an emphasis on a formal specication of system and algorithm. The formalization has been long missing in the system community. The impressive performance speedup will motivate all distributed system designers to rethink their system specication. We leave all these endeavors to future work. 141 Bibliography [1] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. “A scalable processing-in-memory accelerator for parallel graph processing”. In: Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE. 2015, pp. 105–117. [2] Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o”. In: 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA. 2017, pp. 125–137. [3] Tero Aittokallio and Benno Schwikowski. “Graph-based methods for analysing networks in cell biology”. In: Briengs in bioinformatics 7.3 (2006), pp. 243–255. [4] Andrei Alexandrescu and Katrin Kirchho. “Data-Driven Graph Construction for Semi-Supervised Graph-Based Learning in NLP.” In: HLT-NAACL. 2007, pp. 204–211. [5] Rajeev Alur and Mihalis Yannakakis. “Model checking of hierarchical state machines”. In: ACM SIGSOFT Software Engineering Notes. Vol. 23. ACM. 1998, pp. 175–188. [6] ARM. ARM Cortex-A5 Processor. 2009. [7] Abanti Basak, Shuangchen Li, Xing Hu, Sang Min Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, and Yuan Xie. “Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads”. In: 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE. 2019, pp. 373–386. [8] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. “Relational inductive biases, deep learning, and graph networks”. In: arXiv preprint arXiv:1806.01261 (2018). [9] Scott Beamer, Krste Asanović, and David Patterson. “Direction-optimizing Breadth-rst Search”. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. SC ’12. Salt Lake City, Utah: IEEE Computer Society Press, 2012, 12:1–12:10. isbn: 978-1-4673-0804-5.url: http://dl.acm.org/citation.cfm?id=2388996.2389013. 142 [10] Scott Beamer, Krste Asanović, and David Patterson. The GAP Benchmark Suite. 2015. arXiv: 1508.03619[cs.DC]. [11] Michela Becchi, Mark Franklin, and Patrick Crowley. “A workload for evaluating deep packet inspection architectures”. In: Workload Characterization, 2008. IISWC 2008. IEEE International Symposium on. IEEE. 2008, pp. 79–89. [12] Richard Bellman. “On a routing problem”. In: Quarterly of applied mathematics 16.1 (1958), pp. 87–90. [13] Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H Loh, Don McCaule, Pat Morrow, Donald W Nelson, Daniel Pantuso, et al. “Die stacking (3D) microarchitecture”. In: 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE. 2006, pp. 469–479. [14] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. “Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks”. In: Proceedings of the 20th international conference on World wide web. ACM. 2011, pp. 587–596. [15] Paolo Boldi and Sebastiano Vigna. “The webgraph framework I: compression techniques”. In: Proceedings of the 13th international conference on World Wide Web. ACM. 2004, pp. 595–602. [16] William M Campbell, Charlie K Dagli, and Cliord J Weinstein. “Social network analysis with content and graphs”. In: Lincoln Laboratory Journal 20.1 (2013), pp. 61–81. [17] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. “R-MAT: A recursive model for graph mining”. In: Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM. 2004, pp. 442–446. [18] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. “PowerLyra: Dierentiated Graph Computation and Partitioning on Skewed Graphs”. In: Proceedings of the Tenth European Conference on Computer Systems. EuroSys ’15. Bordeaux, France: ACM, 2015, 1:1–1:15.isbn: 978-1-4503-3238-5.doi: 10.1145/2741948.2741970. [19] Yangjun Chen, Duren Che, and Karl Aberer. “On the ecient evaluation of relaxed queries in biological databases”. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM. 2002, pp. 227–236. [20] Cristiana Chitic and Daniela Rosu. “On validation of XML streams using nite state machines”. In: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004. ACM. 2004, pp. 85–90. [21] Alessandro Cimatti, Edmund Clarke, Enrico Giunchiglia, Fausto Giunchiglia, Marco Pistore, Marco Roveri, Roberto Sebastiani, and Armando Tacchella. “Nusmv 2: An opensource tool for symbolic model checking”. In: International Conference on Computer Aided Verication. Springer. 2002, pp. 359–364. 143 [22] Thayne Coman, Seth Greenblatt, and Sherry Marcus. “Graph-Based Technologies for Intelligence Analysis”. In: Commun. ACM 47.3 (Mar. 2004), pp. 45–47.issn: 0001-0782.doi: 10.1145/971617.971643. [23] Hybrid Memory Cube Consortium et al. Hybrid memory cube specication version 2.1. Tech. rep. Nov. 2015. [24] Guohao Dai, Tianhao Huang, Yuze Chi, Jishen Zhao, Guangyu Sun, Yongpan Liu, Yu Wang, Yuan Xie, and Huazhong Yang. “Graphh: A processing-in-memory architecture for large-scale graph processing”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2018). [25] Anthony Danalis, Ki-Yong Kim, Lori Pollock, and Martin Swany. “Transformations to parallel codes for communication-computation overlap”. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society. 2005, p. 58. [26] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. “Gluon: A Communication-optimizing Substrate for Distributed Heterogeneous Graph Analytics”. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI 2018. Philadelphia, PA, USA: ACM, 2018, pp. 752–768.isbn: 978-1-4503-5698-5.doi: 10.1145/3192366.3192404. [27] Sutapa Datta and Subhasis Mukhopadhyay. “A grammar inference approach for predicting kinase specic phosphorylation sites”. In: PloS one 10.4 (2015), e0122294. [28] Edsger W Dijkstra. “A note on two problems in connexion with graphs”. In: Numerische mathematik 1.1 (1959), pp. 269–271. [29] Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. “An ecient and scalable semiconductor architecture for parallel automata processing”. In: IEEE Transactions on Parallel and Distributed Systems 25.12 (2014), pp. 3088–3098. [30] Anton J Enright and Christos A Ouzounis. “BioLayout—an automatic graph layout algorithm for similarity visualization”. In: Bioinformatics 17.9 (2001), pp. 853–854. [31] Domenico Ficara, Stefano Giordano, Gregorio Procissi, Fabio Vitucci, Gianni Antichi, and Andrea Di Pietro. “An improved DFA for fast regular expression matching”. In: ACM SIGCOMM Computer Communication Review 38.5 (2008), pp. 29–40. [32] Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. “Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation”. In: IEEE Transactions on knowledge and data engineering 19.3 (2007), pp. 355–369. [33] Mingyu Gao, Grant Ayers, and Christos Kozyrakis. “Practical near-data processing for in-memory analytics frameworks”. In: 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE. 2015, pp. 113–124. 144 [34] Mingyu Gao and Christos Kozyrakis. “HRL: Ecient and Flexible Recongurable Logic for Near-Data Processing”. In: Proceeding of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA). Mar. 2016. [35] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. “TETRIS: Scalable and Ecient Neural Network Acceleration with 3D Memory”. In: Proceeding of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Apr. 2017. [36] C Lee Giles, Steve Lawrence, and Ah Chung Tsoi. “Rule inference for nancial prediction using recurrent neural networks”. In: Computational Intelligence for Financial Engineering (CIFEr), 1997., Proceedings of the IEEE/IAFE 1997. IEEE. 1997, pp. 253–259. [37] Gurbinder Gill, Roshan Dathathri, Loc Hoang, Andrew Lenharth, and Keshav Pingali. “Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms”. In: European Conference on Parallel Processing. Springer. 2018, pp. 249–264. [38] Vaibhav Gogte, Aasheesh Kolli, Michael J Cafarella, Loris D’Antoni, and Thomas F Wenisch. “HARE: Hardware accelerator for regular expressions”. In: Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE. Oct. 2016, pp. 1–12. [39] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. “PowerGraph: Distributed Graph-parallel Computation on Natural Graphs”. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. OSDI’12. Hollywood, CA, USA: USENIX Association, 2012, pp. 17–30.isbn: 978-1-931971-96-6.url: http://dl.acm.org/citation.cfm?id=2387880.2387883. [40] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. “PowerGraph: Distributed Graph-parallel Computation on Natural Graphs”. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. OSDI’12. Hollywood, CA, USA: USENIX Association, 2012, pp. 17–30.isbn: 978-1-931971-96-6.url: http://dl.acm.org/citation.cfm?id=2387880.2387883. [41] Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. “GraphX: Graph Processing in a Distributed Dataow Framework”. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. OSDI’14. Broomeld, CO: USENIX Association, 2014, pp. 599–613.isbn: 978-1-931971-16-4.url: http://dl.acm.org/citation.cfm?id=2685048.2685096. [42] Amit Goyal, Hal Daumé III, and Raul Guerra. “Fast large-scale approximate graph construction for nlp”. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics. 2012, pp. 1069–1080. [43] Graph500. Graph 500 Benchmarks. http://www.graph500.org. 2010. [44] Todd J Green, Gerome Miklau, Makoto Onizuka, and Dan Suciu. “Processing XML streams with deterministic automata”. In: International Conference on Database Theory. Springer. 2003, pp. 173–189. 145 [45] Aditya Grover and Jure Leskovec. “node2vec: Scalable feature learning for networks”. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2016, pp. 855–864. [46] Ziyu Guan, Jiajun Bu, Qiaozhu Mei, Chun Chen, and Can Wang. “Personalized tag recommendation using graph-based ranking on multi-type interrelated objects”. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM. 2009, pp. 540–547. [47] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. “Graphicionado: A high-performance and energy-ecient accelerator for graph analytics”. In: Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE. 2016, pp. 1–13. [48] Sungpack Hong, Siegfried Depner, Thomas Manhardt, Jan Van Der Lugt, Merijn Verstraaten, and Hassan Cha. “PGX.D: A Fast Distributed Graph Processing Engine”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’15. Austin, Texas: ACM, 2015, 58:1–58:12.isbn: 978-1-4503-3723-6.doi: 10.1145/2807591.2807620. [49] Sungpack Hong, Nicole C. Rodia, and Kunle Olukotun. “On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-world Graphs”. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. SC ’13. Denver, Colorado: ACM, 2013, 92:1–92:11.isbn: 978-1-4503-2378-9.doi: 10.1145/2503210.2503246. [50] Imranul Hoque and Indranil Gupta. “LFGraph: Simple and Fast Distributed Graph Analytics”. In: Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems. TRIOS ’13. Farmington, Pennsylvania: ACM, 2013, 9:1–9:17.isbn: 978-1-4503-2463-2.doi: 10.1145/2524211.2524218. [51] Hybrid Memory Cube Consortium. Hybrid Memory Cube Specication 2.1. 2015. [52] Mark C. Jerey, Suvinay Subramanian, Cong Yan, Joel Emer, and Daniel Sanchez. “A Scalable Architecture for Ordered Parallelism”. In: Proceedings of the 48th International Symposium on Microarchitecture. MICRO-48. Waikiki, Hawaii: ACM, 2015, pp. 228–241.isbn: 978-1-4503-4034-2. doi: 10.1145/2830772.2830777. [53] Lin Jiang and Zhijia Zhao. “Grammar-aware Parallelization for Scalable XPath Querying”. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM. 2017, pp. 371–383. [54] Andrew B Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. “Orion 2.0: A power-area simulator for interconnection networks”. In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems 20.1 (2012), pp. 191–196. [55] Andrew B. Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. “ORION 2.0: A Power-Area Simulator for Interconnection Networks”. In: IEEE Trans. Very Large Scale Integr. Syst. 20.1 (Jan. 2012), pp. 191–196.issn: 1063-8210.doi: 10.1109/TVLSI.2010.2091686. 146 [56] George Karypis and Vipin Kumar. “A fast and high quality multilevel scheme for partitioning irregular graphs”. In: SIAM Journal on scientic Computing 20.1 (1998), pp. 359–392. [57] Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. “Memory-centric System Interconnect Design with Hybrid Memory Cubes”. In: Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques. PACT ’13. Edinburgh, Scotland, UK: IEEE Press, 2013, pp. 145–156.isbn: 978-1-4799-1021-2.url: http://dl.acm.org/citation.cfm?id=2523721.2523744. [58] Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. “Memory-centric system interconnect design with hybrid memory cubes”. In: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press. 2013, pp. 145–156. [59] John Kim, William Dally, Steve Scott, and Dennis Abts. “Cost-ecient dragony topology for large-scale systems”. In: IEEE micro 29.1 (2009), pp. 33–40. [60] John Kim, William J Dally, and Dennis Abts. “Flattened buttery: a cost-ecient topology for high-radix networks”. In: ACM SIGARCH Computer Architecture News. Vol. 35. 2. ACM. 2007, pp. 126–137. [61] P. M. Kogge, S. C. Bass, J. B. Brockman, D. Z. Chen, and E. Sha. “Pursuing a petaop: point designs for 100 TF computers using PIM technologies”. In: Frontiers of Massively Parallel Computing, 1996. Proceedings Frontiers ’96., Sixth Symposium on the. Oct. 1996, pp. 88–97.doi: 10.1109/FMPC.1996.558065. [62] Peter M. Kogge. “EXECUBE-A New Architecture for Scaleable MPPs”. In: Proceedings of the 1994 International Conference on Parallel Processing - Volume 01. ICPP ’94. Washington, DC, USA: IEEE Computer Society, 1994, pp. 77–84.isbn: 0-8493-2493-9.doi: 10.1109/ICPP.1994.108. [63] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. “What is Twitter, a Social Network or a News Media?” In: Proceedings of the 19th International Conference on World Wide Web. WWW ’10. Raleigh, North Carolina, USA: ACM, 2010, pp. 591–600.isbn: 978-1-60558-799-8. doi: 10.1145/1772690.1772751. [64] Nicolas Le Novere, Michael Hucka, Huaiyu Mi, Stuart Moodie, Falk Schreiber, Anatoly Sorokin, Emek Demir, Katja Wegner, Mirit I Aladjem, Sarala M Wimalaratne, et al. “The systems biology graphical notation”. In: Nature biotechnology 27.8 (2009), pp. 735–741. [65] M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John. “Data partitioning strategies for graph workloads on heterogeneous clusters”. In: SC15: International Conference for High Performance Computing, Networking, Storage and Analysis. Nov. 2015, pp. 1–12.doi: 10.1145/2807591.2807632. [66] Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, and Lizy K. John. “Data Partitioning Strategies for Graph Workloads on Heterogeneous Clusters”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’15. Austin, Texas: ACM, 2015, 56:1–56:12.isbn: 978-1-4503-3723-6.doi: 10.1145/2807591.2807632. 147 [67] Dong Uk Lee, Kyung Whan Kim, Kwan Weon Kim, Hongjung Kim, Ju Young Kim, Young Jun Park, Jae Hwan Kim, Dae Suk Kim, Heat Bit Park, Jin Wook Shin, et al. “25.2 A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with eective microbump I/O test methods using 29nm process and TSV”. In: Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International. IEEE. 2014, pp. 432–433. [68] Jure Leskovec, Lada A Adamic, and Bernardo A Huberman. “The dynamics of viral marketing”. In: ACM Transactions on the Web (TWEB) 1.1 (2007), p. 5. [69] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. “Signed Networks in Social Media”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’10. Atlanta, Georgia, USA: ACM, 2010, pp. 1361–1370.isbn: 978-1-60558-929-9.doi: 10.1145/1753326.1753532. [70] Jure Leskovec and Andrej Krevl. friendster. https://snap.stanford.edu/data/com-Friendster.html. 2014.url: https://snap.stanford.edu/data/com-Friendster.html. [71] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. June 2014. [72] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. “Community structure in large networks: Natural cluster sizes and the absence of large well-dened clusters”. In: Internet Mathematics 6.1 (2009), pp. 29–123. [73] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. “McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures”. In: MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture. 2009, pp. 469–480. [74] Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”. In: Proc. VLDB Endow. 5.8 (Apr. 2012), pp. 716–727.issn: 2150-8097.doi: 10.14778/2212351.2212354. [75] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. “Pregel: A System for Large-scale Graph Processing”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10. Indianapolis, Indiana, USA: ACM, 2010, pp. 135–146.isbn: 978-1-4503-0032-2.doi: 10.1145/1807167.1807184. [76] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. “Pregel: A System for Large-scale Graph Processing”. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. SIGMOD ’10. Indianapolis, Indiana, USA: ACM, 2010, pp. 135–146.isbn: 978-1-4503-0032-2.doi: 10.1145/1807167.1807184. [77] Vladimir Marjanović, Jesús Labarta, Eduard Ayguadé, and Mateo Valero. “Overlapping Communication and Computation by Using a Hybrid MPI/SMPSs Approach”. In: Proceedings of the 24th ACM International Conference on Supercomputing. ICS ’10. Tsukuba, Ibaraki, Japan: ACM, 2010, pp. 5–16.isbn: 978-1-4503-0018-6.doi: 10.1145/1810085.1810091. 148 [78] David W Matula and Leland L Beck. “Smallest-last ordering and clustering and graph coloring algorithms”. In: Journal of the ACM (JACM) 30.3 (1983), pp. 417–427. [79] Julian McAuley and Jure Leskovec. “Learning to Discover Social Circles in Ego Networks”. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS’12. Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 539–547.url: http://dl.acm.org/citation.cfm?id=2999134.2999195. [80] Frank McSherry, Michael Isard, and Derek G Murray. “Scalability! But at whatfCOSTg?” In: 15th Workshop on Hot Topics in Operating Systems (HotOSfXVg). 2015. [81] Batul J Mirza, Benjamin J Keller, and Naren Ramakrishnan. “Studying recommendation algorithms by graph analysis”. In: Journal of Intelligent Information Systems 20.2 (2003), pp. 131–160. [82] Anurag Mukkara, Nathan Beckmann, Maleen Abeydeera, Xiaosong Ma, and Daniel Sanchez. “Exploiting Locality in Graph Analytics through Hardware-Accelerated Traversal Scheduling”. In: Proceedings of the 51st annual IEEE/ACM international symposium on Microarchitecture (MICRO-51). Oct. 2018. [83] Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. “Data-parallel nite-state machines”. In: ACM SIGARCH Computer Architecture News. Vol. 42. ACM. 2014, pp. 529–542. [84] Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. “GraphPIM: Enabling Instruction-Level PIM Ooading in Graph Computing Frameworks”. In: High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on. IEEE. 2017, pp. 457–468. [85] Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. “Grappa: A latency-tolerant runtime for large-scale irregular applications”. In: International Workshop on Rack-Scale Computing (WRSC w/EuroSys). 2014. [86] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. “A Lightweight Infrastructure for Graph Analytics”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: ACM, 2013, pp. 456–471.isbn: 978-1-4503-2388-8.doi: 10.1145/2517349.2522739. [87] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. “Energy ecient architecture for graph analytics accelerators”. In: Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE. 2016, pp. 166–177. [88] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. “The PageRank citation ranking: bringing order to the web.” In: (1999). [89] Robert Paige and Robert E Tarjan. “Three partition renement algorithms”. In: SIAM Journal on Computing 16.6 (1987), pp. 973–989. 149 [90] Yinfei Pan, Ying Zhang, and Kenneth Chiu. “Simultaneous transducers for data-parallel XML parsing”. In: Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on. IEEE. 2008, pp. 1–12. [91] Yinfei Pan, Ying Zhang, Kenneth Chiu, and Wei Lu. “Parallel xml parsing using meta-dfas”. In: e-Science and Grid Computing, IEEE International Conference on. IEEE. 2007, pp. 237–244. [92] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. “Deepwalk: Online learning of social representations”. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM. 2014, pp. 701–710. [93] Alexandre Petrenko. “Fault model-driven test derivation from nite state models: Annotated bibliography”. In: Modeling and verication of parallel processes. Springer, 2001, pp. 196–205. [94] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M Amber Hassaan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Méndez-Lojo, et al. “The tao of parallelism in algorithms”. In: ACM Sigplan Notices. Vol. 46. 6. ACM. 2011, pp. 12–25. [95] The Lemur Project. The ClueWeb12 Dataset. 2013.url: http://lemurproject.org/clueweb12/. [96] Junqiao Qiu, Zhijia Zhao, and Bin Ren. “MicroSpec: Speculation-centric ne-grained parallelization for FSM computations”. In: Parallel Architecture and Compilation Techniques (PACT), 2016 International Conference on. IEEE. 2016, pp. 221–233. [97] Junqiao Qiu, Zhijia Zhao, Bo Wu, Abhinav Vishnu, and Shuaiwen Leon Song. “Enabling scalability-sensitive speculative parallelization for FSM computations”. In: Proceedings of the International Conference on Supercomputing. ACM. 2017, p. 2. [98] Meikang Qiu, Lei Zhang, Zhong Ming, Zhi Chen, Xiao Qin, and Laurence T Yang. “Security-aware optimization for ubiquitous computing systems with SEAT graph approach”. In: Journal of Computer and System Sciences 79.5 (2013), pp. 518–529. [99] Rajeev K Ranjan, Adnan Aziz, Robert K Brayton, Bernard Plessier, and Carl Pixley. “Ecient BDD algorithms for FSM synthesis and verication”. In: IWLS95, Lake Tahoe, CA 253 (1995), p. 254. [100] RE2. https://github.com/google/re2/. [101] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. “X-Stream: Edge-Centric Graph Processing Using Streaming Partitions”. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. SOSP ’13. Farminton, Pennsylvania: Association for Computing Machinery, 2013, pp. 472–488.isbn: 9781450323888.doi: 10.1145/2517349.2522740. [102] Indranil Roy and Srinivas Aluru. “Finding motifs in biological sequences using the micron automata processor”. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE. 2014, pp. 415–424. 150 [103] Semih Salihoglu and Jennifer Widom. “GPS: A Graph Processing System”. In: Proceedings of the 25th International Conference on Scientic and Statistical Database Management. SSDBM. Baltimore, Maryland, USA: ACM, 2013, 22:1–22:12.isbn: 978-1-4503-1921-8.doi: 10.1145/2484838.2484843. [104] Daniel Sanchez and Christos Kozyrakis. “ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems”. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. ISCA ’13. Tel-Aviv, Israel: ACM, 2013, pp. 475–486.isbn: 978-1-4503-2079-5.doi: 10.1145/2485922.2485963. [105] José Carlos Sancho, Kevin J Barker, Darren J Kerbyson, and Kei Davis. “Quantifying the potential benet of overlapping communication and computation in large-scale scientic applications”. In: |. IEEE. 2006, p. 17. [106] Satu Elisa Schaeer. “Graph clustering”. In: Computer science review 1.1 (2007), pp. 27–64. [107] Manjunath Shevgoor, Jung-Sik Kim, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, and Aniruddha N Udipi. “Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device”. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM. 2013, pp. 198–209. [108] Julian Shun. K-Core. 2019.url: http://jshun.github.io/ligra/docs/tutorial_kcore.html. [109] Julian Shun and Guy E. Blelloch. “Ligra: A Lightweight Graph Processing Framework for Shared Memory”. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP ’13. Shenzhen, China: ACM, 2013, pp. 135–146.isbn: 978-1-4503-1922-5.doi: 10.1145/2442516.2442530. [110] Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, and Michael W. Mahoney. “Parallel Local Graph Clustering”. In: Proc. VLDB Endow. 9.12 (Aug. 2016), pp. 1041–1052.issn: 2150-8097. doi: 10.14778/2994509.2994522. [111] S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K. John. “Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters”. In: 2016 45th International Conference on Parallel Processing (ICPP). Aug. 2016, pp. 77–86.doi: 10.1109/ICPP.2016.16. [112] S. Song, X. Zheng, A. Gerstlauer, and L. K. John. “Fine-grained power analysis of emerging graph processing workloads for cloud operations management”. In: 2016 IEEE International Conference on Big Data (Big Data). Dec. 2016, pp. 2121–2126.doi: 10.1109/BigData.2016.7840840. [113] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. “Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction”. In: Proc. VLDB Endow. 12.2 (Oct. 2018), pp. 154–168.issn: 2150-8097.doi: 10.14778/3282495.3282501. [114] NIST (National Institute of Standards and Technology). Matrix Market. https://math.nist.gov/MatrixMarket/index.html. Aug. 2000. 151 [115] AM Stankovic and MS Calovic. “Graph oriented algorithm for the steady-state security enhancement in distribution networks”. In: IEEE Transactions on Power Delivery 4.1 (1989), pp. 539–544. [116] Arun Subramaniyan and Reetuparna Das. “Parallel Automata Processor”. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM. 2017, pp. 600–612. [117] Prateek Tanden, Faissal M Sleiman, Michael J Cafarella, and Thomas F Wenisch. “HAWK: Hardware support for unstructured log processing”. In: Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE. 2016, pp. 469–480. [118] Lei Tang and Huan Liu. “Graph mining applications to social network analysis”. In: Managing and Mining Graph Data. Springer, 2010, pp. 487–513. [119] The University of Texas at Austin. Texas Advanced Computing Center (TACC), 2019.url: https://www.tacc.utexas.edu/. [120] Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. “From think like a vertex to think like a graph”. In: Proceedings of the VLDB Endowment 7.3 (2013), pp. 193–204. [121] Po-An Tsai, Nathan Beckmann, and Daniel Sanchez. “Jenga: Sotware-Dened Cache Hierarchies”. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM. 2017, pp. 652–665. [122] Robert A Van Engelen. “Constructing nite state automata for high performance web services”. In: IEEE International Conference on Web Services. Citeseer. 2004. [123] Ta Quoc Viet and Tsutomu Yoshinaga. “Improving linpack performance on SMP clusters with asynchronous MPI programming”. In: IPSJ Digital Courier 2 (2006), pp. 598–606. [124] Keval Vora, Sai Charan Koduru, and Rajiv Gupta. “ASPIRE: exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM”. In: ACM SIGPLAN Notices. Vol. 49. 10. ACM. 2014, pp. 861–878. [125] Jack Wadden, Vinh Dang, Nathan Brunelle, Tommy Tracy II, Deyuan Guo, Elaheh Sadredini, Ke Wang, Chunkun Bo, Gabriel Robins, Mircea Stan, et al. “ANMLzoo: a benchmark suite for exploring bottlenecks in automata processing engines and architectures”. In: Workload Characterization (IISWC), 2016 IEEE International Symposium on. IEEE. 2016, pp. 1–12. [126] Tianyi Wang, Yang Chen, Zengbin Zhang, Tianyin Xu, Long Jin, Pan Hui, Beixing Deng, and Xing Li. “Understanding graph sampling algorithms for social network analysis”. In: 2011 31st International Conference on Distributed Computing Systems Workshops. IEEE. 2011, pp. 123–128. [127] Zhen-Gang Wang, Johann Elbaz, Françoise Remacle, RD Levine, and Itamar Willner. “All-DNA nite-state automata with nite memory”. In: Proceedings of the National Academy of Sciences 107.51 (2010), pp. 21996–22001. 152 [128] Ming Wu, Fan Yang, Jilong Xue, Wencong Xiao, Youshan Miao, Lan Wei, Haoxiang Lin, Yafei Dai, and Lidong Zhou. “GraM: Scaling Graph Computation to the Trillions”. In: Proceedings of the Sixth ACM Symposium on Cloud Computing. SoCC ’15. Kohala Coast, Hawaii: ACM, 2015, pp. 408–421.isbn: 978-1-4503-3651-2.doi: 10.1145/2806777.2806849. [129] Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen, Ming Wu, Wei Li, and Lidong Zhou. “Tux 2 : Distributed Graph Computation for Machine Learning”. In: The 14th USENIX Symposium on Networked Systems Design and Implementation. 2017. [130] Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen, Ming Wu, Wei Li, and Lidong Zhou. “Tux2: Distributed Graph Computation for Machine Learning.” In: NSDI. Berkeley, CA, USA: USENIX Association, 2017, pp. 669–682. [131] Fang Yu, Zhifeng Chen, Yanlei Diao, TV Lakshman, and Randy H Katz. “Fast and memory-ecient regular expression matching for deep packet inspection”. In: Architecture for Networking and Communications systems, 2006. ANCS 2006. ACM/IEEE Symposium on. IEEE. 2006, pp. 93–102. [132] Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. “Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations”. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. SOSP ’09. Big Sky, Montana, USA: Association for Computing Machinery, 2009, pp. 247–260.isbn: 9781605587523.doi: 10.1145/1629575.1629600. [133] Torsten Zesch and Iryna Gurevych. “Analysis of the Wikipedia category graph for NLP applications”. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007). 2007, pp. 1–8. [134] Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. “Exploring the Hidden Dimension in Graph Processing”. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. Savannah, GA, USA: USENIX Association, 2016, pp. 285–300.isbn: 978-1-931971-33-1.url: http://dl.acm.org/citation.cfm?id=3026877.3026900. [135] Mingxing Zhang, Yongwei Wu, Kang Chen, Xuehai Qian, Xue Li, and Weimin Zheng. “Exploring the Hidden Dimension in Graph Processing”. In: The 12th USENIX Symposium on Operating Systems Design and Implementation. 2016. [136] Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, Christos Kozyrakis, and Xuehai Qian. “GraphP: Reducing Communication for PIM-Based Graph Processing with Ecient Data Partition”. In: High Performance Computer Architecture (HPCA), 2018 IEEE International Symposium on. IEEE. 2018, pp. 544–557. [137] Zhijia Zhao and Xipeng Shen. “On-the-y principled speculation for FSM parallelization”. In: ACM SIGPLAN Notices. Vol. 50. ACM. 2015, pp. 619–630. [138] Zhijia Zhao and Bo Wu. “Probabilistic Models Towards Optimal Speculation of DFA Applications”. In: Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE. 2011, pp. 220–220. 153 [139] Zhijia Zhao, Bo Wu, and Xipeng Shen. “Challenging the embarrassingly sequential: parallelizing nite state machine-based computations through principled speculation”. In: ACM SIGPLAN Notices. Vol. 49. ACM. 2014, pp. 543–558. [140] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. “Gemini: A Computation-centric Distributed Graph Processing System”. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. Savannah, GA, USA: USENIX Association, 2016, pp. 301–316.isbn: 978-1-931971-33-1.url: http://dl.acm.org/citation.cfm?id=3026877.3026901. [141] Xiaowei Zhu, Wentao Han, and Wenguang Chen. “GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning”. In: 2015 USENIX Annual Technical Conference (USENIX ATC 15). 2015, pp. 375–386. 154
Asset Metadata
Creator
Zhuo, Youwei (author)
Core Title
Hardware and software techniques for irregular parallelism
Contributor
Electronically uploaded by the author
(provenance)
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2022-05
Publication Date
01/12/2022
Defense Date
12/07/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,graph algorithms,graph analytics,memory systems,OAI-PMH Harvest,processing-in-memory
Format
application/pdf
(imt)
Language
English
Advisor
Qian, Xuehai (
committee chair
), Govindan, Ramesh (
committee member
), Prasanna, Viktor (
committee member
), Wang, Chao (
committee member
)
Creator Email
youweizh@usc.edu,zhuoyw@live.cn
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC110455249
Unique identifier
UC110455249
Legacy Identifier
etd-ZhuoYouwei-10340
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhuo, Youwei
Type
texts
Source
20220112-usctheses-batch-907
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Computer architecture is at a critical juncture. With the end of Moore's law and Dennald scaling, it is impossible to increase the number of transistors exponentially in single-core architecture. ❧ It becomes more efficient to build multi-core processors to scale performance. As a result, modern computers, from personal desktop computers to high performance computing clusters, are using exponentially more cores. To leverage the power of multi-core, we need to modify the applications and enable thread parallelism. Many applications are easy to modify. However, it is difficult for irregular applications to scale efficiently on multi-core architecture. ❧ We define applications into two categories: regular applications and irregular applications. Irregular applications exhibit two main characteristics: sparsity and data dependency. Sparsity means that the number of non-zero data elements is much fewer than the possible maximum number. Data dependency describes the fact that computing on different parts of the data is dependent. Typical irregular applications include sparse matrix multiplication and graph processing. ❧ At a high level, these two characteristics of irregular applications limit multi-core scalability. First, sparsity is bad for resource utilization. Hardware design is optimized for dense applications. When there are zero elements, the hardware resources corresponding to these elements are idle. Sparse applications can rarely reach peak theoretical performance. Second, data dependency prevents us from separating the problem into a number of parallel tasks. With data dependency, we must enforce it for correctness and serialize the computation. It is also possible to put extra efforts to break data dependency, at the expense of redundant computation and communication. The synchronization overhead will become the performance bottleneck of irregular applications. ❧ The focus of this thesis is to develop new techniques to scale irregular applications efficiently on parallel architecture. We believe that either software or hardware techniques alone will not meet this goal. To this end, we present hardware and software techniques that address both the sparsity and data dependency challenge posed by irregular applications. ❧ For hardware accelerators, we design CSE, a new parallel algorithm that parallelizes finite-state machines. Existing solutions are focused on software optimizations and ignore the hardware features. CSE can partition the state machine into convergence sets and parallelize them with constant parallelization overhead. ❧ For novel memory architecture Processing-In-Memory (PIM), we propose two techniques that together exploits the full memory bandwidth benefits of Hybrid Memory Cubes (HMC). GraphP is a HMC graph accelerator with data partition as first-order design principle. We use the software method of partitioning to reduce communication. GraphQ is another HMC graph accelerator which focuses on new hardware design at different levels of memory hierarchy. We change the programming model of graph processing and rearrange the memory access patterns in memory cubes. We also develop a runtime system that schedules the communication messages in batches to achieve peak bandwidth utilization. ❧ For distributed systems, our contributions include a compiler-based technique that removes redundant computation and communication. SympleGraph is a state-of-the-art distributed graph processing system. It is the first distributed graph system that considers data dependency in its programming model and system implementation. We use circulant scheduling, double buffering, and differential communication to optimize data dependency in a distributed setting.
Tags
computer architecture
graph algorithms
graph analytics
memory systems
processing-in-memory
Linked assets
University of Southern California Dissertations and Theses