Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithm and system co-optimization of graph and machine learning systems
(USC Thesis Other)
Algorithm and system co-optimization of graph and machine learning systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Algorithm and system co-optimization of graph and machine learning systems by Gengyu Rao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2024 Copyright 2024 Gengyu Rao Table of Contents List of Tables ........................................................................................ v List of Figures ....................................................................................... vi Abstract .............................................................................................. viii Chapter 1: Introduction ............................................................................ 1 Chapter 2: GraphS: Eliminating Redundant Computation and Communication in PIM-Based Graph Processing with Dependence Scheduling............................................... 4 2.1 Abstract.................................................................................... 4 2.2 Introduction ............................................................................... 5 2.3 Background and Motivation .............................................................. 9 2.3.1 PIM-Based Graph Processing ................................................... 9 2.3.2 Algorithms with Loop-carried Dependency .................................... 11 2.3.3 Why Did We Waste Computation and Communication?....................... 13 2.4 GraphS Overview ......................................................................... 15 2.5 User-Defined-Function Analysis ......................................................... 16 2.5.1 GraphS Primitives................................................................ 16 2.5.2 GraphS Code Analyzer .......................................................... 17 2.6 Circulant Scheduling: Enforcing Dependency with Parallelism........................ 18 2.6.1 Problems and Insights ........................................................... 18 2.6.2 Circulant Scheduling............................................................. 19 2.6.3 Selective Dependency Propagation.............................................. 22 2.7 Runtime Implementation and Optimization ............................................. 23 2.7.1 GraphS Runtime System......................................................... 23 2.7.2 GraphSR Runtime System ....................................................... 24 2.7.3 Static Analysis of Graph ......................................................... 29 2.8 Evaluation ................................................................................. 29 2.8.1 Evaluation Methodology......................................................... 29 2.8.2 Performance Improvements ..................................................... 31 2.8.3 Breakdown and Load Imbalance ................................................ 32 2.8.4 Energy Reduction ................................................................ 33 2.8.5 Computation and Communication Reduction .................................. 34 2.9 Related Work .............................................................................. 34 ii 2.10 Conclusion ................................................................................ 35 Chapter 3: SparseCore: stream ISA and processor specialization for sparse computation ..... 36 3.1 Introduction ............................................................................... 36 3.2 Background ............................................................................... 40 3.2.1 Sparse Tensor Computation ..................................................... 40 3.2.2 GPM Methods and Optimizations............................................... 40 3.2.3 Existing Architectures on GPM ................................................. 42 3.3 Stream Instruction Set Extension ........................................................ 44 3.3.1 Stream Definition ................................................................ 44 3.3.2 Register Extension ............................................................... 44 3.3.3 Instruction Set Specification ..................................................... 45 3.3.4 Code Examples................................................................... 48 3.4 SparseCore Architecture.................................................................. 49 3.4.1 Stream ID Mapping .............................................................. 49 3.4.2 Stream Unit and Stream Reuse .................................................. 50 3.4.3 Stream Cache..................................................................... 51 3.4.4 Stream Data Dependency ........................................................ 52 3.4.5 Sparse Computation on Values .................................................. 53 3.4.6 Nested Intersection............................................................... 54 3.5 Implementation and Software ............................................................ 54 3.5.1 Implementation Considerations ................................................. 54 3.5.2 Hardware Cost ................................................................... 55 3.5.3 Compiler.......................................................................... 55 3.6 Evaluation ................................................................................. 56 3.6.1 Simulator and Configuration .................................................... 56 3.6.2 Graph Mining Algorithms and Data Sets ....................................... 57 3.6.3 Overall Performance of GPM ................................................... 59 3.6.3.1 Comparison with Flexminer, TireJax, and GRAMER .............. 59 3.6.3.2 Comparison with CPU................................................ 60 3.6.4 Cycle Breakdown Analysis ...................................................... 61 3.6.5 Comparing to GPU............................................................... 61 3.6.6 The Distribution of Stream Lengths............................................. 62 3.6.7 Varying the Number of Stream Unit ............................................ 63 3.6.8 Analysis on Bandwidth .......................................................... 63 3.6.9 Tensor Computation Performance............................................... 63 3.6.9.1 Comparison with CPU................................................ 63 3.6.9.2 Comparison with OuterSPACE, ExTensor, and Gamma ............ 64 Chapter 4: HybridServing: token-level hybrid serving for large language model .............. 70 4.1 Abstract.................................................................................... 70 4.2 Introduction ............................................................................... 71 4.3 Motivation and Background .............................................................. 73 4.3.1 Machine Learning Scheduling................................................... 74 4.3.2 Request Preemption and KV-Cache ............................................. 76 iii 4.3.3 Token-level Latency analysis .................................................... 77 4.3.3.1 Prefilling and Decoding .............................................. 78 4.3.4 Latency and accuracy trade-off .................................................. 79 4.4 Implementation ........................................................................... 79 4.4.1 Hybrid execution ................................................................. 79 4.4.1.1 Layer selection ........................................................ 80 4.4.2 Scheduling........................................................................ 80 4.4.2.1 Scheduling algorithm ................................................. 80 4.4.2.2 Latency prediction .................................................... 82 4.4.2.3 Customizable policy and Task Group................................ 82 4.5 HybridServing: System design ........................................................... 83 4.6 Evaluation ................................................................................. 84 4.6.1 Experiment setup................................................................. 84 4.6.2 Layer Skipping Model Accuracy ................................................ 84 4.6.3 Latency prediction ............................................................... 85 4.6.4 Synthetic Traces.................................................................. 86 4.6.5 Comparing with alternative implementations................................... 88 4.6.5.1 Switching model ...................................................... 88 4.6.5.2 Separated Server ...................................................... 89 4.7 Conclusion ................................................................................ 89 References ........................................................................................... 91 iv List of Tables 2.1 Datasets.................................................................................... 31 2.2 Real machine test result, all normalized to Gemini ..................................... 31 3.1 Stream ISA Extension. R0-R4 are general-purpose registers, F0,F1 are FP registers, IMM is an immediate value. ........................................................ 45 3.2 Architecture Conf. ........................................................................ 56 3.3 GPM Apps ................................................................................ 57 3.4 Graph Datasets ............................................................................ 58 3.5 Matrix and tensor Datasets ............................................................... 58 4.1 User groups................................................................................ 83 4.2 Performance and accuracy of model with different number of layers skipped ........ 85 v List of Figures 2.1 Bottom-up BFS Algorithm ............................................................... 7 2.2 GRAPHP ................................................................................... 10 2.3 Examples of Algorithms with Loop-carried Dependency. .............................. 11 2.4 GRAPHQ .................................................................................. 12 2.5 GRAPHQ: Redundant Computation and Communication .............................. 14 2.6 GRAPHS workflow ....................................................................... 15 2.7 Bottom-up BFS in GRAPHS.............................................................. 16 2.8 Circulant Scheduling of GRAPHS........................................................ 19 2.9 GRAPHS Runtime Implementation ...................................................... 24 2.10 GraphS Runtime .......................................................................... 25 2.11 GRAPHSR Insights ....................................................................... 26 2.12 GRAPHSR Runtime Implementation .................................................... 27 2.13 Percentage of unsatisfied high-degree vertex (Kcore4 & MIS) ......................... 29 2.15 GRAPHS Runtime ........................................................................ 32 2.16 GRAPHS Energy .......................................................................... 32 2.17 GRAPHS Actual Compute ................................................................ 32 2.18 GRAPHS Communication ................................................................ 32 2.19 GRAPHS Breakdown ..................................................................... 33 3.1 Pattern Enumeration ...................................................................... 40 3.2 Tailed-triangle mining .................................................................... 41 3.3 Pattern enumeration with Stream ISA ................................................... 48 vi 3.14 Length Distribution ...................................................................... 62 3.15 Tensor Computation Speedup ............................................................ 64 3.16 Gmean speedup of OuterSPACE, ExTensor, Gamma, and SparseCore with outerproduct, and Gustavson’s algorithm over SparseCore with inner-product ............. 65 3.4 Different SPMSPM Dataflows with Stream ISA ......................................... 67 3.5 SparseCore Architecture.................................................................. 68 3.6 Parallel Comparison ...................................................................... 68 3.7 Speedup of SparseCore over Flexminer and TireJax ................................... 69 3.8 Speedups over CPU ....................................................................... 69 3.9 CPU execution breakdown ............................................................... 69 3.10 SparseCore execution breakdown ........................................................ 69 3.11 SparseCore compared to GPU implementations (log scale) ........................... 69 3.12 Varying the Number of SUs .............................................................. 69 3.13 Varying S-Cache Bandwidth ............................................................. 69 4.1 Super Serve architecture .................................................................. 75 4.2 Hybrid execution .......................................................................... 80 4.3 HybridServing system .................................................................... 84 4.4 Linear model for latency prediction...................................................... 85 4.5 Token-level latency violations break down at different request rate (with and without latency prediction) .................................................................... 86 4.6 GPU memory usage and waiting queue length during the execution................... 88 4.7 Performance Metrics for requests in 4 groups ........................................... 88 4.8 GPU memory usage, waiting and running queue length during the execution......... 89 4.9 Throughput with different implementation (HS as HybridServing, SS as separated server ...................................................................................... 90 vii Abstract Machine learning has been used in many domains. In recent years, transformer-based large language models have shown great performance in many real-world use cases. However, due to their significant size, they require substantial computational power for efficient serving. To address this challenge, in this thesis, we developed algorithm and system co-optimization for large language model inference. We designed a scheduling algorithm to improve the Quality of Service (QoS) for requests of different priorities. Our goal is to develop a system that can better serve large language models while maximizing system utilization across different scenarios and requirements. Another widely explored area is graph, which has been used to model many real-world systems. People have explored the connection between graph and machine learning by applying machine learning algorithm to graph data. In our previous research, we explored various algorithm and system optimizations for graph mining. In particular, we introduced the Stream ISA targeting sparse data computation. Our experiments show significant speedup over baseline CPU and accelerators. For graph pattern mining(GPM), our system outperforms baseline CPU system by on average 13.5× and up to 64.4×. It also outperforms FlexMiner and TrieJax by on average 2.7×and 3651.2×, up to 14.8×and 43912.3×respectively. For Stream ISA in machine learning tasks, our results demonstrated its potential application in sparse tensor computation. We achieved 6.9×, 1.88×, 2.78× , 4.49×, and 2.44× speedup for inner-product, outer-product, Gustavson’s algorithm, tensor times vector, and tensor times matrix respectively. viii Chapter 1 Introduction Machine learning has been widely used in many domains, such as speech recognition[1], content generation[2], image and video processing[3], etc. They have demonstrated their value with human-comparable performance[4, 5]. In recent years, with the invention of transformer architecture, large language models(LLMs) have demonstrated great value in a wide range of tasks[6– 10]. At the same time, the size of these large-scale models has grown exponentially, e.g., from several billion parameters[11] to over one hundred billion parameters[12]. Such dramatic increase in model size has introduced significant challenges in the computation of such machine learning models. Despite the ever-challenging computation, thanks to continuous optimizations in both algorithm and system, the quality and performance of machine learning models have continued to increase. Among those optimizations, a significant portion were made possible by combining the genius in algorithm and system. For instance, recent work [13] utilized constant block size 4×1 sparsity pattern algorithm to better exploit new CPU SIMD instructions. Similarly, other work[14] combined the GPU hardware support for sparsity with fine grained structured sparsity to further enhance performance. Additionally, under the distributed scenario, very recent works [15, 16] designed the scheduling algorithm to fit with the limits of commodity hardwares. Even though numerous researches have been done in this field, we found that some interesting issues remain little explored. This thesis proposes novel approaches to explore the following issues: 1 • Request priority: Requests may have different priorities. Recent work [17] only tried to prioritize real-time tasks over best-effort tasks via preemption. We will schedule requests based on their priorities to improve the Quality of Service. • Hybrid computing: Most machine learning systems highly rely on the computation power of GPU. In those systems, CPUs are mainly used to control the GPU. Previous hybrid systems [18] are tailored to support only specific machine learning algorithms. We will utilize both GPU and CPU resources for computation. • Heterogeneous and lower-end GPU devices: Most systems are optimized for cutting-edge data center GPU, characterized by high bandwidth, large memory capacity, and strong computation power. However, such devices could be significantly more expensive than their lower-end counterparts. Furthermore, real-world clusters may consist of nodes equipped with various GPU types. Previous system [15] was only optimized for large language model training on commodity GPU. We will introduce optimization targeting various GPU devices, aiming to boost the performance of lower-end and even heterogeneous GPUs to be comparable with high-end ones, thus increasing the performance-per-dollar or performanceper-investment. As mentioned above, previous works have tried to solve these issues individually and partially. We believe these issues can be better addressed when considered as a whole. We are planning to develop a machine learning inference system to address the above issues via algorithm and system co-optimization. Another widely explored research area is graph. Graph has been used to model many real-world systems such as social networks[19], road maps[20], biochemical structures[21–23], etc. Due to its common usage, people have been applying machine learning and data mining algorithms to graph data[24]. In our previous research, we explored various algorithm and system optimizations for graph mining. In particular, we introduced the Stream ISA targeting sparse data computation. Our experiments show significant speedup over baseline CPU and accelerators for graph mining 2 algorithms. For machine learning tasks, our results also demonstrated its potential application in sparse tensor computation. We are planning to utilize stream ISA in our machine learning inference system. 3 Chapter 2 GraphS: Eliminating Redundant Computation and Communication in PIM-Based Graph Processing with Dependence Scheduling 2.1 Abstract Processing-In-Memory (PIM) architectures based on recent technology advances, e.g., Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), demonstrate great potential for graph processing. Recent works made great progress in advancing PIM-based graph processing architectures based on the seminal architecture Tesseract: GRAPHP reduces inter-cube communication by better graph partition, and GRAPHQ and GRAPHH enforce regular communication. However, existing solutions did not address a less intuitive problem — redundant computation and communication. This paper proposes GRAPHS, a novel compiler, runtime system, and architecture co-designed PIM-based solution that completely eliminates redundant computation and inter-cube communication. GRAPHS makes the key and previously unknown observation that the distributed execution of graph algorithms in memory cubes does not precisely match algorithm semantics expressed in vertex-oriented User-Defined Functions (UDFs). Specifically, for each vertex, dependencies exist among the processing of neighbors —– when certain condition is satisfied the following neighbors 4 do not need to be processed. We call the unique code pattern in graph algorithms as loop-carried dependency and discover that it exists in several important algorithms, including BFS, MIS, K-core, and K-means. All current PIM-based architectures do not honor this semantics precisely and lead to redundant computation and communication. GRAPHS addresses the problem with two components: 1) a code analyzer to analyze the unmodified UDFs to identify loop-carried dependency; 2) a PIM-based runtime and architecture that faithfully enforces loop-carried dependency. We present the details of GRAPHS runtime design and also propose a variant GRAPHSR that further improves the performance for certain algorithms by accumulating partial results. The evaluation based on both simulation and real cluster shows that the proposed ideas lead to significant speedups and communication reduction compared to state-of-the-art architectures (GRAPHQ and Gemini). 2.2 Introduction Graphs capture relationships between data items, such as interactions or dependencies. Graph analytics have emerged as an important way to understand the relationships between heterogeneous types of data, allowing data analysts to draw valuable insights from patterns in the data for a wide range of applications, including machine learning tasks [25], natural language processing [26– 28], anomaly detection [29–31], clustering [32, 33], recommendation [34–37], social influence analysis [38–40], bioinformatics [41–43]. To improve programmability, graph-oriented programming mode (API) [44–46] can easily express graph algorithms by allowing programmers to “think as the vertex”. Programmers can express algorithms in vertex or edge function based on neighbor vertices. Real-world graphs are typically sparse and stored in compressed representation, which poses challenges for conventional memory hierarchy. Graph algorithms have poor locality due to random accesses in updating the neighbors, and high memory bandwidth requirement due to the small amount of computation between random accesses. These two challenges can be well tackled by Processing-In-Memory (PIM) architecture, which reduces data movement between memory and computation by placing 5 computing logic inside memory dies. Though once believed to be impractical, PIM recently became a practical architecture due to the emerging 3D stacked memory technology, such as Hybrid Memory Cube (HMC) [47] and High Bandwidth Memory (HBM) [48]. In general, the architecture is composed of multiple memory cubes connected by external links (e.g., SerDes links in HMC with 120GB/s per link). Within each cube, multiple DRAM dies are stacked with Through Silicon Via (TSV) and provide higher internal memory bandwidth up to 320GB/s. At the bottom of the dies, computation logic (e.g., simple cores) can be embedded. Performing computation at in-memory compute logic can reduce data movements in memory hierarchy. More importantly, PIM provides “memory-capacity-proportional” bandwidth and scalability. Recently, several PIM-based graph processing architectures have been proposed, greatly advancing the field. Tesseract [49] is the seminal PIM architecture that supports vertex programming model with architectural primitives to enable inter-cube communication. GRAPHP [50] codesigns graph partition, programming model, and architecture to reduce inter-cube communication. GRAPHQ [51] and GRAPHH [52] enforce regular inter-cube communication and optimize bandwidth usage. All solutions rely on and show the great potential of vertically integrating [53] the design of hardware, runtime and architecture. However, they all did not pay enough attention on the highest layer algorithm, which is mapped to the architecture by vertex programming model and runtime system. This paper demonstrates the gap between algorithm and existing runtime systems, which leads to redundant computation and communication. In vertex-oriented API, abstract computation is specified as vertex-centric User-Defined Functions (UDFs) P(v), which is executed in parallel on each vertex v. In each P(v), programmers can access the neighbors of v as if they are local. For example, GRAPHP [50] provides two APIs: GenUpdate: a vertex generates updates based on all its neighbors; ApplyUpdate: update the property of a vertex. The computation in these UDFs can be “plugged in” to the global execution in multiple cubes by runtime which orchestrates low-level execution in architecture [49–52]. Note that the APIs and runtime system are decoupled, the above APIs of GRAPHP can be also implemented by GRAPHQ [51] runtime. More details will be discussed in Section 2.3.1. 6 1 def bfs(Array[Vertex] nbr) { 2 foreach v in V { 3 foreach u in nbr(v) { 4 if (not visited[v] && 5 frontier[u]) { 6 visited[v] = true; 7 frontier[v] = true; 8 break; 9 } 10 } // end foreach u 11 } // end foreach v 12 } // end bottomUpBfs (a) bottom-up BFS 1 def GenUpdate(u){ 2 update = default_value; 3 foreach v in nbr(u){ 4 if(not visited[u] && frontier[v]){ 5 visited[u] = true; 6 frontier[u] = true; 7 update = depth[v]+1; 8 break; 9 } 10 } 11 return update; 12 } 13 def ApplyUpdate(v, update){ 14 depth[v] = min(depth[v],update); 15 if(update!=default_value) 16 visited[v] = true; 17 } (b) Implementation of BFS in GraphQ Figure 2.1: Bottom-up BFS Algorithm We focus on a common code pattern in graph algorithms and UDFs—loop-carried dependency, which implements the following semantics: when the UDF (e.g., GenUpdate) traverses all neighbors of a vertex in a loop, it will perform some operation depending on previous neighbors. Consider two neighbors n1 and n2 of vertex v, according to the dependence, if n1 satisfies a certain condition, n2 will not be processed. However, if the processing of n1 and n2 are performed in different cubes, the cube that processes n2 does not know the outcome of processing n1. This pattern exists in several important algorithms. Consider the bottom-up breadth-first search (BFS) [54] with pseudocode code in Figure 2.1 (a). In each iteration, the algorithm visits the neighbors of “unvisited” vertices. If one of the neighbors of the current unvisited vertex is in the “frontier”, it will no longer traverse other neighbors and mark the vertex as ”visited”. Compared to the conventional BFS, bottom-up BFS avoids the inefficiency due to multiple visits of one new vertex in the frontier. According to [54], it significantly reduces the number of edges traversed. Figure 2.1 (b) shows the implementation of bottom-up BFS in the API of GRAPHP 1 The GenUpdate and ApplyUpdate function specify the computation to process each neighbor of a vertex, and vertex property update, respectively. We see that GenUpdate for bottom-up BFS has control dependency. GenUpdate iterates the neighbors of local vertex u, and breaks out of the 1The three APIs of GRAPHQ— processEdge, reduce, apply — are not suitable for expressing the semantics since the process of incoming edges is performed in different processEdge functions. Nevertheless, GRAPHP’s APIs can be supported by GRAPHQ, see Section 2.3.1 for details. 7 loop when it finds a neighbor belongs to the frontier (Line 4). This control dependency expresses the semantics of skipping the following edges and avoids unnecessary edge traversals. In current PIM-based graph processing architecture, although programmers can write such break statements in UDF, they are not precisely enforced by the runtime during distributed execution in different cubes, leading to redundant computation and inter-cube communication. This is because current runtime systems treat UDFs as “black boxes”, which leads to sub-optimal integration of semantics inside UDFs to the global execution. Nevertheless, the different execution semantics typically does not affect correctness, but rather the efficiency of algorithm implementation. This is perhaps the reason that the problem was neglected in all prior works — not only in PIM-based graph processing but also state-of-the-art distributed graph processing frameworks, e.g., Gemini [55]. In Section 2.3.2, we show that the pattern exists in several other important algorithms, including MIS, K-core, and K-means. The paper proposes GRAPHS, a new compiler, runtime system, and architecture co-designed PIM-based solution that completely eliminates redundant computation and inter-cube communication due to loop-carried dependency. GRAPHS solves the problem with two components: 1) UDF analysis to identify and augment loop-carried dependency in unmodified UDFs; 2) PIMbased runtime and architecture to faithfully enforce loop-carried dependency semantics. The key requirement is that the computation related to a given vertex in different cubes are performed sequentially. GRAPHS ensures it by circulant scheduling which distributes execution of UDFs in different cubes while still preserving enough parallelism. We present the details of GRAPHS runtime design and also propose a variant GRAPHSR that further improves the performance for certain algorithms by accumulating partial results. We evaluate GRAPHS with a zSim-based simulator using five real-world graphs as large as Friendster [friendster] and four algorithms, i.e., bottom-up BFS, MIS, K-core, K-means, that are not commonly used in recent works on graph processing architecture. We compare GRAPHS to GRAPHQ, the state-of-the-art PIM-based graph processing architecture. The results show that GRAPHS achieves on average 2.2× (maximum 4.37×) speedup, on average 32.8% (maximum 8 50.5%) inter-cube communication reduction. They lead to 51.6% energy saving on average. With partial results propagation, GRAPHSR achieves on average 12.5× (maximum 19.91×) speedup, on average 90.23% (maximum 97.79%) inter-cube communication reduction. They lead to 91.4% energy saving on average. To demonstrate the applicability of GRAPHS’s techniques in distributed graph processing frameworks, we implement the ideas in Gemini [55], the state-of-the-art framework with the highest performance. We conduct experiments on a real cluster (not simulator) with 16 nodes using a real-world graph Twitter-2010 [56] with 1.5 billion edges on the four algorithms. The results show that the proposed ideas lead to significant speedups (up to 2.3×) and communication reduction (up to 63%). 2.3 Background and Motivation 2.3.1 PIM-Based Graph Processing Processing-In-Memory (PIM) architecture reduces data movements by performing computations close to where the data are stored. 3D memory technologies (e.g., Hybrid Memory Cubes (HMC) [47] and High Bandwidth Memory (HBM) [48]) make PIM feasible by integrating memory dies and compute logic in the same package, achieving high memory bandwidth and low latency. Similar to recent works [49–51], this paper considers a general PIM architecture that captures key features of specific PIM implementations. The architecture is composed of multiple cubes connected by external links (e.g., SerDes links in HMC with 120GB/s per link). Within each cube, multiple DRAM dies are stacked with Through Silicon Via (TSV) and provide higher internal memory bandwidth up to 320GB/s. At the bottom of the dies, computational logics (e.g., simple cores) could be embedded. In Tesseract [49], a small single-issue in-order core is placed at the logic die of each vault. It is feasible because the area of 32 ARM Cortex-A5 processors including an FPU (0.68 mm2 for each core [57]) corresponds to only 9.6% of the area of an 8 Gb DRAM die area (e.g., 226 mm2 [58]). GRAPHS assumes the same setting. With 16 cubes, the whole system delivers 9 5 TB/s memory bandwidth, considerably larger than conventional memory systems. Moreover, the memory bandwidth grows proportionally with capacity in a scalable manner. Figure 2.2: GRAPHP All PIM-based graph processing architectures partition the graph into subgraphs that can fit into each cube. The connectivity of the subgraphs leads to inter-cube communication. GRAPHP [50] shows that communication is affected by the graph partition strategy and proposes a method to reduce the communication from one inter-cube message per crosscube edge in Tesseract [49] to one inter-cube message per replica synchronization. This is enabled by assigning disjoint edge sets to different cubes and generate replicas when a vertex is connected to two edges in different cubes. Results show that it can reduce inter-cube communication by at least 90% in all experiments with broadcast optimization,. Figure 2.2 illustrates the insights of GRAPHP. ✈and ⊚ represent source and destination master vertices, respectively. ® represents the replica of a vertex. Each vertex can have one master and multiple replicas in other cubes depending on graph partition. In the example, v0 has 10 neighbors, all edges connected to v0 are assigned to cube 0, only one v1 is allocated in the same cube. Since the master of the others v2 ∼ v10 are allocated in remote cube 1 ∼ cube 3, their replicas are created in cube 1. When the masters of v2 ∼ v10 are updated, the runtime system will generate inter-cube messages to synchronize the replicas, indicated by the arrows between cubes. It also shows how the two APIs are executed: GenUpdate generates the update for a vertex based on all its neighbors; ApplyUpdate updates the property of a vertex, at the same time the runtime system updates all its replicas. GRAPHQ [51] enables regular and batched communication by generating inter-cube messages between a specific cube pair together. Since APIs and runtime system are decoupled, GRAPHQ 10 1 def mis(Array[Vertex] nbr) { 2 foreach v in V { 3 if(not removed[v]){ 4 foreach u in nbr(v) { 5 if(color[u] < color[v]){ 6 result[v] = miscnt 7 removed[v] = true 8 } 9 } // end foreach u 10 } 11 } // end foreach v 12 }//end MIS (a) MIS 1st phase 1 def GenUpdate(u){ 2 color_t = default_value; 3 foreach v in nbr(u){ 4 if(not removed[v]){ 5 color_t = min(color_t,color[v]) 6 } 7 } 8 return color_t; 9 } 10 11 def ApplyUpdate(v, update){ 12 color_min[v] = min(color_min[v], update); 13 if(color_min[v]>color[v]) 14 removed[v] = true; 15 } (b) MIS 1st phase (GraphQ) 1 def kcore(Array[Vertex] nbr) { 2 foreach v in V { 3 if(not removed[v]){ 4 foreach u in nbr(v) { 5 if(not removed[u]) 6 count++ 7 } 8 } 9 if (count<K_NUM) 10 removed[v] = true 11 } // end foreach v 12 }//end MIS (c) K-core 1 def GenUpdate(u){ 2 count = default_value; 3 foreach v in nbr(u){ 4 if(not removed[v]){ 5 count++; 6 if(count >= K_NUM) 7 break; 8 } 9 } 10 return count; 11 } 12 13 def ApplyUpdate(v, update){ 14 count[v] = sum(count[v], update); 15 if(count[v]<K_NUM) 16 removed[v] = true; 17 } (d) K-core (GraphQ) 1 def Kmeans(Array[Vertex] nbr) { 2 foreach v in V { 3 if(visited[v]){ 4 foreach u in nbr(v) { 5 if(not visited[u]){ 6 cluster[u]=cluster[v] 7 visited[u] = true 8 } 9 } 10 visited[v] = false 11 } // end foreach u 12 } // end foreach v 13 } // end Kmeans (e) Kmeans 1 def GenUpdate(u){ 2 update = default_value; 3 foreach v in nbr(u){ 4 if(not visited[u] && frontier[v]){ 5 visited[u] = true; 6 update = cluster[v]; 7 break; 8 } 9 } 10 return update; 11 } 12 13 def ApplyUpdate(v, update){ 14 value[v] = min(value[v], update); 15 if(update!=default_value) 16 visited[v] = true; 17 } (f) Kmeans (GraphQ) Figure 2.3: Examples of Algorithms with Loop-carried Dependency. can also support GenUpdate and ApplyUpdate. Since all communication from cube i and cube j will happen together in batch, GRAPHQ naturally performs partial GenUpdate in the sender before they are transferred to the receiver cube. Thus we have replicas of the destination (v0) in cube 1 ∼ cube 3. As shown in Figure 2.4, when partially reduced values are received in cube 0, they can be reduced to a single value and apply to the master of v0 by ApplyUpdate. Thus, GenUpdate function is executed in both local and remote cubes. Based on the above explanation, GRAPHQ supports GRAPHP’s APIs (which can express loop-carried dependency) but enables more efficient execution than GRAPHP, we use GRAPHQ with GenUpdate and ApplyUpdate as the baseline. 2.3.2 Algorithms with Loop-carried Dependency In this section, we explain four important iterative graph algorithms with semantics of loop-carried dependency. Figure 2.3 shows the pseudocode of one iteration of each algorithm (except bottom-up BFS, which was already shown in Figure 2.1) in sequential implementation and the implementation using vertex-oriented API. Breadth-First Search (BFS). BFS is an iterative graph traversal algorithm that finds the shortest path in an unweighted graph. The conventional BFS algorithm follows the top-down approach: 11 Figure 2.4: GRAPHQ BFS first visits a root vertex. In each iteration, the newly “visited” vertices become the “frontier” and BFS visits all the neighbors of the “frontier”. The bottom-up BFS [54] discussed earlier changes the direction of traversal. In each iteration, bottom-up BFS visits the neighbors of “unvisited” vertices. If one of the neighbors of the current unvisited vertex is in the “frontier”, it will skip the traversal of other neighbors and add the vertex to “visited”. Bottom-up BFS can significantly reduce the number of edges traversed. Maximal Independent Set (MIS). An independent set is a set of vertices in a graph, in which any two vertices are non-adjacent. A Maximal Independent Set (MIS) is an independent set that is not a subset of any other independent set. A heuristic MIS algorithm (Figure 2.3 (a)) is based on graph coloring. First, each vertex is assigned a unique value (color) and marked as active. In each iteration, we find a new MIS composed of active vertices with the smallest color value among their active neighbors’ colors. The new MIS vertices will be removed from further execution (marked as inactive). Then we will remove neighbors of that has been marked as new MIS vertices. K-core. A K-core of a graph G is a maximal subgraph of G in which all vertices have a degree at least k. The standard K-core algorithm (Figure 2.3 (c)) removes the vertices with degree less than K. Since removing vertices will decrease the degree of its neighbors, it will do it iteratively until 12 no more removal is needed. When counting the number of neighbors for each vertex, if the count reaches K, we can exit the loop and mark this vertex as “no remove” in the iteration. K-means. K-means is a popular clustering algorithm in data mining. Graph-based K-means [59] is one of its variants where the distance between two vertices is defined as the length of the shortest path between them (assuming that the length of every edge is 1). The algorithm shown in Figure 2.3 (e) consists of four steps: (1) Randomly generate a set of cluster centers; (2) Assign every vertices to the nearest cluster center; (3) Calculate the sum of distance from every vertex to its belonging cluster center; (4) If the clustering is good enough or the number of iterations exceeds some prespecified threshold, terminate the algorithm; Else, goto (1) and repeat the algorithm. 2.3.3 Why Did We Waste Computation and Communication? The essential reason for the wasted computation and communication is that the pattern of loopcarried dependency, although can be expressed in UDFs, is invisible to the runtime system. Figure 2.5 illustrates the execution of bottom-up BFS in GRAPHQ. Suppose the condition is satisfied after processing v3 in cube 1, thus v4 will not be processed. According to Figure 2.1 (a), all following vertices (v5 ∼ v10) should also be skipped, but since cube 2 and cube 3 are not aware of the execution in cube 1, they will just process all vertices as usual. Fundamentally, the problem is that the runtime system lacks the knowledge of the execution status in other cubes. This leads to the redundant computation because otherwise, we do not execute GenUpdate function of v5 ∼ v10, and more importantly, redundant communication because after executing GenUpdate in cube 2 and cube 3, the results will be sent in batch to cube 0. Wasted communication is indicated in Figure 2.5. Note that the problem does not exist in GRAPHP, which performs reduce in the same cube as the destination master vertex. However, GRAPHQ is far superior to GRAPHP because it generates less and batched communication, together with other optimizations. Results in [51] show that GRAPHQ can outperform GRAPHP by 2× on average. 13 Figure 2.5: GRAPHQ: Redundant Computation and Communication The problem exists for all algorithms with loop dependency. We already explained bottomup BFS in Section 2.2. For the other three graph algorithms discussed earlier: MIS has control dependency: if one vertex already finds itself not the smallest one, it will not be marked as new MIS in this iteration and thus and break out of the neighbor traversal. K-core has data dependency and control dependency: if the vertex has more than K neighbors, it will not be marked as removed in this iteration and further computation can be skipped. K-means has control dependency: when one of the neighbors is assigned to the nearest cluster center, the vertex can be assigned the same center. Note that we use these algorithms as clean and typical examples to demonstrate the effectiveness of our idea. These kernels can be used as the building blocks of other more complicated algorithms. Note that these algorithms are very different from the familiar examples (e.g., BFS, SSSP, CC, PageRank) that are commonly used in prior works on graph processing architecture, thus our paper makes an attempt to consider less-studied but equally important algorithms. Indeed, our new insights cannot be found from the four traditional graph algorithms. 14 Figure 2.6: GRAPHS workflow 2.4 GraphS Overview To eliminate redundant computation and communication, we propose GRAPHS, a new PIM-based graph processing architecture and runtime system that precisely enforce loop-carried dependency semantics in UDFs of algorithm implementation. The overall workflow is shown in Figure 2.6. GRAPHS consists of two components. The first one is UDF analysis, which determines whether the UDFs contains loop-carried dependency. If so, the analysis identifies the dependency data that need to be propagated during the execution, then inserts dependency communication operations so that loop-carried dependency can be precisely enforced in different cubes. We will describe the analysis in Section 2.5. The second component is the runtime support to enforce loop-carried dependency on the UDF analyzed codes. The key mechanism is to introduce dependency communication, for each vertex, it propagates dependency across replica cubes and eventually back to master cube. To enforce the precise semantics, for a given vertex v, the execution of GenUpdate in the master and replica cubes (sender cubes {1,2,3} in Figure 2.4) need to be performed sequentially. The key challenge is how to enforce the sequential semantics while still enabling enough parallelism? We solve this problem by circulant scheduling and other communication optimizations (Section 2.6). 15 1 bool Bitmap[vertex_num] 2 def get_dep(v) {return Bitmap[v];} 3 def emit_dep(v) {Bitmap[v]=true;} 4 5 def GenUpdate(u){ 6 if(get_dep(u)==1)//added code 7 return default_value; 8 update = default_value; 9 foreach v in nbr(u){ 10 if(not visited[u] && frontier[v]){ 11 emit_dep(u);//added code 12 visited[u] = true; 13 frontier[u] = true; 14 update = depth[v]+1; 15 break; 16 } 17 } 18 return update; 19 } Figure 2.7: Bottom-up BFS in GRAPHS 2.5 User-Defined-Function Analysis 2.5.1 GraphS Primitives In GRAPHS, we add primitives to keep track of dependency, which is intended to be inserted by UDF analysis component and executed by runtime system. They are transparent to programmers so that GRAPHS can execute unmodified codes. The additional internal data structure is a Bitmap for all vertices — one bit for each vertex. Bitmap maintains the runtime information on whether computation should be still performed for a vertex at a cube. Two primitives are used to update dependency information in the Bitmap: emit dep sets the corresponding bit in Bitmap for a vertex; and get dep returns the currently received dependency information when executing GenUpdate for a vertex. Figure 2.7 shows how the dependency information and primitive should appear in the analyzed UDFs using bottom-up BFS as an example. When processing a vertex u, the runtime system first executes get dep, if the returned value is true, the following computation related to this vertex will 16 be skipped (Line 6 and 7). After the vertex u is added to the current frontier, emit dep is executed to set the corresponding bit for vertex u in Bitmap (Line 11). In the next section, we will describe the details of GRAPHS analyzer to generate the instrumented codes. The code analyzer can assume that all cubes have an abstract and shared view of Bitmap: GenUpdate can read it using get dep, and update it using emit dep. A natural question is whether different GenUpdate function will modify the Bitmap at the same time, causing race condition? The answer is no, since to ensure precise loop-carried dependence, the execution of GenUpdate of a given vertex must be executed sequentially in different cubes. More details of the requirement and its implications will be discussed in Section 2.6.2. 2.5.2 GraphS Code Analyzer We make certain assumptions on graph processing UDFs to simplify the code analyzer design. These assumptions are inherent in graph algorithms. Therefore, the ideas of GRAPHS code analyzer are independent of graph processing runtimes and APIs. Specifically, we make the following assumptions. 1) We assume that the graph algorithms are expressed in vertex-oriented UDFs. Our analyzer only inspects these UDFs. 2) All UDFs are defined using lambda expressions that capture variables. Copy statement about those related variables is not allowed, so we can locate the UDFs and variables. 3) There is a loop to traverse neighbor vertices. It is true for almost all graph algorithms; the details may differ in minor details in different frameworks. 4) If there is a nested loop with loop dependency in each level, the outermost loop performs neighbor traversal, and the analyzer will only instrument the outermost loop. Based on these assumptions, we design GRAPHS analyzer as two passes in clang LibTooling at clang-AST level. In the first pass, the analyzer locates the UDFs and analyzes the function body to determine whether loop-carried dependency exists. We use clang-lib to compile the source code and obtain the corresponding Clang-AST. Then, the analyzer traverses AST to: 1) locate the UDF; 2) identify all functions that process edges; 3) search for all for-loops that traverses neighbors and check whether loop-dependency patterns exist — there exists at least one break statement related to the for-loop; 4) store all AST nodes of interests. 17 If dependency is identified, the second pass of the analyzer identifies the dependency data that should be transferred and performs source-to-source transformation. Specifically, it first inserts dependency communication initialization code. Before the loop in UDF, it inserts new control flow that checks dependency in preceding loops with get dep. Inside the loop in UDF, it inserts emit dep before the corresponding break statement to propagate the dependency message. As a result, GRAPHS code analyzer can generate instrumented codes as in Figure 2.7. Note that the whole procedure does not involve any programmers’ involvement for code modifications and is completely transparent. 2.6 Circulant Scheduling: Enforcing Dependency with Parallelism 2.6.1 Problems and Insights With instrumented UDFs that dynamically maintained dependency information through the conceptually shared Bitmap, the runtime system has sufficient information to enforce loop-carried dependency precisely. However, we need to ensure high performance with efficient hardware implementation. To precisely enforce the loop-carried dependency, for a given vertex, the GenUpdate should be sequentially executed on different cubes when they process disjoint sets of edges connected to a remote master vertex. This is because the execution on a cube may be unnecessary when the condition is satisfied in a previous cube. The requirement makes the execution more restrictive than the original execution in GRAPHQ, to achieve high performance, we need to enable sufficient parallelism. The essential question is, whether the elimination of wasted computation and communication can offset the effects of reduced parallelism? We also need to store Bitmap in PIM architecture efficiently. Since it is frequently accessed in GenUpdate, the choice should minimize the performance overhead due to accessing Bitmap. One option is to store a single shared copy in one of the cubes. Unfortunately, it causes intensive intercube communication since most cubes except the one which has the copy cannot access Bitmap 18 S3 S2 S1 S0 S0 S3 S2 S1 S1 S0 S3 S2 S2 S1 S0 S3 C0 C1 C2 C3 C0 C1 C2 C3 D3URFHVVLQJ6WHSV S0 >@ >@ >@ >@ S1 >@ >@ >@ >@ S2 >@ >@ >@ >@ S3 >@ >@ >@ >@ E'HSHQGHQF\&RPPXQLFDWLRQ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ F8SGDWH&RPPXQLFDWLRQ Figure 2.8: Circulant Scheduling of GRAPHS locally. It is tempting to store the bit for a vertex v in the master cube of v. However, as shown in Figure 2.4, the Bitmap is not only used in the master cube but also in the execution of GenUpdate in other cubes. Thus, it does not solve the problem of remote access. We decide to replicate the Bitmap across all cubes. It ensures that get dep can always access Bitmap locally. We emphasize that the storage overhead of Bitmap is quite small. Consider LiveJournal [60] with 4.8M vertices and 69M edges, the overhead of one copy of Bitmap is 4.8M/(69M ∗ 32) = 0.217%, the total overhead of replicating it in 16 cubes is 3.48%. With replicated Bitmap, the runtime system is responsible for maintaining the content in all cubes. 2.6.2 Circulant Scheduling To clarify our ideas, Figure 2.8 (a) shows the matrix view of a graph distributed in four cubes. An element (i,j) in the matrix represents an edge vi → v j . Similarly, we use the notion [i, j] to represent a grid — the set of edges from cube i to cube j. The sequential requirement means that for all edges in the same column need to be processed sequentially across all cubes. For iterative algorithms, one iteration processes all edges in the graph. Let us divide an iteration into multiple steps. In each step, cube i processes all edges in [i, j], j ∈ {0,1,2,3}. Figure 2.8 (a) shows one possible scheduling of steps that enables sufficient parallelism while still preserving the sequential execution requirement. We call the execution in Figure 2.8 (a) circulant scheduling. The work for each cube is indicated in the range of rows corresponding to this cube. The cube will process grids according 19 to the step order. For example, cube 0 first processes all edges in [0,1] and then [0,2], [0,3], [0,0]. The other cubes are similar. We also require that the step execution is synchronized — all cubes should be in the same step. The key property of circulant scheduling is that the same step presents diagonally in the matrix view. As a result, during the same step, each cube i processes edges in different grid [i, j] in parallel. For example, in step 0, the grids processed by cube 0,1,2,3 are [0,1],[1,2],[2,3],[3,0], respectively. Across all steps, edges belonging to all grids [ j,i], j ∈ {0,1,2,3}, are processed sequentially. This is the essential insight that circulant scheduling can enable sufficient parallelism while still preserving the sequential execution requirement. Now the only missing piece is how to transfer the dynamic dependency information from one cube to another. Since each cube has the replication of Bitmap, a cube can directly write the remote Bitmap when executing emit dep in Figure 2.7. This operation is performed by runtime system transparently by dependency communication — the new communication we introduce in GRAPHS to propagate and enforce precise loop-carried dependency. Figure 2.8 (b) shows a clear view of synchronized step execution according to Figure 2.8 (a) with dependency communication marked. We see that the communication pattern is the same between all steps. Note that with a given number of cubes, multiple possible circulant schedulings exist, the pattern of dependency communication is determined by the scheduling. Even with circulant scheduling, the execution is still more restrictive than arbitrary execution, but it elegantly enables more parallelism by letting each cube process disjoint sets of edges in parallel. The significant improvements in Section 2.8 indicate that the eliminated redundant computation and communication can indeed fully offset the effects of reduced parallelism with circulant scheduling. At this point, readers who are familiar with the recent PIM-based graph processing architecture or graph processing system will find circulant scheduling already used in prior works. Before explaining more details, we first clarify the differences between GRAPHS and GRAPHQ. We also show that even if certain ideas in graph processing systems can be applied to PIM-based graph processing architecture, the key ideas of GRAPHS does not even exist in distributed graph processing. 20 Relation to GRAPHQ [51]. GRAPHQ also uses circulant scheduling but for a different purpose — enabling batched and regular inter-cube communication used to update remote vertex values. We call it as update communication, which also exists in GRAPHS but different from dependency communication. Figure 2.8 (c) shows the update communication in GRAPHQ with circulant scheduling. We see that between different steps (“round” by the terminology in [51]), the pattern of update communication is different. Also, update communication can be overlapped with the computation in each step, as long as it is finished by the end of the next step. GRAPHS can inherit this advantage, but dependency communication cannot be overlapped with computation due to sequential execution requirement. This is why in Figure 2.8 (b) they are indicated between two steps. Nevertheless, the computation will be still correct if the destination cube does not wait for the completion of dependency communication. This will only cause some redundant computation that would have been avoided. In essence, GRAPHQ can be considered as a special case without dependency communication. Relation to Gemini [55]. Gemini uses circulant scheduling to mostly overlap computation and communication between nodes, which is a more significant problem in distributed environment. The initial purpose is slightly different from GRAPHQ is to achieve regular communication. Similarly, Gemini does not precisely enforce loop-carried dependency and does not have dependency communication. To validate the point, we implemented our idea in Gemini and tested the performance using a real-world graph Twitter-2010 [56] with 1.5 billion edges on the four algorithms. The results show that GRAPHS ideas lead to significant speedups (up to 2.3×) and communication reduction (up to 63%). This shows that the idea can apply to distributed graph processing but we leave it as future work. Relation to GridGraph [61]. This out-of-core graph processing system has the notion of “grid” which also represents edges from a region of vertices to another region. It loads subgraphs corresponding to grids to memory each time to achieve better locality. Thus, despite the similarity of the grid notion, the context and goal of the system is quite different. 21 Summary. Although the circulant scheduling technique has been used in prior system and architectures, no prior work is able to precisely enforce loop-carried dependency. GRAPHS uses circulant scheduling as an important technique to achieve sufficient parallelism while eliminating redundant computation and communication. 2.6.3 Selective Dependency Propagation In GRAPHS, although dependency communication is a new type that did not exist in prior architectures, it does not necessarily increase the total communication amount. The reason is that reduced update communication may be more than the extra dependency communication. To reduce the overall total communication, we use a simple policy to reduce the amount of dependency communication. The observation is that, for vertices that do not have replicas in every cube, the increased dependency communication has less benefits: for a given amount of increased dependency communication, the reduced update communication is less. To ensure the benefits of dependency communication, we propose to differentiate the dependency communication for high-degree and low-degree vertices. We define high-degree as vertices with mirror replications in every cube, and low-degree as all vertices that are not high-degree. The additional dependency communication is the same for the high-degree and low-degree vertices, but the high-degree vertices can save more update communication. Based on this insight, GRAPHS only propagates dependency for high-degree vertices. For low-degree vertices, GRAPHS falls back to the original schedule: the replica cube directly sends the update messages to the master cube without updating Bitmap. Essentially, we do not enable the optimization for low-degree vertices. It may weaken the benefits of reducing the number of edges traversed. However, since the low-degree vertices have less neighbors, the redundant computation due to loop-carried dependency is not significant — not able to skip many neighbors during execution. 22 2.7 Runtime Implementation and Optimization This section discusses the detailed GRAPHS runtime system implementation. Since both GRAPHQ and GRAPHS use circulant scheduling with update communication, we build the runtime of GRAPHS based on GRAPHQ with two modifications: 1) implementing GenUpdate and ApplyUpdate APIs, instead of processEdge, reduce, and apply. As explained before, loop-carried dependency cannot be naturally expressed by GRAPHQ’s API; 2) adding dependency communication operations to send and receive Bitmap. Specifically, SendBitMap sends BitMap to the next cube in circulant scheduling with cube ID as the parameter. RecvBitMap receives BitMap from the previous cube in circulant scheduling. In Section 2.7.1, we first describe the basic GRAPHS runtime in which the master cube of a vertex collects the partial results of GenUpdate from all remote replica cubes and generates the final update. It follows the procedure in Figure 2.4. Then we propose an optimized design GRAPHSR in which the partial results of GenUpdate are passed around cubes since partial results accumulation can lead to further computation and communication reduction. This optimization is not applicable to all algorithms, we will discuss the observation, applicability, and detailed design in Section 2.7.2. 2.7.1 GraphS Runtime System Figure 2.9 shows the implementation of the runtime system of GRAPHS. Each cube has 1) sendBuf and recvBuf for batched inter-cube update communication; 2) Bitmap to keep the dependency information dynamically. Batched communication is initialized in all cubes before an iteration starts. An iteration is divided into N steps, N is the number of cubes. As shown in Figure 2.8 each cube will have to receive the dependency communication from the previous cube in circulant scheduling except the first step. It is performed by RecvBitMap (Line 7). The Bitmap of each cube is always written by a remote cube. We calculate destinations of update and dependency communication (Line 9 ∼ 12) according to the patterns in Figure 2.8 (c) and (b), respectively. Each cube locally computes the partial results of GenUpdate for vertices in the current grid determined 23 1 sendBuf = local Array[DataType]; 2 recvBuf = local Array[DataType]; 3 Bitmap = local Array[bool]; 4 InitBatch(); 5 for (stepId = 0; stepId < cubeNum; stepId++) { 6 if (stepId != 0) 7 RecvBitMap(); //receive dep. information 8 9 //Next cube for update communication 10 upNext = (myId + stepId + 1 + cubeNum) % cubeNum; 11 //Next cube for dependency communication 12 depNext = (myId - 1 + cubeNum) % cubeNum; 13 14 for (u <- GraphGrid[myId,upNext].vertices) { 15 sendBuf[u] = GenUpdate(outNbrs(u)); 16 } 17 18 if (stepId != (cubeNum-1)) { 19 //End of each step except the last 20 SendBatch(upNext); 21 SendBitMap(depNext); 22 } else { //End of last step, local updates 23 for (v <- Partition.vertices) 24 ApplyUpdate(v, sendBuf(v)); 25 } 26 27 //At the end of each step (except the first) 28 //Master cube handles updates from replicas 29 if (stepId != 0) { 30 //Batched update from previous step 31 RecvBatch(); 32 for (v <- Partition.vertices) 33 ApplyUpdate(v, recvBuf(v)); 34 } 35 } Figure 2.9: GRAPHS Runtime Implementation by [myId,upNext] (Line 14 ∼ 16). These results will be used in calculating the final updates for destinations of all edges in the current grid, and are stored in sendBuf. Some bits in the local Bitmap in the range of vertices to be updated may be set (emit dep in Figure 2.7). Next, the partial results and a range of bits in Bitmap are sent to the calculated destinations as update and dependency communication, respectively (Line 20 and 21). The sender cube knows Bitmap address in the destination cube, and only non-zero bytes are transferred—if any bit in a byte is non-zero, we transfer the whole byte. At the end of each step, each cube has to receive the batched update communication and perform ApplyUpdate based on the received partial results (recvBuf). It is similar to GRAPHQ, and the update communication of the previous step can be overlapped with the computation of the current step. At the end of the last step, all cubes will perform a final and local ApplyUpdate. The execution of GRAPHS runtime with four cubes is illustrated in Figure 2.10. 2.7.2 GraphSR Runtime System The optimization insight is that for certain algorithms, it is beneficial to accumulate partial results to expedite the termination of correct computation and further reduce redundant work. Consider 24 Figure 2.10: GraphS Runtime the k-core algorithm, which finds all vertices with degrees ≥ k, when k=4, if each cube has less than 4 vertices connected to the master vertex in a remote cube, no computation and communication can be saved. It is shown in Figure 2.11 (a) with 16 cubes. However, it is still different from the exact algorithm semantics, which will stop traversing after 4 neighbors have been identified. Essentially, GRAPHS runtime only improves GRAPHQ when 4 or more neighbors are identified in a single cube. To truly enforce the precise semantics, partial results have to be accumulated and passed around cubes. Consider a vertex v in the master cube, which first performs local GenUpdate (not shown) since if the result 2 for vertex v can be generated locally, there is no need to check the vertex further in other cubes. If the result is not generated, the master cube will send the current partial results to the next cube in circulant scheduling, which execute GenUpdate using both its local data and the received partial results. If the “positive” final result (degree ≥ 4) is obtained in a cube, it will directly send it to the master cube, which updates the master copy of the vertex. Otherwise, the partial result will be sent to the next cube in the fixed circulant scheduling that has vertices connected to the master vertex v. This policy avoids unnecessary communication between cubes, 2The result in k-core for a vertex is a boolean indicating whether it has ≥ k neighbors. 25 >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ >@ S0 S1 S2 S3 S0 S3 S2 S1 S1 S0 S3 S2 S2 S1 S0 S3 S3 S2 S1 S0 C0 C1 C2 C3 C0 C1 C2 C3 S4 >@ >@ >@ >@ F6WHSV G&RPELQHG'HSHQGHQF\DQG 8SGDWH&RPPXQLFDWLRQLQVWHSV … D*UDSK65LQVLJKWVDFFXPXODWLQJUHVXOWV C0 C1 C2 C3 C14 C15 … E*UDSK65UHGXFHGFRPSXWDWLRQDQGFRPPXQLFDWLRQ FRPELQLQJGHSHQGHQF\DQGXSGDWH C0 C1 C2 C3 C14 C15 Figure 2.11: GRAPHSR Insights which only passes around irrelevant partial results. The ideas are shown in Figure 2.11 (b). Here, cube 1 generates partial results 3 (no neighbor is found in cube 0) which is passed to cube 3, since cube 2 does not have vertices connected to v in cube 0. Based on these insights, we can implement an optimized runtime system GRAPHSR which further eliminates redundant computation and communication in GRAPHS runtime. Before describing the details, we would like to discuss the applicability of the idea. Besides k-core algorithm, it also improves MIS, where the key operation is to compare a vertex v’s randomly assigned “color” to all its neighbors’ color so that all vertices can make consistent decisions on identifying independent sets. We need to check all neighbors of v in different cubes and perform the comparison. The partial result transferred will be whether vertex v’s random number if larger (or smaller) than all vertices that have been checked in previous cubes. There are two reasons why GRAPHSR can also benefit MIS: 1) it computes at local cube in the first step, but GRAPHS will compute local at last step. The result is affected by graph structure and partitioning, certain data sets are more likely to satisfy the condition at local cube. 2), GRAPHS only sends message for high-degree vertices (see Section 2.6.3, but GRAPHSR propagates dependence for all vertexes with combined update and dependency communication. 26 1 vertexBuf = local Array[DataType]; 2 vertexMapping = local Array[DataType]; 3 InitBatch(); 4 for (stepId = 0; stepId < cubeNum; stepId++) { 5 curMasterID = (myId + stepId + cubeNum) % cubeNum; 6 recvBuf = &vertexBuf[curMasterID.first_vertexId]; 7 sendBuf = recvBuf; 8 for (u <- GraphGrid[myId,curMasterID].vertices) { 9 if(recvBuf[u]==SKIP) 10 break; 11 result,sendBuf[u] = GenUpdateR(recvBuf[u],outNbrs(u)); 12 nextCube = vertexMapping[u].get_next(); 13 if(result||(nextCube == INT_MAX)) 14 addToResultBatch(u,result,nextCube); 15 else 16 addToBatch(u,result,nextCube); 17 } 18 //Sending partial results forward 19 for(i = 1; i <(cubeNum - stepId); i++ ){ 20 destId = (myId + i + cubeNum) % cubeNum; 21 SendBatch(destId); //data inserted in Line 16 22 } 23 //Sending final results back to master cube 24 if(stepId!=0){ 25 SendResult(curMasterID); 26 //Each step can only receive from one cube 27 recvId = (myId + stepId + cubeNum) % cubeNum; 28 RecvResult(recvId); 29 } 30 //Receive partial results 31 for(i = 1; i < (cubeNum - stepId); i++ ){ 32 recvId = (myId + stepId + i + cubeNum) % cubeNum; 33 RecvBatch(i); 34 } 35 } Figure 2.12: GRAPHSR Runtime Implementation GRAPHSR cannot benefit algorithms that do not have the above property. For example, in bottom-up BFS, the condition is that any neighbor belongs to the current frontier. In the beginning and end of the execution, there are few vertices in the frontier, so many checks turn out to be false. In GRAPHS, these cases do not cause update communication to master cube, while GRAPHSR still needs to circulate the partial results across cubes. Kmeans is similar since it can be considered as BFS with multiple roots. Next we consider implementation details of GRAPHSR runtime. The data movements between cubes in GRAPHSR are essentially a combination of update (for partial results) and dependency communication. Intuitively, communication for a vertex happens when “the condition needs to be checked in the receiver cube with the provided partial result”. To compute the first partial result, each cube will execute locally and them follows the circulant scheduling. Thus the step scheduling is circular left shift by one position from Figure 2.8 (a), shown in Figure 2.11 (c). The corresponding execution is shown in Figure 2.11 (d). GRAPHSR no longer has two types of communications so we use the same color. The key feature of GRAPHSR is that a cube can send multiple batched communications to different remote cubes. It is due to the policy mentioned earlier: for a vertex v, a cube only sends 27 the message to the next cube that has v’s neighbors, so certain cube can be skipped, e.g., C2 in Figure 2.11 (b). The message between two cubes (Ci and Cj) and two consecutive steps includes the batched partial results of vertices that are obtained in Ci and need to be further checked in Cj . Thus, the message between (Ci and Cj) and message between (Ci and Cj) between two given steps contain partial results for disjoint set of vertices. As shown in Figure 2.11 (d), with four cubes, the first local cube may send to all three remote cubes (between S0 and S1). After the first remote cube, each cube may still send to all three others — when the result may be found and sent back to the master cube (between S1 and S2). However, between S2 and S3, each cube may only send to two other cubes, this is because the partial results only propagate forward. Eventually, after the last step in the scheduling, each cube can only send one message back to the master cube. In general, with N cubes, the maximum number of possible messages between Si and Si+1 is: (N-1) for i=0,1; (N-i) for i=2,...,(N-1). For a vertex v, the next cube to send the partial result can be determined statically — it is a function of graph partition. We can obtain that information in pre-processing and store it in a table vertexMapping in each cube, each vertex needs log2N bits. The overhead of generating this information is negligible since such information can be directly obtained when assigning each edge to a cube at the graph partitioning time. Based on the above discussion, the detailed GRAPHSR runtime implementation is shown in Figure 2.12. At every step, each cube computes curMasterID (Line 5) — the master cube of the incoming partial results. It means that if the condition is satisfied in the current cube or it is the last in the ring, the result needs to be sent back to cube curMasterID. Each entry in recvBuf has the default value SKIP and it will be only changed if partial result is received, which means that the cube needs to compute the update using received partial result and local vertices. It is done by GenUpdateR (Line 11), which can be generated easily by adding partial result as an input. If the result is computed (should terminate) or the cube is the last one, the result needs to be placed in result batch by addToResultBatch (Line 14) which is sent to cube curMasterID (Line 25). If the result is not computed (need to forward), the partial result is placed in normal batch by 28 addToBatch (Line 16) which is sent to corresponding cube (Line 21). The cubes in the circle receive multiple normal batch at Line 33 and each master cube receives result batch at Line 28. After computing partial results, if a cube does not have anything to send to another one expecting an incoming forward message, it needs to send a small “dummy” message to avoid deadlock. This is not shown in Figure 2.10. 2.7.3 Static Analysis of Graph 1 10 0.00 0.16 Percentage of unsatisfied TT (Kcore4) GraphS GraphSR 1 10 0.00 0.21 WK (Kcore4) 1 10 0.00 0.12 TT (MIS) GraphS GraphSR 1 10 0.00 0.14 WK (MIS) Figure 2.13: Percentage of unsatisfied high-degree vertex (Kcore4 & MIS) We run a static analysis of GRAPHS to simulate the first iteration of K-core (K=4) and MIS algorithm. Figure 2.7.3 and Figure 2.7.3 show the percentage of high-degree vertices that are not finished at the end of current step, with graph TT and WK as examples. We see that this percentage is always lower for GRAPHSR, which leads to less communication. The cube number starts from zero, for the condition could be satisfied locally and no further computation needed. The number of vertices that unsatisfied at first step is different, because GRAPHSR will compute the local cube at first step, but GRAPHS will compute local at last step. 2.8 Evaluation 2.8.1 Evaluation Methodology PIM System configuration. We evaluate GRAPHS based on zSim [62], a scalable x86-64 multicore simulator. We modified zSim according to HMC’s memory and interconnection model, heterogeneous compute units, on-chip network and other hardware features. While zSim does not natively support HMC interconnection simulation, we insert a NOC layer between LLC and memory to simulate different intra-cube and inter-cube memory bandwidth. The results are validated 29 against NDP [63]. For compute units, we use 256 single-issue in-order cores in all test cases. Each core has 32KB L1 instruction cache and 64K L1 data cache. Cache line size is 64B and simulation frequency is 1000 MHz. Each core has a 16-entry message queue and 32KB L1 instruction cache, with no L2 or shared cache. For memory configuration, we use 16 cubes (8 GB capacity, 512 banks). The cubes are connected with the Dragonfly topology [64]. The maximal internal data bandwidth of each cube is 320GB/s. For meaningful comparison, the configurations are the same as GRAPHQ [51] The energy consumption of the inter-cube interconnect is estimated as two components: a) dynamic energy, which is proportional to the number of flit transfer events that happen among each pair of cubes; b) static energy, which corresponds to the energy cost when the interconnect is plugged in power but in idle state (i.e., no transfer event happens). We use zSim to count the number of transfer events and use ORION 3.0 [65] to model the dynamic and static power of each router. We calculate the leakage energy of the whole interconnect from the flit transfers. We also validated Table 1 in [66] with McPAT [67]. Graph algorithms. We run the four algorithms: bottom-up BFS, K-core, K-means, and MIS. For BFS, we do not use direction-switch optimization. For Kcore, we run it with different K numbers. For Kmeans, we choose the number of clusters as p |V| and run it like BFS. For MIS, there are two phases, first one is to get the minimum value of colors in vertex’s neighbors, if the min value is greater than the vertex’s value, then mark it as remove, the second phase is to remove neighbors of that has been marked in first phase. Graph Dataset. We tested with TT,WK,LJ,TW,FR in Table 2.1. To run algorithm on directed graphs, we convert the un-directed datasets to directed graphs by adding reverse edges. For large graphs (TW, FR), due to the slow simulation speed, we show results of the first several iterations and the speedups are not used in calculating average speedups. The complete executions of (TW, FR) are tested in real distributed cluster setting. Distributed cluster implementation and configuration. We implement the ideas of GRAPHS with dependency message in Gemini [55], the comparison with Gemini can directly demonstrate 30 Graph #Vertices #Edges ego-Twitter (TT)[68] 81K 2.4M Enwiki2013(WK)[69] 4.2M 101M LiveJournal(LJ)[60] 4.8M 69M Twitter2010(TW)[56] 42M 1.5B Friendster(FR)[friendster] 66M 1.8B Table 2.1: Datasets Graph speedup traversed edges upt msg dep msg total comm BFS TW 2.30 0.3563 0.7553 0.0446 0.7999 FR 1.72 0.3650 0.4657 0.0429 0.5085 Kcore TW 1.38 0.4537 0.5377 0.0074 0.5450 FR 1.52 0.2820 0.3646 0.0074 0.3719 MIS TW 1.46 0.5062 0.4721 0.0313 0.5034 FR 1.35 0.3762 0.3639 0.0259 0.3898 Kmeans TW 1.39 0.4151 0.6854 0.0250 0.7103 FR 1.46 0.7361 0.7044 0.0393 0.7437 Table 2.2: Real machine test result, all normalized to Gemini the benefits. Due to slow network and memory compared to computation, the idea of GRAPHSR do not work. We use the implementation to run two large graphs (TW,FR). We conduct the experiments on a cluster with 16 nodes. Each node with 2 Intel Xeon CPU E5-2630 v3 (8 cores) with Max Memory Bandwidth of 59 GB/s 3 . and network as Mellanox InfiniBand FDR with bandwidth of 7GB/s. For BFS, we average the experiment results of 64 randomly generated non-isolated roots. For each root, we run the algorithm 5 times. For K-core, we choose K=2 ,commonly used in strongly connected component [70]. In the benchmark of K-means, we choose the number of clusters as p |V| and runs for 20 iterations at most. For algorithms other than BFS, we run the application 20 times and the results are obtained with average. 2.8.2 Performance Improvements Figure 2.15 shows the execution cycles of GRAPHS and GRAPHSR normalized to GRAPHQ [51]. For BFS and Kmeans, speedup is on average 1.57× and 1.86× and at maximum 1.70× and 2.43× . 3Max Memory bandwidth is the maximum rate at which data can be read from or stored into a semiconductor memory by the processor (in GB/s). https://ark.intel.com/content/www/us/en/ark/products/83356/intel-xeon-processor-e5- 2630-v3-20m-cache-2-40-ghz.html 31 TT WK LJ TW FR 0.0 0.2 0.4 0.6 0.8 1.0 Cycle normalized kcore2 GraphS GraphSR kcore3 kcore4 mis kmeans bfs Figure 2.15: GRAPHS Runtime TT WK LJ TW FR 0.0 0.2 0.4 0.6 0.8 1.0 Energy normalized kcore2 GraphS GraphSR kcore3 kcore4 mis kmeans bfs Figure 2.16: GRAPHS Energy TT WK LJ TW FR 0.0 0.2 0.4 0.6 0.8 1.0 Compute normalized kcore2 GraphS GraphSR kcore3 kcore4 mis kmeans bfs Figure 2.17: GRAPHS Actual Compute TT WK LJ TW FR 0.0 0.2 0.4 0.6 0.8 1.0 Transfer normalized kcore2 GraphS GraphSR kcore3 kcore4 mis kmeans bfs Figure 2.18: GRAPHS Communication For kcore-{2,3,4}, GRAPHS’s speedup is on average 2.96×,2.93×,2.84× and 4.37×,4.09×,3.86× at maximum, and GRAPHSR’s speedup is on average 13.98×,13.92×,12.52× and 19.91×,17.43×,15.83× at maximum. When K becomes lager, the speedup gradually becomes less. For MIS, since the second phase is not optimized, the speedup for GRAPHS is on average 1.60× and at maximum 1.89× and for GRAPHSR on average 9.67× and at maximum 14.31× (slightly less than that of kcore). Table 2.2 shows the results of two large graphs on distributed cluster. GRAPHS outperforms baseline with a speedup up to 2.30 × (1.55 × on average). 2.8.3 Breakdown and Load Imbalance The Figure 4.6.3 shows the breakdown of runtime into communication, computation and synchronization. The load imbalance of GRAPHS with dependancy message is slightly better than GRAPHQ in most cases. This is because vertexes with higher degrees are more likely to satisfy dependency, removing unnecessary computation with dependency message can help to reduce the imbalance. GRAPHSR cuts down more than 80% of the total runtime, and lead to very imbalance 32 TT WK LJ TW FR 0.0 0.2 0.4 0.6 0.8 1.0 BreakDown kcore2 kcore3 kcore4 mis kmeans bfs GraphQ: sync Compute send GraphS: sync Compute send GraphSR: sync Compute send Figure 2.19: GRAPHS Breakdown result. This imbalance is due to graph partition and nature of this method. Given the significant speedups of GRAPHSR, we do not believe it is a critical issue. 2.8.4 Energy Reduction The Figure 2.16 shows the interconnect energy consumption of GRAPHS compared with GRAPHQ. The energy cost of the interconnect consists of both the static consumption and the dynamic consumption, which are determined by the execution time (performance) and communication amount, respectively. GRAPHS reduces energy cost by 51.6% on average and 76.9% at maximum. GRAPHSR reduces energy cost by 91.41% on average and 94.99% at maximum. 33 2.8.5 Computation and Communication Reduction We summed the number of edges computed and number of apply to estimate the computation cost, as shown in The Figure 2.17. GRAPHSR can reduce the actual computation to only 4.91% on average and 2.18% at maximum. GRAPHS can reduce to 62.6% on average and 14.9% at maximum. The Figure 2.18 shows the reduction of the communication, GRAPHSR reduces a huge proportion of communication. For example, in K-core algorithm (K = 2), for vertices with 2 or more edges at first cube, there would be no communication, for vertices with 1 edge at first cube, there would be only one message needed, because if it only has one edge, then only a result be sent to master, if it has more edges at the next cube, then just partial result send to the next cube and it would satisfy there. For GRAPHS, in dataset like liveJournal[60], as K grows larger, there is little save of communication. GRAPHS could reduces transfer to 67.2% on average and 49.5% at maximum. And GRAPHSR could reduces transfer to 9.77% on average and 2.21% at maximum. In distributed cluster, we calculated the number of edges traversed to show computation reduction. As in Table 2.2, GRAPHS reduces edge traversed to 43.6% on average and 28.2% at maximum. For communication, the upt msg and dep msg column in Table 2.2 indicate the amount of update and dependency communication, and the total comm column means the total communication. All normalized to total communication in Gemini. We can see that GRAPHS indeed incurs much less total communication than Gemini. 2.9 Related Work Graph Processing Accelerators. [49] is the first PIM-based graph processing accelerator. GraphP [50] observes that data partition is the first-order design principle in PIM. GraphQ [51] reduces communication by static scheduling and specialized hardware. [71] is another accelerator that supports dynamic scheduling, and it is designed for asynchronous graph processing. GRAPHS is a 34 synchronous graph processing accelerator. It eliminates both communication and computation in PIM. Distributed Graph Processing Systems. [72–74] are high performance computing systems for BFS. They eliminate redundant computation in BFS with static scheduling. The technique in GRAPHS applies to BFS and other graph algorithms. [75–78] analyzes the user-defined functions and avoid unnecessary computation. The distinction is that these systems leverage relaxed dependency for more parallelism, while GRAPHS enforces dependency. 2.10 Conclusion This paper proposes GRAPHS, a novel compiler, runtime system, and architecture co-designed PIM-based solution that completely eliminates redundant computation and inter-cube communication incurred by not precisely enforcing loop-carried dependency. GRAPHS addresses the problem with two components: 1) a code analyzer to analyze the unmodified UDFs to identify loop-carried dependency; 2) a PIM-based runtime and architecture that faithfully enforces loop-carried dependency. We present the details of GRAPHS runtime design and also propose a variant GRAPHSR that further improves the performance for certain algorithms by accumulating partial results. The evaluation based on both simulation and real cluster shows that the proposed ideas lead to significant speedups and communication reduction compared to state-of-the-art architectures (GRAPHQ and Gemini). 35 Chapter 3 SparseCore: stream ISA and processor specialization for sparse computation 3.1 Introduction Sparse computation on data where a large fraction of values are zeros is important in scientific computing [79, 80], machine learning [81, 82], graph analytics [83], and simulation [84], etc. Sparse tensor algebra and graph pattern mining (GPM) are are two major applications that perform frequent sparse computation. The essential computation is the set operations, i.e., intersection, subtraction, and merge (union), on two sparse vectors. Currently, accelerators have been developed for specific application domains on fixed algorithm or dataflow. For sparse tensors, these operations can be naturally extracted from algorithms and effectively accelerated. Sparse matrix-sparse matrix multiplication (SPMSPM) has rich algorithmic diversity [85]. Prior works have developed accelerators based on inner-product [86], outer-product [87], and Gustavson’s algorithm [85]. The different choices of dataflow determine the computation schedules with different tradeoffs. Some dataflows have asymptotically worse performance on certain inputs. With dataflow embedded in accelerator architecture, adopting one accelerator prevents the usage of other dataflows. Graph Pattern Mining (GPM) [88, 89] only received research interests recently. GPM tasks a graph G and pattern specification p as inputs, and returns all subgraphs of G that are isomorphic to p. Two graphs G0 = (V0,E0) and G1 = (V1,E1) are isomorphic if and only if there exists an one-to-one 36 mapping f : V0 → V1 such that (u, v) ∈ E0 ⇐⇒ (f(u), f(v)) ∈ E1. A subgraph of G isomorphic to p is called an embedding. GPM enables various important applications such as functional modules discovery [90, 91], biochemical structures mining [21–23] and anomaly detection [92–97] and many others [98–104]. The key challenge of GPM is the need to enumerate a large number of subgraphs, e.g., with WikiVote, a small graph with merely 7k vertices, the number of vertexinduced 5-chain embeddings can reach 71 billion. An efficient method for finding the embeddings of p from G is pattern enumeration, which constructs the embeddings that are isomorphic to p. Graphs are typically stored in sparse representation, e.g., compressed sparse row (CSR), which keeps the neighbor list of a vertex as a sparse array. The key operation of pattern enumeration is the intersection between neighbor lists—the essential primitive to construct embeddings based on pattern. For example, for two connected vertices v1 and v2, performing the intersection of their neighbor lists could identify triangles (v1, v2, v), where v is the neighbor of both v1 and v2. Pattern enumeration algorithms can be expressed as nested loops, in which each level extends the current subgraphs with a new vertex. Different from sparse tensor and graph computation, for GPM, the intersections on edge lists are deeply embedded in the nested loops, making it difficult to reuse existing accelerators for other domains to accelerate GPM algorithms. This difficulty gives rise to the recent interests of designing specific GPM accelerators. TrieJax [105] and GRAMER [106] are two early GPM accelerators. Their performance is limited since they did not accelerate the state-of-the-art algorithms. We provide detailed analysis in Section 3.2.3. FlexMiner [107] is the state-of-the-art pattern-aware software/hardware co-designed GPM accelerator. While achieving impressive speedups, the design is not flexible, i.e., the skeleton of its embedding exploration algorithm is implemented in hardware and fixed, making it hard to apply new algorithmic optimizations without hardware modification. SISA [108] is a set-centric Processing-In-Memory (PIM) based GPM accelerator. The set operations can be expressed in the programming model and off-loaded to PIM for execution. In this paper, rather than designing a new accelerator for a particular domain based on specific algorithms/dataflows, we aim to enhance the flexibility of sparse computation acceleration 37 architecture. We define a sparse vector as a stream, which can be a key or (key,value) stream. We propose SparseCore, which extends the instruction set architecture (ISA) to make stream first-class citizens, and develop efficient architectural components to support the stream ISA. The novel ISA extension intrinsically operates on streams, realizing both efficient data movement and computation. It can be considered as a natural extension to the traditional instructions for ordinary scalar values. Our approach accelerates all kinds of sparse computation with a unified architecture, which can adopt to different and evolving algorithms with software modifications, instead of developing specialized hardware every time. Good flexibility can make the architecture easily adapt to complex and fast evolving algorithms and optimizations, which is particularly the case for GPM. The general-purpose processor can execute any code patterns, while the stream ISA extension can accelerate the critical intersection operations in a seamless manner. In this sense, our approach is similar to the successful SIMD ISA extension to multimedia applications decades ago. In comparison, FlexMiner [107] replaces some intersection with cmap, which is used to perform connectivity checking, but forces the implementation to use the less general notion that may not benefit from new optimizations. For example, FlexMiner is unable to support a new optimization based on Inclusion-Exclusion Principle that can accelerate pattern counting by up to 1110× in GraphPi [109], while SparseCore can easily benefit from it by implementing the optimization in software. Moreover, extending ISA—the interface between software and hardware—allows compiler to analyze existing source codes and generate instructions in stream ISA to accelerate sparse computations. For GPM, the additional compiler pass can be smoothly integrated with compilation-based systems, such as AutoMine [110], GraphZero [111], and GraphPi [109], preserving the simple user interface. Specifically, users provide the input graph and pattern specifications, and the compiler synthesizes algorithm implementations automatically. SparseCore architecture considers both computation and memory efficiency. We introduce Stream Mapping Table (SMT) that records the mapping between stream ID and stream register, and tracks the dependency between streams. For computation efficiency, we introduce stream unit 38 (SU) to efficiently perform set operations including intersection, subtraction, and merge. The computations on sparse values determined by the indices in two streams are performed by Stream Value Processing Unit (SVPU). To accelerate a unique GPM algorithm pattern, we introduce nested instructions, which are implemented inside the processor by a sequence of micro-ops. To enable stream data reuse, a scratchpad is associated with SU. To ensure efficient data movements, we propose a Stream Cache (S-Cache) that can effectively prefetch streams based on the known sequential stream access pattern. We develop a GPM compiler to generate codes with stream ISA. The new instructions are transparent to programmers and the compiler takes unmodified GPM codes. The main challenge for code generation is stream management (similar to register allocation in traditional compilers). For tensor computation, we modified TACO [112], the state-of-the-art tensor algebra compiler, to generate the baseline and optimized implementations with stream instructions. We implement SparseCore ISA and its architectural components on zSim [62]. We evaluate the architecture with (1) GPM applications, including seven patterns (triangle/three-chain/tailedtraingle counting, 3-motif mining, 4/5-clique counting, and FSM) on ten real graphs, and (2) sparse tensor computation, including matrix multiplication with three algorithms, tensor times vector(TTV), and tensor times matrix(TTM) on 11 matrices and two tensors. For GPM, SparseCore significantly outperforms InHouseAutomine on CPU by on average 13.5× and up to 64.4×. SparseCore also outperforms FlexMiner and TrieJax by on average 2.7×and 3651.2×, up to 14.8×and 43912.3×respectively. For tensor computation, our experiments show that SparseCore archives 6.9×, 1.88×, 2.78× , 4.49×, and 2.44× speedup for inner-product, outer-product, Gustavson’s algorithm, TTM, and TTV respectively. 39 3.2 Background 3.2.1 Sparse Tensor Computation While the dense tensor computation can be efficiently accelerated by GPU and SIMD instructions, the acceleration of sparse tensor computation is an open problem. Motivated by various applications, it has received intensive research interests [85–87, 113–115]. An important kernel is sparse matrix-sparse matrix multiplication (SPMSPM) Amk ∗Bkn = Cmn, which can be commonly implemented via three algorithms: inner-product, outer-product, and Gustavson’s algorithm. The difference among the three is the order of loops that iterate three indices: inner-product loops in the order of m, n, and k; outer-product loops in the order of k, m, and Gustavson’s algorithm loops in the order of m, k, and n. The inner-product can be seen as multiplication of vectors, thus it can be implemented by intersection. For Outer-product and Gustavson’s algorithm, the computation results are merged into the result matrix C, so they can be implemented by merge. Recent accelerators are developed based on different algorithms with different dataflows. 3.2.2 GPM Methods and Optimizations Figure 3.1 shows the GPM problem which finds the pattern (triangle) (a) in the input graph (b). Figure 3.1 (c) shows the memory access and computation of pattern enumeration. From two connected vertices, their edge lists are accessed, followed by the intersection between them. In this example, each common neighbor forms a triangle embedding matching the pattern. During execution, 1 2 3 4 5 6 7 B A C (a) pattern graph (b) input graph 2 3 7 1 6 4 Access v2 and v3’s edge list. Intersection is v7. Constructed a triangle {v2,v3,v7}. (c) pattern enumeration Figure 3.1: Pattern Enumeration edge list accesses are followed by the computation (intersection) that is much more complex than graph computation and cannot be efficiently performed in current processors. Specifically, starting from the first vertex ID of two edge lists, if they mismatch, 40 the processor advances the pointer of one of the lists, checks the boundary, fetches the next vertex ID, and compares again. This code pattern contains branches and data dependencies in a tight loop, making it difficult to predict the branches and exploit instruction level parallelism. Input Graph 4 0 1 2 3 Pattern A B C (a) Symmetry Breaking (b) Symmetry Breaking+Intersection Early Termination 0 1 v0 v1 D 0 2 v0 v1 0 3 v0 v1 1 0 v0 v1 1 2 v0 v1 1 4 v0 v1 2 0 v0 v1 2 1 v0 v1 3 0 v0 v1 4 1 v0 v1 Intersect N(v0) and N(v1) to get v2 0 1 v0 v1 2 v2 1 0 v0 v1 2 v2 1 2 v0 v1 0 v2 2 1 v0 v1 0 v2 0 2 v0 v1 1 v2 2 0 v0 v1 1 v2 Discarded embeddings with v2 >= v0 Subtract N(v0), N(v2) and {v0,v2} from N(v1) to get v3 2 1 v0 v1 0 v2 2 0 v0 v1 1 v2 4 v2 3 v2 0 1 v0 v1 0 2 v0 v1 0 3 v0 v1 1 0 v0 v1 1 2 v0 v1 1 4 v0 v1 2 0 v0 v1 2 1 v0 v1 3 0 v0 v1 4 1 v0 v1 Intersect N(v0) and N(v1) to get v2 with upper bound v0 1 2 v0 v1 0 v2 2 1 v0 v1 0 v2 2 0 v0 v1 1 v2 2 1 v0 v1 0 v2 2 0 v0 v1 1 v2 4 v2 3 v2 Subtract N(v0), N(v2) and {v0,v2} from N(v1) to get v3 Figure 3.2: Tailed-triangle mining Symmetry breaking in pattern enumeration avoids counting the same embedding for multiple times due to symmetry by enforcing a set of restrictions among vertices during embedding construction. A tailed-triangle mining example is shown in Figure 3.2. We denote the first/second/third/forth matched vertex of an embedding as v0-v3. Symmetry breaking requires v2 < v0 so that the a unique embedding is enumerated only once, i.e., (v0, v1, v2, v3) = (2,1,0,4) is the same as (0,1,2,4). As shown in (a), symmetry breaking first obtains all v2 that is a common neighbor of v0 and v1 by intersecting N(v0) and N(v1), where N(v) is the neighbor vertex set of v stored in edge list. Then, it discards all v2 that are no less than v0 to satisfy the restriction (line 5-6 of the algorithm (a)). This can be further improved by early termination of intersections since only the 41 elements smaller than v0 in N(v0)∩N(v1) are needed, indicated as BoundedIntersect() in (b). This optimization not only reduces computation and accessed data, but also eliminate branches in the next loop level. Due to the richness of the algorithm, the optimizations are fast evolving. 3.2.3 Existing Architectures on GPM Pattern enumeration relies on various optimizations to achieve good performance, which make the codes very complex and cannot be efficiently executed on accelerators designed for sparse tensor [85–87, 113–115] and machine learning [116–118]. On the other side, it is also difficult to map the complex GPM algorithms on the more “general” specialized architectures such as Sparse Processing Unit (SPU) [119–121] which offers specialized supports for stream-join (similar to intersection) based on a systolic decomposable granularity reconfigurable architecture (DGRA). SPU requires manually rewriting C codes and describing data flow graph (DFG) with the language extensions for DGRA. The computations are mapped to the systolic DGRA by analyzing the DFGs. Due to complex data and control flow dependencies, porting the same state-of-the-art GPM algorithms that execute on SparseCore to SPU leads to large DFGs. For example, four-motif needs up to 112 nodes in the DFG (48 computation nodes and 64 memory nodes), however, each SPU core can only support 20 computation nodes. Thus, it is infeasible to directly execute such large DFGs on an SPU core. To overcome the challenge, one solution is to partition the DFG into smaller partitions that an SPU core can accommodate, and distribute them to multiple cores for collaborative execution. Unfortunately, SPU has very limited mechanisms for cross-core collaboration, especially, poor synchronization support: the only way to synchronize SPU cores is via the “host” RISC-V core. Another possible solution is to execute GPM applications with multiple phases, each of which executes a DFG partition. Between different phases, the SPUs have to be reconfigured. To achieve good performance, the costly reconfiguration should not be performed frequently. Unfortunately, GPM applications switch between different parts of the DFG frequently, leading to extremely high overhead. 42 In summary, SPU’s inefficiency for GPM applications is essentially due to its inflexibility to handle the complicated GPM codes with large DFGs. In comparison, SparseCore, as a generalpurpose processor extension, is not restricted by the DFG-based programming model. It can execute the GPM codes in the similar way as the commodity processors do, except that the set operations are executed with ISA extension and special functional unit, i.e., Stream Unit (SU). In some sense, this approach resembles how SIMD instructions accelerate data-parallel computations. Pipette [122] is a recent architecture that supports pipeline parallelism with ISA extension. Merge-intersect, as a stage in the pipeline, performs intersection between two streams. It is possible to port GPM applications on Pipetee, which requires less manual effort than SPU. However, programmers still need to express GPM algorithms in pipeline parallelism, leading to more significant code modifications than SparseCore. In comparison, SparseCore provides higher efficiency on unmodified codes with nested intersection and seamless integration of specialized acceleration components with CPUs. TrieJax [105] is based on a variant of the Worst Case Optimal Join algorithm. Its performance is limited due to the lack of symmetry breaking support and blindly processing graph data as a database table. GRAMER [106] is based on a much slower pattern-oblivious algorithm with expensive isomorphic check. Its execution time after speedup is even longer than directly executing pattern enumeration on commodity machines. Based on more advanced algorithms, the software/hardware co-designed FlexMiner [107] achieves much better performance. Its software component analyzes the user-specified patterns, generates intermediate representations (IR) that contain the necessary information for embedding exploration (e.g., symmetry breaking restrictions), and passes them to a hardware embedding exploration engine. The downside is that, the implementation based on FlexMiner has to use the less general notion that may not benefit from new optimizations. SISA [108] is the first set-centric ProcessingIn-Memory (PIM) based GPM accelerator. It takes a similar approach by extending ISA but the execution of the set operations is performed on PIM. While SISA can effectively boost the performance of GPM and reduce data movements, it does not support sparse tensor computations, which 43 not only require set operations on sparse vectors, but also perform floating point computations on values based on the outcome indices of set operations. 3.3 Stream Instruction Set Extension 3.3.1 Stream Definition In general, we define a sparse vector as a stream, which can be: (1) a key stream—a list of keys, such as the edge list in graph representation; or (2) a (key,value) stream—a list of (key,value), such as the pair of indices of non-zero elements and their values in a sparse tensor representation. We propose a novel instruction set extension that intrinsically operates on streams, supporting both data movement and computation. 3.3.2 Register Extension The stream ISA extension represents stream as the first-class data type. The processor uses N stream registersto maintain stream information, where N is the maximum number of active streams supported. A stream is active between its initialization and free—each can be performed by an instruction. A stream register stores the stream ID, the stream length, the start key address, the start value address, stream priority, and a valid bit. The stream registers cannot be accessed by any instruction and are setup up when the corresponding stream is initialized. The program can reference a stream by the stream ID, the mapping between a stream ID and its stream register is managed internally in the processor with the Stream Mapping Table (SMT). The key and value address of a stream register are only used by the processor to refer to the keys and values when the corresponding stream ID is referenced. SparseCore also includes three auxiliary graph format registers (GFR0, GFR1, GFR2) to support various graph representations. The content of these registers are interpreted based on the specific format. For simplicity, this paper considers compressed sparse row (CSR) [123], which has two arrays: (1) vertex array stores vertices sequentially with each entry pointing to the start of 44 Table 3.1: Stream ISA Extension. R0-R4 are general-purpose registers, F0,F1 are FP registers, IMM is an immediate value. Instruction Description Operands S READ R0, R1, R2, R3 Initialize a key stream R0:start key address, R1:stream length, R2:stream ID, R3: priority S VREAD R0, R1, R2, R3, R4 Initialize a (key,value) stream R0:start key address, R1:stream length, R2:stream ID, R3:start value address, R4: priority S FREE R0 De-allocate a stream R0:stream ID S FETCH R0, R1, R2 Return one element of a key stream R0:stream ID, R1:element offset, R2: returned element S SUB R0, R1, R2, R3 Subtraction of two streams, use stream of id R0 to subtract stream of id R1 R0,R1: input stream IDs, R2:output stream ID, R3: upper-bound of the subtracted result S SUB.C R0, R1, R2, R3 Return # of elements in subtraction of two streams, use stream of id R0 to subtract stream of id R1 R0,R1: input stream IDs, R2:returned result, R3: upper-bound of the subtracted result S INTER R0, R1, R2, R3 Intersection of two streams R0,R1: input stream IDs, R2:output stream ID, R3: upper-bound of the intersected result S INTER.C R0, R1, R2, R3 Return # of elements in intersection of two streams R0,R1: input stream IDs, R2: returned result, R3: upper-bound of the intersected result S VINTER R0, R1, R2, IMM Sparse computation using the values of two (key,value) streams R0,R1: input stream IDs, R2:returned result, IMM: specify user-defined op S MERGE R0, R1, R2 Merge of two streams R0,R1: input stream IDs, R2:output stream ID S MERGE.C R0, R1, R2 Return # of elements in merge of two streams R0,R1: input stream IDs, R2:output stream ID S VMERGE F0, F1,R0, R1, R2 Sparse computation with two (key,value) streams F0,F1: multiplication scale, R0,R1: input stream IDs, R2:output stream ID S LD GFR R0, R1, R2 Initialize GFRs based on graph representation R0, R1, R2: content to be loaded into GFR0, GFR1, and GFR2 S NESTINTER R0, R1 Nested intersection R0: stream ID, R1: returned result the vertex’s edge list; and (2) edge array stores the edges of each vertex sequentially. In this case, GFR0, GFR1 and GFR2 hold the CSR index, CSR edge list, and CSR offset, which can be loaded to GFRs by an instruction. CSR index and CSR edge list store the address of vertex array and edge array, respectively. The CSR offset stores the offset of the the smallest element larger than the vertex itself in the neighbor list. It is used to support the nested intersection, and the symmetric breaking optimization. 3.3.3 Instruction Set Specification Table 3.1 lists the instruction set extension for streams. The instructions can be classified into three categories: (1) stream initialization and free; (2) stream computation; and (3) stream element access. The input operands for all instructions are general purpose registers containing stream ID. S READ and S VREAD are the instructions to initialize a key stream and (key,value) stream, respectively. The operands are general purpose registers containing start key address (also start value address for S VREAD), stream length, stream priority, and stream ID. After they are executed, if the stream ID is not active, an unused stream register (valid bit is 0) will be allocated to the stream and the new mapping entry is created and inserted into SMT. If the stream ID is already active, the previous mapping is overwritten with the current stream information. After creating the mapping to a stream register, both instruction will also trigger the fetching of key stream to the stream cache (see details in Section 3.4.3). Thus, if the current stream overwrites the previous one, the content in the stream cache will also be updated. 45 S VREAD does not load the values, which will be triggered when the computation instruction for (key,value) stream (V VINTER) is executed. The values are accessed and fetched through the ordinary memory hierarchy rather than the stream cache. S FREE is used to free a stream. When it is executed, the processor finds the SMT entry for the stream ID indicated in the operand and set the valid bit to 0. If such entry is not found, an exception is raised. The stream ISA contains nine instructions for stream computation. S INTER, S INTER.C, S SUB, S SUB.C, S MERGE, S MERGE.C perform simple computation on key stream—intersection and subtraction. The suffix “.C” indicates the variants of the corresponding instructions that do not output the result stream but just the count of non-zeros in the result stream. If the output is a stream, the stream ID of an initialized stream should be given in one of the input registers. The stream ID is then added into SMT. These instructions take a upper-bound operand R3 to support early intersection/subtraction termination. Once all output stream elements smaller than R3 have been produced, the instruction terminates the computation early. For unbounded operations, R3 is set to -1. S VINTER performs user-defined intersected value computations. These instructions compute the intersection of the keys of the two input (key,value) streams, and then performs the computation on the values corresponding keys. For example, the key intersection of two (key,value) streams [(1,45),(3,21),(7,13)] and [(2,14),(5,36),(7,2)] is 7. The instruction performs the computation on the corresponding values: assuming the computation is multiply-accumulation (MAC) specified in IMM, the result is 13 × 2 = 26 in R2. The other types of computation can be specified in IMM, such as MAX (choose the maximum and accumulate), MIN (choose the minimum and accumulate), or any reduction operation. S VMERGE performs merged value computations and outputs a key value stream. It computes the merged keys, multiplies the corresponding value with a given scale, and adds multiplication results of the same key. For example, given two inputs streams [(1,4),(3,21)] and [(1,1),(5,36)], the result keys would be [1,3,5]. Assume scales are 2 and 3, each value would be [(4 ∗ 2 + 1 ∗ 3),(21 ∗ 2),(36 ∗ 3)]. Thus the result stream would be[(1,11),(3,42),(5,108)]. 46 In SparseCore, computation on values is performed by a dedicated functional unit, which can be easily extended to perform new operations. The two instructions are useful in sparse tensor computation, where the keys indicate the positions of the non-zeros and the actual computations are performed on these values. If any input stream ID is not a (key,value) stream, an exception is raised. S NESTINTER performs the nested intersection. It is an instruction specialized for GPM. Let the input stream (an edge list) be S = [s0,s1,...,sk ], where each si corresponds to a vertex. Let us denote the edge list of each si as S(si), and the result of the instruction as C. This instruction performs the following computation: C = ∑ i=k i=0 count((S∩S(si))), where ∩ is the intersection between two key streams, and count returns the length of a stream. The intersections are bounded by the value of si . Thus, this instruction implements dependent stream intersection. Given a stream S, the other streams to be intersected with it are determined by the keys (vertices) of S. The generation of the dependent streams corresponding to each si is performed by the processor using the information in the three GFR registers, which are loaded once using S LD GFR before processing a graph. The S FETCH instruction performs the stream element access—returning the element with a specific offset in a stream, which can be either the output stream of an intersection operation or an initialized stream loaded from memory. Typically, the offset is incremented to traverse all elements in a stream. When it reaches the end of the stream, S FETCH will return a special “End Of Stream (EOS)” value. In Table 3.1, we included all parameters needed by an instruction as operands for clarity. For some instructions, the total number of operands might be too large to be encoded in the instruction format of a given processor. It does not pose fundamental issue. In an implementation, we can include shared registers to hold the priority (R3 of S READ and S VREAD) and scales (F0 and F0 of S VMERGE), and remove them from the operands. We can introduce simple instructions to set these registers, which can be used before these instructions. This solution is feasible and correct since such information is obtained when the instructions are decoded, which happens in order. We do not include such details in the instruction specification to avoid diluting the essential ideas. 47 for (Vertex v0: graph) { Set n0 = v0.neighbours(); // equivalent to // cnt += NestedIntersect(n0); for (Vertex v1: n0) { Set n1 = v1.neighbours(); cnt += Intersect(n0, n1).size(); } } … // for (v0: graph) // R1-R4: start_addr, len, id, priority of n0 S_READ R1, R2, R3, R4 S_NESTINTER R3, R5 S_FREE R3 ADD R6, R5, R6 // cnt += NestedIntersect(n0); for (Vertex v0: graph) { Set n0 = v0.neighbours(); for (Vertex v1: n0) { Set n1 = v1.neighbours(); Set t0 = BoundedIntersect(n0, n1, v0); for (Vertex v2: t0) {… // calculate N(v1)-N(v0)-N(v2)-{v0, v2} } } } … // for (v0: graph) for (v1: n0) // R1-R4: start_addr, len, id, priority of n0 // R5-R8: start_addr, len, id, priority of n1 // R9: id of t0, R10: v0 S_READ R1, R2, R3, R4 // create the input streams S_READ R5, R6, R7, R8 S_INTER R3, R7, R9, R10 // R10=v0 is the upper bound S_FREE R3; S_FREE R7 // free the input streams … // for (v2: t0) … (a) Nested Intersection (triangle) (b) Bounded Intersection (tailed-triangle) Figure 3.3: Pattern enumeration with Stream ISA 3.3.4 Code Examples Figure 3.3 (a) shows triangle counting implementation using stream ISA. The v1 for-loop is essentially a nested intersection operation on n0, which can be implemented by S NESTINTER. Figure 3.3 (b) shows the implementation of tailed-triangle mining (shown in Figure 3.2 (b)) with intersection early termination. We use S INTER to intersect the two edge lists with an upper-bound v0 (stored in R10) so that the intersection can terminate early. Figure 3.4 (a)(b) show the inner product implementation with our stream ISA extension. Line 7 performs the multiply-accumulation on the values of the intersected keys. Figure 3.4 (c)(d) show the Gustavson implementation. Line 10 performs the merge computation. With SparseCore, the same architecture and ISA can flexibly implement different algorithms. The ISA allows different loop iterations to use the same stream IDs, similar to the same variable names. The processor keeps track of the active streams in both front-end (after instruction 48 decoding) and back-end (at instruction commit time), and will recognize the same stream IDs in different iterations as different streams. 3.4 SparseCore Architecture The SparseCore architecture is composed of specialized structures built on conventional processor architecture and memory hierarchy that implement the stream ISA extensions. Figure 3.5 shows a detailed overview with stream related components highlighted in gray color. All instructions in Table 3.1 except S NESTINTER occupy one entry in the Reorder Buffer (ROB). 3.4.1 Stream ID Mapping In SparseCore, each stream ID (Sid) specified in an instruction is mapped to an internal stream register (Sreg). This mapping is performed at the front-end after instruction decoding and the mapping relation is kept in SMT. Besides the stream ID and its mapped stream register, each SMT entry contains: (1) two valid bits: VD, indicating the define point of the stream, and VA, indicating whether the stream is active; (2) the start (s) and produced (p) bit, which indicate whether SCache contains the keys from the start of the stream and whether the data for the whole stream is produced (so that it can be used by the dependent streams); and (3) the pred0 and pred1: the IDs of the streams that the current stream depends on. Initially, both VD and VA are 0 and SMT is empty. Both VD and VA are set after decoding a S READ or a S VREAD instruction and the SMT entry indicates that the Sid i in the last operand of the instruction is mapped to Sreg j. BothVD andVA are set to one, they indicate that the instruction defines Sidi and it is active. Later, when S FREE Sid i is decoded, the SMT is examined and an entry for Sid i should be found (otherwise an exception is raised), and its VD is reset, while VA is unchanged. This means that Sid i is no longer defined—the instructions after S FREE Sid i should not be able to reference Sid i—but the stream is still active since S FREE Sid i has not been retired. When S FREE Sid i is retired, VA is reset and the entry becomes free. When a new 49 stream is mapped, the processor checks SMT and finds an entry with VA = 0, which implies VD = 0. Note that is not true vise versa—VD = 0 does not imply VA = 0. Our design expects the codes to call S FREE after a stream is no longer used, so that its SMT entry can be released. It can be easily ensured by the compiler. When all stream registers are occupied (VA = 1), the instruction that initializes a new stream will be stalled. The current design with 16 stream registers is enough for all our applications. The larger (or even unlimited) number of stream IDs can be supported by virtualization: When a thread attempts to allocate too many streams, newer entries will be saved to a special memory region to release SMT space. The deadlock may happen when an intersection instruction is blocking the ROB and the related streams are swapped out. To avoid this scenario, we prioritize the streams used by the first intersection instruction in ROB and swap in its operand streams. The design can naturally support the stream operations in loop iterations of GPM. Typically, inside an iteration, some streams are initialized and computations on them are performed before S FREEs at the end of the iteration (refer to Figure 3.3 (c) for an example). Different iterations can use the same stream IDs, which are mapped to different SMT entries. SMT does not increase the latency of CPU pipeline, and can be implemented in a pipelined manner similar to the register rename stage in CPU. The mapping from architecture registers to physical registers is similar to the mapping from Sids to Sregs, with the “readiness” of stream IDs. 3.4.2 Stream Unit and Stream Reuse Stream Unit (SU) performs stream operations, we applied extensive optimizations to achieve high performance. Figure 3.6 illustrates the parallel comparison inside SU. For simplification, we only show 4 elements, which have all been loaded into the internal buffer of SU. We set the buffer size as 16 and use double buffer to avoid stall when one buffer is occupied by moving data into SU. For intersection, at Cycle 1, the first element in each stream will be loaded and compared with all the elements in the other stream. For stream A, its 3rd element is found to be equal to the first element of B, thus at the next cycle, the 3rd element of A will be loaded for parallel comparison. For stream 50 B, the first element of A is less than any of its elements, so no action is needed. At Cycle 2, the 3rd element of A and the first element of B are found equal, so the element 3 will be put into the result buffer. Finally, at Cycle 3, each stream will use their next element for parallel comparison. The same optimization also applies to subtraction and merge. For subtraction and merge, the parallel comparison may generate multiple elements at one cycle, while only zero or one element can be produced per cycle for intersection. For example, at Cycle 1, the first two elements in stream A are less than the first element of B, thus they will be put into result buffer for subtraction and merge. We use a buffer to keep output stream, and write back when a full S-cache line of elements are generated. In both GPM and tensor computation, a stream can be reused many times or immediately used by the following computation. It is either based on the algorithm or because such stream keeps the intermediate results. To avoid unnecessary data movements between S-Cache, a scratchpad shared among all SUs is used to store the stream with high stream priority. The stream priority (R3 of S READ and S VREAD) can be assigned by compiler after program analysis. 3.4.3 Stream Cache In SparseCore, the keys for each active stream are loaded into a special stream cache (S-Cache), which lies on top of L2 cache together with L1. The values in (key,value) stream are fetched through the normal memory hierarchy. When the stream keys are accessed using the stream instructions, the data will not pollute L1. Since the keys of a stream are accessed sequentially, the data can be effectively prefetched to S-Cache without a complex prefetcher, thanks to the known access pattern. Each stream register has a slot that holds a fixed number keys of the stream. We use the 64-key slot which leads to 256 byte slot size. With 16 stream registers, the total size of S-Cache is 4KB. When an S READ is executed, the first 64 keys are fetched to the S-Cache, and the start bit in SMT for the stream is set. Unless the length of the stream is no more than 64, at this point the S-Cache only contains the first portion of the stream. The start bit indicates that the instructions 51 that depend on the stream can use the data in the S-Cache slot. Referring to Table 3.1, our ISA does not contain any instruction that explicitly stores to a stream: only S INTER and S SUB produce the results in the output stream. When these instructions are executed, the result keys are written to the S-Cache slot in group of 64. If the result stream contains more than 64 keys, the slot will contain the most recently produced 64 keys while the previous slot is written back to L2 and the start bit is cleared. When the whole result stream is generated by the computation instruction, the produced bit is set, which is used to trigger the dependent instructions. The typical code pattern is that two streams are initialized by S READ before the intersection operation is performed. In this case, data fetching from L2 to S-Cache and transfer to SUs for computation can be pipelined. To support that, we use the idea of double buffer and divide each slot into two sub-slots. When a sub-slot is fetched from L2, the keys in the other sub-slot can be prefetched to SU simultaneously and the intersection computation can be overlapped. Stream cache can send two cache line of data to two SUs at each cycle. With multiple SUs, the parallel execution time of multiple intersections can be better overlapped with the data fetching time of these streams. When multiple SUs (4 in our design) need data to perform computations, S-Cache has to schedule the data transfer to different SUs. We use a simple round-robin policy: at each cycle, S-Cache schedules the transfer to a different SU that is waiting for the data. Each SU is able to perform the intersection on the partial key streams received. 3.4.4 Stream Data Dependency Two streams may have dependency due to: (1) stream ID, where an instruction uses the output stream of a previous computation (S INTER or S SUB) as an input stream; or (2) the overlapped memory regions of two streams. It is easy to handle the first scenario: after the stream IDs are available after decoding, the dependency can be handled in the similar manner to the data dependency on general registers. When a dependency is identified, the consumer instruction can only execute after the producer instruction. It is enforced by filling the pred0 and pred1 in SMT of the 52 consumer instruction. When the producer instruction finishes, its SMT entry’s produced bit is set. Each cycle the processor checks the status of the producer instruction(s) and triggers the consumer instruction when all operands’ produced bit are set. If the key stream produced is less than 64 keys, the whole stream is in S-Cache with the start bit set, the consumer instruction reads directly from S-Cache; otherwise, the slot will be refilled from L2. For the second scenario, we can check the potential dependency conservatively by leveraging the fact that the length of the output stream is less than the sum of the length of two input streams. Thus, we can conservatively deduct the maximum length of the output stream. The possibly overlapped stream memory regions can be detected using the start key address and stream length of different streams. The dependent stream instructions need to be executed sequentially, which is enforced using the same mechanism as the first scenario. 3.4.5 Sparse Computation on Values The sparse computation on values is suppored by the coordination between SU, value buffer (vBuf), load queue, and Stream Value Processing Unit (SVPU). When S VINTER is executed, an SU starts with key intersection calculation and the output keys are given to the Value Address Generator (VA gen) associated with the SU (refer to Figure 3.5). VA gen generates the value addresses for each key in the intersection. These addresses are sent to load queue to request the values through the normal memory hierarchy, rather than S-Cache. Each value request is also allocated with an entry in the vBuf, which will collect the two values returned from the load queue (val0 and val1). Each entry has a ready bit (r) for each value, which is set when the load queue receives the value. For S VINTER, we assume that the operation is commutative (e.g., multiply-accumulate) thus the computation using val0 and val1 can be performed by SVPU as soon as both ready bits are set. We do not need to enforce any order on the accumulation. The acc reg is used to keep the accumulated partial results. While performing substantial amount of computations, this instruction only takes one entry in ROB. After the final result is produced in the acc reg of the corresponding SU, 53 it will be copied to the destination register, and then the instruction will retire from the processor when it reaches the head of ROB. The execution flow for S VMERGE is similar, VA gen and vBuf are also used for value requests. However, SU will perform merge rather than intersection. Also, instead of accumulate results, S VMERGE will output each value. 3.4.6 Nested Intersection For S NESTINTER, we use the Nested Instruction Translator to generate the instruction sequence of other instructions of stream ISA to implement it. Based on the input key stream, the translator first generates the stream information based on each key element. The memory addresses of the streams information are calculated based on the GFR registers, then the memory requests are sent through load queue. For each stream, an entry is allocated in the translation buffer, its ready bit (rdy) is set when the stream information is returned at load queue. Similar to the pointer to vBuf entry, each load queue entry also keeps a pointer to the translation buffer entry. For each nested stream, three instructions are generated: S READ, S INTER.C, and S FREE. An addition instruction is generated to accumulate the counts. Each instruction takes an entry in the translation buffer. The start address and stream length fields are only used in S READ. When the stream information is ready, the three instructions are inserted into ROB. The translation is stalled when the translation buffer is full, which can be due to either ROB full or waiting for the stream information. In either case, the space will be released because eventually the instructions in ROB will retire and the requested data will be refilled. These events do not wait for the translation procedure and cause no deadlock. 3.5 Implementation and Software 3.5.1 Implementation Considerations S NESTINTER is translated into a variable length instruction sequence by the Nested Instruction Translator and takes multiple ROB entries. To ensure precise exception, the processor takes a 54 checkpoint of registers before the instruction. If an exception is raised during the execution, processor rolls back to the checkpoint and raises the exception handler. It is similar to the mechanisms for atomic block execution [124, 125]. Besides information such as general registers, the checkpoint includes the content of SMT, stream registers and GFR registers. In SparseCore architecture, stream cache does not participate in the coherence protocol. For the applications that SparseCore targets, the data (such as graph or sparse tensors) are read-only, thus there is no correctness issue. S-Cache itself is not read-only, many of our applications indeed write intermediate data to stream cache. For data synchronization, normal CPU instructions should not access stream data. An implementation raises an exception if this is violated. It ensures that S-Cache is only be accessed via S FETCH. 3.5.2 Hardware Cost We implemented the key components of SparseCore, including S-Cache, SU, SMT, and stream registers, using Chisel[126]. We used Synopsys Design Compiler with the Open-Cell 15nm design library[127] to synthesize the Verilog code generated by Chisel. These component can achieve a frequency of 4.35Ghz, indicating that our architectural extensions will not affect the latency of the baseline processor. We estimate area numbers of the SRAMs in our Scratchpad using CACTI[128]. We use the 22nm technology, which is the closest from 15nm available in CACTI. The total area of our S-Cache with 12 slots, 4 SUs, SMT, Scratchpad and Sregs takes 0.73mm2 , while the area of an Intel SkyLake server core (14nm) is close to 15mm2 [105]. 3.5.3 Compiler For GPM, we developed a compiler to generate GPM implementation with stream ISA. The compiler takes the user-specified patterns as input, synthesizes the corresponding intersection based GPM algorithms (e.g., those in Figure 3.2), and translates them to C++ implementations embedded with stream ISA assembly instructions. For tensor computation, we modified tensor algebra 55 Table 3.2: Architecture Conf. Number of cores 6 ROB size 128 loadQueue size 32 cache line size 64B l1d cache size 32KB,8-way L2 256KB,8-way L3 12MB,16-way S-Cache slot size 256B scratchpad size 16KB compilers TACO [112], which takes user-specified math expression as input and generates C++ implementations embedded with stream ISA instructions. One major challenge is stream management during code generation (similar to register allocation in traditional compilers). To implement an intersection, the compiler may generate instructions that introduce up to three active streams—two input streams loaded by S READ and one output stream produced by S INTER. We release these created streams eagerly, since resources used to maintain actives streams (e.g. s-cache and stream registers) are limited. The streams created by S READ are released by S FREE after the intersection operation, and the compiler will insert S FREE instructions to free the stream produced by S INTER once it is no longer needed. If the number of actives streams reaches its limit (i.e., the number of stream registers), the compiler simply falls back to generate scalar ISA based intersection code, and print outs a warning message. In practice, we notice that such a “fall-back” scenario is rare (did not happen for all applications evaluated) thanks to our aggressive stream freeing strategy. 3.6 Evaluation 3.6.1 Simulator and Configuration We simulate SparseCore on zSim [62]. We implement all SparseCore architectural components into the simulator. Our configuration is listed in Table 3.2. For GPM, we compare SparseCore with the recent accelerators. For TrieJax [105], we implemented the partial join results (PJR) cache and simulated their cache hierarchy. The access patterns are analyzed and simulated according to the 56 Table 3.3: GPM Apps Triangle counting (T) Three chain counting (TC) Tailed triangle counting (TT) 3-Motif (TM) 4-Clique (4C) 5-Clique (5C) Frequent subgraph mining (FSM) operational flow description. For Flexminer [107], we implemented the cmap and simulated their access patterns. For GRAMER [106], we implemented its specialized memory hierarchy and simulated the access patterns. For TrieJax, Flexminer, and GRAMER, we all assume full overlapping of any non-dependent data access. 3.6.2 Graph Mining Algorithms and Data Sets We execute our InHouseAutomine to mine different patterns. It is because the codes of AutoMine [110] is not publicly available. For GPM, we choose several popular applications listed in Table 3.3 to evaluate SparseCore. They can be divided into four categories. (1) Pattern counting applications, which include triangle (T), three-chain (TC), and tailed-triangle counting (TT). We use T, 4C, and 5C to denote the implementations with nested intersection, while TS, 4CS, and 5CS refer to the corresponding implementations without this optimization. (2) k-Motif mining, which counts the embeddings of all connected patterns with a given size k. (3) k-Clique mining, which discovers all size-k complete subgraphs of the input graph. (4) Frequent subgraph mining (FSM), which aims to discover all vertex-labeled frequent patterns. A pattern is considered as frequent if and only if its support is no less than a user-specified threshold. Similar to Peregrine [129], we choose the minimum image-based (MINI) support metric [130] and only discover frequent patterns with no more than three edges. It is important to note InHouseAutomine and our compiler for SparseCore implement the same algorithm. The only pass of SparseCore compiler in addition to InHouseAutomine compiler is to emit code using stream ISA extension. 57 Table 3.4: Graph Datasets name #V #E avg D max D citeseer (C) [19, 131, 132] 3.3K 4.5K 1.39 99 email-eu-core (E)[133, 134] 1.0K 16.1K 25.4 345 soc-sign-bitcoinalpha (B) [135–137] 3.8K 24K 6.4 511 p2p-Gnutella08 (G) [138, 139] 6k 21k 3.3 97 socfb-Haverford76 (F) [19] 1.4K 60K 41.3 375 wiki-vote (W) [140, 141] 7k 104k 14.6 1065 mico (M) [142] 96.6K 1.1M 11.2 1359 com-youtube (Y) [143] 1.1M 3.0M 2.6 28754 patent (P) [144] 3.8M 16.5M 8.8 793 livejournal (L) [60, 145] 4.8M 42.9M 17.7 20333 Table 3.4 lists the real-world graphs we used from various domains, ranging from social network analysis to bioinformatics. Table 3.5: Matrix and tensor Datasets Name Dimensions Nonzeros Density Circuit204 (C)[146] 1020×1020 5883 0.57% Email-Eu-core(E)[133, 134] 1005×1005 25571 2.5% Fpga dcop 26 (F)[146] 1220×1220 5892 0.40% Piston (P) [146] 2025×2025 100015 2.4% Laser (L)[146] 3002×3002 5000 0.055% Grid2 (G)[146] 3296×3296 6432 0.059% Hydr1c (H) [146] 5308×5308 23752 0.084% California (CA) [147, 148] 9664×9664 16150 0.017% ex19 (EX) [146] 12005×12005 259577 0.18% gridgena (GR) [146] 48962×48962 512084 0.021% TSOPF (T) [146] 18696×18696 4396289 1.26% Chicago Crime (Ch)[149] 6.2k ×24×2.4k 5.3M 1.46% Uber Pickups (U) [149] 4.3k ×1.1k ×1.7k 3.3M 0.0385% For sparse tensor computation, we implemented the three algorithms for SPMSPM, tensor times vector (TTV, Zi j = ∑k Ai jkBk ) and tensor times matrix (TTM, Zi jk = ∑l Ai jlBkl). These applications make use of SparseCore’s sparse value computation ability. We use state-of-the-art tensor algebra compiler TACO[112] to generate tensor kernel. We use the matrices and tensors listed in Table 3.5. We conduct more comprehensive evaluations of GPM applications than tensor computation since the general-purpose processor based design is motivated by the complex GPM code patterns. 58 3.6.3 Overall Performance of GPM 3.6.3.1 Comparison with Flexminer, TireJax, and GRAMER To make the fair comparison, we only enable one computation unit in each accelerator and one SU in SparseCore. For Flexminer, the area of a PE is 0.18mm2 without the shared cache of 4MB. In comparison, the average area for each of SU in SparseCore, including scratchpad, S-Cache and all other added components, is 0.183mm2 . For TrieJax, the total area of the architecture is 5.31mm2 for 32 internal threads. On average, each thread is 0.166mm2—very similar to area per SU for SparseCore. It is important to note Flexminer and SparseCore implement the same algorithm. We compare the performance of SparseCore, Flexminer, and TrieJax in Figure 3.7. Results related to TrieJax are shown in log scale. TrieJax can not support the Three chain, 3-Motif, and Tailedtriangle patterns, which are vertex-induced [107]. Intuitively, an edge-induced subgraph of G is formed by taking a subset of G’s edges; while a vertex-induced subgraph of G is formed by taking a subset of G’s vertices and all edges among them. The vertex/edge-induced pattern requires that the embeddings from the input graph be vertex/edge-induced subgraphs. Since TrieJax only supports join primitive, it can only support edge-induced patterns. We only evaluate clique counting, of which edge-induced and vertex-induced clique patterns happen to be the same. On average, SparseCore outperforms TrieJax by 3651.2×. The performance gap is due to TrieJax’s inferior algorithm design and the lack of graph structure support. TrieJax processes the graph data as a database table, which leads to unnecessary binary search and significant redundancy in GPM. For example, TrieJax lacks support for symmetry breaking, which means for Triangle and 4/5-Clique, they will count the same embedding for 6, 24, 120 times. Moreover, let us consider triangle counting, when extending from v1 to v2, TrieJax will try to find the edgelist of v2 via its LUB unit, which would perform binary search on the table based on Worst Case Optimal Join. This operation may take up to O(logN) time. However, for the CSR graph based approach, the same operation only takes O(1) time. Even though TrieJax has a partial join results (PJR) cache, we believe this design is inefficient and failed to exploit the access pattern of graph mining. PJR cache will deallocate large entries that 59 exceed 1KB, which corresponds to only 256 vertices. However, for GPM applications, vertices with high degrees are more likely to be accessed[106], which usually cannot be placed in PJR. Hence, the PJR cache fail to cache the most frequently accessed data. For example, the largest degree of email-eu-core, a small graph with merely 1K vertices, could be 345, which exceeds the capability of an PJR entry. On average, SparseCore outperforms Flexminer by 2.7×. This speedup comes from the parallel comparison design inside SU, which provides the advantage in the basic stream computation. For GRAMER, since it is based on a pattern-oblivious algorithm with much more redundant computation. It is even slower than our baseline CPU benchmark. Based on our results, SparseCore outperforms GRAMER on average 40.1×and up to 181.8×. 3.6.3.2 Comparison with CPU Further performance comparison among SparseCore (with/without nested intersection) and the CPU baseline are shown in Figure 3.8. TS, 4CS, and 5CS refer to the triangle counting, 4-Clique, and 5-Clique implementations without nested intersection. On average, enabling nested intersection speeds up these applications by 1.65×. It is because with nested intersection instructions, the normal instructions used to explicitly manage the corresponding loops, graph structure accesses, and embedding counting are eliminated. Nested intersection instructions allow more intersections to be executed on-the-fly simultaneously, thanks to the reduction of normal instructions that would have occupied more ROB entries. Besides, note that SparseCore achieves less speedup for FSM. It is because the support calculation in FSM is costly, and thus the intersection/subtraction operations that our architecture accelerates only take a smaller portion of execution time. Comparing across different datasets, SparseCore achieves higher speedups on graphs with higher average degree. This could be explained by Amdahl’s law. On graphs with higher degrees, the operand lengths of intersection/subtraction operations are generally longer. As a result, these operations are more computation-intensive and take up a larger portion of execution time. Recall 60 that SparseCore only speedups intersection/subtraction operations, and thus achieves higher performance improvement on denser graphs. Also, a higher degree means the stream can be reused more often, leading to better utilization of the scratchpad and a significant reduction of access to the normal cache hierarchy. 3.6.4 Cycle Breakdown Analysis We analyze the source of SparseCore’s performance gain by analyzing the breakdown of the execution cycle for CPU and SparseCore as shown in Figure 3.9 and Figure 3.10. We can see from Figure 3.9 that branch misprediction cost takes a significant portion of the total cycles for CPU. This is due to the code pattern of intersections, which contains branches in a tight loop, making it difficult to predict the branches. As shown in Figure 3.10, the branch misprediction cost is significantly reduced for SparseCore, due to our specialized instructions. The computation categories are counted as the summation of cycles when any functional unit of the CPU is busy. This category can be further divided into two parts, Other computation, and Intersection. The Intersection represents the cycles when the CPU or Stream Unit is executing an intersection or subtraction operation. The Other computation category includes cycles for all other computations. It is worth noting that SparseCore can overlap Other computation with Intersection, since SparseCore is based on outof-order CPU core. For SparseCore, the Other computation category takes a higher proportion of the reduced total cycles. 3.6.5 Comparing to GPU We also compare SparseCore with GPU (Nvidia Tesla K40m). We assume the clock frequency of SparseCore to be 1Ghz. We compare the performance of SparseCore (with symmetry breaking) with two GPU implementations with or without symmetry breaking optimizations. The optimization in general adds more branches, and we want to study, with massive parallelism, whether the redundant enumeration with less branch divergence can overshadow less computation with more 61 branches. Figure 3.11 shows the results. We can see that: 1) SparseCore outperforms the GPU implementations significantly, thus, even with a more powerful GPU, the results should stay the same; and 2) symmetry breaking is also effective in GPU, and the massive parallelism on more computation cannot overweight less computation with more branches. Using Nvidia profiling tools, we find that the reason for low performance of pattern enumeration on GPU is two-fold: 1) low warp utilization (about 4.4%) due to the branches and the different loop sizes (edge list length) for different threads; and 2) low global memory bandwidth utilization (about 13%) since threads access edge lists at different memory locations. Based on our results, it is no surprise that all existing pattern enumeration based graph mining system are based on CPU. 3.6.6 The Distribution of Stream Lengths 0 100 200 300 Length 0 1 Percentage T TM TC 4C 5C TT 0 200 400 Length 0 1 G C B E F W M Y P L Figure 3.14: Length Distribution We further analyze the length distribution of involved streams in different GPM algorithms. Figure 3.14 left shows the cumulative distribution function (CDF) of stream lengths in different graph mining algorithms on the email-eu-core graph. Even on the same graph dataset, different applications could lead to different stream length distributions. We notice that clique applications (i.e., 4-Clique/5-Clique counting) in general introduce shorter stream lengths. The reason is that in clique applications, the input operands of intersection operations are usually the intersection results of other streams. And these operands tend to have shorter stream lengths. We also fix the graph mining application to triangle counting and analyze the stream length distribution on various datasets. The results are reported in Figure 3.14 right. For this figure, we cut off the counting for stream larger than 500. The observation is intuitive–the longest stream length on datasets with larger maximal degrees (e.g., LiveJournal, Youtube) are longer. Besides, there are more long streams on denser datasets like E (email-eu-core) and F (socfb-Haverford76). 62 3.6.7 Varying the Number of Stream Unit We characterize the performance of SparseCore by varying the number of SUs. Figure 3.12 shows the results with 1 to 16 stream units. When the number of SUs is no more than 4, increasing it will generally improve SparseCore’s performance. However, with more than 4 stream units, adding SUs introduces significantly less benefit. 3.6.8 Analysis on Bandwidth We characterize SparseCore’s performance with different bandwidth. Figure 3.13 shows the performance of SparseCore with aggregated S-Cache and Scratchpad bandwidth varying from 2 elements per cycle to 64 elements per cycle. Increasing aggregated bandwidth can improve SparseCore’s performance. However, there is a point of diminishing return. For example, for the TC (three-chain counting) application, increasing the bandwidth from 32 to 64 elements/cycle introduces almost no benefit. It is because there are not enough concurrent active stream intersection/subtraction operations to saturate the bandwidth. The number of concurrent active stream operations is determined by the application and implementation. Triangle counting (T) and 4/5-Clique (4/5C) counting use the nested intersection instruction to trigger intersection operations in a bursty manner, leading to more simultaneously on-the-fly intersections. Hence, T/4C/5C benefits more from bandwidth increase than algorithm/implementation without the nested instruction (e.g., 4CS, 5CS). Moreover, each algorithm has a unique stream operation pattern, which leads to different number of simultaneously on-the-fly stream operations. Therefore, each algorithm would benefit from the bandwidth increase differently. 3.6.9 Tensor Computation Performance 3.6.9.1 Comparison with CPU 63 C A CEFGLPE X G R T 0 20 40 Speedup inner C A CEFGLPE X G R T 0 1 2 outer C A CEFGLPE X G R T 0 5 10 gustavson (a) Sparse Matrix Ch U 0 1 2 3 TTV Ch U 0 2 4 6 TTM (b) tensor Figure 3.15: Tensor Computation Speedup SparseCore’s speedups against the CPU baseline are shown in Figure 3.15. For sparse matrix multiplication, it achieves on average 6.9×, 1.88×, and 2.78× speedup for inner-product, outerproduct, and Gustavson’s algorithm. Comparing across algorithms, Gustavson executes faster than the other two algorithm on CPU, e.g., 93.0× and 2.13× faster than inner- and outer-product on ex19. However, SparseCore archives the highest speedups for innerproduct. It is because inner-product’s data access pattern can be better accelerated by our data reuse. SparseCore achieves higher speedups for Gustavson’s algorithm than outer-product for similar reason. After SparseCore’s acceleration, Gustavson’s algorithm still has the highest performance but the gap between inner-product becomes smaller, e.g., 24.9× and 2.13× faster than inner- and outer-product on ex19. Comparing across different datasets, the speedup of TSOPF with inner-product and Gustavson’s algorithm is much higher than the other matrices. This is because TSOPF has more non-zero elements per column, which leads to longer streams and more efficient data reuse. This is similar to our observations in Section 3.6.3.2, where SparseCore achieves higher speedup on graphs with a higher average degree. For tensor computation, SparseCore achieves on average 4.49× and 2.44× speedup for TTM and TTV respectively. Similar to matrices, for tensor with higher density, SparseCore can achieve higher speedup. 3.6.9.2 Comparison with OuterSPACE, ExTensor, and Gamma Similar to the comparison with GPM accelerators, we only enable 64 inner: SparseCore Extensor outer: SparseCore OuterSPACE gustavson: SparseCore Gamma 10 0 10 1 10 2 Speedup Speedup over inner-product SparseCore Figure 3.16: Gmean speedup of OuterSPACE, ExTensor, Gamma, and SparseCore with outerproduct, and Gustavson’s algorithm over SparseCore with inner-product one computation unit in each accelerator and one SU in SparseCore. For OuterSPACE, as stated in their paper, the latency of allocation is typically hidden and they use scratchpad to hide the latency of grabbing new elements from the main memory. Thus, we mainly model their cache/scratchpad, their PE, and HMC transfer. For a fair comparison, we configured the latency of their cache/scratchpad to be the same as the latency of SparseCore’s L1d cache. For ExTensor, we model their PE and the transfer cost of DRAM to LLB and Partial Output Buffer to DRAM with the same configuration in their paper. We model their PE with the same number of parallel comparators as SparseCore for a fair comparison. For Gamma, as stated in the paper, the FiberCache uses fetch to hide the memory access latency of a cache miss. For simplification, we model the FiberCache as “always hit”. For the PE, we modeled it with one-element-per-cycle throughput. We compare the performance of SparseCore, OuterSPACE, ExTensor, and Gamma in Figure 3.16. We observe that that: 1) SparseCore with a better algorithm is faster than accelerators with worse algorithms, i.e., SparseCore with Gustavson’s algorithm is faster than accelerators with outer-product, and inner-product; and 2) For each algorithm, the specialized accelerators are faster than SparseCore with 5.2× for inner-product, 3.1× for outer-product, and 2.4× for Gustavson’s algorithm. This demonstrates the key trade-off between flexibility and performance: SparseCore can easily adapt to various algorithms with reasonable speedups—not significantly slower than the full specialized accelerators with the fixed dataflow. In particular, even if SparseCore’s performance of 65 Gustavson’s algorithm is slower than Gamma, it is faster than OuterSPACE and ExTensor, thanks to the algorithmic advantages. 66 while (true) { cmp = stream1[i1] - stream2[i2]; if (cmp == 0) { i1++;i2++; output += value1[i1]*value2[i2]; } else if (cmp < 0) { i1++; } else { i2++; } if (i1>boundary1 || i2>boundary2) break; } ... //mov index addr, length, id, value addr, priority of stream 1 to R8-R12 S_VREAD R8, R9, R10, R11, R12 //mov index addr, length, id, value addr, priority of stream 2 to R8-R12 S_VREAD R8, R9, R10, R11, R12 //mov id of stream 1, stream 2 to R8-R9 S_VINTER R8, R9, R10, MAC //mov id of stream 1 to R8 S_FREE R8 //mov id of stream 2 to R8 S_FREE R8 ... (a) Inner Product in C while (true) { cmp = stream1[i1] - stream2[i2]; if (cmp == 0) { out_v[idx] = v1[i1]*V1+v2[i2]*V2; out[idx] = stream1[i1];i1++;i2++; } else if (cmp < 0) { out[idx] = stream1[i1]; out_v[idx] = v1[i1]*V1; i1++; } else { out[idx] = stream1[i2]; out_v[idx] = v2[i2]*V2; i2++; } idx++; if (i1>bound1 || i2>bound2) {//tail stream = (i1>bound1)?stream2:stream1; v = (i1>bound1)?v2:v1; i = (i1>bound1)?i2:i1; copy(&out[idx],&stream[i],left); copy(&out_v[idx],&v[i],left); break; } } ... //mov index addr, length, id, value addr, priority of stream 1 to R8-R12 S_VREAD R8, R9, R10, R11, R12 //mov index addr, length, id, value addr, priority of stream 2 to R8-R12 S_VREAD R8, R9, R10, R11, R12 //mov index addr, length, id, value addr, priority of stream 2 to R8-R12 S_VREAD R8, R9, R10, R11, R12 //mov id of stream 1, stream 2, result stream to R8-R10 //mov scale factor to F1 and F2 S_VMERGE F1, F2, R8, R9, R10 //mov id of stream 1 to R8 S_FREE R8 //mov id of stream 2 to R8 S_FREE R8 ... (b) Inner Product in Our ISA (c) Gustavson in C (d) Gustavson in Our ISA Figure 3.4: Different SPMSPM Dataflows with Stream ISA 67 SU0 SU1 SU2 SU3 sub-slot0 sub-slot1 S-Cache … Sreg0 Sreg1 Sreg15 GFR0 GFR1 GFR2 Graph Format Registers (GFR) VA_gen VA_gen VA_gen VA_gen … vBuf0 … Load Queue … vBuf1 … vBuf2 … vBuf3 SVPU0 acc_reg SVPU1 acc_reg SVPU2 acc_reg SVPU3 acc_reg Sreg0 Sreg1 … Sreg15 Stream Mapping Table (SMT) Reorder Buffer Nested Intersection Translator Translation Buffer head L1 Cache L2 Cache Memory v val0 r0 val1 sid sreg VD VA s p pred0 pred1 v sid start addr stream len rdy start key addr start val addr stream length valid Processor Stream Registers r1 L3 Cache Scratch pad Figure 3.5: SparseCore Architecture Figure 3.6: Parallel Comparison 68 EFWMY 0 2 Speedup TC EFWMY 0.0 2.5 TM EFWMY 0 2 TT EFWMY 10 2 T EFWMY 10 3 4C EFWMY 10 3 5C Flexminer TrieJax Figure 3.7: Speedup of SparseCore over Flexminer and TireJax GCBEFWMYPL 0 10 Speedup TC GCBEFWMYPL 0 20 TM GCBEFWMYPL 0 50 TS GCBEFWMYPL 0 50 T GCBEFWMYPL 0 25 TT GCBEFWMYPL 0 50 4C GCBEFWMYPL 0 20 5C GCBEFWMYPL 0 20 4CS GCBEFWMYPL 0 20 5CS 1K 2K 0 2 FSM on M Figure 3.8: Speedups over CPU G C B E F W M Y P L 0.0 0.5 1.0 breakdown TC G C B E F W M Y P L TM G C B E F W M Y P L TS G C B E F W M Y P L 4C G C B E F W M Y P L 5C G C B E F W M Y P L TT Cache Mispred. Other computation Intersection Figure 3.9: CPU execution breakdown GCB E FWMY P L 0.0 0.5 1.0 breakdown TC GCB E FWMY P L TM GCB E FWMY P L TS GCB E FWMY P L T GCB E FWMY P L 4C GCB E FWMY P L 5C GCB E FWMY P L 4CS GCB E FWMY P L 5CS GCB E FWMY P L TT Cache Mispred. Other computation Intersection Figure 3.10: SparseCore execution breakdown B E F W M Y 10 2 Speedup vs. GPU w/o breaking T B E F W M Y 10 3 4C B E F W M Y 10 3 5C B E F W M Y 10 2 TT B E F W M Y 10 2 TC B E F W M Y 10 2 TM SparseCore GPU w. breaking Figure 3.11: SparseCore compared to GPU implementations (log scale) 1 2 4 8 16 Number of SUs 1.0 1.5 Speedup vs. 1SU TS 1 2 4 8 16 1.0 2.0 T 1 2 4 8 16 1.0 2.0 3.0 TC 1 2 4 8 16 1.0 2.0 TM 1 2 4 8 16 1.0 2.0 3.0 4C 1 2 4 8 16 1.0 1.5 5C 1 2 4 8 16 1.0 1.5 TT 1 2 4 8 16 1.0 1.2 4CS 1 2 4 8 16 1.0 1.1 5CS B E F W Figure 3.12: Varying the Number of SUs 2 4 8 16 32 64 # of elements / cycle 1.0 2.0 3.0 Speedup vs. 2/cycle TS 2 4 8 16 32 64 1.0 2.0 3.0 T 2 4 8 16 32 64 1.0 2.0 3.0 TC 2 4 8 16 32 64 1.0 2.0 TM 2 4 8 16 32 64 1.0 2.0 3.0 4C 2 4 8 16 32 64 1.0 2.0 5C 2 4 8 16 32 64 1.0 2.0 TT 2 4 8 16 32 64 1.0 2.0 4CS 2 4 8 16 32 64 1.0 1.5 5CS B E F W Figure 3.13: Varying S-Cache Bandwidth 69 Chapter 4 HybridServing: token-level hybrid serving for large language model 4.1 Abstract Large language models have been widely used for a wide range of industries, from education, and customer service to content generation. More and more people have begun to use large language model services as part of their daily lives. At the same time, larger and larger models are deployed for better accuracy and user satisfaction. The increased model size and spiking user demands worsen the tension between model accuracy and system performance, such as latency and throughput. In recent years, many research explored the balance between model accuracy and system performance, such as Super Serve, and CompOFA. Those works explore the dynamic and static balancing points between performance and accuracy, but they all failed to study the execution of different balancing points on the same device. Those methods lead to high overhead when switching balancing points of LLM models. In this work, we implemented concurrent serving of the LLM model at different balancing points on the same device, which enables a seamless transition between balancing points with zero overhead. On the other side, Super Serve and all existing systems only consider user demand with latency requirements. In real cases, however, some tasks prioritize better accuracy over latency. In our 70 work, we implemented the scheduling algorithm with both latency and accuracy requirements, which increases our hardware utilization and also reduces the latency violation. In addition, we significantly reduced the output token latency violation by predicting request latency and make scheduling decisions based on the prediction. Our experiment shows that HybridServing achieves 47% shorter waiting queue length and restores from the busty request 2.7× faster than the fixed accuracy police (accuracy tolerate requests with high accuracy). 4.2 Introduction In recent years, the breakthrough of the large language model has enabled its wide deployment in many areas, including teaching assistant[150, 151], text summary[152, 153], text-based and voicebased customer service[154–156], image and video generation [157, 158], and personal assistant [159, 160]. For many users, LLM is becoming a part of their daily life. According to a data analysis report [161], ChatGPT service’s monthly visitors grew from 152.7 million to 1.7 billion within one year. Market research predicts [162] that the Large Language Model industry will grow at an annual rate of over 40% for the next 9 years. At the same time, larger and larger large language models are being trained and deployed. Companies are competing for LLM with higher and higher accuracy. From 2018 to 2022, the cutting-edge LLM model size grew from 340 Million [163] to 530 Billion [164], which is over 1000× increase over 4 years. With the increasing demands for LLM services and increasing model size, the balance between accuracy and performance has become an important topic. Researchers and commercial companies have proposed many software and hardware innovations to balance such tension, such as quantization [165], better computing algorithms [166–168], also higher hardware capacity, and performances [169]. 71 In this work, we further explore such balance space. For the user side, We categorize the user requirements into the following two aspects: Performance (Latency/Throughput): Users might have different latency requirements. Some services have real-time requirement, such as online customer chatbot and call enter [154–156]. While other services don’t have strict time constraints, such as text summary [152, 153]. Accuracy: Some users might be willing to sacrifice accuracy for better performance or lower cost, while others may prioritize accuracy over performance and cost. Such as the prescription model of OpenAI [170], users can get different model accuracy with different prices. The best model accuracy is only for paid users. Another important requirement for the LLM inference system is the following: Resource Utilization: For the deployment side, commercial companies are deploying Machine Learning services at a large scale. Meta aims to grow its AI infrastructure to 350,000 NVIDIA H100 GPUs by the end of 2024[171]. Even though no commercial company has released its daily LLM request rate and the specific cluster used, with more and more users using LLM to solve daily tasks, it is reasonable to expect that LLM services will be used and deployed on a large scale. In addition to the increasing demands, due to the model size and computation requirements, LLM inference systems highly rely on expensive computing resources, like high-end GPU[171] or other accelerators like TPU [172]. Thus, a good LLM inference serving system should utilize the computing resources as much as possible. Early Machine Learning (ML) inference serving system [173–175] tried to strike a balance between the above three requirements by choosing a static point before launching the inference serving. The drawback is that it is hard to adjust the system when the Latency or Accuracy requirement changes. More recent system [176] allow users to choose a set of models with different latency and accuracy balancing points and allow the system to choose which model to use at runtime. However, those systems are designed for traditional Machine Learning requests. Some of their insight may not apply to LLM services. Super Serve [176] allows the switching between balancing 72 points after all the existing requests are finished. Those requests are traditional ML tasks like image processing. The wasted time is at most the latency of processing one image. But for Large Language Model serving, some requests may be executed for a much longer time. One example is chatbot, user may chat with LLM for several minutes. If Super Serve is directly applied to this case, the GPU worker will block new requests for several minutes till the existing ones are finished, and then switch to another accuracy and performance balancing point. In this work, we explore the execution of model with different accuracy on the same GPU at the same time, which naturally resolves this overhead. Additionally, Super Serve and existing serving systems only consider latency constraints. Users can only submit queries with latency requirements. However, in real scenarios, some workloads may prioritize accuracy over latency. In this work, we implemented the scheduling algorithm with both accuracy and latency requirements considered, which increases overall user satisfaction. Our work made the following contributions: • Hybrid Execution We implemented the hybrid execution of the model at different performanceaccuracy balancing points, which enables the seamless transition from one balancing point to another. HybridServing is the first ML serving system that implements such a function. • Scheduling with latency and accuracy constraints HybridServing is the first LLM serving system that considers both constraints. HybridServing achieves 47% shorter waiting queue length and restores from the busty request 2.7× faster than the fixed accuracy police (accuracy tolerate requests with high accuracy). When the system is at high workload, HybridServing can reduce the token-level latency violation by 75% to 82%. 4.3 Motivation and Background In 4.3.1, we provide a brief overview of the scheduling algorithms of ML serving systems and analyze them with the characters of LLM tasks. In 4.3.3, we analyze the breakdown of tokenlevel latency violation and the unique challenges for LLM task scheduling. We also discuss how 73 hybrid scheduling can resolve the issue and increase hardware utilization. In 4.3.4, we provide an overview of the implementation of different performance-accuracy balancing points, which inspires our implementation of hybrid execution. 4.3.1 Machine Learning Scheduling In recent years, numerous Machine Learning serving frameworks have been implemented for different optimization targets. For traditional Machine Learning model serving, Cipper [173] was designed to be deployed between end-user applications and machine learning frameworks. It implemented adaptive batching, caching, and model selection techniques to meet latency, accuracy, and throughput requirements. Super Serve [176] utilize the weight-shared SuperNetwork[177] to meet the Service-Level Objective(SLO) requirement. It achieves better SLO than the previous systems, including Cipper. Proteus [178] is an inference serving system more similar to Cipper. Compared with Super Serve, Proteus lacks the support of weight-shared SuperNetwork, which can significantly reduce the GPU memory consumption. For LLM scheduling, Andes [179] studies the scheduling algorithm to optimize user experience. It assumes all the requests have the same requirement and only serves one model. Other LLM severing systems [180–182] focused on the server-side performance and simply used the First Come First Serve (FCFS) policy. The most comparable framework is the Super Serve, which explores the different accuracyperformance balancing points of the same SuperNetwork. Figure 4.3.1 shows the architecture of Super Serve, users’ requests are sent to the request queue. And the scheduler will batch requests based on their requirements and system status and assign them to balancing points of the model that can satisfy the requirements. The dispatcher will dispatch each batch to GPU worker and execute the model on the assigned balancing point. For the scheduler, Super Serve introduced a scheduling algorithm called SlackFit. SlackFit will first categorize SuperNetwork model at different balancing points into different buckets, from models with lower accuracy and models with higher performance to higher accuracy and lower performance. At runtime, the SlackFit algorithm will choose the model with the highest accuracy that can satisfy the latency requirement. 74 Super Serve schedules requests of different requirements to different devices, since it can only execute one type of model at each computing device. The device will remain occupied till the scheduled requests are finished. The switching of balancing points can only happen after the worker finishes all pending requests. Figure 4.1: Super Serve architecture Super Serve was designed for traditional ML tasks. We analyze the Super Serve scheduling algorithm with the characters of LLM tasks and shows its drawbacks as below. Task Run time: The Super Serve method may be sufficient for traditional ML tasks like image process, which has very short runtime. However, for large language model services, requests can be executed for a long time. e.g.. For models like GPT-4o [183], llama 3,[184] the maximum number of output tokens is 4096, which means the model needs to be executed for up to 4096 iterations to finish one request. Additionally, for requests like chat bot, users may spend minutes chatting with LLM interactively. Directly applying Super Serve method would lead to significant waste when switching between balancing points. Latency v.s. Accuracy Super Serve prioritizes latency over accuracy. Latency is the only constraint that a user can submit to Super Serve. Super Serve will try to schedule the request with the highest accuracy that can satisfy the latency. However, some LLM service users may be willing to sacrifice the latency for higher accuracy, e.g. for tasks like book summary and other offline tasks. These tasks may have no real-time latency constraint but require high accuracy. Whereas other users may be willing to sacrifice accuracy for better latency. 75 Scheduling algorithms that don’t consider accuracy requirements may lead to significant resource waste. e.g. Requests that tolerate long latency but need high accuracy might be scheduled with compromised accuracy, and requests that tolerate accuracy loss but need strict latency may get unnecessary high accuracy. Even if we add accuracy to Super Serve’s scheduling algorithm, it may still suffer from load imbalance and poor hardware utilization. This is because every GPU can only be executed with a fixed accuracy and performance balancing point. Based on the above analysis, we believe a good LLM serving system should have the following characteristics. • Token-level scheduling: This resolves the issue caused by the long run time of LLM tasks. For many existing LLM severing systems, the scheduling algorithms are designed and implemented at the token-level. [180] • Seamless switching between balancing points: We want to minimize the switching overhead. We also want users to experience no delay during switching. • Load balancing and Resource utilization: For ML serving systems like Clipper, Proteus, and Super Serve, they all rely on deploying models at different balancing points on different GPUs. If one model gets more requests than the others, it may lead to load imbalance issues. • Consider both accuracy and performance: This will allow better utilization of computing devices and better suit users’ needs. 4.3.2 Request Preemption and KV-Cache Task preemption allows high priority requests to take over the resource occupied by lower priority ones. For traditional Machine learning tasks like DNN, REEF [17] implemented it via a reset-based preemption scheme that can proactively killing and restoring GPU kernels at microsecond-scale. 76 For LLM tasks, however, preemption is more complicated due to the intermediate data, which is the KV-Cache. vLLM implemented the task preemption via recomputing the intermediate data or swapping intermediate data from GPU memory to CPU memory. Another unique challenge for scheduling is caused by the unpredictable size of KV-Cache. The size of the KV-Cache is determined by the total input and output length of the request, which is unpredictable and varies a lot. For example, for Llama 2 7B model in 16-bit precision, a request with a total length of 4096 needs about 2 GB of KV-cache. In comparison, a request with a total length of 4 needs several MB of KV-cache. Since we don’t know how long the output sequence is, we don’t how much memory will be eventually needed. If we schedule too many tasks, the device may run out of memory, if we schedule too few tasks, the device won’t be fully utilized. vLLM resolves this issue by introducing Paged Attention and its preemption. vLLM manages GPU memory in pages, which is similar to the Operating System. Requests will get more pages allocated as it generates more intermediate data. When the GPU runs out of memory, some requests will be stalled and their memory will be released and used by other requests. However, this method introduced another issue. The computation of the request will be stalled till there is enough GPU memory available, which may violate the latency requirement. As we’ve previously analyzed, some LLM tasks can tolerate high latency. If we schedule latency tolerate and non-tolerate requests on the same device, and preempt latency tolerate requests when we run out of memory, the system can archive both higher utilization and less latency violation. We will demonstrate this in our experiment 4.6.4. 4.3.3 Token-level Latency analysis A commonly referenced output-token latency is the human reading speed. To satisfy the user, LLM inference systems need to generate words faster than humans can read. The average silent reading rate for adults in English is 238 words per minute[185]. The token and word translation rate can 77 be estimated based on [186]. In this paper, we use the rate of 3/4 words per token to estimate the word generation latency. We profiled the per-token latency of vLLM [180] with dataset [187]. We configured vLLM with chunked-prefill [181] enabled. Figure 4.6.3 shows the breakdown of the reasons for latency violations. The left one corresponds to the breakdown of vLLM without optimizations in HybridServing. A significant percentage of latency violations are due to the chunked-prefill, which prolongs the iteration. Another source of latency violation includes the first token, which measures the time from scheduling the request to generating the first token, or Time-to-First-Token(TTFT), request preemption, and iteration exceed the latency requirement. 4.3.3.1 Prefilling and Decoding The computation of LLM tasks can be divided into two phases, prefilling and decoding. In the Prefilling phase, the input tokens are processed. In the Decoding phase, output tokens will be generated at every iteration. The prefilling phase is computing dense, which may occupy too much computing time and leads to delays of other requests at the decoding phase. Many research tried to reduce the interference of the prefilling and decoding phases. For example, Sarathi[181] breaks prefilling into chunks and piggyback them with decoding. Figure 4.4.1 shows the result of the latency violation tested with vLLm and enabled chunked-prefill. Still, many latency violations are still due to the chunked-prefilling. Splitwise [182] uses separate GPU devices to process prefill and decoding phases. However, this method needs to transfer a huge amount of data between GPU devices, which may become a bottleneck when the hardware bandwidth is limited. We will further analyze this issue and propose our solution in 4.6.3. 78 4.3.4 Latency and accuracy trade-off A lot of research has been done to explore the trade-off space of system performance and model accuracy. For DNN models, users can use Neural Architecture Search (NAS) [188–190] to trim the model to a specific latency target. However, to switch between balance points, this approach needs models to be loaded to memory at runtime or stored in the memory during the whole execution, both lead to resource waste. A more recent supernet-based approach [176] will train a large network and extract sub-networks to suit different latency and accuracy requirements. Compared with the previous NAS approach, this approach requires less memory space. This is because models of different accuracy can share the weight data. For the large language model, most trade-off research focused on training models of different sizes. The closest one to supernet is Self-Speculative Decoding [191, 192]. Self-Speculative Decoding selects layers of the full model to create a smaller model with less accuracy and better latency. Self-Speculative Decoding uses the smaller model to predict the output of the full model and uses the full model to verify the prediction of the smaller model. 4.4 Implementation 4.4.1 Hybrid execution Inspired by Self-Speculative Decoding, we implement the hybrid execution of models with different accuracy via layer skipping. As illustrated in Figure 4.4.1, Input 0 and 1 are executed at high accuracy and will go through each layer normally. Input 2 is executed at lower accuracy. It bypasses the execution at the multilayer perceptron layer 1 (MLP 1) and then multilayer perceptron layer 2 (MLP 2). 79 Figure 4.2: Hybrid execution 4.4.1.1 Layer selection Layer skipping can accelerate the inference, but it also leads to accuracy degradation. The accuracy of the model depends on which layers to choose. Even with the same number of total layers skipped, the accuracy varies significantly. We implemented a layer-searching algorithm to find the combination of layers that has the best accuracy. Assuming the model has n layers in total, the total search space is two to the power of n. Our search algorithm was implemented based on the Bayesian optimization used in Self-Speculative Decoding, which performs the exploration and exploitation of the search space. Bayesian optimization will interactively selects different combinations of layer skipping for evaluation. In general, with more layers skipped, the performance will become better, and the accuracy will become worse. We run the layer searching algorithm with different number of layers to skip, which explores different accuracy and performance balancing points. 4.4.2 Scheduling 4.4.2.1 Scheduling algorithm The scheduling algorithm pseudo code is shown in 4.1. For line 1-13, the initial running queue contains requests that were executed in the previous iteration and are still unfinished. The queue will be sorted using a configurable policy. For line 4-13, new KV-Cache memory space will be 80 Listing 4.1: HybridServing Scheduling pseudo code 1 # unfinished requests in previous iteration 2 policy . sort . running ( running_queue ) 3 running , scheduled = [] 4 while running_queue not empty : 5 request = running_queue . popleft () 6 while not request . can_allocate () : 7 victim = policy . victim ( running_queue ) 8 preempt ( victim ) 9 waiting_queue . append ( victim ) 10 else : 11 scheduled . append ( request ) 12 running . append ( request ) 13 running_queue = running 14 15 # latency sensative requests preempt insensative 16 if policy . enable_eager_preemption : 17 if waiting_queue . has_lat_sensative () and 18 running . has_lat_insensative () : 19 mem = waiting_queue . lat_sensative . size () 20 free_space = sys . free_space 21 for victim in running . lat_insensative () : 22 if free_space > mem : 23 break 24 preempt ( victim ) 25 free_space += victim . size () 26 running_queue . remove ( victim ) 27 waiting_queue . append ( victim ) 28 29 policy . sort . waiting ( waiting_queue ) 30 while waiting_queue not empty : 31 request = waiting_queue . pop () 32 chunk_len = request . length () 33 while budget . predict_latency () > SLO : 34 chunk_len . reduce () 35 if can_allocate ( chunk_len ): 36 scheduled . append ( request ) 37 running_queue . append ( request ) 38 if chunk_len <=0: 39 break 40 41 # set accuracy 42 if GPU_Mem_usage > policy . high_mem : 43 increase_accuracy = False 44 if GPU_Mem_usage < policy . low_mem : 45 increase_accuracy = True 46 47 for req in scheduled : 48 if policy . accuracy_tolerate ( req ) 49 if increase_accuracy and req . high_accuracy : 50 req . increase_accuracy () 51 else : 52 req . high_accuracy = False 53 return scheduled allocated to each of the requests. If there is not enough memory, a victim request will be selected and preempted based on the configurable policy. Line 15-27 contain the code for eager preemption. If the eager preemption is enabled, and if the waiting queue contains latency sensitive requests while the running queue contains latency insensitive requests. We will preempt and release the memory of insensitive ones, so the sensitive requests can be scheduled. Line 29-39 contains the code for scheduling waiting requests. The configurable policy will be used to determine which requests are to be scheduled. HybridServing will predict the latency of the next iteration and adjust the chunked-prefill size so that the SLO constraint won’t be violated. Line 41-52 contain the code to set up the accuracy for each request. Users will be able to specify the policy to determine whether the request tolerates accuracy loss or not. 81 4.4.2.2 Latency prediction 4.6.3 HybridServing will profile the model offline with different chunked-prefills sizes. The system will use the profiling result to train a Linear Model, which can predict the latency based on the chunk size. After comparing different input features, we found that the sum of chunk lengths can best predict the latency. We only profile and train the model with latency faster than the human reading speed mentioned in 4.3.3. This is based on the two guarantees: • Monotonicity: The latency will increase when we schedule more or longer chunks. • Latency requirement: HybridServing tries to generate tokens faster than human can read. The scheduling algorithm only needs to predict whether the latency will be satisfied or not. As long as latency is violated, we don’t have to precisely know how much it exceeds the latency. Our model should mainly be precise for the range that can satisfy the latency requirement. We will show and analyze the prediction results in 4.6.3. 4.4.2.3 Customizable policy and Task Group As shown in the scheduling pseudo code, HybridServing allows users to configure the serving system policy such as choosing the requests to be preempted and the new requests to be scheduled. List 4.2 shows one scheduling policy example for HybridServing. We implemented our scheduling policy as Task Groups. As shown in Table 4.4.2.3, we categorize requests into 4 groups, based on the tolerance of Accuracy and Latency. For scheduling new requests, we will prioritize tasks in the latency sensitive groups. For the tasks in the same group, we will schedule them with FirstCome-First-Serve (FCFS). We implemented the scheduling algorithm to prioritize task group 1 over 2 and 3 over 4. For request preemption, the algorithm will choose the request with the highest task group number and the latest arrival time to be preempted. For the accuracy configuration, the policy example will stop increasing the accuracy for requests in groups 2 and 3 when the GPU 82 memory usage reaches 50% and will restore the accuracy when the GPU memory usage falls back to 20%. Because of the configurable feature of the system, it is very easy to augment our policy to implement more user tiers or other scheduling algorithms. Listing 4.2: HybridServing Scheduling Policy Example 1 class MyPolicy : 2 enable_eager_preemption = True 3 high_mem =0.5 4 low_mem =0.2 5 def sort . running ( running ): 6 return sorted ( running , key = req -> 7 ( req . group_num , req . arrival_time ))) 8 9 def sort . waiting ( waiting ): 10 return sorted ( waiting , key = req -> 11 ( req . group_num , req . arrival_time ))) 12 13 def victim ( running ): 14 return sorted ( running , key = req -> 15 ( req . group_num , req . arrival_time ))). popend () 16 17 def accuracy_tolerate ( req ): 18 return req . group_num in set (2 ,3) Latency sensitive Latency tolerate Accuracy sensitive Task Group 1 Task Group 4 Accuracy tolerate Task Group 2 Task Group 3 Table 4.1: User groups 4.5 HybridServing: System design Figure4.5 shows the system design of HybridServing. User requests will first be added to the Central Scheduler, which performs the load balancing for all the GPU workers. Each GPU worker has its own local scheduler, which will schedule requests to the corresponding GPU and choose 83 requests to be preempted when needed. The details of the local scheduling algorithm are described in 4.4.2. For every GPU worker, requests can be executed at different accuracies concurrently. Figure 4.3: HybridServing system 4.6 Evaluation 4.6.1 Experiment setup Our server was equipped with AMD EPYC 7302 16-Core Processor and Nvidia RTX A5000 GPUs. Each GPU has 25.7GBs of device memory. The number of GPUs used for each experiment will be specified in the corresponding sections. For the LLM model, we chose the LLaMA 2 model with 7B parameters. The performance and accuracy information about layer skipping are listed in 4.6.2. For the dataset, we used the ShareGPT dataset [187]. 4.6.2 Layer Skipping Model Accuracy We searched the skipping combinations for LLaMA-2 7B model and tests its perplexity with WikiText-2 benchmark[193]. Perplexity indicates the accuracy of the model, which is defined as the exponentiated average negative log-likelihood of a sequence, the lower the better. The results are shown in Table4.6.2. We can see that the accuracy becomes worse as the number of skipped layers increases, and the throughput increases as more layers are skipped. For the rest of our evaluation, we set the number of layers skipped as 15 for requests executed at lower accuracy. 84 skipped layers Perplexity Throughput increase 0 6.23 0% 5 7.08 4.8% 10 9.38 7.7% 15 12.27 13.8% 20 29.5 20.5% 25 47.44 22.1% Table 4.2: Performance and accuracy of model with different number of layers skipped 4.6.3 Latency prediction We show the prediction results of our linear model in Figure 4.6.3. The predicted and the actual latency were shown in red and blue dots correspondingly. The X axis shows the sum of the prefill chunk length, which we used as the input for the linear model. As the result shows, the actual latency falls closely with the prediction of our linear model. Figure 4.4: Linear model for latency prediction We further test the latency prediction by altering the scheduling algorithm. Figure 4.6.3, shows per-token latency violations with and without latency prediction enabled. Results were normalized as a percentage of total generated tokens. We generate the requests using poisson distribution with the arrival rate from 1 to 5 (request/s). For the tests with arrival rate at 1 (request/s), the latency violation is at about 1.75%, which is very small compared with others. This is due to the request rate is much lower than the system 85 capacity, most requests were immediately processed with less chance of latency violation due to chunked prefill. This also makes our prediction less useful. For arrival rate between 2 to 5 (request/s), with the latency prediction disabled, a significant percentage of latency violations are due to the chunked prefill. With the prediction enabled, the violation due to chunked prefill reduced significantly, as the scheduling algorithm will reduce the chunk size based on the prediction result. The reduction of latency violation ranges from 75% to 82%. Another source of latency violation is the first token, which measures the time from scheduling the request to generating the first token, or Time-to-First-Token(TTFT). It is very common for the TTFT to be very large. In comparison, the other two sources of violations, request preemption and iteration latency exceeds latency SLO, are very trivial. Both of them contributes to less than 0.1% of the total tokens. We also looked into the reasons for the failed prediction. We collected those violation inputs and re-run them offline. We found that most of them have latency similar to our prediction. We believe the failed prediction may be due to the lower level implementation of CUDA runtime or the machine learning framework. We leave this as future work. 0.00% 0.25% 0.50% 0.75% 1.00% 1.25% 1.50% 1.75% token-level latency violation (percentage) 1 (Req/s) w Prediction 0.00% 0.25% 0.50% 0.75% 1.00% 1.25% 1.50% 1.75% w/o Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 2 (Req/s) w Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% w/o Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 3 (Req/s) w Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% w/o Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 4 (Req/s) w Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% w/o Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 5 (Req/s) w Prediction 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% w/o Prediction firt token preempt iteration latency prefill-chunk Figure 4.5: Token-level latency violations break down at different request rate (with and without latency prediction) 4.6.4 Synthetic Traces We generated a synthetic trace with the following configuration. For 0-100s, requests are generated using poisson distribution with an arrival rate of 1 (request/s). For 100-200s, the arrival rate 86 increased to 3 (request/s). For 200-300s, the arrival rate was reduced to 1 (request/s). We assign the requests to the Group 1-4 using round-robin. For model accuracy, we choose the combination of 15 layers to be skipped. We performed the test with three different configurations: adaptive accuracy (with and without preemption), fixed accuracy (accuracy tolerate requests executed with fixed low accuracy or fixed high accuracy). Figure 4.6.4 shows the GPU memory usage, accuracy (Acc improve, which indicates the number of accuracy tolerate requests being executed at high accuracy), and waiting queue length during the execution. For GPU memory usage and waiting queue length, the adaptive accuracy without preemption police and fixed accuracy police (accuracy tolerate requests with fixed low accuracy) are very similar. It shows that, with adaptive accuracy, HybridServing can adapt the accuracy based on workload, so that the system can have better accuracy and similar performance compared with the fixed accuracy policy. Comparing adaptive accuracy with and without preemption, the preemption leads to longer waiting queue length and slower recovery to normal system workload. This is because preemption will cause overhead by recomputing the KV-Cache. For the accuracy (Acc improve), as mentioned in 4.4.2.3, the adaptive accuracy policy will schedule accuracy tolerated requests at low accuracy when the GPU memory usage reaches a user-specified number. Here the ratio is 50%. As shown in the figure, once the GPU memory usage exceeds 50%, the Acc improve, indicating the number of accuracy tolerated requests being executed at high accuracy, dropped into zero. Figure 4.6.4 shows the latency violation and Time-per-Output-Token (TPOT) for requests in 4 groups. As expected from the previous analyses, the adaptive accuracy without preemption police and fixed accuracy police (accuracy tolerate requests with fixed low accuracy) are very similar. Comparing adaptive accuracy with and without preemption, the TPOT for preemption is much longer, which is due to the overhead of preemption. Overall, with the adaptive accuracy policy, HybridServing achieves 47% shorter waiting queue length and restores from the busty request 2.7× faster than the fixed accuracy police (accuracy tolerate requests with high accuracy). 87 0 200 0% 20% 40% 60% 80% 100% percentage GPU Mem usage 0 200 0 5 10 15 20 25 30 35 Acc improve 0 200 0 20 40 60 80 waiting len fixed mixed acc adapt acc w preempt adapt acc w/o preempt fixed high acc Figure 4.6: GPU memory usage and waiting queue length during the execution G 1 G 2 G 3 G 4 0% 20% 40% 60% 80% 100% percentage token level latency violation G 1 G 2 G 3 G 4 0.0 0.5 1.0 1.5 2.0 latency (s) mean tpot fixed mixed acc adapt acc w preempt adapt acc w/o preempt fixed high acc Figure 4.7: Performance Metrics for requests in 4 groups 4.6.5 Comparing with alternative implementations We further compare HybridServing with two other alternative implementations. 4.6.5.1 Switching model As discussed in 4.3.1, a possible implementation for switching between different accuracy and performance balancing points is to stop the server, wait till all the requests on the fly are finished, and switch to the new model. We implemented such a policy and compared it with HybridServing. Figure 4.6.5.1 shows the GPU memory usage, waiting, and running queue length during the execution. During the model switching, the system will stop accepting requests, and the waiting queue will grow significantly long quickly. After the switching completes, the system will resume processing the requests and the waiting queue will start to shrink from its peak, as shown in the figure at 147s. The whole process for the waiting queue to grow and resume normal took 41s. This result shows that switching model would cause significant overhead. 88 0 200 0% 20% 40% 60% 80% 100% percentage GPU Mem usage 0 200 0 10 20 30 40 running len 0 200 0 20 40 60 80 waiting len adapt acc w/o preempt Switch model Figure 4.8: GPU memory usage, waiting and running queue length during the execution 4.6.5.2 Separated Server Another possible implementation discussed in 4.3.1 is to have different models being served on different GPUs. We compare this implementation with HybridServing by benchmarking their throughput under different ratios of accuracy tolerated and sensitive requests, corresponding to the Task Group 1 and Task Group 2 4.4.2.3 (G1 and G2 in Figure 4.6.5.2). We tested with 2 GPUs, assigned at high and low accuracy points. As shown in Figure 4.6.5.2, as the ratio of low accuracy requests increase, the throughput for separated server implementation first increases and then decreases. This is due to the load unbalancing. When the ratio is too low or too high, one of the two GPUs will become the bottleneck. The separated server implementation achieves higher throughput than HybridServing by 14% when it reaches the load balancing point. The result shows that the separated server implementation only outperforms HybridServing around the load balancing point. Considering that the incoming request ratio may changes during the whole serving and switching model will cause significant overhead, we believe HybridServing would yield similar or better performance than the alternative implementations. 4.7 Conclusion HybridServing is the first ML serving system that implements the hybrid execution of LLM models at different performance-accuracy balancing points, which enables the seamless transition from one balancing point to another. HybridServing is the first ML serving system that implement such 89 0% 20% 40% 60% 80% 100% ratio of low accuracy 0 1 2 3 4 5 6 7 Througput HS total HS G1 HS G4 SS total SS G1 SS G4 Figure 4.9: Throughput with different implementation (HS as HybridServing, SS as separated server function. HybridServing achieves 47% shorter waiting queue length and restores from the busty request 2.7× faster than the fixed accuracy policy (accuracy tolerates requests with high accuracy). When the system is at high workload, HybridServing can reduce the token-level latency violation by 75% to 82%. 90 References 1. Deng, L. & Platt, J. Ensemble deep learning for speech recognition in Proc. interspeech (2014). 2. Gregor, K., Danihelka, I., Graves, A., Rezende, D. & Wierstra, D. Draw: A recurrent neural network for image generation in International conference on machine learning (2015), 1462–1471. 3. Razzak, M. I., Naz, S. & Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps: Automation of Decision Making, 323– 350 (2018). 4. Liu, X., Faes, L., Kale, A. U., Wagner, S. K., Fu, D. J., Bruynseels, A., et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. The lancet digital health 1, e271– e297 (2019). 5. Dodge, S. & Karam, L. A study and comparison of human and deep learning recognition performance under visual distortions in 2017 26th international conference on computer communication and networks (ICCCN) (2017), 1–7. 6. Kasneci, E., Seßler, K., Kuchemann, S., Bannert, M., Dementieva, D., Fischer, F., ¨ et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103, 102274 (2023). 7. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022). 8. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021). 9. Brants, T., Popat, A. C., Xu, P., Och, F. J. & Dean, J. Large language models in machine translation (2007). 10. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., et al. Large language models encode clinical knowledge. Nature, 1–9 (2023). 11. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J. & Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019). 91 12. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., et al. Opt: Open pretrained transformer language models. arXiv preprint arXiv:2205.01068 (2022). 13. Shen, H., Meng, H., Dong, B., Wang, Z., Zafrir, O., Ding, Y., et al. An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs. arXiv preprint arXiv:2306.16601 (2023). 14. Pool, J. Accelerating sparsity in the nvidia ampere architecture. GTC 2020 (2020). 15. Feng, Y., Xie, M., Tian, Z., Wang, S., Lu, Y. & Shu, J. Mobius: Fine tuning large-scale models on commodity gpu servers in Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (2023), 489–501. 16. Borzunov, A., Baranchuk, D., Dettmers, T., Ryabinin, M., Belkada, Y., Chumachenko, A., et al. Distributed Inference and Fine-tuning of Large Language Models Over The Internet (2022). 17. Han, M., Zhang, H., Chen, R. & Chen, H. Microsecond-scale preemption for concurrent {GPU-accelerated}{DNN} inferences in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) (2022), 539–558. 18. Zheng, D., Song, X., Yang, C., LaSalle, D. & Karypis, G. Distributed hybrid cpu and gpu training for graph neural networks on billion-scale graphs. arXiv preprint arXiv:2112.15345 (2021). 19. Rossi, R. A. & Ahmed, N. K. The Network Data Repository with Interactive Graph Analytics and Visualization in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015). 20. Belli, D. & Kipf, T. Image-Conditioned Graph Generation for Road Network Extraction. NeurIPS 2019 workshop on Graph Representation Learning (2019). 21. Ma’ayan, A. Insights into the organization of biochemical regulatory networks using graph theory analyses. Journal of Biological Chemistry 284, 5451–5455 (2009). 22. Cho, Y.-R. & Zhang, A. Predicting protein function by frequent functional association pattern mining in protein interaction networks. IEEE Transactions on information technology in biomedicine 14, 30–36 (2009). 23. Milenkovic, T. & Pr ´ zulj, N. Uncovering biological network function via graphlet degree ˇ signatures. Cancer informatics 6, CIN–S680 (2008). 24. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. & Philip, S. Y. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 4–24 (2020). 92 25. Xiao, W., Xue, J., Miao, Y., Li, Z., Chen, C., Wu, M., et al. Tux2 : Distributed Graph Computation for Machine Learning in The 14th USENIX Symposium on Networked Systems Design and Implementation (2017). 26. Alexandrescu, A. & Kirchhoff, K. Data-driven graph construction for semi-supervised graph-based learning in nlp in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference (2007), 204–211. 27. Goyal, A., Daume III, H. & Guerra, R. ´ Fast large-scale approximate graph construction for nlp in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (2012), 1069–1080. 28. Zesch, T. & Gurevych, I. Analysis of the Wikipedia category graph for NLP applications in Proceedings of the TextGraphs-2 Workshop (NAACL-HLT 2007) (2007), 1–8. 29. Qiu, M., Zhang, L., Ming, Z., Chen, Z., Qin, X. & Yang, L. T. Security-aware optimization for ubiquitous computing systems with SEAT graph approach. Journal of Computer and System Sciences 79, 518–529 (2013). 30. Stankovic, A. & Calovic, M. Graph oriented algorithm for the steady-state security enhancement in distribution networks. IEEE Transactions on Power Delivery 4, 539–544 (1989). 31. Wang, Y.-J., Xian, M., Liu, J. & Wang, G.-y. Study of network security evaluation based on attack graph model. JOURNAL-CHINA INSTITUTE OF COMMUNICATIONS 28, 29 (2007). 32. Shun, J., Roosta-Khorasani, F., Fountoulakis, K. & Mahoney, M. W. Parallel Local Graph Clustering. Proc. VLDB Endow. 9, 1041–1052. doi:10.14778/2994509.2994522 (2016). 33. Schaeffer, S. E. Survey: Graph Clustering. Comput. Sci. Rev. 1, 27–64. doi:10.1016/j. cosrev.2007.05.001 (2007). 34. Fouss, F., Pirotte, A., Renders, J.-M. & Saerens, M. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering 19, 355–369 (2007). 35. Guan, Z., Bu, J., Mei, Q., Chen, C. & Wang, C. Personalized tag recommendation using graph-based ranking on multi-type interrelated objects in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009), 540–547. 36. Lo, S. & Lin, C. WMR–A Graph-Based Algorithm for Friend Recommendation in Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence (2006), 121–128. 37. Mirza, B. J., Keller, B. J. & Ramakrishnan, N. Studying recommendation algorithms by graph analysis. Journal of Intelligent Information Systems 20, 131–160 (2003). 93 38. Campbell, W. M., Dagli, C. K. & Weinstein, C. J. Social network analysis with content and graphs. Lincoln Laboratory Journal 20, 61–81 (2013). 39. Tang, L. & Liu, H. in Managing and Mining Graph Data 487–513 (Springer, 2010). 40. Wang, T., Chen, Y., Zhang, Z., Xu, T., Jin, L., Hui, P., et al. Understanding graph sampling algorithms for social network analysis in 2011 31st International Conference on Distributed Computing Systems Workshops (2011), 123–128. 41. Aittokallio, T. & Schwikowski, B. Graph-based methods for analysing networks in cell biology. Briefings in bioinformatics 7, 243–255 (2006). 42. Enright, A. J. & Ouzounis, C. A. BioLayout—an automatic graph layout algorithm for similarity visualization. Bioinformatics 17, 853–854 (2001). 43. Le Novere, N., Hucka, M., Mi, H., Moodie, S., Schreiber, F., Sorokin, A., et al. The systems biology graphical notation. Nature biotechnology 27, 735–741 (2009). 44. Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., et al. Pregel: a system for large-scale graph processing in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (2010), 135–146. 45. Kalavri, V., Vlassov, V. & Haridi, S. High-Level Programming Abstractions for Distributed Graph Processing. CoRR abs/1607.02646 (2016). 46. McCune, R. R., Weninger, T. & Madey, G. Thinking Like a Vertex: A Survey of VertexCentric Frameworks for Large-Scale Distributed Graph Processing. ACM Comput. Surv. 48, 25:1–25:39. doi:10.1145/2818185 (2015). 47. Consortium, H. M. C. et al. Hybrid memory cube specification version 2.1 tech. rep. (2015). 48. Lee, D. U., Kim, K. W., Kim, K. W., Kim, H., Kim, J. Y., Park, Y. J., et al. 25.2 A 1.2 V 8Gb 8-channel 128GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29nm process and TSV in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International (2014), 432–433. 49. Ahn, J., Hong, S., Yoo, S., Mutlu, O. & Choi, K. A scalable processing-in-memory accelerator for parallel graph processing in Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), 105–117. 50. Zhang, M., Zhuo, Y., Wang, C., Gao, M., Wu, Y., Chen, K., et al. GraphP: Reducing communication for PIM-based graph processing with efficient data partition in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2018), 544– 557. 51. Zhuo, Y., Wang, C., Zhang, M., Wang, R., Niu, D., Wang, Y., et al. GraphQ: Scalable PIM-Based Graph Processing in Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (ACM, Columbus, OH, USA, 2019), 712–725. doi:10. 1145/3352460.3358256. 94 52. Dai, G., Huang, T., Chi, Y., Zhao, J., Sun, G., Liu, Y., et al. Graphh: A processing-inmemory architecture for large-scale graph processing. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (2018). 53. Hennessy, J. L. & Patterson, D. A. A New Golden Age for Computer Architecture. Commun. ACM 62, 48–60. doi:10.1145/3282307 (2019). 54. Beamer, S., Asanovic, K. & Patterson, D. ´ Direction-optimizing Breadth-first Search in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (IEEE Computer Society Press, Salt Lake City, Utah, 2012), 12:1– 12:10. 55. Zhu, X., Chen, W., Zheng, W. & Ma, X. Gemini: A computation-centric distributed graph processing system in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) (2016), 301–316. 56. Kwak, H., Lee, C., Park, H. & Moon, S. What is Twitter, a Social Network or a News Media? in Proceedings of the 19th International Conference on World Wide Web (ACM, Raleigh, North Carolina, USA, 2010), 591–600. doi:10.1145/1772690.1772751. 57. ARM. ARM Cortex-A5 Processor 58. Shevgoor, M., Kim, J.-S., Chatterjee, N., Balasubramonian, R., Davis, A. & Udipi, A. N. Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (2013), 198–209. 59. Salihoglu, S. & Widom, J. GPS: A Graph Processing System in Proceedings of the 25th International Conference on Scientific and Statistical Database Management (ACM, Baltimore, Maryland, USA, 2013), 22:1–22:12. doi:10.1145/2484838.2484843. 60. Leskovec, J., Lang, K. J., Dasgupta, A. & Mahoney, M. W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics 6, 29–123 (2009). 61. Zhu, X., Han, W. & Chen, W. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning in 2015 USENIX Annual Technical Conference (USENIX ATC 15) (2015), 375–386. 62. Sanchez, D. & Kozyrakis, C. ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-core Systems in Proceedings of the 40th Annual International Symposium on Computer Architecture (ACM, Tel-Aviv, Israel, 2013), 475–486. doi:10.1145/2485922. 2485963. 63. Gao, M., Ayers, G. & Kozyrakis, C. Practical near-data processing for in-memory analytics frameworks in 2015 International Conference on Parallel Architecture and Compilation (PACT) (2015), 113–124. 95 64. Kim, G., Kim, J., Ahn, J. H. & Kim, J. Memory-centric system interconnect design with hybrid memory cubes in Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (2013), 145–156. 65. Kahng, A. B., Li, B., Peh, L.-S. & Samadi, K. ORION 2.0: A Power-Area Simulator for Interconnection Networks. IEEE Trans. Very Large Scale Integr. Syst. 20, 191–196. doi:10. 1109/TVLSI.2010.2091686 (2012). 66. Tsai, P.-A., Beckmann, N. & Sanchez, D. Jenga: Sotware-Defined Cache Hierarchies in Proceedings of the 44th Annual International Symposium on Computer Architecture (2017), 652–665. 67. Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M. & Jouppi, N. P. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (2009), 469–480. 68. McAuley, J. & Leskovec, J. Learning to Discover Social Circles in Ego Networks in Proceedings of the 25th International Conference on Neural Information Processing Systems (Curran Associates Inc., Lake Tahoe, Nevada, 2012), 539–547. 69. Wikipedia, E. enwiki-2013 http://law.di.unimi.it/webdata/enwiki-2013/. 2013. 70. Hong, S., Rodia, N. C. & Olukotun, K. On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-world Graphs in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (ACM, Denver, Colorado, 2013), 92:1–92:11. doi:10.1145/2503210.2503246. 71. Ozdal, M. M., Yesil, S., Kim, T., Ayupov, A., Greth, J., Burns, S., et al. Energy efficient architecture for graph analytics acceleratorsin Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on (2016), 166–177. 72. Buluc¸, A. & Madduri, K. Parallel Breadth-first Search on Distributed Memory Systems in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (ACM, Seattle, Washington, 2011), 65:1–65:12. doi:10.1145/ 2063384.2063471. 73. Beamer, S., Buluc, A., Asanovic, K. & Patterson, D. Distributed memory breadth-first search revisited: Enabling bottom-up search in 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013), 1618–1627. 74. Buluc, A., Beamer, S., Madduri, K., Asanovic, K. & Patterson, D. Distributed-memory breadth-first search on massive graphs. arXiv preprint arXiv:1705.04590 (2017). 75. Vora, K., Koduru, S. C. & Gupta, R. ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms Using a Relaxed Consistency Based DSM in Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications (ACM, Portland, Oregon, USA, 2014), 861–878. doi:10.1145/2660193.2660227. 96 76. Vora, K., Gupta, R. & Xu, G. Kickstarter: Fast and accurate computations on streaming graphs via trimmed approximations in (2017). 77. Mariappan, M. & Vora, K. GraphBolt: Dependency-Driven Synchronous Processing of Streaming Graphs in Proceedings of the Fourteenth EuroSys Conference 2019 (ACM, Dresden, Germany, 2019), 25:1–25:16. doi:10.1145/3302424.3303974. 78. Vora, K. {LUMOS}: Dependency-Driven Disk-based Graph Processing in 2019 {USENIX} Annual Technical Conference (USENIX ATC19) (USENIX Association, Berkeley, CA, USA, 2019), 429–442. 79. Gilbert, J. R., Reinhardt, S. & Shah, V. B. High-performance graph algorithms from parallel sparse matrices in International Workshop on Applied Parallel Computing (2006), 260–269. 80. Greathouse, J. L. & Daga, M. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format in SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (2014), 769–780. 81. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., et al. SCNN: An accelerator for compressed-sparse convolutional neural networks. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 27–40 (2017). 82. Kepner, J., Alford, S., Gadepally, V., Jones, M., Milechin, L., Robinett, R., et al. Sparse deep neural network graph challenge in 2019 IEEE High Performance Extreme Computing Conference (HPEC) (2019), 1–7. 83. Davis, T. A. Algorithm 1000: SuiteSparse: GraphBLAS: Graph algorithms in the language of sparse linear algebra. ACM Transactions on Mathematical Software (TOMS) 45, 1–25 (2019). 84. Canning, A., Galli, G., Mauri, F., De Vita, A. & Car, R. O (n) tight-binding molecular dynamics on massively parallel computers: an orbital decomposition approach. Computer Physics Communications 94, 89–102 (1996). 85. Zhang, G., Attaluri, N., Emer, J. S. & Sanchez, D. Gamma: leveraging Gustavson’s algorithm to accelerate sparse matrix multiplication in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (2021), 687–701. 86. Hegde, K., Asghari-Moghaddam, H., Pellauer, M., Crago, N., Jaleel, A., Solomonik, E., et al. Extensor: An accelerator for sparse tensor algebra in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019), 319–333. 87. Pal, S., Beaumont, J., Park, D.-H., Amarnath, A., Feng, S., Chakrabarti, C., et al. Outerspace: An outer product based sparse matrix multiplication accelerator in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2018), 724– 736. 97 88. Ribeiro, P., Paredes, P., Silva, M. E., Aparicio, D. & Silva, F. A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets. arXiv preprint arXiv:1910.13011 (2019). 89. Duma, A. & Topirceanu, A. A network motif based approach for classifying online social networks in 2014 IEEE 9th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI) (2014), 311–315. 90. Schmidt, M. C., Rocha, A. M., Padmanabhan, K., Chen, Z., Scott, K., Mihelcic, J. R., et al. Efficient α, β-motif finder for identification of phenotype-related functional modules. BMC bioinformatics 12, 1–15 (2011). 91. Parthasarathy, S., Tatikonda, S. & Ucar, D. in Managing and mining graph data 547–580 (Springer, 2010). 92. Becchetti, L., Boldi, P., Castillo, C. & Gionis, A. Efficient semi-streaming algorithms for local triangle counting in massive graphs in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (2008), 16–24. 93. Nandi, A., Mandal, A., Atreja, S., Dasgupta, G. B. & Bhattacharya, S. Anomaly detection using program control flow graph mining from execution logs in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 215–224. 94. Kang, U., Akoglu, L. & Chau, D. H. Big graph mining for the web and social media: algorithms, anomaly detection, and applications in Proceedings of the 7th ACM international conference on Web search and data mining (2014), 677–678. 95. Parveen, P., Evans, J., Thuraisingham, B., Hamlen, K. W. & Khan, L. Insider threat detection using stream mining and graph mining in 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing (2011), 1102–1110. 96. Wang, Y.-N., Wang, J., Fan, X. & Song, Y. Network Traffic Anomaly Detection Algorithm Based on Intuitionistic Fuzzy Time Series Graph Mining. IEEE Access 8, 63381–63389 (2020). 97. Kang, U., Akoglu, L. & Chau, D. H. P. Big graph mining: Algorithms, anomaly detection, and applications. Proceedings of the ACM ASONAM 13, 25–28 (2013). 98. Uddin, S., Hossain, L., et al. Dyad and triad census analysis of crisis communication network. Social Networking 2, 32 (2013). 99. Shaw, D. R. The methods behind the madness: Presidential electoral college strategies, 1988-1996. The Journal of Politics 61, 893–913 (1999). 100. Ralaivola, L., Swamidass, S. J., Saigo, H. & Baldi, P. Graph kernels for chemical informatics. Neural networks 18, 1093–1110 (2005). 98 101. Kashima, H., Saigo, H., Hattori, M. & Tsuda, K. in Chemoinformatics and advanced machine learning perspectives: complex computational methods and collaborative techniques 1–15 (IGI Global, 2011). 102. Deshpande, M., Kuramochi, M., Wale, N. & Karypis, G. Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering 17, 1036–1050 (2005). 103. Yan, X., Yu, P. S. & Han, J. Graph indexing: A frequent structure-based approach in Proceedings of the 2004 ACM SIGMOD international conference on Management of data (2004), 335–346. 104. Zhang, L., Han, Y., Yang, Y., Song, M., Yan, S. & Tian, Q. Discovering discriminative graphlets for aerial image categories recognition. IEEE Transactions on Image Processing 22, 5071–5084 (2013). 105. Kalinsky, O., Kimelfeld, B. & Etsion, Y. The TrieJax Architecture: Accelerating Graph Operations Through Relational Joins in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (2020), 1217–1231. 106. Yao, P., Zheng, L., Zeng, Z., Huang, Y., Gui, C., Liao, X., et al. A Locality-Aware EnergyEfficient Accelerator for Graph Mining Applications in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2020), 895–907. 107. Xuhao, C., Tianhao, H., Shuotao, X., Thomas, B., Chanwoo, C. & Arvind. FlexMiner: A Pattern-Aware Accelerator for Graph Pattern Mining in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021). 108. Besta, M., Kanakagiri, R., Kwasniewski, G., Ausavarungnirun, R., Beranek, J., Kanel- ´ lopoulos, K., et al. SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems. arXiv preprint arXiv:2104.07582 (2021). 109. Shi, T., Zhai, M., Xu, Y. & Zhai, J. GraphPi: High Performance Graph Pattern Matching through Effective Redundancy Elimination in 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2020), 1418–1431. 110. Mawhirter, D. & Wu, B. AutoMine: harmonizing high-level abstraction and high performance for graph mining in Proceedings of the 27th ACM Symposium on Operating Systems Principles (2019), 509–523. 111. Mawhirter, D., Reinehr, S., Holmes, C., Liu, T. & Wu, B. GraphZero: A High-Performance Subgraph Matching System. ACM SIGOPS Operating Systems Review 55, 21–37 (2021). 112. Kjolstad, F., Chou, S., Lugato, D., Kamil, S. & Amarasinghe, S. taco: A Tool to Generate Tensor Algebra Kernels in 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2017), 943–948. doi:10.1109/ASE.2017.8115709. 99 113. Hegde, K., Yu, J., Agrawal, R., Yan, M., Pellauer, M. & Fletcher, C. Ucnn: Exploiting computational reuse in deep neural networks via weight repetition in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA) (2018), 674–687. 114. Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S., Das, D., et al. Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2020), 58– 70. 115. Zhang, Z., Wang, H., Han, S. & Dally, W. J. Sparch: Efficient architecture for sparse matrix multiplication in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) (2020), 261–274. 116. Deng, C., Sui, Y., Liao, S., Qian, X. & Yuan, B. GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA) (2021), 1110– 1123. 117. Parashar, A., Rhu, M., Mukkara, A., Puglielli, A., Venkatesan, R., Khailany, B., et al. Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45, 27–40 (2017). 118. Gondimalla, A., Chesnut, N., Thottethodi, M. & Vijaykumar, T. Sparten: A sparse tensor accelerator for convolutional neural networks in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019), 151–165. 119. Dadu, V., Weng, J., Liu, S. & Nowatzki, T. Towards general purpose acceleration by exploiting common data-dependence forms in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (2019), 924–939. 120. Wang, Z. & Nowatzki, T. Stream-based memory access specialization for general purpose processors in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA) (2019), 736–749. 121. Sankaralingam, K., Nowatzki, A., Gangadhar, V., Shah, P. & Ardalani, N. Systems and methods for stream-dataflow acceleration US Patent App. 16/384,819. 2019. 122. Nguyen, Q. M. & Sanchez, D. Pipette: Improving Core Utilization on Irregular Applications through Intra-Core Pipeline Parallelism in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2020), 596–608. 123. Buluc¸, A., Fineman, J. T., Frigo, M., Gilbert, J. R. & Leiserson, C. E. Parallel sparse matrixvector and matrix-transpose-vector multiplication using compressed sparse blocks in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures (2009), 233–244. 124. Ceze, L., Tuck, J., Torrellas, J. & Cascaval, C. Bulk disambiguation of speculative threads in multiprocessors. ACM SIGARCH Computer Architecture News 34, 227–238 (2006). 100 125. Qian, X., Sahelices, B. & Torrellas, J. OmniOrder: Directory-based conflict serialization of transactions. ACM SIGARCH Computer Architecture News 42, 421–432 (2014). 126. Bachrach, J., Vo, H., Richards, B., Lee, Y., Waterman, A., Avizienis, R., ˇ et al. Chisel: constructing hardware in a scala embedded language in DAC Design Automation Conference 2012 (2012), 1212–1221. 127. Martins, M., Matos, J. M., Ribas, R. P., Reis, A., Schlinker, G., Rech, L., et al. Open cell library in 15nm FreePDK technology in Proceedings of the 2015 Symposium on International Symposium on Physical Design (2015), 171–178. 128. Balasubramonian, R., Kahng, A. B., Muralimanohar, N., Shafiee, A. & Srinivas, V. CACTI 7: New tools for interconnect exploration in innovative off-chip memories. ACM Transactions on Architecture and Code Optimization (TACO) 14, 1–25 (2017). 129. Jamshidi, K., Mahadasa, R. & Vora, K. Peregrine: a pattern-aware graph mining system in Proceedings of the Fifteenth European Conference on Computer Systems (2020), 1–16. 130. Bringmann, B. & Nijssen, S. What is frequent in a single graph? in Pacific-Asia Conference on Knowledge Discovery and Data Mining (2008), 858–863. 131. Bader, D. A., Meyerhenke, H., Sanders, P. & Wagner, D. Graph partitioning and graph clustering (American Mathematical Society Providence, RI, 2013). 132. Geisberger, R., Sanders, P. & Schultes, D. Better approximation of betweenness centrality in 2008 Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX) (2008), 90–100. 133. Leskovec, J., Kleinberg, J. & Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD) 1, 2–es (2007). 134. Yin, H., Benson, A. R., Leskovec, J. & Gleich, D. F. Local higher-order graph clustering in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), 555–564. 135. Bitcoin Alpha network dataset – KONECT 2018. 136. Kumar, S., Spezzano, F., Subrahmanian, V. S. & Faloutsos, C. Edge Weight Prediction in Weighted Signed Networks in Proc. Int. Conf. Data Min. (2016), 221–230. 137. Kunegis, J. KONECT: The Koblenz Network Collection in Proceedings of the 22nd International Conference on World Wide Web (Association for Computing Machinery, Rio de Janeiro, Brazil, 2013), 1343–1350. doi:10.1145/2487788.2488173. 138. Leskovec, J., Kleinberg, J. & Faloutsos, C. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD) 1, 2–es (2007). 101 139. Ripeanu, M., Foster, I. & Iamnitchi, A. Mapping the gnutella network: Properties of largescale peer-to-peer systems and implications for system design. arXiv preprint cs/0209028 (2002). 140. Leskovec, J., Huttenlocher, D. & Kleinberg, J. Signed networks in social media in Proceedings of the SIGCHI conference on human factors in computing systems (2010), 1361–1370. 141. Leskovec, J., Huttenlocher, D. & Kleinberg, J. Predicting positive and negative links in online social networks in Proceedings of the 19th international conference on World wide web (2010), 641–650. 142. Elseidy, M., Abdelhamid, E., Skiadopoulos, S. & Kalnis, P. Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment 7, 517–528 (2014). 143. Yang, J. & Leskovec, J. Defining and Evaluating Network Communities based on Groundtruth. arXiv preprint arXiv:1205.6233 (2012). 144. Leskovec, J., Kleinberg, J. & Faloutsos, C. Graphs over time: densification laws, shrinking diameters and possible explanations in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (2005), 177–187. 145. Backstrom, L., Huttenlocher, D., Kleinberg, J. & Lan, X. Group formation in large social networks: membership, growth, and evolution in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), 44–54. 146. Davis, T. A. & Hu, Y. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1–25 (2011). 147. Kleinberg, J. M. Authoritative sources in a hyperlinked environment in Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms (1998), 668–677. 148. Langville, A. N. & Meyer, C. D. A reordering for the PageRank problem. SIAM Journal on Scientific Computing 27, 2112–2120 (2006). 149. Smith, S., Choi, J. W., Li, J., Vuduc, R., Park, J., Liu, X., et al. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools http://frostt.io/. 150. Hu, B., Zheng, L., Zhu, J., Ding, L., Wang, Y. & Gu, X. Teaching Plan Generation and Evaluation With GPT-4: Unleashing the Potential of LLM in Instructional Design. IEEE Transactions on Learning Technologies (2024). 151. Sawicki, B. Teaching Maxwell Equations with LLM Assistance. 152. Jin, H., Zhang, Y., Meng, D., Wang, J. & Tan, J. A comprehensive survey on processoriented automatic text summarization with exploration of llm-based methods. arXiv preprint arXiv:2403.02901 (2024). 102 153. Sahu, G. & Laradji, I. H. MixSumm: Topic-based Data Augmentation using LLMs for Lowresource Extractive Text Summarization. arXiv preprint arXiv:2407.07341 (2024). 154. Shi, J., Li, J., Ma, Q., Yang, Z., Ma, H. & Li, L. CHOPS: CHat with custOmer Profile Systems for Customer Service with LLMs. arXiv preprint arXiv:2404.01343 (2024). 155. Pandya, K. & Holia, M. Automating Customer Service using LangChain: Building custom open-source GPT Chatbot for organizations. arXiv preprint arXiv:2310.05421 (2023). 156. Kolasani, S. Optimizing natural language processing, large language models (LLMs) for efficient customer service, and hyper-personalization to enable sustainable growth and revenue. Transactions on Latest Trends in Artificial Intelligence 4 (2023). 157. Qin, J., Wu, J., Chen, W., Ren, Y., Li, H., Wu, H., et al. Diffusiongpt: LLM-driven text-toimage generation system. arXiv preprint arXiv:2401.10061 (2024). 158. Qu, L., Wu, S., Fei, H., Nie, L. & Chua, T.-S. Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation in Proceedings of the 31st ACM International Conference on Multimedia (2023), 643–654. 159. Dong, X. L., Moon, S., Xu, Y. E., Malik, K. & Yu, Z. Towards next-generation intelligent assistants leveraging llm techniques in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2023), 5792–5793. 160. Zhang, K., Kang, Y., Zhao, F. & Liu, X. LLM-based Medical Assistant Personalization with Short-and Long-Term Memory Coordination in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024), 2386–2398. 161. Duarte, F. Number of ChatGPT Users (Jul 2024) https://explodingtopics.com/blog/ chatgpt-users. [Online; accessed 20-July-2024]. 2024. 162. Research, D. M. Explosive Growth Predicted: Large Language Model Market Set to Reach USD 6.5 Billion by 2024 To USD 140.8 Billion by 2033- Dimension Market Research https : / / finance . yahoo . com / news / explosive - growth - predicted - large - language-184300698.html. [Online; accessed 20-July-2024]. 2024. 163. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). 164. Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022). 165. Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., et al. AWQ: Activationaware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems 6, 87–100 (2024). 103 166. Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention in International conference on machine learning (2020), 5156–5165. 167. Vyas, A., Katharopoulos, A. & Fleuret, F. Fast transformers with clustered attention. Advances in Neural Information Processing Systems 33, 21665–21674 (2020). 168. Park, C., Jeong, Y., Cho, M. & Park, J. Fast point transformer in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), 16949–16958. 169. Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 43, 9–17 (2023). 170. ChatGPT. ChatGPT Pricing https :/ /openai . com/ chatgpt/ pricing/. [Online; accessed 20-July-2024]. 2024. 171. Kevin Lee Adi Gangidi, M. O. Building Meta’s GenAI Infrastructure https://engineering. fb . com / 2024 / 03 / 12 / data - center - engineering / building - metas - genai - infrastructure/. [Online; accessed 20-July-2024]. 2024. 172. Jouppi, N., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings in Proceedings of the 50th Annual International Symposium on Computer Architecture (2023), 1–14. 173. Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E. & Stoica, I. Clipper: A {Low-Latency} online prediction serving system in 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (2017), 613–627. 174. Crankshaw, D., Sela, G.-E., Zumar, C., Mo, X., Gonzalez, J. E., Stoica, I., et al. Inferline: Ml inference pipeline composition framework. arXiv preprint arXiv:1812.01776 (2018). 175. Gujarati, A., Karimi, R., Alzayat, S., Hao, W., Kaufmann, A., Vigfusson, Y., et al. Serving {DNNs} like clockwork: Performance predictability from the bottom up in 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20) (2020), 443–462. 176. Khare, A., Garg, D., Kalra, S., Grandhi, S., Stoica, I. & Tumanov, A. SuperServe: FineGrained Inference Serving for Unpredictable Workloads. arXiv preprint arXiv:2312.16733 (2023). 177. Cai, H., Gan, C., Wang, T., Zhang, Z. & Han, S. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019). 178. Ahmad, S., Guan, H., Friedman, B. D., Williams, T., Sitaraman, R. K. & Woo, T. Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (2024), 318–334. 104 179. Liu, J., Wu, Z., Chung, J.-W., Lai, F., Lee, M. & Chowdhury, M. Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services. arXiv preprint arXiv:2404.16283 (2024). 180. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., et al. Efficient memory management for large language model serving with pagedattention in Proceedings of the 29th Symposium on Operating Systems Principles (2023), 611–626. 181. Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S. & Ramjee, R. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023). 182. Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, ´I., Maleki, S., et al. Splitwise: Efficient generative llm inference using phase splitting. Power 400, 1–75 (2023). 183. Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023). 184. team, L. The Llama 3 Herd of Models https://ai.meta.com/research/publications/ the-llama-3-herd-of-models/. [Online; accessed 20-July-2024]. 2024. 185. Brysbaert, M. How many words do we read per minute? A review and meta-analysis of reading rate. Journal of memory and language 109, 104047 (2019). 186. OpenAI. What are tokens and how to count them? https : / / help . openai . com / en / articles/4936856-what-are-tokens-and-how-to-count-them. [Online; accessed 20-July-2024]. 2024. 187. ShareGPT. ShareGPT https://sharegpt.com/. [Online; accessed 20-July-2024]. 2024. 188. Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., et al. Mnasnet: Platform-aware neural architecture search for mobile in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), 2820–2828. 189. Baker, B., Gupta, O., Naik, N. & Raskar, R. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016). 190. Yu, J., Jin, P., Liu, H., Bender, G., Kindermans, P.-J., Tan, M., et al. Bignas: Scaling up neural architecture search with big single-stage models in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 (2020), 702–717. 191. Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., et al. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168 (2023). 192. Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., et al. Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710 (2024). 105 193. Merity, S., Xiong, C., Bradbury, J. & Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016). 106
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Advancing distributed computing and graph representation learning with AI-enabled schemes
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Striking the balance: optimizing privacy, utility, and complexity in private machine learning
PDF
Improving decision-making in search algorithms for combinatorial optimization with machine learning
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Differentially private and fair optimization for machine learning: tight error bounds and efficient algorithms
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Hardware and software techniques for irregular parallelism
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
AI-enabled DDoS attack detection in IoT systems
Asset Metadata
Creator
Rao, Gengyu
(author)
Core Title
Algorithm and system co-optimization of graph and machine learning systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2024-08
Publication Date
08/28/2024
Defense Date
07/31/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm and system co-optimization,graph,graph computation,graph mining,machine learning,machine learning systems,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Raghavendra, Raghu (
committee chair
), Bogdan, Paul (
committee member
), Qian, Xuehai (
committee member
), Zhao, Yue (
committee member
)
Creator Email
gengyura@usc.edu,zouyoo@outlook.com
Unique identifier
UC113999U1J
Identifier
etd-RaoGengyu-13424.pdf (filename)
Legacy Identifier
etd-RaoGengyu-13424
Document Type
Dissertation
Format
theses (aat)
Rights
Rao, Gengyu
Internet Media Type
application/pdf
Type
texts
Source
20240828-usctheses-batch-1203
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
algorithm and system co-optimization
graph
graph computation
graph mining
machine learning
machine learning systems