Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
(USC Thesis Other)
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ARCHITECTURE DESIGN AND ALGORITHMIC OPTIMIZATIONS FOR ACCELERATING GRAPH ANALYTICS ON FPGA by Shijie Zhou A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2018 Copyright 2018 Shijie Zhou Dedication To my wife and my parents ii Acknowledgments First, I would like to express my deep and sincere gratitude to my advisor, Pro- fessor Viktor K. Prasanna, for his unwavering guidance and support. During my Ph.D. program, he has patiently trained me in various research activities, kindly supported my research work, and wisely guided me through challenges and obsta- cles. I am also grateful to Professor Rajgopal Kannan and Professor Charalampos Chelmis for their guidance and collaboration on several research projects and pub- lications. Likewise, I am indebted to Professor Xuehai Qian and Professor Aiichiro Nakano for serving on my dissertation committee. Their careful reviews and valu- able feedback have guided me in improving this thesis. Second, I would like to thank all my colleagues at USC, in particular, Yun Rock Qu, Da Tong, Sanmukh Rao Kuppannagari, Ren Chen, Andrea Sanny, Shreyas GirishSingapura, CharithDhanushkaWickramaarachchi, HanqingZeng, andKar- tikLakhotia. Mydiscussionswiththemhaveinspiredmostoftheaccomplishments inthisthesis. IalsothankDianeDemetras, KathrynKassar, andJaniceThompson for their help in administrative work. Finally, I am most grateful to my wife, Yifan Zhu, and my parents, Shuguang Zhou and Hongmei Liu, for the moral support and encouragement. Without their unconditional love, I could not have completed this Ph.D. thesis. iii Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures x Abstract xiii 1 Introduction 1 1.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Graph Analytics Applications . . . . . . . . . . . . . . . . . 2 1.2.2 Types of Graph Analytics . . . . . . . . . . . . . . . . . . . 3 1.2.3 Graph Analytics Algorithms . . . . . . . . . . . . . . . . . . 3 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Background and Related Work 11 2.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Adjacency List . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Adjacency Matrix . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Compressed Sparse Row . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Coordinate List . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Graph Processing Paradigms . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Vertex-centric Paradigm . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Edge-centric Paradigm . . . . . . . . . . . . . . . . . . . . . 16 2.3 Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.1 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.3 CPU-FPGA Heterogeneous Architectures . . . . . . . . . . . 21 iv 2.3.4 Comparison between FPGA, CPU, and GPU . . . . . . . . 23 2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4.1 Multi-core-based Graph Processing Frameworks . . . . . . . 23 2.4.2 GPU-based Graph Processing Frameworks . . . . . . . . . . 25 2.4.3 ASIC-based Graph Processing Frameworks . . . . . . . . . . 25 2.4.4 FPGA-based Graph Processing Accelerators . . . . . . . . . 26 2.4.4.1 Algorithm-specific Accelerators . . . . . . . . . . . 26 2.4.4.2 Graph Processing Frameworks . . . . . . . . . . . . 27 2.5 Graph Algorithms in This Thesis . . . . . . . . . . . . . . . . . . . 28 3 High-throughput Graph Processing Framework on FPGA 31 3.1 Framework Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.1 Graph Partitioning and Vertex Buffering . . . . . . . . . . . 34 3.3.2 Partition Skipping . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.3 Parallelizing Edge-centric Graph Processing . . . . . . . . . 37 3.3.3.1 Inter-partition Parallelism . . . . . . . . . . . . . . 37 3.3.3.2 Intra-partition Parallelism . . . . . . . . . . . . . . 38 3.3.4 Data Layout Optimization . . . . . . . . . . . . . . . . . . . 38 3.3.5 Data Communication Reduction . . . . . . . . . . . . . . . . 40 3.3.5.1 Update Combining . . . . . . . . . . . . . . . . . . 40 3.3.5.2 Update Filtering . . . . . . . . . . . . . . . . . . . 41 3.4 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Processing Engine . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Design Automation Tool . . . . . . . . . . . . . . . . . . . . . . . . 46 3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 47 3.6.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 48 3.6.3 Resource Utilization and Power Consumption . . . . . . . . 49 3.6.4 Execution Time and Throughput . . . . . . . . . . . . . . . 49 3.6.5 Impact of the Optimizations . . . . . . . . . . . . . . . . . . 51 3.6.5.1 Impact of Partition Skipping . . . . . . . . . . . . 51 3.6.5.2 Impact of Update Combining and Filtering . . . . 52 3.6.5.3 Impact of Data Layout Optimization . . . . . . . . 53 3.7 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . . 55 3.7.1 Comparison with Multi-core Designs . . . . . . . . . . . . . 55 3.7.2 Comparison with GPU Designs . . . . . . . . . . . . . . . . 55 3.7.3 Comparison with State-of-the-Art FPGA Designs . . . . . . 58 v 4 Accelerating Graph Processing on CPU-FPGA Heterogeneous Platform 60 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Hybrid Algorithm for Graph Processing . . . . . . . . . . . . . . . . 62 4.2.1 Hybrid Data Structure . . . . . . . . . . . . . . . . . . . . . 62 4.2.2 Paradigm Selection . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.3 Hybrid Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2.4 Mapping theHybridAlgorithmontoaHeterogeneousPlatform 67 4.2.4.1 Scatter Phase . . . . . . . . . . . . . . . . . . . . . 67 4.2.4.2 Gather Phase . . . . . . . . . . . . . . . . . . . . . 70 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . 71 4.3.2 Accelerator Function Unit Design . . . . . . . . . . . . . . . 72 4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 75 4.4.3 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . 75 4.4.4 Vertex-centric vs. Edge-centric on the CPU . . . . . . . . . 76 4.4.5 Hybrid Algorithm on the CPU . . . . . . . . . . . . . . . . . 76 4.4.6 Hybrid Algorithm on the CPU-FPGA Heterogeneous Platform 80 4.5 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . . 82 4.5.1 Comparison with Multi-core Implementations . . . . . . . . 82 4.5.2 Comparison with State-of-the-Art FPGA-based Accelerators 83 5 Accelerating Matrix Factorization for Machine Learning Applica- tions 84 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 Challenges in Acceleration . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.1 Limited On-chip Memory Resources . . . . . . . . . . . . . . 88 5.3.2 Limited Parallelism because of Data Dependencies . . . . . . 88 5.3.3 Concurrent Accesses to Dual-port On-chip RAMs . . . . . . 89 5.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1 Graph Partitioning and Communication Hiding . . . . . . . 89 5.4.2 Parallelism Extraction . . . . . . . . . . . . . . . . . . . . . 93 5.4.3 Edge Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 96 5.5 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . 97 5.5.2 Processing Engine . . . . . . . . . . . . . . . . . . . . . . . . 99 5.5.3 Feature Vector Buffer . . . . . . . . . . . . . . . . . . . . . . 99 5.5.4 Hazard Detection Unit . . . . . . . . . . . . . . . . . . . . . 102 5.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 104 vi 5.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 104 5.6.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . 104 5.6.3 Resource Utilization and Power Consumption . . . . . . . . 105 5.6.4 Pre-processing Time and Training Time . . . . . . . . . . . 105 5.6.5 Throughput vs. Parallelism . . . . . . . . . . . . . . . . . . 106 5.6.6 Impact of the Optimizations . . . . . . . . . . . . . . . . . . 107 5.6.6.1 Bank Conflict Reduction . . . . . . . . . . . . . . . 107 5.6.6.2 Data Dependency Reduction . . . . . . . . . . . . 108 5.6.6.3 Communication Cost Reduction . . . . . . . . . . . 109 5.7 Comparison with State-of-the-Art . . . . . . . . . . . . . . . . . . . 110 5.7.1 Comparison with Multi-core Implementations . . . . . . . . 110 5.7.2 Comparison with GPU Implementations . . . . . . . . . . . 111 6 Conclusion 113 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.2.1 Emerging Memory Technologies . . . . . . . . . . . . . . . . 116 6.2.2 Graph Stream Algorithms . . . . . . . . . . . . . . . . . . . 117 6.2.3 Evolving Graph Algorithms . . . . . . . . . . . . . . . . . . 117 Reference List 119 vii List of Tables 2.1 Comparison between FPGA, CPU, and GPU . . . . . . . . . . . . . 24 3.1 Mapping of graph algorithms to the ECP . . . . . . . . . . . . . . . 32 3.2 Graph datasets used in the experiments . . . . . . . . . . . . . . . . 48 3.3 Resource utilization, clock rate, and power consumption . . . . . . . 50 3.4 Execution time (ms) for various datasets . . . . . . . . . . . . . . . 51 3.5 Throughput (MTEPS) for various datasets . . . . . . . . . . . . . . 51 3.6 Comparison with highly optimized multi-core implementations . . . 56 3.7 Comparison with GPU-based implementations . . . . . . . . . . . . 57 3.8 Comparison with GraphOps . . . . . . . . . . . . . . . . . . . . . . 58 3.9 Comparison with ForeGraph . . . . . . . . . . . . . . . . . . . . . . 59 4.1 Synthetic graph datasets used in the experiments . . . . . . . . . . 74 4.2 Resource utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Performance of the hybrid algorithm on the CPU-FPGA heteroge- neous platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.4 Comparison with state-of-the-art multi-core design . . . . . . . . . 82 4.5 Comparison with state-of-the-art FPGA-based accelerators . . . . . 83 5.1 Bipartite graph notations for matrix factorization . . . . . . . . . . 87 viii 5.2 Large real-life sparse matrices used for the experiments . . . . . . . 104 5.3 Resource utilization, clock rate and power consumption . . . . . . . 105 5.4 Pre-processing time . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.5 Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Bank conflict reduction because of Optimization 3 . . . . . . . . . . 108 5.7 Pipeline stall reduction because of Optimization 2 . . . . . . . . . . 108 5.8 Comparison between two partitioning approaches . . . . . . . . . . 109 5.9 Communication cost reduction because of Optimization 1 . . . . . . 110 5.10 Comparison with state-of-the-art multi-core implementation . . . . 111 5.11 Comparison with GPU implementations . . . . . . . . . . . . . . . 112 ix List of Figures 1.1 A graph G with vertices V and edges E . . . . . . . . . . . . . . . 1 2.1 An example graph used to illustrate various graph representations . 11 2.2 Adjacency list representation of the example graph from Figure 2.1 12 2.3 Adjacency matrix representation of the example graph from Figure 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 CSR representation of the example graph from Figure 2.1 . . . . . . 14 2.5 COO representation of the example graph from Figure 2.1 . . . . . 14 2.6 Internal organization of FPGA . . . . . . . . . . . . . . . . . . . . . 19 2.7 DRAM Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Shared-memory CPU-FPGA heterogeneous architecture . . . . . . . 22 3.1 The target system architecture of our framework . . . . . . . . . . . 32 3.2 Example graph and its associated data structures, with the assump- tion that the value of each update is the product of the edge weight and the attribute of the source vertex of the edge . . . . . . . . . . 34 3.3 Data layout after the graph in Figure 3.2 is partitioned . . . . . . . 36 3.4 Non-sequentialexternalmemoryaccessesbecauseofwritingupdates into the external memory in the scatter phase . . . . . . . . . . . . 39 3.5 Overall architecture of the accelerator . . . . . . . . . . . . . . . . . 42 x 3.6 Architecture of the processing engine . . . . . . . . . . . . . . . . . 43 3.7 Update combining network for q = 4 . . . . . . . . . . . . . . . . . 44 3.8 Data forwarding circuits . . . . . . . . . . . . . . . . . . . . . . . . 46 3.9 Workflow of the design automation tool . . . . . . . . . . . . . . . . 47 3.10 Reduction of edge traversals because of partition skipping . . . . . . 52 3.11 Reduction factor of produced updates because of update combining and filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.12 Reduction factor of non-sequential DRAM accesses because of data layout optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 Variation of active vertex ratio during the execution of BFS for a web graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Hybrid data structure of an example graph, with the assumption that v 0 is the only active vertex in the current iteration . . . . . . . 63 4.3 Coordination between the CPU and FPGA by the runtime system . 68 4.4 Concurrent write accesses to the same bin . . . . . . . . . . . . . . 69 4.5 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Architecture of the accelerator function unit (AFU) . . . . . . . . . 74 4.7 Comparison between the VCP and ECP on the CPU . . . . . . . . 77 4.8 Execution time comparison between the VCP, ECP, and hybrid algorithm on the CPU . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.9 Execution time breakdown of the hybrid algorithm on the CPU . . 79 4.10 Accelerating the hybrid algorithm by CPU-FPGA co-processing . . 81 5.1 Pipelined processing of ISs . . . . . . . . . . . . . . . . . . . . . . 92 5.2 Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Architecture of the processing unit . . . . . . . . . . . . . . . . . . 100 xi 5.4 Multiport FVB based on banking . . . . . . . . . . . . . . . . . . . 102 5.5 Architecture of the bank conflict resolver for K = 4 . . . . . . . . . 103 5.6 Throughput for various K . . . . . . . . . . . . . . . . . . . . . . . 107 xii Abstract Graph analytics has drawn much research interest because of its broad applicabil- ity from machine learning to social science. However, obtaining high-performance for large-scale graph analytics is very challenging because of the large memory footprint of real-world graphs and the irregular access patterns of graph analyt- ics algorithms. As general-purpose processors (GPPs) have several architectural inefficiencies in processing large-scale graph data, dedicated hardware accelerators can significantly improve performance with respect to execution time, throughput, and energy efficiency. In this thesis, we focus on designing hardware architectures based on state-of-the-art field-programmable gate array (FPGA) technologies to accelerate iterative graph analytics algorithms. We also propose novel algorithmic optimizations to optimize memory performance and maximize parallelism in order to achieve a significant speedup. In the first part of our research, we propose a high-throughput FPGA frame- work to accelerate general graph algorithms based on the edge-centric paradigm (ECP). To optimize the performance of our framework, we propose various novel algorithmic optimizations, including a graph partitioning approach to enable effi- cient data buffering, an optimized data layout to improve memory performance, and an update merging and filtering scheme to reduce data communication. We also develop a design automation tool to facilitate the generation of accelerators xiii usingourframework. Fourrepresentativegraphalgorithms−namely, sparsematrix vectormultiplication(SpMV),PageRank(PR),single-sourceshortestpath(SSSP), and weakly connected component (WCC)−are accelerated to evaluate the perfor- mance of our framework. In the second part of our research, we explore CPU-FPGA heterogeneous archi- tectures for graph analytics acceleration. We analyze the tradeoffs between the widely used vertex-centric paradigm (VCP) and ECP and propose a hybrid algo- rithm that dynamically selects between them during execution. We develop a hybrid data structure that concurrently supports the VCP and ECP, as well as enables efficient parallel computation on heterogeneous platforms. Furthermore, we map our hybrid algorithm onto a state-of-the-art heterogeneous platform that integrates a multi-core CPU and an FPGA accelerator through cache coherent interconnect. We evaluate our CPU-FPGA co-design by accelerating breadth-first search (BFS) and SSSP. In the third part of our research, we design an FPGA architecture to accelerate the training process of a popular machine learning algorithm that performs matrix factorization (MF) using stochastic gradient descent (SGD). We transform the algorithm into a bipartite graph processing problem and propose a novel three- level hierarchical graph partitioning approach to overcome acceleration challenges. This approach enables conflict-minimizing scheduling and processing of edges to achieve a significant speedup. We implement our designs by using state-of-the-art FPGAs and demonstrate their superior performance over the state-of-the-art graph analytics accelerators in terms of throughput, execution time, and energy efficiency. The broader impacts of this thesis include the productive use of FPGAs for accelerating graph analytics and machine learning algorithms on very large graphs. xiv Chapter 1 Introduction 1.1 Graphs In computer science and mathematics, graphs are abstract data structures that model structural relationships among objects. As shown in Figure 1.1, a graph G = (V,E) consists of a set of vertices V that are connected by a set of edges E, where E⊆ V×V. A graph is undirected if all the edges are bidirectional (i.e., ∀u,v∈V, (u,v)∈E⇐⇒ (v,u)∈E); otherwise, the graph is directed. A graph is weighted if each edge of G is assigned a numerical value called weight; otherwise, the graph is unweighted. Vertex Edge Figure 1.1: A graph G with vertices V and edges E Graphs have been widely used to represent real-world networked data in many application domains [1]. For example, in social networks, graphs are used to repre- sent people and the interactions (e.g., friendship) among them. In transportation networks, graphsareusedtomodel cities androadconnections. IntheWorldWide Web, graphs are used to model web pages and the hyperlinks among them. These 1 real-world graphs have some common characteristics. The first characteristic is their large data volume. For example, as of 2018, the social network of Facebook has over 2 billion active users on a monthly basis [2]; major web search engines (e.g., Google) need to find information from over a trillion web pages. The second characteristic is their heterogeneous node degree distribution. Many real-world graphs have a scale-free structure whose node degree distribution follows a power law. The third characteristic is their sparsity. The number of edges is within a constant multiple of the number of vertices. 1.2 Graph Analytics 1.2.1 Graph Analytics Applications Graph analytics leverages graph structures to understand, codify, and visualize relationships between objects in a graph. It is built based on the mathematics of graph theory and aims to derive hidden information and produce insightful outcomes by analyzing large-scale graphs. There are many applications based on graph analytics, such as the following: • Detecting financial crimes • Applying influencer analysis in social network communities • Performing grid and network quality analysis • Optimizing routes in airlines • Providing personalized recommendations to customers • Conducting research in bioinformatics 2 • Managing investment 1.2.2 Types of Graph Analytics The following are the four fundamental types of graph analytics [3]: • Path analysis: This type of analysis aims to determine the shortest distance between two vertices in a graph. An example use case is route optimization for logistics, supply and distribution chains. • Connectivity analysis: This type of analysis aims to determine weaknesses in networks and compare connectivity across distinct networks. • Community analysis: This type of analysis is used to find groups of interact- ing people in a social network. • Centrality analysis: This type of analysis enables identifying relevancy to find the most influential people in a social network or to find the most highly accessed web pages. 1.2.3 Graph Analytics Algorithms Graphalgorithmsareharnessedtoperformgraphanalytics. Therearemanywidely used graph analytics algorithms, including the following: • Clustering: to group vertices based on their characteristics such that there is high intra-cluster similarity and low inter-cluster similarity • Cutting: to find the cut with the fewest number of crossing edges • Graph search and traversal: breadth-first search (BFS) and depth first search 3 • Shortest path: to find the shortest path between two vertices of interest in a weighted graph • Widest path: to find a path between two designated vertices in a weighted graph such that the weight of the minimum-weight edge in the path is max- imized • Betweenness centrality: to count the number of times a vertex appears in the shortest path between two other vertices • Connected component: to find a subgraph such that every pair of vertices in the subgraph can be reached from each other • PageRank: to rank the popularity of vertices • Coloring: to color the vertices/edges of a graph such that no two adjacent vertices/edges share the same color • Transitive closure: to determine whether a vertexj is reachable from another vertex i for all vertex pairs (i,j) in a directed graph • Triangle counting: to count the number of edge triangles incident to each vertex Unfortunately, these graph analytics algorithms are notoriously challenging to accelerate. They exhibit the following characteristics that pose great challenges in acceleration: • Iterative computations • Irregular data access patterns • Poor locality of reference, both spatial and temporal 4 • Limited independent parallelism • Memory-bound performance • Low computation-to-communication ratio 1.3 Motivation Becauseofthebroadapplicabilityofgraphanalytics, manygraphanalyticsengines have been developed based on general-purpose processors (GPPs), such as multi- core processors and general-purpose graphics processing units. However, GPPs are not the ideal platforms to perform graph analytics [4, 5]. They have several inefficiencies, including (1) wasted external memory bandwidth because of ineffi- cient memory access granularity (i.e., loading and storing entire cache line data while operating on only a small portion of the data), (2) ineffective on-chip mem- ory usage because of the poor spatial and temporal locality of graph algorithms, (3) mismatch in execution granularity (i.e., computation based on x86 instructions instead of utilizing domain-specific data types for graph analytics), and (4) expen- sive atomic operations (e.g., memory locks) to prevent the race condition because of concurrent updates by distinct threads. To address these inefficiencies, dedi- cated hardware accelerators for graph analytics have drawn lots of interest and demonstrated great success [4, 5, 6, 7, 8, 9, 10]. In 1965, Gordon Moore predicted that the number of transistors per unit area would double every year and later in 1975, he updated his original prediction to every two years [11]. This prediction is known as the Moore’s law. Until around 2005, Moore’s law correlated well with processor performance because of Dennard scaling [12], which states that scaling down transistor size significantly improves not only transistor performance but also energy efficiency so that the 5 power density of the chip stays constant. However, after 2005, Dennard scaling appeared to have broken down and microprocessors hit the power wall [13]. Since then energy efficiency has been a primary concern in all stages of computer sys- tem design. Because of the increasing interest in energy-efficient computing, the field-programmable gate array (FPGA) has become a very attractive platform to develop accelerators for many applications [14, 15, 16, 17, 18, 19, 20]. FPGA lies in between GPP and application specific integrated circuit (ASIC). Compared with GPP, FPGA can deliver higher performance at a lower cost and power consumption, especially for data-intensive applications [21]. Compared with ASIC, FPGA can be reconfigured, both statically and dynamically; this reconfig- urability is quite useful in design cases where an application requires software- like data-dependent processing with ASIC-level high performance [22, 23, 24]. FPGAs have been introduced into data centers to provide customized acceleration of computation-intensive tasks [25]. Amazon Web Service has recently launched FPGA-based cloud instances to allow customers to develop FPGA accelerators for complex applications [26]. Intel and IBM have incorporated FPGA into their next-generation heterogeneous systems [27, 28] to improve performance and energy efficiency. FPGA is a very attractive platform to accelerate graph analytics. State-of-the- art FPGA devices provide dense logic elements (up to 5.5 million [29]), which can be used to implement deep pipelines in order to exploit fine-grained parallelism for acceleration. In addition, FPGAs support various emerging memory technologies, such as hybrid memory cube [6] and HBM2 [28], which offer high-bandwidth and low-latency accesses with low power consumption. More importantly, FPGAs pro- vide abundant on-chip memory resources (up to 500 Mb [29]), which are fully user 6 controllable to overcome the irregular memory access patterns of graph analytics algorithms and thus achieve efficient data reuse. 1.4 Thesis Contributions In this thesis, we focus on accelerating iterative graph analytics algorithms by exploiting state-of-the-art FPGA technologies. Our proposed techniques include both (software-based) algorithmic optimizations and (hardware-based) architec- ture design. Specifically, we make the following contributions: • Graph Processing Framework on FPGA: We propose an FPGA frame- work to accelerate graph algorithms based on the ECP. The framework is flexible in supporting generic graph algorithms with various vertex attributes andattributeupdatefunctions. Wedevelopadesignautomationtooltoallow users to easily and rapidly construct accelerators for various graph analytics using our framework. Furthermore, we propose novel algorithmic optimiza- tions to improve performance, including a graph partitioning approach to enable efficient data reuse, an optimized data layout to improve memory performance, and an update merging and filtering scheme to reduce data communication. Basing on our design methodology, we accelerate four fun- damental graph algorithms, including sparse matrix vector multiplication (SpMV), PageRank (PR), single source shortest path (SSSP), and weakly connected components (WCC). Experimental results show that for a variety of real-world and synthetic large graphs, the framework sustains an average throughput of 2,076 million traversed edges per second (MTEPS) for SpMV, 2,225 MTEPS for PR, 2,916 MTEPS for SSSP, and 3,493 MTEPS for WCC, respectively. Compared with highly optimized multi-core implementations, 7 ourframeworkachievesupto37.9×speedup. Comparedwithstate-of-the-art FPGAframeworks, ourframeworkachievesupto50.7×throughputimprove- ment. • Software-Hardware Co-design for Accelerating Graph Analytics: Weexplorestate-of-the-artCPU-FPGAheterogeneousarchitecturestoaccel- erate non-stationary graph algorithms. By analyzing the tradeoffs between the VCP and ECP, we propose a hybrid algorithm that dynamically selects between them during the execution. We introduce the notion of active vertex ratio, based on which we develop a simple but efficient paradigm selection approach. We develop a hybrid data structure to concurrently support both the VCP and ECP. Basing on the hybrid data structure, we propose a graph partitioning scheme to increase parallelism and enable efficient parallel com- putation on heterogeneous platforms. In each iteration, we use our paradigm selection approach to select the appropriate paradigm for each partition. Furthermore, we map our hybrid algorithm onto a state-of-the-art heteroge- neous platform that integrates a multi-core CPU and an FPGA accelerator through cache coherent interconnect. We use our design methodology to accelerate BFS and SSSP. Experimental results show that our CPU-FPGA co-design achieves up to 1.5× (1.9×) speedup for BFS (SSSP) compared with highlyoptimizedbaselinedesigns. Comparedwithstate-of-the-artalgorithm- specific FPGA accelerators for BFS, our proposed design achieves up to 4.0× throughput improvement. • Accelerating the Training Process of Matrix Factorization: Matrix factorization (MF) using SGD is a popular machine learning technique to derive latent information from a collection of observations. We design a 8 highly parallel FPGA architecture to accelerate the training process of SGD- based MF. We identify the challenges for acceleration and overcome them by transforming the SGD-based MF algorithm into a bipartite graph pro- cessing problem. We propose a novel three-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of edges to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input bipartite graph into induced subgraphs; this enables our accelerator to efficiently buffer vertex data in the on-chip memory for data reuse and completely hide communication over- head. Second, we partition all edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of edges inside each matching to reduce concurrent memory access conflicts to the dual-port on-chip RAMs. Compared with non-optimized baseline designs, the optimizations result in up to 60× data dependency reduction, 4.2× bank conflict reduction, and 15.4× speedup. Experimental results show that our FPGA accelerator sustains a high computing throughput of up to 217 billion floating-point operations per second (GFLOPS) for training large real-life sparse matrices. Compared with state-of-the-art multi-core and GPU imple- mentations, our FPGA accelerator demonstrates up to 13.3× and 12.7× speedup, respectively. 1.5 Thesis Organization The rest of this thesis is organized as follows: • Chapter 2 presents the background and related work. 9 • Chapter3detailsthedesignofourFPGA-basedgraphprocessingframework. Four fundamental graph analytics algorithms are accelerated to evaluate per- formance. • Chapter 4 introduces our software-hardware co-design for accelerating graph analytics on CPU-FPGA heterogeneous platforms. • Chapter 5 presents our design and optimizations of an FPGA accelerator for MF based on bipartite graph representation. • Chapter 6 concludes the thesis and presents three future research directions. 10 Chapter 2 Background and Related Work In this chapter, we first introduce several widely used graph representations and graphprocessingparadigms. Then,wecoverFPGAtechnologiesandrelatedworks. Lastly, we discuss the graph algorithms that have been studied in this thesis. 2.1 Graph Representations In Section 2.1, we briefly introduce several commonly used graph representations. We use the example graph shown in Figure 2.1 for the illustration, in which the number in each vertex represents the index of the vertex, and the number next to each edge represents the weight of the edge. We use|V| to denote the total number of vertices,|E| to denote the total number of edges, and v i to denote the vertex with index i (0≤i<|V|). 0 1 3 2 4 4 1 1 2 5 3 6 Figure 2.1: An example graph used to illustrate various graph representations 11 2.1.1 Adjacency List The adjacency list representation uses an array of linked lists to store a graph. The size of the array is equal to|V|. The i th linked list contains an element storing v j if there exists an edge from v i to v j (0≤ i,j <|V|); the element also records the weight of the corresponding edge. Figure 2.2 depicts the adjacency list representation of the example graph. 𝑣 0 𝑣 3 , 5 𝑣 1 𝑣 0 , 4 𝑣 2 , 1 𝑣 2 𝑣 4 , 1 𝑣 3 𝑣 1 , 3 𝑣 4 𝑣 3 , 2 𝑣 4 , 6 Figure 2.2: Adjacency list representation of the example graph from Figure 2.1 2.1.2 Adjacency Matrix The adjacency matrix representation stores a graph using a|V|×|V| matrix. Let M(i,j)denotetheentryinthei th rowandj th columnofthematrix(0≤i,j <|V|). Ifthereisanedgefromv i tov j ,M(i,j)isequaltotheweightoftheedge; otherwise, M(i,j) = 0. The adjacency matrix representation of the example graph is depicted in Figure 2.3. 12 0 0 0 5 0 4 0 1 0 6 0 0 0 0 1 0 3 0 0 0 0 0 0 2 0 𝑣 0 𝑣 1 𝑣 2 𝑣 3 𝑣 4 𝑣 0 𝑣 1 𝑣 2 𝑣 3 𝑣 4 Figure 2.3: Adjacency matrix representation of the example graph from Figure 2.1 2.1.3 Compressed Sparse Row The compressed sparse row (CSR) representation compresses the adjacency matrix representation by storing the matrix as three arrays−namely value array, column- index array, and row-offset array. The value array is of size|E|; it stores the values of the nonzero elements of the adjacency matrix, as they are traversed in a row- major fashion. The column-index array is of size|E|; it stores the column indexes of the elements in the value array. The row-offset array is of size|V| + 1; its i th element (0≤ i <|V|) stores the location in the value array that corresponds to the first nonzero element of the i th row in the adjacency matrix. The last element of the row-offset array stores|E|. The CSR representation of the example graph is depicted in Figure 2.4. 13 5 4 1 6 1 3 2 value array 3 0 2 4 4 1 3 column-index array 0 1 4 5 6 7 row-offset array Figure 2.4: CSR representation of the example graph from Figure 2.1 2.1.4 Coordinate List The coordinate list (COO) representation stores a graph as an array of edges. Each edge is represented as a <src, dest, weight> tuple to specify the source vertex, the destination vertex, and the weight of the edge. The COO representation of the example graph is depicted in Figure 2.5. src dest weight 𝑣 0 𝑣 3 5 𝑣 1 𝑣 0 4 𝑣 1 𝑣 2 1 𝑣 1 𝑣 4 6 𝑣 3 𝑣 1 3 𝑣 4 𝑣 3 2 𝑣 2 𝑣 4 1 Figure 2.5: COO representation of the example graph from Figure 2.1 14 2.2 Graph Processing Paradigms In this section, we introduce two widely used paradigms for designing graph pro- cessing engines and accelerators− namely the VCP (Section 2.2.1) and ECP (Sec- tion 2.2.2). Both paradigms are flexible in supporting a variety of graph algorithms with various vertex attributes and graph update functions. 2.2.1 Vertex-centric Paradigm The VCP expresses graph algorithms by “thinking like a vertex” [30]. The compu- tation is expressed at the level of a single vertex. The processing is structured in a number of iterations and terminates either after a specified number of iterations or after the attributes of all vertices have converged. Algorithm 1 shows the general computation template of the VCP. Each iteration comprises a scatter phase fol- lowed by a gather phase. In the scatter phase, the vertices whose attributes have been updated in the previous iteration send updates to their neighbors through outgoing edges. These vertices are called active vertices in the current iteration. In the gather phase, the vertices that receive update(s) apply the update(s) to their attributes and become the active vertices in the next iteration. The scatter- gather execution model is synchronous so that all updates from the scatter phase are visible to the vertices only after the scatter phase has been completed. TheVCPstorestheinputgraphusingtheadjacencylistorCSRrepresentation. Therefore, random memory accesses through pointers or row-offsets are required to traverse the outgoing edges of active vertices. Unfortunately, random memory accessesarehighlyirregularsothatconventionalprefetchingandcachingstrategies cannot efficiently handle them. In this scenario, numerous pipeline stalls occur 15 because of the long memory access latency, resulting in significant performance deterioration [31, 32]. Algorithm 1 Vertex-centric paradigm 1: while not done do 2: Scatter Phase: 3: for each vertex v do 4: if v has an updated attribute then 5: for each each outgoing edge e of v do 6: Produce an update u based on weight of e and attribute of v 7: Send u to vertex v e.dest 8: end for 9: end if 10: end for 11: Gather Phase: 12: for each vertex v do 13: if v receives update(s) from neighbour(s) then 14: Update its attribute based on the update(s) 15: end if 16: end for 17: end while 2.2.2 Edge-centric Paradigm The ECP expresses the computation of graph algorithms at the level of each edge and update [33]. The processing is also iterative, with each iteration consisting of a scatter phase followed by a gather phase. Algorithm 2 shows the general computation template of the ECP. In the scatter phase, all edges are sequentially traversed; based on each edge, an update is produced. In the gather phase, all the produced updates are sequentially processed. Each update is examined to determine whether the update condition for its destination vertex is met; if so, the attribute of the destination vertex is updated based on the update. 16 Algorithm 2 Edge-centric paradigm 1: while not done do 2: Scatter Phase: 3: for each edge e do 4: Produce an update u based on weight of e and attribute of vertex v e.src 5: u.dest =e.dest 6: end for 7: Gather Phase: 8: for each update u do 9: if update condition for vertex v u.dest is met then 10: Update attribute of v u.dest based on u 11: end if 12: end for 13: end while The ECP stores the input graph using the COO representation. The advantage isthatalledgesareaccessedfromtheexternalmemoryinastreamingfashionwith- outanyrandomaccesses. Inaddition, theCOOrepresentationdoesnotrequirethe edges to be sorted, thus enabling data layout optimization. However, the downside of the ECP is that it requires traversing all edges in each iteration, resulting in redundant edge traversals for the graph algorithms in which not all the vertices are active in each iteration. Such algorithms are defined as non-stationary graph algorithms [34]. When there are very few active vertices in an iteration, the ECP can lead to numerous redundant edge traversals. 2.3 Platforms 2.3.1 FPGA FPGAs are semiconductor devices that can be reprogrammed to any desired appli- cation by a designer after manufacturing. The high-level architecture of an FPGA device is depicted in Figure 2.6, which consists of configurable logic blocks (CLBs), 17 programmable interconnects, digital signal processing (DSP) slices, on-chip block RAMs (BRAMs), and input/output (I/O) blocks. A CLB is constructed by logic slices, each of which contains a set of lookup tables (LUTs), flip-flops, and multi- plexers. A LUT is a collection of logic gates that are hard-wired on the FPGA. Distinct logic blocks are connected through the routing interconnect of FPGA (e.g., wires and programmable switches). DSP slices are used to implement sig- nal processing functions such as the multiply-accumulate operation. Two types of on-chip RAMs are available on an FPGA device, including distributed RAM and BRAM. A distributed RAM is constructed by LUTs and distributed across the FPGA device. It is fast, localized, and ideal for small data buffers or reg- ister files. A BRAM is a dedicated dual-port memory that is concentrated into columns on the FPGA device. It serves as a relatively larger memory structure than the distributed RAMs. Both distributed RAM and BRAM can provide high- bandwidth and low-latency accesses with a configurable word width to the logic blocks. If more memory resources are required, the I/O blocks can be used to build high-speed interfaces in order to connect external memory such as dynamic random-access memory (DRAM). FPGAs started out as a prototyping platform, allowing the convenient and cost-effective development of glue logic connecting discrete ASIC components. As the gate density of FPGAs increased, FPGA applications shifted from glue logic to a wide variety of high-performance and data-intensive problems, where FPGA devices are deployed in the field as final but still flexible solutions. Compared with ASIC, which is custom manufactured for specific design tasks, FPGA has the ability to be reconfigured, both statically and dynamically by simply changing the state of the memory bits in the on-chip RAM. Such reconfigurability is quite 18 Block RAM Block RAM Logic block Interconnect DSP slice I/O blocks LUT Block RAM Figure 2.6: Internal organization of FPGA useful in design cases where software-like data-dependent processing with ASIC- level high performance is required [22, 23, 24]. As a result, FPGAs are an ideal fit for many different applications, such as aerospace, data center, video and image processing, wired communications, and wireless communications. 19 2.3.2 DRAM DRAM is widely used as the external memory for FPGA-based platforms [35]. State-of-the-art DRAM (e.g., DDR4 SDRAM) provides high peak bandwidth, but the performance highly depends on the access pattern [36]. For graph analytics, the sustained bandwidth is usually much lower than the peak bandwidth, making DRAM performance the main bottleneck [7]. As depicted in Figure 2.7, a DRAM chip is organized into banks. Each bank consists of a 2D matrix of locations. At each addressable location ([bank, row, column]), a fixed number of data bits are located. DRAM accesses are pipelined and scheduled subject to the availability of the shared resources such as sense amplifiers and data buses. To improve DRAM performance, the DRAM controller may reorder the execution of memory accesses [37]. Bank Columns Rows Bank Bank Row buffer Column decoder Row decoder Data bus Address bus DRAM chip Figure 2.7: DRAM Organization 20 A row of a bank needs to be first activated in order to enable accesses. Once a row is activated, a row buffer holds the entire row for subsequent accesses to the same row. Access in an activated row is defined as a row-hit. When there is access to a different row (e.g., other than the activated row), the activated row must be closed by a row pre-charge command and the target row (e.g., the row in which the data reside) must be activated. This results in extra access latency and power consumption. Such access is defined as a row-conflict. There are two common DRAM access patterns: sequential and random. For sequential access pattern, consecutive memory accesses map to the same row of DRAM, resulting in high sustained bandwidth. For random access pattern, it is highly likely that consecutive memory accesses map to different rows of DRAM, resulting in considerable row-conflicts; in this scenario, the sustained memory bandwidth significantly deteriorates. 2.3.3 CPU-FPGA Heterogeneous Architectures Heterogeneous architectures integrating a general-purpose CPU and an FPGA accelerator are appealing for high-performance computing with low cost [25, 32, 38, 39]. As processing units that are optimized for fast sequential processing (i.e., CPU) and processing units that are optimized for massive parallelism (i.e., FPGA) coexist, such architectures can efficiently cope with workloads that require variable amounts of parallelism across the execution. For example, the workloads that are computation intensive and parallelizable are accelerated by the FPGA, while the rest of the workload is handled by the CPU. FPGA vendors have introduced system-on-chip (SoC) products, which inte- grate ARM cores and FPGA fabric on the same die [40, 41]. The SoC products target the embedded market in terms of the cores, fabric capacity, and memory 21 interfaces. Convey HC-1 [42] is an early server product that integrates a multi-core CPU and FPGA. The data movement between the CPU and FPGA is realized through peripheral component interconnect express (PCIe) [43]. More recently, Intel and IBM have respectively announced initial server products that integrate the CPU and FPGA using cache-coherent interconnect technologies [44, 45, 46]. This enables the FPGA to be a peer to the CPU processor from a memory access standpoint, leading to more efficient CPU-FPGA co-processing. Figure 2.8 depicts a shared-memory CPU-FPGA heterogeneous architecture, in which the accelerator function unit on the FPGA can directly read data from and write data to the last level cache of the CPU through cache-coherent interconnect (e.g., Intel QuickPath Interconnect [44]). Compared with conventional interconnect technologies (e.g., PCIe [43]), the cache-coherent interconnect eliminates the need to move data back and forth between the CPU and FPGA. This enables the offloading of specific workloads to the FPGA for acceleration in a fine-grained fashion. Cache coherent interconnect DRAM DRAM 𝐂𝐏𝐔 𝑁 cores Last level cache DRAM 𝐅𝐏𝐆𝐀 Accelerator function unit Figure 2.8: Shared-memory CPU-FPGA heterogeneous architecture 22 2.3.4 Comparison between FPGA, CPU, and GPU FPGA, multi-core CPU, and GPU are the three major parallel computing plat- forms. Table 2.1 provides a high-level comparison between these three. FPGA exploitsfine-grainedparallelismbybuildingdeeppipelines, whereasCPUandGPU exploit coarse-grained thread/core-level parallelism. In addition, the threads run- ning on GPU are executed in a single instruction multiple data fashion. From frequency perspective, FPGA operates at a lower frequency compared with CPU and GPU; however, the pipelines built on FPGA can achieve a higher number of operations per cycle because of a larger pipeline depth [47]. Modern FPGAs have large amount of on-chip RAMs which are user controllable (i.e., users decide which are to be removed from or are stored in the on-chip RAMs), whereas CPU and GPU do not offer such fine-grained cache control. Moreover, for algorithms that exhibit irregular memory access patterns, it is suggested to disable the cache of GPU because of its small capacity [48]. Lastly, FPGA is more power efficient than CPU and GPU; it can achieve higher performance (i.e., FLOPS) with a lower power consumption [49, 50]. 2.4 Related Work 2.4.1 Multi-core-based Graph Processing Frameworks Pregel [51] is the first graph processing framework. It is built based on the VCP and uses the MapReduce programming model to run graph algorithms on a clus- ter with hundreds of multi-core processors. However, it is very challenging to deal with load imbalance, synchronization, and fault tolerance in distributed computing 23 Table 2.1: Comparison between FPGA, CPU, and GPU Platform FPGA Muli-core CPU GPU (Intel Stratix 10 (Intel core (Nvidia GeForce GX 5500) I9-7980XE) GTX 1080) Parallelism Fine grained, Core level, Core level, pipelined thread level thread level # of cores Configurable 18 2560 Frequency 100−800 MHz 2.6−4.2 GHz 1.6−1.7 GHz On-chip 20.75 MB L1: 1.125 MB L1: 1 MB memory block RAMs L2: 18 MB L2: 8 MB L3: 24.75MB External DDR4: 20 GB/s DDR4: GDDR5X: memory HBM2: 256 GB/s 20−80 GB/s 320 GB/s Technology 14 nm 14 nm 16 nm # of transistors 17 billion 7 billion 7.2 billion Power 10−40 Watt 165 Watt 180 Watt Peak floating-point 10 TFLOPS 1 TFLOPS 8.9 TFLOPS performance environment for graph processing [34, 52]. As a result, developing a graph process- ing framework on a single multi-core processor has become increasingly popular [33, 53, 54, 55, 56, 57, 58]. GraphChi is the first graph processing framework developed based on a sin- gle multi-core processor [53]. It stores all the graph data in a solid-state drive (SSD) and develops a parallel sliding window method to reduce the amount of random accesses to the SSD. GraphChi achieves comparable performance to dis- tributed systems (i.e., GraphLab [59] and PowerGraph [60]) with a small fraction of the computing resources. X-Stream [33] is designed based on the ECP. It pro- poses a streaming partition approach to maximize sequential accesses to the graph data stored in a disk. GraphMat [54] maps vertex-centric computations to high- performance sparse matrix operations. GridGraph [55] develops a fine-grained 24 partitioning approach to break graphs into 1D-partitioned vertex chunks and 2D- partitioned edge blocks. The objective is to enable on-the-fly vertex updates and thus reduce the I/O amount. CLIP [56] exploits beyond-neighborhood accesses, which allow an edge to update a vertex that does not belong to the neighborhood of the edge, to reduce the number of iterations for algorithm convergence. 2.4.2 GPU-based Graph Processing Frameworks GPU platforms have been widely explored to accelerate graph analytics. Represen- tative GPU-based graph processing frameworks include CuSha [61], nvGraph [62], Medusa [63], Gunrock [64], and Graphie [65]. CuSha [61] focuses on addressing the limitations of uncoalesced global memory accesses for GPU-based graph pro- cessing. nvGraph [62] is a graph analytics library developed by NVIDIA based on Compute Unified Device Architecture (CUDA). Medusa [63] develops six program- ming APIs, which allow developers to define their own data structures and graph kernels. Gunrock [64] proposes a data-centric processing abstraction that leverages the GPU to accelerate the frontier computations (i.e., the computations of active vertices). Graphie [65] implements the asynchronous graph traversal model on the GPU to reduce data communication. 2.4.3 ASIC-based Graph Processing Frameworks Graphicionado [4] is a customized graph analytics accelerator designed for a high throughput and energy efficiency. It uses a specialized memory system that con- sists of off-chip DRAM and on-chip embedded DRAM. The architecture in [5] features hardware scheduling and dependence tracking to implement the asyn- chronous execution model [66]. GraphP [67], GraphPIM [68], and Tesseract [69] 25 are processing-in-memory-based ASIC designs, which reduce the data communica- tion by integrating processing units within the memory. 2.4.4 FPGA-based Graph Processing Accelerators 2.4.4.1 Algorithm-specific Accelerators Using FPGA to accelerate graph analytics has achieved great success. However, many existing FPGA accelerators are algorithm specific and thus not applicable to general graph algorithms. In [70], an FPGA-based architecture for accelerating the PageRank algorithm is presented. The accelerator is implemented on a Virtex- 5 FPGA and achieves 2.5× speedup compared with a multi-core implementation running on a dual-core Xeon processor. In [71, 72, 73], BFS is accelerated based on FPGA-HMC platforms. The designs achieve a high throughput of up to 45.8 billion edges traversed per second (GTEPS) and a power efficiency of up to 1.85 GTEPS/Watt for scale-free graphs. In [74], G. Lei et al. accelerate the Dijkstra algorithm for SSSP using a Virtex-7 FPGA. Compared with a CPU implementa- tion running on the AMD Opteron 6376 processor, the FPGA accelerator achieves up to 5× speedup. In [75], an FPGA accelerator for SpMV is proposed based on a specialized compressed interleaved sparse row encoding approach. The design achieves one third of the throughput performance of a GTX 580 GPU implementa- tion with 7× less energy. In [76], B. Betkaoui et al. accelerate the all-pairs shortest paths algorithm using a CPU-FPGA heterogeneous platform (i.e., Convey HC-1); the design achieves 10× speedup over a quad-core CPU implementation and 5× speedup over an AMD Cypress GPU implementation. 26 2.4.4.2 Graph Processing Frameworks GraphStep [10] is the first FPGA-based graph processing framework. It targets small graphs whose data (i.e., vertices and edges) can fit in the on-chip memory of one FPGA or multiple FPGAs. In [31], B. Betkaoui et al. develop a reconfigurable hardwarearchitectureforlarge-scalegraphproblemsontheConveyHC-1platform. They observe that the performance is memory bound because of a large amount of random accesses to the memory. GraphGen [8] is an FPGA framework based on the VCP. To improve memory performance, it partitions the input graph into subgraphs such that the vertices and edges of each subgraph can fit in the on-chip memory of the FPGA. The subgraphs are sequentially processed by the FPGA one at a time. GraphGen also provides a compiler for automatic HDL code generation. GraphOps [7] is a dataflow library for vertex-centric graph processing. It provides several commonly used building blocks such as reading the attributes from all the neighbors. The target platform of GraphOps is a CPU-FPGA heterogeneous architecture, inwhichtheFPGAisusedtoaccelerateedgetraversals, andtheCPU isresponsibleforupdatingvertexattributes. ForeGraph[9]isamulti-FPGA-based graph processing framework. It uses the 2-D graph partitioning technique in [57] to partition the graph into edge blocks, and it utilizes multiple FPGAs to process distinct edge blocks in parallel. 27 2.5 Graph Algorithms in This Thesis Inthisthesis, weaccelerateanon-exhaustivelistofrepresentativegraphalgorithms that have different memory and computation requirements, including Breadth- First Search (BFS), Single-Source Shortest Path (SSSP), Weakly Connected Com- ponents (WCC), Sparse Matrix-Vector Multiplication (SpMV), PageRank (PR), and Matrix Factorization (MF). • Breadth-First Search: BFS is a classic graph traversal algorithm. The objec- tive is to build a BFS tree for an unweighted graph given a root vertex. Here, each vertex maintains an attribute to record its level in the BFS tree. The algorithm starts at the root vertex and explores all the neighbors of the root vertex to construct the first level of the BFS tree. Then, all the unexplored vertices that have an edge to any vertex at the first level are explored and added to the second level. The algorithm continues exploring vertices in this way until no additional vertices can be explored. • Single-Source Shortest Path: SSSP aims to find the shortest paths from a single source vertex to all the other vertices in a weighted graph. In this algorithm, each vertex maintains an attribute to record the weight of the shortest path from the source vertex to itself. In each iteration, all the active vertices send their updated attributes to their neighbors through outgoing edges; then, each vertex that receives update(s) from neighbor(s) updates its attribute if a shorter path is found. The algorithm terminates when none of the vertices updates the attribute in an iteration. • Weakly Connected Components: A connected component (CC) of a graph is a subgraph such that (1) there is certain path connecting each pair of vertices in the subgraph and (2) no additional vertices can be reached by the vertices 28 in the subgraph. WCC aims to find all the CCs in an undirected graph. In this algorithm, each vertex maintains an attribute called CC-identifier to record the CC that it belongs to. The CC-identifier of a CC is the vertex that has the smallest index in the CC. In each iteration, all the active vertices send their CC-identifiers to their neighbors; if a vertex receives a smaller CC-identifier than its current CC-identifier, it updates its attribute. When the algorithm terminates, the vertices that have the same attribute (i.e., CC-identifier) form a CC. • Sparse Matrix-Vector Multiplication: SpMV iteratively multiplies the sparse adjacency matrix of a directed graph with a vector of values, one per vertex. In this algorithm, each vertex maintains a numerical value as its attribute. In each iteration, the vertexv i updates its attribute based on Equation (2.1), in which + and x are algorithm-specific operators (e.g., standard addition and multiplication operators), w ij denotes the weight of the edge from v i to v j (w ij =0 if there is no such edge), and|V| is the total number of vertices in the graph. attr(v i ) = + |V| j=0 (attr(v i )x w ij ) (2.1) • PageRank: PR is used to rank the importance of vertices in a directed graph. In this algorithm, each vertex maintains an attribute called PageRank, which indicates the likelihood that the vertex will be reached. In each iteration, each vertex v updates its own PageRank based on Equation (2.2), in which d is a constant called damping factor,|V| is the total number of vertices in the graph, v nbr represents a neighbor of v such that v has an incoming edge from v nbr , and L nbr is the number of outgoing edges of v nbr . 29 PageRank(v) = 1−d |V| +d× X PageRank(v nbr ) L nbr (2.2) • Matrix Factorization: MF is an advanced graph analytics algorithm for machine learning applications. It is used to predict unknown data (e.g., customer rating on a product) by analyzing a collection of observations rep- resented as a weighted bipartite graph. Here, the attribute of each vertex is a vector of elements (i.e., a feature vector) rather than a single value. As a result, different fromthe other graphanalytics algorithms thathavebeen dis- cussed earlier in this section, MF has high computation-to-communication ratio and is compute-bound. This algorithm first randomly initializes the feature vectors of all the vertices, and then iteratively updates the feature vectors using the SGD technique. By taking an observation represented as an edge, the algorithm computes the prediction error and updates the feature vectors of the vertices of the edge toward the opposite direction of the gra- dient. The training process terminates when the overall squared prediction error converges and then returns the feature vectors of all the vertices. 30 Chapter 3 High-throughput Graph Processing Framework on FPGA In this chapter, we present our graph processing framework to accelerate general graph algorithms following the ECP. We propose novel optimizations to optimize the performance. We also develop a design automation tool that can automatically generate the register-transfer level (RTL) design based on user inputs. 3.1 Framework Overview Our framework is based on a system architecture as depicted in Figure 3.1, which consists of the external memory (i.e., DRAM) and FPGA. The external memory stores all the graph data including vertices, edges, and updates. On the FPGA, there are p (p ≥ 1) processing engines (PEs) working in parallel to sustain a high processing throughput. Each PE is customized based on the target graph algorithm and has a multi-pipelined architecture. The memory controller handles the external memory accesses performed by the PEs. In the scatter phase, the PEs read edges from the external memory and write updates into the external memory; in the gather phase, the PEs read updates from the external memory and write updated vertices into the external memory. OurframeworkacceleratesgraphalgorithmsbasedontheECP.Table3.1shows the mapping of some example graph algorithms to the ECP. Given an edge-centric 31 External Memory Memory Interface Processing Engine 0 FPGA Processing Engine 𝑝 −1 … External Memory Memory Controller FPGA … PE 𝑝 −1 PE 0 … … Figure 3.1: The target system architecture of our framework Table 3.1: Mapping of graph algorithms to the ECP Algorithm Produce update u based on edge e and vertex v e.src u← Process_edge(e, v e.src ) Input: e, v e.src Output: u SpMV u.dest←e.dest; u.value←e.weight x attr(v e.src ) PR u.dest←e.dest; u.value← d×attr(v e.src ) Num_of_outgoing_edges(v e.src ) SSSP u.dest←e.dest; u.value←attr(v e.src ) +e.weight WCC u.dest←e.dest; u.value←attr(v e.src ) Algorithm Update vertex v u.dest based on update u v u.dest ← Apply_update(u, v u.dest ) Input: u, v u.dest Output: v u.dest SpMV attr(v u.dest )← attr(v u.dest ) + u.value PR attr(v u.dest )← attr(v u.dest ) +u.value SSSP attr(v u.dest )← min (attr(v u.dest ),u.value) WCC attr(v u.dest )← min (attr(v u.dest ),u.value) 32 graph algorithm, our framework maps it to the target architecture and generates the Verilog code of the FPGA accelerator. The development flow of our frame- work is as follows: First, the user needs to provide the inputs in order to define the algorithm parameters of the edge-centric graph algorithm (e.g., data width of each edge and vertex) and specify the hardware resource constraints of the FPGA design (e.g., number of BRAMs that can be used). Second, based on the inputs, the framework performs design space exploration to determine the opti- mal architecture parameters (e.g., the number of PEs) to maximize the processing throughput. Third, a design automation tool outputs the Verilog code of the FPGA accelerator. 3.2 Data Structures Our framework uses the COO representation (Section 2.1.4) to store the input graph. In addition to the edge array, all the vertices are stored in a vertex array, with each vertex maintaining an algorithm-specific attribute. Each update pro- duced in the scatter phase is represented as a <dest, value> pair, in which dest denotes the destination vertex of the update and value denotes the value associ- ated with the update. Figure 3.2 shows the data structures of an example graph. 3.3 Optimizations In order to improve the performance of the generated accelerators by our frame- work, we propose several optimizations to (1) efficiently use the on-chip memory resources, (2) fully take advantage of the massive parallelism provided by the FPGA, (3) optimize the performance of the external memory, and (4) reduce the data communication between the FPGA and external memory. 33 𝑣 𝑖𝑑 attr 𝑣 0 0.7 𝑣 1 1.0 𝑣 2 3.0 𝑣 3 4.5 𝑣 4 1.0 𝑣 5 2.0 Vertex array src dest weight 𝑣 0 𝑣 1 2.0 𝑣 1 𝑣 2 3.0 𝑣 3 𝑣 2 1.0 𝑣 3 𝑣 4 0.2 𝑣 4 𝑣 5 0.4 𝑣 5 𝑣 2 3.0 Edge array Updates dest value 𝑣 1 1.4 𝑣 2 3.0 𝑣 2 4.5 𝑣 4 0.9 𝑣 5 0.4 𝑣 2 6.0 𝑣 0 𝑣 4 𝑣 1 𝑣 2 𝑣 3 𝑣 5 Figure 3.2: Example graph and its associated data structures, with the assumption that the value of each update is the product of the edge weight and the attribute of the source vertex of the edge 3.3.1 Graph Partitioning and Vertex Buffering As the attributes of vertices are repeatedly accessed and updated in each iteration, we propose to buffer them in the on-chip RAMs of the FPGA, which can offer fine- grained low-latency random accesses to the PEs. For large graphs whose entire vertex array does not fit in the on-chip RAMs, we partition the graph to guarantee that the vertex data of each partition fit in the on-chip RAMs. Assuming the data (i.e., attribute) of m vertices can be stored in the on-chip RAMs of the FPGA, we partition the input graph into k =d |V| m e non-overlapping 34 partitions. We first partition the vertex array into k vertex sub-arrays, such that the i th vertex sub-array contains m vertices whose vertex indexes are consecutive and betweeni×m and (i+1)×m−1 (0≤i<k). We define each vertex sub-array as an interval. After the vertex array is partitioned, the edge array is partitioned intok edge sub-arrays, each of which is defined as a shard; thei th shard contains all the edges whose source vertices belong to the i th interval (i.e., ∀ edge e ∈ Shard i , v e.src ∈ Interval i ). The i th shard and the i th interval constitute the i th partition. Each partition also maintains an array called bin to store the updates whose destination vertices belong to the interval of partition (i.e.,∀ update u∈ Bin i ,v u.dest ∈ Interval i ). During the processing, the data of each shard (i.e., edges) remain fixed 1 ; the data of each bin (i.e., updates) are recomputed in each scatter phase; the data of each interval (i.e., vertices) are updated in each gather phase. Figure 3.3 shows the data layout after the graph in Figure 3.2 is partitioned into two partitions (i.e., k = 2) with each partition having three vertices (i.e., m = 3). Note that the size of each shard depends on the number of edges whose source vertices are in the corresponding interval; the size of each bin depends on the number of edges whose destination vertices are in the corresponding interval. Algorithm 3 illustrates the computation of the ECP after the input graph is partitioned. All the intervals, shards, and bins are stored in the external memory. Before a partition is processed, all the data of its interval are pre-fetched and buffered into the on-chip RAMs (Lines 3 and 10). Then, edges (updates) are sequentially read from the external memory during the scatter (gather) phase (Lines 4 and 11). Because of the vertex buffering, when processing edges and updates (Lines 5 and 12), the PEs on the FPGA can access the vertex data directly from the on-chip RAMs, rather than from the external memory. 1 We assume the edges of the input graph do not alter. 35 𝑣 𝑖𝑑 attr 𝑣 0 0.7 𝑣 1 1.0 𝑣 2 3.0 Interval 0 𝑣 𝑖𝑑 attr 𝑣 3 4.5 𝑣 4 1.0 𝑣 5 2.0 Interval 1 src dest weight 𝑣 0 𝑣 1 2.0 𝑣 1 𝑣 2 3.0 src dest weight 𝑣 3 𝑣 2 1.0 𝑣 3 𝑣 4 0.2 𝑣 4 𝑣 5 1.0 𝑣 5 𝑣 2 0.2 Shard 0 Shard 1 dest value 𝑣 1 1.4 𝑣 2 3.0 𝑣 2 4.5 𝑣 2 6.0 dest value 𝑣 4 0.9 𝑣 5 0.4 Bin 0 Bin 1 Partition 0 Partition 1 Figure 3.3: Data layout after the graph in Figure 3.2 is partitioned Algorithm 3 Edge-centric graph processing based on graph partitioning 1: while not done do 2: for i from 0 to k− 1 do 3: Store Interval i in on-chip RAMs //vertex buffering 4: for each edge e∈ Shard i do 5: u← Process_edge(e, v e.src ) 6: Write u into Bin bu.dest/mc 7: end for 8: end for 9: for i from 0 to k− 1 do 10: Store Interval i in on-chip RAMs //vertex buffering 11: for each update u∈ Bin i do 12: v u.dest ← Apply_update(u, v u.dest ) 13: end for 14: Write Interval i into external memory 15: end for 16: end while 36 3.3.2 Partition Skipping One key issue of the ECP is that it requires traversing all the edges of the graph in each scatter phase. This results in numerous redundant edge traversals for non- stationary graph algorithms, in which traversing the edges of non-active vertices in an iteration is unnecessary. In order to address this issue, we propose a partition skipping scheme in order to reduce redundant edge traversals for non-stationary graph algorithms. We define an active partition (in an iteration) as a partition that has at least one active vertex in its interval. For each partition, we maintain a 1-bit status flag to indicate if the partition is active or not. In the scatter phase, we check the status flag of each partition. If a partition is active, the edges in its shard are traversed; otherwise, this partition is directly skipped. In the gather phase, when the attribute of a vertex is updated, the corresponding partition that this vertex belongs to will be marked as active for the next iteration. 3.3.3 Parallelizing Edge-centric Graph Processing To fully utilize the massive parallelism offered by the FPGA, we parallelize the execution of Algorithm 3 using two levels of parallelism: inter-partition and intra- partition parallelism. 3.3.3.1 Inter-partition Parallelism As our framework uses p (p≥ 1) PEs on the FPGA, up to p partitions can be processed by the PEs in parallel. We define the parallelism to process distinct partitions by distinct PEs in parallel as inter-partition parallelism and denote it as p. We usethecentralizedloadbalancing scheme in[77]to allocatethecomputation tasks of partitions to the PEs. When a PE completes the processing of a partition, it is automatically assigned another partition to process. 37 3.3.3.2 Intra-partition Parallelism Inside each PE, we useq (q≥ 1) parallel processing pipelines (Chapter 3.4). In the scatter (gather) phase, theseq processing pipelines concurrently processq distinct edges (updates) of the same shard (bin) in a pipelined fashion. We define the parallelism to concurrently process distinct edges or updates inside each PE as intra-partition parallelism and denote it as q. 3.3.4 Data Layout Optimization Let r 0 ,r 1 ,··· ,r h−1 ,r h ,r h+1 ,··· , denote a sequence of external memory accesses. Wedefineamemoryaccessr h asasequential memory access ifthememorylocation accessed by r h is contiguous to the memory location accessed by r h−1 ; otherwise, r h is defined as a non-sequential memory access. Note that non-sequential memory accesses can result in additional access latency as well as additional power con- sumption [33, 36, 78]. Therefore, it is desirable to optimize data layout to reduce the number of non-sequential memory accesses. In Algorithm 3, reading vertices (Lines 3 and 10), edges (Line 4), and updates (Line 11) from the external memory and writing vertices (Line 14) into the exter- nal memory result in sequential memory accesses. However, writing updates into the external memory (Line 6) results in non-sequential memory accesses. This is because the produced updates need to be written into the bins based on their des- tination vertices. It is likely that the destination vertices of consecutively produced updates are in distinct intervals. In this scenario, these updates are written into distinct bins stored in discontinuous external memory locations, thus resulting in non-sequential memory accesses. In the worst case, writing each update can result in a non-sequential external memory access. Figure 3.4 shows the scenario in which writing each update results in a non-sequential external memory access. Therefore, 38 processing all the edges results inO(|E|) non-sequential external memory accesses in the scatter phase. Traversing order Non-sequential memory access Bin 0 (for 𝑣 0 ~𝑣 99 ) Write updates based on dest Bin 1 (for 𝑣 100 ~𝑣 199 ) Bin 2 (for 𝑣 200 ~𝑣 299 ) Shard 0 src dest 0 10 0 101 1 201 1 105 98 20 99 201 99 105 External memory Figure 3.4: Non-sequential external memory accesses because of writing updates into the external memory in the scatter phase To minimize the number of non-sequential external memory accesses because of writing updates in the scatter phase, we propose an optimized data layout by sorting the edges in each shard based on their destination vertices. Theorem 1. In the scatter phase, based on our optimized data layout, the number of non-sequential external memory accesses due to writing updates isO(k 2 ), where k is the total number of partitions. Proof. The destination vertices of the updates are the same as those of the tra- versed edges. As we have sorted each shard based on the destination vertices, the updates are also produced in a sorted order. Therefore, the updates whose destination vertices belong to the same interval are produced consecutively and are written into the same bin. Non-sequential memory access only occurs when an update belonging to a different bin (i.e., other than the bin that the previous update is written into) is produced. Therefore, writing the updates produced by 39 traversing one shard results inO(k) non-sequential memory accesses. As the scat- ter phase traverses up to k shards, the total number of non-sequential external memory accesses is O(k 2 ). 3.3.5 Data Communication Reduction As traversing each edge will produce an update, the total number of updates pro- duced in the scatter phase is equal to the number of edges,|E|. Therefore,|E| updates are written into the external memory in the scatter phase and read from the external memory in the following gather phase. This results in|E| updates transferred back and forth between the FPGA and external memory in each iter- ation. To reduce the data communication, we propose two optimizations: update combining and update filtering. 3.3.5.1 Update Combining In the scatter phase, we propose combining together the updates that have the same destination vertex. For example, for PR, combining multiple updates can be performed by summing them up. Note that the update combining scheme is enabled by our data layout optimization. As the proposed data layout sorts each shard based on the destination vertices, in the scatter phase, the updates that have thesamedestinationvertexareproducedconsecutively. Thus, consecutiveupdates that have the same destination vertex can be easily combined as one update and thenwrittenintotheexternalmemory. Notethatasthenumberofupdateswritten into the external memory in the scatter phase is reduced, the number of updates to be processed in the following gather phase is reduced as well. 40 3.3.5.2 Update Filtering Wefurtherproposeanupdatefilteringschemefornon-stationarygraphalgorithms. Here, each vertex maintains an additional active_tag to indicate whether the vertex is active in the current iteration. In the scatter phase, for each produced update, we check the active_tag of the source vertex that the update produced is basedon. Ifanupdateisproducedbasedonanactivevertex, itismarkedasavalid update; otherwise, it is invalid. All the invalid updates are discarded and will not be written into the external memory, thus reducing the data communication. Note that update filtering optimization is not applicable to stationary graph algorithms in which all the vertices are active in each iteration. 3.4 Accelerator Design 3.4.1 Overall Architecture We show the overall architecture of our accelerator in Figure 3.5. The DRAM connected to the FPGA is the external memory to store all the intervals, shards, and bins. There are p PEs on the FPGA, and these PEs process p distinct parti- tions in parallel. Each PE has an individual interval buffer and multiple processing pipelines. The interval buffer is constructed by on-chip UltraRAMs and used to buffer the interval data of the partition being processed by the PE. The processing pipelines of each PE concurrently process distinct edges (updates) of the same shard (bin) in a pipelined fashion in the scatter (gather) phase. The memory controller handles the external memory accesses made by the PEs. In the scatter phase, the PEs read edges from the DRAM and write updates into the DRAM. 41 In the gather phase, the PEs read updates from the DRAM and write updated vertices into the DRAM. Partition 𝑘 −1 … … FPGA … … Vertex Edge Update Memory Controller Partition 0 Interval Buffer Pipeline Pipeline PE 0 … Interval Buffer Pipeline Pipeline PE 𝑝 −1 … Control Logic DRAM Figure 3.5: Overall architecture of the accelerator 3.4.2 Processing Engine Figure3.6depictsthearchitectureofeachPE.Asshown, eachPEusesq processing pipelines (q≥ 1); therefore, it can concurrently process q input data in each clock cycle. In the scatter phase, the input data represent edges. In each clock cycle, each processing pipeline takes one edge as input. Then, the vertex-read module reads the attribute of the source vertex of the edge from the interval buffer. The process- edge module produces an update based on the edge weight and the attribute of the source vertex. Each produced update is assigned a validity flag to indicate whether 42 𝑣 𝑖𝑑 locked 𝑣 0 1 𝑣 1 0 … … Input Data … Interval Buffer Vertex Mutex Table Output Updates … Processing Pipeline 𝑞 −1 Processing Pipeline 0 Processing Engine Input Data Vertex-read Edge? Process-edge N Y Vertex Edge Update Unlock Signal Apply-update … 𝑣 𝑖𝑑 attr 𝑣 0 0.7 … … Interval Buffer Update Combining Network Vertex-write Figure 3.6: Architecture of the processing engine it is produced based on an active vertex. Note that the vertex-write module and vertex mutex table (VMT) are not used during the scatter phase; this is because there are only read accesses to the vertices in the scatter phase. All the updates produced by theq processing pipelines are fed into the update combining network, which uses parallel sort-and-combine (SaC) units to combine the input updates based on their destination vertices. Each SaC unit takes two updates as input and compares their destination vertices. If both updates are valid and have the same 43 destination vertex, they are combined and output as one valid update; otherwise, they are sorted based on their destination vertices and output to the next pipeline stage. The update combining network arranges the SaC units in a bitonic sorter fashion [79], requiring (1+logq)· logq·q/4 SaC units in total when q is a power of 2. Figure 3.7 depicts the architecture of the update combining network for q = 4. Note that the invalid updates output by the update combining network are discarded and will not be written into the external memory. SaC SaC SaC SaC SaC Update Combining Network Sort-and-Combine Unit SaC Invalid Same dest? Valid updates? Sort Combine Input updates Output updates Figure 3.7: Update combining network for q = 4 In the gather phase, the input data represent updates. In each clock cycle, each processing pipeline takes one update as input. Then, the vertex-read module reads 44 the attribute of the destination vertex of the update from the interval buffer. The apply-update module computes the updated attribute of the destination vertex. At last, the vertex-write module writes the updated attribute of the destination vertex into the interval buffer. As there are both read and write accesses to the vertex attributes in the gather phase, read-after-write (RAW) data hazard (i.e., the vertex-read module reads the attribute of a vertex that is being computed) may occur. To handle the possible RAW data hazard, we develop a VMT based on a fine-grained locking mechanism. The VMT uses BRAMs to maintain a 1- bit lock for each vertex of the partition being processed. A lock with value 1 means that the attribute of the corresponding vertex is being computed by one of the processing pipelines, and thus cannot be read at this time. For each input update, the VMT checks the lock status of its destination vertex: if the lock value is 0 (i.e., unlocked), the update is fed into the processing pipeline and the lock value is set to 1 (i.e., locked); otherwise, the pipeline stalls until the lock value becomes 0. Note that when any processing pipeline writes an updated vertex attribute into the interval buffer, it also generates an unlock signal to the VMT to unlock thecorresponding vertex. Therefore, deadlock will notoccur. Forthe graph algorithms whose Apply_update function can be performed within a single clock cycle (e.g., SSSP and WCC), we propose replacing the VMT with data forwarding circuits in order to avoid the pipeline stalls resulting from data hazards. As shown in Figure 3.8, the data forwarding circuits forward the vertex attribute output by each processing pipeline to all the processing pipelines. Each apply-update module checks whether the attribute of the destination vertex of its input update is among the forwarded data; if it is, the apply-update module uses the forwarded data rather than the data read from the interval buffer. 45 … … Vertex-read Apply-update Vertex-write … … Vertex-write Apply-update Data forwarding circuits Vertex-read Figure 3.8: Data forwarding circuits 3.5 Design Automation Tool We have built a design automation tool to allow users to rapidly generate the FPGA accelerators based on our design methodology. Figure 3.9 illustrates the workflow of our design automation tool. Users need to provide the edge-centric algorithm specification (e.g., the data width of each vertex attribute) and hard- ware resource constraints to the tool. The hardware resource constraints spec- ify the available on-chip RAMs, logic resources, DSP resources, and the external memory bandwidth for implementing the target accelerator. Our tool uses these constraint inputs to determine the design parameters of the accelerator, including inter-partition parallelism (p), intra-partition parallelism (q), and the capacity of each interval buffer in terms of vertices (m). The selection of these parameters is through design space exploration as shown in Algorithm 4 with the assump- tion that the resulting accelerator operates at 200 MHz. Users can also manually choose these design parameters. Based on the selected design parameters, the tool generates all the design modules (i.e., VMT, update combining network, PEs, 46 etc.) and automatically connects them to produce the Verilog code of the FPGA accelerator. Algorithm Specification Resource Constraints Architecture Parameters Design Modules RTL Verilog Code Design Space Exploration Figure 3.9: Workflow of the design automation tool 3.6 Performance Evaluation 3.6.1 Experimental Setup We conduct experiments using the Xilinx Virtex UltraScale+ xcvu5pflva2104 FPGA. The target FPGA device has 600,577 slice LUTs, 1,201,154 slice regis- ters, 3,474 DSPs, 832 I/O pins, 36 Mb of BRAMs, and 132 Mb of UltraRAMs. We verify our designs and evaluate the performance through post-place-and-route sim- ulations using Xilinx Vivado Design Suite 2018.1 [80]. We use four Micron 8GB DDR3-1600 chips as the external memory. Each DRAM chip runs at 800 MHz and has a peak data transfer rate of 15 GB/s. Using the proposed framework, we accelerateSpMV,PR,SSSP,andWCC.Abroadrangeofgraphdatasets, including real-life and synthetic graphs, are used in the experiments. Table 3.2 summarizes 47 Algorithm 4 Design space exploration to select design parameters 1: Inputs: hardware resource constraints 2: p←p max s.t. p max saturates DRAM bandwidth 3: q← 1 4: while true do 5: if LUT, Register, or DSP is insufficient then 6: q←q/2 7: Break 8: else 9: q←q× 2 10: end if 11: end while 12: m←m max s.t. m max satisfies on-chip RAM constraint 13: Outputs: p,q,m the key characteristics of these datasets. The real-life graphs are obtained from the Stanford network dataset repository [81], and the synthetic graphs are generated using the Graph500 graph generator [1]. Table 3.2: Graph datasets used in the experiments Dataset |V| |E| Diameter Graph type BKstan 0.7 M 7.6 M 514 Web graph WKtalk 2.4 M 5.0 M 9 Communication CAroad 2.0 M 5.5 M 849 Road network LJounal 4.8 M 69.0 M 16 Social network Twitter 41.6 M 1468.4 M 15 Social network RMat21 2.1 M 182.1 M 6 Synthetic graph RMat24 16.8 M 263.0 M 6 Synthetic graph 3.6.2 Performance Metrics We use the following performance metrics for the evaluation: • Clock rate: the clock rate sustained by the FPGA accelerator 48 • Resourceutilization: theutilizationofFPGAresources, includinglogicslices, registers, on-chip RAMs, and DSPs • Power consumption: the power consumption of the FPGA accelerator • Execution time: for stationary algorithms (i.e., SpMV and PR), the execu- tiontimereferstotheaverageexecutiontimeperiteration; fornon-stationary algorithms (i.e., SSSP and WCC), it refers to the total execution time of the algorithm. • Throughput: the number of traversed edges per second (TEPS), computed as the total number of traversed edges divided by the execution time 3.6.3 Resource Utilization and Power Consumption We generate FPGA designs with various architecture parameters using our design automation tool. Table 3.3 shows the resource utilization, clock rate, and power consumption of the generated FPGA accelerators. In these designs, the interval buffer of each PE is able to store the data of 256K vertices (i.e., m = 256K). We observe that all the generated FPGA accelerators sustain a high clock rate of at least 200 MHz with a low power consumption of less than 20 Watt. Basing on the available resources of the target FPGA device and Algorithm 4, we set the number of PEs to 4 (p = 4), the number of pipelines in each PE to 8 (q = 8), and the capacity of each interval buffer to 256K vertices (m = 256K). 3.6.4 Execution Time and Throughput We show the execution time and throughput performance for various graph datasets in Table 3.4 and Table 3.5, respectively. On average, our FPGA accel- erators achieve a high throughput of 2,076 MTEPS for SpMV, 2,225 MTEPS for 49 Table 3.3: Resource utilization, clock rate, and power consumption Algorithm p q LUT Reg DSP On-chip RAM (%) Clock rate Power (%) (%) (%) Block RAM UltraRAM (MHz) (Watt) SpMV 2 1 17.2 9.1 0.1 3.4 27.2 212 6.3 2 18.9 9.4 0.2 3.7 27.2 212 6.5 4 21.8 10.3 0.5 4.2 27.2 208 7.7 8 31.7 13.2 0.9 5.3 27.2 206 9.5 4 1 34.5 18.1 0.2 6.8 54.5 207 10.0 2 37.8 18.7 0.5 7.4 54.5 207 10.2 4 43.7 20.6 0.9 8.4 54.5 207 13.1 8 63.5 26.3 1.8 10.6 54.5 201 17.5 PR 2 1 17.5 9.1 0.1 3.3 27.2 209 4.5 2 19.5 9.3 0.2 3.5 27.2 209 4.6 4 22.8 10.3 0.5 3.9 27.2 208 5.0 8 34.0 13.1 0.9 4.6 27.2 204 5.8 4 1 34.6 18.1 0.2 6.6 54.5 208 6.9 2 39.0 18.6 0.5 7.0 54.5 208 7.1 4 45.6 20.5 0.9 7.8 54.5 202 7.6 8 68.1 26.1 1.8 9.2 54.5 200 10.7 SSSP 2 1 10.0 5.3 0 0.2 27.2 220 3.1 2 10.3 5.4 0 0.4 27.2 212 3.1 4 12.1 6.1 0 0.8 27.2 207 3.7 8 16.1 7.0 0 1.5 27.2 206 5.0 4 1 18.2 10.5 0 0.4 54.5 210 5.0 2 20.2 10.7 0 0.8 54.5 207 5.2 4 24.3 12.1 0 1.6 54.5 205 6.2 8 32.3 13.9 0 2.9 54.5 200 8.2 WCC 2 1 11.4 5.6 0 0.2 27.2 216 3.5 2 11.6 5.9 0 0.4 27.2 212 3.6 4 13.5 6.3 0 0.8 27.2 208 3.9 8 19.3 7.6 0 1.5 27.2 205 4.4 4 1 23.6 11.6 0 0.4 54.5 210 5.0 2 23.9 11.8 0 0.8 54.5 207 5.4 4 27.0 12.6 0 1.6 54.5 203 5.9 8 34.5 15.1 0 2.9 54.5 200 7.5 50 PR, 2,916 MTEPS for SSSP, and 3,493 MTEPS for WCC, respectively. We also observe that the achieved throughput for the dataset WKtalk is much less than the average for all the four graph algorithms. This is because over 90% of the edges are grouped into the same shard after the graph is partitioned. As a result, the distribution of the computation load among the PEs is extremely unbalanced in the scatter phase (i.e., one PE traverses 90% of the edges, whereas all the other PEs traverse 10% of the edges). We do not observe this issue for the other datasets, for which the computation load is equally distributed among the PEs. Table 3.4: Execution time (ms) for various datasets Algorithm Dataset BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 SpMV 3.2 5.0 2.8 36.2 652.5 56.7 143.5 PR 3.0 4.5 2.7 32.7 590.4 53.4 140.3 SSSP 782.4 25.5 1113.3 592.1 5576.8 967.1 921.3 WCC 1769.0 46.2 1480.1 412.9 6617.1 450.3 1107.9 Table 3.5: Throughput (MTEPS) for various datasets Algo. Dataset BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 Average SpMV 2361 1004 1964 1906 2250 3217 1832 2076 PR 2533 1116 2037 2110 2487 3410 1875 2225 SSSP 3109 2156 2441 3111 2869 4304 2419 2916 WCC 4949 1665 3652 3322 3395 4852 2619 3493 3.6.5 Impact of the Optimizations 3.6.5.1 Impact of Partition Skipping We first study the impact of the partition skipping optimization for SSSP and WCC. The baseline design does not have this optimization and thus traverses 51 the edges of both active partitions and non-active partitions in each iteration. Figure 3.10 shows the reduction of edge traversals because of the partition skipping optimization. On average, this optimization reduces the number of edge traversals by 1.4× for SSSP and 1.3× for WCC, respectively. We also observe that the partition skipping optimization is very effective when the ratio of active vertices (i.e., the number of active vertices over the total number of vertices) in an iteration is very low (e.g., the first iteration of SSSP and the last iteration of WCC); in such iterations, many partitions do not have any active vertices and thus can be skipped. However, when active vertices ratio in an iteration is very high (e.g., the first iteration of WCC), it is highly likely that all the partitions are active; in this scenario, none of the partitions can be skipped. 0 0.5 1 1.5 2 2.5 3 BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 SSSP WCC Reduction of edge traversals 3 2 1 0 × × × × Figure 3.10: Reduction of edge traversals because of partition skipping 3.6.5.2 Impact of Update Combining and Filtering We further explore the impact of update combining and filtering to reduce the data communication. For comparison purpose, we implement a baseline design that has the partition skipping optimization and uses the optimized data layout, 52 but does not have the update combining and filtering optimization. Figure 3.11 illustrates the effectiveness of the optimization. We observe that the number of produced updates is reduced by 2.3× to 12.5× for SpMV, 2.7× to 14.7× for PR, 10.6× to 548.2× for SSSP, and 6.9× to 1253.1× for WCC, respectively. It can also be observed that this optimization has higher impact on non-stationary graph algorithms (i.e., SSSP and WCC). This is because non-stationary graph algorithms use both the update combining and update filtering schemes, whereas stationary graph algorithms (i.e, SpMV and PR) only employ the update combining scheme. On average, this optimization reduces the number of produced updates by 6.5× for SpMV, 7.5× for PR, 104.1× for SSSP, and 218.9× for WCC. 1 10 100 1000 10000 BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 Average SpMV PR SSSP WCC 10 3 10 2 10 1 10 0 Reduction factor of updates (log scale) 0 5 10 15 20 BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 Average SpMV PR SSSP WCC Reduction factor of non- sequential DRAM accesses Figure 3.11: Reduction factor of produced updates because of update combining and filtering 3.6.5.3 Impact of Data Layout Optimization Lastly, we study the impact of our data layout optimization. The baseline design for the comparison uses the partition skipping and communication reduction opti- mizations, but it utilizes the default COO representation that does not have 53 our data layout optimization. Figure 3.12 shows the reduction factor of non- sequential DRAM accesses, which is computed as the number of non-sequential DRAM accesses performed by the baseline design, divided by the number of non- sequential DRAM accesses performed by the optimized design. We observe that this optimization reduces the number of non-sequential DRAM accesses by 2.1× to 12.2× for SpMV, 2.4× to 15.3× for PR, 2.4× to 8.2× for SSSP, and 2.2× to 12.2× for WCC, respectively. As a result, the optimized designs can sustain a high DRAM bandwidth of 46.8 GB/s for SpMV, 46.5 GB/s for PR, 36.7 GB/s for SSSP, and 43.0 GB/s for WCC. However, the baseline designs can sustain only 25.0 GB/s for SpMV, 14.5 GB/s for PR, 10.3 GB/s for SSSP, and 28.5 GB/s for WCC. 1 10 100 1000 10000 BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 Average SpMV PR SSSP WCC 10 3 10 2 10 1 10 0 Reduction factor of updates (log scale) 0 5 10 15 20 BKstan WKtalk CAroad LJounal Twitter RMat21 RMat24 Average SpMV PR SSSP WCC Reduction factor of non- sequential DRAM accesses Figure 3.12: Reduction factor of non-sequential DRAM accesses because of data layout optimization 54 3.7 Comparison with State-of-the-Art 3.7.1 Comparison with Multi-core Designs We first compare the performance of our design with several highly optimized multi-core implementations, including X-Stream [33], NXgraph [57], GraphX [82], and GraphMat [54]. X-Stream [33] runs on a 32-core AMD Opteron 6272 pro- cessor with 25 GB/s DRAM bandwidth. NXgraph [57] runs on a hexa-core Intel i7 processor with 160 GB/s DRAM bandwidth. GraphX [82] runs on a cluster consisting of 16 computing nodes; each node has 8 cores. GraphMat [54] runs on a 24-core Intel Xeon E5-2697 v2 processor with 80 GB/s DRAM bandwidth. As these works do not report throughput performance, we conduct the compari- son based on the execution time performance. Table 3.6 summarizes the results of the comparison using the same datasets. It can be observed that our FPGA designs achieve up to 20.5×, 35.5×, 5.0× and 37.9× speedup for SpMV, PR, SSSP, and WCC, respectively. In addition, the power consumption of our FPGA designs (<20 Watt) is much lower than that of multi-core platforms (typically>80 Watt). Therefore, from energy-efficiency perspective, our framework achieves an even larger improvement. 3.7.2 Comparison with GPU Designs We further compare our FPGA framework with three state-of-the-art GPU-based graph processing frameworks, including nvGRAPH [62], CuSha [61], and Gunrock [64]. The results of the comparison are shown in Table 3.7.2. It can be observed that our FPGA-based designs achieve comparable performance with the GPU- based designs. Note that the external memory bandwidth of the GPU platforms (288 GB/s) is 4.8× higher than that of our target platform (60 GB/s). If we scale 55 Table 3.6: Comparison with highly optimized multi-core implementations Algorithm Dataset Approach Execution time Speedup (ms) SpMV LJournal [33] 740 20.5× Our framework 36 PR LJournal [33] 580 1.0× [57] 100 5.8× [54] 45 12.9× Our framework 33 17.6× Twitter [82] 20950 1.0× [57] 2050 10.2× [54] 1800 11.6× Our framework 590 35.5× SSSP CAroad [54] 5500 5.0× Our framework 1113 RMat24 [54] 1900 2.1× Our framework 921 WCC LJournal [33] 7220 17.5× Our framework 413 Twitter [82] 251000 37.9× Our framework 6617 these GPU results by assuming a bandwidth of only 60 GB/s, our framework will outperform the GPU designs by 2.2× to 7.2×. 56 Table 3.7: Comparison with GPU-based implementations Algor. Dataset Approach Platform Bandwidth # of cores/ Frequency Power Exec. Time (GB/s) pipelines (MHz) (Watt) (ms) PR RMat21 [64] NIVIDIA Tesla K40c 288 2880 745 245.0 80.4 Our framework Xilinx UltraScale+ 60 32 200 15.2 53.4 Twitter [62] NIVIDIA Tesla M40 288 3072 1140 250.0 850.0 Our framework Xilinx UltraScale+ 60 32 200 15.2 590.1 SSSP LJournal [61] NIVIDIA GeForce GTX780 288 2304 863 250.0 346.0 Our framework Xilinx UltraScale+ 60 32 200 7.0 592.1 WCC RMat21 [64] NIVIDIA Tesla K40c 288 2880 745 245.0 428.9 Our framework Xilinx UltraScale+ 60 32 200 6.7 450.3 LJournal [61] NIVIDIA GeForce GTX780 288 2304 863 250.0 190.0 Our framework Xilinx UltraScale+ 60 32 200 6.7 412.9 57 3.7.3 Comparison with State-of-the-Art FPGA Designs Lastly,wecompareourproposedframeworkwithtwostate-of-the-artFPGAframe- works for accelerating general graph algorithms, GraphOps [7] and ForeGraph [9]. GraphOps [7] is a hardware library for constructing FPGA-based accelerators for graph analytics. Its target platform is a CPU-FPGA heterogeneous platform con- sisting of a 12-core Intel Xeon X5650 processor and a Xilinx Virtex-6 FPGA. The peak DRAM bandwidth is 38.4 GB/s. Table 3.8 summarizes the results of the comparison with GraphOps, showing that our framework improves throughput performance by up to 27.6× and 50.7× for SpMV and PR, respectively. Table 3.8: Comparison with GraphOps Algorithm Dataset Approach Throughput Improvement (MTEPS) SpMV BKstan [7] 162 14.7× Our framework 2361 WKtalk [7] 37 27.6× Our framework 1004 RMat24 [7] 165 11.1× Our framework 1832 PR BKstan [7] 190 13.3× Our framework 2533 WKtalk [7] 37 29.8× Our framework 1116 RMat24 [7] 37 50.7× Our framework 1875 ForeGraph [9] is implemented based on a multi-FPGA platform that inter- connects four Virtex UltraScale FPGAs in the Microsoft Catapult fashion [25]. Table 3.9 shows the results of the comparison with ForeGraph. Our framework 58 achieves 1.3× and 2.0× higher throughput for PR and WCC, respectively. Note that ForeGraph uses four FPGAs, whereas our framework uses a single FPGA. Table 3.9: Comparison with ForeGraph Algorithm Dataset Approach Throughput Improvement (MTEPS) PR Twitter [9] 1856 1.3× Our framework 2487 WCC Twitter [9] 1727 2.0× Our framework 3395 59 Chapter 4 Accelerating Graph Processing on CPU-FPGA Heterogeneous Platform In this chapter, we present our hybrid algorithm for accelerating non-stationary graphalgorithmsbasedonashared-memoryCPU-FPGAheterogeneousplatforms. We use our design methodology to accelerate BFS and SSSP and demonstrate performance improvement by efficient CPU-FPGA co-processing. 4.1 Motivation As introduced in Chapter 2, the VCP and ECP have been widely used to design graph analytics accelerators. However, both of them have notable drawbacks: the VCP requires random memory accesses to traverse edges, and the ECP results in redundant edge traversals. To avoid the disadvantages of the VCP and ECP, we propose a hybrid algorithm that dynamically selects the appropriate paradigm during runtime. Our hybrid algorithm targets non-stationary graph algorithms, in which only a subset of the vertices is active in each iteration. We define the active vertex ratio (in an iteration) as the number of active vertices in the iteration over the total number of vertices. Our hybrid algorithm is motivated by the fact that for 60 non-stationary graph algorithms, the active vertex ratio significantly varies over the iterations, especially when the input graphs are low-diameter and scale-free [83, 84, 85]. For example, Figure 4.1 shows the number of active vertices and active vertex ratio in each iteration when running BFS for a web graph. It can be observed that in the fourth and fifth iterations, a large amount of vertices are active, whereas in the rest of the iterations, only a small amount of vertices are active. 0% 20% 40% 60% 80% 100% 1 2 3 4 5 6 7 Iteration # % of active vertices 0% 25% 50% 75% 100% 1 2 3 4 5 6 7 Active vertex Useful edge Iteration # 0% 50% 100% 1 2 3 4 5 6 7 Active vertex Useful edge Iteration # 0% 50% 100% 0.E+00 2.E+05 4.E+05 6.E+05 8.E+05 1.E+06 1.E+06 1.E+06 2.E+06 2.E+06 2.E+06 1 2 3 4 5 6 7 No. of active vertices Active vertex ratio Iteration # 2 × 10 6 2 × 10 6 1 × 10 6 4 × 10 5 No. of active vertices Active vertex ratio Figure 4.1: Variation of active vertex ratio during the execution of BFS for a web graph The key idea of our hybrid algorithm is that (1) when the active vertex ratio is low, we use the VCP to traverse the edges (a small amount of random memory accesses are favored over a large amount of redundant edge traversals); (2) when the active vertex ratio is high, we use the ECP to traverse the edges (a small amount of redundant edge traversals are favored over a large amount of random memory accesses). 61 4.2 Hybrid Algorithm for Graph Processing 4.2.1 Hybrid Data Structure We assume that the graph is initially stored using the COO graph representation, and the edge array has been sorted based on the source vertices. The COO rep- resentation supports the ECP, but it could not support the VCP. The reason is that given a vertex, the memory location of its outgoing edges is unknown; there- fore, the vertex cannot directly access its edges. To resolve this issue, for each vertex, we keep a pointer to record the index of its first outgoing edge in the edge array, through which the vertex can quickly locate its edges. Therefore, the hybrid data structure can concurrently support the VCP and ECP. To indicate whether a vertex is active in an iteration, we assign an active_tag to each vertex; the active_tag records the most recent iteration in which the attribute of the vertex was updated. For example, a vertex with an active_tag value of i becomes an active vertex in the (i + 1) th iteration. To enable vertex buffering using on-chip memory, we use the graph partition- ing introduced in Section 3.3.1 to partition the graph. Then, each partition has an interval (a array of vertices with contiguous indexes), a shard (an array of edges whose source vertices belong to the interval), and a bin (an array to store the updates whose destination vertices belong to the interval). Partitioning the graph also increases the available parallelism because the computations of distinct partitions can be performed in parallel. In Figure 4.2, we show the hybrid data structure for an example graph that has been partitioned into two partitions with each partition having two vertices. 62 𝑣 𝑖𝑑 attr. pointer 𝑣 0 0 𝑒 0 0 𝑣 1 ∞ 𝑒 2 ∞ 𝑣 2 ∞ 𝑒 3 ∞ 𝑣 3 ∞ 𝑒 5 ∞ Edge array 𝑒 𝑖𝑑 src dest weight 𝑒 0 𝑣 0 𝑣 1 2 𝑒 1 𝑣 0 𝑣 3 3 𝑒 2 𝑣 1 𝑣 3 1 𝑒 3 𝑣 2 𝑣 0 1 𝑒 4 𝑣 2 𝑣 1 1 𝑒 5 𝑣 3 𝑣 2 1 0 2 1 3 𝐑𝐨𝐨𝐭 Interval 0 Interval 1 Shard 0 Shard 1 Vertex array 𝑢 𝑖𝑑 dest value 𝑢 0 𝑣 1 2 𝑢 𝑖𝑑 dest. value 𝑢 0 𝑣 3 3 Bin 0 Bin 1 active _tag 2 3 1 1 1 1 Figure 4.2: Hybrid data structure of an example graph, with the assumption that v 0 is the only active vertex in the current iteration 4.2.2 Paradigm Selection For both the VCP and ECP, the scatter phase includes three types of operations−namely reading edges from the external memory, computing updates, and writing updates into the external memory. As the operations to compute updates and write updates are similar for the VCP and ECP, reading edges results in the most significant performance difference. In each iteration, we select the appropriate paradigm for each partition based on the active vertex ratio of the partition, which is computed as the number of active vertices in the partition divided by the total number of vertices in the partition. Let m denote the number of vertices in the interval of each partition, BW VCP (BW ECP ) denote the sustained memory bandwidth for reading edges based on the VCP (ECP), and r denote the active vertex ratio of a partition. 63 Theorem 2. In the scatter phase, if r > BW VCP BW ECP , the ECP results in a lower execution time for this partition; otherwise, the VCP results in a lower execution time. Proof. Let|S| denote the total number of edges in the shard of the partition, and D e denote the number of bytes required to represent an edge. The execution time for reading all the edges based on the ECP can be estimated as: T ECP = |S|×D e BW ECP (4.1) Let ξ denote the average degree of the vertices (i.e., ξ = |S| m ) in the interval. The execution time for reading all the edges of active vertices based on the VCP can be estimated as: T VCP = r×m×ξ×D e BW VCP (4.2) By comparing Equations (4.1) and (4.2), we can obtain that when r > BW VCP BW ECP , T ECP is smaller; otherwise, T VCP is smaller. Let r thold denote the threshold for determining whether to select the VCP or ECP (i.e.,r thold = BW VCP BW ECP ). We use the first two iterations to estimateBW VCP and BW ECP in order to determiner thold : in the first iteration, we enforce selecting the ECP to estimate BW ECP ; in the second iteration, we enforce selecting the VCP to estimate BW VCP . We estimate the sustained memory bandwidth for reading edges based on Equation (4.3), in which D e denotes the number of bytes required to represent an edge and T scatter denotes the execution time of the scatter phase in the corresponding iteration. 64 BW = #_of_accessed_edges_from_external_memory×D e T scatter (4.3) After r thold is determined, we select the appropriate paradigm for each partition based on Algorithm 5. We maintain two queues, VCP_queue and ECP_queue. We insert the index of a partition into the corresponding queue based on the selected paradigm for the partition. Note that if a partition does not have any active vertices, the partition is skipped in the scatter phase (Line 7). Algorithm 5 Paradigm selection for each partition Let k denote the total number of partitions Let Par i denote the i th partition (0≤i<k) Paradigm_selection(Par 0 ,··· ,Par k−1 ) 1: for i from 0 to k− 1 do 2: if Par i .no_of_active_vertices>m×r thold then 3: ECP_queue.enqueue(Par i ) 4: else if Par i .no_of_active_vertices> 0 then 5: VCP_queue.enqueue(Par i ) 6: else 7: Par i can be skipped 8: end if 9: end for 10: Return ECP_queue and VCP_queue 4.2.3 Hybrid Algorithm Algorithm 6 illustrates our hybrid algorithm. In each iteration, before the scatter phase starts, we perform the paradigm selection algorithm (i.e., Algorithm 5) to select the appropriate paradigm for each partition. The scatter phase of the par- titions whose indexes are in VCP_queue (ECP_queue) will be performed based on the VCP (ECP). 65 Algorithm 6 Hybrid algorithm 1: Initialization: 2: no_of_iterations = 0 3: for i from 0 to k− 1 do 4: if Par i contains root vertex then 5: Par i .no_of_active_vertices = 1 6: else 7: Par i .no_of_active_vertices = 0 8: end if 9: end for 10: while any partition has an active vertex do 11: Paradigm Selection: 12: {ECP_queue, VCP_queue} = Paradigm_selection(Par 0 ,··· ,Par k−1 ) 13: Scatter Phase: 14: while VCP_queue is not empty do 15: idx = VCP_queue.dequeue() 16: VCP_scatter(Par idx ) //see Algorithm 7 17: Par idx .no_of_active_vertices = 0 18: end while 19: while ECP_queue is not empty do 20: idx =edge_centric_queue.dequeue() 21: ECP_scatter(Par idx ) //see Algorithm 8 22: Par idx .no_of_active_vertices = 0 23: end while 24: Gather Phase: 25: for i from 0 to k− 1 do 26: if bin of Par i is not empty then 27: for each update u in bin of Par i do 28: if update condition for vertex u.dest is met then 29: update vertex v u.dest 30: if (v u.dest ).active_tag6=no_of_iterations then 31: (v u.dest ).active_tag =no_of_iterations 32: Par i .no_of_active_vertices++ 33: end if 34: end if 35: end for 36: end if 37: end for 38: no_of_iterations++ 39: end while 66 4.2.4 Mapping the Hybrid Algorithm onto a Heteroge- neous Platform 4.2.4.1 Scatter Phase We accelerate the scatter phase of Algorithm 6 by CPU-FPGA co-processing. To coordinate the execution between the CPU and FPGA, we develop a runtime system. As shown in Figure 4.3, (1) when both VCP_queue and ECP_queue are not empty, the CPU and FPGA concurrently execute the scatter phase of the intervals in VCP_queue and ECP_queue, respectively; (2) when VCP_queue is empty but there are still remaining intervals in ECP_queue, the runtime system uses a work- stealing strategy to achieve load balancing between the CPU and FPGA; in this scenario, theCPUstealsanintervalfromECP_queueandexecutesitsscatterphase. Algorithms 7 and 8 illustrate the scatter phase performed based on the VCP and ECP,respectively. WeusetheFPGAtoacceleratethescatterphaseoftheintervals in ECP_queue because (1) we observe that the total execution time is dominated by executing the scatter phase of the intervals with high active vertex ratio and (2) the streaming nature of the ECP makes the FPGA suitable for acceleration [33, 78]. Algorithm 7 VCP-based scatter function 1: function VCP_scatter(Par idx ): 2: for each vertex v in Par idx CPU do 3: w if v is active then 4: wwl for each outgoing edge e of v do 5: wwlwl produce update u and write it into bin of Par be.dest×k/|V|c 6: wwl end for 7: w end if 8: end for 67 VC_queue EC_queue CPU FPGA Par 𝑖 1 … Par 0 Par 𝑘 −1 … Paradiam Selection … Par 𝑖 2 … Figure 4.3: Coordination between the CPU and FPGA by the runtime system Algorithm 8 ECP-based scatter function 1: function ECP_scatter(Par idx ): 2: if FPGA_is_busy6= true then 3: FPGA_is_busy = true 4: load interval of Par idx from external memory into on-chip memory 5: for each edge e in shard of Par idx FPGA do 6: if vertex v e.src is active then 7: produce update u and write it into bin of Par be.dest×k/|V|c 8: end if 9: end for 10: FPGA_is_busy = false 11: else 12: for each edge e in shard of Par idx CPU do 13: w if vertex v e.src is active then 14: wwl produce update u and write it into bin of Par be.dest×k/|V|c 15: w end if 16: end for 17: end if 68 When the produced updates are written into the bins stored in the external memory, concurrent write accesses to the same bin may occur. Figure 4.4 depicts a scenario in which two processing units (i.e., CPU cores and FPGA) concurrently write distinct updates into the same bin, resulting in a race condition. Therefore, atomic operations (e.g., exclusive access to shared data) to prevent concurrent writes to the same memory location are required. Algorithm 9 illustrates the algorithm performed by each processing unit when it writes update(s) into the external memory. We maintain one shared counter for each bin to keep track of the number of updates that have been stored in the bin. When a processing unit attempts to write Δ updates into a bin, it first reserves Δ empty slots in the bin and increases the shared counter of the bin in an atomic fashion before it starts the writing operations. Bin Write an update unit A Processing unit B Processing Occupied slot Empty slot Figure 4.4: Concurrent write accesses to the same bin 69 Algorithm 9 Writing update(s) into a bin stored in the external memory Let Δ denote the number of updates to be written Let Bin.size denote the number of updates stored in Bin 1: function Write_update(Δ updates, Bin): 2: Atomically do{ 3: start_position = Bin.size 4: end_position = start_position + Δ 5: Bin.size = Bin.size + Δ 6: } 7: Write the Δ updates into Bin[start_position],···, Bin[end_position] 4.2.4.2 Gather Phase In the gather phase (Algorithm 6, Line 25-37), we concurrently execute the gather phase of distinct intervals by using distinct CPU cores. Note that the gather phase of distinct intervals can be independently executed in parallel. This is because the updates stored in each bin will be applied to only the vertices in the corresponding interval. In Algorithm 6, when applying updates to vertices, we keep track of the number of active vertices of the partition (Line 32), which can be used to compute the active vertex ratio of the partition in the next iteration. The gather phase is entirely executed by the CPU without FPGA acceleration. This is because the execution time for the gather phase constitutes only a small portion of the total execution time (< 18%, see Chapter 4.4.5); therefore, based on the Amdahl’s law [77], accelerating the gather phase using the FPGA will not result in much speedup. Another reason is that using the FPGA to execute the gather phase requires pre-fetching the interval data into the on-chip memory and writing them back into the external memory; however, the updates maybe few; in thisscenario, writingtheintervaldatabackandforthbetweentheexternalmemory and FPGA is very inefficient and can adversely impact the execution time. 70 4.3 Implementation Our approach targets a heterogeneous platform with coherent shared-memory between the CPU and FPGA. Examples include the Intel-Altera Heterogeneous Architecture Research Platform [46]. The platform integrates an Intel Xeon multi- core processor with an Altera FPGA through cache-coherent QuickPath Inter- connect (QPI) technology. On the FPGA, a control unit for receiving control signals from the CPU and dedicated registers to store the status of the FPGA (e.g., whether the FPGA is idle or not) are provided. Users can implement a customized accelerator function unit (AFU) on the FPGA, which can coherently access the CPU’s last-level cache (i.e., L3 cache) and the DRAM attached to the CPU through QPI. 4.3.1 Overall Architecture Figure 4.5 depicts the overall architecture of our design. The intervals, shards, and bins are stored in the DRAM, which is shared by the CPU and FPGA. On the CPU, a master thread is created to schedule the execution and coordinate with the FPGA. The master thread creates a group of worker threads to execute the computations of distinct partitions in parallel. Each thread maintains a local bin in the cache. In the scatter phase, the produced updates are first written into the local bins; when a local bin becomes full, the corresponding thread will write the updates into the DRAM. The purpose of having the local bins is in order to avoid frequent expensive atomic operations (Algorithm 9) for writing the updates into the bins stored in the DRAM. The master thread controls the FPGA by sending control signals to the con- trol unit on the FPGA. Based on the control signals, FPGA obtains the memory 71 addresses of the data to be processed and starts processing. During the process- ing, the FPGA sets its device status to “busy”; when the FPGA completes the processing of a partition, it sets its device status to “free” to indicate that it is ready to process another partition. In the scatter phase, the updates produced by the FPGA are first written into the local bin of the master thread; the master thread is responsible for writing them into the DRAM. Interval 0 Shard 0 Bin 𝑘 −1 … 𝐃𝐑𝐀𝐌 𝐂𝐏𝐔 Coherent memory interconnect AFU Interval buffer Pipelines 𝐅𝐏𝐆𝐀 Worker thread Cache hierarchy Core Local bin Bin 0 Master thread Core Local bin Control unit Device status AFU I/O Vertex array Edge array … Interval 𝑘 −1 … Shard 𝑘 −1 … … Figure 4.5: Overall architecture 4.3.2 Accelerator Function Unit Design The FPGA accesses the shared-memory in blocks of cache lines. The cache line size (e.g., 64 bytes) can be much larger than the data size of an edge (e.g., 8 bytes). In order to fully utilize the data in a cache line, we design a multi-pipelined 72 architecture for the AFU. With the assumption that the cache line size is γ bytes and each edge is represented usingD e bytes, the AFU hasq = γ De pipelines working inparallel. Figure4.6depictstheAFUarchitecture. Allthepipelinesconnecttoan interval buffer that is composed of on-chip BRAMs. When the FPGA executes the scatter phase for a partition, the data of the interval are pre-fetched and buffered in the interval buffer; as a result, the pipelines can access the vertex data from the interval buffer rather than from the DRAM. Each pipeline consists of a vertex-read module and a process-edge module. The vertex-read module reads the vertex data from the interval buffer when edges are streamed in. The process-edge module is responsible for computing the update based on the attribute of the vertex and the edge weight. The update filter is used to filter out invalid updates that are produced based on non-active vertices. An invalid update can be identified by checking the active_tag of the source vertex that the update produced is based on. Valid updates are first written into an output buffer on the FPGA whose size is equal to the cache line size. When the output buffer becomes full, the FPGA issues a memory write request to write the buffered updates into the local bin of the master thread. 4.4 Performance Evaluation 4.4.1 Experimental Setup We implement our designs on an Intel-Altera Heterogeneous Architecture Research Platform. The target platform integrates a 14-core Intel Xeon E5-2680 processor with an Altera Arria 10 GX1150 FPGA through QPI technology. Each CPU core operates at 2.4 GHz and has a 32 KB L1 cache and a 256 KB L2 cache. All the 14 cores share a 35 MB L3 cache. The FPGA has 1,150,720 adaptive logic modules 73 Interval buffer Vertex-read Process -edge Vertex -read Process-edge Edge 𝐀𝐅𝐔 Edge Valid updates … Pipeline Pipeline Update filter Vertex … … Figure 4.6: Architecture of the accelerator function unit (AFU) and up to 6.62 MB of on-chip BRAMs. The heterogeneous platform is equipped with 64 GB DDR3-1600 main memory. The CPU can assess the main memory with a peak bandwidth of 30 GB/s. The FPGA can assess the main memory with a peak bandwidth of 12.8 GB/s. The cache line size is 64 bytes. The results in this work were generated using pre-production hardware and software from Intel, and may not reflect the performance of production or future systems. We generate synthetic scale-free graphs using the Graph 500 graph generator [1]. Table 4.1 summarizes the key characteristics of the graph datasets. The pre- processing overhead to generate our hybrid data structure is also included in Table 4.1. The pre-processing is performed by the CPU of the target platform. We assume that the input graphs do not change during the execution. Table 4.1: Synthetic graph datasets used in the experiments Notation # Vertices (|V|) # Edges (|E|) T pre−processing G 1 10 M 140 M 0.005 s G 2 10 M 180 M 0.012 s G 3 10 M 160 M 0.025 s 74 4.4.2 Performance Metrics We use execution time and throughput as the performance metrics for the evalu- ation. • Execution time: the total execution time of the BFS/SSSP algorithm • Throughput: the number of traversed edges per second (TEPS); we use a conservative approach introduced by Graph 500 [1] to compute the through- put performance: Throughput = |E| T exec (4.4) 4.4.3 Resource Utilization We report the resource utilization of our FPGA accelerators in Table 4.2, and they are evaluated through post-place-and-route simulations using Quartus design software v17.0 [86]. For BFS (SSSP), each interval has 512K (128K) vertices and each vertex has an 8-bit (32-bit) attribute. We did not further increase the interval size because the accelerators have consumed up to 62.6% of the BRAMs in the FPGA device. For both BFS and SSSP, the FPGA accelerator runs at 200 MHz and the AFU contains eight parallel pipelines to saturate the memory bandwidth available to the FPGA. Table 4.2: Resource utilization Algorithm Logic Register BRAM BFS 7.6% 84892 62.6% SSSP 7.6% 87395 62.6% 75 4.4.4 Vertex-centric vs. Edge-centric on the CPU To explore the tradeoffs between the VCP and ECP, we first compare their perfor- mance on the CPU of the target platform. All the software implementations are parallelized using OpenMP [87] with 16 threads. Figure 4.7 shows the execution time comparison of each iteration. We have the following observations: • For both BFS and SSSP, the active vertex ratio varies over the iterations. • The active vertex ratio in an iteration has significant impact on the execution time of each iteration for both the VCP and ECP. • When the active vertex ratio in an iteration increases, the execution time of the iteration also increases for both the VCP and ECP. • When the active vertex ratio in an iteration is high (e.g., > 20%), the ECP results in a lower execution time than the VCP. • When the active vertex ratio in an iteration is low (e.g., < 5%), the VCP results in a lower execution time than the ECP. 4.4.5 Hybrid Algorithm on the CPU We further compare our hybrid algorithm (Algorithm 6) with the VCP and ECP. All the designs are executed on the CPU of the target platform. Figure 4.8 shows the execution time comparison. We have the following observations: • Compared with the VCP for BFS, our hybrid algorithm achieves 1.2× to 1.4× speedup. • Compared with the ECP for BFS, our hybrid algorithm achieves 1.3× to 1.5× speedup. 76 Legend: Active vertex ratio VCP ECP 0% 50% 100% 0 0.03 0.06 1 3 5 7 9 11 13 15 17 19 0% 50% 100% 0 0.1 0.2 1 2 3 4 5 6 7 8 9 10 11 12 𝐁𝐅𝐒 𝐁𝐅𝐒 G 1 G 2 Iteration # Iteration # Execution time (s) Execution time (s) 0% 50% 100% 0 0.2 0.4 1 2 3 4 5 6 7 8 9 𝐁𝐅𝐒 Iteration # Execution time (s) G 3 0% 50% 100% 0 0.1 0.2 1 5 9 13 17 21 25 29 33 37 Execution time (s) 𝐒𝐒 𝐒 𝐏 Iteration # 0% 50% 100% 0 0.2 0.4 1 5 9 13 17 21 25 29 33 37 𝐒𝐒 𝐒 𝐏 Iteration # Execution time (s) 0% 50% 100% 0 0.5 1 1 5 9 13 17 21 25 29 33 37 Execution time (s) 𝐒𝐒 𝐒 𝐏 Iteration # G 1 G 2 G 3 Figure 4.7: Comparison between the VCP and ECP on the CPU 77 • Compared with the VCP for SSSP, our hybrid algorithm achieves 1.1× speedup for all the three datasets. • Compared with the ECP for SSSP, our hybrid algorithm achieves 1.2× to 1.3× speedup. 0 1 2 1 2 3 VCP ECP Hybrid Normalized execution time G 1 G 2 G 3 0 1 2 1 2 3 VCP ECP Hybrid Normalized execution time G 1 G 2 G 3 𝐁𝐅𝐒 𝐒𝐒𝐒𝐏 Figure 4.8: Execution time comparison between the VCP, ECP, and hybrid algo- rithm on the CPU 78 Figure4.9showstheexecutiontimebreakdownofthehybridalgorithmrunning on the CPU of the target platform. It can be observed that the execution time is dominated by the scatter phase. For both BFS and SSSP, the execution time for the scatter phase occupies at least 82% of the total execution time. 0% 20% 40% 60% 80% 100% 1 2 3 Gather Scatter 0% 20% 40% 60% 80% 100% 1 2 3 Scatter Gather G 1 G 2 G 3 G 1 G 2 G 3 𝐁𝐅𝐒 𝐒𝐒𝐒𝐏 Figure 4.9: Execution time breakdown of the hybrid algorithm on the CPU 79 4.4.6 Hybrid Algorithm on the CPU-FPGA Heteroge- neous Platform We accelerate the hybrid algorithm by using CPU-FPGA co-processing (Section 4.2.4). Table 4.3 reports the execution time and throughput performance. Our design achieves a high throughput of up to 670 MTEPS and 75 MTEPS for BFS and SSSP, respectively. Table 4.3: Performance of the hybrid algorithm on the CPU-FPGA heterogeneous platform Algorithm Dataset Execution time Throughput (sec) (MTEPS) BFS G 1 0.12 333 G 2 0.17 470 G 3 0.24 670 SSSP G 1 0.63 63 G 2 1.10 73 G 3 2.13 75 We also explore the speedup because of the FPGA acceleration. Here, the baselineforthecomparisonexecutesourhybridalgorithmontheCPUofthetarget platform, without the FPGA acceleration. Figure 4.10 summarizes the comparison results. We observe that our CPU-FPGA co-processing designs achieves 1.1× to 1.5× speedup for BFS, and 1.6× to 1.9× speedup for SSSP, respectively. The achieved speedup is mainly constrained by the available memory bandwidth for the FPGA accelerator (12.8 GB/s) because bandwidth is important for streaming algorithms. We believe that as coherent memory interconnect technology evolves, the FPGA accelerator will have a higher bandwidth to access the shared memory. In this scenario, our design will achieve an even higher speedup. 80 0 1 2 1 2 3 Hybrid algorithm (CPU-only) Hybrid algorithm (CPU-FPGA) 0 1 2 1 2 3 Hybrid algorithm (CPU-only) Hybrid algorithm (CPU-FPGA) G 1 G 2 G 3 G 1 G 2 G 3 𝐁𝐅𝐒 𝐒𝐒𝐒𝐏 Normalized execution time Normalized execution time Figure 4.10: Accelerating the hybrid algorithm by CPU-FPGA co-processing 81 4.5 Comparison with State-of-the-Art 4.5.1 Comparison with Multi-core Implementations We first compare our design with a state-of-the-art multi-core design, Graph- Mat [54]. GraphMat is a highly optimized graph processing framework that has demonstrated the best performance among existing software-based graph process- ing frameworks. Table 4.4 shows the execution time comparison with GraphMat. For BFS, our design achieves 1.4× to 1.5× speedup; for SSSP, our design achieves 1.5× to 1.8× speedup. Table 4.4: Comparison with state-of-the-art multi-core design Algorithm Dataset Approach Execution time Speedup BFS G 1 [54] 0.17 s 1.4× Our design 0.12 s G 2 [54] 0.25 s 1.5× Our design 0.17 s G 3 [54] 0.36 s 1.5× Our design 0.24 s SSSP G 1 [54] 0.97 s 1.5× Our design 0.63 s G 2 [54] 1.81 s 1.6× Our design 1.10 s G 3 [54] 3.87 s 1.8× Our design 2.13 s 82 4.5.2 Comparison with State-of-the-Art FPGA-based Accelerators There are several algorithm-specific FPGA accelerators for BFS [71, 88]. These designs are highly optimized implementations with optimizations that are only applicable to BFS. We compare our design with these designs based on throughput performance. Table 4.5 summarizes the comparison results. Our design achieves up to 4.0× throughput improvement. Table 4.5: Comparison with state-of-the-art FPGA-based accelerators Algorithm Approach Platform Memory BW Throughput (GB/s) (MTEPS) 1CPU1 FPGA BFS [88] 12-core CPU + 32 20 550 Virtex 5 FPGA [71] 4-core CPU 17 60 166 Kintex UltraScale FPGA Our design 14-core CPU + 30 12.8 670 Arria 10 FPGA 83 Chapter 5 Accelerating Matrix Factorization for Machine Learning Applications 5.1 Motivation MF is a technique to factor a sparse matrix into the product of two low-rank dense matrices. It has been widely used in many machine learning applications, such as collaborative filtering [89], topic modeling [90], and text mining [91], which need to predict unknown data based on a collection of existing observations (e.g., customer ratings on products). MF is powerful because it can discover latent features from observations. An example latent feature of a customer might be his/her income level, which has an impact on his/her interest in products but cannot be directly derived from the observed data (e.g., ratings on products). Hence, the output model of MF is called the latent factor model, which consists of two dense matrices called (latent) feature matrices. SGD is a gradient descent optimization technique used to minimize certain objective functions [92]. It is a popular learning algorithm to train the MF model. However, the training process of the SGD-based MF algorithm is very computation intensive. This is because the MF model needs to be iteratively updated based on the training data for thousands of iterations. When the volume of the training data is huge, the training time can 84 become excessively long. Therefore, it is essential to design hardware accelerators to accelerate the training process. 5.2 Problem Definition Without loss of generality, we define the problem based on the context of collab- orative filtering for recommender systems [89]. Let U and V denote a set of users and items, and|U| and|V| denote the number of users and items, respectively. The input training dataset is a partially observed rating matrix R ={r ij } |U|×|V| , in which r ij represents the rating of item v j given by user u i (0 ≤ i < |U|, 0≤j <|V|). The output of the training process contains two low-rank matrices, P (a|U|×H matrix) andQ (a|V|×H matrix), which are referred to as the user feature matrix and item feature matrix, respectively. A typical value of the rank of P and Q (i.e., H) is 32 [4, 54]. The i th row of P (denoted as p i ) constitutes a feature vector of user u i and the j th row of Q (denoted as q j ) constitutes a feature vector of itemv j , respectively. The prediction of the rating of itemv j given by user u i is the dot product of p i and q j : ˆ r ij =p i l·lq j = H−1 X h=0 p ih ·q jh (5.1) Given an observed rating r ij , the prediction error is computed as err ij =r ij − ˆ r ij . The objective of the training process is to obtain such P andQ that minimize the overall regularized squared error based on all the observed ratings: min P,Q X u i ∈U,v j ∈V err 2 ij +λ· (||p i || 2 +||q j || 2 ) (5.2) 85 In the objective function,λ is a constant used to introduce regularization in order to prevent overfitting. In order to minimize the objective function, SGD is used to update the feature vectors [89, 93]. SGD randomly initializes all the feature vectors and then updates them by iteratively traversing all the observed ratings until the overall squared error (i.e., P err 2 ij ) converges. By taking an observed rating r ij , p i and q j are updated by a magnitude proportional to a constant α (i.e., learning rate) in the opposite direction of the gradient, yielding the following updating equations: p new i =β·p old i +err ij ·α·q old j (5.3) q new j =β·q old j +err ij ·α·p old i (5.4) In Equations (5.3) and (5.4),β is a constant whose value is equal to (1−αλ). The algorithm require incrementally updating the feature vectors once per rating. As a result, the ratings of the same item or given by the same user cannot be concurrently processed because they will result in concurrent updates to the same p i or q j . This algorithm can be transformed into a bipartite graph processing problem. The input matrix is converted into a weighted bipartite graph G, whose vertices can be divided into two disjoint sets,U (user vertices) andV (item vertices). Each observed rating in R is represented as an edge connecting a user vertex and an item vertex in G. G is represented using the COO graph representation. Each edge (i.e., edge ij ) is represented as a <u i ,v j ,r ij > tuple, in which u i and v j refer to the user and item vertices connected by the edge, and r ij (i.e., edge weight) corresponds to the rating value of v j given by u i ; all the edges are stored in an edge list E; each user/item vertex maintains a feature vector whose length is H. Table 5.1 summarizes the frequently used graph notations throughout this chapter. 86 Algorithm10illustratestheSGD-basedMFalgorithmbasedonthebipartitegraph representation. Alltheedgesareiterativelyprocessedtoupdatethefeaturevectors of vertices until the overall squared error converges. When the training process terminates, the feature vectors of all the user vertices and item vertices constitute the output feature matrices P and Q, respectively. Table 5.1: Bipartite graph notations for matrix factorization Notation w w w w w w wDescription u i the user vertex with index i (0≤i<|U|) v j the item vertex with index j (0≤j <|V|) p i the feature vector of u i q j the feature vector of v j H the length of each feature vector edge ij the edge connecting u i and v j r ij the weight of edge ij Algorithm 10 SGD-based MF using bipartite graph representation MF_Train (G(U,V,E)) 1: for each user/item vertex do 2: Randomly initialize its feature vector 3: end for 4: while convergence_condition = false do 5: for each edge ij ∈E do 6: Read feature vectors p i and q j 7: Compute ˆ r ij based on Equation (5.1) 8: Compute err ij based on r ij and ˆ r ij 9: Update p i and q j based on Equations (5.3) and (5.4) 10: end for 11: end while 12: Return all the feature vectors 87 5.3 Challenges in Acceleration To achieve efficient acceleration of the SGD-based MF algorithm using FPGA, three challenges need to be carefully addressed. 5.3.1 Limited On-chip Memory Resources As the feature vectors of users and items are repeatedly accessed and updated during the processing, it is desirable to store them in the on-chip memory of the FPGA for data reuse. However, for a large training dataset that involves a large number of users and items, the feature vectors cannot fit into the on- chip memory. In this scenario, external memory (e.g., DRAM) is required to store them. However, accessing feature vectors from the external memory can incur long access latencies, which result in massive accelerator pipeline stalls and significant performance deterioration [31, 94]. Therefore, the first challenge is how to use the limited on-chip memory resources in order to achieve efficient data reuse. 5.3.2 Limited Parallelism because of Data Dependencies SGD is inherently a serial algorithm because it requires incrementally updating the training model once per training data. As a result, the ratings of the same item or given by the same user have data dependencies and cannot be processed in parallel. ThisisbecausetheconcurrentprocessingofsuchratingsleadstotheRAW data hazard. We define such data dependency among ratings as feature vector dependency. Therefore, the second challenge is how to reduce the feature vector dependencies so that the massive parallelism offered by FPGA can be efficiently exploited. 88 5.3.3 Concurrent Accesses to Dual-port On-chip RAMs FPGA accelerators commonly use parallel processing units to increase processing throughput [31, 78]. However, the native on-chip RAMs (e.g., BRAM) support only dual-port accesses (one read port and/or one write port) [95, 96, 97, 98]. When multiple processing units concurrently access the same RAM based on dis- tinct memory addresses, these memory accesses have to be serially served. This leads to additional latency to resolve the access conflicts as well as performance deterioration. Therefore, the third challenge is how to schedule the execution of edges to reduce such access conflicts. 5.4 Optimizations We propose three novel algorithmic optimizations to overcome the three challenges described in Section 5.3. 5.4.1 Graph Partitioning and Communication Hiding In order to address the challenge described in Section 5.3.1, we partition G into induced subgraphs (ISs) in order to achieve two goals: (1) the feature vectors of the vertices in each induced subgraph can fit in the on-chip memory of the FPGA, and (2) the computation for processing each IS can completely hide the communication cost. LetL (N) denote the on-chip storage capacity in terms of the number of feature vectors for user (item) vertices. We partition U into l disjoint vertex subsets {U 0 ,...,U l−1 }, each of sizeL at most, wherel =d |U| L e. Similarly,V is partitioned into{V 0 ,...,V n−1 }, each of size N at most, where n =d |V| N e. We will introduce our proposed algorithm to perform the partitioning of U and V in details later. 89 LetE xy denote a subset ofE that consists of all the edges connecting the vertices belonging to U x and V y in G (0≤x<l, 0≤y <n). U x ,V y , and E xy form an IS of G [99]. The necessary condition for the on-chip buffering of all the feature vectors of each IS is as follows: |U x |≤L,∀x∈ [0,l) & |V y |≤N,∀y∈ [0,n) (5.5) As we ensure that each user (item) vertex subset has no more than L (N) vertices during the partitioning, the necessary condition for the on-chip buffering of the feature vectors can be easily satisfied. Because there are l user vertex subsets and n item vertex subsets, the total number of ISs after the partitioning is l×n. In each iteration of the training process, these ISs are sequentially processed by our FPGA accelerator based on Algorithm 11. Note that during the processing of the edges in E xy , all the feature vectors of the vertices in U x andV y have been prefetched and buffered into an on- chip buffer of FPGA; therefore, the processing units of the accelerator can directly access the feature vectors from the on-chip buffer, rather than from the external memory. Algorithm 11 Scheduling of induced subgraph processing 1: while Convergence_condition = false do 2: for x from 0 to l− 1 do 3: Load feature vectors of U x into on-chip buffer 4: for y from 0 to n− 1 do 5: Load feature vectors of V y into on-chip buffer 6: Process all the edges∈E xy 7: Write feature vectors of V y into external memory 8: end for 9: Write feature vectors of U x into external memory 10: end for 11: end while 90 Double buffering is a widely used technique by FPGA architectures to hide the communication cost for data transfer between the FPGA and external memory [8, 19]. Usingdoublebuffering, wecanpipelinetheprocessingofISswhileoverlapping communication and computation of each IS with its predecessor/successor. To illustrate our idea, we define the following terms: •|IS τ |: the number of edges in an induced subgraph IS τ ,τ∈ [0,l×n) • Ω: the number of edges that can be processed per unit of time • P (IS τ ): the computation time to process all the edges ofIS τ (i.e.,P (IS τ ) = |ISτ| Ω ) • T rd τ : the communication time resulting from reading the feature vectors of IS τ from the external memory • T wr τ : the communication time resulting from writing the feature vectors of IS τ into the external memory As shown in Figure 5.1, we pipeline the processing of ISs by overlapping the computation time P (IS τ ) of IS τ with the writing of feature vectors from IS τ−1 and the reading of feature vectors fromIS τ+1 . We derive thesufficient condition for a complete overlap of communication and computation as follows: P (IS τ )≥T wr τ−1 +T rd τ+1 ,∀τ∈ [0,l×n) (5.6) By replacing the P (IS τ ) with |ISτ| Ω , we obtain: |IS τ |≥ Ω× (T wr τ−1 +T rd τ+1 ),∀τ∈ [0,l×n) (5.7) 91 Step 1 Step 3 Step 1 Step 2 Step 2 Step 3 Step 1 Step 2 Step 3 Time 𝑆ℎ𝑎𝑟𝑑 𝑥𝑦 𝑆ℎ𝑎𝑟𝑑 𝑥 ′ 𝑦 ′ 𝑆ℎ𝑎𝑟𝑑 𝑥 ′′ 𝑦 ′′ … … … 𝑃 (𝐼 𝑆 𝑧 −1 ) 𝑃 (𝐼 𝑆 𝑧 ) 𝑃 (𝐼 𝑆 𝑧 +1 ) Time 𝐼𝑆 𝑧 −1 𝐼 𝑆 𝑧 𝐼 𝑆 𝑧 +1 … … … 𝑇 𝑧 −1 𝑟𝑑 𝑇 𝑧 −1 𝑤𝑟 𝑇 𝑧 𝑟𝑑 𝑇 𝑧 𝑤𝑟 𝑇 𝑧 +1 𝑟𝑑 𝑇 𝑧 +1 𝑤𝑟 Figure 5.1: Pipelined processing of ISs Therefore, besides satisfying the necessary condition for the on-chip buffering of feature vectors, a desirable graph partitioning approach should also ensure that each obtained IS has a sufficient amount of edges for the computation to com- pletely hide the communication. Graph partitioning is a classic problem and manysophisticatedgraphpartitioningalgorithmshavebeendeveloped[100]. How- ever, sophisticated approaches usually introduce significant pre-processing over- head. Our intuition is that it is not necessary to invest significantly in develop- ing complex partitioning algorithms (in terms of pre-processing time); rather any reasonably fast algorithm that satisfies Inequalities (5.5) and (5.7) is acceptable. The vertex-index-based partitioning approach introduced in Section 3.3.1 can be applied here to satisfy the necessary condition for on-chip vertex buffering. How- ever, this approach can lead to significant data imbalance such that someISs may have very few edges; in this scenario, the communication cost cannot be com- pletely hidden by the computation. Therefore, we propose a new heuristic graph partitioning approach that is not only simple and fast, but also leads to a bal- anced partitioning effect such that each obtainedIS has sufficient edges to satisfy Inequality (5.7). 92 We define the subset degree of a vertex subset as the total number of edges connecting to the vertices in the subset. When we partition U and V into vertex subsets, we attempt to pack vertices into each disjoint vertex subset such that the subset degrees are close to each other. However, the most important criterion is to ensure that the number of vertices in each vertex subset is bound by L and N. Algorithm 12 illustrates our approach to partition U into U 0 ,··· ,U l−1 ; V is partitioned based on the same methodology. We first identify the vertex degree of each vertex (i.e., the number of edges connected to the vertex) and sort all the vertices based on the vertex degree in descending order. Then, we greedily assign each vertex into the vertex subset that has the minimum subset degree, until all the vertices are assigned, subject to the subset size condition. When each vertex is assigned to a vertex subset, we assign a new vertex index to it (Algorithm 12, Line 18), which indicates the vertex subset that it belongs to and its index in the vertex subset. After U and V are partitioned, we reorder the vertices based on the new indexes such that the feature vectors of the vertices belongingtothesamevertexsubsetarestoredcontiguouslyintheexternalmemory. Since user and item vertices are reordered, we also re-index the user and item indexes of each edge and partition the edges into induced subgraphs based on the new indexes. 5.4.2 Parallelism Extraction To address the challenge described in Section 5.3.2, we further partition the edges in each IS into a list of non-overlapping matchings, such that each matching consists of a set of independent edges without any common vertices. Therefore, the edges in the same matching do not have any feature vector dependencies and can be independently processed in parallel. 93 Algorithm 12 Partition U into l subsets U 0 ,··· ,U l−1 Let u i . degree denote the number of edges connected to u i (0≤i<|U|) Let U x . size denote the number of vertices in U x (0≤x<l) Let U x . degree denote the subset degree of U x (i.e., U x . degree = P u i . degree ,∀u i ∈U x ) Partition (U, L, l) 1: for x from 0 to l− 1 do 2: U x ←? 3: U x . degree ← 0 4: U x . size ← 0 5: end for 6: Sort U based on vertex degree in descending order 7: for each u i ∈U do 8: subset_id←−1 9: min_degree←|E| 10: for x from 0 to l− 1 do 11: if min_degree>U x . degree and U x . size <L then 12: subset_id←x 13: min_degree←U x . degree 14: end if 15: end for 16: U subset_id ←U subset_id ∪u i 17: U subset_id . degree ←U subset_id . degree +u i . degree 18: u i . new_user_id ←subset_id×L +U subset_id . size 19: U subset_id . size ←U subset_id . size + 1 20: end for 21: Return U 0 ,··· ,U l−1 We partition the edges in eachIS into matchings based on the graph theory of edge-coloring[99], whichassigns“colors”totheedgesofabipartitegraphsuchthat any two adjacent edges do not have the same color. After all the edges have been colored, the edges having the same color form a matching. A classic edge-coloring algorithm has been introduced in [99]. However, this algorithm can result in small matchings, in which there are very few edges (e.g., only 1 edge). When such small matchings are processed, the parallelism provided by the hardware accelerator (i.e., parallel processing units) cannot be fully utilized. Therefore, we propose a new edge-coloring algorithm that avoids small matchings. As shown in Algorithm 94 13, we maintain all the matchings using a linked list (i.e., M_List). During the edge-coloring process, the linked list keeps track of the size of each matching (i.e., the number of edges in the matching), and arranges the matchings based on their sizes in a non-descending order. When coloring an edge, we traverse the linked list and assign the edge to the first matching whose color is appropriate. Therefore, when an edge has multiple color options, the color of the matching that has the minimum number of edges is selected. Algorithm 13 Partition edges of an IS into matchings Let M_List denote a linked list of matchings Let M denote the matching being examined Let M.size denote the number of edges in the matching M Let M.color denote the color of the matching M Let M next denote the next matching in M_List linked by M Partition (IS) 1: for each edge e ij ∈IS do 2: M←M_List.head 3: while M6=M_List.tail do 4: if u i or v j has an edge colored by M.color then 5: M←M next 6: else 7: Color e ij using M.color 8: M←M∪e ij 9: M.size←M.size + 1 10: while M.size>M next .size do 11: Swap M and M next in M_List 12: end while 13: Go to Line 1 14: end if 15: end while 16: Create an new matching M←? 17: M←M∪e ij 18: M.size← 1 19: M_List.addFirst(M) 20: end for 21: Return M_List 95 5.4.3 Edge Scheduling The architecture of our accelerator has K parallel processing units sharing an on- chip buffer, which is organized in 2K ∗ (K ∗ ≥ K) memory banks with separate banks for users and items (see Section 5.5); therefore, a batch of K edges is fed into the processing units and processed at a time. However, because of the dual- port nature of on-chip RAMs [95, 98], each bank can serve only one read access and one write access per clock cycle. If there is a bank conflict between two or more accesses within a batch, the memory requests to process the edges have to be serially served. Hence, the latency (in terms of clock cycles) to resolve the bank conflict(s) within a batch is equal to the maximum number of accesses to the same bank within the batch. In order to reduce the bank conflicts, we develop a batching algorithm in order to schedule the processing of edges. As shown in Algorithm 14, the algorithm aims to partition the edges of a matching into batches, with each batch having K edges. We define a threshold value, Δ, to restrict the upper bound of bank conflicts allowed within a batch. Δ is initially set to 0. Then we sequentially traverse the edges and assign an edge into a batch if its addition does not violate the threshold condition or the batch size condition. If there are still unassigned edges after all the edges have been traversed, we increase Δ by 1 and traverse the unassigned edges again. The same procedures are repeated until all the edges have been signed into a batch. 96 Algorithm 14 Partition edges of a matching into batches Let K denote the maximum number of edges that can be assigned in a batch Let Count_BC(B,e) denote a function to count the number of edges in batch B that have bank conflict with edge e Partition (M) 1: Create b =d M.size K e empty batches, B 0 ,··· ,B b−1 2: Δ← 0 3: while M6=? do 4: for each edge e∈M do 5: for i from 0 to b− 1 do 6: if B i .size<K and Count_BC(B i ,e)≤ Δ then 7: B i ←B i ∪e 8: M←M\e 9: Break 10: end if 11: end for 12: end for 13: Δ← Δ + 1 14: end while 15: Return B 0 ,··· ,B b−1 5.5 Accelerator Design 5.5.1 Overall Architecture The overall architecture of our FPGA accelerator is depicted in Figure 5.2. The external memory connected to the FPGA accelerator stores all the edges and the feature vectors of all the user and item vertices. Before an IS is processed, the featurevectorsofalltheverticesbelongingtotheIS havebeenstoredinthefeature vector buffer (FVB), which is organized as memory banks of UltraRAMs (see Section5.5.3). WhenanIS isprocessed, FPGAfetchestheedgesfromtheexternal memory and stores them into a first-in-first-out edge queue (EQ). Whenever the EQ is not full, the FPGA pre-fetches edges from the external memory. A batch of edges are fed into the bank conflict resolver (BCR) at a time and output in one 97 or multiple clock cycles, such that the edges output in the same clock cycle do not result in any bank conflict accesses to the FVB. Then, the edges output by the BCR are checked by the hazard detection unit (HDU) to determine whether they are free from data hazard to be processed. If an edge has no feature vector dependency with any edge being processed in the processing engine (PE), it is sent into the PE; otherwise, accelerator stalls occur until the dependency is resolved. ThePEconsistsofK processingunitsthatprocessdistinctedgesinparallel. These processing units access the feature vectors of user and item vertices from the FVB. Feature Vector Buffer 𝐅𝐏𝐆𝐀 Memory Controller Edges User Feature Vectors Item Feature Vectors Processing Engine Bank Conflict Resolver Hazard Detection Unit 𝐄𝐱𝐭𝐞𝐫𝐧𝐚𝐥 Memory Processing Unit 𝐾 −1 Edges Edge Queue … Processing Unit 0 … Figure 5.2: Overall architecture 98 5.5.2 Processing Engine The PE consists of K parallel processing units that concurrently process distinct edges. We show the architecture of each processing unit in Figure 5.3. Each input edge is processed based on the follow steps: 1. Based on the user and item vertex indices of the edge, the processing unit reads the feature vectors (i.e., p i and q j ) from the FVB. 2. The prediction ˆ r ij is computed based on p i and q j ; meanwhile, p i and q j are multiplied with the constants (i.e.,α andβ) to obtainαp i ,αq j ,βp i , andβq j . 3. Oncethepredictionerrorerr ij isobtained,p new i andq new j arecomputedbased on Equations (5.3) and (5.4). 4. The updated feature vectors (i.e., p new i and q new j ) are written into the FVB. The dot product of p i and q j is computed in a binary-reduction-tree fashion, requiringH (i.e., the length of each feature vector) multipliers and (H− 1) adders in total. Therefore, each processing unit contains 7H multipliers, (3H− 1) adders, 1 subtractor, 1 squarer, and 1 accumulator. This results in a peak computing throughput of (10H + 2) floating point operations per clock cycle. The processing unit is fully pipelined in order to process one edge per clock cycle. 5.5.3 Feature Vector Buffer As the K processing units of the PE need to concurrently access the FVB to read and write distinct feature vectors, there can be up to 2K read requests and 2K write requests in each clock cycle. However, the native on-chip RAMs of FPGA,suchasBRAMsandUltraRAMs, provideonlytwoportsforreadingand/or writing [95, 98]. There are three major approaches to build multiport memory 99 𝑞 𝑗0 𝑝 𝑖 ∙𝑞 𝑗 𝑟 𝑖𝑗 − 𝑟𝑖𝑗 𝑝 𝑖 𝑞 𝑗 ×𝛼 ×𝛼 ×𝛽 𝑒𝑟𝑟 𝑖𝑗 + + 𝛽𝑝 𝑖 𝛽𝑞 𝑗 𝑒𝑟𝑟 𝑖𝑗 ∙𝛼𝑞 𝑗 𝑒𝑟𝑟 𝑖𝑗 ∙𝛼𝑝 𝑖 × × 𝑒𝑟𝑟 2 ×𝛽 𝑞 𝑗 𝑛𝑒𝑤 𝑛𝑒𝑤 𝑝 𝑖 𝑛𝑒𝑤 𝑛𝑒𝑤 Feature Vector Buffer 𝑒𝑟𝑟 2 × 𝑝 𝑖0 × 𝑝 𝑖1 𝑞 𝑗1 + … + … … … 𝑟 𝑖𝑗 Figure 5.3: Architecture of the processing unit using dual-port on-chip RAMs, including multi-pumping [101], replication [95], andbanking[102]. Multi-pumpinggainsadditionalportsbyrunningtheprocessing unitswithaK×lowerfrequencythanthememory. Consequently, thissignificantly deteriorates the clock rate of the processing units for a large K (e.g., K = 8) [95, 96]. Replication-based approaches, such as LVT [96] and XOR [95], create replicas of all the stored data to provide additional ports and keep track of which replica has the most recently updated value for each data element. However, the amount of RAMs needed in implementing this approach grows quadratically with 100 the number of ports, such that K×K replicas are required to support K read ports andK write ports. Additionally, the clock rate can degrade below 100 MHz when the width and depth of the memory are large (e.g., 1Kbit× 16K) [97, 103]. To support a large buffer capacity and sustain a high clock rate, we use the banking approach [102] to build the multiport FVB. This approach divides the memory into equal-sized banks and interleaves these banks to provide higher access bandwidth (i.e., more read and write ports). As shown in Figure 5.4, the FVB contains two parts of equal size, one for storing user feature vectors and the other for storing item feature vectors. Each part is divided into K ∗ banks (K ∗ ≥ K), and each bank is implemented using a dual-port UltraRAM [98]. Therefore, the FVB provides 2K ∗ read ports and 2K ∗ write ports in total. Feature vectors of vertices are stored into the FVB in a modular fashion based on the vertex indices, such that p i is stored in the (i%K ∗ ) th user bank and q j is stored in the (j%K ∗ ) th item bank. Hence, the feature vector of any user (item) can be accessed from the FVB based on the user (item) vertex index without complex index-to-address translation. However, the banked FVB cannot handle concurrent accesses to the same bank for distinct feature vectors. Such memory accesses are defined as bank conflict accesses, and they may occur for both user and item feature vectors. To address this issue, we develop a BCR in order to avoid any bank conflict accesses. As illustrated in Figure 5.5, the BCR fetches a batch of K edges from the EQ at a time and uses parallel detectors to detect the potential bank conflicts among the edges. Then, it outputs the edges to the HDU in one or multiple clock cycles, such that the edges output in the same clock cycle have the feature vectors stored in distinct banks of the FVB. Therefore, concurrent accesses to the same bank 101 Feature Vector Buffer 𝑝 0 𝑝 𝐾 ∗ … User Bank 0 𝑝 𝐾 ∗ −1 𝑝 2𝐾 ∗ −1 … User Bank 𝐾 ∗ −1 … … … … … 𝑝 1 𝑝 𝐾 ∗ +1 … … … … … R/W Requests R/W Responses … … User Banks … Item Banks … Figure 5.4: Multiport FVB based on banking will not occur. However, this also leads to additional clock cycles to resolve the bank conflicts within a batch; in the worst case, when all the edges in a batch conflict with each other, the BCR takes K clock cycles to output all the K edges in the batch. 5.5.4 Hazard Detection Unit Astheedgesofdistinctmatchingscanhavecommonvertices, featurevectordepen- dencies exist among the edges of distinct matchings. Therefore, when the edges output from the BCR and the edges being processed in the PE belong to distinct 102 Output the edge and clear the buffer Detect bank conflict Edge buffers Buff 0 Buff 1 Buff 2 Buff 3 Bank Conflict Resolver Output edges Input edges Figure 5.5: Architecture of the bank conflict resolver for K = 4 matchings, RAW data hazards may occur. The HDU is responsible for detecting feature vector dependencies and preventing RAW data hazards. We design the HDU using BRAMs based on a fine-grained locking mechanism. The HDU main- tains a 1-bit lock for each vertex of the IS being processed. A lock with value 1 means that the feature vector of the corresponding vertex is being computed by the PE, and thus cannot be accessed at this time. For each input edge, the HDU checks the locks of both the user and the item vertices; if both the locks are 0 (i.e., unlocked), the edge is fed into the PE and the locks are set to 1 (i.e., locked); otherwise, the HDU generates a pipeline stall signal to stall the accelerator until both the locks become 0. Note that when the PE writes an updated feature vec- tor into the FVB, it also sends unlock signals to the HDU to set the lock of the corresponding vertex back to 0. Therefore, deadlock will not occur. 103 5.6 Performance Evaluation 5.6.1 Experimental Setup We conduct experiments based on a state-of-the-art Virtex UltraScale+ xcvu9pflgb2104 FPGA . This FPGA device has 1,182,240 slice LUTs, 2,364,480 slice registers, 6,840 DSPs, and up to 43.3 MB of on-chip RAMs. Two DDR4 chips are connected to the FPGA as the external memory. Each DRAM has 16 GB capacity and a peak access bandwidth of 19.2 GB/s. Our FPGA designs are implemented in RTL using Verilog. We use large real-life sparse matrices to evaluate our designs. Table 5.2 sum- marizes the characteristics of the datasets. In all the experiments, the length of each feature vector is 32 (i.e., H = 32) with each element represented using IEEE 754 single precision format. We adopt standard learning rate α = 0.0001 and regularization parameter λ = 0.02. Table 5.2: Large real-life sparse matrices used for the experiments Dataset # users (|U|) # items (|V|) # ratings (|E|) Density ( |E| |U|×|V| ) Libim [104] 135 K 168 K 17,359 K 7.6×10 −4 Netflix [89] 480 K 17 K 100,480 K 1.2×10 −2 Yahoo [105] 1,200 K 137 K 460,380 K 2.8×10 −3 5.6.2 Performance Metrics We evaluate the performance of our designs based on the following metrics: • Resource utilization: the percentages of basic FPGA resources utilized by the accelerator, including the slice LUT, register, BRAM, UltraRAM, and DSP 104 • Power: the power consumption of the FPGA accelerator • Execution time: the elapsed time to complete one iteration of the SGD-based MF algorithm • Throughput: the number of floating point operations performed per second (GFLOPS) 5.6.3 Resource Utilization and Power Consumption Table 5.3 shows the resource utilization, clock rate, and power consumption of our FPGA accelerator for various number of processing units. The reported results are obtained through post-place-and-route simulations using Xilinx Vivado Design Suite 2018.1 [80]. The capacity of the FVB is set to 64K feature vectors. Table 5.3: Resource utilization, clock rate and power consumption K LUT Register DSP On-chip RAM (%) Clock rate Power (%) (%) (%) Block RAM UltraRAM (MHz) (Watt) 1 7.0 4.6 3.8 1.2 37.5 171 5.8 2 14.3 8.2 7.5 1.2 37.5 167 7.3 4 30.7 17.2 15.1 1.2 37.5 161 12.1 8 63.9 33.4 30.2 1.2 37.5 150 20.1 5.6.4 Pre-processing Time and Training Time Table 5.4 and Table 5.5 report the pre-processing time and training time, respec- tively. The pre-processing includes the three proposed optimizations in Section 5.4, and it is performed by an Intel Xeon E5-2650 processor. The training is per- formed by our FPGA accelerator. Note that the pre-processing is performed only once, whereas the training is an iterative process. Therefore, the pre-processing 105 time can be amortized and is negligible compared with the training time. In Table 5.5, we also report the total number of iterations and the execution time for each iteration. Table 5.4: Pre-processing time Dataset Opt. 1 Opt. 2 Opt. 3 Total (Section 5.4.1) (Section 5.4.2) (Section 5.4.3) Libim 0.4 sec 4.4 sec 1.6 sec 6.4 sec Netflix 1.0 sec 10.7 sec 5.4 sec 17.1 sec Yahoo 5.5 sec 42.3 sec 18.9 sec 66.7 sec Table 5.5: Training time Dataset Total training time # iterations to converge T exec per iteration Libim 360.8 sec 11,568 0.03 sec Netflix 876.4 sec 5,766 0.15 sec Yahoo 2536.5 sec 3,714 0.68 sec 5.6.5 Throughput vs. Parallelism Inthissection, wevarythenumberofprocessingunits(K)from1to8toexploreits impact on throughput performance. Figure 5.6 shows the throughput performance for various K. We have the following observations: • The throughput performance significantly improves asK increases for all the three datasets. • For K = 8, our FPGA accelerator sustains a high throughput of 165 GFLOPS for Libim, 213 GFLOPS for Netflix, and 217 GFLOPS for Yahoo, respectively. 106 • ThethroughputperformanceofLibimisworsethanthatofNetflixandYahoo for K = 4 and 8. This is because the Libim dataset is much sparser than the other two datasets, thus resulting in more small matchings that cannot fill up the processing units. 0 100 200 300 1 2 4 8 Libim Netflix Yahoo Throughput (GFLOPS) Number of processing units (𝐾 ) Figure 5.6: Throughput for various K 5.6.6 Impact of the Optimizations To show the effectiveness of our three proposed optimizations, we compare our optimized design with several non-optimized FPGA-based baseline designs. All the comparisons are based on K = 8. 5.6.6.1 Bank Conflict Reduction We first explore the effectiveness of Optimization 3 (Section 5.4.3) in reducing the number of bank conflicts. Here, the baseline design used for the comparison only performs Optimization 1 and Optimization 2 during the pre-processing. Table 5.6 107 summarizes the results of the comparison. We observe that Optimization 3 reduces the number of bank conflicts by 2.4× to 4.2×. As a result, the execution time per iteration is improved by 1.3× to 1.5×. Table 5.6: Bank conflict reduction because of Optimization 3 Dataset # bank conflicts per edge Reduction T exec per iteration (sec) Speedup Optimized Baseline Optimized Baseline Libim 0.067 0.161 2.4× 0.03 0.04 1.3× Netflix 0.039 0.166 4.2× 0.15 0.23 1.5× Yahoo 0.042 0.164 3.9× 0.68 1.03 1.5× 5.6.6.2 Data Dependency Reduction We further explore the impact of Optimization 2 (Section 5.4.2) which aims to reduce the number of pipeline stalls because of feature vector dependencies. For comparison purpose, the baseline design performs Optimization 1 and Optimiza- tion3only. Table5.7summarizestheeffectivenessofthisoptimization. Weobserve that the optimized design dramatically reduces the number of bank conflicts by 28.8× to 60.3× and thus achieves 13.3× to 15.4× speedup. Table 5.7: Pipeline stall reduction because of Optimization 2 Dataset # pipeline stalls per edge Reduction T exec per iteration (sec) Speedup Optimized Baseline Optimized Baseline Libim 0.115 3.313 28.8× 0.03 0.40 13.3× Netflix 0.061 3.133 51.3× 0.15 2.19 14.6× Yahoo 0.054 3.259 60.3× 0.68 10.45 15.4× 108 5.6.6.3 Communication Cost Reduction Lastly, westudytheimpactofOptimization1(Section5.4.1)toreducethecommu- nication cost. We define communication cost as the data transfer time between the FPGA and external memory. The baseline design for the comparison also performs Optimization 2 and Optimization 3; however, when partitioning the input graph intoISs, thebaselinedesignusesthevertex-index-basedpartitioningapproach(see Section 3.3.1) rather than our proposed heuristic partitioning approach (Algorithm 12). We first compare the partitioning effect of these two approaches. Table 5.8 lists the maximum size, minimum size, and average size of ISs with respect to the number of edges after the input graph is partitioned. It can be observed that theISs obtained by our heuristic approach have similar number of edges, whereas theISs obtained by the vertex-index-based partitioning approach can vary signif- icantly in size. Table 5.8: Comparison between two partitioning approaches Dataset Approach |IS| max |IS| min |IS| avg Libim Algorithm 12 591 K 524 K 579 K Vertex-index-based approach 703 K 175 K Netflix Algorithm 12 6,699 K 6,699 K 6,699 K Vertex-index-based approach 6,929 K 4,501 K Yahoo Algorithm 12 2,487 K 2,288 K 2,465 K Vertex-index-based approach 3,086 K 288 K Table 5.9 summarizes the results of the comparison with respect to communi- cation cost. For all the three datasets, the optimized design is able to completely hide the communication cost, whereas the baseline design cannot completely do so for the Libim and Yahoo datasets. This is because the baseline design has small ISs that do not have enough edges to completely hide the communication cost. 109 Table 5.9: Communication cost reduction because of Optimization 1 Dataset Unhidden communication T exec per cost per iteration (sec) iteration (sec) Optimized Baseline Optimized Baseline Libim 0 0.005 0.031 0.036 Netflix 0 0 0.15 0.15 Yahoo 0 0.04 0.68 0.72 5.7 Comparison with State-of-the-Art 5.7.1 Comparison with Multi-core Implementations There are several multi-core-based graph-processing frameworks that support MF. Representative examples include GraphMat [54] and GraphLab [59]. However, most of these frameworks implement gradient-descend-based MF, as it can be eas- ily expressed as a vertex-centric program. GD-based MF accumulates the interme- diate updates for each feature vector and performs the update after all the ratings have been traversed in an iteration. Therefore, it updates each feature vector only once per iteration and thus requires more iterations to converge and more training time than SGD-based MF [106] (e.g., 40× more iterations to train Netflix [107]). Native [107] is the-state-of-the-art multi-core implementation for SGD-based MF. It partitions the input training matrix into submatrices, and exploits submatrix- level parallelism to concurrently process the submatrices that do not have feature vector dependencies by using distinct CPU cores. However, the submatrices can vary significantly in size, thus resulting in load imbalance among the CPU cores and increasing the synchronization overhead. Table 5.10 shows the comparison results between Native [107] and our FPGA accelerator based on the same dataset (i.e., Netflix). Our FPGA accelerator achieves 13.3× speedup. 110 Table 5.10: Comparison with state-of-the-art multi-core implementation Approach Platform T exec per iteration (sec) Speedup [107] 24-core Intel E5-2697 2.00 13.3× Our design Virtex UltraScale+ 0.15 5.7.2 Comparison with GPU Implementations In this section, we compare the performance of our FPGA accelerator with two state-of-the-art GPU implementations [108, 109]. In [108], several scheduling schemes for parallel thread execution on the GPU are developed and compared. However, lock-free static scheduling schemes cannot efficiently exploit the thou- sands of cores on the GPU, and the dynamic scheduling schemes require memory locks to handle featurevector dependencies and thusresult in a significant synchro- nization overhead. In [109], the GPU design focuses on optimizing the memory performance of the GPU by exploiting warp shuffling, memory coalescing, and half-precision (i.e., using 16 bits to represent a floating point number) techniques. Table 5.11 summarizes the comparison results of our FPGA design with [108, 109] for training the same dataset (i.e., Netflix). Our design achieves 12.7× and 2.5× speedup compared with [108] and [109], respectively. Note that the performance improvement is achieved with fewer cores (i.e., processing units), a lower clock frequency, lower memory bandwidth, and lower power consumption. 111 Table 5.11: Comparison with GPU implementations Approach Platform T exec per Speedup Throughput Improvement # cores Frequency Mem. BW Power iteration (GFLOPS) (MHz) (GB/s) (Watt) (sec) [108] 2880 745 288 235 1.90 1.0× 8.6 1.0× [109] 3072 1000 360 250 0.38 5.0× 171.4 20.0× Our design 8 150 38 20 0.15 12.7× 213.3 24.8× 112 Chapter 6 Conclusion In this thesis, we focused on designing FPGA-based accelerators for graph ana- lytics. We demonstrated the applicability of our proposed designs by accelerating variousgraphanalyticsalgorithms. Weusedbothreal-worldandsyntheticdatasets in our experimental evaluations to show the effectiveness of our designs. Compared withstate-of-the-artacceleratorsforgraphanalytics,theproposeddesignsachieved better performance with respect to execution time, throughput, and energy effi- ciency. The innovation of this thesis is twofold: • Algorithm-oriented architecture design. We use pipelining and multi- processing techniques to design highly parallel FPGA architectures for sus- taining a high throughput for graph algorithms. • Architecture-aware algorithmic optimization. We propose various novel optimizations, such as data layout optimization, parallelism extrac- tion, graph partitioning, vertex buffering, and edge scheduling, to achieve efficient acceleration by the proposed accelerators. 6.1 Summary of Contributions With respect to designing an FPGA framework for accelerating graph analytics, our contributions included the following: • We demonstrated the applicability of the framework by accelerating SpMV, PR, SSSP, and WCC. 113 • We developed a design automation tool that obtained users’ input and auto- matically generated a synthesizable RTL design. • We proposed a graph partitioning approach to enable efficient vertex buffer- ing using on-chip memory. • We proposed an optimized data layout to improve memory performance and reduce data communication. • Our framework sustained a very high throughput for the four studied algo- rithms. Compared with highly optimized multi-core implementations, our framework achieved up to 38× speedup. Compared with the state-of-the-art FPGA frameworks, our design achieved up to 50× higher throughput. From the perspective of exploration on the CPU-FPGA heterogeneous architecture for accelerating non-stationary graph algorithms, our contributions included the following: • We conducted a detailed comparison between the VCP and ECP. Based on their key characteristics, we proposed a hybrid algorithm to dynamically select between them during execution. • We proposed a hybrid data structure to concurrently support both the VCP and ECP. • We proposed a graph partitioning scheme to enable efficient concurrent exe- cution on heterogeneous platforms. • We developed an FPGA accelerator to accelerate our hybrid algorithm. We accelerated BFS and SSSP by using our design methodology. 114 • Compared with highly optimized algorithm-specific FPGA-based designs, our design achieved up to 4× throughput improvement. With respect to accelerating SGD-based MF using FPGA, we made the following contributions: • We developed a highly parallel accelerator that consisted of parallel process- ing units with a shared on-chip buffer. • We transformed MF into a bipartite graph processing problem and leveraged graph theory to improve the performance. • We developed a simple and fast partitioning heuristic to partition the bipar- tite graph into ISs. This enabled the on-chip buffering of vertices and com- pletely hid communication with computation. • We developed scheduling algorithms to schedule the execution of edges in order to reduce data dependencies and bank conflicts. • The experimental results showed that the proposed FPGA accelerator sus- tained a high throughput of up to 217 GFLOPS for training large real-life sparse datasets. • Compared with state-of-the-art GPU implementations, our FPGA accelera- tor achieved up to 12.7× speedup and 24.8× throughput improvement. 115 6.2 Future Work Accelerating graph analytics using FPGA is still a young and promising research field. In this section, we illustrate some advanced topics, including emerging tech- nologies and unexplored areas for graph analytics acceleration, and we highlight some potential future research directions. 6.2.1 Emerging Memory Technologies Emerging memory technologies that enable near-memory and in-memory process- ing offer significant potential toward implementing latency-bounded computations. Altera Stratix 10 DRAM system-in-package technology [110] integrates 3D-stacked high-density DRAM closely to the state-of-the-art 14nm FPGA fabric within the same package. This memory technology can provide up to 10× higher access bandwidth than DDR4 SDRAM. As more bandwidth is available to the FPGA accelerator, our designs can exploit greater parallelism by building more pipelines to saturate the memory bandwidth. In this scenario, our proposed accelerators will further improve the throughput performance and achieve an even higher speedup. Another trend in memory technologies is to develop both high-speed and high- capacity data storage for big data applications. For example, Intel-Micron 3D XPoint transistor-less cross point architecture [111] creates a new class of non- volatile memory that has up to 1000× lower latency and 1000× greater endurance than NAND flash memory, and is 10× denser than DRAM. It is promising to integrate our designs with the 3D XPoint technology for accelerating graph ana- lytics. Forexample, wecanbuildamemoryhierarchy, whichconsistsoftheon-chip memory (e.g., BRAM), off-chip memory (e.g., 3D-stacked high-density DRAM), and non-volatile memory (e.g., 3D XPoint), and exploit the parallelism offered by 116 the FPGA to maximize the access bandwidth throughout the memory hierarchy in order to achieve a significant speedup for large-scale graph analytics. 6.2.2 Graph Stream Algorithms There has been considerable interest in designing graph stream algorithms [112] to process massive graphs with a small amount of memory. These algorithms combine data stream techniques with the ideas from approximation algorithms and graph theory. The objective is to find an arbitrarily good approximation (e.g., estimat- ing connectivity properties, approximating graph distances, finding approximate matching, counting the frequency of sub-graphs) based on summary data struc- tures for graphs (e.g., spanners and sparsifiers). Therefore, the computation does not follow the VCP or ECP. Instead, the graph is processed using a data stream model, in which the input is a stream of sampled edges and the edges must be processed in the order they arrive. The challenges in implementation include con- structing data summaries with a small memory footprint and updating the data summaries at high speed. One future research direction is to design FPGA acceler- ators for graph stream algorithms. We can store the data summaries in the on-chip RAMs and build customized pipelines to process the data stream and update the data summaries in order to sustain a high throughput. 6.2.3 Evolving Graph Algorithms Some applications (e.g., cyber networks) need to process evolving graphs whose graph structure can change over time by adding, deleting, and modifying ver- tices/edges. One solution for evolving graph analytics is to re-run the graph algo- rithm when there is an update to the graph structure. However, this solution is computationally expensive and problematic for large graphs, especially when the 117 update rate is high. As a result, evolving graph algorithms are developed to ana- lyze the graphs that are constantly changing [113]. These algorithms can typically avoidafullrecomputationofthegraphanalyticsbydevelopingdynamicdatastruc- tures. The challenges in accelerating evolving graph algorithms include storing the dynamicdatastructuresinthememoryoftheacceleratorforefficientcomputation, as well as maintaining the consistency of the graph structure at a high update rate. Currently, there has not been any FPGA-based design to support evolving graphs. The shared-memory CPU-FPGA heterogeneous architecture is a very appealing platform to accelerate evolving graph algorithms. This is because the CPU can be used to handle updates to the graph structure in order to ensure the consistency, while the FPGA can be used to accelerate the normal graph computations in order to sustain a high throughput. 118 Reference List [1] Graph500, “Graph 500,” https://graph500.org/. [2] T.S.Portal, “Numberofdailyactivefacebookusersworldwideasof1stquar- ter 2018 (in millions),” https://www.xilinx.com/products/technology/ pci-express.html. [3] M.Ferguson, “Whatisgraphanalytics?” http://www.ibmbigdatahub.com/ blog/what-graph-analytics. [4] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphi- cionado: A high-performance and energy-efficient accelerator for graph ana- lytics,” in Proceedings of IEEE/ACM International Symposium on Microar- chitecture, pp. 56:1–56:13, 2016. [5] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. M. Burns, and Ö. Özturk, “Energy efficient architecture for graph analytics accelerators,” in Proceedings of ACM/IEEE International Symposium on Computer Archi- tecture, pp. 166–177, 2016. [6] J. Zhang, S. Khoram, and J. Li, “Boosting the performance of fpga- based graph processor using hybrid memory cube: A case for breadth first search,” in Proceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 207–216, 2017. [7] T. Oguntebi and K. Olukotun, “Graphops: A dataflow library for graph analytics acceleration,” in Proceedings of ACM/SIGDA International Sym- posium on Field-Programmable Gate Arrays, pp. 111–117, 2016. [8] E. Nurvitadhi, G. Weisz, Y. Wang, S. Hurkat, M. Nguyen, J. C. Hoe, J. F. Martínez, and C. Guestrin, “Graphgen: An FPGA framework for vertex- centric graph computation,” in Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 25– 28, 2014. 119 [9] G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “Foregraph: Exploring large-scale graph processing on multi-fpga architecture,” in Pro- ceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 217–226, 2017. [10] M. DeLorimier, N. Kapre, N. Mehta, D. Rizzo, I. Eslick, R. Rubin, T. E. Uribe, T. F. K. Jr., and A. DeHon, “Graphstep: A system architecture for sparse-graph algorithms,” in Proceedings of IEEE Symposium on Field- Programmable Custom Computing Machines, pp. 143–151, 2006. [11] G. E. Moore, “Progress in digital integrated electronics,” in Proceedings of IEEE Solid-State Circuits Society Newsletter, pp. 36–37, 2006. [12] R. H. Dennard, F. H. Gaensslen, H. nien Yu, V. L. Rideout, E. Bassous, Andre, and R. Leblanc, “Design of ion-implanted mosfets with very small physical dimensions,” IEEE Journal of Solid-State Circuits, p. 256, 1974. [13] S. Kaxiras and M. Martonosi, “Computer architecture techniques for power- efficiency,” Morgan and Claypool, 2008. [14] Y. R. Qu, S. Zhou, and V. K. Prasanna, “High-performance architecture for dynamically updatable packet classification on FPGA,” in Proceedings of ACM/IEEE Symposium on Architecture for Networking and Communica- tions Systems, pp. 125–136, 2013. [15] D. Tong and V. K. Prasanna, “Sketch acceleration on FPGA and its appli- cations in network anomaly detection,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 4, pp. 929–942, 2018. [16] S. Zhou, Y. R. Qu, and V. K. Prasanna, “Large-scale packet classification on FPGA,” in Proceedings of IEEE International Conference on Application- specific Systems, Architectures and Processors, pp. 226–233, 2015. [17] S. R. Kuppannagari, R. Chen, A. Sanny, S. G. Singapura, G. P. C. Tran, S. Zhou, Y. Hu, S. P. Crago, and V. K. Prasanna, “Energy performance of fpgasonPERFECTsuitekernels,” inProceedings of IEEE High Performance Extreme Computing Conference, pp. 1–6, 2014. [18] R. Chen and V. K. Prasanna, “Optimizing interconnection complexity for realizing fixed permutation in data and signal processing algorithms,” in Proceedings of IEEE International Conference on Field Programmable Logic and Applications, pp. 1–9, 2016. 120 [19] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga- based accelerator design for deep convolutional neural networks,” in Proceed- ings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170, 2015. [20] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for gen- erating high throughput cnn implementations on fpgas,” in Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 117–126, 2018. [21] E. Nurvitadhi, G. Venkatesh, J. Sim, D. Marr, R. Huang, J. Ong Gee Hock, Y. T. Liew, K. Srivatsan, D. Moss, S. Subhaschandra, and G. Boudoukh, “Can fpgas beat gpus in accelerating next-generation deep neural net- works?,” in Proceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 5–14, 2017. [22] M. Ramirez-Martinez, F. Sanchez-Fernandez, P. Brunet, S. M. Senouci, and E. Bourennane, “Dynamic management of a partial reconfigurable hardware architecture for pedestrian detection in regions of interest,” in Proceedings of IEEE International Conference on ReConFigurable Computing and FPGAs, pp. 1–7, 2017. [23] H. Giesen, B. Gojman, R. Rubin, J. Kim, and A. DeHon, “Continuous onlineself-monitoringintrospectioncircuitryfortimingrepairbyincremental partial-reconfiguration (COSMIC TRIP),” ACM Transactions on Reconfig- urable Technology and Systems, vol. 11, no. 1, pp. 3:1–3:23, 2018. [24] B. Li, Z. Ruan, W. Xiao, Y. Lu, Y. Xiong, A. Putnam, E. Chen, and L. Zhang, “Kv-direct: High-performance in-memory key-value store with programmable nic,” in Proceedings of ACM Symposium on Operating Sys- tems Principles, pp. 137–152, 2017. [25] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A reconfigurable fabric for accelerating large-scale datacenter services,” in Proceeding of Inter- national Symposium on Computer Architecuture, pp. 13–24, 2014. [26] Amazon, “Amazon ec2 f1 instances,” https://aws.amazon.com/ec2/ instance-types/f1/. [27] IBM, “Field programmable gate arrays for the cloud,” https://www. zurich.ibm.com/cci/cloudFPGA/. 121 [28] Intel, “Intel unveils industry’s first fpga integrated with high bandwidth memory built for acceleration,” https://newsroom.intel.com/news/. [29] Xilinx, “Virtex ultrascale+,” https://www.xilinx.com/products/ silicon-devices/fpga/virtex-ultrascale-plus.html. [30] R. R. McCune, T. Weninger, and G. Madey, “Thinking like a vertex: A survey of vertex-centric frameworks for large-scale distributed graph pro- cessing,” ACM Computing Surveys, vol. 48, no. 2, pp. 25:1–25:39, 2015. [31] B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk, “A reconfigurable com- puting approach for efficient and scalable parallel graph exploration,” in Pro- ceedings of IEEE International Conference on Application-Specific Systems, Architectures and Processors, pp. 8–15, 2012. [32] G. Weisz, J. Melber, Y. Wang, K. Fleming, E. Nurvitadhi, and J. C. Hoe, “A study of pointer-chasing performance on shared-memory processor-fpga systems,” inProceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 264–273, 2016. [33] A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: Edge-centric graph processing using streaming partitions,” in Proceedings of ACM Symposium on Operating Systems Principles, pp. 472–488, 2013. [34] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis, “Mizan: A system for dynamic load balancing in large-scale graph process- ing,” in Proceedings of ACM European Conference on Computer Systems, pp. 169–182, 2013. [35] S. Neuendorffer and K. A. Vissers, “Streaming systems in fpgas,” in Proceed- ings of International Conference on Embedded Computer Systems: Architec- tures, Modeling, and Simulation, pp. 147–156, 2008. [36] B. Jacob, S. Ng, and D. Wang, “Memory systems: Cache, dram, disk,” Morgan Kaufmann Publishers Inc., 2007. [37] S. Rixner, W. J. Dally, U. J. Kapasi, P. R. Mattson, and J. D. Owens, “Memory access scheduling,” in Proceedings of International Symposium on Computer Architecture, pp. 128–138, 2000. [38] D. J. Moss, S. Krishnan, E. Nurvitadhi, P. Ratuszniak, C. Johnson, J. Sim, A. Mishra, D. Marr, S. Subhaschandra, and P. H. Leong, “A customizable matrix multiplication framework for the intel harpv2 xeon+fpga platform: A deep learning case study,” in Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 107–116, 2018. 122 [39] G. Stitt, A. Gupta, M. N. Emas, D. Wilson, and A. Baylis, “Scalable win- dow generation for the intel broadwell+arria 10 and high-bandwidth fpga systems,” inProceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 173–182, 2018. [40] Xilinx, “Xilinx soc, mpsoc and rfsoc,” https://www.xilinx.com/ products/silicon-devices/soc.html. [41] Intel, “Intel socs: When architecture matters,” https://www.altera.com/ products/soc/overview.html. [42] T. M. Brewer, “Instruction set innovations for the convey HC-1 computer,” IEEE Micro, vol. 30, no. 2, pp. 70–79, 2010. [43] Xilinx, “Pci express (pcie),” https://www.statista.com/statistics/ 346167/facebook-global-dau/. [44] HPCwire, “First xeon-fpga integration launched by Intel,” https://www.hpcwire.com/2018/05/22/ first-xeon-fpga-integration-launched-by-intel/. [45] IBM, “Coherent accelerator processor interface (CAPI),” https:// developer.ibm.com/linuxonpower/capi/. [46] P. Gupta, “Xeon+fpga platform for the data center,” https: //www.archive.ece.cmu.edu/~calcm/carl/lib/exe/fetch.php?media= carl15-gupta.pdf. [47] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding performance differences of fpgas and gpus,” in Proceedings of IEEE Inter- national Symposium on Field-Programmable Custom Computing Machines, 2018. [48] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou, “Locality- driven dynamic gpu cache bypassing,” in Proceedings of ACM on Interna- tional Conference on Supercomputing, pp. 67–77, 2015. [49] BERTEN, “GPU vs FPGA performance comparison,” http://www. bertendsp.com/pdf/whitepaper/BWP001_GPU_vs_FPGA_Performance_ Comparison_v1.0.pdf. [50] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. Chung, “Accelerating deep convolutional neural networks using specialized hard- ware,” Microsoft Research, 2015. 123 [51] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph processing,” in Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 135–146, 2010. [52] R. Chen, Y. Yao, P. Wang, K. Zhang, Z. Wang, H. Guan, B. Zang, and H. Chen, “Replication-based fault-tolerance for large-scale graph process- ing,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 7, pp. 1621–1635, 2018. [53] A. Kyrola, G. E. Blelloch, and C. Guestrin, “Graphchi: Large-scale graph computation on just a PC,” in Proceedings of USENIX Symposium on Oper- ating Systems Design and Implementation, pp. 31–46, 2012. [54] N. Sundaram, N. Satish, M. M. A. Patwary, S. R. Dulloor, M. J. Anderson, S. G. Vadlamudi, D. Das, and P. Dubey, “Graphmat: High performance graph analytics madeproductive,” VLDB Endowment, vol. 8, pp. 1214–1225, July 2015. [55] X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning,” in Proceedings of USENIX Annual Technical Conference, pp. 375–386, 2015. [56] Z. Ai, M. Zhang, Y. Wu, X. Qian, K. Chen, and W. Zheng, “Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk I/O,” in Proceedings of USENIX Annual Technical Conference, pp. 125–137, 2017. [57] Y. Chi, G. Dai, Y. Wang, G. Sun, G. Li, and H. Yang, “Nxgraph: An efficient graph processing system on a single machine,” in Proceedings of IEEE International Conference on Data Engineering, pp. 409–420, 2016. [58] W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu, “Turbograph: A fast parallel graph engine handling billion-scale graphs in a single pc,” in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 77–85, 2013. [59] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Heller- stein, “Distributed graphlab: A framework for machine learning and data mining in the cloud,” VLDB Endowment, vol. 5, no. 8, pp. 716–727, 2012. [60] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: Distributed graph-parallel computation on natural graphs,” in Proceedings of USENIX Symposium on Operating Systems Design and Implementation, pp. 17–30, 2012. 124 [61] F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan, “Cusha: Vertex-centric graph processing on gpus,” in Proceedings of ACM International Symposium on High-performance Parallel and Distributed Computing, pp.239–252, 2014. [62] NVIDIA, “nvgraph,” https://developer.nvidia.com/nvgraph. [63] J. Zhong and B. He, “Medusa: Simplified graph processing on gpus,” IEEE Transaction on Parallel and Distributed Systems, vol. 25, pp. 1543–1552, June 2014. [64] Y.Wang, A.Davidson, Y.Pan, Y.Wu, A.Riffel, andJ.D.Owens, “Gunrock: A high-performance graph processing library on the gpu,” in Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming, pp. 11:1–11:12, 2016. [65] W. Han, D. Mawhirter, B. Wu, and M. Buland, “Graphie: Large-scale asyn- chronous graph traversals on just a GPU,” in Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 233– 245, 2017. [66] G. M. Baudet, “Asynchronous iterative methods for multiprocessors,” Jour- nal of the ACM, vol. 25, pp. 226–244, Apr. 1978. [67] M. Zhang and Y. Zhuo, “Graphp: Reducing communication for pim-based graph processing with efficient data partition,” in Proceedings of IEEE Inter- national Symposium on High Performance Computer Architecture, pp. 457– 468, 2018. [68] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “Graphpim: Enabling instruction-level PIM offloading in graph computing frameworks,” inProceedings of IEEE International Symposium on High Performance Com- puter Architecture, pp. 457–468, 2017. [69] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in- memory accelerator for parallel graph processing,” in Proceedings of Inter- national Symposium on Computer Architecture, pp. 105–117, 2015. [70] S. McGettrick, D. Geraghty, and C. McElroy, “An FPGA architecture for the pagerank eigenvector problem,” in Proceedings of IEEE International Con- ference on Field Programmable Logic and Applications, pp. 523–526, 2008. [71] J. Zhang, S. Khoram, and J. Li, “Boosting the performance of fpga- based graph processor using hybrid memory cube: A case for breadth first search,” in Proceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 207–216, 2017. 125 [72] S. Khoram, J. Zhang, M. Strange, and J. Li, “Accelerating graph analytics by co-optimizing storage and access on an fpga-hmc platform,” in Proceed- ings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 239–248, 2018. [73] J. Zhang and J. Li, “Degree-aware hybrid graph traversal on fpga-hmc plat- form,” in Proceedings of ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, pp. 229–238, 2018. [74] G. Lei, Y. Dou, R. Li, and F. Xia, “An FPGA implementation for solving the large single-source-shortest-path problem,” IEEE Transactions on Circuits and Systems, vol. 63-II, no. 5, pp. 473–477, 2016. [75] J.Fowers,K.Ovtcharov,K.Strauss,E.S.Chung,andG.Stitt,“Ahighmem- orybandwidthFPGAacceleratorforsparsematrix-vectormultiplication,” in Proceedings of IEEE International Symposium on Field-Programmable Cus- tom Computing Machines, pp. 36–43, 2014. [76] B. Betkaoui, Y. Wang, D. B. Thomas, and W. Luk, “Parallel fpga-based all pairs shortest paths for sparse networks: A human brain connectome case study,” in Proceedings of IEEE International Conference on Field Pro- grammable Logic and Applications, pp. 99–104, 2012. [77] V. Kumar, A. Grama, A. Gupta, and G. Karypis, “Introduction to parallel computing: Design and analysis of algorithms,” Benjamin-Cummings Pub- lishing Co., Inc., 1994. [78] S. Zhou, C. Chelmis, and V. K. Prasanna, “High-throughput and energy- efficient graph processing on FPGA,” in Proceedings of IEEE Annual Inter- national Symposium on Field-Programmable Custom Computing Machines, pp. 103–110, 2016. [79] K. E. Batcher, “Sorting networks and their applications,” in Proceedings of Spring Joint Computer Conference, pp. 307–314, 1968. [80] Xilinx, “Vivado design suite,” https://www.xilinx.com/products/ design-tools/vivado.html. [81] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data. [82] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica, “Graphx: Graph processing in a distributed dataflow framework,” in Proceedings of USENIX Symposium on Operating Systems Design and Implementation, pp. 599–613, 2014. 126 [83] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient parallel graph exploration on multi-core cpu and gpu,” in Proceedings of International Conference on Parallel Architectures and Compilation Techniques, pp. 78–88, 2011. [84] Y. Umuroglu, D. Morrison, and M. Jahre, “Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform,” in Proceedings of IEEE International Conference on Field Programmable Logic and Applications, pp. 1–8, 2015. [85] S. Beamer, K. Asanović, and D. Patterson, “Direction-optimizing breadth- first search,” in Proceedings of International Conference on High Perfor- mance Computing, Networking, Storage and Analysis, pp. 12:1–12:10, 2012. [86] Intel, “Intel Quartus prime design software,” https://www.altera.com/ products/design-software/fpga-design/quartus-prime/overview. html. [87] B. Barney, “OpenMP,” https://computing.llnl.gov/tutorials/ openMP/. [88] O. G. Attia, T. Johnson, K. Townsend, P. H. Jones, and J. Zambreno, “Cygraph: A reconfigurable architecture for parallel breadth-first search,” in Proceedings of IEEE International Parallel & Distributed Processing Sympo- sium Workshops, pp. 228–235, 2014. [89] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, pp. 30–37, Aug. 2009. [90] H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon, “A scal- able asynchronous distributed algorithm for topic modeling,” in Proceedings of International Conference on World Wide Web, pp. 1340–1350, 2015. [91] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014. [92] C. De Sa, M. Feldman, C. Ré, and K. Olukotun, “Understanding and opti- mizing asynchronous low-precision stochastic gradient descent,” in Proceed- ings of International Symposium on Computer Architecture, pp. 561–574, 2017. [93] S. Funk, “Netflix update: Try this at home,” http://sifter.org/~simon/ journal/20061211.html. 127 [94] S. Zhou, C. Chelmis, and V. K. Prasanna, “Optimizing memory performance for FPGA implementation of pagerank,” in Proceedings of IEEE Interna- tional Conference on ReConFigurable Computing and FPGAs, pp. 1–6, 2015. [95] C. E. Laforest, M. G. Liu, E. R. Rapati, and J. G. Steffan, “Multi-ported memories for fpgas via xor,” in Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 209–218, 2012. [96] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for fpgas,” in Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 41–50, 2010. [97] S.N.ShahrouziandD.G.Perera, “Anefficientembeddedmulti-portedmem- ory architecture for next-generation fpgas,” in Proceedings of IEEE Interna- tional Conference on Application-specific Systems, Architectures and Proces- sors, pp. 83–90, 2017. [98] Xilinx, “Ultraram: Breakthrough embedded memory integration on ultrascale+ devices,” https://www.xilinx.com/support/documentation/ white_papers/wp477-ultraram.pdf. [99] R. J. Wilson, “Introduction to graph theory,” John Wiley & Sons, Inc., 1986. [100] R.Chen, J.Shi, B.Zang, andH.Guan, “Bipartite-orienteddistributedgraph partitioning for big learning,” in Proceedings of Asia-Pacific Workshop on Systems, pp. 14:1–14:7, 2014. [101] H.E.Yantir, S.Bayar, andA.Yurdakul, “Efficientimplementationsofmulti- pumped multi-port register files in fpgas,” in Proceedings of Euromicro Con- ference on Digital System Design, pp. 185–192, 2013. [102] J. Wawrzynek, K. Asanovic, J. Lazzaro, and Y. Lee, “Banked multiport memory,” https://inst.eecs.berkeley.edu/~cs250/fa10/ lectures/lec08.pdf. [103] S. Zhou, R. Kannan, and V. K. Prasanna, “Accelerating low rank matrix completion on FPGA,” in Proceedings of IEEE International Conference on ReConFigurable Computing and FPGAs, pp. 1–7, 2017. [104] L. Brozovsky and V. Petricek, “Recommender system for online dating service,” https://pdfs.semanticscholar.org/1a42/ f06f368cf9b2ba8565e81d8e048caa5c2c9e.pdf. [105] Yahoo, “Ratings and classification data,” https://webscope.sandbox. yahoo.com/catalog.php?datatype=r. 128 [106] L. Bottou, “Stochastic gradient descent tricks,” in Neural Networks: Tricks of the Trade - Second Edition, pp. 421–436, 2012. [107] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey, “Navigating the maze of graph analytics frameworks using massive graph datasets,” in Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 979–990, 2014. [108] R. Kaleem, S. Pai, and K. Pingali, “Stochastic gradient descent on gpus,” in Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, pp. 81–89, 2015. [109] X. Xie, W. Tan, L. L. Fong, and Y. Liang, “Cumf_sgd: Parallelized stochastic gradient descent for matrix factorization on gpus,” in Proceed- ings of ACM International Symposium on High-Performance Parallel and Distributed Computing, pp. 79–92, 2017. [110] Intel, “Stratix 10 mx,” https://www.altera.com/products/sip/memory/ stratix-10-mx/overview.html. [111] Micron, “Breakthrough nonvolatile memory technology,” https://www. micron.com/products/advanced-solutions/3d-xpoint-technology. [112] A. McGregor, “Graph stream algorithms: A survey,” SIGMOD Record, vol. 43, pp. 9–20, May 2014. [113] O.GreenandD.A.Bader, “custinger: Supportingdynamicgraphalgorithms for gpus,” in Proceedings of IEEE High Performance Extreme Computing Conference, pp. 1–6, 2016. 129
Abstract (if available)
Abstract
Graph analytics has drawn much research interest because of its broad applicability from machine learning to social science. However, obtaining high-performance for large-scale graph analytics is very challenging because of the large memory footprint of real-world graphs and the irregular access patterns of graph analytics algorithms. As general-purpose processors (GPPs) have several architectural inefficiencies in processing large-scale graph data, dedicated hardware accelerators can significantly improve performance with respect to execution time, throughput, and energy efficiency. In this thesis, we focus on designing hardware architectures based on state-of-the-art field-programmable gate array (FPGA) technologies to accelerate iterative graph analytics algorithms. We also propose novel algorithmic optimizations to optimize memory performance and maximize parallelism in order to achieve a significant speedup. ❧ In the first part of our research, we propose a high-throughput FPGA framework to accelerate general graph algorithms based on the edge-centric paradigm (ECP). To optimize the performance of our framework, we propose various novel algorithmic optimizations, including a graph partitioning approach to enable efficient data buffering, an optimized data layout to improve memory performance, and an update merging and filtering scheme to reduce data communication. We also develop a design automation tool to facilitate the generation of accelerators using our framework. Four representative graph algorithms—namely, sparse matrix vector multiplication (SpMV), PageRank (PR), single-source shortest path (SSSP), and weakly connected component (WCC)—are accelerated to evaluate the performance of our framework. ❧ In the second part of our research, we explore CPU-FPGA heterogeneous architectures for graph analytics acceleration. We analyze the tradeoffs between the widely used vertex-centric paradigm (VCP) and ECP and propose a hybrid algorithm that dynamically selects between them during execution. We develop a hybrid data structure that concurrently supports the VCP and ECP, as well as enables efficient parallel computation on heterogeneous platforms. Furthermore, we map our hybrid algorithm onto a state-of-the-art heterogeneous platform that integrates a multi-core CPU and an FPGA accelerator through cache coherent interconnect. We evaluate our CPU-FPGA co-design by accelerating breadth-first search (BFS) and SSSP. ❧ In the third part of our research, we design an FPGA architecture to accelerate the training process of a popular machine learning algorithm that performs matrix factorization (MF) using stochastic gradient descent (SGD). We transform the algorithm into a bipartite graph processing problem and propose a novel three-level hierarchical graph partitioning approach to overcome acceleration challenges. This approach enables conflict-minimizing scheduling and processing of edges to achieve a significant speedup. ❧ We implement our designs by using state-of-the-art FPGAs and demonstrate their superior performance over the state-of-the-art graph analytics accelerators in terms of throughput, execution time, and energy efficiency. The broader impacts of this thesis include the productive use of FPGAs for accelerating graph analytics and machine learning algorithms on very large graphs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Hardware and software techniques for irregular parallelism
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Towards the efficient and flexible leveraging of distributed memories
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Efficient graph learning: theory and performance evaluation
PDF
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Estimation of graph Laplacian and covariance matrices
Asset Metadata
Creator
Zhou, Shijie
(author)
Core Title
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/16/2018
Defense Date
08/31/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
FPGA acceleration,graph analytics,OAI-PMH Harvest,parallel architecture
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Nakano, Aiichiro (
committee member
), Qian, Xuehai (
committee member
)
Creator Email
shijieinusc@gmail.com,shijiezh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-80672
Unique identifier
UC11671479
Identifier
etd-ZhouShijie-6855.pdf (filename),usctheses-c89-80672 (legacy record id)
Legacy Identifier
etd-ZhouShijie-6855.pdf
Dmrecord
80672
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zhou, Shijie
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
FPGA acceleration
graph analytics
parallel architecture