Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-softcore architectures and algorithms for a class of sparse computations
(USC Thesis Other)
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTI-SOFTCORE ARCHITECTURES AND ALGORITHMS FOR A CLASS OF SPARSE COMPUTATIONS by Qingbo Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2010 Copyright 2010 Qingbo Wang Dedication To my wife ii Acknowledgments First, I must thank my advisor, Prof. Viktor K. Prasanna. His guidance and support have been instrumental in my success. This work would not have been possible with- out him. I would also like to express my gratitude to my committee members, Prof. Ai- ichiro Nakano and Prof. Monte Ung. They have provided many useful insights during this process. Additionally, Prof. Kai Hwang, Prof. Jeff Draper, Dr. Amol Bakshi, and Prof. Bosco Tjan provided helpful feedback at the time of my qualifying exam. I have had the benefit of working with a wonderful research group. I would es- pecially like to acknowledge Ling Zhuo, Weirong Jiang, Yinglong Xia, Edward Yang, Hoang Le, Cong Zhang, Animesh Pathak, and Sudhir Vinjamuri. I was also privileged to have worked with Prof. Judith A. Hirsch to learn what it means to be a scientist and to have enlightening discussions with Dr. Jianwei Chen. I also thank Janice Thompson and Aimee Barnard for their administrative assistance. My parents, Huaqi Wang and Jiurong Wang, deserve special recognition for their spiritual support of me. Last but not the least, I thank my wife, Rong Fan, and my son, Benjamin Benjie Wang, wholeheartedly for bringing me strength with their love and constant positive energy. iii Table of Contents Dedication ii Acknowledgments iii List of Tables vii List of Figures viii Abstract x Chapter 1: Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . 3 1.1.2 Multi-Core Architectures on FPGA . . . . . . . . . . . . . . 7 1.1.2.1 Multi-core Processors . . . . . . . . . . . . . . . . 7 1.1.2.2 Processors on FPGA . . . . . . . . . . . . . . . . . 8 1.1.3 Sparse Computations . . . . . . . . . . . . . . . . . . . . . . 10 1.1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2: Sparse Computations 16 2.1 Computation Spareness . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.1 Performance Optimization Strategy for Sparse Computations . 17 2.2 Sparse Computations on Reconfigurable Hardware . . . . . . . . . . 19 2.2.1 Large Dictionary String Matching . . . . . . . . . . . . . . . 19 2.2.1.1 Large Dictionary String Matching . . . . . . . . . . 19 2.2.1.2 Previous Work on String Matching . . . . . . . . . 20 2.2.1.3 Aho-Corasick Algorithm . . . . . . . . . . . . . . 22 2.2.2 Breadth-First Search on a Large Graph . . . . . . . . . . . . 25 2.2.2.1 BFS Algorithms . . . . . . . . . . . . . . . . . . . 26 2.2.2.2 Previous Work on Graph Algorithms . . . . . . . . 29 iv Chapter 3: Multi-Core Architecture on FPGA 32 3.1 Parallel Computing and FPGA . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 High Level Language Programming for FPGA . . . . . . . . 33 3.1.2 FPGA-Augmented High Performance Computing . . . . . . . 34 3.1.2.1 High-Performance Reconfigurable Computing Sys- tems . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1.2.2 Architectural Model and Performance Parameters for HPRCs . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Microprocessor-Based Computing on FPGA . . . . . . . . . . . . . . 38 3.2.1 Embedded Processors . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 Soft Processors . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2.1 Soft Processor Architecture Design . . . . . . . . . 43 3.2.2.2 Parallel Programming Model on FPGA . . . . . . . 45 3.2.3 Usage of Soft Processors . . . . . . . . . . . . . . . . . . . . 46 3.3 Multiple Application Specific Softcore Architecture on FPGA . . . . 49 3.3.1 MapsCore Multi-Softcore Architecture . . . . . . . . . . . . 50 3.3.2 High Performance MapsCore for Sparse Computations . . . . 52 3.3.3 MapsCore-Inspired General Purpose Processors . . . . . . . . 56 Chapter 4: Large Dictionary String Matching on FPGA 59 4.1 Characteristics of AC-opt . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Multi-Core Architecture for Large Dictionary String Matching . . . . 62 4.2.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . 62 4.2.2 On-Chip Buffer for Hot States . . . . . . . . . . . . . . . . . 63 4.2.3 Structure of a Core . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.4 Reconfigurability of the Multi-Core Architecture . . . . . . . 65 4.3 Design Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 DFA Re-mapping . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.2 Simplified Thread Synchronization by Input Interleaving . . . 67 4.3.3 Shared and Pipelined Buffer Access Module . . . . . . . . . 68 4.3.4 Interface Between Cores and DRAM Controller . . . . . . . . 70 4.4 Performance Analysis and Experimental Results . . . . . . . . . . . . 70 4.4.1 DRAM Access Module . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 Performance Analysis For a One-Core System . . . . . . . . 72 4.4.3 Resource Constraints . . . . . . . . . . . . . . . . . . . . . . 73 4.4.3.1 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.4.3.2 Local Memories . . . . . . . . . . . . . . . . . . . 74 4.4.4 Implementation on FPGA Platforms . . . . . . . . . . . . . . 75 4.4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . 76 4.4.6 Performance Comparison . . . . . . . . . . . . . . . . . . . . 79 v Chapter 5: Breadth-First Search on FPGA Platform 81 5.1 Parallel Breadth-First Search . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Architecture for BFS on FPGA . . . . . . . . . . . . . . . . . . . . . 84 5.2.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . 84 5.2.2 Message Passing Parallel Processing . . . . . . . . . . . . . . 85 5.2.3 Structure of a Core . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.4 Operation of BFS and Realization of Barriers . . . . . . . . . 86 5.3 Design and Optimization . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Bitmap to Reduce Interconnect Traffic . . . . . . . . . . . . . 91 5.3.2 Dual Processing Unit in One Core . . . . . . . . . . . . . . . 92 5.3.3 Port Polling Schedule . . . . . . . . . . . . . . . . . . . . . . 92 5.4 Implementation and Performance . . . . . . . . . . . . . . . . . . . . 94 5.4.1 Implementation on FPGA Platforms . . . . . . . . . . . . . . 94 5.4.2 Simulation of the Multi-Softcore Architecture . . . . . . . . . 95 5.4.2.1 Features of Sample Input Graphs . . . . . . . . . . 95 5.4.2.2 SystemC Modeling and Verification . . . . . . . . . 96 5.4.3 Experimental Results and Discussion . . . . . . . . . . . . . 97 5.4.3.1 Throughput Measurements . . . . . . . . . . . . . 98 5.4.3.2 Resource Profiling for Storage Components . . . . 100 5.4.3.3 Performance Discussion . . . . . . . . . . . . . . . 104 Chapter 6: Conclusion 105 6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 105 6.1.1 Multi-Softcore Architecture on FPGA . . . . . . . . . . . . . 105 6.1.2 Algorithm Design for Large Dictionary String Matching and Breadth-first Search . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.1 Further Study of Multi-Softcore Multi-DRAM Architecture for String Matching . . . . . . . . . . . . . . . . . . . . . . . 108 6.2.2 Further Study of Breadth-First Search on FPGA . . . . . . . . 109 References 112 vi List of Tables 2.1 DFA state transition table . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 A string search example . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1 A few states in levels 0 and 1 are responsible for the vast majority of the hits during string matching . . . . . . . . . . . . . . . . . . . . . 60 4.2 DDR SDRAM key electrical timing specifications. All units are ns ex- cept for the banks. Clock period is when working with FPGA. tRTP: Read-to-Precharge delay. tRC: Active-to-Active delay in the same bank. tRRD: Active-to-Active delay between different banks. tRAS: Active-to-Precharge delay tRP: Precharge latency. tRCD: Activate latency. tCL: Read latency. . . . . . . . . . . . . . . . . . . . . . . . 71 4.3 Performance of string matching on a general purpose multi-core system 80 5.1 Memory consumption for round-robin port polling design . . . . . . . 93 5.2 Resource consumption and timing performance of designs on FPGA . 94 5.3 Performance comparison with state-of-the-art BFS implementations . 104 vii List of Figures 1.1 Generic FPGA and its components . . . . . . . . . . . . . . . . . . . 5 1.2 Evolution of FPGA devices . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 AC-failure example . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Breadth-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Algorithm 1 - Sequential BFS . . . . . . . . . . . . . . . . . . . . . 27 2.4 Algorithm 2 - Parallel BFS on one processing element . . . . . . . . 30 3.1 Architectural model of high-performance reconfigurable computing systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 PowerPC 405 configuration on FPGA . . . . . . . . . . . . . . . . . 39 3.3 Architecture of Xilinx MicroBlaze . . . . . . . . . . . . . . . . . . . 41 3.4 MicroBlaze configuration on Xilinx FPGA . . . . . . . . . . . . . . . 42 3.5 A shared memory multi-processor architecture on FPGA (Reproduced from project CerberO [89]) . . . . . . . . . . . . . . . . . . . . . . . 46 3.6 RAMP BLUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.7 Multi application-specific core architecture . . . . . . . . . . . . . . 51 3.8 Core and interconnect design flow . . . . . . . . . . . . . . . . . . . 52 4.1 Memory accesses to hot states . . . . . . . . . . . . . . . . . . . . . 61 4.2 Architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Structure of a core . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.4 DFA re-mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 viii 4.5 Interleaving input to pipelines . . . . . . . . . . . . . . . . . . . . . 67 4.6 BRAM buffer access module . . . . . . . . . . . . . . . . . . . . . . 69 4.7 Impact of the number of pipeline stages on design clock rate . . . . . 75 4.8 Throughput for a 2-core system . . . . . . . . . . . . . . . . . . . . . 77 4.9 Throughput performance of multi-core architecture for full-text search 78 4.10 Throughput performance of multi-core architecture for network con- tent monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 A parallel BFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 BFS architecture overview . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Single BFS core architecture . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Operation of BFS on multi-softcore architecture . . . . . . . . . . . . 87 5.5 A design of multi-softcore architecture for BFS . . . . . . . . . . . . 89 5.6 Single core design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.7 Redundancy in a core’s graph partition . . . . . . . . . . . . . . . . . 92 5.8 Topologies of sample input graphs . . . . . . . . . . . . . . . . . . . 96 5.9 Throughput measurements for graphs withNeighbor andBipart topolo- gies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.10 Throughput measurements for graphs withRand topology . . . . . . 99 5.11 Channel buffer usage of designs forRand graphs . . . . . . . . . . . 100 5.12 Buffer size in a core’s ready list . . . . . . . . . . . . . . . . . . . . . 101 5.13 Neighbor list depth for a system of eight cores on a graph of 8K vertices102 5.14 Resource consumption of the central queue in DRAM interface . . . . 103 5.15 DRAM channel buffer size forRand graph of 64K vertices . . . . . . 104 ix Abstract Field-programmable gate array (FPGA) is a representative reconfigurable comput- ing platform. It has been used in many applications to execute computationally in- tensive workloads. In this work, we study architectures and algorithms on FPGA for sparse computations. These computations have unique features: 1) the ratio of input and output operations to computation is high and 2) most memory accesses are random with little or no data locality, which leads to low memory bandwidth utilization. We propose Multiple Application Specific Softcore architecture to overcome the performance hurdles that are inherent to sparse computations. We identify the crit- ical issues, demonstrate our solutions, and validate the proposed architecture using two case studies: large dictionary string matching and breadth-first search on a graph. Our architecture utilizes multiple application-specific processing units (softcores) to exploit the potential thread-level parallelism in these computations. To alleviate the impact of long latency from accessing external memory on system performance, a specialized memory architecture and a scheduling mechanism are devised to reduce the number of accesses to external memory and to hide the effects of the remaining accesses. By utilizing customized interconnects which are adaptive to communica- tion demand, flexible and efficient inter-softcore data exchange and synchronization mechanism are well supported. The two kernels in our study are among the most common sparse computation algorithms and are of practical significance on their own. String matching searches x for all occurrences of a set of patterns (the dictionary) in a string of input data. It is the core function of search engines, intrusion detection systems (IDS), virus scan- ners, and spam and content filters. In our study on large dictionary string matching, our design achieved a throughput comparable to implementations on state-of-the-art multi-core computing systems. Breadth-first search is a fundamental building block for many graph algorithms, with applications in network analysis, image processing, and database query. Breadth-first search is a difficult kernel to parallelize on cache- based multi-core systems due to its fine-grained random data access and synchroniza- tion between threads. We demonstrate that, by using a message passing multi-core architecture with a distributed barrier design, high throughput performance can be ob- tained using a modest amount of logic resources on FPGA. xi Chapter 1 Introduction General purpose computing has progressed in the direction of parallel processing re- cently, while researchers in reconfigurable computing have taken advantage of paral- lelism for decades. The use of reconfigurable hardware, mostly in the form of field- programmable gate arrays (FPGAs), has provided performance superiority in areas such as cryptography, networking, and telecommunications [51, 37]. Recently, as the computing density on an FPGA improved to over a million cells, FPGA has been used in many applications that demand high throughput and high computation capacity, such as communications and floating point scientific computations [58, 107, 54]. However, advanced VLSI algorithms and design techniques on FPGA have not yet provided high-performance solutions to a class of applications called sparse computa- tions. Examples of sparse computations include string matching with a large pattern dictionary, exact inference on probabilistic graphical models, and breadth first search (BFS) in a large graph [52, 99, 24]. These computations are conducted on a large data set and have certain unique features: 1) the ratio of I/O to computation is high and 2) most memory accesses are random with little or no data reuse, which leads to low external memory bandwidth utilization. 1 Sparse computations have become increasingly prevalent in current applications. For example, large dictionary string matching has been used in network monitoring, intrusion detection systems (IDS), virus scanners, and spam/content filters [81, 15, 11]. Solutions based on graph modeling are also critical functions of networking analysis [16], image processing [97], and database query [1]. Efforts to solve these problems were made by using general purpose chip-multiprocessors [78, 76] and application- specific integrated circuits (ASICs) [66]. However, as these applications deal with larger and larger data sets, solutions which are more flexible, scalable, and efficient should be investigated. Our research examines properties of sparse computations and the computation ca- pability of FPGAs and proposes a multi-softcore architecture on FPGA to improve throughput performance of sparse computation kernels. A core can be either a cus- tomized function unit for specialized data processing or a soft processor for general purpose computing. Soft processors are a class of microprocessor IP cores that can be implemented on the logic fabric of FPGA. All cores in the architecture have identical or similar logic designs and resource consumptions. The cores process the data concur- rently and share and exchange information through customized interconnects. These customized interconnects will minimize the inter-core communication overhead and support fast synchronizations. Other resources available on FPGA, including Block RAM (BRAM) and logic fabric, allow diverse implementation of buffer and cache for data re-uses and can reduce memory access latency. Architectural and design opti- mizations, such as multithreading and pipelining, can be used to hide long memory access latency and enable high clock rate hardware designs on FPGA. This multi-softcore architecture allows high performance algorithm implementa- tion for sparse computations. Case studies show that it can successfully address multi- ple challenges by sparse computations to achieve throughput performance equal to, if 2 not higher than, known solutions on other commercially available chip multiprocessor (CMP) systems. The demand for resource and architectural support can be properly met by reconfigurable computing platforms. To the best of our knowledge, no such research has yet been systematically performed. The rich set of architectural elements and resource constraints on FPGA, such as configurable I/O pins and available BRAMs, present a host of opportunities and chal- lenges for designs targeting sparse computations. The objective of our research is to explore these opportunities and demonstrate the effectiveness of the proposed multi- softcore architecture on FPGA for high-performance sparse computations. In the fol- lowing sections, this thesis provides background materials about reconfigurable hard- ware, sparse computations, and challenges in using reconfigurable hardware for sparse computations. Finally, a summary of our contributions is presented. 1.1 Background 1.1.1 Reconfigurable Hardware General Purpose Processors (GPPs) and Application Specific Integrated Circuits (ASICs) are the two most popular platforms for conducting computations. Software written to run on GPPs can solve virtually all kinds of computation problems, but the solutions may or may not meet performance requirements. ASICs are designed by hardware engineers to obtain the best possible performance for a specific algorithm. Both are accepted because of their cost-effectiveness due to mass production. Reconfigurable hardware serves as a bridge between the flexible but sometimes performance-constrained GPPs and the inflexible but very high-performance ASICs. Reconfigurable hardware obtains functions through configuration after fabrication. 3 Hence, an application on top of it can combine the flexibility of software with the high performance of customized hardware design [88]. By being programmed specifically for the problem to be solved instead of for general purpose computing, reconfigurable hardware can achieve higher performance and greater efficiency than GPPs. This ef- fectiveness is especially true in applications with a regular structure and/or a great deal of parallelism. Additionally, the ability of reconfigurable hardware to be programmed and reprogrammed many times allows for its design development to be less expensive and its use to be much more flexible than ASICs. Field Programmable Gate Arrays (FPGAs) are the dominant form of reconfigurable hardware devices [38, 22]. Typical FPGAs are implemented as a matrix of configurable logic blocks (CLBs) and programmable interconnects between and within the blocks. In this thesis, we only consider SRAM-based FPGAs. SRAM stands for static random access memory, which uses bi-stable latching circuitry to store each bit. An FPGA is configured by writing a configuration bitstream to its SRAM-based configuration memory. The functions of CLBs and the routing of interconnects are determined by the configuration bitstream. Since the device is controlled by the state of the SRAM bits, the functionality of an FPGA chip can be changed by alternating the memory state and being customized for a particular application. The device is composed of many thousands of basic logic blocks. Based on device variety, it can also include fast ASIC multipliers, Ethernet MACs, local RAMs, and clock managers. The programmable logic blocks of FPGAs, such as those from Xilinx or Altera, are based on lookup tables (LUTs), flip-flops, multiplexers, and carry chains. The in- terconnects are composed of programmable switch blocks and wires of various length. SRAM banks, serving as a configuration memory, control all of the functionality of these logic elements and interconnects and the signaling standards of the I/O pins. The 4 Figure 1.1: Generic FPGA and its components values in the lookup tables can produce any combinational logic functionality neces- sary, the flip-flops provide integrated state elements, and the SRAM-controlled routing directs logic values into the appropriate paths to produce the desired architecture. Figure 1.1 shows a generic FPGA architecture and its components. The large shaded boxes represent the configurable logic blocks. The small shaded boxes rep- resent “switch boxes” that are used to program the routing. While there are several routing architectures [17, 100], no particular one is depicted in the figure. Rather, we simply emphasize that each of the configurable logic blocks connects to the intercon- nection network and that this network is configured so that the desired functionality is implemented on the FPGA. An l-input LUT can be configured to perform any logic function ofl inputs. The LUT acts like a memory in which the inputs are address lines, and the logic function that is implemented is determined by what is stored in the mem- ory. The bits in a LUT memory are written as part of configuration. As an example, a 3-input LUT is shown on the right of Figure 1.1 (A,B,C are the inputs). The logic function implemented is a 3-input OR: The LUT outputs a0 ifA,B, andC are each0, 5 Figure 1.2: Evolution of FPGA devices and it outputs a 1 otherwise. In practice, the most common type of LUT in FPGAs is the 4 or 6-input LUT, i.e.l = 4 or6. The density of FPGAs has been steadily increasing, which allows designs that are increasingly complex to be implemented on a single chip [40, 90, 86]. Figure 1.2 shows the recent development of FPGAs using the Xilinx Virtex family as an exam- ple. In addition to increased logic density, FPGA vendors are adding more embedded features into FPGAs. These features are dedicated (non-configurable) hardware blocks embedded into FPGA fabric that enhance the capability of the FPGAs by performing common functions without using programmable logic resources. By being dedicated, embedded hardware modules take less space and can achieve higher clock rates than comparable features made out of programmable logic. One embedded feature that is common in modern FPGA is embedded RAM memory that provides local storage on FPGAs [7, 102]. Large FPGAs have, on average, 5 MB of embedded memory. Em- bedded arithmetic units, such as multipliers and multiply-accumulate units, are also 6 common. Even microprocessors have been embedded into FPGAs. For example, Xil- inx started to add up to two embedded PowerPC 405 hardware cores into its Virtex-II Pro series and upgraded the cores to PowerPC 440 in its newly released Virtex-6. Logic resources on the recently released FPGAs also increase enormously [102]. For example, a Virtex-6 SX475T has 476K logic cells, 38,304 Kbits BRAM, and 2016 DSP48E1 slices. 1.1.2 Multi-Core Architectures on FPGA 1.1.2.1 Multi-core Processors In recent years, many multi-core processors have been designed, manufactured, and made commercially available [26, 44]. Some of these processors are traditional general- purpose multi-core processors containing a handful of cache-coherent and powerful cores attached to either a switching fabric or a shared bus. Examples include the quad core Opteron from AMD and the quad core Xeon from Intel [3, 48]. Other types of multi-core processors include highly parallel processors such as IBM Cell BE [46] and nVidia Telsa [69]. In these processors, a number of accelerators are controlled explicitly by a specific core or software. Additionally, various special pur- pose network/content processors exist [48]. Multi-core processor architectures with different features require different optimization techniques to implement an algorithm efficiently. Modern clusters also utilize multi-core processors as computing nodes. For exam- ple, in the November 2008 TOP500 list of the most powerful supercomputers, the IBM Roadrunner ranks at the top [87]. Each computing node of the Roadrunner consists of an AMD Opteron and Cell BE processors. Many multi-core processors, such as Cell BE, Opteron, and Xeon, are used in clusters, and some have vector operation units. 7 This way, the modern clusters actually provide parallelism at multiple levels, such as through virtualization, thread, data, and even bit levels. Efficiently utilizing parallel computing capabilities at multiple levels becomes a challenge in high performance computing research and applications. The local memories, or caches, in a multi-core processor are relatively small, and the problems stemming from this issue are likely to intensify in the future. While advances in chip integration promise tens and perhaps hundreds of cores in a silicon die, on-chip memory size will probably not follow the same trend. For this reason, we believe that application programmers should beware of application working sets and should strategically arrange data orchestration between the cores when developing high performance algorithms on multi-core processors. The same philosophy will be applied and the efficacy will be examined in our study of FPGA-based multi-core ar- chitectures. The performance obtained on our architecture for the sparse computations will be compared with solutions on some of these above-mentioned systems. 1.1.2.2 Processors on FPGA FPGAs began as prototyping devices that allowed for convenient development of glue logic type applications for connecting ASIC components. Then, the gate density of FPGA devices increased, and application-specific hardware blocks were added. As a result, utilizations of FPGA shifted from glue logic to a wide variety of applications, such as for signal processing and network problems. Designs using FPGA have been deployed in the field while they are still flexible to change. Enabling system-on-chip (SoC) designs, general purpose processors have made their way into FPGA. For example, Xilinx included two PowerPCs in some of its Virtex II Pro FPGAs in 2001. The PowerPC 405 core has a five-stage pipeline, separate 16 KB instruction and data L1 caches, a CoreConnect bus, an Auxiliary Processing Unit 8 (APU) interface for expandability, and support for clock rates exceeding 400 MHz. Starting from Virtex-5, PowerPC 440 is embedded on the FPGA chip as an upgrade to PowerPC 405. Shortcomings of these embedded processors, when being used for SoC designs on FPGA, include lack of computing power and constrained bandwidth when connected to other on-chip components. These problems limit their role to interfacing and coordi- nating between other high-performance IP cores in designs for most applications. The ARM family has dominated embedded system design and has made many of its mem- bers, such as ARM7, ARM9, and ARM Cortex-M1, available for FPGA usage [12]. ARM Cortex-M1 is developed specifically for FPGA and supports Actel [2], Altera, and Xilinx devices. Xilinx and ARM have recently announced development collabo- ration to put Cortex-M1 in the next generation of FPGAs. This inclusion signifies the increase of processing power of hardware processors and that the processors are able to take on more serious computing tasks in FPGA-based SoC Designs. The collabora- tion between ARM and major FPGA vendors allows the synergy of high-performance general purpose computing and application-specific designs enabled by optimized re- configurable hardware IP cores. In the meantime, soft processors emerged as a parallel initiative to the embed- ded hardware processor cores on FPGA. The major vendors provide a series of soft processors resembling their hardware counterparts [101, 8]. Other open source soft processors targeting FPGA platforms are also abundant [4]. Initially designed for soft- ware programmers to program high performance hardware computing platforms, soft processors have found uses in the design of embedded systems and System on Pro- grammable Chip (SoPC). Both hardware embedded and software reconfigurable processor cores can be con- nected through proprietary or custom-designed interconnect, such as CoreConnect or 9 Fast Simplex Link (FSL), to form a multi-processor system on chip. Existing designs have demonstrated the potential of such systems [74]. However, most research is sim- ply focused on building the systems. More research should be performed on how to assemble a multi-processor system on FPGA to support one type of application organ- ically. Answering this question is one of the primary goals of our research. 1.1.3 Sparse Computations Sparse computations are applications that require high throughput on large data sets, which demands high-volumed communications from both external and internal I/Os. They sometimes may involve intensive computation during each processing step [39]. A class of sparse computations, including string matching and graph traversal prob- lems, are the computational kernels that request frequent access to data in memory. These data, especially when large, are usually stored on a large external memory, which brings long latency for random access. These kernels have the following key features: (i) the ratio of I/O to computation is high; and (ii) most memory accesses are random with little or no data re-use, which leads to low bandwidth utilization. Many of these computations have some forms of concurrency and are amenable to parallel execution. Therefore, the performance of these kernels depends largely on the I/O bandwidth, memory latency, and computational parallelism. When the computa- tion takes place in multiple processes, the demand for I/O bandwidth also comes from inter-process communication needed for the data exchange and control signaling be- tween each process. Our thesis aims to fully utilize the parallelism available in sparse computations to achieve high-performance algorithm designs. A new type of emerging applications on the Internet, called data intensive appli- cations, share some similar traits with our target applications. Most data intensive 10 applications have much larger data sets and are usually supported by clusters and data centers. The workload that they support is massively parallel and mostly independent, so they are relatively easy to parallelize. On the contrary, the data accesses in sparse computations exhibit multi-grained dependency, which requires synchronization. The performance optimization for data intensive applications focuses more on the latency and energy issues. Instead, our study on sparse computations concerns itself more with throughput performance. We believe that FPGA-based hardware designs can provide a suitable solution for sparse computations to achieve high performance. Since the hardware designs have more control over their local and external memory, special data layout and data remapping optimizations can be employed to reduce the overall memory latency. Spa- tial parallelism provided by multiple processing units enables memory accesses to be overlapped with computations, which maximizes memory bandwidth utilization. The configurability of FPGA also allows communication channels to be built in order to improve communication performance among different processing units. These processing units, called softcores, are implemented using reconfigurable logic fabric. Compared to the soft processors that were introduced before, the soft- cores are customized with specialized designs that only conduct the desired compu- tation and I/O functionality. Required by an application, the memory subsystem and interconnect for inter-core communication are also adaptable to the specific demand for the problem at hand. 11 1.1.4 Challenges Reconfigurable hardware has benefited many applications in various fields and is now possible to use in sparse computations. However, challenges remain. Some of the major challenges include: Long latency when accessing external memories: Due to the characteristics of sparse computations, long latencies are inevitable when the data set is large and ac- cess to external memories is necessary. For many existing solutions, overcoming this performance hurdle becomes their main task and is a major performance delimiter. FPGA, with large and directly controllable on-chip memory, should provide an effi- cient scheme for solving this problem. Design for computing and communication capabilities: Thanks to the flexible na- ture of FPGA, a core’s processing power, I/O capacity, and the interconnect to facilitate inter-core data communication and memory access are all configurable. When navi- gating such a large design space, designers should take extra care when provisioning resources and optimizing designs for each subsystem. For example, solely increasing a core’s processing power may lead to congestion on interconnect, which would result in a poor overall performance for the system. A balanced design for computing and communication capacities can be hard to create. Scheduling complexity: As the amount of the data going into a core increases, multi-threaded execution may become necessary inside a softcore. Variation in com- putation time and I/O latency can lead to out-of-order execution among the abstracted threads of data processing. Consequently, the bookkeeping, management, and schedul- ing of such threads within a core’s address space can become a daunting task. With 12 resource constraints and performance demand in sight, a strategy is needed to address these problems. Synchronization cost reduction: Due to inevitable data interchange among threads and cores, the synchronization overhead will become insurmountable as the number of threads increases. To effectively reduce the cost spent on the synchronization, we can take advantage of the flexibility in the FPGA hardware design to either have dedi- cated hardware synchronization mechanisms or software-based schemes when circum- stances vary. Low clock frequency: The penalty for the hardware flexibility provided by FPGAs is reduced clock frequency. Primarily because of the programmable interconnects, state-of-the-art FPGAs operate at clock frequencies that are about one-tenth those of state-of-the-art GPPs. FPGA designs make up the gap in clock rate by fully cus- tomizing their architectures to applications. As will be discussed more in Section 3.1, the main ways that FPGAs achieve high performance is through the exploitation of pipelining and parallelism. 1.2 Contributions The main contributions of this thesis, as summarized below, are in addressing the challenges of using reconfigurable hardware to accelerate sparse computations. We conducted application-driven architecture study using large dictionary string matching and breadth-first search on large graphs as two case studies. Based on the proposed multi-softcore architecture, our solutions achieved superior or similar throughput per- formance when compared with the others on both reconfigurable hardware and GPP platforms. We realize that the emerging general purpose CMPs and the results of 13 decades of algorithm research on such problems also open windows for further im- proving performance of such applications. Additionally, the use of reconfigurable hardware is still in its infancy when compared to the use of GPPs and other types of CMPs. Nonetheless, we identify and address several issues in the use of reconfig- urable hardware to accelerate sparse computations, and we demonstrate that the use of reconfigurable hardware is a viable and promising acceleration alternative, which is only beginning to be realized by the community. Multiple application-specific softcore architecture on FPGA: We propose a multi- softcore architecture on FPGA (MapsCore) for high performance implementation of sparse computation kernels. The softcores and the interconnects between them are both application-specific and fully adaptable. With the help of electronic system level languages (ESLs), construction of these cores and interconnects is as easy as pro- gramming a GPP. Message passing is used as the programming model, which fits the communication-heavy nature of sparse computations. The synchronization is also achieved using messages. Due to architectural benefits of FPGA, a variety of design optimization techniques can be developed. Their effectiveness is demonstrated through the study of the two representative sparse computation kernels. Advanced design solution for large dictionary string matching: We developed and implemented the large dictionary string matching algorithm based on our multi- softcore architecture. Along with proving the performance superiority of such an architecture, we also demonstrated effective schemes to gain cache performance by transforming input data set to create locality. We further presented a simple thread management scheme that hides long memory latency and can completely eliminate the long latency effect of accessing external DRAMs for certain application scenarios. 14 Breadth-first search on reconfigurable hardware: With each core in charge of one partition of the graph, cores exchange information and synchronization commands through the interconnect. Thus, parallel processing is conducted in this manner. Our message-passing multi-softcore system can achieve high throughput that is compara- ble to those of state-of-the-art commercially available CMPs. We utilize a distributed “floating” barrier mechanism to reduce the overhead of synchronization barriers, and we define a message consistency model to guarantee the correct execution of the al- gorithm. Multiple optimization techniques enabled by FPGA platform are applied to improve the overall throughput performance of our BFS system. 1.3 Organization To present our contributions, we have organized the remainder of the thesis as follows. Chapter 2 discusses sparse computations and describes two of its most common ap- plications. Chapter 3 surveys general purpose computing efforts on FPGA, examines the usage of processors, and presents the multi-softcore architecture suitable for our study. Chapter 4 details our solutions to large dictionary string matching based on the proposed multi-softcore architecture. Chapter 5 explains our scheme for overcoming communication costs and synchronization overhead in solving breadth-first search on large graph problems. Finally, Chapter 6, summarizes the work and presents areas for future research. 15 Chapter 2 Sparse Computations 2.1 Computation Spareness Computation sparseness refers to a property present in many applications in which the ratio of I/O to computation is high. Computation sparseness may occur in different forms in computing. For example, a class of large-scale data-intensive applications, such as high performance key-value storage systems, emerges to support demand from some major Internet services, including Amazon (Dynamo [27]), LinkedIn (V olde- mort [93]), and Facebook (memcached [59]). The workloads of these systems share several characteristics with the sparse computations in our study: they are I/O intensive and require random access over large datasets. However, the target applications in our study, such as network security (like string matching), dynamic programming (used in computational biology), graph problems, and sparse matrix computations, require synchronization in their data processing. Their I/O operations deal with smaller data elements that are usually stored in memory. On the contrary, most of the data-intensive applications have thousands of concurrent, mostly-independent operations that are conducive to parallel execution and can be well supported by clusters and data centers. Also, the size of objects in these applications 16 can range from 100s of bytes for wall posts to 1KB values for thumbnail images, which usually reside in secondary storages [10]. While the data-intensive applications devote optimization efforts to achieve short latency and energy efficiency, the sparse computation research is focused on improving throughput. Two sparse computations investigated in this thesis are large dictionary string match- ing and breadth-first search. Their memory accesses are random with little or no re-use. When the data is stored in external memory, such as DRAM, the long access latency limits the memory bandwidth utilization. Study of these two kernels also shows that the computation function conducted by the processing elements could be simpler than compute-intensive applications. For example, in string matching, the processing el- ements only need to perform one shift and one add operations for each step. As for BFS, the processing element does little computation. Certainly, real applications using BFS as an exploring kernel will have accompanying computation tasks on the graph vertices or along the edges. 2.1.1 Performance Optimization Strategy for Sparse Computations While the performance of compute-intensive applications is usually measured using million instructions per second or floating-point operations per second, the perfor- mance metrics for sparse computations is primarily throughput (the amount of I/Os that it conducts per unit time). Therefore, the performance of a sparse computation kernel largely depends on the computational parallelism, memory latency, and I/O bandwidth. A suitable platform to support this kind of computation needs to provide sufficient I/O bandwidth, direct control on its local memory, and the ability to accom- modate substantial spatial parallelism. Memory accesses need to be overlapped with computations, which will improve the bandwidth utilization. 17 The changes in computation workload should lead the architecture designers to emphasize thread-level parallelism more than traditional tactics to improve microar- chitecture performance, such as instruction-level parallelism (ILP) and data path opti- mization. In addition, as the computation functionality in the processing unit becomes simpler, the architecture does not need to support a full-fledged ISA (instruction set architecture) and branching prediction and speculation. Performance of sparse computation kernels can benefit from optimizations in mem- ory system design. Data layout affects memory hierarchy performance. As shown by Park, et al. [71], proper data layout can avoid a significant amount of cache conflicts and misses to improve I/O bandwidth utilization. However, care should be taken to weigh the benefits of data remapping and the drawbacks of remapping overhead. Ob- viously, as the local memory of a processing unit get larger , the memory system performs better. However, the size of local memory is constrained by the specific architecture design and resource availability on the computing platform where the ar- chitecture is implemented. Besides reducing the number of accesses to external memory, I/O complexity opti- mization can also come from reducing synchronization overhead. When parallelizing sparse computation kernels, timing cost arises from synchronization between each exe- cution thread. This performance overhead is likely to deteriorate in a nonlinear fashion as the utilization of parallelism increases. An appropriate mechanism for synchro- nization should be utilized to avoid the long latency from traditional synchronization designs. Other cache design techniques, such as run time caching and prefetching tech- niques, are also options for improving performance of sparse computations. After all, better understanding of a sparse computation enables designers to identify the opportu- nities for sparse computations performance optimizations. With a suitable computing 18 platform, a high performance design can be achieved by exploiting those opportuni- ties. This rationale is how we come to the conclusion that reconfigurable computing systems, especially FPGA, can be a good choice for our study in sparse computations. 2.2 Sparse Computations on Reconfigurable Hardware Increasing resources allocated to I/O and memory subsystem is an emerging trend in processor design. Many GPPs now provide more than one SDRAM interfaces. For ex- ample, Intel Nehalem has three channels of DDR3, and AMD Barcelona and Shanghai both have two channels of DDR2. Some designs even have embedded SDRAM con- trollers on the CPU chip, such as Larrabee with two integrated DRAM controllers. Many GPPs incorporate sophisticated on-chip networks, such as dual-ring or switch bar to enable multi-pair communication among the cores [47]. Reconfigurable hard- ware, such as FPGA, is always able to utilize multiple memory interfaces. As a result of FPGA’s ability to adapt an architecture to cater to a specific application, much pre- vious research has used it to study sparse computations. We present the two target applications below as our case studies. Their implementations on FPGA are intro- duced, and related studies on other GPP-based solutions are included for comparison. 2.2.1 Large Dictionary String Matching 2.2.1.1 Large Dictionary String Matching String matching looks for all occurrences of a pattern dictionary in a stream of in- put data. It is the key operation in search engines and is a core function of network monitoring, intrusion detection systems (IDS), virus scanners, and spam/content filters 19 [81, 15, 11]. For example, the open-source IDS Snort [81] has thousands of content- based rules, many of which require string matching against entire network packets (deep packet inspection). To support heavy network traffic, high-performance algo- rithms are required to prevent an IDS from becoming a network bottleneck. When a pattern dictionary is small, the whole string dictionary can be stored in some kind of on-chip memory. This way, the system performance is solely depen- dent on the clock rate and the speed of the processors. GPPs can achieve very high performance in this scenario. However, as applications evolve, the size of pattern dic- tionaries has increased dramatically and is fast approaching hundreds of thousands of entries in one dictionary. This growth demands hundreds of Megabytes in memory space and inevitably requires the use of external memories. While GPPs generally need hundreds of clock cycles for a random DRAM access and a sophisticated DMA scheme is needed to support sustained high-rate access to external DRAM, FPGA can provide effective solutions. 2.2.1.2 Previous Work on String Matching Due to their high I/O bandwidth and computational parallelism, FPGA-based designs for string matching algorithms with application specific optimizations have been pro- posed [103, 15, 11]. These designs typically use a small dictionary, on the order of a few thousand patterns, that can be stored in the on-chip memory or logic fabric of the FPGAs. Dharmapurikar et al. [30] introduced a novel implementation using a Bloom fil- ter. The hash-table lookup uses only a moderate amount of logic and memory, and it searches thousands of strings for matches in one pass. Also, a change in the rule set (the pattern dictionary) does not require FPGA reconfiguration. However, the tradeoff 20 between the false positive rate and the number of rules stored in the memory leads to performance degradation for large dictionary string matching. Baker et al. [15] presented a search engine using the Knuth-Morris-Pratt (KMP) Algorithm [52] on FPGA. The authors adopt a systolic array architecture for multiple pattern matching, in which each unit is responsible for one pattern. The unit archi- tecture uses a modified KMP algorithm with two comparators and a buffered input to guarantee that one character can be accepted into the unit at every clock cycle. The pattern and its pre-computed jump table are stored in BRAMs. This design results in highly efficient area consumption on FPGAs but is limited by the available on-chip BRAM blocks. Recently, soft processors on FPGAs have gained a lot of interest in the research community. Using soft processors on FPGA, the engineering time can be reduced, and software engineers can program the high-performance hardware platform. Ravindran et al. [74] implemented a simplified IPv4 packet forwarding engine on an FPGA us- ing an array of MicroBlazes [101]. The softcore architecture exploited both spatial and temporal parallelism to reach a performance that was comparable to designs on an application-specific multi-core processor, such as Intel IXP2400. These studies moti- vate us to explore a multi-core architecture on FPGA to achieve high performance for large dictionary string matching. Chip multiprocessors (CMP) also present new opportunities for fast string match- ing with their unprecedented computing power. Yu et al. [106] proposed new regular expression rewriting techniques to reduce memory usage on general purpose CMPs and used grouping of regular expressions to enable processing on multiple threads or multiple hardware cores. Scarpazza et al. studied the optimization of Aho-Corasick algorithm on the Cell BE for both small and large pattern dictionary string matching [78]. When the patterns are in the range of a few hundred, one Synergistic Processing 21 Element (SPE) using local store can obtain 5 Gbps throughput. However, when the dictionary includes hundreds of thousands of patterns, they must be stored in the ex- ternal XDR DRAM, and the throughput can only reach 3.15 Gbps for two processors with 16 SPEs. 2.2.1.3 Aho-Corasick Algorithm A class of algorithms using automata have become more attractive [54] for string matching. From the classic Aho-Corasick algorithm [6] and its many variants, such as Knuth-Morris-Pratt, Boyer-Moore [19], and Comment-Walter [21], we selected one of the Aho-Corasick algorithms, AC-opt, for our design [95]. AC-opt can achieve high performance independent of the keyword set and input content, and it can match all occurrences, even those overlapping with one another. Due to its high storage re- quirements, AC-opt is an appropriate choice for the investigation of our architecture on achieving high-performance sparse computation designs. The Aho-Corasick algorithm and its variants perform efficient string matching of dictionary patterns on an input streamS. They find instances of the pattern keywords P = (P1;P2;:::;Pn) inS, even when keywords may overlap with one another. All variants of Aho-Corasick function by constructing a finite state transition table (STT) and processing the input text character-by-character in a single pass. Once a state transition is made, based on the current character, that character of the input text no longer needs to be considered. The construction of the STT needs to take place only once, and the STT can be reused as long as the pattern dictionary does not change. Each state also contains an output function. If the output function is defined on a given state, then that state is considered to be a final state and the output function gives the keyword or keywords that have been found. 22 0 1 7 4 3 9 6 2 8 5 c a t a t a t b r ~{a,b,c} 1 2 3 4 5 6 7 8 9 0 7 8 0 7 8 0 0 0 (a) Goto function i f(i) i output(i) 3 cat,at 6 bat,at 8 at 9 car (b) Failure function (c) Output function Figure 2.1: AC-failure example The most basic variant of Aho-Corasick, known as AC-fail, requires the construc- tion and use of two functions in addition to the standard output functions: goto and failure. The goto function contains the basic STT that was just discussed. A state tran- sition is made using the goto function, based on the current state and the current input character. The failure function supplements the goto function. If no transition can be made with the current state and input character, then the failure function is consulted so that an alternate state can be chosen and text processing may continue. This exam- ple in figure 2.1 illustrates usage of the AC-fail variant of Aho-Corasick. It shows the 23 goto, failure, and output functions for a dictionary of the following keywords: cat, bat, at, and car. A more optimized version of Aho-Corasick is also presented by Aho and Cora- sick [6], known as AC-opt. We use this algorithm in our study. AC-opt eliminates the failure function by combining it with the goto function to obtain a next-move function. The result is a true deterministic finite automaton (DFA), which is capable of string matching by making only one state transition per input character. Therefore, search- ing is simplified and more efficient. However, since the construction of the next-move function requires both the goto and failure functions, it is less efficient than AC-fail when processing the dictionary initially. States a b c r t 0 7 4 1 1 2 4 1 2 7 4 1 3 9 4 5 4 1 5 7 4 1 6 7 7 4 1 8 3,6,8,9 7 4 1 Table 2.1: DFA state transition table input c a r i c a t u r e state 0 1 2 9 0 1 2 3 0 0 0 Table 2.2: A string search example Table 2.1 illustrates an AC-opt DFA. The DFA is constructed from a dictionary that consists of the keywords that were previously used. Table 2.2 shows a search on the input string “caricature.” Using this application kernel, we explore the design space presented by FPGA ar- chitecture and arrive at a systematic design approach to circumvent the memory access bottleneck. The final projected performance can reach close to 10 Gbps, surpassing other similar studies on commercially available CMPs. 24 2.2.2 Breadth-First Search on a Large Graph In many applications, data structures can be represented as graphs. For example, the World Wide Web forms a directed graph with millions of nodes [33]. Citations in a field can also be presented as a graph [68]. Similar approaches can be found for solving problems in travel, biology, computer chip design, and many other fields. Therefore, the development of algorithms to handle graphs is a major interest in computer science. A graphG = (V;E) is composed of a set of vertices,V and a set of edges,E. We define the size of a graph as the number of verticesjVj. Given a vertex v2 V , we indicate withE v the set of neighboring vertices ofv (or neighbors for short), such that E v =fw2V : ((v;w)2Eg. For thev,a v denotes its arity, the number of elements jE v j.a denotes the average arity of the vertices in the graph,a = P (jE v j/jVj). Root Explored vertices Frontier vertices Unexplored vertices The frontier vertices are marked using yellow or red dots to distinguish their levels Figure 2.2: Breadth-first search Solving a graph problem usually occurs while traversing a graph. Graph traversal is critical to many areas of science and engineering that demand techniques to ex- plore large data sets represented by graphs. In these areas, search algorithms are the 25 computational engines to discover vertices, paths, and groups of vertices with desired properties. Among graph search algorithms, Breadth-First Search (BFS) is probably the most common one, and it is a building block for a wide range of graph analysis applications. As shown in Figure 2.2, BFS begins at the start node (root), and it explores all the neighboring nodes. Then, for each nearest node, it explores the unexplored neighbor nodes until it finds the goal. During this process, a list of “frontier nodes” is maintained for immediate visit. Speeding up BFS has a major impact on many applications and has been a topic for almost every existing computing platform. We believe that FPGA- based multi-softcore architecture can improve throughput performance for BFS. 2.2.2.1 BFS Algorithms In this section, we present the classic BFS algorithm and its bulk-synchronous paral- lelized version. They are the base for our study of BFS on the FPGA platform. We can formalize the algorithm as follows: given a graph G(V;E) and a root vertexr2V , the BFS algorithm explores the edges ofG to discover all the vertices reachable fromr in a “breadth-first” order, and it produces a breadth-first tree rooted atr. Vertices are visited in levels: when a vertex is visited at levell, it is also said to be at distancel from the root. This procedure is shown in Algorithm 1. For a sequential organization, two queues need to be maintained during the explo- ration. Shown in Algorithm 1,Q is the set of vertices that must be visited in the current level. Q is initialized with the rootr (see line 4) at level 0. At level 1,Q will contain the neighbors ofr. At level 2,Q will contain these neighbors’ neighbors (except for those visited in levels 0 and 1) and so on. The algorithm scans the content ofQ, and, for each vertexv2Q, adds its corresponding neighbors toQnext.Qnext is the set of vertices to visit in the next level. At the end of the exploration for a level, the content 26 Figure 2.3: Algorithm 1 - Sequential BFS input : GraphG(V;E), rootr // Variables definition level, exploration level; 1 Q, vertices to be explored in the current level; 2 Qnext, vertices to be explored in the next level; 3 marked, boolean arraymarked i ,8i2 [1:::jVj]; 4 // Initialization 8i2 [1:::jVj] :marked i =false; 5 marked r =true; 6 level 0; 7 Q frg; 8 repeat 9 Qnext fg; 10 for allv2Q do 11 for alln2E v do 12 ifmarked n =false then 13 marked n true; 14 Qnext Qnext S fng; 15 end 16 end 17 end 18 Q Qnext; 19 level level+1; 20 untilQ =fg ; 21 27 ofQnext is transfered toQ. The algorithm ends whenQ is empty. A vertex is visited only once during the process, When being implemented on GPPs, the queues are usually maintained on-chip, and the graph structure typically resides in external memory. However, on-chip resource for the queues may run out when the size or topology of the graph demands a large amount of storage. Then, access to external memory becomes necessary for both the global graph data and the intermediate data structures. The problem of searching large graphs alone poses difficult challenges, mainly due to the exorbitant search space im- posed by the vast amount of data. Now, combined with the lack of spatial and temporal locality in the data access pattern, the access latency to external memory becomes a main performance limiter for GPP-based system implementations. Algorithm 2 shows a parallel BFS on a multi-core machine with shared global storage and private local memories. Oreste Villa, et al. implemented this algorithm on Cell BE [92]. This study considered limited size of local memory on each SPE and avoided overflows by explicitly managing the memories at the software level. With machine specific optimizations, the authors achieved an impressive performance in the range of 100 to 800ME=s. “ME=s” stands for million edges per second, which is a commonly used throughput metrics for BFS that measures the number of edges traversed during a unit time. In this algorithm,V is partitioned into disjoint setsV i , with each owned by a PE. Each PEi explores and marks only its own vertices and passes any other vertices to their respective owners. All steps in the “repeat loop” are globally synchronized across the processing elements. Synchronization points between memory accesses and queue exchange are required. The steps of Algorithm 2 are executed in parallel by all the available PEs. PEi accesses its own privateQ i ,Qnext i , and its partition of the marked array, which includes only the variables associated with the verticesV i . The outgoing 28 and incoming queues are private to each PEi, denotedQout i;1 ,Qout i;2 ,:::,Qout i;N andQin i;1 ,Qin i;2 ,:::,Qin i;N , respectively. Through these queues, PEs can forward the vertices to their respective owners. At initialization, the root vertexr is assigned to its owner’sQ i . During the explo- ration, each PEi examines the verticesv inQ i and collects the vertices inE v , which belong to PEp, toQout i;p . The dispatch operation then transfers the contents of each Qout i;p to Qin p;i . Afterwards, the maintenance of Q i and Qnext i is the same as in Algorithm 1. When implemented on a general purpose CMP, the performance of the parallel BFS suffers from both the problems existing in the sequential algorithm and the com- munication and synchronization overhead [98]. The latter occurred due to the queue- exchanging and the level-maintaining for the cores in the system. We think FPGA- based solutions are able to overcome the hurdle posed by the access to external mem- ory and communication needed by data exchange and thread synchronization among the cores. 2.2.2.2 Previous Work on Graph Algorithms Of the many studies of BFS, most use simple parallelization of Algorithm 1 and assign multiple processing elements to the vertices. Since the maintenance ofQ andQnext is through global data structures, scaling on large graphs becomes difficult. An early approach to implementing graph algorithms on FPGAs is part of the RAW project [13]. RAW’s graphs are directed and stored in FPGA by building a circuit that resembles the graph. Vertices are represented by logic units, and edges are wires connecting the vertices. As a consequence, changing nodes and edges requires com- pletely rerouting and reconfiguring the entire FPGA circuit, which is extremely time- consuming. The design was basically used as a showcase for the proposed dynamic 29 Figure 2.4: Algorithm 2 - Parallel BFS on one processing element input : GraphG(V;E), rootr, available processing elements (PE)N input :V i : ( S i:1:::N V i ) =V , andV i T V j = ifi6=j // Variables definition Q i , vertices to be explored in the current level; 1 Qnext i , vertices to be explored in the next level; 2 level, exploration level; 3 marked v , boolean arraymarked i ,8i2 [1:::jVj]; 4 Qout i;p , outgoing queues,8p2 [1:::N]; 5 Qin i;p , incoming queues,8p2 [1:::N]; 6 // For processing element i level 0; 7 Q fg; 8 8v2V i :marked v false; 9 ifr2V i then 10 Q i frg; 11 marked r true; 12 end 13 repeat in lockstep across the processing elements 14 Qnext i fg; 15 Qout i;p fg;8p2 [1:::N]; 16 // Gather and Dispatch for all(p;v)2 [1:::N]Q i do 17 Qout i;p Qout i;p S f(v;E v T V p )g; 18 end 19 // All-to-All Qin p;i Qout i;p ;8p2 [1:::N]; 20 // Bitmap for alln2E v where(v;E v )2 ( S p:1:::N Qin i;p do 21 ifmarked n =false then 22 marked n true; 23 Qnext i Qnext i S fng; 24 end 25 end 26 Q i Qnext i ; 27 level level+1; 28 untilQ p fg:8p2 [1:::N]; 29 30 computation structures (DCS), a compilation technique to automatically produce dy- namic code for reconfigurable computing. HArdware Graph ARray (HAGAR) [60] at Bell Labs maps the adjacency matrix representation of a graph to reconfigurable hardware. Two algorithm implementations for graph reachability and shortest path have been presented to validate the design. Using RAMs on FPGA to store and switch between multiple contexts of a regular architecture, large graphs can be processed. This scheme can waste large amounts of resources if the sparsity of a graph rises. Also, alternating the dimension of the matrix is difficult. Another work on graph algorithms is called GraphStep [28], which gives a general high level abstract of a graph computing model based on object. The on-chip assump- tion is held for FPGA implementation to take advantage of BRAM’s high bandwidth and low latency. The study is focused on sparse graph and not on FPGA architecture and performance optimization. The work on general purpose computing platform is also abundant. The paral- lel BFS, detailed in Algorithm 2, is used in research on BlueGene/L [105] and Cell BE [92, 77]. They focus on maximizing the performance of BFS on each platform with different architecture features. Another work studies the impact of topology of a graph on the BFS performance on a general purpose CMP [98]. The resulting design adapts the number of active threads in the system to curtail the cost of synchroniza- tion barriers. This approach yields superior performance compared to the first two on general purpose CMPs, especially for low-degreed graph. 31 Chapter 3 Multi-Core Architecture on FPGA This chapter begins by introducing the utilization of FPGA in parallel computing and the efforts to simplify their design practice. Then, after a survey of the general pur- pose processors implemented on FPGA, we propose the multi-softcore architecture and design methodologies for our study. 3.1 Parallel Computing and FPGA Parallel computing has been an inherent part of FPGA-based application design since FPGA’s introduction in 1985 [102]. Parallelism has its spatial form and temporal form. Bit-level and data-level spatial concurrency has always been a fixture in most high- performance FPGA applications. Pipelining, which utilizes temporal parallelism, adds another dimension to the design space of FPGA applications. Typically, FPGA de- vices, on which a realistic application is implemented, run an order of magnitude slower than their ASIC counterparts and the implementations on commercial GPPs, from the clock rate point of view. To be competitive in system performance, FPGA designers must utilize all of the tools of the trade that they have, such as exhausting the spatial and temporal parallelisms, when designing an application. This rule of thumb is 32 summarized as the principle of “PPP”, standing for “Parallelism and Pipeline leading to Performance.” [64] To achieve this “PPP” ideal, FPGA designers need to possess sophisticated skills and keen insights on the applications. This requirement becomes a main culprit for FPGA’s narrower adoption. Many efforts have been directed to make FPGA practically acceptable to a wider community. Two such efforts are introduced in the following sections. 3.1.1 High Level Language Programming for FPGA Known for its programmability while also allowing application-specific designs, FPGA is considerably more difficult to program than most GPPs. Many endeavors exist to enable programming FPGA using high level languages similar to C/C++. One ex- pects that, with the help of special data structure definitions, a design can be modeled abstractly, verified easily, and synthesized into circuits directly. These efforts indeed alleviate some FPGA design difficulty, but none has achieved notable and wide-spread acceptance yet. Examples of such projects include Handel-C [5], Impulse-C [72], Mitrion-C and a subset of SystemC [70]. These languages are known as Electronic System Level Languages (ESLs). Handel-C adopts special data constructs to explicitly represent parallelism and uses channels to send messages between processes. Impulse-C shares the abstraction of channels with Handel-C but implements them as an API rather than as a native part of the language. SystemC is a library of C++ classes and macros and provides an event-driven simulation kernel. These SystemC features enable a designer to simulate concurrent processes in which each is programmed using C++ syntax. In addition to standard data types in C++, SystemC offers data types that are specifically used for 33 digit logic description. Several types of thread invocations can be deployed to control the execution of each process, which defines functionality of a design block (the mod- ule or object). These languages share some commonalities of C/C++ programming and can generate circuit descriptions in hardware description languages (HDLs). Another language tool, called Mitrion-C [63], uses a different approach, in which a Mitrion Virtual Processor (MVP) is used as an abstraction layer in the middle of software and hardware. Software written in Mitrion C can be executed as instructions on top of MVP, while MVP is implemented on the FPGA as an array of processing elements. These processing elements represent the instruction execution stream, while the software controls data flow through this virtual machine. This scheme sacrifices some performance compared to directly designed hardware on FPGA, but it can make programming relatively easier. 3.1.2 FPGA-Augmented High Performance Computing FPGA also participates in high-end parallel computing systems as application-specific accelerators. Many such systems have been built, including SRC 6 and 7 [82], Cray XD1 and XT5h [25], and SGI RASC [80]. These systems contain multiple nodes that are connected through an interconnection network. Each node is based on either FPGAs, general-purpose processors, or both. Such systems can achieve higher per- formance than systems with only processors [79, 109]. Below, we document several representative systems and discuss the important performance parameters using an ar- chitecture model. The experience gained from designing and utilizing these systems sheds light on the formalization of our proposed architecture, which is discussed in Section 3.3. 34 3.1.2.1 High-Performance Reconfigurable Computing Systems In SRC-7 MAPstations [82], the basic architectural unit contains one Intel micropro- cessor and one reconfigurable computing resource called MAP processor. Each MAP processor consists of two FPGAs and one FPGA-based controller. Each FPGA has ac- cess to six banks of SRAM memory. The FPGA controller facilitates communication and memory sharing between the microprocessor and the FPGAs. Multiple MAPsta- tions can be connected by Ethernet to form a cluster. Cray XD1 has a similar architecture [25]. The basic architectural unit of XD1 is a compute blade, which contains two AMD 2.2 GHz processors and one Xilinx Virtex-II Pro XC2VP50. Each FPGA has access to four banks of QDR II SRAM, which totals to 16 MB of memory. Through Cray’s RapidArray processor, the FPGA can access the DRAM of the microprocessors. Six compute blades fit into one chassis. The recently introduced reconfigurable computing systems contain more hetero- geneity, with nodes containing various numbers of FPGAs and processors. For exam- ple, Cray XT5 h , supports reconfigurable computing by incorporating FPGA modules from DRC [32]. All of the blades in XT5 h are connected through Cray SeaStar inter- connect that can achieve a sustained bandwidth of 6 GB/s. XT5 h can have three kinds of computing blades: scalar processing (XT5 blade), vector processing (X2 blade), and a reconfigurable processing blade (XR1). A Cray XR1 blade has two nodes. Each node consists of a single AMD Opteron processor that is tightly coupled with two reconfigurable processing units (RPUs) from DRC. The RPUs are connected to the AMD processor through HyperTransport, which provides 2.5 GB/s bandwidth. Each RPU contains a Virtex-4 FPGA, SRAM memory of 256 MB, and 4 GB of DRAM memory, which provides 6.25 GB/s bandwidth. 35 SGI has also proposed Reconfigurable Application Specific Computing (RASC) technology, which provides hardware acceleration to SGI Altix servers [80]. An SGI RASC RC100 blade has two Xilinx Virtex-4 FPGAs. The FPGAs are connected to 80 GB of SRAM memory. Each blade is directly connected to the shared global memory in the system through the SGI NUMAlink4 interconnect, at a bandwidth of 6.4 GB/s. Recently, SGI developed the RC200 blade, which will be used to augment SGI Altix XE and SGI Altix CE clusters and blade servers. 3.1.2.2 Architectural Model and Performance Parameters for HPRCs We model the architecture of the high-performance reconfigurable computing systems (HPRCs) in Figure 3.1. Many HPRCs exhibit heterogeneity in their processing el- ements. In this model, the heterogeneous nodes are connected by a low latency, high bandwidth interconnection network. The nodes differ in computing capacity, memory hierarchy, and suitability for the applications. Three types of nodes, the P(Processor)-node, F(FPGA)-node, and H(Hybrid)-node, which contains one FPGA and one general-purpose processor, are shown in the figure. Based on our work with matrix computations on a HPRC [110], we identify a set of key performance parameters for designs with F-nodes or H-nodes. We exclude the parameters related to GPPs and the latency of the memory access, since the focus of our discussion is FPGA, which must have data streaming in if a design is to be optimized. M d : the size of the DRAM memory; bw d : the bandwidth between the FPGA and the DRAM memory; bw s : the bandwidth between the FPGA and the SRAM memory; M s : the size of the SRAM memory; 36 Interconnect Network Shared DRAM FPGA SRAM FPGA SRAM uP DRAM uP DRAM SRAM uP FPGA DRAM SRAM uP FPGA DRAM Figure 3.1: Architectural model of high-performance reconfigurable computing sys- tems bw n : bandwidth of the interconnection network; lat n : latency of the interconnection network. In addition to the architecture-specific features, application parameters should also be considered. Suppose that the given application containsq tasks, denoted ast 0 ,t 1 , . . . ,t q1 ; and we havep nodes, denoted asN 0 ,N 0 , . . . ,N p1 . o j : the number of operations in taskt j ; d j : the number of words needed by taskt j to perform computations; C ij : compute capacity ofN i for implementation oft j , measured in the number of operations performed in one second. IfN i is an F-node,C ij can be easily computed for the FPGA-based design. Sup- pose the design executes o operations during each clock cycle and can be run at the clock speed off. Thus,C ij equalsof. On the other hand, ifN i is a P-node, then 37 C ij equals the sustained performance of the processor for taskt j and can only be ob- tained by executing the software implementation oft j . The performance of an H-node depends on the workload partitioning and coordination between the FPGA and the processor [108]. 3.2 Microprocessor-Based Computing on FPGA A system-on-chip (SoC) can be built by adding microprocessors onto FPGA and con- necting them with I/O devices and accelerators made of the logic fabric on FPGA. Thus, a C/C++ program can be run on them in order to to ease design difficulties on FPGA. Some FPGA vendors provide embedded hardware processors in the middle of the FPGA fabric, and most vendors allow soft processors to be built using the logic and memory resources on an FPGA chip. 3.2.1 Embedded Processors An embedded processor is a hardware IP core with a specific interface designed to connect to the IP cores made of FPGA fabric. When FPGA IP cores include I/O controllers, such as compact flash interface and Ethernet MAC, they realize a system- on-chip (SoC) design. Some FPGAs have more than one embedded processor, and they can connect with one another. By interconnecting with other processor cores, this SoC becomes a mini-supercomputer. A representative of such embedded processors is PowerPC 405 on Xilinx Virtex Pro II and Virtex-4. PowerPC 405 is a 32-bit RISC CPU core licensed from IBM [45]. The core integrates a scalar single-issue 5-stage pipeline with separate instruction and data caches, a JTAG port, trace FIFO, multiple timers, and a memory management unit 38 (MMU). The PowerPC 405 core can be integrated with peripherals and application- specific macro cores using CoreConnect(TM) bus architecture [45]. Since Virtex-5, Xilinx replaces PowerPC 405 with more advanced PowerPC 440 [102]. Figure 3.2: PowerPC 405 configuration on FPGA Figure 3.2 shows an organization of the PowerPC 405 processor block on Xilinx FPGA [102]. Aside from the typical architectural components, some interface mod- ules are also presented, including OCM and PLB. OCM stands for on-chip memory controller and provides a direct channel between the processor block and Block RAMs on the FPGA chip. The memory accesses through OCM are not subject to cache. PLB stands for processor local bus and is mainly used to interface with I/O devices that are made of FPGA fabrics, such as Ethernet controllers, general purpose I/O interfaces, and customer-designed IP cores. PLB can also be used to access memory, includ- ing on-chip BRAM, off-chip SRAM, and DRAM. However, any such access request will be subject to bus arbitration; therefore, long latency may be induced. PowerPC 39 core connects to custom IP cores via PLB and accesses them as memory-mapped I/Os. Note that PowerPC 405 can also access custom IP cores through a mechanism called Auxiliary Processor Unit (APU). APU extends the native instruction set with custom instructions, which are backed up by custom IP cores that are configured as Fabric Co- processor Modules (FCMs). These instructions are intercepted by the APU controller and circumvent the pipeline execution to perform custom designed functionalities. Common, recognized problems when utilizing PowerPC 405 on Xilinx FPGAs are the APU’s unsuitability for processing large data set and PLB’s bandwidth bottleneck stemming from bus contention. Xilinx recognizes these problems and pledges to re- solve them in its incorporation of a new generation of embedded processors, which include the ARM Cortex M1 [12]. ARM brings a standard processor architecture to most known FPGA platforms and can deliver a very high performance per MHz. The work incorporating ARM into Xilinx FPGA focuses on enhancing the interconnecting capability, which facilitates core-to-core and core-to-peripheral communications. Currently, FPGA has been able to place, at most, two hardware processor cores on a chip. Designs on FPGA run at one order of magnitude slower than their ASIC and GPP counterparts. The smallest system configuration with a PowerPC cannot run more than 200 MHz. Combined with a simple microarchitecture in these embedded hardware processors, the IPC (Instruction Per Cycle) of applications cannot be larger than one. The bus contention caused by others design modules on FPGA, some of which are slow, exacerbates the interconnect bandwidth bottleneck. This bottleneck limits the number of optimized accelerators for data processing that can be linked to the processors. In the end, the exploitation of data and thread level parallelism is adversely affected, and the performance of the whole system suffers. 40 3.2.2 Soft Processors Soft processors are a class of microprocessor IP cores that can be implemented on logic fabric of reconfigurable computing devices such as FPGA. They are usually written in VHDL/Verilog and are available either as open source [4] or by purchase [8, 101]. Most soft processors can run a version of Linux as their operating system, and they have support from GNU development tool chains. The flexibility of soft processors allows for their simple configuration and leads to area efficiency, which enables a larger number of the soft processors to be placed on a single chip of FPGA. Next, we show a representative soft processor architecture and discuss research activities related to soft processors on FPGA. A soft processor example: The architecture of MicroBlaze is shown in Figure 3.3. We can see it is a Harvard architecture with separate instruction and data caches. Mi- croBlaze uses Local Memory Bus (LMB) and a slow On-Chip Peripheral Bus (OPB) instead of PowerPC’s OCM and PLB. Figure 3.3: Architecture of Xilinx MicroBlaze 41 MicroBlazes can be configured similarly to PowerPC in order to form an SoC de- sign. This configuration is shown in Figure 3.4. Soft processors are very flexible, be- cause they can be both uniquely configured and connected with different memory, pe- ripherals, and one another. A drawback of soft processors is their low clock rate, which ranges from 25 MHz to 200 MHz. This low clock rate excludes them from competing with their hard core counterparts and the commercially available CMPs. Therefore, their roles are generally limited to interfacing and coordinating other higher-speed units on FPGA. Figure 3.4: MicroBlaze configuration on Xilinx FPGA 42 Often, a soft processor consumes only 2% to 15% of the resources on a medium- sized FPGA. Lower end cores, such as PicoBlaze, consume even less. Thus, one can unleash the power of soft processors and of the host reconfigurable device by allocating as many soft processors as possible on a chip, where they may facilitate massively parallel computations of an application. Coupled with the abundant and configurable I/O pins on FPGA, this architecture has potential to compete with other microprocessors for high I/O intensive applications and enable FPGA to excel beyond being just an accelerator. However, many obstacles must be overcome to achieve this goal. 3.2.2.1 Soft Processor Architecture Design Soft processor design on FPGA is oriented at performance, just like any processor design. However, the design metrics differ from those of GPPs, mainly due to the difference between the architectural and design constraints on FPGA and on other platforms. Since computation density usually serves as metrics for hardware design on FPGA, area or area efficiency is also used for appropriate reasons for soft processor design. Area consumption on FPGA is roughly linear to the power consumption of the design. It also reversely correlates to the clock rate of a design, since wiring delay plays a large role in determining timing complexity of an FPGA design. A design with a smaller footprint usually has shorter wires and more compact combinational circuits between the registers. Area efficiency can be measured in multiple ways. First, the pure area consump- tion compared against the total number of available threads on a system can show insights into soft processor design. For example, a multithreaded soft processor design 43 with four threads can save a lot of area compared to four single-threaded soft proces- sors [34]. However, sometimes a “million instructions per area unit” combined with “instructions per cycle” can be more indicative of the design quality. Labrecque et al. [55] brought multistage pipeline architecture into multiple threads. In an ideal scenario where the number of pipelines is equal to the number of pipeline stages and the threads are assumed to be independent, data hazards are resolved by hiding the dependency in the pipeline. The resulting designs will not depend on data forward to increase IPCs. This scheme will reduce design complexity and will obtain a better area efficiency measured in IPC over resource usage, such as slices in Xilinx products or logic elements (LE) in Altera terminology. In a later study where cache and off-chip memory are considered in the system design [56], the result shows that multi- processors with single-threaded cores perform the best out of all other design options for a given total area. Note that this architecture does not consider the synchronization between the soft processors. Since applications demand different resources and exhibit distinct communication and synchronization patterns, a soft processor design that aims for all types of applica- tions usually results in being underutilized. Optimizations through application-specific customization become imperative, and its automation is necessary. Several academic projects have studied the automatic synthesis of soft processors on FPGA, such as SPREE, CUSTARD [104, 31, 23]. Some studies stray away from microarchitecture design of individual cores and look instead at the interconnects which oftentimes de- termine the performance of the entire multi-core system [29, 42, 75]. With these tools and research, a comprehensive evaluation of soft multi-processors regarding both the microarchitecture of the processors and the interconnects was made possible. A study of this kind focuses on a set of streaming applications and applies a slew of customization to all aspects of the multiprocessor system on FPGA chip, 44 including interconnect topology, pipeline depth, communication buffers, ISA subset- ting, and the combination of these with other factors. An important observation is that “larger soft multiprocessor systems benefit from point-to-point interconnection . . . microarchitectural optimizations, such as instruction subsetting, inter-processor buffer sizing and pipeline depth variation yield significant performance and area ben- efits.” [91] Microsoft Research’s extensible MIPS (eMIPS) project implements a MIPS RISC processor on a Xilinx FPGA [73]. The eMIPS takes advantage of FPGA’s dynamic reconfiguration feature to extend a core instruction set on the fly and add peripherals when needed in real time. The scheme is not much different from Xilinx’s APU so- lution, but it looks at broader applications of added function units. For example, they can serve as a debugging and testing co-processor or as a system intrusion detection mechanism on the hardware level. 3.2.2.2 Parallel Programming Model on FPGA Many projects reported building a multi-core system on top of FPGA using softcores. Many choose to connect as many softcores as possible using various types of intercon- nects, such as hierarchical bus, mesh, or systolic array [65, 36, 74] and to conduct ad hoc programming on them. Other important issues, such as cache coherence, multi- processor interrupt management, processor identification, and synchronization, are left to programmers. A high-level programming model with a supporting synchronization mechanism can alleviate many of the above-mentioned difficulties. A shared memory multiproces- sor was demonstrated in project CerberO [89]. As shown in Figure 3.5, the architecture uses an OPB bus to connect all the processors, shared memory, and other IP cores. A separate synchronization engine (SE) provides locks and barriers using point-to-point 45 Processor 0 Local Cache LMB Processor N Local Cache LMB SE CrossBar Other IP Cores Boot BRAM Multichannel Memory Controller DDR RAM (Shared data and instructions) I I Figure 3.5: A shared memory multi-processor architecture on FPGA (Reproduced from project CerberO [89]) links to each processor. Also, a crossbar connects all processor cores to facilitate short and critical message transmission, such as boot information and task scheduling. 3.2.3 Usage of Soft Processors Aside from being a computing engine, the flexibility and scalability of soft processors allow them to be an instrument in many computing system studies. Soft Processors as Research Utilities: FPGA and soft processors have been used as prototyping devices since their introduction. An FPGA-based many-core system was built using Xilinx Virtex II Pro to emulate future computers with massive parallelism in their processing units [20, 53]. The system, called RAMP Blue, consists of 768 to 1008 MicroBlaze cores in 64 to 84 Virtex II Pro 70 on 16 to 21 BEE2 boards, which surpasses the milestone of 1000 cores in a standard 42U rack. A software infrastructure 46 Figure 3.6: RAMP BLUE consisting of GCC, uClinux, and Unified Parallel C (UPC) allows the running of off- the-shelf applications and scientific benchmarks. Message passing is adopted as the programming model. Shown in Figure 3.6, the architecture is based on point-to-point channels and switches and uses a combination of custom and generic hardware to provide the func- tionality. The customized network provides an FSL interface with the high-performance LVCMOS and Ten-gigabit Attachment Unit Interface (XAUI) links between the FPGAs on or between the PCB boards. The user network serves as the primary communication mechanism in RAMP Blue [53]. Figure 3.6 also shows the design of the user network, which consists of two modules: buffer and switch. These modules can be intercon- nected in nearly arbitrary ways, allowing the user to implement arbitrary complex on-chip and off-chip network topologies. Each packet is prepended by the MicroB- laze core with a source-computed route prior to injection into the user network. Note that broadcast capabilities are not provided. The buffer is the more complex, though typically smaller, of the two modules and contains most of the control logic. In this design, the user network offers high scalability, flexibility, and reliability, but it is sig- nificantly complex and is a performance bottleneck [20]. Message passing based on 47 this design is not likely to achieve high performance. Moreover, FSL is not suitable for the transfer of large amounts of data. Another way to use FPGA-based soft processors is by adding them into a multi- processor system and monitoring other processors’ bus and memory activities. Suh et al. [84] used an FPGA-based system to replace one Intel Pentium-III in a dual proces- sor server system. This FPGA core listens on the front side bus for cache coherence traffic and coordinates with the other processor to enable its function. The FPGA core directly controls its local cache which is made of BRAMs. The collected measurement data is sent back to a host computer for analysis. Programming Model Realization: Many programming models proposed in the past have not been committed to silicon due to a variety of reasons. With FPGA, the cost to implement and test these models is reduced significantly. Despite its theoretical importance, the Parallel Random Access Machine (PRAM) model is thought to be unrealistic. A recently developed prototype used explicit multi- threading architecture (XMT) to support PRAM algorithms on a single FPGA chip [96]. Programs written in XMTC, an extension of C, can be run on the 75 MHz processor. Tests using some microbenches verify the functionality of the machine and exhibit promising performance compared to other commercially available CPUs. The biggest advantage of the PRAM machine is its programming ease. The reconfigurable mesh (R-MESH) is a parallel computing model for which many low timing complexity algorithms have been developed [62]. In this model, an array of processing elements (PEs) are connected using a segmented bus through a local switch element. The switches can be configured to form different global communication pat- terns on R-MESH. The programs on R-MESH execute cycles of bus configuration, communication, and constant-time computation on PEs in a lock-step fashion. This 48 model benefits from the massive parallelism inherent to many applications. FPGA is suitable for implementing a proof-of-concept design for such a model and for verify- ing the kernel algorithms, such as multiplication [50], transitive closure [94], nearest neighbor [67], and convex hull [43]. A reported implementation used PicoBlaze, an 8-bit RISC softcore, as a processing element and designed a combinational circuit as a switch element [36]. The case study uses another MicroBlaze to simulate a host PC sending data to and reading results back from a 4X4 mesh of PicoBlazes. A mesh sorting algorithm [85] is studied as a test case. 3.3 Multiple Application Specific Softcore Architecture on FPGA Efforts to use FPGA in the wide spectrum of general purpose computing, particularly in the form of processor-centered and cache-based systems, have not met performance expectations. FPGA-based designs are inevitably 10 to 100 times slower than a GPP, as shown by several proof-of-concept projects [96, 20]. FPGA is not likely to compete with GPPs by simply adopting an ISA and a memory hierarchy, which is the forte of general purpose processors. Customization of ISA simplifies the designs on FPGA but is not enough to overcome the speed gap between FPGA designs and highly optimized microprocessors. The attempts of Xilinx, Actel, and Altera to incorporate a more powerful ARM hardware core into their FPGA products may be an indication of the inadequacy of current embedded hardware processors. 49 FPGA excels in application-specific hardware designs that are created by skillful digital system designers. Based on the above convictions, we propose Multiple Ap- plication Specific Core Architecture (MapsCore) on FPGA and present techniques to obtain sustained high performance for sparse computations on this architecture. 3.3.1 MapsCore Multi-Softcore Architecture MapsCore architecture connects multiple application-specific cores with a customized on-chip network, where the cores are optimized hardware designs with prescribed net- work protocol to interface with the system-wide network on chip (NoC). The cores own local memory and register files and can be homogeneous or heterogeneous, depending on the task composite in the applications. The NoC is preferably a point-to-point con- nection, but other interconnects can be used when resource constraints are present. All interactions between cores, including data sharing and synchronization, are through message passing. The design of core needs to observe the “PPP” rule that enables FPGA to achieve high performance, such as taking advantage of data parallelism through vectorization and temporal parallelism through pipelining. Considering the behavioral nature of sparse computations, our architecture emphasizes task and thread level parallelism. Thus, synchronization is a major concern for designers. In the case of heterogeneous multi-softcore computing systems, workload balancing among the cores should be pur- sued to reduce synchronization overhead. Schemes to avoid the effect of insufficient I/O bandwidth and long latency should be considered equally to, if not more than, the processing power of individual cores. An overview of the MapsCore architecture is shown in Figure 3.7. Note that the global memory and I/O devices are connected to 50 the cores through a different interconnect than the one connecting the cores to one another. L Core1 Interconnect for Inter-Core Communications Core 2 Core m Global Memory IO 1 IO n L L Interconnect for Shared Memory and I/O Devices Figure 3.7: Multi application-specific core architecture To design a core manually may be cumbersome at times. ESLs, combined with advanced EDA tools, can solve most of the design problems. As ESLs have matured, the way that hardware engineers describe their designs has also gradually changed. EDA tools that assimilate many advances in design automation research are able to produce near-optimal logic designs from HDL programs for many hardware platforms. The design flow for implementing softcores is outlined in Figure 3.8. Designers can either use HDL or use one of the ESLs to describe the cores and the interconnects. The ESL programs need to be translated into HDL programs first, and the HDL programs are then fed to a synthesizer to generate circuits. Interconnect is an equally important component in our architecture for achieving high performance on applications. As multi-core architectures appears to stop the frequency scaling and the clock rate is actually getting lower, the performance model of computing systems is shifting from latency mode to throughput mode. Therefore, the 51 HDL design ESL design Compiler Synthesizer Figure 3.8: Core and interconnect design flow performance of a multi-core system is critically dependent on how much and how fast data can be exchanged between cores and can be loaded to each core’s local memory. The interconnect and synchronization scheme in our architecture is to be customized to the applications at hand. 3.3.2 High Performance MapsCore for Sparse Computations Speedup of an application through parallel processing has been studied for more than 40 years. Amdahl’s Law [9] states that a program can only run as fast as its sequential part, no matter how many processors are used to divide the parallelizable workload. Suppose thatP is the fraction of a workload that can be parallelized and thatn pro- cessors share that part of the computation. Then, the speedup can be described as in Equation 3.1. Speedup = 1 (1P)+P=n (3.1) It looks like the maximal speedup that we can have is1=(1P), wheren becomes irrelevant. Opponents argue that parallel computing allows problems to scale to a much larger size than with sequential computing. Gustafson’s Law considers the sequential portion of a problem fixed, while the parallelized portion will ben times larger when it is sequentialized [41]. Suppose thatP 0 is the parallelized part of the computation. 52 When it is serialized, the relative time becomes (1P 0 )+nP 0 , which is also the speedup of the parallel processing. Therefore, the speedup can be linear to the number of processors in a system. The above analysis is still relatively simplistic due to assumptions made based on the PRAM model. In the Bulk-Synchronous Model (BSP), communication and synchronization can add nonlinear overhead into the performance equation when the number of cores becomes large. Stone looks at runtime in terms of both computation and communication costs [83]. He defines runtime, R, and communication time, C, for each block of code that could be run in parallel. GivenM tasks andN processors and assuming for the sake of simplicity that all tasks need to communicate with one another on a bus connection, the speedup can be characterized as in Equation 3.2. RM RM N + CM 2 2 CM 2 2N = N R C R C + M(N1) 2 (3.2) Speedup<= N R C R C +N D C + M(N1) 2 (3.3) If we include the synchronization costD as a linear or superlinear function ofM, the speedup is bounded as in Equation 3.3. WhenR=C gets much larger thanMN, the speedup can approach linearly toN. However, for sparse computations, R=C is usually small. When synchronization cost is considered and N becomes larger, the term with D=C, the ratio of synchronization over communication, can possibly be- come larger than the term with(R=C), the ratio of computation over communication. Then, the speedup becomes less than 1. 53 The above analysis, though counter-intuitive, cautions us about the appropriate- ness of the assumptions for the architecture model. Nonetheless, it provides us with useful insights for our investigation: to achieve higher speedup for sparse computa- tions, the underlying architecture must be able to support substantial communication and confine synchronization overhead. This requirement can be met by availing cus- tomized and optimized design for interconnect and synchronization. Alternatively, we can enhance the processing capacity of cores in a way such that more data process- ing can be resolved locally. Consequentially, this scheme can alleviate the demand for communication capacities. MapsCore is able to utilize multiple DRAMs, on-chip dual-ported BRAMs, and customized interconnect to expand the bandwidth capacity enormously. As a source of communication bottleneck, synchronization overhead can also be reduced by employing tactically designed software or hardware mechanism on MapsCore. Message passing programming model: Message passing is adopted as MapsCore’s programming model. All kinds of data exchange between the softcores are through interconnect. Reads and writes are serialized in the cores and onto shared external DRAM(s). Therefore, no data hazard is concerned, and no lock is needed to access DRAMs. Since synchronization messages are sent via the interconnect alongside with data, a message consistency policy is defined to enforce that the data sent from a core must be processed by the receiving core before a later corresponding synchronization message. We investigate large dictionary string matching and breadth-first search on a large graph in the following chapters. Some effective techniques are summarized below as the design principles that enable our MapsCore architecture to achieve throughput 54 performance for sparse computation that is higher than, or comparable to, those state- of-the-art solutions on other computing platforms. Generating locality out of data input: By definition, sparse computations lack lo- calities in their memory access pattern, which is one of the reasons that cache-based systems underperform for such applications. However, we realize that every applica- tion presents some type of “locality” on the operations it conducts. Because any actual work is done on a finite set of data (called a working set), one can arrange the work- ing set into cache or local memory that is close to processing elements. FPGA has a large amount of on-chip memory that offers short latency, direct control, and large bandwidth. They are made of either BRAM or distributed RAM and can be config- ured flexibly. For the large dictionary string matching case study, it is found that the memory visits are concentrated on a small fraction of the large dictionary for even an extended period of time. When making those visits to an on-chip memory, we can reduce the long latency accesses to external memories and improve the throughput of the memory and the entire system. Hiding memory access latency and simplified thread management: According to Equation 3.3, the communication cost needs to be kept down, especially for sparse computations. A proved scheme is to multiply the execution threads to hide the long latency of accessing external memory. However, when this technique is used on a cache-based system, costs associated with context switches and thread scheduling in- crease. On FPGA-based MapsCore architecture, we devise a round-robin virtual thread management to simplify the scheduling algorithm and save memory consumption by using an extensible shift register to manage context. The details can be found in Sec- tion 4.3.2. 55 Point-to-point inter-core communication: We showcase this strategy in our design for BFS on FPGA. The cores need to pass graph vertices to all other cores. Because of the capability of FPGA, we design a point-to-point network to connect all pairs of core, so that the latency for each transmission is minimal. This design is infeasible on most other multi-core architectures. When demands are made by resource constraints and other design stipulations, a hierarchical point-to-point network can be utilized to reduce the resource consumption. The cores form fully connected subgroups, which then connect with each other using the point-to-point network. The point-to-point network architecture enables easy compliance with message consistency policy. Reduce synchronization overhead using floating barriers: In our BFS architec- ture, we use message passing to send data and synchronization requests between the cores through the same communication channels. The correct execution of BFS algo- rithm is guaranteed by enforcing the message consistency policy. The synchronization barriers are decided by each core individually in a distributed way. This scheme gives birth to “floating barriers,” in which the participating cores reach the same barrier at different times. This phenomena occurs due to the behavior of BFS algorithm where the “frontier nodes” can come from two consecutive levels. This scheme reduces the barrier overhead and can be potentially generalized to application design on other mes- sage passing systems. 3.3.3 MapsCore-Inspired General Purpose Processors Traditionally, FPGAs act as accelerators to GPPs, like they do in high-performance re- configurable computing systems. However, our study of MapsCore for sparse compu- tations reveals FPGA’s potential as a contributer to the control path of general purpose 56 computing systems. We document two such schemes learned from our experience with large dictionary string matching and BFS on a graph. FPGA as a Thread Context Manager and Scheduler: General purpose processors typically utilize multi-threading to deal with long latency peripherals. However, con- text switches are costly and, for sparse computations, occur often. Recent architectures use hardware threading techniques, such as SMT from Intel, to alleviate such prob- lems. But the number of threads that can be effectively supported this way is limited. Recent wide deployment of virtualization has also raised the demand for thread and context management. A software-only solution is flexible but slow, while hardware thread lacks scalability and flexibility. Our solution for large dictionary string match- ing demonstrates that an FPGA-based design can benefits from both worlds. The con- text storage can be allocated on-demand on FPGA, and a context manager is in charge of delivering a ready context to its CPU. More complex scheduling disciplines can be realized using soft or embedded processors on FPGA. The dynamic partial reconfig- uration techniques, which are promoted by some FPGA vendors but not well adopted in conventional FPGA applications, can be utilized to configure design on logic fabric to provide large bandwidth context storage. FPGA as an Programmable Interconnect: For a multi-core architecture, the per- formance relies largely on the aggregate bandwidth between cores and memories. This bandwidth is often limited by the interconnect, which accommodates not only memory requests but also other types of communications between the cores. The dynamics of the cores’ interaction can sometimes permit them to converse in multiple pairs simul- taneously but is unlikely to be fully supported by the interconnect. Using FPGA as a programmable interconnect allows for dynamic hardware-based switching that adapts 57 to specific applications and provides high aggregate bandwidth. FPGA has a potential to outperform most widely used interconnects. Again, dynamic partial reconfiguration on FPGA enables 100% interconnect adaption to applications. 58 Chapter 4 Large Dictionary String Matching on FPGA String matching looks for all occurrences of a pattern dictionary in a stream of input data, including those patterns which overlap. As described in Section 2.2.1, much of the early work on using reconfigurable hardware for string matching has focused on achieving superb performance through special logic design and smart utilization of on-chip memory to hold as many data representation entries as possible. However, the size of dictionaries has increased greatly. A dictionary can have 10,000 patterns or more now [81, 78], which results in a state-transition table (STT) that is tens of megabytes in size when using the AC-opt algorithm for the string matching (see Sec- tion 2.2.1.3). Such large tables can be stored only in external memory and incur long access latency. Schemes to avoid effectively the long latency penalties that result from accessing external memory have not yet been examined for such applications. In this chapter, we develop a high performance string matching engine on FPGA with the use of DRAM modules as external memory in the system. This study demon- strate a design paradigm that illustrates how a host of softcores can form a framework that maximize the potential of the DRAM system to perform as well as on-chip mem- ory. 59 4.1 Characteristics of AC-opt In our study, the STTs usually have more than 60K states, which constitute the rows of a 2-dimensional array. The number of columns is the same as the size of the alphabet. Thus, a large dictionary needs 60 MB storage or more, which should be stored on DRAM. We believe that, even for a significantly large input stream, the states traversed during a string matching are concentrated on a small part of that STT. The few states that are visited most by that string matching engine can cover the majority of state transitions during the process of string matching. % States in levels 0,1 % Hits Full-text search 0.114% 46.79% Network monitoring 0.114% 81.46% Intrusion detection 0.506% 89.84% Virus scanning 0.506% 88.39% Table 4.1: A few states in levels 0 and 1 are responsible for the vast majority of the hits during string matching The four representative application scenarios in our study are (1) a full-text search system, (2) a network content monitor, (3) a network intrusion detection system, and (4) an anti-virus scanner. These scenarios are the same as in Scarpazza, et al. [78]. In scenario (1), a text file (the King James Bible) is searched against a dictionary containing the 20,000 most used words in the English language. In scenario (2), net- work traffic is captured at the transport control layer with “Wireshark” [35] while a user is browsing multiple popular news websites. This capture is searched against the same English dictionary as in scenario(1). In scenario (3), the same network capture is searched against a dictionary of about 10,000 randomly generated binary patterns, whose lengths are uniformly distributed between 4 and 10 characters. In scenario (4), a randomly generated binary file is searched against the randomly generated dictionary. 60 Scarpazza et al. [78] showed that the DFA states at levels 0 and 1, whose distances to the initial state are 0 and 1 respectively, attract a vast majority of the hits to memory, as shown in Table 4.1. We further recorded the number of hits to each state during the search process in the same four scenarios and ranked the states in descending order of the number of visits. From the results presented in Figure 4.1, which shows the number of top states and the percentage of memory access that goes to these states, we can see that more than 85% of accesses by the string matching engine goes to the top 1000 states. We define the hot states as the set of states that are visited most during a string matching process. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 Number of Hot States Percentage of Accesses (%) full-text search network content monitor network IDS anti-virus scanner Figure 4.1: Memory accesses to hot states 61 4.2 Multi-Core Architecture for Large Dictionary String Matching 4.2.1 Architecture Overview Our proposed architecture is shown in Figure 4.2. There arep cores sharing an STT through an interface to DRAM(s). Each core, such asC i in the figure, is equipped with a copy of on-chip bufferB i , serves an input streamS i , and gives an outputO i when produced. B 0 DRAM (STT) FPGA Interface Module C 0 S 1 S p-1 B 1 C 1 B i C i B p-1 C p DRAM Controller S i S 0 O 0 O 1 O i O p-1 Figure 4.2: Architecture overview Utilizing the identified data usage feature from the last section, the buffers are employed to store the hot states on chip in order to reduce off-chip memory references and to take advantage of fast on-chip memory access. When an address to STT arrives, the core logic makes a decision to direct it to either off-chip DRAM or on-chip storage. 62 If a large number of hot states are stored on-chip, the buffer can resolve the majority of references to STT which would improve throughput performance of string matching. The cores are connected to external DRAM through an on-chip interface module and a DRAM controller. The DRAM controller can be off-chip, but it is usually im- plemented on FPGA. 4.2.2 On-Chip Buffer for Hot States To populate on-chip buffers with hot states, we first run a search against the STT using a training input trace, which exhibits statistical similarity to the incoming traffic. The number of visits to each state of the DFA is recorded, and the list is sorted in descend- ing order. The topn entries in this list are defined as hot states and will be stored in on-chip buffers. Asn gets larger, the hit rate to on-chip buffers by a string matching engine becomes higher. However, the selection ofn is also affected by factors, such as the type of buffer device and the available buffer size. A state corresponds to a row in the STT, with 256 next state entries for an 8-bit represented alphabet. When we use 32-bit data to store a next state entry, including the next state ID and output functions, 1 KB of storage is needed for each state. The on-chip storage can be implemented as a fully associative cache on FPGA, such as a CAM. However, a CAM of thousands of entries can lower the system clock rate and can bottleneck performance. Due to its fast access speed and large volume, Block RAM (BRAM) is an ideal choice for implementing on-chip buffers to hold a large number of hot states. 63 4.2.3 Structure of a Core A single core architecture is shown in Figure 4.3. An input character is received by the Address Generator and is combined with the current state to generate a new address for STT reference. The STT is organized as a 2-dimensional matrix, in which the row indices represent the state IDs and the column indices represent the 8-bit input characters. In the DRAM and in the BRAM, the data are stored physically as a one- dimensional array in row major order. Thus, the address generation can be achieved by simply concatenating the current state ID with the 8-bit input character. This address is then used to determine whether this memory reference should go to an on-chip buffer or off-chip DRAM. To exploit the available DRAM bandwidth, we add multiple virtual threads to a core to take advantage of temporal parallelism. Each of these virtual threads processes an input stream, which could be a segment from a long input stream or an indepen- dent stream. A virtual thread is a DFA engine that traverses the STT with its input characters. The thread manager, along with thread context storage, is added to provide scheduling and synchronization among the virtual threads. The thread context stores the status of a virtual thread, including core ID, virtual thread ID, the address to STT, and the returned current state after a reference to STT is resolved. It also keeps track of whether the reference to STT is on-chip or not. The thread manager chooses a ready thread and picks its input stream through the Mux for the address generator to process. The Output unit checks thread context registers to output the matches. 64 Buffer (Hot States) DRAM Interface Module Thread Contexts Address Generator Thread Manager Mux On-Chip? Current State Input Streams Core i New address Output Output Y To external Memory To DRAM Figure 4.3: Structure of a core 4.2.4 Reconfigurability of the Multi-Core Architecture Our architecture’s cores execute on their own, even though they may be affected by others through shared global devices, such as interconnect and DRAM. Thep cores can be abstracted asp hardware threads, which add spatial parallelism to the architecture by multiplying execution from a single core. As a result, this massive parallelism collectively exploits the DRAM bandwidth. The multi-core architecture on FPGA can be described by parameters, such as the number of cores, the number of virtual threads per core, the external DRAM’s band- width, the external DRAM’s latency, the on-chip buffer’s size, the on-chip buffer’s latency, and the on-chip buffer’s access bandwidth. Constrained by the resources of FPGA, these parameters can be chosen to achieve high performance for string match- ing. 65 4.3 Design Optimizations 4.3.1 DFA Re-mapping 0 n-1 i n-1 i 0 Original States State 0 State 1 New States Re-Mapping Hot state ranks Hot states In External Memory Figure 4.4: DFA re-mapping TheIDs of the identified hot states were initially assigned during the building of the DFA and are unlikely to be contiguous. Since the state IDs are used by hardware logic to decide which memory to reference, this intermittent nature of the hot state IDs complicates the hardware design. We adopted an ID re-mapping scheme to ease this complication by shifting the IDs of hot states to the beginning of the STT and making them top states. A new ID is assigned to each state according to its rank in the sorted sequence. Thus, the re-mapping is a simple exchange of old and new IDs during a breadth-first search (BFS) on the original DFA that starts from the root node. This process is illus- trated in Figure 4.4, The state space is divided into two domains after the re-mapping. While a state with an ID number lower than n (the number of the selected hot states) goes to an on-chip hot state buffer, the other states go to external DRAM. Hence, the hardware design for this decision-making becomes a comparator. 66 4.3.2 Simplified Thread Synchronization by Input Interleaving For thread scheduling and synchronization, different ways exist to design the thread manager and thread context store that are shown in Figure 4.3. To reduce implemen- tation complexity, we used an input interleaving scheme for thread scheduling and synchronization. In this scheme, every virtual thread within a single core, identified with a threadID from 0 tom1, is assigned with a string segment, wherem is the number of virtual threads for each core. These threads are polled in a round-robin fashion by the thread manager; therefore, the input streams are essentially interleaved. 1 m 2 1 2 Input Interleaved over multiple threads Shift Register (m-k) BRAM Buffer (k) Thread Context Figure 4.5: Interleaving input to pipelines As illustrated in Figure 4.5, a thread, even one referencing STT on DRAM, sends its context into the BRAM pipeline. The thread context can include coreID, thread ID, generated addresses, result from BRAM buffers, and other control information. When it is shifted out of the BRAM module ofk stages, the context for the thread is further pushed into a shift register with a depth ofmk. In this configuration, a thread context is met with its own next input at the time of its exit from the shift register. Using this scheme, the thread contexts can be maintained by the shift register with help from the BRAM buffer design, which is introduced in the next section. However, a drawback of this design is that a loss in performance results if a stall of execution 67 occurs when the head thread’s reference to STT, specifically to DRAM, has not re- turned. Our evaluation shows that, whenm is reasonably large, our design does not suffer significantly from the stalls. When the interleaved input streams are the segments from one input trace, match- able patterns across the boundary of two consecutive segments can be missed. To avoid this hazard, an overlap, equal to the length of the longest pattern minus one, is preserved between the neighboring segments when partitioning [78]. This method slightly decreases the overall throughput, but it guarantees the correctness of the string matching engine. 4.3.3 Shared and Pipelined Buffer Access Module We utilized a shared and pipelined design for the BRAM buffer access module. BRAMs on FPGA can be naturally configured for dual-port access without loss of performance. So, two cores can share one buffer where a set of hot states is stored. To increase the clock frequency of buffer access, we adopt a pipeline architecture inside this module. As illustrated in Figure 4.6, the BRAMs in the buffer module are divided into k even partitions, denoted M 0 ;M 1 ;M k1 . The selected hot states are also divided evenly intok groups, and each group is stored on one of the BRAM partitions. The ac- cessing elements, denotedAE, are separated by the pipeline registers, denotedREG. An AE is responsible for accessing its local BRAM and relays data from stage to stage. As seen in Figure 4.6, a core sends a thread’s context with address information into the BRAM buffer module. For a thread that is accessing DRAM, its thread context is simply passed through stage registers without any processing. However, if a thread 68 M 0 From core 1 M 1 M k-1 BRAM Partitions From core 0 To core 1 To core 0 REG i AE i BRAM i en Control From REG i-1 From BRAM i-1 To next stage From the other pipeline Buffer Access Module AE 0 AE k-1 AE’ 0 AE’ 1 AE 1 AE’ k-1 Figure 4.6: BRAM buffer access module needs to access the BRAM partitions, then anAE, which is getting data from both the BRAM and the pipeline register of the previous stage, must do the following: If the arriving thread needs access to BRAM buffers and the STT address falls in the range of its local buffer, then it must send the address into the local BRAM. If the output of the previous BRAM partition is the resulting current state for this thread, then it must pass it to the next pipeline register. Otherwise, the data from the previous pipeline register is passed along to the next stage. The control signal is global to all stages in a pipeline and is used by the thread manager to hold the pipeline when a stall is necessary. Note that all BRAM partitions have output registers and do not need pipeline registers in between. 69 4.3.4 Interface Between Cores and DRAM Controller Our design uses a multiple-to-one FIFO to interface the cores and the DRAM con- troller. A simple time-slotted, round-robin scheduling is used to serve incoming re- quests from each core. By “time-slotted,” we mean that the addresses coming at a clock cycle go through the FIFO according to a pre-determined order, such as theIDs of the cores, before the requests from the next clock cycle. This scheme is consistent with the virtual thread management introduced in Section 4.3.2. It performs, because our input streams to the cores are independent of each other and they all have equal priorities. We adopt a design from Le et. al. [57] and adapt it to a 4-to-1 basic unit of synchronous FIFO with conversion. A higher ratio FIFO can be formed by using multiple basic units. The FIFOs are implemented using registers and logic only, in order to save BRAM for hot state buffers. A common implementation of a FIFO is a circular buffer with read and write addresses. The size of the FIFO sets the maximum number of entries that can be queued, and it is bounded bymp in our design. This bounded buffer size is due to the fact that a thread does not send a new request until its last request is served. Therefore, the maximum number of the active threads, which is equal to the maximum number of STT references, ismp. 4.4 Performance Analysis and Experimental Results 4.4.1 DRAM Access Module To evaluate system performance for string matching with large dictionary and long input trace, one must study the behavior of DRAM modules. DDR SDRAMs have gained popularity by increasing operating frequency and currently have become the 70 standard of DRAMs. In Table 4.2, we identify some critical timing specifications for DDR SDRAM from the published technical documentation of DRAM vendors [61]. SDRAM DDR DDR2 DDR3 tRTP 12 7.5 10 tRC 55 55 50 tRRD 10 10 10 tRAS 40 45 37.5 tRP 15 15 15 tRCD 15 15 15 tCL 15 15 15 Clock period 5 3 2.5 Banks 4 4/8 8 Table 4.2: DDR SDRAM key electrical timing specifications. All units are ns except for the banks. Clock period is when working with FPGA. tRTP: Read-to-Precharge delay. tRC: Active-to-Active delay in the same bank. tRRD: Active-to-Active delay between different banks. tRAS: Active-to-Precharge delay tRP: Precharge latency. tRCD: Activate latency. tCL: Read latency. FPGA vendors support DDR3 with a clock rate of 400 MHz and DDR2 with 333 MHz on their development platforms [7, 102]. However, when the data access is irreg- ular and spatial locality cannot be utilized, DRAM delays and latencies become more important than their peak data rates. The parameters in Table 4.2 can be divided into two classes. The first includes such timing requirements astRTP ,tRC,tRRD, and tRAS, which specify the delays that are to be satisfied between consecutive operations on DRAM. The second are latencies, includingtRP ,tRCD, andtCL, which are the times needed for an operation to complete its job. These parameters are used in our simulation program to estimate the performance for our design. Note that the numbers for the parameters may vary slightly for different DRAM modules. 71 4.4.2 Performance Analysis For a One-Core System Let B represent the number of bits for an input character and T designate the clock period for hardware logic. DRAM access takesu cycles when it reaches an open row, although it takesv cycles when a closed row is accessed. Letr denote the ratio of row hits for the DRAM access. Data accesses to on-chip buffers can be done inb cycles, while the hit rate for on-chip buffers ish. Throughput =B=fhbT +(1h)[(1r)vT +ruT]g (4.1) The lower bound of throughput for our design is depicted in Equation 4.1. This considers a very conservative scenario, in which the DRAM access module is at the end of a pipeline and the BRAM buffer is only one-stage. This configuration means that the pipeline stalls when a DRAM access is in progress. Our architecture hash > 0:9 when a large buffer is employed based on Figure 4.1. According to Equation 4.1, the DRAM access contributes less than 10% of its latency to the calculation of throughput only, so a minor variation on DRAM latencies will not affect the design performance significantly. Moreover, our pipeline is designed to hide a certain amount of DRAM access latency for some DRAM references. Particularly when multiple DRAM access requests are queued at the DRAM interface module, the latency for each request can be reduced due to the multi-bank architecture of commercially available DRAMs. When a request is waiting to access a different bank from the current active bank of DRAM, the controller can prepare the new bank so that the tRTP can be saved. Consequently, the actual throughput should be higher than a projection based on Equation 4.1. Note that the DRAM latencies still have impact on the throughput, but minor variations in access latency become trivial. Equation 4.1 suggests a design 72 guideline for our proposed architecture: to improve the throughput effectively, we can either reducebT or increaseh. 4.4.3 Resource Constraints 4.4.3.1 I/O In order to achieve superior performance, sparse computations must provide sufficient I/O bandwidth and short memory access latency to the cores. Inter-core communi- cation can use FIFOs and specialized interconnects. This use adds complexity to the logic design. However, most sparse computations have simple arithmetic operations at each step. Therefore, the core logic will usually not consume a large amount of fabric logic on FPGA. This observation is verified by our implementation results in a later section. Comparatively, wise use of FIFO resources and the design of interconnect become more important than the design of a processing core. I/O facility, dedicated to DRAM access in this design, also consumes logic re- sources on FPGA. A commercial-class DRAM controller design utilizes 1760 slices and 2 BRAM blocks on a XC5VLX50T, as reported in Xilinx Application Notes XAPP867. This design accounts for 5-10% of logic and 1-2% of BRAMs on a middle to high-end FPGA. Since we do not need a large number of DRAM interfaces, this resource constraint should be met easily. Moreover, for our applications, the DRAM controller can be simplified a great deal. The logic resource consumption does not consititute a concern for our architecture. I/O pins on FPGA are an important resource for multi-core architecture on FPGA. On Virtex-5, the number of user I/O ranges from 400 to 1200. A DRAM controller needs less than 100 pins. A core needs at least 8 bit wires for streaming data input and 73 extra wires for control signals. These needs limit the number of cores to be around 20 to 60. High speed serial transceivers, such as GTP and GTX from Xilinx, have been ap- pearing on FPGAs recently. These ports are currently used in PCIs and Ethernet con- trollers. We can utilize them to interface either external DRAMs or input/output string streams. 4.4.3.2 Local Memories Block RAMs (BRAMs) are paramount to the performance of our architecture. The size of BRAM directly determines the hit rateh, and it also indirectly affects the clock rate (theT introduced in Section 4.4.2). When more than 70% of BRAMs on an FPGA are used for one functionality in a design, the clock rate results from place and route begin to deteriorate. Available BRAM on FPGA sets a limit on how many hot states can be stored on chip. On Virtex-5 from Xilinx, BRAM blocks come at 18Kb or 36Kb each, and the total size can be from 1 Mb to 18:5 Mb. So, one can put 4 5K hot states in the on-chip memory without advanced schemes to compress the states’ data structure. FPGA designs also frequently utilize on-board SRAM. Access to external SRAM can be slower than on-chip BRAMs when a design is running at peak clock rate. How- ever, for a complex design implemented on FPGA, the system clock rate is usually between 100 MHz and 200 MHz. This rate fares well with external on-board SRAMs. SRAMs also have size limits, which are not to exceed 10 MB, due to practical and economical reasons. Hot states may be allocated to SRAMs if needed. FPGA offers another memory resource, distributed RAM, in a comparatively small amount. This type of RAM is built using logic fabric on FPGA and rivals the perfor- mance of BRAMs. 74 Because of the flexibility of FPGAs, it is easier to develop a memory system for the hardware design that meets both the frequency and size requirements. All of the above factors should be taken into account when evaluating the performance limits of our architecture. A later section evaluates a sample design and shows that the lower bound of our system performance can compete with highly sophisticated CMP solutions for large dictionary string matching. 4.4.4 Implementation on FPGA Platforms We implemented our architecture on a Virtex-5 XL155 for a 64K state STT with 2 cores. An STT of 64K states needs a 16-bit representation for each of its state IDs. According to the analysis in Section 4.1, the throughput can be better as the buffer size gets larger. We target a buffer of 1K states, which needs at least 4 Mb BRAM on FPGA. The Virtex-5 LX155 has 192 BRAM blocks of 36 Kb or 6912 Kb in total, which is sufficient to buffer 1K states in our design. 0 50 100 150 200 250 0 20 40 60 80 100 120 140 Number of BRAM pipeline stages Clock Rate (MHz) Figure 4.7: Impact of the number of pipeline stages on design clock rate 75 Shown in Figure 4.7 is the clock rate versus the number of pipeline stages from place and route results for a design implementation with 4Mb BRAM buffer and a varied number of pipeline stages. We can reasonably say that our implementation can run at over 200 MHz using the design optimizations introduced in Section 4.3. The usage of logic fabric is very small when not including the DRAM controller, with a maximum of 3% of OLOGICS, 4% of Slice LUTs, and 6% of Slice LUT-Flip Flop pairs for the 128-stage pipeline case. The implementation results prove that the logic resource on FPGA is not a limitation for our architecture. Note that we do not necessarily divide the BRAM into 128 partitions in our design. In practice, we can use a 8-stage BRAM pipeline with additional shift registers to fulfill the demand for the number of virtual threads to achieve better DRAM utilization. In that case, we only need 1-2% of logic for a 2-core system, while the BRAM usage is at 66% of the chip. We can place dozens of cores on a mid-sized FPGA. Our design works with a customized DRAM controller that is connected by a FIFO queue that can have different clock frequencies for write and read. The controller is based on a DDR2 controller generated by the Memory Interface Generator (MIG) tool in Xilinx ISE design suite 10.1. 4.4.5 Performance Evaluation Performance of string matching is measured by throughput inGbps (gigabit per sec- ond). We first study the performance of our design implementation for a 2-core sys- tem. The two cores share a BRAM buffer of 1K states and an external DRAM, Micron MT47H64M16 DDR2 SDRAM. The DRAM has a 16-bit data bus and 8 internal banks with a selected burst length of 4. The number of pipeline stages in the BRAM buffer 76 is set to 8. The number of virtual threads was varied to observe its impact on perfor- mance. 1.0 1.5 2.0 2.5 3.0 3.5 0 10203040506070 Number of Threads per Single Core Throughput (Gbps) full-text search network content monitor network IDS anti-virus scanner Figure 4.8: Throughput for a 2-core system Our simulation program set a 200 MHz clock rate for the cores, while allowing the DRAM module to run at 333 MHz. The DRAM refresh cycles were ignored due to their minimal effect on the performance. When we have 2-way 8-bit input at 200 MHz, the maximum throughput achievable by the design is 3.2 Gbps. The results for the four application scenarios from Section 4.1 are shown in Figure 4.8. As the number of virtual threads per core increases, the throughput generated by the two cores also grows to about 3 Gbps. For application scenarios such as the network content monitor and network IDS, the optimal throughput of 3.2 Gbps is approached when the number of threads is large. This result means that all access to DRAM is returned before a requesting thread is scheduled to run again, which eliminates the stalls of execution. We then fixed the total size of the BRAM buffer on chip and varied the number of cores to study the performance of our architecture under memory constraints. As the 77 number of cores increases, the buffer size of each core and the number of hot states decrease, which reduces the hit rate to the on-chip buffers. 1.0 1.5 2.0 2.5 3.0 3.5 02 46 8 10 12 14 16 18 Number of Cores Throughput (Gbps) 8 16 24 32 48 64 Number of Threads per Core Figure 4.9: Throughput performance of multi-core architecture for full-text search Figure 4.9 shows that, for a full-text search, our design with two cores can yield optimal performance for all cases with different numbers of threads per core. The throughput for all designs is then decreasing, even though the number of cores contin- ues to increase. For the network content monitor experiment, the performance peaks at about 5.5 Gbps for a four core design, as shown in Figure 4.10. Then, the throughput degrades in a way that is similar to the result for the full-text search. As the design changes to use more cores, the size of each core’s on-chip buffer also gets smaller. Therefore, the hit rate to on-chip buffer will drop to a level at which the references to external DRAM deplete the available bandwidth, and at which increasing the number of threads no longer provides benefits. At that time, adding more cores will only intensify the contention for DRAM and deteriorate the overall performance of the design. The reason for the different peaking points between the two experiments is that the hit rate curves shown in Figure 4.1 are different for the two. The full-text search’s hit rate to on-chip buffer is lower than the one for the network content monitor, given 78 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0 2 4 6 8 10 12 141618 Number of Cores Throughput (Gbps 8 16 24 32 48 64 Number of Threads per Core Figure 4.10: Throughput performance of multi-core architecture for network content monitor the same number of hot states, so its performance should saturate earlier than that of the network content monitor. The difference in peak throughputs for the two experiments also shows that the performance of our proposed architecture is contingent upon input stream characteristics. 4.4.6 Performance Comparison While no published research on FPGA exists with which to compare our work, we studied large dictionary string matching on a Dell XPS 410 with an Intel Core 2 Quad Q6600 processor for the full-text search scenario. The Intel C/C++ 10.1 compiler ap- plies system-specific optimizations, which can allow Cycle-Per-Instruction (CPI) for a program to reach close to1=4. Similar to our scheme in the FPGA design, a “training” pre-fetch is used to boost the performance. The “training” technique conducts a search on training input statistically similar to the real input. The intention is to load the cache with the most visited states. The results for four cores are presented in Table 4.3. 79 Table 4.3: Performance of string matching on a general purpose multi-core system Measured Throughput (Gbps) Best Average 5,000 patterns 5.5 3.2 5,000 patterns, Trained 10.0 5.8 50,000 patterns 3.3 2.3 50,000 patterns, Trained 4.7 3.4 Another thorough study of large dictionary string matching on Cell BE, is pre- sented by Scarpazza et al. [78]. In this study, the authors found that the XDR DRAM in the system performs best in random access when an SPE performs 16 concurrent transfers of 64-bit blocks. They proposed a 16-input interleaving scheme for one SPE. Assisted by other optimization techniques on memory system and local stores, the design achieves a theoretical aggregate pattern search throughput of 3.15 Gbps on a 2 Cell processor system with 16 SPEs. Our proposed architecture with only two to four cores exhibits performance that is competitive with these highly advanced CMPs, without resorting to a high performance proprietary DRAM system like Cell BE does. 80 Chapter 5 Breadth-First Search on FPGA Platform Breadth-First Search (BFS) is a fundamental graph algorithm. Its implementation on various computing platforms has been widely studied [14, 76, 105]. Due to the irreg- ular nature of the fine-grained memory access to graph representations, parallelization of BFS on cache-based systems can be a difficult task [98]. Many issues, such as cache coherence policy and inter-process synchronization, come into play and make the sit- uation very complicated. This predicament is commonly blamed on the lack of I/O capacity in the system. A dominant reconfigurable computing platform, FPGA can provide a suitable so- lution to these problems. Due to the configurability, an FPGA design can allocate sufficient resources to the I/O subsystem design, while also catering to the demand for computing power for specific graph problems. This chapter proposes a multi-softcore architecture on FPGA for BFS to investigate related issues. 5.1 Parallel Breadth-First Search BFS is a graph traversal algorithm on an undirected graph. Let G = (V;E) denote the input undirected graph, whereV =f0;1; ;N 1g is the vertex set and E is 81 the edge set. We assume that G is connected. Given G and a root vertex r 2 V , BFS begins atr and explores all the neighbor vertices. Then, for each of the neighbor vertices, BFS visits the unexplored neighbor vertices and so on, until it traverses all the vertices. During this procedure, all vertices at the same distance tor (the same level) must be explored before their neighbor vertices in the next level can be explored. BARRIER S i ={neighbors of v Q i } Dispatch S i Q i,0 Q i,1 Q i,P-1 All nodes marked R i =Q 0,i Q 1,i ... Q P-1,i for each unmarked v R i mark v; Q i ←{v} endfor {r} if r V i Ø otherwise Q i = Iteration Thread 0 (i.e. i=0) Thread 1 Thread P-1 (BARRIER) Figure 5.1: A parallel BFS Parallelizing BFS algorithm has been a subject of several studies [92, 98]. These studies’ solutions are based on chip-multiprocessors (CMPs), and each utilizes various machine-specific optimization techniques. The reported throughputs range from sub 100 up to 800 million edges per second (ME/s). One such parallel BFS algorithm implementation is presented in Figure 5.1. In this implementation, the input graph G is represented as an adjacency array, a widely used data structure for graphs [18]. Assuming that there areN vertices in G, the adjacency array consists of a vertex array and up to N neighbor arrays. The vertex array has N elements, and each stores the 82 basic information of a vertex in G, such as the ID of the vertex, the flag that shows if the vertex has been visited, the owner of the node, the pointer to its corresponding neighbor array, and the size of the neighbor array. The neighbor array of a vertex v consists of the IDs of vertices adjacent tov. As shown in Figure 5.1, the graph is statically partitioned in P disjoint subsets, where P is the number of cores. Each core i runs a thread i and is the owner of the subsetV i , wherei = 0;1;:::;P1. No special algorithm is applied during the partitioning to generate locality in the subsets’ data representation. When executing, thread i locally maintains one ready queue Q i and P outgoing queues Q i;j , where j = 0;1;:::;P1, denotes the destination cores of the outgoing queues. Threadi gathers all neighbors from vertices inQ i and dispatches them into appropriate outgoing queues. Here, a barrier needs to ensure that the dispatching is complete before all threads access the data through interconnect. After the barrier, the core j receives vertices from Q i;j . It then marks those which are unmarked and puts them into Q j for the next level of exploration. The second barrier in the figure enforces the control dependency (the level-by-level exploration). This diagram of parallelized BFS gives a glimpse of performance blockades when it is implemented on CMPs. While the Q i s and Q i;j s can be locally maintained by a corei, the graph partitionV i may go to external memory due to the size limitation of on-chip cache on CMPs. Moreover, the data collection stage from all outgoing queues demands high throughput from the interconnect. The barrier can also be a costly operation on general purpose processors, since it is usually implemented using global storage and is serially accessed. 83 5.2 Architecture for BFS on FPGA FPGA-based solutions can alleviate the hurdle imposed by the access to external mem- ory and help with communications needed by data exchange and thread synchroniza- tion among the cores. We design a multi-core architecture and adapt the algorithm design, shown in Figure 5.1, to fully utilize the features of FPGA platform for high- performance algorithm implementation of BFS. 5.2.1 Architecture Overview The basic organization of our architecture is a group of cores that are connected through an interconnect. A separate DRAM interface connects the cores to external memory, such as DRAM(s), where the entire graph is stored. As shown in Figure 5.2, cores, interconnect, and DRAM interface are on the FPGA chip. Core 0 DRAM DRAM Interface Interconnect FPGA Storing Entire Graph Core 1 Core P-1 Figure 5.2: BFS architecture overview A core in our architecture has local memory. For the BFS algorithm, the on-chip memory stores two particular types of information. One type is for the subset of the vertices in the core’s partition, which includes flags for whether the vertices are visited, 84 the base addresses for accessing the vertices’ neighbor lists in DRAM, and the lengths of these neighbor lists. Another storage is a data buffer to temporarily hold both of the queues of the vertices that are ready to be explored and of the neighbors that are ready to be sent. 5.2.2 Message Passing Parallel Processing Our architecture uses message passing through the interconnect to exchange informa- tion between the autonomous softcores to parallelize the algorithm. Note that the cores share the external memory only to be able to read the graph data; therefore, no race condition occurs. The messages include two basic types: (1) vertex information that may consist of source core ID, destination core ID, and vertex ID and (2) synchro- nization information, such as barrier markers for the BFS algorithm. Other types of messages can be added based on specific designs. The number of messages transmitted may demand a high bandwidth interconnect. However, the aggregate traffic is linearly related to, or limited by, the bandwidth of external memory. This relation is because the vertices that are exchanged in intercon- nect are all read from DRAM(s), which are bounded byjEj. The number of barrier messages isO(logN) only, where usuallyN <jEj. Hence, the system performance depends on the utilization of DRAMs, provided that the interconnect’s bandwidth is adequate. 5.2.3 Structure of a Core As shown in Figure 5.3, a core receives messages from and sends messages to the interconnect. If a message has vertex information, then it goes to vertex array after it has been received to determine whether it has been visited. Otherwise, the message is 85 sent to the synchronization control. When a vertex has not been visited yet, the access control uses the information for neighbor pointer and length of neighbor list from the vertex array to access DRAMs. The returns from DRAM are the neighbors of this vertex. The neighbors are then buffered to wait for output. Vertex Array (Visited : Neighbor Pointer : Length of Neighbor List) Visited? DRAM Interface Neighbor Buffer N Message from Interconnect Message to Interconnect ID/Synch? ID Output Control Synchronization Control Synch Access Control Figure 5.3: Single BFS core architecture The output control is responsible for sending vertex messages to their respective owners and sending barrier messages to all cores in the system for synchronizations. The next section will elaborate on the synchronization mechanism when the operation of BFS is explained. 5.2.4 Operation of BFS and Realization of Barriers We illustrate the BFS operations, which use the same algorithm as in Section 5.1, on our architecture and explain how the synchronization works by using a 4-core system in Figure 5.4. Note that we do not show cycle-by-cycle operations of the architecture nor the memory access latency here. The lines with an arrow end shown in Figure 5.4 are not real connections. For the purpose of demonstration only, they indicate feasible 86 message transmission between the cores that is supported by the underlying intercon- nect. Suppose that a vertex is designated as the root for the BFS and that the core that owns this root is the root core, denoted R in the figure. In this process, m i and m r represent barrier markers from core i or R respectively, and v r;j is a vertex message sent from root coreR to corei. The beginning stage of the BFS is shown in Figure 5.4 (a). The root core starts by reading the neighbor list of the root and then sends messages carrying the neighbor vertices’ information to their respective owners via communication links. To determine a vertex’s owner core, its ID is checked according to the partitioning method. For example, if we evenly distribute the ID number space to the cores, then we can check into which range the ID falls to find its owner core. During this time, the other cores that are not the root send out barrier markers. Note that, in a core, the same barrier marker is sent to all cores in the same cycle by the output control module. Vr3 mr mr mr mr Vr2 Vr1 Vrr mr mr mr mr Vrr Vr1 Vr2 Vr3 m2 m2 m2 m2 m1 m1 m1 m1 m3 m3 m3 m3 m1 m2 m3 m1 m2 m2 m3 m3 m1 m1 m2 m3 mr mr mr mr m1 m1 m1 m1 m2 m2 m2 m2 m3 m3 m3 m3 Figure 5.4: Operation of BFS on multi-softcore architecture As stated before, BFS performs on a level-by-level basis. So, the root vertex is at Level 0, and all of its neighbors are atLevel 1. The root core starts inLevel 1, while non-root cores advance toLevel 1 after sending out the first batch of markers. When messages arrive at their destination, the vertex messages are processed first. As shown 87 in Figure 5.4 (b), thev r;j s enter the cores, while the barrier markers hold in buffers of the communication links. The on-hold marker messages prevent the following vertex messages in the same channels, which are not shown in this figure, from entering their destination cores. Note that, at this moment, the barrier markers from the root core have not been sent. Later, in Figure 5.4 (c), the barrier markers from the root arrive at their destinations. A core counts received barrier marker messages to decide if a barrier is reached, which occurs when the markers from all cores in the system arrive at this core. The core then initiates the synchronization control to advance toLevel2, and it then prepares to send a new round of marker messages when the neighbors buffered forLevel1 are sent out. This process continues in a loop until the BFS is complete. For simplicity, we omit the operations where cores1,2, and3 send out the buffered neighbors. Distributed Synchronization: From the above description, we learn that a core reaches its barrier when it gets barrier markers from all other cores. Each core decides its own synchronization based on this condition, and a core can arrive at the barrier at a different time than other cores. This phenomenon is called a “floating barrier.” Message Consistency Policy: The correct operation of barriers is guaranteed if a vertex message is always processed before its ensuing barrier marker message from the same sending core. This constraint is called message consistency policy in our architecture. To ensure this occurrence, a core should send a barrier marker to another core only after it sends out all neighbor vertices in the current level that belong to that core. The core on the receiving end should process vertices in its current level only and then process the corresponding marker messages. 88 5.3 Design and Optimization We present a design of our architecture on FPGA and introduce several optimization techniques to achieve improved performance. In our design, all of the cores are connected with one another to form a complete graph, which is shown in Figure 5.5. The communication channels are built using FIFOs on FPGA so that a message can be buffered there before entering the core. This FIFO is necessary especially when a barrier marker is in the front of the queue and holds the incoming messages from that channel. When we have P cores in the system, P 2 FIFO-enabled channels for bi-directional communication between each pair of cores are needed. Therefore, a core has P input ports and P output ports, and each serves one neighboring core only. The core sends messages to itself through a FIFO, including both vertex and barrier marker messages. This scheme provides easier compliance with the message consistency policy, since the vertex messages can wait in FIFOs for a preceding barrier marker to be processed. 0 j 1 P-1 i k Figure 5.5: A design of multi-softcore ar- chitecture for BFS The core design illustrated in Fig- ure 5.6 assumes an approach similar to pipelining. The storage devices, such as Ready Lists and Neighbor Buffer, have registers on their output which serve as pipeline registers. The output ports write messages to communication links (FIFO queues). The pipeline stages are indicated using dashed lines in the figure. Note that the DRAM Access takes varying amounts 89 of time before it returns data from the external memory to the neighbor buffers. The working regime of a core is as follows. Poll 0 Vertex Array (Visited : Neighbor Pointer : Length of Neighbor List) Ready List 0 Poll 1 Neighbor Buffer Bitmap Output Control Ready List 1 Input ports Output ports DRAM Access Figure 5.6: Single core design Input ports are scanned for an incoming message. When a vertex message is found, it is checked in Vertex Array to see whether it has been visited. A visited vertex is discarded. An unvisited vertex updates its entry in Vertex Array, which is addressed using the vertex’s ID, and the output from Vertex Array is pushed into Ready List. If a barrier marker is found, then it is counted until the total number of barrier markers is equal to the number of cores in the system. At this time, a barrier is reached, and a marker message is created and entered into Ready Lists. Entries in Ready Lists, carrying such information as address pointer and length of neighbor list, are used to access DRAMs. The returned neighbors from 90 DRAMs are written into Neighbor Buffer and wait to be output. A barrier marker enters Neighbor Buffer automatically. The vertices from Neighbor Buffer are checked against Bitmap (see Section 5.3.1) for whether they have been sent previously from the core. If not, they are sent to an appropriate output port by the output controller. However, a barrier marker will be sent to every output port. 5.3.1 Bitmap to Reduce Interconnect Traffic A vertex may appear in a core’s neighbor lists multiple times, due to the fact that it may be neighbor to multiple vertices in the graph partition of the core. During the BFS, these vertices will be sent to output ports more times than is necessary. This redundancy can be seen in Figure 5.7. We divide a graph of one million vertices, with a degree of 16 and topology ofNeighbor,Bipart, andRand (see Section 5.4.2.1 for the definitions), onto multi-softcore systems with 4, 8, and 16 cores. Vertex redundancy is measured by the surplus appearances of a vertex in the cores’ neighbor lists. If there are more cores, then less the redundancy occurs. Bipart shows less redundancy than Neighbor, which in turn is in a better situation thanRand. In general, the redundancy is dependent upon graph degree, topologies, and the number of partitions. A redundant vertex will cost a processing cycle after it enters its destination core. Based on the above observation, we design a bitmap scheme to reduce the redundancy of the traffic in communication links. As shown in Figure 5.6, a bitmap is used to record whether a vertex has been sent out previously from the core. If not, then the vertex is sent to an output port and its bit in the bitmap is updated. This per-core wise bitmap is for the entire graph, so it can be very resource-consuming when a graph is large. For this reason, we use a bitmap to curtail the memory consumption. As a further 91 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Topology Redundancy 4-core 8-core 16-core Rand Neighbor Bipart Figure 5.7: Redundancy in a core’s graph partition optimization to subdue the memory demand from a large graph, we may replace the bitmap with a cache or a TCAM-based buffer to have a dynamically managed record. 5.3.2 Dual Processing Unit in One Core Figure 5.6 shows that the polling modules and Ready Lists are duplicated for a core. Utilizing dual-ported BRAMs on FPGA, we design a dual processing scheme for the BFS to increase the number of execution threads. The polling units, each scanning one half of input ports, write results into their respective Ready List. Only one copy is needed for the other local memories, such as Vertex Array, Neighbor Buffer, and Bitmap. They should be capable of supporting two accesses at the same time. 5.3.3 Port Polling Schedule Having a “perfect” schedule to poll the input ports may reduce the likely backlog in communication channels. Also, it may enable better utilization of the duplex pro- cessing units by balancing their workloads and keeping them equally busy most of the time. However, the performance of the scheduling algorithm strongly depends on 92 graph topologies and the location of the root vertex for each BFS. Thus, having the kind of scheduling discipline that suits all input scenarios is difficult. In our design, we elect to use a polling scheme that is similar to round robin. Starting from one input port, one processing unit of a core checks its half of input ports for the first one with an input of a vertex message. If this vertex has not been visited, then it enters the corresponding Ready List. If the vertex has been visited, then it is discarded and the processing unit moves on to the next cycle of polling. This process repeats itself until the BFS completes. We design this polling module to be a finder of the next “1” in the input ports. “0” is assigned for empty ports or ports with marker messages, and “1” is assigned for ports with valid vertex messages. Without building a complex combinational circuit with muxes, adders, and comparators, we propose a memory lookup design suitable for polling modest numbers of ports. We define an anchor position to be the immediate next port to the one that was just accessed. Given an anchor position and a bit repre- sentation of current port status, the next “1” port can be uniquely determined. Suppose that there arep ports. The address of lookup memory will be the anchor position plus the bits representation for current input status, which has a length ofdlogpe+p. Then the memory size is at(dlogpe)2 (dlogpe+p) . The memory consumption for this design can be estimated in Table 5.1. Table 5.1: Memory consumption for round-robin port polling design Core Number 4 6 8 10 12 14 16 18 20 Bits/Core 16 128 256 1536 3K 6K 12K 64K 128K Total Bits 64 768 2048 15K 36K 84K 192K 1.125M 2.5M The memory consumption grows super-exponentially to the number of cores in the system. Medium to large FPGAs usually have 15 to 30 Mbits of BRAMs and 5 to 8 93 Mbits of Distributed RAMs, so they can reasonably support the port polling for a 16- core system on one chip. Furthermore, considering the exponential growth of memory consumption, we can divide the ports into more segments and implement a multi-level polling scheme to save memory consumption. Of course, the latter approach is not a standard round-robin any more, but it is quite fair for scheduling. A separate module monitors all input ports for barrier markers and is responsible for informing the core to advance its BFS level. 5.4 Implementation and Performance 5.4.1 Implementation on FPGA Platforms We evaluated our architecture on a Xilinx Vertex-5 XC5VL330, which is an FPGA for general logic applications with a large amount of logic resources at 207360 LUT-FF pairs and mid-sized BRAMs at 10368 Kb. Based on Figure 5.6, the whole Bitmap and part of the Vertex Array require “write” operations in dual-port mode, so it is neces- sary to store them in BRAM. Other demands for storage can be met by either BRAMs or distributed RAMs. We evaluated our design with 4- and 8-core configurations that support BFS on a graph of 256K vertices with an arity of 16. The resource consump- tion and timing performance are reported in Table 5.2, excluding an implementation of a DRAM controller that can be either on or off the chip of FPGA. Table 5.2: Resource consumption and timing performance of designs on FPGA Number of Cores LUT-Flip Flop Pairs Block RAMs Supported Clock Rate 4 5286 2% 233 80% 4.728ns 211MHz 8 11599 5% 280 97% 7.766ns 129MHz Compared with the consumption of logic fabric, the utilization of BRAM is very high, which shows that our architecture relies on the availability and configuration of 94 memory resources to achieve high performance. If we constrain our design on the FPGA chip, then the magnitude of the design, such as the number of cores and the size of supported graphs, is to be decided by the amount of BRAMs on the FPGA. However, we can employ off-chip SRAMs to support more cores and larger graphs while retaining performance for the designs. For the purpose of evaluating the resource consumption and timing performance of a design, we only implemented designs that can fit on chip here. Our results show that a large design utilizing almost all on-chip BRAMs can work at clock rates over100MHz. 5.4.2 Simulation of the Multi-Softcore Architecture 5.4.2.1 Features of Sample Input Graphs To systematically study the performance of BFS on our multi-softcore architecture, three types of graphs with various topologies were used as inputs. Illustrated in Fig- ure 5.8, they are “neighboring” graphs, bipartite graphs, and random graphs. The “neighboring” graphs are denoted asNeighbor. The vertices were arranged as shown in (a). Each vertex, marked by the numbers in the squares, had edges connecting to its left and right neighbors. Bipartite graphs are denoted asBipart. The vertices were arranged in two sets as shown in (b), where each vertex had edges connecting to a given number of vertices with adjacent IDs in the other set. Random graphs are de- noted asRand. Each vertex had edges connecting to a random subset of the vertices. Figure 5.8 (c) illustrates one such graph. Neighbor andBipart have relatively good locality, since the neighbors of the vertices are close to each other. 95 0 1 2 3 4 5 4 5 6 7 0 1 2 3 0 1 2 3 4 5 6 Figure 5.8: Topologies of sample input graphs 5.4.2.2 SystemC Modeling and Verification We utilized SystemC, one of the Electronic System Level languages (ESLs), to model and verify the functionality and evaluate design options of our multi-softcore archi- tecture for BFS. SystemC is a class library built upon standard C++, and it can model large-scale digital systems with cycle accuracy [70]. Its event-driven simulation kernel can support thread-like execution of many functional definitions inside a design com- ponent. The executables run on top of OS and can be 10 to 100 times faster than other simulation techniques. C/C++ programming adds the ability to handle large amounts of data, while many designs in SystemC can be translated into HDL program directly. We simulated our architecture with the design described in Section 5.3 for systems of 4, 8, and 16 cores respectively. A DRAM interface, seen in Figure 5.2, connects every core through an individual hardware channel to DRAM(s). The DRAM interface polls the cores for their DRAM requests in a fashion similar to the one explained in Section 5.3.3. These requests are queued in their incoming order to access the external memory. As we study an FPGA-based design, we are able to incorporate multiple DRAM controllers like some recently developed microprocessors do [49, 3]. However, we assume one DRAM and its controller are employed only to obtain the lower bound 96 of the performance for our design and to facilitate the comparison with other works on different platforms. We utilize the DRAM simulator used in the study for large dictionary string match- ing in Section 4. In our evaluation of BFS architecture, we select the DRAM to be a DDR3, Micron MT41J128M8. This DRAM has 8 banks with a burst length of 8, and its major parameters, such astRTP ,tRC,tRRD,tRAS,tRP ,tRCD andtCL, are similar to the DDR2’s used in Table 4.2. The DRAM runs at 400 MHz. Our core is set up to run at 100 MHz, which can be easily achieved on FPGA, as shown in Section 5.4.1. Note that the DRAM in our simulation is a DDR SDRAM supporting 800Mbps data rate per wire at the clock rate of 400MHz. During the simulation, a softcore reads the neighbor list of a vertex on its ready list from the DRAM. By con- figuring the width of data bus equal to the representation of a neighbor’s information, the information for each neighbor is read during one data transmission, which means that two neighbor entries are read during one clock. 5.4.3 Experimental Results and Discussion The aforementioned sample graphs are input to our SystemC model for the architec- tural verification and design space exploration. We vary the size of the sample graphs from 1K to 8K to 64K vertices, and their arities also are chosen from 4;8, and 16. Thus, a test case in our experiments has four variables, which are the number of cores and the topology, size, and arity of the input graphs. We measure the throughput per- formance through the number of edges visited in a unit of time as Million Edges Per Second (ME/s). The resource consumption of the configurations on each of the impor- tant architectural components, such as depth of channel FIFO, size of ready list, and neighbor buffer is also recorded. We also look into the hardware resource consumption 97 of DRAM interface to provide a comprehensive understanding of resource demand on FPGA from our BFS architecture. 5.4.3.1 Throughput Measurements Legend: 1024 Vertices 8192 Vertices 65536 Vertices 0 100 200 300 400 500 600 700 800 48 16 Core-4 on Neighbor Graph 0 100 200 300 400 500 600 700 800 48 16 Core-8 on Neighbor Graph 0 100 200 300 400 500 600 700 800 48 16 Core-16 on Neighbor Graph 0 100 200 300 400 500 600 700 800 48 16 Core-4 on Bipart Graph 0 100 200 300 400 500 600 700 800 48 16 Core-8 on Bipart Graph 0 100 200 300 400 500 600 700 800 48 16 Core-16 on Bipart Graph Figure 5.9: Throughput measurements for graphs withNeighbor andBipart topolo- gies We first examine the throughput performance of our design with regard to different experimental parameters. A throughput is limited by available DRAM bandwidth, because every edge traversed in BFS is delivered from DRAM. according to the setup of our simulation model, this limit is 800ME=s, which is equal to the DDR3’s data transmission rate. We present the results for bothNeighbor andBipart in Figure 5.9. The X-axis is the arity of the graphs, which is either 4, 8, or 16, and the Y-axis is the measured 98 Legend: 1024 Vertices 8192 Vertices 65536 Vertices 0 100 200 300 400 500 600 700 800 48 16 Core-4 on Rand Graph 0 100 200 300 400 500 600 700 800 48 16 Core-8 on Rand Graph 0 100 200 300 400 500 600 700 800 48 16 Core-16 on Rand Graph Figure 5.10: Throughput measurements for graphs withRand topology throughput. One can see that the throughput performance ranges from 160 to almost 800 ME=s, which shows that the impact from the topology and arity of the graphs is more significant than that from the number of cores and the size of the graphs. When the arity is larger than 8, the throughput is between500 and795ME=s, which compares favorably with other solutions for BFS [98, 76]. The low throughput for low- degreed graphs is expected because the burst length of the DRAM is 8. Thus, much of the DRAM bandwidth is wasted, which also limits the utilization of cores. The reason that the graph size did not matter for these results is due to the regular distribution of neighbors for each vertex in these two types of synthetic graphs. When a graph is large enough, the access pattern to DRAM is likely to repeat, which causes the size of a graph to be a non-factor in system performance. The throughput results forRand graphs, shown in Figure 5.10, share some similar- ity with the first two topologies, such as the correlation of throughput with the arity of the graphs. However, as the size of the graphs grows, the measured throughputs decay. This result is due to the random nature of this topology which leads to more waste of DRAM bandwidth. 99 Note that, for some test cases, the throughputs are approaching the 800 ME=s limit set by DRAM bandwidth. This result is reasonable, because the DRAM is used solely for the purpose of reading and the inter-core communication utilizes a separate interconnect. The 8-bank configuration of the DRAM also allows us to open rows concurrently to hide long latency from “pre-charging” and “activating” rows within the same bank. These schemes greatly improve the DRAM bandwidth utilization. 5.4.3.2 Resource Profiling for Storage Components A major component in our design is the channels that connect each pair of the cores. For bothNeighbor andBipart graphs, the regular distribution of vertices’ neighbors brings great load balancing to the cores. Also, the resulting BFS tree has large depth, which translates into fewer vertices for each level. As a result, the channel’s buffer consumption is low and evenly distributed among all cores. Our simulation shows that the buffer usage of each channel is between two to four for all test cases that use Neighbor andBipart graphs. 0 50 100 150 200 250 Core-4 1K Core-4 8K Core-4 64K Core-8 1K Core-8 8K Core-8 64K Core-16 1K Core-16 8K Core-16 64K Buffer Size Degree 4 Degree 8 Degree 16 Figure 5.11: Channel buffer usage of designs forRand graphs 100 However, in the case of Rand graphs, the imbalance of load on the cores exists during the BFS process, and the shorter BFS tree leads to more vertices per level. These factors cause the buffer size “explosion,” as shown in Figure 5.11. The X-axis indicates the number of cores in the systems and the size of input graphs. Here, the size of buffers varies from about 10 up to 250. Figure 5.11 also shows that the buffer usage grows as the graph size becomes larger. However, when the number of cores increases, the buffer size of individual channels decreases accordingly. Since the number of channels is quadratic of the number of cores, the total buffer consumption still rises. Generally, the buffer size also increases as the arity of aRand graph becomes higher. 0 20 40 60 80 100 120 140 160 Core-4 1K Core-4 8K Core-4 64K Core-8 1K Core-8 8K Core-8 64K Core-16 1K Core-16 8K Core-16 64K Degree 4 Degree 8 Degree 16 629 Figure 5.12: Buffer size in a core’s ready list Within a core, the ready list shares a similarity with the channels. In both, the buffer consumption for Neighbor and Bipart is modest but becomes very large for theRand graphs. Figure 5.12 also shows that the buffer usage in ready lists forRand graphs decreases as the arity of the graphs decreases and as the number of cores in the system increases. The test case, where BFS is conducted on a system of four cores for 101 theRand graph with 64K vertices and an arity of 16, stands out with the buffer size at 629, which is unproportionally large when compared to other scenarios. 0 2 4 6 8 10 12 14 16 18 20 Degree 4 Degree 8 Degree 16 Queue Size Neighbor Bipart RAND Figure 5.13: Neighbor list depth for a system of eight cores on a graph of 8K vertices Comparatively, the neighbor buffer in a core does not consume much storage. The simulation shows that this feature is insensitive to all parameters of the experiments, except that it grows roughly linearly to the arity of graphs. Figure 5.13 shows the results of a system with eight cores running BFS on a graph of 8K vertices with all three different topologies. All other test cases exhibit similar measurements. In our design, the DRAM interface connects to each core with a dedicated channel built on a FIFO. Inside the interface, a state machine polls the input ports in round- robin fashion to find valid requests from the cores. Then, these requests are queued to access the DRAM controller in their incoming order. The results from DRAM are returned to the appropriate cores once the requests succeed. The central queue for storing the requests is resource-consuming for large Rand graphs. For the purpose of comparison, the results forNeighbor andBipart graphs are shown in Figure 5.14 (a). The figure only shows the results for the system of eight cores that runs BFS on a graph of 8K vertices for both topologies. These results are presented like a function 102 0 5 10 15 20 25 30 35 Degree 4 Degree 8 Degree 16 (a) 8-Core System for Neighbor and Bipart Graphs of 8192 Vertices Queue Size Neighbor Bipart 1 10 100 1000 10000 100000 Degree 4 Degree 8 Degree 16 (b) 8-Core System for Rand Graph Queue Size 1024 Vertices 8192 Vertices 65536 Vertices 0 5 10 15 20 25 30 35 Degree 4 Degree 8 Degree 16 (a) 8-Core System for Neighbor and Bipart Graphs of 8192 Vertices Queue Size Neighbor Bipart Figure 5.14: Resource consumption of the central queue in DRAM interface of the arity of the graphs, because the queue sizes for these two topologies correlate to the arity of the graphs and not to all other system and graph parameters. For these test cases, the memory consumption is modest, even for very large graphs. To the contrary, the results forRand graphs differ significantly. We plot the results in Figure 5.14 (b) for the system with eight cores only, since the results for 4-core and 16-core systems are similar to the 8-core one for this study. Note that the figures (a) and (b) have different legends and that the Y-axis in (b) is logarithmic. When the size of a graph becomes larger, the size of the queue also grows. The demand for buffer storage is independent of the number of cores, and it is only slightly dependent on the arity of the input graphs. The last results are the resource consumptions for the channels connecting the DRAM interface to all cores. Again, for theNeighbor andBipart graphs, the buffer size of each channel is very small. However, forRand graphs, the buffer size increases along with the arity of a graph and reduces linearly to the number of cores. This inter- esting finding is illustrated in Figure 5.15, which means the total resource consumption 103 for all DRAM interface channels remains fairly constant for systems with different numbers of cores. 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 48 16 Number of Cores Buffer Size Arity=4 Arity=8 Arity=16 Figure 5.15: DRAM channel buffer size forRand graph of 64K vertices 5.4.3.3 Performance Discussion Table 5.3: Performance comparison with state-of-the-art BFS implementations Solution Platform DRAM Bandwidth Maximally Sustained Throughput MapsCore 800Mbps 790ME=s AMD Opteron 2350 [98] 1.33Gbps 311ME=s Cell BE [76] 3.2Gbps 787ME=s The experimental results show that our BFS design based on MapsCore architec- ture consumes small amount of logic resource, even when executing the algorithm on substantially large graphs. With 97% of BRAM utilization on an FPGA, the imple- mentation still can run at a clock rate of 100 MHz or more and can achieve superb throughput performance that ranges from 160 to 790ME=s. This result is favorably comparable to the throughput that were achieved on CMPs and other parallel com- puting platforms, including some of state-of-the-art processors, such as Cell BE. The achievable throughputs are compared between our solution and two others as shown in Table 5.3. 104 Chapter 6 Conclusion The focus of our work has been multi-softcore architectures and algorithm develop- ment for a class of sparse computations on reconfigurable hardware. We examined the diverse uses of soft processors on FPGA, the representative reconfigurable computing platform, and the effectiveness and drawbacks of using soft processors in many appli- cations. We demonstrated that, even though there are challenges involved with using reconfigurable hardware on sparse computations, these challenges can be overcome by using our proposed MapsCore architecture. Our contributions to this field allow this still-evolving technology to be applied in sparse computations and provide some of the immense processing power required to accommodate demands from I/O-intensive applications. Below, we also detail ideas for future work that can make reconfigurable hardware a practical platform for sparse computations. 6.1 Summary of Contributions 6.1.1 Multi-Softcore Architecture on FPGA We surveyed the existing ways of utilizing microprocessors on FPGA, which can be a main computing workhorse, an assistance to other advanced hardware accelerators, 105 or a tool in modeling and monitoring other types of multi-processor architecture. The microprocessors can be embedded hardware cores or softcores made of FPGA logic fabric and on-chip memory. Electronic system level languages (ESLs) are introduced as a way to promote FPGA to a wider community by enabling high level programming on the high-performance reconfigurable hardware device. This effort complements the initiative to incorporate microprocessors onto FPGA so that design development on FPGA is as easy as programming a GPP. Multiple application specific softcore architecture: Based on the experiences from the aforementioned techniques and an in-depth understanding of sparse computations’ behaviors and characteristics, we proposed Multiple Application Specific Softcore Ar- chitecture (MapsCore) for sparse computations. MapsCore adopts a flexible scheme that extends beyond traditional von Neumann architecture. The softcores in this archi- tecture have their functionality customized to specific applications and can control a large amount of local memory directly. The design of interconnect is also application- specific and aims to take full advantage of I/O and routing resources on FPGA. We proposed a message-passing programming model on this architecture that parallelizes computations by exchanging data and control information between the cores. Sparse computations provide a prime opportunity for such multi-softcore architecture to ex- cel and are thus the target applications in this thesis. Designers are recommended to model and verify the core using ESLs, which can then be translated to designs in HDL programs. The design principles that enable high performance sparse computation implementation on FPGA were summarized. The specific techniques are introduced when the algorithm designs for the application case studies are discussed. 106 6.1.2 Algorithm Design for Large Dictionary String Matching and Breadth-first Search We developed advanced algorithms on our architecture for large dictionary string matching and breadth-first search on a large graph. Parallelizing such applications is of practical importance in the field of parallel computing. Application-specific op- timization techniques that are enabled on FPGA platform are applied in these designs and the resulting throughput performance is promising. The effectiveness of the fol- lowing techniques are demonstrated through studies on the proposed multi-softcore architecture: Re-mapping data to create locality: Sparse computations are known for their ran- dom access to data storage, which lacks locality. However, any real world application has “locality,” which may be in a form that is different from commonly recognized spatial and temporal localities. A rationale is that an application accesses its working set during execution, and a highly referenced part of this working set can be cached for frequent access. Our architecture that controls local memory directly is able to utilize such properties in data access to achieve improved memory system performance. Simplified thread management to hide I/O latency: Using multiple threads to hide I/O latency is a common technique. We demonstrated a simple round-robin schedul- ing with a shift-register context storage to manage the context switches between the threads. This scheme effectively reduces costly accesses to external memory and hides long latency resulting from those accesses. The achieved throughput performance for large dictionary string matching approaches the potential of a system design that uti- lizes on-chip memory exclusively. 107 Low cost barrier and message consistency policy on message passing system: Using message passing to implement barriers allows the cores in our architecture to make synchronization decisions in an individual and distributed way. This scheme saves cycles of synchronization activity and allows some cores to pass a barrier while the others are approaching the same barrier. These features enable high performance implementation for certain algorithms on the proposed MapsCore architecture. How- ever, a message consistency policy, which enforces the cores to process data messages before their corresponding synchronization (control) messages, should be installed in order to guarantee correct execution of the algorithms. 6.2 Future Work 6.2.1 Further Study of Multi-Softcore Multi-DRAM Architecture for String Matching We have evaluated our large dictionary string matching design and algorithm with one external DRAM module. Multiple DRAM interfaces have been appearing in high-end GPPs and Graphics Processing Units (GPUs). The design in our study can benefit greatly from this new development. However, the addition of DRAM interfaces brings new challenges to designers, such as efficient utilization of multiple DRAM modules by either duplicating or splitting the DFA and advanced interconnect design to allow effective sharing of DRAMs by all cores. Currently, the on-chip memory employed in our design is underutilized, even though we can deplete the capacity of DRAM using our current scheme. Strategies, such as adding more cores to share the hot state buffers, should improve the utilization of on-chip memory. However, the incurring complexity should be studied. 108 Data structure compression and dynamic management of on-chip storage: For much larger DFAs, the currently available BRAM on FPGA may not be large enough to hold the hot states that are necessary to obtain high hit rates to on-chip memory. To increase the on-chip reference rate, one can compress the data representation of hot states. The current logic consumption of our design on FPGA is very low, which means that much logic fabric is available to implement hash, content addressable memory (CAM), and other function modules for data compression and de-compression. As the on-chip storage component becomes diverse, more dynamic management mechanisms should be deployed. Furthermore, when the number of hot states are too large, they can be partitioned and be dynamically loaded onto the chip. For the same DFA, one can also produce multiple hot-state data sets that are based on different training traces and choose one to load according to the run-time traffic characteristics. Hot state learning: The hot states mechanism is not dynamic enough when com- pared with the cache design in general purpose processors. Hot state learning can adapt to input traffic, whose characteristics may change over a longer period of time. An on-line learning technique can be developed and suitably supported by FPGA to update the hot states on chip. 6.2.2 Further Study of Breadth-First Search on FPGA Our current study on BFS evaluated a design in which all cores connect with one another. This complete graph topology causes the interconnect complexity to grow quadratically, and resource demand for the interconnect may be hard to meet when the number of cores is large. A hierarchical interconnection network can alleviate the problem and save resources used for interconnectivity. This way, cores will form into subgroups that are connected with one another. Within a subgroup, the cores are fully 109 connected, and a “router” administers message exchange between the members inside and outside of this subgroup. A transport layer protocol needs to be installed in order to observe the message consistency policy. Utilization of graph partitioning: Adopting a hierarchical design provides oppor- tunities for utilizing advanced graph partitioning. A strategically partitioned graph assignment for each core or each subgroup can balance the traffic within the subgroup and between subgroups. This traffic balancing can help provision resources to com- munication links and leads to better overall performance for the system. Algorithms for graph partitioning are usually costly, and one needs to justify the use of them for a BFS kernel. For example, emerging applications from social networking and research efforts to understand the structure of the Internet require extensive use of graph explo- ration tools. Such graphs have a stable core structure over a period of time. However, they do constantly evolve since vertices are dynamically joining or leaving the graphs. A tool equipped with partitioned graph algorithms may be of great advantage in that it can resolve a search faster than non-partitioned algorithms on such graphs. Kernels for graph exploration: FPGA vendors provide many IP cores for I/O in- terfaces, memory modules, or compute intensive kernels like signal/image processing but none for sparse computations. Building IP cores for graph exploration kernels can fill this void and may spark the community’s interest in engaging FPGA in fields in which they are not traditionally. Prefetching for BFS: Our architecture does not utilize a caching mechanism. How- ever, prefetching buffers can be implemented on on-chip memories that are not occu- pied by the design components. These buffers can load in the neighbor lists of those 110 vertices that are still waiting in ready list. More sophisticated algorithms can be de- vised on top of this prefetching scheme to predict future memory requests and load them onto the chip beforehand. Especially when multiple DRAM interfaces and high bandwidth interconnect are instrumented in the system, prefetching can better utilize the capacity of the I/O system. 111 References [1] A graph query language and its query processing. In ICDE ’99: Proceedings of the 15th International Conference on Data Engineering, page 572, Washington, DC, USA, 1999. IEEE Computer Society. [2] Actel Corporation. http://www.actel.com/. [3] Advanced Micro Devices, Inc. http://www.amd.com/us-en/Processors /Product Information/. [4] Aeroflex Gaisler AB. http://www.gaisler.com/cms. [5] Agility Design Solutions Inc. Handel-c language reference manual. [6] Alfred V . Aho and Margaret J. Corasick. Efficient string matching: an aid to bibliographic search. Commun. ACM, 18(6):333–340, 1975. [7] Altera Corporation. http://www.altera.com. [8] Altera Corporation. http://www.altera.com/products/ip/processors/nios2/ni2- index.html. [9] Gene M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS ’67 (Spring): Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485, New York, NY , USA, 1967. ACM. [10] David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. Fawn: a fast array of wimpy nodes. In SOSP ’09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 1–14, New York, NY , USA, 2009. ACM. [11] S. Antonatos, K. G. Anagnostakis, E. P. Markatos, and M. Polychronakis. Per- formance analysis of content matching intrusion detection systems. In Proc. of the International Symposium on Applications and the Internet, January 2004. [12] ARM Limited. http://www.arm.com/fpga/. 112 [13] Jonathan W. Babb, Matthew Frank, and Anant Agarwal. Solving graph prob- lems with dynamic computation structures. High-Speed Computing, Digital Signal Processing, and Filtering Using Reconfigurable Logic, 2914(1):225– 236, 1996. [14] David A. Bader and Kamesh Madduri. Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In ICPP ’06: Proceedings of the 2006 International Conference on Parallel Processing, pages 523–530, Washington, DC, USA, 2006. IEEE Computer Society. [15] Z. Baker and V . Prasanna. A computationally efficient engine for flexible in- trusion detection. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(10):1179–1189, October 2005. [16] J. A. Barnes and Frank Harary. Graph theory in network analysis. Social Net- works, 5(2):235–244, 1983. [17] Vaughn Betz and Jonathan Rose. Fpga routing architecture: segmentation and buffering to optimize speed and density. In FPGA ’99: Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate ar- rays, pages 59–68, New York, NY , USA, 1999. ACM. [18] Daniel K. Blandford, Guy E. Blelloch, and Ian A. Kash. An experimental anal- ysis of a compact graph representation. In In Proceedings of the Sixth Workshop on Algorithm Engineering and Experiments, pages 49–61, 2004. [19] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Com- mun. ACM, 20(10):762–772, 1977. [20] D. Burke, J. Wawrzynek, K. Asanovic, A. Krasnov, A. Schultz, G. Gibeling, and P.-Y . Droz. Ramp blue: Implementation of a manycore 1008 processor fpga system. In Proceedings of the Fourth Annual Reconfigurable Systems Summer Institute (RSSI’08), October 2008. [21] Beate Commentz-Walter. A string matching algorithm fast on the average. In Proceedings of the 6th Colloquium, on Automata, Languages and Program- ming, pages 118–132, London, UK, 1979. Springer-Verlag. [22] K. Compton and S. Hauck. Reconfigurable Computing: A Survey of Systems and Software. ACM Computing Surveys, 34(2):171–210, June 2002. [23] Jason Cong, Guoling Han, and Wei Jiang. Synthesis of an application-specific soft multiprocessor system, February 2007. [24] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2001. 113 [25] Cray Inc. http://www.cray.com/. [26] David E. Culler and Jaswinder P. Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, Inc., USA, 1999. [27] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakula- pati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter V osshall, and Werner V ogels. Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev., 41(6):205–220, 2007. [28] Michael deLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, Ian Es- lick, Raphael Rubin, Tomas E. Uribe, Thomas F. Jr. Knight, and Andre De- Hon. Graphstep: A system architecture for sparse-graph algorithms. Field- Programmable Custom Computing Machines, Annual IEEE Symposium on, 0:143–151, 2006. [29] J. P. Derutin, L. Damez, A. Desportes, and J. L. Lazaro Galilea. Design of a scalable network of communicating soft processors on fpga. In ASAP ’07: Proceedings of the 2007 IEEE International Conference on Application-Specific Systems, Architectures and Processors, pages 184–189, Washington, DC, USA, 2007. IEEE Computer Society. [30] S. Dharmapurikar, P. Krishnamurthy, T. Sproull, and J. Lockwood. Deep packet inspection using parallel bloom filters. In Hot Interconnects, pages 44–51, Au- gust 2003. [31] Robert G. Dimond, Oskar Mencer, and Wayne Luk. Custard - a customisable threaded fpga soft processor and tools. In In Proceedings of the 2005 Interna- tional Conference on Field Programmable Logic and Applications, 2005. [32] DRC, The Coprocessor Company. http://www.drccomputer.com/. [33] Christos Faloutsos, Kevin S. McCurley, and Andrew Tomkins. Fast discovery of connection subgraphs. In KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 118– 127, New York, NY , USA, 2004. ACM. [34] Blair Fort, Davor Capalija, Zvonko G. Vranesic, and Stephen D. Brown. A mul- tithreaded soft processor for sopc area reduction. Field-Programmable Custom Computing Machines, Annual IEEE Symposium on, 0:131–142, 2006. [35] Gerald Combs. http://www.wireshark.org. [36] Heiner Giefers and Marco Platzner. A many-core implementation based on the reconfigurable mesh model. In FPL, pages 41–46, 2007. 114 [37] M. Gokhale, D. Dubois, A. Dubois, M. Boorman, S. Poole, and V . Hogsett. Granidt: Towards Gigabit Rate Network Intrusion Detection Technology. In Proc. of 12th International Conference on Field-Programmable Logic and Ap- plications, pages 404–413, Montpellier, France, September 2002. [38] Maya B. Gokhale and Paul S. Graham. Reconfigurable Computing : Accelerat- ing Computation with Field-Programmable Gate Arrays. Springer, Dordrecht, The Netherlands, 2005. [39] Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. Residual splash for opti- mally parallelizing belief propagation. In In Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, Florida, April 2009. [40] Zhi Guo, Walid Najjar, Frank Vahid, and Kees Vissers. A Quantitative Anal- ysis of the Speedup Factors of FPGAs Over Processors. In Proc. of the 12th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pages 162–170, California, USA, February 2004. [41] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532– 533, 1988. [42] F. Kastensmidt H. Freitas, D. Colombo and P. Navaux. Evaluating network-on- chip for homogeneous embedded processors in fpgas. In IEEE International Symposium on Circuits and Systems, pages 3376–3379, May 2007. [43] Tatsuya Hayashi, Koji Nakano, and Stephan Olariu. An o((log log n)2) time algorithm to compute the convex hull of sorted points on reconfigurable meshes. IEEE Transactions on Parallel and Distributed Systems, 9:1167–1179, 1998. [44] John L. Hennessy and David A. Patterson. Computer Architecture: A Quanti- tative Approach (The Morgan Kaufmann Series in Computer Architecture and Design). Morgan Kaufmann, May 2002. [45] IBM. http://www.ibm.com/. [46] IBM. http://www.research.ibm.com/cell/. [47] Intel. http://www.intel.com. [48] Intel Corporation. http://www.intel.com/Products/server/processors/index.htm. [49] Intel Corporation. http://www.intel.com/technology/architecture-silicon/next- gen/. [50] Ju-wook Jang, Heonchul Park, and Viktor K. Prasanna. An optimal multipli- cation algorithm on reconfigurable mesh. IEEE Trans. Parallel Distrib. Syst., 8(5):521–532, 1997. 115 [51] Kimmo U. Jarvinen and Jorma O. Skytta. High-speed elliptic curve cryptog- raphy accelerator for koblitz curves. Field-Programmable Custom Computing Machines, Annual IEEE Symposium on, 0:109–118, 2008. [52] Donald E. Knuth, James H. Morris, and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323–350, 1977. [53] Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and Pierre- Yves Droz. Ramp blue: A message-passing manycore system in fpgas. In International Conference on Field Programmable Logic and Applications, Au- gust 2007. [54] Sailesh Kumar, Sarang Dharmapurikar, Fang Yu, Patrick Crowley, and Jonathan Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In SIGCOMM ’06: Proceedings of the 2006 conference on Applications, technologies, architectures, and protocols for computer commu- nications, pages 339–350, 2006. [55] Martin Labrecque and Gregory Steffan. Improving pipelined soft processors with multithreading. In FPL, pages 210–215, 2007. [56] Martin Labrecque, Peter Yiannacouras, and J. Gregory Steffan. Scaling soft processor systems. In Proceedings of FCCM, pages 99–110, 2008. [57] Hoang Le, Weirong Jiang, and Viktor K. Prasanna. A sram-based architecture for trie-based ip lookup using fpga. Field-Programmable Custom Computing Machines, Annual IEEE Symposium on, 0:33–42, 2008. [58] K. Masselos, A. Pelkonen, M. Cupak, and S. Blionas. Realization of wireless multimedia communication systems on reconfigurable platforms. J. Syst. Ar- chit., 49(4-6):155–175, 2003. [59] Memcached. A distributed memory object caching system. [60] Oskar Mencer, Zhining Huang, and Lorenz Huelsbergen. Hagar: Efficient multi-context graph processors. In FPL ’02: Proceedings of the Reconfig- urable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications, pages 915–924, London, UK, 2002. Springer-Verlag. [61] Micron Technology, Inc. http://www.micron.com. [62] Russ Miller, V . K. Prasanna-Kumar, Dionisios I. Reisis, and Quentin F. Stout. Parallel computations on reconfigurable meshes. IEEE Trans. Comput., 42(6):678–692, 1993. [63] Mitrionics AB,. Mitrion users guide. 116 [64] Gerald R. Morris and Viktor K. Prasanna. Sparse matrix computations on re- configurable hardware. Computer, 40:58–64, 2007. [65] Georgios-Grigorios Mplemenos and Ioannis Papaefstathiou. Soft multicore sys- tem on FPGAs. In Proceedings of FCCM, pages 199–201, 2008. [66] Vadali Srinivasa Murty, P. C. Reghu Raj, and S. Raman. Design of a high speed string matching co-processor for nlp. VLSI Design, International Conference on, 0:183, 2003. [67] Koji Nakano and Stephan Olariu. An optimal algorithm for the angle-restricted all nearest neighbor problem on the reconfigurable mesh, with applications. IEEE Transactions on Parallel and Distributed Systems, 8:983–990, 1997. [68] M. E. Newman. The structure of scientific collaboration networks. Proc Natl Acad Sci U S A, 98(2):404–409, January 2001. [69] NVIDIA. http://www.nvidia.com/object/tesla computing solutions.html. [70] Open SystemC Initiative (OSCI). http://www.systemc.org/home/. [71] Neungsoo Park, Bo Hong, and Viktor K. Prasanna. Tiling, block data lay- out, and memory hierarchy performance. IEEE Trans. Parallel Distrib. Syst., 14(7):640–654, 2003. [72] David Pellerin and Scott Thibault. Practical fpga programming in c. Prentice Hall Press, Upper Saddle River, NJ, USA, 2005. [73] Richard Neil Pittman, Nathaniel Lee Lynch, and Ro Forin. emips, a dynami- cally extensible processor. Technical report, MicroSoft Research, 2006. [74] Kaushik Ravindran, Nadathur Satish, Yujia Jin, and Kurt Keutzer. An FPGA- based soft multiprocessor system for ipv4 packet forwarding. In Proceedings of FPL, pages 487–492. IEEE, 2005. [75] Manuel Salda na, Lesley Shannon, Jia Shuo Yue, Sikang Bian, John Craig, and Paul Chow. Routability of network topologies in fpgas. IEEE Trans. Very Large Scale Integr. Syst., 15(8):948–951, 2007. [76] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Efficient breadth- first search on the cell/be processor. IEEE Transactions on Parallel and Dis- tributed Systems, 19:1381–1395, 2007. [77] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Efficient breadth- first search on the cell/be processor. IEEE Trans. Parallel Distrib. Syst., 19(10):1381–1395, 2008. 117 [78] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Exact multi-pattern string matching on the Cell/B.E. processor. In Conf. Computing Frontiers, pages 33–42, 2008. [79] R. Scrofano, M. Gokhale, F. Trouw, and V . K. Prasanna. A Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers. In Proc. of the 14th IEEE Symposium on Field-Programmable Custom Computing Ma- chines, California, USA, April 2006. [80] Silicon Graphics, Inc. http://www.sgi.com/. [81] SNORT: The Open Source Network Intrusion Prevention and Detection System. http://www.snort.org. [82] SRC Computers, Inc. http://www.srccomp.com/. [83] Harold S. Stone. High-Performance Computer Architecture. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1992. [84] Taeweon Suh, Shih-Lien Lu, and Hsien-Hsin S. Lee. An fpga approach to quan- tifying coherence traffic efficiency on multiprocessor systems. In FPL, pages 47–53, 2007. [85] Maurice Tchuente. Parallel Computation on Regular Arrays. Halsted Press, New York, NY , USA, 1991. [86] David B. Thomas and Wayne Luk. Efficient Hardware Generation of Random Variates with Arbitrary Distributions. In Proc. of 14th IEEE Symposium on Field-Programmable Custom Computing Machines, pages 57–66, California, USA, April 2006. [87] TOP500.org. http://www.top500.org/. [88] Nick Tredennick and Brion Shimamoto. Reconfigurable systems emerge. Pro- ceedings of the 2004 International Conference on Field Programmable Logic and Its Applications, 3203:2–11, 2004. [89] Antonino Tumeo, Matteo Monchiero, Gianluca Palermo, Fabrizio Ferrandi, and Donatella Sciuto. A design kit for a fully working shared memory multipro- cessor on fpga. In GLSVLSI ’07: Proceedings of the 17th ACM Great Lakes symposium on VLSI, pages 219–222, New York, NY , USA, 2007. ACM. [90] K.D. Underwood and K.S. Hemmert. Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In Proc. of 2004 IEEE Symposium on Field-Programmable Custom Computing Machines, California, USA, April 2004. 118 [91] Deepak Unnikrishnan, Jia Zhao, and Russell Tessier. Application specific cus- tomization and scalability of soft multiprocessors. In FCCM ’09: Proceedings of the 2009 17th IEEE Symposium on Field Programmable Custom Comput- ing Machines, pages 123–130, Washington, DC, USA, 2009. IEEE Computer Society. [92] Oreste Villa, Daniele Paolo Scarpazza, Fabrizio Petrini, and Juan Fern´ andez Peinador. Challenges in mapping graph exploration algorithms on advanced multi-core processors. In IPDPS, pages 1–10, 2007. [93] Project V oldemort. A distributed key-value storage system. [94] B. F. Wang and G. H. Chen. Constant time algorithms for the transitive closure and some related graph problems on processor arrays with reconfigurable bus systems. IEEE Trans. Parallel Distrib. Syst., 1(4):500–507, 1990. [95] B. W. Watson. The performance of single-keyword and multiple-keyword pat- tern matching algorithms. Technical Report, Eindhoven University of Technol- ogy, 19(10):1179–1189, October 1994. [96] Xingzhi Wen and Uzi Vishkin. Fpga-based prototype of a pram-on-chip pro- cessor. In CF ’08: Proceedings of the 5th conference on Computing frontiers, pages 55–66, New York, NY , USA, 2008. ACM. [97] Z. Wu and R. Leahy. An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 15:1101–1113, 1993. [98] Yinglong Xia and Viktor Prasanna. Topologically Adaptive Parallel Breadth- First Search on Multicore Processors. In Proc. of Parallel and Distributed Com- puting and Systems (PDCS 2009), Cambridge, Massachusetts, USA, November 2009. [99] Yinglong Xia and Viktor K. Prasanna. Parallel exact inference on the cell broad- band engine processor. In SC ’08: Proceedings of the 2008 ACM/IEEE confer- ence on Supercomputing, pages 1–12, Piscataway, NJ, USA, 2008. IEEE Press. [100] Xilinx Incorporated. http://www.xilinx.com/itp/data/alliance/dsu/dsu1 2.htm. [101] Xilinx Incorporated. http://www.xilinx.com/products/design resources/proc central/microblaze.htm. [102] Xilinx Incorporated. http://www.xilinx.com. 119 [103] Sungwon Yi, Byoung-Koo Kim, Jintae Oh, Jongsoo Jang, George Kesidis, and Chita R. Das. Memory-efficient content filtering hardware for high-speed intru- sion detection systems. In Proceedings of the 2007 ACM symposium on Applied computing, pages 264–269, 2007. [104] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Application- specific customization of soft processor microarchitecture. In in FPGA06: Pro- ceedings of the, pages 201–210. ACM Press, 2006. [105] Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hen- drickson, and Umit Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene/l. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 25, Washington, DC, USA, 2005. IEEE Computer Society. [106] Fang Yu, Zhifeng Chen, Yanlei Diao, T. V . Lakshman, and Randy H. Katz. Fast and memory-efficient regular expression matching for deep packet inspection. In ANCS ’06: Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems, pages 93–102, New York, NY , USA, 2006. ACM. [107] L. Zhuo and V .K. Prasanna. High Performance Linear Algebra Operations on Reconfigurable Systems. In Proc. of SuperComputing 2005, Washington, USA, November 2005. [108] L. Zhuo and V .K. Prasanna. Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems. In Proc. of the 12th International Con- ference on Parallel and Distributed Systems (ICPADS), Minnesota, USA, July 2006. [109] L. Zhuo and V .K. Prasanna. Hardware/Software Co-Design on Reconfigurable Computing Systems. In Proc. of the 21st IEEE International Parallel & Dis- tributed Processing Symposium, California, USA, March 2007. [110] Ling Zhuo, Qingbo Wang, and Viktor K. Prasanna. Matrix computations on heterogeneous reconfigurable systems. Field-Programmable Custom Comput- ing Machines, Annual IEEE Symposium on, 0:310–311, 2008. 120
Abstract (if available)
Abstract
Field-programmable gate array (FPGA) is a representative reconfigurable computing platform. It has been used in many applications to execute computationally intensive workloads. In this work, we study architectures and algorithms on FPGA for sparse computations. These computations have unique features: 1) the ratio of input and output operations to computation is high and 2) most memory accesses are random with little or no data locality, which leads to low memory bandwidth utilization.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Parallel implementations of the discrete wavelet transform and hyperspectral data compression on reconfigurable platforms: approach, methodology and practical considerations
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
Modeling the reliability of highly scaled field-programmable gate arrays in ionizing radiation
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Memristive device and architecture for analog computing with high precision and programmability
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Exploration of parallelism for probabilistic graphical models
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Hardware and software techniques for irregular parallelism
PDF
Probabilistic methods and randomized algorithms
PDF
Compilation of data-driven macroprograms for a class of networked sensing applications
PDF
Mocap data compression: algorithms and performance evaluation
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
Asset Metadata
Creator
Wang, Qingbo
(author)
Core Title
Multi-softcore architectures and algorithms for a class of sparse computations
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/13/2011
Defense Date
04/30/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm,computing system,field-programmable gate array,multi-core architecture,OAI-PMH Harvest,reconfigurable computing,sparse computation
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Ung, Monte (
committee member
)
Creator Email
qingbow@gmail.com,qingbowa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3392
Unique identifier
UC1490616
Identifier
etd-Wang-4000 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-387367 (legacy record id),usctheses-m3392 (legacy record id)
Legacy Identifier
etd-Wang-4000.pdf
Dmrecord
387367
Document Type
Dissertation
Rights
Wang, Qingbo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
algorithm
computing system
field-programmable gate array
multi-core architecture
reconfigurable computing
sparse computation