Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-performance linear algebra on reconfigurable computing systems
(USC Thesis Other)
High-performance linear algebra on reconfigurable computing systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-PERFORMANCE LINEAR ALGEBRA ON RECONFIGURABLE COMPUTING SYSTEMS by Ling Zhuo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2007 Copyright 2007 Ling Zhuo Dedication To my late mother ii Acknowledgments First, I must thank my advisor, Prof. Viktor K. Prasanna. His guidance and support have been instrumental in my success. This work would not have been possible without him. I would also like to express my gratitude to my committee members, Prof. Michel Dubois, Prof. Mary Hall, and Prof. Manbir Singh. They have provided many useful insights during this process. Additionally, Prof. Michel Dubois, Prof. Pedro Diniz, Prof. Sandeep Gupta, and Prof. Priya Vashishta provided helpful feedbacks at the time of my qualifying exam. I would also like to thank my master advisors, Dr. Cho- Li Wang and Dr. Francis Law, for their inspiration and guidance. They were the first one to teach me how to do research. I have had the benefit of working with a wonderful research group. I would espe- cially like to acknowledge Ronald Scrofano, Gerald Morris, Zachary Baker, Seonil Choi, Animesh Pathak, and Jingzhao Ou. I also thank Aimee Barnard, Rosine Sarafian, and Henryk Chrostek for their administrative assistance. Last but not the least, I must thank my husband, Cong Zhang, whose love and support are crucial for anything I have ever accomplished. I also thank our daughter, Keyun, for bringing me sunshine and happiness. iii Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures ix Abstract xiii Chapter 1: Introduction 1 Background .... ... .... ... .... ... .... .... ... .. 3 Reconfigurable Hardware . . . . .... ... .... .... ... .. 3 Reconfigurable Computing Systems . . . . . .... .... ... .. 5 Linear Algebra Operations . . . .... ... .... .... ... .. 6 Challenges . . . . . .... ... .... ... .... .... ... .. 8 Contributions . . . . . . .... ... .... ... .... .... ... .. 9 Organization.... ... .... ... .... ... .... .... ... .. 12 Chapter 2: Scientific Computing on Reconfigurable Computing Systems 14 Floating-Point Arithmetic on FPGAs . .... ... .... .... ... .. 15 Scientific Computing Kernels and Applications . . .... .... ... .. 16 Chapter 3: Design Model 20 Reconfigurable Computing Systems . .... ... .... .... ... .. 21 Cray Reconfigurable Systems . . .... ... .... .... ... .. 21 SRC Reconfigurable Computers .... ... .... .... ... .. 22 SGI RASC . . . . .... ... .... ... .... .... ... .. 23 High-Level Design Model . . . . . . .... ... .... .... ... .. 24 Architectural Model.... ... .... ... .... .... ... .. 24 System Parameters .... ... .... ... .... .... ... .. 26 Design Flow.... ... .... ... .... ... .... .... ... .. 27 Parameterized Design Approach .... ... .... .... ... .. 28 Design Parameters .... ... .... ... .... .... ... .. 29 iv Chapter 4: Vector Operations on Reconfigurable Computing Systems 31 Related Work . . . . . . .... ... .... ... .... .... ... .. 32 Dot Product .... ... .... ... .... ... .... .... ... .. 33 Operation Analysis .... ... .... ... .... .... ... .. 34 Architecture . . . . .... ... .... ... .... .... ... .. 34 Dense Matrix-Vector Multiplication . .... ... .... .... ... .. 36 Operation Analysis .... ... .... ... .... .... ... .. 36 Architecture . . . . .... ... .... ... .... .... ... .. 37 Sparse Matrix-Vector Multiplication . .... ... .... .... ... .. 39 CRS Format . . . . .... ... .... ... .... .... ... .. 39 Operation Analysis .... ... .... ... .... .... ... .. 40 Architecture . . . . .... ... .... ... .... .... ... .. 41 Experimental Results . . .... ... .... ... .... .... ... .. 43 Experimental Setup .... ... .... ... .... .... ... .. 43 Design Flow on XD1 . . . . . . .... ... .... .... ... .. 44 Performance Analysis . . . . . . .... ... .... .... ... .. 45 Implementations on XD1 . . . . .... ... .... .... ... .. 49 Performance Comparison . . . . .... ... .... .... ... .. 50 Chapter 5: FPGA-Based Reduction Circuit 52 The Reduction Problem and Related Work . . . . . .... .... ... .. 53 The Reduction Problem: An Intuition . . . . .... .... ... .. 53 The Reduction Problem: Formal Definition . .... .... ... .. 54 Previous Work . . . .... ... .... ... .... .... ... .. 55 Reduction Circuit Design Analysis . . . . . . .... .... ... .. 56 Tree-Traversal Method . .... ... .... ... .... .... ... .. 58 Intuitive Idea . . . .... ... .... ... .... .... ... .. 58 Proposed Design: Fully Compacted Binary Tree . . . .... ... .. 59 Architecture and Schedule . . . . . . .... .... ... .. 59 Proof of Correctness . . .... ... .... .... ... .. 62 Snapshot . .... ... .... ... .... .... ... .. 64 Striding Method . . . . . .... ... .... ... .... .... ... .. 65 Intuitive Idea . . . .... ... .... ... .... .... ... .. 65 Proposed Design: Single Strided Adder . . . .... .... ... .. 66 Architecture and Schedule . . . . . . .... .... ... .. 66 Proof of Correctness . . .... ... .... .... ... .. 69 Snapshot . .... ... .... ... .... .... ... .. 71 Performance Analysis . . .... ... .... ... .... .... ... .. 72 Summary of Proposed Designs . .... ... .... .... ... .. 72 Performance Metrics . . . . . . .... ... .... .... ... .. 73 Design Implementations . . . . .... ... .... .... ... .. 74 Performance Analysis . . . . . . .... ... .... .... ... .. 75 v Chapter 6: Dense Matrix Operations on Reconfigurable Computing Systems 79 Related Work . . . . . . .... ... .... ... .... .... ... .. 81 Matrix Operations on Parallel Systems . . . . .... .... ... .. 81 Matrix Operations on Reconfigurable Computing Systems . . . . . . 82 Matrix Multiplication . . .... ... .... ... .... .... ... .. 84 Operation Analysis .... ... .... ... .... .... ... .. 85 Lower Bound . . . . . . .... ... .... .... ... .. 85 Design Trade-off Analysis . . . . . . .... .... ... .. 87 Design Issues . . . . . . .... ... .... .... ... .. 88 Matrix Multiplication on Single FPGA . . . . .... .... ... .. 89 Design 1 . .... ... .... ... .... .... ... .. 89 Design 2 . .... ... .... ... .... .... ... .. 95 Design 3 . .... ... .... ... .... .... ... .. 99 Matrix Multiplication on Reconfigurable Computing Systems . . . . . 101 Matrix Factorization . . . .... ... .... ... .... .... ... .. 105 Operation Analysis .... ... .... ... .... .... ... .. 105 Design Issues . . . .... ... .... ... .... .... ... .. 106 Proposed Design . .... ... .... ... .... .... ... .. 108 Block LU Decomposition . . . . . . .... .... ... .. 111 Hierarchical Design .... ... .... ... .... .... ... .. 112 Experimental Results . . .... ... .... ... .... .... ... .. 112 Performance Analysis . . . . . . .... ... .... .... ... .. 112 Performance Comparison . . . . .... ... .... .... ... .. 116 Performance Projection . . . . . .... ... .... .... ... .. 117 Implementations on XD1 . . . . .... ... .... .... ... .. 119 Chapter 7: Hybrid Designs on Reconfigurable Computing Systems 121 Related Work . . . . . . .... ... .... ... .... .... ... .. 122 Design Methodology . . .... ... .... ... .... .... ... .. 123 Workload Partition .... ... .... ... .... .... ... .. 124 Multiple Nodes . . .... ... .... ... .... .... ... .. 125 Hardware/Software Coordination.... ... .... .... ... .. 126 Performance Prediction . . . . . .... ... .... .... ... .. 127 Hybrid Designs . . . . . .... ... .... ... .... .... ... .. 127 Matrix Multiplication . . . . . . .... ... .... .... ... .. 127 Workload Partition . . . .... ... .... .... ... .. 128 Hardware/Software Coordination . . .... .... ... .. 129 Block LU Decomposition . . . . .... ... .... .... ... .. 129 Task Identification . . . .... ... .... .... ... .. 130 Algorithm and Architecture . . . . . .... .... ... .. 130 Hardware/Software Coordination . . .... .... ... .. 131 Load Balancing . . . . . .... ... .... .... ... .. 132 vi Floyd-Warshall Algorithm . . . .... ... .... .... ... .. 132 Task Identification . . . .... ... .... .... ... .. 133 Algorithm and Architecture . . . . . .... .... ... .. 134 Workload Partition . . . .... ... .... .... ... .. 135 Hardware/Software Coordination . . .... .... ... .. 135 Experimental Results . . .... ... .... ... .... .... ... .. 136 Implementation Details . . . . . .... ... .... .... ... .. 136 Performance Analysis . . . . . . .... ... .... .... ... .. 137 Performance Comparison . . . . .... ... .... .... ... .. 139 Performance Projection . . . . . .... ... .... .... ... .. 143 Chapter 8: Conclusion 145 Summary of Contributions . . . . . . .... ... .... .... ... .. 146 Future Work .... ... .... ... .... ... .... .... ... .. 150 Portable and Scalable Libraries . .... ... .... .... ... .. 150 Heterogeneous Computing Platform . . . . . .... .... ... .. 151 References 153 vii List of Tables 3.1 Characteristics of memory for a single FPGA .... .... ... .. 25 4.1 64-bit floating-point cores and reduction circuit . . . .... ... .. 44 4.2 Sparse matrices used in our experiments . . . .... .... ... .. 46 5.1 Summary of reduction circuits . .... ... .... .... ... .. 72 5.2 Characteristics of the reduction circuits (n = 128) . . .... ... .. 74 7.1 Workload partition for matrix multiplication on XD1 .... ... .. 136 7.2 Routines and latencies for operations in block LU decomposition . . . 137 viii List of Figures 1.1 Generic FPGA and its components . . . . . . .... .... ... .. 4 1.2 Number of system gates on Xilinx FPGAs . . .... .... ... .. 6 3.1 Hardware architecture of Cray XD1 . . . . . .... .... ... .. 21 3.2 Hardware architecture of SRC-7 .... ... .... .... ... .. 22 3.3 Hardware architecture of SGI Altix 350/RC 100 . . . .... ... .. 23 3.4 Architectural model for reconfigurable computing systems . . . . . . 24 3.5 Parameterized design approach . .... ... .... .... ... .. 29 4.1 Architecture for dot product (k =4) . . ... .... .... ... .. 35 4.2 Architecture for MXV (k =4) . .... ... .... .... ... .. 37 4.3 Improved architecture for SpMXV . . . . . . .... .... ... .. 42 4.4 Block diagram of an FPGA-based design on XD1 . . .... ... .. 44 4.5 Area and latency for dot product (n = 2048) . .... .... ... .. 47 4.6 Performance analysis for MXV (n = 2048) . .... .... ... .. 48 4.7 Total time/Computation time/Initialization time vs. block size for sparse matrix memplus .. .... ... .... ... .... .... ... .. 49 4.8 Speedup of our design for SpMXV (k =4) over the Itanium 2 system 51 5.1 Buffer overrun . . . .... ... .... ... .... .... ... .. 54 5.2 General architecture of reduction circuits . . . .... .... ... .. 57 5.3 Partially compacted binary tree (PCBT) architecture . .... ... .. 59 5.4 Fully compacted binary tree (FCBT) architecture . . .... ... .. 59 ix 5.5 Labels for FCBT inputs (n=5) .... ... .... .... ... .. 60 5.6 Scheduling algorithm for FCBT design . . . . .... .... ... .. 61 5.7 Snapshot of FCBT design . . . . .... ... .... .... ... .. 64 5.8 Reducing n inputs into α segments (α=4). . .... .... ... .. 65 5.9 Single strided adder (SSA) architecture . . . .... .... ... .. 66 5.10 Scheduling algorithm for SSA design . . . . .... .... ... .. 68 5.11 Snapshot of SSA design . . . . . .... ... .... .... ... .. 71 5.12 Area vs. lg(n) ... .... ... .... ... .... .... ... .. 75 5.13 Clock Speed vs. lg(n)... ... .... ... .... .... ... .. 76 5.14 Latency and area-latency product for reducing a single set of 128 inputs 76 5.15 Latency and area-latency product for reducing 128 sets of 128 inputs . 77 5.16 Throughput and throughput-to-area ratio for reducing 128 sets of 128 inputs .... ... .... ... .... ... .... .... ... .. 78 6.1 Trade-offs for matrix multiplication . . . . . .... .... ... .. 85 6.2 Design 1 for matrix multiplication: describes the cycle-specific behav- ior of each PE . . . .... ... .... ... .... .... ... .. 90 6.3 Architecture of Design 1 for matrix multiplication . . .... ... .. 91 6.4 Computation illustrations for Design 1 (n=4, s=2) .... ... .. 93 6.5 Data flow for Design 1 (n=4, s=2, and the pipeline delays of both the adder and the multiplier are 2) . . . . . . .... .... ... .. 94 6.6 Design 2 for matrix multiplication, r =2: describes the cycle-specific behavior of each PE .... ... .... ... .... .... ... .. 96 6.7 Architecture of Design 2 for matrix multiplication, r =2 .. ... .. 97 6.8 Computation illustrations for Design 2 (n=4, s=1, r =2) ... .. 98 6.9 Design 3 for matrix multiplication: describes the cycle-specific behav- ior of each PE . . . .... ... .... ... .... .... ... .. 100 6.10 Computation illustrations for Design 3 (n=4, m b =8, k =2) . . . . 101 x 6.11 Architecture for matrix multiplication using multiple levels of memory . . . . . . .... ... .... ... .... .... ... .. 102 6.12 Architecture for matrix multiplication on multiple FPGAs . . . . . . . 104 6.13 Computational flow of Design 1 for LU decomposition . . . . . . . . 107 6.14 Architecture of Design 2 for LU decomposition . . . .... ... .. 109 6.15 Computational flow of Design 2 for LU decomposition . . . . . . . . 110 6.16 Performance analysis for matrix multiplication (n = 2048) . ... .. 113 6.17 Performance analysis for LU decomposition (n = 128) ... ... .. 115 6.18 Comparison of sustained performance of Design 2 and processor-based designs for LU decomposition . .... ... .... .... ... .. 117 6.19 Performance projection . . . . . .... ... .... .... ... .. 118 6.20 Performance of matrix multiplication on multiple FPGAs . . . . . . . 120 7.1 Architecture for block LU decomposition at iteration 0 . . . . . . . . 131 7.2 Illustration of operations of the nodes for block Floyd-Warshall ( n b = 8, p=4, t=2). .. .... ... .... ... .... .... ... .. 134 7.3 Latency of one b× b block matrix multiplication using hybrid design vs. b f (b = 3000, p=5) . ... .... ... .... .... ... .. 138 7.4 Latency of the 0th iteration in block LU decomposition using hybrid design vs. l (n = 30000, p=6). .... ... .... .... ... .. 138 7.5 Latency of one iteration in block Floyd-Warshall using our design vs. l 1 (b = 256, n = 18432, p=6). .... ... .... .... ... .. 139 7.6 GFLOPS performance of matrix multiplication vs. p .... ... .. 140 7.7 GFLOPS of block LU decomposition vs. n b (b = 3000) ... ... .. 141 7.8 GFLOPS performance of block Floyd-Warshall vs. p .... ... .. 141 7.9 Percentage of total performance of the system achieved by our designs 142 7.10 Projected performance of hybrid design for matrix multiplication (b f = 2048, p=6). ... .... ... .... ... .... .... ... .. 143 xi Abstract Recently, high-end computing systems have been introduced that employ reconfig- urable hardware as application-specific hardware accelerators for general-purpose pro- cessors. These systems provide new opportunities for high-performance implementa- tions of scientific applications. However, they also pose new design challenges, in- cluding utilization of available hardware resources, exploitation of multiple levels of memory, and hardware/software co-design. In this work, we investigate high-performance designs for floating-point based lin- ear algebra on reconfigurable computing systems. The operations studied are funda- mental kernels for scientific computing, including dense and sparse matrix-vector mul- tiplication, matrix multiplication and matrix factorization. We first study the existing systems and propose a high-level design model. This model captures the architectural details of a system through various parameters at both node level and system level. We next propose optimized designs on reconfigurable hardware using a parameter- ized design approach. Using the approach, we identify the design parameters, explore the design space and analyze the design trade-offs for each target operation. By tun- ing the parameters, the proposed designs can adapt to various hardware devices and achieve optimal performance under the available hardware resources. We also develop high-throughput and area-efficient designs for reduction, a fundamental primitive in performing linear algebra operations. Our designs are then incorporated into hybrid designs that utilize both the processors and the reconfigurable hardware in the system. xii A design methodology is proposed for hybrid designs to perform workload partitioning and hardware/software coordination. Experimental results show that our designs for the vector operations achieve more than 90% of the peak performance under the available memory bandwidth. For the matrix operations, our designs achieve the optimal latency and minimize the required memory bandwidth with the available hardware resources. In addition, when newer and faster floating-point cores become available, the performance of our designs in- creases correspondingly. The proposed hybrid designs have been implemented on a state-of-the-art reconfigurable computing system. With 6 processors and 6 reconfig- urable devices, our designs achieve more than 20 GFLOPS for matrix multiplication and matrix factorization. Furthermore, our designs achieve up to 90% of the total com- puting power of the system and more than 85% of the performance predicted using the high-level design model. xiii Chapter 1 Introduction Reconfigurable hardware offers the design flexibility of general-purpose processors, but with time performance closer to application-specific integrated circuits [19]. More- over, one reconfigurable device can hold a large number of functional units that per- form computations in parallel. Therefore, reconfigurable hardware, particularly FPGAs (Field Programmable Gate Arrays), has provided significant performance increases in areas such as signal processing, cryptography, and network security [18, 7, 30]. How- ever, due to its low computing density, early reconfigurable hardware was used mainly for fixed-point applications that are not computationally demanding. With rapid advances in technology, current reconfigurable hardware can be used for implementations of more complex applications because it contains much more resources than its predecessor. In particular, a state-of-the-art reconfigurable device contains up to 300,000 configurable logic cells, several megabytes of on-chip mem- ory, hundreds of I/O pins, as well as a large number of hardware primitives such as fixed-point multipliers [74]. With the large amount of hardware resources, multiple floating-point cores can now be configured on one reconfigurable device device and perform computations in parallel. Moreover, tens of floating-point words can be ex- changed between the device and the external memory in one clock cycle. Therefore, 1 current reconfigurable hardware is able to provide both high computational parallelism and high I/O parallelism. It is hence now attractive for a broader range of applications, including computationally intensive kernels requiring floating-point arithmetic. Recently, reconfigurable hardware has been employed to accelerate scientific ap- plications. For both dense and sparse matrix computations, reconfigurable hardware has achieved higher performance than implementations on general-purpose processors [23, 24]. FPGA-based designs have also been proposed for more complex applica- tions, such as molecular dynamics [61] and fluid dynamics [59]. Some researchers have even suggested that with current rate of development, reconfigurable hardware will soon outperform general-purpose processors in terms of both peak and sustained floating-point performance [67]. As the computing power of reconfigurable hardware increases rapidly, high-end systems that employ such hardware as application-specific accelerators have become a new trend in high-performance computing [21, 64, 62]. These systems contain mul- tiple nodes that are connected through an interconnect network. Each node contains both reconfigurable hardware and processors that share a memory system. These sys- tems provide multiple levels of parallelism. Coarse-grain parallelism can be employed on multiple nodes, while fine-grain parallelism can be exploited on the reconfigurable hardware. Moreover, within each node, the processors and the reconfigurable hard- ware can collaborate to compute the tasks assigned to the node. Thus, such systems can achieve higher performance than systems with processors only. However, to utilize effectively these reconfigurable computing systems, certain de- sign challenges must be addressed. First, optimized designs on the reconfigurable 2 hardware are required. Secondly, these designs need to be incorporated into the sys- tems by utilizing multiple reconfigurable devices and multiple levels of memory. Effi- cient hardware (reconfigurable hardware)/software (programs executed on the proces- sors) co-design is also crucial. In our work, we address these challenges and propose high-performance designs for linear algebra on reconfigurable computing systems. Our work focuses on basic linear algebra operations, including dense/sparse matrix-vector multiplication, matrix multiplication and matrix factorization. These operations are fundamental to scientific computing. In particular, they serve as basic building blocks for many numerical linear algebra applications, including the solution of linear systems of equations, linear least square problems, eigenvalue problems and singular value problems [57]. Therefore, high-performance designs for these operations are crucial for performance improve- ment of many scientific applications. In the next section, we provide introductory material about reconfigurable hard- ware, reconfigurable computing systems, and the challenges in using them for high- performance computing. We then present a summary of our contributions and the organization of this thesis. 1.1 Background 1.1.1 Reconfigurable Hardware Reconfigurable hardware provides a bridge between flexible but performance-constrained general-purpose processors and very high-performance but inflexible Application-Specific Integrated Circuits (ASICs). Because reconfigurable hardware is programmed specif- ically for the problem to be solved rather than to be able to solve all problems, it 3 can achieve higher performance than processors, especially for applications with a regular structure and abundant parallelism. Also, as reconfigurable hardware can be re-programmed an unlimited number of times, it is much more flexible than ASICs. A typical type of reconfigurable hardware is FPGAs [19]. An FPGA device con- sists of tens of thousands of logic blocks (clusters of slices) whose functionality is determined by programmable configuration bits. These logic blocks are connected us- ing a set of routing resources that are also programmable. Thus, mapping a design to an FPGA consists of determining the functions to be computed by the logic blocks, and using the configurable routing resources to connect the blocks. The FPGAs considered in this work are SRAM-based. That is, they are configured by writing a configuration bitstream to their SRAM-based configuration memory. LUT … LUT FFs, MUXs, fast carry logic A B C 0 1 1 1 1 1 1 1 7 6 5 4 3 2 1 0 A B C 0 1 1 1 1 1 1 1 7 6 5 4 3 2 1 0 Figure 1.1: Generic FPGA and its components Figure 1.1 shows a generic FPGA architecture and its components. The large shaded boxes represent the configurable logic blocks. The small shaded boxes rep- resent “switch boxes” that are used to program the routing. The configurable logic blocks in FPGAs are made up of configurable combinational logic, flip-flops (FFs), multiplexers (MUXs), and high-speed intra-block connections (e.g., fast carry logic). 4 The configurable combinational logic is implemented as look-up tables (LUTs). An l- input LUT can be configured to perform any logic function of l inputs. As an example, a 3-input LUT is shown on the right of Figure 1.1 (A, B, C are the inputs). The logic function implemented is a 3-input OR: if A, B, and C are each 0, the LUT outputs a 0; otherwise, it outputs a 1. In practice, the most common type of LUT in FPGAs is the 4-input LUT. Recently, the computing power of FPGAs has been increasing dramatically, as shown in Figure 1.2. In addition to the increased density, current FPGA fabrics con- tain many more hardware primitives which are dedicated (non-programmable) hard- ware features embedded into the FPGA fabric. The primitives increase the capability of the FPGAs by performing common functions without using programmable logic resources. Common primitives include Block RAM (BRAM), fixed-point multipli- ers and multiply-accumulate cores [1, 74]. Even processors have been embedded into FPGAs. For example, the Xilinx Virtex-II Pro FPGAs have up to two PowerPC 405 cores embedded in the FPGA fabric. 1.1.2 Reconfigurable Computing Systems A reconfigurable computing system consists of multiple nodes, and each node can be based on the processors, the FPGAs, or both. With their high frequencies and V on Neumann architecture, processors are suitable for executing control-intensive tasks. FPGAs, on the other hand, with their temporal parallelism and spatial parallelism, can be used for computationally intensive tasks which contain a large amount of paral- lelism. A task can also be partitioned between the processors and the FPGAs. The presence of a high-bandwidth, low-latency interconnect network distinguishes reconfigurable computing systems from other systems in which FPGAs are attached 5 1998 2000 2001 2002 1999 System Gates* 1997 0.5M 0.18-0.22 μ 1.8-2.5V core + Embedded RAMS 1.1M 3.2M 8M 2.5-3.3V core XC4000 0.25-0.35 μ Configurable Logic Blocks 0.13 μ 1.5V core + Embedded multipliers 1.5V core + PowerPC cores + Gigabit transceivers 0.13 μ Year 2005 1.2V core 0.09 μ + Integrated MAC + 500 MHz 10M 0.065 μ 1.0V core + 6-input LUT + 550 MHz 2006 13M Figure 1.2: Number of system gates on Xilinx FPGAs to a host processor over a standard bus, such as PCI or VME. Moreover, because of the multiple levels of memory available to the FPGAs, the design process for reconfig- urable computing systems is much more complex than that in embedded systems with FPGAs. In Chapter 3, we will study currently available reconfigurable computing sys- tems and use the information to develop relevant models. 1.1.3 Linear Algebra Operations Most of the linear algebra operations studied in this thesis belong to the set of Ba- sic Linear Algebra Subprograms, which is commonly referred to as BLAS [43]. In particular, dot product is in Level 1 BLAS; dense and sparse matrix-vector multiplica- tion are in Level 2 BLAS; matrix multiplication is in Level 3 BLAS. BLAS has been used in a wide range of software and provides building block routines for performing basic vector and matrix operations. Optimizations for the BLAS library on general- purpose processors include loop unrolling, register blocking and cache blocking [72]. 6 Since many of the optimizations are platform specific, ATLAS was proposed which automatically generates and optimizes numerical software for processors with deep memory hierarchies and pipelined functional units. Another important linear algebra library is LAPACK (Linear Algebra PACKage) [5]. This library provides routines for solving systems of simultaneous linear equa- tions, least squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided. In our work, we consider the LU decomposition. The operations considered in this work have several common features that make them suitable for hardware acceleration. First, they are loop-oriented and can benefit greatly from the spatial parallelism provided by FPGAs. Secondly, they are data- oblivious so that the algorithms and architectures do not change with the input data. Thirdly, the control logic of these operations is simple and does not constrain the per- formance of the entire design. These operations have been implemented on parallel systems and their parallel im- plementations are included in ScaLAPACK (Scalable LAPACK) [11]. ScaLAPACK is one of the most widely used libraries on distributed-memory MIMD systems. Its com- putational model consists of a one- or two-dimensional processor grid, where each process stores pieces of the matrices and vectors. In ScaLAPACK, the block-cyclic data layout is used to minimize the frequency of data movements between different levels of the memory hierarchy. A message-passing library designed for linear alge- bra, BLACS (Basic Linear Algebra Communication Subprograms), is used for inter- process communication. ScaLAPACK utilizes the parallelism among multiple computers. In contrast, our FPGA-based designs exploit the spatial and temporal parallelism on one reconfigurable 7 hardware device. Instead of using message passing, multiple functional units commu- nicate through routing wires on the device. Moreover, the various hardware constraints on the FPGAs and the complex memory hierarchy in reconfigurable computing sys- tems result in new design trade-offs. These trade-offs have never been analyzed in the existing work on parallel and distributed systems. There is a large body of work in using systolic arrays for linear algebra opera- tions [3, 13]. While some of such work can be re-targeted to FPGA-based systems, these systolic algorithms do not exploit the powerful features of the reconfigurable computing systems. These features include large on-chip memory with huge on-chip communication bandwidth, availability of multiple banks of SRAM in close proximity of FPGA device, memory hierarchy, multiple FPGAs, and high bandwidth intercon- nect. Also, compared with the systolic model of computation, reconfigurable com- puting systems have their own set of constraints, such as hardware constraints on the FPGAs and the size constraint on the memory. 1.1.4 Challenges With both processors and FPGAs in the system, reconfigurable computing systems are promising for high-performance computing. However, they also propose various new design challenges: Optimized Designs on FPGAs: Designs on traditional processors and parallel sys- tems are based on exploiting pipelining and cache hierarchy. For reconfigurable com- puting systems, we need to exploit not only temporal parallelism (pipelining), but also spatial parallelism, by executing multiple Processing Elements (PEs) concurrently. All these hardware optimizations must be considered under various hardware resource constraints. Therefore, the design space is much larger and more design trade-offs 8 must be analyzed. Moreover, the deep pipelining in the FPGA-based floating-point cores may cause problems for some basic operations, such as the reduction of series of inputs. Exploitation of Memory Hierarchy: In reconfigurable computing systems, FPGAs have access to multiple levels of memory, including the on-chip memory, the on-board memory and the main memory of the processors. These memories form a memory hi- erarchy which is similar to that in the general-purpose processor-based systems. How- ever, an application executed on a processor usually has no control over the content in the cache. In contrast, FPGA-based designs in reconfigurable computing systems initiate read and write operations to the various levels of memory. Therefore, effec- tive utilization of the memory hierarchy is crucial to achieve high performance and to minimize the requirements on the memory bandwidth. Hardware/Software Co-Design: A node in reconfigurable computing systems may contain both processors and FPGAs. To utilize fully their computing power, the work- load of an application must be partitioned between the FPGAs and the processors. An effective partitioning must consider overheads such as data transfer time and network communication time when multiple nodes are employed. Moreover, we must consider the synchronization and communication between the processors and the FPGAs. As they share the memory space, their memory accesses must also be coordinated to avoid data hazards. 1.2 Contributions The main contributions of this thesis, summarized below, are in addressing the chal- lenges of using reconfigurable computing systems and proposing high-performance 9 designs for linear algebra on these systems. Our work shows that reconfigurable hard- ware has become an attractive choice for floating-point based applications, and that reconfigurable computing systems are able to achieve higher performance than tradi- tional parallel systems. Our proposed designs can be used as building blocks in designs for more complex applications. Novel Design Model for Reconfigurable Computing Systems: We survey cur- rently available reconfigurable computing systems and build a high-level design model based on their architectures. In this model, a reconfigurable computing system is abstracted as a distributed system with multiple nodes connected through a high- bandwidth, low-latency network. Each node in the system contains both the processors and the FPGAs. The model provides various parameters at system level, such as the number of nodes and the network bandwidth between the nodes. Each node is also characterized by various parameters, including the floating-point computing power of both the processor and the FPGA, the size of various levels of memory and the memory bandwidth. The proposed model captures the architectural details of reconfigurable computing systems and greatly facilitates design development on them. In our work, the model is used to propose high-performance FPGA-based designs and hybrid designs that utilize both the processors and the FPGAs. Optimized Designs on FPGAs: Based on the design model, we propose optimized algorithms and architectures for dot product, matrix-vector multiplication, matrix mul- tiplication and matrix factorization on FPGAs. For each operation, we analyze its inherent characteristics and identify various design parameters. These parameters in- clude the number of PEs, the size of internal storage, the block size and the required 10 memory bandwidth. By exploring the design space, we analyze the design trade-offs among area, latency and storage size needed by a design. The proposed designs are generic and independent of the FPGA device. By tuning the design parameters, the designs can adapt to various devices and achieve optimal performance under the available hardware resources. Our designs for dot product and matrix-vector multiplication employ a simple yet efficient tree-based architecture. For matrix multiplication and matrix factorization, a linear array architecture is used. The proposed designs fully utilize the available hardware resources on the FPGA and the available memory in a reconfigurable com- puting system. With the given resources, the designs minimize both the latency and the memory bandwidth requirement. In addition, the performance of the designs increases linearly with the available hardware resources. High-Performance and Area-Efficient Reduction Circuits: We propose high- performance and area-efficient FPGA-based reduction circuits to reduce series of se- quentially delivered values. Such reduction widely exists in scientific applications. However, it may cause data hazards in the FPGA-based designs because of the deep pipelining in the floating-point arithmetic on FPGAs. In this work, using accumulation as an example, we analyze the design trade-offs among the number of floating-point adders, the buffer size and the latency of a design. The proposed designs achieve high throughput in that they are able to reduce multiple input sets of arbitrary sizes without stalling the pipeline. The designs are also area- efficient because the numbers of adders in the designs are fixed and their buffer sizes are moderate. In particular, one design uses only one adder and two buffers of size α 2 words, where α is the pipeline delay of the adder. In this thesis, one of the proposed 11 reduction circuits is used in our designs for dot product and dense/sparse matrix-vector multiplication. Design Methodology for Hybrid Designs: We propose a design methodology for hybrid designs on reconfigurable computing systems. The designs are hybrid in that they utilize both the processors and the FPGAs in the system for computations. There are several steps in this methodology. First, we identify the tasks within an application. Based on the inherent attributes of the tasks as well as the system parameters, hard- ware/software partitioning is then performed. The partitioning is further optimized by overlapping the computations with the data transfer and network communications. Within each node, the computations on the FPGAs and processors are coordinated so that data hazards and memory access conflicts are avoided. When multiple nodes are employed, load balancing is maintained. Using the methodology, we propose hybrid designs for matrix multiplication, LU decomposition for matrix factorization and Floyd-Warshall algorithm [20] for the all- pairs shortest-paths problem. Our designs achieve higher performance than designs that use only the processors or the FPGAs in the system. The methodology can also be used for performance prediction. It has been shown that the proposed designs achieve up to 90% of the predicted performance. 1.3 Organization We have organized the remainder of the thesis as follows. In Chapter 2, we survey prior work that uses reconfigurable hardware and reconfigurable computing systems for scientific computing. In Chapter 3, we survey existing reconfigurable computing systems and present our design model. In Chapter 4, we discuss our proposed designs 12 for vector operations. In Chapter 6, we present our designs for matrix multiplication and matrix factorization. In Chapter 7, we describe our design methodology for hybrid designs and the proposed designs. In Chapter 8, we summarize the work and present areas for future research. 13 Chapter 2 Scientific Computing on Reconfigurable Computing Systems Because of their computational intensity, scientific computing kernels and applications have long been targets for acceleration. Current FPGAs contain many more config- urable logic blocks and on-chip memory than previous ones, and thus have become an attractive platform on which to perform such acceleration. In reconfigurable comput- ing systems, FPGAs are used as application-specific hardware accelerators. Thanks to its reconfigurability, an FPGA accelerator can be used to accelerate multiple kernels and applications. In this section, we survey prior work in investigating the use of FPGAs and re- configurable computing systems for high-performance implementations of scientific computing applications. We break our survey into two categories: (1) floating-point arithmetic on FPGAs and (2) computational kernels and scientific applications on re- configurable computing systems. Note that work that is directly related to ours is dis- cussed in individual chapters. For example, we leave the discussion of FPGA-based vector operations and matrix operations to Chapter 4 and Chapter 6, respectively. 14 2.1 Floating-Point Arithmetic on FPGAs FPGA-based floating-point cores are required for implementations of scientific com- puting applications. In [46], the authors discuss the implementations and performance of single-precision floating-point adders and multipliers and the implication of this performance in the context of a matrix multiplication algorithm. Their work targets more primitive FPGAs than are available today. In particular, the area available for the floating-point cores is severely limited, and the clock frequencies are low. Thus, the authors deal mainly with variations in the arithmetic algorithms and their resource implications. Reference [10] focuses on the implementation of floating-point cores for mod- ern FPGAs. In this work, a parameterized floating-point library is presented. The library contains an adder, a multiplier as well as cores for the conversion from (to) floating-point format to (from) fixed-point format. These cores are parameterized by the bitwidth of the floating-point numbers. In [70], the library in [10] is extended to include a divider, a square rooter, and an accumulation core. Like the other cores, these are parameterized by bitwidth. The library presented in [70] supports general flexible formats, which can be parameterized to accept any bitwidth operand, including the IEEE standard 754 formats [39]. In [45], a floating-point unit generator is presented which can create a large space of floating-point adders, subtractors, multipliers and dividers based on a variety of parameters. The authors explore three trade-off levels: the architectural level, the floating-point algorithm level, and the floating-point representation level. The floating- point unit generator was implemented as part of a C++ to FPGA automatic design tool. Users of the generator can choose cores that are optimized for area, throughput, or 15 latency. Trade-offs in area and latency are also analyzed for bitwidth less than or equal to 32 bits. Govindu et al. developed a library for double-precision floating-point cores, in- cluding adder/subtractor, multiplier, divider and square rooter [33]. Their cores sup- port all types of numbers representable, all types of exceptions that can be generated, all rounding modes, and the NaN diagnostics that are described in IEEE standard 754. In addition, these cores are parameterized by degree of pipelining and by the features of IEEE standard 754 that are implemented. The performance data for three configu- rations of the cores are presented. The effects of supporting the standard are analyzed when these cores are used for Lennard-Jones force and potential calculations, part of molecular dynamics (MD) simulations. Several vendors have also provided their own floating-point cores, such as Xilinx [74] and Nallatech [53]. In the SRC-7 that will be available soon, dedicated double- precision floating-point cores will be embedded into the FPGAs [64]. 2.2 Scientific Computing Kernels and Applications Iterative methods are widely used in scientific computing to solve large sparse lin- ear systems. These solvers are computationally intensive and their parallel nature en- ables them to be accelerated by FPGAs. Morris et al. [52] present the design and implementation of a parameterized, deeply pipelined FPGA-based double-precision floating-point version of the Jacobi iterative method. Through performance estimates, the authors show that the dense Jacobi design has a 1.3 speedup over processor-based implementations for a single iteration. The multiple iteration speedup is 2.7. The 16 sparse Jacobi design achieves up to 36.8 speedup over processor-based implementa- tions. In particular, sparse matrices having an irregular structure experience the biggest speedups. Another iterative solver that has been implemented on FPGAs is the Conjugate Gradient (CG) method. In [50], the authors implement the complete CG method on a reconfigurable computing system. The processors in the system perform all the steps but the sparse matrix-vector multiplication, which is carried out on the FPGAs. Us- ing an SRC 6 MAPstation, the design achieves a speed-up of about 1.3 over the a processor-only implementation. The authors then estimate that with a next-generation SRC 7 MAPstation, a speed-up of 3.9 is attainable. Mathematical transforms are used widely in scientific computing. For example, the Fast Fourier Transform (FFT) is included in the HPCChallenge benchmark suite [37]. Reference [35] provides detailed analysis for FPGA-based double-precision floating- point FFTs. It also presents three implementations: a parallel architecture, a pipelined architecture and a parallel-pipelined architecture. These implementations are com- pared in terms of sustained performance and memory requirements for various FFT sizes and FPGA sizes. The authors also compare the FFT performance of current FPGAs with that of current general-purpose processors, and the estimated FFT perfor- mance of future FPGAs with that of future general-purpose processors. The authors predict that within four years, FPGAs will dramatically outperform processors for large FFTs. Besides computational kernels, more complex scientific applications can also ben- efit from hardware acceleration. For example, molecular dynamics (MD), a widely used technique to simulate the trajectory of an atomic system over time, has been implemented on reconfigurable computing systems [61, 60]. In [61], the most com- putationally intensive part of an MD simulation, Lennard- Jones force and potential 17 calculations is accelerated using FPGAs. The design presented in [61] is a modular, very deeply pipelined architecture that exploits the fine-grained parallelism of the cal- culations. It achieves 3.9 GFLOPS using a Xilinx Virtex-II Pro FPGA. In [60], a complete MD simulation is implemented on a reconfigurable computing system. The application is partitioned between software and hardware. In particular, most of the simulation runs on the processor in the system and the force calculations are executed on the FPGAs. The proposed implementation achieves a 2X speed-up over the corresponding processor-only solution. The authors of [63] use FPGAs to accelerate three kernels in computational fluid dynamics: the Euler, Viscous, and Smoothing algorithms. The calculations are imple- mented on the FPGAs in a pipelined fashion. Each floating-point operation in the ac- celerated kernels has its own floating-point core on the FPGAs. Using single-precision floating-point numbers, the designs achieve 10.2, 23.2, and 7.0 GFLOPS, respectively, for the kernels listed above. The system on which the application runs has a proces- sor connected over the PCI bus to several Xilinx FPGAs and SRAM memory. At the time when the research was conducted, the PCI bus communication bandwidth was the main restraining factor of the performance. In [31], a Monte Carlo radiative heat application is explored. An FPGA-based design is developed for the inner loop of the application which accounts for the most computation time. The loop is unrolled into a pipeline in which each stage consists of one or more floating-point cores operating in parallel. This pipeline is replicated to take advantage of coarse-grained parallelism. The target FPGAs have enough resources to allow the pipeline to be replicated three times. When implemented on a Xilinx Virtex- II Pro FPGA, the design achieves a 10.4X speed-up over a 3 GHz Pentium processor. Reference [32] presents a survey of available reconfigurable computing systems and then assesses how they would be used in several floating-point intensive scientific 18 computing applications. These applications are from the fields of climate modeling and computational chemistry and physics. Based on their profiles, the authors con- clude that an approach to using reconfigurable computing systems in scientific com- puting that is based solely on the use of libraries for common computational tasks will not be fruitful for large-scale scientific computing applications. Instead, it is crucial to profile carefully and partition the application. In order to utilize fully the existing re- configurable computing systems, it may be necessary to to re-write existing application code. All the work discussed in this section shows that FPGAs are competitive in per- formance with general-purpose processors for floating-point based applications. With both the FPGAs and processors, reconfigurable computing systems promise to achieve high performance for scientific computing kernels and applications. 19 Chapter 3 Design Model In this chapter, we survey existing reconfigurable computing systems, including Cray XD1 [21], SRC MAPstation from SRC Computers [64], and SGI Altix350 with em- bedded RC100 blade [62]. Based on the hardware architecture of these systems, we propose our design model, which captures their architectural details and memory hi- erarchy. The model characterizes a reconfigurable computing system using various parameters at both system level and node level. We also discuss our design approach for linear algebra on FPGAs in this chapter. A parameterized and flexible design approach is adopted, which separates the algorithm and architecture from the actual device. Using this approach, we can generate an op- timized FPGA-based design that is independent of the device used, and then adapt the design to a specific hardware device by tuning the design parameters. This approach is used in in Chapter 4 and Chapter 6. The proposed designs can also be incorporated into the hybrid designs proposed in Chapter 7. 20 3.1 Reconfigurable Computing Systems Many reconfigurable computing systems have become available. Several representa- tive systems are described below. 3.1.1 Cray Reconfigurable Systems The basic architectural unit of Cray XD1 is a compute blade, which contains two AMD Opteron processors and one Xilinx Virtex-II Pro FPGA [21]. Each FPGA has access to four banks of QDR II SRAM. The total SRAM bandwidth available to each FPGA is 12.8 GB/s. Through Cray’s RapidArray processors, the FPGA can access the DRAM of the processors at a bandwidth of 3.2 GB/s. Six compute blades fit into one chassis, and multiple chassis can be connected to form larger systems. The hardware architec- ture of XD1 is shown in Figure 3.1. The “RT” between the RapidArray Processor and the FPGA refers to the RapidArray Transport links designed by Cray [21]. FPGA XC2VP50 RapidArray Processor 3.2 GB/s Microprocessors (1MB cache) HyperTransport 3.2 GB/s 2 GB/s SRAM (16 MB) 12.8 GB/s 3 GB/s DRAM (8 GB) RT Compute Blade FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) 3 GB/s FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) … Internal RapidArray Switch Chassis External RapidArray Switch Other Chassis 2 GB/s FPGA XC2VP50 RapidArray Processor 3.2 GB/s Microprocessors (1MB cache) HyperTransport 3.2 GB/s 2 GB/s SRAM (16 MB) 12.8 GB/s 3 GB/s DRAM (8 GB) RT Compute Blade FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) 3 GB/s FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) FPGA XC2VP50 Compute Blade Microprocessors (1MB cache) … Internal RapidArray Switch Chassis External RapidArray Switch Other Chassis 2 GB/s Figure 3.1: Hardware architecture of Cray XD1 21 The recently introduced supercomputers of Cray, XT3 and XT4, also support re- configurable computing by incorporating FPGA modules from DRC [25]. Each DRC module contains a Virtex-4 FPGA and SRAM memory of up to 64 MB. The module has access to adjacent Opteron processors and DRAM memory at a bandwidth of up to 6.4 GB/s. 3.1.2 SRC Reconfigurable Computers FPGA FPGA SRAM (64 MB) 19.2 GB/s FPGA Controller MAP Series H DRAM (8 GB) Intel Pentium MAPstation MAPstation … MAPstation Gig Ethernet SDRAM (2 GB) 8.4 GB/s 14.4 GB/s Figure 3.2: Hardware architecture of SRC-7 In SRC reconfigurable computers, the basic architectural unit, MAPstation, con- tains one Intel microprocessor and one reconfigurable logic resource called MAP pro- cessor [64]. In SRC-7, each MAP processor consists of two Altera Stratix II FPGAs and one FPGA-based controller. The Altera FPGAs have access to eight banks of on- board (SRAM) memory of total size 64 MB. The SRAM memory bandwidth is 19.2 GB/s. The FPGA controller facilitates communication and memory sharing between the processor and the FPGAs. The bandwidth between the FPGAs and the processor is 14.4 GB/s. The MAP processor also contains two on-board 1 GB SDRAM banks, 22 which can be accessed by the Altera FPGAs at 4.2 GB/s through the controller. Multi- ple MAPstations can be connected by Ethernet to form a cluster. The hardware archi- tecture of SRC-7 is shown in Figure 3.2 [64]. The MAPstations can also be connected by SRC Hi-Bar switch and have access to common memory. 3.1.3 SGI RASC SGI has also proposed Reconfigurable Application Specific Computing (RASC) tech- nology, which provides hardware acceleration to SGI Altix servers [62]. In an SGI RASC RC1000 blade, there are two Xilinx Virtex-4 FPGAs. The FPGAs are con- nected to 80 GB SRAM memory at a bandwidth of 3.2 GB/s. Each blade is directly connected to the shared global memory in the system through the SGI NUMAlink4 interconnect, at a bandwidth of 6.4 GB/s. The architecture of Altix 350/RC 100 is shown in Figure 3.3. Itanium 2 Itanium 2 DRAM (4 GB) Altix 350 NUMAlink 6.4 GB/s RC100 Blade SRAM (80 MB) TIO Algorithm FPGA XC4VLX200 TIO Algorithm FPGA XC4VLX200 Loader FPGA 6.4 GB/s 6.4 GB/s SRAM (80 MB) 3.2 GB/s 3.2 GB/s Figure 3.3: Hardware architecture of SGI Altix 350/RC 100 23 3.2 High-Level Design Model 3.2.1 Architectural Model Based on the hardware architecture of the existing reconfigurable computing systems, we build a high-level architectural model for them. In this model, a reconfigurable computing system is abstracted as a distributed system with multiple nodes, which are connected through an interconnect network. Each node can be based on either general- purpose processors, FPGAs, or both. In this thesis, we only consider systems where each node consists of both FPGAs and processors. Currently, in such systems, only the processors of the nodes are connected to the interconnect network. The architectural model is shown in Figure 3.4. … Interconnect Network Processor SRAM DRAM BRAM FPGA Processor SRAM DRAM BRAM FPGA Figure 3.4: Architectural model for reconfigurable computing systems Inside each node, the FPGA has access to multiple levels of memory that have various storage capacities and bandwidths. The first level is the on-chip memory of the FPGA, usually Block RAMs (BRAMs). BRAMs are embedded hardware primitives on the FPGA fabrics. In a state-of-the-art device, the aggregate memory bandwidth of BRAMs is over 100 GB/s, while the total size is usually less than 10 Mb. 24 The second level of memory is off-chip but on-board memory, which is usually SRAM. To access this memory, the FPGA-based designs must incorporate certain memory controllers. The storage capacity of SRAM memory is much larger than that of the BRAM memory, while its bandwidth to FPGA is much lower. The FPGA also has access to the main memory of the general purpose processor, which is usually DRAM memory and is in the range of gigabytes. The bandwidth between the DRAM and the FPGA is usually lower than that of the SRAM memory. Table 3.1: Characteristics of memory for a single FPGA SRC Cray SGI Level Size Bandwidth Size Bandwidth Size Bandwidth A.BRAM N/A N/A 522 KB 209 GB/s 756 KB 378 GB/s B.SRAM 64 MB 19.2 GB/s 32 MB 12.8 GB/s 80 MB 3.2 GB/s C.DRAM 10 GB 14.4 GB/s 8GB 3.2 GB/s 4GB 6.4 GB/s These various levels of memory together form a memory hierarchy. We refer to the BRAM memory as Level A, the SRAM as Level B, and the DRAM as Level C. The typical values of memory size and memory bandwidth are shown in Table 3.1. Because SRC has not announced the model of the Altera FPGAs in SRC-7, the characteristics of their on-chip memory are not available yet. The memory hierarchy in reconfigurable computing systems is quite similar to that in the general-purpose processor-based systems. However, the bandwidth of the first memory level in reconfigurable computing systems is much higher. Secondly, an appli- cation executed on a general-purpose processor usually has no control over the content in the cache. In contrast, FPGA-based designs in reconfigurable computing systems initiate read and write operations to all the levels of memory, and thus can utilize the memory more efficiently. The third difference is that in reconfigurable computing sys- tems, the FPGAs can directly access Level C memory without going through Level B memory. 25 3.2.2 System Parameters In our model, a reconfigurable computing system is characterized by various parame- ters, both at system level and node level: • p: number of nodes in the system. The nodes are denoted as N 0 , N 1 , ..., N p−1 ; • B n : network bandwidth between any two nodes, measured as the number of bytes transferred per second; • b w : word width. In this work, we consider double-precision floating-point num- bers. Thus, b w =8 bytes. The parameters used to describe a node in the system include: • C f : floating-point computing power of the FPGA; • O f : the number of floating-point operations performed in each clock cycle in the FPGA-based design; • F f : the clock speed of the FPGA-based design; • C p : floating-point computing power of the processor; • B d : DRAM memory bandwidth between the FPGA and the processor within a node, measured as the number of bytes transferred between the FPGA and DRAM per second; • M b ,M s : size of the BRAM memory and the SRAM memory measured in bytes, respectively; The computing power is defined as the number of floating-point operations per- formed in a second. For both the processor and the FPGA, the computing power is 26 dependent on the application. For an FPGA-based design, C f is calculated as O f ×F f . For the processor, C p is the sustained performance for a given application achieved by the chosen program. Note that our model does not consider the memory access latency. This is because for the applications considered, data are streaming into and out of the FPGA. There- fore, the memory access latency is incurred only once. When a large amount of data is transferred between the FPGA and DRAM memory, the memory access latency can be ignored. Similarly, as the communications among the nodes are restricted to long messages, we ignore the network latency. 3.3 Design Flow In this thesis, we take two phases to generate efficient designs for linear algebra on reconfigurable computing systems. In the first phase, a parameterized approach is used to propose optimized designs on FPGAs. Such an approach not only considers the inherent attributes of the linear algebra operation, but the available hardware resources on the device. The approach is used in Chapter 4 and Chapter 6 to develop designs for vector operations and matrix operations, respectively. The second phase incorporates the FPGA based-design into the given reconfig- urable computing system. The system is characterized by various system parameters, as discussed in Section 3.2.2. In this phase, the values of system parameters are deter- mined and hardware/software co-design is performed using the methodology presented in Chapter 7. In the rest of this section, we describe the parameterized design approach and present possible design parameters. 27 3.3.1 Parameterized Design Approach Optimized FPGA-based design is crucial for high-performance designs on reconfig- urable computing systems. When designing FPGA-based implementations for an ap- plication, we aim to exploit the computational parallelism as well as the I/O parallelism with the given hardware resources. Therefore, we need to both propose an optimized algorithm and architecture for the application, and consider the hardware constraints of the device. This process is further complicated when the application is floating-point based, as the floating-point cores are much larger and much more complex than the fixed-point cores. In our approach, we separate the algorithm and architecture design from the actual implementations on a given device. We first examine the the description of the opera- tion. The description includes the number of floating-point operations required as well as the amount of data needed to be exchanged with the DRAM memory. The design parameters are then identified. A list of potential design parameters are presented in Section 3.3.2. Some of them concern the floating-point cores used in the design, such as their pipeline delays; some are for the entire design, such as the amount of memory required by the design and the number of floating-point cores used. Based on the exploration of the design space formed by the design parameters, an optimized design is proposed. Such a design is independent of the device used. By tuning the values of the parameters, the design can adapt to various hardware devices. When a device is chosen, the hardware constraints are determined, including the available area, I/O pin count, the size of available memory, and the memory bandwidth. We then select appropriate values for the design parameters and generate an optimal implementation for the operation. The design approach is illustrate in Figure 3.5. 28 Operation Description Design Parameters Design Space Architecture Device Resources Memory Bandwidth Implementation Figure 3.5: Parameterized design approach In our designs, the implementations of building blocks, such as the floating-point cores, have no effect on the architecture or the algorithm or the architecture. Therefore, these blocks can be independently designed and then plugged into the design as mod- ules, with no or few modifications to the interface between them and other parts of the architecture. Moreover, when smaller or faster floating-point cores become available, the performance of our designs increases accordingly. 3.3.2 Design Parameters An FPGA-based design is described by the following design parameters: • α,β,γ: pipeline delay of the floating-point adder/subtractor, the floating-point multiplier and the floating-point divider, respectively; • k,k : number of floating-point multiplications and floating-point additions/subtractions performed in each clock cycle, respectively; • bw b ,bw s ,bw d : BRAM memory bandwidth, SRAM memory bandwidth and DRAM memory bandwidth required by the design, respectively; they are measured in the number of bytes exchanged between the design and the memory in each clock cycle; 29 • m b ,m s : total size of the BRAM memory and the SRAM memory (in words) needed by the design, respectively; • b: block size when blocking algorithm is employed. In all our designs, the source data are stored in the DRAM memory initially and are streaming into the FPGAs. Therefore, unless stated otherwise, the “external memory” in this work refers to the DRAM memory. The internal storage in the FPGA-based designs is realized either by the BRAM memory or the SRAM memory. The latency of the design is denoted as T , which equals the number of clock cycles needed to complete the linear algebra operation. T comp and T I/O refer to the number of clock cycles needed for floating-point computations and for I/O with the DRAM memory, respectively. 30 Chapter 4 Vector Operations on Reconfigurable Computing Systems In this chapter, we present our proposed FPGA-based designs for several vector oper- ations, including dot product, matrix-vector multiplication (MXV) and sparse matrix- vector multiplication (SpMXV). These operations are used in many scientific applica- tions, such as the Jacobi method and Conjugate Gradient iterative solvers [9], which are used to solve large linear systems. These vector operations are I/O bound in that their performance is constrained by the available memory bandwidth. However, in cache-based systems, a large part of the memory bandwidth available to the opera- tions is wasted due to cache misses [47]. This problem is most obvious in SpMXV , whose irregular memory accesses cause large numbers of cache misses [66]. Although some optimizations have been proposed to improve the performance of SpMXV on cache-based systems, they require information on the sparsity of the input matrix. For matrices with very irregular structure, these optimizations provide few improvements [38, 56]. FPGAs have become an attractive option for implementing I/O-bound applica- tions. The large amounts of on-chip memory and large numbers of I/O pins on the 31 current FPGA fabrics provide high on-chip and off-chip memory bandwidth. There- fore, FPGA-based designs for these applications are able to avoid the long latency caused by cache misses. In this chapter, we propose simple yet efficient designs for the vector operations. We use the parameterized design approach discussed in Section 3.3.1. The proposed designs utilize a tree architecture and are characterized through various parameters, including the number of floating-point cores, the size of internal storage and the block size. The parameters are tuned according to the hardware resource constraints, in- cluding the number of I/O pins and the size of on-chip memory of the FPGA. Block algorithms are employed to reduce the requirements on the storage size. For SpMXV , our design conforms to an existing sparse matrix storage format and makes no assump- tion about the sparsity of the input matrix. For our designs, the area increases linearly with the number of floating-point cores used; the latency decreases proportionally with the number of floating-point cores, if adequate memory bandwidth is provided. On the other hand, the storage size and the memory bandwidth required by the design are dependent on both the number of floating-point cores and the block size. When implemented in XD1, our design for MXV achieves more than 90% of the peak performance under the available DRAM memory bandwidth. In addition, our design for SpMXV achieves up to 30X speedup over an Itanium 2 processor for matrices with very irregular structure. 4.1 Related Work In [67], Underwood et al examined the potential capacity of FPGAs in performing floating-point BLAS operations. They proposed a special multiply-accumulate core for FPGA-based dot product to accommodate the deep pipelining in the floating-point 32 adder. However, in their design, vectors must have at least 7500 elements because of the overheads of the final accumulation. In the design for MXV in [67], α (the number of pipeline stages in the adder) copies of the vector are stored on the FPGA so that the design can be used for matrices and vectors of all sizes. The authors predicted that with the increasing speed of memory bandwidth available to FPGAs, the performance of FPGA-based designs for vector operations will soon surpass that of designs on general-purpose processors. In [26], fixed-point SpMXV is implemented on FPGAs. The authors designed a constant-coefficient multiplier that can be reloaded at run-time and use it for SpMXV. The authors of [23] studied iterative floating-point SpMXV on FPGAs. Their design distributes the dot products in an SpMXV to multiple PEs connected in a loop. The elements of the vector are transferred inside the loop to reduce the requirement on the storage size. However, their design does not consider the constraint on I/O bandwidth. In addition, a balanced partition of the dot products among the PEs is complex due to the irregular sparsity in matrix A. In [50], a design for floating-point, FPGA-based SpMXV is presented as a module in the complete conjugate gradient method. The module is specially designed for SRC 6 MAPstation and utilizes all the resources on the two FPGAs in the system. In contrast, our designs are independent of the device and the system. Also, our designs are parameterized so that they can adapt to various hardware resource constraints. 4.2 Dot Product We first consider dot product of two vectors, u and v. The operation is formulated as u· v = n−1 i=0 u i v i 33 4.2.1 Operation Analysis For dot product, the minimum number of external I/O operations is 2n+1. There are n floating-point multiplications and n floating-point additions. We can easily identify k and k as the design parameters, because they affect T comp . k and k refer to the number of floating-point additions and multiplications performed in one clock cycle, as discussed in Section 3.3.2. Since there is no data reuse in this operation, no internal storage is needed by the design. To reduce the latency, we overlap T comp and T I/O . Fur- thermore, the floating-point multiplications and additions are overlapped. Therefore, the lower bound on the latency of the operation is derived as T ≥ max(T comp ,T I/O )≥ max(max( n k , n k ), (2n+1)b w bw d )) 4.2.2 Architecture To achieve the lower bound, k = k and k ≈ bw d 2bw . Thus we propose an architec- ture that consists of identical number of floating-point adders and multipliers. The floating-point adders and multipliers are pipelined so that additions, multiplications and external I/O operations are overlapped. During a clock cycle, each multiplier reads one element from each of the vectors, and multiplies these two floating-point numbers. An adder tree then sums up the out- puts of the multipliers. Whenk<n, we need an additional adder to accumulate the outputs of the adder tree. Thus, we use k adders in total, including the k− 1 adders in the adder tree and the additional adder. If we ignore the clock cycles used to fill the pipelines of the multipliers and the adders, the effective latency of the architecture T = n k . 34 Adder 0 Adder 1 Adder 2 Multiplier0 u i0 Reduction Circuit Multiplier1 Multiplier2 Multiplier3 u i1 u i2 u i3 v i0 v i1 v i2 v i3 Figure 4.1: Architecture for dot product (k=4) The adder tree in the architecture yields one output each clock cycle. Thus, the task of the additional adder is to reduce sets of sequentially delivered floating-point values. However, the pipelining in the floating-point adder can cause read-after-write hazards during the reduction. Therefore, we replace the additional adder outside the adder tree with a reduction circuit discussed in Chapter 5. The architecture for dot product is shown in Figure 4.1. Suppose T red (s) is the time needed to complete the reduction. For our proposed reduction circuits, T r eds=Θ(s) for s inputs. Therefore, the effective latency of the design is T = n k + T red ( n k )=Θ( n k ) (4.1) The design performs 2k+1 external I/O operations during each clock cycle, that is, bw d =(2k+1)b w . The number of I/O pins used is 8(2k+1). 35 4.3 Dense Matrix-Vector Multiplication We next study MXV (Matrix-Vector Multiplication) between an n× n matrix A and a vector x. The result vector is y. The operation is formulated as y = Ax, y i = n−1 j=0 a ij x j (i=0, 1,...,n− 1) 4.3.1 Operation Analysis In MXV , the total number of floating-point operations is 2n 2 . As in the case of dot product, we identify k and k as the design parameters. In this operation, each element in x is reused n times. If we do not store any element of x in the architecture, the total number of external I/O operations is 2n 2 .To reduce partially the I/O costs, we use block MXV algorithm. Thus, the block size, b, becomes an important parameter. Without loss of generality, we assume n is a multiple of b. In the blocked version, A is partitioned into n b blocks of columns, with each block of size n× b. Similarly, x is partitioned into n b blocks of size b× 1. We denote the blocks using A g and x g , respectively (g =0, 1,..., n b − 1). For each block MXV A g × x g , A g and x g need to be read from and the results need to be written to the external memory. Therefore, the total number of external I/O operations with block size b equals n b × (nb + b + n)= n 2 + n 2 b + n. We set k = k to reduce T comp . The lower bound on the latency of the operation is: T ≥ max(T comp ,T I/O )≥ max( n 2 k , (n 2 + n 2 b + n)b w bw d ) The above equation shows a design trade-off between the block size and the I/O time. The smaller the block size, the more external I/O operations are required, and 36 vice versa. Another trade-off exists between the storage size and the data access time. If we store only one copy of x g in the architecture, reading k distinct values from the internal storage takes multiple clock cycles. On the other hand, if x g is duplicated at each multiplier, all multipliers can access the needed data in one clock cycle. In this case, the storage size is increased to kbb w . Adder 0 Adder 1 Adder 2 Multiplier2 Storage Multiplier3 Storage Multiplier0 Storage Multiplier1 Storage Reduction Circuit x A i0j0 j0 A i1j1 j1 A i2j2 j2 A i3j3 j3 y Figure 4.2: Architecture for MXV (k =4) 4.3.2 Architecture Based on the design trade-offs, we propose an architecture for MXV , as shown in Figure 4.2. This architecture is almost the same as the architecture for dot product, except that each multiplier is attached to a local storage. Each local storage contains a copy of x g . During the computation, the multiplier reads one element from A g in each clock cycle, then uses the column index to find the corresponding x g element in its local storage, and finally multiplies these two numbers. Initially, the storage is loaded with x 0 . To reduce the latency, the initialization of block x g is overlapped with the computations of A g−1 × x g−1 (g =1,..., n b − 1). In the architecture, k words are read out of the local storage in each clock cycle. As k can be large, the local storage is 37 realized using on-chip memory because the available BRAM bandwidth is much larger than the available SRAM bandwidth. Whenb>k, one row in A g is further partitioned into b k sub-rows, which are fed into the architecture in order. A reduction circuit is used in the architecture to accumu- late the sums of the sub-rows. To generate the final y, we also need to accumulate the results generated by A 0 × x 0 , A 1 × x 1 , ..., A n b −1 × x n b −1 . For each y element, such intermediate results are generated every n clock cycles. Since the typical value of α is less than 20 and is usually much smaller than n, an additional floating-point adder suffices for such accumulation. The effective latency of our design is: T=(b + k)+( n×b k + k)× ( n b − 1) + n×b k + T red ( b k ) = n 2 k + nk b + b + T red ( b k )=Θ( n 2 k ) (4.2) The design needs to perform k + k b external floating-point I/O operations during each clock cycle. Thus, bw d =(k + k b )b w . The total size of BRAM memory needed by the design is m b = kbb w , and the required bandwidth is bw b = kb w . We use b wj to denote the number of bits in a column index. Thus, the number of I/O pins used is k(8b w + b wj )+2b w : each multiplier reads one element of A and its column index in each clock cycle; the first multiplier reads one element of x per clock cycle for initialization. The required number of I/O pins can be reduced by using a counter to keep track of the column index. In the above architecture, A is stored in row-major order. If A is stored in column- major order, another architecture can be used. In the second architecture, there are k adders. Each adder is connected to a multiplier and is attached to a local storage, which stores the intermediate results of some elements of vector y. In particular, the 38 pth adder (p =0, 1,...,k − 1) stores the intermediate results of elements p, k + p,..., ( n k − 1)k + p of vector y. During each clock cycle, the k multipliers multiply k distinct elements of A with one element of x. The adders then accumulate the results of the multiplications with the intermediate results of y stored in the local storage. In this case, the intermediate result of y i (i =0, 1,...,n− 1) is updated only every n clock cycles. Therefore, as long as n is larger than α, data hazards do not occur. For block MXV , matrix A is partitioned into b× n blocks, and y is partitioned into b× 1 blocks. Each A block is read in column-major order and multiplied with vector x to generate the corresponding y block. The latency and the size of BRAM memory required by these two architectures are the same. However, the tree-based architecture can be extended for SpMXV easily, as discussed in the next section. 4.4 Sparse Matrix-Vector Multiplication In SpMXV , matrix A is a sparse matrix with n z nonzero elements. For each element a ij , the following computation is performed: y i = y i + a ij × x j , if a ij =0 4.4.1 CRS Format In our design, the sparse matrix is stored in Compressed Row Storage (CRS) format [6], one of the most commonly used storage formats for sparse matrices. In CRS format, the nonzeros of the matrix rows are stored in contiguous memory locations. There are three vectors: val for floating-point nonzero elements; col for the column 39 indices of the nonzeros; and ptr for storing the locations in the val vector that start a row [6]. As an example, consider sparse matrix A shown below: A = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 10 0 0 −2 30 9 0 07 0 0 00 8 4 ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ The CRS format for this matrix is specified by the vectors given below: val 10 -2 397 84 col 0 3 021 23 ptr 0 245 By subtracting consecutive entries in the ptr vector, a new vector len can be gen- erated. len vector stores the size of each row, that is, the number of nonzeros in each row. Note that the last entry in the len vector is calculated using n z − ptr(n− 1).For our example, the len vector is: len 2 212 4.4.2 Operation Analysis As each nonzero element of A needs two floating-point operations, the total number of floating-point operations to be performed is 2n z . When the computations are complete, n output operations are needed to write y back to the external memory. Thus the total 40 number of external I/O operations is n z +2n. The lower bound on the latency of this operation is: T ≥ max(T comp ,T I/O )≥ max( 2n z k , (n z +2n)b w bw d ) 4.4.3 Architecture The basic architecture for SpMXV is the same as that for MXV , which is shown in Figure 4.2. However, for SpMXV , it is possible that len(i) is not a multiple of k (i=0, 1,...,n− 1). In this case, the last sub-row of row i has less than k elements, and has to be expanded to k through zero padding. However, zero padding adversely affects the performance of SpMXV , because it results in idle cycles of floating-point multipliers and adders (with zeros as operands). In addition, sending zeros into the architecture wastes the available DRAM memory bandwidth. We thus propose a merging method to reduce zero padding. In this method, if the control logic detects that the last sub-row of the ith row is to be processed, it checks if this sub-row can be merged with a sub-row of the (i+1)th row. Merging is allowed if both rows have a sub-row which contains no more than k/2 nonzeros, that is, if len(i) mod k ≤ k/2 and len(i+1) mod k ≤ k/2. In the merging, the last sub-row of the ith row is read in the left subtree, and one sub-row of the (i+1)th row is read in the right subtree. The improved architecture for SpMXV is shown in Figure 4.3. In the figure, the sums of the two sub-rows are called sum 1 and sum 2 , respectively. For simplicity of the logic, sum 1 goes through the root node and is added to 0. To guarantee that sum 1 and sum 2 are yielded at the same clock cycle, delay is inserted between the root of the right subtree and sum 2 . The delay is the same as the latency of the floating-point 41 Right Subtree Left Subtree Ad d e r Ad d e r Ad d e r sum 1 sum 2 De la y ... ... C ontro l Un it j len(i) R eduction Circ u it y Figure 4.3: Improved architecture for SpMXV adder. Both sum 1 and sum 2 are given to the reduction circuit, which then yields the final result of y elements. The control unit yields the row index of the each y element. This design requires the same DRAM memory bandwidth, BRAM memory and the number of I/O pins as the design for MXV . When the size of the input matrix and vector x is large, the block SpMXV algorithm is employed. Note that block size b results in trade-off between the computation time and the initialization time. As b increases, the initialization time for x g (g =0,...,n/b− 1) increases. However, a smaller b may result in fewer nonzeros in A g . Thus the computation time of A g × x g may increase because of the increased zero padding. On the other hand, the number of zero sub-blocks (blocks which do not contain any nonzero element) increases as b decreases. With a small b, it is possible that the computations on the FPGA must be suspended to wait for the initialization of x g to complete. 42 4.5 Experimental Results In this section, we first present the details in our experimental setup and the design flow on Cray XD1, the reconfigurable computing system we used in our experiments. These information is common to all our designs, and will not be repeated in the following chapters. We next analyze the design trade-offs by tuning the values of the design parameters. We then compare the performance of our design for SpMXV with a design on general- purpose processors. The performance of the MXV design on one compute blade of XD1 is also presented. 4.5.1 Experimental Setup In our experiments, we used the Xilinx ISE 7.2i [74] and Mentor Graphics ModelSim 5.7 [48] development tools. Our target device is the Xilinx Virtex-II Pro XC2VP50 [74], which contains 23616 slices, about 4 Mb of on-chip memory and 852 I/O pins. Although it is not a big device, it serves the purpose of trade-off analysis. In addition, it is used in XD1. The building blocks used in our designs include floating-point adder, floating-point multiplier, floating-point divider, and reduction circuit. Table 4.1 gives the character- istics of these blocks. Our floating-point cores comply with the IEEE-754 double- precision format [39]. Their implementation details can be found in [33]. The reduc- tion circuit used in our designs is the SSA design presented in Chapter 5. It employs one floating-point adder only. To measure the floating-point performance of our designs, we use MFLOPS (one million floating-point operations per second) and GFLOPS (one billion floating-point operations per second), because they are popular metrics for floating-point throughput. 43 Table 4.1: 64-bit floating-point cores and reduction circuit Adder Multiplier Divider Reduction Circuit No. of Pipeline Stages 14 11 58 - Area (slices) 892 835 3213 2859 Clock Speed(MHz) 170 170 140 170 4.5.2 Design Flow on XD1 As discussed in Section 3.1, each blade in XD1 contains 2.2 GHz AMD Opteron pro- cessors and Xilinx Virtex-II Pro XC2VP50 FPGAs. To utilize the FPGA accelerator, two programs are needed. One is a C program which runs on the processors, and the other is a VHDL program that describes an FPGA-based design. Before being loaded onto XD1, an FPGA-based design must be modified as fol- lows: 1) SRAM memory controllers (SRAM cores) must be inserted if the design needs to access the SRAM banks; 2) An RT (RapidArray Transport) core must be in- serted for the design to communicate with the processor; 3) An application-specific component Rt Client is needed to control the communications between the FPGA, the processor and the SRAM memory. The resulting structure of an FPGA-based design on XD1 is shown in Figure 4.4. RT Core RapidArray Processor RT Client User Design SRAM Cores SRAM Bank SRAM Bank SRAM Bank SRAM Bank Figure 4.4: Block diagram of an FPGA-based design on XD1 After modifying the design, the following steps are required to load the design onto XD1 [21]: 1. Write the C program and build the executable for it. 44 2. Synthesize, place & route the FPGA-based design. At this step, the design can be debugged using simulators. 3. Generate the binary file for the FPGA-based design. The file is then converted to Cray-specific FPGA logic file using command-line tools on XD1. 4. Load the logic file onto the FPGA. Write a job script for the design, and submit the job to the system. The C program is in charge of file operations and memory accesses. The SRAM and BRAM memory on the FPGA are mapped into the memory space of the processor so that the processor can write to them directly. The results generated by the FPGA- based design are written back to the DRAM memory directly. Besides the shared memory space, the processors and the FPGAs can also communicate through user- defined registers. These registers are located on the FPGAs, and can be read/written by the processors. The C program also handles the network communication using the MPI standard [49]. 4.5.3 Performance Analysis To analyze the design trade-offs, we first determine the range for each design parameter based on the hardware resource constraints. We then vary the values of the parameters, and measure the area and clock speed of the designs after place and route. Next, the latency and the required memory bandwidth are measured. For dot product and MXV , the problem size n is set as 2048, so that the source matrix cannot be stored in the BRAM memory of the FPGA. For SpMXV , we use sparse matrices included in the UF Sparse Matrix Collection [68]. These matrices are from a variety of engineering and science applications. All of the matrices are square, and most are neither symmetric 45 nor diagonal. Table 4.2 gives the characteristics of each matrix used in our experi- ments. The matrices are roughly ordered by the regularity of their structure, with the more regular ones at the top. Table 4.2: Sparse matrices used in our experiments Name Application Area n n z max{len(i)} 1 raefsky3 Fluid/structure 21200 1488768 80 2 bcsstk35 Automobile frame 30237 1450163 222 3 rdist1 Chemical processes 4134 94408 81 4 memplus Circuit simulation 17758 126150 353 5 gemat11 Power flow 4929 33185 27 6 lns3937 Fluid flow 3937 25407 11 7 sherman5 Oil reservoir 3312 20793 21 8 mcfe Astrophysics 765 24382 81 9 jpwh991 Circuit physics 991 6027 16 10 bp1600 Simplex method 822 4841 304 11 str600 Simplex method 363 3279 33 For dot product, the only parameter is k, the number of floating-point multiplica- tions that can be performed each clock cycle. Since each multiplication needs two 64- bit numbers, the range of k is determined by the number of the available I/O pins. For the target device, k ≤ 6. In the design, the control logic occupies less than 5% of the total area. The clock speed of the design is limited by that of the floating-point cores, and remains 170 MHz (Table 4.1). As shown in Figure 4.5, when k increases from 1 to 6, the area of the design increases linearly. The latency of the design decreases pro- portionally as k increases. When the clock speed remains fixed, the required external memory bandwidth also increases linearly with k. For MXV , we need to tune both k and b. The range of k is determined by the total size of the available area, the area of the floating-point cores, and the area of the reduction circuit. For the target device, k ≤ 10. The area of the design increases 46 1 2 3 4 5 6 0 0.5 1 1.5 2 x 10 4 Area (slices) Area 1 2 3 4 5 6 0 10 20 30 40 Latency (us) k Latency Figure 4.5: Area and latency for dot product (n = 2048) linearly with k, similar as in Figure 4.5. The clock speed of the design decreases with k. However, when k increases from 1 to 10, the degradation in the clock speed is less than 10%. The range of b is determined by the size of the on-chip memory of the FPGA. For simplicity, we set the storage constraint as 1 Mbit, and vary b from 256 to 2048 words. Figure 4.6(a) shows that the latency is primarily determined by k. When k is fixed, the latency varies little as b increases. Each point in Figure 4.6(a) has a corresponding point in Figure 4.6(b), which shows that the required memory bandwidth increases linearly with k. From the figures, we see that a large block size does not decrease the latency or the required external memory bandwidth. On the other hand, the total size of on-chip memory needed by the design increases with k as well as b, according to Section 4.3. Thus, when determining the value of b, a small block size is preferred. Most of the trade-off analysis for SpMXV is the same as that of MXV . However, in SpMXV , the effect of b depends on the sparse matrix used. In our experiments, we chose matrix memplus as an example. This matrix has an irregular structure so that different values of b result in different numbers of nonzero elements in each block. As 47 500 1000 1500 2000 0 2 4 6 8 10 0 5 10 15 20 25 b k Latency (ms) (a) Latency vs. k, b 500 1000 1500 2000 0 2 4 6 8 10 0 30 60 90 120 b k Bandwidth (Gb/s) (b) Memory bandwidth vs. k, b Figure 4.6: Performance analysis for MXV (n = 2048) 48 shown in Figure 4.7, when b increases from 2000 to 18000, the time for initializing the internal storage increases, and the time for performing computations decreases. The total time for SpMXV does not vary much when b is larger than 4000. The optimal performance is achieved when the block size is larger than the matrix size, that is, when no block SpMXV is performed. 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10 4 0 1 2 3 4 5 6 7 x 10 4 Block Size No. of Cycles total time(k=4) comp time(k=4) init time(k=4) total time(k=8) comp time(k=8) init time(k=8) Figure 4.7: Total time/Computation time/Initialization time vs. block size for sparse matrix memplus 4.5.4 Implementations on XD1 We implemented the design for MXV on one compute blade of XD1. Before the computation starts, matrix A is read from the DRAM memory of the processor and distributed to the four SRAM banks attached to the FPGA. Next, the processor initial- izes the local storage of the PEs using vector x. During the computations, the design on the FPGA reads one word from each SRAM bank in one clock cycle. The processor and the FPGA communicate through several status registers about the problem size n and completion of initialization and computation. 49 Since the size of SRAM memory attached to an FPGA is 16 MB, n can at most be √ 2× 1024. In our experiments, we set n = 1024. Based on the available bandwidth between the FPGA and the SRAM memory, we set k =4. The additional components including the RT core, memory controllers and control logic for status registers occupy approximately 3000 slices. The entire design occupies 13772 slices, and runs at 164 MHz. As the design reads one 64-bit word and 8-bit parity code from each SRAM bank during every clock cycle, the achieved SRAM memory bandwidth is 5.9 GB/s. The total latency for performing MXV is 8.0× 10 −3 seconds. The computation time is 1.6× 10 −3 seconds, and the remaining time is spent in moving the data between the DRAM memory and the SRAM memory. The achieved DRAM memory bandwidth is 1.3 GB/s. For MXV , the minimum latency equals approximately n 2 bw bw d . Thus, the peak per- formance of any design for MXV is 2n 2 n 2 bw/bw d = 2bw d bw FLOPS. Under the used DRAM memory bandwidth, the peak performance is 325 MFLOPS. The sustained perfor- mance of our design is 262 MFLOPS, which is about 80% of the peak performance. If matrix A is initially stored in SRAM, our design would achieve 1.05 GFLOPS. 4.5.5 Performance Comparison We compare our design for SpMXV with a program on a cache-based system. The system consists of a dual 900 MHz Itanium 2 processor, an L1 cache of 16 KB, an L2 cache of 256 KB, and a peak memory bandwidth of 8.5 GB/s. Note that the dense vector x for each test matrix fits well in the L2 cache of the Itanium system. The system runs the Red Hat Linux Advanced Server 2.1 Operating System. For SpMXV on the Itanium system, we use SPARSITY [38]. As far as we know, this system has 50 reported the highest performance for floating-point SpMXV on general-purpose pro- cessors. The programs were compiled using the Intel C Compiler 7.0 for Linux, with -fast (highest optimization, static linking, interprocedural optimization) and -Ob2 (al- lows the compiler to inline any function that it sees fit) optimizations. The compiled code used only one of the two processors in the system. 1 2 3 4 5 6 7 8 9 10 11 0 10 20 30 Matrix Number Speedup Figure 4.8: Speedup of our design for SpMXV (k=4) over the Itanium 2 system We compare the Itanium system with the implementation of the basic architec- ture for SpMXV , with k =4. The design runs at 160 MHz, and requires a memory bandwidth of 8 GB/s. This memory bandwidth is much less than the SRAM memory bandwidth in XD1. Still, we made a conservative performance estimation by deducting 30% off the performance of our design for the control overhead of the SRAM devices [67]. The MFLOPS speedup of our design over the Itanium 2 system is shown in Fig- ure 4.8. The speedup increases as the matrix number increases. Our design performs best over the general-purpose processors for matrices with the most irregular structure, such as matrix 10 and 11. This is because the performance of our design depends on the number of nonzeros in each row. Even if the nonzeros are distributed randomly, the performance of our design is not affected. On the other hand, the random distribution of the nonzeros causes large numbers of cache misses in cache-based systems. 51 Chapter 5 FPGA-Based Reduction Circuit In the designs in Chapter 4, series of floating-point values must be accumulated. Such accumulation also exists in many other scientific applications. Another example is the Lennard-Jones force calculation in molecular dynamics [61]. At the end of the calculation, values that are generated serially must be accumulated. Accumulation is just one type of “reduction,” which refers to reducing a series of sequentially delivered values to one value using one type of binary operator. Because the FPGA-based floating-point operators are usually deeply pipelined to achieve high clock speed, using them for data reduction may cause data hazards. On the other hand, sending a stream of floating-point values back to the general-purpose processors for reduction consumes memory bandwidth and may even negate the performance gains derived from hardware acceleration. In this chapter, we study high-performance and area-efficient FPGA-based reduc- tion circuits. Using accumulation as a prototypical example, we first analyze the design trade-offs among the number of floating-point adders, the buffer size and the latency of a design. We then study two basic methods for implementing reductions using deeply pipelined adders. In the tree-traversal method, the inputs are considered as leaves of a 52 binary tree, and the reduction of the inputs requires a traversal of the tree. In the strid- ing method, the inputs are first reduced to α values, where α is the number of pipeline stages in the adder. The α values are then further reduced to the single output value. Based on the trade-off analysis, two designs for reduction circuits are proposed. Both designs are able to reduce multiple input sets of arbitrary sizes without stalling the pipeline. In addition, the numbers of adders in the designs are fixed, and the buffer sizes are independent of the number of input sets. The first design uses the tree-traversal method, and employs two adders and a buffer of size Θ(lg(n)). The second is based on the striding method. It uses only one adder and needs two buffers of size Θ(α 2 ). The latencies of the designs depend on both n and α. When n is much larger than α, the latencies are Θ(n) clock cycles. 5.1 The Reduction Problem and Related Work 5.1.1 The Reduction Problem: An Intuition Using a pipelined adder to accumulate a series of sequentially delivered values may be problematic. Figure 5.1 shows a simple example of this failure for a pipelined adder with a latency of α =2. Suppose one dot product corresponds to the sequence (a, b, c, d), and the next dot product corresponds to the sequence (v,w,x,y,z), i.e., the sum a + b + c + d is the value of the first dot product and v + w + x + y + z is the value of the second dot product. In step 1, a + b is in the last adder stage, and c and d are in the input buffers. In step 2, c + d, which are the final two input values in the current calculation, enter the adder pipeline. Value v from the next dot product calculation is buffered. In step 3, w is also buffered. In step 4, we are still waiting to finish a+b+c+d. The interlock circuit prevents new values from entering 53 the adder until the existing calculation is finished, so x gets dropped if we do not throttle the input stream (stall the pipeline) or use a bigger input buffer. Pipeline stalls significantly reduce performance. In addition, a buffer whose size increases linearly with n is unacceptable because n could be very large. dc v v w v w x 12 34 ba ba dc ba dc ba dc two-stage adder (α a = 2) interlock Figure 5.1: Buffer overrun 5.1.2 The Reduction Problem: Formal Definition We define the reduction problem as follows: We are given p sets of inputs, with set i containing n i floating-point values (0≤ i≤ p−1). The values in the sets are delivered sequentially as [(s 0,0 ,··· ,s 0,n 0 −1 ),··· , (s p−1,0 ,··· ,s p−1,n p−1 −1 )]. The problem is to reduce n i values in set i into a single value such that r i = n i−1 j=0 s i,j , i=0, 1,...p− 1. The adders used are pipelined with α pipeline stages, andα> 1. Our goal is to design area-efficient and high-performance reduction circuits that satisfy the following requirements: 1. Reduce p sets of inputs correctly without causing any data hazard. 2. Accept one input during each clock cycle without stalling. 54 3. The number of adders employed is fixed, independent of p and n i (0≤ i≤ p−1). 4. Low requirement on buffer size: buffer size of Θ(n i ) (0≤ i≤ p− 1)or Θ(p) is unacceptable. 5.1.3 Previous Work Research on reduction circuits began nearly two decades ago, when pipelined comput- ers and vector computers first became available. Kogge proposed a method which uses lg(n) adders to reduce multiple input sets, where n is the maximum size of the input sets [42]. In this method, the reduction is regarded as an adder tree, and each adder is in charge of computations on one level of the tree. This method may be suitable for relatively small fixed-point FPGA-based implementations. However, it is infeasi- ble for floating-point FPGA-based implementations due to the large size and design complexity of the floating-point adders. In [54], the authors proposed two reduction methods. Each method uses one adder and a buffer of fixed size. In these methods, n inputs are first reduced to α values, which are then reduced to a single value when n≥ α. These methods are well suited for reducing one input set. However, for multiple input sets, the methods assume all the sets are available in the external memory before the computation starts. If the in- put sets are generated sequentially, a buffer is needed whose size increases with the number of sets. This is because the adder cannot accept the inputs of the next set when it is reducing the α values of the previous set. Incoming inputs during this period of time must be stored temporarily in a buffer. The large buffer requirement makes FPGA-based implementations of these methods infeasible for large values of n. 55 Some researchers have implemented reduction using specially designed operators. In [70], the authors used variable precision floating-point adders and delayed normal- ization in order to reduce the number of pipeline stages. Our designs work with generic pipelined operators and do not require special arithmetic operators. Moreover, the ar- chitectures and algorithms of our designs are independent of the implementations of the operators. Several reduction circuits have been proposed that are suitable for implementations on FPGAs. When the size of each input set is a power of 2, the design in [75] performs reduction using one floating-point adder. This design uses lg(n) buffers of fixed size, where n is the upper bound on the size of an input set. However, this design does not work with arbitrary number of inputs. In [50], a partial summation unit using a single adder is used. However, this design requires a fixed-sized output binary tree ac- cumulator and Θ(n) buffer space. A similar design was employed in [51]. This design reduces the buffer requirement to Θ(α), but still requires the large output accumulator. Obviously, none of the work described above meets the requirements listed in Section 5.1.2. In our work, we propose two designs which are able to reduce multiple input sets of arbitrary sizes. The numbers of adders in the designs are fixed and the required buffer sizes are moderate. Therefore, the designs have small areas and can be run at high clock speeds. We also provide a thorough discussion on the design trade-offs and the categorization of the intuitive ideas for reduction circuits. 5.1.4 Reduction Circuit Design Analysis The general architecture of reduction circuits is shown in Figure 5.2. The datapath con- sists of the adders and the buffers. The control logic schedules the additions among the 56 adders and controls the reads and writes of the buffers. The input values enter the ar- chitecture sequentially, either from an external memory or from another FPGA-based design. The output of the circuit is the final result of the reduction. data path + control logic buffers adders output input Figure 5.2: General architecture of reduction circuits Design trade-offs exist among the number of adders, the buffer size and the la- tency. When a reduction circuit contains multiple floating-point adders, it is possible that these adders perform computations concurrently. Thus, the latency of the design can be greatly reduced. However, the area of the design also increases with the number of adders. Another trade-off exists between the number of adders and the buffer size of the design. When the design contains enough adders, each output of an adder can be taken immediately by another adder if the output is not the final result of the reduction. In this case, the buffer size required by the design is minimized, but the size of the design can be large. In other words, the pipeline stages in the adders serve as the storage for the intermediate results. On the other hand, if the design contains a large buffer to store the intermediate results, the number of the adders, and thus the area of the design, can be reduced. There is also a trade-off between the area of the design and the complexity of the control logic. As the number of adders increases, the workload of each adder decreases 57 and the scheduling of the adders becomes simpler. On the other hand, when the design employs only a few adders, each adder has a higher workload, and the intermediate results are stored in the buffers. In this case, the scheduling algorithm is more complex because it needs to avoid data hazards as well as buffer access conflicts. 5.2 Tree-Traversal Method 5.2.1 Intuitive Idea In the tree-traversal methods, the accumulation of a series of input values is represented by a binary adder tree. In this case, accumulating the inputs is equivalent to performing a breadth traversal of the tree. A simple solution is to use n− 1 adders to form a lg(n)-stage binary adder tree. The latency of this solution is n + αlg(n) cycles. However, the large area of floating-point adders makes such a solution undesirable or, for large n, infeasible. In addition, since the inputs arrive sequentially, such an architecture makes uneconomical use of the adders. For example, the first adder at level 0 of the tree operates on the first two inputs to arrive. Before it can operate on any more inputs, it must wait for the next n− 2 inputs to arrive and enter the other adders in level 0. Such inefficiency exists at all levels of the tree. The partially compacted binary tree (PCBT) design in Figure 5.3 can reduce this inefficiency. There is only one adder and a buffer with two entries at each tree level. When there are two values in the buffer, they are read from the buffer and passed to the adder. Thus, the design haslg(n) adders, a buffer size of 2lg(n), and a latency of n + αlg(n) cycles. However, the utilization of the adders in this design is still very low. We know that the adder at level 0 takes new inputs every other clock cycle. The adder at level 1 must wait for two outputs of the adder at level 0, so this adder reads 58 new inputs only once every four cycles. In particular, the adder at level i reads new inputs only once every 2 i+1 cycles. Additionally, this design requireslg(n) adders, which still may not be feasible for large values of n. + + • • • + input output level buffers ⎡ lg n⎤ -1 1 0 • • • Figure 5.3: Partially compacted bi- nary tree (PCBT) architecture control logic output + + • • • input level buffers ⎡ lg n⎤ -1 1 0 • • • Figure 5.4: Fully compacted binary tree (FCBT) architecture 5.2.2 Proposed Design: Fully Compacted Binary Tree 5.2.2.1 Architecture and Schedule We propose a fully compacted binary tree (FCBT) design that can reduce multiple input sets using only two floating-point adders. It works for sets of arbitrary sizes, n> 1, up to a design-specific maximum (n ≤ n max ). The architecture of the design is shown in Figure 5.4. The first adder performs reductions at level 0, and reads from buffer 0 during each clock cycle. The second adder is shared by levels 1, 2, ..., lg (n)− 1, and reads from or writes to different buffers in different clock cycles. When n is not a power of 2, it is possible that some additions only have one input value and must sbe padded with a zero value. We call such additions “singleton.” In our design, only the last addition at a given tree level can be singleton. For both adders in the design, there are two modes of addition: normal mode, and singleton mode. If 59 buffer l contains 2 entries from the same set, the entries are read into an adder and a normal addition is performed. If buffer l only contains one entry from a set and singleton l =1, the entry is fed into an adder with a padded zero and a singleton addition is performed. To set the correct value for singleton l , we attach a label to each number that enters the buffers. The label shows how many of the inputs in the same set remain to be accumulated at that level. Figure 5.5 illustrates how the 5 inputs in a set are labeled at each level. The small boxes represent the inputs with their labels shown below them. The inputs labeled as 0 do not really exist; they are shown in the figure to form the virtual binary tree on which the FCBT design is based. For each singleton add, an input is added with an input labeled as 0. We use t u to denote the label for input u.We know that at level l, n 2 l+1 additions are performed. Thus, if t u ≥ 2 l+1 , u cannot be the last input of the set at level l. There must be another input, v, following u. Thus, singleton l =0, and u is then written into buffer l. When v arrives, u and v are then read as operands of the second adder. On the other hand, if t u < 2 l+1 , item u must be the last input of the set at level l. If no other input of the set is waiting, u requires a singleton add, and singleton l =1. However, if another input of the set is waiting for u, the addition of the two inputs yields the final result. 5 5 1 5 3 5 4 3 2 10 10 0 0 label level 3 1 0 label label label 2 Figure 5.5: Labels for FCBT inputs (n=5) 60 The scheduling algorithm for the FCBT design is shown in Figure 5.6. For conve- nience, we define the last j bits of expression v using the notation v j . We also use op 1 and op 2 to denote the outputs of the first adder and the second adder, respectively. We use out n =lg (n) to represent the adder level at which the final result is found. According to the algorithm, buffer 0 is read during every clock cycle. Buffer l (l=1,...,lg(n)− 1) is read when C =2 l −1+ a2 l for some integer a, where C is the value of a clock cycle counter inside the controller. Counter C islg(n)-bit wide. Buffer l+1 (l=0,...,lg(n)−2) is written to when C−α=2 l −1+b2 l for some integer b. {counter} if the first adder is first used then C =0 else C = C +1 end if {buffers} write input port to buffer 0 if out n ≤ 1 then write op 1 to external memory else write op 1 to buffer 1 end if for l =1tolg(n)− 2 do if (C− α) l =2 l − 1 then if out n ≤ l+1 then write op 2 to external memory else write op 2 to buffer l+1 end if end if end for {adders} if singleton 0 =0 then read 2 values from buffer 0 into the first adder else perform a singleton add in the first adder end if for l =1 to lg (n)− 1 do if C l =2 l − 1 then if singleton l =0 then read 2 values from buffer l into the second adder else perform a singleton add in the second adder end if end if end for Figure 5.6: Scheduling algorithm for FCBT design 61 5.2.2.2 Proof of Correctness We now prove the correctness of this design. Theorem 1: The FCBT scheduling algorithm shown in Figure 5.6 guarantees a collision-free use of the adders. Proof: The statement is obviously true for the first adder since it always adds two values as soon as they are available. The second adder can hold the additions from level 1,...,lg(n). Since buffer l is read every 2 l clock cycles, the additions at level l occupy at most 1 2 l of the adder’s utilization. Because lim n→∞ n l=1 1 2 l =1, the utilization of the second adder can never exceed 1. Thus, the usage of the second adder is also collision free. According to the schedule in Figure 5.6, a buffer cannot grow without bound once it begins reading values from that buffer. Moreover, according to the schedule, buffer 0 contains at most 2 values before it is first read; buffer l (1≤ l≤lg(n)− 1) contains at most 3 values when it is first read. Thus, there cannot be buffer overflow. We have proved that the schedule of the FCBT design prevents collisions and that the buffers will not overflow. Since the binary operation, addition, is both associative and commutative, we need not worry about the ordering or grouping of the operations. According to the scheduling algorithm, each input and intermediate result are added once and only once. Thus, our FCBT design properly reduces an input set. Theorem 2: The FCBT design reduces n inputs in less than 3n+(α−1)lg(n)−3 cycles. When n is a power of 2, the FCBT reduces n inputs in n + α lg (n) cycles. Proof: It takes n cycles for all the inputs to arrive. The last value enters buffer 0 at cycle n, and then traverses the first floating-point adder. Then it enters buffer 1, the second floating-point adder, buffer 2, the second floating-point adder, ..., buffer 62 lg(n)− 1, and the second floating-point adder. Clearly it goes through the floating- point adder lg(n) times. At buffer l (l =1,...,lg(n)− 1), it waits for at most 2 l − 1 cycles before being read by the floating-point adder. Therefore, the latency of the algorithm is at most n + αlg (n) + lg(n)−1 l=1 (2 l − 1)≤ n + αlg (n)+2n− 2− lg (n)−1=3n+(α− 1)lg (n)− 3 cycles. When n is a power of 2, the waiting time at all buffers equals zero. Thus, the latency of the algorithm is n+α lg (n) cycles. The FCBT design also works for multiple input sets. If we can prove that neither the inputs nor the intermediate results from two input sets can ever be added, we will have shown that the FCBT design properly reduces multiple sets. We use the termi- nology item to refer to either a raw input value or a partial reduction of several input values. We use the term mingled to describe an item having mixed composition, i.e., an item having values from more than one input set. Theorem 3: In the FCBT design, r i contains only items of set i. Proof: We use a proof by contradiction. Suppose buffer l+1 accepts the first mingled item, i.e., buffer l+1 contains an item, V = v i1 + v i2 , where item v i1 has only items from set i1, and v i2 has only items from set i2. Without loss of generality, we assume i1 <i2. Clearly items v i1 and v i2 were both in buffer l at some earlier clock cycle. There could not have been another item from set i1 in buffer l at the same time as v i1 . Otherwise, that item, rather than v i2 , would have been entered into the adder pipeline with v i1 . Since multiple sets enter buffers sequentially, item v i1 entered buffer l ahead of item v i2 . Furthermore, since no other item from set u was in buffer l, v i1 must either be a singleton at level l or be the final sum of set i1. However, if v i1 were a singleton item, it would have been added with a 0; if v i1 were the final sum of set u, it would have been written to the external memory and not to buffer l. 63 clock counter buffer contents (n = 7) adder pipelines (α = 4) cycle value 1 000 s 2 s 1 buffer 0 adder 1 2 001 s 3 buffer 0 s 1 +s 2 adder 1 3 010 s 4 s 3 buffer 0 s 1 +s 2 adder 1 4 011 s 5 buffer 0 s 3 +s 4 s 1 +s 2 adder 1 5 100 s 5 s 6 buffer 0 s 3 +s 4 s 1 +s 2 adder 1 s 7 buffer 0 s 5 +s 6 s 3 +s 4 adder 1 s 1 +s 2 buffer 1 buffer 0 s 7 s 5 +s 6 s 3 +s 4 adder 1 s 1 +s 2 buffer 1 8 111 s 1 +s 2 s 3 +s 4 buffer 1 s 7 s 5 +s 6 adder 1 9 000 s 1 +s 2 s 3 +s 4 buffer 1 s 7 s 5 +s 6 adder 1 s 5 +s 6 buffer 1 s 7 adder 1 s 1 +···+s 4 adder 2 s 7 s 5 +s 6 buffer 1 adder 1 s 1 +···+s 4 adder 2 12 011 buffer 2 s 5 +s 6 +s 7 s 1 +···+s 4 adder 2 13 100 buffer 2 s 5 +s 6 +s 7 s 1 +···+s 4 adder 2 14 101 s 1 +···+s 4 buffer 2 s 5 +s 6 +s 7 adder 2 15 110 s 1 +···+s 4 buffer 2 s 5 +s 6 +s 7 adder 2 16 111 s 1 +···+s 4 s 5 +s 6 +s 7 buffer 2 adder 2 17 000 s 1 +···+s 4 s 5 +s 6 +s 7 buffer 2 adder 2 18 001 s 1 +···+s 4 s 5 +s 6 +s 7 buffer 2 adder 2 19 010 buffer 2 s 1 +···+s 7 adder 2 6 101 7 110 10 001 11 010 Figure 5.7: Snapshot of FCBT design Thus, we have our contradiction and proof that the algorithm produces only reductions containing items from a single input set. 5.2.2.3 Snapshot In this section, we give a snapshot for small input sizes and a small α value to show how the FCBT design works. In Figure 5.7, we accumulate seven inputs (n =7), and both adders have four pipeline stages (α=4). We use a 3-bit counter and three 3-slot buffers. If a buffer or pipeline is empty and will not be used again, we do not show it in the snapshot. Several points about this example are worth noting. In clock cycle seven, a singleton addition is performed, that is, the first pipeline adds s 7 with a padded zero. In clock cycle eight, although buffer 1 contains two items, it is not read, because the 64 last bit of the counter is not 0. Instead, buffer 1 is read in clock cycle 9, when the last bit of the counter becomes 0. Similarly, buffer 2 is not read in clock cycle 16 because the last two bits of the counter are not 01. Instead, buffer 2 is read in clock cycle 18, when the last two bits of the counter are 01. 5.3 Striding Method 5.3.1 Intuitive Idea The striding method for designing reduction circuits is based on a different idea: when n ≥ α,an α-stage pipelined floating-point adder can reduce n input values into α segments without any data hazard. Figure 5.8 illustrates the summation of n inputs into α intermediate results when α=4. clock adder pipeline (α = 4) cycle 1 s 1 2 s 2 s 1 3 s 3 s 2 s 1 4 s 4 s 3 s 2 s 1 ... 8 s 4 +s 8 s 3 +s 7 s 2 +s 6 s 1 +s 5 ... 12 s 4 +s 8 +s 12 s 3 +s 7 +s 11 s 2 +s 6 +s 10 s 1 +s 5 +s 9 ... n s 4 +s 8 +s 12 +···+s n s 3 +s 7 +s 11 +···+s n-1 s 2 +s 6 +s 10 +···+s n-2 s 1 +s 5 +s 9 +···+s n-3 Figure 5.8: Reducing n inputs into α segments (α=4). We use the term coalesce to refer to the final stage of reduction when all the partial reductions currently in the adder are coalesced into a single value. We can use the same adder to perform the coalescence. However, during the coalescing stage, the adder cannot accept new inputs and they must be stored in a buffer. Since coalescing happens to every input set, the buffer size is now dependent on the number of input 65 sets, p. When p is large, which is usually the case in matrix-vector multiplication, such a design is infeasible. Suppose α distinct sets are stored in a buffer and each set has α items. Then we can interleave the additions from the α sets so that the adder is fully utilized and no data hazard occurs. In this way, reducing α 2 items from α distinct sets takes at most α 2 clock cycles. At the same time, another buffer of size α 2 is needed to store the new inputs. Thus, we can reduce multiple inputs sets with one adder and two buffers of size α 2 . 5.3.2 Proposed Design: Single Strided Adder 5.3.2.1 Architecture and Schedule control logic input buffer 2 + output ctrl ctrl buffer 1 Figure 5.9: Single strided adder (SSA) architecture We propose the single strided adder (SSA) design, whose architecture is shown in Figure 5.9. It contains one floating-point adder and two buffers of size α 2 . The input can either enter one of the buffers, or enter the adder directly. The adder selects its operands from the input and the buffers. If the output of the adder is the final result of a set, it is written to the external memory; otherwise, it is written back to one of the buffers. The buffer that accepts new inputs is denoted as Buf in . For set i,if n i ≤ α, n i inputs are written into Buf in directly; otherwise, α inputs are written into Buf in and are then added with the remaining new inputs of the set. In the n i (if n i >α)or α (if 66 n i ≤ α) cycles in which Buf in accepts new inputs, the adder is used by another buffer. This buffer is called Buf red , which stores no more than α items for each set. Reading from Buf red column by column, the adder interleaves α additions from α distinct sets. The results of these additions are written back to Buf red . Buffer 1 and Buffer 2 function as Buf in and Buf red alternately. When Buf in is full, the two buffers are swapped. That is, Buf in becomes Buf red and Buf red becomes Buf in . Since n i can be of any value larger than 1, when Buf in is full, Buf in may contain more than α distinct sets. The problem now is how to interleave the additions in these sets so that data hazards do not occur. We try to fit all the sets into α rows. This fitting problem can be reduced to the subset-sum problem, which is NP-complete [20], if no input set is allowed to be partitioned among two rows. Therefore, we allow an input set to be split among two rows. Suppose the n i (n i <α)or α (n i ≥ α) items of set i are written into Buf in in row-major order. We know that the items in a set can exist only in two consecutive rows. Otherwise, a set will have more than α items in Buf in , which is not allowed by the algorithm. Figure 5.10 describes the scheduling algorithm of the SSA design. It is assumed that the buffers allow read-after-write access. That is, if an entry is read and written in the same clock cycle, the output of the buffer is the value which is written. Two points must be noted in the algorithm: 1. The first column of Buf red is read twice. During the second read operation, Buf red [jα] contains the intermediate result of the set, which is split between row j− 1 and row j (j =1,...,α− 1). 2. It is possible that an input set is split between two buffers. This case can be seen as the set being split between row 0 and row 0− 1. A special register is used to 67 if the input is among the first α inputs of a set then write the input into Buf in if it is the first input of the set, incre- ment i if Buf red is not empty then ifl< α and (Buf red [jα + l] and Buf red [jα + l− 1] are in the same set) then add Buf red [jα + l] with Buf red [jα + l− 1] else if l = α then if a set is split between row j− 1 and row j then add Buf red [(j−1)α+l−1] with Buf red [jα] else if a set is split between two buffers then add Buf red [0] with temp end if end if increment j if j +1 = α, then j =0, increment l end if else adds the input with the kth item of set i in Buf in , increment k end if {buffers} if Buf in is full or (no new input and Buf red is empty) then swap Buf in and Buf red j =0, l =1, k =0, i =−1 end if {output of the adder} if one operand is from Buf in [x] then if it is not from the last set in Buf in then write the adder output to Buf in [x] else write the adder output to Buf red [x] end if else if one operand is the last item of a set in Buf red then write the output to the external memory else if operands are from Buf red [y] and Buf red [y− 1] then write the output to Buf red [y] else if one operand is from row j and is the last item of a set split between row j− 1 and row j then write the output to Buf red [jα] else if one operand is from row 0 and is the last item of a set split between two buffers then write the output to Buf red [0] else if one operand is from row α− 1 and is the last item of a set split between two buffers then write the output to temp end if Figure 5.10: Scheduling algorithm for SSA design 68 store the intermediate result from row 0− 1. The value stored in this register is then added to Buf red [0]. 5.3.2.2 Proof of Correctness We now prove the correctness of the SSA design. Theorem 4: The SSA scheduling algorithm shown in Figure 5.10 guarantees a collision-free and hazard-free use of the adder and does not cause buffer overflow. Proof: The use of the adder is collision-free because the adder reads only from Buf red when the Buf in is accepting new inputs. We next prove data hazards do not occur in the algorithm. When Buf in is full, each of its column contains items from α distinct sets. Otherwise, there must be a set that contains more than α items in Buf in , which is not allowed by the algorithm. Therefore, through interleaving additions in α distinct rows, we interleave additions in α distinct sets. Thus the use of the adder is hazard-free. We next prove the buffers do not overflow. Initially, both buffers are empty. With- out loss of generality, suppose Buffer 1 is used as Buf in , and Buffer 2 is used as Buf red . After the inputs of α sets enter the architecture, Buffer 1 is full and is designated as Buf red . The algorithm then takes α 2 clock cycles to read from Buffer 1. During these clock cycles, the first α inputs of α consecutive sets are written into Buffer 2. When the remaining inputs of the sets enter the architecture, the adder accepts operands from Buffer 2, and not Buffer 1. After α sets enter the architecture, Buffer 1 is empty and Buffer 2 is full; then Buffer 1 is designated as Buf in , and Buffer 2 is designated as Buf red . The algorithm continues in this way until the last input enters the architecture. Thus, the buffers never overflow. Theorem 5: The circuit correctly reduces multiple input sets, i.e., r i = n i −1 j=0 s i,j , i=0,...,p− 1. 69 Proof: We first prove that each set in Buf red is reduced correctly. Clearly, this is true if set i is not split between two rows. Suppose set i is split between row j− 1 and row j, j =1,...α− 1. The items in row j− 1 are summed up and the intermediate result is written to Buf red [(j−1)α+α−1]. If only one item of the set is stored in row j, it must be Buf red [jα], because the items of a set are stored consecutively in row-major order. As Buf red [(j−1)α+α−1] is added with Buf red [jα], the set is reduced correctly. If two or more items of set i are stored in row j, they are summed up and the intermediate result is written to Buf red [jα] before l = α (l is used in Figure 5.10). Otherwise, the set contains more than α− 1 items in row j, and thus contains more than α items in Buf in . Since Buf red [(j− 1)α + α− 1] is added with Buf red [jα] when l = α, the set is reduced correctly. The case in which set i is split between two buffers is similar to the case where the set is split between row 0 and row 0− 1. Thus, we can prove the set is reduced correctly using a similar analysis. We now prove that r i contains only items from a single set i. Suppose during some clock cycle, the adder accepts two operands from different sets. This cannot happen whenl<α because it is not allowed by the algorithm. Suppose when l = α, the items in Buf red [(j− 1)α + l− 1] and Buf red [jα] are from different sets. In this case, no set is split between row j− 1 and row j; hence these items will not be read by the adder. Similarly, when l = α, if the items in Buf red [0] and in temp are from different sets, they will not be read by the adder. We thus reach a contradiction. Theorem 6: The circuit reduces p sets in at most ( p−1 i=0 n i +2α 2 ) cycles. Proof: Since the reduction circuit never stalls reading the inputs, the last element of the last set enters the circuit in clock cycle p−1 i=0 n i . At this time, if Buf in is full and Buf red is empty, they are swapped. Hence the adder reads from Buf red , which is full. This takes α 2 clock cycles. 70 clock buffer 1 contents adder buffer 2 contents notes cycle (α = 4) s 00 s 01 s 02 s 10 At clock cycle 16, buffer 1 is full, 16 s 11 s 12 s 20 s 21 and we swap Buf in and Buf red . s 22 s 30 s 31 s 32 s 40 s 41 s 42 s 50 In clock cycles 17-20, the adder s 02 s 10 s 40 +s 41 s 51 s 52 s 60 s 61 reads operands from 4 rows of 20 s 20 s 21 buffer 1. Since s 22 is not in the s 22 s 30 s 31 s 32 s 11 +s 12 same set as s 30 , they are not added. s 42 s 50 s 00 +s 01 s 10 s 40 +s 41 +s 42 s 51 s 52 s 60 s 61 In clock cycles 21-24, the adder 24 s 11 +s 12 s 20 s 21 s 30 +s 31 s 62 s 70 s 71 s 72 continues reading from buffer 1. s 22 s 32 s 11 +s 12 is written to the first s 50 s 00 +s 01 +s 02 location in row 1 of buffer 1. s 10 s 51 s 52 s 60 s 61 28 s 11 +s 12 s 30 +s 31 +s 32 s 62 s 70 s 71 s 72 Starting from clock cycle 29, the s 22 s 20 +s 21 s 80 s 81 s 82 s 90 intermediate results of split rows s 50 are summed up. s 51 s 52 s 60 s 61 32 s 20 +s 21 +s 22 s 62 s 70 s 71 s 72 At clock cycle 32, buffer 1 is empty s 11 +s 12 +s 10 s 80 s 81 s 82 s 90 and the adder begins to reduce sets s 91 s 92 in buffer 2. Note that Temp = s 50 . s 91 +s 92 +s 90 48 At clock cycle 48 the operands of s 63 +s 60 +s 61 the last addition enter the adder s 51 +s 52 +s 50 Figure 5.11: Snapshot of SSA design If only q (q< α 2 ) entries of Buf in are written as the last input enters, Buf red is not empty. Thus, the adder first reads from Buf red before the two buffers are swapped. This takes α 2 − q clock cycles. Then Buf in becomes Buf red , and is read by the adder. The operands of the last addition enter the adder in clock cycle p−1 i=0 n i +(α 2 − q)+ α(α− 1) + q α .Asq> q α , the final result of the pth set is available in clock cycle p−1 i=0 n i +(α 2 − q)+ α(α− 1) + q + α = p−1 i=0 n i +2α 2 . 5.3.2.3 Snapshot We illustrate the operations of an α =4 SSA design using a snapshot. During clock cycle 0, Buf in = Buffer 1 and Buf red = Buffer 2; both buffers are empty. The inputs are stored in the buffers in row-major order. In particular, the memory addresses increase 71 from left to right, and from top to bottom. The data in the adder travels from the top to the bottom. There are 10 input sets of size 3. We show the data in the buffers and the adders in the selected clock cycles in Figure 5.11. 5.4 Performance Analysis 5.4.1 Summary of Proposed Designs Table 5.1 summarizes the characteristics of the proposed designs and the PCBT design. Although the PCBT design is unrealistic for most FPGA-based designs because of the large number of floating-point adders, it is very simple to implement. By including the PCBT design, our analysis of the trade-offs will be more complete. Table 5.1: Summary of reduction circuits Design No. of Adders Buffer Size Latency for n inputs PCBT lg(n) 2lg(n) n + αlg(n) FCBT 2 3lg(n) ≤ 3n+(α− 1)lg(n) SSA 1 2α 2 ≤ n+2α 2 The FCBT design is relatively easy to implement. It requires two adders, has the smallest buffer size, but has the longest theoretical latencies for even modest values of n. The SSA design employs only one adder, has a reasonable theoretical latency for larger values of n, but has a relatively complex implementation. Additionally, when α≥ lg(n), the SSA design has the largest buffer size. The PCBT design and the FCBT design must know the maximum number of inputs before the computation starts, while the SSA design has no upper bound on the input size. On the other hand, when there are multiple input sets, the throughput for each set equals its number of inputs except for the first set. For the SSA design, however, 72 the throughput depends not only on the number of inputs in the current set, but also on the sizes of the previous and the subsequent sets. When the input sizes are less than α, the outputs of the SSA design may be out of order. In this case, we can first write the outputs into an output buffer, and read them out in order. As each input set contains at least 2 elements, the maximum size of the output buffer is α 2 2 . 5.4.2 Performance Metrics We use the following metrics to evaluate our designs: • Area: The number of configurable slices used by the design. • Clock Speed: The maximum achievable speed of the design. • Total Time: The period of time between when the first input arrives and when the last result is output. • Throughput: It is computed as (total number of outputs)/(output time). The output time is the period between when the first result is generated and when the last result is output. For applications that sequentially generate input sets for reduction, high throughput is more desirable than shorter latency. Thus, when multiple sets are reduced, throughput is a better measurement of performance than total time. To evaluate the area-efficiency of our designs, we also use the following metrics: • Area-Time Product: It is calculated as Area× Total Time, in thousand slices× μs. A lower value indicates better performance. • Throughput-To-Area Ratio: It is calculated as Throughput/Area, in thousand results per slice per second. A higher value indicates better performance. 73 5.4.3 Design Implementations We implemented our proposed designs on a Xilinx Virtex-II Pro XC2VP50 FPGA. The characteristics of the floating-point adder are shown in Table 4.1. Note that the proposed designs are mostly independent of the implementations of the adder, except that the buffer size of the SSA design varies with α. Thus, any adder (or actually any associative and commutative binary operator) can be plugged into our designs with few modifications to the architectures and the algorithms. We implemented the FCBT design, the SSA design and the PCBT design. The characteristics of the implemented circuits are shown in Table 5.2, when n = 128. Ex- cept for the PCBT design, the buffers in the designs are implemented using the BRAM memory on the FPGA. Table 5.2: Characteristics of the reduction circuits (n = 128) Design Area (Slices) Clock Speed (MHz) No. of BRAMs PCBT 6808 165 - FCBT 2859 170 10 SSA 1804 165 6 As shown in Table 5.2, the PCBT design needs the most adders and uses slice logic for buffers. Therefore, its area is much larger than the other designs. For the FCBT design, about 40% of the area is used for memory access logic and control logic. Its clock speed is constrained by the clock rate of the adder. The SSA design uses only one adder and takes up the smallest area. However, because the algorithm is more complex, its control logic occupies more than 50% of its area. Moreover, its clock speed is lower than that of the FCBT design. On the other hand, since the FCBT design needs to carry the tags and set indices along with the inputs, it needs more BRAMs than the SSA design. 74 As n increases, the number of adders in the PCBT design increases, as does the routing complexity. Thus, the area and the clock speed of the design vary with n. Although the number of adders is fixed in the FCBT design, the memory address logic becomes more complex for large n. Therefore, the design occupies more slices and the achievable clock speed decreases as n increases. As shown in Figure 5.12, as n increases from 2 4 to 2 20 , the area of the PCBT design increases by 433%, while the area of the FCBT design increases only by about 20%. However, as shown in Figure 5.13, the degradation in the clock speed of both designs is almost the same. This is because the control logic of the FCBT design is more complex, and its complexity increases with n. Nonetheless, the clock speed degradation is less than 25% as n increases from 2 4 to 2 20 . 0 5000 10000 15000 20000 25000 4 8 12 16 20 lg(n) Area [slices] PCBT Design FCBT Design Figure 5.12: Area vs. lg(n) 5.4.4 Performance Analysis Figure 5.14 shows the total time and the area-time product of our designs for reducing a single set of n inputs, when n = 128. The total time is measured in μs. We see that although the FCBT design takes longer to complete the reduction than the PCBT 75 145 155 165 175 4 8 12 16 20 lg(n) Clock Speed [MHz] PCBT Design FCBT Design Figure 5.13: Clock Speed vs. lg(n) 0 3 6 9 PCBT FCBT SSA Total Time [ ȝ s] 0 5 10 Area × Time [thousand slices × ȝ s] Latency Area-Latency Product Figure 5.14: Latency and area-latency product for reducing a single set of 128 inputs design, it achieves a smaller area-time product because its area is much smaller. The total time of the SSA design is the longest because it has to reduce the entire Buf red , even if Buf red only contains α items. However, as its area is the smallest, its area-time product is just slightly larger than that of the FCBT design. Figure 5.15 shows the total time and the area-time product of our proposed designs for reducing n sequential sets of n inputs, when n = 128. The FCBT design uses less time than the PCBT design and has a smaller area-time product. The latency of the 76 95 100 105 PCBT FCBT SSA Total Time [ ȝ s] 0 400 800 Area × Time [thousand slices × ȝ s] Latency Area-Latency Product Figure 5.15: Latency and area-latency product for reducing 128 sets of 128 inputs SSA design is still the longest, but the difference between the total times is less than 10%. Thus, thanks to its small area, the SSA design achieves the smallest area-time product. The throughput and the throughput-to-area ratio of the proposed designs and the PCBT design are shown in Figure 5.16. We see that the throughput of the FCBT de- sign is almost the same as that of the PCBT design. Actually, after the first result, both designs generate one result every 128 clock cycles. The difference in their throughputs is caused by their different clock speeds. On the other hand, the SSA design achieves the highest throughput and throughput-to-area ratio. In this experiment, the SSA de- sign generates the first α results after the 256th set is reduced to α values. After that, α results are generated every 128×α clock cycles. Thus, the SSA design has the shortest output time. From the experimental results, we see that our proposed designs are area-efficient and can achieve high performance. In terms of total time, the FCBT design is suitable for reductions of both one set and multiple sets, while the SSA design is suitable for reducing multiple sets. Moreover, the SSA design achieves higher throughput when 77 1 1.5 2 PCBT FCBT DSA SSA Throughput [results/μs] 0 0.4 0.8 Throughput/Area [thousand results/s/slice] Throughput Throughput/Area Figure 5.16: Throughput and throughput-to-area ratio for reducing 128 sets of 128 inputs multiple sets are reduced. Therefore, for the designs proposed in Chapter 4, the SSA design is used. 78 Chapter 6 Dense Matrix Operations on Reconfigurable Computing Systems In this chapter, we investigate optimized designs for dense matrix operations on re- configurable computing systems. The operations studied are matrix multiplication and LU decomposition for matrix factorization. They are both key computational kernels in linear algebra, and are included in level 3 BLAS [43] and LAPACK [5], respectively. We first analyze the inherent attributes of these two operations and derive a lower bound on their performance. We then identify their design parameters, including the number of PEs, the internal storage size required by the design, and the block size. For matrix multiplication, three designs are proposed: Design 1, Design 2 and De- sign 3, respectively. All of these designs employ a linear array architecture of PEs. Each PE contains one floating-point adder, one floating-point multiplier and an inter- nal storage. Communications are allowed between neighboring PEs only. In this way, the designs make use of the short connection wires on the FPGA device and are able to achieve high clock speed. Also, the linear array architecture with a small and reg- ular control logic results in little degradation in performance when the number of PEs increases. 79 Design 1 and Design 2 achieve the lower bound on the latency of any n× n ma- trix multiplication algorithm, Θ(n 2 ). However, in these two designs, the number of PEs and the required storage size increase with the problem size n. In Design 3, the number of PEs is determined by the available resources and the storage size is deter- mined by the size of BRAM memory on the FPGA. With the given resources, Design 3 achieves the optimal latency and minimizes the required memory bandwidth. Design 3 is then further extended to utilize the SRAM memory and the multiple FPGAs in reconfigurable computing systems. For LU decomposition, our design employs a circular linear array of PEs. The first PE contains a floating-point divider, while each of the other PEs contains a floating- point multiplier and a floating-point adder/subtractor. The number of PEs is indepen- dent of the problem size n, and is determined by the available resources on the FPGA. The design requires a storage of size Θ(n 2 ) words. Our design achieves the lower bound on the latency that can be reached by any LU decomposition design. For large matrices, block LU decomposition is employed, and the block size b is determined by the size of the BRAM memory on the FPGA. A hierarchical design is proposed to utilize the SRAM memory in reconfigurable computing systems. On one Xilinx Virtex-II Pro FPGA, our designs for matrix multiplication and LU decomposition achieve 4 GFLOPS and 3.2 GFLOPS, respectively. The performance of Design 3 for matrix multiplication compares favorably with that of optimized programs executed on state-of-the-art processors. Our design for LU decomposition achieves higher performance than the optimized library routines run on an AMD processor. 80 6.1 Related Work 6.1.1 Matrix Operations on Parallel Systems There has been extensive work on parallel algorithms for matrix multiplication. Two classical algorithms are Cannon’s algorithm [13] and Fox’s algorithm [27]. They are designed based on a square processor grid with a block data distribution in which each processor holds consecutive blocks of data. During each iteration, the blocks on one processor are either transferred to its neighboring processors or are broadcast to the other processors in the same row of the grid. Choi et al. developed PUMMA [15] to implement Fox’s algorithm on general 2D grids. In [29], Van de Gejin and Watts proposed matrix multiplication algorithms for distributed memory concurrent computers. In their algorithms, the blocks on one processor are broadcast within the row as well as within the column. Also, the computation and communication are efficiently overlapped. Due to the size and routing complexity of the FPGA-based floating-point cores, a 2D grid is infeasible for implementations on FPGAs. Moreover, broadcasting data within an FPGA-based design causes large overheads. Li and Pan [44] parallelized a well-known sequential algorithm on a linear array of processors. However, their work needs a reconfigurable pipeline optical bus to support massive volume of data transfer. Such a bus is not available on FPGAs, and implementing it on FPGAs causes large overheads. In [28], Gallivan et al. surveyed the parallel algorithms for LU decomposition on shared-memory systems and distributed systems. Parallel factorization schemes for block-tridiagonal systems are discussed in details. Choi et al. examined matrix factorization methods on a 2D processor-grid [16]. The matrix to be decomposed is partitioned among the processors with a block cyclic distribution. The distribution of 81 workload in LU decomposition becomes uneven as the algorithm progresses. The au- thors discussed the trade-offs between the load imbalance and communication costs, which can be controlled by varying the block size. In [65], a broadcast-based parallel algorithm for LU decomposition is presented. It takes advantages of the physical com- munication facilities in Ethernet, and overlaps communications with computations. In contrast to the above previous work, the designs proposed in this thesis do not use broadcasting or special hardware to transfer data. In our designs, PEs are con- nected in a linear array, and only the communications among neighboring PEs are al- lowed. Such communications are realized using the short connection wires on FPGAs. Also, the communications are efficiently overlapped with the computations. 6.1.2 Matrix Operations on Reconfigurable Computing Systems Amira et al. implemented a bit-level algorithm for 8-bit fixed-point numbers on FPGAs, which employs a serial-parallel matrix-matrix multiplier based on the Baugh-Wooley algorithm [3]. The I/O bandwidth required by the design is proportional to the problem size. Jang et al. [41] proposed FPGA-based linear array algorithms for fixed-point ma- trix multiplication. Their algorithms achieve optimal latency, Θ(n 2 ), for n× n matrix multiplication. The algorithms need n multipliers, n adders and a total storage of size n 2 words. In [41], no hardware resource constraints are considered. This is because the work in [41] targets fixed-point matrix multiplication, which usually requires much fewer hardware resources than what is provided by the FPGA device. However, due to the design complexity of floating-point cores, various resource constraints, such as the number of slices, the size of on-chip memory and the available memory bandwidth, become important factors that may affect the performance of a design. For example, it 82 is impractical to configure n floating-point adders and n floating-point multipliers on current FPGA devices when n is larger than 50. Floating-point matrix multiplication on FPGAs has been discussed in several pa- pers. In [46], the authors investigated the influence of the floating-point MACs (Multi- plier and ACcumulators) on the performance of matrix multiplication. The authors of [58] studied the implementation of floating-point arithmetic and used matrix multipli- cation as an example application. The focus of that work is FPGA-based designs for arithmetic, and not for matrix multiplication. In [67], Underwood et al. discussed ma- trix multiplication as part of the BLAS (Basic Linear Algebra Subprograms). Their ar- chitecture performs a collection of matrix-vector multiplications in parallel, and blocks the source matrices to reduce the requirements on the storage size. In their work, the authors focused on examining the potential capacity of FPGAs in performing floating- point BLAS operations and comparing the computing capacity of FPGAs with that of general-purpose processors. In [24], the authors proposed a parallel algorithm for matrix multiplication, which consists of a master processor and multiple slave proces- sors. The master distributes the data from the source matrices to the slaves, so that the slaves can compute different blocks in the result matrix in parallel. This algorithm has a latency of Θ(n 3 ), and is only suitable for large matrices. FPGA-based matrix factorization has also been studied. Choi et al [17] proposed a linear array architecture for fixed-point LU decomposition. The array consists of n PEs; each PE contains a multiplier, an adder/subtractor and a reciprocator. The latency of this architecture is n 2 . However, using one reciprocator in every PE is very resource- consuming if the architecture is used for floating-point LU. In [71], Wang et al pro- posed a multiprocessor system on an FPGA device and used it for LU decomposition of sparse block-diagonal-bordered matrices. Each processor on the FPGA is attached 83 to a floating-point adder/subtractor, a floating-point multiplier and a floating-point di- vider. As one FPGA-based divider occupies a large number of slices, the number of processors in such a system is highly limited. In [34], a circular array architecture was proposed for floating-point LU decom- position. It uses n PEs, and only the first PE, PE 0 , contains a divider. PE j (1 ≤ j ≤ n− 1) contains one MAC (Multiplier and ACcumulator) and a storage of size n− j. PE j is in charge of computations that involve column j − 1. The latency of the design is n 2 , and the required storage size is n 2 2 . [22] gives a detailed description of block LU decomposition algorithm and proposes an architecture for it. The matrix is partitioned into b× b blocks, where b is determined by the number of configurable slices. The latency of the design in [22] is approximately 2n 3 3b cycles, which is about twice the minimum latency of LU decomposition with the given hardware resources. In this thesis, we provide parameterized and optimized designs for floating-point matrix multiplication and LU decomposition on reconfigurable computing systems. Our designs can be used for various problem sizes and can adapt to various hardware resource constraints. With the given resources, our designs are able to achieve optimal latency and minimize the memory bandwidth requirement. 6.2 Matrix Multiplication Dense matrix multiplication is formulated as C = AB, c ij = n−1 q=0 a iq b qj (i, j =0, 1,...,n− 1) where A, B, C are all n× n matrices. 84 No. of PEs (k) (1,1,n 3 ) Storage Size (m) Latency (T) n n 2 n 3 1 n n 2 n 3 n 2 n 3 (n,n 2 ,n 2 ) Figure 6.1: Trade-offs for matrix multiplication 6.2.1 Operation Analysis 6.2.1.1 Lower Bound The total number of floating-point operations is 2n 3 . As each element of A and B is used n times, the total number of external I/O operations is 2n 2 . As with the vector operations, the number of floating-point operations performed in each clock cycle, k and k , are identified as design parameters. We set k = k so that T comp = Ω( n 3 k ). Because of the data reuse, the size of internal storage for intermediate results needed by the design becomes another important parameter. We use I/O complexity to refer to the total number of external I/O operations performed by an algorithm. It has been proven [36] that the I/O complexity of any implementations of the usual matrix multiplication algorithm is Ω( n 3 √ m ), when Θ(1) ≤ m ≤ Θ(n 2 ), where m is the size of the internal storage. Whenm> Θ(n 2 ), the I/O complexity remains Ω(n 2 ). Therefore, the lower bound on the latency of matrix multiplication is: T ≥ max(T comp ,T I/O )≥ max( n 3 k , n 3 b w / √ m bw d )(m≤ n 2 ) (6.1) 85 Figure 6.1 illustrates the relationship between m, k, and T . When k = O(1), T is Ω(n 3 ) regardless of the value of m, as the computation time forms the bottleneck in the algorithm. This is the case for general-purpose processors. Consider a processor that has a cache large enough to hold all the source matrices and is able to access data from the cache in every cycle. In this case, m =Θ(n 2 ), and thus T I/O is reduced to Ω(n 2 ). However, the processor can perform only a constant number of computations during each clock cycle. Hence T comp , and therefore T , remains Ω(n 3 ). On the other hand, if m =Θ(1), T once again cannot be improved, irrespective of the number of PEs that perform computations during each clock cycle. This is the situation where I/O bandwidth, rather than the computing units, forms the bottleneck in the algorithm. The optimal latency is achieved when k =Θ(n) and m =Θ(n 2 ). In this case, both T comp and T I/O are Ω(n 2 ), and T reaches the lower bound, Ω(n 2 ). Note that for FPGA-based implementations of floating-point matrix multiplication, sometimes k =Θ(n) and m=Θ(n 2 ) cannot be satisfied. Due to the complexity of the floating-point cores, only tens of floating-point cores can be configured on a current FPGA device, while n could be several magnitudes higher. Moreover, n 2 can be much larger than the size of the on-chip memory of the FPGA. Thus, we must consider cases wherek<n andm<n 2 . Based on Equation 6.1, the following cases arise for various values of k and m: 1. k ≥ n, m = n 2 : the optimal latency, Θ(n 2 ), can be achieved for matrix multi- plication. 2. k≥ n,m<n 2 : the latency of matrix multiplication is bounded by the I/O time. 3. k<n, m = n 2 : the latency of matrix multiplication is bounded by the compu- tation time. 86 4. k<n,m<n 2 : the latency of matrix multiplication depends on both k and m. As case 2 and case 3 are not balanced, we do not consider them. In our work, we propose designs for cases 1 and 4. 6.2.1.2 Design Trade-off Analysis From the above analysis, the latency of a matrix multiplication algorithm is determined by several parameters, including the number of PEs k, the storage size m and the memory bandwidth bw d . These parameters result in various design trade-offs. The first trade-off exists between the total storage size used by a design and the required memory bandwidth. Suppose the number of PEs and T comp are fixed. From Equation 6.1, we know that T I/O depends on both m and bw d . When the total storage size of the matrix multiplication algorithm decreases, bw d must be increased to pre- vent T I/O from increasing. On the other hand, when the available memory bandwidth decreases, m must be increased. The second trade-off lies between the memory bandwidth and the hardware re- sources required by a design. Suppose m and the I/O complexity of the matrix multi- plication algorithm are fixed. In this case, the I/O time T I/O decreases as the number of I/O operations performed in one clock cycle increases. However, to reduce the latency, more PEs are needed to reduce the computation time T comp . Thus, to utilize higher memory bandwidth, more hardware resources are needed. On the other hand, when more PEs are implemented to reduce T comp , higher memory bandwidth is required to reduce T I/O . The third design trade-off is between the number of PEs in a design and the size of local storage in each PE. As discussed in Section 6.2.1, a design with n PEs must have a total storage of size Θ(n 2 ) to achieve optimal latency. Thus, the local storage in 87 each PE must be of size Θ(n). However, if the memory constraint of the FPGA device requires the designer to decrease the size of the local storage, more PEs are needed. In some cases, even multiple FPGAs are needed. Suppose the FPGA device can hold at most x PEs. For some problem size y, Θ(y× x) may exceed the size of the on-chip memory of the device, and optimal latency Θ(n 2 ) cannot be achieved. We solve this problem by setting the storage size in each PE as a variable, which is independent of the problem size. For example, if s× x equals the size of the on-chip memory, the PEs in the algorithm could be designed to store s intermediate results. The design now must employ Θ( n 2 s ) PEs so that m=Θ(n 2 ). Thus, Θ( n 2 sx ) FPGAs are required to achieve the optimal latency. 6.2.1.3 Design Issues Besides the design trade-offs caused by the hardware resources, we must also consider other design issues. The first concerns the floating-point cores used in the design. To exploit the parallelism of matrix multiplication, we hope to maximize the number of PEs that are performing computation concurrently. Hence, it is desirable that the area of each PE is minimized. On the other hand, the clock speed of the floating-point cores, therefore PEs, can be improved through pipelining. The pipeline stages take up extra slices and increase the area of the PEs, further decreasing the number of PEs on the FPGA. Therefore, when we design the floating-point cores for matrix multi- plication algorithm, we cannot simply increase their clock speed or reduce their area. Instead, we must optimize these cores in the context of the algorithm so that they are speed-area balanced, as discussed in [33]. Secondly, note that the memory bandwidth available to an FPGA-based design may be constrained by the number of I/O pins of the FPGA device. This is more prone to happen to floating-point based applications, because floating-point numbers have a 88 much larger bitwidth than fixed-point numbers. We thus choose a linear array architec- ture whose memory bandwidth is independent of the problem size. Another advantage of the linear array architecture is that it can easily scale over multiple FPGAs. In our analysis so far we have ignored the on-chip communication costs, as well as the routing complexity of the algorithm. These two factors are crucial for the effi- ciency of the algorithm. First, to achieve optimal latency, we should ensure that the on-chip communication overheads do not dominate the computation time T comp . Sec- ondly, the routing of the algorithm should be implementable using available FPGA routing resources. For floating-point applications on FPGAs, the floating-point cores usually occupy large numbers of slices and short connection wires on the FPGA. If the long connection wires are used to connect the PEs, the clock speed of the algo- rithm would be greatly reduced. The linear array architecture makes good use of the short connection wires on the FPGA, and reduces the communications among the PEs. Thus, it enables our design to minimize the communication overheads and the routing complexity. 6.2.2 Matrix Multiplication on Single FPGA We now present our matrix multiplication designs on a single FPGA. Matrices A and B are initially stored in the DRAM memory, and are read into the FPGA during the computations. The final results of C are written back to the external memory. The internal storage of these designs are realized using the BRAM memory only. 6.2.2.1 Design 1 In Design 1, there is a linear array of n× n s PEs. Each PE contains one floating-point multiplier, one floating-point adder, two registers and s (1≤ s≤ n) words of storage. 89 For simplicity of analysis, we assume n is a multiple of s. Each PE has one input port for A, one input port for B, and one output port for C. The lth PE receives data from the (l− 1)th PE on its left, and passes them to the (l+1)th PE on its right. The final elements of C are transferred from right to left. The first PE, PE 0 , is connected to the external memory. for PE l =0 to n 2 s − 1 (in parallel) do for 1to n cycles do shifts data in RR.B right to PE l+1 end for for n+1 to n 2 + n cycles do shifts data in RR.A, RR.B right to PE l+1 if RR.B = b h,l mod s then copy it into RR.B1 or RR.B2 alternatively end if {i = l s , l s + n s ,...,n− ( n s − l n )} if RR.A = a ih then {b hj is either in RR.B1 or in RR.B2} c ij = c ij + a ih × b hj end if end for end for Figure 6.2: Design 1 for matrix multiplication: describes the cycle-specific behavior of each PE Design 1 is shown in Figure 6.2, and its architecture is shown in Figure 6.3. In this algorithm, each PE calculates s elements of C. PE 0 computes c 00 , c n s ,0 , ... , c (n− n s ),0 ; PE 1 computes c ( 1 n ),1 , c ( 1 n + n s ),1 , ..., c (n−( n s − 1 n )),1 ; ...; PE l computes c ( l n ),(l mod s) ,c ( l n + n s ),(l mod s) , ..., c (n−( n s − l n )),(l mod s) . Matrices A and B are fed into the PEs in column-major and row-major patterns respectively. B starts n cycles earlier than A, and is fed into the lower I/O port in Figure 6.3. A is fed into the upper I/O port. We group n cycles into a phase. In phase 0, the first row of matrix B is fed into PE 0 . In phase h, column h of matrix A (a ih , 0 ≤ i ≤ n− 1) and row h+1 of matrix B (b (h+1),j , 0 ≤ j ≤ n− 1) enter PE 0 in order. For example, a 11 enters the 90 External Memory PE 0 PE l PE p-1 RR.A RR.B RR.B1 RR.B2 MUX MULTIPLIER + PE l From PE l-1 To PE l+1 From PE l+1 To PE l-1 CoutBuffer c ij ’ Figure 6.3: Architecture of Design 1 for matrix multiplication upper I/O port of PE 0 during the same cycle as b 21 enters the lower I/O port of PE 0 . As b hj traverses the PEs, it is picked up by the PEs computing c ij . These PEs keep b hj in the registers until a ih passes through. When a ih reaches the PEs computing c ij , the PEs update c ij = c ij + a ih × b hj . c ij is the intermediate result of c ij (0 ≤ i≤ n− 1, 0≤ j ≤ n− 1) and is stored in the storage within the PE that is computing c ij . The final result of C traverses the linear array in the reverse direction, that is, from PE n 2 s −1 to PE 0 . Suppose during a clock cycle, PE l generates a final element in C, c ij . PE l then transfers c ij toPE l−1 .IfPE l receives c i j fromPE l+1 in the same clock cycle, it stores c i j in its CoutBuffer. The data in CoutBuffer are transferred to PE l−1 during clock cycles in which PE l does not generate C elements. The CoutBuffer is of the same size as the storage which stores the intermediate result of matrix C. 91 For the algorithm to perform correctly, we observe that two requirements should be satisfied. We then state these requirements, and show how they are satisfied by the algorithm with the number of registers in the PEs. 1. Since a ih stays at each PE for one clock cycle only, b hj should arrive no later than a ih . The number of cycles needed for b hj to arrive at PE l is (h− 1)n + j + l. a ih needs n+(h− 1)n + i + l cycles to arrive at PE l . The requirement is satisfied since (h−1)n+j +l≤ n+(h−1)n+i+l for all i, j. Suppose s = n 2 , there are 2n PEs in the linear array. c 1,n−1 is computed by PE n−1 . We show how b 2,n−1 arrives at PE n−1 no later than a 12 for c 1,n−1 = c 1,n−1 + a 12 × b 2,n−1 . b 2,n−1 needs (2− 1)n + n cycles to arrive at PE 0 and needs (n− 1) more cycles to move toPE n−1 , requiring a total of 3n− 1 cycles. a 12 needs n+(2− 1)n cycles to arrive atPE 0 and (n− 1) more cycles to move to PE n−1 . Therefore, a 12 arrives PE n−1 at 3n-th cycle, later than b 2,n−1 . 2. Once b hj reaches the PEs that compute c ij , a copy of b hj should reside in these PEs until a ih arrives. A PE can hold b hj in its registers. We now show that using two registers in each PE can satisfy this requirement. The registers are denoted as RR.B1 and RR.B2 in Fig- ure 6.3, and they always store two consecutive elements (b hj and b h+1,j ). For example, when b 22 arrives at PE 2 , b 02 is in RR.B1 and b 12 is in RR.B2. If we can prove that a n−1,0 has passedPE 2 by then, b 22 can replace b 02 in RR.B1, because b 02 is no longer needed by PE 2 . Regardless of the value of s, b hj is no longer needed by a PE after a n−1,h has passed the PE. ForPE l , a n−1,h arrives at the (n+(h−1)n+n+l)th cycle, and b h+2,j arrives at the ((h+1)n+j +l)th cycle. Sincej<n, n+(h−1)n+n+l> (h+1)n + j + l for all h and l. Thus b hj can be replaced when b h+2,j arrives at PE l . This proves that each PE needs only two registers to hold b hj . 92 A 0 X B 0 A 0 X B 1 A 0 X B 2 A 0 X B 3 PE 0 PE 1 PE 2 PE 3 A 1 X B 0 A 1 X B 1 A 1 X B 2 A 1 X B 3 PE 4 PE 5 PE 5 PE 7 a 10 a 30 a 11 a 31 a 12 a 32 a 13 a 33 a 00 a 20 a 01 a 21 a 02 a 22 a 03 a 23 A 0 A 1 B 0 b 00 b 10 b 20 b 30 B 1 b 01 b 11 b 21 b 31 b 02 b 12 b 22 b 32 b 03 b 13 b 23 b 33 B 2 B 3 Figure 6.4: Computation illustrations for Design 1 (n=4, s=2) We illustrate Design 1 using an example in Figure 6.4. When n =4 and s =2, there are 8 PEs in total. PE 0 calculates c 00 and c 20 , ..., PE 4 calculates c 10 and c 30 , ..., and PE 7 calculates c 13 and c 33 . In phase 0, PE 0 stores b 00 in its register; in phase 1, PE 0 updates c 00 = c 00 + a 00 × b 00 , c 20 = c 20 + a 20 × b 00 . b 00 reaches PE 4 in phase 1 and is stored in the register. In phase 2, when column 0 of A reaches PE 4 , PE 4 updates c 10 = c 10 +a 10 ×b 00 and c 30 = c 30 +a 30 ×b 00 . Figure 6.5 shows the data flow for Design 1, assuming the pipeline delays of the adder and the multiplier are both 2. Due to space limitations, we show only the data flow from clock cycle 5 to cycle 13. The last element, a n−1,n−1 , arrives at the last PE at the (n+(n−1)n+n+n n s −1)th cycle. As defined in Section 3.3.2, α is the pipeline delay of the adders. The final c n−1,n−1 is thus generated at (n 2 + n 2 s + n−1+ α)th cycle. It then traverses the PEs and is written back to the external memory at the (n 2 +2 n 2 s + n + α− 1)th cycle. Therefore, the latency of the algorithm T = Θ((1 + 2 s )n 2 ).As 1 ≤ s ≤ n,we have 1 <1+ 2 s ≤ 3, and thus T =Θ(n 2 ). According to Section 6.2.1.1, Design 1 achieves the asymptotically optimal latency. In each clock cycle of Design 1, n floating-point multiplications and n floating- point additions are performed. Thus, k = n. The required size of BRAM memory m b 93 a 00 b 10 a 00 *b 00 b 03 b 02 b 01 b 00 a 10 b 11 a 00 *b 00 a 20 b 12 a 20 *b 00 update c 00 a 30 b 13 a 20 *b 00 update c 00 a 00 b 10 a 00 *b 01 b 03 b 02 b 01 b 00 a 10 b 11 a 00 *b 01 a 00 b 10 a 00 *b 02 b 03 b 02 b 01 b 00 a 20 b 12 a 20 *b 01 update c 01 a 10 b 11 a 00 *b 02 a 00 b 10 a 00 *b 03 b 03 b 02 b 01 b 00 a 01 b 20 a 01 *b 10 update c 20 inA inB Multiplier Stage 1 Multiplier Stage 2 Adder Stage 1 Adder Stage 2 a 30 b 13 a 20 *b 01 update c 01 a 20 b 12 a 20 *b 02 update c 02 a 10 b 11 a 00 *b 03 a 00 b 10 b 03 b 02 b 01 a 11 b 21 a 01 *b 10 update c 20 a 01 b 20 a 01 *b 11 update c 21 a 30 b 13 a 20 *b 02 update c 02 a 20 b 12 a 20 *b 03 update c 03 a 10 b 11 a 10 *b 00 a 00 b 10 b 03 b 02 a 21 b 22 a 21 *b 10 update c 00 a 11 b 21 a 01 *b 11 update c 21 a 01 b 20 a 01 *b 12 update c 22 a 30 b 13 a 20 *b 03 update c 03 a 20 b 12 a 10 *b 00 a 10 b 11 a 10 *b 01 a 00 b 10 b 03 a 31 b 23 a 21 *b 10 update c 00 a 21 b 22 a 21 *b 11 update c 01 a 11 b 21 a 01 *b 12 update c 22 a 01 b 20 a 01 *b 13 update c 23 a 30 b 13 a 30 *b 00 update c 10 a 20 b 12 a 10 *b 01 a 10 b 11 a 10 *b 02 a 00 b 10 PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 6 PE 7 Cycle 5 7 6 8 9 10 11 12 Figure 6.5: Data flow for Design 1 (n =4, s =2, and the pipeline delays of both the adder and the multiplier are 2) 94 is 2×s× n 2 s =Θ(n 2 ) words. The design reads two words from and writes one word to the DRAM memory during each clock cycle. Therefore, the required DRAM memory bandwidth bw d =3b w . The number of I/O pins used is 24b w . 6.2.2.2 Design 2 From the discussion in Section 6.2.1.1, we know that a matrix multiplication algo- rithm that requires higher I/O bandwidth can achieve shorter latency. Therefore, we propose the second algorithm whose bw d is 3rb w (1 ≤ r ≤ n), and whose latency is approximately 1/r of that of Design 1. In Design 2, there is a linear array of n 2 r 2 s PEs. Each PE contains r 2 registers, r 2 floating-point multipliers, r 2 floating-point adders, and r 2 storage blocks of s words (1 ≤ s ≤ n r , 1 ≤ r ≤ n). For simplicity, we assume n is a multiple of both s and r. Each PE has 2r input ports and r output ports. Ports IOA 1 , IOA 2 , ..., IOA r are used to input r elements of A concurrently to the PEs. Ports IOB 1 , IOB 2 , ..., IOB r are used to input r elements of B. Ports IOC 1 , IOC 2 ,..., IOC r output r elements of C. As in Design 1, the lth PE receives the input data from the (l− 1)th PE and passes them to the (l+1)th PE. Figure 6.6 and Figure 6.7 show Design 2 and its architecture, respectively. In both of the figures, r =2; each PE contains two CoutBuffers, whose size is the same as the storage in the PE which stores the intermediate results of C. In Design 2, C is partitioned into r 2 submatrices of size n r × n r , which are denoted as C xy (0 ≤ x, y ≤ r− 1). We use c xy ij (0 ≤ x, y ≤ r− 1, 0 ≤ i, j ≤ n r − 1)to denote elements in C xy . PE 0 computes c xy 00 , c xy n s ,0 ,..., c xy (n− n s ),0 ; PE 1 computes c xy ( 1 n ),1 , c xy ( 1 n + n s ),1 , ..., c xy (n−( n s − 1 n )),1 ; ...; PE l computes c xy ( l n ),(l mod n r ) , c xy ( l n + n s ),(l mod n r ) , ..., c xy (n−( n s − l n )),(l mod n r ) . The basic idea of Design 2 is to perform the calculations of C xy (0 ≤ x, y ≤ r− 1) in parallel, using r 2 MACs per PE. In Design 2, we partition A and B into r 95 for PE l =0 to n 2 4s − 1 (in parallel) do for 1to n 2 cycles do shifts data in RR1.B and RR2.B right to PE l+1 if RR1.B = b 0 h,l mod s then copy it into RR1.B1 end if if RR2.B = b 1 h,l mod s then copy it into RR2.B1 end if end for for n 2 +1 to n 2 2 + n 2 cycles do shifts data in RR1.A, RR2.A, RR1.B, RR2.B right to PE l+1 if RR1.B = b 0 hj then copy it into RR1.B1 or RR1.B2 alterna- tively end if if RR2.B = b 1 hj then copy it into RR2.B1 or RR2.B2 alterna- tively end if {i = l s , l s + s ,...,n− ( n s − l n )} if RR1.A = a 0 ih then {b 0 jh is in either RR1.B1 or RR1.B2} c 00 ij = c 00 ij + a 0 ih × b 0 hj c 01 ij = c 01 ij + a 0 ih × b 1 hj end if if RR2.A = a 1 ih then {b 1 jh is in either RR2.B1 or RR2.B2} c 10 ij = c 10 ij + a 1 ih × b 0 hj c 11 ij = c 11 ij + a 1 ih × b 1 hj end if end for end for Figure 6.6: Design 2 for matrix multiplication, r =2: describes the cycle-specific behavior of each PE 96 External Memory PE 0 PE l PE n/2-1 RR1.A RR1.B RR1.B1 RR1.B2 PE l From PE l-1 To PE l+1 RR2.A RR2.B RR2.B1 RR2.B2 MULT MULT MULT MULT + + + + To PE l-1 From PE l+1 CoutBuffer1 CoutBuffer2 Figure 6.7: Architecture of Design 2 for matrix multiplication, r =2 submatrices. A 0 contains rows 0, ..., n r − 1 of A; A 1 contains rows n r , ..., 2n r − 1 of A; and so on. B 0 contains columns 0, ..., n r − 1 of B; B 1 contains columns n r , ..., 2n r − 1 of B; and so on. The elements of the submatrices of A and B are denoted as a x ih and b y hj respectively, 0≤ x, y≤ r− 1, 0≤ i, j ≤ n r − 1, 0≤ h≤ n− 1. As in Design 1, matrices A and B are fed to the PEs in column-major and row- major pattern respectively. B starts n r cycles earlier than A. During each clock cycle, one element from each of the r submatrices of A is fed into the PEs; similarly, r elements of B are fed into the PEs. We group n r clock cycles into a phase. In phase 0, the first row of matrix B is fed into PE 0 . In phase h, column h of matrix A and row h +1 of matrix B enter PE 0 in order. As b y hj traverses the linear array, they are stored in registers by the PEs that compute c xy ij until a x ih passes through. When 97 the PE which is in charge of c xy ij is not idle, the r 2 MACs in the PE are performing (c xy ij ) =(c xy ij ) + a x ih × b y hj , 0 ≤ x, y ≤ r− 1. Thus, each element is shared among r MACs. This sharing allows r 2 PEs to perform computations concurrently with 2r registers and 2r input ports. a 22 a 23 a 32 a 33 a 02 a 03 a 12 a 13 a 20 a 21 a 30 a 31 a 00 a 01 a 10 a 11 c 00 c 02 c 20 c 22 PE 0 PE 1 PE 2 PE 3 A 0 A 1 B 0 B 1 b 00 b 10 b 20 b 30 b 01 b 11 b 21 b 31 b 02 b 12 b 22 b 32 b 03 b 13 b 23 b 33 c 01 c 03 c 21 c 23 c 10 c 12 c 30 c 32 c 11 c 13 c 31 c 33 c 20 c 21 c 30 c 31 c 22 c 23 c 32 c 33 c 00 c 01 c 10 c 11 c 02 c 03 c 12 c 13 C 00 C 01 C 10 C 11 Figure 6.8: Computation illustrations for Design 2 (n=4, s=1, r =2) Figure 6.8 illustrates how Design 2 is applied to our example in Section 6.2.2.1, when r =2, s=1. There are 4 PEs in total. PE 0 calculates c 00 ,c 02 ,c 20 ,c 22 ; ...;PE 3 calculates c 11 ,c 13 ,c 31 ,c 33 . In phase 0, PE 0 stores b 00 and b 02 in registers; in phase 1, PE 0 updates c 00 = c 00 + a 00 × b 00 , c 02 = c 02 + a 00 × b 02 , c 20 = c 20 + a 20 × b 00 , and c 22 = c 22 + a 20 × b 02 . In phase 2, the first column of A reaches PE 3 , and PE 3 starts updating elements of C. Similar as with Design 1, we can prove Design 2 performs correctly. In Design 2, the final result of c n−1,n−1 is available at (n+(n− 1) n r + n 2 r 2 s −1+ α)th cycle. As c n−1,n−1 takes n 2 r 2 s clock cycles to traverse back to the external memory, the matrix mul- tiplication completes in n 2 r + 2n 2 r 2 s +n− n r +α−1 cycles. Therefore, T = O(( 1 r + 2 r 2 s )n 2 ). As 1 ≤ r ≤ n and 1 ≤ s ≤ n r , 0 ≤ 1 r + 2 r 2 s ≤ 3, and thus T =Θ(n 2 ). According to Section 6.2.1.1, Design 2 achieves the asymptotically optimal latency. In Design 2, in each clock cycle, n floating-point multiplications and n floating- point additions are performed. Thus, k = n. The size of BRAM memory used by the 98 design m b is 2× s× r 2 × n 2 rs =Θ(n 2 ) words. The design reads 2r words from and writes r words to the DRAM memory during each clock cycle. Therefore, the required DRAM memory bandwidth bw d =3rb w . The number of I/O pins used is 24rb w . 6.2.2.3 Design 3 In Designs 1 and 2, m b is Θ(n 2 ) words. In the third design, m b equals the size of available BRAM memory on the FPGA M b , and can be smaller than n 2 b w . For the given k and m b , Design 3 minimizes the required memory bandwidth. The architecture of Design 3 is similar to that of Design 1, and is therefore not shown here. There are k PEs connected in a linear array, and the first PE, PE 0 ,is connected to the external memory. Each PE consists of one floating-point multiplier and one floating-point adder. However, different from Design 1, each PE in Design 3 has a local storage of size m b 2k to store intermediate results of C, a storage of size m b 2k to store the final results of C, and 2 √ m b k registers. In Design 3, we are actually performing block matrix multiplication with a block size of b× b, where b = √ M b 2bw . When b = n, Design 3 is the same as Design 1. When b<n, the blocks are denoted as A fz and B zg , where f,g,z =0, 1,..., n b −1. Without loss of generality, we assume n is a multiple of b, and b is a multiple of k. In Design 3, for each block matrix multiplication A fz × B zg , A fz is read in the column-major pattern, while B zg is read in the row-major pattern. PE l is in charge of computing the lth, (k + l)th, ..., (( b k − 1)k + l)th columns of C fg . Before the computation starts, the first row of B zg is read into the architecture. As these b numbers traverse the linear array,PE l (l =0,...,k−1) stores the lth, (k+l)th, ..., (( b k −1)k+ l)th numbers into its registers. Afterwards, every b k clock cycles, one element of A fz and one element of B zg are read into the architecture. As a iq passes through one PE, 99 {for every block matrix multiplication} for PE l =0 to k− 1 (in parallel) do if l = j mod k then copy b qj to a register end if if q =1 (the first row of the B block) then shift b qj right to PE l+1 else shift b qj , a iq right to PE l+1 for every b qj in the registers do c ij <= c ij + a iq × b qj end for end if end for Figure 6.9: Design 3 for matrix multiplication: describes the cycle-specific behavior of each PE it is multiplied with every stored B element whose row index is q. The intermediate results for C fg are stored in the local storage of the PEs. Figure 6.10 illustrates Design 3 applied to our example in Section 6.2.2.1, for m b = 8, p=2. PE 0 computes column 0 and PE 1 computes column 1 of C 00 ,C 01 ,C 10 ,C 11 . Note that since the storage size in each PE is b=2, at any time each PE stores at most two intermediate results of C. C fg (f,g =0, 1,..., n b − 1) must be accumulated. According to the analysis of Design 1, the effective latency for each block matrix multiplication, T , equals ( b k ) 3 × k 2 = b 3 k . Thus, every T clock cycles, an intermediate result of C is generated. If T is larger than the number of pipeline stages in the adder α, we can simply use a floating- point adder for the accumulation without incurring any data hazards. Otherwise, we can perform the accumulation on a host processor. This accumulation is not considered in this thesis. In Design 3, we perform ( n b ) 3 block matrix multiplications. In addition, b clock cycles are needed to read the first row of the first B block. Therefore, the effective 100 a 22 a 23 a 32 a 33 a 02 a 03 a 12 a 13 a 20 a 21 a 30 a 31 a 00 a 01 a 10 a 11 b 00 b 10 b 20 b 30 b 01 b 11 b 21 b 31 b 02 b 12 b 22 b 32 b 03 b 13 b 23 b 33 c 00 c 10 c 02 c 12 c 20 c 30 c 22 c 32 PE 0 PE 1 c 01 c 11 c 03 c 13 c 21 c 31 c 23 c 33 c 20 c 21 c 30 c 31 c 22 c 23 c 32 c 33 c 00 c 01 c 10 c 11 c 02 c 03 c 12 c 13 C 00 C 01 C 10 C 11 Figure 6.10: Computation illustrations for Design 3 (n=4, m b =8, k=2) latency of Design 3 equals b+( n b ) 3 × b 3 k =Θ( n 3 k ). In this design, 3 words are exchanged with the DRAM memory every b k cycles. Thus, bw d = 3kbw b . The number of I/O pins used is 24b w . According to Section 6.2.1.1, the memory bandwidth required by Design 3 achieves the theoretical lower bound under the given storage size and the number of PEs. 6.2.3 Matrix Multiplication on Reconfigurable Computing Systems All of the designs proposed in Section 6.2.2 employ only the BRAM memory. In this section, we propose a design that employs both the BRAM memory and the SRAM memory in reconfigurable computing systems. We first partition matrices A and B into b 2 ×b 2 blocks. These blocks are denoted as A iq and B qj (i, j, q =0, 1,..., n b 2 −1). Each A iq (B qj ) is further partitioned into smaller blocks of size b 1 × b 1 , which are denoted as A iq fz and B qj zg (g,h,z =0, 1,..., b 2 b 1 − 1). Without loss of generality, we assume n is a multiple of b 2 , and b 2 is a multiple of b 1 . The values of b 1 and b 2 are discussed later in this section. 101 The proposed architecture is shown in Figure 6.11. It contains Design 3 proposed in Section 6.2.2, which is labeled “MM” in the figure. MM multiplies two b 2 × b 2 blocks. The result of MM is given to a floating-point adder and is combined with the intermediate result of C. The internal storage of MM is implemented using the BRAM memory of the FPGA. As the total size of the internal storage equals 2b 2 1 ,we have b 1 = M b 2bw .For A iq fz × B qj zg (g,h,z =0, 1,..., b 2 b 1 − 1), the architecture uses one storage unit of size b 1 × b 2 to store the elements in B qj . It also contains another storage unit of size b 2 2 to store the intermediate results of C ij . These storage units are implemented using the SRAM memory. Therefore, we have M s =(b 2 2 + b 1 b 2 )b w . A B (SRAM) Adder C’ (SRAM) MM (BRAM) DRAM DRAM Figure 6.11: Architecture for matrix multiplication using multiple levels of memory In the proposed design, A iq is read in column-major order, and B qj is read in row- major order. The computations start by reading the first row of b 1 × b 1 blocks of B qj into the FPGA. These blocks are stored in the SRAM memory. Afterward, every b 2 1 b 2 k clock cycles, one block of A iq and one block of B qj are read into the architecture. A iq fz is multiplied by MM with b 2 b 1 blocks of B qj that are stored in the FPGA. The output of 102 MM is accumulated with the stored intermediate results by the adder in the architec- ture. If the adder generates a final element of C, the element is output to the external memory; otherwise, the element is written back to the storage unit. The effective latency of each b 1 ×b 1 block matrix multiplication equals b 3 1 k (Section 6.2.2). In our design, ( n b 1 ) 3 block matrix multiplications are performed. Including the time for inputting the first row of B gj , the total latency of our design is: T = b 2 b 1 × b 2 1 +( n b 1 ) 3 × b 3 1 k =Θ( n 3 k ) (6.2) To achieve this latency, one b 1 ×b 1 block of A iq and one block of B qj are read from the DRAM memory every b 2 b 1 × b 3 1 k cycles. The design outputs b 2 2 words to the DRAM memory every b 3 2 k cycles. Thus, the required DRAM memory bandwidth is bw d = 3kbw b 2 . Each PE in MM reads one word from and writes one word to each of its two local storage units, hence bw b =4kb w . As for the SRAM memory, one b 1 × b 1 block of B qj is written into it every b 2 b 1 × b 3 1 k cycles. Also, in every clock cycle, two intermediate results of C ij are read from and written to it respectively. Thus, bw s =(2+ k b 2 )b w . The proposed design can also be implemented on multiple FPGAs in the reconfig- urable computing systems. In this case, p FPGAs are connected in a linear array. The FPGAs are labeled from left to right, as FPGA 0 , FPGA 1 ,..., FPGA p−1 . FPGA 0 reads elements of A and B from the DRAM memory, and the elements traverse the linear array of FPGAs. During the computations, as the b 2 b 1 blocks of B qj traverse the linear array of FPGAs, FPGA h stores blocks h, (p+h), ..., (( b 2 b 1 p −1)p+h) into the BRAM memory. As A iq fz passes through an FPGA, it is multiplied by MM with b 2 b 1 p blocks of B qj that are stored in the FPGA. The final results of C traverse the linear array from right to left, and are written back to the DRAM memory by FPGA 0 . 103 The storage unit of the elements in B qj is now of size b 1 b 2 p ; the storage unit of the intermediate results of C ij is of size b 2 2 p . When a final element of C is generated, it is transferred to FPGA h−1 ifh> 0; otherwise, it is written to the external memory. Each FPGA generates b 2 2 p final elements of C in consecutive clock cycles. When FPGA h is transferring these elements (h> 0) or writing them to the external memory (h =0), the elements transferred to the FPGA from its right neighbor must be stored. After- ward, the FPGA begins transferring or writing the stored final results. Therefore, the architecture needs another storage unit of size b 2 2 p for the final results of C. This storage is also implemented using the SRAM memory. A B (SRAM) Adder C’ (SRAM) MM (BRAM) C (SRAM) DRAM FPGA 0 FPGA 1 ... FPGA p-1 Figure 6.12: Architecture for matrix multiplication on multiple FPGAs The architecture is shown in Figure 6.12. When p FPGAs are employed, each FPGA performs n 3 b 3 1 p block matrix multiplication. Therefore, the effective latency of the design T =Θ( n 3 kp ), with bw d = 3kbw b 2 . According to the analysis in Section 6.2.1.1, 104 this design minimizes the latency as well as the requirement on the DRAM memory bandwidth with the given hardware resources. 6.3 Matrix Factorization The LU decomposition method factors an n×n matrix A into an n×n lower triangular matrix L (the diagonal entries are all 1) and an n×n triangular matrix U. We use a ij , l ij and u ij to denote the elements of A, L and U, respectively (0≤ i, j ≤ n− 1). In [20], a sequential algorithm for LU decomposition is discussed. It consists of three main steps which are listed below. As customary in hardware implementations of matrix factorization, we assume no pivoting is needed. • Step 1: The column vector a i0 (1 ≤ i ≤ n− 1) is multiplied by the reciprocal of a 00 . The resulting column vector is l i0 . u 0j equals a 0j (1≤ j ≤ n− 1). • Step 2: l i0 is multiplied by the row vector u 0j . The product is subtracted from the submatrix a ij (1≤ i, j ≤ n− 1). • Step 3: Steps 1 and 2 are recursively applied to the new submatrix generated in Step 2. During the tth iteration, l it and u tj (t+1≤ i, j ≤ n− 1) are generated. When t = n− 1, the decomposition is complete. 6.3.1 Operation Analysis During the tth iteration, (n−1−t) 2 multiplications, (n−1−t) 2 subtractions, and (n− 1−t) divisions are performed. Therefore, LU decomposition requires a total of n−1 t=0 (n− 1−t)≈ n 2 2 divisions, n−1 t=0 (n−1−t) 2 ≈ n 3 3 multiplications and n 3 3 additions/subtractions each. Because of the large area of the FPGA-based floating-point divider, we perform 105 at most one division each clock cycle. For multiplications and additions/subtractions, we set k = k . It has been shown that the upper bound on I/O complexity of LU decomposition is the same as for matrix multiplication, and pivoting does not change the upper bound [73]. Thus, the lower bound on the latency of LU decomposition is: T ≥ max(T comp ,T I/O )≥ max(max( n 3 3k , n 2 2 ), n 3 b w / √ m bw d )(m≤ n 2 ). Similar as in matrix multiplication, we consider only cases where k ≤ n and m ≤ n 2 . In our designs, matrix A is stored in the DRAM memory initially. When LU decomposition is complete, matrices L and U are written back to the DRAM memory. 6.3.2 Design Issues We first propose an extension for the design in [34], which is denoted as Design 1. Based on this extension, we discuss several design issues in FPGA-based LU decom- position. In Design 1, there are k+1 PEs connected in a circular linear array as in [34]. The first PE, PE 0 , contains a floating-point divider, while each of the other PEs contains a floating-point multiplier and a floating-point adder/subtractor. Matrix A is fed into the architecture in column major order. As a i0 travels through PE 0 , it is divided by a 00 , and the resulting l i0 is stored in PE 1 (1 ≤ i ≤ n− 1). a 01 (= u 01 ) multiplies with l i0 , and the product results are subtracted from a i1 (1 ≤ x ≤ n − 1). The subtraction results traverse the circular linear array back to PE 0 , and are then divided by u 11 = a 11 − a 01 × l 10 . The resulting l i1 is stored back in PE 1 (2≤ i≤ n− 1). In particular, PE h stores columns n k × (h− 1), n k × (h− 1) + 1, ..., n k × (h− 1) + n k − 1 of L (1≤ h≤ k). 106 As column q of matrix A traverses the PEs, Design 1 calculates the qth columns of L and U. To generate u iq (i<q), the row vector l ij is multiplied with the column vector a jq (0 ≤ j ≤ i). The result of the dot product is then subtracted from a iq . To generate l iq (i ≥ q), the row vector l ij is multiplied with the column vector a jq (0≤ j ≤ h); the result of the dot product is then subtracted from a iq . The resulting a iq is then divided by u q . The computational flow is shown in Figure 6.13. Diagnonal Elements ... Figure 6.13: Computational flow of Design 1 for LU decomposition Design 1 stores n− q words for column q of L (1≤ q ≤ n− 1). Thus, it requires a storage of n 2 2 words. However, as Design 1 mainly executes min(i,q) x=0 l ij a jq , multipli- cation results are generated serially and must be accumulated in each PE. Moreover, if two neighboring PEs perform a dot product together, their intermediate results must also be accumulated. In this case, when pipelined multipliers and adders/subtractors are used, read-after-write data hazards may arise. To avoid such hazards, we can in- terleave the computations of multiple matrices, that is, use stacking matrices. In this case, the storage size required by the design is increased to n 2 (α+β) 2 , where α and β are the pipeline latencies of the adder/subtractor and the multiplier, respectively. Thus, the first design issue for FPGA-based LU decomposition is to avoid data hazards without greatly increasing the storage requirement. The second design issue is to utilize effectively the parallelism provided by the architecture. In Design 1, PE h stores n k consecutive columns of L. Therefore, when 107 the first n k columns of A traverse the PEs, only PE 1 is performing floating-point op- erations while the other PEs are idling; for the next n k columns of A, only PE 1 and PE 2 are working, and so on. Only for the last n k columns, are all k PEs working in parallel. The latency of Design 1 is approximately 2n 3 3k cycles, whenn> k.Aswe show later, lower latency can be achieved if the computational parallelism of the PEs is fully exploited. 6.3.3 Proposed Design Based on the analysis in the previous section, we propose a second design, Design 2. This design parallelizes the sequential algorithm described at the beginning of Section 6.3. We first assume our design employs only the BRAM memory for internal storage. The architecture of our design is shown in Figure 6.14. It contains a circular linear array of k +1 PEs. The PEs are labeled as PE 0 , PE 1 , ..., PE k from left to right. Each PE has two input ports and two output ports. These ports are denoted as inL, outL, inU and outU, respectively. Each PE is connected to its neighboring PEs only, while PE k is connected to PE k−1 and PE 0 . PE 0 contains a floating-point divider and a storage unit of size n words to store u ii (0 ≤ i ≤ n− 1). This storage unit is denoted as S 0 . PE h (1 ≤ h ≤ k) contains a floating-point multiplier and a floating-point adder/subtractor. PE h uses two storage units: S1 h is of size (n−1) 2 k and stores the intermediate results a ij . S2 h is of size n−1 k and stores u tj in the tth iteration (1≤ i≤ n− 1, j = h, h +k,...,h+( n−1 k − 1)k). For simplicity, we assume n− 1 is a multiple of k. Our algorithm contains n− 1 iterations. During each iteration, the architecture goes through three stages. In Stage 1 of the tth iteration, a t,t+1 , a t,t+2 , ..., a t,n−1 108 inU inL S 0 / outU outL PE 0 PE 1 PE k inU inL S1 h +/- outU outL PE h S2 h X Figure 6.14: Architecture of Design 2 for LU decomposition traverse the linear array (0≤ t≤ n− 2) through ports inU and outU. PE h performs u tj = a tj − a tj , where j = h, h +k,...,h+( n−1 k − 1)k. The resulting elements of U are stored in S2 h (1≤ h≤ k) and will be used in Stage 3. In Stage 2, a tt , a t+1,t , ..., a n−1,t enter the architecture through port inU, and a it = a it − a it is performed (t≤ i≤ n− 1). These subtraction results traverse along the circular array through ports outL and inL. After they reach PE 0 , they are divided by a tt (= u tt ) to generate l it . These elements of L are then used in the next stage. Note that when t =0, l i0 = a i0 (1 ≤ i ≤ n− 1). During Stage 3, l t+1,t , l t+2,k , ..., l n−1,k are fed intoPE 1 through port inL. Each of them is multiplied with the elements stored in S2 h in PE h (1≤ h≤ k). The intermediate results in S1 h are updated using a ij = a ij + l it × u tj (t+1 ≤ i, j ≤ n− 1). The computational flow of Design 2 is shown in Figure 6.15. In iteration t, Stage 1 takes (n− t−1+ k + α) cycles and Stage 2 takes (n− t− 1+k +α+γ) cycles. Note that there is no dependency between Stages 1 and 2. Thus, 109 Diagnonal Elements ... ... Figure 6.15: Computational flow of Design 2 for LU decomposition Stage 2 can start immediately after a t,n−1 enters the architecture. In this case, these two stages together need (2(n− t− 1) + k + α + γ) cycles. In Stage 3 of iteration t, l it must be multiplied with at most (n−t−1) k elements in each PE (t+1≤ i≤ n− 1). Thus, Stage 3 requires ( (n−t−1) k (n− t− 1) + β + α) cycles. Therefore, the total latency of our design equals: T = n−1 t=0 (2(n− t− 1) + k + α + γ)+ n−1 t=0 ( (n−t−1) k (n− t− 1) + β + α) ≤ n 6k (n− 1)(2n− 1) + 1 2 (3n 2 − 3n− 2) + (β+2α + γ + k)n (we use (n−t−1) k = n−t−1 k +1 , for all t) ≈ n 3 3k ,(n>k) (6.3) By overlapping the computations in Stage 3 and Stage 1, we can further reduce the latency of Design 2. In this case, the latency of Design 2 can be reduced but is still about n 3 3k cycles. According to the discussion in Section 6.3.1, our design achieves the minimum latency, and hence the optimal performance of LU decomposition, when k<n. Note that when ( (n−t−1) k (n− t− 1)) is smaller than α + γ, a t+1,t+2 may be read before it is correctly updated. To avoid such data hazard, zero padding is needed so that Stage 3 takes at least α + γ clock cycles. 110 In the proposed design, m b =(n− 1) 2 + n. As the design needs to read n 2 words from and write n 2 words to the DRAM memory in n 3 3k cycles, bw d ≈ 6kbw n . 6.3.3.1 Block LU Decomposition When n 2 is larger than the size of the available on-chip memory, the block LU decom- position algorithm is used. In this algorithm, matrix A is partitioned into blocks of size b 1 × b 1 . The value of b 1 is determined by the size of BRAM memory. The detailed block LU algorithm is presented in Section 7.3.2. Its computational flow is similar to that of the algorithm in [20], except that the basic computational unit is now a block instead of a word. In the architecture for block LU decomposition, we need one set of PEs for b 1 × b 1 matrix factorization, one set of PEs for b 1 × b 1 matrix multiplication and a single PE for matrix subtraction. Such an architecture is proposed in [22]. Suppose the numbers of PEs for matrix factorization and matrix multiplication are k 1 and k 2 , respectively. During block LU decomposition, the PEs for matrix factoriza- tion are used about n 2 b 2 1 times and the PEs for matrix multiplication are used about n 3 3b 3 1 times. Thus, the latency of block LU decomposition is determined by the latency of matrix multiplication, and is approximately n 3 3k 2 cycles. The PE sets for matrix factorization and matrix multiplication require storage spaces of size b 2 1 and 2b 2 1 , respectively. Therefore, m b =3b 2 1 . As each PE reads one word and writes one word to each of its local storage units, bw b =6kb w . External bandwidth bw d is approximately 2n 2 bw n 3 /3k 2 = 6k 2 n . Including the output pins, the number of I/O pins is 3b w . 111 6.3.4 Hierarchical Design In the above discussion, only the BRAM memory of the FPGA is used. To utilize the SRAM memory, we employ a hierarchical design. Matrix A is partitioned into blocks of size b 2 × b 2 , which are further partitioned into blocks of size b 1 × b 1 . The design in Section 6.3.3.1 is used to perform b 2 × b 2 matrix factorization. For b 2 × b 2 matrix multiplication, our proposed design in Section 6.2 is employed and the SRAM memory is used to store the intermediate results. In this case, m s =2b 2 2 + b 1 b 2 and bw s =(2+ k b 2 2 )b w . 6.4 Experimental Results In this section, we first analyze the design trade-offs by tuning the values of the design parameters for both matrix multiplication and LU decomposition on a Xilinx Virtex-II Pro XC2VP50. The performance of our designs on a Xilinx Virtex-II Pro XC2VP100 is then compared with designs on general-purpose processors. Next the performance of our designs on a chassis of XD1 is reported. Their projected performance when improved floating-point cores are available is also presented. 6.4.1 Performance Analysis For matrix multiplication, we use Design 3 to perform design trade-off analysis. In this design, the number of PEs is k, whose value is determined by the total size of available area and the area of the PEs. Each PE contains one floating-point adder, one floating- point multiplier, some BRAMs, and a control unit. The control unit is in charge of the routing and control circuitry inside the PE. The area of the design increases lin- early with k, as shown in Figure 6.16(a). On the other hand, as more PEs are used, 112 0 5000 10000 15000 20000 25000 123 45 67 89 10 No. of PEs Area (Slices) 120 130 140 150 160 Clock Speed (MHz) Area Clock Speed (a) Area and clock speed vs. k 0 50 100 150 0 2 4 6 8 10 0 20 40 60 b 2 k Latency (s) (b) Latency vs. k, b 2 0 50 100 150 0 2 4 6 8 10 0 6 12 18 24 b 2 k Bandwidth (Gb/s) (c) Memory bandwidth vs. k, b 2 Figure 6.16: Performance analysis for matrix multiplication (n = 2048) 113 more routing resources are needed and the routing becomes more complex. Thus, the achievable clock speed of the design degrades as k increases. We can configure at most 10 PEs on the target device. In this case, the maximum achievable clock speed is 125 MHz. The other two parameters are block sizes b 1 and b 2 . Their values are determined by the available BRAM memory and SRAM memory. According to the discussion in Section 6.2.2.3, neither the latency nor the required DRAM memory bandwidth depends on b 1 . Thus, we set b 1 = k so that we need to tune k and b 2 only. In our experiments, we vary b 2 from 16 to 128. Figure 6.16(b) shows that the latency still decreases proportionally with k even though the clock speed decreases with k. This is because the degradation in the clock speed is not as fast as the decrease in T . Figure 6.16(b) also shows that the latency is not affected by b 2 . However, the required external memory bandwidth is dependent on both k and b 2 , as shown in Figure 6.16(c). A larger k results in a higher required DRAM memory bandwidth. On the other hand, the larger b 2 reduces the requirement on the DRAM memory bandwidth. In the design for LU decomposition, there are k+1 PEs. PE 0 contains a floating- point divider and a control unit that generates the control signals for the other PEs. The area of PE 0 is much larger than that of the other PEs, and its clock speed is bounded by that of the divider. PE 0 occupies 4112 slices, while each of the other PEs takes up 2136 slices. In our analysis, we set the problem size as 128. We do not consider block LU decomposition here because its performance is determined by that of matrix multiplication. For the target device, k+1≤ 6. The area of the design increases linearly with k, as shown in Figure 6.17(a). On the other hand, the clock speed of the design decreases with the number of PEs, due to the increased routing complexity. However, the speed 114 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0.3 0.6 0.9 1.2 1.5 1.8 x 10 4 Area (slices) Area 1 2 4 6 0 1.2 2.4 3.6 4.8 6 Latency (ms) k+1 Latency (a) Area and latency vs. k 1 2 3 4 5 6 0 1 2 3 k+1 Bandwidth (Gb/s) (b) Memory bandwidth vs. k Figure 6.17: Performance analysis for LU decomposition (n = 128) 115 of the entire design is bounded by that of the divider when k+1≤ 6. Therefore, the latency decreases proportionally with k as shown in Figure 6.17(a). Actually, the total degradation in clock speed is less than 25% when k +1 increases from 1 to 18. The required memory bandwidth increases with k, as shown in Figure 6.17(b). 6.4.2 Performance Comparison We next compare the sustained performance of our designs for matrix multiplication and matrix factorization with designs based on general-purpose processors. For this purpose, we use the Xilinx Virtex-II Pro XC2VP100 as the target device, which con- tains 44,096 slices and about 1 MB of on-chip memory. For Design 3 for matrix multiplication, one XC2VP100 can hold at most 20 PEs and runs at 110 MHz. Thus, our design for matrix multiplication achieves a sustained performance of about 4 GFLOPS. Such performance compares favorably with general- purpose processors. For example, a Xeon processor-based platform at 3.2 GHz with a 1 MB L3 cache achieves only 5.5 GFLOPS performing 64-bit matrix multiplication, while a 3-GHz Pentium 4 processor with a 512 KB L2 cache reaches 5.0 GFLOPS [40]. These numbers are collected when the processors execute the Intel Math Kernel Library, which has been optimized to exploit the best performance of the Intel proces- sors [40]. For LU decomposition, 18 PEs are configured on the XC2VP100, and the design can be run at 110 MHz. The size of the matrices ranges from 100× 100 to 1000× 1000. Whenn> 128, we use the block LU decomposition algorithm, where k 1 =6 and k 2 =12. We compare against a 2.2 GHz AMD Opteron processor with an L1 cache of 64 KB and an L2 cache of 1 MB. On the processor, two versions of the LU decomposition routine, dgetrf, are executed. One is from netlib LAPACK, and 116 the other is from ACML [2]. ACML is a set of numerical routines tuned specifically for AMD64 platform processors (including Opteron and Athlon 64). The MFLOPS performance of both versions and our work is shown in Figure 6.18. We see that our design achieves more than 2X speedup over netlib dgetrf. Our design also achieves better performance than ACML dgetrf for certain problem sizes. 500 1500 2500 3500 100 300 500 800 1000 Problem Size, n MFLOPS Design 2 acml/dgetf netlib/dgetf Figure 6.18: Comparison of sustained performance of Design 2 and processor-based designs for LU decomposition Note that our designs are device independent. In particular, we did not utilize manual placement or any other device-dependent optimizations in our experiments. Otherwise, the sustained floating-point performance of our designs could be further increased. 6.4.3 Performance Projection In our designs, the implementations of the floating-point cores have no effect on the architecture or the control logic. Therefore, when improved floating-point cores are available, they can be plugged into the design easily. In our designs for matrix mul- tiplication and matrix factorization, most of the area is occupied by the floating-point cores, and the clock speed is largely bounded by that of those cores. For example, 117 50 60 70 80 90 100 100 110 120 130 140 150 6 8 10 12 14 16 18 20 % of Area % of Clock Speed GFLOPS (a) Projected performance for matrix multiplication 50 60 70 80 90 100 100 110 120 130 140 150 3 6 9 12 % of Area % of Clock Speed GFLOPS (b) Projected performance for LU decomposition Figure 6.19: Performance projection 118 in Design 3 for matrix multiplication, the control unit occupies less than 30% of the area of the PE. Additionally, the area of the control logic for the entire architecture is less than 5% of the total area. In our experiments, when the area of the adder and the multiplier is reduced, the control overheads in the PEs remain constant, and the area of the design decreases. On the other hand, as the achievable clock speed of the MAC increases, the clock speed of the PE increases almost linearly. Thus, we know that when the performance of the cores is improved, the perfor- mance of the designs improve accordingly. Based on this observation, we project the performance of our designs as the area of the floating-point cores reduces to 50% and the clock speed increases 50%. Note that 25% of the performance is deducted to ac- count for the clock speed degradation caused by routing. As shown in Figure 6.19, our designs can achieve up to 19.5 GFLOPS and 11.7 GFLOPS for matrix multiplica- tion and LU decomposition on one XC2VP100, respectively. Note that even with the smallest and fastest floating-point cores, the memory bandwidth requirements of our designs are met by the available bandwidth of XD1. 6.4.4 Implementations on XD1 We implemented our design for matrix multiplication on one chassis of XD1. In our implementation, p, the number of FPGAs, varied from 1 to 6. Since 2b 2 2 cannot be larger than the total size of SRAM (96 MB), we set b 2 = 2048. At most 8 PEs can be configured on the FPGA, that is, k = b 1 =8. Every b 2 1 b 2 kp cycles, FPGA h exchanges at most three b 1 × b 1 blocks with DRAM if h =0, or with FPGA h−1 ifh> 0.For k = b 1 =8 and p =1, bw d = 438.1 MB/s. This bandwidth is much smaller than the available DRAM bandwidth B d in XD1. As for the SRAM memory, one word is read from and written into the storage unit for the intermediate results of C during every 119 clock cycle. When multiple FPGAs are employed, one b 1 × b 1 block is read from and written into the storage unit for the final results of C every b 2 b 2 kp clock cycles. Thus, bw s is at most 2.1 GB/s + 73 MB/s≈ 2.1 GB/s. This bandwidth is much smaller than the available SRAM bandwidth in XD1. Figure 6.20 shows the performance of our design when p increases from 1 to 6. We see that our design is scalable over multiple FPGAs, and achieves about 12 GFLOPS with 6 FPGAs. 0 3 6 9 12 15 01 23 45 6 No. of FPGAs GFLOPS Figure 6.20: Performance of matrix multiplication on multiple FPGAs 120 Chapter 7 Hybrid Designs on Reconfigurable Computing Systems In the previous chapters, we have discussed optimized designs for linear algebra that employ the FPGAs in reconfigurable computing systems only. In this chapter, a design methodology is presented for proposing hybrid designs that utilize both the proces- sors and the FPGAs in these systems for computations. This methodology performs hardware/software co-design, maintains load balancing in the system, and addresses coordination between the processors and the FPGAs. We study several example applications: matrix multiplication, block LU decompo- sition for matrix factorization and the block Floyd-Warshall algorithm [20] for the all- pairs shortest-paths problem. For each application, we first identify its various tasks. Based on the inherent attributes of the tasks as well as the system parameters, workload partitioning between the hardware and the software is performed. The partitioning is further optimized by overlapping the computations with the data transfer and network communications. Within each node, the computations on the FPGAs and processors are coordinated so that data hazards and memory access conflicts are avoided. The load among multiple nodes is balanced so that none becomes a performance bottleneck. The tasks in matrix multiplication are block multiplications, which are allocated to the nodes and partitioned between the processors and the FPGAs. For block LU 121 decomposition, only one type of tasks is suitable for partitioning and is executed on both the processors and the FPGAs. The other types of tasks run on the processor of one node only. None of the tasks of block Floyd-Warshall is partitioned between the hardware and the software. Instead, the tasks are assigned entirely either to the processors or to the FPGAs. Using 6 nodes in one chassis of XD1, our hybrid design achieves 22 GFLOPS for matrix multiplication. For block LU decomposition and block Floyd-Warshall, our designs achieve 20 GFLOPS and 6.6 GFLOPS, respectively. Moreover, our designs achieve up to 90% of the sum of the computing power of the nodes and more than 85% of the performance predicted based on the design model. 7.1 Related Work In [60], the authors partition the tasks in a molecular dynamics simulation system be- tween the processor and the FPGA. The most intensive task is executed on the FPGA and the rest are executed by the processor. They modeled the performance of sev- eral alternatives for the tasks mapped to the FPGA and implemented one of these alternatives on one SRC-6 MAPstation. A design for Conjugate Gradient (CG) that uses both the FPGAs and the processor in a SRC-6 MAPstation is proposed in [50]. In that work, a high-level language (HLL)-to-hardware description language (HDL) compiler is used in conjunction with custom-built, VHDL-based, floating-point com- ponents. The most computationally intensive task of the CG solver is implemented on the FPGAs while the other tasks run on the processor. In contrast to these results, we propose a generic design methodology that can be applied to a class of applications. In addition, our designs utilize both the processors and the FPGAs for computations 122 by partitioning the computationally intensive tasks. Moreover, our work uses multiple nodes in reconfigurable computing systems. Various programming models have been proposed for hardware/software co-design on systems with FPGA accelerators. The work in [8] starts from high-level represen- tation, and provides methods to explore hardware/software trade-offs based on the profiling of the source code. C code is automatically derived, and critical kernels are tagged for FPGA implementation. In [4], a unifying programming model for specify- ing threads within an application is presented. Threads are specified from a single mul- tithreaded application program and compiled to run on the processor or synthesized to run on the FPGA. These efforts aim to facilitate the programming and compilation on a single node in reconfigurable computing systems. On the other hand, our work focuses on the full utilization of multiple nodes in these systems. Our work can be integrated into these programming models to provide high-performance hardware/software co- designs. 7.2 Design Methodology We propose a design methodology for hybrid designs on reconfigurable computing systems, assuming each node consists of one processor and one FPGA. We target com- putationally intensive applications suitable for hardware acceleration, such as matrix computations. Our methodology has the following steps: 1. Identify various tasks inside an application. The computational complexity of the tasks and the dependencies among them are then analyzed. 2. Characterize the hardware resources and the computing capacity of a given sys- tem using various parameters. 123 3. Perform hardware/software partitioning based on the system parameters and the attributes (computational complexity, data dependencies, etc) of the tasks. 4. Improve the partitioning by overlapping the computations with data transfer and network communications (if multiple nodes are employed). In the rest of this section, we describe our design methodology in details. As the task identification is dependent on the application, it is discussed in Section 7.3 for each example application. Note that our methodology is unsuitable for perform- ing hardware/software co-design for control-intensive applications or applications that contain few computationally intensive tasks. In addition, our methodology is based on approximation. 7.2.1 Workload Partition A simple way to partition a task within a node is to execute the computationally inten- sive part on the FPGA and use the processor for the control intensive part. However, in this case, the computing power of the processor is mostly wasted. Therefore, we par- tition the workload of an application between the hardware and software so that both the processor and the FPGA are fully utilized. Suppose X p floating-point operations are assigned to the processor and X f operations are assigned to the FPGA. Thus, the computation time (time for execution of the task) of the processor T p = Xp Cp , and the computation time of the FPGA T f = X f C f . We choose X p and X f so that T p ≈ T f . However, the above partition is not accurate because it does not consider the data transfer time inside a node. As the input data are stored in the DRAM memory, they must be transferred to the FPGA. The data are streamed into the FPGA so that the computations on the FPGA can be overlapped with the data transfer. However, the computations on the processor cannot begin until the transfer is completed. Thus, data 124 transfer time has to be included in the workload partitioning. Suppose there are D d bytes of data to be transferred to the FPGA. We have T p + D d B d = T f (7.1) Note that if multiple threads run on a processor, the transfer time for source data can also be overlapped with the computations. However, in this case, additional over- heads such as the cost of context switch and thread synchronization arise. As we are concerned with providing a simple yet effective model, we assume the processors are single-threaded. At the end of the task, the result data generated by the FPGA must be transferred back to the DRAM. As such transfers is initiated by the FPGA, it can be overlapped with the computations on the processor. Due to spatial parallelism, the data transfer can also be overlapped with the computations on the FPGA. Therefore, as an approx- imation, we do not consider this data transfer time. Some tasks contain a large number of data dependencies. If the tasks are parti- tioned, the coordination and communication between the processor and the FPGA can easily become the performance bottleneck. Therefore, such tasks are assigned entirely to either the processor or the FPGA. In this case, T p ≈ T f is achieved by tuning the numbers of tasks assigned to the processor and the FPGA. 7.2.2 Multiple Nodes An application can also employ multiple nodes in reconfigurable computing systems. Since the nodes communicate through the processors, the computations on the proces- sors cannot overlap with the network communications. On the other hand, the com- putations on the FPGAs are not affected. Thus, the partition must be adjusted so that 125 the communication costs are included. Suppose D n bytes of data must be transferred between two nodes. Equation 7.1 should be modified to: T p + D d B d + D n B n = T f (7.2) When the tasks are assigned to multiple nodes, load balancing among the nodes is important. If a node is overloaded, it will become the performance bottleneck in the system. Thus, in our model, we adjust the number of tasks assigned to each node so that the execution time of each node is approximately equal. 7.2.3 Hardware/Software Coordination Besides workload partitioning, the coordination between the processor and the FPGA is also important. First, the processor must notify the FPGA-based design to start and must be notified when the computations on the FPGA are completed. The status registers on the FPGA are used for this purpose. The latency for the processor to check the registers is negligible compared with the computation time of a task. Nonetheless, the frequency of coordination is given for each example application. The second issue is the coordination of memory accesses. As both the processor and the FPGA have access to the DRAM memory, memory accesses, especially mem- ory writes, must be coordinated to avoid conflicts. Therefore, we must make sure that the processor and the FPGA write to separate memory locations. If they must write to the same address, a status register is used as the lock variable of the memory location. Another coordination issue is data dependency between the processor and the FPGA. For example, if the FPGA reads from the DRAM memory before the computations on the processor are complete, read-after-write hazards may occur. Thus, we do not allow the FPGA to read the DRAM memory before getting permission from the processor. 126 Similarly, the processor must get permission from the FPGA before reading the SRAM memory. 7.2.4 Performance Prediction Our design methodology can also be used for performance prediction. After the val- ues of the system parameters are determined, the workload for a given application is partitioned. Then we can calculate the total execution time for the application on both the processor (T tp ) and the FPGA (T tf ) based on the data dependencies among the tasks. For simplicity, we assume all the data transfer and network communications are overlapped with the computations on the FPGA. Thus, the predicted total latency of the design is max{T tp ,T tf }. In Section 7.4, we compare the predicted performance with the experimental results. 7.3 Hybrid Designs 7.3.1 Matrix Multiplication In the hybrid design for matrix multiplication, we employ the architecture and algo- rithm described in Section 6.2.3. For C = A×B, the columns of matrix A are grouped into column stripes, each of which consists of n k submatrices of size k× k. k is the number of PEs on one FPGA. Similarly, the rows of B are partitioned into multiple row stripes. Initially, matrices A and B are stored in the DRAM memory of N 0 . Dur- ing the computation, one column stripe of matrix A and one row stripe of matrix B are read and transferred to the other nodes at the same time. 127 7.3.1.1 Workload Partition With p nodes, the tasks in matrix multiplication are (n× n)× (n× n p ) block multipli- cations. Each node performs one task, which is partitioned between the processor and the FPGA. One simple partitioning assigns n f rows of A to FPGA and n p rows to the processor, where n f + n p = n and n f np = C f Cp . This partitioning can be further improved by considering the data transfer time and network communication time. It takes T comm = 2nkbw Bn to transfer one column stripe of A and one row stripe of B from N 0 to N i (1≤ i≤ p−1). Within N i , the processor then moves n f k elements of matrix A and nk p elements of matrix B from its DRAM memory to the FPGA. Data transfer time, T mem , thus equals (n f k+nk/p)bw B d . After transferring the data, the processor of N i performs a (n p × k)× (k× n p ) matrix multiplication. Thus, the computation time of the processor is T p = 2npnk pCp . The FPGA of N i multiplies a n f ×k and a k× n p matrix, that is, it performs n f k × n kp (k×k) submatrix multiplication. Therefore, the computation time of the FPGA is T f = 2n f nk pC f . Except for the first stripes of A and B, T comm and T mem can be overlapped with T f . Thus, the values of n p and n f should be determined so that T f = T comm + T mem + T p (7.3) Since each FPGA calculates n f × n p elements of C, the size of SRAM memory required by the design is n f n p words. When the matrix size is large, block matrix multiplication is performed. Matrices A and B are partitioned into blocks of size b×b. In this case, b f rows of A are assigned to the FPGA, while b p rows are assigned to the processor. We have b f b p b w = M s , where M s is the size of the SRAM memory of the FPGA. 128 7.3.1.2 Hardware/Software Coordination In the proposed partitioning, the processor generates n p rows of C and the FPGA generates n f rows of C. As they write to separate memory locations, there will not be memory access conflict. For the multiplication of one column stripe of A and one row stripe of B, the processor must signal the FPGA to start computations, as well as be notified when the FPGA is done. Thus, the frequency of coordination is 2 T f = pC f n f nk times per second. For block matrix multiplication, the frequency of coordination is pC f b f bk times per second. 7.3.2 Block LU Decomposition We next consider block LU decomposition of large matrices. We follow the block algorithm given in [16], which is described as below. At the beginning of the algorithm, there are four matrices: A 0 00 , A 0 01 , A 0 10 , and A 0 11 . A 0 00 is a b×b matrix, A 0 01 is a b×(n−b) matrix, A 0 10 is an (n− b)× b matrix, and A 0 00 is an (n− b)× (n− b) matrix. The goal of the algorithm is to decompose A into two matrices, L and U, such that ⎛ ⎜ ⎝ A 0 00 A 0 01 A 0 10 A 0 11 ⎞ ⎟ ⎠ = ⎛ ⎜ ⎝ L 1 00 0 L 1 10 L 1 11 ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ U 1 00 U 1 01 0 U 1 11 ⎞ ⎟ ⎠ (7.4) The steps of the algorithm are as follows: 1. Perform a sequence of Gaussian eliminations on the n×b matrix formed by A 0 00 and A 0 10 in order to calculate the entries of L 1 00 , L 1 10 , and U 1 00 ; 2. Calculate U 1 01 as the product of (L 1 00 ) −1 and A 0 01 ; 3. Evaluate A 1 11 ← A 0 11 − L 1 10 U 1 01 ; 129 4. Apply steps 1 to 3 recursively to matrix A 1 11 . During the tth (0 ≤ t ≤ n b − 1) iteration, the initial matrices are A t 00 , A t 01 , A t 10 , A t 11 ; the resulting matrices L t 00 , U t 00 , L t 10 , U t 01 and A t+1 11 are obtained. An iteration denotes an execution of steps 1to3. 7.3.2.1 Task Identification Five tasks have been identified in block LU decomposition: opLU, opL, opU, opMM and opMS.Inthe tth iteration, opLU is performed in step 1 to obtain L t 00 and U t 00 . ( n b − t) opL operations are performed in step 1 to obtain L t 10 using A t 10 and (U t 00 ) −1 . In step 2, ( n b − t) opU operations generate U t 01 using A t 01 and (L t 00 ) −1 . In step 3, block matrix multiplications (opMM operations) and matrix subtractions (opMS operations) are performed ( n b − t) 2 times. opL and opU operations need the outputs of opLU; opMM operations need the outputs of opL and opU operations; opMS operations need the outputs of opMM operations. Among the five operations, opMS is the least computationally intensive (Θ(n 2 )) and does not need hardware acceleration. The other operations are all computationally intensive, with a complexity of Θ(n 3 ). However, it is impractical to partition opLU, opL and opU between the processor and the FPGA because they contain a lot of data dependencies. Therefore, in our design, only opMM is performed on both the proces- sor and the FPGA. All the other operations are performed on the processor. 7.3.2.2 Algorithm and Architecture We now present our hybrid design for block LU decomposition. Matrix A is parti- tioned into b× b blocks, which are denoted as A uv (0 ≤ u, v ≤ n b − 1). Initially, N i (0≤ i≤ p− 1) stores A iv and A ui (i≤ u, v≤ n b − 1), A (i+p),v and A u,(i+p) ,(i + p≤ u, v≤ n b − 1), ..., A (i+p( n bp −1)),v and A u,(i+p( n bp −1)) (i + p( n bp − 1)≤ u, v≤ n b − 1). 130 … N 0 N i N p-1 … Interconnect Network Processor FPGA SRAM DRAM Matrix Multiplier Matrix Multiplier Processor DRAM opLU, opL, opU Figure 7.1: Architecture for block LU decomposition at iteration 0 In the tth iteration, N t performs opLU, opL, and opU operations on its processor, where t = t mod p. After the first opL and opU are complete, their outputs are transferred to the other nodes. The outputs of the opMM operations, A uv (t+1 ≤ u, v≤ n b −1), are sent to N t , where t =max{u, v}. N t performs A uv = A uv −A uv (t+1≤ u, v≤ n b − 1). The architecture of our design in the 0th iteration is shown in Figure 7.1. In our design, b× b block matrix multiplications are performed by the processors and the FPGAs in p− 1 nodes together. The partitioning proposed in Section 7.3.1 is employed. In particular, for block multiplication E = C×D, b f and b p rows of matrix C are assigned to the FPGA and the processor, respectively. 7.3.2.3 Hardware/Software Coordination In our design, the processor generates b p rows of E and the FPGA generates b f rows of E. As they write to separate memory locations, there will not be memory conflict. For the multiplication of one column stripe of C and one row stripe of D, the processor must signal the FPGA to start computations as well as be notified when the FPGA is done. Thus, the frequency of coordination is (p−1)C f b f bk times per second. 131 7.3.2.4 Load Balancing While the other p−1 nodes are performing opMM operations, N t performs operations opLU, opL and opU. It is highly possible that N t gets overloaded and becomes the performance bottleneck of the system. To avoid this, l opMM operations are performed during one opLU/opL/opU. According to Section 6.2, the total latency of one opMM equals b k × 2b f b (p−1)C f = 2b f b 2 (p−1)C f . Suppose the latency of performing one opLU, opL and opU operation is T lu , T opl and T opu , respectively. To achieve load balancing, l is determined so that max{T lu ,T opl ,T opu } + lb k × T comm = 2b f b 2 (p− 1)C f (7.5) We include the communication costs in the equation because N t also needs to send the block matrices to the other nodes. 7.3.3 Floyd-Warshall Algorithm Consider a weighted and directed graph G with n vertices. The all-pairs shortest-paths problem is the problem to find a shortest (least-weight) path between every pair of vertices in the graph. The weight of a path is the sum of the weights of its constituent edges. The Floyd-Warshall algorithm [20] is an efficient technique to solve this prob- lem. In this algorithm, d t ij is the weight of a shortest path from vertex i to vertex j with all intermediate vertices in the set{0, 1, ..., t} (0≤ t≤ n− 1). D t is the matrix formed by d t ij (0 ≤ i, j ≤ n− 1). If there is an edge between i and j, d 0 ij equals the weight of that edge. Otherwise, d 0 ij =0 if i = j; d 0 ij =∞ if i= j. This algorithm is referred to as the “regular” Floyd-Warshall algorithm in this thesis. 132 We follow a blocked version of the Floyd-Warshall algorithm [69]. Matrix D 0 is partitioned into blocks of size b× b, which are denoted as D 0 uv (0 ≤ u, v ≤ n b − 1). There are a total of n b iterations in the algorithm. D t+1 is used to denote the matrix generated after iteration t (0≤ t≤ n b − 1). In iteration t, there are three steps: 1. Perform the regular Floyd-Warshall algorithm on block D t tt . This operation is denoted as op1; 2. Perform the regular Floyd-Warshall algorithm on block D t th (0 ≤ h ≤ n b − 1, h = t), using the columns of D t th and the rows of D t tt . Similarly, perform the regular Floyd-Warshall algorithm on block D t ht using the rows of D t ht and the columns of D t tt . We denote these operations as op21 and op22, respectively; 3. Perform the regular Floyd-Warshall algorithm on the remaining blocks. For block D t uv (0≤ u, v≤ n b − 1, u= t, v= t), the rows of D t tv and the columns of D t ut are needed. This operation is called op3. 7.3.3.1 Task Identification The operations in the steps are identified as the tasks in the block Floyd-Warshall algorithm. In each iteration, one op1 operation, n b − 1 op21 operations, n b − 1 op22 operations, and ( n b −1) 2 op3 operations are performed. op21 and op22 operations need the outputs of op1; op3 operations need the outputs of op21 and op22. All the tasks in the block Floyd-Warshall algorithm are computationally intensive and have a complexity of Θ(n 3 ). Except the inputs, these tasks are the same. There- fore, we use the same processor-based algorithm and FPGA-based design for these operations. As the tasks contain a large number of data dependencies, they are not suitable for partitioning between the processor and the FPGA. 133 7.3.3.2 Algorithm and Architecture Initially, each node stores n bp columns of the b×b blocks in D 0 . In particular, N i stores columns in bp , in bp +1,..., (i+1)n bp − 1 (0≤ i≤ p− 1). In iteration t (0 ≤ t ≤ n b − 1), there are n b phases. In phase 0, op1 operation is performed on block D t tt by N t , where t = t n/bp . The resulting D t tt is then transferred to all the other nodes. When N i (0≤ i≤ p− 1, i= t ) performs n bp op21 operations, N t performs n bp − 1 op21 operations and one op22 operation. In the following phase, the results of the op22 operation are transferred to the other nodes for op3 operations. When N i (0 ≤ i ≤ p− 1, i = t ) performs n bp op3 operations, N t performs n bp − 1 op3 operations and one op22. Again, the results of the op22 operation are transferred to the other nodes for op3 operations. This continues until all the op22 operations and op3 operations are complete. Note that except phase 0, each node performs the same number of operations in each phase. Thus, load balancing is easily maintained in the system. Figure 7.2 illus- trates the operations of the nodes when n b =8, p=4 and t=2. Different operations of the application are shown in different patterns. n/bp n/b op1 N 0 Interconnect Network op21 op22 op3 N 1 N 2 N 3 Figure 7.2: Illustration of operations of the nodes for block Floyd-Warshall ( n b =8, p=4, t=2). 134 On the FPGA, we employ the design proposed in [12]. In that design, with k floating-point adders and k floating-point comparators, the latency of performing the regular Floyd-Warshall algorithm on a b× b matrix takes 2b 3 k clock cycles. 7.3.3.3 Workload Partition Within each node, an operation is assigned entirely either to the processor or to the FPGA. Each of the operations contains b 3 floating-point additions and b 3 floating-point comparisons. Thus, the computation time of the processor T p = 2b 3 Cp . According to [12], we have T f = 2b 3 C f . The on-chip memory required by the design is of size 2k 2 words, and the size of required BRAM memory is 2b 2 words. We aim to balance the workload between the processor and the FPGA. In each phase discussed above, each node performs n bp operations and sends/receives one block. When an operation is implemented on the FPGA, two blocks need to be accessed from the DRAM memory. Thus, we have T comm = b 2 Bn and T mem = 2b 2 B d . Suppose the processor and the FPGA perform l 1 and l 2 operations, respectively. Besides l 1 + l 2 = n bp ,we have l 1 × T p + T comm + l 2 × T mem = l 2 × T f (7.6) 7.3.3.4 Hardware/Software Coordination As the FPGAs and the processors execute separate tasks in our design, there is no memory access conflict. For l 2 operations performed by the FPGA, the processor must give the start signal and be notified when the FPGA is done. Thus, the frequency of coordination is 2 l 2 ×T f = C f l 2 b 3 times per second. 135 7.4 Experimental Results 7.4.1 Implementation Details We implemented our hybrid designs on one chassis of XD1. A chassis contains 6 nodes, hence p ≤ 6. The network bandwidth between two nodes within the same chassis B n =2 GB/s. On each node, 8 MB of SRAM memory is allocated to store the intermediate results. Thus, M s =2 20 . When our FPGA-based matrix multiplier in Section 6.2.2.3 is implemented on one FPGA in XD1, at most 8 PEs can be configured. Hence k =8. As each PE performs two floating-point operations in each clock cycle, and the clock speed of the design is 130 MHz, C f =8× 2× 130× 10 6 =2.08 GFLOPS. The FPGA-based design gets one word from the DRAM memory in each clock cycle, hence B d =1.04 GB/s. On the processor, the dgemm subroutine in ACML is executed. When the matrix size is larger than 1000, one AMD processor achieves C p = 3.9 GFLOPS. In our experiments, we considered large matrices and used block matrix multipli- cation. We have b f b p ≤ 2 20 bw , and b must be a multiple of both p and k. Therefore, in our experiments, when p=1, 2, 3, b = 1536; when p=4, 6, b = 3072; when p=5, b = 3000. The workload partition was determined according to Equation 7.3, and is shown in Table 7.1. Table 7.1: Workload partition for matrix multiplication on XD1 p 1 2 3 4 5 6 b 1536 1536 1536 3072 3000 3072 b f 512 512 640 1280 1280 1280 b p 1024 1024 896 1792 1720 1792 For block LU decomposition, we used p =6 nodes. In this case, block matrix multiplication was performed on p−1=5 nodes. Therefore, b = 3000, b p = 1720 and 136 b f = 1280. To get the value of l, we first obtained the latencies of opLU, opL and opU operations on the processor. When b = 3000, the routine used and the latency of each operation are shown in Table 7.2. According to Equation 7.5, we set l=3. Table 7.2: Routines and latencies for operations in block LU decomposition Operation opLU opL opU Routine dgetrf dtrsm dtrsm Latency (s) 4.9 7.1 7.1 In the design for the Floyd-Warshall algorithm, we implemented the design pro- posed in [12] on the FPGA. At most k =8 PEs can be configured and each PE per- forms one floating-point operation in each clock cycle. Our implementation achieved 120 MHz, hence C f =8× 120× 10 6 =0.96 GFLOPS. As the FPGA-based design gets one word from the DRAM memory in each clock cycle, B d = 960 MB/s. Again, we have p =6 and M =2 20 .As b needs to be a multiple of k, we set b = 256.On the processor, for each b× b block, we implemented the regular Floyd-Warshall algo- rithm as described in [20]. When b = 256, the sustained performance of the algorithm C p = 190 MFLOPS. According to Equation 7.6, we have l 1 l 2 = 1 5 and l 1 + l 2 = n bp . Therefore, we set n as 18432. In this case, l 1 = 2 and l 2 = 10. 7.4.2 Performance Analysis The workload partition given by the design methodology is justified by experiments. In Figure 7.3, b f for block matrix multiplication varies from 0 to 3000, while b = 3000 and p =5. When b f first becomes larger than 0, the processors in the system are utilized and the performance increases. The maximum performance of the design is 137 achieved when b f = 1280. After that, the FPGAs become overloaded and the proces- sors are underloaded. Hence the performance of the design begins to decrease. Note that the size of the required SRAM memory increases with b f . 0 1 2 3 4 5 6 0 640 1280 1920 2560 3000 b f Latency (s) Figure 7.3: Latency of one b× b block matrix multiplication using hybrid design vs. b f (b = 3000, p=5) 200 250 300 350 012 345 l Latency (s) Figure 7.4: Latency of the 0th iteration in block LU decomposition using hybrid design vs. l (n = 30000, p=6) Figure 7.4 shows the latency of our design for performing the 0th iteration of block LU decomposition when l increases from 0 to 5. In the experiment, b f = 1280 and b p = 2720. Operations opLU, opL and opU are performed on the processor of N 0 , while the block matrix multiplications are performed on the other nodes. We see that the latency decreases when l increases from 0 to 3, because the computing power of N 1 , ..., N p−1 138 is getting utilized better. The latency begins to increase when l is further increased because N 0 is becoming under-utilized. However, the increase is not noticeable until l =5. 0 40 80 120 160 12 10 8 6 4 2 0 l 1 Latency (s) Figure 7.5: Latency of one iteration in block Floyd-Warshall using our design vs. l 1 (b = 256, n = 18432, p=6) Figure 7.5 shows the latency of one iteration in our design for the block Floyd- Warshall algorithm when the workload partition varies. We see that when l 1 decreases from 12 to 2, the latency keeps on decreasing because the FPGA keeps on taking more workload from the processor. However, when l 1 is 1, the FPGA becomes overloaded and the latency begins to increase. Unlike the other two applications, there is a large difference between the computing power of the FPGA and the processor for this appli- cation. Therefore, when the FPGA is employed only, the latency is even smaller than some cases where the processor and the FPGA work together. 7.4.3 Performance Comparison To evaluate our hybrid designs, we compare their performance against two baseline designs. The “FPGA-only design” employs the FPGAs in the nodes only, while the “Processor-only design” employs only the processors. For fair comparison, the base- line designs follow the same steps as the hybrid design. We use GFLOPS to measure 139 the sustained floating-point performance of the designs. Note that the FPGA-only design still needs the processors in the nodes for data transfer and network communi- cations. Figure 7.6 shows the performance of our design and the baseline designs for ma- trix multiplication as p varies. We see that all three designs are scalable, while our design achieves better performance than the other two. In particular, when p =6, the Processor-only design achieves 2.8 GFLOPS/FPGA; the FPGA-only design achieves 2.0 GFLOPS/processor and the hybrid design achieves 4.1 GFLOPS with one proces- sor and one FPGA. That is, our design achieves 1.5X speedup and 2X speedup over the baseline designs, respectively. 0 10 20 30 123 456 No. of Nodes, p GFLOPS Hybrid Design Processor-only Design FPGA-only Design Figure 7.6: GFLOPS performance of matrix multiplication vs. p For block LU decomposition, the performance of the hybrid design increases with the number of blocks, n b , as shown in Figure 7.7. This is because block matrix mul- tiplication opMM is the the only task which exploits the computing power of both the FPGAs and the processors. When n = 30000 and b = 3000, the hybrid design achieves 20 GFLOPS, 2X speedup over the FPGA-only design and 1.3X speedup over the processor-only design. As opMM is the most executed task and is scalable across the nodes, our design for block LU is also scalable. 140 9 12 15 18 21 10 12 14 16 18 20 n/b GFLOPS Hybrid Design Processor-only Design FPGA-only Design Figure 7.7: GFLOPS of block LU decomposition vs. n b (b = 3000) 0 2 4 6 8 123 456 No. of Nodes, p GFLOPS Hybrid Design Processor-only Design FPGA-only Design Figure 7.8: GFLOPS performance of block Floyd-Warshall vs. p The performance of the design for the block Floyd-Warshall algorithm remains almost the same when n increases. The reason is that the proportion between the computational loads of the FPGA and the processor is independent of the problem size. Figure 7.8 shows the scalability of our design for the block Floyd-Warshall algo- rithm. When p=6, the hybrid design achieves 6.6 GFLOPS, 5.8X speedup and 1.15X speedup over the Processor-only design and the FPGA-only design, respectively. To further evaluate the efficiency of our designs, we compare their performance against the total performance of the system, C t . C t is the sum of the computing power of the nodes. For the hybrid designs, C t = p(C f + C p ). 141 Figure 7.9 shows the percentage of C t our designs have achieved. For block LU decomposition, our design achieves 56% of C t . This is because in our implementation, we used the atomic ACML routines for opLU, opL and opU. In this case, a large part of the network communications has to be performed before the routines start and cannot be overlapped with the computations. On the other hand, in our designs for matrix multiplication and Floyd-Warshall, more than 90% of T comm and T mem is overlapped with T f . Therefore, our designs achieve 70% and 96% of C t , respectively. From these results, we see that although our design model is simple, it generates efficient hybrid designs that fully utilize both the processors and the FPGAs. 0 20 40 60 80 100 Matrix Multiplication Block LU Floyd-Warshall Percentage of Total Performance Figure 7.9: Percentage of total performance of the system achieved by our designs We also compare the performance of our hybrid designs with the performance predicted by the design methodology. In the prediction, we use the same system pa- rameters and the same workload partition as in the experiments. However, we assume all the communication costs and memory transfer time are overlapped with the com- putations on the FPGA. Our design for block LU decomposition achieves about 86% of the predicted performance. For matrix multiplication and block Floyd-Warshall, 142 the hybrid designs achieve about 90% and 96% of the predicted performance, respec- tively. For all the applications, the design methodology has provided a fairly accurate prediction. 7.4.4 Performance Projection The performance of our designs can be further improved. For example, when im- proved floating-point cores are available, C f will increase. Hence the performance of the hybrid design will also increase accordingly. We can also use a more powerful device to achieve a higher C f . XC2VP50 is relatively small in the Xilinx Virtex-II Pro family. With a larger device, more PEs can be configured. For example, the Xilinx Virtex-II Pro XC2VP100 contains 44096 slices, and can contain twice as many PEs as XC2VP50. 20 25 30 35 40 150 200 250 30 35 40 45 O f F f GFLOPS Figure 7.10: Projected performance of hybrid design for matrix multiplication (b f = 2048, p=6) We thus project the performance of our design for matrix multiplication when the number of floating-point operations performed each clock cycle O f ranges from 20 to 40 and the clock speed F f increases from 150 MHz to 250 MHz. We fix b f to 2048, while b and b p vary. In Figure 7.10, when O f =40 and F f = 250 MHz, our design can 143 achieve 44.6 GFLOPS with 6 nodes. Note that the projected performance is achieved with the DRAM memory bandwidth and the interconnect bandwidth available in XD1. Also, even with the highest F f , the required SRAM memory bandwidth of the design is lower than the available bandwidth in XD1. Another way to improve the performance of our designs is to use better designs. For example, the performance of our design for block Floyd-Warshall is low because the design on the processors is not optimized. If existing optimizations are employed [55], both C p and the performance of the hybrid design will increase. 144 Chapter 8 Conclusion In this thesis, we have focused on using reconfigurable computing systems to acceler- ate scientific computations. We have demonstrated that although various design chal- lenges exist in utilizing such systems, these challenges can be overcome. To illustrate our ideas, we have proposed designs for linear algebra on reconfigurable computing systems. The designs are optimized under given hardware resources and achieve high performance. Our work allows many scientific applications to take advantage of these systems, and to achieve immense performance improvement. In addition, we have pro- vided a high-level model, a parameterized design approach, detailed design trade-off analysis, and an efficient design methodology. All these can be used not only by the end-user community to develop programming models, but also by system architects to further improve existing reconfigurable computing systems. In this chapter, we summarize the contributions of our work. We also discuss ideas for future research on reconfigurable computing systems. 145 8.1 Summary of Contributions Design Model for Reconfigurable Computing Systems We have built a high-level design model based on the architecture of several existing reconfigurable computing systems. This model abstracts the multiple levels of memory in these systems and the co-existence of processors and FPGAs. In the model, a system is characterized by various system parameters, including the number of nodes, the computing power of the processor and the FPGA in each node, the memory bandwidth, and the network bandwidth. Based on this model, we have proposed a parameterized and flexible design ap- proach. To describe an FPGA-based design, various design parameters are used, in- cluding the number of floating-point operations performed in each clock cycle, the size of internal storage and the required memory bandwidth. Our design approach separates the algorithm and architecture from the hardware device used. Therefore, we can gen- erate an optimized FPGA-based design which is independent of the device used. The design can then be adapted to a specific hardware device by tuning the parameters. Optimized Designs on FPGAs: We have proposed optimized FPGA-based imple- mentations for dot product, matrix-vector multiplication, matrix multiplication and matrix factorization. For each operation, we have analyzed its inherent attributes, such as the number of floating-point operations and I/O operations required, and identified various design parameters. By exploring the design space, we have analyzed the design trade-offs among area, latency and storage size for the operation. We have proposed tree-based architecture for the vector operations. The area of the architecture increases linearly with the number of floating-point cores used. Moreover, the latency decreases almost proportionally with the number of floating-point cores, if 146 adequate memory bandwidth is provided. On the other hand, the required storage size and the required memory bandwidth are determined by both the number of floating- point cores and the block size. When implemented on one FPGA in XD1, our design for MXV achieved more than 90% of the peak performance of the FPGA device under the given memory bandwidth. For SpMXV , our design conforms to an existing sparse matrix storage format and makes no assumptions about the sparsity of the input matrix. The design has achieved high speedup over an Itanium 2 processor for matrices with very irregular structures. For matrix multiplication, three designs are proposed. They all employ linear array architecture. Design 1 and Design 2 achieve the lower bound on the latency of n× n matrix multiplication algorithms, Θ(n 2 ). However, in these two designs, the number of PEs and the required storage size increase with the problem size n. In Design 3, the number of PEs is determined by the available resources, and the storage size is determined by the size of on-chip memory on the FPGA. With the given resources, Design 3 achieves optimal latency and minimizes the required memory bandwidth. On one Xilinx Virtex-II Pro XC2VP100, our designs achieve a sustained performance of 4 GFLOPS, which compares favorably with general-purpose processor-based de- signs. Design 3 has been further extended to utilize the SRAM memory and the mul- tiple FPGAs in reconfigurable computing systems. Using 6 FPGAs in XD1, Design 3 achieves 12 GFLOPS. For LU decomposition, our design employs a circular linear array of PEs. The number of PEs is independent of the problem size n, and is determined by the available resources on the FPGA. The design requires a storage of size Θ(n 2 ) words. Our design achieves the lower bound on the latency that can be reached by any LU decomposition design. Using one XC2VP100, our design achieves 2X speedup over netlib dgetrf on 147 an AMD 2.2 GHz Opteron processor. A hierarchical design is proposed to utilize the SRAM memory in reconfigurable computing systems. In our designs, the implementations of building blocks, such as floating-point cores, have no effect on the architecture or the algorithm. Therefore, when smaller or faster floating-point cores are available, they can be plugged into our designs eas- ily. The performance of our designs also increases accordingly. We have shown that with one XC2VP100, our designs can achieve up to 19.5 GFLOPS and 11.7 GFLOPS for matrix multiplication and LU decomposition with improved floating-point cores, respectively. High-Performance and Area-Efficient Reduction Circuits: We have investigated the design of FPGA-based reduction circuits from both the algorithmic and architec- tural level. Using accumulation as an example, we have analyzed the design trade-offs among the number of floating-point adders, the buffer size, and the latency of a design. We have also studied two basic methods for implementing reductions using deeply pipelined adders. Two designs have been proposed. Their area and clock speed are presented and their performance is analyzed. The FCBT design is based on the tree-traversal method and is relatively easy to implement. It uses two adders and requires a buffer of size Θ(lg(n)), where n is the number of inputs. The upper bound on the latency of this design is 3n+(α− 1)lg(n), where α is the number of pipeline stages in the adder. The SSA design uses the striding method. It employs only one adder, but requires two buffers of size Θ(α 2 ). To reduce n inputs, it uses less than n+2α 2 clock cyles. The proposed designs have been proven to reduce multiple input sets of arbitrary sizes without stalling the pipeline. They are also area-efficient in that the numbers of adders in the designs are fixed, and the buffer sizes are independent of the number of 148 input sets. In this work, the SSA design has been used in the designs for dot product and dense/sparse matrix-vector multiplication. Design Methodology for Hybrid Designs: We have proposed a design methodol- ogy for hybrid designs that utilize both the processors and the FPGAs in reconfig- urable computing systems. Based on the system parameters identified in our design model, this methodology follows several simple steps to partition the workload of an application between the processors and the FPGAs. In performing the partitioning, we have considered not only the computational time, but also the data transfer time and the network communication time. We have also addressed other challenges in hardware/software co-design, such as the synchronization between the FPGAs and the processors and the memory access coordination. We have proposed hybrid designs for matrix multiplication, block LU decompo- sition and the block Floyd-Warshall algorithm. The tasks in matrix multiplication are block multiplications which are partitioned between the processors and the FPGAs. For block LU decomposition, we partition one type of tasks between the processors and the FPGAs, and perform the other types of tasks on the processor of one node only. The tasks of block Floyd-Warshall contain a large number of data dependen- cies, and thus are not partitioned. Instead, the tasks are assigned entirely either to the processors or to the FPGAs. We have proven through experiments that the workload partition given by the methodology is optimal. In particular, most of the data transfer time and the network communication time are overlapped with the computations. All of our designs are scalable over multiple nodes. Using 6 nodes in one chassis of XD1, our hybrid design achieves 22 GFLOPS for matrix multiplication. For LU decomposition and Floyd- Warshall, our designs achieve 20 GFLOPS and 6.6 GFLOPS, respectively. Our design 149 for the block Floyd-Warshall algorithm achieves 5.8X speedup over the Processor- only design, and our design for the other two applications achieve 2X speedup over the FPGA-only design. In addition, our designs achieve up to 90% of the total com- puting power of the system, and more than 85% of the performance predicted using the design model. 8.2 Future Work 8.2.1 Portable and Scalable Libraries Our work has provided high-performance designs for linear algebra on reconfigurable computing systems. These designs are essential for performance improvement of a large number of scientific applications. However, to be widely accepted and used by the scientific computing community, these designs should be included into a library with a standard API (Application Programming Interface). This API should have the same or similar formats as existing libraries, for example, ScaLAPACK, to ease the transition. In addition, the API should hide the architectural details of the system from the end user. In particular, the usage of the FPGAs should be transparent unless the end user chooses to implement his/her own FPGA-based designs. Another important feature of this library is portability. To achieve this, the library should be parameterized, and the parameters be tuned to adapt to a specific system architecture. In our work, the values of the system parameters and the design pa- rameters are determined manually. This is acceptable because we are targeting one system. However, in the future, the values of the parameters should be determined automatically, because more and more reconfigurable computing systems will become 150 available. To do so, we must find the optimal point in a large design space in an effi- cient way. Moreover, the task identification and the workload partitioning of the hybrid designs should also be automatic. Current reconfigurable computing systems contain only a few nodes. However, to get into the mainstream of high-performance computing, they will probably expand into systems with hundreds, or even thousands of nodes. Therefore, future design models should include information on the network architecture of a system, so that the library can be scalable over a large number of nodes. 8.2.2 Heterogeneous Computing Platform In this thesis, we assume each node contains one processor and one FPGA. However, as FPGAs become more and more powerful, an FPGA device can be plugged into the system as a standalone node. For example, a DRC module that contains a Virtex-4 FPGA can be plugged into XT3/XT4 directly [25]. Other configurations of the nodes are also possible. For example, a node may contain various numbers of processors and FPGAs; the nodes can also contain different types of processors and FPGAs with different computing powers. Therefore, a generic model is needed to address the heterogeneity of the future re- configurable computing systems. This model should not only capture the architectural details of the systems, but also accommodate their heterogeneous features. For exam- ple, when a system contains a large number of nodes, it may use a hierarchy network architecture. In this case, the network bandwidth between two nodes in the same level may be different from that between two nodes in two different levels. The future model for heterogeneous reconfigurable computing systems should have the following functionality. First, it should perform automatic workload partitioning. 151 Such partitioning should consider not only the various computing powers, but also the overheads, such as data exchanges among the nodes. Secondly, the tasks are scheduled onto the nodes with the load balance maintained in the system. The third functionality is resource selection. In certain network architectures, it is possible that the overheads of using the FPGA accelerators offset the gains for some applications. In this case, the model must determine whether FPGAs should be used and which ones should be used. The model should also be able to provide a reasonably accurate performance prediction. Such a model can be further extended to more generic heterogeneous systems which contain both general-purpose processors and special-purpose hardware accel- erators. These systems can be as large as a cluster or as small as a chip [14]. Such systems have drawn enormous interests from both academia and industry. Propos- ing efficient designs on these systems, however, will require multidisciplinary efforts among computer scientists, hardware engineers and computer architects. 152 References [1] Altera Corporation. http://www.altera.com. [2] AMD Core Math Library. http://developer.amd.com/acml.aspx. [3] A. Amira and F. Bensaali. An FPGA based Parametrisable System for Matrix Product Implementation. In Proc. of The IEEE Workshop on Signal Processing Systems Design and Implementation (SIPS2002), pages 75–79, California, USA, October 2002. [4] E. Anderson, J. Agron, W. Peck, J. Stevens, F. Baijot, E. Komp, R. Sass, and D. Andrews. Enabling a Uniform Programming Model Across the Soft- ware/Hardware Boundary. In Proc. of the 14th IEEE Symposium on Field- Programmable Custom Computing Machines, California, USA, April 2006. [5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LA- PACK user’s guide third edition. http://www.netlib.org/lapack/ lug/lapack_lug.html, August 1999. [6] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and Henk van der V orst. Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, 2000. [7] S. Bajracharya, C. Shu, K. Gaj, and T.A. El-Ghazawi. Implementation of Ellip- tic Curve Cryptosystems Over GF(2 n ) in Optimal Normal Basis on a Reconfig- urable Computer. In Proc. of the 12th ACM International Symposium on Field- Programmable Gate Arrays, California, USA, February 2004. [8] M. Baleani, F. Gennari, Y . Jiang, Y . Patel, R.K. Brayton, and A. Sangiovanni- Vincentelli. Hw/sw partitioning and code generation of embedded control appli- cations on a reconfigurable architecture platform. In Proc. of the 10th Interna- tional Symposium on Hardware/Software Codesign (CODES), Colorado, USA, May 2002. [9] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V . Eijkhout, R. Pozo, C. Romine, and H. Van der V orst. Templates for the Solution of Linear 153 Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, Philadel- phia, PA, 1994. [10] Pavle Belanovic and Miriam Leeser. A Library of Parameterized Floating-Point Modules and Their Use. In Proc. of 12th International Conference on Field Pro- grammable Logic and Application, pages 657–666, Montpellier, France, Septem- ber 2002. [11] L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Don- garra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathe- matics, Philadelphia, PA, 1997. [12] U. Bondhugula, A. Devulapalli, J. Fernando, Pete Wyckoff, and P. Sadayap- pan. Parallel FPGA-based All-Pairs Shortest-Paths in a Directed Graph. In Proc. of the 20th IEEE International Parallel and Distributed Processing Symposium, Rhodes, Greece, April 2006. [13] L. E. Cannon. A Cellular Computer to Implement the Kalman Filter Algorithm. PhD thesis, Montana State University, 1969. [14] Cell Processor. http://www.research.ibm.com/cell/. [15] J. Choi, J. J. Dongarra, and D. W. Walker. PUMMA: Parallel Universal Matrix Multiplication Algorithms. Concurrency: Practice and Experience, 6(7):543– 570, October 1994. [16] J. Choi, J.J. Dongarra, L.S. Ostrouchov, A.P. Petitet, D. W. Walker, and R. C. Whaley. Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines. Scientific Programming, 5(3):173–184, Fall 1996. [17] S. Choi and V . K. Prasanna. Time and Energy Efficient Matrix Factorization using FPGAs. In Proc. of 13th International Conference on Field Programmable Logic and Applications, Lisbon, Portugal, September 2003. [18] S. Choi, R. Scrofano, V . K. Prasanna, and J. W. Jang. Energy-Efficient Sig- nal Processing Using FPGAs. In Proc. of 11th ACM International Symposium on Field-Programmable Gate Arrays (FPGA 2003), California, USA, February 2003. [19] K. Compton and S. Hauck. Reconfigurable Computing: A Survey of Systems and Software. ACM Computing Surveys, 34(2):171–210, June 2002. [20] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algo- rithms. The MIT Press, 2nd edition, 2001. [21] Cray Inc. http://www.cray.com/. 154 [22] Vikash Daga, Gokul Govindu, Sridhar Gangadharpalli, V . Sridhar, and Viktor K. Prasanna. Efficient Floating-point based Block LU Decomposition on FPGAs. In Proc. of Engineering of Reconfigurable Systems and Algorithms, Nevada, USA, June 2004. [23] M. deLorimier and A. DeHon. Floating-Point Sparse Matrix-Vector Multiply for FPGAs. In Proc. of the 13th ACM International Symposium on Field- Programmable Gate Arrays, California, USA, February 2005. [24] Y . Dou, S. Vassiliadis, G.K. Kuzmanov, and G.N. Gaydadjiev. 64-bit Floating- Point FPGA Matrix Multiplication. In Proc. of the 13th International Symposium on Field Programmable Gate Arrays, California, USA, February 2005. [25] DRC, The Coprocessor Company. http://www.drccomputer.com/. [26] H.A. ElGindy and Y .L. Shue. On Sparse Matrix-Vector Multiplication with FPGA-based System. In Proc. of The 10th IEEE Symposium on Field- Programmable Custom Computing Machines (FCCM’02), California, USA, April 2002. [27] G. C. Fox and S. W. Otto. Matrix Algorithms on a Hypercube I: Matrix Multi- plication. Parallel Computing, 4:17–31, 1987. [28] K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Rev., 32(1):54–135, 1990. [29] R. A. Van De Geijn and J. Watts. SUMMA: Scalable Universal Matrix Multipli- cation Algorithm. Concurrency: Practice and Experience, 9(4):255–274, 1997. [30] M. Gokhale, D. Dubois, A. Dubois, M. Boorman, S. Poole, and V . Hogsett. Granidt: Towards Gigabit Rate Network Intrusion Detection Technology. In Proc. of 12th International Conference on Field-Programmable Logic and Ap- plications, pages 404–413, Montpellier, France, September 2002. [31] M. Gokhale, J. Frigo, C. Ahrens, J.L. Tripp, and R. Minnich. Monte Carlo Ra- diative Heat Transfer Simulation on a Reconfigurable Computer. In Proc. of 14th International Conference on Field Programmable Logic and Application, pages 95–104, Leuven, Belgium, August 2004. [32] M. Gokhale, C. Rickett, J.L. Tripp, C. Hsu, and R. Scrofano. Promises and Pit- falls of Reconfigurable Supercomputing. In Proc. of 2004 International Confer- ence on Field Programmable Logic and Its Applications, pages 11–20, Nevada, USA, September 2006. [33] G. Govindu, R. Scrofano, and V .K. Prasanna. A Library of Parameterizable Floating-Point Cores for FPGAs and Their Application to Scientific Computing. 155 In Proc. of International Conference on Engineering Reconfigurable Systems and Algorithms, June 2005. [34] Gokul Govindu, Seonil Choi, Viktor K. Prasanna, Vikash Daga, Sridhar Gangad- harpalli, and V . Sridhar. A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs. In Proc. of the Inter- national Conference on Engineering of Reconfigurable Systems and Algorithms, Nevada, USA, June 2004. [35] K. Scott Hemmert and Keith D. Underwood. An Analysis of the Double- Precision Floating-Point FFT on FPGAs. In Proc. of 2005 IEEE Symposium on Field-Programmable Custom Computing Machines, California, USA, April 2005. [36] J. Hong and H. Kung. I/O Complexity: The Red Blue Pebble Game. In Proc. of ACM Symposium on Theory of Computing, pages 326–333, Wisconsin, USA, May 1981. [37] HPC Challenge Benchmark. http://icl.cs.utk.edu/hpcc/index. html. [38] E.J. Im, K.A. Yelick, and R. Vuduc. SPARSITY: An Optimization Framework for Sparse Matrix Kernels. International Journal of High Performance Comput- ing Applications, 18(1):135–158, February 2004. [39] Institute of Electrical and Electronics Engineers. IEEE 754 Standard for Binary Floating-Point Arithmetic. IEEE, 1984. [40] Intel. http://www.intel.com. [41] J. W. Jang, S. Choi, and V . K. Prasanna. Area and Time Efficient Implemen- tation of Matrix Multiplication on FPGAs. In Proc. of The First IEEE Interna- tional Conference on Field Programmable Technology, California, USA, Decem- ber 2002. [42] P.M. Kogge. The Architecture of Pipelined Computers. Hemisphere Pub. Corp., 1981. [43] C.L. Lawson, R.J. Hanson, D. Kincaid, and F.T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Transaction on Mathematical Soft- ware, 5(3):308–323, 1979. [44] K. Q. Li and V . Y . Pan. Parallel Matrix Multiplication on a Linear Array with a Reconfigurable Pipelined Bus System . IEEE Transactions on Computers, 50(2):519–525, 2001. 156 [45] Jian Liang, Russell Tessier, and Oskar Mencer. Floating Point Unit Generation and Evaluation for FPGAs. In Proc. of the 11th IEEE Symposium on Field- Programmable Custom Computing Machines, pages 185–194, California, USA, April 2003. [46] W. B. Ligon III, S. McMillan, G. Monn, K. Schoonover, F. Stivers, and K. D. Underwood. A re-evaluation of the practicality of floating point operations on FPGAs. In Proc. of IEEE Symposium on FPGAs for Custom Computing Ma- chines, pages 206–215, California, USA, April 1998. [47] J.D. McCalpin. Sustainable Memory Bandwidth in Current High Per- formance Computers. http://home.austin.rr.com/mccalpin/ papers/bandwidth/bandwidth.html, 1995. [48] Mentor Graphics Corp. http://www.mentor.com/. [49] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Technical Report UT-CS-94-230, 1994. [50] G.R. Morris, R.D. Anderson, and V .K. Prasanna. A Hybrid Approach for Map- ping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercom- puter. In Proc. of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines, California, USA, April 2006. [51] G.R. Morris, R.D. Anderson, and V .K. Prasanna. An FPGA-based application- specific processor for efficient reduction of multiple variable-length floating- point data sets. In Presented at IEEE 17th International Conference on Application-specific Systems, Architectures and Processors (ASAP’06), Steam- boat Springs, CO, September 2006. [52] G.R. Morris and V .K. Prasanna. An FPGA-Based Floating-Point Jacobi Iterative Solver. In Proc. of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks, Neveda, USA, December 2005. [53] Nallatech. http://www.nallatech.com. [54] L.M. Ni and K. Hwang. Vector Reduction Methods for Arithmetic Pipelines. In Proc. of the 6th International Symposium on Computer Arithmetic, pages 144– 150, June 1983. [55] M. Penner and V . Prasanna. Cache-Friendly Implementations of Transitive Clo- sure. In Proc. of International Conference on Parallel Architectures and Com- piler Techniques, Barcelona, Spain, September 2001. [56] A. Pinar and M.T. Heath. Improving Performance of Sparse Matrix-Vector Mul- tiplication. In Proc. of Supercomputing 1999, Oregon, USA, November 1999. 157 [57] W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling. Numerical Recipes in C : The Art of Scientific Computing. Cambridge University Press, 1992. [58] I. Sahin, C. S. Gloster, and C. Doss. Feasibility of Floating-Point Arithmetic in Reconfigurable Computing Systems. In Presented at 3rd Military and Aerospace Programmable Logic Devices (MAPLD) Conference, Maryland, USA, Septem- ber 2000. [59] K. Sano, T. Iizuka, and S. Yamamoto. Systolic Architecture for Computa- tional Fluid Dynamics on FPGAs. In Proc. of 15th IEEE Symposium on Field- Programmable Custom Computing Machines, California, USA, April 2007. [60] R. Scrofano, M. Gokhale, F. Trouw, and V . K. Prasanna. A Hardware/Software Approach to Molecular Dynamics on Reconfigurable Computers. In Proc. of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines, California, USA, April 2006. [61] R. Scrofano and V .K. Prasanna. Computing Lennard-Jones Potentials and Forces with Reconfigurable Hardware. In Proc. of International Conference on Engi- neering of Reconfigurable Systems and Algorithms, pages 284–290, June 2004. [62] Silicon Graphics, Inc. http://www.sgi.com/. [63] W.D. Smith and A.R. Schnore. Towards an RCC-based Accelerator for Com- putational Fluid Dynamics Applications. In Proc. of 2003 International Con- ference on Engineering Reconfigurable Systems and Algorithms, Nevada, USA, June 2003. [64] SRC Computers, Inc. http://www.srccomp.com/. [65] F. Tinetti and A.D. Giusti. Broadcast-Based Parallel LU Factorization. In Proc. of 11th International Euro-Par Conference, pages 867–876, Lisbon, Portugal, August 2005. [66] S. Toledo. Improving Memory-System Performance of Sparse Matrix-Vector Multiplication. IBM Journal of Research and Development, 41(6):711–725, 1997. [67] K.D. Underwood and K.S. Hemmert. Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In Proc. of 2004 IEEE Sym- posium on Field-Programmable Custom Computing Machines, California, USA, April 2004. [68] University of Florida Sparse Matrix Collection. http://www.cise.ufl. edu/research/sparse/matrices/. 158 [69] G. Venkataraman, S. Sahni, and S. Mukhopadhyaya. A Blocked All-Pairs Shortest-Paths Algorithm. Journal of Experimental Algorithmics, 8, 2003. [70] X. Wang, S. Braganza, and M. Leeser. Advanced Components in the Vari- able Precision Floating-Point Library. In Proc. of IEEE Symposium on Field- programmable Custom Computing Machines, California, USA, April 2006. [71] X. Wang and S.G. Ziavras. Performance Optimization of an FPGA-Based Con- figurable Multiprocessor for Matrix Operations. In Proc. of IEEE International Conference on Field-Programmable Technology, Tokyo, Japan, December 2003. [72] R.C. Whaley, A. Petitet, and J.J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1–2):3–35, 2001. Also available as University of Tennessee LAPACK Working Note #147, UT-CS-00- 448, 2000 (www.netlib.org/lapack/lawns/lawn147.ps). [73] D. Womble, D. Greenberg, R. Riesen, and S. Wheat. Out of Core, Out of Mind: Practical Parallel I/O. In Proc. of the Scalable Parallel Libraries Conference, pages 10–16, Mississippi, USA, 1993. [74] Xilinx Incorporated. http://www.xilinx.com. [75] L. Zhuo, G.R. Morris, and V .K. Prasanna. Designing Scalable FPGA-Based Re- duction Circuits Using Pipelined Floating-Point Cores. In Proc. of the 12th Re- configurable Architectures Workshop, Colorado, USA, April 2005. 159
Abstract (if available)
Abstract
Recently, high-end computing systems have been introduced that employ reconfigurable hardware as application-specific hardware accelerators for general-purpose processors. These systems provide new opportunities for high-performance implementations of scientific applications. However, they also pose new design challenges, including utilization of available hardware resources, exploitation of multiple levels of memory, and hardware/software co-design.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
Hardware and software techniques for irregular parallelism
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
High performance packet forwarding on parallel architectures
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
A synthesis approach to manage complexity in software systems design
PDF
Studies into computational intelligence approaches for the identification of complex nonlinear systems
PDF
An automated testing system for scientific workflows
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Compiler directed data management for configurable architectures with heterogeneous memory structures
PDF
Computation of algebraic system from self-assembly
PDF
Model-guided performance tuning for application-level parameters
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
PDF
Compilation of data-driven macroprograms for a class of networked sensing applications
PDF
Exploration of parallelism for probabilistic graphical models
PDF
Design-time software quality modeling and analysis of distributed software-intensive systems
Asset Metadata
Creator
Zhuo, Ling
(author)
Core Title
High-performance linear algebra on reconfigurable computing systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
08/01/2007
Defense Date
06/21/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithm design,high-performance computing,matrix operations,OAI-PMH Harvest,reconfigurable computing
Language
English
Advisor
Prasanna, Viktor K. (
committee chair
), Dubois, Michel (
committee member
), Hall, Mary (
committee member
), Singh, Manbir (
committee member
)
Creator Email
lzhuo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m735
Unique identifier
UC1326909
Identifier
etd-Zhuo-20070801 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-543521 (legacy record id),usctheses-m735 (legacy record id)
Legacy Identifier
etd-Zhuo-20070801.pdf
Dmrecord
543521
Document Type
Dissertation
Rights
Zhuo, Ling
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
algorithm design
high-performance computing
matrix operations
reconfigurable computing