Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
(USC Thesis Other)
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MAPPING SPARSE MATRIX SCIENTIFIC APPLICATIONS ONTO FPGA-AUGMENTED RECONFIGURABLE SUPERCOMPUTERS by Gerald R. Morris A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2006 Copyright 2006 Gerald R. Morris Epigraph Romans 10:9 – That if thou shalt confess with thy mouth the Lord Jesus, and shalt believe in thine heart that God hath raised him from the dead, thou shalt be saved. ii Dedication To honor and glorify Jesus Christ, my Lord and Savior. iii Acknowledgements I would like to express my love and appreciation to my wife, Lisa, who supported me during this difficult time and willingly took second place behind my research. She will always be in first place from now on. I want to thank mom and dad for doing a fantastic job with all seven of us kids; they made sure we were ready for the show. I profoundly appreciate the advice and encouragement I received from my advisor, Prof. Viktor K. Prasanna. Several of my articles would have been rejected had it not been for his insight and gentle guidance. His wisdom and expertise allowed me to succeed. I would like to thank my other committee members: Prof. Michel Dubois for his insight into experimental research; he was enthused that I had actually built the FPGA-based circuits and understands that sometimes the only way to discover is to build. Prof. Aiichiro Nakano for his genuine interest; he asked to be on my committee and helped me believe my work was important. I thank all the committee members for their constructive criticism, which helped me produce a defensible dissertation. I thank the late Dr. Johnny Welborn and Warrenton Independent Baptist Church, for their prayers. My boss, John West, for his optimism and support. Deborah Dent, Bill Ward, and Tom Oppe, for their faith and encouragement. My buddies in CS&E, for their merciless Ph.D. humor. Rikk, Ron, and Ling, for helping me over the rough spots. SRC tech support, for helping me get the designs to work. Finally, I’d like to thank Diane, Rosine, Aimee, Mary, Amy, and Jackie for their friendship, laughter, and helping me navigate the administrative maze at USC and at work. iv Contents Epigraph ii Dedication iii Acknowledgements iv List of Tables viii List of Figures ix Abstract xii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research contributions . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 High-level language based reduction operations . . . . . . . . 4 1.2.2 Sparse matrix applications on reconfigurable computers . . . 7 2 Field Programmable Gate Arrays and Reconfigurable Computers 10 2.1 Field programmable gate arrays . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Internal architecture . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Hardware description language design flow . . . . . . . . . . 14 2.1.3 High-level language design flow . . . . . . . . . . . . . . . . 15 2.2 Floating-point computations on FPGAs . . . . . . . . . . . . . . . . 16 2.2.1 Floating-point arithmetic cores . . . . . . . . . . . . . . . . . 18 2.2.2 Floating-point kernels . . . . . . . . . . . . . . . . . . . . . 19 2.3 Reconfigurable computers . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Hardware description language design flow . . . . . . . . . . 21 2.3.3 High-level language design flow . . . . . . . . . . . . . . . . 22 2.3.4 Hybrid design flow . . . . . . . . . . . . . . . . . . . . . . . 25 v 2.4 Postscript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Sparse Matrix Calculations on General-Purpose Processors 28 3.1 Compressed sparse row format . . . . . . . . . . . . . . . . . . . . . 28 3.2 Sparse matrix kernel performance . . . . . . . . . . . . . . . . . . . 31 3.3 Improving performance of sparse matrix kernels . . . . . . . . . . . . 33 3.3.1 Preprocess data . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Optimize algorithm . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 Specialized hardware . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Reality of improvement efforts . . . . . . . . . . . . . . . . . . . . . 36 4 Floating-Point Reduction Operations 40 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Reduction problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Reduction method categories . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Hardware description language reduction . . . . . . . . . . . . . . . 51 4.5.1 Tree traversal method design . . . . . . . . . . . . . . . . . . 51 4.5.1.1 Description of schedule . . . . . . . . . . . . . . . 54 4.5.1.2 Proof of sufficient pipeline slots . . . . . . . . . . . 55 4.5.1.3 Proof of collision-free schedule . . . . . . . . . . . 56 4.5.1.4 Proof of bounded buffer growth . . . . . . . . . . . 56 4.5.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . 58 4.5.2 Striding method design . . . . . . . . . . . . . . . . . . . . . 58 4.5.2.1 Description of operational modes . . . . . . . . . . 59 4.5.2.2 Proof of maximum cycles to coalesce . . . . . . . . 61 4.5.2.3 Proof of dual pipeline sufficiency . . . . . . . . . . 61 4.5.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 High-level language reduction . . . . . . . . . . . . . . . . . . . . . 63 4.6.1 Pipelines and high-level language reduction . . . . . . . . . . 63 4.6.2 Fundamental problem . . . . . . . . . . . . . . . . . . . . . 64 4.6.3 Striding method design . . . . . . . . . . . . . . . . . . . . . 65 4.6.3.1 Description of toroidal access pattern . . . . . . . . 67 4.6.3.2 Proof of maximum buffer size . . . . . . . . . . . . 68 4.6.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . 70 5 Accelerating Sparse Iterative Solvers using Reconfigurable Computers 71 5.1 Iterative solver primer . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1.1 Jacobi method . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1.2 Conjugate gradient . . . . . . . . . . . . . . . . . . . . . . . 75 5.2 FPGA design considerations . . . . . . . . . . . . . . . . . . . . . . 78 5.3 Sparse matrix Jacobi solver . . . . . . . . . . . . . . . . . . . . . . . 80 5.3.1 Software-only sparse matrix Jacobi system design . . . . . . 80 vi 5.3.2 FPGA design boundary . . . . . . . . . . . . . . . . . . . . . 82 5.3.3 FPGA-augmented sparse matrix Jacobi system design . . . . 82 5.3.3.1 Processor design . . . . . . . . . . . . . . . . . . . 83 5.3.3.2 Processor configuration . . . . . . . . . . . . . . . 86 5.3.3.3 Processor operation . . . . . . . . . . . . . . . . . 88 5.3.4 Jacobi implementation details . . . . . . . . . . . . . . . . . 91 5.3.4.1 Target platform . . . . . . . . . . . . . . . . . . . 91 5.3.4.2 Implementation summary . . . . . . . . . . . . . . 92 5.3.5 Performance comparison . . . . . . . . . . . . . . . . . . . . 93 5.3.5.1 Test matrices . . . . . . . . . . . . . . . . . . . . . 93 5.3.5.2 Test conditions . . . . . . . . . . . . . . . . . . . . 94 5.3.5.3 Test results . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Sparse matrix conjugate gradient solver . . . . . . . . . . . . . . . . 95 5.4.1 Buy versus build . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.2 Software-only sparse matrix CG system design . . . . . . . . 96 5.4.3 FPGA design boundary . . . . . . . . . . . . . . . . . . . . . 97 5.4.4 FPGA-augmented sparse matrix CG system design . . . . . . 98 5.4.4.1 Processor design . . . . . . . . . . . . . . . . . . . 98 5.4.4.2 Processor configuration . . . . . . . . . . . . . . . 99 5.4.4.3 Processor operation . . . . . . . . . . . . . . . . . 100 5.4.5 CG implementation details . . . . . . . . . . . . . . . . . . . 101 5.4.5.1 Target platform . . . . . . . . . . . . . . . . . . . 102 5.4.5.2 Implementation summary . . . . . . . . . . . . . . 102 5.4.6 Performance comparison . . . . . . . . . . . . . . . . . . . . 103 5.4.6.1 Test matrices . . . . . . . . . . . . . . . . . . . . . 103 5.4.6.2 Test conditions . . . . . . . . . . . . . . . . . . . . 103 5.4.6.3 Test results . . . . . . . . . . . . . . . . . . . . . . 104 5.5 Analysis of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.1 Analytical approach . . . . . . . . . . . . . . . . . . . . . . 105 5.5.2 Performance on next generation reconfigurable computer . . . 106 5.5.3 Expected performance for small test cases . . . . . . . . . . . 109 5.5.4 Explanation for increased performance . . . . . . . . . . . . 110 6 Conclusions and Future Directions 111 6.1 Summary of research contributions . . . . . . . . . . . . . . . . . . . 111 6.1.1 High-level language based reduction operations . . . . . . . . 111 6.1.2 Sparse matrix applications on reconfigurable computers . . . 112 6.1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.2 Impact of research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Reference List 116 vii List of Tables 3.1 Sparse matrix vector multiply performance . . . . . . . . . . . . . . 37 5.1 SJAC processor design parameters . . . . . . . . . . . . . . . . . . . 87 5.2 Jacobi implementation parameters . . . . . . . . . . . . . . . . . . . 92 5.3 Jacobi test matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 SMVM processor design parameters . . . . . . . . . . . . . . . . . . 100 5.5 Conjugate gradient implementation parameters . . . . . . . . . . . . 102 5.6 Conjugate gradient test matrices . . . . . . . . . . . . . . . . . . . . 103 5.7 Small test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.8 MAP specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.9 Small test case performance . . . . . . . . . . . . . . . . . . . . . . 109 viii List of Figures 2.1 Idealized FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 FPGA internal architecture . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 LUT-based digital logic . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 HDL-based FPGA design . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5 HLL-based FPGA design . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 The three p’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.7 RC system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.8 HDL-based RC design flow . . . . . . . . . . . . . . . . . . . . . . . 22 2.9 HLL-based RC design flow . . . . . . . . . . . . . . . . . . . . . . . 24 2.10 Hybrid RC design flow . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Dense matrix vector multiply algorithm . . . . . . . . . . . . . . . . 29 3.2 Example of CSR format . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Sparse matrix vector multiply algorithm . . . . . . . . . . . . . . . . 30 3.4 Memory access pattern . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Smart memory controller access pattern . . . . . . . . . . . . . . . . 35 3.6 Regular matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.7 Irregular matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1 Reduction trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Naive matrix vector multiply unit . . . . . . . . . . . . . . . . . . . . 43 ix 4.3 Buffer overrun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Reduction methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Parallel recursive doubling . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Pipelined recursive vector reduction . . . . . . . . . . . . . . . . . . 50 4.7 Pipelined vector reduction with feedback . . . . . . . . . . . . . . . . 51 4.8 Binary adder tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.9 Partially compacted binary tree . . . . . . . . . . . . . . . . . . . . . 53 4.10 Fully compacted binary tree . . . . . . . . . . . . . . . . . . . . . . 53 4.11 Contributions to pipeline fullness . . . . . . . . . . . . . . . . . . . . 55 4.12 Parallel looped adders . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.13 Operating modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.14 Coalescing a 5-stage pipeline . . . . . . . . . . . . . . . . . . . . . . 62 4.15 Sparse matrix vector multiply HLL pseudo-code . . . . . . . . . . . . 64 4.16 Modified sparse matrix vector multiply HLL pseudo-code . . . . . . . 65 4.17 Two-stage sparse matrix vector multiply HLL pseudo-code . . . . . . 66 4.18 Accumulator architecture . . . . . . . . . . . . . . . . . . . . . . . . 67 4.19 Toroidal access pattern of arrayS . . . . . . . . . . . . . . . . . . . 68 4.20 Snapshot of stage 1 partial summation unit . . . . . . . . . . . . . . . 69 5.1 Sparse matrix Jacobi algorithm . . . . . . . . . . . . . . . . . . . . . 74 5.2 Quadratic form of vectorx . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 Conjugate gradient algorithm . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Choosing an FPGA design boundary . . . . . . . . . . . . . . . . . . 78 5.5 Software-only sparse matrix Jacobi system design . . . . . . . . . . . 80 5.6 FPGA-augmented sparse matrix Jacobi system design . . . . . . . . . 82 5.7 Idealized dense matrix Jacobi processor . . . . . . . . . . . . . . . . 83 5.8 Idealized sparse matrix Jacobi processor . . . . . . . . . . . . . . . . 85 x 5.9 FPGA-based sparse matrix Jacobi processor . . . . . . . . . . . . . . 86 5.10 VHDL-based IP cores . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.11 SJAC processor local memory matrix storage . . . . . . . . . . . . . 89 5.12 Jacobi run time comparison . . . . . . . . . . . . . . . . . . . . . . . 95 5.13 Software-only sparse matrix CG system design . . . . . . . . . . . . 96 5.14 FPGA-augmented sparse matrix CG system design . . . . . . . . . . 98 5.15 FPGA-based sparse matrix vector multiply processor . . . . . . . . . 99 5.16 SMVM processor local memory matrix storage . . . . . . . . . . . . 101 5.17 CG run time comparison . . . . . . . . . . . . . . . . . . . . . . . . 104 5.18 Jacobi actual and estimated speedup . . . . . . . . . . . . . . . . . . 108 5.19 CG actual and estimated speedup . . . . . . . . . . . . . . . . . . . . 109 xi Abstract The large capacity of field programmable gate arrays (FPGAs) has prompted re- searchers to map computational kernels onto FPGAs. In some instances, these ker- nels achieve significant speedups over their software-only counterparts running on general-purpose processors. The success of these efforts has spurred supercomputer companies to develop reconfigurable computers (RCs) that allow the FPGAs to be- come, in effect, application-specific coprocessors. In concert with the RCs are high- level language-to-hardware description language (HLL-to-HDL) compilers that facil- itate development of FPGA-based kernels using HLL-based programming rather than HDL-based hardware design. In theory, these technologies allow end-users to cre- ate high-performance custom computing architectures. In practice, acceleration of floating-point scientific kernels is still problematic. Sequential vector reductions like accumulation are difficult because the pipelined floating-point units introduce loop carried dependences that prevent the hardware from being fully pipelined. This has a profound impact on fundamental scientific kernels such as matrix vector multiply. Without pipelining, the potential performance advantage of FPGA-based kernels is eliminated. This dissertation develops algorithms and architectures for time and area efficient software and hardware implementation of scientific kernels on RCs. In particular, xii it deals with the problem of mapping IEEE Standard 754 double-precision floating- point sparse matrix computations onto FPGA-augmented RCs using an HLL-to-HDL compiler. The major contributions of this research are firstly, a novel algorithm and architec- ture that facilitates HLL-based reduction of multiple, sequentially delivered floating- point vectors without pipeline stalls or buffer overflows, and secondly, the demon- stration of how to speedup an important class of scientific applications, that is, sparse matrix solvers, by mapping them onto reconfigurable supercomputers. Optimized software version of two classic iterative solvers, the Jacobi method, and conjugate gradient are used as a baseline for comparison. Using heuristics and techniques pre- sented in the dissertation, these two solvers are accelerated using FPGA-based ker- nels. To ensure a fair comparison, both versions of each solver are developed using the same software baseline, and both versions are run on the same platform using the same set of sparse linear equations. The FPGA-augmented versions have a measured speedup of over two on a current-generation RC and an estimated speedup of over six on a next-generation RC. xiii Chapter 1 Introduction This chapter provides a context for the dissertation. Section 1.1 identifies the main technologies and problem areas associated with this research. Section 1.2 summarizes the principal research contributions and, in a certain sense, is an executive summary of the dissertation. 1.1 Background The intention of this background section is to give enough information on field pro- grammable gate arrays (FPGAs), reconfigurable computers (RCs), high-level lan- guage (HLL)-to-hardware description language (HDL) compilers, sparse matrix com- putations, and reduction operations such that the context of this research is estab- lished. A detailed technical discussion on each of these areas is given in later sections of this dissertation. FPGAs consist of programmable logic elements embedded in a programmable in- terconnection network that is surrounded by programmable input/output (I/O) blocks. There are basically three kinds of FPGAs. Flash memory FPGAs can be reconfigured multiple times. Unlike static random access memory (SRAM), flash memory retains 1 its state when the power is removed. Therefore flash memory FPGAs are nonvolatile, that is they retain their configuration even when the power is turned off. Antifuse FPGAs can only be configured one time and are obviously nonvolatile. SRAM FP- GAs can be reconfigured multiple times and are obviously volatile by virtue of their SRAM-based configuration memory. Every time the device is powered up the FPGA configuration memory must be reloaded. It is this last category of FPGAs that are of interest in this dissertation. The others are mentioned only for the sake of posterity. Conceptually, FPGAs can be configured to implement arbitrary digital logic circuits; in practice there are physical constraints, which limit circuit size and speed. Using an HDL or other design mechanism, a digital logic circuit description or behavioral model is created. Via a software tool suite, the design representation is converted into a stream of ones and zeros, which is called a configuration bitstream. When the bit- stream is loaded onto the FPGA, the FPGA takes on the behavior of the digital logic circuit. If one needs a different digital logic circuit, one simply loads the correspond- ing configuration bitstream onto the FPGA. The traditional use of FPGAs is as a sub- stitute for application specific integrated circuits (ASICs) in low volume applications, in prototypes and design scenarios where there is a considerable amount of design uncertainty, and in systems where there are frequent requirements changes. In recent years, the role of FPGAs has been expanded. Since mathematical operations can be implemented via digital logic circuits, and FPGAs can be reconfigured to implement digital logic circuits, it stands to reason that FPGAs can be reconfigured to perform mathematical operations. Thus, the fine-grained resolution, mega logic gate capacity, and other features of contemporary FPGAs allow designers to create floating-point intellectual property (IP) cores such as adders and multipliers. This capability enables the development of massively parallelized, deeply pipelined FPGA-based application 2 specific kernels that can sometimes outperform their software-only equivalents run- ning on a GPP. This is an emerging technology that holds forth great promise. The use of FPGAs to create high performance computational kernels prompted startup and established supercomputer companies to offer reconfigurable computers (RCs) that combine GPPs with high-end FPGAs. The FPGAs can be reconfigured at run time to become, in effect, application-specific coprocessors. In addition, research institutions and commercial vendors have introduced compilers that allow FPGA con- figuration bitstreams to be generated from HLLs such as Java and C rather than from traditional HDLs such as VHDL and Verilog. These exciting new hardware and soft- ware technologies allow end-users to create custom computing architectures aimed at the computationally intensive parts of each problem. Modern GPPs employ architectural features that result in high performance for a broad range of applications. However, for large floating-point sparse matrix cal- culations having an erratic memory access pattern, these same features can result in performance degradation. The data associated with sparse matrix operations may not exhibit the temporal or spatial locality of reference assumed by the GPP memory hi- erarchy. Depending on the sparsity structure of the matrix, the GPP could end up fetching a cache block, only using a single value, and then evicting the cache block to make room for another. This is not a new problem, and it has been studied in great detail over an extended period of time. Data optimizations such as reordering, platform-specific kernel optimizations such as cache blocking, and specialized hard- ware such as processing in memory (PIM) chips can sometimes mitigate the problem. However, these optimizations are highly dependent on the sparsity structure of the matrices being processed. For large, irregular sparse matrices, the traditional opti- mizations may provide little improvement. In contrast, the FPGA-based processing elements (PEs) on an RC, which do not have the same kind of memory hierarchy as 3 the GPP-based PEs, can sometimes be used to improve the performance of sparse matrix computational kernels. For integer and fixed-point applications such as digital signal processing and bio- logical sequencing, RCs and HLL-to-HDL compilers have already proven themselves. However, for floating-point scientific computing, accelerating certain kinds of appli- cations via FPGA-based RCs can be difficult. In particular, HLL-to-HDL compilers do not directly support efficient multiple-set floating-point vector reduction opera- tions such as can be seen in the inner loop accumulation of matrix multiply. There are HDL-based vector reduction solutions, however, integration of such HDL-based intellectual property (IP) reduction cores is problematic since the HLL-to-HDL com- pilers cannot pipeline variable latency components. These deficiencies have a huge impact on the performance of fundamental scientific kernels such as sparse matrix vector multiplication, and they effectively negate the performance advantages of RCs for many scientific applications. 1.2 Research contributions The primary contributions of this research are described below. The nature of the problem is summarized, a brief description of the solution is provided, and an indica- tion about why each contribution is important is rendered. In a sense, this section is an executive summary of the dissertation. 1.2.1 High-level language based reduction operations Vector reductions, which are frequently used in scientific calculations, are operations that input one or moren-vectors and reduce them to a scalar value. Vector accumula- tion,s= P x i , is one of the most well known examples of a reduction. Conceptually, 4 one can use a full binary tree of floating-point functional units to create a reduction architecture. For example, a reduction unit to accumulate the elements of ann-vector could be created by having a (lgn)-level binary tree of adders with n=2 adders at the leaf nodes, n=4 adders at the next tree level, and so forth until there is a single adder at the root node. If one uses fully pipelined adders in the tree, one obtains a fully pipelined accumulator that accepts one n-vector per clock cycle, and (after the latency) emits one result per clock cycle. Clearly, a binary tree n-vector data path of pipelined floating-point adders, which leverages both parallelism and pipelin- ing, results in a very high-performance reduction architecture. However, the large size of floating-point IP cores and FPGA resource constraints preclude implement- ing full binary tree reduction circuits on a single FPGA for even modest values ofn. Therefore, large parallel vector reductions must be translated into a series of smaller parallel vector reductions. The resulting sequential stream of partially reduced val- ues must then be reduced to the final result. Unfortunately, nested loops involving floating-point arithmetic calculations cannot be flattened and pipelined by HLL-to- HDL development environments because of the loop carried dependence associated with the pipelined floating-point functional units. A simple example illustrates the problem. To add three values, one could enter two of them into the adder pipeline. But then, one would have to wait till these two values traversed the adder pipeline before one could add the third number. The archetypal example of where this loop carried dependence would be a problem is sparse matrix vector multiply, where the inner loop does an accumulation reduction. The seemingly obvious solution, map- ping HDL-based intellectual property (IP) core accumulators into the HLL-to-HDL environment, also fails. The loop containing the variable latency IP core cannot be pipelined since the HLL-to-HDL compiler cannot pipeline a variable latency loop 1 . 1 How many pipeline stages are in a variable latency loop? 5 This is a huge problem that, prior to this research, had not been solved. Without pipelining, MHz-scale FPGA-based kernels cannot compete with GHz-scale GPPs. This capability gap prevents development of a number of critical FPGA-based scien- tific kernels such as matrix multiply, matrix vector multiply, dot product, vector norm, min, max, and so forth, and is one of the primary reasons why RCs have not become part of mainstream supercomputing. One of the principal contributions of this research, then, is the development of a viable solution to the reduction problem. In particular, using accumulation reduc- tion as an example, a novel HLL-based partial-summation architecture and two-stage toroidal access algorithm are developed. The basic idea is to take the single reduction loop, which cannot be pipelined by the HLL-to-HDL compiler because of the loop carried dependence, and split it up into two fully pipelined, concurrently executing loops. Unlike the apparent concurrency of a threaded uniprocessor, these loops are compiled by the HLL-to-HDL compiler into independent hardware pipelines that are truly operating concurrently. The first loop employs an®-stage pipelined adder and an ®-row by®-column partial summation array,S. The second loop has a simple®-input binary tree accumulator that reduces completed rows ofS. A round-robin scheduling algorithm guarantees an ®-cycle interval between subsequent references to the same memory location in S. Conceptually, the ®£ ® array is wrapped top-to-bottom to produce a cylinder, and the cylinder is wrapped end-to-end to produce a torus. The accumulation of a given input vector is restricted to a specific row within theS array and accesses a different column for each new input value. For example, if the first col- umn has just been accessed, then the second column will be accessed next, followed by the third column, and so forth. After the last column is accessed, the first column will again be accessed. However, by this time, the previous contents of the column and its associated input have already traversed the adder pipeline and been written 6 back into theS array. Thus, even if there are more than® elements in the sequential input vector, the major circumference of the torus (number of columns) is®, thereby ensuring that any previous data at a given location has already traversed the adder and been written back. When the given input vector is exhausted, all that remains to do is to reduce the® partial reductions inS down to a single value. As mentioned, this is accomplished by a binary tree output accumulator. If a series of small vectors are to be reduced, the minor circumference of the torus (number of rows) is also®, thereby ensuring that by the time a row of the S array needs to be reused, its contents have already been sent to the output accumulator and the row initialized to 0. This unique approach allows fully pipelined HLL-based sequential vector reduc- tion operations to be implemented on contemporary RCs and removes one of the major impediments preventing mainstream acceptance and use of RC-based super- computing. 1.2.2 Sparse matrix applications on reconfigurable computers Solving systems of linear equations, Ax = b, where A is an n£n real-valued ma- trix,x is the unknown real n-vector, andb is a known real n-vector, is essential for a number of applications. In many scientific codes such as weather forecasting, su- personic air flows, signal and image processing, and so forth, theA matrix is sparse. The fundamental problems associated with sparse matrix computations on GPPs were previously mentioned. Since the compute engine is ultimately a GPP and its mem- ory hierarchy, the sparsity structure of the matrix sometimes prevents the traditional optimizations from being effective. In contrast RCs have a different kind of mem- ory subsystem that is impervious to the sparsity structure of the matrices. As such, 7 some sparse matrix computations may have an affinity for RCs, that is, sparse ma- trix computations, if properly designed, should have performance on an RC that is independent of the sparsity structure of the matrix. If one can design an architecture and algorithm for a given sparse matrix computation that has enough parallelism and is fully pipelined, one can also expect high performance compared to an equivalent software-only computation running on a GPP. The second principal contributions of this research, then, is the development of architectures and algorithms to facilitate the development of high-performance sparse matrix iterative solvers using HLL-to-HDL compilers. This part of the research de- scribes some of the FPGA design considerations one needs to be aware of prior to mapping a kernel onto an RC. Perhaps the most important of these considerations is the FPGA design boundary, which identifies the portion of the design that should be mapped onto the FPGA. The design boundary could range from the entire sys- tem to a part of a single module. Various factors such as anticipated speedup, nature of the algorithm (for example compute intensive versus control intensive), ability to reuse data, anticipated memory bandwidth, algorithm developmental stability, avail- able parallelism, and so forth are considered. Using the appropriate FPGA design con- siderations, binary tree k-vector data paths, and the reduction techniques previously described, IEEE Std 754 double-precision floating-point sparse matrix computations are mapped onto FPGA-augmented RCs. In particular, two well-known sparse matrix iterative linear equation solvers, the Jacobi method and conjugate gradient (CG), are mapped onto reconfigurable computers using a hybrid development technique that in- volves the use of both HDL-based IP cores and HLL-to-HDL compiler technology. In the case of the Jacobi method, the FPGA design boundary is essentially the entire kernel. In contrast, only a portion of the CG kernel is selected as the FPGA design boundary. The sparse matrix iterative solvers are designed and implemented in such a 8 way as to get more than a two-fold improvement over software on current generation RCs and more than a six-fold improvement over software on next-generations RCs. For many of the large scientific codes, the principal resource constraint is CPU cycles. If the scientists and engineers who use supercomputers can cut their simulation time by a factor of two or more, then this frees up the compute cycles such that either more simulations or finer resolution simulations can be undertaken. This research holds forth the promise of significant reductions in run times for crucial computational elements, and takes reconfigurable supercomputing one step closer to mainstream adoption. 9 Chapter 2 Field Programmable Gate Arrays and Reconfigurable Computers This chapter provides details on the principal technologies associated with this re- search. Section 2.1 is a primer on field programmable gate arrays. It includes a dis- cussion of FPGA architecture and describes the design flows used to develop FPGA- based circuits. Section 2.2 primarily talks about the use of FPGAs to implement floating-point computational kernels. Section 2.3 provides an overview of reconfig- urable computers. It includes a look at RC system architectures and describes the design flows used to develop RC-based applications. 2.1 Field programmable gate arrays FPGAs are semiconductor devices consisting of programmable and fixed digital logic elements immersed in a programmable interconnection network that is surrounded by programmable I/O blocks. The FPGA was invented in the mid 1980’s by Xilinx, Inc. cofounder Ross Freeman [93]. At the time, the idea of trading small and fast min- imized logic designs for large and slow memory-based logic emulation seemed like a pretty radical idea. However, Freeman’s idea was ultimately redeemed by Moore’s 10 Law [54, 55], which basically states that the number of transistors on a chip doubles every 18 months. Today, FPGAs have become a staple in many digital logic applica- tions. clk CLB CLB CLB B R A M m u l t CLB CLB CLB CLB CLB GPP CLB I/O I/O I/O I/O CLB Figure 2.1: Idealized FPGA As mentioned in the previous chapter, the focus in this research is on SRAM- based FPGAs, which can be reconfigured by end users via a configuration bitstream to implement digital logic circuits. The mega gate capacity of FPGAs allows devel- opment of complex digital circuits such as switches [50], Ethernet controllers [91], USB controllers [4], memory controllers [28], and even microprocessors [1]. Princi- pal FPGA vendors include Xilinx [90], Altera [6], Lattice Semiconductor [47], and Actel [2]. While FPGA-based logic circuits are slower than equivalent application- specific integrated circuits (ASICs), the lower non-recurring engineering costs, faster development time, and the ability to reconfigure the FPGAs, make them a viable al- ternative in many digital design applications. In theory, one can design any digital logic circuit and place it on an FPGA. In practice, the primary constraints are area, clock rate, and I/O. 11 2.1.1 Internal architecture FPGA vendors each use different terminology to describe the internal structure of an FPGA, for example, a Xilinx “slice” is an Altera “adaptive logic module.” In this dissertation, Xilinx terminology is used to describe the internal architecture of an FPGA. As shown in Figure 2.2(a), small one-bit-wide blocks of SRAM, known as 2 n one-bit SRAM cells n address bits a 2 a 1 a 0 a 3 d in d out r/w (a) LUT logic FF a 2 a 1 a 0 a 3 a 6 a 5 a 4 a 7 logic FF control Q Y Y Q X X L U T L U T m u x m u x (b) slice slice slice slice slice r o u t i n g i n t e r n a l (c) CLB Figure 2.2: FPGA internal architecture look-up tables (LUTs) are the fundamental programmable logic element in FPGAs. A slice, as idealized in Figure 2.2(b), contains LUTs, flip-flops, and other control logic and is the basic unit of area when determining the size of an FPGA-based design. A configurable logic block (CLB), as shown in Figure 2.2(c), contains multiple slices 12 and a programmable internal switch that allow modestly sized subcircuits to be built without having to deal with the external place and route issues. 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 b a c i s c o (a) adder truth table 10 11 01 00 1 1 1 1 1 0 c i ab 10 11 01 00 1 1 1 1 1 0 c i ab ab b c a c c i i o + + = b a c s i ⊕ ⊕ = (b) Karnaugh map c i a b o c s (c) minimized logic address (minterms) LUTs (values) 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 1 b a c i s c o (d) dual-LUT equivalent Figure 2.3: LUT-based digital logic In traditional logic design, one starts with a truth table, for example, the full adder truth table shown in Figure 2.3(a). The boolean equations corresponding to the truth table output values are minimized, for example, via the simple Karnaugh map as shown in Figure 2.3(b). Finally, the minimized boolean equations are mapped onto digital logic gates as shown in Figure 2.3(c). LUTs can mimic the behavior of logical operators. For example, the full adder circuit can be implemented on an FPGA using 13 two LUTs, as suggested by Figure 2.3(d). The adder truth table values are loaded into the LUTs. Each logical operator is expressed in canonical sum of products form wherein the LUT address bits correspond to the logic inputs (the minterms), and the content bit at each LUT address corresponds to the logic operator’s value. Contemporary FPGAs contain tens-of-thousands of CLBs and a programmable interconnection network arranged in a rectangular grid surrounded by I/O blocks. As suggested by Figure 2.1, FPGAs also have higher level fixed logic such as block RAMs (BRAMs), multipliers, clock managers, and even GPPs embedded in the fab- ric. A configuration bitstream is loaded onto the FPGA to define its functionality. The two principal design flows used to create FPGA configuration bitstreams are described in the upcoming sections. 2.1.2 Hardware description language design flow Figure 2.4 depicts an HDL-based FPGA design flow. In this approach, the hardware engineer creates a register transfer level or behavioral description of the circuit using a traditional hardware description language such as VHDL [41]. Via a simulation PAR bitstream BITGEN SYNTH HDL FPGA SIM Figure 2.4: HDL-based FPGA design 14 (SIM) environment such as Modelsim [51], the engineer verifies circuit operation at the HDL level. As noted in the diagram, the design can also be simulated, with increasing accuracy, at later stages in the design flow. After circuit operation has been verified, the HDL is sent through a front-end synthesis (SYNTH) tool such as Synplify Pro [80] to produce netlist files, which are essentially text-based descriptions of the schematic. The netlists are processed by the vendor-specific back-end place and route (PAR) and bit generation (BITGEN) tools to produce a configuration bitstream. FPGA vendors typically provide a suite of front-end and back-end tools to facilitate the HDL-based design flow. An example of such a suite is the Xilinx Integrated Software Environment (ISE) [89]. 2.1.3 High-level language design flow Several commercial and open source HLL-to-HDL development environments are now available. Examples include Mitrion-C [53] and the JHDL [14] initiative being undertaken by Brigham Young University (BYU). These tool sets allow the develop- ment of FPGA configuration bitstreams using HLL-based programming rather than HDL-based hardware design. As shown in Figure 2.5, HLL-to-HDL environments typically provide a functional testing (TEST) mechanism, which operates at the HLL level and allows designers to verify the design’s behavior. After the design function- ality has been verified, the HLL-to-HDL compiler inputs the HLL and emits HDL. The HDL is processed by the FPGA tool chain, as described previously, to produce a configuration bitstream. As with the HDL flow, the design can be simulated at later stages using the SIM tool. 15 PAR bitstream BITGEN SYNTH HLL FPGA TEST SIM HLL-to-HDL Figure 2.5: HLL-based FPGA design 2.2 Floating-point computations on FPGAs Modern FPGAs are no longer restricted to their traditional role as substitutes for ASICs. The large capacity of modern FPGAs now allows development of FPGA- based computational kernels. Alex et al. [5] demonstrate a 30-fold speedup over software for an FPGA-based protein sequencer. Cheung et al. [18] describe an FPGA-based elliptic curve cryptosystem that achieves a 25-fold speedup over soft- ware. Sugie et al. [79] describe an FPGA-augmented system that accelerates the Smith-Waterman method by a factor of 142 over software. Baker and Prasanna show at least a 24-fold speedup over software for the Apriori data mining algorithm. Azizi et al. [8] describe an FPGA-based molecular dynamics (MD) system that uses fixed- point arithmetic to achieve a 20-fold speedup over software. Clearly, the potential speedups associated with FPGA-based kernels has already been demonstrated for in- teger and fixed-point applications. However, scientific computing generally requires the greater precision and range afforded by floating-point arithmetic as opposed to 16 integer or fixed-point arithmetic. Unfortunately, the MHz-scale clock rate of FPGA- based floating-point computational kernels is relatively low compared to the GHz- scale clock rate of GPPs. In order to compete with GPPs, FPGAs must use deeply pipelined floating-point cores and exploit algorithmic parallelism. Deeply pipelined ®-stage basic floating-point cores such as adders or multipliers provide an ®-fold speedup if the pipelines can be filled for an extended period of time. Interconnect- ingm stages of these basic floating-point cores results in a new higher-order floating- point core that has a pipeline depth and potential speedup of m£ ®. If n of these higher-order floating-point cores are used in parallel, then the performance is boosted by m£®£n. The multiplicative effect of pipelining and parallelization is the key to high-performance FPGA-based floating-point kernels and can be summarized by the three p’s memory aid shown in Figure 2.6. In practice, overheads associated with control, synchronization, routing, etc., tend to slow things down. Furthermore, the size of the FPGA, the available I/O bandwidth, and the algorithmic parallelism limit parallelized pipelined “Performance = Pipelined ×Parallelized” Figure 2.6: The three p’s the extent to which kernels can be pipelined and parallelized. Nonetheless, deeply pipelined, massively parallelized FPGA-based floating-point kernels can sometimes be used to gain a performance advantage over GPPs. The following sections discuss 17 efforts to map floating-point functional units and kernels onto field programmable gate arrays. 2.2.1 Floating-point arithmetic cores Because of the flexibility and large size of modern FPGAs, there is a body of re- search into FPGA-based floating-point arithmetic using custom radixes, formats, and bit-widths. Lienhart et al. [48] develop and use reduced precision floating-point cores to accelerate n-body scientific simulations using FPGAs. Zhao and Leeser [97] de- scribe an application-adaptive precision modeling approach that can facilitate the use of variable-precision floating-point units. Wang et al. [85] demonstrate the use of variable-precision floating-point library cores to accelerate portions of a multi-spectral image processing algorithm. Cantanzaro and Nelson [16] implement and compare pipelined radix 2 and radix 16 single-precision adders and multipliers to show the po- tential benefits of higher radix floating-point number representations. Gu et al. [36] investigate the use of FPGAs and “semi floating point” numbers to achieve a 16-fold acceleration of the ProtoMol MD code. Wang et al. [84] extend their earlier work to include a variable-precision single set accumulation reduction IP core. Govindu et al. [35] describe a library of deeply pipelined IEEE Std 754 [7] double-precision floating-point cores. Commercial floating-point cores are also available. For exam- ple, Xilinx now offers a variable-precision floating-point operator that can be tailored using the Coregen tool that is included with ISE. The ability to create basic floating-point functional units like adders and multipli- ers has been clearly demonstrated by the research examples just considered. The next section looks at research efforts to map entire floating-point kernels onto FPGAs. 18 2.2.2 Floating-point kernels FPGA-based floating-point kernels targeting specific problem domains have been demonstrated. Gokhale et al. [34] achieve better than a 3-fold speedup over software for an FPGA-based single-precision heat transfer simulation. Scrofano and Prasanna [70] describe a 3.9 GFLOPS FPGA-based double-precision kernel to accelerate the Lennard-Jones force calculation. Devlin et al. [25] describe an FPGA-based single- precision one-dimensional convolution kernel that can achieve a 10-fold speedup over software. Several researchers have investigated general-purpose FPGA-based floating-point scientific kernels. Dou et al. [26] describe an FPGA-based double- precision floating-point block matrix multiplication kernel. Zhuo and Prasanna [101] present an FPGA-based double-precision sparse matrix vector multiplication kernel that can achieve over 350 MFLOPS. DeLorimier and DeHon [24] describe an FPGA- based sparse matrix vector multiply kernel that is estimated to achieve 1.5 GLOPS. In [98] Zhou and Prasanna consider design trade offs for FPGA-based double-precision basic linear algebra subroutines (BLAS) including vector product, matrix vector mul- tiply, and matrix multiply. Beauchamp et al. [11] consider the advantages of embed- ding IEEE Std 754 floating-point hardware units in the FPGA fabric. Underwood [82] argues that by 2009, FPGAs will have an order of magnitude peak floating-point per- formance over GPPs. Clearly, the use of FPGA-based kernels to accelerate floating- point applications is becoming a reality. 2.3 Reconfigurable computers The reconfigurable computer, which was proposed by Gerald Estrin [32] in 1960, is a “fixed plus variable structure” computer consisting of fixed digital logic modules and a variable structure that can be “temporarily distorted into a problem oriented special 19 purpose computer.” Technological limitations and the advent of the microprocessor caused the RC to languish in relative obscurity. However, the FPGA has precipi- tated a renaissance, and several companies now offer RCs that use GPPs and FPGAs as the fixed plus variable structure envisioned by Estrin. Cray has the XD1 [21], which can have six user-programmable FPGAs per chassis. The late Seymour Cray’s startup company, SRC Computers, offers the MAP processor [77], which includes two user-programmable FPGAs in a number of different configurations ranging from single-MAP workstations to high performance multiple-MAP clusters. SGI offers the dual-FPGA RC100 blade [72], which can be placed as a peer compute node in SGI’s NUMALink switching fabric. In a similar vein, Mercury Computer Systems has the triple-FPGA Powerstream FCN Module [52], which can be placed as a peer compute node in Mercury’s RapidIO switching fabric. 2.3.1 System architecture While different in some respects, all of the RC offerings mentioned in the previous section are the similar in the sense that the FPGA-based portions can be viewed as variable structure processing elements (PEs). Figure 2.7 is an idealized diagram of a fixed plus variable structure RC. The fixed PEs correspond to traditional GPP(s) and the associated memory hierarchy. The variable structure PEs typically have one or more FPGAs surrounded by multiple memory banks. The local memory banks, which are independent from the GPP memory, give the FPGAs the ability to store large amounts of data, and to access multiple values in a single FPGA clock cycle. In contrast to GPPs, the fine-grained resolution of FPGAs allows the RC hardware to be reconfigured specifically for the problem to be solved. The FPGAs are, in ef- fect, programmable application specific processors. For applications that have some 20 GPP FPGA main memory cache memory local memory fixed variable structure interconnect Figure 2.7: RC system architecture combination of large-strided or random data reuse, streaming, parallelism, and com- putationally intensive loops, RCs can achieve higher performance than GPPs. Scien- tific applications involving sparse matrices often exhibit such characteristics and may have an affinity for RCs. The following sections describe the primary RC develop- ment flows. 2.3.2 Hardware description language design flow Figure 2.8 illustrates the HDL-based RC design approach. The design is partitioned into software modules, which are written in a traditional HLL and targeted for exe- cution on the GPPs, and FPGA modules, which are written in an HDL and targeted for execution on the FPGAs. Software modules that call FPGA modules also in- clude some vendor-specific calls to control/use the FPGA, for example, the Cray XD1 fpga load call loads a configuration bitstream on the FPGA. The software modules are compiled with the normal software compiler to produce object files. The FPGA mod- ule HDL code is input to the synthesis tool. The netlists produced by synthesis are fed into PAR, which feeds BITGEN to produce the configuration bitstream. The linker 21 COMP LINK PAR bitstream executable BITGEN SYNTH HLL HDL FPGA GPP IP lib obj FPGA modules SW modules API Figure 2.8: HDL-based RC design flow inputs the object and library files and produces the executable. At run time, the GPP and FPGA cooperatively execute the application. The HDL-based RC design flow is relatively primitive. Despite the use of the nomenclature “module,” these designs are not really modular, there is a significant amount of coupling between the software module and FPGA module. The software must have an intimate knowledge of the hardware functionality, and is usually respon- sible for data synchronization. Furthermore, the FPGA module designer must be an experienced hardware designer. The HDL-based RC design flow is not really practical for mainstream scientific computing on RCs. 2.3.3 High-level language design flow Estrin [32] required general-purpose computers in the fixed structure to facilitate “higher level languages for man-machine communication.” An HLL-based RC de- velopment approach is important because it introduces a man-machine communica- tion model that abstracts away many of the FPGA details such as the I/O interface 22 and, perhaps most importantly, the clock. It also provides a truly modular program- ming model. In the HLL-based RC design flow, the FPGA module appears to be a standard call within the software module. One can pass parameters to the FPGA module without having to understand how the FPGA module operates. HLL-based compilers allow FPGA module development using HLL-based programming rather than HDL-based hardware design. However, there are differences between the “nor- mal” HLL in the software modules and the “special” HLL in the FPGA modules. Tra- ditional HLLs, which were developed for von Neumann uniprocessor architectures, do not have mechanisms for expressing parallelism, therefore HLL-to-HDL compiler developers have taken one of four basic approaches, 1) modify an existing HLL as with Celoxica’s Handel-C [17] or Colorado State University’s SA-C [19], 2) create a new HLL as with Mitrionics’ Mitrion-C [53], 3) add appropriate classes to an object- oriented language as with BYU’s JHDL [14], or 4) use a standard HLL but include pragmas to guide the compiler as with SRC’s Carte compiler [76]. Independent of the mechanism, the goal is deeply pipelined, highly parallelized hardware. There- fore, in addition to parallel blocks, HLL-to-HDL compilers provide features such as pipelined loops, communication channels, synchronization primitives, and applica- tion programmer interface (API) calls to access specialized IP cores. As Figure 2.9 illustrates, the design is still partitioned into software modules and FPGA modules, as with the HDL-based RC flow. The software modules are written with “normal” HLL and compiled with the normal software compiler to produce object files. However, the FPGA modules are written using the “special” HLL provided by the target HLL- to-HDL compiler. The FPGA module HLL code is compiled with the HLL-to-HDL compiler to produce HDL input to the synthesis tool. Synthesis feeds PAR, which feeds BITGEN to produce the configuration bitstream. The linker inputs the object files, library files, the FPGA module call specification, and produces the executable. 23 COMP LINK PAR bitstream executable BITGEN SYNTH HLL “special” HLL HLL-to-HDL FPGA GPP API lib obj FPGA modules SW modules Figure 2.9: HLL-based RC design flow FPGA-augmented RCs and HLL-to-HDL compilers have already proved them- selves for integer and fixed-point applications such as digital signal processing [15]. For floating-point applications, however, contemporary RCs and HLL-to-HDL com- piler technology make programmer access to the FPGAs range from moderately dif- ficult to nearly impossible. For example, in Handel-C there is no floating-point data type at the language level; one must use the Celoxica floating-point library, which models floating-point types via a struct; this obviously limits the portability of the code. As another example, SRC Carte cannot pipeline inner loops, which seriously limits the performance of kernels that require nested loops. The deeply pipelined floating-point cores make certain operations very difficult. The author’s early attempts to speed up a small scientific code actually resulted in a nearly ten-fold slowdown. Fi- nally, even though the HLL kind of looks like software, it is still, in fact, a hardware design. Certainly the modularity and more familiar programming languages make the HLL-based RC design flow an improvement over the HDL-based flow. But the parallel blocks, pipelined loops, and other features added to the HLLs, coupled with 24 the deep pipelines needed for floating-point performance, and the glaring inability to perform multiple-set reduction operations, force one to concede that the HLL-based approach is still not quite ready for mainstream supercomputer users. 2.3.4 Hybrid design flow HLL-based RC development environments such as Celoxica’s DK Design Suite and SRC’s Carte support a hybrid development approach. This hybrid design flow allows the developer to use an HDL such as Verilog to create customized IP cores and im- port them into the HLL environment. As a result, the developer can use all the vendor HLL features such as parallel code blocks, pipelined loops, and channels, yet still have HLL access to the customized IP cores. Figure 2.10 illustrates the hybrid approach. The development tools provide some interface mechanism allowing the HLL-to-HDL compiler to obtain visibility into the user IP cores, for example, DK’s interface dec- laration or Carte’s info file. This interface specifies the format of the call statement used within the FPGA module’s HLL code to access the IP core. The interface also COMP LINK PAR bitstream executable BITGEN SYNTH HLL “special” HLL HLL-to-HDL inter face FPGA GPP lib obj FPGA modules SW modules user HDL Figure 2.10: Hybrid RC design flow 25 provides information to the HLL-to-HDL compiler allowing it to integrate the user IP core at the HDL level. The software modules are compiled with the normal software compiler to produce object files. The FPGA module HLL code is compiled with the HLL-to-HDL compiler. The compiler HDL output and the user HDL code are input to the synthesis tool. Synthesis outputs are fed into the PAR tool, which feeds BITGEN to produce the bitstream. The linker inputs the object files, library files, the FPGA module call specification, and produces the executable. The parallel blocks, pipelined loops, and other constructs in the HLL-based de- sign support the three p’s mentioned in Section 2.2. In the hybrid design flow one can use all these HLL features yet still have access to the customized user-defined IP cores from within the HLL-based portion of the FPGA module design. Unfortunately, getting the compiler to pipeline loops or parallelize statements involving user-defined IP cores can be particularly difficult. The IP cores developed using a traditional HDL flow, in some cases, simply do not map well into the HLL-based development envi- ronment. As with the HLL-based flow, the hybrid flow is not quite ready for prime time. The author’s experience with the scientists and engineers who use supercomput- ers is that they want their 20-year-old, 300,000-line, FORTRAN 77 scientific codes to compile and run, with minimal modifications, on whatever new platform comes along. Thus, the challenge in the research at hand is to find ways to make RCs more accessible to these traditional supercomputer users. 2.4 Postscript As mentioned in Section 2.2.1, Govindu et al. [35] describe a library of deeply pipelined IEEE Std 754 double-precision floating-point cores. These high-performance floating-point IP cores are used by the author to do the research documented in this 26 dissertation. It is worth noting that the academic cores developed by [84] and the commercial cores developed by Xilinx and SRC Computers, were not able to meet the timing constraints and/or area constraints of the target RC platform. 27 Chapter 3 Sparse Matrix Calculations on General-Purpose Processors This chapter discusses various aspects of sparse matrix computations on general- purpose processors. Section 3.1 talks about the sparse matrix storage format used in this research. Section 3.2 identifies the performance issues associated with sparse matrix computations on GPPs. Section 3.3 categorizes and reviews some of the previ- ous research efforts that attempt to improve the performance of sparse matrix kernels. Section 3.4 points out that these efforts, while successful to some degree, have still not solved the fundamental performance issues. 3.1 Compressed sparse row format A discussion of sparse matrix storage formats is an appropriate opening for this chap- ter. In particular, this section describes the compressed sparse row (CSR) format used in this research. The naive dense matrix vector multiply algorithm is shown in Fig- ure 3.1. In considering the dense matrix algorithm for processing a sparse matrix, A, which has a relatively small number of nonzero values,n z ¿ n 2 , wheren z is the number of nonzeros, andn is the order of the matrix, two things are obvious. First, a 28 1: algorithm DMVM(y;A;x) 2: fori in1:::n do 3: y i Ã0 4: forj in1:::n do 5: y i Ãy i +a ij ¢x j 6: end for 7: end for 8: end algorithm Figure 3.1: Dense matrix vector multiply algorithm significant amount of storage is being used to store the zero-valued a ij , and second, the calculation at line 5 is unnecessary whenevera ij =0. CSR format and other com- pressed sparse matrix representations are often used to represent sparse matrices in order to eliminate the unnecessary storage and computations associated with the zero values. CSR format employs three vectors,val2R n z ,col2I n z , andptr2I n+1 , to store only the nonzero values and to provide a mechanism for identifying the row and column indices. val The nonzero values as the matrix is traversed row-wise. col The column index of each nonzero value. There is a one-to-one relationship betweenval andcol. Specifically, ifval h =a ij thencol h =j. ptr The index in val (andcol) where each matrix row starts. The first element of matrix row i is val ptr i , which occupies matrix column col ptr i . The range of index values for row i is ptr i · j < ptr i+1 . To keep row length calculations consistent,ptr n+1 ´n z +1. An example depicting the CSR representation of an order, n = 4, sparse matrix is shown in Figure 3.2. In looking at the dense matrix representation, it can be seen that row three starts with the value, two, which “lives” in column one, that is, a ij = 2, 29 matrixA CSR representation 2 6 6 4 5 0 0 1 0 8 0 0 2 0 6 0 4 0 0 7 3 7 7 5 n=4 () n z =7 1 2 3 4 5 6 7 val 5 1 8 2 6 4 7 col 1 4 2 1 3 1 4 ptr 1 3 4 6 8 Figure 3.2: Example of CSR format where i = 3 and j = 1. Furthermore, row three contains two values. This same information can be deduced from the CSR representation. Clearly, ptr 3 = 4, which means that val 4 = 2 is the first value in row three. This value has a column in- dex, col 4 = 1, as required. Finally, ptr 4 ¡ptr 3 = 6¡4 = 2 is the length of row three. Although not obvious from the simple example shown in Figure 3.2, CSR for- mat can significantly reduce the storage requirements of a sparse matrix. An order, n = 1;024, matrix with n z = 2;048 double-precision (8 byte) nonzeros has a dense storage requirement ofn 2 ¢8 = 8MB: CSR representation has a storage requirement of sizeof(val)+ sizeof(col)+ sizeof(ptr): If one assumes thecol andval vectors are 32-bit (4 byte) integers, then the CSR format storage requirement is given by n z ¢8+n z ¢4+(n+1)¢4 = 28KB: This represents a compression ratio of nearly 300:1, which is clearly a significant reduction in storage. A CSR-format sparse ma- trix vector multiply algorithm is shown in Figure 3.3. In considering the algorithm, two things are again obvious. The unneeded storage of the zeros has been elim- 1: algorithm SMVM(y;val;col;ptr;x) 2: fori in1:::n do 3: y i Ã0 4: forj inptr i :::ptr i+1 ¡1 do 5: y i Ãy i +val j ¢x col j 6: end for 7: end for 8: end algorithm Figure 3.3: Sparse matrix vector multiply algorithm 30 inated, and the unnecessary calculations involving zeros have also been eliminated. Unfortunately, the situation is not idyllic, as will be made clear in the sections that follow. 3.2 Sparse matrix kernel performance In theory, GPPs could include functional units (FUs) to perform floating-point sparse matrix operations. In practice, such FUs would consume a considerable amount of silicon real estate, complicate the pipelined data path and associated control circuitry, decrease the maximum operating frequency, and increase the cost of GPPs. Further- more, large floating-point sparse matrix operations are not really in the mainstream of personal or business-class computing. Finally, a hardware sparse matrix opera- tor would mandate a particular storage format. The cost of translating formats could be significant and perhaps be on the same order of magnitude as the cost of doing the calculation in software. Accordingly, modern GPPs typically include only basic floating-point operations such as addition, subtraction, square root, etc. They also include architectural features such as a memory hierarchy that, per Hennessy and Pat- terson [39], “make the common case fast.” Such features give GPPs considerable flex- ibility and result in high performance for a broad range of applications. However, for certain types of applications such as those involving large floating-point sparse matrix calculations, these same features can result in performance degradation. The classic example of this degradation is sparse matrix vector multiply, y = Ax, which has a high ratio of memory references to floating-point arithmetic operations and suffers from irregular memory access patterns. Furthermore, the dense n-vector, x, cannot fit in cache for largen, so there may be little chance for data reuse. A symbolic rep- resentation of the problem is shown in Figure 3.4. The current row of sparse matrix, 31 A (not shown), is already in cache. The illustrated portion of memory corresponds to thex vector, and the gray cells correspond to thex elements needed for the cur- rent row of the matrix. As suggested by the diagram, the cache is too small to hold the entirex vector. Given an address, the controller obtains the desired cache block from memory. In the example, only two values in the fetched cache block are actu- ally being used. The controller will have to make several memory accesses to satisfy the computational demands, and none of these accesses is able to supply more than a few numbers The net result is a significant waste of precious memory bandwidth. memory cache controller addr GPP wasted bandwidth data path Figure 3.4: Memory access pattern Depending on the sparsity structure of matrix A, the GPP could end up fetching a cache block, only using a single x i , and then evicting the cache block to make room for another. In a pathological case, this could require the 10’s of cycles associated with a level-one cache miss, or even the 100’s of cycles associated with a level-two cache miss, for everyx i value used. As a result, the GPP can be slowed down by one or two orders of magnitude. 32 3.3 Improving performance of sparse matrix kernels The poor performance of sparse matrix operations on GPPs is certainly not a new problem. There is a large body of research associated with techniques to make sparse matrix operations perform better on GPPs. A search on IEEE Xplore for titles con- taining “sparse matrix” yielded 577 hits. A similar search of the ACM Digital Library turned up 4,440 documents. The problem has been studied in great detail over an ex- tended period of time by hardware engineers, computational engineers, algorithm and data structure designers, compiler researchers, and so forth. According to Whaley et al. [88], The holy grail of compilation research is to take an arbitrary code as an input and produce completely optimal code as output for given languages and hardware platforms. Despite the immense amount of effort that has been poured into the approach, its success has been limited both by practi- cal time constraints (viz., users will not tolerate compile-times that extend into several days) and by the amount of detailed information the compiler can obtain about the software to be compiled and the hardware on which it is supposed to execute. In the interim, there seem to be three general approaches toward improving the perfor- mance of sparse matrix operations on GPPs, 1) preprocess the data to make the access pattern better fit the sparse matrix algorithm behavior on a GPP, 2) optimize the sparse matrix algorithm to make it better fit the data access pattern on the GPP, and 3) mod- ify the GPPs behavior via specialized memory controllers such that the needed data gets to the GPP in a more timely manner and/or allows more data reuse. Examples of research in each of these areas will be considered in the following sections. 33 3.3.1 Preprocess data As early as 1969, Cuthill and McKee [22] proposed an algorithm to minimize the bandwidth of sparse symmetric matrices by reordering the input data. Essentially they permute rows and columns to maximize the clustering of the nonzero elements. With respect to Figure 3.4, this would correspond to findings ways to make each cache access contain more grey cells. The Cuthill-McKee (CM) algorithm and its variants are still used to optimize the performance of sparse matrix operations. Duff [29] gives a 1977 “mid-term” review, primarily in the context of solving sparse simultaneous systems of equations, which looks at matrix storage formats, preordering, partitioning, and so forth. In a more contemporary research effort, Oliker et al. [63] consider CM and various other reordering and partitioning algorithms in order to improve the performance of sparse matrix vector multiply. 3.3.2 Optimize algorithm Researchers have also looked at optimizing the kernels themselves (rather than pre- processing the data). With respect to Figure 3.4, this would correspond to finding ways to reuse the grey cells multiple times before evicting the cache block. Perhaps the most well known algorithm optimization effort is the automatically tuned linear al- gebra software (ATLAS) project [88, 87], which attempts to optimize the basic linear algebra subprograms (BLAS) codes for the particular target platform. Unfortunately, ATLAS does not explicitly support sparse matrix operations. However, according to Whaley, one of the ATLAS principle investigators [86], “some sparse operations can employ the level 1 BLAS for a significant part of their computations, so ATLAS can help there.” Im et al. [42] explicitly try to deal with sparse matrix operations. They consider register blocking, cache blocking and other platform-specific optimizations 34 to improve the performance of sparse matrix kernels. Their SPARSITY framework is a two-pass system that currently supports sparse matrix vector multiply. During the first pass, one runs a profiler to obtain characteristics pertaining to the target machine. During the second pass, one runs a kernel optimizer to obtain an optimized version of sparse matrix vector multiply. The second-pass inputs include the target machine pro- file, example matrices, and the maximum number of vectors. For some matrices, the SPARSITY framework was able achieve a 4-fold speedup for a single matrix vector multiply, and up to a ten-fold speedup for multiple matrix vector multiplies. 3.3.3 Specialized hardware A third approach to speeding up sparse matrix operations is to develop specialized smart memory controllers to, in effect, cluster the memory accesses. A symbolic rep- resentation of the basic concept is shown in Figure 3.5. The idea here is that the GPP memory smart controller cache list GPP no wasted bandwidth data path Figure 3.5: Smart memory controller access pattern and memory controller cooperate in some way, such as address translation, to create the cache illusion of contiguous data when dealing with non-contiguous memory data. In the Data-Intensive Architecture (DIV A) research described by Hall et al. [37], the authors propose to accelerate irregular applications such as sparse matrix computation 35 that, “perform poorly on conventional architectures because their control and data ac- cesses cannot be statically predicted, and they do not make effective use of cache.” Their idea is to use PIM chips to bridge “the growing gap between processor and memory speeds.” Zhang et al. [96] describe their Impulse smart memory controller that uses prefetching and an additional level of address translation. Their simulations show performance improvements ranging from 20 percent up to 5-fold for a variety of applications. In a follow-on DIV A effort, Draper et al. [27] describe the proto- type PIM chip to augment the memory subsystem of a GPP. By replacing standard DRAM chips with these PIMs, they are able to achieve a 35-fold speedup for a matrix transpose operation. Tanabe et al. [81] describe how to build a specialized memory module, DIMMnet-2, that can speed up irregular kernels. They predict an over 8-fold speedup of sparse matrix vector multiply on a single Pentium 4-based workstation. 3.4 Reality of improvement efforts Clearly, data preprocessing optimizations, kernel optimizers, specialized memory con- trollers, and other techniques such as prefetching can sometimes mitigate the perfor- mance difficulties associated with sparse matrix computations on GPPs. However, in the final analysis, the computations are still being done using a GPP and its multilevel memory hierarchy. All optimizations ultimately depend on the sparsity structure of the matrices being processed. In some cases the access patterns are so erratic that there is little that can be done to improve performance. As part of their research on FPGA-based sparse matrix vector multiplication, Zhuo and Prasanna [101] use the SPARSITY framework to optimize a software-only sparse matrix vector multi- ply. They then use this optimized kernel with some of documented matrices found at the University of Florida (UFL) sparse matrix collection [23]. The data shown in 36 Table 3.1: Sparse matrix vector multiply performance Matrix MFLOPS raefsky3 763 bcsstk35 660 rdist1 387 memplus 259 gemat11 136 lns 3937 104 sherman5 85 mcfe 100 jpwh 991 25 bp 1600 20 str 600 15 Table 3.1 is adapted from their results. Column Matrix is the name of the matrix as documented at the UFL web site and MFLOPS is the double-precision MFLOPS per- formance obtained from the optimized sparse matrix vector multiply kernel. The the- oretical peak double-precision floating-point performance of the 900MHz Itanium 2 used in their work is 3600 MFLOPS. Some matrices, such as raefsky3, were able to achieve fairly respectable MFLOPS performance. Others, such as str 600, were not. One of the reasons for this broad variance in performance can be seen by looking at the sparse matrix structure diagrams, which depict the sparse matrix as ann£n square with a dot where the matrix has a nonzero, and no dot where the matrix has a zero. The structure diagrams in Figure 3.6 represent the regularly structured matrices, raef- sky3 and bcsstk35. These two sparse matrices have some spatial locality of reference (clustering), which can be exploited by the optimized matrix vector multiply kernel, as evidenced by the reasonable MFLOPS performance values shown in Table 3.1. How- ever, the structure diagrams in Figure 3.7, which represent the irregularly structured matrices, bp 1600 and str 600, exhibit virtually no locality of reference. Therefore the 37 (a) raefsky3 (b) bcsstk35 Figure 3.6: Regular matrices (a) bp 1600 (b) str 600 Figure 3.7: Irregular matrices optimized sparse matrix vector multiply did not perform very well on these matrices, as evidenced by the MFLOPS values shown in Table 3.1. In general, sparse matrix operations do not exhibit the temporal or spatial locality assumed by the GPP memory hierarchy. Even with the three kinds of optimizations described above, the performance of sparse matrix kernels on GPPs can still be pretty poor. The good news is that research efforts such as [101, 24, 57, 56] suggest that FPGA-based solutions, which can be customized for the particular problem at hand and employ a different kind of memory subsystem, may help mitigate the problem, 38 that is, sparse matrix computations may have an affinity for FPGA and reconfigurable computer based solutions. 39 Chapter 4 Floating-Point Reduction Operations A vector reduction operation, in the context of this research, is an operation that in- puts one or more n-vectors of 64-bit double-precision floating-point values and re- duces them to a scalar value. Reductions are frequently used in scientific computing. Examples include dot product,x T y = P x i y i , accumulation, s = P x i , and vector norm calculations such askxk 2 = p P x 2 i andkxk 1 =maxjx i j. Without loss of generality, accumulation reductions will generally be used in this chapter to explain the concepts. Section 4.1 motivates the exposition by showing how binary trees can theoretically be used to implement reduction operations. Section 4.2 then describes why practical reduction implementations are problematic, especially when dealing with pipelined floating-point units. Section 4.3 categorizes reduction methods via a simple two-dimensional taxonomy. Section 4.4 is a brief literature re- view of research associated with reduction operations. Section 4.5 briefly describes two of the HDL-based reduction techniques that were developed as part of this re- search effort. Section 4.6 closes out the chapter by describing the HLL-based vector reduction technique that is one of the principal contributions of this dissertation. 40 4.1 Motivation In theory, vector reductions can be implemented using a binary tree of basic floating- point IP cores. For example, to compute the dot product of two n-vectors, a full binary tree n-vector data path, as shown in Figure 4.1(a), can be used. To minimize clock cycle time and maximize performance, the floating-point IP cores are deeply pipelined. Thus, the dot product unit itself is also fully pipelined and accepts two n-vectors every clock cycle. The binary tree data path has1+lgn tree levels withn × × × × × × × × × × × × × × × × + + + x 1 y 1 x 2 y 2 x 3 y 3 x 4 y 4 n = 4 j n j j y x ∑ =1 (a) dot product + + + + + + + x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 n = 8 ∑ = n j j x 1 (b) accumulation x 2 x 2 x 2 x 2 + + + x 1 x 2 x 3 x 4 n = 4 ∑ = n j i x 1 2 (c) L2-norm Figure 4.1: Reduction trees multipliers at the leaf nodes, andn¡1 adders in the otherlgn tree levels. If® m and ® a are the latencies of the multiplier and adder IP cores, then the latency of the dot product data path is ® d =® m +® a lgn (4.1) 41 clock cycles. Therefore, after the ® d -cycle latency, the dot product unit emits one result every clock cycle. A similar analysis can be applied to the other reduction trees shown in Figure 4.1. Clearly, a binary treen-vector data path consisting of pipelined floating-point IP cores, which leverages both parallelism and pipelining, results in a very high-performance reduction architecture. 4.2 Reduction problem Basic floating-point IP cores like add and multiply are on the order of 1,000 slices [84, 35], and the largest contemporary FPGAs have around 50,000 slices and 1,000 I/O pins [92, 94, 95]. These resource constraints preclude implementing£(n) double- precision floating-point cores on a single FPGA for even modest values ofn, that is, full binary vector reduction tree circuits are usually not possible. Therefore, large parallel vector reductions must be translated into a series of smaller parallel vector reductions. The resulting sequential stream of partially reduced values must then be reduced to the final result. In addition, for some scientific kernels such as the Lennard- Jones force calculation [69, 36], the values to be reduced are calculated sequentially. Either way, a stream of sequentially delivered floating-point values needs to be re- duced. The deeply pipelined floating-point IP cores used on FPGAs mean that simple sequential vector reduction solutions such as a looped adder will introduce multiple cycle delays in the pipelines, that is, they will create loop carried dependences. This situation is illustrated using Figure 4.2, which shows the block diagram of a matrix vector multiply,y = Ax. The architecture consists of the largestk-width binary tree dot product circuit that will fit on the FPGA, followed by an accumulator reduction circuit implemented as a looped adder. The algorithm appears to be straightforward. 42 ∑ ih d d ih × × × × + + + A x k k = 4 + k ∑ = = n j j ij i x a y 1 dot accumulator Figure 4.2: Naive matrix vector multiply unit Thea i matrix rows and thex vector are partitioned into multiplek-vectors. At each clock edge one pair ofk-vectors enters the pipelined dot product data path. Per Equa- tion 4.1, the partial dot products,d ih , stream out of the root adder after® m +® a lgk clock cycles. The sequentially delivered double-precision floating-point values are ac- cumulated by the reduction circuit to produce y i . Unfortunately, the pipelined adder causes the simple looped adder circuit to fail. Figure 4.3 shows a simple example of this failure for an adder with® a = 2. Suppose matrix row,i, corresponds to the par- α a -stage adder (α a = 2) b 4 b 3 b 2 b 1 a 1 +a 2 a 3 a 4 b 3 a 3 +a 4 b 4 a 1 +a 2 b 1 b 2 1 4 • • • 2 3 buffers y i y i Figure 4.3: Buffer overrun tial dot product sequence,(a 1 ;a 2 ;a 3 ;a 4 ), and the next matrix row corresponds to the sequence(b 1 ;b 2 ;b 3 ;b 4 ), that is,y i =a 1 +a 2 +a 3 +a 4 , and,y i+1 =b 1 +b 2 +b 3 +b 4 . 43 In step 1 °,a 1 +a 2 is in the last adder stage, anda 3 anda 4 are in the input buffers. In steps 2 ° and 3 °, the rowi values traverse the adder pipeline and are placed in the out- put buffers, and theb 1 andb 2 values from the next row are stored in the input buffers. In step 4 °, the final addition fory i is getting ready to enter the adder pipeline. In the meantime, an interlock prevents new values from entering the adder until the exist- ing calculation is finished, so b 3 gets dropped. One might be tempted to simply use a larger buffer and claim this will solve the problem. The following analysis shows why one cannot simply use a larger buffer to solve the problem. Starting with row 1, when the first two d 1h values are available, they are entered into the adder pipeline. The pipeline delay introduces a data hazard that prevents immediately adding the next two partial dot products. Therefore, every other clock cycle one pair ofd 1h values are entered into the pipeline, leaving bubbles (pipeline stages that are not doing any use- ful work) every other pipeline slot. After® cycles, partial sums appear at the output. At this point the adder output values and the newly arriving d 1h values are injected into the adder pipeline (or a pair of thed 1h if there is no adder output). When all the bubbles bleed out of the pipeline, there is a full pipeline, an empty input buffer, and at each clock cycle the adder output value and the newly arrivingd 1h values are injected into the pipeline. As long as the reduction involves row 1, this approach works fine. However, the arrival of a partial dot product from row 2 creates a pipeline schedule conflict. Suppose an interlock circuit prevents the adder from taking values out of the input buffer until row 1 has been completed. Recall, there are ® row 1 partial sums in the pipeline. It takes® cycles to “empty” the pipeline, but these® partial sums are reentered pair-wise into the pipeline so there are now ®=2 partial sums spaced every 2 slots, then ®=4 partial sums spaced every 4 slots, and so forth. Therefore, it takes ®lg® cycles to drain row 1 from the pipeline. In the meantime the input buffer has stored®lg® subrow values from row 2. The interlock circuit now permits injection of 44 row 2 pairs. After® cycles, the pipeline starts emitting row 2 partial sums. Therefore, these output values and the newly arriving row 2 dot product values are entered into the adder pipeline. Obviously only 2® row 2 values have been consumed from the input buffer, so the steady state for row 2 requires£(®lg®) storage. Since the input buffer grows by£(®lg®) on every row, an input buffer of size£(n®lg®) is needed, which is not realizable for large n. A larger input buffer is out of the question, and design changes to stall the pipeline significantly reduce performance. For sparse ma- trices, the number of partial dot products to be accumulated will vary from row to row. Thus, the vector reduction problem boils down to reducing multiple, variable-length sets of sequentially delivered double-precision floating-point values without stalling the pipeline or imposing unrealistic buffer requirements on the design. The vector reduction problem can be formally defined as follows. There are m vectors, (v 1 ;¢¢¢ ;v m ), with vector v i containing n i double-precision floating-point values,v i = (d i;1 ;¢¢¢ ;d i;n i ). The values are sequentially delivered to the reduction unit, one per clock cycle, as [(d 1;1 ;¢¢¢ ;d 1;n 1 );¢¢¢ ;(d m;1 ;¢¢¢ ;d m;n m )]. The problem is to reduce each vector to a single value, s i = d i;1 © d i;2 ©¢¢¢© d i;n i = n i L j=1 d i;j , for 1 · i · m, where © represents the basic operation such as addition, and L represents the aggregate vector reduction operation such as §. The circuits designed to solve the vector reduction problem must satisfy the following four criteria: 1. Use pipelined double-precision floating-point operators in order to achieve low clock cycle times and high performance. 2. Have a relatively small number of operators. In particular, solutions requiring £(m) or£(n i ) floating-point operators are not acceptable. 3. Accept one input during each clock cycle without stalling the pipeline. 45 4. Have a relatively small buffer size. In particular, solutions requiring £(m) or £(n i ) floating-point buffers are not acceptable. 4.3 Reduction method categories To facilitate the discussion of reduction operations, the simple two-dimensional tax- onomy presented in Figure 4.4 will be used. The idea here is to use the intersection of a row and a column as a category. For example, one category of reduction methods is an HDL-based tree traversal method. The bullets indicate the three categories of reduction methods that will be described in this chapter. HDL-based HLL-based tree traversal method ² striding method ² ² Figure 4.4: Reduction methods The rows in Figure 4.4 identify the two basic methods for designing serial reduc- tion circuits, the tree traversal method and the striding method. In the tree traversal method, the inputs are considered to be the leaves of a binary tree and the reduction of the inputs requires a traversal of the tree, which results in a single output value at the root node. The recursive doubling method, to be described shortly, is an example of a tree traversal method. In the striding method, the inputs are first reduced to ® a values, where® a is the number of pipeline stages in the adder. The® a values are then further reduced to the single output value. The pipelined vector reductions mentioned in the next section are examples of the striding method. 46 The columns in Figure 4.4, which only apply to RC-based reductions, represent the development flow, HDL-based or HLL-based. HDL-based designs have consid- erable flexibility when it comes to reduction circuits. In the HDL-based basic lin- ear algebra subprograms (BLAS) research reported in [100], for example, Zhuo and Prasanna note that, “a reduction circuit is needed in the architectures for both dot product and matrix-vector multiply.” In their work, they were able to use the reduction core described in [99]. In the HDL-based Jacobi method research effort documented in [56], the HDL-based reduction method described in Section 4.5.1 was used for the accumulator component. Unfortunately, HLL-based designs are a different story. Prior to the research described herein, the reduction problem, for an HLL-based de- velopment context, had not yet been solved. During an initial training session on the use of an HLL-based RC, the author’s attempts to speed up a small scientific kernel actually resulted in a slowdown because an inner loop accumulation reduction could not be efficiently performed. There were no HLL-based reduction operators to ac- commodate multiple sets of data without flushing the pipeline, which, of course, kills the inner loop performance. Furthermore, the existing single-set reduction operations can not be reset or read except upon entry and exit from the loop. Therefore, they cannot be used in flattened-loop, multiple vector reduction scenarios such as matrix vector multiply. Others researchers have also encountered this problem. For the HLL- based Gauss-Seidel iteration described in [13], B¨ ohm and Hammes note that, “At the time of writing this abstract, this and other floating-point macros have not been inte- grated in the MAP compiler yet,” and they simply tolerate a multiple cycle pipeline. In the HLL-based sparse matrix vector multiply research in [3], Akella et al. try sev- eral different approaches but keep running into the reduction problem; they conclude that, “performance is still about 2 – 2.55 slower than software.” As lately as Septem- ber 2006, Smith et al [74] report that their attempts to accelerate conjugate gradient 47 via an HLL-based approach actually resulted in a slowdown. Their slowdown is also attributable to the reduction problem. It is worth noting that their target platform is the same as the target platform used in this dissertation. Clearly, for an HLL-based development environment, more research is needed with respect to the problem of reducing multiple sets of sequentially delivered double-precision floating-point val- ues. The HLL-based reduction solution, which will be described later, is one of the principal contributions of this research. 4.4 Literature review Nearly 3,000 years ago, Solomon [75] noted that “there is no new thing under the sun.” Obviously, he was referring to the human experience from a metaphysical con- text. However, his words still ring true when it comes to reduction operations on pipelined and parallel processors. In 1960, Leonid Kantorovich [33] developed one of the earliest pipelined processors, the Arithmetic Machine. Kantorovich’s design used a four-level buffer and specialized carry-save adder in order to get a high-performance dot product operation. Kogge and Stone [45] use recursive doubling, which is essen- tially a spatial plus temporal version of a binary tree, to solve (among other problems), the reduction operation,x n = ¦a j . The idea with this tree traversal method is to use n=2 processors and do pairwise multiplication of then-vector elements during the first iteration, which results in an (n=2)-vector. During the next iteration, n=4 processors are used to again do pairwise multiplication, which results in an (n=4)-vector. Clearly afterlgn iterations, a scalar value is produced, as shown in Figure 4.5. Kuck’s classic computer architecture text [46] explicitly deals with pipelined vector reductions via a technique called recursive reduction. The basic idea with this striding method is 48 a 8 a 7 a 6 a 5 a 4 a 3 a 2 a 1 t 0 t 1 t 2 t 3 a 8 ·a 7 a 6 ·a 5 a 4 ·a 3 a 2 ·a 1 (a 8 a 7 )·(a 6 a 5 ) (a 4 a 3 )·(a 2 a 1 ) x 8 =(a 8 a 7 a 6 a 5 )·(a 4 a 3 a 2 a 1 ) Figure 4.5: Parallel recursive doubling to divide the n-vector into two halves. These two halves are used in the first itera- tion through the pipeline to produce an (n=2)-vector. This is again divided in half to produce an (n=4)-vector, and so forth, until the desired scalar result is obtained after lgn iterations, as shown in Figure 4.6. Kuck’s (and implicitly, Kogge and Stone’s) re- striction thatn be a power of two can be relaxed via the padding technique described by Kogge [44], which basically pads the vectors with identity values, that is, 0 for addition, and 1 for multiplication. However, the recursive reduction approach still requires £(n) buffer space, which is unrealistic for FPGA-based implementations. Ni and Hwang [58, 59] describe a symmetric reduction technique and an asymmetric reduction technique that eliminate the large buffer requirement of recursive reduction by employing a controlled feedback path from the pipeline output back to the input as shown in Figure 4.7. In this striding method, the C input represents the identity value, the e input controls the latch, and the c i inputs control the muxes. During the fill-time for an accumulation reduction, for example, the controller setsC =0,e=0, c 1 = 0, and c 0 = 0, which causes the input to be 0+x i until the pipeline fills. The fill-time is followed by the merge-time, during which the controller causes the inputs and pipeline values to be reduced to® values via the feedback loop. When no more in- puts are available, the controller transitions to drain-time, during which the pipeline is emptied, that is, the® pipeline values are reduced to a single scalar output. In addition 49 1 2 α a buffer b buffer x buffer ⊕ . . . . . . . . n words n/2 words j n j x 1 = ⊕ ctrl Figure 4.6: Pipelined recursive vector reduction to controlling the latch and mux control signals, the controller also causes the pipeline bubbles (Ni and Hwang call them “unproductive segments”) to be ignored. Sips and Lin [73] improve upon the Ni and Hwang techniques by adding additional hardware that allows the feed-time and merge-time to overlap. Zhuo and Prasanna [99] develop a multiple set HDL-based FPGA-based accumulator reduction circuit using the ideas presented by Ni and Hwang, and Sips and Lin. They are able to accomplish this us- ing a single adder by allowing for out-of-order completion. Luo and Martonosi [49], Papadantonakis et al. [65], and Wang et al. [84], attack the reduction problem by cre- ating specially designed FPGA cores. Unfortunately, Luo and Martonosi’s delayed addition technique responds to possible overflow conditions by stalling the pipeline, which requires a “reasonably large” local buffer and the associated control overhead to manage it. The Papadantonakis et al. work only deals with 16-bit saturating in- teger accumulation, and the Wang et al. accumulator requires a pipeline flush after 50 1 2 α mux mux latch ⊕ . . . . . . . . j n j x 1 = ⊕ x i C B(t) c 1 c 0 e ctrl Figure 4.7: Pipelined vector reduction with feedback each accumulation. Accordingly, they are not good candidates for this research. The following two sections describe the vector reduction techniques that were developed as part of the author’s research efforts. 4.5 Hardware description language reduction 4.5.1 Tree traversal method design This section describes an HDL-based tree traversal reduction solution that accommo- dates multiple sets of input vectors. The intuition behind this design is, of course, a (lgn)-level binary adder tree as depicted in Figure 4.8. Since the inputs arrive sequen- tially, as suggested by the time sequence shown at the bottom of the tree, the latency of this architecture is n + ® a lgn cycles, where ® a is the latency of the pipelined adders. Notice that in this idealized architecture most of the adders are idle most of time. For example, the first adder at level 0 must wait until both d 1 and d 2 arrive. It 51 + + + + + • • • level lg n - 1 1 0 + • • • output d 1 d 2 d 3 d 4 d n t 1 t 2 t 3 t 4 t n • • • input time Figure 4.8: Binary adder tree then has to wait for the next n¡2 inputs before it gets a new operand. Obviously similar inefficiencies exist for all the adders. If the adders are“horizontally compacted” and buffers are placed at each tree level, the partially compacted binary tree (PCBT) shown in Figure 4.9 is the result. Instead of multiple adders at a given tree level, there is only one adder at that level. There is also a two-word buffer at each level. When there are two values in the buffer, they are read from the buffer and passed to the adder. The PCBT design haslgn adders, a buffer size of 2lgn, and a latency of n+® a lgn cycles. However, the utilization of the adders in the PCBT is still very low. The adder at level 0 has two new inputs every two clock cycles, the adder at level 1 has two new inputs every four clock cycles,¢¢¢ , and the adder at leveli has two new inputs every2 i+1 clock cycles. If the adders are “vertically compacted” and some control circuitry and a counter are added, the fully compacted binary tree (FCBT) design shown in Figure 4.10 is the result. The design employs only one adder rather than lgn adders. Each level 52 • • • input output level buffers lg n - 1 1 0 • • • + + + Figure 4.9: Partially compacted binary tree has a buffer of three 64-bit words (except level 0, which only needs 2). The d ih inputs are clocked into buffer 0 every cycle. When there are two values, buffer 0 is input • • • output level buffers lg n - 1 1 0 • • • + ctrl C counter Figure 4.10: Fully compacted binary tree destructively read and the pair of numbers entered into the pipeline. Two cycles later the next pair is entered, etc. A key observation is that the pipeline is only used half the time to accommodate the buffer 0 input values, that is, it only uses half of the available pipeline slots. This is the crux of the design technique; the remaining partially reduced 53 values are interleaved into the open pipeline slots and the intermediate results are buffered. This raises three questions: ² Are there enough pipeline slots? ² Can the slots be scheduled without collision? ² How many buffers are needed at each level? 4.5.1.1 Description of schedule The adder pipeline schedule is dictated by the buffer read and write schedule. To avoid any confusion, it is noted that “reading a buffer” is equivalent to entering a pair of values into the pipeline, and “writing a buffer” is equivalent to removing a single partial sum from the end of the pipeline. This section describes the scheduling approach and proves it guarantees a collision-free use of the pipeline. The (lgn)-bit counter, C, shown in Figure 4.10 is initialized to 0 when buffer 0 is first read. The counter is incremented every clock cycle, and values arrive every clock cycle, so the next time buffer 0 is read, the low order bit ofC will be 0. This can be expressed as C = ¤0, where¤ is a multi bit wildcard. After ® cycles, the output of the pipeline is written into buffer 1, therefore the write schedule for buffer 1 is justC¡® =¤0. Clearly all the¤0 slots have been used, so only the¤1 slots are left. Buffer 1 is read whenC =¤01, and buffer 2 is written whenC¡® =¤01. This leaves the¤11 slots. Buffer 2 is read when C = ¤011, and buffer 3 is written when C¡® = ¤011, and so forth. This relationship between the buffer read and write schedule, pipeline depth, and counterC, is formalized as C =2 i ¡1+a2 i+1 (4.2) C¡® =2 i ¡1+a2 i+1 (4.3) 54 for some integer, a, where Equation 4.2 is the read schedule for buffer i, and Equa- tion 4.3 is the write schedule for bufferi+1. 4.5.1.2 Proof of sufficient pipeline slots The idea of interleaving partial sums in available pipeline slots will only work if there are enough slots to handle arbitrarily large values of n. This section provides the necessary proof. Theorem 1: A single pipeline holds sufficient slots to simultaneously accommo- date the partial reductions for anyn. Proof: Since the pipeline always ingests two values, buffer 0, in Figure 4.11, can never cause more than 1=2 of the pipeline slots to be used no matter how many numbers arrive. The partially reduced values produced by the pipeline are placed in ... ... 0. ... ... 1. 2 1 buffer pipeline ... ... i. ... 2 1 4 1 i 2 1 1 2 1 + i Figure 4.11: Contributions to pipeline fullness buffer 1. Again, since they are taken by pairs these values can never cause more than 1=4 of the slots to be used. This geometric series continues, and one can see that bufferi can never cause more than1=2 i+1 of the slots to be used. The technique must work no matter how many values arrive so one must sum all these contributions lim n!1 n X i=0 1=2 i+1 =1 55 Thus, one can never completely fill the pipeline by interleaving higher level partial reductions for any value ofn. 4.5.1.3 Proof of collision-free schedule It has been proved that there are enough slots in the pipeline to simultaneously ac- commodate all the partial reductions and proposed a schedule for pipeline use. Proof that the proposed schedule allows collision-free use of the pipeline follows: Theorem 2: The schedule shown in Equation 4.2 and Equation 4.3 guarantees a collision-free use of the pipeline. Proof: This ia a proof by contradiction. Assume buffer i and buffer j, where j >i, are scheduled for a read at the same time. From Equation 4.2 2 i ¡1+a2 i+1 =2 j ¡1+b2 j+1 for some integersa andb, so 2 i (1+2a) = 2 j (1+2b)) (1+2a) = 2 j¡i (1+2b) Since j > i, then (1+2a) must be even. But a is an integer, so (1+2a) can never be even; this is a contradiction and hence proof that the read schedule is collision free. Values cannot change positions inside the pipeline so the write schedule is also collision free. Thus, the theorem is proved. 4.5.1.4 Proof of bounded buffer growth The reduction method only works if the buffer at each level: i) does not grow without bound, and ii) is limited to some constant size. This section presents proofs of both. 56 Theorem 3: A buffer cannot grow without bound once values begin to be read from that buffer. Proof: For buffer i, one value is written (produced) every 2 i cycles, and two values are read (consumed) every2 i+1 cycles. Thus, rate produced =rate consumed and one cannot have unbounded growth once the buffer begins to be read. Theorem 4: The given schedule and the buffer size,s=3, guarantee there can be no buffer overflow. Proof: Again, a proof by contradiction is used. Let the number of values in buffer i be denoted as b i , and assume the first overflow, b i = 4, occurs at time, t of . Let the next read after the overflow occur at time,t nr = t of +a. Sincea = 0 would make t of a read cycle (no overflow), and reads are scheduled every 2 i+1 cycles, it follows that 0 < a < 2 i+1 . The schedule puts the previous read 2 i+1 cycles earlier, that is t pr =t nr ¡2 i+1 =(t of +a)¡2 i+1 (4.4) In the worst case scenario, buffer i has not overflowed and b i = 3 just before time t pr . At timet pr , there are two cases: 1) two of the values are consumed which leaves leave b i = 1, and 2) a value arrives precisely at time t pr so two of the values are consumed which leaves b i = 2. To overflow buffer i in case 1), three more values must be produced by timet nr . This takes at least 2£2 i +1 = 2 i+1 +1 cycles. For case 2), two more values have to be produced by timet nr . Since a value just arrived, it takes exactly2£2 i = 2 i+1 cycles. Thus, if one definesc2f0;1g, one can express 57 the production time for both cases as, t prod = 2 i+1 +c. If there is an overflow, the following formula must be true: t pr <t of ¡t prod =t of ¡ ¡ 2 i+1 +c ¢ Combining this with Equation 4.4, results in (t of +a)¡2 i+1 <t of ¡ ¡ 2 i+1 +c ¢ soa <¡c. This result violates the condition 0 < a < 2 i+1 . This is a contradiction, and the maximum buffer size,s=3, holds. 4.5.1.5 Summary This section described an HDL-based tree traversal method vector reduction design that runs in£(n) time and uses£(lgn) buffer space. It included proofs that a single adder has enough slots to accommodate a collision-free schedule, and that the size s=3 buffers will not grow without bound. 4.5.2 Striding method design This section describes an HDL-based striding method reduction solution that also accommodates multiple sets of input vectors. The intuition behind this design is that it is a simple extension of the looped adder technique that was described in Section 4.2 and that it uses a control algorithm which is similar to that of Ni and Hwang, and Sips and Lin [73, 59]. However, unlike the single looped adder, which fails unless given an unrealistic input buffer size, this design alternates between multiple looped adders and uses a relatively small input buffer to handle the case when all adders are busy. 58 A block diagram of the design, for two looped adders, is depicted in Figure 4.12. The idea is to allow one adder to reduce all the values for a given input vector. After + output input ctrl + buffers Figure 4.12: Parallel looped adders the last value in a vector arrives, there are ® a partial reductions still “in flight” in the pipeline, so another adder begins reducing the new vector while the first adder finishes reducing its vector. If all adders are busy, new values are buffered until an adder becomes available. Intuitively, the parallel pipeline approach seems reasonable, but there are some unresolved issues. Specifically, there are three questions, which will be answered in the sections that follow: ² How many cycles does it take to coalesce the pipeline? ² How many looped adder pipelines are needed? ² How much buffer space is needed? 4.5.2.1 Description of operational modes Each reduction circuit adder operates in one of four modes and is controlled by a finite state machine (FSM), as depicted in Figure 4.13. A description of the control FSM states and associated transitions follows. 59 fill wait steady state coalesce inputs available outputs available no inputs no inputs finished Figure 4.13: Operating modes wait The adder is waiting for a new set of values to arrive. When the first two values arrive, the circuit transitions into the fill mode. fill In this mode, the adder pipeline is filling up but has not yet produced any outputs. After® a cycles, partial sums appear at the adder output. At this point, the circuit transitions into the steady state mode. If the last value in the set arrives while in the fill mode, the circuit transitions into the coalesce mode. steady state In the steady-state mode, which is similar to the merge-time of Ni and Hwang, the pipeline is full and a single value is in the input buffer. The pipeline output value and the single buffer value are inserted into the pipeline. As each input buffer value is read, the newly arriving input value is written into the buffer. At any point in time, the pipeline is filled with partially reduced values. After the last value in the set is read from the input buffer, the adder stops reading the buffer and the circuit transitions into the coalesce mode. coalesce The coalesce mode, which is similar to the Ni and Hwang drain-time, begins whenever the last value in a set is read from the input buffer. It can be shown that it takes at most® a dlg® a e cycles to coalesce a full pipeline. After coalescing the 60 pipeline contents down to a single value, the circuit transitions back to the wait mode. If all adders are coalescing, a fixed buffer of size® a dlg® a e+1 ensures that a buffer overflow cannot occur. 4.5.2.2 Proof of maximum cycles to coalesce Suppose seti is currently being reduced in one of the pipelines. After reading the last value from the input buffer the values in the pipeline are coalesced to a single value. Theorem 5: It takes no more than®d1+lg®e cycles to coalesce an®-stage pipeline. Proof: The proof proceeds from the case which has the most unfinished work. This is when the®-stage pipeline has a value in every position, and the output buffer, obuf, is empty. This is depicted for ® = 5 in clock cycle 0 of Figure 4.14. After ® cycles, all the partial reductions have moved through the pipeline. Every time there is one value in obuf, and one value at the pipeline output they are inserted back into the pipeline for further reduction. This means there are now®=2 partial reductions in the pipeline, spaced every 2 positions, as depicted at clock cycle 5 in Figure 4.14. After every ®-cycle pass, the pipeline has 1=2 the number of partial sums. After multiple passes through the pipeline there is a single value as depicted in cycle 12. It then takes an additional® cycles for this final reduced value to appear at the output. Clearly this repeated division of® by2 corresponds todlg®e. Since it takes® cycles for each pass through the pipeline, and it takesd1+lg®e passes to coalesce the pipeline, it takes at most,®d1+lg®e cycles to coalesce an®-stage pipeline. 4.5.2.3 Proof of dual pipeline sufficiency The number of parallel pipelines needed is driven by the size of the input buffer, ibuf. As shown previously, a single adder in the loopback configuration of Figure 4.2 needs £(n®lg®) buffer space, which is unacceptable. This section proves that only two 61 cc obuf adder pipeline (α = 5) deabc dea dea dea bc dea bc dea de bc de bc a de bc b a de c b a de d c b a e d c b a bc bc bc a a c e 12 11 10 9 8 7 6 5 4 3 2 1 0 Figure 4.14: Coalescing a 5-stage pipeline pipelines are needed to reduce the input buffer to a reasonable and constant size that does not depend on the input vector length. Theorem 6: Given two®-stage looped pipelines, an input buffer size of®d1+lg®e will prevent buffer overflow. Proof: This is a proof by contradiction. Assume the ®d1+lg®e buffer has just overflowed for the first time. Let the next set to be processed be set j. This means there are ®d1+lg®e values from set j (and perhaps later sets) in the buffer, and one more value has just arrived causing the overflow. If either pipeline were in wait, fill, or steady state mode then buffer values would be consumed on each cycle and an overflow could not occur. Thus both pipelines must be in the coalesce mode. The longest it can take to coalesce a set is ®d1+lg®e cycles. But it took at least®d1+lg®e + 1 cycles to cause an overflow. This constitutes the contradiction 62 and proof that two loop adder pipelines reduce the input buffer to a reasonable and constant size. 4.5.2.4 Summary This section described an HDL-based striding method vector reduction design that runs in £(n) time and uses £(® a dlg® a e) buffer space. It provided a description of the operational modes and proofs showing that a fixed size buffer and two adders were sufficient to accommodate multiple, variable length vectors. 4.6 High-level language reduction An HDL-based flow requires a significant hardware design background on the part of the developer and is unlikely to be adopted by mainstream supercomputer users. Therefore an HLL-based reduction approach, which more closely resembles program- ming, is needed. This section describes the issues associated with HLL-based reduc- tions and then presents the novel HLL-based reduction solution that is one of the key contributions of this research. 4.6.1 Pipelines and high-level language reduction Consider the sparse matrix vector multiply HLL pseudo-code shown in Figure 4.15. There is a parallel section at line 3 that contains two loops. Therefore, unlike the apparent parallelism of a uniprocessor thread, the two loops at line 4 and line 7 cor- respond to independent hardware pipelines that execute concurrently. The parallel block at line 4 has a channel, chD, that is being fed by a k-width dot product unit. Every clock cycle, the dot product unit ingestsk compressed sparse row (CSR) format matrix A values (val h ) and the corresponding k elements from thex vector (x col h ). 63 1: fori in 1:::n do 2: y i Ã0:0 3: parbegin 4: forh in ptr i :::ptr i+1 stride k do 5: chDà putChan(dot(val h ,x col h )) 6: end for 7: forj in ptr i :::ptr i+1 stride k do 8: dà getChan(chD) 9: y i Ãy i +d 10: end for 11: parend 12: end for Figure 4.15: Sparse matrix vector multiply HLL pseudo-code The parallel block at line 7 reads the chD channel and accumulates the partial dot products into output vector element y i using a looped adder. An HLL-to-HDL com- piler might translate this code into the hardware that was previously illustrated in Figure 4.2. As explained earlier, this architecture fails unless the pipeline is stalled or a large buffer is used. An HLL-to-HDL compiler such as Carte will detect the loop carried dependence and stall the pipeline, which results in poor performance. 4.6.2 Fundamental problem A seemingly obvious solution, mapping an existing HDL-based IP reduction design into the HLL environment via the hybrid flow described in Section 2.3.4, does not solve the problem. Suppose line 9 of Figure 4.15 is replaced with the HDL-based solution described in Section 4.5.1. This would result in the pseudo-code shown in Figure 4.16. Since the reduction on line 9 has a variable latency, the enclosing loop starting on line 7 also has a variable latency. Clearly a variable latency loop cannot be pipelined since the compiler cannot determine the length of the pipeline. A non- pipelined loop results in poor performance. Therefore, one cannot simply integrate an existing HDL solution. 64 1: fori in 1:::n do 2: y i Ã0:0 3: parbegin 4: forh in ptr i :::ptr i+1 stride k do 5: chDà putChan(dot(val h ,x col h )) 6: end for 7: forj in ptr i :::ptr i+1 stride k do 8: dà getChan(chD) 9: reduce(d,y i ) ! variable latency 10: end for 11: parend 12: end for Figure 4.16: Modified sparse matrix vector multiply HLL pseudo-code This is a fundamental problem that cannot be solved in the given form since each row of the sparse matrix can have a different number of elements. This means that there will be a different number of partial reductions, that is, partial dot products. Even if the basic reduction operation such as add, could be done in zero time, all the partial dot products must still be ingested. Thus, for sparse matrix vector multiply and similar cases, the latency of the reduction is always variable and cannot be pipelined. 4.6.3 Striding method design This section describes a novel HLL-based striding method vector reduction solution that accommodates multiple sets of sequentially delivered input vectors. The intu- ition behind this design is to partition the reduction operation into two stages that run in separate pipelined loops, as suggested by the simplified HLL-based pseudo-code shown in Figure 4.17. There are three parallel loops at line 7, line 10, and line 15, that correspond to independent, concurrently executing hardware pipelines. The loop at line 7 feeds the chD channel as previously described. The loop at line 10 reads the channel into a local variable,d, as before. However, on line 12 and 13 it uses the modulo operator to sumd into a different column of the®-row by®-columnS array. 65 1: declareS ®;® 2: ioldÃ1 3: fori in 1:::n do 4: imodÃ((i¡1) mod ®)+1 5: s imod Ã0 6: parbegin 7: forh in ptr i :::ptr i+1 stride k do 8: chDà putChan(dot(val h ,x col h )) 9: end for 10: forj in 1:::(ptr i+1 ¡ptr i )=k do 11: dà getChan(chD) 12: jmodÃ((j¡1) mod ®)+1 13: s imod;jmod Ãs imod;jmod +d 14: end for 15: whilei·n do 16: wait untiliold6=imod 17: y i Ãbintree(s iold ) 18: s iold Ã0 19: ioldÃimod 20: end while 21: parend 22: end for Figure 4.17: Two-stage sparse matrix vector multiply HLL pseudo-code On line 15, the third loop waits until the current row of theS array is finished, that is, until the partial summation for rowi is completed. It then uses an®-width binary tree to accumulate all ® columns of the completed S row into y i , and resets the row to 0 for future use. Figure 4.18 is an idealized block diagram of the reduction architecture that an HLL-to-HDL compiler might produce from the loops at line 10 and line 15. The stage 1 architecture (the loop at line 10) includes an ®-stage pipelined adder, an ®-row by ®-column partial summation array, S, and some control circuitry. The stage 2 archi- tecture (the loop at line 15) has a simple®-input binary tree accumulator that reduces completed rows of S and writes the results to vector y. A round-robin scheduling 66 stage 1 stage 2 ∑ = = α ε 1 l l i y + α-stage pipeline ε d ih ε + d ih ∑ l ε α ε ε ε ε 3 2 1 L ctrl α α× S Figure 4.18: Accumulator architecture algorithm guarantees an ®-cycle interval between subsequent references to the same memory location inS. 4.6.3.1 Description of toroidal access pattern Perhaps the easiest way to envision the schedule is to view the toroidal access pat- tern of the S array. The ®£® array in Figure 4.19(a) is wrapped top-to-bottom to produce the cylinder shown in Figure 4.19(b). The cylinder is wrapped end-to-end to produce the torus shown in Figure 4.19(c). The accumulation of a given input vector is restricted to a specific row within theS array. For example, if the black square has just been accessed, then the grey square will be accessed next, followed by the striped square, and so forth. After the checkerboard square is accessed, the black square will again be accessed. However, by this time, the contents of the black square and its associated input,d ih have already traversed the adder pipeline and been written back 67 α α (a) array (b) cylinder (c) torus Figure 4.19: Toroidal access pattern of arrayS to memory. Thus, even if there are more than® elements in the input vector, the ma- jor circumference of the torus (number of columns) is ®, thereby ensuring that any previous data at a given location has already traversed the adder pipeline and been written back by the time that location is again referenced. If a series of small vectors are to be reduced, the minor circumference of the torus (number of rows) ensures that by the time a row needs to be reused, its contents have already been sent to the output accumulator and the row initialized to 0. 4.6.3.2 Proof of maximum buffer size Theorem 7: An ®£® buffer is sufficiently large to prevent loop carried depen- dences and allow for a fully pipelined reduction. Proof: A multiple cycle snapshot of the round-robin schedule is shown in Fig- ure 4.20. An® = 4 adder pipeline and a single row,s i , of theS array, which already contains partially reduced data, (² 1 ;² 2 ;² 3 ;² 4 ), is depicted. The input column indi- cates the incomingd i values to be accumulated. Thes i element being accessed during a given clock cycle is highlighted by a double border. At the first clock cycle, input d 1 and S i1 = ² 1 enter the adder pipeline, as shown on the right side of the snapshot. On the second clock cycle,d 2 andS i2 = ² 2 enter the pipeline and² 1 +d 1 proceed to the next pipeline stage. On the third cycle, d 3 and S i3 = ² 3 enter the pipeline. On 68 i input s access pattern and contents adder pipeline (α = 4) 1 2 3 4 ε 1 ε 2 ε 3 ε 4 ε 1 +d 1 ε 1 ε 2 ε 3 ε 4 ε 2 +d 2 ε 1 +d 1 ε 1 ε 2 ε 3 ε 4 ε 3 +d 3 ε 2 +d 2 ε 1 +d 1 ε 1 ε 2 ε 3 ε 4 ε 4 +d 4 ε 3 +d 3 ε 2 +d 2 ε 1 +d 1 ε 1 +d 1 ε 2 ε 3 ε 4 ε 1 +d 1 +d 5 ε 4 +d 4 ε 3 +d 3 ε 2 +d 2 ε 1 +d 1 ε 2 +d 2 ε 3 ε 4 ε 2 +d 2 +d 6 ε 1 +d 1 +d 5 ε 4 +d 4 ε 3 +d 3 ε 1 +d 1 ε 2 +d 2 ε 3 +d 3 ε 4 ε 3 +d 3 +d 7 ε 2 +d 2 +d 6 ε 1 +d 1 +d 5 ε 4 +d 4 ε 1 +d 1 ε 2 +d 2 ε 3 +d 3 ε 4 +d 4 ε 4 +d 4 +d 8 ε 3 +d 3 +d 7 ε 2 +d 2 +d 6 ε 1 +d 1 +d 5 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 Figure 4.20: Snapshot of stage 1 partial summation unit the fourth cycle,d 4 andS i4 = ² 4 enter the pipeline. One will notice on the fifth clock cycle, that the adder output, ² 1 +d 1 , is written into S i1 and reentered into the adder pipeline with the incoming d 5 value. Since it takes ® cycles to traverse the adder pipeline, and the true dual-ported memory with a write first policy is being employed, the inputs never have to be buffered and a fully pipelined loop is possible. A similar action occurs with inputs d 6 through d 8 . The stage 2 output summation can be ac- complished with an®-node binary reduction tree, because® is small enough to allow the required ®¡ 1 adders to fit on the FPGA. Whenever the last value for a given row enters the adder, the controller waits® cycles for the result to traverse the adder, loads the now completed row of S into the output accumulator, and then initializes the vacant row to all zero values. If multiple small rows are entered into the reduction unit, it takes at most ® cycles after the final value of the first row has been entered into the adder for the first row to traverse the adder pipeline, enter the output accu- mulator, and theS row be reclaimed. Since there are® rows, the reduction unit also handles multiple sets without pipeline stalls. The toroidal access pattern and temporal considerations makeS appear as an infinite two-dimensional array, which can handle 69 multiple variable-length sets of sequentially delivered input vectors. The two-stage primary algorithm disconnects the input from the output allowing high-performance fully pipelined loops to be employed. 4.6.3.3 Summary This section described an HLL-based vector reduction technique that runs in £(n) time and requires£(® 2 ) buffer space. The performance speedups that can be obtained from this architecture will be illustrated in the next chapter, where variants of the accumulator are used to accelerate two well-known sparse matrix iterative solvers. 70 Chapter 5 Accelerating Sparse Iterative Solvers using Reconfigurable Computers Solving systems of linear equations,Ax=b, whereA is ann£n real-valued matrix, x is the unknown realn-vector, andb is a known realn-vector, is essential for a num- ber of applications such as solving partial differential equations (PDE) in a scientific simulation, where the PDE is discretized into a system of equations that are repeatedly solved at each time step. Solvers can be divided into two categories, direct solvers and iterative solvers. Direct solvers are usually based on Gaussian elimination and involve two steps, 1) factor theA matrix, and 2) solve for thex vector using back substitution. Iterative solvers start with an initial guess and iteratively find closer approximations to the solution. For sparse matrix systems, direct methods may be too expensive or even impossible. For example, a very large but very sparse matrix that fits in memory could have dense factors that will not fit in memory. Therefore, iterative methods are often used to solve sparse matrix systems. In this chapter, two classic sparse matrix iterative solvers are accelerated using FPGA-augmented RCs. Section 5.1 describes the Jacobi method and conjugate gradi- ent (CG) iterative solvers. Section 5.2 describes some general FPGA design consid- erations. Section 5.3 and Section 5.4 compare software-only Jacobi and CG designs 71 with FPGA-augmented designs and show that significant speedups are possible. The acceleration of sparse matrix iterative solvers via FPGA-based kernels represents a principal contribution of this dissertation. 5.1 Iterative solver primer This section provides a brief description of the Jacobi method iterative solver and the conjugate gradient iterative solver. 5.1.1 Jacobi method The Jacobi method, named after the German mathematician, Carl Gustav Jacob Jacobi [30], is an example of a stationary iterative method. According to Barrett et al. [10] and Black and Moore [12], stationary iterative methods can be expressed in the form, x (±+1) ÃBx (±) +c; (5.1) where neither B norc depend upon the iteration index, ±. According to James [43] and Bagnara [9], the Jacobi method is guaranteed to converge whenever matrix A is weakly diagonally dominant (DD), that is, ja ii j¸ n X j=1jj6=i ja ij j: Since the diagonal element dominates the row and column, one can simply solve for each diagonal element and plug in an approximate value. Repeated applications of this technique cause the approximations to converge to the actual solution. 72 To derive the Jacobi method,Ax =b, is massaged to look like Equation 5.1. Let A=L+U+D, whereL is the lower triangular matrix containing all elements ofA be- low the diagonal,U is the upper triangular matrix containing all elements ofA above the diagonal, and D is the diagonal matrix consisting of only the diagonal elements ofA. SubstitutingL+U +D intoAx =b yields,(L+U +D)x =b. This can be manipulated to obtain,x=D ¡1 (¡L¡U)x+D ¡1 b, which can be easily transformed into the vector form of the Jacobi iteration,x (±+1) à D ¡1 £ b¡(L+U)x (±) ¤ : The i th component of diagonal matrix,D ¡1 , is just1=a ii , and(L+U)x (±) is essentially the matrix vector product, Ax (±) , with the a ii diagonal terms being ignored. Therefore, the Jacobi iteration can be expressed in point form as, x (±+1) i à 1 a ii 2 4 b i ¡ j=n X j=1jj6=i a ij x (±) j 3 5 : (5.2) Barrett et al. [10] note that “the Jacobi method is also known as the method of simul- taneous displacements, since the updates could in principle be done simultaneously.” To demonstrate the Jacobi method, the set of linear equations, Ax=b 7! à 3 2 1 4 !à x 1 x 2 ! = à 12 14 ! ; will be solved. From a starting value of (x (0) ) T = (1;1), it takes nine iterations to converge to the known solution: ± x (±) 1 x (±) 2 x (±+1) 1 x (±+1) 2 0 1:000 1:000 3:333 3:250 1 3:333 3:250 1:833 2:667 . . . . . . . . . 8 1:999 2:998 2:001 3:000 9 2:001 3:000 2:000 3:000 : 73 A CSR-format sparse matrix Jacobi (SJAC) algorithm is shown in Figure 5.1. The algorithm is relatively straight forward. It accepts theA matrix (in CSR format), the b vector, and approximate solution vector, x. The outer loop at line 2 visits each 1: algorithm SJAC(val;col;ptr;b;x) 2: fori in 1:::n do 3: §Ã0 4: forj in ptr i :::ptr i+1 ¡1 do 5: ifi=col j then / * save a ii * / 6: aÃval j 7: else / * sum a ij x j * / 8: §Ã§+val j £x (±) col j 9: end if 10: end for 11: x (±+1) i à 1 a (b i ¡§) 12: end for 13: end algorithm Figure 5.1: Sparse matrix Jacobi algorithm row of the matrix. Theptr vector is used to iterate through each matrix row, and the col vector is used to index into thex (±) vector, that is, to match thea ij andx j values as mandated by Equation 5.2. Thus, the loop at line 4 basically calculates the dot product, P a ij x j , while ignoring thea ii term, as required. One will note that, on line 6, the a ii term is saved for later use. The dot product is subtracted from b i and the difference divided bya ii , as shown on line 11. The calling routine is responsible for performing the convergence test. 74 5.1.2 Conjugate gradient CG was independently discovered by Magnus Hestenes and Eduard Stiefel. In 1952, they joined forces at the National Bureau of Standards, now called the National In- stitute of Standards and Technology (NIST), to publish their findings [40]. CG is an example of a nonstationary iterative method. Nonstationary methods differ from sta- tionary methods in that “the computations involve information that changes at each iteration.” [10] Early attempts to use CG on large systems of equations failed, and CG fell out of favor. However, according to O’Leary [62], John Reid’s research in the 1970s [66] lead to “renewed attention to the algorithm.” Indeed, Krylov subspace methods, of which CG is the prototype, were affirmed as one of the top 10 algorithms of the twentieth century [83]. Today, CG is probably the best known iterative method for solving linear equations whenever sparse matrixA is symmetric positive definite (SPD), that is, A=A T and 8x6=0jx T Ax>0: CG-related literature resources are ubiquitous – a Google search for “conjugate gra- dient” resulted in over 2.6 million hits, an IEEE Xplore search yielded 276 articles having “conjugate gradient” in the title, and the ACM Digital Library reported 23 such articles. Any text dealing with sparse linear equations such as [67, 10] would, by necessity, include some discussion of CG. Accordingly, this section is not a rigorous mathematical treatment of the CG method. However, it does give a bit of detail and enough of the intuition behind CG such that the algorithm can be understood. Fundamentally, conjugate gradient is an iterative algorithm, CG(f(x);x 0 ), for finding the nearest local minimum of a function, f(x), where x is an n-vector of 75 independent variables, andx 0 is a starting point. According to Shewchuk [71], if one takes the quadratic form of vectorx, f(x)= 1 2 x T Ax¡b T x+c; (5.3) and matrix A is an SPD matrix, it can be shown via “a little bit of tedious math” that Equation 5.3 is minimized by the solution to Ax = b. A plot of the quadratic − = = 15 11 9 3 3 5 b A 10 5 0 -5 -10 -15 -20 -10 15 20 10 5 0 -5 2 x 1 x 2500 2000 1500 1000 500 0 -500 − = 3 4 m x b x = m A c A f T T + − = x b x x x 2 1 ) ( Figure 5.2: Quadratic form of vectorx form of x, involving an order n=2 SPD matrix, yields a three-dimensional concave up parabolic surface as illustrated in Figure 5.2. Thex value at the lowest point on this surface,f(x m ), is the solution toAx =b. This point is also the local and global minimum, which can therefore be found using CG. For n > 2 one would obtain an (n+1)-dimensional parabolic surface. Intuitively, at least for the three-dimensional case, one can imagine “walking downhill” to find the lowest point. Unlike the method 76 of steepest descent, which uses the local gradient for going downhill (and can there- fore end up taking multiple steps in the same direction), CG usesA-orthogonal (con- jugate) search directions that allow the iteration to converge in fewer steps. In theory, the maximum number of CG iterations for an n£ n system of equations is n. In practice, it usually takes more. The CG algorithm shown in Figure 5.3 is based on the equations and algorithms given in [71, 67, 10, 40]. This is a regular CG algorithm that does not consider pre- conditioning or residual correction. These real-world considerations will be addressed later. The notation, (x;y) = P x i y i , represents the dot product, and the iteration in- dex is indicated via a superscripted parenthetical expression such asx (i) . CG keeps 1: algorithm CG(A;b;x;x 0 ) 2: x (0) Ãx 0 3: p (0) Ãr (0) Ãb¡Ax (0) 4: iÃ0 5: while (¢ is too big) do 6: qÃAp (i) 7: ½Ã(r (i) ;r (i) ) 8: ®Ã½=(p (i) ;q) 9: x (i+1) Ãx (i) +®p (i) / * new solution * / 10: r (i+1) Ãr (i) ¡®q / * new residual * / 11: ¯Ã(r (i+1) ;r (i+1) )=½ 12: p (i+1) Ãr (i+1) +¯p (i) / * new direction * / 13: iÃi+1 14: end while 15: end algorithm Figure 5.3: Conjugate gradient algorithm threen-vectors,x (the approximate solution),r (the residual), andp (the search direc- tion). The algorithm has a loop that calculates the next value of the estimated solution vector, calculates the new residual vector, and updates the search direction vector. Each loop iteration involves one matrix vector multiply, three dot products, and three 77 vector updates. In the absence of ill-conditioned matrices, roundoff error, cancella- tion, and so forth, each loop iteration yields a betterx by “walking” in the direction derived from A and p. The terminology “¢ is too big” on line 5 is an abstraction for the convergence test. Since an optimistic expectation at line 5 might result in an infinite loop, some authors explicitly set a limit on the iteration count. 5.2 FPGA design considerations Before describing the Jacobi and CG designs, some general design considerations will be addressed. The primary design consideration is to determine the FPGA design boundary, which identifies the portion of the design that should be mapped onto the FPGA. As suggested by Figure 5.4, the FPGA design boundary could range from the ? m 1 m 1.2 m 1.2.1 m 1.2.3 m 1.1 m 1.3 m 1.2.2 Figure 5.4: Choosing an FPGA design boundary entire system to a part of a single module. Strictly speaking, although not examined in this dissertation, the design boundary could also be run-time dynamic. A number of factors come into play when deciding upon a suitable FPGA design boundary. The anticipated speedup should definitely be considered. The software-only sys- tem (if available) should be profiled to determine candidate modules. Obviously, 78 modules that constitute a reasonable portion of the execution time would be poten- tial candidates. In the absence of an existing software system, asymptotic algorithm complexity analyses such as “Big-Theta” and “Big-O” can be used [20]. According to Amdahl’s Law [39], overall speedup is given by, s o = 1 (1¡f e )+f e =s e ; (5.4) where s o is the overall speedup, f e is the fraction of the application that is to be enhanced, and s e is the speedup of the enhanced portion. A 100-fold speedup of a certain module might seem worthwhile. However, if the module is only 5 percent of the runtime, then the overall speedup is, s o = 1=(1¡0:05+0:05=100) = 1:05, which is hardly worth the effort. On the other hand, a 3-fold speedup of a module that constitutes 90 percent of the program runtime results in an overall speedup of s o =1=(1¡0:9+0:9=3)=2:5, which certainly seems worth the effort. The nature of the algorithm should also be considered. Section 3.2 showed that sparse matrix codes may have an affinity for FPGA-based processing. In contrast, Harkins et al. [38] look at the performance of sorting algorithms on RCs and show that control intensive applications like sorts do not perform well on FPGAs. These authors further note that “factors such as memory bandwidth, clock speed, algorithm computational density and an algorithm’s ability to be pipelined all have an impact on FPGA performance.” Potential FPGA area constraints, possible data reuse, the algorithm’s developmen- tal stability, and so forth should also be considered. To make a MHz-scale FPGA com- petitive with a GHz-scale GPP, the FPGA-based kernel must be both deeply pipelined and highly parallelized. Thus, for floating-point kernels, FPGA area constraints may become a limiting factor. The extent to which data in a module can be reused, or 79 the number of simultaneous local memory accesses allowed by the FPGA-based PE could be considered. If a candidate module is likely to change, then it may not be worth the effort to put it onto an FPGA. If a candidate module takes a lot of run time because of an inefficient software implementation, then it may make more sense to simply redesign the software rather than target it for an FPGA. Obviously, there are no hard-and-fast rules for determining the FPGA design boundary. However, one should at least consider the ideas presented in this section in order to make an informed decision. 5.3 Sparse matrix Jacobi solver This section describes the software-only and FPGA-augmented Jacobi designs and compares their performance. 5.3.1 Software-only sparse matrix Jacobi system design The software-only SJAC design shown in Figure 5.5 consists of a main routine, some sparse matrices, A 1 :::A h , the mmio, coocsr, and amux utilities, the software-only x 1 ... x h main A 1 ... A h b 1 ... b h Θ 1 ... Θ h SJAC mmio coocsr software amux Figure 5.5: Software-only sparse matrix Jacobi system design 80 SJAC routine, and the output solution and statistics files, x 1 :::x h and £ 1 :::£ h . Theb i vectors are depicted as inputs, but in practice they are generated from a known x vector at run time. The main routine is a simple driver program, which essentially measures how long it takes to solve each system of linear equations. The A 1 :::A h are DD sparse matrices stored in NIST Matrix Market coordinate format [60]. The mmio utility, also available from the NIST Matrix Market web site, reads coordinate format files. The coocsr utility, which is part of Saad’s SPARSKIT [68], converts co- ordinate format matrices into CSR format. The amux utility, also part of SPARSKIT, is an optimized sparse matrix vector multiply for matrices that are represented in CSR format. The SJAC software kernel implementation is based on the Jacobi pseudocode presented by Barrett et al. [10]. However, for the sake of efficiency, the£(n) solution vector copy shown in that algorithm is avoided by 1) alternating between two vectors and 2) performing the convergence test in the main routine. At run time, main calls upon mmio to read in a coordinate format matrix, calls coocsr to convert matrix, A, into CSR format, and calls amux to calculateb = Am, wherem is a realn-vector consisting of all 1,000’s (Roman numeral “m”). The main routine then invokes the SJAC kernel with parameters A, b, and starting point, x 0 , which is a realn-vector consisting of all 0’s. After completing an iteration, the SJAC kernel returns the updatedx solution vector. To avoid having to copy the new solution each iteration, main simply keeps two copies ofx and alternates between them each time it calls SJAC. When the difference between subsequent iterations is smaller than a programmer-defined delta, main writes the solution to the results file. It then writes the input matrix name, number of iterations, and wall clock run time to the statistics file and terminates. Since a knownn-vector is used to generateb, verification of the solution is trivial. 81 5.3.2 FPGA design boundary A profile of the software-only Jacobi system running on the target RC shows that it spends over 98 percent of the time in the SJAC kernel. Furthermore, SJAC is a rela- tively small, monolithic, sparse matrix code; there are no elaborate control structures or calls to library routines. The behavior of sparse matrix codes on GPPs, and their potential affinity for FPGA acceleration, is described in Section 3.2. In addition, the target RC provides up to six simultaneous 64-bit local memory reads per clock cy- cle, which allows for algorithmic parallelization; as noted in Section 5.1.1, the Jacobi algorithm itself is embarrassingly parallel. Finally, theA matrix andb vector are in- variant during the entire Jacobi iteration, so the transfer cost to an FPGA-based SJAC can be amortized across all iterations. Therefore, the decision was made to include the entire SJAC kernel inside the FPGA design boundary. 5.3.3 FPGA-augmented sparse matrix Jacobi system design The FPGA-augmented SJAC design shown in Figure 5.6 is virtually identical to the local memory SJAC processor x 1 ... x h main A 1 ... A h b 1 ... b h Θ 1 ... Θ h mmio coocsr amux F1 F2 Figure 5.6: FPGA-augmented sparse matrix Jacobi system design 82 software-only design except that the software SJAC kernel has been replaced by the FPGA-based SJAC processor, which will be described later. The run time operation of the FPGA-augmented SJAC design is also virtually identical to the software-only SJAC design but with one important distinction. Since theA matrix andb vector are invariant during the entire Jacobi calculation, the SJAC processor pulls a copy of A and b one time and stores them locally for subsequent iterations. Amortization of theA andb transfer cost across all iterations of Jacobi is a key design feature of the FPGA-based SJAC processor. 5.3.3.1 Processor design To motivate the exposition of the FPGA-based SJAC processor, this section starts with the description of an idealized dense matrix Jacobi processor, followed by the description of a more detailed, albeit still idealized sparse matrix Jacobi processor. The block diagram of an idealized dense matrix Jacobi processor is shown in Fig- ure 5.7. The inputs include the dense matrix,A, the constant vector,b, and the approx- × × × × + + + - / i ignore a ii select a ii 4 = n [ ] Σ − i ii b a 1 ∑ ≠i j j ij x a ) (δ x (δ) A x (δ+1) a ii b i n n i b i i delay d i Σ − i b Figure 5.7: Idealized dense matrix Jacobi processor 83 imate solution vector, x (±) . The output is the updated approximation to the solution vector,x (±+1) . The idealized processor consists of a binary reduction tree (essentially a dot product unit), a subtraction unit, divider, and some control units including a counter, multiplexer, and delay units. The operation is fairly straightforward. The first matrix row,a 1 , is placed at one set of input lines to the dot product unit. Simul- taneously, thex (±) vector is placed at the other set of input lines. On the next clock cycle, the second matrix row,a 2 , andx (±) vector are ingested. The next cycle deals with the third matrix row and so forth. The i counter controls the multipliers so that the term,a ii x (±) i , is ignored, per Equation 5.2. The dot product unit reduces the inputs to produce, d i = j=n X j=1jj6=i a ij x (±) j : (5.5) Each constant vector element,b i , is fed into the subtraction unit via a delay unit, which ensures that the inputs are applied at the correct times. The subtraction unit also inputs Equation 5.5 to produce, b i ¡ j=n X j=1jj6=i a ij x (±) j : (5.6) This result, and the appropriately delayed matrix diagonal element, a ii , enter the di- vider, which outputs thei th component of thex (±+1) vector. The block diagram of an idealized sparse matrix Jacobi processor is shown in Fig- ure 5.8. The inputs include the CSR-format sparse matrix, A, the constant vector,b, and the approximate solution vector, x (±) . The output is the new approximation to the solution vector,x (±+1) . The idealized processor consists of several sets of mem- ory units, some registers, a dot product unit, a serial reduction unit, division unit, subtraction unit, multiplication unit, and some control units to orchestrate processor operation. The operation begins by inputting the vectorsb,x (±) ,val,col, andptr and 84 × × × × + + + × reduction ctrl / strided pre-divided val h1 val h2 val h3 val h4 col h1 col h2 col h3 col h4 val val col col b i ptr ptr x (δ) x (δ) x (δ) x (δ) b b k = 4 k copies D -1 a ii 1 a ii 1 b i - Σ k k x (δ) A x (δ+1) i d ih Figure 5.8: Idealized sparse matrix Jacobi processor storing them in the memory units. The latency of the divider is hidden by doing the division in parallel with the initial input and storing the result in theD ¡1 memory for subsequent iterations. Dot product operation requiresk simultaneous values fromA, and k simultaneous values fromx (±) . Vectorsval andcol are strided, step k, across multiple memory banks to allow fork simultaneous memory fetches. For sparse ma- trices, any dot product leaf node multiplier could reference anyx (±) value, so, to avoid multiple memory cycles, thex (±) vector is copiedk times, once at each leaf node. The firstk elements ofval, that is, matrix elementsa 11 :::a 1k , are simultaneously fetched from memory and placed in registers at the input lines of the dot product unit. The reg- isters are needed to synchronize the matrix values with the correspondingk elements of thex (±) vector. The appropriatex (±) vector values are fetched from memory, using col 1 :::col k as addresses, and placed at the other dot product input lines. The registers have control inputs such that the a ii terms are ignored as mandated by Equation 5.2. On the next clock cycle, the second k-vector from the A matrix and the matching k elements from thex (±) vector are processed and so forth. After the pipeline latency, 85 the first partial dot product, d ih , is emitted from the dot product unit. On the next clock cycle, the next partial dot product is emitted, and so forth. The serial reduction unit inputs b i and then subtracts the serially delivered partial dot product values to produce the value shown in Equation 5.6. The 1=a ii value is fetched from the D ¡1 memory and, with Equation 5.6, fed into the output multiplier, which produces thei th component of vector,x (±+1) . 5.3.3.2 Processor configuration Given the idealized design information above, it is now easier to describe the actual SJAC processor design configuration. As shown in Figure 5.9, the SJAC processor design consists of two pipelined cooperating FPGAs, F1 and F2, and a set of local memory banks to hold the vectorsb,col,val,D ¡1 , androw (to be described later). The parameters shown in Table 5.1 specify the SJAC processor design characteristics. - A × × × × α s -stage pipeline val col k x (δ) col val u v i ≠j k k i = j / ptr ptr 1 x i (δ+1) b i ε s s S α α × b b x (δ) a ii a ii 1 D -1 row ctrl ctrl v u T F1 F2 d ih ε – d ih [ ] Σ − i ii b a 1 ∑ l ε s α ε ε ε ε 3 2 1 L ctrl ∑ ≠ − j i j ij i x a b ) δ ( Figure 5.9: FPGA-based sparse matrix Jacobi processor 86 Table 5.1: SJAC processor design parameters Parameter Description k dot product width ® s subtracter loop carried dependence ® m multiplier IP core latency ® a adder IP core latency n max max number of matrix rows n zmax max number of non-zeros The F1 configuration includes on-chip BRAMs to store thex (±) andptr vectors, a divider to calculate the1=a ii values, ak-width dot product IP core, an output channel to send the partial dot products over to F2, an input channel to receive the Equation 5.6 values back from F2, a multiplier to produce the final x (±+1) i values, an output direct memory access (DMA) channel to send the x (±+1) i back to GPP memory, and some control circuitry. The F2 configuration is a variant of the accumulation reduction circuit described in the previous chapter. It includes an input channel to receive the partial dot products from F1, a partial-subtraction unit consisting of an® s -stage pipelined subtraction unit and an® s -row by® s -column array, an® s -wide output accumulator IP core, an output channel to send the Equation 5.6 values back to F1, and some control logic. To meet timing constraints, simplify the FPGA module code, and reduce compi- lation time, the SJAC processor design employs two fully pipelined IP cores that are imported into the HLL-based design using the hybrid development flow described in Section 2.3.4. Thek-width dot product unit and the® s -width output accumulator IP cores were developed ahead of time, using VHDL and the 64-bit IEEE floating-point adder and multiplier IP cores described in [35]. The dot product unit accepts two 64- bit floating-pointk-vectors every clock cycle, and after® m +® a lgk cycles, emits one 64-bit floating-point dot product every clock cycle. Figure 5.10(a) shows an example fork =4. The binary tree accumulator IP core accepts® s 64-bit floating-point values 87 × × × × × × × × × × × × × × × × + + + u 1 v 1 u 2 v 2 u 3 v 3 u 4 v 4 k = 4 l l l v u k ∑ =1 (a) dot product ∑ = s α ε 1 l l + + + + + ε 1 ε 2 ε 3 ε 4 ε 5 ε 6 ¢ s = 6 (b) accumulator Figure 5.10: VHDL-based IP cores every clock cycle. After ® a dlg® s e cycles, it produces one 64-bit floating-point sum every clock cycle. In the general case,® s may not be a power of two, and delay units are needed to synchronize the pipeline stages. Figure 5.10(b) shows an example for ® s =6. 5.3.3.3 Processor operation The FPGA-based SJAC processor has four operational modes, startup, which occurs once for each set of equations to be solved, input, which occurs once per iteration, execute, which occurs once per iteration, and output, which also occurs once per iteration. The reader will note that the execute and output modes run concurrently. During the startup mode, the CSR-format sparse matrixA and vectorb are DMA’ed from processor memory and stored in the SJAC processor. Theval,col, andb vectors are stored in local memory banks, and theptr vector is stored in BRAM on F1. To permit the dot product unit to fetch k values fromval and k values fromcol every clock cycle, the vectors are strided, stepk, across the local memory banks. TheD ¡1 values and row control vector are calculated and stored in local memory. Matrix A and vectorb are invariant during the entire Jacobi calculation, so the cost of this 88 startup mode processing is amortized across all iterations. The number of simultane- ous local memory bank reads, and the width of the banks, are constraints that had to be considered. To conserve memory bank resources in the design, multiplecol values are packed into each local memory bank. In Figure 5.11, for example, four elements x (δ) x (δ) x (δ) x (δ) row 1 row 2 split 0 0 0 0 16 64 64 bnk 1 bnk 5 col 1...4 col 5 ,0,0,0 col 6...9 col 10...13 val 1 val 6 val 10 val 5 bnk 2 val 2 val 7 val 11 0 bnk 3 val 3 val 8 val 12 0 bnk 4 val 4 val 9 val 13 0 sync u & v ignore a ii strided val packedcol … u v to dot product i ≠ j Figure 5.11: SJAC processor local memory matrix storage of integer vector, col, are packed into a single 64-bit local memory word, and four additional local memory banks are used for the real vector,val. To simplify routing, and allow the design to meet timing, eachval memory bank is attached to exactly one leaf node in the dot product unit. Thus, zeros are padded at the end of each matrix row that does not have a multiple ofk values. This process, which takes place during the matrix A marshalling sequence on the GPP, is called k-alignment. The idea, for k =4, is shown at the bottom of row 1 in Figure 5.11. During the input mode, x (±) is stored in BRAM memories on F1. Since the dot product unit must fetchk different values each clock cycle,k copies ofx (±) are created in parallel, one for each input of the dot product unit, as shown in Figure 5.11. 89 At the beginning of the execute mode, which is implemented as a single-cycle pipelined loop, the first k elements of the A matrix, val 1 :::val k , are fetched from the stridedval memory and placed in registers at thev input lines of the dot product unit. Simultaneously, the corresponding k elements, col 1 :::col k , are fetched from col and used as addresses into the x (±) memories, as shown in Figure 5.11. This causes the appropriately pairedu andv k-vectors to enter the dot product unit at the same time. Note that control circuitry causes thea ii values to be ignored as required. On the next clock cycle, the second k-group is processed, and so forth. The partial dot products are sent to the serial subtraction unit on F2. As shown in Figure 5.9, F2 contains a variation of the accumulator reduction described in the previous chapter. The idea is to have an ® s -row by ® s -column array, S, and a meta-data array, row, which associates a row index with each incoming partial dot product value. The first column of S is initialized to b, the other columns are initialized to zero. The j th incoming partial dot product value “belongs to” output vector component,x (±+1) row j , and is subtracted from S i 0 j 0, where i 0 = row j mod ® s and j 0 = j mod ® s . Example: suppose the incoming partial dot product, d ih , is to be subtracted from the value in the black cell in Figure 5.9. The control unit fetches the value from the black cell and sends it and d ih into the ® s -stage subtracter. The next incoming partial dot product is subtracted from the value in the grey cell, the next from the value in the slashed cell, and so forth across the row until the checkerboard cell. At this point, the access wraps back around to the black cell, which has already been updated by this time. This round-robin scheduling guarantees ® s cycles between subsequent writes to the same memory location and allows a single-cycle pipelined stage 1 loop. During the output mode, which runs concurrently with the execute mode, the stage 2 single-cycle pipelined loop uses the binary tree output accumulator IP core on F2 to sum each completedS row and reclaim the row for subsequent use. Wheneverrow j 90 changes value (indicating the start of a new row), the controller waits ® s cycles (to complete any pending subtractions) and then triggers the stage 2 loop to calculate the summation, ® s X `=1 ² ` = ® s X `=1 S i 0 ` =b i ¡ j=n X j=1jj6=i a ij x (±) j : Theb i ¡§ value is channeled back to F1, where it is multiplied by 1=a ii to produce x (±+1) i . This output vector element is DMA’ed back to the processor memory. 5.3.4 Jacobi implementation details This section describes the target platform, and summarizes the implementation. 5.3.4.1 Target platform SRC Computer’s Carte programming environment [76] is tightly integrated with their RC hardware [78]. Carte automatically handles inter-chip communication, I/O pin mapping, bitstream generation, loop scheduling, and other necessary (but wearisome) details. This frees the developer to concentrate on mapping the algorithm onto hard- ware. Carte also directly supports the hybrid development model. Thus, an SRC-6 Model T210-0 MAPStation running Carte v2.1 was used as the target RC. The target SRC-6 has two 2.8GHz Intel Xeon processors with 512KB cache and 1GB RAM. Its MAP Series MPC processor contains two Xilinx Virtex II 6000 FPGAs running at 100MHz. Each of these FPGAs has 288KB of BRAM. There are six banks of local on-board memory (OBM) associated with the FPGAs, which provide up to 24MB local memory. The MAP processor is connected to the Xeon GPP motherboard through a memory interface. DMA is used to move data between the OBM and GPP 91 memory. The MAP processor has a streaming DMA capability, and an inter/intra- FPGA streaming capability that allows computation to overlap communication and facilitates parallelism within the MAP processor. 5.3.4.2 Implementation summary The software-only and FPGA-augmented Jacobi designs described above were coded for the SRC-6 platform. The same set of files were used in both implementations to ensure a valid side-by-side comparison. The only difference was the particular SJAC kernel being used. The bulk of the effort was in implementing the FPGA-based SJAC Table 5.2: Jacobi implementation parameters Parameter Value k 4 ® s 14 ® m 10 ® a 14 n max 2,048 n zmax ¼1M processor, which used the parameters shown in Table 5.2. Thek value is constrained by the number of local memory banks in the MAP processor. Since dot product fetches k values from val and k values from col every clock cycle, k = 4 is the widest data path that will fit on the target platform. This was accomplished by packing four 16-bit integers from thecol vector into a single 64-bit local memory word. The ® s value is based on the “clocks per iteration” loop carried dependence summary of Carte and represents the number of cycles needed to fetch data from theS array, pass it through the subtraction unit, and store it back to S. The reader will recall that ® s determines the dimensions of theS array. The® m and® a IP core latency values are documented in [35]. Then max value is constrained by the FPGA area. The successful 92 F1 design consumed only48=144=33:3 percent of the available BRAMs. However, it needed33;144=33;792 = 98 percent of the available slices. The additional control logic needed for larger BRAMs would simply not fit on FPGA F1. The n zmax value is constrained by the size of the OBM banks. The k-alignedval andcol arrays are placed in half-sized OBMs, so n zmax = 4£ 262;144 ¼ 1M is the largest number of non-zero values that can be processed on the target RC. After several false starts, and with a considerable amount of effort in the code and test phase, Carte was coaxed into pipelining every loop, and meeting the SRC-6 timing and areas constraints. The performance obtained from the software-only and FPGA-augmented designs is given in the next section. 5.3.5 Performance comparison 5.3.5.1 Test matrices The six DD sparse matrices summarized in Table 5.3 were used as test inputs for both the software-only and FPGA-augmented versions of the Jacobi solver. There are two matrix orders, n = 1K (1,000), and n = 2K (2,000). For each order, three matrices having sparsity percentages of two, four, and six percent were generated, that is, n z = n 2 £2%, n z = n 2 £4%, and n z = n 2 £6%. The smallest matrix fits in the 512KB Xeon cache, whereas the largest matrix is over five times larger than cache. It is acknowledged that these matrices are not particularly sparse since typical sparse matrices haven z ¼n. However, it was necessary to use a sufficient number of non-zeros such that the SJAC software would experience cache misses. 93 Table 5.3: Jacobi test matrices n % n z KB 1K 2 20K 221 1K 4 40K 416 1K 6 60K 611 2K 2 80K 832 2K 4 160K 1,613 2K 6 240K 2,395 5.3.5.2 Test conditions As mentioned previously, theb vector is generated asb = Am, wherem is a real n-vector consisting of all 1,000’s (Roman numeral “m”). The initial guess, x 0 , is a realn-vector consisting of all zero’s. The convergence test is given by, i=n max i=1 (jx (±) i ¡x (±+1) i j)·10 ¡9 : In all 12 cases (six software-only and six FPGA-augmented), Jacobi was run on an unloaded system. 5.3.5.3 Test results Figure 5.12 shows the wall clock run time of the software-only and FPGA-augmented Jacobi versions. In over 2/3 of the test cases, the software-only run time is approx- imately the same as or higher than the FPGA-augmented run time. The effect was especially pronounced for the larger test cases, which did not fit in the 512KB Xeon cache. 94 1.56 2.65 5.72 6.36 5.49 9.06 4.17 3.96 6.80 8.52 12.27 27.23 trial #1: trial #2: trial #3: n z = n 2 × 2% n z = n 2 × 4% n z = n 2 × 6% n (trial #) time [s] 30 25 20 15 10 5 0 1K(1) 1K(3) 1K(2) 2K(1) 2K(2) 2K(3) t fpga t sw Figure 5.12: Jacobi run time comparison 5.4 Sparse matrix conjugate gradient solver This section describes the software-only and FPGA-augmented CG designs and com- pares their performance. 5.4.1 Buy versus build Unlike the relatively simple Jacobi method, conjugate gradient is notoriously difficult to properly implement and use. According to O’Leary [62], CG “went into eclipse in the 1960s as naive implementations were unsuccessful on some of the ever-larger problems that were being attempted.” Building a poor software CG implementation from scratch and then claiming to beat it with RC hardware would have diminished the credibility of this research. Accordingly, the use of an existing CG implementation seemed prudent. Several candidates, including the NAG linear algebra suite [61], UT 95 Austin’s NSPCG package [64], and Saad’s SPARSKIT library [68], were considered. Factors such as cost, portability, ease of use, availability, and matrix format support resulted in the selection of SPARSKIT as the software base from which the two CG designs in this investigation were formulated. 5.4.2 Software-only sparse matrix CG system design The software-only CG design shown in Figure 5.13 consists of a main routine, some sparse matrices,A 1 :::A h , the mmio, coocsr, amux, and ddot utilities, the cg routine, and the output solution and statistics files, x 1 :::x h and £ 1 :::£ h . As before, the b i vectors are actually generated at run time. The main routine basically measures how long it takes to solve each system of linear equations. The A 1 :::A h are SPD sparse matrices stored in coordinate format. The mmio, coocsr, and amux utilities are as previously described in Section 5.3.1. The ddot utility, also included with SPARSKIT, calculates dot product. Of particular importance is the cg routine, which is SPARSKIT’s reverse communication enabled implementation of CG. x 1 ... x h main A 1 ... A h b 1 ... b h Θ 1 ... Θ h cg mmio coocsr amux ddot q ←Ap amux Figure 5.13: Software-only sparse matrix CG system design At run time, main calls mmio to read in a matrix, calls coocsr to convert the matrix into CSR format, and calls amux to generateb = Am as before. The main routine then initializes a flags array and work array, which are used by the cg routine. At 96 this point, main invokes the cg routine with parametersb, starting pointx 0 , flags, and work; the reader will note that matrix A is not passed to cg. Whenever cg needs a matrix vector product, as at line 6 of Figure 5.3, it sets the appropriate request value in flags, uses work to communicate thep vector back to the main routine, and returns. The main routine loop examines flags to determine the course of action (done, matrix vector multiply, or error). If cg requested a matrix vector product, main sends matrix A and vectorp to the amux sparse matrix vector multiply routine. When amux returns the q à Ap matrix vector product, main uses work to communicate q back to cg by setting the appropriate reentry indicator in flags and calling cg, which continues with the iteration. When cg is done, it sets the done value in flags, uses work to communicate x back to the main routine, and returns. The main routine writes the solution to the results file, and it writes the input matrix name, number of iterations, wall clock run time to the statistics file, and then terminates. As with Jacobi, since a knownn-vector was used to generateb, verification of the solution is trivial. 5.4.3 FPGA design boundary A profile of the software-only CG system running on the target RC shows that it spends over 97 percent of the time in the amux sparse matrix vector multiply (SMVM) routine. Furthermore, the SPARSKIT cg code is a complicated routine that includes sophisticated control structures such as the reverse communication mechanism and has several calls to other routines such as ddot and the convergence tests. In contrast, the amux SMVM kernel, like SJAC, is a relatively small, monolithic, sparse matrix code with an affinity for FPGA acceleration. As with the Jacobi method, theA matrix is invariant during the entire CG iteration, so the transfer cost to an FPGA-based 97 SMVM can be amortized across all iterations. Therefore, the decision was made to put the SMVM kernel inside the FPGA design boundary. 5.4.4 FPGA-augmented sparse matrix CG system design The FPGA-augmented CG design shown in Figure 5.14 is virtually identical to the x 1 ... x h main A 1 ... A h b 1 ... b h Θ 1 ... Θ h cg mmio coocsr ddot q ←Ap SMVM local memory SMVM processor F1 F2 Figure 5.14: FPGA-augmented sparse matrix CG system design software-only design except that the software amux kernel has been replaced by the FPGA-based SMVM processor, which will be described later. The run time operation of the FPGA-augmented CG design is also virtually identical to the software-only CG design. As with the Jacobi calculation, the SMVM processor pulls a copy of A and uses it for all iterations. 5.4.4.1 Processor design The FPGA-based processor used to accelerate the CG is essentially a specialized sparse matrix vector multiply processor. Most of the design details will be described in the next section. 98 5.4.4.2 Processor configuration As shown in Figure 5.15, the SMVM processor design is similar to the SJAC proces- sor. The reader will note that the control structures to “ignore” the a ii terms are no longer needed, that the division unit is no longer needed, that the output multiplication unit is no longer needed, and that the first column of S no longer needs to be set to b. There are two pipelined cooperating FPGAs, F1 and F2, and a set of local mem- + ctrl A α-stage pipeline val col k p col val u v k ∑ l ε α ε ε ε ε 3 2 1 L ptr ptr q i ∑ j j ij p a ε α α× S p row ctrl ctrl v u T F1 F2 d ih ε + d ih k Figure 5.15: FPGA-based sparse matrix vector multiply processor ory banks to hold the vectorscol,val, androw. The parameters shown in Table 5.4 specify the SMVM processor design characteristics. The F1 configuration includes on-chip BRAMs to store thep andptr vectors, a k-width dot product IP core, an output channel to send the partial dot products over to F2, an input channel to receive the q i values back from F2, an output direct memory access (DMA) channel to send theq i back to GPP memory, and some control circuitry. 99 Table 5.4: SMVM processor design parameters Parameter Description k dot product width ® adder loop carried dependence ® m multiplier IP core latency ® a adder IP core latency n max max number of matrix rows n zmax max number of non-zeros The F2 configuration is basically the accumulation reduction circuit described in Section 4.6.3. It includes an input channel to receive the partial dot products from F1, a partial-summation unit consisting of an ®-stage pipelined addition unit and an ®-row by ®-column array, an ®-wide output accumulator IP core, an output channel to send theq i values back to F1, and some control logic. As with the SJAC processor, the SMVM processor design uses ak-width dot prod- uct IP core and an ®-width output accumulator IP core that are imported into the design via the hybrid development flow. 5.4.4.3 Processor operation The SMVM processor has four operational modes, startup, which occurs once for each set of equations to be solved, input, which occurs once per iteration, and the concurrently running execute and output modes, which also occur once per iteration. During startup mode, the CSR-format sparse matrixA is DMA’ed from processor memory and stored in the SMVM processor. As with the SJAC processor, the val vector is strided, stepk, ink-aligned local memory banks and multiplecol values are packed into local memory banks as shown in Figure 5.16. Theptr vector is stored in BRAM on F1, and therow control vector is calculated and stored in local memory banks. 100 p p p p row 1 row 2 split 16 64 64 bnk 1 bnk 5 col 1...4 col 5 ,0,0,0 col 6...9 col 10...13 val 1 val 6 val 10 val 5 bnk 2 val 2 val 7 val 11 0 bnk 3 val 3 val 8 val 12 0 bnk 4 val 4 val 9 val 13 0 syncu & v stridedval packedcol … u v to dot product Figure 5.16: SMVM processor local memory matrix storage During the input mode,k copies of input vector,p, are stored in BRAM, one copy for each leaf node of the dot product unit as shown in Figure 5.16. The execute mode and output mode are similar to the SJAC processor except that there is a vector summation reduction rather than a vector subtraction reduction on F2. In addition, there is no pre-reduction processing or post-reduction processing, that is, no b vector nor D ¡1 matrix. During each clock cycle, the dot product unit ingests the nextk-vector,v, fromval, via a set of synchronization registers, and the nextk- vector,u, fromp that is fetched from BRAM using the matchingk-vector fromcol as suggested by Figure 5.16. The resulting partial dot products from each row are chan- neled to F2, where they are reduced to a single value, as described in Section 4.6.3. Theq i values are returned to F1 and then DMA’ed back to GPP memory. 5.4.5 CG implementation details This section summarizes the CG implementation on the SRC-6 target platform previ- ously described. 101 5.4.5.1 Target platform The target platform is the same reconfigurable computer that was used for the sparse matrix Jacobi solver. Details can be found in Section 5.3.4.1. 5.4.5.2 Implementation summary The software-only and FPGA-augmented CG designs were coded for the SRC-6 plat- form. The FPGA-based SMVM processor used the parameters shown in Table 5.5. Table 5.5: Conjugate gradient implementation parameters Parameter Value k 4 ® 14 ® m 10 ® a 14 n max 4,096 n zmax ¼1M Thek value is constrained by the number of local memory banks in the MAP proces- sor. The® value is based on the “clocks per iteration” loop carried dependence sum- mary of Carte and represents the number of cycles needed to fetch data from the S array, pass it through the addition unit, and store it back to S. The ® m and ® a IP core latency values are documented in [35]. The n max value, as with SJAC, is con- strained by the FPGA area. However, the smaller area requirement of the F1 design (no divider, and no multiplier) allowed for a larger n max value. The n zmax value is constrained by the size of the OBM banks. 102 5.4.6 Performance comparison 5.4.6.1 Test matrices The 12 SPD sparse matrices summarized in Table 5.6 were used as test inputs for both the software-only and FPGA-augmented versions of the CG solver. There are four matrix orders, n = 1K, n = 2K, n = 3K, and n = 4K. For each order, three matrices having sparsity percentages of approximately two, four, and six percent were generated, that is, n z ¼ n 2 £2%, n z ¼ n 2 £4%, andn z ¼ n 2 £6%. The smallest matrix fits in the 512KB Xeon cache, whereas the largest matrix is more than 18 times larger than cache. As before, it was necessary to use a sufficient number of non-zeros such that the amux sparse matrix vector multiply software would experience cache misses. Table 5.6: Conjugate gradient test matrices n % n z KB 1K 2 20K 214 1K 4 40K 409 1K 6 60K 604 2K 2 78K 810 2K 4 160K 1,602 2K 6 240K 2,381 3K 2 180K 1,818 3K 4 364K 3,613 3K 6 543K 5,365 4K 2 323K 3,232 4K 4 639K 6,320 4K 6 960K 9,450 5.4.6.2 Test conditions Theb vector is again generated asb=Am, wherem is a realn-vector consisting of all 1,000’s (Roman numeral “m”). The initial guess,x 0 , is a realn-vector consisting 103 of all zero’s. No preconditioning is used, and the convergence test is based on the 2-norm of the residuals, krk 2 kb¡Ax 0 k 2 ·10 ¡9 ; wherekrk 2 is the 2-norm of the current residual, andkb¡Ax 0 k 2 is the 2-norm of the initial residual. In all 24 cases (12 software-only and 12 FPGA-augmented), CG was run on an unloaded system. 5.4.6.3 Test results Figure 5.17 shows the wall clock run time of the software-only and FPGA-augmented CG versions. In 5/6 of the test cases, the software-only run time is approximately the trial #1: trial #2: trial #3: n z ≈ n 2 × 2% n z ≈ n 2 × 4% n z ≈ n 2 × 6% n (trial #) time [s] 180 160 140 120 100 80 60 40 20 0 1K(1) 1K(3) 1K(2) 2K(1) 2K(2) 2K(3) t fpga t sw 0.5 0.9 2.6 6.8 12.4 20.3 23.3 45.9 72.3 67.3 120.7 157.9 2.2 2.4 2.7 6.5 9.2 11.9 16.8 24.0 33.8 35.0 50.4 74.4 3K(1) 3K(3) 3K(2) 4K(1) 4K(2) 4K(3) Figure 5.17: CG run time comparison same as or higher than the FPGA-augmented run time. As with Jacobi, the effect was especially pronounced for the larger test cases, which did not fit in the 512KB Xeon cache. 104 5.5 Analysis of Results In considering the observed speedups obtained for both the Jacobi and CG iterative solvers, one might be inclined to overestimate the portion of the speedup that results from the poor sparse matrix kernel behavior on a multi-level memory machine or that the speedup shown in this research were obtained by merely “throwing hardware at the problem.” One might claim by simply moving the rather substantial 24MB local memory that sits next to the FPGAs and using it as an L2 cache with a GPP that they will get a similar speedup. This section analyzes the results and describes why the performance increases cannot be attributed solely to the large 24MB local memory. In particular, it shows that simply using a large L2 cache will not result in similar speedups. 5.5.1 Analytical approach Since actual hardware was not available to assess the impact of a larger L2 cache on GPP sparse matrix kernel performance, the test cases shown in Table 5.7, which did not significantly exceed L1 cache memory size, will be considered. The analysis shows that even for small cases, which did not use the large 24MB local memory, the FPGA-based solution outperforms the GPP-based solution, that is, the large 24MB memory was not the primary contributor to the speedup. Without question, the mem- ory is needed for the large cases, but only to support the parallelism, which, along with the pipelining, are the principal contributors to the speedup. A couple of impor- tant observations are noted. First, the xc2v6000 FPGAs used in the experiments are not the latest generation. Second, modern GPPs generally have both the L1 and L2 cache on the same chip as the GPP. Accordingly the movement of the 24MB FPGA memory as L2 cache would require a new GPP design. If one ignores the cost element 105 Table 5.7: Small test cases solver n % n z KB Jacobi 1K 2 20K 221 Jacobi 1K 4 40K 416 Jacobi 1K 6 60K 611 CG 1K 2 20K 214 CG 1K 4 40K 409 CG 1K 6 60K 604 associated with such a new architecture or assumes the higher performance will make the cost worthwhile, it is reasonable to also assume a newer FPGA family can be used. The next section provides estimates of kernel performance on a next generation RC. 5.5.2 Performance on next generation reconfigurable computer Expected performance of these designs running on the soon-to-be-released SRC-7 MAP Series MPH/RL processor, can be estimated using Equation 5.4 (Amdahl’s Law) and the specifications given in Table 5.8 [78, 90]. The additional OBMs allow the dot product unit to ingest twice as many values per clock cycle, provided there are enough multipliers and slices available to build a k = 8 dot product unit. The SRC-6 SJAC Table 5.8: MAP specifications SRC-6 SRC-7 item MPC MPH/RL FPGA xc2v6000 xc2vp100 slices 33,792 44,096 clock 100MHz 150MHz BRAMs 144 444 mult18x18’s 144 444 OBM reads 6 20 processor uses only 101 mult18x18’s and the SRC-6 SMVM processor uses even fewer, so there are clearly enough multipliers in the SRC-7 to build the additional four floating-point multipliers needed for ak = 8 dot product unit. The SRC-6 SJAC 106 processor uses 33,144 slices, and the SRC-6 SMVM processor uses even fewer. As shown in [35], the adder IP cores require 1,078 slices, and the multiplier cores require 935 slices. A k = 8 dot product needs an additional four adders and an additional four multipliers, which translates into 8,052 slices. Since the SRC-7 has 10,304 slices more than the SRC-6, there are enough slices left in the SRC-7 to build the additional floating-point units. As shown in Table 5.8, the SRC-7 MPH/RL MAP also runs at 150MHz. The combination of a wider data path and increased clock rate mean that an SRC-7 SJAC processor and SRC-7 SMVM processor executes2£1:5=3 times faster than on the SRC-6. As mentioned in Section 5.3.2, over 98 percent of the run time is spent in the software SJAC kernel, and Section 5.4.3 shows that over 97 percent of the time is spent in the SMVM kernel. Therefore, the fraction to be enhanced is, f e6 = 0:98 for SJAC, andf e6 = 0:97 for SMVM. Solving Equation 5.4 fors e results in s e = f e f e ¡1+1=s o : (5.7) The final Jacobi experiment in Figure 5.12, has an overall speedup, s o6 = 2:22. From Equation 5.7, the speedup of the enhanced portion on the SRC-6 can be com- puted ass e6 =0:98=(0:98¡1+1=2:22)=2:28. On the SRC-7, the enhanced portion runs three times faster than on the SRC-6, that is,s e7 = 3£s e6 = 3£2:28 = 6:84. The GPP on the SRC-7 is an Intel x86 architecture much like the SRC-6, so it is rea- sonable to assume the SRC-7 software SJAC kernel still consumes 98 percent of the run time, that is, f e7 = 0:98. Therefore, the final FPGA-augmented Jacobi trial on an SRC-7 is estimated to run s o7 = 1=(1¡0:98+ 0:98=6:84) = 6:12 times faster than software. A similar analysis holds for all the other trials. Figure 5.18 shows the actual speedup on an SRC-6 and the estimated speedup on a SRC-7 when comparing the software-only Jacobi with the FPGA-augmented design. 107 0.59 1.73 3.19 1.11 4.64 1.65 2.75 0.95 3.58 1.25 6.12 2.22 trial #1: trial #2: trial #3: n z = n 2 × 2% n z = n 2 × 4% n z = n 2 × 6% n (trial #) speedup 8 7 6 5 4 3 2 1 0 1K(1) 1K(3) 1K(2) 2K(1) 2K(2) 2K(3) SRC-7 est SRC-6 act Figure 5.18: Jacobi actual and estimated speedup For the final CG SRC-6 experiment shown in Figure 5.17, the overall speedup is, s o6 = 2:12. From Equation 5.7, the speedup of the enhanced portion on the SRC-6 is, s e6 = 0:97=(0:97¡ 1 + 1=2:12) = 2:20. On the SRC-7, the enhanced portion runs three times faster than on the SRC-6, that is, s e7 = 3£2:20 = 6:60. It is still reasonable to assume that the SRC-7 software SMVM kernel consumes 97 percent of the run time, that is, f e7 = 0:97. Therefore, for the the final CG trial, the SRC-7 is estimated to runs o7 =1=(1¡0:97+0:97=6:60)=5:65 times faster than software. A similar analysis holds for all the other trials. Figure 5.19 shows the actual speedup on an SRC-6 and the estimated speedup on a SRC-7 when comparing the software-only CG with the FPGA-augmented design. 108 0.65 1.05 2.69 2.94 3.76 4.63 3.85 5.15 5.69 5.18 6.29 5.65 0.22 0.36 0.95 1.04 1.36 1.70 1.39 1.91 2.14 1.92 2.40 2.12 trial #1: trial #2: trial #3: n z ≈ n 2 × 2% n z ≈ n 2 × 4% n z ≈ n 2 × 6% n (trial #) speedup 8 7 6 5 4 3 2 1 0 1K(1) 1K(3) 1K(2) 2K(1) 2K(2) 2K(3) 3K(1) 3K(3) 3K(2) 4K(1) 4K(2) 4K(3) SRC-7 est SRC-6 act Figure 5.19: CG actual and estimated speedup 5.5.3 Expected performance for small test cases Notice from Table 5.9, that nearly half of the test cases on the older FPGAs still ran faster than software. Also, the estimated performance of the small test cases on a newer FPGA results in five out of six cases running faster than software. Further Table 5.9: Small test case performance speedup solver n % n z KB act est Jacobi 1K 2 20K 221 0.59 1.73 Jacobi 1K 4 40K 416 1.11 3.19 Jacobi 1K 6 60K 611 1.65 4.64 CG 1K 2 20K 214 0.22 0.65 CG 1K 4 40K 409 0.36 1.05 CG 1K 6 60K 604 0.95 2.69 notice that these test cases on the GPP would not have benefited from the larger 24MB L2 cache, since they fit or nearly fit in the existing 512KB L1 cache. Finally, notice that these test cases did not use the 24MB local memory near the FPGAs. In fact, 109 the largest of these only used about 600MB. Clearly there is something more than cache effects that allow the FPGA-augmented solvers to run faster. The next section provides an explanation for this performance enhancement. 5.5.4 Explanation for increased performance Cache performance aside, the characteristics that allow the FPGA-based kernels to outperform a GPP are the pipelining and parallelism, that is, the three p’s mentioned in Section 2.2. Even though the older SRC-6 FPGA-based MAP processor is running at 100MHz it produces a speedup. From Equation 4.1, the pipelined dot product unit has a potential pipeline speedup of, ® d = ® m +® a lgk = 10 + 14lg4 = 38. Furthermore, the dot product has a parallelismk =4, therefore, the potential speedup of the dot product unit associated with the three p’s is 38£ 4 = 152. If one now takes into account the additional speedup associated with the reuse of the matrices, and that there are no cache misses at all, one can understand how a 100MHz FPGA- based processor can outperform a 2.8GHz processor. In particular, 2.8GHz/152 = 18MHz. For the newer SRC-7 processor, the pipelined dot product has a potential pipeline speedup of ® d = 10 + 14lg8 = 52, and a potential parallel speedup of k =8. Therefore the potential speedup associated with the three p’s is52£8=416. Obviously, to sustain this speedup, one would have to keep the pipeline full most of the time. Clearly, the three p’s is a powerful principle, which can result in significant speedups. 110 Chapter 6 Conclusions and Future Directions 6.1 Summary of research contributions The major contributions of this research are firstly, a novel partial reduction algorithm and architecture that facilitates HLL-based reductions without incurring pipeline stalls or experiencing buffer overflows, and secondly, the demonstration of how to speedup an important class of scientific applications, that is, sparse matrix solvers, by mapping them onto reconfigurable supercomputers. 6.1.1 High-level language based reduction operations The problems associated with FPGA-based floating-point reduction operations were discussed, and a viable solution to the problem was described. Using accumulation re- duction as an example, a novel partial-summation architecture and two-stage toroidal access algorithm were developed. This unique approach allows fully pipelined HLL- based reduction operations to be implemented on contemporary RCs and removes one of the major impediments preventing mainstream acceptance and use of RC-based su- percomputing. With this technique, one can now efficiently reduce multiple variable- length sets of sequentially delivered double-precision floating-point vectors using an 111 FPGA-augmented RC. This facilitates the development of several important kernels beyond those documented in this dissertation, vector norm, for example. 6.1.2 Sparse matrix applications on reconfigurable computers The fundamental problems associated with sparse matrix computations onto GPPs were discussed, as were details on why RC-based solutions fail. Using the reduction techniques previously described, IEEE Std 754 double-precision floating-point sparse matrix computations were mapped onto FPGA-augmented RCs. In particular, using a binary tree k-vector data path followed by the novel serial reduction architecture described above, two well-known sparse matrix iterative linear equation solvers, the Jacobi method, and conjugate gradient (CG) are mapped onto reconfigurable comput- ers using a hybrid development technique that involved the use of both HDL-based IP cores and HLL-to-HDL compiler technology. In the case of the Jacobi solver, the FPGA design boundary encompassed the entire solver. In contrast, the conjugate gra- dient design boundary encompassed only the sparse matrix vector multiply kernel. The result in both cases was a measured speedup of more than two-fold on a first- generation RC, and an estimated speedup of over six-fold on a second-generation RC. The techniques developed in the research can be used to implement other important kernels, matrix multiply, for example. 6.1.3 Analysis An analysis of the results showed that the speedups observed were primarily attribut- able to the three p’s, that is the effects of pipelining and parallelism. In particular, the analysis showed that simply using a very large L2 cache in lieu of the FPGA-based processor would not provide similar speedups. 112 6.2 Impact of research Many of the large scientific computing problems such as weather forecasting, predict- ing structural and material response to impulsive loads, understanding the air flows around supersonic air vehicles, signal and image processing, and so forth ultimately involve the repeated solution of large sparse systems of equations. In many instances, the solvers are the single most compute intensive portion of the simulation. The be- havior of sparse matrix operations on GPPs is a well known problem that has been researched for over 30 years. Certainly there have been some performance improve- ments, but the bottom line is that the compute engine is still a GPP and its memory hierarchy. Some irregular sparse matrices simply will not allow for much perfor- mance improvement. This work has clearly demonstrated that significant speedups are possible by using FPGA-based RCs. If the scientists and engineers who use su- percomputers can cut their simulation time by 50 percent or more, then this frees up the compute cycles such that either more simulations or finer resolution simulations can be undertaken. For many of the large codes, the principal resource constraint is CPU cycles. At the author’s place of employment [31], for example, which is one of the Department of Defense high-performance computing centers, each user is given an annual allocation of supercomputer cycles. In most instances the users burn their annual allocations and there are usually requests for additional reserve allocations. This research holds forth the promise of significant reductions in run times for crucial computational elements, and takes reconfigurable supercomputing one step closer to mainstream adoption. 113 6.3 Future directions The current research used a binary treek-vector data path followed by a serial reduc- tion unit to achieve the speedups. One of the big constraints in the implementation of this architecture is the number of simultaneous local memory bank reads. This forced the use of on-chip BRAMs to keep multiple copies of thex vector, which limited the order of the matrices to a few thousand. One interesting future study might consider the effects of placing thex vector in the local memory banks of a second-generation RC, which has a significantly larger number of memory banks. This could potentially boost the capacity of the solvers to matrices having orders of a few million. One might do a theoretical study to see what the optimum number of memory banks might be in order to solve these kinds of problems. The current research requires the use of an output binary tree accumulator for the reduction operation. Perhaps one might investigate the use of time division mul- tiplexing across rows. The idea here would be to serially deliver k-vectors from ® rows to the binary treek-vector data path. The subsequent output partial dot products would each belong to a different row and could immediately be added to the running sum for the given row. Since the next partial dot product for a given row will not appear for at least ® cycles, the loop carried dependence and the need for an out- put accumulator would be eliminated. The investigation would need to determine if the significantly more complicated control algorithm, and possibility for out of order completions become a show stopper. Perhaps this approach in conjunction with some row permutations to allow for similar length contiguous rows might produce a better result. Another interesting study might be to look at the use of RCs and MPI to solve very large partitioned solvers. The research at hand has clearly shown an improvement 114 for a single GPP plus FPGA PE pair. There is nothing in the literature about the performance of clustered GPP plus FPGA PEs. Alternative sparse matrix representations have not yet been considered. This re- search only considered CSR storage format and did not take other matrix charac- teristics into consideration. Perhaps storage formats that take advantage of matrix properties like symmetry could be exploited to obtain even greater speedups. This research showed a potential speedup of over six-fold for both Jacobi and CG. However, no analysis has been done to show the power requirements needed to obtain this speedup. Since power consumption is becoming a very important issue at supercomputing centers, a trade-off analysis based on power consumption would be interesting. If the six-fold speedup requires a nine-fold increase in energy maybe it is not such a good deal. The bottom line is that there are still a number of research opportunities with respect to accelerating floating-point computations on reconfigurable computers. 115 Reference List [1] Actel, Inc. IP module - CXZ80CPU. http://www.actel.com/ products/ip/search/detail.aspx?id=547. [2] Actel, Inc. Programmable logic solutions. http://www.actel.com. [3] Sreesa Akella, Melissa C. Smith, Richard T. Mills, Sadaf R. Alam, Richard F. Barrett, and Jeffrey S. Vetter. Sparse matrix vector multiplication on the SRC MAPstation. In Proceedings of the 9th High Performance Embedded Com- puting Workshop (HPEC’05), page http://www.ll.mit.edu/HPEC/ agendas/proc05/HPEC05 Open.pdf, Lexington, MA, USA, Septem- ber 2005. [4] Aldec, Inc. IP cores. http://www.aldec.com/products/ipcores. [5] Anish Alex, Jonathan Rose, Ruth Isserlin-Weinberger, and Christopher Hogue. Hardware accelerated novel protein identification. In Proceedings of the 14th International Conference on Field Programmable Logic and Application (FPL’04), pages 13–22, Antwerp, Belgium, September 2004. [6] Altera, Inc. FPGAs, CPLDs, and structured ASICs. http://www. altera.com. [7] ANSI/IEEE Std 754-1985. IEEE Standard for Binary Floating-Point Arith- metic. Institute of Electrical and Electronics Engineers, New York, NY , USA, 1985. [8] Navid Azizi, Ian Kuon, Aaron Egier, Ahmad Darabiha, and Paul Chow. Recon- figurable molecular dynamics simulator. In Proceedings of the 12th IEEE Sym- posium on Field-Programmable Custom Computing Machines (FCCM’04), pages 197–206, Napa, CA, USA, April 2004. [9] Roberto Bagnara. A unified proof for the convergence of Jacobi and Gauss- Seidel methods. SIAM Review, 37(1):93–97, March 1995. 116 [10] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel, June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, Charles Romine, and Henk Van der V orst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia, PA, USA, 1994. [11] Michael J. Beauchamp, Scott Hauck, Keith D. Underwood, and K. Scott Hem- mert. Embedded floating-point units in FPGAs. In Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Ar- rays (FPGA’06), pages 12–20, Monterey, CA, USA, February 2006. [12] Noel Black and Shirley Moore. Stationary iterative methods. In MathWorld–A Wolfram Web Resource, http://mathworld.wolfram. com/StationaryIterativeMethod.html, June 2002. [13] Wim B¨ ohm and Jeffrey Hammes. A transformational approach to high per- formance embedded computing. In Proceedings of the 8th High Performance Embedded Computing Workshop (HPEC’04), pagehttp://www.ll.mit. edu/HPEC/agenda04.htm, Lexington, MA, USA, September 2004. [14] Brigham Young University. JHDL. http://www.jhdl.org. [15] Peter Buxa and David Caliga. Reconfigurable processing design suits UA V radar apps. In COTS Journal, http://www.cotsjournalonline. com/home/article.php?id=100407, 2005, RTC Group. [16] Bryan Catanzaro and Brent Nelson. Higher radix floating-point representations for FPGA-based arithmetic. In Proceedings of the 13th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05), pages 161– 171, Napa, CA, USA, April 2005. [17] Celoxica Ltd. Handel-C language reference manual. http://celoxica. com/techlib/files/CEL-W0410251JJ4-60.pdf. [18] Ray C. C. Cheung, Wayne Luk, and Peter Y . K. Cheung. Reconfigurable el- liptic curve cryptosystems on a chip. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’05), pages 24–29, Washington, DC, USA, 2005. [19] Colorado State University. SA-C. http://www.cs.colostate.edu/ cameron/SACoverview.html. [20] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, 2nd edition, 2001. [21] Cray Inc. Cray XD1 supercomputer for reconfigurable computing. http: //www.cray.com/downloads/FPGADatasheet.pdf. 117 [22] Elizabeth Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference of the ACM, pages 157–172, San Francisco, CA, USA, August 1969. [23] Tim Davis. University of Florida sparse matrix collection. Technical report, University of Florida, 2004. http://www.cise.ufl.edu/research/ sparse. [24] Michael deLorimier and Andr´ e DeHon. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th Interna- tional Symposium on Field-Programmable Gate Arrays (FPGA’05), pages 75– 85, Monterey, CA, USA, February 2005. [25] Malachy Devlin, Robin Bruce, and Stephen Marshall. Implementation of floating-point VSIPL functions on FPGA-based reconfigurable computers us- ing high-level languages. In Proceedings of the 8th MAPLD International Con- ference (MAPLD’05), page www.klabs.org/mapld05/papers/186 devlin paper.doc, Washington, DC, USA, September 2004. [26] Yong Dou, Stamatis Vassiliadis, Georgi K. Kuzmanov, and Georgi N. Gay- dadjiev. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05), pages 86–95, Monterey, CA, USA, February 2005. [27] Jeffrey Draper, J. Tim Barrett, Jeff Sondeen, Sumit Mediratta, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. A prototype processing-in-memory (PIM) chip for the data-intensive architecture (DIV A). The Journal of VLSI Signal Processing, 40(1):73–84, May 2005. [28] Michel Dubois, Jaeheon Jeong, Yong Ho Song, and Adrian Moga. Rapid hard- ware prototyping on RPM-2. IEEE Design and Test of Computers, 15(3):112– 118, July 1998. [29] Iain S. Duff. A survey of sparse matrix research. Proceedings of the IEEE, 65(4):500–535, 1977. [30] Encyclopaedia Britannica. Jacobi, Carl. In Encyclopaedia Bri- tannica Premium Service, http://www.britannica.com/eb/ article-9043197, June 2006. [31] ERDC MSRC. Major Shared Resource Center. http://www.erdc.hpc. mil. [32] Gerald Estrin. Organization of computer systems–the fixed plus variable struc- ture computer. In Proceedings of the Western Joint Computer Conference, pages 33–40, San Francisco, CA, USA, May 1960. 118 [33] Yakov I. Fet and Dmitri A. Pospelov. Parallel computing in Russia. Lecture Notes in Computer Science, 964:464–476, 1995. [34] Maya Gokhale, Janette Frigo, Christine Ahrens, Justin L. Tripp, and Ron Min- nich. Monte carlo radiative heat transfer simulation on a reconfigurable com- puter. In Proceedings of the 14th International Conference on Field Program- mable Logic and Application (FPL’04), pages 95–104, Antwerp, Belgium, September 2004. [35] Gokul Govindu, Ronald Scrofano, and Viktor K. Prasanna. A library of para- meterizable floating-point cores for FPGAs and their application to scientific computing. In Proceedings of the International Conference on Engineering Reconfigurable Systems and Algorithms (ERSA’05), pages 137–148, Las Ve- gas, NV , USA, June 2005. [36] Yongfeng Gu, Tom VanCourt, and Martin C. Herbordt. Integrating FPGA ac- celeration into the Protomol molecular dynamics code: Preliminary report. In Proceedings of the 14th IEEE Symposium on Field-Programmable Cus- tom Computing Machines (FCCM’06), pages 315–316, Napa, CA, USA, April 2006. [37] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. Mapping irregular applications to DIV A, a PIM-based data-intensive architecture. In Proceedings of the ACM/IEEE SuperComputing 1999 Conference (SC’99), vol- ume 5, pages 57–74, Portland, OR, USA, November 1999. [38] John Harkins, Terek El-Ghazawi, Esam El-Araby, and Miaoqing Huang. Per- formance of sorting algorithms on the SRC 6 reconfigurable computer. In Pro- ceedings of the 2005 IEEE International Conference on Field-Programmable Technology (FPT’05), pages 295–296, Singapore, December 2005. [39] John L. Hennessy and David A. Patterson. Computer Architecture: A Quan- titative Approach. Morgan Kaufman, San Francisco, CA, USA, third edition, 2003. [40] Magnus Hestenes and Eduard Stiefel. Methods of conjugate gradients for solv- ing linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, December 1952. [41] IEEE Std 1076-1993. IEEE Standard VHDL Language Reference Manual. Institute of Electrical and Electronics Engineers, New York, NY , USA, 1994. 119 [42] Eun Jin Im, Katherine A. Yelick, and Rich Vuduc. SPARSITY: An optimiza- tion framework for sparse matrix kernels. International Journal of High Per- formance Computing Applications, 18(1):135–158, February 2004. [43] K. R. James. Convergence of matrix iterations subject to diagonal dominance. SIAM Journal on Numerical Analysis, 10(3):478–484, June 1973. [44] Peter M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill, New York, NY , USA, 1981. [45] Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient so- lution of a general class of recurrence equations. IEEE Transactions on Com- puters, C-22(8):786–793, 1973. [46] David J. Kuck. The Structure of Computers and Computations. John Wiley & Sons, New York, NY , USA, 1978. [47] Lattice Semiconductor, Inc. FPGA and CPLD solutions. http://www. latticesemi.com. [48] Gerhard Lienhart, Andreas Kugel, and Reinhard Manner. Using floating-point arithmetic on FPGAs to accelerate scientific n-body simulations. In Proceed- ings of the 10th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02), pages 182–191, Napa, CA, USA, April 2002. [49] Zhen Luo and Margaret Martonosi. Accelerating pipelined integer and floating-point accumulations in configurable hardware with delayed addition techniques. IEEE Transactions on Computers, 49(3):208–218, March 2000. [50] Patrick Lysaght and Delon Levi. Of gates and wires. In Proceedings of the 18th IEEE International Parallel & Distributed Processing Symposium (IPDPS’04), pages 132–137, Santa Fe, NM, USA, April 2004. [51] Mentor Graphics, Inc. ModelSim. http://www.model.com/. [52] Mercury Computer Systems, Inc. Powerstream FCN module. http://www. mc.com/products/boards.cfm. [53] Mitrionics Inc. Mitrion-C. http://www.mitrionics.com. [54] Gordon E. Moore. Cramming more components onto integrated circuits. Elec- tronics, 38:114–117, April 1965. [55] Gordon E. Moore. Cramming more components onto integrated circuits. Pro- ceedings of the IEEE, 86(1):82–85, January 1998. 120 [56] Gerald R. Morris and Viktor K. Prasanna. An FPGA-based floating-point Ja- cobi iterative solver. In Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN’05), pages 420–427, Las Vegas, NV , USA, December 2005. [57] Gerald R. Morris and Viktor K. Prasanna. Pipelined data path for an IEEE-754 64-bit floating-point Jacobi solver. In Proceedings of the 9th Annual High Per- formance Embedded Computing Workshop (HPEC’05), pagehttp://www. ll.mit.edu/HPEC/agendas/proc05/HPEC05 Open.pdf, Lexing- ton, MA, USA, September 2005. [58] Lionel M. Ni and Kai Hwang. Vector reduction methods for arithmetic pipelines. In Proceedings of the 6th International Symposium on Computer Arithmetic (ISCA’83), pages 144–150, Aarhus, Denmark, June 1983. [59] Lionel M. Ni and Kai Hwang. Vector-reduction techniques for arithmetic pipelines. IEEE Transactions on Computers, C-34(5):404–411, May 1985. [60] NIST. Matrix market, June 2004. http://math.nist.gov/ MatrixMarket. [61] Numerical Algorithms Group. Mathematics and technology for optimized per- formance. http://www.nag.co.uk. [62] Dianne M. O’Leary. Methods of conjugate gradients for solving linear systems. In David R. Lide, editor, A Century of Excellence in Measurements, Standards, and Technology: A Chronicle of Selected NBS/NIST Publications, 1901-2000, pages 81–85. NIST Special Publication 958, 2001. [63] Leonid Oliker, Xiaoye Li, Gerd Heber, and Rupak Biswas. Ordering unstruc- tured meshes for sparse matrix computations on leading parallel systems. Lec- ture Notes in Computer Science, 1800:497–503, 2000. [64] Thomas C. Oppe, Wayne D. Joubert, and David R. Kincaid. NSPCG user’s guide version 1.0: A package for solving large sparse linear systems by vari- ous iterative methods, TR-CNA-216. Technical report, University of Texas at Austin, Austin, TX, USA, 1988. http://rene.ma.utexas.edu/CNA/ NSPCG. [65] Karl Papadantonakis, Nachiket Kapre, Stephanie Chan, and Andr´ e DeHon. Pipelining saturated accumulation. In Proceedings of the IEEE 2005 Inter- national Conference on Field-Programmable Technology (ICFPT’05), pages 19–26, Singapore, December 2005. 121 [66] John. K. Reid. On the method of conjugate gradients for the solution of large sparse systems of linear equations. In John. K. Reid, editor, Large Sparse Sets of Linear Equations, pages 231–254. Academic Press, New York, NY , USA, 1971. [67] Yousef Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, Boston, MA, USA, 1985. [68] Yousef Saad. SPARSKIT: a basic tool kit for sparse matrix com- putations. http://www-users.cs.umn.edu/ » saad/software/ SPARSKIT/sparskit.html, June 1994. [69] Ronald Scrofano, Maya Gokhale, Frans Trouw, and Viktor Prasanna. A hard- ware/software approach to molecular dynamics on reconfigurable computers. In Proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), pages 23–32, Napa, CA, USA, April 2006. [70] Ronald Scrofano and Viktor K. Prasanna. Computing Lennard-Jones poten- tials and forces with reconfigurable hardware. In Proceedings of the Inter- national Conference on Engineering Reconfigurable Systems and Algorithms (ERSA’04), pages 284–290, Las Vegas, NV , USA, June 2004. [71] Jonathan R. Shewchuk. An introduction to the conjugate gradient method with- out the agonizing pain, edition 1 1 4 . Unpublished draft, http://www.cs. cmu.edu/ » jrs/. [72] Silicon Graphics Inc. SGI RASC Technology. http://www.sgi.com/ products/rasc. [73] Henk J. Sips and Haixiang Lin. An improved vector-reduction method. IEEE Transactions on Computers, 40(2):214–217, February 1991. [74] Melissa C. Smith, Jeffrey S. Vetter, and Sadaf R. Alam. Investiga- tion of benchmark suites for high-performance reconfigurable computing platforms. In Proceedings of the 9th MAPLD International Conference (MAPLD’06), page http://klabs.org/mapld06/abstracts/219 smith paper.pdf, Washington, DC, USA, September 2006. [75] Solomon. Ecclesiastes 1:9. In The Holy Bible. Authorized King James Version, 1611. [76] SRC Computers Inc. Carte programming environment. http://www. srccomp.com/SoftwareElements.htm. [77] SRC Computers, Inc. General purpose reconfigurable computing systems. http://www.srccomp.com. 122 [78] SRC Computers, Inc. MAP processor specifications. http://www. srccomp.com/HardwareSpecs.htm. [79] Takashige Sugie, Tomoyoshi Itoa, and Toshikazu Ebisuzakin. A special- purpose computer for exploring similar biological sequences: Bioler-2 with multi-pipeline and multi-sequence architecture. Journal of Computer Physics Communications, 162(1):37–50, September 2004. [80] Synplicity, Inc. FPGA solutions. http://www.synplicity.com/ products/index.html. [81] Noboru Tanabe, Hirotaka Hakozaki, Masasige Nakatake, and Yasunori Dohi. A new memory module for memory intensive applications. In Proceedings of the International Conference on Parallel Computing in Electrical Engineering (PARELEC’04), pages 123–128, Dresden, Germany, September 2004. [82] Keith Underwood. FPGAs vs. CPUs: Trends in peak floating-point perfor- mance. In Proceedings of the 2004 ACM/SIGDA Twelfth International Sympo- sium on Field Programmable Gate Arrays (FPGA’04), Monterey, CA, USA, February 2004. [83] Henk A. van der V orst. Krylov subspace iteration. Computing in Science & Engineering, 2(1):32–37, January 2000. [84] Xiaojun Wang, Sherman Braganza, and Miriam Leeser. Advanced compo- nents in the variable precision floating-point library. In Proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), pages 249–258, Napa, CA, USA, April 2006. [85] Xiaojun Wang, Miriam Leeser, and Haiqian Yu. A parameterized floating- point library applied to multispectral image clustering. In Proceedings of the 7th MAPLD International Conference (MAPLD’04), page http:// klabs.org/mapld04/abstracts/wang x a.pdf, Washington, DC, USA, September 2004. [86] R. Clint Whaley. Email archive: math-atlas-devel. http:// sourceforge.net/mailarchive/forum.php?forum id=426. [87] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automati- cally tuned linear algebra software (ATLAS). http://math-atlas. sourceforge.net. [88] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1– 2):3–35, 2001. 123 [89] Xilinx, Inc. Design tools center. http://www.xilinx.com/ products/design resources/design tool. [90] Xilinx, Inc. The programmable logic company. http://www.xilinx. com. [91] Xilinx, Inc. Xilinx intellectual property. http://www.xilinx.com/ ipcenter/. [92] Xilinx, Inc. Virtex-II Pro and Virtex-II Pro X Platform FPGAs: Complete data sheet. http://direct.xilinx.com/bvdocs/publications/ ds083.pdf, October 2005. [93] Xilinx, Inc. New technology and a new way of working. In His- tory of Xilinx. http://www.xilinx.com/company/xilinxstory/ history.htm, 2006. [94] Xilinx, Inc. Virtex-4 family overview. http://direct.xilinx.com/ bvdocs/publications/ds112.pdf, February 2006. [95] Xilinx, Inc. Virtex-5 LX platform overview. http://direct.xilinx. com/bvdocs/publications/ds100.pdf, May 2006. [96] Lixin Zhang, Zhen Fang, Mike Parker, Binu K. Mathew, Lambert Schaelicke, John B. Carter, Wilson C. Hsieh, and Sally A. McKee. The impulse memory controller. IEEE Transactions on Computers, 50(11):1117–1132, November 2001. [97] Zhihong Zhao and Miriam Leeser. Precision modeling of floating-point ap- plications in variable bitwidth computing. In Proceedings of the Interna- tional Conference on Engineering Reconfigurable Systems and Algorithms (ERSA’03), pages 208–214, Las Vegas, NV , USA, June 2003. [98] Ling Zhuo and Viktor K. Prasanna. Design tradeoffs for BLAS operations on reconfigurable hardware. In Proceedings of the 34th International Conference on Parallel Processing (ICPP’05), pages 78–86, Oslo, Norway, June 2005. [99] Ling Zhuo and Viktor K. Prasanna. High-performance and area-efficient reduc- tion circuits on FPGAs. In Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05), pages 52–59, Rio de Janeiro, Brazil, October 2005. [100] Ling Zhuo and Viktor K. Prasanna. High performance linear algebra operations on reconfigurable systems. In Proceedings of the ACM/IEEE SuperComputing 2005 Conference (SC’05), pages 2–13, Seattle, WA, USA, November 2005. 124 [101] Ling Zhuo and Viktor K. Prasanna. Sparse matrix-vector multiplication on FP- GAs. In Proceedings of the 2005 ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05), pages 63–74, Monterey, CA, USA, February 2005. 125
Abstract (if available)
Abstract
The large capacity of field programmable gate arrays (FPGAs) has prompted researchers to map computational kernels onto FPGAs. In some instances, these kernels achieve significant speedups over their software-only counterparts running on general-purpose processors. The success of these efforts has spurred supercomputer companies to develop reconfigurable computers (RCs) that allow the FPGAs to become, in effect, application-specific coprocessors. In concert with the RCs are high-level language-to-hardware description language (HLL-to-HDL) compilers that facilitate development of FPGA-based kernels using HLL-based programming rather than HDL-based hardware design. In theory, these technologies allow end-users to create high-performance custom computing architectures. In practice, acceleration of floating-point scientific kernels is still problematic. Sequential vector reductions like accumulation are difficult because the pipelined floating-point units introduce loop carried dependences that prevent the hardware from being fully pipelined. This has a profound impact on fundamental scientific kernels such as matrix vector multiply. Without pipelining, the potential performance advantage of FPGA-based kernels is eliminated.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Accelerating scientific computing applications with reconfigurable hardware
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Resiliency-aware scheduling
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Resource management for scientific workflows
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Optimizing execution of in situ workflows
Asset Metadata
Creator
Morris, Gerald Roger
(author)
Core Title
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
10/30/2006
Defense Date
10/13/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
conjugate gradient,FPGA,Jacobi method,OAI-PMH Harvest,reconfigurable computer,sparse matrix,vector reduction
Language
English
Advisor
Prasanna, Viktor K. (
committee chair
), Dubois, Michel (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
grm@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m116
Unique identifier
UC1322724
Identifier
etd-Morris-20061030 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-29573 (legacy record id),usctheses-m116 (legacy record id)
Legacy Identifier
etd-Morris-20061030.pdf
Dmrecord
29573
Document Type
Dissertation
Rights
Morris, Gerald Roger
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
conjugate gradient
FPGA
Jacobi method
reconfigurable computer
sparse matrix
vector reduction