Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Accelerating scientific computing applications with reconfigurable hardware
(USC Thesis Other)
Accelerating scientific computing applications with reconfigurable hardware
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ACCELERATING SCIENTIFIC COMPUTING APPLICATIONS WITH RECONFIGURABLE HARDWARE by Ronald Scrofano, Jr. A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December 2006 Copyright 2006 Ronald Scrofano, Jr. Dedication To my parents and my teachers ii Acknowledgments First, I must thank my advisor, Prof. Viktor K. Prasanna. His belief in me was in- strumental in bringing me to USC. His guidance and support have been instrumental in my success. This work would not have been possible without him. I would also like to express my gratitude to my committee members, Prof. Aiichiro Nakano and Prof. Alice Parker. They have provided many useful insights during this process. Ad- ditionally, Prof. Nakano was the rst to teach me molecular dynamics. Prof. Mary Hall and Prof. Jeff Draper provided helpful feedback at the time of my qualifying exam. I was also privileged to receive guiding advice from Maya Gokhale and Frans Trouw. I would also like to thank my undergraduate advisor, Prof. Myungsook Klassen, for the part she has played in my success. I have had the benet of working with a wonderful research group. I would espe- cially like to acknowledge Ling Zhuo, Seonil Choi, Zachary Baker, Animesh Pathak, Gerald Morris, Cong Zhang, Sumit Mohanty, Jingzhao Ou, and Aditya Kwatra. I also thank Aimee Barnard, Henryk Chrostek, and Rosine Saraan for their administrative assistance. My parents, Ronald Scrofano and Rachel Scrofano, deserve special recognition for all that they have done for me, not only during this process, but throughout my life. I could not have reached this point without their love and support. I have also been blessed with two amazing sisters, Nancy Scrofano and Diane Scrofano. My extended family has encouraged me as well. My friendsLance Grange, Mark Schwartzman, iii Jeremy Lau, Mary Beth Lutz, Alexandra Bulcke, Bilal Shaw, James Barnett, Will Lam- bert, Alex Robinson, and all the resthave made life outside the ofce truly enjoyable. Finally, I have to thank Los Alamos National Laboratory and the Defense Ad- vanced Research Projects Agency for funding my research; the U.S. Army ERDC MSRC for access to their SRC MAPstation; and the team from SRC Computers, espe- cially Steve Heistand, for their assistance. iv Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures ix Abstract xi Chapter 1: Introduction 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Recongurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . 3 Recongurable Computers . . . . . . . . . . . . . . . . . . . . . . . 5 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Scientic Computing on Recongurable Hardware 12 Linear Algebra and Transforms . . . . . . . . . . . . . . . . . . . . . . . . 13 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Scientic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 3: Kernels for Scientic Computing on Recongurable Hardware 20 Floating-Point Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 IEEE Standard 754 . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Floating-Point Cores . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Square Root Core . . . . . . . . . . . . . . . . . . . . . . . . 26 Other Cores in the Library . . . . . . . . . . . . . . . . . . . 30 Transcendental Functions . . . . . . . . . . . . . . . . . . . 33 Utilizing the Parameters of the Floating-Point Cores . . . . . . . . . . 33 Application Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 v Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Arithmetic Expression Evaluation . . . . . . . . . . . . . . . . . . . . . . 40 Arithmetic Expression Evaluation . . . . . . . . . . . . . . . . . . . 42 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . 42 Tree Representation . . . . . . . . . . . . . . . . . . . . . . 43 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 43 FPGA-based Arithmetic Expression Evaluation . . . . . . . . 46 Proposed Algorithms and Architectures for the Arithmetic Expression Evaluation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Basic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Advanced Case . . . . . . . . . . . . . . . . . . . . . . . . . 56 Properties of the Designs . . . . . . . . . . . . . . . . . . . . 59 General Expressions . . . . . . . . . . . . . . . . . . . . . . 61 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Comparison with Compiler-Generated Designs . . . . . . . . 65 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Parallel Arithmetic Expression Evaluation . . . . . . . . . . . 67 Pipeline Synthesis . . . . . . . . . . . . . . . . . . . . . . . 68 Chapter 4: Modeling Recongurable Computers 70 Recongurable Computers . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Survey of Recongurable Computers . . . . . . . . . . . . . . . . . . 71 Massively Parallel FPGA Arrays . . . . . . . . . . . . . . . . 74 Element-Level Architectures . . . . . . . . . . . . . . . . . . . . . . 76 System- and Node-Level Architectures . . . . . . . . . . . . . . . . . 77 Shared-Memory Architecture . . . . . . . . . . . . . . . . . 78 Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . 78 Abstract Model and System-Level Programming Model . . . 79 Node-Level Programming Model . . . . . . . . . . . . . . . 81 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Element-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 GPP Element . . . . . . . . . . . . . . . . . . . . . . . . . . 84 RH Element . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Node-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Complimentary Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 5: Molecular Dynamics on Recongurable Computers 93 Introduction to Molecular Dynamics . . . . . . . . . . . . . . . . . . . . . 93 Force Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Benchmark Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 98 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . 99 A Note on Precision . . . . . . . . . . . . . . . . . . . . . . 100 Single-Node, Shifted-Force Simulations . . . . . . . . . . . . . . . . . . . 101 vi Node-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 RH Element-Level Design Alternatives . . . . . . . . . . . . . . . . 103 Choosing a Design for the Implementation . . . . . . . . . . . . . . . 108 Implementation and Experimental Results . . . . . . . . . . . . . . . 110 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Single-Node, Particle-Mesh-Ewald Simulations . . . . . . . . . . . . . . . 115 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 RH-Element Design for the Real-Space Part . . . . . . . . . . . . . . 121 erfc(x) and e x 2 . . . . . . . . . . . . . . . . . . . . . . . . 121 Recongurable Computer Implementation Details and Results . . . . 122 Partition Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Multi-Node, Shifted-Force Simulations . . . . . . . . . . . . . . . . . . . . 126 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Simulations on GPP-Only Nodes . . . . . . . . . . . . . . . 130 Simulations on Accelerated Nodes . . . . . . . . . . . . . . . 132 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Chapter 6: Conclusion 140 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Further Study of Molecular Dynamics . . . . . . . . . . . . . . . . . 143 Libraries for Recongurable Computers . . . . . . . . . . . . . . . . 145 New Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Reference List 148 vii List of Tables 3.1 Properties of the Fully Pipelined Square Root Core . . . . . . . . . . 31 3.2 Properties of the Fully Pipelined Adder Core . . . . . . . . . . . . . . 31 3.3 Properties of the Fully Pipelined Multiplier Core . . . . . . . . . . . 31 3.4 Properties of the Fully Pipelined Divider Core . . . . . . . . . . . . . 32 3.5 Throughput-to-area ratios of AEE architectures . . . . . . . . . . . . 65 3.6 Area-latency products of AEE architectures . . . . . . . . . . . . . . 65 5.1 Shifted-force software prole for two benchmark simulations . . . . . 101 5.2 Performance results for single-node, shifted-force simulations . . . . 114 5.3 PME software prole for two benchmark simulations . . . . . . . . . 117 5.4 Performance results for single-node, PME simulations . . . . . . . . 124 5.5 Estimated prole of the parallel palmitic acid simulation . . . . . . . 131 5.6 Estimated prole of the parallel CheY protein simulation . . . . . . . 131 5.7 Speed-Ups of Palmitic Acid Simulation . . . . . . . . . . . . . . . . 134 5.8 Speed-Ups of CheY Protein Simulation . . . . . . . . . . . . . . . . 135 5.9 Performance Comparison with Related Work . . . . . . . . . . . . . 138 viii List of Figures 1.1 Generic FPGA and its components . . . . . . . . . . . . . . . . . . . 4 3.1 Steps to perform oating-point square root . . . . . . . . . . . . . . . 27 3.2 Frequency, area, and frequency/area ratio of square root core . . . . . 35 3.3 Architecture for the Lennard-Jones force and potential calculation kernel 37 3.4 Effects of compliance on throughput of Lennard-Jones pipeline . . . . 38 3.5 Example expression tree from molecular dynamics force calculation . 44 3.6 Example expression tree for Floyd-Warshall algorithm . . . . . . . . 45 3.7 Example of an expression tree for an expression with eight inputs . . . 45 3.8 Compacted-tree architecture . . . . . . . . . . . . . . . . . . . . . . 47 3.9 Architecture for the basic case of AEE . . . . . . . . . . . . . . . . . 50 3.10 Algorithm for the basic case of AEE . . . . . . . . . . . . . . . . . . 52 3.11 Architecture for the advanced case of AEE . . . . . . . . . . . . . . . 56 3.12 Algorithm for the advanced case of AEE . . . . . . . . . . . . . . . . 57 3.13 Padding an expression tree . . . . . . . . . . . . . . . . . . . . . . . 62 3.14 Change in frequency and area as number of types of operators increases 64 4.1 Conceptual drawing of the SRC MAP processor . . . . . . . . . . . . 72 4.2 Conceptual Drawing of a Cray XD1 Blade . . . . . . . . . . . . . . . 75 4.3 Abstract architecture of a GPP element . . . . . . . . . . . . . . . . . 76 4.4 Abstract architecture of an RH element . . . . . . . . . . . . . . . . . 77 ix 4.5 Example of task-level parallelism within a node . . . . . . . . . . . . 82 4.6 The design ow for mapping an application to a recongurable computer 83 4.7 Steps in implementing an application on a recongurable computer . . 84 5.1 Hardware/software partitioning of the MD simulation application . . . 102 5.2 Algorithm for basic design . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Architecture for basic design . . . . . . . . . . . . . . . . . . . . . . 104 5.4 Algorithm for write-back design . . . . . . . . . . . . . . . . . . . . 106 5.5 Data layout in on-board memory . . . . . . . . . . . . . . . . . . . . 111 5.6 Performance estimates for using 0 to 4 RH elements . . . . . . . . . . 115 5.7 Assignment of MD application tasks to RH element and GPP element 119 5.8 Speed-ups of palmitic acid simulation with various parameters . . . . 125 5.9 Two-dimensional example of spatial decomposition . . . . . . . . . . 127 5.10 Estimated scalability of the parallel MD program . . . . . . . . . . . 132 5.11 Estimated scalability of the parallel MD program on accelerated nodes 134 x Abstract With recent technological advances, it has become possible to use recongurable hard- ware to accelerate scientic computing applications. There has been a resulting devel- opment of recongurable computers that have microprocessors, recongurable hard- ware, and high-performance interconnect. We address several aspects of accelerat- ing scientic computing applications with recongurable hardware and recongurable computers. Because there is no native support on recongurable hardware for the oating-point arithmetic needed by many scientic computing applications, we introduce a library of double-precision oating-point cores and analyze the effects on performance of the degree of pipelining and the implemented features of IEEE standard 754. Scien- tic computing applications may spend a large amount of time evaluating arithmetic expressions. Hence, we present area-efcient designs for arithmetic expression eval- uation that hide the pipeline latencies of oating-point cores. These designs use at most two cores for each type of operator in the expression and have better area and throughput properties than designs generated by a state-of-the-art hardware compiler for FPGAs. Experiments show that for 64- and 1024-input expressions, area increases linearly with the number of types of operators. Implementing a design on a recongurable computer can be difcult and is not guaranteed to give a speed-up. We thus formulate hierarchical architectural and per- formance models for recongurable computers that facilitate performance prediction xi early in the design process. The performance model has errors of 5% to 13% in our work in accelerating molecular dynamics. A hierarchical programming model for de- veloping and modeling implementations of scientic computing applications on recon- gurable computers is also provided. To demonstrate acceleration of a complete scientic computing application, we study molecular dynamics on recongurable computers. We investigate single-node, shifted-force simulations; single-node, particle-mesh-Ewald simulations; and multi- node, shifted-force simulations. We attain 2 to 3 speed-ups over state-of-the- art microprocessors through a hardware/software approach in which the most inten- sive task executes on recongurable hardware and the rest of the tasks execute on the microprocessor. In the particle-mesh-Ewald simulation, we exploit parallelism between the microprocessor and the recongurable hardware. For the multi-node, shifted-force simulations, we show that a cluster of accelerated nodes has about the same performance as a cluster of twice as many unaccelerated nodes. xii Chapter 1 Introduction From the early days of Monte Carlo simulations on the MANIAC computer to the present day of modeling, uid dynamics, billion-particle simulations, computational biology, and the like, the computer has become an invaluable tool for scientic discov- ery. Through the use of mathematical models and numerical algorithms, the computer becomes a virtual laboratory in which experiments to bring about new understand- ing can be performed [78, 92]. The eld of scientic computing supports these ex- periments through the development of tools and techniques for the experiments [78]. Scientic computing kernels include matrix operations, transforms, and numerical in- tegration. Kernels are brought together to form scientic computing applications, such as N-body simulations, molecular dynamics, computational uid dynamics, and cli- mate modeling. As computational scientists are trying to solve tomorrow's problems [91], they have a great demand for computing power; scientic computing applications are very computationally intensive. Additionally, they commonly require oating-point arith- metic. Indeed, one of the measures of a supercomputer's performance is the number of oating-point operations per second (FLOPS) that it performs [112]. Because of their 1 computational intensity, nding ways to increase the performance of scientic com- puting applications is of great interest. For instance, the Earth Simulator, a parallel vector supercomputer, was recently developed to provide enormous processing power to researchers in the earth sciences [100]. Blue Gene/L is a newly developed massively parallel supercomputer whose initial target applications are in life sciences, though it can be utilized in other elds as well [35]. There has also been work in using custom application-specic integrated circuits (ASICs) to accelerate scientic computing applications. Because ASICs are developed specically for a task, they can attain higher performance at that task than a general purpose processor (GPP), which has a lot of overhead because of its generality. One example of an ASIC being used to accelerate a scientic computing application is the GRAPE (GRAvity-PipE) project [68]. The GRAPE hardware is used to accelerate the computation of gravitational interactions between particles in astrophysical N-body simulations. The idea is to have a host computer perform all communication and all but the most intensive calculations. The most intensive calculations are performed on the special purpose hardware. GRAPE is an ASIC chip that is a special purpose com- puter for these calculations. Another example is the MD-GRAPE project [110]. Like GRAPE, it provides ASICs to accelerate a compute-intensive portion of a scientic computing application. In this case, that application is molecular dynamics. While developments in hardware for scientic computing have progressed along the path of massively parallel processors and customized a hardware, researchers in other areas have explored the benets of another computing technology: recong- urable hardware [37]. While recongurable hardware was rst used as glue logic and for prototyping ASICs, increases in density soon made it capable of supporting com- plete applications. The use of recongurable hardwaremostly in the form of eld- programmable gate arrays (FPGAs)has provided fantastic performance increases in 2 areas such as signal processing, cryptography, and network security [20, 54, 9]. Until recently, though, using recongurable hardware to accelerate scientic computing ap- plications was not feasible. Among other shortcomings, FPGAs lacked sufcient logic resources to implement the necessary oating-point arithmetic. However, recent ad- vances in FPGA hardware, including increased density and the inclusion of embedded multipliers, have made oating-point arithmetic on FPGAs possible. It has, therefore, become of great interest to study the possibility of utilizing recongurable hardware for the acceleration of scientic computing applications. Such acceleration of scientic computing applications with recongurable hard- ware is the focus of this thesis. In the next section, we provide introductory material about recongurable hardware, recongurable computers, and the challenges in using them for scientic computing. We then present a summary of our contributions. 1.1 Background 1.1.1 Recongurable Hardware Recongurable hardware is used to bridge the gap between exible but performance- constrained GPPs and very high-performance but inexible ASICs. Recongurable hardware is programmed and reprogrammed after fabrication. It thus combines the exibility of software with the high performance of custom hardware [113]. Because recongurable hardware is programmed specically for the problem to be solved rather than to be able to solve all problems, it can achieve higher performance and greater ef- ciency than GPPs, especially in applications with a regular structure and/or a great deal of parallelism. Because recongurable hardware is programmed and reprogrammed 3 LUT … LUT FFs, MUXs, fast carry logic A B C 0 1 1 1 1 1 1 1 7 6 5 4 3 2 1 0 A B C 0 1 1 1 1 1 1 1 7 6 5 4 3 2 1 0 Figure 1.1: Generic FPGA and its components after fabrication, it is less expensive to develop designs for and much more exible to use than ASICs. The state-of-the-art recongurable-hardware devices are FPGAs [37]. The FPGAs in the recongurable computers considered here are SRAM-based. That is, they are congured by the writing of a conguration bitstream to the FPGA's SRAM-based conguration memory. FPGAs are implemented as a matrix of congurable logic blocks and programmable interconnect between and within the blocks. Figure 1.1 shows a generic FPGA architecture and its components. The large shaded boxes rep- resent the congurable logic blocks. The small shaded boxes represent switch boxes that are used to program the routing. There are several routing architectures and we do not depict any particular one in the gure. Rather, we simply emphasize that each of the congurable logic blocks connects to the interconnection network and this network is congured such that the desired functionality is achieved on the FPGA. The congurable logic blocks in FPGAs are made up of congurable combina- tional logic, ip-ops (FFs), multiplexers (MUXs), and high-speed intra-block con- nections (e.g., fast carry logic). The congurable combinational logic is implemented as look-up tables (LUTs). An l-input LUT can be congured to perform any logic 4 function of l inputs. The LUT acts like a memory in which the inputs are address lines and the logic function that is implemented is determined by what is stored in the mem- ory. The bits in a LUT memory are written as part of conguration. As an example, a 3-input LUT is shown on the right of Figure 1.1 (A, B, C are the inputs). The logic function implemented is a 3-input OR: if A, B, and C are each 0, the LUT outputs a 0, and it outputs a 1 otherwise. In practice, the most common type of LUT in FPGAs is the 4-input LUT. The density of FPGAs has been steadily increasing, allowing more and more com- plex designs to be implemented in a single chip. In addition to increased density, FPGA vendors are increasing the amount of embedded features within the FPGAs. These are dedicated (non-programmable) hardware features embedded into the FPGA fabric that increase the capability of the FPGAs by performing common functions without using programmable logic resources. Because they are dedicated, embedded hardware fea- tures take less space and can achieve higher clock rates than comparable features made in programmable logic. One embedded feature common in modern FPGAs is embed- ded RAM memory to provide local storage on the FPGAs [2, 125]. Large FPGAs have on the order of 1 MB of embedded memory. Embedded arithmetic units, such as multipliers and multiply-accumulate units, are also common [2, 125]. Even processors have been embedded into FPGAs. For example, the Xilinx Virtex-II Pro FPGAs have up to two PowerPC 405 cores embedded in the FPGA fabric. 1.1.2 Recongurable Computers Recongurable computers bring together GPPs and recongurable hardware to exploit the benets of each. With their high frequencies and von Neumann architecture, GPPs are well-suited for control-intensive tasks and tasks that are not too computationally 5 intensive. Recongurable hardware, on the other hand, with its pipelining and paral- lelism, is well-suited for data-intensive tasks that require a lot of processing. Recon- gurable computers consist of one or more GPPs, recongurable hardware, memory, and high-performance interconnect. Tasks can be executed on the GPP(s), the recon- gurable hardware, or both. The presence of high-performance, low-latency inter- connect distinguishes recongurable computers from other systems in which FPGAs are attached to a host processor over a standard bus, such as PCI or VME. For this reason, what we term recongurable computers here are sometimes referred to as re- congurable supercomputers and high-performance recongurable computing plat- forms [37, 104]. In Chapter 4, we will survey currently available recongurable com- puters and use the information to develop relevant models. 1.1.3 Challenges Recongurable hardware has beneted many applications in various elds. Now, it is possible to use it in scientic computing. Doing so is not without its challenges, though. Here, we list some of the major challenges. Our focus is on FPGAs because they are the dominant recongurable-hardware devices. Floating-Point Arithmetic: There is no native support for oating-point arithmetic on FPGAs. That is, no currently available FPGAs provide embedded oating-point cores, though there has been some research in this direction [10]. Consequently, oating-point cores must be implemented using the programmable logic resources on the FPGA. Floating-point cores tend to require large amounts of area. Additionally, in order to achieve high clock frequencies, they must be deeply pipelined. Nonetheless, it has been shown that the oating-point performance of FPGAs is competitive with that of GPPs and is increasing at a faster rate than that of GPPs [116]. 6 Low Clock Frequency: There is a penalty for the hardware exibility provided by FPGAs: reduced clock frequency. Especially because of the programmable inter- connect, state-of-the-art FPGAs operate at clock frequencies about one-tenth those of state-of-the-art GPPs. Customization of the architecture to the problem is used to get around this large frequency discrepancy. As will be discussed more in Section 4.2.1, the main ways FPGAs obtain high performance is through the exploitation of pipelin- ing and parallelism. The use of pipelining and parallelism has limits, though. Development of Designs: Designs for FPGAs have traditionally been written in hardware description languages such as VHDL and Verilog. Essentially, developing designs for FPGAs has been treated as a hardware-design problem, which makes it unfamiliar to most in the scientic computing community. At this lower level, devel- oping large designs can be even more difcult than it is with high-level programming languages. As a result, there have been numerous efforts aimed at facilitating higher- level programming [26, 108, 16]. These efforts focus on the development of compilers that input code written in a high-level language, such as C or a customized variant of C, and generate hardware designs. These efforts are becoming especially important with the development of recongurable computers. Acceleration of Complete Applications: Much of the early studies of the acceler- ation of scientic computing applications with recongurable hardware has focused on individual tasks, such as matrix operations and fast Fourier transforms. Impressive speed-ups in individual kernels demonstrate potential, but they will not necessarily translate into impressive speed-ups when the kernels are integrated back into com- plete applications [38]. Knowing early in the development process that acceleration of a kernel will give a speed-up for an entire application is especially important given 7 the difculty of developing and implementing designs on recongurable hardware and recongurable computers. 1.2 Contributions The main contributions of this thesis, summarized below, are in addressing the chal- lenges of using recongurable hardware to accelerate scientic computing applica- tions. We realize that the use of recongurable hardware is not immediately going to supplant the decades of research into the use of large-scale computers for scientic computing applications. Indeed, compared to the use of GPPs, the use of recong- urable hardware is still in its infancy. Nonetheless, we expose and address several issues in the use of recongurable hardware to accelerate scientic computing appli- cations and we show that the use of recongurable hardware is a viable acceleration alternative whose promise is only beginning to be realized by the community. Kernels for Scientic Computing: We study two basic kernels for scientic com- puting. First, we study the development of a library of double-precision oating-point cores for FPGAs. Our main focus is the oating-point square root core. The cores in the library are the rst oating-point cores for FPGAs to support all types of numbers representable, all types of exceptions that can be generated, all rounding modes, and the NaN diagnostics that are described in IEEE standard 754, and to do so in double precision (64-bit). The cores in the library support three levels of compliance with IEEE standard 754 and the square root core in particular supports more combinations of features. The degree of pipelining is also a parameter of the square root core. We investigate the effect on performance of supporting various features of IEEE standard 8 754, both in the context of the core itself and in the context of a task in molecular dynamics simulation. The second kernel that we study is arithmetic expression evaluation. Floating-point cores for FPGAs, such as those that are part of the library, must be deeply pipelined to achieve high clock frequencies. This deep pipelining makes it difcult to reuse the same oating-point core for a series of computations that are dependent upon one another, as would be desirable for arithmetic expression evaluation. However, oating- point cores use a great deal of area, so it is important to use as few oating-point cores in an architecture as possible. Hence, we have developed area-efcient architectures and algorithms for arithmetic expression evaluation that effectively hide the pipeline latency of the oating-point cores and use at most two oating-point cores for each type of operator in the expression. Experimental results show that the areas of our designs increase linearly with the number of types of operations in the expression and that our designs occupy less area and achieve higher throughput than designs generated by a commercial hardware compiler. Modeling Recongurable Computers: We survey currently available recongurable computers and develop abstract and programming models for them. We then use these models to develop a high-level performance model for recongurable comput- ers. These models help address the challenges in both development of designs and acceleration of complete applications by providing a vendor-independent model of re- congurable computers that can be used to study the properties of potential designs at the system, node, and compute-element levels. Modeling the performance of possible implementationsboth software and hardwareof an application is a crucial step in developing successful recongurable computer implementations. Our models speci- cally consider both the GPPs and the recongurable hardware present in recongurable 9 computers. At the same time, the modeling is performed at a high level so that it can be performed early in the design process. Molecular Dynamics: We investigate the acceleration of a complete scientic com- puting application: molecular dynamics. Molecular dynamics (MD), which is in- troduced thoroughly in Section 5.1, is a technique for simulating the movements of atoms in a system over time. We present one of the rst recongurable computer im- plementations of this application. The application is partitioned such that the most computationally intensive task executes on the recongurable hardware and the rest of the tasks execute on the processor. For the task executing on the recongurable hardware, we model the performance of several alternative designs using our perfor- mance model. After choosing a design, we implement the whole application on an SRC 6 MAPstation recongurable computer. We study three types of MD simulations: single-node, shifted-force simulations; single-node, PME simulations; and multi-node, shifted-force simulations. Shifted-force and PME refer to the type of electrostatic calculations performed in the simulations. In the shifted-force simulations, we are able to achieve approximately 2 speed- ups over the same simulations running on a state-of-the-art GPP. In the PME sim- ulations, we achieve approximately 3 speed-ups. In these simulations, we exploit parallelism by executing the reciprocal-space part of the PME calculations on the GPP while the real-space part of the calculations executes in parallel on the recongurable hardware in the recongurable computer. Finally, in the multi-node simulations, we examine the effect that acceleration has on scalability of the simulations. We nd that for a spatial decomposition, the accelerated simulations do not scale as well as the unaccelerated simulations, but that this is because the accelerated simulations have 10 already accelerated the task in the simulation that is the greatest beneciary of the par- allelism. We also nd the interesting result that for one of our benchmark simulations, P accelerated nodes match the performance of 2P unaccelerated nodes, and in some cases exceed it. 1.3 Organization To present our contributions, we have organized the remainder of the thesis as fol- lows. In Chapter 2, we describe many of the other efforts in accelerating scientic computing with recongurable hardware. In Chapter 3, we detail our contributions in the area of kernels for scientic computing. In Chapter 4, we develop our models for recongurable computers. In Chapter 5, we present our work in the acceleration of MD simulations through the use of recongurable computers. Finally, in Chapter 6, we summarize the work and present areas for future research. 11 Chapter 2 Scientic Computing on Recongurable Hardware Because of their computational intensity, scientic computing kernels and applica- tions are attractive targets for acceleration. Now that it has sufcient area resources to perform oating-point arithmetic and with the advent of recongurable computers, recongurable hardwareespecially in the form of FPGAshas become an attractive platform to perform this acceleration. In fact, a high-level analysis comparing trends in peak oating-point performance of FPGAs to those of GPPs concludes that the peak FPGA oating-point performance is growing at a signicantly faster rate than that of GPPs [116]. Recongurable hardware is also attractive because its recongurability allows the same hardware to be used in multiple kernels and applications. Thus, there have been several prior research efforts investigating the use of recon- gurable hardware to accelerate scientic computing applications. Below, we summa- rize several of these efforts. We break them into two categories that encompass most of this work: (1) linear algebra and transforms and (2) scientic simulations. Note that we leave work that is directly related to ours to be discussed in individual chapters. For instance, we leave the discussion of MD simulations to Chapter 5, which describes our work in accelerating MD simulations. 12 2.1 Linear Algebra and Transforms Linear algebra kernels and mathematical transforms are used widely in scientic com- puting. In fact, a supercomputer's performance is often measured by its performance in executing the LINPACK benchmark, which is composed of linear algebra rou- tines [86, 112]. Additionally, the fast Fourier transform (FFT)as well as LINPACK and several othersappears as a kernel in the HPCChallenge benchmark suite, show- ing its importance as a kernel in scientic computing applications [27]. 2.1.1 Linear Algebra A wide variety of linear algebra kernels, both dense and sparse, have been studied. Our focus here is on those that use oating-point arithmetic because oating-point arithmetic is often necessary in scientic computing applications. One such study is [126]. In [126] the oating-point adders and multipliers described in [41] are used to study high-performance matrix multiplication. Two algorithms using a systolic, linear-array architecture are presented and design tradeoffs among latency, storage size, and number of processing elements in the systolic array are analyzed. Targeted to a Xilinx Virtex-II Pro XC2VP125, 1 with double-precision arithmetic, the higher- performance of the two algorithms is able to achieve a performance of 8.3 GFLOPS, almost 3 GFLOPS higher than a state-of-the-art GPP. [117] studies the performance of FPGA-based implementations for three linear al- gebra kernels from the BLAS library: vector dot product, matrix-vector multiplication, and matrix multiplication. This work concludes that FPGAs can signicantly outper- form GPPs in executing these kernels. Further, it predicts that the performance gap 1 This device, though supported by Xilinx development tools at the time of the publication of [126], to our knowledge, never went into production. 13 between FPGAs and GPPs will widen in the future. To do these projections, the au- thors use the oating-point adder and multiplier cores for both single-precision and double-precision arithmetic that are presented in [116]. These cores implement most features of IEEE standard 754, but leave out the exception generation and all rounding modes but round to nearest. Sparse matrix-vector multiplication is studied in [23] and [127]. In [23] an ar- chitecture is developed in which the dot products required for sparse matrix-vector multiplication are mapped to processing elements. These processing elements are connected in a bidirectional ring topology so that processing elements can commu- nicate intermediate results. A mapping phase maps computations to processing el- ements with the goal of minimizing the amount and distance of communication be- tween processing elements. Using this architecture, the authors report performances of 1.5 GFLOPS/FPGA for a single Xilinx Virtex-II XC2V6000-4 FPGA and 750 MFLOPS/FPGA in a 16-FPGA system. The single FPGA does not outperform state- of-the-art GPPs by much, but the multi-FPGA system outperforms the same number of GPPs by almost 500 MFLOPS/device and achieves a greater fraction of its peak oating-point performance. [127] presents a tree-based architecture for sparse matrix-vector multiplication. In this architecture, the necessary dot products are computed by a tree in which the leaves are oating-point multipliers and the inner nodes are oating-point adders. The output of the adder is fed to a reduction circuit that accumulates intermediate values. The paper presents several techniques to reduce overhead and effectively handle large matrices. When implemented on a Xilinx Virtex-II Pro XC2VP70 FPGA, speed-ups over a 900 MHz Itanium 2 that range from barely more than 1 to more than 30 are achieved. The speed-ups are affected by the sparsity structure of the matrix. 14 Sparse matrix-vector multiplication is also the most computationally intensive task in the conjugate gradient method for solving sparse linear systems of equations. In [73], a specialized design for double-precision oating-point, FPGA-based sparse matrix- vector multiplication is presented. Beyond that, though, in [73], the complete conju- gate gradient method is implemented on a recongurable computer. The GPPs in the recongurable computer do all the steps but the sparse matrix-vector multiplication, which is carried out on the recongurable hardware. Using an SRC 6 MAPstation, a speed-up of about 1.3 is achieved. The authors then estimate that with a next- generation SRC 7 MAPstation, a speed-up of 3.9 is attainable. Beyond the kernels mentioned above, other linear algebra kernels have been stud- ied, including matrix factorization [119, 120, 128] and iterative solvers [74, 75]. 2.1.2 Transforms [46] presents the most in-depth analysis of the performance of double-precision oating- point FFTs on FPGAs. For experiment, this work presents three architectures for the FFT: a parallel architecture in which there is a single stage of parallel buttery units that operate independently, a pipelined architecture in which there is one buttery unit for each stage of the FFT, and a parallel-pipelined architecture in which several stages of the FFT are pipelined and each of these stages has several buttery units working in parallel. For the Xilinx Virtex-II Pro XC2VP100-6 FPGA, which was the largest and fastest FPGA available at the time of the publication of [46], the authors report that for small FFTs, the parallel architecture has the highest performance and that for FFTs with greater than 1024 points, the parallel-pipelined architecture has the highest performance. The authors also estimate performance for the case of (currently nonex- istent) FPGAs two and four times as large as XC2VP100-6. They nd that unless the 15 FFT is very, very large, the pipelined architecture has lower performance than the other two. The parallel pipelined architecture performs better, especially for FFTs of over 1024 points. Finally, the authors compare the FFT performance of current FPGAs with that of current GPPs and compare estimates of the FFT performance of future FPGAs with estimates of the FFT performance of future GPPs. While currently an FFT must be more than about 8192 points for the FPGA to outperform the GPP, the authors esti- mate that in two years that crossover size will drop to 1024 points, and that within four years, FPGAs will dramatically outperform GPPs for large FFTs. [105] describes two oating-point FFT alternatives for use with the SRC 6 MAP- station. The need for these FFTs arises as part of a climate modeling application. The performance described in [105] is not as good as that described in [46]. One of the de- signs turns out to be unusable because it requires more logic resources than are avail- able on the Xilinx Virtex-II XC2V6000 FPGA in the MAPstation. In the other, if only one double-precision buttery unit is implemented, the FPGA implementation takes longer to run than the GPP version of the same FFT. The use of four buttery units, which requires the use of two FPGAs, is necessary in order to achieve a speed-up of 2.4. One factor limiting the performance of the FFT in this case is the MAPstation's requirement that the design run at exactly 100 MHz. 2.2 Scientic Simulations There have been several efforts to accelerate scientic simulations through the use of recongurable hardware. An early example is PROGRAPE-1 (PROgrammable GRAPE-1) [45]. PROGRAPE-1 is an FPGA-based, programmable, multipurpose computer for many-body simulations. It is especially targeted for scientic techniques that have many different algorithmic implementations and for which new algorithms 16 are still being developed. The example cited in [45] is the SPH force calculation al- gorithm. PROGRAPE-1 is implemented with two Altera EPF10K100 FPGAs, each of which contains about 100,000 gates. With this now somewhat outdated technology, the developers were able to achieve a throughput equivalent to 0.96 GFLOPS for a gravitational interaction calculation. Note that they did not implement IEEE standard oating point units due to size limitations of the FPGAs. [106] uses recongurable hardware to accelerate three kernels in computational uid dynamics: the Euler, Viscous, and Smoothing algorithms. The system on which the application runs has a GPP connected over the PCI bus to several Xilinx Virtex-II XC2V6000 FPGAs and SRAM. The calculations are implemented on the FPGAs in a pipelined fashion in which each oating-point operation in the accelerated kernels gets its own oating-point core on the FPGAs. Using single precision oating point numbers, the authors achieve 10.2, 23.2, and 7.0 GFLOPS, respectively, for the kernels listed above. In [36], a Monte Carlo radiative heat application is explored. It is found that the inner loop of the application accounts for the most computation time, so a hardware design for this loop is developed. The loop is unrolled into a pipeline in which each stage consists of one or more oating-point cores operating in parallel. This pipeline is then replicated to take advantage of coarse-grained parallelism. The target FPGAs have enough resources to allow the pipeline to be replicated three times. Data inputs to the pipeline are stored in on-chip memory. The best results were obtained when tar- geting the design to a Xilinx Virtex-II Pro XC2VP100-6 FPGA. A speed-up of 10.4 is achieved over a 3 GHz Pentium 4 Xeon. The performance results for the FPGA implementation, however, ignore the cost of writing data to the FPGA. Additionally, the speed-up is comparing inner-loop kernels, not the complete application with and without acceleration. 17 [38] presents a survey of available recongurable computers and then assesses how they would be used in several oating-point-intensive scientic computing applica- tions. Based on the proles of these applications, which are from the elds of climate modeling and computational chemistry and physics, it concludes that an approach to using recongurable computers in scientic computing that is based solely on the use of libraries for common computational tasks will not be fruitful for large-scale sci- entic computing applications. Instead, it is crucial to carefully prole and partition the applicationand possibly to re-write existing application codein order to take advantage of recongurable computers. [114] and [115] describe the implementation of a road trafc simulation application on the Cray XD1 recongurable computer. This application is different than the others in that no oating-point arithmetic is required. However, the implementation described in these two works is one of the rst implementations of a complete supercomputing applicationboth hardware and softwareon a recongurable computer, rather than an implementation of only kernels. Several practical considerations are taken into ac- count, such as the asymmetry between the cost of reading and writing data to/from the recongurable hardware from/to the GPP. While no performance model is proposed, some of the issues described in Chapter 4 are also identied in [114], such as the need to consider communication time between the recongurable hardware and the GPP and that the performance of the whole application, not just a single accelerated kernel, should be considered when implementing an application on a recongurable computer. In this case, the kernel implemented in the recongurable hardware achieves a speed- up of 34.4 over the same kernel in software, but the overall application speed-up is limited to 5. These works, then, also illustrate the need for careful proling and partitioning between hardware and software. 18 Clearly, within the community, there is a great deal of interest in and excitement about the use of recongurable hardware to accelerate scientic computing applica- tions. 19 Chapter 3 Kernels for Scientic Computing on Recongurable Hardware In this chapter, we investigate two aspects of performing scientic computation on recongurable hardware. The rst is oating-point arithmetic. Floating-point arith- metic is essential to many scientic applications. Because oating-point arithmetic especially double-precision oating-point arithmeticrequires a great deal of hard- ware resources, it has only recently become possible to implement many oating-point cores on a single FPGA. In the following section, we introduce a library of oating- point coresfocusing mainly on the oating-point square root coreand describe our methodology for testing the correctness of the outputs of the oating-point cores [40]. This library of oating-point cores has the unique feature of varying degrees of com- pliance with IEEE standard 754 [48]. We also assess the effect of compliance with IEEE standard 754 on the performance of a oating-point application kernel. The second aspect that we investigate in this chapter is that of evaluating arith- metic expressions using oating-point cores on recongurable hardware. Floating- point cores on recongurable hardware have two main drawbacks. The rst is that in order to achieve a high clock rate, oating-point cores for FPGAs must be deeply pipelined. This deep pipelining makes it difcult to reuse the same oating-point core 20 for a series of computations that are dependent upon one another. However, the sec- ond drawback of oating-point cores on recongurable hardware is that they use a great deal of area, which makes it important to use as few oating-point cores in an architecture as possible. Thus, in Section 3.2, we describe area-efcient architectures and algorithms for arithmetic expression evaluation that effectively hide the pipeline latency of the oating-point cores and use at most two oating-point cores for each type of operator in the expression [98]. Experimental results are given that show that the areas of our designs increase linearly with the number of types of operations in the expression and that our designs occupy less area and achieve higher throughput than designs generated by a commercial hardware compiler. 3.1 Floating-Point Cores Many scientic computing applications require oating-point arithmetic. There have been several efforts to develop oating-point cores for FPGAs, all of which have short- comings. For instance, none of the existing oating-point cores are compliant with IEEE standard 754 [48]. Consequently, in the recongurable computing community, there is no general understanding of the resource and performance overheads that arise from adhering to the standard. Also, in commercial recongurable computing plat- forms such as those from SRC and Cray, FPGA-based accelerators interface with fully IEEE-compliant GPPs. Users would nd it inconvenient to have to consider the IEEE compliance, or lack thereof, of the FPGA-based designs and the corresponding effect on the results. On the other hand, if the users know that they do not need full compli- ance, they should have the option of only implementing in hardware the features that they need. 21 In this section, we describe a library of double-precision oating-point cores for FPGAs developed to support all types of numbers representable, all types of excep- tions that can be generated, all rounding modes, and the NaN diagnostics that are described in IEEE standard 754, and to do so in double precision (64-bit). We fo- cus on the square root core, which we have developed, though we mention properties of the adder/subtracter, multiplier, and divider cores as well. These cores are param- eterized by degree of pipelining and by the features of IEEE standard 754 that are implemented. We also show how implementing various features of IEEE standard 754 affects the performance of a kernel in molecular dynamics simulations. Note that we do not assess the effects on accuracy of implementing only selected features of IEEE standard 754. Rather we assess the effects on throughput; we leave it up to users of the cores to determine what features they need to implement in order to obtain the accuracy desired. Below, we briey introduce IEEE standard 754. Then, in Section 3.1.2, we de- scribe our square root core and analyze its performance. In Section 3.1.3, we describe means of analysis to determine the best oating-point core conguration and degree of pipelining for a given situation. In Section 3.1.4, we describe the use of the library of oating-point cores in a scientic computing kernel and analyze the effects of varying levels IEEE compliance. In Section 3.1.5, we describe existing FPGA oating-point cores and contrast them with the cores that we describe here. 3.1.1 IEEE Standard 754 IEEE standard 754 is a standard for binary oating-point arithmetic [48]. It species four numerical formats: single precision, single extended precision, double precision, and double extended precision. The oating-point cores implement double-precision 22 arithmetic, so, from here out, we will only describe the parts of the standard that pertain to double-precision numbers. Double-precision numbers are 64-bit values with three elds: a sign bit s, a biased exponent e (11 bits), and a fraction f (52 bits). There are ve types of numbers: NaNs, 1,0, normal numbers, and denormal numbers. A number's type is determined by the values of its exponent and fraction. IEEE standard 754 denes four rounding modes: round to nearest (the default), round toward 0, round toward +1, and round toward1. All of the operations are to behave as if they produce an innite precision result and then round that result using the specied mode. IEEE standard 754 denes addition, subtraction, multiplication, division, remain- der, and square root as operations, as well as several conversions between formats and comparisons. At this time, the library contains oating-point cores for addi- tion/subtraction, multiplication, division, and square root. We have not developed cores for remainder or the conversion or comparison operations and thus do not de- scribe those operations here. Arithmetic operations on innities and NaNs are both dened in IEEE standard 754. Also, diagnostic information can be stored in the frac- tional part of NaNs. These NaN diagnostics should be propagated from a NaN that is input to an operation to the NaN that is the output. Finally, IEEE standard 754 denes ve types of exceptions: invalid operation, di- vision by zero, overow, underow, and inexact. Note that IEEE standard 754 also denes trapping mechanisms which we do not implement. 23 3.1.2 Floating-Point Cores Here, we describe the implementation and analyze the performance of the square root core. This core has several parameters. One parameter is the degree of pipelining. The other parameters determine the level of compliance with IEEE standard 754whether or not NaN diagnostics are supported, whether or not denormal numbers are supported, whether or not exceptions are generated, and which rounding modes are supported. We present three congurations for analysis. The rst conguration is the most compliant with IEEE standard 754. In it, NaN diagnostics, all four IEEE rounding modes, exception generation, and denormal numbers are supported. The common features conguration implements only the round to nearest rounding mode and sup- port for exceptions but does not implement NaN diagnostics; denormal numbers are treated as zero. In the lowest overhead conguration, the only rounding mode im- plemented is round toward 0; denormal numbers are treated as zero, exceptions are not generated, and NaN diagnostics are not supported. The square root core was coded in VHDL and synthesized with Synplicity Synplify Pro 7.2. The parameters determining the number of pipelining stages and the features supported are implemented as VHDL generics. The user sets the appropriate values before synthesis and only the necessary hardware is synthesized. In our experiments, the target device was the Xilinx Virtex-II Pro XC2VP7 FPGA. The place and route tool used was Xilinx PAR, part of Xilinx ISE 5.2. We analyze the performance and area requirements of the square root core when it is fully pipelined and, in Section 3.1.3, describe techniques for further analysis. 24 3.1.2.1 Testing When implementing oating-point cores, especially ones that implement parts of IEEE standard 754, it is important to test the behavior of the cores for a variety of inputs. However, to the best of our knowledge, there is no existing, publicly available set of test vectors and expected outputs that tests compliance with parts of IEEE standard 754. Thus, we have developed one. Our philosophy in developing the test vectors for the oating-point cores is that if, for the same inputs, our oating-point cores give the same output as a oating-point unit (FPU) that complies with IEEE standard 754 does, a system comprised of our cores would also be compliant with the standard. So, in order to develop our test vectors, we needed an IEEE-standard-754-compliant FPU and inputs that test its compliance. Intel claims that the FPU in the Itanium processor is IEEE-standard-754-compliant. In order to test this claim, we used the C version of the program paranoia [34]. This program tests the IEEE-standard-754 compliance of the FPU in a general purpose processor. With the Intel C++ Compiler 8.1 for Linux, we compiled paranoia.c on a Hewlett-Packard rx2600 server with dual 900 MHz Intel Itanium 2 processors and 8.3 GB of memory, running the Red Hat Linux Advanced Server 2.1 operating system. When run on this system, paranoia reports No failures, defects, nor flaws have been discovered. Rounding appears to conform to the proposed IEEE standard P754. The arithmetic diagnosed appears to be Excellent! Thus, we used the Itanium's FPU as the compliant FPU against whose output we would check the output of our oating-point cores. After testing the FPU, we edited paranoia to output the binary values for all the arguments to the arithmetic operations of interest to us. We wrote a C program that 25 reads in these arguments and performs the desired computations. Our program outputs the arguments to the oating-point operators as stimulus for VHDL testbenches. It also outputs the VHDL code to check that that the results from our oating-point cores match those of the Itanium's FPU. Additionally, our program outputs VHDL testbench code to test if the exceptions generated by our oating-point cores are the same as those generated by the Itanium's oating-point unit. To test the oating-point cores, we rst simulate them behaviorally with ModelSim using the generated testbenches [71]. Once they are correct, we then place and route them and perform post-place-and-route simulations using the testbenches to ensure that the cores still perform correctly. 3.1.2.2 Square Root Core The steps for performing the oating-point square root are shown in Figure 3.1 and detailed below. Determine the type of input: The hardware examines the sign bit, exponent, and frac- tion of the input oating-point number to determine what type of number it is: positive normal number, positive denormal number, negative number,0,1, or quiet or signaling NaN. Determining the input type requires one clock cycle. Expand the 52-bit fraction to the full 53-bit signicand: If the number is a positive normal, the hidden 1 is prexed to f, making 1f. If the number is denormal, a 0 is prexed to f, making 0f. This step is done in parallel with the input determination. The input is then sent to a priority encoder that determines the number of places, if any, that the number must be left shifted. The shifting itself is done by a logarithmic shifter that takes two cycles. 26 Determine Input Type Adjust Exponent Halve Exponent Square Root Significand . . . Normalize Round Generate Exceptions Exapand Fraction to Significand Exapand Fraction to Significand Shift Significand Result and Exceptions s e f s e f Figure 3.1: Steps to perform oating-point square root. s, e, and f are the input sign bit, exponent, and fraction, respectively. Adjust the exponent, if necessary: The exponent must be adjusted if the input is de- normal and will be left-shifted. The number of bits by which the input will be shifted must be subtracted from 1. This stage takes place after the priority encoder, in parallel with the rst stage of the logarithmic shifter. Halve the Exponent: The exponent of the result of the square root operation is half that of the original number. Halving the exponent takes one stage and can be done in parallel with the last stage of the logarithmic shifter. Square root the signicand: This step is the most compute- and resource-intensive of the oating-point square root algorithm. We rst perform any necessary shifting to ensure that the value of the signicand is in the range [1, 4). After doing so, this step is the same as taking a xed-point square root. Many methods exist for taking 27 a xed-point square root. For example, in the square root core described in [121], a table lookup method for square root is used. For double-precision oating-point square root, though, the table would require more than twice as much embedded memory as is available in our target FPGA, the Xilinx Virtex-II Pro XC2VP7. Thus, in order to use this method, we would need to use logic resources for storage. Instead, we have chosen a different method. In our square root core, the signicand is square rooted using the non-restoring algorithm of Li and Chu [63]. This algorithm calculates the output one bit at a time, starting with the most signicant bit. Each computation requires only an integer add or an integer subtract, depending on the bit generated in the prior stage. The bit width of the addition/subtraction grows by one bit for each new bit of the square root calcu- lated. The output of the algorithm is not only the square root of the input, but also the remainder. This algorithm has several features that make it attractive for our square root unit. It is easy to pipeline (one stage per bit generated) and requires only adder/subtracters, registers, and logical NOT gates. The exact remainder produced by the algorithm can be used in rounding toward +1 and in generating the inexact exception. The only drawbacks are that the adder/subtracters at the ending stages of the algorithm are quite large and there is a signicant pipeline latency to produce the results. Square rooting the signicand requires 54 clock cycles when fully pipelined. Normalize the result: Because in an earlier stage we ensured that the input signicand was in the range [1; 4), the output signicand is in the range [1; 2). A number in the range [1; 2) is normalized, so no extra normalization step is needed. Round the result: The rst step in rounding is to correct the remainder if it is negative. Correcting the remainder requires a 57-bit addition and takes one clock cycle. Round 28 to the nearest is implemented as an integer addition of 1 to the full output of the square root algorithm. Rounding toward +1 is implemented similarly. If the (corrected) remainder is 0, the number is left as it is. If the (corrected) remainder is nonzero, an integer addition of 1 to the result of the square root algorithm except for the rounding bit is performed. In square root, all outputs that are not special numbers are positive. Consequently, round toward 0 and round toward1 are equivalent. In both cases, all the bits generated by the square root algorithm, except for the rounding bit, are used as the result. Output the nal, correctly rounded result and raise any exceptions that have occurred: If the input type, determined in the rst stage, was positive normal or positive denor- mal, the correctly rounded result is output. Otherwise, the appropriate special number is output. Of the ve types of exceptions described in IEEE standard 754, it is only possible for square root to generate two of them: invalid operation and inexact result. Under the appropriate conditions, these exception ags are raised. The core also passes along any exceptions that were generated by other oating-point cores and given as input to it. Table 3.1 shows the properties of the square root core in the three congurations described above when it is fully pipelined [40]. From the table, we can see that the con- gurations vary little in maximum frequency. Reducing the features of IEEE standard 754 that are implemented gives a small improvement in pipeline latency. The common features conguration has three fewer pipeline stages than the most compliant case because the common features case lacks denormal support: the priority encoder and logarithmic shifter, and their three cycles of pipeline latency, are removed. The lowest overhead conguration has two fewer cycles than that because it additionally removes the cycle for correcting the remainder and the cycle for the addition that is part of 29 round to nearest and round toward +1. Clearly, the greatest difference between the congurations is the amount of area that each requires. The most compliant congu- ration requires 40% more area than the lowest overhead conguration and 28% more area than the common features case. In analyzing all possible congurations of the square root core, we nd that removing the support for denormals leads to the most area savings: 262 slices, on average. 3.1.2.3 Other Cores in the Library Tables 3.2, 3.3, and 3.4 show the properties of the other cores in the library [40]. Table 3.2 shows that the largest increase in area of the adder cores is between the lowest overhead adder and the common features adder. Supporting denormals and exceptions does increase the area, but not nearly as much since much of the hardware needed to support denormal numbers, such as large shifters, is already present. Table 3.3 shows that the greatest area increase of the multiplier cores comes in going from the common features conguration to the most compliant conguration. This increase is largely due to the support for denormal numbers in the most compliant conguration. In Table 3.4, results for the divider are shown. The lowest overhead congura- tion has been built using both a non-restoring division (NRD) algorithm and an SRT4 algorithm for the xed-point division that the core internally requires. The most com- pliant and common features congurations were only built using NRD internally. The NRD version of the lowest overhead conguration runs faster but requires many more pipeline stages. 30 Table 3.1: Properties of the Fully Pipelined Square Root Core No. of Maximum Area % of FPGA's Frequency/Area Conguration Stages Frequency (MHz) (slices) Total Area (MHz/1000 slices) Most Compliant 60 164 2332 47 70 Common Features 57 164 1826 37 90 Lowest Overhead 55 169 1666 34 101 Table 3.2: Properties of the Fully Pipelined Adder Core No. of Maximum Area % of FPGA's Frequency/Area Conguration Stages Frequency (MHz) (slices) Total Area (MHz/1000 slices) Most Compliant 17 170 1640 33 103 Common Features 17 170 1580 31 107 Lowest Overhead 14 170 1078 21 157 Table 3.3: Properties of the Fully Pipelined Multiplier Core No. of Maximum Area % of FPGA's Frequency/Area Conguration Stages Frequency (MHz) (slices) Total Area (MHz/1000 slices) Most Compliant 19 170 2085 42 81 Common Features 12 170 1165 23 150 Lowest Overhead 11 170 935 19 181 31 Table 3.4: Properties of the Fully Pipelined Divider Core No. of Maximum Area % of FPGA's Frequency/Area Conguration Stages Frequency (MHz) (slices) Total Area (MHz/1000 slices) Most Compliant (NRD) 68 140 4243 86 33 Common Features (NRD) 60 140 3625 73 38 Lowest Overhead (NRD) 58 140 3212 66 44 Lowest Overhead (SRT4) 32 90 3713 75 24 32 3.1.2.4 Transcendental Functions Most libraries of oating-point cores, including ours, are limited to addition/subtraction, multiplication, division, and square root cores. More complicated transcendental func- tions are needed in many scientic applications. Currently, though, options for calcu- lating transcendental functions on FPGAs are limited. Individual users must develop oating-point cores for functions as they are needed. These cores usually make use of CORDIC algorithms or table lookup [5, 111]. For example, in Section 5.3.2.1, we describe our implementation of oating-point cores for erfc(x) and e x 2 . There have also been efforts to include log and e x in a oating-point library [25]. 3.1.3 Utilizing the Parameters of the Floating-Point Cores Our oating-point cores provide the user with two types of parameters. The rst is the features of IEEE standard 754 that are supported. Whether or not the user can make use of these parameters is dependent upon the application, the data on which it oper- ates, and the numerical accuracy desired. For example, ushing denormals to zero is a latency optimization provided by many compilers, such as the Intel C++ Compiler 8.1. If the user would use this optimization for the software version of his application, then he need not implement denormal support in the hardware version of his applica- tion: doing so would increase the area signicantly while not providing any benet. As another example, if the user would only use round to nearest rounding mode in the soft- ware version, then he need not implement the other rounding modes in the hardware version. However, it is important that these features of IEEE standard 754 be available if they are needed. For example, if the user wants the accuracy of denormals, he should be able to have that accuracy in his hardware version of the application. The ability 33 to implement only those features necessary is a great benet of using recongurable hardware as the computing platform. The other parameter provided is the degree of pipelining of the oating-point cores. The degree of pipelining affects both the frequency and the area of the design: pipelin- ing is necessary to achieve a high frequency but it also leads to a larger area. Fig- ure 3.2(a) shows the frequency of the lowest overhead square root core as a function of the number of pipelining stages. We use the lowest overhead square root core as an example, but the analysis technique applies to any of the congurations. The degree of pipelining ranges from 1 pipeline stage to 55 pipeline stages. The frequency increases close to linearly with the number of pipeline stages. The rate of increase is about 3 MHz per pipeline stage. Note that once 50 pipeline stages are used, the increase in frequency levels off. In fact, it even dips slightly until the full 55 pipeline stages are used, which yields the maximum frequency. Figure 3.2(b) shows the area of the square root core as a function of the degree of pipelining. The increase in area is also close to linear. The rate of increase is about 15 slices per pipeline stage. Note that in contrast to the frequency curve, the area curve is monotonically increasing. Clearly, there is a tradeoff between frequency and area: setting the degree of pipelining to achieve a high frequency implies a large area and setting the degree of pipelining for a small area implies a low frequency. To further analyze this tradeoff, we use a third metric, the frequency-to-area ra- tio [41]. The degree of pipelining that leads to the highest frequency-to-area ratio is the one that gives the most throughput per unit area. Any further pipelining will in- crease the area at a higher rate than it increases frequency. The frequency-to-area ratio for the square root core is shown in Figure 3.2(c). The highest frequency-to-area ratio occurs when the number of pipelining stages in the core is 51. Even though there is a 34 0 20 40 60 80 100 120 140 160 180 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Frequency (MHz) (a) 0 20 40 60 80 100 120 140 160 180 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Frequency (MHz) (a) 0 200 400 600 800 1000 1200 1400 1600 1800 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Area (slices) (b) 0 200 400 600 800 1000 1200 1400 1600 1800 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Area (slices) (b) 0 20 40 60 80 100 120 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Frequency/Area (MHz/1000 slices) (c) 0 20 40 60 80 100 120 1 6 11 16 21 26 31 36 41 46 51 Number of Pipeline Stages Frequency/Area (MHz/1000 slices) (c) Figure 3.2: Frequency (a), Area (b), and Frequency/Area ratio (c) as a function of the number of pipeline stages in the lowest overhead square root core 35 slightly higher frequency when the unit is fully pipelined, the increase in frequency is overshadowed by the increase in area. 3.1.4 Application Kernel We demonstrate the utility of our cores by applying them to a scientic computing application. In [95], we developed a simple, pipelined architecture to calculate the Lennard-Jones force and potential in hardware. The dataow graph for the calculation is shown in Figure 3.3. Note that this is a simplication of the kernel; it is not as complete as those kernels which are described in Chapter 5. We give each operation in the dataow graph its own oating-point core. In the architecture, there are eight oating-point multiplications, ve oating-point addition/subtractions, two oating- point divisions, and one oating-point square root. The input is a double-precision number representing the square of the distance between two molecules and the outputs are one double-precision number representing a component of the Lennard-Jones force and another double-precision number representing the Lennard-Jones potential. The architecture is very deeply pipelined because it is comprised of the pipelined oating- point cores. However, once the pipeline is full, there are two outputs per clock cycle. We coded this design in VHDL, utilizing our fully pipelined oating-point cores. We synthesized the design with Synplicity Synplify Pro 7.2 and placed-and-routed the design with Xilinx PAR, part of ISE 7.1. The target device was the Virtex-II Pro XC2VP100. Due to very long place-and-route times, the results for the implementation with fully compliant cores are only an estimate. When using the lowest overhead conguration of the oating-point cores, it is pos- sible to utilize coarse-grained parallelism by implementing two of the force calculation 36 ÷ × × × × − × − <0.5> <4.0> <1.0> × × + ÷ <dU c > + <dU c *r c− U c > − × F u r 2 + − × ÷ <Y> Z : Addition : Subtraction : Multiplication : Division : Square root : Constant : Input/Output <1.0> <48.0> Figure 3.3: Architecture for the Lennard-Jones force and potential calculation kernel pipelines in parallel on a single FPGA. The frequency achievable by such an imple- mentation is 100 MHz. The total area of this implementation is 35315 slices. When using the common features or the most compliant congurations of the oating-point cores, it is not possible to implement two of the force calculation pipelines in parallel on a single FPGA: the area required is too great. So, only a single force cal- culation pipeline is implemented. When using the common features conguration, the frequency achievable is 125 MHz and the area is 23864 slices. When using the most compliant conguration, the estimated frequency is 100 MHz and the estimated area is 34613 slices. The graph in Figure 3.4 shows the throughput achievable by the implementations. Because of the parallelism, when the lowest overhead conguration of the oating- point cores is used, the throughput is 1.6 times that of when the common features conguration is used and is twice that of when most compliant conguration is used. 37 0 500 1000 1500 2000 2500 3000 3500 most compliant common features lowest overhead Throughput (MFLOPS) Figure 3.4: Throughput of the Lennard-Jones force and potential calculation when using the most compliant, common features, and lowest overhead oating-point core congurations This example demonstrates the dramatic difference in the performance with differently congured oating-point cores. Therefore, it is necessary to provide the user with these parameters so that he can utilize those congurations which best suit his performance and numerical accuracy needs. 3.1.5 Related Work Some early work in oating-point arithmetic for FPGAs was done in [30], [67], and [66], among others. These early works, however, were targeted to more primitive FPGAs than are available today. Specically, these FPGAs lacked the large amount of logic resources and the embedded arithmetic units that today's state-of-the-art FPGAs have. Thus, in these works, nothing more than 32-bit precision is ever considered, the area available for the oating-point cores is severely limited, and the clock frequencies are rather low. Much of the work dealt with variations in the arithmetic algorithms and their resource implications. Further, only addition and multiplication are implemented; there is no implementation for division or square root. Consequently, we do not focus on these early attempts but instead look into more recent investigations of oating- point arithmetic on FPGAs. 38 [11] is one of the earliest works about the implementation of oating-point cores for modern FPGAs. In this work, a parameterized oating-point library is presented. The library contains addition and multiplication cores as well as cores for the con- version from (to) oating-point format to (from) xed-point format. These cores are parameterized by the bit-width of the oating-point numbers. In [121], this work is ex- tended to include division, square root, and accumulation cores. Like the other cores, these are parameterized by bit width. [24] is another oating-point arithmetic library for FPGAs in which the the bit widths of the exponent and mantissa are parameters. The library provides oating- point addition, subtraction, multiplication, division, and square root. It also isto our knowledgethe rst to provide a logarithm core and one of the rst to provide an e x core. While it does provide this functionality, the library does not closely follow the format dened in IEEE standard 754. For instance, it treats special numbers differently and so must convert from IEEE-standard-754 format to its own internal format. It also does not support denormal numbers. Another early work implementing oating-point cores for FPGAs is [65]. This work describes a parameterized oating-point library and uses this library to imple- ment the SPH force calculation algorithm on a Virtex-II FPGA. The library has addi- tion, multiplication, division, and square root cores. The performance of each oating- point core for up to 24 bits of precision is listed. [64] presents a oating-point unit generator that can generate a large space of oating-point adders, multipliers, and dividers based on a variety of parameters. Trade- offs at the architectural level, the algorithm level, and the representation level are an- alyzed. The oating-point unit generator is implemented as part of a C++ to FPGA 39 conguration bitstream compiler. In the code, users can specify whether cores opti- mized for area, throughput, or latency should be selected. Tradeoffs in area and latency are analyzed, but only for bit widths less than or equal to 32 bits. [41] describes oating-point adders and multipliers that follow the IEEE-standard- 754 single- and double-precision formats, as well as a 48-bit format. It is important to note that while these cores represented the oating-point numbers as specied in IEEE standard 754, they did not implement all the features of the standard. For example, denormal numbers were not supported. Nonetheless, this work was the rst to develop FPGA oating-point cores specically designed for such large bit widths. Furthering the work of [41], a oating-point divider is developed in [39]. [95] develops a 64-bit, oating-point square root core to go with the cores developed in [41] and [39]. The work presented in this section differs in important ways from the related work listed above. The oating-point cores developed for the library are the rst developed for FPGAs to support all types of numbers representable, all types of exceptions that can be generated, all rounding modes, and the NaN diagnostics that are described in IEEE standard 754 and to do so in double precision (64-bit). Consequently, we are also the rst to analyze the performance effects of supporting the standard when the oating-point cores are used in a scientic computing kernel. 3.2 Arithmetic Expression Evaluation The arithmetic expression evaluation problem has long been of theoretical and prac- tical interest [7]. It is the problem of computing the value of an expression that is represented by a tree in which the leaves are numeric values and the internal nodes are operators. Many efcient algorithms have been proposed in the parallel comput- ing community [13, 70]. With the increasing computing power of FPGAs, multiple 40 oating-point cores like those described in the previous section can be congured on one FPGA and work cooperatively. Thus, FPGA-based designs for arithmetic expres- sion evaluation have become promising. Such designs are especially important for recongurable computers because the target applications for recongurable computers are large-scale applications that contain many expressions. However, developing high-performance oating-point designs on FPGAs is still challenging. Because of their complexity, the oating-point cores have very large ar- eas. It is thus desirable to use as few of them in a design as possible. In order to achieve a high clock frequency, the oating-point cores must be deeply pipelined. This deep pipelining makes it difcult to reuse the same oating-point core for a series of com- putations that are dependent upon one another. Further, it is important to maintain a high throughput so that the pipeline latency is not a detriment to oating-point designs. Therefore, the design problem is to develop algorithms and architectures that can hide the pipeline latency and use as few oating-point cores as possible in a design while still maintaining a high throughput. In this section, we propose a high-performance and area-efcient solution to the arithmetic expression evaluation problem. Suppose the number of numerical values in an expression is n. In our proposed architectures, the number oating-point cores needed is independent of n, and the buffer size is of (lg(n)). The rst proposed design is most efcient for expressions whose expression trees are complete binary trees: trees in which all the internal nodes have degree 2 and all the leaves have the same depth. We show that if the n inputs to such an expression arrive sequentially, it is possible to evaluate the expression with only one oating-point core of each type used in the expression and complete the evaluation in (n) time. For expressions whose expression trees are not complete binary trees but satisfy certain requirements, we propose a similar design that uses two oating-point cores of each type. We then 41 discuss the penalty for general expressions that cannot be directly evaluated by the proposed designs. We also describe how the design can evaluate multiple expressions without the need for hardware reconguration. We have implemented the proposed solution on Xilinx Virtex-II and Virtex-II Pro FPGAs. The area of our architecture increases linearly with the number of types of operators. On the other hand, when the number of operators is xed, there is only a small area increase as the number of inputs increases. We compare the proposed solution with another hardware design and compiler-generated designs. The rest of this section is organized as follows. In Section 3.2.1, we dene and illustrate the arithmetic expression evaluation problem and the design challenges that it presents. In Section 3.2.2, we describe the proposed algorithms and architectures for arithmetic expression evaluation and we prove their correctness. Section 3.2.3 presents analysis of our designs' performance. In Section 3.2.4, we discuss related work. 3.2.1 Arithmetic Expression Evaluation 3.2.1.1 Problem Denition An arithmetic expression is any well-formed string composed of at least one of the arithmetic operations, left and right parentheses as needed, and numerical values that are either constants or variables. We denote an arithmetic expression e of n dis- tinct numerical values by e(n). Suppose these n values are input sequentially as [r 0 , r 1 , . . . , r n1 ]; and one r i comes in per clock cycle (0 i n 1). The arith- metic expression evaluation problem is thus to compute the value of e applied to these n inputs. Also, we use S to denote the set of types of operations in the expression. 42 3.2.1.2 Tree Representation An arithmetic expression can be represented by a tree in which the leaves are numeric values and the internal nodes are operators. In particular, most expression trees can be transformed into a binary tree. Unary operators can be considered with an identity operation. [51], on the other hand, describes ways to transform into binary trees those expression trees in which there are nodes for associative, commutative operators that take more than two inputs. There also have been some works on arithmetic expression tree-height reduction. In [59], an upper bound on the reduced tree height is given assuming that only associa- tivity and commutativity are used. Let e(n) be any arithmetic expression with depth d of parenthesis nesting. The tree of e(n) can be transformed in such a way that its height is less than or equal todlg ne + 2d + 1. Bounds using associativity, commuta- tivity and distributivity have also been given. In [13], it is proved that the tree height of e(n) can be reduced to underd4 lg ne. For expressions of special forms, better bounds can be given, as shown in [58]. Therefore, in our work, we focus on expressions that can be represented by bi- nary trees. Additionally, the trees are balanced, that is, the heights of the subtrees of any node differ by only a small amount. Skinny trees or non-binary trees can be transformed using existing methods. 3.2.1.3 Applications In many scientic computing algorithms, the inner loops need to evaluate arithmetic expressions. Molecular dynamics has several arithmetic expressions that must be eval- uated many times. Such expressions usually have tens of inputs. Figure 3.5 shows one example expression. Specically, this expression calculates the magnitude of the 43 × × + × - × + A r -14 B r -8 r -1 D × × × × - q i q j r -2 b d(erfc(r)) r -2 erfc(r) r -2 f Figure 3.5: Example expression tree from molecular dynamics force calculation Lennard-Jones force and the real-space part of the electrostatic force acting between a pair of atoms in the simulation. This force is used as part of the acceleration update phase of the simulation. All of the leaves in the expression tree are either constants or are calculated prior to this expression's evaluation. The proposed design is also applicable to other scientic kernels, such as vector dot product and matrix-vector multiplication. These kernels require the accumulation of oating-point numbers. Such accumulation is a special case of arithmetic expression evaluation. For large-scale problems, these kernels have expressions with thousands of inputs. Probabilistic inference is another area that uses expression evaluations. For exam- ple, in the Lauritzen-Spiegelhalter algorithm, a general belief network is transformed into a tree of cliques [57]. To nd the probability of events, expressions must be eval- uated at each node and the results propagated up the tree. Besides oating-point scientic applications, arithmetic expression evaluation ex- ists in many other applications. For example, the graph theoretical problems of all- pairs shortest-path and transitive closure are fundamental in a variety of elds, such as network routing and distributed computing. These problems are solved by the Floyd-Warshall algorithm, which requires comparisons and additions, either xed- or 44 min + + + min min ) 0 ( 23 D ) 0 ( 21 D ) 0 ( 13 D ) 3 ( 24 D ) 3 ( 43 D ) 4 ( 25 D ) 4 ( 53 D ) 5 ( 23 D Figure 3.6: Example expression tree for Floyd-Warshall algorithm × - + × × + × r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 ((r 0 × r 1 ) × (r 2 - r 3 )) × ((r 4 + r 5 ) + (r 6 × r 7 )) 2 1 0 Level Figure 3.7: Example of an expression tree for an expression with eight inputs oating-point, depending upon the problem domain. The Floyd-Warshall algorithm for a graph of n vertices consists of n steps. We use D (k) ij to denote the weight of the shortest path from vertex i to vertex j in step k (i;j;k = 1;::: n). D (0) is the weight matrix of the original graph. In the algorithm, D (n) ij = min(D (0) ij ;D (0) i1 + D (0) 1j ;D (1) i2 + D (1) 2j ;::: ;D (n1) i;n + D (n1) n;j ): Suppose the min function is implemented using a binary comparator. Figure 3.6 shows the expression tree for D (5) 23 when n = 5. 45 3.2.1.4 FPGA-based Arithmetic Expression Evaluation We now discuss the design issues for arithmetic expression evaluation on FPGAs. To begin with, we restrict the trees to complete binary trees with four or more inputs. Thus, a tree with k levels has n = 2 k inputs and each of the operators is a binary operator. An example of such a tree is shown in Figure 3.7. The evaluation problem for such expressions is trivially solved with n 1 oating- point cores, one for each operation in the expression. However, this is a very inefcient solution. The large areas of oating-point cores make such a solution undesirable or, for large n, infeasible. When the inputs to the expression arrive sequentially, such an architecture also makes uneconomical use of the oating-point cores. The oating- point cores in the architecture do not take new inputs on every clock cycle; rather, they are only active periodically. For example, the rst oating-point core at level 0 of the tree operates on the rst two inputs to arrive. Before it can operate on any more inputs, it must wait the n 2 cycles it takes for the next n 2 inputs to arrive and enter the other oating-point cores in level 0, plus two more cycles for its new inputs to arrive. Such inefciency exists at all levels of the tree. An architecture like that of Figure 3.8 is a rst step in addressing this inefciency. There is only one oating-point core of each type necessary in a given tree level, a constant-size buffer, and a multiplexer. We assume throughout the paper that the number of pipeline stages in each oating-point core is the same and we call it (in Section 3.2.2.3 we describe what to do when this is not the case). There are now O(jSj lg(n)) oating-point cores and lg(n) constant-size buffers. At each level of this compacted tree, when there are two values in the buffer, they are read from the buffer and passed to the oating-point cores. The correct oating-point core output, as chosen by the multiplexer's selection logic, is written to the buffer at the next highest level. 46 r i + - × + - × + - × 0 1 ) lg( - n · · · 1 buffers Figure 3.8: Complete binary tree for an expression with three types of operators after compaction to one oating-point core of each type per level (+,, and could be replaced by any binary operators) While an improvement over the trivial solution, this solution is still inefcient. The oating-point cores at each level are still not taking new inputs at each clock cycle. The oating-point cores at level 0 take new inputs every other clock cycle. The oating- point cores at level 1 must wait for two outputs of the oating-point cores at level 0. So, these oating-point cores only read new inputs once every four cycles. The oating- point cores at level i only read new inputs once every 2 i+1 cycles. Additionally, this architecture requires O(jSj lg(n)) oating-point cores, which still may not be feasible for large values of n. But, since the oating-point cores are not fully utilized, we see that with a clever architecture and algorithm, it is possible to evaluate the expression using a single oating-point core for each type of operator in the expression. The difculty in developing such an architecture and algorithm lies in the fact that oating-point cores on FPGAs are very deeply pipelined. Partial results are not available in the cycle immediately following the start of their computation. Instead, 47 they are available stages later. If (n) storage is allowed, the inputs can be buffered until a partial result is ready. Once again, though, this solution leads to inefcient use of the oating-point cores. Additionally, stalling for each operation will drastically reduce throughput. Since we are targeting applications in which the operands will be double-precision oating-point numbers, we do not want to have to store many values on the chip. We anticipate that expression evaluation will be part of a larger application implemented in hardware. So, while there is sufcient memory on current FPGA devices to store a substantial number of double-precision oating-point numbers, we would like to use as little memory as possible for the expression evaluation and leave the rest free for the other parts of the application. Thus, we would like to use O(lg(n)) buffer space. In summary, the focus of our work is to develop designs for arithmetic expression evaluation on FPGAs using the minimum number of deeply pipelined oating-point cores possible. Additionally, the proposed solution should not require (n) buffer size or introduce pipeline stalls. 3.2.2 Proposed Algorithms and Architectures for the Arithmetic Expression Evaluation Problem We now present the proposed solution to the arithmetic expression evaluation problem. We start with the evaluation of expressions whose expression trees are complete binary trees, which is referred to as the basic case. We then extend this solution to the the advanced case, where more general expressions are considered. For each case, we prove the correctness of the proposed solution, and note some important features of it. Lastly, we discuss how our proposed designs can be used for general expressions. 48 3.2.2.1 Basic Case Architecture and Algorithm As described above, the oating-point cores at level i in the architecture in Figure 3.8 are utilized only once every 2 i+1 clock cycles. With careful scheduling, then, we can interleave all oating-point operations in a single set ofjSj oating-point cores. A selection function is used to decide which oating-point core's output is written to buffer i + 1. In that way, it is possible to select which oating-point core's result should be used in the later calculations. By selecting the appropriate result for each buffer write, any complete-binary-tree expression e can be evaluated. We formalize these ideas below. The architecture for the solution to the basic case is shown in Figure 3.9. It has one -stage pipelined oating-point core for each type of operation in the expression, a counter (C), some control circuitry, and lg(n) buffer levels. All of the buffers hold three data items, except for the buffer at level 0, which only holds two data items. Note that the buffers are FIFOs. The r i values are written into buffer 0 at every clock cycle. When there are two values in the buffer, they are destructively read and are presented as inputs to the oating-point cores. Two cycles later, the next two r i values are presented as inputs to the oating-point cores. When the oating-point cores have operated on two values from buffer i, one oating-point core's output is chosen and is written to buffer i + 1. To select the output from the correct oating-point core, the architecture has lg(n) 2 buffer control counters bcc 1 ;::: ;bcc lg(n)1 , and has a selection function '. Buffer i, i = 1;::: ; lg(n) 1, has a (lg (n) i)-bit buffer control counter, bcc i , associated with it. The purpose of bcc i is to allow buffer i to decide which oating- point-core output should be written to it. bcc i counts the number of writes to buffer i that have occurred. For example, consider the tree in Figure 3.7. There are three 49 counters Control + × ÷ · · · + × ÷ · · · C bcc lg(n)-1 … bcc 1 j · · · i r buffers 0 · · · 1 ) lg( - n 1 Figure 3.9: Architecture for the basic case (+,, could be replaced by any binary operators) types of operators in level 0 of the tree:,, and +. Each of these operators has a corresponding oating-point core in the architecture for this expression. When data is to be written into buffer 1, buffer 1 must be able to correctly choose which oating- point core's output should be written into it. From the gure, one can clearly see that the rst write to buffer 1 should be from the multiplier, the second write should be from the subtracter, and so on. Thus, when bcc 1 = 0, the output of the multiplier core should be written to buffer 1, when bcc 1 = 1, the output of the subtracter core should be written to buffer 1, and so on. This mapping between bcc i and the functional units is handled by the selection function '. ' is the function that controls the selection of oating-point core outputs to be written into a given buffer. ' has two parameters: the buffer level and the value of the buffer control counter for that level. The output of ' is the type of operator 50 whose output should be written into the buffer. For example, for the tree in Figure 3.7, '(1; 0) = multiplier and '(1; 1) = subtracter. The algorithm for the solution to the basic case is given in Figure 3.10. It describes the schedule for reading from and writing to the buffers, the buffer control counters and the selection function. The sections of the algorithm are executed in parallel every clock cycle. The notation v hji is used to denote the j least signicant bits of v. For notational convenience, we dene the value of ' (lg (n)) to be the oating-point core for the type of operator that is at the root of the complete binary tree. Proof of Correctness We now prove several lemmas and theorems to show that the proposed algorithm and architecture solve the basic case of the problem. Lemma 1: (lg(n)) buffer space is sufcient for the algorithm and architecture, and there are no data hazards in the architecture. Proof : We rst prove that (lg(n)) buffer space is sufcient for the proposed algo- rithm and architecture. For buffer i, one value is written every 2 i clock cycles; two values are read every 2 i+1 clock cycles because the oating-point cores in the architec- ture are all binary operators. Thus, we have rate write = rate read , and a buffer cannot grow without bound once we begin reading values from that buffer. Obviously buffer 0 at most contains 2 values before it is rst read because it is read every 2 clock cycles. We now prove that all the other buffers at most contain 3 values before they are rst read. There are 2 i+1 1 clock cycles between two reads of buffer i. When i > 0, we have 2 i + 1 < 2 i+1 1 < 2 2 i + 1. Thus, during 2 i+1 1 clock cycles, at most two values are written into buffer i. Suppose buffer i is rst read at t i . If buffer i contains more than 3 values, then it would have had 2 or more values at t i 2 i+1 , which means that it should have been read earlier and t i is not the rst read. 51 fcountersg if oating-point cores are rst used then C = 0 for i = 1 to lg (n) 1 in parallel do bcc i = 0 end for else C = C + 1 end if fbuffersg write input port to buffer 0 for i = 0 to lg(n) 2 in parallel do if (C ) hi+1i = 2 i 1 then write output of oating-point core ' (i + 1;bcc i+1 ) to buffer i + 1 bcc i+1 = bcc i+1 + 1 end if end for foating-point coresg for i = 0 to lg (n) 1 in parallel do if (C hi+1i = 2 i 1) and (buffer i has more than 1 value) then read 2 values from buffer i for all s2 S in parallel do Enter the two values into oating-point core s end for end if end for foutputg if output of oating-point cores is valid then if (C ) hlg(n)i = n 2 1 then write the output of oating-point core ' (lg (n)) to external memory end if end if Figure 3.10: Algorithm for the basic case 52 This contradicts our assumption. Thus, buffer i needs at most 3 values and (lg(n)) buffer space is sufcient. We now prove that no data hazards occur in the proposed architecture by contra- diction. Assume a hazard occurs when the oating-point cores try to read from buffer i before the correct value is written to buffer i. If the read succeeds, we know that buffer i contains two valid values and this is not an invalid read. If the buffer contains less than two valid values, the read will not succeed and the hazard will not occur. In both cases, we reach the contradiction and the lemma is proved. Lemma 2: The partial results from different input sets are not combined together during computation. Proof : Let R and T be two sets of n inputs. The elements of an input set arrive at the architecture sequentially. Without loss of generality, let the elements of R be the rst to arrive. Since elements arrive sequentially, one per clock cycle, all of the elements of R arrive and enter buffer 0 before any of the elements of T do. The oating-point cores are linear pipelines of the same length, . Since the elements of T enter the oating- point cores after the elements of R, they must exit the oating-point cores after the elements of R. The rst two elements to enter buffer i are the rst two elements to leave buffer i. Thus, for a partial result from R to be combined with a partial result from T , there must be exactly one partial result from R and one or more partial results from T in buffer i, for some i. However, the number of inputs in R is n = 2 k for some integer k. So, at level i, there are n 2 i = 2 ki partial results. This is an even number. The algorithm calls for the oating-point cores to read two values from buffer i at a time. Thus, it is impossible for exactly one partial result from R to be left in the buffer with one or more partial results from T . Therefore, it is not possible for operands from multiple sets to be combined together during computation. 53 Theorem 3: The algorithm and architecture together correctly evaluate an expres- sion e whose expression tree is a complete binary tree with n = 2 k inputs for all k 2. Proof : We prove this theorem inductively. First, we assume that e has n = 2 2 = 4 inputs and show that the algorithm works. Let R be the set of n inputs with r 0 the rst to arrive, r 1 the second to arrive, and so on. When r 0 arrives, it is stored in buffer 0. When r 1 arrives, it is also stored in buffer 0. On the next cycle, C is set to 0 and r 0 and r 1 are given as input to all of the oating-point cores. Also, r 2 is written to buffer 0. On the next cycle, r 3 is written to buffer 0. On the next cycle after that (C is 2, so C h1i = 2 0 1), r 2 and r 3 are read from the buffer and given as inputs to the oating-point cores. stages after r 0 and r 1 entered the oating-point cores, (C ) h1i = 2 0 1, so the output of one of the oating-point cores must be written. bcc 1 = 0, so the output of oating-point core '(1; 0) is written to buffer 1. By the denition of ', this is the correct output to be written. bcc 1 is incremented to 1. Two cycles later, (C ) h1i = 2 0 1 again, so it is time to write to buffer 1 again. Since r 2 and r 3 were given as inputs to the oating-point cores two cycles after r 0 and r 1 were, their partial result is the output from the oating-point cores. bcc 1 = 1, so the output of the oating-point core '(1; 1) is written into buffer 1. bcc 1 is incremented and rolls over back to 0. Buffer 1 now has two inputs, so as soon as C h2i = 01, the two partial results will be read from buffer 1 and given as inputs to the oating-point cores. cycles after that, the output is ready. Correspondingly, (C ) h2i = 4 2 1 = 01, so the output of oating-point core '(lg(n)) is written to external memory. By denition, this is the output from the correct oating-point core. So, the output is correct. For the inductive hypothesis, we assume that the algorithm and architecture pro- duce the correct results for n = 2 k , k > 2, inputs. We now prove that, if that is the case, the algorithm and architecture produce the correct results for n = 2 k+1 inputs. 54 The key observation is that evaluating an expression of n = 2 k+1 inputs is almost the same as evaluating two consecutive expressions of m = 2 k = n 2 inputs. For n inputs, an extra level of buffering is added to the architecture and one more operation must take place. Also, ' and bcc i are extended accordingly and bcc n1 is added to the architecture. The rst m inputs to arrive are operated on in the same manner as they would have been if the algorithm and architecture were designed for m inputs. The only difference is that instead of the output of oating-point core '(lg(m)) being written to external memory when (C ) hlg(m)i = m 2 1, the output of what is now '(lg(m); 0) is written to buffer lg(n) 1. By the inductive hypothesis, this value is the correct evaluation of the left subtree of the expression. Similarly, the right subtree of the expression is also correctly evaluated. We know from Lemma 2 that the partial results from two sets of m inputs could not have been combined together during the calculation. The values from the left and right subtrees have been written into buffer lg(n) 1. As soon as C hlg(n)i = 2 lg(n1) 1 = n 2 1, those two values are read from buffer lg(n) 1 and given as inputs to the oating-point cores. cycles later, when (C) hlg(n)i = n 2 1, the output of '(lg(n)) is written to external memory. By denition, this is the output of the correct oating-point core. Therefore, the correct result is produced. Thus, the algorithm and architecture together correctly evaluate an expression e of n = 2 k inputs for all k 2. Theorem 4: The latency of the algorithm is at most 3n + ( 1) lg(n) 2 cycles. Proof : It takes n cycles for all the inputs to arrive. The last input will go through the oating-point cores lg(n) times. It will spend at most 2 i+1 1 cycles in buffer i. So the total latency is at most n + lg(n) + P lg(n)1 i=0 (2 i+1 1) = 3n + ( 1) lg(n) 2 cycles. 55 i r counters Control + × ÷ · · · + × ÷ · · · C bcc … bcc 1 j · · · + × ÷ · · · 1 ) lg( - n · · · buffers · · · 1 ) lg( - n 1 Figure 3.11: Architecture for the advanced case (+,, could be replaced by any binary operators) 3.2.2.2 Advanced Case We now extend the architecture for the basic case to the advanced case. In this case, the number of inputs n does not need to be a power of 2, and the expression tree does not need to be a complete binary tree. However, we assume that at each expression tree level, all values must be operands of some binary operators except the last one. If the last value is not paired with any other value, it is called a singleton value. When n is not a power of 2, there may be operations on singleton values in multiple levels of the tree. The architecture in Figure 3.9 is not able to handle the singleton op- erations for the following reason. When we pad a singleton operation with an identity, values in the subsequent sets are still arriving in the buffers. Since only the singleton value is consumed from the buffer instead of two values, the buffers will overow. Algorithm and Architecture The architecture in Figure 3.11 is proposed. This ar- chitecture contains two sets of oating-point cores anddlg(n)e buffers. By using an 56 fcountersg if the rst pipeline is rst used then C = 0 for i = 1 todlg(n)e - 1 in parallel do bcc i = 0 end for else C = C + 1 end if fbuffersg write input port to buffer 0 if valid output from rst set of oating- point cores then write '(1;bcc 1 ) to buffer 1 bcc 1 = bcc 1 + 1 end if for i = 1 todlg(n)e 1 do if (C ) hii = 2 i1 1 and valid out- put from second set of oating-point cores then write '(i+1;bcc i+1 ) to buffer i+1 bcc i+1 = bcc i+1 + 1 end if end for foating-point coresg if last 0 = 1 and singleton 0 = 1 then read 1 value from buffer 0 else if 2 values are in buffer 0 then read 2 values from buffer 0 end if for all s 2 the rst set of operators in parallel do enter the read values into oating- point core s end for for i = 1 todlg(n)e 1 do if C hii = 2 i1 1 then if last i = 1 and singleton i = 1 then read 1 value from buffer i else if two valid values in buffer i then read 2 values from buffer i end if end if for all s2 the second set of operators in parallel do enter the read values into oating- point core s end for end for Figure 3.12: Algorithm for the advanced case additional set of oating-point cores, the architecture provides a sufcient number of available pipeline slots to schedule the extra operations precipitated by the singleton cases. In particular, the rst set of oating-point cores handles the operations in the rst level of the binary tree, while the second set of oating-point cores is in charge of the operations in all the other levels. This architecture also containsdlg(n)e 2 buffer control counters bcc 1 , . . . , bcc dlg(n)e1 and the selection function ' to select the correct oating-point-core outputs to be written to the buffers. ' is adjusted so that '(1;bcc 1 ) chooses from the rst set of oating-point cores and '(x;bcc x );x = 2;:::;dlg ne chooses from the second. 57 The algorithm for the architecture is shown in Figure 3.12. In the algorithm, out n is the tree level at which the nal result may be found and equalsdlg(n)e. singleton i denotes whether or not the last value at level i is a singleton. If i = 0, singleton i = n mod 2; otherwise, singleton i =d n 2 l+1 e mod 2, i = 1;::: ;dlg(n)e. last i = 1 when a value entering the oating-point cores is the last for that set of inputs to enter the oating-point cores at level i of the expression tree and last i = 0 otherwise. Proof of Correctness The scheduling of buffer 1, . . . , bufferdlg(n)e 1 in the ad- vanced case is the same as the scheduling of buffer 0, . . . , buffer lg(n) 2 in the basic case. Thus, the correctness of the algorithm and architecture can be proved in the same way as in Section 3.2.2.1. We therefore only present two issues whose proof cannot be explicitly derived from the basic case. Lemma 5: A buffer cannot grow without bound once reading from it commences. Proof : This is obvious for buffer 0. For buffer i, i = 1;:::;dlg(n)e 1, since we may have a singleton case, either one or two values are read every 2 i cycles. Thus, 1 1 2 i rate read 2 1 2 i = 1 2 i1 : During n cycles, at most n 2 i+1 values are written to buffer i. Thus, rate write l n 2 i+1 m 1 n : If n 2 i+1 , then l n 2 i+1 m 1 n 1 2 i+1 + 1 n 1 2 i and if 2 i < n < 2 i+1 , then l n 2 i+1 m 1 n = 1 n < 1 2 i : It is clear, then that rate write 1 2 i for all n > 2 i . 58 In our algorithm, any arithmetic expression that reaches level i must have more than 2 i inputs. Otherwise, the expression has out n i and the evaluation is complete before level i. Therefore, we have rate write rate read . We thus prove that we cannot have unbounded growth once we start reading from the buffer. Theorem 6: The latency of the algorithm is at most 3n + ( 1)dlg(n)e 3 cy- cles. Proof : It takes n cycles for all the inputs to arrive. The last value enters buffer 0 at cycle n, and then traverses the rst set of oating-point cores. Then it enters buffer 1, one of the second set of oating-point cores, and so on until it enters bufferdlg(n)e1 and then one of the second set of oating-point cores. Clearly it goes through the oating-point coresdlg(n)e times, and at buffer i, it waits for at most 2 i 1 cycles before being read by the oating-point cores (i = 1;::: ;dlg(n)e 1). Therefore, L n + dlg (n)e + dlg(n)e1 X i=1 (2 i 1) n + dlg (n)e + 2n 2dlg (n)e 1 = 3n + ( 1)dlg (n)e 3 where L is the latency, in cycles. 3.2.2.3 Properties of the Designs The designs for both the basic case and the advanced case have several important properties. The rst of these is that, as desired, at most two oating-point cores for each type of operator in the expression are present in the designs. Each oating-point core in the designs is being used efciently; each oating-point core takes new inputs as often as possible. This is an important property because, due to their large areas, the oating-point cores dominate the area of expression evaluation circuits. This property 59 would not be as notable if large buffers were necessary or the time complexity for the expression evaluation increased to more than the optimal, which is (n) since it takes at least n cycles for all of the inputs to arrive. Secondly, only (lg(n)) buffer space is necessary for the designs. This is a small amount of space and leaves most of the FPGA's on-chip memory for other tasks im- plemented on the device. Despite this small buffer size, as shown in Theorem 4 and Theorem 6, the latency remains low. The designs also achieve the highest possible throughput for inputs arriving sequentially: an output every n cycles once the oating- point-core pipelines are full. Another property of the designs is that they place no restriction on the number of pipeline stages in the oating-point cores. In developing the designs, we assumed that the oating-point cores all had the same number of pipeline stages. In practice, this is not always the case. To compensate, shift registers can be added to the oating-point cores with fewer stages so that their pipeline latencies match that of the oating-point core with the most stages. will be set to the number of pipeline stages of the oating- point core with the longest pipeline latency. While adding shift registers does increase the area of the architecture, this increase will be insignicant when compared to the size of the oating-point cores. Importantly, no pipeline stalls are introduced. The designs also allow for the inclusion of the extra pipeline stages before or after the oating-point cores, if necessary. This extra delay can be captured by increasing the value of by the number of added pipeline stages. For example, if the critical path of the design is between the output of the oating-point cores and the input to memory, an extra pipeline stage (or more than one stage) can be added to the path. It is only necessary to increase the value of by the number of added stages. This property adds scalability to the algorithm. 60 Lastly, multiple expressions can be evaluated using the same hardware without reconguration. If there are p expressions to be evaluated and each expression has the same number of inputs, the only difference in evaluating the expressions is the ' function. The outputs of the ' function can be stored in memory, so only the memory needs to be changed to change the expression being evaluated. If the expressions to be evaluated have different numbers of inputs, this can be handled as well. The sizes for the memories and counters should be set for the expression with the largest number of inputs. As long as the ' function is set correctly for a given expression, its result can be read from the evaluation circuit at the appropriate time. 3.2.2.4 General Expressions Thus far, we have only considered certain classes of arithmetic expressions. Despite this, the proposed algorithms and architectures can be applied to a wide variety of general expressions. For example, some expressions involve non-binary operators. As a preprocessing step, the expression trees for such expressions can be transformed into binary trees as discussed in Section 3.2.1.2. After such transformations, the resulting binary expression tree still may not com- ply with the requirements of the basic case or the advanced case. In this case, the input source can pad the inputs to the expression evaluation circuit until the expres- sions meet the requirement. For example, the expression tree in Figure 3.13(a) could be transformed into the expression tree in Figure 3.13(b). Doing such padding reduces the throughput. If h is the number of levels in the tree and n is the number of inputs, then the throughput is reduced by a factor of 2 h n . If the binary tree is balancedthat is, if the heights of the subtrees of any node differ by only a small amountthe im- pact on the throughput of the algorithm and architecture is small. For example, for an expression such as that in Figure 3.13, where the subtrees of the root node differ 61 × - + × × + × r 0 r 1 r 2 r 3 r 4 r 5 r 6 r 7 ((r 0 × r 1 ) × (r 2 - r 3 )) × ((r 4 + r 5 ) + ((r 6 × r 7 ) × (r 8 – r 9 ))) × - r 8 r 9 L L L × × + × r 6 r 7 ((r 0 × r 1 ) × (r 2 - r 3 )) × ((r 4 + r 5 ) + ((r 6 × r 7 ) × (r 8 – r 9 ))) × - r 8 r 9 r 4 r 5 + r 2 r 3 - r 0 r 1 × (a) (b) Figure 3.13: (a) Expression tree before padding. (b) Expression tree after padding (? represents any operation. represents a padding input," L represents the operation of passing the left input to the next level) in height by one, the reduction in throughput is at most a factor of 2. As discussed in Section 3.2.1.2, various methods exist that can reduce the height of the expression trees and transform the trees into more balanced ones. Through padding, our designs can be applied to general expressions. 3.2.3 Experimental Results We rst analyze the performance of the proposed solution to the basic case when n = 64 inputs and when n = 1024 inputs as the number of types of oating-point operators ranges from a single type of operator to four types of operators. We think of these operators as add, subtract, multiply, and divide. However, because our goal is to study the performance of our design rather than the performance of individual oating-point cores, we use the same oating-point core in all cases, repeating it as many times as is necessary. In our implementation, we use a oating-point multiplier 62 core. For example, in the case of four types of operators, rather than having a oating- point adder, a oating-point subtracter, a oating-point multiplier, and a oating-point divider, we use four oating-point multipliers. The multiplier, as well as the other oating-point cores used for this paper, are briey described in the las section and more thoroughly described in [40]. We use the lowest overhead oating-point cores. These oating-point cores are sufcient for the purpose of this illustration. Any pipelined cores, even xed-point ones, could be used with our architecture. If more or less complex cores are required, the performance of the proposed design will change accordingly. We coded the proposed architecture and algorithm for the basic case in VHDL. With the Xilinx Virtex-II Pro XC2VP20 as our target device, we synthesized the design with Synplicity Synplify Pro 8.1 and mapped and placed-and-routed the result with the appropriate tools in Xilinx ISE 7.1. Since buffer 0 is written to every clock cycle, we implemented it as two registers. In order to avoid large multiplexers at the inputs to the oating-point cores, we did not implement the other buffers separately. Instead, we combined them into two embedded RAMs, one for the left inputs to the oating- point cores and the other for the right inputs. Also, to simplify the addressing logic, we allowed each buffer level in the RAM to have four words instead of just three. With this scheme, we only need to multiplex between the registers of buffer 0 and the RAM for the other buffer levels. The address logic takes care of choosing between the buffers at levels 1 through lg(n) 1. We do still, however, require multiplexers to select which oating-point core's output should be written into the buffers. Figure 3.14 shows the effect on the frequency and the area of the implementation as the number of operators goes from one to four. For both input sizes, the area increases roughly linearly with the number of types of operators. Note that the area for n = 1024 is not much larger than the area for n = 64. In each case, as the number of types of 63 0 50 100 150 200 250 1 2 3 4 Number of Floating-Point Cores Frequency (MHz) 0 1000 2000 3000 4000 5000 6000 Area (slices) Frequency (n=64) Frequency (n=1024) Area (n=64) Area (n=1024) Figure 3.14: Change in frequency and area as number of types of operators goes from 1 to 4 operators increases, the frequency decreases. This makes sense because the amount of control logic necessary increases as the number of operators increases. A curious case is the relatively small decrease in frequency between one and two types of operators when n = 64. In each case except for n = 64 with one type of operator, the critical path keeping the design from achieving higher frequency is in the control logic dealing with writing a oating-point core result to the buffers. However, when n = 64 and there is only one type of operator, the critical path keeping the design from achieving higher frequency is in the oating-point multiplier. That is, in this case, the design cannot achieve a higher frequency because it has reached the limit of the oating-point multiplier, rather than reaching the limit of the control logic. Because of these different limiting factors, with n = 64, there is not a great difference in frequency between when there is one type of operator when there are two. We now compare the implementation of the proposed solution to an implementa- tion of the compacted-tree architecture with lg(n) oating-point cores of each type, as in Figure 3.8. Table 3.5 shows the throughput-to-area ratio of each of the architectures when n = 64 and the number of types of operators varies from one to four. Clearly, 64 Table 3.5: Throughput-to-area ratio of the proposed architecture and the compacted- tree architecture. Values are in thousand results per slice per second (higher value is better). Number of Types of Operators 1 2 3 4 proposed 2.40 1.34 0.71 0.52 compacted-tree 0.40 0.27 0.18 0.13 Table 3.6: Area-latency product of the proposed architecture and the compacted-tree architecture. Values are in slices s per result (lower value is better). Number of Types of Operators 1 2 3 4 proposed 1115 1994 3742 5180 compacted-tree 4866 7053 10880 14924 the proposed solution has a much higher throughput-to-area ratio; the area used is be- ing used much more efciently. Table 3.6 shows the area-latency product of the two architectures and algorithms. The compacted-tree algorithm requires fewer cycles to compute a result than the proposed design. However, the area-latency product of the compacted-tree architecture and algorithm is much higher than that of the proposed ar- chitecture and algorithm. Thus, by this metric as well, the proposed solution is superior to the compacted-tree approach. 3.2.3.1 Comparison with Compiler-Generated Designs We now compare the proposed design to designs for the same expressions generated by a high-level language compiler. The compiler we choose to compare against is the compiler in SRC Computers' Carte programming environment, which is one of the state-of-the-art compilers presently available [108]. This compiler allows hard- ware designs to be described in the C programming language and includes some extra 65 macros for data movement and special functions. The compiler generates an exe- cutable for SRC's MAPstation, which has two microprocessors and two Xilinx Virtex- II XC2V6000 FPGAs. We focus only on the FPGA part and all results presented are from designs targeted to the Virtex-II XC2V6000. Additionally, all designs run at 100 MHz as this is a requirement for the SRC 6 MAPstation. In these comparisons, the designs generated by the SRC compiler use SRC's oating-point cores while our designs use the oating-point cores from [40]. One function for which a hardware macro is provided is oating-point accumula- tion. SRC's oating-point accumulation macro is called from within a pipelined loop. If multiple sets of numbers are to be accumulated, the entire pipeline must be ushed to nish one accumulation before any numbers from the next set can enter the pipeline to begin the next accumulation. Accumulation is a special case of our design in which there is only one type of operator: addition. Our design does not require that the pipeline be ushed between sets of numbers. The drawback to our design compared to the SRC accumulator is that the number of inputs to the accumulation must be known in advance whereas the SRC accumulator can accumulate conditionally and take any number of inputs. To accumulate 64 and 1024 inputs, our design requires 1474 and 1589 slices, re- spectively, while the SRC accumulator requires 2002 slices in each case. So, our design uses less area while not requiring any ushing of the pipeline. If the number of inputs were not a power of two, our design would require about twice as much area because it would need two oating-point adders instead of just one, but there would still be no pipeline ushing required. We also compare the proposed design's area in evaluating arbitrary expressions to the designs generated by the compiler. As examples, we look at an 8-input expression and a 64-input expression. For an 8-input expression with 2 adds, 1 subtract, and 4 66 multiplies, the proposed design requires 3378 slices while the design generated by the compiler requires 6823 slices. The trouble is that the compiler is limited to creating designs in which each oating-point operation has its own oating-point core. It cannot generate optimized designs like that developed here. For an 8-input expression, this is not too bad, as the generated design easily ts on the target FPGA. However, as the number of inputs increases, the FPGA quickly lls up. A 64-input expression with 18 adds, 17 subtracts, 17 multiplies, and 11 divides requires only 6738 slices with the proposed design, but requires 56565 slices in the compiler-generated design. 56565 slices is 67% more slices than are present in the target FPGA, so the expression cannot be placed and routed. 3.2.4 Related Work 3.2.4.1 Parallel Arithmetic Expression Evaluation The arithmetic expression evaluation problem has been studied widely in the parallel computing community. Several algorithms have been proposed for theoretical pro- gramming models. For example, [70] and [21], present parallel algorithms for the problem that assume the PRAM model in which there are many processors, all with uniform access to a shared memory. [90], on the other hand, solves the problem for the Processor Arrays with Recongurable Bus System model in which computation is performed by cells connected to a grid-shaped recongurable bus system. Each of these techniques employs tree contraction [51]. In tree contraction, a rooted, binary tree is systematically reduced from consisting of many nodes to consisting of a single nodethe root. These methods for arithmetic expression evaluation require O(lg(n)) time. 67 In [7], the algorithm in [70] is adapted to be practical for SMP machines. The practical SMP version of the algorithm must distribute the work between available processors and synchronize the processors during the steps of the algorithm. These existing works in parallel computing assume that a processor is capable of performing any operation. Additionally, it is assumed that a processor counts as a unit of computing resource. However, in FPGA-based designs, each operation has its own operator that occupies certain hardware resources. Therefore, we are not concerned with the number of processors and communication time. Instead, we aim to minimize the number of operators while achieving a high throughput. [123] discusses area-efcient designs for parallel polynomial evaluation on FPGAs. The goal of this work is to minimize the hardware requirements of the evaluator by using the minimal number of operators for each type of operation. Since they only consider xed-point arithmetic, each operation takes one clock cycle to complete. For a polynomial with degree n, (n) operators are required. We focus on oating-point expression evaluation and the oating-point cores are all pipelined. Moreover, the number of operators in our proposed solution is independent of the input size. 3.2.4.2 Pipeline Synthesis Datapath pipeline synthesis has been studied by many researchers. Many times, such as in [81], [84], and [50], pipelining is performed only at the task level and the indi- vidual operators are not pipelined. More closely related to our work are works such as [18] and [109] in which the individual operators and the overall tasks are pipelined. A scheduling algorithm for two-level pipelining, where both the task and each of the operators are pipelined, is proposed in [18]. The scheduling problem is formulated as an integer linear programming problem for minimization of the total execution time. 68 The running time of the algorithm depends on the number of nodes in the dataow graph, the number of operators, and the number of pipeline stages in each operator. [109] proposes a design methodology that automatically generates FPGA circuits that meet a given throughput constraint at minimal area cost. Two different schedul- ing algorithms are used in the methodology, both of which combine module selection with resource sharing during pipeline synthesis. The library elements used in the syn- thesized designs are pre-pipelined and their numbers of pipeline stages are taken into account in the algorithms. In each of these works on pipeline synthesis, design-space exploration has to be performed for each individual dataow graph. The output of the works is dependent on the dataow graph and the selected modules. On the other hand, our proposed architectures and algorithms can be used for all expressions with little modication. Moreover, our designs are independent of the implementations of the oating-point cores. 69 Chapter 4 Modeling Recongurable Computers As described in Chapter 2, much of the early work in using recongurable hardware for scientic computing applications has focused on individual tasks [124, 46, 102, 105, 106, 116]. Oftentimes, though, recongurable hardware can provide fantastic speed-ups for tasks in a scientic computing application but when the tasks are inte- grated back into the complete application, the speed-up for the complete application is not very impressive. Despite advances in high-level programming for recongurable hardware, it can be difcult to develop a working program for recongurable comput- ers that utilizes both the GPPs and the recongurable hardware. A crucial step, then, is the determination of whether or not the approach considered will provide a signicant enough speed-up to justify the implementation effort. In this chapter, our goal is to develop a high-level performance model for recon- gurable computers in order to address these issues [97]. In Section 4.1, we survey currently available recongurable computers and develop abstract architectural and programming models for recongurable computers. Section 4.2 develops an analytic performance model for recongurable computers. Section 4.3 presents some of the relevant work in the literature. 70 4.1 Recongurable Computers As described in Section 1.1.2, recongurable computers are comprised of GPPs, recon- gurable hardware, memory, and high-performance, low-latency interconnect. Archi- tecturally, there are three different areas to examine. At the system level, we consider the composition of the entire recongurable computer. That is, we are looking at all of the global properties of the system such as homogeneity/heterogeneity, memory distri- bution, and so forth. At the node level, we study the individual nodes that make up the complete system. At the element level, we look at the individual computing elements that make up nodes. Modeling at each of these levels is necessary in order to predict performance of the complete system. As we proceed with our discussion of recongurable computers, we will make use of the following denitions for the computing elements. RH element The recongurable hardware, likely in the form of one or more tightly coupled FPGAs, and off-chip (but on-board) memory that is spread over multiple banks, with each bank accessible independently. GPP element One or more tightly coupled GPPs (e.g., Intel Pentium 4 processors) sharing a RAM memory. 4.1.1 Survey of Recongurable Computers Recently, recongurable computers have been developed and made available by sev- eral vendors. SRC Computers has developed several offerings [108]. The RH element in all of them is SRC's MAP processor. The MAP processor is depicted conceptu- ally in Figure 4.1. In the Series C version of the MAP, there are 2 Xilinx Virtex-II XC2V6000 FPGAs available for the user to use for programmable logic. There are 71 FPGA SRAM 5 SRAM 0 FPGA … To GPP or Switch Figure 4.1: Conceptual drawing of the SRC MAP processor 6 banks of on-board memory. Each bank holds 4 MB worth of 64-bit words. The banks can be accessed independently, in parallel, leading to a maximum bandwidth of 4.8 GB/s. However, each bank is single-ported. That is, in a given cycle, a given memory bank may be read from or written to by the FPGA, but not both. There is also a penalty of 4 dead cycles for switching between read mode and write mode. Also, be- cause the on-board memories are single-ported, both FPGAs cannot access the same on-board memory bank in the same cycle. The FPGAs can communicate and synchro- nize with one another, though. Data can also be streamed between the two FPGAs. In our work, we have been able to implement 53 single-precision oating-point cores (11 adders, 12 subtracters, 19 multipliers, 5 squaring cores, 4 accumulators, 1 divider, and 1 square root core) in a single FPGA on the MAP. The newer Series H MAP promises to increase the available memory and memory bandwidth and will use the larger, faster Xilinx Virtex-II Pro XC2VP100 for user logic, boosting the performance even further. The GPP element in all of SRC's recongurable computers is comprised of two 2.8 GHz Intel Xeon processors connected to a shared memory. The shared memory is 1 GB or more. 72 In SRC's basic MAPstation, there is one GPP element and one MAP processor as the RH element. Communication between the GPP element and the RH element is through DMA transfers. The transfer rate is 1.4 GB/s in each direction. It is also possible to stream data from/to the GPP element to/from the RH element. When data is streamed, the communication is overlapped with computation and the programmer can write the program as if new data elements arrive to the RH element every clock cycle. MAPstations can be connected over commodity networks to form clusters. In SRC Hi-Bar systems, the GPP elements and RH elements are connected to SRC's Hi-Bar switch. The GPP and RH elements in this system are, then, more loosely coupled than in the basic MAPstation. There is no longer necessarily a one-to-one cor- respondence between GPP elements and RH elements. GPP elements can still com- municate with RH elements through DMA transfers, but these transfers must go over the Hi-Bar switch, which has a bandwidth of 1.4 GB/s per port and a latency of 180 ns port to port. There also may be memory connected to the switch that is shared by the GPP and RH elements. Each element reads/writes data from/to that shared memory through DMA transfers. Silicon Graphics has developed the Recongurable Application Specic Comput- ing Technology (RASC) RC100 Blade to provide hardware acceleration to their Altix systems [101]. This RH element consists of dual Xilinx Virtex 4 XC4VLX200 FPGAs, and 4 banks of on-board memory. The total amount of memory is anticipated to be 80 MB of SRAM or 20 GB of SDRAM. The on-board memory banks are 32 bits wide. The bandwidth between the on-board memory and the FPGA is 9.6 GB/s. In this system, the RH elements are peers with the GPP elements on the intercon- nection network (SGI's NUMAlink 4) and all connect to the shared memory in the system. Through the use of NUMAlink, RH elements can communicate directly with 73 one another and the entire system has access to the same coherency domain. The bidi- rectional bandwidth between the FPGA and common memory over the NUMAlink is up to 12.4 GB/s and the latency is 50 ns round trip. The GPP elements in this system have Intel Itanium 2s. The Cray XD1 is another recongurable computer. An XD1 chassis consists of six blades. Each blade consists of one GPP element and one RH element. In the XD1, the RH element is an FPGAa Xilinx Virtex-II Pro XC2VP50 or a Xilinx Virtex 4 XC4VLX160and four on-board memory banks. The on-board memory banks each hold 4 or 8 MB and are 64 bits wide. The bandwidth from the on-board memory banks to the FPGA is 3.2 GB/s and the latency is 8 clock cycles, but this access is pipelined. The GPP element in the XD1 consists of two AMD Opterons and shared RAM memory. The GPP element communicates with the RH elementand with other bladesthrough Cray's RapidArray Processor (RAP). In fact (see Figure 4.2), only one of the processors in the GPP element can communicate with the RAP that is con- nected to the RH element. The RH element can also use the RAP to communicate over the RapidArray interconnect with other processors and/or FPGAs. The bandwidth be- tween the FPGA and the RAP and between the RAP and the GPP is 3.2 GB/s, while the bandwidth between the RAP and the rest of the interconnection network is 2 GB/s. 4.1.1.1 Massively Parallel FPGA Arrays There have also been some efforts to develop recongurable computers consisting of only RH elements. While these recongurable computers are beyond the scope of this work, we present them here for completeness. The reader is encouraged to read the cited works on these types of recongurable computers for more detail on their properties. 74 Shared Memory GPP GPP RAP RAP RapidArray Interconnect FPGA SRAM SRAM SRAM SRAM GPP Element RH Element Figure 4.2: Conceptual Drawing of a Cray XD1 Blade One example massively parallel FPGA array is the Berkeley Emulation Engine 2 [17]. In this system, an RH element consists of one Xilinx Virtex-II Pro FPGA and four on- board memories, each with a capacity of up to 1 GB. Four of these RH elements are grouped together and coupled with another RH element that is for control and has global communication interconnect interfaces that allow the RH elements to commu- nicate with each other and any secondary devices in the system. At the system level, several interconnection schemes are possible, including Ethernet and Inniband. For more details on BEE2 and its use in radio astronomy applications, see [17]. Another massively parallel FPGA array is the TMD [82]. The RH elements in this system consist of Xilinx Virtex-II Pro FPGAs and four on-board memories, two with DRAM and two with SRAM, though in future versions the on-board memories may vary from RH element to RH element. The computing model for this recongurable computer is different from the others in that applications running on this recong- urable computer are seen entirely as computing processes. The compute engines that do the processing are either custom hardware congured on the FPGA or embedded 75 Memory GPP GPP … Figure 4.3: Abstract architecture of a GPP element processors on the FPGA. Processes on the same FPGA communicate through FIFOs. Processes on different FPGAs that are on the same board use the high-performance I/O capabilities of the FPGAs that make up the system. Processes on different FPGAs on different boards use the I/O capabilities of the FPGAs that make up the system to emulate common interconnect standards like Ethernet and Inniband to go through commodity switches. For more details on TMD, the limited MPI developed for it, and its application to molecular dynamics simulation, see [82]. 4.1.2 Element-Level Architectures In our abstract model, the GPP element consists of one or more GPPs and RAM. If there are multiple GPPs within a GPP element, they are organized into a symmetric multiprocessor (SMP) system, sharing memory. This architecture is depicted in Fig- ure 4.3. The GPP elements in each of the currently available recongurable computers conform to this abstract model (see Section 4.1.1). For instance, the GPP element in the SRC 6 MAPstation contains two Intel Xeon processors that share memory [108]. In our abstract model, the RH element consists of one or more recongurable- hardware devices, likely FPGAs, that have access to shared on-board memory. This is depicted in Figure 4.4. If there are multiple recongurable-hardware devices, we 76 Mem Bank FPGA FPGA … Mem Bank … Figure 4.4: Abstract architecture of an RH element assume that they can locally communicate with one another and can thus be treated as one large recongurable-hardware device. The shared on-board memory in the RH element is organized into multiple banks that can be accessed in parallel. These banks provide single-cyclepossibly pipelinedaccess and are not implicitly cached. The on-board memory can be considered to be orders of magnitude smaller than the RAM in the GPP element. For example, in the Cray XD1, the RH element has four banks of SRAM memory and each bank holds 4 MB (see Section 4.1.1). Each GPP element in the XD1, on the other hand, has access to up to 16 GB of DRAM. 4.1.3 System- and Node-Level Architectures Recongurable computers can be considered multiple-instruction, multiple-data (MIMD) machines by the Flynn classication [31]. That is, the GPP elements can act independently of one another and the RH elements. Likewise, the RH elements can act independently of one another and the GPP elements. Of course, it is more likely that the elements in the system will be used together to speed up the execution of a given application. The system-level architectures of recongurable computers, as described above, can be grouped mainly into two categories: shared-memory architectures and cluster 77 architectures. We describe each below as well as the node-level architectures to support them. 4.1.3.1 Shared-Memory Architecture As can be seen from the survey of available recongurable computers in Section 4.1.1, one common architecture for recongurable computers is the shared-memory archi- tecture. In this architecture, the GPP elements and the RH elements are peers on the interconnection network and connect to a global shared memory. Each individual RH element and each individual GPP element is considered a node. Such a system nat- urally makes a heterogeneous recongurable computer because there are at least two types of nodes in the system: GPP elements and RH elements. If all of the GPP el- ements are identical to one another and all of the RH elements are identical to one another, then the system can be considered fairly homogeneous because there are only two different types of elements available. If, however, there are a variety of GPP ele- ments and a variety of RH elements available, the system is then truly heterogeneous. Silicon Graphics recongurable computers, as described in Section 4.1.1, t the shared memory architectural model. The RASC RC100 blade is the RH element node and the Itanium processors are the GPP element nodes. This is a fairly homogeneous system. Similarly, the SRC Hi-Bar recongurable computers also t this architectural model. The MAP processor(s) are the RH element node(s) and the dual Pentium 4s and their memories make up the GPP element node(s). 4.1.3.2 Cluster Architecture Another common architecture for recongurable computers is a cluster of accelerated nodes. We use this term to refer to architectures that are similar to both the tradi- tional massively parallel processors (MPPs) and the traditional cluster of workstations 78 architectures [47]. The cluster of accelerated nodes architecture is different from the shared-memory architecture in several ways. One major way in which the cluster architecture is different from the shared- memory architecture is in what constitutes a node in the system. In the cluster ar- chitecture, a node consists of a GPP element; an RH element; and high-performance, low-latency intra-node interconnect between them. The GPP element and the RH el- ement are not peers on the interconnection network between nodes. Instead, the node communicates to other outside nodes as a unit. The RH element acts more as an accel- erator for the GPP element. Also unlike in the shared-memory architecture, in the cluster architecture, there is no globally shared memory. The nodes must communicate with one another over the interconnection network using, for example, message passing. The interconnection network between the nodes can be either a commodity network (like the networks in traditional clusters of workstations) or a special purpose network (like the networks in traditional MPPs). A cluster of SRC MAPstations is one example of this architecture. Each node in the cluster consists of one GPP element and one RH element. A commodity network would be used to connect the nodes. The Cray XD1 is another example of this ar- chitecture. In this recongurable computer, too, a node consists of one GPP element and one RH element. In this case, the nodes are connected via Cray's RapidArray interconnect system. 4.1.3.3 Abstract Model and System-Level Programming Model The abstract model of a recongurable computer that we proceed with in the rest of this chapter is based on the cluster architecture. We assume that each node in the system will consist of one GPP element, one RH element, and high-performance intra-node 79 interconnect. Memory is distributed to the nodes. That is, there is no central memory shared among the nodes. Within a node, the GPP element can send blocks of data to the RH element or can stream data to the RH element. That is, one method of communication is for the GPP element to write the data necessary for the RH element to execute its computations into the RH element's memory before the computation starts. The other method of commu- nication is for the communication of data between the GPP element and RH element to overlap with the computation on the RH element. Likewise, the RH element can write chunks of data to the GPP element or stream data to the GPP element. We assume that the elements can perform other computational tasks while data is streaming. In prac- tice (for example, in the XD1 [115]), there may be a performance difference between reading and writing between elements. We assume that this problem will be dealt with at implementation time and we thus consider all communication between the elements equal. The programming model employed is that which we term the accelerated single- program, multiple-data (ASPMD) model. Traditionally, the single-program, multiple- data model of computation, means that the same program is run on all nodes in a sys- tem (where the nodes consist of one or more general purpose processors and memory), but on different data sets [47]. During a given cycle, the general purpose processors in the nodes may be at different points in the program. There may be synchronization commands that cause the nodes to block until all processors have reached a certain point in the program. If one node (A) requires data that resides on another node (B), that data must be communicated from node (B) to node (A). In practice, this commu- nication is often achieved through message-passing. The ASPMD model is basically the same, except now each node consists of one GPP element and one RH element. At the node level, the GPP element uses the RH 80 element as an accelerator for intensive parts of the program. Either element can com- municate with other nodes but each node communicates as one single entity. We have chosen this model of computing for a two main reasons. First, the SPMD model is used commonly in many applications, including those in scientic computing, which are of great interest to us. For example, the GAMESS computational chem- istry code uses the SPMD model [79]. In the case of these applications, then, the tasks mapped to each node can be accelerated by the use of the RH element. Indeed, though not the focus of this work, even the massively parallel FPGA arrays can be programmed using this model. The second reason that we have chosen the ASPMD model is that even though our architectural model and the ASPMD model match most closely to cluster architec- tures, programs written with the ASPMD computing model in mind can also run on recongurable computers that have shared-memory architectures. Indeed, this often happens with traditional shared-memory machines. For example, programs written in MPI can execute on SMPs. The messages are passed using shared variables. If there is not a one-to-one correspondence between RH elements and GPP elements, there may be a need for load balancing at the system level. 4.1.3.4 Node-Level Programming Model With the ASPMD model used at the system-level, each node in the recongurable computer is executing the entire program. At the node level, then, it is possible to ex- ploit task-level parallelism. One or more tasks can run on the RH element while one or more other tasks run on the GPP element. We assume, then, a fork-join programming model at the node level. An example is shown in Figure 4.5. Note that the tasks in independent branches of the fork are independent of one another. For example, in Figure 4.5, task 5 does not depend on the output of task 3. If 81 GPP Element: RH Element: task 1 task 2 F task 3 task 4 task 6 task 5 J task 7 Figure 4.5: Example of task-level parallelism within a node (F represents forking and J represents joining) there were a dependency, there would need to be a join such that, for example, task 5 did not begin before task 3 completed. Since the tasks in each branch are independent of one another, there is also no data transfer between tasks in different branches. The tasks need not take the same amount of time to execute. The program will wait until all branches of the fork have nished before proceeding with the next tasks. For better performance, load balancing between the GPP element and the RH element may also be possible, as in the hybrid designs in [129]. 4.2 Performance Modeling The design ow of the hardware/software approach to implementing an application on a recongurable computer is shown in Figure 4.6. The rst step is to prole the application. In many cases, the application to be accelerated on a recongurable com- puter already exists as software. Thus, it can be proled using existing commercial or academic tools such as TAU, VTune, gprof, or myriad others [99, 49, 42]. If the application does not already exist in software form, it can be analyzed using existing GPP modeling techniques. The results of the proling step allow the user to determine the tasks in the application that are bottlenecks and are thus candidates for acceleration with recongurable hardware. 82 Profile Application Identify Bottlenecks Develop Designs Analyze Designs Implement Figure 4.6: The design ow for mapping an application to a recongurable computer It is in the next two stepsdeveloping designs and analyzing those designsthat the application is partitioned into hardware parts and software parts and that the hard- ware architectures for implementing tasks in the application are developed. During these two steps, a fairly accurate, high-level performance model is a necessity. One reason that a performance model is so important is that it may be the case that using a recongurable computer improves the execution time of a particular task or kernel, but the speed-up of the entire application is not signicant enough to warrant the effort in- volved in developing a recongurable computer implementation. Thus, any hardware design must be evaluated in the context of the entire application. Additionally, despite advances in compiler tools and development environments, the nal step in the design owimplementation of an application on a recongurable computeris still more difcult than implementation of an application on a GPP or a cluster of GPPs. This nal step actually consists of several substeps, as illustrated in Figure 4.7. The steps in Figure 4.7 between synthesize and generate bitstream, inclusive, take the description of the recongurable-hardware design to be implemented on the RH element and con- vert that to a bitstream that can be used to congure the FPGA. These steps, especially place and route, take much longer than compiling software; collectively, they can take hours or even days to run. We thus develop a performance model to address these issues. Our performance model is especially focused on the analysis step. The goal is to enable users to de- termine if moving tasks to the recongurable hardware in a recongurable computer 83 Develop/Modify Software Code Develop RH code Debug/ Simulate Compile Translate, Map, Place and Route Synthesize Generate Bitstream Link HW/SW Partitioned Design Executable Figure 4.7: Steps in implementing an application on a recongurable computer will provide a signicant enough speed-up to warrant the cost and effort required in the implementation step. At the system level, the recongurable computer described by our abstract model behaves much like any other cluster or MPP. That is, each node executes a given pro- gram on a set of data and the nodes communicate with one another to exchange data, as necessary. For the most part, then, the system-level performance can be modeled using known techniques for modeling clusters and MPPs. See Section 4.3 for some refer- ences to existing techniques and a work specically about system-level performance modeling of recongurable computers. We thus focus our attention on node-level and element-level modeling. The element- level estimates will be used to make the node-level estimates. 4.2.1 Element-Level 4.2.1.1 GPP Element Modeling of GPP elements is a well-studied problem [8]. If the fork-join model of node-level programming is being used and code for the application exists already, a proling tool can be used to get a good estimate for the time that the software tasks will take to execute. For example, in the next chapter, we describe the use of proling data in porting a molecular dynamics application to the MAPstation. For the portions of the 84 code that remain in software, we use the proling results to estimate the performance of the GPP element. If proling results do not exist, existing modeling techniques can be utilized [19]. Even if code and proling results exist, the GPP element may incur additional costs if it must perform any data remapping in order to use the RH element. That is, the data in the original implementation of the application may be in a format that is well-suited for the GPP element but ill-suited for use by the RH element. In this case, the GPP element may need to reorganize the data such that it is in a suitable format for the RH element before it sends any data to the RH element. The remapping cost can be determined by implementing the remapping functions and proling or by other, higher-level modeling techniques. 4.2.1.2 RH Element To develop the RH element-level component of our performance model, we examine the factors that contribute to the high-performance, or lack thereof, of tasks mapped onto recongurable hardware. Typically, the recongurable hardware in RH elements is one or more FPGAs, so those devices are the focus of our examination. The execution time of a task mapped to an FPGA is, of course, ultimately de- termined by the number of cycles required to nish the computation and the clock frequency at which the implementation executes. The two major factors affecting the number of cycles are pipelining and parallelism. Pipelining and parallelism are espe- cially important in scientic computing applications that use oating-point arithmetic. FPGAs overcome their comparatively low clock frequencies by having the hardware specically tailored to the application at handnamely, by pipelining the execution and by exploiting both ne- and coarse-grained parallelism. 85 As described in Section 3.1, oating-point arithmetic is often a necessity in sci- entic computing applications. Floating-point cores, especially those that conform to IEEE standard 754, are very complex and require a great deal of logic resources to implement. In order to achieve high clock frequencies, oating-point cores for FPGAs must be deeply pipelined. The main difculty with deeply pipelined units for oating-point arithmetic or otherwiseis that their use can lead to data hazards, which may cause the pipeline to stall. Stalling the pipeline is often very detrimental to the performance of a task being executed on an FPGA. Customizing a design for a particular problem also involves exploiting parallelism, when possible. Fine-grained parallelism involves independent operations that are in- volved in producing the same result. This parallelism is similar to instruction-level parallelism in GPPs. For example, in the distance calculations that are part of force cal- culation in molecular dynamics, additions and subtractions on each of the components of atomic position can occur in parallel, but the results of these parallel operations ul- timately come together to provide a single result. Coarse-grained parallelism, on the other hand, involves independent operations that provide independent results. Extend- ing the previous example, if there are two separate distance calculation pipelines, each providing independent results, then coarse-grained parallelism is being exploited. Two factors constrain the amount of parallelism possible in a task being executed on an FPGA: limited logic resources available on the FPGA and limited bandwidth between the FPGA and memory. Logic resources available on the FPGA are still a limiting factor, especially in designs using oating-point arithmetic. Even though modern FPGAs are extremely large compared to those of even just a few years ago, oating-point cores still require large amounts of the available logic resources. This limits the parallelism that can be achieved. If a design has a large number of oating- point cores, the oating-point cores will dominate the design's area requirement. So, 86 a simple approximation to determine how many oating-point cores will t on a chip, and thus how much parallelism is possible, is to add up the areas of the individual oating-point cores that are necessary in the design. Because the tools in the design ow perform optimizations, this approximation will likely provide an upper bound on the amount of area required by the oating-point cores. There are more detailed methods for area approximation as well (see Section 4.3). The other limit on parallelism is bandwidth. If data cannot enter the FPGA (or leave it) fast enough in order for all the parallel computations to take place simultaneously, then the amount of parallelism in the design must be reduced. In streaming designs, it is the intra-node bandwidth that can limit the parallelism. That is, the GPP element must be able to supply the RH element with enough new inputs each cycle to support the desired level of parallelism in the design mapped to the FPGAs in the RH element. In designs that do not involve streaming, it is the bandwidth between the FPGA and its on-board memory that can limit performance. The on-board memory of the RH element in a recongurable computer is typically much smaller than the RAM in a GPP element, but it also has a lower access latency, in terms of cycles. Additionally, this local memory is divided into b equally-sized banks, where the banks can be accessed in parallel. Each access reads or writes one w-bit word. Clearly, then, a design cannot require more than bw bits worth of reading from and writing to on-board memory in a given cycle. This can limit parallelism and/or slow down a pipeline. Another way in which accessing the on-board memory can affect performance is if there are costs associated with accessing it. In some cases, it may be that there is more than a single cycle of latency for a memory access. It is also possible that there are penalties (dead cycles) for switching a memory bank between read mode and write mode. 87 Because of the exibility of FPGAs, it is difcult to develop one all-encompassing model for the hardware. Instead, all of the above factors should be taken into account in order to estimate the number of cycles a design unit will require to produce a result. Then, knowing how many design units can be used in parallel, it is possible to estimate the total number of cycles to produce all the results. Estimating frequency can be more challenging. The SRC 6 and 7 MAPstations re- quire that all recongurable-hardware designs run at 100 and 150 MHz, respectively. While that does not guarantee that a design will meet that frequency, it is the best that can be done for designs running on those machines. For other recongurable comput- ers, there is no required frequency. The designer can use prior experience to make an estimate. For instance, in our prior work, we have seen up to a 28% decrease in achiev- able frequency between the lowest frequency of an individual oating-point core and a design in which oating-point cores use up most of the area in the FPGA [40]. There are other, more accurate methods, that require knowledge of the target FPGA. See Section 4.3 for details. 4.2.2 Node-Level To determine the total node-level execution time of an application, we need to know the execution times of the individual tasks, the time spent in communication, and the time spent reconguring the hardware. The amount of time each task takes to execute is estimated using the models described in the last section. We need, however, to compensate for the fact that some tasks execute in parallel. For each collection of tasks within a fork-join section, the total execution time will be the time taken by the 88 sequence of tasks that takes the longest to execute. That is, for fork-join section f, the time to execute is T f = max (T GPPtasks f ;T RHtasks f ) (4.1) where T GPPtasks f is the time taken by the tasks mapped to the GPP element to execute in fork-join section f and T RHtasks f is the time taken by the tasks mapped to the RH element to execute in fork-join section f. The total time for task execution is then T = X f T f (4.2) Take the tasks in Figure 4.5 as an example. Let x i be the execution time of task i. Clearly, T 1 = x 1 + x 2 and T 3 = x 7 since in the rst and third sections, only one element is used. In the second section, however, both elements are used. So, T 2 = max((x 3 + x 6 ); (x 4 + x 5 )). The total time for task execution is 3 X f=1 T f = T 1 + T 2 + T 3 = (x 1 + x 2 ) + max ((x 3 + x 6 ); (x 4 + x 5 )) + (x 7 ) In order for task i to execute on a given element, the data required by task i must reside in the local memory of that element (we deal with the case of streaming shortly). Let m be the element on which the data resides and let n be the element on which task i will execute. That is, m is either GPP element or RH element. If m is GPP element, n is RH element and if m is RH element, n is GPP element. If, at the start of task i, the data task i needs resides in the local memory of m, it must be transferred to n before task i can begin execution. This will incur a communication 89 cost, C i;m;n , that depends on the amount of data to be transfered, d i;m;n ; the band- width, B m;n , between m and n; and the overhead for the communication, O m;n , which includes latency and start-up costs. That is, C i;m;n = d i;m;n B m;n + O m;n (4.3) The total time spent communicating data between elements is thus C = X i C i;m;n (4.4) where the m and n are set based upon the node to which task i is mapped. An ex- ceptional case is when it is possible to stream data from m to n. In that case, the computation and communication overlap. C i;m;n should be set equal to the startup cost and latency for the streaming. Or, a coarser approximation can be made and C i;m;n can be set to zero. The last parameter needed for the node-level modeling is the amount of time that the recongurable computer must spend reconguring the recongurable hardware. Each time a different task is executed on the recongurable hardware, the recong- urable hardware must be recongured. This takes time. How much time depends upon whether the recongurable hardware is fully or partially recongured. Also note that it may be possible to overlap reconguration of the RH element with computation on the GPP element. We call the total amount of time spent in reconguration that cannot be overlapped with other computation R. The total time for an application to execute on a given node is thus L = T + C + R (4.5) 90 In the next chapter, we use the performance model presented here to analyze de- sign alternatives for implementing MD on a recongurable computer. But rst, we conclude this chapter with some related, complimentary works. 4.3 Complimentary Works In [129], the hybrid design technique is presented. Hybrid designs are task designs that effectively utilize both the FPGAs and the processors in recongurable comput- ers. They address key issues in the development of task designs for recongurable computers, most notably the exploitation of parallelism between the GPPs and FPGAs present in recongurable computers. At design time, load balancing is performed by spreading the oating-point operations over the GPP element and the RH element such that, based on frequency, both elements will take the same amount of time to nish their portion of the task. This is somewhat different than our approach in that tasks can be divided among elements whereas in the model presented here, each task is mapped to only one element on the node. System-level modeling for recongurable computers is described in [103] and [104]. In this model, the recongurable computer consists of a number of distributed com- puting nodes connected by an interconnection network. Some or all of the nodes have recongurable hardware that provides acceleration. Hence, in this model, the recong- urable computer is more heterogeneous than the recongurable computers considered in this chapter. The algorithms are modeled as synchronous iterative algorithms. The model considers background loading on the GPP elements in the recongurable com- puters. This work then applies the performance model to a scheduling problem. Unlike our model, it assumes that there are no conditional statements in the hardware. It also provides no element-level modeling techniques. Thus, [103] and [104] and the work 91 presented in this chapter are complimentary. [103] and [104] could be used to ll in the system-level modeling that we have not addressed here. GPP- and system-level modeling for traditional computing systems is discussed in [8]. The performance modeling methodology is based on application signatures, machine proles, and convolutions. Application signatures are compact, machine- independent representations of the fundamental operations in an application. They are developed from running application code on some set of existing machines and merging the results into a machine-independent form. For example, one important part of the application signature is the memory access pattern. Machine proles are generated from the results of low-level benchmarks. Most often, these benchmarks are focused on memory and communication performance. Convolution, as well as other techniques, are then used to combine application signatures with machine proles in order to model the performance of the applications on the target machines. [60] and [28] describe estimating area and/or frequency of FPGA designs. In [60], a data ow graph (DFG) is the input. The area of each operation in the data ow graph is estimated using parameterized models of common blocks, such as adders. This estimation technique is meant to be used by high-level language compilers. Sim- ilarly, [28] begins with a DFG to estimate area and also frequency of applications mapped to FPGAs; the target applications are in the elds of telecommunications, multimedia, and cryptography. Algorithms are characterized by operations and their average word length and the degree of parallelism in the DFG. A mapping model of the target architecture is used to determine the area required by the operations in the DFG as well as the clock frequency at which the design will run. This mapping model requires knowledge of how the operations in the DFG are mapped to hardware. These techniques could be used to make the frequency and area estimates necessary in the element-level modeling phase. 92 Chapter 5 Molecular Dynamics on Recongurable Computers In this chapter, we investigate the use of recongurable computers in order to improve the performance of a particular scientic computing application: molecular dynamics (MD) simulation. The recongurable computer used for this exploration is the SRC 6 MAPstation, as described in Section 4.1. We begin with an in-depth introduction to molecular dynamics. We then present three case studies. In each, we use the perfor- mance model described in the last chapter to aid in our evaluations of potential designs. In the rst case study, we accelerate a shifted-force MD simulation program executing on a single node. In the second, we accelerate a PME MD simulation program exe- cuting on a single node. Finally, in the third case study, we examine the performance and scalability of this approach by accelerating a shifted-force simulation program executing on multiple nodes. 5.1 Introduction to Molecular Dynamics MD is a widely used technique to simulate the trajectory of an atomic system over time. For this introduction, we mainly reference [1]. A system of atoms is placed 93 inside a simulation box. In many simulations, the system is considered to be peri- odic: it is assumed that the central simulation box is replicated in each direction out to innity. Given the initial conditions of the system at time t = 0, the motion of the atoms in the system until time t = t f is simulated by advancing the simulation by small discrete time steps t. The amount of real time being simulated is usually on the order of nanoseconds or microseconds and the time step is usually on the order of femtoseconds. The new positions of the atoms are calculated by integrating the classical equations of motion. One popular integrator is the velocity Verlet algorithm, shown below, where ~ r i (t), ~ v i (t), and ~ a i (t) are the position, velocity, and acceleration of atom i at time t, respectively. 1. calculate ~ v i t + t 2 based on ~ v i (t) and acceleration ~ a i (t); 2. calculate ~ r i (t + t) based on ~ v i t + t 2 ; 3. calculate ~ a i (t + t) based on ~ r i (t + t) and ~ r j (t + t) for all atoms j6= i; 4. calculate ~ v i (t + t) based on ~ a i (t + t). It is well known that nding the acceleration on each atom (step 3) is the most time-consuming step of the simulation. Finding the acceleration requires nding the forces acting upon each atom. 5.1.1 Force Calculation The forces calculated in MD simulations can be broken into two types: bonded and nonbonded. Bonded forces only act between atoms that are in bond groups. This is thus an O(n) calculation, where n is the number of atoms in the simulation, and not usually a bottleneck. 94 Nonbonded forces, on the other hand, are pairwise forces that can act between any atoms in the system that are not in the same bond group. Nonbonded force calcula- tion can further be divided into the calculation of short-range and long-range forces. Short range forces drop off quickly as the distance between atoms rises. For these forces, a cutoff distance is employed: if the distance between two atoms is greater than some distance r c (often about 10 A in practice), the atoms are assumed not to interact. Finding pairs of atoms that interact is naively an O(n 2 ) operation. Two techniques to reduce the number of pairs that must be evaluated are the linked cell list and the Verlet neighbor list. In the linked cell list approach, the simulation box is broken into a number of cells, where the length, width, and height of each cell is close to but greater than the cutoff distance. The atoms in each cell, then, only interact with atoms in the same cell or the 26 neighboring cells. This drastically reduces the number of pairs of atoms that must be evaluated. The purpose of the neighbor list is to avoid searching for interacting pairs of atoms at every time step. It is based on the principle that if atoms i and j are close to one another at time step t, they should still be close to one another at time step t+1. So, the neighbor list is built at time step t and updated at time step t+x, where x is commonly between 10 and 20. The neighbor list is really a combination of lists, where each atom has a list of atoms with which it might interact. Throughout this chapter, when we refer to the neighbor list, we mean the all-encompassing combination of lists. When we refer to atom i's neighbor list, we are referring to only the list of atom i's neighbors. The list can be constructed using either the O(n 2 ) search for pairs or the more efcient linked cell list approach. The force due to the Lennard-Jones approximation for the van der Waals poten- tial (hereafter referred to as the Lennard-Jones force) is a common short range force. 95 Equations 5.1 and 5.2 show the equations for the Lennard-Jones potential and force, respectively, where ~ r ij is the distance vector between atoms i and j, r ij is the distance between atoms i and j, and A, B, C, and D are constants. Constants C and D are necessary to compensate for the use of the cutoff distance. This approach is called the shifted-force technique. Note that whenever we refer to force calculation, in the context of the simulation, the potential energy is also calculated at that step. U LJ ij = A ij r 12 ij B ij r 6 ij + C ij D ij r ij (5.1) ~ f LJ ij = 6 r 2 ij 2A ij r 12 ij B ij r 6 ij + D ij r ij ~ r ij (5.2) Long range forces present more of a problem. They do not drop off quickly at long distances. Not only can atoms interact with any other atoms in the system, they can also interact with their images in other boxes. One solution is to use a cutoff scheme, just like for the short range forces. The long-range force used most often in MD simulations is the electrostatic or Coulomb force. A shifted-force approach can be used for the Coulomb force and potential. Equations 5.3 and 5.4 show the shifted-force Coulomb potential and force, respectively, where q i is the charge of atom i. U E ij = q i q j r ij r C (r C r ij ) 2 (5.3) ~ f E ij = q i q j r ij 1 r 2 ij 1 r 2 c ~ r ij (5.4) While simple, this technique unfortunately can lead to artifacts in the results of the simulation. For instance, recent simulations that compare the structure of biologically inspired model membrane structures have demonstrated that the bilayer structures of 96 cells are different at approximately the 5% level compared to when more sophisticated approaches are used [83]. There are also indications that the cutoff Coulomb approach creates incorrect structures for simulations of solvated proteins and DNA. However, this effect is still only at the 5% level or less, and for some simulations the computa- tional efciency inherent in this approach makes it the only realistic choice. A more accurate technique, though, is the Ewald summation, which breaks the potential energy calculation into a real-space (or direct-space) part, a reciprocal-space part, and a correction term. The real-space and reciprocal-space parts are given in Equations 5.5 and 5.6, where is a convergence parameter that determines how much work is done in the real- and reciprocal-space parts, the ~ n are vectors representing images of the central simulation box, and the ~ m are reciprocal lattice vectors (see [29]). The asterisk in the summation means to omit the cases where both~ n = ~ 0 and i = j. For the correction term, which is not computationally intensive, see [29]. The expression for S is given in Equation 5.7. Forces are found by taking the negative gradient of these terms. In practice, is often set so that the real-space part of the calculation can be done out to the same cutoff distance as the short range forces and, thus, the only value of ~ n that matters is ~ n = ~ 0. U E real = 1 2 X ~ n ~ n X i;j=1 q i q j erfc(j~ r i ~ r j + ~ nj) j~ r i ~ r j + ~ nj (5.5) U E recip = 1 2V X ~ m6=0 exp( 2 ~ m 2 = 2 ) ~ m 2 S(~ m)S(~ m) (5.6) S(~ m) = n X i=1 q i exp(2~ m~ r i p 1) (5.7) 97 At best, the Ewald summation is an O(n 3=2 ) calculation. The particle mesh Ewald (PME) technique and its smooth variant reduce the complexity to O(n log n) by facil- itating the use of the 3D FFT in the Ewald summation. For an in-depth description, see [29] or [62]. The basic idea is to approximate the S term in Equation 5.6. Instead of considering point charges in space, each charge contributes a certain amount of charge to regularly spaced grid points. The amount of charge contributed is computed through interpolation. Smooth PME (from here on referred to as just PME) uses cardinal B- splines for interpolation. The real-space part and the correction are calculated without any changes. The main steps for calculating the reciprocal-space part are 1. Find B-spline coefcients and their derivatives for each atom 2. Assign charges to grid points 3. 3D FFT the grid 4. Calculate the energy using the transformed grid and other arrays 5. Inverse FFT the product of the grid with other arrays 6. Use the result of step 5 and the derivatives of the B-spline coefcients to nd the forces 5.1.2 Benchmark Simulations We use two real-word simulations to benchmark the performance of our implementa- tions. One simulation is of palmitic acid in water. The simulation consists of 52558 atoms of 8 different types. The cutoff length is 10 A, the time step is 2 fs, and the neighbor list is rebuilt once every 10 steps. The simulation box is 104 84:8 256 A 3 98 and the grid used in the PME method is 128 96 256. Fourth order interpolation is performed to assign charges to grid points. The other simulation is of the CheY protein in water. This simulation consists of 32932 atoms of 17 different types. The cutoff length is 10 A, the time step is 2 fs, and the neighbor list is rebuilt once every 10 steps. The simulation box is 73:8 71:8 76:8 A 3 and the grid used in the PME method is 64 64 64. Fourth order interpolation is performed to assign charges to grid points. 5.1.3 Software Implementation The rst step in implementing the MD simulation application on a recongurable com- puter is to develop a software implementation that can be proled and eventually mod- ied to make use of the recongurable hardware. For our work, we have written a software implementation of a molecular dynam- ics simulation. This software implements the Velocity Verlet algorithm in single- or double-precision arithmetic. For nonbonded force calculation, we calculate the shifted Lennard-Jones force and use either the shifted-force technique or the PME technique for the Coulomb force. To identify the interacting pairs of atoms, we use the Verlet neighbor list technique. Every x time steps, where x is dened by the user, a linked cell list is created. This linked cell list is then traversed to make the neighbor list. We have also implemented the bonded force calculation. To allow for larger time steps, we employ the RATTLE iterative constraints technique to constrain bond lengths [1]. The inputs to the simulation are provided at run-time. They include the number of steps in the simulation, the length of the time step, the dimensions of the simulation box, the cutoff and neighbor list cutoff distances, the frequency at which the neighbor list should be updated, and the size of the interpolation grid to be used by PME. 99 The structure of the systematomic types, charges, and masses; groups of atoms that participate in bonded interactions; constants used in force calculation; and so forthcan be specied in an AMBER-format topology le, which is preprocessed into les in the program's internal format [3]. Initial positions and velocities can be specied in a coordinate le in PDB format [85], which also gets preprocessed. The software is written in C. It has been compiled using Intel C compiler 8.1, with -O3 and -Ob2 optimizations turned on. We proled the single-precision version of the software running two benchmark simulations for 100 or more steps each. We ran the simulations on one of the 2.8 GHz Xeon processors in the MAPstation. The operating system is Linux. To prole, we used Oprole versions 0.8.1 and 0.9.1 [80]. To perform the FFTs needed by PME simulations, we employ Intel MKL version 8.1.14 5.1.3.1 A Note on Precision While the purely software version of our code can use single- or double-precision, the task running in recongurable hardware is limited to single-precision, as will be- come evident in the following sections. The use of double-precision arithmetic is the conventional approach in MD simulations, although the widely used GROMACS soft- ware is often used with mostly single-precision arithmetic [118]. The original driver for using high-precision arithmetic was the desire to avoid cumulative rounding errors implicit in very long simulations, but this has been of less concern in recent work. Unless divergence from the exact atomic trajectories is important, a system using single-precision arithmetic will still explore realistic congurations during the simula- tion [87]. In the remainder of this chapter, we examine three versions of the software im- plementation and how hardware acceleration improves the performance. The rst is 100 Table 5.1: Shifted-force software prole for two benchmark simulations % of Computation Time Task Palmitic Acid CheY Protein Nonbonded Forces 75.15 74.09 Building Neighbor List 20.75 22.76 Dihedral Torsion Forces 2.14 1.51 Other 1.96 1.64 a single-node, shifted-force simulation [94]. The next is a single-node, PME simula- tion [96]. The last is a multi-node, shifted-force simulation [93]. The results presented below were obtained with the version of the program available at the time of their orig- inal publication. As we have developed features of the MD program, we have found and corrected bugs. However, none of these corrections have signicantly impacted the performance results. 5.2 Single-Node, Shifted-Force Simulations The prole for the software-only version of the single-node, shifted-force simulation is given in Table 5.1. From this prole, it is clear that accelerating the computation of nonbonded forces would benet the overall application most. But, before implement- ing the nonbonded force calculation in recongurable hardware, we need to determine if doing so will provide a signicant speed-up for the overall application. To make this determination, we use the performance model of the last chapter to model the per- formance of possible recongurable computer implementations before implementing them. 101 positions neighbor list nonbonded forces General Purpose Processor •initialize •update velocity •update position •build neighbor list •compute bonded forces •update acceleration Reconfigurable Hardware •compute nonbonded forces Reconfigurable Hardware •compute nonbonded forces Figure 5.1: Hardware/software partitioning of the MD simulation application 5.2.1 Node-Level Design At the node level, the partitioning of the MD simulation algorithm between hardware and software is as shown in Figure 5.1. At each time step, the current atomic positions need to be transferred to the RH element's on-board memory. Each position vector has three components. So, if n is the number of atoms and b is the number of bytes per oating-point word, 3nb bytes must be transferred to the RH element's on-board memory. Further, the neighbor list must be transferred to the RH element, although it should be possible to stream it to the RH element, thus overlapping the communi- cation with computation. The recongurable hardware must also have access to the types and charges of each atom, which are constant throughout the simulation but vary from simulation to simulation. Several other values that are constant throughout the simulation must be available as well, such as the size of the simulation box and the constants in the Lennard-Jones force equations. As the nonbonded forces are calculated, they will be stored in the RH element's on- board memory. The force acting on each atom has three components, so the amount of data that needs to be stored is 3nb. At the end of the nonbonded force calculation, this data will be transferred back to the GPP element. Therefore, ignoring the time taken 102 to transfer data that remains constant since it is only transferred once and ignoring the time to transfer the neighbor list since we assume that it will be streamed, the total communication time C, is 6nb=B GPP, RH s/step, where B GPP, RH is the bandwidth between the GPP element and the RH element. As an approximation, we have set the communication overheads, O GPPelement,RHelement and O RHelement,GPPelement , to zero because there are only two data transfers and they transfer large amounts of data. If there were many data transfers, this approximation might hurt the accuracy of the predicted performance [56]. Since only one task will be executed on the RH element, the RH element will only need to be congured once. Thus, we set R, the reconguration cost, to 0 in our estimates. The time spent in conguring the RH element will be overlapped with computation on the GPP element. 5.2.2 RH Element-Level Design Alternatives The basic design for performing the nonbonded force calculation task on the RH ele- ment is a direct translation of software into hardware. The algorithm and architecture for such a design are shown in Figures 5.2 and 5.3, respectively. Though omitted in the gures, the type and charge of each atom are also used in the calculation. In the gures, positionOBM and forceOBM represent on-board memories. In Figure 5.2, CALC NBF represents applying Equations 5.1 to 5.4, as well as other necessary tech- niques, such as the minimum image convention [1]. Note in Figure 5.2, that the algo- rithm makes use of the fact that ~ f ij = ~ f ji in order to reduce the amount of computa- tion. In this design, the two loops in the algorithm are pipelined such that in every cycle, either a new i atom is entering the pipeline or a new j atom is entering the pipeline. 103 foreach atom i do 1 ~ r i positionOBM[i] 2 ~ f i forceOBM[i] 3 foreach neighbor j of i do 4 ~ r j positionOBM[j] 5 ifj~ r i ~ r j j < r c then 6 ~ f ij CALC NBF(~ r i , ~ r j ) 7 ~ f i ~ f i + ~ f ij 8 ~ f j forceOBM[j] 9 forceOBM[j] ~ f j ~ f ij 10 end 11 end 12 forceOBM[i] ~ f i 13 end 14 Figure 5.2: Algorithm for basic design The pipeline makes use of ne-grained parallelism: any calculations that can happen in parallel, such as nding the differences between each component of the position vector, are done in parallel. If r ij r c , the results of the calculation are discarded at the end of the pipeline. In Figure 5.3, the force calculation box represents this pipeline. One problem with this design is that in order to subtract a newly calculated ~ f ij from ~ f j , ~ f j needs to be read from the force memory. If, as is common in presently Force Calculation positionOBM Neighbor List Read Controller forceOBM Write Controller Figure 5.3: Architecture for basic design 104 available recongurable computers, the on-board memory on the RH element is single ported, the pipelined architecture will have to stall for one cycle. There may also be penalties associated with switching the memory between read mode and write mode. Because of the lengthy pipelines in oating-point cores, this design is also suscepti- ble to a read-after-write hazard. Specically, the subtraction in line 10 of the algorithm may still be taking place for one i atom when another i atom causes the read in line 9 to execute. Since it is not possible to forward values within the oating-point cores, stalling will be necessary. The time required to do the calculations with this design, in cycles, is given in Equation 5.8. l is the length of the neighbor list and s is average the number cycles of stalling due to conicting memory accesses and hazards. Put another way, on average, the pipeline produces one new result every s cycles. T RH = sl (5.8) A second alternative is to use a write-back design [105] 1 . The algorithm for this design is given in Figure 5.4. This design makes use of the fact that an atom j occurs in atom i's neighbor list at most once. The loop beginning on line 5 is pipelined, as is the loop in beginning on line 16. The outer loop is not pipelined. The key is that the new ~ f j values are not stored directly into on-board memory but instead are stored temporarily in on-chip memory (forceRAM). They are not written back to on- board memory until all of the neighbors of the current i atom have been processed. Because force calculation for a new i atom cannot begin until after all of the j atoms 1 While [105] mentions the concept of a write-back design to reduce time spent going to memory, no comparison of the write-back design's performance against those alternate designs is given. 105 foreach atom i do 1 ~ r i positionOBM[i] 2 ~ f i forceOBM[i] 3 n 0 4 foreach neighbor j of i do 5 ~ r j positionOBM[j] 6 ifj~ r i ~ r j j < r c then 7 ~ f ij CALC NBF(~ r i , ~ r j ) 8 ~ f i ~ f i + ~ f ij 9 ~ f j forceOBM[j] 10 forceRAM[n] ~ f j ~ f ij 11 n n + 1 12 end 13 end 14 forceOBM[i] ~ f i 15 foreach ~ f j inforceRAM do 16 forceOBM[j] ~ f j 17 end 18 end 19 Figure 5.4: Algorithm for write-back design from the previous i atom's neighbor list have been written back to memory, the read- after-write hazard is avoided. The conicting memory accesses of the basic design are also avoided. The number of times that the on-board memory switches modes is also minimized. The drawback to this design is that it requires that the pipeline be drained after the processing of each i atom. Thus, the pipeline ll-up latency is incurred for each i atom. To lessen the number of pipeline stages that must be drained after each i atom, the pipelined inner loop can be broken into two separate pipelines joined by a FIFO. This FIFO must be large enough to hold all of the neighbors of two (or more) consecutive i atoms. The rst pipeline reads the position, charge, and type data for each neighbor atom j and calculates ~ r ij . If r 2 ij < r 2 c , ~ r ij , the charge of j, and the type of j are written 106 to the FIFO. Note that there are no dependencies at this stage, so this pipeline can write new values to the FIFO as long as there is space. When the FIFO is full, it waits until the second pipeline has processed data before it starts writing new values. The second pipeline reads data from the FIFO and does the remaining steps in the force calculation. This pipeline must be drained after each new i atom that it reads from the FIFO. Since the rst pipeline should produce data faster than the second pipeline can process it, there should be data ready in the FIFO whenever the second pipeline is ready to begin processing it. Effectively, then, it is only the second pipeline that must be drained and whose ll-up latency is incurred for each atom. The time required to do the calculations with this design, in cycles, is given in Equation 5.9, where n is the number of atoms, N is the average number of neighbors for each atom, p is the number of pipeline stages (if two pipelines are used, p can be considered the number of stages in the second pipeline), and s is the number of cycles required to switch an on-board memory from read mode to write mode and vice versa. The rst term in Equation 5.9 gives the number of cycles spent in the (second) pipeline. The second term gives the number of cycles spent writing data back to memory, assuming these writes can be pipelined. The third term gives the number of cycles spent reading i atoms' data from memory. The last term gives the number of cycles spent switching the on-board memory between read and write modes. T RH = n(N + p) + n(N + 1) + n + 2sn (5.9) The last design alternative does not use that ~ f ij = ~ f ji . The algorithm is the same as that in Figure 5.2, except without lines 9 and 10. This design requires twice as many force calculations but removes the hazards and the need to read from the force memory. A caveat is that using this method will require use of a second neighbor list: 107 if j is a neighbor of i in the original list, then i must be a neighbor of j in the second neighbor list. The time required to do the calculations with this design, in cycles, is given in Equation 5.10, where T BNL,GPP is the amount of time it takes to build the extra neighbor list on the general purpose processor. While this latter time is incurred by a task taking place on the general purpose processor, we include it here to emphasize that this method requires building the second neighbor list. T RH = 2l + T BNL,GPP (5.10) Note that Equations 5.8, 5.9, and 5.10, assume that there is enough bandwidth between the recongurable hardware and its on-board memory to read all three com- ponents of the positions, as well as the types and charges in parallel, while writing all three components of the forces in parallel. 5.2.3 Choosing a Design for the Implementation We now use Equations 5.8, 5.9, and 5.10 to determine which design we should use in our recongurable computer implementation. We are only able to do the non- bonded force computation in single-precision arithmetic on the recongurable hard- ware. Using double-precision arithmetic would require more logic than is available in the FPGAs in the MAPstation. Even if the amount of logic were sufcient, there would not be enough on-board memory to store all of the necessary data. The MAP- station has eight banks, each of which is eight bytes wide, which is not enough to hold position, force, charge, and type data and still have a bank available for streaming the neighbor list. Thus, we focus on single-precision arithmetic and b, the number of bytes in a oating-point word, is four. 108 An important property of the MAPstation is its requirement that all hardware de- signs run at 100 MHz. While this does not guarantee that any of the above designs will run at 100 MHz, 100 MHz is the best possible frequency, so we use it in estimating times. Two other important properties are that the on-board memory banks are single ported and that switching a memory from write mode to read mode and vice versa requires four dead cycles [108]. We use the timing information from our palmitic acid simulation to make our de- cisions. From proling, we have found that on average, a single step of the velocity Verlet algorithm takes 1.27 s to execute in software. T GPPtasks is 0.32 s/step and will be the same for each of the architectures. C will also be the same no matter which architecture is chosen. We want to stripe the 3 4-byte position components over 2 memory banks (each 8 bytes wide) so that all 3 can be accessed in parallel. In order to achieve this, we must transfer 4 words for each set of components: the 3 position components and 1 extra word. A similar situation arises in writing the forces back to the GPP element. So, rather than transferring 6n words, we actually transfer 8n words per step. With the 52558 atoms in the palmitic acid simulation, this translates to transferring 1.7 MB per step at the MAPstation's transfer rate of 1400 MB/s, requiring only about 1.2 ms per step [108]. The basic design produces poor results. Because of the memory dependencies, the pipeline will only produce a result once every 10 cycles. The neighbor list in the palmitic acid simulation has about 17 million entries, so the number of cycles required per step is 170 million. This leads to an overall time of about 2 s/step, which is a slowdown over the software only version. The write-back design fares better. From our knowledge of the oating-point cores and the structure of the force calculation pipeline, we know that the pipeline will be 109 very long, about 200 stages. Through proling the simulation, we know that the aver- age number of neighbors close enough to interact is 192. With this and other informa- tion about the node, we nd that even with the long pipeline, the anticipated total time is 0.63 s/step, a 2 speedup over the software implementation. The third design is hampered by the need to build a second neighbor list and gives a speedup that is lower than that of the write-back approach. The time spent in force computation is only 0.34 s/step, but every 10 steps, the neighbor list is rebuilt. Building the second neighbor list contributes an average of 0.27 s/step extra. The total is then 0.93 s/step, only a 1:4 speed-up. If data can be reorganized in less time than would be required to build a second neighbor list, this method will still have the practical problem that it will require oating-point accumulation of multiple sets of varying sizes in the hardware. Performing such accumulations is still an active area of research as they are difcult to do, especially from a high-level languages like C for the Carte compiler, without stalls to the oating-point pipeline [72]. Clearly, then, the alternative to use is the write-back approach. 5.2.4 Implementation and Experimental Results For implementation on the MAPstation, we made a few modications to the traditional techniques. The neighbor list is usually implemented by a long list of atoms with a separate array that points to the start of a particular atom's set of neighbors. To reduce the number of separate arrays, we instead insert the atoms into the neighbor list right before their respective sets of neighbors. We use the most signicant bit of the word as a ag to denote which atoms start new sets. To save memory banks in the RH element, we do not store the atom type array in its own on-board memory bank. Instead, we pack the type and the atom into the neighbor list in one 64-bit word. This does not 110 r x r y r z q f x f y f z FPGA Neighbor List+ Types Figure 5.5: Data layout in on-board memory adversely affect the communication time because we are streaming in the neighbor list in 64-bit words anyway. We pack the charge in the same array as the positions because, as mentioned above, we must transfer 4 4-byte words, instead of 3, every step anyway. The data layout in on-board memory is shown in Figure 5.5, where r a and f a are the a components of the position and force, respectively, solid lines denote borders of memory banks, and dashed lines show data packed into one word. The rst time the nonbonded force calculation is called, the constants for the sim- ulation are transferred to the FPGA. The dimensions of the simulation box are stored in registers. The constants for Lennard-Jones force calculation are stored in four block RAMs, one each for the A, the B, and the two force/potential shift constants. Each time the nonbonded force calculation task is executed, the position and charge data is transferred before calculation starts. Due to limitations with the MAPstation's streaming DMA, we are unable to stream the neighbor list. Instead, we use regular DMAs but overlap them with computation. We rst DMA a section of the neighbor list into two on-board memory banks. While the force calculation is processing that section of the neighbor list, we DMA the next section of the neighbor list into another 111 two on-board memory banks. We switch back and forth between reading and writing memory banks so that the computation is overlapped with communication. For intermediate storage, we use block RAMs. From proling our simulations, we found that no atom had more than 750 neighbors. The block RAMs in the target Xilinx FPGA hold 512 32-bit words each. Thus, for the FIFO between the distance- and force calculation pipelines, we use four block RAMs each for the three components of the distance vector, the squared magnitude of the distance vector, the type of the atom, and the charge of the atom. Thus, we use 24 block RAMs to implement the FIFO. In the force calculation pipeline, we use two block RAMs per force component as intermediate storage. We use another two block RAMs to store the index values of the neighbors. When the forces are written back to on-board memory, they are striped across two on-board memories. Upon completion of the nonbonded force calculation, the forces are DMAed back to the GPP. The code is written in C for the SRC Carte compiler. The oating-point cores provided by SRC are used to carry out the oating-point arithmetic. To accumulate the force acting on an i atom, we use three of the accumulator oating-point cores provided by SRC, one for each component of the force. We use another accumulator core to accumulate the potential energy. This oating-point core allows one new input to be presented to the accumulator per clock cycle. When a new accumulation is to start (in our case, when force calculation for a new i atom begins), the accumulator pipeline must be ushed. This ushing of the accumulator pipeline does not lead to any extra delays in our write-back design because, due to the dependencies described above, the pipeline must be ushed when force calculation on a new i atom is to begin anyway. Note that calculating the forces on the j atoms does not require an accumulator, only an adder, because the force on a j atom is updated at most once for any given i atom. 112 We use Carte version 2.1 together with the Xilinx ISE tools version 7.1.04 and the Intel C compiler version 8.1 to generate the executable. We use only one of the two Xilinx Virtex-II XC2V6000-4 FPGAs on the MAP. The FPGA area of the nonbonded force calculation is 33,634 slices, which is 99% of the available slices. 28% of the available block RAMs and 77% of the available multipliers are used. The required 100 MHz frequency is met. 5.2.4.1 Performance We ran each of the simulations on the MAPstation for 1000 steps and compared that to the same simulations running in software only on the MAPstation. Table 5.2 shows the results. For the palmitic acid simulation, we achieve a 2 speed-up while for the CheY protein simulation, we achieve a 1:9 speed-up. This is noteworthy because we achieve these speed-ups despite leaving parts of the simulation in software and not us- ing the GPP element and RH element in parallel. The difference in speed-up between the two simulations is due to the tasks left in software accounting for a greater per- centage of the simulation time than the task mapped to hardware in the CheY protein simulation than in the palmitic acid simulation (see Table 5.1). The 2 speed-up for palmitic acid is the same as was predicted in Section 5.2.3. The time for the simulation was slightly longer than predicted, likely due to overheads unaccounted for in the performance modeling, such as DMA start-up costs and the cost of transferring the rst sections of the neighbor list. These speed-ups are inuenced by the parameters of the simulation. For example, were we to build the neighbor list less frequently than once every 10 steps, the speed- ups achieved would be greater because building the neighbor list would account for less of the overall simulation time. The neighbor list cutoff distance can also affect the application's prole, as can other physical properties, such as density. In a more dense 113 Table 5.2: MAPstation performance results for single-node, shifted force simulations Latency (s/step) Simulation SW only SW + HW Speed-up Palmitic Acid 1.28 0.66 2.0 CheY Protein 0.74 0.39 1.9 system, the atoms would have more neighbors, which would increase the utilization of the pipeline before it is drained, increasing performance. A more dense system would also require that more block RAMs be devoted to the storage of intermediate results. 5.2.4.2 Scalability In the purely software domain, the scalability of MD simulations is judged by the speed-ups obtained by spreading the simulation over multiple nodes, where each node executes all tasks in the velocity Verlet algorithm. This type of scaling is certainly pos- sible with recongurable computers. Indeed, we investigate it in Section 5.4. Here, we deviate from our standard ASPMD model and examine a different kind of scalability: having one GPP element but multiple RH elements. As in the current implementation, all tasks in the simulation except for nonbonded force calculation are executed on the GPP element. The constants are communicated to all of the RH elements during the rst step of the simulation. At each step of the simulation, the position and charge data are transferred to all of the RH elements. If there are R RH elements, then each one is given approximately 1=R of the neighbor list to process. Upon completion of calculating the nonbonded forces for its portion of the neighbor list, each RH element then transmits the forces back to the GPP element. For each atom i, the GPP element calculates ~ f i = P R r=1 ~ f ir , where ~ f ir is the force on atom i calculated by RH element r. 114 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 1 2 3 4 Number of RH Elements Time/Step (s) RH Time Comm. Time SW Time Figure 5.6: Performance estimates for using 0 to 4 RH elements Figure 5.6 shows the estimated time per step as the number of RH elements in- creases from 0 to 4. Equation 5.11 was used to produce the estimates. Note that the communication time is so small that it is barely visible. We ignore the cost of the nal summation mentioned above and of deciding how to split the neighbor list among the RH elements, as the time for these computations will be dwarfed by the other software tasks. Clearly, once more than 2 RH elements are employed, the tasks in software become dominant. T = T SWtasks + RT comm + T RH R : (5.11) 5.3 Single-Node, Particle-Mesh-Ewald Simulations Using PME does not change the basic structure of the simulation. The velocity Verlet algorithm, the bonded force calculation, and the Lennard-Jones portion of the non- bonded force calculation remain the same. The major changes are in the electrostatic nonbonded force calculation, as described in Section 5.1.1. Comparing the shifted 115 force approximation to the PME technique, we see that several changes will be re- quired to the program. The real-space part of the potential is similar to the standard Coulomb potential (without the changes for force shifting), except a term with an erfc(x) operation is introduced (see Equation 5.5). The correction term is not of ma- jor concern as it includes only O(n) calculations. The reciprocal-space calculation, however, is wholly new and requires 3D FFTs, interpolations, and the multiplication of 3D arrays. The introduction of these kernels has the potential to signicantly af- fect the application prole, so we begin by examining the prole of a software-only implementation of MD simulation with PME. Following [29] and referencing the relevant functions in NAMD and GROMACS, we developed an implementation of the PME technique that was integrated into the simulation program described in the last section. The computation of erfc(x) and its derivative are implemented with table lookup and interpolation. 5.3.1 Partitioning The application prole for the two benchmark simulations is shown in Table 5.3. Clearly, the real-space part of the force calculation, which includes the shifted-force Lennard-Jones force calculation, is taking the most time. In the palmitic acid simu- lation, the next-most time is taken by the reciprocal-space part of the PME calcula- tion, which includes the interpolation and both FFTs. Building the neighbor list takes roughly half as much time, and the rest of the tasks even less. So, based on the palmitic acid simulation, it seems that we should focus our efforts on the real- and reciprocal- space parts of force calculation to obtain the highest speed-up. 116 Table 5.3: PME software prole for two benchmark simulations Palmitic Acid CheY Protein Task Time (s/step) % Computation Time Time (s/step) % Computation Time Real-space 1.56 70.25 0.93 74.42 Reciprocal-space (total) 0.34 15.40 0.10 7.80 FFTs 0.16 7.03 0.01 0.93 Assign charges 0.08 3.76 0.04 3.51 Find forces 0.06 2.61 0.03 2.09 Find B-spline coeffs. 0.02 1.07 0.01 1.18 Other 0.02 0.92 < 0.01 0.10 Building Neighbor List 0.19 8.55 0.13 10.80 Other 0.13 5.79 0.09 6.96 117 In the CheY protein simulation, the construction of the neighbor list takes the second-most time in the simulation, while the reciprocal-space part of the force calcu- lation takes the third-most time. Note that in the CheY protein simulation, the FFTs take hardly any time. This occurs because the 64 64 64 grid used by the CheY protein simulation causes very few cache misses in the GPP's cache, leading to very high FFT performance. In general this will not be the case and the FFTs and, conse- quently, the reciprocal-space part of force calculation, will take a higher percentage of the time. Thus, we focus our acceleration study on the real- and reciprocal-space parts of the force calculation and leave the remaining tasks, including construction of the neighbor list, in software. As mentioned above, the real-space part of the calculation is very similar to the cal- culation in the shifted-force approach, with the introduction of an erfc(x) computation and an e x 2 computation as part of the derivative of erfc(x). These two terms present a challenge for hardware implementation because there are no existing oating-point cores to calculate them. However, they can be calculated through table lookup and interpolation and we describe such an implementation in Section 5.3.2.1. Thus, with- out too much change, we can accelerate the real-space part in a similar fashion to the way the shifted force approach is accelerated in the last section. We will obtain sim- ilar performance as well, which will be a signicant speed-up over the software-only implementation of the real-space part. For example, the real-space part in the palmitic acid simulation should take only about 0.34 s/step, a 4.6 speed-up over the software- only version [94]. Similarly, the real-space part of the CheY protein simulation should take only about 0.19 s/step, a 4.8 speed-up over the software-only version. Clearly, this task should be executed in hardware. For the reciprocal-space part of force calculation, one option is to not accelerate it and leave it in software with the rest of the simulation. Because they are not dependent 118 GPP Element: RH Element: VU PU F Recip Real BF VU J Corr BNL Figure 5.7: Assignment of MD application tasks to RH element and GPP element upon one another, we can exploit node-level parallelism and have the reciprocal space part execute on the GPP element and the real-space part execute on the RH element. Figure 5.7 shows this partition. In the gure, VU and PU represent the velocity and position update tasks, respectively; Recip and Real represent the reciprocal-space part and the real-space part of the force calculation, respectively; BF and Corr represent the bonded force calculation and force correction, respectively; and BNL represents build- ing the neighbor list, with the dashed-line box indicating that this step is only executed on select iterations of the algorithm. The length of time for the two parallel tasks to complete is equal to the maximum of the time the reciprocal-space part takes to exe- cute in software and the real-space part takes to execute in hardware. For the palmitic acid simulation, we estimate that the two parts of the simulation will take about the same time. For the CheY protein simulation, the real-space part of the simulation will still be longer than the reciprocal-space part, but not as much longer. The other option is to accelerate the reciprocal-space part in hardware as well. The reciprocal-space part of the calculation consists of several subtasks: nding interpo- lation coefcients, assigning charges to the grid, forward and backward FFTs, and so forth. None of these kernels individually is responsible for much of the simulation's overall computation time. To effectively accelerate the entire application, then, the whole reciprocal-space part must be accelerated. The pipeline to do the real-space 119 part of the calculation and the pipeline to do the reciprocal-space part of the calcu- lation cannot both t in the FPGA at the same time because of area limitations. So, only one will be able to execute on the FPGA at a time and the FPGA will have to be recongured each time either of the functions is called. Reconguration takes about 0.05 seconds, so R equals about 0.1 seconds per step [108]. For the CheY protein simulation, this is not a solution since the reciprocal-space part only takes 0.1 s/step. Even for the palmitic acid simulation, considering the added reconguration cost, the reciprocal-space part of the computation may not be signicantly accelerated with a custom hardware design. For this study, we have chosen the rst partitioning scheme, that is, the real-space part executes in hardware in parallel with the reciprocal-space part executing in soft- ware. Before implementation, though, we estimate the performance using the model of Chapter 4. Take the palmitic acid simulation as an example. Based on proling, the tasks in the rst and last fork-join sections are estimated to take take a total of 0.32 s/step. Also based on proling, the reciprocal space part is estimated to take 0.34 s/step. The hardware implementation of the real-space part is estimated to take only 0.34 s because it is very similar to the design in the previous section. Thus, the time for the second fork-join section is max(0:34; 0:34) = 0:34 s/step. As described in the last section, for a 52558-atom simulation, the communication time C is estimated to be only 1.2 ms, given the position data and force data being transferred and the transfer rate of the MAPstation. Since the RH element will only be congured once, we set R = 0. The total time is, therefore, estimated to be 0.66 s/step, leading to an estimated speed-up of 3.36. 120 5.3.2 RH-Element Design for the Real-Space Part We use the same write-back design as in the last section, except the equations are adjusted such that the real-space part of the Coulomb force is calculated as opposed to the shifted-force version of the Coulomb force. In order to facilitate these new calculations, we need to implement erfc(x) and e x 2 . 5.3.2.1 erfc(x) and e x 2 There are currently no libraries available for performing transcendental functions, such as erfc(x) and e x 2 in oating-point on FPGAs. Thus, we had to develop our own implementations. We use table lookup and linear interpolation to calculate these two functions to ve digits of precision. This method works particularly well on FPGAs because the table can be stored in on-chip memory. In MD, we are also able to bound the inputs to the two functions. That is, we know ahead of time in what range of values the x in erfc(x) and e x 2 lies. In the case of MD, x = j~ r i ~ r j + ~ nj. In our simulations, we utilize the common technique of setting such that the real-space energy is 0 at the cutoff distance, r c . Thus, it is guaranteed that for any pair of atoms i and j, 0 x r c . For a cutoff distance of r c =10 A, as is used in our simulations, it turns out that this limits the range of inputs to 0 x 3:6. Both erfc(x) and e x 2 are well-behaved functions in this region. The table lookup and interpolation steps are accomplished using xed point arith- metic and shifting, which is much more area efcient than oating-point arithmetic. Any values that are larger than r c return erfc(r c ) e (rc) 2 0 and any values that are smaller than a threshold return erfc(0) = e (0) 2 = 1. 121 5.3.3 Recongurable Computer Implementation Details and Results For implementation on the MAPstation, we use the same modications to the tradi- tional techniques as in the shifted-force version. When the program runs, it executes the velocity Verlet algorithm. During the force calculation step, two threads are created using pthreads [89]. One thread calls the reciprocal-space part's function to execute on the GPP element and the other calls the hardware implementation of the real-space part to execute on the RH element. The rst time the real-space part is called, all the necessary constants are transferred to the FPGA in the RH element and stored on-chip, either in registers or in embedded memories. These constants only need to be trans- ferred once as they remain on-chip throughout the simulation. Each time the real-space part is called, the position-charge array is DMAed to the FPGA's on-board memory. The neighbor list is effectively streamed to the FPGA by alternating DMAs between four on-board memory banks. At the end of the calculation, the forces are DMAed back to the GPP. When both threads have nished, the force calculation proceeds with bonded force calculation and the force corrections. We implemented MD with PME for the electrostatics, as described above, on the SRC 6 MAPstation. All of the hardware design, except for the table lookup and inter- polation functions for erfc(x) and e x 2 , was coded in C for the SRC Carte compiler. The erfc(x) and e x 2 tables each have 2048 17-bit wide elements. Each table can t in two embedded memories in the target FPGA. 17-bit wide values were chosen because, in addition to providing the required precision in the results, they allow each multiply done during the interpolation to use only one of the 18-bit 18-bit embedded multipliers in the target FPGA. The values in the tables were generated by a separate software program. The thresholds for maximum and minimum values are set at 4.0 122 and 2 23 , respectively. The maximum is set to 4.0, rather than 3.6, because 4.0 can be checked simply by looking at the exponent of the oating-point number. 2 23 is chosen as the minimum because any number smaller than that will have all its meaningful bits shifted out when the number is converted to xed-point. These circuits are coded in VHDL and synthesized using Synplicity Synplify Pro, version 8.1. The complete hardware design is mapped to the FPGA using Xilinx's ISE tools, version 7.1.04. The results of the acceleration on the recongurable computer are shown in Ta- ble 5.4. We achieved a 2.92 speed-up and a 2.72 speed-up for the palmitic acid and CheY protein simulations, respectively, by moving the real-space part into hardware and executing it in parallel with the reciprocal-space part. The speed-up for palmitic acid is less than estimated above. A large part of this is due to a memory conict between the parallel force calculation functions accessing the memory on the GPP el- ement. The reciprocal-space part needs to access this memory as part of its regular calculation as, for example, the interpolation grid is too large to t into caches. The real-space part needs to have the neighbor list DMAed to the RH element, which also requires the use of the memory bus. Thus, each part of force calculation can execute more slowly than estimated because it is waiting for data. Our high-level model from Chapter 4 has no way of modeling this. This problem is less pronounced for the CheY protein simulation because there are fewer cache misses in the function and thus fewer accesses to main memory. Another factor contributing to the discrepancy between es- timated and actual performance is that we ignored overheads such as DMA start-up costs and thread creation costs in our estimates. 123 Table 5.4: MAPstation performance results for single-node, PME simulations Latency (s/step) Simulation SW only SW + HW Speed-up Palmitic Acid 2.22 0.76 2.92 CheY Protein 1.25 0.46 2:72 5.3.4 Partition Space The introduction of the PME technique to the simulation program has added tradeoffs to the design space for partitioning the MD application between hardware and soft- ware. So far, we have discussed the prole for two simulations that use fourth order interpolation and have particular FFT sizes. However, the interpolation order and the size of the FFTs used in the simulation can signicantly impact the application prole. As an example, we consider what would happen if the palmitic acid simulation used sixth-order interpolation and had a grid size of 256 192 512, instead of its current conguration. In that case, the speed-up from moving the real-space part to hardware drops to 1.30. The reason is that the real-space part now only accounts for 33.79% of the overall simulation time while the reciprocal-space part accounts for 58.44% of the overall simulation time. There is thus a tradeoff between interpolation order, FFT size, and speed-up. Figure 5.8 shows the speed-ups for various congu- rations of the palmitic acid simulation, where i is the interpolation order and f is the FFT size. Another way to look at this is that changing the parameters changes the decision about which part or parts of the simulation should be moved into hardware. In the case described above in which the reciprocal-space part takes more time than the real- space part, it would be advantageous to accelerate the reciprocal-space part in hardware 124 0.00 0.50 1.00 1.50 2.00 2.50 3.00 i=4,f=128x96x256 i=6,f=128x96x256 i=4,f=256x192x512 i=6,f=256x192x512 Speed-Up (x) Figure 5.8: Speed-ups of palmitic acid simulation with various parameters instead of the real-space part. Or, it may in fact be better to accelerate both parts of the force calculation in hardware. The prole would also change if, through further optimization, the real-space part took less time to execute. For instance, GROMACS uses optimizations based upon the types of molecules interacting in order to reduce the number of operations in the real-space part of the force calculation. On x86 platforms, assembly-coded routines are used to increase the performance even further [118]. Increasing the performance of the real-space part of the force calculation reduces its contribution to the overall computation time which in turn may make accelerating other parts of the simulation more attractive. One more case in which we may change our decisions about what should and should not be moved to hardware is the case of a highly parallel simulation implemen- tation. In this case, the number of atoms per node may be small because the atoms are distributed among many nodes. However, it can be difcult to effectively parallelize the 3D FFT. FPGAs have demonstrated high performance in 1D FFT computations, 125 especially large FFTs [46]. There may be a partitioning in which a small number of FPGAs perform the FFTs needed in the reciprocal-space part while the rest of the calculations are distributed among a large number of GPPs. Whether or not such a partitioning is practical will depend upon the performance of FPGAs in executing oating-point 3D FFTs, which is not well-studied at this time. 5.4 Multi-Node, Shifted-Force Simulations Thus far, we have employed ne-grained, element-level parallelism and estimated the effects of course-grained, element-level parallelism in Section 5.2 and we have employed ne-grained, element-level parallelism and node-level parallelism in Sec- tion 5.3. In this section, we will still employ ne-grained, element-level parallelism but for the rst time, we will investigate system-level parallelism. In order to do so, we must parallelize our MD simulation. 5.4.1 Parallelization There are three main techniques for SPMD parallelization of MD: atom decomposi- tion, force decomposition, and spatial decomposition [88]. For this work, we have chosen to use spatial decomposition because of its good scaling properties for large numbers of atoms and nodes [88] and because of the availability of reference code [78]. A study of the benets of the other parallelization techniques when using the ASPMD model is left for future work. In the following description of spatial decomposition, we mainly reference [78]. Spatial decomposition partitions the system to be simulated into subsystems of equal volume. Each of the subsystems is assigned to a node. This node is responsible for doing all the steps of the velocity Verlet algorithm for all of the atoms that reside on 126 P x P y Figure 5.9: Two-dimensional example of spatial decomposition it (those in its subsystem, referred to as local atoms). The simplest partition of the system is to partition it into a three-dimensional mesh with P = P x P y P z nodes in the x, y, and z dimensions, respectively. A two-dimensional example in which P x = 4 and P y = 3 is shown in Figure 5.9. In the gure, the outer box with the solid lines is the periodic box of the simulation system, the dashed lines denote the subsystems that are each assigned to one of the 12 nodes, and the black circles represent the atoms. Once the assignment of subsystems to nodes is made, each node can proceed al- most independently, performing all steps of the simulation for its local atoms. Com- munication is only required for two tasks: atom caching and atom migration. Atom caching is required because the atoms at the edge of a subsystem may interact with atoms in the (periodic) neighboring subsystems. So, the positions of atoms along the edges of the subsystems must be communicated to the neighboring subsystems at each time step. Atom migration is necessary simply because atoms move in physical space during the simulation. Consequently, the subsystem in which an atom resides may change 127 over the course of the simulation. When this happens, the information about the atom, such as its position and velocity, must be transferred from one node to another. 5.4.2 Implementation 5.4.2.1 Software We modied our software MD program to implement spatial decomposition as de- scribed above, using the code from [78] as a reference for the partitioning and commu- nication. In our parallel software MD program, we still use the neighbor list technique and still support the same features, except for PME and RATTLE. Parallelizing these two tasks is beyond the scope our work. Instead of PME, we use the shifted-force cal- culation as in Section 5.2. Since we no longer have RATTLE to constrain the lengths of the Hydrogen bonds, we must use smaller time steps. In our case, we use a time step of 0.1 fs. Because of this change, we could use a smaller neighbor list cutoff length or rebuild the neighbor list less frequently but, for consistency with the other simulations, we do not take advantage of these possible optimizations. For simplicity, we replicate the constants in the simulationtype, mass, charge, exclusion list, bonded force calculation listson each node. Doing this avoids some extra communication at the cost of using extra memory at each node. For simulations the size of our benchmarks, there is abundant memory, so this is not a problem. We do, however, have to assign each atom a global ID that it takes with it during atom migration and caching so that it can access the arrays holding the constants. This is similar to the passport of [88]. The reference code uses the linked cell list but not the neighbor list, so we had to modify it to use the neighbor list. Every x steps, where x is user-dened, the neighbor list is rebuilt using the linked cell list. Before the neighbor list can be built, though, 128 atom migration must take place. Each subsystem scans its atoms and the ones that have moved out are sent to the appropriate neighbor subsystems. By synchronizing the communication correctly, each subsystem is required to do only six sends and six receives to communicate the information about the migrating atoms [78, 88]. This communication is implemented using the Message Passing Interface (MPI) [76]. After atom migration, atom caching takes place using the same communication scheme as atom migration. Any atoms that are within the neighbor list cutoff distance of the edge of their subsystem are cached on the neighboring subsystems. The list of atoms that are sent is saved so that it can be used for the atom caching that takes place during steps in which a new neighbor list is not created. Once the atoms are cached, a linked cell list is created and the neighbor list built from that, as in the sequential simulation. This neighbor list is used for nonbonded force calculation, as before. On a given node, only the forces on the local atoms need to be calculated and kept track of. Consequently, both bonded and nonbonded force calculation must be modied slightly. In bonded force calculation, the lists of groups of atoms that take part in bonded interactions with one another are not addressable by atom. Instead, they are lists of bonds/angles/dihedrals, where each bond/angle/dihedral in the list holds the indices of the atoms that take part in it. So, during bonded force calculation, each node scans the complete lists of bonds/angles/dihedrals. If one or more of the atoms in a bond/angle/dihedral resides on a given node, the bond stretch/angle bend/dihedral torsion force calculation for that bond/angle/dihedral takes place on that node, and only the forces on the local atoms are stored. If not, that bond/angle/dihedral is skipped. A similar method is used for the exclusion calculations. 129 Nonbonded force calculation also requires slight modication. If both atoms that are interacting are local atoms, the computation is as before. If one of the two is not, then only the force on the local atom is computed and stored. 5.4.2.2 Hardware The basic design for the hardware remains the same as in Section 5.2. In hardware, rather than computing the forces conditionally based upon whether or not an atom is local, the forces on each atom in every pair are computed. However, only the forces for the local atoms are sent back to the GPP element. 5.4.3 Experimental Results 5.4.3.1 Simulations on GPP-Only Nodes We ran the simulation program on P = 2 k nodes, k = 0;:::; 4. The single-node ver- sion was run on the 2.8 GHz Xeon in the GPP element of the SRC 6 MAPstation. LAM/MPI 7.1.2 was the MPI implementation [14, 107]. Note that this is a single node, so the particular MPI implementation has little effect. A cluster with the same processors, same compiler, and same MPI library was not available, so we had to esti- mate the performance for multi-node runs. We ran the simulations for k = 1;:::; 4 on a cluster in which the nodes have Myrinet interconnect [77] and the MPI implemen- tation is Open MPI [33]. We measured the communication time using calls to MPI's wall-clock time function, MPI Wtime(). For computation time, we ran the simulations for k = 1;:::; 4 on the two processors of the MAPstation. We proled the simulations using Oprole 0.9.1, having it keep a separate prole for each MPI process. We then found the average computation time per step per node. Using this computation time and the communication time from the cluster runs, we estimated the performance of 130 Table 5.5: Estimated prole of the parallel palmitic acid simulation # of Nodes Atoms/Node Time (s/step) % of Time in (P x , P y , P z ) Max. Min. Avg. Comp. Comm. Total Nonbonded 1 (1, 1, 1) 52558 52558 52558 1.34 < 0.01 1.34 73 2 (2, 1, 1) 26279 26279 26279 0.70 0.02 0.72 71 4 (2, 2, 1) 13208 13071 13140 0.35 0.02 0.37 67 8 (4, 2, 1) 6812 6344 6570 0.16 0.06 0.22 58 16 (4, 4, 1) 3494 3074 3285 0.08 0.09 0.17 29 Table 5.6: Estimated prole of the parallel CheY protein simulation # of Nodes Atoms/Node Time (s/step) % of Time in (P x , P y , P z ) Max. Min. Avg. Comp. Comm. Total Nonbonded 1 (1, 1, 1) 32932 32932 32932 0.73 < 0.01 0.74 73 2 (1, 2, 1) 16498 16434 16466 0.40 0.01 0.41 68 4 (1, 2, 2) 8294 8190 8233 0.19 0.01 0.20 65 8 (2, 2, 2) 4174 4078 4117 0.08 0.02 0.09 56 16 (2, 4, 2) 2183 1956 2058 0.03 0.04 0.07 32 the program running on clusters in which the processors are 2.8 GHz Xeons, Myrinet is the interconnect, and Open MPI is the MPI implementation. The proling results for the benchmark simulations are shown in Tables 5.5 and 5.6. The scalability of the program running the benchmark simulations is shown in Figure 5.10. Because of the averaging, we believe that these estimates are somewhat optimistic. The tables show that, as expected, as the number of nodes increases, the computa- tion time decreases but the communication time increases. For both simulations, when there are 16 nodes, communication time is longer than computation time. The percent- age of time spent in nonbonded force calculation also decreases, as can be seen from the last column of the tables. This decline is mainly due to the rise in communica- tion time. While not shown in the tables, our experiments show that as a percentage of computation time only, the time spent in nonbonded force calculation does drop, 131 0.00 2.00 4.00 6.00 8.00 10.00 12.00 1 2 4 8 16 Number of Nodes Speed Up (x) Palmitic CheY Figure 5.10: Estimated scalability of the parallel MD program but not as drastically. For instance, for the 16-node case, nonbonded force calculation takes of 65% and 68% of the computation time for the palmitic acid and CheY pro- tein simulations, repsectively. Nonetheless, the drop in the percentage of time spent in nonbonded force calculation will affect the scalability of the hardware-accelerated clusters. 5.4.3.2 Simulations on Accelerated Nodes The single-node version of the simulation was run on the MAPstation and the two- node version was run on two MAPstations in which the interconnect was gigabit Eth- ernet and LAM/MPI 7.1.1 was the MPI implementation. The four-node version of the simulation was run on a similar system, except two of the nodes were the newer SRC 7 MAPstations, which have newer, faster FPGAs. However, having these faster MAPstations likely had very little effect on the overall time because the simulation synchronizes every time step. If the SRC 7 MAPstations nished computing much faster than the SRC 6 MAPstations, they would spend the extra time waiting for the SRC 6 MAPstations to nish their computations. 132 For the 8- and 16-node simulations, we estimate the computation time for non- bonded force calculation based on Equation 5.9, using the average number of atoms on each node for n and the average number of neighbors over the atoms on the nodes for N. We use the software estimates from the GPP-only cluster runs to estimate the rest of the computation time. We estimate the intra-node communication cost, C, based on the average number of local atoms and cached atoms on each node. Finally, we use the communication values from the software runs for estimates of inter-node communication. Figure 5.11 shows the speed-ups over the single-accelerated-node runs as the num- ber of accelerated nodes increases from 1 to 16. Comparing Figure 5.11 to Figure 5.10 shows that the accelerated-node version scales poorly compared to the GPP-only-node version. This is to be expected, though. The parallelization benets the nonbonded force calculation most. In the accelerated version, nonbonded force calculation is al- ready much faster than it was in software. Further, as mentioned previously, in soft- ware, the percentage of time taken by nonbonded force calculation drops as the number of nodes is increased. With nonbonded force calculation taking less of a percentage of time, Amdahl's law says that the acceleration of nonbonded force calculation can do less to speed-up the simulation. Tables 5.7 and 5.8 also demonstrate this point. The tables show for the palmitic acid and CheY protein simulations, respectively, the speed-ups in using accelerated nodes. As more nodes are added, the benet of those nodes being accelerated diminishes. Another trend can be seen in the palmitic acid speed-ups in Table 5.7, though. That trend is that one accelerated node consistently gives about the same performance as two GPP-only nodes. In fact, in some cases, P accelerated nodes gave better per- formance than 2P GPP-only nodes. This is due to the fact that the accelerated nodes give added performance without adding communication time. We believe that in larger 133 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 1 2 4 8 16 Number of Nodes Speedup (x) Palmitic CheY Figure 5.11: Estimated scalability of the parallel MD program on accelerated nodes Table 5.7: Speed-Ups of Palmitic Acid Simulation Latency (s/step) # of Nodes SW-Only Accelerated Speed-Up 1 1.34 0.67 2.01 2 0.72 0.36 2.01 4 0.37 0.24 1.50 8 0.22 0.15 1.44 16 0.17 0.14 1.18 simulations, especially if a higher percentage of time is spent in nonbonded force cal- culation, this effect will be more pronounced because there will be more atoms per node and more communication required when new nodes are added. 5.5 Related Work Several prior works have studied implementing tasks in MD simulations in hard- ware [95, 105, 124, 62]. However, these works do not consider the MD simulation application as a whole. So, we instead focus our comparisons on works in which most of the major tasks in an MD simulation have been implemented. We begin with an in-depth comparison against the rst such works in the literature. We then compare 134 Table 5.8: Speed-Ups of CheY Protein Simulation Latency (s/step) # of Nodes SW-Only Accelerated Speed-Up 1 0.74 0.38 1.92 2 0.41 0.23 1.76 4 0.20 0.16 1.28 8 0.09 0.07 1.29 16 0.07 0.06 1.15 against two later works. All of these are single-node simulations; we are unaware of any recongurable-hardware-accelerated MD simulations that utilize system-level parallelism as is done in the Section 5.4. The rst complete MD simulation implemented on recongurable hardware is described in [6]. In this work, the velocity Verlet algorithm is implemented on a hard- ware platform with four FPGAs, memory, and interconnect. All steps of the velocity Verlet algorithm are implemented in hardware. Only the Lennard-Jones force is con- sidered. All the computations are done in xed-point arithmetic. The Lennard-Jones force is calculated with table look-up and interpolation. The system uses the O(n 2 ) method to nd interacting pairs. [43] is another work in which the entire velocity Verlet algorithm is considered. This work is similar to [6] in that both implementations perform all steps of the al- gorithm in hardware using xed-point arithmetic, only nonbonded force calculation is implemented, this nonbonded force calculation is performed with table lookup and interpolation, and, it seems, the interaction pairs are found using the O(n 2 ) approach. One advance in [43] is the addition of cutoff Coulomb force calculation. Another advance is support for multiple types of atoms. Two types can be accommodated in hardware; more than two types requires swapping in tables from external memory. Yet 135 another advance is the use of multiple pipelines for coarse-grained parallelism in the design. Our hardware/software approach to accelerating MD simulations is fundamentally different than those of [6] and [43]. These implementations move all steps of the simulationposition update, velocity update, acceleration updateto hardware. We, on the other hand, have left most of the simulation in software, moving only the non- bonded force calculation to hardware. Further, rather than switching to a xed-point implementation, we maintain oating-point arithmetic throughout, in both hardware and software. Another difference is in the way we approach the calculation of the non- bonded forces. Both of the other implementations use table lookup and interpolation while ours directly calculates the forces from their equations. Our approach gives us the advantage over the previous implementations that it is simple to integrate tasks that are well-suited for software. For example, we have imple- mented (in software) the bonded force calculation, which the earlier implementations have not. We can also support both shifted-force and PME techniques for Coulomb force calculation with little hardware redesign; if we used only recongurable hard- ware, we would need to design a new module for the reciprocal-space part of the calculation and worry about reconguration, tting the entire design into one FPGA, or partitioning it among multiple FPGAs. Further, integrating other advanced tasks is easily accomplished in our approach: it is merely a matter of integrating new function calls in the software. Our approach has also made it easy to handle multiple types of atoms. Since we use direct calculation of the forces, we need not use any block RAMs for table-lookup in the force calculation. Further, the number of type-based constants that need to be stored is only four times the square of the number of types, as opposed to needing separate force calculation tables for interactions for various types of atoms, as is the 136 case in [43]. Thus we can handle a large number of types of atoms. The use of only a small number of block RAMs for constants is crucial in our approach because it keeps the rest of the block RAMs available for the intermediate storage needs of our write-back design. Finally, our approach has facilitated the use of the neighbor list technique for nd- ing interacting pairs of atoms. This is very important because it drastically reduces the number of pairs that need to be evaluated at each time step. In the O(n 2 ) approach, most of the time is spent nding distances between pairs of atoms that are too far apart to necessitate the calculation of the forces they exert on one another. Even with the advantages of hardware, it is difcult for a system using this approach to scale to large simulations; software using more advanced techniques such as the neighbor list or the linked-cell list will likely be faster. Table 5.9 shows the speed-ups over software baselines for the shifted-force de- sign proposed here and the fastest speed-ups reported in [6] and [43]. Note that the speed-ups reported in [6] and [43] are much greater than that reported for the proposed design, but that those speed-ups are obtained by comparing against software baselines that are much slower than our software implementation, despite having fewer atoms. Either the system simulated was very denseand such a system would improve the speed-ups obtained in our design as wellor the software baseline was slow. Also note that bonded force calculation and potential energy calculation are not done in [6] and [43]. [6] and [43] do, however, use varying-precision xed-point arithmetic to provide results nearly as accurate as double-precision oating-point arithmetic, while the accuracy of our simulation is limited to single-precision oating-point arithmetic. [44] presents several improvements to [43]: a new number format, a new interpola- tion strategy, integration into a realistic MD simulation program (ProtoMol [69]), and results for using linked cell lists (though no implementation details are given). The 137 Table 5.9: Performance Comparison with Related Work Shifted-Force [6] [43] [44] Number of atoms 52558 8192 8192 8192 SW baseline latency (s/step) 1.3 10.8 9.5 0.66 Speed-up 2 21 89 6 rst two of those improvements facilitate the use of more atom types while maintain- ing high precision without the use of true double-precision oating-point arithmetic. Integration into a realistic MD simulation program allows for more advanced simu- lation techniques, such as bonded force calculation, and also provides more realistic timing results. The use of linked cell lists makes their design more scalable to larger simulations. Still, because all storage is done on-chip, they are limited to simulations with 11200 atoms or less. With these improvements, it is possible to t two force calculation pipelines on a Xilinx Virtex-II Pro XC2VP70 FPGA and run them at a frequency of 75 MHz. Note that this 75 MHz frequency means that it would not be possible to use this design as-is on the SRC 6 MAPstation. In fact, even 75 MHz may not be achievable on the FPGA in the SRC 6 MAPstation because it is one generation older than the FPGA used in [44]. For an 8192-atom simulation with 26 atom types, the authors achieve a 5.5 speed-up over the software baseline. This is impressive, however, once again, direct comparisons of speed-ups is unfair. While over 14 faster than the baseline simulation used in [43], the ProtoMol baseline still seems on the slow side, likely because it uses linked cell lists rather than the more efcient neighbor lists that are implemented in our software baseline (see Table 5.9). The density of the simulated system, which is not reported, could also be a factor. 138 The fourth work that accelerates MD simulations on a recongurable computer ac- celerates a modied version of the NAMD simulation package [55]. Notably, this mod- ied version removes the calculation of bonded forces and, hence, the exclusions from nonbonded force calculations of atoms that are bonded to one another. Also, single- precision oating-point arithmetic and 32-bit integer arithmetic are employed. NAMD uses table-lookup and interpolationalbeit with oating-point numbersfor force calculation, so this type of force calculation is also performed in the hardware [55]. The authors study various architectural choices and report on their performance. While NAMD is certainly capable of using PME in simulations, it is unclear whether or not PME is actually utilized in the simulations in [55] and in any case, PME's impact on performance is not addressed. [55] is similar to our work in that it takes a hardware/software approach to the prob- lem. The acceleration is targeted to certain portions of the code inside the nonbonded force calculation. Using two FPGAs in a single RH element in an SRC 6 MAPstation, a 1.3 speed-up over their baseline NAMD simulation is achieved (the baseline NAMD simulation is different from our simulation, so once again, it is difcult to compare speed-ups). They use coarse-grained parallelism in having two pipelines per RH ele- ment. Using a later version of the MAPstation and more optimized code, four pipelines t on the RH element and a 2.5 speed-up is achieved. The authors also present the idea using coarse-grained parallelism in the evaluation of whether or not atomic pairs are close enough to interact and then having less coarse-grained parallelism in the force calculation necessary if they do interact. This is similar to our approach of using a FIFO between distance calculation and the rest of the force calculation. 139 Chapter 6 Conclusion The focus of our work has been the acceleration of scientic computing applications with recongurable hardware. We have demonstrated that though there are challenges involved with using recongurable hardware, these challenges can be overcome. Our contributions to this eld, summarized below, allow this new, rapidly improving tech- nology to be applied in scientic computing in order to provide some of the immense computing power required to make new scientic discoveries. Below, we also detail ideas for future work that will make recongurable hardware an even better t for scientic computing. 6.1 Summary of Contributions Kernels for Scientic Computing: We have presented the rst double-precision oating-point cores for FPGAs that support most of the features in IEEE standard 754. Our focus has been the square root core, which is parameterized by the features of IEEE standard 754 implemented and the degree of pipelining. We found that the se- lection of features that are implemented affects the area of the core more so than it does the frequency. Degree of pipelining affects both frequency and area. By comparing 140 the various oating-point core congurations in a simplied kernel for MD, we saw that the implementation utilizing the lowest overhead conguration allowed for the ex- ploitation of parallelism because of the comparatively small size of the oating-point cores. We have also presented algorithms and architectures that facilitate the area-efcient evaluation of arithmetic expressions using deeply pipelined oating-point cores. Ex- periments have shown that for expressions of both 64 and 1024 inputs, the area in- creases linearly with the number of types of operations in the expression. Beyond area efciency, the proposed designs have several benets, including optimal throughput, a low (lg(n)) memory requirement, and the ability to evaluate multiple expressions without hardware reconguration. Because they only receive one input per clock cy- cle, they also have a low I/O bandwidth requirement. We have also provided proofs that show the correctness of the algorithms. Modeling Recongurable Computers: We have developed architectural models for recongurable computers based on what is currently commercially available. Settling on a cluster of accelerated nodes architectural model, we then developed the Accel- erated SPMD programming model and a hierarchical performance model, describing key issues in the performance of recongurable-hardware designs in the process. Since not all applications will greatly benet from acceleration with recongurable hard- ware, analytic performance modeling is a necessary step that should take place early in the development process, especially given the cost and effort involved in porting applications to recongurable computers. Use of the model in development of the ac- celerated version of the MD simulation program has demonstrated that the model is fairly accurate. 141 Molecular Dynamics: To show the acceleration of a complete scientic computing application, we have implemented MD on the MAPstation, a state-of-the-art recong- urable computer. We studied three types of MD simulations: single-node, shifted-force simulations; single-node, PME simulations; and multi-node, shifted-force simulations. We proled a software-only version of the single-node, shifted force program run- ning two benchmark simulations and found the most computationally intensive task to be the nonbonded force calculation. We partitioned the application so that this task would run in hardware while the rest of the simulation ran in software. This hardware/software approach was novel as all previous efforts in performing MD on recongurable computers moved the entire application to the recongurable hardware, limiting the simulation features that could be implemented. Using the performance model, we evaluated several alternative hardware designs and determined that a write-back design would yield the highest performance. Imple- menting this design in hardware and integrating it with the rest of the software, we were able to achieve a 2 speed-up over the corresponding software-only solution. Extending this work to PME simulations, we were able to achieve speed-ups of almost 3 by exploiting node-level parallelism in which the reciprocal-space and real-space parts of the force calculation execute in parallel on the GPP element and RH element, respectively. Beyond the speed-up, we also discussed parameters in the simulation that can affect the partitioning. Finally, we analyzed the rst multi-node, recongurable-computer implementation of MD. Using a spatial decomposition of MD, we showed that recongurable-hardware acceleration actually worsens the scalability when the number of atoms remains the same. This is due to parallelization and recongurable-hardware acceleration both targeting the nonbonded force calculation and to the fact that the percentage of time taken by this task drops as the number of nodes increases. We also see, though, that 142 the accelerated version requires about half as many nodes to get the same, if not better, performance as the unaccelerated version. 6.2 Future Work 6.2.1 Further Study of Molecular Dynamics In Chapter 5, we developed some of the rst recongurable-computer implementations of MD to consider advanced techniques like PME, bonded force calculation, and the use of neighbor lists. While much more advanced than previous work accelerating MD with recongurable hardware, there are still features that can be added. Double-Precision Arithmetic: As described in Section 5.1.3.1, the accelerated ver- sion of our MD simulation programs use single-precision arithmetic. Single-precision is all that can be used because of limitations in the logic resources on the FPGAs and in the number of on-board memories in the RH element. With newer FPGAs and re- congurable computers, this should no longer be a problem. For instance the SRC 7 MAPstation has nine banks of on-board memory, which will help alleviate the mem- ory limitations. The SRC 7 MAPstations also have Xilinx Virtex-II Pro XC2VP100 FPGAs, which have 1.3 as many logic resources as the Virtex-II FPGAs in the cur- rent MAPstations. These FPGAs are also of a higher speed grade. Preliminary results show the current single-precision implementation takes up only about half of the logic resources in an XC2VP100 FPGA. There should, then, be enough area for double- precision arithmetic. On the other hand, the pipelines in double-precision oating- point cores are longer than those in single-precision oating-point cores, making the draining of the pipeline more expensive. These changes may affect which of the design alternatives presented in Section 5.2.2 turns out to be the best. 143 Acceleration of Neighbor List Construction: Besides nonbonded force calcula- tion, construction of the neighbor list takes the highest percentage of the computation time in the MD simulations. In the shifted-force simulations, building the neighbor list takes up between 20% and 23% of the computation time for the benchmark sim- ulations. In the PME simulations, it takes between 8% and 11%. Accelerating the neighbor list would increase the potential speed-up of using recongurable hardware. For instance, by Amdahl's Law, for the shifted-force simulations, the ideal speed-up from using recongurable hardware is about 4 without accelerating neighbor list construction. If neighbor list construction is accelerated, the ideal speed-up becomes 24 and 32 for the palmitic acid and CheY protein simulations, repsectively. Thus, accelerating both nonbonded force calculation and neighbor list construction increases the potential speed-up dramatically. Advanced Simulation Techniques: Our MD implementations are some of the rst for recongurable computers to include necessary features like bonded force calcula- tion and to use optimizations like the neighbor list. However, state-of-the-art simula- tion packages, like NAMD, GROMACS, and Amber, provide even more functional- ity [53, 118, 15]. This advanced functionality includes pressure calculations, multiple time step methods, optimized calculations for common interactions, and the fast multi- pole method for long-range Coulomb force calculation. In particular, the fast multipole method may be amenable to acceleration with recongurable hardware as there has al- ready been some success in accelerating it with an ASIC [4]. Also, other parallelization strategies can be integrated. In Section 5.4, we use a simple spatial decomposition scheme. But, there are other schemes. For instance, GROMACS uses a particle decomposition [118]. NAMD, which is known to scale well to thousands of processors, uses a combination of force and spatial decomposition and 144 goes to great lengths to overlap communication and computation [53, 61]. We have presented trends for the spatial decomposition scheme, but the effect of acceleration on these other schemes is not known. If communication can be hidden by computation, the accelerated simulations may scale better than they have in our implementation. 6.2.2 Libraries for Recongurable Computers Libraries are important components in scientic computing. Libraries exist for widely used kernels such as linear algebra functions and the FFT [12, 32]. Traditionally, the library species a common set of functions and a common application programming interface (API) for using those functions. It is up to computer vendors or third par- ties to provide highly tuned implementations of the library functions, often written in assembly, for their architectures. There are also some library implementations that au- tomatically tune themselves for high performance on the particular target machine on which they will execute [122, 32]. Libraries have also been developed for FPGAs, such as libraries for performing oating-point arithmetic [11, 24, 40] and common tasks in digital signal process- ing [22]. These libraries have parameters, such as problem size and degree of paral- lelism, that are set at synthesis time and yield a design giving a particular performance. They are very useful for speeding the design process. The aforementioned methods have worked well for their respective domains but fall short for use with recongurable computers. The reason is that with recongurable computers, there are many choices for implementing a kernel and the right choice is inuenced by many variables, the values for some of which are known at design time, the values for others of which are not known until run time. For example, in Section 5.3, the percentage of the computation time taken up by the FFT in the PME 145 MD simulations varies widely; it is dependent upon the size of the FFT, which is not known until run time. Even if a very efcient FFT implementation for recongurable hardware were available, using this implementation might not speed up the application and in some cases might even slow it down. The fact that solely accelerating the FFTs would not have necessarily beneted the application as a whole serves to make a point about libraries for recongurable computers: they must be much more exible than libraries for GPPs. For GPPs, the fastest library implementation for a particular kernel (FFT, for example) is best be- cause the tasks in the application must run sequentially. For recongurable computers, on the other hand, there is an inherent parallelism between hardware and software. Running a kernel task more slowly in software may be benecial if it allows another, more computationally intensive task, to run in hardware. In the embedded computing community, this idea is known as software deceleration [52]. Thus, libraries for recongurable computers must do more than provide a common interface and highly optimized implementations for various platforms and they must be more than just wrappers around hardware implementations that always execute the hardware version if it is available. At the very least, the user should be able to give the library hints as to whether the software or hardware version of the library kernel should be employed. A more advanced system would allow the library to choose, at run-time, which implementation should be employed, based on parameters such as problem size (number of points in the FFT, for example). The system would provide an interface between the library's performance models and the application's performance model. 146 6.2.3 New Techniques Thus far, the work in accelerating scientic computing applications with recong- urable hardware, including that presented in this thesis, has focused on accelerating existing techniques with recongurable hardware. A long-term, open problem is the creation of new techniques that are developed with the strengths of recongurable hardwareor other accelerator technologies, for that matterin mind. Creating these new techniques, however, will require multidisciplinary efforts between computational scientists, computer scientists, and computer engineers. 147 Reference List [1] M.P. Allen and D. J. Tildesley. Computer Simulation of Liquids. Oxford Uni- versity Press, New York, 1987. [2] Altera Corp. http://www.altera.com. [3] AMBER 8 Users' Manual, 2004. http://amber.scripps.edu/doc8/ amber8.pdf. [4] Takashi Amisaki, Shinjiro Toyoda, Hiroh Miyagawa, and Kunhiro Kitamura. Development of hardware accelerator for molecular dynamics simulations: A computation board that calculates nonbonded interactions in cooperation with fast multiple method. Journal of Computational Chemistry, 24(5):582592, April 2003. [5] Ray Andraka. A survey of CORDIC algorithms for FPGA based computers. In Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field- Programmable Gate Arrays, February 1998. [6] Navid Azizi, Ian Kuon, Aaron Egier, Ahmad Darabiha, and Paul Chow. Re- congurable molecular dynamics simulator. In Proceedings of the IEEE Sym- posium on Field-Programmable Custom Computing Machines, pages 197206, April 2004. [7] D.A. Bader, S. Sreshta, and N.R. Weisse-Bernstein. Evaluating arithmetic ex- pressions using tree contraction: A fast and scalable parallel implementation for symmetric multiprocessors (SMPs). In Proceedings of the 9th International Conference on High Performance Computing, volume 2552 of Lecture Notes in Computer Science, pages 6375, December 2002. [8] D. H. Bailey and A. Snavely. Performance modeling: Understanding the present and predicting the future. In Proceedings of Euro-Par 2005, September 2005. [9] Zachary K. Baker and Viktor K. Prasanna. Computationally-efcient engine for exible intrusion detection. IEEE Transactions on VLSI, 13(10):11791189, October 2005. 148 [10] Michael J. Beauchamp, Scott Hauck, Keith D. Underwood, and K. Scott Hem- mert. Embedded oating-point units in FPGAs. In Proceedings of the 14th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 1220, February 2006. [11] Pavle Belanovic and Miriam Leeser. A library of parameterized oating point modules and their use. In M. Glesner, P. Zipf, and M. Renovell, editors, Pro- ceedings of the 12th International Conference on Field Programmable Logic and Applications, pages 657666, Berlin, September 2002. Springer-Verlag. [12] L. Susan Blackford, James Demmel, Jack Dongarra, Iain Duff, Sven Hammar- ling, Greg Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, An- toine Petitet, Roldan Pozo, Karin Remington, and R. Clint Whaley. An updated set of Basic Linear Algebra Subprograms (BLAS). ACM Transactions on Math- ematical Software, 28(2):135151, June 2002. [13] Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the Association for Computing Machinery, 21(2):201206, April 1974. [14] Greg Burns, Raja Daoud, and James Vaigl. LAM: An open cluster environment for MPI. In Proceedings of Supercomputing Symposium, pages 379386, 1994. [15] David A. Case, Thomas E. Cheatham III, Tom Darden, Holger Gohlke, Ray Luo, Kenneth M. Merz, Jr., Alexey Onufriev, Carlos Simmerling, Bing Wang, and Robert J. Woods. The Amber biomolecular simulation programs. Journal of Computational Chemistry, 26(16):16681688, December 2005. [16] Celoxica DK design suite. http://www.celoxica.com/products/ dk/default.asp. [17] Chen Chang, John Wawrzynek, Pierre-Yves Droz, and Robert W. Brodersen. The design and application of a high-end recongurable computing system. In Toomas Plaks, editor, Proceedings of the International Conference on Engineer- ing Recongurable Systems and Algorithms, June 2005. [18] C.Y .R. Chen and M.Z. Moricz. Data path scheduling for two-level pipelining. In Proceedings of the 28th Design Automation Conference, June 1988. [19] Irina Chihaia and Thomas Gross. Effectiveness of simple memory models for performance prediction. In Proceedings of the 2004 IEEE International Sym- posium on Performance Analysis of Systems and Software, March 2004. [20] Seonil Choi, Ronald Scrofano, Viktor K. Prasanna, and Ju-Wook Jang. Energy-efcient signal processing using FPGAs. In Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field-Programmable Gate Arrays, pages 225234, New York, February 2003. ACM Press. 149 [21] R. Cole and U. Vishkin. The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time. Algorithmica, 3:329346, 1988. [22] Ben Cordes, Miriam Leeser, and Joe Tarkoff. An FPGA API for VSIPL++. In Proceedings of the Ninth Annual Workshop on High-Performance Embedded Computing, September 2005. [23] Michael deLorimier and Andr´ e DeHon. Floating-point sparse matrix-vector multiply for FPGAs. In Proceedings of the 2005 ACM/SIGDA 13th Interna- tional Symposium on Field-Programmable Gate Arrays, pages 7585, February 2005. [24] J´ er´ emie Detrey and Florent de Dinechin. A VHDL library of parameterisable oating-point and LNS operators for FPGA. http://perso.ens-lyon. fr/jeremie.detrey/FPLibrary/. [25] J´ er´ emie Detrey and Florent de Dinechin. Parameterized oating-point loga- rithm and exponential functions for FPGAs. Technical report, ´ Ecole Normale Sup´ erieure de Lyon, 2006. [26] Pedro Diniz, Mary Hall, Joonseok Park, Byoungro So, and Heidi Ziegler. Bridging the gap between compilation and synthesis in the DEFACTO system. In Proceedings of the Languages and Compilers for Parallel Computing Work- shop, August 2001. [27] Jack J. Dongarra and Piotr Luszczek. Introduction to the HPCChallenge bench- mark suite. Technical Report ICL-UT-05-01, University of Tennessee, 2005. [28] Rolf Enzler, Tobias Jeger, Didier Cottet, and Gerhard Tr¨ oster. High-level area and performance estimation of hardware building blocks on FPGAs. In Pro- ceedings of the 10th International Conference on Field-Programmable Logic and Applications, pages 525534, August 2000. [29] Ulrich Essmann, Lalith Perera, Max L. Berkowitz, Tom Darden, Hsing Lee, and Lee G. Pedersen. A smooth particle mesh Ewald method. Journal of Chemical Physics, 103(19):85778593, November 1995. [30] Barry Fagin and Cyril Renard. Field programmable gate arrays and oating point arithmetic. IEEE Transactions on VLSI Systems, 2(3):365367, Septem- ber 1994. [31] Michael Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers, C-21(9):948960, 1972. [32] Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216231, February 2005. 150 [33] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Don- garra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, concept, and design of a next gen- eration MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, pages 97104, September 2004. [34] David Gay and Thos Sumner. paranoia (C version). http://www.netlib. org/paranoia/. [35] R.S. Germain, Y . Zhestkov, M. Eleftheriou, A. Rayshubskiy, F. Suits, T.J.C. Ward, and B.G. Fitch. Early performance data on the Blue Matter molec- ular simulation framework. IBM Journal of Research and Development, 49(2/3):447455, March/May 2005. [36] Maya Gokhale, Janette Frigo, Christine Ahrens, Jutin L. Tripp, and Ron Min- nich. Monte carlo radiative heat transfer simulation on a recongurable com- puter. In J. Becker, M. Platzner, and S. Vernalde, editors, Proceedings of the 2004 International Conference on Field Programmable Logic and Its Appli- cations, volume 3203 of Lecture Notes in Computer Science, pages 95104, September 2004. [37] Maya B. Gokhale and Paul S. Graham. Recongurable Computing: Accelerat- ing Computation with Field-Programmable Gate Arrays. Springer, Dordrecht, The Netherlands, 2005. [38] Maya B. Gokhale, Christopher D. Rickett, Justin L. Tripp, Chung Hsing Hsu, and Ronald Scrofano. Promises and pitfalls of recongurable supercomputing. In Proceedings of the International Conference on Engineering Recongurable Systems and Algorithms, June 2006. [39] Gokul Govindu, Viktor K. Prasanna, Vikash Daga, Sridhar Gangadharpalli, and V . Sridhar. Floating-point based block LU decomposition on FPGAs. In Pro- ceedings of the 2004 International Conference on Engineering Recongurable Systems and Algorithms, June 2004. [40] Gokul Govindu, Ronald Scrofano, and Viktor K. Prasanna. A library of pa- rameterizable oating-point cores for FPGAs and their application to scientic computing. In Toomas Plaks, editor, Proceedings of the International Confer- ence on Engineering Recongurable Systems and Algorithms, June 2005. [41] Gokul Govindu, Ling Zhuo, Seonil Choi, and Viktor K. Prasanna. Analysis of high performance oating point arithmetic on FPGAs. In Proceedings of the 11th Recongurable Architectures Workshop (RAW 2004), April 2004. 151 [42] GNU gprof. http://www.gnu.org/software/binutils/manual/ gprof-2.9.1/gprof.html. [43] Yongfeng Gu, Tom VanCourt, and Martin C. Herbordt. Accelerating molecular dynamics simulations with congurable circuits. In Proceedings of the 2005 In- ternational Conference on Field Programmable Logic and Applications, August 2005. [44] Yongfeng Gu, Tom VanCourt, and Martin C. Herbordt. Improved interpola- tion and system integration for FPGA-based molecular dynamics simulations. In Proceedings of the 2006 International Conference on Field Programmable Logic and Applications, pages 2128, August 2006. [45] T. Hamada, T. Fukushige, A. Kawai, and J. Makino. PROGRAPE-1: A pro- grammable, multi-purpose computer for many-body simulations. Publications of the Astronomical Society of Japan, 52:943954, October 2000. [46] K. Scott Hemmert and Keith D. Underwood. An analysis of the double- precision oating-point FFT on FPGAs. In IEEE Symposium on Field- Programmable Custom Computing Machines, April 2005. [47] Kai Hwang and Zhiwei Xu. Scalable Parallel Computing. McGraw Hill, 1998. [48] IEEE Standard for Binary Floating-Point Arithmetic, 1985. IEEE Std 754- 1985. [49] Intel Corp., 2006. http://www.intel.com. [50] R. Jain, A. Parker, and N. Park. Module selection for pipelined synthesis. In Proceedings of the 25th Design Automation Conference, June 1988. [51] Joseph J´ aJ´ a. An Introduction to Parallel Algorithms. Addison-Wesley Publish- ing Company, 1992. [52] P. James-Roxby, G. Brebner, and D. Bemmann. Time-critical software deceler- ation in an FCCM. In Proceedings of the Twelfth Annual International Sympo- sium on Field Programmable Custom Computing Machines, April 2004. [53] Laxmikant Kal´ e, Robert Skeel, Milind Bhandarkar, Robert Brunner, Attila Gur- soy, Neal Krawetz, James Phillips, Aritomo Shinozaki, Krishnan Varadarajan, and Klaus Schulten. NAMD2: Greater scalability for parallel molecular dy- namics. Journal of Computational Physics, 151:283312, 1999. [54] J.-P. Kaps. High speed FPGA architectures for the data encryption standard. Master's thesis, ECE Department, Worcester Polytechnic Institute, Worcester, MA, May 1998. 152 [55] V . Kindratenko and D. Pointer. A case study in porting a production scientic supercomputing application to a recongurable computer. In Proceedings of the Fourteenth Annual IEEE Symposium on Field-Programmable Custom Comput- ing Machines, April 2006. [56] V olodymyr Kindratenko. Code partitioning for recongurable high- performance computing: A case study. In Proceedings of the International Conference on Engineering Recongurable Systems and Algorithms, June 2006. [57] Alexander V . Kozlov and Jaswinder Pal Singh. A parallel Lauritzen- Spiegelhalter algorithm for probabilistic inference. In Supercomputing '94: Proceedings of the 1994 ACM/IEEE conference on Supercomputing, pages 320 329, 1994. [58] David Kuck and Kiyoshi Maruyama. Time bounds on the parallel evaluation of arithmetic expressions. SIAM Journal on Computing, 4(2):147162, June 1975. [59] David Kuck and Yoichi Muraoka. Bounds on the parallel evaluation of arith- metic expressions using associativity and commutativity. Acta Informatica, 3(3):203216, September 1974. [60] Dhananjay Kulkarni, Walid Najjar, Robert Rinker, and Fadi J. Krudahi. Fast area estimation to support compiler optimizations in FPGA-based recong- urable systems. In Proceedings of the 10th Annual IEEE Symposium on Field- Programmable Custom Computing Machines, April 2002. [61] Sameer Kumar, Chao Huang, Gheorghe Almasi, and Laxmikant V . Kal´ e. Achieving strong scaling with NAMD on Blue Gene/L. In Proceedings of 20th IEEE International Parallel & Distributed Processing Symposium, April 2006. [62] Sam Lee. An FPGA implementation of the smooth particle mesh Ewald re- ciprocal sum compute engine (RSCE). Master's thesis, University of Toronto, 2005. [63] Yamin Li and Wanming Chu. A new non-restoring square root algorithm and its VLSI implementations. In Proceedings of the 1996 International Conference on Computer Design, October 1996. [64] Jian Liang, Russell Tessier, and Oskar Mencer. Floating point unit generation and evaluation for FPGAs. In Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 185194, April 2003. [65] Gerhard Lienhart, Andreas Kugel, and Reinhard Manner. Using oating-point arithmetic on FPGAs to accelerate scientic N-body simulations. In Proceed- ings of the 10th Annual IEEE Symposium on Field-Programmable Custom Com- puting Machines, pages 182191. IEEE Computer Society Press, April 2002. 153 [66] Walter B. Ligon, III, Scott McMillan, Greg Monn, Kevin Schoonover, Fred Stivers, and Keith D. Underwood. A re-evaluation of the practicality of oating- point operations on FPGAs. In Proceedings of the 6th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, April 1998. [67] Loucas Louca, Todd A. Cook, and William H. Johnson. Implementation of IEEE single precision oating point addition and multiplication on FPGAs. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Ma- chines, April 1996. [68] Junichiro Makino, Toshiyuki Fukushige, and Masaki Koga. A 1.349 Tops simulation of black holes in a galactic center on GRAPE-6. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing, New York, November 2000. ACM Press. [69] Thierry Matthey, Trevor Cickovski, Scott Hampton, Alice Ko, Qun Ma, Matthew Nyerges, Troy Raeder, Thomas Slabach, and Jes´ us A. Izaguirre. ProtoMol, an object-oriented framework for prototyping novel algorithms for molecular dynamics. ACM Transactions on Mathematical Software, 30(3):237 265, 2004. [70] G. L. Miller and J. H. Reif. Parallel tree contraction and its application. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 478489, October 1985. [71] ModelSim. http://www.model.com/. [72] Gerald R. Morris, Richard D. Anderson, and Viktor K. Prasanna. An FPGA- based application-specic processor for efcient reduction of multiple variable- length oating-point data sets. In Proceedings of the 17th IEEE Interna- tional Conference on Application-specic Systems, Architectures and Proces- sors, pages 323330, Steamboat Springs, CO, September 2006. [73] Gerald R. Morris, Richard D. Anderson, and Viktor K. Prasanna. A hybrid approach for mapping conjugate gradient onto an FPGA-augmented recong- urable supercomputer. In Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, April 2006. [74] Gerald R. Morris and Viktor K. Prasanna. An FPGA-based oating-point Jacobi iterative solver. In Proceedings of the 8th International Symposium on Parallel Architectures, Algorithms, and Networks, pages 420427, December 2005. [75] Gerald R. Morris and Viktor K. Prasanna. A hybrid approach for accelerating a sparse matrix Jacobi solver using an FPGA-augmented recongurable com- puter. In Proceedings of the 9th Military and Aerospace Programmable Logic Devices Conference, September 2006. 154 [76] MPI: A message passing interface, 1995. http://www.mpi-forum.org/ docs/mpi-11-html/mpi-report.html. [77] Myrinet overview. http://www.myri.com/myrinet/overview/. [78] Aiichiro Nakano. Class notes for CSCI 599: High performance scientic com- puting. University of Southern California, Fall semester, 2003. [79] Ryan M. Olson, Michael W. Schmidt, and Mark S. Gordon. Enabling the ef- cient use of SMP clusters: The GAMESS/DDI model. In Proceedings of Su- percomputing 2003, November 2003. [80] OProle. http://oprofile.sourceforge.net/about/. [81] N. Park and A. Parker. Sehwa: A program for synthesis of pipelines. In Pro- ceedings of the 23rd Design Automation Conference, June 1986. [82] Arun Patel, Maneul Salda na, Christopher Comis, Paul Chow, Christopher A. Madill, and R´ egis Pom es. A scalable FPGA-based multiprocessor. In Pro- ceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, April 2006. [83] M. Patra, M. Karttunen, M. T. Hyvnen, E. Falck, P. Lindqvist, and I. Vattulainen. Molecular dynamics simulations of lipid bilayers: Major artifacts due to trun- cating electrostatic interactions. Biophysics Journal, 84:36363645, 2003. [84] P.G. Paulin and J.P. Knight. Force-directed scheduling in automatic data path synthesis. In Proceedings of the 24th Design Automation Conference, June 1987. [85] PDB Format Guide, 1996. http://www.rcsb.org/pdb/file_ formats/pdb/pdbguide2.2/guide2.2_frame.htm%l. [86] A. Petitet, R. C. Whaley, J. Dongarra, and A. Cleary. HPL - a portable imple- mentation of the high-performance Linpack benchmark for distributed-memory computers. http://www.netlib.org/benchmark/hpl/. [87] L. Phillips, R.S. Sinkovits, E.S. Oran, and J.P. Boris. The interaction of shocks and defects in Lennard-Jones crystals. Journal of Physics: Condensed Matter, 5(35):63576376, August 1993. [88] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):119, March 1995. [89] POSIX Threads Tutorial, 2006. http://llnl.gov/computing/ tutorials/pthreads. 155 [90] B. Pradeep and C. Siva Ram Murthy. Parallel arithmetic expression evaluation on recongurable meshes. Computer Languages, 20(4):267277, 1994. [91] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flan- nery. Numerical Recipes in C. Cambridge University Press, New York, second edition, 1992. [92] D.C. Rapaport. The Art of Molecular Dynamics Simulation. Cambridge Uni- versity Press, Cambridge, 2004. [93] Ronald Scrofano, Maya Gokhale, and Viktor K. Prasanna. Parallel molecular dynamics with recongurable computers. IEEE Transactions on Parallel and Distributed Systems. (in preparation). [94] Ronald Scrofano, Maya Gokhale, Frans Trouw, and Viktor K. Prasanna. A hardware/software approach to molecular dynamics on recongurable com- puters. In Proceedings of the Fourteenth Annual IEEE Symposium on Field- Programmable Custom Computing Machines, April 2006. [95] Ronald Scrofano and Viktor K. Prasanna. Computing Lennard-Jones potentials and forces with recongurable hardware. In Toomas Plaks, editor, Proceedings of the International Conference on Engineering Recongurable Systems and Algorithms, pages 284290, June 2004. [96] Ronald Scrofano and Viktor K. Prasanna. Preliminary investigation of advanced electrostatics in molecular dynamics on recongurable computers. In Proceed- ings of Supercomputing 2006, November 2006. [97] Ronald Scrofano and Viktor K. Prasanna. A hierarchical performance model for recongurable computers. In Sanguthevar Rajasekaran and John Reif, edi- tors, Handbook of Parallel Computing: Models, Algorithms and Applications, Chapman & Hall/CRC Computer & Information Science Series. CRC Press, June 2007. [98] Ronald Scrofano, Ling Zhuo, and Viktor K. Prasanna. Area-efcient arithmetic expression evaluation using deeply pipelined oating-point cores. IEEE Trans- actions on Very Large Scale Integration Systems. (under review). [99] Sameer Shende, Allen D. Malony, Alan Morris, Steven Parker, and J. Davi- son de St. Germain. Performance evaluation of adaptive scientic applications using TAU. In Proceedings of the International Conference on Parallel Com- putational Fluid Dynamics, 2005. [100] Masaaki Shimasaki and Hans P. Zima. The Earth Simulator. Parallel Comput- ing, 30(12):12771278, December 2004. 156 [101] Silicon Graphics, Inc., 2006. http://www.sgi.com. [102] Robert C. Singleterry, Jr., Jaroslaw Sobieszczanski-Sobieski, and Samuel Brown. Field-programmable gate array computer in structural analysis: An initial exploration. In Proceedings of the 43rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2002. [103] Melissa C. Smith. Analytical Modeling of High Performance Recongurable Computers: Prediction and Analysis of System Performance. PhD thesis, Uni- versity of Tennessee, 2003. [104] Melissa C. Smith and Gregory D. Peterson. Parallel application performance on shared high performance recongurable computing resources. Performance Evaluation, 60:107125, 2005. [105] Melissa C. Smith, Jeffrey S. Vetter, and Xeujun Liang. Accelerating scientic applications with the SRC-6 recongurable computer: Methodologies and anal- ysis. In Proceedings of the 2005 Recongurable Architectures Workshop, April 2005. [106] William D. Smith and Austars R. Schnore. Towards an RCC-based accelerator for computational uid dynamics applications. In Toomas Plaks, editor, Pro- ceedings of the 2003 International Conference on Engineering Recongurable Systems and Algorithms, pages 222231, June 2003. [107] Jeffrey M. Squyres and Andrew Lumsdaine. A component architecture for LAM/MPI. In Proceedings, 10th European PVM/MPI Users' Group Meeting, pages 379387, September / October 2003. [108] SRC Computers, Inc. http://www.srccomputers.com. [109] W. Sun, M.J. Wirthlin, and S. Neuendorffer. Combining module selection and resource sharing for efcient FPGA pipeline synthesis. In Proceedings of the 14th ACM/SIGDA International Symposium on Field-Programmable Gate Ar- rays, February 2006. [110] Makoto Taiji, Tetsu Narumi, Yousuke Ohno, Noriyuki Futatsugi, Atsushi Sue- naga, Naoki Takada, and Akihiko Konagaya. Protein explorer: A petaops special-purpose computer system for molecular dynamics simulations. In Pro- ceedings of the 2003 ACM/IEEE Conference on Supercomputing, New York, November 2003. ACM Press. [111] Ping Tak Peter Tang. Table-lookup algorithms for elementary functions and their error analysis. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic, pages 232236, June 1991. 157 [112] Top 500 supercomputing sites. www.top500.org. [113] Nick Tredennick and Brion Shimamoto. Recongurable systems emerge. In J. Becker, M. Platzner, and S. Vernalde, editors, Proceedings of the 2004 Inter- national Conference on Field Programmable Logic and Its Applications, vol- ume 3203 of Lecture Notes in Computer Science, pages 211, September 2004. [114] Justin L. Tripp, Henning S. Mortveit, Anders A. Hanson, and Maya Gokhale. Partitioning hardware and software for recongurable supercomputing applica- tions: A case study. In Proceedings of Supercomputing 2005, November 2005. [115] Justin L. Tripp, Henning S. Mortveit, Anders A. Hansson, and Maya Gokhale. Metropolitan road trafc simulation on FPGAs. In Proceedings of the 2005 IEEE Symposium on Field-Programmable Custom Computing Machines, April 2005. [116] Keith Underwood. FPGAs vs. CPUs: Trends in peak oating-point perfor- mance. In Proceedings of the 2004 ACM/SIGDA Twelfth International Sympo- sium on Field Programmable Gate Arrays, February 2004. [117] Keith D. Underwood and K. Scott Hemmert. Closing the gap: CPU and FPGA trends in sustainable oating-point BLAS performance. In Proceedings of the 12th Annual IEEE Symposium on Field-Programmable Custom Computing Ma- chines, April 2004. [118] David van der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E. Mark, and Herman J. C. Berendsen. GROMACS: Fast, exible, and free. Journal of Computational Chemistry, 26(16):17011718, December 2005. [119] Richard L. Walke, Robert W. M. Smith, and Gaye Lightbody. 20 GFLOPS QR processor on a Xilinx Virtex-E FPGA. In Signal Processing Algorithms, Architectures, and Implementations X, volume 4116. SPIE, June 2000. [120] Xiaofang Wang and Sotirios G. Ziavras. Parallel LU factorization of sparse matrices on FPGA-based congurable computing engines. Concurrency and Computation: Practice & Experience, 16(4):319343, April 2004. [121] Xiaojun Wang, Sherman Braganza, and Miriam Leeser. Advanced components in the variable precision oating-point library. In Proceedings of the Four- teenth Annual IEEE Symposium on Field-Programmable Custom Computing Machines, pages 249258, April 2006. [122] R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1- 2):335, January 2001. 158 [123] Mathew Wojko and Hossam ElGindy. On determining polynomial evaluation structures for FPGA based custom computing machines. In Proceedings of the 4th Australasian Computer Architecture Conference, January 1999. [124] Christophe Wolinski, Frans Trouw, and Maya Gokhale. A preliminary study of molecular dynamics on recongurable computers. In Toomas Plaks, editor, Pro- ceedings of the 2003 International Conference on Engineering Recongurable Systems and Algorithms, pages 304307, June 2003. [125] Xilinx, Inc. http://www.xilinx.com. [126] Ling Zhuo and Viktor K. Prasanna. Scalable modular algorithms for oating- point matrix multiplication on FPGAs. In Proceedings of the 11th Recong- urable Architectures Workshop (RAW 2004), April 2004. [127] Ling Zhuo and Viktor K. Prasanna. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the 13th ACM International Symposium on Field- Programmable Gate Arrays, February 2005. [128] Ling Zhuo and Viktor K. Prasanna. High-performance and parameterized ma- trix factorization on FPGAs. In Proceedings of the 16th International Confer- ence on Field Programmable Logic and Applications, August 2006. [129] Ling Zhuo and Viktor K. Prasanna. Scalable hybrid designs for linear algebra on recongurable computing systems. In Proceedings of the 12th International Conference on Parallel and Distributed Systems, July 2006. 159
Abstract (if available)
Abstract
With recent technological advances, it has become possible to use reconfigurable hardware to accelerate scientific computing applications. There has been a resulting development of reconfigurable computers that have microprocessors, reconfigurable hardware, and high-performance interconnect. We address several aspects of accelerating scientific computing applications with reconfigurable hardware and reconfigurable computers.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Dynamically reconfigurable off- and on-chip networks
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Performance optimization of scientific applications on emerging architectures
PDF
Metascalable hybrid message-passing and multithreading algorithms for n-tuple computation
PDF
Domain specific software architecture for large-scale scientific software
PDF
Parallel simulation of chip-multiprocessor
PDF
An automated testing system for scientific workflows
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Model-guided performance tuning for application-level parameters
PDF
Spam e-mail filtering via global and user-level dynamic ontologies
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Floating-point unit design using Taylor-series expansion algorithms
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Computation of algebraic system from self-assembly
PDF
A resource provisioning system for scientific workflow applications
Asset Metadata
Creator
Scrofano, Ronald, Jr.
(author)
Core Title
Accelerating scientific computing applications with reconfigurable hardware
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
11/19/2006
Defense Date
10/25/2006
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
arithmetic expression evaluation,FPGA,molecular dynamics,OAI-PMH Harvest,performance modeling,reconfigurable computers,reconfigurable hardware
Language
English
Advisor
Prasanna, Viktor K. (
committee chair
), Nakano, Aiichiro (
committee member
), Parker, Alice C. (
committee member
)
Creator Email
scrofano@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m170
Unique identifier
UC1174320
Identifier
etd-Scrofano-20061119 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-32195 (legacy record id),usctheses-m170 (legacy record id)
Legacy Identifier
etd-Scrofano-20061119.pdf
Dmrecord
32195
Document Type
Dissertation
Rights
Scrofano, Ronald, Jr.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
arithmetic expression evaluation
FPGA
molecular dynamics
performance modeling
reconfigurable computers
reconfigurable hardware