ENERGY AND TIME EFFICIENT DESIGNS FOR DIGITAL SIGNAL PROCESSING KERNELS ON FPGAS by Seonil Choi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2004 Copyright 2004 Seonil Choi C ontents D edication ii Acknowledgm ents iii List Of Tables ix List Of Figures x A bstract xiii 1 Introduction 1 1.1 Overview........................................................................................................ 1 1.2 Contributions of the D isse rta tio n .......................................................... 9 1.2.1 High-Level Performance Modeling Technique: Domain-Specific Modeling and Design M ethodology............. 9 1.2.2 Energy Efiicient Algorithmic Design Techniques in FPGAs . 11 1.2.3 Energy and Time Efficient Designs for Matrix Multiplication Using F P G A s....................................................... 11 1.2.4 Energy and Time Efficient Designs for Matrix Factorization Using F P G A s ....................................................... 12 1.2.5 Energy/Time Efficient and Parameterized Designs for Fast Fourier Transforms Using F P G A s ............................................. 14 1.3 Outline of the D issertation....................................................................... 14 2 Background 17 2.1 Field Programmable Gate Arrays (FPG A s)............................................ 18 2.1.1 Conventional F P G A ..................................................................... 19 2.1.2 Inclusion of Embedded Memory, Embedded Multipliers and DSP B lo c k s .................................................................................... 21 2.1.3 Availability of DSP IP c o re s ........................................................ 24 2.1.4 Inclusion of Embedded and Soft Microprocessor Cores . . . . 25 2.2 Energy Efficient Design T echniques....................................................... 26 2.2.1 Power Dissipation in F P G A s........................................................ 27 2.2.2 Low-Level Design Techniques ...................................................... 30 2.2.3 Algorithm Level Design T echniques........................................... 33 VI R eproduced with perm ission of the copyright owner. High-Level Perform ance M odeling Techniques: Dom ain-Specific M odeling and D esign M ethodology 39 3.1 Related work .............................................................................................. 42 3.2 Domain-Specific Energy M odeling.......................................................... 44 3.2.1 High-Level Energy Model ......................................................... 48 3.2.2 Component Specific Power Function E stim a tio n ......... 51 3.2.3 Deriving System-Wide Energy Function ................................ 56 3.3 Design M eth o d o lo g y ....................................................................... 58 3.4 Illustrative Examples of Domain-Specific Modeling and Design M ethodology ....................................................... 63 3.4.1 Domain 1: Uniprocessor A rchitecture............................ 64 Defining Components and P a ra m e te rs .................... 65 System-Wide Energy Function ................................. 67 Design Trade-offs and Performance Analysis .... 69 3.4.2 Domain 2: Linear Array A rchitecture............................ 69 Defining Components and P a ra m e te rs ..................... 70 System-Wide Energy Function ................................. 73 Design Trade-offs and Performance Analysis .... 73 3.4.3 Domain 3: Block Matrix Multiplication on Linear Array Ar chitecture 76 Defining Components, Parameters, and System-Wide Energy F u n c tio n .......................................................... 76 Design Trade-offs and Performance Analysis .... 77 Energy and Tim e Efficient M atrix M ultiplication U sing FP G A 79 4.1 Related work ............................................................................................. 82 4.2 Energy Efficient Algorithms and Architectures for Matrix Multipli cation ........................................................................................................... 84 4.3 Performance Modeling and O ptim ization............................................. 99 4.3.1 Domain-Specific Energy Model ................................................. 99 4.3.2 Functions to Estimate Energy, Area, and L a te n c y ................... 103 4.3.3 Trade-offs among Energy, Area, and L atency.............................105 4.3.4 Other Optimization Techniques for Energy EflBciency .... Further reproduction prohibited without perm ission. 5.2.1 LU D ecom position........................................................................... 130 5.2.2 Block LU Decomposition................................................................. 139 5.3 Performance Estimation and Design Trade-offs...................................... 145 5.3.1 High-Level Performance Model .................................................... 145 5.3.2 Design Trade-offs for Time and Energy E fficiency.................... 148 5.4 Design Synthesis, Optimization, and Simulation M eth o d s.....................150 5.4.1 Optimizations for Time and Energy Efficiency............................ 151 5.4.2 Macro-Level Power and Resource A n a ly z e rs..............................154 5.4.3 Simulation Methods ........................................................................157 5.5 Performance Com parison............................................................................. 159 5.5.1 Uniprocessor and Theorem 3 ...........................................................159 5.5.2 Theorem 3 and Other Linear Array A rchitecture........................166 5.5.3 DSP and Corollary 3 ........................................................................166 6 E nergy/T im e Efficient and Param eterized Designs for Fast Fourier Transforms on FPG A s 172 6.1 Energy and Time Efficient Design for F F T .............................................173 6.1.1 Energy Efficient Design T echniques..............................................174 6.1.2 Energy Efficient Design for F F T .................................................... 175 6.2 Performance Estimation and Design S y n th esis.......................................181 6.2.1 Energy Performance E stim atio n .....................................................182 6.3 Performance of Synthesized D esigns..........................................................184 7 Conclusion and Future Research 192 7.1 Future Research D irections.......................................................................... 195 Reference List 199 vni R eproduced with perm ission of the copyright owner. List O f Tables 1.1 Performance comparison of FPGAs and DSP [142].............................. 2 2.1 (a) Capacitances of CLB and embedded blocks and (b) power dis sipation of configured and embedded logic in Xilinx Virtex-II .... 28 3.1 Model param eters........................................................................................ 73 3.2 Comparison of our designs on linear array architecture with Xilinx d e s ig n ........................................................................................................... 75 3.3 Accuracy of the high-level energy estimation of our d esig n s.............. 76 3.4 Performance comparison and accuracy of various designs in Domain 3 78 4.1 Range of parameters for Xilinx XC2VI500 ........................................... 102 4.2 Number of modules used and the latency of various designs................ 103 4.3 Energy and time performance m o d els....................................................... 106 4.4 Power and area functions for various m o d u le s ....................................... 109 4.5 Estimation errors of energy and area functions in Table 4 .3 .................117 4.6 Performance comparison of various off-chip designs against the Xil inx design and the design A proposed in [1 1 0 ].......................................120 4.7 Performance comparison of various on-chip designs..................................121 4.8 Performance comparison of various off/on-chip designs for Theorem 2123 5.1 States of each PE and their o p e ra tio n s.................................................... 138 5.2 Power and area functions for various m o d u le s ....................................... 147 5.3 Energy, area, and time performance models for Theorem 3 .................. 148 5.4 Energy, area, and time performance models for Theorem 4 .................. 149 5.5 Energy, area, and time performance models for Corollary 3 ..................150 5.6 Memory access rates of various modules in the uniprocessor design and the design in Theorem 3 .......................................................................162 5.7 Performance comparison of the designs based on Theorem 3 and the uniprocessor d e sig n ....................................................................................... 163 5.8 Performance comparison of the design A based on [23] [99] and the design based on Theorem 3 .......................................................................... 167 5.9 Performance of the designs based on Corollary 3 .................................... 171 6.1 Performance of our FFT designs................................................................. 187 6.2 FFT performance of Xilinx library based design and TI DSP based d e s ig n ..............................................................................................................188 6.3 FFT performance comparison with Xilinx library based designs . . 188 6.4 Average power dissipation of the TM S320C6415.................................... 190 6.5 FFT performance comparison with the TI DSP based designs . . . 190 IX R eproduced with perm ission of the copyright owner. List O f Figures 1.1 Power dissipation of various p ro cesso rs................................................. 3 1.2 Algorithms used in Software Defined Radio (SDR) (a) Direction of arrival algorithm (e.g. MUSIC) [115] and (b) MVDR with RLS algorithm [4 7 ].............................................................................................. 8 2.1 The evolution of F P G A s ........................................................................... 18 2.2 The conventional FPGA arc h ite c tu re .................................................... 20 2.3 The recent FPGA architecture................................................................. 22 2.4 (a) Capacitances of various wires and (b) the power dissipation of various wires as the function of frequency in Xilinx Virtex-II [120] . 29 2.5 Power dissipation of various storage element implementations as the function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% switching activity, 16 bits per e n try )............................................. 31 2.6 Power dissipation of various multipliers as the function of precision (Virtex-II XC2V1500, 150MHz, 50% switching ac tiv ity )................... 32 2.7 The effect on energy dissipation of the disabling of a SRAM as a function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% memory access rate, 8 bits per e n t r y ) .......................................... 35 3.1 Domain-specific modeling ........................................................................ 41 3.2 (a) Domain-specific modeling and system-wide energy estimation and (b) component power state m a tric e s ............................................. 47 3.3 (a) Power function estimation and (b) register power function as a function of the number of registers (r) and frequency ( / ) ................ 52 3.4 Design methodology based on the domain-specific m odeling.............. 59 3.5 Uniprocessor architecture: (a) off-chip design and (b) on-chip design. 65 3.6 System-wide energy dissipation and energy distribution for 12 x 12 matrix multiplication as a function of cache size: (a) off-chip design and (b) on-chip design................................................................................. 68 3.7 (a) Linear array architecture for matrix multiplication, (b) PE or ganization, and (c) corresponding a lg o rith m ....................................... 71 3.8 (a) Power dissipation for a single PE and (b) the system-wide energy as a function of the amount of storage (s) for n = 4, 8,16................... 74 4.1 Energy distribution of the design proposed in [110] 85 4.2 (a) Off-chip design, and (b) on-chip design, (c) architecture of PEj used in Theorem 1, and (d) algorithm used in Theorem 1 ................ 87 4.3 Snapshot of the data flow for 3 x 3 matrix multiplication (Theorem 1) 88 4.4 Architecture of PEj for Theorem 2 ........................................................ 93 4.5 Algorithm for Theorem 2 ........................................................................... 94 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.5 Energy and area estimates for various designs for N — 256 ......... 185 6.6 Energy distribution of modules in FFT architecture for various de sign points (A^=256, BRAM b a s e d )..........................................................186 7.1 Design methodology at the application level with the kernel level . . 200 xn R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A bstract Reconfigurable hardware such as FPGAs is flexible alternatives to DSPs or ASICs used in mobile devices, for which energy is a key performance metric. De signs on reconfigurable hardware offer many design parameters such as operating frequency, precision, amount of storage, degree of parallelism, etc. These param eters define a large design space that must be explored to find energy efficient solutions. It is also challenging to predict the energy variation at the early de sign phases when a design is modified at algorithm level. To address this sce nario, a methodology to develop energy efiicient designs on FPGAs is proposed. The methodology integrates domain-specific modeling, coarse-grained performance evaluation, design space exploration (DSE), and low-level simulation to understand the trade-offs among energy, latency, and area. The domain-specific modeling tech nique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate trade-off analysis. This model is used to understand the impact of various parameters on system-wide energy and is a basis for energy efficient de signs. DSE analyzes the design space defined by tlie domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection. xiii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The modeling technique and design methodology are applied to three digi tal signal processing kernels: matrix multiplication, matrix factorization, and Fast Fourier Transforms. The designs identified by our methodology demonstrate trade offs among energy, latency, and area. Our designs are compared with state-of-the- art designs to demonstrate the effectiveness. As the first kernel, matrix multiplica tion is considered. From the well-known designs for matrix multiplication, ’’energy hot spots”, which are responsible for most of the energy dissipation, are identi fied. Then three new algorithms and architectures that offer trade-offs among the number of I/O ports, registers, and PEs are proposed. Functions to represent the impact of algorithm design choices on the energy, area, and latency are derived. These functions are used to either optimize the energy performance or provide trade-offs for a family of candidate algorithms and architectures. As the second kernel, two designs for matrix factorization are proposed. The first design is used for a normal LU factorization. A linear array architecture is employed to min imize the usage of long interconnects, leading to lower energy dissipation. The optimal latency is achieved on the linear array architecture. The second design is used for block-based LU decomposition. The linear array based design for LU de composition and the design for matrix multiplication kernel are re-used. Through the analysis of design trade-offs, the block size that minimizes the total energy is identified. As the third kernel, energy efficient designs for Fast Fourier Transform are proposed. Architectural parameters such as degrees of vertical and horizontal XIV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. parallelism are identified and a design domain is created through a combination of design choices. Design trade-offs are performed using high-level performance model to obtain energy efficient designs. XV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 1 Introduction 1.1 O verview A dramatic increase in the density and speed has been achieved in recent FPGA ar chitectures. The state-of-the-art FPGA from Xilinx has multi-million system gates and delivers over 0.3 Tera MAGs/sec. at an operating frequency of 300MHz [144]. Table 1.1 shows the peak performance capabilities of the Virtex FPGA compared with the fastest DSP available last year [142]. Indeed, FPGAs have become an attractive fabric for the implementation of mas sively parallel and computationally intensive applications such as signal, image, and network processing tasks [52][84][96][118]. Also, the reconfigurability and the high performance of FPGAs enable the future wireless communications such as Software Defined Radio (SDR) [137]. Table 1.1: Performance comparison of FPGAs and DSP [142] Function Fastest DSP (TI 64xx) Xilinx Virtex-II 8x8 MAC 4.4 billion MAGs 600 billion MAGs FIR, 256 tap. 17MSPS 180MSPS 16-bit data/coefficient (l.lGH z) (180MHz) 1024-point FFT 7.7 fisec. 0.1 iisec. (16 bit data) (800MHz) (140MHz) Traditionally, the performance metrics for implementing digital signal process ing applications and, indeed, most processing applications in general, have been latency and throughput. However, with the proliferation of portable and mobile devices [14] [84], it has become increasingly important that systems are not only fast, but also energy efficient. AMD Opteron (1.8 GHz) Pentium 4-M (2GHz) Pentium 4 (3GHz) Xilinx Virtex-II (XC2V1500) T I DSP TMS320C64XX (600MHz) PowerPC 405 LP (380MHz) Intel PXA255 (XScale) (400MHz) 1000 2000 3000 Power (mW) 0 .1 5 W A \ M -“ n --------- V F □ S leep ■ Idle ■ Normal ► 85W 4000 5000 Figure 1.1: Power dissipation of various processors To develop the energy efficient designs, the optimization is required at various levels. Studies show that optimization at the algorithmic level has a much higher impact on total energy dissipation of a system than at the RTL or gate level. It is reported that the impact (on energy optimization) ratio is 20 : 2.5 : 1 for R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. algorithmic, register, and circuit level [114]. Thus, instead of low-level optimiza tion techniques, in this thesis, we investigate and apply algorithmic techniques for minimizing the energy dissipated by FPGAs in signal processing applications. In addition, if the level of abstraction is elevated, high-level models do not capture all the details of a system and consider only a small set of key parameters that affect energy. This lowers the accuracy of energy estimation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Alternatively, a large number of parameters in a high-level model possibly achieve higher accuracy but will result in a large design space. • Large simulation effort: Low-level simulations using RT-level power simula tors are not only time consuming but also require input vectors which need equally expensive functional simulation. Flexibility in using FPGAs results in a large design space. It is not feasible to traverse such a large space using time-consuming low-level simulations using tools such as Mentor Modelsim [92] and Xilinx XPower [146]. Further reproduction prohibited without perm ission. to represent the impact of changes in the algorithm on the system-wide energy dissipation, area, and latency. The modeling starts by identifying parameters whose values change depending on the algorithm and have significant impact on the system-wide energy dissipation. These parameters depend on the algorithm and the architecture used, and the target FPGA’s device features. The param eters give rise to a large design space, which may take a lot of time to explore for the minimization of system-wide energy dissipation. In this regard, the de signer’s ability to understand the algorithm and the architecture is very significant in identifying the key parameters. For example, if two algorithms use different numbers of MACs and adders for implementation of a matrix multiplication, and the MACs and adders are busy in almost all the cycles, the numbers of MACs and adders are good candidates for key parameters. We derive closed-form functions representing the system-wide energy dissipation, area, and latency in terms of the key parameters. Assumptions are then made to simplify the functions. In general, the simplicity of the functions and the resulting deviation from actual coefficients of the functions obtained via low-level simulations depend on the application and the designer’s ability to extract key parameters and build an appropriate energy model using them. Moreover, these functions are meant to be used in early design phase, where we are more concerned with algorithmic level changes than gate- level changes, thus rendering our accuracy level more than sufficient. Moreover, R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. our techniques can also be used for next generation FPGAs having lower power dissipation features as well as higher computing power. We apply the techniques presented here to designing architectures and algo rithms for three well-known digital signal processing kernels: matrix multiplication, matrix factorization, and Fast Fourier Transforms (FFT). Matrix multiplication and matrix factorization are frequently used kernel operations in signal and image processing systems including mobile and SDR systems [91]. They are also used for the signal filtering in embedded target recognition systems. Matrix multiplication is used as part of matrix factorization. The FFT is also the compute-intensive por tion of broadband beamforming applications such as those generally used in SDR and sensor networks. Figure 1.2 shows two algorithms and several kernels used in SDR, which are similar to the kernels we choose. For example. Minimum Variance Distortionless Response (MVDR) beamformer with Recursive Least Squares (RLS) algorithm consists of matrix factorization, multiplication, addition, and subtrac tion [47]. Antenna array riogn Covariance matrix factorization FFT Matrix vector multiplication Computational requirement: 3 GOPS for single user (a ) Matrix addition subtraction Weight update Matrix vector multiplication Correlation matrix inversion Computational requirement: 53 GOPS for single user (b) Figure 1.2: Algorithms used in Software Defined Radio (SDR) (a) Direction of arrival algorithm (e.g. MUSIC) [115] and (b) MVDR with RLS algorithm [47] To show the performance of our designs, we compare the latencies, resource utilizations, and energy dissipations of the energy efficient designs to those of Xilinx IP cores and DSP based designs for the same signal processing kernels. We use both high-level estimation (based on latency and energy equations) and low- level simulation in our comparisons. These comparisons show that our proposed designs using FPGAs can provide significant reductions in not only latency but also energy dissipation. R eproduced with perm ission of the copyright owner. The contributions of this thesis include: 1.2.1 High-Level Performance M odeling Technique: Dom ain-Specific M odeling and Design M ethodology Reconfigurable architectures offer several design parameters such as operating fre quency, precision, amount of memory, degree of parallelism, etc. These parameters define a large design space that must be explored to find energy efficient solutions. It is also challenging to predict the energy variation at the early design phases when a design is modified at algorithm level. Efficient traversal of such a large design space requires high-level modeling to facilitate rapid estimation of system-wide energy. To address this scenario, we propose a domain-specific modeling technique R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for energy efficient kernel design that exploits the knowledge of the algorithm and the target architecture family for a given kernel to develop a high-level model. This model captures architecture and algorithm features, parameters affecting en ergy performance, and power estimation functions based on these parameters. A system-wide energy function is derived based on the power functions and cycle specific power state of each building block of the architecture. This model is used to understand the impact of various parameters on system-wide energy and can be a basis for the design of energy efficient algorithms. Our high-level model is used to quickly obtain fairly accurate estimate of the system-wide energy dissipation of data paths configured using FPGAs. Based on the high-level modeling technique, the design methodology is devel oped to explore the design space to obtain high energy performance on FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low level simulation to understand the trade-offs between energy, latency, and area. The domain specific modeling tech nique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. The high-level model also consists of functions for estimating energy, latency, and area that facil itate trade-off analysis. Design space exploration (DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for accurate performance estimation for the designs selected by the DSE and also for final design selection. 1.2.2 Energy Efficient Algorithm ic Design Techniques in FPG A s We identify techniques that can be applied to FPGA-based designs to obtain energy efficiency. Further reproduction prohibited without perm ission. we identify ’ ’energy hot spots” , which are responsible for most of the energy dissi pation. Based on this, we develop algorithms and architectures that offer trade-offs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and perfor mance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide trade-offs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplica tion on FPGAs that results in trade-offs among energy, area, and latency. The designs are made scalable by using a fixed I/O bandwidth independent of the problem size. The optimal latency is achieved on the linear array architecture. The second design is used for a block-based LU decomposition. The first design and the matrix multiplication kernel are re-used. In both designs, high-level models for energy profiling are built, and the time and energy performance of many possible designs is predicted. In the second design, through the analysis of design trade-offs, the block size that minimizes the total energy dissipation is identified. A set of candidate designs is implemented on the FPGAs to verify the estimates. Since we are not aware of any designs that map energy efficient LU decomposition onto FPGA, we implement a uniprocessor design and the best known design on a linear array. They are compared with our designs in terms of the time, area, and energy performance. We implement a set of parameterized designs having parallelism, radix and choice of storage types as parameters, on FPGAs to verify the estimates. Our designs dissipate 57% to 78% less energy than the optimized designs from the state-of-the-art designs. In terms of a comprehensive metric such as EAT (Energy-Area-Time), our designs offer performance improvements of 3-13x over the state-of-the-art designs. 1.3 O utline o f th e D issertation The remainder of this thesis is organized as follows: Chapter 2 presents the background for the thesis. The evolution and the state of FPGAs are discussed. The characteristics of power dissipation in FPGAs are discussed to motivate the issues that need to be addressed in developing energy 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. optimization techniques. The high level energy performance models are defined and used for rapid design space exploration to find the energy minimal solutions. Chapter 5 illustrates in detail the energy and time efficient designs for matrix factorization using FPGAs. A new design based on a linear array and a block based design are proposed. The designs of the matrix multiplication described in Chapter 4 are reused as part of the matrix factorization. More algorithmic techniques are developed and applied. The high-level energy performance models are defined and used to identify the block size that minimizes the energy dissipation. 15 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 6 describes in detail the energy and time efficient and parameterized designs for Fast Fourier Transforms (FFT). Various parameters that significantly affect the energy performance are identified. In this chapter, we give a brief overview of the evolution of FPGA architectures. The characteristics of power dissipation and the knobs to control the power dissi pation in FPGAs are discussed to motivate the issues that need to be addressed in developing energy optimization techniques. Also, general energy optimization techniques such as latency reduction are discussed. 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1 Field Program m able G ate Arrays (F P G A s) As the semiconductor technology improves with more and more transistors being implemented on a single chip, the speed and density of FPGAs have increased tremendously. Moreover, recent platform FPGAs with new features have evolved to satisfy increasing performance requirements in the industry. We summarize the evolution of the FPGA architecture based on the inclusion of new features (see Figure 2.1). 8M § I 3.2M to 1.1M 0.5M + 4 PowerPC 32bit cores + Gigabit transceivers + Embedded multipliers ; + Embedded RAMS Configurable Logic Blocks 1.5V core 0.13p I w y i R T iVr iyiR T E IX ^il I FROi 1.5V core 0.13m 0.18-0.22M 0.25-0.35M 1997 1998 1999 Year 2000 2001 2002 'System gates: the metric used to estimate the typical numlDer of gates and the memory that can be realized in the FPGA device for a design. Figure 2.1: The evolution of FPGAs 18 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1.1 Conventional FPG A The conventional FPGA architecture consists of an array of configurable logic blocks (CLBs) and an interconnection network of wires that connects the CLBs (see Figure 2.2). Both the CLBs and the interconnection network are configurable. Each CLB usually contains configurable lookup tables (LUTs) that enables the de vice to implement any function of multiple inputs, registers, and additional combi national logic. The output of each CLB is either the output of the LUT or that of a configurable register connected to the LUT output. The CLBs at the periphery of the device perform the I/O operations. The interconnection network is configured by changing the connections between the CLBs and the wires by configuring the switch boxes which connect various wires. The functionality of the CLBs and the connection of the switch boxes are determined by configuration data stored in a configuration memory, which is typically achieved by using static random-access memory (SRAM) bits to control the configurations of transistors. SRAM based configuration can be reprogrammed by downloading different configuration bits into the SRAM. The Xilinx XC4000 is an example for this type of FPCAs. Thus FPGA architectures can efficiently have various levels of memory with individual cells or LUTs acting as registers, cascaded groups of cells or LUTs acting as reconfigurable single- or multi-port RAMs for mid-size needs, and large dedicated RAM blocks for more demanding uses. The inclusion of these flexible embedded memory elements allows for high-speed and local-memory in tensive designs common in DSP, to queue and store data locally without costly off-chip access. With the need for high-speed DSP functions, ASIC multipliers and generic DSP blocks containing multipliers and adders have been embedded for arithmetic operators. This greatly improves the performance of the FPGA for these arithmetic functions. For example, the Xilinx Virtex is a high-performance FPGA (see Figure 2.3) [144]. It has different versions which have capacities ranging from 50 thousand to 2 mil lion system gates. The Virtex architecture comprises of an array of CLBs, encircled by programmable I/O blocks, and dedicated Block SelectRAMs of 4096 bits each. 21 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA m m m i / \ Dual port memory Embedded memory Embedded multiplier PowerPC or Arm Embedded processor Figure 2.3: The recent FPGA architecture 2 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. It has a hierarchical routing matrix with local routing and varying number of global routes of different lengths. There are 24 single length routes, 96 routes of length six and 12 long lines spanning the chip. There are additional I/O routing resources around the periphery of the logic blocks. The CLBs contain four logic cells each. Each logic cell has a 4-input function generator (LUT), a flip-flop and some carry logic. The LUTs can operate as function generators or they can be used as distributed RAMs. Additional multiplexors and wires in a CLB provide flexible combination of different logic cell outputs and routing of input signals to CLB output. High speed arithmetic is facilitated by providing additional carry logic in each of the logic cells. A dedicated AND gate in each logic cell improves multiplier implementations. On-chip local memories can be realized on the Virtex architecture in two different ways. The logic cells can be combined and configured as memory cells to obtain multiported RAM of required sized. Each Virtex also has large Block SelectRAM memories. These are organized along the two vertical edges of the FPGA. Each memory block is four CLBs high and the number of such blocks is as much as 32 for large size Virtex chips with 64 CLBs height. Altera Stratix [4] FPGAs contain generic DSP blocks containing ASIC adders and multipliers capable of being configured into MACs or complex number multipliers. 2.1.3 Availability of D SP IP cores In the past several years, with the preference of FPGAs for embedded systems, the need for IP cores has increased significantly. Moreover, the capabilities of FPGAs have improved such that highly complex designs can be implemented. Thus the complexity of the design process increased and a more sophisticated design team was required to successfully complete a design. Proven IP was a fast way to combat rising design costs and expanding schedules. Designers wanting to take advantage of the time-to-market advantages of platform-based FPGAs need ready access to complex peripheral IP. As a result, today there are an increasing number of commercial IP vendors developing certified IP cores for use in high-end FPGAs. The available IP includes processor and DSP cores, bus-based peripherals, and 24 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a host of standard functions designed to facilitate almost drag-and-drop system creation. 2.1.4 Inclusion of Embedded and Soft M icroprocessor Cores FPGA based systems are typically attached to a host system through some inter face such as the system bus or the I/O channels. While these systems have shown significant speedups for specific applications, the limiting factor is the communica tion cost between FPGAs and the host computer. Moreover, control was sometimes enabled through complex state machines or from outside the chip. Currently, sys tems try to alleviate this problem by embedding microprocessors into FPGAs on the same die. Moreover, microprocessors can be implemented using the FPGA fab ric itself. Thus architectures having combinations of processors and designs on the FPGA fabric can be obtained. There are various terms used for these architectures, including Configurable System-on-Chip (CSOC), Reconfigurable Systems-on-Chip (RSOC), Systems on Programmable Chip (SOPC) [15] [42] [105]. FPGA vendors are also aggressively approaching this design space by provid ing customized processor and other IP cores on their devices. These include the Virtex-11 Pro [145] with IBM PowerPC 405 cores and the Altera Excalibur [5] with ARM922T processor. The PowerPC core operates at 380MHz and communi cates with the FPGA fabric at more than 6GB/sec. Virtex-11 Pro is an evolution from high capacity Virtex-11 series FPGAs. It is expected to scale up to 10 million 25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. system gates and is expected to have 556 18-bit x 18-bit embedded multipliers. We then present techniques for reducing energy dissipation, some of which do so by lowering power dissipation, others by lowering latency. Several low-level and algorithmic level techniques for energy efficient design are discussed. The main focus on is algorithmic level techniques. “Algorithmic level techniques” refer to those techniques in algorithm development that can be used to reduce energy dissipation. 26 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.2.1 Power Dissipation in FPG A s Widely accepted equation for general power dissipation [95] is defined as: P ^ C V ^ f + Vheak (2 .1) The first term defines the dynamic power dissipation where P, C, V , and / rep resent power dissipation, effective capacitance, voltage, and running frequency, respectively. Effective capacitance can be used to account for combined effect of multiple capacitances or varying switching activity. Figure 2.4 2 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 2.1: (a) Capacitances of CLB and embedded blocks and (b) power dissipation of configured and embedded logic in Xilinx Virtex-II Components C(pF) Flip-Flops LUT Distributed RAM Block RAM Block Multipliers I/O LVTTL 2.88 26.4 24.3 982.5 1,777.7 100.4 (a ) Modules Speed (MHz) Toggle Rate(%) Area (slice) Power (mW) note PowerPC 300 n/a n/a 290.0 Register 150 50 4 2.1 16 bit Counter 150 50 4 2.8 16 bit Adder 150 50 4 2.8 16 bit Multiplier (LUT) 150 50 307 97.8 16 bit Block Multiplier 150 50 n/a 35.6 16 bit Distributed RAM 150 50 145 85.3 128Byte Block RAM 150 50 n/a 25.7 2KByte (b) shows the capacitance and the power dissipation of various interconnection wires in Virtex-II [120]. This analysis differs from ASIC technology where clock distri bution often dominates power dissipation [148]. The sources of power dissipation between these two technologies are different because their interconnect structures are composed disparately: FPGA interconnect consists of pre-fabricated wire seg ments of various lengths, with used and unused routing switches attached to each wire segment. Another important factor affecting the power dissipation in FPGAs is resource utilization [120]. In typical FPGA designs, a majority of the resources may not be 28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Interconnect C(pF) Long Line 26.10 Hex Line 18.40 Double Line 13.20 Direct Connect 7.28 (a ) 3.0 ■ ♦ — Long Line ■ a — Hex Line ■ ♦ — Double Line Direct Connect 2.5 2.0 § E 0 0 1 0.5 0.0 0 20 40 60 80 100 Frequency (MHz), 20% toggle rate (b) Figure 2.4: (a) Capacitances of various wires and (b) the power dissipation of various wires as the function of frequency in Xilinx Virtex-II [120] 2 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. used after the configuration and thus they will not dissipate any dynamic power. One more factor in determining power dissipation is the switching activity, which is defined as the number of signal transitions in a clock period. The switching activity for each resource depends not only on the type of design but also the input stimuli. The ability to choose the proper binding is due to the existence of several configurations for the same computation. Thus, 30 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 0 0 r- 80 + BRAM 5 E c 60 j . o I ■ t o - m — I " 2 40 4 o [ 5 o I Q - 1 10 1 0 0 1000 10000 No. Entries (n) Figure 2.5: Power dissipation of various storage element implementations as the function of the number of entries (Virtex-11 XC2V1500, 150MHz, 50% switching activity, 16 bits per entry) 31 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. different bindings affect FPGA energy dissipation. For example, in Figure 2.5, we show three possible bindings for storage in Virtex-II based on the number of entries: registers, slice based RAM (SRAM), and embedded Block RAM (BRAM). Further reproduction prohibited without perm ission. 2.2.3 Algorithm Level Design Techniques It is known that energy performance can be improved significantly by optimizing a design at the algorithm level [114]. We summarize the algorithm-level techniques that can be used to improve the energy performance of designs implemented on FPGAs. A rchitecture Selection: Since FPGAs provide the freedom to map various architectures, choosing the appropriate architecture affects the energy dissipation. It plays a large part in determining the amount of interconnect and logic to be used in the design. Since interconnect dissipates a large amount of power, minimiz ing the number of long wires between building blocks is beneficial [120]. Several past efforts have identified various architecture families, each having different char acteristics in terms of I/O complexity, memory requirements, area, etc. [53] [78]. M odule Disabling: In developing an algorithm, it is possible to design the algorithm such that it utilizes the clock gating technique to disable modules that are not in use during the computation. For example, FFT computation has many- complex number multipliers to perform t-widdle factor computations (multiplica tion and addition/subtraction). Because of the nature of the algorithm, some twiddle factors are 1, — 1, j, or — j and their computation can be bypassed. Thus, the implementation of twiddle factor computation can exploit clock gating to dis able the unnecessary computation modules. Another example is that some RAM implementations have sleep states so that power dissipation is reduced when they are idle. Figure 2.7 shows the power dissipation of SRAM (16 bits per entry) of various sizes. The disabled memory dissipates less than 10% of the amount of power that the enabled memory does. The power dissipation of BRAM is even smaller than that of SRAM. Its power dissipation is less than 1% of the power dis sipation of enabled memory. Because of these reduced power dissipations, energy dissipation is also reduced, provided that latency does not increase too much. M em ory Banking: Along with block disabling and using BRAMs, a large memory can be made of smaller memory banks, where each bank has its own enabling/disabling feature. By enabling only the necessary memory banks, the energy used by memory is saved. For example, the block based approaches in matrix multiplication and matrix factorization often requires a large number of memory banks consisting of BRAMs. When n = 128, 16 BRAMs are 3 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 180 — Disabled 160 Enabled i50% toogie ratei 140 I 120 I - 1 100 5 o Q . 60 + 40 + 20 4 0 32 64 96 128 160 192 224 256 Number of entries (n) Figure 2.7: The effect on energy dissipation of the disabling of a SRAM as a function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% memory access rate, 8 bits per entry) 3 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. required. However, 3 to 4 BRAMs are often required and by using memory banking we can save 80% energy in the memory. Pipelining: Pipelining is an efficient design practice for both time and en ergy performance. Many digital signal processing applications process streaming data. For these applications with regular data flow, pipelining increases through put. Pipelining increases power dissipation, however, since all logic in the design is continuously active. In Chapter 4, we employ a fully parallel approach and a block based approach for a matrix multiplication. For problem size up to 16, the full parallel architecture gives good energy performance. However, for 3 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. problem sizes greater than 16, block matrix multiplication leads to better energy performance. Since the internal storage required for large problem sizes increases dramatically, parallel processing has a negative effect on the total energy dissipa tion. This result implies that a designer must carefully investigate the trade-offs among the algorithm and the degree of parallelism. Throughout this thesis, we will discuss this effect in detail. A lgorithm Selection: A given application is mapped onto FPGAs differently by selecting different algorithms. For example, using block matrix multiplication is the algorithm-level design choice for larger matrix multiplication. Further reproduction prohibited without perm ission. of the device. This quiescent power cannot be minimized at the algorithm level. However, the faster a design completes its computations, the less energy that will be consumed due to quiescent power. 3 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 3 H igh-L evel Perform ance M odeling Techniques: D om ain-Specific M odeling and D esign M eth od ology A high-level model should allow to explore a large design space rapidly in order to evaluate the impact of algorithmic and architectural choices on the energy dis sipation of a design in FPGAs while it should be fairly accurate in predicting the performance. Several issues must be addressed in developing a high-level energy model for FPGAs. There are numerous ways to map an algorithm onto an FPGA as opposed to mapping onto a traditional processor such as a RISC processor or a DSP, for which the architecture and the components such as ALU, data path, memory, etc. are well defined. For FPGAs, the basic element is the lookup table (LUT), which is too low-level an entity to be considered for high-level modeling. Besides, the 3 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. architecture design depends heavily on the algorithm. Therefore, no single high- level model can capture the energy behavior of all feasible designs implemented on FPGAs. In addition, if the level of abstraction is elevated, high-level models may not capture the details of a system and consider only a small set of key parameters that affect energy. This can affect the accuracy of energy estimation. In order to address the issues discussed above, we propose a domain-specific modeling technique (see Figure 3.1). This technique facilitates high-level energy modeling for a specific domain. A domain corresponds to a family of architectures and algorithms that implements a given kernel. For example, a set of algorithms implementing matrix multiplication on a linear array is a domain. Detailed knowl edge of the domain is exploited to identify the architecture parameters for the analysis of the energy dissipation of the resulting designs in the domain. By re stricting our modeling to a specific domain, we reduce the number of architecture parameters and their ranges, thereby significantly reducing the design space. A limited number of architecture parameters also facilitate development of power functions that estimate the power dissipated by each component (a building block of a design). The advantage of our approach is the ability to rapidly evaluate the system-wide energy using energy function for different designs within a domain. Our high-level energy model also facilitates algorithmic level energy optimization through identification of appropriate settings for architecture parameters such as frequency, number of components, precision, etc., early in system design. The organization of the chapter is as follows. Related efforts are discussed in Section 3.1. Section 3.2 describes the domain-specific modeling technique and the methodology to estimate the power functions. A detailed description of modeling and energy estimation using domain-specific modeling for four different domains is presented in Section 3.4. 3.1 R elated work Several research efforts have focused on rapid energy estimation of a design on FPGAs. Shang and Jha [119] proposed a black-box approach to estimate energy based on input and output signal statistics. Stammermann et. al. presented ORINOCO, a software tool for power dissipation analysis and optimization at the algorithmic level from C/C-I-+ and VHDL description [125]. However, C/C-1— t- or VHDL descriptions do not capture parameters affecting system-wide energy and also a designer requires a complete knowledge of the final system before the code can be generated in these languages. Both ORINOCO and XPower are essentially estimation tools and can be used in our methodology to perform low-level sample simulations necessary for specify ing our component specific power functions. We have compared our estimation accuracy against XPower. In [17] regression tree [18] is used to improve the power estimation of a RT-level component. Starting with candidate variables (I/O bits), the variable U j, which has the maximum impact on the power dissipation is identified. Then the sample power dissipation results of power measurement is split in two subsets based on this variable. The splitting is recursively performed to build a regression tree which ranks variables in their significance with respect to the power. It is a bottom-up 4 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. approach starting from low-level implementation and ends in identifying significant variables affecting the power. In contrast, our model starts with candidate parameters chosen from a high- level view of the architecture and algorithm. The effect of the parameters on the system-wide energy is captured in the component specific power functions. The component specific power functions are used to obtain parameter values for optimal power performance by traversing the design space at an algorithmic level. 3.2 D om ain-Specific E nergy M odeling Since FPGAs provide the freedom to map various architectures, choosing an ap propriate architecture plays a significant role in determining the amount of inter connect and logic to be used in the design which also affects energy dissipation, latency, and area. The parameters representing algorithm level and architecture level choices for a specific application form a multi-dimensional space. For example, the number of multipliers, registers and the I/O channels can be changed from algorithm level choices for matrix multiplication. In the course of high-level modeling, we consider many power management tech niques that provide control knobs when applied to designing for FPGAs [120] [148]. One such technique is clock gating, which is used to disable parts of the device that are not in use during the computation. In the Virtex-II family of FPGAs, clock gating can be realized by using primitives such as BUFGMUX to switch from a high frequency clock to a low frequency clock [144]. BUFGGE can be used for dynamically driving a clock tree only when the corresponding logic is used. For example, FFT computation has many complex number multipliers to perform twiddle factor computations (multiplication and addition/subtraction). Because of the nature of the algorithm, some twiddle factors are 1, — 1, j, or — j and their computation can be bypassed. Thus, the implementation of twiddle factor com putation can exploit clock gating to disable the unnecessary computation blocks. Ghoosing bindings is another technique. A binding is a mapping of a computa tion to an FPGA component. The ability to choose the proper binding is due to the existence of several configurations for the same computation. Thus, different 4 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. bindings affect FPGA energy dissipation. For example, there are three possible bindings for storage elements in Virtex-II devices based on the number of entries: registers, slice based RAM (SRAM), and embedded Block RAM (BRAM). An other example is the choice between hard and soft IP. One such case is the choice of multipliers: block multipliers, such as those in the Virtex-II and Stratix, can be more efficient than CLB-based multipliers. In high-level modeling, we can analyze the trade-offs that arise from various bindings based on the design requirements. Exploiting the domain knowledge and the power management techniques, the goal of domain-specific modeling (Figure 3.2 (a)) is to represent energy dissipation of the designs specific to a domain in terms of parameters associated with this domain. For a given domain, only those parameters which can significantly affect system-wide energy dissipation and can be varied at algorithmic level are chosen for the high-level energy model. As a result, our model a) facilitates algorithmic level optimization of energy performance, b) provides rapid and fairly accurate estimates of the energy performance, and c) provides energy distribution profile for individual components to identify candidates for further optimization. Further reproduction prohibited without perm ission. 3.2.1 High-Level Energy M odel Our high-level energy model consists of RModules, Interconnects, component spe cific parameters and power functions, component power state matrices, and a system-wide energy function. Relocatable Module (RModule) is a high-level architecture abstraction of a com putation or storage module. It is either a CLB (configurable logic block)-based logic or a ’ ’larger” module composed of multiple RModules and Interconnects. We define RModule whose power dissipation can be individually characterized once their input stimuli are known, regardless of their location. For example, a regis ter can be a RModule if the number of registers varies in the design depending on algorithmic level choices. One important assumption about RModule is that energy performance of an instance of a RModule is independent of its location on the device. While this assumption can introduce small error in energy estimation, it greatly simplifies the model. We regard RModules as building blocks which are used to construct the energy model. The granularity of RModules for a specific do main are influenced by the domain. For example, the adders or registers inside the multiplier can be RModules. But there is no sense choosing them since there are no corresponding parameters for them in the domain. They are not targets for energy optimization at algorithm level. Interconnect represents the connection resources used for data transfer between the RModules. The power dissipated in a given Interconnect depends on its length, width, and switching activity. Interconnect 48 R eproduced witfi perm ission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout perm ission. can be of various types. For example, in Virtex-II, there are several Interconnect types such as long lines, hex lines, double lines, and single connections which differ in their lengths [144] [145]. In the rest of the chapter, we use component to refer to both RModule and Interconnect. Component specific parameters depend on the characteristics of the component and its relationship to the algorithm. We choose those parameters which may significantly affect the total energy using knowledge of application, algorithm and architecture and model the domain using the chosen parameters. For example, we model the domain for the matrix multiplication using the number of multipliers and registers since power dissipation in these components significantly affects the total energy (Section 3.4.2). From our knowledge in the algorithm, we find that there exists frequent systolic movement of intermediate results among them. If the design has the latency of T cycles, then k two dimensional matrices are constructed where the z-th matrix is of size T x U i (Figure 3.2 (b)). An entry in a CPS matrix represents the power state of a component during a specific cycle and is determined by the algorithm. System-wide energy function represents the energy dissipation of the designs belonging to a specific domain as a function of the parameters associated with the domain. The domain-specific nature of our energy modeling is exploited when the de signer identifies the level of architecture abstraction (RModules and Intercon nects) appropriate to the domain and/or chooses the parameters to be used in the component-specific power functions. This is a human-in-the-loop process and exploits the designer’s expertise in the algorithm and the architecture family that constitutes the domain. Well-known power models based on capacitance, voltage, and switching activity can be more accurate and are generic to be applicable across many domains. However, they do not provide a designer a clear understanding of the impact of his/her algorithmic level design choices on the energy performance. Our modeling enables the designer to rapidly explore a large design space based 5 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. on the understanding of the effect of the design choices on the overall energy per formance. To handle modeling complexity we follow a hierarchical approach. Each RMod ule can be recursively divided into RModules and Interconnects. This hierarchical nature allows the designer to capture the details of architecture in the design at various levels of abstraction to identify parameters affecting performance. 3.2.2 Com ponent Specific Power Function Estim ation Power dissipation by a RModule or Interconnect in a particular state is captured as a power function of a set of parameters. These functions are typically constructed through curve fitting based on some sample low-level simulations. We demonstrate our function estimation technique in detail by deriving the power function for a register-based memory implemented on the Virtex-II device. Figure 3.3 (a) sum marizes the technique. This technique was applied for power function estimation during the modeling of the various domains described in Section 3.4. Let C.p{pi,... ,Pn) be the component power function and p i,... be the pa rameters associated with the component. (a ) I .s .a. □ 15.00-20.00 □ 10.00-15.00 ■ 5.00-10.00 0 0.00-5.00 9° 110 130 1 5 0 ^ ^ Frequency (f) 4 Nurrber of register (r) (b) Figure 3.3: (a) Power function estimation and (b) register power function as a function of the number of registers (r) and frequency (/) 5 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. component specific parameters are frequency of operation, number of registers in a memory, and the precision. We decided not to vary the precision and assumed it to be 8-bit. Therefore the parameters that affect energy dissipation of the memory are number of registers (r) and frequency (/). Let (r, / ) denote a design point. We identified the candidate designs randomly (for low-level simulation) to be the combinations of r = 1,4, 8 and / = 10, 50,150MHz. The designer associates a VHDL implementation with each RModule. These VHDL implementations are parameterized based on the parameters supported by the associated RModule. Low-level simulation is performed at each of the chosen design points to estimate the power dissipation at that design point. We use random input vectors since there is no general purpose technique to predict exactly what data is available as input to a component. However, we have developed a technique based on statistical analysis to obtain reasonably accurate average estimate of power dissipation of a design [69]. We utilize confidence intervals about the sample mean energy dissipation for a design. Confidence intervals allow us to address dependency upon input stimuli because they describe the likelihood that the true mean over an entire population is within a certain range of the mean found from a sample out of the population. Equation 2 : ^ / 2 (s/x/M ) is employed to estimate the confidence interval for our simulations where x is the sample mean (the mean found by experiment), a; is a number between 0 and 1, Za/2 is a constant as explained in [64], s is the population standard deviation, and M is the number 53 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of samples. We assume that the distribution from which the results come is not too badly skewed or discrete. For example, to statistically analyze energy dissipation for the matrix multipli cation in Section 3.4.2, we performed 50 different n x n matrix multiplication trials for our linear architecture. Each trial consists of performing the low-level simu lation procedure, as described above, with the uniformly distributed, randomly generated matrices as input. These power estimates and the design points are provided as inputs to the power function builder. Further reproduction prohibited without perm ission. can be identified based on long, hex, double and direct wires. After perform ing synthesis and simulation, we can obtain the number of different wires used from a XDL file which is the text version of place and routed circuit descrip tion (.ncd file) [144][146]. The power function of an Interconnect component is I.p{L,w) = {l/2)V ‘ ^ ■ f ■ sw ■ {Ci-l + Ch-h + Cdb-db + Ccir-dr) where V is voltage, / is the operating frequency, sw is the average switching activity, L is the length of an interconnect, w is the precision (or width), and Ci, Ch,Cdb,Cdr and l,h,db,dr are the average capacitance and number of long, hex, double, and direct wires respectively [120]. However, since the architectures we are currently consider has neighboring con nections, we use a simplified approach. We use Equation 3.1 to estimate power dissipation in an interconnect. denotes the power dissipation of a cluster of k RModules connected through the candidate interconnect and M.pi represents power dissipation of the i-th RModule. The power dissipated by the cluster and RModules are obtained by low-level simulation: k IC.p = ^ . p - ^ M . p i (3.1) i=l The low-level simulation is performed as follows. The sample VHDL design is synthesized using XST (Xilinx Synthesis Technology) on Xilinx ISE 4.1i. The place-and-route file (.ncd file) is obtained for the target FPGA device using PAR. 5 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Mentor Modelsim 5.5e is used to simulate the module and generate simulation results (.vcd file). These two files are then provided to the Xilinx XPower tool to estimate the energy dissipation. Above simulation technique was also applied to the candidate designs to estimate power which was multiplied with latency to obtain the measured energy estimates shown in Table 3.3 and 3.4. While the initial effort to build the component power function might be expen sive, the benefits are noticeable when the same components are re-used in different designs within and (possibly) across domains. 3.2.3 Deriving System -W ide Energy Function The CPS matrices capture the operating state of each component for every cycle and the power functions provide the power estimate for each state. Therefore, the total energy of the complete system is obtained by summing the energy dissipation of individual components in each cycle. The system-wide energy function S E is obtained as: k 1 f ni T \ -S T ? = X] T J2Yl^i-P-PS where ps = C P S {i,t,j) (3.2) i=i J \j= i t = i J is the power dissipated in the j-th component {j = l...n;) of type i during cycle t (t = 1...T) and / is the operating frequency. In the worst case, the complexity of energy estimation is 0 {T x Z)Li Equation 3.2) which corresponds to iterating over the elements of the CPS matrices and adding the energy dissipation by each component in each cycle. However, typically, there is a repeating pattern of state changes for a component (for example, due to loop structures within the algorithms). Also, different com ponents of the same type dissipate the same amount of energy during each cycle. Therefore, based on these observations the time to compute the energy is better than the worst case complexity of energy estimation stated above. Further, even if we compute the system-wide energy based on each cycle we do not analyze the activities at the level of individual gates. Typically, there are only a few distinct components within a domain that affect energy dissipation of the designs in that 5 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. domain. Indeed, for the illustrative examples considered in this chapter, the time for energy estimation does not depend on the problem size. The time needed to perform high-level estimation (assuming the power func tions are pre-computed) is on the order of minutes on a Pentium 11 1 Xeon running at 700MHz, whereas the time needed for low-level simulation and power estimation was 3-24 hours per design on the same machine. For the domains discussed in this chapter, we typically need 4-8 low-level simulations (one for each design point) for each power function. Once all power functions are computed and the system-wide energy function is derived, they are applied to the complete design space. For Domain 2 (Section 3.4.2), the number of low-level simulations performed to de fine the domain-specific models were approximately 30. As these simulations are for a component not the complete design each low-level simulation takes approxi mately 30 to 60 minutes. The model is applicable to all the design of n x n matrix multiplication where 1 < n < 48 (we chose 20 designs). Therefore, our effort ap proximately takes 10-12 hours of simulation and computation which is very small when compared with approximately 2 weeks needed to simulate 20 designs. 3.3 D esign M eth od ology The aim of the design methodology is to design energy efficient data paths specific to an application. To achieve this goal, our methodology presents a set of designs which provide trade-offs among energy, latency, and area. The designer explores 58 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. these designs and identifies an appropriate design based on some selection criteria and performance metrics. For a given kernel, there exists several architecture families [53] [78] [110], each having different characteristic in terms of I/O complexity, memory require ments, area, latency, length of interconnect, etc. Further, the comparison of cost of memory space and I/O is also considered while determining the size of memory. Based on the performance needs and the capabilities and limitations of the target FPGA chip, we identify a suitable architecture fam ily. The architecture family and corresponding algorithm for a particular kernel is referred to as a domain. As our approach is based on algorithm level optimizations, the initial step is the most crucial one. Identification of an appropriate domain ensures that we begin with a latency efficient design (we improve energy performance without compromising latency), and there are architecture parameters that can be varied to increase energy efficiency. 2. Domain-Specific Modeling: The details of the architecture family and the algorithm corresponding to the kernel are captured in a domain-specific model. This model captures the architecture details in terms of modules (storage and computation) and connectivity among them, parameters that affect power dissipation, valid ranges for each parameter, various perfor mance constraints, and functions to evaluate power, latency, and area based on these parameters. Further details regarding domain-specific modeling is 60 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. described in Section 3.2. Domain-specific modeling performs the first level of design space reduction by identifying the ranges of the parameters based on the architecture and algorithm constraints. 3. Coefficient Estimation: The effect of variation of different architecture pa rameters on energy is captured by an energy function associated with the domain-specific model (see Section 3.2.2). This function has three different aspects; power associated with a module, number of modules, and latency. While number of modules and latency are derived from the algorithm and architecture details in Step-2, the power for each module is evaluated through low-level simulation for that module. During module specific power estima tion, the design of the module also includes the interconnects necessary to add the module to the rest of the design. This ensures that the energy dissi pated by the interconnect is also included. As a variation, it is also possible to estimate power for a module as a function of some parameter associated with the module. Each module is evaluated for the options (a) and (b) and the algorithm and architecture are considered together for option (c). Choosing (a) requires the user to perform Step-2 and Step-3 again. Choices (b) and (c) requires the user to perform Step-1 and Step-2 again. The functions as sociated with the model allow user to estimate the area, latency, and energy to evaluate each design. This process of estimating performance through the use of functions is referred to as high-level estimation. After several iterations, once the designer identifies an energy efficient design, the functions associated with different performance attributes are analyzed to identify the trade-offs among area, energy, and latency and ultimately to identify an energy efficient design for a particular kernel with similar area x time requirement as the base design. The DSE process is repeated for each domain to identify a set of candidate designs for individual kernels. Our methodology advocates identification of a set of design for each kernel, each with different requirement of energy, latency, and area. Availability of a set of designs ensures flexibility during 62 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. integration for complete system design. A high-level estimation tool inte grates kernel specific performance values to derive system-wide performance values in terms of energy, area, and latency to compare against system-wide performance constraints. 5. Low-level Synthesis & Simulations: This step identifies the energy efficient design. The set of design candidates chosen by DSE are implemented and simulated using low-level simulators such as RT-level or cycle-accurate simu lators. Low-level simulations provide accurate latency and energy values and these are used to identify the most efficient design. 3.4 Illu strative E xam ples o f D om ain-Specific M odeling and D esign M ethod ology To illustrate our domain-specific modeling methodology, we apply the techniques discussed in the previous section to define high-level models for three different do mains implementing matrix multiplication, the frequently used kernel operation in wide variety of signal processing algorithms [50]. For each domain, we iden tify the components and the component specific parameters, evaluate the power functions for each component, and finally derive a system-wide energy function. Three architecture families, a uniprocessor architecture, a homogeneous linear ar ray architecture, and a heterogeneous linear pipelined architecture are chosen to 63 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. demonstrate our approach. In the off-chip design, the PE has one MAC (multiplier and accumulator), a cache (local buffer) of size c, and I/O ports (see Figure 3.5 (a)). Each word of cache is three 8-bit registers. The data matrices are stored in an external memory. For n X n matrix multiplication, the computational complexity of the algorithm is O(n^) [65]. Block matrix multiplication is performed with block size ^/c x ^/c. The I/O complexity (amount of traffic between the PE and external memory) is 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0{v?I\/c). It can be observed that a large cache decreases the I/O traffic and as a result improves the energy dissipation in performing I/O. ( a ) FPGA FPGA Cache Cache Memory Bank C Memory Bank B Memory Bank A MAC External Memory PE I/O MAC PE (b) Figure 3.5: Uniprocessor architecture: (a) off-chip design and (b) on-chip design. In the on-chip design, the PE has one MAC, a cache of size c, and three memory banks for storing three matrices (see Figure 3.5 (b)). BMM is performed with block size ^/c x yA. The energy for I/O (outside the device) is not included, but the energy dissipated in the three memory banks is considered. The read/write access frequency of the memory banks depends on the traffic between the memory banks and the PE. It can be observed that as the cache size increases, the number of memory bank accesses decreases and as a result the energy dissipated in the memory banks reduces. Defining C om ponents and Param eters We identified four components: MAC, cache, and the memory banks as RModules, the I/O as an Interconnect. The RModules have w bit precision. We assumed the precision of input data to be 8 bits and the precision of the intermediate and the 65 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. output data to be 16 bits. Therefore, the cache size (c) is the only parameter that can be varied at design time. The component specific power functions for MAC {MAC.p), cache {R.p), I/O {lO.p), and memory bank {MEM.p) were obtained through low-level simulation using the method described in Section 3.2. To implement the MAC in Virtex-II, there are two design choices: a CLB- based multiplier and a dedicated multiplier. A dedicated multiplier is a stand alone ASIC-based multiplier. A CLB-based multiplier is built using CLBs and it was observed that it dissipates more power than a dedicated multiplier. Similarly, there are two design choices for implementing the cache using CLBs. If the cache size is small, the cache can be realized using CLBs configured as register modules. Larger cache can be realized using CLBs configured as SRAM [144]. However, a SRAM-based cache can only be configured to be a multiple of 16 bytes. A Block SRAM can be configured to be a multiple of 2 Kbytes with 8 bit precision. The power function for 2 Kbyte Block SRAM is: M EM .p{fm ) = 2.89fm ^ + 25.79/m + 0.29 (mW). System -W ide Energy Function We now consider the system-wide energy dissipated by the design. In both the on-chip and off-chip designs, the amount of computation performed by the MACs is the same and the MACs dissipate the same amount of energy. For the off-chip design, we do not consider the energy dissipated in the external memory. The system-wide energy function (SE) for performing n x n matrix multiplication is; SE{n, c) = j{n^M AC.p -f 4(n^ -f /s/c)R.p{c) -f- ?> {n^/ \/c)IO.p) (3.3) Note that as c varies, we obtain a family of architectures each implementing matrix multiplication using BMM with different block sizes. The operating frequency of our design was set to 166MHz. Figure 3.6 shows how different values of c affect the system-wide energy dissipation and the energy distribution among the components of the design for 12 x 12 matrix multiplication. As c increases, tire energy for performing I/O decreases but the energy dissipated in the cache increases. Initially, 6 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the system-wide energy decreases as c increases but for large values of c, the system- wide energy increases. For the on-chip design, the energy dissipated in the memory banks is considered instead of the energy dissipated in the I/O . The system-wide energy function is: SE{n,c) = y{n^ MAC.p + / y/c)R.p{c) + 3{n^/y/c) MEM .p) (3.4) 1500 □ 10 □ M A C [ :■ C a c h e d niil 9 16 C a c h e siz e ( c ) (a ) □ wc S 1500 C a c h e siz e ( c ) (b ) Figure 3.6: System-wide energy dissipation and energy distribution for 12 x 12 matrix multiplication as a function of cache size: (a) off-chip design and (b) on- chip design. Note that as c increases, the traffic between the memory banks and the PE decreases and as a result the energy dissipated in the memory banks decreases. 6 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. D esign Trade-offs and Perform ance Analysis As the system-wide energy function is a well-behaved function with easily deter minable minima, we were able to identify the most energy efficient designs from the trade-off graphs (see Figure 3.6). For both designs, the cache size c = 16 gives the minimum system-wide energy. It is critical to ensure th at aik ’ ’meets” bkj in a cycle in PEj. For this, aik and bkj pass through two and one delay(s), respectively, in each PE. The resulting architecture for each PE is shown in Figure 3.7 (b). a^k enters input 69 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. port A S and goes through two delays {AS.LR and AS.RR), while bkj enters B S and goes through one delay. Details of the algorithm, its analysis, and proof of correctness can be found in [110]. Compared with the uniprocessor design in Section 3.4.1, we use more multipliers to reduce the latency. The above family of architectures offers several advantages compared to other architecture families. These architectures have a low I/O- bandwidth requirement and they scale as the problem size grows. To achieve the minimal I/O complexity (0(n^)), the total amount of storage across all the PEs should be n?. As shown in [110], this architecture can perform n x n matrix multiplication in 0 (n ‘ ^) time using n\n/s] PEs. For the sake of illustration, we consider the on-chip design for this domain. Defining C om ponents and Param eters The structure of the linear array is shown in Figure 3.7 (a). It consists of three components: processing elements (PEs), buses connecting adjacent PEs, and mem ory banks. For the purpose of high-level modeling, we identified the PE and the memory bank as RModules, and the bus between two adjacent PEs as an Intercon nect. The PE has a MAC of precision w and storage of size s (see Figure 3.7 (b)). The MAC is implemented using a dedicated multiplier. The PE has two power states on and off. In the on state, the multiplier is on and thus the PE dissipates more power than in the off state when the multiplier is off. The power state of 7 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA Memory Bank A ~ ~ Memory |_ Bank B P E , PE , PE, Memory BankC AS AS.LR AS.RR BS BS.LR BS.RR U BS[1] -> BS[s] BF BF.R BF.T MAC OC C[1] C[s] ACT ACT.L ACT.R ACT[1] (b) Shift data In shift registers. Read data into (input) registers from input ports. If (ACT[i]=1)then C[i]=C[i]+AS.LR*BF.T. If (Pg=1) then select data from AS.LR Mg select data from BS.RR M j, select data from BS[s] Mg select data from ACT[s]. else M ^ select data from AS.RR M g select data from BF. R Mg select data from BS.LR M p select data from ACT.L. (0 ) Figure 3.7: (a) Linear array architecture for matrix multiplication, (b) PE organi zation, and (c) corresponding algorithm 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the multiplier is controlled by clock gating. The PE also includes 6 registers and 3 multiplexors of w bits. The key parameters affecting energy are the number of PEs (pe), the amount of storage within a PE (s), and power states (ps). The system-wide energy function is specified using these three parameters. 7 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3.1: Model parameters Parameters Values or ranges s 1 < s < n pe 1 < pe < n \n/s) w 8 ps on, o f f System -W ide Energy Function There are several constraints imposed by the algorithm which is exploited to iden tify component specific parameters and their ranges. The value of s determines the total number of PEs {pe). The latency (T) of this design using n \n/s] PEs and s storage per PE [110] is : T = (n^ -|- 2n |"n/s] — \n /s\ + 1). We consider problems in the range 1 < n < 16. Precision (w) is set to 8. In each PE, the multiplier is on for T / ([n /s]) cycles and is off for T x (1 — 1/ [n/s]) cycles. refers to the power dissipation of PE when its multiplier is in state ps (see Equation 3.5). Note that the I/O traffic between the PEs and the memory banks is 0{n^). The system-wide energy function is: SE{n, s) = j { n • T • PE.p,ps=on + T ■ {n \n/s] - n) ■ PE.p,ps=off + T ■ (n [n/s] - 1) ■ IC.p P 3n^ • M EM .p) (3.6) D esign Trade-ofFs and Perform ance Analysis Figure 3.8 (a) shows the effect of varying the amount of storage (s) on the power dissipation of a PE. Figure 3.8 (b) shows the effect of varying the amount of 73 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. storage (5 ) on the system-wide energy for three problem sizes {n = 4,8,16). Based on these plots, to obtain energy efficient designs we choose pe = s — n, where n is the problem size. 160 B L U Q . 1 2 0 b a. c .0 I d .9 . Q 40 £ 5 o a . 1 2 4 8 12 16 Amount of Storage (s) (a) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+00 — n=16 1.E +01 - ^ n = 4 4 8 12 Amount of Storage (s) 16 (b) Figure 3.8: (a) Power dissipation for a single PE and (b) the system-wide energy as a function of the amount of storage (s) for n = 4, 8,16. Table 3.2 shows the energy, latency, and area of the designs for various problem sizes. We compared the performance of our design with a design for 3 x 3 ma trix multiplication provided by Xilinx [144]. Since Xilinx library does not provide on-chip design, we added Block SRAMs to the Xilinx design. All Xilinx designs execute at 150MHz. For n > 3, we used block matrix multiplication using the 3 x 3 design. The improvement in energy dissipation and latency in our designs com pared with the Xilinx designs are also shown. On the average our designs dissipate 32% less energy compared with the Xilinx design. The latency improvement varies from 5.8 X to 17.3 x. However, our designs occupy more area. 74 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. In the simulations, the same input data used to obtain the component specific power functions were used. As noted earlier, the average switching activity was observed to be 50%. We performed this experiment for various problem sizes using designs in Section Table 3.3 also shows the error percentage of our high-level estimation method when compared with energy estimation values obtained through low-level simulation. The error percentages are below 9.0%. 7 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3.3: Accuracy o : the high-level energy estimation of our designs Matrix size (n x n ) 3 x 3 6 x 6 8 x 8 9 x 9 12 x 12 16 X 16 Estimated energy (nJ) 34.0 213.2 497.8 715.5 1801.2 4759.7 Measured energy (nJ) 37.4 228.6 536.9 768.4 1913.6 5078.6 Error .9.0% 6.7% 7.3% 6.9% 5.9% 6.3% 3.4.3 Domain 3: Block M atrix M ultiplication on Linear Array Architecture The third domain targets large size (n > 12) matrix multiplications. It consists of block matrix multiplication (BMM) and the linear array architecture presented in Domain 2. The BMM algorithm for x iV matrices repeatedly uses the design (hardware) for sub-matrix multiplication of size n x n , where A'’ is a multiple of n. In this domain, we have considered the off-chip design. Defining C om ponents, Param eters, and System -W ide Energy Function All components defined in Domain 2 are also applicable to this domain. An addi tional parameter is the block size (n). For a block size of n, we chose the designs with pe = s = n and ps = on to implement n x n matrix multiplication. Based on our performance trade-off analysis of Domain 2, (Figure 3.8) these designs are the most efficient ones in terms of latency and energy dissipation for n x n matrix multiplication. Since N x N matrix is divided into n x n sub-matrices, 7 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. {Njri)^ block matrix multiplications are performed. Therefore, the latency is: T = {N /n Y X + ‘ 2 n )/f. Therefore, the system-wide energy function is; SE {N , n) = {nT ■ PE.p,ps=on + {n - 1)T ■ IC.p + 3T • 1 0 .p) (3.7) D esign Trade-ofFs and Perform ance Analysis We vary the block size (n) to evaluate various trade-offs. Table 3.4 shows the area, latency, and energy of 24 x 24 and 48 x 48 matrix multiplication using various block sizes. Results show that the matrix multiplication for N = 24 dissipates least energy when n = 12. To verify our result we simulated all the designs that are within 10% of the optimal design in terms of energy dissipation. Through low-level simulation (Table 3.4) design with n = 12 is verified as the most energy efficient design for 24 x 24 matrix multiplication. For 48 x 48 matrix multiplication, the design using 16 x 16 block matrix multiplication is the most energy efficient design. C hapter 4 E nergy and T im e Efficient M atrix M ultip lication U sing F P G A Matrix multiplication is a frequently used kernel operation in a wide variety of graphics, image processing, robotics, and signal processing applications. Several signal and image processing operations can be reduced to matrix multiplication. Most of the previous work on matrix multiplication on FPGAs focuses on latency optimization [7]. However, since mobile devices typically operate under various computational requirements and energy constrained environments, energy is a key performance metric in addition to latency and throughput [14]. Hence, in this chapter, we develop designs that minimize the energy dissipation. Our designs offer trade-offs between energy, area, and latency for performing matrix multiplication on commercially available FPGA devices. Recent efforts by FPGA vendors have resulted in rapid increases in the density of FPGA devices. Hence, we also develop 7 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a design that attem pts to further minimize the energy dissipation and latency in exchange for an increase in area to utilize higher density FPGA. Our effort is focused on algorithmic techniques to improve energy performance, instead of low-level (gate-level) optimizations. We evaluate various alternative designs at the algorithmic level (with accompanying architectural modifications) on their energy performance. For this purpose, we construct an appropriate energy model based on the technique described in Chapter 3 to represent the impact of changes in the algorithm on the system-wide energy dissipation, area, and latency. The modeling starts by identifying parameters whose values change depending on the algorithm and have significant impact on the system-wide energy dissipation. These parameters depend on the algorithm and the architecture used, and the device features of target FPGAs. In the case of matrix multiplication, we achieve a high-level of accuracy along with simple representations for the functions. The energy, area, and latency functions provide us with a high-level picture and pointers on where to look for possible savings in system-wide energy, area, and latency. Those functions allow us to make trade-offs in the early design phase to meet the constraints. Using the energy function, algorithmic and architectural level optimizations are made. Extensive low-level simulations using Xilinx ISE 4.1i and Modelsim 5.5e, for XC2V1500 and XC2V3000 as example target FPGA devices, are then performed. Xilinx XPower is used on the simulation data to verify the accuracy of the energy and area estimated by the functions. Our experiments 8 0 R eproduced with perm ission of the copyright owner. For example, for matrices of size 12 x 12, the system-wide energy dissipation is reduced by an additional 40%, resulting in 69% reduction when compared with the design from the Xilinx library [144]. The latency and area reduce and increase by factors of 23 and 11.8, respectively. The rest of the chapter is organized as follows. Section 4.1 summarizes the related work in the literature. Algorithms and architectures for energy efficient implementation are presented in Section 4.2. An energy model specific to our implementation is described in Section 4.3. ft includes extracting key parameters from our algorithm and architecture to build a domain-specific energy model and deriving functions to represent system-wide energy dissipation, area, and latency. Section 4.3.3 shows the optimization procedure for our algorithms and architectures 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. in an illustrative way. Analysis of the trade-offs between system-wide energy, area, and latency is also provided. Section 4.4 provides implementation details and describes the simulation method along with its statistical representativeness. Section 4.5 analyzes the performance of our algorithms and architectures through various known metrics in addition to the system-wide energy dissipation. 4.1 R elated work To the best of our knowledge, there has been no previous work targeted at energy efficient implementation of matrix multiplication on FPGAs. Mencer et. al. [88] implemented matrix multiplication on the Xilinx XC4000E FPGA device. Their design employs bit-serial MACs using Booth encoding. They focused on trade-offs between area and maximum running frequency with parame terized circuit generators. For the specific example of 4 x 4 matrix multiplication, 954 CLBs are used to achieve a maximum running frequency of 33MHz. Amira et. al. [7] improved the design in [88] using the Xilinx XCVIOOOE FPGA device. Their design uses modified Booth-encoder multiplication along with Wal lace tree addition. The emphasis was once again on maximizing the running fre quency. For the specific example of 4 x 4 matrix multiplication, 296 CLBs are used to achieve a maximum running frequency of 60MHz. Area/speed, or equivalently the number of CLBs divided by the maximum running frequency, was used as a performance metric. 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Even though our designs mainly target the trade-offs among energy dissipation, area, and latency along with algorithmic level energy optimization, they also im prove the designs in [88] and [7] in terms of the area/speed metric. The area/speed metrics for the designs in [88], [7], and for our design are 14.45, 4.93, and 2.35, respectively. For fair comparison, translation of the number of CLBs for different FPGA devices is performed on the basis of the equivalent amount of logic. Extra 2n registers of 8-bit words are required to store copies and are not involved in the systolic data movement. Their work has never been implemented on FPCAs. 8 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The most appropriate reference design with which the performance of our de signs should be compared comes from Xilinx [144]. The state-of-the-art design from Xilinx library performs matrix multiplications for limited sizes (3x3). Xilinx XPower [144] can be used to measure the power dissipation of designs implemented on Xilinx FPGA devices. For fair comparison, we use the same design environ ment, the same target device, and the same power measurement tool. Details of the simulations can be found in Section 4.5. Xilinx just provides a point design op timized at the gate level. Our work constructs a design space spanned by possible design choices in our algorithm. 4.2 E nergy Efficient A lgorithm s and A rchitectures for M atrix M u ltip lication For performance comparison purposes, we have implemented the best-known sys tolic design [110] on FPGA devices. The energy distribution profile of the design reveals that much of the total energy is dissipated in the registers (see Figure 4.1). For example, 78% of the energy is used in the registers for 12 x 12 matrix multi plication. By identifying the energy hot spot, we propose new energy efficient algorithms and architectures for matrix multiplication. We present our algorithms and archi tectures in two theorems and two corollaries. Pseudo-code for cycle-specific data 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I 60% Q ) £ 20% 78% 73% 63% 9 6 12 Storage Problem s iz e (n) Figure 4.1: Energy distribution of the design proposed in [110] movement, the detailed architectures, and a snapshot of an example computation are also shown. Theorem 1 improves the best-known algorithm for matrix multipli cation [110] in terms of the number of registers used in the designs. Our design has optimal time complexity with a leading coefEcient of 1 for matrix multiplication on a linear array. Theorem 1 is extended to Corollary 1 for trade-offs among en ergy dissipation, area, and latency. Corollary 1 is used to identify energy efficient designs under latency and area constraints. The second algorithm is developed to exploit further increases in the density of FPGA devices to realize improvements in energy dissipation and latency (The orem 2). It uses more MACs and I/O ports. Corollary 1 and Theorem 2 are integrated into Corollary 2. T h eo rem 1 n x n matrix multiplication can be performed in 2n cycles using 3 I/O ports and n PEs (processing elements), each having a M AC (MAC-and- accumulator), f registers, and 2 local memories of n words (Figure 4-2 (a) and (b) show a linear array connecting the PEs and Figure 4-2 (c) shows a PE). P ro o f 1 The algorithm in Figure 4-2 (d) and the architecture in Figure 4-2 (a), (b), and (c) are devised to compute C ij = O ik x bkj for all i,j. a^k, bkj, and cij represent elements of the n x n matrices A, B, and C. P E j denotes the j-th PE from the left in Figure 4-2 (a), j = l,2 ,..,n . P E j computes column j of matrix C, cij,C2j, ...jCnj, which is stored in the local memory Cbuf. In Phase k, column k of matrix A (oikO < i < n) and row k of matrix B (bkj, 1 < / < rr/ traverse PE i, P E 2, P E s , P E n in order and allow PEj to update c / = c[j -|- O ik x bkj, 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA Matrix A Matrix B Matrix C V 4 ^ ---- J ------ ^ i. ' > . 1 Irl 1 ' Control logic > I/O ports PE, FPGA I/O ports On-chip 1 ' f c m em ories n ► (Matrix 1 A,B,C) PE, Control logic (a ) (b) From PE. 'j - i PE To p e , j* i o u t w B15T ~ B U CtHlf CObuf (c ) For t=l to n do For all j, l=j=n, do in parallel PE^ shifts data in BU right to If (BU=b^^) , copy it into BM For t=n+l to n^+n do For all j, 1= j=n, do in parallel PEj shifts data in A, BU right to If (BU=b^j), copy it into BL or BM (alternately) If (A=ajj^), C i j ' == = i j ' + a i k * (b^j is in either BM or BL) (Cj^j' is in Cbuf) For t=n^+l to 2n= do For all j, l=j=n, do in parallel PEj store input to CObuf For t=n^+l to n^+n do PEj output data c^j' to PEj j For t=rf+n+l to 2n^—n do PEj output data CObuf to PEj_j (d) Figure 4.2: (a) OfT-chip design, and (b) on-chip design, (c) architecture of PEj used in Theorem 1, and (d) algorithm used in Theorem 1 8 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. O O 3 o ) 3 " G O • m mm n sT s tT > o E ( 0 § >< H S e ;£ SI c & uT in Q. ‘ S ' DC M - < 3 ffi E DO _l DO o Figure 4.3: Snapshot of the data flow for 3 x 3 matrix multiplication (Theorem 1 ) 8 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. where c[j represents the intermediate value of Cij. Once bkj arrives at P E j , a copy of bkj resides in P E j until aik,a2k,azk,...,dnk P < ^ss through P E j. We observe that the following two essential requirements should be satisfied: 1) Since a^k stays at each P E j for just one cycle, bkj should arrive at P E j no later than Oik, for any i , l < i < n. 2) Once bkj arrives at P E j, a copy of bkj should reside in P E j until O nk arrives. We show how the above two essential requirements for our systolic implementation are satisfied with minimal number of registers. For example, we show how b2n (the last element of matrix B in phase 2) arrives at PE^ no later than ai2 (the first element of matrix A in phase 2) for = c\^ + a ^ x b2n- ^2 n needs 3n — 1 cycles, a ^ needs 3n cycles. 8 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2) Once bkj arrives at PEj, a copy of bkj should reside in PE j until Onk arrives: We show how to minimize the number of registers to store copies of bkj {k = 1, 2, ..,n) in P E j, for each j. We prove that two registers (denoted BM and BL in Figure 4-2 (c)) are sufficient to hold bkj at P E j (to store two consecutive elements, b(k+i)j and bkj). For example, when 6 3 4 arrives at PE^, 6 1 4 is in BL and 6 2 4 is in B M . If we can prove that a„i has arrived at P E 4, 6 3 4 can replace bu in BL. Note that bi4 is no longer needed in P E 4 after 0 ^ 4 = 0^4 + a^i x ^ 1 4 is performed using Gni. In general, bkj is needed until a^k arrives at PEj in the {n + {k — l)n + n -\-j — l}-th cycle. b (^k + 2)j arrives at PE j in the {(/c + l)n + 2j — l}-th cycle. Since {k + l)n -\-2j — 1 > n-\- {k — l)n + n — 1 for all j,k , and n, bkj can be replaced when b (^k + 2)j arrives at PE j. Also, the time difference between b (^k + 2)j and bkj is {(/c + l)n + 2j — 1} — {n + (A ; — l)n + n + j — 1} = j, which has a minimum value of 1. Hence, in the worst case, b (^k + 2)j arrives 1 cycle after bkj is no longer required, which means that b (^k + 2)j can replace bkj. This also shows that b (^k+ 2)j can not arrive while b(^k+ i)j is used since & (fc + 2 )j barely arrives after bkj is no longer required. This proves that P E j needs at least two temporary registers, BM and BL, to hold bkj {k = 1,2, ,.,n). 3) ffi + 2n cycles are needed to complete the matrix multiplication. The compu tation finishes one cycle after arrives at PE^, which is the {n + [n — l)n I- n -Fn — l}-th or {ffi - + ■ 2n — l}-th cycle. 9 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4) A local memory (C buf) of n words is required to store intermediate and final values for the column of matrix C being computed in each PE. Another local memory (CObuf) of n words is required to buffer the final values from PEj+i to PE j. Starting from the + l)-th cycle, results of d^j are generated from all PEs during the next 2n — 1 cycles. CObuf is necessary in each PE otherwise the final d^j from one matrix multiplication would be overwritten by the intermediate d^j of the next matrix. P E j stores the output from via port Cin in CObuf. For the first n cycles, the final c[j at P E j^i (for example, Ci2 at cycle 11) is input to P E j . For the next — n cycles, the stored c[j of -P-Ej+i in CObuf is input to PE j. C orollary 1 n x n matrix multiplication can be performed in {rn^ + 2r^n) cyeles using 3 I/O ports and ^ PEs, each having 1 MAC, 2 local memories of ^ words and 4 registers, where n is divisible by r. P ro o f 2 n X n matrix multiplication can be decomposed into r x f matrix multiplications, assuming n is divisible by r. Using Theorem 1 with n replaced by ", the proof follows. Corollary 1 provides trade-offs between area and latency. Larger values for r reduces the number of PEs, which results in less area. But it increases the number of cycles to complete the matrix multiplication. Combined with power and area estimation of modules. Corollary 1 provides trade-offs among energy dissipation, area, and latency. T h eo rem 2 n x n matrix multiplication can be performed in ( ^ + ^ ) cycles using 3r I/O ports and " PEs, each having MACs, 2r^ local memories of f words, and 4r registers (Figure 4-4 shows a PE for r = 2), where n is divisible by r. P ro o f 3 The n x n matrices A, B, and C are divided into r^ submatrices, each of size f: x f , assuming n is divisible by r. Let A^y, B^y, Cxy, ^ ^ x ,y < r, denote the submatrices. Then, we have Cxy = Yffk=i-^xk x Bky, I < x ,y < r . The basic idea is to perform Cxy = Yfk=i -^xk x Bky in parallel for all x, y, 1 < x ,y < r using MACs per PE. Axk x Bky for each x, y, 1 < x ,y < r, can be performed using 3 I/O ports and " PEs, each having one MAC, 4 registers, and two local 92 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. From PE. PE, BU1 IH T 1ET Z I > A 2 . A2, BU2 lR i2 B2, BL2 Cbuf Cbuf Cbuf Cbuf From PE.^j — ci,„ L a c 2 ,„ CObuf CObuf CObuf CObuf T o p e . C2, Figure 4.4: Architecture of P E j for Theorem 2 9 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For t=l to n/2 do For all j do in parallel PE^ shift words in BUI & BU2 to the right (to * If (BUI = , copy it into BMl If (BU2 = , copy it into BM2 For t=n/2+l to (n/2)^+n/2 do For all j do in parallel PEj shift words in A1,A2,BUI,BU2 to the right(to PEj^^)* If (BUI = bjjjjj), copy it into BMl after moving the word in BMl into BLl If (BU2 = b^2kj) ' *=°Py it into BM2 after moving the word in BM2 into BL2 If (Al= a,i.^), Cbuf,j=Cbufj,+ajj^x Cbuf^2=05uf b^^k; If (A2= a2j^^), Cbuf2j^=Cbuf2j+a2j^j^^x b^j^^ Cbuf22=Cbuf22+a2j^^x b^j^^ (b^^^j is in either BMl or BLl, b^^kj either BM2 or BL2) For t=(n/2)^+l to 2(n/2)^+n do For all j do in parallel PEj shift words in A1,A2,BU1,BU2 to the right(to PEj^^)* If (BUI = bjj^^j), copy it into BMl after moving the word in BMl into BLl If (BU2 = bjj^j), copy it into BM2 after moving the word in BM2 into BL2 If (Al= aj2i^), Cbuf^j=Cbuf,j+a,2ik5« b^^^j Cbuf,2=0>ufj2+a,2ik* b^^^. Figure 4.5: Algorithm for Theorem 2 9 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. memories o f^ words per PE, as stated in Theorem 1. Since the computations for all the submatrices of matrix C need to be performed in parallel in each PE, we need to duplicate the resources of a PE by a factor of r^, which is the number of submatrices. This would require 3r^ I/O ports and each P E to have r^ MACs, 4r^ registers, and 2r^ local memories of ^ words. We show how the number of registers and I/O ports can be reduced to M registers per P E and 3r I/O ports. P E j denotes the j-th P E from the left in Figure f.f, j = 1 , 2 , P E j computes column j of all submatrices Cxy, I < x ,y < r. An M AC is used to update column j of each submatrix Cxy, I < x ,y < r , requiring a total of r‘ ^ MACs per PE. 1) For each pair of x and y, I < x ,y < r, we show how Cxy = Y/fk^i Axk x Bky can be performed m ( ^ + ^ ) cycles using ^ PEs with an M AC and 4 registers per PE and 3 I/O ports. C !„ y represents the intermediate results for Cxy. Using Theorem 1, we can perform C'xy = + Axk x Bky for any specific combination of x, k, and y, I < x ,k ,y < r, in cycles using the aforementioned PEs. For each pair of x and y, 1 < x ,y < r, Cxy = Y7k=i x Bky is obtained by performing C'^y = (7^^ + Axk x Bky in a serial manner with k increasing from 1 to r. A preliminary analysis reveals that this would take ( ^ + ^ ) x r = ( ^ + 2 n) cycles. However, a close look at the data movement in the proof of Theorem 1 reveals that the input of the last column of submatrix Axk lo the array can be overlapped with input of the first row of submatrix for k = 1 , 2 , — 1 . Using the overlapping, C'^y = C !j,y + Axk x Bky for k = 1,2, 3..., r can be performed 9 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. in a pipelined fashion, taking ^ cycles for each k. At the start, ^ cycles are needed to prefetch the first row of submatrix B\y. At the end, after the last column of submatrix A^r is input, " additional cycles are needed to moue it through the array of - PEs to complete the updates for Cxy This leads to an execution time of ^ + + ^ = + ^ cycles, instead 0 / ( ^ + 2 n) cycles. 2) We show how Cxy — Yfk=i -^xk x Bky can be performed in parallel for all pairs of X and y, I < x ,y < r, in cycles using ^ PEs with r^ MACs, 2r^ local memories of " words, and 4r registers per PE, and 3r I/O ports. C'^y is the intermediate result for Cxy In stage k, 1 < k < r, C'^y = C'xy + Axk x Bky is performed in parallel for all 1 < x ,y < r. Column 9 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. j of submatrix C'^y is updated by MACxy o,nd stored in a local memory of size Cbufy^y, for each x, y, I < x ,y < r, in PE j. 3) One set of local memories (C bufs) of ^ words in PE j is used to store inter mediate results for column j of submatrix Cxy for any x, y, 1 < < r. Another set of local memories (CObufs) of ^ words is required to buffer the final results from PEj+i to P E j. Thus, a total o/2r^ local memories of ^ words per PE are required. Starting from the { ^ + l)-th cycle, results of ch are generated from all PEs during the next n + ^ — 1 cycles. CObuf is necessary in each PE otherwise the final c[j from one matrix multiplication would be overwritten by the intermediate ch of the next matrix. P E j stores either the output from PEj+i via ports Cff^ , C 2 i^ a ^ , ..., Crin or ch from MACxy of PE j in CObuf s. For the first ^ cycles, the final c'ij at P E j is output to P E j^i via ports Clont, C2ont, ■ ■ ■ , CVout- For the next ^ ^ cycles, the stored c '^j of P E j in CObuf is output to P E j^i. The outputs from P E i are the resulting matrix C . Starting from the + 3)-th cycle, PE j stores the final c[j of P E j^i via ports Cffn , C2in, ..., Crin in CObufs for ^ cycles. 4) Figure 4-4 shows our architecture for r — 2 and Figure 4-5 shows the accom panying algorithm. Let Axy, Bxy, and Cxy, 1 < a:,?/ < 2, denote submatrices, each of size f X In the first stage, C ',^ .y = Axi x B^y, are performed in parallel for all ^ E : x ,y < 2 by feeding the elements of A n , A 21, B u , and B u through the 4 input ports. In the second stage, C'xy = C'xy + Ax2 x B^y is performed in parallel for all ^ S: x ,y < 2. Each stage takes ^ + n cycles from Theorem 1. Since overlapping 97 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is possible between the end of the first phase and the start of the second phase, the total number of cycles for both stages combined is ^ explained before. For larger values of r, there is greater parallelism within the PEs and hence the execution time greatly reduces. It must be noted that the number of MACs is a key parameter to determine the whole architecture. Based on the number of MACs, Theorem 2 and Corollary 1 can be combined into Corollary 2 . C orollary 2 n x n matrix multiplication can be performed in m a x (^ , xmin(n^+ 2 n, m ^+ 2 m) cycles using m MACs, 2 m local memories of ^ words, 4 min(n, m) registers, and 3m ax(^, 1 ) I/O ports, where is divisible by m and 1 < m < n^. P ro o f 4 For 1 < m < n, the proof follows from Corollary 1 by setting r — For n < m < n^, the proof follows from Theorem 2 by setting t = ‘ ^ . A more detailed analysis of all designs with respect to energy, area, and latency is presented in Section 4.3.3. 4.3 Perform ance M odeling and O ptim ization Given the goal of algorithmic level optimization of energy performance for matrix multiplication on FPGA devices, we need an energy model to represent the im pact of individual algorithmic level choices on the energy performance. Based on this model, we make the design trade-offs to obtain energy efficient designs. The candidate designs are implemented in Section 4.4.1. 4.3.1 Domain-Specific Energy M odel Our approach for the performance modeling is to use a domain-specific energy model proposed in Ghapter 3[30] [36]. The model is applicable only to the design domain spanned by the family of algorithms and architectures being evaluated. The family represents a set of algorithm-architecture pairs that exhibit a common structure and similar data movement. The domain is a set of point designs result ing from unique combinations of algorithm and architecture level changes. The domain-specific energy model abstracts the energy dissipation to suit the design domain. The abstraction is independent of the commonly used levels such as gate. 9 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. register, or system level. Rather it is based on the knowledge about the family of algorithms and architectures. The parameters are extracted considering their expected impact on the total energy performance. For example, if the number of MACs and the number of registers change values in a domain and are expected to be frequently accessed, a domain-specific energy model is built using them as key parameters. The parameters may include elements at the gate, register, or system level as needed by the domain. It is a knowledge-based model which exploits the knowledge of the designer about the algorithm and the architecture. We also use the knowledge to derive functions that represent energy dissipation, area and latency. Beyond the simple complexity analysis, we make the functions as accurate as possible by incorporating implementation and target device details. For example, if the number of MACs is a key parameter, we implement a sample MAC on the target FPGA device to estimate its average power dissipation. Random input vectors, as many as are needed for the desired confidence interval [64], are generated for simulation. A power function representing the power dissipation as a function of m, the number of MACs, is generated. This power function is obtained for each module related to the key parameters. Based on the designer’s optimization goal and the time available for design, a balance needs to be struck between accuracy and simple representation of the functions. In Corollary 1, n denotes the size of input matrices, r is introduced for block multiplication using submatrices of size \ In Theorem 2, r determines the number of I/O ports (3r), the number of MACs (r^) and the submatrices of size Due to the nature of our algorithms, the number of each key module depends only on these two parameters. We identify registers of 8-bit and 16-bit words, MACs, SRAMs (Distributed SelectRAMs in the Xilinx devices), and BSRAMs (Block SelectRAMs in the Virtex- II devices) [144] as key modules. Choosing specific values for the parameters in Table 4.1 results in a design point in the design space. For example, n = 24, p = 6, reg = 4, m — 1, sram = 2, Kb = 2, and Kio = 0 represents a design where 24 X 24 matrix multiplication is implemented using 6 PEs with 4 registers, one MAC, and two SRAMs per PE. The input and output matrices are stored in two ((’2 X 24 X 24/1024] = 2) BSRAMs on the device and no I/O ports are used. 1 0 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.1: Range of parameters for Xilinx XC2V1500 Parameter Range FPGA constraints Problem size (n) 2,3,4,... No. of PEs ip) nil, n is divisible by I, I is integer No. of registers/PE (reg) ( 0 < k < logj, n) 8/16-bit registers No. of MACs/PE (m) b^^ 2-stage pipeline, embedded No. of SRAMs/PE (sram) \nb'^ l i e ] 16 words minimum No. Of BSRAMs/PE {K^) (on-chip design) |"2n^/1024'| 1024 16-bit words minimum No. of I/O ports (K.J (off-chip design) 3 /7 ^ 8/16 bits An energy model specific to the domain is constructed at the module level by assuming that each module of a given type (register, multiplier, SRAM, BSRAM, or I/O port) dissipates the same power independent of its location on the chip. This model simplifies the derivation of system-wide energy dissipation functions. The energy dissipation for each module can be determined by counting the number of cycles the module stays in each power state and low-level estimation of the power used by the module in the power state assuming average switching activity. Additional details of the model can be found in Chapter 3. Table 4.2 lists the key parameters and the number of each key module in terms of the two parameters, for each domain. In addition, it shows the latencies which also depend on the parameters. By choosing specific values for the parameters in 102 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.2, a different point design is realized in the design space. For example, the point design, n = 16 and r = 4, represents a design where 16 x 16 matrix multiplication is implemented using 4 PEs with 4 registers, one MAC, and one SRAM per PE. The area function is given by where A represents the 103 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. area used by module i. In general, these simplified energy and area functions may not be able to capture all the implementation details needed for accurate estimation. However, we are concerned with algorithmic level comparisons, rather than accurate estimation. Moreover, our architectures are simple and have regular interconnections, and so the error between these functions and the actual values based on low-level simulation is expected to be small. In Section 4.4.2, we evaluate the accuracy of the energy and area functions. The latency functions is obtained easily because the theorems and corollaries already give us the latency in clock cycles for the different designs. Table 4.2 shows the number of modules used by the designs for n x n matrix multiplication with 8-bit input precision and 16-bit output precision. For the off- chip design, I/O ports are used to fetch elements from outside the FPGA. In the on-chip design, BSRAMs of 1024 16-bit words are used for on-chip storage of input matrices. SRAMs are CLB-based memory blocks used for storing intermediate results. The power and area values of each module are shown in Table 4.4. For example, P s r a m is the average power used by SRAM (16-bit word), where x is the number of entries. In the actual implementation of a SRAM, the number of its entries should be multiples of 16. Poffset denotes the remaining power dissipation of a PE (after the modules have been accounted for), and takes care of glue logic and control logic. Similar numbers representing the area of each module are also obtained. Agffset denotes the area of a PE that accounts for glue logic and control 104 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. logic. The latencies are obtained in terms of seconds by dividing them by the clock frequency. Using Table 4.2, functions that represent energy, area, and latency for Corollary 1 and Theorem 2 are shown in Table 4.3. Functions for other designs can be obtained in the same way. An average switching activity of 50% for input data to each module at a running frequency of 150MHz is assumed. Multiply operation is performed using dedicated embedded multipliers available in the Virtex-II device. Note that throughput is important, since many applications for matrix mul tiplication process a stream of data. Our design in Corollary 1 is a pipelined architecture, with the first ^ cycles of the computations on the next set of data being overlapped with the last ^ cycles of the computations on the current set of data. Thus for a stream of matrices, an - x - submatrix can be processed every ] fp ^ cycles. Table 4.3: Energy and time performance models Corollary 1 | Metric Performance model Latency (cycles) k:oA=^r^[{nlrf +nlr] Effective latency (cycles) h on= r'[nlrf Energy (on-chip) f(n / r) ( + 2P^g^ + 4/)jg - t - 4/^,g )1 Hn/r)P^,,, J Energy (off-chip) F T ^ A d d ~ ^ ^ ^ S R A U l+2T,+P„ + (n/r)/^^,„ J Area (on-chip) A r o r i ^){ ^ M u it A ir fa + '2-^ram 4A ^ g - 1 - 4 - t - j and |’2«"/1024] BSRAMs Area (off-chip) A :o rt —{ n l r ) i y + Aj^ + 2A^,^ -r4A ^ g + 4A ^ jg - 1 - ) and two 8-bit input ports, one 16-bit output port Metric Performance model Latency (cycles) Lnn2=n^lr + 2nlr Effective latency (cycles) ^ - T h m 2 ^ ^ Energy (on-chip) E^2 -^ ^ |+ p 2 n V l0 2 4 ]p ,,^ +(n/r)P^,.,, J Energy (off-chip) F J ^'^8‘* ■ ^ ■ ^ 1 6 )1 n™ 2-^2|^2rP,+rP, + (n/r)i^^,,, J Area (on-chip) A j^2 = (A j^ ^ , + A^^ + 2Aj^j^) -f n (4A ^ g - 1 - 4A ^ ,^ ]+{n! and [2n"/1024] BSRAMs Area (off-chip) Aj},„2 ~^A^jj +2,A^n^nf) -+-/i(4Ajg -l"4A p|g)-l-(n/r)A „g5.^, and 2r 8-bit input ports, r 16-bit output ports 106 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. latency for 48 x 48 matrix multiplication for the off-chip and on-chip designs of Corollary 1. It can be used to choose energy efficient designs to meet given area and latency constraints. For example, if 800 slices are available and the latency should be less than 6,000 cycles (36/rs), an energy efficient design is obtained using ^ = 4. The energy dissipation, area, and latency for such a design, evaluated using the functions in Table 4.3, are 6.85/r J, 524 slices, and 5400 cycles (32.4^s), respectively. Figure 4.6 shows that as the block size (^) increases, the area increases and the latency decreases, because the degree of parallelism increases. While the energy dissipation decreases till ^ = 15 or 16, it starts increasing afterwards. The reason for this behavior is as follows. The energy used by the local storages, Cbuf and CObuf, is 2n^(0.126 -I- 2.18) and is hence proportional to O (^ ). The energy used by the rest of modules, except I/O , are proportional to 0{n^). The energy for I/O is proportional to 0{rn^). As ” increases (r decreases), the energy used by I/O decreases relatively faster and thus the total energy decreases. However, after ^ > 16, the energy used by the local storage becomes the dominant factor. This helps us to identify the optimal block size for energy efficient matrix multiplication. Trade-off analysis for the on-chip model also shows similar behavior. The on- chip design uses BSRAMs instead of I/O ports. Since the energy used in I/O ports is more than the energy used in the BSRAMs, the energy used in the on-chip design is less than the energy used in the off-chip design. However, the choice between the off-chip and on-chip design depends on the situation - whether the 107 R eproduced with perm ission of the copyright owner. Table 4.4: Power and area functions for various modules Module Block multiplier (8x8 bit) Power Function (mW) 12.50 Area Function (slice) Adder (8 bit) P = 2 1 1 ^A dd ' ' SRAM (16-bit word, X number of entries) P s r a m = 0-126 X 16 + 2.18 ^ S R A M -18.44 X 16 + 16.40 BSRAM (16 bit, 1024 entries) Prsram —16.37 A — 1 A* ^BSRAM — ^ Register (8 bit) ^ «8 = 2.12 4 Register (16 bit) P = 2 P R 16 RS ^R16 - 8 Output port (16 bit) P^=10 Input port (8 bit) P, =10 * Block multiplier or BSRAM uses area equivalent to 16 slices. matrix multiplication is stand-alone or a part of an application (e.g. an application consists of multiple kernels). Theorem 2 provides asymptotic improvement in energy and latency perfor mance in the on-chip model. As shown in Table 4.2, asymptotically, the energy dissipated in the BSRAMs and the latency of the Xilinx reference design increase as 0{n^) and 0{n^), respectively, assuming a unit of energy is used per cycle for retaining a word in the BSRAM. Energy dissipation and latency for the designs based on Theorem 1 and [110] increase as O(n^) and O(n^), respectively, under the same assumptions. Theorem 2 improves these complexities to O (^ ) and 0{~), respectively, where ^ is the block size for block multiplication and n is divisible by r with r > 1. Further increases in the density of FPGAs can be used to increase 109 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. CO storage 12 16 24 48 Latency 3 30 S ' T 1 I SI > 1 — W 4 1 6 ' 8 BSRAM Misc. MAC Storage Latency Area Figure 4.7: Energy, area, and latency trade-offs of Theorem 2 as a function of r, (a) off-chip design and (b) on-chip design for n = 48 1 1 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the number of multipliers and hence nr, leading to asymptotic reduction in energy dissipation and latency. Figure 4.7 shows the trade-offs among energy, area, and latency for Theorem 2. As the value of r increases (or block size ^ decreases), the area increases and the latency decreases. However, the energy dissipation continuously decreases. Thus, the designs based on Theorem 2 reach the minimal point of energy dissipation when the block size is the smallest unlike the designs based on Corollary 1. Note that the local storages consists of registers, Cbufs, and CObufs. The energy nsed by the registers in the designs based on Theorem 2 is O (^ ) while the energy used by the registers in the designs based on Corollary 1 is 0{n^) for the same problem size. Thus, the energy used by the registers in the designs based on Theorem 2 decreases as r increases while the energy used by the registers in the designs based on Corollary 1 is constant. Additionally, the linear array architecture facilitates the use of two more techniques; parallel processing and pipelining. Both parallel processing and pipelining decrease the effective latency of a design. Parallel processing does so by increasing the amount of resources while pipelining does so by increasing the resource utilization. By decreasing effective latency, both techniques can lead to lower energy dissipation. However, these techniques can also increase the power dissipation, which can have a negative effect on the energy dissipation. The de signer must reach a compromise between low latency and high power in order to achieve a low energy design. Another technique that we employ is the choosing of the appropriate bindings, fn an FPGA, there can be many possible mappings of the computation and storage elements to the actual hardware. For example, in the Virtex-II, the storage CBuf can be implemented as registers, a distributed Selec- tRAM, or a Block SelectRAM. Each of these types of storage dissipates a different amount of energy and can lead to implementations with wide variation in energy dissipation. When the number of entries > 64, a Block SelectRAM is used since it is energy efficient as a large memory; otherwise, a distributed SelectRAM is used. 1 1 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Similar decisions can be made for other elements of the design, such as choosing between (embedded) Block multipliers or configured (CLB-based) multipliers. In our design, we choose Block multipliers since they are energy efficient when both inputs are not constant. 4.4 D esign Synthesis and Sim ulation Based on the high-level performance estimation, the chosen designs are imple mented and simulated to obtain the accurate results. Our target device is Virtex-II which is a high-performance, platform FPGA from Xilinx [144]. We have chosen the XC2V1500 and XC2V3000 models for comparison and its speed grade is -5. These models have 48 and 96 18 x 18-bit Block multipliers, respectively. 4.4.1 Im plem entation Details Based on observation in Section 4.3.3, we implemented the designs using VHDL in Xilinx ISE 4.1i environment. All parameters are specified using ’’G eneric” vari ables in VHDL syntax. By changing the values of the Generic variables, different numbers and types of modules are instantiated (see Table 4.2) and eventually the whole architecture are synthesized accordingly. Note that the design for Theorem 1 has one parameter n, and Corollary 1 and Theorem 2 have two parameters n and r. These are only necessary parameters for design synthesis. For the off-chip design. 113 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. all matrix data are fed via I/O ports. Thus the total energy includes the energy used for I/O . Other than the data paths for all designs, control logic is also parameterized based on n and r. We chose the mixture of distributed and centralized controls. Each PE of the off-chip design for Theorem 1 and Theorem 2 has 6 control signals from the control logic (i.e. centralized control). The centralized control signals are generated from the control logic (outside PE) and are fed to the first PE. Note that only the first PE gets the control signals and the rest of PEs use them in a pipelining manner (see Figure 4.2 (a) and (b)). Then all signals are passed 114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reg Register loading for A and B Acc Mux Selecting B data MW Mult Multiplication MR Accumulation Memory write Memory read (a) Data tiazard No data tiazard Reg Mux Mult Acc N MR / Reg Mux Mult Acc MW MR Reg Mux Mult Acc MW MR (b) Figure 4.8: (a) Pipelining stages and (b) data hazard to the next PE with one or two clock delays. Address generation for CBuf and COBuf are generated inside PE in a distributed manner. The first control signal, CtRegLoad, is used to load the input data to BM or BL. It is asserted every ^ cycles (see Figure 4.3). CtMuxToMult is used to determine which BM or BL is multiplied with A. This signal is identical with CtRegLoad except the first ^ cycles. CtMultCe and CtRamWe are the enable signals for the multiplier and SRAM. Both are asserted when the computation starts. CtFlush is asserted after one set computation for ” X " matrix multiplication is completed and a new set of computation starts. CBuf needs to be flushed before the next result is stored. It is asserted at (y)^ + y + 3 and is on for ^ cycles and is off for ^ cycles. In the data path, using this signal, we accumulate the intermediate values for matrix C with previous intermediate 115 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. values or 0 (flushing effect). CtCoutMux is used to determine whether the results come from an accumulator or COBuf. It triggers the data pulling from the current COBuf to the previous PE. In fact, the current PE sends the data of COBuf to the previous PE. It is asserted at (p)^ + 4 and is on for ” cycles and is off for (p)^ — p cycles. For the on-chip design, the control logic includes the additional signals such as address generation for on-chip memory. In addition, the energy used by the control logic is separately measured. 