Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy and time efficient designs for digital signal processing kernels on FPGAs
(USC Thesis Other)
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY AND TIME EFFICIENT DESIGNS FOR DIGITAL SIGNAL PROCESSING KERNELS ON FPGAS by Seonil Choi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2004 Copyright 2004 Seonil Choi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 3140452 Copyright 2004 by Choi, Seonil All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. UMI UMI Microform 3140452 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. D edication To my beloved mother, my father, my two sisters, and my lovely wife 1 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A cknow ledgm ents I would like to thank Dr. Viktor K. Prasanna, my advisor at the University of Southern California, for his guidance, encouragement, support, endurance, and insight throughout my Ph.D. program. In addition to academic guidance, he has been motivating me in the development of my professional skills and perspective of research. In the long road to my Ph.D. he provided me excellent direction and moral support when I faltered in my approach. I understood the process of abstracting the underlying principles and applying problem solving techniques to multiple domains wholly due to his teaching. His contribution towards my professional and career growth goes far beyond technical and academic advisement. I have been extremely fortunate to have him as my advisor and to have known him as a person over the last several years. I also thank the members of my qualifying examination and defense committees. Dr. Peter Beerel, Dr. Cauligi S. Raghavendra, Dr. Antonio Ortega, and Dr. Cyrus Shahabi for their suggestions. It has been a wonderful experience to participate in collaborative research ef forts with my fellow graduate students during my Ph.D. program. I have shared thoughts on research and other things daily with Jingzhao On, Gokul Govindu, Ne- ungsoo Park, Henry Park, and Jeonghee Shin who have been an excellent sounding board for all my ideas. I thank the PARIS and MILAN project members, Ronald Scrofano, Sumit Mohanty, Zachary Baker, and Ling Zhuo. I thank the found ing members of the MAARC and MAARC-H research group, Andreas Dandalis, hi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reetinder Sidhu, and Kiran Bondalapati for making the joint effort a success. Throughout the years, other students of Dr. Prasanna’s research group have made my Ph.D. study an enriching experience - Jongwoo Bae, Yongwha Chung, Jinwoo Suh, Ammar Alhusaini, Amol Bakshi, Prashant Bhat, Myungho Lee, YoungWon Lim, Wenheng Liu, Vaibhav Mathur, Sameer Wadhwa, Bharani Thiruvengadam, Bhargava Gundala, Michael Penner, Mark Redekopp, Mitali Singh, Ashwin Sethu- ram, Yang Yu, Dhananjay Raghavan, Animesh Pathak, Harish Krishnaswamy among others. Henryk Chrostek has been making all administrative affairs run smoothly throughout my stay. Many thanks also go to my seniors and friends at u s e , Jaeheon Jeong, Jongseon Kim, Hyungsuk Kim, Yungho Choi, Jongwook Woo, Chulho Shin, Seongwon Lee, Yongho Song, Sunjoo Kim, Kangwoo Lee, Joonho Ha, Jungyup Kang, Wonwoo Noh, and Dongsoo Kang for their encour agement and support. I thank the DARPA and National Science Foundation for supporting our research and providing opportunities to frequently visit and inter act with active researchers all over the world. I thank my parents, Hojoong Choi and Mija Woo, and my sisters, Yookyung Choi and Yoolee Choi, for extremely encouraging and patient with my seemingly never ending academic pursuits for the last twenty five years, culminating in my Ph.D. Throughout my Ph. D. study, several of my friends provided the needed encouragement for working and diver sion from academics as appropriate. Last, but the most, I would like to thank my IV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. wife, Jocelyn Zeng, for many years of her relentless support and endurance for my pursuing of a doctoral degree. V R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C ontents D edication ii Acknowledgm ents iii List Of Tables ix List Of Figures x A bstract xiii 1 Introduction 1 1.1 Overview........................................................................................................ 1 1.2 Contributions of the D isse rta tio n .......................................................... 9 1.2.1 High-Level Performance Modeling Technique: Domain-Specific Modeling and Design M ethodology............. 9 1.2.2 Energy Efiicient Algorithmic Design Techniques in FPGAs . 11 1.2.3 Energy and Time Efficient Designs for Matrix Multiplication Using F P G A s....................................................... 11 1.2.4 Energy and Time Efficient Designs for Matrix Factorization Using F P G A s ....................................................... 12 1.2.5 Energy/Time Efficient and Parameterized Designs for Fast Fourier Transforms Using F P G A s ............................................. 14 1.3 Outline of the D issertation....................................................................... 14 2 Background 17 2.1 Field Programmable Gate Arrays (FPG A s)............................................ 18 2.1.1 Conventional F P G A ..................................................................... 19 2.1.2 Inclusion of Embedded Memory, Embedded Multipliers and DSP B lo c k s .................................................................................... 21 2.1.3 Availability of DSP IP c o re s ........................................................ 24 2.1.4 Inclusion of Embedded and Soft Microprocessor Cores . . . . 25 2.2 Energy Efficient Design T echniques....................................................... 26 2.2.1 Power Dissipation in F P G A s........................................................ 27 2.2.2 Low-Level Design Techniques ...................................................... 30 2.2.3 Algorithm Level Design T echniques........................................... 33 VI R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. High-Level Perform ance M odeling Techniques: Dom ain-Specific M odeling and D esign M ethodology 39 3.1 Related work .............................................................................................. 42 3.2 Domain-Specific Energy M odeling.......................................................... 44 3.2.1 High-Level Energy Model ......................................................... 48 3.2.2 Component Specific Power Function E stim a tio n ......... 51 3.2.3 Deriving System-Wide Energy Function ................................ 56 3.3 Design M eth o d o lo g y ....................................................................... 58 3.4 Illustrative Examples of Domain-Specific Modeling and Design M ethodology ....................................................... 63 3.4.1 Domain 1: Uniprocessor A rchitecture............................ 64 3.4.1.1 Defining Components and P a ra m e te rs .................... 65 3.4.1.2 System-Wide Energy Function ................................. 67 3.4.1.3 Design Trade-offs and Performance Analysis .... 69 3.4.2 Domain 2: Linear Array A rchitecture............................ 69 3.4.2.1 Defining Components and P a ra m e te rs ..................... 70 3.4.2.2 System-Wide Energy Function ................................. 73 3.4.2.3 Design Trade-offs and Performance Analysis .... 73 3.4.3 Domain 3: Block Matrix Multiplication on Linear Array Ar chitecture 76 3.4.3.1 Defining Components, Parameters, and System-Wide Energy F u n c tio n .......................................................... 76 3.4.3.2 Design Trade-offs and Performance Analysis .... 77 Energy and Tim e Efficient M atrix M ultiplication U sing FP G A 79 4.1 Related work ............................................................................................. 82 4.2 Energy Efficient Algorithms and Architectures for Matrix Multipli cation ........................................................................................................... 84 4.3 Performance Modeling and O ptim ization............................................. 99 4.3.1 Domain-Specific Energy Model ................................................. 99 4.3.2 Functions to Estimate Energy, Area, and L a te n c y ................... 103 4.3.3 Trade-offs among Energy, Area, and L atency.............................105 4.3.4 Other Optimization Techniques for Energy EflBciency .... I l l 4.4 Design Synthesis and Sim ulation................................................................113 4.4.1 Implementation D etails....................................................................113 4.4.2 Simulation M e th o d .......................................................................... 116 4.5 Design Analysis and Performance C om parison......................................118 Energy and Tim e Efficient M atrix Factorization U sing FP G A 124 5.1 Related W o r k ................................................................................................ 127 5.2 Time and Energy Efficient Designs for Matrix F acto riz atio n ....................................................................................129 V ll R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.2.1 LU D ecom position........................................................................... 130 5.2.2 Block LU Decomposition................................................................. 139 5.3 Performance Estimation and Design Trade-offs...................................... 145 5.3.1 High-Level Performance Model .................................................... 145 5.3.2 Design Trade-offs for Time and Energy E fficiency.................... 148 5.4 Design Synthesis, Optimization, and Simulation M eth o d s.....................150 5.4.1 Optimizations for Time and Energy Efficiency............................ 151 5.4.2 Macro-Level Power and Resource A n a ly z e rs..............................154 5.4.3 Simulation Methods ........................................................................157 5.5 Performance Com parison............................................................................. 159 5.5.1 Uniprocessor and Theorem 3 ...........................................................159 5.5.2 Theorem 3 and Other Linear Array A rchitecture........................166 5.5.3 DSP and Corollary 3 ........................................................................166 6 E nergy/T im e Efficient and Param eterized Designs for Fast Fourier Transforms on FPG A s 172 6.1 Energy and Time Efficient Design for F F T .............................................173 6.1.1 Energy Efficient Design T echniques..............................................174 6.1.2 Energy Efficient Design for F F T .................................................... 175 6.2 Performance Estimation and Design S y n th esis.......................................181 6.2.1 Energy Performance E stim atio n .....................................................182 6.3 Performance of Synthesized D esigns..........................................................184 7 Conclusion and Future Research 192 7.1 Future Research D irections.......................................................................... 195 Reference List 199 vni R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f Tables 1.1 Performance comparison of FPGAs and DSP [142].............................. 2 2.1 (a) Capacitances of CLB and embedded blocks and (b) power dis sipation of configured and embedded logic in Xilinx Virtex-II .... 28 3.1 Model param eters........................................................................................ 73 3.2 Comparison of our designs on linear array architecture with Xilinx d e s ig n ........................................................................................................... 75 3.3 Accuracy of the high-level energy estimation of our d esig n s.............. 76 3.4 Performance comparison and accuracy of various designs in Domain 3 78 4.1 Range of parameters for Xilinx XC2VI500 ........................................... 102 4.2 Number of modules used and the latency of various designs................ 103 4.3 Energy and time performance m o d els....................................................... 106 4.4 Power and area functions for various m o d u le s ....................................... 109 4.5 Estimation errors of energy and area functions in Table 4 .3 .................117 4.6 Performance comparison of various off-chip designs against the Xil inx design and the design A proposed in [1 1 0 ].......................................120 4.7 Performance comparison of various on-chip designs..................................121 4.8 Performance comparison of various off/on-chip designs for Theorem 2123 5.1 States of each PE and their o p e ra tio n s.................................................... 138 5.2 Power and area functions for various m o d u le s ....................................... 147 5.3 Energy, area, and time performance models for Theorem 3 .................. 148 5.4 Energy, area, and time performance models for Theorem 4 .................. 149 5.5 Energy, area, and time performance models for Corollary 3 ..................150 5.6 Memory access rates of various modules in the uniprocessor design and the design in Theorem 3 .......................................................................162 5.7 Performance comparison of the designs based on Theorem 3 and the uniprocessor d e sig n ....................................................................................... 163 5.8 Performance comparison of the design A based on [23] [99] and the design based on Theorem 3 .......................................................................... 167 5.9 Performance of the designs based on Corollary 3 .................................... 171 6.1 Performance of our FFT designs................................................................. 187 6.2 FFT performance of Xilinx library based design and TI DSP based d e s ig n ..............................................................................................................188 6.3 FFT performance comparison with Xilinx library based designs . . 188 6.4 Average power dissipation of the TM S320C6415.................................... 190 6.5 FFT performance comparison with the TI DSP based designs . . . 190 IX R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f Figures 1.1 Power dissipation of various p ro cesso rs................................................. 3 1.2 Algorithms used in Software Defined Radio (SDR) (a) Direction of arrival algorithm (e.g. MUSIC) [115] and (b) MVDR with RLS algorithm [4 7 ].............................................................................................. 8 2.1 The evolution of F P G A s ........................................................................... 18 2.2 The conventional FPGA arc h ite c tu re .................................................... 20 2.3 The recent FPGA architecture................................................................. 22 2.4 (a) Capacitances of various wires and (b) the power dissipation of various wires as the function of frequency in Xilinx Virtex-II [120] . 29 2.5 Power dissipation of various storage element implementations as the function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% switching activity, 16 bits per e n try )............................................. 31 2.6 Power dissipation of various multipliers as the function of precision (Virtex-II XC2V1500, 150MHz, 50% switching ac tiv ity )................... 32 2.7 The effect on energy dissipation of the disabling of a SRAM as a function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% memory access rate, 8 bits per e n t r y ) .......................................... 35 3.1 Domain-specific modeling ........................................................................ 41 3.2 (a) Domain-specific modeling and system-wide energy estimation and (b) component power state m a tric e s ............................................. 47 3.3 (a) Power function estimation and (b) register power function as a function of the number of registers (r) and frequency ( / ) ................ 52 3.4 Design methodology based on the domain-specific m odeling.............. 59 3.5 Uniprocessor architecture: (a) off-chip design and (b) on-chip design. 65 3.6 System-wide energy dissipation and energy distribution for 12 x 12 matrix multiplication as a function of cache size: (a) off-chip design and (b) on-chip design................................................................................. 68 3.7 (a) Linear array architecture for matrix multiplication, (b) PE or ganization, and (c) corresponding a lg o rith m ....................................... 71 3.8 (a) Power dissipation for a single PE and (b) the system-wide energy as a function of the amount of storage (s) for n = 4, 8,16................... 74 4.1 Energy distribution of the design proposed in [110] 85 4.2 (a) Off-chip design, and (b) on-chip design, (c) architecture of PEj used in Theorem 1, and (d) algorithm used in Theorem 1 ................ 87 4.3 Snapshot of the data flow for 3 x 3 matrix multiplication (Theorem 1) 88 4.4 Architecture of PEj for Theorem 2 ........................................................ 93 4.5 Algorithm for Theorem 2 ........................................................................... 94 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4.6 Energy, area, and latency trade-offs of Corollary 1 as a function of the block size (n /r), (a) off-chip design and (b) on-chip design for n = 4 8 ..............................................................................................................108 4.7 Energy, area, and latency trade-offs of Theorem 2 as a function of r, (a) off-chip design and (b) on-chip design for n = 4 8 .......................110 4.8 (a) Pipelining stages and (b) data h azard ................................................. 115 4.9 Comparison between our design (based on Theorem 1) and Xilinx design for 3 x 3 matrix multiplication: (a) Energy dissipation for randomly generated matrices and (b) Average energy dissipation with confidence in te rv a ls............................................................................. 118 4.10 Energy distribution over logic, net, and I/O for Theorem 1 and 2 . 122 5.1 (a) Overall architecture, (b) architecture of PE, and (c) algorithm for LU decom position....................................................................................133 5.2 Snapshot for 3 x 3 LU decom position........................................................137 5.3 (a) Overall architecture for block LU decomposition, (b) a schedule for Theorem 4 and (c) a schedule for Corollary 3 ................................... 142 5.4 (a) Energy dissipation as function of b and r for n = 256, and (b) energy distribution as function of 6 for n = 256 151 5.5 Design flow of macro power and resource analyzers ..............................155 5.6 Power dissipation over clock cycles for n = 4 ...........................................158 5.7 (a) Architecture and (b) algorithm for a uniprocessor design .... 161 5.8 Energy distribution of the uniprocessor design, the design in Theo rem 3, and the design A in [23] [99] for (a) n = 16 and (b) n = 32 excluding the quiescent p o w e r................................................................... 164 5.9 Energy distribution of the uniprocessor design, the design in Theo rem 3, and the design A in [23][99] for (a) n = 16 and (b) n = 32 including the quiescent p o w e r ................................................................... 165 5.10 Energy dissipation of the design in Corollary 3 including the quies cent power and the effect of the memory banking for (a) n — 32 and (b) n = 128, where 6 = 1 6 ..........................................................................168 5.11 Energy distribution of the design in Corollary 3 over various problem sizes (a) with memory banking and (b) without memory banking . . 169 6.1 (a) Data buffer (DB), (b) Twiddle factor computation (TW ), (c) Data path permutation (P E R ), (d) parallel-to-serial/serial-to-parallel mux ( P S /S P ) , and (e) Radix-4 computation ( R 4 ) .............................................177 6.2 Data permutation for D B at the first stage for 16-point FFT (clock cycle (a) t = i, (b) t f -f 1, and (c) t — i + 3 ) ...................................... 179 6.3 Architectures for 16-point FFT (a) {Hp, Vp) = (1,1), (b) (ifp, Vp) = (2,1), and (c) {Hp, Up) = (2 ,4 )....................................................................180 6.4 Algorithm used for the architecture in Figure 6.3 (b) 181 XI R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.5 Energy and area estimates for various designs for N — 256 ......... 185 6.6 Energy distribution of modules in FFT architecture for various de sign points (A^=256, BRAM b a s e d )..........................................................186 7.1 Design methodology at the application level with the kernel level . . 200 xn R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A bstract Reconfigurable hardware such as FPGAs is flexible alternatives to DSPs or ASICs used in mobile devices, for which energy is a key performance metric. De signs on reconfigurable hardware offer many design parameters such as operating frequency, precision, amount of storage, degree of parallelism, etc. These param eters define a large design space that must be explored to find energy efficient solutions. It is also challenging to predict the energy variation at the early de sign phases when a design is modified at algorithm level. To address this sce nario, a methodology to develop energy efiicient designs on FPGAs is proposed. The methodology integrates domain-specific modeling, coarse-grained performance evaluation, design space exploration (DSE), and low-level simulation to understand the trade-offs among energy, latency, and area. The domain-specific modeling tech nique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. A domain is a family of architectures and corresponding algorithms for a given kernel. The high-level model also consists of functions for estimating energy, latency, and area that facilitate trade-off analysis. This model is used to understand the impact of various parameters on system-wide energy and is a basis for energy efficient de signs. DSE analyzes the design space defined by tlie domain and selects a set of designs. Low-level simulations are used for accurate performance estimation for the designs selected by the DSE and also for final design selection. xiii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The modeling technique and design methodology are applied to three digi tal signal processing kernels: matrix multiplication, matrix factorization, and Fast Fourier Transforms. The designs identified by our methodology demonstrate trade offs among energy, latency, and area. Our designs are compared with state-of-the- art designs to demonstrate the effectiveness. As the first kernel, matrix multiplica tion is considered. From the well-known designs for matrix multiplication, ’’energy hot spots”, which are responsible for most of the energy dissipation, are identi fied. Then three new algorithms and architectures that offer trade-offs among the number of I/O ports, registers, and PEs are proposed. Functions to represent the impact of algorithm design choices on the energy, area, and latency are derived. These functions are used to either optimize the energy performance or provide trade-offs for a family of candidate algorithms and architectures. As the second kernel, two designs for matrix factorization are proposed. The first design is used for a normal LU factorization. A linear array architecture is employed to min imize the usage of long interconnects, leading to lower energy dissipation. The optimal latency is achieved on the linear array architecture. The second design is used for block-based LU decomposition. The linear array based design for LU de composition and the design for matrix multiplication kernel are re-used. Through the analysis of design trade-offs, the block size that minimizes the total energy is identified. As the third kernel, energy efficient designs for Fast Fourier Transform are proposed. Architectural parameters such as degrees of vertical and horizontal XIV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. parallelism are identified and a design domain is created through a combination of design choices. Design trade-offs are performed using high-level performance model to obtain energy efficient designs. XV R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 1 Introduction 1.1 O verview A dramatic increase in the density and speed has been achieved in recent FPGA ar chitectures. The state-of-the-art FPGA from Xilinx has multi-million system gates and delivers over 0.3 Tera MAGs/sec. at an operating frequency of 300MHz [144]. Table 1.1 shows the peak performance capabilities of the Virtex FPGA compared with the fastest DSP available last year [142]. The inclusion of new features in the FPGA fabric, such as a large number of embedded multipliers, further enhances their suitability. Recent FPGAs such as Xilinx Virtex-II [145] and Altera Stratix [4 ] offer hundreds of multipliers on a single chip along with multiple general purpose processors. While the processors can be used to perform book keeping activities and schedule overall execution of tasks, the FPGA fabric and multipliers are used to perform compute-intensive tasks. These new features offer opportunities and R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. challenges to speed up fixed point computations by using high degree of parallelism. With such available processing power, FPGAs are attractive fabric devices as flexi ble and high-speed alternatives to DSPs and ASICs [19] [54] [87] [112] [124] [129] [131]. Indeed, FPGAs have become an attractive fabric for the implementation of mas sively parallel and computationally intensive applications such as signal, image, and network processing tasks [52][84][96][118]. Also, the reconfigurability and the high performance of FPGAs enable the future wireless communications such as Software Defined Radio (SDR) [137]. Table 1.1: Performance comparison of FPGAs and DSP [142] Function Fastest DSP (TI 64xx) Xilinx Virtex-II 8x8 MAC 4.4 billion MAGs 600 billion MAGs FIR, 256 tap. 17MSPS 180MSPS 16-bit data/coefficient (l.lGH z) (180MHz) 1024-point FFT 7.7 fisec. 0.1 iisec. (16 bit data) (800MHz) (140MHz) Traditionally, the performance metrics for implementing digital signal process ing applications and, indeed, most processing applications in general, have been latency and throughput. However, with the proliferation of portable and mobile devices [14] [84], it has become increasingly important that systems are not only fast, but also energy efficient. Therefore, in addition to time performance, energy performance is a key performance metric [95]. For example, many systems for SDR will be implemented in mobile terminals or mobile base stations [91]. Thus, the time and energy efficient implementations are very crucial. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 1.1 shows the power dissipation of various devices. FPGA dissipates a higher power than the TI DSP and low power embedded processors (IBM PowerPC 405 LP and Intel XScale PXA255). Note that FPGAs have a wide range of the power dissipation. The power is used only by the configured portion in a FPGA device. Thus, design techniques determine the energy efficiency of the device. By using sophisticated design techniques, the energy used to complete computations in FPGAs can be lower than that of DSP devices or low power embedded processors. AMD Opteron (1.8 GHz) Pentium 4-M (2GHz) Pentium 4 (3GHz) Xilinx Virtex-II (XC2V1500) T I DSP TMS320C64XX (600MHz) PowerPC 405 LP (380MHz) Intel PXA255 (XScale) (400MHz) 1000 2000 3000 Power (mW) 0 .1 5 W A \ M -“ n --------- V F □ S leep ■ Idle ■ Normal ► 85W 4000 5000 Figure 1.1: Power dissipation of various processors To develop the energy efficient designs, the optimization is required at various levels. Studies show that optimization at the algorithmic level has a much higher impact on total energy dissipation of a system than at the RTL or gate level. It is reported that the impact (on energy optimization) ratio is 20 : 2.5 : 1 for R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. algorithmic, register, and circuit level [114]. Thus, instead of low-level optimiza tion techniques, in this thesis, we investigate and apply algorithmic techniques for minimizing the energy dissipated by FPGAs in signal processing applications. Moreover, a design using FPGAs has to achieve a balance among energy, latency, and area performance. To achieve this balance, a designer has to consider various trade-offs, such as energy versus latency, energy versus area, and energy versus I/O. In order to evaluate the impact of algorithmic changes on the energy dissipation at the early design phase, several challenging issues must be addressed: • Lack of a high-level model. There are numerous ways to map an algorithm onto an FPGA as opposed to mapping onto a traditional processor such as a RISG processor or a DSP, for which the architecture and the components such as ALU, data path, memory, etc. are well defined. For FPGAs, the basic element is the lookup table (LUT), which is too low-level an entity to be considered for high-level modeling. Besides, the architecture design depends heavily on the algorithm. Therefore, no single high-level model can capture the energy behavior of all feasible designs implemented on FPGAs. In addition, if the level of abstraction is elevated, high-level models do not capture all the details of a system and consider only a small set of key parameters that affect energy. This lowers the accuracy of energy estimation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Alternatively, a large number of parameters in a high-level model possibly achieve higher accuracy but will result in a large design space. • Large simulation effort: Low-level simulations using RT-level power simula tors are not only time consuming but also require input vectors which need equally expensive functional simulation. Flexibility in using FPGAs results in a large design space. It is not feasible to traverse such a large space using time-consuming low-level simulations using tools such as Mentor Modelsim [92] and Xilinx XPower [146]. Our experience shows that simulators running on four 700MHz Pentium III Xeon processors require an average of 2-3 hours to estimate energy dissipation of a simple design for 3 x 3 matrix multipli cation. • Limited performance improvement through low-level optimizations: While low-level optimizations at circuit level through efiicient place and route tools are time consuming, the return in terms of energy efficiency is much less compared with high-level optimizations [114]. In this context, there is a need for a high-level energy model which not only enables algorithmic level optimizations but also provides rapid and reasonably ac curate energy estimates. Also, we need to evaluate various alternative designs at the algorithmic level (with accompanying architectural modifications) on their en ergy performance. For this purpose, we construct an appropriate energy model R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. to represent the impact of changes in the algorithm on the system-wide energy dissipation, area, and latency. The modeling starts by identifying parameters whose values change depending on the algorithm and have significant impact on the system-wide energy dissipation. These parameters depend on the algorithm and the architecture used, and the target FPGA’s device features. The param eters give rise to a large design space, which may take a lot of time to explore for the minimization of system-wide energy dissipation. In this regard, the de signer’s ability to understand the algorithm and the architecture is very significant in identifying the key parameters. For example, if two algorithms use different numbers of MACs and adders for implementation of a matrix multiplication, and the MACs and adders are busy in almost all the cycles, the numbers of MACs and adders are good candidates for key parameters. We derive closed-form functions representing the system-wide energy dissipation, area, and latency in terms of the key parameters. Assumptions are then made to simplify the functions. In general, the simplicity of the functions and the resulting deviation from actual coefficients of the functions obtained via low-level simulations depend on the application and the designer’s ability to extract key parameters and build an appropriate energy model using them. Moreover, these functions are meant to be used in early design phase, where we are more concerned with algorithmic level changes than gate- level changes, thus rendering our accuracy level more than sufficient. Moreover, R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. our techniques can also be used for next generation FPGAs having lower power dissipation features as well as higher computing power. We apply the techniques presented here to designing architectures and algo rithms for three well-known digital signal processing kernels: matrix multiplication, matrix factorization, and Fast Fourier Transforms (FFT). Matrix multiplication and matrix factorization are frequently used kernel operations in signal and image processing systems including mobile and SDR systems [91]. They are also used for the signal filtering in embedded target recognition systems. Matrix multiplication is used as part of matrix factorization. The FFT is also the compute-intensive por tion of broadband beamforming applications such as those generally used in SDR and sensor networks. Figure 1.2 shows two algorithms and several kernels used in SDR, which are similar to the kernels we choose. For example. Minimum Variance Distortionless Response (MVDR) beamformer with Recursive Least Squares (RLS) algorithm consists of matrix factorization, multiplication, addition, and subtrac tion [47]. Its computational requirement is over 53GOPS (Giga operations per second) per user. At this moment, the targeted performance and energy efficiency can be achieved by either FPGA or ASIC based designs. However, the adaptivity requirement of SDR can be achieved only with the FPGA based design. Note that three chosen kernels in this thesis are important function blocks and require significant amount of computation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Antenna array riogn Covariance matrix factorization FFT Matrix vector multiplication Computational requirement: 3 GOPS for single user (a ) Matrix addition subtraction Weight update Matrix vector multiplication Correlation matrix inversion Computational requirement: 53 GOPS for single user (b) Figure 1.2: Algorithms used in Software Defined Radio (SDR) (a) Direction of arrival algorithm (e.g. MUSIC) [115] and (b) MVDR with RLS algorithm [47] To show the performance of our designs, we compare the latencies, resource utilizations, and energy dissipations of the energy efficient designs to those of Xilinx IP cores and DSP based designs for the same signal processing kernels. We use both high-level estimation (based on latency and energy equations) and low- level simulation in our comparisons. These comparisons show that our proposed designs using FPGAs can provide significant reductions in not only latency but also energy dissipation. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1.2 C ontributions o f th e D issertation This thesis addresses several important issues in developing energy efficient de signs using configurable architectures such as FPGAs. Our focus is on developing high-level performance modeling techniques that can enable rapid performance estimation for various metrics such as energy, area, and time in FPGAs. Using high-level modeling techniques, the other focus is on developing and optimizing several digital signal processing applications for high energy performance. This thesis is one of the earliest efforts to develop high-level energy models and opti mization techniques for energy performance in FPGAs. The contributions of this thesis include: 1.2.1 High-Level Performance M odeling Technique: Dom ain-Specific M odeling and Design M ethodology Reconfigurable architectures offer several design parameters such as operating fre quency, precision, amount of memory, degree of parallelism, etc. These parameters define a large design space that must be explored to find energy efficient solutions. It is also challenging to predict the energy variation at the early design phases when a design is modified at algorithm level. Efficient traversal of such a large design space requires high-level modeling to facilitate rapid estimation of system-wide energy. To address this scenario, we propose a domain-specific modeling technique R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for energy efficient kernel design that exploits the knowledge of the algorithm and the target architecture family for a given kernel to develop a high-level model. This model captures architecture and algorithm features, parameters affecting en ergy performance, and power estimation functions based on these parameters. A system-wide energy function is derived based on the power functions and cycle specific power state of each building block of the architecture. This model is used to understand the impact of various parameters on system-wide energy and can be a basis for the design of energy efficient algorithms. Our high-level model is used to quickly obtain fairly accurate estimate of the system-wide energy dissipation of data paths configured using FPGAs. Based on the high-level modeling technique, the design methodology is devel oped to explore the design space to obtain high energy performance on FPGAs. Our methodology integrates domain specific modeling, coarse-grained performance evaluation, design space exploration, and low level simulation to understand the trade-offs between energy, latency, and area. The domain specific modeling tech nique defines a high-level model by identifying various components and parameters specific to a domain that affect the system-wide energy dissipation. The high-level model also consists of functions for estimating energy, latency, and area that facil itate trade-off analysis. Design space exploration (DSE) analyzes the design space defined by the domain and selects a set of designs. Low-level simulations are used 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for accurate performance estimation for the designs selected by the DSE and also for final design selection. 1.2.2 Energy Efficient Algorithm ic Design Techniques in FPG A s We identify techniques that can be applied to FPGA-based designs to obtain energy efficiency. Using the currently available control knobs for the power dissipation in FPGAs, we present techniques for reducing energy dissipation, some of which do so by lowering power dissipation, others by lowering latency. Several low-level and algorithmic level techniques for energy efficient design, such as clock gating, bind ing, architecture/algorithm selection, and memory banking, are discussed. The main focus is on algorithmic level design techniques that can be used to reduce energy dissipation in algorithm development. 1.2.3 Energy and Tim e Efficient Designs for M atrix M ultiplication Using FPG A s New algorithms and architectures for matrix multiplication on configurable de vices are developed. These have reduced energy dissipation and latency compared with the state-of-the-art FPGA-based designs. By profiling well-known designs. 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. we identify ’ ’energy hot spots” , which are responsible for most of the energy dissi pation. Based on this, we develop algorithms and architectures that offer trade-offs among the number of I/O ports, the number of registers, and the number of PEs. To avoid time-consuming low-level simulations for energy profiling and perfor mance prediction of many alternate designs, we derive functions to represent the impact of algorithm design choices on the system-wide energy dissipation, area, and latency. These functions are used to either optimize the energy performance or provide trade-offs for a family of candidate algorithms and architectures. For selected designs, we perform extensive low-level simulations using state-of-the-art tools and target FPGA devices. We show a design space for matrix multiplica tion on FPGAs that results in trade-offs among energy, area, and latency. Our designs improve the energy performance of state-of-the-art FPGA-based designs by 29% to 51% without any increase in the area-latency product. The latency of our designs is reduced 1/3 to 1/15 while area is increased 1.9x to 9.4x. In terms of comprehensive metrics such as EAT (Energy-Area-Time), our designs exhibit superior performance compared with the state-of-the-art by 50%-79%. 1.2.4 Energy and Tim e Efficient Designs for M atrix Factorization Using FPG A s Two new designs are proposed for matrix factorization, the compute intensive kernel used in many applications such as adaptive beamforming. Both designs are 12 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. optimized for time and energy performance. The first design is used for a normal LU factorization. A linear array architecture is employed to minimize the usage of long interconnects, leading to lower energy dissipation. The designs are made scalable by using a fixed I/O bandwidth independent of the problem size. The optimal latency is achieved on the linear array architecture. The second design is used for a block-based LU decomposition. The first design and the matrix multiplication kernel are re-used. In both designs, high-level models for energy profiling are built, and the time and energy performance of many possible designs is predicted. In the second design, through the analysis of design trade-offs, the block size that minimizes the total energy dissipation is identified. A set of candidate designs is implemented on the FPGAs to verify the estimates. Since we are not aware of any designs that map energy efficient LU decomposition onto FPGA, we implement a uniprocessor design and the best known design on a linear array. They are compared with our designs in terms of the time, area, and energy performance. Also, the performance of our designs is compared with that of state-of-the-art low power DSP based designs. Our designs dissipate 59% to 78% less energy than the best known architecture and 4x to 36x less energy than the lower DSP based designs. 13 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1.2.5 E nergy/T im e Efficient and Param eterized Designs for Fast Fourier Transforms Using FPG A s Parameterized and energy/time efficient designs for the Fast Fourier Transforms (FFT) on FPGAs are developed. Architectures for FFT on FPGAs are designed by investigating and applying techniques for minimizing the energy dissipation. Architectural parameters such as degrees of vertical and horizontal parallelism are identified and a design domain is created through a combination of design choices. We determine design trade-offs using high-level performance estimation to obtain energy efficient designs. We implement a set of parameterized designs having parallelism, radix and choice of storage types as parameters, on FPGAs to verify the estimates. Our designs dissipate 57% to 78% less energy than the optimized designs from the state-of-the-art designs. In terms of a comprehensive metric such as EAT (Energy-Area-Time), our designs offer performance improvements of 3-13x over the state-of-the-art designs. 1.3 O utline o f th e D issertation The remainder of this thesis is organized as follows: Chapter 2 presents the background for the thesis. The evolution and the state of FPGAs are discussed. The characteristics of power dissipation in FPGAs are discussed to motivate the issues that need to be addressed in developing energy 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. optimization techniques. Also, general energy optimization techniques in FPGAs are discussed at various levels to provide basic power control knobs for the energy efficient designs. Chapter 3 describes the high-level performance modeling techniques for FPGAs in detail. The abstraction of the architecture and the application by the model parameters is explained in detail. These performance modeling techniques are used inherently in developing all the energy efficient designs presented in the later chapters of this thesis. Based on the high level modeling techniques, the design methodology to obtain energy efficient designs is presented. Chapter 4 illustrates in detail the energy and time efficient designs for matrix multiplication using FPGAs. Two algorithms and architectures based on a linear array and a block based design are proposed. Various algorithmic techniques are also developed and applied. The high level energy performance models are defined and used for rapid design space exploration to find the energy minimal solutions. Chapter 5 illustrates in detail the energy and time efficient designs for matrix factorization using FPGAs. A new design based on a linear array and a block based design are proposed. The designs of the matrix multiplication described in Chapter 4 are reused as part of the matrix factorization. More algorithmic techniques are developed and applied. The high-level energy performance models are defined and used to identify the block size that minimizes the energy dissipation. 15 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 6 describes in detail the energy and time efficient and parameterized designs for Fast Fourier Transforms (FFT). Various parameters that significantly affect the energy performance are identified. Based on the parameters, a high-level performance model is defined and used to explore the design space. Especially, the impact of the parallelism on the energy dissipation is analyzed. Chapter 7 discusses the conclusions from this thesis work and addresses di rections for future research with respect to the energy and power performance of FPGAs. 16 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 2 B ackground In past few years, digital signal processing (DSP) applications requiring high per formance and operating in constrained environments, have greatly increased. Field Programmable Gate Arrays (FPGAs) show the potential to fill the demand requir ing high performance and flexibility. This demand has been traditionally satisfied by using multiple DSP processors for a massive parallel processing. Also, FPGAs show potential when power and energy performance become critical concerns in the DSP applications. In this chapter, we give a brief overview of the evolution of FPGA architectures. The characteristics of power dissipation and the knobs to control the power dissi pation in FPGAs are discussed to motivate the issues that need to be addressed in developing energy optimization techniques. Also, general energy optimization techniques such as latency reduction are discussed. 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1 Field Program m able G ate Arrays (F P G A s) As the semiconductor technology improves with more and more transistors being implemented on a single chip, the speed and density of FPGAs have increased tremendously. Moreover, recent platform FPGAs with new features have evolved to satisfy increasing performance requirements in the industry. We summarize the evolution of the FPGA architecture based on the inclusion of new features (see Figure 2.1). 8M § I 3.2M to 1.1M 0.5M + 4 PowerPC 32bit cores + Gigabit transceivers + Embedded multipliers ; + Embedded RAMS Configurable Logic Blocks 1.5V core 0.13p I w y i R T iVr iyiR T E IX ^il I FROi 1.5V core 0.13m 0.18-0.22M 0.25-0.35M 1997 1998 1999 Year 2000 2001 2002 'System gates: the metric used to estimate the typical numlDer of gates and the memory that can be realized in the FPGA device for a design. Figure 2.1: The evolution of FPGAs 18 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1.1 Conventional FPG A The conventional FPGA architecture consists of an array of configurable logic blocks (CLBs) and an interconnection network of wires that connects the CLBs (see Figure 2.2). Both the CLBs and the interconnection network are configurable. Each CLB usually contains configurable lookup tables (LUTs) that enables the de vice to implement any function of multiple inputs, registers, and additional combi national logic. The output of each CLB is either the output of the LUT or that of a configurable register connected to the LUT output. The CLBs at the periphery of the device perform the I/O operations. The interconnection network is configured by changing the connections between the CLBs and the wires by configuring the switch boxes which connect various wires. The functionality of the CLBs and the connection of the switch boxes are determined by configuration data stored in a configuration memory, which is typically achieved by using static random-access memory (SRAM) bits to control the configurations of transistors. SRAM based configuration can be reprogrammed by downloading different configuration bits into the SRAM. The Xilinx XC4000 is an example for this type of FPCAs. The detailed architectural survey of this type of FPCAs is discussed in [20]. 19 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA Interconnect Configurable logic block Switch box Data out Data Configurable lookup table Configurable register Carry chain Carry Carry out Figure 2.2: The conventional FPGA architecture 2 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1.2 Inclusion of Embedded Memory, Embedded M ultipliers and D SP Blocks Many DSP and telecommunications applications require the local storage of data on the chip to be easily accessed by the compute logic. Recent FPGAs have on- chip embedded memory. While the CLB can be configured as an internal memory, the embedded memory has the advantages of better and variable sizes, speeds, and dimensions. Thus FPGA architectures can efficiently have various levels of memory with individual cells or LUTs acting as registers, cascaded groups of cells or LUTs acting as reconfigurable single- or multi-port RAMs for mid-size needs, and large dedicated RAM blocks for more demanding uses. The inclusion of these flexible embedded memory elements allows for high-speed and local-memory in tensive designs common in DSP, to queue and store data locally without costly off-chip access. With the need for high-speed DSP functions, ASIC multipliers and generic DSP blocks containing multipliers and adders have been embedded for arithmetic operators. This greatly improves the performance of the FPGA for these arithmetic functions. For example, the Xilinx Virtex is a high-performance FPGA (see Figure 2.3) [144]. It has different versions which have capacities ranging from 50 thousand to 2 mil lion system gates. The Virtex architecture comprises of an array of CLBs, encircled by programmable I/O blocks, and dedicated Block SelectRAMs of 4096 bits each. 21 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA m m m i / \ Dual port memory Embedded memory Embedded multiplier PowerPC or Arm Embedded processor Figure 2.3: The recent FPGA architecture 2 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. It has a hierarchical routing matrix with local routing and varying number of global routes of different lengths. There are 24 single length routes, 96 routes of length six and 12 long lines spanning the chip. There are additional I/O routing resources around the periphery of the logic blocks. The CLBs contain four logic cells each. Each logic cell has a 4-input function generator (LUT), a flip-flop and some carry logic. The LUTs can operate as function generators or they can be used as distributed RAMs. Additional multiplexors and wires in a CLB provide flexible combination of different logic cell outputs and routing of input signals to CLB output. High speed arithmetic is facilitated by providing additional carry logic in each of the logic cells. A dedicated AND gate in each logic cell improves multiplier implementations. On-chip local memories can be realized on the Virtex architecture in two different ways. The logic cells can be combined and configured as memory cells to obtain multiported RAM of required sized. Each Virtex also has large Block SelectRAM memories. These are organized along the two vertical edges of the FPGA. Each memory block is four CLBs high and the number of such blocks is as much as 32 for large size Virtex chips with 64 CLBs height. Each such memory cell is a fully synchronous dual-ported 4096-bit RAM with independent control signals for each port and independent data widths. In addition, they pro vide application memory for many of the emerging embedded applications using the new way of hard- and soft-IP based processors. 2 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The Virtex-II architecture includes up to 168 18-bit x 18-bit embedded (Block) multipliers while the Stratix architecture includes up to 22 DSP blocks which can be configured 4 18-bit x 18-bit embedded multipliers and 3 36-bit adders. Virtex-II is scaled up to 8 million system gates with internal system clocks up to 420MHz. Virtex-II also has higher capacity CLBs and larger size Block SelectRAM and dis tributed RAM. Virtex-II also supports high bandwidth serial and parallel interfaces supporting several industry standards. Altera Stratix [4] FPGAs contain generic DSP blocks containing ASIC adders and multipliers capable of being configured into MACs or complex number multipliers. 2.1.3 Availability of D SP IP cores In the past several years, with the preference of FPGAs for embedded systems, the need for IP cores has increased significantly. Moreover, the capabilities of FPGAs have improved such that highly complex designs can be implemented. Thus the complexity of the design process increased and a more sophisticated design team was required to successfully complete a design. Proven IP was a fast way to combat rising design costs and expanding schedules. Designers wanting to take advantage of the time-to-market advantages of platform-based FPGAs need ready access to complex peripheral IP. As a result, today there are an increasing number of commercial IP vendors developing certified IP cores for use in high-end FPGAs. The available IP includes processor and DSP cores, bus-based peripherals, and 24 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a host of standard functions designed to facilitate almost drag-and-drop system creation. 2.1.4 Inclusion of Embedded and Soft M icroprocessor Cores FPGA based systems are typically attached to a host system through some inter face such as the system bus or the I/O channels. While these systems have shown significant speedups for specific applications, the limiting factor is the communica tion cost between FPGAs and the host computer. Moreover, control was sometimes enabled through complex state machines or from outside the chip. Currently, sys tems try to alleviate this problem by embedding microprocessors into FPGAs on the same die. Moreover, microprocessors can be implemented using the FPGA fab ric itself. Thus architectures having combinations of processors and designs on the FPGA fabric can be obtained. There are various terms used for these architectures, including Configurable System-on-Chip (CSOC), Reconfigurable Systems-on-Chip (RSOC), Systems on Programmable Chip (SOPC) [15] [42] [105]. FPGA vendors are also aggressively approaching this design space by provid ing customized processor and other IP cores on their devices. These include the Virtex-11 Pro [145] with IBM PowerPC 405 cores and the Altera Excalibur [5] with ARM922T processor. The PowerPC core operates at 380MHz and communi cates with the FPGA fabric at more than 6GB/sec. Virtex-11 Pro is an evolution from high capacity Virtex-11 series FPGAs. It is expected to scale up to 10 million 25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. system gates and is expected to have 556 18-bit x 18-bit embedded multipliers. An other distinguished inclusion is the high speed Giga bit serial connections to meet the requirement of the telecommunication applications. The ability of the Virtex- II Pro to integrate multiple aspects of different architecture paradigms provides a flexible platform for design development. The design tools combine aspects of embedded processor compilation, electronic design automation (EDA), real time operating systems (RTOS), digital signal processing (DSP), and so on. 2.2 E nergy Efficient D esign Techniques We discuss techniques that can be applied to FPGA-based designs to obtain energy efficiency. Note that, though the terms are often used interchangeably, power and energy are not the same. Energy is the product of average power dissipation and latency. To understand energy dissipation, therefore, it is necessary to understand power dissipation and its effect on latency and vice versa. In this section, we first briefly describe sources of power dissipation in FPGAs. We then present techniques for reducing energy dissipation, some of which do so by lowering power dissipation, others by lowering latency. Several low-level and algorithmic level techniques for energy efficient design are discussed. The main focus on is algorithmic level techniques. “Algorithmic level techniques” refer to those techniques in algorithm development that can be used to reduce energy dissipation. 26 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.2.1 Power Dissipation in FPG A s Widely accepted equation for general power dissipation [95] is defined as: P ^ C V ^ f + Vheak (2 .1) The first term defines the dynamic power dissipation where P, C, V , and / rep resent power dissipation, effective capacitance, voltage, and running frequency, respectively. Effective capacitance can be used to account for combined effect of multiple capacitances or varying switching activity. The second term defines the static power due to the leakage current Iieak- For example, Virtex-II XC2V1500 dissipates 188mW of the static power (called quiescent power) while the device is on. In this thesis, we consider both dynamic and static power dissipations. Several studies of FPGA power dissipation have recently appeared in litera ture [119] [120] [148]. Figure 2.4 shows the capacitance and the power dissipation of various interconnection wires in Virtex-II [120]. Table 2.1 presents the capacitance of some components and the power dissipation of configured modules in Virtex-II (the power dissipation of PowerPC core is obtained from the Virtex-II Pro). These works show that power dissipation in FPGA devices is primarily due to the pro grammable interconnects. In the Virtex-II family, for example, it is reported that between 50% and 70% of total power is dissipated in the interconnect, with the remainder being dissipated in the clocking, logic, and I/O blocks [120]. Figure 2.4 2 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 2.1: (a) Capacitances of CLB and embedded blocks and (b) power dissipation of configured and embedded logic in Xilinx Virtex-II Components C(pF) Flip-Flops LUT Distributed RAM Block RAM Block Multipliers I/O LVTTL 2.88 26.4 24.3 982.5 1,777.7 100.4 (a ) Modules Speed (MHz) Toggle Rate(%) Area (slice) Power (mW) note PowerPC 300 n/a n/a 290.0 Register 150 50 4 2.1 16 bit Counter 150 50 4 2.8 16 bit Adder 150 50 4 2.8 16 bit Multiplier (LUT) 150 50 307 97.8 16 bit Block Multiplier 150 50 n/a 35.6 16 bit Distributed RAM 150 50 145 85.3 128Byte Block RAM 150 50 n/a 25.7 2KByte (b) shows the capacitance and the power dissipation of various interconnection wires in Virtex-II [120]. This analysis differs from ASIC technology where clock distri bution often dominates power dissipation [148]. The sources of power dissipation between these two technologies are different because their interconnect structures are composed disparately: FPGA interconnect consists of pre-fabricated wire seg ments of various lengths, with used and unused routing switches attached to each wire segment. Another important factor affecting the power dissipation in FPGAs is resource utilization [120]. In typical FPGA designs, a majority of the resources may not be 28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Interconnect C(pF) Long Line 26.10 Hex Line 18.40 Double Line 13.20 Direct Connect 7.28 (a ) 3.0 ■ ♦ — Long Line ■ a — Hex Line ■ ♦ — Double Line Direct Connect 2.5 2.0 § E 0 0 1 0.5 0.0 0 20 40 60 80 100 Frequency (MHz), 20% toggle rate (b) Figure 2.4: (a) Capacitances of various wires and (b) the power dissipation of various wires as the function of frequency in Xilinx Virtex-II [120] 2 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. used after the configuration and thus they will not dissipate any dynamic power. One more factor in determining power dissipation is the switching activity, which is defined as the number of signal transitions in a clock period. The switching activity for each resource depends not only on the type of design but also the input stimuli. Having understood the sources of power dissipation, we can now discuss the low-level and algorithmic level design techniques for energy efficient FPGA-based design. 2.2.2 Low-Level Design Techniques In literature, there are many low-level power management techniques that lead to energy savings when applied to designing for FPGAs [120, 148]. Here, we discuss those low-level techniques that provide control knobs for algorithm-level design. One such technique is clock gating, which is used to disable parts of the device that are not in use during the computation. In the Virtex-II family, clock gating can be realized by using primitives such as BUFGMUX to switch from a high frequency clock to a low frequency clock [142]. BUFGCE can be used for dynamically driving a clock tree only when the corresponding logic is used. Choosing energy efficient bindings is another technique. A binding is a mapping of an operation to an FPGA component. The ability to choose the proper binding is due to the existence of several configurations for the same computation. Thus, 30 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1 0 0 r- 80 + BRAM 5 E c 60 j . o I ■ t o - m — I " 2 40 4 o [ 5 o I Q - 1 10 1 0 0 1000 10000 No. Entries (n) Figure 2.5: Power dissipation of various storage element implementations as the function of the number of entries (Virtex-11 XC2V1500, 150MHz, 50% switching activity, 16 bits per entry) 31 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. different bindings affect FPGA energy dissipation. For example, in Figure 2.5, we show three possible bindings for storage in Virtex-II based on the number of entries: registers, slice based RAM (SRAM), and embedded Block RAM (BRAM). For large storage elements (those with more than 48 entries) BRAM shows an advantage in power dissipation over other implementations. Another example is the choice between hard and soft IP. One such case is the choice of multipliers: block multipliers, such as those in the Virtex-11 and Stratix, can be more efficient than CLB-based multipliers (See Figure 2.6). All results for Figure 2.5, Figure 2.6, and Figure 2.7 are obtained using the techniques proposed in Chapter 3 [30] [36]. In developing an algorithm, a designer can analyze the trade-offs that arise from various bindings based on the design requirements. 100 T -♦— Block Mull • Slice based Mult -A— Constant Mult 80 r E c ,o to Q . < 0 S o Q - 4 6 8 10 12 14 16 hput precision (bit) Figure 2.6: Power dissipation of various multipliers as the function of precision (Virtex-II XC2V1500, 150MHz, 50% switching activity) 3 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.2.3 Algorithm Level Design Techniques It is known that energy performance can be improved significantly by optimizing a design at the algorithm level [114]. We summarize the algorithm-level techniques that can be used to improve the energy performance of designs implemented on FPGAs. A rchitecture Selection: Since FPGAs provide the freedom to map various architectures, choosing the appropriate architecture affects the energy dissipation. It plays a large part in determining the amount of interconnect and logic to be used in the design. Since interconnect dissipates a large amount of power, minimiz ing the number of long wires between building blocks is beneficial [120]. Several past efforts have identified various architecture families, each having different char acteristics in terms of I/O complexity, memory requirements, area, etc. [53] [78]. Based on the performance needs and the limitations of the target FPGA chip, it is possible to identify a suitable architecture. Identification of an appropriate architecture for an algorithm ensures that we begin with an efficient design most suitable for the performance requirements and that there are various architecture parameters that can be varied to explore trade-offs among energy, latency, and area. For example, matrix multiplication can be implemented using a 1-D array (linear array) or a 2-D array. A 2-D array dissipates more power from interconnect since more interconnects are required. Thus, it is possible that more energy would be dissipated, depending upon the resulting latency. 33 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. M odule Disabling: In developing an algorithm, it is possible to design the algorithm such that it utilizes the clock gating technique to disable modules that are not in use during the computation. For example, FFT computation has many- complex number multipliers to perform t-widdle factor computations (multiplica tion and addition/subtraction). Because of the nature of the algorithm, some twiddle factors are 1, — 1, j, or — j and their computation can be bypassed. Thus, the implementation of twiddle factor computation can exploit clock gating to dis able the unnecessary computation modules. Another example is that some RAM implementations have sleep states so that power dissipation is reduced when they are idle. Figure 2.7 shows the power dissipation of SRAM (16 bits per entry) of various sizes. The disabled memory dissipates less than 10% of the amount of power that the enabled memory does. The power dissipation of BRAM is even smaller than that of SRAM. Its power dissipation is less than 1% of the power dis sipation of enabled memory. Because of these reduced power dissipations, energy dissipation is also reduced, provided that latency does not increase too much. M em ory Banking: Along with block disabling and using BRAMs, a large memory can be made of smaller memory banks, where each bank has its own enabling/disabling feature. By enabling only the necessary memory banks, the energy used by memory is saved. For example, the block based approaches in matrix multiplication and matrix factorization often requires a large number of memory banks consisting of BRAMs. When n = 128, 16 BRAMs are 3 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 180 — Disabled 160 Enabled i50% toogie ratei 140 I 120 I - 1 100 5 o Q . 60 + 40 + 20 4 0 32 64 96 128 160 192 224 256 Number of entries (n) Figure 2.7: The effect on energy dissipation of the disabling of a SRAM as a function of the number of entries (Virtex-II XC2V1500, 150MHz, 50% memory access rate, 8 bits per entry) 3 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. required. However, 3 to 4 BRAMs are often required and by using memory banking we can save 80% energy in the memory. Pipelining: Pipelining is an efficient design practice for both time and en ergy performance. Many digital signal processing applications process streaming data. For these applications with regular data flow, pipelining increases through put. Pipelining increases power dissipation, however, since all logic in the design is continuously active. In FPGA designs with streaming data, throughput is another important factor in energy dissipation. Thus, in the pipelined design, a modified version of the energy equation is F’ pipe = Pavg/Th, where Th is the throughput of the design. Note that ^ can be considered the effective latency of the design. The effective latency accounts for the benefits of overlapping computations in pipelin ing. All designs in this thesis adopt pipelining. Pipelining is one technique in which increasing the power dissipation may decrease the overall energy dissipation. Parallel Processing: Parallel processing is an important technique for re ducing energy dissipation in FPGA systems. In practice, the trade-off between pipelining and parallelism is not distinct: merely replicating functional units rather than using pipelining has the negative effect of increasing area and wiring, which in turn increases the energy dissipation. Instead, a more sophisticated approach to parallel processing is needed. In Chapter 4, we employ a fully parallel approach and a block based approach for a matrix multiplication. For problem size up to 16, the full parallel architecture gives good energy performance. However, for 3 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. problem sizes greater than 16, block matrix multiplication leads to better energy performance. Since the internal storage required for large problem sizes increases dramatically, parallel processing has a negative effect on the total energy dissipa tion. This result implies that a designer must carefully investigate the trade-offs among the algorithm and the degree of parallelism. Throughout this thesis, we will discuss this effect in detail. A lgorithm Selection: A given application is mapped onto FPGAs differently by selecting different algorithms. For example, using block matrix multiplication is the algorithm-level design choice for larger matrix multiplication. In Chapter 4, the block matrix multiplication is energy efficient choice for n > 16. Matrix fac torization can also be implemented using a block matrix factorization algorithm (see Chapter 5). In implementing the FFT, the choice of radices affects the en ergy performance. For example, a radix-4 based algorithm significantly reduces the number of complex multiplications that would otherwise be needed if a radix-2 based algorithm were used. All these algorithm selections affect the architectures and the energy dissipation of a design. The trade-offs between different algorithms should be analyzed to achieve energy efficient designs. Quiescent Power Consideration: Virtex-II does not have a power manage ment scheme like, for example, the CoolRunner-II CPLD, which features normal and low-power modes. The Virtex-II always consumes power when the device is on. For example, XC2V1500 dissipates 188mW regardless of any operation states 3 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of the device. This quiescent power cannot be minimized at the algorithm level. However, the faster a design completes its computations, the less energy that will be consumed due to quiescent power. 3 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 3 H igh-L evel Perform ance M odeling Techniques: D om ain-Specific M odeling and D esign M eth od ology A high-level model should allow to explore a large design space rapidly in order to evaluate the impact of algorithmic and architectural choices on the energy dis sipation of a design in FPGAs while it should be fairly accurate in predicting the performance. Several issues must be addressed in developing a high-level energy model for FPGAs. There are numerous ways to map an algorithm onto an FPGA as opposed to mapping onto a traditional processor such as a RISC processor or a DSP, for which the architecture and the components such as ALU, data path, memory, etc. are well defined. For FPGAs, the basic element is the lookup table (LUT), which is too low-level an entity to be considered for high-level modeling. Besides, the 3 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. architecture design depends heavily on the algorithm. Therefore, no single high- level model can capture the energy behavior of all feasible designs implemented on FPGAs. In addition, if the level of abstraction is elevated, high-level models may not capture the details of a system and consider only a small set of key parameters that affect energy. This can affect the accuracy of energy estimation. In order to address the issues discussed above, we propose a domain-specific modeling technique (see Figure 3.1). This technique facilitates high-level energy modeling for a specific domain. A domain corresponds to a family of architectures and algorithms that implements a given kernel. For example, a set of algorithms implementing matrix multiplication on a linear array is a domain. Detailed knowl edge of the domain is exploited to identify the architecture parameters for the analysis of the energy dissipation of the resulting designs in the domain. By re stricting our modeling to a specific domain, we reduce the number of architecture parameters and their ranges, thereby significantly reducing the design space. A limited number of architecture parameters also facilitate development of power functions that estimate the power dissipated by each component (a building block of a design). For a specific design, the component specific power functions, param eter values associated with the design, and the cycle specific power state of each component are combined to specify a system-wide energy function. Our approach is a top-down approach in contrast with other approaches that exploit low-level simulations and estimations for each component and accumulate 4 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Various Kernels (FFT, DCT, Matrix multiplication) Kernel Various Architecture Families I Domain 1 ^ D o a Domain Specific Modeling System-wide Energy Function Domain 2 I Domain Specific Modeling System-wide Energy Function Domain n Domain Specific Modeling I System-wide Energy Function Figure 3.1: Domain-specific modeling 41 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. these results to estimate overall energy dissipation. The advantage of our approach is the ability to rapidly evaluate the system-wide energy using energy function for different designs within a domain. Our high-level energy model also facilitates algorithmic level energy optimization through identification of appropriate settings for architecture parameters such as frequency, number of components, precision, etc., early in system design. The organization of the chapter is as follows. Related efforts are discussed in Section 3.1. Section 3.2 describes the domain-specific modeling technique and the methodology to estimate the power functions. A detailed description of modeling and energy estimation using domain-specific modeling for four different domains is presented in Section 3.4. 3.1 R elated work Several research efforts have focused on rapid energy estimation of a design on FPGAs. Shang and Jha [119] proposed a black-box approach to estimate energy based on input and output signal statistics. This approach is suitable for estima tion of average power dissipation of a RT-level component to be embedded into a system. However, it is not applicable for algorithm level power analysis. On the other hand, our model captures various architecture parameters that can be manipulated at algorithmic level for energy optimization. 4 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. XPower, the power estimation tool provided by Xilinx [144] [146], estimates the energy dissipation of FPGAs based on low-level simulation. The input to the tool is LUT-level place-and-route information along with details of switch ing activity for LUT-level components. While its accuracy is comparable with the actual execution of the design, it does not support energy estimation early in the design phase when the complete system description in some HDL is not available. Stammermann et. al. presented ORINOCO, a software tool for power dissipation analysis and optimization at the algorithmic level from C/C-I-+ and VHDL description [125]. However, C/C-1— t- or VHDL descriptions do not capture parameters affecting system-wide energy and also a designer requires a complete knowledge of the final system before the code can be generated in these languages. Both ORINOCO and XPower are essentially estimation tools and can be used in our methodology to perform low-level sample simulations necessary for specify ing our component specific power functions. We have compared our estimation accuracy against XPower. In [17] regression tree [18] is used to improve the power estimation of a RT-level component. Starting with candidate variables (I/O bits), the variable U j, which has the maximum impact on the power dissipation is identified. Then the sample power dissipation results of power measurement is split in two subsets based on this variable. The splitting is recursively performed to build a regression tree which ranks variables in their significance with respect to the power. It is a bottom-up 4 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. approach starting from low-level implementation and ends in identifying significant variables affecting the power. In contrast, our model starts with candidate parameters chosen from a high- level view of the architecture and algorithm. The effect of the parameters on the system-wide energy is captured in the component specific power functions. The component specific power functions are used to obtain parameter values for optimal power performance by traversing the design space at an algorithmic level. 3.2 D om ain-Specific E nergy M odeling Since FPGAs provide the freedom to map various architectures, choosing an ap propriate architecture plays a significant role in determining the amount of inter connect and logic to be used in the design which also affects energy dissipation, latency, and area. Therefore, based on the performance needs and the characteris tic of the target FPGA chip, it is possible to identify a set of suitable architectures, each having different characteristics in terms of I/O complexity, memory require ments, area, etc. Defining a domain which consists of an appropriate architecture for an algorithm ensures that we begin with an efficient design most suitable for the performance requirements and that there are various architecture parameters that can be varied to explore trade-offs among energy, latency, and area. For example, matrix multiplication can be implemented using a 1-D array (linear array) or a 2-D 4 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. array. A 2-D array would dissipate more power from interconnect since more inter connects are required. Thus, it is possible that more energy would be dissipated, depending upon the resulting latency. The parameters representing algorithm level and architecture level choices for a specific application form a multi-dimensional space. For example, the number of multipliers, registers and the I/O channels can be changed from algorithm level choices for matrix multiplication. In the course of high-level modeling, we consider many power management tech niques that provide control knobs when applied to designing for FPGAs [120] [148]. One such technique is clock gating, which is used to disable parts of the device that are not in use during the computation. In the Virtex-II family of FPGAs, clock gating can be realized by using primitives such as BUFGMUX to switch from a high frequency clock to a low frequency clock [144]. BUFGGE can be used for dynamically driving a clock tree only when the corresponding logic is used. For example, FFT computation has many complex number multipliers to perform twiddle factor computations (multiplication and addition/subtraction). Because of the nature of the algorithm, some twiddle factors are 1, — 1, j, or — j and their computation can be bypassed. Thus, the implementation of twiddle factor com putation can exploit clock gating to disable the unnecessary computation blocks. Ghoosing bindings is another technique. A binding is a mapping of a computa tion to an FPGA component. The ability to choose the proper binding is due to the existence of several configurations for the same computation. Thus, different 4 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. bindings affect FPGA energy dissipation. For example, there are three possible bindings for storage elements in Virtex-II devices based on the number of entries: registers, slice based RAM (SRAM), and embedded Block RAM (BRAM). An other example is the choice between hard and soft IP. One such case is the choice of multipliers: block multipliers, such as those in the Virtex-II and Stratix, can be more efficient than CLB-based multipliers. In high-level modeling, we can analyze the trade-offs that arise from various bindings based on the design requirements. Exploiting the domain knowledge and the power management techniques, the goal of domain-specific modeling (Figure 3.2 (a)) is to represent energy dissipation of the designs specific to a domain in terms of parameters associated with this domain. For a given domain, only those parameters which can significantly affect system-wide energy dissipation and can be varied at algorithmic level are chosen for the high-level energy model. As a result, our model a) facilitates algorithmic level optimization of energy performance, b) provides rapid and fairly accurate estimates of the energy performance, and c) provides energy distribution profile for individual components to identify candidates for further optimization. First, we define the high-level energy model. Then we provide details of energy estimation method using this model. 4 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. c Domain High-level Model Function Estimation Relocatable Module interconnect [ Component Specific Parameters | I (n,pe,f,sw ,....) Component Specific Power Functions Component Power State Matrices J System-wide I Area | System-wide Energy Function System-wide Energy System-wide' Latency | Design f A Specific Parameters I Design (a ) ^ Number of components (n,) State of a component in a cycle ■ e : (b) Figure 3.2: (a) Domain-specific modeling and system-wide energy estimation and (b) component power state matrices 4 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.2.1 High-Level Energy M odel Our high-level energy model consists of RModules, Interconnects, component spe cific parameters and power functions, component power state matrices, and a system-wide energy function. Relocatable Module (RModule) is a high-level architecture abstraction of a com putation or storage module. It is either a CLB (configurable logic block)-based logic or a ’ ’larger” module composed of multiple RModules and Interconnects. We define RModule whose power dissipation can be individually characterized once their input stimuli are known, regardless of their location. For example, a regis ter can be a RModule if the number of registers varies in the design depending on algorithmic level choices. One important assumption about RModule is that energy performance of an instance of a RModule is independent of its location on the device. While this assumption can introduce small error in energy estimation, it greatly simplifies the model. We regard RModules as building blocks which are used to construct the energy model. The granularity of RModules for a specific do main are influenced by the domain. For example, the adders or registers inside the multiplier can be RModules. But there is no sense choosing them since there are no corresponding parameters for them in the domain. They are not targets for energy optimization at algorithm level. Interconnect represents the connection resources used for data transfer between the RModules. The power dissipated in a given Interconnect depends on its length, width, and switching activity. Interconnect 48 R eproduced witfi perm ission of tfie copyrigfit owner. Furtfier reproduction profiibited witfiout perm ission. can be of various types. For example, in Virtex-II, there are several Interconnect types such as long lines, hex lines, double lines, and single connections which differ in their lengths [144] [145]. In the rest of the chapter, we use component to refer to both RModule and Interconnect. Component specific parameters depend on the characteristics of the component and its relationship to the algorithm. We choose those parameters which may significantly affect the total energy using knowledge of application, algorithm and architecture and model the domain using the chosen parameters. For example, we model the domain for the matrix multiplication using the number of multipliers and registers since power dissipation in these components significantly affects the total energy (Section 3.4.2). From our knowledge in the algorithm, we find that there exists frequent systolic movement of intermediate results among them. Another examples are operating frequency and precision of a multiplier RModule if they are varied by the algorithm. Possible candidates parameters include operating frequency (/), input switching activity {sw), word precision {w), power states (jps), number of RModule type i {nfi, etc. Component specific power functions capture the effect of component specific parameters on the average power dissipation of the component. The power functions are obtained by implementing sample designs of individual components and simulating them using low-level simulators (described later in this section). 4 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Component Power State (CPS) matrices capture the power state for all the components in each cycle. For example, consider a design that contains k different types of components {Ci, ...,Ck) with m components of type i. If the design has the latency of T cycles, then k two dimensional matrices are constructed where the z-th matrix is of size T x U i (Figure 3.2 (b)). An entry in a CPS matrix represents the power state of a component during a specific cycle and is determined by the algorithm. System-wide energy function represents the energy dissipation of the designs belonging to a specific domain as a function of the parameters associated with the domain. The domain-specific nature of our energy modeling is exploited when the de signer identifies the level of architecture abstraction (RModules and Intercon nects) appropriate to the domain and/or chooses the parameters to be used in the component-specific power functions. This is a human-in-the-loop process and exploits the designer’s expertise in the algorithm and the architecture family that constitutes the domain. Well-known power models based on capacitance, voltage, and switching activity can be more accurate and are generic to be applicable across many domains. However, they do not provide a designer a clear understanding of the impact of his/her algorithmic level design choices on the energy performance. Our modeling enables the designer to rapidly explore a large design space based 5 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. on the understanding of the effect of the design choices on the overall energy per formance. To handle modeling complexity we follow a hierarchical approach. Each RMod ule can be recursively divided into RModules and Interconnects. This hierarchical nature allows the designer to capture the details of architecture in the design at various levels of abstraction to identify parameters affecting performance. 3.2.2 Com ponent Specific Power Function Estim ation Power dissipation by a RModule or Interconnect in a particular state is captured as a power function of a set of parameters. These functions are typically constructed through curve fitting based on some sample low-level simulations. We demonstrate our function estimation technique in detail by deriving the power function for a register-based memory implemented on the Virtex-II device. Figure 3.3 (a) sum marizes the technique. This technique was applied for power function estimation during the modeling of the various domains described in Section 3.4. Let C.p{pi,... ,Pn) be the component power function and p i,... be the pa rameters associated with the component. Estimation of the component-specific power function involves estimation of power dissipation through low-level simu lation of the component at different design points. A design point is a unique combination of parameter values. For our chosen component, a register based memory, we used the basic register design provided by the Xilinx library. The 51 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. architecture parameters with ranges O I ^component specific power functions .ncd file ■vcd file netlist .ncd ^ VHDL VHDL XPower ModelSim Xilinx XST Synthesis Waveforms (stimulant) Xilinx Place&Route Domain Specific Modeling Power function icuilder (curve-fitting,..) (a ) I .s .a. □ 15.00-20.00 □ 10.00-15.00 ■ 5.00-10.00 0 0.00-5.00 9° 110 130 1 5 0 ^ ^ Frequency (f) 4 Nurrber of register (r) (b) Figure 3.3: (a) Power function estimation and (b) register power function as a function of the number of registers (r) and frequency (/) 5 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. component specific parameters are frequency of operation, number of registers in a memory, and the precision. We decided not to vary the precision and assumed it to be 8-bit. Therefore the parameters that affect energy dissipation of the memory are number of registers (r) and frequency (/). Let (r, / ) denote a design point. We identified the candidate designs randomly (for low-level simulation) to be the combinations of r = 1,4, 8 and / = 10, 50,150MHz. The designer associates a VHDL implementation with each RModule. These VHDL implementations are parameterized based on the parameters supported by the associated RModule. Low-level simulation is performed at each of the chosen design points to estimate the power dissipation at that design point. We use random input vectors since there is no general purpose technique to predict exactly what data is available as input to a component. However, we have developed a technique based on statistical analysis to obtain reasonably accurate average estimate of power dissipation of a design [69]. We utilize confidence intervals about the sample mean energy dissipation for a design. Confidence intervals allow us to address dependency upon input stimuli because they describe the likelihood that the true mean over an entire population is within a certain range of the mean found from a sample out of the population. Equation 2 : ^ / 2 (s/x/M ) is employed to estimate the confidence interval for our simulations where x is the sample mean (the mean found by experiment), a; is a number between 0 and 1, Za/2 is a constant as explained in [64], s is the population standard deviation, and M is the number 53 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of samples. We assume that the distribution from which the results come is not too badly skewed or discrete. For example, to statistically analyze energy dissipation for the matrix multipli cation in Section 3.4.2, we performed 50 different n x n matrix multiplication trials for our linear architecture. Each trial consists of performing the low-level simu lation procedure, as described above, with the uniformly distributed, randomly generated matrices as input. These power estimates and the design points are provided as inputs to the power function builder. For components with a single parameter, the power function are obtained using curve-fitting on sample simulation results. In case of more number of the parameters, surface fitting can be used. Currently, we only focus on building component power functions with at most two parameters. The resulting power functions are provided to the designer. We have used Microsoft Excel for power function estimation. Figure 3.3 (b) shows the graph based on sample simulation results of different design of the register based memory. The power function, based on the graph is i?.p(r, / ) 0.0142 ■ r • f + 0.0011. The component power function of an interconnect depends on its length, op erating frequency, and the switching activity. Unfortunately, estimating the in terconnect length requires the knowledge of the placement of the physical im plementation of the components. In Virtex-II device, various routing resources 5 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. can be identified based on long, hex, double and direct wires. After perform ing synthesis and simulation, we can obtain the number of different wires used from a XDL file which is the text version of place and routed circuit descrip tion (.ncd file) [144][146]. The power function of an Interconnect component is I.p{L,w) = {l/2)V ‘ ^ ■ f ■ sw ■ {Ci-l + Ch-h + Cdb-db + Ccir-dr) where V is voltage, / is the operating frequency, sw is the average switching activity, L is the length of an interconnect, w is the precision (or width), and Ci, Ch,Cdb,Cdr and l,h,db,dr are the average capacitance and number of long, hex, double, and direct wires respectively [120]. However, since the architectures we are currently consider has neighboring con nections, we use a simplified approach. We use Equation 3.1 to estimate power dissipation in an interconnect. denotes the power dissipation of a cluster of k RModules connected through the candidate interconnect and M.pi represents power dissipation of the i-th RModule. The power dissipated by the cluster and RModules are obtained by low-level simulation: k IC.p = ^ . p - ^ M . p i (3.1) i=l The low-level simulation is performed as follows. The sample VHDL design is synthesized using XST (Xilinx Synthesis Technology) on Xilinx ISE 4.1i. The place-and-route file (.ncd file) is obtained for the target FPGA device using PAR. 5 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Mentor Modelsim 5.5e is used to simulate the module and generate simulation results (.vcd file). These two files are then provided to the Xilinx XPower tool to estimate the energy dissipation. Above simulation technique was also applied to the candidate designs to estimate power which was multiplied with latency to obtain the measured energy estimates shown in Table 3.3 and 3.4. While the initial effort to build the component power function might be expen sive, the benefits are noticeable when the same components are re-used in different designs within and (possibly) across domains. 3.2.3 Deriving System -W ide Energy Function The CPS matrices capture the operating state of each component for every cycle and the power functions provide the power estimate for each state. Therefore, the total energy of the complete system is obtained by summing the energy dissipation of individual components in each cycle. The system-wide energy function S E is obtained as: k 1 f ni T \ -S T ? = X] T J2Yl^i-P-PS where ps = C P S {i,t,j) (3.2) i=i J \j= i t = i J Ci-p.ps is the power dissipated in the j-th component {j = l...n;) of type i during cycle t (t = 1...T) and / is the operating frequency. CPS{i, t,j) is the power state of the j-th component of the Ath type during the t-th cycle. Many state-of-the-art 5 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGAs feature multiple clock domains. However, we focus on the design of signal processing kernels which typically perform an atomic task such as FFT, DCT, filters, etc. Therefore, we consider a single clock frequency for the complete kernel design. Since the system-wide energy function is derived using component specific power functions, the energy distribution among various components (the fraction of the total energy dissipated by each component) can be obtained. This informa tion is used to identify candidate components to be considered by the designer for energy optimization. Through details can be found from Chapter 4 to Chapter 6. Due to the high-level nature of the model, we can rapidly estimate the system- wide energy. In the worst case, the complexity of energy estimation is 0 {T x Z)Li Equation 3.2) which corresponds to iterating over the elements of the CPS matrices and adding the energy dissipation by each component in each cycle. However, typically, there is a repeating pattern of state changes for a component (for example, due to loop structures within the algorithms). Also, different com ponents of the same type dissipate the same amount of energy during each cycle. Therefore, based on these observations the time to compute the energy is better than the worst case complexity of energy estimation stated above. Further, even if we compute the system-wide energy based on each cycle we do not analyze the activities at the level of individual gates. Typically, there are only a few distinct components within a domain that affect energy dissipation of the designs in that 5 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. domain. Indeed, for the illustrative examples considered in this chapter, the time for energy estimation does not depend on the problem size. The time needed to perform high-level estimation (assuming the power func tions are pre-computed) is on the order of minutes on a Pentium 11 1 Xeon running at 700MHz, whereas the time needed for low-level simulation and power estimation was 3-24 hours per design on the same machine. For the domains discussed in this chapter, we typically need 4-8 low-level simulations (one for each design point) for each power function. Once all power functions are computed and the system-wide energy function is derived, they are applied to the complete design space. For Domain 2 (Section 3.4.2), the number of low-level simulations performed to de fine the domain-specific models were approximately 30. As these simulations are for a component not the complete design each low-level simulation takes approxi mately 30 to 60 minutes. The model is applicable to all the design of n x n matrix multiplication where 1 < n < 48 (we chose 20 designs). Therefore, our effort ap proximately takes 10-12 hours of simulation and computation which is very small when compared with approximately 2 weeks needed to simulate 20 designs. 3.3 D esign M eth od ology The aim of the design methodology is to design energy efficient data paths specific to an application. To achieve this goal, our methodology presents a set of designs which provide trade-offs among energy, latency, and area. The designer explores 58 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. these designs and identifies an appropriate design based on some selection criteria and performance metrics. Our design methodology is illustrated in Figure 3.4 and consists of 5 steps as described below. < — Model Database (module level) \Z > M ultiplier I / O Memory H y b r i d Domain Specific Modeling 4 Design Space Exploration and High-level Estimation Coefficient Estimation 1 Domain Analysis (Architecture Algorithm) Low-level Synthesis & Simulation of the Complete Kernel Figure 3.4: Design methodology based on the domain-specific modeling our design methodology consists of 5 steps as described below. 1. Algorithm and Architecture Exploration: The kernel is analyzed for the com putation and I/O requirements which influence the selection of architecture family. Further, because the target is an FPGA for which energy cost for 59 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. interconnect is large, we identify architecture families such that area for com putation logic and storage is much higher than the area for interconnection. For a given kernel, there exists several architecture families [53] [78] [110], each having different characteristic in terms of I/O complexity, memory require ments, area, latency, length of interconnect, etc. Further, the comparison of cost of memory space and I/O is also considered while determining the size of memory. Based on the performance needs and the capabilities and limitations of the target FPGA chip, we identify a suitable architecture fam ily. The architecture family and corresponding algorithm for a particular kernel is referred to as a domain. As our approach is based on algorithm level optimizations, the initial step is the most crucial one. Identification of an appropriate domain ensures that we begin with a latency efficient design (we improve energy performance without compromising latency), and there are architecture parameters that can be varied to increase energy efficiency. 2. Domain-Specific Modeling: The details of the architecture family and the algorithm corresponding to the kernel are captured in a domain-specific model. This model captures the architecture details in terms of modules (storage and computation) and connectivity among them, parameters that affect power dissipation, valid ranges for each parameter, various perfor mance constraints, and functions to evaluate power, latency, and area based on these parameters. Further details regarding domain-specific modeling is 60 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. described in Section 3.2. Domain-specific modeling performs the first level of design space reduction by identifying the ranges of the parameters based on the architecture and algorithm constraints. 3. Coefficient Estimation: The effect of variation of different architecture pa rameters on energy is captured by an energy function associated with the domain-specific model (see Section 3.2.2). This function has three different aspects; power associated with a module, number of modules, and latency. While number of modules and latency are derived from the algorithm and architecture details in Step-2, the power for each module is evaluated through low-level simulation for that module. During module specific power estima tion, the design of the module also includes the interconnects necessary to add the module to the rest of the design. This ensures that the energy dissi pated by the interconnect is also included. As a variation, it is also possible to estimate power for a module as a function of some parameter associated with the module. This requires a number of simulations and curve-fitting based on the simulation results but eliminates the need of coefficient estimation again if the same parameter changes during Step-4 ■ 4. Design Space Exploration (DSE) and Optimization: This step performs a human-in-the-loop design space exploration and optimization based on the different functions associated with the model. Initially, the energy function is analyzed to identify a distribution of energy dissipation among various types 61 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of modules. Modules with higher percentage of energy dissipation are chosen as possible candidates for design modification. There are three alternatives during energy optimization; (a) power reduction for a module through alter native implementations, (b) reduction in the number of instances of a module through architecture modification, and (c) latency reduction through algo rithm variation. Each module is evaluated for the options (a) and (b) and the algorithm and architecture are considered together for option (c). Choosing (a) requires the user to perform Step-2 and Step-3 again. Choices (b) and (c) requires the user to perform Step-1 and Step-2 again. The functions as sociated with the model allow user to estimate the area, latency, and energy to evaluate each design. This process of estimating performance through the use of functions is referred to as high-level estimation. After several iterations, once the designer identifies an energy efficient design, the functions associated with different performance attributes are analyzed to identify the trade-offs among area, energy, and latency and ultimately to identify an energy efficient design for a particular kernel with similar area x time requirement as the base design. The DSE process is repeated for each domain to identify a set of candidate designs for individual kernels. Our methodology advocates identification of a set of design for each kernel, each with different requirement of energy, latency, and area. Availability of a set of designs ensures flexibility during 62 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. integration for complete system design. A high-level estimation tool inte grates kernel specific performance values to derive system-wide performance values in terms of energy, area, and latency to compare against system-wide performance constraints. 5. Low-level Synthesis & Simulations: This step identifies the energy efficient design. The set of design candidates chosen by DSE are implemented and simulated using low-level simulators such as RT-level or cycle-accurate simu lators. Low-level simulations provide accurate latency and energy values and these are used to identify the most efficient design. 3.4 Illu strative E xam ples o f D om ain-Specific M odeling and D esign M ethod ology To illustrate our domain-specific modeling methodology, we apply the techniques discussed in the previous section to define high-level models for three different do mains implementing matrix multiplication, the frequently used kernel operation in wide variety of signal processing algorithms [50]. For each domain, we iden tify the components and the component specific parameters, evaluate the power functions for each component, and finally derive a system-wide energy function. Three architecture families, a uniprocessor architecture, a homogeneous linear ar ray architecture, and a heterogeneous linear pipelined architecture are chosen to 63 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. demonstrate our approach. We have chosen the Virtex-II (XC2V1500, speed grade -5) as our target device. 3.4.1 Domain 1: Uniprocessor Architecture We define a uniprocessor (PE) implementing the ” usual” block matrix multiplica tion (BMM) as the first domain. This domain uses a single multiplier and results in compact energy efficient designs. There are two possible scenarios: on-chip design and off-chip design. If the matrix multiplication kernel is a stand-alone application, all matrix data are stored in an external memory outside the FPGA. We refer to such a design as off-chip design. If the matrix multiplication kernel is one of many kernels in an application, it is desirable to have the matrix data reside (in a Block SRAM) on the device. We refer to such a design as on-chip design. The Block SRAM is a dedicated on-chip memory in Virtex-II and is usually used for storing intermediate data between (kernel) computations. Figure 3.5 shows our target architectures. In the off-chip design, the PE has one MAC (multiplier and accumulator), a cache (local buffer) of size c, and I/O ports (see Figure 3.5 (a)). Each word of cache is three 8-bit registers. The data matrices are stored in an external memory. For n X n matrix multiplication, the computational complexity of the algorithm is O(n^) [65]. Block matrix multiplication is performed with block size ^/c x ^/c. The I/O complexity (amount of traffic between the PE and external memory) is 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 0{v?I\/c). It can be observed that a large cache decreases the I/O traffic and as a result improves the energy dissipation in performing I/O. ( a ) FPGA FPGA Cache Cache Memory Bank C Memory Bank B Memory Bank A MAC External Memory PE I/O MAC PE (b) Figure 3.5: Uniprocessor architecture: (a) off-chip design and (b) on-chip design. In the on-chip design, the PE has one MAC, a cache of size c, and three memory banks for storing three matrices (see Figure 3.5 (b)). BMM is performed with block size ^/c x yA. The energy for I/O (outside the device) is not included, but the energy dissipated in the three memory banks is considered. The read/write access frequency of the memory banks depends on the traffic between the memory banks and the PE. It can be observed that as the cache size increases, the number of memory bank accesses decreases and as a result the energy dissipated in the memory banks reduces. 3.4.1.1 Defining C om ponents and Param eters We identified four components: MAC, cache, and the memory banks as RModules, the I/O as an Interconnect. The RModules have w bit precision. We assumed the precision of input data to be 8 bits and the precision of the intermediate and the 65 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. output data to be 16 bits. Therefore, the cache size (c) is the only parameter that can be varied at design time. The component specific power functions for MAC {MAC.p), cache {R.p), I/O {lO.p), and memory bank {MEM.p) were obtained through low-level simulation using the method described in Section 3.2. To implement the MAC in Virtex-II, there are two design choices: a CLB- based multiplier and a dedicated multiplier. A dedicated multiplier is a stand alone ASIC-based multiplier. A CLB-based multiplier is built using CLBs and it was observed that it dissipates more power than a dedicated multiplier. Similarly, there are two design choices for implementing the cache using CLBs. If the cache size is small, the cache can be realized using CLBs configured as register modules. Larger cache can be realized using CLBs configured as SRAM [144]. However, a SRAM-based cache can only be configured to be a multiple of 16 bytes. We noticed that for c > 6, the SRAM-based cache dissipates less power than the register-based cache of the same size. The caches for matrix A and B has 8 bit precision and for matrix C, 16 bit precision. MAC.p and lO.p are constants. The power function for the register-based cache is: R.p{c) = 2.12c (mW) for 8 bit precision. The power function for 16 bit register is 2R.p{c). The power function for the SRAM-based cache is: R.p{c) = 0.1 2 ( [ c /1 6 ] -1-4.52 [c/16] -f 7.81 (mW). The memory bank is implemented using Block SRAM in Virtex-II. The power state of Block SRAM can be controlled by gated clocking. However, there is not much 6 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. difference (< 2%) in the power dissipation between the on and off states. Instead, the power dissipated in Block SRAM depends mainly on the access frequency (/m ) of the memory bank. A Block SRAM can be configured to be a multiple of 2 Kbytes with 8 bit precision. The power function for 2 Kbyte Block SRAM is: M EM .p{fm ) = 2.89fm ^ + 25.79/m + 0.29 (mW). 3.4.1.2 System -W ide Energy Function We now consider the system-wide energy dissipated by the design. In both the on-chip and off-chip designs, the amount of computation performed by the MACs is the same and the MACs dissipate the same amount of energy. For the off-chip design, we do not consider the energy dissipated in the external memory. The system-wide energy function (SE) for performing n x n matrix multiplication is; SE{n, c) = j{n^M AC.p -f 4(n^ -f /s/c)R.p{c) -f- ?> {n^/ \/c)IO.p) (3.3) Note that as c varies, we obtain a family of architectures each implementing matrix multiplication using BMM with different block sizes. The operating frequency of our design was set to 166MHz. Figure 3.6 shows how different values of c affect the system-wide energy dissipation and the energy distribution among the components of the design for 12 x 12 matrix multiplication. As c increases, tire energy for performing I/O decreases but the energy dissipated in the cache increases. Initially, 6 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the system-wide energy decreases as c increases but for large values of c, the system- wide energy increases. For the on-chip design, the energy dissipated in the memory banks is considered instead of the energy dissipated in the I/O . The system-wide energy function is: SE{n,c) = y{n^ MAC.p + / y/c)R.p{c) + 3{n^/y/c) MEM .p) (3.4) 1500 □ 10 □ M A C [ :■ C a c h e d niil 9 16 C a c h e siz e ( c ) (a ) □ wc S 1500 C a c h e siz e ( c ) (b ) Figure 3.6: System-wide energy dissipation and energy distribution for 12 x 12 matrix multiplication as a function of cache size: (a) off-chip design and (b) on- chip design. Note that as c increases, the traffic between the memory banks and the PE decreases and as a result the energy dissipated in the memory banks decreases. 6 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.4.1.3 D esign Trade-offs and Perform ance Analysis As the system-wide energy function is a well-behaved function with easily deter minable minima, we were able to identify the most energy efficient designs from the trade-off graphs (see Figure 3.6). For both designs, the cache size c = 16 gives the minimum system-wide energy. Since I/O operations are expensive, using larger cache helps to reduce the energy for off-chip design. However, as the cache size increases, its energy dissipation becomes dominant and the system-wide energy increases. For the on-chip design, the energy for Block SRAM is not as significant as the energy for I/O . All designs use a single dedicated multiplier. 3.4.2 Dom ain 2: Linear Array Architecture For the second domain, we consider a linear array of processing elements (PEs) as the candidate architecture (see Figure 3.7 (a)). Each PE has one multiplier and storage. We start with an algorithm for optimal latency on linear array [110]. PEj in Figure 3.7 (a) computes C ij = YJk=i h j for alH, 1 < i < n where aik, bkj, and C ij represent an element of n x n matrices A, B, and C. In iteration (i,k), 1 < i,k < n, C ij = C ij + Q ik X bkj is computed in PEj. Elements of matrix Aand Rare fed to the array via two input ports of PEiin column major and row major order, respectively. It is critical to ensure th at aik ’ ’meets” bkj in a cycle in PEj. For this, aik and bkj pass through two and one delay(s), respectively, in each PE. The resulting architecture for each PE is shown in Figure 3.7 (b). a^k enters input 69 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. port A S and goes through two delays {AS.LR and AS.RR), while bkj enters B S and goes through one delay. Details of the algorithm, its analysis, and proof of correctness can be found in [110]. Compared with the uniprocessor design in Section 3.4.1, we use more multipliers to reduce the latency. The above family of architectures offers several advantages compared to other architecture families. These architectures have a low I/O- bandwidth requirement and they scale as the problem size grows. To achieve the minimal I/O complexity (0(n^)), the total amount of storage across all the PEs should be n?. As shown in [110], this architecture can perform n x n matrix multiplication in 0 (n ‘ ^) time using n\n/s] PEs. For the sake of illustration, we consider the on-chip design for this domain. 3.4.2.1 Defining C om ponents and Param eters The structure of the linear array is shown in Figure 3.7 (a). It consists of three components: processing elements (PEs), buses connecting adjacent PEs, and mem ory banks. For the purpose of high-level modeling, we identified the PE and the memory bank as RModules, and the bus between two adjacent PEs as an Intercon nect. The PE has a MAC of precision w and storage of size s (see Figure 3.7 (b)). The MAC is implemented using a dedicated multiplier. The PE has two power states on and off. In the on state, the multiplier is on and thus the PE dissipates more power than in the off state when the multiplier is off. The power state of 7 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA Memory Bank A ~ ~ Memory |_ Bank B P E , PE , PE, Memory BankC AS AS.LR AS.RR BS BS.LR BS.RR U BS[1] -> BS[s] BF BF.R BF.T MAC OC C[1] C[s] ACT ACT.L ACT.R ACT[1] (b) Shift data In shift registers. Read data into (input) registers from input ports. If (ACT[i]=1)then C[i]=C[i]+AS.LR*BF.T. If (Pg=1) then select data from AS.LR Mg select data from BS.RR M j, select data from BS[s] Mg select data from ACT[s]. else M ^ select data from AS.RR M g select data from BF. R Mg select data from BS.LR M p select data from ACT.L. (0 ) Figure 3.7: (a) Linear array architecture for matrix multiplication, (b) PE organi zation, and (c) corresponding algorithm 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the multiplier is controlled by clock gating. The PE also includes 6 registers and 3 multiplexors of w bits. The key parameters affecting energy are the number of PEs (pe), the amount of storage within a PE (s), and power states (ps). We implemented the PE using a Virtex-II operating at / =166MHz and per formed simulations to obtain the power functions for the PE and the bus. The power function for the PE is: PE.p.ps = 7.01s -f 31.04 m W {ps = on) (3.5) 7.01s -h 14.04 m W {ps — off) The interconnect power function is constant. It is estimated using Equation 3.1 since the interconnect between the PEs is localized in the design and is regular. We implemented two PEs and the interconnect, and measured the power dissipation while both PEs are in ON state. The power dissipated in the interconnect is IC.p = 39.74 mW. The power function for the memory bank is the same as in the uniprocessor architecture (see Section 3.4.1.1). We consider the problems of size 1 < n < 16. For the sake of illustration, we fixed w at 8. The parameters and their ranges are shown in Table 3.1. Note that the parameters of interest are pe, ps, and s. The system-wide energy function is specified using these three parameters. 7 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3.1: Model parameters Parameters Values or ranges s 1 < s < n pe 1 < pe < n \n/s) w 8 ps on, o f f 3.4.2.2 System -W ide Energy Function There are several constraints imposed by the algorithm which is exploited to iden tify component specific parameters and their ranges. The value of s determines the total number of PEs {pe). The latency (T) of this design using n \n/s] PEs and s storage per PE [110] is : T = (n^ -|- 2n |"n/s] — \n /s\ + 1). We consider problems in the range 1 < n < 16. Precision (w) is set to 8. In each PE, the multiplier is on for T / ([n /s]) cycles and is off for T x (1 — 1/ [n/s]) cycles. PE.p.ps refers to the power dissipation of PE when its multiplier is in state ps (see Equation 3.5). Note that the I/O traffic between the PEs and the memory banks is 0{n^). The system-wide energy function is: SE{n, s) = j { n • T • PE.p,ps=on + T ■ {n \n/s] - n) ■ PE.p,ps=off + T ■ (n [n/s] - 1) ■ IC.p P 3n^ • M EM .p) (3.6) 3.4.2.3 D esign Trade-ofFs and Perform ance Analysis Figure 3.8 (a) shows the effect of varying the amount of storage (s) on the power dissipation of a PE. Figure 3.8 (b) shows the effect of varying the amount of 73 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. storage (5 ) on the system-wide energy for three problem sizes {n = 4,8,16). Based on these plots, to obtain energy efficient designs we choose pe = s — n, where n is the problem size. 160 B L U Q . 1 2 0 b a. c .0 I d .9 . Q 40 £ 5 o a . 1 2 4 8 12 16 Amount of Storage (s) (a) 1.E+05 1.E+04 1.E+03 1.E+02 1.E+00 — n=16 1.E +01 - ^ n = 4 4 8 12 Amount of Storage (s) 16 (b) Figure 3.8: (a) Power dissipation for a single PE and (b) the system-wide energy as a function of the amount of storage (s) for n = 4, 8,16. Table 3.2 shows the energy, latency, and area of the designs for various problem sizes. We compared the performance of our design with a design for 3 x 3 ma trix multiplication provided by Xilinx [144]. Since Xilinx library does not provide on-chip design, we added Block SRAMs to the Xilinx design. All Xilinx designs execute at 150MHz. For n > 3, we used block matrix multiplication using the 3 x 3 design. The improvement in energy dissipation and latency in our designs com pared with the Xilinx designs are also shown. On the average our designs dissipate 32% less energy compared with the Xilinx design. The latency improvement varies from 5.8 X to 17.3 x. However, our designs occupy more area. 74 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3.2: Comparison of our designs on linear array architecture with Xilinx design Matrix size Design based on Xilinx library Design based on Linear array architecture Performance improvement n X n T A E T A E E T 6 x 6 1.71 251 292.40 0.30 1074 213.2 37% 5.8x 9 x 9 5.76 251 986.86 0.30 1935 715.5 38% 9.6x 15 X 15 26.67 251 4568.80 1.54 4305 3812.5 20% 17.3x * T is the latency (usee). A is the number of slices. * E is the energy dissipation (nJ) The energy dissipation of the designs discussed in this section is based on high- level estimation using the system-wide energy function for the domain. In order to validate these energy estimations, we performed the following experiment. For a particular design, we used the corresponding system-wide energy function to estimate the total energy dissipation. We compared this result with a complete VHDL simulation of the design using Xilinx tools described in Section 3.2.2. In the simulations, the same input data used to obtain the component specific power functions were used. As noted earlier, the average switching activity was observed to be 50%. We performed this experiment for various problem sizes using designs in Section 3.4.2.2. Table 3.3 also shows the error percentage of our high-level estimation method when compared with energy estimation values obtained through low-level simulation. The error percentages are below 9.0%. 7 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3.3: Accuracy o : the high-level energy estimation of our designs Matrix size (n x n ) 3 x 3 6 x 6 8 x 8 9 x 9 12 x 12 16 X 16 Estimated energy (nJ) 34.0 213.2 497.8 715.5 1801.2 4759.7 Measured energy (nJ) 37.4 228.6 536.9 768.4 1913.6 5078.6 Error .9.0% 6.7% 7.3% 6.9% 5.9% 6.3% 3.4.3 Domain 3: Block M atrix M ultiplication on Linear Array Architecture The third domain targets large size (n > 12) matrix multiplications. It consists of block matrix multiplication (BMM) and the linear array architecture presented in Domain 2. The BMM algorithm for x iV matrices repeatedly uses the design (hardware) for sub-matrix multiplication of size n x n , where A'’ is a multiple of n. In this domain, we have considered the off-chip design. 3.4.3.1 Defining C om ponents, Param eters, and System -W ide Energy Function All components defined in Domain 2 are also applicable to this domain. An addi tional parameter is the block size (n). For a block size of n, we chose the designs with pe = s = n and ps = on to implement n x n matrix multiplication. Based on our performance trade-off analysis of Domain 2, (Figure 3.8) these designs are the most efficient ones in terms of latency and energy dissipation for n x n matrix multiplication. Since N x N matrix is divided into n x n sub-matrices, 7 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. {Njri)^ block matrix multiplications are performed. Therefore, the latency is: T = {N /n Y X + ‘ 2 n )/f. Therefore, the system-wide energy function is; SE {N , n) = {nT ■ PE.p,ps=on + {n - 1)T ■ IC.p + 3T • 1 0 .p) (3.7) 3.4.3.2 D esign Trade-ofFs and Perform ance Analysis We vary the block size (n) to evaluate various trade-offs. Table 3.4 shows the area, latency, and energy of 24 x 24 and 48 x 48 matrix multiplication using various block sizes. Results show that the matrix multiplication for N = 24 dissipates least energy when n = 12. To verify our result we simulated all the designs that are within 10% of the optimal design in terms of energy dissipation. Through low-level simulation (Table 3.4) design with n = 12 is verified as the most energy efficient design for 24 x 24 matrix multiplication. For 48 x 48 matrix multiplication, the design using 16 x 16 block matrix multiplication is the most energy efficient design. Note that, our estimations based on the system-wide energy function are within 10% of the estimation using low-level simulations. 7 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 3■ 4 : Performance comparison and accuracy of various designs in Domain 3 K /T afri'v A/TfiiQcnrorl Matrix Block Estimated Measured size size T A E T A E 24 X 24 8 x 8 2,187 1,048 5,271 2,187 1,101 5,491 12 X 12 1,352 1,572 4,757 1,352 1,667 4,983 48 X 48 8 x 8 17,496 1048 42,164 17,496 1,101 43,929 12 X 12 10,816 1,572 38,053 10,816 1,667 39,867 16 X 16 7,803 2,096 36,100 7,803 2,186 37,679 * T is the latency (cycles). A is Area (slices). * E is the energy dissipation (nJ) 7 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 4 E nergy and T im e Efficient M atrix M ultip lication U sing F P G A Matrix multiplication is a frequently used kernel operation in a wide variety of graphics, image processing, robotics, and signal processing applications. Several signal and image processing operations can be reduced to matrix multiplication. Most of the previous work on matrix multiplication on FPGAs focuses on latency optimization [7]. However, since mobile devices typically operate under various computational requirements and energy constrained environments, energy is a key performance metric in addition to latency and throughput [14]. Hence, in this chapter, we develop designs that minimize the energy dissipation. Our designs offer trade-offs between energy, area, and latency for performing matrix multiplication on commercially available FPGA devices. Recent efforts by FPGA vendors have resulted in rapid increases in the density of FPGA devices. Hence, we also develop 7 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a design that attem pts to further minimize the energy dissipation and latency in exchange for an increase in area to utilize higher density FPGA. Our effort is focused on algorithmic techniques to improve energy performance, instead of low-level (gate-level) optimizations. We evaluate various alternative designs at the algorithmic level (with accompanying architectural modifications) on their energy performance. For this purpose, we construct an appropriate energy model based on the technique described in Chapter 3 to represent the impact of changes in the algorithm on the system-wide energy dissipation, area, and latency. The modeling starts by identifying parameters whose values change depending on the algorithm and have significant impact on the system-wide energy dissipation. These parameters depend on the algorithm and the architecture used, and the device features of target FPGAs. In the case of matrix multiplication, we achieve a high-level of accuracy along with simple representations for the functions. The energy, area, and latency functions provide us with a high-level picture and pointers on where to look for possible savings in system-wide energy, area, and latency. Those functions allow us to make trade-offs in the early design phase to meet the constraints. Using the energy function, algorithmic and architectural level optimizations are made. Extensive low-level simulations using Xilinx ISE 4.1i and Modelsim 5.5e, for XC2V1500 and XC2V3000 as example target FPGA devices, are then performed. Xilinx XPower is used on the simulation data to verify the accuracy of the energy and area estimated by the functions. Our experiments 8 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. show that the estimates of energy dissipation and area are within 3.8% to 7.8% of the actual values obtained from low-level simulations. Our optimized algorithm and architecture (Corollary 1 in Section 4.2) save 29% to 51% of the system-wide energy dissipation for matrices of sizes 3 x 3 to 48 x 48, when compared with the design from the state-of-the-art Xilinx library [144], Latency is reduced by a factor of 3 to 15 while area is increased by a factor of 1.9 to 9.4. To pursue the possibility of further reduction in system-wide energy dissipation and latency in exchange for an increase in area, we also develop an algorithm and architecture (Theorem 2 in Section 4.2) with an increased number of MACs. Low- level simulations show further reduction in the system-wide energy dissipation and latency. For example, for matrices of size 12 x 12, the system-wide energy dissipation is reduced by an additional 40%, resulting in 69% reduction when compared with the design from the Xilinx library [144]. The latency and area reduce and increase by factors of 23 and 11.8, respectively. The rest of the chapter is organized as follows. Section 4.1 summarizes the related work in the literature. Algorithms and architectures for energy efficient implementation are presented in Section 4.2. An energy model specific to our implementation is described in Section 4.3. ft includes extracting key parameters from our algorithm and architecture to build a domain-specific energy model and deriving functions to represent system-wide energy dissipation, area, and latency. Section 4.3.3 shows the optimization procedure for our algorithms and architectures 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. in an illustrative way. Analysis of the trade-offs between system-wide energy, area, and latency is also provided. Section 4.4 provides implementation details and describes the simulation method along with its statistical representativeness. Section 4.5 analyzes the performance of our algorithms and architectures through various known metrics in addition to the system-wide energy dissipation. 4.1 R elated work To the best of our knowledge, there has been no previous work targeted at energy efficient implementation of matrix multiplication on FPGAs. Mencer et. al. [88] implemented matrix multiplication on the Xilinx XC4000E FPGA device. Their design employs bit-serial MACs using Booth encoding. They focused on trade-offs between area and maximum running frequency with parame terized circuit generators. For the specific example of 4 x 4 matrix multiplication, 954 CLBs are used to achieve a maximum running frequency of 33MHz. Amira et. al. [7] improved the design in [88] using the Xilinx XCVIOOOE FPGA device. Their design uses modified Booth-encoder multiplication along with Wal lace tree addition. The emphasis was once again on maximizing the running fre quency. For the specific example of 4 x 4 matrix multiplication, 296 CLBs are used to achieve a maximum running frequency of 60MHz. Area/speed, or equivalently the number of CLBs divided by the maximum running frequency, was used as a performance metric. 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Even though our designs mainly target the trade-offs among energy dissipation, area, and latency along with algorithmic level energy optimization, they also im prove the designs in [88] and [7] in terms of the area/speed metric. The area/speed metrics for the designs in [88], [7], and for our design are 14.45, 4.93, and 2.35, respectively. For fair comparison, translation of the number of CLBs for different FPGA devices is performed on the basis of the equivalent amount of logic. For example, 140 CLBs of the Xilinx XC2V1500 used in our design of 4 x 4 matrix multiplication to achieve a running frequency of 166MHz can be translated into 280 CLBs of the Xilinx XCVIOOOE FPCA device used in [7]. Prasanna and Tsai [110] achieved the theoretical lower bound in latency for matrix multiplication with a linear systolic design. They provide trade-offs between the number of registers and the latency. Their work focused on reducing the leading coefficient for the time complexity. Our work focuses on minimizing energy dissipation under constraints for area and latency. We significantly reduce the number of registers involved in the movement of intermediate results and elements of input matrices, -|- ^ , 1 < s < n, registers of 8-bit words are involved in the data movement for n x n matrix multiplication in [7]. In our design, only 2n registers of 8-bit words are involved in the systolic data movement (based on Theorem 1). Extra 2n registers of 8-bit words are required to store copies and are not involved in the systolic data movement. Their work has never been implemented on FPCAs. 8 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The most appropriate reference design with which the performance of our de signs should be compared comes from Xilinx [144]. The state-of-the-art design from Xilinx library performs matrix multiplications for limited sizes (3x3). Xilinx XPower [144] can be used to measure the power dissipation of designs implemented on Xilinx FPGA devices. For fair comparison, we use the same design environ ment, the same target device, and the same power measurement tool. Details of the simulations can be found in Section 4.5. Xilinx just provides a point design op timized at the gate level. Our work constructs a design space spanned by possible design choices in our algorithm. 4.2 E nergy Efficient A lgorithm s and A rchitectures for M atrix M u ltip lication For performance comparison purposes, we have implemented the best-known sys tolic design [110] on FPGA devices. The energy distribution profile of the design reveals that much of the total energy is dissipated in the registers (see Figure 4.1). For example, 78% of the energy is used in the registers for 12 x 12 matrix multi plication. By identifying the energy hot spot, we propose new energy efficient algorithms and architectures for matrix multiplication. We present our algorithms and archi tectures in two theorems and two corollaries. Pseudo-code for cycle-specific data 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. I 60% Q ) £ 20% 78% 73% 63% 9 6 12 Storage Problem s iz e (n) Figure 4.1: Energy distribution of the design proposed in [110] movement, the detailed architectures, and a snapshot of an example computation are also shown. Theorem 1 improves the best-known algorithm for matrix multipli cation [110] in terms of the number of registers used in the designs. Our design has optimal time complexity with a leading coefEcient of 1 for matrix multiplication on a linear array. Theorem 1 is extended to Corollary 1 for trade-offs among en ergy dissipation, area, and latency. Corollary 1 is used to identify energy efficient designs under latency and area constraints. The second algorithm is developed to exploit further increases in the density of FPGA devices to realize improvements in energy dissipation and latency (The orem 2). It uses more MACs and I/O ports. Corollary 1 and Theorem 2 are integrated into Corollary 2. Corollary 2 provides more comprehensive trade-offs among energy dissipation, area and latency than Corollary 1. 8 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Based on the location of the input and output matrices, we have two design scenarios: off-chip design and on-chip design (see Figure 4.2 (a) and (b)). In the off-chip design, we assume that the input matrices are stored outside the FPGA. I/O ports are used for data access. While we assume that the input matrices are stored in an external memory outside the FPGAs, we do not include the energy used by the external memory. However, in the on-chip design, we store all input and output matrices in an on-chip memory of the FPGA devices. The on-chip memory refers to an embedded memory in FPGAs. For example, a Block SelectRAM in the Virtex-II devices can be used for the on-chip memory. Thus, the energy used by the on-chip memory is included in the on-chip design. T h eo rem 1 n x n matrix multiplication can be performed in 2n cycles using 3 I/O ports and n PEs (processing elements), each having a M AC (MAC-and- accumulator), f registers, and 2 local memories of n words (Figure 4-2 (a) and (b) show a linear array connecting the PEs and Figure 4-2 (c) shows a PE). P ro o f 1 The algorithm in Figure 4-2 (d) and the architecture in Figure 4-2 (a), (b), and (c) are devised to compute C ij = O ik x bkj for all i,j. a^k, bkj, and cij represent elements of the n x n matrices A, B, and C. P E j denotes the j-th PE from the left in Figure 4-2 (a), j = l,2 ,..,n . P E j computes column j of matrix C, cij,C2j, ...jCnj, which is stored in the local memory Cbuf. In Phase k, column k of matrix A (oikO < i < n) and row k of matrix B (bkj, 1 < / < rr/ traverse PE i, P E 2, P E s , P E n in order and allow PEj to update c / = c[j -|- O ik x bkj, 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. FPGA Matrix A Matrix B Matrix C V 4 ^ ---- J ------ ^ i. ' > . 1 Irl 1 ' Control logic > I/O ports PE, FPGA I/O ports On-chip 1 ' f c m em ories n ► (Matrix 1 A,B,C) PE, Control logic (a ) (b) From PE. 'j - i PE To p e , j* i o u t w B15T ~ B U CtHlf CObuf (c ) For t=l to n do For all j, l=j=n, do in parallel PE^ shifts data in BU right to If (BU=b^^) , copy it into BM For t=n+l to n^+n do For all j, 1= j=n, do in parallel PEj shifts data in A, BU right to If (BU=b^j), copy it into BL or BM (alternately) If (A=ajj^), C i j ' == = i j ' + a i k * (b^j is in either BM or BL) (Cj^j' is in Cbuf) For t=n^+l to 2n= do For all j, l=j=n, do in parallel PEj store input to CObuf For t=n^+l to n^+n do PEj output data c^j' to PEj j For t=rf+n+l to 2n^—n do PEj output data CObuf to PEj_j (d) Figure 4.2: (a) OfT-chip design, and (b) on-chip design, (c) architecture of PEj used in Theorem 1, and (d) algorithm used in Theorem 1 8 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. O O 3 o ) 3 " G O • m mm n sT s tT > o E ( 0 § >< H S e ;£ SI c & uT in Q. ‘ S ' DC M - < 3 ffi E DO _l DO o Figure 4.3: Snapshot of the data flow for 3 x 3 matrix multiplication (Theorem 1 ) 8 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. where c[j represents the intermediate value of Cij. Once bkj arrives at P E j , a copy of bkj resides in P E j until aik,a2k,azk,...,dnk P < ^ss through P E j. We observe that the following two essential requirements should be satisfied: 1) Since a^k stays at each P E j for just one cycle, bkj should arrive at P E j no later than Oik, for any i , l < i < n. 2) Once bkj arrives at P E j, a copy of bkj should reside in P E j until O nk arrives. We show how the above two essential requirements for our systolic implementation are satisfied with minimal number of registers. In addition, we evaluate the number of cycles required to finish the operation and the amount of local memory per PE. An illustrative snapshot for n = 2 > is provided for more clarity. 1) bkj should arrive at PEj no later than a^k, for any i , l < i < n: Matrix B is fed to the lower I/O port of P E i (see Figure j.2 (c)) in row major order (bu,bi2,bis, ...,bin,b2i,b 22, ■ ■ ■ ) ■ Matrix A is fed to the upper I/O port of P E \ in column major order (an, Q 2 i, «3 i, ■ ■ ■ , a„i, ai 2 , 0 2 2 , •••j; n cycles behind matrix B. For example, an is fed to the upper I/O port of P E i in the same cycle as 6 2 1 is fed to the lower I/O port of P E i. The number of cycles required for bkj to arrive at PEj is (k — \)n + 2j — \ . aik requires n + {k — l)n + i + j — 1 cycles to arrive at P E j. The requirement is satisfied since {k — l)n + 2j — 1 < n + {k — l)n + i -\-j — 1 for all i and j. For example, we show how b2n (the last element of matrix B in phase 2) arrives at PE^ no later than ai2 (the first element of matrix A in phase 2) for = c\^ + a ^ x b2n- ^2 n needs 3n — 1 cycles, a ^ needs 3n cycles. 8 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2) Once bkj arrives at PEj, a copy of bkj should reside in PE j until Onk arrives: We show how to minimize the number of registers to store copies of bkj {k = 1, 2, ..,n) in P E j, for each j. We prove that two registers (denoted BM and BL in Figure 4-2 (c)) are sufficient to hold bkj at P E j (to store two consecutive elements, b(k+i)j and bkj). For example, when 6 3 4 arrives at PE^, 6 1 4 is in BL and 6 2 4 is in B M . If we can prove that a„i has arrived at P E 4, 6 3 4 can replace bu in BL. Note that bi4 is no longer needed in P E 4 after 0 ^ 4 = 0^4 + a^i x ^ 1 4 is performed using Gni. In general, bkj is needed until a^k arrives at PEj in the {n + {k — l)n + n -\-j — l}-th cycle. b (^k + 2)j arrives at PE j in the {(/c + l)n + 2j — l}-th cycle. Since {k + l)n -\-2j — 1 > n-\- {k — l)n + n — 1 for all j,k , and n, bkj can be replaced when b (^k + 2)j arrives at PE j. Also, the time difference between b (^k + 2)j and bkj is {(/c + l)n + 2j — 1} — {n + (A ; — l)n + n + j — 1} = j, which has a minimum value of 1. Hence, in the worst case, b (^k + 2)j arrives 1 cycle after bkj is no longer required, which means that b (^k + 2)j can replace bkj. This also shows that b (^k+ 2)j can not arrive while b(^k+ i)j is used since & (fc + 2 )j barely arrives after bkj is no longer required. This proves that P E j needs at least two temporary registers, BM and BL, to hold bkj {k = 1,2, ,.,n). 3) ffi + 2n cycles are needed to complete the matrix multiplication. The compu tation finishes one cycle after arrives at PE^, which is the {n + [n — l)n I- n -Fn — l}-th or {ffi - + ■ 2n — l}-th cycle. 9 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4) A local memory (C buf) of n words is required to store intermediate and final values for the column of matrix C being computed in each PE. Another local memory (CObuf) of n words is required to buffer the final values from PEj+i to PE j. Starting from the + l)-th cycle, results of d^j are generated from all PEs during the next 2n — 1 cycles. CObuf is necessary in each PE otherwise the final d^j from one matrix multiplication would be overwritten by the intermediate d^j of the next matrix. P E j stores the output from via port Cin in CObuf. For the first n cycles, the final c[j at P E j^i (for example, Ci2 at cycle 11) is input to P E j . For the next — n cycles, the stored c[j of -P-Ej+i in CObuf is input to PE j. The outputs from P E i are the resulting matrix C. 5) A snapshot of the execution of the algorithm is provided for n = 3 in Fig ure 4-3. It shows the contents of the registers, A, BU, BM, and BL of each PE during each cycle of the matrix multiplication process. For example, a^i, 6 2 3 , bu, o-nd 6 2 1 stay in P E \ during cycle 6. 0 3 1 and bu (in the dark circles) are used to up date = d^i + a^i X bu- Note that bu is no longer needed after this update and hence can be replaced by 6 3 1 , which arrives in cycle 7. Elements of matrix A stay in register A for one clock cycle and pass through while elements of matrix B are prefetched into registers, BU, BM, and BL of each PE in the linear array and stay until they are no longer needed. 91 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C orollary 1 n x n matrix multiplication can be performed in {rn^ + 2r^n) cyeles using 3 I/O ports and ^ PEs, each having 1 MAC, 2 local memories of ^ words and 4 registers, where n is divisible by r. P ro o f 2 n X n matrix multiplication can be decomposed into r x f matrix multiplications, assuming n is divisible by r. Using Theorem 1 with n replaced by ", the proof follows. Corollary 1 provides trade-offs between area and latency. Larger values for r reduces the number of PEs, which results in less area. But it increases the number of cycles to complete the matrix multiplication. Combined with power and area estimation of modules. Corollary 1 provides trade-offs among energy dissipation, area, and latency. T h eo rem 2 n x n matrix multiplication can be performed in ( ^ + ^ ) cycles using 3r I/O ports and " PEs, each having MACs, 2r^ local memories of f words, and 4r registers (Figure 4-4 shows a PE for r = 2), where n is divisible by r. P ro o f 3 The n x n matrices A, B, and C are divided into r^ submatrices, each of size f: x f , assuming n is divisible by r. Let A^y, B^y, Cxy, ^ ^ x ,y < r, denote the submatrices. Then, we have Cxy = Yffk=i-^xk x Bky, I < x ,y < r . The basic idea is to perform Cxy = Yfk=i -^xk x Bky in parallel for all x, y, 1 < x ,y < r using MACs per PE. Axk x Bky for each x, y, 1 < x ,y < r, can be performed using 3 I/O ports and " PEs, each having one MAC, 4 registers, and two local 92 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. From PE. PE, BU1 IH T 1ET Z I > A 2 . A2, BU2 lR i2 B2, BL2 Cbuf Cbuf Cbuf Cbuf From PE.^j — ci,„ L a c 2 ,„ CObuf CObuf CObuf CObuf T o p e . C2, Figure 4.4: Architecture of P E j for Theorem 2 9 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For t=l to n/2 do For all j do in parallel PE^ shift words in BUI & BU2 to the right (to * If (BUI = , copy it into BMl If (BU2 = , copy it into BM2 For t=n/2+l to (n/2)^+n/2 do For all j do in parallel PEj shift words in A1,A2,BUI,BU2 to the right(to PEj^^)* If (BUI = bjjjjj), copy it into BMl after moving the word in BMl into BLl If (BU2 = b^2kj) ' *=°Py it into BM2 after moving the word in BM2 into BL2 If (Al= a,i.^), Cbuf,j=Cbufj,+ajj^x Cbuf^2=05uf b^^k; If (A2= a2j^^), Cbuf2j^=Cbuf2j+a2j^j^^x b^j^^ Cbuf22=Cbuf22+a2j^^x b^j^^ (b^^^j is in either BMl or BLl, b^^kj either BM2 or BL2) For t=(n/2)^+l to 2(n/2)^+n do For all j do in parallel PEj shift words in A1,A2,BU1,BU2 to the right(to PEj^^)* If (BUI = bjj^^j), copy it into BMl after moving the word in BMl into BLl If (BU2 = bjj^j), copy it into BM2 after moving the word in BM2 into BL2 If (Al= aj2i^), Cbuf^j=Cbuf,j+a,2ik5« b^^^j Cbuf,2=0>ufj2+a,2ik* b^^^. If (A2= a 2 2 i] j) I Cbuf2j=Cbuf2^+a22jjjX b2^kj ^ “^22~^'^^22"*’ ®22ik* ^22k; (b 2j ] j j is in either BMl or BLl, b 2 2^j is in either BM2 or BL2) For t=n^/2+l to 2n^/2 do For all j do in parallel For t=n^/2+l to n^/2+n/2 PEj output C jj ^ j ^ j ' (from MACj^^^) to and Ci2,j' (from MAC 2 2) to C2„“, PEj store C2^j^j' (from MAC 2j ^ ) to C0 buf 2j ^ and =2 2i/ *“^^ 2 2) to C0 buf2 2 For t=n^/2+(2m-l)n/2+l to n^/2+(2m)n/2, where m=l to n/2 PEj output C0 buf 2i to Cl^^^ and C0 buf 2 2 to C2^^^ PEj store input Cl^ to CObufand input 02^^^ to CObufj^2 For t=n^/r+(2m)n/2+l to n^/2+(2m+l)n/2, where m=l to n/2-1 PEj output CObuf to and CObuf^ 2 to C2^^^^ PEj store input Clj^ to CCSjufji euid input 02^^^^ to CObuf2 2 *Submatrices and (or A ^ 2 * 2 2) s*^ter I/O ports Al^^^ and A2^ of PE^ in column major order, n/2 cycles behind Matrix B. *Submatrices B^j and B ^ 2 (or and B 2 2) enter I/O ports Bl^^ and B2j^ of PEj in row major order. *®^xyik elelment of i-th row eUid k-th column in submatrix A^. *b„. . is an elelment of k-th row and j-th column in sulxnatrix B xykj J xy . Figure 4.5: Algorithm for Theorem 2 9 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. memories o f^ words per PE, as stated in Theorem 1. Since the computations for all the submatrices of matrix C need to be performed in parallel in each PE, we need to duplicate the resources of a PE by a factor of r^, which is the number of submatrices. This would require 3r^ I/O ports and each P E to have r^ MACs, 4r^ registers, and 2r^ local memories of ^ words. We show how the number of registers and I/O ports can be reduced to M registers per P E and 3r I/O ports. P E j denotes the j-th P E from the left in Figure f.f, j = 1 , 2 , P E j computes column j of all submatrices Cxy, I < x ,y < r. An M AC is used to update column j of each submatrix Cxy, I < x ,y < r , requiring a total of r‘ ^ MACs per PE. 1) For each pair of x and y, I < x ,y < r, we show how Cxy = Y/fk^i Axk x Bky can be performed m ( ^ + ^ ) cycles using ^ PEs with an M AC and 4 registers per PE and 3 I/O ports. C !„ y represents the intermediate results for Cxy. Using Theorem 1, we can perform C'xy = + Axk x Bky for any specific combination of x, k, and y, I < x ,k ,y < r, in cycles using the aforementioned PEs. For each pair of x and y, 1 < x ,y < r, Cxy = Y7k=i x Bky is obtained by performing C'^y = (7^^ + Axk x Bky in a serial manner with k increasing from 1 to r. A preliminary analysis reveals that this would take ( ^ + ^ ) x r = ( ^ + 2 n) cycles. However, a close look at the data movement in the proof of Theorem 1 reveals that the input of the last column of submatrix Axk lo the array can be overlapped with input of the first row of submatrix for k = 1 , 2 , — 1 . Using the overlapping, C'^y = C !j,y + Axk x Bky for k = 1,2, 3..., r can be performed 9 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. in a pipelined fashion, taking ^ cycles for each k. At the start, ^ cycles are needed to prefetch the first row of submatrix B\y. At the end, after the last column of submatrix A^r is input, " additional cycles are needed to moue it through the array of - PEs to complete the updates for Cxy This leads to an execution time of ^ + + ^ = + ^ cycles, instead 0 / ( ^ + 2 n) cycles. 2) We show how Cxy — Yfk=i -^xk x Bky can be performed in parallel for all pairs of X and y, I < x ,y < r, in cycles using ^ PEs with r^ MACs, 2r^ local memories of " words, and 4r registers per PE, and 3r I/O ports. C'^y is the intermediate result for Cxy In stage k, 1 < k < r, C'^y = C'xy + Axk x Bky is performed in parallel for all 1 < x ,y < r. I/O ports, lO A i, IO A 2, ..., 1 0 Ar are used to feed submatrices Aik, A 2k, ■ ■ ■ , Ark simultaneously to the array. I/O ports lO B i, IO B 2, ..., lO B r are used to feed submatrices Bki, Bk2, •••, Bkr simultaneously to the array. An MAC, MACxy, is used to perform C'xy — C'xy + Axk x Bky for each pair of x and y, 1 < x ,y < r. Note that elements of Axk can be shared among r MACs, MACxy, I C l y C l c, in each PE while elements of Bky can be shared among r MACs, MACxy, 1 < x < r, m each PE. For example, in any stage k, 1 C k C r, all the r MACs, MACxy, 1 C y C r, in parallel perform C'xy = C'xy + Axk x Bky and need Axk- This sharing allows us to reduce the number of registers required per PE from 4r^ to 4r and the number of I/O ports from 3r^ to 3r. A direct application of Theorem 1 without sharing would use 4 registers per PE and 3 I/O ports to feed each MAC, requiring 4r^ registers per PE and 3r^ I/O ports to the array. Column 9 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. j of submatrix C'^y is updated by MACxy o,nd stored in a local memory of size Cbufy^y, for each x, y, I < x ,y < r, in PE j. 3) One set of local memories (C bufs) of ^ words in PE j is used to store inter mediate results for column j of submatrix Cxy for any x, y, 1 < < r. Another set of local memories (CObufs) of ^ words is required to buffer the final results from PEj+i to P E j. Thus, a total o/2r^ local memories of ^ words per PE are required. Starting from the { ^ + l)-th cycle, results of ch are generated from all PEs during the next n + ^ — 1 cycles. CObuf is necessary in each PE otherwise the final c[j from one matrix multiplication would be overwritten by the intermediate ch of the next matrix. P E j stores either the output from PEj+i via ports Cff^ , C 2 i^ a ^ , ..., Crin or ch from MACxy of PE j in CObuf s. For the first ^ cycles, the final c'ij at P E j is output to P E j^i via ports Clont, C2ont, ■ ■ ■ , CVout- For the next ^ ^ cycles, the stored c '^j of P E j in CObuf is output to P E j^i. The outputs from P E i are the resulting matrix C . Starting from the + 3)-th cycle, PE j stores the final c[j of P E j^i via ports Cffn , C2in, ..., Crin in CObufs for ^ cycles. 4) Figure 4-4 shows our architecture for r — 2 and Figure 4-5 shows the accom panying algorithm. Let Axy, Bxy, and Cxy, 1 < a:,?/ < 2, denote submatrices, each of size f X In the first stage, C ',^ .y = Axi x B^y, are performed in parallel for all ^ E : x ,y < 2 by feeding the elements of A n , A 21, B u , and B u through the 4 input ports. In the second stage, C'xy = C'xy + Ax2 x B^y is performed in parallel for all ^ S: x ,y < 2. Each stage takes ^ + n cycles from Theorem 1. Since overlapping 97 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is possible between the end of the first phase and the start of the second phase, the total number of cycles for both stages combined is ^ explained before. For larger values of r, there is greater parallelism within the PEs and hence the execution time greatly reduces. It must be noted that the number of MACs is a key parameter to determine the whole architecture. Based on the number of MACs, Theorem 2 and Corollary 1 can be combined into Corollary 2 . C orollary 2 n x n matrix multiplication can be performed in m a x (^ , xmin(n^+ 2 n, m ^+ 2 m) cycles using m MACs, 2 m local memories of ^ words, 4 min(n, m) registers, and 3m ax(^, 1 ) I/O ports, where is divisible by m and 1 < m < n^. P ro o f 4 For 1 < m < n, the proof follows from Corollary 1 by setting r — For n < m < n^, the proof follows from Theorem 2 by setting t = ‘ ^ . Smaller values for m reduce the number of modules such as MACs, registers, and I/O ports used in the design, resulting in lesser area but the latency is seen to increase. Combined with the latency of a design and the area and the power dis sipation of the modules. Corollary 2 provides trade-offs among energy dissipation, area, and latency for 1 < m < n^. Corollary 2 provides a more comprehensive set of trade-offs than Corollary 1 or Theorem 2 since the number of MACs used varies within wide range for a given problem size n. Note that Corollary 1 and Theorem 2 provide trade-offs among energy dissipation, area, and latency for 1 < m < n and 98 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. n < m < n^, respectively, and hence can be viewed as subsets of Corollary 2. A more detailed analysis of all designs with respect to energy, area, and latency is presented in Section 4.3.3. 4.3 Perform ance M odeling and O ptim ization Given the goal of algorithmic level optimization of energy performance for matrix multiplication on FPGA devices, we need an energy model to represent the im pact of individual algorithmic level choices on the energy performance. Based on this model, we make the design trade-offs to obtain energy efficient designs. The candidate designs are implemented in Section 4.4.1. 4.3.1 Domain-Specific Energy M odel Our approach for the performance modeling is to use a domain-specific energy model proposed in Ghapter 3[30] [36]. The model is applicable only to the design domain spanned by the family of algorithms and architectures being evaluated. The family represents a set of algorithm-architecture pairs that exhibit a common structure and similar data movement. The domain is a set of point designs result ing from unique combinations of algorithm and architecture level changes. The domain-specific energy model abstracts the energy dissipation to suit the design domain. The abstraction is independent of the commonly used levels such as gate. 9 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. register, or system level. Rather it is based on the knowledge about the family of algorithms and architectures. The parameters are extracted considering their expected impact on the total energy performance. For example, if the number of MACs and the number of registers change values in a domain and are expected to be frequently accessed, a domain-specific energy model is built using them as key parameters. The parameters may include elements at the gate, register, or system level as needed by the domain. It is a knowledge-based model which exploits the knowledge of the designer about the algorithm and the architecture. We also use the knowledge to derive functions that represent energy dissipation, area and latency. Beyond the simple complexity analysis, we make the functions as accurate as possible by incorporating implementation and target device details. For example, if the number of MACs is a key parameter, we implement a sample MAC on the target FPGA device to estimate its average power dissipation. Random input vectors, as many as are needed for the desired confidence interval [64], are generated for simulation. A power function representing the power dissipation as a function of m, the number of MACs, is generated. This power function is obtained for each module related to the key parameters. Based on the designer’s optimization goal and the time available for design, a balance needs to be struck between accuracy and simple representation of the functions. The estimation error of the functions derived in this chapter ranges from 3.3% to 7.4%. Since the model 1 0 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is intended for algorithmic level analysis in the early stage of the design, the error is considered satisfactory. Our family of architectures and algorithms for matrix multiplication forms a domain and we limit algorithm-level exploration for energy optimization to the design space spanned by this domain. The family of architectures and algorithms in Figure 4.2, Figure 4.4 and Figure 4.5, and the parameters in Table 4.1 represent the design space. We build two domains for Corollary 1 and Theorem 2. Two parameters, n and r, are used. In Corollary 1, n denotes the size of input matrices, r is introduced for block multiplication using submatrices of size \ In Theorem 2, r determines the number of I/O ports (3r), the number of MACs (r^) and the submatrices of size Due to the nature of our algorithms, the number of each key module depends only on these two parameters. We identify registers of 8-bit and 16-bit words, MACs, SRAMs (Distributed SelectRAMs in the Xilinx devices), and BSRAMs (Block SelectRAMs in the Virtex- II devices) [144] as key modules. Choosing specific values for the parameters in Table 4.1 results in a design point in the design space. For example, n = 24, p = 6, reg = 4, m — 1, sram = 2, Kb = 2, and Kio = 0 represents a design where 24 X 24 matrix multiplication is implemented using 6 PEs with 4 registers, one MAC, and two SRAMs per PE. The input and output matrices are stored in two ((’2 X 24 X 24/1024] = 2) BSRAMs on the device and no I/O ports are used. 1 0 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.1: Range of parameters for Xilinx XC2V1500 Parameter Range FPGA constraints Problem size (n) 2,3,4,... No. of PEs ip) nil, n is divisible by I, I is integer No. of registers/PE (reg) ( 0 < k < logj, n) 8/16-bit registers No. of MACs/PE (m) b^^ 2-stage pipeline, embedded No. of SRAMs/PE (sram) \nb'^ l i e ] 16 words minimum No. Of BSRAMs/PE {K^) (on-chip design) |"2n^/1024'| 1024 16-bit words minimum No. of I/O ports (K.J (off-chip design) 3 /7 ^ 8/16 bits An energy model specific to the domain is constructed at the module level by assuming that each module of a given type (register, multiplier, SRAM, BSRAM, or I/O port) dissipates the same power independent of its location on the chip. This model simplifies the derivation of system-wide energy dissipation functions. The energy dissipation for each module can be determined by counting the number of cycles the module stays in each power state and low-level estimation of the power used by the module in the power state assuming average switching activity. Additional details of the model can be found in Chapter 3. Table 4.2 lists the key parameters and the number of each key module in terms of the two parameters, for each domain. In addition, it shows the latencies which also depend on the parameters. By choosing specific values for the parameters in 102 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.2, a different point design is realized in the design space. For example, the point design, n = 16 and r = 4, represents a design where 16 x 16 matrix multiplication is implemented using 4 PEs with 4 registers, one MAC, and one SRAM per PE. Table 4.2: Number of modules used and the latency of various designs Domain Corollary 1 Theorem 2 Corollary 2 Key parameters (range) n, r(r<n) * n divisible bv r n, r{r < n) * n divisible by r n,m{\<m<n^) * divisible by m No. of PEs n! r n! r wm[m,n^ Im) No. of register/PE 4 4r max(4,4m/n) No. of MAC/PE 1 P max(l,m^/n^) No. of SRAM/PE 2 max(2,2m^/n^) No. of BSRAM [2nVl024] [2n"71024] [2n"71024] No. of I/O ports 3 3r max(3,3m7n) Latency (cycles) +2r^n l r + 2nl r max ( 7 mb n 7 m) min + 2n, + 2w) 4.3.2 Functions to Estim ate Energy, Area, and Latency Functions that represent the energy dissipation, area, and latency are derived for Corollary 1 and Theorem 2. The energy function of a design is approximated to be J2iUPi, where ti and Pi represent the number of active cycles and average power for module i. For example, PmuU denotes the average power dissipation of the multiplier module. The average power is obtained from low-level power simulation of the module. The area function is given by where A represents the 103 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. area used by module i. In general, these simplified energy and area functions may not be able to capture all the implementation details needed for accurate estimation. However, we are concerned with algorithmic level comparisons, rather than accurate estimation. Moreover, our architectures are simple and have regular interconnections, and so the error between these functions and the actual values based on low-level simulation is expected to be small. In Section 4.4.2, we evaluate the accuracy of the energy and area functions. The latency functions is obtained easily because the theorems and corollaries already give us the latency in clock cycles for the different designs. Table 4.2 shows the number of modules used by the designs for n x n matrix multiplication with 8-bit input precision and 16-bit output precision. For the off- chip design, I/O ports are used to fetch elements from outside the FPGA. In the on-chip design, BSRAMs of 1024 16-bit words are used for on-chip storage of input matrices. SRAMs are CLB-based memory blocks used for storing intermediate results. The power and area values of each module are shown in Table 4.4. For example, P s r a m is the average power used by SRAM (16-bit word), where x is the number of entries. In the actual implementation of a SRAM, the number of its entries should be multiples of 16. Poffset denotes the remaining power dissipation of a PE (after the modules have been accounted for), and takes care of glue logic and control logic. Similar numbers representing the area of each module are also obtained. Agffset denotes the area of a PE that accounts for glue logic and control 104 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. logic. The latencies are obtained in terms of seconds by dividing them by the clock frequency. Using Table 4.2, functions that represent energy, area, and latency for Corollary 1 and Theorem 2 are shown in Table 4.3. Functions for other designs can be obtained in the same way. An average switching activity of 50% for input data to each module at a running frequency of 150MHz is assumed. Multiply operation is performed using dedicated embedded multipliers available in the Virtex-II device. Note that throughput is important, since many applications for matrix mul tiplication process a stream of data. Our design in Corollary 1 is a pipelined architecture, with the first ^ cycles of the computations on the next set of data being overlapped with the last ^ cycles of the computations on the current set of data. Thus for a stream of matrices, an - x - submatrix can be processed every ] fp ^ cycles. Thus the effective latency becomes (^)^, which is the time between the arrivals of the first and last output data of the current computation. Hence, the design in Corollary 1 is a throughput-oriented design since one output is avail able every clock cycle for a stream of matrices. The design in Theorem 2 is also throughput-oriented since r output data items are available every clock cycle. Its effective latency becomes 4.3.3 Trade-offs among Energy, Area, and Latency The functions in Table 4.3 are used to identify trade-offs among energy, area, and latency. For example. Figure 4.6 illustrates the trade-offs among energy, area, and 105 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.3: Energy and time performance models Corollary 1 | Metric Performance model Latency (cycles) k:oA=^r^[{nlrf +nlr] Effective latency (cycles) h on= r'[nlrf Energy (on-chip) f(n / r) ( + 2P^g^ + 4/)jg - t - 4/^,g )1 Hn/r)P^,,, J Energy (off-chip) F T ^ A d d ~ ^ ^ ^ S R A U l+2T,+P„ + (n/r)/^^,„ J Area (on-chip) A r o r i ^){ ^ M u it A ir fa + '2-^ram 4A ^ g - 1 - 4 - t - j and |’2«"/1024] BSRAMs Area (off-chip) A :o rt —{ n l r ) i y + Aj^ + 2A^,^ -r4A ^ g + 4A ^ jg - 1 - ) and two 8-bit input ports, one 16-bit output port Metric Performance model Latency (cycles) Lnn2=n^lr + 2nlr Effective latency (cycles) ^ - T h m 2 ^ ^ Energy (on-chip) E^2 -^ ^ |+ p 2 n V l0 2 4 ]p ,,^ +(n/r)P^,.,, J Energy (off-chip) F J ^'^8‘* ■ ^ ■ ^ 1 6 )1 n™ 2-^2|^2rP,+rP, + (n/r)i^^,,, J Area (on-chip) A j^2 = (A j^ ^ , + A^^ + 2Aj^j^) -f n (4A ^ g - 1 - 4A ^ ,^ ]+{n! and [2n"/1024] BSRAMs Area (off-chip) Aj},„2 ~^A^jj +2,A^n^nf) -+-/i(4Ajg -l"4A p|g)-l-(n/r)A „g5.^, and 2r 8-bit input ports, r 16-bit output ports 106 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. latency for 48 x 48 matrix multiplication for the off-chip and on-chip designs of Corollary 1. It can be used to choose energy efficient designs to meet given area and latency constraints. For example, if 800 slices are available and the latency should be less than 6,000 cycles (36/rs), an energy efficient design is obtained using ^ = 4. The energy dissipation, area, and latency for such a design, evaluated using the functions in Table 4.3, are 6.85/r J, 524 slices, and 5400 cycles (32.4^s), respectively. Figure 4.6 shows that as the block size (^) increases, the area increases and the latency decreases, because the degree of parallelism increases. While the energy dissipation decreases till ^ = 15 or 16, it starts increasing afterwards. The reason for this behavior is as follows. The energy used by the local storages, Cbuf and CObuf, is 2n^(0.126 -I- 2.18) and is hence proportional to O (^ ). The energy used by the rest of modules, except I/O , are proportional to 0{n^). The energy for I/O is proportional to 0{rn^). As ” increases (r decreases), the energy used by I/O decreases relatively faster and thus the total energy decreases. However, after ^ > 16, the energy used by the local storage becomes the dominant factor. This helps us to identify the optimal block size for energy efficient matrix multiplication. Trade-off analysis for the on-chip model also shows similar behavior. The on- chip design uses BSRAMs instead of I/O ports. Since the energy used in I/O ports is more than the energy used in the BSRAMs, the energy used in the on-chip design is less than the energy used in the off-chip design. However, the choice between the off-chip and on-chip design depends on the situation - whether the 107 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 8 12 16 24 48 Block siz e (n/r) (a) o o 150 r 0 o 100 % 0 3 o 50 ^ o 0 0) 3 O c 0 T 3 _i ■ H i I/O [Z. _ J M isc. 0 1 ------iMAC ■ m Storage — Latency — ♦ — Area 8 12 16 Block s iz e (n/r) 24 48 I BSRAM I Misc. IMAC I Storage - Latency -Area (b) Figure 4.6: Energy, area, and latency trade-offs of Corollary 1 as a function of the block size (n /r), (a) off-chip design and (b) on-chip design for n = 48 108 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 4.4: Power and area functions for various modules Module Block multiplier (8x8 bit) Power Function (mW) 12.50 Area Function (slice) Adder (8 bit) P = 2 1 1 ^A dd ' ' SRAM (16-bit word, X number of entries) P s r a m = 0-126 X 16 + 2.18 ^ S R A M -18.44 X 16 + 16.40 BSRAM (16 bit, 1024 entries) Prsram —16.37 A — 1 A* ^BSRAM — ^ Register (8 bit) ^ «8 = 2.12 4 Register (16 bit) P = 2 P R 16 RS ^R16 - 8 Output port (16 bit) P^=10 Input port (8 bit) P, =10 * Block multiplier or BSRAM uses area equivalent to 16 slices. matrix multiplication is stand-alone or a part of an application (e.g. an application consists of multiple kernels). Theorem 2 provides asymptotic improvement in energy and latency perfor mance in the on-chip model. As shown in Table 4.2, asymptotically, the energy dissipated in the BSRAMs and the latency of the Xilinx reference design increase as 0{n^) and 0{n^), respectively, assuming a unit of energy is used per cycle for retaining a word in the BSRAM. Energy dissipation and latency for the designs based on Theorem 1 and [110] increase as O(n^) and O(n^), respectively, under the same assumptions. Theorem 2 improves these complexities to O (^ ) and 0{~), respectively, where ^ is the block size for block multiplication and n is divisible by r with r > 1. Further increases in the density of FPGAs can be used to increase 109 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. CO storage 12 16 24 48 Latency 3 30 S ' T 1 I SI > 1 — W 4 1 6 ' 8 BSRAM Misc. MAC Storage Latency Area Figure 4.7: Energy, area, and latency trade-offs of Theorem 2 as a function of r, (a) off-chip design and (b) on-chip design for n = 48 1 1 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the number of multipliers and hence nr, leading to asymptotic reduction in energy dissipation and latency. Figure 4.7 shows the trade-offs among energy, area, and latency for Theorem 2. As the value of r increases (or block size ^ decreases), the area increases and the latency decreases. However, the energy dissipation continuously decreases. Thus, the designs based on Theorem 2 reach the minimal point of energy dissipation when the block size is the smallest unlike the designs based on Corollary 1. Note that the local storages consists of registers, Cbufs, and CObufs. The energy nsed by the registers in the designs based on Theorem 2 is O (^ ) while the energy used by the registers in the designs based on Corollary 1 is 0{n^) for the same problem size. Thus, the energy used by the registers in the designs based on Theorem 2 decreases as r increases while the energy used by the registers in the designs based on Corollary 1 is constant. The same analysis applies to the energy complexity of BSRAMs. 4.3.4 Other Optim ization Techniques for Energy Efficiency To optimize the energy performance of our design, we employ several energy ef ficient design techniques in Chapter 2 [33]. One such technique is architecture selection. FPGAs give the designer the freedom to map almost any architec ture onto hardware. Different architectures have varying energy performances. 1 1 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. latencies, throughputs, etc. In our design, we have chosen a linear array of pro cessing elements. In FPGAs, long interconnects dissipate a significant amount of power [119]. Therefore, for energy efficient designs, it is beneficial to minimize the number of long interconnects. A linear array of PEs accomplishes this goal. Each PE communicates only with its nearest neighbors, minimizing the use of long wires. Additionally, the linear array architecture facilitates the use of two more techniques; parallel processing and pipelining. Both parallel processing and pipelining decrease the effective latency of a design. Parallel processing does so by increasing the amount of resources while pipelining does so by increasing the resource utilization. By decreasing effective latency, both techniques can lead to lower energy dissipation. However, these techniques can also increase the power dissipation, which can have a negative effect on the energy dissipation. The de signer must reach a compromise between low latency and high power in order to achieve a low energy design. Another technique that we employ is the choosing of the appropriate bindings, fn an FPGA, there can be many possible mappings of the computation and storage elements to the actual hardware. For example, in the Virtex-II, the storage CBuf can be implemented as registers, a distributed Selec- tRAM, or a Block SelectRAM. Each of these types of storage dissipates a different amount of energy and can lead to implementations with wide variation in energy dissipation. When the number of entries > 64, a Block SelectRAM is used since it is energy efficient as a large memory; otherwise, a distributed SelectRAM is used. 1 1 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Similar decisions can be made for other elements of the design, such as choosing between (embedded) Block multipliers or configured (CLB-based) multipliers. In our design, we choose Block multipliers since they are energy efficient when both inputs are not constant. 4.4 D esign Synthesis and Sim ulation Based on the high-level performance estimation, the chosen designs are imple mented and simulated to obtain the accurate results. Our target device is Virtex-II which is a high-performance, platform FPGA from Xilinx [144]. We have chosen the XC2V1500 and XC2V3000 models for comparison and its speed grade is -5. These models have 48 and 96 18 x 18-bit Block multipliers, respectively. 4.4.1 Im plem entation Details Based on observation in Section 4.3.3, we implemented the designs using VHDL in Xilinx ISE 4.1i environment. All parameters are specified using ’’G eneric” vari ables in VHDL syntax. By changing the values of the Generic variables, different numbers and types of modules are instantiated (see Table 4.2) and eventually the whole architecture are synthesized accordingly. Note that the design for Theorem 1 has one parameter n, and Corollary 1 and Theorem 2 have two parameters n and r. These are only necessary parameters for design synthesis. For the off-chip design. 113 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. all matrix data are fed via I/O ports. Thus the total energy includes the energy used for I/O . For the on-chip design, the data is stored in the BSRAMs first and fed into the first PE from the BSRAM. However, the energy used to store the data in BSRAM via I/O ports is not included unlike the case of the off-chip design. The whole design as well as the design for a PE are pipelined. Figure 4.8 (a) shows the pipeline stages of each PE and Figure 4.8 (b) shows the pipelining in each PE. The CBuf is implemented using a dual port SRAM. Note that there is data feedback which is required for accumulation of intermediate values for matrix C. To accumulate the intermediate values, one data comes from the MAC and the other one comes from CBuf. If n > 2, the memory read for intermediate value occurs after the value is written to CBuf. However, if n < 2 for Theorem 1 or ^ < 2 for Theorem 2, data hazard occurs. Since two clock cycles are required between the memory write and read, the distance between read-after-writes (RAWs) has to be 2 clock cycles to prevent the data hazard. Other than the data paths for all designs, control logic is also parameterized based on n and r. We chose the mixture of distributed and centralized controls. Each PE of the off-chip design for Theorem 1 and Theorem 2 has 6 control signals from the control logic (i.e. centralized control). The centralized control signals are generated from the control logic (outside PE) and are fed to the first PE. Note that only the first PE gets the control signals and the rest of PEs use them in a pipelining manner (see Figure 4.2 (a) and (b)). Then all signals are passed 114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reg Register loading for A and B Acc Mux Selecting B data MW Mult Multiplication MR Accumulation Memory write Memory read (a) Data tiazard No data tiazard Reg Mux Mult Acc N MR / Reg Mux Mult Acc MW MR Reg Mux Mult Acc MW MR (b) Figure 4.8: (a) Pipelining stages and (b) data hazard to the next PE with one or two clock delays. Address generation for CBuf and COBuf are generated inside PE in a distributed manner. The first control signal, CtRegLoad, is used to load the input data to BM or BL. It is asserted every ^ cycles (see Figure 4.3). CtMuxToMult is used to determine which BM or BL is multiplied with A. This signal is identical with CtRegLoad except the first ^ cycles. CtMultCe and CtRamWe are the enable signals for the multiplier and SRAM. Both are asserted when the computation starts. CtFlush is asserted after one set computation for ” X " matrix multiplication is completed and a new set of computation starts. CBuf needs to be flushed before the next result is stored. It is asserted at (y)^ + y + 3 and is on for ^ cycles and is off for ^ cycles. In the data path, using this signal, we accumulate the intermediate values for matrix C with previous intermediate 115 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. values or 0 (flushing effect). CtCoutMux is used to determine whether the results come from an accumulator or COBuf. It triggers the data pulling from the current COBuf to the previous PE. In fact, the current PE sends the data of COBuf to the previous PE. It is asserted at (p)^ + 4 and is on for ” cycles and is off for (p)^ — p cycles. For the on-chip design, the control logic includes the additional signals such as address generation for on-chip memory. In addition, the energy used by the control logic is separately measured. It accounts for about 10% of the designs. 4.4.2 Simulation M ethod Using the high-level model defined in Section 4.3.3, several designs can be obtained by optimizing on latency, area or energy. In this chapter, we attem pt to arrive at minimal energy designs. The candidate designs are implemented in VHDL. These designs are synthesized using XST (Xilinx Synthesis Technology) in Xilinx ISE 4.1i. The place-and-route file (.ncd file) is obtained for Virtex-II XC2V1500 and XC2V3000 (speed grade -5). The input test vectors for the simulation are randomly generated such that their average switching activity is 50%. Modelsim 5.6b, from Mentor Graphics, is used to simulate the designs and generate simulation results (.vcd file). These two files are then provided to the Xilinx XPower tool to evaluate the average power dissipation. Energy dissipation is further obtained by multiplying the average power with the effective latency. 116 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The estimates from Section 4.3.3 are also against the actual values based on synthesized designs to test the accuracy of the performance estimation functions. We observe that the estimation error of our functions (see Table 4.5) ranges from 3.3% to 7.4% for energy dissipation and 3.3% to 4.1% for area. Table 4.5: Estimation errors of energy and area functions in Table 4.3 Metric Matrix size 3x3 0x6 9x9 12x12 15x15 Energy (nJ) Estimated Measured Error 16.7 114.7 365.5 840.8 1612.1 17.3 110.0 340.2 795.2 1509.2 3.3% 4.3% 7.4% 5.7% 6.8% Area (slice) Estimated Measured Error 494 986 1478 1970 2462 477 949 1431 1894 2364 3.6% 3.9% 3.3% 4.0% 4.1% To address the dependency of energy dissipation on the input matrices, matrices are randomly generated and fed to our design and the Xilinx design for 3 x 3 matrix multiplication. Equation 4.1 is employed to estimate the confidence interval for our simulation. The confidence interval is the interval into which the real average (over all possible input matrices in this example) falls with a certain probability (confidence) [64]. x,Zai2,s, and M represent the average energy over (randomly generated) sample input matrices, the standard normal percentile, the standard deviation, and the number of sample matrices, respectively. The probability that the real average energy dissipation belongs to the interval in Equation 4.1 is 1 — a [64]. X rL Z, a/2 y/M (4.1) 1 1 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 4.9 (a) compares the energy dissipation over 50 randomly generated 3 x 3 input matrices of our design with that of the Xilinx design. The 95% confidence intervals are compared in Figure 4.9 (b). Based on 95% confidence intervals, the average energy dissipation of our design for 3 x 3 input matrices is 7.81 nJ (32%) less than that of the Xilinx design. All designs in the chapter follow the simulation method based on this confidence interval. 30 20 ( D 5 10 -«— Our design - X ilinx design 10 20 30 40 Index of input m atrix (a) 50 30 1 Our design □ X ilinx design Confidence interval D ) Designs (b) Figure 4.9: Comparison between our design (based on Theorem 1) and Xilinx de sign for 3 X 3 matrix multiplication: (a) Energy dissipation for randomly generated matrices and (b) Average energy dissipation with confidence intervals 4.5 D esign A nalysis and Perform ance C om parison Using the functions in Table 4.3, energy efficient designs are identified for various sizes of matrices around (an arbitrary) area constraint of 2,000 slices. This is about 118 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 25% of the total available area in XC2V1500. The area does not necessarily increase with the size of the input matrices. The identified designs are implemented and simulated, as described in Section 4.4.2. Table 4.6 compares the performance of our designs against the Xilinx design and the best-known latency-optimal design [110] on a linear array for various sizes of matrices. All values for energy and area are based on low-level simulation. All designs include control logic and the on-chip designs includes on-chip memories (BSRAM) and its address generation logic. Xilinx provides an optimized design for 3 x 3 matrix multiplication only. The Xilinx design uses only 251 slices while a XC2V1500 device can hold up to 7,800 slices in addition to memory banks (BSRAMs) and I/O ports. For n > 3, we used block matrix multiplication using the 3 x 3 design. The Xilinx design is run at 150MHz. The comparisons are based on individual metrics as well as more comprehensive metrics, such as energyxareaxlatency (EAT) product [13]. The area is the number of slices used in the design. An embedded multiplier or a BSRAM is counted as 16 slices each since they occupy the equivalent area. The largest reduction in energy dissipation is 51% and can be obtained by using ” =15 or 16. Our designs improve the performance of the Xilinx reference design by 29%-51% with respect to energy and a factor of 3-15 with respect to latency while increasing the area by a factor of 1.9-9.4. The designs based on [110] with the same latency as ours reduce the energy dissipation, when compared with the Xilinx design by -27% to 31%. Analysis of energy and area functions reveals 119 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. that our designs improve the Xilinx design due to reduction in latency and energy efficient binding. In terms of EAT, our designs based on Corollary 1 offer superior performance by 50%-79%, when compared with the Xilinx design. Table 4.6: Performance comparison of various off-chip designs against the Xilinx design and the design A proposed in [110] Design Metric 3x3 Matrix size 6x612x12 15x15 24x24 48x48 Energy(nJ) 24.4 195.4 1563 3053 12506 100049 Xilinx design Latencyfusec) 0.18 1.44 11.52 22.50 92.16 737.28 Area(sllces) 251 251 251 251 251 251 EATxlE-12 1.1 70.6 4520 17243 289293 18514777 Block size (n/r) 3 6 12 15 12 12 Energy(nJ) 17.3 110.0 795 1509 6361 50892 Proposed design (Corollary 1) (reduction, %) 29% 44% 49% 51% 49% 49% Latency(usec) 0.06 0.24 0.96 1.50 7.68 61.44 (speedup, times) 3.00 6.00 12.00 15.00 12.00 12.00 Area(sllces) 477 949 1894 2364 1894 1894 (Increase, times) 1.90 3.78 7.55 9.42 7.55 7.55 EATxlE-12 0.5 25.1 1446 5351 92534 5922165 Block size (n/r) 3 6 6 5 8 8 Energy(nJ) 31.1 158.8 1271 2729 8983 71862 (reduction, %) -27% 19% 19% 11% 28% 28% Design A Latency(usec) 0.10 0.30 2.36 5.86 13.17 105.40 (speedup, times) 1.87 4.88 4.88 3.84 7.00 7.00 Area(sllces) 451 1124 1124 877 1698 1698 (Increase, times) 1.80 4.48 4.48 3.49 6.76 6.76 EATx1E-12 1.4 52.7 3372 14016 200950 12860824 Table 4.7 shows the energy, area, and latency of on-chip designs in Corollary 1 for various problem sizes. While the comparison of off-chip and on-chip designs is not fair, it is useful to analyze the effect of I/O ports and on-chip memory. For 120 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the off-chip design, data is fed via the I/O ports of the FPGA. The power used by an 16-bit I/O port is 80mW. The input ports use less power than the output ports since the output ports need to handle a large fan-out. For the on-chip design, all data is stored in BSRAMs. The power dissipation of the read and write operations on a single port BSRAM with 50% access rate is 35mW. Thus, data access from a BSRAM is more energy efficient than from an I/O port. However, we likely have a combination of both situations, where we read the data from input ports and store the result in BSRAMs for further computation. The final result would be obtained after several computations and output via output ports. Thus the design trade-offs have to be performed at a system level where multiple kernels are integrated. Table 4.7: Performance comparison of various on-chip designs. Design Metric 3x3 Matrix size 6x6 12x1215x15 24x24 48x48 Block size (n/r) 3 6 12 15 12 12 Proposed (Corollary 1) Energy(nJ) 16.0 105.9 775 1510 6200 49598 Latency(usec) 0.06 0.24 0.96 1.50 7.68 61.44 Area(slices) 434 861 1699 2083 1699 1699 EATx1E-12 0.4 21.9 1264 4717 80896 5177370 Figure 4.10 shows the energy distribution among logic, net, and I/O ports for Theorem 1 and Theorem 2. The figures are based on the off-chip designs. Logic energy represents the energy used by the combinational logic in the design. Net energy is the energy used by interconnect wires, and I/O energy by input/output ports. Note that the algorithm of Theorem 2 uses more energy in the nets than in 121 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 800 600 O ) 03 C L U 200 - □ I/O ■ Net □ Logic i (Thm1, n=6) (Thm2, n=6) (Thm1, n=12) (Thm2, n=12) D esigns (Theorem, Problem size) Figure 4.10: Energy distribution over logic, net, and I/O for Theorem 1 and 2 the logic, unlike Theorem 1. For example, the ratio of (logic energy)/(net energy) is 1.21 for Theorem 1 and 0.75 for Theorem 2. The reason is that Theorem 2 has a more complex design for a PE and uses more interconnects between PEs. In the low-level simulation. Theorem 1 runs at 150MHz. However, Theorem 2 runs at 143MHz or less due to a more complex design involving more interconnect wires. Another factor that affects the energy dissipation in Theorem 2 is the param eter, r. It determines the degree of parallelism. Table 4.8 shows the results from the low-level simulation of the off-chip and on-chip designs for Theorem 2. As r increases, the usage of logic and interconnect increases and the latency decreases by a factor of r. Since the latency decreases as r increases, we can arrive at a design with reduced energy by decreasing r. If r = 4 for n = 16, we use a larger device, XC2V3000, instead of the XC2V1500, since the XC2V1500 can hold only 1 2 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 48 multipliers, while our design requires 64 multipliers. We do not use CLB-based multipliers due to low energy efficiency. Table 4.8: Performance comparison of various off/on-chip designs for Theorem 2 Design Metric Matrix size 6x6 8x8 12x12 16x16 Block size(n/r) 2 2 2 3 4 2 4 Frequency(MHz) 143 143 143 117 127 143 130 Theorem 2 Energy(nJ) 73.4 158.9 481 503 430 1212 1062 (Off-chip) Latency(usec) 0.13 0.22 0.50 0.41 0.28 0.90 0.49 Area(slices) 1669 2233 3358 5879 6592 4472 8850 EATx1E-12 15.4 79.4 813 1213 804 4851 4627 Block size(n/r) 2 2 2 3 4 2 4 Frequency(MHz) 143 143 143 100 110 143 110 Theorem 2 Energy(nJ) 64.8 146.6 456 480 412 1113 1225 (On-chip) Latency(usec) 0.13 0.22 0.50 0.48 0.33 0.90 0.58 Area(slices) 1827 2416 3558 6670 6770 4700 9203 EATx1E-12 14.9 79.3 818 1535 913 4684 6559 123 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 5 E nergy and T im e Efficient M atrix Factorization U sing F P G A In this chapter, we consider one of the important signal processing kernels: matrix factorization. For example, matrix factorization is a fundamental kernel in adap tive beamforming [62][115]. Approaches to future wireless communications such as software defined radio (SDR), require implementations of such signal process ing kernels on reconfigurable hardware like FPGAs [137] since FPGAs provides adaptability as well as high performance. While current FPGA devices offer very few features for power control, we iden tify several techniques and show how to effectively use them to improve energy performance. For example, the popular low power design technique in ASlGs, dy namic voltage scaling with dynamic frequency scaling, is not available in FPGAs. However, clock gating can be used in current FPGAs to disable idle modules. Also, imposing a certain architecture and algorithm on FPGA fabrics affects the energy 124 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. performance since many different architectures and algorithms can be mapped to perform the same operation. Transposing an original algorithm using appro priate parallelism and choice over various resources available on FPGAs leads to performance trade-offs over time and energy. We will show how we exploit the algorithmic level techniques. We first develop a design for matrix factorization on a linear array architec ture. While various architectures can be implemented on FPGAs, the linear array architecture shows better time and energy performance over a sequential algo rithm and its corresponding architecture (see Section 5.5). In the sequential algo rithm/architecture, the memory access is a dominant factor of energy dissipation. By transposing the sequential algorithm for LU decomposition to a parallel version, a new algorithm/architecture balances the memory access and the computation. In the second design we investigate and apply algorithmic technique that uses a block based approach to obtain time and energy efficient designs in FPGAs. The matrix multiplication architecture proposed in Chapter 4 is used as part of the matrix factorization design. In the block LU decomposition, the block size determines the complexity. For n x n LU decomposition using block size of b, the computational complexity is -f- — l ) ( x “ ^))- b = n, the design with the least amount of complexity is obtained. However, this is not necessarily translated to the minimal energy design. Therefore, high-level performance estimation (based on the latency and energy performance models) 125 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is used for rapid design space exploration. Based on the estimation, we identify an optimal block size that minimizes the total energy dissipation. Then several candidate designs are implemented in FPGAs. To the best of our knowledge, we are not aware of any FPGA based designs for LU decomposition. Also, no designs considering energy efficiency were developed. Hence for the sake of comparison, we implemented a uniprocessor design and the best known design on a linear array on the Xilinx FPGA devices to compare the time and energy performance with our designs. Also, we implement the LU decom position in software on state-of-the-art low power TI DSP devices and compare its performance. Our designs on the FPGAs are significantly more time and energy efficient in all cases. The remainder of this chapter is organized as follows. In Section 5.1, the related work is discussed. In Section 5.2, we present our algorithms and architectures for matrix factorization. In Section 5.3, time and energy performance is estimated for the proposed algorithms and architectures. Section 5.4 presents the implemen tation details and the energy optimization techniques. Section 5.5 presents the performance of these synthesized designs. Also, comparison with a uniprocessor design, the best known design, and TI DSP-based implementations is made. 126 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.1 R elated W ork Meeting latency and throughput requirements is a critical concern in many em bedded signal and image processing. Consequently, a vast literature has appeared over the last two decades describing fine grain parallelism algorithms [131]. They are well suited to meeting these requirements when state-of-the-art FPGAs pro vide an adequate level of speed, density, and programmability in the form of reconfigurable computers, boards, and chips with embedded computational sup ports [4 ] [142]. Such hardware allows rapid implementation and improvement of parallel algorithms and architectures for signal and image processing applications to inexpensive programmable and parameterizable designs. However, to the best of our knowledge, there has been no previous works targeted for time and energy efficient designs of LU decomposition in FPGAs. To factorize anbxb matrix, we choose to implement LU decomposition. While many designs have been developed to solve LU decomposition, we are interested in the designs that can be mapped onto FPGAs. The well-known LU decom position on 2D systolic arrays was developed by Kung and Leiserson [76]. The automatic 2D systolic array generation tool for LU decomposition was developed by Nash [98]. In [76], the latency of 36 is required for 26^ operations (including memory access and computation) using 6^ processing elements (PEs). However, even though significant computation speed is achieved, the design requires a large number of PEs (6^) and I/O ports (46). This gives rise of problems of interconnect 127 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. and data feeding. Also, the I/O operations via I/O pads in FPGAs are expensive in terms of energy dissipation. Due to these reasons, a linear (ID) systolic array with a small number of PEs and not too many I/O ports was proposed. Similar architectures are proposed in [23] [99] and the best known latency on a ID systolic array with b PEs is achieved for dense matrices and band matrices. Each PE re quires 5 1 /0 ports and 2b amount of storage. Total storage requirement is 2b{b— 1). It also has two types of PEs for multiply and divide operations, respectively, as suming that the division is more expensive than multiplication. This requires the diagonal entry of a matrix to be propagated in reverse direction after performing a normalization. This leads to the latency of 2b{b+ 1). In case LU decomposition is performed for the stream of matrices, overlapping the LU decomposition for current and the next matrices is not efficiently managed, which leads the effective latency (will be defined in Section 5.2.1) to 25^. In our design, each PE uses 4 1/0 ports and k amount of storage, where k is the index of k-th PE. Thus total amount of storage is |6(6 -f 1). Since division can be efficiently implemented in FPGAs, each PE performs both multiply and divide operations, which removes the data propagation of the diagonal entry of the matrix in reverse direction. While our design occupies 20% less area on average than the design in [23] [99], the latency is significantly reduced to b '^ + b—l (the leading constant is 1). In performing the LU decomposition for a stream of matrices, our design overlaps the LU decomposition 128 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. for current and the next matrices efficiently, which leads the effective latency to For large matrices, block LU decomposition can be performed. An n x n matrix is decomposed by finding decompositions of bx b submatrices within it, along with other operations needed to update the entries. While the block LU decomposition is mapped onto distributed parallel machines such as Intel iPSC/860 and Paragon systems [27], we are not aware of any time and energy efficient designs for block LU decomposition that are mapped onto FPGAs. In our design for block LU decomposition, two sets of linear array architecture for LU decomposition and matrix multiplication are combined with memory blocks and the time and energy efficient scheduling is implemented. Also, our design is parameterized based on the block size b for design trade-off. By varying the block size, we explore a large number of designs and identify time and energy efficient designs. 5.2 T im e and E nergy Efficient D esigns for M atrix Factorization Several methods are known for factoring a given matrix [27]. In this chapter, we choose to implement LU decomposition on FPGAs. Essentially, LU decomposition factors a b X b matrix into a b x b lower triangular matrix L (the diagonal entries are all 1) and ab x b upper triangular matrix U. 129 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. In Theorem 3, a new algorithm and architecture for LU decomposition is de veloped for a linear array of PEs. The new algorithm runs in each PE. Each PE performs computations on the input or intermediate matrix and the results are fed to the neighboring PE on the next clock cycle. Data dependencies between input and intermediate matrices are solved by efficient and regular scheduling. Each PE uses only two input ports: one for feeding input or intermediate matrices and the other for outputting the decomposed matrix. W ith this fixed I/O bandwidth regardless of problem size, we achieve an optimal latency of 6^ -F 6 — 1 with leading coefficient of 1 for LU decomposition. In Theorem 4, we propose a new parallel design on FPGAs for block LU de composition. The design partitions a large matrix into multiple smaller blocks. To perform a computation for the smaller blocks, the architecture and algorithm in Theorem 3 is re-used. Corollary 3 further reduces the latency of Theorem 4 while the energy performance stays the same. By varying the block size, we achieve time and energy efficient designs. We are also not aware of any energy efficient designs for block LU decomposition that are mapped onto FPGAs. 5.2.1 LU Decom position Let A he a , b X b matrix, ax^y is an element of matrix A, where x is the row index and y is the column index. Ix^y (ux,y) is an element of matrix L ([/), where x is the row index and y is the column index. We assume that matrix A is a non-singular 130 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. matrix and, further, we do not consider pivoting. The sequential algorithm in [41] consists of three main steps: S tep 1: The column vector ax,i where 2 < x < b is multiplied by the reciprocal of ai_i. The resulting column vector is lx,i- S tep 2: is multiplied by the row vector = ui^y where 2 < y < b. The product lx,i X ui^y is computed and subtracted from the submatrix ax,y where ‘ 2 < x ,y < b. S tep 3: Step 1 and 2 are recursively applied to the new submatrix formed in Step 2. An iteration denotes an execution of Step 1 and 2. During the k-th iteration, the column vector Ix^k and the row vector Uk,y where k-\-l < x ,y < b are generated. The product Ix^k x Uk,y is subtracted from the submatrix O x^y where k < x ,y < b obtained during the {k — l)-th iteration. The time complexity of the sequential algorithm is 0(6^). We propose a new architecture and algorithm on a linear array shown in Figure 5.1. To decompose a . b X b matrix, a linear array architecture is developed using b PEs. Essentially, PEj performs computations for the j-th column of matrices L and U when the appropriate data is fed to it. Each PE consists of an adder/subtracter, a multiplier, a division lookup table, and a storage LU (p entries per PE). The storage L U of P E j is used to store the j-th column of matrices L and U. Each PE has two input ports (ain, LUin) and two output ports (aout, LUout)- ain and aout are used to feed 131 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. in and out Q x,y or lx,y. LUin and LUout are used to output resulting matrices L and U to PEb in a pipelined manner. Note that the column vector during the A:-th iteration has to be distributed to PE^+i to PEb- Also, the column vector lx,k+i during the {k + l)-th iteration has to be distributed to PEk+2 to PEb- Since we only have one datapath via ain and aout, it is crucial that lx,k is distributed before lx,k+i is distributed. For example, to compute « 3 _ 3 , both i and are required to reach PEs at cycle 9 and 10 (see the arrows in Figure 5.2). Figure 5.1 (c) shows our algorithm by describing the operations in each PE during each cycle. Theorem 3 LU decomposition of b xb matrix can be performed in b ‘ ^ + b—1 cycles using the architecture in Figure 5.1 (a) and (b), and the algorithm in Figure 5.1 (c) using b PEs. P roof 5 The elements O x,y of matrix A are fed in row major order (ai,i,ai 2 , ai,3 , •••, ai,6, C b 2p,..., O b ^b ) to O in of P E i. All data are fed from left to right. Thus o-x,y arrives at PEj at cycle b(x — l) + y + j — l where I < j < b. Seven operations are performed based on indices x and y and index j of P E j . The indices x and y can he realized using counters in each PE. They also can be fed to PEi and propagated in a pipelined manner. Opl) Bypassing (input data propagation): O x,y is passed from P E j^i to via P E j except when y = j and x > j. I f y = j and x > j, lx,y is generated at P E j by normalizing O x^y by u^^k (Op6) and is passed to P E j^i via port aout- U,y is also stored in the storage LU. 132 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. From P E j_^ Jo PEj^, PE PE. (a) □ lu . out I RegM | Storage RegT RegR LU , (b) (input data path) for a ll P E j, j= l to p, receiving input a^^Cvia port a,.„),do in parallel i f y=j and x>j then ® out='i69M ; (RegM has 1„ ) e lse end if ; end for; (Computation and storing operation) RegT=0; for a ll P E j, j= l to p, receiving input a„y, do in parallel i f y<j then RegT=RegT+a, y*u, ^ : (u, „ is read from storage L U , e L v is I w fP O M pE h i) _ • I * »y J e ls e l f y=] then RegT=a„ y-RegT; i f x=j then RegR=l/RegT (through table lookup); e ls e i f x>j then RegM=RegR*RegT; (the resu lt is 1 „ and stored to storage LU ) end if ; u„y=RegT; (the resu lt is stored to storage LU ) RegT=0; end if ; end for; (Result data path) for a ll PEj,j=l to p, receiving input lu„y(via port LU,„),do in parallel i f y=j then LU„ut=lx,y O P (from PE j) e ls e L0„ut=1Ux,y: (fpom P E j.j) end if ; end for; (C ) Figure 5.1; (a) Overall architecture, (b) architecture of PE, and (c) algorithm for LU decomposition 1 3 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Op2) Multiplication: If y < j, a multiplication and an accumulation are per formed in P E j. The input data O x,y via port ain is, in fact, lx,y that is generated at PE j-i. A multiplication generates the product of lx,k o -^^d Uk,y, where lx,k is a column vector generated in PE^ during the k-th iteration and is a row vector generated from PE^+i to PE^ during the k-th iteration {k 1 < x ,y < h). The U k^y are stored in the storage LUs of PE^+i to PEf,. During the k-th iteration, PEj computes the product of the column vector Ix^k o,f^d the j-th entry from Uk^y For example, during the first iteration, the product of the column vector lx,i where 2 < X < b and the entry Ui^2 is performed in P E 2 . ■ Op3) Accumulation: After the multiplication during the same clock cycle, an accumulation, a^f^ = lx,k>^VLk,y-\-a^^~^\ is performed. a l x } y denotes the intermediate element of submatrix generated during the k-th iteration, a ^ ^ ^ y is used either for another accumulation or for normalization (Op6) and is stored in RegT. RegT is a temporary storage to hold during the accumulation. Note that accumulation and subtraction share one adder/subtracter since they never occur simultaneously. Op4) Subtraction: If y = j, a subtraction is performed after all accumulations (Op2) are complete. This ensures that is subtracted from the element of sub matrix O x,y where k < x ,y < b during k-th iteration. In Step 2 of the sequential algorithm, the subtraction is performed after multiplication and the result is stored for the next subtraction. These operations are done repeatedly. For example, us^s 1 3 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is computed as {(0 3 , 3 —/ajtxi.a) —/3 ,2 W 2 , 3 }- In the proposed algorithm, all accumula tions (Op2) are performed first. Then one subtraction is performed. For example, U 3 , 3 is computed as { — (^ 3 ,iUi_ 3 + h,2U 2,f) + (see Figure 5.2). Op5) Data Storing: This operation stores an entry ofl^^y oru^^y to storage LU. I f y — h lx,y or U x^y is generated in PEj. If x < j, is stored in storage LU. If X > j, Ix^y is stored in storage LU after normalization (Op6). This operation guarantees that the j-th column of the decomposed matrices L and U is stored in PEj. Op6) Reciprocal: Division is required since the normalization is performed by U k,k for the column vector = {ok+i^k, ■ ■ ■ ,o .b ^ k ) during the k-th iteration ( 1 < k < b). U k^k is stored in RegT after subtraction (Op3) and the reciprocal value of U k^k is stored in RegR. The reciprocal operation occurs if x = y = j. Op7) Normalization: After the subtraction (Op3), the value is stored in RegT. If V = j ond X > j, the values in RegT and RegR are multiplied. This operation generates the column vector F,k where k- \-l <x< b in PEk during the k-th iteration. Op8) Output Results: This operation sends out the results and u^^y stored in the storage LU in a pipelined manner. If y = j, F,y or U x^y is sent to port L J7out- Otherwise, Ix^y or Ux,y from P-Ej-i is passed to PEj+i via port Li/out- To satisfy the data dependency of F,k being generated during the k-th iteration and used during the (k + 1 )-th iteration and to obtain the minimum latency, two 13 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. conditions have to be satisfied. Note that the column vector lx,k {k + I < x < b) is produced in PEk during the k-th iteration. The first condition is that Ix^k has to propagate from PEk to PEk-\-i after (generated during the ( k -l) -th iteration) propagates to PEk+i and before lx,k+i is generated in PEk+i during the (k -\-l)-th iteration. Let Tk{lx,j) be the sum of the time when E j is generated in PE j and the propagation time when it reaches PEk, which is Tk{lx,j) = b{x — 1) + 2{j — 1) + l - \ - k - j . Then, Tk+i{lx,k-i) = b{x - 1) + - 1, Tk+i{lx,k) = b{x - 1) + 2k, and Tk+i{lx,k+i) = b{x - 1) + 2 A : + 1. Since Tk+i{lx,k-i) < Tk+i{lx,k) < Tk+i{lx,k+i) — 1 < 0 < 1, the condition is satisfied for all x where k -\- \ < x < h. To define the second condition, let lux^y be lx,y if x > j, or U x^y if x < j in P E j. Note that lux^j is computed in PE j every b cycles. To output the resulting matrices L and U without delay, the second condition that lUxj arrives at PEk via port LUin and LUout before PEk produces any lux^k is required to satisfy. We assume j < k. Then, Tk{luxj) = b{x — l) + j - { - k — 1 and Tk{lux,k) = b{x — 1) + 2/c — 1. Since Tk{luxj) < Tk{lux^k) j < k, the second condition is satisfied for all x where 1 < X < b. Total latency is calculated as lubj, to be available as output: Note thatTk{lUxj) = b{x — 1) -\-2{j — 1) 1 k ~ j . The time taken for the last result, lub^b to be available as output is Tb{lubx) = b{b — 1) + 2(6 — 1) + 1. Thus total latency ?s 6^ + 6 — 1. Although all 8 operations in the proof of Theorem 3 are based on the indices X, y, and j, it is easy to understand if we describe the state of each PE. Again, 136 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Si 0 1 II o j S 3 S ^ 3 G C h - ZD • c c CC 1 0 1 a m o? cc Figure 5.2: Snapshot for 3 x 3 LU decomposition 1 3 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.1: States of each PE and their operations State Condition Operations Note S1 y < j Opt. Op2, Gp3, Gp8 Perform multiplication/accumulation 82 x = y = j Op4, Gp5, Op6, Op8 Generate S3 x > j , y = j Op4, Op5, Op7 Generate , S4 x < j, y = j Op4, Op8 Generate ^ S5 y > j Opt Idle the state of each PE is based on the indices x, y, and j and there are 5 states, Si, S5. In each state, multiple operations are performed within a single clock cycle. Table 5.1 shows the condition of each state in PEj and multiple operations in its state. For example, in state SI where y < j, PEj performs multiplication and accumulation (Op2, Op3). from port ain and stored in storage LU are multiplied and accumulated with the previously stored in RegT. The resulting value is stored back to RegT. Also, in this state, is propagated to PE j+i via aout (Opl) and lux^y from port LUin is propagated to P E j^i via LU out (Op8). Figure 5.2 shows the states of each PE for 3 x 3 LU decomposition over clock cycles. Note that the conditions for the state are used later to generate control logic of each PE. The storage requirement in the current design is h words per PE for storage LU . In fact, the required storage can be reduced to j — 1 words at PEj since the resulting lu^^y where l < x < j , y = jaX PEj are output to port LU out and are not used in the later computation. However, we keep the required storage to be b 138 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. words per PE because this architecture will be used in the block LU decomposition (see Section 5.2.2) and the storage requirement in the design is b per PE. Since our design is a pipelined architecture, the first b cycles of the computations on the next matrix can be overlapped with the last b cycles of the computations on the current matrix. For a stream of matrices, one matrix can be decomposed every 5^ cycles. Thus the effective latency becomes 6^, which is the time between the arrivals of the first and last output data of the current computation. Also, we achieve the minimum latency with a fixed I/O bandwidth which uses only two input ports: one for input datapath and the other for output datapath. 5.2.2 Block LU Decom position For large matrices, it is possible to perform block LU decomposition. In this case, a n X n matrix is decomposed by finding decompositions o ib x b submatrices within it, along with other operations needed to update the entries. The sequential algorithm is given in [27]. The input matrix A is partitioned into four matrices: All, Ai2 , A2 1 , and A 2 2 . A n is a , b x b matrix, A 12 is a b x {n — b) matrix, A 21 is an {n — b) X b matrix, and A 22 is an {n — b) x {n — b) matrix. The goal of the algorithm is to decompose A into two n x n matrices, L and U, such that / \ ( \ ( \ All Ai2 I 0 1 ( U[, U[, V follows: 1 2 1 ^ 2 2 L'n 0 T' T' ■^21 -'^22 J 0 m . The steps of the algorithm are as 2 2 J 139 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Step 1: Perform a sequence of Gaussian eliminations on the n x b matrix formed by ^ 1 1 and A 21 in order to calculate the entries of L 21, and Step 2: Calculate U[2 as the product of and A 12. Step 3: Evaluate A '2 2 A 22 — L'2iU[2- S tep 4: Apply Step 1 to 3 recursively to matrix ^ 2 2 - During the k-th iteration, the resulting submatrices L^ii , Uii L ^ 2i ■ > ^ and ^ 2 2 ^ are obtained. An iteration denotes an execution of Step 1 and 3. By utilizing the architecture and algorithm in Theorem 3 in combination with a matrix multiplication/subtraction architecture, we propose an architecture for block LU decomposition on FPGAs as shown in Figure 5.3. The block size b is later used as the parameter to realize time and energy efficient designs. There are two sets of PFs: one set performing a . b x b LU decomposition and the other performing a b x b matrix multiplication/subtraction. Each set of PFs is linearly pipelined and both sets are connected to a memory bank. The input matrix is stored in the memory bank and fed to both sets of PFs. After computation, the results are stored back to the memory bank and used for next computation. Four different operations are identified: opLU, opL, opU, and opMMS. opLU is performed to obtain . opL U from Step 1 is realized by using the algorithm and architecture proposed in Theorem 3. opL from Step 1 is performed to obtain L ^ 2\ ■ The same architecture in Theorem 3 is used. However, the matrix and 1 4 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the reciprocal of its diagonal entries are required to perform opL. Since opL is performed after opLU, all PEs already hold the reciprocals in RegRs. is fed via port LUin. We add one more datapath that feeds the data from port LUin to storage L U . opU from Step 2 is performed to obtain U12 ■ opU also uses the same architecture. It requires from opLU. are fed via port LUin to storage L U . In Step 3, matrix multiplication/subtraction {opMMS) is performed. Once opL and opU are complete, L ^ 2i available for opMMS. The architecture for opMMS is proposed in Chapter 1. Since there is matrix subtraction after matrix multiplication, additional subtraction logic is added. The matrix multiplication algorithm takes two b x b submatrices C from L ^ 2i D from and computes the product E C x D. Another b x b submatrix F is taken from ^ 2 2 ^ . Then the final values are obtained hy E F — E. If 6 is large, the bxb matrix multiplication can be decomposed into r x r matrix multiplications, where r is the sub-block size. Note that if n = 6, the set of PEs for opMMS is not used and only the set of PEs for opLU is used. In this case, the design in Theorem 3 dissipates energy. T h eo rem 4 LU decomposition of a n x n matrix can be performed in -\- 1 6 r — i j 1 cycles using the architecture in Figure 5.3 (a) using b PEs for b x b LU decomposition and r PEs for r x r matrix multiplication/subtraction, where block size is b and sub-block size is r. P ro o f 6 At a given time, only one operation is performed and the schedule is shown in Figure 5.3 (b). These operations are all performed on b x b matrices. As 141 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Memory banks PEs for f u T ) f ir ) f u ~ ] Memory PE • • • - ► PEs fo r MS Memory Memory PE • • • Memory (a) [ LU ] opLU [ MS ] opMMS - Iteration 1 - -X - - Iteration 2 - Iteration n/b P - < - 1 - > -< 1 — iVb-1 — — n/b-1 — (n/b-1 f 1 > •< — n/b-2 — — n/b-2 — ► < - (n/b-2f —► ... Q • ------ • ------- L’ , --------• -------U',, --------• -------A’ --------• ------ • -------L "„--------• -------U”, , -------• ------- A'' • • --------- • L'„U '„ ^ L”„U "„ “ (b) •<1X- - iteration 1 - -2 (n /b -1 )- -Iteration 2 - - 2 ( n /b - 2 ) - Iteration n/b P - H U Z D - H Z ) C Z U r Z - 0 ® • • • CZD Lgi and U 1 2 • • L' <-------- -(n/b-1)^--------- ► ... ( MS ] • — AW ---------- • • • • |_ ( iV b ) y (n A ) ) (n/b-2)= [ Ms] • -------------- A" - (C ) Figure 5.3: (a) Overall architecture for block LU decomposition, (b) a schedule for Theorem 4 and (c) a schedule for Corollary 3 1 4 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. all data are streaming, the computation of the current data set and the data feeding of the next data set are overlapped. Therefore, each of opLU, opL, and opU has an effective latency ofb^. opMMS has an effective latency of (see Corollary 1). There are | iterations to complete the block LU decomposition. During each iteration, only one b x b opLU is performed. The effective latency of all opLU is During the k-th iteration, ( | — k) opL and opU for b x b block size are performed. The effective latency of all opL and opU is Y ^= \{^~ k) = |( ^ )( ^ — 1)6^. During the k-th iteration, ( | — kff opMMS for b xb block size are performed. Thus, the total number of opMMS is ~ k ^ = Kf)(f “ 1 )(t ~ b x b matrix multiplication can be decomposed to r x r matrix multiplication, the effective latency of all opMMS is |(f)(f — l)(x ~ ^ Thus, accumulating the latencies of all operations leads — 1^ + 6 — 1, which includes the time to fill the pipeline stages. Theorem 4 uses a straightforward schedule since only one set of PEs performs computations at a given time. We can utilize the two sets of PEs in parallel to reduce the total latency. For example, the overlapping of operations is shown in Figure 5.3 (c). opMMS can start after opL and opU yield their first submatrices. Then the rest of opL and opU can be overlapped with opMMS. We can save (2^ — 4)6^ cycles during the first iteration. 1 4 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C o ro llary 3 LU decomposition of n x n matrix can be performed in 3bn — 26^ + 6 ^ ( f ~ ( t ~ l) + ^ ~ 1 cycles using the architecture in Figure 5.3 (a) and the schedule in Figure 5.3 (c) P ro o f 7 During each iteration, after opL and opU for the first blocks are per formed, the input matrices for opMMS are ready. Thus, opL and opU can he performed in parallel with opMMS. The effective latency to complete the k-th it eration is 3b^ + (f ~ k y • y . During the ^-th iteration, only one opLU is per formed. Thus, the total latency is TfJk=i ^{36^ + ( | — + + 1 = 3bn — 2h^ 3" 6 ^ ( f “ ( t “ l) which includes the time to fill the pipeline stages. While Corollary 3 reduces the total latency compared with Theorem 4, it does not reduce the amount of computation. Total energy is the sum of the energy used for computation and quiescent energy (the energy for configuration memory, static energy, etc.) used by the device even when the logic is idle. The quiescent energy depends only on the total latency. Since Corollary 3 reduces the latency, the quiescent energy and hence the total energy are reduced. Thus we use the architecture and algorithm in Corollary 3 to obtain both time and energy efficient designs. 1 4 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.3 Perform ance E stim ation and D esign Trade offs For a given problem size n, varying the parameters such as block size b and sub block size r creates a large design space. Before implementing the designs and performing low-level simulation, we estimate the performance of possible designs, prune the design space, and finally identify ’’good” candidate designs for time and energy efficiency. The candidate designs were implemented using VHDL (see Section 5.4). 5.3.1 High-Level Performance M odel To estimate the performance of our designs, we have employed domain-specific modeling. A detailed description of this methodology can be found in Chapter 3. Domain-specific modeling is a hybrid (top-down plus bottom-up) approach to per formance modeling that allows the designer to rapidly evaluate candidate algo rithms and architectures in order to determine the design that best meets criteria such as energy, latency, and area. In domain-specific modeling proposed in Chapter 3, an architecture is divided into Relocatable modules (RModules) and Interconnects. RModules are hardware elements that are assumed to dissipate the same amount of power no m atter where they are instantiated on the chip and Interconnects are the wires connecting the 145 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. RModules. From the algorithm, the designer knows when and for how long each RModule is active. W ith this knowledge, the designer can calculate the latency of the design. Additionally, with estimates for the power dissipated by each RMod ule and the Interconnect, the designer can estimate the energy dissipated by the design. Knowing the area required by each RModule, the designer can estimate the area that will be occupied by the design. In the top-down portion of the hy brid approach, the designer’s knowledge of the architecture and the algorithm is incorporated, by deriving the performance models to estimate energy, area, and latency. The bottom-up portion is the estimation of the power values of RModules and Interconnects and requires the designer to perform low-level simulations of the RModules and Interconnects. Note that the power values for the RModules can be used in the evaluation of many different designs. In our design, the RModules are multipliers, adders, multiplexors, RAM, recip rocal lookup tables, and registers. The power values of each RModule are shown in Table 5.2. For example, P s r a m is the average power used by SRAM (16-bit word), where x is the number of entries, and SRAM is used by storage L U . In the actual implementation of a SRAM, the number of its entries should be multiples of 16. P b r a m is the power used by a Block SelectRAM. Similar numbers representing the area of each module are also obtained. The latencies are obtained in terms of seconds by dividing them by the clock frequency. Using Table 5.2, functions that represent energy, area, and latency for Theorem 3, Theorem 4, and Corollary 3 are 1 4 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.2: Power and area functions for various modules Module Power function (mW) Area function (slices) Block multiplier (16x16 bit) 22.50 BMult Adder (16 bit) Pa^ =3.26 SRAM (16-bit word, x entries) P s ^ = -0 .25[x/l6f -t6.07fv/16]+4.40 A ,« ^ = -0 .0 irx /1 6 f -1-17.67 f x / 161 +20.08 BRAM (16 bit, 1024 entries) Pb ^ =15.49 A^r 4 m =1BRAM Mux (16-bit 2x1) ^m.= 3 .2 6 ^Mux = 8 Register (16 bit) ^«i6=2.34 \ l 6 = 8 Reciprocal unit Prc , =6.00 A,,^=1BRAM Q uiescent power Pg=188 ' Target device is Xilinx Virtex-ll XC2V1500. ' BMult is a Block multiplier. BRAM is a Block SelectRAM. ' All m odules are assum ed to run at 120MHz and their switching activities to be 30%. shown in Table 5.3, Table 5.4, and Table 5.5, respectively. An average switching activity of 30% for input data to each module at a running frequency of 120MHz is assumed. Since each PE of the design in Theorem 3 has five states, a PE with a different state dissipates a different amount of energy. To accurately estimate the energy, the number of cycles of a specific state for all PEs is defined as duration. For example, the number of cycles for state SI over all PEs are lb{b— l){2b— 1). Only the power function for state SI is applied to obtain the energy used at state SI over all PEs. 147 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.3: Energy, area, and time performance models for Theorem 3 Metric Performance model Theorem 3 Latency (cycles) Z ^ 3 =b^ + b - \, Z ^ 3 = 6^ (effectivelaten cy) Power P =3P + P + P + P +4P •' 77im3.(51,S3) ^16 ^ ^ B M u lt ^ ^ Add ^ ^ S R A M M u x ^ p 1 p i _ p I p i _ 4 p p _ p I p u p • '^ 3 .( 5 2 ,5 4 ) ^ ^ A d d ^ ‘^ S R A M ^ ^ R c p ^ ^ ^ M u x » ■ '^ 3 .(5 5 ) -*^^16 ^ ^ A d d ^Mux Duration (oycles) ^ T h m ix s i s i ) — ~b(b— i)(2b— V)+— b(b— V ), ~ b A — b(b— \), ^Thm 3.(S5) ~ ^Thm3XSi,S3) ~ ^ T h n iiX S lM ) Energy ^ m m ii ~ ^Thm3XSl,S3)^Thm3XSl,S3) ^Thm3.{,S2,S4)^Thm3.(S2M ) ^Thm3.(S5)^Thm3.(S5) ^-Thm3^Q Area A jh m 3 ^A?16 \ m u I I ^ A dd ^R A M A j c p ‘ ^ ^ M u x ) * power dissipation of design A at state S. is the total duration of design A at state S. 5.3.2 Design Trade-offs for Tim e and Energy Efficiency To achieve time and energy efficient designs, we explore the various design param eters such as frequency, block size, precision, and number of PEs. All parameters contribute to energy dissipation, latency, and area of a design. For example, the latency and energy of Corollary 3 are a function of the block size b and the sub block size r. By choosing b = r = n, the minimum latency of n? (546.1 fj,s at 120MHz for n = 256) can be achieved. However, this design does not necessarily have minimum energy dissipation. We explore the parameters, b and r, and deter mine their values that minimize the energy dissipation. The estimates are based on 120MHz designs by considering the operating frequency that can be achieved after implementation (see Section 5.4). Figure 5.4 (a) shows the energy dissipation as a function of b and r for n = 256. The minimum energy is obtained around r = 16 148 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.4: Energy, area, and time performance models for Theorem 4 Metric Performance model Theorem 4 Latency (cycles) '-‘ opW ' - \b\ h hpL opL 2\b \b If n ^ o p M M S 6Kb " - 1 b J b - T h m A ~ b ^ o p L U ^ o p L A in U ■ * “ A OpMMS P =P ^ opLU.{Sl,S3) ■ ' opL.{S\,S3) P ■ * 77im3.(51,53) Power p —p —p p _ p _ p ^opUJ.{Sl,SA) ^ opL.{Sl,SA.) ^ 77wi3.(52,5'4) ’ ^opLU.{S5) ^opL.{S5) ^Thni3.{S5) P =3P +P +P +P +4P ■ ' opi/.(51,53) R16 ^ '' B A /M /r ^ ^ ^ ^ SRAM ^ Aft« P =3P +P +P +4P P =P +P +P ^ opt/.(52,54) R 1 6 ^ ^ Add SRAM ~ ^ ^ M u x ^ ■ * opU.{S5) ^ R \ 6 ~ ^ Add ~ ^ Mux P = ArP 4-A * P + rP + ^rP -\-'2 .tP ^ opMMS B16 ^ BMult ^ ^^ Add ^ ^ SRAM ^ ^ Mux b^opLVm.Sy) ^ A7m3.C51,S3)’ b^opLU.(S2,SA)~ ^opLU.(,S5) ^-^TTim SXSS) b ^ o p L X S lS i) ~ b ^ o p U .( S l,S 3 ) ' 1 Duration (cycles) / block operation b^opL.(S2,SA) b)opU.(S2.SA) ' 1 D. opL.(S5) “ ^ o p m S S ) b ^ o p M M S ^ o p M M S 2\b Ab ' h h - i ' vi>A i> / f \/' \ ' " - 1 \u j\b J \ A 77im3.(51,S3) A 77im3.(S2,54) =i £ £-1 A T h m 3 .( S S ) Energy ^ o p L U ~ b^opLU .(SlS3)b'opLU X Sl,S3) A )pLf/.(i2,S4)-^pU /.(52,^4) AppLi/.(55)-^pLV.(S5) b^opW ^^BRAM P _ 7 ~ ) p -1-/1 P 4 -/1 P 4- T P ^opL ^opL.{S\,S3ropL.{SAS3) ^opL.{S2,S^yopL.{S2,S4) ^opL.{SSyopL.{S5) ^opL^^B RA M ^ o p V ~ b 2 o p U .( S \.S 3 ) b ’ o p L .(S \,S 3 ) b ) o p U X S 2 .S A ) b ’ o p U X S l,S A ) A>pt/.(S5)^p(7.(5'5) b^^pub^B R A M P — / I P P ^opM M S ^o p M M S ^o p M M S ~ ^^^o p M M S ^B R A M ^ThmA ~ b^opLU ^opZ , " * ■ ^opMMS ^ ^ iaP q Area A 7 ! r a 4 b (3 + A^^j ^ + A ^ ^ p + 4Aj^^) + r (4 A jj J + + 2Agg^ + 2 A ^ ^ ^ ) + 1024 * P ^.^s) the power dissipation of design A at state S. is the total duration o f design A at state 5 149 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.5: Energy, area, and time performance models for Corollary 3 Metric Performance model Corollary 3 Latency (cycles) A rorS ^opW ^ f i - i l U J b ^ o p M M S Power The same as the ones in Theorem 4 Duration (cycles) / block operation The same as the ones in Theorem 4 Energy ^C o r3 ~ ^o p L U ^opL ^opU ^ o p M M S " * ■ ^ori^Q Area The same as the ones in Theorem 4 * is the power dissipation of design A at state S. is the total duration of design A at state S. and 5 = 16 while the latency is 2743.5 /rs. Note that the energy optimal design runs 5x longer than the latency optimal design. Figure 5.4 (b) shows the energy distribution over four operations when n = 256 and b varies. When 5 = 16, we achieve the minimum energy design and opMMS is the dominant source of energy dissipation. Through design space exploration, we found that the energy efficient designs are obtained when 5 = r = 16 for n > 32. 5.4 D esign Synthesis, O ptim ization, and Sim ulation M eth od s To obtain time and energy efficient designs, we briefly discuss the optimization techniques used in our designs. Then the synthesized designs for various problem sizes and the results from low-level simulations are presented. 150 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. □ OpMMS 1 opLU 9 1.5 size (b) Sub-block size (r) ^ (a) 16 32 Block size (b) (b) Figure 5.4: (a) Energy dissipation as function of b and r for n = 256, and (b) energy distribution as function of 6 for n = 256 5.4.1 Optim izations for Time and Energy Efficiency In this section, we summarize the energy efficient design techniques in Chapter 2 [33] employed in our designs. One such technique is architecture selection. FP GAs give the designer the freedom to map almost any architecture onto hard ware. Different architectures have varying energy performances as well as latencies, throughputs, etc. In our design, we have chosen a linear array of processing ele ments. In FPGAs, long interconnects dissipate a significant amount of power [119]. Therefore, for energy efficient designs, it is beneficial to minimize the number of long interconnects. A linear array of PEs accomplishes this goal. Each process ing element communicates only with its nearest neighbors, minimizing the use of long wires. Additionally, the linear array architecture facilitates the use of two 151 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. more techniques: parallel processing and pipelining. Both parallel processing and pipelining decrease the effective latency of a design. Parallel processing does so by increasing the amount of resources while pipelining does so by increasing the resource utilization. By decreasing effective latency, both techniques can lead to lower energy dissipation. However, they can also lead to increased power dissipa tion, which can have a negative effect on energy dissipation. The designer must reach a compromise between low latency and high power in order to achieve a low energy design. Since our algorithm has several stages such as addition, multipli cation, and accumulation, the data can be pipelined from one stage to the next. Another technique we use is block disabling. In developing an algorithm, it is possi ble to design the algorithm such that it utilizes the clock gating technique [142] to disable modules that are not in use during the computation. In our design, since opMMS takes longer time that other operations, a set of PEs of opLU, opL, and op U becomes idle and is disabled to save energy. Another technique that we use is choosing the appropriate bindings. In an FPGA, there can be many possible mappings of computation and storage ele ments to the actual hardware. For example, in Virtex-II, the storage LU can be implemented as registers, distributed SelectRAM (SRAM), or embedded Block Se lectRAM (BRAM). Each of these types of storage dissipates a different amount of energy and can lead to implementations with wide variation in energy dissipation. When the number of entries > 64, BRAM is used since it is energy efficient for 1 5 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. large memory; otherwise, SRAM is used. Similar decisions can be made for other elements of the design, such as choosing between embedded (Block) multipliers or configured multipliers. In our design, we choose embedded multipliers since it is energy efficient when both inputs are not constant. To implement the division unit, we use a lookup table approach. This technique is faster and uses less energy compared with other division algorithms [121]. To calculate a/6, we first obtain 1/6 via a lookup table and perform the multiplication a x (1/6). The approach is effective if the multiplication is fast. Since we use the embedded multipliers, fast multiplication (within one cycle) can be performed. The lookup table for recip rocal is generated as follows: Inv{h) — Round{2^/b) where 6 is the value to be inverted, m is the number of bits used to represent the output, and Round is the rounding function. Along with block disabling and BRAM, a large memory can be made of smaller memory banks {memory banking), where each bank has its own enabling/disabling feature. By enabling only the necessary memory banks, the energy used by memory is saved. For example, the design for Corollary 3 requires the memory banks consisting of BRAMs. When n = 128, 16 BRAMs are required. However, opL U , opL, or op U requires one BRAM and opMMS requires 3 BRAMs. By using memory banking, 3 BRAMs are used most of time and 4 BRAMs are used at maximum. This leads to 80% energy reduction in the memory. 1 5 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.4.2 Macro-Level Power and Resource Analyzers While the domain-specific modeling provides the rapid high-level performance es timation, it is necessary to verify the accuracy. Thus, the candidate designs are implemented using VHDL and mapped onto the target FPGAs using the Xilinx ISE 5.2i design flow. After the designs are placed and routed, the PAR reports provides the resource utilization in terms of slices, I/O, Block multipliers, and BRAMs. XPower also provide the lump sum power values for the logic, net, I/O , and clock. However, to validate the energy used by each module and the asso ciated interconnects, it is necessary to analyze the design at macro level after it is placed and routed. The macro corresponds to the module used in a design. We develop two analysis tools: macro power analyzer (MPA) and macro resource analyzer (MRA). The design flow of MPA and MRA is shown in Figure 5.5. To analyze the power dissipation of a certain module, tcl commands of XPower are used [142]. The first purpose of MPA is to verify the high-level estimates against the power dissipation from the low-level simulation. While the high-level model captures the module power dissipation, the power used by the interconnect between the components is not well considered. MPA obtains the power value for each wires of the interconnect, thus measures accurately its power values. MPA also provides the analysis of how the architectural and algorithmic change affects the power and eventually energy dissipation of the modules and the interconnects. 1 5 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Design in VHDL NCD VCD XDL PWR (iogic, net, V I/O, dock) Tci commands Signais for iogic, net, i/o, clock Macro ievel power report Synthesis Place&Route (PAR) XPower Simuiation (Modeisim) NGD2XDL Macro levei resource report Identify signals for a macro block MRA (Macro resource analyzer) MPA (Macro power analyzer) Figure 5.5: Design flow of macro power and resonrce analyzers 155 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. In the design flow for MPA, first, a design is implemented in VHDL, synthesized using Xilinx XST or Synplicity Synplify, and placed/routed using Xilinx PAR. After the design is placed and routed, NCD file is generated. The testbench for the design is created and its simulation is conducted using Mentor Graphics Modeisim. From the simulation, VCD file is generated. These two files, NCD and VCD, are inputted to XPower. Using the tcl commands provided by XPower, MPA generates the detail power dissipation report for logic, net, I/O , and clock. Since the signals and variables of the design in VHDL are carried out at the bit level in the XPower report, MPA groups bit-wise signals to word-wise modules and classify them at the RT level. To obtain the power dissipation of each module, the power values of bit-wide signals are accumulated. MPA also identifies the interconnect between modules and collects its power dissipation value. Then we profile the power dissipation of the design based on various modules and their interconnect. MPA is written in Python [111]. FPGA consists of many resources such as slices (F/F, LUT), BRAMs, Block multipliers, and interconnect wires. The interconnect is an important resource in a FPGA design and eventually affect the power dissipation significantly [119]. In the Virtex-II (Pro) device, various interconnects are available as long, hex, double, and direct wires. While most resource utilization can be obtained in the current Xilinx design tools, the detail information of interconnect wire usage between slices or other resources lacks. The XDL file, which is generated from the NCD file. 1 5 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. provides all logic and interconnect information at bit level. However, it does not provide macro level information, which is critical to analyze the impact on the architectural/algorithmic modification in our design. MRA is developed based on the XDL file to analyze the interconnect utilization. The grouped signals that are identified in the MPA design flow are re-used in MRA. The XDL file and the grouped signals are input to MRA. MRA provides the report of the interconnect (wire) utilization between modules as the numbers of long, hex, double, and direct wires. This information is used to consider how the architectural/algorithmic decision affects the interconnect utilization. Along with MPA, we optimize a design on FPGA in various metrics such as less interconnect intensive architecture and low interconnect energy algorithm. All results in Section 5.5 are generated using these tools to analyze the design and provide insights of a FPGA design and more importantly to lead to a energy efficient design. 5.4.3 Simulation M ethods Using the performance models defined in Section 5.3, we identified the energy and time efficient designs based on the parameters. By considering different criteria such as area, latency, and energy, we identified several designs. The minimal energy designs are chosen as candidate designs and are implemented in VHDL. The precision of all designs was 16 bits. These designs were synthesized using XST in 157 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Xilinx ISE 5.2i and the frequency achieved was 120MHz. The place-and-route file as an .nod file was obtained for the Virtex-II XC2V1500 bg575-6 device. The input test vectors for the simulation were randomly generated such that their average switching activity was 30%. Mentor Graphics Modeisim 5.6b was used to simulate the designs and generate the simulation results as a .vcd file. These .vcd and .ncd files were then used by the Xilinx XPower tool to evaluate the average power dissipation. Energy dissipation was obtained by multiplying the average power by latency. We also compared estimates from Section 5.3 against actual values based on implemented designs to test the accuracy of the performance estimation. We observe that the energy estimates (see Table 5.7) are within 10% of the simulation results. PEI PE2 W 40 o 20 o 20 PE3 PE4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Clock cycle Clock cycle Figure 5.6: Power dissipation over clock cycles for n = 4 Figure 5.6 shows the power dissipation of each PE over clock cycle where the problem size is 4. The power dissipation is the average values for 50 matrices that are randomly generated. Since the state of a PE varies over different clock cycles, the power used by a PE varies. Note that a PE that has a higher index (for 158 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. example, PE4 for n = 4) dissipates more than any other PEs that have a lower index PEs. It is because the high-index PE stays more at State SI which enables all blocks in a PE and uses more power. 5.5 Perform ance C om parison Since we are not aware of any prior FPGA based designs for LU decomposition, we implemented three baseline designs: one on FPGAs using a single PE, the other on FPGAs using the design of the best known latency, and another, a software imple mentation on state-of-the-art TI DSPs. While the design are running on Virtex-II, the device uses the quiescent power of 188mW (from XPower). When we compare the designs that are mapped on FPGAs, we do not include the quiescent power. However, the quiescent power is included in the comparison between FPGA-based designs and DSP-based designs. 5.5.1 Uniprocessor and Theorem 3 For the first comparison, a uniprocessor design is developed and implemented on the same device. Figure 5.7 shows the architecture and algorithm. Storage A of size 6^ is used to store the input, intermediate, and resulting matrices. Storage U of size b is used to store single row data that are generated in each iteration. Each iteration is executed in a sequential manner. The latency is hb{b+ 1)(25-|- 1) + 5^. 1 5 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Note that the uniprocessor design uses the similar amount of storage {b "^ + b + 3) as the design in Theorem 3 (6^ + 36) does. The access rates for the registers are almost the same in both designs. However, the access rate for storage A and storage U , 0(|6^), in the uniprocessor design is almost twice higher than that of storage LU , 0(|6^), in the design for Theorem 3 (see Table 5.6). Figure 5.8 shows the energy distribution of MAC (multiply-and-accumulate), memory, and miscellaneous modules for n = 16, 32 without including the quiescent power. While the energy used by MAC and miscellaneous modules in both designs is almost the same, the energy used by the memory in the uniprocessor design is almost twice higher than the energy used by the design in Theorem 3. Figure 5.9 shows the energy distribution including the quiescent energy used by the device. Since the latency of the uniprocessor design is longer, the quies cent energy become the dominant portion compared with the computation energy. Table 5.7 shows the comparison with the uniprocessor design. The uniprocessor design is a very concise architecture and thus uses the smallest area. However, the design based Theorem 3 outperforms in terms of time and energy metrics. For example, when n = 32, the design based on Theorem 3 uses 36% less energy and runs 13.2x faster while using 27.6x more slices. If the quiescent energy is included, the design based on Theorem 3 uses 74% less energy. In minimizing the quiescent energy, the latency affect the energy performance significant. We conclude that 1 6 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. > LU, Storage A RegM Storage U RegT RegR (a) out (input data) fo r i= l to b, fo r j= l to b Store a^- (from Aj^) to storage A; end for; end for; (Computati on) fo r k=l to b Read 3 , ^ 1 ^ and store to RegT (U |^ |^ = a ,^ |, ); RegR=l/RegJ; fo r j=k+l to b RegT=U|jj; storage U=RegJ; end for; fo r i=k+l to n l^i,=aj|^*RegR (Store to RegM); for j=k+l to n a^.j=a.j-l(U|^^. is read from storage U); Store a,j to storage A; end for; end for; end for; (Output data) fo r i= l to b, fo r j= l to b Output luy from storage A to end for; end f^or; (b) Figure 5.7: (a) Architecture and (b) algorithm for a uniprocessor design 161 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.6: Memory access rates of various modules in the uniprocessor design and the design in Theorem 3 D esign Block R ead Write Storage A k = 2 b h'+Z^(^-l) k = 2 Storage U 1 (^ -1)’ k = l ^-1 k = l Uni p rocessor RegM b-l b-\ k = \ k = \ R egT b b-\ b-\ k = 2 k= l k = \ R egR b - \ S torage LU b-\ k = \ b^ D esign in T heorem 3 RegM k = l R egT b-l b-l k = i k = i R egR b - l 1 6 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the efficient parallelism and pipelining achieve the time and energy efficient design for the LU decomposition. Table 5.7: Performance comparison of the designs based on Theorem 3 and the uniprocessor design D esign Metric Problem siz e (n) 4 8 12 16 24 32 4 8 S lice 128 128 128 128 128 128 160 Area BMult 1 1 1 1 1 1 1 Uni processor (120MHz) BRAM 2 2 2 2 2 2 5 Latency (u see) 0.5 2.8 7.8 16.7 50.4 112.4 3 5 5 .3 E w /o Q (nJ) 19.6 157.2 5 32.7 1265.8 4 4 2 1 .5 10498.4 5 2 6 2 6 .2 E w / Q (nJ) 116.7 677.3 2 0 02.2 4 4 11.7 13903.0 3 1 6 2 9 .6 119416.3 Slice 4 4 2 883 1325 1766 26 5 0 35 3 3 65 2 8 Area BMult 4 8 12 16 24 32 4 8 D esign in T heorem 3 (120MHz) BRAM 4 8 12 16 24 32 48 Latency (u see) 0.1 0.5 1.2 2.1 4.8 8,5 19.2 E w /o Q (est) (nJ) 14.9 109.1 3 57.3 833.8 2 8 4 5 .2 6 6 8 9 .4 25 5 7 1 .6 E w /o Q (sim) (nJ) 15.8 118.3 370.8 8 80.4 2 9 8 8 .5 7081.1 2 7 0 4 5 .0 E w /o Q (error) (%) 6% 8% 4% 5% 5% 6% 5% E w / Q (sim) (nJ) 39.9 2 0 9.4 5 82.9 1234.8 3 7 4 7 .6 82 93.7 2 9 1 8 1 .2 More slic e s (times) 3.5 6.9 10.4 13.8 2 0.7 27.6 4 0 .8 More BMult (times) 4.0 8.0 12.0 16.0 2 4.0 32.0 4 8 .0 Comparison More BRAM (times) 2.0 4.0 6.0 8.0 12.0 16.0 9 .6 Latency (speedup) 3.9 5.2 6.5 7.8 10.5 13.2 18.5 L ess E w /o Q (%) 24% 31% 33% 34% 36% 36% 51% L ess E w / Q (%) 66% 69% 71% 72% 73% 74% 76% * S lice is the number of siic e s u sed in a design. Bmuit is the number of Biock muitipiiers. BRAM is the number of Biock SeiectR A M s. Latency is the effective latency. E (est) is the estim ated en ergy dissipation. E (sim) is the m easured en ergy dissipation. E w /o Q and E w / Q are th e energy without and with the q u iescent power, respectiveiy. 1 6 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. M isc. Memory MAC Latency # slic e s 2700 2250 C / 3 O m 1800 'o o 2 1350 “D C ^ 900 o c L U 450 U niprocessor D esign A Theorem 3 18 15 12 o < D tfi 13 O c C D (a) 24000 I Misc. Memory 20000 Latency w 16000 12000 U niprocessor Theorem 3 D esign A Figure 5.8: Energy distribution of the uniprocessor design, the design in Theo rem 3, and the design A in [23] [99] for (a) n = 16 and (b) n — 32 excluding the quiescent power 1 6 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4500 Q uiesc Memory ffslices c 1800 U niprocessor Theorem 3 D esign A 18 15 12 o 0 3 ( / ) > , o c C D 32000 Q uiesc I Misc. Memory C D 24000 - t5 t- Latency 16000 Uniprocessor Theorem 3 D esign A Figure 5.9: Energy (distribution of the uniprocessor design, the design in Theo rem 3, and the design A in [23] [99] for (a) n = 16 and (b) n = 32 including the quiescent power 16 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.5.2 Theorem 3 and Other Linear Array Architecture As discussed in Section 5.1, we choose the best known architecture and algorithm from [23] [99]. This design is implemented in the same design environment and its energy dissipation is measured. While the design in [23] [99] is easily implemented with a regular datapath, the memory access rate is higher than any other designs. Again, Figure 5.8 and Figure 5.9 show the energy distribution of this design. Note that the energy used by the memory is significantly more than any other designs. The quiescent energy is relatively smaller impact than one in the uniprocessor design. In Table 5.8, the designs based on [23] [99] and Theorem 3 are compared. For example, when n = 32, the design based on Theorem 3 uses 69% less energy, uses 20% less slices, and runs 2x faster than the design based on [23] [99]. 5.5.3 D SP and Corollary 3 We also compare the performance of LU decomposition on FPGAs and DSPs. FPGAs are known to be better than DSPs in terms of time and energy performance. Since many target applications for DSP devices and FPGAs are similar, comparing their time and energy performance is beneficial to designers. The design based on Gorollary 3 is used since large problem size is considered in this comparison. O ther than the energy used by the designs for Theorem 3 and the matrix multiplication in Gorollary 1, the energy used by the memory banks to store the whole matrix is included. While the total amount of memory required is 166 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.8: Performance comparison of the design A based on [23] [99] and the design based on Theorem 3 D esign Metric Problem siz e (n) 4 8 12 16 24 32 48 Slice 4 9 4 1058 1621 2184 3 3 1 0 4 4 3 7 9096 Area BMult 4 8 12 16 24 32 48 D esign A BRAM 1 1 1 1 1 1 1 (120M Hz) Latency (usee) 0.3 1.1 2.4 4.3 9 .6 17.1 38.4 E w /o Q (nJ) 3 6.4 3 17.8 1102.0 2 6 46.7 9 8 6 3 .0 2 3 5 3 8 .0 1 16198.7 E w /Q (n J ) 86.6 5 18.4 1553.2 3 448.8 1 1 6 67.8 2 6 7 4 6 .5 123417.9 More slices (times) 0.9 0.8 0 .8 0.8 0 .8 0.8 0.7 More BMult (times) 1.0 1.0 1.0 1.0 1.0 1.0 1.0 Com parison More BRAM (times) 4.0 8.0 12.0 16.0 2 4.0 3 2.0 48.0 Latency (speedup) 2.0 2.0 2.0 2.0 2 .0 2.0 2.0 L ess E w /o Q (%) 59% 66% 68% 68% 71% 72% 78% L ess E w / Q (%) 54% 62% 62% 64% 68% 69% 76% n^, the memory is organized with small chunks. In Virtex-II, a BRAM holds ISxlk size of memory, which can store a 32 x 32 block matrix data. If n > 32, 6 < 32, the total number of memory banks of a ISxlk BRAM is |’n^/1024]. During the block LU decomposition, only the memory banks that have necessary blocks are enabled for the computation and other banks can be disabled. For example, for n = 128, b = 32, the total number of banks is 16 and 32 x 32-blocks are stored over 16 BRAMs. During opL U , opL, or op U , only one BRAM needs to be enabled. During opMMS, three BRAMs need to be enabled. Figure 5.10 shows the effect of the memory banking for n = 32,128 and 6 = 16. For n = 32, the whole matrix fits in one BRAM and the memory banking does not have any impact on the total energy (see Figure 5.10 (a)). However, for n = 128, the memory banking saves 30% 167 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of energy compared with no memory banking (see Figure 5.10 (b)). Figure 5.11 show the energy distribution of the design based on Corollary 3 for the various problem sizes and b = 16. Note that the fraction of the quiescent energy reduces as the problem size increases since the energy complexity of the quiescent power is © (^ ) and the energy complexity of the computation is 0(n^). Also, note that the fraction of the energy used by the memory banks are small and constant with memory banking (see Figure 5.11 (a)) while the fraction of the energy used by the memory banks without memory banking becomes the dominant portion (see Figure 5.11 (b)). 1.E+04 8.E+03 6.E+03 O.E+00 ■ Quiesc. □ Memory banks □ Misc. ■ Memory ■ MAC = 4.E+03 2.E+03 No memory banking Memory banking (a) 5.E+05 4.E+05 3.E+05 < u 2.E+05 1.E+05 O.E+00 ■ Quiesc. □ Memory banks □ Misc. I Memory ^ MAC No memory banking Memory banking (b) Figure 5.10: Energy dissipation of the design in Corollary 3 including the quiescent power and the effect of the memory banking for (a) n = 32 and (b) n = 128, where 6 = 16 1 6 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 100% Quiesc. Memory Banks □ MIsc □ Memory MAC O ) 40 A I I l ± 16 32 64 128 Problem size (n) (a) 100% 80% 60% 20% 0% M [ —1 I —I f— 1 " ■ 0 Quiesc. I [• ■ Memory Banks j I - nM isc — □ Memory IM l ■ MAG ■ ■ ~ ■ H ^ ■ I I I I 16 32 64 128 Problem size (n) (b ) Figure 5.11: Energy distribution of the design in Corollary 3 over various problem sizes (a) with memory banking and (b) without memory banking To compare with the design based on Corollary 3, we choose the TI TMS320VC5510 running at 300MHz and TI TMS320C6415 running at 600MHz as representa tive DSPs. TMS320VC5510 is a low power DSP and has 2 16-bit MAC units (600MIPS). TMS320C6415 is a high performance DSP and has eight 16-bit MAC units (4800MIPS). The LU decomposition is implemented in C and its preci sion is 16 bits. The matrix multiplication is performed using the function call DSP^maLmul from the TI DSP library. The latency is obtained by using the TI Code Composer 2.1. To compute the energy dissipation, we assumed the 75% high / 25% low activity category of power dissipation for the function call DSP-maPmul since it is a hand-optimized code [133] [134]. The power dissipation for the rest of C code is based on the 50% high / 50% low activity category since the code is 1 6 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. optimized by the TI compiler. For the DSP, we chose the block size b, 0 < b < min(n, 16) so as to minimize the energy dissipation. As seen from the results in Table 5.9, our FPGA implementations perform LU decomposition faster using less energy. The energy dissipation of our designs is determined to be 36. Ix to 4.3x less than the TI 55xx series and the speedup is from 2.4x to 5.4x. Another interesting results are the energy distribution on the core operation, MAC, and the rest of operations. In Figure 5.11 (a), approximately 40% of the total energy is used by MAC if the quiescent energy is excluded. This is relatively high compared with the a general purpose processor or DSP processor. 1 7 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 5.9: Performance of the designs based on Corollary 3 Design Metric 8 12 Problem size (n) 16 32 48 64 128 T I5510 (300MHz) Block size (b) 8 12 16 16 16 16 16 Latency (usee) 23.0 38.9 57.6 251.7 625.4 1221.7 6691.0 Energy (nJ) 3446 5834 8634 37758 93816 183252 1003656 TI 6415 (600MHz) Block size (b) 8 12 16 16 16 16 16 Latency (usee) 2.9 4.9 7.2 31.5 78.2 152.7 836.4 Energy (nJ) 4221 7147 10577 46629 116804 229746 1282105 Block size (b) 8 6 8 16 16 16 16 LU Slice 883 662 883 1766 1766 1766 1766 decompostlon Area BMult 8 6 8 16 16 16 16 (Theorem 3) BRAM 8 6 8 16 16 16 16 Matrix Slice 0 693 924 1845 1845 1845 1845 Corollary 3 multiplication Area BMult 0 6 8 16 16 16 16 (VIrtex-ll, (Gorollary 1) BRAM 0 0 0 0 0 0 0 120MHz) Memory bank BRAM 1 1 1 1 4 4 16 Slice 883 1356 1807 3612 3612 3612 3612 Total Area BMult 8 12 16 32 32 32 32 BRAM 9 7 9 17 20 20 32 Latency (usee) 0.5 1.5 2.7 10.7 25.6 51.2 345.6 Energy (nJ) 117 301 687 5182 16374 37872 295507 Improvement Latency (speedup, times) 5.4 3.2 2.7 2.9 3.1 3.0 2.4 over II 300MHz Less Energy (times) 36.0 23.7 15.4 9.0 7.1 6.1 4.3 Slice is the number of slices. BMult Is the number of Block multipliers. ' BRAM Is the number of Block SeiectRAMs. ' All energy values for FPGA designs Includes the quiescent energy. 171 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 6 E n erg y /T im e Efficient and P aram eterized D esign s for Fast Fourier Transform s on F P G A s In this chapter, we present energy efficient FFT designs on FPGAs. The FFT is the compute-intensive portion of broadband beamforming applications such as those generally used in SDR [91]. We investigate design techniques for minimizing the energy dissipated by FPGAs and apply the techniques for designing archi tectures and algorithms for FFT. We identify the architectural parameters that characterize the FFT designs and which affect the energy dissipation of the de signs. A high-level energy performance model is developed using these parameters. This model is used to determine design trade-offs, estimate the energy efficiency and arrive at energy efficient designs. A parameterized architecture is designed, so that by selecting appropriate parameter values, the architecture of a complete design can be easily synthesized. Our parameterized design has more flexibility than a soft IP core, because it exploits the degrees of parallelism and throughput. 172 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. We implement and simulate a set of designs on Virtex-II using Xilinx ISE tools to obtain energy dissipation values. We also compare the latencies, the area, and energy dissipations of our designs with the Xilinx library based designs. We use both estimated values (based on the model) and actual values (based on the imple mented designs) in our comparisons. These comparisons show that our designs can provide significant reductions in not only latency but also energy dissipation. We thus provide a parameterized architecture and high-level model for fast estimation and implementation of energy efficient FFT designs. The remainder of this chapter is organized as follows. Section 6.1.1 presents design techniques for minimizing the energy dissipation in FPGAs. Our param eterized FFT architectures are presented in Section 6.1.2. Section 6.2 shows our performance estimation and the implementations of the selected designs on FP GAs. We also compare our designs with Xilinx library based designs. 6.1 E nergy and T im e Efficient D esign for F F T In this section, we briefly discuss techniques that can be applied to FPGA-based designs to obtain energy efficiency. Then we present our energy efficient, param eterized architectures for FFT on FPGAs, inculcating the aforementioned design techniques. 1 7 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.1.1 Energy Efficient Design Techniques In literature, there are many low-level power management techniques that lead to energy savings when applied to FPGA design [120, 148]. One such technique is clock gating, which is used to disable parts of the device that are not in use during the computation. In the Virtex-II family, clock gating can be realized by using primitives such as BUFGCE for dynamically driving a clock tree only when the corresponding logic is used. For example, FFT computation has many com plex number multipliers to perform twiddle factor computations (multiplication followed by addition/subtraction). Because of the nature of the FFT algorithm, more than a quarter of the twiddle factors are 1, — 1, j, or — j and their compu tation can be bypassed. Thus, the implementation of twiddle factor computation can exploit clock gating to disable the unnecessary computation blocks. Choosing energy efficient bindings is another technique. A binding is a mapping of a com putation to an FPGA component. The ability to choose the proper binding is due to the existence of several configurations for the same computation. For example, since FFT has a high data storage requirement, different bindings of the storage elements affect energy dissipation significantly. There are three possible bindings for storage elements in Virtex-II devices based on the number of entries: registers, slice based RAM (SRAM), and embedded Block RAM (BRAM). For large storage elements (those with more than 48 entries) BRAM shows an advantage in power dissipation over other implementations. A designer can analyze the trade-offs that 1 7 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. arise from various bindings based on the design requirements. A pipelined archi tecture might be chosen since many digital signal processing applications process a stream of data. For these applications with regular data flow, pipelining in creases throughput. But pipelining might increase power dissipation since all logic in the design is continuously active. Since throughput is maximized, eventually, the energy dissipation is reduced. Moreover since interconnect accounts for con siderable power dissipation, the degree of parallelism and the depth of pipelining which might increase interconnect and thus energy dissipation, have to analyzed before implementation. Another technique is algorithm selection. A given appli cation can be mapped onto FPGAs differently by selecting different algorithms. For example in implementing the FFT, the choice of a radix-4 based algorithm significantly reduces the number of complex multiplications that would otherwise be needed if a radix-2 based algorithm were used. Thus the trade-offs between dif ferent algorithms and architectures should be analyzed to achieve energy efficient designs. More techniques are discussed in Chapter 2. 6.1.2 Energy Efficient Design for FFT For FFT designs, we use the well known Cooley-Tukey method. The calculation of an A’ -point FFT requires 0{N) operations for each of its log2 (A) stages, so the total computation required is 0(A log2 A) [103]. 175 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Due to the fact that, in practice, FFTs often process a stream of data, a pipelined architecture has been chosen. The A^-point FFT design is based on the radix-4 algorithm. While there are many design parameters, we identify the pa rameters that determine the FFT architecture and eventually affect the energy dissipation. The parameterization is the key of our design since we explore de sign space based on the parameters for energy efficiency. There are five design parameters that characterize an W point FFT designs: • Problem size {N) • Degree of horizontal parallelism (Hp) • Degree of vertical parallelism (Vp) • Binding for storage element • Precision of data. The horizontal parallelism determines how many radix-4 stages are used in parallel {1 < Hp < log^^N). Vertical parallelism determines the number of inputs being computed in parallel. Using the radix-4 algorithm, up to 4 inputs can be operated on in parallel. We have considered five basic building blocks: radix-4, data buffer, datap ath permutation, parallel-to-serial/serial-to-parallel mux, and twiddle factor computation (see Figure 6.1). Each individual block is parameterized, so that a complete design for any N can be obtained from combinations of the basic blocks: 176 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. RAM -> RAM -> (b) (c) > -► (d) Figure 6.1: (a) Data buffer (DB), (b) Twiddle factor computation (TW ), (c) Data path permutation (PER), (d) parallel-to-serial/serial-to-parallel mux (PS/SP), and (e) Radix-4 computation (R4) • R adix-4 butterfly (R4): This block performs a set of additions and sub tractions with 16 adders/subtracters. It takes four inputs and produces four outputs in parallel. Each input data has real and imaginary components. The complex number multiplication for 1, — 1, j, or — j is implemented by remapping the inputs data path and using adders / subtracters. • Tw iddle factor com putation (TW): This block performs the complex num ber multiplication of the data with twiddle factors. The twiddle factors are obtained from a sine/cosine lookup table. Bypassing the multiplication when the value of twiddle factors is 1, — 1, j, or —j can reduce computation and thus energy (by disabling the multipliers). This block contains 4 multipliers, 2 adders/subtracters and two sign inverters. • D ata buffer (DB): This block consists of two RAMs having N/Vp entries each. Data is written into one and read from the other RAM simultaneously. 1 7 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The read and write operations are switched after every N inputs. The data write and read addresses are at different strides determined by the architec ture. For example in a T V = 16, single input case, writing is done sequentially and reading is at strides of four. D ata path perm utation (PER): In the parallel architectures (I^ ) = 4) , after computation of each stage, the data paths need to be permuted so that data can be accessed in parallel and in the correct order by the next stage. Dependencies occur due to stride accesses requiring data from either same locations or same RAMs. Figure 6 . 2 shows the permutation of the first stage for the 16-point FFT. There are 16 data (uq, ..., 0 1 5 ) and four data are fed to four data buffers each cycle. On the first clock cycle, four data (uq, U i, 0 2 , 0 3 ) are stored in the first entry of each D B in parallel (see Figure 6.2 (a)). On the second clock cycle, another four data (0 4 , 0 5 , 0 6 ) 0 ^ 7 ) are stored in the second entry of each D B with one location being permuted (see Figure 6.2 (b)). On the third and fourth clock cycles, the operation is performed in the same manner and the final result is shown in Figure 6.2 (c). Note that 0 4 is stored in the second buffer not the first one. By doing these operations up to 4 cycles, the first buffer has O q, 0 7 , O io, O 1 3, the second buffer O i, 0 4 , on, 0 1 4 , and so on. Also, note that the four data, ao, ^4 , cls, and 0 1 2 , are stored in different DBs so that the radix-4 computation can be performed in parallel. The permutation occurs at every stage in the same manner. 178 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Parallel-to-serial/serial-to-parallel mux (PS/SP): This block is used when the data is fed into the Radix-4 block in parallel and fed out in se rial in the serial architecture {Vp < 4). While the radix-4 module operates on four data in parallel, the rest of architecture is flexible. Thus, to match the data rate, a parallel-to-serial mux before the radix-4 module and a serial- to-parallel mux after the radix-4 module are required. DBo DB, DB, DB, M em ory e n try 0 1 2 3 (a) M em ory en try 0 1 2 3 a^ (b) M em ory e n try 0 1 2 3 3 o 37 3 i o 3 1 3 a, 34 3 i i 3 i4 3 2 3 s 3 a 3,5 3 s 3 e 3b 3 1 2 (C) Figure 6.2; Data permutation for D B at the first stage for 16-point FFT (clock cycle {& ) t = i, (b) t = i + 1, and (c) t = i + 3) For example, a 16-point FFT algorithm has 2 radix-4 stages. In the design, we can use one or two radix-4 blocks {Hp = 1,2) depending on the sharing of the radix-4 block resource. If Hp = 1, one radix-4 block is used and is shared by the first and second stages. Thus a feedback datapath is necessary which decreases the throughput of the design. Figure 6.3 (a) shows the fully serial architecture and Figure 6.3 (b) shows an architecture for iV = 16 where Vp = 1, Hp = 2. Figure 6.3 (c) shows a fully parallel architecture when Vp = 4, Hp = 2. This design has 12 data buffers, two radix-4 blocks, and 3 twiddle computation blocks. We also 179 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a o [ i] D B ~ 1 > > > R4 > > -> » | TW ^ » T D B 1 » ajj] a ^ P ] (a) > > -> > -> > R4 >1 TW t>i DB R4 > > -> > > -> a,[i] (b) a o P ] a o P + 1 ] a o P + 2 ] aoP+3]^ rDBT> -H DB, |> R CDBZ>_ R4 - > r rw r > -* f W > - > T W > (c) R4 - > T d^ C B > -H Z dO > - > f W > ^ a ^ Ii] ■a^p+l] > 3 2 0 + 2 ] ' a 2 [i+ 3 ] Figure 6.3: Architectures for 16-point FFT (a) {Hp, Vp) = (1,1), (b) {Hp, Vp) (2,1), and (c) {Hp, Vp) = (2,4) 1 8 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. develop the associated algorithm for various architectures. Figure 6.4 describes the parallel algorithm for the architecture in Figure 6.3 (c). The variable P is used for horizontal parallelism {Hp = log4 N = 2) and there are four unrolled do parallel loops (T 4 > = 4). Quad, D ist, P, L, and R are used for indexing data buffer in parallel. Quad = N/4; D is t = N /4; f o r P=0 t o log^ N -l do p a r a l l e l (n o te : h o r iz o n t a l p a r a ll e li s m , Hp=log4N) fo r K = 0 t o 4P-1 do L = 4*K*Quad/4P; R = L+Quad/4P-1; idxTWj^ = k; idxTW^ = 2 * k ; idxTWj = 3*K; = w[idxTW j^]; TWj = w C ldxT W j]; TWj = w C ldxT W j]; f o r ]= L t o R do do p a r a l l e l (n o te : v e r t i c a l p a r a l l e l i s m , Vp=4) ap^i[3] = ap [D ]+ap [D +D ist/4P ]+ap [J+2*D ist/4P ]+ap [J+ 3*D ist/4P ]; ap^i[j+D ist/4P] = a p [3 ]-j* a p [3 + D ist/4 P ]-a p [3 + 2 * D ist/4 P ]+ j* a p [J + 3 * D ist/4 P ]; ap^j^[J+2*Dist/4P] = a p [3 ]-a p [3 + D ist/4 P ]+ a p [3 + 2 * D ist/4 P ]+ j* a p [3 + 3 * D ist/4 P ]; ap^i[D+3*Dist/4P] = a p [3 ]+ j* a p [J + D ist/4 P ]-a p * [3 + 2 * D ist/4 P ]-j* a p [3 + 3 * D ist/4 P ]; end p a r a l l e l do p a r a l l e l ap^i[3+ D ist/4P ] = X w ^* ap ^ JJ+ D ist/4 P ]; ap ^j^[J+ 2 * D ist/4 P ] = TW 2*ap^j^[D+2*Dist/4P]: a p ^ J j + 3 * D i s t / 4 P ] = T «3 * a p ^ 3 [ 3 + 3 * D i s t / 4 P ] : end p a r a ll e l end fo r end fo r end p a r a ll e l fo r Figure 6.4: Algorithm used for the architecture in Figure 6.3 (b) 6.2 Perform ance E stim ation and D esign Syn th esis Since the architecture is parameterized, we can generated all possible designs by varying the parameter values. However, rather than implementing and simulating all designs, we define the high-level model using the techniques in Chapter 3 and 181 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. conduct performance estimation and design trade-offs. Then the chosen candidate designs are implemented. Our target device is Virtex-II (speed grade -5) which is a high-performance, platform FPGA from Xilinx [144]. We have chosen the XC2V1500 for small FFT designs {N < 64) and XC2V3000 for large FFT designs (N > 64). These devices have 48 and 96 18 x 18-bit embedded multipliers, respectively. 6.2.1 Energy Performance Estim ation In FPGA designs with streams of data, throughput is an important factor in energy dissipation. Thus, in our pipelined design, the energy equation is E = P/Th, where P is the average power dissipation and Th is the throughput of the design. Note that 1/Th = L can be considered the effective latency of the design. The effective latency accounts for the benefits of overlapping computations in pipelining. Based on the architecture and algorithm in Section 6.1.2, it can be shown that the equation to calculate the latency (L), of computing an A-point, radix-4 FFT is: L = Alog4A/(VpX Ap), (6.1) where L is in cycles. To convert this latency to seconds, we merely divide by the clock frequency. We also know the types of FPGA components (multipliers, registers, etc.) and the amounts of each type of component that are used by for 182 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. five basic building block. We obtain the power function for each basic building block using the techniques in Chapter 3. We sum the average power dissipation of each blocks to estimate the total power dissipation. Since power is energy divided by latency, and we have earlier calculated latency, we can multiply the power by the latency to estimate the energy used in executing the algorithm. The power functions for the data buffer, the radix-4 block, the data path permutation, parallel- to-serial/serial-to-parallel mux, and the twiddle computation block are Pdb, P r 4, PpER) Pps/sp aiid P t w - , respectively, where P d b — 1-23A^ + 35.44 (mW) using SRAM, P d b = 0.0156iV+79.65 (mW) using BRAM, P r ^ = 142.84 (mW), P r e r = 54.09 (mW), Pps/sp — 13.52 (mW), Ptw = 0.0054A^-|- 183.57 (mW) using Block SRAM, Ptw = 0.4879N-f 157.74 (mW) using SRAM, and Pjo = 44 (mW). Thus, the energy can be estimated as E = L • {Vp{Hp 1)Pdb + “ ^iTnHpPpER + {Hp — 1 )P r4 +2s{H p — l)Pps/sp + tv’ thPpw + 21pP/o} (6-2) m is the number of the data path permutation block (m = 1 when = 4, otherwise m = 0), s is the number of parallel-to-serial/serial-to-parallel muxes (s = 1 when 1^ = 1, otherwise s = 0). tyth is the number of twiddle computation blocks {ty = Vp — 1 when = 4, otherwise ty = Vp] th = Hp — I when Hp = log4 N, otherwise th = Hp). 1 8 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. We apply the parameter values to Equation 6.2 to compare the energy dissipa tions of various problem sizes. The clock speed of FPGA-based designs varies based on the likely speed achievable after place and route results. Figure 6.5 shows the energy and area estimates for various design points with different bindings when computing a 256-point FFT. It clearly shows that the BRAM designs are better than the SRAM based in both metrics. Figure 6.6 shows the distribution of energy dissipations of the modules for various design points when computing a 256-point FFT. The BRAM based designs use BRAMs for data buffer and phase lookup table since they are more energy efficient for N > 64. The precision of each data is 16 bits. We assume a clock frequency of lOOMHz since the implemented designs are run at a clock frequency of lOOMHz. We choose the minimal energy dissipation by selecting different parameter values {Hp = 4 and Vp = log4 A” , BRAM based). It shows that parallelism increases the energy efficiency of the FFT design despite increasing the area requirement. 6.3 Perform ance of S ynthesized D esign s All designs are parameterized based on N, Hp, Vp, binding, and precision as described in Section 6.1.2 and are implemented after coding in VHDL. Based on tlie performance estimation, we identified several designs for various problem sizes. We fixed the precision as 16 bits. These designs were synthesized using XST (Xilinx Synthesis Technology) in Xilinx ISE 4.1i [144]. The place-and-route file (.ncd file) 184 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 10000 8000 ts 8000 o Q . ^ 4000 < u L U 2000 — Biergy (SRAM-) - A - Biergy (BRAM) , , - • - A r e a (SRAM) ■ “ ‘ -4<— AreaiBRAMi / , s ' __________ 12000 10000 8000 u ) o > o 6000 3 . n ) < 4000 2000 (1, 1) ( 1,2 ) (1,4) (4,1) Design point (Vp, Hp) (4,2) (4,4) Figure 6.5: Energy and area estimates for various designs for N = 256 was obtained for Virtex-II XC2V1500 and XC2V3000. The input test vectors for the simulation are randomly generated and the average switching activity is 50%. Mentor Graphics Modelsim 5.5e was used to simulate the designs and generate simulation results (.vcd file). These two files are then provided to the XPower tool to evaluate the average power dissipation. The energy dissipation values are obtained by multiplying the average power by the latency. We observed that the error of energy estimation (see Table 6.1) is below 20% for the energy dissipation. The source of error mostly is due to the control logic since the estimation does not include it. If the estimates of designs are within 20% error range, the synthesized and simulated values need to be compared. We present low-level simulation results 1 8 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. -T 6000 ,0 Tw iddlo M ux Radix-4 Dbuf Area 5000 S 2500 3000 « ra 2000 C D (1.1) (1.2) (1.4) (4,1) Design point (Vp, 1 -^ )) Figure 6.6: Energy distribution of modules in EFT architecture for various design points (A^=256, BRAM based) 1 8 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 6.1: Performance of our FFT designs Problem Our designs (lOOMHz) size(n) Vp Hp Binding A T Eest Em Eerr EAT 16 1 2 SRAM 1171 0.16 65.4 77.0 15% 0.014 4 2 SRAM 2390 0.04 63.5 75.2 15% 0.007 64 1 3 SRAM 2266 0.64 552.4 493.3 12% 0.72 1 3 BRAM 1613 0.64 464.2 390.4 19% 0.40 4 3 SRAM 5690 0.16 393.9 418.7 6% 0.38 4 3 BRAM 4193 0.16 403.2 400.4 1% 0.27 256 1 4 BRAM 2050 2.56 2582.2 2223.1 16% 11.67 4 4 BRAM 5624 0.64 2203.2 1971.3 12% 7.10 1024 1 5 BRAM 2744 10.24 14963.5 13739.4 9% 386.06 4 5 BRAM 6673 2.56 11424.7 9204.2 20% 157.23 * Best is the estimated energy (nJ). * Em is the measured energy(nJ) from the synthesized designs. * Eerr is the error of the estimated energy. * The unit of EAT is lE-9. * The unit of Area (A) is slice, The unit of time (T) is usee. for FFT using the Virtex-II FPGA running at lOOMHz. Note BRAM is chosen for architectures for V > 64 since BRAM can store large data relatively using less energy than SRAM. We also use FFT designs from the Xilinx library to compare with our designs. Xilinx provides the various sizes of FFT in CoreGen library. We run Xilinx FFT at lOOMHz. From the results in Table 6.2, our designs dissipates 57% to 78% less energy for various problem sizes (see Table 6.3). It is clear that our designs can perform the computations both faster and more energy efficiently than the Xilinx based designs. If we use a comprehensive metric, EAT (Energy-Area-Time), our designs offer a performance improvement of 3-14x, compared to the Xilinx designs. 187 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 6.2: FFT performance of Xilinx library based design and TI DSP based design Problem Xilinx (IGOMHz) TI DSP (500MHz) size(n) A T Em EAT T Em 16 136 0.16 179.6 0.04 0.17 199.9 64 1079 1.92 1785.6 3.70 0.60 716.4 256 1303 7.68 6927.3 69.32 2.53 3008.3 1024 1557 30.72 34283.5 1639.82 12.07 14365.7 * Em is the measured energy(nJ) from the synthesized designs. * The unit of EAT is lE-9. * The unit of Area (A) is slice. The unit of time (T) is usee. Table 6.3: FFT performance comparison with Xilinx library based designs Problem Our designs Our designs vs. Xilinx designs size(n) Vp Hp Binding E (dec.) A (inc.) T (dec.) EAT (dec.) 16 1 2 SRAM 57% 0.86x l.Ox 2.71x 4 2 SRAM 58% 1.75x 4.0x 5.45x 64 1 3 SRAM 72% 2.10x 3.0x 5.17X 1 3 BRAM 78% 1.49x 3.0x 9.18x 4 3 SRAM 77% 5.27x 12.0x 9.70x 4 3 BRAM 78% 3.89x 12.0x 13.77X 256 1 4 BRAM 68% 1.57x 3.O x 5.94x 4 4 BRAM 72% 4.32x 12.0x 9.77x 1024 1 5 BRAM 60% 1.76x 3.0x 4.25x 4 5 BRAM 73% 4.29x 12.0x 10.43X dec. is t le decrement, inc. is the increment 1 8 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. We also compare our FPGA-based designs with a DSP based design. The Texas Instruments TMS320C6415 has been chosen since it is Texas Instruments’ highest performance fixed point DSP [134], This device can perform four 16 x 16-bit or eight 8 X 8-bit multiplications per clock cycle. To estimate the power dissipation, Texas Instruments classifies two activity models for the DSP: the high activity model and the low activity model [132]. High activity represents the time the DSP spends performing compute-intensive, optimized algorithms, such as a FIR filter. Low activity represents the time spent performing less compute-intensive tasks, such as setting up registers or executing less optimized algorithms. Texas Instruments then presents data for two power activity categories; 75% high/25% low and 50% high/50% low. The former is for applications that spend 75% of their time in high activity and 25% of their time in low activity and the latter is for applications whose time is split into 50% high activity and 50% low activity. Table 6.4 shows the average power dissipation when the DSP is run at clock frequencies of 500MHz and 600MHz. Note that these values do not include any external memory. We assume that all data can fit in the device’s cache. We have determined that FFT more closely falls into the 75% high/25% low category because it is an optimized algorithm, spending more of its time in high activity. The design in the TI DSP run at 500MHz. Table 6.2 and Table 6.5 compare the performance of our designs against the Xilinx design and the DSP solution from TI [132]. The TMS320C6415 results come from using Texas Instruments Code 1 8 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Table 6.4: Average power dissipation of the TMS320C6415 Clock Operating Power (W) frequency (MHz) voltage (V) 50%/50% 75%/25% 600 1.4 1.47 1.61 500 1.2 1.04 1.19 Composer Studio 2.1. We ran the DSP^fft algorithms from the Texas Instruments DSP library [132]. The latency is based directly on profiling. To compute energy dissipation, we again assume the 75% high/25% low activity category, as described above. From the results in Table 6.2, our designs dissipates 31% to 72% less energy for various problem sizes (see Table 6.5). It is clear that our designs can perform the computations both faster and more energy-efficiently than the TI DSP based designs. If we use a comprehensive metric, EAT (Energy-Area-Time), our designs offer a performance improvement of l-5x, compared to the TI DSP based designs. Table 6.5: EET performance comparison with the TI DSP based designs Problem Our designs Our designs vs. TI DSP size(n) Vp Hp Binding E (dec.) T (dec.) 16 1 2 SRAM 61% 1.06x 4 2 SRAM 62% 4.25x 64 1 3 SRAM 31% 0.94x 1 3 BRAM 46% 0.94x 4 3 SRAM 42% 3.75x 4 3 BRAM 44% 3.75x 256 1 4 BRAM 68% 0.99x 4 4 BRAM 72% 3.95x 1024 1 5 BRAM 60% l.lSx 4 5 BRAM 73% 4.71X * dec. is th e d e c re m e n t, inc. is th e in crem en t 190 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Energy efficiency is achieved using pipelining, appropriate disabling of the compute blocks and choosing efficient memory bindings. Pipelining increases the throughput and decreases the effective latency. Disabling twiddle factor compu tation blocks reduces the amount of energy dissipated by more than 30%. Also, selecting a radix-4 algorithm as the basic building block reduces the number of com plex multiplications in the design. Choosing bindings such that the data buffers and sine/cosine lookup tables are implemented using SRAM for A' < 64 or BRAM for A > 64 increases the energy efficiency. We also observed that while parallelism increases the throughput and eventually the energy efficiency, the energy used by the interconnect in FPGAs significantly increases. For example, the design of {Vp,Hp) = (4,1) dissipates 20% more energy than the design of {Vp, Hp) = (1,4) for A = 256 while the former has 2 times higher throughput. It is because the former uses more interconnects and memory elements. 191 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C hapter 7 C onclusion and Future R esearch The computational requirement for many recent DSP applications including SDR [137] and the next generation wireless communications [47] can be met by ASICs. However, there is little or no flexibility in ASIC based designs. General purpose processors can provide a certain degree of flexibility with the programming of DSP functions and kernels in software. However, the flexibility comes at the expense of lower performance and higher power dissipation than the ASICs. FPGAs can fill the cap between customized, high performance, non-flexible ASICs and lower per formance, programmable, power-hungry microprocessors. However, even though FPGAs have been around for more than 20 years, the power and energy efficiency for FPGAs are recently gaining attention. Also, while many design methods were developed to map applications onto FPGAs, the design methodology for high en ergy performance is not sufficiently developed. 1 9 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. This thesis introduced domain-specific energy modeling for rapid system-level energy estimation and algorithmic level optimization for configurable architectures. The modeling captures the details of the architecture and the algorithm to identify parameters affecting the power performance, hence facilitating the derivation of a system-wide energy function. Our modeling technique provides a virtual malleable data path. It is virtual because at the time of performance estimation we do not implement the design on target FPGAs. It is malleable because there are several parameters that can be varied to understand the trade-offs between different performance metrics such as energy, area, and latency. Such a characteristic makes our proposed model suitable for large application synthesis where several different kernels are integrated to implement a system such as MPEG encoding or Software Defined Radio. To develop energy efficient designs for signal processing kernels, we discussed algorithmic level techniques that can reduce the energy dissipation of FPGAs. Parallel processing along with pipelining is one energy efficient technique for per forming computation on a steam of data. Compute-block disabling also shows another immediate impact on energy savings. We have demonstrated the tech niques through the design of three signal processing kernel applications. Through both high-level estimations and low-level simulations, we have provided evidence supporting the idea that FPGAs can be more energy efficient than DSPs. We have shown this result by choosing representative FPGAs and representative DSPs. 1 9 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Based on the high-level energy modeling technique, new algorithms and ar chitectures were developed for matrix multiplication to significantly reduce the energy dissipation and latency, when compared with the state-of-the-art FPGA- based designs. These improve the best known design for matrix multiplication and provide trade-offs among energy, area, and latency. In our methodology, ’ ’energy hot spots,” which dissipate most of the energy, are identified through energy es timation functions. Algorithmic level optimizations are performed to reduce the dissipation in these ” energy hot spots” without an increase in latency. The latency of our designs is minimal compared with other architectures on a linear array. The energy performance is significantly better than the state-of-the-art FPGA based design and the best known linear array design. Secondly, we developed time and energy efficient designs for LU decomposition on FPGAs. Before implementing the designs, we analyzed the architecture and algorithm to understand the design trade-offs. A algorithmic technique such as the block based approach was employed for decomposition and multiplication of large matrices and the block size that results in the energy minimal design was obtained. The latency of our designs is minimal compared with other architectures on a linear array. The energy performance is significantly better than the state-of- the-art FPGA based design and TI DSP based design. Thirdly, to develop energy efficient designs for FFT, we discussed the method to extract various design parameters that significantly affect the parallelism and 1 9 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the energy performance. Based on the parameters, flexible and parameterized FFT architectures and algorithms were proposed. We observed that the interconnects dissipate significant amount of energy in a parallel architecture. Performance es timation and design trade-offs were performed to identify energy efficient designs. For all three digital signal processing kernels, we provided the high-level per formance models in order to explore the design space rapidly. After pruning the design space, selected designs were implemented using VHDL on the Xilinx Virtex- II FPGAs through the Xilinx ISE design environment. Low-level simulations were also performed to evaluate the chosen designs. Our optimized algorithms and architectures for matrix multiplication, matrix factorization, and FFT have achieved significant performance improvement com pared with the state-of-the-art designs. For example, our design for the matrix multiplication based on Theorem 1 save 29% to 51of the system-wide energy dissi pation for matrices of sizes 3 x 3 to 48 x 48, when compared with the design from Xilinx library. Latency is reduced by a factor of 3 to 15 while area is increased by a factor of 1.9 to 9.4. 7.1 Future R esearch D irections In this thesis, we have presented a high-level performance modeling and algoritlimic level design techniques to develop energy efficient designs on configurable architec tures. To the best of our knowledge, this is one of the earliest efforts to address the 195 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. power and energy performance in FPGAs and develop energy efficient mapping for digital signal processing applications. There are several research areas that have to be addressed in the future to extend this research for emerging architectures and applications. We briefly outline some of these areas: • Enhancing high-level performance modeling: Domain-specific modeling is well-suited to estimate the regular data paths that typically are used in many DSP kernels. When we impose a regular data path such as linear array ar chitecture onto FPGAs, the usage of long wires can often be avoided. This leads to very accurate energy estimation based on our module based power model. However, if the interconnection between modules are long and the modules are not placed regularly, more accurate models for the interconnect should be developed and geometry based energy estimation be used. Another extension of high-level modeling is to apply to an ASIC based design. The same modeling techniques can be used by identifying the component power functions for various modules using the similar design methodology. • Exploring future power control knobs: While we have shown many energy optimization techniques, state-of-the-art FPGAs (e.g., Virtex-II/pro) do not provide low power features such as dynamic voltage scaling with dynamic fre quency scaling. Using those future features in FPGA, our design techniques can be easily ported to newer FPGAs and exploit the new features at the algorithmic level effectively. For example, the operations opL and op U in the 196 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. block LU decomposition can be executed slower than the operation opMMS (defined in Section 5.2.2). This might provide the opportunity to use lower frequency and voltage, which leads to energy saving. Also, the startup power, the reconfiguration power, and the static and leakage power of Virtex-II/pro are relatively higher than that of other devices. If FPGAs are used in a sys tem which is in a sleep mode most of time and is active for a certain period of time (duty cycle), the current FPGAs will use significant amount of en ergy during wake-up and reconfiguration time. This can be improved when the power-hungry SRAM for the configuration memory is replaced with non volatile memory such as magnetoresistive memory, ferroelectric memory, or ovonic unified memory [135]. For example, Actel ProASIG Plus [2] uses the flash memory to store the configuration data when the device is off but is too slow to keep up with the current high performance and fast reconfiguration requirements. Using the future low power features such as control knobs for multiple power states, one area of particular interest is analyzing the influ ence that duty cycle can have on application mapping. For example, one design may be more energy efficient in performing a given application, but it may dissipate more energy while it is idle than is allowable in the energy budget for the implementation. Choosing the right mapping in this situa tion involves not only the type of analysis presented in this thesis, but also 1 9 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. analysis of specific usage scenarios and the availability of low power modes in each type of designs. • High-level performance modeling and design methodology at the application level for Reconfigurable-System-On-Chip (RSOC): We focused on developing high-level performance modeling and energy efficient designs on FPGAs for various kernels. However, DSP applications typically consist of many ker nels. Thus, high-level performance modeling and design methodology for the energy performance at the application level is required. Figure 7.1 describes a possible general design methodology including our kernel level approach (the right side of the dotted line in Figure 7.1). A design methodology (the left side of Figure 7.1) is required to address the problem of system synthesis for a complete application on an FPGA with an embedded processor such as Virtex II Pro [145]. It involves design of data paths for each application kernel, design of a controller to schedule execution of each kernel, and im plementation of the data paths on the FPGA. The expectation in terms of performance is an energy efficient design subject to latency and area con straint. RSOC such as the Triscend Configurable SOC [136] devices are being used to implement many embedded systems, where energy efficiency is a major concern. RSOCs incorporate many different components, such as processor core, FPGA fabric, memory, etc. Various power management 1 9 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. techniques can be applied to these components. Since kernels within an ap plication can be implemented as hardware in the FPGA fabric or software in the processor core, the task mapping based on the energy efficiency including communication cost between them creates a large design space. The recon figuration costs also incurred under different mappings significantly impact the overall system energy dissipation. In order to achieve energy efficient de signs on RSOCs, a performance model to abstract a general class of RSOC architectures for application development is required. Also, a mathematical formulation of the energy efficient mapping problem for various applications is required [105] [107]. Design language and tools to support energy efficient designs for digital sig nal processing applications on FPGAs: While VHDL and Verilog serves as the well known design languages for most FPCA based designs, other ap proaches to use different languages and tools for designing DSP applications are offered [1] [25] [147]. However, there are very few approaches that consider the energy performance in the design tools [89] [106]. Design language and tools that automatically explore a large design space considering constraints such as energy, area, and time are needed. 1 9 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Application ex: FFT, DOT, Kernel High Level system-wide estimation Domain Specific Modeling Set of system designs Design space exploration trade-offs Low level simulations Set of Candidate Designs for a Kernel System design Choose Domains Designs for different kernels Figure 7.1: Design methodology at the application level with the kernel level 200 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. R eference List [1 ] AccelChip Inc. AccelChip DSP Synthesis, http://www.accelchip.com, 2003. [2 ] Actel Corporation. ProASIC Plus Data Sheet, http://www.actel.com, 2003. [3 ] A. Agrawal, A. Bakshi, J. Davis, B. James, A. Ledeczi, S. Mohanty, V. Mathur, S. Neema, G. Nordstrom, V. K. Prasanna, C. Raghavendra, and M. Singh. MILAN: A Model Based Integrated Simulation for Design of Embedded Systems. Lanquaqe Compilers and Tools for Embedded Systems (LCTES), June 2001. [4] Altera Corporation. Stratix Data Sheet, http://www.altera.com, 2002. [5 ] Altera Corporation. Excalibur Data Sheet, http://www.altera.com/products/ devices/excalibur/exc-index. html. [6 ] S. Alum, N. Futamura, and K. Mehrotra. Parallel Biological Sequence Com parison Using Prefix Computations. International Parallel Processing Sym posium & Symposium on Parallel and Distributed Processing (IPDPS), April 1999. [7] A. Amira, A. Bouridane, and P. Milligan. Accelerating Matrix Product on Reconfigurable Hardware for Signal Processing. International Conference on Field Programmable Logic and Applications (FPL), 2001. [8 ] ATLAS, http://math-atlas.sourceforge.net. [9 ] D. A. Bader, B. M. E. Moret, and M. Yan. A Linear-Time Algorithm for Computing Inversion Distance Between Two Signed Permutations with an Experimental Study. Journal of Computational Biology, 8(5):483-491, Octo ber 2001. [10] J. Bae, S. Choi, N. Park, and V. K. Prasanna. I lard w are/Soft ware Codesign for a Low Bit-Rate Videoconferencing System. International Conference on Systems Engineering, 1996. 201 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 11 12 13 14 15 16 17 18 19 20 21 22 23 J. Bae, S. Choi, N. Park, and V. K. Prasanna. Design of an Architecture for Low Bit-Rate Videoconferencing System. Multimedia Technology & Applica tions Conference (MTAC), March 1996. J. Bae, N. Park, S. Choi, and V. K. Prasanna. Design of a Low Bit-Rate Videoconferencing System. Wescon Conference, October 1996. B. Bass. A Low-Power, High-Performance, 1024-Point FFT Processor. IEEE Journal of Solid-State Circuits, 34(3):380-387, 1999. J. Becker, T. Pionteck, and M. Glesner. DReAM: A Dynamically Reconfig urable Architecture for Future Mobile Communication Applications. Inter national Conference on Field Programmable Logic and Applications (FPL), 2002. J. Becker. Configurable Systems-on-Chip: Challenges and Perspectives for Industry and Universities. International Conference on Engineering of Re configurable Systems and Algorithms (ERSA), 2002. P. Belanovic and M. Leeser. Library of Parameterized Floating-Point Mod ules and Their Use. International Conference on Field Programmable Logic and Applications (FPL), September 2002. A. Bogliolo, L. Benini, and G. Micheli. Regression-based RTL Power Model ing. ACM Transactions on Design Automation of Electronic Systems, 5(3), 2000. B. L. Bowerman and R. T. O’Gonnell. Linear Statistical Models-An Applied Approach, 2nd edition, Brooks/Gole Pub Go., 2000. G. Brebner and N. Bergman. Reconfigurable Computing in Remote and Harsh Environments. International Conference on Field Programmable Logic and Applications (FPL), 1999. S. Brown and J. Rose. FPGA and CPLD Architecture: A Tutorial. IEEE Design & Test of Computers, 13(2):42-57, Summer 1996. D. Buell, J. Arnold, and W. Kleinfelder. Splash2: FPGAs in a Custom Com puting Machine, IEEE Gomputer Society Press, 1996. D. Caliga and D. P. Barker. Delivering Acceleration: The Potential for Increased HPC Application Performance Using Reconfigurable Logic. AC M /IEEE conference on Super Computing, November 2001. E. Casseau and D. Degrugillier. A Linear Systolic Array for LU Decomposi tion. VLSI Design, 1994. 202 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [24] S. Casselman. Virtual Computing and The Virtual Computer. IEEE Work shop on FPGAs for Custom Computing Machines (FCCM), April 1993. [25] Celoxica Inc. D K l.l Design Suite, www.celoxica.com. [26] C. Chen and M. Sarrafzadeh. An Effective Algorithm for Gate-Level Power- Delay Tradeoff Using Two Voltages. International Conference on Computer Design, 1999. [27] J. Choi, J. J. Dongarra, L. S. Ostrouchov, A. P. Petitet, D. W. Walker, and R. C. Whaley. The Design and Implementation of the Scalapack LU, QR, and Cholesky Factorization Routines. Scientific Programming, 5:173-184, 1996. [28] S. Choi, J. Bae, and V. K. Prasanna. Synthesis of Memory-Based VLSI Architectures for Discrete Wavelet Transforms. European Signal Processing Conference (EUSIPCO-96), September 1996. [29] S. Choi, Y. Chung, and V. K. Prasanna. Configurable Hardware for Sym bolic Search Operations. IEEE International Conference on Parallel and Dis tributed Systems (ICPADS), December 1997. [30] S. Choi, J.-W. Jang, S. Mohanty, and V. K. Prasanna. Domain-Specific Mod eling for Rapid System-Wide Energy Estimation of Reconfigurable Architec tures. International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), June 2002. [31] S. Choi, J.-W. Jang, and V. K. Prasanna. Minimizing Energy Dissipation of Matrix Multiplication Kernel on Virtex-II. Reconfigurable Technology: FP- CAs & Reconfigurable Processors for Computing and Applications, SPIE In formation Technologies and Communications (ITCOM), July 2002. [32] S. Choi, R. Scrofano, and V. K. Prasanna. Energy-Efficient Design of Kernel Applications for FPGAs Through Domain-Specific Modeling. Military and Aerospace Programmable Logic Devices (MAPLD), September 2002. [33] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang. Energy-Efficient Signal Processing Using FPGAs. ACM /SICDA International Symposium on Field Programmable Cate Arrays (FPCA), February 2003. [34] S. Choi and V. K. Prasanna. Time and Energy Efficient Matrix Factoriza tion. International Conference on Field Programmable Logic and Applica tions (FPL), September 2003. [35] S. Choi, G. Govindu, J.-W. Jang, and V. K. Prasanna. Energy-Efficient and Parameterized Designs of Fast Fourier Transforms on FPGAs. IEEE Inter national Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003. 203 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [36] S. Choi, J.-W. Jang, S. Mohanty, and V. K. Prasanna. Domain-Specific Modeling for Rapid Energy Estimation of Reconfigurable Architectures. Spe cial Issue on Configurable Computing of the Journal of Supercomputing, 26(3):259-281, November 2003. [37] S. Choi and V. K. Prasanna, Time and Energy Efficient Matrix Factorization. Journal of Parallel and Distributed Computing (submitted), 2004. [38] P. Chou, R. Ortega, K. Hines, K. Partridge, and G. Borriello. IPCHINOOK: An Integrated IP-Based Design Framework for Distributed Embedded Sys tem. Design Automation Conference (DAC), 1999. [39] E. Chu and A. George. Inside the FFT Black Box, CRC Press, 2000. [40] Y. Chung, S. Choi, and V. K. Prasanna. Parallel Object Recognition on an FPGA-Based Configurable Computing Platform. International Workshop on Computer Architectures for Machine Perception (CAMP), October 1997. [41] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms, 2nd edition, McGraw-Hill, 2001. [42] A. Dandalis and V. K. Prasanna. Signal Processing Using Reconfigurable System-on-Chip Platforms. International Conference on Engineering of Re configurable Systems and Algorithms (ERSA), June 2001. [43] A. DeHon. The Density Advantage of Configurable Computing. IEEE Com puter, 33(4):41-49, April 2000. [44] J. W. Demmel, J. R. Gilbert, and X. Li. SuperLU Users’ Guide. Technical report. University of California at Berkeley, 1999. [45] S. Derrien, S. Rajopadhye, and S. Sur-Kolay. Optimal Partitioning for FPGA Based Regular Array Implementations. International Conference on Parallel Computing in Electrical Engineering, 2000. [46] D. Deshpande, A. K. Somani, and A. Tyagi. Configuration Scheduling Schemes for Striped FPGA. ACM /SICDA International Symposium on Field Programmable Gate Arrays (FPGA), 1999. [47] M. Devlin. How to Make Smart Antenna Arrays. Xcell Journal Online, Issue 45, http://www.xiIinx.com/publications/xcelIonline, 2003. [48] C. Dick. The Platform FPGA: Enabling the Software Radio. Software De fined Radio Technical Conference and Product Exposition (SDR), November 2002 . 2 0 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [49] 0 . Diessel and H. ElGindy. On Dynamic Task Scheduling for FPGA- Based Systems. International Journal of Foundations of Computer Sciences, 12(5):645-669, 2001. [50] H. ElGindy and Y.-L. Shue. On Sparse Matrix-Vector Multiplication with FPGA-Based Systems. IEEE Symposium on Field Programmable Custom Computing Machines (FCCM), April 2002. [51] B. Fagin and C. Renard. Field Programming Gate Arrays and Floating Point Arithmetic. IEEE Transactions on Very Large Scale Integration (VLSI) Sys tems, 2(3):365-367, September 1994. [52] O. D. Fidanci, D. Poznanovic, K. Gaj, T. El-Ghazawi, and N. Alexandridis. Performance and Overhead in a Hybrid Reconfigurable Computer. Reconfig urable Architecture Workshop (RAW), 2003. [53] J. A. B. Fortes, K. S. Fu, and B. Wah. Systematic Approaches to the Design of Algorithmically Specified Systolic Arrays. International Conference on Acoustics, Signal and Speech Processing (ICASSP), 1985. [54] J. Frigo, M. Gokhale, and D. Lavenier. Evaluation of the Streams-C C-to- FPGA Compiler: An Applications Perspective. ACM /SICDA International Symposium on Eield Programmable Gate Arrays (FPGA), 2001. [55] A. Garcia, W. Burleson, and J. L. Danger. Power Modelling in FPGAs. Inter national Conference on Field Programmable Logic and Applications (FPL), 1999. [56] V. George, H. Zhang, and J. Rabaey. The Design of a Low Energy FPGA. International Symposium on Low Power Electronics and Design, 1999. [57] J. Gerlach and W. Rosenstiel. Development of a High-Level Design Space Exploration Methodology. Research Project Report, WSI 98-13, University of Tuebingen, 1998. [58] Generic Modeling Environment, http://www.isis.vanderbilt.edu. [59] G. Govindu, L. Zhuo, S. Choi, P. Gundala, and V. K. Prasanna. Area and Power Performance Analysis of a Floating-Point Based Application on FP GAs. High Performance Embedded Computing (HPEC), September 2003. [60] G. Govindu, S. Choi, V. Daga, V. K. Prasanna, S. Gangadharpalli, and Srid- har V. A High-Performance and Energy-efficient Architecture for Floating- Point Based LU Decomposition on FPGAs. Reconfigurable Architectures Workshop (RAW), 2004. 205 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [61] G. Govindu, S. Choi, and V. K. Prasanna. Analysis of High-Performance Floating-Point Arithmetic on FPGAs. Reconfigurable Architectures Work shop (RAW), 2004. [62] S. Haykin. Adaptive Filter Theory, 4th edition, Prentice Hall, 2002. [63] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Ap proach, 3rd edition, Morgan Kaufmann Publishers, 2003. [64] R. Hogg and E. Tanis. Probability and Statistical Inference, pages 656-657, 6th edition, Prentice Hall, 2001. [65] J.-W. Hong and H. T. Kung. I/O Complexity: The Red-Blue Pebbling Game. ACM Symposium on Theory of Computing (STOC), 1981. [66] J. Hwang, F. Chiang, and T. T. Hwang. A Re-engineering Approach to Low Power FPGA Design using SPFD. Design Automation Conference (DAC), 1998. [67] L. Hyesook and E. Swartzlander. A Systolic Array for 2-D DPT and 2-D DOT. IEEE International Conference on Application Specific Array Proces sors (ASAP), 1994. [68] Institute of Electrical and Electronics Engineers, editors. IEEE 75f Standard for Binary Floating-Point Arithmetic, 1984. [69] J.-W. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multi plication on FPGAs. International Conference on Field Programmable Logic and Applications (FPL), 2002. [70] J.-W. Jang, S. Choi, and V. K. Prasanna. Area and Time Efficient Implemen tation of Matrix Multiplication on FPGAs. IEEE International Conference on Field-Programmable Technology (ICFPT), December 2002. [71] J.-W. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multipli cation on FPGAs. IEEE Transactions on VLSI (TVLSI). (submitted) [72] J. Liang, R. Tessier, and 0 . Mencer. Floating Point Unit Generation and Evaluation for FPGAs. IEEE Symposium on Field Programmable Custom Computing Machines (FCCM), April 2003. [73] P. Kogge, V. Freeh, K. Ghose, N. Toomarian, and N. Aranki. Morph: Adding an Energy Gear to a High Performance Microarchitecture for Embedded Applications. Kool Chips Workshop, MICRO-33, December 2000. 206 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [74] D. Kumar and K. Parhi. Performance Trade-off of DCT Architectures in Xilinx FPGAs. Asilomar Conference on Signals, Systems, and Computers, 1999. [75] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin Cummings, 1993. [76] H. T. Kung and G. E. Leiserson. Algorithms for VLSI Processor Arrays. Introduction to VLSI Systems, C. Mead and L. Conway, Addision-Wesley, Chapter 8.3, 1980. [77] M. Lam, E. Rothberg, and M. E. Wolf. The Cache Performance and Opti mizations of Blocked Algorithms. International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. [78] S. Lei and K. Yao. Efficient Systolic Array Implementations of Digital Fil tering. IEEE International Symposium on Circuits and Systems, 1989. [79] M. P. Leong and P. H. Leong. A Variable-Radix Digit-Serial Design Method ology and its Application to the Discrete Cosine Transform. IEEE Transac tions on VLSI Systems (TVLSI), 2002. [80] Y.-H. Lu and G. De Micheli. Comparing System Level Power Management Policies. IEEE Design & Test of Computers, 18, March-April 2001. [81] W. Luk, N. Shirazi, and P. Y. K. Cheung. Modelling and Optimizing Run time Reconfigurable Systems. IEEE Symposium on FPGAs for Custom Com puting Machines (FCCM), 1996. [82] W. Luk, P. Andreou, A. Derbyshire, F. Dupont-De-Dinechin, J. Rice, N. Shi razi, and D. Siganos. A Reconfigurable Engine for Real-time Video Process ing. International Conference on Field Programmable Logic and Applications (FPL), 1998. [83] Models, Algorithms, and Architectures for Reconfigurable Computing Project, http://maarcii.usc.edu. [84] P. Master and P. M. Athanas. Reconfigurable Computing Offers Options For 3C. Wireless Systems Design, January 1999. [85] V. Mathur and V. K. Prasanna. A Hierarchical Simulation Framework for Application Development on System-on-Chip Architectures. IEEE Interna tional ASIC /SO C Conference, 2001. 2 0 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [86] M. Martina, G. Masera, G. Piccinini, F. Vacca, and M. Zamboni. FPGA Fully Reconfigurable Lifting Kernel for Multimedia Processing. International Conference on Very Large Scale Integration, The Global System on Chip Design and CAD Conference, December 2001. [87] W. J. G. Melis, P. Y. K. Cheung, and W. Luk. Image Registration of Real-Time Broadcast Video Using the UltraSONIG Reconfigurable Com puter. International Conference on Field Programmable Logic and Applica tions (FPL), 2002. [88] O. Mencer, M. Morf, and M. Flynn. PAM-Blox: High Performance FPGA Design for Adaptive Computing. IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), 1998. [89] Model-based Integrated Simulation, http://milan.usc.edu. [90] P. Mishra, N. Dutt, and A. Nicolau. Functional Abstraction Driven Design Space Exploration of Heterogeneous Programmable Architectures. Interna tional Symposium on System Synthesis (ISSS), 2001. [91] J. Mitola HI. Software Radio Architecture: Object-Oriented Approaches to Wireless Systems Engineering, John Wiley & Sons, Inc., 2000. [92] Model Technologies. ModelSim. http://www.model.com. [93] S. Mohanty, V. K. Prasanna, S. Neema, and J. Davis. Rapid Design Space Exploration of Heterogeneous Embedded Systems using Symbolic Search and Multi-Granular Simulation. Language Compilers and Tools for Embedded Systems (LCTES), 2002. [94] S. Mohanty, S. Ghoi, J. Jang, and V. K. Prasanna. A Model-Based Method ology for Application Specific Energy Efficient Data Path Design Using FP GAs. IEEE International Conference on Application-Specific Systems, Ar chitectures and Processors (ASAP), 2002. [95] T. Mudge. Power: A First-Glass Architectural Design Constraint. IEEE Computer, 34, April 2001. [96] W. Najjar, W. B5hm, B. Draper, J. Hammes, R. Rinker, R. Beveridge, M. Ghawathe, and G. Ross. From Algorithms to Hardware - A High-Level Lan guage Abstraction for Reconfigurable Computing. IEEE Computer, August 2003. [97] Nallatech. www.nallatech.com. 2 0 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [98] J. G. Nash. Automatic Latency-Optimal Design of FPGA-based Systolic Ar rays. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 2002. [99] J. J. Navarro, J. M. Llaberia, F. J. Nunez, and M. Valero. LU-Decomposition On a Linear Systolic Array Processor. International Journal of Mini and Microcomputers, 11(1): 4-8, 1989. [100] S. Neema. System Level Synthesis of Adaptive Computing Systems. Ph.D. Dissertation, Vanderbilt University, Department of Electrical and Gomputer Engineering, May 2001. [101] M. Nemani and F. N. Najm. High-level Area and Power Estimation for VLSI Circuits. IEEE/AC M International Conference on Computer-Aided Design, 1997. [102] H. Noori, H. Pedram, A. Akbari, and S. Sheidaei. FPGA Implementation of a DSP Core for Full Rate and Half Rate GSM Vocoders. International Conference on Microelectronics, 2000. [103] A. V. Oppenheim and R. W. Schafer. Discrete-Time Signal Processing, Pren tice Hall, 1989. [104] J. On, S. Ghoi, and V. K. Prasanna. Modeling and Energy-Efficient Applica tion Mapping of Configurable SoG Architectures. Technical Report, Electrical Engineering, University of Southern California, October 2002. [105] J. On, S. Choi, and V. K. Prasanna. Performance Modeling of Reconfig urable SoG Architectures and Energy-Efficient Mapping of a Class of Appli cations. IEEE Symposium on Field-Programmable Custom Computing Ma- chines(ECCM), April 2003. [106] J. On, S. Choi, G. Govindu, and V. K. Prasanna. Creating Parameterized and Energy-Efficient System Generator Designs. Annual Military and Aerospace Programmable Logic Devices (MAPLD) International Conference, 2003. [107] J. On, S. Choi, and V. K. Prasanna. Energy-Efficient Hardware/Software Co-Synthesis for a Glass of Applications on Reconfigurable SoCs. to appear in International Journal of Embedded Systems. [108] T. P. Plaks. Configuring of Algorithms in Mapping into Hardware. Journal of Supercomputing, Issue 2, February 2002. [109] A. Pimentel, L. Hertzbetger, P. Lieverse, P. van der Wolf, and E. Deprettere. Exploring Embedded-Systems Architectures with Artemis. IEEE Computer, November 2001. 2 0 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [110] V. K. Prasanna Kumar and Y. Tsai. On Synthesizing Optimal Family of Lin ear Systolic Arrays for Matrix Multiplication. IEEE Transactions on Com puters, 40(6), 1991. [111] Python, http://www.python.org. [112] H. Quinn, M. Leeser, and L. S. King. Implementing Image Processing Pipelines in a Hardware/Software Environment. High Performance Embed ded Computing Workshop, 2002. [113] J. Rabaey. Reconfigurable Processing: The Solution to Low-Power Pro grammable DSP. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1997. [114] A. Raghunathan, N. K. Jha, and S. Dey. High-level Power Analysis and Optimization, Kluwer Academic Publishers, 1998. [115] J. Razavilar, F. Rashid-Farrokhi, and K. J. Ray Liu. Software Radio Archi tecture with Smart Antennas: A Tutorial on Algorithms and Complexity. IEEE Journal on Selected Area in Communication, 17(4), April 1999. [116] R. Rinker, J. Hammes, W. Najjar, W. Bohm, and B. Draper. Compiling Image Processing Applications to Reconfigurable Hardware. IEEE Interna tional Conference on Application-Specific Systems, Architectures and Proces sors (ASAP), 2000. [117] I. Sahin, C. S. Gloster, and C. Doss. Feasibility of Floating-Point Arith metic in Reconfigurable Computing Systems. Military and Aerospace Pro grammable Logic Devices (MAPLD) Conference, September 2000. [118] R. Scrofano, S. Choi, and V. K. Prasanna. Energy Efficiency of FPGAs and Programmable Processors for Matrix Multiplication. International Confer ence on Field-Programmable Technology (ICFPT), 2002. [119] L. Shang and N. K. Jha. High-Level Power Modeling of CPLDs and FPGAs. International Conference on Computer Design, 2001. [120] L. Shang, A. Kaviani, and K. Bathala. Dynamic Power Consumption in Virtex-H FPGA Family. IEEE International Symposium on Field Pro grammable Gate Arrays (FPGA), 2002. [121] N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. IEEE Sym posium on FPGAs for Custom Computing Machines (FCCM), 1995. 210 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [122] R. R Sidhu, K. Bondalapati, S. Choi, and V. K. Prasanna. Computation Models for Reconfigurable Machines. International Symposium on Field- Programmable Gate Arrays (FPGA), Poster, February 1997. [123] P. Singer. The Optimal Detector. SPIE Signal and Data Processing for Small Targets, August 2002. [124] N. Srivastava, J. L. Trahan, R. Vaidyanathan, and S. Rai. Adaptive Im age Filtering using Run-Time Reconfiguration. Reconfigurable Architectures Workshop (RAW), 2003. [125] A. Stammermann, L. Kruse, W. Nebel, and A. Prastsch. System Level Opti mization and Design Space Exploration for Low Power. International Sym posium on System Synthesis, 2001. [126] A. Stoica, R. Zebulum, D. Keymeulen, R. Tawel, T. Daud, and A. Thakoor. Reconfigurable VLSI Architectures for Evolvable Hardware: from Experi mental Field Programmable Transistor Arrays to Evolution-Oriented Chips. IEEE Transactions on VLSI Systems, Special Issue on Reconfigurable and Adaptive VLSI Systems, pages 227-232, February 2001. [127] 0 . Storaasli, R. C. Singleterry, and S. Brown. Scientific Computations on a NASA Reconfigurable Hypercomputer. Military and Aerospace Pro grammable Logic Devices (MAPLD) Gonference, September 2002. [128] J. Stroomer, J. Ballagh, H. Ma, B. Milne, J. Hwang, and N. Shirazi. Creating System Generator Design Using jg. IEEE Symposium on Field Programmable Gustom Gomputing Machines (FGCM), 2003. [129] H. Styles and W. Luk. Customising Graphics Application: Techniques and Programming Interface. IEEE Symposium on Field Programmable Gustom Computing Machines (FGGM), 2000. [130] J. Sztipanovits and G. Karsai. Model-Integrated Computing. IEEE Com puter, April 1997. [131] R. Tessier and W. Burleson. Reconfigurable Computing and Digital Signal Processing: A Survey. Journal of VLSI Signal Processing, M ay/June 2001. [132] Texas Instruments, http://www.ti.com. [133] Texas Instruments. TMS320VC5510 Power Consumption Summary. http: //www. ti. com. [134] Texas Instruments. TMS320C64xx Power Consumption Summary. http://www. ti. com. 211 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [135] N. Tredennick and B. Shimamoto. Go Reconfigure. IEEE Spectrum, 40(12), December, 2003. [136] Triscend Inc. http://www.triscend.com. [137] W. Tuttlebee. Software Defined Radio: Enabling Technologies, J. Wiley, 2002. [138] J. Villarreal, D. Suresh, G. Stitt, F. Vahid, and W. Najjar. Improving Soft ware Performance with Configurable Logic. Kluwer Journal on Design Au tomation of Embedded Systems, 2002. [139] E. W. Wanek, J. F. Bogdanowicz, R. E. Maylone, R. Scrofano, S. Mohanty, J. On, E. Andreev, S. Choi, and V. K. Prasanna. Energy and Latency Efficient Design of a Personnel Detection and Tracking System. High Performance Embedded Computing (HPEC), September 2003. [140] E. W. Wanek, J. F. Bogdanowicz, R. E. Maylone, R. Scrofano, S. Mohanty, J. On, E. Andreev, S. Choi, and V. K. Prasanna. Designing an Energy- and Latency-Efficient Personnel Detection and Tracking System Using MILAN. Annual Raytheon Company Processing Systems Technology Network’ s Tech nology Expo (PSTN), 2003 [141] F. C. Wolff, M. J. Knieser, D. J. Weyer, and C. A. Papachristou. High-level Low Power FPGA Design Methodology. National Aerospace and Electronics Conference, 2000. [142] Xilinx Inc. http://www.xilinx.com. [143] Xilinx Inc. Two Flows for Partial Reconfiguration: Module Based or Small Bit Manipulations, http://www.xilinx.com. [144] Xilinx Inc. Xilinx Application Note: Virtex-II Series and Xilinx ISE 4.1i Design Environment, http://www.xilinx.com, 2002. [145] Xilinx Inc. Virtex-II (Pro) Platform FPGA User Guide. http://www.xilinx. com. [146] Xilinx Inc. Xilinx ISE 5.2i Design Environment, http://www.xilinx.com, 2003. [147] Xilinx Inc. System Generator for DSP. http://www.xilinx.com. [148] G. Yeap. Practical Low Power Digital VLSI Design, Kluwer Academic Pub lishers, 1998. 212 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Color processing and rate control for storage and transmission of digital image and video
PDF
Contributions to content -based image retrieval
PDF
Information hiding in digital images: Watermarking and steganography
PDF
Improving memory hierarchy performance using data reorganization
PDF
Intelligent systems for video analysis and access over the Internet
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Data compression and detection
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Cost -sensitive cache replacement algorithms
PDF
Intelligent image content analysis: Tools, techniques and applications
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
A template-based standard-cell asynchronous design methodology
PDF
Computer-aided lesion detection in positron emission tomography: A signal subspace fitting approach
Asset Metadata
Creator
Choi, Seonil
(author)
Core Title
Energy and time efficient designs for digital signal processing kernels on FPGAs
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Beerel, Peter (
committee member
), Ortega, Antonio (
committee member
), Raghavendra, Cauligi S. (
committee member
), Shahabi, Cyrus (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-507233
Unique identifier
UC11335658
Identifier
3140452.pdf (filename),usctheses-c16-507233 (legacy record id)
Legacy Identifier
3140452.pdf
Dmrecord
507233
Document Type
Dissertation
Rights
Choi, Seonil
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical