Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Dynamic logic synthesis for reconfigurable hardware
(USC Thesis Other)
Dynamic logic synthesis for reconfigurable hardware
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bieedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send U M I a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, M l 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DYNAMIC LOGIC SYNTHESIS FOR RECONFIGURABLE HARDWARE bv Andreas Dandalis A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December *2001 Copyright 2001 Andreas Dandalis Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. U M I Number. 3065777 Copyright 2001 by Dandalis, Andreas A ll rights reserved. U M I* U M I Microform 3065777 Copyright 2002 by ProQuest Information and Learning Company. A ll rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, M l 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School U niversity Park LOS ANGELES. CALIFORNIA 90089-1695 This dissertation , w ritten b y D t < - n cUc I \ _______ Under th e direction o f h-ik.. D issertation C om m ittee, and approved b y a ll its members, has been presen ted to am i accepted b y The Graduate School, in partial fulfillm ent o f requirem ents fo r th e degree o f DOCTOR OF PHILOSOPHY De a n o f G raduat e St u di es D ate D ecem ber 17, 2 0 0 1 DI SSER TA TION COMMITTEE 1-yljyL Reproducedwith permission of the copyright owner. Further reproduction prohibited without permission. Dedication To my family Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements I would like to thank my advisor at USC, Dr. Viktor K. Prasanna. for his guidance and support throughout my Ph.D. program. His academic leadership has been an inspiration in the development of my professional skills and in the understanding of research. Besides his technical and academ ic advisement. Dr. Prasanna has always been helpful in balancing my academ ic responsibilities with my personal life. I feel extrem ely fortunate and honored to have worked with Dr. Prasanna over the past few years. He is truly a brilliant person and an inspiring advisor. I would like also to thank the members of my qualifying and dissertation committees. Dr. Peter Beerel, Dr. Jean-Luc Gaudiot. Dr. Asish Goel. Dr. Doug lerardi. and Dr. Chris Kyriakakis for their suggestions. I would also like to thank Dr. C'auligi S. Raghavendra and Dr. Jose D. P. Rolim for their support and encouragement in my research. I also thank the U. S. Department of Defense, the l T . S. National Science Foundation, and the USC Integrated M edia Systems Center for supporting my research, and providing opportunities to frequently visit and interact with re searchers all over the world. iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In addition. I am appreciative to the m em bers of the MAARC research group for making this joint effort a success. I have shared thoughts on research and other matters over daily lunches with Kiran BondaJapati. who has been an exceptional sounding board for all my ideas and an exceptional office mate. I would also like to thank the other founding members of the group Seonil Choi and Reetinder Sidhu as well as the later members Mark Redekopp. Bharani Thiruvengadam. and Sammeer W’ adhwa. Through the years, several students o f Dr. Prasanna have made my Ph.D. adventure an enriching experience—Am m ar Alhusaini, Amol Bakshi. Prashant Bhat. Mvungho Lee. Young Won Lim, W enheng Liu. Vaibhav Matur. Sumit Mo- hanty, Neungsoo Park, Michael Penner, M itali Singh, Thrasyvoulos Spvropoulos. and Jinvvoo Suh. among others. Special thanks go to Henryk C ’hrostek and Chris tine Contreras for dealing with all the bureaucratic issues throughout my Ph.D. program. Special thanks go also to several friends that have been very valu able during my Ph.D. adventure—Georgios Asmanis, Spyridon Bakiras. Christos Chrvsafis, and Elizabeth Kunda, among others. Last but not least. I am grateful to my parents Petros and Georgia, and my brother Gianni for being extremely encouraging and patient with not only my graduate studies, but also with my endless personal pursuits throughout my life. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xii 1 Introduction 1 1.1 Dissertation C o n trib u tio n s....................................................................... 5 1.1.1 Domain-Specific and Instance-A ware M a p p in g ...................... 5 1.1.2 Application D em onstration.......................................................... 6 1.1.3 Configuration C om pression.......................................................... 7 1.2 Dissertation O u t l i n e ................................................................................. 7 2 Reconfigurable Computing 10 2.1 Reconfigurable D evices.............................................................................. 1 1 2.1.1 Field Programmable G ate A r r a y s ............................................. 13 2.1.2 Coarse-grained A rchitectures....................................................... 17 2.2 RC Application E x e c u tio n ....................................................................... 18 2.3 RC Hardware Design T ools....................................................................... 20 3 Domain-Specific and Instance-Aware Mapping 22 3.1 Challenges for Dynamic R C ................................................................... 23 3.2 Related W o r k .............................................................................................. 25 3.3 Our A p p ro ach.............................................................................................. 28 4 Adaptive Cryptographic Engine 33 4.1 In tro d u ctio n ................................................................................................. 34 4.2 Private-Key Cryptography for I P S e c .................................................. 37 4.2.1 Private-Key Cryptography using F P G A s ................................ 41 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Adaptive Cryptographic E n g in e ............................................................. 1 4 4.3.1 Cryptographic Library D e sig n .................................................... 46 4.4 Implementation Decisions and Results ............................................... 49 4.4.1 Implementation Results .............................................................. 54 4.4.1.1 M A R S .............................................................................. 56 4.4.1.2 R C '6 ................................................................................. 57 4.4.1.3 R ijn d a e l.......................................................................... 58 4.4.1.4 S e r p e n t .......................................................................... 59 4.4.1.5 T w o fis h ........................................................................... 59 4.4.2 Key-Setup Latency Im provem ents............................................. 60 4.4.3 Comparative Analysis of Our FPGA Implementations . . . 61 4.4.4 Related W o r k .................................................................................. 65 4.4.5 Comparison with Software and ASIC Implementations . . 67 5 G ra p h P ro b le m s a n d M a tr ix M u ltip lic a tio n 71 5.1 The Single-Source Shortest Path problem ............................................. 72 5.1.1 Mapping the Bellman-Ford alg o rith m ...................................... 74 5.1.2 Area and Tim e Performance E stim a te s................................... 77 5.1.3 Performance C o m p a riso n s.......................................................... 79 5.2 M atrix M ultiplication................................................................................. 83 5.2.1 Experimental R e s u lts .................................................................... S7 6 P a ra lle l D ed u ctio n E n g in e fo r SA T Solvers 93 6.1 In tro d u ctio n .................................................................................................. 94 6.2 Backtrack Search Algorithms and Deduction P r o c e s s ...................... 95 6.3 S ta te -o f-th e -a rt........................................................................................... 98 6.4 An Adaptive Parallel Architecture for the Deduction Process . . . 100 6.4.1 Hardware O rg an izatio n .................................................................... 103 6.4.2 Run-time M a p p in g ...........................................................................106 6.4.3 Run-Time Performance O p tim izatio n ..........................................108 6.5 Experimental R e s u lts .....................................................................................I l l 7 C o n fig u ra tio n C o m p re ssio n 116 7.1 In tro d u ctio n ......................................................................................................117 7.2 FPG A C o n fig u ratio n .....................................................................................121 7.3 Compression Techniques: Applicability and Implementation Cost 123 7.4 Related W o rk .................................................................................................. 127 7.5 O ur Compression T e c h n iq u e .......................................................................129 7.5.1 Basic LZW A lg o rith m ....................................................................130 7.5.2 Compact Dictionary C onstruction................................................132 7.5.3 Enhancement of the Dictionary Representation .....................135 7.5.4 Configuration Decompression ...................................................... 140 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.6 Experiments and Compression R e s u lts ....................................................142 7.6.1 Single C onfigurations....................................................................... 144 7.6.2 Sets of Configurations ....................................................................148 8 Conclusions and Future Research 152 8.1 Future D irectio n s............................................................................................157 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Tables 1.1 Spectrum of Reconfiguration [77] ......................................................... 4 2.1 FPGA Technology Advancements [7 7 ].................................................. 13 4.1 Im plem entation Results .......................................................................... 55 4.2 Performance Com parisons with FPG A Implementations [39. 42] . 67 4.3 Performance Com parisons with Software Implementations [6, 10] . 69 4.4 Performance Com parisons with ASIC Implementations [74] .... 70 5.1 Experim ental R esults for m * ................................................................... 73 5.2 Performance Com parisons with the Solution in [ 9 ] ............................ 80 5.3 M atrix M ultiplication on C'S2l 1 2 .......................................................... 90 6.1 Speedup C om pared with the Baseline Solution (p = 1 ) .....................114 7.1 Compression R atios for Single Configurations ..................................... 145 7.2 Dictionary and Index Memory Requirem ents for Single Configura tions ................................................................................................................... 146 7.3 Compression R atios for Sets of C o n fig u ratio n s..................................... 150 7.4 Dictionary and Index Memory Requirements for Sets of Configu rations ................................................................................................................151 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Figures 2.1 Typical FPGA A rc h ite c tu re ................................................................... 14 2.2 Xilinx Platform F P G A ............................................................................. 16 3.1 From a High Level Design Description to a Working Implementation 24 3.2 Domain-Specific and Instance-Aware M a p p in g ................................. 29 3.3 Details of Our A p p ro a c h ......................................................................... 30 4.1 Architecture of A C E ................................................................................ 45 4.2 Key-setup Latency Comparisons of O ur FPGA Implementations . 62 4.3 Throughput Comparisons of Our FPGA Im plem entations............. 63 4.4 Area Comparisons of O ur FPGA Im plem entations........................... 65 5.1 The Skeleton Architecture for the Bellman-Ford Algorithm . . . . 75 5.2 The Module Architecture (left) and the Placement of the Modules into an Array of FPGAs ( r i g h t ) ............................................................ 76 5.3 Tim e Performance Comparison: Our Approach v.s. Software Im plementation .............................................................................................. 82 5.4 P E Architecture ....................................................................................... 85 5.5 Chameleon CS2000 .................................................................................... 88 5.6 CS2000 Reconfigurable Processing F a b ric ............................................ 88 5.7 Memory Access and Requirements Compared with the Solution in [3 7 ].................................................................................................................. 91 6.1 Speedup Compared with p = I for Various SAT Instances . .. . 102 6.2 Parallel Deduction Engine Architecture and Module Details . . . 104 6.3 l^PT in terms of p ....................................................................................... 109 6.4 T he Product of UPT with the Corresponding Speedup in term s of pl09 6.5 Hardware Area in term s of v and B ..........................................................112 6.6 Clock Speed in term s of v and B .............................................................112 7.1 FPGA-based Embedded System Architecture .....................................126 7.2 O ur Configuration Compression T e c h n iq u e ............................................130 7.3 An Illustrative Example of Our Dictionary Representation . . . . 134 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.4 An Illustrative Example of Memory Organization for the Dictio nary anti the I n d e x ................................................................................... I. ‘ 14 7.5 An Illustrative Example of Enhancing the Dictionary Representation 136 7.6 Our Decompression-based Reconstruction of the Configuration Bit stream .......................................................................................................... 141 7.7 Conventional Read of the Configuration B it-s tre a m ............................ 141 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Algorithms 1 The Bellman-Ford Algorithm ................................................................ 72 2 The L Z W A lgorithm .................................................................................... 131 3 Our Heuristic - Phase I .................................................................................. 138 - 1 Our Heuristic - Phase 2 ..................................................................................139 XI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract The enormous advances in process technology over the past decade has enabled reconfigurable devices to be attractive for a wide spectrum of applications com pared with conventional fabrics such as microprocessors. DSPs, and ASICs. The unique capability of reconfigurable devices for post-fabrication and application- specific hardware customization has the potential to deliver ASIC-like perfor mance with microprocessor-like flexibility. However, application development for reconfigurable devices has been dominated by the conventional ASIC design flow that prevents them from achieving their full potential; reconfigurable devices are treated only as programmable devices while hardware reconfiguration is not considered as part of application execution. This dissertation addresses the fun damental challenges in mapping applications onto reconfigurable hardware based on run-tim e parameters. A novel approach is proposed that is based on algorithm-specific and instance- aware hardware reconfiguration. Initially, based on the semantics of a given algo rithm and the target reconfigurable device, algorithm-specific configurations are xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. derived off-line. Then, based on these algorithm-specific configurations, instance- aware reconfiguration occurs to adapt the hardware to the given run-time param eters. The performance metric is the effective execution tim e. This consists of the time required to map the application onto hardware and the tim e required to execute the application on hardware. The objective is to dramatically reduce the contribution of the mapping tim e to the effective execution tim e to improve the overall performance. Using the proposed approach, significant performance improvements are achieved for a wide range of applications compared with the state-of-the-art. The applications of interest include private-key cryptography for Internet secu rity. graph problems, matrix arithm etic operations, and boolean satisfiability. Known solutions for these applications either focus on improving only the ex ecution tim e on reconfigurable hardware ignoring the m apping tim e or, fail to explore run-tim e hardware reconfiguration to improve the overall performance. To enhance the feasibility of our approach in embedded environments and to enable multi-personality FPGA-based embedded products, a configuration com pression technique is developed th at minimizes the memory requirements for stor ing configuration data. Such a compression approach is essential to cope with the tremendous growth in the size of configuration bit-stream s. Our approach requires minim al decompression hardware and does not affect the tim e required Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to configure a reconfigurable device. Experimental results dem onstrate that near- optim al compression ratios can be achieved for configuration bit-streams of real- world applications. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction The continuous evolution of reconfigurable hardware over the past decade, has led to reconfigurable devices that promise to deliver the perform ance benefits of hardware with the flexibility of software for a wide spectrum of applications. The advances in semiconductor technology has led to high-density devices with faster clock speed at a lower cost. In addition, the advent of near-billion transistor chips has allowed the development of device architectures with advanced system features (e.g., interconnect, I/O , clock management, embedded processors). As a result, besides the initial application domain of circuit em ulation and rapid prototyping, reconfigurable devices has been proven advantageous for application domains such as network protocol processing, wireless communication, signal and image processing, cryptography, and genome database scanning, am ong others. Typical reconfigurable devices consist of a matrix of configurable logic blocks laid over a programmable interconnection network. By configuring the logic 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. blocks and the interconnection network, application-specific circuits can be cre ated. Moreover, these hard-wired circuits can be reconfigured on demand to adapt to different application requirements. The most representative class of reconfig urable devices is Field Programmable G ate Arrays (FPGAs). FPGAs are the first devices to go onto the newest process technologies at the highest transistor counts and densities. Current estimates suggest that by the year 2005. FPGAs will consist of over 2 billion transistors offering over 50 million system gates, several 1GHz embedded processors, high-performance MAC and floating-point units, and over 10 G bits/sec I/O channels. The performance advantage of reconfigurable devices lies on the unique com bination of hardware parallelism, hardware specialization, and hardware reuse. Hardware parallelism and specialization are essential to lead to superior perfor mance for several application domains compared with general-purpose processors and Digital Signal Processors (DSPs). General-purpose processors and DSPs are mostly amenable to serial (or Iimited-parallelism) fashion of computing, while reconfigurable devices favor parallel implementations. On the other hand, by customizing the hardware to the requirements of the computations, reconfig urable hardware becomes the software. As a result, the fetch/decode overhead of instruction-centric com puting is eliminated. Moreover, application-specific cir cuits can lead to superior performance compared with the general purpose func tional units integrated in microprocessors and DSPs. ■ > Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. While hardware parallelism and specialization are crucial to meet the de manding requirements of applications, the key characteristic of reconfigurable devices is the ability of post-fabrication customization of pre-tested silicon. Such post-fabrication customization is the key advantage of reconfigurable devices over Application Specific Integrated Circuits (ASICs) that cannot be reconfigured af ter fabrication. For one thing, reconfiguration eliminates design risks, reduces significantly non-recurrent and recurrent engineering costs, and leads to fast tim e-to-m arket. For another, reconfiguration can revolutionize Reconfigurable Computing (RC). that is. computing using reconfigurable devices. As it is shown in Table 1.1. the spectrum of reconfiguration extents from field upgrades to evolving logic. For field upgrades, adaptive products, and multi personality products, reconfiguration occurs off-line and the hardware configura tion remains static during application execution. In such an approach, the design objective is to explore the configuration space determined by th e device to derive an application-specific implementation that meets the application requirements. As a result, reconfigurable devices are considered as general-purpose ASIC’ s and therefore, an ASIC-based solution would always lead to superior performance. However, reconfigurable devices can lead to lower overall cost while meeting the application requirements. Even so, such a static approach prevents RC’ from achieving its full potential to utilize reconfiguration as part of application execu tion. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1.1: Spectrum of Reconfiguration [77] Once in a while Turn-on Application Task Continuous Field Adaptive Multi-personality Task-dependent Evolving Upgrades Products Products Logic Logic Exploring reconfiguration at run time is probably the only way that RC can become more advantageous than ASIC'-based solutions, not only in terms of cost-effectiveness, but also in terms of time performance. Algorithm-specific and instance-aware reconfiguration at run time is the key to RC. Hardware cir cuits optim ized for a specific algorithm or for a specific problem instance can be proven superior compared with application-specific implementations. Deriv ing such circuits at run tim e is the major challenge for RC and requires novel models, tools, and techniques. This dissertation addresses the critical issues involved in run-tim e mapping of applications onto reconfigurable devices. Y V e identify the fundamental chal lenges in reconfiguring hardware based on run-tim e param eters. A novel map ping approach is introduced that can lead to efficient solutions. Using our ap proach, we dem onstrate solutions for several applications th at incorporate dy namic hardware custom ization based on run-tim e param eters. Finally, an in novative configuration compression technique is proposed that is essential for Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. minimizing th e memory requirements for storing configuration data in embed ded environments. The research results of this dissertation have been exten sively published in m ajor international conferences, sym posium s, and workshops [22. 23. 24. 25. 26, 27, 28, 29. 30. 31. 32. 60. 72). The basic contributions of the dissertation and the dissertation outline are listed in the following. 1.1 D issertation Contributions This dissertation addresses several critical issues in m apping applications onto reconfigurable devices based on run-tim e parameters. O ur focus is on developing a novel mapping approach that can enable application execution based on run-time parameters. Several applications are considered to dem onstrate the feasibility of our approach as well as exploring this new research field. This dissertation is one of the earliest efforts to address the problem of dynam ic logic synthesis for reconfigurable hardware, that is, deriving hardware configurations based on run tim e param eters. 1.1.1 Domain-Specific and Instance-A ware Mapping We develop a novel approach for fast run-time mapping by exploiting proper ties of a specific domain and a given problem instance. The specific domain is defined by th e algorithm semantics and by the architecture characteristics of the target device. The key idea is th at the mapping problem is handled as an o Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm synthesis problem, as opposed to a conventional hardware synthesis approach. Algorithm-specific configurations are generated off-line to facilitate instance-aware hardware custom ization at run time. T he proposed approach re duces significantly the time required to map a given application onto hardware and results in implementations with predictable tim e performance. Our perfor m ance m etric is the effective execution time, which includes the execution tim e on hardware as well as the tim e required to obtain the configuration and the tim e required to configure the hardware. Our approach will put on added values on top of existing tools to facilitate dynamic logic synthesis by overcoming the current lim itations: the excessive mapping tim e and the perform ance predictability. 1.1.2 Application Demonstration The domain-specificity of our approach necessitates experim entation with several application domains to confirm the feasibility and applicability of our approach but also to explore this novel field of mapping. We focus on the following domains: 1. M ulti-algorithm applications in adaptive embedded environments. The run tim e param eter is the algorithm based on which reconfiguration occurs. The hardw are configuration remains static during application execution. Such a dom ain is essential in the advent of next-generation reconfigurable devices. 2. Instance-dependent applications in contemporary RC systems. The run tim e param eter is the problem instance based on which the hardware is 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. customized. The hardware configuration remains static during application execution, or is further modified based on the knowledge gained during problem solving. Contem porary RC systems typically incorporate a host machine to control the application execution on reconfigurable hardware. 1.1.3 Configuration Compression Y V e also develop a novel configuration compression technique for embedded sys tem s based on reconfigurable devices. O ur technique is based on the principles of dictionary-based compression. Y V e address both the problems of minimizing memory requirements for storing configuration data and optimizing decompres sion efficiency. Decompression efficiency corresponds to the decompression hard ware cost and decompression rate. Configuration compression is essential as the density of reconfigurable devices is rapidly increasing resulting in an excessive amount of data required to configure a device. The proposed approach enhances the feasibility of our m apping approach in RC embedded environments, as well as decreases the cost of m ulti-personality FPGA-based embedded products. 1.2 Dissertation Outline In C hapter 2, the architecture characteristics and capabilities of reconfigurable devices are described. Models of application execution are discussed in detail and current limitations are presented that m otivate our work. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Chapter 3. our mapping approach is described along with key challenges in run-time mapping. Related work is also presented to fortify the novelty of our approach. Chapter 4 describes in detail an A daptive Cryptographic Engine that was de veloped using our approach. The proposed engine can adapt at run tim e to diverse security parameters. Problems in designing configurations for multi-algorithm applications are addressed and a thorough performance study is performed. In Chapter 5. instance-dependent m apping is dem onstrated. We use our ap proach to solve the single-source shortest path problem and the m atrix multi plication problem. For the shortest path problem, the hardware is adapted at run time to the graph topology. For the m atrix multiplication problem, coarse grained reconfigurable architectures are considered. A scalable and partitioned solution is presented that can adapt to the input problem size, as well as to the available hardware resources. In Chapter 6, a parallel deduction engine for backtrack search algorithms is presented. The engine initially adapts to the input SAT formula. Based on the results obtained during problem solving, the engine evolves in term s of its parallelization factor in order to optim ize performance. In Chapter 7. the proposed configuration compression technique is thoroughly- described. Several experiments are dem onstrated to validate the applicability and effectiveness of our approach. S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finall v in Chapter 8. conclusions are drawn from this work and future re search is discussed based on the findings of this dissertation and the evolution of reconfigurable hardware. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Reconfigurable Com puting Reconfigurable Computing (RC) is a revolutionary computing paradigm that is based on the ability of reconfigurable hardware for post-fabrication and application- specific hardware custom ization. Such post-fabrication customization of pre tested silicon distinguishes reconfigurable devices from ASICs. In RC, the hard ware is configured on demand based on the application requirements. RC promises to deliver hardware-like perform ance with software-like flexibility for a wide range of applications, and thus, becoming more attractive than solutions based on con ventional fabrics such as ASICs, DSPs, and microprocessors. A brief overview is given of reconfigurable devices, RC' application execution, mapping approaches to RC, and RC design tools. 1 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1 Reconfigurable Devices Reconfigurable devices consist one of the fastest growing segments of the digital logic m arket. Current estim ates suggest that 80% of all System-on-Chip semicon ductor devices shipped in th e near future will include embedded reconfigurable hardware [43]. Reconfigurable devices can be classified as a combination of config urable logic blocks (C'LBs) and a programmable interconnection network, which can both be reprogrammed. Reprogramming can be achieved by altering the con trol bits th at dictate the functionality of the logic blocks and the connectivity of the interconnection network. These control bits are stored in on-chip distributed memory elements that are used only for programming the hardware resources. Numerous architectures have been developed that differentiate from each other in term s of the granularity of the logic blocks, the support for reconfig uration. and the system features integrated on-chip. 1. Granularity: Based on the granularity of the configurable logic blocks, re configurable devices can be categorized as fine-grained or coarse-grained architectures. In fine-grained architectures, the basic logic block can re alize a bit-level function of usually 1-8 input bits [4, 8. 13, 63, 77]. On the other hand, coarse-grained architectures are datapath-oriented struc tures that consist of word-based programmable processing elem ents that can perform demanding arithmetic operations (e.g., 16-32 bits MAC) [15. 16, 17, 34, 37, 44. 57, 73]. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Reconfiguration: Complete reconfiguration alters the hardware configura tion of th e complete device while partial reconfiguration alters the hardware configuration of a part of the device. Most state-of-the-art reconfigurable devices support both complete and partial reconfiguration. The configura tion d a ta is stored in on-chip configuration memory, which is distributed throughout the device. Single-context devices can store on-chip only one configuration for the hardware resources. Reconfiguration occurs by loading different configuration data onto the configuration memory. On the other hand, m ulti-context devices can store on-chip multiple configurations for the hardw are resources. As a result, the overhead of loading an externally stored configuration can be eliminated since reconfiguration can be realized by switching among different contexts. 3. System Features: Besides the variety of programmable logic blocks and interconnection networks, advanced system features are integrated on-chip in state-of-the-art reconfigurable devices. It is becoming a common trend to integrate embedded memory blocks to enhance on-chip d ata storage. Moreover, em bedded microprocessors are integrated as hard or soft IP cores to support reconfigurable hardware during application execution. Finally, various clock management techniques and I/O schemes are used to enhance the applicability of the devices. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.1 Field Programmable Gate Arrays The continuous evolution of Field Program m able G ate Arrays (FPGAs) has been the driving force for RC. Historically. FPGAs were small devices with slow clock speeds th at were used as glue-logic in developing boards. As the FPGA hardware density was increasing rapidly following the Moore's Law. FPGAs were initially an offshoot of the quest for prototyping and em ulation. The evolution of FPGAs (Table 2.1) have led to devices with multi-million transistor counts, fast clock speeds, low cost, and enhanced reconfiguration and system features. As a result, FPGAs are rapidly eroding in application domains such as signal and image pro cessing. wireless communication, network protocol processing, and cryptography, among others. Table 2.1: FPGA Technology Advancements [77] 1996 1998 1999 2002 Density (System Gates) 50K IM 2M 10M M ACs/device (Billions) 1 27 68 600 Cost (Gates/$) 300 4K 10K 40K Cost (Million MACs/$) 6 100 340 2.000 Typical FPGAs consist of fine-grained configurable logic blocks (C’LBs) over laid over a programmable interconnection network and surrounded by a pro gram m able I/O (Figure 2.1). Most of the latest devices integrate also embedded 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□ □□□□□ □□□ □□ □□ □□ □□ □□ □□ □□ □□ Reconfigurable Logic LUT LUT Figure 2.1: Typical FPG A Architecture memory to enhance on-chip data storage. The embedded memory is either dis tributed among columns of C ’LBs or placed at the periphery of the reconfigurable logic array. The reconfigurable logic array is usually a 2-D array of C'LBs that are connected with each other and w ith the I/O using the programmable in terconnection network. The programmable interconnection network consists of local and global connections that are controlled by programmable switches. Lo cal connections can connect adjacent CLBs while global connections facilitate the communication am ong distant CLBs. CLBs are the core elem ents of the reconfig urable logic and can realize combinatorial an d /o r synchronous logic. Early FPGA architectures have incorporated CLBs consisting of a Look-Up Table (LUT). a 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. flip-flop, and multiplexers. The LUT can be used either to realize combinatorial logic or local memory. The flip-flop can be used for sequential circuits. The multiplexers determine the d atap ath realized inside th e CLB. as well as the con nectivity of the CLB to the interconnection network. By configuring the LUT and the multiplexers, a variety of functions can be realized. Advanced FPGA architectures incorporate clusters of 2-4 logic blocks to derive larger CLBs and facilitate the realization of more complex functions. In addition, specialized com binatorial logic (e.g., fast carry logic, shift register) is also incorporated in the logic cells of several device families to enhance their functionality. The control bits that dictate the functionality of the logic blocks and the connectivity of the interconnection network are stored in on-chip memory (i.e.. configuration memory). Configuration memory is distributed throughout the de vice and can only be used for storing configuration bit-stream s, that is. the data required to configure the hardware resources. Early devices incorporated non volatile mem ory for storing configuration bit-streams. As a result, only one-time program m ability was allowed. Such devices are out o f the scope of this disser tation since they cannot support reconfiguration. M odern devices incorporate volatile m em ory (i.e., SRAM) to store configuration bit-stream s. Typical sizes of configuration bit-streams range from 0.6 to 33 Mbits [4, 8, 77] and depends on the hardware density of the devices. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Syr t M r t O ™ M uRipUeis VIRTEX-II IP-lmmereion™ Architecture Figure 2.2: Xilinx Platform FPGA T he volatility of the configuration memory prevents from storing configura tion d a ta permanently on-chip. As a result, the tim e required to configure the hardw are resources depend on the amount of the d ata transferred to the config uration memory and the rate of the data transfer. This rate is determined by the the clock rate used for transferring data and the mode of operation. Parallel modes of operation allow word-based data transfer while serial modes transfer 1 bit o f d ata every clock cycle. Typical values of d ata transfer rates can be as high as 480 M bits/sec for parallel modes of operation [4. 8, 77]. By incorporating partial instead of complete reconfiguration, the tim e to configure a device can be significantly reduced since less amount of d ata has to be transferred to the configuration memory. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the advent of near-billion transistor chips. FPG As are rapidly evolving to Reconfigurable System-on-Chip architectures. Besides the conventional re configurable hardware, next-generation FPGAs will also incorporate em bedded microprocessors, embedded memory, and application-specific accelerators. Such an evolution is essential to advance from stand-alone products to system-level solutions. For example, Platform FPGA [77] (Figure 2.2) is based on an em bedded IBM PowerPC 405 processor and fine-grained reconfigurable hardw are to enable the integration of a broad spectrum of IP blocks. Besides the recon figurable hardware, complex IP-based designs can incorporate an abundance of advanced routing resources, programmable I/O . on-chip memories, and em bed ded multipliers. Platform FPGA promises to deliver over 600 billion M ACs/sec, 420 Dhrystone M IPs of performance, and support up to S40 Mbps I/O bandw idth [77]. 2.1.2 Coarse-grained Architectures Early fine-grained architectures were mainly designed for bit-level tasks and ran dom logic functions. Their performance was limited for computationally dem and ing applications over large word length data. An alternative avenue that has been explored by many research groups is coarse-grained reconfigurable architectures [15, 17, 34, 37, 44. 73]. These architectures are datapath-oriented structures and consist of a small number of word-based, programmable processing elem ents. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. These processing elem ents can perform critical word-based operations (e.g.. mul tiplication) with high performance. This can result in greater computational efficiency and high throughput for coarse-grained comput ing tasks. However, the flexibility of such architectures is lim ited due to the granularity and arithmetic- centric nature of th e processing elements. Recently, commercial coarse-grained architectures have been emerged [16, 57] th at target specific application domains (e.g.. digital signal processing, network protocol processing, wireless communica tions). 2.2 RC Application Execution Once the computing tasks to be executed on reconfigurable hardware are deter mined, the sequence of steps in realizing an RC' solution is: 1. Configuration Design: The com puting tasks have to be m apped to logic descriptions of the reconfigurable devices. 2. Configuring th e Device: The functionality of the configurable logic cells and the connectivity of the programmable interconnection network have to be established in this step by configuring the device. This step is realized under the control of a host that is either external to the device or integrated on-chip. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Execution: The device executes under the control of the host. This also involves data transfers between the device and the host, as well as any external memory. The tim e required for designing the configuration and configuring the device consist the m apping time. The m apping tim e may also include the overhead caused by reconfiguring the hardw are during execution. T he tim e required to execute on hardw are denotes the execution time. Based on the tim e that the tasks to be executed on hardware are determined, mapping approaches for RC can be categorized as static or dynamic. In static m apping approaches, th e tasks are known a priori, and thus, the de sign of configurations occur off-line. T he device is considered as a programmable ASIC and its configuration remains static during execution. As a result, the map ping time is not in the critical path to the solution and can be safely ignored. If the hardware resources are not sufficient to implement the given tasks, the hard ware resources are time-shared am ong sub-tasks by switching among predefined configurations. As a result, the tim e required to reconfigure th e device has to be considered in the calculation of the execution time. However, the configuration schedule is derived off-line and reconfiguration is only explored in a limited fash ion to virtually increase the logic capacity of the device. On the other hand, in dynamic approaches, the computing tasks or the problem instance are determined at run time. In such approaches, th e design of configurations is in the critical 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. path to the solution since it depends on run-tim e param eters. Such run-time param eters can be also determined on the fly based on results that are d e riv e d during execution. Utilizing reconfigurable hardware for speeding-up applications has been mostly- limited to static solutions that prevent RC from achieving its full potential. In such static solutions, it is sufficient to utilize conventional logic synthesis and technology mapping tools to derive efficient im plem entations. The hardware de sign optim ization problem does not consider the dynamic nature of reconfigurable hardware or exploits it to a limited extend. Only by exploring the ability of hard ware customization at run time, the full potential of RC can be achieved. 2.3 RC Hardware Design Tools Conventional RC solutions are mostly dictated by static m apping approaches. As a result, the conventional ASIC design flow has become a dom inant part in the configuration design process. The required functionality is specified using a high- level language (e.g., HDL. C. Matlab) or schematic tools. Iterative steps of logic synthesis and technology mapping are then performed to derive the configuration bit-stream to configure the device. This iterative process can result in excessive mapping tim e that can go up to several hours for complex and challenging ap plications. Even if the tools are steadily improving in term s of performance and quality of results, the hardware density and complexity of the devices are also 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. increasing rapidly making the mapping problem harder to be solved. Moreover, such a design flow lacks of the capability of capturing the dynamic nature of reconfigurable hardware and is only efficient to derive static configurations. Tools for capturing the dynamic nature of reconfigurable hardware are at best in their infancy [50. 78]. Such tools are essential to facilitate the develop m ent of dynamic mapping approaches and interactive simulation and debugging environments. Recently, a Java-based design tool (JHDL) has been developed that specifi cally target reconfigurable hardware based system s [50]. JHDL uses object con- structors/deconstructors to control the lifetime of a circuit on the reconfigurable hardware. As a result, it can efficiently capture the dynamic nature of the hard ware. JHDL can be used as a simulation framework or as an interface to recon figurable hardware to facilitate application execution. In addition, a set of software tools and A PIs has also been developed (JB its). which enables the creation of Xilinx FPGA configuration bit-streams from Java code [78]. JBits is a low-level tool that gives access to the FPGA internal logic and routing resources to support run-time reconfiguration. A simulation envi ronm ent is also provided for analysis and experim entation. Since JB its handles configuration bit-streams, the user should be fam iliar with the low-level details of th e device architecture and the structure of the configuration bit-stream . 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 Domain-Specific and Instance-Aware M apping Utilizing reconfigurable devices for accelerating applications has been mostly lim ited to developing configurations that optim ize the time required to execute the given applications on hardware. Such a static approach does not take advantage of the key potential of reconfigurable hardware, that is. run-tim e hardware cus tomization. Only by considering dynamic reconfiguration of the hardware as part of the hardware design problem, the full potential of RC can be accomplished. In the following, a novel mapping approach is proposed that enables applica tion mapping onto reconfigurable devices based on run-time param eters. Our approach facilitates dynamic logic synthesis for reconfigurable hardware by over coming the current limitations: the excessive mapping tim e and the performance predictability. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1 Challenges for Dynamic RC Dynamic RC is essential to numerous applications such as Internet security, graph problems, boolean satisfiability, and adaptive signal processing, among others. In these applications, the com putations to be performed onto hardware are dictated by param eters th at are known only at run tim e such as. the topology of a graph, the security param eters negotiated between two communicating entities, a given SAT problem instance, the width of the input data, the problem size. etc. Existing m apping techniques have adopted the ASIC-based design flow and tools that prevent the Reconfigurable Computing paradigm from achieving its full potential. Most of the existing tools and frameworks are a good basis to support designs up to the logic synthesis level, but are inadequate to support dynamic RC. The derived hardware designs are optimized in term s of tim e performance and hardware area, but cannot efficiently exploit the dynamic nature of hardware. The excessive compilation times required for transforming a high-level de sign description to a working implementation is the most severe limitation to dynamic RC. In the case of configurations th at are reused over tim e (i.e., static RC), configuration design occurs only once and is not a bottleneck. However, for configurations th a t depend on run-time parameters (i.e.. dynamic RC), the exces sive compilation tim e is unacceptable. Any speedup that can be achieved using reconfigurable devices is overshadowed by the time-consuming mapping process. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. RC Key Challenge: Compilation Times uproc/DSPs Configurable Logic ASICs today future, if not addressed Figure 3.1: From a High Level Design Description to a Working Implementation Four to six orders of magnitude improvement are required to enable conventional tools to be used in the critical path to the solution (Figure 3.1). A nother problem with conventional tools is that the quality of the results can be determ ined very late in the design cycle. Reliable tim e performance and hard ware area requirements can be derived only after place-and-route. In addition, if the same design description is processed repeatedly through the tools for multi ple tim es, it is not guaranteed th at the derived configurations will result to the same tim e performance and hardware area requirements. Moreover, the configu rations generated for two problem instances of slightly different sizes are usually substantially different from each other. As a result, performance predictability becomes a critical limitation if conventional tools are used in the critical path to the solution. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 Related Work Considerable research has been motivated by the suitability of instance-dependent problems to reconfigurable devices [9, 59. 80, 81]. However, almost all the efforts fail to consider the mapping overhead in their comparisons. These approaches are based on conventional tools and are inadequate to attack the mapping tim e overhead problem. The Dynamic Computation Structures (DCS) approach was utilized to solve graph problems [9]. A compilation technique was developed to autom atically m ap high-level hardware descriptions onto hardware. The topology of the input graph was embedded onto a multi-FPGA system using the conventional ASIC' design flow. Several hours were required for partitioning and place-and-route to m ap each graph edge to a physical wire. The resulting execution time was in the range of msec. Significant speedup was reported compared with state-of-the art software solutions. However, the mapping tim e was not considered and the comparisons were made only in term s of execution time. The proposed approach led to irregular and instance-dependent interconnect. As a result, as the graph size increases, the clock rate degrades rapidly and th e area requirements increases dram atically. No reliable tim e performance and hardware area estimations were possible before actually mapping onto the hardware. Moreover, for denser graphs, the m apping tim e would increase dramatically or, even worse, the derived logic net lists may not be routable. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In [80], a similar approach was initially utilized to map specific instances of th e boolean satisfiability (SAT) problem. As a result, irregular designs were derived after several hours of compilation. As a result, the proposed approach was effective only for hard SAT instances with very long software running tim e (hours or m ore). However, hard SAT instances would result in increased compilation tim e or. even worse, the derived logic netlists may not be routable. Motivated by the m apping time overhead, a new m odular architecture was proposed in [81] that resulted in significantly reduced mapping time. However, while the clock speed was increased and did not depend on the input SAT instance, the num ber of clock cycles required to solve the problem was also increased significantly affecting the execution time. As a result, it was unclear the range of SAT instances of interest th a t can benefit from this approach. A nother approach for solving the SAT problem was developed in [59]. Dy nam ic circuit generation was employed to map specific instances of the SAT problem . The main focus was the compilation tim e while the execution time was not considered of primary interest. A new tool was developed by incorporated the m ost efficient known algorithm s to perform logic synthesis, partitioning, and place-and-route. The objective was to speedup the mapping process. However, only sm all circuits were m apped instead of complete designs as in [80. SI]. Even if design synthesis and place-and-route tim es were reported, no tim e comparisons 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. were available, la addition, the derived designs included long physical intercon nects that resulted in slow clock rates. An alternative approach to instance-dependent m apping was proposed in [66j. In [66], self-reconfiguration was considered to reduce the mapping tim e for run tim e hardware reconfiguration. A novel self-reconfigurable architecture was de veloped that can alter the configuration bits by itself without the intervention of an external host. As a result, the mapping tim e can be considerably reduced facilitating fast run-time hardware adaptation. However, self-reconfiguration is out of the scope of this dissertation. Besides the aforementioned research, several researchers have proposed models of reconfigurable architectures and application execution techniques to enable dynamic RC. However, such research does not address explicitly the problem of dynamic logic synthesis for reconfigurable hardware. In [55], the Reconfigurable Mesh computation model was introduced based on a mesh with reconfigurable buses. This model was one of the earliest efforts to address run-time adaptation based on the com putation and communication requirements. Efficient solutions were dem onstrated for a variety of applications including image processing and com putational geom etry and graph problems. In [12], a high level model of reconfigurable hardware was developed to facilitate the development of algorithmic m apping techniques. Loop computations were considered as the application domain in [12]. Loop com putations match extrem ely 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. well with the spatial computing paradigm dictated by reconfigurable hardware. In [36j. the Flexible URISC model has been introduced that facilitates run-tim e routing among logic modules. By mapping logic modules into the memory space of URISC. data transfers among modules can be achieved in a shared-memorv fashion without the need for explicit modification of the hardware routing. In [40], a m ethod has been introduced (RTR) to enhance the functional den sity of SRAM-based FPGAs. Given an application, hardware configurations can be derived that correspond to time-exclusive operations. At run tim e, the hard ware resources are time-shared among the pre-defined configuration based on a configuration schedule. Based on RTR. several techniques have been introduced for mapping pipelines onto reconfigurable hardware [15, 53]. By tim e-sharing the hardware resources among pipeline stages and overlapping execution with recon figuration, efficient solutions have been dem onstrated for mapping pipelines of arbitrary number of stages on reconfigurable hardware. 3.3 Our Approach We propose a novel approach for enabling application mapping based on run-tim e parameters by exploiting properties of a specific domain and a given problem instance (Figure 3.2). The algorithm sem antics and the target reconfigurable 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Algorithm Domain Skeleton Problem Instance Run-time Adaptation Models of Reconfigurable Target Reconfigurable Target Memory n Host * I FPGAs Figure 3.2: Domain-Specific and Instance-Aware Mapping device define the specific domain. Besides the reconfigurable device, a host ma chine and external memory are indispensable to facilitate run-time reconfigura tion and store the configuration data, respectively. The host machine a n d /o r the external mem ory may be integrated on-chip with the reconfigurable hardware. Algorithm-specific configurations (skeleton) are generated off-line based on the specific dom ain. At run tim e, instance-aware adaptation occurs to determ ine the hardware configuration. The instance corresponds to the run-time param eters based on which hardware customization is performed. The goal o f our approach is to alleviate the bottleneck of conventional tools from the critical path to the solution. O ur performance metric is the effective execution tim e T (Figure 3.3) which includes the execution tim e on hardware (T e) 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Specific Domain Problem Instance Skeleton Adaptation ME Effective Execution time Reconfigurable Target) ^On-tfie-fiy results^ Execution Figure 3.3: Details of Our Approach and the tim e required to obtain the hardware implementation (T m + T a / b ). The latter time is th e sum of the tim e required to derive the hardware configuration and the tim e required to configure the device. T.\/g also includes the overhead caused by reconfiguring the hardware during execution. As shown in Figure 3.3, our m apping approach consists of two major interde pendent phases: Skeleton Synthesis and Skeleton Adaptation. A skeleton corresponds to the configuration data that is necessary to cover the parameter space for a given application. As a result, the skeleton corresponds to parameterized configuration bit-stream s. The skeleton is derived ofF-line and its construction does not affect the effective execution time. As a result, conventional 30 Target Architecture Algorithm Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tools can be also incorporated. However, it is more beneficial to utilize tools like JBits [78] th at enable logic synthesis at the bit-stream level. Given a run-tim e param eter, the skeleton is dynamically adapted to enable instance-aware hardware custom ization. The skeleton adaptation is dictated by- algorithms th at are developed off-line and are executed by the host machine. Such an adaptation can lead to functional or structural modifications as well as data redistribution. The run-tim e adaptation is dictated by the characteristics of the instance based on which configuration is derived. The adaptation space of the skeleton corresponds to all the potential hardware modifications that may occur and determ ine the degree of freedom for reconfiguration. As a result, the skeleton adaptation process can be viewed as a domain-specific tool for dynamic logic synthesis. Functional modifications add, remove, or alter the functionality of programmable logic blocks. Structural modifications shape the programmable interconnection network according to the requirements of the input instance. Compared with conventional approaches to static RC’ approaches and related work [9, 59, 80, 81], the key benefits of our approach are: 1. The optim ization metric is the effective execution tim e that includes the overhead due to the m apping time. 2. Configuration design is accomplished incrementally'' without requiring a complete redesign for every input instance. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. Reliable tim e performance and hardware area requirements can be obtained off-line that is im portant in applications where the problem param eters are determined at run time. In the following chapters, the efficiency and applicability of our mapping methodology is dem onstrated for a variety of applications and reconfigurable devices. The applications of interest include private-key cryptography for Inter net security, the single-source shortest path problem, the matrix multiplication problem, and the boolean satisfiability problem. 3* 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 A daptive Cryptographic Engine A rchitectures that implement the Internet Protocol Security (IPSec) standard have to meet the enormous computing demands o f cryptographic algorithms. In addition. IPSec architectures have to be flexible enough to adapt to diverse security param eters. In this chapter, an FPGA-based Adaptive Cryptographic Engine (ACE) for IPSec architectures is proposed. By taking advantage of FPGA technology, ACE can adapt to diverse security param eters on the fly, while pro viding superior performance compared with software-based solutions. Hardware adaptation occurs before computation commences, an d is crucial not only for the applicability of the engine, but also for the lifetime o f it. Given a set of hard ware configurations, a general graph-based representation is introduced based on which various high-level optimization problems can be addressed (e.g.. configu ration memory requirements, reconfiguration overhead) for designing a crypto graphic library. A diverse set of cryptographic algorithm s is utilized to evaluate 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the tim e performance of the proposed cryptographic engine. The time perfor mance m etrics are throughput and kev-setup latency. Throughput corresponds to the am ount of data processed per tim e unit while the key-setup latency time is the minimum tim e required to commence data encryption after providing the input key. T he latency metric is the key measure for IPSec. where a small amount of data is processed per key and key context switching occurs repeatedly. 4.1 Introduction The enormous success of Internet has made the Internet Protocol (IP) one of the primary communication protocols in our days. However, communicating over the Internet involves significant security risks, since the Internet is an unprotected network. Therefore, the need for securing the Internet has become a fundamental issue, especially for transm itting confidential data (e.g.. electronic commerce, electronic banking, Virtual Private Networks). For securing the Internet traffic. Internet Protocol Security (IPSec) standard was developed by the Internet Engineering Task Force [49]. T h e IPSec standard extends the IP protocol by securing the IP traffic at the IP level using crypto graphic methods. IPSec has been adopted by all leading vendors and will be the future standard for secure communications on the Internet [68]. It is also rapidly becoming the industry standard for Virtual Private Networks (VPNs) [41]. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cryptography is the fundamental component of IPSec. Therefore, architec tures that implement IPSec have to meet the enormous com puting demands of cryptographic algorithms. Moreover, since the security param eters are dynam ically negotiated by the communicating entities. IPSec architectures have to be also flexible enough to adapt to diverse security parameters (e.g.. cryptographic algorithms, cryptographic operation mode, key) and to be updated with state- of-the-art algorithms and standards. W hile ASIC-based solutions can provide ultim ate performance, they lack flexibility. On the contrary, software-based so lutions can provide the required flexibility but are inadequate for high-speed networks. In this work, we focus on the d ata encryption aspects of IPSec. Our goal is to design and implement an Adaptive Cryptographic Engine (ACE) using FPCJAs. Such an engine consists of a hardware component (FPGA) and a memory compo nent. FPGA configurations of cryptographic algorithms are stored in the memory in the form of a cryptographic library. Using the cryptographic library, the FPGA is configured on-demand at run tim e. Then, the configured FPG A performs the required encryption/decryption tasks. Thus, ACE can provide hardware-like time performance w ith software-like flexibility. The efficiency of ACE is determ ined by the time performance of the FPGA implementations, the memory requirem ents for storing the corresponding con figuration bit-stream s, and the reconfiguration time for adapting to run-time 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. param eters. In embedded systems like ACE. memory is the most restrictive re source and significantly affects the cost-effectiveness of the system. By using partial reconfiguration, the memory requirements, as well as the time overhead due to reconfiguration can be significantly reduced. A diverse set of private-key cryptographic algorithms was chosen to study th e tim e performance of ACE. Tim e performance and area requirements results are provided for all the considered algorithm s. The tim e performance m etrics are throughput and key-setup latency. Throughput corresponds to the am ount of data processed per tim e unit, while key-setup latency is the minimum tim e required to commence data encryption after providing the input key. Besides the throughput metric, the latency metric is the key measure for IPSec, where a small amount of data is processed per key and key context switching occurs repeatedly. To the best of our knowledge, we are not aware of any published results that include key-setup latency. The architectural characteristics of FPG As match extrem ely well with the com putational requirements of private-key cryptographic algorithms. By taking advantage of hardware customization and hardware parallelism, high encryption rates and low key-setup latency can be achieved. The fine granularity of FPGAs can lead to very efficient implementations of core operations of private-key cryp tographic algorithms such as bit-perm utations, bit-substitutions, look-up table reads, and boolean functions, among others. Moreover, the constant bit-width of 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the aforementioned operations alleviates accuracy-related implementation prob lems and facilitates efficient designs. The inherent parallelism of data encryption can be also efficiently exploited on FPGAs. The spatial computing paradigm of FPGAs allows multiple operations an d /o r tasks to be executed in parallel while loops with d a ta dependencies can be efficiently mapped onto hardware. 4.2 Private-K ey Cryptography for IPSec The enormous advances in network technology have resulted in an amazing po tential for changing the way we com m unicate and do business over the Internet. However, in th e case of confidential data, the cost-effectiveness and globalism provided by the Internet are dim inished by the main disadvantage of public net works: security risks. The significantly increasing growth in the confidential d ata traffic over the Internet makes the security issue a fundamental problem. Confi dential data has to be protected against security threats including loss of privacy, loss of data integrity, identity spoofing, and denial-of-service. among others [19]. Consequently, applications such as electronic banking, electronic commerce, and Virtual Private Networks (VPNs), require an efficient and cost-effective way to address the security threats over public networks. The IPSec standard is an on-going effort of the Internet Engineering Task Force [49] that has been adopted by all leading vendors. It will be the future standard for secure communications over the Internet [68]. IPSec is also rapidly 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. becoming the standard o f choice for VPNs [41]. VPNs utilize public networks in order to establish a secure private network. As a result, drastic reduction of cost and superior network efficiency can be achieved. The estimated savings can be from 209? to 809? by switching from leased lines, serving branch offices, or serving remote access users [41]. The remarkable potential of the VPN m arket is an evidence of the significance of IPSec in the global electronic business. From 224 million U. S. dollars in the year 1998, the VPN market is forecasted to grow to 13.000 million F. S. dollars in the year 2004 [41]. The IPSec standard is a framework of open standards for ensuring secure private communication over the Internet [19]. It extends the IP protocol by securing the IP traffic at the IP level using cryptographic methods. Since it provides security at the IP level, there is no need for changing the end systems, the applications, and the interm ediate network infrastructure. The only required changes occur at the end points where the encrypted packets enter/exit the public network. Consequently, th e implementation and management costs are reduced significantly. However, th e m ajor advantage of IPSec is its flexibility: it is not based on a particular cryptographic method. T he cryptographic methods to be used are negotiated between the communicating entities. IPSec provides an open framework to implement any cryptographic algorithm . As a result, the IPSec standard makes it possible for security systems developed by different vendors to interoperate [41]. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to protect IP traffic from security threats. IPSec provides the follow ing services [I9j: 1. Confidentiality: Data encryption. 2. Authenticity: Proof of sender. 3. Integrity: D ata tam pering detection. 4. Replay Protection: Preventing unauthorized data resending. ■ 5 . Key Management: Negotiating security associations and key exchanging. Initially the two communicating parties negotiate and establish a security as sociation (SA) for protecting the transferred data. A SA determines the crypto graphic methods and the related keys to be utilized. The cryptographic methods include both private and public key cryptography, keyed hash algorithms, and digital certificates. Each SA is restricted by its lifetime after the expiration of which a new SA has to be established. The lifetime of an SA can be determined in term s of absolute tim e or am ount of transm itted data. Usually, a small am ount of data is processed per key and key-context switching occurs repeatedly. In this work, we focus on private-key cryptography (i.e., confidentiality) for IPSec architectures. The im plem entation of a cryptographic algorithm must achieve high processing rate to fully utilize the available network bandwidth. Otherwise, the cryptographic component of the IPSec architecture becomes a 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance bottleneck. T he implementation m ust also support fast kev-context switching. Otherwise, additional latency is introduced that is critical in latency- bound applications (e.g., video on-demand, IP telephony). Finally, in order to keep the IPSec architecture independent of the implemented cryptographic algo rithm s. it must be possible to add new implementations or modify existing ones. Then, th e IPSec architecture can be updated with state-of-the-art cryptographic algorithms and standards. Known IPSec architectures [41] utilize soft ware- based or ASIC-based solu tions for implementing private-key cryptographic algorithms. Software-based ap proaches provide the flexibility required to adapt to various cryptographic al gorithms and standards. However, since cryptographic algorithms have enor mous com putational requirem ents, soft ware-based solutions are inadequate for the data-transfer rates found in high-speed networks [62]. In this case, ASIC- based architectures can provide significant performance advantages over software- based solutions. Major com panies such as AT&T, Cisco, and Lucent, endorse specialized encryption/decryption chips integrated in their routers. However, any update of these specialized chips becomes very costly. Even though the price of such chips is dropping dram atically, the question that still remains is what to do with the old “boxes” [41]. T h e ultim ate solution for the problem would be a reconfigurable processor that can provide software-like flexibility with hardware- like performance. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.1 Private-Key Cryptography using FPGAs FPGAs are a highly promising alternative for implementing private-key crypto graphic algorithms. By combining hardware custom ization, hardware parallelism, and hardware reuse. FPGA-based solutions can provide the high encryption rates and agile key-context switching required in high-speed networks, and the flexi bility required by IPSec. Moreover, the post-fabrication customization th at dis tinguish FPGAs from ASICs can lead to fast turn-around time. Turn-around tim e is critical in environments where application requirements and standards are changing rapidly (e.g.. information-based networked environments). A typ ical private-key cryptographic algorithm that operates on blocks of d a ta (block cipher) consists of th e following tasks: 1. Key Setup: Based on the input key. key-dependent data is derived for the cryptographic rounds (e.g.. sub-keys). The lifetime of the key depends on the application. For example, in IPSec. the lifetime of a Security Associa tion determines the lifetime of the related keys. 2. Cryptographic Core: Consists of several rounds that usually have the same functionality. There is a linear data dependency among the cryptographic rounds, that is. the input of a round is the output of the preceding round. A cryptographic round can commence as soon as its key-dependent d ata is available. 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Depending on the operation mode, data dependencies among consecutive blocks of data may be introduced. For example, in the Electronic C ’ odebook (EC'B) mode [64], all the data blocks are processed independently. On the con trary. in the Cipher Block Chaining (CBC) mode [64]. the results of the processing of previous data blocks are fed back in the processing of the current block. Op eration modes that introduce data dependencies among consecutive data blocks provide higher security. However. ECB mode favor parallel implementations, and thus, it can lead to higher processing rates. The fine-granularity of FPGAs matches extremely well the operations required by private-key cryptographic algorithms (e.g., bit-perm utations. bit-substitutions, look-up table reads, boolean functions). As a result, such operations can be exe cuted more efficiently in FPGAs than in a general-purpose computer. Moreover, the constant bit-width of the operations in a block cipher alleviates accuracy- related implementation problems and facilitates efficient designs. However, the key advantage of FPGAs over general-purpose computers is hardware parallelism. The spatial computing paradigm of FPGAs can efficiently exploit the inherent parallelism of private-key cryptography. On the contrary, the serial fashion of com puting in general-purpose computers is a limiting factor for their perfor mance. FPGA-based implementations can exploit parallelism at different levels: 1. Instruction-level: M ultiple operations can be executed concurrently during key setup and/or at a cryptographic round. As a result, low key-setup 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. latency can be achieved while the time required to process a d ata block through a cryptographic round can be significantly reduced. 2. Data-level: Multiple blocks o f d ata can be processed in parallel in operation modes that allow parallel processing (e.g.. non-feedback, interleaved). In addition, blocks of data corresponding to different keys can be processed in parallel. As a result, superior throughput can be achieved com pared with a serial implementation. 3. Task-level: The key-setup component can process data concurrently with the cryptographic core. The cryptographic core can start as soon as the key-dependent data for the first round is derived. As a result, agile key- context switching and low key-setup latency can be achieved. In software implementations, the cryptographic core can not commence before the key- dependent data for all the rounds is derived. Security issues also make FPG A implementations more advantageous than soft ware-based solutions. An encryption algorithm running on a generalized com puter has no physical protection [64]. Hardware cryptographic devices can be securely encapsulated to prevent any modification of the implemented algorithm . In general, hard ware-based solutions are the embodiment of choice for m ilitary and serious commercial applications (e.g., NSA authorizes encryption only in hardware) [64]. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Finally, even if ASICs can achieve superior performance compared with FP GAs. their flexibility is fixed during fabrication. As a result, they incur high turn-around times and any update with new algorithm s or standards becomes very costly [41]. On the contrary, hardware reuse allows FPGA-based solutions to be customized after fabrication. Therefore. FPGA-based solutions can re sult in cryptographic engines with increased lifetime compared with ASIC-based solutions. 4.3 Adaptive Cryptographic Engine The proposed Adaptive Cryptographic Engine (ACE) is an FPGA-based archi tecture (Figure 4.1) that can provide the speed and flexibility required by IPSec. ACE can be dynamically adapted to cryptographic algorithms of different Se curity Associations (SAs) at runtim e. As we will show shortly, agile adaptation to different key-contexts can also be achieved by FPGA implementations. In addition, the cryptographic library can be updated by new configurations by up dating the memory contents. As a result, ACE can adapt to any cryptographic algorithm. ACE consists of an FPGA device(s), a cryptographic library, and a con figuration controller (Figure 4.1). The com putational core of ACE is recon- figurable hardware. The FPG A is configured on the fly by the configuration 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Request ACE \ " - i I Algorithm A ; Configuration 1 : Configuration 2 Algorithm B C o n fig u ration 1 Configuration 2 Cryptographic Library t INPUT OUTPUT (Data, Key) (Data) Figure 4.1: Architecture of ACE controller. Subsequently, adaptation to the input key occurs and the data en cryption/decryption commences. The configuration controller determines the configuration to be chosen based on the established SA (shown as request in Fig ure 4.1). T he cryptographic library consists of FPGA configurations stored in the memory in the form of configuration bit-streams. A cryptographic algorithm may correspond to multiple configurations with respect to cryptographic param eters (e.g., key length, block length) or to performance requirements (e.g., throughput. hardware area). The configuration controller is indispensable for configuring the FPGA(s). Besides being the interface between the cryptographic library and the FPGA(s), it also resolves the external requests. 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The cryptographic library consists of FPGA configurations of private-key cryptographic algorithms. Each cryptographic algorithm can correspond to var ious configurations. Such algorithm variants differ from each other in terms of the cryptographic operation mode, the key length, the num ber of rounds, etc. The configurations can be updated to enhance the applicability of ACE. In the following sections, a methodology is proposed to design a cryptographic library with respect to its memory requirements, and a thorough performance analysis of the engine is performed. 4.3.1 Cryptographic Library Design The applicability of ACE is determined by the cryptographic library. The cryp tographic library consists of FPGA configurations of private-key cryptographic algorithms. For each algorithm, various configurations can be derived based on the cryptographic operation mode, the key length, the num ber of rounds, etc. However, the tim e performance of FPGA configurations should be independent of the input data in order to eliminate the risk of cryptographic attacks that are based on perform ance measurements. By matching the algorithm requirements with the low-level hardware details of the FPGA architecture, hardware-optimized designs can be derived. As a result high performance can be achieved. Furthermore, the tim e performance and the area requirements can be accurately determined beforehand. This is particularly 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. important in run-tim e environm ents where the param eters of the problem are not known a priori but tim e and area constraints must be satisfied. In the following section, a thorough study of tim e performance issues is provided. Such a study is crucial to evaluate the potential of FPGA-based cryptographic engines for IPSec. Besides tim e performance, the size of the library and the tim e overhead due to reconfiguration are im portant aspects in designing cryptographic libraries. In embedded systems like ACE. memory is the most restrictive resource and signifi cantly affects the cost-effectiveness of the system. In addition, the time required for reconfiguration may become a bottleneck in latency-bound applications. Re configuration tim e can, however, be safely ignored since the FPGA can be recon figured while the communicating entities are negotiating to establish a Security Association (SA). Typically, the security policy is negotiated first and then, key exchange and authentication occur. Reconfiguration can start as soon as the cryptographic algorithm is decided, while encryption can commence as soon as the SA negotiation is completed. Thus, if reconfiguration is completed before the tim e that encryption can commence, the reconfiguration tim e overhead can be safely ignored. By using data compression techniques, the memory requirements for storing configuration bit-stream s can be significantly reduced. However, in embedded 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. systems like ACE. besides achieving high compression ratios, decompression ef ficiency is an essential aspect. Decompression efficiency corresponds to the de compression hardware cost as well as the decompression rate. Decompression hardware can significantly affect the cost-effectiveness of the cryptographic en gine. while decompression rate can increase the tim e required to configure the FPG A. In Chapter 7, we address the problem of configuration compression and a novel compression technique is introduced for FPGA-based embedded systems. O ur technique is applicable to any SRAM-based FPGA. requires very little de compression hardware, and does not affect the tim e to configure the FPGA. By using partial reconfiguration, further savings in memory can be achieved while the time to reconfigure the FPGA can be reduced significantly. Typically, by using partial reconfiguration, less amount of data is required to be loaded onto the FPGA compared with complete reconfiguration [40]. Reconfiguration occurs repeatedly in ACE in order to adapt to run-tim e requests. T he FPGA implementations that can be realized using ACE can be viewed as the vertices of a directed graph whose edges correspond to configuration bit-streams. T he derived graph also includes the “ 'null” vertex, which corresponds to the configuration of the FPGA after power-up. The graph edges correspond to the partial or com plete reconfiguration that is required to reconfigure the FPGA between any two vertices. Edges th at correspond to complete reconfiguration can be incident 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. only from the “null" vertex. Each edge is associated with a cost, that is. the size of the corresponding bit-stream . Based on the above graph-based representation, the cryptographic library de sign can be addressed by solving various high-level optim ization problems. For example, the M inimum Spanning Tree (MST) that is rooted at the “null” vertex consists of the edges that lead to minimal memory requirements. Algorithms for solving the Directed Minimum Spanning Tree problem can be found in [IS. 38]. Furthermore, since configuration tim e is proportional to the size of the configura tion bit-stream. by imposing constraints on the longest path among two vertices, numerous optim ization problems can be derived for minimizing reconfiguration time. 4.4 Im plem entation Decisions and Results The significance of 128-bit block ciphers in future cryptography led us to choose the final AES [3] candidates for performance evaluation of the ACE. In our implementations, we assume th at all the algorithms are secure. The five final candidates (M A R S [14], RC6 [61], Rijndael [21], Serpent [5], Twofish [65]) are private-key cryptographic algorithms th at support at the minimum block sizes of 12S bits and key sizes of 128, 192. and 256 bits. The com putational requirements of the considered algorithm s include a diverse spectrum of operations (see Section 4.4.1) that determ ines the performance of implementations using FPGAs. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Development of the AES is an on-going effort of the National Institute of Standards and Technology (NIST) and the cryptographic community. AES will be a public algorithm designed to protect sensitive government information. It will also serve as an im portant security tool to support the dynamic growth of electronic commerce. The development effort started on 1997 and. eventually, fifteen candidate algorithms were submitted from all over the world. On Au gust 9. 1999. NIST announced the final five candidates algorithms of the AES standard {MARS, RC6. Rijndael. Serpent, Twofish). On October 2. 2000. NIST announced that Rijndael has been selected as the proposed AES. The proposed AES will replace the aging D ata Encryption Standard (DES). which was adopted by NIST in 1977 as a Federal Information Processing Standard used by federal agencies and the private sector to encrypt information [3]. As a hardware target for the proposed implem entations, we have chosen the Xilinx Virtex family of FPGAs. Virtex is a high-capacity, high-speed performance FPGA providing a superior system integration feature set [77]. For mapping onto Virtex devices, we used the Foundation Series v2.1i software development tool [77]. The synthesis and place-and-route param eters of the tool remained the sam e for all the implementations. All the results were based on placed-and-routed implementations (device speed -6) that included both the key-setup component and the cryptographic core along with their control circuits. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Among the various time-space tradeoffs, our focus was prim arily time perfor mance. For each algorithm , we have implemented the key-setup component, the control circuitry, and the encryption block cipher for 128-bit d ata blocks using 128-bit keys. A “single-round" based design was chosen for each implementa tion. Since one round was implemented, it was reused repeatedly. The kev-setup component was processing data in parallel with the cryptographic core until all the key-dependent data is derived. While the cryptographic core was processing the data of the i th round, the key-setup component was calculating the key- dependent data for the (i + I) th round. As a result, even if an algorithm does not support on-the-fly key generation in the software domain, the key setup can be executed on the flv in FPGAs. In our implementations, the tim e required to derive the key-dependent data for a round was less than the tim e required to process a block of data through a round. As a result, the cryptographic core was processing data continuously after the key-dependent data for th e first round was derived. We have exploited the inherent parallelism of each cryptographic core at the round level and th e low-level hardware features of FPGAs to enhance the per formance. The perform ance metrics are throughput and key-setup latency. The throughput m etric indicates the amount of data encrypted per tim e unit after the initialization of th e algorithm . The key-setup latency denotes th e minimum time 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. required to commence encryption after providing the input key. While through put indicates the buik-encrvption capability of the implementation, key-setup latency indicates the capability of agile key-context switching. Our goal was to maximize throughput for each candidate algorithm by minimizing the encryption time per round. Since one round was implemented and was reused repeatedly, the throughput results correspond to — — . where n and I round are the number n - *r*>und of required rounds and the encryption tim e per round, respectively. Similar per formance analysis can be performed for larger sizes of data blocks and keys. In addition, the derived results can be elaborated for implementations that process multiplet&locks of data and/or data stream s concurrently. Regarding the cryptographic cores, the m ajority of the required operations fit extremely well in Virtex FPGAs. The perm utations and substitutions can be hard-wired while distributed memory can be used as look-up tables. In addition, boolean functions, data-dependent rotations, and addition can be mapped very efficiently onto Virtex FPGAs. W herever a m ultiplication with a constant was required, constant coefficient multipliers were utilized to enhance the performance compared with “regular’ ’ multipliers. Regular multiplication is required only by the M A R S and /?C6 block ciphers. In both cases, two 32-bit numbers are multiplied and the lower 32 bits of the output are used in the encryption process. We tried the multiplier-macros provided for Virtex FPGAs, but we found that they were a performance bottleneck. Besides the excessive latency that was 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. introduced due to the numerous pipeline stages, excessive area was also required since the full multiplier was mapped onto the FPGA. Instead of using these macros, a m ultiplier that computes partial results in parallel and outputs only the required 32 bits was used. As a result, the latency was reduced by more than 50% and the area requirements were also reduced significantly. Besides throughput, the key-setup latency issue was also of primary interest, that is. the cryptographic core had to commence as early as possible. Based on the achieved throughput, we designed the kev-setup component to sustain the processing rate of the cryptographic core and to achieve minimal latency. The key-setup latency metric is the key metric for applications where a small amount of d a ta is processed per key and key-context switching occurs repeatedly. In software implementations, the cryptographic process cannot commence before the key-setup process for all the rounds is completed. As a result, the key-setup latency tim e equals the key-setup tim e. To implement efficient key-setup circuits, we took advantage of the embedded memory modules (Block Select RAM) of the Virtex FPGAs [77]. The Virtex F P GAs provide dedicated on-chip modules of true dual-read/w rite port synchronous RAM, with 4096 memory cells each. Depending on the size of the device, 32-132 Kbits of d ata can be stored using the available memory modules. The key-setup circuit utilized these memory modules to pass its results to the cryptographic core. As a result, th e cryptographic core could commence as soon as the key-dependent 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data for the first encryption round is available in the memory modules. Then, during each encryption round, the cryptographic core reads the corresponding data from the memory modules. For each algorithm, we have also implemented the key-setup circuit and the cryptographic core separately. For all the implementations, the maximum clock speed of the key-setup circuit was higher than the m aximum clock speed of the cryptographic core. Based on the results of these individual implementations, we also provide key-setup latency estim ates for implem entations that clock the key-setup and the cryptographic core circuits at their maximum speed. 4.4.1 Implementation Results In Table 4.1. implementation results are shown for the considered algorithms. The order of presenting the algorithms is alphabetic. T he key-setup latency results are represented both as absolute tim e and as the fraction of the corresponding encryption tim e over one 128-bit block of data. In addition, the throughput results are represented both as encryption rate and as encryption rate elaborated on area. Finally, area requirements results are provided for both the key-setup and the cryptographic core circuits. In the following, relevant performance issues specific to each algorithm are discussed. •54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 4.1: Implementation Results Algorithm Key-Setup Latency Throughput Area Requirements ( # of slices) psec Key— aerup la tency block en erva tio n tim e Mbits / sec Kbits / (sec*slice) Total Key Setup Cryptographic Core MARS 1.96 3.12 101.88 29.55 6896 2275 (33%) 4621 (67%) RC6 0.17 0.15 112.87 42.59 2650 901 (34%) 1749 (66%) Rijndael 0.07 0.20 353.00 62.22 5673 1361 (24%) 4312 (76%) Serpent 0.08 0.09 148.95 58.41 2550 1300 (51%) 1250 (49%) T wofish 0.18 0.25 173.06 18.48 9363 6554 (70%) 2809 (30%) O' O' 4.4.1.1 MARS The M A R S block cipher is the IBM submission to AES [14]. The M A R S key expansion procedure expands the input 128-bit key into a 1280-bit key. First a linear-key expansion occurs following by stirring the kev-vvords based on an S-box. Both processes involve sim ple operations performed repeatedly. However, the final stage of modifying the m ultiplication key-words involves string-matching operations that are relatively expensive functions. String-matching is an expen sive operation compared with the rest of the operations required by M A R S . A compact implementation of string-m atching introduces high latency, while a high-performance implementation increases the area requirements dramatically. In our implementation, the last stage of the key-expansion process (i.e.. string- matching) was not implemented. In spite of this, the introduced key-setup latency was still relatively high (the worst am ong all the implementations considered in this study). The cryptographic core of M A R S consists of a 16-round cryptographic layer wrapped with two layers of 8-round “forward” and “backward mixing" [14]. The achieved throughput depended m ainly on the efficiency of the multiplier. In our implementation, only one round of each layer was implemented that was used repeatedly. T he encryption time for one block of data was 32 clock cycles. An interesting feature of our design is th a t by increasing the utilization factor of the processing stages (i.e.. all the three processing stages execute in parallel), the 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. average encryption time for one block of data can be reduced to 16 clock cycles for operation modes that allow concurrent processing of m ultiple blocks of data (e.g., non-feedback, interleaved). 4.4. 1.2 RC6 The RC6 block cipher is the AES proposal of the RSA Laboratories and R. L. Rivest from the M IT Laboratory for Computer Science [61]. The implemented block cipher corresponds to ir = 32. r = 20, and b = 14 (i.e., 32-bit round keys. 20 rounds. 14-byte input key). The RC'6 key setup expands the input 128-bit key into 42 round keys. The key for each round corresponds to a 32-bit word. The key scheduling is fairly simple. The round-keys are initialized based on two constants. Y V e have implemented the initialization procedure using a look-up table since it is the same for any input key. Then, the contents of the look-up table were used to generate the round-keys with respect to the input key. As a result, remarkably low key-setup latency was achieved that was equal to the lo% of the time for encrypting one block of data. The cryptographic core of RC 6 consists of 20 rounds. T he symmetry and reg ularity found in th e RC6 block cipher resulted in a compact implementation. The entire data block was processed at the same tim e by using two identical circuits. The achieved throughput depended mainly on the efficiency of the multiplier. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .4 .1 .3 Rijndael The Rijndael block cipher is the AES proposal of J. Daemen and V'. Rijmen from the Katholieke Universiteit Leuven [21]. Rijndael has been selected as the proposed AES. The implemented block cipher corresponds to iV & = 4, .V t = 4. and Nr = 10 (i.e.. 4 x 32-bit block data. 4 x 32-bit key, 10 rounds). The Rijndael key setup expands the input 128-bit key into a 1408-bit key. Simple operations are used th at resulted in extrem ely low key-setup latency. ROM-based look-up tables were utilized to perform the SubByte transform ation. The achieved latency was the lowest among all th e implementations considered in this study. The cryptographic core of Rijndael consists of 10 rounds. The cryptographic core is ideal for implem entations on FPClAs. It combines fine-grain parallelism with look-up table operations. The round transform ation can be represented as a look-up table resulting in extremely high speed. Y V e have implemented a ROM-based fully-parallel version of the look-up table. By combining common references to the look-up table, we have achieved a 25% savings in ROM compared with th e straightforward implementation suggested in the AES proposal [21], The simplicity of the operations and the inherent fine-grain parallelism resulted in the highest throughput among all the implementations. Furtherm ore, the Rijndael im plem entation had the highest area utilization factor (i.e., throughput per area unit). 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .4.1.4 Serpent The Serpent block cipher is the AES proposal of R. Anderson. E. Biham . and L. Knudsen from Technion. Cambridge University, and University of Bergen, respectively [5]. The Serpent key setup expands the input L28-bit key into a 4224-bit key. First, the input key is padded to 2-56 bits and then it is expanded to an interm ediate key by iterative mixing of the key data. Finally, by using look up tables, the keys for all the rounds are calculated. The simplicity of the required operations resulted in extremely low key-setup latency (the second lowest among all the implementations considered in this study). The cryptographic core of Serpent consists of 32 rounds. The round transfor mation is a linear transform consisting of rotations, shifts, and X O R operations. Neither multiplication nor addition is required. As a result, the lowest encryption tim e per round and the most com pact implementation were achieved among all the implementations. Furthermore, the Serpent implementation had the second higher area utilization factor (i.e., throughput per area unit). 4 .4 .1 .5 Twofish The T w ofish block cipher is the AES proposal o f the Counterpane Systems, H i/fn, Inc., and D. Wagner from the University of California Berkeley [65]. The Tw ofish key setup expands the input l2S-bit key into a 1280-bit key. Moreover, it generates the key-dependent S-boxes used in th e cryptographic core. Four 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 128-bit S-boxes are generated. Since our goal was to minimize latency, we have implemented a parallel version of the key setup consisting of 24 permuta tion boxes and 2 M D S matrices [65]. Moreover, the RS m atrix was implemented for the S-box generation. The matrices are used for “constant m atrix"-to-m atrix multiplication over G'F(28). The best known implementation o f a constant coef ficient multiplier in Virtex FPGAs is by using a look-up table [77]. As a result, low latency was achieved, but excessive area was required. The area requirements corresponded to the 70% of the total area. However, by implem enting a more compact design (e.g.. reusing processing elem ents), the kev-setup latency would increase. The cryptographic core of T w ofish consists of 16 rounds. The structure of the round transform ation is similar to the structure of the kev-expansion circuit. The only major difference is the S-boxes th at the cryptographic core uses. 4.4.2 Key-Setup Latency Improvements For each algorithm, we have also implemented the key-setup circuit and the cryp tographic core separately. For each algorithm , the maximum clock speed of the key-setup circuit was higher than the m axim um clock speed of th e cryptographic core. Thus, by clocking each circuit at its maximum clock speed, improvement in key-setup latency can be achieved. No additional synchronization hardware is re quired since we can configure the read/w rite ports of the on-chip memory modules 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. having different clock speeds. Compared with implementations using one clock, the kev-setup latency tim e can be reduced by a factor of 1.35.2.96. 1.43.1.00. and 1.15 for M A R S . RC'6. Rijndael, Serpent, and Tw ofish. respectively. Clearly, the RC 6 block cipher can achieve the best key-setup latency improvement by clock ing the kev-setup and the cryptographic core circuits at their maximum clock speeds. For th e M A R S block cipher, the result is based on an implementation that does not include the circuit for modifying the multiplication kev-words. 4.4.3 Comparative Analysis of Our FPGA Implementations In Figure 4.2, kev-setup latency comparisons are made among our FPGA imple mentations. T he comparisons are m ade in term s of absolute tim e and the ratio of the key-setup latency time to the tim e required to encrypt one block of data. The latter m etric represents the capability of agile key-context switching with respect to the encryption rate. Clearly, Rijndael and Serpent achieve the lowest key-setup latency times, while the latency times for RC6 and T w o f ish are higher by a factor of 2.5. As we have m entioned earlier, the key-setup latency introduced by M A R S is the highest. All th e algorithms (except M A R S ) achieve key-setup latency tim e that is equal to the 7%-25% of the tim e for encrypting one block of d ata. In Figure 4.3, throughput comparisons are made among our FPGA imple mentations. T he comparisons are m ade in term s of the encryption rate and the 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. key-setup latency time usee 1.5 O JS key-setup latency time block encryption time 3.5 2 .5 0.15 Twofish 0.05 0.25 0.15 Twofish 0.05 Figure 4.2: Key-setup Latency Comparisons of Our FPG A Implementations ratio of the encryption rate to the area requirements. T he latter metric reveals the hardware utilization efficiency of each implementation. Rijndael achieves the highest encryption rate due to the ideal m atch of its algorithmic characteristics with the hardware characteristics of FPGAs. In ad dition, the encryption rate of Rijndael is higher than the ones achieved by the other algorithms by a factor of 1.7-3.12. Moreover, Rijndael also achieves the 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Throughput Throughput/Area Mbits / sec Kbits / (sec * slice) 4 0 0 ---------------------------------------------- ~ 350 300 250 2 0 0 150 100 50 0 Figure 4.3: Throughput Comparisons of O ur FPGA Implementations best hardware utilization. The hardware utilization metric combines the inher ent parallelism of the cryptographic round that is exploited on hardware with the computational characteristics of the algorithm in term s of area requirements. Serpent achieves the second best hardware utilization while having the low est encryption time per round. Consequently, under the same area constraints. Serpent can achieve throughput equivalent to Rijndael for operation modes that allow concurrent processing of multiple blocks o f data. Similar to Rijndael, the 63 Rijndael Twofish Twofish MARS MARS Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithmic characteristics of Serpent m atches extremely well with the hardware characteristics of FPGAs. Twofish achieves the second best encryption rate, but the complexity of the cryptographic core and the m atrix-to-m atrix multiplication are the limiting fac tors for achieving higher rates. In addition, the area requirements of the key-setup circuit results in low hardware utilization. However, a compact implementation of the key-setup circuit would lead to larger key-setup latency which is a critical metric for IPSec. The encryption rate of M A R S and RC6 depends on the performance of the multiplier im plem entation. In general, multipliers on FPGAs can achieve high throughput using pipelined versions of parallel implementations. However, the latency through a multiplier remains a limiting factor for applications where the latency is th e dom inant factor for the performance (e.g.. feedback operation modes). Finally, in Figure 4.4, area comparisons are made among our FPGA imple mentations. T he comparisons are made in terms of the total area as well as the area required by each of the key-setup and the cryptographic core circuits. Serpent and R C 6 have the most compact implementations. Serpent also has the most com pact cryptographic core circuit while RC6 has th e most compact key-setup circuit. For the M A R S block cipher, the shown result is based on an 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Area Requirements # of Virtex slices 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 O'— 1 Cryptographic Core Key Setup Twofish M ARS mwm _ . ■ ■ ; t . - .• i t " " . , . j i I • . / . ■ * ’J ® i. Figure 4.4: Area Comparisons of Our FPGA Implementations implementation th at does not include the circuit for modifying the multiplication key-words [14]. 4.4.4 Related Work In [54], the Rijndael algorithm was implem ented using JBits [78]. The throughput that was achieved for single-round im plem entations was in the range of 250-300 Mbits/sec. However, the key-setup latency was in the range of 63-153 msec, which is equivalent to the tim e to encrypt 123529-355813 data blocks. Given a 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. key. key-dependent data is derived by a host machine. Then, by using the JB its tool, configuration bit-stream s are derived based on the key-dependent data. As a result, by using the JB its tool in the critical path to th e solution, excessive latency was introduced that makes the solution unacceptable in terms of key agility [75]. In [39. 42], FPGA implementations of the AES candidate algorithms were described using Virtex FPGAs. However, only the cryptographic core for each algorithm was implemented. No results regarding key setup were provided. In Table 4.2, throughput results for [39, 42] and our work are shown for encrypting 128-bit data blocks using 128-bit keys. To make a fair comparison, the results shown for [39] correspond to the performance evaluation for feedback modes. In [39], results for non-feedback modes were also provided, which corresponded to implementations that process multiple blocks of data concurrently. The major difference in the throughput results is the Serpent algorithm. By implementing 8 rounds of the algorithm [39], the distribution of the subkeys among consecutive rounds becomes very efficient resulting in 3 x speedup com pared with our ■ ‘single-round” implementation. For M A R S , our implementation achieved higher throughput by a factor of 2.5 compared with [42]. The M A R S block cipher was not implemented in [39]. For RC 6 and R ijndael, all the imple mentations achieved sim ilar throughput performance. For T wo fis h , the through put achieved in [42] and in our work is higher than the one in [39] by a factor 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T ab le 4.2: P e rfo rm a n c e C o m p ariso n s w ith F P G A Im p le m e n ta tio n s [39. 42] Algorithm Throughput (Mbits /se c ) [39] Our 142] MARS — 101.88 39.80 RC’6 126.50 112.87 103.90 Rijndael 300.10 353.00 331.50 Serpent 444.20 148.95 339.40 Twofish 119.60 173.06 177.30 of 1.5. By combining the throughput results provided in [39. 42] and the perfor mance results provided in our work, we can verify that Rijndael and Serpent favor FPGA implementations the most among all the considered algorithms. 4.4.5 Comparison with Software and ASIC Implementations Our time performance results are compared with the best software-based results found in [6] and [10]. In [6], optimized assembly-language im plem entations on the Pentium II were described for M A R S , R C 6. Rijndael. and T w o fish : only throughput results were provided. In [10], ANSI C-based implementations on a variety of platforms were described for all the AES candidate algorithms: both throughput and key-setup tim e results were provided. In Table 4.3. throughput and key-setup latency comparisons are shown for encrypting 128-bit data blocks using 128-bit keys. Clearly, the FPG A implemen tations achieve significant reduction in th e key-setup latency tim e by a factor of 4-144. In software implementations, the cryptographic process cannot commence 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. before the kev-setup process for all the rounds is completed. As a result, the key- setup latency tim e equals to the kev-setup tim e making kev-context switching inefficient. On the contrary, in FPGAs. each cryptographic round can commence as early as possible since the key-setup process can run concurrently with the cryptographic process. As a result, minimal latency can be achieved. Regarding throughput results, the software implementations achieve higher throughput by a factor of 1.84, 2.28, and 1.17 for M A R S . RC6. and Ticofish. respectively. These algorithms require m ultiplication operations. Our intuition is th at the hardware specialization and parallelism exploited in FPGAs were not enough to outperform the efficiency of the m ultiplication in software. On the contrary, the FPGA implementations achieved higher throughput by a factor of 1.45 and 2.-14 for Rijndael and Serpent, respectively. This speedup reconfirms that Rijndael and Serpent favor FPGA im plem entations the most among the considered algorithms. It is also worthy to mention th at Rijndael results in one of the fastest implementations in both software and FPGAs. For processing m ultiple data streams of different keys and/or multiple blocks of data (e.g., non-feedback or interleaved operation modes), multiple rounds can be implemented that would lead to superior perform ance com pared with software- based implementations. In general, if p rounds are implemented and fully utilized, an improvement of factor of p in the throughput of the FPGA implementations 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T a b le 4.3: P e rfo rm an ce C o m p a ris o n s w ith S o ftw are Im p le m e n ta tio n s [6. lOj Throughput Key-Setup Latency Algorithm Mbits / sec Speedup psec Speedup Software Our Software Our MARS [6] 188.00 101.88 1/1.84 [10] 8.22 1.96 4.19 RC6 [6] 258.00 112.87 1/2.28 [10] 3.79 0.17 22.29 Rijndael [6] 243.00 353.00 1.45 [10] 2.15 0.07 30.71 Serpent [10] 60.90 148.95 2.44 [10] 11.57 0.08 144.62 Twofish [6] 204.00 173.06 1/1.17 [10] 15.44 0.18 85.78 can be achieved compared with th e results shown in Table 4.3. Further through put optim izations can be achieved by adopting deep-pipelined designs. However, for feedback operation modes, the achieved throughput is bounded by troun< i due to the data dependencies between consecutive blocks o f data. As a result, a “single-round” implementation th a t is reused repeatedly is sufficient unless a “multiple-round” implementation can result in more efficient distribution of the sub-keys among cryptographic rounds. Our performance results are also compared with th e results of ASIC-based implem entations described in the NSA's “Hardware Performance Simulations of Round 2 AES Algorithms” [74]. O ur tim e performance results are compared with the results provided for encrypting 128-bit data blocks using 128-bit keys using iterative architectures. In Table 4.4. throughput and key-setup latency comparisons are shown. Clearly, besides our im plem entations, Rijndael achieves 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T ab le 4 .4 : P erfo rm an ce C o m p a riso n s w ith A S IC Im p le m e n ta tio n s [74] Algorithm Throughput Keg-Setup Latency Mbits / sec Speedup psec Speedup m Our m i Our MARS 56.71 101.S8 1.79 9.55 1.96 4.87 RC6 102.83 112.87 1.09 5.74 0.17 33.76 Rijndael 605.77 353.00 1/1.71 0.00 0.07 — Serpent 202.33 148.95 1/1.35 0.02 0.08 1/4.00 Twofish 105.14 173.06 1.64 0.06 0.18 1/3.00 the highest throughput in ASICs too. Surprisingly enough, the FPGA imple mentations for M A R S , RC6, and Tw ofish achieve higher throughput than the ASIC-based counterparts. For one reason, since ASIC technology can provide the ultim ate performance, we assume that the resulted speedups are due to the de sign techniques (e.g., inherent parallelism) and the individual components (e.g.. multiplier) incorporated in our implementations. For another, the Virtex FP GAs are fabricated on a leading edge 0.18/im, six-layer m etal silicon process [77], while a 0.5pm MOSIS-specific technology library was utilized in [74]. Regard ing the key-setup latency time, the only major difference is the RC6 algorithm, where an improvement by a factor of 33.76 has been achieved by our FPGA-based implementation. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Graph Problems and M atrix M ultiplication Graph problems and matrix operations are two representative application do mains where the hardware can be customized at run tim e based on the input problem instance. In this chapter, we demonstrate a mapping scheme for the single-source shortest path problem. It is a classical combinatorial problem that arises in many optimization problems (e.g., problems of heuristic search, deter ministic optim al control problems, d ata routing within a com puter communica tion network) [11]. Given a weighted, directed graph and a source vertex, the problem is to find a shortest path from the source to every other vertex. Given a graph instance, the hardware is customized at run tim e based on the graph topology. In addition, we address the matrix m ultiplication problem. Matrix multiplication is a kernel operation in digital signal and image processing. A scalable and partitioned solution is derived that can be adapted at run tim e to the size of the input matrices, as well as the available hardware resources. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1 The Single-Source Shortest P ath problem For solving the single-source shortest path problem, we consider the Bellman- Ford algorithm [20j (Algorithm I). The input of the algorithm is a weighted. A lg o rith m 1 The Bellman-Ford Algorithm Input: G( V * . E). w. s /* Initialize Single-Source d o fo r each vertex vtV label(v) « — oo label(s) « — 0 /* Relax edges fo r A * • < — 1 to | V ' | — 1 d o fo r each edge (u. v )tE label(v) « — min{label{v),label[u) + w(u. r)} /* Check for negative-weight cycles fo r each edge (u .v)tE d o if label(c) > label(u) + w(u,v) th e n r e tu r n FALSE r e tu r n TREE directed graph and a source vertex. The edge weights can be negative. The com plexity of the algorithm is 0 (ne), where n is the number of vertices and e is the num ber of edges. Initially, while the source label is set to 0. the labels of all destination vertices are set to oo. In the relaxation phase, each edge is relaxed by updating the corresponding destination vertex's label according to the equation in th e pseudocode. After updating the labels n — 1 times, th e algorithm checks if there is a negative-weight cycle reachable from the source. If there is such a 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.1: Experimental Results for m ' Problem Size n x e # of iterations m ’ (average) 16 x 64 4.52 16 x 128 4.40 16 x 240 4.31 64 x 256 4.13 64 x 512 4.06 64 x 1024 4.03 128 x 512 5.32 128 x 1024 4.96 128 x 2048 4.97 256 x 1024 6.04 256 x 2048 5.75 256 x 4096 5.86 512 x 2048 7.02 512 x 4096 6.71 512 x 8192 6.76 1024 x 4096 7.40 1024 x 8192 7.77 1024 x 16384 7.80 cycle, no solution exists. O therwise, the label of each vertex corresponds to the cost of the shortest path from the source. For graphs with no negative-weight cycles reachable from the source, the algorithm may converge in less than n — 1 iterations [11]. The number of required iterations m ‘, is the height of the shortest path tree of the input graph. This height is equal to the m axim um number of edges in a shortest path from the source. In the worst case m* = n — 1, where n is the number of the vertices. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We performed extensive software simulations to determ ine the relation be tween m ‘ and n — 1 for graphs with no negative-weight cycles. Note that known FPGA-based solutions [9j always perform n — 1 iterations of the algorithm, re gardless the value of in’. Table 5.1 shows the experimental results for different problem sizes. For each problem size. 10’ -106 graph instances were randomly generated. Then, the value of m m for each graph instance was found and the av erage over all graph instances was calculated. For the considered problem sizes, the number of required iterat ions grows logarithmically as the num ber of vertices increases. For values of e/n smaller than those in Table 5.1. m “ starts converging to n — 1. 5.1.1 Mapping the Bellman-Ford algorithm The skeleton corresponds to a general graph G(V,E) with n vertices and e edges. A weight w(i,j) is assigned for each edge {i,j) € E (i.e.. edge from vertex / to vertex j). The derived structure (Figure 5.1) consists of n modules connected in a pipelined fashion. An index id = 0 ,1 .... n — 1 and a label are uniquely associated with each module. M odule i corresponds to vertex i. The weight of the edges is stored in the memory. The memory can be external or integrated on-chip. No particular ordering of the weights is required. Each memory word consists of the weight w(i,j) and the associated indices i and j. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Memory: w(i,j), i, j Start/Stop Modules corresponding to n vertices w (i,j) = w (i,j) + label(i) label(j) = min{label(j). w (i,j)} Figure 5.1: T he Skeleton Architecture for the Bellman-Ford Algorithm The S tart/S top module initiates execution on the hardware. An iteration corresponds to the e cycles needed to feed once the contents of the memory one- bv-one to the modules. The weights w (i.j) are repetitively fed to the modules every e cycles. The algorithm term inates after m" iterations. One extra iteration is required for the S tart/S to p module to detect this term ination. If no labels are modified during an iteration and m * < n. the graph contains no negative- weight cycles reachable from the source and a solution exists. Otherwise, the graph contains a negative-weight cycle reachable from th e source and no solution exists. In each module (Figure 5.2). the values id and label are stored and the re- taxation of the corresponding edges is performed. In th e upper part, the label is added to each incoming weight w{i.j). The index i is compared with id to determ ine if the edge (i.j) is incident from vertex id. T he weight w(i.j) is up dated only if i = id. In the lower part, the weight iv(i.j) is relaxed according to the min operation of the algorithm as shown in Algorithm I. The index j is 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — B.. label Stop I I “ cut” Figure 5.2: The M odule Architecture (left) and the Placem ent of the Modules into an Array of FPGAs (right) compared with id to determine if the edge (i.j) is incident to the vertex id. The label of the vertex id is updated only if j = id and u'(i.j) < label. When label is updated, a flag U is asserted. At the beginning of each iteration, the signal R is set to I by the Start/S top module to reset all the flags. In addition. R resets a register that contains the signal Stop in m odule n — I. The signal Stop travels through the modules and samples all the flags. At the end of each iteration, the Stop signal is sampled by the Start/Stop module. If Stop = 0 and m" < n. the execution term inates and a solution exists. Otherwise, if Stop = I after n iterations, no solution exists. The skeleton placement and routing onto the FPGAs array (Figure 5.2) is simple and regular. The communication between consecutive modules is uniform and differs only at the boundaries of th e array. Depending on the number of the 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. required modules, a “cut” (Figure 5.2) is formed that corresponds to the commu nication links of the last module (Figure 5.1). During the skeleton adaptation, the “cut” is formed and the labels in the allocated modules are initialized. T he execution is managed by a control program executed on the host. This program controls th e memory for feeding the required weights to the array. In addition, it initiates and term inates the execution via the S tart/S to p module. 5.1.2 Area and Time Performance Estimates The above module was implemented based on the param eterized libraries for Xil- inx 6200 series of FPGAs [48]. The footprint of each module was (p + [logn]) x (4/> + 2flog n] + 10), where p denotes the number of bits in each weight/label, and n is the num ber of vertices. For p = logn = 16. 4 modules can be placed in the largest device of the XC6200 family. However, by using contem porary FPGA de vices (e.g., Virtex FPGA [77]), several hundreds of modules can be mapped onto hardware. T he memory space required was (p + 2 [log n ] ) x e bits, where e is the number of edges. The needed memory-array bandwidth is p + 2 log n bits/cvcle to support th e execution. To fully utilize the benefits of the FastM APTW interface. 140 M B/sec bandwidth is required. Under this assum ption, the largest XC6200 device can be configured in 165 psec using wildcards [77]. The algorithm terminates after (m* + 1) x {e + 2ri) clock cycles, where m" is the num ber of required iterations for a given graph. One cycle corresponds to 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the clock period of the skeleton. The clock rate for the skeleton was estimated to be at least 15 MHz for p = 16 bits, and at least 25 MHz for p = 8 bits. In the clock rate analysis, all the overheads caused by the routing were considered. The clock rate was determined mainly by the carrv-chain adder of the modules. By using a faster adder, significant improvements in the clock rate are possible (e.g.. V ’irtex FPG A [77]). The m apping tim e was in the range of msec. The above mapping tim e analysis is based on the timings for the Fast M AP interface in the Xilinx 6200 series of FPGAs databook [77]. The tim e performance can be improved by increasing the factor of paralleliza- tion th at is explored onto hardware. For example, by feeding multiple memory contents per clock cycle to the modules, the amount of required clock cycles can be significantly reduced. However, more complex modules are required to handle multiple weight values. Moreover, by deriving clusters of m ultiple modules, the amount of required clock cycles can be reduced with respect to the length of the derived pipeline. However, sharing the same d ata among m ultiple modules inside a cluster can affect the clock rate of the pipeline. Further study is required to evaluate the tim e performance of these design improvements w ith respect to the area requirements and the clock rate. 7S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1.3 Performance Comparisons Table 5.2 shows performance comparisons of our solution with th e solution in [9]. In [9], the shortest path problem is solved by using Dynamic Com putation Structures (DCS). The key characteristic of the DCS approach is th e mapping of each edge onto a physical wire. The experim ents considered only problems with an average out-degree of 4 and a maximum in-degree of 8. For th e instances considered, the compilation tim e was 4-16 hours assuming that a network of 10 workstations was available. Extensive time was spent for placement and routing. For denser graphs, the m apping tim e would increase dram atically or. even worse, the derived logic netlists may not be routable. Hence, the resulting m apping time elim inated any gains achieved by fast execution tim e. To make fair comparisons with our solution, we assumed that the available bandwidth for configuring the array is 4 MB/sec as in [9]. Even though, the m apping tim e for our solution was estim ated to be in the msec range. Besides the mapping overhead, the mapping of edges into physical wires re sulted in several limitations in [9], with respect to the clock rate and the area requirem ents. The clock rate depended on the longest wire which is Q (n2) in the worst case, where n is the num ber of vertices. This rem ark is supported by well- known theoretical results [71] which show that in the worst case, a graph takes fi(n4) area to be laid out, where n is the number of vertices, and that the longest wire is Q(n2) long. Therefore, as n increases, the execution tim e in [9] drops 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Table 5.2: Performance Comparisons with the Solution in [9] Check for negative-weight cycles Mapping Time Execution Time Area # of iterations # of cycles Clock rate Solution in [9] NO > 4 hours n — 1 n — 1 ft(nr) n (« 4) Our Solution YES ~ 100 msec m * (>»’ + 1) x (e + 2n) independent of n 0(n + e) Vj o dramatically. For n = 16.64.128. the execution tim e was on the average 1.5-2.0 times faster than our approach, while it became 1.3 times slower for n = 256. For larger n, the performance degradation in [9] is expected to be more severe. Considering both the execution and the m apping tim e, the resulting speedup comparing with the solution in [9| was 106. Also, in [9], n — I iterations were always executed and negative-weight cycles could not be detected. If checking for algorithm convergence and negative-weight cycles were included in the design, the resulting longest wire would increase fur ther drastically affecting the clock rate and the execution time. Finally, the tim e performance and the area requirements in [9] are determined completely by the efficiency of the CAD tools and no reliable estim ates can be made before compilation. Area comparisons are difficult to make since different families of FPGAs were used in [9]. Furtherm ore, the considered graph instances in [9] were not indicative of the entire problem space since e/n = 4. For the considered instances in [9]. one XC4013 FPGA was allocated per vertex but. as e/n increases, the area required grows rapidly. In our solution. 0 (n ) area for FPG A s and 0(e) memory were required. Moreover, our design is a m odular design and can be easily adapted to different graph instances without complete redesign. Software sim ulations were also performed to m ake tim e performance compar isons with uniprocessor-based solutions (Figure 5.3). The algorithm that was SI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. 2 e: # of 28 -=-» 16 2 4 -f- = 8 Q. 22 Q. 1.4 1.2 500 1000 n: # of vertices 1 500 2000 2500 Figure 5.3: Tim e Performance Comparison: Our Approach v.s. Software Imple m entation mapped onto the hardware was also implemented in C ' language. The software experiments were performed on a Sun ULTRA I with 64 MB of memory and a clock rate of 143 MHz. No limitations on the in/out-degree of the vertices were assumed. For each problem instance, 104-106 graph instances were randomly generated and th e average running tim e was calculated. The compilation time on the uniprocessor to obtain the executable was not considered in the compar isons. Moreover, the data were assumed to be in the memory before execution and no cache effects were considered. Under these assumptions, on the average, an edge was relaxed every 250 nsec. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For the hardware implementation, it was assumed that p = 16 bits. The map ping tim e was proportional to the num ber of vertices of the input graph. Both the mapping and the execution tim e were considered in the comparisons. The achieved speedup was asymptotically 3.To. However, for the considered problem sizes (Figure -5.3). lower speedup was observed. As e/n increases, the mapping tim e overhead is am ortized over the corresponding execution tim e. Hence, shorter configuration tim e would result in convergence to the speedup bound (3.75) for smaller e/n and n. By using contem porary FPGA devices (e.g.. Virtex FPGA [77]). the clock rate of our implementation can be significantly improved lead ing to higher speedup. Similarly, by using contemporary microprocessors, faster software-based implementations can be realized too. However, the performance of software-based solutions is limited by the serial fashion of com puting of micro processors. On the contrary, by adopting the design improvements proposed in Section 5.1.2. higher degree of parallelism can be exploited onto hardware leading to further speedup improvement. 5.2 M atrix M ultiplication Conventional FPG A s are fine-grained architectures, mainly designed for imple m enting bit-level tasks and random logic functions. Their performance is lim ited for com putationally demanding applications over large word length data. A S3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. highly promising avenue th at is being explored by m any research groups is coarse grained configurable architectures. These architectures are datapath-oriented structures and consist of a sm all number of powerful, word-based configurable processing elements (PEs). Such architectures can result in greater computa tional efficiency and high throughput for coarse-grained computing tasks. O ur mapping is based on a simple model of typical coarse-grained configurable architectures. It is a configurable linear array of identical PEs. Adjacent PEs are connected in a pipelined fashion with word parallel links. The data/control channels can be configured to communicate with each other at different speeds (datapath configuration). The PEs can also be configured to have different inter nal structures (functional configuration). This can be exploited to map hetero geneous computations where different computations are performed by different sets of PEs. The parameters o f the model include p. the num ber of PEs. m the amount of total memory in each PE, and w the d ata word width. An external controller/m em ory system provides the required d ata and control signals and can store the results computed by the array. I/O operations can only be performed at the left and right boundary of the array. Even if the target architecture is coarse-grained architectures, th e derived solution can be also applied to FPGAs. T he key idea of our approach is dynamic datapath configuration. By con figuring th e datapaths, we essentially schedule the dataflow along the array and the com putations that each d a ta stream participates in. The data operands are 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transported and aligned through the array via differential speed data channels. Furthermore, the functionality of the PEs is changed by reconfiguring th e con nectivity (datapaths) among the functional units and local memories. For performing C — A x B. where each matrix is of size .V x S’. p PEs with m = 2p storage in each PE are required. Using such an array, a subproblem of size (p x S ) x (S x p) can be solved. We can perform a N x S ’ matrix multiplication using at most [A '/p]2 iterations of the subproblem. In each PE (Figure 5.4). p rows from a (p x A ) submatrix of .4 commute with exactly one column from a ( .V x p) subm atrix of B resulting in p elements of m atrix C'. Although we assum e A > p. similar solutions can be derived for smaller problem sizes. Load every p cycles Load every p cycles 2 p-word memory banks Update/Upload control fast OUT - results slow A - submatrix fast B - submatrix Control signals Figure 5.4: PE Architecture 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Submatrices .4px.v are feci into the array through a slow data channel (2 clock cycles delay per PE) in column m ajor order. Submatrices B \ Xp are fed through a fast data channel ( I clock cycle delay per PE) in row m ajor order. A fast output d ata channel is used to carry the results from the local memories out of th e PEs. Two p-word banks of local memory are used for storing the interm ediate results. During each iteration, the contents of one memory bank are uploaded onto the OUT data channel, while the interm ediate results are stored in the o ther one. The uploading mechanism is performed in a repetitive m anner along th e array, starting from the leftmost PE. Using the speed differential between the data channels ancl the uploading m echanism , regular data flow and full utilization of the array are achieved. The regular structure of the computation makes the control of the array sim ple and uniform. The control signals travel through the array via fast and slow channels as well. The clock cycle is determined by the m ultiply-add-update operation per formed in each PE (Figure 5.4). By pipelining the datapath for this operation, th e clock cycle tim e can be decreased. On the average, 2 I/O data accesses and p local memory access (i.e., 1 per PE) are required during each clock cycle. The external data transfer through the I/O is [Af/p](;V2 + ^P ) v v 'hich is m inim al [47]. Thus, an N x S m atrix multiplication requires [iV/p]2pAr + p i2 — 1 clock cycles S6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (p results are computed per clock cycle on the average). Our m apping achieves optim al tim e performance. 5.2.1 Experimental Results For the sake of illustration, we apply the above approach to perform m atrix multiplication on Chameleon CS2000 [16] and on RaPiD [37]. Chameleon CS2000 [16] incorporates two m ajor cores: a 32-bit em bedded pro cessor and a 32-bit high-performance reconfigurable hardware fabric (Figure 5.5). The embedded processor is mainly utilized for control and data movement oper ations while the reconfigurable fabric is the m ain computation core. An essential feature of the reconfigurable fabric is instantaneous reconfigurability. CS2000 can deliver 24 billion 16-bit ALT operations and 3 billion 16-bit MAC operations per second, and support 48 GByte/sec on-chip memory bandwidth. I G Bvte/sec main memory bandwidth, and 2 GByte/sec programmable I/O bandw idth [16]. In Figure 5.6, the reconfigurable processing fabric of CS2000 is shown. The fabric is divided into slices. Each slice can be reconfigured independently and consists of tiles. Each tile is built with 32-bit programmable d atap ath units (DPU), 16x24 multipliers, local memories, and a programmable control unit. The functional units across the fabric are connected using configurable local and global interconnects. A configuration for the fabric determines the functionality of the DPUs, the datapaths across the fabric as well as the programmable I/O . S7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Programmable V O Figure o.o: Chameleon CS2000 Slice 0 Slice 1 Slice 3 Tile \ Tile A rile A Tile A 32-bit DfTJ 3 M * D n j 32-hit DPU 324jtDPU SMritDTO 32-bit DPU 16x21 16x24 TUeC Figure 5.6: CS2000 Reconfigurable Processing Fabric Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Besides altering the hardware configuration off-line, instantaneous reconfiguration can occur transparently at runtime (3 /isec per slice). The reconfigurable fabric of CS2112 [16] was optimized to realize m atrix mul tiplication. The current CS2112 family does not incorporate dedicated hardware for address generation and memory control for the local memories. As a result, an excessive amount of DPUs was utilized to control the local memories leading to the implementation of p = 12 MAC units even if 24 multipliers are available. Next-generation reconfigurable processors will include dedicated hardw are for lo cal memory control and address generation allowing the full utilization of the available multipliers. In Table 5.3, performance results for m atrix multiplication are shown. The results shown for CS2112 correspond to p = 12 MAC’ units. Perform ance compar isons were also made with state-of-the-art DSPs (Texas Instrum ents TSM32062x). These results are based on performance benchm arks provided by the vendor [70]. The maximum clock speed (300 MHz) in th e TSM32062x family was used. Note th at the clock speed for CS2112 is 125 MHz. Note also that CS2112 is at its infancy while high-end DSP and FPGA devices were considered in the compar isons. For example, the fabrication technology of CS2H2 is 0.25/i CMOS while the TMS320C62x DSP family and the Xilinx Virtex FPGA are fabricated using 0.18/i CMOS. The performance results shown for DSP correspond to computa tion tim e only. I/O and memory access overheads were not considered (best-case 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T ab le 5 .3 : M a trix M u ltip lic a tio n on C S 2 U 2 Platform Matrix Order 200 500 1000 Time (sec) Relative Time Time (sec) Relative Time Time (sec) Relative Time CS2112 0.0059 1.00 0.0667 1.00 0.6654 1.00 DSP 0.0133 2.25 0.2083 2.40 1.6667 2.43 /zP 0.0459 7.78 0.5800 6.69 4.6893 6.84 scenario). The results shown for the microprocessor are the fastest results pub lished for matrix m ultiplication (single precision, floating point) [7]. In Table -5.3. the relative time was derived by normalizing the corresponding execution time with respect to the CS2112 execution time. Therefore, the relative tim e of a plat form denotes the speedup achieved by the CS2112 platform over th at platform. Clearly. CS2112 outperform s DSP and microprocessor platforms by a factor of 2.4 and 7.1. respectively (on the average). Even if the clock rate of CS2112 is slower than in the other platform s, the high-degree of parallelism exploited on hardw are results in superior performance. O ur solution was also applied to RaPiD [37]. RaPiD is a linear configurable array of functional units th a t is optimized for highly repetitive com putation in tensive tasks such as those found in signal and image processing [37]. Each cell I PE consists of an integer multiplier, two integer ALUs, six general-purpose registers and three local memories. From our model prospective, this architec ture corresponds to p = 16 and m = 96. Each datapath is 16 bits wide and up 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. © o >. O 1 - Memory access performance Solution in [2] - O ur approach local memory 0 - o © o > 2 | -0 .5 external memory 2 5 3 3.5 4 4.5 5 5.5 a=m/p: Local memory available per PE / Number of PEs 3.5 4 5 5 5 Local memory used per PE 2.5 Solution in [2] O ur approach 1.5 Figure 5.7: Memorv Access and Requirements Compared with the Solution in [37] to 3 programmable delays can be inserted per path inside a cell. Conservative estim ates for the clock rate is 100 MHz. O ur solution results in full utilization of the PEs in RaPiD and asymptotically the sam e time perform ance as in [37]. However, our m apping uses m = 2p memory per PE for storing interm ediate results. In [37] additional local memory is required for storing d a ta operands as well. It is easy to verify that our solution reduces the total num ber of local memory accesses. In addition, it can result in potential area savings. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.7 compares the m em ory performance of the two approaches and the am ount of local memory used. The difference in the asym ptotic average num ber of memory accesses per cycle is drawn as a function of o = ! j . The value m = o p is the amount of local memory in each of the p PEs. Note that, a = 2 and 0 = 6 represent the minimum and the available size of the local memory in [37]. Moreover, the ratio of the local memory used in [37] over that needed in our approach is shown in Figure 5.7 as a function of o. Our approach reduces the overhead due to local m em ory accesses. For a > 3. the solution in [37] requires less external memory accesses than our approach. However, in order to achieve this, larger amount of local memory is used resulting in increased area and control overheads. In addition, the time performance remains the sam e since the execution tim e depends on the number of available multipliers in both the approaches. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Parallel D eduction Engine for SAT Solvers FPGAs are a promising technology for accelerating SAT solvers. Besides their high density, fine granularity, and massive parallelism. FPGAs provide the op portunity for run-tim e customization of the hardware based on the given SAT instance. In this chapter, a parallel deduction engine is proposed for backtrack search algorithms. The performance of the deduction engine is critical to the overall performance of the algorithm since, for any moderate SAT instance, mil lions of implications are derived. We propose a novel approach in which. /» . the amount of parallelization of the engine is fine-tuned during problem solving in or der to optimize perform ance. Not only the hardware is initially customized based on the input instance, but it is also dynamically modified in term s of p based on the knowledge gained during solving the SAT instance. Compared with conven tional deduction engines that correspond to p = 1. we dem onstrate speedups in the range of 2.S7-5.44 for several SAT instances. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.1 Introduction The boolean satisfiability problem (SAT) is a central problem in artificial intelli gence. combinatorial optim ization, and m athem atical logic. SAT is a well-known NP-complete problem and many problems can be proved NP-complete by reduc tion from SAT [20]. In this work, the conjunctive normal form (CN F) formula is considered. A CN F formula on boolean variables is expressed as an A N D of clauses, each of which is the OR of one or m ore literals. A literal is the occurrence of a variable or its negation. Given a CNF formula, the objective of a SAT solver is to identify an assignm ent to the variables that causes the formula to evaluate to logic I or verify th at no such assignment exists. Numerous FPG A-based SAT solvers have been proposed [1. 2, 58, 69. 80. 81] th at take advantage of th e fine granularity and the spatial computing paradigm of FPGAs. Compared with software im plem entations, the main advantage of implementing SAT solvers in FPGAs is the hardware customization of the bit- level operations and the inherent parallelism in checking a truth variable assign ment against the constraints imposed by the clauses (i.e.. deduction process). Furthermore, reconfigurable hardware allows instance-specific custom ization and run-tim e modification of the hardware in order to achieve superior performance. O ur research focuses on FPGA-based implementations of the deduction pro cess described in the Davis-Putnam procedure [33]. The deduction process de scribed in [33] is the m ost widely used in backtrack search algorithms. Such 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithms have been proven to be superior for several classes of SAT instances compared with local search algorithms and algebraic m anipulation techniques [67]. In general, backtrack search algorithms consist of iterative execution of de- cision. deduction, and backtrack processes. The performance of the deduction engine is critical to the overall performance of the algorithm since, for any mod erate SAT instance, millions of implications are derived [79]. In [60]. we have demonstrated th at, for various SAT instances, a parallel deduction engine can lead to up to an order of magnitude speedup compared with “serial" implemen tations. In this work, we develop a parallel deduction engine and introduce an heuristic based on which the engine is fine-tuned during problem solving in order to optimize performance. Not only the hardware is initially customized based on the input instance (a common feature of known FPGA-based SAT solvers), but it is also dynamically modified based on the knowledge gained during solving the SAT instance. 6.2 Backtrack Search Algorithms and Deduction Process Backtrack search algorithms start from an em pty truth variable assignment (TVA) and organize th e search by maintaining a binary decision tree. Each tree level corresponds to a variable assignment. Thus, each tree node is associated with a 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. truth partial assignment to the set of variables. Searching the decision tree starts from the root of the tree and consists of the following iterative processes [67]: 1. Decision: At a tree level, the current assignment is extended by deciding a value to an unassigned variable. 2. Deduction: The current assignment is extended by im plication assignments to unassigned variables and possible conflicts are detected. If no conflict is detected, a new decision is m ade at one level below. Conflict occurs when a variable is assigned to different values. 3. Backtrack: If a conflict is detected, the current assignment is not valid and the search process will continue by making a new assignment at the same level or at a level closer to the root of the tree. The deduction process described in the Davis-Putnam procedure [33] is the most widely used by backtrack search algorithms (e.g.. G R A SP [67]). By incor porating various decision and backtrack techniques, numerous backtrack search algorithms have been derived that incorporate the Davis-Putnam procedure [33]. According to this procedure, deduction is based on the unit clause rule, that is. if a clause is not “true" and all but one literal in that clause have been assigned, then the unassigned literal is implied to logic I in order for the clause to be “true". Implications are derived using Binary Constraint Propagation (BCP) of a TVA. For any m oderate SAT instance, millions of im plications are derived [79]. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the rem ainder of this chapter. Unit Propagation Tim e (U PT ) is the average time to resolve implications or detect a conflict for every TYA determined by de cision or backtrack. As a result. l'P T = ^ . where T is the total tim e spent by the deduction process during problem solving and d is the total number of decisions and backtracks. T here are two prim ary approaches to improve the perform ance of backtrack search algorithm s: reducing the number of tree nodes visited and optimizing the deduction process. Incorporating decision assignment techniques can have a significant impact on the to tal number of decisions made. Furthermore, sophisti cated backtracking techniques can significantly prune the search space. However, sophisticated decision assignm ent and backtrack techniques necessitate complex data stru ctu res that favor software over hardware im plem entations. On the other hand, th e bit-level operations and the inherent parallelism of the deduction pro cess m atches extremely well with the hardware characteristics of FPGAs. In this work, we focus on instance-dependent implementations of the deduction process using FP G A s that can be optim ized at runtime based on the knowledge gained during solving the SAT instance. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3 State-of-the-art In [SO. 81]. FPGA-based SAT solvers have been presented based on the deduc tion process described in the Davis-Putnam procedure [33]. In [SO ], instance- dependent mapping was utilized to take advantage of the parallelism inherent in the deduction process. Given a SAT instance, th e implication circuit for each variable was derived on the fly and was then hard-w ired on the FPGA. An impli cation circuit corresponded to the conditions under which the considered variable would be implied by applying the unit clause rule to all the clauses that contain it. As a result, the deduction process was highly parallelized resulting in 0( 1) I T T at the expense of slow clock rates (1-2 MHz). The proposed instance-dependent mapping led to irregular interconnect that required excessive compilation tim e (i.e., tim e to derive the configuration and map it onto hardware). This excessive compilation time degraded the effective execution tim e. For example, compared with software implementations, the observed speedup of 51 x for the hole 10 SAT instance [35] reduced to 8 .3 x due to the compilation tim e overhead [80]. Intu itively, the compilation tim e would be more severe for hard SAT instances or. even worse, the resulting m apping may not be routable. In [81], a new architecture was proposed to drastically reduce the compilation tim e. Instead of the irregular global interconnect used in [SO], a ring-based in terconnect was utilized to connect modules in a pipelined fashion. Each module corresponded to a clause and contained the im plication circuits for each of its 9S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. variables based on the unit clause rule. Given a SAT instance, the modules were adapted to the input clauses. As a result, compared with [SOj. the compilation time was drastically reduced. However, the resulting UPT was 0(n + e). where n is the num ber of the variables and e is the num ber of the clauses in the input instance. Compared with [80). higher clock speeds were achieved but a larger number of clock cycles was required for problem solving after mapping the solver onto the hardware. W hile the performance of the design in [80] is lim ited by the compilation time, the performance of the design in [81) is critically affected by the increase in the UPT. In this paper, a parallel deduction engine is proposed with low compilation overhead which is similar to the compilation overhead in [81]. However, the UPT is 0 (n -f- |) , where I < p < e is the amount of parallelization. The param eter p that leads to overall minimum execution tim e is closely related to intrinsic characteristics of the given instance. To the best of our knowledge, there is no published work related to the com putation of the optim al value for p. Different values for p correspond to different implementations for the deduction engine (see Section 6.4.1) and lead to different time performance. In our approach, the deduction engine is initially adapted based on the instance and a value for p. Then, the architecture dynamically evolves in term s of p to optimize overall performance. On the contrary, in [80] and [81] (as well as in other FPGA-based SAT solvers), the deduction engine is initially adapted to the instance but remains 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. static (i.e.. fixed amount of paralleiization) during problem solving. In [SI], new clauses are added in the deduction engine to support conflict-directed dynam ic learning. However, the am ount of paralleiization remains the same at p = 1. 6.4 An Adaptive Parallel Architecture for the D eduction Process In [60]. we have evaluated the performance of a novel implementation of the deduction process. The key idea was to split the clauses of the input instance into p groups and perform the deduction process in each group in parallel. As a result, the deduction process consisted of the following steps: 1. The current variable assignment is fed to each group of clauses. Implica tions are resolved independently in each group. If a conflict is detected, the deduction process is term inated and backtrack is performed. If no implica tions occur, a new decision is made. Otherwise. Step 2 is performed. 2. A merging process is performed to combine the distinct TVAs derived in each group. As a result, a single TVA is derived that contains all the assigned variables. If a variable has been implied to conflicting values in different groups, the deduction process is term inated and backtrack is per formed. Otherwise, Step 1 is repeated using the resulting TVA as input. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A sim ulator w ritten in C + + was developed in (60) to evaluate the overall performance of a SAT solver that incorporates the above modified deduction process. Each group of clauses was implemented as a linear array of modules operated in a pipelined fashion. Hence, the deduction engine implemented in [80) corresponds to p = 1. The simulation results (cycle count) dem onstrated th at, compared with the baseline solution, the achieved speedup could be significant. The baseline solution corresponds to a deduction engine with p = 1. Speedup refers to the overall performance of the SAT solver. Static ordering of variables and chronological backtracking were considered. The observed speedups were up to an order of m agnitude for the chosen instances [60]. However, given a SAT instance, the selection of a value for p th at would optim ize the performance of the SAT solver is a key research problem for realizing SAT solvers based on the parallel deduction process proposed in [60]. In Figure 6.1. the speedup for several values for p is shown for various SAT instances. The results shown in Figures 6.1. 6.3, and 6.4 correspond to the following SAT instances from the DIMAC’S benchmark [3-5]: par8-l-c, hole6, aim-50-2_0-no-4, aim-50-l_6-yesl-l, and aim-100-3.4-ves 1-4. Static ordering of variables and chronological backtracking were incorporated by the simulator. The graphs were obtained using our sim ulator. For any SAT instance and values for p close to 1, the amount of paralleiization of the deduction process is small resulting in relatively poor speedup compared with the maximum speedup that can be 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. e: # ofj clauses n: # of, variables a. 3 I C O 40 56 P Figure 6.1: Speedup Compared with p = I for Various SAT Instances achieved. On the other hand, for values for p close to the number of clauses e, the speedup degrades drastically since the second step of the modified deduction process is repeated a large num ber of times. Given a SAT instance, the problem is to find a value for p th at leads to maximum speedup. This optimal value for p is intrinsic to the instance. In Section 6.4.3, a fast heuristic is described that leads to near-optimal speedup. This heuristic is applied on the fly during problem solving. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.4.1 Hardware Organization A parameterized architecture for the deduction engine has been developed con sisting of groups of clauses. The architecture is scalable with respect to the number of clauses in the input instance. Each group of clauses corresponds to a linear array of modules connected in a pipelined fashion. Each module is associ ated with one clause of the given instance. The param eters of the architecture are the number of clauses e. the number of group of modules p. the number of variables B that are moved per clock cycle to/from the architecture, and the maximum number of variables v that a module contains (i.e.. every module can correspond to a clause with up to r variables). The engine extends an input TVA with implication assignments and detects conflicts. However, it can be modified to incorporate hardware implem entations of decision and backtrack engines. The decision and backtrack processes are assumed to be executed by a host machine. In Figure 6.2. an illustrative example of a parallel deduction architecture is shown for p = 3 and e = 24 along with the module architecture. The TVA is fed to the architecture by the host at a rate of B variables per clock cycle. The merged TVA is repeatedly fed back to the architecture until no more implications occur or a conflict is detected. T he main difference compared with [60] is that the merging process occurs concurrently with implication resolution. The merging process is implemented as a linear array of simple OR structures that are located at the end of each group. As a result, the hardware organization is simplified 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TVA_Out,, Count Count StateJn State_Out VariablesJn Variables_Out Impliedjn Local memory u variables Control circuit Implication circuit Conflict detection Alignment circuit Conflict detection Alignment circuit Figure 6.2: Parallel Deduction Engine A rchitecture and M odule Details 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. leading to reduced compilation overhead and hardware area. For the proposed architecture. I'P T = tp x (f ^ ] + f + p). where n is the number of variables in the given instance. The parameter tp is intrinsic to the SAT instance and can be determined after actually solving the SAT instance. In the proposed architecture. tp also depends on the parameter p. the way in which the clauses are split into groups, and the ordering of clauses within each group. The architecture of a module is the same for all the clauses (Figure 6.2). Modules are connected in a pipelined fashion, and thus, the clock cycle of the engine is approxim ately equal to the latency through a module. The variables associated with a clause are stored in the local memory of the associated module. A standard 2-bit encoding is used for unassigned and assigned variables. Up to v variables can be stored locally. The implication and the control circuits are generic for any clause containing up to v literals. The im plication circuit resolves implications and detects possible conflicts. The control circuit supervises the dataflow and the operation of the module based on the state of the engine (i.e.. idle, implication, conflict). The state is propagated and updated through the modules by the State.In and Slate.Out signals. Similarly, the ImpliedJn and Implied.Out signals trace the occurrence of implications. The alignment circuits are specific for each clause and are used to update the variables in the local memory with the variables on the bus and vice versa. Both alignment circuits have been implemented using ROM-based lookup tables. As a result, a module 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. can be customized with respect to a given clause by updating the ROM contents only. Finally, the global signal Count is used to identify which variables are fed to the module, and thus, it controls the alignment of variables. 6.4.2 Run-time Mapping The tim e overhead to map the above deduction engine architecture onto reconfig urable hardware should be minimal since the m apping contributes to the effective execution time. The m odularity and the simplicity of the proposed architecture lead to tem plates with fast compilation times. Tem plates can be developed off line to facilitate run-time mapping. The major requirem ent for implementing a deduction engine is to adapt a tem plate to the input instance for a given value for p. The tem plates correspond to FPGA configurations. A configuration corre sponds to a deduction engine with specific values for B. t\ and ep, where ep is the number of modules per group (i.e., ep = f |] . T he param eter ep determines the amount of paralleiization p explored in the hardware for a given instance. The number of groups mapped onto an FPGA depends on the density of the FPGA. The param eter B depends on the I/O resources of the FPGA. A conservative es tim ate of the maximum value for B is 32 variables per clock cycle. The param eter v is associated with the clause that contains the largest number of literals in the given SAT instance. However, large clauses can be decomposed into smaller ones 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. resulting in a smaller v at the expense of an increased e. Given a SAT instance, the num ber of clauses e determ ines the number of FPGAs needed to map the proposed deduction engine architecture. For a given p and a SAT instance, the FPGA(s) is configured at runtime. The resources of the FPGA(s) are assumed to be sufficient for solving the SAT instance. W ithout loss of generality, we assume that p divides e and the deduc tion engine can be mapped onto one FPGA. Initially, the FPGA is loaded with a configuration corresponding to v > vm ax and ep = where cmax is the num ber of literals in the largest clause of the given instance. Then, by altering the interconnect, only the first | groups of modules remain connected to the merge circuits (Figure 6.2). This can be realized using partial reconfiguration in order to either disconnect the remaining groups of modules from the corresponding merge circuits or to bypass these m erge circuits. As a result, e clauses are implemented in p groups. Finally, using partial reconfiguration, the contents of the local mem ory in the modules are updated based on the clauses of the given instance. A conservative estim ate of the tim e for mapping a deduction engine is twice the time required to configure th e entire FPGA. During problem solving, the above procedure can be repeated to implement a deduction engine with a different p. An alternate solution for modifying the parameter p of the engine would be to alter the interconnect. The advantage of this approach is that there is no need to update the contents of the local memory in the modules. However, further study 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is needed to understand the impact of such an approach on the reconfiguration cost and the clock speed. 6.4.3 Run-Time Performance Optimization Given a SAT instance with e clauses and a set of available tem plates with differ ent values for ep. the objective of performance optim ization is to determine the value for p th at leads to maximum speedup compared with the baseline solution. Heuristics for splitting the clauses into groups and ordering the clauses inside a group are not considered in the proposed approach. In general, the time spent in the deduction process is a dominant factor for the performance of a backtrack search algorithm [67]. The total tim e spent for deduction during problem solving is r/xU P T . where d is the total num ber of decisions and backtracks. Hence, in tuitively. optim izing the deduction process can lead to maximizing the speedup. The behavior of UPT for various SAT instances and values for p is shown in Fig ure 6.3. T he graphs were obtained using our simulator. For values for p close to 1 or to the num ber of clauses e, the UPT value degrades from the global minimum. In Figure 6.4, the product of UPT with the corresponding speedup (Figure 6.1) is shown in term s of p. Clearly, the product is nearly a constant. Therefore, the optim ization problem of maximizing the speedup can be indeed approximated by the optim ization problem of minimizing UPT. 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1400 1200 1000 800 600 400 200 20 P Figure 6.3: UPT in terms o f p 1400 1200 1000 a 3 I m soo X 600 400 200 20 50 40 Figure 6.4: The Product of UPT with the Corresponding Speedup in terms of p 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the following, a greedy heuristic is described for optimizing performance at runtim e. Given a set of tem plates and a SAT instance, the objective is to find the tem plate with the minimum l T PT while solving the SAT instance. The heuristic is executed by the host machine and decides the tem plate based on which the FPGA(s) is configured. Initially, a sorted linked list is created. Each elem ent in the list corresponds to a tem plate. The list is sorted in ascending order of ep. As a result, the head of the list corresponds to the tem plate with the minimum ep. Starting from the head of the list, the associated template is used to configure the FPG A(s) and the backtrack search algorithm begins execution. The implemented deduction engine is operated for kp x (ep + n + p) clock cycles, and is further continued until no more implications occur or a conflict is detected. The param eter k is an integer and ep = f^]. Empirical results suggest that for k < 5. a good approximation of the UPT can be derived. The derived UPT is recorded as the minimum UPT. Then, the tem plate associated with the subsequent element in the list is used to configure the FPGA(s). The backtrack search algorithm continues using the mapped deduction engine and the U PT is evaluated again. If the derived U PT is larger than the minimum U PT. the tem plate corresponding to the minimum UPT is used to configure the FPG A(s) again and the backtrack search algorithm continues until it solves the SAT instance. Otherwise, the derived U PT becomes the minimum UPT. the 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tem plate associated with the subsequent element in the list is used to configure the FPC!A(s). and the same procedure is repeated. An interesting feature of the above optim ization process is that the backtrack search algorithm need not restart over again from an em pty TVA at the root of the binary decision tree after switching tem plates but it continues w ith the current TVA at th e current tree level. The convergence rate of the heuristic and the achieved speedup depend on the distance between consecutive ep values of the templates and the minimum ep. Different optim ization heuristics or sets of templates can lead to higher speedup. O ur idea of UPT-based perform ance optimization during problem solving can incorporate various heuristics and sets of templates. 6.5 Experim ental Results The proposed deduction engine (Figure 6.2) was m apped onto the Virtex XC’Y 1000 -6 -FG680 FPGA [77] for various values for e.p. r, and B. The Foundation Series 2.1i software developm ent tool [77] was used for logic synthesis and place-and- route. No placement or timing constraints were imposed. In Figures 6.5 and 6.6. for e = 32 and various values for v and B. the hardw are area and the clock speed are shown, respectively. The shown results are the average over im plem entations with p = 1,2,4.8. For a given e. the hardware area and the clock speed are approximately the sam e for different values for p. Hardware area is proportional 1 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10000 9 0 0 0 B = 1 6 8 0 0 0 7000 6000 g 5 0 0 0 4000 3000 2 0 0 0 1000 v : * of variables per module Figure 6.5: Hardware Area in term s of e and B e= 32 60 I 4 0 8 o 12 16 20 24 o : * of variables per module Figure 6.6: Clock Speed in term s of c and B 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to f and is prim arily determined bv r and B. Approximately 100 modules with v = 16 and B = 16 can be mapped on the above FPCJA. The clock speed of the engine is primarily determined by e and decreases by a factor of 2-3 as r increases from - 1 to 32. The sensitivity of the software tools used as well as the absence of placement constraints are the main reasons for the observed reduction of the clock speed with respect to different values for B. To the best of our knowledge, we are not aware of any published results regarding hardware area and clock speed, which are essential to make comparisons with software-based solutions. Using our sim ulator [60], the heuristic proposed in Section -1.3 was applied to several SAT instances. The considered instances are shown in Figure 6.3. A set of templates with ep = [^] was assumed, where p = (4. S. 16.32.fr!}. Compared with the baseline solution, the speedup achieved by applying our approach is shown in Table 6.1. The results correspond to B = 16. Furthermore, the opti mization steps are shown in terms of the param eter p. In addition, the maximum speedup that can be achieved and the associated value for p are shown. The maximum speedup was determined by solving the SAT instance for each value for p without applying our heuristic. For the SAT instance parS-l-c, the problem was solved before evaluation of U PT was completed for the template with minimum ep. For hole6 and aim- 50-20-no-4, the tem plate that leads to m aximum speedup was found and the achieved speedup was 3.12 and 3.03. respectively. Compared with the maximum 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T a b le 6.1: S p eed u p C o m p a re d w ith th e B aseline S o lu tio n (p = I) S A T Instance Optimization Steps Speedup Achie red Maximum parS-l-c 64 1.06 2.26 9 16 hole6 64 -> 32 -> 16 8 -> 16 3.12 3.64 Q 16 airn-50-20-no-4 64 -»• 32 -> 16 — ► 8 -»• 16 3.03 3.04 < J 16 aim -50-16-yes l-l 64 -> 32 — ► 16 S — ► 16 2.87 2.90 Q S ai m -100-34-yes 1 -4 64 — ► 32 — ► 16 — ► S — > 16 5.44 6.25 q 32 speedup, th e achieved speedup was 1%-15% smaller depending on the tim e spent on tem plates with different values for p. Finally, for aim-60-16-yes l-l and aim- 100-34- yes 1-4. the tem plate that leads to maximum speedup was not found, but the achieved speedup was only 1%-13% smaller than the corresponding maximum speedup. T he achieved speedup is indicative for classes of SAT instances of sim ilar hardness and number of clauses and variables. The performance results shown in Table 6.1 do not include the time required to configure the FPGA(s). Considering the configuration overhead, the tim e performance of our approach becomes / x r t + f(/s , where / is the number of the optim ization steps, s is the achieved speedup, and fi and r ( are the tim e performance and the configuration overhead of the baseline solution, respectively. Similarly, the tim e performance of th e baseline solution becomes r, + where ri is usually in the order of seconds or minutes [81]. If r t and t t are of the sam e order of m agnitude, then the baseline solution and our solution will both achieve 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sim ilar overall performance. However, for larger values for f| (i.e.. harder SAT instances), the overall speedup achieved bv our solution will be equal to .s. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Configuration Compression The density and performance of FPGAs have drastically improved over the past few years. Consequently, the size of the configuration bit-streams has also in creased considerably. As a result, the cost-effectiveness of FPGA-based embed ded systems is significantly affected by the m em ory required for storing various FPGA configurations. In this work, a novel compression technique is proposed that reduces the memory required for storing FPGA configurations and results in high decompression efficiency. Decompression efficiency corresponds to the decompression hardware cost as well as the decompression rate. The proposed technique is applicable to any SRAM-based FPGA device since configuration bit-stream s are processed as raw data. The required decompression hardware is simple and is independent of the individual sem antics of configuration bit-streams or specific features of the on-chip configuration mechanism. Moreover, the tim e to configure the device is not affected by our compression technique. Using our 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. technique, vve dem onstrate up to -11% savings in memory for configuration bit stream s of several real-world applications. 7.1 Introduction The enormous growth of embedded applications has made em bedded systems an essential component in products that emerge in almost every aspect of our life: digital TVs. game consoles, network routers, cellular base-stations. digital com munication devices, printers, digital copiers, multifunctional equipm ent, home appliances, etc. For exam ple, from the 200 million units shipped in the year 1997, the DSP embedded systems market has been forecasted to grow to 1.200 million units in the year 2001 [76]. The goal of embedded systems is to perform a set of specific tasks to improve the functionality of larger systems. As a result, they are usually not visible to the end-user since they are embedded in larger sys tem s. Embedded systems usually consist of a processing unit, memory to store data and programs, and an I/O interface to communicate with other components of the larger system. T heir complexity depends on the complexity of the tasks they perform. The main characteristics of an embedded system are raw com putational power and cost-efFectiveness. The cost-effectiveness of an embedded system includes characteristics such as product lifetime, overall price, and power consum ption, among others. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The unique combination of hardware-like performance with software-like flex ibility make FPGAs a highly promising solution for embedded systems. FPCJA- based embedded systems allow the application designer to adapt the hardware resources to the application. Current estim ates suggest that 80% of all System- on-C’hip semiconductor devices shipped in the near future will include embedded reconfigurable logic [43]. Such systems will significantly reduce design risks and increase the production volume of the chip. Additionally, the requirements of applications in the in format ion-based networked world are changing so rapidly, that reconfigurability is possibly the only way to overcome severe time-to-market requirements. Typical FPGA-based embedded system s have FPGA devices as their process ing unit, memory to store data and FPG A configurations, and an I/O interface to transm it and receive data. FPGA-based embedded systems can sustain high processing rates, while providing a high degree of flexibility required in dynami cally changing environments. FPGAs can be reconfigured on dem and to support m ultiple algorithms and standards. T hus, the degree of the system flexibility strongly depends on the amount of configuration data that can be stored in the field. However, the size of the configuration bit-stream has increased considerably over the past few years. For example, th e size of the configuration bit-stream of the Virtex FPGAs, range from 0.6 M bits to 33 Mbits [77]. As a result, storing US Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. configuration bit-stream s in an FPGA-based embedded system becomes a critical problem drastically affecting the cost-effectiveness of the system. In this work, we propose a novel compression technique to reduce the mem ory requirements for storing configuration bit-streams in FPGA-based embedded systems. By compressing configuration bit-streams. significant savings in mem ory requirements can be achieved. T he configuration compression occurs off-line. At runtime, decompression occurs and the decompressed data is fed to the on- chip configuration mechanism to configure the device. The m ajor performance requirements of the compression problem are the decompression hardware cost and the decompression rate. The above requirements distinguish our compression problem from conventional software-based applications. V V ’ e are not aware of any prior work that addresses the configuration compression problem of FPGA-based embedded systems with respect to the cost and speed requirements. Our compression technique is applicable to any SRAM-based FPGA device, since it does not depend on specific features of the configuration mechanism. The configuration bit-stream s are processed as raw data without considering any indi vidual semantics of the configuration data. As a result, both com plete and partial configuration schemes can be supported. The required decompression hardware is simple and independent of the configuration format or characteristics of the configuration mechanism. In addition, the achieved compression ratio is inde pendent of the decompression hardware and depends only on the entropy of the 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. configuration bit-stream . Finally, the tim e to configure an FPCIA depends only on the d a ta rate of the on-chip configuration mechanism, the speed of the memory that stores the configuration data, and the size of the configuration bit-stream. The decompression process does not add any overhead to the configuration time. The proposed compression technique is based on the principles of dictionary- based compression algorithms. In addition, a unified-dictionary approach is pro posed for compressing sets of configurations. Even though statistical methods can achieve higher compression ratios, we propose a dictionary-based approach because statistical methods lead to high decompression hardware cost. The dic tionary corresponds to configuration data th a t is stored in the memory. In our scheme, the dictionary is derived based on the well-known L Z W compression algorithm [56]. However, a m ajor deviation from £ ZIP-based techniques is the calculation of the compression ratio. Our compression technique proposes a novel way of constructing the dictionary to significantly improve the compression ratio. In addition, our technique delivers the decompressed data in order. On the con trary, in conventional LZVP-based techniques, the decompressed d a ta is delivered in reverse order and a stack is utilized to reconstruct the original d ata. However, by incorporating a stack in a hardware implementation, the decompression rate would be affected significantly. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Using our technique, we dem onstrate ll% -41% savings in memory for con figuration bit-streams of several real-world applications. The configuration bit stream s corresponded to cryptographic and digital signal processing algorithms. Our target architecture was Yirtex FPGAs [77]. Single, as well as sets of con figuration bit-streams were compressed using our technique. The size of the configuration bit-streams ranged from 1.7 Mbits to 6.1 Mbits. 7.2 FPG A Configuration An FPGA configuration determ ines the functionality of the FPGA device. An FPGA device is configured by loading a configuration bit-stream into its internal configuration memory. An internal controller manages the configuration memory, as well as the configuration d ata transfer via the I/O interface. Throughout this chapter, we refer to both the configuration memory and its controller as the configuration mechanism. Based on the technology of the internal configuration memory. FPGAs can be perm anently configured once or can be reconfigured in the field. For example. Anti-Fuse technology allows one-time program m ability while SRAM technology allows reprogrammability. In this work, we focus on SRAM-based FPGAs. In SRAM-based FPGAs. the contents of the internal configuration memory are reset after power-up. As a result, the internal configuration memory cannot be used for storing configuration data permanently. Using partial configuration, only a part of the contents of the 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. internal configuration memory is modified. As a result, the configuration tim e can be significantly reduced compared with the configuration tim e required for a complete reconfiguration. Moreover, partial configuration can occur at runtim e without interrupting the computations that an FPGA performs. SRAM-based FPGAs require external devices to initiate and control the configuration process. Usually, the configuration data is stored in an external memory and an external controller supervises the configuration process. The time required to configure an FPGA depends on the size of the config uration bit-stream . the clock rate and the operation mode of the configuration mechanism, and the throughput of the external memory that stores the configu ration bit-stream . Typical sizes of configuration bit-streams range from 0.6 Mbits to 33 Mbits [4. S. 77] depending on the density of the device. The clock rate of the configuration mechanism determines the rate at which the configuration data is delivered to the FPG A device. The configuration data can be transferred to the configuration mechanism serially or in parallel. Parallel modes o f configuration result in faster configuration time. Typical values of data rates can be as high as 480 Mbits/sec [4. S. 77]. Thus, the external memory that stores th e configuration bit-stream should be able to sustain the data rate of the configuration mecha nism. Otherwise, the memory becomes a performance bottleneck and the tim e to configure the device increases. Such an increase could be critical for applications where an FPGA is configured on-demand based on run-time param eters. 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Configuration bit-stream s consist of data to be stored in the internal config uration memory, as well as instructions to the configuration mechanism. The data configures the FPGA architecture, that is. the configurable logic blocks, the interconnection network, the I/O pins. etc. T he instructions control the func tionality of the configuration mechanism. Typically, instructions are used for initializing the configuration mechanism, synchronizing clock rates, and deter mining th e memory addresses at which the d ata will be written. The format of a configuration bit-stream depends on the characteristics of the configuration mechanism as well as the characteristics of the FPGA architecture. As a result, the bit-stream format varies among different vendors, or even among different FPGA families of the same vendor. 7.3 Com pression Techniques: Applicability and Im plem entation Cost Data compression has been extensively studied in the past. Numerous compres sion algorithm s have been proposed to reduce th e size of data to be stored or transm itted over a network. The effectiveness of a compression technique is char acterized by the achieved compression ratio, that is. the ratio of the size of the compressed data to the size of the original data. However, depending on the 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. application, metrics such as processing rate, implementation cost, and adapt ability. may become critical performance issues. In this section, we will discuss compression techniques and the requirements to be met for compressing FPGA configurations in FPGA-based embedded systems. In general, a compression technique can be either lossless or lossy. Lossless compression techniques reconstruct the exact original d a ta after decompression. Lossless techniques are used in applications where any loss of information after decompression is critical. On the contrary, lossy compression techniques eliminate certain information of the original data after decompression. Lossy techniques are prim arily used in image, video, and audio applications. For configuration compression, the configuration bit-stream should be reconstructed without loss of any information, and thus, a lossless compression technique should be used. Otherwise, the functionality of the FPGA may be altered, or even worse, the FPGA may be damaged. Lossless compression techniques are based on statistical m ethods or dictionary- based schemes. For any given data, statistical methods can result in better com pression ratios than any dictionary-based scheme [56]. Using statistical methods, a symbol in the original data is encoded with a number of bits proportional to the probability of its occurrence. By encoding the most frequently-occurring symbols with fewer bits than their binary representation requires, the data is compressed. The compression ratio depends on the entropy of the original data, as well as 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the accuracy of the model that is utilized to derive th e statistical information of the given data. However, the complexity of the decompression hardware can significantly increase the cost of such an approach. In th e context of embedded systems, dedicated decompression hardware (e.g.. CAM memory) is required to align codewords of different lengths, as well as determ ine the output of a code word. In dictionary-based compression schemes, single codewords encode variable- length strings of symbols [56]. T he codewords form an index to a phrase dictio nary. Decompression occurs by parsing the dictionary w ith respect to its index. Compression is achieved if the codewords require smaller num ber of bits than the strings of symbols that they replace. Contrary to statistical m ethods, dictionary- based schemes require significantly simpler decompression hardware. Only mem ory read operations are required during decompression and high decompression rates can be achieved. Therefore, in the context of FPG A -based embedded sys tems. a dictionary-based scheme would result in fairly low implementation cost. In Figure 7.1, a typical architecture of FPGA-based em bedded systems is shown. These systems consist of an FPGA device(s). m em ory to store data and FPGA configurations, a configuration controller to supervise the configuration process, and an I/O interface to transm it and receive d ata. T he configurations are compressed off-line by a general-purpose computer and the compressed data is stored in the embedded system. Besides the memory requirem ents for the 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Configuration Bit-Stream Memory Configuration Controller I/O ► —♦I Data Memory Figure 7.1: FPGA-based Embedded System A rchitecture compressed data, additional memory may be required during decompression. For exam ple, in LZ -based algorithm s [56], the dictionary can be reconstructed on the fly based on the index. As a result, in software-based applications, only the index is stored or transm itted. Thus, only the index is considered in the calculation of the compression ratio. However, in the context of em bedded systems, the m em ory requirements to store the dictionary should also be considered. At runtim e, decompression occurs and the original configuration bit-stream is delivered to the FPGA configuration mechanism. As a result, the decompression hardware cost and the decompression rate become m ajor requirements of the compression problem. The decompression hardware cost m ay affect the cost of the system . In addition, if th e decompression rate cannot sustain the data rate of the configuration mechanism, the tim e to configure the FPG A will increase. 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.4 Related Work Work related to FPG A configuration compression has been reported in [45. 46j. In [-4 -5 ]. the proposed technique took advantage of the characteristics of the con figuration mechanism of the Xilinx XC'6200 architecture. Therefore, the tech nique is applicable only to that architecture. In [46]. runlength compression techniques for configurations have been described. Again, the techniques were developed specifically for the Xilinx XC6200 architecture. Addresses were com pressed using runlength encoding while d ata was compressed using LZ compres sion (sliding-window method [56]). Dedicated on-chip hardware was required for both methods. A set of configuration bit-stream s (2-88 Kbits) was used to fine- tune the parameters of the proposed m ethods. A 16-bit size window was used in the LZ implementation. W hile this window size led to good results for these bit-stream s. it is im practical for larger configuration bit-streams. Moreover, a fine-tuned scheme for larger configuration bit-stream s would lead to larger size windows. As stated in [46], larger size windows impose a fairly high hardware penalty with respect to the buffer size as well as the supporting hardware. In [51. 52]. dictionary-based compression techniques were utilized for code minimization in em bedded processors. However, code minimization takes advan tage of the semantics of programs for Instruction Set Architecture (ISA) based processors and is unlikely to achieve similar results for FPGA configuration bit stream s (i.e.. raw data). For example, program s can have jum ps that require 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. decompression to be performed not in a sequential m anner while configuration bit-streams should be decompressed sequentially. In [51]. the dictionary was built by solving a set-covering problem. The underlying compression model was devel oped with respect to the semantics of programs (i.e.. control-flow and operational instructions). The size of the considered programs was 0.5-10.0 Kbits and the achieved compression ratios (i.e.. size of the compressed program as fraction of the original program) were approxim ately 85%-95%. Since the technique in [51] was developed for code size minimization, it is not fair to make any compression ratio comparisons with our results. In [52], a fixed-size dictionary was used for compressing programs. The size of the programs was in the order of hundreds of bits. No detailed information was provided regarding the algorithm used to build the dictionary. The authors mainly focused on tuning the dictionary parame ters to achieve better compression results based on the specific set of programs. However, such an approach is unlikely to achieve the sam e results for FPCIA con figurations where the bit-stream is a data file and not a program for ISA-based processors. In addition. Huffman encoding was used for compressing the code words. As a result, dedicated hardware resources were needed for decompressing the codewords. 12S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.5 Our Compression Technique O ur compression technique is based on the principles of dictionary-based com pression algorithms. Even though statistical methods can achieve higher compres sion ratios [56], we propose a dictionary-based approach because dictionary-based schem es lead to simpler and faster decompression hardware. In our approach, the dictionary corresponds to configuration data and the index corresponds to the way th e dictionary is read in order to reconstruct a configuration bit-stream. In Figure 7.2. an overview of our configuration compression technique is shown. The input configuration bit-stream is read sequentially in the reverse order. Then, the dictionary and the index are derived based on the principles of the well-known L Z W compression algorithm [56]. In general, finding a dictionary that results in optim al compression has exponential complexity [56]. By deleting non-referenced nodes and by merging common prefix strings, a compact representation of the dictionary is achieved. Finally, a heuristic is applied that further enhances the dictionary representation and leads to savings in memory. T he original configu ration bit-stream can be reconstructed by parsing the dictionary with respect to the index in reverse order. T h e achieved compression ratio is the ratio of the total m em ory requirements (i.e.. dictionary and index) to the size of the bit-stream . In the following, we describe in detail our compression technique as well as the decom pression method. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C onf i gur at or } _ Reverse B f i - Sf i aam ^ Order Dictionary Construction O t C T JO M M r I W D E X 1 LZW Merge common prefix strings Delete non-referenced nodes Compact representation I / /V D E X I D / C T / O M # y Reverse / / v o e r Order o v d e x Selectively delete substrings Selectively delete nodes Heuristic Figure 7.2: Our Configuration Compression Technique 7.5.1 Basic LZW Algorithm The LZW algorithm [56] is an adaptive dictionary encoder, that is, the coding technique of L Z W is based on the input data already encoded. The input to the algorithm is a sequence of binary symbols. A symbol can be a single bit or a data word. Symbols are processed sequentially. By combining consecutive symbols, strings are formed. In our case, the input is the configuration bit-stream . Moreover, the bit-length of the symbol determines the way the bit-stream is processed (e.g., bit-by-bit, byte-by-byte). The main idea of L Z W is to replace the longest possible string of symbols w ith a reference to an existing dictionary entry. As a result, the derived index consists of pointers to the dictionary. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A l g o r i t h m 2 T h e LZW A lg o rith m Input: An input stream of symbols IN. Output: The dictionary and the index. dictionary « — input alphabet symbols 5 = .Yt'££ re p e a t s < — read a symbol from IN if 5s exists in the dictionary 5 f- 5s else output the code for 5 add 5s to the dictionary 5 < — s en d u n til (all input data is read) Initially, the dictionary is preloaded with entries for all the symbols of the input alphabet (Algorithm 2). For example, if the symbol is a byte, the dictionary is preloaded with entries for 0 — 255. One symbol s is read at a time. A tem porary string 5 is utilized during compression. If the string 5 s is not found in the dictionary, the code for 5 is added to the index and 5 s becomes a new entry to the dictionary. T he dictionary contains all the previously seen strings. There is no restriction on the size of the dictionary, so more and more phrases are generated as encoding proceeds. If the string 5 s is found in the dictionary, a new symbol is read. The procedure terminates when all the input data has been read. In soft ware-based applications, only the index is considered in the calculation of the compression ratio. The main advantage of L Z W (and any Z,Z-based 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm) is that the dictionary can be reconstructed based on the index. As a result, only the index is stored in a secondary storage media or transm itted. The dictionary is reconstructed on-line and the extra memory required is provided bv the “host." However, in em bedded systems, no secondary storage media is available, and the extra required memory has to be considered in the calculation of the compression ratio. Also, note that the dictionary includes phrases that are not referenced by its index. This happens because, as compression proceeds. L Z W keeps all the strings that are seen for the first time. This is performed regardless of whether these strings will be referenced or not. This is not a problem in soft ware-based applications since the size of the dictionary is not considered in the calculation of the compression ratio. 7.5.2 Compact Dictionary Construction In our approach, we propose a com pact memory representation for the dictionary. In general, the dictionary is a forest of suffix trees (i.e.. one tree for each symbol of the input alphabet). Each string in a tree is stored in the memory as a singly-linked list. The root of a tree is the head of all the lists in that tree. Every entry in the memory consists o f a symbol and an address to a prefix string and every string is associated with an entry. A string is read by traversing the corresponding list from the address of its associated memory entry to the head of the list. Furthermore, dictionary entries that are not referenced in the index 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are deleted and not stored in the memory. Finally, common prefix strings are merged as one string. An example of our dictionary representation is shown in Figure 7.3. For illustrative purposes, we consider letters as symbols. The root of the tree is the symbol ~C”. Each one of the strings "COM PUT E”. "COMPL TER ~. and ~C O M P U TA TIO N ~ is associated with a node. Since th e string "COM PVT” is a common prefix string, it is only represented once in the memory. In Figure 7.-1. the memory organization for storing the dictionary and the index of the above example is shown. The memory requirements for the dictionary are 11dictionary ^ {dotdsymbol * " F [ ” ^^2 ) bits, V V here tl'lictionriry is th e number of memory entries of the dictionary and datasvmboi is the number of bits required to represent a symbol. Similarly, the memory requirements for the index are n,n,iex x [log2 Urf.cOonaryl bits, where nm jex is the num ber of memory entries of the index. From the above exam ple, we notice th at during decompression, the decom pressed strings are delivered in reverse order. In fact, in software-based imple m entations [56]. a stack is used to deliver each decompressed string in the right order. However, in the considered embedded environment, additional hardware is required to implement the stack. In addition, the size of the stack should be as large as the length of the longest string in the dictionary. Moreover, the tim e overhead to reverse the order of the decompressed strings would affect the tim e to configure the FPGA. In our scheme, to avoid the use of a stack, we derive 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. | r COMPUTE i i COMPUTER i - N > - ■ • C O M P U T A T IO N I Figure 7.3: An Illustrative Example of Our Dictionary Representation Dictionary Address Data Symbol Pointer 0001 C oooo 0010 O 0001 0011 M 0010 0100 P 0O11 0101 U 0100 0110 T 0101 0111 A 0110 1000 T 0111 1001 1 1000 1010 O 1001 1011 N 1010 1100 E 0110 1101 R 1100 Index - * ■ C O M P U T A T IO N -*• COMPUTER - ♦ COMPUTE Codeword 1011 1101 1100 Figure 7.4: An Illustrative Exam ple of Memory Organization for the Dictionary and the Index 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the dictionary after reversing the order of the configuration bit-stream. During decompression, the configuration bit-stream is reconstructed by parsing the index in the reverse order. In this way. the decompressed strings are delivered in order and the exact original bit-stream is reconstructed. We have performed several ex perim ents to exam ine the impact of compressing a reverse-ordered configuration bit-stream instead of the original one. O ur experim ents suggest that the memory requirements for both the dictionary and the index are very close to each other in both cases (i.e., variation less than ±1% ). 7.5.3 Enhancement of the Dictionary Representation After deriving the dictionary and its index, we reduce the memory requirements of the dictionary by selectively decomposing strings in the dictionary. In the following, a prefix string corresponds to a path from any node up to the tree root. Similarly, a suffix string corresponds to a path from a leaf node up to any node. Finally, a substring corresponds to a path between any two arbitrary nodes. The main idea is to replace frequently-occurring substrings by a new or an existing substring. As a result, while memory savings can be achieved for the dictionary, additional codewords are also introduced leading to index expansion. For example, consider the prefix strings "C O M P CTER * and *QUALCOM" (Figure 7.5). Again, for illustrative purposes, we consider letters as symbols. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 COM i i PUTER t- I QUALCOM t - I QU A LCO M ► - < Q l COMPUTER I - - / c Figure 7.5: An Illustrative Example o f Enhancing the Dictionary Representation Since "COM" is a com m on substring, by storing it in the memory only once, the dictionary size can b e reduced. However, one additional codeword is required for "C O M PU TER" since it is decomposed in two substrings (i.e.. "COM" and "PUTER"). In general, the problem of decomposing substrings that can result in maximum savings in memory has exponential complexity. In the following, a 2-phase greedy heuristic is described th at selectively de composes substrings to achieve overall memory savings. A bottom -up approach is used that prunes the suffix trees startin g from the leaf nodes and replaces deleted suffix strings by new (or existing) prefix strings. V V ie concentrate only on suffix strings that include nodes pointed at by only one suffix string. Otherwise, the suffix string extends over large num ber of prefix strings resulting in lower possibility for potential savings in memory. I'sing our heuristic. 80%-85% of 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the nodes in all suffix trees were examined for the bit-streams considered in our experim ents (see Section 7.6). In the first phase, we delete suffix strings that can lead to potential savings in memory (Algorithm 3). Initially, we identify repeated suffix strings that appear across all the suffix trees of the dictionary. As mentioned earlier, the number of suffix trees in the dictionary equals the num ber of symbols of the input alphabet. For each distinct suffix string s,. the potential savings in memory cosf(s,) are com puted. The cost(st) depends on the potential savings in dictionary memory and the potential index expansion assuming that .s, is deleted from all the suffix trees. Only suffix strings s, with non-negative cost{»,) are deleted. By reducing the dictionary size, the num ber of bits that is required to address the dictionary (i.e.. [log2 njjctlonary] ) can decrease too. As a result, the word length of both the dictionary and index memories can decrease resulting in further savings in memory. In the second phase, we selectively delete individual nodes of the suffix trees in order to decrease the num ber of bits required to address the dictionary (Algo rithm 4). The deletion of nodes results in index expansion. However, the memory requirem ents due to the increase of index size can be potentially amortized by the decrease of the word length of both the dictionary and the index memories. T he goal is to reduce the dictionary size while introducing a m inim um number of new codewords. Initially, nodes n, of the same distance across all the suffix trees 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A l g o r i t h m 3 O u r H e u ristic - P h a s e 1 Input: A dictionary D,n and an index Output: Enhancfd dictionary Df,mp and index [temp. STRIXGS = [suffix strings in D,„ containing nodes that are pointed at by only one suffix string} I = { s ,: s,tS T R lX G S A (if / / 7 => Si ^ -Sj)} s,el' A lenffth(si)=l} /* L = maxlength(si) /* datad,ct>onary'- vvord-length for the dictionary memory /* data,njrT: vvord-length for the index memory /* rc,: node of s, with the highest distance from a leaf node r t,: # o fx i S T R l X G S : x = s, /* c,: # of tim es n, is referenced by the index if 3 prefix string xcDin : x = Si a, = 0 else «, = 1 end cost(s,) — (/, <i|) * (dat(ifitctlonary) ct ' * datu,n,irr Sm," = for / = \..L ste m p = . v r / . L v s , : s lt{ r , u r } if co.s/(.s,) > 0 Sdelete — Sfolete U {•St} else Sump = Sump U ( /U { x tS T R l XGS : s, □ x} en d u = r - Sum p en d d e le te {xe5T R l XG S : x = y A yeS^/etc} Snew = {new prefix strings that replace the deleted suffix strings} Dump = D,n— {deleted substrings} U5nett. I temp = {restore /,„ due to deleted substrings} 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm 4 Our Heuristic - Phase 2 Input: Dtemp and IUmp from Algorirthm 2. Output: Enhanced dictionary Dtnh and index Ienh- N = {n, : n,est A s,e{D temp D S T R IN G S }} /* S T R IN G S is the sam e set of strings as in Algorithm 2 /* nt: dictionary node cos/(n, ) = # of times n, is referenced by the index depth(ni) = distance from a leaf node sort N in term s of depth(nt ) /* ascending order sort n, of sam e depth in term s of cost(n,) /* ascending order V m = NELL ntem p = | Dtemp | /* I * |= # of nodes in * while | A * |> n,em p repeat mark consecutive nodes in N with respect to sorting .Vm = {marked nodes} u n til ( # of m arked nodes — — a = = nttmp) /* ^2cost(n,)'- sum m ation of costs of the marked nodes /* a: # of nodes required to replace suffix strings that /* will be deleted if marked nodes are deleted if (deletion of m arked nodes results in overall savings) -V = N - N m | D t e m p |t - 2flogdD"m p H -1 n 4 •?P082|t?r*mpr| — • '* t e m p “ else BREAK en d end delete {marked nodes} Snew ={new prefix strings th at replace the deleted suffix strings} D e n iI = D t e m p ~ {marked nodes} U5n eu , lenh = {restore / fem p due to deleted substrings} 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are sorted w ith respect to the number of codeword splits cost(n,) (i.e.. num ber of new codewords introduced if the node will be deleted). Then, starting from the leaf nodes, we mark individual nodes according to their cost(m). A marked node is eligible to be deleted. Nodes with a smaller num ber of codeword splits are marked first. Y V e continue to mark nodes until we achieve a l-bit savings in addressing the dictionary. If the index expansion results in increasing the to tal memory requirements, the marked nodes are not deleted and the procedure is term inated. Otherwise, the marked nodes are deleted and the procedure is repeated. 7.5.4 Configuration Decompression Decompression occurs at power-up or at runtime. The original configuration bit-stream is reconstructed by parsing the dictionary w ith respect to the index. As shown in Figure 7.6, the contents of the index (i.e., codewords) are read sequentially. A codeword corresponds to an address to the dictionary memory. For each codeword, all the symbols of the associated string are read from the dictionary m em ory and then the next codeword is read. A comparator is used to decide if the output data of the dictionary memory corresponds to a root node, that is, all the symbols of a string have been read. Depending on the output of the com parator, a new codeword is read or the last-read pointer is used to address the dictionary memory. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data address a d d r e s s ?=0 data symbol ♦ to FPGA configuration mechanism Counter Index Memory Dictionary Memory Figure 7.6: Our Decompression-based Reconstruction of the Configuration Bit stream i I __ address data » to FPGA configuration mechanism Counter Configuration Bit-Stream Memory Figure 7.7: Conventional Read of the Configuration Bit-stream 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 7.7. a typical scheme for storing and reading the configuration bit stream is shown. Typically, the configuration bit-stream is stored in memory. It is im portant to deliver the bit-stream sequentially otherwise the configuration mechanism will not be initialized correctly and th e configuration process will fail. Depending on the configuration mode, data is delivered serially or in parallel. In our decompression-based scheme (Figure 7.6). the only hardware overhead introduced is a com parator and a multiplexer. T he output of the decompression process is identical to the data delivered by the conventional scheme. Moreover, the data rate for delivering the configuration data is the same for both the schemes and depends only on the memory bandwidth. T h e decompression process does not add any tim e overhead to the configuration tim e. 7.6 Experim ents and Compression Results Our configuration compression technique was applied to configuration bit-stream s of several real-world applications. The target architecture was the Virtex FPGAs [77j. For m apping onto the Virtex FPGAs. we used the Foundation Series v’ 2.1i software developm ent tool [77). Each application was mapped onto the smallest Virtex FPGA th a t met the area requirements of the corresponding implemen tation. The size of the configuration bit-streams ranged from 1.7 M bits to 6.1 Mbits. In Table 7.1, the configuration bit-stream sizes for each implementation are shown. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The considered configuration bit-streams corresponded to implementations of cryptographic and signal processing algorithms. The cryptographic algorithm s were the final candidates of the Advanced Encryption Standard (AES): M A R S . RC 6. R ijn d a el. Serpent, and T w o fish . Their im plem entations included a key- scheduling unit, a control unit, and one round of the cryptographic core th at was used iteratively. Implementation details of th e AES algorithm s can be found in C hapter 4. Y V e have also implemented digital signal processing algorithms using the logic cores provided with the Foundation 2.li software tool [77]. A 1024-point and a 512-point complex F F T were im plem ented that were able to perform IFFT too. In addition, four 256-tap FIR filters were mapped onto the same device. In this implementation, all filters can process data concurrently. Finally, a 1024-tap FIR filter was also implemented. The configuration bit-stream s were processed byte-by-byte during compres sion. that is, the symbol for the dictionary entries was chosen to be an 8-bit word. As a result, the decompressed data is delivered as 8-bit words, and thus, parallel modes of configuration can be supported. Note th a t the maximum num ber of bits used in parallel modes of configuration is typically 8 bits [4. 8, 77]. If the configuration mode requires less than 8 bits (e.g.. serial mode), an 8-to-n bit converter can be used, where n is the number of bits required by the configuration mode. In this work, for each configuration bit-stream , we do not attem pt to find the optim al bit-length for the symbol th at leads to the best compression results. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.6.1 Single Configurations The compression results for single configurations are shown in Tables 7.1 and 7.2. The results are organized with respect to the optim ization stages of our technique (Figure 7.2). The results shown for L Z W correspond to the construction of the dictionary and the index using the L Z W algorithm . The only difference compared with Figure 7.2 is that the L Z W results include the optim ization of merging common prefix strings in the dictionary. Hence, the results shown for Compact correspond to the deletion of the non-referenced nodes in the dictionary. Finally, the results shown for H euristic correspond to th e optim izations performed by our heuristic and are also the overall results of our compression technique. In Table 7.1. the achieved compression ratios are shown. The compression ratio is the ratio of the total memory requirements (i.e.. memory to store dictio nary and index) to the bit-stream size. In addition, in Table 7.1. lower bounds on th e compression ratios are shown. For our compression technique, the lower bound for each bit-stream corresponds to the entropy of the bit-stream with respect to the L Z W compression encoder. As m entioned earlier, the compres sion ratio is affected by the entropy of the d ata to be compressed as well as the model of encoding [56]. W e have calculated the lower bound by dividing the index size derived using L Z W by the bit-stream size. Therefore, the lower bound corresponded to the compression ratio th a t can be achieved by LZW for soft ware-based applications (assuming 8-bit symbols). 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T ab le 7.1: C o m p re ssio n R a tio s for S in g le C o n fig u ratio n s Configuration Bit-stream size (bits) Compression ratio L Z W Compact Heuristic Lower Bound MARS 3608000 179% 96% 82% 73% RC’6 2546080 119% 69% 59% 48% Rijndael 3608000 198% 104% 89% 81% Serpent 2546080 165% 95% 79% 67% Twofish 6127712 186% 103% S6% 76% FFT-256 1751840 140% 85% 6S% 56% FFT-1024 1751840 159% [ 89% 72% 64% 4 x FIR-256 1751840 180% 97% 80% 73% FIR -1024 1751840 177% 96% 79% 71% In Table 7.2. the compression results are shown in term s of the memory requirements. The memory requirements for the dictionary are n.{u.tlonnry x ( 8 + nog2 n * d . w , l ) bits, where « < * ,< :< ,0„ary is the number of memory entries of the dictionary. Similarly, the memory requirements for the index are nm^, - x [log2 n jI(;tlo ri,irv] bits, where nmj ex is the number of memory entries of the index and [log2 U rf.cM onacyl 's ^he number of bits required to address the dictionary. L Z W In soft ware-based applications, only the index is considered in the calculation of the compression ratio. In addition, statistical encoding schemes are utilized for further compressing the index. As a result, in typical L Z W applications, superior compression ratios (i.e., 10%-20%) have been achieved by using commercially available software programs (e.g., compress, gzip). However, such commercial programs are not applicable to our compression problem. As 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 7.2: Dictionary ami Index Memory Requirements for Single Configurations ('onfiyuralion Dictionary memory requirements and word-lengtli (bits) Index memory requirements and word-lengtli (bits) L Z W Compact //< uristic LZW f Compact Ht uristic MARS 3827070 26 1116912 24 172032 21 2611920 18 2351040 16 2796586 13 RC6 1811575 25 667575 23 172032 21 1227536 17 1083120 15 1344564 13 Rijndael 4231418 26 1149810 21 172032 21 2921874 18 2599888 16 3055919 13 Serpent 2511275 25 826152 24 172032 21 1703332 17 1603136 16 1845688 13 T wofish 6746558 26 1919550 25 360448 22 4666101 18 1106876 17 1913398 11 FFT-256 1479408 24 564141 23 81920 20 982192 16 920805 15 1107396 12 FFT-1024 1657900 25 574034 23 81920 20 1123037 17 990915 15 1181964 12 4 x FIR-256 1883900 25 575897 23 81920 20 1276717 17 1126515 17 1330044 12 F IR -1024 1849725 25 580612 23 81920 20 1253478 17 1106010 15 1303416 12 o > discussed earlier, in the context of em bedded environments, both the dictionary and the index are considered in the calculation of the compression ratio. The size of the derived dictionaries was com parable to the size of the original bit-streams. Therefore, negative compression occurred, that is. the memory requirements for the dictionary" and the index were greater than the bit-stream size. C o m p a c t By deleting the non-referenced nodes in the dictionary, the num ber of the dictionary entries was reduced by a factor of 2.4-3.4. As a result, the num ber of bits required to address the dictionaries was also reduced by' 1 to 2 bits affecting the word length of both the dictionary and the index memories accordingly". Compared with the L Z W results, the memory requirements for the dictionaries were reduced by a factor of 2.5-3.7. In addition, the memory require ments for the indices were also reduced by 6%-13% even though the number of codewords remained the same. Overall, the compression ratios achieved at this optim ization stage were 69%-104%. H e u r is tic Finally, the overall savings in memory" were further improved by our heuristic. The goal of our heuristic was to reduce the size of the dictionary at the expense of the index expansion. Indeed, compared to the Compact results, the dictionary entries were reduced by a factor of 2.9-6.2 while the number of codewords was increased by 35% 50%. The number of bits required to address the dictionary was reduced by 2 to 3 bits affecting the word length of both the dictionary' and the index memories accordingly. As a result, even though the 1 4 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. num ber of codewords was increased, the total memory requirements were reduced. Compared with the Compact results, the memory requirements of the dictionaries were further reduced by a factor of 3.2-7.1 while the memory requirement of the indices were increased by 18%— 40%. Overall, the compression ratios achieved at this optimization stage were 59%-S9%. Our heuristic improved the compression ratios provided by the C om pact results by 14%-20%. Considering the compression ratios achieved by L Z W and the lower bounds on them , our compression technique performs well. The improvements over the LZW results were significant. On the average, our technique reduced the dic tionary memory requirements by 94.5% while the index memory requirements were increased by 11.5%. As a result, our compression results were close to the lower bounds. On the average, our compression ratios were higher than the lower bounds by 14.5%. Overall, our compression technique reduced the memory re quirem ents of the configuration bit-stream s by 0.35-1.04 Mbits. The savings in memory corresponded to ll% -41% of the original bit-streams. 7.6.2 Sets of Configurations O ur technique can be extended to compress a set of configurations by incorpo rating a uni/ied-dictionary approach. The proposed approach differs from our configuration compression technique (Figure 7.2) only in the way th at the dictio nary is constructed. Instead of constructing multiple dictionaries by processing 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the configuration bit-stream s independently, the bit-streams are processed in a sequence by sharing the sam e dictionary. The L Z W algorithm (see Algorithm 2) is applied to each configuration bit-stream without initializing the dictionary. Ev ery tim e LZ W is called, it uses the dictionary th at was derived by the preceding call. The derived indices are grouped in one index for facilitating th e processing through the remaining stages of our compression technique (Figure 7.2). The goal of the tin/yferf-dictionarv approach is to construct a single dictionary for multiple configurations in order that the world-length of the index memory will be the same across different configurations. As a result, a sim ple memory organization will be required for decompression, which is identical to the one shown in Figure 7.6. On the contrary, if the configuration bit-stream s are pro cessed independently {baseline), a more complex memory organization will be required that consists of m ultiple memory modules of various word lengths. For comparison purposes, in th e remaining of this section, the solution of processing the configuration bit-stream s independently is referred as baseline. Furtherm ore, if the dictionaries obtained by the baseline approach are grouped to form a sin gle dictionary, the compression ratio would increase due to the increase of the num ber of bits required to address the dictionary entries. In Tables 7.3 and 7.4, the achieved compression ratios and the dictionary and index memory requirem ents are shown. Clearly, besides resulting in simple memory organization, the proposed approach achieves better compression ratios 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T a b le 7.3: C o m p ressio n R a tio s for S ets of C o n fig u ra tio n s Configurations Bit-streams size (bits) Compression ratio LZ W Compact He uristic Baseline MARS. Rijndael 2 x 3608000 IS1% 97% 85% 85.50% RC’6. Serpent 2 x 2546080 136% 76% 68% 69.00% FFT-256. FFT-1024. 4 x FIR-256. FIR -1024 4 x 1751840 142 % 84 % 71 % 74.75 % than the baseline approach. This happens because the increase of the number of bits required to address the dictionary entries is am ortized by the decrease of the num ber of index entries. The num ber of index entries decreases due to the fact th at, after the first call of LZW , the dictionary is not initialized with the alphabet symbols, but it already contains strings. Therefore, for larger number of configuration bit-stream s. a larger decrease of the num ber o f index entries is expected. Compared with the baseline approach, for {M A R S. Rijndael), {RC'6. Serpent}, and {FFT-256. FFT-1024. 4 x FIR-256}, the num ber of index en tries decreases by 8.41%, 9.89%, and 19.06%, respectively. Before applying our heuristic (Figure 7.2), the number of dictionary entries is decreased by a range of 7%-17% compared with the baseline approach. This happens because common entries among different dictionaries are replaced by a single entry in the uni fied dictionary. However, after applying our heuristic, the num ber of dictionary entries is the same for both the approaches. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Table 7.4: Dictionary ami Index Memory Requirements for Sets of Configurations Configurations D memory requirements and word-lengtli (bits) L Z W Compuct Heuristic Index memory requirements and word-lengtli (bits) L Z W Compact Heuristic MARS, Rijndael 7676856 27 2193250 25 360448 22 5397406 19 4829258 17 5772508 14 RC6, Serpent 4092062 26 1376568 24 360418 22 2828394 18 2514128 16 3095974 14 FFT-256, FFT-1024, 4 x FIR-256, F IR -1024 5889468 26 2063875 25 360448 22 4072788 18 3846522 17 4618700 14 9387 Chapter 8 Conclusions and Future Research Reconfigurable Computing (RC) is an emerging computing paradigm that rapidly erodes a wide range of applications where conventional fabrics such as micropro cessors, DSPs, and ASIC’ s used to prevail. T he advantage of RC is based on the unique capability of reconfigurable hardw are to provide high-performance while enabling application-specific and post-fabrication hardware customization. Hence. RC promises to deliver ASIC-like perform ance with microprocessor-like flexibility. Such flexibility is possibly the only way for hardware-based solutions to cope with the rapid emergence of new algorithm s and standards in almost ev ery application of the today's interconnected world. However, such flexibility is also a m ajor challenge for revolutionizing application execution using RC by cus tomizing the hardware based on run-time param eters. Only by exploring dynamic hardware reconfiguration, reconfigurable devices can evolve from programmable hardware solutions to dynam ic hardware solutions. 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To realize this potential, new m apping approaches are required that do not adopt the conventional ASIC design flow in the critical path to the solution. For one thing, conventional tools result in excessive compilation times that are unac ceptable for hardware customization based on run-time param eters. For another, these tools cannot capture the dynam ic nature of reconfigurable hardware and thus, preventing RC from achieving its full potential. To com bat the excessive compilation times and the inefficiencies introduced by conventional tools, a novel domain-specific and instance-aware mapping approach has been developed to support dynamic RC. The domain consists of the semantics of a given algorithm and the hardware characteristics of the target reconfigurable device. Based on this domain, a skeleton is derived off-line, which corresponds to the configuration data that is necessary to cover the param eter space of the considered application. At run tim e, instance-aware adaptation of the skeleton occurs to custom ize the hardware to the run-time param eters. The performance metric is the effective execution tim e, which include both the execution time on hardware and the mapping time. O ur approach facilitates dynam ic logic synthesis for reconfigurable hardware by overcoming the current lim itations: the excessive mapping tim e and the performance predictability. Known approaches to dynamic RC either overlook the contribution of map ping tim e to run-tim e hardware custom ization or fail to explore dynamic reconfig uration to improve the overall perform ance. By adopting the conventional ASIC 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. design flow in the critical path to the solution, these approaches mainly focus on m atching the application requirem ents with the characteristics of the hardware to optim ize execution time and area requirements. Hardware reconfiguration is not considered in deriving optim ized designs or is explored in a limited fashion. O ur approach was used to address several application domains and reconfig urable devices. The derived results confirmed that compared with the state-of- the-art RC solutions, significant improvement in the effective execution tim e has been achieved. The applications of interest included private-key cryptography for Internet security, graph problems, m atrix multiplication, and boolean satisfi ability. Based on our approach, a novel Adaptive Cryptographic Engine (ACE) was developed for IPSec architectures. By taking advantage of FPGA technol ogy. ACE can adapt on the fly to diverse security parameters that are negotiated at run tim e between two communicating entities. Moreover. ACE provides agile key-context switching and high encryption rate that are essential to cope with high-speed networks. Agile key-context switching is particularly critical for IPSec where a sm all amount of data is processed per key and key-context switching oc curs repeatedly. A diverse set of private-key cryptographic algorithms was chosen to dem onstrate the applicability of ACE. A thorough performance analysis of the engine was performed. By combining hardware customization, hardware paral lelism. and hardware reuse. ACE can outperform software-based solutions with 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. respect to both throughput and kev-setup latency. Finally, while ASIC-based so lutions can provide superior performance, they lack flexibility, which is essential for IPSec. A novel solution for the shortest path problem was also derived that adapted the reconfigurable hardware to the topology of the input graph. Compared with the state-of-the-art. significant improvement in the effective execution time was achieved bv improving both the m apping and the execution time. Moreover, a scalable and partitioned solution was derived for the m atrix multiplication problem. Even if the target architecture was coarse-grained architectures, the derived solution can be also applied to FPGAs. The derived solution can adapt to the matrix problem size and the available resources on hardware. Optim al performance was dem onstrated in term s of both time anti am ount of local amount of memory used. Finally, a parallel deduction engine for backtrack search algorithms was demon strated. A heuristic was also developed based on which the engine evolves during problem solving with respect to the am ount of parallelization p. Our objective was to find the value p that, compared with the baseline case (p = I), gives maximum speedup for the SAT algorithm . The optimal value p is intrinsic to the given instance. In the state-of-the-art FPGA-based SAT solvers [81]. a deduction engine with p = 1 is used for any given instance: initially the engine is adapted to the input instance but p = 1 remains fixed during problem solving. On the 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. contrary, in our approach, not only is the engine initially adapted to a given SAT instance, but is also dynamically reconfigured with respect, to p based on the knowledge gained during solving the SAT problem . As a result, significant speedup can be achieved compared with the baseline case. To enhance the feasibility of our approach in embedded environments and to facilitate the development of multi-personality products, a novel configura tion compression technique was developed. Our goal was to reduce the memory required to store configurations in FPGA-based em bedded systems and achieve high decompression efficiency. Decompression efficiency corresponds to the de compression hardware cost as well as the decompression rate. Although data compression has been extensively studied in the past, we are not aware of any prior work that addresses configuration compression for FPGA-based embedded systems with respect to the cost and speed requirem ents. Our compression tech nique is applicable to any SRAM-based FPGA device since it does not depend on specific features of the configuration mechanism. Using our technique, we have dem onstrated 11-41% savings in memory for various configuration bit-streams of real-world applications. Considering the lower bounds derived for the compres sion ratios, the achieved compression ratios were higher than the lower bounds by 14.5% on the average. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.1 Future Directions In this dissertation, a domain-specific and instance-aware mapping approach was presented that enables application m apping onto reconfigurable devices based on run-time param eters. To the best of our knowledge, this is one of the first efforts to address this novel research field. Further research that will fortify our work include the following directions: 1. Design Tools: Design tools for run-tim e hardware customization are in their infancy. As a result, the manual nature of our approach is a limiting factor. Tools like JB its [78] are essential to enable configuration bit-stream han dling. New design tools should be developed on top of existing ones that will enable the construction of partial configuration bit-streams. Given an original and a desired hardware configuration, the problem consists of de riving a partial bit-stream of m inim um size. Further issues to be addressed are the level of abstraction of the input hardware configurations and the portability of the tool among different architectures and bit-stream struc tures. 2. Application Domain: The domain-specificitv of our approach dictates that new application domains should be explored to fortify the applicability of dynamic RC. An application of particular interest is cryptanalysis. Recon figurable hardware can lead to impressive performance by customizing the 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hardware to a specific input data pattern. Moreover, continuous hardware customization based on run-time results can be proven beneficial to improve performance. 3. Boolean Satisfiability: We believe that the UPT-based optimization ap proach introduced in this dissertation could be the basis for parallel deduc tion engines for backtrack search algorithms. Future work includes further experimentation with instances from the DIMACS benchm ark [3o] and eval uation and enhancement of the proposed heuristic. Further study of the chosen set of templates and the evaluation of UPT would be essential to fortify our approach. Another interesting topic to study would be the ef fectiveness of parallel deduction engines with respect to various decision and backtrack heuristics. Finally, further study is needed to understand the behavior of l T PT in terms of different heuristics for splitting the clauses into groups and ordering the clauses inside a group. 4. Configuration Compression: By using the L Z W algorithm to construct the dictionary, th e compression ratios th at can be achieved are bounded by the entropy of the data with respect to the L Z W coding model (i.e., sequential process of data). An approach th a t constructs the dictionary based on a two-dimensional array of data seems to be closer to the structure of the actual bit-stream s. Such dictionary-based approaches are in their infancy loS Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and further exploration is required to evaluate their effectiveness. Further more. the development of a skeleton-based approach for our compression technique can improve the compression ratios achieved. A skeleton cor responds to the correlation among a set of configuration bit-stream s. By removing the data redundancy of the skeleton in the bit-stream s. savings in memory can be achieved. Given a set of configurations, th e problem consists of deriving a skeleton in order to reduce the size of individual indices. 159 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [1] Abramovici M. and Saab D. Satisfiability on Reconfigurable Hardware. In International Conference on Field Programmable Logic and Applications, August 1997. [2] Abramovici M.. Sousa J. T. and Saab D. A Massivelv-Parallel Easily- Scalable Satisfiability Solver using Reconfigurable Hardware. In A C M /IE E E Design Automation Conference. O ctober 1999. [3] Advanced Encryption Standard, http://csrc.nist.gov/encryption/aes/. [4] Altera PLD Devices, http://w w w .altera.com . [5] Anderson R.. Biham E. and Knudsen L. Serpent: A Proposal for the Ad vanced Encryption Standard. Technical report. NIST AES Proposal, June 1998. [6] Aoki K. and Lipmaa H. Fast Implementations of AES Candidates. In Third AE S Candidate Conference. April 2000. [7] Automatically Tuned Linear Algebra Software (ATLAS). http://w w w .netlib.org/atlas/index.htm l. [8] Atmel FPGA. http://w w w .atm el.com . [9] Babb J., Frank M. and Agarwal A. Solving graph problems with dynamic computation structures. In SPIE 96: High-Speed Computing, Digital Signal Processing, and Filtering using Reconfigurable Logic, February 1996. [10] Bassham L. E. III. Efficiency Testing of ANSI C Implementations of Round 2 Candidate Algorithms for the Advanced Encryption Standard. In Third A E S Candidate Conference, April 2000. [11] Bertsekas D. P. and Tsitsiklis J. N. Parallel and Distributed Computations: Numerical Methods. Athena Scientific, Belmont. Massachusetts, 1997. [12] Bondalapati K. and Prasanna V'. K. Mapping Loops onto Reconfigurable Architectures. In International Conference on Field Programmable Logic and Applications. Septem ber 199S. 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [13] Brown S. and Rose J. FPGA and CPLD Architectures: A Tutorial. In IEEE Design £' Test of Computers. 1996. [14] Burwick C ’. et al. MARS - a candidate cipher for AES. Technical report. NIST AES Proposal. August 1999. [15] C’adambi S.. VVeenerJ.. Goldstein S. C ’.. Schmit H. and Thom as D. E. Man aging Pipeline-Reconfigurable FPGAs. In International Symposium on Field Programmable Gate Arrays. February 1998. [16] Chameleon Systems Inc. http://www.cham eIeonsystem s.com . [17] Chen D. C ’. and Rabaey J. M. A Reconfigurable Multiprocessor IC for Rapid Prototyping of Algorithmic-Specific High-Speed DSP Paths. IE E E Journal o f Solid-State Circuits. 27:1985-1904. 1992. [18] C ’hu Y. J. and Liu T. H. On the shortest arborescence of a directed graph. Science Sinica. 14:1396-1400. 1965. [19] Cisco Systems Inc. IPSec. http://ww w .cisco.com /public/products_tech.shtm l. [20] C'ormen T. H.. Leiserson C. E. and Rivest R. L. Introduction to Algorithms. The MIT Press, Cambridge M assachusetts. 1997. [21] Daemen J. and Rijmen V. The Rijndael Block Cipher. Technical report, NIST AES Proposal, Septem ber 1999. [22] Dandalis A. and Prasanna V. K. Fast parallel implementation of DFT using configurable devices. In International Workshop on Field Programmable Logic and Applications. September 1997. [23] Dandalis A. and Prasanna V. K. Mapping Homogeneous C om putations onto Dynamically Configurable Coarse-Grained Architectures. In IE E E Sympo sium on Field-Programming Custom Computing Machines. April 1998. [24] Dandalis A. and Prasanna V. K. Space-Efficient Mapping of 2D-DCT onto Dynamically Configurable Coarse-Grained Architectures. In International Workshop on Field Programmable Logic and Applications. Septem ber 1998. [25] Dandalis A. and Prasanna V. K. FPGA-based Cryptography for Internet Security. In Online Symposium fo r Electronic Engineers, November 2000. [26] Dandalis A. and Prasanna V. K. Configuration Compression for FPGA-based Embedded Systems. In International Symposium on Field- Programmable Gate Arrays, February 2001. 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [27] Dandalis A. and Prasanna V. K. Signal Processing using Reconfigurable System-on-C’hip Platforms. In International Conference on Engineering o f R'configurable Systems and Algorithms. June 2001. [28] Dandalis A.. Mei A. and Prasanna V'. K. Domain Specific Mapping for Solving G raph Problems on Reconfigurable Devices. In Reconfigurable .4r- chitectures Workshop. April 1999. [29] Dandalis A.. Prasanna V. K. and Gaudiot J. L. Run-tim e Mapping of Graph- Problem Instances onto Reconfigurable Hardware. In M ilitary and Aerospace Applications o f Programmable Devices and Technologies. September 1999. [30] Dandalis A.. Prasanna V. K. and Rolim J. D. P. A Comparative Study of Performance of AES Final Candidates Using FPGAs. In Workshop on Cryptographic Hardware and Embedded Systems. August 2000. [31] Dandalis A.. Prasanna V. K. and Rolim J. D. P. An Adaptive Cryptographic Engine for IPSec Architectures. In IEEE Symposium on Field-Programming Custom Computing Machines. April 2000. [32] Dandalis A.. Prasanna V. K. and Thiruvengadam B. Run-tim e Performance Optimization of an FPGA-based Deduction Engine for SAT Solvers. In International Conference on Field Programmable Logic and Applications. September 2001. [33] Davis M. and Putnam H. A Com puting Procedure for Quantification The ory. Journal o f ACM. 7:201-215, 1960. [34] Dehon A. and Mirskv E. MATRIX: A Reconfigurable Com puting Architec ture with Configurable Instruction Distribution and Deployable Resources. In IEEE Symposium on Field-Programming Custom Computing Machines. April 1996. [3o] DIMAC’S SAT benchmarks. ftp://dim acs.rutgers.edu/pub/challenge/satisfiability/. [36] Donlin A. Self Modifying C ircuitry - A Platform for Tractable Virtual Cir cuitry. In International Conference on Field Programmable Logic and Ap plications. August 1998. [37] Ebeling C ’.. C ’ronquist D. C ’., Franklin P. and Fisher C. RaPiD - A config urable com puting architecture for compute-intensive applications. Technical Report UVV-CSE-96-11-03, D epartm ent of Computer Science and Engineer ing, University of Washington, November 1996. 162 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [38] Edmonds J. O ptim um branchings. J. Research o f the National Bureau o f Standards. 71(B):233-240. 1967. [39] Elbirt A. J.. \ ip W.. Chetvvvnd B. and Paar C. An FPGA Implementation and Performance Evaluation of the AES Block Cipher Candidate Algorithm Finalists. In Third A E S Candidate Conference. April 2000. [40] Eldredge .J. G. and Hutchings B. L. Run-Time Reconfiguration: A M ethod for Enhancing the Functional Density of SRAM-Based FPGAs. Journal o f VLSI Signal Processing. 12(l):67-86, January 1996. [41] Fowler D. \ irtual Private Networks: Making the Right Connection. Morgan Kaufmann Publishers Inc.. San Francisco. California, 1999. [42] Gaj K. and C ’hodowiec P. Comparison of the hardware performance of the AES candidates using reconfigurable hardware. In Third AES Candidate Conference. April 2000. [43] G artner Group Datacpiest. http://w w w .gartner.com . [44] Hartenstein R. V V .. Kress R. and Reinig H. A Scalable. Parallel, and Recon figurable Datapath Architecture. In International Symposium on 1C' Tech nology. Systems & Applications. September 1995. [45] Hauck S.. Li Z. and Schwabe E. J. Configuration Compression for the Xilinx XC'6200 FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 1S(S): 1107— 1113, August 1999. [46] Hauck S.. Wilson W. D. Runlength Compression Techniques for FPGA Con figurations. In IEEE Symposium on Field-Programmable Custom Computing .1 fachines. April 1999. [47] Hong J. and Kung H. I/O complexity: The red blue pebble game. In . 4 0 / Symposium on Theory o f Computing, May 1981. [48] H.O.T. Works Ver. 1.1. Parameterized Library for XC'6200. http://w w w .vcc.com . [49] Internet Engineering Task Force, http://w w w .ieft.org. [50] JHDL. http://w w w .jhdl.org. [51] Laio S.. Devadas S. and Keutzer K. A Text-Compression-Based Method for Code Size Minimization in Embedded Systems. .4CA/ Transactions on Design Automation o f Electronic Systems, 4(l):12-38. January 1999. 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [52] Lefurgy C\. Bird P.. Cheng I-C. and Mudge T . Improving Code Density L sing Compression Techniques. In IE E E /A C M Symposium on Microarchi tecture, December 1997. [53] Luk YV.. Shirazi N.. Guo S. and Cheung P. V. K. Pipeline Morphing and V irtual Pipelines. In International Conference on Field Programmable Logic and Applications, September 1997. [54] McMillan S. and Patterson C. JB its Implementations of the Advanced En cryption Standard (Rijndael). In International Conference on Field Pro grammable Logic and Applications, September 2001. [55] Miller R.. Prasanna V. K.. Reisis D. and Stout Q. F. Parallel com putations on reconfigurable meshes. IEEE Transactions on Computers, 42(6):67S-692. June 1993. [56] Nelson M.. Gaily J-L. The Data Compression Book. M <L'T Books. New York, 1996. [57] PACT Corporation, h ttp :// wwvv.pactcorp.com. [58] Platzner M. and Micheli G. Acceleration of Satisfiability Algorithms by Re configurable Hardware. In International Conference on Field Programmable Logic and Applications. August 199S. [59] Rashid A.. Leonard .J. and Mangione-Smith Y V . H. Dynamic Circuit Gener ation for Solving Specific Problem Instances of Boolean Satisfiability. In IE E E Symposium on Field-Programmable Custom Computing Machines. April 1997. [60] Redekopp M. and Dandalis A. A Parallel Pipelined SAT Solver for FPGAs. In International Conference on Field Programmable Logic and Applications. August 2000. [61] Rivest R. L., Robshaw M. J. B.. Sidney R. and Yin Y. L. The RC6rw Block Cipher. Technical report. NIST AES Proposal, Ju n e 1998. [62] Robinson B. Plans for a Secure Future, <efe.com, 418, September 1999. [63] Rose J., Gamal A. and Sangiovanni-Vincentelli A. Architecture of Field Program m able Gate Arrays. In Proceedings o f the IEEE, July 1993. [64] Schneier B. Applied Cryptography. John YVilley & Sons Inc., 2nd edition. 1996. [65] Schneier B. et al. Twofish: A 128-Bit Block Cipher. Technical report. NIST AES Proposal. June 199S. 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [66] Sidhu R.. Wadhwa S.. Mei A. and Prasanna V. K. A Self-Reconfigurable G ate A rray Architecture. In International Conference on Field Pro grammable Logic and Applications. August 2000. [67] Silva J. P. M. and Sakallah K. A. GRASP: A New Search Algorithm for Satisfiability. Technical Report CSE-TR-292-96. C om puter Science and En gineering Department. University of Michigan. Ann A rbor. April 1996. [68] SSH IPSEC Express, http://w w w .ipsec.com . [69] Suyama T .. Yokoo M. and Sawada H. Solving Satisfiability Problems on FPGAs. In International Conference on Field Programmable Logic and Ap plications. August 1996. [70] Texas Instrum ents Inc. http://w w w .ti.com . [71] Ullman J. D. Computational Aspects o f VLSI. C om puter Science Press. Rockville. Maryland. 198-1. [72] Wadhwa S. and Dandalis A. Efficient Self-Reconfigurable Implementa tions Using On-Chip Memory. In International Conference on Field Pro grammable Logic and Applications. August 2000. [73] Waingold E. et al. Baring it all to Software: The Raw Machine. Technical Report M IT/LC’S Technical Report TR-709. MIT Laboratory for Computer Science. Massachusetts Institute of Technology. March 1997. [74] Weeks B.. Bean M.. Rozylovvicz T. and FickeC. Hardware Performance Sim ulations o f Round 2 Advanced Encryption Standard Algorithms. In Third AE S Candidate Conference. April 2000. [75] W hiting D., Schneier B. and Bellovin S. AES Key Agility Issues in High-Speed IPsec Implementations. http://ww w .counterpane.com /aes- agility.html. [76] World Semiconductor Trade Statistics Organization, http://w w w .w sts.org. [77] Xilinx Inc. http://vvvvw.xilinx.com. [78] Xilinx JB its SDK. http://w w w .xilinx.com /products/jbits. [79] Zhang H. and Stickel M. An Efficient Algorithm for U nit-Propagation. In In ternational Symposium on Artificial Intelligence and M athematics. Septem ber 1998. 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [SO ] Zhong P.. Martonosi M.. Ashar P. and Malik S. Accelerating Boolean Sat isfiability with Configurable Hardware. In IE E E Symposium on FPGAs fo r Custom Computing Machines. April 1998. [81] Zhong P., Martonosi M.. Malik S. and Ashar P. Solving Boolean Satisfiabil ity with Dynamic Hardware Configurations. In International Workshop on Field Programmable Logic and Applications. August 199S. 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
High -speed networks with self -similar traffic
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Experimental demonstration of optical router and signal processing functions in dynamically reconfigurable wavelength-division-multiplexed fiber -optic networks
PDF
A template-based standard-cell asynchronous design methodology
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
High fidelity multichannel audio compression
PDF
An efficient design space exploration for balance between computation and memory
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Adaptive task allocation and data gathering in dynamic distributed systems
PDF
Investigation of degrading effects and performance optimization in long -haul WDM transmission systems and reconfigurable networks
PDF
Adaptive packet scheduling and resource allocation in wireless networks
PDF
Algorithms for streaming, caching and storage of digital media
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Efficient acoustic noise suppression for audio signals
PDF
A perceptual organization approach for figure completion, binocular and multiple-view stereo and machine learning using tensor voting
Asset Metadata
Creator
Dandalis, Andreas
(author)
Core Title
Dynamic logic synthesis for reconfigurable hardware
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Beerel, Peter (
committee member
), Goel, Ashish (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-190418
Unique identifier
UC11334684
Identifier
3065777.pdf (filename),usctheses-c16-190418 (legacy record id)
Legacy Identifier
3065777.pdf
Dmrecord
190418
Document Type
Dissertation
Rights
Dandalis, Andreas
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical