Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
(USC Thesis Other)
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
APPLICATION-SPECIFIC EXTERNAL MEMORY INTERFACING FOR FPGA-BASED RECONFIGURABLE ARCHITECTURE by Joonseok Park A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERISITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) August 2004 Copyright 2004 Joonseok Park Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3145263 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3145263 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication To my father and mother, who have encouraged and supported me with their endless love. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements Without the help from many people, it would not be possible to finish my journey of Ph.D. and this dissertation. There are so many people I would like to thank, I cannot even begin to name them all. My apologies in advance for forgetting anyone here. First and foremost, I would like to express my deepest gratitude to my advisor, Dr. Pedro C. Diniz, for his patience, advice, and guidance as a mentor and as a friend. He has shown me all the virtue of a researcher, a scientist, and a professor. I cannot think of a word great enough to describe his help during my Ph.D. years. Thank you, Pedro. I would like to thank the members of my dissertation committee: Dr. Victor K. Prasanna and Dr. Timothy Pinkston, who spent their precious time to make me see the big picture of research with their countless feedbacks. I also would like to thank the members of my guidance committee: Dr. Mary W. Hall, who has guided me as a leader of our research team, as well; and Dr. Jean-Luc Gaudiot. I also would like to thank all the ISI compiler group members - Dr. Jacqueline Chame, Jaewook Shin, Heidi Ziegler, Chun Chen, Yoon-Ju Lee, Nastaran Baradaran for helpful insights and input to my research. I would like to thank to Dongho Kim, June- Sup Lee, Soon-Wook Hwang, In-Young Ko, Ihn Kim, Yong-Dae Kim, Joong-Seok Moon, Hyuck-chul Jung, Ivan Horn, who showed me the way to complete the graduate student life at ISI. To Hyeoksoo Kim, Changwoo Kang, Bokki Min, iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Wonwoo Ro, Joon-sang Park, Taek-Jun Kwon and Chon Yi, thank you for being my friend!. Last but not least, I would like to thank my family: my father, mother, brother, sister and brother-in-law. I am forever grateful for their encouragement, supports, sacrifices, and prayers to finish this thesis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Dedication.............................................................................................................................ii Acknowledgements.............................................................................................................iii List of Figures.................................................... viii List of Tables........................................................................................................................x Abstract................................................................................................................................ xi Chapter 1 Introduction........................................................................................................ 1 1.1 Motivation................................................................................ 1 1.2 Organization....................................................................................................... 6 Chapter 2 Background: FPGA-based Reconfigurable Computing Machines...............7 2.1 Fine-grain Field-programmable-Gate-Arrays (FPGAs)................................7 2.2 FPGA Execution Model and Programming Abstractions..............................9 2.3 System Level Mapping of Applications to FPGAs.......................................12 2.4 Mapping Computations onto FPGAs............................................................. 16 2.5 Hardware Description Language...................................................................22 2.6 Trends in FPGAs Architectures..................................................................... 23 Chapter 3 Problem Description........................................................................................ 25 3.1 Motivation........................................................................................................ 25 3.2 Behavioral Synthesis Tools Drawbacks........................................................ 27 3.2.1 Insufficient Support for External Memory Operations................................28 3.2.2 Restricted Design Realization using Single Clock Stream......................... 29 3.2.3 Support for Multiple Tasks.............................................................................30 3.2.4 Lack of Data Dependence Analysis and Loop Transformations................30 3.3 Research Contributions................................................................................... 32 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Decoupled Memory Interface........................................................................34 4.1 Rationale of Decoupled Memory Architecture ...........................................34 4.1.1 Solutions to the Drawbacks of Behavioral Synthesis Tools.......................35 4.2 Design Choices and Architecture............................ 37 4.3 Components.................................................................................................... 39 4.3.1 Data Channels....................................................... 40 4.3.2 Input and Output Conversion FIFOs.............................................................41 4.3.3 Offset Reloading...................................................... 45 4.3.4 Channel Controller......................................................................................... 46 4.3.5 Address Generation Unit................................................................................48 4.4 Design Library and Parameters ....................................................... 51 4.5 Methodology Issues....................................................................................... 53 Chapter 5 Application-Specific Decoupled Memory Interface Features...........56 5.1 Application Specific Operation Scheduling ....................................56 5.1.1 Group Scheduling for Equal Rate Channels................................................58 5.2 Address Generation for 2-D Array Indexing...............................................60 5.3 Target Memory Physical Protocols.............. 63 5.3.1 SRAM Interfacing.................................................................. 63 5.3.2 SDRAM Interfacing........................................................................................64 5.4 DMI Designs with Compiler Loop Transformations..................................66 Chapter 6 Practical Evaluations...................................................................................... 68 6.1 Experimental Methodology...........................................................................68 6.2 Experiments.................................................................................................... 70 6.3 Synthesis Results of DMI Components....................................................... 73 6.4 Applications........................................................................................ 74 6.5 Sobel Edge Detection (SOBEL)........................................... 76 6.5.1 Comparisons DMI with Alternative Design Choices.................................77 6.5.2 Impact of Group Scheduling and Pre-fetching............................................80 6.6 Binary Image Correlation (BIC)...................................................................83 6.6.1 Impact of Group scheduling and Pre-fetch..................................................84 6.6.2 DMI Support for Loop Transformations.................... 85 6.7 Multi-Task BIC Designs................................................................................ 88 6.7.1 Comparison of Alternative Design Choices................................................89 6.7.2 Impact of Group scheduling and Pre-fetching.............................................91 6.8 Matrix Multiplication (M AT)....................................................................... 93 6.8.1 Design Evaluations........................................................................................ 94 6.9 Data Search and Reorganization Computation........................................... 96 6.9.1 Application Description and Data-path design........................................... 97 6.9.2 Synthesis Results............................................................................................99 6.10 Discussion......................................................................................................100 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Related Work.................................................................................................. 102 7.1 Memory Access Optimizations of High-Level Synthesis........................... 102 7.2 Hardware-Oriented Explicit Parallel Programming Languages.................105 7.3 Non FPGA-based Reconfigurable Architectures........................ 106 7.4 System Level Descriptions Using High-Level Language............................109 7.5 Domain Specific Efforts ......................................................................110 Chapter 8 Conclusion..................... 112 References..........................................................................................................................114 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 1: Xilinx® Virtex™ CLB with 2 slices................................................................ 8 Figure 2: The 2-D Mesh structure of the Xilinx® Virtex™ device............................... 9 Figure 3: Execution model - Core data-path and data streams ..........................10 Figure 4: Memory Data Streams......................................................................................11 Figure 5: System level description and execution model of our target architecture.. 14 Figure 6: Organization of the Annapolis WildStar™ FPGA-based board.................. 15 Figure 7: Generic design flow for FPGA devices..........................................................17 Figure 8: Boolean equation with corresponding table and abstract logic gates..........20 Figure 9: Structure of a Decoupled Memory Architecture (DMI)............................... 38 Figure 10: A Channel connecting external memory and I/O port of data-path.......... 40 Figure 11: The structure and I/O ports of an input conversion FIFO.................... 42 Figure 12: The structure and I/O ports of an output conversion FIFO.................. 44 Figure 13: The structure and I/O ports of an offset reload channel....................... 46 Figure 14: Structure of the Address Generation Unit (AGU).......................................50 Figure 15: Diagram of channel controller FSM in a round-robin fashion.................. 57 Figure 16: Example of group for channels..................................................................... 59 Figure 17: Address calculation logic supporting column-wise memory accesses in for 2D array variables.......................................................................................................61 Figure 18: Diagram of DMI which support page mode SDRAM accesses.................65 Figure 19: Timing Diagram for SDRAM page/non-page mode in DMI controller.. 65 Figure 20: The Annapolis™ WIDSTAR™/PCI board..................................................69 v i i i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 21: Kernel of Sobel Edge Detection in C (a) and possible implementations; a Naive implementation (b) and reusing data in tapped-delay lines (c).................. 76 Figure 22: Performance of designs before and after group scheduling, different pre fetch factors, comparison with manual design........................................................82 Figure 23: BIC main kernel loop and operations example........................................... 83 Figure 24: Pseudo code of BIC applying tiling i, j loops by e by f, and interchange the control loops with n loop...........................................................................................86 Figure 25: Performance metrics for various tiling shapes for single BIC and consequent the loop interchange as depicted in Figure 24.................................... 87 Figure 26: Two tasks execution cycle count for B IC ................................................... 92 Figure 27: Performance of MATRIX using row-wise AGU with one or two-word pre fetch (P), column AGU using offset-reload with one or two word pre-fetch, versus. 2d AGU with one or two-word pre-fetch(P)........................................................... 95 Figure 28: Sample Sparse-Mesh and Data and Reorganization query for the Pointer Kernel.......................................................................................................................... 97 Figure 29: The DMI controller and data-path implementations for pointer based application...................................................................................................................99 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 1: Behavior of next address calculation logic..................................................... 62 Table 2: Metrics used in Experimental Evaluation of Designs.................................... 70 Table 3: Synthesis results of individual components of DMI generated using Xilinx® ISE targeting Virtex™ IK BG560.......................................................................... 73 Table 4: Performance and synthesis results of three design methodologies for the Sobel Edge Detection computation..........................................................................78 Table 5: Computations and memory overhead of versions of implementation in cycle count and synthesis results....................................................................................... 81 Table 6: Synthesis result of versions of SOBEL implementation using DMI architecture................................................................................................................. 82 Table 7: Performance analysis of all versions of BIC implementation with single external memory.........................................................................................................85 Table 8: Synthesis results for tiled version of BIC........................................................88 Table 9: Performance, synthesis results and design effort metric designs with DMI architecture and manual design for 2 kernel BIC in single Xilinx® Virtex™.... 90 Table 10: Performance analysis of all versions of BIC implementation with two kernels in a single FPGA...........................................................................................91 Table 11: Synthesis results of 2 kernel loops BIC implemented on Xilinx® Virtex® BG560 in various version..........................................................................................93 Table 12: Synthesis results for all versions................................................................... 96 Table 13: Synthesis results for Sparse-mesh pointer-tracing kernel (percentage out of whole FPGA area)....................................................................................................100 X Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract The flexibility of Field-Programmable-Gate-Arrays (FPGAs) has made them the medium of choice for fast prototyping and a popular vehicle for the development of custom hardware designs for reconfigurable computing. Despite their popularity, FPGAs are hard to program and existing programming tools lack support for the external memory operations. Currently programmers have to assume the role of hardware designers and embed in their high-level designs specification low-level information regarding external memory operations such as exact timing and clock cycles latency. These practices lead to long and very error-prone development cycles, hampering the widespread adoption of reconfigurable computing technology. In this thesis we describe a decoupled memory interface (DMI) design approach to support external memory operations targeting reconfigurable devices such as FPGAs. The proposed approach directly supports an FPGA macro data-flow execution mode; where the computation is defined as a behavioral Very-High-level-Description- Language (VHDL) process interacting with the external memory via the notion of data streams. These abstract concepts of tasks and data streams are pervasive in image and signal processing computation for which FPGAs has been recognized to be an excellent match. The proposed solution also allows for the effective integration of behavioral VHDL with Register-Transfer-Level specifications, therefore which takes advantage of a wealth of synthesis techniques for behavioral specification while promoting modular xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. development of multiple interacting designs - a notoriously difficult problem for large designs enabled by the increasing capacity of FPGA devices. We have successfully integrated the design approach presented in this thesis with a compilation and synthesis tool that combines behavioral synthesis with structural synthesis for VHDL designs. The experimental results for a limited set of image processing applications reveal that indeed the design choices of the proposed interface lead to designs that exhibit comparable performance to the performance of manual designs attained at a very small fraction of the design time. The automatically generated designs execute correctly on a real target reconfigurable device from commercial vendor. The increased FPGA device capacity has enabled the development of system-on-a- chip with multiple and heterogeneous processing cores connected via a programmable interaction network. Future systems with these characteristics will, undoubtedly, need to communicate with external, or even multiple, internal memory modules. Because of the heterogeneity of these future platforms, the development of flexible interfaces, such as the one proposed here, will allow the rapid development of complete and correct designs. The ability to generate a large number of complete and correct designs will ultimately lead, we believe, to better and more reliable design exploration strategies with which compilation tools can deliver effective designs in useful time. x ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 In t r o d u c t io n 1.1 Motivation Reconfigurable systems offer the promise of significant performance improvements over traditional computing architectures, as they allow for the development of custom data-path structures suitable to each computation’s particular needs [7]. For example, a numerically intensive computation might take advantage of a large number of multiply-accumulate units whereas an image-processing, memory intensive computation can use a customized tapped-delay line storage structure to hold windows of pixel values. The advances of VLSI fabrication of recent years have lead to an explosive increase in the number of available transistors on a single die. This increase made it possible to develop various configurable and reconfigurable architectures which are configurable at mn-time, during program execution), that vary in granularity and/or the number of configurable elements or connectivity of the computing elements. For example, the RaPiD architecture consists of linear arrays of simple computing elements connected over a bus where data can be passed between each computing element in a pipelined fashion [18]. The XPP reconfigurable architecture consists of a two dimensional mesh arrangement of processing elements (PEs) and memory Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. elements (ME), with the connection interconnected by segmented busses of fixed width for data and synchronizations [11]. The connections between each PE are accomplished via a handshaking protocol despite a synchronous implementation of the individual PEs. Each PE can implement simple arithmetic operations, comparisons and retain values in counters to conditionally generated data which will be consumed by other PEs in a data-flow style of execution model. In this work, we focus on very fine-grain (finer than above) reconfigurable systems composed of commercially available Field-Programmable-Gate-Array (FPGA) devices. The available FPGA architectures usually consist of a two-dimensional mesh of configurable logic blocks (CLBs). Each CLB can directly implement an arbitrary combinatorial function of a fixed number of inputs and have its outputs registered using traditional Flip-Flop elements. By arranging the individual CLBs in a coherent fashion, FPGAs can implement arbitrary logic functions or execute a set of operations implemented by a data-path. An important motivation to use commercially available FPGAs is the existence of mature synthesis tools for mapping computation in textual specifications to FPGA designs. In general, these tools take specifications in the hardware descriptions languages such as Very High Level Hardware Description Language (VHDL) or Verilog [63] and generate the configuration of the individual programmable elements in the target FPGAs in order to implement the required logic function. In this work we distinguish between structural VHDL specifications and behavioral VHDL 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specifications. Structural VHDL requires hardware designer or programmers to specify the corresponding components of the hardware realization along with the controller and corresponding sequencing of operations. It also allows designers to describe the structure of hardware design which consists of various smaller design units. Synthesis tools that use structural VHDL as input, have to map the hardware components of a structural design to the CLBs of the architecture performing some target FPGA specific mapping transformations. Programming in structural VHDL leads designs to attain maximum performance and the designers to control very low level design specification in details, but is lengthy and error-prone. On the other hand, behavioral VHDL programming [38] allows the designer to be freed from the need to specify low-level details of the execution. In order to cast a behavioral VHDL specification into structural VHDL that will ultimately be mapped to the target FPGA, behavioral Synthesis tools automatically allocate, bind and schedule the resources required to implement the computation specified by a designer. Behavioral VHDL frees the designer from the specification of low-level details of the execution of the computation, but will in many cases lead to worse performing designs than if the designer were to directly specify a structural design. Regardless of the level of abstraction in programming FPGA-based computing systems, dealing with external memories is a major programming and interfacing issue. This thesis explores the possibility and performance trade-off of the development of a modular design approach for external memory accesses while still allowing the exploration of application specific knowledge in the mapping of 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. computations to FPGA-based system. FPGAs have no direct support, even for basic read and write memory operations. FPGAs usually provide hardware resources such as I/O pins in FPGAs, and buffers to connect FPGAs and memory module, which requires designers’ effort to make them function correctly in the hardware design. Vendors of FPGA chips or systems using FPGA chips, typically offer some low- level raw interface specification of resources, in register-transfer-level VHDL, or even in the hardware specification itself, to which a designer must conform for correctness of the operations. A major issue is the difficulty in supporting more sophisticated operations, such as pipelined memory accesses or address generation in order to facilitate the mapping of higher-level programming languages such as C to FPGA via compiler and/or synthesis tools. At the level of behavioral synthesis there is also a lack of support for sophisticated memory operations, especially if the target memory is an off-chip module. Memory operations are not easily integrated in the synthesis process where simplistic assumptions about the latency of the accesses are taken by the scheduling of the computation. Behavioral synthesis exhibits other shortcomings. It is not trivial to specify in application specific scheduling strategies without having to explicitly embed the scheduling in the behavioral specification, thereby substantially reducing the extensibility of the code and consequently their utilities. Also, there is no simple way to specify a design with multiple concurrent processes that share access to the same memory, in a manner that is distributed throughout the behavioral specification in the combined scheduling of the various processes. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to address these shortcomings, while retaining the obvious advantages of specifying designs in behavioral synthesis, we have developed a decoupled memory interface (DMI) controller; this approach allows interfacing different kinds of external memory on FPGA-based systems, Dynamic-RAM or Static RAM for example, and supports different access modes, page-mode, or pipelined-modes. This approach also allows for designers to define application-specific scheduling of memory operations. Lastly, by decoupling the timing of the data-path and of the DMI controller, we have imposed a handshaking protocol to connect the data-path and the DMI, through a queue structure. While in some cases this handshaking leads to an increase in the latency of the memory operations, in our target domain of applications, it is possible to pre-fetch the data required for the execution of the computation. A handshaking protocol also allows for an easy integration of various VHDL-level processes in behavioral VHDL with the structural and parametrizable designs we have developed for the DMI. The experimental results in this thesis support these claims. Our DMI structure allows multiple IP core implementations that shares external memory to be integrated in a single design - impossible to achieve using contemporary behavioral synthesis tools. With the growing complexity of current and future reconfigurable systems, we believe it is important to develop flexible abstractions that can be easily interfaced with existing tools that perform some of the more sophisticated mapping steps, such as low-level partitioning and scheduling. The DMI abstraction described in this 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. thesis has been successfully used in the context of the DEFACTO project supporting the automatic mapping of computations expressed in C directly to FPGAs [15]. 1.2 Organization In the next chapter we outline the basics of the internal organization and capabilities of current FPGA devices and how these devices can be used as the building blocks of reconfigurable computing architectures. In Chapter 3 we describe the problem this thesis addresses, the lack of support for basic and advanced memory operations. In Chapter 4 we present a solution, the Decoupled Memory Interface (DMI), which we have designed and implemented along with the rationale for some of its design choices. Chapter 5 describes the application of the features the proposed DMI facilitates in many cases, enabling powerful transformations when mapping computations expressed in high-level programming languages to FPGA-based reconfigurable architectures. In Chapter 6 we present quantitative experimental results that illustrate the benefits of the proposed solution for a sample set of kernel code from several application domains. In Chapter 7 we present related works and conclude in Chapter 8. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 2 B a c k g r o u n d : FPG A -b a se d R e c o n f ig u r a b l e C o m p u t in g M a c h in e s In this chapter, we describe the basics of Field-Programmable-Gate-Arrays (FPGAs) and how they can be used as general-purpose reconfigurable computing engines. We also outline the common approaches for mapping computations expressed in high- level programming languages to FPGA-based computing machines. 2.1 Fine-grain Field-programmable-Gate-Arrays (FPGAs) Field-Programmable-Gate-Arrays (FPGAs) have evolved from programmable- Logic-Devices (PLDs) and Mask-Programmable-Gate-Arrays (MPGAs) by the inclusion of SRAM-based field-programmable capabilities. The typical FPGA architecture includes a regular arrangement of logic components used to implement the combinatorial and state-preserving logic, interconnected via programmable components. A popular FPGA device, such as the Xilinx® Vitrex™ part, is organized as a two-dimensional mesh of Configurable Logic Blocks (CLBs) interconnected via a mesh of limited sized buses. Figures 1 and 2 illustrate the basic CLB and interconnection arrangement for the Xilinx® Virtex™ part [71]. Other vendors, such as Altera®, have opted for a CLB with different granularity and are interconnected via a hierarchy of buses [65]. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CQUT A COUT A 8 2 > ~ * 3 1 SY > - F 4 >“ F 8 > - F 2 > - FI > - EX > ~ C a r r s ’ & C e n tra C an y & SP D Q E C S P 0 o B 2 G S > c a > S 1 > - B Y > • > XB • > X F S - MO p2 " B X > C a n y & Oonfrol C any A C « R f c r < 3 l S P D Q EC l > Q EC •> Y Q ->X B • > .X -> X Q A CiK A Clf-4 Figure 1: Xilinx® Virtex™ CLB with 2 slices. 6itee>j>.*pe Figure 1 depicts the structure of a CLB for a Xilinx® Virtex™ device. The CLB consists internally of two identical slices. Each slice contains two Look-Up-Tables (LUTs) and is connected to a D-Latch via a Carry&Control logic block. The actual implementation of the LUT is based on a static RAM (SRAM) implementation, allowing it to dynamically reconfigure the LUTs as well as its connections to the other logic blocks around it. Essentially, the LUTs implement the combinatorial nature of a hardware design, whereas D-Latches provide the synchronous or state- preserving elements of a design. The CLB includes a Carry&Control block to facilitate the implementation of arithmetic combinatorial logic, such as ripple carry adders and consequent multipliers [72]. A contemporary Virtex™ device such as the Virtex™ IK 560BG contains 12,288 such slices. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ■ 4 ^ G R M GRM C LB r*&— G R M CLB A . CLB G R M A - | — ► ! CLB G RM G R M G R M k ^ k CLB CLB T ? G R M CLB G R M CLB G RM G R M CLB G R M Figure 2: The 2-D Mesh structure of the Xilinx® Virtex™ device Figure 2 illustrates the two-dimensional (2D) mesh interconnection arrangement between the many CLBs in the Virtex™ device. This mesh is composed of Global Routing Matrix (GRM) resources, establishing the connection between any two CLBs in the device either directly or indirectly via other GRMs. The CLBs can also be directly connected along the rows of the 2D mesh arrangement. 2.2 FPGA Execution Model and Programming Abstractions In the absence of any structured form of control and data path for the execution of computations in FPGAs, we have imposed a basic structured consisting of a core data-path and several input and output port abstractions as depicted in Figure 3. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Host Processor Address Generator Address Generator sy n c register output input ~^> Core Data-Path input - ► Address Generator Figure 3: Execution model - Core data-path and data streams. In terms of the execution of the computation the core data-path controls the flow of data in and out of its ports. It communicates with an external interface by requesting data to be read/written from/to external memory in order to proceed with its internal computation. The core data-path synchronizes for start/termination with the host process via specific register values. In terms of the programming abstraction, we view this computational structure as a vehicle to implement a macro data-flow execution model with input and output streams, where each input/output data stream has a particular sequence of addresses associated with it. For the definition of the locations or addresses for each access to external memory, we impose that the addressing of the various accesses occurs external to the host processor. However, if there is any changes from the pattern we have imposed in the address streams during the application execution, core data-path should issue the change of address explicitly. The sequence of addresses is defined 1 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by an affine function of the invocation or activation count of each access. Typically, this activation count is an affine function of a loop iteration count kept in an address generation unit external to the core data-path. In Figure 4 we illustrate the address patterns for a simple loop nest for two array references A[j] and B(j][i] assuming a row-major data layout for array variables. The sequence of addresses generated by the A[j] reference consists of a repeated sequence of the addresses for all the positions of the A array, whereas the sequence of addresses for B[j][i] consists of a sequence of interleaved addresses. At each iteration boundary of the (i,j) loop, the AGU will automatically advance to the next address in each sequence using a particular stride value, and perform the “wrap-around” of the addressing when a given address “overflows” a limit value associated with the corresponding dimension of the array variables. Both values of the stride and limit values are customizable and programmable to suit the needs of a particular set of loop computations. Data Stream 1 Data Stream 2 < > * « I £8 = 1 fori for j ... = A[jJ; B[j,il = — end for end for Figure 4: Memory Data Streams. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The notion of a sequence of addresses associated with a particular data channel can be viewed as a fixed-length memory-mapped data stream, ha addition to this stream address generation, the address generation units also support random addressing by the core data-path used in the case of irregular data access patterns. A compiler supporting the execution model depicted in Figure 3 and using the capabilities of the address generation unit outlined above is responsible for isolating the computation in the input application. Herein, it defines the core data-path functionality as well as the parameters that define the various addressing sequences for the data to be fed into the data-path. 2.3 System Level Mapping of Applications to FPGAs Part of the challenge in mapping a computation expressed in a high-level imperative programming languages, such as C or Fortran, to FPGA lies in the need to define the basic computing elements found in traditional computing architectures, namely Arithmetic Logic Units (ALUs), Register Files and Instruction Sequences. These traditional architectures also directly support the basic abstractions of address space for both data and program. None of these hardware and programming model abstractions are directly supported in current FPGA devices nor to some extent in any of the commercially available FPGA-based computing machines. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A common approach, as adopted in the context of the DEFACTO project [15], consists in the partitioning of an input program between a component that directly executes on an FPGA and a component that executes in a traditional, host processor. Both computing devices communicate via memory where a simple synchronization protocol is established in order to transfer the data and control. Executing a computation on an FPGA requires the loading of a configuration file analogous to the loading of an executable file into a traditional computing system, and transferring the input data into the appropriate system memories that the FPGA can have access to. The system execution model followed in our research is a Master-Slave, where multiple FPGAs operate in slave-mode with respect to a common Master processor. The computation proceeds by having the Master processor depositing data into the memories accessible to each FPGA and then signalizing them for starting the execution. Upon this signal, each FPGA proceeds to execute a specific task that is assigned to it in some commonly agreed specification. This specification typically includes the bit-stream loaded in each FPGA combined with specific parameters values passed as memory values. Upon completion each FPGA signals the Master processor again via memory-mapped registers. Figure 5 depicts a system view of such mapping arrangement showing on (a) the partitioning, compilation and synthesis flow and in (b) the execution organization of combined processor-FPGA computation architecture. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A pplication Host Thread FPGA Thread Address Space High-Level Compiler lib C Compiler VHDL = r Synthesis Tools cnfg.bit Host-oniy Memory Address write data sync start read data write data wait sync end read data executable (a) compilation flow (b) execution flow Figure 5: System level description and execution model of our target architecture. In terms of the compilation and synthesis flow, a system level mapping tool will have to assess the profitability of executing portions of its computation on the FPGA versus executing it on the host processor. The portion of the computation that is assigned to the host processor will be compiled using the native compiler for that specific target processor. The portion of the computation assigned to the FPGA will have to be translated to a hardware description language, either VHDL or Verilog, and then be translated to the low-level bit-stream programming format used to 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. configure the FPGA device. This synthesis process is typically accomplished by a variety of tools and entails many abstraction levels as described in the next section. The common system-level architecture of FPGA-based reconfigurable computing machines consists of a board with one or more FPGA parts, connected with RAM parts external to the FPGAs mounted on a printed-circuit-board (PCB). This PCB has an interface to a common system-level bus such as a PCI or VME through which the FPGAs can be mapped in a host processor address space. Figure 6 below depicts the internal architecture of the FPGA-based configurable board used in the research presented in this thesis, WildStar™ configurable architecture by Annapolis Micro System, Inc. Shared Shared MemorvO Memorv 1 — SRAMQ FPGA ft FPGA SRAM1 32bits 64bits SRAM2 SRAM3 PCI Controtl To Off- Board Shared Shared Memorv2 MemorvS Figure 6: Organization of the Annapolis WildStar™ FPGA-based board. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this board there are three FPGA Virtex™ IK parts each with direct access to four external memory parts. The FPGAs are arranged linearly where the middle FPGA shares two external memories with each of the other FPGAs. This sharing allows two FPGAs to simultaneously access disjoint portions of the same memory. By swapping the connections between two FPGA for the same memory, the system allows the FPGAs to instantly exchange the data in the shared memory. In terms of interaction with an external host, the WildStar™ board uses the PCI™ interface and a library-based communication scheme that allows a program executing on the host to load configuration files for the individual FPGAs and corresponding memories. This library interface also allows for polling of the status of specific registers defined by the Annapolis Wildstar™ library as Status Word Register indicating the specific internal status of each FPGA. 2.4 Mapping Computations onto FPGAs We now describe a generic design flow for mapping a description of a computation in Behavioral VHDL to FPGA architectures [6]. The traditional mapping of designs to FPGA, from Behavioral synthesis specifications, comprises of five major mapping steps as illustrated in Figure 7. In the front-end of this mapping we have Behavioral Synthesis followed by register-transfer-level (RTL) synthesis and logic synthesis as the architecture independent mapping steps. At the back-end of the mapping we can identify architecture specific steps that map the intermediate specifications to the 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. target hardware blocks culminating in the generation of a configuration file - a bit stream - for the FPGA device. Generic Transformation R esource A llo catio n , B in d in g a n d S c h e d iiin g Irtfiro v e A lio c a io n ,B in d in g a n d S c h e d iJ m g v ia A ic h ite c tu re -m d e p en d n t transfoim Etions ( e .g re tim h g a reso u ice sharing) C om binatorial logic m inim i/a ti on M ap p in g abstract logic s^tes in to target architecture logic b lo c k s o r lib rary c e lls B in d logic u rits o r c e lls to p h y d cal resources a n d interconnect them B ehavioral Specification Logic S y nthesis T e c h n d o g y M apping Intermediate Representation R T L c o d e (Stiuctural V H D L ) A L U o p a a tio n s, R e g s te rs , FStVfe C om binatcrial L ogjc a n d S la te M a ctin es Logic G ates & F lip -F b p s CaribinE torial Logic a n d State M a clin es Logic G ates & F lip -F b p s P h y acal lib r a r y M ap T ransistors o r LU Ts T arget A rchitecture C onfiguration Figure 7: Generic design flow for FPGA devices. Behavioral Synthesis Behavioral Synthesis, also referred to as high-level synthesis, consists of the translation of a computation expressed in behavioral VHDL to an RTL description of a particular hardware implementation capable of carrying out that computation. This translation is accomplished by three main phases - resource allocation, resource 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. binding and scheduling. Resource allocation will determine the various resources required to carry out the computation, where the binding will assign the various resources to the various operations. Finally, scheduling will determine when the various operations assigned to each resource are executed. When it defines the execution of a particular design, Behavioral synthesis is faced with many possible choice of allocation, binding and scheduling. To limit these choices, designers can specify the maximum number of allowed resources of each kind and/or the maximum number of clock cycles in the execution of the computation. Internally the tools use these user-provided area and time constraints to find a feasible design exploiting the trade-off between area resources and execution time. The feasible region of combination of required resources and resulting performance metrics are defined as the feasible design space. When behavioral synthesis determines a feasible design, it generates a specification of the data-path that can execute the specified computation as well as a controller that orchestrates its execution. Behavioral synthesis tools also determine the number of required control steps (c-steps) for the execution of the computation and in many cases can specify a breakdown of the number of c-steps for each of the iterations of the computation’s loop constructs. The data-path and its controller, typically a Finite-State-Machine (FSM), are defined in structural VHDL or even in RTL representation. Commercial behavioral synthesis tools, such as Synopsys’ Behavioral Compiler™ or Mentor Graphic’s Monet™, also provide an estimation interface reporting the 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. amount of physical area (data-path and the corresponding controller) and time (c- steps) resources. The designer uses these estimations to transform the input specification or refine the area and timing constraints in search of a better design. Register-T ransfer-Level Synthesis Register-Transfer-Level (RTL) synthesis translates a particular hardware design, by exploiting the target architecture features. For example, and when targeting FPGA devices, RTL synthesis can select the type of controller architecture between one-hot encoding or simple state-based encoding or even partitioning a single FSM controller between a set of communicating controllers. At the level of the data-path this phase of the synthesis can also take advantage of carry-look-ahead chains for the implementation of adder blocks. In general, RTL synthesis will transform the input RTL specification to suite specific features of the target hardware blocks [44], Logic-Level Synthesis Logic-level synthesis is the process of mapping the individual blocks specified in a RTL description to the target individual, and considered atomic block. The output of previous step, which is blocks of combinational logic and storage elements, will be described in the Boolean functions which will be transformed into gate-level schematic. The output of this synthesis step consists of a network of abstract logic gates typically optimized, via logic minimization. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Architecture Mapping Architecture mapping, also known as library mapping or technology mapping, is the process in which abstract gate-level descriptions are mapped to physical library cells based on the target technology. For the specific case of LUT-based FPGAs, this mapping consists of defining the specific number of LUTs required to implement a given design specification, and for each LUTs the corresponding configuration. Figure 8 illustrates an example of a technology mapping of a Boolean expression to a 4-input LUT. In this example, the description is done by the Boolean expression y = (x i and x2) o r (not x3) o r x 4 . This description is directly translated into a network of logic gates depicted as Figure 8 (a) and then into the truth-table in (c) which defines the internal configuration of the LUT (c). x1x2x3x4 V 0 0 0 0 1 0 0 0 1 1 0 0 10 0 0 11 1 0 10 0 I 0 10 0 1 0 10 1 1 0 1 10 0 111 1 10 0 0 1 10 0 1 1 10 10 10 11 1 110 0 1 110 0 1 110 1 1 1 1 1 0 I 1111 1 y — (x l and x2) or (not x 3) or x 4 X1 [ (a) B oolean Equtaion and Logic 4-input (b) boolean table (c) 4 Input LUT Figure 8: Boolean equation with corresponding table and abstract logic gates. 2 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This technology mapping procedure consists of an iterative process of partitioning and combining the abstract gates representations in an attempt to match the LUTs input count [29] in this example of four inputs. Clearly, this partitioning depends heavily on the internal architecture of the FPGA and is, in terms of algorithmic complexity, a very hard problem [9][58]. Placement and Routing (P&R) In the context of FPGAs placement consist in the assignment of logic functions specified in abstract LUTs to physical CLBs. Once given a placement for the abstract LUTs the routing of the connections between these LUTs must be performed. After placement, routing decides which resources will be assigned to connect the output of one logic cell to the input of others. There are numerous routing algorithms some of which take advantage of specific FPGA internal architectures features (e.g., [9] [3]). A commonly used technique for placement and routing of FPGAs is Simulated Annealing (SA) for its robustness across a wide range of mapping scenarios [41]. Often viewed as independent problems, placement and routing are fundamentally intertwined. A poor placement will most likely lead to a more difficult routing and consequently leading to very low clock rates for the design. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5 Hardware Description Language The current practice for mapping applications to FPGA-based reconfigurable architectures relies on the programmer to specify the computation to be mapped onto the FPGAs using hardware oriented programming languages such as VHDL or Verilog [63]. The computation model offered by these languages is inherently different from high-level imperative languages such as C or FORTRAN. Typically, hardware description languages have two aspects to model hardware designs - abstract behavior and hardware structure modeling. Abstract behavior model provides similar programming style as imperative programming languages. The structural specification models the hardware design consisting of a set of smaller hardware designs which will ultimately be translated into a hardware specification by a set of synthesis tools. At its lowest level, HDL supports objects such as signals and I/O ports that directly map to design wires. An important aspect of the HDL description is the specification of the timing and control structures. The designer is responsible to explicitly schedule the flow of data in and out of the functional units that implement the required computation. A significant distinction between an HDL specification and imperative programming languages is that in HDL each construct that is instantiated for a particular design corresponds to an individual resource that can be activated in parallel with other resources. Its internal specification, if done using a set of statements is assumed to be 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. evaluated concurrently with the exception of specific clocking statement that define the boundaries of synchronization. To help designers’ efforts and increase readability of descriptions, behavioral synthesis tools allow more and more high-level language-like design (algorithmic level description). This trend will, most likely lead to hardware description languages to assume the role of high-level programming languages. However, there is still a fundamental semantic gap between HDL and imperative sequential programming languages. Whereas HDL expose a CSP execution model [34], popular imperative programming languages are inherent sequential. Applications (or algorithms) are coded in high level imperative language. Except few parallel programming languages, there is only single stream of control. Possible advantages over contemporary approaches if we combine existing efforts of parallelizing compiler and behavioral synthesis - such as loop analysis, transformations and optimization, parallelization, scalar replacement, or the mapping of data into internal RAM blocks instead of using discrete registers. 2.6 Trends in FPGAs Architectures The recent increase in the number of available transistors on a die, recent FPGAs has enabled FPGAs to increase not only the number of CLBs on a single device but also to include configurable blocks with higher functional capacity. In addition recent 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FPGA part also include heterogeneous configurable blocks such as specific hardware multiplier and adder blocks (e.g., Xilinx Spartan™ device series) as well as SRAM blocks (e.g., Xilinx Virtex-E™ device series). The number of available transistors has increased so dramatically that Xilinx now offers the ability to include in the Virtex-II Pro™ multiple pared-down Power-PC RISC cores as soft-macros1 [73], We believe this trend, which embeds more specialized units of diverse granularity and heterogeneity, will continue, as the scale of integration of more transistors on a single die will continue to increase. In the context of a growing number heterogeneous programmable logic blocks in future FPGAs, possibly named Field-Programmable-Core-Arrays (FPCA), the need to design complexity of mapping designs, correctly, and in a feasible amount of time will be of critical importance. 1 A n in te lle c tu a l-p ro p e rty (IP ) co re is im p le m e n te d a s a so ft-m a c ro b y p ro g ra m m in g th e u n d erly in g C L B s to em u la te the fu n ctio n ality o f th e IP co re, as o p p o se d to a h a rd -m ac ro w h e re th e IP -c o re is c a st a t the tra n sisto r-lev e l lay e r o f th e F P G A fabric and c a n n o t be rep ro g ra m m ed to carry o u t a n o th e r fu n ctio n . 2 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 3 P r o b l e m D e s c r ip t io n 3.1 Motivation Despite its internal data bandwidth, FPGAs still have to contend with having to retrieve/store data from/to external data memories. These operations are not directly supported at the FPGA fabric and also not directly supported by any of the commercially available synthesis tools. There are many reasons for this lack of support for memory operations. On the physical side, different memories have distinct physical layers, meaning different data width and transfer control lines. Depending on the transfer modes, physical lines assume different roles in more advanced transfer modes such as page-mode or pipelined operations. Finally, the relative timing for memory operations is highly dependent on the particular memory module at hand. On the synthesis side steps, such as the scheduling of operations, depend critically on the specific knowledge of dependences between the memory operations and their timing. Some of this knowledge is not adequately captured in current high-level hardware-oriented programming languages. If designers want to exploit this knowledge they have to “hard code” synthesis decisions such as scheduling in their 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hardware specification, in effect embedding in the code some of the scheduling decisions. At the system level, contemporary behavioral synthesis tools lack the capability to synthesize specifications of system level issues, such as architectural functionalities between FPGAs, memory interfaces, and possible host CPUs. Since behavioral synthesis tools generate single stream of control, we have seen the deficiency in the synthesis of possible different clock speed between data-path and memory bus. For example, based on the survey of reconfigurable systems, many of current FPGA- based Reconfigurable Computing systems view memory as a shared resource [5][6], The sharing might be issued between host processors and tasks in a FPGA chip, and between multiple tasks in single FPGA chip or multiple chips. In any case, external memory access latencies in the view of data-path implemented in FPGAs are non- deterministic. At the same time, physical addresses of data are bound at run-time. Synthesis of designs that multiple processes (or thread in designs) share resources is very cumbersome to implement in many of the hardware-oriented programming languages. In other popular languages such as Behavioral VHDL, it is simply impossible to specify to independent processes that share common resources. While programmer can aggregate multiple processes in a single process and directly manage the shared resources, this leads to non-modular and difficult solutions to maintain. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. With the rapid increase of FPGA chip capacity, the ability to develop designs in a modular fashion while not substantially sacrificing the overall system performance will become a major design issue. Multiple core designs in a single FPGA are likely to share the limited resources, such as RAM and custom IP cores in addition to the external memory controllers. 3.2 Behavioral Synthesis Tools Drawbacks Behavioral synthesis has long been seen as a mechanism to quickly explore the design space corresponding to behavioral specifications that are deliberately vague about some of its execution and implementation aspects [20], Behavioral synthesis uses designer-provided constraints, such as the maximum number of clock cycles required to execute a given computation and/or the maximum number of hardware resources allowed to derive a suite of feasible implementation designs. Despite corresponding to a clear increase in the level o f programming abstraction over pure RTL descriptions, where designers are responsible for all of the aspects of resource allocation and scheduling of the execution, behavioral synthesis still exhibits several shortcomings. In the following subsections, we outline the drawbacks we see in behavioral synthesis that directly pertain to the issues of external memory operations in the context of mapping of computations written in high-level imperative programming languages, such as C, directly to hardware. There are other aspects tied to the limitations of the current implementation of the 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis in commercially available tools, such as the lack of powerful compiler analysis that can handle multi-dimensional array variables and perform sophisticated data-flow analysis across loop iterations, though some of them are not the object of our study [2]. 3.2.1 Insufficient Support for External Memory Operations Current behavioral synthesis tools, in most cases, map array variables into internal/external memories. Hence the address streams for internal/external memory accesses are determined at synthesis phase based on the provided constraints - such as array size, packing mode of array and mapping options to multiple arrays to single memory. Some of behavioral synthesis support memory access optimization, translate array variable to the module in their internal library to handle provided constraints. For internal memory, it is possible to specify or modify the, albeit cryptic, interface the specifications of simple operation parameters such as the latency, number of read/write ports, and its bit width. For external memories, no interface exists for advanced modes such as pipelining or page mode accesses2. There is no automated support for dealing with the physical layer and timing of external memory operations, and the inclusion of timing information in the internal scheduling algorithm of the tool is non-trivial if at all possible. To achieve acceptable performance, for advanced operation modes, current 2 M e n to r G ra p h ic s® M o n e t™ su p p o rt p ip elin ed m em o ry a c ce sse s o n ly w h en lo o p p ip e lin in g is e x p lo ited . 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. approaches rely on low-level cryptic interfaces the programmer must master, a very tedious and error-prone job. 3.2.2 Restricted Design Realization using Single Clock Stream A basic limitation of existing behavioral synthesis tools is that the designs they generate must use the same clock signals throughout the entire design. This is not surprising because one of the main functionality of behavioral synthesis is scheduling of the input hardware description which is an algorithmic description of a task. As such it is not possible to partition a design between the core data-path with a given clock and a design for the memory controllers. For example, contemporary synthesis tools, such as Mentor Graphics® Monet™, perform the schedule control signals based on abstract single stream of C-step (control-step). The schedules base on a single c-step stream, tightly couple memory operation and data-path operation controls. Hence, this clocking restricts for the memory interface to use the same clock cycle as defined by the critical path of the whole system. As a consequence, memory operation frequency will be as slow as the frequency of whole design, which is not a good design approach if external memory bandwidth is limited by the slow system clock frequency. For example, memory bus is as fast as 100MHz+, however, automatically synthesized FPGA design hardly works above 50Mhz, without manual tuning. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.3 Support for Multiple Tasks Mapping a single computational kernel may lead to a large portion of an FPGA’s resources being unused. Multiple-task configurations with time-multiplexed execution scheme will reduce expensive run-time reconfiguration cost, without loss of possible inter-task parallelism. The tightly coupled control scheme of behavioral synthesis tools, as mentioned above, assumes that designers have full control for the synthesized resources, including external memory. In this context, the description of multiple tasks to share fixed resources seriously increases the design complexity. Hence the conventional behavioral synthesis tool cannot handle this issue properly. The issue of handling multiple tasks deals mainly with the added complexity in defining a combined “valid” scheduling. The solution of getting a schedule for multiple task leads to two choices. First forcing the programmer to define its implicit protocols introducing a lot of “wait” statement, which will results quite poor in efficiency or second by forcing him/her to define a controller in behavioral VHDL, neither of these options is attractive for programmers. In these approaches, designs are error-prone and harder to maintain. 3.2.4 Lack of Data Dependence Analysis and Loop Transformations In design space exploration, accurate knowledge of application in the synthesis phase increases the possibility to generate the better design. One of the main advantages of 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using behavioral synthesis tools lies is their capability of fast design space explorations. The fact that data dependence analysis and scalar replacement will reduce external memory accesses in the general purpose processors is also true in the Reconfigurable Computing system [30][31][64], The scalar replacement will be realized in the FPGA system by assigning storage elements inside the FPGA chip [64], In this context, the tradeoff in the realized target designs achieved by the data dependence analysis and corresponding scalar replacements, should be the important points of design space exploration. Unfortunately, the data dependence analyses in behavioral synthesis are not powerful as in existing high-level language compilers. For computations that manipulated multi-dimensional array variables, the conventional synthesis tools cannot determine when data-dependences between loop iterations in a nest. Typically these tools can only handle single-dimensional array variables requiring the programmer to manually flatten any multi-dimensional arrays. Loop transformations are also important factors to consider in design space exploration. Loop tiling, loop permutations and loop unrolling and scalar replacement produce large differences in the realized hardware designs in terms of performance and space [64]. Current commercial tools support, in a programmer controlled fashion, the application of loop unrolling. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Research Contributions The research presented here addresses the three major issues outlined in the previous sections, while allowing and even promoting, the development of fast and modular design realization, for FPGA based system. We will keep the generality and modularity of designs for domains of applications. Finally, by offering a set of execution abstractions, such as data channels with programmable features, the proposed approach promotes the collaboration of high-level compiler analyses with synthesis tools for the generation of specific hardware transformations and optimizations. We propose Decoupled Memory Control approach to support external memory operations, for FPGA-based reconfigurable systems as one possible solution. Decoupled memory architecture provides target memory independent view of data path design. There are many advantages of such as decoupled approach. First, by establishing a clear interface between a memory controller unit and multiple designs, the proposed approach effectively enables multiple designs to be developed independently in a modular fashion. By decoupling the scheduling of memory accesses with the execution of the computation data-paths, and providing a buffering between the I/O and the data-path input/output ports, designers can still leverage existing tools and allow compilation tools to derive application-specific memory access scheduling strategies for each individual designs. This flexibility comes at the potential price of a performance. However, and for the target application domains we 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are focusing, as presented in the experimental results section, this performance loss is minimal and is, in many scenarios, out weighted by the many design efficiency advantages. In our experimental evaluation, we have focused on a limited set of image processing kernels. These kernels exhibit a large amount of simple computations with a stream- based data accesses pattern. Most of the kernels have regular and statically determined data access patterns, which will be realized with simple memory address generation schemes. Our proposed decoupled memory control scheme also supports irregular data access patterns found in pointer-based kernels, which are not traditionally seen as a good match for FPGA-based computing. Our results reveals that the proposed solution can also support irregular memory access while not destroying the core of the DMI abstraction, thereby retaining the design advantages mentioned above. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 D e c o u p l e d M e m o r y In t e r f a c e In this chapter, we describe solution proposed in this thesis - the decoupled memory interface (DMI), supporting advanced external memory operations in FPGA-based reconfigurable systems. We begin this description by presenting the rationale of its design along with its advantages/disadvantages over current approaches. Next we describe the individual components of the design illustrating its operation for selected examples. 4.1 Rationale of Decoupled Memory Architecture The decoupling of the memory operations from the execution stream has been explored in many multithreaded architectures [44][57] as a technique to overcome the increasing gap between processor performance and memory latency. In traditional architectures this decoupling allows for the implementation of hardware pre-fetching engines with tight synchronization with the internal pipelining and register files. In the context of reconfigurable, and FPGA-based architectures in particular, this decoupling offer several additional benefits. Beside the opportunity of generation and customizable data access patterns generation (specific strides and offset 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. calculations), decoupling the data access from the execution also has the advantage o f isolating the synthesis issues, such as the timing of the synthesized design from scheduling of the memory operations. The decoupling also allows for compiler and synthesis tools to capture application-specific knowledge and use it in deriving improved memory operation schedules. 4.1.1 Solutions to the Drawbacks of Behavioral Synthesis Tools By decoupling the external memory control with a data-path scheduling-independent interface, we can provide an architecture specific memory controller design, that isolates some of interface behavioral synthesis issues. We thereby avoid the issue of having to explicitly deal with the external memory control signals. The interface design block generates all the control signals to access external memory, while it communicates with data-path design specified in high-level hardware description - in our case behavioral VHDL. The interaction between the memory controllers and the core-data-paths are performed via a handshaking protocol, as discussed below. The decoupling of the interface also enables the development of FPGA system using different clock frequencies. If we separate memory related controls with the other design components, we can utilize more bandwidth support of given architecture utilizing “faster” memory bus if possible. This separation and further optimizations will only be allowed by decoupling memory operations with other design Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. components. Since most of contemporary FPGA chips support multiple clock inputs, it is possible to apply the separate clock input for memory interfacing [55]. To address the limitation of multiple task synthesis in a single FPGA chip of current behavioral synthesis, particularly with respect to external memories, which shared between multiple tasks, the memory interface handles concurrent requests to the same memory. Handling concurrent requests is achieved by internal and programmable arbitration logic. To accommodate the latency of the request the DMI has internal buffering capacity. Each data-path design tolerates the possible latency of the arbitration via its handshaking protocol. In our current proposed architecture, we embedded a simple FSM to resolve multiple concurrent memory requests. Although not explored extensively in work presented here, the decoupling memory control features from data-path synthesis reduces the effort in combining behavioral synthesis tools with powerful compiler analysis features- such as loop tiling, loop interchange, loop fission, and scalar replacement in loop nests with array references using affine subscript functions. Although loop transformations will substantially impact on the generated design after synthesis, the explorations of possible design space requires the estimations or implications of performance and resource utilizations (area of FPGAs in this case) of realized design while applying loop transformations. Our DMI approach enables to capture the resource regards to external memory accesses more accurately than conventional behavioral synthesis tools in design space explorations. By combining standardized interface for external 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory and accurate data-path units’ estimations generated by existing behavioral synthesis tools, we can enable design space explorations for designs resulting from loop transformations [62]. 4.2 Design Choices and Architecture The proposed architecture, as depicted in Figure 9, aims at decoupling memory- interfacing controls of target-architecture dependent characteristics from the data path that implements the application’s core computation. The proposed architecture provides a programmable Address Generation Unit (AGUs), a programmable memory channel controller, various conversion FIFOs, and corresponding offset channels. The main functionality of the decoupled memory controller is to generate memory addresses and control signals on behalf of the core data-path. To accomplish the decoupling of the memory operations with the data-path design the implementation of the DMI defines two interfaces as depicted in Figure 9. One interface communicates with the data-path and implements the transfer of data to the (multiple) data ports of the data-path. The conversion FIFOs are used to facilitate the synchronization between memory accesses and data-path interface execution. Since there is no execution code stream to synchronize between two entities, the DMI and the data-path, we provide protocols to synchronize through the queues in FPGA system. This protocol should be a transfer of signals which will be implemented as actual hardware design. We have opted for a handshaking protocols in this interface 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. between data-path and conversion FIFOs, because it is a general timing-independent way to enable communications of behaviorally coded design with an external of the design. Through this protocol, data-path controls do not need to directly control the external memory modules. Decoupled Memory Interface D a ta p a th A C o n v e r s io n F I F O s D a ta I/O p o r ts O f f s e t o u tp u t O f f s e t C h a n n e l C o n v e rs io n F IF O s D a ta I/O p o r ts Channel Controller O f fs e t o u tp u t O f f s e t C h a n n e l M e m o iy c o n tro l s ig n a ls C o n v e r s io n F I F O s D a ta I/O p o rts External Memory Module Dc t a B u s O f f s e t u p d a t e A d d r e s s B u s Address Generation Unit Figure 9: Structure of a Decoupled Memory Architecture (DMI). The second interface, the physical interface, generates all the external memory control signals matching the vendor-specific interface signals - physical address, enable, WriteEnable, Dataln/Out, and etc. AGU and Channel Controller will generates all this signals as control signals for Conversion FIFOs. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. An AGU is responsible for the issues physical address to the memory according to a “select” and a “data request” signals from the channel controller. To operate channel controller has two functionalities; it synchronize the AGU and conversion FIFOs, and it will provide synchronized memory control signals. To free the channel controller of a fixed schedule and trying to take advantage of the available bandwidth we opted by a “snooping” scheme where the channel controller detects a conversion FIFO that is not full and speculatively pre-fetch the data. To accomplish this, the channel controller uses the entity of the AGU corresponding to the conversion FIFO requiring the data and schedules of memory operations. As resolving race conditions for data bus usage, the appropriate AGU register entry will be issued to address bus while relying on the control signals from channel controller. In this design we have not explicitly support for the memory protection or read/write hazards. The underlying assumption is that these issues are enforced or guided by compilation tools, which should be responsible for setting up the programmable resource in DMI design. 4.3 Components We now describe the various components of the proposed design, namely channels, input/output conversion FIFOs, offset reload channel, channel controller and AGU. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.1 Data Channels In our target application domain, which exploits stream-based data accesses, we define the term channels as data transfer path from the external memory to the data path input port. In the physical implementations, a channel will be defined as the aggregation of conversion FIFO, AGU entry and optional offset reload channel. The data streams which will be provided into a specific input port of data-path, will be transferred through same channel while execution. for i = 0 to N-1 c[i] = a[i] + b[i] Input A —v Conversion F I F 0 1 Input B E x te r n a l M e m o r y C h a n n e C o n tr o l le r Output c I t e r a t e N t i m e s Figure 10: A Channel connecting external memory and TO port of data-path. Figure 10 illustrates the operations of a data channel when the core data-path implements a design that realizes the high-level code shown at the left-hand upper comer of the figure. The gray scaled conversion FIFO will hold the input data corresponding to the memory accesses to the array a[i]. AGU entry has the data that 4 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. defines the sequences of addresses. In this example there is no offset reload channel because the data accesses of stream a[i], has a uniform increment of the physical memory address, through the N iteration executions. To execute a[i] + b[i] implemented in the data-path as in the right most side of above Figure 10; it requires data which is loaded in the external memory. Conversion FIFO 1 detects the data request from Input port A, then channel controller issues the address generation signals to AGU. The AGU will issue the physical address of array a[i], starting from address of a[0]. After issuing a physical external memory request and the given amount of memory access latency, the channel controller enables the conversion FIFO 1 to load the data corresponding to a[i]. If the data is ready in the conversion FIFO 1, conversion FIFO will communicate the input A port of data-path to load data into it and proceed with the computation. In essence a data channel corresponds to an entity of AGU and its binding in time to a specific conversion FIFO of the interface. From the prospective of a compiler a data channel can be viewed as a sequence of memory locations as a stream of variable. 4.3.2 Input and Output Conversion FIFOs In the proposed architecture, we use Conversion FIFOs for two purposes - separation of memory controls from data-path design and synchronization between memory controls and data-path. First, the data-path implementation is decoupled from the 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. timing issues relating to memory accesses through Conversion FIFOs. The memory controller simply need to test whether or not each conversion FIFO is full, to pre fetch the subsequent data item as specified by the data channels parameters - the access dimension and stride. Since Input and Output conversion FIFOs directly communicates with external memory, we implement conversion FIFO synchronous to external memory and it communicate with data-path asynchronously. Another important functionality of conversion FIFO is to handle data granularity differences between memory bus and core data-paths. C L K R e s e t FifoDataLoad(C) .... .1, ,............... - .................. DataRequest(D) NotFull(C) V " in te rn a l B u ffe r DataReady(D)^ FifoDataFiush(C) 1 “ " “ It Input Conversion FIFO Dataln(D) , t Dataln(M) ^ I n p u t P o r t w i d t h B u s w i d t h ( C ) C o m m u n ic a tio n w ith C h a n n e l C o n tr o lle r (M ) F ro m E x te r n a l M e m o ry B u s (D ) C o m m u n ic a tio n w ith D a ta p a th Figure 11: The structure and I/O ports of an input conversion FIFO. Without handling of possible granularity gap, the performance of result design will suffer from inefficient external memory accesses. The unattractive solutions will be, again, the designers’ manual efforts to either on data-path design or external memory interface, while we try to avoid both of them. To address this potential mismatches in our DMI design approach, conversion FIFO fetches data, with the granularity of 4 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory bus width, store in the buffer temporarily, then feed into the data-path input port. We can provide analogous output conversion FIFOs to temporarily store output data, whose granularity is different from data bus. As depicted in the Figure 11, the data port width to data-path and to memory bus usually differs. The memory bus width is a constant architecture-dependent value whereas input data-path port width is application-specific. A non-full signal stands for FIFO has at least one empty data slots to store in it. Notice that under this model, there is no immediate termination control of fetching data. Termination of the fetching of the data is under the control of the core data-path by ignoring either explicitly or implicitly via offset reloading of an AGU entry, and the subsequent flushing of possibly spurious data in a FIFO. In terms of the internal operation, the conversion FIFO can be non-full initially. The channel controller will notice the FIFO’s status, and channel controller issues the external memory operations. After a given latency, channel controller store fetched data to its internal buffer of FIFO. Since data is ready once data is loaded in the buffer, data-path will access the data in the input port after checking DataReady signal. Finally, data-path issue the signal DataRequest to notify one chunk of data has been read. The size of internal buffer in terms of the number of data elements it can hold can vary. If we assign large sized buffer, data will be pre-fetched by large amount. In our design library, the size is either 32 bit (same as memory bus width) or 64 bit (two times of memory bus width). 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The FIFOJ)ataFlush signal is activated if offset reload channel is defined for the channel. When an offset-reload command is issued, the data stored in the internal buffer of the input conversion_FIFO, which has been pre-fetched before data-path reaches the execution, should be flushed. The details of offset reload will be addressed in the next section. C L K R e s e t DataPop(C) Empty(C) NotFull(C) I n te r n a l B u f f e r DataOut(M) Output Conversion FIFO M e m o r y B u s w i d t h StoreRequest(D) 8--------------------------------- ChannelReady(D) . Dataln(D) , ^ 1 ---- 1 — O u t p u t P o r t W i d t h {C ) C o m m u n i c a t io n w ith C h a n n e l C o n tr o lle r (M ) F r o m E x t e r n a l M e m o r y B u s (D ) C o m m u n i c a t io n w ith [ Figure 12: The structure and I/O ports of an output conversion FIFO. Figure 12 depicts the structure of Output Conversion FIFO. All the signals except NotFull are symmetric with respect to the I/O ports of the input conversion FIFO. If a data-path tries to store data to the buffer of a conversion FIFO, it starts the operation by checking the ChannelReady signal. If valid it stores the output value through DataOut port. The other I/O ports, DataPop and empty signals allow for the communication with channel controller. Channel controller reads the signals from empty port, and assigns memory bus to this channel, then sends DataPop signal, which completes a memory write. Similar to Input conversion FIFO, the width of the 4 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DataOut input port from data-path is application-specific, and the width of DataOut, the output port to send buffer contents to external memory, match with the external memory bus width. Since I/O conversion FIFOs directly communicate with external memories, we implement the clock of conversion FIFO synchronous to that of external memory bus. NotFull signal will be omitted if the “Offset_Reload” has not been defined for the channel. If so, OffsetReloadRequest issues while internal buffer contents are stored to the external memory. Channel controller will generate corresponding control signals after reading NotFull signal, if the OffsetReloadRequest signal is issued. 4.3.3 Offset Reloading To extend the support of irregular data accesses, we provide an offset reloading capability. This capability allows for updates of address in the middle of streamed memory accesses. The functionality of offset reload is reset AGU register entry value for specific channels. As depicted in Figure 13, the protocols to interact with the offset reloading, between the data-path and channel controllers are identical to the protocol used for the I/O conversion FIFOs. The OffsetPush signals will be issued from data-path if required, after reading the ChannelReady signal. The OffsetValue from the data-path will be store in the register. Based on the empty port signal, the channel controller will 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. generate an OffsetPop signal which sends the OffsetValue to the channel controller, which should finally be sent to AGU. The offset reload channels are identical for both input channel and output channel. The OffsetPortWidth varies according to the application. CLK Reset QffsetPop(C) ------------------- g g Empty(C) , OffstOut(C) ----------------- Offset Port Width Single entry register Offset Port Width O ffset Channel (C) Communication with Channel Controller (D) Communication with Datapath OffsetPush(D) i--------------------- ChannelReady(D) < OffsetValue(P) Offset Port Width Figure 13: The structure and I/O ports of an offset reload channel 4.3.4 Channel Controller The memory channel controller serializes the multiple and possible concurrent, memory requests for the various data channels, which try to access the same external memory. Internally, the channel controller is organized as an FSM, which is responsible for computing and emitting the address for each channel that requires data, and a RTL-level VIIDL synchronization process which generates the external memory operation signals, according to the vendor specific memory interface timing specifications. 4 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As outlined in section of the conversion FIFOs, the channel controller snoops the “empty/NotFull” signals of the input and the output conversion FIFOs. The FSM has the same number of states as the number of channels. It reads the output signals of conversion FIFOs, which will notify its current status. The FSM then determines what to do, and starts to generate sequence of control signals to trigger the generation of an address by the AGU, channel controller and external memory. The sequence the activation of the AGU signals and of the conversion FIFOs can vary base on the channel - input or output and with or without offset reload channel. The sequence of external memory operations will vary according to the architectural characteristics of external memory, such as timing of controls, read/write latency as well as addressing modes - pipeline vs. non-pipeline, page/non-page/burst, etc. In this context, in our decoupled memory controller, the one process written in structural VHDL in channel controller is the only part which will be affected by architecture specs of external memory. Hence, with the minimal modifications, we can port our decoupled memory interface to different target system. The current implementation allows for application-specific scheduling by defining distinct orderings of the memory accesses for the different data channels. We have also incorporated support for pipelined memory accesses across multiple channels to reduce latency when these features are supported by the specific vendor-interface as is the case with our current target FPGA-board - the WildStar™ [70]. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In our channel controller implementation, the handling of an OffsetReload request has higher priority than other data requests. For the input conversion FIFO case, the currently fetched data is for pre-fetch. Since the issue of OffsetReload signals means that pre-fetched data which has not been used at this point is no longer valid, we need to flush the stored data in the conversion FIFOs. In case of output the data which has not been stored to the external memory, which would happen if the output data width is smaller than the data bus width, should be stored to external memory; otherwise, the integrity of program is violated. The contents of offset reload channel should be re-directed to AGU. 4.3.5 Address Generation Unit The AGU generates the physical addresses corresponding to the various data channels. Similar to DSP targeting units, we have included in our DMI direct support the next address generation logic in AGUs [17]. The multiple-entry-register table triggers the generations of multiple streams of data accesses with minimal address generation overhead, while reusing address computation logics. Internally, the AGU consists of a register table to keep physical address data, an arithmetic unit to update and generated new physical addresses, and a FSM to generate appropriate control signals. An AGU binds ports to memory channels by mapping the channel data with parameters, such as base address, stride and access directions, to one of its entries and storing the data-path port identifier to which the data should be stored or retrieved. Because stream-based applications make heavy use of fine-grain streamed 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data access patterns we have developed hardware to directly support stream-based data access modes. Finally, the AGU controller will add the current value of the offset with the base address provided by the register table. To support streamed data accesses, AGU automatically increment the its physical address entry to generate the subsequent addresses of a data channel. The concatenation option (also programmable) allows for faster memory access at the possible expense of memory in the layout of the array variables. For a given memory access, the AGU computes the next memory address by first adding into the value of the current index a given constant value, typically the value of the offset as 0, 1, or 2 (parameterizable) to account for the sizes of the data values to be fetched. We provide a table of registers and arithmetic units to support run-time memory binding, which is important in the case of shared memory between multiple processing units. Our AGU is capable of any given stride of address generations. It is possible to increment any location of bit by changing base and offset bit-width parameter. By assigning non-zero value to the stride for base, we can generate any kind of address streams which is statically incrementing. The interaction between an external unit and the AGU requires the loading of a series of parameters - (base address, access directions and stride). It stores the initial base address of the stream and the current value of the index variables corresponding to the latest memory address. The design also allows for a data-path design to directly write the values to the register table entry to load start addresses for streams, thereby allowing changing the base and offset values for a specific stream channel in the middle of the data-path executions. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. O ffs e L ^ o a d .V 8 lu e (C) AGljJoacLerkry (H) AGU_rea4Avrite (H) la tc h R-Table < — Datajsquesl: (C) FSM From Host (H) From Channel Controller (C Internal Signals Address Calculatiari logic Figure 14: Structure of the Address Generation Unit (AGU) All the operations of the AGU are triggered by signals from the channel controller. The AGU external interface includes a set of signals to indicate which entry is to be used in the memory access and the direction of the access (either read or write) from the channel controller. The AGU module is controlled by a simple FSM that stores back into the register the current values of the index values at the end of each cycle for utilization in subsequent memory accesses. Another set of signals allows an external entity (typically from the host processor) to write specific values to the register entries and therefore program the contents of the AGU. Multiple streams for the same array might coexist in the same AGU by using distinct entries in the register files. In the current implementations we have allocated a single 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. AGU per memory module in order to allow concurrent memory accesses. Resources can be shared by sharing of AGU entries by multiplexing in time the utilization of the AGU at the possible expense of performance. In our specialized AGU, bit-width o f registers and arithmetic units or even number of register entry will change according to applications. 4.4 Design Library and Parameters We implemented the proposed DMI using structural VHDL as a set p f program modules. Our DMI generation library generates all the components - Conversion FIFOs, offset reload channels, channel controller, and AGU - as well as the top level design which connects the DMI with the data-paths and vendor-provided external memory ports. Our DMI library is about 16k lines of C/C++. In current version of setting includes architectural features provided by FPGA board vendor, such as architecture constrained signal declarations for target FPGA chips, and communication logics between FPGAs and host processors. For the DMI components, the implementation of designs has been fully parameterized, and all of parameters are exposed to a compiler tool for the synthesis of a complete FPGA design.. To implement DMI for a given application(s), we need number of I/O ports of data paths) as a parameter. Each I/O port is classified into streaming port or not. If it is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. not the streaming port, we provide static register connected to the data-path port3. If it is streaming data, we call it as a channel. The parameters of channel are defined by the tuple (conversion FIFO, base, offset, stride, dir, OffsetReload) indicating a sequence of consecutive address for the data to be fetched for a particular conversion FIFO, where conversion FIFO indicated the identity of the source/destination conversion FIFOs, base, offset and stride are the typical memory address calculation parameters and dir indicates either a read or a write operation. Based on the number of channels to connect to the same memory bank, we can implement the channel controller and AGUs, together with detailed parameters of each channel. All of the parameters are accessible in the static analysis of an application in the current high- level language compiler. As pointed in section (see the section about the rationale) the correctness of the generated designs and their interaction with the DMI described here relies on the correct set-up of the AGU and other programmable modules inside the DMI. In the context of the DEFACTO project [15] we have develop simple array variables analysis that can automatically map different arrays to distinct memories and generate the corresponding AGU entries along with their parameters. We have also developed a simple interface that allows the DEFACTO compiler to generate the structural VHDL code that implements the required DMI modules and aggregates 3 F o r no n stre a m d ata, w e p ro v id e d reg iste rs c o n n e c te d to the d e sig n b lo c k w h ic h is d ire c tly a c c e ssib le from the o u tsid e o f FPG A s. H en ce, h o st p ro ce ss o r o th er IP b lo c k w ill w rite /o r rea d this re g iste r b e fo re o r a fte r d a ta -p a th e x ecu tio n s. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. them with the core data-path design, specified in Behavioral VHDL, as outlined in section (see section about the design flow). In short, we have integrated the proposed design in a completely automatic compilation and synthesis tool that generates designs that execute correctly on real FPGA devices. The channel controller also requires the knowledge of the parameters related to the external memory module. These parameters are determined at system design time - in our case, the target vendor-specific FPGA board. The only specific timing knowledge is embedded in synchronous process of channel controller. 4.5 Methodology Issues The fundamental methodology issue addressed by the work in this thesis deals with the effective integration of multiple design specifications in behavioral VHDL while guaranteeing specific timing constraints. To the best of our knowledge there are three ways to achieve this goal. In the first approach, an all behavioral description, the designer merges two distinct and interacting designs. Because of the need to generate a single coherent scheduling of execution, many behavioral synthesis tools only support a single VHDL process abstraction in the specification. In effect this approach disallows the possibility of independent and modular design. The only viable alternative is for designers to effective define a behavioral specification of a dynamic scheduler and embed the 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interactions of all of the VHDL processes with this common scheduling process. A second approach calls for the integration of a behaviorally-specified design with a library. This library, specified internal in structural VHDL, can be modeled to be integrated with the tools behavioral synthesis flow by exposing enough details of its input/output behavioral. This is the common approach used in the case of IP-cores that need to be included in behavioral designs. Unfortunately this approach does not allow handling variable latencies for memory operations and hence cannot be used for the case when the designs have multiple core data-paths competing for a shared resource. To overcome these limitations, we have opted for an third integration methodology - integrate a structural design resulting from the translation of a behavioral design into an RTL specification with an RTL specification of the DMI design presented here. This approach allowed us to retain the benefits of behavioral synthesis for the design of individual data-paths with the flexibility and timing correctness guarantees required for the external memory operations. The price to pay for this abstraction-level mismatch is the definition of an time- independent communication protocol between the structural and the behavioral design - an half-handshaking timing protocol. While in general an handshaking protocol is less efficient that fixed timing communication the natural of the domain of computations we are focusing on, image processing and regular data access patterns allow our implementation to overlap computation with memory traffic hiding in many cases the latency and delays introduced by this protocol. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As a proof of the validity of this methodology we have successfully integrated the structural designs generated by a compilation and synthesis tool - the DEFACTO compiler - with a commercial behavioral synthesis tool - the Mentor Graphics Monet™ [20], The DEFACTO compiler automatically generates the DMI modules for the required number of memories and extracts the data channel and AGU parameter data. The compiler makes extensive use of the SUIF annotations mechanism to communicate the required analysis information between the compiler passes. The compiler generates the core data-path using a tool we have developed - suif2vhdl. The suif2vhdl translates a computation in SUIF's internal representation to behavioral VHDL. The compiler then directly interacts with the Monet™ in batch mode for the generation of the structural design corresponding to a behavioral specification of the core data-path. It then merges the two RTL specifications, one for the DMI modules and another corresponding to the core data-path in a single RTL design which is then synthesized using the appropriate low-level synthesis tools for the target device in this case a Xilinx® Virtex™ part. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 5 A p p l ic a t io n -S p e c if ic D e c o u p l e d M e m o r y In t e r f a c e F e a t u r e s In this chapter, we address the direct support for application-specific features we have included in the proposed DMI design. The efforts described here include the generation of application-specific address streams; the embedding of application- specific knowledge into the channel controller, and the support of advanced functionality of memory access modes. 5.1 Application Specific Operation Scheduling In order to determine which conversion FIFO requires data from the external memory, the channel controller module of the DMI sequentially checks, in a Round- Robin fashion the non-full bits associated with every input channels and the non empty bits associated with every output channel. For a maximum number of N active input and output channels a simple verification FSM requires N clock cycles.4. Under the scenario where some data channels require service, whereas others do not, this sequential, Round-Robin testing, requires the introduction of additional clock cycles leading to an overall performance degradation. 4 One could opt for a binary search/elimination using log(N) rounds to determine which channels require service. Although elegant this approach would clearly pay off only for very large number of active channels and was, therefore, not pursued in our implementation. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 15 illustrates a sample FSM to check in Round-Robin style the status of each active channel illustrating the inclusion of pipelining memory accesses in the scheduling of the accesses. Despite its apparent simplicity, this strategy interacts with other techniques, common in synthesis, such as loop pipelining. Using pipelining, loop prologues and loop epilogues do not exhibit all of the memory operations of the regular, steady state bodies of the loop. Typically, the prologue generates no write operations and the epilogue requires no additional input data. The information about the scheduling of prologues and epilogues must be encoded in the scheduler’s FSM, therefore increasing its complexity. Figure 15: Diagram of channel controller FSM in a round-robin fashion In addition, Round-Robin performs well if all the channels exhibit the same rate, that is, they produce/consume the same number of data items per computation cycle. If different channels have distinct rates, the scheduler must check, for every loop- iteration, whether or not an operation is required, and the performance will suffer If empty/or full? memory request < - 41' R/W based on channel select < = = channei_l other wise NULL; Check Channel Check Channel Check Channel N If empty/or full? \ memory request <= " 1’ \ R/W based on channel \ select <= channel_2 Z -— ^ other wise NULL; X 1 Channel Process A(round-robin) if (elk 'event and elk - V ) then CurrentState <= NextState; mem_read <= mem_read_sig; mem_read_sig_d <= mem_read_sig; mem_read_sig_dd <= mem_read_sig_d; mem_read_sig_ddd <= mem_read_sig_dd; rnera_read_sig_dddd <= mem_read_sig_ddd; mem_read_sig_ddddd <= mem_read_stg_dddd; mem_write < - mem_write_sig; menx_write_sig_d <= mem_write_sig; mem_write_sig_dd <= mem_write_sig_d; agu_offset_sig_d <= agu_offset_sig; agu_offset_sig_dd <= agu_offset_sig_d; agu_offset_sig_ddd <= agu_offset_sig_dd; agu_offset_sig_dddd <= agu__affset_sig__ddd; agu_offset_sig_ddddd <= agu_offset_sig_dddd; end if; Process B 5 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. from costly channel status checking operations. Intuitively, the possibility of the performance loss by the unnecessary channel snooping increases as the number of channels increases. As the FSM uses more predicates to determine its appropriate action, its implementation size, and consequently the complexity, increased. 5.1.1 Group Scheduling for Equal Rate Channels A way to eliminate the overhead associated with the checking of which channels need to access memory is to embed information about several channels whose data access exhibit the same pattern into the channel controller. If, for example, at each iteration of the loop, there are three channels that need to access memory, testing for a single channel, is equivalent to test for any of them. Using this approach, the number of FSM states, and therefore, cycles dedicated to hardware checks are substantially reduced. We call this approach a Group Scheduling strategy. Clearly, this group strategy can only be applied when the compiler can determine the exact relationship between memory accesses of the channels in the computation. While not the focus of our work, cases where control flow dependences preclude the extraction of static schedules can be handled by speculative techniques. As depicted in the Figure 16, A and B data accesses at the same rate, and C, D, E occur at the same rate. If the channel controller detects that channel C require data, it can also infer that channel D and channel E also need to access external memory. If channel controller noticed that channel A does not need data accesses at a point of 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. execution, then neither does channel B. We can embed this information into FSM by assigning different next state based on the current status signals, not full for input channels, from each channel. for i = 0 to I { data_read A group data_read B for j = 0 to J { fork = 0to K{ data_read C dacircad D data_read E g ro u p group ( C h a n n e l D C h a n n e l B C h a n n e l A | C h a n n e l E C h a n n e l C (a) example code (b) group channels (c) round-robin (d) group scheduling transactions Figure 16: Example of group for channels. With the compile time static analysis, mainly group of channels which requires data fetch/store, we can implement specialized channel control FSM. It will reduce the overhead originated from the “waiting cycles” for each channel to occupy external memory bus. We allow to set Groupld when channel declarations in the DMI controller generations. If the Groupld parameters are set to each channel, the transitions in the control FSM will be generated differently from the Round-Robin fashion which is default FSM. As in the figure, the transitions in FSM jumps between different groups if there require no memory accesses. If there is, which means all the channels in that group require memory accesses, state transitions snoops all the channels before jumping to the other group. Again, the benefit of 5 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. group channel control scheme is to reduce the overhead caused by unnecessary channel status checking, if there is any. The rate of reductions in performance loss increases as number of channel increases. We can apply group scheduling for multiple task implementations in our DMI architecture. In case, after implementing multiple task which shares same memory, and only one of them is in executions, round-robin scheduling will suffer from unnecessary channel snooping. If we apply group the channels for each task in one group, we can minimize the performance loss by memory overhead. 5.2 Address Generation for 2-D Array Indexing The AGU architecture introduced in the previous chapter, is capable of generating a sequence of address for a given stride of address generations, which is regularly incremented. This AGU architecture, however, cannot generate column-wise accesses of multi-dimensional array, as the value of a given address stream is monotonic, i.e., its address never decreases its value. For example, it is possible to generate a stream of A[0][0], A[1][0], A[2][0],...., but there is no way to reach A[0][1] after A[m][0] for array A[m][n], in a row-major layout format. A possible solution is providing costly reloading of the offset operations from data-path at the boundary to allow accesses A[0][1] after A[m][0]. With a simple modification of address generation logic and upgrade of AGU FSM can solve this issue. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B Data Load From Outside From C h an n el Control; R-Tabfe f ' Base diml striclel dim2 stnde2 Select (Channel Id) MuxControl ovl : ___ jt nr ▼ 0 ▼ ! 1 *0 f t \M U X l/^ g > iU X ^ -C g - Address Calculation Logic. latch F S M From C h an n el C ontroller Internal Signals Figure 17: Address calculation logic supporting column-wise memory accesses in for 2D array variables. As depicted in Figure 17, the control signal provided from the AGU’s FSM, allows the selection between the column-incremented value and row-incremented value. Combination of overflow with MuxControl signal enables the increment of the other dimension if there is any overflow. This features in effect implements the “wrap around” functionality required to performs the column-wise and row-wise addressing. Table 1 show the next address generated according to the values of control signal, and R-table contents. With the MuxControl signal value of 0, it will generate the address stream in row-wise data accesses. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mux control Overflow of adder Next address ovl ov2 0 0 0 xl * size(dim2) + x2 + y2 0 1 (xl + yl) *size(dim2) + x2 +y2 - size(dim2) 1 0 N/A 1 1 (xl + yl - size(diml)) *size(dim2) + x2 +y2 - size(d2) 1 0 0 (xl + yl) * size(dim2) + x2 0 1 N/A 1 0 (xl + yl - size(diml)) *size(dim2) + x2 +y2 1 1 (xl + yl - size(diml)) *size(dim2) + x2 +y2 - size(dim2) Table 1: Behavior of next address calculation logic: contents in a single R-table are (dim l, stride 1, dim2, stride2) = (xl, y l, x2, y2). ____________ ________________ If the MuxControl signals 1, the MUX1 multiplexer selects the incremented value for dim l, which is column-wise access. When the overflows issue from the adder of each dimension, it will increment value of the other dimension. Since FSM changes value of MuxControl for each channel, we can generate a different access mode for a different channel. For example, to implement a matrix transpose operation, B[j][i] = A[i][j] in an (i,j) loop nest, whose loop index variables start from zero, we use one input channel, and one output channel. We exploit normal row-wise accesses in input channel for the array A, and column-wise accesses in output channel for the array B. Ffence for channel A, we load the value of R-table entry as (dim l, stride 1, dim2, stride2) = (0, 0, 0, 1), and for channel B, (0, 1, 0, 0), where the base of each channel should be loaded at the starting address of each array. The channel controller will generate MuxControl of “0”, for channel A, and “ 1” for channel B, based on the Channelld. The implemented address generation logic is applicable to the application domains which have channels with different access mode. Matrix transpose and 6 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. matrix multiplications are classified in this domain. We will show the performance comparison, with and without this calculation logic, for Matrix Multiplications kernel in the next chapter. For the address calculation logic generates next address for each channel data access, the AGU FSM should keep the addressing mode of each channel. As channel controller sends Channelld with DataRequest, it should issue the appropriate MuxControl signal. In our DMI generation library generates this enhanced calculation logic design based on the parameter tag of “column-access” mode of the channel. Compile-time analysis can determine the access direction of each channel, if there is any, and issue the DFG to generate this enhanced AGU design. 5.3 Target Memory Physical Protocols DMI interface can support different kinds of external memory. We achieve it by embedding memory module specific controls into channel controller and AGUs. 5.3.1 SRAM Interfacing For Synchronous RAM, we can support variable latency of memory accesses, only with minimal changes in channel controller. The channel controller sends out memory control signals based on its channel schedule. For input channel, after a given latency of cycles which specified by architectural constraints, it generates 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DataLoad signals to the associate conversion FIFO. For output channel, it empty the output channel contents while it sends DataStore signals to external memory. Our DMI library automatically generates clock delays based on the parameter of the memory latency of read or write. It also supports pipelined memory accesses if the target memory module allows pipelined memory accesses. 5.3.2 SDRAM Interfacing We successfully mapped our DMI interface targeting for SDRAM interface. With minimal manual modifications in channel controller and AGU, we are able to support page mode operations for SDRAM. To support page mode, address control signal should separate the Row Address Strobe (RAS) and Column Address Strobe (CAS) [38]. In Figure 18, we have also shown a RAS/CAS generator block that can be thought as part of the address generation unit. This component takes an isolated address and determines, should SDRAM page mode be used, whether or not the corresponding memory access is within the same page of the previous memory access. If so, it informs the channels controller, which in turn bypasses some of its internal states to perform a faster in-page memory access mode. Internally this component consists of a simple table lookup much like a cache block. For the channel controller, it requires internal states in the FSM to generate correct sequence of control signals. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. External SDRAM Module Memoijy contrcl signal^ Channel Controller Conversion FIFOs Offset reload channel Address fli's y ^ fe.-*0 Pr&v_RAS ! •4 - 4 V f Page Mode- , detecting FSM j ss; Address Generation Unit Figure 18: Diagram of DMI which support page mode SDRAM accesses. C E _r WE_i Data (a) non-page mode 2 memory reads (b) page mode 4 memory reads Figure 19: Timing Diagram for SDRAM page/non-page mode in DMI controller. Figure 19 illustrates a timing diagram for the SDRAM page and non-page mode memory access as supported by our memory controller implementations. The current implementation can interleave the memory accesses of multiple SDRAM channels, 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and with minor modifications support both SRAM (in pipelined and non-pipelined) modes with the SDRAM memory accesses. Although the target FPGA-based configurable board used for the extensive validation of the proposed DMI, does not directly support DRAM modules, we have validated the various DMI implementations using DRAM interfaces, in its sophisticated access modes variants, through simulation at the structural level. 5 .4 DMI Designs with Compiler Loop Transformations In the application-specific hardware designs, synthesis tools, or designers, explore the possible design choices to build the hardware meets their purposes. In the behavioral synthesis tools, the existence of the loop in the application description provides space to explore in the target design mainly by loop unrolling. The impact of loop tiling and unrolling on the performance of result design is addressed in the previous research [62], In this section we address the impact on the DMI controller designs based on the several loop transformations. Fully loop unrolling and parallelized execution of the loop body increases the number of input output port by unrolling factor, as well as data-path resource requirement. Since commercial behavioral synthesis tools are not capable of partial loop unrolling, we can apply loop tiling together with loop unrolling in the DEFACTO compiler. From the DMI controller’s view-point, a loop- 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. unrolled data-path will be supported by increasing the number of data channels and possibly modifying the stride parameters of some of the AGU entries. Loop tiling, while improving the locality of the accesses in a given array region, will invariably leads to the introduction of offset reloading operations to delimit the accesses to a given tile. The proposed DMI can easily support these transformations either by simple reloading of the base and offset associated with a given address stream or, in some cases, by reducing the parameters of the AGU artificially scaling the dimensions of the arrays for the purposes of address wrap-around. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 P r a c t ic a l E v a l u a t io n s In this chapter we present a suite of experimental evaluations for a set of kernels computations implemented on a specific target FPGA - the Xilinx® Virtex™ device. We have deliberately chosen specific computations that require and/or take advantage of selected features of the proposed DMI to attain good execution performance. We begin by describing the experimental methodology used to obtain the results presented in this chapter followed by a description of each of the individual kernels computation and the corresponding experimental evaluation. 6.1 Experimental Methodology For all but one application, we have manually converted high-level descriptions of the each kernel, originally expressed as C programs, to behavioral VHDL5. For each computation we defined the quantity and parameter values for a set of DMI structures. Next, we use Mentor Graphics Monet® Behavioral synthesis tool to derive an RTL VHDL description of the computation and merge the resulting RTL code with the RTL code for the DMI structures into a single structural VHDL design. With this complete structural design that interfaces both a core data-path and data from an external memory, we use Synplicity® Synplify Pro 6.2.0 tool to perform 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. logic synthesis and the Xilinx® ISE 4.1i synthesis package to generate a bit-stream FPGA configuration file, by performing the mapping, placement and routing. In these experiments we targeted a Xilinx® Virtex™ IK BG560 part and tested the generated designs on FPGA-based configurable computing board - the WildStar™/PCI board as shown in Figure 20. Figure 20: The Annapolis™ WIDSTAR™/PCI board. For each design, and depending on the focus of the experiment, we collected a series of implementation metrics that reflect the number of consumed FPGA resources and overall execution time. Table 2 lists the various metrics, where we indicate which of the tools in our flow was used to collect it or, in some cases, how do we compute the value of each particular metric. 5 F o r th e p o in te r-tra c in g a p p licatio n , w e b u ilt a d a ta -p a th in stru c tu ra l V H D L , m ain ly b e c au se this k ern el co n tain s c o m p u ta tio n s th a t m an ip u la te rec o rd d a ta stru c tu re s, an d is d iffic u lt to m o d el b e h a v io ra lly . 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Perform ance Metric B rief Description Tool Used Clock Rate (MHz) Maximum clock frequency the design can operate at Symplify® Synplify Xilinx® ISE6 Total Cycle Count Total clock cycle execution count ModelTech® ModelSim® Simulator Computation Cycles The approximate cycle count for data-path computation only Computed as: Initiation Interval x loop iteration count Wall-Clock Time Projected execution time Computed as: Cycle Count / Clock Rate C-step Distinct control stages for the execution of the data-path Mentor Graphics Monet® Synthesis tools Initiation Interval (II) Initiation interval of loops after loop pipelining Mentor Graphics Monet® Synthesis tools Memory Overhead Cycles Number of clock cycles dedicated to memory operations Computed as: Total cycle count - computation cycles Slices Area resource measure of resulting design on FPGA7 Symplify® Synplify Xilinx® ISE Table 2: Metrics used in experimental evaluation of designs. 6.2 Experiments In this study, we compared different design approaches for implementing a given computational kernel on the target FPGA device. We compare the cost/benefit measured by observing the performance and the corresponding design time using three distinct design methodologies with progressive programmer involvement and automation, namely: Full Behavioral: In this design both the data-path and the memory interface are described fully behaviorally. The designs are synthesized using the Mentor Graphics 6 W e u sed X ilin x ® IS E i4 .1 to p e rfo rm F P G A m ap p in g , p la c e m e n t a n d ro u tin g . 7 R efer to c h a p te r 2 . 1, F ig u re 1. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Monet® behavioral synthesis tools. The designs generated using this approach are only verified via ModelSim® simulation. Decoupled Memory Interface and Behavioral design: In this approach the core data-path is generated using a fully behavioral, but the DMI is generated using a RTL description. This approach is still automated. The DMI is generated using a set of library code generation functions whereas the core data-path is generated via behavioral synthesis. Full Manual: In this design approach all components are developed to fit the specific needs of the computation at hand. The memory interface consists of a set of counters with a FSM that respect the memory access timing constraints. The core data-path is also developed for each specific kernel application in structural VHDL. The designs that use the proposed DMI module can also take advantage of several distinct memory operation scheduling policies, namely: Naive: In this scheduling policy all memory operations for multiple channels are scheduling using a Round-Robin strategy. Internally the memory controller examines each input data and check if it needs to be serviced by either a read or a write operation. Versions using the Naive scheduling use 32 bit long (same as the bit width of memory bus) conversion FIFOs internal buffer. Since our channel controller try to fetch data as soon as possible (ASAP) based on the status of input channel, as described in the chapter 5.1, input conversion FIFO pre-fetch a single word - 32 bit. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Pre-fetching (P): As we increase the depth of a conversion FIFO, the DMI will increase the amount of pre-fetched. Versions using Pre-fetching strategies use 64 bit long conversion FIFOs, i.e, a pre-fetching depth of at most 2 memory words in the current implementation. The channel controller schedules the memory operations of the channels in a Round-Robin fashion as a Naive scheduling strategy. Group-scheduling (G): In this scheduling strategy, the controller avoids some of internal test-states to determine the need to schedule operations related to distinct input channels that exhibit the same data access consumption rate. Versions using this scheduling strategy use 32 bit long conversion FIFOs. Pre-fetching with Group-Scheduling (P+G): In this strategy both increased Pre fetching and group scheduling are used with 64 bit long conversion FIFOs. Intuitively, increasing the amount o f pre-fetching will reduce the average memory overhead per data item. However, the presence of operations such as reloading the offset of a given channel, requires the invalidation of all the pre-fetched data. There is a potential trade-off between the depth of the conversion FIFO and the frequency of the reloading of the offset value associated with a given channel. Longer conversion FIFOs lead to larger implementations but also have the potential to increase overall performance by reducing memory access overhead. So does group scheduling, since the transitions of FSM of channel controller more complex than naive impelematation. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3 Synthesis Results of DMI Components W e present the synthesis results of individual components of the DMI architecture. In this evaluation we have synthesized each component in isolation to assess the compactness and potential for overall clock rate performance. Unfortunately one cannot simply compose the numerical results presented in Table 3, in a combined design. In a complete design the various components interact leading to overall clock rates that are often lower then the lowest individual clock rate and area that is larger that the aggregation of the areas of the individual components. Nevertheless, the results in Table 3 reveal that the individual components are very compact and exhibit clock rates that very unlikely will be part of an overall design’s critical path. C om ponents V ariations Slices C ount F P G A A rea (% ) Clock (M Hz) I/O channels Input Conversion FIFO (32-bit) 26 0.25% 120 Input Conversion FIFO (64-bit) 59 0.45% 74 Output Conversion FIFO (32-bit) 26 0.25% 130 Offset Reload Channel (20-bit) 3 0.03% 145 Channel controller 2 I/Os w/o offset reload channel 11 0.08% 161 4 I/Os w/o offset reload channel 51 0.75% 105 8 I/Os w/o offset reload channel 89 0.72% 86.1 16 I/Os w / offset reload channel 171 1.39% 63.7 Channel controller with offset reload 2 I/Os with offset reload channels 38 0.31% 143 4 I/Os with offset reload channels 98 0.80% 73.5 8 I/Os with offset reload channels 208 1.69% 69.4 16 I/Os with offset reload channels 435 3.54% 63.1 AGU 4 channel entries 103 0.84% 76.9 8 channel entries 194 1.58% 75.7 16 channel entries 350 2.85% 56.8 2-D AGU 8 channel entries 195 1.59% 53.6 16 channel entries 359 2.92% 41.2 Table 3: Synthesis results of individual components of DMI generated using Xilinx® ISE targeting Virtex™ IK BG560 with 12,288 slices. 7 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The component that exhibits the lowest clock rate of 50+ MHz is the AGU that supports 2D addressing modes. This unit is clearly more complex than the other versions of AGU, as it requires more internal logic. Nevertheless, our observations indicate that the kernel data-path design, generated by behavioral synthesis tools, is usually much slower than these components including AGUs, and therefore is more likely to define the critical path of the entire complete design. 6 .4 Applications To evaluate the capability and adaptability of our DMI, we implemented four (4) kernel computations using the proposed decoupled DMI design approach, namely a Sobel Edge Detection (SOBEL) computation; a Binary Integer Correlation (BIC) computation; a Matrix Multiply (MAT) computation, and a pointer-based data search and reorganization (POINTER) computation. We observe the applicability of the DMI for each kernel computation, and analyzed the performance impact of using application specific features that are directly supported by the proposed DMI architecture. Mainly for simplicity of evaluation and understanding of the experimental results, we have chosen to focus on specific kernel codes that exploit or take advantage of the features of the proposed DMI. Finding a single kernel, or application code, that would elicit all of the features we wish to study in this evaluation would most likely 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be infeasible. It also would lead to a more sophisticated experiment in an attempt to isolate the features of the proposed DMI we wish to evaluate. In this evaluation we devised a set of different experiments for each kernel code. For the SOBEL kernel we compare designs with alternative tool and methodology choices. We compared a fully manual design, against a design fully generated by a behavioral synthesis tool from the full behavioral description, and also against a design resulting of the combination of the DMI and behavioral synthesis tools. We will also present results that illustrate the benefits of group scheduling and pre fetching in the channel controller. For the BIC kernel we compare a multi-task implementation using the combination of the DMI and behavioral synthesis against a full manual implementation. For this application, behavioral synthesis cannot handle a multi-task implementation. We also provide the results design with various loop tiling transformations. These results are intended to show the impact in the performance of the resulting design and the consequent need for DMI designs to support multiple data channels. For the MAT kernel we illustrated the performance impact of the combined row-wise and column wise addressing directly supported by the AGU as it avoids the need to reload and flush the data in the conversion FIFOs. Finally, for the POINTER kernel, we illustrate how the DMI can effectively support pointer tracing. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.5 Sobel Edge Detection (SOBEL) The Sobel Edge Detection computation uses vertical and horizontal gradient operators over a 3-by-3 window to determine the location of the edges in a gray scale input image. As depicted in Figure 21, the computation is organized as a doubly nested loop manipulating two-dimensional arrays. For each loop iteration the computation calculates the values of two gradient and decides to assign either a ‘ 1’ value or a ‘0’ value to the output pixel. igjiim iindilli-Hj imgnm-.ll for (1=0; i < SIZE - 4; i++) { for (j=0; j < SIZE - 4; j++) { uhl = (img[i][j+2] - img[i]|j]) + 2*( img[i+l][j+2] - im g[i+l][j]) + (img[i+2][j+2] - img[i+2][j]); uh2 = (img[i+2][j] - img[i][j]) + 2*( img[i+2][j+l] - im g[i][j+I]) + (img[i+2][j+2] - im g[i]0+2]); if ((abs(uhl) + abs(uh2)) < threshold) edge[i][j]=”0 x ff’; else edge[i][j]=”0x00”; } } (a) Sobel Edge Detection pseudo code hngji-HjUS = 0 E 53 i ESI t a F F - (b) direct mapping a H \|| (b) implementation using tapped-delay line Figure 21: Kernel of Sobel Edge Detection in C (a) and possible implementations; a Naive implementation (b) and reusing data in tapped-delay lines (c). For this computation the DEFACTO compiler recognizes the opportunity of reusing input data previously fetched thereby substantially reducing the number of required 7 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory accesses. The reference implementation for this loop therefore uses three, 3- entry tapped-delay lines8, as illustrated in Figure 21 (c). In terms of the DMI, the implementation uses three input channels and 1 output channel. The output channel requires the reloading of the offset value of the output channel since the computation generates non-sequential accesses at the boundary of second loop. The three input channels do not require this reloading operation as they move along consecutive addresses regardless of the jump to a new row. 6.5.1 Comparisons DMI with Alternative Design Choices Table 4 depicts the performance results for the implementation of the SOBEL computation for the three design methodologies presented in section 6.2. For each design we present several metrics obtained by the methods described in section 6.2. We illustrate the impact of exploiting different memory access modes such as pipelined and pre-fetching, using one or two memory words for pre-fetch distance. We also report the number of slices for each design distinguishing the number of slices used for the core data-path and the number used for the complete design. In terms of clock rates we distinguish between the clock rate the core data-path can achieve versus the clock rate for the entire design. This allows us to understand which component is the bottleneck of the complete design. The subsequent three rows present timing-oriented performance results where we show in parenthesis the 4 T a p p e d d e la y lin e is a s p e cia liz e d reg iste r se t. w h ic h sh ift its co n ten ts ac ro ss reg isters in tan d e m in a sin g le c lo c k cy cle. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. slowdown against the manual design. Finally, in the last row, we have attempted to capture the required human effort for the various design methodologies for this particular computation. For the manual design we generated a family of designs revealing how aggressive programmers are in the pursuit of performance and adjusting various initiation interval metrics. Design Type Full Behavioral DMI with Behavioral Data-path Full Manual Memory Access Modes Pipelined Pipelined + 1-word pre fetch Pipelined + 2- word pre fetch Pipelined + “optimal” pre fetch Space (slices) Data-path N/A 245 245 N/A Data-path + Mem int 1,552 549 644 360 full design 4,102 2291 2400 1940 Clock (MHz) Data-path N/A 33.0 30.7 N/A Data-path + Mem int 33.1 38.1 33.4 45.0 full design 23.0 33.0 29.7 22.4 Initiation Interval 12 6 6 1 Execution Clock Cycles 2,608 1937 1448 208 Wall-Clock Time(usecs/iter.) 113.4(12.2) 58.7(6.31) 48.8(5.24) 9.3(l.O) Design Time (minutes) 60(1 hour) 120 (2 hour) 2,400(1week) Table 4: Performance and synthesis results of three design methodologies for the Sobel Edge Detection computation. For improved performance all designs exploit two external memory modules - one for the input array, and the other for the output array. Notice that in behavioral synthesis there is no easy mechanism to exploit external memory access pipelining. The accesses to external storage in behavioral synthesis are controlled by the way external memories are modeled inside the particular tool set. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The results in Table 4 reveal that even the largest design, obtained using a full behavioral VHDL specification and a behavioral synthesis tools, is still small, occupying only about 30% of the 12,288 LUTs each of the target devices has. Overall, the full behavioral design is by far the largest design and the fraction of the FPGA space devoted to the DMI is small at about 20% of chip for which this metric can be observed. In terms of wall-clock time performance, there is, however, a significant gap between the manual design and the full behavioral design. The final performance, measured as the total wall-clock execution time, taking into account the clock rate, shows that the manual design is about 12 times faster than the full behavioral design. Other than the obvious clock rate benefit, accounting for a speedup factor of 1.5, the manual design is able to schedule the memory operations by completely hiding the latency of the external memory accesses in what we have labeled “optimal” pre fetching.9 There is an additional performance gain factor over the full behavioral code. Since behavioral synthesis tools cannot exploit memory accesses with distinct granularities, without designers’ specifications, there is a four-fold memory bandwidth loss in this case of this kernel code. To fetch a single 8 bit data item the full behavioral implementation generates a memory access independent of previous or future 9 T h is is c le a rly an a b u se o f the c o n c e p t o f p re -fe tc h in g sin ce svhat is e x p lo ite d h e re is th e a b ility to c arefu lly sch ed u le the m em ory o p eratio n s. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. accesses. The manual implementation exploits the spatial and temporal locality of the memory accesses to reduce the number of memory accesses. In terms of the comparison between the design versions using the proposed DMI and the fall manual design we observe a slowdown between 5.25 and 6.5. Despite both versions being able to exploit the pipelined memory access modes as offered by the target FPGA-based architecture, the core data-path synthesized using behavioral synthesis has a much higher initiation interval (six clock cycles) than the design produced manually (a single clock cycle). Another factor contributing to this performance gap is the fact that using the DMI the design will inevitably use additional cycles for synchronization between the data-path and the DMI. This factor, however, is substantially mitigated by the use of pre-fetching which can be farther reduced by longer input conversion FIFOs. 6.5.2 Impact of Group Scheduling and Pre-fetching For the design using the proposed DMI, we evaluate the impact of group scheduling of the external memory accesses. Table 5 presents the results of the combined designs with one and two external memories and using the various scheduling or memory operation policies, respectively Naive; Pre-fetching; Group-Scheduling; Pre-fetching with Group-scheduling. For the designs that do not take advantage of pre-fetching the conversion FIFOs are 32 bits long whereas the designs that exploit pre-fetching have 64 bit long conversion FIFOs. 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5 presents the simulated clock cycles, the clock cycles devoted to the computation and the clock cycles devoted to external memory accesses, the later including any idle time awaiting memory transactions to complete. We also report the percentage reduction of the total number of execution cycles and memory operations due to the various scheduling policies. Code Versions and Metrics One External Memory Two External Memories Naive P G P+G Naive p G P+G Clock Cycle Count 1,962 1,463 1,950 1,450 1,937 1,448 1,937 1,448 Computations 1,176 1,176 1,176 1,176 1,176 1,176 1,176 1,176 Mem. Overhead 786 287 774 274 761 272 761 272 Reduced cycles 0 499 12 512 0 489 0 489 Perc. Total Exec. 0 25.4% 0.6% 26.1% 0 25.2% 0 25.2% Perc. Memory Overhead 0 63.5% 1.5% 65.1% 0 64.3% 0 64.3% Table 5: Computation and memory overhead of versions of implementation in cycle count and synthesis results (Naive: single word pre-fetch with round-robin schedule; P: two word pre-fetch with round-robin schedule, G: single word pre-fetch with group scheduling, P+G: two word pre-fetching with group scheduling) The results in Table 5 reveal that pre-fetching substantially reduces the number of clock cycles in the overall execution for both memory configurations. Group scheduling improves the performance only very slightly for the case of a single external memory configuration. For the two-memory configuration there is no benefit in using Group-scheduling since all accesses to each memory have the same pattern and therefore the group scheduling degenerates into the Naive scheduling. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The plots in Figure 22 depict the cycle counts for all eight designs and contrast them against the manual design implementation. As expected the manual design outperforms any of the other designs. Cycle I jjjl c y c l e c o u n t jjjj W a l l- c l o c k t i m e (n s e c ) I count ~.. 1 (i sec 2 5 0 0 100 2000 1 5 0 0 1000 4 0 5 0 0 tw o tw o full m a n u a l s i n g le s in g le tw o tw o s in g le m e m { N ) m e m ( P ) m e m ( G ) m e m ( P + G ) m e m (N ) m e m ( P ) m e m ( G ) m e m ( P + G ) w ith 2 m e m Figure 22: Performance of designs before and after group scheduling, different pre fetch factors, comparison with manual design. (N: single word pre-fetch with round- robin schedule; P: two word pre-fetch with round-robin schedule, G: single word pre-fetch with group scheduling, P+G: two word pre-fetching with group scheduling) Code Versions/Metrics One External Memory Two External Memories Naive P G P+G Naive P G P+G Area (Slices) 2,274 2,386 2,278 2,386 2,291 2,400 2,291 2,400 Clock Rate (MHz) 32.2 26.1 32.3 31.3 33.0 29.7 33.0 29.7 Table 6: Synthesis result of versions of SOBEL implementation using DMI architecture (Naive: single word pre-fetch with round-robin schedule; P: two word pre-fetch with round-robin schedule, G: single word pre-fetch with group scheduling, P+G: two word pre-fetching with group scheduling) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. W e also present in table 6 the FPGA resources utilization for each of the designs in this study. As can be seems there is not a large variation of either the space used or the attained clock rate across all designs. We attribute this behavior to the fact that the implementation of the various scheduling policies does not lead to a substantial increase in implementation complexity and the fact that these designs are fairly small occupying a very small portion of the total FPGA area resources. 6.6 Binary Image Correlation (BIC) This kernel computes a binary correlation between an image template and a window of an input image. for(m = 0 ; m < (t - s); m++){ for(n = 0; n< (t - s), n++){ for(i = 0; i < s; i++){ for(j = 0; j < s; j++){ if(mask[i][j] \= 0) th[m][n] += image[m+i][n+j]; } } } } (a) BIC kernel loop (b) Diagram of operation Figure 23: BIC main kernel loop and operations example. 8 3 n lo o p W in d o w lOOD s b y s im a g e o f R e s u lts o f ( t - s ) b y ( t - s ) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The basic computation is depicted in Figure 23 as a perfectly nested loop with four loops using a mask variable (2D mask array of size s-by-s over an input image (2D image array) of size t-by-t. As illustrated on Figure 23 (b) the computation scans the input image by sliding a s-by-s window over the input and accumulates the values of the corresponding image window for non-zero mask values in another 2D array variable. In this study we have fully unrolled the two innermost loops of the nest to expose ample opportunities for data reuse through a transformation called scalar- replacement or register promotion [10][64]. We have implemented this kernel for a 64-by-64 input image with 8-by-8 mask image. In this implementation, all input data is stored in one external memory, and the output result image is stored into a different external memory. 6.6.1 Impact of Group scheduling and Pre-fetch Table 7 presents the results for the various policies for the external memory operations scheduling. As with the previous kernel code, we observe a substantial decrease in the total number of execution cycles. When Pre-fetching alone is used we observe an 80% reduction of the memory cycle count resulting in a net reduction of 20% for the overall total execution cycle count. Group scheduling leads to a visible performance gain for this kernel over the naive scheduling. We observed that the group scheduling is beneficial in case the behaviors of channels connected to the same channel controller, are different. Since the data 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. access behavior of input image and mask image is different, group schedule shows a 1.5 % of reductions in cycle count and 5.8% among memory access overhead. Since the computation cycles - 32,490 - are the dominant factor in the total execution cycle count, the reduction caused by group scheduling is fairly small. However, a reduction of 5.8% of memory overhead and about 4% point after aggressive pre fetching results in a noticeable performance improvement over the design with Pre fetching only. Metric Memory Operation Strategy Naive P G P+G Total Execution 43,695 34,939 43,039 34,513 Memory Overhead 11,205 2,449 10,549 2,023 Reduction Cycles Count 0 8,756 656 9,182 Reduction of Total Exec. 0% 20.0% 1.5% 21.0% Reduction Memory Over. 0% 78.1% 5.8% 81.9% Table 7: Performance results for BIC with a single external memory (N: single word pre-fetch with round-robin schedule; P: two word pre-fetch with round-robin schedule, G: single word pre-fetch with group scheduling, P+G: two word pre fetching with group scheduling). 6.6.2 DMI Support for Loop Transformations As outlined in Chapter 5, the proposed DMI architecture provides a convenient set of abstractions in terms of data channels and AGU resources that can effectively support high-level compiler transformations. In this section we examine the role of the proposed DMI in supporting loop tiling and loop unrolling for the BIC kernel. 8 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop unrolling and loop tiling are common loop transformations that promote reuse and expose instruction-level parallelism (ILP). One of the side-effects of loop unrolling it to increase the number of memory accesses for data items that are non loop invariant. Figure 24 illustrates the application of tiling and unrolling of the two inner most loops of the BIC kernel. fo r m = 0 to (t-s) temp [n] = 0 ; fo r i i = 0 to s by e fo r j j = 0 to s by f f o r n = 0 to (t-s) / / loops i and j t i l e d and u n ro lle d ; if(mask[ii+0] Cjj+0] != 0) temp[n] += image[m+ii+0] [n+jj+0] i f (mask[ii+0] [j j + f] != 0) temp [n] += image [m+ii+0] [n+jj+f] ; i f (mask[ii+e] [j j+0] != 0) temp[n] += image[m+ii+e] [n+jj+0]; i f (mask[ii+e] [jj + f] != 0) temp [n] += image [m+ii+e] [n+jj + f] ; for n = 0 to ( t - s ) th [m ] [n] += temp [n] ; Figure 24; Pseudo code of BIC applying tiling i, j loops by e by f, and interchange the control loops with n loop._________________ _______________ ________ ______ By tiling the i-loop by a factor or e and the j-loop by a factor off the implementation can capture the reuse of data in e tapped-delay lines of depth/ each. Because of the tiling of the j-loop, the implementation must interrupts the scanning of the input lines 8 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. o f the input image at the end of e read operations by undertaking the reloading of the offset register associated with the e data channels. Using the proposed DMI architecture, a compiler tool can map the transformed code to behavioral VHDL using many AGU and data channels resources to map to the various data streams resulting from the data access patterns of the unrolled code array data references. Cycle j 0 cycle count g Wall-clock time (jj s e c )! count 1 (t sec 7 . E + 0 6 6 . E + 0 6 5 . E + 0 6 4 . E + 0 6 3 . E + 0 6 2 . E + 0 6 1 . E + 0 6 0 . E + 0 0 1 x 1 6 2 x 1 6 4 x 1 6 8 x 1 6 1 6 x 1 6 Figure 25: Performance metrics for various tiling shapes for single BIC and consequent the loop interchange as depicted in Figure 24 In Figure 25 and table 8, we depict the design performance and FPGA resource usage for the application of various tiling factors to the BIC kernel. For this case study we used a 64-by-64 input image and a 16-by-16 mask. 250k 200k 150k 100k 50k Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Tiled (e by f) 1x16 2x16 4x16 8x16 16x16 Clk rate (MHz) 35.3 29.9 21.7 28.3 24.2 Slices 4,558(37%) 6,661(54%) 7,070(57%) 9,734(79%) 10,417(84%) Table 8: Synthesis results for tiled version of BIC. For any of these designs the modifications required in using the proposed DMI designs are minimal. The design simply has to indicate the number of additional data channels required and load the appropriate base, offset and stride values corresponding to each of the address streams for each data channel in the AGU registers. The DMI interface designs for tiling of 2-by-8 and 2-by-4 for i and j loop, which use different tile shape, are identical, since it uses same number of channels and offset reloading channels. Since the design realizations depicted by loop tiling, or choices of loop interchange, is beyond the capability of conventional behavioral synthesis tools, the applications of these loop transformations using parallelizing compiler technique help designer or analysis tools to explore possible design choice spaces and the realization of designs. 6.7 Multi-Task BIC Designs In this experiment, we map two BIC kernels with different input data and matching operators into a single FPGA device. The goal of this experiment is to evaluate the ability of the proposed DMI to support a multi-task design whose data-paths are generated using behavioral synthesis approaches. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We selected two tasks that do not need to communicate, hence putting the focus of the implementation on the ability of the tools to schedule memory accesses and coordinate with the internal execution of both tasks. For the purposes of this experiment we have used a 64-by-64 image size and a 8-by-8 pixel mask. 6.7.1 Comparison of Alternative Design Choices We now compare the performance of two designs: a mixed design using a structurally defined DMI along with a behavioral description of the core data-path and a full manual design of the memory controller as well as the code data-path. We are unable to provide a full behavioral design because of the large array sizes involved in this code. Yet another more fundamental reason that current behavioral synthesis cannot handle this design is due to the existence of multiple task descriptions. For all versions of implementations, we used three external memory modules (A, B, C). We stored the input image array, which is identical for the two tasks, in the first external memory (A). Two sets of mask image for the two tasks are stored in the second memory (B). The output data generated by 2 tasks is written in the third external memory (C). For each task, reading data from the external memories A and B, and writing out data into external memory C, creates a constant memory access contention. In the proposed DMI approach, channel controller serializes all the memory requests between multiple data-paths, the programmer/designer can synthesize multiple data-paths in a single FPGA without manual design efforts. 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Design Type DMI with Behavioral Data-path Full Manual Memory Access Strategy Pipelined + 1- word pre-fetch Pipelined + 2- word pre-fetch Optimal pre fetching Space (Slices) data-path 3,238 3,238 N /A data-path + DMI 4,180 4,588 4,264 full design 5,864 6,318 5,926 Clock (MHz) data-path 27.7 25.8 N/A data-path + DMI 27.9 28.2 31.6 full design 27.7 24.0 25.5 Initiation Interval 10 10 1 Execution Time (cycles) 43,535 34,787 8,333 W all-Clock Time (usec/iter.) 1,571.7(4.81) 1,449.5(4.43) 326.8(1.0) Design Time (minutes) 120 (2 hour) 2,400(1 week) Table 9: Performance, synthesis results and design effort metrics for designs with DMI architecture and a manual design for 2 kernel BIC in single Xilinx® Virtex™ Table 9 shows the performance and synthesis metrics of two different design approaches. In the full manual design, data-path design and memory access controllers are tightly-coupled, using a single FSM that generates all the data-path control signals and memory control signals. For this manual design we are unable to determine the metrics for the data-path components, since we cannot decouple the data-path from the complete design. The slice count for the DMI and data-path reveals that the manual design is bigger than the design version with DMI and behavioral description of the data-path. In the case of the manual design, we implemented a loop-pipelined design for the data-path, with an initiation interval of one cycle. To achieve this single cycle initiation interval, most of computational resources and registers to store temporal values cannot be shared across C-steps, resulting in naturally large designs. 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The performance gap between the manual design and the designs that use the proposed DMI is mainly due to the very long initiation interval, which is generated by behavioral synthesis. The numerical gap is nevertheless smaller than the ratio of the initiation intervals, as tasks can effectively hide the latency of the memory accesses. However, the manual design execution of two tasks takes almost twice of a single task execution, since the single task schedule is already optimized by hiding memory latencies in its computations. 6.7.2 Impact of Group scheduling and Pre-fetching In this section, we show the impact of group scheduling versions of two-task BIC computation. Since we store two input data in separate memory modules, all the channels for input are connected to the same channel controller, exhibit the same access behavior. Metrics M emory Operation Strategy Naive P G P+G Total Execution 43,543 34,787 43,543 34,635 Memory Overhead 11,053 2,297 11,053 2,145 Clock Cycles Reduction 0% 8756 0% 8,908 Perc. of total Exec. 0% 20.1% 0% 20.5% Perc. of memory Over. 0% 79.2% 0% 80.6% Table 10: Performance analysis of all versions of BIC implementation with two kernels in a single FPGA (Na'ive: single word pre-fetch with round-robin scheduling; P: two word pre-fetch with round-robin scheduling, G: single word pre-fetch with group scheduling, P+G: two word pre-fetching with group scheduling). 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As with the previous studies on memory overhead reduction, we observe a reduction of the memory overhead cycles of up to 80%, which translates in a 20% reduction of the overall execution time. Group scheduling, however, does not lead to any reduction of the overall execution cycle count. Because all external memory accesses exhibit the same external memory patterns, there is no performance gain over the Round-Robin naive strategy. However, when we combine the group scheduling with an increased pre-fetching amount we observe a slight performance improvement over increased pre-fetch only version - 80.6% .vs. 79.2% of cycle reductions. Cycle count c y c l e c o u n t f U W a l l- d o c k t i m e ( n s e c ) H sec 5 0 0 0 0 2000 4 5 0 0 0 1 8 0 0 4 0 0 0 0 1 6 0 0 3 5 0 0 0 1 4 0 0 1200 1000 2 5 0 0 0 20000 6 0 0 1 5 0 0 0 4 0 0 10000 200 5 0 0 0 P + G Nave P G m a n u a l Figure 26: Two tasks execution cycle count for BIC (Nai've: single word pre-fetch with round-robin scheduling; P: two word pre-fetch with round-robin scheduling, G: single word pre-fetch with group scheduling, P+G: two word pre-fetching with group scheduling).______________________________________________________________ 9 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Compared to single-word pre-fetching, we observed that two-word pre-fetching version produce unnecessary pre-fetch, which will be deleted by immediately by OffsetReload request from data-path at the loop boundary. Overall, group scheduling combined with two-word pre-fetch reduces the execution cycles by 0.4% over naive scheduling with two-word pre-fetch. As for the synthesis area results, shown in Table 11, we observe that using a two-word pre-fetch version increases the slice count by approximately 10% over the one-word pre-fetch version, and group scheduling increases the slice count by 2 for the channel controller FSM component over the design with naive scheduling strategy. M etrics M em ory Scheduling S trategy Naive P G P+G Area (Slices) 5,864 6,318 5,866 6,320 Speed (MHz) 27.7 24.0 26.1 23.7 Table 11: Synthesis results of 2 kernel loops BIC implemented on Xilinx® Virtex® BG560 in various version (Naive: single word pre-fetch with round-robin scheduling; P: two word pre-fetch with round-robin scheduling, G: single word pre fetch with group scheduling, P+G: two word pre-fetching with group scheduling) 6.8 Matrix Multiplication (MAT) Matrix multiplication is a widely used and studied kernel in image-processing applications. Our study focuses on the performance benefits of using a custom address generation scheme that can support the various addressing schemes for the two input matrices and the single output matrix in this computation. In this example 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the data accesses to the output matrix exhibit a unit stride, and the data accesses to the two input matrices have multi-mode, still regular, access patterns. An address generator with a (base + offset) address generation scheme requires the reloading of the base offset across rows and columns of the input matrices. 6.8.1 Design Evaluations In this experiment, we used two 16-by-16 matrices as input and generated one 16-by- 16 result matrix. All three matrixes are mapped to a single external memory. We implemented Matrix Multiplication with three versions - one using row-wise AGU with offset-reload, which can generate only monotonic increments of address space (row-wise AGU), and another one with column-wise AGU (column-AGU), as shown in the Chapter 4.3, and lastly one using 2D indexing AGU (2d-AGU), as shown in Chapter 5.2. The implementation with an AGU that supports only row-wise address generation requires an offset reloading operation at each iteration of the outer most loop to update the AGU entry corresponding to one of the input matrices. We compare the performance of these two versions with and without the pre-fetch option. Figure 27 shows the performance metric of cycle count and wall-time measure of each design. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C y c l e c o u n t cycle count Wall-clock time (it sec) ji sec 100000 1 4000 80000 3000 70000 60000 50000 2000 40000 30000 1000 20000 10000 0 r o w - w is e A G U r o w - w is e A G U + P C o lu m n A G U C o lu m n A G U + P 2 d - A G U 26 - A G U + P Figure 27: Performance of MATRIX using row-wise AGU with one or two-word pre-fetch (P), column AGU using offset-reload with one or two word pre-fetch, versus. 2d AGU with one or two-word pre-fetch(P). Two-word pre-fetch option does not show any improvements for regular row-wise AGU version compared to single-word. The reason is that the pre-fetched value for the output array that exploits column-wise accesses, pre-fetched data of the adjacent in row-wise element will be flushed in the every loop iteration. Therefore, it increases execution cycles because of the unnecessary pre-fetch as well as invalidations of the data that is already in the conversion FIFOs. However, if we pre fetch by two words using column-wise address generation unit, even with offset- reload, it increases the performance and result in performance improvements over single-word pre-fetch version. The versions with column AGU or 2d-AGU show substantial performance gain over row-AGU versions, because column accesses eliminate large amount of costly offset-reload. With two-word pre-fetch, we observe 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 20% reduction in cycle count reductions. In the column AGU versions, offset reloads occur in the second nested loop. The 2d-AGU eliminates the offset-reload in the second nested loop and requires offset-reloads in the outer-most loop. In this experiment, 2d-AGU versions perform 2-3% better than column AGU versions with regards to the cycle count. Metrics Addressing and M emory Scheduling Strategy Row-wise Row-wise+P Column-wise Column-wise + P A rea (Slices) 2308 2343 2241 2323 S peed (MHz) 24.3 23.4 36.4 22.0 Table 12: Synthesis results for all versions: (R row-wise generic AGU, P: pre-fetch 2 words, C: column-wise AGU) In terms of the FPGA space resources consumed, all code versions exhibit similar area comsumption. The design using column-wise address generation unit exhibits an unusual improvement in clock rate. We attribute this “anomalous” result to the vagaries of the P&R step during the synthesis of this design. 6.9 Data Search and Reorganization Computation To extend the capability of our DMI approach, we support pointer-tracing operations found in irregular application that traverse pointer-based linked data structures. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.9.1 Application Description and Data-path design In this experiment we focus on a database searching computation kernel [14]. This computation scans a large pointer-based data structure, a sparse-mesh, and collects the records in the data structure that meet a given matching criteria. A sample sparse- mesh data structure and the core of a Data and Reorganization computation is depicted in Figure 28 (a) and (b) respectively. void SpatialQuery(sm 'sparse, int xjl, int yjl, int x ur, int y_urX row_node*m; data_node *dn; m = sm->raws; while(m 1 = N U L L ){ dn = rn->first_node; while(dn !=NULL){ if(insideRegion(dn,xJI,yJI,x_ur,y_ur) = T R U E E ){ if(selectionCriteria(dn,<args>) = TRUE){ addToList(dn,<args>); } dn = dn->next_col; } } m = m->next_row; } } (b) S am p le D ata a n d R eo rg an izatio n Q u ery Figure 28: Sample Sparse-Mesh and Data and Reorganization query for the Pointer Kernel. The main computations or operations in this kernel are spatial queries and pattern matching. Spatial queries, as shown in the Figure 28 (b), read two dimensional spatial fields of records and compare the values to identify whether each record is inside the rectangular area, partially overlapped or outside the area. The main operators are arithmetic comparators. The pattern matching operations consist of 9 7 y-aris x-axis (a) S am p le S p arse-M esh D ata S tru ctu re Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. matching the contents of a given record text field against three search patterns. Should the record meet the matching criteria, the computation copies the entire contents of the record to a new location and proceeds to the next record, following a pointer field in the record being currently examined. In this kernel the total execution cycles are proportional to the record size, in our experiment 37 word (of 32-bit), and the execution cycles per record (CPR) is proportional to the size of the record. The kernel traces the linked list while detecting whether its internal data attribute -length of 80 characters (640 bits total) - contains any of three predefined strings. The implementation pipelines the various records to be fetched, from memory to effectively overlap computation with memory accesses both to retrieve as well as to store memory. While the implementation determines if a node should be selected and therefore copied to memory via the output channels, it reads a new node following the selected pointer link field in the node being examined. To implement, the pattern matching examines the three patterns concurrently using a divide-and-conquer technique to accelerate the search for each pattern. In terms of the fetching of the data from the data structure, the implementation aggressively pipelines the accesses of the records in the linked lists while performing the pattern matching computation on a given record. The indirect pointer access is accomplished by reloading the contents of the base and offset address of a selected data stream in the AGU through the offset reload channel. As each record of the database is larger than a single bus transaction, we also exploit the consecutive addressing modes of the AGU to fetch the various words in a given record. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Data In Data ( ' Output Channel Input Channel Offset reload clianne! Offset reload channel External Memory Address Generation Unit Channel Controller Figure 29: The DMI controller and data-path implementations for pointer based application. Figure 29 depicts the implementation for this kernel’s computation. Each record being examined is kept in the internal “wide-register” along with another incoming record data and possibly an out-going record to be written to a new location to implement the organization phase of the computation. The data-path keeps track of which is being examined inside the layout of the record, by a counter that keeps the count of which attribute of data is being received through the input conversion FIFO. 6.9.2 Synthesis Results Table 13 presents the synthesis results for the pointer tracing computation kernel where we report area and clock rates for the three main components of the design. 9 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Metrics Design Components Full design Computation and FSM Wide- Registers Complete Data-path Area (Slices) 1,403(11%) 2,642 4,045(32%) 9,450(77%) Clk(MHz) 32.2 — 25.7 33.6 Table 13: Synthesis results for Sparse-mesh pointer-tracing kernel (percentage out of whole FPGA area). The synthesized design of this kernel has been verified on Wildstar™ FPGA board. As mentioned in the previous section, in this kernel the total execution cycles are proportional to the record size, in our experiment 37 32-bit words, and the execution cycles per record (CPR) is proportional to the size of the record. The total execution cycles will be CPR multiplied by number of records. 6.10 Discussion The variety of the kernels and experiments presented here reveal that the proposed DMI architecture and corresponding interface allows designers, or in its absence compiler tools, to quickly generate complete FPGA designs that interface with external memories. Some of the design options of the proposed DMI allow the integration with behavioral synthesis tools with single and multi-task design, a capability beyond the reach of pure behavioral synthesis tools. The flexibility of the proposed design comes at a performance cost. In order to deliver a timing-agnostic interface, we have imposed a data transfer protocol between the DMI architecture and any external design using a handshaking protocol. too Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To mitigate the potential latency impact that this approach imposes we aggressively explore pipelining and pre-fetching memory access modes. For the domain of image processing computations where the data access patterns are very regular and in most cases known at compile time, there are ample opportunities to hide the costs of the external memory latency as well as of the synchronization costs. The experimental results presented here support these claims. While traditionally the domain of application of FPGA designs has focused on absolute performance as their ultimate design metric, we believe that with the growing availability of transistors on a die and increased clock rates, the focus will shift towards correctness and performance predictability. In terms of design complexity and area overhead the experimental results also show that the proposed design does not severely impact the overall area budget of the designs used in these experiments. This area concern is becoming less of an issue with the growing FPGA capacities. Instead designers are more interested in design correctness and timely development. Our own experience has been that the proposed DMI design promotes a modular design development while delivering designs that have raw performance comparable to hand-tuned and highly optimized designs at a very low fraction of the development time. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 7 R e l a t e d W o r k In this chapter we describe related work in the area of external memory accesses for FPGA-based architectures, both in providing direct support for address generation and scheduling of the operations. We also describe previous efforts in the optimization of memory operations in high-level synthesis as well as work in decoupled architectures for traditional processors. 7.1 Memory Access Optimizations of High-Level Synthesis External and internal memory access optimization has been an important research topic in high-level synthesis, because of its impact on overall design performance. While internal memory operations have been successfully integrated in current synthesis flows, external memory operations expose several implementation issues in this integration, namely varying latency, control signals and programmable address/data bus width. Ly et. al. introduce the notion of “Behavioral Templates” to resolve timing constrains, mutli-cycle operations, and hierarchical CDFG (control data flow graph) descriptions [42]. Since then, various commercial synthesis tools support interface related operation incorporations using this scheme. This approach is limited to 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. templates described in a form of CDFG, with fixed scheduling constraints. Hence, RTL level module library design, cannot be reused for design synthesis. Memory specific optimizations, such as pipelining, pre-fetching, or specialized address generation optimizations, are difficult to apply, as well. Schmit proposes a high-level synthesis solution for memory optimizations for application-specific architecture [59]. The goal of this research includes memory binding for arrays, fast address generation, memory design optimizations, and memory access controls for different target memory modules. The automation of multiple array mappings and optimizations are the main advantages over previous HLS tools. However, their approach requires user specifications or library support for memory binding. For external (off-chip) memory accesses, their multiple array mapping and fast address generations (array grouping) are only partially applicable, as their memory optimizations focuses on minimizations of resource usages for memory module implementations. Panda et. al. propose models for well-known operation mode off-chip memories, incorporating into HLS of applications [53]. They provided a mechanism to achieve an “efficient” scheduling for memory access modes mainly targeting DRAMs. By separating the row-address-strobe (RAS) and column-address-strobe (CAS) control signals, their synthesis can eliminate duplicated RAS control strobe if there are identical RAS stream generations between consecutive external memory accesses - page mode accesses. They also provided an optimization phase to reorder external 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory accesses to reduce costly non-page mode accesses. Their results reveal 25 - 60 % reductions in execution cycles compared to traditional HLS approaches. There are clear distinctions between their research and ours. Because of tight integration between the memory controls, their approach keeps the most of behavioral synthesis tools drawbacks presented in this research targeting for FPGA-based system - lack of support of physical clock differences between FPGA chips and memory bus, multiple-IP cores implementations which share single external memory. Miranda et. al. propose “ADOPT”, aiming for address generation optimization scheme [46], Their goal is to minimize the area overhead introduced by the use of large numbers of customized address calculation units. They expanded their idea by sharing hardware resources for counter based address generations [47], and source- level transformations to “cluster” to maximize sharing between multiple streams of address [48]. Their approach is similar to ours in separation of memory access optimizations from data-path synthesis. Their goal, reducing the area overhead for address generations, is different from ours, which are the abstractions and automation memory access design module implementations. Weinhardt et. al. propose to maximize internal reuse to reduce memory access request from the data-path implemented in FPGA using internal shift registers [69], They also proposed operation-scheduling and memory access parallelizing methodology to make full use on-chip memory components using vectorizing execution technique. The main concern of our work is external memory accesses and 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. handling of possible concurrent accesses to the “off-chip shared resources” rather than parallelization for internal memory modules. 7 .2 Hardware-Oriented Explicit Parallel Programming Languages Gokhale et. al. propose an abstract representation, “Stream C”, which targets for programmability of FPGAs and stream-data applications [22] [3]. Their compiler includes an array allocation algorithm for multi-level memory subsystems to reduce total memory latency during execution [21], They defined a “Hardware Stream Library” to exploit hardware communication channels between multiple processes or processes and memory units in FPGA design. The library requires direct access for all parameter specified from designers (or programmers). In their system, designers should be aware of details of the memory access scheme including scheduling between “multiple streams”, which needs be implemented in the target architecture. Nevertheless, our research focuses on the extraction of memory access abstraction from compile time analysis, as well as memory access optimizations. The Match compiler automatically translates a subset of MatLab programs directly to hardware design in FPGA-based machines [2]. The compilation translates a MatLab program into an intermediate representation that is directly translated to a set of predefined structural VHDL modules such as multipliers and adders. The Match compilation system does not address memory access optimizations and application- specific memory access scheduling [25]. Instead, the scheduling of the memory 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. access in Match compilation is tightly coupled with the data-path control FSM. As such the Match compiler cannot achieve memory access pipelining, or interleaving, between multiple memory requests [26]. In contrast, our research efforts exploit the notion of decoupled memory control schemes to FPGA-based architectures, in order to separate external memory controls from data-path designs. Hereby, we combine the efforts established already in behavioral (high-level) synthesis tools with system level synthesis in reconfigurable computing using high-level language compiler techniques, and finally achieved memory access pipelining, or interleaving, between multiple memory requests. 7.3 Non FPGA-based Reconfigurable Architectures The PipeRench architecture is geared solely to pipelined computations with virtually unlimited number of stages where each stage is mapped to a given virtual strip [61]. Programmers rely on compilers to map their applications, written in C, to the computational capabilities of each strip. The compiler then generates a schedule of the virtual strips and relies on hardware to swap on demand the configuration for each strip. PipeRench architecture can be seen as coarse grain FPGAs with less complex controls. Each processing element (PE) contains ALU, internal register file and LUTs. All PEs are connected global interconnections. The array of PE can be configured as a set of N by B-bit ALUs, where B is the ALU bit of each PE. The output of configured ALUs will pass through the interconnection network to the 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. input of the next strip set PEs. The configurations are stream of control signals which are sent by the host processor. The Raw architecture is an array of simple RISC-like pipelined processors. It exploits large amount of ILP and operation pipelining, thereby achieving a high performance [1]. Each ALU has its own instruction cache and data cache. The bit- width of each ALU is fixed. The interconnection networks between individual small- bit ALU tiles are the only configurable devices in this architecture. According to a static compiler-driven communication schedule, and by consequent configurations of interconnections, it is possible to configure given Raw architecture to the application-specific hardware engine. They achieves substantial data throughput using large number of processing elements and fast fixed interconnection networks. Compared to Raw architecture and its compiler support, our approach can provide a cost-effective solution in some application domains, since we are using general reconfigurable hardware (FPGA) and commercial synthesis tools. The Garp architecture integrates a MIPS core with reconfigurable hardware to be used as an accelerator in a co-processor integration model [33]. Garp is the hybrid architecture of general-purpose processor and tightly-coupled FPGA arrays. For given configurations for FPGA arrays, host (general-purpose) processor sends data and synchronous control signals. The reconfigurable hardware consists of a set of two dimensional arrays of configurable logic block (CLBs) interconnected by programmable wiring. Each of the arrays has a section dedicated to memory 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interfacing and control and uses a fixed clock. The reconfigurable array has direct access to memory for fetching either data or configuration data hence avoiding both data and reconfiguration bottlenecks. This approach relieves a lot of interface design requirements for FPGA array usage. However, from the same reason, it will limit the programmability of FPGA because FPGA interface should match the controls of host processors. In some applications, which exploit coarse-grain parallelisms, we are targeting for, there are less opportunity to achieve high-performance in this architecture. RaPiD is a general coarse-grained reconfigurable architecture that allows the user to construct custom application-specific architectures in a run-time configurable way [18]. Configurations achieved by the coarse-grain computing element, such as adder multiplier, connected through the flexible communication bus. These units are organized linearly over a bus, and communicate through registers in a pipelined fashion. The reconfigurable computing engine functions as special arithmetic units as coprocessors. It achieves high performance through large amount of parallelism similar to the superscalar processor. Rapid compiler should take care of micro-coded instruction streams to control coarse-grain computing elements [19]. The XPP reconfigurable architecture consists of a two dimensional mesh arrangement of processing elements (PEs) and memory elements (ME), with the connection interconnected by segmented busses of fixed width for data and synchronization. The connection between PEs is accomplished via a handshaking 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. protocol despite their internal synchronous execution. A PE can implement simple arithmetic operations, comparisons and retain values in counters to conditionally generated data to be consumed by other PEs in a data-flow style of execution [11]. Their research efforts include the compilation of reconfigurable architecture from high-level language and the temporal partitioning if the program is too large to fit in their target architecture. They assumed data-driven execution model to accesses on- chip RAM module. Their approach to access shared on-chip memory module is different from ours since their memory accesses scheme supports fixed number of different requests while our approach can handle flexible number of channels. 7.4 System Level Descriptions Using High-Level Language Cameron project is to make FPGAs or other reconfigurable computing machine available to more applications programmers, by raising the abstraction level from hardware circuits to software algorithms [27]. They have developed a variant of the C programming language (SA-C) and an optimizing compiler that maps high-level programs directly onto FPGAs, focusing on image processing (and other) applications [28] [8], The main difference with our research is their SA-C allows user’s low-level hardware specifications in its own syntax, while our approach uses original C code for algorithm description. Their external memory access schemes for array variable - called element generator - separates memory related issues from data-path design, similar to our approaches. SA-C relies on programmer’s specification and compile-time analysis on the selection of application specific 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. element generator from predefined set of implementations. However, they do not allow different streams of array accesses for single external memory. In our approach, address generation unit can generate multiple streams of data. Other researchers have focused on the development of new high-level language geared towards hardware synthesis (e.g., SpecC [16] and JHDL™ [37]). Overall these approaches offer the programmer a CSP-like execution model with a programming syntax close to the popular imperative languages such as C or Java. Typically, these approaches offer the benefits of an integrated hardware/software co simulation environment. The main difference with these researches with ours is that we use original C syntax rather than variant of high-level language. We applied decoupled memory control schemes to FPGA based architecture, to exclude some issues related to external memory controls. By doing so, we combine the efforts established already in behavioral (high-level) synthesis tools with system level synthesis in reconfigurable computing using high-level language compiler techniques, which ultimately achieve faster design realization for normal programmer than above approaches. 7.5 Domain Specific Efforts Bhattacharyya et. al developed a framework for the scheduling of memory accesses in the context of synchronous data flow graphs [7]. Its main goal is to determine a 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. good schedule that minimizes the amount of communication buffer requirements across multiple tasks. This research is similar to ours in a sense that communications between multiple task and schedules for the usage of shared resource - the communication buffer in their case and shared external memory in our case. For the FPGA implementations of data search and comparison kernels, other researchers have also recognized the value of having FPGAs to accelerate the implementation of relational database operations. Jean et.al. sketch an example of an application using a mix of image and text elements [3 6] sketch an example of an application using a mix of image and text elements [36], but they do not present an overall implementation for the translation of SQL queries to FPGAs and their experimental evaluation methodology is limited. They do not address the structure of their hardware implementation. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 8 C o n c l u s io n The dramatic increase in the capacity of Field-Programmable-Gate-Arrays (FPGAs) has enabled their utilization as a domain specific and reconfigurable computing engine. The performance of FPGA-based computing platforms depends critically on the efficiency of exploiting basic and advanced modes in accessing external memories. In this research we proposed a decoupled memory interface (DMI) approach to the issues that arise in behavioral synthesis of designs for FPGAs with multiple design processes sharing the access to external memories. The proposed approach enables the integration of multiple behavioral designs and thus retains the advantages of design modularity and behavioral synthesis. The DMI interface allows designers to exploit advanced memory access modes without being concerned with the vagaries of the actual memory physical signaling. The design and methodology choices in the proposed DMI design also support compiler approaches for fast system level design space explorations taking into account the external memory interfaces not directly supported in current behavioral synthesis tools. We provide fully parameterized DMI generation library. The efforts to target different external memory module are minimal. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The preliminary results of the utilization of the proposed DMI interface to a limited set of image processing kernels reveal that we are able to significantly increase the design productivity. We also observed that the application of loop transformation enabled by the interface increase the ability of fast design space exploration. The added complexity of the resulting designs is tolerable. By expanding protocols and application-specific optimizations, we can expand our target application domains and reduce the performance gap to manual and highly tuned designs. We also include features the allow for irregular external data access patterns, thereby extending the applicability of the proposed interface to computations such as pointer tracing, traditionally not viewed as a good match for FPGAs. We have successfully integrated the design approach presented in this thesis with a compilation and synthesis tool - the DEFACTO compiler that combines behavioral synthesis with structural synthesis for VHDL designs. The experimental results for a limited set of image processing applications reveal that the design choices of the proposed interface lead to designs that exhibit comparable performance to that of manual designs attained at a very small fraction of the design time. The automatically generated designs execute correctly on a real target reconfigurable device — a Xilinx® Virtex™ part. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. L is t o f R e f e r e n c e s [1] A. Agarwal, S. Amarasinghe, R, Barua, M. Frank, W. Lee, V. Sakar, D. Srikrishna, N. Tayler, “The Raw Compiler Project.” In Proceedings of the Second SUIF Compiler Workshop, Stanford, CA, August 21-23, 1997 [2] A. Aho, R. Sethi, J, Ullman, “Compilers: Principles, Techniques and Tools.” Addison Wesley, 1986 [3] M. Alexander, J. Cohoon, J. Ganley, G. Robins, “Performance-Oriented Placement and Routing for Field-Programmable Gate Arrays”, In Proceedings of the European Design Automation Conference, 1995 [4] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, M. Haidar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden, and D. Zaretsky “MATLAB Compiler for Distributed Heterogeneous Reconfigurable Computing Systems”, In Proc. of IEEE Symposiums on FPGAs for Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA, Apr. 2000, pp. 39 - 48. [5] F. Barat and R. Lauwereins, “Reconfigurable Instruction Set Processors: A Survey”, In Proc. of the 11th IEEE International Workshop on Rapid System Prototyping (RSP), IEEE Computer Society Press, Los Alamitos, CA, 2000. [6] M. Barr, "A Reconfigurable Computing Primer" Multimedia Systems Design, Sep. 1998, pp. 44-47. [7] S. Bhattacharyya, J. Buck, S. Ha, and E. Lee, “Generating compact code from dataflow specifications of multirate signal processing algorithms”. IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, Vol. 42, No. 3:138-150, Mar. 1995. [8] A. Bohm, B. Draper, W. Najjar, J. Hammes, R. Rinker, M. Chawathe, C. Ross. “One-Step Compilation of Image Processing Algorithms to FPGAs.” In Proc. of IEEE Symposium on Field-Configurable Custom Machines (FCCM 2001). IEEE Computer Society Press, Los Alamitos, CA, 2001 [9] S. Brown, J. Rose, G. Vranesic “A Detailed Router for Field-Programmable Gate Arrays ”, IEEE Transactions on Computer Aided Design, Vol. 1, No.5, pp. 620-628, May 1992. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [10] S. Carr and K. Kennedy, "Scalar Replacement in the Presence of Conditional Control Flow”, Software - Practice & Experience Vol. 24, No. 1, January 1994 [11] J. Cardoso and M. Weinhardt, “XPP-VC: A C Compiler with Temporal Partitioning for the PACT-XPP Architecture”, In Proc. of the 12th International Conference on Field Programmable Logic and Applications (FPL’02), Sep. 2002. [12] A. DeHon, “Reconfigurable Architectures for General-Purpose Computing” Doctoral Dissertation, AI Technical Report 1586, MIT Artificial Intelligence Laboratory, 545 Technology Sq., Cambridge, MA 02139, Sep. 1996 [13] S. Derrien and S. Rajopadhye, “FCCMs and the Memory Wall”, In Proc. of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), IEEE Computer Society Press, Los Alamitos, CA. Oct. 2000 [14] P. Diniz, and J. Park, “Automatic Synthesis of Data Storage and Control Structures for FPGA-based Computing Machines” In Proc. of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'00), IEEE Computer Society Press , pp. 91-100,, Los Alamitos, CA. Oct. 2000. [15] P. Diniz, M. Hall, J. Park, B. So and H. Ziegler, “Bridging the Gap between Compilation and Synthesis in the DEFACTO System”, In Proc. of the 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), Published as Lecture Notes in Computer Science (LNCS), Vol. 2624, Springer, Berlin, 2003. [16] R. Domer, A. Gerstlauer, D.Gajski, “SpecC Language Reference Manual Version 2.0, Dec 2002 http://specc.org [17] DSP 56000 24-bit digital signal processor family manual Motorola, Inc. 6501 William Cannon Drive, West Austin, TX 78735-8598 htfp://e~ www.motorola.com/files/dsp/doc/inactive/DSP56000UM.pdf [18] C. Ebeling, Darren C. Cronquist, Paul Franklin. "RaPiD - Reconfigurable Pipelined Datapath", In Proc. of the 6th International Workshop on Field- Programmable Logic and Applications, 1996. [19] C. Ebeling, D. Cronquist, P. Franklin, J. Secosky, and S. Berg. "Mapping Applications to the RaPiD Configurable Architecture", In Proc. of IEEE Symposium on F ield-Programmable Custom Computing Machines (FCCM’97), IEEE Computer Society Press, Los Alamitos, CA. Oct. 1997 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [20] J. Elliott “Understanding Behavioral Synthesis: A Practical Guide to High- Level Design” Kluwer Academic Publishers, January 2000. [21] M. Gokhale and J. Stone, “Automatic Allocation of Arrays to Memories in FPGA Processors with Multiple Memory Banks” In Proc. of IEEE Symposium on F ield-Programmable Custom Computing Machines (FCCM’99), IEEE Computer Society Press, Los Alamitos, CA. Oct. 1999. [22] M. Gokhale, J. Stone, J. Arnold and M. Kalinowski, “Stream-Oriented FPGA Computing in the Streams-C High Level Language In Proc. of IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’00), IEEE Computer Society Press, Los Alamitos, CA, Apr. 2000, pp. 49-58. [23] J. Goodman, J. Hsieh, K. Liou, A. Pleszkun, P. Schechter, and H. Young. “PIPE: A VLSI Decoupled Architecture”. In Proc. of the 12th International Symposium on Computer Architecture (ISCA), ACM Press, Jun. 1985. [24] S. Gupta, M. Miranda, F. Catthoor, and R. Gupta “Analysis of High-level Address Code Transformations for Programmable Processors”, In Proc. of ACM Design Automation and Test in Europe Conference, pp. 9-13, Mar. 2000 . [25] M. Haidar, A. Nayak, N. Shenoy, A. Choudhary, and P. Banerjee “FPGA Hardware Synthesis from MatLab”, In Proc. of the 14th International Conference on VLSI Design, pp. 299 - 304, India, January 2001. [26] M. Haidar, A. Nayak, N. Shenoy, A. Choudhary, P. Baneijee “Automated Synthesis of Pipelined Designs on FPGAs for Signal and Image Processing Applications Described in MATLAB”, In Proc. of Asia Pacific DAC, pp. 64 5 -6 4 8 , Jan. 2001. [27] J. Hammes, R. Rinker, W. Bohm, W. Najjar, B. Draper, R. Beveridge. “Cameron: High-level Language Compilation for Reconfigurable Systems.” In Proc. of International Conference on Parallel Architectures and Compilation Techniques (PACT), Newport Beach, CA, Oct. 12-16, 1999. [28] J. Hammes, B. Draper, W. Bohm. “Sassy: A Language and Optimizing Compiler for Image Processing on Reconfigurable Computing Systems.” , In Proc. of International Conference on Vision Systems, Las Palmas de Gran Canaria, Spain, Jan. 11-13, 1999 [29] S. Hauck, “Multi-FPGA Systems” Doctoral Dissertation University of Washington, 1995 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [30] S. Hauck and A. Agarwal “Software Technology for Reconfigurable Systems”, Northwestern University, Dept, of ECE Technical Report, 1996. [31] S. Hauck, “The Role of FPGAs in Reprogrammable Systems” In Proc. of the IEEE, Vol. 86, No. 4, pp. 615-638, April 1998. [32] S. Hauck, “The Future of Reconfigurable Systems” Keynote Address in 5th Canadian Conference on Field Programmable Devices, Montreal, June 1998 [33] J. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with Reconfigurable Coprocessor”, In Proc. of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM '97), IEEE Computer Society Press, Los Alamitos, CA. pp. 24-33, April 1997. [34] C. Hoare. “Communicating Sequential Processes”. Prentice-Hall International, 1985. [35] J. Jacob and P. Chow “Memory Interfacing and Instruction Specification for Reconfigurable Processors”, In Proc. of the IEEE Symposium on Field- Programmable Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA. pp. 145-154, April 1999 [36] J. Jean, G. Dong, H. Zhang, X. Guo and B. Zhang, “Query processing with an FPGA Coprocessor Board”, In Proc. of the 1st International Conf. on Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas, NV. 2001. [37] JHDL™ Development Release Documentation, Aug, 2001, h t t p ://www .j h d l .o r g / [38] B. Keeth, R. Baker, “DRAM Circuit Desgin: A tutorial.”, IEEE Press, New York, 1999 [39] D. Knapp, “Behavioral Synthesis: Digital System Design Using the Synopsys Behavioral Compiler” Prentice Hall, June 1996. [40] H. Krupnova, G. Saucier “FPGA Technology Snapshot: Current Devices and Design Tools”, In Proc. of International workshop on Rapid System Prototyping, 2000. [41] G. Lemieux, S. Brown and D. Vranesic, “On Two-Step Routing For FPGAs”, In Proc. of International Symposium on Physical Design, Apr. Napa, CA pp. 60 - 66, [42] T. Ly, D. Knapp, R. Miller, and D. MacMillen “Scheduling using Behavioral Templates”, In Proc. of the 32n d ACM/IEEE Design Automation Conference (DAC), June, 1995. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [43] M. McFarland, A. Parker, and R. Camposano “The high-level Synthesis of Digital Systems”, In Proc. of the IEEE Vol. 78, No. 2, Feb. 1990 [44] P. Michel, U. Lauther, P. Duzy “The Synthesis Approach to Digital System Design”, Kluwer Academic Publishers, Boston, 1992. [45] M. Miranda, F. Catthoor, H. De Man, J. Rabaey, P. Chau, and J. Eldon “Address Equation Optimization and Hardware Sharing for Real-Time Signal Processing Applications”, VLSI Signal Processing VII, pp. 208-217, IEEE Press, New York, 1994. [46] M. Miranda, F. Catthoor, M. Janssen, H. De Man “ADOPT: Efficient Hardware Address Generation in Distributed Memory Architectures”, In Proc. of IEEE 9th International Symposium on System Synthesis, pp. 20-25, IEEE Computer Society Press, Los Alamitos, CA., Nov. 1996. [47] M. Miranda, M. Kaspar, F. Catthoor, and H. De Man “Architectural Exploration and Optimization for Counter Based Hardware Address Generation”, In Proc. of 8th ACM/IEEE European Design and Test Conference, pp. 293-298, 1997. [48] M. Miranda, F. Catthoor, and H. De Man “High-Level Optimization and Synthesis Techniques for Data-Transfer-Intensive Applications”, IEEE Transactions on VLSI systems, Vol. 6, No. 4, Dec., 1998. [49] P. Moisset, J. Park and P. Diniz, "Very High-Level Synthesis of Control and Datapath Structures for Reconfigurable Logic Devices", In Proc. of the 2nd International Workshop on Compiler and Architecture Support for Embedded Systems (CASES'99), Washington D.C., October 1999. [50] P. Moisset, P. Diniz and J. Park. "Matching and Searching Analysis for Parallel Implementation on FPGAs", In Proc. of the 9th ACM International Symposium on FPGAs (FPGA'01), pp. 125-131, ACM Press, New York, Feb. 2001. [51] Monet™ User’s and Reference Manual Software Release R42, Mentor Graphics Inc., 1999. [52] P. Panda, N. Dutt and A. Nicolau, “Exploiting off-chip memory access modes in high-level synthesis”, In Proc. of the 1997 IEEE/ACM International Conference on Computer-Aided Design, pp. 333 - 340, 1997. [53] P. Panda, “Memory Optimizations and Exploration for Embedded Systems”, Doctoral Dissertations, University of California, Irvine, January 1998. 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [54] P. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E.Brockmeyer, C.Kulkari, A. Vandercappelle, P.G Kjeldsberg. “Data and Memory Optimization Techniques for Embedded Systems”, ACM Transactions on Design Automation of Electronic System, pp. 149-206, Apr. 2001. [55] J. Park and P. Diniz, “Synthesis of Pipelined Memory Access Controller for Streamed Data Applications on FPGA-based Computing Engines”, In Proc. of International. Symposium on System Synthesis, ACM press, N.Y, Sept. 2001. [56] J. Park and P. Diniz, “Synthesis and Estimation of Memory Interface for FPGA-based Reconfigurable Computing Engines” In Proc. of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), IEEE Computer Society Press, Los Alamitos, CA. 2003. [57] W. Ro, “Decoupled Memory Architecture with Speculative Pre-Execution”, Doctoral Dissertation, University of Southern California, May 2004. [58] A.Sangiovanni-Vinceltelli, A.E1 Gamal, J.Rose, “Synthesis Methods for Filed Programmable Gate Arrays”, In Proc. of the IEEE, Vol. 81, No. 7, 1993, pp. 1057-1083. [59] H. Schmit, "Synthesis of Application-Specific Memory Structures," Doctoral Dissertation, Carnegie Mellon Universtiy,, Nov. 1995. [60] H. Schmit, and D. Thomas “Address Generations for Memories Containing Multiple Arrays”, IEEE Transactions on Computer-Aided Design of Integrated Circuits And Systems. Vol. 17. No. 5. May 1998. [61] H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine and R. Taylor "PipeRench: A Virtualized Programmable Datapath in 0.18 Micron Technology", In Proc. of the 2002 IEEE Custom Integrated Circuits Conference (CICC). [62] K. Shayee, J. Park, and P. Diniz, “Performance and Area Modeling of Complete FPGA Designs with Loop Transformations”, In Proc. of the 13th International Conference on Field Programmable Logic and Applications (FPL’03), Springer-Verlag, Berlin, 2003. [63] D. Smith, “HDL Chip Design A Practical Guide for Designing, Synthesizing, and Simulating ASICs and FPGAs using VHDL or Verilog” Doone Publications, Madison, AL, USA 1996 [64] B. So and M. Hall, "Increasing the Applicability of Scalar Replacement", in ACM Conference on Compiler Construction (CC'04), 2004. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [65] Starix II Device handbook. Altera® 101 Innovation Drive, San Jose, CA 95134 http://www.altera.com/literature/hb/stx2/stratix2__handbook.pdf. [66] “The Stanford SUIF Compilation System”, version 1.1.2 Public domain software and documentation available at http://suif.stanford.edu. [67] D. Thomas, J. Adams and H. Schmit, “A model and methodology for hardware-software codesign” Design & Test of Computers, IEEE, Vol. 10, No. 3, Sep. 1993 pp. 6-15. [68] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati, and P.Boucard. “Programmable Active Memories: Reconfigurable Systems Come of Age”, IEEE Transactions on VLSI Systems, Vol. 4, No. l,:pp. 56— 69, Mar. 1996. [69] M. Weinhardt, and W. Luk “Memory Access Optimization and RAM Inference for Pipeline Vectorization”, In Proc. of the 9th International Symposium of Field Programmable Logic (FPL’99), Springer-Verlag, 1999 [70] WildStar™ Reference Manual revision 4.0, Annapolis MicroSystems Inc., 1999. [71] Virtex 2.5V Filed Programmable Gate Arrays Product Specification. XILINX, Inc 2100 Logic Drive San Jose, Calif 95214.DS003-l(v2.5), 2000. [72] Virtex-II 1.5V FPGA Complete Data Sheet. XILINX, Inc 2100 Logic Drive San Jose, Calif. 95214.DS031(vl.7), 2001. [73] Virtex-2 ProX FPGA Complete Data Sheet. XILINX, Inc 2100 Logic Drive San Jose, Calif. 95214.DS110-2(vl.l), 2004 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
An efficient design space exploration for balance between computation and memory
PDF
Improving memory hierarchy performance using data reorganization
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Compiler optimizations for architectures supporting superword-level parallelism
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Architectural support for network -based computing
PDF
Architecture -independent programming and software synthesis for networked sensor systems
PDF
A flexible framework for replication in distributed systems
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Cost -sensitive cache replacement algorithms
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
High performance crossbar switch design
Asset Metadata
Creator
Park, Joonseok
(author)
Core Title
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Diniz, Pedro C. (
committee chair
), Pinkston, Timothy M. (
committee member
), Prasanna, Viktor (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-556940
Unique identifier
UC11335975
Identifier
3145263.pdf (filename),usctheses-c16-556940 (legacy record id)
Legacy Identifier
3145263.pdf
Dmrecord
556940
Document Type
Dissertation
Rights
Park, Joonseok
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical