Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An efficient design space exploration for balance between computation and memory
(USC Thesis Other)
An efficient design space exploration for balance between computation and memory
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AN E F F IC IE N T D ESIGN SPACE EX PLO R A TIO N FO R BA LANCE B E T W E E N CO M PU TA TIO N AND M EM ORY by Byoungro So A D issertation Presented to the FACULTY O F T H E GRA D U ATE SCHOOL U N IV ERSITY O F SO U TH ERN CA LIFORN IA In P artial Fulfillment of the Requirem ents for th e Degree D O C T O R O F PH ILO SO PH Y (C O M PU T E R SCIENCE) Decem ber 2003 Copyright 2003 Byoungro So Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3133338 Copyright 2003 by So, Byoungro All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3133338 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, written by BYOusJfsrRO SO under the direction o f h / S dissertation committee, and approved by all its members, has been presented to and accepted by the Director o f Graduate and Professional Programs, in partial fulfillment o f the requirements fo r the degree o f DOCTOR OF PHILOSOPHY s . Director Date December 17. 2003 Dissertation Committee Chair Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication This dissertation is gratefully dedicated to • My wife, Sooyeon Kim, without whose love, encouragement, support, and patience, not a word would have been written; • My son, Joshua, and my daughter, Iris, for their love and relaxing my stress; • My father, Sunyoung, for motivating my graduate study and his support; • My mother, Heeja, for teaching me about tolerance. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgem ents I would like to thank my advisor, Dr. Mary W. Hall, for her guidance, enthusiasm, en couragement, and support which made the completion of this dissertation possible. I have learned numerous invaluable things from her while I was studying under her guidance. I was fortunate to have Dr. Viktor Prasanna and Dr. Timothy Pinkston as comittee mem bers of dissertation committee. I thank them for their valuable comments and advice on my research. I also would like to thank my colleagues at USC Information Sciences Institute for their advice and friendship. I feel very fortunate to have been working with Dr. Pedro Diniz and Dr. Jacquerine Chame. They both are motivated and provide me with initial guidance on data reuse. My officemate Heidi Ziegler was friendly and insightful. She helped me a lot with my technical writing. Chapter 5 is a challenging and im portant part of my thesis. She contributed to implementing the initial framework. Joonseok Park provided me with his expertise on VHDL and FPGA-related tools. Yoon-Ju Lee contributed to implementing loop unroll-and-jam. I also had the privilege of working with Jaewook Shin, Chun Chen, and Shesha Raghunathan, who were great fun to work with. Finally, I would like to thank a million times to my family for supporting and encour aging me during my graduate years. 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents Dedication ii Acknowledgements iii List Of Tables vii List O f Figures viii Abstract xi 1 INTRO DUCTIO N 1 1.1 B ackground............................................................................................................... 4 1.1.1 Target Application D o m a in ...................................................................... 5 1.1.2 Target Architecture ................................................................................... 6 1.1.3 Behavioral Synthesis vs. Parallelizing C o m p ile rs ................................. 7 1.2 Overview of Our Approach ................................................................................. 10 1.2.1 Analyses and Transformations .............................................................. 11 1.2.2 Guiding Metrics .......................................................................................... 12 1.2.3 S u m m a ry ....................................................................................................... 14 1.3 E x am p le..................................................................................................................... 14 1.4 C ontributions........................................................................................................... 18 1.5 Organization of D issertation................................................................................. 20 2 DATA DEPEN DEN C E A N D REUSE ANALYSES 21 2.1 Dependence A nalysis.............................................................................................. 22 2.2 D ata Reuse A n a ly sis.............................................................................................. 25 3 AUTOMATIC DESIGN SPACE EXPLORATION 28 3.1 Definitions.................................................................................................................. 29 3.2 Assumptions ........................................................................................................... 31 3.3 Interaction between Compiler and Behavioral S y n th e s is ............................. 31 3.4 Guiding Metrics ..................................................................................................... 33 3.4.1 Chip Space Usage ....................................................................................... 33 3.4.2 B a la n c e .......................................................................................................... 33 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4.3 E fficien cy .................................................................; ................................ 35 3.4.4 Saturation P o i n t ......................................................................................... 35 3.5 Monotonic Search Space P ro p erties...................................................................... 37 3.5.1 Unroll Factor vs. D ata Fetch R a t e ........................................................ 37 3.5.2 Unroll Factor vs. D ata Consumption R a te ........................................... 38 3.5.3 Unroll Factor vs. B a la n c e ......................................................................... 39 3.5.4 Unroll Factor vs. Design S iz e .................................................................. 40 3.6 Optimization A lg o rith m ......................................................................................... 40 3.6.1 D ata Fetch Rate vs. D ata Consumption R a t e ....................................... 41 3.6.2 Computing the Best U ................................................................................ 42 3.6.3 Computing Individual Unroll F acto rs...................................................... 45 3.7 S u m m a ry .................................................................................................................... 48 4 SCALAR REPLACEM ENT 50 4.1 D efinitions................................................................................................................... 52 4.2 C arr’s A p p ro a c h ...................................................................................................... 53 4.2.1 V 0 ................................................................................................................... 57 4.2.2 Vr c .................................................................................................................... 57 4.2.3 V j ................................................................................................................... 58 4.2.4 S u m m a ry ....................................................................................................... 58 4.3 Extending Scalar Replacement ............................................................................ 60 4.3.1 Computing R and M for Each Reuse C h a in .......................................... 62 4.3.1.1 C 0 .................................................................................................... 63 4.3.1.2 C c ................................................................................................... 63 4.3.1.3 C m ................................................................................................ 65 4.3.1.4 C ic ................................................................................................ 65 4.3.2 s u m m a r y ....................................................................................................... 67 4.3.3 Generalizing the A lg o rith m ...................................................................... 68 4.4 Scalar Replacement Transformation ................................................................... 70 4.4.1 Register S h ift/R o ta tio n ............................................................................. 71 4.4.2 Loop Peeling and loop-invariant code m o tio n ...................................... 72 4.5 Control F l o w ............................................................................................................. 74 4.6 S u m m a ry .................................................................................................................... 80 5 CUSTOM DATA LAYOUT 82 5.1 Motivation ................................................................................................................. 84 5.2 Overview .................................................................................................................... 86 5.3 Virtual M a p p in g ....................................................................................................... 88 5.3.1 Analyzing D ata Access Patterns ............................................................ 88 5.3.2 Partitioning Array References................................................................... 89 5.3.3 Array R e n a m in g .......................................................................................... 92 5.3.3.1 Single D im ension.......................................................................... 92 5.3.3.2 Multiple D im ensions................................................................... 97 5.3.3.3 Array Renaming A lgorithm ...................................................... 98 5.3.4 E x a m p le s....................................................................................................... 99 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.5 Array Reorganization ...................................................................................104 5.4 Physical Memory Mapping .....................................................................................106 5.5 Global Optimization P ro b le m ..................................................................................108 5.6 Incorporating Conventional Layout Schem es........................................................109 5.7 S u m m a ry ......................................................................................................................110 6 EXPERIM ENTS 112 6.1 M eth o d o lo g y ............................................................................................................... 114 6.2 Scalar R eplacem ent......................................................................................................116 6.3 Custom D ata L a y o u t.................................................................................................. 121 6.3.1 Memory Access T im e s ...................................................................................122 6.3.2 S p e e d u p s ..........................................................................................................126 6.4 Automatic Design Space E xploration.....................................................................129 6.5 Accuracy of E s tim a te s ...............................................................................................136 6.5.1 Deriving Area and Speed M etrics............................................................... 137 6.5.2 Results .............................................................................................................138 6.6 S u m m a ry ......................................................................................................................141 7 RELATED W ORK 143 7.1 Synthesizing High-Level C onstructs........................................................................ 143 7.2 Design Space E x p lo ra tio n ........................................................................................ 146 7.3 D ata R e u s e ...................................................................................................................146 7.4 D ata L ayout...................................................................................................................148 7.5 D iscussion......................................................................................................................150 8 CONCLUSION 151 8.1 C ontributions................................................................................................................151 8.1.1 Automatic Design Space E xploration.........................................................151 8.1.2 Scalar Replacement ......................................................................................152 8.1.3 Custom D ata L ay o u t......................................................................................154 8.2 Future Work ............................................................................................................... 154 Reference List 157 Appendix A Behavioral VHDL O u tp u t................................................................................................... 162 Appendix B Computing Best Unroll Factors in 2-deep Loop N ests................................................... 169 A ppendix C Extension to n-deep Loop N ests......................................................................................... 175 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Tables 1.1 Comparison of Behavioral Synthesis and Parallelizing Compiler.................... 8 3.1 Optimal unroll factors for different dependence v ecto rs.................................. 48 4.1 Array reference categories in C arr’s approach..................................................... 55 4.2 Comparison of reuse chain categories.................................................................... 61 5.1 Examples of virtual mapping in a single dimension...............................................100 5.2 Examples of virtual mapping in two dimensions.................................................... 102 6.1 Problem size (iteration count).....................................................................................117 6.2 Number of registers........................................................................................................119 6.3 FPGA space usage (Kilo-slices).................................................................................. 121 6.4 Overall speedup on a single FPG A ............................................................................ 136 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 1.1 Comparison of design flow between conventional and our approaches. . . . 2 1.2 Annapolis’ W ildstar/PC I board.............................................................................. 7 1.3 Code transformation example.................................................................................. 15 1.4 Code transformation example, continued............................................................. 17 2.1 An example code and its dependence and reuse graphs.................................... 23 3.1 Interaction between compiler and behavioral s y n th e s is .................................. 32 3.2 The effect of the increase in unroll factors........................................................... 38 3.3 Algorithm for Design Space Exploration.............................................................. 43 3.4 Difference in execution time for the same {7 = 8 ............................................... 46 4.1 An example of C arr’s scalar replacem ent............................................................ 53 4.2 Steps of Carr’s scalar replacement.......................................................................... 55 4.3 Final output of Carr’s scalar replacement............................................................ 59 4.4 Steps of our scalar replacement............................................................................... 61 4.5 Memory/register access behaviors of full and partial data reuse.................... 64 4.6 An example of C % c with D = (1, *, 2}..................................................................... 67 4.7 An example of our scalar replacement................................................................... 68 4.8 Loop peeling example................................................................................................. 74 4.9 Issues on control flow in scalar replacement......................................................... 75 4.10 Handling conditional generators/finalizers......................................................... 77 5.1 An example of custom data layout......................................................................... 83 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2 Comparison of three data layouts.......................................................................... 84 5.3 Steps of custom data layout.................................................................................... 86 5.4 Array reference partitioning algorithm................................................................. 90 5.5 Array renaming algorithm........................................................................................ 98 5.6 An example of data access order and possible physical mappings.....................107 6.1 Application kernels........................................................................................................ 113 6.2 Experimental design flow ......................................................................................... 114 6.3 Number of Memory Accesses ...................................................................................118 6.4 Speedups of scalar replacement.................................................................................. 120 6.5 JAC, memory access times vs. unroll factors.......................................................... 122 6.6 SOBEL, memory access times vs. unroll factors.................................................... 122 6.7 FIR, memory access times vs. unroll factors........................................................... 124 6.8 PAT, memory access times vs. unroll factors.......................................................... 124 6.9 MM, memory access times vs. unroll amounts........................................................125 6.10 JAC, speedup..................................................................................................................126 6.11 SOBEL, speedup............................................................................................................ 127 6.12 FIR, speedup...................................................................................................................128 6.13 PAT, speedup..................................................................................................................128 6.14 MM, speedups.................................................................................................................128 6.15 FIR kernel........................................................................................................................130 6.16 Matrix Multiply kernel................................................................................................. 131 6.17 Jacobi kernel....................................................................................................................133 6.18 Pattern kernel................................................................................................................. 134 6.19 Sobel kernel..................................................................................................................... 135 6.20 Estimated performance.................................................................................................139 6.21 Achieved performance................................................................................................... 139 6.22 25MHz Time vs. Space.................................................................................................140 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.23 40MHz Time vs. Space................................................................................................140 6.24 25MHz Ratio vs. Space................................................................................................141 6.25 40MHz Ratio vs. Space................................................................................................141 B .l Approximate execution time based on the data dependence...............................170 C .l Plots for U and E ...................................................................................................... 176 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract This dissertation describes an automated system that maps an application written in a high level language to an FPGA. The current practice of FPGA mapping requires designers to manually and iteratively apply loop transformations to negotiate the inherent space-time trade-offs. This iterative process is called design space exploration. This manual approach is not only error-prone and tedious, but also prohibitively expensive given the large search space and the current long synthesis times. The DEFACTO system automates design space exploration by combining hardware synthesis with parallelizing compiler technologies into a unified system. The compiler uses its high-level knowledge and several metrics to guide code transformations toward a good design. Further, to quantitatively evaluate alternate designs, we use synthesis estimation techniques that are much faster than fully synthesizing a design. Thus, this integration significantly raises the level of abstraction for hardware design and explores a design space much larger than is feasible for human designers. The code transformations developed for design space exploration are motivated by the flexibility of FPGAs. We can devote on-chip resources for multiple functional units to exploit more parallelism, or for on-chip data storage to keep data and repeatedly reuse it. Further, the multiple external memory banks offer opportunities for parallel memory accesses to/from FPGAs. Therefore, the compiler optimizes designs using the following transformations. Scalar replacement replaces repeated accesses to an array element with a scalar variable, so that the element is accessed from a register rather than memory. Our approach extends the previous work to eliminate both redundant read and write memory xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. accesses across multiple loops in a nest. A novel transformation called custom data layout derives an application-specific data layout in multiple memories that facilitates parallel memory accesses. Unroll-and-jam replicates the loop body and jams the copies of the inner loop, and thus improves fine-grain parallelism. We have reduced design space exploration to a more tractable problem of optimal unroll factor selection. The DEFACTO system derives an implementation that closely matches the best per formance among those considered, and selects the smallest design among implementations with comparable performance. We search on average only 0.3% of the entire design space over 5 multimedia kernels. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 INTRO DUC TIO N Field Programmable Gate Arrays (FPGAs) are composed of thousands of small pro grammable logic cells dynamically interconnected to allow the implementation of any logic function. Their extreme flexibility and tremendous growth in device capacity have made them the medium of choice for fast hardware prototyping and a popular vehicle for the realization of custom computing machines. An advantage of FPGAs is that they can sometimes yield even faster solutions than conventional hardware, up to 2 orders of magnitude on encryption, as they can be tailored to the particular computational needs of a given application (e.g., template-based matching). Despite the growing importance of FPGAs for application-specific designs, these de vices are still difficult to program, making them inaccessible to the average developer. Figure 1.1(a) illustrates the current practice of developing FPGA designs. Developers hand-code the application in low-level hardware description languages such as Verilog or structural VHDL. FPGA logic synthesis is the term given to the process of translating functional logic specifications to a bitstream description that configures the device. De velopers synthesize or implement the design in hardware using a wide variety of synthesis tools. Place-and-route binds the design components and interconnects to the chip re sources. These low-level logic synthesis and place-and-route steps are prohibitively slow; one iteration of this process takes hours or even days. 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Design Specification (Low-level VHDL) Validation / Evaluation Logic Synthesis / ______ Place&Route | I Design Jt Modification Correct? .Good design’ 1 i............ (a) Current practice. Figure 1.1: Comparison of design flow between conventional and our approaches. As synthesis tools perform only limited optimizations, developers must perform high- level and global optimizations by hand. Further, because of the complexity of synthesis, it is difficult for developers to predict a priori the performance and space characteristics of the resulting design. Specifically, analytical estimation has no way to know the type and number of resources bound to the design in the scheduling and binding phases of synthesis tools. For this reason, developers engage in an iterative refinement cycle, at each step manually applying transformations, synthesizing the design, examining the results, and modifying the design to trade off performance and space. This repeated process is called design space exploration, encapsulated by the dotted line in Figure 1.1. We believe the way to make programming of FPGA-based systems more accessible is to offer a high-level imperative programming paradigm, such as C, coupled with compiler technology oriented towards FPGA designs. In this way, developers retain the advantages of a simple programming model via the high-level language but rely on powerful compiler analyses and transformations to optimize the design as well as autom ate most of the te dious and error-prone mapping tasks. We make the observation that, for a class of FPGA applications characterized as highly parallel array-based computations (e.g., multimedia image and signal processing codes), many hand optimizations performed by developers 2 Algorithm (C/Fortran) C onstraints Unroll Factor Selection SUIF2VHDL Translation Behavioral Synthesis Estimation Logic Synthesis / Place&Route Compiler Optimizations (SUIF) • D ependence / R euse Analysis • Unroll and Jam • Scalar R eplacem ent • Custom Data Layout____________ (b) Our approach. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are similar to transformations used in parallelizing compilers. For example, developers parallelize computations, optimize external memory accesses, explicitly manage storage and perform loop transformations. For this reason, we argue that parallelizing compiler technology can be used to optimize FPGA designs. Recently, Behavioral VHDL has been developed to raise the level of abstraction for hardware descriptions. It abstracts many hardware details, and it is thus closer to an algorithm description written in a high-level language. The process of taking a behavioral specification and generating a low-level description of a hardware implementation is called behavioral synthesis. In this dissertation, we describe an autom ated approach to design space exploration, based on a collaboration between a parallelizing compiler and behavioral synthesis tools. This compiler algorithm effectively enables developers to explore a potentially large design space, which without automation would not be feasible. A behavioral synthesis tool can make use of the compiler’s high-level knowledge in its implementation decisions. Then, synthesis tools can be used to produce estimates on a design’s space and performance, as they have much more specific information on hardware resources after partially synthe sizing the design. Behavioral synthesis estimation delivers fast and reasonably accurate estimates, thereby facilitating the design space exploration. The base infrastructure of the DEFACTO compiler is the SUIF (Stanford University Intermediate Format) compiler system, which is a platform for research on compiler techniques for high-performance machines such as scalar data flow optimizations, array data dependence analysis, loop transformations for both locality and parallelism, software prefetching, and instruction scheduling [57], Figure 1.1(b) shows the steps of automatic application mapping in the DEFACTO compiler. The fundamental differences from the conventional approach illustrated in Figure 1.1(a) are as follows: 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • our system takes an algorithm description written in high-level languages such as C or Fortran. • our system performs design space exploration automatically and efficiently by com bining parallelizing compiler and behavioral synthesis technologies. • our approach avoids the time-consuming low-level phases such as logic synthesis and place-and-route as much as possible. The DEFACTO compiler takes an algorithm description and a set of constraints that the resulting design must satisfy; e.g., the size of the design should be less than the FPGA capacity. Next, the compiler performs several code analyses and transformations, which will be explained later in this chapter. The optimized code is translated into a Behavioral VHDL description using the SUIF2VHDL tool. For the behavioral description, the compiler invokes Monet behavioral synthesis to get some estimates on space usage and performance. The compiler tries several designs by changing the parameters of the code transformations until it finds the most appropriate design. Finally, once a design is selected, we perform logic synthesis and place-and-route to actually map the design to an FPGA. The remainder of this chapter is organized as follows. In the next section, we describe background information. We introduce the overview of our approach in Section 1.2, and present the contributions of this dissertation in Section 1.4. Finally, we show the organization of this dissertation in Section 1.5. 1.1 Background In this section, we present the characteristics of target applications and target archi tectures for this work. Next, we compare the capabilities of behavioral synthesis and parallelizing compilers. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.1.1 T arget A p p lication D om ain Because of their customizability, FPGAs are commonly used for applications that have significant amounts of fine-grain parallelism and possibly can benefit from non-standard numeric formats (e.g., reduced data widths). Specifically, multimedia applications, in cluding image and signal processing on 8-bit and 16-bit data, respectively, offer a wide variety of popular applications that map well to FPGAs. For example, a typical image processing algorithm scans a multi-dimensional image and operates on a given pixel value and all its neighbors. Images are often represented as multi-dimensional array variables, and the computation is expressed as a loop nest. Such applications exhibit abundant concurrency as well as temporal reuse of data. Fortunately, such applications are also a good match for the capabilities of current parallelizing compiler analyses. We are mainly targeting loop nest computations on array and scalar variables (no pointers), where it would be preferable for the array subscript expressions to be in the affine domain; i.e., the array subscript expressions are a linear function of loop index variables and constants. The loop bounds must be constant. Non-constant bounds can potentially be supported by the algorithm, but the generated code and resulting FPGA designs would be much more complex. For example, behavioral synthesis would transform a fo r loop with a non-constant bound to a w hile loop in the hardware implementation. Optimizations provided to a w hile loop by synthesis tools are relatively limited compared to a fo r loop. We also assume there is a uniform latency for all memory accesses. We validate and evaluate the algorithms described in this dissertation on five multime dia kernel applications. Each application is written in standard C, where the computation is a single loop nest. There are no pragmas, annotations or language extensions describing the hardware implementation. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.1.2 Target A rch itectu re FPGAs offer a much faster time to market for time-critical applications and allow post silicon in-field modification to prototypical or low-volume designs where an Applica tion Specific Integrated Circuit (ASIC) is not justified. FPGAs are implemented as (re)programmable seas-of-gates with distinct internal architectures. For example, the Xilinx Virtex family of devices consists of 12,288 device slices where each slice in turn is composed of 2 look-up tables (LUTs) each of which can implement an arbitrary logic func tion of 11 boolean inputs and 6 outputs [25]. Two slices form a configurable logic block (CLB), and these blocks are interconnected in a 2-dimensional mesh via programmable static routing switches. To configure an FPGA, designers have to download a bitstream file with the configu ration of all slices in the FPGA as well as the routing. Other programmable devices, for example the APEX II devices from Altera, have a more hierarchical routing approach to connecting the CLBs in their FPGAs, but the overall functionality is similar [11]. An adaptive computing architecture is a computing system that incorporates con figurable logic devices such as FPGAs, usually in combination with conventional logic and memory. Adaptive computing architectures have been proposed from small-scale systems-on-a-chip [54], board-level systems such as from Annapolis Micro Systems, and on up to large-scale multi-board systems [6]. The compilation techniques designed for DEFACTO focus on targeting board-level systems such as the W ildstar/PC I board from Annapolis Micro Systems, depicted in Fig ure 1.2. Such systems consist of multiple interconnected FPGAs; each can access its own local memories. A larger shared system memory and general-purpose processor (GPP) are directly connected to the FPGA (this connection varies significantly across boards). The G PP is responsible for orchestrating the execution of the FPGAs by managing the 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 32bits 64bits Shared Memory! Shared Memory# SRAM# |— - SRAM2 FPGA 0 'GA 2 SRAM1 — SRAM3 PCI Controller Shared Memory2 To Off-Board Memory ____________________Figure 1.2: Annapolis’ W ildstar/PC I board.____________________ flow of control and data from shared system memory to local memories in the application execution. The work in this dissertation describes an implementation and experimental results for designs that are mapped to a single FPGA and multiple memories, which can be identified by a dotted line in Figure 1.2. We thus focus on the algorithmic aspects of design space exploration under simpler data and computation partitioning strategies. Many other issues im portant to compiling to adaptive computing systems, beyond the scope of this dissertation, are being addressed by DEFACTO [60, 59, 45]. 1.1.3 B ehavioral S yn th esis vs. P arallelizing C om pilers Commercially available behavioral synthesis tools map high-level descriptions of compu tations expressed in hardware-oriented programming languages, such as VHDL or Verilog, to a target programmable computing fabric. These languages present a completely dif ferent model of execution from standard imperative programming languages like C, in that they explicitly expose the parallelism and in some cases even low-level expression evaluation to the synthesis tool. In VHDL, for example, the programmer specifies the 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Behavioral Synthesis Parallelizing Compilers Optimizations only on scalar variables Optimizations on scalars and arrays Optimizations only inside loop body Optimizations inside loop body and across loop iterations Supports user-controlled loop unrolling. Analyses guide automatic loop transformations. Manages registers and inter-operator communications. Optimizes memory accesses. Evaluates trade-offs of different storage on- and off-chip. Considers only a single FPGA. System-level view: multiple FPGAs and multiple memories. Performs allocation, binding, and scheduling of hardware resources. No knowledge of hardware implementation of computation Table 1.1: Comparison of Behavioral Synthesis and Parallelizing Compiler. computation around the notion of concurrent processes, each of which has a well-defined set of internal signals and externally visible side effects (the sensitivity list). Behavioral synthesis performs three core functions: • binding operators and registers in the specification to hardware implementations (e.g., selecting a ripple-carry adder to implement an addition); • resource allocation (e.g., deciding how many ripple-carry adders are needed); and, • scheduling operations in particular clock cycles. Behavioral specifications, as opposed to logic or structural specifications, allow de signers to specify their computations without committing to a particular implementation. This higher-level abstraction makes it easier to translate to VHDL portions of a compu tation specified in an imperative programming language such as C. In this respect, the gap between a traditional compilation system and synthesis tools is reduced, allowing for a translation in both domains of representation. While there are some similarities between the optimizations performed by synthesis tools and parallelizing compilers, they offer complementary capabilities in many ways, as shown in Table 1.1. The fundamental concern of behavioral synthesis tools is to extract 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. low-level operations from the process description in the computation specification. They perform only limited optimizations on tiny segments of a design at a time. The tools rely on the designer to manually apply high-level transformations to negotiate inherent hardware space-time trade-offs. Because loop unrolling is always a legal transformation to apply, current behavioral synthesis tools offer the possibility of performing loop unrolling semi-automatically. Given a pragma, the tool will unroll a given loop either fully or partially, thereby increasing the amount of operators required to execute the computations in the loop in parallel, provided there are no data dependences. Also, some tools provide mechanisms for the designer to specify pipelining of the execution of the iterations of the loop. Tools may also generate finite state machines that control the execution of the loop iterations, either in serial or a pipelined fashion. The key advantage of parallelizing compiler technology over behavioral synthesis is the ability to perform data dependence analysis on array variables, used as a basis for paral- lelization, loop transformations and optimizing memory accesses, such as the techniques described in this dissertation. This technology permits optimization of designs with array variables which usually reside in off-chip memories. Further, it enables reasoning about the benefits of code transformations (such as loop unrolling) without explicitly apply ing them. In addition, parallelizing compilers are capable of performing global program analysis, which permits optimization across the entire system. Therefore, we have combined the two technologies into a unified framework such that a behavioral synthesis tool makes use of the compiler’s high-level knowledge in its implementation decisions and the compiler makes use of the estimates on a design’s space and performance produced by the synthesis tool. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2 Overview of Our Approach The DEFACTO compiler takes an algorithm description and a set of architectural con straints, particularly chip capacity, and performs code transformations to optimize the design in an application specific way. We use a fixed target clock rate to guide the synthesis process, but it is not a hard constraint. Other constraints are described in Chapter 3. Under the given constraints, the optimization criteria for mapping loop computations to FPGA-based systems are as follows: 1. The execution time should be minimized. 2. For a given level of performance, FPGA space usage should be minimized. As for the first criterion, our compiler exploits fine-grain parallelism, data locality, and parallel memory accesses to multiple memories. The second criterion is also needed for several reasons. • If two designs have equivalent performance, the smaller design is more desirable, in that it frees up space for other uses of the FPGA logic, such as to map other loop nests. • A smaller design usually has less routing complexity, and as a result, may achieve a faster target clock rate. • A smaller design consumes less power. Moreover, these optimization criteria suggest a strategy for selecting among a set of candidate designs. The compiler’s decision process has many degrees of freedom. It can leverage coarse-grain parallelism, fine-grain parallelism, and data locality. In addition, the space on an FPGA can be used either for computation logic or for data storage. To automate design space exploration, we must define a set of analyses and transfor mations to be applied and metrics to evaluate specific candidate designs and to prune 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the large search space. We leverage the SUIF system, and augment its analyses and transformations to optimize for FPGA-based systems. We first perform several stan dard transformations provided by SUIF. Loop permutation which moves the loops that carry the most data reuse to the innermost position where the temporal data locality can be better exploited. Constant propagation propagates constants assigned to a variable through flow graph and substitutes the use of the variable. Constant folding replaces an operation of constants with the constant result of that operation. Common subexpression elimination replaces the subexpression that appear in more than one computation and the values do not change between computations with the result of the subexpression. Dead code elimination eliminates codes that are unreachable or that does not affect the program (e.g., dead stores). Loop-invariant code motion moves computations that pro duce the same value in every loop iteration out of the loop. These transformations reduce redundant computations. In addition, we perform several other analyses and transformations to improve fine- grain parallelism, on-chip temporal data locality, and memory parallelism across multiple memories, as described in the next subsection. 1.2.1 A n alyses and T ransform ations We now revisit Figure 1.1(b) to describe the analyses and transformations that we have developed or enhanced in the DEFACTO system. The dependence and data reuse anal yses are the basis of all other transformations to optimize the design. Data dependence analysis identifies the essential ordering constraints among statements or operations in a loop nest. Data reuse analysis identifies the opportunities of data reuse on-chip once it is fetched from memory, based on data dependence information. It also identifies the register reuse opportunities that eliminate redundant memory write accesses. We exploit instruction-level and memory parallelism using loop unrolling (for inner most loops) and unroll-and-jam for outer loops. Unroll-and-jam involves unrolling one 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or more loops in a nest and fusing copies of the inner loop together [1]. Unroll-and-jam itself is a well-known technique, but we use it to guide the design space exploration using behavioral synthesis estimates and the guiding metrics described in the next subsection. We make the data reuse explicit in the code using scalar replacement, which reduces the number of memory accesses by replacing array references with temporary scalar vari ables. The standard approach to scalar replacement exploits data reuse only in the innermost loop [8]. Our approach exploits data reuse opportunities across multiple loops in a nest, and eliminates redundant write accesses. A unique feature of our approach is that it increases the applicability of scalar replacement by removing the requirement of iteration-reordering transformation, which is particularly im portant when unroll-and-jam is not legal, or when multiple optimization goals require conflicting loop transformation strategies. In addition, our approach provides a flexible strategy to trade off between ex ploiting reuse opportunities and reducing the register requirements of scalar replacement. We have developed a novel data layout scheme, called custom data layout. It de rives a custom array layout across multiple memories by analyzing data access patterns of the code. Further, it facilitate high-bandwidth parallel memory accesses to multiple memories. We do not use any fixed data layout, but rather select application-specific layouts according to data access patterns of the code. A unique feature of this approach is its flexibility in the presence of code reordering transformations, such as the loop nest transformations commonly applied to array-based computations. This feature yields high memory parallelism no m atter how memory is accessed, as compared to solutions that op timize for memory parallelism assuming a fixed data layout. In turn, operator parallelism is improved because more data is available to perform independent computations. 1.2.2 G uiding M etrics There can be many possible designs that perform the same algorithm. It is not easy to find a design that satisfies all the optimization goals described earlier. But, it is not 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. practical to compare all possible designs. Therefore, we introduce three metrics to guide the design space exploration. Here, we describe each metric briefly. Details are presented in Chapter 3. One of the goals of the DEFACTO system is to get a balanced loop behavior between computation and memory at run time. Balance is defined as the ratio of the data consumption rate of the computation to the data fetch rate from memory. Depending on the balance value, we can determine whether memory or computation is the performance bottleneck. Another metric is needed because of the inherent space-time tradeoff; as the design exploits more parallelism, more copies of operators are necessary, and as the design exploits more data reuse on chip, more registers for on-chip storage are required. Thus, when the increase in space results in just a minor performance improvement, we must not increase the complexity of the design. As such, we define efficiency as the ratio of the performance increase to the space increase. This metric suggests when to stop our automatic design space exploration. An additional concept, a memory bandwidth saturation point, refers to certain unroll factors where the data is being fetched/stored at a rate corresponding to the maximum bandwidth of the target architecture. The code transformations described above adjust balance as follows. Unroll-and-jam increases the data consumption rate, since it can introduce more parallel memory accesses and independent computations. Whereas, custom data layout increases the data fetch rate, since it increases the parallel memory accesses during the same memory latency. Improving data locality by scalar replacement decreases both the data consumption rate and the data fetch rate, since it reduces the number of memory accesses. Depending on balance of a design, we can decide whether increasing unroll factors would improve the overall performance. Therefore, the DEFACTO compiler analyzes balance and efficiency of a loop to prune the candidate designs in its design space exploration. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.2.3 Sum m ary The design space exploration algorithm involves an iterative evaluation to find the best unroll factors for loops in a loop nest computation. For each fixed set of unroll factors, unroll-and-jam, scalar replacement and custom data layout are performed. This design space exploration strategy is fast, since it bypasses logic synthesis and place-and-route as much as possible. In addition, it does not explore all possible designs in a brute-force way. The optimization search considers only a very small portion of the possible unroll factors, because it is guided by the set of metrics. 1.3 Example To illustrate this approach, the example code in Figure 1.3(a) is mapped to an FPGA which has four external memories. This code fetches two data elements from memory, and adds them, and writes the result to another memory. The behavioral synthesis tool optimizes only computations and memory operations within a loop body, and thus cannot identify the parallelism across loop iterations. The compiler applies unroll-and- jam to expose fine-grain and memory parallelism to high-level synthesis by replicating the logical operations and their corresponding operands in the loop body. Figure 1.3(b) shows the code after unrolling both loops once and jamming the copies of the inner loop. There are four copies of the original loop body and array subscript expressions are updated to represent two iterations of the inner loop and two iterations of the outer loop, and step sizes of both loops are also updated. Once this design fetches eight data items from a memory, it can perform four additions in parallel, since they are independent. Finally, th e re su lts a re w ritte n b ack to a n o th e r m em ory. High latency and low bandwidth to external memory, relative to the computation rate, are often performance bottlenecks in FPGA-based systems, just as in conventional archi tectures. The DEFACTO system addresses these problems in two ways; i.e., it reduces 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mems for (i= l; i <=64; i++) for ( j = 1; j <=32; j+ + ) A[i\[j] = B[i]\j} + Mems FPGA (a) original program for (i= 1; i <=64; i+ = 2) for (j= l; j <=32; j+ = 2 ) { i4[t][7'] = S[i][7'] + B[i-l][j-l}; A [i\ b’+ i] = - BM b'+1] + 5 M M ^[H-lKj] = £[i+l]|>'] + B[i]\j-1}-, A[i+l}[j+l] = B [i+ l][j+ l] + B[i][j]; } Mems B Mems FPGA *$MJ— > S > — ( J — (b) After unroll-and-jam Figure 1.3: Code transformation example. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. redundant memory accesses with scalar replacement and decreases the memory latencies by parallelizing the memory accesses with custom data layout. Because synthesis cannot exploit the data reuse opportunities in array variables, scalar replacement is used to make data reuse explicit by using the same register for the read references that access the same memory location. In Figure 1.3(b), the first and the last statements use the the same array element Similarly, there are data reuse opportunities between B[i\[j+1] and since B[i}[j-1] accesses the same array element accessed by Z?[i][j+1] in the next iteration of loop j. Thus, scalar replacement eliminates redundant operand fetches which can be identified by dotted arrows in Figure 1.3(b). Figure 1.4(a) illustrates how we eliminate redundant memory accesses. We load the array element accessed by B[f][j] into a scalar Bo_o, and in the last statement we replace the array reference with this same scalar. Similarly, we load the .array element accessed by B[i][j+l} into B{_o, and we shift the data in B\_o to Bi_\ at the end of the loop. In the next iteration, B\_i contains the correct data without accessing memory. For simplicity, we do not show the code to initialize B\_\ at the very first iteration of loop j. Another way of tackling the memory bottleneck problem is to reduce the memory access latency by parallelizing the remaining memory accesses after scalar replacement across multiple memory banks. There are still essential memory accesses to arrays B and A in Figure 1.4(a). If we keep the entire array B in a single memory, four data fetches must be serialized. In result, these parallelizable computations also must be executed in serial. We address this challenge using custom data layout. It analyzes the data access patterns of the code in Figure 1.4(a), and lays out the data so that most of the data required by independent computations can be fetched/stored at the same time. Figure 1.4(b) shows the output of our custom data layout. We give different array names to represent accesses to different memories. The FPGA design in Figure 1.4(b) can fetch four data items at the same time. Subsequently, four additions can be done in parallel. In addition, writing the result back to memories can be done concurrently. Compared with the original code 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (*=1; i <=64; *+=2) for ( j= 1; j <=32; j+ = 2) { 5o_o = -B[?][j]; Buo = 5 [i][j+ l]; i?2_ o = B[i+l}[j); B 3_ q = B[i+l][j+l}-, A{i\{j} = Bqj) + - 6 3 - 1 7 ; -^Wb’+ l] = -®i-0 + 5 2_i6; ^[^+l][j] = B 2_ o + Bi_i] 4l[*+l][j+l] = -B 3-O + B q_ q\ B\_\ = £i_o; shift_registers(52_i6, • • •, B 2_o); shift .registers (S 3_i7, • • •, £ 3_o); } (a) After scalar replacement Mems p T h FPGA Mems rtK B 4 j ■ = f K for (i= l; i <=32; *++) for (j= l; j <=16; j+ + ) { £o_o = B00\i][j}-, B x_ 0 = £01 [*][?]; B 2_ o = £ 1 0 [*][j]; £ 3_ 0 = £ ll[i][j]; -400 [? :][;/] = B q _ q + B 3_\t, L JL^J = -S l-0 + £2_165 -410[i][j] = B 2_ q + £i_i; -411 [«][;/] = £ 3_ o + Boo] £ i_ i = £i_o; shift -registers ( £ 2 _i 6 , • • •, B 2S))] shift_registers(f?317, • • •, £ 3_o); } BO AO B1 A1 B 2 A 2 B 3 A 3 (b) After custom data layout Figure 1.4: Code transformation example, continued. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ^ in Figure 1.3(a), the final transformed code in Figure 1.4(b) has increased both data fetch rate and data consumption rate, since the loop body includes more parallel memory accesses and independent computations. In result, the performance has improved at the cost of increased space usage. The next step of our automatic design space exploration in Figure 1.1(b) is SUIF2VHDL translation. The compiler translates the highly optimized SUIF intermediate code into behavioral VHDL using the SUIF2VHDL tool. This tool generates a behavioral VHDL description of the core datapath of the selected computation to execute in hardware. The resulting behavioral code is illustrated in Appendix A. Once synthesized, it interacts with the software code running on the host processor. For a proposed behavioral VHDL design, the compiler next invokes a behavioral synthesis estimation step. This estimation step, using the output of SUIF2VHDL, will indicate with high confidence if the design will fit on a single FPGA device and if so what the expected performance is. Using the estimates on space and computational latency from the behavioral synthesis tool, the compiler goes through several times the steps of unroll factor search until it finds an optimal solution within the space constraint. Thus, we have reduced the design space exploration to a problem of unroll factor selection. Finally, once an optimal solution is found, the resulting hardware design is sent on to logic synthesis and place-and-route to actually map the design to an FPGA. Since behavioral estimates are not 1 0 0 % accurate, we need another iteration of design space exploration in some rare cases where the given constraints are not met after place-and-route. 1.4 Contributions In this dissertation, we present a detailed algorithm for design space exploration, code analyses, transformations, and experimental results demonstrating their effectiveness. While there are a few systems that automatically synthesize hardware designs from C 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. specifications [55], to our knowledge there is no other system that automatically explores the design space in collaboration with behavioral synthesis estimation features. This dissertation presents the following specific contributions, which are all fully au tom ated in the DEFACTO compiler. • Integration of behavioral synthesis tools and parallelizing compiler tech nology to map automatically applications w ritten in C or Fortran to FPGA-based custom computing architectures in an application specific way. The behavioral syn thesis estimation is fast and accurate enough to expedite our algorithm to the highly effective and efficient design. • Autom atic design space exploration algorithm guided by several metrics and the monotonicity properties of the design search space. We search on average only 0.3% of the design space. Our algorithm derives an implementation that closely matches the performance of the fastest design in the design space, and selects the smallest design among implementations with comparable performance. • Custom data layout which distributes array data across multiple memory banks in an application-specific way by analyzing the data access patterns in the loop nest. As such, our generalized approach facilitates high-bandwidth parallel memory accesses. On a single FPGA with eight memories, we observe a speedup up to 9.65 and 8 6 % reduction in memory accesses over a naive approach that maps an array to a single memory. • Extension of scalar replacement that eliminates both redundant read and write memory accesses across multiple loops in a nest without requiring unroll-and-jam, thereby increasing the applicability of scalar replacement. In addition, our approach is provides a flexible strategy to trade off between exploiting reuse opportunities and reducing the register requirements of scalar replacement. Using this technique, 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we observe a 57% to 92% reduction in the number of memory accesses compared with the original number of array references. 1.5 Organization of Dissertation The remainder of this dissertation is organized as follows. In the next chapter, we describe data dependence and data reuse analyses, which are the bases of our code transforma tions such as loop permutation, unroll-and-jam, scalar replacement, and custom data layout. Chapter 3 describes our design space exploration algorithm in mapping loop nest computations to hardware. Chapter 4 describes how we eliminate the redundant mem ory accesses and Chapter 5 presents how we reduce the memory latency by parallelizing the remaining memory accesses. In Chapter 6 , we present experimental results for the application of our algorithm to 5 multimedia computations. We survey related work in Chapter 7 and conclude in Chapter 8 . 2 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 DATA D E PE N D E N C E A N D REUSE ANALYSES The compiler code transformations described in this dissertation improve fine-grain paral lelism, data locality on chip, and parallel memory accesses across multiple memory banks. The analyses described in this chapter provide the compiler with information to guide code transformations that exploit parallelism and data reuse as described in Chapters 3 and 4. Data dependence relations represent whether two data accesses in the code may refer to the same memory location. Thus, they impose the essential ordering constraints among statements or operations in a program. A data reuse can be exploited when there are multiple references to an array element in the same or subsequent iterations of a loop. The first such dynamic reference may either be a read or write; the reuse occurs on subsequent read references. A related optimization opportunity is to eliminate redundant writes to memory, where an array element is written, and on a subsequent iteration is overwritten; there is no need to write the initial value of the element to memory. In this disseration, we refer to this optimization as register reuse. Data reuse analysis identifies the opportunities of data reuse and register reuse based on the data dependence information. Thus, data reuse analysis is closely related to data dependence analysis. In this dissertation, our major focus is dependences among array references in a loop nest. In a loop, since each operation can be executed many times, a dependence can flow 21 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. from any instance of execution of an operation to any other operation instance, and even to the same operation. There are other reuse analysis works targeting other data storage such as cache memories [29, 38], which are orthogonal to our work in register domain. In the next chapter, we first describe dependence analysis on array references in a loop nest. In Section 2.2, we describe data reuse analysis. 2.1 Dependence Analysis Given the focus on array references in a loop nest, dependence analysis determines whether Bernstein’s conditions [5] are satisfied for every pair of iterations I\ and I 2 ] i.e., 1. Iteration l\ does not write into a location that is read by a subsequent iteration I 2 ■ 2. Iteration I 2 does not write into a location that is read by a subsequent iteration I\. 3. Iteration I\ does not write into a location that is also written into by a subsequent iteration I 2 . We use D to refer to a set of dependence vectors. DT c D is the set of flow {true) dependences, which occur when the first condition does not hold. D A C D is the set of anti-dependences, which occur when the second condition does not hold. D ° C D is the set of output dependences, which occur when the third condition does not hold. D 1 C D is the set of input dependences, which occur when both iterations I\ and I 2 read the same memory location. Unlike other dependences, it does not represent a code-reordering con straint, but it provides useful information in terms of data reuse opportunities. Consider the example code in Figure 2.1(a). There is a true dependence between A[i][j] and A[i- l][j-l], and an input dependence between B[i] and B[i-1], and an input dependence from C[j] to itself. A reordering code transformation preserves a data dependence if it does not change the relative execution order of the source and sink of that dependence. The iteration 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i = 1; i < 65; i+ + ) for (j = 1; j < 33 ;j+ + ) A[i][j] = A[i-l][j-l]+B[i]+B[i-l]+C[j]+D[i\{j]; (a) Original code <*,0> <0,*> < 0,*> <0,+> <0,+> B[i- 1] C [j] A [[-!][/- (Dim <+,0> . "A. (b) Dependence graph (c) Reuse graph _________ Figure 2.1: An example code and its dependence and reuse graphs._________ space of a loop nest is the n-dimensional polyhedron consisting of all the n-tuples of values of the loop indices, and each point in the iteration space corresponds to a loop iteration. A data dependence is called loop-carried if two operation instances occur in two different iterations in the iteration space. A data dependence is called loop-independent if two operation instances occur in the same iteration in a loop nest. We say that a data dependence is lexicographically positive when the source precedes the sink in the iteration space [1]. We analyze the data dependence at compile time for array references in the affine domain. In other words, array subscript expressions are of the form a iL i + 0 2 ^ 2 + ... + anL n + b, where a, and b are constants and the L t are loop index variables. If array references are not in the affine domain (e.g., A[B[i]]), it is hard to decide the fixed dependence at compile time. A set of dependence relations can be represented by a directed graph called a data dependence graph. In this graph, a node represents many instances of an array reference and an edge represents a set of data dependences D. A dependence relation is character ized by the distance between the source and sink array references in the iteration space. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A dependence vector d — (di,c?2 , • • • ,dn) refers to a vector difference of distance in an n-dimensional loop iteration space [1]. A dependence vector entity di in each loop Li which represents the dependence distance in terms of iteration counts between source and sink references of the dependence edge is one of the following: • c: an integer constant distance. • +: more than one dependence distance that ranges from 1 to oo. • — : more than one dependence distance that ranges from — oo to -1 . • *: more than one dependence distance that ranges from — oo to oo, or an unknown dependence distance. A constant dependence distance c means that the distance between two dependent array references in the corresponding loop is c iterations. A data dependence with a constant dependence distance for the loop that carries the dependence throughout the loop is called a consistent dependence. [1] By definition, the first non-zero entity of a lexicographically positive dependence vector must be a positive integer or '+ ’, and the vector entities of a loop-independent data dependence are all zeros. Figure 2.1(b) illustrates the dependence graph for the example code in Figure 2.1(a). The lexicographically positive true dependence from A[i][j] to A[i-l][j-l] is (1,1), which implies the same array element written by A[i\ [j] is read back by A[i-1] [7-I] in one iteration of the outer loop and one iteration of the inner loop. In addition, the lexicographically positive input dependence vector from B[i] to B[i-1] is (1,*) because the same array element accessed by B[i] is read again by B[i-1] in one iteration of the outer loop and all iterations of the inner loop. Similarly, there is a self-loop input dependence (*,0) from C\j\ to itself, since the same array element accessed by C[;j] is accessed again across outer loop iterations. Array reference D[i][j] does not carry any data dependence because each reference is to a unique location. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 D ata Reuse Analysis Not all data dependences can be translated to data reuse in registers. As such, data reuse analysis constrains the dependence vectors to exploit data reuse through a set of registers in three ways: 1. We only consider elements from D T , D [ and D ° . Anti-dependences, from D A, are not considered as candidates of data reuse and are ignored. 2. A dependence vector must be lexicographically positive for the data reuse to be realizable. 3. When there is more than one dependence vector, the smallest one must represent all other vectors. One example that does not meet the third constraint above is A[i+ j-\-1] — > A [i+ j\, whose dependence vectors include (1 , 0 ), (0,1 ), (2, — 1), and (1 ,-2 ), etc. Since these vectors require a different number of registers, it is hard to generate a simple code to adjust the number of registers for each dependence vector. Array references with input and true data dependences always carry the opportu nities of data reuse. In addition, array references with output dependences carry the opportunities of register reuse. Two affine array references are uniformly generated if all the coefficients at are identical; the offset b may be different. A uniformly generated set (UGS) includes array references that are uniformly generated in each array dimension. Note that there can be multiple UGSs for a single array. D ata reuse analysis identifies data reuse opportunities only within each UGS. In Figure 2.1(a), there are four UGSs; i.e., {A[i][j], A{i-l][j-l\}, {B[i], B[i-1]}, {C[j]}, and {D[f][j]}. Temporal data reuse oc curs when two accesses to the same data at different times. Temporal data reuse from an array reference to itself is referred to as self-temporal, while temporal data reuse among different array references is called group-temporal. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Just like a dependence graph in dependence analysis, a set of data reuse relations can be represented by a directed graph called a data reuse graph, which includes a set of reuse instances. A reuse instance consists of two dependent array references, one source and one sink, and a reuse edge which represents the minimal dependence vector between the two array references. Figure 2.1(c) illustrates the data reuse graph for the data dependence graph in Figure 2.1(b). A self-loop reuse edge O in Figure 2.1(c) means that self-temporal reuse is possible from an array reference to itself across loop iterations. Since we are exploiting data reuse across multiple loops in a nest as described in Chapter 4, we introduce two novel concepts, reuse distance and data reuse chains. A group-temporal reuse distance between two dependent references is defined by the min imal difference in iteration counts of the innermost loop, where the references are not invariant. There are two differences between the reuse distance and the dependence dis tance. First, the dependence distance is with respect to the iterations of a specific loop, while the reuse distance considers all the loops in a nest, where the array references are not invariant. Secondly, the reuse distance considers only the smallest dependence vector when there is more than one dependence vector associated with a pair of references. The smallest dependence vector can be a representative of all other dependence vectors. For a given incoming dependence vector d = (c, • • • ,c), where c is an integer constant, of a sink reference in an n-deep loop nest, we define the group-temporal reuse distance e(d,n) where Ik refers to the iteration count of loop k. In Figure 2.1(c), for instance, the reuse distance of a dependence vector (1,1) is 33 (=1x32+1). A reuse chain, denoted as C, is a connected component (collection of reuse instances) of a reuse graph. A UGS may include more than one reuse chain if each chain accesses independent locations of the same array. In a reuse chain, an array reference with no as follows: (2 .1) 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. incoming true or input dependence or outgoing output dependence edges is called a reuse generator. Specifically, an array write reference with no outgoing output dependence edge is called a finalizer. A reuse generator provides the data that can be reused by other references in the reuse chain, and the finalizer contains the final value of the array element that may have been overwritten several times by previous write references. In Figure 2.1(c), the reuse generator of reuse chain {A[i][j] — > A[i-l][j-l]} is A[i][j], which is also the finalizer in this chain. In the next chapter, we use the dependence information to identify the fine-grain parallelism and to check the legality of code reordering transformations. In Chapter 4, we describe how to eliminate redundant memory accesses based on the information given by data reuse analysis. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 AUTOM ATIC DESIGN SPACE EXPLORATION Based on the code analyses described in Chapter 2, DEFACTO iteratively evaluates several candidate FPGA designs by applying compiler transformations that improve fine- grain parallelism, data locality on chip, and parallel memory accesses across multiple memory banks, which are described respectively in Chapters 3, 4, and 5. The focus of this chapter is how the compiler explores a potentially large design space very efficiently and automatically to produce a good quality design that satisfies the optimization criteria described in Chapter 1. We first describe how the compiler and a behavioral synthesis tool interact with each other to complement each other’s capabilities. The compiler exploits estimation from behavioral synthesis to determine specific hardware parameters (e.g., size and speed) with which the compiler can quantitatively evaluate the application of a transformation to derive an optimized and feasible implementation of the loop nest computation. Since the hardware implementation is bounded in terms of capacity, the compiler transformations must also consider space constraints. We introduce several guiding metrics (balance, efficiency, and saturation point) to effectively evaluate a set of candidate designs. These metrics enable the compiler to explore a potentially large design space, which without automation would not be feasible. In addition, we present several monotonicity properties of the design search space as 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a result of our compiler transformations. The algorithm selects a near-optimal design among those considered, while only searching on average 0.3% of the search space for five multimedia kernels that we have studied. The rest of this chapter is organized as follows. In the next section, we define the terms that are used to describe the algorithm in this chapter. In Section 3.3, we describe how the compiler and behavioral synthesis interact with each other. In Section 3.4, we introduce several guiding metrics in our design space exploration. In Section 3.5, we present properties of the design search space as a result of our compiler transformations. We describe our design space exploration algorithm in Section 3.6, and summarize in Section 3.7. 3.1 Definitions We define U as the product of all the unroll factors in an n-deep loop nest: n U = Y [ u i, 1 where is the unroll factor for each loop i. Our design space exploration algorithm presented in Section 3.6 determines an optimal U that satisfies the optimization goals. We define the following terms to describe the interaction between the compiler and the synthesis tool. C lo c k ta r g ' • a target clock rate (MHz). Capacity : the target FPGA chip space limit (slices). width : a data bit width (= memory width) (bits). lr : the latency of a read memory access (cycles). lw : the latency of a write memory access (cycles). 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mp : the overall number of physical memories. M R : the number of read accesses to memory bank i per loop iteration. M 'f : the number of write accesses to memory bank i per loop iteration. Mi : the total number of memory accesses to memory bank i per loop iteration; i.e., Mi = M R + M W . M R : the total number of read memory accesses per loop iteration; i.e., M r = M R. M w : the total number of write memory accesses per loop iteration; Area(d) : the overall estimated area of design d (slices). Cycles(d) : the estimated computational latency per loop iteration of design d, taking the memory latencies into account (cycles). E(d) : the overall number of estimated cycles of design d, taking balance and memory latencies into account (cycles). Estimating computational latency requires specific information on what kind of resources and how many of them are used for computational logic, and delay and space usage of each mapped resource, etc. When memory accesses are pipelined, both lr and lw are regarded as one cycle. 30 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. 3.2 Assum ptions The discussion in this chapter is presented under the following assumptions: • The memories have a single port, and read and write accesses cannot be supported at the same time. However, the same type of multiple accesses (either read or write) can be pipelined to service one request every cycle. • To simplify the discussion, we also assume that the access width matches the mem ory width. • Loop-invariant code motion moves the loop invariant code out of the loop. • Scalar replacement removes all but a single memory read or a single write access for each reuse chain, and we keep all reused data internally on the FPGA, and the remaining array references are independent. Thus, as the unroll factor increases, the number of parallel memory accesses increases proportionately. We achieve this assumption for many applications with short dependence distances, but may require too many registers in the general case. We address this problem in Chapter 4. • Custom data layout maximizes the parallel memory accesses for independent array references after scalar replacement. We achieve this assumption by accounting for data access patterns and data dependence among the array references in the affine domain, as described in Chapter 5. If these assumptions do not hold, our design space exploration algorithm produces a less optimized design, since the compiler optimizations are less successful. 3.3 Interaction between Compiler and Behavioral Synthesis We describe in this section how the compiler and a behavioral synthesis tool interact with each other to complement each other’s capabilities, as illustrated in Figure 3.1. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. lr, lw, M p, Clocktarg, B max, Capacity d y C l()Ck fQ yg Compiler Behavioral Synthesis Cycles(d), Area(d) M r, M w, E(d), Usal, Balance(rf), Efficiency^,</2 ) __________Figure 3.1: Interaction between compiler and behavioral synthesis__________ The compiler takes some architecture-specific parameters such as lr ,lw, M p,Clocktarg, Capacity. It then passes the optimized design d and Clocktarg to the estimation interface. The compiler uses a set of internal functions to format the request using the syntax specific to the synthesis tool. The estimation interface then invokes the synthesis tool in batch mode. After synthesizing the design, the behavioral synthesis tool returns the estimates that include Cycles(d) and Area(d) in the form of report files. Finally, based on these estimates, the compiler computes several guiding metrics such as Usat, B alance(d), E ff icien cy (d i, o ? 2). These metrics are computed based on M R, M w , E(d), and Area(d). We can simply scan the optimized code to count M R and M w because custom data layout makes the accesses to separate memories explicit in the code, as described in Chapter 5. Further, in the presence of control flow, scalar replacement moves most of the conditionally executed array references outside a control flow or eliminate them by replac ing with scalars, as described in Chapter 4. Our memory interface speculatively executes the remaining conditionally executed read array references. However, we conservatively assume that the remaining conditionally executed write array references are always ex ecuted. E(d) computation depends on Balance(d). If a design d is compute-bound or balanced, the estimated total execution time is dominated by Cycles(d), since the mem ory operations are overlapped with computations. If the design is memory-bound, the estimated total execution time is dominated by memory access latencies. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 Guiding M etrics Our algorithm uses four metrics to continuously limit the number of candidate designs during the iterative search, while still meeting our optimization goals. Recall that the optimization criteria for the design space exploration algorithm are as follows: 1 . the execution time should be minimized; and, 2. for a given level of performance, FPGA space usage should be minimized. The chip space usage is used to see if the design fits on an FPGA, and the other metrics, balance, efficiency, and saturation point, are used to find a good quality design. These metrics are measured after code transformations such as unroll-and-jam, scalar replacement, and custom data layout, etc. 3.4.1 C hip Space U sage The results of estimation provide space usage Area{d) of the design d. This space metric filters out infeasible designs within a fixed chip capacity limit Capacity. This metric is also used in computing the efficiency metric described in Section 3.4.3. 3.4.2 B alance Balance of a design d, which is related to optimization criteria 1 and 3, is defined as follows: Balance(d) = F(d) / C'(d), (3-1) where F(d) refers to the data fetch rate, the total data bits that memories can provide per cycle, and C(d) refers to the data consumption rate, total data bits the computation consumes during the computational latency. Balance for a particular loop nest refers to the ratio of the data fetch rate from the memory banks to the consumption rate on an FPGA. If balance is less than one, the loop nest needs data at a higher rate than 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memories can provide. This kind of loop is memory bound, and the extra on-chip space for computation logic may be used to improve data locality. On the other hand, if balance is greater than one, the loop is compute bound, and the extra space may be used to improve parallelism. We measure balance per iteration of the innermost loop, which is the most frequently executed code in the loop nest. We borrow the notion of balance from previous work for matching the floating point operations and memory accesses to the maximal capabilities of a given architecture [ 8 ]. However, because we have the flexibility in FPGAs to adjust time spent in either computation or memory accesses, we compare the data fetch rate and data consumption rate under different optimization assumptions. The data fetch rate and the consumption rate are defined as follows: Mp m = ^ {(Mi x w idth) / (lr x M f1 + lw x (bits/cycle) (3.2) 1 Mp C(d) = £ { ( M , x w idth) j Cycles(d)} (bits/cycle) (3.3) i— 1 These rates are measured per loop iteration even if the actual number of fetched/consumed bits in each cycle is varying in reality during the loop execution. The DEFACTO compiler controls balance by performing loop transformations that adjust parallelism, data reuse, and parallel memory accesses. Unroll-and-jam increases the data consumption rate, since it decreases Cycles(d) for the same amount of data, Mj. Whereas, custom data layout increases the data fetch rate, since it increases M i during the same memory latency. Improving data locality by scalar replacement decreases both the data consumption rate and the data fetch rate, since it reduces M % . We present how the DEFACTO compiler can adjust balance in the design space exploration in Section 3.5. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4.3 E fficiency For an inherently compute-bound design, performance may continue to improve as we increase the number of operators and on-chip data storage, but perhaps not enough to justify the increased space usage. We may degrade the overall performance from the minor decrease in cycles, since the achieved clock rate after place-and-route is degrading as design complexity grows. To capture this notion, we compute E fficiency(di,c? 2) to compare a design d\ that uses a particular U value with a design da that uses a larger U value. E ffic ie n c y (d i, d / 2 ) is defined as follows: If E ffic ie n c y is below a predefined threshold, our algorithm selects d\ over cfy Therefore, our algorithm increases the number of operators and on-chip data storage only when doing so will have at least moderate performance gains. 3.4.4 Satu ration P oin t There is a fixed limit on effective memory bandwidth utilization we can achieve by in creasing the memory parallelism. We define an additional concept, a memory bandwidth saturation point to suggest initial unroll factors to start the design space exploration algorithm. We say a design has reached a saturation point when the data is being fetched and stored at a rate corresponding to the maximum bandwidth of the target architecture. At this point the memory parallelism reaches the maximum, such that whenever there is a request to access a single memory in the resulting unrolled innermost loop body, all Mp memories must be equally busy; i.e., E ff icien cy (d i, d%) — E(d1) ~ E ( d 2) (3.4) Area{d2 ) — Area(d\) M f = M f = ... = M Z V (3.5) 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Intuitively, the above properties translate to increased memory bandwidth utilization. We consider reads and writes separately because they will be scheduled separately, as described in Section 5.4. In particular, we define Usat as the product of unroll factors that leads to a saturation point. It indicates how much unrolling ( U) across multiple loops in a nest is necessary to reach a saturation point; i.e., Usat = LCM (G C D (M *, M w ), Mp), (3 .6 ) We use GCD(MR, M w ) to maximize the opportunities for parallel memory accesses, both reads and writes, and LCM with Mp to make all Mp memories busy whenever there is a request for memory accesses. Simply stated, we are looking for the smallest U that results in a multiple of M v read and write accesses as shown in Equation 3.6. Since loop peeling and loop-invariant code motion in scalar replacement have eliminated memory accesses in the main loop body that are invariant with respect to any loop in the nest, all such unroll factors do not affect memory parallelism. Therefore, if an unroll factor u\ of Usat is greater than one, the array subscript expressions are varying with respect to loop i. T hat is, to reach a saturation point, we consider unrolling only those loops that will introduce additional memory parallelism. In some cases where array subscript expressions involve multiple loop index variables, the assumption that scalar replacement will remove all array references except one cannot be achieved. In these cases, it is not possible to reach a saturation point no m atter how many times we unroll the loops. Thus, we choose the following heuristic Usat, i.e., U Sat = n - = i M p. The reason why we choose M p as an unroll factor for each loop is that we can get more memory parallelism by unrolling multiple loops rather than unrolling a 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. single loop many times because of the data dependence from the array reference to itself across loop iterations. 3.5 M onotonic Search Space Properties The optimization involves selecting unroll factors for the loops in the nest. Our search is guided by the following observations about the impact of unrolling a single loop in the nest. The monotonicity property also applies when considering simultaneous unrolling for multiple loops as long as unroll factors for all loops are either increasing or decreasing. 3.5.1 U nroll Factor vs. D a ta F etch R ate Observation 1 The data fetch rate is monotonically nondecreasing as the unroll factor increases, but it is also non-increasing beyond the saturation point as the unroll factor increases by multiples of Usat . Loop unroll-and-jam exposes the array references for several iterations in one iteration of the innermost loop. Thus, there is a proportional increase in the number of memory accesses in the loop body as the unroll factor increases unless the array is invariant with respect to the loop. Figure 3.2(a) illustrates the relationship between the unroll factor and data fetch rate. Intuitively, the data fetch rate increases as the unroll factor increases, since there are more memory accesses available in the loop body for scheduling in parallel. This observation requires that the data is laid out in multiple memories and the accesses are scheduled such that the number of independent memory accesses on each memory cycle is monotonically nondecreasing as the unroll factor increases. However, the data fetch rate can only be increased up to a saturation point, where the maximal memory parallelism is exploited. W ith the properties of custom data layout, increasing the unroll factor does not increase the data fetch rate beyond the saturation point. Here, the unroll factor must increase by multiples of Usat, so that each time a 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. J 1 balance i kdata consumption rate ► unroll factor unroll factor sat i tdata fetch rate F , f/sat unroll factor (a) (b) (c) i Lrate (bits/sec) J kdesign size i ^performance optimal solution optimal solution ► unroll factor ► unroll factor ► unroll factor sat (d) (e) (f) _______________ Figure 3.2: The effect of the increase in unroll factors._______________ memory operation is performed, there are Mp accesses in the main loop body that are available to schedule in parallel. 3.5.2 U nroll Factor vs. D a ta C on su m p tion R ate O b serv atio n 2 The data consumption rate is monotonically non-decreasing as the unroll factor increases, even beyond the saturation point. Unlike the data fetch rate, the data consumption rate is not machine-specific, and depends on the number of memory references in a single iteration of a loop. In other words, the loop tries to consume more data as the number of memory references is increased. Therefore, the data consumption rate has no fixed upper limit. Figure 3.2(b) shows this relationship between the unroll factor and the data consumption rate. Intuitively, as the unroll factor increases, more operator parallelism is enabled, thus reducing the computation time and 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. increasing the frequency at which data can be consumed. Further, based on Observation 1, as we increase the data fetch rate, we eliminate idle cycles waiting on memory and thus increase the consumption rate. Although the memory parallelism exploited as a result of unrolling a loop may reach a threshold, the data consumption rate continues to improve at least slightly due to simpler loop control. In most cases, as the unroll factor increases, the data consumption rate does not increase linearly for two reasons. First, the exposed independent computations cannot be scheduled at the same time if all the necessary operands cannot be ready at the same time. In other words, the data fetch rate also limits the computation time. Secondly, memory accesses are completely independent, whereas operator parallelism may be restricted. In a rare case, however, if there is no dependence at all across loop iterations and all the necessary data can be fetched all at the same time, the data consumption rate also can increase linearly. 3.5.3 U nroll Factor vs. B alance O b se rv atio n 3 Balance is monotonically nondecreasing before the saturation point and monotonically non-increasing beyond the saturation point as the unroll factor increases by multiples of Usat ■ Given the properties of the data consumption rate and the data fetch rate described in the previous section, we can derive the monotonicity property of balance. Balance increases until the design exploits the maximal memory parallelism that happens at the saturation point; the consumption rate increases at a smaller rate than the fetch rate does as the unroll factor increases. Therefore, balance increases monotonically up to the saturation point. Beyond the saturation point, the data fetch rate no longer increases while the data consumption rate keeps increasing. Therefore, balance decreases monotonically beyond 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the saturation point. Figure 3.2(c) shows this monotonicity of the balance metric, and the peak is found at the saturation point. 3.5.4 U nroll Factor vs. D esign Size O b se rv atio n 4 The design space requirement monotonically increases as the unroll fac tor increases. As the unroll factor increases, more computation operators are configured on an FPGA to exploit fine-grain parallelism. Accordingly, the design space requirement grows as the unroll factor increases because the loop body gets bigger and bigger as shown in Figure 3.2(d). The growth in chip space usage sometimes does not lead to reasonable performance gain. Therefore, it is not a good idea to unroll the loops imprudently from the perspective of chip space utilization. Therefore, using efficiency metric defined in Equation 3.4, we achieve the smallest design with near-optimal performance among the designs that the design space exploration algorithm considers. 3.6 O ptim ization Algorithm The design search space is exponential, but we would like to find the optimal solution within the space constraint as quickly as possible. Our design space exploration algorithm relies on the monotonicity properties and metrics described in Section 3.4 to continuously limit the number of candidate designs during the iterative search, while still meeting our optimization criteria. The overall approach increases U only when doing so will have at least moderate performance gains, and we will see from our experimental results that such a strategy is very compatible with using behavioral synthesis estimates. Once we find the best U, a specific in each loop i whose total product is U can be decided to minimize the execution time based on the data dependence information. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In the next subsection, we show the possible relations between the data fetch rate and the data consumption rate, and what the optimal design in each case is. In Section 3.6.2, how we use the monotonicity properties and guiding metrics to find the best U. In Section 3.6.3, we present how to decide the individual unroll factors for the given U . 3.6.1 D a ta Fetch R a te vs. D a ta C onsu m p tion R ate There could be 5 possible scenarios that can happen between the data fetch rate and the data consumption rate, as shown in Figure 3.2(e). In any scenario, the slower rate dominates the overall execution time, and the faster rate cannot get its full performance. In scenario 1 and 4, the data fetch rate and the consumption rate do not cross each other. In scenario 3 and 5, there is one crossing point, where the balance is one. In scenario 2, there are two crossing points. In all scenarios in Figure 3.2(e), the optimal solution can be found beyond the saturation point, if the solution meets the space constraint. In scenario 2 and 3, for example, there is a crossing point before the saturation point. This solution is a balanced solution whose space utilization is perfect. However, a solution with a greater unroll factor performs better than the balanced solution. Even if it is not balanced, its performance-dominating rate is greater than that of the balanced solution. Once the balanced solution is found beyond the knee (saturation point) of the data fetch rate, increasing the unroll factor beyond the balanced solution will not improve the overall performance, little if any, since the slower memory operations dominate the overall performance, and computation is frequently idle waiting for data to arrive. O b se rv atio n 5 The performance improves monotonically before the optimal solution suggested by the balance metric and monotonically not increasing, little if any, beyond the optimal solution as the unroll factor increases by multiples ofU sat- Figure 3.2(f) illustrates the relationship between the unroll factor and overall perfor mance. The knee of the graph is the optimal solution. The optimal solutions in scenario 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 and 5 are the balanced solution if it meets the space constraint. In scenario 1, the optimal solution is at the saturation point. In scenario 3 and 4, the optimal solution can be found between the saturation point and the maximal unroll factor, provided the design meets the space constraint. In these cases, performance may continue to improve as unroll factors increase, but perhaps not enough to justify the increased space usage. We prevent this using E ffic ie n c y metric described in Section 3.4.3. According to these scenarios, therefore, the saturation point is a good starting point of the search for the optimal solution. The next section describes in detail how the design space exploration algorithm searches the optimal solution guided by the monotonicity properties and the guiding metrics. 3.6.2 C om p u tin g th e B est U We use the following terms to describe the algorithm for design space exploration. • If. iteration count of loop i. • Ubase'. n i , which means no unrolling. • Umax ' ■ Ii, which means complete unrolling. • U ct,: the maximal U among the compute bound designs searched by the algorithm. • Umb: the minimal U among the memory bound designs searched by the algorithm. Given the described monotonicity of the search space for each loop in the nest in Sec tion 3.5, we start with a design at the saturation point, and we search larger unroll factors that are multiples of Usat, looking for a balanced design or the two points between which balance crosses over from compute bound to memory bound, or vice versa. In fact, ignoring space constraints, we could search each loop in the nest independently, but to 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bool DONE = False, last-fit = True int Ucb = 1 /* no unrolling */, Um b = Umax /* fliLi */ int U -- Usat /* initial U is a t the satu ratio n point * / S p ace;ast = 0, E u s t = 0 while (DONE = = False) do { Code = Generate(U) E s tim a te = Synthesize(Code) /* C y c le s (C o d e ), S p ace (C ode) G E s tim a te * / B a la n c e = Compute-Balance(Code, C ycles(C ode)) E = Compute-Performance(Code, B a la n c e , C y c le s (C ode)) E f f i c i e n c y = Compute-Efficiency(Eiast, E, S p ace /QSt, S p ace (C o d e)) /* F irst, deal w ith space-constrained designs * / if (Space(Code) > C a p a c ity ) th en { /* last was comp_bound * / Umb = U /* prune U through Urnax */ U = (U + U cb) / 2 last-fit = False } else if (last-fit = = False and U < Usat) th en { /* look for th e largest U th a t fits between 1 and Usat */ Ud> ~ U / * prune 1 through U * / U = (U + Umb) /2 } else if (B a la n c e = 1) th en DONE = True /* Balanced, so DONE! */ else if (B a la n c e < 1) th en { /* m em ory bound */ Umb = U /* prune U through Um b / if (U = Usat) th en DONE — True /* Always m em ory bound * / else U = (U cb + U mb)/2 /* Balanced solution is between earlier size and this one * } else if (B alance > 1) th en { /* com pute bound */ if (Um b = Um ax) th en { /* Have only seen com pute bound so far * / if (E ffic ie n c y < TH RESH O LD ) then { DONE = True, U = Ucb } else /* prune Ucb through U * / {U cb = U, U = U x 2} } else /* Balanced solution is betw een earlier and current sizes * / {U cb = U, U = (U + Umb)/2} } /* only m ultiples of U sat are candidates of th e search * / if (U is not a m ultiple of Usat) th en U = Get-Close-Multiple(U, Ucb, Umb) if (U = = Ucb or Umb) th en { DONE = True /* no m ore points to search in between, or Usat is m em ory bound if (U Usat) th en U = Ucb / * prefer com pute bound and fit * / } else if (U > Umax) th en /* Umax is com pute bound * / { DONE = True, U = Umax } Eiast = E, S p aceiast = S p ace (C ode) } /* end of while loop * / retu rn U Figure 3.3: Algorithm for Design Space Exploration. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. converge to a near-optimal design more rapidly, we select unroll factors based on the data dependences. Each time U is decided, U must satisfy the following condition: Ub ase — U cb — U < Umb ^ Umax The algorithm first selects Usat, the starting point for the search using Equation 3.6. Since we are starting with a design that maximizes memory parallelism, then either the design is memory bound and we stop the search, or it is compute bound and we continue. If it is compute bound, then we consider unroll factors that provide increased operator parallelism, in addition to memory parallelism. If the initial design is space constrained, we must reduce U until the design size is less than the size constraint C apacity, resulting in a suboptimal design. In this case, the algorithm selects the largest unroll factor between Ubase and U3at, because this will maximize available parallelism. In every iteration, Ucb and Umb keep track of U of the most recent compute-bound and memory-bound designs, respectively. Assuming the initial design is compute bound, the algorithm increases U until it reaches a design that is (1) memory bound; (2) larger than C apacity; (3) represents full unrolling of all loops in the nest; i.e., U > Umax', or, (4) E ffic ie n c y is below THRESHOLD. In the cases of (1) and (2), the algorithm will select U between the last compute bound design Ucb that fits and the current design Umb, approximating binary search such that U = |_ (Ucb + Umb) / 2j. In the cases of (3) and (4), the optimal solution is the last searched U by the algorithm, which is Ucb- The candidates of U in the search must be multiples of Usat- Otherwise, the design is not fully utilizing the memory bandwidth, a key to enhancing the overall performance. The function Get-Close-Multiple(U,Ucb,Umb) returns unroll factor Uout which is a closer multiple of Usat■ * Ucb A Uffat A Umb, and, 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 c x Usati if U c x USat ^ (c + 1) x Usat U ; (c + 1) x Ugat, otherwise. where c is an integer constant. In the next section, we describe the details of how our algorithm selects the individual unroll factor Ui for a given U in 2-deep loop nests. Then, we extend the algorithm to n-deep loop nests in the following section. 3.6.3 C om p u tin g Individual U nroll Factors We have seen several important characteristics of the search space in Section 3.5. If there are multiple loops in a nest, each loop carries different characteristics on parallelism and data locality even if it follows the observations described in Section 3.5. For a given U, there can be several different combinations of Ui s. Figure 3.4 illustrates an example of U = 8 and different combinations of unroll factors of two loops. U decides the number of copies of the original rolled loop body. If an array is not invariant with respect to any loop in the nest, each combination of unroll factors whose product is the same U introduces the same number of memory accesses. In this section, we solve the problem of determining best individual unroll factors that minimizes the total execution time for a given U. The discussion in this section is based on the fact that VHDL distinguishes itself from other languages by the way statements are executed because it can represent concurrent statements, which are especially suited to model the parallelism of hardware. So, indepen dent computations can be executed in parallel whenever all the necessary data is ready. It is not possible for the compiler to compute the exact execution time at compile time. For a given U(= n?= i u? ;), however, the compiler can decide individual unroll factor Ui s that minimize the performance. Thus, the purpose of the analysis described in this section is to compare some different candidate designs. Input dependences and loop-independent 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8 7 _ £ 5 4 4 3 3 2 2 1 1 2 2 3 3 1 1 2 2 1 1 2 2 3 3 4 4 (a) 1x8 (b) 2x4 (c) 4x2 (d) 8x1 Figure 3.4: Differences in execution time for the same U = 8 assuming D = {(1,0), (0,2)}. dependences are not considered because they do not affect the parallel execution across loop iterations. Also, true or output dependences with negative dependence distance are not considered, since they prohibit unroll-and-jam for the safety of correctness. O b se rv atio n 6 For a given U and D, it is better to select a proportional ratio of unroll factors according to the minimal data dependence distance on each loop. For example, let’s assume a loop nest has dependence vectors {(1,0), (0,2)}. Unrolling the inner loop once doubles the data consumption rate, since there is no dependence between the two consecutive loop iterations. In addition, the data fetch rate can also be doubled if memories can provide the increased data requirement per iteration. Whereas, unrolling the outer loop once does not increase the data consumption rate as much as unrolling the inner loop does. This is because the dependence vector (1,0) prohibits the two consecutive iterations of the outer loop from being executed in parallel. If unrolling the inner loop once does not exploit the full memory bandwidth, we can increase the unroll factor of the outer loop. The other operations that are involved in the dependence vector (0,2) may benefit from unrolling the outer loop, and more data might be fetched during the same latency. Different combinations of unroll factor for a given U and D take the different number of cycles when the unrolled loop is synthesized to an FPGA. In Figure 3.4 the numbers in a rectangle represent the execution steps of computation imposed by the dependence 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. vectors. In Figure 3.4(a) where only the inner loop is unrolled by 8, the first two unrolled iterations can be executed in parallel, but the next two iterations cannot be executed at the same time because of the dependence vector (0, 2), and thus the next two iterations must wait for the previous two iterations to finish. Figure 3.4(b), where the outer loop is unrolled by 2 and the inner loop is unrolled by 4, shows an interesting execution behavior similar to wavefront parallelism. In the first step, the lower left two iterations are executed in parallel. In step 2, the upper left two iterations and the lower right two iterations can be executed in parallel. Finally, the upper right two iterations are executed in step 3. The synthesis tools automatically identify this kind of parallel execution opportunities. In looking at Figure 3.4(c) where the outer and inner loops are unrolled by 4 and 2 respectively, the first two iterations are executed in parallel, but the upper two iterations cannot because of dependence vector (1,0). In Figure 3.4(d) where only the outer loop is unrolled by 8, all the iterations must be serialized because of dependence vector (1,0). Therefore, different combinations of unroll factors whose product is the same U rep resent different data consumption rate and thereby balance. Based on the Observation 6, the unroll factors of 2x4 in Figure 3.4(b) is the best choice. The other combinations also could find the balanced solution, but it requires a greater U value than that of the solution with 2x4 combination. The key insight is that we unroll all loops in the nest, with larger unroll factors for the loops carrying larger minimal nonzero dependence distances. Table 3.1 summarizes the optimal unroll factors for all the possible dependence vectors in 2D iteration space. Intuitively, the equations for the best combination of presented in this section can be summarized as follows: • A parallelizable loop i is given the maximal unroll factor U and the unroll factors of the other loops are one; i.e., \fdt 6 D : dt — 0. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D Ul U 2 (a) { (d i, 0), (d2, 0 ) , . . . , (dn , 0}} 1 u (b) {(0, rfr), (0 ,d 2) , . . . , (0 ,d n )} U 1 (c) { (d i, d 2), (d3,d 4) , . . . , (dn ,d n + i)} 1 u (d) { (d 1, 0 ) ,( 0 ,d 2)} y/di x U/d,2 i / d 2 x U/d\ (e) { ( d i,0 ) , (d2,d 3)} 1 U (f) { ( 0 ,d i ) ,( d 2,d 3)} U 1 (g) { ( d i,0 ) ,( 0 , d 2) ,( d 3,d 4)} y /d i x f / / d 2 i / d 2 x U/di Table 3.1: Optimal unroll factors for different dependence vectors in 2D iteration space. • If there is no parallelizable loop, each dimension is given a relative weight of the min imal non-zero dependence distance over other dimensions’ minimums. A loop with the bigger dependence distance must be given more weight because more iterations can be parallelized on that dimension between the dependent iterations. In Appendices B and C, we present in detail how we determine the best individual unroll factors based on U and D in 2D and 3D iteration spaces respectively. From the mathematical induction rule, the same methodology can be extended to an n-deep loop nest. 3.7 Summary As devices and consequently designs become more complex, there will be a growing need to explore in an efficient fashion increasingly larger design spaces. In this chapter, we have described an algorithm for automatic design space exploration in mapping applications to FPGA-based systems. To meet the optimization criteria set forth in this chapter, we have reduced the optimization process to a tractable problem, that of selecting the unroll factors for each loop in the nest that leads to a high-performance, balanced, and efficient design. Guided by the monotonicity properties and several metrics, our algorithm searches a very small portion of the entire design space. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The balance concept can be used in many other purposes where there is a pro ducer/consumer relation. For example, we use it in coarse-grain pipelining across loop nests to minimize the synchronization between them. Formally, an asynchronous linear pipeline is a set of pipe stages, which performs a fixed function over a stream of data flowing from the first pipe state to the last, in a linear progression. We use balance to match the producer and consumer rates between two loop nests, thereby decreasing the need for synchronization and buffer storage [60]. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 SCALAR REPLACEM ENT Like in conventional architectures, high latency and low bandwidth to external memo ries, relative to the computation rate, are often performance bottlenecks in FPGA-based systems. Based on the data reuse opportunities identified by data reuse analysis, scalar replacement [7] makes the data reuse explicit by replacing array references with scalar temporaries for references that read the same memory location, resulting in the reduction in the memory read accesses. It also eliminates multiple write references by replacing the redundant memory writes with writes to a register, and only writing the final data to memory. The improved data locality as a result of scalar replacement reduces both the data consumption rate and the data fetch rate. Thus, it allows unrolling the loop nest more to reach the memory bandwidth saturation point, thereby enhancing the overall performance. One previous approach to scalar replacement eliminates unnecessary read accesses based on true and input dependences in the innermost loop of a nest [8]. To increase the data reuse opportunity in the innermost loop, they rely on unroll-and-jam, which moves iterations of the outer loops to the innermost loop body. Thus, unroll-and-jam can be used to decrease some dependence distances in the loop iteration space between two dependent references. While this overall strategy has been shown to be very effective at speeding up im portant computational kernels, the reliance on unroll-and-jam has several 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. limitations. Unroil-and-jam is not always legal, and because it significantly transforms the code, leading to larger loop bodies, it may conflict with other optimization goals. In this chapter, we describe a new approach for scalar replacement of array variables that extends this prior work in several im portant ways. 1. Increases the applicability by eliminating the necessity for unroll-and-jam; 2. Increases the candidates for data reuse both by exploiting reuse across multiple loops in a nest (not just the innermost loop) and by removing redundant write memory accesses (in addition to redundant reads); 3. Provides a flexible strategy to trade off between exploiting reuse opportunities and reducing the register requirements of scalar replacement. 4. Improves the applicability of scalar replacement in the presence of control flows. This new approach was motivated by the opportunities arising in compiling to FPGAs: (1) the amount of logic that can be configured for registers far exceeds the number of registers in a conventional system (on the order of several hundred to a few thousand); and, (2) data can be copied between registers in parallel. For these reasons, the compiler stores reused data in registers across loops in a nest, even though the reuse distance can be potentially large, and frequently copies between registers. Further, it is beneficial to avoid the code size increase associated with unroll-and-jam, since increased design complexity can lead to slower achievable clock rates and may exceed device capacity. While FPGAs represent a unique computing platform, these extensions to scalar re placement will become increasingly im portant due to a number of current architectural trends, such as the larger register files of 64-bit machines (such as the Itanium ’ s 128 registers), rotating register files [42], and software-managed on-chip buffers (e.g., Imag ine [48]). 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. From the experiments on five multimedia kernels, we observe a 58 to 90 percent reduction in memory accesses and speedup 2.34 to 7.32 over the original programs. The remainder of this chapter is organized as follows. The next section defines the terms used to describe the algorithm in this chapter. In Section 4.2, we briefly introduce the prior solution by Carr and Kennedy [8], for purposes of comparison with our approach. In Section 4.3, we describe the details of our algorithm, followed by a discussion of code generation in Section 4.4. In Section 4.5, we show how we replace conditionally executed array references, followed by a summary in Section 4.6. 4.1 Definitions We define the following terms to describe the algorithm. A: the group temporal reuse amount associated with a particular incoming dependence vector of a reuse chain; i.e., how many memory accesses can be replaced. D: a set of dependence vectors. D t : a set of true dependence vectors. D ° : a set of output dependence vectors. D 1: a set of input dependence vectors. I f the loop iteration count of loop I. M : the number of memory accesses remaining after scalar replacement. G : the number of memory accesses incurred, assoicated with a reuse generator. R: the number of registers required to exploit data reuse. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i = 1; i < 65; i+= 2) { for (j = 1; jf < 33; j++ ) { A[i][j]=A[i-l][j-l]+B[i\+B[i-l}+C[j] +D[i\[j]; A[i+l][}]=A[{\[j-l}+B[i+l}+B[i\+C[j\ +D[i+l][j]-, } } (a) After unroll-and-jam loop i by one Aim <0,1 > < 1, 1> ->(A [H ]L/-1] <0,0> p [i+ n L /-] < o ,+ > Bid ]^ ± : ( B [i] | <0,+> C [j] < 0,+> <+ ,0> [g[j'+l] ) - ’ - > ( B[i-1] <0,0> ( 1 (b) Reuse graph for (i = 1; i < 65; i+= 2) { Bo = B\i}; Bi = B[*-l]; B 2 = B[i+1]; for (j = 1; j < 33; j++ ) { if ( j= = l) Ai = A[i)[j - 1]; Cb = C\j]- A0 = A[i-l][j-l] + B 0 + Bi + C0 + D[i][j]; A[i\[j} = A 0; — Ai + B 2 + Bo + Co + D[i+l][j]; A\ = } } (c) After scalar replacement for data reuse in the innermost loop Figure 4.1: An example of C arr’s scalar replacement 4.2 Carr’s Approach In this section, we briefly introduce a previous approach to scalar replacement by Carr and Kennedy [8] for the purposes of comparison with our approach. Their scalar replacement algorithm relies on loop iteration reordering transformation called unroll-and-jam, which moves the iterations of the outer loop to the innermost loop body. It is used both to im prove th e p o te n tia l in stru c tio n -le v e l p a ra lle lism a n d to sh o rte n th e d e p e n d e n c e d ista n c e between dependent references. The benefit of the latter use is to reduce the number of registers required to exploit a reuse instance. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Carr’s approach restricts scalar replacement to array references with dependences car ried by the innermost loop, thus of the form (0,0, • • • 0, dn); i.e, all outer loops should not carry any dependence. Further, their approach considers only consistent data de pendences in the innermost loop to exploit data reuse. Unroll-and-jam is used to bring dependences carried by outer loops into the innermost loop. In Figure 2.1(c), for exam ple, the array references A[i-l][j-l], B[i-1], and C\j\ that induce the dependence vectors (1,1), (1, *), (+, 0), identified by dotted arrows, are not of the form (0,0, • • • 0, dn). After unroll-and-jam, as in Figure 4.1(a), the dependences are (0,1), (0, 0), (0, +) as illustrated in Figure 4.1(b), increasing the innermost loop reuse. Similarly, the dependence (+, 0) of array reference C\j\ in Figure 2.1(b) is not carried by the innermost loop. By unrolling the outer loop as shown in Figure 4.1(b), a loop-independent dependence (0,0) is created. The overall algorithm is to reduce unnecessary memory accesses and to match memory and floating point operations with the peak performance of the target architecture [8]. As unroll factors increase, data reuse increases monotonically. However, register require ments also increase and can exceed the number of available registers. Thus, the algorithm attem pts to derive the best unroll factors computed as a function of M and the number of floating point operations under the constraint that R is less than or equal to the total number of available registers for a given architecture. Thus, the core of their analysis is to compute M and R, parameterized by the unroll factors u i , • • •, for loops 1 to n-1; the algorithm never unrolls the innermost loop because doing so does not affect the data access patterns. In this chapter, we denote the unroll factor of each loop I as u i. The steps of their scalar replacement approach are in Figure 4.2. Given a data depen dence graph, array references are partitioned into three categories that exhibit different memory access behavior [8]: 1. V®: References without an incoming consistent dependence. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. Analyze the dependence vectors in a loop nest, assuming the code is unrolled for a set of unroll factors. 2. Categorize the array references as described below. 3. Select dependence vectors on the form (0,0, • • • 0, dn). 4. Compute R and M for each array reference, parameterized by the unroll factors. 5. Decide the unroll factor within the fixed register constraint and perform unroll-and- jam. 6. Replace read array references that contain a reuse opportunity in the innermost loop with appropriate scalar variables. 7. Insert register copy statements to exploit data reuse across innermost loop itera tions. 8. Perform loop peeling/hoisting to initialize registers. ___________________Figure 4.2: Steps of C arr’s scalar replacement.___________________ 2. V^p: Read references with an incoming consistent dependence, but not invariant with respect to any loop. 3. V j : Read references that are invariant with respect to a loop. Table 4.1 shows the characteristics of the consistent dependences in each category and examples of array references from Figure 2.1(b). In Table 4.1, c is a constant, and u\ is Category V 0 V ? V 1 v r D 0 {(C •••,£}} {<•• ,* /+ , ••)} Examples A \ m m \ j \ A[i-l][j-l] B\i] B[i-1 ] C[j\ G U l U \ U l U\ U l 1 A U l- 1 M Ul Ml 1 0 0 1 R 0 0 C M X 1 JT Ul U l 0 SM 2,048 2,048 312 64 64 256 Table 4.1: Array reference categories in Carr’s approach. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the unroll factor of the outer loop i, and S M is the total number of memory accesses throughout the entire loop nest execution. factors ui,-- - ,un-\, we can compute G and A per unrolled innermost loop iteration as follows: A reuse generator accesses memory in all the loops except for the loop where the array subscript expression is invariant, no m atter which category it belongs. For example, in Table 4.1, G = u\ for reference A[i][j] after unrolling the outer loop by u\. Category can reuse data that is either computed or fetched from memory by the must be big enough to make the resulting dependence vector of the form (0, • • ■ , 0, dn). In general, a dependence distance di can be made zero if loop i is unrolled by di or more; i.e., Ui > di. For example, A[Ll][j-l] can reuse data that is computed by A[i][j] if the outer loop is unrolled once, as shown in Figure 4.1(b). However, only Ui — di array references that are created by unrolling Ui can exploit reuse, while di array references still carry non-zero dependence in the outer loops as illustrated in Figure 4.1(b). Thus, the group-temporal reuse amount A — u\-l for reference A[i-l][j-l], In the rest of this subsection, we describe how the number of registers (R) and the number of memory accesses per innermost loop iteration (M) for each array reference are Let d be an outgoing dependence vector from a reuse generator or an incoming depen dence vector to a non-generator array reference. Given the smallest d and a set of unroll 71— 1 if di is a constant; G = J J a(dj,ttj), where a(di,x) = < (4.1) i=i 1, otherwise, V 1, if di is not a constant; n —1 A = J\P (d i,u i), where (3(di,x) = x - du if x > df, (4.2) reuse generator in category F® only if the unroll factors of all the outer loops in the nest 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. computed as a function of unroll factors and incoming/outgoing d, according to which category V 9, Vr c and V j the reference belongs. 4.2.1 V % We can compute M 0 and I?0 for each array reference in category V 0 as follows: M 0 = G, R 9 = 0 An array reference in category V 0 does not require a register because it is not involved in a reuse instance that can be exploited by this approach. The number of memory accesses inside the innermost loop increases proportionately as we increase the unroll factor of loops in which the reference is not invariant; i.e., G. In Table 4.1, for example, M 0 = u\ and R 9 = 0 for reference A[i][j], 4.2.2 Vrc We can compute and R% for each array reference in category V f: as follows: M f = G — A, Rr = A x (dn + 1) Since array references in do not access memory for A iterations, they access memory only for G — A iterations. Each array reference which can exploit A amount of group- temporal data reuse requires dn registers to exploit data reuse in the innermost loop ln. One additional register is required to be used by the source of the data dependence, since dn+ l values are live in each loop iteration. In Table 4.1, for example, M f; = 1 and Rr = 2 (? ij — 1) for reference A[i-l][j-l]. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.3 V r J We can compute M / and R{, for each array reference in category V j as follows: Array references in category V / will not access memory in the innermost loop if they are invariant with respect to it. Only unrolling a loop in which the array reference in category V / is not invariant introduces additional (=G) memory accesses, and unrolling an invariant loop requires additional (=G) registers to exploit data reuse in the innermost loop. Note that if the array subscript expression is not invariant in the innermost loop (e.g., C\j}), the data reuse opportunity cannot be exploited by C arr’s approach. In Figure 2.1(a), in the first iteration of the inner loop, B[i\ will load data from memory to a register, and in other iterations, it accesses the register. Unrolling the outer loop by u\ will create u\ copies of B\i\, and each of them has different a subscript, thus requiring a register to be used in the innermost loop. 4.2.4 Sum m ary Since we have calculated R as a function of u\, now we can select a specific u\ that satisfies the register constraints. For example, assume we can use 32 registers for data reuse in the target architecture. The overall number of registers required from Table 4.1 is 4ui-2. The maximum unroll factor such that Au\ — 2 < 32 is «i= 8. (A smaller unroll factor may be chosen by their algorithm based on their balance metrics.) Figure 4.3 is the final output of Carr’s approach. Compared with Figure 2.1(a), the code after unroll-and-jam in Figure 4.3 exploits more data reuse. The final output of C arr’s approach incurs 4,792 memory accesses and requires 31 registers. G, if dn is a constant; 0, if dn is a constant G , otherwise. 0, otherwise. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i — 1; i < 65; i+ = 8 ) { Bo,o = B[i]; Boti = B[i+1]; -B o ,2 — B[i+ 2 ]; Bo,3 — B[i+3]; Bo,4 = B[*+4]; Bo;5 = B [«+5]; Bo,6 — B[i+ 6 ]; Bo,7 = B[i+7]; Bi,o = B[i-1]; Biti = B[i}\ B i;2 = B[i+1]; B i,3 = B[i+2]; B ij4 = B[i+3]; B 1i5 — B[i+4]; B ig = B[i+5]; B jj = B [i+ 6 ]; for (j = 1; j < 33; j++) { if (j= = l) { A),i = A\i]\j-1]; A iA = A[i+l][j-l]; 2 t2,i = A[i+2][j-l}- A3j1 = A[i+3][j-l}-, A4 }1 = ^4[*+4][j-lj; 24 5 ,1 = 2l[i+5][j-l]; A6 } 1 = ^4[i-f6][j-l]; } 0 ) = C[j]; 210.0 = ^4.[*-l] [j-1] + Bo,o + B ito + Co + B[*][j]; A ito — Agti + Bo,i + B iti + Co + B [*+l][j]; 242.0 = A i ti + Bo,2 + B ij2 + Co + Z)[i+2] [j j; ^ 3,0 = ^ 2,1 + Bo,3 + B i ,3 + Co + B[i+3][jj; 24. 4.0 = 2 43,i + Bo,4 + B i ,4 + Co 4- B[?+4][jj; ^ 5,0 = 244,1 + Bo,5 + Bi,5 + Co + B[f+5][jfj; 246.0 — 245,1 + B o ,6 + B i,6 + Co + B [ i+ 6 ] [ j j ; 24[i+7][j] = ^ 6,i + Bo,7 + B i,7 + Co + D[i+7\[j\\ A[i}\j] = t4o,0; A[i+l][j} = t4i,0; = 242,o; 2 4 [ i + 3 ] [ j ] = 2 4 3 , o ; A[*+4][j] = 244,0 ; 2l[i-|-5][j] = ^ 5,0 ; 2l[i+6][i] = A6,0; 240.1 = 2f0,o; 2li,i = Ai,0; j 4 2 , i = 2 4 2 , o ; j 4 3 , i = 2 4 3 , o ; 2f4,i = 244,0; 245,1 = ^ 5,0; 2 4 6 .1 = 2i6,o; } } F ig u re 4.3: F in a l o u tp u t o f C a r r ’s sc a la r re p la c e m en t. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. However, even after unroll-and-jam, their approach still cannot exploit some data reuse opportunities between H[i+7][j] and H[i-l][j], whose dependence vector is (1,1) carried by the outer loop. Thus, array C and B repeatedly accesses the same array element across outer loops. In the next section, we present how our approach to scalar replacement eliminates these redundant array accesses. 4.3 Extending Scalar Replacem ent C arr’ s approach to scalar replacement may conflict with other compiler optimizations when they require different unroll factors. In addition, when unroll-and-jam is not safe because of dependences, the previous algorithm can exploit only data reuse originally residing in the innermost loop.1 We have extended C arr’s approach to eliminate this requirement so that the number of registers can be adjusted according to the application requirements and registers available without necessity of unroll-and-jam. Compared to C arr’s approach, our approach does not require any further unrolling to expose data locality to the innermost loop. In addition, our approach does not have the limitation of applicable dependences of the form (0, 0, • • • 0, dn). We show how we achieve these goals in the rest of this section. The steps of our algorithm are in Figure 4.4. We partition reuse chains that exploit different kind of data reuse across loop iterations into four categories; i.e., 1. C 0: reuse chains that carry no data reuse opportunities. 2. C c: reuse chains that carry only group-temporal data reuse opportunities. 3. C i0: reuse chains that carry only self-temporal data reuse opportunities. 1Loop skewing can be used to make the dependences safe, but it would be extremely hard to compute the consistent dependence distance for the resulting code. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. Perform the dependence analysis for a given code. 2. Build a reuse graph, where we partition all array references into reuse chains C®,Cc,C l® , and C',c, as described below.. 3. Compute R and M for each reuse chain. We compute M as part of register pressure control, described in Section 4.3.3. 4. If needed to control register pressure, compute the efficiency metric of each reuse chain, and tile some inner loops to tradeoff some data reuse opportunities with a smaller number of registers. 5. Replace array references with appropriate scalars. 6. Insert register shift/rotation statements to transfer a data item from the generator down to other references in a reuse chain. 7. Perform loop peeling and loop-invariant code motion to initialize or finalize registers for loop-invariant array references. __________________Figure 4.4: Steps of our scalar replacement.____________________ 4. C IC: reuse chains that carry both self-temporal and group-temporal data reuse opportunities. In Figure 2.1(b), the reuse chain {^4[z][j] -* d[?'-l][j-l]} belongs to C c, and {O B[i] — > B[i- 1]0} belongs to G nc, and {C[j\ 0} belongs to Cn0, and {D[i][j}} belongs to C®. Table 4.2 shows the characteristics of the dependences in each category and examples of reuse chains C® Cc c m Cic D 0 {(•••, +/*,•••,o,---)} {(•••,+/*,---,c, •••)} e.g. imm { A m - A[i-i}[j-i}} {C\j\ 0} {O B\i] -h. B[i-1]0} G 2,048 2,048 32 64 A 1,953 63 M 2,048 2,143 32 65 R 0 34 32 2 Table 4.2: Comparison of reuse chain categories. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. from Figure 2.1(b). Note that we compute M and R throughout the entire execution of a loop nest. Our categorization is different from Carr’s in that (7® and C° are related to F j® and V^: , but include both read and write references. and Clc distinguish between references that would be categorized as V j. Since we are exploiting data reuse across outer loops as well as the innermost loop, we need to discriminate reuse chains that exploit data reuse only in invariant loops (category (7*® ) from reuse chains that exploit data reuse in the invariant loops as well as with a constant dependence distance in some loops (category Clc). We also categorize both read and write references, which allows us to eliminate redundant writes back to memory. In the next section, we assume that there are enough registers to exploit all the data reuse opportunities exposed by the data reuse analysis, and that the loop bounds are constant. We relax these assumptions by generalizing the algorithm in Section 4.3.3. 4.3.1 C om p u tin g R and M for Each R eu se C hain The major algorithmic difference between C arr’ s approach and our approach is that we compute the number of registers and the number of memory accesses for each reuse chain, while C arr’ s approach does so for each individual array reference. Computing R for reuse chains is more precise than computing for individual array references; without looking at where the reused data is coming from (reuse generator or elsewhere), it is hard to exactly compute the register requirement for a given dependence vector. Because the fetched data by a reuse generator may be used several times until the last reference in a reuse chain uses it, the dependence vector d between the generator and the last reference in a reuse chain decides the required number of registers to fully 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. exploit possible data reuse. For a given d and a set of loop iteration counts, we compute A and G for each data reuse chain as follows: n n G = ~\\a{dh h), A = f t m , I i ) (4-3) 1=1 1=1 where a(di,x) and (3(di,x) are defined in Equations 4.1 and 4.2. The above equations are functions of I rather than u, since our approach exploits data reuse in the entire loop nest and does not unroll the loops. In Table 4.2, for example, G=2,048 and >1=1,953 for reuse chain {A[i][j] A[i-l][j-l]}. In the rest of this subsection, we describe how to compute the number of registers and memory accesses for each type of reuse chain for a given dependence vector d = (di, d2 , ■ ■ ■ , dn) between the generator and the last array reference in the reuse chain. 4.3.1.1 C9 M and R required for reuse chains in C9 can be computed as follows: M 0 = G, R 9 = 0 Reuse chains in C 9 are equivalent to Carr’s V 9 in that array references in C9 are not involved in any data reuse. 4.3.1.2 Cc Reuse chains in Cc captures only group-temporal reuse. Therefore, M and R required for reuse chains in C c can be computed as follows: M C = 2 G - A, R c = e(d, n ) + 1 where the function e(d,n) is defined in Equation 2.1. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UM Vi'h) r Register A ccesses (rh r2) (0,0) Y Memory Accesses | Register A ccesses (0,1) M emory A ccesses (a ) <y (b ) C m U M Register Register A ccesses A ccesses (0 ,r2) Memory A ccesses ( 0 ,0 ) -------------------------------------------------------------- (c) Partial data reuse Figure 4.5: Memory/register access behaviors of full and partial data reuse. The reuse generator accesses memory in the entire iteration space, which leads to G memory accesses. Other references in the same reuse chain fetch data from memory only in the initial di iterations of each loop I. Figure 4.5(a) illustrates the general data access behavior for a reuse chain with reuse distance (di, cfo)- In the shaded region, which represents the group-temporal reuse amount A, the sink reference of a reuse instance gets necessary data from the source reference. However, the sink references access memory in the L-shaped white region. To exploit data reuse for reuse chains in C c, a series of registers are necessary to keep the data until it is fully reused. The reuse generator fetches data from memory to the first register in a series, and each register A\ shifts the data to its neighboring register Ai+\ in every innermost loop’s iteration. Finally, after iterations of the innermost loop corresponding to the reuse distance e(d, n), the last register A ^ n^ gets the necessary data. Meanwhile, the reuse generator keeps fetching a new datum from memory in every iteration; i.e., G memory accesses. The L-shaped white region can be computed by subtracting the shaded region from the whole region; i.e., G — A. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, the total number of memory accesses for a reuse chain in C c is G + G — A = 2G — A, since the generators as well as other references belong to C c, but in C arr’s approach, the generators do not belong to VjP, but to V 0 by definition. In Table 4.2, for example, M c= 2,143 and R c= 34 for reuse chain {^4[«] [j] A[?-l][j-l]}. 4.3.1.3 Ci0 A dependence vector of reuse chains in C 1 0 consists of '+ ’, A’, and 0. Let dj be the first non-zero dependence distance, which should be '+ ’ in this reuse chain category; i.e., (0, • • •, 0, + , • • •). Then, M and R required for each reuse chain in C 1 0 can be computed as follows: n M i0 = G, R 1 0 = J J a{du Ii), where dj = ‘+ ’. i=j The data in a register can be reused without any additional registers during the entire execution of loops corresponding to *+’ or V , where the array reference is invariant. Fig ure 4.5(b) illustrates the general data access behavior for a reuse chain with dependence vector (0, +). Array elements along the dimensions corresponding to the dependence dis tance 0 are kept in a series of registers for the outer loops corresponding to dependence distance ‘+ ’ or V to exploit self-temporal reuse. In the dimensions corresponding to leading zero dependence distances (i.e., dimensions from 1 to j-1 if j > 1), the reuse chain does not exploit data reuse at all, so they do not affect the number of registers. In Table 4.2, for example, M 10— 32 and R t0= 32 for reuse chain [C[j] }. 4.3.1.4 Cic Reuse chains in Cu' contain self-loop edges and data reuse edges between different array references. As such, a dependence vector of reuse chains in C tc consists of and constants. In this category, at least one constant should be non-zero, as opposed to C arr’s V*, where this is not required. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Let dj be the first non-constant dependence distance; i.e., (c, •••, c, +/*,•• •). Then, the number of memory accesses remaining after scalar replacement for reuse chains in C lc can be computed as follows: n M ic = 2 G - A , R ic = {e(d,j - 1) + 1} x J J a ( d ,, J,) i=i ( !> * ) In Table 4.2, for example, reuse chain {O B[i] A _B[i-l]0} can exploit self-temporal data reuse in the inner loop (corresponding to V ) and group-temporal data reuse in the outer loop (corresponding to ‘1’). Since two references are invariant in the inner loop, we can determine the reuse distance in terms of the outer loop, which is 1. Thus, R lc= 1+1=2 for this chain. In reuse chain category C c, a non-zero constant dependence distance c means the data is reused in c iterations, and thereafter, the data will not be used again by the same array reference. On the other hand, in C tc the data can be used again in some outer loops corresponding to dependence distance '+ ’ or V . Consider dependence vector (+ ,1). A data item is reused in the next iteration of the inner loop. In addition, the data is reused during the entire iterations of the outer loop. Thus, the same number of registers as the inner loop iteration count are necessary, no m atter how small c is. Therefore, the first non-constant dependence distance in dimension j means that exploiting self-temporal reuse in j through n dimensions requires the following number of registers just like ; *-e-> n ? =ja(di,ii). Further, the constant dependence vector dimensions in the outer loops before di mension j require e(d,j-1)+1 sets of /s’ * ® registers in a similar way as computing R c. For example, consider a dependence vector (1,*, 2) between A[i][fc] and A[i-i\{k-2] in Figure 4.6(a). To exploit self-temporal reuse in the second dimension, I*, registers are required to keep the array elements accessed in the innermost loop. Further, since the dependence distance in the first dimension is 1, the total number of registers required for 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i = 1; i < 64; *++) for (j = 0; j < 3 3 ;j+ + ) { for (k = 2; k < 8; k++) { if (J = = 0) { A0 = A[i\[k}; if (i = = 1 || A ; < = 3) A8 = A[i-l}[k-2}-, for (i = 1; i < 64; f+ + ) } for (j = 0;j < 33;j + + ) ■■■ = A0 + A8; for (k = 2; k < 8; k++) if (j = = 0) { •••=?: A [i][A ;] + A[?-l][A:-2]; rotate_registers(^O i • • ■ , ^ 5); r o t a t e _ r e g i s t e r s ( A 6 , • • • 1 A u ) \ } } } s M f t _ r e g i s t e r _ s e t s ( { A 0, • • •, A 5}, {A6, • • •, ^ 1 1 }); (a) Original code. (b) Scalar replaced code. _________Figure 4.6: An example of CIC with D = (1, *, 2). this reuse chain ends up multiplying 2 by Ik- Intuitively, this corresponds to Ik registers to exploit self-temporal reuse in the second dimension, and an additional Ik registers to exploit group-temporal reuse in the first dimension. Thus, G=64, A=63, M lc=65, and R lc= 2 for reuse chain {O B[i\ ^ in Table 4.2. 4.3.2 sum m ary Figure 4.7(a) shows the result of our scalar replacement applied to the code in Fig ure 2.1(a). Figure 4.7(a) shows the result of our scalar replacement applied to the code in Figure 2.1(a). The code in Figure 4.7(a) does not access any array element more than once, fully exploiting data reuse opportunities at the cost of more registers used than Carr’s approach. In the next subsection, we present how we control the register pressure, not losing many opportunities of data reuse. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (j t = 1; j t < 33; j t+=16) for (i = 1; i < 65; i+ + ) { Bo = B[i\; if (* = = !) B\ = B[i-1]; for (j = 1; j < 33; j + + ) { if ( * = = !){ for {i = 1; i < 65; i+ + ) { Bo — B[i\; if ( i= = l) B 1 = B[i-l\; Co = C\j]; } for (j = j t\ j < jt+ 1 6 ; j++) { if (* = = 1 ) { A n = A [i-l][j-l]; Co = C \j]; } else if ( j = = l ) else if (j==jt) A1 7 = A[i-l][j-l}; A o = A n + B o + B i + C o + Z?[i][j]; A\i][j] = A0; A33 = A\i-l][j-l}; Ao = ^ 3 3 + Bo + B \ + Co + - D [ i] [ j] ; A[i]\j] = A 0; shift_registers(Ao, • • •, A 3 3 ); shift_registers(Ao, • • •, ^ 1 7 ); rotate_registers(Co, • • •, C 3 1 ); } Bi = B o; } ro tate ^registers (C'o, • • •, C 1 5 ); } B\ = Bo; } (a) Our scalar replacement for full data reuse (b) After tiling the inner loop for partial data reuse Figure 4.7: An example of our scalar replacement. 4.3.3 G eneralizing th e A lgorithm In architectures where there are not enough registers to exploit all possible data reuse opportunities, we have to reduce the required number of registers to avoid register spills. Further, a non-constant loop bound hinders the reuse analysis from computing the reuse distance e(d, n) in Equation 2.1. Fortunately, we can use the same technique to deal with these issues. Partial data reuse trades off data reuse opportunities for lower register pressure. We exploit partial data reuse using a code transformation called tiling, which divides com putation and data into blocks, and makes the data reuse closer in time. Partial data reu se e x p lo its d a ta reu se o n ly w ith in a tile a n d in tro d u c e s m em o ry accesses acro ss tiles. Figure 4.5(c) shows the difference in memory access behaviors between full and partial data reuse. In Figure 4.5(a), d\ x I2 + g ? 2 + 1 registers are used to exploit full data reuse. On the other hand, in Figure 4.5(c), only d\ x I2 / 2 + ^2 + 1 registers are used at the cost 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of more memory accesses between the tiles. The loop to be tiled and the tiling factors are decided based on the dependence information. Tiling each loop i by T* will reduce the reuse distance by From this tiled reuse distance, we can compute the number of registers and the number of memory accesses required for the tiled code. Figure 4.7(b) shows the code after tiling the inner loop j for partial data reuse. Array B is read twice, and some elements of array A at the border of the tile are read twice. However, arrays A and C use only half of the registers since the reuse distance associated with these references has been reduced by half. Thus, the code in Figure 4.7(b) incurs 4,416 memory accesses and requires 36 registers. Using tiling, we can also derive inner loop nests with constant loop bounds, when some bounds are not constant. Some iterations may remain after tiling the loop if the iteration count is not divisible by the tiling factor. Index set splitting divides the index set of a loop into multiple portions, replicating the body of the loop as appropriate [1]. We use index set splitting to isolate the residual (Ik mod T*.) iterations from the main tiled loop Ik- In result, each tile of iterations contains a fixed constant Tk iteration count. Tiling is not always a legal transformation. An alternative way to control the register pressure is selective data reuse which selectively exploits data reuse for some reuse chains depending on their register requirements and reuse benefits. If a reuse chain requires many registers but eliminates little memory accesses, it is better to give up data reuse for it. T h u s, we c o m p u te th e re g iste r efficiency for each reu se c h ain c as follows: E fficiency,, = M c/R c, 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where M c refers to the number of memory accesses eliminated by scalar replacement and R c the number of registers necessary for reuse chain c. As Carr did, this optimization problem can be solved using dynamic programming for a knapsack problem. For a given set of pairs {(i?c, E ff ic ie n c y c)}, we want to select a collection of reuse chains C such that maximize the E fficiency,. within the constraint YhceC The solution of this optimization problem may use less than R registers. If there are still remaining registers, we can use them to exploit data reuse for some part of reuse edges of a low- efficiency reuse chain. Our approach can apply selective data reuse to filter out extremely low efficiency reuse chains, and apply partial data reuse for the rest. Because the legality tests for tiling and unroll-and-jam are essentially the same, this approach is always applicable whenever Carr’s algorithm could exploit reuse via unroll-and-jam, and may be applicable in additional cases. If tiling is not legal and iteration counts are unknown at compile time, our approach eliminates redundant read and write array references only in the innermost loop, which still can eliminate more redundant array accesses. 4.4 Scalar Replacem ent Transformation Once the number of registers is computed, a series of scalar variables, S q, S i , ■ ■ •, S r - i, are introduced, where R is the number of registers to be used to replace array references in a reuse chain. Since there can be more than one reuse chain for an array, we actually name registers with a tuple (chain id, register id}. In this chapter, we omit the chain id for brevity. An array reference is replaced with an appropriate scalar variable depending on the reuse distance between its generator and the array reference within the same reuse chain. There are two cases where a load/store statem ent needs to be inserted. Each reuse generator that is a read reference needs a register initialization uS i= A [ ]” before the 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. register is used, and the finalizer reference in a reuse chain requires a memory store 1 1 A[ ]=<SV’. Intuitively, the insertion position of these initialization/finalization statements is the innermost loop in which the array reference is not invariant, thereby accomplishing loop-invariant code motion as described in Section 4.4.2. In this section, we assume there is no control flow in the code. We describe how to handle control flows in Section 4.5. 4.4.1 R egister S h ift/R o ta tio n Register shifting involves copying the contents of a register Ai to its neighboring register Ai+% in every iteration of the innermost loop, where the array references are not invari ant. An introduced scalar variable receives data from its generator after the number of iterations corresponding to its reuse distance. Let r represent the reuse distance between the generator and the last array reference in a reuse chain. Then, to achieve reuse, a series of register copy statements {(.SVfi = Si) \ r — 1 > % > 0 } are inserted at the appropriate position. In Figure 4.7(a), the s h if t_ r e g is te r s operation is equivalent to a series of register copy statements. The insertion position of register shifts for each reuse chain can be decided by the same method as the initialization/finalization point of registers. In the case of reuse chain categories C 20 and C 2 C , data reuse occurs repeatedly in loops where the array reference is invariant. The ro ta te _ r e g is te r s operation shifts the data in a series of registers and rotates the last one into the first position. In FPGA-based systems, these register-to-register copy operations can be performed in parallel. Further, these parallelizable operations are free in some sense, since they are executed at the same time with loop index variable increment/ decrement and bound check operations. A similar strategy can be applied on the Itanium, using its rotating registers [42]. In an architecture where register shifts and rotates are more costly, a hybrid strategy could be used that applies unroll-and-jam to reduce the reuse distance in some cases, to 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reduce the number of shifts and rotates performed. We can eliminate a s h i f t ^ re g is te rs operation for a reuse chain with dependence vector d = (d i,• • • ,dn) by unrolling each loop li by di respectively, since the resulting dependence distances are either 0 or 1 after unrolling. Therefore, to eliminate the entire s h if t^ r e g is te r s across all k reuse chains, we unroll-and-jam each loop k by LCM (df,df, • • • ,< $ ) , where dj represents the «-th dependence distance of d of reuse chain j. Note that this could be done without increasing the register requirements or impacting reuse, but the compiler would need to explore the tradeoff space between costs of shifts and rotates and overheads associated with unroll-and-jam. 4.4.2 Loop P eelin g and loop-invariant cod e m otion Loop peeling removes the first (or last) several iterations of the loop body and makes them as separate code before (or after) loop execution [1]. loop-invariant code motion moves the loop invariant code out of the loop if it will not change the meaning of the original program [1]. Scalar replacement uses loop peeling in combination with loop-invariant code motion for register initialization and finalization purposes. We see in Figure 4.7(a) that values for the yl33 register are initialized on the first iteration of loop j and on the first iteration of loop % . For clarity it is not shown here, but the code generated by our compiler actually peels the first iteration of loop j and loop i instead of including these conditional loads so that the main body of both loops have the same number of memory accesses. Further, we can avoid the overhead of condition checks in every loop iteration. Similarly, the B \ register is initialized on the first iteration of loop i. In addition, array reference B[i\ is invariant with respect to loop j, so the initialization of register Bo and B \ are moved outside the inner loop j using loop-invariant code motion. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Similarly, array C is invariant with respect to loop i, so it is initialized in the peeled loop body after the first iteration of loop i is peeled. W ithin the main loop body, only a read reference to array D and a write reference to array A remain. Our approach performs loop peeling on multiple loops in a nest. To initialize the scalar variables at the beginning of the loop iterations, each loop needs to be peeled by the maximum dependence distance among all true and input dependence vectors. To finalize the array elements at the end of the loop iterations, each loop needs to be peeled by the maximum dependence distance among all output dependence vectors; namely, P e e l i n g f a c t o r b e f o r e k = VrfG {DT U D ; }, maxdj P e e l i n g f a c t o r a f t e r li = V d € D ° , maxdj In each peeled iteration, all the array references are initialized and finalized unless their subscript expressions are not invariant with respect to the enclosing loop. However, since we peel each loop by the maximal dependence distance among all dependence vectors, some array references whose dependence distance is less than the peeling factor do not need to be initialized. Intuitively, we do not initialize/finalize the lexically identical array references in some last/first peeled iterations, since they are already initialized in an earlier peeled iteration or will be finalized in a later peeled iteration. Lexically identical references are present in the last ( P e e l i n g f a c t o r b e f o r e /, — di) peeled iterations for an incoming true or input dependence vector d, and equivalently, in the first ( P e e l i n g f a c t o r a f t e r li - di) peeled iterations for an outgoing output dependence vector d. The lower bound and/or upper bound of the remaining loop needs to be updated to reflect the decreased iteration count of the loop. In addition, the a rra y indices must be updated appropriately since the peeled body is no longer enclosed by that loop. Figure 4.8 shows examples of peeling before and after a loop. In looking at Fig ure 4.8(a), the loop is peeled twice because the maximum dependence distance is two. In 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. peeling section (0): A[i\ A[i-l]-^> A[f-3] peeling section (1): A[*4-l] A[i] | A[i-2] for (i = ...){ • • •} (a) Peeling sections before the loop for (i = • • •) { for (j = • • •) {• • •}} peeling section (0,1): -B [* ][?'] peeling section (0,0): B[i][j+1] B[i-l][j-2] (b) Peeling sections after the loop ________________________ Figure 4.8: Loop peeling example.________ ________________ peeled iterations, all scalar variables that replace array references are initialized except the reference A[i\ that is shown in a box in Figure 4.8(a), since its dependence distance is smaller than the peeling factor in the second peeled iteration. Similarly, Figure 4.8(b) shows the example of peeling after a loop to finalize the array elements. Since the max imum dependence is one for loop i and two for loop j , the inner loop is peeled twice and the outer loop is peeled once. In this case, J5[i-l][)-2] in the first peeled iteration is not written back to memory because the lexically identical array reference in the second peeled iteration will overwrite the same array element. Although at first glance, the code size appears to increase by peeling, high-level syn thesis will usually reuse the operators between the peeling sections and original loop body, so that the code growth does not correspond to a growth in the hardware design. < o ,i> 4.5 Control Flow As formulated in Section 4.3, scalar replacement decisions are governed by data depen dence analysis. D ata dependence analysis does not take control flow into account, and control flow may prohibit possible data reuse. A problem arises when a data reuse gen erator and/or a finalizer are conditionally executed, since other array references in the 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if (...) Ao — A[i]; if (-•-) if (• • •) if (• • •) . {Ao = • • •; A[i] = A)!} { ••• = A[i\; } => {••• = A ;} ■ ■ ■ = A[i-1]; ••• = A ; A \ = A q; { A[i\ = •••;} => else { A 0 = A[i]; } ■ ■ ■ = A{i-l}; ■ ■ ■ = A\; A \ = A q; (a) Conditional input dependence. (b) Conditional true dependence. Ao = ■ • •; A[i\ = ■ ■ ■ ; if (• • •) if (••-) => { A = •••;} M[?-l] = •••;} A[i-1] = A 1; A \ = Ao; if ( -- -) ••• = A[i); A q = A[i\; if (•••) • • • = A q; else => else • • • = A q; ■ ■ ■ = A[i] ; (c) Conditional output dependence. (d) Unrealizable input dependence. Figure 4.9: Issues on control flow in scalar replacement. same reuse chain may access an uninitialized register, and the final value in the register may not get written to memory. Although control flow must be a consideration in a scalar replacement algorithm, it is somewhat orthogonal to the analysis for identifying reuse that is the focus of this chapter. The issues are illustrated by the example codes in Figure 4.9. In each case, we show the original code on the left and the output of scalar replacement on the right. In Figure 4.9(a), the data reuse generator A[i\ is conditionally executed, and A[i-1] may reuse data from its potential reuse generator. Similarly, in Figure 4.9(b), the conditional write to A[i\ may provide data to read access A[i-1]. In Figure 4.9(c), the write access to A[i\ may be unnecessary if the condition holds in the next loop iteration. In Figure 4.9(d), A[i] is read regardless of the condition. In [9], C a rr a n d K e n n e d y show ed how to h a n d le c o n tro l d e p e n d e n c e u sin g p a rtia l redundancy elimination technique and data flow analysis, assuming there is no backward jump and no multiple loop exits. The essential property of partial redundancy elimination is that it should never increase the number of computations performed along any path [40, 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9]. Thus, C arr’s approach does not eliminate potential redundant array references in Figure 4.9. Further, to make the partial redundancy fully redundant, it inserts register load statements in every possible paths in which a data is needed but not generated [9]. In some cases where read memory accesses are speculatively executed, this may introduce some unnecessary overhead. We handle conditional reuse generators and linalizers by combining control flow anal ysis [1] and data dependence analysis. Our approach replaces partially-redundant mem ory accesses even if it may introduce some unnecessary memory accesses only in some first/last iterations. Simply stated, we initialize/finalize outside the conditionally exe cuted reuse generators and finalizers. We exploit data reuse only for reuse chains which require at least one array access no m atter what control flow is taken. Since the same array element must be read by another array reference within the same reuse chain in a later iteration, or the same array element must have been written in a previous it eration, this approach does not increase the number of memory accesses except in the boundary iterations, the number of which is less than the reuse distance. In general, the overall number of memory accesses may actually even decrease, since we can eliminate partially-redundant memory accesses in other remaining loop iterations. Furthermore, we can suppress these unnecessary memory accesses by peeling those boundary iterations, where we keep generators/finalizers conditionally initialized/finalized. Another difference is that our approach minimizes the number of register load statements, as illustrated in Figure 4.9(d), whereas Carr’s approach does not [9]. Figure 4.10 shows the algorithm to handle conditionally executed reuse generators and finalizers. We first define the following terms to describe our algorithm: A: a conditional reuse generator or finalizer A'-, other array references that are unconditionally executed in the same reuse chain A 5-f A': flow (true) dependence between A and A' 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. If 3^4': A' belongs to P D O M (Y ) and 3X &A', initialize the scalar variable for A in Y . 2. If 3A': A' belongs to P D O M (Y ) and 3A 8^ A ', initialize the scalar variable for A in P D F (X ) — D O M (X ), except the control paths that include A', where the dependence vector between A and A ' is loop-independent. 3. If 3A’: A' belongs to D O M (Y ) and 3,4' 8°A, finalize A in IP D O M (Y ). 4. IfV A ': A ' belongs to P D F (X ) — D O M (X ) and 3A 8% A ' or A 5? A ', which is loop- independent, initialize the scalar variable for A in Y. _______________Figure 4.10: Handling conditional generators/finalizers_______________ A ' 8°A: output dependence between A ' and A A 5lA': input dependence between A and A' X : basic block to which A belongs; i.e., A £ X cd Y : basic block upon which X is control dependent; i.e., Y — > X D O M (Y ): dominators of Y ; i.e., predecessors of the IF structure, which have to be executed. P D O M (Y ): postdominators of Y; i.e., successors of the IF structure, which have to be executed. IP D O M (X ): immediate postdominator of X; i.e., the first basic block right after the IF structure, P D F (X ): postdominance frontier of X; i.e., all the basic blocks that are control depen dent on Y . Simply stated, assuming enough loop iterations are peeled to initialize/finalize the regis ters, the algorithm in Figure 4.10 is the following: 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. Initialize the scalar variables for the read accesses to conditional reuse generators before the IF condition, provided that the reuse chain carries any data reuse, as illustrated in Figure 4.9(a). 2. Initialize the scalar variables for the write accesses to conditional reuse generators in the other control path of the IF condition, provided that the reuse chain carries any data reuse, as illustrated in Figure 4.9(b). 3. Finalize the conditional finalizers after the IF condition, provided that the reuse chain carries any output dependence, as illustrated in Figure 4.9(c). 4. Initialize the scalar variables for the read accesses to the conditional reuse generators before the IF condition, if the conditional generator is accessed in all control paths of the condition; i.e., there is a loop-independent input dependence among all control paths of the condition, as illustrated in Figure 4.9(d). Consider the example code in Figure 4.9(a). Initializing the register A q for the con ditional generator A{i] will ensure the reuse chain is initialized even if the condition does not hold. The s h if t_ r e g is te r s operation ensures the correct data in registers by the time it is referenced. If the condition holds at least once, the transformed code has a reduced number of memory accesses because non-generator read reference ^4[f-l] in the same reuse chain must access the same array element in a later iteration. However, if the condition does not hold in the last iteration, the initialization of A q from A[i\ is unnec essary. If at least one iteration is peeled after the loop, we can suppress the initialization of the conditional generator in the peeled code and keep it inside the IF condition just like the original code to avoid the unnecessary memory access. In the example code in Figure 4.9(b), initializing the conditional generator A[i\, which is the source of a true dependence, in the other control flow path of the IF condition ensures that the register that replaces the conditional finalizer contains the correct data. 78 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. Simply stated, we want to initialize the write accesses to conditional reuse generators only if the initialized data is not overwritten. If the condition does not hold in the last iteration, the initialization of Ao is redundant, since there is no more iteration to exploit data reuse. Again, if the loop is peeled after the loop, we can suppress this additional memory accesses. In Figure 4.9(c), the conditional finalizer A[i-1] is finalized after the IF condition to ensure that the final data in a register is written back to memory even if it may happen in some iterations later. In addition, the register A \ contains data that was in register Ao in the previous iteration by s h i f t j r e g i s t e r s operation. If the condition does not hold in the first iteration, the finalization of A[i-1] after the IF condition is unnecessary. If the loop is peeled before the loop, we can suppress the finalization in the peeled code. In short, if there is a conditional data reuse generator and an unconditional read reference and the condition does not hold, it becomes similar to prefetching in that the register initialization fetches the data in advance. If there is a conditional finalizer and an unconditional write reference and the condition does not hold, it becomes similar to delayed write in that the final data is written back to memory in the later iteration. As illustrated in Figure 4.9(d), initializing the scalar variables for the read accesses to unconditional reuse generators that are enclosed within the IF condition does not generally decrease the actual number of memory accesses because only one of the control flow paths will be executed anyway. However, in the case of speculative memory fetches (such as in DEFACTO), the number memory accesses can be decreased by pulling the unconditional read array access out of the IF condition. In the case of nested control flow, the above rule moves the initialization and finaliza tion to the outermost region where the reuse chain resides. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6 Summary In this chapter, we have described an algorithm to eliminate redundant memory accesses by replacing array references with scalar variables which are mapped to registers eventu ally. Our approach has extended the standard scalar replacement in several ways. First, we increase the applicability by eliminating the necessity for unroll-and-jam. Unroll-and-jam is not always legal, and the unroll factors for scalar replacement may conflict with the desired factors for other optimizations such as instruction-level parallelism. Secondly, we increase the candidates for data reuse both by exploiting data reuse across multiple loops in a nest (not just the innermost loop) and by removing redundant write memory accesses (in addition to redundant reads). Our data reuse analysis can identify all the data reuse opportunities within a loop nest. As long as there are enough registers, our approach can exploit both redundant read and write memory accesses we have identified, not requiring unroll-and-jam. Thirdly, we provide a flexible strategy to trade off between exploiting reuse opportu nities and reducing the register requirements of scalar replacement. In some architectures where there are not enough register, not all the data reuse opportunities we have identified can be exploited. We address this challenge in two ways. We trade off some data reuse opportunities for lower register pressure using loop tiling. We exploit full data reuse only within a tile and we introduce some extra memory accesses across tiles. Another way of reducing register pressure is to selectively exploit data reuse of reuse chains depending on their efficiency metric. Lastly, we improve the applicability of scalar replacement in the presence of control flows. We aggressively eliminates partially redundant memory accesses even if it may introduce some unnecessary memory accesses only at some boundary iterations in some rare cases. However, the overall memory accesses are decreased, since we eliminate much 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. more partially redundant memory accesses in the remaining iterations. Further, we can peel the boundary iterations to avoid unnecessary memory accesses. Using less than or equal to 32 registers for this technique, we observe a 58 to 90 percent of reduction in memory accesses and speedup of 2.34 to 7.31 over the original programs. This new approach was motivated by the opportunities arising in compiling to FPGAs: (1) the amount of logic that can be configured for registers far exceeds the number of registers in a conventional system (on the order of several hundred to a few thousand); and, (2) data code be copied between registers in parallel. For these reasons, the compiler stores reused data in registers across loops in a nest, even though the reuse distance can be potentially large, and frequently copies between registers. While FPGAs represent a unique computing platform, these extensions to scalar re placement will become increasingly im portant due to a number of current architectural trends, such as the larger register files of 64-bit machines (e.g., Itanium), and software- managed buffers. We foresee a growing need for techniques such as that described in this chapter as the performance gap between computation and memory speed grows, to effectively reduce the memory traffic and maximize the memory bandwidth utilization. 81 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. Chapter 5 CUSTOM DATA LAYOUT Although scalar replacement improves data locality on chip, we still have to fetch some essential data from memory. The fine-grain parallelism exposed by unroll-and-jam, as discussed in Chapter 3, may be limited by bandwidth to memory for these remaining array accesses. In this chapter, we describe a generalized approach to deriving a cus tom data layout in multiple memory banks for array-based computations to facilitate high-bandwidth parallel memory accesses for the memory accesses remaining after scalar replacement. Most compiler solutions assume a fixed data layout in memories, usually standard row or column major, and transform the code to access memory in parallel whenever possible. In contrast, our approach derives a application-specific layout for array data by analyzing the access patterns of the code. Thus, the compiler has more degrees of freedom in transforming code, and can preserve memory parallelism while accomplishing other optimization goals. This difference is particularly im portant when used in conjunc tion with code reordering transformations, such as loop nest transformations commonly performed on array-based computations. When used in conjunction with unroll-and-jam, our experimental results 011 5 multimedia kernels show a speedup up to 9.65 and 87.5% reduction in memory accesses on eight memories, as compared to mapping data to a single memory. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i = 0; i < 32; i+ —2) for (j = 0; j < 16; j+ = 2) { int .A [32] [16]; int 5 [32] [16]; A[i\[j] = B[i]\j] + ^ for (i = 0; i < 32; f+ + ) A[i][j-|-1] = B [i\\j+ 1 ] + 1; for (j = 0; j < 16; j+ + ) M i+ W ] = B [i+l][j] + 1; A [i\[j] = B[i]\j] + 1; ilji+ ljb + l] = £ [i+ l]b + l] + 1; } (a) Original code (b) After unroll-and-jam for (i = 0; i < 16; z++) for (j = 0; j < 8; j+ + ) { i400[i]b] = S00MM + 1 ; A0l[{\[j] - B01[i]b] + 1; A10[{[[j] = B10\i][j] + 1; All[i][j\ = B u m + 1; } (c) Final code __________ Figure 5.1: An example of custom data layout.__________ The organization of the rest of this chapter is the following. The next section describes a set of examples to motivate the approach. Section 5.2 presents the the overview of custom data layout algorithm for a single loop nest. Section 5.3 describes the analyses and transformations to identify the parallel memory accesses in virtual memories and how we reorganize array data from /to a naive layout in a single memory to/from a custom layout in multiple memories. Section 5.4 describes how to map .virtual memories to a limited number of physical memories. Section 5.5 extends the algorithm to the entire program. In Section 5.6, we discuss the degenerate cases where we incorporate traditional data layout schemes. In Section 5.7, we present a summary. 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Naive data layout. A(0,0) A(0,1) B(0,0) B(0,1) A{0,0) A (0,1) A(0,2) A(0,3) A(0,4) A{0,5) A(0,6) A(0,7) A(1,0) A(1,1) A(1,2) A(1.3) B(0,0) B(0,1) B(0,2) B(0,3) B(0,4) B(0,5) B(0,6) B(0,7) (b) Modulo unrolling. A (0,0) A(0,1) A£1,D) A (M ) A(0,2) A(0,3) A(1,2) A (l,3) B(0,0) B(0,1) B(1,0) B(1,1) B(0,2) B(0,3) B (l,2) B(1,3) (c) Custom data layout. ___________________ Figure 5.2: Comparison of three data layouts.________________' 5.1 M otivation There are a number of techniques being developed to increase the likelihood that nearby memory accesses will be to independent memories, thus enabling parallel memory ac cesses. A number of techniques examine the low-level output of the compiler and, given a fixed mapping of data to memory, reorder individual accesses and operations to increase memory parallelism [58]. We illustrate the difference among several different data layouts using the example code in Figure 5.1(a). At the compiler level, the standard approach is to view data as being mapped to a single memory. We refer to this as a na'ive data layout. Figure 5.1(b) shows the code after we unroll loops i and j by one (equivalent to an unroll factor of two) and jam the copies of the loop j. Figure 5.2(a) depicts a possible naive layout for the transformed code shown in Figure 5.1(b). The entire array A and the entire array B are mapped into memory MO. Even though the code is transformed to execute four loop iterations in parallel, it will execute serially due to memory stalls. Additionally, other memories will be idle while the computation stalls for data. 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Most closely related to our work is modulo unrolling in the Raw compiler [34], where data layouts are in a cyclic order across multiple memory banks in the most quickly varying dimension of the array. Modulo unrolling involves unrolling a loop in the nest that accesses the lowest order dimension of an array so that the accesses in an unrolled loop body are to statically fixed memory banks. Figure 5.2(b) shows the modulo unrolling layout for the code in Figure 5.1(b). This layout scheme distributes across four memories in a cyclic fashion the elements of array A and B along the second array dimension, which is the fastest changing dimension in the loop nest. The code in Figure 5.1(b) now can fetch two elements of array B at a time. Two memory banks are still idle while the computation stalls for the other two array elements. It is only if the j loop were unrolled by four, that modulo unrolling would fetch all four elements of B at the same time and thus take advantage of the available parallelism in the system. Figure 5.2(c) illustrates the custom data layout for the code in Figure 5.1(b). Custom data layout distributes arrays A and B across four memories according to the particular data access pattern in the kernel. As such, we reduce the possibility of computation stall due to memory accesses. It distributes the four array elements, S[i][j], RfiJ^'+l], B[i+l\\j], and R [i+ l][j+ l], accessed in the unrolled loop body. Since all four memory elements required by the loop body are placed in separate memory banks, no memory will be idle, thus achieving better use of the machine’s memory to computation bandwidth. The four statements in the loop body can now execute in parallel since there are no data dependences among them and each of the array references is accessing a different memory bank. This example illustrates an advantage of using a custom data layout as opposed to a fixed layout such as the cyclic one used by modulo unrolling. A compiler might want to apply iteration-space reordering and code reordering transformations that conflict with the modulo unrolling transformation, thus impacting memory parallelism. For example, the DEFACTO compiler uses unroll-and-jam to increase both memory parallelism and 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. Normalize the step size of each loop to have a unit stride. 2. Virtual mapping for each array; i.e., (a) Analyze the data access pattern from the subscript expression in each array dimension of each array reference; i.e., a stride and an offset. (b) Partition array references such that accesses to independent array elements belong to separate partitions. In each partition, we reanalyze the data access patterns of references. (c) Map each partition to a separate virtual memory, and rewrite array references to represent the accesses to different virtual memories by combining the results of rewriting each dimension. (d) Where needed, insert code to copy data to/from naive layout in one memory from /to a custom data layout across multiple memories. 3. Physical mapping: based on the access orders of the arrays accessed in a particu lar loop nest, and taking memory operation scheduling scheme into account, bind virtual memories to physical memories. _____________________ Figure 5.3: Steps of custom data layout.______________________ instruction-level parallelism, and to derive a balance as described in Chapter 3. Unrolling outer loops provides additional opportunities for memory parallelism which can not be exploited by modulo unrolling. Further, a custom layout gives the compiler more freedom to accomplish other optimization goals without impacting memory parallelism. 5.2 Overview Custom data layout increases the parallel array accesses across multiple memories using the three-step algorithm, independent of other common loop transformations. Figure 5.3 provides an outline of the custom data layout algorithm. The first step is normalizing th e s te p size, w hich involves re p la c in g all th e in sta n c e s o f the lo o p index v a ria b le I with s x I, where s is the step size of loop I. Loop normalization is always legal. In the second step, virtual mapping, we divide a set of array references into partitions that access independent array elements, and map each partition to a separate virtual 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory. This mapping is a one-to-one mapping between the original array indices and new array indices in virtual memories, based on array access pattern information. We first analyze the data access pattern (stride and offset) of each individual array reference. Since the data access patterns of different array dimensions are orthogonal, each dimension can be treated independently. Assuming arrays accessed within their bounds, if two array references access mutually exclusive array indices in at least one dimension, they access independent array elements. In this case, we put them in separate partitions. Otherwise, we put them in the same partition, and derive a single unified data layout for them based on their common data access pattern1. To maximize the opportunities of parallel memory accesses, we create as many partitions as possible. The compiler rewrites each array reference so that the transformed subscript expression takes into account both the data element’s virtual memory assignment and position within the newly formed array in the virtual memory. Based on the data access patterns of the code, we insert array distribution/gathering code to/from multiple memories. In the third step, physical mapping, the compiler binds virtual memories to physical memories, taking into account memory access conflicts based on the array access order in the program, to exploit both memory access and instruction-level parallelism. The algorithm in Figure 5.3 is most effective in the affine domain, but it can handle some non-affine array subscript expressions. For simplicity of presentation, the bulk of this chapter assumes all dimensions are affine, but non-affine references are briefly described in Section 5.3.4. In the remainder of this chapter, M p refers to the total number of physical memories available in the system, and M v the number of virtual memories to which the arrays in a specific partition will be mapped. The next section describes step 2 in Figure 5.3. Step 3 is discussed in Section 5.4. 'A possible exception to step 2(b) is read-only arrays such that their array references access common array elements. They can be replicated to multiple memory banks if deemed cost effective. If replication is used, renaming need not modify the subscript expressions. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3 Virtual M apping In this section, we present how we derive a virtual mapping for each array as outlined in Step 2 of the algorithm in Figure 5.3. 5.3.1 A n alyzin g D a ta A ccess P a ttern s For each array reference, we analyze the data access pattern of each array dimension inde pendently. We first transform m-dimensional array subscript expressions into a canonical form by performing constant propagation, constant folding, and algebraic simplification; i.e., A[a\L\ -p • • • + a \L n + hi] • • • [a™L\ + • • • + a™Ln -p bm\. We denote an affine array subscript expression in a particular array dimension as a Single Index Variable (SIV)2 subscript expression if there is only one non-zero coefficient. If there is more than one non zero coefficient in the array subscript expression, it is a Multiple Index Variable (MIV) subscript expression. If all the coefficients are zero, it is a Zero Index Variable (ZIV) subscript expression. Basically, we identify two things from the affine subscript expression in each dimension of each array reference: a stride and an offset. An offset is simply the constant term b of the canonical array subscript expression. In the case of MIV, the subscript expression may access array elements with different strides on different loops; i.e., the stride on each loop Li is a n. We define a stride Si for a particular array dimension i of a particular canonical array subscript expression as the greatest common stride among all n loops; i.e., 0, if ZIV; (5.1) 2In this dissertation, ZIV, SIV and MIV refer respectively to zero, single, and multiple index variable subscript expression for brevity. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Custom data layout attem pts to map to memory only those array elements that are accessed by the code. Thus, some elements may be omitted, such as in strided accesses, or accesses where the subscript for one or more dimensions is held constant. For example, the stride and the offset of array reference ^4[2i+4j+l] are 2 and 1, respectively. We map only the odd array elements starting from array index 1. 5.3.2 P artition in g A rray R eferences When two array references access mutually exclusive array elements, and thus there is no data dependence between them, we can put them in separate partitions. For example, consider array references A[4i],A[4i+l], and T[2'i]. A\Ai] accesses a subset of array ele ments accessed by A[2i\, but j4[4i+l] accesses independent array elements. So, we derive a unified data layout for A[Ai] and A[2i\, and a separate data layout for A[Ai+l}- The following theorem proves a key property used by the partitioning algorithm. THEOREM 1 Two m-dimensional array references A and A! access independent loca tions, and can be placed in separate partitions if the following condition holds: references A and A ' for each array dimension i. PROOF. When both array references are ZIV in dimension i, they access independent array elements if their constant indices are different; i.e., b 't A h- Otherwise, GCD(sj, s'f) represents the greatest common stride of array indices that may be accessed by both A and A' on dimension i3. We prove Equation 5.2 by contradiction. 3If either of two array references is ZIV, GCD(si, 0 )= s,. 3i, 1 < i < m : if s = s' = 0; (5.2) 6' m od GCD(sj,s() A h m od GCD(sj,s(), otherwise. where Si and are the strides, and h and 6' are the offsets associated with two array 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Procedure P artition(P , i) P: a partition; i.e., a set of k references to an array Q: a set of partitions; i.e., {P} i: array dimension to be partitioned to: total number of dimensions of array A: an array reference B[a\Lx + ----- b a \L n + 61] • • • [a™Iq + ---b a™Ln + bm) in P sj: a s trid e on array dimension i of array reference j, where 1 < j < k Pmax : maximum number of subpartitions for P in dimension i k = \P\; /* cardinality of set P */ if (i > m or k — 1) return P Si = GCD(.s|, • • •, sf); /* common stride for k references */ if (Si = 0) Pmax = max.(bj, ■ ■ ■ , 6*); else Pmax = .s,; for j = 1, Pmax /* create Pmax empty subpartitions */ Pj = 0 ; for each A € P { if (s, = 0) insert A into F % .; else insert A into P(b i m od Siy , } Q = 0 ; for j = 1, Pmax { if (Pj ^ 0 and P: l ^ P) Q = Q U P artitio n (P j, i, m); } if (Q = 0) Q = P artitio n (P , i+1, m); if(Q = 0) Q = P; /* unable to derive subpartitions for P */ return Q; /* a set of subpartitions */ _________________Figure 5.4: Array reference partitioning algorithm._________________ Let two array references access the same array index in a particular array dimension in two different iterations k and of each loop Ly i.e., a\l\ H ----- kanln + b = H ------ba'nl'n + b'. Rearranging terms, a + • • • + a'nl'n — a\l\ — ■ ■ ■ — anln = b — b'. Let s = GCD(ai, • • • ,an,a i, ■ ■ ■ ,a' n) = GCD(sj,s^). If we divide both sides by s, (a[l[ + • • • + a'nl'n — ail\ — anln)/s = (b — b')/s. The left hand side is an integer, since all terms are integers, and s is the common factor of a\, ■ • •, an, a'x ■ ■ ■ , a'n. For the right hand side to be an integer, s must divide b — b'. Thus, (b — b') mod s = 0 b mod s = b' mod s. □ 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The partitioning algorithm uses Equation 5.2 for each array dimension to divide array references into partitions. We define a common stride st for each dimension i across k array references as follows: Si = GCD(sJ, •••,*£) (5.3) The algorithm for partitioning the array references in a loop nest is shown in Fig ure 5.4. Initially, all array references belong to a single partition, Set, and we call proce dure P a r t i t i o n i s t , 1) for each array. Then, it separates array references into different partitions whenever it can prove they are accessing independent array elements using Equation 5.2. In a specific array dimension i, array references within a partition are recursively partitioned according to Equation 5.2 until no further partitioning is possible. The recursive procedure P a r ti tio n derives a set of P m ax possible subpartitions (some are empty). If all references are mapped to the same subpartition, then that dimension’s references cannot be partitioned, so the algorithm returns the result of partitioning the next dimension. Otherwise, it attem pts to further partition a subpartition according to the current dimension. For example, consider Set = {A[2'i], A[4i+3], A[8i+1], A [ 82+ 5 ]}. The common stride GCD(2, 4, 8, 8) is 2. According to the condition in Equation 5.2, Set is divided into two partitions (A[2i]} and { A [41+3], A[8?'+l], A [8i+ 5]}. The second partition is further partitioned into two subpartitions {A[4i+3]} and {A[8i+1], A[8i+5]}, since GCD(4, 8, 8) = 4 and (3 mod 4) ^ (1 mod 4) = (5 mod 4). Further, the second subpartition is divided into two subpartitions (A [81+1]} and |A [8i+5]}. Therefore, all four references in Set access completely independent array elements, and are mapped to separate virtual memories. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.3 A rray R enam in g Once all m array dimensions are partitioned, the next step of virtual mapping is to map each partition to a distinct virtual memory independently. We rewrite array references in a partition P to represent accesses to the same virtual memory. To rename each array reference and to form the new subscript expression within a partition, we must derive the following three components for each array dimension: 1. o f f s e t, to be added to the loop index variables with non-zero coefficients in the subscript expressions, designating the offset within the virtual memory; and 2. c o e ff, a new coefficient to represent the stride of accesses in the corresponding loop after the custom layout; and 3. s u ffix , to be concatenated to the original array name, designating the virtual memory to which this reference is mapped. We derive these components using the common stride in Equation 5.3 within a partition and the offset b of each array reference. In this section, we first describe array renaming for one-dimensional arrays. Then, we extend the approach to multi-dimensional arrays. 5.3.3.1 Single Dimension A specific one-dimensional array reference with ZIV, SIV, or MIV in a partition is rewrit ten as follows: ZIV: A[6] = > • A • s u ff ix[of f set] SIV: A[aL + b ] =b A • su ff ixfco effL + o ffse t] MIV: A{a\L\-\------banLn+b\ = > ■ A * s u ffix fc o e ff\L\-\ bco effnLn+ o ff set] 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The new array name A * su ffix uniquely identifies a partition and its corresponding virtual memory. The new subscript expression represents locations within the virtual memory. We derive o f f s e t, co eff, and s u f f ix for a specific array reference as follows: b, if s = 0; o f f s e t = (5.4) \b/s\, otherwise. di, if s = 0: co effj = { (5.5) a*/s, otherwise. b, if s = 0; s u ffix = ^ (5.6) b mod s, otherwise. In Equation 5.4, b is divided by stride s because s is the unit of mapping to a virtual memory. A coefficient of each loop variable is divided by s in Equation 5.5 to represent the stride on each loop after virtual mapping. Equation 5.64 ensures that array references that access the common array indices are mapped to the same virtual memory. The result of partitioning array references according to the algorithm in Figure 5.4 is that any two array references A and A ' in a partition follow the following property; i.e., b ', = bi, if both Si and s( are zero; Vi, 1 < i < m : ^ (5.7) b ', mod Si = bi mod s*, otherwise. where Si is given by Equation 5.3. T H E O R E M 2 Each partition has a unique suffix* value in at least one dimension. PROOF. 4Note that one of b or s may be a negative number. Since C compilers do not standardize on modulo arithmetic in the presence of negative numbers, we would like to clarify and avoid confusion. The result of (b mod s) should be a positive number ranging between 0 and abs(s)-1. If b is negative, it is defined as (s — 1) — ((— 6 — 1) mod s). If s is negative, it is defined as (b mod — s). 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Assume two array references A and A ' respectively have offsets bi and and are separated into two partitions, where the strides are respectively s* and s' on a particular array dimension i. We omit the proof for the case where both st and s' are zero, since it is trivial from the condition in Equation 5.2. Because A and A ' are separated, the following condition holds according to the con dition in Equation 5.2: bi mod GCD(sj,s') ^ b\ mod GCD(sj,s^). We prove this theorem holds if and only if bi mod Si ^ If mod s'; namely, bi mod GCD(sj,s') ^ b 'i mod GCD(sj,s^) 4=4 bi mod Sj / f > ' mod s'. This proposition is equivalent to the following contraposition: bi mod Si = 6' mod s^ 4=4 bi mod GCD(sj,s£) = 5' mod GCD(sj,s^). Let k = (bi mod sf) = (b\ mod s^). Then, bi = Sj x c + k and h '% = s' x c' + k, where c and d are integers. Since GCD(sj,s^) is a factor of both Sj and s', (si x c) mod GCD(sj,s^) = (s^ x c') mod GCD(sj,s£) = 0. Therefore, bi mod GCD(sj, s') = 6' mod GCD(s,;, s'). □ Thus, we use the condition in Equation 5.7 to identify a unique virtual memory to which a partition is mapped. Theorems 3 and 4 below prove that Equations 5.6, 5.4, and 5.5 ensure a one-to-one mapping between the original array indices accessed by two different array references A[a\L\-\-------1 - anL n+b] and A[a'xLi + • — \-a'nLn+b'] and the array indices in the virtual memories after virtual mapping. In the case of a s= 0 where the partition consists of ZIVs only, the proof of one-to-one mapping is trivial. We also omit the proof for SIV, since SIV is a special case of MIV. T H E O R E M 3 I f two array references access the same array index; i.e., a \l\ + • • • + anln + b = a\l\ + •••-[- a'nl ' n + b' in two iterations li and /' of each loop Li, then the 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. corresponding array element is accessed at the same array index in the same virtual mem ory after virtual mapping; i.e., 1. b mod s = b' mod s, and 2. c o e ffi/i • • • + c o e ffnZ n ---b = c o e f f '^ b coeff'nl 'n b \b'/s\. PROOF. The proof of b mod s = b' mod s follows from the proof of Theorem 1. Rearranging terms, we know that a'fl'i + • • • + a 'nl'n — a\l\ — • • • — anln = b — V . If we divide both sides by s, we obtain (a[l[ -\------ \- a 'nl 'n — a \l \ anln)/s = (b — b')/s, which is equivalent to c o e f f '^ H b c o e fff nl'n — c o e ff\ l \ ----- — c o e ffnln = (b — b')/s. Since (6 mod s) = {b' mod s), |_6/sj — [b’/ s\ = b/s — b'/s. Therefore, coeffiZi • • • + c o e ffnZ n ■ ■ ■ + \ b/s\ = c o e f f '^ b coeff^Z^ • • - + \ b'/s\. □ T H E O R E M 4 I f two array references A and A ' access two different array indices; i.e., a\l\ + ■ ■ ■ + anln + b ^ a[l[ + • • ■ + a 'nl'n + b', then the corresponding array elements are accessed either in separate virtual memories or at different array indices within the same virtual memory after virtual mapping; i.e., 1. b mod s f^b' mod s, or else 2. c o effiZ i---- b c o e ffnZ n • • • + i^ c o e ff[ l[ b coeff^Z^ b \ b'/s\. PROOF. The first case occurs when two array references access mutually exclusive array indices, and thus belong to separate partitions. They access array indices with the common stride s, starting from b and b', respectively. If s does not divide the difference between b and V , array index b' is never accessed by A, and array index b is never accessed by A' in the entire iterations of loop L x. Thus, (b' — b) mod s 0 b mod s f b' mod s. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The second case occurs when an array index accessed by A is also accessed by A! in some iterations other than l\, since they are mapped to the same virtual memory. For example, consider A\li] and A[2i + 4], A[2] is accessed by A[2i\ when i= 1, and A [8] is accessed by A^li + 4] when i= 2. These array elements are mapped to different locations in the same virtual memory. The proof is omitted, but is similar to the proof for the second equation in Theorem 3, substituting inequality for equality. □ The following theorem proves that virtual mapping is always a legal transformation by showing that the dependence vector d within a loop nest between two array references is the same after virtual mapping. T H E O R E M 5 The original dependence vector d between array references A[a\L\-\------1 - anLn + b} and A[a\L\ + • • • + a 'nL n + b'] that access common array indices is preserved after virtual mapping. PROOF. Let yl[aiLi + • • • + anL n + b} and A{a\L\ + • • • + a 'nLn + b'] access the same array element in iterations {l\, ■ • •, ln) and (l[, ■ ■ ■ , l '. n), respectively. Thus, aiZi + • • • + anln + b = a^l^ + • • • + a'nlr n + b' = a\(li — di) + • • • + a'n (ln — dn) + b'. We only have to prove that two renamed subscript expressions after virtual mapping are identical in iterations (h, • • •, ln) and (/,, • • • ,l'n). respectively; i.e., coeffiZiH hco effnin+ o ff s e t = coeffjZ^4 l- c o e f f ^ + o f f s e t'. Rearranging terms, coeffiiiH — • + c o e ffriZ ri — co eff [l^ — co eff ' nl 'n = o f f s e t' — o ff s e t = \ b' / s \ — \b/s\. Since there is a dependence between the two array references, (6' mod s) = (b mod s). Thus, [bf/s \ — |_ & /SJ = b'/s — b/s. If we multiply both sides by s, a\l\ + • • • + anln — a[l[ — • • • — a'nl ' n = b' — b. Rearranging terms, aili + ■ ■ ■ T anln + b = + • • • + a'nl ' n + b' = a'^{l\ — dj) + • • • + Q^(Zn — dn) + b'. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, the dependence vector is preserved after virtual mapping. □ 5.3.3.2 M u ltip le D im ensions To rename multi-dimensional arrays, we combine the results of renaming each dimen sion, as will be discussed in this section. For all applicable dimensions, Equations 5.6, 5.4, and 5.5 from the previous section can be applied. We rewrite m-dimensional array references in an n-deep loop nest as follows: A [a\L \ + • • • + a \L n + & i] • • • [a™ L\ + • • • + a™Ln + bm\ => A • su ff ix i • • • • • su ff ix m[coeff\L \ + •••-(- coeff^L„ + o ffs e ti] • • • • • • [coeff + • • • + coeff™Ln + o ff s e t m] Again, since ZIV and SIV subscript expressions are a special case of MIV, we treat them identically. The new array name in each virtual memory is now the concatenation of the original array name A and a set of suffixes, su ff ix i, • • •, su ff ix m, for each array dimen sion. Each o ffset^ , coeff*-, and suffix^ is computed by Equations 5.4, 5.5, and 5.6, respectively, whenever array renaming is applicable for dimension i. The partitioning decision is based on Equation 2 in Theorem 1, and following the partitioning algorithm, each partition is mapped to a unique virtual memory id. Thus, the number of virtual memories is equivalent to the number of partitions. Therefore, for each m-dimensional array, the cardinality of the set of distinct s u f f ix values determines the total number of virtual memories M v; that is, M„ = |{ su ffix i • suffix2 • • • • • s u f fix m} | If the total My exceeds Mp, then the physical mapping phase decides which virtual memories are mapped to the same physical memory. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Procedure A rray _Renaming((5) Q: a set of partitions of array references = { P } A: an array reference B [ a \ L i + • • • + a* T ** + bi] - ■ ■ [a™Li + • • • 4- a™L„ + bm \ E P Si: th e com m on s t r i d e on array dim ension i in P for each P E Q for each A E P { for i = 1 , m { if (si ^ 0 ) { su ffix * = bi m od s*; o ffs e t* = bi/sf, for j — 1 , n /* for each loop of n-deep loop nest */ coeff* = o*/ Si; } else { /* all references in P are ZIVs * / su ffix * = bi ; o ffs e t* = bf, for j = 1 , n /* for each loop of n-deep loop nest * / coeff*- = a*-; } } Replace A w ith B • s u f f ix i • • • • • su f f i x m [coef f \L\ + • • • + co ef f + of f s e ti] • • • • • • [co eff + • • • + c o e ff ™Ln + o f f s e t m]. } ______________________ Figure 5.5: Array renaming algorithm.______________________ The proof that array renaming is a one-to-one mapping for multi-dimensional arrays follows from the proofs of Theorems 3, and 4, and the observation that each dimension can be treated independently. 5.3.3.3 Array Renaming Algorithm Once all m array dimensions are partitioned, we call A rra y -R e n a m in g in Figure 5.5 for each array in the code. A rra y -R e n a m in g maps each partition to a distinct virtual memory. Based on the common stride computed in the procedure P a r t i t i o n in Figure 5.4, we compute o f f s e t , c o e f f , and s t r i d e using Equations 5.4, 5.5, and 5.6 for each dimension. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3.4 E xam ples Table 5.1 shows examples of virtual mapping in a single dimension. The first two cases (a) and (b) show SIV array references whose strides are identical. Table 5.1(a) contains only a single array reference. So, the stride is 2 and we can use one virtual memory. Array renaming rewrites the reference to Al[i+1]. Two array references in Table 5.1(b) access respectively even and odd array elements with the same stride GCD(2,2)=2, starting from two different indices 0 and 1. According to the condition in Equation 5.2, each reference accesses mutually exclusive array elements, and they are mapped to two virtual memories AO and A l. The new coefficients are both 1, and the new offsets within each virtual memory are both zeros. As such, two references in each virtual memory becomes consecutive accesses. In Table 5.1(c), two array references have strides 4 and 6, respectively. Since com mon stride is GCD(4,6)=2, they belong to the same partition. This unified data layout distributes only even numbered array elements to virtual memory AO. In Table 5.1(d), according to the condition in Equation 5.2, A[4i+3] accesses array elements independent of the other references. So, we derive separate data layouts for it, and a unified data layout for the other array references. The stride of the unified partition is 2 and the stride of A[4i+3] is 4. Each partition can use one virtual memory, AO and A3 respec tively. In Table 5.1(e), the common stride is GCD(2,4,4)=2. According to the condition in Equation 5.2, we first group array references into two partitions, {A[2i]} and {A[4i+3], A[4i+5]}, because (0 mod 2) / (3 mod 2) = (5 mod 2). The second partition is further divided into two subpartitions, since the common stride is 4 and (3 mod 4) / (5 mod 4). In result, all three array references are not accessing any common array indices, and mapped to separate virtual memories. In Table 5.1(f), we have one more array reference A[2i+1] than Table 5.1(e). However, A[2i+1] accesses array indices accessed by both A[4i+3] and A[4i+5]. Thus, we derive a unified data layout for the three array references 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Original G C D P artitions A fter V irtual M apping Array References (strides) Pi s P2 s P 3 s Pi P2 Pz (a) A [2i + 3] 2 A[2i + 3] 2 A l[ i+ 1 ] (b) A[2i\, A [ 2 i+ l] 2 A[2i] 2 A [ 2 i+ l] 2 A0[i] Al[i} ■ (c) A[4i], A[6i] 2 A[4i],A[6i\ 2 AQ{2i], AQ[3i] (d) A[2i], A [4i+2], A[4i+3] 2 A[2i], A[4i+2] 2 A[4i+3] 4 A0[i], A 0 [ 2 i + 1 ] A3{i] (e) A[2i], A [4i+3], A[4i+5] 2 A[2i] 2 A[4i+3] 4 A[4z+5] 4 A0[i\ A3[i\ A l { i + l ] (f) A[2i], A [4i+3], A[4i+5], A[2i+1] 2 A\2i] 2 A[2i+1], A[4i+3], A[4i+5] 2 A 0 [i] Al[i], A l [ 2 i + l ] , A l \2 i + 2 ] (g) A[2i], A[4i], A[4i+3], A[4i+5] 2 A[2i], A[4i] 2 A[4i+3] 2 A[4*+5] 4 A0[i},A0[2i] A3[i] A l[i+ 1 ] (h) A [2i+4j], A [ 2 i + 4 j + l ] , A [ 2 i + 4 j + 2 ] 2 A [2 i+ 4 j\, A [ 2 i + 4 j + 2 ) 2 A [ 2 i + 4 j + l ) 2 A 0 [i+ 2 j], A 0 [i+ 2 j+ l] A l [ i + 2 j ] (i) A[4i], A [ 2 i + 2 j + l ] 2 A[4i] 4 A [ 2 i + 2 j + l ] 2 A 0 [i] A l [ i + j ] Table 5.1: Examples of virtual mapping in a single dimension. o o 4[2i+ l], 4[4'i+3] and A[4i+5], and a separate layout for A[2i], Similarly, in the case of Table 5.1(g), we can derive three separate data layouts for {4[2i|, A[4i]}, A[4i+3], and A[4i+5], respectively. Table 5.1(h) shows one-dimensional MIV array references. The array element .4[8] accessed by array reference AQi+Aj] is accessed multiple times when i=Q and j= 2, i = 2 a n d j= l, and i=A and j= 0 . In addition, the array element 4[8] accessed by array reference 4[2?+4)+2] will be accessed again by array reference A[2i+4j]; (2 mod 2) = (0 mod 2). Therefore, these two array references must be mapped to the same virtual memory. Thus, at most two virtual memories can be used. In Table 5.1(i), A[Ai\ is SIV and 4 [2 i+ 2 j+ l] is MIV. Since GCD(4,2) = 2 and (0 mod 2) ^ (1 mod 2), we derive separate layouts for each reference using its own stride 4 and 2, respectively. Table 5.2 shows examples of virtual mapping for two-dimensional array references. We will consider one dimension at a time. Array references in Table 5.2(a) shows two- dimensional array references, where each dimension is SIV and MIV, respectively. In looking at the first dimension, the references are not partitioned. The second dimension of each reference accesses mutually exclusive array elements according to the Equation 5.2. In Table 5.2(b), the strides in each dimension are different among three references. In looking at the first dimension, [2i] and [4i] access common indices, but [2i+l] is accessing independent indices. So, we group three array references into two partitions {4[2z][2jf+l], 4[4i][[4j]} and {A[2'<-K1][2j+4fc]}. Next, in looking at the second dimension of the first partition, [2j+l] and [4jr] access independent indices. Thus, we further divide the first partition into two subpartitions {4[2i][2j+l]} and {4[4i][[4j]}. Each reference is inde pendently renamed. Therefore, three references are mapped to virtual memories 401, 410, and 400, respectively. Table 5.2(c), (d), and (e) show examples of ZIV subscript expressions. In looking at the first dimension in Table 5.2(c), we divide array references into two partitions {4[2i+l][2j]} and {4[2i][2j],4[4i][2],4[4i][6]}. In looking at the second dimension of 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission. Original A rray References G C D (stridei) G C D ( s t r i c t ) P artitions A fter R enam ing Pi S 1 S 2 P 2 Sl S2 P 3 S i S2 Pi P 2 P 3 (a) A[2i][4j+4k\, A[2i}[4j+4k+3] 2 4 A[2i}[4j+4k] 2 4 A [2i][4j+4j+3] 2 4 A10[i}[j+k] A03[i}[j+k] (b) A[2i\[2j+1\, A[4i]{4j], A[2i+l][2j+4k] 2 2 A[2i][2j+1] 2 2 A[2i+l][2j+4£:] 2 2 A[4i}[4j] 4 4 A01[i][j] AlQ[i][j+2k\ A00[i][j] (c) A{2i][2j], A[4i][ 2 ], A[2i+l}[2j], A[4i][ 6 ] 2 2 A[2i+l}[2j] 2 2 A[2i][2j], A[4i] [ 2 ], A[4i][6] 2 2 A10[i][j] A00[i][j], A00[2i][l], A00[2t][3] (d) A[2i][2j], A [4][2], A[2i+l][2j+l], A [ 5][7] 2 2 A[2i][2j], m m 2 2 A [2z+ l][2j+ l], A[5][7] 2 2 A00[i][j], A00[2][l] ^ H [ * M A H [2] [3] (e) A\2i}[2j\, A [4][2], A[2i+l}[2j+l], A [5][6 ] 2 2 A[2i][2j], A[4][2] 2 2 A{2i+l][2j+l] 2 2 A i m 0 0 A00[i][j], A00[2][l] A l l [i]l>‘] A56[5] [ 6 ] (f) A[i\i2 a Ah]m 1 2 A[j][4i] 1 2 A00[i][j], A00[j][2i] (g) A[i][2j+l\, A[j][4i] 1 2 A\i][2j+1] 1 2 A[j][4i\ 1 4 A01[i][j] A o o i m (h) A[2i+l}[3B[j}}, A[2i][2j] 2 1 A[2i][2j] 2 2 A[2i+l}[W\j}} 2 3 A00[i][y] A10[i] [B[j]} (i) A[4i][4j x fc+1], A[2i][2j] 2 2 A[2i\[2j] 2 2 A[4i][4j x fc+1] 4 4 A00[z][i] A01[z] [j x k] Table 5.2: Examples of virtual mapping in two dimensions. o to the second partition, the common stride is G CD(2,0)=2, and (0 mod 2) = (2 mod 2) = (6 mod 2). Thus, we can derive a unified data layout for the second partition using the common stride. In Table 5.2(d), we first group array references into two partitions {^4[2?][2.7],i4[4][2]} and {yl[2i+l][2j+l], N[5][7]} by looking at the first dimension. In the second dimension of both partitions, we can derive a unified data layout. Thus, each partition is mapped to virtual memories A00 and A l l. Table 5.2(e) includes a slightly different ZIV reference A[5][6]. In this case, A[2i+l}[2j+l] never access A [5][6], they are partitioned into separate layouts. Thus, array references in Table 5.2(e) can use three virtual memories. Table 5.2(f) and (g) show more complex examples where the loop index variables appear in different array dimensions between two references. While one reference accesses a row of array elements, the other reference accesses a column of array elements. The algorithm in Figure 5.4 supports such references, since only the stride and offset for each dimension are used for partitioning. In looking at the first dimension in Table 5.2(f), two array references access common indices whenever i = j; GCD(1,1)=1. So, they remain in a single initial partition. Similarly, in looking at the second dimension, the common stride is G CD(2,4)=2. Thus, the two references access common indices in the second dimension, and we derive a unified data layout for these array references using the common strides 1 and 2 in each dimension. In the case of Table 5.2(g), the two array references access independent indices in the second dimensions. Therefore, they are placed into two partitions, which are mapped to virtual memories 7101 and d 0 0 . Table 5.2(h) and (i) show examples of non-affine array subscript expressions such as [3i?[j]] and [4j x fc+1]. Non-affine subscript expressions are not supported by the algo rithm described in this chapter, since the stride must be computed differently. However, if accesses can be placed in separate partitions according to the data access patterns in other dimensions that have affine subscript expressions, then the algorithm can skip the non-affine dimension during partitioning. Further, a conservative approximation of the 103 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. stride can be used to extend the algorithm for partitioning dimensions with non-affine subscript expressions. For example, the exact value of B [j] or j x k is unknown at com pile time. But, we know that 3B[j] should be at least a multiple of 3, and that 4j x k should be at least a multiple of 4. Thus, we conservatively assume the strides for these non-affine expressions are 3 and 4, respectively. In looking at the first affine dimension in Table 5.2(h), two references access independent array elements. So, we derive sep arate layouts for them. In Table 5.2(i), the common stride of the second dimension is GCD(4,2)=2. According to Equation 5.2, two references are divided into two partitions. 5.3.5 A rray R eorgan ization For data that is upwards exposed to the beginning of the transformed code, or computed within the transformed code and live on exit, it may be necessary to reorganize the data from /to a naive layout in a single memory to/from a custom layout in multiple virtual memories. In this section, we describe how to perform this reorganization based on the data access patterns; i.e., stride (sj) and offset (bt) in each array dimension i. Let L B j and U Bj be the lower and upper bound of each loop L j, respectively. For simplicity, let’s assume s, / 0. Based on s 2 and bi in each dimension i, then, the array indices accessed by array subscript expression [a\L\ -\ f - a ‘ nL n + bi] for the naive layout can be formulated to [(6 i m o d Si) + Vi X S j], where vt in each array dimension i is an integer such that a \L B \ H 1 - al nL B n + 5, < (bi mod s*) + Vi x s* < a\U B i - )------- h al nU Bn + bi < £ > {a \L B \ H + ax nL B n + bi - (bi mod Si)}/si < u* < {a\U B i + h a^UBn + bi - (bi mod Si)}/s^ 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For a virtual memory computed using Equation 5.6 on each dimension, we can con struct a mapping from the naive data layout to a custom data layout as follows: A[(bi mod si) + v\ x si][(&2 mod s2) + v2 x s2] • • • [(bm mod sm) + Vm X ^m] = £ ■ A • s u f f ix i • • • • • s u f f ix m[ui][u2] • • • [vm]. (5-8) The inverse mapping is equivalent. By considering all possible virtual memories, and combining the results for multiple dimensions, we can map all elements from /to a standard layout in a single memory. T H E O R E M 6 Array reorganization described in Equation 5.8 ensures that the array references of the form [(bi mod st) + Vi x st] access the correct data [vj] in a virtual memory after array reorganization, and vice versa; i.e., a \L \ H 1 - al nL n + bi = (bi mod sf) + Vi x s* c o e f f\L i + • • • + c o e ff^ L n + o ffse t^ = Vi PROOF. According to Equations 5.4 and 5.5, we can rewrite the right hand side of < = > as (a \/si)L i -\-----+ (al J s i ) L n + = U j- By dividing both sides of the equality on the left hand side of by st and rearranging terms, we get (a \/si)L i H h (al n/si)L n + (bi - (bt mod Si))/si = v*. Thus, (bi — (bi mod s,))/ st = \_ b t/ s , \ , which is always true, since (bt mod sf) is the re mainder of bi/Si, and the result of applying a bottom operation to division b .L /s.t is the same as the result of removing this remainder from b% first and then dividing by sp □ 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.4 Physical M emory M apping Virtual mapping creates as many virtual memories as needed to maximize opportunities of parallel memory accesses for each array in isolation, and in an architecture-independent way. In this section, we describe how to map virtual memories to a limited number of physical memories such that the exposed parallel memory access opportunities are preserved as much as possible. To map the transformed code to a specific target architecture, we must take the following into account: (1) the number of physical memories Mp; (2) competing demands of multiple arrays; and (3) the scheduling algorithm of memory accesses. Intuitively, we want to distribute M v virtual memories across M p physical memory banks as evenly as possible, since it preserves the exposed parallel memory access opportunities, and minimizes the address bits required for each physical memory. The actual memory operations that can be scheduled concurrently are affected by the physical memory mapping. We denote EM„ as the total number of virtual memories across all the arrays in a loop nest. If E M v < M p, we distribute each virtual memory to a different physical memory. If EM v > M p, some virtual memories must be mapped to the same physical memory, thereby possibly sacrificing potential memory parallelism. Some virtual memories carry a scheduling constraint such that the operations on the right hand side of an assignment statement must be scheduled before the operations on the left hand side. We map the virtual memories that carry the scheduling constraint to the same physical memory to give other less constrained virtual memories more freedom to be mapped to separate physical memories. The final scheduling of memory accesses is handled by ASAP scheduling algorithm [16, 39] employed by Monet. For each loop body, this scheduler collects all the memory accesses to separate virtual memories. Then, it schedules the ones that have no scheduling constraints with each other at the same cycle. In looking at Figure 5.6(a), we know that 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For i= l,N { B[i\=A[i- 2] 0 (a) MO For i= ,N,4 { BOH] = A2[i-2] Bl[i] = A3[i-2] B2[i] = A0[i-2] B3[i] ) = Al[i-2] A O B O Ml A 1 B 1 M2 A 2 B 2 M3 A3 B3 read A 2 write S O read A O write B 2 read A 3 write 67 read A 1 write S3 MO B O (b) A 2 Ml A 3 B 1 M2 A O B 2 M3 read A 2 write S O read A 3 write 67 read A O write B 2 read A 1 write S3 (c) tune A 1 B 3 time Figure 5.6: An example of data access order and possible physical mappings. all four reads to the array A could be performed in parallel and similarly for all four writes to array B. If we choose a physical memory mapping as shown in Figure 5.6(b), where we simply place each renamed array in the physical memory with the same identification, the resulting schedule will only perform two memory accesses in parallel. This is because there is a scheduling constraint between writes to the array BO, B I, B2, B3 and reads from the array A2, A3, AO, A l. Even if the first four memory accesses BO, A2, B I, and A3 are accessing separate memories, only A 2 and AS are scheduled at the same cycle, and BO and B I remain in the scheduling queue. Further, since BO and B I are in the scheduling queue, they block any other memory accesses from occurring to those memories until they are completed. Therefore, even if virtual mapping exposes the maximal parallel memory access opportunities, Figure 5.6(b) takes four memory cycles to complete all 8 memory operations within an unrolled loop iteration in Figure 5.6(a). We take these scheduling constraints into account and distribute arrays in a way that the arrays with scheduling constraints are mapped to the same memory to take advantage of the full memory bandwidth. In Figure 5.6(c), all four read operations to A’s are scheduled at the same cycle first and then all four write operations to f?’s are 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. scheduled at another cycle. The schedule length within a loop iteration is decreased by half and the memory bandwidth is fully utilized. 5.5 Global Optim ization Problem When deriving a data layout across an entire program, the best layout for each program region (e.g., loop nest) may not be globally optimal. The goal is to find the best data layout that is fixed throughout the program execution. We call it a static data layout. This layout does not require any data reorganization. In a degenerate case, all the array elements must be mapped to a single memory to avoid data reorganization. An alternative approach is to allow data reorganization between the program regions such as loop nests. We call this approach a dynamic data layout. On an FPGA platform, the data reorganization cost is relatively low compared with conventional architectures because we can transfer on-chip data in parallel. The cost can also be hidden by reorganizing data by a block granularity rather than reorganizing the entire array data after finishing one loop nest. In the meantime, both the producer and consumer of data can be busy working on different blocks of data. The global optimization problem can mix static and dynamic data layouts. To derive a static layout, we first compute the local optimal data layouts of each program region. Next, for all references to an array which are from different program regions, we com pute the common stride and the number of virtual memories for the unified data layout between the two program regions as described in Figure 5.4. If the condition 5.2 holds between the two program regions, we keep each local optimal layout for these two array references, since they access mutually exclusive array indices. Otherwise, the two array references must be mapped to the same virtual memory if the communication cost be tween the program regions is bigger than the benefit of memory parallelism by allowing data reorganization. As in Section 5.3.2, we use GCD(.s', s) as the common stride of array 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. references in two different program regions. Since we may use smaller stride for a unified data layout than we do for the local optimal data layouts, the number of virtual memories in each region may be decreased. If a static data layout reduces parallelism, a dynamic data layout may be a better solution when the memory parallelism benefit makes up for the reorganization cost. The global optimization problem can group the loop nests where fixing the layout is more ben eficial, and introduce data reorganizing communication at other program regions where benefits outweigh reorganization cost. 5.6 Incorporating Conventional Layout Schemes Custom data layout improves the opportunities of parallel memory accesses for indepen dent array references. Dependence-carrying array references are mapped to the same memory. If the dependence is loop-carried, the dependent array references do not access the same array elements in each loop iteration. For example, consider two array refer ences A[i] and within a loop. An array element accessed by A[i] is accessed again by A[i-1] in the next iteration of loop i. Custom data layout does not exploit parallel memory accesses in this case, since we have to statically specify the memory id to which array elements are mapped in the DEFACTO system. In this case, we may want to per form loop unrolling or unroll-and-jam to introduce independent array references within the loop body. Alternatively, if the unroll factors conflict with other optimization goals, we may want to consider traditional data layout schemes, such as block, cyclic, and block-cyclic, under the assumption that memory bank disambiguation is performed automatically by a memory interface controller or a memory management unit (MMU). If we use a cyclic layout for the example discussed above, the two array references will access two different 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memories in every loop iteration, even if the instances of an array reference do not access a fixed memory bank. 5.7 Summary In this chapter, we have described an algorithm for deriving custom data layouts in mul tiple memory banks for array-based computations, to facilitate high-bandwidth parallel memory accesses in modern architectures where multiple memory banks can simulta neously feed one or more functional units. By examining data dependences and array subscript expressions, our algorithm automatically derives application-specific layouts in multiple memories. As compared to solutions that reorganize computation to optimize for memory par allelism assuming a fixed data layout, our approach yields high memory parallelism for a fixed computation order by reorganizing the data. This difference is particularly impor tant when used in conjunction with code reordering transformations, such as loop nest transformations commonly performed on array-based computations. In addition, our data layout supports more varied data layouts than HPF-like notations (block, cyclic, block-cyclic) can do. Our compiler has more degrees of freedom in transforming code, and can thus preserve memory parallelism while accomplishing other optimization goals. Our results show that custom data layout yields results as good as, or better than, naive or fixed cyclic layouts, and is significantly better for certain access patterns and in the presence of code reordering transformations. When used in conjunction with unrolling loops in a nest to expose instruction-level parallelism, we observe up to 87.5% reduction in the number of memory access cycles and speedups up to 9.65 for 8 memories, as compared to using a single memory with no unrolling. As the trend towards architectures with increased bandwidth through parallel memory accesses continues, and as data storage and data movement are increasingly exposed to 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. software control, we foresee a growing need for techniques such as that described in this chapter, to maximize memory access parallelism. Ill Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 EXPERIM ENTS To show the effectiveness of the design space exploration algorithm, we have performed experiments on five multimedia application kernels; namely, • Finite Impulse Response (FIR) filter, integer multiply-accumulate over 32 consecu tive elements of a 64 element array. • Matrix Multiply (MM), integer dense m atrix multiplication of a 32-by-16 matrix by a 16-by-4 matrix. • String Pattern Matching (PAT), character matching operator of a string of length 16 over an input string of length 64. • Jacobi Iteration (JAC), 4— point stencil averaging computation over the elements of an array. • Sobel Edge Detection (SOBEL), 3-by-3 window Laplacian operator over an integer image. As shown in Figure 6.1, each application is written in C. However, our compiler also can handle Fortran programs. In this chapter, we first present the benefits of each individual compiler transformation in terms of fine-grain parallelism, data locality on chip, and parallel memory accesses, and then summarize the overall performance improvement. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for ( i = 0; i <32; i+ + ) { for (i—0; i <64; i + + ) { for ( j - 0 ; j <16; j + + ) { data[i] = 0 ; C[*][j] = 0 ; for (j= 0 ; j <32; j + + ) { for ( k = 0; k <4; k + + ) data[i] = data[*] + (sample[* + j] * coeff[j]); C[i][j] = C [i\[j} + B[i]\k] * A[j][fc]; } } (a) FIR for (i=0; i <48; i+ + ) { res[i] = 1 ; for ( j = 0 ; j <16; j + + ) { if (p a t[7] != str[* + j ]) res [ s ’ ] = 0 ; } } (c) PAT for (i= l; i <65; i + + ) { for (j= 0 ; j <33; j + + ) { su m l = - u[i-l][j-l] + 2 * u[f][j+ l] - 2 *u[f][j-l] + u[j+1][j+ 1] - u [i+ l]\j-l}; sum 2 = u [i-l][j-l] - u[z-l][j] + u [f-l][j+ l] - u [f+ l][j-l] - 2 *u[i4 -l][j] - u [i+ l][j+ l]; m agnitude = sum l * su m l + sum 2 * sum 2 ; if (m agnitude > threshold) e[i\[j) = 255; else e\i}[j] = 0 ; } } (e) SOBEL Figure 6.1: Application kernels. } } (b) MM for (*=01; i <33; i+ + ) { for ( j = 1; j <17; j + + ) { = (B [i + W l + B [* - W ) ~kB[i][j + 1] + B[i][j — l])/4 ; } } (d) JAC Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Algorithm FPGA Design Design S pace Exploration G eneral Optimizations Unroll-and-Jam S calar R eplacem ent 5. SU IF2V H D L /B ehavioral Synthesis Logic S ynthesis / Place& Route C ustom D ata Layout Figure 6.2: Experimental design flow 6.1 M ethodology We have used our unified framework by combining our compiler and a behavioral synthesis tool to analyze the given application and to determine the best hardware implementation on an FPGA. Figure 6.2 depicts the design flow used for these experiments. We applied the design space exploration algorithm in Figure 3.3 to select a balanced and efficient design for each of the application kernels. In these experiments, we target a Xilinx Virtex XCV1000 with 4 external memories, as in the Annapolis W ildStar board [24], In this platform, read and write operations to external memory require 7 and 3 clock cycles respectively in a non-pipelined access mode, and 1 clock cycle respectively in a pipelined access mode. In practice, memory latency is somewhere in between these two modes as some but not all memory accesses can be fully pipelined. As part of the design space exploration, the compiler derives a set of designs in behavioral VHDL, which is then synthesized by Behavioral synthesis tool Monet [39] from Mentor Graphics. Based on several guiding metrics, the compiler decides the next unroll factors to try another design. Once we find an optimal design within the selective set of our optimization crite ria, we produce an application-specific FPGA design by performing logic synthesis and place-and-route to the output of behavioral synthesis. For a design that our algorithm 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. selected, we have performed logic synthesis with Mentor’s LeonardoSpectrum and place- and-route with Xilinx Foundations tool set. In these experiments, we also observe the impact of choosing different target clock rates (25MHz and 40MHz) on the quality of the designs. These rates are reasonable targets for the Virtex parts in our target platform. This variation in target clock rates will allow us to verify the sensitivity of our design exploration algorithm to the aggressiveness of the estimation for different clock rates and its impact on the ability of the algorithm in choosing the correct design. The target clock was provided to behavioral synthesis and logic synthesis as a parameter, and both tools were directed to optimize for both performance and area. In all, we produced behavioral synthesis results for a total of 364 points in the design space. The full place-and-route results were obtained for 209 of these points. Most of the remaining points are too large to fit within the capacity of the Virtex parts, and so logic synthesis does not attem pt to produce a design. For seven of the points associated with PATTERN, there is an incompatibility between the output of behavioral synthesis and the expected input for logic synthesis, which prevents logic synthesis from producing a correct design. Thirteen points that are on the boundary of the Virtex XCV1000 capacity require more memory to synthesize than the memory available in the machine that produced our synthesis results. We performed experiments at different stages of the experiment design flow to show the benefits of individual stages. The rest of this chapter is organized as follows. In Section 6.2, we show the benefits of eliminating redundant memory accesses with scalar replacement, which involves stages 1 and 3 in Figure 6.2. In Section 6.3, we show the benefits of custom data layout after scalar replacement in the presence of unroll-and-jam iteration reordering transformation, which involves stages 1 through 4 in Figure 6.2. In Section 6.4, we present the effectiveness of our design space exploration algorithm, which involves stages 1 through 6. In Section 6.5, we verify the accuracy of behavioral synthesis 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. estimates , which involves stages 1 through 7 in Figure 6.2. Finally, we summarize the experimental results in Section 6.6. 6.2 Scalar Replacem ent This section presents experimental results that characterize the impact of the scalar replacement algorithm for a set of application kernels. In this experiment, we compare five data reuse schemes: 1. no data reuse 2. redundant write elimination only 3. innermost only 4. an approximation to C arr’s approach 5. our approach All these schemes do not unroll loops except for an approximation to C arr’s approach. The second approach eliminates only redundant write array accesses across all the loops in a nest. The third approach eliminates both redundant read and write array accesses only in the innermost loop. This approach can be thought of as a lower bound on the benefits from C arr’ s scalar replacement algorithm. The approximation to C arr’ s approach selects the maximal unroll factor for the outer loop of the 2-deep loop nests such that the number of registers is not exceeded. We also performed loop-invariant code motion in addition to Carr’s scalar replacement approach. Inner loops are not unrolled in C arr’s algorithm. In the case of MM, we also unrolled the outermost loop since the results would be equivalent for either of the two outer loops. While C arr’ s algorithm might decide not to unroll this much, these results can be thought of as an upper bound on the benefits from C arr’s scalar replacement algorithm, since the measurements for performance improvements on our FPGA system are not affected by code size. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Problem size FIR MM PAT JAC SOBEL 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Outer loop 64 64 640 64 64 640 64 64 640 32 32 320 64 64 640 Middle loop 5 6 6 Inner loop 30 32 60 5 6 16 31 32 62 15 16 30 14 16 28 Table 6.1: Problem size (iteration count). To characterize the benefits and cost of each data reuse scheme, we measured four metrics: • the number of memory accesses remaining after each data reuse scheme • the number of registers used to exploit data reuse • the speedup over original programs • the FPGA space usage. The first and second metrics heavily depend on the iteration counts of the loops. For each program, thus, we compare three different problem sizes in terms of iteration count of each loop in a nest as shown in Table 6.1. We assume there are 32 registers available in the target architecture. The first problem size for each program requires less than or equal to 32 registers to fully exploit data reuse using our approach. The second size is slightly bigger than the first size, and thus it requires slightly more registers. The third size requires much more registers than 32 to exploit full data reuse. Thus, we perform partial data reuse in the cases of the second and third problem sizes of each program. However, we do not perform selective data reuse based on the efficiency metric in this experiment. For C arr’s approach, we maximize the unroll factor of the outer loop (and jam the copies of the inner loop) such that it uses less than or equal to 32 registers. Our approach is fully automated, but we performed redundant write elimination only, innermost only, and C arr’s approach by hand. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 39k a * j^k f a < *3k | Z □ No Daia Reuse Output Only Innermost Only Carr’s Approach Our Approach •IR M M P A T J A C S O B E L (a) Problem size 1 □ No Data Reuse Output Only Innermost Only Carr’s Approach Our Approach _ § l_ ___ H R M M P A T JA C S O B E L (b) Problem size 2 250 k u 1 2 0 0 k t- H “ 1 5 0 k A £ 1 0 0 k o | 50 k Z 0 39^40 □ No Data Reuse E Output Only Q Innermost Only 0 Carr’s Approach ■ Our Approach F IR M M P A T JA C S O B E L (c) Problem size 3 Figure 6.3: Number of Memory Accesses In Figure 6.3, we measured how effectively we eliminated the quantity of redundant memory accesses. The first bar represents the original number of memory accesses, and the second bar the number of memory accesses remaining after eliminating only redundant array write accesses, the third bar and fourth bar the number of array references remaining after applying the innermost only and C arr’s approaches, respectively, and the last bar the number of array references remaining after applying our approach. In Figure 6.3(a) and (b), the second bar eliminated about 50% of memory accesses for FIR and MM, 33% for PAT. However, there was no opportunity for redundant write elimination at all for JAC and SOBEL. In Figure 6.3(c), the second bar eliminated slightly more memory accesses (up to 1%), since the problem size is bigger. For three different problem sizes, the innermost only approach eliminated the same amount of redundant memory accesses as the second bar did for FIR and PAT, and 17% JAC, and 39% for SOBEL, but only 28% for MM, which is smaller than the second bar’s 50%. Carr’s approach eliminated 72% of memory accesses for FIR, 52% for JAC, and 48% for SOBEL. However, it eliminated the same amount of redundant memory accesses as the second bar did for MM and PAT. On the other hand, our approach eliminated slightly more redundant memory accesses (up to 5% more) than Carr’s approach for FIR and JAC. However, it eliminated 90% of memory accesses for MM, 65% for PAT, and 77% for SOBEL in Figure 6.3(a), and up 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to 3% less in Figure 6.3(b), and up to 1% more in Figure 6.3(c). If the problem size is bigger, the benefits of our approach to scalar replacement would be greater. Table 6.2 shows the number of registers required to exploit data reuse for three data reuse schemes. Only one register is required for FIR, MM, and PAT for redundant write elimination only and innermost only schemes. But, for JAC and SOBEL, no register is necessary for redundant write elimination only because there is no redundant array write accesses, while 3 and 9 registers are required respectively to eliminate redundant read accesses in the innermost loop. Carr’s approach uses 32 registers for FIR, MM, and PAT, but 30 registers for JAC and SOBEL because increasing the unroll factor requires more than 32 registers. Our approach uses a comparable number of registers for problem sizes 1 and 3, but considerably less registers for problem size 2 because tiling the inner loop reduces the iteration count of the inner loop. However, as shown in Figure 6.3(b), our partial data reuse scheme eliminated more redundant array accesses than C arr’s approach which uses much more registers. Figure 6.4 presents the speedup results of overall performance on a single FPGA. The baseline is the original program which accesses memory for all array references. The redundant write elimination only scheme observes speedups from 1.21 to 1.71, which is quite valuable using only a single register. The innermost only scheme observes speedups from 1.21 to 1.84, and C arr’s approach observes speedups from 1.21 to 3.21. On the other hand, our approach observes speedups from 2.34 to 7.31 for five programs. If we use full Problem size FIR MM PAT JAC SOBEL 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 O utput only 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 Innermost only 1 1 1 1 1 1 1 1 1 3 3 3 9 9 9 C arr’s approach 32 32 32 32 32 32 32 32 32 30 30 30 30 30 30 Our approach 31 17 31 30 19 29 32 17 32 31 17 31 32 20 32 Table 6.2: Number of registers. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E3 Output Only g Innerm ost Only . i i C arr’s A pproach . ■ O ur Approach FIR MM PAT JAC SOBEL (a) Problem size 1 O utput Only E3 Innerm ost Only C arr’s Approach O ur Approach FIR MM PAT JAC SOBEL (b) Problem size 2 £39 Output Only g a Innerm ost Only ggj C arr’s Approach ■ O ur A pproach ii FIR MM PAT JAC SOBEL (c) Problem size 3 Figure 6.4: Speedups of scalar replacement. data reuse, which is possible in our FPGA platform, the speedups of our approach would be greater. As a final benefit of our approach over using unroll-and-jam, the FPGA space required to implement the unrolled computation may be prohibitive. Efficient space usage is very im portant to deriving quality FPGA designs. Larger designs have more routing complex ity, and may lead to lower achieved clock rates. In Table 6.3, we show the FPGA space usage for each data reuse scheme. The output only and innermost only schemes use rela tively small amount of FPGA space, since they use extremely small number of registers. Carr’s approach replicates the loop body by unroll-and-jam, which considerably increases the code size and the resulting FPGA space usage. Our approach also increases the code size by peeling the loop iterations, but FPGA designs share resources to configure peeled codes and the main loop body, since they do not execute at the same time. However, unrolled copies of the loop body may be executed in parallel, so the hardware synthesis tool configures more and more functional logic as we increase the unroll factor. Overall, our approach uses less FPGA space (up to 80% less) than C arr’s approach. From the experiments that we performed, we observe the following sources of benefits: • Redundant write elimination is valuable; up to 51 percent reduction in memory accesses. • The most benefits are nevertheless coming from input and true dependences. 1 2 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Problem Size FIR MM PAT JAC SOBEL 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 O utput only 1.3 1.3 1.3 1.4 1.4 1.5 0.2 0.2 0.3 0.5 0.5 0.5 2.4 2.6 2.9 Innermost only 1.3 1.3 1.4 1.5 1.5 1.5 0.3 0.3 0.3 1.4 1.4 1.5 4.7 4.7 5.1 Carr’s approach 5.4 5.4 5.9 6 12 7 3.7 3.6 4 15.9 15.8 11.9 26 26.3 28.7 Our approach 2.6 2.3 2.9 2.8 3.2 4.3 1.5 1.2 1.9 3.3 3.3 4.2 12 11.6 13 Table 6.3: FPGA space usage (Kiloslices). • Exploiting data reuse across multiple loops does not necessarily require a large number of registers relative to the footprint of the accessed data, and results in the best performance. • Partial data reuse within a limited number of registers eliminated up to 40% more redundant array accesses than C arr’s approach did for five programs that we have studied. 6.3 Custom D ata Layout This section presents experimental results that characterize the effectiveness of the previ ously described custom data layout algorithm for five multimedia kernel applications. In the following experiment, we compare performance obtained by custom data layout with a naive layout, and with modulo unrolling, as described in Section 5.1. For each of the multimedia kernels, we have automatically generated the naive and custom layouts, performing the modulo unrolling mapping by hand. We assume that all application data fits into the memory components. W ith higher latencies, the benefits of memory parallelism increase. So, we conservatively assign a low memory latency for both read and write of one cycle each, which is the case on our target platform when all memory accesses are fully pipelined. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1500 1200 > > 900 I 600 Q J 300 □ Naive Layout B Modulo Unrolling Layout I Custom Layout 1500 1200 > > 900 I 600 u 300 □ Naive Layout B Modulo Unrolling Layout I Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories (b) 8 memories Figure 6.5: JAC, memory access times vs. unroll factors. 6000 5000 ■3 4000 u >>3000 o 6 ■ S 2000 1000 |] Naive Layout Q Modulo Unrolling Layout I Custom Layout o U 6000 5000 4000 3000 2000 1000 Q Naive Layout Q Modulo Unrolling Layout I Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories (b) memories Figure 6.6: SOBEL, memory access times vs. unroll factors. 6.3.1 M em ory A ccess T im es The first set of results in Figure 6.5 through 6.9 shows the time (in cycles) spent accessing memory, for each of the three layout schemes as a function of unroll factors for the loop nest in each application. In the graphs, the x-axis corresponds to different unroll factors for th e in n e r a n d o u te r loops. F o r ex am p le, 1 x 2 refers to no u n ro llin g on th e o u te r loop, and a factor of 2 unrolling on the inner loop. We show results for both 4 and 8 memories, to see the impact of increasing memories on overall memory performance. 1 2 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For both Jacobi in Figures 6.5 and SOBEL in Figures 6.6, which have multi-dimensional SIV, we see enormous decreases in memory cycles due to both modulo unrolling and cus tom data layout whenever the inner loop is unrolled by a factor as large or larger than the number of memories, a 75% reduction for 4 memories, and an 87.5% reduction for 8 memories. This is because unrolling the inner loop by the number of memories allows for the maximum parallel data layout to be used for arrays with the inner loop’s index in their lowest dimension subscript expression. When only the outer loop is unrolled, custom data layout outperforms modulo unrolling, since it can distribute multiple dimensions of the accessed arrays. When the inner loop is unrolled by less than the number of memories and the outer loop is also unrolled, although there is some performance improvement due to modulo unrolling, yielding, for example, a 50% reduction in memory access time for the 2x2 cases on 4 memories, it is not as successful as custom data layout, which still achieves a 75% reduction in this case. For both FIR in Figures 6.7 and PAT in Figures 6.8, which exhibit MIV, both our custom layout and modulo unrolling achieve a modest improvement over naive when only one loop is unrolled, i.e., one of the unroll factors is 1. Scalar replacement cannot eliminate these accesses without further unrolling. In the case of 4x1, there are slightly more array accesses whose access expressions contain the outermost loop index variable in the lowest dimension and this accounts for the slight decrease in memory cycles, compared with 1x4. When both loops are unrolled, we are able to take advantage of the unrolling in two dimensions, affecting not only the lowest order dimension in array access expressions but any dimension related to the i or j loop, and derive a custom layout to achieve the maximum parallelism available to the system. Therefore, custom data layout outperforms modulo unrolling, which can only take advantage of unrolling in the lowest dimension of a given array access expression. The better performance of our custom layout is also 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 0 0 0 2500 | 2000 u £>1500 o 6 | 1000 500 0 Q Naive Layout 5 Modulo Unrolling Layout | Custom Layout £ 2000 ?> 1500 lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors Q Naive Layout S Modulo Unrolling Layout I Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories (b) 8 memories Figure 6.7: FIR, memory access times vs. unroll factors. 1000 S 800 u 600 © £ « 400 200 0 Naive Layout B Modulo Unrolling Layout 1 Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories 1000 S 800 o ^ 600 o £ 400 200 □ Naive Layout B Modulo Unrolling Layout I Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (b) memories Figure 6.8: PAT, memory access times vs. unroll factors. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1000 800 C /5 O 600 f c * | 400 200 8 Naive Layout 8 Modulo Unrolling Layout I Custom Layout 1000 800 ^ 600 8 Naive Layout 8 Modulo Unrolling Layout | Custom Layout —I I i-H> C N j< N '— ^ 1 OO ^ 1 ^ X X X X X X X X X X X X X X X X X X X X — « N -, — h ^ — I rl r- N O O — . (S 't -■ TfCNtN X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X 0 1 _ f - . T r r t M - H ( N — 00 — ( N ' t — X X X X X X X X X X X X X X X X X X X ' 04 Unroll Factors (a) 4 memories Unroll Factors (b) 8 memories _____________ Figure 6.9: MM, memory access times vs. unroll amounts._____________ attributed to the scalar replacement of further exposed MIV array accesses not exposed in modulo unrolling. Results for MM are shown in Figures 6.9. Although MM is a three-deep loop nest, we only consider unroll factors for the two outer loops, since scalar replacement eliminates all memory accesses in the innermost loop. We see similar results to those of JACOBI and SOBEL. Custom data layout outperforms modulo unrolling when loops other than the middle loop are unrolled, which in this case corresponds to the most quickly varying dimension, or when the middle loop is unrolled by a factor smaller than the number of memories. An additional improvement occurs in the loops where only the innermost loop is unrolled, due to parallelizing memory accesses in the peeled portion of the loop body after scalar replacement. A subtle point is that the same mechanism does not eliminate all memory accesses in the peeled loops. Looking at the 1x1x4 case in Figure 6.9, there is a decrease in memory access cycles in modulo unrolling and custom layout over the na'ive case due to unrolling in the lowest array dimension. This win is due to the now parallelized memory accesses in the peeled loops. For the MM 1x4x1, 2x2x1, and 4x1x1 cases, the arguments are similar to those for the JACOBI and SOBEL 1x4, 2x2 and 4x1 cases in that custom 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10 ■ o 6 CD U a, 1 /3 4 Q Naive Layout B Modulo Unrolling Layout I Custom Layout 10 0 Naive Layout B Modulo Unrolling Layout 1 Custom Layout m lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories (b) 8 memories Figure 6.10: JAC, speedup. data layout outperforms modulo unrolling when unrolling occurs in an array dimension other than the lowest order dimension or a combination of lowest and non-lowest order dimensions. Overall, as we go from 4 memories to 8 memories, we see the growing importance of custom data layout, since larger unroll factors on the for the loop or loops representing the most quickly varying dimension are needed for modulo unrolling to fully utilize the memory bandwidth of the platform. 6.3.2 Sp eedu ps Now we see how the reduction in memory cycles translates into overall speedups. For the speedup results, the execution times are normalized to the naive lx l (i.e., not unrolled) time for each kernel. The speedups for JACOBI and SOBEL are shown in Figure 6.10 and 6.11, respectively. The speedups when the inner loops are unrolled are proportional to the decreased number of memory cycles for each layout. The speedups when both loops are unrolled reflect the win of custom data layout from an increase in useful parallel memory accesses and are directly proportional to the decrease in overall memory cycles. When the outer loop is 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10 a •§ 6 < o < D a 60 4 Q Naive Layout 0 Modulo Unrolling Layout 1 Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 Bxl 4x2 2x4 1x8 Unroll Factors (a) 4 memories 10 Q Naive Layout S Modulo Unrolling Layout I Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (b) 8 memories Figure 6.11: SOBEL, speedup. unrolled, modulo unrolling does exhibit a speedup over the naive scheme even though the memory access cycles are the same for these two layouts. This is due to the fact that we assume in our memory access accounting model that a read and write to different memories may not occur in parallel. This is im portant in JACOBI and SOBEL because many memory accesses are in the form of A ( i ,j) = B (i,j). In reality, the schedule that Monet generates does allow for a memory read and write to different memories to occur in parallel and coupled with the assumption that entire arrays are spread across multiple memories in modulo unrolling, we see a speedup over the naive case. The speedups for FIR and PATTERN are shown in Figure 6.12 and 6.13. The slightly higher speedups when the outer loops are unrolled are attributed to the fact that there are slightly more array accesses whose lowest order dimension contains the outermost loop i. When both loops are unrolled, we see a speedup of 6 and 7.7 for 4- and 8-memory versions of FIR, and speedups of 5.7 and 9.6 for 4- and 8-memory versions of PAT. The speedup for MM is shown in Figure 6.14. In all cases, the speedups are propor tional to the decreases in memory accesses shown in Figure 6.5 through 6.9. Overall, we see the same effect as with the memory cycle times. As the number of memories increases, so does the importance of the custom data layout algorithm. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 0 ■ a u < D a, e / 5 0 Naive Layout Q Modulo Unrolling Layout 1 Custom Layout 10 lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories 0 Naive Layout Q Modulo Unrolling Layout 1 Custom Layout 4x1 2x2 1x4 8x1 4x2 2x4 Unroll Factors (b) FIR 8 memories Figure 6.12: FIR, speedup. 10 a, ■ a 6 < u CL, M 4 0 Naive Layout 0 Modulo Unrolling Layout | Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (a) 4 memories 10 & , 3 a ■a 6 < u D a. 0 Naive Layout B Modulo Unrolling Layout | Custom Layout lx l 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8 Unroll Factors (b) memories Figure 6.13: PAT, speedup. O h O T 3 < D O O h 0 0 0 Naive Layout § M odulo Unrolling Layout 1 Custom Layout 10 f l Naive Layout g Modulo Unrolling Layout | Custom Layout 8 6 4 2 a a Unroll Factors (a) 4 memories Unroll Factors (b) 8 memories Figure 6.14: MM, speedups. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.4 A utom atic Design Space Exploration We now present experimental results for the application of our design space exploration algorithm to a set of five multimedia kernels. In each figure, we show the results for non-pipelined memory accesses on the left and the results for pipelined memory accesses on the right to observe the impact of memory access latencies on the balance metric and consequently in the selected designs. In all results, we are assuming 4 memories, which is the number of external memories that are connected to each of the FPGAs in the Annapolis W ildStar board. The graphs in Figure 6.15 through Figure 6.19 show a large number of points in the design space, substantially more than are searched by our algorithm, to highlight the monotonicity properties between unroll factors and metrics of interest. In all plots, a squared box indicates the design selected automatically by our design space exploration algorithm. The first set of results in Figures 6.15 and 6.16 plots balance, execution cycles and design area in the target FPGA as a function of unroll factors for the inner and outer loops of FIR and MM. Although MM is a 3-deep loop nest, we only consider unroll factors for the two outer loops, since scalar replacement has eliminated all memory accesses in the innermost loop. The graphs in the first two rows have unroll factors for the inner loop as their x-axis, and each curve represents a specific unroll factor for the outer loop. In balance plots, a design is balanced for an unrolling factor when the y-axis value is 1. D ata points above the y-axis value of 1.0 indicate compute-bound designs whereas points with the y-axis value below 1 indicate memory-bound designs. A compute-bound design suggests that more resources should be devoted to speeding up the computation component of the design, typically by unrolling and consuming more resources for computation. A memory-bound design suggests that less resources should be devoted to computation as the functional units that implement the computation are already idle waiting for data. The design area graphs represent space consumed (using a log scale) on the target Xilinx 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0.4 - 0.35 - » 0.3 - I Q 0.25 - 0.2 • □ selected design O uter Loop Unroll Factor 1 m~9 O uter L oop U nroll Factor 2 O uter L oop U nroll Factor 4 O uter L oop Unroll Factor 8 -j— F O uter Loop U nroll Factor 16 *--*« O uter Loop Unroll Factor 32 O uter Loop U nroll F actor 64 _L_ 16 Inner L oop Unroll Factor (a) Balance, non-pipelined □ selected design O uter Loop O uter Loop O uter Loop ▲ O uter Loop H — F O uter Loop x - k O u ter Loop O uter Loop U nroll Factor 1 U nroll F actor 2 U nroll F actor 4 U nroll F actor 8 U nroll F actor 16 U nroll F actor 32 U nroll F actor 64 16 Inner L oop U nroll Factor (b) Balance, pipelined Loop Unroll Factor 1 S — S Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 A - 4 Outer Loop Unroll Factor 8 -t— p Outer Loop Unroll Factor 16 x~~K Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 64 12000 10000 □ selected design tii 6000 Inner Loop Unroll Factor (c) Execution time, non-pipelined # - # O uter Loop Unroll Factor 1 W r-M O uter Loop U nroll Factor 2 Outer Loop U nroll Factor 4 hr-A. O uter L oop Unroll Factor 8 4 —+ O uter Loop Unroll Factor 16 >*---& O uter Loop Unroll Factor 32 O uter L oop Unroll Factor 64 □ selected design Inner Loop Unroll Factor (d) Execution time, pipelined w *m • • • m ax space | | selected design ’ 0 , , 1 — , . 1 , . 1 - ------1 -------L. : • . • • • I • • • . • 10 10- Space (log-scaled) (e) Area, non-pipelined m ax space • -• % t . • • • — j | selected design a " . 1 . • • • • ...................................i . . • • ..................... - . . ..............I 1 0 1 0 Space (log-scaled) (f) Area, pipelined Figure 6.15: FIR kernel. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. O uter Loop U nroll F actor 1 O uter L oop U nroll F actor 2 $ O u ter L oop U nroll F actor 4 O u ter L oop U nroll F actor 8 O uter L oop U nroll F actor 16 O uter L oop U nroll F actor 32 □ selected design '012 Inner Loop U nroll Factor (a) Balance, non-pipelined O uter L oop U nroll F actor 1 m-m O uter Loop Unroll F actor 2 ♦ • • • O uter L oop U nroll F actor 4 O uter L oop Unroll F actor 8 O uter L oop U nroll F actor 16 x --k O uter L oop U nroll F actor 32 □ selected design Inner Loop U nroll Factor (b) Balance, pipelined 7000 6000 # - # O uter L oop U nroll Factor 1 I H H O uter L oop U nroll Factor 2 O uter L oop U nroll F actor 4 A— A . O uter Loop U nroll F actor 8 H — h O uter L oop Unroll Factor 16 ytr-K O uter L oop U nroll Factor 32 5000 .4000 □ selected design 8 3000 2000 1000 Inner Loop Unroll Factor 5000 O uter L oop Unroll Factor 1 MHH Outer Loop U nroll Factor 2 O uter Loop U nroll Factor 4 ▲ O uter Loop U nroll Factor 8 + — +• Outer Loop U nroll Factor 16 k Outer Loop Unroll Factor 32 4000 3 3000 □ selected design 2000 1000 Inner L oop Unroll Factor (c) Execution time, non-pipelined (d) Execution time, pipelined i o 4 r m a x s p a c e m a x s p a c e • [ | s e le c te d d e s ig n • I | s e le c te d d e s ig n • X _ o • f y u % E •. o % > • • • • • • * W 0 r • • • ............i • • • • •• • • • • • ............ 1 . . . 7 .............i • * • - :• . ............i 1 0 1 1 , , 1 . 1 - 1 Q 1 0 I O 4 i o 5 Space (log-scaled) Space (log-scalcd) (e) Area, non-pipelined (f) Area, pipelined Figure 6.16: M atrix Multiply kernel. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Virtex 1000 FPGAs for each of the unrolling factors. A vertical line indicates the maximal device capacity. All designs to the right side of this line are therefore unrealizable. W ith pipelined memory accesses, there is a trend towards compute-bound designs due to the low memory latency. W ithout pipelining, memory latency becomes more of a bottleneck leading, in the case of FIR, to designs that are always memory bound, while the non-pipelined MM exhibits memory-bound and balanced designs. The second set of results, in Figure 6.17 through Figure 6.19, show the performance of the remaining three applications, JAC, PAT and SOBEL. In these figures, we present, as before, balance, execution cycles and design area as a function of unroll factors. We make several observations about the full results. First, we see that balance follows the monotonicity properties described in Observation 3, increasing until it reaches a saturation point, and then decreasing. The execution time is also monotonically non increasing, related to Observation 5. In all programs, our algorithm selects a design that is close to best in terms of performance, but uses relatively small unroll factors. Among the designs with comparable performance, in all cases our algorithm selected the design that consumes the smallest amount of space. As a result, we have shown that our approach meets the optimization criteria set forth in Section 1.2. In most cases, the most balanced design is selected by the algorithm. When a less balanced design is selected, it is either because the effective memory bandwidth utilization of the more balanced design is not saturated yet (as for non-pipelined FIR), or is too large to fit on the FPGA (as for pipelined MM). Table 6.4 presents the overall speedup results of the design selected by our algorithm for each kernel, on a single FPGA with four memories, as compared to the baseline, for both pipelined and non-pipelined memory access modes. The baseline is the loop nest with no unrolling (unroll factor is 1 for all loops) but including all other applicable code transformations, such as stages 1, 3, and 4 in Figure 6.2. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0-0 O uter L oop Unroll Factor 1 m-m O uter L oop U nroll Factor 2 0~0 O uter Loop Unroll F actor 4 O uter Loop U nroll Factor 8 O uter L oop Unroll Factor 16 v * — *? O uter Loop U nroll F actor 32 □ selected design Inner Loop U nroll Factor + - • O uter L oop U nroll F actor 1 K H i O uter L oop U nroll F actor 2 0-0 O uter L oop U nroll F actor 4 O uter L oop U nroll F actor 8 O uter L oop U nroll F actor 16 x — x O uter L oop U nroll F actor 32 □ selected design (a) Balance, non-pipelined Inner Loop U nroll Factor (b) Balance, pipelined 6000 0 — 0 Outer L oop U nroll Factor 1 S M i Outer L oop U nroll Factor 2 0 -0 Outer Loop Unroll Factor 4 A— A . Outer Loop Unroll Factor 8 -j— F Outer L oop Unroll Factor 16 s*~~X Outer L oop Unroll Factor 32 5000 □ selected design S 3000 2000 1000 0 1 2 8 4 16 Inner Loop Unroll Factor (c) Execution time, non-pipelined 1600 1400 0 -0 O uter Loop U nroll Factor 1 M— M O uter Loop U nroll Factor 2 0-0 O uter L oop U nroll Factor 4 O uter Loop U nroll Factor 8 H — h O uter Loop U nroll Factor 16 y - $ r O uter Loop U nroll Factor 32 1200 •g 1000 □ selected design 600 400 200 0 1 2 8 4 16 Inner Ltx)p U nroll Factor (d) Execution time, pipelined □ selected design si V Space (log-scaled) m ax space • j \ selected design •• • ♦ , • • • • M • • •• . • • • • • .................................1 • ................................... Space (log-scaled) (e) Area, non-pipelined (f) Area, pipelined Figure 6.17: Jacobi kernel. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. □ selected design O uter L oop Unroll g O uter L oop Unroll O uter Loop Unroll A -A O uter L oop Unroll ■ • t — O uter L oop Unroll x—x O uter L oop Unroll O uter L oop Unroll O uter L oop Unroll O uter Loop Unroll •V 4- O uter L oop Unroll F actor 1 Factor 2 F actor 3 Factor 4 Factor 6 F actor 8 Factor 12 Factor 16 F actor 24 Factor 48 Inner L oop Unroll Factor O uter Loop O uter Loop O uter Loop O uter Loop □ selected design O uter Loop O uter Loop O uter Loop O uter Loop P O uter Loop * O uter Loop U nroll Factor 1 U nroll Factor 2 U nroll F actor 3 U nroll F actor 4 U nroll Factor 6 U nroll Factor 8 U nroll Factor 12 U nroll F actor 16 U nroll Factor 24 U nroll Factor 48 (a) Balance, non-pipelined Inner L oop U nroll Factor (b) Balance, pipelined □ selected design O uter Loop Unroll 9 - s O uter L oop Unroll 4 — 4 O uter L oop Unroll A -A O uter L oop Unroll -j— h O uter L oop Unroll x - x O uter Loop Unroll O uter Loop Unroll V - f O uter L oop Unroll £»•••> O uter L oop Unroll •*— *• O uter L oop Unroll Factor 1 Factor 2 Factor 3 Factor 4 Factor 6 Factor 8 Factor 12 Factor 16 F actor 24 Factor 48 Inner Loop Unroll Factor (c) Execution time, non-pipelined O uter Loop O uter Loop O uter Loop O uter Loop □ selected design O uter Loop O uter Loop > x O uter Loop O uter Loop O uter Loop O uter Loop Unroll F actor 1 Unroll F actor 2 Unroll Factor 3 Unroll F actor 4 Unroll F actor 6 Unroll Factor 8 Unroll Factor 12 Unroll F actor 16 Unroll Factor 24 Unroll Factor 48 Inner Loop Unroll Factor (d) Execution time, pipelined 0 • • • • □ selected design Space (log-scaled) m ax space • • • * □ selected design Space (log-scaled) (e) Area, non-pipelined (f) Area, pipelined Figure 6.18: Pattern kernel. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S0.35 □ s e le c te d d e sig n # - # O u te r L o o p U n ro ll F a c to r 1 O u te r L o o p U n ro ll F a c to r 1 m-m O u te r L o o p U n ro ll F a c to r 2 2 •Hi O u te r L o o p U n ro ll F a c to r 2 O u te r L o o p U n ro ll F a c to r 4 i r A O u te r L o o p U n ro ll F a c to r 4 O u te r L o o p U n ro ll F a c to r 8 1 \ \ A-A O u te r L o o p U n ro ll F a c to r 8 4— F O u te r L o o p U n ro ll F a c to r 16 1.9 - i \ \ 4— F O u te r L o o p U n ro ll F a c to r 16 > ? —x O u te r L o o p U n ro ll F a c to r 32 - \ \ \ > f~ * O u te r L o o p U n ro ll F a c to r 32 O u te r L o o p U n ro ll F a c to r 6 4 « 1.8 * \ \ i \ \ O u te r L o o p U n ro ll F a c to r 6 4 Inner L oop U nroll Factor (a) Balance, non-pipelined □ s e le c te d d e sig n Inner L oop U nroll Factor (b) Balance, pipelined # - # Outer Loop Unroll Factor 1 HMH Outer Loop Unroll Factor 2 Outer Loop Unroll Factor 4 A - A Outer Loop Unroll Factor 8 4— F Outer Loop Unroll Factor 16 ' A — ¥ . Outer Loop Unroll Factor 32 Outer Loop Unroll Factor 64 □ se le c te d d e sig n Inner Loop Unroll Factor (c) Execution time, non-pipelined # - • O uter L oop Unroll • H i O uter L oop Unroll A » O uter L oop Unroll A A O uter L oop Unroll •I— F O uter L oop Unroll 5*“ * O uter L oop Unroll O uter L oop Unroll Factor I Factor 2 Factor 4 Factor 8 Factor 16 Factor 32 Factor 64 □ s e le c te d d e sig n Inner Loop Unroll Factor (d) Execution time, pipelined 0 □ s e le c te d d e s ig n 0 □ s e le c te d d e s ig n • • • • Space (log-scaled) Space (log-scaled) (e) Area, non-pipelined (f) Area, pipelined Figure 6.19: Sobel kernel. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Program Non-Pipelined Pipelined FIR 7.67 17.26 MM 4.55 13.36 JAC 3.87 5.56 PAT 7.53 34.61 SOBEL 4.01 3.90 ___________________Table 6.4: Overall speedup on a single FPGA.___________________ Although we present a very large number of design points in the graphs shown in this section, the algorithm searches only a tiny fraction of those displayed. Instead, the design space exploration algorithm uses the pruning heuristics based on the monotonicity properties and the guiding metrics, as described in Chapter 3. This reveals the effective ness of the algorithm as it finds the best design point having only explored only 0.3% of the design space consisting of all possible unroll factors for each loop. For larger design spaces, we expect the number of points searched relative to the size to be even smaller. 6.5 Accuracy of Estim ates In this section, we compare performance estimates from behavioral synthesis with actual performance obtained from applying logic synthesis and place-and-route. The goal of these experiments is twofold. First, we compare for each design the performance estimates from behavioral synthesis with actual performance obtained from logic synthesis and place-and-route. Second, we determine if our design space exploration algorithm would have selected the same design if it had accurate synthesis data rather than estimates. Our design space exploration algorithm relies on behavioral synthesis estimates to assess the impact on the hardware designs of each of the program transformations the design space exploration algorithm performs. These estimates can be derived far more quickly (up to several orders of magnitude faster) than full synthesis and place-and-route 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the design, thus allowing the compiler to consider many more designs than would otherwise be practical. In this section, we show that, while not fully accurate, behavioral synthesis estimation is highly effective and efficient when used in conjunction with our design space exploration algorithm. Our results show that the same design would be selected by the design space exploration algorithm, whether we use estimates or actual results from place-and-route, because it favors smaller designs, where the estimates are accurate to some extent, and only increases complexity when the benefit is significant. 6.5.1 D erivin g A rea and S p eed M etrics The DEFACTO compiler derives quantitative metrics for the performance and FPGA area for each design in its design space exploration algorithm. We have presented how we derive the overall performance based on the balance metric in Chapter 3. The derivation of Area(d) is complicated by two main factors. First, the estimates do not include space required by logic elements inserted due to place-and-route. Second, the space metric provided for a given design is dependent on the target FPGA devices and on the specific component library used by the synthesis tool. In the case of our experimental set up, Monet does not provide Xilinx specific space metrics, but rather abstract space metrics. To address these two shortcomings, we apply a statistical approach to derive Area(d) from Monet estimate. Using the real space capacity of a Xilinx Virtex XCV1000 FPGA as Capacity — 12,288 slices, we have measured the exact place-and-route space for 209 designs for all our experiments. We then derive the ratio of the space Monet reports in its abstract space units to the measured space usage from place-and-route. Assuming a random normal distribution for the space-ratio (not the real area itself), we can then compute the mean ratio range for a given confidence interval, say 100(1 — a)%. Let X(n) 137 Reproduced with permission of the copyright owner. Furiher reproduction prohibited without permission. be the sample mean and S 2(n) be the sample variance. According to [33], 100(1 — a) percent confidence interval can be given by; X ± t n_ 1 > 1 _a/2X/ S 2(n )/n (6.1) where tn- i )i_ a/2 is a predefined constant. If we want 90% confidence, a must be set to 0.1 normally. Using a particular value for a, the compiler can determine the range of space-ratio for that particular confidence, and derive the range of real space in the target FPGA for each particular design in the abstract space measure. In reality, we can take advantage of the fact that the statistical distribution is symmetric whereas the compiler is only interested in the lower bound for the space-ratio range. Effectively, this allows the compiler to use a value of a that yields a tighter, hence better bound, than the value used for a would suggest. A value of a — 0.2 for a symmetric confidence interval of 80% effectively yields the asymmetric confidence interval for a value of a = 0.1 that is a confidence interval of 90%. For the 209 sample designs used in our experiments, the sample mean of the space- ratio of Monet estimate to the space usage measured by place-and-route is 2.715, and the sample variance is 0.232. For a 90% confidence interval, this ratio’s lower bound is 2.719 and thus the 12,288 Virtex slices translate into 33,411 as estimated by Monet. 6.5.2 R esu lts The first set of results, depicted in Figure 6.20 and Figure 6.21, compare the estimates produced by behavioral synthesis and place-and-route, respectively, for FIR, which is a 2-deep loop nest. Since the design space exploration algorithm evaluates designs resulting from different unroll factors, these results are presented as a function of unroll factors for FIR ’s inner and outer loops. Each curve in Figure 6.20 shows results for a fixed unroll factor for the outer loop, and each point on a curve represents a different unroll 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 250 k • - • O u t e r L oop U nroll F acto r 1 a~m O uter L oop U nroll F acto r 2 ♦•••♦ O uter L oop U nroll F acto r 4 A -A O u ter L oop U nroll F acto r 8 4 -4 O u ter L oop U nroll F actor 16 O u ter L oop U nroll F actor 32 *• - * O uter L oop U nroll F acto r 64 □ selected design | 1 0 0 k o s 150 k c • - • O u ter L oop U nroll F acto r 1 m~9 O uter L oop U nroll F actor 2 O uter L oop U nroll F actor 4 A -A O uter L oop U nroll F acto r 8 4 -4 O uter L oop Unroll Factor 16 - > —* O u ter L oop U nroll F acto r 32 □ selected design 0 12 4 8 16 Inner Loop U nroll Factor 32 32 Inner Loop U nroll F actor Figure 6.20: Estimated performance. Figure 6.21: Achieved performance. factor for the inner loop. As we increase the unroll factors, the amount of available parallelism increases dramatically, up to the saturation point. For larger unroll factors, instruction-level parallelism may improve but memory parallelism will not. Monotonicity of performance improvement as a function of unroll factors is clearly demonstrated in both the estimated results in Figure 6.20 and the actual performance in Figure 6.21, except the outer unroll factor one where there is so little reduction in cycles due to unrolling that the clock degradation outweighs the benefit. The Efficiency metric captures this behavior and prevents selection of such designs. We compare actual performance of all five programs as a function of measured space in terms of slices in Figure 6.22 and Figure 6.23, for a 25MHz and 40MHz target clock rate, respectively. We see a trend that performance varies somewhat for the smaller designs, but eventually we reach a point (related to the saturation point) where performance improves at most modestly but space continues to grow. We see from these figures that, for all programs, our algorithm selects one of the best-performing designs, and the smallest design among those of comparable performance. In all programs, we can acquire better performance with the 40MHz target clock rate. To examine the accuracy of the estimates, we plot the ratio of estimated to actual performance across the five programs in Figure 6.24 and Figure 6.25 for a 25MHz and 40MHz target clock rate, respectively. The Y-axis is the ratio of estimated to actual 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 600 k ■ selected design 500 k F IR JA C O B I M M P A T S O B E L 400 100 k 4 k 1 0 k 12k A ctu a l S p a c e (slices) 800 k ■ selected design — FIR JA C O B I M M P A T S O B E L 500 k 400k 300 k 200 k 4k 10 k 12 k A ctu a l S p a c e (slices) Figure 6.22: 25MHz Time vs. Space. Figure 6.23: 40MHz Time vs. Space. performance, so that values above 1 obtained better than expected performance, and below 1 worse than expected performance. The X-axis is measured space, so that we can see how estimation accuracy varies as the design grows more complex. The number of c-steps remains the same from behavioral synthesis through the final design, but there is some variation in clock rate achieved by place-and-route as the design grows due to increased routing complexity. For most of the tiny designs (corresponding to low values of the unroll factor), the target 25MHz was overly conservative and place-and-route was able to achieve a faster clock rate. As a result, for these designs, the estimated performance was also very conservative, yielding a ratio well above 1. These very small designs tended not to be selected by our algorithm because the number of cycles was significantly higher than those for slightly larger unroll factors, but they show that, when the clock rate is overly pessimistic, the algorithm would benefit from examining the estimated clock rate provided by the behavioral synthesis tool. For the large designs, the target clock rate was too optimistic, but the degradation of the clock rate was at most 20% below the target 25MHz. These results reveal a discrepancy between the estimated and the actual performance most noticeable for the designs too small than for the larger designs. In Figure 6.25, the accuracy of estimates improves over the accuracy for 25MHz for the smaller designs, but can be much worse for the larger designs (beyond about 50% utilization of the FPGA at 6k slices). 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S2.5 2 k 4 k 6 k 8 k A ctual S pace (slices) 1 0 k 3 u I2 - 5 J ? 2 c u T3 < D 1 1.5 < TJ 1 4 > '5 0.5 V w ; i r . 2 k 4 k 6 k 8 k 10 k A ctual S pace (slices) 1 2 k Figure 6.24: 25MHz Ratio vs. Space. Figure 6.25: 40MHz Ratio vs. Space. Overall, despite some fluctuations in the accuracy of the estimates, for either target clock rates, our algorithm selects the appropriate design because it favors smaller designs and only increases complexity if a significant benefit will be obtained. 6.6 Summary For five multimedia kernels that we have studied, scalar replacement eliminates 57 to 92 percent of memory accesses. Custom data layout reduces 33 to 86 percent of time spent for memory operations, which translates to speedup up to 9.65 on a single FPGA over naive data layout. Among the designs considered, our automatic design space exploration selects a design with the near-optimal performance and the smallest space usage among the designs with comparable performance. Overall, we achieve speedup 4 to 34 on a single FPGA over no unroll version with other optimizations applied. Our design space exploration algorithm is very efficient; it explores only 0.3% of the entire design space guided by our metrics balance, efficiency and the saturation point. We have also validated the accuracy of behavioral estimates by comparing with the place-and-route result. The behavioral estimates are accurate enough for our compiler to select an optimal solution. The final clock rate was less than 20 percent of the target clock rate for huge designs that takes up almost all the chip space. In addition, even if 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we have accurate place-and-route data instead of behavioral estimates, our algorithm still selects the same design. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 RELATED W ORK In this chapter we discuss related work in the areas of automatic synthesis of hardware circuits from high-level language constructs and design space exploration using high-level loop transformations and data reuse analyses and data layout schemes. 7.1 Synthesizing High-Level Constructs The gap between hardware description languages such as VHDL or Verilog and applica tions in high-level imperative programming languages prompted researchers to develop hardware-oriented high-level languages. These new languages would allow programmers to migrate to configurable architectures without having to learn a radically new program ming paradigm while retaining some level of control about the hardware mapping and synthesis process. One of the first efforts in this direction was the Handel-C [43] parallel programming language. Handel-C is heavily influenced by the OCCAM CSP-like parallel language but has a C-like syntax. The mapping from Handel-C to hardware is compositional where constructs, such as fo r and w hile loops, are directly mapped to predefined template hardware structures [35]. The Streams-C [18] language aims at retaining a C-like flavor but presents computations in a CSP-like parallel programming model. Data-streams are a 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. predefined class of data types with predefined put and get operators. Because Streams-C is so close to the VHDL structure, compilation to VHDL is greatly simplified. Other researchers have developed approaches to mapping applications to their own reconfigurable architectures that are not FPGAs. These efforts, e.g., the RaPiD [12] reconfigurable architecture and the PipeRench [21], have developed an explicitly parallel programming language and/or developed a compilation and synthesis flow tailored to the features of their architecture. The RaPiD architecture consists of a linear array of functional units connected through register units, and therefore is ideally suited for simple pipelined computations. The RaPiD-C includes explicitly parallel constructs for managing concurrent execution as well as the mapping of loop constructs to the various stages of the linear array architecture. The PipeRench [21] is a custom reconfigurable architecture with the goal to virtualize the hardware layer of reconfigurable architecture by overlaying in time the computation of multiple (and possibly infinite number of stripes) over some set of finite number of allocated stripes for each application. To program PipeRench, researchers have developed the DIL parallel programming language along with their own compiler. The compiler translate the programming constructs to a set of virtual stripe configurations which will be dynamically loaded to the set of hardware stripes at run time. The Cameron research project [47] is a system that compiles programs written in a single-assignment subset of C called SA-C into dataflow graphs and then synthesizable VHDL. The resulting compiled code has both a component that executes on a tradi tional processor as well as on computing architectures with FPGA devices. The SA-C language includes reduction and windowing operators for two-dimensional array variables which can be combined with doall contructs to explicitly expose parallel operations in the computation. These operators are directly translated into predefined library implementa tions for the target FPGAs. Like in our approach, the SA-C compiler includes loop-level transformations such as loop unrolling and tiling, particularly when windowing operators 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are present in a loop. However, the application of these transformations is controlled by pragmas, and is not automatic. Cameron’s estimation approach builds on their own internal data-flow representation using curve fitting techniques [31] and therefore will not be able to exploit target specific transformations as performed by the synthesis tools without replicating some of its functionality. Given the lack of control information in their data-flow representation, Cameron’s estimation will not be able to easily capture the area and speed impact of the behavioral synthesis tools without replicating some of its functionality. Several other researchers have developed tools that map computations expressed in a sequential imperative programming language such as C to reconfigurable custom com puting architectures. Weinhardt [55] describes a set of program transformations for the pipelined execution of loops with loop-carried dependences onto custom machines using a pipeline control unit and an approach similar to ours. He also recognizes the benefit of data reuse but does not present a compiler algorithm. The two projects most closely related to ours, the Nimble compiler and work by Babb et. al. [3], map applications w ritten in C to FPGAs, but do not perform design space exploration. They also do not rely on behavioral synthesis, but in fact replace most of the function of synthesis tools with their own in-house tools. The Napa-C compiler [19] maps applications written in sequential C programs to a reconfigurable architecture in which an FPGA-like reconfigurable core is coupled to a RISC code in a co-processor fashion. Their approach relies on programmer annotations to generate code for a reconfigurable computing architecture. The annotations specify which portions of the code should execute on the traditional processor and which should be synthesized for execution directly on the reconfigurable array. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2 Design Space Exploration In this discussion, we focus only on related work that has attem pted to use loop trans formations to explore a wide design space. Other work has addressed more general issues such as finding a suitable architecture (either reconfigurable or not) for a particular set of applications (e.g., [26]). In the context of behavioral VHDL, current tools such as Monet [39] allow the pro grammer to control the application of loop unrolling for loops with constant bounds. The programmer must first specify an application behavioral VHDL, linearize all multi dimensional arrays, and then select the order in which the loops must execute. Next the programmer must manually determine the exact unroll factor for each of the loops and determine how the unrolling is going to affect the required bandwidth and the computa tion. Given the effort and interaction between the transformations and the data layout options available, this approach to design space exploration is extremely awkward and error-prone. Other researchers have also recognized the value of exploiting loop-level transforma tions in the mapping of regular loop computations to FPGA-based architectures. Der- rien/Rajopadhye [15] describe a tiling strategy for doubly nested loops. They model performance analytically and select a tile size that minimizes the iteration’s execution time. 7.3 D ata Reuse Most compiler research have been focusing on the data reuse opportunities in the inner most loop. The most related work [8] is explained in detail in Chapter 4. McKinley et al. [38] show a compound data locality optimization algorithm to improve cache per formance. They determine the best loop structure for data locality by applying loop 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. permutation, loop fusion, loop distribution, and loop reversal to place the loop that car ries most data reuse at the innermost position for each statement in a loop nest. Their cost model computes both temporal and spatial reuse of cache lines. Kolson et al. [30] use well-known internal data forwarding technique to eliminate redundant memory traffic in their pipelining scheduler. This technique forwards data directly from the source of data reuse to the other array references, instead of keeping the data in registers. Their approach is limited not only in the innermost loop, but also only between neighboring loop iterations when pipelined. Thus, their approach uses a small number of registers. Fink et al. [17] introduce a scalar replacement algorithm that can handle both array and pointer objects. They use an extended array SSA framework and global value numbering to determine if two object references are same. In their model, they treat both pointer and array references as accesses to elements of hypothetical heap arrays and they can avoid an expensive point-to analysis. Kodukula et al. [29] introduce data shackling, a data-centric approach to locality enhancement for L2 cache. This approach is different from control-centric approach in that it fixes a block of data and then determines which computations should be executed on it. The advantage of their approach over tiling is that they can handle imperfectly nested loops without transforming them to perfectly nested loops. Deitz el al. [13] introduce array subexpression elimination that eliminates redundant computation and accompanying memory accesses. They are similar to common subexpression elimination in that they store some intermediate computed data into a tem porary variable and replace the same expression with it. They can handle vector/SIMD array notations such as Fortran90, but their approach focuses only on multi-dimensional stencils. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.4 D ata Layout Large-scale multiprocessor systems are highly affected by computation and data parti tioning across processors and memory and to that end, much work derives coarse-grain layouts that avoid communication among processors. Gupta and Banerjee [22] as well as Kennedy and Kremer [27] study the relationship of data partitioning with other com piler optimizations and formulate schemes for use in conjunction with High-Performance Fortran. Anderson’s work [2] automatically derives a coarse-grain data and computation distribution for sequential programs. Further work in the area of data reorganization within procedures as well as across procedure boundaries takes into account effects from the cache(s) on a sequential processor. Cierniak and Li [10] present a framework in cluding both data and control transformations simultaneously. We solve for a fine-grain data layout in our research to decrease memory access latencies. Song et al. [52] employ array contraction in developing a memory piinimization scheme for a cache system. Kim and Prasanna [28] propose a fine-grain data distribution scheme by deriving perfect latin squares for subarrays. They ensure conflict-free mapping to multiple memory modules by assigning a unique memory mapping in any row, column, diagonal, or any subsquare. Huang, Carr and Sweany [23] partition registers across multiple banks to aid soft ware pipelining in maximizing parallelism and minimizing the number of remote register accesses. Their approach is similar to ours in that they use unroll-and-jam and scalar replacement but they employ a register distribution algorithm while we focus on mem ories. Sudarsanam and Malik [53] focus on increased memory bandwidth by perm itting multiple memory accesses to occur in parallel when the referenced variables belong to different memory banks in application specific processors (ASIPs). They perform a com plex search while we directly solve for the custom data layout. Delaluz, Kandemir and Sezer [14] maximize parallel memory accesses while reducing energy consumption of the 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory system. They use profiling to obtain a global data layout solution and are limited to HPF-like layouts. In heterogeneous systems, the Napa-C compiler effort [19, 20] defines extensions to the C language that include mapping variables to storage and also implement the associated compiler processing. Our user input is unannotated and we automatically derive the variable mapping. Others have developed scheduling algorithms either in software or hardware to de crease the overall traffic seen by the memory subsystem. Murthy and Shuvera [41] apply coarse-grain scheduling techniques to synchronous dataflow graphs to minimize memory usage for DSP programs. Mathew et al [36]. implement a memory controller for SDRAM that gathers sparse, strided data and schedules a compound vector command. Rixner et al. [49] and McKee and Wulf [37] exploit memory hierarchy architecture and component characteristics by performing dynamic, out of order scheduling of memory accesses. We leave the details of the scheduling to the Monet scheduler in our system. Custom memory architectures, derived from the application characteristics, also form part of the solution to the memory-computation unit performance gap. Grun, D utt, Nicolau [46] use temporal and spatial locality information to synthesize a memory archi tecture, including caches, completely from scratch. Weinhardt and Luk [56] extract small on-chip RAM in order to limit off- chip memory accesses for reconfigurable platforms. Slock et al. [51] decide how many one or two port memories to implement on Application Specific Integrated Circuits (ASICs). Schmit et al. [50] decrease memory access time by applying techniques that reduce address generation hardware requirements and process ing time. In our work, we assume a fixed memory architecture. We replace array accesses from memory with scalar register accesses in our design for increased on-chip storage of array elements that will be reused. For tiled architectures such as Raw, the Maps [4] compiler performs modulo unrolling as described earlier in the chapter. Other features of the compiler include equivalence 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. class unification for pointer analysis and static versus dynamic promotion of memory accesses. 7.5 Discussion The research presented in this dissertation differs from the efforts mentioned above in several respects. First the focus of this research is in developing an algorithm that can explore a wide number of design points, rather than selecting a single implementation. Second, the proposed algorithm takes as input a sequential application description and does not require the programmer to control the compiler’s transformations. Third, the proposed algorithm uses high-level compiler analysis and behavioral estimation techniques to guide the application of the transformations as well as evaluate the various design points. Our algorithm supports multi-dimensional array variables absent in previous analyses for the mapping of loop computations to FPGAs. Finally, we use a commercially available behavioral synthesis tool to complement the parallelizing compiler techniques rather than creating an architecture-specific synthesis flow that partially replicates the functionality of existing commercial tools. Behavioral synthesis allows the design space exploration to extract more accurate performance metrics (time and area used) rather than relying on a compiler-derived performance model. Our approach greatly expands the capability of behavioral synthesis tools through more precise program analysis. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 CONCLUSION As devices and consequently designs become more complex there will be a growing need to explore in an efficient fashion increasingly larger design spaces. The DEFACTO system presented in this dissertation addresses this growing concern by combining the strengths of parallelizing compiler and behavioral synthesis technologies in a unified framework to deliver a fast and automated design space exploration algorithm. In the next section, we present the contributions of this dissertation. In Section 8.2, we discuss future extensions of this research. 8.1 Contributions This dissertation has presented the following specific contributions, which are all fully automated in the DEFACTO compiler and SUIF compiler. 8.1.1 A u to m a tic D esign Space E xploration We have developed the design space exploration algorithm that permits us to adjust parallelism and data reuse within the chip space constraint in mapping automatically applications to FPGA-based systems. To meet the optimization criteria, we have reduced the optimization process to a tractable problem, that of selecting the unroll factors for each loop in the nest that leads to a high-performance, balanced, and efficient design. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Operator parallelism is exposed to high-level synthesis through the unrolling of one or more loops in the nest; any independent operations will be performed in parallel if high-level synthesis deems this beneficial. We introduce several guiding metrics: bal ance, efficiency, and a memory parallelism saturation point. Balance suggests whether more resources should be devoted to enhance computational parallelism or data locality. Efficiency designates if the increase in on-chip space usage outweighs the performance improvement between two designs. A memory parallelism saturation point indicates that the data is being fetched and stored at a rate corresponding to the maximum bandwidth of the target architecture. It suggests a starting point of the search for best unroll fac tors. Guided by these metrics and the monotonic properties of the design search space, the design space exploration algorithm increases the complexity of the design only when doing so will have large performance gains. The experimental results for five multimedia kernel computations reveal the algorithm is able to quickly and automatically derive a good design based on synthesis estimations and several guiding metrics presented in this dissertation. Overall, we achieve speedup 4 to 34 on a single FPGA over no-unroll version with scalar replacement and custom data layout applied. The design space exploration algorithm explores on average only 0.3% of the entire design search space. 8.1.2 Scalar R eplacem en t Because of high memory latency and low bandwidth to external memories, exploiting data locality on-chip is essential to get good performance in FPGA-based systems. D ata reuse is exploited across the loops in the nest, as a result of scalar replacement. The scalar replacement algorithm extends the previous research by Carr and Kennedy in several ways. First, we increase the candidates for data reuse both by exploiting data reuse across multiple loops in a nest (not just the innermost loop) and by removing redundant write memory accesses (in addition to redundant reads). Our data reuse 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. analysis can identify all the data reuse opportunities within a loop nest. As long as there are enough registers, our approach can exploit both redundant read and write memory accesses we have identified. Secondly, we increase the applicability by eliminating the necessity for unroll-and- jam. Unroll-and-jam is not always legal, and the unroll factors for scalar replacement may conflict with the desired factors for other optimizations such as instruction-level parallelism. Since our approach is not limited to just the innermost loop, our approach can exploit all the data reuse opportunities across multiple loops in a nest, not requiring unroll-and-jam. Thirdly, we provide a flexible strategy to trade off between exploiting reuse opportu nities and reducing the register requirements of scalar replacement. In some architectures where there are not enough register, not all the data reuse opportunities we have identified can be exploited. We address this challenge in two ways. We trade off some data reuse opportunities for lower register pressure using loop tiling. We exploit full data reuse only within a tile and we introduce some extra memory accesses across tiles. Another way of reducing register pressure is to selectively exploit data reuse of reuse chains depending on their efficiency metric. Lastly, we improve scalar replacement in the presence of control flows. We eliminate partially redundant memory accesses even if it may introduce some unnecessary memory accesses only for a few iterations in some rare cases. However, the overall memory accesses are decreased, since we eliminate much more partially redundant memory accesses in the remaining iterations. Further, we can peel the boundary iterations to avoid the unnecessary memory accesses. Using this technique, we observe 57 to 92 percent of reduction in the number of memory accesses for five multimedia kernels that we have studied. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.1.3 C ustom D a ta Layout Custom data layout analyzes the data access patterns of the given optimized code and dis tributes the remaining memory accesses across multiple memory banks to exploit parallel memory accesses. By examining array subscript expressions and data dependences, our algorithm automatically derives application-specific data layouts in multiple memories. Our algorithm partitions arrays across multiple memories using a two-phase algo rithm. First, based on the analyzed data access patterns, we identify the opportunities of parallel memory accesses in an architecture-independent ways. We create as many virtual memories as possible, each of which contains only a single renamed array. Next, we map the virtual memories to a limited number of physical memories that are available in the target architecture, taking memory access scheduling constraints into account. As compared to solutions that optimize for memory parallelism assuming a fixed data layout, our approach yields high memory parallelism no m atter how memory is accessed. This difference is particularly im portant when used in conjunction with code reordering transformations, such as loop nest transformations commonly performed on array-based computations. In addition, our data layout supports more varied data layouts than HPF-like notations (block, cyclic, block-cyclic) can do. Our compiler has more degrees of freedom in transforming code, and can thus preserve memory parallelism while accomplishing other optimization goals. Using this technique, we observe a speedup up to 9.65 and 86% reduction in memory accesses on eight memories. 8.2 Future Work In this dissertation, we have considered three major code transformations that improve fine-grain parallelism, on-chip temporal data locality, and parallel memory accesses across multiple memories. There are many other optimizations used to increase instruction-level 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. parallelism that could potentially improve the resulting design. We discuss the most sig nificant of these here. Software pipelining allows iterations of a loop to be overlapped with one another in order to take advantage of maximum parallelism in a loop body [32]. Predicated execution guards each operation with one of the predicate registers and the operation is committed only if the guarding predicate is true [44]. Instruction-level par allelism is increased by unconditionally executing all possible control paths at the same time. Thus, these techniques affect the design space exploration algorithm in several ways. First, they increase the data consumption rate, which is useful for compute-bound designs. Second, they may consume more on-chip resources, so the compiler must con sider space-time trade-offs. Thirdly, they change the data access patterns of the code, and require an increased memory bandwidth from memories. The design space exploration algorithm presented in this dissertation focuses on de riving the best implementation for each loop nest. Beyond optimizing individual loop nests, a major focus of future work on DEFACTO is a global optimization strategy for multiple tasks (multiple loop nests). This locally optimal solution may not be the best global solution for the following reasons. First, since the optimization of an individual loop nest assumes the entire chip space can be used, there may not be sufficient chip space for all the loop nests. W ithin a fixed limit of on-chip resources, we have to decide the unroll factors for each task that lead to the best overall performance. Secondly, we must analyze the communication requirements between tasks. It would be much cheaper to communicate data between two producer and consumer tasks using on-chip buffers or direct communication channels between FPGAs than to access memories. Thus, reduc tion in memory accesses may allow us to increase the complexity of the critical section (most time-consuming task) to improve the overall performance. Thirdly, there is always a space-time trade-off among the tasks. Like a local optimization problem, balancing the data processing rate between two tasks that are a producer and a consumer of data is important to effectively utilize the limited on-chip space. Fourth, to expand custom data 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. layout to multiple tasks, we have to recognize both static and dynamic layouts depending on the reorganization cost and the benefit of parallel memory accesses. We will also in vestigate the integration of custom data layout with the task-level pipeline analyses [60] and include new analyses for data reordering when profitable during communication. We will further expand this research to multiple tasks on multiple FPGAs. New and future FPGA devices include much more cores, special purpose functional units, and complex memory hierarchies such as on-chip memory. Another research ex tension includes how to take advantage of these new architectural features as efficiently as possible. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [1] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, San Francisco, 2002. [2] Jennifer M. Anderson. Automatic Computation and Data Decomposition for Multi processors. Ph.d. thesis, Stanford University, 1997. Published as Stanford CSL-TR- 97-719. [3] J. Babb, M. Rinard, A. Moritz, W. Lee, M. Frank, R. Barua, and S. Amarasinghe. Parallelizing applications into silicon. In Proceedings of the IEEE Symposium on FPGA for Custom Computing Machines (FCCM ’ 99), pages 70-81, Los Alamitos, California, 1999. IEEE Computer Society Press. [4] Rajeev Barua, Walter Lee, Saman P. Amarasinghe, and Anant Agarwal. Maps: A compiler-managed memory system for raw machines. In ISCA, pages 4-15, 1999. [5] A. J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions on Electronic Computers, 15(5):757-763, October 1966. [6] D. Buell, J. Arnold, and W. Kleinfelder. Splash 2: Fpgas in a custom computing machines. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines. Computer Society Press, April 1996. [7] David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In Proceedings of the ACM SIG PLAN 1990 Conference on Programming Language Design and Implementation, New York, June 1990. ACM press. [8] Steve Carr and Ken Kennedy. Improving the ratio of memory operations to floating point operations in loops. ACM Transactions on Programming Languages and Sys tems, 16(6): 1768-1810, November 1994. [9] Steve Carr and Ken Kennedy. Scalar replacement in the presence of conditional control flow. Software - Practice and Experience, 24(l):51-77, January 1994. [10] Michal Cierniak and Wei Li. Unifying data and control transformations for dis tributed shared memory machines. In SIG PLAN Conference on Programming Lan guage Design and Implementation, pages 205-217, 1995. [11] Altera Corp. APEX II programmable logic device family data sheets. 2001. 157 with permission of the copyright owner. Further reproduction prohibited without permission. [12] D. Cronquist, P. Franklin, and C. Ebeling. Specifying and compiling applications for RaPiD. In Proceedings of the IEEE Symposium on FPGA for Custom Computing Machines (FCCM ’98), pages 116-125, Los Alamitos, Calif., 1998. IEEE Computer Society Press. [13] Steven J. Deitz, Bradford L. Chamberlain, and Lawrence Snyder. Eliminating re dundancies in sum-of-product array computations. In Proceedings of the AC M In ternational Conference on Supercomputing, pages 65-77, 2001. [14] V. Delaluz, M. Kandemire, and U. Sezer. Improving off-chip memory energy behavior in a multi-processor, multi-bank environment. In Proceedings of the 1 4 th Workshop on Languages and Compilers for Parallel Computing, Berlin, August 2001. Springer- Verlag. [15] S. Derrien and S. Rajopadhye. Loop tiling for reconfigurable accelerators. In Proceedings if the Eleventh International Symposium on Field Programmable Logic (FPL’ 2001), 2001. [16] John P. Elliott. Understanding Behavioral Synthesis: A Practical Guide To High- Level Design. Kluwer Academic Publishers, 2nd edition, 2000. [17] Stephen J. Fink, Kathleen Knobe, and Vivek Sarkar. Unified analysis of array and object references in strongly typed languages. In Proceedings of the 2000 Static Analysis Symposium, pages 155-174, 2000. [18] J. Frigo, M. Gokhale, and D. Lavenier. Evaluation of the streams-c c-to-fpga com piler: an applications perspective. In Proceedings of the AC M Symposium on Field Programmable Gate Arrays (FP G A ’ 01), 2001. [19] Maya B. Gokhale and Janice M. Stone. Napa c: compiling for a hybrid risc/fpga architecture. In Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, pages 126-135, 1998. [20] Maya B. Gokhale and Janice M. Stone. Automatic allocation of arrays to memories in fpga processors with multiple memory banks. In Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, pages 63-69, 1999. [21] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. Taylor, and R. Laufer. PipeRench: A coprocessor for streaming multimedia acceleration. In Proceedings of the 26th International Symposium on Computer Architecture, New York, NY, 1999. ACM Press. [22] Manish Gupta and Prithviraj Banerjee. Demonstration of automatic data partition ing techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 2(2):179-193, 1992. [23] Xianglong Huang, Steve Carr, and Phillip Sweany. Loop transformations for ar chitectures with partitioned register banks. In Proceedings of the AC M SIG PLAN workshop on Languages, compilers and tools for embedded systems, 2001. 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [24] Annapolis MicroSystems Inc. WildStar Reference Manual, revision 4-0, 1999. [25] XILINX Inc. Virtex-II 1.5V FPGA Complete Data Sheet. DS031(vl.7). 2100 Logic Drive, San Jose, Calif., 2001. [26] Vinod Kathail, Shail Aditya, Rob Schreiber, Bob Rau, Darren Cronquist, and Mukund Sivaraman. Pico (program in, chip out): Automatically designing custom computers. IEEE Computer, 35(9):39-47, September 2002. [27] Ken Kennedy and Ulrich Kremer. Automatic data layout for High Performance Fortran. 1995. [28] Kichul Kim and Viktor K. Prasanna. Latin square for parallel array access. IEEE Transactions on Parallel and Distributed Systems, 4(4):361-370, April 1993. [29] Induprakas Kodukula, Keshav Pingali, Robert Cox, and Dror E. Maydan. An ex perimental evaluation of tiling and shackling for memory hierarchy management. In Proceedings of the AC M International Conference on Supercomputing, pages 482- 491, June 1999. [30] D. Kolson, A. Nicolau, and N. D utt. Elimination of redundant memory traffic in high-level synthesis. IEEE Transactions on Computer-aided Design, 15(11):1354- 1363, November 1996. [31] D. Kulkarni, W. Najjar, R. Rinker, and F. Kurdahi. Fast area estimation to support compiler optimizations in fpga-based reconfigurable systems. In Proceedings of the IEEE Symposium on FPGA for Custom Computing Machines (FCCM ’ 02), 2002. [32] Monica Lam. Software pipelining: An effective scheduling technique for vliw ma chines. In Proceedings of the AC M SIG PLAN Conference on Programming Language Design and Implementation, 1988. [33] Averil M. Law and W. David Kelton. Simulation Modeling and Analysis. McGraw- Hill, Inc, 1991. [34] Walter Lee, Rajeev Barua, M atthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman P. Amarasinghe. Space-time scheduling of instruction-level parallelism on a raw machine. In Architectural Support for Pro gramming Languages and Operating Systems, pages 46-57, 1998. [35] W. Luk, D. Ferguson, and I. Page. Structured hardware compilation of parallel programs. Abingdon EE&CS Books, 1994. [36] Binu K. Mathew, Sally A. McKee, John B. Carter, and Al Davis. Design of a parallel vector access unit for sdram memory systems. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, pages 39-48, 1999. [37] Sally A McKee and Wm A. Wulf. Access order and memory-conscious cache uti lization. In Proceedings of the First Symposium on High Performance Computer Architecture (HPCA1), pages 253-262, January 1995. 159 with permission of the copyright owner. Further reproduction prohibited without permission. [38] K athryn S. Mckinley, Steve Carr, and Chau wen Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 18(4):424-453, July 1996. [39] Mentor Graphics Inc. M onet™ , r44 edition, 1999. [40] E. Morel and C. Revoise. Global optimization by suppression of partial redundancies. Communications of the ACM, 22(2), February 1979. [41] J. Murthy and S. Bhattacharyya. A buffer merging technique for reducing memory requirements of synchronous dataflow specifications. In Proceedings of International Symposium on System Synthesis, pages 78-84, 1999. [42] John Ng, D attatraya Kulkarni, Wei Li, Robert Cox, and Scott Bobholz. Inter procedural loop fusion, array contraction and rotation. In The 12th International Conference on Parallel Architectures and Compilation Techniques, pages 114-124, New Orleans, LA, 2003. [43] I. Page and W. Luk. Compiling occam into fpgas. In Proceedings if the First Inter national Symposium on Field Programmable Logic (FPL’ 91), 1991. [44] J.C.H. Park and M. Schlansker. On predicated execution. Technical Report HPL- 91-58, HP Labs., 1991. [45] Joonseok Park and Pedro Diniz. Synthesis of memory access controller for streamed data applications for fpga-based computing engines. In Proceedings of the 1 4 th In ternational Symposium on System Synthesis (ISSS’ 01), October 2001. [46] Nikil D utt Peter Grun and Alex Nicolau. Access pattern based local memory cus tomization for low power embedded systems. In Proceedings of the Design, Automa tion and Test in Europe, pages 778-784, 2001. [47] R. Rinker, M. Carter, A. Patel, M. Chawathe, C. Ross, J. Hammes, W. Najjar, and W. Bohm. An automated process for compiling dataflow graphs into reconfigurable hardware. IEEE Transactions on VLSI Systems, 9(1):130-139, 2001. [48] Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek Khailany, Abelardo Lopez- Lagunas, Peter R. M attson, and John D. Owens. A bandwidth-efficient architecture for media processing. In International Symposium on Microarchitecture, pages 3-13, 1998. [49] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter R. Mattson, and John D. Owens. Memory access scheduling. In Proceedings of the 27th International Sympo sium on Computer Architecture, pages 128-138, 2000. [50] Schmit and Thomas. Address generation for memories containing multiple arrays. IEEE TCAD: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17, 1998. 160 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [51] Peter Slock, Sven Wuytack, Francky Catthoor, and Gjalt de Jongy. Fast and extensive system-level memory exploration for ATM applications. In Proc. 10th A C M /IE E E International Symposium on System-Level Synthesis, pages 74-81, Antwerp, Belgium, 1997. [52] Yonghong Song, Rong Xu, Cheng Wang, and Zhiyuan Li. Locality enhancement by array contraction. In Proceedings of the 1 4 th Workshop on Languages and Compilers for Parallel Computing. Springer Verlag. Published as Lecture Notes in Computer Science. [53] A. Sudarsanam and S. Malik. Memory bank and register allocation in software synthesis for asips. Digest of Technical Papers, IE E E /A C M Intl. Conf. on Computer Aided-Design, pages 388-392, 1995. [54] E. Waingold, M. Taylor, D. Srikirishna, V. Sarkar, W. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. IEEE Computer, pages 86-93, September 1997. [55] Markus Weinhardt. Compilation and pipeline synthesis for reconfigurable architec tures. In Proceedings of the 1997 Reconfigurable Architectures Workshop R A W ’ 97. Springer-Verlag, 1997. [56] Markus Weinhardt and Wayne Luk. Memory access optimization and ram infer ence for pipeline vectorization. In Proceedings of the Ninth International Workshop on Field-Programmable Logic and Applications, pages 61-70. Springer-Verlag, 1999. Lecture Notes in Computer Science vol. 1673. [57] Robert P. Wilson, Robert S. French, Christopher S. Wilson, Saman P. Amarasinghe, Jennifer-Ann M. Anderson, Steven W. K. Tjiang, Shih-Wei Liao, Chau-Wen Tseng, Mary W. Hall, Monica S. Lam, and John L. Hennessy. SUIF: An infrastructure for research on parallelizing and optimizing compilers. SIG PLAN Notices, 29(12):31— 37, 1994. [58] Xiaotong Zhuang, Santosh Pande, and Jr. John S. Greenland. A framework for parallelizing load/stores on embedded processors. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (P A C T’ 02), 2002. [59] Heidi Ziegler, Mary Hall, and Pedro Diniz. Compiler-generated communication for pipelined fpga applications. In Proceedings of the 40th Design Automation Confer ence (D AC’ 03), June 2003. [60] Heidi Ziegler, Byoungro So, Mary Hall, and Pedro Diniz. Coarse-grain pipelining for multiple fpga architectures. In P r o c e e d in g s o f I E E E S y m p o s i u m o n F P G A s f o r Custom Computing Machines, 2002. 161 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A Behavioral VHDL Output l i b r a r y i e e e ; u s e i e e e . s t d _ l o g i c _ 1 1 6 4 . a l l ; u s e i e e e . s t d _ l o g i c _ a r i t h . a l l ; p a c k a g e M yPackage i s s u b ty p e UNSIGNED_INT_8 i s i n t e g e r r a n g e 0 t o (2**8 - 1 ) ; s u b ty p e UNSIGNED_INT_16 i s i n t e g e r ra n g e 0 t o (2**16 - 1 ) ; s u b ty p e SIGNED_INT_8 i s i n t e g e r ra n g e - (2 * * 7 ) t o (2**7 - 1 ) ; s u b ty p e SIGNED_INT_16 i s i n t e g e r r a n g e - (2 * * 1 5 ) t o (2**15 - 1 ) : s u b ty p e SIGNED_INT_32 i s i n t e g e r r a n g e -(2 * * 3 1 ) t o (2**31 - 1 ) ; en d M yPackage; l i b r a r y i e e e ; u s e i e e e . s t d _ l o g i c _ 1 1 6 4 . a l l ; u s e i e e e . s t d _ l o g i c _ a r i t h . a l l ; l i b r a r y m g c _ h ls; u s e m g c _ h l s . d e f s .a l l ; u s e w o rk . M yP ackage. a l 1; e n t i t y m ain IS p o r t ( t a s k _ s t a r t : i n s t d _ l o g i c ; t a s k _ c l k : i n s t d _ l o g i c ; t a s k _ r e s e t : i n s t d _ l o g i c ; ta s k _ d o n e : o u t s t d _ l o g i c ) ; en d m ain ; a r c h i t e c t u r e a_m ain o f m ain i s b e g in p i : p r o c e s s v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e v a r i a b l e i : SIGNED_INT_32; j : SIGNED_INT_32; B_0_0_0 B_0_1_0 B_0_1_1 B_0_2_0 B _0_2_l B_0_2_2 B_0_2_3 B_0_2_4 SIGNED_INT_32 SIGNED,INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. v a r i a b l e B_0_2_5 : v a r i a b l e B_0_2_6 : v a r i a b l e B_0_2_7 : v a r i a b l e B_0_2_8 : v a r i a b l e B_0_2_9 : v a r i a b l e B_0_2_10 v a r i a b l e B _ 0 _ 2 _ ll v a r i a b l e B_0_2_12 v a r i a b l e B_0_2_13 v a r i a b l e B_0_2_14 v a r i a b l e B_0_2_15 v a r i a b l e B_0_2_16 v a r i a b l e B_0_3_0. : v a r i a b l e B _0_3_l : v a r i a b l e B_0_3_2 : v a r i a b l e B_0_3_3 : v a r i a b l e B_0_3_4 : v a r i a b l e B_0_3_5 : v a r i a b l e B_0_3_6 : v a r i a b l e B_0_3_7 : v a r i a b l e B_0_3_8 : v a r i a b l e B_0_3_9 : v a r i a b l e B_0_3_10 v a r i a b l e B _ 0 _ 3 _ ll v a r i a b l e B_0_3_12 v a r i a b l e B_0_3_13 v a r i a b l e B_0_3_14 v a r i a b l e B_0_3_15 v a r i a b l e B_0_3_16 v a r i a b l e B_0_3_17 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 : SIGNED_INT_32 ty p e ArrOO i s a r r a y (1023 dow nto 0) o f SIGNED_INT_32; v a r i a b l e B101 v a r i a b l e B100 v a r i a b l e A101 v a r i a b l e A100 v a r i a b l e B211 v a r i a b l e B200 v a r i a b l e A211 v a r i a b l e B210 v a r i a b l e A210 v a r i a b l e B201 v a r i a b l e A201 v a r i a b l e A200 c o n s ta n t RAM_0 ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO ArrOO; r e s o u r c e := 0; a t t r i b u t e v a r i a b l e s o f RAM_0: c o n s ta n t i s "B101 B211 A 2 1 i"; a t t r i b u t e m ap_to_m odule o f RAM_0: c o n s ta n t i s "ram_s_RW "; a t t r i b u t e pack in g _ m o d e o f RAM_0: c o n s ta n t i s "c o m p a c t"; a t t r i b u t e e x te rn a l_ m e m o ry o f RAM_0: c o n s ta n t i s TRUE; c o n s ta n t RAM_1 : r e s o u r c e := 1; a t t r i b u t e v a r i a b l e s o f RAM_1: c o n s ta n t i s "B100 B210 A 210"; a t t r i b u t e m ap_to_m odule o f RAM_1: c o n s ta n t i s "ram_s_RW "; 163 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a t t r i b u t e packing_m ode o f RAM_1: c o n s ta n t i s "c o m p a c t"; a t t r i b u t e e x te rn a l_ m e m o ry o f RAM_1: c o n s ta n t i s TRUE; c o n s ta n t RAM_2 : r e s o u r c e := 2 ; a t t r i b u t e v a r i a b l e s o f RAM_2: c o n s ta n t i s "A101 B201 A 201"; a t t r i b u t e m ap_to_m odule o f RAM_2: c o n s ta n t i s "ram_s_RW "; a t t r i b u t e packing_m ode o f RAM_2: c o n s ta n t i s "c o m p a c t"; a t t r i b u t e e x te m a l_ m e m o ry o f RAM_2: c o n s ta n t i s TRUE; c o n s ta n t RAM_3 : r e s o u r c e := 3 ; a t t r i b u t e v a r i a b l e s o f RAM_3: c o n s ta n t i s "A100 B200 A 200"; a t t r i b u t e m ap_to_m odule o f RAM_3: c o n s ta n t i s "ram _s_RW "; a t t r i b u t e packing_m ode o f RAM_3: c o n s ta n t i s "c o m p a c t"; a t t r i b u t e e x te rn a l_ m e m o ry o f RAM_3: c o n s ta n t i s TRUE; b e g in m a in _ lo o p : lo o p w a it u n t i l t a s k _ c l k ’e v e n t an d t a s k _ c l k = ’ 1 ’ ; e x i t m a in _ lo o p when t a s k _ r e s e t = ’ 1 ’ ; ta s k _ d o n e <= ’O’ ; i f ( t a s k _ s t a r t /= ’ 1 ’ ) t h e n n e x t; en d i f ; w a it u n t i l t a s k _ c l k ’e v e n t an d t a s k _ c l k = ’ 1 ’ ; e x i t m a in _ lo o p when t a s k _ r e s e t = ’ 1 ’ ; B_0_0_0 := B lO lC l * 1 6 + 0 ) ; B_0_3_17 := B 100(0 * 1 6 + 0 ) ; A 10K 1 * 1 6 + 0) := B_0_0_0 + B_0_3_17; B_0_1_0 := B 1 0 0 (l * 1 6 + 1 ) ; B_0_2_16 := B 101(0 * 1 6 + 0 ) ; A 1 0 0 (l * 1 6 + 1 ) := B_0_1_0 + B_0_2_16; B_0_2_0 := B 101(2 * 1 6 + 0 ) ; B_0_1_1 := B 1 0 0 (l * 1 6 + 0 ) ; A 101(2 * 1 6 + 0 ) := B_0_2_0 + B _0_1_1; B_0_3_0 := B 100(2 * 1 6 + 1 ) ; A 100(2 * 1 6 + 1 ) := B_0_3_0 + B_0_0_0; B_0_1_1 := B _0_1_0; B_0_2_16 := B_0_2_15; B_0_2_15 := B_0_2_14; B_0_2_14 := B _0_2_13; B_0_2_13 := B_0_2_12; B_0_2_12 := B _ 0 _ 2 _ ll; B _ 0 _ 2 _ ll := B_0_2_10; B_0_2_10 := B _0_2_9; B_0_2_9 := B _0_2_8; B_0_2_8 := B _0_2_7; B_0_2_7 := B_0_2_6; B_0_2_6 := B _0_2_5; B_0_2_5 := B _0_2_4; B_0_2_4 := B _0_2_3; B_0_2_3 := B _0_2_2; B_0_2_2 := B _ 0 _ 2 _ l; B _0_2_l := B_0_2_0; 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B_0_3_16 B_0_3_15 B_0_3_14 B_0_3_13 B_0_3_12 B _ 0 _ 3 _ ll B_0_3_10 B_0_3_9 B_0_3_8 B_0_3_7 B_0_3_6 B_0_3_5 B_0_3_4 B_0_3_3 B_0_3_2 B _0_3_l = B_0_3_15 = B_0_3_14 = B_0_3_13 = B_0_3_12 = B _ 0 _ 3 _ ll = B_0_3_10 = B_0_3_9; ; B_0_3_8 : B_0_3_7 : B_0_3_6 : B_0_3_5 ; B_0_3_4 : B_0_3_3 ; B_0_3_2 ' B _0_3_l : B_0_3_0 FOR j IN 0 t o 14 lo o p — p rag m a d o n t_ u n r o ll w a it u n t i l t a s k _ c l k ’ e v e n t an d t a s k _ c l k = ’ 1 ' e x i t m a in _ lo o p when t a s k _ r e s e t = ’ 1 ’ ; B_0_0_0 B_0_3_17 A 1 0 1 (l * B_0_1_0 B_0_2_16 A 1 0 0 (l * B_0_2_0 A 101(2 * B_0_3_0 A i0 0 (2 * B_0_1_1 B_0_2_16 B_0_2_15 B_0_2_14 B_0_2_13 B_0_2_12 B _ 0 _ 2 _ ll B_0_2_10 B_0_2_9 B_0_2_8 B_0_2_7 B_0_2_6 B_0_2_5 B_0_2_4 B_0_2_3 B_0_2_2 B _0_2_l B_0_3_16 B_0_3_15 B_0_3_14 = B 1 0 1 (l * 16 + j + 1 ) ; := B 100(0 * 16 + j + 1 ) ; 16 + j + 1) := B_0_0_0 + B _0_3_17; = B 1 0 0 (l * 16 + j + 2 ) ; := BIOKO * 16 + j + 1 ) ; 16 + j + 2) := B_0_1_0 + B _0_2_16; = B 101(2 * 16 + j + 1 ) ; 16 + j + 1) := B_0_2_0 + B _0_1_1; = B 100(2 * 16 + j + 2 ) ; 16 + j + 2 ) := B_0_3_0 + B _0_0_0; = B_0_1_0; = B_0_2_15 = B_0_2_14 = B_0_2_13 = B_0_2_12 = B _ 0 _ 2 _ ll = B_0_2_10 = B _0_2_9; B_0_2_8 B_0_2_7 B_0_2_6 B_0_2_5 B_0_2_4 B_0_2_3 B_0_2_2 B _0_2_l B_0_2_0 = B_0_3_15 = B_0_3_14 = B_0_3_13 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B_0_3_13 B_0_3_12 B _ 0 _ 3 _ ll B_0_3_10 B_0_3_9 B_0_3_8 B_0_3_7 B_0_3_6 B_0_3_5 B_0_3_4 B_0_3_3 B_0_3_2 B _0_3_l en d lo o p ; ; B_0_3_12 : B _ 0 _ 3 _ ll : B_0_3_10 : B _0_3_9; B_0_3_8; B _0_3_7; B_0_3_6; B_0_3_5; B_0_3_4; B_0_3_3; B _0_3_2; B _ 0 _ 3 _ l; B _0_3_0; FOR i IN 0 t o 30 lo o p — pragm a d o n t_ u n r o ll w a it u n t i l t a s k _ c l k ’ e v e n t an d t a s k _ c l k = ’ 1 ’ ; e x i t m a in _ lo o p when t a s k _ r e s e t = ; B_0_0_0 := B 2 1 1 ( ( i + 1) * 16 + 0 ) ; B_0_3_17 := B 2 0 0 ( ( i + 1 ) * 1 6 + 0 ) ; A 2 1 1 ((i + 1 ) * 1 6 + 0 ) := B_0_0_0 + B_0_3_17; B_0_1_0 := B 2 1 0 ( ( i + 1 ) * 1 6 + 1 ) ; A 2 1 0 ((i + 1) * 16 + 1) := B_0_1_0 + B_0_2_16; B_0_2_0 := B 201( ( i + 2 ) * 1 6 + 0 ) ; B_0_1_1 := B 2 1 0 (( i + 1 ) * 1 6 + 0 ) ; A 201( ( i + 2) * 16 + 0) := B_0_2_0 + B _0_1_1; B_0_3_0 := B 2 0 0 ( (i + 2 ) * 1 6 + 1 ) ; A 2 0 0 ((i + 2) * 16 + 1) := B_0_3_0 + B _0_0_0; B_0_1_1 := B_0_1_0; B_0_2_16 := B_0_2_15; B_0_2_15 := B _0_2_14; B_0_2_14 := B_0_2_13; B_0_2_13 := B_0_2_12; B_0_2_12 := B _ 0 _ 2 _ ll; B _ 0 _ 2 _ ll := B_0_2_10; B_0_2_10 := B _0_2_9; B_0_2_9 := B_0_2_8; B_0_2_8 := B _0_2_7; B_0_2_7 := B _0_2_6; B_0_2_6 := B_0_2_5; B_0_2_5 := B_0_2_4; B_0_2_4 := B_0_2_3; B_0_2_3 := B _0_2_2; B_0_2_2 := B_0_2_l; B _0_2_l := B _0_2_0; B_0_3_17 := B_0_3_16; B_0_3_16 := B_0_3_15; B_0_3_15 := B_0_3_14; B_0_3_14 := B _0_3_13; B_0_3_13 := B_0_3_12; 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B_0. B_0. B_0. B_0. B_0. B_0. B_0. B_0. B_0. B_0. B_0. B_0. FOR C N T — t 1 C O := B _ 0 _ 3 _ ll; 3_11 := B _0_3_10; C O 1 h± o := B _0_3_9; 3_9 = B_0_3_8 C O 1 00 = B_0_3_7 3_7 = B_0_3_6 3_6 = B_0_3_5 3_5 = B_0_3_4 C O 1 = B_0_3_3 3_3 = B_0_3_2 3_2 = B _0_3_l 3_1 = B_0_3_0 j IN 0 t o 14 lo o p — p ragm a d o n t_ u n r o ll w a it u n t i l t a s k _ c l k ’ e v e n t an d t a s k _ c l k = e x i t m a in _ lo o p when t a s k _ r e s e t = B_0_0_0 A 211( ( i - B_0_1_0 A 2 1 0 ((i - B_0_2_0 A 201( ( i - B_0_3_0 A 2 0 0 ((i - B_0_1_1 B_0_2_16 B_0_2_15 B_0_2_14 B_0_2_13 B_0_2_12 B _ 0 _ 2 _ ll B_0_2_10 B_0_2_9 B_0_2_8 B_0_2_7 B_0_2_6 B_0_2_5 B_0_2_4 B_0_2_3 B_0_2_2 B _0_2_l B_0_3_17 B_0_3_16 B_0_3_15 B_0_3_14 B_0_3_13 B_0_3_12 B _ 0 _ 3 _ ll B_0_3_10 B_0_3_9 B 211( ( i + 1) 1) * 16 + j + B 2 1 0 ( ( i + 1) 1) * 16 + j + B 201( ( i + 2) 2 ) * 16 + j + B 2 0 0 ( (i + 2) 2) * 16 + j + B _0_1_0; = B_0_2_15; = B_0_2_14 = B_0_2_13 = B_0_2_12 = B _ 0 _ 2 _ ll = B_0_2_10 = B _0_2_9; B_0_2_8; B_0_2_7 B_0_2_6 : B_0_2_5 : B_0_2_4 : B_0_2_3 : B_0_2_2 : B _0_2_l : B_0_2_0; = B_0_3_16 = B_0_3_15 = B_0_3_14 = B_0_3_13 = B_0_3_12 = B _ 0 _ 3 _ ll = B_0_3_10 = B_0_3_9; = B_0_3_8; * 16 + j + 1 ) ; 1) := B_0_0_0 + B _0_3_17; * 16 + j + 2 ) ; 2) := B_0_1_0 + B _0_2_16; * 16 + j + 1 ) ; 1) := B_0_2_0 + B _0_1_1; * 16 + j + 2) ; 2) := B_0_3_0 + B_0_0_0; 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. en d lo o p m a in _ lo o p ; en d p r o c e s s p i ; en d a_ m ain ; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix B Com puting Best Unroll Factors in 2-deep Loop N ests This section describes how to decide the best unroll factors u\ and u -2 in 2-deep loop nests for given U and D to minimize the execution time. Based on the dependence information D , we partition the u\ x U 2 iteration space into regions that can be executed in parallel. Note that behavioral VHDL is a parallel execution description, and the synthesis tool will schedule all independent operations in parallel once the necessary data is ready if the optimization preference is given to minimizing the execution time rather than space usage. We use the following terms within the u\ x 112 iteration space to describe our algorithm in this section: • Pr represents the number of rows of parallel regions; • Pc represents the number of columns of parallel regions; and, • E represents the approximate number of steps required to execute Pr x Pc parallel regions. Here, Pr is a function of u\ and Pc is a function of U2 . We assume each parallel region takes one step, even if each region may involve different amount of computation and memory accesses. The synthesis tool is responsible to configure enough resources for the biggest region. Some parallel regions can also be executed in parallel depending on D. E is not an accurate measure of execution time, but it is used as a metric to evaluate different combinations of u\ and U2 whose product is U . For a given constant U = u\ x 112, we would like to find individual unroll factors u\ and U 2 that minimize E . For example, consider the unrolled iterations in Figure 3.4(b). From the dependence information {(1,0), (0,2)}, we can divide 2 x 4 iteration space into four parallel regions each of which is 1 x 2 iteration subspace. Therefore, Pr = 2 and Pc = 2 and E = 3. We first describe the cases where D consists of only one of 3 basic dependence vectors of the form (d, 0), (0, d), and (d, d), where d is a non-zero constant. Next, we describe the combination of these three basic dependence vectors in the rest of this section. We regard the dependence distance '+ ’ as one, since it is the smallest of the distances that ‘+ ’ includes. We illustrate how the iteration space is divided into Pr and Pc based on D and how we derive the optimal unroll factors in Figure B .l, where the numbers in each iteration region represent the ordering constraint of loop iteration execution imposed by D, which is represented as arrows in Figure B .l. A thicker arrow determines the number of iterations that can be executed in parallel. 169 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5 4 3 2 1 »2 JLU| f t ( a ) (e) II2 -► mil I (i) m i l' 2 (b) (f) “ 2 -► " 2 -► (j) i n , 1im i 5 :6 7 8 9 “ 2 5: 6 7 8 9 “ 2 5 1 6 7 8 9 4 :5 6 7 8 4 ; 5 6 7 8 4 :5 6 7 8 3 1 4 5 6 7 3: 4 5 6 7 3 !4 5 6 7 .2 - fc- .4.. -5- -6~ i : -3- -4.. -5- - 6 - 2 |3 4 5 6 ' I \2 3 4 5 d i 2 3 4 5 1 2 ' T -5 - (k) 5 6 7 8 9 4 5 6 7 8 3 4 5 6 7 2 3 4 5 6 2 3 4 5 “ 2 mi. (d) (h) “ 2 -► (1 ) Figure B .l: Approximate execution time based on the data dependence. If £>={(<*!, 0>,<d2,0>,...,(d„,0>}, iterations in each row can be executed in paral lel, since there is no dependence among them as shown in Figure B .l(a). The minimal dependence distance on the first dimension determines the parallelizable iterations. Pr = [ui/m in(dj)], where di G {d i,d 2, .. .,d n}. P c = 1. E = Pr + Pc - 1 = Pr. E {—Pr) is the minimum when '«i = 1 and u 2 = U. This explains why the unroll factor of a parallelizable loop must be maximized. iterations in each column can be executed in parallel for the similar reason as for Figure B .l (a). The minimal dependence distance on the second dimension determines the parallelizable iterations. Pc = r«2/m in(dj)l,w here di e {di,d2, ... ,dn}. Pr = 1. E = Pr + Pc — 1 = Pc. E (=PC ) is the minimum when u\ = U and u 2 = 1. 1 7 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I f .D = { (d i, d 2 ), (^ 3 , C Z 4 ) , . . . , (dn , d n + i)} , the minimal dependence vector in each dimension decides the execution time. In Figure B .l(b), the L-shaped region in solid lines shows the parallelizable iterations determined by the smaller dependence vector. A bigger dependence vector determines the L-shaped region in dotted lines. Intersecting two regions results in the overall parallelizable iterations, which the same as the first region in this case. If the minimal dependence distance in each dimension belongs to different dependence vectors, as is the case in Figure B .l(c), the parallelizable region is not a strictly L-shape. But the extra rectangular iterations can be ignored, since they do not affect the total number of execution steps. Pr = \ux/m.m(dodd)], where dodd G {c?i, d3, . . . , dn}. Pc = \u 2 /mm{deven)}, where deven G {d2 ,dx, • • -,d n+1}. E = min(Pr , Pc). dadd and deven represent the dependence distance in the first and second dimension respec tively. E (— the number of L-shaped regions) is the minimum when u\ = 1 and u 2 = U or when u\ = U and u 2 = 1. As shown so far, only the minimal dependence vector in each dimension decides the parallelizable region. In the rest of the cases, we assume the dependence vector of each basic type represents the minimal dependence vector among those with the same basic type. We now describe the cases where dependence vectors with multiple basic types are included. I f D = { (d x , 0), (0, d 2)} , the first d\ x d2 iterations can be executed in parallel, which can be identified by region number one in Figure B .l(d). At the next step, d\ x d2 iteration block right above the iteration region number 1 can be executed. At the same time, d\ x d2 iteration block on the right (— region number 2) can also be executed. The other iteration blocks cannot start yet because of the given dependence {(dpO), (0 , cfo)}- Pr = \ux/d{\. Pc = \u 2 /d 2 ]. E = Pr + Pc — 1. E is the minimum when u\ = \ J dld ^ aiK^ ?i2 = \J did ^ - PROOF. (1) If u\ and it2 are divisible by d\ and d2 respectively, the ceiling operations for Pr and Pc can be removed. E = u \/d \ + u 2 /d 2 - 1 U = u\ x u2, where U ,dx,d 2 are a positive integer constant. E — u \/d \ + U /(u\ x d2) — 1 E = (d2 x uf + d\ x U) /{d\ x d2 x u \ ) — 1 171 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The solution of this equation is the following. di x d2 x (E + 1) ± y/{d\ x d 2 x ( E + l)} 2 - 4 x d\ x d2 x U . . = -------------------------------------- 2 V * --------------------------------------- ( B ' 1 ) For this solution to be valid, {di x d2 x (E + l)} 2 — 4 x d\ x d2 x U > 0. (£ + l ) 2 > (4 x tf)/(d i x d2) E > ^ ( 4 x Lr)/(dj. x d2) - 1 Therefore the minimal E is found. Applying this minimal E to Equation B .l, we get 111 = (< f 1 X x ^ ( 4 x U )/{di X * ) ) /( 2 x * ) = '» = VlsJW=\IW D (2) If either u\ is not divisible by d\ and u 2 is divisible by d2, or iq is divisible by d\ and u 2 is not divisible by d2, E = u i/d i + u 2 jd 2 U = ui x va E = u i/d \+ U /{u\ x d2) E = {d2 x u \ + d\ x U}/{d\ x d 2 x ui} d2 x u\ — d\ x d2 x E x u\ + d\ x U = 0 The solution of this equation is the following. di x d2 x E ± J (di x d2 x E ) 2 — 4 x di x d2 x U _ . ( B ' 2 ) For this solution to be valid, {d\ x d2 x E ) 2 — A x d \ x d 2 x U > Q E 2 > (4 x Cf)/(di x d2) Therefore the minimal E is found. Applying this minimal E to the equation B.2, we get Ul = (di x d2 x sj (4 x U)/{dx x d2))/(2 x d2) = u 2 = U/ yjd\ x U /d 2 = □ (3) If both u\ and u 2 are not divisible by d\ and d2, E = ui/di + u 2/d 2 + 1 U = u\ x u 2 E = ui/di + U/(ui x d2) + 1 E = {d2 x u\ + d\ x U}/{di x d2 x u\] + 1 d2 x u 2 — d\ x d2 x (E — 1) x u\ + d\ x U = 0 The solution of this equation is the following. di x d2 x (E — 1) ± y/{di x d2 x (E — l)} 2 - A x d i x d 2 x U Ul = 2Vd2 -------------------------------------- ( B ' 3 ) For this solution to be valid, {(di x d2 x (E — l)} 2 — 4 x d\ x d2 x U > 0 (E - l) 2 > (4 x U)j(d\ x d2) 172 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E > y / ( 4 x U ) / { d 1 x d2) + 1 Therefore the minimal E is found. Applying this minimal E to the equation B.3, we get di x d2 x 4 x U)/{d\ x d2) I d i x U U \ = — 2 X d2 V ^2 17 / d2 x 1 7 ~ ~ y'd i x P /d 2 ~ V di □ If I? = {(d i, 0), (d2 , C I 3 )}, the intersected iterations between parallelizable iterations for (d i,0 ) and parallelizable iterations for (d2, d3) are determined by d4 if d4 < d2 as shown in Figure B .l(e). If di > d2 , d2 decides the intersected parallelizable iterations as shown in Figure B .l(f), where the initial extra iterations do not affect the total number loop execution steps. P = J r« iM ], if di < d2; r \ r« i/d 2l, otherwise. P c = l . E = Pr + Pc - 1 = Pr . P (=Pr ) is the minimum when ux = 1 and w ,2 = U. If jD ={(0 , di), (d2, ds)}, the intersected iterations between parallelizable iterations for (0 , d\) and parallelizable iterations for (d2, da) are determined by d\ and d3 , but not d2 , as shown in Figure B .l(g) and (h). P = f r«2 / d ! l, if dx < d3; c \ pu2/d 3] , otherwise. Pr = 1. E = Pr + Pc — 1 = Pc. E (= P C ) is the minimum when when u 2 — 1 and = P . If D={(dx, 0 ), (0 , (I2 ), (d3 , d4)}, dependence vector (d3, dx) does not affect the num ber of execution steps. As shown in Figure B .l(j), (k), and (1), L-shaped region for (d3 ,d 4 ) is a superset of the parallelizable iterations for other dependences, except the case in Figure B.l(m ) where (d3 ,d 4) is less than the other dependences. E = Pr + Pc - 1 = pui/di] + [it2 /d 2] - 1. Pr + Pc ~ 1- Pr + Pc — 1. If dx < d% and d2 < d4, \ux/dx], P c = \u2/d2], P If dx < ds and d2 > d4, Pr = \ux/dx], P c = r « 2 / d 2 l , E If di > d3 and d2 < d4, P r = [u i/d i], Pc = [u2/d 2], E If dx > ds and d2 > d4, P r — [u i/d i], Pc = [u2/d 2], 173 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E _ / pr + Pc - 1 , if ui/di - \u i/d \ \ < d \ - dz or u 2/d 2 - ^ 2/^ 21 < d2 - d4 ; } Pr + Pc, otherwise The right corner (di — d-j) x (d2 — d.i) iterations at the last step may introduce an extra step if they are included in the last step of either dimension. Anyway, E is the minimum when ui = d\ x U/d 2 and u 2 = yjd 2 x U/d\. These Ui and u 2 computed based on the dependence information are the best unroll factors that minimize the number of steps of parallelizable iterations. For example, re consider the example in Figure 3.4 with U = 8 , D = {(1,0), (0,2)}. The unroll factors ui = 2 and u2— 4 minimize the number of computation steps. We extend the algorithm to n-deep loops nests in Appendix B. 174 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix C Extension to n-deep Loop N ests We present how we determine the best unroll factors for a 3-deep loop nest in this section. The same methodology described in this section can be extended to an n-deep loop nest by the mathematical induction rule. For the following equations, we would like to find individual unroll factors u \, u2, and U 3 that minimize E. U = U \ X U2 X U 3 E = u \/d \ + u 2 /d 2 + uz/dz In a 2-deep loop nest, we have found the optimal unroll factors at the contact point where the 2D curve U = «i x u 2 adjoins the 2D straight line E = u \/d \ + u 2 /d 2 as depicted in Figure C .l(a). Similarly, in 3D, E is the minimum when the 3D curved surface U = u\ x u 2 x U 3 adjoins the 3D plane E — u \/d \ + u 2 /d 2 + n 3/d 3, where the 3D axes are u \,u 2, and n 3 as illustrated in Figure C .l(b). The point of contact minimizes the number of steps of parallelizable iterations just like in 2-deep loop nest cases. Let (x, y, z ) be the contact point in 3 dimensional space u 1 x u 2 x u3. Since the contact point (x, y, z) is on the 3D curved surface U , The normal vector of the tangential plane of 3D surface U on the contact point (x , y, z) can be computed by applying partial differentiation on each dimension u \,u 2, and u3. Thus, the normal vector is (yz,xz,xy). In addition, the normal vector of the plane E is {l/di,l/d2, 1 /dz). Each vector entity of these two normal vectors must be a multiple of the other. xyz = U (C.l) In addition, the contact point (x, y, z ) is on the 3D plane E. E = x/d\ + y/d 2 + z/dz yz = c/di xz — c/d 2 xy = c/d 3 (C.2) (C.3) (C.4) 175 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10 8 6 4 O 2 4 6 a 10 (a) In 2D (b) In 3D ___________ Figure C.l: Plots for U and_E ________ where c is a non-zero constant. Multiplying three above equations results in the following equation; (xyz ) 2 = (?/{d\d2d3). From equation C .l, U2 = c3/(rfid2c?3 ) c3 = did 2 d3 U 2 C = y d y d ^ h lP (C.5) Applying this c from equation C.5 to equations C.2 , C.3 and C.4 will decide the specific value of x, y, and z, which are optimal unroll factors. xy = did 2 dzU 2/d3, xz = d\d 2 d3 U 2 j d2, yz = \/d id 2 d3 U 2 / di (d3)2U _ 3 / (d2)2U _ M d t f U d\d 2 ’ ^ V d\d3 ’ V d2d3 176 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Improving memory hierarchy performance using data reorganization
PDF
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Cost -sensitive cache replacement algorithms
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
High performance crossbar switch design
PDF
Architecture -independent programming and software synthesis for networked sensor systems
PDF
A flexible framework for replication in distributed systems
PDF
Extending the design space for networks on chip
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Dynamic logic synthesis for reconfigurable hardware
Asset Metadata
Creator
So, Byoungro
(author)
Core Title
An efficient design space exploration for balance between computation and memory
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Hall, Mary W. (
committee chair
), Pinkston, Timothy M. (
committee member
), Prasanna, Viktor (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-495080
Unique identifier
UC11336028
Identifier
3133338.pdf (filename),usctheses-c16-495080 (legacy record id)
Legacy Identifier
3133338.pdf
Dmrecord
495080
Document Type
Dissertation
Rights
So, Byoungro
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical