Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Compiler directed data management for configurable architectures with heterogeneous memory structures
(USC Thesis Other)
Compiler directed data management for configurable architectures with heterogeneous memory structures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
COMPILER DIRECTED DATA MANAGEMENT FOR CONFIGURABLE ARCHITECTURES WITH HETEROGENEOUS MEMORY STRUCTURES by Nastaran Baradaran A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Ful¯llment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2007 Copyright 2007 Nastaran Baradaran Dedication To my parents who made this possible. ii Acknowledgements I would like to thank my advisor Dr. Pedro C. Diniz for his generous time, support, and encouragements. He is one of those rare advisors that many students only hope for. Thanks to the members of my committee Dr. Viktor K. Prasanna and Dr. Timothy M. Pinkston for their continual guidance and support. I am also thankful to Dr. Ryan Kastner and Dr. Aiichiro Nakano for their feedback and suggestions. IwishtothankeveryoneatInformationSciencesInstitute(ISI)includingthemembers oftheComputationalSciencesDivision,membersoftheActionteam,aswellasthehelpful sta®. I am especially grateful for the support and assistance that I received from Dr. Robert Lucas and Dr. Mary Hall over the years. I extend many thanks to my colleagues and friends at USC/ISI. Special thanks to RoshanakRoshandelforhergreatfriendship,HeidiZieglerforhernon-stopsupportduring the last year, and Yoonju Lee Nelson for keeping me sane at work. Many thanks to my mother Moneer and my father Khosrow who made this possible with their unbelievable love and support. I would like to thank my sisters Yasaman and Kiana for their encouragement and enthusiasm. Last but not least I am grateful to my husband and best friend, Gunnar, for his love, patience, and optimism. iii Thisresearchwasmadepossiblebythe¯nancialsupportoftheNationalScienceFoun- dation, Information Sciences Institute, and University of Southern California's Computer Science Department and Graduate School. I would like to thank these entities as without their help this work could not have been completed. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 Memory Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.1 Target Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2 Target Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Current Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Our Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 Mapping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2: Data Reuse Analysis 18 2.1 De¯nitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Data Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Data Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 Reuse Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Reuse Graphs and Reuse Chains . . . . . . . . . . . . . . . . . . . . 23 2.4 Reuse Vectors for SIV and MIV Subscripts . . . . . . . . . . . . . . . . . . 24 2.4.1 Self Reuse for SIVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Group Reuse for SIVs . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 Self and Group Reuse for SIVs . . . . . . . . . . . . . . . . . . . . . 26 2.4.4 Self Reuse for MIVs . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.5 Group Reuse for MIVs . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Advantages of Reuse Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 v Chapter 3: Scalar Replacement 31 3.1 Single Induction Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.1 Self Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2 Group Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.3 Self and Group Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Multiple Induction Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Self Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1.1 Subscripts With Two Variables in a Single Dimension . . . 39 3.2.1.2 Subscripts With Two Variables in Multiple Dimensions . . 43 3.2.2 Group Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.3 Limitations of the Analysis . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4: Analyses of the Critical Path and Scheduling 47 4.1 Data-Flow-Graph and Critical Path(s) . . . . . . . . . . . . . . . . . . . . . 48 4.2 Cuts of the Critical Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Identifying the Cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 E®ects of Bandwidth and Scheduling . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Unlimited Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . 53 4.3.2 Limited Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Desired Latency and Delivery Time . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Threshold Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.7 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Chapter 5: Custom Memory Allocation Algorithm 63 5.1 Problem De¯nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 Data Mapping Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.1 Mapping Transformations . . . . . . . . . . . . . . . . . . . . . . . . 68 5.3.2 Mapping Observations . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3.3 Mapping Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3.3.1 Selecting a Storage Type . . . . . . . . . . . . . . . . . . . 72 5.3.3.2 Selecting a Data Transformation . . . . . . . . . . . . . . . 73 5.3.4 Mapping Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.5 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Best Cut Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.2 Mapping Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.3 Mapping in RAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.4 Summary and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.5 Algorithm Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.6 Limitations of the CMA Algorithm . . . . . . . . . . . . . . . . . . . . . . . 89 5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 vi Chapter 6: Experiments 91 6.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2 Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.3 Application Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.4 E®ects of Considering the Critical Path(s) . . . . . . . . . . . . . . . . . . . 96 6.5 Memory Allocation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5.1 Mapping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5.1.1 Naive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.5.1.2 Custom Data Layout (CDL) . . . . . . . . . . . . . . . . . 101 6.5.1.3 Custom Memory Allocation (CMA) . . . . . . . . . . . . . 101 6.5.1.4 Hand Coded (HC) . . . . . . . . . . . . . . . . . . . . . . . 102 6.5.2 Implementation Considerations . . . . . . . . . . . . . . . . . . . . . 103 6.5.3 Description of the Experiments . . . . . . . . . . . . . . . . . . . . . 104 6.5.4 Allocation Results Using No Registers . . . . . . . . . . . . . . . . . 104 6.5.4.1 Time Performance . . . . . . . . . . . . . . . . . . . . . . . 105 6.5.4.2 Area Performance . . . . . . . . . . . . . . . . . . . . . . . 111 6.5.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.5.5 Allocation Results Using Limited Registers . . . . . . . . . . . . . . 118 6.5.5.1 Time Performance . . . . . . . . . . . . . . . . . . . . . . . 119 6.5.5.2 Area Performance . . . . . . . . . . . . . . . . . . . . . . . 123 6.5.5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.6 Analysis of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Chapter 7: Related Work 130 7.1 Con¯gurable Architectures and Compilation Support . . . . . . . . . . . . . 131 7.2 Data Reuse Analysis and Compiler Transformations . . . . . . . . . . . . . 132 7.3 E®ects of Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.4 Storage Allocation and Management . . . . . . . . . . . . . . . . . . . . . . 134 7.4.1 Allocation for Con¯gurable Architectures . . . . . . . . . . . . . . . 135 7.4.2 Allocation for Embedded Systems . . . . . . . . . . . . . . . . . . . 138 7.4.3 Register Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.4.4 Custom Memory Design . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Chapter 8: Conclusion 141 8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.1.1 Analysis and Quantifying the Reuse . . . . . . . . . . . . . . . . . . 142 8.1.2 Analysis of the Critical Paths . . . . . . . . . . . . . . . . . . . . . . 142 8.1.3 Custom Memory Allocation Algorithm . . . . . . . . . . . . . . . . . 143 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Bibliography 147 Appendix 154 vii List of Tables 6.1 Number of design slices and register bits used in scalar replacement. . . . . 98 A.1 Timing results for all cases when using no registers.. . . . . . . . . . . . . . 155 A.2 Resource usage for all cases when using no registers. . . . . . . . . . . . . . 156 A.3 Timing results for all benchmarks when using 64 registers. . . . . . . . . . . 157 A.4 Resource usage for all benchmarks when using 64 registers. . . . . . . . . . 158 viii List of Figures 1.1 Mapping process for a con¯gurable hardware. . . . . . . . . . . . . . . . . . 3 1.2 The abstract architecture for a Field Programmable Gate Array. . . . . . . 7 1.3 Example code with various mapping strategies. . . . . . . . . . . . . . . . . 13 2.1 Example codes for reuse analysis. . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 Example for SIV self-reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Example for SIV group-reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3 Example for SIV self+group reuse. . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Example for MIV self-reuse with two variables in one dimension. . . . . . . 40 3.5 Example for MIV self-reuse with two variables in multiple dimensions. . . . 43 4.1 Example code for critical path analysis. . . . . . . . . . . . . . . . . . . . . 48 4.2 Data-Flow Graph, Critical Graph, and Cuts for the example code in 4.1. . . 50 4.3 Examples of extreme cases of critical paths. . . . . . . . . . . . . . . . . . . 52 4.4 E®ects of the Threshold selection on the amount of improvement. . . . . . 57 4.5 Preliminary °owchart of the allocation algorithm. . . . . . . . . . . . . . . . 61 5.1 Various transformations applied to example code in ¯gure 4.1.. . . . . . . . 70 5.2 Complete °owchart of the allocation algorithm. . . . . . . . . . . . . . . . . 76 5.3 Custom Memory Allocation (CMA) Algorithm. . . . . . . . . . . . . . . . . 79 ix 5.4 Cost assessment algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5 Algorithms to ¯nd a mapping in RAMs and registers. . . . . . . . . . . . . 84 5.6 Final mapping for the example code. . . . . . . . . . . . . . . . . . . . . . . 87 6.1 Timing results re°ecting the use of cuts. . . . . . . . . . . . . . . . . . . . . 97 6.2 Timing results for FIR using 4 memories and no registers. . . . . . . . . . 105 6.3 Timing results for MM using 4 memories and no registers. . . . . . . . . . . 106 6.4 Timing results for JAC using 4 memories and no registers. . . . . . . . . . 108 6.5 Timing results for HIST using 4 memories and no registers. . . . . . . . . . 109 6.6 Timing results for BIC using 4 memories and no registers. . . . . . . . . . . 110 6.7 Resource usage for FIR using 4 memories and no registers. . . . . . . . . . 112 6.8 Resource usage for MM using 4 memories and no registers. . . . . . . . . . 113 6.9 Resource usage for JAC using 4 memories and no registers. . . . . . . . . . 114 6.10 Resource usage for HIST using 4 memories and no registers. . . . . . . . . 115 6.11 Resource usage for BIC using 4 memories and no registers. . . . . . . . . . 116 6.12 Timing results for all benchmarks using 4 memories and 64 registers. . . . . 119 6.13 Resource usage for all benchmarks using 4 memories and 64 registers. . . . 123 x Abstract Con¯gurable architectures o®er the unique opportunity of realizing hardware designs tai- lored to the speci¯c data and computational patterns of a given application code. These devices have customizable compute fabric, interconnects, and memory subsystems that allow for large amounts of data and computational parallelism. This high degree of con- currencysubsequentlytranslatestobetterperformance. The°exibilityandcon¯gurability of these architectures, however, create a prohibitively large design space when mapping computations expressed in high-level programming languages to these devices. To suc- cessfully investigate the best mapping there is a need for high level program analyses and abstractions as well as automated tools. This dissertation describes a high level approach to one of these mapping problems, namely the allocation and management of storage. We develop and evaluate automatic mapping algorithms that can quickly and e®ectively explore alternative mapping strate- gies. Ourobjectiveistominimizetheoverallexecutiontimewhileconsideringthecapacity and bandwidth constraints of the storage structures. Our approach combines compiler analyses with behavioral synthesis information in order to map the arrays of a loop based computation to an architecture with a set of internal memories. In particular for each computation we consider the access and reuse xi patterns of the data arrays, structure of the critical paths, scheduling information of the synthesis tool, as wellas the storage and bandwidth constraints of the target architecture. We utilize various mapping techniques, namely data distribution, data replication, and scalar replacement. We further consider three levels of storage: o®-chip memory, on-chip memory banks, and on-chip registers. We illustrate the e®ects of applying our analyses and mapping algorithm to a set of image/signal processing kernel codes using a Xilinx Virtex TM FPGA. The novelty of our approach lies in creating a single framework that combines various high-level compiler analyses and data transformations with lower-level scheduling infor- mation in order to map the data. Our experimental results show that our approach is very e®ective in ¯nding high-quality data mappings to the storage structures of an FPGA in an automated fashion. Considering the current tendency towards increasing the variety and capacity of con- trollable storage structures, and given the continuing gap between computation and data access latencies, e®ective data management becomes an essential factor in achieving high performance in future architectures. xii Chapter 1 Introduction Theincreasingnumberofavailabletransistorsperdieareahasenabledthedevelopmentof computing architectures with con¯gurable characteristics. These emerging architectures allow the development of customized yet °exible hardware with processing and routing resources that can be adjusted to suite the particular needs of a given application [41, 67]. While their programmability allows a more °exible hardware platform, their direct mapping of a computation to hardware results in a much higher performance than what could be achieved when targeting more traditional computing architectures. This feature is particularly advantageous for compute intensive applications such as encryption and image processing. For instance, in an image processing setting a con¯gurable architecture can have multiple processing elements with their own private memory banks to handle speci¯c sections of the input image. Despite their signi¯cant performance bene¯ts, however, con¯gurable devices still have notbeenwidelyadopted. Onemainreasonisthatthesearchitecturesareinherentlyharder to program and maintain than traditional systems. In order to make a con¯guration for a device, programmers must assume the role of both hardware and software designers. 1 These designers not only need to have a detailed knowledge of the low level hardware and Hardware Description Languages (HDLs), but they must also resolve issues such as data/computation mapping, data partitioning, placement, routing and control among the various computing and storage elements. These can make the mapping process extremely cumbersome and error-prone. A typical mapping process is illustrated in ¯gure 1.1. The designers have to decide on di®erent code transformations based on the application characteristics, architectural features, and design constraints. This step includes identifying the amount of parallelism andpipelining, selectingthesuitablesetofoptimizationsfortheapplication, anddeciding onhowtomapthedatatovariousmemoriesandthelayoutofeachmemory. Afterapplying the suitable transformations, for the resulting code the designers create the corresponding HDL representation and implement the control states for the design. After this step the design goes through the logic synthesis in which the HDL representation is converted to a netlist. Thisresultingnetlististhenmappedtoaspeci¯chardwareplatformusinga place and route tool. The ¯nal output is a bit-stream con¯guration ¯le that is downloaded onto an FPGA to program the logic cells and interconnects. If the design does not meet the constraints and/or objectives, the designers need to alter some of the decisions and repeat the mapping process. The problem with exploring the design space in this brute force fashion is that the extremeprogrammabilityofthesearchitecturescreatesawiderangeofdesignchoicesthat can overwhelm the average programmer. Furthermore, the stages of logic synthesis and place and route can be prohibitively slow for a trial and error approach. Consequently, 2 Identifying the Pipelining, Parallelization, and Data Mapping Performing Code and Data Transformations Transforming the Code to Hardware Description Language Logic Synthesis Place and Route Satisfactory Design Configuration Bitstream Yes No Modifying Decisions Application Code System Constraints Architectural Features Figure 1.1: Mapping process for a con¯gurable hardware. and due to the lack of high-level programming analyses and tools, programmers settle for simple designs, not exploring the full potential of these architectures. Inthiscontextwebelievethatdevelopmentofhigh-leveldesign/programmingenviron- ments for these architectures would signi¯cantly lower the barrier of programming these systemsandcouldeventuallyleadtoawidespreadadoptionofcon¯gurabletechnology. To addressthelackofhigh-levelanalyses,inthisdissertationwefocusondevelopingcompiler techniques to tackle one major programming issue, namely mapping and management of data for con¯gurable architectures. 3 In the rest of this chapter we ¯rst explain the data mapping problem and review our target architecture and application set in section 1.1. We then give a general overview of our approach and express how it di®ers from current approaches in 1.2 and 1.3. We illustrateanexamplemappingin1.4andoutlinealistofcontributionsin1.5. Weconclude with the organization of this dissertation in 1.6. 1.1 Memory Mapping Problem Storage allocation 1 has been a long standing key problem in improving the performance of an application. Applications tend to spend a large portion of their execution time in accessing the memory and transferring the required data to the processing units. As such, reducing the number of memory accesses or improving the data access time can dramatically improve their overall performance. The importance of the memory mapping problem is even more acute in the domain of con¯gurablearchitectures. One mainreasonis thatin thesedevices designers aggressively exploit instruction-level parallelism (ILP) to increase the performance. This is mainly achieved by employing a large number of functional units (FUs) to concurrently execute as many operations as possible. This increase in the number of concurrent operations, however, invariably leads to a substantial increase in the required data bandwidth as more data items need to be fetched and stored from/to memory per unit of time. Tomitigatethispotentialperformancebottleneckcompilationtools(orintheirabsence thedesignersthemselves)mustlayoutandorganizethedatabetweentheavailablestorage 1 Throughout this document we use the terms data mapping, storage allocation, and memory mapping interchangeably. 4 to increase the bandwidth and thus reduce the data access time. The con¯gurability and heterogeneity of con¯gurable systems, however, makes this speci¯c problem of mapping computations very challenging due to the following reasons: ² Rich memory structure: Not only are there many possibilities of partitioning the data among various storage structures, but because these architectures are con- ¯gurable one can also manage the available bandwidth of each storage structure by de¯ning the number of input/output ports. In order to map the data, designers needtobeawareoftherelativespaceandaccesstimetrade-o®sbetweenthevarious types of storage structures. ² Full control over data management: In the absence of any hardware support, softwareisexclusivelyresponsibleforthecompletemanagementofdata(placement, replacement,addresstranslation,registerallocation,prefetching,etc). Sothedesign- ers need to explicitly orchestrate the °ow of data through the many functional units and ensure the correct timing and location of data throughout the implementation. ² Various degrees of freedom: The design space is prohibitively large due to var- ious degrees of freedom such as multiple storage structures, various data references with di®erent characteristics, several optimization techniques (i.e. replication, dis- tribution, caching), and ¯nally multiple execution points for managing the storage. These factors make a brute force approach simply infeasible. ² Limited resources: Allocation and management of data in the presence of ¯nite storage and bandwidth resources leads to fundamentally hard optimization prob- lems [31]. 5 A major di®erence between this problem and previous work in the area of cache- aware compilation consists of the fact that the con¯gurable architectures do not support the abstraction of a single address space nor do they manage the consistency between the local and external memories. In addition, and due to the absence of any hardware support for data management, the allocation algorithms need to be precise in explicitly saving/retrieving the data to/from di®erent storages. These facts can extremely compli- cate a high level storage management for con¯gurable systems. On the other hand, here the data placement is directed by the compiler and is selectively applied to a subset of data. Thiscontroland°exibility,absentincache-basedarchitectures,allowforalgorithms that can customize the allocation based on the applications' needs. 1.1.1 Target Architecture In this dissertation we target a con¯gurable architecture with an o®-chip memory, a set of on-chip memory banks, and a set of on-chip registers. In particular we focus on a single Field Programmable Gate Array (FPGA) with multiple on-chip RAM blocks 2 and an o®-chip external memory. An FPGA, as depicted in ¯gure 1.2, consists of a mesh of Con¯gurable Logic Blocks (CLBs)surroundedbyaringofInput/OutputsBlocks(IOBs)andprogrammableintercon- nects. The CLBs are the primary building blocks that contain elements for implementing customizable functions. In addition, each CLB can also implement discrete data registers which can be accessed independent of each other. The IO blocks provide circuitry for 2 Throughout this article we use the terms `internal memory banks' and `RAM blocks' interchangeably. 6 Input/Output Blocks Configurable Logic Blocks Programmable Interconnect Block RAM Figure 1.2: The abstract architecture for a Field Programmable Gate Array. communicating signals with external devices. Finally, the block RAMs allow for storage of data with single or double access ports. A processor or functional unit can simultaneously access various internal RAM blocks as dictated by the design's data communication interconnects. This capability can poten- tially result in very high performance for computations that require high degrees of data parallelism. As a result there has been a consistent trend in increasing the number and capacity of the RAM blocks in FPGAs. The latest Xilinx Virtex-5 FPGAs o®er up to 10 Mbits of °exible embedded block RAMs to e±ciently store and bu®er data without any o®-chip memory. Each Virtex-5 memory block stores up to 36 Kbits of data and can be con¯gured either as a dual-port RAM or as a FIFO. 7 In this work we are not concerned with the low-level details of memory bank organi- zation, such as the di®erent ways a memory bank can be composed by various primitive memorycells. Weassumeallthememorybankstobeidenticalintermsoftheirplacement and therefore their relative distance to di®erent functional units. We consider memory ports(bandwidth)astheonlysourceofcontentionandthereforetheavailabilityofenough functional units to support concurrent operations. We further impose a ¯xed data allo- cation to the storage resources for the entire duration of the computation. Finally, in mapping the arrays we are neither concerned with multiples processors nor the details of low level physical memory mapping. It is worth noting that the analyses, techniques, algorithms, and principles developed and discussed in this work are not limited to only FPGAs or con¯gurable architectures. Theycanbeeasilyextendedtoanyarchitecturewithmultiplelevelsofcontrollablestorage. However, the abundance of functional parallelism, large amounts of controllable storage, and customizing capabilities of con¯gurable architectures make them a natural platform for our work. 1.1.2 Target Application The compiler analysis outlined in our work is geared towards computations that can be structured as perfectly or quasi-perfectly nested loops, with symbolically constant loop bounds that manipulate array variables. As the analysis relies on a linear algebra frame- worktodeterminetheaccesspatternsofarrayvariables,itisapplicabletoarrayreferences witha±nesubscripts. Thesedataaccesspatternsareveryregularandcanbesummarized using well-known abstractions developed in the context of data dependence analyses for 8 high-performance compilation. Using these abstractions a compiler can uncover the data reuse patterns of a given array reference to generate code that caches reusable data in local storage. A wide range of computation kernels, most notably from the domain of image and signal processing, meet these requirements. Typically these kernels are organized as dou- bly nested loops manipulating one- or two-dimensional data structures corresponding to either signal discretization processes or digital images. These computations are very data intensive and have fairly regular access patterns with large degrees of data reuse. These facts combined with the possibility of vast amounts of ILP make these appli- cations a natural target for custom hardware implementation using FPGAs. A suitable data management for this class of applications is particularly advantageous and can sub- stantially improve the performance on those architectures. For codes that do not exhibit the above-cited properties, it is still possible to perform a much more coarse-grained data reuse at a lower precision level and therefore with fewer opportunities to take advantage of speci¯c target architectures' features. 1.2 Current Approaches Considering the large design space as well as storage management challenges presented by con¯gurable architectures, researchers have recognized the need for high level data mapping strategies. To that extent various compilers and tools have been developed to map the data to di®erent memories at a high level. Applying these techniques, and based on applications' and architectures' characteristics, designers can decide on the memory 9 layout at the early stages of the design. This can substantially reduce the size of the design space. Abodyofwork(e.g.[69])hasfocusedonallocatingstoragetovariousreferencessolely basedonanapplication'scharacteristics, mainlythedataaccesspatterns. Althoughthese analyses are necessary for a high level understanding of the data behavior they are not su±cient for a good storage allocation. High level data pattern analyses neither provide any insight into the relations between di®erent references nor consider the opportunities for concurrent data accesses during the program's execution. In order to overcome this shortcoming one needs to consider the implications of the execution behavior as well as the scheduling of the computation given the architecture's constraints. On the other hand, techniques that do consider the scheduling information lack the knowledgeofhighleveldatadependenceanalysisorsimplydataaccessbehavior(e.g.[32]). As such they mainly map each array to a single memory based on the characteristics of scheduling. Thiswaythesetechniquesmissanychanceofdatareuse. Furthermore,andby mapping each array to a single memory, they limit any data parallelism by the bandwidth of the memory. While these researchers have recognized the importance of the data mapping problem, none of their approaches have combined critical path and data access pattern information in a single analysis and algorithmic framework that addresses various storage types and transformations as we do in our work. 10 1.3 Our Approach Thegoalofourapproachistodevelopandevaluateacompileralgorithmandaframework that maps and manages a computation's data to a set of heterogeneous storage resources. Our data management attempts to minimize the execution time subject to the available storage and bandwidth. In order to meet the minimal execution time, the proposed algorithm combines the data access pattern information of the arrays with the analysis of the critical path of the computation. We attempt to achieve the best timing by using a minimal design space in terms of storage area. Not only the design is constrained by the chip capacity, minimizing the area also results in less routing complexity, less power consumption, and ultimately better clock rates. Our proposed approach applies a comprehensive analysis to identify the reuse and access patterns for di®erent array references as well as their storage and bandwidth re- quirements. On the other hand, and since the execution time might not be sensitive to all references,ouralgorithmextractsthecriticalpathsofthecomputationinordertoidentify thearrayreferencesofinterest. Indetectingthecriticalpathsweconsiderthedata°owof the computation as well as the scheduling technique of the synthesis tool. Based on these analyses our algorithm applies the various data-oriented transformations to data arrays of the computation and decides where the data corresponding to each array reference should be stored. Wehavebuiltourapproachinacompilationframeworkthatoperatesoncomputations written as unannotated C loop constructs. Our compiler analysis recognizes the parallel 11 dataaccessesandexploitsthedatareuseopportunitiesinordertogenerateacorresponding hardware design with superior performance. The hardware speci¯cation re°ecting the results of our analysis is then mapped onto an FPGA device. For this mapping we use commercially available synthesis tools such as Mentor Graphic's Monet TM for synthesis and Synplicity's Synplify TM tool for Place and Route (P&R). We evaluate our work in the context of an FPGA, however our analysis is neither limited to a particular con¯gurable architecture nor a commercial synthesis tool. The novelty of our approach lies in creating a single framework that combines high- level compiler techniques with lower-level scheduling information, while considering the target architecture's resource constraints. We consider multiple data-oriented compiler transformations, namely distribution, replication, and scalar replacement. Furthermore, ouralgorithmsstageandmanagethedatabetweenvariousmemorystructures, namelyan external memory, a number of ¯xed-capacity internal RAM blocks, and a limited number of internal registers. 1.4 Mapping Example In order to illustrate the basis of our approach, we map the data corresponding to an un- annotated kernel to an FPGA with 4 single-ported RAM blocks. The simple 2D-vector multiplication code, depicted in ¯gure 1.3(a), includes a loop nest that accesses the data corresponding to arrays A and B, performs a multiplication, and saves the data in array C. Further, in order to increase the functional parallelism the loop is unrolled by a factor of 2 (¯gure 1.3(b)). 12 for (i = 0; i < 30; i++) for (j = 0; j < 20; j++) C[j][i] = A[i][j] * B[i][j]; g g for (i = 0; i < 30; i++) for (j = 0; j < 20; j+=2) C[j][i] = A[i][j] * B[i][j]; C[j+1][i] = A[i][j+1] * B[i][j+1]; g g (a) (b) B * * A C RAM 1 RAM 2 RAM 3 RAM 4 B * * A C RAM 1 RAM 3 RAM 2 RAM 4 (c) (d) A2 B2 * * A1 B1 C1 C2 RAM 1 RAM 3 RAM 2 RAM 4 A2 C2 * * A1 C1 B1 B2 RAM 1 RAM 3 RAM 2 RAM 4 (e) (f) Figure 1.3: Example code with various mapping strategies. 13 Figure 1.3(c) presents a naive data mapping in which each array is mapped to a sepa- rateRAMblock. Thismappingisirrespectiveofanyschedulingordataaccessinformation. The mapping strategy in ¯gure 1.3(d) is only concerned with scheduling and ignores the data access patterns. As a result arrays A and C can share the same bandwidth as they are accessed at di®erent execution cycle times. Arrays A and B are mapped to separate RAMs as they are accessed concurrently. It is clear to see that by increasing the unroll factor this mapping will lead to a serialization of accesses to the RAMs. Figure 1.3(e) presents a mapping that only considers the data access patterns. Con- sidering that references A1 = A[i][j], A2 = A[i][j +1], B1 = B[i][j], B2 = B[i][j +1], C1=C[j][i], and C2=C[j+1][i] all access disjoint data sets, their corresponding data is mapped to di®erent RAMs. Since there are not enough memory banks to accommodate all these references, the data sets are mapped in a round-robin fashion. As we can see, sharing the same bandwidth by references A1 and B1 slows down the computation, while the available bandwidth in RAM3 and RAM4 is not utilized. Our own mapping in ¯gure 1.3(f) considers the scheduling as well as the data access patterns. AsaresultthedataforallreferencesisdistributedbetweenRAMsbutthistime references to arrays A and C share the same bandwidth. This does not have any negative e®ect on performance as these references are accessed at di®erent times. As a result we achieve the most parallelism in the computation which results in the shortest execution time. Inthissimpleexamplethereisnodatareuseandweonlyconsiderthedatadistribution as a viable transformation. In addition to data distribution, our full algorithm applies 14 data replication and/or scalar replacement in registers when necessary. The details of what techniques to use are fully discussed in chapter 5. 1.5 Contributions This dissertation presents an approach to improving the execution time through an ef- fective storage allocation. In this allocation we minimize the memory access time for the array references on the critical paths of the computation. We consider various mapping techniques,di®erentstoragestructures,andmultipleconstraints. Whiletherehavebeena few allocation strategies in order to map a computation's data to the storage structures of anFPGA,toourknowledgetherehasbeennoworkthatcombinesthehighleveldataand low level system information to ¯nd the best mapping. In particular, our contributions may be summarized as: ² Analysis of Reuse and Extension of Scalar Replacement: Creating a reuse analysis frameworkbasedonreusevectorsthatseamlesslyworksforSingleInductionVariable andMultipleInductionVariablearrayreferences. Weusethisframeworktoestimate the number of storage elements to exploit the possible data reuses. ² Analysis of Critical Paths: Proposing algorithms and techniques in order to iden- tify the most valuable references for caching in internal storage and calculating the desired amount of reduction in memory access time. We do this by analyzing the critical paths of the computation. We consider the data °ow of the computation as well as the scheduling technique of the synthesis tool. 15 ² Storage Allocation Algorithm: Proposing a compiler algorithm that combines high- level program information with physical level architectural information to map the data arrays onto con¯gurable architectures with heterogeneous storage structures. Our algorithm combines three compiler transformations, scalar replacement, data distribution, and data replication in a greedy fashion, attempting to minimize exe- cution time subject to the capacity and bandwidth constraints. ² Experimental Results: Presenting experimental data for the application of the allo- cation algorithm to a set of image and signal processing kernels targeting a Xilinx Virtex TM FPGA.Wefurthercomparetheperformanceofhardwaredesignsresulting from the application of the allocation algorithm against designs using naive, custom data layout, and hand coded mapping techniques in terms of execution time and storage resources. 1.6 Organization The remainder of this dissertation is organized as follows. In chapter 2 we describe our data dependence and reuse analyses which are the bases of our storage allocation. Chap- ter 3 explains how we use the reuse analyses to develop formulas in order to quantify the reuse. We subsequently use this information to identify the cost associated with storage allocation of each array. In chapter 4 we present how we analyze the critical paths of the computation in order to identify the most bene¯cial data for caching. We also cal- culate how much improvement we need in terms of memory access time at each phase of the execution. In chapter 5 we fully explain our data mapping algorithm. Chapter 6 16 presents the experimental results for the application of our storage allocation algorithm to 5 multimedia computations. We survey related work in chapter 7 and conclude in chapter 8. 17 Chapter 2 Data Reuse Analysis One objective of the storage allocation techniques discussed in our work is to improve the execution performance by reducing the number of accesses to external memory. A common strategy for reducing the number of accesses is to identify and exploit the data reuse in a computation. Data dependence and data reuse relations represent the reuse by uncovering whether two data references in the code access the same memory location. Saving the data associated with the ¯rst data access in local storage and reusing it during the later accesses reduces the number of external memory accesses and hence improves the performance. In order to exploit the potential reuse in a program, a compiler needs to identify the possiblereuse,quantifytherequiredlocalstorage,and¯nallymanagethestoragebetween adataaccessanditsreusesduringtheexecutionoftheprogram. Inthischapterweexplain how to identify the data reuse in a computation and introduce the abstractions that capture and represent this reuse. We ¯rst present some general de¯nitions in section 2.1. We describe the data dependence and data reuse in sections 2.2 and 2.3. In section 2.4 we explainindetailtheabstractionofreusevectorsfordi®erentarrayreferencesanddescribe 18 their advantages in 2.5. We ¯nally summarize the chapter in 2.6. In the next chapter we show how to use reuse vectors to quantify and exploit the reuse. 2.1 De¯nitions In this section we describe the general de¯nitions that we use throughout this document. In our work we focus on array references in perfectly or quasi-perfect loop nests with constantloopbounds. Theiterationspaceofann-levelperfectlynestedloopisrepresented by ~ I = (i 1 ;:::;i n ), where i 1 and i n are loop indices corresponding to the outermost and innermost loops. Thearrayreferencesintheloophavesubscriptsthatarea±nefunctionsoftheenclosed loopindices. Henceeachsubscriptcanberepresentedas a 1 i 1 +a 2 i 2 +:::+a n i n +c, where a 1 ;:::;a n ;c are integers. The array references are furthermore uniformly generated. In a loopnestofnlevelstwoarrayreferenceswithsubscriptfunctionsa 1 i 1 +a 2 i 2 +:::+a n i n +c 1 and b 1 i 1 + b 2 i 2 + ::: + b n i n + c 2 are uniformly generated if and only if a 1 = b 1 ;a 2 = b 2 ;:::;a n =b n . The subscript index of an a±ne reference is represented in a matrix form as H ~ I + ~ C. Here H is the access matrix denoting the coe±cients of the various index variables for each array reference dimension, ~ I represents the iteration space vector, and C is a constant o®set vector. In matrix H each row corresponds to one dimension of the subscript while each column represents an index variable. For example in ¯gure 2.1(a), for array reference A[i¡1][j+1]: 19 for(i = 0; i < 10; i++)f for(j = 0; j < 20; j++)f A[i][j] = A[i-1][j+1]+C[i+j]; D[i] = D[i+2]+B[j]+E[5]; g g for(i = 0; i < 5; i++)f for(j = 0; j < 10; j++)f for(k = 0; j < 20; j++)f A[2j+4k] = B[k]+A[2j+4k+2]; g g (a) (b) (1,0) D[i+2] D[i] C[i+j] B[j] A[i][j] A[i−1][j+1] E[5] (2,0) (1,−1) (0,1) (0,1) (1,−1) (1,0) (0,1) (c) Reuse graph for part (a). Figure 2.1: Example codes for reuse analysis. H = 0 B B @ 1 0 0 0 1 0 1 C C A , ~ I = 0 B B @ i j 1 C C A , C = 0 B B @ ¡1 1 1 C C A The subscript function in each array dimension can de¯ne a Zero Induction Variable (ZIV), a Single Induction Variable (SIV), or a Multiple Induction Variable (MIV). In ¯gure 2.1(a), array reference A[i¡1][j+1] has SIV subscripts in both of its dimensions, whereas reference C[i+j] has an MIV subscript, and reference E[5] has a ZIV subscript. An array dimension subscript leads to a ZIV, SIV, or MIV, respectively if it has zero, one, or multiple nonzero coe±cients in the corresponding row of the H matrix. 20 2.2 Data Dependence There is a data dependence between two array references if they access the same data memory location in some point of the iteration space. This dependence dictates an ex- ecution order between di®erent statements and needs to be preserved by di®erent code transformations. Based on the type of read/write operation for any two dependent array references their dependence is categorized as: ² Input Dependence: when both accesses are data reads. ² True Dependence: when the ¯rst data access is a write operation and the second access is a read operation. ² Anti Dependence: when the ¯rst data access is a read operation and the second access is a write operation. ² Output Dependence: when both accesses are data writes. A data dependence is loop carried if its two dependent accesses occur in di®erent iterations of the loop. The dependence is loop independent if both accesses happen in the same iteration of the loop. 2.3 Data Reuse The data dependence can lead to data reuse as two array references access the same array element in di®erent points of the iteration space. In case of input/true dependencies, the second access is reading the same data location that is accessed in the ¯rst read/write 21 operation. In case of anti/output dependencies, the second access is writing to the same data location that is accessed in the ¯rst read/write operation. The data reuse is called a self-reuse if it is induced by the same data reference and is called a group-reuse if it is induced by two distinct array references. In ¯gure 2.1(a), references of array A exhibit a group-reuse across the iterations of the i and the j loops, while B[j] exhibits a self-reuse across the iterations of the i loop. 2.3.1 Reuse Vectors For a given array reference, the reuse analysis determines the set of reuse vectors that spanthereusespaceforthatreference. Eachreusevector,representedas ~ R=(r 1 ;:::;r n ), de¯nes the minimal distance between the iterations of a reference and its reuse [47]. Given a reference with subscript index H ~ I+ ~ C, its self-reuse occurs for distinct itera- tions ~ I 1 and ~ I 2 that satisfy the reuse equation: ~ I 2 ¡ ~ I 1 =¢ ~ I 2Null(H) Given two uniformly-generated array references with subscript indices H ~ I + ~ C 1 and H ~ I + ~ C 2 , a group-reuse occurs for iterations ~ I 1 and ~ I 2 that satisfy the following reuse equation: ~ I 2 ¡ ~ I 1 =¢ ~ I 2Null(H)+ ~ C Here C is a vector satisfying the equation H ~ C = ~ C 2 ¡ ~ C 1 . The solutions to the reuse equations above characterize the reuse space de¯ned as the Null(H)orNull(H)+ ~ C,andthebasesthatspanthesespacesarethereusevectors. These 22 vectorsareselectedtobealwayslexicographicallypositiveastheirlinearcombinationwill lead to feasible reuse distances. In ¯gure 2.1(a), the self-reuse vector for B[j] is (1;0), the self-reuse vector for C[i+j] is (1;¡1), and the group-reuse vector for the references of array A is (1;¡1). In cases that a reference exhibits both self and group reuses the reuse space is char- acterized by the composition of the two sets of reuse vectors. One set consists of the self-reuse vectors of each uniformly generated reference, whereas the other set consists of a set of group-reuse vectors derived by solving the group-reuse equations. For example for array D in ¯gure 2.1(a), this space is a composition of the self-reuse vector (0;1) with the group-reuse vector (2;0). In general references can have more than a single reuse vector. Each reuse vector dictates the minimum iterations distance along a set of loops between reuses. For a reuse vector ~ R we de¯ne the level of reuse as the level of the outermost non-zero element of ~ R. In ¯gure 2.1(b), reference B[k] has two reuse vectors of (1;0;0) and (0;1;0). Vector (1;0;0) exhibits a self-reuse at level i, while (0;1;0) indicates a self-reuse at level j. 2.3.2 Reuse Graphs and Reuse Chains The reuse relations among a set of array references can be represented using a directed graph called a reuse graph. Each reuse relation consists of two nodes that correspond to two dependent array references, called a source and a sink. These nodes are connected by areuseedgewhichislabeledbythereusevectorbetweenitssourceandsink. Figure2.1(c) illustrates the reuse graph for the code in ¯gure 2.1(a). An edge in form of ! represents a group-reuse, while an edge in form of© is an indicator of self-reuse. 23 Every connected component of the reuse graph is called a reuse chain [69]. In this chain a node with no incoming edge with true or input dependencies or with no outgoing edge with output dependence is called a generator . A node corresponding to a reference with a write operation and with no outgoing output dependence is called a ¯nalizer. The generator accesses for the ¯rst time the data that is reused by the rest of the references in the chain. The ¯nalizer contains the ¯nal value of its reference that may have been overwrittenseveraltimesbypreviouswritereferencesandthereforeneedstobeupdatedin memory. In¯gure2.1(c)thereusechainsincludefA[i][j]!A[i¡1][j+1]g,f©D[i+2]! D[i]ªg,f©B[j]g,fC[i+j]ªg, andfE[5]ªg. For reuse chainfA[i][j]!A[i¡1][j+1]g, A[i][j] is both the generator and the ¯nalizer of the chain. For a reference that exclusively exhibits self-reuse the reuse chain consists of only one reference. Inotherwordsthereisasinglereuseinstancewherethesourceandsinkarethe same array reference. This is illustrated for references B[j] and C[i+j] in ¯gure 2.1(c). For a set of references that exclusively exploit group-reuse the reuse chain includes all these references. Here the data °ows from the generator all the way to the last reference of the chain, going through di®erent references. This is illustrated in ¯gure 2.1(c) for the references to arrays A and D. 2.4 Reuse Vectors for SIV and MIV Subscripts Inthissectionwedescribethestructureandpropertiesofthereusevectorsforarrayswith SIV and MIV subscripts. To that extent we de¯ne a free variable for an array reference 24 A as a loop index that does not appear in any of the array's subscripts and therefore corresponds to a zero column in the H matrix of A. To make the cases of MIVs tractable, we consider references with at most two index variables in each dimension. We further impose a separability requirement for the MIV array index functions. A separable index function is one in which a given index variable is used in the subscript of at most one of the dimensions of the array. For example reference A[i][j +k] is separable while A[i][i +j] is not, as index i appears in both dimensions. This requirement greatly simpli¯es the analysis without substantially compromising its applicability. In the domain of kernel codes we have focused on we have not seen a single example where the indices were not separable. 2.4.1 Self Reuse for SIVs For the references that exclusively exhibit self-reuse, the reuse vectors are in the form of elementary vectors e k where k is the loop level of a free variable. An elementary reuse vector e k means that the data accessed by the array reference is invariant with respect to the loop at level k, and therefore some reuse (not necessarily the outermost) is carried at this level. In ¯gure 2.1(b), the array reference B[k] has two free variables of i and j. In this case the elementary vectors (1;0;0) and (0;1;0) correspond to the self-reuse vectors at levels i and j. Lemma: Array references with ZIV and SIV subscripts have exclusively elementary vectors as part of their self-reuse space. Proof: These references have at most one loop index in each dimension. This leads to a matrix H where each row has at most one non-zero variable. As a result, Null(H) 25 requires all the indices that appear in the subscript of the array reference to be zero. The remaining of the indices, namely the free variables, can have any value. As the reuse vectors are the bases of this null space, they are all in the form of elementary vectors. Each vector has only a single non-zero element equal to 1 at level k, where k is the level of a free variable. 2.4.2 Group Reuse for SIVs For a set of references that exclusively exploit group-reuse the reuse vectors are in the form of (0;0;:::;0;c;c=0;:::;c=0). In this vector each c represents a constant value and the reuse occurs in the outermost non-zero level. For example in ¯gure 2.1(a) references to array A have a group-reuse vector of (1;¡1). This indicates that the reuse occurs at the next iteration of the i loop, for a previous iteration of the j loop. 2.4.3 Self and Group Reuse for SIVs For reuse chains that exploit both self and group-reuses, reuse vectors are in the form of identity vectors as well as (0;0;:::;0;c;c=0;:::;c=0). The self-reuse happens at any loop level k for which there exists a reuse vector of the form e k . The group-reuse happens at level k for the vectors of the form ~ R = (0;0;:::;0; c |{z} k ;c=0;:::;c=0). For example in ¯gure 2.1(a), references of array D have a self-reuse vector of (0;1) with reuse at level j and a group-reuse vector of (2;0) with reuse at level i. Lemma: At a speci¯c reuse level k of a loop, a reuse chain either exhibits self-reuse or group-reuse but not both. 26 Proof: To prove this we need to show that a self-reuse at level k prevents any group- reuseatthatlevelandviceversa. Toprovethe¯rstpartwenotethatforareusechainwith uniformly generated references, self-reuse occurs at level k if there exists a reuse vector of the form e k . In other words the index k is not present in any of the subscripts. As the indexisabsent,therecannotexistanyreusevectoroftheform(0;0;:::;0; c |{z} k ;c=0;:::;c=0). Asaresulttherewontbeanygroup-reuseatlevelk. Theargumentforprovingthereverse part is analogous to this one. 2.4.4 Self Reuse for MIVs We consider an MIV array reference with a subscript in the form of [a 1 i k 1 +a 2 i k 2 +c], where i k 1 and i k 2 are index variables at levels k 1 and k 2 and 0· k 1 ;k 2 · n. For such a reference the reuse analysis uncovers a single null space including a reuse vector with two non-zero entries at levels k 1 and k 2 , which has reuse at level k 1 . The vector is of the form ~ R=(0;:::;0 | {z } k 1 ¡1 ;c 1 ;0;:::;0 | {z } k 2 ¡k 1 ¡1 ;c 2 ;0;:::;0 | {z } n¡k 2 ) in which c 1 and c 2 can be calculated as c 1 =a 2 =gcd(a 1 ;a 2 ) and c 2 =¡a 1 =gcd(a 1 ;a 2 ) In addition to this vector the null space includes a set of elementary vectors corre- sponding to the index variables that are absent in the subscript, if any. For each free variable k the corresponding e k represents a self-reuse at level k. Reference A[2j +4k] in ¯gure 2.1(b) has reuse vectors of the form (0;2;¡1) and (1;0;0), corresponding to its MIV subscript and the free variable i. Theinterpretationofaself-reusevectorwithtwonon-zeroelements(exclusiveforMIV cases) is more complex. A reuse vector (0;:::;®;:::;¡¯;:::;0), with ® and ¯ at levels 27 k 1 and k 2 , indicates that the data item accessed at the current iteration is ¯rst reused ® iterations later at the loop at level k 1 and ¯ iterations earlier at the loop at level k 2 . This means that the loop at level k 1 carries the reuse but the data is reused at an o®setted iteration of the loop at level k 2 . Reuse vector (0;2;¡1) for A[2j +4k] in ¯gure 2.1(b) indicates that a reuse occurs in 2 later iterations of j for an earlier iteration of k. Discussion: Theself-reusevectorofanMIVreferencelooksverysimilartothegroup- reuse vector of an SIV reference in that they both have more than one nonzero element. The di®erence lies in the fact that the group-reuse is just a single vector while the self- reuseisaspacespannedbyavector. Sogroup-reuseintheSIVcaseisaspecialcaseofthe self-reuse in the MIV case. For example the references to array A in ¯gure 2.1(a) exhibit a group-reuse induced by the reuse vector (1;¡1) whereas the reference to array C has a self-reuse of indicated by the reuse vector (1;¡1). The group-reuse only occurs when ¢i=1 and ¢j =¡1 whereas the self-reuse occurs at every iteration where ¢i=¡¢j. 2.4.5 Group Reuse for MIVs Structure of reuse vectors in case of group-reuse for MIVs is very complicated. This is due to the fact that the reuse points in the iteration space do not necessarily form a clear subspace. in ¯gure 2.1(a) there is a group-reuse between A[2j + 4k] and A[2j + 4j + 2] when ¢(2j + 4k) = 2 . This condition could be satis¯ed through any of the reuse vectors(¡1;1);(1;0);(3;¡1);(0;1=2);:::andthesevectorsdonotcorrespondtoanyclear subspaceoftheiterationspace. Asaresult,inordertoanalyzetheMIVgroup-reuseinthis work, we reduce the group-reuse of MIVs to the group-reuse of SIVs by only considering the reuse along one variable. 28 2.5 Advantages of Reuse Vectors Here we contrast the notion of reuse vectors with another abstraction used for capturing thedatadependencecalleddependencevectors. Adependencevectorinann-levelloopnest isann-dimensionalvector ~ D =(d 1 ;:::;d n ). Eachd k representsthedi®erencebetweenthe iterationcountsofloopkintheiterationvectorsofthetwodependentarrayreferences[47]. The elements of ~ D are either a constant, `+', `-', or `*'. The constant term corresponds to a constant di®erence between the iterations of two dependent references. A `+' sign indicatesapositivebutnon-constantdi®erencebetweentheiterations. A`-'signindicates a negative but non-constant di®erence between the iterations. Finally a `*' represent an unknown di®erence between the iterations of two dependent references. In ¯gure 2.1(a), theinputdependencebetweenB[j]anditselfacrosstheiterationsoftheiloopiscaptured by vector (+,0). In previous work the distance vectors were used to capture the reuse information by only considering the shortest dependence distance [69]. For example in ¯gure 2.1(a), the dependence vector (+;0) for reference B[j] is transformed to the vector (1;0), exhibiting the shortest dependence distance. This new vector indicates that the ¯rst reuse of data in B[j] happens in a subsequent iteration of i and the same iteration of j. Thevectorsrepresentingtheshortest dependence distanceaccuratelyidentifythereuse for SIV subscripts. For MIV subscripts however, the reuse information gets lost in the compact representation of these vectors. Namely, these vectors neither provide su±cient information as to when and where the reuse occurs nor they represent the relationship 29 betweenloopindicesinMIVreferences. In¯gure2.1(b),thereferenceA[2j+4k]hasdepen- dence vectors (+;0;0) and (0;+;¤) with shortest distance vectors (1;0;0) and (0;1;¡1). None of these vectors capture the relation between the iterations of the j and k loops where the reuse occurs. Namely, they do not represent that each reuse occurs in two later iteration of j and one earlier iteration of k. Addressingthisshortcoming,thenotionofreuse vectorsnotonlyindicatestherelation between the variables in SIV and MIV subscripts, but also captures the exact iteration di®erence between the two dependent references. 2.6 Chapter Summary In this chapter we described the concept of data reuse for array references in a loop nest. We presented how to compute and interpret the reuse vectors for di®erent reuse types in case of arrays with SIV and MIV indices. We also explained and contrasted the abstractionsofdependencevectorsandreusevectors, astheyarecommonlyusedtostudy the subject of data reuse. These concepts form the basis of our analysis for identifying and quantifying the reuse, fully discussed in the next chapter. 30 Chapter 3 Scalar Replacement For an array reference with data reuse, the technique of keeping the data associated with the¯rstdataaccessinlocalstorageandreusingitduringthelateraccessesiscalled scalar replacement. An important aspect of scalar replacement is to determine the number of storageelementsrequiredtocapturethereuseofagivenarrayreference. Compilerscanuse this metric along with the number of saved memory accesses to select the most bene¯cial array references to store in local storage. Namely they can adjust the aggressiveness of scalar replacement in the presence of limited storage replacement based on a cost-bene¯t analysis. Forareferencewithdatareuse,ifthereuseiscarriedbylooplevelk,thedataaccessed by the inner loop levels needs to be saved in local storage. As computations have di®erent references and each reference can have reuse at multiple loop levels, exploiting the reuse can require di®erent amounts of storage. In the rest of this chapter we discuss how to compute the required storage for various cases of SIV and MIV in sections 3.1 and 3.2. We give a summary of the chapter in section 3.3. 31 for(i = 0; i < 3; i++)f for(j = 0; j < 4; j++)f for(k = 0; k < 6; k++)f ::: = A[j]; g g g Data access at level i 0 1 2 3 i=0, j=0, k=0,…,5 i=0, j=1, k=0,…,5 i=0, j=2, k=0,…,5 i=0, j=3, k=0,…,5 0 1 2 3 i=1, j=0, k=0,…,5 i=1, j=1, k=0,…,5 i=1, j=2, k=0,…,5 i=1, j=3, k=0,…,5 0 1 2 3 i=2, j=0, k=0,…,5 i=2, j=1, k=0,…,5 i=2, j=2, k=0,…,5 i=2, j=3, k=0,…,5 Data reuse at level i Data access and reuse at level k Access footprint of A[j] Figure 3.1: Example code and reuse pattern for SIV self-reuse. 3.1 Single Induction Variables 3.1.1 Self Reuse For a chain that exclusively exhibit self-reuse, the set of reuse vectors R only consists of elementary vectors (section 2.4.1 ). To exploit the reuse at level k of a vector e k , the followingexpressioncomputestheamountofrequiredstorage. Thisexpressionexclusively re°ectsthereuseatlevelk andisindependentoftherequirementsintheinnerloops. Here Size sr de¯nesthenumberofdistinctarrayelementsaccessedbetweenanydataaccessand its ¯rst reuse. Size sr =¯ k £ n Y l=k ® l where: 32 ® l = 8 > > < > > : 1 if9 e l 2 R I l otherwise ¯ l = 8 > > < > > : 1 if9 e l 2 R 0 otherwise (3.1) Figure 3.1 illustrates an example code with its data access footprint. Reference A[j] has reuse vectors in the form of e i = (1;0;0) and e k = (0;0;1). This implies that A[j] exhibit data reuse at levels i and k of the loop. Using the formula for Size sr , we need I j =4storageelementsinordertoexploitthereuseatleveli,whileonly1storageelement to exploit the reuse at level k. 3.1.2 Group Reuse For a set of references that exclusively exhibit group-reuse, the required storage for ex- ploitingthereusecorrespondstothenumberofdi®erentarrayelementsgeneratedbetween the generator and the last reference of the reuse chain. For a reuse vector ~ R=(r 1 ;:::;r n ) with reuse at level k, the number of required storage for capturing the reuse is calculated by Size gr in the following expression. Size gr = n¡1 X i=k f n Y l=i+1 I l £r i g+r n This captures the number of elements accessed between a reference and its ¯rst reuse. Forexamplethecodein¯gure3.2(a)hasagroup-reusevector(1;2;3). Inordertocapture the reuse at level i we need I j £I k £1+I k £2+3=28 storage elements. It is possible for a reuse chain to have sub-chains that exhibit reuse at di®erent levels. Forexamplethecodein¯gure3.2(b)hasareusechaininformoffA[i¡1][j]!A[i][j¡1]! A[i][j+1]!A[i+1][j]g. This results in the set of reuse vectorsf(1;¡1);(0;2);(1;¡1)g, which indicates the possibility of reuse at both levels of i and j. 33 for(i = 0; i < 2; i++)f for(j = 0; j < 3; j++)f for(k = 0; k < 5; k++)f ::: = A[i][j][k]; ::: = A[i+1][j+2][k+3]; g g g 0,0,0 1,2,3 0,1,0 1,3,3 0,2,0 1,4,3 0,0,1 1,2,4 0,1,1 1,3,4 0,2,1 1,4,4 0,0,2 1,2,5 0,1,2 1,3,5 0,2,2 1,4,5 0,0,3 1,2,6 0,1,3 1,3,6 0,2,3 1,4,6 0,0,4 1,2,7 0,1,4 1,3,7 0,2,4 1,4,7 i=0, j=0, k=0,…,4 i=0, j=1, k=0,…,4 i=0, j=2, k=0,…,4 1,0,0 2,2,3 1,1,0 2,3,3 1,2,0 2,4,3 1,0,1 2,2,4 1,1,1 2,3,4 1,2,1 2,4,4 1,0,2 2,2,5 1,1,2 2,3,5 1,2,2 2,4,5 1,0,3 2,2,6 1,1,3 2,3,6 1,2,3 2,4,6 1,0,4 2,2,7 1,1,4 2,3,7 1,2,4 2,4,7 i=1, j=0, k=0,…,4 i=1, j=1, k=0,…,4 i=1, j=2, k=0,…,4 Reusable data access Un-reusable data access Data reuse Access footprint of A[i][j][k] and A[i+1][j+2][k+3] (a) for(i = 1; i < 3; i++)f for(j = 1; j < 12; j++)f ...= A[i-1][j]+A[i][j-1]; ...= A[i][j+1]+A[i+1][j]; g g o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 j i 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 3 j i o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o Data access Reusable data that is not saved Reusable data that is saved Access footprint of A[i-1][j], A[i][j-1], A[i][j+1], and A[i+1][j] (b) Figure 3.2: Example codes and reuse patterns for SIV group-reuse. 34 For chains with reuse at various levles, for a speci¯c reuse level k we consider all the reuse vectors of the form ~ R = (0;0;:::;0; c |{z} k ;c=0;:::;c=0). In order to compute the requiredstorage, weapplythe Size gr equationtothe ~ R withthelargestvalueof catlevel k. So for ¯gure 3.2(b) we can exploit the reuse for sub-chain offA[i][j¡1]!A[i][j+1]g with reuse vector (0;2) at level j. Alternatively we can consider the reuse at level i for the subchain offA[i¡1][j]!A[i+1][j]g with reuse vector (2;0). Discussion: The values of r i in ~ R can be arbitrarily small or large. As a result the relationbetweenthevaluesofr i andI i becomessigni¯cant. Thisa®ectsthereusebehavior at the boundaries of the iteration space and therefore can have a direct impact on the required number of storage. For example in ¯gure 3.2(a), applying the equation of Size gr results in a very loose upper bound for the number of required storage (shaded boxes in the ¯gure). This is due to the fact that the values of I i and r i are very close and therefore a considerable number of elements in each iteration do not exhibit any reuse. A better alternative would be to use the equation P n¡1 i=k f Q n l=i+1 (I l ¡r l )£r i g+r n . This way the elements with no reuse will be excluded and this results in the exact number of required storage (solid boxes in the ¯gure). On the other hand, in ¯gure 3.2(b) using the equation Size gr results in a lower bound for the number of required storage (solid oval in the ¯gure). This is due to the fact that some reusable elements are ignored at the boundaries of the iteration space. A better alternative would be to use the equation P n¡1 i=k f Q n l=i+1 (I l +r l )£r i g+r n . This equations 35 captures all the reusable elements, which results in the exact number of required storage (solid and shaded ovals in the ¯gure). To address this inaccuracy we consider two points. First, and in most practical appli- cations, the value of I i is considerably larger than r i (I i >>r i ). As a result applying the Size gr equation creates only very small amounts of imprecision at the boundaries of the iteration space. Second, we need the number of required storage as a point of comparison to select between various reuse chains. This comparison can tolerate small amounts of imprecision. As a result we opt for using the Size gr equation for all cases. This gives us a close enough approximation for the actual numbers. 3.1.3 Self and Group Reuse For reuse chains that exploit both self and group-reuses each reuse level exhibits either self-reuse or group-reuse. Calculating the amount of required storage at any reuse level depends on the type of reuse at that level as well as the type of reuse in any level above it. More accurately, if a self-reuse occurs at level i and a group-reuse occurs at level j, the required storage depends on whether i<j or i>j. If i < j, namely the self-reuse occurs at a loop level outside the group-reuse, all the elements accessed in the iterations inside of level i need to be saved no matter how small/large the group distance is. As a result the number of required storage is Size sr (i). In ¯gure 3.3(a) the self-reuse for references of A occurs at level i while the group-reuse occurs at level j. As a result the entire data accessed with references of array A needs to be saved for later reuse. 36 for(i = 0; i < 2; i++)f for(j = 0; j < 2; j++)f for(k = 0; k < 5; k++)f ... = A[j][k]+ A[j+1][k+1]; A[j+2][k+6] = :::; g g g 0,0 1,1 2,6 1,0 2,1 3,6 0,1 1,2 2,7 1,1 2,2 3,7 0,2 1,3 2,8 1,2 2,3 3,8 0,3 1,4 2,9 1,3 2,4 3,9 0,4 1,5 2,10 1,4 2,5 3,10 i=0, j=0, k=0,…,4 i=0, j=1, k=0,…,4 0,0 1,1 2,6 1,0 2,1 3,6 0,1 1,2 2,7 1,1 2,2 3,7 0,2 1,3 2,8 1,2 2,3 3,8 0,3 1,4 2,9 1,3 2,4 3,9 0,4 1,5 2,10 1,4 2,5 3,10 i=1, j=0, k=0,…,4 i=1, j=1, k=0,…,4 Data access in group reuse Data reuse in group reuse Data access in self reuse Data reuse in self reuse Access footprint for A[j][k], A[j+1][k+1], and A[j+2][k+6] (a) Self-reuse level outside of group-reuse level. for(i = 0; i < 3; i++)f for(j = 0; j < 5; j++)f for(k = 0; k < 5; k++)f ::: = A[i][k]+ A[i+2][k+1]; g g g 0,0 2,1 1,0 3,1 2,0 4,1 2,0 4,1 0,1 2,2 1,1 3,2 2,1 4,2 2,1 4,2 0,2 2,3 1,2 3,3 2,2 4,3 2,2 4,3 0,3 2,4 1,3 3,4 2,3 4,4 2,3 4,4 0,4 2,5 1,4 3,5 2,4 4,5 2,4 4,5 i=0, i=1, i=2, i=2, j=0,…,4, j=0,…,4, j=0, j=1,…,4 k=0,…,4 k=0,…,4 k=0,…,4 k=0,…,4 Data access in group reuse Data reuse in group reuse Data access in self reuse Data reuse in self reuse Access footprint for A[i][k] and A[i+2][k+1] (b) Group-reuse level outside of self-reuse level. Figure 3.3: Example codes and reuse patterns for SIV self + group reuse. 37 If on the other hand i > j, the group-reuse occurs at a level outside of the self-reuse. In order to exploit the self-reuse we need Size sr (i) storage elements. In addition, we need atmostSize gr (j)storageelementstoexploitthegroup-reuseup tolevelj. Itisimportant to notice that we do not consider the reusable data in level j itself, since this data will be considered as part of the self-reuse. As a result the total number of required storage is captured by adding these two values. In ¯gure 3.3(b) the group-reuse occurs at the outermost level i, while the self-reuse occurs at level j. As a result, at each iteration of j some data (Size sr (j) = I k elements) needs to be saved for later self-reuse. In addition, the data corresponding to group-reuse should be saved for the iteration that reuses the data (in this case i=2). Here we combine the two cases above where the self-reuse occurs at level i and group- reuse occurs at level j. Assuming that8i:I i >>r i . Size sr+gr =f® ij £( n Y l=j+1 I l £r j )g+Size sr (i) where: ®ij = 8 > > < > > : 0 if i<j 1 if i>j (3.2) Note that the reason behind removing the sigma term in Size gr is that we do not consider the reusable data in level j itself, since this data is already considered as part of the self-reuse in the inner loop i. 38 3.2 Multiple Induction Variables Here we discuss the references that have subscript with multiple induction variables. We make use of the auxiliary function ° l for a given array reference and loop level l de¯ned as following. Here I l denotes the number of iterations executed by the loop at index l. ° l = 8 > > < > > : I l if l2 Subscript(ref) 1 otherwise (3.3) In this section we analyze the reuse and required storage for the MIV cases with two index variables in single and multiple dimensions. 3.2.1 Self Reuse 3.2.1.1 Subscripts With Two Variables in a Single Dimension Recallfromsection2.4.4thatthereusevectorforanMIVarrayreferencewithonedimen- sion in the form of [a 1 i k 1 +a 2 i k 2 +c] is of the form ~ R=(0;:::;0 | {z } k 1 ¡1 ;c 1 ;0;:::;0 | {z } k 2 ¡k 1 ¡1 ;c 2 ;0;:::;0 | {z } n¡k 2 ). For a reuse vector ~ R with non-zero values of c 1 = a 2 =gcd(a 1 ;a 2 ) and c 2 = ¡a 1 =gcd(a 1 ;a 2 ) at levels k 1 and k 2 , the number of required storage elements to exploit the reuse at levelk 1 isthe number of distinct solutions to the reuse equations as described in section 2.3.1. While in general this problem can be solved using general frameworks such as Presburger formulas [64], for the special cases of loops with symbolically con- stant bounds the number of such solutions is given by equation 3.4. In this equation we distinguish between various levels of reuse relative to k 1 , which is the reuse level of ~ R. 39 for(l = 0; l < 3; l++) for(i = 0; i < 4; i++) for(j = 0; j < 3; j++) for(k = 0; k < 5; k++)f ::: = A[i+2k] + D[3i+k]; ::: = B[i+2][k-1] + B[i][k]; g 0+0 1+0 2+0 3+0 0+2 1+2 2+2 3+2 0+4 1+4 2+4 3+4 0+6 1+6 2+6 3+6 0+8 1+8 2+8 3+8 Data that needs to be saved for reuse at level l i 0 1 2 3 k 0 1 2 3 4 (a) Example code. (b) Access footprint for A at level l. 0+0 1+0 2+0 3+0 0+2 1+2 2+2 3+2 0+4 1+4 2+4 3+4 0+6 1+6 2+6 3+6 0+8 1+8 2+8 3+8 Data that needs to be saved for reuse at level i i 0 1 2 3 k 0 1 2 3 4 Data that needs to be saved for reuse at level j i 0 1 2 3 k 0 1 2 3 4 0+0 1+0 2+0 3+0 0+2 1+2 2+2 3+2 0+4 1+4 2+4 3+4 0+6 1+6 2+6 3+6 0+8 1+8 2+8 3+8 (c) Access footprint for A at level i. (d) Access footprint for A at level j. 0+0 3+0 6+0 9+0 0+1 3+1 6+1 9+1 0+2 3+2 6+2 9+2 0+3 3+3 6+3 9+3 0+4 3+4 6+4 9+4 Data that needs to be saved for reuse at level i i 0 1 2 3 k 0 1 2 3 4 0+0 3+0 6+0 9+0 0+1 3+1 6+1 9+1 0+2 3+2 6+2 9+2 0+3 3+3 6+3 9+3 0+4 3+4 6+4 9+4 Data that needs to be saved for reuse at level j i 0 1 2 3 k 0 1 2 3 4 (e) Access footprint for D at level i. (f) Access footprint for D at level j. Figure 3.4: Example code and reuse patterns for MIV self-reuse with two variables in one dimension. 40 Size sr = 8 > > > > > > > > > > < > > > > > > > > > > : n Y l=k+1 ° l if k >k 1 c 1 2 4 k 2 ¡1 Y l=k 1 +1 ° l £(I k 2 ¡jc 2 j)£ n Y l=k 2 +1 ° l 3 5 if k =k 1 n Y l=k+1;l6=k 1 ;k 2 ° l £[c 1 £I k 2 +(I k 1 ¡c 1 )jc 2 j] if k <k 1 (3.4) The ¯rst case occurs when the reuse is being exploited at a loop level inside the reuse level carried by ~ R. This means that the reuse at level k 1 is not exploited and thereforethereuseislimitedtotheself-reuseatlevelk representedbyreusevector ~ R=e k . This is essentially the case for the SIV self-reuse. As such, the index variable at loop level k 1 remains constant and the total number of memory locations accessed is simply dictated by the index variables present in the subscript functions for the array reference. In ¯gure 3.4(d), to exploit the reuse at level j for reference A[i + 2k] we need I k = 5 storage elements. In the second case, the reuse level k 1 dictates that we need a number that is given by the bound of the loop at level k 2 , with the exception of the values accessed by the ¯rst c 2 iterations. For the indices at the k 1 level the number of distinct values are repeated after c 1 iterations of the loop at level k 1 , but need to be o®setted by c 2 for all loop levels nested belowlevelk 2 . In¯gure3.4(c),forarrayAandreuseleveliwewouldneed2£(I k ¡1)=8 storage elements. For a level lower than k 1 , i.e. for a loop outside the loop at level k 1 , one needs to accumulate all the values accessed by the combination of the indices at levels k 1 and k 2 . The number of such distinct combinations is [c 1 I k 2 +(I k 1 ¡c 1 )jc 2 j] as all but c 2 values can be reused after the ¯rst c 1 iterations of the k 1 loop. The other factor in the 41 equation captures the presence of a given loop index variable in the subscript function. In ¯gure 3.4(b), for reference A at level l we need 2I k +(I i ¡2)=12 storage elements. Discussion: ² Unlike the case of SIVs, here the value of Size sr for the reuse at level k does not re°ect any possible reuse at an inner level l > k. Therefore the actual number of required storage elements for a speci¯c level, taking into account all reuse vectors, is calculated as the maximum value across all vectors. So in order to exploit data reuse along all reuse vectors with level( ~ R) ¸ k we compute the value of Size sr (k) as Size sr (k)=Max(Size sr (l)) where k·l·n In¯gure3.4(a),forarrayDwithreusevector(0;1;0;¡3)thereuseatlevelirequires only (I k ¡3) = 2 of the the I k distinct values (¯gure 3.4(e)). However, for reuse at level j all I k =5 values need to be saved (¯gure 3.4(f)). ² WenotethatforSIVsubscriptstheanalysispresentedhereisgreatlysimpli¯ed. For SIV references the reuse analysis uncovers one elementary reuse vector e l for each loop nest level l that is not included in any of the array reference subscripts. This elementaryreusevectorisaparticularcaseoftheMIVreusevectorwithk 1 =k 2 =k andc 1 =1=gcd(0;1)=1andc 2 =0=gcd(0;1)=0leadingtoSize sr (k)= Q n l=k+1 ° l . This equation gives the same results as the one developed in 3.1. 42 for(m = 0; m < 5; m++) for(l = 0; l < 3; l++) for(i = 0; i < 4; i++) for(j = 0; j < 3; j++) for(k = 0; k < 4; k++)f ... = A[l+j][i+k]; g Data that needs to be saved for reuse at level m i+k 0 1 2 3 4 5 6 l+j 0 1 2 3 4 (a) Example code. (b) Access footprint for array A at level m. Data that needs to be saved for reuse at level l i+k 0 1 2 3 4 5 6 l+j 0 1 2 3 4 Access footprint at level l i+k 0 1 2 3 4 5 6 l+j 0 1 2 3 4 Data that needs to be saved for reuse at level i Access footprint at level i (c) Access footprint for array A at level l. (d) Access footprint for array A at level i. Figure 3.5: Example code and reuse patterns for MIV self-reuse with two variables in multiple dimensions. 3.2.1.2 Subscripts With Two Variables in Multiple Dimensions We consider an array reference with multiple MIV dimensions each in form of [a p i kp + a q i kq +c pq ], where i kp and i kq are index variables at levels k p and k q , 0·k p ;k q ·n, and eachk p andk q arepresentonlyinonedimension. Foreachdimensionthereisavectorwith two non-zero entries at levels k p and k q in form of ~ R=(0;:::;0 | {z } kp¡1 ;c 1 ;0;:::;0 | {z } kq¡kp¡1 ;c 2 ;0;:::;0 | {z } n¡kq ). In ¯gure 3.5(a) reference A[l+j][i+k] has three reuse vectors. Vector (0;1;0;¡1;0) correspondingtodimension[l+j]withreuseatlevell,vector(0;0;1;0;¡1)corresponding to dimension [i+k] with reuse at level i, and ¯nally vector (1;0;0;0;0) corresponding to the self-reuse at level m. 43 As each MIV vector corresponds to the reuse in a di®erent dimension, considering all thesevectorssimultaneouslywouldextremelycomplicatetheanalysisofthereusepatterns. Asaresultweonlyexploitthereusealongasingleselecteddimensiontheselectionofwhich depends on the available storage elements. Combining this technique with equation 3.4, however, does not guarantee an exact number in terms of required storage. The reason is that as a result of selecting a single reuse vector we ignore any reuse in other dimensions. As equation 3.4 considers the reuse vectors independently, it misses on any reuse in the inner loop levels and therefore only captures an upper bound on the number of required storage elements. Figures 3.5(b), (c), and (d) depict the data that needs to be saved for reuse across the iterations of m, l, and i. Based on equation 3.4 exploiting the reuse at levels m, l and i respectively requires I l £I i £I j £I k = 90, I i £(I j ¡1)£I k = 48, and I j £(I k ¡1) = 9 elements. In reality however, and as shown by the ¯gure, we need to save only 35, 14, and 9 elements. The large numbers in case of reuse at levels m and l are due to the fact that we ignore the reuse in all the inner loops. 3.2.2 Group Reuse As mentioned in section 2.4.5, the case of group-reuse for MIVs is extremely complex as the reuse does not occur in any clear sub-space. As a result, we reduce the group-reuse of an MIV to an SIV group-reuse by only considering the reuse along one index variable. As an example consider the group-reuse between references A[i+j] and A[i+j+1] in an i;j loop. We can keep the value of j constant and exploit the reuse along the vector (1;0) 44 for when ¢i=1. Alternatively we can keep the value of i constant and exploit the reuse along the vector (0;1) for when ¢j =1. 3.2.3 Limitations of the Analysis In this section we summarize the limitations of our analysis for MIV cases that have been outlined throughout the chapter. ² The analytical formulas for Size sr generate tight bounds only for the cases of array references with MIV in a single dimension. For a reference with MIV in multiple dimensionsSize sr derivesthenumberofrequiredstorageelementsforaspeci¯cloop level taking into account the reuse along at most one dimension. As a result this equation will lead to an upper bound on the number of required storage as it will essentially ignore the opportunities for reuse along other vectors/dimensions. ² Consideringthereuseinonlyonedimensionandignoringtherestofthereusevectors causesthecompilertomisssomeopportunitiesforoptimization. Incaseofself-reuse, for either input or output references, this simply means a sub-optimal exploitation of reuse. However, for group-reuse where one of the references is a read operation and the other is a write operation, ignoring opportunities of reuse means that the scalar replaced implementation could use the wrong data value. ² As a result of the separability requirement, we cannot handle more intricate refer- ences with non-separable subscripts such as A[i+2j][j +k] in a (i;j;k) loop nest. In these case the kernel depends on how these subscripts interact with each other 45 and with loop indices. As a result using our analytical expression and considering one of the dimensions independent of the others leads to incorrect results. As we can see the main causes of these limitations are that 1) we consider the reuse corresponding to multiple reuse vectors to be orthogonal and 2) we assume the indices of the subscripts to be separable. These assumptions however are necessary to make the analysisofMIVcasestractable. Fortunately,theresultingshortcomingsdonotseemtobe limiting in practice, as the above instances do not occur often, if hardly at all, in practical applications. 3.3 Chapter Summary In this chapter we explained how to utilize the reuse vectors in order to calculate the amount of required storage for exploiting the reuse for di®erent array references. We derivedanalyticalformulastoquantifythereuseforpossiblereusetypesinSIVsubscripts. We also developed analytical formulas for the case of self-reuse in MIV references with two induction variables. Finally we expressed the limitations of our analysis for the MIV cases. 46 Chapter 4 Analyses of the Critical Path and Scheduling The objective of our memory allocation algorithm is to reduce the execution time of the underlying computation by shortening its critical paths. Considering su±cient computa- tional resources the critical path is solely determined by the dependencies and latencies of the individual operations as well as the availability of the data. This data availability is further de¯ned by the storage mapping choices for each data item in the computation. In this chapter we ¯rst explain our analysis of the critical path in section 4.1. We then describe in 4.2 how we use the operational and data dependency information em- bedded in the critical paths of the computation in order to identify the most bene¯cial arrayreferences. Wecontinuebyexplainingtheimplicationsofschedulingandbandwidth availability for di®erent references in sections 4.3 and 4.4. We introduce our metric for identifying the critical paths in 4.5 and describe our execution model in 4.6. We ¯nally summarize the chapter in 4.8. 47 for (i = 0; i < 5; i++) for (j = 0; j < 10; j++) for (k = 0; j < 20; k++)f C[i][j] += (A[i][k] * B[k][j])/4 + B[k][j+2]; g (a) Original code. for (i = 0; i < 5; i++) for (j = 0; j < 10; j+=2) for (k = 0; j < 20; k++)f C[i][j] += (A[i][k] * B[k][j])/4 + B[k][j+2]; C[i][j+1] += (A[i][k] * B[k][j+1])/4 + B[k][j+3]; g (b) After unrolling the j loop by a factor of 2. Figure 4.1: Example code for critical path analysis. 4.1 Data-Flow-Graph and Critical Path(s) To capture the notions of data °ow and data dependence we abstract the computation of the body of a loop nest as a Data-Flow-Graph (DFG) derived from a Static-Single- Assignment (SSA) intermediate form [24]. Edges of the DFG represent data dependences whereas the nodes represent data accesses or arithmetic/logic operations. In order to investigate the execution time of the computation, we augment each node of the DFG with the latency of the corresponding operation. In doing so we assume the latencies of the numeric operations to be known and the latency of a memory access to be dependent on the corresponding memory level (i.e., internal vs. external storage). Figure 4.2(a) depicts the DFG for the computation in the example code in ¯gure 4.1. A source and sink nodes are arti¯cially added representing the beginning and the end of the computation. 48 In this representation, multiple updates to the same data item, say a scalar, are rep- resented by distinct symbolic values re°ecting the sequence of values the data item may assume. Updatestoanarrayitem,unlessdisambiguatedviatheanalysisoftheindices,are conservatively assumed to operate on the same item. In this representation, loop-carried dependences are represented in an acyclic fashion by explicitly using the last updated value as the initial value for subsequent iterations. Cyclical dependences are handled by numbering the various values symbolically and using only the last value assigned to each data element as the new value. Conditional statements in the loop are converted to data-dependences where both branchesoftheconditionalareevaluatedconcurrently. Weimposeaspeculativeexecution of all the memory operations regardless of the outcome of the control °ow predicates and thereforeignoretheextremecasesofmemoryexceptionhandlingbyassumingtheydonot occur. Upon determination of the value of a conditional predicate, the values computed on either branch are assigned to the individual variables. Results for variables written in both branches are merged via multiplexor nodes. Whenever possible, memory nodes are hoisted and therefore executed speculatively outside each branch of the conditional statement. Wede¯netheCriticalPath(s)(CP)ofaDFGasthepath(s)withthelongestexecution time, given the delays of data references and operations with no computational resource contention. A Critical Graph (CG) is a subgraph of DFG only including its critical paths. Figure 4.2(b) depicts the critical graph for the DFG represented in ¯gure 4.2(a). In the worst case the number of execution paths in a computation can be exponential on the number of statements in the source code. In order to identify the critical paths we 49 C[i][j+1] C[i][j] Sink Source B[k][j] A[i][k] B[k][j+1] * * /4 /4 temp1 temp2 + + + + B[k][j] A[i][k] B[k][j+1] temp1 /4 temp2 + C[i][j] C[i][j+1] * * /4 Source (1) (2) (3) (4) (a) Data Flow Graph (DFG) (c) Cuts for the initial CG (b) Initial Critical Graph (CG) (b) Initial Critical Graph (CG) Sink + + + B[k][j] A[i][k] Source B[k][j+1] * * /4 /4 temp1 B[k][j+2] temp2 B[k][j+3] C[i][j+1] + C[i][j] + + + Sink Figure 4.2: Data-Flow Graph, Critical Graph, and Cuts for the example code in 4.1. do not apply any aggregation and simply enumerate all the paths to ¯nd the longest ones. In practice, as is the case with our experiments in chapter 6, the codes that exhibit such worst case behavior are very rare. 50 4.2 Cuts of the Critical Graph For each critical graph we de¯ne a cut to be a minimal subset of nodes corresponding to data references such that their removal would bisect all the paths in the CG. Eliminating the access latencies associated with the references of a cut reduces the CPs of the DFG by the duration of one memory latency time [8]. Figure 4.2(c) depicts the set of cuts for the CG represented in ¯gure 4.2(b). The four sets of cuts for this example arefB[k][j];A[i][k];B[k][j+1]g,fB[k][j];A[i][k];C[i][j+1]g, fB[k][j +1];A[i][k];C[i][j]g, and fC[i][j];C[i][j +1]g. Each one of these cuts represents a minimal set of memory references such that the elimination of their latencies would shorten the critical paths of the computation by one memory latency time. 4.2.1 Identifying the Cuts A simple algorithm to ¯nd a single cut of a critical graph consists of iteratively selecting a node of the graph and eliminating all its ancestors and descendants. The algorithm continues until no more nodes are left in the graph. In the worst-case, ¯nding all the cuts is exponential with respect to the number of nodes of the graph. Figure 4.3 depicts the two extreme hypothetical cases for the critical graph(s) of a DFG. Here each node represents a data reference. In part (a) all the n data references are accessed independently and as a result each one creates a cut for the graph. Finding the set of cuts takes O(n), where n is the number of nodes corresponding to data elements. In part (b), however, each cut requires one data element from each path. This leads to an O(n p ) combinations, where p is the number of critical paths. 51 Source Sink 1 P 1 2 n-1 n 3 Source Sink 1 P 1 2 n-1 n 3 1 C 2 C 3 C 1 - n C n C Source Sink 1 2 n-1 n 3 1 P 2 P 3 P p P Source Sink 1 P 2 P 3 P p P 1 2 n-1 n 3 1 2 n-1 n 3 1 2 n-1 n 3 1 2 n-1 n 3 1 2 n-1 n 3 1 2 n-1 n 3 1 2 n-1 n 3 1 C 2 C 3 C 4 C (a) (b) Figure4.3: Examplesofextremecasesofcriticalpaths. Part(a)hasonlyonecriticalpath and therefore n cuts C 1 ;:::;C n . Part (b) has p critical paths and n p cuts, from which C 1 ;:::;C 4 are shown. In practice, and in the context of loop-unrolling, the execution paths are very similar and parallel. As a result, in these cases the structure of the critical graph and its cuts mostly resembles the ¯gure in part (b). 4.3 E®ects of Bandwidth and Scheduling Theabstractionofcutsidenti¯esthesetofreferencesthatweneedtomaptofasterstorage in order to improve the access delays. As such, examining the storage requirements for references of a given cut is a good indicator for the amount of required storage. This notion of cuts, however, does not provide any information about the concurrency or the dependency of the data accesses. As a result, we need to consider the issues of bandwidth and scheduling as complimentary factors to DFGs and cuts. 52 We de¯ne the term bandwidth (BW) as the number of concurrent data accesses. In particular the memory bandwidth refers to the number of possible concurrent accesses to an internal/external memory. In addition the required bandwidth represents the maximum number of data accesses that occur simultaneously. For example a RAM block with two 32-bit ports and storing 16-bit data has a memory bandwidth of 4 (64 bits), while a computation with at most six concurrent 16-bit data accesses has a required bandwidth of 6 (96 bits). Here we discuss the implications of bandwidth in assigning storage to the elements of a cut in order to reduce the execution time. 4.3.1 Unlimited Memory Bandwidth Inthiscaseallthedataelementsofacutcanbeaccessedinparallelandthereisnoconcern about the amount of bandwidth. For the purpose of improving the execution time of a computation, the references of a cut need to be considered inclusively. Improving the access time corresponding to only a subset of references will not reduce the execution time of the critical graph and therefore of the computation. For example in ¯gure 4.2(a), the data associated with references B[k][j], A[i][k], and B[k][j +1] should all have the same latency. Improving the access time for only two of these references will not result in any overall improvement as one multiplication operation would stall for the third reference's data. 4.3.2 Limited Memory Bandwidth Even though it is desirable to map all the elements of a cut to faster storage and access them concurrently, sometimes this is not possible due to a limited available bandwidth. 53 This is the more realistic case in which the concurrency and dependency of the operations matter. As various cuts have di®erent bandwidth requirements and scheduling character- istics, their e®ects on the execution time might vary substantially. Here we describe some observations that help us identify the better cuts for storage allocation. Observation1: Thereferencesinacutmayormaynotbeaccessedatthesametime depending on the details of scheduling. As a result these references may or may not be able to share the bandwidth of their corresponding storage locations. This issue directly in°uences the overall execution time of the computation. As an example if we select cut number 2 in ¯gure 4.2(c), references B[k][j], A[i][k], andC[i][j+1]needtobemappedtothesametypeofstorage. Butweonlyneedtoaccess at most two of these elements simultaneously. Reference C[i][j] can share the bandwidth with B[k][j] and A[i][k] as it is only accessed at a later time due to the computation dependencies. Ultimately all the paths of the CG get shorter, while using at most two parallel accesses. Observation 2: In case of limited bandwidth, assigning only a subset of references to faster storage might still result in an overall gain in data access time. This causes an exception to the policy described in 4.3.1 as the references have to share the same bandwidth. Asanexampleifweselectcutnumber1in¯gure4.2(c),referencesB[k][j],A[i][k],and B[k][j+1] need to be accessed concurrently. In case of insu±cient bandwidth of external memory, it still helps to bring the data corresponding to a subset of these references to local storage. In other words accessing a smaller number of data references from the external memory would reduce the maximum access time for this cut. 54 4.4 Desired Latency and Delivery Time So far we have talked about reducing the critical path(s) of a computation in order to improve the overall execution time. Here we explain that the latency of the critical path does not necessarily need to be minimized. Observation3: Theamountofdesiredreductionforacriticalpathshouldbelimited to the di®erence between the critical path(s) and the second longest path(s) in the DFG. Further reducing the critical paths' latency beyond the second longest path would not result in any reduction of the overall execution time of the computation. In other words, if P 1 and P 2 are the ¯rst and second longest paths of the computation and LP 1 and LP 2 represent their latencies, the critical path should be shortened by at most ± = (LP 1 ¡LP 2 ). After this point P 2 becomes the new critical path and further reducing the P 1 is ine®ective. This fact is illustrated in ¯gure 4.4(a). AswewishtoreducetheCGby±,weneedtoreducethelatenciesofthedatareferences of the critical path(s). We call this new reduced latency for each data reference r its Desired Latency (DL r ). We further use this value to guide the mapping decision for that reference. In particular each reference r should be mapped to a storage structure M that can accommodateDL r . This itself depends on the data that is already mapped to M, the schedule which dictates how many references are simultaneously accessed from M, and the amount of memory bandwidth that M can provide. ToinvestigateifastoragemodulecanaccommodatetheDLofareference,weintroduce the notion of Delivery Time (DT) used in our allocation algorithm. The delivery time of 55 a given memory module refers to the time that it takes to acquire some data from that memory. In presence of multiple read requests to memory M, the delivery time DT M (k) of the k th concurrent data item is determined by the two analytical expressions below for the cases of non-pipelined and pipelined access modes. In these expressions Lat M denotes the latency of accessing the ¯rst data item in M while II M is the pipelining initiation interval for subsequent accesses. In this formulation the value of k is determined by how the computation is scheduled. DT nopipe M (k)=d k BW M e£Lat M (4.1) DT pipe M (k)=Lat M +d k¡1 BW M e£II M (4.2) We take note that for two memories M 1 and M 2 , BW M 1 = BW M 2 does not necessar- ily result in DT M 1 = DT M 2 as the delivery time depends on di®erent elements mapped to/accessed from a memory. 4.5 Threshold Metric In section 4.1 we de¯ned the critical path as the longest path in the execution. However, each computation might include multiple execution paths that are very close in length. Observation 4: Picking only the longest path as the critical path could make the value of (LP 1 ¡LP 2 ) very small. As a result, investing the storage resources at each stage 56 1 P 2 P ) LP (LP 2 - 1 G time (a) ` 1 GLP ) LP (LP 2 - 1 G 1 P 2 P 3 P 4 P 5 P 6 P time ` 2 GLP ) LP (LP 2 - 1 G 1 T 1 P 2 P 3 P time (b) (c) ) LP (LP 2 - 1 G 2 T 1 P 2 P 3 P ` 1 GLP time ) LP (LP 2 - 1 G 2 T 1 P 2 P 3 P ` 1 GLP time (d) (e) Figure 4.4: E®ects of the Threshold selection on the amount of improvement. leads to only very small amounts of improvement in the critical graph, an approach that does not e®ectively utilize the resources. To address this issue we introduce the metric of Threshold. We use this metric to bundle together all the paths that are very close in their execution length (¯gure 4.4(b)). This prevents the waste of time and resources for achieving only very small improvements in the execution time. So instead of having only a single longest path as P 1 , there will be a group of longest paths called GLP 1 such that: 57 8path p:(length(P 1 )¡length(p)·Threshold)=>p2GLP 1 (4.3) AsthevalueofThresholddeterminesallthepathsthatbelongtoGLP 1 ,itconsequently identi¯esthenewsecondlongestpathP 2 andthereforethevalueofLP 1 ¡LP 2 astheideal amount of improvement. In¯gure4.4(b),P 4 becomesthenewsecondlongestpathaftergroupingthepathsbased on their closeness. As a result the amount of improvement ± = length(P 1 )¡length(P 4 ) should be applied to all the paths in GLP 1 , namely P 1 , P 2 , and P 3 . Since the amount of improvement plays an important role in our allocation strategy, the value of Threshold must be selected with great care. For example in ¯gures 4.4(c) and (e), even though the values for thresholds T 1 and T 2 are very close, they do result in very di®erent amounts of improvement for the same set of execution paths. Namely, T 2 requires a much more aggressive reduction of the paths in GLP 1 . One should also notice that the value of Threshold can vary in di®erent cases as it directly depends on the characteristics of the computation. Comparing the cases in 4.4(d) and (e), we observe that with identical sets for GLP 1 and identical values for T 2 the amounts of ideal improvement are very di®erent. This is clearly due to the properties of the rest of the paths in the computation, in this case P 3 . To make the selection process of the value of Threshold as independent as possible, we consider our ultimate objective of reducing the execution time while consuming the minimal space. The best execution time is resulted from accessing all the elements of the cut in parallel from local storage (block RAMs). This results in a very low latency without using too much design space. Considering this as the desired latency along with 58 thede¯nitionofThresholdwereachatthefollowingexpressionsthatassistusincomputing an e®ective value for Threshold: 8 > > > > > > > > > > < > > > > > > > > > > : DL·(Lat RAM +II RAM ) (1) Threshold·(LP 1 ¡LP 2 ) (2) improvement=± =(LP 1 ¡LP 2 ) (3) improvement=OriginalLatency¡DesiredLatency (4) Here II RAM is the initiation interval of the block RAM, where II RAM = 0 for non- pipelined accesses. Also the value of OriginalDelay is the access time to the external memory considering that everything is originally mapped to external memory. ² Equation (1) enforces parallel accesses to the RAMs. This is the best case where the desired latency can only be accommodated by a direct access to the RAM. ² Equation (2) is a direct result of equation 4.3. In other words if Threshold>(LP 1 ¡LP 2 ) the second longest path (P 2 ) has to be part of the GLP 1 . This contradicts the de¯nition of the second longest path. ² Equation (3) represents the amount of improvement, desired for the critical path(s), as the value of ±. ² Equation (4) represents the amount of improvement, desired for the critical path(s), asthe di®erencebetweentheoriginal latency(when thedatais in external memory) and the desired latency that we would try to achieve. 59 Combining these expressions results in: Threshold·OriginalDelay¡(Lat RAM +II RAM ) (4.4) This value for Threshold results in the best possible case as it encompasses parallel accesses to the RAMs. If using this value of Threshold results in a feasible allocation, the desired latency and therefore the execution time have the best values. If however thereisnofeasibleallocation, werelaxtheThresholdbyone II RAM ateachiteration(one Lat RAM for non-pipelined cases). This allows the references to have desired latencies that canbesatis¯edbysharingsomebandwidthandthereforesequentialaccesses. Thisresults in longer acceptable latencies and therefore increases the possibility for ¯nding a feasible allocation. 4.6 Execution Model Todetermineifthedeliverytimeofastorage M satis¯esthedesiredlatencyofareference we need to identify the simultaneous accesses to M (value of k in equations 4.1 and 4.2). This requires a full knowledge of scheduling which is not available at the high level of our analyses. As such, we consider an execution model in which all independent memory read operations are scheduled concurrently at the beginning of the computation's execution and all the independent memory write operations are scheduled as-soon-as-possible at a later time. We use the simple as-soon-as-possible memory access scheduling strategy, as supported by current synthesis tools such as Monet TM [55], to allow the algorithm to determinethenumberofconcurrentmemoryaccesses. Ourownexperiencewithbehavioral synthesis tools validates this approach. 60 Threshold Improvement Current Mapping DFG CMA Algorithm 2 1 GLP and GLP (r : reference in a cut) CG Cuts DLr M DT ) ( G System Characteristics (Memory delays, Pipelining info, etc) Original Application Code Execution Model Input Functions Figure 4.5: Preliminary °owchart of the allocation algorithm. 4.7 Putting it all together Hereweexplainhowtheconceptsdiscussedinthischapterworktogethertoformthebasis of our mapping algorithm. Figure 4.5 illustrates the relationship between these various concepts. We will explain the mapping algorithm in full detail in the next chapter and complete this °owchart. In this ¯gure: 1. The study of the application code leads to the extraction of its DFG, CG, and the set of cuts. Here we identify what should be improved. 2. Basedonthecharacteristicsofthetargetsystem(memorydelay,memorybandwidth, pipelining information, etc), we compute the value of Threshold. This value helps identifying the sets of ¯rst and second longest paths of the DFG and therefore 61 the amount of improvement necessary for the CG. Here we specify how much the accesslatenciesshouldbeimproved. Thisvaluesubsequentlydeterminesthedesired latencies for di®erent data accesses of the selected cut. 3. Basedonthecharacteristicsofthetargetsystem,theexistingdatamappedtovarious memories, and the underlying execution model, we identify the best delivery times of di®erent memories. This step shows how fast each memory can provide the data if the data is mapped to it. The mapping algorithm tries to ¯nd a mapping for the data corresponding to the references identi¯ed in step (1). It does so by considering the desired latencies computed in step (2), the availability of di®erent memories determined in step (3), as well as the informationprovidedbytheexecutionmodel. Weexplainthisprocessinthenextchapter. 4.8 Chapter Summary In this chapter we analyzed the role of scheduling and critical paths in an e®ective storage allocation. We argued that in order to reduce the overall execution time all the critical paths in a DFG should be reduced. To this e®ect we introduced the notions of a critical graph and its cuts. We de¯ned the concepts of Desired Latency (DL) for a reference, Delivery Time (DL) for a storage, and the inter-relation between the two concepts. At last we de¯ned the metric of Threshold which controls the selection of critical paths and amountofimprovementintheoverallstorageallocation. Inthenextchapterweshowhow the allocation algorithm utilizes these concepts to determine an e®ective data mapping strategy. 62 Chapter 5 Custom Memory Allocation Algorithm In this chapter we describe our custom memory allocation algorithm. This algorithm is based on the combined knowledge of reuse analysis, critical path/scheduling, resource constraints, and capabilities of various allocation strategies. Multiple degrees of freedom including di®erent array references, di®erent compiler transformations, various storage structures, and the execution point at which the data needs to be stored make this algorithm nontrivial. As this problem is in general NP- Complete [31], we focus on a greedy algorithm to ¯nd a feasible data mapping. We ¯rst de¯ne the problem in 5.1 and formulate it in 5.2. In section 5.3 we describe our mapping approach including the data transformations and metrics that it uses. We thenexplain ouralgorithm in5.4, presentanexample illustratingitsoperationin 5.5, and express its limitations in 5.6. Finally we summarize the chapter in 5.7. 5.1 Problem De¯nition The objective of our algorithm is to ¯nd a feasible storage allocation for the computation athandthatminimizesexecutiontimesubjecttotheavailablestorageandbandwidth. In 63 case of multiple identical designs in terms of execution time, we seek a hardware solution thatyieldsthesmallestimplementationareaastypicallysmalldesignsexhibitbetterclock rates and power properties. In approaching and addressing this problem, and considering the discussions in chap- ter 4, we highlight the following key factors: 1. Execution Time: Minimizing this value is the main objective of our algorithm. Ourapproachleveragestheconceptsofschedulingandcriticalpath(s)inidentifying the access latencies that should be improved. Here we would take advantage of the notions of cuts, threshold, and desired latency, developed in chapter 4, to identify which references should be improved and by how much. 2. Available Storage: This is one of the constraints of our allocation problem. We call this a hard constraint as satisfying it is required for a feasible design. In other words if the storage requirements of our application surpasses the available storage there wont be any solutions. We rely on the concepts of reuse analysis and reuse chains explained in chapter 2. Analyzing the data reuse and allocating storage exclusively to the reuse chains of the computation would substantially reduce its storage requirements. 3. Available Bandwidth: This is one of the constraints of our allocation problem. We call this a soft constraint as satisfying it is not required for a feasible design. It, however, is required for achieving a low execution time. In other words if the bandwidthrequirementsofourapplicationsurpassestheavailablebandwidthwecan 64 still map the application's data, however, the execution time will be increased due to memory stalls. We will combine the concept of delivery time with a given execution model and resource characteristics to investigate if the available bandwidth can satisfy our requirements. 4. System Characteristics: This is part of the input to the allocation problem and represents the attributes of our resources. Namely the number and types of storage, the capacity and bandwidth of each storage, access latencies associated with each storage type, latencies associated with each operation, the attributes of pipelining (if there exists any), etc. Clearly this information is vital for ¯nding a meaningful solution to our problem. 5. Compiler Transformations: This includes all the available techniques for manip- ulating the data in order to solve the allocation problem. These could potentially change the resource (storage/bandwidth) requirements of the computation to better serve our mapping objectives. In order to select a suitable transformation we will utilizethereuseinformationofdi®erentdatareferencesaswellasthecharacteristics of various storage structures. Considering these points, it becomes evident that our problem has various dimensions that are closely inter-connected. This clearly increases the complexity of the problem. 65 5.2 Problem Formulation In addressing our problem we consider three levels of memory including an external mem- ory, on-chip memory banks and on-chip registers. We consider allocating storage exclu- sively to reuse chains of array variables. Finally we intend to allocate storage to di®erent variables so that their desired latencies be satis¯ed. We de¯ne a binary variable X ij which is set to 1 if the data accessed by an array reference r i is mapped to an internal memory bank j, and is set to 0 otherwise. Likewise we de¯ne Y i to be 1 if the data for r i is mapped to internal registers, and 0 otherwise. For each array reference r i , Size(r i ) de¯nes the required storage as the size of the unique reuse chain corresponding to r i as described in chapter 3. For a memory M, Size(M) represents the size of the memory in terms of number of elements. Lastly, the number of available internal registers is denoted by Size(Reg). ConsideringM memories,Size(Reg)registers,andaDFGwithN arrayreferences,our problemisto¯ndanallocationofreferencestothememoriesandregistersthatminimizes the execution time ET subject to the following constraints: 8 > > > > > > > > > > > > > > < > > > > > > > > > > > > > > : 8j : ( P N i=1 X ij £Size(r i ))·Size(M j ) (1) ( P N i=1 Y i £Size(r i ))·Size(Reg) (2) (9i;j : X ij =1)) Y i =0 V (9i: Y i =1)) (8j : X ij =0) (3) (8i;j : X ij =1)) DT j ·DL i (4) The¯rsttwoconstraintsstatethatthestorageassociatedwiththevariousreusechains must ¯t in the storage they are mapped to, either internal memory banks (1) or registers 66 (2). The third constraint determines that each reference is either mapped to registers or to memory banks but not to both. Finally, the fourth constraint guarantees that if a reference is mapped to a memory bank the bank can provide its desired latency. The desiredlatenciesofreferencesmappedtoregistersareautomaticallysatis¯edastheaccess delay for registers is negligible. Theoverallgoaloftheproblemisto¯ndamongthemappingchoicesthatminimizeET the one(s) that lead to the implementation with minimal design area. Registers increase the area by increasing the amount of used logic, number of multiplexors, and interconnect complexity. On the other hand, in memory banks more sophisticated address generation could lead to more complex control and therefore more area. Predicting an implementa- tion's overall design area for ¯ne-grained con¯gurable architectures is an extremely hard problem. We sidestep this issue by assuming a positive correlation between the storage size and design area. As a result our selection criterion for the smallest design is captured by the minimum value for M X j=1 N X i=1 (X ij £Size(r i ))+ N X i=1 (Y i £Size(r i )) . This formulation captures the total amount of required storage in terms of RAM elements and registers. Itisworthnotingthatininvestigatingtheexecutiontimesofvariousallocationswedo not include the cost of bringing the data from external memory to on-chip storage. In our target applications that include loop nests with large loop bounds this cost is amortized over various iterations of the loops. 67 5.3 Data Mapping Approach We mitigate the algorithmic complexity of the storage allocation and mapping problem formulated above using a greedy approach. In this approach initially all data is mapped to external o®-chip memory. In an iterative process the algorithm examines the critical graphofthecomputationonecutatatime. Foreachcutthealgorithmthenselectswhich arrayreferencesshouldbemappedtowhichstorageresourcesusingwhattransformations. This is done by examining the costs/bene¯ts of each mapping choice as well as the target desired latencies. The algorithm terminates when no improvement in performance is possible given the initial available resources. We now explain in detail the various possibilities, observations, and considerations that form the basis of our approach and subsequently our allocation algorithm. 5.3.1 Mapping Transformations Inordertoincreasethedataavailabilityfortheelementsofacomputationweutilizethree compiler transformations namely data distribution, data replication, and scalar replace- ment. Even though these transformations all increase the availability of data, there are di®erences that make each one of them appealing in di®erent cases. Data distribution: This transformation partitions the array's data into disjoint (non-overlapping)datasetsandmapsthemtodistinctmemorymodules. Arrayreferences are bound to the memory module holding their corresponding data, allowing concurrent memory accesses. Distribution increases the data availability as the number of memory 68 modules increases, while preserving the total storage used for holding the data. This is an idealcaseasthedataavailabilityisincreasedwithoutanyincreaseinstoragerequirements. In¯gure4.1, referencesC[i][j]andC[i][j+1]accessnonoverlappingsectionsofthear- ray,namelyoddandevenelementswithrespecttothej index. Wecandistributethedata forarrayC betweentwodi®erentmemorymodulesbindingeachreferencetoonememory, thus allowing concurrent accesses to the corresponding data elements (¯gure 5.1(a)). Datareplication: Thistransformationincreasestheavailabilityofthedatabycreat- ingcopiesofthedataindistinctmemorymodules,therebyallowingconcurrentaccessesto the data. Though costly in terms of storage, it can be used pro¯tably when the replicated data is small and frequently accessed. Consistency is an issue in the presence of read and write operations. In ¯gure 4.1, we can replicate the B array among 4 memories and concurrently access the data referenced by B[k][j];B[k][j+1], B[k][j+2] and B[k][j+3] (¯gure 5.1(b)). Thistransformationstendstoincreasethestorage,sometimesdramatically,tocreatea higher data availability. Also it is only suitable for cases when data cannot be distributed due to the overlap between the data accessed by di®erent references. Scalarreplacement: Mainlysuitableforreusabledata,thistechniqueconvertsarray references to scalar variables and then maps them to local storage. As a result it elimi- nates memory operations and increases the internal data availability. The ¯rst access to each scalar replaced data item (typically a read operation) requires an external memory operation, but subsequent accesses can use the data cached in the local storage. Because the number of storage elements required for capturing the data reuse of a given array reference is tied to the loop bounds, this transformation may be very expensive in terms 69 (c) Scalar replacement for array B C[0][8] C[4][8] C[0][0] C[4][0] C[i][j] C[0][9] C[4][9] C[0][1] C[4][1] C[i][j+1] B[19][0] B[0][0] B[0][9] B[19][9] B[19][0] B[0][0] B[0][9] B[19][9] B[19][0] B[19][9] B[0][0] B[0][9] Odd indices of j Even indices of j B[k][j+2] B[k][j+3] B[k][j] B[k][j+1] B[0][0] B[19][9] R_199 R_180 Odd indices of j Even indices of j B[k][j+2] B[k][j+1] B[k][j] B[k][j+3] B[19][0] B[0][0] B[0][8] B[19][8] B[k][j] B[k][j+1] B[k][j+2] B[k][j+3] considering the scheduling (d) Distribution for array B (b) Replication for array B B[19][0] B[19][9] B[0][0] B[0][9] (a) Distribution for array C External Memory i=0 R_0 R_19 i=1,...,4 B[19][1] B[19][9] B[0][1] B[0][9] Figure 5.1: Various transformations applied to example code in ¯gure 4.1. 70 ofareaandsometimeseveninfeasible. Forexamplein¯gure5.1(c), array B wouldrequire b k £b j =200 registers tocachethe data accessed in the¯rst iteration of the i to be reused in the remaining iterations of the same loop. 5.3.2 Mapping Observations Clearly, applying these mapping transformations indiscriminately for all the data refer- encesmayleadtocapacityissues. Toaddressthisissuewecombinethesethreetechniques with the information regarding the data reuse of di®erent references. Making an e®ective selection of these possibilities relies on the following key observations: ² Observation 1: The desired amount of reduction for a critical path, ±, is bound by the di®erence between the lengths of the critical path and the second longest path in the DFG as described in section 4.4. ² Observation 2: Array references that exclusively exhibit self-reuse access either the exact same data or disjoint data items. In this case each reference belongs to only onereusechainandeachreusechainonlyincludesonereference. Asaresultthedata can be partitioned to disjoint sets and mapped independently. Here distribution is an ideal transformation in which the data for the array can be placed in di®erent memory modules. As an alternative the array's data can be scalar replaced in registers, o®ering the same amount of bandwidth and consuming the same amount of storage as data distribution. Hence, in order to increase the availability of data the only viable transformations are data distribution and scalar replacement. 71 ² Observation 3: For an array whose references only exhibit group-reuse part of the data is accessed by multiple references. As a result of this overlap it is not possible to partition and distribute the array's data for di®erent references. In order to increase the data availability, however, we can replicate the data of the reuse chain. Even though replication has a large a®ect on storage requirements, as weonlyreplicatethedataofthereusechain(reusabledata)thestoragerequirements areactuallyrelativelysmall. Asanalternativethearray'sdatacanbescalarreplaced inregisters, o®eringthesamedataavailabilityandconsuminglessstoragethandata replication. Hence, in order to increase the available data the only viable transformations are data replication and scalar replacement. 5.3.3 Mapping Strategies We now layout the rationale of our algorithm in selecting the suitable transformation and storage type. We emphasize that the objective is to ¯nd the fastest design that consumes the smallest area. 5.3.3.1 Selecting a Storage Type Combining observation 1 with the goal of achieving the smallest design area, for each cut of the critical graph the algorithm attempts to achieve the maximum reduction ± by exclusively using the available RAM blocks. The algorithm utilizes registers only if RAM blocks are incapable of providing the desired reduction due to insu±cient bandwidth. So if at any point the memory banks cannot satisfy the requirements of all the elements of 72 a cut, the algorithm uses registers by applying scalar replacement. The rationale for this cautious approach is derived by two factors: 1. Using registers substantially increases the design area as it results in using more logic, complex interconnects, and extra multiplexors. This can subsequently lead to many adverse e®ects on the clock rate. 2. Mapping to registers can result in a very aggressive reduction of the critical path(s) that simply might not be necessary. As a result it would leave too few storage elements to devote to later critical paths without having any e®ect on the execution time. 5.3.3.2 Selecting a Data Transformation Thechoiceofasuitabletransformationisprimarilybasedonobservations2and3in5.3.2. Simply based on the reuse types and desired latencies of various references the algorithm selects a proper transformation. In this process, however, a subtle complication arises due to the greedy nature of the mapping algorithm. In practice, references that exhibit group-reuse may belong to di®erent cuts of the critical graph. This leads to the possibility that in di®erent iterations thealgorithmmighttrytoapplytwodistinctdatatransformationstothereusechainthat the two references are part of. Rather than attempting to backtrack, and thus substantially complicating the imple- mentation and complexity of the mapping algorithm, our greedy approach reconciles the 73 current mapping of the reuse chain. The update to the chain corresponding to a reference r of an array A is as follows: ² Scalar Replaced: If the reuse chain associated with A has already been scalar replaced then the new reference r can use the data already mapped to registers. As there won't be any bandwidth limitations, no update to the mapping is required. ² Distributed: If the data associated with array A is distributed and reference r's data is a subset of an existing partition, then r will be bound to that partition if the partition's bandwidth allows. Otherwise, the algorithm needs to create a new copy of the reuse chain's data and ¯nd a suitable mapping for the new partition. ² Replicated: If the memory bank holding the current copy of the reuse chain has enough bandwidth to accommodate the desired latency of r (DL r ), then r can use the the existing copy at that memory bank. Otherwise, the algorithm creates a new copy of the reuse chain and ¯nds a suitable mapping for the new copy. 5.3.4 Mapping Metrics As there might be multiple cuts associated with the critical graph of a given DFG, the algorithm must choose which cut to focus on at each step of its greedy approach. Fur- thermore, and in cases where the algorithm is forced to use registers, it must assign them to the most pro¯table data references. We now describe the metrics used in our algorithm for selecting the `best' cut as well as selecting the `best' references of a cut for mapping to registers. 74 1. Cost Factor: This metric is used in selecting the most valuable cut in a set of cuts. As described, removing the access latencies of the references of any cut would decrease the critical path by one memory access time. As a result all the cuts are identical in terms of their bene¯t. To select the least expensive cut we de¯ne the metric Cost as: Cost=Storage£NumRefs where NumRefs denotes the number of data references in the cut while Storage is the number of storage locations required to capture the data reuse associated with these references (explained in chapter 3). The algorithm selects the cut with the lowest Cost value. 2. Gain Factor: When necessary to use registers in order to satisfy the desired latency for the ref- erences of a selected cut, the algorithm must choose the references that use this important resource most e±ciently. As such we de¯ne the Gain metric as: Gain= NumAccesses Storage Here NumAccesses denotes the number of accesses corresponding to r and Storage is the number of storage locations required to capture the reuse of r. From a set of references, the algorithm picks the one with the highest Gain factor to map to registers. 75 Threshold Improvement Current Mapping DFG CMA Algorithm 2 1 GLP and GLP (r : reference in C) CG Cuts DLr M DT ) ( G System Characteristics (Memory delays, Pipelining info, etc) Original Application Code Final Mapping Compiler Transformations Execution Model Input Functions Output C = Best Cut Reuse Analysis Figure 5.2: Complete °owchart of the allocation algorithm. 5.3.5 Putting It All Together We begin by outlining the mapping algorithm at a higher level as illustrated in ¯gure 5.2. The algorithm takes as inputs the computation's code, the system characteristics, the execution model, as well as the set of viable data transformations. As the output it generates an assignment of array references to various storage structures such that the execution time of the computation be minimal. Our approach is a greedy process that at each stage reduces the critical paths of the computation. Given the DFG, at each iteration the algorithm extracts the critical graph and its cuts. It then selects the best cut based on the Cost metric and tries to map the 76 references of this cut to the RAMs and registers. This assignment changes the length of theexecutionpathsandthereforetheshapeofthecriticalgraphateachstage. Asaresult the algorithm needs to recompute the critical graph and its cuts at each iteration. In mapping each reference of the best cut, the algorithm needs to assess the amount of requiredstorageandbandwidth. Intermsofstorage, andasthealgorithmmapsthereuse chains and not the individual references, an analysis of the reuse and therefore identifying the reuse chains are necessary. In terms of bandwidth, it is su±cient to ¯nd a mapping such that DT M · DL r for each reference r mapped to memory M. Here the value of DT M depends on the current references mapped to M, as well as the execution model and the system characteristics. ThevalueofDL r isdictatedbytheamountofdesiredimprovement±. Thisitselfisaresult of the lengths of the execution paths as well as the value of Threshold. Here we assume that the delivery times at each stage are independent of future data mappings. Therefore the mapping choice of each data reference will not be in°uenced by later iterations of the algorithm. In other words reducing the critical paths at each stage does not have a negative e®ect on the already improved execution paths. Regarding the types of storage, the algorithm ¯rst tries to map all the chains to the available RAMs. In doing so, and in order to meet the value of DL r , there might be a need for replication/distribution based on the reuse type of each chain. If however the deliverytimesoftheRAMscannotaccommodatethedesiredlatenciesofthereferences,the algorithmutilizestheavailableregistersaswell. Thisorderwillguaranteetherequirement for the smallest design area. 77 The algorithm terminates when all the references are mapped or when no additional improvement is possible due to resource (storage and bandwidth) limitations. 5.4 Algorithm Description WenowdescribeourCustomMemoryAllocation(CMA)algorithmdepictedin¯gure5.3. As this is a complex algorithm, we ¯rst describe the main allocation algorithm and then explain the di®erent functions contributing to it. Initiallyallthearraysaremappedtoo®-chipmemoryandallinternalRAMbanksand registers are available. In each iteration, CMA calculates the value of Threshold based on thediscussionin4.5(line5). Thisstepspeci¯eswhichexecutionpathsshouldbeincluded inthecriticalgraph. Basically, anypathwhoseexecutiontimeiswithintheneighborhood of Threshold from the longest path is also considered a critical path (lines 6 - 8). After extracting the critical paths and forming the critical graph, the algorithm deter- mines the value of improvement, ±, as the di®erence between the ¯rst and second longest pathsofthecomputation. Alsointhisstepthesetofcutsforthecriticalgraphisidenti¯ed (lines 9 - 10). To detect the most pro¯table cuts, the set of cuts is sorted based on the value of Cost for each cut. This step is performed by the FindCostFactors function, described later in 5.4.1 (lines 12 - 13). After selecting the best cut, the algorithm attempts to map the reuse chains corre- sponding to the references of the selected best cut to the actual RAMs and/or registers 78 Inputs: Application Code, System Characteristics, Data Transformations, Execution Model. Output: Assignment of references to storage. Mapping Algorithmf 01: FindReuseChains(Refs(DFG),LoopNest); 02: ComputeAvailableStorage(); 03: MappedRefs =;; MappingPossible = TRUE; 04: while ((AvailableStorage > 0) && (MappedRefs· N)&& MappingPossible) do 05: Threshold = EXTDELAY- (RAMDELAY + INITINTER); 06: CG = FindCriticalGraph(DFG); 07: LP1 = FindLongestPath(DFG); 08: LP2 = FindSecondLongestPath(DFG); 09: ± = LP1 - LP2; 10: cuts = FindCuts(CG); 11: while (± >0 && Threshold¸ 0) do 12: FindCostFactors(cuts); 13: cuts = SortCutsBasedOnCost(cuts); 14: while(cuts6=;) do 15: bestCut = SelectBestCut(cuts); 16: cuts = cuts -f bestCutg; 17: map = FindMapping(bestCut); 18: if (map = TRUE) then 19: UpdateDFGDelays(DFG,bestCut,map); 20: UpdateAvailableStorage(map); 21: UpdateMappedRefs(map); 22: break; 23: endif; 24: endwhile 25: if (!map && Threshold > 0)f 26: Threshold -= 1; 27: Update CG, LP1, LP2, ±, cuts; 28: gelse if (!map && Threshold == 0) 29: ± = ± - 1; 30: else 31: break; 32: endif; 33: end while 34: if (map = FALSE) then 35: MappingPossible = FALSE; 36: end while g Figure 5.3: Custom Memory Allocation (CMA) Algorithm. 79 (line 17). The objective here is to ¯nd a mapping that satis¯es the value of ±. This step is captured by the function FindMapping explained later in section 5.4.2. If a suitable mapping is found, the algorithm revises the DFG by assigning the new access delays to the reference of the selected cut. It also updates the amount of available storage as well as the set of mapped references (lines 18 - 23). Ifno mapping isfound forthe best cut, thealgorithm selects thenext best cut. A new cut means a new set/subset of references, therefore a new set of requirements, and a new set of con°icts with the already mapped data in di®erent storage structures. As a result there is a good chance that the algorithm could ¯nd a mapping for the references of the new cut and as a result reduce the critical path (embedded in line 14). If the CMA ¯nds a desired mapping for one of the cuts, the ¯rst iteration is over. If there is still storage available and there is still a set of unmapped references in the DFG,the algorithmattemptstofurtherreduce the executiontime byrepeating theabove process for a new critical graph and set of cuts. If,however,CMAdoesnot¯ndamappingforanyofthecutsastosatisfythemaximum possible improvement, ±, the algorithm reduces the value of Threshold and tries to ¯nd a mapping that satis¯es this new latency (lines 25 - 27). It is worth noting that the value of ± =LP 1 ¡LP 2 isdependentonthevalueofThreshold. IftheoriginalselectionofThreshold is too aggressive, the value of ± might become too large (¯gure 4.4(e)). As a result the algorithm would not be able to ¯nd a mapping that satis¯es this large improvement. By reducing the value of Threshold the value of ± could get smaller (¯gure 4.4(c)). Therefore the CMA might be able to ¯nd a mapping that satis¯es this smaller improvement. 80 IftheThresholdisreducedto0andstillthereisnopossiblemapping, itsimplymeans that the resources are not adequate to reduce the LP 1 past the LP 2 of the computation. As a result the algorithm opts for reducing the longest path(s) by reducing the value of ± and repeating the mapping attempt (line 28 - 29). In subsequent iterations of the algorithm, the critical path is updated to re°ect the newly allocated storage and a new set of cuts is identi¯ed. The algorithm proceeds until either the storage resources are fully committed, the data references are all mapped, or no feasible allocations are possible (embedded in line 4). 5.4.1 Best Cut Selection The computation for selecting the best cut is captured by the function FindCostFactors depicted in ¯gure 5.4. This function takes as input the list of the cuts of the current critical graph and calculates the value of metric Cost for each of the cuts. For each cut it examines each reference r and determines if the the exact reference has already been mapped. If so, the function checks if the corresponding mapping can deliver the data in time, meaning satisfying the DL r . If that is not the case clearly the current mapping cannot satisfy the latency for r and its corresponding cut. As a result a default large value is assigned to the cut, making it ineligible for selection as a best cut (lines 5 - 7). If, however, r itself is notmapped but Chain(r) has been mapped before, the function attempts to use the same data for reference r if the corresponding mapping can deliver the data in time. 81 Inputs: List of cuts, Amount of improvement ±. Output: Cost metric for each cut. FindCostFactors(cuts, ±)f 01: for (each cut C2cuts) do 02: for (each reference r2C) do 03: DL r = ExternalDelay - ± 04: if (mapped(r;M) = True) then 05: if (DL r <DT M ) then 06: storage = numrefs =1; 07: break; 08: else 09: if (mapped(Chain(r), M) = True) then 10: if (r can use the data mapped to M) then 11: BindRef(r,M); 12: else 13: storage += CalStorage(Chain(r)); 14: numrefs += 1; 15: endif 16: endif 17: end for 18: Storage cut = storage; NumRefs cut = numrefs; 19: Cost cut = Storage cut £NumRefs cut 20: end for g Figure 5.4: Cost assessment algorithm. If the delivery time is met, r is bound to the mapped data using the auxiliary function BindRef. In this case there is no additional cost generated by satisfying the mapping of r (lines 10 - 11). Should the reference not be able to use the previous mapping the function adds the cost of the Chain(r) to the cut. This cost includes the required storage for Chain(r) as well as an additional reference that needs to be mapped (lines 12 - 14). 82 AttheendthevalueofCostiscalculatedforeachofthecuts(line19). Thisinformation is used by the CMA algorithm to identify the most bene¯cial cut. 5.4.2 Mapping Overview The process of ¯nding a data mapping that satis¯es the value of ± is captured by the function FindMapping depicted in ¯gure 5.5. This function takes as input the references ofaselectedcutandtheirdesiredlatencies. Ifpossible, itoutputsamappingofreferences to the available RAMs and registers so to satisfy the desired latency for each reference. If the desired latency for the references are less than the delivery time of each RAM, these references have to be scalar replaced in registers. The mapping fails if there are not enough registers (lines 3 - 8). If however the desired latency is not that small, the algorithm tries to map all the references to RAMs. This is a desirable scenario as a successful mapping of all the data to RAM blocks would result in the memory latency of the cut to be equal to DL. In this case the consumed area is minimal since no registers are introduced for storing the arrays (line 11). This function is performed by MapInRAMs function described in 5.4.3. If it is not possible for the algorithm to accommodate the references of the best cut with the availables RAMs the algorithm uses a combination of the available registers and memorymodulestoreducethememorylatencyofthecut. Inthisalternativescenario,the algorithmemploystheGainfactortomapthemostdemandingdatareferencetoregisters and map the remaining of the references to the available RAMs (lines 13 - 14). 83 Inputs: A selected cut, Desired latencies of the references. Output: Mapping of the references to RAMs and registers FindMapping(cut, DL)f 01: unmappedList = unmappedReferences(cut); 02: if(DL < RAMDelay) do 03: for (8ref 2unmappedList) 04: map = MapInRegs(ref); 05: unmappedList = unmappedList -frefg; 06: if (map = FALSE) 07: mappingFound = FALSE; 08: endfor 09: else 10: while (unmappedList6=; and AvailableResources6=;) do 11: map1 = MapInRams(); 12: if (map1 = FALSE) then 13: ref = FindBestReferenceForRegisters(unmappedList); 14: map2 = MapInRegs(ref); 15: if (map2 = TRUE) then 16: unmappedList = unmappedList -frefg; 17: else 18: mappingFound = FALSE; 19: endif 20: end while 21: endif g Figure 5.5: Algorithms to ¯nd a mapping in RAMs and registers. Thisprocesscontinuesuntilamappingisfoundorthealgorithmexhaustsallregisters. If a mapping is achieved by a hybrid mapping to RAMs and registers, the access latency of the cut is equal to DL but the area is increased due to the use of registers. 5.4.3 Mapping in RAMs In mapping the reuse chain corresponding to a reference of a cut to RAMs, the function's algorithm considers both available storage space and bandwidth in each memory bank. 84 As references with self-reuse can be distributed (observation 2), the corresponding reusechainsaretreatedasseparateentities. Forthereferenceswithgroup-reuse,however, the data needs to be replicated (observation 3). The number of replicas depends on the desiredlatencyDLaswellastheavailablebandwidthofRAMbanks. Duetothisinherent complexity for the case of group-reuse, the MapInRAMs function ¯rst attempts to map the references with group-reuse and then map the chains corresponding to self-reuse as they can be considered independently. If the reuse chain corresponding to a given reference has been mapped before, the reference being currently considered might be able to make use of the same data if the memory bandwidth allows, otherwise data needs to be replicated. Under this scenario the new data replica is mapped in another memory bank which has su±cient bandwidth to allow a latency of at most DL. Internally, MapInRAMs attempts to satisfy the latency requirements of a cut consid- ering the execution model and using a worst-¯t strategy in a bin packing setting. Here each reference is an item while each RAM is considered as an available bin. In a worst-¯t strategy each reference is mapped to the RAM with the largest available space. Using this technique guarantees that references are distributed in the available RAMs as much as possible, leading to a maximum data concurrency. 5.4.4 Summary and Analysis To summarize, the CMA algorithm takes as input an un-annotated description of the computation in a high-level programming language. At each iteration the algorithm ex- tracts the references on the critical path and analyzes their access patterns. Based on the 85 analysesittriesto¯ndamappingwith± >0foratleastonecut. Itremoves/improvesthe memorylatenciesuntilitexhaustsallavailablestorageorconcludesthatamappingisnot possible. The net result of each iteration of the algorithm is a reduction by at most one memory latency in the critical path, depending on the mapping decisions in the selected cut. If at any iteration no mapping is found, we conclude that the CPs of the CG cannot be reduced any further and there is no point in wasting resources for any of the remaining array references. This approach of examining and greedily minimizing the execution time for the refer- ences of a cut, one at a time, is simpler than a global search solution where many cuts would be examined in a single step. As the result of this lack of a global perspective we expectouralgorithmtoleadtonon-optimalbutneverthelessgoodhardwaredesignsfairly fast. The exhaustive search for ¯nding a mapping for a selected cut can make the algorithm exponential in the worst case. However in practice, and due to the limited size of a cut, the search space is small enough for a brute-force approach. 5.5 Algorithm Illustration We now illustrate the mapping algorithm using the example code in ¯gure 4.1 with the DFGandCG depictedin¯gure 4.2. In thisexample the algorithm targetsan architecture with 4 memory banks M0, M1, M2, M3, each with a single port and a capacity of 100 elements. The access latencies for registers, memory banks, and external memory are 86 A[i][19] C[0][0] C[4][8] A[i][0] A[i][k] C[i][j] of j Even indices Odd indices of j Odd indices of j scalar replacement for array A (a) Distribution for array C and (b) Distribution for array B Even indices of j C[i][j+1] C[4][1] C[0][1] C[4][9] C[0][9] B[0][9] B[k][j+3] B[k][j+1] B[0][1] B[19][9] B[19][1] B[19][8] B[0][8] B[0][0] B[19][0] B[k][j+2] B[k][j] Figure 5.6: Final mapping for the example code. respectively 0, 3, and 10 clock cycles with no pipelining access modes. The latencies of division, multiplication, and addition are respectively 3, 2, and 1 clock cycles. In the beginning the ¯rst and second longest paths take respectively 27 and 22 clock cycles. Fortheselectedcutlabeledas(4), wewouldliketoimprovethelatencybyatleast LP 1 ¡LP 2 =27¡22=5 clock cycles. In other words the new access latency for this cut should be at most 10¡5 = 5 cycles. This latency is achieved by mapping the elements of array C to memory banks. The algorithm investigates the access patterns of references to array C, and as C[i][j] and C[i][j +1] access disjoint data the algorithm distributes array C between two memories. As the result of the ¯rst phase of the algorithm array C is distributed across M0 and M1. In this case the delay of accessing both references in parallel is 3 clock cycles. During the second phase the algorithm updates the critical path information leading to only one possible cut of fB[k][j],A[i][k],B[k][j +1]g. The latencies of the two longest paths are 20 and 15 cycles. As a result the longest path should be improved by at least 87 LP 1 ¡LP 2 = 20¡15 = 5 cycles and therefore the new latency for this cut should be at most 10¡5 = 5 cycles. This latency is achieved by mapping the three references to three banks and reading them in parallel. At ¯rst it seems that we only have the ports from M2 and M3 available. However, based on the data dependency and scheduling, writing to the references of C only occurs after reading the references to A and B. As a result B[k][j],A[i][k], and B[k][j +1] can share the ports of M0 and M1 with C[i][j] and C[i][j +1]. Knowing that there is enough bandwidth we check the storage requirements against the available storage. As the unrolled loop references fB[k][j]g andfB[k][j+1]g access disjoint data, the data for array B can be distributed. Considering the required storage for B[k][j],A[i][k], and B[k][j +1] (respectively 100, 20, and 100), we map the references of B to M2 and M3 while mapping array A to either M0 or M1. As a result of the second phase of the algorithm A is mapped to M0, array C is distributed across M0 and M1, and ¯nally array B is distributed across M2 and M3. The references all have an access latency of 3 cycles. During the third and ¯nal phase the algorithm updates the critical path information leading to only one possible cut fB[k][j +2], B[k][j +3]g. As the latencies of the two longest paths are 15 and 13 cycles, this cut needs to be improved to result in an access latency of at most 10¡2 = 8 cycles. The desired latency is achieved by at most two sequential accesses to the RAMs. The algorithm checks to see if the locations accessed by the new set of references in the new cut are a subset of the already mapped data. As this is the case here, the new references can use the data previously mapped to M2 and M3. In the ¯nal mapping step array C is distributed across M0 and M1, array B is distributed 88 across M2 and M3, and array A is mapped to M0. This mapping leads to a ¯nal critical path of 13 clock cycles 5.6 Limitations of the CMA Algorithm Hereweoutlinethelimitationsofouralgorithmwhicharemostlyduetothegreedynature of our approach: ² Ourmappingstrategyhasagreedynature. Ateachstepthealgorithmonlyconsiders the references of the current cut and maps them according to the current desired latencies. Thiscanbeproblematicifthesamereferencesrequireadi®erentallocation in a later cut. For example in ¯gure 4.2, making an assignment for A[i][k] as part of cutnumber2mightcon°ictwiththedesiredlatencyof A[i][k]aspartofcutnumber 3. ² In assigning the references of a selected cut to memory the order in which these ref- erencesaremappedmatters. Assignmentsoftheearlierreferenceswouldpotentially change the attributes of RAMs/registers. Hence the algorithm might be unable to ¯nd a mapping that satis¯es the desired latencies for the remaining references. ² The value of Threshold strongly a®ects the allocation process (section 4.5). Yet it is extremely di±cult to determine this value in a generic way as various applications have very di®erent characteristics in terms of their execution paths. Despite these shortcomings, and as we explain in the next chapter, our algorithm is very e®ective in identifying good allocation strategies. 89 5.7 Chapter Summary Inthischapterweexpressedourallocationalgorithmindepth. Wedescribedvariousdata transformations, storage types, and metrics used by our algorithm. We explained how the algorithm uses this information combined with the notions described in chapter 4 in order to allocate storage to di®erent data references of a computation. We presented a full description of our greedy algorithm and illustrated its application by an example. We ¯nally pointed out the limitations of the approach mainly caused by its greedy nature. 90 Chapter 6 Experiments In this chapter we report on the e®ectiveness of our Custom Memory Allocation (CMA) algorithm in assigning storage to the array references of a loop nest written in sequential Ccode. OurexperimentsaredesignedinthecontextofanFPGAandillustratethee®ects of the CMA algorithm in terms of execution time and consumed programmable resources. We evaluate the results of CMA against other common memory mapping approaches namely, naive, Custom Data Layout (CDL), and Hand Coded (HC) designs. In addition we highlight the e®ects of our algorithm in using the information of the cuts of the critical graph of the computation for allocation of storage, speci¯cally registers. We begin by presenting the methodology we have used and memory organization in sections 6.1 and 6.2. We then describe the characteristics of our sample application ker- nels in section 6.3. In section 6.4 we illustrate the e®ects of considering the cuts for scalar replacement and compare it with two other approaches that are solely based on reuse information. In section 6.5 we illustrate and compare the results of various mapping tech- niquesforourkernelcodesintermsofexecutiontimeandconsumedarearesources. Inthis section we report on two sets of experiments, namely mappings with and without the use 91 of registers. Overall, we demonstrate that combining the knowledge of program analyses withtheschedulinginformation resultsin e®ectivememory allocations thatyield superior designs both in terms of area and performance. We analyze our results in section 6.6 and summarize the chapter in 6.7. 6.1 Methodology We implement our storage allocation algorithm as part of the Stanford University In- termediate Format (SUIF) compiler [74]. SUIF includes a front end which transforms an originalprograminCorFORTRANtoitsintermediateformat. Itthenusesvariouspasses to analyze and/or transform the intermediate format. The result can then be converted back to C or FORTRAN or be compiled into an executable using a speci¯c architecture back-end. We build our memory allocation algorithm as an independent pass that uses the infor- mation built/gathered by some of the pre-existing passes. Our CMA algorithm combines these analyses with its own analysis of the critical path to identify a suitable storage type, storage size, and allocation technique for di®erent array references of a loop. Initially, the code written in C is transformed to SUIF intermediate format. We apply basic optimization passes such as dead code elimination, constant propagation, and common subexpression elimination. We then unroll each loop nest by a factor speci¯ed by the user, transforming the code by applying the Unroll and Jam pass [69]. We expect this unrolling to exacerbate the e®ects of our algorithm as it increases the bandwidth requirements of the resulting transformed code. 92 After unrolling, we run the passes that analize and quantify the data reuse for the arrayreferences. WethenapplytheexistingCDLpassandourownCMApassforstorage allocation. We create various designs and their corresponding VHDL codes based on the outcomeofCDLandCMApassesandprojectingtheNaiveandHCdatamappingsasthere is no SUIF pass for them. We then convert the behavioral VHDL codes into structural VHDL designs using Mentor Graphics' Monet TM [55] high-level synthesis tool. We set up the synthesis tool with a frequency of 33 MHz targeting the fastest design. This stage calculates the total number of clock cycles required for the entire computation. We also compute the total number of storage bits using the number of allocated storage elements and the bitwidth of elements. Finally we input the netlists generated by the synthesis tool into the Place-and-Route (P&R) tool. We use Synplify Pro 6.2 and Xilinx ISE 4.1i tool sets for logic synthesis and P&R targeting a Xilinx Virtex TM XCV 1K BG560 device. At this stage the tool determines the actual clock periods for the actual designs. We also obtain the area of the designs in terms of number of used slices. Using the cycle counts from the Monet's synthesis and the length of the clocks from the Synplicity's P&R, we calculate the wall clock time that each design takes to execute. 6.2 Memory Organization We de¯ne a memory organization that includes a single o®-chip memory and four on-chip block RAMs. We represent these modules using single port RAM blocks, where each port has both read and write capabilities. Reading and writing to external memory requires 93 ¯ve clock cycles while the same operations takes two clock cycles for on-chip block RAMs. All the memory accesses, either o®-chip or internal, are fully pipelined with an initiation intervalofoneclockcycle. Intermsofthenumberofmemoriesinoursystemweassumea single o®-chip memory and a maximum of four on-chip memories. Each on-chip memory has enough capacity for the data arrays in the computation. In practice each of these single-ported memories can be implemented by cascading multiple smaller RAM blocks. Our algorithm is however completely independent of the number of blocks, number of ports per block, and capacity of the block RAMs. In terms of distributed memory we impose a maximum limit of 64 registers for a total of 768 bits to store the arrays' data. We utilize these registers to highlight the bene¯ts of the CMA algorithm in situations where using the block RAMs alone would not improve the design. This limit is selected based on the bounds of the loops in our application codes. In practice however this limit must be imposed by the compiler as part of a global resource allocation policy, orthogonal to these experiments. 6.3 Application Kernels We focus on a set of image and signal processing code kernels. These applications not onlymeettherequirementsofourcompileranalysesbuttheyarealsohighlyparallelizable. Thesecomputationsareverydataintensiveandexhibitampleopportunitiesfordatareuse, makingthemidealcandidatesforreuseanalysisandcustomhardwareimplementation. In these kernels all arithmetic computations operate only on integer values. Using °oating- point arithmetic would simply stress the space requirements of the implementation of 94 each design without invalidating the bandwidth analysis of our algorithm in the presence of loop unrolling. We have selected the following set of ¯ve image/signal processing benchmark codes in order to show the e®ects of our algorithm: ² Finite Impulse Response (FIR) computes the convolution of a vector with 1024 values against a sequence of 32 coe±cients. ² MatrixMultiplication(MM) performsanintegerdensematrixmultiplicationof two 16-by-16 integer matrices. ² Jacobi (JAC) performs a four-point Jacobi stencil relaxation computation over a 32-by-32 element array. ² Histogram (HIST) enhances a 256 gray-scaled 64-by-64 pixel image stored as an array by applying a global histogram equalization. The data access pattern of the histogram array is irregular. ² Binary Image Correlation (BIC) computes a binary-image-correlation between a 4-by-4 template image and successively overlapping regions of a larger 32-by-32 pixel image. These kernels exhibit the major computational patterns present in image/signal pro- cessing applications. They include array references with regular a±ne accesses, irregular non-a±ne accesses, and ¯nally stencil (window based) computations. Considering the prevalence of these data accesses in image/signal processing applications, we believe that our benchmark selection is a representative set for showing the e®ects of our algorithm. 95 6.4 E®ects of Considering the Critical Path(s) In this section we highlight the e®ects of our algorithm in considering the critical paths of the computation for storage allocation. We conduct these experiments independent of the full storage allocation algorithm described in the next section. Here we mainly focus on the allocation of distributed memories (registers) in the process of scalar replacing the arrays' data to them. An aggressive application of scalar replacement might require a large number of regis- ters, leading to infeasible designs. As a result there have been di®erent techniques devel- oped to apply scalar replacement selectively to a subset of references in the computation. To illustrate two of these techniques and compare them with our own, we conduct our experiments for three versions of scalar replacement, namely: ² Innermost Loop's Reuse (ILR): In this method only the references with reuse in the innermost loop are scalar replaced. This reduces the register requirements for scalar relacement as the reuse in the inner most loop results in shorter reuse distances and therefore fewer number of elements that need to be saved. If there are no references with such reuse, this technique results in a design with no scalar replacement. This method is part of the scalar replacement strategy in [15]. ² Full Scalar Replacement (FSR): This technique uses the value of SavedAccesses=RequiredRegisters as a metric to guide the application of scalar replacement. It uses this value to greedily assign registers to references that yield the best bene¯t/cost ratio. This method is the basis for scalar replacement in [70]. 96 (a) Cycles (b) Clock Rate (c) Execution Time Figure 6.1: Timing results re°ecting the use of cuts. ² Critical Path Aware (CPA): This technique also uses a greedy approach to assign the registers to the most valuable references on the critical path. For this purpose we use the notion of cuts introduced in [8] to identify the sets of references that need to be considered inclusively. We then select a set of references that requires the fewest number of registers. We now illustrate the application of these three techniques to our set of benchmarks using 64 registers with a total of 768 bits. The timing results are illustrated in ¯gure 6.1 while the area results are depicted in table 6.1. The number of used slices are from a total of 12288 slices available in our target FPGA device. 97 Kernel ILR FSR CPA Bits Slices Bits Slices Bits Slices FIR 0 389 768 720 768 675 MM 0 296 512 637 408 648 JAC 24 299 512 720 512 720 HIST 12 422 768 564 12 422 BIC 0 341 512 1630 512 1630 Table 6.1: Number of design slices and register bits used in scalar replacement. FIR:ApplyingtheCPAtechniqueweobservea39:6%and10%improvementoverthe ILR and FSR strategies in terms of clock cycles. Considering the critical paths, the CPA algorithm would assign the registers to the references on the path and therefore reduce the cycle count. In terms of clock rate the CPA design has a 14:8% degradation compared to the ILR technique and a 1:9% advantage over the FSR design. This translates to respectively 30:7% and 11:7% faster execution times over ILR and FSR strategies. MM: In this case CPA and FSR designs both have 6.6% fewer clock cycles than their ILR counterpart. The CPA algorithm results in the same number of cycles as the FSR design while using 20% fewer storage bits. The application of scalar replacement however results in 16:8% slower clock rate and therefore a 9:1% slower execution time for the CPA and FSR designs compared to the ILR design. JAC:HereapplyingtheCPAalgorithmresultsinadesignidenticaltotheoneresulting from the FSR technique. In this benchmark all references belong to the same reuse chain and as a result there is not much room for improvement. The CPA and FSR designs obtain a 4:9% improvement over the ILR version in terms of clock cycles. However the 98 18:8% clock degradation in cases of CPA and FSR designs results in 12:9% slower overall execution time compared to the ILR version. HIST:InthiscasetheCPAtechniqueresultsinadesignidenticaltotheoneresulting from the ILR algorithm. Here all three versions have the same clock cycle performance. However the scalar replacement in the FSR version consumes 64 times the storage bits in cases of ILR and CPA designs. Even using more storage, the FSR design leads to 16:8% slower clock rate and execution time compared to the ILR and CPA cases. BIC:TheCPAdesignisidenticaltotheoneresultingfromtheFSRtechnique. Inthis benchmark all references are on the critical path and therefore the FSR algorithm acts as goodastheCPAalgorithm. Eventhoughthesedesignsacquirea35:6%improvementover the ILR version in terms of clock cycles, they use a large amount of storage. In addition, their complex codes worsen the clock rate by 56:1%, leading to a small 0:52% degradation in execution time. Summary: These results reveal that in terms of clock cycles the CPA algorithm performs either betterthanbothFSRandILRtechniquesorsimilartothebetteroneofthetwo. Besides, the CPA achieves this result using equal or fewer number of storage bits. In terms of clock rates, however, the ILR technique leads to superior designs. This is mainly due to the lack of references with innermost reuse which leads to solutions with no scalar replacement. This in turn results in simpler designs, better clock rates, and sometimes faster execution times. In general, and for the architectures with ¯xed clock rates, the CPA strategy has a clear advantage over the ILR and FSR algorithms. For the architectures with variable 99 clock rates, however, scalar replacement should be applied discriminately as it's overhead might reverse it's bene¯t. We consider this observation in designing our full memory allocation algorithm described in the next section. 6.5 Memory Allocation Algorithm In this section we present and analyze the results from applying our complete memory allocation algorithm. We ¯rst explain and contrast di®erent mapping approaches used in our experiments. After describing the experiments we present our allocation results with and without the use of registers, comparing our technique with other mapping strategies in di®erent storage settings. 6.5.1 Mapping Techniques In our experiments we compare the designs resulting from our CMA algorithm with the outcome of allocation using naive, Custom Data Layout (CDL), and Hand Coded (HC) mapping strategies. Here we outline the mapping strategy of each of these techniques. 6.5.1.1 Naive In the Naive mapping the data corresponding to each array (not necessary an array refer- ence)ismappedtoadi®erentRAMregardlessoftheaccesspatternsorthescheduleofthe computation; i.e. the order in which memory accesses are executed. All data originally resides in the external memory. The arrays' data is brought from the external memory to theon-chipRAMsatthebeginningofthecomputation. Ifanyarrayhasawriteoperation, 100 the corresponding data will be written back to external memory only at the end of the computation. 6.5.1.2 Custom Data Layout (CDL) In the Custom Data Layout [72] mapping, arrays are distributed among di®erent RAMs based on their access pattern but still irrespective of the schedule of the computation. This technique is mainly based on the overlap between the data accessed by di®erent array references. Array references accessing disjoint sections of the same array are distributed across di®erent RAMs. This is the case for the group-reuse in SIVs as well as self- and group- reuseinMIVs. Forthereferenceswithoverlappingdataweconsidertwopossiblescenarios. First, and if there are su±cient registers available, the references are scalar replaced in registers. Second, in case of insu±cient registers the references are mapped to the same RAM. If the number of block RAMs are insu±cient for a total distribution, references' data are allocated to the available RAMs in a round-robin fashion. As in the Naive case, arrays' data is brought from the external memory to the on-chip RAMs at the beginning of the computation and if necessary written back to the external memory at the end of the computation. 6.5.1.3 Custom Memory Allocation (CMA) In our technique, Custom Memory Allocation, arrays are mapped based on their access patterns as well as the computation's schedule. The access patterns are identi¯ed by ana- lyzing the reuse chain of individual arrays, while the computation's schedule is considered 101 through analyzing the critical paths. In CMA all three levels of memory are considered, namely external memory, on-chip block RAMs, and on-chip distributed memories (regis- ters). The basis of the algorithm is to: ² Distribute the data for an array that exhibits only self-reuse. ² Replicate the data for an array that has either group-reuse or overlapping self-reuse (case of MIV). ² Leave an array with no reuse in the external memory. Unlike the previous techniques in which the whole data is brought from o®-chip mem- ory(s) to on-chip memory(s) at the beginning of the computation, in CMA only the reusable data is brought to on-chip memory(s). The on-chip registers are used when mapping data to RAMs is not e®ective in reducing the length of the critical path(s). 6.5.1.4 Hand Coded (HC) In this strategy the arrays' data are distributed, replicated, and/or scalar replaced based on the reuse of di®erent elements, availability of storage resources, and computations' characteristics. UnliketheCMA, the basisof theHandCoded versionsis not quantitative and based on the sequential C code. In these versions the allocation is solely based on analyzing the equivalent VHDL code and observing the details of the computation. Hence the arrays that need to be accessed at the same time are distributed as much as possible in either RAMs or registers. This version does not guarantee the optimality, however simulates a practical design implemented by careful inspection of the hardware characteristics. 102 6.5.2 Implementation Considerations In implementing our CMA algorithm we were faced with various practical issues. Here we describe how we addressed those: ² Calculating the critical path(s): In order to calculate the length of the critical path we aggregate the memory and operation delays along the edges of the Data Flow Graph. For memory accesses, and considering the required time for address genera- tion, we add the latency of one clock cycle to the total latency. ² Accessing the data of a reuse chain: As mentioned, in CMA code versions we only save the data of a reuse chain in on-chip storage. In case of reuse chains with group- reuse, the generator of the chain needs to be accessed (read) from the external memory at each iteration of the loop corresponding to the reuse. The same applies to the ¯nalizer of reuse which might need to be written back to external memory at each iteration of the loop. In these cases we overlap this read/write operation with the computation of the loop. This allows the memory latency to be fully/partially hidden. ² Threshold for the critical paths: In chapter 4 we introduced the need of a threshold in order to ¯nd the critical paths that are too close in length to one another. In our experiments, considering the access delays and the pipelining initiation interval we pickthisthresholdtobeoneclockcycle. Howeverinsomecasesthisthresholdistoo aggressive due to a short computation time. Namely, all the paths become critical and therefore all the references need to be mapped to local storage without any 103 priority. In cases that available storage cannot accommodate this, we dynamically reduce the threshold in order to ¯nd a feasible mapping. We made these decisions by a close study of our algorithm and implementation. Our experiment with a variety of designs conforms to these adjustments. 6.5.3 Description of the Experiments Weexperimentwith twodi®erent memoryresourcesettings. In the ¯rst settingthe target architecture has no registers for the scalar replacement of data. This guarantees the resources to be identical across all mapping strategies, and that all strategies can only exploit the external memory and on-chip RAM blocks. In the second set of experiments we allow the use of 64 registers with a total of 768 bits. Di®erent mapping strategies utilize these registers in order to ¯nd a better mapping that leads to a shorter execution time. In this case we compare the outcome of the CMA mapping with the mapping of the CDL and HC designs. 6.5.4 Allocation Results Using No Registers We now present the results of our experiments using no registers. For each code kernel we derived 24 designs, re°ecting four di®erent mapping strategies as well as various unroll factors. We designate a given design as `X x Y' to indicate the amount of unrolling for the two innermost loops of the code. We ¯rst explain and compare the performances in terms of execution time. Here the execution times also include the cost of bringing the data from external memory to on-chip RAMs. We then illustrate the results for the area 104 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.2: Timing results for FIR using 4 memories and no registers. consumption of di®erent results. The detailed numerical data for this section is presented in the Appendix. 6.5.4.1 Time Performance The timing results are depicted in ¯gure 6.2 through 6.6. Each ¯gure includes the data in terms of number of cycles, clock rates, and execution times for di®erent unroll factors and mapping strategies. FIR (¯gure 6.2): This kernel includes an array reference with MIV subscript. The CDL algorithm maps the entire data of this reference to the same storage, serializing 105 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.3: Timing results for MM using 4 memories and no registers. its accesses. In the CMA algorithm, however, the references of the MIV reuse chain are replicated in various block RAMs. This replication results in simpler address generation schemes and therefore 2:5% faster clock rates over the CDL versions. The concurrent accesses to data results in a 14:5% improvement in terms of clock cycles, resulting in an average 16:6% better execution time compared to the CDL versions. Comparing the CMA and HC versions, the CMA designs show a negligible 0:02% degradation in terms of cycles, a small 0:5% clock gain, and a 0:33% degradation in terms of execution time. We attribute this di®erence to the greedy nature of CMA and that the best local allocations do not necessarily result in the best global allocation. 106 MM(¯gure6.3): ThiskernelonlyincludesSIVarrayreferencesthatexhibitself-reuse and therefore access disjoint data sets. Both CDL and CMA strategies distribute these arrayreferencesandaccessesthemconcurrently. Intermsofclockcycles,theCMAdesigns exhibitanaverage3:9%improvementovertheCDLdesigns. ThesmalladvantageofCMA is due to considering the schedule and critical paths. Simpler address generation results in 2:3% faster clock periods and 6% improvement in execution time for CMA designs. In comparison to HC designs CMA versions su®er from a slight 0.89% degradation in number of clock cycles. Similar to the case of FIR, the reason for this di®erence is the greedy nature of CMA. Even though the simpler CMA designs achieve a 0.6% faster clock period than the HC designs, their overall improvement of execution time is only 0.14% smaller. JAC (¯gure 6.4): This kernel only includes references that either exhibit group-reuse or no reuse at all. In case of references with group-reuse, the CDL design only brings a single copy of the array to an on-chip RAM, creating a large bandwidth pressure specially in case of unrolled versions. In CMA designs however the data corresponding to the reuse chain is replicated in di®erent RAMs. Furthermore, CDL designs bring the arrays with no reuse to RAMs at the beginning of the computation. In CMA versions the data that does not have any reuse is kept in the external memory. In terms of clock cycles the CMA designs perform 52.5% better than their CDL counterparts. This is the largest improvement in our set of benchmarks and is very consistent across various unroll factors. More complex address generation results in 5.7% slower clock rates and subsequently 49.7% improvement in terms of wall clock time. 107 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.4: Timing results for JAC using 4 memories and no registers. IncaseofHCdesignsthedataisreplicatedthesamewayasinCMAversions,however, the selection of di®erent RAMs for various arrays is di®erent. In terms of number of clock cycles CMA designs exhibit a very similar performance to HC designs, showing only a 0.02% decrease. However, more complicated codes in case of HC designs leads to a 1.9% increase in clock rate. As a result CMA designs have a 1.92% advantage over HC versions in terms of wall clock time. HIS (¯gure 6.5): This kernel includes a non-a±ne array access and multiple arrays witha±neindexfunctionssomeofwhichexhibitnoreuse. TheCDLdesignsdistributethe referenceswithdisjointdataamongdi®erentRAMsinaround-robinfashion. Notonlythis 108 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.5: Timing results for HIST using 4 memories and no registers. createssequentialaccessestosomeoftheRAMs,theaddressingoverheadsrevertbackany advantage of parallel accesses to these reference. As a result CDL versions even perform 4.3% worse than the Naive codes in terms of clock cycles. This performance degradation increases to 7.8% for the wall clock time, since the clock rates are 3.4% slower. IntheCMAdesignsthearrayreferenceswithnoreusearekeptinexternalmemory. In addition, and as a result of analyzing the critical path, con°icting arrays are distributed among di®erent RAMs. This results in 32.2% fewer clock cycles in comparison with CDL designs. Considering a 1.4% clock gain, the performance improvement increases to 33.1% 109 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.6: Timing results for BIC using 4 memories and no registers. in terms of wall clock time. This improvement is more evident in larger unroll factors as more parallelism is accommodated in CMA versions. IncomparisonwithHCdesignsCMAversionshaveaverysmallimprovement. Namely 0.8% fewer clock cycles, 1.3% faster clock rate, and 2.15% shorter execution time. Al- though the mapping strategy for the CMA and HC designs are the same, the HC versions use fewer RAMs. As a result of packing more arrays in the same RAM the addressing becomes a bit more complex, resulting in slight performance degradations for HC designs. BIC(¯gure6.6): ThiskernelincludesatwodimensionalMIVarrayaswellasanarray withnoreuse. LikethecaseforFIR,CMAalgorithmreplicatesthedataoftheMIVreuse 110 chain in various block RAMs. Also the data with no reuse is kept in external memory and the accesses to this data are overlapped with the rest of the computation. These factors result in 19:7% fewer clock cycles than CDL designs. Although replication creates a large performance boost in CMA designs, due to the multi-dimensional array references as well as a deep loop nest these designs are very complicated. This complexity results in 6:6% degradation in terms of clock rates over CDL versions, decreasing the performance gains of the design using CMA to 14:8% in terms of execution time. In comparison with HC designs CMA versions have a small 2% degradation in the number of clock cycles. This discrepancy is due to the imprecisions of computing the critical path at the source level as outlined in chapter 4. As a result multiple references are assigned to the same block RAM, an issue that is avoided in HC versions. The clock ratesinthecaseofCMAdesignsareonaverage0.9%fasterthantheHCdesigns, creating a 1.1% better execution time for the CMA versions. 6.5.4.2 Area Performance The consumed area, i.e. number of slices, closely follows the code complexity. This fact is con¯rmed by the generally larger areas for larger unroll factors. Overall naive designs are the simplest and smallest. The area of CDL, CMA, and HC designs depends on various factors including: ² Routing complexities due to the number of used block RAMs as well as how data is distributed among them. 111 (a) Number of Slices (b) Number of Used RAMs (c) Consumed RAM Bits Figure 6.7: Resource usage for FIR using 4 memories and no registers. ² Multiplexor bits which increase as a result of replication through scalar replacement for CMA and HC designs, and manipulating large arrays in case of CDL. ² Number and type of additional operations created by larger unroll factors and/or storing a larger number of arrays on-chip. ² Complex address generations resulting from the mapping of multiple references to the same storage and/or manipulating a larger number of arrays. Figures 6.7 through 6.11 present the area results in terms of consumed number of slices, number of RAMs, and storage bits in the RAMs. 112 (a) Number of Slices (b) Number of Used RAMs (c) Consumed RAM Bits Figure 6.8: Resource usage for MM using 4 memories and no registers. FIR (¯gure6.7): IngeneralCMAdesignsusemorephysicalresourcesthantheirCDL counterpartsastheyareslightlymorecomplex. Datareplication, whichisabsentinCDL cases, requires more control in terms of addressing the elements. This is more evident in the HC designs, where replication is more aggressive. On average CMA designs use 1.1% more slices than CDL versions, while 10.7% fewer slices than HC designs. In terms of storage bits, and unlike CDL, the CMA algorithm performs scalar replace- ment in RAMs and keeps a large array in external memory. This results in 47.2% fewer bits compared to the CDL design. In comparison with the HC design CMA algorithm uses 3.2% fewer bits as the data replication in the CMA case is less aggressive. 113 (a) Number of Slices (b) Number of Used RAMs (c) Consumed RAM Bits Figure 6.9: Resource usage for JAC using 4 memories and no registers. MM (¯gure 6.8): For this kernel, CDL, CMA, and HC designs are very similar and therefore use similar resources. Various arrays are distributed among the RAM blocks in subtle di®erent ways. This results in slight variations in terms of routing and addressing. OnaverageCMAdesignsare5:8%and1:6%smallerthantheirCDLandHCcounterparts respectively. Furthermore, the CMA designs consume 2:7% more storage bits than the CDL designs while being identical to the HC designs. JAC (¯gure 6.9): The replication of the reuse chains in case of the CMA algorithm makes the designs more complicated and as a result the number of slices has a 27:3% 114 (a) Number of Slices 0 1 2 3 4 5 orig 1x2 2x1 1x4 4x1 2x2 Code Versions RAMs Naive CDL CMA HandCoded (b) Number of Used RAMs (c) Consumed RAM Bits Figure 6.10: Resource usage for HIST using 4 memories and no registers. increase compared to the CDL designs. In comparison with HC designs CMA solutions lead to designs with 0:7% more slices due to a more aggressive replication. As the CMA algorithm only replicates the reuse chains and keeps the array with no reuse in external memory, the resulting designs consume 88:5% fewer storage bits comparedtotheCDLdesigns. IncaseofHCdesignsthealgorithmapplieslessreplication as it recognizes that some memory latencies can be hidden through memory pipelining. As a result these solutions consume 61:2% fewer bits than the CMA designs. HIST (¯gure 6.10): For this kernel CDL designs are more complicated as they ma- nipulate a large array that has no reuse. The same array is however kept in the external 115 (a) Number of Slices (b) Number of Used RAMs (c) Consumed RAM Bits Figure 6.11: Resource usage for BIC using 4 memories and no registers. memory in CMA and HC versions. This leads to simpler designs for HC and CMA algo- rithms that are slightly di®erent due to di®erent distribution decisions. The CMA designs areonaverage20.5%and3.9%smallerthanCDLandHCversionsrespectively. Intermsof storage bits CMA and HC designs identically have 91.4% improvement over CDL designs. BIC (¯gure 6.11): Due to the MIV reference in this kernel, CMA and HC algorithms engageinsomeaggressivedatareplication. Thisdatareplicationinformofscalarreplace- ment in block RAMs results in an overhead in terms of address generation. In addition the required code transformations for scalar replacement makes these codes signi¯cantly more complicated than CDL designs. The level of complexity increases with the unrolling 116 factor. As a result the CMA versions are 61% larger than the CDL designs but use 90.6% fewer storage bits. The CMA designs are 0.8% smaller than the HC designs and use 25% fewer bits as they perform less data replication. 6.5.4.3 Summary Throughouttheseexperiments, theCMAalgorithmleadstohardwaredesignsthathavea clearperformanceadvantageoverCDLdesignsintermsofclockcycles. Thisimprovement, an average of 24.5% fewer cycles, is present for all kernels and code versions. Even with a 1.2% degradation of clock rates CMA designs have 24.08% faster execution times. The performance gains are higher for cases where there is a large number of references with overlapping data, for instance created by the application of loop unrolling. For these cases CMA designs replicate the data and distribute the array references. The CDL de- signs, however, only keep a single copy of the data and map all the array references to it. For these CMA and CDL designs the performance di®erence is small for lower unrolling factors as the pipelined memory accesses in CDL could achieve a reasonable performance. However, for the larger unroll amounts the performance of distributed accesses in CMA designs far surpasses the pipelined accesses in CDL designs. In these cases the CMA de- signsachieveabetterperformancethanCDLdesigns, whileconsumingthesameavailable storage. The comparison between the CMA and HC designs reveal that their performances are identical for all practical purposes. In fact CMA designs have only 0.4% loss in terms of cycles. However their 1.06% clock rate gains lead to only 0.63% faster designs. 117 In terms of resource usage, and as a result of more complex code due to replication and scalar replacement, CMA designs consume 12.6% more slices than the simpler CDL designs. Due to more aggressive optimization, HC designs use 3.2% more slices than the CMA versions. As for storage bits CMA designs outperform CDL solutions by 63%. They achieve this by considering the reuse chains as the units of replication and scalar replacementaswellasleavingthearrayswithnoreuseinexternalmemory. Incomparison with the HC designs CMA versions on average use 6.6% more storage bits. This average is however highly in°uenced by the JAC benchmark as in all other cases CMA designs do the same or better than their HC equivalents. 6.5.5 Allocation Results Using Limited Registers We now present the results of our second experimental setup in which we allow the use of 64 registers. For the designs that map a portion of the data to registers we ¯rst bring this data from the external memory to RAMs and then read them into registers. For each kernel we derive 3 designs re°ecting the CDL, CMA, and HC mapping strategies. We focus speci¯cally on the case of 1x4 as unroll factors for the two innermost loops. The reason behind this selection is to have a relatively high bandwidth demand so that the use of registers could create some performance gain. If the bandwidth requirements are small, CMA simply maps di®erent arrays to di®erent RAMs and access them in parallel, therefore not taking advantage of registers. If the bandwidth requirements are too large, the CMA algorithm either maps the majority of references to registers or relaxes the timing requirements by reducing the desired latencies of the references. 118 (a) Number of Cycles (b) Clock Rate (c) Execution Time Figure 6.12: Timing results for all benchmarks using 4 memories and 64 registers. We¯rstexplainandcomparetheperformancesintermsofexecutiontimein¯gure6.12. Here the execution times also include the cost of bringing the data from external memory to on-chip RAMs. We then report the results for the area consumption of di®erent results in ¯gure 6.13. The actual numerical data for this section is presented in the Appendix. 6.5.5.1 Time Performance FIR: In this kernel, the available number of registers allows for a full scalar replacement of the arrays' data in registers. In the CDL mapping the references to the MIV array that have overlapping data are mapped to registers. In the CMA design, and considering the 119 possible schedule, the references to the MIV array are scalar replaced and replicated in RAMs while the references to the SIV array are scalar replaced in registers. This partial mapping to registers results in very similar designs in which CMA has only 1.2% fewer clocks. In terms of clock rates however the CDL code has a 13.8% larger value, leading to a 14.9% performance gain by CMA. In the HC version of the code every reference is scalar replaced in registers. As a result the CMA designs have an average 47% disadvantage in terms of the number of cycles compared to the HC version. In case of the CMA designs scalar replacement in RAMs results in 1% loss in clock rate over the HC design, resulting in an overall 48.5% slower design. MM: In this kernel all the array references access disjoint data sets. As a result CDL distributes the references among the RAMs and does not use any registers. As for CMA and HC versions, identical for this kernel, the number of available registers is insu±cient for a full scalar replacement of data in registers. As a result these versions can only scalar replace part of the references that participate in the computation. As mentioned in section 4.2 this does not reduce the computation time. Due to this fact, the timing results for these cases are very similar to the one resulting from the CDL designs, having only 4.2%, 0.6%, and 4.8% advantage in terms of cycle count, clock rate, and execution time. JAC: In both CDL and CMA designs the array with group-reuse is mapped to regis- ters. In CDL designs the references that exhibit no reuse are distributed among the RAM blocks as they access disjoint data. The CMA algorithm however keeps these references in external memory. This has a negative e®ect on the performance of CDL as the over- heads resulting from bringing the data to on-chip RAMs are never compensated by the 120 data reuse. This leads to 38.3% fewer cycles in case of CMA compared to CDL. Complex routing in case of CDL causes a 7.9% higher clock rate, leading to a 43.22% faster CMA designs. Examining the VHDL code reveals that the execution time of the computation is bounded by the inevitable external memory accesses due to the sources of reuse chains. As such, mapping the data corresponding to the references of the group-reuse in RAMs does not a®ect the performance. Based on this observation, in the HC design we scalar replace the references with group-reuse in RAMs, while keeping the rest of the references in external memory. As a result of this mapping the number of clock cycles in the HC version is only 0.3% more than the CMA version, while using no registers. The simpler designs result in 22.5% better clock rates, which leads to a 22.1% better execution time. HIST:Inthiskernelallthearrayreferenceseitherhaveself-reuse(disjointdataaccess) or no reuse at all. As a result, the CDL designs distribute all the data references among the available RAM blocks without exploiting the registers. The CMA algorithm, on the other hand, maps a subset of array references to registers, keeps the array references with no reuse in external memory, and distributes the remaining of the array references in available RAMs. As a result, CMA designs exhibit 49% fewer cycles, 3.2% slower clock, and subsequently 47.3% faster execution time in comparison to the CDL equivalent. TheHCdesignsexploittheregistersandmapallthereferenceswithreusetoregisters. Like the CMA designs the array references with no reuse are kept in external memory. Although it seems that a more aggressive scalar replacement in registers would result in faster execution time, in this case and due to the dependences in the computation, HC gainsonlyasmalladvantageoverCMA.ConsideringthepipelinedaccessestoRAMblocks 121 in case of CMA mapping, these accesses can indeed overlap with the actual computation. As a result the data is available when it is required and mapping the data to registers is not that bene¯cial. This generates a small 1.6% gain in terms of cycles, a 3.9% faster clock, and 5.6% faster execution time in case of the HC design. BIC: The CDL algorithm distributes the array references with self-reuse as well as the ones with no reuse among di®erent RAMs as they access disjoint data sets. Ideally the references that exhibit overlapping data must be mapped to registers. However the number of registers required for this scalar replacement surpasses the number of available registers. As a result CDL maps these array references to a single RAM as if no registers were available. In case of CMA not only are the references with group-reuse scalar replaced and therefore replicated in RAMs, but the array references with no reuse are kept in external memory. Inadditionthearrayreferenceswithself-reusearepartiallymappedtoregisters. Thisstrategyleadstodesignswitha38.4%advantageintermsofclockcyclesovertheCDL designs. The more complicated design in case of CMA leads to a 5.3% clock degradation. However overall the execution time of CMA is 35.1% faster than the CDL designs. The HC designs are almost identical to the CMA designs with the di®erence that in the HC mapping all the references with self-reuse are scalar replaced in registers. This results an average 18% fewer clock cycles, a 4% slower clock rate, but an overall 13.3% improvement in execution time for the HC designs. 122 (a) Number of Slices (b) Number of Used RAMs (c) Consumed RAM Bits (d) Consumed Register Bits Figure 6.13: Resource usage for all benchmarks using 4 memories and 64 registers. 6.5.5.2 Area Performance FIR:Inthiscasealldesignsusesomenumberofregisters. TheHCdesignusesthelargest number of registers and has the largest area, surpassing the CMA version by 8.9%. The CMA design uses the same number of registers as CDL version but has to manipulate the MIVreference. Thisrequiresadditionalworkthatresultsina5.4%largerspacecompared to the CDL version. The CMA design uses 43% fewer RAM bits compared to both CDL and HC designs. In case of CDL a large array is mapped to the on-chip RAM while the CMA design only maps the data corresponding to its reuse chain. The HC design does more scalar replacement in registers which requires the corresponding full array in RAMs. 123 MM: For this application, unlike CMA and HC designs, the CDL design does not use any registers. This directly a®ects the area, resulting in a 43% larger area for CMA and HC designs. For this kernel all the designs use an identical number of RAM bits. The CMA and HC designs however use 144 bits of registers as well. JAC: Here CDL and CMA designs both utilize a large number of registers while the HC design only uses the block RAMs. In addition the CDL version considers more arrays for mapping to block RAMs. As a result the CMA design is 7% smaller than the CDL version while 23% larger than the HC equivalent. Instead of the replications in case of the HC design, the CDL and CMA designs each use 576 bits of registers and their RAM usage is three times of that for the HC design. HIST: For this application the CDL solution does not use any registers, the CMA design uses a subset of available registers, while the HC version utilizes all the registers. This directly a®ects the area, with CMA design being 31% larger than CDL and 27% smaller than the HC versions. As the CDL design maps the whole arrays to the RAMs, the CMA and HC designs use 91% fewer RAM bits than the CDL design. BIC: For this application the small number of registers forces the CDL solution to only use the block RAMs. On the other hand CMA and HC designs use both registers and RAMs, with the HC design using slightly more registers. The scalar replacement in registers create very complex codes in cases of CMA and HC. The CMA design is 5.3% smaller than the HC design while 114% larger than the CDL design. Furthermore, the CMA design uses 90% and 48% fewer RAM bits compared to CDL and HC designs respectively. 124 6.5.5.3 Summary Intheseexperiments,andintermsofexecutiontime,CMAversionsachievebetween4.8% and 47.3% (an average 29.1%) gain over the CDL versions. At the same time they exhibit between 0% and 48.5% (an average 17.9%) degradation over their HC counterparts. This large variance is a direct result of using registers in various designs. Mapping the reusable data to registers could potentially decrease the number of cycles dramatically. At the same time an aggressive scalar replacement in registers could have a substantial negative e®ect in terms of code complexity and subsequently clock rates. Compared to CDL designs, CMA achieves its advantage by mapping the data to reg- isters only if this improves the critical path(s). Also for overlapping data, and in case of insu±cient registers, CMA scalar replaces and replicates the reuse chain in multiple RAMs, while CDL maps the entire data to a single RAM. IncomparisonwiththeHCdesigns,andasaresultoftheinaccuracyinthecalculation of critical paths, CMA versions might over-utilize or under-utilize the registers. This is prevented in HC versions where register mapping is applied carefully based on the characteristics of the computation. In terms of area the use of registers has a direct negative e®ect on the design size. Not onlytheregistersconsumeadditionalspace,theyintroducealargedegreeofcomplexityin the design. The results of intricate routing and sophisticated code generation followed by the application of scalar replacement can be observed in the area outcomes. Furthermore, and as the scalar replacement in registers accesses the data in RAMs, applying scalar replacement to more arrays translates to an increase in terms of required RAM bits. 125 The CMA designs are on average 31% larger than than the CDL versions but use 45% fewer RAM bits. On the other hand the CMA designs are 3.1% smaller than their HC counterparts while using 41.6% more RAM bits. 6.6 Analysis of the Results Our CMA mapping algorithm shows a solid advantage over the CDL strategy across all studied kernel codes and for all cases. We have identi¯ed that CMA achieves its performance gains due to the following: ² Considering the schedule, critical paths, and the interaction between references in the computation. ² Having the mapping decisions based on analyzing the reuse of the data elements as opposed to their overlapping. ² Utilizingdatareplicationsincasesthatthistechniqueincreasestheparallelism,while limiting the storage requirements. ² Leaving the data with no reuse in external memory. Not only this simpli¯es the address generation schemes in many cases, the access to these elements can be over- lapped with the actual computation and therefore their latencies get mostly hidden. We have also investigated the sources of discrepancy between the CMA and HC de- signs. Thisdi®erencemainlycorrespondstotheinherentdi®erencesbetweentherepresen- tation of an application code in a sequential high level language like C versus a hardware description language like VHDL. 126 ² Criticalpathcalculation: Identifyingthecriticalpathsofthecomputationisthecore of the CMA algorithm. A simple computation in the C program might translate to a much more complicated VHDL code depending on the data/memory accesses, address generations, temporary variables, data shifts/rotations, loop overhead, etc. The lack of this low level information in the high level C code prevents a precise calculation of the critical paths. As a result of this deviation the algorithm might allocate the data di®erently. ² Clock discretizations: The synthesis tool performs some operations (like memory accesses) only at the edge of the clock. This could result in many clock cycles to be onlypartiallyused. Forexampleifanaddresscalculationisrequiredbeforeaccessing a memory address, no matter how small this calculation is, one full clock cycle is spent on it. This behavior is not captured in the high level code and hence a®ects the calculation of the length of the paths. ² Address generation: This can be a signi¯cant issue in case of loops with small computations and/or large bounds. In this case if address generation cannot be overlapped with the rest of the computation, its associated latency increases the computation time of the loop. This increase multiplied by the number of loop iterations can be responsible for a substantial part of the execution time. ² Greedy nature of the algorithm: Even though CMA considers the relation between di®erentarrays,itdoessoonlyonecutatatime. Duetoalackofglobalperspective, CMA might select the cuts or the elements of a selected cut in an order that is not 127 optimal. This might cause multiple references in di®erent cuts to share the same bandwidth and subsequently hurts the performance. ² Precisionofmetricsandthresholds: Inanygreedyalgorithm,theselectionofsuitable metrics is a deciding factor in the performance of the algorithm. While some met- rics (like gain) are developed analytically, some other (like threshold) are developed empirically and might not be optimal. Despitetheseshortcomings,theCMAalgorithmleadstodesignsthatoutperformCDL designs,evenwiththesameresources,andperformveryclosetothehandcodedsolutions. This makes the CMA strategy an e®ective technique for mapping an application's data to storage structures of an FPGA. 6.7 Chapter Summary In this chapter we described our experiments for a set of ¯ve benchmarks each with di®erent characteristics. We demonstrated the e®ectiveness of our CMA algorithm for identifying the best storage allocation strategy for the array references of a loop nest. We also compared these designs with the designs resulting from other common mapping strategies, namelyNaive, CustomDataLayout, andHandCodeddesigns. Werepresented theresultsfortwocon¯gurations, ¯rstassumingnoregistersandthenusingonlyalimited number of registers. The results are highly dependent on the type and size of the target benchmark. How- ever, and using no registers, we achieved up to 53.9% performance gain over the CDL versions in terms of execution time while consuming 12% more area. The same set of 128 experiments showed only an average 2.15% degradation in execution time and a 3.2% area improvement compared to the HC versions. Exploiting a limited number of registers, our gains were between 4.8% and 47.3% in terms of execution time and 31% in terms of area over the CDL versions. In comparison to the HC versions we observed a 17.9% performance loss while consuming 3% more area. WeinvestigatedthesourcesofdiscrepanciesbetweentheCMAandHCdesignsaswell aswithCDLdesigns. Overall, CMAimprovestheperformancebybetterutilizationofthe available resources, making it an e®ective mapping algorithm for this class of con¯gurable computing architectures. 129 Chapter 7 Related Work Minimizing the impact of the access time to memory has been a long standing problem. The issues of data mapping and storage allocation have been well studied in the frame of parallelizing compilers (e.g, [54, 47, 81, 51]). However, it is not possible to directly import thesewellestablishedtechniquesintotheareaofcon¯gurablearchitecturesasthesedevices have di®erent requirements and characteristics [22, 41]. First, these architectures do not support the abstraction of a single address space nor do they manage the consistency between the local and external memories. Due to the absence of any hardware support mapping algorithms need to be precise in explicitly saving/retrieving the data to/from di®erent storages. Finally, the limitations in terms of area, bandwidth, and power make this problem more sensitive. In this chapter we ¯rst highlight some projects in the area of con¯gurable architec- tures in section 7.1. Our reuse analysis and scheduling considerations are an extension to the work described in sections 7.2 and 7.3. We survey the related work in terms of storage allocation in section 7.4. Lastly we summarize the di®erences of our approach in section 7.5. 130 7.1 Con¯gurable Architectures and Compilation Support Contemporarycon¯gurablearchitecturesbothinindustryandacademiahavecon¯gurable storage resources. The Xilinx Virtex-II [83] family of FPGAs have a limited set of RAM blocks, as well as a large number of available slices that can be organized as discrete registers. PACT's XPP con¯gurable Array [59] organizes its processing elements in a data-°oworientedexecutionmodelwhereeachelementhasaccesstolocalregistersaswell as global RAMs. Another e®ort is represented by the XTensa family of processors [84] where only the internal pipeline of a RISC processor can be con¯gured in a very rigid fashion so as not to destroy the timing of the overall processor design. In terms of academic projects, the PipeRench [34] virtualizes the notion of hardware resources through the introduction of a virtual stripe which can be context-switched in every clock cycle. During execution, each stripe can only access a small set of registers during the life-time of the slice before the values are passed along to another stripe. The RaPiD[23]architectureisorganizedasaspatialpipeliningoffunctionalunitsandregisters and focuses on instruction level parallelism and ¯ne-grained computations. Finally, the Garp[42]architecturemergesatraditionalRISCprocessorwithanFPGA-likefabricwith which it communicates via a set of prede¯ned memory-mapped registers. The compilation e®orts supported by each architecture, targeting the mapping and management of storage resources, is very limited. The typical approach relies on data- °ow analysis for scalar variables, focusing on very simple array access patterns for which there are no dependences. 131 In terms of compilation support, the Cameron project [57] operates on programs writ- ten in SA-C, a single-assignment subset of the C language. The SA-C compiler includes various loop level transformations, however, it requires inserted pragmas and speci¯c lan- guage constructs to explicitly determine the application of transformations for di®erent references. TheNapa-Clanguage[33]includesaselectionofannotationsandlibraryfunc- tion calls that programmers need to use for declaring various processes, signals, and data streams. TheRaPiD[23]compilerisdesignedforaClikelanguagecalledRaPiD-C.Using this language a programmer needs to explicitly specify the parallelism, data movement, and partitioning in an application. Additional e®orts in the area of new languages include Handel-C [20], SpecC [29], and MATLAB [58]. Although some of these languages have been integrated into commercial products, they put a burden on programmers by requir- ing them to learn new languages and paradigms. Furthermore, they lack any advanced analysis for the data-°ow and parallelism present in a program. Inourworkwerelyontheprecisionofarraydatadependenceanalysisforun-annotated programs in imperative programming languages, such as C or FORTRAN, to uncover and exploit the opportunities for reuse. The compiler uses this information to automatically allocate and manage the location of the data across the various storage structures. 7.2 Data Reuse Analysis and Compiler Transformations For traditional cache-based architectures many researchers have used data dependence analysis frameworks and various optimizations in order to detect and exploit locality in scienti¯ccomputations[54,53,47,81,36,21,30,63]. Most,ifnotall,ofthepreviouswork 132 in this area has focused on loop reordering transformations to improve the locality of the transformedcode. Inthiscontextresearchershavedevelopedvariousanalysesframeworks for capturing self- and group reuse for a given array reference and between various array references. Looptiling[81]isthefundamentallooptransformationthatallowscompilerstoconvert reuse into locality that could then be exploited by data caches. Scalar replacement [15] is another technique to exploit the reuse in array references. It has been extensively studied for traditional architectures [17, 18], while focusing on reuse in the inner most loop level. More recently this technique has been extended for FPGA-based designs [70], considering only a subset of array references. In this dissertation we have extended data reuse analysis by considering more sophis- ticated data reuse as indicated by Multiple-Induction Variables, and also capturing the reuse at multiple loop levels in a given loop nest. The need to capture more sophisticated datareusepatternsrequiresustouseafundamentallydi®erentformulationoftheproblem based on reuse vectors rather than dependence vectors. 7.3 E®ects of Scheduling A body of work has re°ected the importance of the interplay between the memory ac- cess time and scheduling. Murthy et.al [56] suggest scheduling strategies for minimizing memory usage for synchronous data°ow (SDF) speci¯cations. This technique is based on utilizing the local storage by merging the input and output bu®ers. In [66] Rixner et.al introduce a technique called the memory access scheduling. Here the references are 133 reordered and operations are scheduled in order to exploit the locality and hence optimize the memory bandwidth usage. In the context of parallelizing compilers, Thies et.al [76] argue that the gains of paral- lelism are generally overshadowed by the costs of data locality and communications. To address this they introduce an intricate mathematical framework that incorporates the schedule and storage constraints to analyze the tradeo®s between parallelism and storage space. The work by Szymanek et.al in [75] presents a framework for exploring the data assignment and scheduling for an application with multiple tasks in a multi-layer memory system. The quality of assignments and schedules are assessed based on various costs and constraints. In[79]Wanget.alpresentadatapartitioningandschedulingapproachtobal- ancetheALUandmemoryoperationstoachieveanoptimaloverallschedule. Theyutilize data prefetching and software pipelining by considering the access pattern information. Finally, Zaretsky et.al in [85] describe the e®ects of scheduling and register allocation in mapping software binaries to FPGAs using their FREEDOM compiler. InourworkweleavethedetailsofschedulingtotheMonet[55]scheduler. However,we exploit the data-°ow information of the computation to co-allocate storage to references on the critical path. 7.4 Storage Allocation and Management Storage allocation and management is an important issue and not surprisingly has been extensively reported in the literature. We now outline some of the research e®orts in the 134 domain of con¯gurable architectures and embedded systems that directly correlate with the approach and overall goals of our work. 7.4.1 Allocation for Con¯gurable Architectures In the context of con¯gurable and custom architectures, various authors attempt to min- imize the overall VLSI area dedicated to memories as well as the number of accesses and bandwidth requirements [82, 65, 43, 46]. Gokhale et. al. [32] describe an algorithm for the mapping of array variables to memory banks in order to minimize the execution time. In this approach, using a precedence graph, variables that are accessed at the same time are distributed among di®erent memory banks. Each individual array, however, is wholly mapped to the same memory bank, thus constraining all accesses to the same array to be limited to the available ports of that bank. The algorithm does not consider the capacity or bandwidth limitations and is insensitive to data reuse. Weinhardt and Luk in [80] use the scheduling and data reuse information in order to exclusively map the arrays to RAM blocks. However in this approach arrays are not distributedandallaccessestothesamearrayaresequentialized. Inourownrecentwork[8] we focused exclusively on assigning array data elements to the available registers taking into account the reuse patterns in the code. Gong et. al. [35] describe an approach that partitions the data in various RAMs based on a loop's iteration space and arrays' data foot-prints. Their objective is to minimize the execution time considering a desired frequency at a very early stage of the design °ow. Although this approach considers the capacity constraints of the RAMs, there is no consideration of bandwidth, critical path, reuse analysis, or registers. 135 Dimitroulakoset. al.[26]presentacompilermethodforapplicationmappingincoarse- grained recon¯gurable architectures. This memory aware technique, with a goal of reduc- ing the required memory bandwidth, focuses on the small RAMs associated with pro- cessing elements (PE). The complete mapping process is a combination of scheduling the operations, mapping the operations to processing elements, and routing the data through speci¯cinterconnects. Thealgorithmisagreedyonebasedonthetotalcostofthesethree phases. The work by Guo et. al. [38] attempts to balance the computation and I/O by ex- ploiting the reuse for window based operations in on-chip bu®ers for a domain speci¯c language. The data is originally mapped to the on-chip RAMs and the reusable data is subsequently brought to bu®ers and sent to the processing elements, without considering the area limitations or the critical paths. TheworkbyAlatatet. al.[1]mapsthevariablestomemorybanksinsteadofregisters. The objective is to decrease the area through a reduction of the interconnects and multi- plexors. The data is distributed among di®erent memory banks based on a pre-scheduled control/data °ow graph. The variables that are not read/written in the same control step are mapped to the same memory and distributed otherwise. Here there is no notion of data access patterns or bandwidth limitations. Hanninget. al. [40]describeamethodology tomaptheregularnestedloopstocoarse- grained con¯gurable arrays. They take advantage of the loop parallelization and try to minimize the number of processor elements as well as latency on a PACT architecture. Schmit et. al. [68] describe an approach for mapping arrays to physical memories, considering the memory cost and the length of the schedule. Their techniques is based on 136 grouping the arrays and binding these groups to various memory components. They use a simulated annealing approach to ¯nd the best design. Diguetet.al[25]discusstheissueofdatareuseforpoweroptimizationinhierarchically organized memories, whereas Diniz and Park [28] describe a compiler analysis algorithm that exploits reuse in inner loops but only targets FPGA-based hardware designs. In [52] Lee et. al. propose an algorithm to map loops onto DRAA (Dynamically Recon¯gurable ALU Arrays) architectures. The objective of the work is to maximize the memory interface utilization. The algorithm is based on clustering the input operations for di®erent rows/columns, such that the number of memory operations in each cluster be smaller than the number of memory buses for that row/column. Barua et. al. [11] apply modulo loop unrolling to the inner most loop level to map the distinct sections of an array to di®erent RAMs in a cyclic fashion, regardless of any reuse information. More recently So et. al. [72] re¯ned this approach by examining the data access patterns of the various array references in a loop and develop a strategy to map the various disjoint address streams to di®erent RAMs. They use dependence vectors to determine which loops carry data reuse and what the minimal reuse distance is. They do not cover the cases of MIV nor address the issues of scheduling. The success of these approaches rely on the various data streams being non-overlapping to maximize the memory bandwidth. The analyses proposed in this dissertation signi¯cantly extend the scope and reach of the data reuse opportunities that any of these e®orts can handle. 137 7.4.2 Allocation for Embedded Systems Targeting the domain of embedded systems, Panda et.al [61] describe a methodology to partitiondatabetweenon-chipando®-chipstorageinordertominimizethecachecon°icts thereby improving the performance and consumed power. In [77, 2] the authors outline an approach that assigns di®erent application-level data structures to the memory heap, global memory and stack data. The e®ort described in Sudarsanam's work [73] uses the input instruction stream to build a constrained access graph and then uses simulated annealing techniques to ¯nd a feasible combined mapping of registers and memory banks for the data. The work by Pande et al. [62] focuses on embedded and DSP applications and describes an approach that uses the array de¯nitions and loop bounds to identify the footprint to be associated with each array reference within a loop nest, trying to allocate array variables to memories. Jha et. al. [44] map the data to a hierarchy of memory modules selected from a library of ¯xed components, considering the constraints on the number of words, ports, and bit-width. Researchers at IMEC have extensively explored the issues of storage allocation and memory optimization [5, 7, 19, 78]. In these work the data is mapped to a hierarchical memory structure based on data reuse, while minimizing power consumption and area usage. The work in [48] attempts to reduce the interconnection cost for a data assignment to multi-ported memories. In contrast, our work deals with the architectures that do not have cache structures or auni¯edaddressspaceforallofthedatastoragestructures. Inourapproach,thecompiler has to explicitly place and manage the location of a given data item at all times. If the data is modi¯ed locally the compiler must ensure that when the space used by that data 138 item is to be reused by another item, the data corresponding to the ¯rst item in memory needs to be updated. 7.4.3 Register Allocation With respect to register allocation, and given its signi¯cance and intractable worst case complexity, many researchers have developed various algorithmic strategies. These ap- proaches are based on graph coloring [13], spill minimization [49], greedy techniques [15], simulated annealing [73], or linear-programming [4], and target the limited number of registers available in contemporary architectures. Our register allocation approach di®ers from previous e®orts in that we use reuse information to select the more pro¯table array references in order to limit the number of required registers. 7.4.4 Custom Memory Design The notion of custom memory design is another remedy for the problem of memory per- formance in terms of execution time or power. Kandemir et.al [45] design scratch-pad- memories based on the size of array sections that exhibit reuse at di®erent levels of the loop. Grun et.al [37] customize the local memory based on the access pattern and locality type of di®erent variables. Zhao and Malik [86] estimate the minimum memory size re- quirement based on formulated live variable analysis. Authors in [14] develop a memory size model considering the life time of variables and their associated processes. Finally, the work by Balasa et.al [6] ¯nds the minimum number of memory locations based on a data °ow analysis operating with groups of signals. 139 Here we map the data to a speci¯c target architecture as opposed to designing a memory hierarchy. 7.5 Chapter Summary While building the foundation of our work based on this related work, we can distinguish our research in several aspects. First, we work with an unannotated C program with no special extension or constructs, as well as commercially available FPGAs and Synthesis tools. Second, we have addressed the fundamentally hard problem of allocation of array datainthepresenceoflimitedresourcesintermsofstorageandbandwidth. Third,weuse a precise notion of overlapping of the array sections the various array references access. We also exploit the notion of precedence by using the data-°ow graph information and targeting preferably the references in the critical path of the computation. Fourth, we use several compiler data-oriented transformations and di®erent storage resources. Finally, here we map the data to a speci¯c target architecture as opposed to designing a memory hierarchy. 140 Chapter 8 Conclusion Given the extreme °exibility of future con¯gurable architectures, there is a growing need for advanced analyses and algorithms in order to e®ectively and e±ciently explore the wide range of design solutions when mapping computation data to internal con¯gurable storage. To address this concern, in this dissertation we develop and evaluate compiler algorithms that map a computation's data to a set of heterogeneous storage resources. The novelty of our approach lies in creating a single framework that combines various high-level compiler techniques with lower-level scheduling information, while considering the target architecture's resource constraints. In this chapter we ¯rst present the contributions of this dissertation. We then discuss the future extensions to this research. 8.1 Contributions The goal of this research is to develop and evaluate a compiler algorithm that maps the arrays of a loop based computation in an unannotated C program to an architecture with a set of internal memories. Our objective is to minimize the overall execution time while 141 considering the capacity and bandwidth constraints of the storage structures. To address this goal we have developed the following major areas. 8.1.1 Analysis and Quantifying the Reuse In order to identify and exploit the data reuse we use the abstraction of reuse vectors. We describethestructureandpropertiesofthereusevectorsandpresenthowtocomputeand interpret these vectors for the cases where array references have SIV and MIV subscripts. We utilize the structure of reuse vectors to derive analytical formulas to quantify the reuse for possible reuse types in SIV subscripts. We also develop analytical formulas for the case of self reuse in MIV references with two induction variables. We further explain how to reduce the more complex MIV cases to the one with two induction variables. This quantitativeinformationformsthebasisofourcostfactorfordetectingtheleastexpensive data references in terms of required storage. 8.1.2 Analysis of the Critical Paths We analyze the role of scheduling and critical paths in an e®ective storage allocation. We use the operational and data dependency information embedded in the critical paths of the computation in order to identify the most pro¯table array references. We introduce the notions of critical graph and its cuts. We present experimental results that exhibit the advantages of scalar replacement based on the notion of cuts over the techniques that rely only on the reuse information. We also de¯ne the concepts of Desired Latency (DL) for a data reference, Delivery Time (DL) for a storage, and the inter-relation between the two concepts. At last we 142 introduce the metric of Threshold which controls the selection of critical paths and the amount of improvement in the overall storage allocation. 8.1.3 Custom Memory Allocation Algorithm We present a data mapping algorithm that is based on the combined knowledge of reuse analysis, critical path/scheduling, resource constraints, and capabilities of various allo- cation strategies. Our greedy algorithm considers various data transformations namely data distribution, data replication, and scalar replacement, as well as multiple storage types namely o®-chip memory, on-chip RAM blocks and on-chip registers. This algorithm minimizes the execution time while keeping the design within the storage limitations. We also introduce the metrics of cost and gain that help in assigning the storage to the most bene¯cial cuts and data references. We present experimental results for a set of ¯ve benchmarks each with di®erent char- acteristics targeting a Xilinx Virtex TM device. We demonstrate the e®ectiveness of our CMA algorithm and compare it with other common mapping strategies, namely Naive, Custom Data Layout, and Hand Coded designs. 8.2 Future Work Here we outline some areas that are interesting to investigate, either for improving our techniques or for complementing them. ² Better Estimation of the Critical Graph: A major source of discrepancy between our solutions and the hand coded solutions is the lack of precision in critical path 143 estimation. Due to the di®erences between the high level programming languages and its low level HDL representations, calculating the critical path at a high level is error prone. Improving the techniques for calculating these paths directly a®ects the performance of our allocation algorithm. ² Better Assessment of Clock Rate and Area: The factors of clock rate and area are major points of concern in designing for con¯gurable devices. Optimization techniques very often complicate the designs so that the degradation of clock rates reverts any improvement in clock cycles. Furthermore, the resulting complexities of routingcanincreasetheareainanunpredictableway. Thesee®ectsarepresentinour experimental results and can in°uence the e®ectiveness of our allocation algorithm. Determining the clock rate and consumed area, however, are extremely hard prob- lems. Place and route tools often do a very complex and sometimes cryptic work in mapping a design to hardware. Developing techniques that closely estimate the routing complexities and/or design area not only would improve the performance of our algorithm, but would be invaluable in any design process for con¯gurable devices [50, 71]. ² ExpandingtheDataManagementTechniques: Inourworkweusethreemajormap- ping strategies in order to primarily cache the data in on-chip storage, namely data distribution, data replication, and scalar replacement. There are other techniques that exploit the storage and bandwidth resources to improve the memory perfor- mance of the computation. For example software prefetching virtually eliminates memory access delays by scheduling a memory access ahead of the time when it is 144 needed[3,16]. Thiswaythestoragerequirementoftheapplicationdecreasesinprice of more bandwidth. Another technique would be to use LIFO and FIFO bu®ers in order to reduce the address generation complexities. Combining these techniques withourproposedalgorithmcanleadtobetterutilizationofthelimitedstorageand bandwidth resources. ² Increasing the Granularity: We consider the memory banks to be identical irrespec- tive of their physical placement. However, incorporating this placement information inourframeworkcanleadtoallocationsinwhichthestorageholdingthedataisclose to the functional unit operating on the data. This reduces the routing complexities and as a result improves the performance. Another opportunity for re¯nement is to change the data mapping in the middle of a loop's execution. Currently we ¯x the allocation for the entire execution of the loop. Forlargerloopswitharrayreferencesthathavenon-overlappinglifetimesthis might be a limiting factor. ² Allocation for Multiple Loops: In this work we consider the allocation for the com- putation in a single loop nest. For applications with multiple nests it is desirable to analyze the reuse patterns between various loop nests [87]. In terms of future systems we believe that the storage management issue will become even more signi¯cant. The ever growing presence of real-time and power-aware applica- tions combined with the increase in transistor counts has created an architectural trend towards more controllable storage, especially in con¯gurable/embedded devices. Con- sidering this trend and given the continuing gap between computation and data access 145 latencies, suitable data management becomes an essential factor in achieving high perfor- mance in future architectures. The work described in this dissertation is a step in this direction. 146 Bibliography [1] H. Al Atat and I. Quaiss. Register Binding for FPGAs with Embedded Memory. In Proc. of the IEEE Symp. on FPGAs for Custom Computing Machines (FCCM), 2004. [2] O. Avissar, R. Barua, and D. Stewart. Heterogeneous memory management for embedded systems. In Proc. of Int. Conf. on Compiler, Architecture and Synthesis for Embedded Systems (CASES), 2001. [3] A. Badawy,A.Aggarwal,D.Yeung,andC.Tseng. TheE±cacyofSoftwarePrefetch- ingandLocalityOptimizationsonFutureMemorySystems. InJournalofInstruction Level Parallelism, Vol. 6, 2004. [4] M.Balakrishnan and P.Marwedel. Integrated Scheduling and Binding: A Synthesis Approach for Design Space Exploration. In Proc. of 26th Design Automation Con- ference (DAC), 1989. [5] F. Balasa, F. Catthoor, and H. DeMan. Data°ow-driven Memory Allocation for Multi-dimensional Signal Processing systems. In Intl. Conf. on Computer Aided De- sign (ICCAD), 1994. [6] F. Balasa, F. Catthoor, and H. DeMan. Exact Evaluation of Memory Size for Multi- dimensional Signal Processing systems. In Intl. Conf. on Computer Aided Design (ICCAD), 1993. [7] F. Balasa, F. Catthoor and H. de Man. Practical Solutions for Counting Scalars and Dependences in ATOMIUM - A Memory Management System for Multidimensional Signal Processing. In IEEE Trans. on Computer-Aided Design and Integration of Circuits and Systems, 16(2):133{145,1997. [8] N. Baradaran and P. Diniz. A Register Allocation Algorithm in the Presence of Scalar Replacementfor Fine-Grain Architectures. In Proc. of the Design Automation and Testing in Europe (DATE), 2005. [9] N. Baradaran, P. Diniz and J. Park. Extending the Applicability of Scalar Replace- ment to Multiple Induction Variables. In Proc. of the 17th Workshop on Languages and Compilers for Parallel Computing (LCPC), 2004. [10] N. Baradaran, P. Diniz and J. Park. Compiler Reuse Analysis for the Mapping of Data in FPGAs with RAM Blocks. In Proc. of the IEEE Intl. Conference on Field-Programmable Technology (FPT), 2004. 147 [11] R. Barua, W. Lee, S. Amarasinghe and A. Agarwal. Maps: A Compiler-Managed Memory System for RAW Machines. In Proc. of the ACM Intl. Symp. on Computer Architecture (ISCA), 1999. [12] R. Barua, W. Lee, S. Amarasinghe, and A. Agarawal. Compiler Support for Scalable andE±cientMemorySystems. InIEEETransactionsonComputers,50(11),pp.1234 - 1247, 2001. [13] P. Briggs, K. Cooper, K. Kennedy, and L. Torczon. Coloring Heuristics for Regis- ter Allocation. In Proc. of the ACM Conf. on Programming Language Design and Implementation (PLDI), pages 275{284, 1989. [14] L. Cai, H. Yu, and D. Gajski. A Novel Memory Size Model for Variable-Mapping In System Level Design. In Proc. of the Asia and South Paci¯c Design Automation Conference (ASPDAC), 2004. [15] D.Callahan,S.Carr,andK.Kennedy. ImprovingRegisterAllocationforSubscripted Variables. In ACM Conf. on Programming Language Design and Implementation (PLDI), 1990. [16] D. Callahan, K. Kennedy, and A. Porter¯eld. Software Prefetching. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 1991. [17] S. Carr and K. Kennedy. Scalar Replacement in the Presence of Conditional Control Flow. In Software: Practice and Experience, 24(1), pp. 51-77, 1994. [18] S.CarrandP.Sweany. Anexperimentalevaluationofscalarreplacementonscienti¯c benchmarks. In Software: Practice and Experience, 33(15), pp. 1419 - 1445, 2003. [19] F. Catthoor, K. Danckaert, K.K. Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. van Achteren, T. Omnes. Data Access and Storage Management for Embedded Pro- grammable Processors. Kluwer Academic., 2002. [20] Celoxica Ltd. Handel-C For Hardware Design, 2002. [21] S. Chatterjee, E. Parker, P. Hanlon and A Lebeck. Exact Analysis of the Cache BehaviorofNestedLoops. InProc. of the ACM Conf. on Prog. Language Design and Implementation (PLDI), 2001. [22] K. Compton and S. Hauck. Recon¯gurable computing: a survey of systems and software. In ACM Computing Surveys (CSUR), 34(2), pp.171-210, 2002. [23] D.C.Cronquist,P.Franklin,S.G.Berg,C.Ebeling. SpecifyingandCompilingAppli- cations for RaPiD. In Proc. of IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), 1998. [24] R. Cytron, J. Ferrante ,B. Rosen and M Wegman. E±ciently Computing Static Single Assignment Form and the Control Dependence Graph. In ACM Trans. on Prog. Lang. and System (TOPLAS), 13(4): 451{490,1991. 148 [25] J.P. Diguet, S. Wuytack, F. Catthoor, H. De Man. Formalized methodology for data reuse exploration in hierarchical memory mappings. In Proc. of the IEEE Intnl. Symp. on Low Power Design, 1997. [26] G. Dimitroulakos, M. Galanis, and C. Goutis. A Compiler Method for Memory- ConsciousMappingofApplicationsonCoarse-GrainedRecon¯gurableArchitectures. InProc.oftheInternationalParallelandDistributedProcessingSymposium(IPDPS), 2005. [27] P. Diniz, M. Hall, J. Park, B. So and H. Ziegler. Bridging the Gap between Compilation and Synthesis in the DEFACTO System. In Proc. of the 14th Workshop on Languages and Compilers for Parallel Computing (LCPC), 2001. [28] P. Diniz and J. Park. Automatic Synthesis of Data Storage and Control Structures for FPGA-based Computing Machines. In Proc. of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), 2000. [29] D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. SpecC: Speci¯cation Language and Design Methodology. Springer, 2000. [30] D. Gannon,W. Jalby,andK. Gallivan. Strategiesforcacheandlocalmemoryman- agement by global program transformation. In Journal of Parallel and Distributed Computing, 5:587-616, 1988. [31] M. R. Garey, D. S. Johnson. Computers and Intractability: A Guide to the Theory of Np-Completeness. WH Freeman and Company, 1979. [32] M. Gokhale, J. Stone. Automatic Allocation of Arrays to Memories in FPGA Pro- cessors with Multiple Memory Banks. In Proc. of the IEEE Symp. on FPGAs for Custom Computing Machines (FCCM), 1999. [33] M. Gokhale and J. Stone. NAPA C: Compiling for a Hybrid RISC/FPGA Archi- tecture. In Proc. of IEEE Symposium on FPGAs for Custom Computing Machines (FCCM), 1997. [34] S. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. Taylor, and R. Laufe. PipeRench: A Coprocessor for Streaming Multimedia Acceleration. In Proc. of the 26th Intl. Symp. on Comp. Architecture (ISCA), 1999. [35] W.Gong,G.Wang,andR.Kastner. StorageAssignmentduringHigh-LevelSynthesis forCon¯gurableArchitectures. In Proc. of the ACM/IEEE Intl. Conf. on Computer- Aided Design (ICCAD), 2005. [36] S. Ghosh, M. Martonosi, S. Malik. Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. In ACM Transactions on Programming Languages and Systems,1998. [37] P. Grun,N. Dutt,andA. Nicolau. Accesspatternbasedlocalmemorycustomization for low power embedded systems. In Proc. of the Design, Automation and Test in Europe (DATE), 2001. 149 [38] Z. Guo, B. Buyukkurt, and W. Najjar. Input Data Reuse In Compiling Window Operations Onto Recon¯gurable Hardware. In Proc. of Languages, Compilers and Tools for Embedded Systems (LCTES), 2004. [39] S. Gupta, M. Luthra, N.D. Dutt, R.K. Gupta, A. Nicolau. Hardware and Interface Synthesis of FPGA Blocks using Parallelizing Code Transformations. Invited talk at the special session on Synthesis For Programmable Systems at the Intl. Conference on Parallel and Distributed Computing and Systems, 2003. [40] F. Hanning, H. Dutta, and J. Teich. Mapping of Regular Nested Loop Programs to Coarse-grained Recon¯gurable Arrays - Constraints and Methodology. In Proc. of the International Parallel and Distributed Processing Symposium (IPDPS), 2004. [41] R. Hartenstein. A decade of recon¯gurable computing: a visionary retrospective. In Proc. of the Conf. on Design Automation and Test in Europe (DATE), 2001. [42] J. Hauser and J. Wawrzynek. Garp: A MIPS Processor with a Recon¯gurable Coprocessor. InProc. of the IEEE Symposium on Field-Programmable Custom Com- puting Machines (FCCM), 1997. [43] W. Ho, S. Wilton. Logical-to-Physical Memory Mapping for FPGAs with Dual-Port Embedded Arrays. In Proc. of International Workshop on Field Programmable Logic and Applications (FPL), 1999. [44] P. Jha and N. Dutt. High-level Library Mapping for Memories. in ACM Trans. on Design Automation of Electronic Systems, 5(3), pp. 566{603, Jan. 1999. [45] M.Kandemir,A.Choudhary. Compiler-DirectedScratchPadMemoryHierarchyDe- sign and Management. in Proc. ACM/IEEE Design Automation Conference (DAC), 2002 [46] D. KarchmerandJ. Rose. De¯nitionandSolutionofTheMemoryPackingProblem for Field-Programmable Systems. In Proc. of ACM/IEEE International Conference on Computer-Aided Design, (ICCAD), 1994. [47] R. Allen , K. Kennedy. Optimizing compilers for modern architectures. Morgan Kaufmann Publishers. San Francisco, 2002. [48] T. Kim and C.L. Liu. Utilization of Multiport Memories in Data Path Synthesis. In Proc. of ACM/IEEE Design Automation Conference (DAC), 1993. [49] D. Kolson, A. Nicolau, N. Dutt, and K. Kennedy. Optimal Register Assignment to Loops for Embedded Code Generation. In ACM Trans. on Design Automation of Electronic Systems (TODAES), 1(2):251{279, 1996. [50] D. Kulkarni, W. Najjar, R. Rinker, and F. Kurdahi. Fast Area Estimation to Support Compiler Optimizations in FPGA-based Recon¯gurable Systems. In Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM),2002. 150 [51] M. Lam, E. Rothberg, and M. Wolf. The Cache Performance and Optimizations of BlockedAlgorithms. InProc.SixthInt'lConf.ArchitecturalSupportforProgramming Languages and Operating Systems (ASPLOS), pp. 63-74, 1991. [52] J.Lee,K.Choi,N.Dutt. AnAlgorithmforMappingLoopsOntoCoarse-GrainedRe- con¯gurable Architectures. In Proc. of the ACM Workshop on Languages, Compilers and Tools for Embedded Systems (LCTES), 2003. [53] D. Maydan, J. Hennessy, and M. Lam. E±cient and Exact Data Dependence Anal- ysis. In Proc. of the ACM Conference on Programming Language Design and Imple- mentation (PLDI), 1991. [54] K. McKinley, S. Carr and C-W Tseng. Improving Data Locality with Loop Trans- formations. In ACM Tran. on Prog. Lang. and Syst., 18(4), pp. 424-453, July 1996. [55] Mentor Graphics Inc. Monet, revision 44, 1999. [56] P. Murthy and S. Bhattacharyya. Bu®er merging - a powerful technique for reduc- ing memory requirements of synchronous data°ow speci¯cations. In ACM Tran. on Design Automation of Electronic Systems (TODAES), 9(2), pp. 212-237, 2004. [57] W. Najjar, B. Draper, A.P.W. Bo"hm,R. Beveridge. TheCameronProject: High- Level Programming of Image Processing Applications on Recon¯gurable Computing Machines. In Proc. of PACT - Workshop on Recon¯gurable Computing, 1998. [58] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee. Parallelization of MAT- LAB applications for a multi-FPGA system. In Proc. of IEEE Symp. on Field- Programmable Custom Computing Machines (FCCM), 2003. [59] XPP Technologies, Inc. The XPP White Paper., 2002. [60] P. Panda, F. Catthoor, N. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappe, and P. Kjeldsberg. Data and memory optimization techniques for embedded systems. In ACM Transactions on Design Automation of Electronic Systems (TODAES), 6(2), pp. 149 - 206, 2001. [61] P. Panda, N.Dutt, andA.Nicolau. On-Chipvs.O®-ChipMemory: TheDataParti- tioninginEmbeddedProcessorBasedSystems. InACMTran.onDesignAutomation of Electronic Systems (TODAES), 5(3), pp. 682-704, 2000. [62] D.Bairagi,S.PandeandD.Agrawal. FrameworkforContainingCodeSizeinLimited Register Set Embedded Processors. In Proc. of the ACM Workshop on Languages, Compilers and Tools for Embedded Systems (LCTES), 2000. [63] N. Park, B. Hong, and V. K. Prasanna. Tiling, Block Data Layout, and Memory HierarchyPerformance. InIEEETransactionsonParallelDistributedSystems,14(7), pp. 640-654, 2003. [64] W. Pugh. Counting Solutions to Presburger Formulas: How and Why. In Proc. of the ACM Conf. on Prog. Language Design and Implementation (PLDI), 1994. 151 [65] I. Ouaiss and R. Vemuri. Hierarchical Memory Mapping During Synthesis in FPGA- based Recon¯gurable Computers. In Proc. of the Conf. on Design Automation and Test in Europe (DATE), 2001. [66] R. Barua, W. Lee, S. Amarasinghe and A. Agarwal. S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory Access Scheduling. In Proc. of the ACM Intl. Symp. on Computer Architecture (ISCA), 2000. [67] P. Schaumont, I. Verbauwhede, K. Keutzer, and M. Sarrafzadeh. A quick safari through the recon¯guration jungle. In Proc. of the 38th Design Automation Confer- ence (DAC), 2001. [68] H. Schmit and D. Thomas. Array Mapping in Behavioral Synthesis. Proc. of the International Symposium on System Synthesis (ISSS), 1995. [69] B. So. An E±cient Design Space Exploration for Balance between Computation and Memory. PhD thesis, Dept. Comp. Sc., Univ. of Southern California, Dec 2003. [70] B. So and M. Hall. Increasing the applicability of scalar replacement. In Proc. Intl. Conf. Compiler Construction (CC), 2004. [71] B. So, P. Diniz and M. Hall. Using Estimates from Behavioral Synthesis Tools in Compiler-directed Design Space Exploration. In Proc. of the ACM/IEEE Design Automation Conference (DAC), 2003. [72] B. So, M. Hall, H. Ziegler. Custom Data Layout for Memory Parallelism. In Proc. of the International Symposium on Code Generation and Optimization (CGO), 2004. [73] A. Sudarsanam and S. Malik. Register and Memory Bank Allocation for Software Synthesis in ASIPs. In Proc. of the 1995 Intl. Conf. on Computer-Aided Design, (ICCAD), pp. 388-392, 1995. [74] The Stanford SUIF compilation system. Public Domain Software and Documenta- tion, suif.stanford.edu. [75] R. Szymanek, F. Catthoor, and K. Kuchcinski. Data assignment and access schedul- ing exploration for multi-layer memory architectures. In Embedded Systems for Real- Time Multimedia, 2004. [76] W. Thies, F. Vivien, J. Sheldon, and S. Amarasinghe. A Uni¯ed Framework for Schedule and Storage Optimization. In Proc. of the ACM Conf. on Programming Language Design and Implementation (PLDI), 2001. [77] S. Udayakumaran, R. Barua. Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proc. of Int. Conf. on Compiler, Architec- ture and Synthesis for Embedded Systems (CASES), 2003. [78] A. Vandecappelle, M. Miranda, E. Brockmeyer, F. Catthoor, and D. Verkest. Global Multimedia System Design Exploration Using Accurate Memory Organization Feed- back. In Proc. of the ACM/IEEE Design Automation Conference (DAC), 1999. 152 [79] Z. Wang, M. Kirkpatrick, and E. Sha. Optimal two level partitioning and loop scheduling for hiding memory latency for DSP applications. In Proc. ACM/IEEE Design Automation Conference (DAC), 2000 [80] M. Weinhardt, W. Luk. Memory Access Optimization for Recon¯gurable Systems. In IEE Proc.-Comput. Digit. Tech., 148(3), pp. 105{112, 2001. [81] M. Wolf and M. Lam. A Data Locality Optimization Algorithm. In Proc. of the ACM Conf. on Programming Language Design and Implementation (PLDI), pp. 30- 44, 1991. [82] S. Wuytack, F. Catthoor, G. Jong, H. Man. Minimizing the required memory band- width in vlsi system realizations. In IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, 7(4), pp. 433-441, 1999. [83] Xilinx Inc. Virtex 2.5v FPGA Product Spec.(v2.4)), 2000. [84] Tensilica, Inc. Xtensa Architecture White Paper, 2002. [85] D. Zaretsky, G. Mittal, X. Tang, and P. Banerjee. Evaluation of scheduling and allocation algorithms while mapping assembly code onto FPGAs. In ACM Great Lakes Symposium on VLSI, 2004. [86] Y.ZhaoandSh.Malik. Exactmemorysizeestimationforarraycomputationswithout loop unrolling. In Proc. of the ACM/IEEE Design Automation Conference (DAC), 1999. [87] H. Ziegler. Compiler Directed Design Space Exploration For Pipelined Field Pro- grammable Gate Array Applications. PhD thesis, Dept. Electrical and Computer Engineering, Univ. of Southern California, May 2006. 153 Appendix Here we include the complete results of our allocation experiments in terms of execution time and resource usage. First we show the allocation results when there are no registers available. Table A.1 depicts the number of cycles, clock rates, and wall clock execution times for all benchmarks, all unrolling factors, and all allocation policies. In table A.2 we demonstrate the number of slices, used storage bits, and the number of used RAM blocks for each case. For the experiments where a limited number of registers are available we focus on the designs with the unroll factor of 1x4. Tables A.3 and A.4 respectively show the timing and resource usage information for these cases. 154 Naive CDL CMA HC Ker version Cycles Clk(ns) ET(ms) Cycles Clk(ns) ET(ms) Cycles Clk(ns) ET(ms) Cycles Clk(ns) ET(ms) (1x1) 338920 27.9 9.45 372712 26.95 10.04 327976 23.94 7.85 327976 23.94 7.85 (1x2) 257000 26.5 6.81 274408 29.17 8.0 246056 28.72 7.06 239880 28.97 6.94 FIR (2x1) 238056 26.89 6.40 222184 27.14 6.03 200168 28.45 5.69 184072 26.87 4.94 (1x4) 216040 28.37 6.12 224232 29.9 6.70 207400 30.07 6.23 199496 30.2 6.02 (4x1) 187624 26.36 4.94 146920 28.5 4.18 105736 28.71 3.036 114504 29.17 3.34 (2x2) 197096 27.99 5.52 164840 30.35 5.00 132872 27.71 3.68 143688 29.54 4.24 (1x1) 31777 25.55 0.81 36129 28.15 1.01 31777 25.55 0.81 31777 25.55 0.81 (1x2) 25889 27.17 0.70 25889 28.65 0.74 23841 29.49 0.70 23841 29.35 0.69 MM (2x1) 25633 27.92 0.71 22977 29.4 0.67 22945 28.62 0.65 22945 29.52 0.67 (1x4) 22817 27.63 0.63 20513 30.26 0.62 20513 30.25 0.62 20513 30.26 0.62 (4x1) 22433 28.9 0.64 19457 36.33 0.70 19457 35.26 0.68 19457 35.26 0.68 (2x2) 22561 26.42 0.59 22993 30.28 0.69 22241 29.73 0.66 21217 30.12 0.63 (1x1) 50020 24.63 1.23 50020 28.11 1.41 24166 26.95 0.65 24136 29.77 0.72 (1x2) 47320 27.92 1.32 46296 30.04 1.39 21436 29.62 0.63 21436 30.33 0.65 JAC (2x1) 47290 27.04 1.27 46330 28.88 1.33 22338 29.15 0.65 22338 28.19 0.62 (1x4) 45820 28.31 1.29 43772 29.85 1.31 20148 29.95 0.60 20148 30.57 0.62 (4x1) 45778 29.64 1.35 43794 32.8 1.43 21454 29.63 0.64 21454 34.33 0.73 (2x2) 45490 29 1.32 43120 34.12 1.47 20298 28.64 0.58 20298 34.36 0.69 (1x1) 145559 25.55 3.72 146400 26.8 3.92 129557 28.96 3.75 129557 28.96 3.75 (1x2) 105192 25.73 2.70 106033 27.73 2.94 81731 29.69 2.43 81731 29.69 2.42 BIC (2x1) 100146 28.22 2.83 102011 29.62 3.02 98471 29.24 2.88 98471 29.24 2.87 (1x4) 74916 27.34 2.05 76781 27.07 2.07 49857 30.06 1.49 48169 32.79 1.57 (4x1) 74916 26.36 1.97 76781 28.75 2.21 56239 31.36 1.76 52991 30.21 1.60 (2x2) 83326 26.82 2.23 81827 28.02 2.29 66387 29.62 1.96 64763 29.88 1.93 (1x1) 119234 24.67 2.94 127426 26.09 3.32 94658 26.67 2.52 94658 26.98 2.55 (1x2) 106946 27.14 2.90 111106 26.07 2.89 76226 26.06 1.98 78274 27.36 2.14 HIST (2x1) 106882 26.44 2.82 110978 28.33 3.14 76130 27.18 2.07 78205 28.02 2.19 (1x4) 100802 28.88 2.91 104962 28.96 3.04 68098 28.75 1.96 68098 28.75 1.96 (4x1) 100706 28.5 2.87 104770 28.76 3.01 67890 28.54 1.94 67890 28.54 1.94 (2x1) 100738 26.59 2.67 103810 29.33 3.04 67938 27.85 1.89 67938 27.56 1.87 Table A.1: Timing results for all cases when using no registers. 155 Naive CDL CMA HC Ker version Slices RAMs RAM Bits Slices RAMs RAM Bits Slices RAMs RAM Bits Slices RAMs RAM Bits (1x1) 376 3 25344 386 1 25344 374 2 13440 374 2 13440 (1x2) 494 3 25344 498 2 25344 503 3 13056 569 4 13440 FIR (2x1) 519 3 25344 560 2 25344 545 2 13056 581 3 13440 (1x4) 593 3 25344 638 4 25344 679 4 13824 895 4 14208 (4x1) 646 3 25344 779 4 25344 809 4 13440 927 4 14208 (2x2) 680 3 25344 751 2 25344 759 4 13440 839 4 14208 (1x1) 275 3 6912 340 1 6912 275 3 6912 275 3 6912 (1x2) 303 3 6912 428 2 6912 390 4 6912 394 4 6912 MM (2x1) 346 3 6912 487 2 6912 425 4 6912 451 4 6912 (1x4) 385 3 6912 594 4 6912 550 4 6912 594 4 6912 (4x1) 424 3 6912 792 4 6912 819 4 6912 819 4 6912 (2x2) 410 3 6912 647 2 6912 710 4 8064 680 4 8064 (1x1) 342 2 18432 334 1 18432 415 2 1152 405 3 1701 (1x2) 395 2 18432 506 2 18432 659 4 2304 713 2 1152 JAC (2x1) 403 2 18432 581 2 18432 735 4 2304 659 2 1152 (1x4) 612 2 18432 816 4 18432 1332 4 2304 1433 4 2304 (4x1) 755 2 18432 947 4 18432 766 4 2304 763 2 1152 (2x2) 603 2 18432 831 4 18432 1154 4 2304 1105 2 1152 (1x1) 341 3 16512 338 1 16512 450 2 896 450 2 896 (1x2) 343 3 16512 421 2 16512 627 4 1664 627 4 1664 BIC (2x1) 345 3 16512 421 2 16512 670 4 1664 670 4 1664 (1x4) 369 3 16512 464 4 16512 949 4 1664 949 4 3328 (4x1) 352 3 16512 454 4 16512 727 4 1664 727 4 3328 (2x2) 440 3 16512 569 2 16512 980 4 1664 980 4 3328 (1x1) 430 4 53760 450 1 53760 344 3 4608 318 4 4608 (1x2) 547 4 53760 644 2 53760 492 4 4608 583 3 4608 HIST (2x1) 561 4 53760 744 2 53760 533 4 4608 635 3 4608 (1x4) 643 4 53760 902 4 53760 799 4 4608 799 4 4608 (4x1) 679 4 53760 1087 4 53760 848 4 4608 848 4 4608 (2x2) 684 4 53760 960 2 53760 824 4 4608 824 4 4608 Table A.2: Resource usage for all cases when using no registers. 156 CDL CMA HC Ker version Cycles Clk(ns) ET(ms) Cycles Clk(ns) ET(ms) Cycles Clk(ns) ET(ms) FIR (1x4) 96248 34.62 3.33 95058 29.83 2.83 64627 29.54 1.91 MM (1x4) 20513 30.26 0.62 19649 30.07 0.59 19649 30.07 0.59 JAC (1x4) 32559 40.7 1.32 20078 37.47 0.75 20148 30.57 0.61 BIC (1x4) 76781 27.07 2.07 47281 28.53 1.34 40045 29.72 1.19 HIST (1x4) 133602 28.96 3.86 68080 29.9 2.03 66993 28.76 1.92 Table A.3: Timing results for all benchmarks when using 64 registers. 157 CDL CMA HC Ker version Slices RAMs RAM Bits Register Bits Slices RAMs RAM Bits Register Bits Slices RAMs RAM Bits Register Bits FIR (1x4) 1433 4 25344 384 1511 4 14208 384 1659 3 25344 756 MM (1x4) 594 4 6912 0 849 4 6912 144 849 4 6912 144 JAC (1x4) 1902 4 9216 576 1757 2 9216 576 1433 4 2304 0 BIC (1x4) 464 4 16512 0 992 4 1664 64 1048 4 3200 128 HIST (1x4) 902 4 53760 0 1180 4 4608 192 1614 4 4608 768 Table A.4: Resource usage for all benchmarks when using 64 registers. 158
Abstract (if available)
Abstract
Configurable architectures offer the unique opportunity of realizing hardware designs tailored to the specific data and computational patterns of a given application code. These devices have customizable compute fabric, interconnects, and memory subsystems that allow for large amounts of data and computational parallelism. This high degree of concurrency subsequently translates to better performance. The flexibility and configurability of these architectures, however, create a prohibitively large design space when mapping computations expressed in high-level programming languages to these devices. To successfully investigate the best mapping there is a need for high level program analyses and abstractions as well as automated tools.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Compilation of data-driven macroprograms for a class of networked sensing applications
PDF
Cyberinfrastructure management for dynamic data driven applications
PDF
Applying semantic web technologies for information management in domains with semi-structured data
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Autotuning, code generation and optimizing compiler technology for GPUs
PDF
Lifetime reliability studies for microprocessor chip architecture
PDF
A complex event processing framework for fast data management
PDF
Model-guided empirical optimization for memory hierarchy
PDF
Communication mechanisms for processing-in-memory systems
PDF
High-performance linear algebra on reconfigurable computing systems
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Systematic identification of potential therapeutic targets in cancers using heterogeneous data
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Location-based spatial queries in mobile environments
PDF
QoS-based distributed design of streaming and data distribution systems
PDF
Provenance management for dynamic, distributed and dataflow environments
PDF
Hardware and software techniques for irregular parallelism
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
Algorithms and architectures for high-performance IP lookup and packet classification engines
Asset Metadata
Creator
Baradaran, Nastaran (author)
Core Title
Compiler directed data management for configurable architectures with heterogeneous memory structures
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2007-05
Publication Date
02/15/2007
Defense Date
01/10/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
compilers,design automation,memory management,OAI-PMH Harvest
Language
English
Advisor
Diniz, Pedro C. (
committee chair
), Pinkston, Timothy M. (
committee member
), Prasanna, Viktor K. (
committee member
)
Creator Email
nastaran@ieee.org
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m262
Unique identifier
UC1166796
Identifier
etd-Baradaran-20070215 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-166792 (legacy record id),usctheses-m262 (legacy record id)
Legacy Identifier
etd-Baradaran-20070215.pdf
Dmrecord
166792
Document Type
Dissertation
Rights
Baradaran, Nastaran
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
compilers
design automation
memory management