Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-LEVEL SYNTHESIS OF MEMORY-INTENSIVE APPLICATION-SPECIFIC SYSTEMS by Pravil G upta A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) August 1994 Copyright 1994 Pravil G upta UMI Number: DP22881 All rights reserved INFORMATION TO ALL U SE R S The quality of this reproduction is depen d en t upon the quality of the copy subm itted. In the unlikely even t that the author did not sen d a com plete manuscript and there are m issing p a g es, th ese will be noted. A lso, if material had to be rem oved, a note will indicate the deletion. Dissertation Publishing UMI D P22881 Published by ProQ uest LLC (2014). Copyright in th e Dissertation held by the Author. Microform Edition © P roQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode ProQ uest LLC. 789 E ast E isenhow er Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1 3 4 6 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 P H - CfS G-W 31098? This dissertationr written by P r a v i l G upta under the direction of h is Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY D ean o f G raduate Stu dies Date ..tM x.,14.... 1 3 9 ,4 , DISSERTATION COMMITTEE Dedication To my parents. Acknowledgements I take this opportunity to express my gratitude to several people who have made this thesis possible. I am grateful to my advisor Prof. Alice Parker for her constant guidance and inspiration during my dissertation work. She has been a great source of encouragement and support. Her numerous comments and suggestions made this thesis possible. I also wish to thank Profs. Melvin Breuer and Ken Goldberg for being on m y dissertation and guidance committee, and Profs. Michel Dubois and Sandeep G upta for being on my guidance committee. I would like to thank my colleagues Chih-Tung Chen, Diogenes Silva, Shiv Prakash, Atul Ahuja, J.C. Batista-Desouza, Dong-Hyun Heo, Jen-Pin Weng, Kay- han Kucukcakar, Yung-Hua Hung for their friendship and help. I benefited greatly from interacting with them. Other friends who made my years at USC pleasant are Rajgopal Srinivasan, Ishwar Parulakar, Deb Mukherjee, and Sridhar Narayanan. My parents have been a continuous source of love, support and sacrifice. I ded icate this thesis to them. My brothers and sister were always encouraging. I espe cially thank my wife Pratibha for her patience through these years. Her love always provided me comfort. This research was supported by Semiconductor Research Corporation (Contract No. 89-DJ-075), the Advanced Research Projects Agency (Contract No. JFBI90092) and National Science Foundation (Contract No. GER-9023979). I would like to thank these organizations for their support. Contents D ed ication iii A cknow ledgem ents iv L ist O f Figures ix List O f Tables xi A b stract xii 1 Introduction 1 1.1 High-Level S y n th e s is ....................................................................................... 1 1.2 System-Level S y n th esis................................................................................... 3 1.2.1 System-level components and issu e s............................................... 3 1.2.2 Design m e th o d o lo g y .......................................................................... 3 1.3 The USC P r o j e c t .............................................................................................. 4 1.4 Memory-Intensive S y ste m s............................................................................. 5 1.5 M otivation........................................................................................................... 7 1.6 Introduction to Memory D e s ig n ................................................................... 7 1.7 A Simple E x a m p le .......................................................................................... 8 1.8 Problem D e s c rip tio n ....................................................................................... 10 1.8.1 Problem S ta te m e n t............................................................................. 10 1.9 Decisions made by SM ASH............................................................................. 11 1.10 Our A pproach.................................................................................................... 13 1.11 Storage Design T radeo ffs................................................................................ 15 1.11.1 Storage Size us. Number of Execution C ycles............................. 15 1.11.2 Number of Ports u s. Number of Execution C y c le s ..................... 15 1.11.3 Number of Ports u s. Size of the Storage ...................................... 19 1.11.4 3-way T r a d e o ff ................................................................................... 19 1.11.5 A Storage Architecture Tradeoff Example .................................. 20 1.12 SMASH as a Part of USC ............................................................................. 25 1.13 Thesis O rganization................................. 26 iv 2 R elated R esearch 27 2.1 In tro d u ctio n ......................................................................................................... 27 2.2 High-Level Synthesis R esearch ....................................................................... 27 2.3 Memory Architecture Research .................................................................... 29 3 P rob lem Approach 33 3.1 In tro d u ctio n ......................................................................................................... 33 3.2 Target System A rc h ite c tu re ........................................................................... 34 3.2.1 D atapath . ............................................................................................ 34 3.2.2 Storage A rc h ite c tu re ................. 35 3.2.3 Target Architecture Discussion ....................................................... 37 3.3 Module Library C h a ra c te ristic s........................................................................ 38 3.3.1 Functional M odules.............................................................................. 39 3.3.2 Storage M o d u le s .................................................................... 39 3.4 Clocking S c h e m e ...................................................................................................40 3.5 Overall Synthesis A p p ro a c h ...............................................................................43 3.5.1 Control D ata Flow Graph E x tra c tio n ................................................. 43 3.5.2 D atapath Synthesis with Storage T radeoffs....................................... 49 3.5.3 Storage Architecture Synthesis...............................................................52 3.5.3.1 D ata Transfer S c h e d u lin g ................................................ 52 3.5.3.2 Module A llo c a tio n ..................................................................56 3.5.4 RTL Synthesis............................................................................................ 56 3.5.4.1 MABAL and E p o c h .............................................................. 56 3.6 S u m m a r y ................................................................................................................ 57 4 E stim ation T echniques 58 4.1 In tro d u ctio n .............................................................................................................58 4.2 Storage Cost E s tim a tio n ................................................................................. 60 4.2.1 Lower Bound on Read (Write) Ports on B u f f e r s ......................... 61 4.2.1.1 Com putational C o m p le x ity ............................................. 63 4.2.2 Buffer Size E s tim a tio n ............................................................................63 4.2.3 Lower Bound on the Size of the Input Only B uffer..........................67 4.2.3.1 Com putational C o m p le x ity ............................................. 69 4.2.4 Lower bound on the Size of I/O B u f f e r ......................................... 69 4.2.5 Computational C om plexity................................................................. 72 4.2.6 Storage Structure C o n s tru c tio n ....................................................... 72 4.2.6.1 Implementing Storage Structure with Registers . . . 73 4.2.6.2 Implementing Storage Structure with Register Files . 73 4.2.6.3 Implementing Storage Structure with On-chip RAMs 74 4.3 Functional Cost E s tim a tio n .......................................................................... 75 4.3.1 Lower Bound on Functional M o d u le s ............................................ 76 4.3.2 Lower Bound on the Total Functional C o s t.................................. 76 4.4 Upper Bounds on the Design P a ra m e te rs................................................... 77 v 4.4.1 Upper Bound on Read (Write) Ports on B u ffe rs ....................... 77 4.4.1.1 Computational C o m p le x ity ............................................ 78 4.4.2 Upper Bound on the Size of B u ffers............................................... 78 4.4.2.1 Computational C o m p le x ity ............................................ 84 4.4.3 Upper Bound Cost of Storage Structure ..................................... 84 4.4.4 Upper Bound on the Number of Functional Modules ..................84 4.4.5 Upper Bound on the Total Functional C o s t................................. 85 4.5 S u m m a r y ........................................................................................................... 85 5 S y n th e sis w ith S to ra g e T radeoffs L o o k ah ead 86 5.1 In tro d u ctio n ....................................................................................................... 86 5.2 D atapath Scheduling P ro b le m ...................................................................... 86 5.2.1 Problem D e fin itio n ............................................................................. 87 5.2.2 Various A pproaches............................................................................ 87 5.3 D atapath Scheduling in SM A SH ................................................................... 89 5.4 Preprocessing the C D F G ................................................................................ 90 5.5 Scheduling the CDFG ................................................................................... 92 5.5.1 A ssum ptions........................................................................ 92 5.5.2 Overview of the Scheduling Algorithm in S M A S H .................... 94 5.5.3 Discussion on the Scheduling A lg o r ith m ....................................... 102 5.6 Analysis of the S c h e d u le .................................................................................. 103 5.7 Summary . ...................................................................................................... 103 6 S to ra g e S y n th esis 104 6.1 In tro d u ctio n ..........................................................................................................104 6.2 D ata Transfer Scheduling for I/O B u ffers.................................................... 105 6.2.1 Our A pproach.........................................................................................106 6.2.2 The A lg o rith m ................... 109 6.2.3 Selecting the D ata Values for T r a n s f e r .......................................... 113 6.2.4 Discussion on the A lg o rith m .............................................................. 113 6.3 Background Memory Synthesis .....................................................................114 6.4 S u m m a r y .............................................................................................................115 7 E x p e rim e n ta l R e su lts 116 7.1 In tro d u ctio n ......................................................................................................... 116 7.2 Experiments Prior to Memory R esearch........................................................116 7.2.1 Experim entl : Layout Studies of an AR F i l t e r .............................117 7.2.2 Experiment2 : input latches vs. input R A M ................................123 7.3 Experiments using SM ASH...............................................................................124 7.4 Module L ib r a ry ...................................................................................................124 7.4.1 Functional M odules...............................................................................125 7.4.2 Storage M o d u le s .................................................................................. 125 7.5 High-Level Synthesis Benchmark E xam ples.................................................131 vi 7.5.1 Differential Equation E x am p le............................................................ 131 7.5.2 AR Filter and Elliptic Wave Filter E xam ples..................................137 7.5.3 D iscussion................................................................................................. 137 7.6 Rapid Prototyping of a JPEG Still Image Compression System . . . . 138 7.6.1 D iscussion................................................................................................. 144 7.7 Enhanced Design of JPEG C om ponents........................................................145 7.7.1 Synthesis of the Quantizer .................................................. 145 7.7.2 Synthesis of 1D-DCT with Inner L o o p s............................................147 7.7.3 D iscussion................................................................................................. 151 7.8 S u m m a r y ............................................................................................................. 151 8 C onclusion and Future R esearch 153 8.1 In tro d u ctio n ..........................................................................................................153 8.2 C o n trib u tio n s............................................................................ 154 8.2.1 Development of a High-Level Synthesis S ystem .............................. 154 8.2.2 Identification of Design P a ra m e te rs .................................................. 155 8.2.3 Combined D atapath Scheduling with I/O Accesses .................... 156 8.2.4 Storage T rad eo ffs....................................................................................156 8.2.5 Storage Cost Estimations ................................................................... 157 8.2.6 Upper Bounds on Design P a ra m e te rs ...............................................157 8.2.7 Storage S y n th e sis............................................................................ . 157 8.2.8 E xperim ents..............................................................................................158 8.3 Future D ire c tio n s................................................................................................. 158 8.3.1 High-Level Memory M a n ag e m en t......................................................159 8.3.2 Storage Module A llocation...................................................................159 8.3.3 Improvement in D atapath Memory S ynthesis................................. 161 8.3.4 Address and Control G eneration.........................................................162 8.3.5 Interfacing SMASH with D P S Y N ......................................................162 A p p en d ix A MABAL to SSCNET Netlist T ranslator.................................................................. 163 A p p en d ix B VHDL D escriptions....................................................................................................... 165 B .l VHDL description of 2nd Order Differential Equantion Solver . . . . 166 B.2 VHDL description of an AR Filter E lem ent.................................................168 B.3 VHDL description of an Elliptic Wave Filter E le m e n t.............................170 B.4 VHDL description of 8-point 1D-DCT ........................................................173 B.5 VHDL Description of a Q u a n tiz e r..................................................................176 vii L ist O f F igu res 1.1 Block diagram of the Unified System Construction (USC) Project . . 6 1.2 An Example Showing Scheduling with Memory Related Issues . . . . 9 1.3 SMASH Synthesis S y s t e m ............................................................................ 12 1.4 Storage Size vs. Number of Execution C ycles........................................... 16 1.5 Number of Ports vs. Number of Execution C y c le s ................................. 17 1.6 Number of Ports vs. Size of S to ra g e ........................................................... 18 1.7 Example of a Noise Cleaning Algorithm .......................................................... 20 1.8 Design 1........................................................................................... 21 1.9 Design 2................................................................................................................ 22 1.10 Design 3................................................................................................................ 22 1.11 Design 4................................................... 23 1.12 3 Way Tradeoff in Storage A rch itectu re..................................................... 24 3.1 Target Architecture in S M A S H .................................................................. 34 3.2 Target architecture with Communication L inks....................................... 41 3.3 2-phase Clocking Scheme in S M A S H ........................................................ 41 3.4 An Example Illustrating 2-phase C lo c k in g .................................................. 42 3.5 Synthesis Approach in S M A S H ...................................................................... 44 3.6 Representing Conditional B ra n c h e s ................................................................46 3.7 Representing L o o p s......................................................................................... 47 3.8 Read/W rite Nodes in S M A S H ..........................................................................49 3.9 Read and W rite Timing for D ata V a l u e s ................................................. 53 4.1 ASAP and ALAP Times for Various Read N o d e s ................................ 59 4.2 Buffer C onfigurations...................................................................................... 64 4.3 Lower Bound Estim ation for Input B u ffe rs ................................................ 66 4.4 Lower Bound Estim ation for O utput B u ffe rs ............................................. 66 4.5 Constructing Storage using R e g is te rs .......................................................... 73 4.6 Constructing Storage using Register F ile s................................................... 75 5.1 D atapath Scheduling in SM A SH .................................................................. 89 5.2 ASAP Analysis of the CDFG ..................................................................... 91 5.3 Scheduling Algorithm in S M A S H ............................................................... 93 5.4 Selecting the Most Suitable Step in S M A S H .......................................... 95 viii 5.5 Operators with Varying Delays ................................................................... 98 5.6 Determining M utual Exclusion between Nodes t> , • and Vj .........................98 5.7 Loop Folding in S M A S H ................................................................................. 99 5.8 D ata Transfer into I/O buffers for Conditional Branches .......................101 5.9 Array A c c e s s e s ....................................................................................................102 6.1 D ata Transfer Timing for I/O B u ffers............................................................ 105 6.2 D ata Transfer Scheduling in S M A S H ............................................................ 107 6.3 D ata Transfer Scheduling for I/O Buffers in SMASH .............................. 110 7.1 The AR Filter Dataflow G r a p h ...................................................................... 118 7.2 Cost-performance Tradeoff Curve for a 16-bit Non-Pipelined AR Fil ter D atapath Element ...................................................................................... 119 7.3 Layout of the Most Parallel Non-Pipelined D e sig n .................................... 120 7.4 Overall Cost-performance tradeoff curve for a 16-bit Pipelined AR Filter D a ta p a th ....................................................................................................122 7.5 Area vs. size for 1R/1W Register file in E P O C H ....................................... 127 7.6 Area vs. size for 2R/1W Register file in E P O C H ....................................... 128 7.7 Area tw. size for 1R/1W RAM Module in E P O C H ....................................129 7.8 Area vs. size for 2R/1W RAM Module in E P O C H .................................... 130 7.9 Scheduled CDFG for Design 5 (2nd-order Differential Equation). . . . 135 7.10 JP E G Still Image Compression S y s te m ........................................................ 138 7.11 Design Flow for Still Image Compression System E x a m p le ....................139 7.12 2D DCT implementation from ID D C T s ..................................................... 139 7.13 Layout of ID DCT m o d u le ................................................................................142 7.14 Layout of 2D DCT chip ................................................................................... 143 7.15 Quantization in JPE G Image Compression S ystem .................................... 146 7.16 D ata Flow for the 8-point 1 D -D C T ............................................................... 148 7.17 The Whole 8-point 1D-DCT C D F G ............................................................... 149 L ist O f T ables 7.1 Summarized Area - Delay Statistics of the Non-Pipelined Designs . . 121 7.2 Summarized Area - Delay Statistics of the Pipelined D e sig n s................ 121 7.3 Storage Area Statistics of L ay o u ts...................................................................123 7.4 Module Library used by SMASH. ......................................................... . 125 7.5 Param eters from SMASH for Differential Equation E x a m p le ................ 132 7.6 Param eters from SMASH for AR Filter E xam ple.............................. ... . 132 7.7 Param eters from SMASH for Elliptic Wave F i l t e r .................................... 133 7.8 D ata transfer schedule for design 5 (2nd-order differential equation). . 136 7.9 1D-DCT design param eters obtained using S M A S H .................................140 7.10 ID DCT RTL designs from M A B A L ............................................................141 7.11 Area analysis for the la y o u ts ............................................................................ 142 7.12 Chip-set param eters ..........................................................................................144 7.13 2D-DCT implementations from SOS ............................................................145 7.14 Quantizer Design Param eters Obtained from SMASH . . . . . . . . . 146 7.15 Enhanced 1D-DCT Design using SM A SH ..................................................... 150 Abstract The thesis addresses high-level synthesis of memory-intensive application-specific systems, with emphasis on hierarchical storage architecture design. These systems are commonplace in real-time applications where they demand high performance datapaths along with efficient storage schemes to support them . SMASH (Synthesis of Memory intensive Application-Specific Hardware) is a program which combines storage hierarchy design with datapath synthesis for a given behavioral specification with constraints on cost and performance. SMASH includes the following m ajor tasks: datapath synthesis, which includes operation scheduling combined with I/O accesses by the datpath from on-chip I/O buffers while looking ahead to evaluate storage architecture tradeoffs; and storage hierarchy design, which includes determining the data transfers between different levels of memory hierarchy. The synthesis techniques are based on (i) feasibility analysis of the input/output access during datapath scheduling using global system param eters like the memory bandwidth and I/O tim ing constraints, (ii) tradeoffs in storage structure, and (Hi) cost-delay estimations for both functional and storage structures. Experim ental results are presented which validate our techniques and demon strate the existence of a cost-performance trade-off in the storage architecture. In addition, a system level synthesis experiment illustrates the way SMASH can be in tegrated into a system-level synthesis environment. The experiment involved rapid prototyping of a JPEG image compression system, where SMASH was invoked both in the initial phase of the design flow to obtain the area-delay trade-off curves for different components of the system; and in the final phase to synthesize the selected design points. Chapter 1 Introduction There has been an exponential growth in microelectronics in the last couple of decades. Electronic systems which are mostly application-specific special-purpose hardware are becoming increasingly complex. Furthermore, the intense competition in the electronic industry is forcing companies to ensure a very early m arketing tim e for these extremely complex, but reliable systems at a competitive price. Achieving these goals is the motivation behind design automation. Designers are relying more and more on CAD tools for assistance during the design process. CAD tools are being successfully used for physical design, logic synthesis, simulation, and other activities. However, system-level design still remains an art. Most system-level de cisions are still being made by hum an designers. The correctness of these decisions completely depends on the knowledge and experience of the designer. Furthermore, these m anual designs are so tim e consuming that there is virtually no exploration of the design space. Therefore, there is a growing need for system-level design tools. The research being addressed here deals with development of such system-level syn thesis tools. 1.1 High-Level Synthesis Computer-aided design is being successfully used in autom atic synthesis of applica tion specific ICs (ASICs) from algorithmic descriptions. Such autom atic synthesis of register-transfer level (RTL) designs from a given algorithmic behavioral specifi cation of a digital system, while satisfying a set of constraints and best meeting a design goal, is called high-level synthesis [MMC88]. 1 In the above definition, the RTL datapath consists of a network of functional units (e.g. adder, m ultiplier), storage units (registers, register files), interconnection units (multiplexers, tri-state drivers), and buses. The behavioral specification reflects the mapping from inputs to outputs. Constraints specify design param eters (such as performance, area, or power consumption of the chip); a design goal could be to minimize one (or more) of these design param eters. For example, the design goal could be to minimize total power consumption of the chip while achieving a certain area and performance. The ADAM (Advanced Design Automation) system at the University of Southern California deals with datapath and control synthesis [JKMP89, GKP85]. The inputs to ADAM are • a behavioral specification either in VHDL or in form of a control data flow graph, • a module library of available hardware operators (e.g., adders, multipliers) with their cost/delay characteristics, and • a set of area/delay constraints and goals to be satisfied. The outputs from ADAM are • an RTL netlist consisting of components specified in the m odule library, and • a description of a finite state machine as the controller for the generated RTL structure. Though the realization of RTL structure from the behavior appears straightfor ward, the goal of high-level synthesis is to explore the design space and select the design that best meets the design constraints while achieving the design goals, from a num ber of RTL structures that can realize the given behavior. The RTL netlist is then processed by a silicon compiler to produce an IC. Encouraged by the success in high-level synthesis, researchers are now looking into autom ation of the design process at the system level, which is one-step higher in the design process. The next section briefly describes the system-level synthesis problem and the issues at that level. 2 1.2 System-Level Synthesis System-level synthesis, as a general problem, is such a complex problem th at re searchers are still struggling with the formulation of the problem description. The concern at this level is with system-level components and issues, and with the overall design methodology. 1.2.1 S ystem -level com ponents and issues System-level components include processors (such as general-purpose microproces sors, special-purpose processors, and ASICs), memories (RAMs, ROMs, caches), bus interfaces, and I/O devices. W hich components should be considered in a particular design depends heavily on the application under consideration and the designer as he/she may wish to design the system with a particular set of components. System- level issues consist of which and how m any system-level components to include in the design, how to configure and interconnect them in the system, and how to control them . Again, which issues should be considered depends on the application and the designer. 1.2.2 D esign m ethodology The biggest challenge in system-level synthesis is deciding on a design methodology. The following three design methodologies are of potential use in a system-design environment: 1. Top-down design flow: in this methodology, system-level tools start with the system specification, and generate specifications and constraints for individual components. Then, the individual components are designed using high-level synthesis tools. The drawback in this approach is th at since the design param eters of the individual components are not yet known, system level tools have to rely heavily on predictors. 2. Bottom-up design flow: here, individual components are designed first using high-level synthesis tools. Then, the whole system is designed with these components using system-level tools. The drawback in this approach is that 3 the design space exploration at the system level is very restricted as the system- level tools have to work with the pre-designed components. 3. Mixed design flow: This methodology is a combination of the above two ap proaches. For example, first some representative implementations of individ ual components are synthesized by the high-level synthesis tools to obtain the cost-performance tradeoff curve for these components (a bottom -up step) Then, based on the tradeoff curve, the exact specification and constraints for individual components are generated using the system-level tools. Finally, the individual components are synthesized (a top-down step) for these specifica tions and constraints using high-level synthesis tools. As we can see in the methodologies described above, high-level synthesis tools are at the core o f system-level tools. Therefore, it is essential that these high-level synthesis tools are able to deal with the system-level information. In the top-down approach, they should be able to handle system-related param eters specified by the system-level tools; similarly, in the bottom -up approach, they should be able to pro vide system-related param eters to the system-level tools; and in the mixed approach both. Even though, there has been significant progress in the field of high-level syn thesis during the last decade and researchers have developed powerful techniques to synthesize designs from behavioral specifications, these techniques are constrained in their scope and focus mainly on the synthesis of datapaths alone. There has been very little research to address the synthesis of storage architectures in the designs. It is essential for high-level synthesis to become practical by addressing more issues during synthesis and broadening the scope. The Unified System Construction (USC) project at the University of Southern California is one such effort in this direction. 1.3 The USC Project The Unified System Construction (USC) project involves the development of an inte grated suite of system-level tools for synthesizing multi-chip, heterogeneous applica tion specific systems which m eet cost, performance and power constraints [PCG93]. The focus of the USC project is on real-time systems, such as entertainm ent and 4 communication technologies, but does not exclude other applications requiring spe cialized system design. The m ajor tasks that are accomplished using USC tools are • partitioning, selection of components and packaging styles, and scheduling and allocation of components to meet cost and performance constraints. • autom atic synthesis of memory-intensive architectures, including both on and off chip memory, in conjunction with design of the processing units. • exploration of system-level design tradeoffs by means of predictors, prior to synthesis, including power and therm al effects. A block diagram of the system is shown in Figure 1.1. All the parts of USC project are not relevant to this thesis and will not be discussed here. This thesis addresses the tools and techniques to support autom ation of the design of memory intensive architectures, including both on-chip and off-chip memory, in conjunc tion with design of the processing units. The techniques developed on this subject have been incorporated in a tool called SMASH (Synthesis of Memory-Intensive Application-Specific Hardware). 1.4 Memory-Intensive Systems Memory-intensive application-specific systems are commonplace in video signal pro cessing and real-time applications. These systems consist of four basic subsystems: datapath, controller, memory and I/O architectures. They demand high perfor mance datapaths along with efficient storage schemes to support them . D atapaths for application-specific designs may process enormous amounts of real-tim e data. Such data m ust be stored in structures which are cost-effective and allow access to the data as required by the datapaths. W ith the declining cost of hardware, memo ries increasingly dominate the cost of digital systems. Although cost of storage per bit is very low, the total cost of memory may dom inate the overall system cost due to the huge storage requirements of today’s complex systems. 5 System Specification,, User- I SAS Style Assignment for Systems multiprocessor I mi ^t> mixed tyles custom 1C, common clock SOS Synthesis of Systems Task Style Selection CHOP 2-way partitioning .J L ¥ Mulit-chip Multiprocess Custom Synthesis System ^Architecture lure; SMASH Synthesis of Memory-intensive j Application-Specific Hardware ADAM High Level Synthesis Tools MCS Multi-Chip Scheduling SystehT Architecture/ Cascade Design Automation Silicon Compiler used by system-level tools 1 COMPASS Design Automation ASIC Synthesizer Custom IC’ s Library of parts used by system-level tools i Figure 1.1: Block diagram of the Unified System Construction (USC) Project 6 1.5 Motivation The storage architecture is closely connected to the datapath of the system, and isolating its synthesis from datapath synthesis may not result in an efficient solution. D atapath synthesis procedures themselves must take into account the design of the memory hierarchy which is companion to the datapath; ultim ately the design of the datapaths and memory hierarchies m ust somehow be coordinated. Therefore, we aim for a combined datapath and storage architecture design. In addition, datapaths and memory interact with the controller and with the external world, and require interfaces to both. In order to keep the problem complexity within lim its, we focus more on the issues relevant to storage architecture synthesis. Our topic was m otivated by a study of a digital video processing example, which indicated th at memory design is as im portant and difficult as datapath design. Dat apaths for application-specific designs may process enormous amounts of real-tim e data. Such data m ust be stored in storage structures which are cost-effective and allow access to the data as required by the datapaths. W ith the declining cost of hardware, memories increasingly dom inate the cost of digital systems. The storage architecture synthesis problem is also im portant in applications where the d ata transfer rate is very high (e.g. real-tim e applications such as personal communications). Here the m ajor design issue is how and where to store the data and then how to distribute it efficiently. In such cases, depending on the processing speed requirements and cost lim itations of different systems, a variety of strategies is needed to handle data. Furtherm ore, our layout studies m ade it clear th at datapath synthesis m ust con sider physical design effects along with the design of storage architecture and other system modules in order to be more widely useful to industry [PWGH90, PGH91]. We m ust now autom ate the memory design process in order to gain both cost and performance because of the complexity of these systems. 1.6 Introduction to Memory Design There are many ways of designing an efficient system (in term s of performance and cost) which satisfies all the system specifications, but such designs m ust currently 7 be done by skilled hum an designers. In this research, various design tradeoffs and issues in storage architecture synthesis have been identified and characterized, and a more general approach to storage architecture synthesis has been developed, so that the design process can be autom ated. 1.7 A Simple Example We design a hierarchical storage system concurrently with the datapath and also determ ine the input/ output data-transfer schedule between various hierarchies and datapath, as the datapath itself is scheduled. The need for such a combined syn thesis step is illustrated in the following example, where a simple data flow graph is scheduled considering various memory-related issues (Figures 1.2 a, b, and c). In Figure 1.2 a, the datapath scheduling is done without considering any storage-related issues. As a result, all the inputs A, B, and C are required simultaneously in step 1 so, the module(s) storing A, B and C should have 3 read ports and these inputs m ust be transferred on-chip from outside in one step, demanding a bandw idth of 3 words/cycle. Furthermore, a 3-word buffer is required to store these inputs after the transfer. In Figure 1.2 b, the scheduling is done with lim ited read ports. This schedule results in operators requiring at the most two inputs in any step (so, we require only 2 read ports on the storage modules(s) here), but the design still requires a bandwidth of 2-words/cycle. The buffer size required here is reduced to two as A can be overwritten with C. Finally, Figure 1.2 c shows the scheduling done with lim itations on both the read ports and the bandwidth. This schedule requires only one data transfer per cycle from the external world and two inputs for operators in any step. The storage size required here remains two. Notice th at in a larger CDFG, there may be a little or no delay in execution due to data transfer as the execution of other parts of the CDFG is overlapped with the data transfer. 8 Step 2 Step 3 I a. Datapath schedule. Step 2 Step 3 b. Datapath schedule Step 2 \ with 2 read ports. ® Read port #1 © Read port #2 c. Data path schedule with data prefetching, (one data transfer per cycle allowed.) O Data transf er #1 Figure 1.2: An Exam ple Showing Scheduling with Memory Related Issues 9 1.8 Problem Description The goal of this research is to develop a set of synthesis tools which will, for a given behavioral description (i) design the datapath, (ii) design the storage architecture (foreground and background memory), and (in) coordinate interaction among the memory system, datapath and the external world, while satisfying all the constraints. In addition, another im portant aspect of this research is the emphasis on being able to handle “real” designs. Our tools are able to synthesize designs having • mixed control and data flow, • multi-cycle operations, • memories, • I/O tim ing constraints, • on-chip constants, • arrays, • conditional branches, and • loops. These capabilities of our tools allowed us to experiment with large, realistic examples. These experiments are described later in the thesis. 1.8.1 P roblem Statem en t The problem statem ent is as follows: We are given • the behavioral VHDL description of a memory-intensive application-specific system, which may contain inner loops and conditional branches, with the loop structure, arrays and the indexed references assumed to be already transform ed and optimized; 10 • the m odule library consisting of (i) functional modules (e.g. adders) with each m odule characterized by its area, delay and bitw idth, and (ii) storage modules (e.g. registers, single-port/m ultiport register files, single-port/m ulti- port RAMs) characterized by cost per word, num ber of ports, access tim e and storage capacity; • area-performance constraints; • the clock cycle, which is the duration of each control step in the datapath; • inp u t/o u tp u t tim ing constraints imposed by the external world; and • m em ory bandwidth constraints (the num ber of words th at can be transferred onto the chip in one control step). The synthesis software must produce the target system with • a datapath consisting of operators and operation schedule, • size and port configuration for on-chip foreground memory to store inputs, outputs and interm ediate variables, • data-transfer schedule between the datapath and on-chip memory, • size and port configuration off-chip (or on-chip) background memory for bulk storage, and • data-transfer schedule between the foreground and background memory. The block diagram of the system is illustrated in Figure 1.3. 1.9 Decisions made by SMASH The key decisions m ade by SMASH in designing the target system are to • determ ine the num ber of functional modules of each type in the datapath; • determ ine the operation schedule in the datapath; • design the storage hierarchy: 11 Module Library & parameters □r onstramts Behavioral VHDL Description architecture beh of A is begin process variable a, x, dx: Array8; Bandwidth end process, end behaviour; Clock Cycle SMASH SYNTHESIS SYSTEM CDFG Extraction Datapath Synthesis Datapath Memory Synthesis I/O buffer Synthesis Off-chip Memory Synthesis I/O buffers Datapath Operation Schedule Back ground Memory Data transfer Schedules DP mem. Figure 1.3: SMASH Synthesis System 12 — decide the size and port configuration for foreground memory, and — decide the size and port configuration for background memory; • determ ine the data transfer schedule between datapath and the foreground memory; and • determine the data transfer schedule between foreground and background memory. In the current implementation of SMASH, the num ber of ports on the foreground memory by SMASH, whereas the num ber of ports on the background memory is de rived from the bandw idth between the foreground and background memory specified by the user. To vary the num ber of ports on the background memory the user m ust invoke SMASH repeatedly. 1.10 Our Approach The overall synthesis approach consists of two m ajor subtasks: First, operation scheduling combined with scheduling of on-chip data transfers to/from I/O buffers is performed. As a result of this scheduling, constraints are placed on the mem ory structure. Second, storage architecture synthesis is carried out, which includes determining the data transfers between the foreground and background m em ory1. The first step of the stepwise construction of the system takes into account the sec ond step by looking ahead so th at the second step is not overly constrained. This approach ensures th at the partial design obtained in each step supports the syn thesis of the next structure. Global design param eters like the bandw idth between the on-chip and off-chip memories, and the I/O tim ing constraints are considered when constructing the partial design in each step, tying the whole synthesis process together. The datapath scheduling imposes significant constraints on the design; therefore, all the storage structure related tradeoffs m ust be considered in the dat apath scheduling software itself. 1 synthesis of the storage structures is not described in this research. 13 D atapath design requires scheduling the d ata flow graph and then binding the operations to the hardware modules while taking into account the external band width. In order to satisfy read/w rite port constraints on the storage architecture, the datapath operations and I/O buffer reads and writes by the datapath are sched uled simultaneously. The data is provided to the datapath when and only when it is required, because unavailability of the data would result in processing delays and unnecessary transfers would result in extra bandw idth and buffer-size requirements. This can be achieved by scheduling an operation in such a way th at the I/O d ata can be prefetched into I/O buffers from the background memory before it is required. The second step is the synthesis of the memory hierarchy. As m entioned earlier, this part of the system consists of two levels: on-chip foreground memory and off-chip background memory. Synthesis of each part involves scheduling the data transfer and creating data value bindings to the physical modules. The datapath schedule obtained in the previous step (datapath scheduling) determines the reads/w rites between the datapath and I/O buffers. The objective in this step is to schedule the writes into the I/O buffers from the background memory and the writes back into the background memory from the I/O buffers such th at the buffer size is minimized. W hile doing so the memory bandwidth constraints between the on-chip and the off- chip memory and the timing constraints on the I/O cannot be violated. To minimize the buffer size the data is transferred only if it is required and is overwritten whenever possible. The outcome of this step is a complete data transfer schedule in and out of the I/O buffers. Once the data transfer from background memory to the I/O buffers is known, we have the read schedule for the background memory. If there are tim ing constraints on the I/O then the background memory w rites/reads of the data are already scheduled, otherwise, they are scheduled the way the data transfers for the I/O buffers were scheduled. The final step, datapath memory design, has been researched in the literature quite extensively. There are several good approaches to merge variables into larger modules like single/m ulti port register files, which will be described in Chapter 2. 14 1.11 Storage Design Tradeoffs This section describes the three storage architecture tradeoffs included in SMASH. The total storage size, the num ber of read/w rite ports on the storage structure, and the num ber of clock cycles available for data transfer, can be traded off with each other as described below. 1.11.1 Storage Size us. N um ber o f E xecu tion C ycles The storage size vs. num ber of execution cycles tradeoff can be exploited in two situations. 1. W hen data needs to be prefetched in the buffers to avoid delay in the future, it can be stored in the buffers. Otherwise, if the data cannot be prefetched into the buffers because of the storage size lim itation, processing has to be delayed because the required d ata is unavailable in the buffers. 2. W hen the data is required again in the future, it can be stored in the buffers which may result in increased buffer size. Otherwise it m ust be fetched again, which may result in extra clock cycles in the execution. This tradeoff can be illustrated by the example shown in Figure 1.4. The design is allowed only one adder and inputs ‘a ’ and lc’ are available after step 1, £ b ’ and ‘d ’ are available after step 3, and ‘e’ is available after step 4. The sequence of operations is shown in the table for both the designs. Observe th at, in the first design, step 2 is utilized in prefetching ‘c’ for future use in step 4, whereas in the second design, step 2 is wasted due to insufficient storage space resulting in a delayed operation later on. Also, in the first design ‘a’ could be saved for use in step 5, whereas in the second design ‘a’ was fetched again, adding one more step in the total execution tim e. (Note th at the choice of a particular storage module type is not a part of this tradeoff.) 1.11.2 N um ber o f P orts vs. N um ber o f E xecu tion C ycles The num ber of ports vs. num ber of execution cycles tradeoff is a trivial tradeoff between space and tim e multiplexing. The required data can be transferred by 15 Sequence b Sequence c Step 1 Read a from M, Store in R1 Read a from M, Store in R Step 2 Read c from M, Store in R2 None Step 3 Read b from M, Add a+b Read b from M, Add a+b Step 4 Read d from M, Add c+d Read c from M, Store in R Step 5 Read e from M, Add a+e Readd from M, Add c+d Step 6 Read a from M Store in R Step 7 Read e from M, Add a+e c (stepl) a (stepl) d (step3) b (step3) e (step4) a+b (a) A scheduled CDFG. (d) Execution sequence for designs (b) and (c). (b) Design using 2 registers. (c) Design using 1 register. R2 R1 Figure 1.4: Storage Size vs. Number of Execution Cycles 16 Sequence b Sequence c Step 1 Read a from M, Store in R None Step 2 None None Step 3 Read b from M, Add a+b Read a, b from M, Add a+b Step 4 Read c from M, Store in R Read c, d from M, Add c+d Step 5 Read d from M, Add c+d Read a, e from M, Add a+e Step 6 Read a from M, Store in R Step 7 Read e from M, Add a+e a (stepl) c (stepl) d (step3) b (step3) e (step4) c+d a+b (a) A scheduled CDFG. (d) Execution sequence for designs (b) and (c). (c) Design using a 2-port Memory. (b) Design using a 1-port Memory. Figure 1.5: Num ber of Ports vs. Num ber of Execution Cycles 17 a (stepl) c (stepl) d (step3) b (step3) e (step4) c+d a+b (a) A scheduled CDFG. Sequence b Sequence c Step 1 Read a from M, Store in R1 None Step 2 Read c from M, Store in R2 None Step 3 Read b from M, Add a+b Read a, b from M, Add a+b Step 4 Read d from M, Add c+d Read c, d from M, Add c+d Step 5 Read e from M, Add a+e Read a, e from M, Add a+e (d) Execution sequence for designs (b) and (c). (b) Design using a 1-port memory and 2 registers. (c) Design using a 2-port Memory. R1 R2 Figure 1.6: Number of Ports vs. Size of Storage 18 transferring more words per clock cycle through more ports or by using more cycles to transfer these words. This tradeoff is exemplified by the two implementations shown in Figure 1.5. The first design has a one-port memory and requires seven clock cycles for execution. In contrast, the second design has a two-port memory and requires only three clock cycles. 1.11.3 N um ber o f P orts vs. Size o f th e Storage From the above two tradeoffs one can deduce the tradeoff between the num ber of ports and the size of the storage. This tradeoff is very effective when d ata values are used again. In such a case, they can either be retrieved repeatedly whenever needed with more ports or m ust be saved for the future use which will increase the storage size. In the example shown in Figure 1.6 ‘a ’ is used twice. In the second design there are two ports on the memory so ‘a ’ could be fetched twice instead of being stored. In the first design there is only one port on the memory, so ‘a’ is stored for future use. 1.11.4 3-way Tradeoff We have seen th at in storage architecture design there are three param eters which vary while m aking the cost-performance tradeoffs: 1. num ber of ports, 2. size of the storage structure, and 3. execution time. These tradeoffs m ust be m ade during the datapath scheduling step, as the dat apath schedule determines the execution tim e, buffer size and the read/w rite port requirements. If these decisions are postponed until the storage synthesis task, the datapath schedule might have to be altered in order to accomm odate the storage structure tradeoffs. This could result in complex backtracking and iteration. 19 1.11.5 A Storage A rchitecture Tradeoff E xam ple Consider the following example of a noise-cleaning algorithm. The filter can be described as shown below [Pra78]: if 1 8 x - y » •= i > £ then i = 1 ... N (Picture elements) Xy o Figure 1.7: Example of a Noise Cleaning Algorithm. The whole image is assumed to be stored in an off-chip background memory and only the required data is transferred into the on-chip input buffers. Inputs buffers are the part of the memory shown in detail in the designs in Figures 1.8, 1.9, 1.10, and 1.11. The buffers are being used to fetch the data from the background memory and make them available to the datapath for further processing. The values are required more than once (actually 9 times). Depending on the B W 0n- 0f f and the execution tim e allowed, the data may have to be stored in the buffers. T hat will determ ine the size of the buffers. 20 In designs 1.8, 1.9, and 1.10, it is assumed th at the datapath has already been designed and requires all 9 data of the image for processing at the same tim e. The storage-related tradeoff in these designs is described below. In addition, design 1.11 illustrates the need for combining the datapath design with the design of storage architecture. INPUT B K G N D M E M O R Y Dataj in H h h U }5 I 6 ?7 *9 Storage size 2 Rows + 3 Number of input ports 1 Number of output ports 9 Window processing rate every clock cycle Figure 1.8: Design 1. The storage tradeoffs considered in SMASH illustrated by these designs are as follows (Figure 1.12). • Designs 1 and 2 show the tradeoff between the size of the storage and the num ber of ports, • Designs 2 and 3 show the tradeoff between the num ber of ports and the exe cution tim e, and • Designs 3 and 1 show the tradeoff between the size of the storage and the execution tim e required. In Design 1 the buffer size required is 2 x rows + 3. Only 1 write port is required, as only 1 datum is transferred from the background memory in each cycle. 9 read 21 INPUT B K G N D M E M O R Y Data, in^ P atai+N ” 1 Pataj+ 2 N in h h U }5 ]6 h *9 Storage size 9 Number of input ports 3 Number of output ports 9 Window processing rate every clock cycle Figure 1.9: Design 2. INPUT Storage size 9 Number of input ports 1 Number of output ports 9 Window processing rate every 3 clock cycles Figure 1.10: Design 3. 22 )atapath design with lookahead D ata; INPUT Storage size 9 Number of input ports 1 Number of output ports 3 Window processing rate every 3 clock cycles Figure 1.11: Design 4. ports are required to make all the 9 data points available to the datapath. In this im plem entation the datapath can process a window every clock cycle. The buffers in Designs 2 and 3 are of size 9 but in Design 2 there are 3 write ports whereas, in Design 3 there is only 1 write port. Both the designs provide 9 read ports. Both the designs need 3 new data from the background memory to process the next window as th e data is not being stored in the buffers in these cases. Since, in Design 2, there are 3 write ports, 3 data points can be transferred from the background memory into the input buffer every clock cycle; the datapath processes a window every clock cycle. Design 3 has only 1 write port, therefore 3 clock cycles are needed to transfer the new data. This design processes a window every 3 clock cycles. Notice th at in Design 3 the delay is caused by the data transfer and not the datapath. Design 4 illustrates the necessity of looking ahead into the storage architecture param eters while designing the datapath. This design has the same performance as Design 3 but it requires a slower and cheaper datapath as well as less read ports on the buffer. The datapath in this im plem entation is designed with lookahead into the storage param eters B W 0n- 0f f , which is the num ber of write ports on the buffers. 23 INPUT ► » 2 increasing buffer size decreasing ,l" ', VS. ■ ► decreasing no. of ports increasing increasing X decreasing buffer size vs. exe. time decreasing increasing increasing f decreasing no. of ports vs. exe. time decreasing f increasing INPUT B K (i N J L ) Data M H M O R Y l 2 l 3 H h i f i h * 8 Figure 1.12: 3 Way Tradeoff in Storage Architecture 24 During the datapath design SMASH determ ined th at the bottleneck in processing is the small bandwidth, B W 0n-.0f f , it takes 3 clock cycles to transfer the required data for processing. Therefore, a high performance datapath (as used in Design 3) cannot improve overall performance. Knowing this, a slower and cheaper datapath is designed for this implem entation. The datapath in this case processes a window in 3 clock cycles and requires only 3 data per clock cycle. Furtherm ore, by reducing the data requirement per clock cycle, the num ber of read ports on the input buffers could also be reduced resulting in a cheaper storage structure. Thus a cheaper datapath as well as storage architecture was designed w ithout degrading the overall performance. This example illustrates the need for combining the datapath design with the design of storage architecture. 1.12 SMASH as a Part of USC High-level synthesis tools (like SMASH) are at the core of any system-level tools (like USC) and they can be used in three basic ways: (i) bottom -up, (ii) top-down, or (in ) mixed, in system synthesis. SMASH can be used in any of three design methodologies within USC. • Using SMASH in bottom -up design flow: In bottom -up design flow, SMASH provides the area-delay tradeoff curve for system-level tools like SOS and Propart. Several im plementations of a design with varying cost/perform ance param eters can be quickly synthesized using SMASH and then these param eters are provided to the system-level tools. Finally, the system-level tools select the most cost-effective design for the final system design. • Using SMASH in top-down design flow: W hen used in top-down design flow, SMASH synthesizes ASIC designs to m eet the cost/perform ance constraints determ ined by system-level tools such as SOS and Propart. • Using SMASH in mixed design flow: SMASH can generate the area-delay tradeoff curve of individual components for SOS and Propart and then can produce the selected design point for the final design.* 25 1.13 Thesis Organization The thesis has been organized in the following manner. C hapter 2 surveys the related work. It describes existing research in high-level synthesis in general, and high-level synthesis of storage structures. C hapter 3 describes the SMASH approach to solving the storage synthesis problem. Chapter 4 discusses the techniques we have developed for various design param eter estim ations. These estim ations are used during the syn thesis process. C hapter 5 describes the details of the synthesis techniques used in SMASH for datapath synthesis combined with I/O transfer scheduling while looking ahead in storage structure tradeoffs. C hapter 6 describes the storage structure syn thesis performed in SMASH. Chapter 7 outlines the experiments we have performed to dem onstrate our ideas and the results obtained. Finally, C hapter 8 presents the conclusions and future directions. Appendix A briefly describes a utility developed during the course of research. The utility was widely used in our research as well as in a variety of other experiments. Appendix B contain the VHDL descriptions of all the examples synthesized by SMASH as described in Chapter 7. 26 Chapter 2 Related Research 2.1 Introduction The goal of this research is to develop a software system which performs high-level synthesis with emphasis on memory architecture synthesis. In the following sections we will briefly review the research performed in high-level synthesis in general, and then review the work done in high-level synthesis of memory architectures. 2.2 High-Level Synthesis Research There has been a lot of research in autom atic synthesis of a design structure from a behavioral specification. Some of the systems are briefly described below. The ADAM system at USC deals with datapath synthesis, starting with a control flow /data flow graph as the behavior description of the desired system [JKMP89, GKP85]. Pipelined designs are synthesized by Sehwa [PP88] or a multi-chip sched uler, and non-pipelined designs by MAHA [PPM86] or LADS [WP91]. MAHA uses freedom-based scheduling to schedule the data flow graph. We also propose to use a very similar but modified approach for our scheduler. For module allocation, MA E S AL is used [KP90]. MABAL allocates functional units as they are required and registers are allocated by doing life tim e analysis of the interm ediate values. It also trades off functional units with interconnects (mux, drivers etc.) when required. The ELF system uses list scheduling with urgency of performing th at opera tion before any enclosing tim ing constraint as the priority function [GK84]. Graph gram m ar productions are used for allocation of functional units and registers. After 27 scheduling and allocation, a greedy global partitioning/clustering algorithm is used to minimize the interconnect. Paulin’s HAL system accepts a data flow/control flow graph as the design be havior input [PK87, PK89], It uses a powerful technique for scheduling known as force-directed scheduling. In force-directed scheduling a “force” for each operation in each possible step is used as the priority function. First both an ASAP and an ALAP schedule are generated. These schedules indicate the possible control steps for each operation. Then, for each operation for each possible tim e step, distribu tion graphs are computed. These graphs are used to calculate self force for each operation, as well as external forces, reflecting the effect of the schedule on other operations. The total force is used as the priority function in the scheduling. Allo cation is done by a rule-based expert system followed by operation binding using a greedy algorithm. A clique partitioning algorithm is used to minimize the num ber of registers. Then multiplexers and interconnects are added. Finally, the design is improved by some local optimizations. CMU’s second CMU-DA system supports behavioral transform ations [DPST81]. They use ASAP technique for the scheduling. Their datapath allocator EMUCS binds operations onto each hardware element based on the cost of each unbound data flow element. Some other synthesis packages from CMU include DAA (Design A utom ation As sistant) a knowledge based expert system [KT83]; CSTEP (control step scheduler), which uses list scheduling with tim ing constraint evaluation as the priority function [LT89], and FACET which uses ASAP scheduling [TS83]. University of California at Berkeley’s HY PER system applies optimizing compiler transform ations on the data flow graph, determines the lower and the upper bounds on the num ber of functional units, registers, and buses in order to determ ine the initial num ber of resources [RP90, RMV+88]. Then, it schedules each control step in turn. It applies transform ations like m ultiplexer reduction and datapath partitioning to perform m odule binding. Stok’s scheduler uses a variation of force-directed scheduling [Sto91]. He tries to balance the use of each functional unit. Multi-cycle operations and chaining are allowed. For datapath synthesis, operations are assigned to functional units, based 28 on a weighted clique partitioning algorithm. Interm ediate variables are merged in register files using an edge coloring algorithm. IM EC’s CATHEDRAL-II system is designed for the high-level synthesis of DSP chips [RMV+88]. It constructs datapaths from a set of execution units and also a set of memories (register files and buffers), I/O units and controller modules. Their scheduler uses list scheduling, with priority to the operations on the longer critical path. Interm ediate values are assigned to the register files based on their lifetime analysis. CATHEDRAL-2nd and CATHEDRAL-III are two successors to the CATHEDRAL-II system. Philips’ PIRAM ID system uses CATHEDRAL-II as a synthesis engine along with a floorplanner and a m odule generator. AMICAL from IN PG /TIM A , France, is targeted towards control-flow dom inated systems [JP093]. It provides an interactive environment where autom atic and m an ual synthesis can be mixed. Starting with a pure VHDL input, AMICAL produces a full specification for existing logic and RTL synthesis tools. The target architecture allows complex, synchronous, heterogeneous, parallel application-specific architec tures. There are m any more synthesis systems from different universities, as well as from industry. Unfortunately, no single universally-accepted theoretical framework has yet emerged, due to the complexity of the whole synthesis process. 2.3 Memory Architecture Research There has been very little work done on the autom atic design of memory hierar chies, with the exception of some European and Canadian activity. The bulk of research on memory hierarchy design in the United States involves theoretical and probabilistic studies for general purpose com puter design where the design issues are quite different from the issue in ASICs. There has been little application-specific memory design research performed. In a general purpose com puter, the memory access pattern varies from application to application; therefore, for these machines the memory design is based on probabilistic models. On the other hand, SMASH will be used for systems designed for specific applications. In our case, the memory access pattern is not only relatively fixed but also known before hand. This mostly 29 determ inistic access characteristic helps us in being more specific, hence more effi cient in our designs. This also makes it feasible to autom ate the design process. An example of the kind of tradeoff study related to our work is the work by Parker and Nagle which was performed a num ber of years ago [NP77]. Many researchers have talked about the need for fast, large memories for high- performance systems. M ulti-port RAMs can provide high throughput as sim ulta neous access is possible. The MIMOLA design system is the first system to make tradeoffs in the use of m ulti-port memories [Mar79, Zim79]. The design space pa ram eters in MIMOLA included memories and num ber of memory ports. The de signer starts with minimizing the high-cost low-utilized resources like m emory ports. H e/she begins with very small num ber of resources in the database. The system de scription is processed and if more resources are dem anded in a m icrostatem ent, the program attem pts to resolve it by sequentialization, introduction of storage cells for interm ediate results or restructuring. Unresolvable situations are reported to the designer. He can revise his declaration and process the description again. Balakrishnan et al. presented an approach to use m ulti-port memories to im plem ent single isolated registers [BMBL87, BMB+88]. This approach “packs” these registers into a homogeneous group of modules. They also consider register in terconnection to operators while packing these registers in order to minimize the interconnection cost. They solve the register packing problem sequentially, i.e., by placing registers into one memory at a tim e. Each memory is packed by solving an integer-linear program. Leftover registers are then candidates for the next memory and so on. For a more global approach and a faster solution, a heuristic m ethod based on a graph model of the problem was developed. After finding a feasible packing, a 3-step post processing is done to reduce the interconnection cost. The first step involves interchanging pairs of registers between memories. The second step minimizes the num ber of memories th at require access to both input term inals of a given functional unit. And the third step minimizes the num ber of operator term inals th at require access to m ultiple ports of a given memory. Chen explored the design space for m ultiport memory synthesis [CS90, Che91]. The memory modules were generated in sequence, which resulted in locally optim al 30 solutions and in some cases failed to generate the globally optim al solution of m ini mizing the num ber of m ultiport memories[CS90]. Later the work was modified and extended to generate a more globally optim al solution [Che91]. Ahm ad and Chen use 0-1 integer-linear program m ing to group interm ediate vari ables in the datapath into a m inimum num ber of m ultiport memories depending on their ports and their access pattern [AC91]. The form ulation also take into account the interconnection hardware. Though their technique works well w ith small exam ples, 0-1 ILP is not suitable for bigger examples and will exhibit run-tim e problems. For each design style there is an area/tim e tradeoff possible when we physically lay out the memory cell array. In the case of large memories, a large row count causes a very long bit line, resulting in more delay and an inefficient layout. S. Hiorofume et al. presented a compiler having a flexible port arrangem ent and layout [HMFK90]. G rant et al. suggested an approach to group the mem ory requirem ents of various operators such th at control and communications m ay be optim ized [GD90]. They consider single-port memory modules. To optim ize the communication network be tween the functional units and memory, simple heuristics are used which optimize the write-bus network and read-bus network. They also optim ize the controller by optimizing the control bit sequence. In an earlier study the same group studied a different aspect of m em ory synthesis, address generation hardware [GDF89]. Stok optimizes register files during the synthesis process by splitting read and write phases of registers, and by considering parallel storage and rewrites of values th at have to be read several tim es [Sto89]. Recently Lippens et al. from the Philips Research Labs, in PHIDEO, imple m ented techniques to perform autom atic memory allocation and address allocation for high speed applications [LvMvdW+91]. They synthesize memory after the design of arithm etic units and after scheduling. They model multi-dimensional periodic signals as data stream s and then m anipulate these stream s to form a distributed memory structure. They assume a lim ited num ber of m em ory types available to them (1 and 2 port RAMs) and so their approach is to distribute the d ata among parallel memories. They do not distinguish between background and foreground memory. They perform memory synthesis independent of datapath synthesis. They do not consider conditional branching in the behavior of the system. 31 In IM EC’s CATHEDRAL-II, efficient storage schemes and memory access tech niques were implemented by De M an et al. [VBM91]. According to them , efficient storage schemes and memory access are as crucial as allocation and scheduling of datapaths in DSP ASIC design. They compile m ulti-dimensional d ata structures into distributed dual-port register files and single-port SRAMs. They also consider m ulti-dimensional signals and optimize the high-level mem ory organization using transform ations. They use a polyhedral-based model for the linear, piecewise linear and data-dependent signal indexing [FBS+93], The model is used to derive alterna tive control flow structures for a given data flow specification to optim ize large-scale memory organization both in term s of storage locations and access order. All the above efforts concentrated on separate aspects of memory synthesis. An overall tradeoff approach like the one we have constructed has not been reported elsewhere. 32 Chapter 3 Problem Approach 3.1 Introduction We observed in the introductory chapter th at every new design of a storage archi tecture requires a different strategy for the m ost efficient design in term s of cost and performance. Each strategy is unique in itself and m ay be difficult or impossible to autom ate. We also observed th at storage architecture is companion to the datapath and the design of the datapaths and memory hierarchies m ust be coordinated. In this research, our goal is to develop a more general approach which combines the synthesis of datapaths with the synthesis of storage structures. We wish to apply the approach to memory-intensive application-specific systems in order to autom ate the design process while producing efficient and correct designs. Of course, this kind of general approach cannot m atch hum an designs and m ay suffer from larger or slower hardware as compared to hum an designs. Nevertheless, we wish to exploit all the advantages design autom ation has over m anual designs like faster design tim e, less errors, and exploration of a larger design space. This chapter describes the target system architecture produced by SMASH, the SMASH module library which includes functional modules as well as storage m od ules, the clocking scheme used in SMASH, and the overall approach to synthesize the target design implemented in SMASH. 33 3.2 Target System Architecture Our target system architecture consists of a datapath and a hierarchical storage structure, as shown in Figure 3.1. Our view of the architectural structure and the role of each structure in the system is described in this section. o Figure 3.1: Target Architecture in SMASH 3.2.1 D atapath In our model, the datapath consists of the functional hardware to execute the spec ified behavior. The functional operators used in the datapath are characterized in the user-specified module library in term s of bitw idth, num ber of inputs and out puts, area, and execution delay. This datapath model is the conventional model used by other researchers except for one difference th at in our model, the datapath per se does not have any kind of storage capability. All the interm ediate variables are stored in datapath memory, which is described later. On-chip 11 Off-chip foreground memory 11 background memory I/O buffers 2R/1W RAM r I1R/1W Reg. File 1R/1W RAM DP memoi OFF-THE SHELF MEMORY CHIPS 34 3.2.2 Storage A rch itectu re The overall storage structure is classified into two m ajor sub-architectures based on their location in the storage hierarchy. These storage sub-architectures are as follows (Figure 3.1): 1. on-chip foreground memory to store inputs, outputs and interm ediate vari ables, and 2. off-chip background memory for bulk storage. The sub-architectures may have similar hardware structure but may vary in function ality. The overall architecture of the system m ay be fixed, but each sub-architecture is determ ined by the synthesis system and may consist of various storage modules and devices like registers, register files, and RAMs. The cost of each sub-architecture is a function of 1. the num ber of read and write ports on the sub-architecture, and 2. the size of the sub-architecture. The total cost of the storage sub-architectures can be further optim ized during the actual construction of the storage structure by using the most cost effective modules from the storage library, as explained later in the thesis. The sub-architectures are briefly described below. O n-chip Foreground M em ory The on-chip foreground memory consists of I/O buffers and datapath memory. In cases where the background memory can directly interface to the datapath, the I/O buffers are m ade transparent by using simple wires to replace storage modules. I/O buffers are that part of the foreground mem ory which tem porarily stores 1 /0 variables th at are required for processing, and makes the variables available to the datapath in appropriate control steps. In each control step the required data is loaded onto the buffer ports and as the datapath reads data from the buffers, the controller adds new data to them from the off-chip memory for further process ing. Similarly, the output variables are first stored in the on-chip memory as they 35 are produced and then are transferred to off-chip memory at an appropriate tim e, overlapped with the execution of the datapath. An efficient overlap between the processing and data transfer can make the d ata access latency transparent to the system. In addition, I/O buffers are also used to store d ata which will be used in future steps; if such d ata can not be retrieved easily in the future then it m ay be retained in the buffers. The relevant param eters th at are to be determ ined for the I/O buffers are 1. total buffer size, which is determ ined by the m axim um num ber of inputs and outputs stored in the buffers in any given control step, and 2. the num ber of read ports (Rbuf) and write ports (W &u/) accessible to the dat apath, which is the m axim um num ber of inputs and outputs accessed by the datapath in any given control step. The user specifies the bandw idth between the buffers and the off-chip background m emory B W on-o jf • B W on-o jf is the maxim um num ber of inputs and outputs th at can be transferred between the on-chip I/O buffers and the background memory in one control step. D atapath memory stores the interm ediate variables in the datapath. The pa ram eters which determ ine this subpart are derived from the scheduled CDFG and include 1. the num ber of interm ediate variables, and 2. the lifetimes of these variables. D atapath memory synthesis tradeoffs have already been perform ed by other re searchers [BMB+88, Che91, Sto89] and are not described in this thesis. O ff-chip B ackground M em ory The off-chip background m emory is the bulk storage. All the I/O d ata values from /to the external world are stored in the background memory. The purpose of the off- chip memory is to provide large cheap storage space, ju st as in a general purpose computer. In general, off-chip memory can be shared or distributed between various 36 parts of the datapath or between m ultiple datapaths in a multiprocessor system , but in SMASH distributed memory is im plem ented for the following reasons: 1. Performance requirements are generally quite high for target systems. Dis tributed memory is faster and more efficient in access as the data values can be accessed directly by the processing elements. 2. We know (at least we can implicitly enum erate) the access pattern of the data. So, we can efficiently distribute the data for different datapaths, saving us unnecessary routing and switching. The param eters relevant to off-chip memory are 1. the num ber of read and write ports (each equal to the user-specified B W on- 0f /), and 2. the num ber of words, which is expected to be large compared to the size of on- chip storage, and which is implicitly determ ined by our software as a side-effect of the d ata transfer scheduling. 3.2.3 Target A rch itectu re D iscussion Although the top-level target architecture is fixed, there is a great deal of flexibility for each subpart. The num ber of read and write ports in each sub-architecture is variable and is decided by SMASH. Each of these sub-architectures can be imple m ented using a heterogeneous combination of storage modules. For example, the 1 /0 buffers in the on-chip memory can be m ade up of a heterogeneous combination of registers, single-port/m ulti-port register files, and single-port/m ulti-port RAMs. D atapath mem ory may consists of registers and single-port/m ulti-port register files and m ay even contain RAMs if the storage requirem ents are huge enough. The back ground memory, which is bulk off-chip storage, is constructed using larger modules like RAMs. In an extrem e case, a subpart can degenerate to interconnections if there is no need for any storage. Such a part will have no cost and no storage capability. The cost model used in SMASH is in term s of chip area. The m odel assumes th at the cost of the storage structure prim arily depends on the num ber of required ports and the total size of the structure. The objective of the techniques used in SMASH 37 is to minimize the num ber of ports and the total size of the storage structure. The heterogeneous construction m entioned above could further minimize the chip area of each sub-architecture by using the m ost cost effective modules from the given library. In the proposed model, the bandw idth between the on-chip and the off-chip memory, B W on- 0 j j imposes cost constraints on the overall design because of (i) the pin constraints on the chips, and (ii) the expense of having m ultiport memory for off-chip bulk storage. Furtherm ore, the access tim e for the off-chip background memory will be greater because (i) it is off-chip, and (ii) it is bigger in size. Since access to the d ata values in the off-chip memory m ay be slower than the on-chip access as a result, the software schedules the transfer of the input variables from off-chip memory into the on-chip I/O buffers before they axe required in order to avoid delay in the execution. An interesting feature of this model is that the storage hierarchy can be ignored by providing higher B W on-of f (which corresponds to more ports on the background memory). In such a situation, SMASH can produce designs w ithout I/O buffers as done in PHIDEO [LvMvdW+91]. This categorization of storage architecture is quite similar to the one used for general purpose com puter architectures. However, their datapath and control archi tectures are clearly different from application-specific designs, and hence the storage architectures design problem differs also. 3.3 Module Library Characteristics In this section we describe the specification of the m odule library in SMASH includ ing the characteristics of each module th at m ust be specified in the library. The actual library used in our experiments is described later in the thesis in C hapter 7. The m odule library in SMASH has two components: 1. functional modules and 2. storage modules. 38 3.3.1 F unctional M odules The functional modules are characterized in the conventional way in term s of 1. cost (which could be area, num ber of transistors, or static power dissipation), 2. execution delay, 3. num ber of inputs, 4. num ber of outputs, and 5. bit width of each input and output. Examples of functional modules are adder, multiplier, and com parator. 3.3.2 Storage M odules The storage modules are characterized in a different m anner. Their specification includes 1. bitw idth per word, 2. cost of storage (area, num ber of transistors or static power) per word, 3. m axim um storage capacity per module, and 4. num ber of read and write ports on the module. A brief description of some example storage modules is given below: • Register or latch: The register is the simplest storage element. It can store only one word at a tim e. Its bitw idth is predeterm ined. • Single-port register file: A register file is a collection of registers w ith address ing hardware included in it. The size of the register file (the num ber of words th at can be stored) is variable but m ust be determ ined prior to instantiation in the layout. The register files also have a m axim um capacity lim it per m od ule. T he cost of each register file is a function of its size. Their bitw idth is predeterm ined. 39 • M ulti-port register file: The m ulti-port register file is the same as the single port register file except th at it has m ultiple read and write ports. Usually, there is only one write port and m ultiple read ports. Each read port has an address bus, an output data bus, and a read enable signal. Similarly, the write port also has an address bus, an input d ata bus, and a write enable signal. Simultaneous reads from the same location are possible but in the case of m ultiple write ports, simultaneous writes to the same location are prohibited. • Single-port RAM: Single-port RAM modules considered in the library are to be used on-chip. Each RAM module has a data input bus, address bus, an output enable, and a write enable. The output can be standard or tristated. The bitw idth and the num ber of words per m odule are specified before silicon compilation. The cost of each RAM m odule is a function of its size. • M ulti-port RAM: M ulti-Port RAM modules are also used on-chip. Each write port on the RAM module has an address bus, a input d ata bus, and a write enable signal. Similarly, each read port has an address bus, d ata out bus (standard or tristated), and an output enable bus. Simultaneous reads from the same address are legal but simultaneous writes to the same address are obviously not allowed. Timing is critical for the write vs. read operation only when accessing data th at is being w ritten in the current address cycle. 3.4 Clocking Scheme We assume two-phase clocking for our target system. The d ata is w ritten to the off- chip background mem ory (either from the external world or from the I/O buffers) in phase one (4>l) and read from the off-chip background memory (either by the external world or the I/O buffers) in phase two ((f)2). Similarly, d ata is read from the on-chip foreground memory (either by the background mem ory or by the datapath) in < j > 1 and w ritten to the on-chip foreground memory (from the background mem ory or by the datapath) in < f > 2. The datapath processes the d ata in < f)l and writes it back into the buffers in < /> 2 . In case the design does not have I/O buffers, the datapath interacts directly with the background memory; so instead of writing the data back 40 into buffers, it writes it into the background memory in < £2. The scheme is shown in Figures 3.2 and 3.3 and illustrated in the following example in Figure 3.4. Input Output Datapath Datapath memory I/O buffers Background memory Figure 3.2: Target architecture with Communication Links < E > 1 02 Bkgnd Write Bkgnd Read Buf. Read Buf. Write DPM Read DPM Write DP Execute DP Out Figure 3.3: 2-phase Clocking Scheme in SMASH 41 A (step 1) B (stdp 3) C (step 2) WR OE 1 R2RE2 0 ... ........... Step 5 f SUM A partial DFG with timing constraints on inputs. Step 1 O l 0 2 RAM Adder An RTL implementation. Step 2 Step 3 Step 4 O l 0 2 O l 0 2 O l 0 2 RAM Write A Read A Write C Write B Read B Register File Write A Write B Read A&B Write SUM Adder SUM =A+B Timing details for each RTL module. Figure 3.4: An Exam ple Illustrating 2-phase Clocking 42 3.5 Overall Synthesis Approach Our overall approach is shown in Figure 3.5. The synthesis process starts w ith the VHDL behavioral description of the system to be designed. This description is then compiled and translated into a control d ata flow graph (CDFG) represented in DDS (Design D ata Structure) [KP85, KP83] form at. Next, the datapath for this CDFG is synthesized, followed by synthesis of the supporting storage architecture. Finally, the RTL netlist of the synthesized design is generated, and then compiled to obtain the chip layouts. These steps are briefly described below. The two prim ary tasks in the synthesis process viz. combined datapath and I/O transfer synthesis, and storage architecture synthesis are described in detail in Chapters 5 and 6 respectively. 3.5.1 C ontrol D ata F low Graph E xtraction The very first step in the synthesis process is to extract the control d ata flow graph (CDFG) of the target system from the VHDL description. This is done by generating the DDS description of the system using V2DSS [CP91] and then extracting the relevant information from the DDS description. The DDS describes the d ata flow in its data flow model and the control flow required for conditional branches and inner loops in its timing model. These two models are merged into a CDFG for processing by SMASH. The CDFG specifying the target system is defined as follows: D e fin itio n 3.5.1 A CDFG is a directed acyclic graph G (V ,E ) where • V is a finite set o f nodes. Each node v € V represents an operation Ok in the behavioral description o f the system. It includes control-flow operations like distribute and join; and • E is a finite set o f directed edges between the nodes in the CDFG. A directed edge eij from node V { € V to node Vj € V exists in E if (I) the data produced by Vi is consumed by Vj (data edge), or (2) the data produced by V { controls vj (control edge). Some related definitions follow. These definitions are used later in the thesis. 43 VHDL Behavioral Description j CDFG Extraction a. VHDL to DDS, and b. CDFG transformations. fe M 8 0 W W O O O O B 8 6 8 t8 6 6 B 6 6 6 6 6 B 8 8 8 B 8 8 6 6 B 6 6 8 6 B 6 B 6 6 6 fiJU i Datapath S y n t h e s i s with S t o r a g e T r a d e o f f s Scheduling a. Datapath operations, b. I/O buffer reads/writes, and c. Datapath memory reads/writes. S t o r a g e / A r c h i t e c t u r e\S y n t h e s i s Datapath Memory Synthesis Module allocation for data path memory. I/O Buffer Synthesis a. Data transfer scheduling between background mem. and I/O buffers. b. Module allocation for I/O buffers. I Background Memory Synthesis a. Data transfer scheduling between ext. world and background memory. b. Module allocation for background mem. RTL Synthesis M ABAL + Epoch. Chip Layout Figure 3.5: Synthesis Approach in SMASH 44 D e fin itio n 3.5.2 Pred(v) C V is the set of all immediate predecessors of node v. D e fin itio n 3.5.3 Succ(v) C V is the set of all immediate successors of node v. D e fin itio n 3 .5 .4 Op(v) = Ok implies that node v is executed using an operator of type Ok. D e fin itio n 3.5.5 Vok Q V is a set of all nodes of operation type Ok in the CDFG i.e. Vok = {v | v € V and Op(v) = Ok} Example: Vr is a set of all the read nodes in the CDFG. The CDFG represents a partial order -< on all the operations in the behavior of the system. The partial order defines the operational precedence constraints. For example V\ -< v2, implies th at V\ has to be fully executed before v 2 can start, where vi and v2 are any two operations in the behavioral description i.e. Vi, v2 € V. R e p re s e n tin g C o n d itio n a l B ra n c h e s To represent the conditional branches, a predicate is attached to each node in the CDFG under which th at node is executed. The predicate is extracted from the range information in the control and tim ing model of DDS. In DDS, a pair of conditional (i.e. exclusive) branches is represented by a pair of ranges connected at both ends by an or-fork and an or-join point in the control flow graph; a predicate is attached to each range describing the conditions under which th at path is to be taken, as illustrated in Figure 3.6 a. Bindings (triples) link operations in the DFG to tim e ranges in the control and tim ing model. W hile constructing the CDFG for SMASH, the appropriate predicates are attached to each node as shown in Figure 3.6. M ulti way branches are converted into m ultiple 2-way branches as im plem ented in the VHDL compiler. R e p re s e n tin g L oops Loop boundaries and loop bodies are identified using the special points a and to of the tim ing model in DDS, as shown in Figure 3.7 a. a is the initial point of a loop, and u > is the point at which the loop iterates back to a . The a and u > have symbolic 45 P = False gg*' a. A Conditional Branch in the DDS Control and Timing Model Associated Predicate: P = 1 nodes of then branch Distri bute join nodes of else branch Associated Predicate: P = 0 b. A Conditional Branch in the CDFG in SMASH Figure 3.6: Representing Conditional Branches subscripts which are used in distinguishing values and nodes in different iterations of the loop. In our CDFG, the loop body is enclosed by a distribute node and a join node. One branch between the distribute-join pair contains the loop body, whereas the other branch is the exit branch, as shown in Figure 3.7 b. The num ber of iterations for the loop is extracted from the loop condition. Loop body Loop condition P or-fork simple point a. Representing a Loop in DDS timing model Number of iterations extracted from loop condition Outputs f o r / ( join iteration i \ Distri-j InPuts for bute J iteration i Loop Body Loop exit b. Representing a Loop in CDFG in SMASH Figure 3.7: Representing Loops R ep resen tin g In p u ts and O u tp u ts The input and output variables (arrays or scalars) are explicitly specified in the VHDL description of the target system. Corresponding to each access of the input 47 and output variables, a Read or Write node is introduced in the CDFG, as described below. The set of input and output nodes are denoted by I P and O P respectively. A data variable can be of two types distinguished by the type of addressing scheme used to access it: determ inistic addressable or non-determ inistic addressable. D efin itio n 3 .5 .6 Deterministic Addressable: the address of the data is constant or can be determ ined at VHDL compile tim e. Such d ata can be treated on an individual basis and can be assigned to storage locations independent of other data as long as we provide the required interconnections. D efin itio n 3 .5 .7 Nondeterministic Addressable: The address is a variable or is data dependent and so cannot be determ ined a priori. The address refers to a specific location in an array. In such a case the whole array is considered as one entity rather than treating each value individually. The whole array will be m ade accessible to the datapath as per requirem ents and will be m apped to one or more storage modules in the I/O buffers depending on its size and the num ber of variables th a t have to be accessed simultaneously. The required d ata will be accessed im m ediately after the address has been determined. Every input read in the CDFG is represented by a Read node and every output write by a Write node as illustrated in Figure 3.8. The read node consists of (i) array A (which is being read), and (ii) indices *i, * 2, . . . , in (the “address” of the value being read) as input edges and value A[ii, i2, . . . , in] as the output edge. Similarly, the write node has (i) array A , (ii) indices i\, i2, . . . , in, and (iii) the value v to be w ritten as input edges and the modified array A! w ith A'[i\, *2, . . . , in] = v as the output edge. The num ber of indices in both the read node and write node is variable and depends on the dimension of the array being accessed, instead of assigning just one input edge corresponding to the “address” of the referenced value. This is done to preserve the behavior of the array representation. Assigning ju st one edge would require precom putation of the absolute address of the value in the array, which would im plicitly indicate a specific arrangem ent within the array. A single d ata value is represented as an array of one element w ith the index input being a constant zero. Each inp u t/o u tp u t array (or scalar) has a size associated w ith it. 48 1 Array Value = Arrayti!,..,^] (with Array[ilv.,in] = Value) a. A Read Node b. A Write Node Figure 3.8: R ead/W rite Nodes in SMASH D e fin itio n 3.5.8 The size of an input or output array SZ(a), a G I P U O P (where I P is the set of inputs and OP is the set of outputs) is the number of data elements in a. The size of a scalar input is always 1. Each in p u t/o u tp u t variable may have an optional tim ing constraint. A tim ing constraint on input ip implies th at ip is available for processing only after tim e step s. Since, the input tim ing information is usually available in term s of real tim e, we require the user to provide the clock cycle length and the program com putes the corresponding step s for each input. The description form at is described in C hapter 5. This CDFG then can be transform ed and optim ized as suggested by m any re searchers [FBS+93, LvMvdW +91, WL91]. Various transform ations can be applied to optim ize the loop structure, arrays and indexed references. However, these trans formations require extensive research in themselves and will be im plem ented in other USC tools. 3.5.2 D atap ath Synthesis w ith Storage Tradeoffs The datapath synthesis step is the first m ajor step in the synthesis process. During this step datapath operations are scheduled combined with I/O transfer scheduling 49 between I/O buffers of the foreground memory and the datapath, while looking ahead in storage structure tradeoffs. In general, datapath synthesis consists of two fundam ental steps: • D atapath scheduling: In this step, each operation in the CDFG is assigned to a control step, and • Module allocation and binding: Here the num ber of functional units to be used in the design is determ ined and then each operation in the CDFG is m apped into a functional unit. D a ta p a th Scheduling This step determ ines the serial/parallel tradeoffs of the design, which result in the area-performance tradeoff. For example, if ‘n ’ add operations are scheduled in a single step during scheduling then it implies th at the final design will have at least ‘n ’ adders. In our application this step is even more im portant because the scheduled CDFG also dictates the access-pattern for inputs and outputs by the datapath, which in turn determines the num ber of read and write ports on the buffers, and influences the buffer size. For example, if the scheduled operations access ‘n ’ inputs in a single step then the final design m ust have at least ‘n ’ ports on the input buffers so th at the datapath can access these ‘n ’ inputs in one step. In other words, the datapath scheduling step imposes constraints on the storage structure. Later, during storage synthesis, these constraints m ust be m et by the synthesized storage structure. Furtherm ore, the constraints generated for the storage structure m ust not be very stringent, otherwise the storage synthesis software m ight not be able to satisfy the given cost constrains and would fail in the future. Therefore, in our approach the storage-related param eters are considered during datapath scheduling as follows: • The scheduling technique combines data scheduling w ith scheduling of I/O transfers between the datapath and the I/O buffers so th a t the num ber of ports on the storage structure under synthesis is acceptable. 50 • The scheduling technique looks ahead into d ata prefetching requirem ents dur ing scheduling, so th at the constraints for the on-off chip bandw idth are not violated in the future. Decisions regarding the tradeoffs in storage architecture m ust also be m ade during datapath synthesis. If these decisions were postponed until the storage synthesis step, we m ight have had to alter the datapath schedule in order to accomm odate storage architecture tradeoffs. This could have resulted in complex backtracking and iteration. The storage architecture tradeoffs are described in detail in C hapter 5. Many decisions m ade during this step have great im pact on the final design. Therefore, it is necessary to evaluate the im pact of these decisions on the final design prior to complete synthesis. This is achieved through estim ation of design param eters for storage architecture as well as datapath. These techniques are explained later in Chapter 4. The result of the above approach is a schedule which guarantees th at no band width or I/O tim ing constraint is violated in the next synthesis step when the complete I/O transfer schedule between the foreground and background mem ory is determ ined. This is how the whole synthesis process is tied together in SMASH. M od u le A llo ca tio n and B in d in g For m odule allocation and binding a decision was m ade to use existing ADAM tools. These tasks are completed using MABAL [KP90]. Unfortunately, MABAL is unable to handle loops, arrays and multicycle operations in the design specification. Therefore, the designs we synthesize using MABAL do not contain loops. Multicycle operations are taken care of by explicitly specifying m odule bindings to MABAL for these operations. After datapath synthesis, the exact I/O access p attern is known, we know exactly when each input is required by the datapath and also when each output is produced by the datapath. Our next step is to construct a cost effective storage structure from the above access pattern th at will support the datapath. 51 3.5.3 Storage A rch itectu re Synthesis As was described earlier, our target system consists of two levels of hierarchy in the storage architecture: (i) on-chip foreground mem ory (I/O buffers and datapath m em ory), and (ii) off-chip background memory. To reduce the complexity of the problem, these hierarchies are synthesized separately. Dividing the storage archi tecture also helps us decide the order in which various d ata requirem ents should be considered. Since the datapath has already been scheduled in the previous step, the d ata requirem ents of the datapath in each control step are now known determinis- tically. The foreground memory, which directly interacts w ith the datapath, can be synthesized. (W ithin foreground memory, the order between I/O buffer synthesis and datapath memory synthesis is arbitrary.) Following the synthesis of foreground memory, d ata writes in the background m em ory can be scheduled and its synthesis can be completed. Like datapath synthesis, I/O buffers and off-chip m em ory are also synthesized in two basic steps: • d ata transfer scheduling, where the read and write tim es of each word are determ ined; and • m odule allocation, where a physical location is assigned to each word. 3 .5 .3 .1 D a ta Transfer S ch edu lin g There is a finite interval from the production of a d ata value to its consumption. The inputs required by the datapath can be transferred to the buffers from the background memory and stored locally in the buffers in any tim e step before they are consumed by the datapath. Similarly, the outputs produced by the datapath can be stored locally in the buffers and transferred to the background m em ory in any tim e step before they are required elsewhere. Both storing a value in the buffers and transferring it into or from buffers require resources: a m em ory location to store a value and a port to access it. In fact, the size and port requirem ents of the storage structure depend on the d ata transfer schedule. T he d ata transfer scheduling step determ ines a definite schedule for the transfer of all the data values to and from the storage sub-architecture being designed which 52 have not been previously scheduled. This enables us to know the exact storage, port and interconnect requirem ents and also will help us in constructing the address generator. Before form ulating the problem m athem atically, we define the following term s in this context. For any d ata value the following tim e points are very im portant (they are given w ith respect to the storage module, m i, being constructed): D e fin itio n 3 .5 .9 The Birth time, 7 f ni(d), is defined as the time point when the data d is made available for storage in module m i. D e fin itio n 3.5.10 The Write time, I f f 1 (d), is the time when d is written into the storage module m i. For example, T f u^{d) represent the tim e point when d is w ritten into the I/O buffers from either the background mem ory or the datapath. Note th at the d ata d may be w ritten m ultiple tim es into storage m odule m i, in such a case T ff1(d) will have m ultiple values associated w ith it corresponding to each write tim e. D e fin itio n 3.5.11 The Read time, 7^mi(d), is the time when d is read from the storage module m \. Note that the data d may be read multiple times from m \, in such a case T ff1 (d) will have multiple values associated with it corresponding to each read time. D e fin itio n 3.5.12 The Death time, 'Tfni{d), is when d dies or is no longer available. These definitions are illustrated in Figure 3.9. buf buf bk w Datapath buf buf w VO Buffers Background Memory Figure 3.9: Read and W rite Tim ing for D ata Values W hen the data is being transferred in real-tim e from the outside world to the background memory, the birth tim e and the write tim e are going to be the same, as 53 the d ata is w ritten in to the background memory as soon as it is available. W hen the d ata is being transferred from the background mem ory to an I/O buffer, the birth tim e is the tim e when it is w ritten into the background memory from the external world, and write tim e is the tim e when it is actually w ritten into the I/O buffer from the background memory. It can be w ritten into the buffer any tim e after its birth tim e (or after it is w ritten in the background memory from the external world). For some d ata values the write tim e is fixed. For example, in our case, the output write tim e into the buffers by the datapath is fixed once the datapath scheduling is done. For some d ata values the read tim e is fixed; for example, after the datapath scheduling, the input read tim es by the datapath from the I/O buffers are fixed. The birth tim e and the death tim e for all the values are either specified a priori by the external tim ing constraints or assigned the execution start tim e and the execution end tim e respectively. Some inherent constraints, called the timing constraints, for these tim e points are given below. These constraints are with respect to the storage structure (I/O buffer or off-chip memory) being considered: • W hile constructing the I/O buffers, the write tim e of a data value is the tim e when the d ata value is w ritten into the I/O buffers from the background m em ory (for the inputs) or by the datapath (for the outputs). Similarly, the read tim e of a d ata value is the tim e when the value is read from the buffer by the datapath (for the inputs) or the background memory (for the outputs). • W hile constructing the off-chip background memory, the write tim e of data value is the tim e when the data value is w ritten into the background m em ory from the external world (for the inputs) or from the I/O buffers (for the outputs). The read tim e is the tim e when the d ata value is read from the background memory by the I/O buffers (for the inputs) or the external world (for the outputs). The tim ing constraints are as follows: 1- %mi (^0 — T™1 (d). The d ata value cannot be w ritten into the storage m odule m i before it is m ade available to m i. 54 2- T™l {d) < T™l {d). The write and read tim es are the tim es when the data is w ritten into and read from m i respectively. The equality in the constraint depends on the situation. The data value can be w ritten and read in the same step only w ith a two-phase clocking scheme. Strictly speaking, a write m ust be done before a read. 3. T™1(d) < 77"1 (d). The value cannot be read from m i after it is not available. As m entioned earlier 77”1 (d) and 77"1 (d) are specified prior to scheduling unless a value is created or used by an operation not yet scheduled. However, 77"l (d) and/or 77”' (d) may not be specified at th at tim e. The goal in the d ata transfer scheduling step is to determ ine T™l{d) and T jni{d). The port constraints th at m ust be satisfied are: V™1 is the num ber of write ports on storage structure m i, num ber of read-write ports on m i, and V™1 is the num ber of read ports on m i. N™l (s) is the num ber of all the d ata values being w ritten in m i in step s, i.e. . y r o o = i{<n w s . t . 7 r ‘(<o = * and A//"1 (s) is the num ber of d ata values being read from m \ in step s, = |{<H Vd s.t. T™l {d) = 5 Then, assuming a tw o p h a se clo ck in g scheme with the reads and writes being perform ed in alternate clock phases, so that we can use V for reads in one phase and for writes in the other phase, the following three inequalities hold: V™'+7>% > J C '[ s ) Vs n mi + V Z 1 > A C mi(s) Vs r z ' + r z z + v ? 1 > M Z 'W + A r r i s ) v s Given the read and write tim es of all the d ata values stored in a m odule, a lower bound on the size of th at module is given by S M mi = m ax{|{d}| s.t. T™'(d) < s < 77"l (d),Vd} V S 55 Note th at this is a lower bound because we m ay read the same value into the storage m odule more than once in order to reuse the m em ory locations. Nevertheless, this relationship indicates th at the size of a storage m odule depends on the read and write tim ings of the data values stored in th at module. Next, we present our objective in this step. Given the num ber of ports V ™ 1, and V™1 th at can be used to read and write in one step1, the birth tim e T^n '1{d) and the death tim e 77"1 (d) of the data d, our goal is to determ ine a transfer schedule, read tim e 77"1 (d) and write tim e 77"1 (d) for all d, such th at the size of the memory SM-rmi is minimized, and the port and tim ing constraints are m et. The details of the d ata transfer algorithm and its im plem entation in SMASH are described in C hapter 6. 3 .5 .3 .2 M od u le A llo ca tio n After every d ata transfer has been scheduled and various requirem ents at each step are known we should allocate each value to a physical storage location. The objective here is to optim ize the total storage area, m eeting all the requirem ents. This step is not covered in this dissertation but will be addressed in the future. 3.5.4 RTL Synthesis The next and final step in the overall synthesis process is the register-transfer level (RTL) synthesis. This includes m apping the scheduled CDFG onto an RTL netlist and generating the layout of the netlist. We used existing tools (commercial and ADAM) to assist us in this step instead of developing our own tools. 3.5.4.1 M A B A L and E p och In this pathway, MABAL [KP90] is used for m odule allocation, binding and complet ing th e RTL netlist, and Epoch (a commercial silicon compiler from Cascade Design Autom ation Corporation) [Cas93] is used for layout generation. MABAL accepts the datapath schedule along with other relevant inform ation such as the CDFG and m odule library, and generates the RTL netlist of the design. This netlist is then is assumed to be zero in the current implementation of SMASH. 56 translated into Epoch’s netlist form at using the MABAL2Epoch netlist translator. Finally, the layout of the design is generated using Epoch. SMASH was quickly interfaced with these tools as all the required interfaces for these tools were already in existence. Several designs were generated using this pathway. These designs are presented in Chapter 7. As m entioned above, MABAL is does not handle loops, arrays and multicycle operations in the design specification. Therefore, the layouts of designs with loops cannot be generated currently. However, designs with multicycle operations can be handled by explicitly specifying module bindings to MABAL for these operations. We are looking into another commercial RTL synthesis tool called DPSYN from COMPASS Design Autom ation. This option is planned for future research and is described in C hapter 8. 3.6 Summary In this chapter we have outlined our approach to solving the high-level synthesis problem for memory-intensive systems. We presented the target architecture pro duced by SMASH, which consists of a datapath and a two-level memory hierarchy. Next, we described the characteristics of the module library used by SMASH. The m odule library includes functional modules as well as storage modules. We also briefly described the clocking scheme assumed in SMASH. Finally, we outlined all the m ajor steps of the overall synthesis process. These steps are 1. CDFG extraction, 2. datapath synthesis, 3. storage architecture synthesis, and 4. RTL synthesis. The next chapter describes the estim ation techniques used in the datapath syn thesis step. The details of datapath synthesis step and the storage architecture synthesis step are presented in Chapters 5 and 6 respectively. 57 Chapter 4 Estimation Techniques 4.1 Introduction As discussed earlier, high-level synthesis is known to contain NP-com plete problems [SJ94]. Searching the design space for an optim al solution is a complex and tim e consuming process. Therefore, m ost synthesis algorithms are based on heuristics, searching for near-optim al solutions. The use of heuristics makes it necessary to evaluate the decisions m ade using the heuristics during the synthesis process. Lower and upper bounds on the cost can help us in doing th at. Specifically, these bounds can be used in 1. guiding the user (or other software tools) in design space search by quickly providing useful design param eters w ithout synthesizing the design, 2. evaluating the im pact of high-level decisions on the final design at an early design stage w ithout going through the complete synthesis process, 3. speeding up the design space search during synthesis, and 4. evaluating the quality of a design produced by the heuristic. A lower and upper bound are com puted at the beginning of the synthesis process. Later, as the synthesis proceeds, these bounds are updated and analyzed to evaluate the im pact of the decisions m ade on the final designs. If the estim ated cost indicates a violation of the cost constraints in the future then th at decision is discarded. The total cost of the system consists of various components viz. functional cost, storage cost, interconnection cost, and controller cost. There has been a lot of 58 research in estim ating some of the components like functional cost and controller cost but not much research on estim ating the storage cost. In this research we have developed techniques to estim ate the storage cost. Storage cost estim ation combined w ith functional cost estim ation which is based on existing techniques has been incorporated in SMASH. In this chapter we will describe the methodology to estim ate the storage and functional cost of the final im plem entation from the partial operation schedule. The overall estim ation step is divided into two parts: (i) storage cost estim ation, and (ii) functional cost estim ation. We begin w ith some definitions which will be used in this chapter. Control steps 2 3 4 5 6 Figure 4.1: ASAP and ALAP Times for Various Read Nodes D e fin itio n 4 .1 .1 3 [il,<2] is defined as an interval between clock boundaries tx and t2, where 0 < t x < t2 < M axSteps and M axSteps is the number of control steps required in executing the CDFG. Equivalently, [tl,t2\ is the sequence of control steps {si + 1 , . . . , S2 — 1, S2 }. For example, in Figure 4.1 [2,6] is the interval between clock boundaries t x = 2 and t 2 = 6 and includes control steps 3, 4, 5, and 6. O b s e rv a tio n 4.1.1 A node v representing operation Ok with delay Dok m u s t b e scheduled in the interval [tx,t 2], i f t x < A S A P (v )+ D o k < A L A P (v) < t2. A S A P (v) is the As Soon As Possible time for v, Dok is the delay of operator Ok, and A L A P (v ) is the As Late As Possible time for v. Nodes R1 R2 R3 R4 R5 t1= 2 ' asap Interval [2,6] t2 = 6 Clock boundaries — 1 2 3 4 5 6 59 In the example (Figure 4.1), R1 and R5 m ust be scheduled in the intervals [2,5] and [2,6] respectively. Note th at R1 m ust also be included when considering the interval [2,6]. D efin itio n 4 .1 .1 4 Lok,ti,t2 kb* is defined as a set of nodes such that the node v € Vok m u st b e scheduled in the interval [ii,^ ]- Using observation 4-l-l> Lok,tut2 = {v \ v € Vok and tx < A S A P (v) + Dok < A L A P (v ) < t2} In the example (Figure 4.1), Lr i2$ = {R l, R5}. O b servation 4.1.2 A node v representing operation Ok with delay Dok m ay b e scheduled in the interval [<i,<2]> if A S A P (v) < t x < A L A P (v) + Dok, or A S A P (v ) < t 2 < A L A P (v ) + Dok, or ti < A S A P (v) < A L A P (v ) + Dok < t2 Example: In Figure 4.1, all of the 5 read nodes R l, R2, ..., R5 m ay be scheduled between clock boundaries 2 and 6. D efin itio n 4.1.15 Uok,ti,t2 Q ko* is defined as a set of nodes such that the node v € Vok m ay b e scheduled in the interval [tx,tf\. Using observation 4-1-2, Uok,ti,t2 — {v | v € Vok and any of the 3 conditions given in observation 4-1-2 is satisfied.} In the example (Figure 4.1), Ur i2,6 = {^?1, -R2, R3, R4, R5}. 4.2 Storage Cost Estimation E stim ated cost of the storage (buffers) is used in the design process to avoid violating the area constraint in the final design. The storage cost of a design is the cost of constructing the storage structure such th at it has the num ber of ports and size 60 required by the final im plem entation. The construction m ust be done using modules from the given storage library. During the datapath synthesis, estim ated storage cost is included in the total estim ated area. This enables us to avoid datapaths at an early stage, which m ay violate the area constraints in the future. W ithout any knowledge of the storage cost, a datapath th a t requires unacceptably large storage in the final im plem entation m ight be selected at this point but eventually would have to be rejected. The storage cost estim ation consists of the following three steps: 1. determ ining a lower bound on the num ber of read and write ports on the buffers. 2. determ ining a lower bound on the total size of all the buffers, and 3. m eeting these requirem ents with the lowest-cost storage modules from the library. 4.2.1 Lower B ound on R ead (W rite) P orts on Buffers A lower bound on the num ber of read (R) ports on the buffers, N l b (R ), is the m inim um num ber of read ports th at are needed on the buffers in the final design. Actually, N z,b ( R ) is nothing but the m axim um num ber of inputs th a t m ust be accessed in a step. Since, in our approach each read operation (input access) is represented by a read node, N l b (R ) can be determ ined the same way th at the lower bound for functional operators is determ ined. In fact, our techniques are based on the techniques developed to obtain the lower bound for functional operators [SJ94, JM P88, JP P 87, Kuc91]. Basically, we count the num ber of reads th at m ust be scheduled only during a specified interval, using the ASAP and ALAP analysis of the read nodes as the actual scheduling has not been done yet. We assume th at these reads will be uniformly distributed over th at interval because any other distribution will result in greater num ber of reads in at least one control step than the average. By perform ing this analysis over all the possible intervals and then determ ining the m axim um num ber of reads th at m ust be perform ed in any interval, we determ ine the desired lower bound. 61 T h eorem 4.2.1 A lower bound on the number of read (R) ports on the buffers, N l b (R), required in executing the CDFG in M axSteps is N l b (R) = max V l« } where t\ = 0 . . . (M axSteps — 1), t2 = (<i + 1), (h -f- 2 ) ,..., M axSteps, and as Per Definition 4-1.14- Proof: Consider an interval [< 1 ,^2] and the set of read nodes Vr. A lower bound on the num ber of read nodes n (R ,tl,t2 ) th a t m ust be scheduled in the interval [tx,t2] is at least equal to the cardinality of the set of read nodes v E Vr such th at v m u st b e scheduled in Using Observation 4.1.1, n(R, tl,t2 ) = |{u | v E Vr and <1 < A S A P (v) < A L A P (v) < < 2 } | = \LR,tx,t2\ Our objective is to determ ine the m inim um num ber of read ports nib(R,tl,t2) th at are required to access these inputs (i.e. perform the read operation) in the interval therefore we m ust assume uniform distribution of these nodes over the interval A uniform distribution results in an average num ber of input reads in a control step, whereas any other distribution will result in more num ber of input reads than the average in at least one control step. Therefore, the m axim um of input reads in a single control step for the interval [^1 ,^2] (which is the lower bound on the num ber of read ports required for th at interval) is m inim um when the distribution is uniform. W ith th at assum ption we get nib(R, tl,t2 ) n (R ,t\,t2 ) (i2 — < 1) \^R,h,t2\ (I2 — ti) (4.1) 62 The m axim um taken over all possible intervals [t1 ? i2] gives the m inim um num ber of read ports required in the final design: N l b (R) = Taa,x{nib(R,tl,t2)} where t\ = 0 . . . (M axSteps — 1) and #2 = (<i + 1 ) ... M axSteps (4-2) From Equations 4.1 and 4.2 we get the desired lower bound. □ A lower bound on the num ber of write ports on the buffers is determ ined in a similar m anner. T h eorem 4.2 .2 A lower bound on the number of write ports on the buffers, N lb (W ), required in executing the CDFG in M axSteps is N l b {W ) = m a x | ^ l _ ^ \LWmm\ j where 11 = 0 . . . (M axSteps — 1), i2 = (ti + 1) • • • M a xS tep s, and Lw,ti,t2 as Per definition 4-1-14- Proof: Similar to the above proof. 4 .2.1.1 C om p u tation al C o m p lex ity Com puting n(R, f°r each interval [ix, t 2] is done by processing all the elements of the set Vr , which requires 0 ( | Vr |) com putations. This analysis is done for all the possible intervals i.e. it is done 0 (M a x Steps2) times. Therefore, the overall com plexity of determ ining the lower bound on the read ports is O (M axSteps2.\Vr \). Similarly, the complexity of determ ining the lower bound on the write ports is 0 (M a x S te p s 2 • |V ^|). 4.2.2 Buffer Size E stim ation The next step in estim ating the cost of the storage structure is to estim ate the total size of all the buffers. The problem is as follows: given a CDFG and I/O bandw idth 63 {BWon-off) on the chip, the problem is to determ ine a lower bound on the total size of all the I/O buffers B u f Sizers- Data path Data path Data path Output Buffer Buffer I/O Buffer Input a b c Figure 4.2: Buffer Configurations Though in our model the buffers store both inputs and outputs, we have devel oped a lower-bound estim ation theory for the following three possible configurations of th e buffers: 1. input only buffer (Figure 4.2 a), 2. output only buffer (Figure 4.2 b), and 3. input as well as output buffer (Figure 4.2 c). A lower bound on the total size of the I/O buffers is the sm allest buffer size which is able to support the I/O transfer performed concurrently w ith the datapath execution, using the given bandw idth B W 0n- 0/f. In other words, it is the smallest I/O buffer required to store 1. all the inputs th at are prefetched before they are required by the datapath, and/or 2. all the outputs produced by the datapath before they are transferred back to the off-chip memory. 64 Before presenting the estim ation techniques we define some term s. D e fin itio n 4 .2 .1 6 L jftltt2 Q I is defined as a set of inputs such that the input i € m u s t b e accessed in the interval [< 1 ,^2]- Using Observation 4-1-1 > = {i \ i € Pred(v) and v € LR> tltt2} where LRttltt2 = {v \ v £ Vr a n d tx < A S A P (v ) < A L A P (v ) < t2} D e fin itio n 4 .2 .1 7 Ui> tl,t2 ^ I is defined as a set of inputs such that the input i m a y b e accessed in the interval [< 1, £2]- Using Observation 4-1.2, UlMto — {i \ i € Pred(v) andv £ UR'tut2} where URltl,t2 = {v | v € VRand any of the 3 conditions given in Observation 4-1-2 is satisfied.} D e fin itio n 4.2.18 Lo,ti,t2 ^ O is defined as a set of outputs such that the output o € Lo,u,t2 m u s t b e produced in the interval [^1,^2]- Using Observation 4 -1-f Lo,tut2 = {° I 0 ^ Succ(v) and v € Lw,tut2} where Lw,tx,t2 = {v | v £ Vw a n d tx < A S A P (v ) < A L A P (v ) < £2} D e fin itio n 4 .2 .1 9 Uo,tut2 ?= O is defined as a set of outputs such that the output o m a y b e produced in the interval [ti,t2]. Using Observation 4-1-2, Uo,tut2 = {o | o G Succ(v) and v € Uw,tut2} where Uw,ti,t2 = {u | v € Vwand any of the 3 conditions given in Observation 4-1-2 is satisfied.} D e fin itio n 4 .2 .2 0 / acc(£i,£2 ) is the number of inputs that m u s t b e accessed during [^1 ,^2] (Figure 4-2). Oprod{tx, t 2) is the number of outputs that m u s t b e produced during [£1 ,^2] (Figure 4-4)- 65 D e fin itio n 4.2.21 Ist{ti,t2) is the number of inputs that are prefetched and stored in the buffers before t\ to support the processing in [< 1 ,^2] (Figure 4-3)- Similarly, Ost( t\,t 2 ) is the number of outputs that are stored in the buffers after t2 (Figure 4-4)- D e fin itio n 4 .2 .2 2 Itrans(ti, t2) is the number of inputs and Otrans{tut2) is the num ber of outputs that are transferred during [t\,tf\ as illustrated in Figures 4-3 and 4-4 respectively. tans = # of inputs that can be transferred between steps t1 andt2 lacc = # of inputs that will be accessed between1 1 BufferSize = ls, (Inputs prefetched and stored in buffers) steps t1 and t2 2 3 4 5 Figure 4.3: Lower Bound Estim ation for Input Buffers Otrans = # of outputs that can be transferred between steps t1 andt2 °prod = # of outputs that will be produced between- steps t1 andt2 BufferSize = Ost (Outputs stored in buffers) Figure 4.4: Lower Bound Estim ation for O utput Buffers We start with estim ation of the lower bounds for the first two cases in order to understand the idea behind the theory, then develop the estim ates for the third configuration, although the third configuration subsumes the first and second con figurations, and is the one which is actually used in SMASH. 66 4.2.3 Lower B ound on th e Size o f th e Input O nly Buffer W hen the input buffers are separate from the output buffers, a lower bound for the total size of all the input buffers can be com puted using the following theorem. T h eorem 4 .2 .3 A lower bound on the size of the input buffers, B ufSizez,B (I), required in executing the CDFG in M axSteps is given by B u fS izeL B (I) = m a x f d L /^ ^ l — B W on- 0f j • (t2 — *i)),0} V intervals [ti,t2] where t l = 0 . . . (M axSteps — 1), t 2 = (ti + 1) . • • M a xS tep s, and BWon-oj/ is the bandwidth between on and off chip memory. P roof: Consider an interval \tx,t2\. The size of the input buffer to support the processing in [tx, is the num ber of inputs th at should have been prefetched and stored into the buffers prior to this interval, i.e. Ist( ti,t2). Now, / st ( ii ,i 2) is the difference between Iacc( ti,t2) and Itrans(ti, t2) (assuming th a t the inputs transferred during [ii,tf2] are not stored in the buffer, and therefore do not contribute to the buffer size), i.e. Istif 1?^2) = Iacc(ti.)t2^ Itransifxitf} laccifiy^2 ) is the cardinality of the set i-e. laccif 1 5 ^2 ) = i,tal Itrans(ti,t2 ) is the num ber of inputs th at can be transferred during [tx,t2] using a bandw idth of B W 0n- 0f / which can be determ ined by considering the following two cases: • Case(i): Iacc(t 1 ^ 2 ) > B W on- 0/j- (t2—tx). In this case, we assume the m axim um utilization of the bandw idth because we are interested in the lower bound, i.e. Itransifh ^2 ) = B W 0n— 0f f ' (t 2 tx) 67 • case(ii): Iacc(t 1 ,^2 ) < B W on- o f j • (<2 — *i)- In this case, all the accessed inputs can be transferred in the interval [tx, t 2], i.e. prior to this interval. Therefore, B S i b ( I ,t\,t2), the m inim um size of the buffer required to support the processing in [<i,<2] is required in executing the CDFG in M axSteps is given by B u f SizeLB(0 ) = m ax{([Lo,n,t2| — B W on- o f f • (t2 - <i)),0} where <1 = 0 . . . (M axSteps — 1), t2 = (<1 + 1 ) ... M a xS tep s, and BWon-ojf the bandwidth between on and off chip memory P roof: Similar to the above proof. I transmit ^2 } — facc(tl,<2) All the rem aining inputs should have been prefetched and stored into the buffers B S i b ( I , t i , t2) =: : 15^2 ) = - I a c c ( ^ l ) ^ 2 ) I t r a n s i f l j ^ 2 ) I BWon— of I a c c ( t l , t 2 ) — B W 0n- 0f f • ( h ~ t l ) if I a c c ( t l , t 2 ) > B W 0n- 0f f • (<2 ~ *l) otherwise — m a x |( / a(;c B^Von— off * (<2 = m ax{(|L /,ilif2j - BWon-off ■ (<2 - *i)),0} (4.3) The m axim um taken over all the possible intervals [< 1 ? t2\ gives the m inim um size of the inputs buffer required in the final design. B u fS izeL B (I) = max{jB5/fr(/, <1, <2)} (4.4) where < 1 = 0 . . . (M axSteps — 1) and t2 = (t\ + 1) • ■ • M axSteps From 4.3 and 4.4 we get the desired lower bound. □ T h eorem 4 .2 .4 A lower bound on the size of the output buffers, BufSizei,B(0), 68 4 .2 .3 .1 C om p u tation al C o m p lex ity T he com putational complexity of determ ining the lower bound of the input-only buffer is 0 (M a xS te p s 2 • |Vb|): Com puting B S ib (I,ti,t2) in Equation 4.3 requires linear processing of the read nodes in CDFG, which can be done in |Vr| steps. This analysis m ust be done for all the possible intervals [ti,^ ] i.e. 0 (M a x S te p s 2) times. Therefore, the overall complexity of this analysis is 0 {M a xS te p s 2 ■ |VW|)* 4.2.4 Lower bound on th e Size o f I /O Buffer Under the assum ption th at the same buffers are used to store inputs as well as outputs, the buffer size is estim ated using the theorem given later in this section. Prior to presenting the theorem , we present the following lem m a which will be used in the proof of the theorem. L e m m a 4.2.1 F = m a x { A —x, B + x } is minimum when A —x = B + x or x — P ro o f: Proof follows by considering the following three cases: , _ A — B _ A + B case(i) A — x > B + x or x < — - — = > ■ F > — - — . D A — B A + B case(n) A — x < B + x o v x > — - — = £ - t > — -— „ A — B A B case(m) A — x = B + x o t x = — - — => t = — - — A - B Clearly, F is m inim um in case(iii). □ Now we present the theorem to determ ine a lower bound on the total size of all the I/O buffers. T h e o re m 4.2.5 A lower bound on the size of the I/O buffers, BufSizei.Bi.IO ), required in executing the CDFG in M axSteps is given by: B u f SizeLB{IO) = m ax{(|L /ftlI* a| - / * rBn5(ii,<2)), iD W 0n— 0f f ’ (^2 ^l)) Itransilli ^2 ))} 69 where tl = 0 . . . (M axSteps — 1), <2 = (tx + 1) ■ • • M axSteps, B W o n - o f f is the bandwidth between on and off chip memory, and Itrans(t X, t f ) — 0 i f \Lo,ti,t2 \ > B W on- o f f • ( t 2 — < l ) + | £ j , f i , t 2 | < BWon-off • (*2 — ^l) if | I > B W 0n-off ' (*2 ~ *l) + \Lo,tltt2\ otherwise ^ 2 P ro o f: For a given interval [ti, t2], a lower bound on the buffer size B Sn{IO , tl, t2), is the m axim um of (i) Ist(tx, t2), the num ber of inputs required to be stored before ti, and (ii) Ost( ti,t 2 ), the num ber of outputs required to be stored after t2. B S ib (IO ,tl,t2 ) — m a x { Iat(ti,t2), 0 3t( ti,t2)} We also know th at Istiflii'f) = laccif Xji'f) Itransifxitf) O st ( t x , t 2J = Oprodifxy t f ) Otrans(tXitf) Iacc{tx,tf) is the cardinality of the set L i ttlyt2 and Oprod(t\,t 2 ) is the cardinality of the set Lo,ti,t2 i-e- laccifxi if) = \Li,tlyt2\ (4*5) Oprod(tx,t2 ) = \Lo,tut2\ (4-6) Furtherm ore, the total num ber of inputs/outputs th at can be transferred from /to the buffers is lim ited by the bandw idth and the duration of the interval: Itransifl, 12 ) “ I- O trans (txjtf) B W o n - o f f ' ( ^ 2 ^ l ) 70 Since we are interested in the lower bound, we consider the case when Itrans(t 1*^2 ) Otrans(tli £2 ) — BWon— of f ' (£ 2 £l) Therefore B S lb( IO ,tu t 2) = m ax{/st(£i,£2),O sf(£i,£2)} — rnax{(/aC C (t l,£ 2) -7frans(£l>£2))? (f^prod(^l»£2 ) Ofrans(£l» ^2 ))} = maX^IZz/^j^j | Itransi^lt £2 ))? (l-^O.ti.fal — J5Won_0/ / • (£2 — £l) + ffrans(£l,£2))} (4-7) Itrans(ti, £2 ) is yet to be determ ined. Since we are interested in the lower bound, the desired /tra»s(£ii £2 ) is the one which minimizes B u f Sizeib( I O ,ti,t2). We know that B u fS ize ib (IO ,tx ,t2) is m inim um when the following condition is true, from Lemma Furtherm ore, since 0 < I transit'll £ 2 ) < BWon-off ' (£ 2 ~ £l) ftrons(£i, £ 2 ) m ust be modified as follows: • Case(i): Oprod(t 1 ^ 2 ) > B W on-off • (£2 — £1) + /acc(£i,£2 ) In this case we get Itrans{iii t 2 ) < 0 in Equation 4.8. This implies th at Itrans(ti> £2 ) should be equal to zero. In other words, in this situation the num ber of outputs produced is so high th at in order to minimize the buffer size, we should not transfer the inputs into the buffer and only transfer the outputs out of the buffer. • Case(ii): Iacc(ti,t2) > B W on- o f f • (£2 — £1) + Oprod(ti, t2). In this case Itrans{ti,t2) > B W on- o f f • (£2 — £1 ) in Equation 4.8. Since Itrans(ti,t2) < BWon-off • (£ 2 — £1 ), therefore Itrans(ti,t2) is equal to B W on- off ■ (h ~ £i)- 4.2.1. laccit 1)^2) ftrans(£l> £ 2 ) m , /<rans(£l 5 £ 2 ) Oprodit 1’ ^2 ) BWon-off ' (£ 2 £l) T Itransit 1 ? £2 ) BWon— 0f f ‘ (£ 2 £l) + -Iqec(£ 1;£2) Oproditl ? ^2 ^- A 71 • Case(iii): Otherwise, we use Equation 4.8. In sum m ary I tr a n s (t 1 ? ^ 2 ) 0 i f Oprod{t\^t‘ jt) B ^ V 0n — o ff ' (t 2 ^ l ) 4 " lacci^ l j ^ ) B W o n - o f f ' ( ^ 2 — ^ l ) i f Iacc{ti,t2) > B W on- o f f ' ( ^ 2 ~ * l ) + O prod (tl, t 2) BWpn-of )+iacc(ti,< 2)-QP rod(ti,t2) otherwise 2 (4.9) From Equations 4.7 and 4.9 we get the desired result. □ 4.2.5 C om pu tation al C om p lexity The com putational complexity of determ ining the lower bound I/O buffer size is 0 {M a xS te p s2 • max{|VR|, |VW|})- Com puting B S u { IO ,ti,t 2 ) in Equation 4.3 re quires linear processing of the read nodes and write nodes in CDFG which can be done in 0 ({ |V r|) + 0({\Vw \) or 0(m ax{|V h|, |H v|}) steps. This analysis is done for all the possible intervals [£1,^2] which takes 0 ( M ax Steps2) steps. Therefore, the overall complexity of this analysis is 0 (M a x S te p s2 • max{ | Vh|, |K v|})- 4.2.6 Storage Structure C onstruction The third and final step in the storage cost estim ation for the I/O buffers is to im plem ent a storage structure of size S w ith R read ports and W write ports with the modules from the storage library. In SMASH, this is done by im plem enting the storage architecture with three types of storage modules: 1. registers, 2. register files, and 3. on-chip RAMs. The cost of implementing the storage on these three modules is com puted as described below. The m inim um cost is used as the estim ated storage structure cost. Note th at the cost obtained here is only an estim ate, and not a lower bound. 72 4 .2 .6 .1 Im p lem en tin g Storage S tru ctu re w ith R eg isters A storage structure of size S w ith R read ports and W write ports using registers can be constructed as shown in Figure 4.5. The details of such a construction are as follows: Num ber of registers needed (M) = S Size of input multiplexers = W to 1 Num ber of input multiplexers = S Size of output multiplexers = S to 1 Num ber of output m ultiplexers = R The total cost of this im plem entation is Cost = S x Creg + 5 • ( 2 ^ ^ - 1) • Cmux + R x ( 2 ^ ^ - 1 ). Cmux Note th at the n-to-1 m ultiplexer is constructed using (2^logn^ — 1) 2-to-l m ulti plexers of cost Cmux each. W write ports R read ports S muxes (W-to-1) ] S Registers R muxes (S-to-1) Figure 4.5: Constructing Storage using Registers 4 .2 .6 .2 Im p lem en tin g Storage S tru ctu re w ith R e g ister F iles A storage structure of size S with R read ports and W write ports using register files w ith r-read ports and w-write ports can be constructed as shown in Figure 4.6. 73 The register files are allowed to have a variable size up to a m axim um capacity of smoI. We connect each of the r read ports of a register file to R /r m ultiplexers only, because each read port of the register file can access all the d ata values of the register file. Thus, each d ata value is available at all the R read ports of the overall storage structure. Note th at in this configuration certain access patterns are prohibited, nevertheless this configuration suffices for our needs. In this im plem entation Total num ber of register files N = [ Num ber of register files of size smax = - j Last reg. file size for rem aining words = S m od smax Size of input m ultiplexers = f ^*| to 1 Num ber of input m ultiplexers = w x N Size of output m ultiplexers = N to 1 Num ber of output m ultiplexers = R The total cost of this im plem entation is Cost — I I • CregfHe(Smax) Cregfile( < 5 " m od S moa. ) LSjnax J + u>.iV.(2rio s r £ ll _ i) . Cmux + R • (2P°sJV l - 1) • Cmux 4 .2 .6 .3 Im p lem en tin g Storage S tru ctu re w ith O n-chip R A M s A storage structure of size S with R read ports and W write ports using on-chip RAMs w ith r-read ports and tu-write ports can be obtained in the same m anner as done for register files. The RAM modules are allowed to have a variable size up to a m axim um capacity of smax. We connect each of the r read ports of a RAM m odule to R /r m ultiplexers only, because each read port of the RAM m odule can access all the d ata values of the RAM. Thus, each d ata value is available at all the R read ports of the overall storage structure. Note th at in this configuration certain access-patterns are prohibited, nevertheless this configuration suffices for our needs. In this im plem entation Total num ber of RAMs N = [— — 1 1 Smax 1 Num ber of RAMs of size smax = I — — I * * & m a x • * 74 • * -------------------------------- W write ports-------------- W/w W/w 7 \ w write Regfiie r read w write Regfiie r read w write Regfiie r read r X * • • V R/r R read ports Figure 4.6: Constructing Storage using Register Files Last RAM size for rem aining words = S m od sm Size of input multiplexers = [ ^ ] to 1 Num ber of input multiplexers = w x N Size of output multiplexers = N to 1 Num ber of output m ultiplexers — R The total cost of this im plem entation is Cost = (-'ram('Smax) " 4 * CromC'S' m o d S m a x ) + w • n • (2 N r ^ i l _ ! ) . Cmux + R . (2 nog/vi _ l ) . Cn Note th at the cost of a RAM m odule (7rom is a function of its size. 4.3 Functional Cost Estimation To estim ate the functional cost, first, the lower bound on the num ber of operators is determ ined. Then the m inim um functional cost is estim ated. 75 The lower bound of the num ber of functional operators is basically the m axim um num ber of operators required sim ultaneously in a step. Since, the actual schedul ing has not been performed yet, we count the num ber of operations th at can be scheduled only during a specified interval (from the ASAP and ALAP analysis of the nodes) and assume th at these operations will be uniformly distributed over th at interval. A uniform distribution results in an average num ber of operations in a con trol step, whereas any other distribution will result in more num ber of operations than the average in at least one control step. Therefore, the m axim um of operations perform ed in a single control step for the interval [<i, (which is the lower bound on the num ber of functional operators required for th at interval) is m inim um when the distribution is uniform. By performing this analysis over all the possible intervals and then determ ining the m axim um num ber of operators th at m ight be required in any interval, we determ ine the desired lower bound. Once the lower bounds on the num ber of operators required for each type of operator are known, the functional cost can be estim ated. 4.3.1 Lower B ound on F unctional M odules T h eo rem 4 .3 .6 A lower bound on the number of operators of type Ok, NLB(Ok) required in execution of the CDFG in M axSteps is N L B ( O k ) = m a x | ^ - L — ^ \L o k,tut2\ j where t\ — 0... (M axSteps — 1), t? — (ti + 1) ... M a xS tep s , and L0k,tltt2 as Per definition 4-1-14- P ro o f: Same as proof of Theorem 4.2.1 except th at the operator type here is Ok instead of R. 4.3.2 Lower B ound on th e T otal F unctional C ost The m inim um functional cost is C O St functional =~ E N LB{Ok) • Cost(Ok) VOK 76 Cost(Ok) is the cost of operator type Ok- 4.4 Upper Bounds on the Design Parameters T he upper bounds on various param eters in storage and functional architecture of the design are developed for theoretical reasons. Though we have not used these bounds in SMASH, they are developed for future inclusion in BEST, a behavioral area-delay estim ator [Kuc91]. BEST requires both the lower and upper bounds for a tight estim ation of various design param eters. In this section, the following upper bounds are determ ined: 1. upper bound on read and write ports on buffers, 2. upper bound on the buffer size, and 3. upper bound on the num ber of functional modules of each type. 4.4.1 U p p er B ound on R ead (W rite) P orts on Buffers T h e o re m 4 .4 .7 An upper bound on the number of read ports on the buffers, N u b(R), required in executing the CDFG in M axSteps is N ub(R) = m ax{|f/R itl where tl = 0... (M axSteps — 1) and Up.,ti,ii+i is as per definition J^.1.15. P ro o f: An upper bound on the num ber of read ports on the buffers, N u b(R) is nothing but the m axim um num ber of inputs th at m ay be accessed (i.e. the m axim um num ber of read nodes th at m ay be scheduled) in a step or interval [<i, t\ + 1], Using Observation 4.1.2 and Definition 4.1.15, the m axim um num ber of read nodes th at m ay be scheduled in this step is IUr^^+i]- The m axim um taken over all the steps gives the desired upper bound. □ 77 T h eo rem 4 .4 .8 An upper bound on the number of write ports on the buffers, N u b ( W ) , required in executing the CDFG in M axSteps is N u b(W ) = m ax{| Uwm ,* 1+1 1) where £1 = 0... (M axSteps — 1) and Uw,ti,ti+i as per definition 4-1.15. Proof: Same as above. 4 .4 .1 .1 C om p u tation al C om p lex ity The com putational complexity of determ ining the upper bound on the num ber of read ports on the buffers is 0 (|V r| • M axSteps). Determ ining U r ^ ^ + i takes 0 ( V r ) steps for each i\ and this m ust be done for all the control steps i.e. for ti = 1... M axSteps. Therefore the overall complexity of this com putation is 0 ( \ V r \ - M axSteps). 4.4.2 U pper B ound on th e Size o f Buffers The following theorem gives the upper bound on the total size of all the I/O buffers (when the buffers store both the inputs and outputs). This theorem is then extended to get the upper bound on the total size of all the input-only buffers and also th at of output-only buffers. T h eorem 4 .4 .9 An upper bound on the size of I/O buffers, B u f SizejjB{IO) re quired in executing the CDFG in M axSteps is the sum of input and output data sizes i.e. B u f S ize UB(IO) = |/ P | + |O P | where I P and OP are the input and output data sets. Proof: In a given interval [ti^tf), the m axim um num ber of inputs th at m ay require storage is nothing but the num ber of inputs being accessed 7acc(i 1 ,^2 ) because in the worst case all the inputs may require storage. Similarly, the m axim um num ber 78 of outputs th at m ay requires storage is Oprod(ti, < 2 )- Therefore, the m axim um I/O buffer size B S ub { I0 ,tl,t2 ) required for processing in the interval is B S ub{ I O , t l , t 2 ) — l a c c i f x ^ l f ) " f " O p r o d ( t l 1 t f ) Now consider the interval [0, M axSteps], B S ub(10 , 0 , M axSteps) = I acc(0, M axSteps) -f O prod(0, M axSteps) = \IP\ + \OP\ As, / O C c(0> M axSteps ) is the total num ber of inputs th at m ay be accessed in the interval [0, M axSteps] and Oprod(0, M axStep s) is the num ber of outputs th a t may be produced during [0, MaxSteps]. □ C o ro lla ry 4.4.1 An upper bound on the size of input-only buffer, B ufSizeuB (I) required in executing the CDFG in M axSteps is B u fS ize u B { I ) = \IP\ where I P is the input data set. P ro o f: Similar to the above proof. C o ro lla ry 4 .4 .2 An upper bound on the size of output-only buffer, B u f Sizeu B(0) required in executing the CDFG in M axSteps is B u f SizeUB{0) = \OP\ where O P is the output data set. P ro o f: Similar to the above proof. Unfortunately, these upper bounds are very loose and are of little use. If we assume th at the inputs and outputs are transferred such th at they are not stored in the buffers, we get a better estim ate of the m axim um buffer size th a t m ay be required by the final design. (Because of the assum ption the result remains an estim ate of the m axim um buffer size and not an upper bound.) The following theorem estim ates 79 the m axim um size of all the I/O buffers. The result is then extended to estim ate the m axim um size of all the input-only buffers as well as th a t of output-only buffers. T h eorem 4.4 .1 0 The maximum size of I/O buffers, B u f Sizemax(IO) required in executing the CDFG in M axSteps can be estimated by: B u f Sizemax(IO) = m3.x{55jjiaj;(/0 )fij h )} where tl = 0... (M axSteps — 1), ti = (<i + 1) ... M a xS tep s, B W 0n- 0f f is the bandwidth between on and off chip memory and B S max (I0 , t\, t2) if — B W 0n-.0f j • (t2 — ti) ^ \Uo,H,t2\ if \Uo,tut*\ — W lM il — BWon-off * (^2 — ^l) 0 if \Ui,tut2 \ 4- ^ B W 0n- 0f f • (t2 — ti) . \Ui,ti,ta\ + \Uo m m \ - BWon-off ■ (t2 — h ) otherwise P roof: For a given interval [ti,t2], if the num ber of inputs th at m ay be accessed is lacdfi,t2), the num ber of inputs th at are transferred is Itrans(ti,t2), the num ber of outputs th at m ay be produced is Oprod(ti,t2), and the num ber of outputs th at are transferred is Otrans(ti, t2)', then the m axim um size of all the I/O buffers, B S max(IO ,tl,t2 ) is the m axim um of (i) the num ber of inputs required to be stored before ti (Ist(ti,t2)), and (ii) the num ber of outputs required to be stored after t2 (Ost( ti,t2)). The assum ption here is th at Itrans{ti,t-i) and Otrans(ti, t2), the inputs and outputs transferred during the interval [ti,t2], are scheduled such th at they are not stored in the buffers. (Because of this assum ption the result obtained here is an estim ate of the m axim um buffer size and not an upper bound.) Therefore B S max(IO ,tl,t2 ) — n m x { Ist(ti,t2),O st(ti,t2)} 80 where Ist(t 1.5^2 ) — laccit 1?^2) Itransifx^tf) O st { t 1 ,^ 2 ) = O p r o d i t l y ^2) ^ t r a n s ( ^ l j ^2) Furtherm ore, the total num ber of inputs/outputs th at can be transferred from /to the buffers is lim ited by the bandw idth and the duration of the interval, and assuming m axim um utilization of the bandwidth: I tr a n sifii t f ) "t" Otrans ( U , h ) = B W 0n- 0f f • ( t 2 ~ tx) Therefore, B S m a x i ^ I O ,^ 1 ,^ 2 ) = m a x { / sf ( i i , t 2 ) , ^ 2 )} ^ max{(7occ(f 1 , t 2) ftrons(^l? ^2 ))} ( O p r o d ( f l 5 ^ 2 ) — 0 t r a n s = m ax{(/occ(fi, < 2 ) -^frons(^l> ^2 ))? ( O p r o r f ( f lj ^2 ) B ^ V 0n —o f f ’ (^2 "I" I t r a n s W here /*ron«(^x,^2 ) is yet to be determ ined. Since, we are interested in the upper bound, the desired Itransiii,^) is the one which results in m axim um B u fS iz e max( I 0 ,t i,t 2 )• This is determ ined by considering the following two cases: dacc(t 1 *^2 ) ™ b ) Clearly, B S ma x (IO ,ti,t 2 ) will be m axim um when Itrans(ti,h) is m inim um . Basi cally, this will be the case when inputs are transferred into the buffers only after transferring the outputs from the buffers. Assuming m axim um utilization of the bandw idth (i.e. the maximum number of inputs and outputs are transferred.), we get the following three sub cases: 1. 0 P rod(ti-> ^2 ) > BWon-of f ( t 2 — ti)- In this case The whole bandw idth is utilized in transferring the outputs from the buffers. None of the inputs are transferred in the buffers i.e. d t r a n s i f h l f ) ~ 0 81 2. I ac c ( t i , t 2) + O p rod ( t i , t 2) < B W on- 0f / ■ ( t 2 — t x ). In this case all the inputs can be transferred in the buffers and outputs out of the buffers in [ t i , t 2]. Therefore Itr a n s(tlit2) = Iacc{tl)t2) 3. Otherwise, first the outputs are transferred out of the buffers and then the rem aining bandw idth is utilized in transferring the inputs in the buffers i.e. Itr a n s (tl,t2) — B W 0n—o ff - ( t 2 i x ) C p r o d (tl,t2) In summary, Itrans 0 if O pr o d ( t \ , t 2 ) > B W o n - o f f ' ( t 2 ~ t \) I a c c ( t \ , t 2 ) if I aCc ( t l , t 2 ) + O prod ( t l , t i ) < B W on- 0f f ' (t 2 — tfx) B W o n - o f f • ( t 2 — t l ) — O p r o d ( t i , t 2) otherwise Correspondingly, we get B S m a x ( I O , t l , t 2) I a c c ( t l , t 2 ) if O prod ( t l , t 2 ) > B W o n - o f f ' (t 2 — ix) 0 if I ac c { t l , t 2 ) + O pr o d ( t l , t 2 ) < B W o n - o f f " (^2 ~ ^l) k Iacc(ti,t2) -|- Oprod(ti,t2) BlV 0n— off ' (t2 tx) otherwise Case(ii): I ac c { t i , t 2 ) O p rod ( t i , t 2) Similar to case (i), here we get B S m a x { I O , t i , t 2) Oprod{tl,t2) if Iacc{ti,t2) ^ BWon— of f ' (t2 tl) 0 if Iacc (^1 > ^2) “ I " O p r o d { t l , t 2 ) < B W o n - o f f ' (^2 ^l) h c c { t i , t 2) + O prod ( t i , h ) — B W o n - o f f * (< 2 — *i) otherwise By combining the two cases, B S m a x ( I O , t i , t 2) 82 Ia.cc{t\^t2) if Iacc{t\->t2) > Oprod(t l , t 2) ^ B W on-off ' (^2 — ^l) Oprod( t l , h ) if Oprod (tl,h ) > /«»ee(*l,*2) > fi^ o n -o // * (*2 ~ *l) ^ ^ 0 if / acc(tl, < 2) + Oprod(tU t 2 ) < B W 0n-off ■ (t2 - t l ) k facc(^lj^2) “ I " Oprod(tl j ^2 ) B W 0n— off ' (^2 ^l) otherwise (4.12) Iacc{tiit2) and Oprod (ii,t 2 ) can be determ ined using Definitions 4.2.17 and 4.2.19, and Observation 4.1.2. Iacc{t\)t2) — \Uittltt2\ (4-13) Oprod(ti,t2) = \Uo,t\,t2\ (4-14) From Equations 4.11, 4.13 and 4.14 we get the desired result. □ C orollary 4 .4 .3 The maximum size of all the input-only buffers, B u fS iz e max(I) required in executing the CDFG in M axSteps can be estimated by: B u fS iz e max(I) = ma,x{(\UI>tut2\ - BWon-off ■ f a - ti)),0} where £1 = 0... (M axSteps — \ ) , t 2 = (£j + 1) ... M a xS tep s , and B W on— off is the bandwidth between on and off chip memory. P ro o f: The proof follows from the proof of the above theorem (Theorem 4.4.10) as it is a special case of the above theorem . In this case, the outputs are not considered, therefore Oprod(fi,<2 ) is m ade 0 in Equation 4.11 i.e. DC ( T * + \ I * - - ^ K- ^ t 2) < B W o n - o f f - {t2 — t \ ) JtfOn 2) = ( 0 \ I . c c ( h , h ) - B W ^ , , ■ \ ( t 2 — t i ) otherwise = m a x { (/acc BWfon— off ■ (^2 ^i))» 0} = m ax{(|£/j,(lf(3| - BWon-off • (h ~ *i)),0} The m axim um taken over all the possible intervals [ t i , t 2] gives the desired bound. □ 83 C orollary 4 .4 .4 The maximum size of all the output-only buffers, B u fS iz e max(0 ) required in executing the CDFG in M axSteps can be estimated by: B u f Sizemax{0) = max{(|C/'o,ti,t2l “ B W 0n- 0f f • (<2 - <i)),0} where <1 = 0... (M axSteps — 1), <2 = (<i + 1)... M a xS tep s, and B W on-of 1 is the bandwidth between on and off chip memory. P roof: Similar to the above proof. 4 .4.2.1 C om p u tation al C o m p lex ity The com putational complexity of estim ating the m axim um size of all the I/O buffers is 0 (M a xS te p s2 • max{|VR|, |H v|}). Com puting -B5'max(/0,< i,< 2) requires 0(m ax{|V R|, |VW|}) steps. This analysis is done 0 (M a x S te p s2) tim es. Therefore, the overall complexity of is 0 (M a x S te p s2 • max{| Vr |, |VW|})- Similarly, the com plexity of determ ining the upper bound of the total size of input-only buffers is 0 {M a xS tep s2 • {|V/i|}) and th a t of the total size of all the output-only buffers is 0 (M a x S te p s2 • {|VW|})- 4.4.3 U pper B ound C ost o f Storage Structure An upper bound on the cost of storage is determ ined by im plem enting the I/O buffers of size B u fS izeu B {IO ) with N u b{R) read ports and N u b(W ) write ports on registers, register files and on-chip RAMs as described in Section 4.2.6.1. The m inim um of the three im plem entations is considered to be the upper bound cost of the storage structure. 4.4.4 U p p er B ound on th e N um ber o f F unctional M odules T h eorem 4.4.11 An upper bound on the number of functional modules of type Ok, NuB{Ok), required in executing the CDFG in M axSteps is NuB{Ok) = m ax {|f/0*,i1,i1 +i|} where <1 = 0... (M axSteps — 1) and 84 Uokl ,*i,ti+i * 5 as per definition 4-1-15. P ro o f: Same as proof of Theorem 4.4.7 except th a t the operator type here is O* instead of R. 4.4.5 U p p er B ound on th e T otal F unctional C ost The m axim um functional cost is COStfunctional = ^ ] Ajy\g(Ofc) ' COSt(Ok) VOK Cost{Ok) is the cost of operator type Ok- 4.5 Summary In this chapter we developed techniques to estim ate the storage cost and the func tional cost. The storage cost estim ation consist of the following three steps: 1. determ ining lower bound on the num ber of read and w rite ports on the buffers. We also presented the required theoretical basis to prove these bounds, 2. determ ining lower bound on the total size of all the buffers. Again, we provided theoretical basis for this bound through a proof, and 3. implementing these requirem ents on lowest cost storage modules from the li brary. We also developed theory to determ ine the upper bounds for 1. the num ber of read and write ports on the buffers, 2. the total size of all the buffers, and 3. the num ber of functional modules in the design. 85 Chapter 5 Synthesis with Storage Tradeoffs Lookahead 5.1 Introduction In this chapter we address datapath synthesis with lookahead in storage structure tradeoffs. During this synthesis step, datapath operations are scheduled combined with the scheduling of data transfers between I/O buffers and the datapath while looking ahead to evaluate storage structure tradeoffs. The general scheduling prob lem in architectural synthesis is presented first, then datapath scheduling in SMASH is described in detail. D atapath scheduling is a crucial step as m any of the archi tectural tradeoffs, including storage-related tradeoffs, are m ade during this step. In this chapter, the details of the scheduling algorithm used in SMASH are presented. The chapter contains a description of how SMASH considers storage-related pa ram eters, makes tradeoffs in storage architecture, and uses storage and functional cost estim ations during datapath scheduling. Also included are the details of the techniques used in SMASH for handling various issues such as conditional branches, loops, operators with varying execution delays, and constraints on bandw idth and I/O tim ing. 5.2 Datapath Scheduling Problem T he scheduling problem in architectural synthesis is generally form ulated in the following two ways: 86 1. Scheduling under hardware constraints'. For a given choice of hardw are modules with prespecified properties and total cost, sequence the operations in the CDFG such th at the schedule length is minimized. 2. Scheduling under timing constraints: For given tim e constraints, m inimize the num ber of hardware modules needed for execution of the CDFG. In synchronous systems the basic correctness constraints are th at every hardware unit can be used only once during a control step. Com binational logic cannot have feedback, buses cannot carry more than one value, registers can be loaded only once and read/w rite ports of storage modules can be accessed only once. 5.2.1 P rob lem D efin ition The general control data flow graph scheduling problem in architectural synthesis is associating w ith each node Vi £ V, a control step S(v{) such th at Vi -< Vj = $ • S(v{) < S(vj) (5-1) using the functional modules from the library, while satisfying the input tim ing constraints. Also provided is a library of the functional modules such th at there is one (and only one) type of functional operator in the module library to execute a functional operation in the CDFG. It is assumed that operator types have been preselected by another ADAM tool like SLIMOS or by the user. Each operator of type k has a cost Costok (which could be in term s of area, num ber of transistors or static power) and delay Dok (which is tim e taken for execution) associated w ith it. 5.2.2 Various A pproaches Much research has been done in datapath scheduling. Four m ajor types of scheduling techniques exist in the literature [Sto91]: • Transformational scheduling algorithms: These algorithm s apply transform a tions to improve an existing schedule [BCM+88, Pen86]. The algorithm m ay start from either a fully serial schedule which is then parallelized by applying 87 transform ations or a m axim ally parallel schedule to which serializing transfor m ations are applied. • Integer-linear programming techniques: These techniques form ulate the prob lem as an ILP problem. Here again, researchers have form ulated the schedul ing problem under tim ing constraints [LHL89, PK90] and also under hardware constraints [HHL90]. These techniques have high tim e complexity. • Neural network scheduling algorithms: These algorithm s use self-organizing rules to solve the problem [HP90]. The cited approach schedules under tim ing constraints. This technique is also quite slow but is extrem ely suitable for parallel machines. • Iterative scheduling algorithms: These algorithm s schedule one operation at a tim e. Generally, either of the following two types of schemes is adopted: 1. The m axim um num ber of operations is scheduled in a step and then the next step is considered. Examples are as soon as possible (ASAP) scheduling, as late as possible (ALAP) scheduling and list scheduling. In the ASAP scheduling, an operation is scheduled in the earliest tim e step. ASAP scheduling results in the shortest execution tim e but may require extra hardware. Similarly, in the ALAP scheduling, an operation is scheduled in the latest tim e step by starting from the term inal nodes in the data flow graph and determ ining the latest tim e an operation can be scheduled. In list scheduling, some criterion is used to delay certain oper ations. Different researchers have experim ented w ith different criteria in their schedulers. ELF uses urgency [GBK85], whereas Slicer uses mobility as the criterion to delay the operation [PG87]. Sehwa uses a combina tion of list schedulers but can switch to exhaustive search to find a final solution when the search space has been sufficiently pruned [PP88]. 2. Each operation is chosen by some priority function and is scheduled in the best possible step. Examples include critical path [PPM86] and distribution-based schedulers such as force-directed scheduling [PK87]. 88 5.3 Datapath Scheduling in SMASH The problem being addressed here is as follows: given the behavior of an application- specific system in the form of a CDFG (obtained from the VHDL description), the m odule library, area/perform ance constraints, the external bandw idth constraints, and an optional I/O tim ing constraint; the scheduling algorithm m ust schedule the operations in the CDFG, including I/O reads and writes, while satisfying all the constraints. It also determ ines the num ber of functional modules and the num ber of read and write ports on the I/O buffers required by the schedule. Figure 5.1 shows the top-level view of the inputs and outputs of this step in SMASH. CDFG (Extracted from VHDL desc.) Constraints • Area/Performance • Bandwidth ext. I/O timing Library Parameters • Functional Storage Clock Cycle Datapath Scheduling in SMASH Datapath • Number of functional modules per type. Operation schedule I/O buffers DP Memory Intermediate variable accesses Number of R/W ports I/O accesses Figure 5.1: D atapath Scheduling in SMASH Besides combining the datapath scheduling with I/O access scheduling during this step, (?) the storage architecture param eters are considered so th at the schedule is guaranteed to satisfy the constraints during the storage synthesis, («) tradeoffs 89 in the storage architecture are considered so th a t the schedule does not require alterations during storage synthesis, and (Hi) estim ation techniques for storage and functional cost are used to evaluate the im pact of high-level decisions on th e final design so th at potentially inferior designs are discarded early in th e design process. SMASH first preprocesses the CDFG, then schedules the processed CDFG, and finally, analyzes the schedule to verify the correctness and quality of the schedule. These three stages are described below. 5.4 Preprocessing the CDFG SMASH preprocesses the CDFG before performing scheduling. T he tasks performed during preprocessing are described below. In sertin g R /W N o d es in th e C D F G To schedules I/O accesses, SMASH inserts a read/w rite (R /W ) node in the CDFG corresponding to each I/O access from the buffers. These R /W nodes have already been described in C hapter 3. A S A P and A L A P A n alysis o f th e C D F G The ASAP and the ALAP analyses of the CDFG are perform ed in this step. The A S A P (v ) and A L A P (v ) tim es for a node v G V are defined as follows. D efin itio n 5.4.23 A S A P (v ) is the earliest possible time for executing the node v in the CDFG. D efin itio n 5 .4 .2 4 A L A P (v) is the latest possible time for executing the node v so that the CDFG is executed in M ax Steps. The basic procedure to determ ine A S A P (v ) tim es for all the nodes in the CDFG is outlined in Figure 5.2, and is a depth-first recursion. The ALAP procedure is sim ilar to the ASAP procedure. 90 P ro c ed u re A S A P (node,step) /* Should be c a lle d w ith node = ROOT and s te p = 1 * / 1. i f (ASAP{node) < step) th e n ASAP{node) = step-, 2. i f (O P {node) = = OU T P O R T ) r e tu r n ; /* end o f r e c u r s io n . * / 3 . e l s e , f o r a l l th e o u tg o in g edges of node (a) nextjnode - head o f th e o u tg o in g edge. (b) next^step = step + d e la y of O P{v) ; (c) ASAP (nextjnode,next^step) 4 . r e tu r n ; Figure 5.2: ASAP Analysis of the CDFG D eterm in a tio n o f Lower B ou n d s on D esig n P aram eters SMASH determ ines th e lower bounds on the following design param eters based on the ASAP and ALAP schedules of the nodes: 1. num ber of read and write ports on the I/O buffers, 2. num ber of functional modules of each type, and 3. size of the I/O buffers. These lower bounds are used as the initial resource constraints in the scheduling algorithm as described in the next section. In addition, SMASH estim ates the storage and functional costs based on these lower bounds. The cost is in term s of chip area. This inform ation is provided to the user to help h im /her in setting the design constraints. 91 S p ecification o f th e D esig n G oal Besides specifying the design constraints, the user can also specify the design goal. By design goal we m ean whether priority should be given to (i) performance op timization, or (ii) area optimization during scheduling while satisfying the area or perform ance constraints [PPM86]. • Performance optim ization: when performance optim ization is the design goal, whenever the scheduler is unable to schedule a node in a control step due to resource lim itations, SMASH adds an extra resource. However, if the addition of th e extra resource causes an area constraint violation, then SMASH extends the execution tim e by inserting an extra step. W hen both the options result in a constraint violation, SMASH prom pts an error and aborts the scheduling. • Area optim ization: when area optim ization is the design goal, SMASH inserts an extra step whenever the scheduler is unable to schedule a node. However, if inserting the extra step causes a performance constraint violation then an extra resource is added. Here also, when both the options cause constraint violation, the scheduling is aborted. Specification of the design goal facilitates the user in obtaining a design which is near-optim al for the desired param eter (area or performance). 5.5 Scheduling the CDFG 5.5.1 A ssu m p tion s The scheduling software assumes two-phase clocking as described in C hapter 3. The datapath reads and processes the d ata in the first phase and writes it back into m em ory in th e second phase. In case the datapath interacts directly with the off-chip memory, it writes the d ata into the off-chip memory in the second phase. SMASH allows loops with fixed iterations in the VHDL description. W hile estim ating the storage cost, only the I/O buffer cost is included. Estim ates of the cost of datapath m emory have already been researched [Kuc91] and will be included in future. The basic scheduling algorithm is outlined in Figure 5.3. 92 Procedure ListScheduling 1. candidateJist - MostUrgentNodesQ ; /* sort nodes in CDFG based on their freedom. */ 2. while (candidateJist != NULL) { (a) node - GetBestNode (.candidateJist) ; /* get the node with the least freedom from candidate list. */ (b) step = GetBestStep(node) ; /* determine the best step for scheduling node. */ (c) if node cannot be scheduled in any step due to lack of resources then increase i. resources when optimizing performance is the design goal, or ii. execution time when optimizing area is the design goal. (d) ScheduleNodelnStep{node,step) ; /* schedule node in step. */ (e) DeleteFromCandidateList{node) ; /* remove node from candidate list. */ } /* while */ Figure 5.3: Scheduling Algorithm in SMASH 93 5.5.2 O verview o f th e Scheduling A lgorith m in SM A SH The following are the salient features of the scheduling algorithm used in SMASH. S c h e d u lin g I /O accesses w ith d a ta p a th o p e ra tio n s SMASH considers I/O accesses from /to the on-chip buffers as R /W operations and schedules them concurrently w ith the datapath operations. SMASH performs this concurrent scheduling by treating the R /W nodes (which are inserted during the pre processing of the CDFG) as functional operators performing data transfers. W hen ever an operation involving I/O is scheduled, the corresponding R /W node is also scheduled, implying an I/O buffer access. S c h e d u lin g A lg o rith m - F re e d o m B a se d L ist S c h e d u lin g SMASH uses freedom-based list scheduling with freedom of the nodes as inverse of the urgency factor [PPM86], but any other scheduling technique which performs scheduling under hardware and performance constraints could be used. The freedom of a node v is given by: Freedomiy) = A L A P (v ) — A S A P (y) The higher the freedom, the lower the urgency for scheduling. D istribution graphs of the R /W operations and the datapath operations are used to assign them probabilistically to the m ost suitable control step, as done in force- directed scheduling [PK89]. D istribution probability of an operation type Ok in step s is given by D is tP r o b a b ility ^ , s) = ^ A L A p (vy } A S A p (v ) + , where, A S A P (v ) < s and A L A P (v ) > s 94 Procedure GetBestStep(node) 1. minudist = BIGNUMBER; 2. best^step - NONE; 3. for step = ASAP (node) to ALAP (.node) { (a) op-dist - read-dist = writejdist - 0.0; (b) Check if there axe sufficient resources (including read and write ports) for node to be scheduled in step. (c) If yes then, i. Compute distribution probability of node’s operator, ii. If node requires inputs then compute distribution probability of READ operator, iii. If node produces outputs then compute distribution probability of WRITE operator, iv. Compute the total of the above three distribution probabilities. 4. bestjstep = step with sufficient resources and minimum total distribution probability. 5. return (beststep ) ; Figure 5.4: Selecting the Most Suitable Step in SMASH 95 P o r t C o n s tra in ts o n th e B u ffers The read and write port constraints are used to avoid high cost for the buffers. The scheduler ensures the availability of a functional operator as well as R /W operations during scheduling (Figure 5.3) by checking whether the I/O buffers are able to provide the required num ber of read and write ports in each step, i.e. R r e q ( s ) < R b u j Vs, and Wreg(s) < W buJ Vs B a n d w id th a n d I / O T im in g C o n s tra in ts To avoid violation of bandw idth and I/O tim ing constraints imposed by the user, the scheduler ensures th at the inputs can be prefetched into the buffers from the background memory, and outputs can be transferred back to the background memory from the buffers before they are required elsewhere: I(s) < B W 0n- 0f f x s Vs O (s) < B W 0n- 0f f x (M axSteps — s) Vs where J(s) is the num ber of d ata variables required for all the operations scheduled before or in control step s. And 0 ( s ) is the num ber of d ata variables produced by all the operations scheduled after or in control step s. The tim ing constraints on inputs and outputs are specified in a file in the following format: data-nam e d a ta -ty p e tim in g - c o n s tr a in t The data-nam e is the nam e of the data variable, the d a ta - ty p e is the type of the d ata variable, and the tim in g - c o n s tr a in t is the tim ing constraint on the d ata vari able. For an input variable, the d a ta - ty p e is INPUT, and the tim in g - c o n s tr a in t is the tim e when the input is m ade available for processing. Similarly for an output variable, the d a ta - ty p e is OUTPUT, and the tim in g - c o n s tr a in t is the tim e by when the output m ust have been produced. 96 S torage A rch itectu re Tradeoffs The scheduler considers storage architecture trade-offs. During scheduling, SMASH takes the storage architecture features into account and ensures th a t in the following synthesis steps when it is actually doing the storage synthesis, the datapath schedule supports the selected storage architecture. The following three types of tradeoffs possible in the storage architecture are considered in SMASH: 1. storage size vs. num ber of execution cycles, 2. num ber of ports vs. num ber of execution cycles, and 3. num ber of ports vs. storage size. To deal with this 3-way tradeoff, SMASH iterates on the num ber of ports and, for each choice of num ber of ports, it trades off between the size and the execution tim e. This can be done because the num ber of ports does not vary too m uch in practical designs. Finally, the m ost cost-effective design (in term s of chip area) is chosen from these designs. The outer design loop minimizes the num ber of ports on the I/O buffers, and the inner design loop minimizes the chip area by minimizing the size of the buffers. C h ain ed and M u lticy cle O peration s To improve the total execution tim e of the CDFG and allow operators w ith varying execution delays, the nodes in the CDFG are chained, or distributed over m ultiple cycles appropriately. In operations chaining, if the total delay of m ultiple dependent nodes is less than the clock cycle then they are scheduled in the same control step, as illustrated in Figure 5.5 a. Alternately, if the delay of the operation is longer than the clock cycle, then it is executed in m ultiple control steps, as illustrated in Figure 5.5 b. H an d lin g C on d ition al B ranches The scheduler allows operator sharing among mutually-exclusive nodes of conditional branches. A predicate (branching condition) bitm ap P D {v ) is associated w ith each 97 M delay > elk b. Multi-cycle Operation delay = tl delay = t2 a. Operation Chaining Figure 5.5: Operators with Varying Delays node v. Also associated with each v is a predicate value m ap PV(v). PV(v) rep resents the predicate values (True or False) of all the predicates associated with v. A one at the ith bit in PD(v) implies th at predicate i (Pi) controls v whereas a zero at the ith bit in PD(v) implies th at P, does not control v. W hen Pi controls v, a one at the ith bit in PV(v) implies th at v is executed when Px is True, similarly, a zero at the iih bit in PD(v ) implies th at v is executed when Px is False. The PD(v) and PV(v) are used in determ ining the m utual exclusion among the nodes efficiently. The procedure to determ ine the m utual-exclusion between two nodes v, and Vj is described in Figure 5.6. Procedure M utualE xclusion(u,-, Vj) 1. if PD(vi) = = PD(vj) then (a) if PV(vi) == PV(vj) then return(NO); (b) else return(YES); 2. else return(NO); Figure 5.6: Determ ining M utual Exclusion between Nodes Vi and Vj 98 H a n d lin g L o o p s The loops are folded in order to achieve higher perform ance as shown in Figure 5.7. Associated w ith every loop i we have the following two param eters: Itefi L O O P lter2 ^ 1 Folding factor ltern Figure 5.7: Loop Folding in SMASH 1. The folding factor fe, which is the delay (in clock cycles) between two consec utive iterations of the loop, ft is based on the inter-data dependency of the nodes within the loop body and is determ ined by the scheduler. A variation in the folding factor also results in tradeoffs in storage structure as well as datapath. This kind of tradeoff is illustrated by an experim ent described later in the thesis in Section 7.7.2. 2. The number of iterations Iter^ which is the num ber of tim es the loop body is executed. SMASH assumes th at Itere is finite and is specified in the VHDL description for every loop. 99 A node v G V belonging to loop £, when scheduled in step s, is also executed in steps s + fe,s + 2f t, ..., s + (Itere. — 1) • ff, i.e. v will be scheduled in steps - s + i * f t for i = 0,..., (Itere — 1) In case of nested loops, if v belongs to loops £n,£n- i, • • • where £n is the outerm ost loop and £\ is the innerm ost, then v will be scheduled in steps (((•» + *1 + *3 ■ h 2 ) • • • i n ' f t n ) for *i = 0,..., (Iterg1 — 1), i2 = 0 ,...,( I te r e 2 - 1), in = 0, . . . , ( I t e r £n - 1), I / O fo r C o n d itio n a l B ra n c h e s To handle the I/O requirem ents of the operations inside c o n d itio n a l b ra n c h e s efficiently, the scheduler attem pts to schedule the test node as early as possible. As a result, the interval between the test node and operation node, requiring I/O , is large enough to transfer the d ata into I/O buffers from off-chip memory. The data transfer m ust be done concurrently with the transfer of other d ata values being used in other parts of the CDFG. W hen the d ata transfer is scheduled, the software may or m ay not be able to m ake all the required transfers in tim e. This gives rise to the following two scenarios (Figure 5.8). 1. Dynamic selection of the d ata to be transferred: if SMASH is able to trans fer the d ata between the test execution and the operation execution then the d ata which satisfies the predicates can be chosen dynamically during execu tion. This strategy avoids unnecessary transfers and is effective when th e data (array) sizes are huge. 2. Worst case analysis: in case there is not enough tim e (or bandw idth) to transfer the data, after the test is executed all the d ata values are scheduled to be 100 transferred to the buffers irrespective of the outcom e of the test, so th at the appropriate d ata value is available during the execution. This results in extra d ata transfers but avoids delay in execution. Such situations m ay arise in real-tim e systems demanding very high performance. Test Transfer A AND B (Worst case analysis) Transfer A OR B (Dynamic selection of data) Figure 5.8: D ata Transfer into I/O buffers for Conditional Branches A rra y A ccesses Our approach to array accesses is very sim ilar to our handling of conditional branches (Figure 5.9). Here, the node computing the array index is scheduled as early as pos sible. Again, during the actual data-transfer scheduling there could be the following two scenarios. 1. Dynamic selection of the data to be transferred: transfer only the referenced data of the array. 2. W orst case analysis: transfer the whole array. This strategy is suitable for high-performance real-tim e systems. 101 ^Compute Index) Array Transfer Array (The whole array) Transfer Array[Index] (Only referenced element) Index' ^ Array [Index] Figure 5.9: Array Accesses C ost E stim a tio n The scheduler uses storage cost and functional cost estim ations described in 4. The cost of the storage and functional modules are estim ated based on th e partial sched ule. These estim ated costs are used in • verifying whether the area constraints will be satisfied in th e final designs, and • evaluating the decision regarding scheduling an operation in a step, whether scheduling the operation in th at step would result in m ore resources in the future. 5.5.3 D iscu ssion on th e Scheduling A lgorith m To summarize, the scheduling approach in SMASH uses freedom-based list schedul ing. It is enhanced by using distribution graphs to select the best step whenever there is a choice. It combines scheduling of I/O accesses w ith datapath scheduling by modeling them as transfer operations. The im plem entation allows operation chain ing and also multicycle operators. It can handle conditional branches and loops in an efficient m anner. All the storage-related constraints (which include port constraints on the buffers, on-off chip bandw idth constraints and I/O tim ing constraints) are also considered. As a result of this approach, the schedule obtained guarantees th a t no bandw idth or I/O tim ing constraint is violated in the next synthesis step when the complete 102 I/O transfer schedule between the on-chip and off-chip memory is determ ined. In addition, this process provides the I/O requirem ent schedule which is the starting point for the second step. 5.6 Analysis of the Schedule SMASH verifies the schedule by checking d ata dependencies as defined in Equation 5.1 for all the nodes in the CDFG. It also outputs the percentage utilization of each type of m odule in each step, which helps us determ ine the quality of the schedule. 5.T Summary In this chapter we presented the details of the scheduling algorithm used in SMASH. We described the three stages of the scheduler: preprocessing of the CDFG, schedul ing of the CDFG, and analysis of the schedule. The core of the scheduler is the scheduling stage. We presented the details of the techniques used in SMASH to combine scheduling of I/O accesses with datapath scheduling, to handle conditional branches and loops, to include the bandw idth and I/O tim ing constraints during scheduling, and to incorporate storage architecture tradeoffs during datapath syn thesis. The schedule obtained using the approach presented here guarantees th a t no bandw idth or I/O tim ing constraint is violated in the next synthesis step when the complete I/O transfer schedule between the on-chip and off-chip m em ory is determ ined. In addition, this step provides the I/O requirem ent schedule which is the starting point for the second step and described in the next chapter. 103 Chapter 6 Storage Synthesis 6.1 Introduction After performing combined datapath and I/O access scheduling (as described in C hapter 5), our next step in the synthesis process is the design of the storage struc ture. At this stage in the synthesis process, we have determ ined the following design param eters: • complete schedule for the datapath, • I/O access schedule as required by the datapath, • num ber of read and write ports on the I/O buffers (implicitly determ ined by the I/O access schedule), • the tim e interval during which the inputs and outputs are available, and • B W o n - o j j • > the bandw idth between the on-chip and off-chip memory. Also note th at the datapath schedule obtained in the previous synthesis step (dat apath scheduling) guarantees th at no bandw idth or I/O tim ing constraint will be violated during the storage synthesis step. O ur objective in this step is to construct the storage structure which consists of the following three sub-architectures, as illustrated in Figure 3.5: 1. on-chip I/O buffers, 2. on-chip datapath memory, and 104 3. off-chip background memory. As was described in Chapter 3, the construction of each sub-architecture consists of two basic tasks: 1. data-transfer scheduling, where the read and write tim es of each word are determ ined, and 2. m odule allocation, where a physical location is assigned to each word. In this chapter we describe the design of the I/O buffers and the background memory. The design of the datapath mem ory has not been addressed in this thesis as it has been researched extensively by other researchers [BMB+88, Che91, Sto89]. For the design of I/O buffers and background memory, we have developed techniques to perform the data-transfer scheduling. These techniques are described below and have been im plem ented in SMASH. The m odule allocation step requires further research and is not addressed in this dissertation. 6.2 Data Transfer Scheduling for I/O Buffers The data-transfer scheduling step consists of scheduling the data transfers between the off-chip background memory and I/O buffers with the objective of minimizing the buffer size, as shown in Figure 6.1. The data-transfer schedule m ust satisfy the d ata requirem ents as determ ined in the datapath scheduling step (C hapter 5). buf (unknown) Trbuf( known) Datapath Trbuf (unknown) ,buf( known) I/O Buffers Background Memory Figure 6.1: D ata Transfer Tim ing for I/O Buffers The problem is described as follows: The software is given as inputs • the num ber of read and write ports on the I/O buffers, 105 • the tim e interval during which the inputs and outputs are available, birth tim e TbU *{d) and death tim e 7dU \ d ) for all the d ata values d, and • the d ata requirem ent by the datapath, d ata read tim e from the buffers 7~ bu*(d) for the inputs and data write tim e in the buffers 7^ ( d ) for the outputs. The following param eters are to be determined: • the d ata transfer schedule to and from off-chip background memory, which consists of — d ata write tim e to the buffers from the background memory T ^ ^ d f ) for the inputs, and — d ata read tim e from the buffers to the background mem ory l ' bu^{d0) for outputs, such th at the size of the I/O buffers is m inimized, and the buffers provide the data to the datapath when they are required and store the outputs when they are produced; and • the required buffer size. Figure 6.2 shows the top-level view of the inputs and outputs of this step in SMASH. 6.2.1 Our A pproach The d ata transfer scheduling problem is analogous to the operation scheduling prob lem in datapath synthesis in the following two ways: • Equivalent to total tim e required for CDFG execution (the schedule length) in datapath scheduling, we have the total tim e required for d ata transfer in this problem. • Similarly, equivalent to the num ber of resources in datapath scheduling, we have the num ber of ports, th at determ ines the bandw idth, or the num ber of words th at can be accessed per cycle in this problem. In this case, the ports are the resources executing the data transfer operation. 106 Number of R/W port on buffers from DP Scheduling stepi I/O timing constraints Data a ccesses by datapath from DP Scheduling step. Data transfer Scheduling in SMASH I/O buffers Data transfer schedule from bkgnd memory • I/O buffer-size Figure 6.2: D ata Transfer Scheduling in SMASH 107 However, the objective function in this problem differs from th a t in the datapath scheduling problem. In datapath scheduling, the objective is to m inim ize either the execution tim e or the num ber of resources, whereas in this problem , both the data transfer tim e and the num ber of resources (read/w rite ports) have already been decided in the previous step. The objective here is to minimize the size of the buffers. This aspect of our problem makes it different from the traditional datapath scheduling problem and more complex. However, we have used existing scheduling techniques w ith a redefinition of the objective function. Before presenting our algorithm , we describe how the read and write tim es of the d ata values influence the buffer size. D efin ition 6.2.25 The set of inputs B u fi(s ) present in the buffers in step s is B u fi(s ) = {di\di G I P andT?uf{di) > s} - € I P andT*uf(di) > s} where 'Tfu*(df) is the time when di is read by the datapath from the I/O buffers, and T f/1 (di) is the time when < / , ■ is written into the I/O buffers from the background memory. D efin ition 6.2.26 The set of outputs B u f a(s) present in the buffers in step s is B u f 0(S) = {d0\do € OP andT*uf(dp) < s} - {d0\d0 € O P and T?uJ(dD ) < s } where 7~ £ u*(d0) is the time when da is written by the datapath into the I/O buffers, and 7~ buI(d0) is the time when di is transferred into the background memory from the I/O buffers. B u f / s ) is the set of inputs which are read from the buffers by the d atap ath in steps s, s + 1, s + 2 ,..., M axSteps but are not w ritten into the buffers after step s, therefore these inputs m ust be w ritten into the buffers in or before step s and will be present in the buffers in step s. Similarly, B u f 0(s) is the set of outputs which are w ritten by the datapath in steps s — 1,3 — 2, ...,1 but are not read by the background m em ory in these steps. Therefore, they will be present in the buffers in step s. 108 D efin ition 6 .2 .2 7 The size o f the I/O buffers is given by: B u fS iz e = m ax{|.Bu/(.s)|} = max{| B ufi(s) + B u f 0(s)\} Vs The size of the buffers is determ ined by the m axim um num ber of d ata values in the buffer in a control step, which implicitly depends on the d ata transfer timings of the I/O . 6.2.2 T h e A lgorith m The data-transfer scheduling algorithm used in SMASH is outlined in Figure 6.3. The salient features of this algorithm are discussed below. S c h e d u lin g T ec h n iq u e SMASH uses list scheduling to schedule data transfers. A list of all the input and output values to be scheduled is m aintained in each step and the m ost urgent data transfers are scheduled based on their urgency factor. The urgency factor (U(d, s)), associated with each d ata value, determines the necessity of transferring d in step s. The following theorem provides the basis for computing the urgency factor. T h e o re m 6.2.12 The contribution of an input, read by the datapath from the buffer, to the buffer size is minimum when the input is transferred to the buffers as late as possible. P ro o f: Consider the input di which is accessed by the datapath from the buffers in control step 7/ u*(d{) and is transferred into the buffers from the background memory in step s = T£uf(di). We know th a t the size of the buffers is given by B u fS iz e = m ax{|J5u/(s)|} V s = m a x { |H u /(l)|, \B u f(2)\,..., \B uf(M a xSteps)\} Clearly, by Definition 6.2.26, the set B u f( s — 1) of d ata values will not include di as s — 1 < 7wU *{di). However, if di is transferred into the buffers in control 109 A lgorithm D a ta T ra n sfe rS c h e d u lin g 1. Read I/O timing constraints, and the read ports Rbuf and write ports Wbuj on the buffers. 2. Read I/O access pattern as required by the datapath (a) For ste p = 1 to M a x S te p s, get Reads [step] and Writes [step] 3. Schedule data transfers for inputs. (a) list « NULL; (b) For step = M axSteps down to 1 i. Add all the inputs being read by the datapath in step to list. ii. Compute the urgency factor for each input in list. iii. avail-ports = Wbu/ - I Writes [step] I ; /* The number of writes that can be done in ste p .* / iv. Select avail-ports number of inputs from the list to be written in the buffer in step based on their urgency factor. v. Remove the scheduled inputs from list. 4. Schedule data transfers for OUTPUTS. (a) list = NULL; (b) For step = 1 to M axSteps i. Add all the outputs being written by the datapath in step to list. ii. Compute the urgency factor for each output in list. iii. avail-ports — Rbuf ~ I Reads [step] | ; /* The number of reads that can be done in ste p .* / iv. Select avail-ports number of outputs from the list to be read from the buffer in step based on their urgency factor. v. Remove the scheduled outputs from list. 5. return; Figure 6.3: D ata Transfer Scheduling for I/O Buffers in SMASH 110 step T ^ ( d i ) — 1 instead of 7 ^ “^(d,), the new set B u f'(s — 1) will include di as •s' = T*uf(di), i.e. B u f ( s - 1) = B u f(s - 1) U {d,} which implies | B u f'(s — 1)| = | B u f( s — 1)| + 1 Hence, the new buffer size B u fS iz e 1 will be B u fS iz e ' = m ax { |Z ? it/'(l)|,. . . , \B u f{ s — 1)|,..., \B uf'(M axSteps)\} = m a x { |J 5 u /(l)|,. . . , |B u f( s — 1) + 1|,..., \B u f (M axSteps)]} > B u fS iz e Therefore, if the data d, is transferred into the buffers prior to 7 ^“^(d,) it may increase the buffer size. □ T h e o re m 6 .2 .1 3 The contribution of an output, written by the datapath into the buffer, to the buffer size is minimum when the output is transferred out of the buffers as soon as possible. P ro o f: Similar to the above proof. □ Ideally, the inputs should be transferred into the buffers in the same step as they are required, and the outputs should be transferred out of the buffers in the same step as they are produced. Unfortunately, this is not possible as the bandw idth available to perform the data transfer is lim ited and there may not be enough bandw idth available for the required d ata transfer in each step. Therefore, these transfers should be scheduled such th at the bandw idth constraints are m et while their presence in the buffers is minimized (to minimize the buffer size). Based on the above theorem , our heuristic attem pts to transfer the input data into the buffers from the background mem ory as late as possible, and transfer the output d ata from the buffers into the background memory as soon as possible. Un less it is necessary to transfer the input d ata value into the buffers, its transfer is postponed; similarly, the outputs are transferred out of the buffers as soon as possible. I ll C o m p u tin g th e U rgen cy Factor The urgency factor (U(d{,s)) for an input dt - in step s depends on s — 7 {di). This is the num ber of steps available to transfer di into the buffers from the background memory before di is required by the datapath. The smaller this value is, the more urgent it is to transfer d, into the buffers. If s = T^u^(di), then d, m ust be transferred in step s. Furtherm ore, if d{ is accessed again by the datapath in a step s', such th at s' > s then its priority is lowered for the second transfer from the background m emory as it can be stored in the buffers. As described during the discussion of storage tradeoffs, m ultiple transfers of data require higher BWon-off- Since, B W on-o ff is user specified, the inputs th at are accessed again are retransferred only when there is sufficient bandw idth available for m ultiple transfers. SMASH still allows m ultiple transfers for these inputs because storing them m ay increase the size of the buffers. Similarly, the urgency factor (U(d0, 5 )) for an output da in step s depends on 7^ { d 0) — s. This is the num ber of steps available to transfer d0 from the buffers into the background memory before it is required elsewhere. The smaller this value is, the more urgent it is to transfer dQ from the buffers. If s = (d0), then da m ust be transferred in step s. (Note th a t we do not consider m ultiple writes of an output because if an output is produced m ultiple tim es, only the value which is produced last is valid.) U p d a tin g th e L ist of D a ta V alues to b e Schedu led As m entioned above, for every control step a list of d ata values is m aintained. Only these d ata values are considered for scheduling in th at control step. W hile scheduling the input data transfers, the candidate list is determ ined as follows. In this case, the processing is done backwards from M axSteps to the first step. Assume th at steps M axSteps, M axSteps — 1,..., s + 2, s + 1 have been scheduled and now we are at step s. The candidate list (the input buffer set B ufi(s)) contains the values th a t are • read by the datapath from the buffers in step s, which is Reads [s] in our im plem entation, and 112 • read by the datapath from the buffers in steps s + 1,5 + 2,..., M ax Steps but not w ritten into the buffers from the background m em ory in these steps, which is the candidate list rem aining after scheduling the d ata transfers in the previous iteration (corresponding to control step s + 1). These are the inputs th at need to be transferred into the buffers from the background mem ory before s. In other words, this is the list of inputs which is considered for scheduling in step s. For the outputs, the candidate list contains the outputs th at are • produced in step s by the datapath, which is Writes [5 ] in our im plem entation, and • produced in steps 1,2,..., s — 1 but not transferred into the background m em ory in these steps, which is the candidate list rem aining after scheduling the data transfers in the previous iteration corresponding to control step s — 1. Note th at in this case the processing is done forwards from step 1 to M axSteps. 6.2.3 S electin g th e D ata V alues for Transfer After creating the candidate list the algorithm selectively schedules the m ost urgent values in each step. Prior to scheduling any input transfer from (or output transfer to) the background mem ory from the buffers we have to take into account the output writes (or input reads) by the datapath into (or from) th e buffers because the ports on the buffers are used by background memory as well as the datapath. This is done by first determ ining all the reads and writes by the datapath (Reads {.step] and Writes [step] in Figure 6.3) from the datapath schedule, then adjusting the num ber of available ports (avail-ports) on the buffers in each step. Finally, the algorithm schedules availjports num ber of d ata values for transfer between the background mem ory and the buffers from the candidate list based on their urgency factor. This process is repeated for all the steps. 6.2.4 D iscu ssion on th e A lgorith m In sum m ary, the salient features of the algorithm are th at it 113 • minimizes the size of the I/O buffers, while m eeting the port and tim ing con straints; • transfers the inputs into the buffers from the background memory as late as possible and transfers the outputs to the background m em ory from th e buffers as soon as possible, since keeping a value in the I/O buffers will contribute to the buffer size; • checks if the value which is required again by the datapath, can be refetched, and — if yes, then the algorithm overwrites the value in order to save a memory location, — else, the algorithm stores the value in the buffers for later use. The scheduling is guaranteed to succeed because the datapath scheduling step ensures th at no bandw idth or I/O tim ing constraint will be violated in this step. 6.3 Background Memory Synthesis Synthesis of the background memory is done in the sam e way as I/O buffer synthesis. First d ata transfers are scheduled and then m odule allocation is performed. As m entioned earlier, the module allocation step is not addressed in this thesis. The background memory interacts with the on-chip buffers and the external I/O . After scheduling the data transfers between the buffers and the background memory, the read schedule from the background m em ory for all the input values and the write schedule into the background memory for the output values are known. The rem aining unknown timings for the data values include • d ata write tim e to the background memory from the external world for all the inputs d,, and • d ata read tim e from the background m em ory to the external world 7~ bk(d0) for all the outputs d0. These tim ings are either fixed by the external I/O constraints or can be determ ined in the same way as the I/O buffers, as described in Section 6.2. 114 6.4 Summary In this chapter we have described our approach for scheduling d ata transfers between the I/O buffers and the background memory. The d ata transfer scheduling is the first step towards synthesizing the storage structure. SMASH schedules the data transfers between the I/O buffers and the background mem ory given the num ber of read and write ports on the buffers, the d ata accesses by the datapath, and the tim e interval during which the inputs and outputs are available, while minimizing the storage size. The algorithm used in SMASH is also presented. The algorithm is based on list scheduling w ith a modified objective function. The scheduling technique is based on the urgency factor associated with each d ata value. The approach is extended to schedule the d ata transfers between the background memory and the external world if they are not already constrained by the user specification. 115 Chapter 7 Experimental Results 7.1 Introduction This chapter describes the overall experim ental process and the results. In the first section we briefly describe two layout experim ents done prior to memory-synthesis research. The rem aining chapter describes the experim ents done using SMASH. The experim ents done prior to memory-synthesis research not only m otivated us to address the memory synthesis problem but also showed us some area-tim e tradeoffs possible in the storage architecture of a design. The experim ent done using SMASH dem onstrated the effectiveness of our tech niques during high-level synthesis of such “real” designs. As m entioned in the in troduction (C hapter 1), we developed our techniques and synthesis tools keeping “real” designs in mind. An integral part of this research was to dem onstrate the effectiveness of SMASH during synthesis of such “real” designs. 7.2 Experiments Prior to Memory Research As m entioned in the introductory chapter, our research was also m otivated by our layout studies of autom atic synthesis of an AR filter through the ADAM system [PGH91]. In a separate experim ent we m anually replaced the input latches with a RAM and compared the areas of the two im plem entations. These experim ents are described below. 116 7.2.1 E x p erim en tl : Layout S tu dies o f an A R F ilter In this experim ent, our goal was to study the effects of wiring area and delay and un used area on final chip characteristics. We also wanted to dem onstrate the existence of the cost-performance tradeoff curve in the actual layouts based on autom atically synthesized designs, using both pipelined and non-pipelined design styles. This was the first published tradeoff study of its kind [PGH91]. The exam ple chosen for this study was the AR lattice filter element, a design with a clear cost-performance tradeoff curve at the register-transfer level. The data-flow graph for this AR filter is shown in Figure 7.1. For this experim ent, we produced 6 non-pipelined and 6 pipelined RTL designs. We used MAHA [PPM86] to generate the schedule for non-pipelined designs and Sehwa [PP88] for pipelined designs. MABAL completed the RTL designs, which were then translated to Cascade Design A utom ation’s Chipcrafter form at through our netlist translation and expansion program . For non-pipelined designs the cost was m easured as total area of the bounding box and performance as the delay through the active area of the chip. For pipelined designs, cost was m easured as the area of the bounding box, and perform ance as the delay between the initiations of new d ata into the pipeline. We ran Chipcrafter w ith the OKI 1.2 m icron, twin- well, double-layer m etal CMOS ruleset and achieved layouts ranging from 20,000 to 30,000 transistors. For each design, we measured individual contributions to final chip area. We then assessed w hether these layouts still fit our cost-speed tradeoff curve. N o n -P ip e lin ed R esu lts These designs varied in parallelism. A cost-performance tradeoff curve for non- pipelined datapaths is shown in Figure 7.2. This curve shows the register-transfer design points, and the physical param eters when layout is considered. The register- transfer design points included raw cell area and raw cell delays. Although the tradeoff curve is not a sm ooth convex surface when physical factors are taken into account, there are no faster designs which are cheaper, or slower designs which are also larger. The most parallel non-pipelined layout is shown in Figure 7.3. 117 Figure 7.1: The AR Filter Dataflow Graph A r e a (sq. microns) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 200 400 600 800 1000 1200 1400 Datapath Delay ( ns ) Figure 7.2: Cost-performance Tradeoff Curve for a 16-bit Non-Pipelined AR Filter D atapath Element xlO7 * - Actual Points o - RTL Points O o 119 S e a ttle S ilic o n C orporation (VXEV) 2 .9 it**— I e d it o ld c e ll e d l tc e l1 q u it view a l l zoom out zoom In vlau ports portlnfo s e le c t area s e le c t expand unexpand redraw Cut y Via y ■ Pplus H N plus SN w ell Q Pw ell ■M ata12 ■ S , H stal j Active y Poly y Figure 7.3: Layout of the Most Parallel Non-Pipelined Design 120 Design No. No. of Control steps Active Area x 10 6fim 2 Total Delay ns D atapath Delay ns 1. 1 44.5 171 171 2. 4 23.4 NA NA 3. 8 19.9 836 568 4. 10 12.4 1036 670 5. 16 12.0 1693 1161 6. 18 7.4 1781 1200 Table 7.1: Summarized Area - Delay Statistics of the Non-Pipelined Designs P ip elin ed R esu lts The tradeoff curve in Figure 7.4 is for pipelined designs. The curve shows the register-transfer level design points and the actual points considering the layout. In these designs the basic clock period remains almost the same and therefore the delay depends on the initiation interval of the circuit. Design Num ber No. of Control steps Active Area x lO V m 2 Initiation Interval ns 1. 1 64.1 63 2. 4 20.6 265 3. 6 18.5 410 4. 8 12.9 529 5. 12 11.0 830 6. 16 10.0 1185 Table 7.2: Summarized Area - Delay Statistics of the Pipelined Designs D iscu ssion All these layouts described above did not include the input latches. W hen these input latches are included (as most of the traditional synthesis systems do) the chip area becomes very high. Furtherm ore, w ithout proper I/O storage and m anagem ent, 121 A r e a (sq. microns) xlO7 0 * - Actual Points o - RTL Points 0 200 400 600 800 Datapath Delay ( ns ) 1000 1200 Figure 7.4: Overall Cost-performance tradeoff curve for a 16-bit Pipelined AR Filter D atapath 122 Design Num ber No. of Steps Design with Registers Design w ith RAM Improv- m ent % Register Area xlO6 fim 2 Chip Area xlQPfim2 Reg. + RAM* Area xlO 6fj,m2 Chip Area x 10 6fim 2 1. 4 2.28 26.83 1.48 22.87 14.76 2. 8 2.28 24.38 1.16 16.96 30.43 3. 10 2.28 17.74 0.91 11.39 35.79 4. 16 2.28 17.30 0.83 9.88 52.90 5. 18 2.28 14.76 0.75 6.64 55.01 Table 7.3: Storage Area Statistics of Layouts the num ber of I/O pins required is unnecessarily very high and results in im prac tical designs. Buffering I/O with a RAM instead of registers subsequent to these experim ents (described in the next section) halved the area of the smallest design. An im portant outcome of this simple experim ent was the need for high-level synthesis systems to take into account a num ber of factors, such as efficient buffering and I/O d ata m anagem ent th at were ignored in the past. Only then we can use them for real designs and make them acceptable to industry. 7.2.2 E xperim ent2 : input latches vs. input R A M In a separate experim ent, we included the input latches and generated the layouts for 5 non-pipelined AR filter designs with varying parallelism. Then we m anually m erged these input latches into a RAM and again generated the layouts. On analyz ing the layouts we observed a significant improvement in area where registers were merged. In fact, the saving was as high as 50% in some cases. The experim ental results are given in Table 7.3. This experim ent m otivated us to explore m em ory design tradeoffs possible in this dimension. •RAM AREA = 0.5 x 106/im2. 123 7.3 Experiments using SMASH These experiments were designed to dem onstrate the following aspects of our re search: • the capabilities of SMASH in combining the datapath scheduling with I/O access from the buffers, • improvement in datapath schedules when storage-related constraints are also considered, • existence of the cost-performance tradeoff in the storage architecture (cost being a function of R /W ports, storage size, and B W on- 0f/), • the capabilities of the software in term s of handling “real” designs which in clude descriptions of arrays, loops and conditional branches in the input be havioral specification, and • the role of SMASH in a system-level design environm ent like USC. These experiments have been organized into three categories: 1. experiments with representative high-level synthesis workshop benchmarks, 2. rapid prototyping of a JP E G still image compression system, and 3. enhanced design of the components of the JP E G still image compression sys tem . The input VHDL behavioral descriptions for the designs were m anually written. As m entioned earlier, closeness to “reality” was always emphasized in our research, therefore the m odule library was generated using a commercial silicon compiler by first laying out different modules; and then by characterizing them individually. The details of the m odule library are given in the following section. 7.4 Module Library We used the Epoch Silicon Compiler from Cascade Design A utom ation to design our library elements. The technology used is based on Epoch’s m odule library and 124 uses O rbit’s 1.2 micron technology. The design as well as the area-delay analysis was done using Epoch’s tools. Since, SMASH requires functional modules and storage modules, the two libraries were developed as described below. 7.4.1 F unctional M odules The functional modules were characterized by (?) cost, and (ii) execution delay. We designed the required functional operators using Epoch and analyzed the designs to obtain the area and delay of each module. These param eters are sum m arized in Table 7.4. Module name Area (sq. microns) Delay (ns) Multiplier 2398490 55 Divider 2398490 55 Adder 81558 30 Subtractor 81558 30 Comparator 81558 30 2-1 Mux 24858 3 3-1 Mux 69834 3 4-1 Mux 76028 3 Distribute* 0 0 Join* 0 0 ’ dumm y module Table 7.4: Module Library used by SMASH. 7.4.2 Storage M odules The storage modules were characterized by (?) cost of storage per word, (??) maxi m um storage capacity, and (??) num ber of read and write ports on the module. This required us to characterize cost as a function of size. To do th a t we generated several im plem entations of various types of storage modules (like register files and on-chip RAM s), then by interpolation we derived the relationship between size and cost. 125 The following storage modules were used in our library: • Register: The area required by a 16-bit wide register is 37821fim 2. • Register file with 1 read port and 1 write port: The following cost function was derived from the layouts of 1R/1W register files w ith varying sizes (as shown in Plot 7.5). x 8 j - 14405 fim 2 CostArea(size) is a step function because Epoch’s library required the size of the register file to be in m ultiples of 8. And the m axim um allowed size was 64 words per module. • Register file w ith 2 read ports and 1 write port: For 2R /1W register file, CostArea{size) (from Plot 7.6) is x 8^ — 19135 fxm2 where size varied in m ultiples of 8, up to 64 words per register file. • On-chip RAM with 1 read port and 1 write port: For 1R/1W RAM module, CostArea(size) (from Plot 7.7) is CostArea{$ize) = 9043 x x 2 ^ + 147480 fim 2 where size should an even num ber (between 4 and 4096) for each RAM. • On-chip RAM with 2 read ports and 1 write port: For 2R /1W RAM module, CostArea(size) (from Plot 7.8) is CostArea{size) = 18329 x x 2^ +303680 ptm2 where size should be an even num ber (between 4 and 1024) for each RAM. Epoch’s library also provides several other port configurations (like 2R/2W , 3R /1W , 3R /2W , 3R /3W , 4R /1W , and 4R /2W ) for RAM modules, but these were a S t CostArea{size) = 53707 X f j 126 10 20 Figure 7.5: Area vs 30 40 50 60 size (16-bit words) size for 1R/1W Register file in EPOCH area (sq. microns) x 10 30 40 size (16-bit words) Figure 7.6: Area vs. size for 2R /1W Register file in EPOCH 128 a re a (sq . microns) 10 ' 10 », . » i i i I J. .2 ,3 ,1 10 size (16-bit words) Figure 7.7: Area vs. size for 1R/1W RAM Module in EPOCH size (16-bit words) Figure 7.8: Area vs. size for 2R/1W RAM Module in EPOCH not used in our experiments. These modules will be characterized and included in the library in future. The clock cycle was assumed to be 30ns. 7.5 High-Level Synthesis Benchmark Examples From the High-Level Synthesis Workshop benchm arks, the following three examples were synthesized to illustrate the above m entioned capabilities of SMASH [GP94]: 1. a second-order differential equation solver [PK89], 2. an AR filter element [PGH91], and 3. an Elliptic wave filter element. To simplify the experiments, it was assumed th at the input data was stored in an off-chip RAM before processing. The software was executed w ith priority on perform ance optim ization while m eeting the area constraint. We also gener ated some designs without considering storage-related constraints (B W on-ojj and R b u j/W bu}). SMASH generated several designs with varying param eters for these examples. These implem entations are sum m arized in Tables 7.5, 7.6, and 7.7. The parenthesized storage architecture param eters indicate the designs w ithout storage- related constraints. These param eters were m anually determ ined when the designs were m apped on our target architecture. 7.5.1 D ifferential E quation E xam ple In this section, we analyze the differential equation example in detail. Five different im plem entations of the differential equation example were generated. These im plem entations varied in cost and performance. These designs are briefly described below. • Design 1: This design is a high performance but expensive (high cost) design. This design was synthesized w ithout considering the storage-related param e ters so th at a comparison could be m ade with the designs where these param eters were also optimized. This design was synthesized by executing SMASH 131 Design no. Input constraints Software output Functional area 106pm2 on-off (words/ cycle) # ports ^ b u f # ports W buf Buffer size words Functional resources Exec ution time ns 1 7.5 (4) (4) (2) NA (6) <, 3 * 240 2 7.5 2 4 2 8 <, 3 * 240 3 6.0 (4) (4) (1) NA (5) <, 2 * 300 4 6.0 2 3 1 6 <, 2 * 300 5 6.0 1 3 1 7 < + - 2 * * » l ^ 330 6 3.0 2 2 1 4 <• +. * 450 7 3.0 1 2 1 6 <. +, * 480 Table 7.5: Param eters from SMASH for Differential Equation Exam ple Design no. Input constraints Software output Funtional area 106pm2 on-off (words/ cycle) # ports Rfcuf # ports w*u, Buffer size words Functional resources Exec ution time ns 1 10.0 (6) (6) (2) NA (8) 2 + ,4 * 390 2 10.0 4 4 2 6 2 + .4 * 390 3 10.0 3 4 2 10 2 + ,4 * 420 4 5.0 2 4 2 8 2 +, 2 * 600 Table 7.6: Param eters from SMASH for AR Filter Exam ple 132 Design no. input constraints Software output Functional area 106pm2 on-off (words/ cycle) # ports R but # ports W but Buffer size words Functional resources Exec ution time ns 1 10.0 (3) (3) (4) (10) 4+.4* 540 2 10.0 3 3 4 10 4+.4* 540 3 7.5 2 2 2 7 3+.3* 600 4 5.5 2 3 3 8 3+, 2 * 690 5 5.0 2 2 1 4 2 +.2* 720 Table 7.7: Param eters from SMASH for Elliptic Wave Filter w ithout constraining B W 0n- 0j / and R b u j f W h u j - The parenthesized storage ar chitecture param eters shown in the table were m anually determ ined from the output. • Design 2: This design was generated by executing SMASH under B W 0n- of f constraints. This design requires the same num ber of functional resources as design 1 and has the same performance. However, B W 0n~0j f requirem ent is 50% lower in this case compared to design 1. An analysis of the operation schedule showed th at SMASH achieved this by scheduling the operations in such a m anner th at the operations requiring inputs were postponed till the inputs were available and other operations were moved ahead. • Design 3: Again, this design was synthesized w ithout considering the storage- related param eters. It is a slower but cheaper design compared to the above two designs. • Design 4- This design has the same functional cost and performance as Design 3. However, B W on- 0f f is 50% lower, and even the num ber of read ports on the buffers Rbuf is lower. • Design 5: This design was synthesized to show the tradeoff in B W on- 0f j . Here, the functional cost is the same as designs 3 and 4 but B W 0n- 0f / is lower, 133 hence the longer execution tim e. The lower B W on-o f/ also requires inputs to be prefetched and stored in the buffers, therefore, the buffer size is higher than design 4. • Design 6: This is a slow and cheap design. There is substantial of resource sharing and therefore, the execution tim e is longer. The buffer size decreased in this case, as longer execution tim e provided enough flexibility for d ata trans fers. SMASH could schedule the operations such th at d ata could be transferred whenever required without storing them in the buffers. • Design 7: Again, this design has the same functional cost as design 6, but lower B W 0n- of f . As a result, some operations were postponed until the required data could be fetched, hence longer execution tim e. Also, some inputs were prefetched and stored in the buffers to avoid further delay in execution, hence there is higher buffer size than Design 6. Designs 1 and 3 vs. 2 and 4 respectively clearly dem onstrate the effectiveness of our techniques in reducing the unnecessary storage cost. Schedules with storage constraints achieved the same performance with the same functional area while re ducing B W 0n- 0f f and Rbuj/W buf- This was because SMASH distributed the data requirem ent uniformly and overlapped the data transfer with the execution. Designs 5, 6 and 7 were generated to dem onstrate the tradeoff curve existing in the storage architecture. Designs 4 vs. 5 and 6 vs. 7 show th at for the same func tional resources, varying storage param eters resulted in different execution delays. Designs 5 and 7 are slower than the designs 4 and 6 respectively. Analysis shows th at this was because of the lim ited bandw idth allowed in 5 and 7. Furtherm ore, in designs 5 and 7, as the I/O transfer was the bottleneck, unnecessary transfers were avoided by storing the d ata value in the buffers for further use, hence the buffer size increased. Note th at though these slower designs require bigger buffer sizes, they are still cheaper because of the lower bandw idth requirem ent (which in turn also determines the num ber of pins on the chip). Figure 7.9 shows the datapath schedule in detail for design 5, excluding the R /W nodes. This figure also does not include the on-chip off-chip data transfer schedule, as th at task is performed in the second design step. The results obtained from the 134 stepl 0 2 J step6 dx step9 unext xnext Figure 7.9: Scheduled CDFG for Design 5 (2nd-order Differential Equation). 135 Step number Buffers-Off chip data trasfer Buffer contents Buffers-Datapath data transfer Write into buffer Read from buffer Read from buffer Write into buffer 1 X - X - - 2 3 - 3, x 3,x - 3 dx - 3, dx - - 4 u - 3, dx, u u,dx - 5 y - 3, dx, y 3,y - 6 X - dx, x, y - - 7 a - dx, x, y, a a,dx,x - 8 u - dx, x, y, u y.u - 9 - ynext dx, x dx,x ynext 1 0 - unext - - unext 11 - xnext - - xnext Table 7.8: D ata transfer schedule for design 5 (2nd-order differential equation). 136 data transfer scheduling software are sum m arized in Table 7.8. Observe how inputs ‘x ’ and ‘u ’ are read twice into the buffer because there was an em pty ‘slot’ available for re-transfer before they were required again, and storing them in the buffers would have resulted in an increase in the buffer size. On the other hand, ‘dx’ was stored in the buffers for future use because its re-transfer was not possible before the given tim e. Also, note th at after re-fetching ‘x ’ in step 6 (for consumption in step 7) it was saved in the buffers for step 9. Fetching ‘x ’ again in step 8 (for step 9) instead of ‘u ’, and saving ‘u ’ in the buffers could have been another option, but th at would have resulted in a buffer size of 5 words (in step 7). Therefore, SMASH decided to save ‘x ’ and re-fetch ‘u ’. 7.5.2 A R F ilter and E llip tic W ave F ilter E xam ples The AR filter example showed the same tradeoffs as shown by the differential equa tion example. Designs 1, 2, and 3 had the same functional cost but varying storage cost and performance. Design 1 was produced without considering the storage cost, therefore it resulted in high storage cost. Both B W 0n- 0f j and Rbuj were high in this case. W hen considering the storage cost during synthesis (design 2), SMASH could reduce both the B W on-o j/ and Rbuj while achieving the same perform ance with the same functional hardware. In design 3, the B W 0n- 0 / / was further reduced but with an increase in the execution tim e. Design 4 was synthesized w ith tighter area and B W 0n- 0f f constraints. It is the slowest and cheapest of the four designs. The elliptic wave filter example was also synthesized to show above m entioned tradeoffs. It clearly shows a tradeoff between the cost (both the functional cost as well as storage cost) and the execution tim e. 7.5.3 D iscussion To summarize, these designs showed the tradeoffs in the storage design in addition to the tradeoffs in the datapath. They showed why storage-related constraints m ust be considered during datapath synthesis to get better designs. They also showed SMASH’s capability of determ ining the param eters for storage structure and schedul ing the required d ata transfers. 137 7.6 Rapid Prototyping of a JPEG Still Image Compression System The objective of this system-level experim ent [CGDBP94, GCDBP94] was twofold: first, to synthesize more realistic designs with SMASH, and second, to illustrate the way SMASH can be used in a system-design environment. In this experim ent, we chose to focus our design activity on a standard for still image compression, JP E G [Wal91]. We chose to synthesize the system shown in Figure 7.10, and used SMASH along with other system-level tools (ProP art and SOS) to explore design possibilities. Entropy »• 2D- DC! — ► Quantizer Figure 7.10: JP E G Still Image Compression System Figure 7.11 shows the design flow used for this experim ent. It is im portant to note th at the design flow has some bottom -up portions, which represent the flow between the application of each tool, each of which operates in essentially a top- down fashion. Thus the design flow is both top-down and bottom -up. We began with the synthesis of the DCT (Discrete Cosine Transform) func tion. The 2D-DCT was decomposed into repeated row-column lD -D C T ’s prior to the application of the system-level tools (Figure 7.12). The 1D-DCT macro was 138 Source Image Data’ Reconstructed Image Data Decompose 2D-DCT into two 1 D-DCTs JP E G SYSTEM S P E C I F I C A T IO N Entropy Encoder Compressed S 5 m im a g e °a t a Decoder Q ID- DCT VDHL Description. 2D- DCT 2 D -1 DCT uMwmauaam Quantizer I Dequan tizer (Predi Predicted Parameters! [Existing Layout Parameters] 2D-DCT estimation Propart SMASH and MABAL C Five 1D- DCT RTL Designs Choice 1 w Choice 2 SOS c Three 2D- DCT Multiprocessor Architectures Partitioned 3 Chip Im plementation of JPEG up Quantizer Entropy Encoder Chip 3 Dequant. Entropy Decoder Layout Generation using ChipCrafter I Q 2D-DCT La^ut (Chip 1) ^ Figure 7.11: Design Flow for Still Image Compression System Exam ple 8 X 8 pixel frame ID DCT over rows ID D CT over columns transformed frame Figure 7.12: 2D DCT im plem entation from ID DCTs 139 synthesized first and used to construct a 2D-DCT, clearly a bottom -up step. The 8-point 1D-DCT m atrix is as follows[FLS+92]: X 0 ' ’ d d 0 0 0 0 0 0 ' x 0 + x 7 + x 3 + x 4 x 4 d - d 0 0 0 0 0 0 Xi + x 6 + x 2 + #5 x 2 0 0 b / 0 0 0 0 Xq + x 7 — x 3 — x 4 X 6 0 0 f - b 0 0 0 0 X\ + x G — x 2 — x 5 — X X x 0 0 0 0 a c e 9 x 0 — x 7 x 3 0 0 0 0 c - 9 —a — e Xi — x 6 *5 0 0 0 0 e —a 9 c x 2 — x 5 . * 7 . 0 0 0 0 9 — e c —a X3 x 4 This 1D-DCT description was translated into a behavioral VHDL description. SMASH was used to generate five schedules from this VHDL description. The module library used is shown in Table 7.4. These datapath schedules w ith varying cost and performance are shown in Table 7.9. SMASH also determ ined buffer size and bandw idth requirement as shown in Table 7.9. Design number Input constraints SMASH output Functional area (106 pm2) on-off (words/ cycle) r 6«/ w buf Buffer size (words) Functional resources Execution time (cycles) 1 30 5 12 6 21 3 +, 4 12 * 8 2 20 3 6 5 14 3 +, 3 8 * 13 3 18 3 4 4 11 3 +, 3 7 * 15 4 10 3 3 3 7 2 +,2 4 * 20 5 6 2 3 3 9 2 +, 2 2 * 26 Table 7.9: 1D-DCT design param eters obtained using SMASH The 1D-DCT schedules were then processed by another ADAM tool called MA- BAL [KP90] to generate the RTL netlists. These netlists were analyzed to obtain the area characteristic of the datapath as shown in Table 7.10. The area for func tional units, multiplexers and registers was determ ined from the netlists, and wiring 140 area was estim ated m anually using a rule-of-thumb which we observed in our earlier experiments [PGH91]. Design Number Functional area (106 pm2) Interconnect area (106 pm2) Total area (105 pm2) Muxes and Registers Wiring A B C = 2(A+B) 1 29.35 3.74 66.18 99.27 2 19.68 3.87 47.09 70.63 3 17.20 4.05 42.50 63.74 4 9.92 3.67 27.18 40.77 5 5.12 4.12 18.48 27.73 Table 7.10: ID DCT RTL designs from MABAL Next, for system-level design we estim ated the performance and silicon area of the rem aining components of the system. A 2D-DCT architecture consisting of two 1D-DCT modules and an 8 x 8 frame buffer was selected as shown in the literature [FLS+92]. The worst-case datapath delay was used to calculate the performance for each im plem entation with a two-phase non-overlapping clocking scheme. The quantizer performance and silicon area were estim ated similarly, and the param eters used are comparable to those reported in the literature [FLS+92]. For the Huffman codec, param eters of an existing chip were used [PP93]. After estim ating the performance and silicon area of all the parts in the compres sion system, it was partitioned by ProPart. P roP art selected the 2D-DCT design which is constructed using the second 1D-DCT design produced by SMASH. Finally, the layout of the 1D-DCT macro and 2D-DCT chip were generated using Epoch from Cascade Design Autom ation. These layouts are shown in Figures 7.13 and 7.14, and the analysis of the area distribution is shown in Table 7.11. A comparison of our chip set with others is shown in Table 7.12 [CS93]. Since we obtained the Huffman coding chip param eters from another source, they are only compared here to show th at the param eters we are using are comparable to those in the literature. The die we did design, the DCT, has somewhat larger die size than 141 F igure 7.13: Layout of ID D C T m odule Area Total Functional Controller Interconnect Muxes + Registers Wiring 1D-DCT 83.72 19.83 1.03 4.67 58.19 2D-DCT 209.94 39.66 2.06* 13.91** 154.31 * controller for the frame-buffer and I/O pads not included ** a 64 word on-chip RAM included. Table 7.11: Area analysis for the layouts iW IH liiw T pecans F igure 7.14: L ayout of 2D D C T chip 143 the industrial chips, but the performance was comparable. The technologies used by the industrial chips was not m entioned in the referenced article [CS93], so we were not able to determ ine whether our 1.2 micron CMOS technology was inherently larger than the industrial technologies. Area (mm2) Ours Bellcore LSI 2D DCT/IDCT 11.8x17.7 10.7 x 10.2 9.5 x 9.5 Quant./Dequant. (4.4 x 4.4) 9.0 x 9.0 9.9 x 9.9 Encoder (3.5 x 3.5) 6.6 x 6.2 7.4 x 7.4 Decoder (3.5 x 3.5) 7.5 x 8.4 7.4 x 7.4 Table 7.12: Chip-set param eters To search the design space for a wider range of im plem entations, the SOS mul tiprocessor synthesis tool was used. Cost/perform ance param eters predicated from the RTL netlists for the 1D-DCT im plementations (Table 7.10) were input to SOS, so th at it could choose from all five 1D-DCT implem entations. The design space was searched for various performance constraints with the objective of minimizing the cost. The sets of 1D-DCT implementations selected by SOS for various tim ing constraints are shown in Table 7.13. 7.6.1 D iscussion The prim ary outcome of the experim ent was a clear understanding of how SMASH can be used in a system-level design environment. It showed how high-level synthesis tools like SMASH can be used not only in designing the systems but also in perform ing ‘w hat-if’ analysis during the design process. Even if the designer doesn’t want to use high-level synthesis for actual design, he/she can perform a quick ‘w hat-if’ analysis of his decisions using these tools. Thus, the design space can be explored very quickly. 144 Design number Input to SOS Outputs from SOS Time constraint Processors* Cost Execution time/ 8x8 frame Pixel rate ns 10s p-m2 ns 106pixel/s 1 6400 4 P5 110.90 6350 10.08 2 3200 2 P I, 2 P5 255.08 3200 20.00 3 950 5 P I, 2 P2, 1 P3 1402.93 950 67.37 * P I ... P5 are the five 1D-DCT designs from SMASH. Table 7.13: 2D-DCT im plementations from SOS 7 . 7 Enhanced Design of JPEG Components After the initial success with the JP E G system, we decided to 1. synthesize additional components of the JP E G system, and 2. improve the 1D-DCT design. The design of the quantizer and enhanced 1D-DCT dem onstrate the capability of SMASH in designing “real” designs. These VHDL descriptions included conditional- branches, loops, arrays, and on-chip constants. 7.7.1 Synthesis o f th e Q uantizer In the initial JP E G system design experim ent, the quantizer param eters were m an ually estim ated assuming only one possible im plem entation. The design space was not explored at all because the functional description of the quantizer (Figure 7.15) is so trivial th at there is no obvious tradeoff in the functional hardware. The quan tizer consists of a single division by a predeterm ined coefficient, therefore, there is no area-delay tradeoff in the functional hardware. However, there is a fair am ount of I/O activity in the design as the whole stream of 8 elements has to be quantized and th at m otivated us to explore the tradeoffs possible due to I/O . We anticipated 145 a tradeoff in storage-related param eters and performed this experim ent to demon strate the existence of tradeoff curve due to high I/O activity even though there was no tradeoff possible in the functional hardware. Input__________ J_^ (8 points) ,, , , , , , ,, Q QTable Figure 7.15: Quantization in JP E G Image Compression System Observe that the description also includes an if-then-else condition to handle a negative input coefficient in case the divider is not designed to handle negative values. The design param eters obtained from SMASH are sum m arized in Table 7.14. As anticipated, the two designs show a tradeoff between the storage-related cost and Design no. Input constraints SMASH output Loop latency (ns) Fun. area (106 |xm2) on-off (words/ cycle) R buf W buf Buffer size (words) Fun. resources Execution time (ns) 1 3.0 1 1 1 2 <, /, Neg, + 720 60 2 3.0 2 1 1 3 <, /, Neg, + 600 60 Background memory size = 8 (input stream) + 64 (QTable) + 8 (output stream) Table 7.14: Quantizer Design Param eters O btained from SMASH execution tim e, though none in the functional hardware. They differ in bandw idth constraints, buffer size, and execution tim e. The functional hardware cost in both the designs is the same. If (l> 0) O = l/Q; else O = -(-l/Q) O ^ Output (8 points) 146 The first design is the cheaper and slower of the two designs. Our analysis of the design showed th at I/O transfer is the bottleneck in this design and m ost of the tim e was being spent in transferring the I/O and the coefficients. There are 8 inputs, 8 quantization-coefficients, and 8 outputs which are accessed by the datapath. The B W on- 0f / allows 1 data transfer per cycle (30 ns) requiring at least 720 ns for data transfer. By overlapping the datapath execution w ith the d ata transfer, the design achieves the optim al execution tim e of 720 ns. After realizing the bottleneck we increased the B W 0n- 0 f j to 2 data transfers per cycle. The result was a faster design. However, the total execution tim e did not reduce proportionately because now the datapath execution dom inated the total tim e. And of course, the speed up was gained at the expense of more bandw idth which implies higher cost. 7.7.2 S ynthesis o f 1D -D C T w ith Inner Loops Our next step was to make the 8-point 1D-DCT, which is the m ost crucial com ponent of the JP E G system, more efficient and realistic. To m ake it more efficient we decided to use a different description with fewer operations (in particular m ulti plications) [NK93]. And, to make the description more realistic, we (i) introduced arrays for the fram e buffer instead of individual points, (« ) defined the on-chip con stants instead of assuming them to be inputs, and (Hi) included the loop definition inside the VHDL description instead of assuming it to be an outer-loop. By bringing the loop inside the description, SMASH could exploit the tradeoff associated with folding the loop. The data-flow for the 8-point 1D-DCT (inside the loop body) is shown in Figure 7.16. The VHDL description of the whole design is included in Appendix B. The complete CDFG of the design including read and write nodes is shown in Figure 7.17. The design param eters obtained from SMASH are summarized in Table 7.15. A brief description of these designs follows: • Design 1: This design is the fastest design with execution tim e of 840 ns for the whole fram e buffer. It is also the most expensive design in term s of total (functional and storage) cost. It is highly parallel and requires a large num ber 147 Figure 7.16: D ata Flow for tlie 8-point 1D-DCT 148 Figure 7.17: The W hole 8-point 1D-DCT CDFG 8x81 D-DCT design parameters from SMASH Design no. Input constraints SMASH output Loop latency (ns) Fun. area (10® pm2) on -off (words/ cycle) R;buf w buf Buffer size (words) Fun. resources Execution time (ns) 1 50 6 8 8 32 <, 16 *,5 14 + 840 60 2 50 4 8 8 28 <, 16*. 7 -, 14 + 960 60 3 35 4 6 6 20 <, 12*. 4 -, 12 + 1080 90 4 25 4 4 3 7 <, 8 *, 3 7 + 1170 120 5 20 4 4 3 7 < ,7 *,3 10 + 1590 180 6 15 2 3 3 7 <, 5 *, 4 8 + 2190 240 Background memory size = 64 (input frame) + 64 (output frame) Table 7.15: Enhanced 1D-DCT Design using SMASH of funtional modules. Furtherm ore, the buffer size is large because, in order to m eet the high performance of the datapath, the inputs were required to be prefetched and stored in the buffers. • Design 2: This design is a bit slower and cheaper than design 1. Though, the functional cost of this design is higher than design 1, it is a cheaper design because of the lower B W on- o j / (which translates into num ber of pins on the chip). An analysis of the design showed th at the extra delay was spent on transferring the d ata on and off chip as B W on- 0f f is lower. Buffer size contin ued to be high in this case also, due to prefetching of inputs to m eet the high datapath performance. • Design 3: This design is substantially cheaper and slower th an the above two designs. Even the buffer size is smaller in this design; this is true because the longer execution tim e allowed more flexibility in scheduling the I/O transfers. Later, during the data-transfer scheduling SMASH could scatter the transfers in such a way th at the memory locations could be shared. • Design 4• ' This design was allowed a lim ited functional area and th a t resulted in a lower performance. Notice th at in this case prefetching of inputs is not required, the buffers are there ju st to latch the inputs and outputs in each 150 step. This is because the long execution tim e and sufficient B W o n - o f f allowed the inputs to be fetched just in tim e and outputs to be transferred back im mediately. Therefore no extra storage is required. • Design 5: This design is quite close to design 4 in term s of cost but very slow in execution. Our analysis showed th at it was prim arily because of the 1D- DCT description itself. The data dependencies in the 8-point 1D-DCT CDFG is such th at for up to 8 multipliers, the execution schedule is very regular; however, when there are less than 8 multipliers, some of the operations are delayed significantly. • Design 6: This is the cheapest and the slowest design with a lot of resource sharing. The storage cost is also low because the longer execution tim e al lowed a lower bandw idth (providing any higher bandw idth is of no use as the datapath requires data at a slower rate) and a smaller buffer. 7.7.3 D iscussion The design of the quantizer and the 1D-DCT with inner loop showed dem onstrated the existence of cost-performance tradeoff in datapath as well as the storage struc ture. The quantizer example showed the tradeoffs in the storage structure. The 1D-DCT design showed how designs w ith arrays, on-chip constants and loops can be handled by SMASH. The design contained an inner loop which was folded to achieve higher performance. 7.8 Summary In this chapter we presented the design experiments performed using SMASH along with other USC and commercial tools. We first presented two layout experiments which were conducted prior to development of SMASH.R The rem aining chapter described three set of experiments performed using SMASH. The first set of experiments was to synthesize some representative benchm ark examples from the high-level synthesis workshop. These designs showed the tradeoffs in the storage structure in addition to the tradeoffs in the datapath. They also 151 showed th at the storage-related constraints m ust be considered during datapath synthesis to achieve a cost-effective storage structure. Separating the synthesis of datapath from the synthesis of storage structure results in an inefficient design. The second experim ent involved the design of the JP E G Still Image Compression System. SMASH was used in the experiment to implement the 1D-DCT module. This experiment showed how high-level synthesis tools like SMASH can be used in designing the systems. It also showed that SMASH like tools can be used in exploring the design space quickly by performing ‘w hat-if’ analysis during the design process. The designer can choose to change the system-specification and study the im pact of such changes on the final implementations immediately. The third set of experiments consisted of the design of the quantizer used in the JP E G system and an improved design of the 1D-DCT. These specifications contained on-chip constants, arrays and loops. SMASH generated several implem entations of these designs with varying cost and performance. 152 Chapter 8 Conclusion and Future Research 8.1 Introduction In this thesis we have looked at the techniques th at support autom atic synthesis of memory-intensive application-specific systems. Based on these techniques, a soft ware tool called SMASH (Synthesis of Memory-intensive Application-Specific Hard ware) has been developed. The prim ary objective of this research was to broaden the scope of high-level synthesis systems by considering memory related issues during the synthesis process. Several memory-intensive application-specific systems were studied to under stand various issues in design of these systems. The following observations were m ade during the prelim inary research: • Synthesis tools m ust consider design of storage architectures and other system modules in order to be accepted by the industry. • The storage architecture is closely connected to the datapath and isolating its synthesis from the datapath synthesis may not result in an efficient solution. D atapath synthesis procedures themselves m ust take into account the design of the memory hierarchy, and the design of the datapaths and m em ory hierarchies m ust be coordinated. • Every application requires a unique strategy to design the m ost efficient storage architecture in term s of both cost and performance. However, a m ore general approach m ust be developed which can be applied to autom ate the design process while producing efficient and correct designs. 153 • “Real” design specifications contain a wide variety of features like mixed con trol and data flow, I/O tim ing constraints, on-chip constants, arrays, condi tional branches, and loops. These features should be handled by the synthesis system. These observations m otivated us to develop synthesis techniques which (z) com bine the datapath synthesis with memory hierarchy design, (m) handle the above mentioned features in “real” design specifications, and (Hi) exploit all the advan tages design autom ation has over m anual designs like faster design tim e, less errors, and exploration of a larger design space. In this chapter we have sum m arized our contributions. Next, we have outlined the areas for future research. 8.2 Contributions In summary, our contributions are as follows: 8.2.1 D evelop m en t o f a H igh-L evel S ynthesis S ystem The m ain contribution of this research is SMASH this tool set, given • a behavioral VHDL description of a memory-intensive application-specific sys tem , • a module library consisting of (i) functional modules, and (ii) storage modules, • area-performance constraints, • a clock cycle, which is the duration of each control step in the datapath, • inp u t/o u tp u t tim ing constraints imposed by the external world, and • memory bandw idth constraints, produces a target system with • a datapath consisting of operators and operation schedule, 154 • size and port configuration for on-chip foreground memory to store inputs, outputs and interm ediate variables, • data-transfer schedule between the datapath and on-chip memory, • size and port configuration off-chip (or on-chip) background memory for bulk storage, and • data-transfer schedule between the foreground and background memory. The development of this tool set required identification of the problem, the issues and the design param eters involved. Next, we developed the required techniques to achieve our goals. The m ain contributions in these areas are described below. 8.2.2 Identification o f D esign Param eters Based on the study of various memory-intensive examples, the design param eters relevant to each memory structure were identified and the following architecture with two levels of memory hierarchy was proposed to solve the problem: 1. on-chip foreground m em ory: which consists of I/O buffers to store inputs and outputs, and datapath memory to store interm ediate variables, and 2. off-chip background memory: for bulk storage. For the I/O buffers, the relevant param eters which m ust be optim ized by the synthesis software are 1. the num ber of read ports and write ports accessible to the datapath, which is the m axim um num ber of inputs and outputs accessed by the datapath in any given control step, and 2. total buffer size, which is determ ined by the m axim um num ber of inputs and outputs stored in the buffers in any given control step. The param eters which determ ine the datapath memory are 1. the num ber of interm ediate variables, and 155 2. the lifetimes of these variables. D atapath memory synthesis has already been performed by other researchers [BMB+88, Che91, Sto89] and was not addressed in this thesis. For the off-chip memory, the relevant param eters th at m ust be optimized by the synthesis software are 1. the num ber of read/w rite ports, and 2. size, which is expected to be large compared to the size of on-chip storage size and is implicitly determ ined by our software as a side-effect of the data transfer scheduling. Our next contribution was to develop techniques to determ ine these param eters for a given behavioral description. 8.2.3 C om bined D atapath Scheduling w ith I /O A ccesses We developed techniques to combine datapath scheduling with I/O access schedul ing. We considered I/O accesses from /to the I/O buffers as read/w rite operations and scheduled them concurrently with the datapath operations. Besides combining the datapath scheduling with the I/O access scheduling, our datapath scheduling techniques consider • the storage architecture param eters so that the schedule is guaranteed to satisfy the constraints during the storage synthesis, • the tradeoffs in the storage architecture so th at the schedule does not require alterations during storage design, and • storage and functional cost estim ations to evaluate the im pact of high-level decisions on the final design so th at potentially inferior designs are discarded early in the design process. 8.2.4 Storage Tradeoffs We identified the following tradeoffs which could be m ade in the storage structures: 156 1. storage size vs. num ber of execution cycles, 2. num ber of ports vs. num ber of execution cycles, and 3. num ber of ports vs. storage size. 8.2.5 Storage C ost E stim ation s We developed techniques to estim ate the storage cost. The storage cost estim ation consists of the following three steps: 1. determining a lower bound on the num ber of read and write ports on the buffers, 2. determining a lower bound on the total size of all the buffers, and 3. implementing these requirements on lowest-cost storage modules from the li brary. We also presented the required theoretical basis to prove these bounds. 8.2.6 U pp er B ounds on D esign Param eters We developed theories to determ ine the upper bounds for 1. the num ber of read and write ports on the buffers, 2. the total size of all the buffers, and 3. the num ber of functional modules in the design. 8.2.7 Storage Synthesis We proposed a two step design process for storage synthesis. These steps are 1. d ata transfer scheduling, and 2. storage module allocation. We developed techniques to perform the data transfer scheduling. The second step, the m odule allocation, requires further research and will be completed in the future. 157 8.2.8 E xperim ents We conducted a num ber of design experiments to test and verify our techniques and software. The experiments included 1. designs from high-level synthesis workshop benchmarks, 2. rapid prototyping of a JP E G still image compression system, and 3. enhanced design of two of the components (1D-DCT and quantizer) of the JP E G still image compression system. The following aspects of our research were successfully dem onstrated though these experiments: • the capabilities of SMASH in combining the datapath scheduling with I/O access from the buffers, • improvement in datapath schedules when storage-related constraints were also considered, • existence of a cost-performance tradeoff in the storage architecture (cost being a function of R /W ports, storage size, and B W on- o f /), • the capabilities of the software in term s of handling “real” designs which in clude arrays, loops and conditional branches in the input behavioral specifica tion, and • the role of SMASH in a system-level design environment like USC. 8.3 Future Directions The following issues were not addressed in this research. They m ust be addressed in the future in order to make the synthesis system versatile and thorough. • high-level memory management, • storage m odule allocation, 158 • improvement in datapath memory synthesis, and • interfacing SMASH with DPSYN. 8.3.1 H igh-L evel M em ory M anagem ent It has been well recognized th at transform ations on algorithm description are cru cial in obtaining efficient implementations. The requirem ent of these transform ations arises because user-specified algorithms may not be efficient. For instance, we do not need to keep independent operations together (as in loop boundaries) just because of the algorithm ic construct. Studies on memory organization for multi-dimensional signals have shown that loop transform ations can have effects on final implemen tation costs [FBS+93, LvMvdW+91]. These transform ations have been found to be very effective in achieving two goals: parallelism and efficient use of mem ory hierarchy. In case of arrays, transferring huge arrays into on-chip buffers could be very expensive. In such cases, memory accesses can be greatly optim ized by considering d ata locality. The consideration of data locality makes it more im portant to apply loop transform ations in a system atic manner. Another transform ation th at can be used is decomposition of arrays into smaller arrays so th at only th at part of the array which is useful for the processing can be transferred. Unfortunately, decomposing each array into individual data points will increase the complexity of the problem as now we will have to deal with more arrays. Therefore, the decomposition m ust be done in such a way th at there are not excessive arrays to handle and each array is small enough to avoid unnecessary d ata transfers. Similar issues are being addressed by several researchers working on advanced compilers [WL91, Lov77, PW86]. These ideas can be extended to our applications. 8.3.2 Storage M odule A llocation As described earlier, in our research we have not addressed the issue of allocating a physical mem ory location to each data value. After the data transfers are scheduled and the d ata requirements in each control step are known, we need to allocate each value to an appropriate physical module. The objective of this step is to distribute 159 these data values among different modules such th at there is no access clash and also access tim e and storage cost (both storage area and interconnection area) are minimized. This step consists of the following basic tasks: 1. selecting the storage modules, 2. distributing the data values among the selected modules, and 3. completing the interconnection network and addressing hardware. Notice th at during the data transfer scheduling it was ensured th at the feasibility of a complete allocation was guaranteed. Therefore, the choice of the m odule type will only depend on the following requirements: 1. the num ber of read/w rite ports on the modules, and 2. the size of the modules. Next, the data values can be assigned to the selected modules. This is a rather difficult task as different module types have different num bers of ports and sizes. Distributing the d ata values among such heterogeneous modules has yet to be re searched. Existing approaches do not consider distribution of d ata among hetero geneous modules at the same tim e. Another issue is allocation of arrays as a single entity rather than expanding them to individual elements. PHIDEO assigns data stream s to m ultiport memory modules by dividing the overall problem into sub problems which are then solved using ILP formulation [LvMVvdW93]. Though, this approach is based on PH ID EO ’s data stream model, it can be modified to allocate arrays to background memory in our case. Furtherm ore, the interconnect cost also m ust be considered during the storage construction. To reduce the interconnection overhead, we prefer to keep the data required by a functional module in the storage module to which th at functional module is already connected. We can start distributing the d ata values from control step one. To perform a heuristic module allocation we can first choose a value then compute the cost of assigning it to a particular module. The cost should include the interconnection cost. Depending on the cost we can assign it to the cheapest 160 module. We choose another value from the set of values rem aining in th at control step and assign it to an appropriate module, and so on. Overall, this problem needs to be researched extensively. 8.3.3 Im provem ent in D atapath M em ory Synthesis The datapath memory synthesis problem has been extensively studied by Balakr- ishnan et al. [BMBL87, BMB+88]. However, their approach considers the same type of modules. Also, they optimize the interconnection separately. A more ef ficient solution could be obtained if heterogeneous module types (like register files with varying num ber of ports, registers, and on-chip RAMs w ith varying num ber of ports) were allowed. Also, the interconnection cost could be considered while merging the registers into larger modules. Heterogeneous modules allow us to avoid using expensive modules when they are not required. Furtherm ore, registers should be merged selectively without insisting on merging all the registers into a large mod ule. Registers which are used most frequently should not be merged as this might increase the access-clash with other variables stored in th at module. A possible approach to solve this problem could be to first form a no-conflict graph for all the data values. Two nodes of this graph would connected if they could be put in the same module and could be accessed in the same tim e step without causing an access clash. Then, based on the access pattern, various com patible sets could be formed. A compatible set is the set of data values which can be allocated to the same module without any access clash. All the elements of the same compatible set could be allocated to the same storage module (register file or RAM ). In order to minimize the interconnection overhead, the storage modules m ust be assigned based on their connectivity with the functional modules. In other words, first clusters of data values which can be put together should be formed then depending on the functional modules they are interacting with, the storage modules could be assigned (allocate the d ata values to register files or RAMs). 161 8.3.4 A ddress and C ontrol G eneration Synthesis of the address and control signal generator is another issue which m ust be considered in the future. In case of non-addressable memory modules like reg isters and register files, the control signal generation is the issue; whereas, in case of addressable memory modules like RAMs address generation is the issue. For ad dressable memories like on-chip RAMs, the address to access a mem ory location can be generated using a variety of algorithms and hardware. The cost of the decoding logic can be very significant [Bur90]. Grant et al. studied the address generation problem for the case when the block of memory is accessed repeatedly [GDF89]. Comprehensive address generation problem is yet to be addressed. Since, address generation is directly related to the module allocation, we recommend that this issue be considered during the m odule allocation step. 8.3.5 Interfacing SM A SH w ith D P S Y N DPSYN is an RTL synthesizer from COMPASS Design Autom ation Inc. It accepts a scheduled data flow graph in the form of a finite state machine and generates the layout of the design. DPSYN was acquired quite recently and is under evaluation. The required interfaces between SMASH and DPSYN need to be developed in the near future. 162 Appendix A MABAL to SSCNET Netlist Translator A netlist translator was w ritten to complete the pathway from specification to layout with autom ation of all m ajor design steps. The translator translates the ADAM RTL output netlist to the Cascade Design Autom ation ChipCrafter (now Epoch) bit-level netlist. The program has the following features: • It expands the RTL netlist to a bit-level netlist for a user-specified bitw idth. • It constructs some ‘complex’ modules from existing basic modules. For exam ple, a subtracter is constructed by using adder and inverter cells. Similarly, a shifter is implemented by just shifting the bit connections by the required num ber of bits. • It integrates the controller and the d ata path by connecting the control signals to the d ata path modules appropriately. • It can translate testable designs having BILBO or SCAN registers. — For BILBO methodology, it can create the BILBO registers for the given polynomial. — For SCAN methodology, it can connect all the scan registers in a scan chain. • It is modular. Any new type of module can be included simply by writing a function defining the attributes of the new module. This work helped us in performing a num ber of experiments: 163 • More than 50 layouts have been generated until now to study the effects of high-level decisions on the final layouts. • An arithm etic Fourier transform chip was designed from the VHDL description to physical layout within 48 hours. • Some AR filter designs were m ade testable to study the im pact of adding testability on the area and the performance of the final designs. • A 1D-DCT and a 2D-DCT chips were designed as a part of a JPEG -im age compression system design. 164 Appendix B VHDL Descriptions In this appendix the VHDL descriptions of the examples synthesized in this thesis are included. These examples are: 1. a second-order differential equation solver, 2. an AR filter element, 3. an Elliptic wave filter element, 4. a 1D-DCT with inner loop, and 5. a quantizer. 165 B .l VHDL description of 2nd Order Differential Equantion Solver — diffeq.vhdl — author : chih-tung chen — source : diffeq.v in high level synthesis workshop. — description : This is a VHDL behavioral description of a differential equation. It is modifed from diffeq.v in a way that the internal loop of the original description is transformed using the mitchell’s method in order to have a loopless description. — assumption : 1. there are external feedbacks as follows: oxport -> ixport, oyport -> iyport, and ouport -> iuport. 2. control signals such reset, ready, nxt and over should be added externally. 3. It should be implemented as a non-piplined design due to the unfixed loop in the original algorithm. package TYPES is type SixteenBitVector is array(15 downto 0) of Bit; end TYPES; use work.TYPES.all; package 0PR16 is function "+"(opnl,opn2:SixteenBitVector) return SixteenBitVector; function (opnl,opn2:SixteenBitVector) return SixteenBitVector; function (opnl,opn2:SixteenBitVector) return SixteenBitVector; function less(opnl,opn2:SixteenBitVector) return Boolean; end 0PR16; use work.TYPES.all; use work.0PR16.all; 166 entity diffeq is port(aport, dxport : in SixteenBitVector; ixport, iyport, iuport : in SixteenBitVector; oxport, oyport, ouport : out SixteenBitVector); end diffeq; architecture behaviour of diffeq is begin process variable a,x,y,dx,u,ul,u2,u3,u4,u5,u6,y1:SixteenBitVector constant three: SixteenBitVector := X"0003"; constant five : SixteenBitVector := X"0005"; begin x := ixport; y := iyport; u := iuport; dx := dxport; a := aport; ul := u * dx; u2 := three * x; — modified by Pravil u3 := three * y; yl := ul; — original yl := u * dx; u4 := ul * u2; u5 := dx * u3; if less(x, a) then x := x + dx; y := y + yl; u6 := u - u4; u := u6 - u5; end if; oxport <= x; oyport <= y; ouport <= u; end process; end behaviour; B .2 V H D L d e sc r ip tio n o f an A R F ilte r E lem en t arf.vhdl — date author source : chih-tung chen : 3 /1 8 /9 2 : originated from Jagan's description. — description : an AR filter element in VHDL behavior description package TYPES is type SixteenBitVector is array(15 downto 0) of Bit; end TYPES; use work.TYPES.all; package 0PR16 is function "+"(opnl,opn2:SixteenBitVector) return SixteenBitVector; function (opnl,opn2:SixteenBitVector) return SixteenBitVector; end 0PR16; use work.TYPES.all; use work.0PR16.all; entity arf is port(al,a2,a3,a4,a5,a6,a7,a8,a9,al0,all,al2, al3,al4,al5,al6,al7,al8,al9,a20,a21,a22, a23,a24,a25,a26 :in SixteenBitVector; out1,out2:out SixteenBitVector); end arf; architecture behaviour of arf is begin process variable el,e2,e3,e4,e5,e6,e7,e8,e9,elO,el1,el2, el3,el4,el5}el6,el7}el8,el9,e20,e21,e22,e23,e24, e25,e26,e27,e28,e29,e30: SixteenBitVector; begin el:=al * a2; e2:=a3 * a4; e3:=a5 * a6; e4:=a7 * a8; 168 e5:=a9 * alO; e6:=all * al2; e7:=al3 * al4; e8:=al5 * al6; e9:=el + e2; el0:=e3 + e4; ell:=e5 + e6; el2:=e7 + e8; el3:=ell + al7; el4:=el3; el5:=al8 + el2; el6:=el5; el7:=al9 * ei5; el8:=el3 * a20; el9:=el4 * a21; e20:=el6 * a22; e21:=el7 + el8; e22:=e2i; e23:=el9 + e20; e24:=e23; e25:=a23 * e23; e26:-e21 * a24; e27:=a25 * e22; e28:=a26 * e24; e29:=e25 + e26; e30:=e27 + e28; outl<=e9 + e29; out2<=el0 + e30; end process; end Behaviour; 169 B .3 V H D L d e sc r ip tio n o f an E llip tic W ave F ilte r E lem en t — ellipf,vhdl — date — source 9/15/93 ellipf.vhdl in HLS92 benchmarks. — assumption : 1. control signals such reset, ready, nxt and over should be added externally. 2. only the internal body of the while loop is considered. package TYPES is type SixteenBitVector is array(15 downto 0) of Bit; end TYPES; use work.TYPES.all; package 0PR16 is function "+"(opnl,opn2:SixteenBitVector) return SixteenBitVector; end 0PR16; use work.TYPES.all; use work.0PR16.all; entity ellipf is port ( inp : in SixteenBitVector; outp : out SixteenBitVector; sv2, svl3, svl8, sv26, sv33, sv38, sv39 : in SixteenBitVector; sv2_o, svl3_o, svl8_o, sv26_o, sv33_o, sv38_o, sv39_o : out SixteenBitVector); end ellipf; architecture ellipf of ellipf is begin process variable nl, n2, n3, n4, n5, n6, n7 : SixteenBitVector; variable n8, n9, nlO, nil, nl2, nl3 : SixteenBitVector; variable nl4, nl5, nl6, nl7, nl8, nl9 : SixteenBitVector; 170 variable n20, n21, n22, n23, n24, n25 : SixteenBitVector; variable n26, n27, n28, n29 : SixteenBitVector; begin nl ; = inp f sv2; n2 : = sv33 + sv39 n3 : = nl + svl3; n4 : = n3 + sv26; n5 ; = n4 + n2; n6 ; = n5 ; n7 ; = n5 ; n8 : = n3 + n6; n9 ; = n7 + n2; nlO := n3 + n8; nil := n8 + n5; nl2 := n2 + n9; nl3 := nlO ; nl4 := nl2 ; nl5 := nl + nl3; nl6 := nl4 + sv39; nl7 := nl + nl5; nl8 := nl5 + n8; nl9 := n9 + nl6; n20 := nl6 + sv39; n21 := nl7 ; n22 := nl8 + svl8; n23 := sv38 + nl9; n24 := n20 ; n25 := inp + n21; n26 := n22 ; n27 := n23 ; n28 := n26 + svl8; n29 := n27 + sv38; sv2_o <= n25 + nl5; 171 svl3_o <= nl7 + n28; svl8_o <= n28; sv26_o <= n9 + nil; sv38_o <= n29; sv33_o <= nl9 + n29; sv39_o <= nl6 + n24; outp <= n24; end process; end ellipf; 172 B .4 V H D L d escrip tio n o f 8 -p o in t 1 D -D C T — dct.vhdl — source : IEEE Transaction on CAD August 1993 Vol 12 — Number 8 page 1120. — description: an 8X8 16-bit DCT VHDL behavior. Consists of 8 8-point dct's in a loop. — date : Dec. 23rd 1993. — written by : Pravil Gupta package TYPES is type Frame is array(0 to 7, 0 to 7) of integer; end TYPES; — Entity declaration doesn't determine — the number of ports on the chips, use work.TYPES.all; entity dct is port(InFrame :in Frame; OutFrame :out Frame); end dct; architecture behavior of dct is use work.TYPES.all; begin process variable ti,t2,t3,t4, ml,m2,m3,m4,m5,m6,m7,m8, nl,n2,n3,n4: integer; variable I: integer; constant cl,c2,c3,c4,c5,c6,c9,cl2,cl4,cl5,cl6 :integer := 0; 173 — each constant must be assigned appropriately. begin I := 0; while I < 8 loop ti := InFrame(0,I)+InFrame(7,I) t2 := InFrame(3,I)+InFrame(4,I) t3 := InFrame(l,I)+InFrame(6,I) t4 := InFrame(2,I)+InFrame(5,I) ml := tl+t2; m2 := t3+t4; m3 := tl-t2; m4 := t3-t4; — d = cl2 = cl3 0utFrame(0,I) <= cl2*(ml+m2); 0utFrame(4,I) <= cl2*(ml-m2); — cl4+cl5 = b & cl5 = f — cl5 = f & cl5+cl6 « -b nl := cl5*(m3+m4); 0utFrame(2,I) <= cl4*m3 + nl; 0utFrame(6,1) <= cl6*m4 + nl; m5 := InFrame(0,I)-InFrame(7,I) m6 := InFraime(l,I)-InFrame(6,I) m7 := InFrame(2,1)-InFrame(5,1) m8 := InFrame(3,I)-InFreume(4,I) — c6 = c, c3 = g-c, c9 = -a-c n2 := c6*(m5+m6+m7+m8); n3 := c3*(m5+m8); n4 := c9*(m6+m7); — c2 = clO, cl = c8, c4 = cl1, c5 = c7 OutFrame(l,1) <= cl*m5 + n3 + n2 + c2*m7 OutFrame(3,I) <= c5*m8 + n2 + cl*m6 + n4 OutFrame(5,I) <= c2*m5 + n2 + n4 - c4*m7 OutFrame(7,I) <= n3 + c4*m8 + n2 + c5*m6 I := I + 1; end loop; end process; end behavior; B .5 V H D L D e sc r ip tio n o f a Q u an tizer — dct.vhdl — source : — description: — date : March 12th 1994 — written by : Pravil Gupta package TYPES is type Row is array(0 to 7) of integer; type Frame is array(0 to 7, 0 to 7) of integer; end TYPES; — Entity declaration doesn’t determine the — number of ports on the chips, use work.TYPES.all; entity quant is port(InRow :in Row; QTable :in Frame; OutRow :out Row); end quant; architecture behavior of quant is use work.TYPES.all; begin process variable temp : integer; * variable I : integer; constant J : integer := 0; begin 176 I :« 0; while ( I < 8 ) loop temp := InRow(I); — divide by QTable[J, I] , ensuring proper rounding if (temp < 0) then temp := -temp; temp temp/QTable(J,I); temp := -temp; else temp := temp/QTable(J,I); end if; OutRow(I) <= temp; I := I + 1; end loop; end process; end behavior; 177 Reference List [AC91] [BCM+88] [BMB+88] [BMBL87] [Bur90] [Cas93] [CGDBP94] I. Ahmad and C. Y. Roger Chen. Post-Processor For D ata P ath Synthesis Using M ultiport Memories. In Proc. of the I n t’ l Conf. on Computer Aided Design, pages 276-279, 1991. R.K. Brayton, R. Camposano, G. De Micheli, R.H.J.M . O tten, and J. van Eijndhoven. Silicon Compilation, ed. D.D. Gajski, chapter “The Yorktown Silicon Compiler System”, pages 204-310. Addison- Wesley, 1988. M. Balakrishnan, A.K. M ajum dar, D.K. Banerji, J.G . Linders, and J.C. M ajithia. Allocation of M ultiport Memories in D atapath Syn thesis. IE E E Trans, on Computer Aided Design, 7(4):536-540, April 1988. M. Balakrishnan, A.K. M ajum dar, D.K. Banerji, and J.G. Linders. Allocation of M ultiport Memories in D atapath Synthesis. In Proc. of the I n t’ l Conf. on Computer Aided Design, pages 266-269, 1987. W. P. Burleson. Memory Design for Bit-level VLSI Architectures. In Proc. o f the IE EE In t’ l Symp. Circuits and Systems, pages 2308- 2311, 1990. Cascade Design Autom ation Corporation, 3075 112th Avenue NE, Bellevue, WA 98004. Epoch Designer’ s Handbook, 1993. C. T. Chen, P. Gupta, J. C. DeSouza-Batista, and A. C. Parker. Rapid Prototyping of a JPE G Image Compression System using Syn thesis Tools. In IEEE Data Compression C onf, M arch 1994. 178 [Che91j [CP91] [CS90] [CS93] [DPST81] [FBS+93] [FLS+92] [GBK85] C. H. Chen. Allocation of M ultiport Memory with Ports of Different Types in Register Transfer Level Synthesis. In Proc. o f the I n t’ l Conf. on Computer Design, pages 418-421, 1991. C. T. Chen and A. C. Parker. VHDL2DDS: A VHDL Language to DDS D ata Structure Translator. Technical Report CEng 91-21, Departm ent of EE-Systems, University of Southern California, July 1991. C. H. Chen and G. E. Sobelman. Singleport/M ultiport Memory Synthesis in D ata P ath Design. In Proc. o f the IE E E In t’ l Symp. Circuits and System s, pages 1110-1112, 1990. C.F. Chang and B. J. Sheu. A Multi-Chip Module Design for Portable Video Compression Systems. In IE E E Multi-Chip Module Conf., pages 39-44, 1993. S. W. Director, A. C. Parker, D. P. Siewiorek, and D. E. Thomas. A Design Methodology and Computer Aids for Digital VLSI Systems. IE E E Transactions on Circuits and System s, CAS-28:634-645, July 1981. Frank Franssen, Florin Balasa, M. Swaaij, F. Catthoor, and H. De Man. Modeling M ultidimenstional D ata and Control Flow. IE EE Tran, on VLSI System s, 1(3):319— 327, Septem ber 1993. H. Fujiwara, M.L. Liou, M.T. Sun, K.M. Yang, M.M. M aruyama, K. Shomura, and K. Ohyama. An All-ASIC Im plem entation of a Low Bit-Rate Video Codec. IE E E Trans, on Circuits and Systems fo r Video Technology, 2(2):123-134, June 1992. E.F. Girczyc, R.J. Buhr, and J. Knight. Applicability of a Subset of ADA as an Algorithmic Hardware Description Language for Graph Based Hardware Compilation. IE E E Trans, on Computer Aided De sign, CAD-4(2), April 1985. 179 [GCDBP94] [GD90] [GDF89] [GK84] [GKP85] [GP94] [HHL90] [HMFK90] [HP90] P. Gupta, C. T. Chen, J. C. DeSouza-Batista, and A. C. Parker. Ex perience with Image Compression Chip Design using Unified System Construction Tools. In Proc. o f the 31st Design Autom ation Conf., June 1994. D.M. Grant and P.B. Denyer. Memory, Control and Communication Synthesis for Scheduled Algorithms. In Proc. o f the 27th Design Autom ation Conf., pages 162-167, June 1990. D.M. G rant, P.B. Denyer, and I. Finlay. Synthesis of Address Gen erators. In Proc. of the In t’ l Conf. on Computer Aided Design, pages 116-118, 1989. E. Girczyc and J. Knight. An ADA to Standard Cell Hardware Compiler Based on Graph Grammars and Scheduling. In Proc. o f the In t’ l Conf. on Computer Design, pages 726-729, October 1984. J. Granacki, D. Knapp, and A. C. Parker. The ADAM Design Au tom ation System: Overview, Planner and N atural Language Inter face. In Proc. o f the 22nd Design Autom ation Conf., pages 727-730, June 1985. P. G upta and A. C. Parker. SMASH: A Program for Scheduling Memory-Intensive Application Specific Hardware. In 7th Int ’ I Sym posium on High-Level Synthesis, May 1994. C. T. Huang, Y.C. Hsu, and Y.L. Lin. Optim um and Heuristic D ata Scheduling under Resource Constraints. In Proc. o f the 27th Design Autom ation C onf, pages 65-70, June 1990. S. Hirofumi, N. M atsumoto, K. Fujimori, and S. Kato. A Flexi ble M ulti-Port RAM Compiler for D atapath. In Proc. o f the IE E E Custom Integrated Circuits C onf, pages 16.5.1-16.5.4, May 1990. A. Hemani and A. Postula. A Neural Net Based Self Organising Scheduling Algorithm. In Proc. o f the European Design Autom ation C onf, pages 136-139, March 1990. 180 [JKMP89] [JMP88] [JP093] [JPP87] [KP83] [KP85] [KP90] [KT83] [Kuc91] [LHL89] R. Jain, K. Kiigiik^akar, M. J. Mlinar, and A. C. Parker. Experience with the ADAM Synthesis System. In Proc. o f the 26th Design Automation Conf., pages 56-61, June 1989. R. Jain, M. J. Mlinar, and A. C. Parker. Area-Time Model for Synthesis of Non-Pipelined Designs. In Proc. o f the In t’ l Conf. on Computer Aided Design, November 1988. A. Jerraya, I. Park, and K. O ’Brien. AMICAL: An Interactive High- Level Synthesis Environment. In ED A C 93, February 1993. R. Jain, A. C. Parker, and N. Park. Predicting Area-Time Tradeoffs for Pipelined Design. In Proc. of the 24th Design Autom ation Conf., pages 35-41. IEEE and ACM, July 1987. D. Knapp and A. C. Parker. A D ata Structure for VLSI Synthesis and Verification. Technical report, Digital Integrated Systems Cen ter, Dept, of EE-Systems, University of Southern California, October 1983. D. Knapp and A. C. Parker. A Unified Represention for Design Infor mation. In Proceedings o f the IFIP Conf. on Hardware Description Languages, August 1985. K. Kiigiikgakar and A. C. Parker. D ata P ath Tradeoffs using MA- BAL. Proc. o f the 27th Design Autom ation Conf., June 1990. T. J. Kowalski and D. E. Thomas. The VLSI Design Autom ation Assistant: Prototype System. In Proceedings o f the 20th Design Automation C onf, 1983. K. Kucukcakar. System-Level Synthesis Techniques with Emphasis on Partitioning and Design Planning. PhD thesis, University of Southern California, September 1991. J.H. Lee, Y. C. Hsu, and Y. L. Lin. A New Integer Linear Pro gramming Formulation for the Scheduling Problem. In Digest of 181 [Lov77] [LT89] [LvMvdW+91] [LvMVvdW93] [Mar79] [MMC88] [NK93] [NP77] Technical Papers of the Int. Conf. of Computer Aided Design, pages 20-23, November 1989. D. B. Loveman. Program Improvement by Source-to-Source Trans formation. Journal o f the Association fo r Computing Machinery, 24(1):121-145, January 1977. E. Lagnese and D. E. Thomas. A rchitectural Partitioning for System Level Design. In Proc. of the 26th Design Autom ation Conf., June 1989. P. E. R. Lippens, J. L. van Meerbergen, A. van der Werf, W. F. J. Verhaegh, and B. T. McSweeney. Memory Synthesis for High Speed DSP Applications. In Proc. of the IE E E Custom Integrated Circuits Conf., pages 11.7.1-11.7.4, May 1991. P. E. R. Lippens, J. L. van Meerbergen, W. F. J. Verhaegh, and A. van der Werf. Allocation of M ultiport Memories for Hierarchical D ata Streams. In Proc. o f the I n t’ l Conf. on Computer Aided Design, pages 728-735, 1993. P. Marwedel. The MIMOLA Design System: Detailed Description of the Software System. In Proc. of the 16th Design Automation Conf., pages 59-62, 1979. A. C. Parker M. McFarland and R. Camposano. Tutorial on High- Level Synthesis. Proc. o f the 25th Design Autom ation C onf, Jul 1988. J.A. Nestor and G. Krishnamoorthy. SALSA: A New Approach to Scheduling with Timing Constraints. IE E E Trans, on on Computer- Aided Design, 12(8):1107-1122, August 1993. A. Nagle and A. Parker. Hardware/Software Tradeoffs in a Variable Word W idth, Variable Queue Length Buffer Memory. In Proc. o f the 4th Annual Symposium on Comp. Architecture, pages 159-163, March 1977. 182 [PCG93] [Pen86] [PG87] [PGH91] [PK87] [PK89] [PK90] [PP88] [PP93] A. Parker, C. T. Chen, and P. Gupta. Unified System Construction. In Proc. o f the Synthesis And Simulation Meeting and International Interchange, October 1993. Z. Peng. Synthesis of VLSI Systems with the CAMAD Design Aid. In Proc. of the 23th Design Automation C onf, pages 278-283. IEEE and ACM, June 1986. B.M. Pangrle and D.D. Gajski. Design Tools for Intelligent Silicon Compilation. IE E E Trans, on Computer Aided Design, pages 1098- 1112, November 1987. A. C. Parker, P. Gupta, and A. Hussain. The Effects of Physical Design Characteristics on the Area - Performance Tradeoff Curve. In Proc. of the 28th Design Automation C onf, pages 530-534, June 1991. P. Paulin and J. Knight. Force-Directed Scheduling in Autom atic D ata Path Synthesis. In Proc. o f the 24 th Design Autom ation C onf, pages 195-202. IEEE and ACM, July 1987. P.G. Paulin and J.P. Knight. Force-Directed Scheduling for the Be havioral Synthesis of ASICs. IE E E Tran, on Computer Aided De sign, pages 661-679, June 1989. C.A. Papachristou and H. Konuk. A Linear Program Driven Scheduling and Allocation Method. In Proc. o f the 27th Design Automation Conf., pages 77-83, June 1990. N. Park and A. C. Parker. Sehwa: A Software Package for Syn thesis of Pipelines from Behavioral Specifications. IE E E Trans, on Computer-Aided Design, March 1988. H. Park and V.K. Prasanna. Area Efficient VLSI Architectures for Huffman Coding. Int. Conf. on Acoustics, Speech and Signal Pro cessing, 1993. 183 [PPM86] [Pra78] [PW86] [PWGH90] [RMV+88] [RP90] [SJ94] [Sto89] [Sto91] [TS83] A. C. Parker, J. Pizarro, and M. Mlinar. MAHA: A Program for D atapath Synthesis. In Proc. o f the 23th Design Autom ation C onf, pages 461-466, July 1986. W. K. P ratt. Digital Image Processing, pages 319-321. Wiley, 1978. D. A. Padua and M. J. Wolfe. Advanced Compiler Optimizations for Supercomputers. Communications o f the ACM, 29(12):1184-1201, December 1986. A. C. Parker, Jen-Pin Weng, P. Gupta, and A. Hussain. The Effects of Physical Design Characteristics on the Quality of Synthesized De signs. In Canadian Conf. on VLSI Design, pages 1.1.1-1.1.7, 1990. J. Rabaey, H. De Man, J Vanhoof, G. Goossens, and F. Catthoor. Silicon Compilation, ed. D.D. Gajski, chapter CATHEDRAL-II: A Synthesis System for Multiprocessor DSP Systems, pages 311-360. Addison-Wesley, 1988. J. Rabaey and M. Potkonjak. Resource Driven Synthesis in the HY PER System. In Proc. I n t i Symposium on Circuits and Systems, pages 2592-2595, May 1990. A. Sharma and R. Jain. Estim ating A rchitectural Resources and Performance for High-Level Synthesis Applications. In Proc. of the 30th Design Automation C onf, pages 355-360, June 1994. L. Stok. Interconnect Optimization during D atapath Synthesis. In Fourth International Workshop on High-Level Synthesis, pages 1-6, October 1989. L. Stok. Architectural Synthesis and Optimization o f Digital Sys tems. PhD thesis, Technishe Universiteit Eindhoven, April 1991. C.-J. Tseng and D.P. Siewiorek. Facet: A Procedure for the Au tom ated Synthesis of Digital Systems. In Proc. o f the 20th Design Automation Conf., pages 490-496, June 1983. 184 [VBM91] [Wal91] [WL91] [WP91] [Zim79] J. Vanhoof, I. Bolsens, and H. De Man. Compiling M ulti dimensional D ata Streams into D istributed DSP ASIC Memory. In Proc. o f the I n t’ l Conf. on Computer Aided Design, pages 272-275, 1991. G. K. Wallace. The JP E G Still Picture Compression Standard. Communications o f the ACM, 34(4):31-44, April 1991. M. E. Wolf and M. S. Lam. A Loop Transformation Theory and an Algorithm to Maximize Parallelism. IE E E Trans, on Parallel and Distributed Systems, 2(4):452-471, October 1991. J. Weng and A. C. Parker. 3D Scheduling: High-Level Synthesis with Floor planning. In Proc. o f the 28th Design Automation C onf, pages 668-673, July 1991. G. Zimmermann. The MIMOLA Design System: A Com puter Aided Digital Processor Design Method. In Proc. o f the 16th Design A u tomation C onf, pages 53-58. ACM SIGDA, IEEE Com puter Society - DATC, June 1979. 185
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255752
Unique identifier
UC11255752
Legacy Identifier
DP22881