Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONCURRENT HIGH-LEVEL SYNTHESIS W ITH FLOORPLANNING by Jen-Pin Weng A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) August 1993 Copyright 1993 Jen-Pin Weng UMI Number: D P22877 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is d ep en d en t upon the quality of th e copy subm itted. In the unlikely ev en t that the author did not sen d a com plete m anuscript and there are m issing p ag es, th e se will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22877 Published by P roQ uest LLC (2014). Copyright in the D issertation held by the Author. Microform Edition © P roQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode P roQ uest LLC. 789 E ast E isenhow er Parkway P.O. Box 1346 Ann Arbor, Ml 48106 -1 3 4 6 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by J e n - P i n W eng under the direction of h is Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of CpS W * 4 7 * 4 DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson Dedication To my mother. I Acknowledgements I would like to take this opportunity to express my sincere appreciation to my advisor, Professor Alice C. Parker, for her valuable guidance and continued en couragement. W ithout her, this thesis would not exist. I would also like to thank Professors Ming-Deh Huang and Massoud Pedram for serving on my dissertation committee, and Professors Melvin A. Breuer, Sarma Sastry and Gerald G. Medioni for serving on my guidance committee. I thank all my friends and colleagues in USC for their warm friendship and vari ous help. In particular, I like to mention Yung-Hua Hung, Chih-Tung Chen, Pravil Gupta, Shiv Prakash, Sen-Pin Lin, Atul Ahuja, Charles Njinda, Mary Zittercob and Donnalyn Combest. Finally, I would like to thank my mother, Kuei-Ing, for her patience, support, and sacrifices throughout my graduate study. This research was supported by Semiconductor Research Corporation (Contracts 89-DJ-075 and 92-DJ-075). I would like to thank them for the support. I iii Contents A ck n o w led g em en ts iii L ist O f F ig u res vii L ist O f T ab les x i A b str a c t x iii 1 In tro d u ctio n 1 1.1 Behavioral S y n th e s is ..................................................................................... 1 1.1.1 Data Path Synthesis . . . . ........................................................... 2 1.1.2 Control Path Synthesis .................................................................. 2 1.2 Motivation and Problem A p p r o a c h ........................... 3 1.3 Thesis O rganization........................................................................................ 5 2 R e la te d R esearch 6 2.1 D ata Path S y n th e s is .................................. 6 2.1.1 Prediction Methods ................................ 7 2.1.2 Floorplanning A p p ro ach es.............................................................. 8 2.2 Control Path S y n th esis.................................................................................. 9 2.3 Therm al Analysis and S y n th esis................................................................. 10 3 O v erv iew o f th e S y n th esis S y ste m R esearch 12 3.1 L im ita tio n s .................................. 12 3.2 Clocking S c h e m e ............................................................................................ 15 3.3 An Overview of Data Path S y n th e s is ....................................................... 17 3.4 An Overview of Control Path S ynthesis.................................................... 20 3.4.1 Controller for Designs without Conditional Branches . . . . 23 3.4.2 Controller for Designs with Conditional B ra n c h e s .................... 23 4 P r e lim in a r y 3D S ch ed u lin g R esearch 28 4.1 Algorithm D e sc rip tio n ................................................................................. 30 iv 4.1.1 Scheduling and Module A ssignm ent....................................... 30 4.1.2 Improvement P r o c e d u r e ........................................................... 32 4.2 Experimental R e su lts................................................................................... 32 5 D a ta P a th S y n th esis 40 5.1 Force-Directed S c h e d u lin g ......................................................................... 41 5.2 Prediction T ech n iq u e................................................................................... 44 5.2.1 Operator E s tim a tio n ................................................................. 45 5.2.1.1 An Operator Lower-Bound Estim ation for Pipelined D esigns............................................................. 45 5.2.1.2 An Operator Lower-Bound Estim ation for Non- Pipelined D esigns............................................................. 46 5.2.2 Wiring E s tim a tio n ..................................................................... 50 5.2.2.1 Wiring Area Estim ation ............................................ 50 5.2.2.2 Wiring Delay Estim ation ............................ 51 5.2.3 Controller E stim ation..................................................... 52 5.2.3.1 PLA Abstract Characteristic Estim ation . . . . . . 53 5.2.3.2 Controller Area Estim ation ...................................... 54 5.2.3.3 Controller Delay Estim ation ...................................... 55 5.3 Floorplanning T ech n iq u e...................... 57 5.3.1 Incremental Cluster Tree Generation .............................. 60 5.3.2 Shape Function C o m p u ta tio n ................................................. 63 5.4 3D Scheduling T echnique............................................................................ 68 5.4.1 Timing Model A ssu m p tio n s.................................................... 68 5.4.2 Loop Model ................... 70 5.4.3 3D Scheduling Algorithm ................................. . 73 5.4.4 Complexity Analysis of the 3D Scheduling Algorithm . . . . 79 5.4.5 D iscussion............................................................... ......................... 80 6 D a ta P a th S y n th esis E x p erim en ts and R e su lts 81 6.1 Introduction .................. 81 6.2 A Non-Pipelined FIR Filter Example . ................... 82 6.3 A Pipelined FIR Filter Example .............................. 94 6.4 A Non-Pipelined Differential Equation Solver Example . . . . . . . 97 6.5 Non-Pipelined Elliptic Filter E x a m p le s.................................................... 101 6.5.1 A 12-Time-Step Non-Pipelined Elliptic Filter Example . . . 103 6.5.2 A 10-Time-Step Non-Pipelined Elliptic Filter Example . . . I l l 6.6 Robot Arm Controller E xam ples........................................... . . . . . . I l l 6.7 An Inner Loop E x a m p le ............... 117 7 C o n tro l P a th S y n th esis 122 7.1 Controller Assumptions ..................................................................123 v 7.2 Control Signal G e n e ra tio n ............................................................................ 124 7.3 Designs without Conditional B r a n c h e s ..................................................... 125 7.3.1 Controllers for Non-Pipelined D esigns............................................ 125 7.3.2 Controllers for Pipelined Designs ................................................127 7.4 Controllers Using Status R egisters...............................................................129 7.4.1 Controllers for Non-Pipelined D esigns............................................ 130 7.4.2 Controllers for Pipelined D e s ig n s ................................................... 134 7.5 Controllers without Status Registers .............................. 138 7.5.1 Controllers for Non-Pipelined D esigns............................................ 138 7.5.2 Controllers for Pipelined Designs .................................. 143 7.6 Control Path Experiments and R e s u lts .....................................................151 7.6.1 Non-Pipelined Design Experiments and R e s u l t s .................... 151 7.6.2 Pipelined Design Experiments and Results . . . . . . . . . . 160 8 T h erm a l A n a ly sis and S im u la tio n R esu lts 165 8.1 Thermal M odels................................................................................................ 167 8.2 Derivation of the Analytical S o lu tio n ........................................................ 169 8.3 Derivation of the Numerical Solution .............................................. 172 8.4 Thermal Experiments and R e su lts................................................................173 9 C o n clu sio n s and F u tu re R esearch 180 9.1 D ata Path Synthesis .................................................................. 181 9.2 Control Path S y n th esis...................................................................................181 9.3 Thermal Analysis and Synthesis . ........................................................... 182 A p p e n d ix A Layouts of the 10-Time-Step Non-Pipelined FIR Filter D e s ig n ..................... 190 A p p e n d ix B Layouts of the 4-Time-Step Non-Pipelined Differential Equation Solver E x a m p le ................................. 207 A p p e n d ix C Layouts of the 12-Time-Step Non-Pipelined Elliptic Filter Design . . . . 211 List Of Figures 3.1 A Portion of the USC High-Level Synthesis S y s te m ..................... 13 3.2 Proposed System A rchitecture............................................................. 14 3.3 Proposed Two-Phase Clocking S c h e m e ............................................. 15 3.4 The VHDL Description for FIR Filter Design ..................................... 19 3.5 The Translated Dataflow Graph for FIR Filter D esign........................ 20 3.6 The Schedule for the 10-Time-Step Non-Pipelined FIR Filter . . . . 21 3.7 The Floorplan for the 10-Time-Step Non-Pipelined FIR Filter . . . 22 3.8 An RTL Design for the 10-Time-Step Non-Pipelined FIR Filter . . 22 3.9 A Synthesis Example without Conditional B ra n ch e s..................... 24 3.10 A Synthesis Example with a Conditional B ra n c h ............................ 25 3.11 Two Possible Controllers for the Example Shown in Figure 3.10 . . 26 4.1 Scheduling and Module Assignment P ro c e d u re ............................... 31 4.2 Iterative Improvement Procedure ................................. 33 4.3 The Scheduling Process of a 2-Time-Step Non-pipelined FIR Filter 35 4.4 The Schedule of the 2-Time-Step Non-pipelined FIR Filter Design . 36 4.5 The Floorplan of the 2-Time-Step FIR Filter with Minimum Operators 36 4.6 The Floorplan of the 2-Time-Step FIR Filter with a Redundant Adder 37 4.7 A Non-pipelined FIR Filter Using ChipCrafter Library S e t ........ 38 4.8 Floorplan with Minimum Operators Using ChipCrafter Library Set 39 4.9 Floorplan with a Redundant Adder Using ChipCrafter Library Set . 39 5.1 A 4-time-step FIR Filter ASAP and ALAP S c h e d u le .................. 42 5.2 Time Frames for the 4-time-step FIR Filter Example . . . . . . . . 42 5.3 Distribution Graph for 4-time-step FIR Filter Example . . . . . . . 43 5.4 The Estimation of Wiring A r e a ............................................. 51 5.5 A Finite State Machine Implementation Using P L A ..................... 53 5.6 Internal Construction of PLA (Taken from [M li9 1 ]).................. 55 5.7 The Flow Chart of the Floorplanning Process .................................... 58 5.8 FIR Filter Floorplan with Overall Aspect Ratio Goal 2 : 1 .................. 58 5.9 FIR Filter Floorplan with Overall Aspect Ratio Goal 3 : 1 .................. 59 5.10 3-Room Floorplan Patterns, Orientations and Some Possible Labelings 59 5.11 The Construction of a Cluster T r e e ................................................... 61 vii 5.12 Shape Function Addition for Horizontal and Vertical Cuts ............. 64 5.13 Merging of Shape F u n c tio n s ...................................................................... 66 5.14 Calculation of the Shape Function for a Cluster Node with 3 Children 67 5.15 The Loop M o d e l ......................................................................................... 71 5.16 The Proposed Approach for Loop Designs ........................................... 72 5.17 The Flow Chart of 3D Scheduling A lg o rith m ....................................... 74 5.18 The Allocation Table for the 4-Time-Step FIR Filter Example . . . 75 6.1 The 10-Time-Step FIR Filter Schedule Produced by L A D S ............... 83 6.2 The RTL 10-Time-Step FIR Filter Design using the LADS Schedule 84 6.3 The Floorplan created by L A D S ............................................................... 85 6.4 The Final Implementation Floorplan using the LADS Schedule . . . 86 6.5 The 10-Time-Step FIR Filter Design Produced by F D S ................... 88 6.6 The RTL 10-Time-Step FIR Filter Design using the FDS Schedule . 89 6.7 A Manually Created Timing-Driven Floorplan using the FDS Schedule 89 6.8 The 10-Time-Step FIR Filter Schedule Produced by MAHA . . . . 90 6.9 The RTL 10-Time-Step FIR Filter Design using the MAHA Schedule 91 6.10 A Manually Created Timing-Driven Floorplan using the MAHA S ch ed u le........................................................................................................... 91 6.11 Comparisons with Different ChipCrafter P lacem ents.......................... 95 6.12 Comparisons with Different Placement S tr a te g ie s ............................. 96 6.13 The Pipelined FIR Filter Schedule by LADS (Init.I. = 4, P.Len. = 9) 98 6.14 The Floorplan for Pipelined FIR Filter Design Created by LADS . 99 6.15 The Pipelined FIR Filter Schedule by Sehwa (Init.I.=4, P.Len.=6) . 100 6.16 The 4-Time-Step Non-Pipelined Differential Equation Solver SchedulelOO 6.17 The RTL Design for the 4-Time-Step Differential Equation Solver . 101 6.18 The Floorplan for the 4-Time-Step Differential Equation Solver Pro duced by LADS ................................................................................... 102 6.19 The 12-Time-Step Non-Pipelined Elliptic Filter Schedule by LADS 104 6.20 The RTL Design of 12-Time-Step Non-Pipelined Elliptic Filter by MABAL .................................................................................................105 6.21 The Floorplan for 12-Time-Step Non-Pipelined Elliptic Filter De sign by L A D S .......................... 106 6.22 The Final Implementation Floorplan LA D S Placement I for 12- Time-Step Non-Pipelined Elliptic Filter Design ................ 107 6.23 The Final Implementation Floorplan LA D S Placement II for 12- Time-Step Non-Pipelined Elliptic Filter Design ...................... 108 6.24 The 17-Time-Step Non-Pipelined Elliptic Filter Schedule by MAHA 112 6.25 The 10-Time-Step Non-Pipelined Elliptic Filter Schedule by LADS 113 6.26 The Floorplan for 10-Time-Step Elliptic Filter Design Created by L A D S .......................... 114 6.27 The 14-Time-Step Non-Pipelined Elliptic Filter Schedule by MAHA 115 viii 6.28 The C ontrol/D ata Flow Graph for the Robot Arm Controller . . . . 116 6.29 The VHDL Description of the Robot Arm Controller Example . . . 118 6.30 The Schedule and Data Flow Graph of the Robot Arm Controller . 119 6.31 The Timing Graph of the Robot Arm Controller E x a m p l e .................120 7.1 A Ring State Transition D ia g ra m ................................................................126 7.2 A Simple Pipelined Design Synthesis E x a m p le ........................................ 128 7.3 Control Specifications Using Status R egisters............................................131 7.4 A Pipelined Design Synthesis Example with Conditional Branches . 136 7.5 The Controller for the Pipelined Example Shown in Figure 7.4 . . . 147 7.6 A PLA-Based Controller Synthesized by Finesse Program . . . . . . 153 7.7 PLA Area Comparison of the Non-Pipelined Designs . . . . . . . . 155 7.8 Controller Area Comparison of the Non-Pipelined D e s ig n s .................156 7.9 M A H A .12 State Transition Diagram without Status Registers . . . 158 7.10 M A H A .12 State Transition Diagram Using Status Registers . . . . 159 7.11 PLA Area Comparison of the Pipelined D e s ig n s ..................................... 164 7.12 Controller Area Comparison of the Pipelined D e s ig n s ...........................164 8.1 A Typical Single-Chip Heat Transfer M o d e l .............................................167 8.2 A Four-Layer Chip Thermal Model . .......................................................168 8.3 A Simplified Layered Chip Thermal Model ... .. ... .. .. .. 169 8.4 Finite Difference Approximation and The Corresponding Nodal N et work ........................................... 172 8.5 A 2-time-step Non-pipelined FIR Filter D esign................ 174 8.6 The Floorplan for the FIR Filter Design with Minimum Operators . 174 8.7 The Floorplan for the FIR Filter Design with a Redundant Adder . 175 8.8 The Thermograms of the FIR Filter Design ............................................177 8.9 The Floorplan for the Heat-Balanced FIR with a Redundant Adder 178 8.10 The Floorplan for the Heat-Balanced FIR with Redundant Adders . 178 8.11 The Thermogram of the Heat-Balanced FIR with Redundant Adders 179 A .l The Layout of ChipCrafter Placement I using LADS Schedule . . . 192 A.2 The Layout of ChipCrafter Placement II using LADS Schedule . . . 193 A.3 The Layout of ChipCrafter Placement III using LADS Schedule . . 194 A.4 The Layout using LADS Schedule and Floorplan without Pin As signment .............................................................................................................. 195 A.5 The Layout using LADS Schedule and Floorplan with Pin Assignment 196 A.6 The Layout of ChipCrafter Placement I using FDS Schedule . . . . 197 A.7 The Layout of ChipCrafter Placement I I using FDS Schedule . . . . 198 A.8 The Layout of ChipCrafter Placement III using FDS Schedule . . . 199 A.9 The Performance-Driven Layout using FDS Schedule without Pin A ssig n m e n t.................... 200 ix A. 10 The Performance-Driven Layout using FDS Schedule with Pin As signment ..............................................................................................................201 A. 11 The Layout of ChipCrafter Placement I using MAHA Schedule . . . 202 A .12 The Layout of ChipCrafter Placement II using MAHA Schedule . . 203 A .13 The Layout of ChipCrafter Placement III using MAHA Schedule . . 204 A. 14 The Performance-Driven Layout using MAHA Schedule without Pin A s s ig n m e n t....................................................................................................... 205 A. 15 The Performance-Driven Layout using MAHA Schedule with Pin A s s ig n m e n t....................................................................................................... 206 B .l The Layout of ChipCrafter I using LADS S c h e d u le ................................208 B.2 The Layout of ChipCrafter I I using LADS S chedule............................... 209 B.3 The Layout using LADS Schedule and F loorplan..................................... 210 C .l The Layout of ChipCrafter Placement I before Buffer Resizing . . . 213 C.2 The Layout of ChipCrafter Placement II before Buffer Resizing . . 214 C.3 The Layout of ChipCrafter Placement III before Buffer Resizing . . 215 C.4 The Layout of ChipCrafter Placement I V before Buffer Resizing . . 216 C.5 The Layout of ChipCrafter Placement V before Buffer Resizing . . 217 G.6 The Layout using LA D S Placement I before Buffer Resizing . . . . 218 C.7 The Layout using LAD S Placement / / before Buffer Resizing . . . . 219 C.8 The Layout of ChipCrafter Placement I after Buffer Resizing . . . . 220 C.9 The Layout of ChipCrafter Placement II after Buffer Resizing . . . 221 C.10 The Layout of ChipCrafter Placement III after Buffer Resizing . . . 222 C .ll The Layout of ChipCrafter Placement I V after Buffer Resizing . . . 223 C.12 The Layout of ChipCrafter Placement V after Buffer Resizing . . . 224 C.13 The Layout using LAD S Placement I after Buffer Resizing . . . . . 225 C.14 The Layout using LAD S Placement II after Buffer Resizing . . . . 226 List Of Tables 4.1 First Library Set Used for FIR F i l t e r ....................................................... 34 4.2 M inimum-Operator Designs versus Redundant-Operator Designs . . 37 4.3 Library Set Created by C hipC rafter........................................................... 38 4.4 A FIR Filter 2-Time-Step Design Using ChipCrafter Library Set . . 38 6.1 Design Library Set Used by LADS (Obtained from ChipCrafter) . . 82 6.2 The 10-Time-Step Non-Pipelined FIR Filter Designs using the LADS S c h e d u le .............................................................................................. 92 6.3 The 10-Time-Step Non-Pipelined FIR Filter Designs using the FDS S ch ed u le........................................................ 93 6.4 10-Time-Step Non-Pipelined FIR Filter Designs using the MAHA S ch ed u le............................................................................................................ 93 6.5 The 4-Time-Step Non-Pipelined Differential Equation Solver Exam plel02 6.6 The 12-Time-Step Elliptic Filter Designs before Buffer Resizing . . 109 6.7 The 12-Time-Step Elliptic Filter Designs after Buffer Resizing . . . 110 6.8 Non-Pipelined Robot Arm Controller Examples by L A D S .....................117 6.9 Pipelined Robot Arm Controller Examples by L A D S ............................121 7.1 Design Library Set Used by CSG (Obtained from ChipCrafter) . . . 151 7.2 Non-Pipelined Robot Arm Controller Examples by MAHA . . . . . 152 7.3 Non-Pipelined Robot Arm Controller RTL Examples by MABAL . 152 7.4 Controllers Using Status Registers for Non-Pipelined Examples . . . 154 7.5 Controllers without Status Registers for Non-Pipelined Examples . 157 7.6 Pipelined Robot Arm Controller Examples by S e h w a ............................160 7.7 Pipelined Robot Arm Controller RTL Examples by MABAL . . . . 161 7.8 Controllers Using Status Registers for Pipelined E x a m p le s ......... 162 7.9 Controllers without Status Registers for Pipelined Examples . . . . 163 8.1 Library Set Created by ChipCrafter Silicon C o m p ile r................. 173 A .l The Bindings for the 10-Time-Step Non-Pipelined FIR Filter Design Produced by L A D S .............................. 191 xi B .l The Bindings for the 4-Time-Step Non-Pipelined Differential Equa tion Solver Example Produced by L A D S .................................................. 208 C .l The Bindings for the 12-Time-Step Non-Pipelined Elliptic Filter Design Produced by L A D S .......................... 212 Abstract In modern integrated circuits, submicron feature sizes result in delays of intercon nections being comparable to delays of functional units. This thesis describes a new approach to high-level synthesis which simultaneously considers the area and delays of interconnections as well as functional units during the scheduling process using floorplanning. The floorplan concurrently constructed during the scheduling process provides an accurate estim ate of the area and delays caused by interconnec tions to the scheduler. Our experiments shows that this new scheduling technique produces satisfactory results for various practical-size examples. The control path synthesis program CSG described in this thesis autom ati cally synthesizes controllers with conditional branches for pipelined as well as non- pipelined datapath dominated designs. CSG can introduce status registers into a controller design to store the condition state in a design with conditional branches. A multiple-code state encoding may be obtained using status registers, which in turn enables logic synthesis programs to produce a better minimization. The ex perim ental results show that controllers using status registers result in smaller PLAs in many design examples. A result of reduced feature sizes is the increase in power density. Design strate gies are described to relieve potential therm al problems during high-level synthesis. Operators are placed as close as possible to their data predecessors in order to m in imize the interconnection cost while ensuring that the therm al constraints are not violated. Spreading out overused functional units around the problem area is of ten at the cost of performance degradation. Introducing redundant operators is suggested to reduce the module utilization among problem modules when system performance is im portant. Our experimental results produce quite satisfactory results for a power-dominated example. xm Chapter 1 Introduction Due to the decreasing cost of VLSI chips, Application Specific Integrated Circuits (ASICs) have emerged as one of the fastest growing aspects of IC design. As the life cycle of IC products is reduced, ASIC designers have to make the design process quicker as well as simpler to further meet the needs of markets. To reduce the IC design turnaround time, the design autom ation industry has responded by introducing many design tools. These tools include layout generation, placement and routing, module generation, simulation and many others. The development of higher-level design tools is needed due to the constantly increasing m arket pressure. This raises research interest in behavioral-level synthesis (i.e. high-level synthesis). 1.1 Behavioral Synthesis High-level synthesis takes an abstract behavioral specification of a digital system and finds a register-transfer level structure which realizes the given behavior. By behavior we mean the way th at the system interacts with its environment. Usually there are many different structures that can be used to realize a given behavior, i One of the synthesis tasks is to find the structure that best meets the constraints (lim itations on the performance, area or power) while minimizing other costs. The input specification algorithm gives the required mappings from sequences of inputs to sequences of outputs. From that input specification, the synthesis system produces a description of a register-transfer level structure. This structure includes a data path as well as a control path. A data path is a network of registers, i i I i 1 ! functional units, multiplexers and buses. A control path, on the other hand, is a finite state machine which drives the data paths so as to produce the required behavior. The control path can be realized in term s of microcode, a PLA circuit or random logic. 1.1.1 Data Path Synthesis The core of data path synthesis can be divided into three subtasks, namely, schedul ing, module allocation and binding. They are closely interrelated and depend on each other. Scheduling assigns the operations to tim e steps. A tim e step is a fundam ental sequencing unit in synchronous systems. The aim of scheduling is to minimize the number of tim e steps needed to realize the specified behavior under certain limits on the available hardware resources. Scheduling is also used to m in imize the amount of hardware resources to realize the specified behavior that is limited by a certain number of time steps. Module allocation allocates a sufficient amount of hardware to implement the design. The hardware consists of functional units, memory elements and intercon nection paths. The interconnection paths include multiplexers, tri-state drivers and buses. They are designed so that the functional units and registers are con nected to support the data path transfers required by the schedule. Module binding consists of assigning the operations to operators (hardware resources) and values to registers. The goal of module allocation and binding is to minimize the amount of hardware needed to realize the design. 1.1.2 Control Path Synthesis Once the schedule and data path has been chosen, it is necessary to synthesize a controller to drive the data paths as specified by that schedule. The controller is derived using the information from the schedule, the register-transfer network and bindings produced by data path synthesis. The activated control signals will select operations to be performed at specific tim e steps and route the data through appropriated functional units. 2 The synthesis of the controller itself can be done in different ways. For exam ple, in a non-pipelined design without conditional branches, a control step could correspond to a state in the finite state machine. After the controlling finite state machine has been specified, the state machine can be further synthesized by logic j synthesis tools, including state encoding and optim ization of the combinational logic. 1.2 Motivation and Problem Approach Submicron integrated circuits have extremely small, fast gates. The area and de lay gaps between functional modules and interconnection modules are narrowing, partly due to smaller feature sizes, and partly due to larger designs being integrated into a chip, requiring proportionally more interconnections. Module binding can affect the performance of the subsequent physical implementation greatly due to long interconnection delays between operators. An “optim al” register-transfer level schedule can actually be quite suboptimal when interconnection delays dom inate execution delays. However, interconnection delays can only be determ ined accu rately after floorplanning is completed. Obviously, high-level synthesis tools using submicron technology will not be able to make intelligent scheduling decisions w ithout considering the effects of interconnections using floorplanning. The therm al problem is another issue in designing a high-speed integrated circuit. A result of reduced feature sizes is an increase in power density, partly due to the higher operating speed, and partly due to the increased component density on an integrated circuit. On the other hand, the operating tem perature of a chip is lim ited to a certain range for acceptable system reliability. For silicon devices, this tem perature is in the range of 75-85°C [Jr.83]. W ith increasing power density and with limits on the operating tem perature, therm al lim itations of the chip must i be considered during the design process. Consequently, it is im perative to study i not only the therm al properties of the device and package m aterial but also the run-tim e therm al properties on the chip/die so that a better therm al layout can be obtained to alleviate potentially high therm al stress during the design stage. 3 A new approach for scheduling called 3D scheduling,1 which performs schedul ing, module allocation, module binding and floorplanning concurrently to take into account interconnection area and delay effects during the scheduling process, is proposed. Interconnections here include wiring, multiplexers and registers in the data path. Prediction tools are used in the 3D scheduling technique to pre view register-transfer level design characteristics before the scheduling process is performed. The 3D scheduling approach can produce schedules with operation chaining.2 The delay of an operation chain is determined according to character istics of chained modules, which are described in the module library. For example, the delay of two cascaded carry-select adders is roughly equal to two times of carry-select adder delay. However, the delay of two cascaded ripple-carry adders is slightly longer than the delay of a ripple-carry adder. The 3D scheduling technique estim ates the delays of operation chains accurately which allows the 3D scheduler more degree of freedom during the scheduling process. To resolve potential therm al problems in a design, a new scheduling approach which simultaneously balances heat distribution by means of floorplanning is pro posed. Operations are scheduled and placed as close as possible to their data sources to minimize interconnection costs while assuming therm al constraints are not violated. Two strategies are proposed for designs having heat concentration problems. First, “hot” circuits can be spread around the problem area over the unused space on the floorplan. Or, therm al problems may be alleviated by intro ducing extra redundant operators to reduce the utilization of operators around the problem area. After the data path is produced, control path synthesis is performed to create the respective controller. The ability to synthesize controllers for pipelined designs with conditional branches contributes to the problem complexity. Two controller im plem entation schemes, namely, controllers with status registers and controllers without status registers, are proposed to minimize the area of a design. The pur pose of introducing status registers into a controller is to store reserved condition X 3D scheduling refers to the problem of simultaneously scheduling time and the X-Y plane. 2An operation chain allows two or more operations with data dependencies to be scheduled in the same time step. 4 values in a design with conditional branches, which can sometimes produce smaller layouts than traditional approaches. 1.3 Thesis Organization The work presented in this thesis falls into three groups: data path synthesis, control path synthesis and thermal analysis and synthesis. Chapter 2 describes the related research on data path synthesis, prediction methods, floorplanning approaches, controller design and thermal analysis. In Chapter 3, overviews of the data path synthesis and control path synthesis are presented. Chapter 4 describes the preliminary data path synthesis research. Chapter 5 then presents our current data path synthesis approach in which the 3D schedul ing technique is given. The experiments on data path synthesis are presented in Chapter 6. An extensive set of examples were synthesized. Layouts and tim ing simulation results were used to validate our system. Chapter 7 describes the work on control path synthesis. Two controller implementation schemes are pro posed: controllers with and without status registers. The second part of Chapter 7 describes the experiments performed on control path synthesis. Both types of controllers (i.e. with and without status registers) were synthesized and laid out to illustrate the tradeoff between area and performance. Chapter 8 presents thermal models, an analytical solution and a numerical solution to perform thermal analysis on the surface of a chip/die. A group of ther mograms corresponding to different module allocations, bindings and floorplans were produced to simulate temperature profiles of different designs but the same behavior. Chapter 9 concludes the research done in this thesis and discusses future research problems. I i 5 Chapter 2 Related Research In this chapter, we study previous research efforts in the literature. The survey is divided into the following five areas: (i) data path synthesis, (ii) prediction methods, (iii) floorplanning approaches, (iv) control path synthesis and (v) thermal analysis and synthesis. 2.1 Data Path Synthesis It is very difficult to predict the performance of submicron silicon before the layout is completed. Very little synthesis research has taken into account the physical design effects in high-level synthesis. BUD is an unique program which floorplans prior to synthesis [McF86]. BUD first builds a hierarchical cluster tree based on the common functionality, common interconnections and potential parallelism. Resource allocation and operation assignment are performed by cutting this tree at different levels. A schedule is produced for each configuration (due to each cut) and final design characteristics are estimated using an approximate floorplan. Fasolt floorplans and analyzes area impact after scheduling [Kna90]. The cur rent schedule is adjusted in the next design iteration to reduce the area over head caused by interconnections. ELF is an early system which estimates inter connection effects during synthesis [Gir84]. ELF predicts the wiring area from register-transfer level characteristics to make design decisions. Chippe is a con straint driven expert system which predicts wire delays from the structural RT-level 6 design [BG90]. The worst case wiring delay is predicted after behavioral-level syn thesis and this delay is used by the evaluation mechanism of the iteration synthesis process. HAL uses a force function proposed by Paulin [PK89] to perform the scheduling subtask. The allocation approach in HAL is rule-based. The work of Cloutier and Thomas at Carnegie Mellon University modified the force function to perform scheduling, module allocation and module assignment at the same tim e due to the close interrelation of scheduling and module binding subtasks [CT90]. Splicer [Pan88] is an example system which uses extensive search techniques to perform synthesis. The technique, based on branch-and-bound search, is used in Splicer to bind the best operation assignment while the interconnections are generated simultaneously. LYRA and ARYL [HCLH90] formulated the operation assignment task as a bipartite graph matching problem. Interconnection allo cation is performed by a greedy heuristic after register allocation and operation assignment are done. LYRA performs register allocation first followed by opera tion assignment. ARYL performs operation assignment first followed by register allocation. 2.1.1 Prediction Methods The early work on the estimation of functional resource allocation was on the prediction of lower-bound functional area due to varying tim e characteristics for pipelined and non-pipelined design styles by Jain and Parker [JPP87] [JMP88]. This prediction research was focused on estimating the resource allocation be havior of Sehwa [PP88] and MAHA [PPM86], the pipelined and non-pipelined scheduling programs in the ADAM synthesis system. Only single-cycle operations were considered in this work. An estim ation technique for functional unit allocation was recently proposed by Hagerman [Hag91]. The concept of distribution graphs [PK89] is used to estim ate the functional unit allocation to be produced by an imperfect scheduling tool (e.g. a force-directed type scheduling tool). Estimation results are favorable as compared to the scheduling results produced by SAM [CT90]. 7 In the research of the area estimation, PLEST is a program which estim ates the layout area of an register-transfer level design using the standard-cell placement and routing style. PLEST estimates areas of layouts with various aspect ratios, feed-throughs and track densities using probabilistic methods. PLEST results were verified by RCA CADDAS standard-cell layouts and were within 10% of actual layout areas [KP89]. Zimmerman proposed a constructive layout estim ation based on predicting shape functions for a binary slicing tree [Zim88]. Shape functions of leaf nodes are used to predict shape functions of composite nodes. The process is hierarchi cally repeated until the shape function of the root node is calculated. Kurdahi and Ram achandran combined Zimmerman’s constructive methods with probabilistic methods used in PLEST [KR91]. Since this estim ation approach does not traverse the whole slice tree, a favorable speed advantage over purely constructive methods (e.g. Zim m erm an’s approach) is expected. 2.1.2 Floorplanning Approaches Most current approaches consider the scheduling and floorplanning problems sep arately which optimizes designs with different objective functions. L auther’s ver sion of min-cut placement [Lau79] is an early work on floorplanning research. A top-down m ethod that divides cells into two blocks was used. The process con tinuously divides blocks into smaller blocks until the number of cells in a block is small. Placement decisions made at higher levels of the hierarchy may suffer from lack of lower-level information. O tten ’s shape propagation floorplanning technique [Ott83] initially derives a physical hierarchy in the form of a binary slicing tree. The slicing tree is then tra versed bottom -up. At each internal node of the tree, a composite shape function is calculated from shape functions of its children and directions of cuts. These shape functions are combined and propagated recursively to the root of the tree. How ever, in O tten ’s approach, a binary slicing tree is restrictive and global connection costs are not considered during the bottom -up traversal. La Potin [PD86] improved the min-cut m ethod by considering shape functions and pad positions during floorplanning. Dai [DK87] extended La P otin’s work to 8 general non-slicing structures and multi-way cluster trees and combined floorplan ning with hierarchical global routing. Pedram [Ped91] further extended Dai’s work by computing shape functions for cluster nodes bottom -up which incorporates cell sizes, shapes and pin positions to achieve physical and timing constraints. Zimmerman [Zim88] improved O tten’s technique to optimize directions of cuts during the shape function propagation. The wiring area is estim ated for each node of the binary slicing tree which shifts the shape functions of nodes to account for wiring areas. Herrigel [HGF89] and Lengauer [Len90] proposed a top-down follow-up phase to minimize the total interconnection length by switching cells across cuts. However, the improvement of this m ethod is lim ited, because the minimizations of layout area and interconnection length are performed separately. 2.2 Control Path Synthesis In the area of autom ated controller synthesis, most of existing high-level synthesis systems only synthesize controllers for non-pipelined designs. A widely applied con troller style in many existing synthesis system is the ROM-based microprogrammed controller model. Examples include the control allocator for the CMU-DA [NCP82] and ATOMICS system for Cathedral-II [GVM87] [Gaj88a] [ZSRM90]. The control allocator for the CMU-DA system assumes a canonical microprogrammed model, and performs optimizations based on microcode format constraints. The ATOM ICS system takes an RT-level description as input and performs microprogram scheduling in order to minimize the global machine cycle count. Three pipeline stages can be identified in the control critical path of designs produced by ATOM ICS. Some systems map the control/data flow graph representing the hardware be havior into a corresponding state transition diagram, which can be implemented by either a PLA circuit or random logic. For example, the Yorktown Silicon Compiler [Gaj88b] implements the controller as a hierarchy of finite state machines, where a finite state machine is associated with each routine. Control state splitting allows the delay through the combinational part of the data path to be reduced to satisfy clock cycle constraints. AT&T’s Bridge system [TPL+86] [TW R+88] first creates a 9 symbolic control table once the data path is allocated. The symbolic control table basically identifies all modules which need to be activated in each cycle. The de tailed im plem entation of the controller will be decided by the module generator. In the Hyper synthesis system [CPTR89], a state transition diagram is derived from a control/data flow graph based on the scheduling information. It is a recursive procedure due to the hierarchy nature of the control/data flow graph. The state transition diagram is then optimized by removing dummy states. The PUBSS system [WTL91] developed at Princeton University synthesizes control-dominated machines by performing state minimization or simplification, state decomposition and state collapsing on the behavior FSMs to achieve user specified constraints. A behavior FSM is an autom aton whose inputs and outputs are only partially scheduled (i.e. the timing behavior is incompletely specified). K im ’s controller synthesis research at UC-Irvine [KK91] is an example which synthesizes controllers for pipelined designs using PLAs as its platform . The con troller is modeled as a Moore-style finite state machine.1 Control states are decided by possible execution modes in a scheduled control/data flow graph using a color ing scheme proposed by Park [PP88]. Next state transitions for each control state are determ ined by the compatibility of two possible execution modes between two consequent group of states. A condition value needs to be produced at least two clock cycles before the conditional execution is performed, because of the use of a Moore-style controller. Although systems mentioned above are part of powerful synthesis systems and effective in synthesizing some classes of designs, they address only part of all of our objectives—to synthesize controllers for pipelined designs as well as non-pipelined designs w ith/w ithout conditional branches. 2.3 Thermal Analysis and Synthesis To the best of our knowledge, no high-level synthesis research which considers therm al effects has been reported. Instead of taking care of the therm al problem 1 Outputs of a Mealy-style finite state machine depend on inputs as well as the current state. Outputs of a Moore-style finite state machine only depend on the current state. 10 during the design process, most of the reported works focus on a post-design anal ysis procedure. Many improvement procedures at the board-level have suggested th at therm al design should be considered during the PCB layout design process [OP90]. However, proposed therm al models and improvement procedures used at the board-level cannot be applied directly for designs at the chip-level. Therm al analysis using a numerical approach has been done many times. Nu merical techniques such as finite-difFerence [MAD83], finite-element [WFM83] and boundary element [CPB88] has been widely used to analyze therm al profiles of electronic circuits with irregular boundary conditions. Some analytical research which focuses therm al analysis on ASIC chips has been published recently. Basi cally the analyzed structure was simplified to a layer model. The num ber of layers vary from two [CA78] to four layers [LPM89] [LMP89]. The com putation com plexity rapidly increases as the number of layers increases. These works analyzed and pointed out potential therm al problems but did not provide any improvement strategies during the circuit design process. However, considering therm al design after the circuit design is completed is usually too late to alleviate potential ther mal problems. Especially on some critical circuit designs, it is almost impossible to meet therm al constraints after the circuit design is completely done. In such cases, iterations would have to be performed. The remaining chapters of this dissertation describe our research. Any results described here which are used for our research will be cited at the appropriate time. 11 Chapter 3 Overview of the Synthesis System Research The current USC high-level synthesis system synthesizes register-transfer level de signs which implement the behavior from algorithmic-level specifications. A por tion of the USC high-level synthesis system is shown in Figure 3.1. A register- transfer level design can be subdivided by its functionality into two parts—the data path and the control path, as shown in Figure 3.2. The data path consists of a network of functional units, registers, multiplexers, tri-state-drivers and buses. The data path makes the complex computation in the algorithmic-level specifica tion achievable. The control path generates the control signals1 required by the registers, multiplexers and tri-state-drivers2 for sequencing the events in the data path, as well as the signals required by the control path itself. In the rest of this chapter, overviews of data path synthesis and control path synthesis are given by following the proposed clocking scheme. 3.1 Limitations Hierarchical control/data flow graphs are not currently considered; control/data flow graphs should be flattened before high-level synthesis. Pipelined designs as 1ALU control signals are not considered in the current CSG (Control Signal Generator). However, they can be obtained similarly to control signals for multiplexers by simple extension of the current CSG. 2In our approach, buses are logical interconnection elements whose usages are controlled by tri-state-drivers. 12 C ^D ata Flow & Timing G ra p h s ^ ) ^ C o n s tr a in ts ^ ^ M o d u l e L ib r a r y ^ ^ ^ f e c h no logy Information S chedule & Allocation Table Floorplan C j3 eg ister-T ran sfer Level D e sig n ^ ) Controller Specification Cell-Level Netlist Layout Netlist Expansion CLSI VHDL Parser VHDL2DDS Translator MABAL Allocation & Binding Cascade Design Automation ChipCrafter L A D S fD a fa Path Synthesis) S cheduling with Floorplannmg Figure 3.1: A Portion of the USC High-Level Synthesis System 13 External Inputs Output Registers External Inputs Control Signals State Registers Output Registers Outputs Status Registers Condition Values Control Path Logic Data Path Logic Figure 3.2: Proposed System Architecture well as non-pipelined designs are synthesized in our data path and control path syn thesis programs. The control/data flow graph may contain conditional branches; the mutual-exclusion operations of the same type may share an operator to re duce the hardware cost. Pipelined operators and multi-cycle operators are not considered in the current synthesis programs. In data path synthesis, control/data flow graphs with nested inner loops are handled but not parallel loops for non-pipelined designs. The controller model is proposed for non-pipelined designs with nested inner loops but not implemented in the current control path synthesis program. D ata recursion edges3 are not considered for the pipelined designs. The possibilities of operation chaining are examined in data path synthesis to shorten the schedule. The data path synthesis program performs operator allocation and bindings but register allocation, value bindings and multiplexer allocation are done partially. Buses are not considered but can be handled similarly as multiplexers by simple extension of the current data path synthesis program. A set of floorplans with different aspect ratios is created for a tentative design (or schedule). The control path synthesis program considers designs using m ultiplexer architectures as well as bus architectures. A finite state machine is produced using the Cascade Design Autom ation ChipCrafter silicon compiler. Two controller im plem entation styles 3A data recursion edge with a degree d is a pseudo edge representing data dependency in the control/data flow graph for a pipelined design, where d means the value used by an operation is j generated d iterations earlier. 14 2-Ti LC Control Path i f Execution Period 'C P a d c a c d 2 -T i Control Signals Stored 2'T jc Control-Path Clock LD Data Path Execution Period 1 DP Data Path Values Stored 2 T jd Data-Path Clock Clock 1 a Clock 1 b Clock 2a Figure 3.3: Proposed Two-Phase Clocking Scheme are provided, namely, controllers using status registers and controllers w ithout status registers. 3.2 Clocking Scheme The clocking scheme chosen is vital because different clocking scheme assum ptions may lead to completely different data path and control path synthesis results. In our approach, a conservative two-phase non-overlapping clocking scheme is as sumed. A data-path clock cycle, C lockpp, is always followed by a control-path clock cycle, C lockcp, as shown in Figure 3.3. Memory elements in the synthesized circuits can be classified into four cate gories, namely the state registers, output registers, data-path registers and status registers. State registers are used to keep track of the current state, which allows the controller to decide the next state transitions. O utput registers hold the out put control signals to avoid race conditions between the data path and control path during the data path execution. D ata-path registers store the interm ediate com putation values created by functional elements in the data path at the end of each d ata path execution. If the status-register controller im plem entation style is assumed, status registers are used by the control path to store the condition val ues. Note th at all the memory elements can be realized by either positive/negative edge-triggered D-type flip flops or level-sensitive latches depending on the clocking scheme used. Positive edge-triggered D-type flip flops are proposed to implement all the i memory elements, including the registers in the data path and control path. All 15 values in the controller’s registers can only be changed at the rising edges of control- path clocks, and all the values generated by the d ata path can be stored into registers only at the rising edges of data-path clocks. If each clock lasts for a sufficiently long period then the correct executions of both the control path and data path can be ensured. The possibilities of race conditions can also be avoided because the registers in the data path and control path would never be changed at the same time. Some significant param eters for an edge-triggered D-type flip flop (E TD FF) are listed below, with brief definitions [UT8 6 ]. The setup tim e, Tsetup, for an E TD FF is the m inimum tim e that the D (input) signal must be stable prior to the triggering edge of C (clock) pulse. The hold tim e, Thoid, is the m inim um tim e th at the D signal m ust be held constant after the triggering edge of the C pulse. The propagation delay, Tdeiay, is the tim e required for a signal to propagate from the C term inal to the Q term inal, assuming that the D signal has been set up sufficiently far in advance as specified by the setup tim e constraint. T l c , T t c , T l d and T t d specify the tolerances of the leading and trailing edges of control path clock and data path clock, respectively. For the clocking scheme we specified, there is a certain tolerance range, within which we can assume the errors will be confined. This is essentially a worst case approach. A u c an<i & -CD define the phase difference from C lockup T to Clockcp i and Clockcp | to Clockup i, respectively. Assume th at, for example, the leading edge of the Clockcp pulse of some cycles would arrive at every E TD FF at tim e t (illustrated in Figure 3.3). Then, in actual systems, this edge is received in the interval (t — T i,c,t + Tpc)- Corresponding assumptions apply for the other three edges. Our goal is to design our system so th at if this assumption and the corresponding assumptions about the precision of the components used are valid, then there will be no failure due to tim ing, even in the extrem e case. A set of constraints are developed such that if all the flip flop input signals arrive on time for the first cycle, then they will also arrive on tim e for the next cycle. By induction, for all succeeding cycles, the flip flop inputs are also stable over appropriate intervals, so that the system will behave according to the specification. The timing constraints which ensure correct behavior are summarized as follows: 16 • The worst case control path execution period, Tcp , should be longer than the sum of the propagation delay Tdeiay of data p ath ’s DFFs, the control path combinational logic delay and the setup tim e Tsetup of control p a th ’s DFFs. • The worst case of data path execution period, T o p , should be longer than the sum of the propagation delay Tdeiay of control p a th ’s DFFs, the d ata path combinational logic delay and the setup tim e TsetU p of d ata p a th ’s DFFs. • The earliest arrival tim e of control/data path input signals for the next cycle should not arrive so early as to violate the hold tim e constraints for the current cycle, i.e. Tcp > Thoid the hold tim e of control p a th ’s DFFs; and Top > Thoid the hold time of the data p ath ’s DFFs. • The phase difference, A p > c and A c d, should be non-negative to ensure a non-overlapping clocking scheme. The clocking scheme we defined above implicitly requires condition values to be created at least one clock cycle before they can be used to determ ine which conditional execution paths are executed. For example, if a condition value is created by the data path during Clock lb period then the value will be stored into a register in the data path at the end of Clock lb. The condition value will be ready to be taken by the controller to decide the proper output signals, according to the condition value, during Clock 2a period. It is obvious Clock 2a is also the earliest clock period that the condition value can be used by the controller. 3.3 An Overview of Data Path Synthesis The d ata path synthesis software in the high-level synthesis system described in this thesis takes VHDL descriptions of algorithm ic/behavioral specifications as inputs [VHD8 8 ]. The VHDL descriptions characterize the data dependencies but not parallelism of the input algorithm. For convenience of analysis, the VHDL description is translated into a data flow graph and a timing graph in our approach. The DDS [Kna8 6 ], the internal representation in USC/ADAM synthesis subsystem, m aintains separate representations of data flow and control tim ing information. A 17 VHDL description of an FIR filter is shown in Figure 3.4. Figure 3.5 shows the translated data flow graph of the FIR filter design. After the preprocessing of the VHDL description is done, the d ata path syn thesis program LADS (Layout And D ata path Synthesis) takes user specified con straints, a module library, technology dependent information4 and the translated data flow graph and timing graph to perform scheduling, allocation and binding synthesis subtasks using the concurrently generated floorplan. The schedule and one of the possible floorplans for the non-pipelined 10-time-step FIR filter design are shown in Figures 3.6 and 3.7, respectively. LADS uses a 3D scheduling technique to perform scheduling simultaneously w ith floorplanning. The 3D scheduler first applies prediction tools to estim ate the lower bound on the num ber of operators and the interconnection cost from the data flow graph. The prediction results are employed to decide module allocation and bindings of scheduled operations with a more global view before the whole scheduling process is completed. Module allocation and binding m ust be done along with scheduling in order to be able to floorplan with the bound modules. The 3D scheduler uses a modified force function to determ ine the scheduling and binding for each operation at a time. The area and delay effects of wires, m ulti plexers and registers are considered by the floorplanner during scheduling in our approach. A constructive cluster-tree technique is used in the floorplanning process. The floorplan generated by the 3D scheduler is used to feedback the low-level physi cal information, such as the impacts of wiring area and delays, to the scheduler. In this way, the effects of interconnection in a register-transfer level design can be estim ated more precisely during scheduling. Therefore, the 3D technique is able to produce more timing-balanced schedules as compared to other traditional approaches th at only consider the delays of functional modules. After LADS finishes scheduling and binding, the allocation and binding pro gram MABAL takes the schedule and bindings produced by LADS to complete the register-transfer level design. One possible register-transfer level design for the 4Technology dependent information includes transistor parameters, layout parameters and | other parameters. I entity FIR_FILTER is port(f1, 12, f3 ,14, f5, f6 ,17,18,19, f10, f11, f12, f13, f 14, f15, f 1 6 ,117, f 1 8 ,119, 20, f21, f22, f23, f24 : in Integer; outf: out Integer); end FIR_FILTER; architecture Behaviour of FIR_FILTER is begin process variable e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e12, e13, e14, e15, e16, e17, e18, e19, e20, e21, e22 : Integer; begin e1 = f1 + f2; e2 = f3 + f4; e3 = f5 + f6; e4 = f7 + f8; e5 = f9 + f 10; e6 = f11 +f12 e7 = f13 +f14 e8 = f 15 + f16 e9 I I C D * J N J e10 := e2 * f18 e11 ;= e3 * f 19 e12 ;= e4 * f20; e13 := e5 * f21; e14 := e6 * f22; e15 := e7 * f23; e16 := e8 * f24; e17 := e9 + e10; e18 ;= e17 + e11; e19 := e18 + e12; e20 := e19 + e13; e21 := e20 + e14; e22 := e21 + e15; outf <= e22 + e16; end process; end Behaviour; Figure 3.4: The VHDL Description for FIR Filter Design 19 X X Outf Figure 3.5: The Translated Dataflow Graph for FIR Filter Design non-pipelined 10-time-step FIR filter design is shown in Figure 3.8. The register- transfer level designs can be further synthesized by logic-level and circuit-level CAD tools, such as the Cascade Design Autom ation ChipCrafter compiler, to ob tain chip layouts. 3.4 An Overview of Control Path Synthesis In synchronous systems (the only kind we consider in this approach), a controller can be described by a finite state machine that can be implemented by PLA cir cuits, random logic or microprograms. D ata-path scheduling, module allocation and module binding are assumed to have been performed before control path syn thesis. As shown in Figure 3.1, the control path synthesis program CSG takes M ABAL’s output to synthesize a controller in the current USC high-level synthe sis system. To retain flexibility in choosing the controller im plem entation style, only the behavior of the finite state machine is produced. Flexibility is im portant because we wish to synthesize digital circuits that have proper functionality and also meet the performance and area constraints. To accomplish this, the controller imple m entation style may vary from problem to problem. In this thesis, the controller is assumed to be implemented with a PLA circuit. However, if a microprogrammed L 20 Tim e 1 ^/add\ M TIUh /addS 8 9 10 Figure 3.6: The Schedule for the 10-Time-Step Non-Pipelined FIR Filter 21 M UX5-1.005 ADD.002 to o o CO X n o o CD 5 LU tr C O o o CO 5 LU tr ADD.001 MUL.001 MUX6-1.003 MUX5-1.0Q2 o o X C M o o C O C J L U D C controller Figure 3.7: The Floorplan for the 10-Time-Step Non-Pipelined FIR Filter \m u x 6-i .0037 \mUX4-1. 004/ \MUX4-1 .OOI/' \ mUX5-1.0027 REG16 001 ADD.001 ADD.002 MUL.001 Output Figure 3.8: An RTL Design for the 1 0 -Time-Step Non-Pipelined FIR Filter 22 controller were preferred, the finite state machine could also be implem ented with a m icroprogrammed controller. i 3.4.1 Controller for Designs without Conditional Branches For a design without conditional branches, the controller specification is simple, and its state transition diagram is a simple ring shape. Figure 3.9 illustrates a non- pipelined design example without conditional branches. The dashed lines in the control/data flow graph present the proposed schedule, which is generated by the scheduling program in the data path synthesis system. An outer loop is assumed im plicitly in this graph but would be found explicitly in the corresponding control tim ing graph not shown here. As described in the Lim itations section, a more complex tim ing scheme is not supported here since the problem was simplified in order to make progress on the combined issues of scheduling and floorplanning. In this example, there are three tim e steps. A possible non-pipelined register-transfer level data path and controller are shown in Figures 3.9b and 3.9c, respectively. The default action of data-path registers is to keep values stored in the previous tim e step. The contents of data-path registers are modified only when the control signals “ R egister-W R ITE ” are activated by the controller during the clock period. The state transition diagram of a controller implicitly has a feedback arc from the last tim e step to the first time step th at allows the data path to process the next set of inputs. It has been found from this example that every state in the state transition diagram can be m atched to a tim e step in the scheduled control/data I flow graph. All the control signals specified in a state are necessary for the data p ath to complete the scheduled operations correctly. ! 3.4.2 Controller for Designs with Conditional Branches D istribute-join-pairs in the control/data flow graph are used to represent OrFork and Or Join operations on the conditional execution paths. The exact execution paths are determ ined by condition values that may come from external inputs or be | created by the data path during execution. In our approach, we assume condition ! 23 a) A Control/Data Flow Graph A Control/Data Flow w/o Conditional branch a c b d lU X i Reg2 Adder Multiplier Output b) A Possible Non-Pipelined Register-Transfer Level Design State3 : Mux3_ENABLE = 1 Reg2_WRITE = TRUE State = S tatel Statel : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Reg1_WRITE = TRUE State = State2 State2: Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE State = State3 c) Controller Specification Figure 3.9: A Synthesis Example without Conditional Branches p q a b Distribute Join a) A Control/Data Flow Graph A Control/Data Flow with Conditional branch a c \M u x1 / \ M u x 2 / E E lE? Adder d e u p q Multiplier 1 Comparator > Reg1 To Controller b) A Possible Non-Pipelined Register-Transfer Level Design Figure 3.10: A Synthesis Example with a Conditional Branch values generated by the data path will be stored in the data-path registers at least one clock cycle. Since inputs of the controller we synthesize are assumed either coming from the external inputs or data-path registers, d ata stored in the data path register at least one clock cycle can avoid race conditions. This assum ption can be clarified from the proposed clocking scheme. Condition values should be ready in the data-path registers for controller use no later than the tim e step th at the conditional execution path begins. A condition value is then m aintained by the controller until no more operation executions are dependent on it. Figure 3.10 shows a non-pipelined design example w ith a conditional branch. There are four tim e steps and one distribute-join pair in this scheduled control/data flow graph. The conditional execution paths are found in tim e steps 2 and 3. The condition value “ p > q” determines the output is “(a+&)*e” or “(((a-|-&)+c)*d)*e.” In this example, the condition value “p > q” should be created no later than tim e step 1 , since the conditional operation (the addition of a + b to c) is scheduled to 25 State = State5 State5 : Mux4_ENABLE = 1 Reg2_WRITE = TRUE State = Statel S ta te 4 : Mux4_ENABLE = 1 Reg2_WRITE = TRUE State = Statel State4 : Mux3_ENABLE = 1 Mux4_ENABLE - 0 Reg2_WRITE = TRUE State = State5 S ta te 3 : If (CSGvarO) { Mux3_ENABLE = 1 Mux4_ENABLE = 0 Reg2_WRITE = TRUE State = State4 Statel : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRITE = TRUE State = State2 Statel : Mux1_ENABLE = 0 Mux2_ENABLE - 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRITE = TRUE State = State2 S ta te 2 : If (Reg1) { Mux1 JENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = O Reg2_WRITE - TRUE CSGvarO = 1 } else { CSGvarO = 0 State = State3 State2 : If (Reg1) { Mux1_ENABLE = 1 Mux2_ENABLE - 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE State = State4 State = State3 a) Controller without Using S tatus R egister b) Controller U sing S tatu s R egister Figure 3.11: Two Possible Controllers for the Exam ple Shown in Figure 3.10 be executed in tim e step 2 and the controller needs to decide the execution path to be processed at the beginning of time step 2. Also, the condition value generated in tim e step 1 should be stored into one of the data path registers (Regl in this example). In order to “memorize” the condition values, two controller im plem entations are proposed. The first approach uses status registers to store the condition values. An alternative approach uses different states to distinguish different condition value combinations. 26 Figure 3.11a shows a possible controller im plem entation for the synthesis ex ample in Figure 3.10. In state2, Regl stores condition value u p > q,” and is used to determ ine the next state transition (stateS or state4). Using this im plem enta tion, the condition value is stored in the state memory. For example, stateS and state4 represent the condition value combinations “p > q” F A LSE and T R U E , re spectively. Also, both stateS and state4 have state5 as their next state transition. Since no conditional execution path exists after tim e step 3, the condition value “ P > < z ” dies the end of tim e step 3. A possible controller using status registers is shown in Figure 3.11b. The con dition value is initially stored by Regl in the data path; and it is then stored in the status register CSGvarO th at is an internal control-path register created by the control path synthesis program CSG. The value of CSGvarO is set in tim e step 2 (state2) and is then discarded after tim e step 3 (stateS). The state transition diagram is a simple ring shape; and the number of states is the same as the num ber of tim e steps. It is easy to see that the controller without status registers avoids the use of status registers at the cost of using extra states. In the extrem e cases, the num ber of extra states is exponentially proportional to the num ber of status registers required in that tim e step. Chapter 4 Preliminary 3D Scheduling Research High-level synthesis systems start with an abstract behavioral specification, such as a VHDL description of a digital system and find a register-transfer level structure th at realizes the given behavior [MPC8 8 ]. D ata path synthesis can be subdivided into three subtasks, namely scheduling, module allocation and binding, which are the core of the transform ation from behavior to structure. Those three subtasks are closely related. Scheduling involves assigning the operations to tim e steps. Module allocation allocates a set of modules (i.e. functional units, registers and in terconnection units) which is sufficient to realize the schedule. Binding assigns the operations and values to modules. A tim e step denotes a fundam ental sequencing tim e unit (a clock cycle) in synchronous systems. The prelim inary 3D scheduling approach uses the concept of freedom [PPM 8 6 ] as its scheduling priority function. A floorplan is constructed sim ultaneously with scheduling in order to feedback wiring delays to the scheduler. W iring area is estim ated by a fudge factor that is proportional to the area of allocated functional modules. Note th at the effects of multiplexers and registers are not considered in the prelim inary 3D scheduling approach. The basic idea behind the preliminary 3D scheduling work is as follows: when j operations are scheduled and then functional modules are allocated, simultaneously j we decide their shape and position on the floorplan. In the floorplanning process, j the m odule with minimal area among the allocated functional modules is chosen as a basic area unit. A basic dimension unit is equal to the square root of the basic \ I area unit. A module with non-minimal area is round to be the ceiling integer tim e of the basic area unit. To construct a floorplan, we first determine the num ber of basic area units required by the floorplan using the operator lower-bound estim ation proposed by R. Jain [JPP87]. The bounding box of the floorplan is then approxim ated as a discrete square whose size is the number of the basic area units. A discrete square is a rectangle in which the difference between the width and height of the rectangle is less or equal to one basic dimension unit. The area of the rectangle of a discrete square is minimal. The shape of a newly allocated m odule is assumed to be a discrete square whose size is equal to the number of the basic area units, if it can be placed within the current bounding box. Otherwise, the shape of this newly allocated module is a general rectangle which results in the m inim al area increase of the current bounding box in order to place the module. The approach taken is to schedule the operations along the critical paths (longest paths through the data flow graph) first and assign operators according to their data dependencies. The critical paths are determ ined using the module delays obtained from the module library. Note th at the wiring delays are assumed to be zero in the derivation of the critical paths. A partial floorplan for the cur rently allocated functional modules is constructed once all the operations along the critical paths are scheduled and assigned. The off-critical path operations are then scheduled and assigned to the operators (and placed when a new operator is allocated) with the m inim um overall estim ated interconnection length. After all the operations have been scheduled, the routine checks the tim ing constraints and tries pairwise interchanges to reduce the interconnection delays. Redundant oper ators1 are introduced in order to reduce the interconnection delays if the routine still cannot achieve the timing constraints. 1 Redundant operators are operators not required for the minimum feasible design. 29 4.1 Algorithm Description The prelim inary 3D scheduling approach can be divided into two parts: (1) schedul ing and m odule assignment and (2) wiring delay improvement procedure. The first part is a constructive procedure; the second part is an iterative procedure. 4.1.1 Scheduling and Module Assignment The prelim inary 3D scheduling approach is to develop the schedule, m odule al location and bindings simultaneously. We start from a null module set and add m odule units only when the scheduled operations cannot share existing ones. The ordering in the scheduling algorithm is to schedule the operations on the critical paths first. For the operations along the critical paths, we schedule the operations whose predecessors have been scheduled based on their d ata dependencies. The earliest possible tim e steps (ASAP) are assigned to the scheduled operations. The scheduling approach is similar to that used by Nagle [NCP82] and used in MAH A [PPM 8 6 ]. The rest of the unscheduled operations are scheduled by their freedom in ascending order. Freedom is defined as the tim e-step difference between the latest possible scheduled time step (ALAP) and the earliest possible scheduled tim e step (ASAP) for a given clock cycle. The operation with the least freedom is first scheduled to its earliest possible tim e step (ASAP). W hen two or more operations have the same freedom, the operation with the larger area will be scheduled first. Once each operation has been scheduled, we assign it to a specific m odule unit. In module assignment, we use the best-first approach. We first assign the oper ation to the module available at that time step which performs the same function w ith the m inim um estim ated wiring cost. The wire length between two connected modules is estim ated by the longest possible wire length (i.e. the worst case wire length). In our code, the fanout is assumed to be one for all the modules, but it would be trivial to assume a larger (constant) fanout. The wiring delay is then estim ated by a first-order RC model taken from G upta [Gup91], which is originally I published by Sakurai [Sak83]. These assignments are iteratively improved later by , the improvement procedure. The scheduling algorithm is outlined in Figure 4.1. 30 1. co m m en t: Process the operations along the critical path first 2. for all operations along the critical path whose predecessors have been pro cessed d o 3. b e g in 4. co m m en t: Search all possible time slots on allocated module list 5. for all allowable tim e steps do 6. b e g in 7 . use the best-first approach to find an available m odule which can perform the operation with m inim um wiring cost 8. if found th e n co n tin u e to next operation co m m en t: Otherwise, a new module needs to be allocated e lse allocate a new module of this operator type 9 . free all the schedulings o f this operator type and reschedule them 10. en d 11. en d 12. co m m en t: Now, assign the positions for new allocated modules 13. generate a hoorplan of allocated modules using the constructive m ethod 14. c o m m en t: Process the rest of unscheduled operations 15. repeat the procedure from step 4 to step 9 for the off-critical path unsched uled operations, except use freedom as the priority function for choosing the operation to schedule next; floorplan unplaced modules as they are allocated Figure 4.1: Scheduling and Module Assignment Procedure 31 4.1.2 Improvement Procedure The improvement procedure contains two phases, operation rebinding and redun dant operator allocation. The purpose of both phases is to minimize the total wiring delay. Operation rebinding sequentially examines each potential m odule exchange which may result the reduce of the wiring delays. If the wiring delay is reduced by this exchange, we then switch the bindings of these two operators. Otherwise, the bindings are not changed. After searching all the possible reassignments, redundant operators are allo cated to reduce the wiring delays while the generosity [KP90] constraint is still satisfied. The generosity indicates the maximum tolerance of introducing redun dant operators. From our experiments, we found th at one iteration usually pro duces a satisfactory result. The outline of the improvement procedure is shown in Figure 4.2. 4.2 Experimental Results A simple prototype program was w ritten to verify the prelim inary 3D scheduling approach. In this first prototype, only non-pipelined designs w ithout conditional branches are handled. Also no delay nodes are inserted during the scheduling process. All of these lim itations have been relaxed in our current program described in C hapter 5. The inputs to the prototype program are a d ata flow graph, a m odule library and a generosity factor. The constraint can be either the m axim um allowable area cost or the maxim um allowable execution delay. Note th at the execution delay includes the delays both caused by the functional modules and interconnection wiring, j We applied our prototype to an FIR filter example. Figure 4.3 shows the I snapshots of the scheduling process of a 2 -time-step non-pipelined F IR filter design. There are two critical paths in this example which are shown by solid lines in the figures. A partial floorplan is constructed, which is shown in Figure 4.3c, once all the operations along the critical paths are scheduled. The final schedule of the 2-tim e-step non-pipelined FIR filter design is shown in Figure 4.4. A floorplan of the 2-time-step non-pipelined FIR filter design th at minimizes the num ber of 32 1. c o m m en t: Process the operations along the critical path first 2 . for all operations along the critical path whose predecessors have been pro cessed do 3. b eg in 4. co m m en t: Search all possible exchanges 5. for all modules with the same operator type d o 6. b eg in 7. if this module is assigned to an operation on critical path th e n skip it and co n tin u e 8 . if the total wiring cost is reduced after exchange th e n switch the module assignments of these two operations 9. en d 10. c o m m en t: Allocate a redundant operator to improve the wiring cost 11. if allocated redundant operators greater than generosity allowed th e n g o to step 2 to process next operation 1 2 . allocate a redundant operator and place it as close as possible to its predecessor to reduce the total wiring delays 13. if the total wiring cost is reduced th e n put the allocated redundant operator into allocated module list e lse free the allocated redundant operator 14. en d 15. co m m en t: Processing the non-critical path operations 16. for all operations ordered by freedom do 17. b eg in 18. repeat the procedure from step 4 to 13, except skip the possible exchanges with operations along the critical path 19. en d Figure 4.2: Iterative Improvement Procedure 33 Technology 16 bit adder 16 bit m ultiplier Delay (ns) Area (m il2) Delay (ns) Area (m il2) 3 pm 34 4200 375 49000 2 fim 2 2 1867 250 21778 1 . 6 [ira 18 1195 2 0 0 13938 1 . 2 fxm 13 672 150 7840 Table 4.1: First Library Set Used for FIR Filter operators is shown in Figure 4.5. The floorplan of an alternative design w ith a redundant adder is shown in Figure 4.6. Those two designs took 0.2 second of CPU tim e on a SUN 4/460 machine. Table 4.1 lists the first library set we used for this example. For comparison purposes, we first linearly scaled down each functional unit area and delay from 3-micron process param eters. We used different device param eters for different fabrication processes to calculate wiring delay in this program. We later show an actual scaled library (see Table 4.3) used to test our concepts. Table 4.2 lists the predicted delay time for both designs for the linearly scaled fabrication technology. The errors caused by considering functional delays only are shown clearly in this table. The library set shown in Table 4.1 is originally obtained from a 3-micron RCA design library. The linearly scaled adder and m ultiplier of the 1.2-micron technology are still very large as compared to the adder and m ultiplier derived from a 1.2-micron silicon compiler (see Table 6.1). For example, the 1.2-micron m ultiplier in Table 4.1 is three times larger than the m ultiplier shown in Table 6.1. The large module sizes and long operator chains (i.e. five operators are maxim ally chained in a tim e step) result in 44ns estim ated wiring delays in each tim e step for the design of 1.2-micron process shown in Table 4.1. By inspecting Table 4.2, we found that the differences between functional de lays and the overall system delay are quite linear. However, this is not true of I real process param eters. We believe this phenomenon is due to linearly scaling the area and delay of functional units. For this reason, we used Cascade Design A utom ation ChipCrafter to create an 8 -bit adder and an 8 -bit m ultiplier with 1.2 micron fabrication technology. The library data is shown in Table 4.3. We resyn- j thesized the FIR 2-time-step example, and the resynthesized results are shown in i ! 34 ■y.^ y.*/ y..yy...y H h i " t * • ' ^ \7 j \7 y yK v L , yK vk r - i. ) ! u - '• t 4 i» ' j . * ji? / X , ■ • \ / • \ f~ \ ! X X 4- W > ; X : - v . j v ./ v _y v ' a) 2 Adders and 2 Multipliers allocated ■J ^ N~.y 'V— ( + ) ( + ! ( +} ( + j ( + ) ( + ) \ I , v I N > i \ J ..N \ X \ L « l a 1 i < L * i X • a im \ t «■» > : a l a • f * J f + ! * » ; ' + ) v / v / b) 3 Adders and 2 Multipliers allocated Nj. -^ y ,.y w ( + ) ^ + ) ( + ) w y * - . . y - * - ^ ^vk ''^k < > f 1 ( ’ v ,' v _ / V y v v y V . y v ,y c) 5 Adders and 2 Multipliers allocated \~.y' y ~ y * ) ( * 1 ( *) f *) ( *') ( *) ( *} y v / v y v ■ ■ • . / v y v • • d) 7 Adders and 2 Multipliers allocated Figure 4.3: The Scheduling Process of a 2-Time-Step Non-pipelined FIR Filter 35 add add add add add add add add mul mul mul mul mul mul mul mul add add add add add add add Time S tep 1 Time Step 2 Figure 4.4: The Schedule of the 2-Time-Step Non-pipelined FIR F ilter Design adder adder multiplier multiplier adde adder adder multiplier adder multiplier tdder -► critical path in tim e ste p 2 »» critical path in tim e ste p 1 Figure 4.5: The Floorplan of the 2-Time-Step FIR Filter with M inimum Operators 36 adder adder multiplier multiplier adde 4 - adder -H r adder i adder t 3 + multiplier adder f - f adder adder multiplier critical path in time step 1 ► critical path in time step 2 Figure 4.6: The Floorplan of the 2-Time-Step FIR Filter with a R edundant Adder 3 pm Technology Time Step 1 Time Step 2 Total Delay Est. Area Functional Units 511 ns 511 ns 1 0 2 2 ns 229600 m il2 w ith Est. W iring 586 ns 589 ns 1178 ns 275520 m il2 with a Red. Adder 586 ns 583 ns 1172 ns 280560 m il2 2 pm Technology Time Step 1 Time Step 2 Total Delay Est. Area Functional Units 338 ns 338 ns 676 ns 102048 m il2 with Est. W iring 419 ns 423 ns 846 ns 122458 m il2 with a Red. Adder 419 ns 416 ns 838 ns 124698 m il2 1.6 pm Technology Time Step 1 Time Step 2 Total Delay Est. Area Functional Units 272 ns 272 ns 544 ns 65312 m il2 w ith Est. W iring 326 ns 328 ns 656 ns 78374 m il2 with a Red. Adder 326 ns 324 ns 652 ns 79808 m il2 1.2 pm Technology Time Step 1 Time Step 2 Total Delay Est. Area Functional Units 2 0 2 ns 2 0 2 ns 404 ns 36736 m il2 w ith Est. Wiring 244 ns 246 ns 492 ns 44083 m il2 w ith a Red. Adder 244 ns 243 ns 488 ns 44890 m il2 Table 4.2: M inim um -Operator Designs versus R edundant-O perator Designs 37 Technology 8 bit adder 8 bit m ultiplier Delay (res) Area (m il2) Delay (res) Area (m il2) 1 . 2 fim 7 75 32 903 Table 4.3: Library Set Created by ChipCrafter add add add add add add ad d add mul mul mul mul mul mul mul mul add add add add add add add Tim e S te p 1 Tim e S tep 2 Figure 4.7: A Non-pipelined FIR Filter Using ChipCrafter Library Set Figures 4.7, 4.8 and 4.9 and Table 4.4. The errors caused by considering only func tional delays still rem ain the same (about 20% to 30%). The effect of introducing redundant operators is also obvious, especially with submicron processes, since the j area overhead decreases in significance. It should be noted th at even though wiring delays are reduced for the same design, as feature size decreases, smaller feature sizes allow large designs to fit into a single chip. For these large designs, wires will be longer and wiring delays will have a significant effect on performance. 1.2 fim Technology Time Step 1 Time Step 2 Total Delay Est. Area Functional Units 46 res 46 res 92 res 6018 m il2 w ith Est. W iring 59 res 62 res 124 res 7222 m il2 with a Red. Adder 59 res 59 res 118 res 7312 m il2 Table 4.4: A FIR Filter 2-Time-Step Design Using ChipCrafter Library Set 38 multiplier adder multiplier adder adder adde ~ ^ A multiplier multiplier adder f ► T idde T „ adder adder multiplier multiplier Total Delay = 124 ns Dimension : 119 mil by 119 mil Figure 4.8: Floorplan with Minimum Operators Using ChipCrafter Library Set multiplier adder multiplier adder multiplier adder adder multiplier adder _► T adder I - 4 1 - adder adder adder multiplier multiplier Total Delay = 118 ns Dimension : 119 mil by 119 mil (5 % improvement) Figure 4.9: Floorplan with a Redundant Adder Using ChipCrafter Library Set 39 Chapter 5 Data Path Synthesis This chapter presents a comprehensive integrated d ata path synthesis technique called 3D scheduling which performs scheduling sim ultaneously w ith floorplanning. The 3D scheduler performs scheduling, module allocation, binding and floorplan ning at the same tim e to meet the user specified constraints. The inputs to the 3D scheduler are a VHDL description, the design style (e.g. non-pipelined design style or pipelined design style), a tentative clock cycle, a module library, the process- dependent param eters and the design constraints. The constraints include the perform ance constraint (i.e. the total execution delay), area constraint (i.e. the chip area), and the floorplan aspect-ratio constraint. The outputs of the 3D sched uler are a schedule, a set of allocated functional modules, a set of operator bindings, a set of value bindings, and a set of floorplans with different aspect ratios. The 3D scheduling technique uses a modified force function sim ilar to th at of Paulin [PK89] to perform scheduling as well as binding for the operations in the | d ata flow graph. Since one step lookahead is used by the force function heuristic, it is expected to give us more globally optim um results than other techniques. In the 3D scheduling approach, the effects of interconnections are considered during the scheduling process using a concurrently constructed floorplan. Input m ultiplexers are estim ated, allocated and placed in the floorplan when a functional unit is allocated. W iring delays which guide the scheduling process to produce b etter schedules are estim ated using the first order RC model presented in Sec tion 5.2.2.2. The worst-case wire lengths are used to estim ate the wiring delays in our current approach. The worst-case wire length of two connected modules is equal to the half perim eter of the box enclosing these two modules. Since the fanout of a m odule is unknown until the register-transfer level network is com pleted, the fanout is currently assumed to be one for all the modules in our approach. The prediction/estim ation and floorplanning techniques are used by the 3D scheduler to m ake effective scheduling decisions. In the rest of this chapter, the force-directed scheduling proposed by Paulin [PK89] is first discussed. The prediction technique used in 3D scheduling is given. T he 3D scheduling technique is then introduced, followed by a presentation of the constructive cluster-cut-tree floorplanning technique. 5.1 Force-Directed Scheduling Force-directed scheduling proposed by Paulin [PK89] places similar operations in different tim e steps, so as to balance the concurrency of operations assigned to the functional units without increasing the total execution tim e. By balancing the concurrency of operations, each functional unit is ensured to have a high utilization which in turn decreases the total num ber of units required. This balancing is done in three steps: determ ination of a possible tim e frame for each operation, creation of a distribution graph and calculation of the force associated w ith each possible assignm ent. We repeatedly select the assignment associated with m inim um force and execute the force calculation process until all the operations are assigned. D e te r m in e p o ssib le tim e fram e The tim e frame of each operation is determ ined by evaluating ASAP (as soon as possible) and ALAP (as late as possible) schedules. By combining the results for both schedules, we can derive the tim e frame for each operation. The d ata flow graph of a partial FIR filter design shown in Figure 5.1 illustrates this process. Nodes represent the functional operations, and edges represent the d ata depen dencies between these operations. We assume the clock cycle delay due to functional units is 53ns, the delay of the adder is 25.5ns and the m ultiplier is 53ns. The ASAP and ALAP schedules are shown with horizontal lines in the figure. The resulting tim e fram e is drawn in L 41 add add add add add add Time Step 1 mul mul mul mul mul mul add add Time Step 2 add add Time Step 3 add mul mul add add add Time Step 4 ASAP schedule ALAP schedule Note: Adder delay = 25.5n s, multiplier delay = 5 3 n s and clock cycle = 5 3 n s Figure 5.1: A 4-tim e-step FIR Filter ASAP and ALAP Schedule T im e S te p 1 T im e S te p 2 T im e S te p 3 T im e S te p 4 + 3 + b + 4 X 3 X 4 Figure 5.2: Time Frames for the 4-time-step F IR Filter Exam ple Figure 5.2. The width of each box represents the possibility th at an operation will be assigned to any given tim e step. W ithout lose of generality, the possibilities of operation assignments are assumed to be uniformly distributed. C r e a tio n o f th e d istr ib u tio n grap h The next step is to sum up the probabilities of each type of operation for each tim e step in a d ata flow graph. 42 Time Step 1 s 4 I 12__________ u ___________|_2__________ |3 Adder Multiplier Figure 5.3: Distribution Graph for 4-time-step FIR Filter Exam ple D e fin itio n 5.1 The distribution graph 'DQk(i) of a dataflow graph o f type k op erations in time step i is defined as T>Qk(i)— P roK°ii) ( 5 - 1 ) o £ O p n k where Opn* is the set o f operations o f type k in the dataflow graph; and P ro b io ,i) is the probability o f the operation o assigned in time step i. The distribution graphs indicate the concurrency of similar operation type(s) in a tim e step. Using Figure 5.2, the distribution graphs 2X/+ and 2X?x for the addition and m ultiplication operations are calculated. The results are depicted in Figure 5.3. F orce c a lc u la tio n The last step is to calculate the forces associated with every feasible tim e step scheduling of each operation. The force-directed scheduling process tem porarily ; reduces the operation’s tim e frame to the selected tim e step. The force associated w ith the reduction of the initial tim e frame (bounded by tim e steps t and b) to a new tim e fram e (bounded by tim e steps nt and nb) is calculated by the following equation: nb b ■ F orceint, nb) = J 2 [DGk(i)/(nb - n t + 1)] - 5^[2X?*(*)/(& - t + 1)] (5.2) i=nt i=t 43 Each sum represents the average of distribution values for the tim e steps bounded by th at tim e frame. The force is therefore equal to the difference be tween the average distribution for the tim e steps bounded by the new tim e frame and the average for tim e steps of the initial one. The force induced by assigning a operation o to a specific tim e step i itself is called direct force. We also calculate the force for all predecessors and successors of the current operation whenever their tim e frames are affected. These additional forces are called indirect forces. The total force is the sum of the direct and indirect forces. The force is calculated for each possible assignment of every operation in the d ata flow graph. In each force calculation pass, the operation assignment with the minim al force is assigned. The possible time frames of unscheduled operations are recalculated after each operation assignment. The scheduling process proceeds by repeatedly assigning a schedule to the unscheduled operation with m inim al force and is term inated when all the operations in the d ata flow graph are scheduled. 5.2 Prediction Technique The prediction model includes the lower-bound estim ation for the num ber of functional modules, and area and delay predictions for the wiring and con troller. The predictions are based solely on the behavior of the target design, its area/perform ance constraints, the tentative clock cycle tim e and a m odule li brary. Predictions are performed by scanning through the input behavior w ithout dealing w ith the actual synthesis details. Predictions are fast and helpful in guiding the d ata path synthesis program to prune infeasible designs at an early synthesis stage. Due to the close relationship between the scheduling and binding processes, predictions are also helpful in determining operation-to-operator bindings more globally. The prediction m ethods introduced in this section combine former theo retical and heuristic prediction research done at USC along with new additions to operator lower-bound estim ation. 44 5.2.1 Operator Estimation The input to the operator estim ation consists of an user-input tentative clock cycle tim e, 1 a directed acyclic data flow graph2 (can be obtained from an input VHDL behavioral description), a module library, tentative task delay and initiation interval (for pipelined designs only). A single-cycle architecture model is assumed in the operator estim ation. The single-cycle architecture model is defined as a scheduling model in which all operations are assumed to be com binational and take at most one tim e step to execute. Therefore the clock cycle tim e has to be longer than any operation delay to ensure the correct execution of the behavioral specification. The pipeline model (functional pipelining) assumed in this section is the same as the one in [PP8 8 ] [Jai89]. A non-pipelined design is modeled as a special case of a pipelined design with the same initiation interval as the pipeline length. In the following operator estim ation, we will discuss the theorem s/heuristics for pipelined designs only; the non-pipelined designs are handled in a similar fashion. We first define the utilization of an operator type. Then the lower-bound prediction for pipelined designs used by R. Jain [JPP87] is introduced. Next, a new tighter operator lower-bound estim ation algorithm for non-pipelined designs is presented. 5 .2 .1 .1 A n O p era to r L o w er-B o u n d E s tim a tio n for P ip e lin e d D e s ig n s D e fin itio n 5.2 The utilization Uk o f operator type J c in a pipelined design is defined as u k = (5.3) O k x I where nk is the number of type k operations in the behavior, O k is the number of type k operators in the register-transfer structure and I is the initiation interval o f the pipelined design. lrThe tentative clock cycle should be greater than the delays of the operations in the input data flow graph. For a multiple-chip design, the tentative clock cycle is usually the system clock cycle. 2For a design with inner loops, the data flow graph is still acyclic. The loop information is characterized in the timing graph but not considered in the operator estimation process. 45 In [PP 8 8 ], Park predicts the lower bound on the operation allocation for a pipelined design as shown in Equation 5.4: o„ > T y l (5-4) The existence of the ceiling function in Equation 5.4 is due to the fact th at discrete characteristics of the operators and any fractional requirem ents for oper ators have to be rounded to the smallest integer larger than the required amount. If the distribution graph is used to explain the operator lower-bound estim ation, then Equation 5.4 is rew ritten as (5 .5 ) where Ok,(st,et) is the operator of type k needed to implement the design, st and et are the starting and ending tim e step of the design, respectively, and 1 < st < et < I. For an n-tim e-step pipelined design with an initiation interval /, O k is equivalent to Ok,(i,i) > ^ vgk{ , 0 - | ^ EqUatjon 5 4 shows th at the m inim al num ber of type i operations needed to implement the operations of type i from tim e step st to tim e step et can be estim ated by summing all the type i distribution graphs from tim e step st to tim e step et. 5 .2 .1 .2 A n O p era to r L o w er-B o u n d E stim a tio n for N o n -P ip e lin e d D e sig n s Equation 5.4 shows the theoretical lower bound on the operation allocation for a pipelined design without looking into the structure of the d ata flow graph. The operator lower bound estim ated by Equation 5.4 is exact (i.e. necessary and suffi cient) if there are no maximum tim e constraints between operations. However, a tighter operator allocation lower bound can be derived for a non-pipelined design by inspecting the distribution graphs of operation types in a d ata flow graph. We define a maximally consecutive distribution window T>Qk,mc first. T he procedure for estim ating a tighter operator lower bound for non-pipelined designs is then presented. 46 D e fin itio n 5.3 A distribution window 2X/fc> w (,sf, ei) a subgraph of distribution graph T>Qk of operator type k. It is characterized by a starting time step st and an ending time step et for an N-time-step pipelined design, where 1 < st < et < N . D e fin itio n 5 .4 A consecutive distribution window 'DQk,c{st,et) is a distribution ivindow T>Qk,w{st, et) of operator type k in which no T>Qk{i) is zero for st < i < et. D e fin itio n 5 .5 A consecutive distribution window 2X/fc)C (m.sf, me£) is called a maximally consecutive distribution window 2 X/fciT n c(m.s£, m ef) if T>Gkimst — 1 ) = 0 and T>Gk(‘ m et + 1 ) = 0 . The following procedure, Procedure 1, using the characteristic of m axim ally consecutive distribution windows iteratively calculates a tighter operator lower bound for a given non-pipelined data flow graph. X op of a distribution window UGk,w{st,et) denotes the set of operations of type k which can be scheduled into tim e step i when 'DQkii) is smaller than the currently estim ated operator lower bound for st < i < et. Note that X op is an em pty set at the beginning of the operator lower-bound estim ation. Procedure 1, LowerBound, is a recursive procedure. The basic idea of this pro cedure is to estim ate the operator lower bound of the current distribution window iteratively using Equation 5.4. The procedure first computes the operator low bound of the current distribution window and X op is then recalculated. The distri bution graph of the current distribution window is recom puted due to the change of X op. The operations in X op are excluded in the calculation of the distribution graph because these operations could be scheduled into the tim e step whose expected op erator dem and (i.e. the value of the distribution graph) is lower than the current estim ated operator lower bound. The maximally consecutive distribution windows are then identified and the operator lower bounds of the m axim ally consecutive distribution windows are derived similarly by recursively calling LowerBound. The procedure stops and outputs the estim ated operator lower bound when the recur sion term ination condition (i.e. st is equal to et) is satisfied. GenerateDG(G, st, et, X op) calculates the distribution graph of the distribution window T>Gk,w(st, et) from time step st to tim e step et for all the type k opera tions in the d ata flow graph Q except the ones already put into X op. To estim ate the tighter operator lower bound of type k for a non-pipelined design, Lower- Bound(Q, 1 , N , 0) is calculated. P r o c e d u r e 1 / * Subroutine LowerBound(Q ,s t,e t,X op) calculates operator lower bound of type k for the digital circuit behavior specified by the data flow graph Q. * / if st is equal to et / * No more iterations are necessary. * / r etu rn T>Qk(st) / * Calculate the lower bounds for the distribution window 'DQk,w{st,et'). * / o r n = Nop = X op U {op I op can be scheduled to tim e step i and T>Gk{i) < OPib for st < i < et} / * Recalculate the distribution window T>Qk,w(st,et) due to the change of ■ y * / GenerateDG{Q, s t, et, Nop) for every maximally consecutive distribution window T>Qk,mc(mst, met) between tim e step st and et do O V tb = max((9'P;&, LowerBound(Q, m st, m e t, X op)) r e tu r n O 'Pib a The following example shows the use of Procedure 1 . E x a m p le : For the 4-time-step non-pipelined FIR filter design shown in Fig ure 5.1, the m ultiplier lower bound is first estim ated as ^ = 1 by applying Procedure 1. Since T>QX{\) and T>QX{4) are zeros which are less than the estim ated lower-bound value 1 , we can remove T>QX(1) and T>QX{4) from our lower-bound estim ation. The distribution window now is from tim e step 2 to tim e step 3, and the new lower bound then becomes ["Hf= 2 'DQ-x.i} ) ! 21 := 2. Since T>QX{2) is 3 (that is greater than the currently estim ated lower bound), we do nothing for this tim e step. T>QX(3) is 1 (that is less than the currently estim ated lower bound) and it is contributed to by mulS and m u lf , so X op is equal to 0U{m «/5, mul4 } which is {mulS, mul4 }. GenerateDG{Q, 2, 2, X op) is then called L 48 to recalculate the distribution graph due to the change of X op. T>QX{2) is thus equal to 2. Since no further iterations are possible, the m ultiplier lower bound estim ated by Procedure 1 is equal to 2. As compared to the multiplier lower bound estim ated by Equation 5.4 (i.e. f |] = 1), Procedure 1 predicts a tighter m ultiplier lower bound for this non- pipelined example. Notice th at the multiplier lower bound estim ated by Equa tion 5.4 is insufficient to realize a design, but the m ultiplier lower bound predicted by Procedure 1 can implement a real-life design. □ Procedure 1 provides a system atic way to estim ate operator lower bounds using distribution graphs. As compared to Hagerm an’s work at Carnegie Mellon Uni versity [Hag91], the estim ation result of an operator lower bound strongly depends on the order of m anipulating distribution graphs (i.e. a different order in m an aging distribution graphs may obtain a different operator lower-bound estim ation result). T h e o re m 5.1 Procedure 1, LowerBound{Q, 1, N, 0), computes the operator lower bound of type k for a non-pipelined design and the operator lower bound estimated is greater than or equal to the one calculated by Equation 5-4 • P ro o f: From the estim ation of the loose operator bound, Equation 5.4 can be V"™ 1 ' Tf)G (f\ rew ritten to Equation 5.5 o& (st,ei) ^ I et-st+i—1* Since the operations in X op can be scheduled into any given tim e step i and T>Qk{i) is smaller th an the currently estim ated operator lower bound OVw, it is then possible to schedule the operations into tim e step i in order to increase the operator utilization. In Procedure 1, a m axim ally consecutive distribution window is derived by excluding the operations in X op, so th at the new the operator lower bound is derived from the new m axim ally consecutive distribution graph. The operator lower bound of the new m axim ally consecutive distribution graph is also the lower bound of the data flow graph Q, because the new maximally consecutive distribution graph is contributed to by a subset of operations in the data flow graph Q. \ Procedure 1 initially calculates the operator lower bound OV\b from st — 1 j and et = N using Equation 5.5. The procedure then recursively searches the new 49 m axim ally consecutive distribution windows within the current distribution win dows due to the change of X ov. The operator lower bounds of the new maximally consecutive distribution windows are calculated. Since the new operator lower bound OVib is the maxim um of the current OVib and the operator lower bounds newly calculated, the operator lower bound estim ated by LowerBound(Q, 1, A, 0) is greater than or equal to the one calculated by Equation 5.4. □ 5.2.2 Wiring Estimation Considering only the functional characteristic of RTL design is usually not enough to make correct design decisions in high-level synthesis. W ith the incorporation of the floorplanning process into the scheduling process, the effects of wiring delays can then be estim ated more precisely than traditional approaches. In our wiring estim ation, a channel routing technique is assumed in the layout process. We first present a wiring area approxim ation technique. A first-order RC model used to estim ate wiring delays is then introduced. 5 .2 .2 .1 W ir in g A r ea E stim a tio n In the estim ation of wiring area, a channel routing technique is assumed in the layout system; the in p u t/o u tp u t, clock and power term inals of the m odule are assumed uniformly distributed around the perim eter. A technology dependent routing factor called Track Utilization Ut is used to reflect the possibility of routing wires th at share a routing channel in a specific layout system. We assume the wiring tracks are equally distributed along the boundary. The effect of extra area induced by the wiring is modeled by A x on each side which is estim ated as ibwi + obwi + cbwi + pbwi . A x = ---------------------------- xWtxWi (5.6) 4 where ibwi and obwi are the input and output bit widths of m odule i; cbwi and pbwi are the bit widths required to connect the clock and power term inals for module i. is the average width of a routing track in the layout system under current 50 m odule i without wiring area m odule i with estim a ted wiring a re a Figure 5.4: The Estim ation of W iring Area technology. The effective area of a module including the wiring area A e can be approxim ated by Ae « (W + 2 • A x ) x ( H + 2 • A x ) (5.7) where W and are the width and height of the m odule before considering the wiring area. Note th at (W 4 - 2 • A x ) and ("% + 2 • A x ) are the effective w idth and height of this module, respectively. Figure 5.4 shows examples of a m odule with different aspect ratios before and after considering wiring area. Notice th at a m odule with different aspect ratios may result in a different wiring area in the routing process, which is true in the real world design. The effective area of the allocated modules will be used in the floorplanning process to take the im pacts of wiring area into account. W iring area will increase the wire lengths which result longer wiring delays in a design. 5 .2 .2 .2 W ir in g D e la y E stim a tio n W iring delay estim ation is based on a first-order RC model, taken from G upta [Gup91] and was originally published by Sakurai [Sak83]. In this model, the wiring delay is modeled as a function of the wire length and fanout of the driver. In the wiring delay model, wire delay arises from signal propagation through the RC ! network and also from charging the load capacitance. The wire length is used to 51 estim ate the sheet resistance and capacitance along the signal propagation in the RC delay model; the fanout of the driver is used to estim ate the load capacitance and the charging tim e associated with that capacitance. W iring delay can therefore be represented by the following first-order relationship [Sak83]: * 0 .9 = 102 R C + 2.21 {CtRt + CtR + C R t) (5.8) where: R — Total resistance of the wiring. C = Total capacitance of the wiring. Ct — Load capacitance. Rt = Equivalent resistance of the driving transistor. R , C, R t and Ct are calculated as described in [Gup91]. These param eters are strongly dependent on the design and fabrication process. In th e case of m etal wiring R is very low as compared to R t and can be neglected in the estim ation procedure. Therefore, Equation 5.8 reduces to ([Gup91]) *o.9 = 2.21 R t(Ct + C) (5.9) Using Equation 5.8 (or Equation 5.9), we can calculate the worst-case wiring delay from the worse-case wire length. The worst-case wire length is m easured by the diagonal rectilinear distance of the bounding box containing the two connected modules on the floorplan. Although this is only a first-order RC model, it does produce a very good approxim ation of wiring delays as G upta verified by SPICE sim ulations [Gup91]. 5.2.3 Controller Estimation For the controller estim ation, we will focus on controllers im plem ented by pro gram m able logic arrays (PLAs). The estim ation of controller characteristics is perform ed in two phases. The first phase estim ates PLA characteristics, namely the num ber of PLA inputs, the num ber of PLA outputs and the num ber of PLA 52 -B- — E t F - - B - l_ t____ --EEL i E 3 - ~EEL - B - - B ” -B- -B- ■ B - B- -B-r- 4 -b - - b 4 - -B -B- S ta te R egisters [-»- Inputs -B -B-i- -B- -B- -B- -B- -B- Product T erm s O utputs Figure 5.5: A Finite State Machine Im plem entation Using PLA product terms. The second phase estim ates the actual PLA area and delay from abstract PLA characteristics. 5.2 .3 .1 P L A A b s tr a c t C h a ra c te ris tic E s tim a tio n Estim ations of PLA abstract characteristics are taken from M linar [Mli91]. As depicted in Figure 5.5, a PLA consists of two regions. On the left side, one or more inputs (and/or their complements) are ANDed together to form each product term . On the right side, each output line is driven by one or more product term s which are ORed together. Several assum ptions are made in the development of this PLA characteristic estim ation model: • The controller area/delay estim ations are specific to finite state machines. • The PLA state feedback registers and output registers are considered to be part of the controller. • W iring area/delay within a controller is estim ated sim ilar to the wiring area/delay estim ated in the data path. 53 Inputs to a PLA finite state machine include external inputs and the sta te /sta tu s signals fed back from the outputs of the PLA, as shown in Figure 3.2. These external inputs reflect the status of external hardw are or im pacts of loop control/conditional branch execution. The number of states ( in a finite state m a chine determines the num ber of state bits, |"log2 Cl • The num ber of PLA inputs i c thus can be estim ated as ([Mli91]) = |"lo § 2 Cl + %cond T ilo o p (5.10) where icond is the num ber of condition value inputs, which is affected by the type and the num ber of conditional branches in the data path; iioop is the num ber of loop status inputs, which is determ ined by data path an d /o r loop counter status. Similarly, the PLA outputs can be estim ated as ([Mli91]) Oc — riog2 Cl + R c + M + A + Oloop + Ocond (b-H ) where R c and M are the num ber of control signals associated w ith registers and m ultiplexers, respectively. A is the num ber of control signals for operating ALUs and other m ulti-function hardware. O ioop and ocoa(i are the control signals related to loop counter controls and conditional branch selections. The prediction of the num ber of product term s pc in a PLA is the m ajor difficulty in controller estim ation. It is clear th a t the num ber of the product term s in a PLA is a function of control states as well as the num ber of output control signals. Loops and conditions also have im pacts on the determ ination of product term s of a PLA. More details about M linar’s PLA estim ation techniques can be found in [Mli91]. 5 . 2 .3 . 2 C o n tro lle r A re a E s tim a tio n The area of a controller consists of three parts, the PLA, sta te /sta tu s/o u tp u t registers and wiring area. The PLA area is based on predefined m acro cells as proposed by Mlinar. Given the num ber of PLA inputs ic, outputs oc and product term s pc from earlier equations, the area of a PLA can be defined as [Mli91] P L A area = ci • (2ic + oc) • pc + C 2 • pc + C 3 • (2*c + °c) + c 4 (5.12) 54 PL A G N D PLA Pullup PLA Pullup C 3 i l l C 3 l§ £ c 3 l ! i C3 c 3 i n C 3 i | : c 3 C 3 lU i ill: c 3 c 3 liii: c 2 Cl Cl C1 Cl C1 Cl m u Cl Cl Cl Cl C1 Cl C1 Cl C1 C1 Cl Cl C2 ■ 1 1 1! C1 C1 C1 Cl Cl Cl i l l : C1 C1 C1 C1 Cl C1 C1 Cl Cl C1 Ct Cl 1 1 1! c 2 Cl C1 C1 Cl Cl C1 1 1 1! C1 C1 Cl C1 C1 C1 C1 Cl Cl C1 Cl C1 1 1 1! 1 1 1 : Cl Cl C1 Cl Cl C1 m i Cl Cl Cl C1 C1 C1 Cl Cl Cl Cl Cl Cl I H M Cl Cl C1 Cl Cl Cl I I I C1 C1 C1 Cl Cl Cl Cl Cl C1 Cl Cl Cl 1 1 ! 1 ® Cl Cl C1 C1 Cl C1 1 1 ! C1 Cl Cl C1 Cl Cl Cl Cl Cl Cl Cl C1 c 2 ■ C1 C1 C1 Cl C1 C1 I l l Cl C1 Cl C1 C1 Cl Cl C1 Cl Ct C1 Ct I I ! m C 3 I B I C 3 C 3 1 1 k 111! I I I m i 111 i l l C 3 C 3 ini 1 11 11 1 C 3 1 1 PLA G N D PLA Inputs PLA In terconn ect PLA O u tp uts Figure 5.6: Internal Construction of PLA (Taken from [Mli91]) where c\, c2, c$ and c4 are constants depending on the technology. Each term in Equation 5.12 represents a portion of the PLA area, which is clarified by Figure 5.6, taken from [Mli91]. The num ber of registers required in the controller is equal to the num ber of PLA outputs. W iring area is estim ated in a similar m anner to the approach in Sec tion 5.2.2.1. The geometry of the estim ated PLA is presum ed to be a rectangular block w ith aspect ratio being Z c ^°c and wiring area is estim ated as previously. The controller area thus includes the area of the PLA and registers, 3 w ith estim ated wiring area in each module. 5 .2 .3 .3 C o n tr o lle r D e la y E stim a tio n PLA delays are transition and personality m atrix dependent. Our aim is to esti- | m ate the PLA delay according to the num ber of inputs, the num ber of outputs and | the num ber of product terms. Since the exact personality m atrix is not available ! during the estim ation phase, the PLA delay estim ation is characterized by three I cases, namely the best, the average and the worst cases. 3The geometries of the registers can be obtained from the module library directly. 55 The PLA delay estim ation technique is taken from G upta [Gup91] in which the PLA charging delay ttotai_ch and discharging delay ttotai-dis are estim ated as ifotal^ch — tinvl— c h , “ f * tinv2-.dis “ I ” tptuch T ^op— dis “ f“ iinv3— ch (5.13) tto t a l - d is — tin v X ^ d is “ f " t i n v 2 — ch H “ t p t - d i s T to p - c h ^ i n v 3 — dis (5.14) where tinvl_chftinvl_dis and tinv2_ch/tinv2-dis are input inverter charging/discharging delays, tpt_ch/tpt_dis is the product term stage charging/discharging delay, top-ch/top-dis is the output stage charging/discharging delay and tinv3_ch/tinv_dis is the final inverter charging/discharging delay. An RC model, which is similar to the approach in wiring delays, is used to characterize the PLA delay. Each term in the delay equations is estim ated in two steps. The capacitances and resistances are first estim ated based on the pro posed empirical models. Three types of PLA delay estim ations were used [Gup91], namely • Best-Best Case: Only one product term is connected to the output and all the inputs are driving the product term (m inim um load capacitance and m axim um driving capability). • Average-Average Case: Only half of the product term s are connected to the output and only half of the inputs are driving the product term s (average load capacitance and average driving capability). • Worst- Worst Case: All product term s are connected to the output and only one input is driving the product term (maximum load capacitance and m in im um driving capability). Once the capacitances and resistances are known, each term in the PLA delay equations can then be estim ated by Equation 5.8 or Equation 5.9. The estim ated results are also verified by spice simulation results. More details can be found in [Gup91]. 56 5.3 Floorplanning Technique A floorplan contains information about cell geometries as well as cell locations. Floorplanning is beneficial in providing an accurate estim ate of the area and delay of circuits. The accurate estim ate of area and delay characteristics of a circuit makes an early feedback loop in a high-level synthesis system possible; it also helps the high-level synthesis process prune infeasible designs at an earlier synthesis stage. A typical floorplanning approach first determines the floorplan topology using the connectivity information among modules/cells. Various optim izations are then perform ed in order to minimize the specified cost function (e.g. area a n d /o r perfor m ance). The general floorplanning problem has been proven to be NP-com plete, so heuristic m ethods are used. In our approach, we use the cluster tree struc ture, which was also adopted by Dai [DK87], to represent the m odule topology. Figure 5.7 shows the flow chart of the floorplanning process. For the sake of effi ciency, we lim it the cluster size to four. However, algorithms presented here can be extended to a cluster size of five or larger. Once the cluster tree is determ ined, polynomial tim e optim al algorithm s for a slicing cluster tree [Sto83] or a non-slicing cluster tree [WW90] can be applied to obtain floorplans with different aspect ratios. Figures 5.8 and 5.9 show two additional floorplans with different aspect ratio goals for the 1 0 -tim e-step non- pipelined F IR filter design shown in Figure 3.6. We define our terminology in the following. A k-roorn floorplan pattern is a floorplan structure with exactly k rooms where a room is an enclosed region. An orientation of a pattern is a clockwise rotation of the p attern with respect to its boundary. A labelling of a pattern corresponds to an assignment of nodes of the cluster tree to individual rooms. A topological possibility refers to a particular choice of floorplan pattern, pattern labelling and pattern orientation [Ped91]. A 3-room floorplan example which illustrates the definitions is shown in Figure 5.10. A primitive corresponds to a functionally non-divisible m odule, such as an adder, a m ultiplexer or a register. A cluster node is a non-prim itive node in the | cluster tree. The shape function for a node in the cluster tree gives the lower bound I 57 Initialize the Cluster Tree a s an Empty Cluster Tree The number of children in the cluster tree is equal to the estim ated operator lower bound. Current Floorplan +^New M odules r Start Floorplan Construction Functional Unit? YES Form a Function Cluster mpty Cluste N ode? NO NO Insert the N ew M odules to an Existing Function Cluster with Min.Tree Traversal Distance A ssign the Function Cluster to the Empty Cluster N ode with the Min.Tree Traversal Distance j E Expand the Cluster Tree and Assign the Function Cluster to the Empty Cluster N ode Newly Created 31 Recom pute Shap e Functions Bottom Up and Determine Module Positions Top Down ~ * Partial Floorplan Figure 5.7: The Flow Chart of the Floorplanning Process X 3 5 ADD.002 REG16.002 REG 16.003 MUX2-1.001 REG16.001 MUX4-1.001 ADD.001 MUX5-1.002 co n tro lle r MUL.001 Size: 3631.59 um x 1819.00 um, Area: 6605861.93 um2 Figure 5.8: FIR Filter Floorplan with Overall Aspect Ratio Goal 2:1 58 ADD.002 U i cr MUL.001 MUX6-1.003 ADD.001 c o n tro lle r LU O C a v m w t i w r n y i ^ Size: 4314.09 um x 1414.75 um, Area: 6103358.61 um2 Figure 5.9: FIR Filter Floorplan with Overall Aspect Ratio Goal 3:1 3 -r o o m flo o rp la n p a tte r n s 3 -r o o m flo o rp la n o r ie n ta tio n s 2 3 1 1 3 2 3 2 1 1 2 3 3 -r o o m flo o rp la n la b e ls j Figure 5.10: 3-Room Floorplan Patterns, Orientations and Some Possible Label ings 59 on the height of the node as a function of its width. Note th a t a shape function is a discrete function; each discrete value in a shape function is called a shape point. The shape function of a primitive, which is a collection of variable shape points, is given in the input module library. The shape function of a cluster node is com puted by merging shape functions of its children [Ott83] [Zim 8 8 ] [Ped91]. 5.3.1 Incremental Cluster Tree Generation Hierarchy is a typical way to deal with the complexity of a large problem. In the context of floorplanning, the hierarchy usually is represented by a cluster tree where highly connected primitives are grouped together. This hierarchical scheme simplifies the floorplanning problem, since floorplanning algorithm s can recursively operate on one hierarchical child (which may be a prim itive or a cluster node) at a tim e. The tree itself may have a restricted structure (e.g. binary slicing with fixed cut directions and cluster-to-room labelings) or a flexible structure (e.g. a m ulti-way unoriented cluster tree). A more flexible structure allows a higher de gree of floorplan optim ization. A multi-way unoriented cluster tree is used in our floorplanning procedure. The cluster tree approach allows us to consider more floorplan topologies than th a t which is represented by either horizontal or vertical cuts. The m axim um branching factor in the cluster tree is restricted to a small value (four in our approach). The complexity of floorplanning a cluster node is the same as th at of the general floorplanning problem, if the branching factor is too large. The non slicing floorplan is avoided in our approach, also for efficiency reasons. However, the procedures presented here can be applied to non-slicing floorplans as well as slicing floorplans. In order to incorporate the floorplanning process into the scheduling process, a cluster tree is increm entally constructed as follows. A new cluster node is added to the cluster tree whenever a new functional unit is allocated during the schedul ing process. The m ain shortcoming of this increm ental approach is its greedy, sequential nature which can not account for the global connectivity information. However, this drawback is somewhat alleviated by utilizing prediction tools. The prediction results are helpful since they allow us to overview the design before 60 controller controller MUX8-1.01 «ADD.01\> MUX8-1.02 REG.01 controller controller MUX8-1.01 /ADD.0l\, MUX8-1 .O B * MUX8-1.02 REG.01 MUX8-1.01 ADD.oLM U X8' 1 /03 \ MUX8-1.06 ADD.02 r A I MUL.O\ • MUX8-1.02/ \ MUX8.1.0 4 \ MUX8-1.06 MUX2-1.07 REG.01 REG16.02 Figure 5.11: The Construction of a Cluster Tree scheduling is perform ed (or completed). They are also beneficial in m aking better trade-off decisions w ith a more global view at early stages of the synthesis process. For exam ple, in contrast to using a greedy approach to operation binding, the esti m ation operator allows us to select operation binding among all possible operators. Also, estim ating wiring area of a module allows us to consider the wiring effects during scheduling and floorplanning. The proposed clustering strategies are based on the characteristics of register- transfer level design architecture and functionality. A function cluster, consisting of a functional unit, possible m ultiplexers (which directly connected to in p u t/o u tp u t term inals of the functional unit) and a possible register (which stores the values generated by the functional unit) is used in our scheme. The cluster tree shown in Figure 5.11, for example, has four function clusters: one function cluster for the controller, two function clusters for the adders and one function cluster for the m ultiplier. Function clusters are shown by solid dots in the figure. T he function cluster itself may have a two-level hierarchy, such as the function cluster a d d . 0 1 . Function clusters represent “natural” partitioning of a register-transfer level design as is intuitive because modules inside a function cluster are tightly connected and ^ thus should be placed together in order to reduce the wiring area and delay. _______________, _____________________________________________________________ « L In the d ata path synthesis process, each allocated functional unit corresponds to a function cluster in the cluster tree. The process of constructing a cluster tree contains the following steps. The cluster tree contains em pty spots at the beginning of the scheduling process. The num ber of em pty spots n initially allocated to the cluster tree is equal to the num ber of estim ated operator lower bound. In determ ining the topology of a cluster tree with n em pty spots, a cluster tree of height I is used whose leaf nodes are of level 0 and the root node is of level I [AHU74]. The cluster tree with n em pty spots (n > 1) is determ ined as Tl' ni+1 = r — 1 0 < i < I (5.15) m where m is the cluster size, n; and n,+i are the num ber of nodes of level i and «-f 1 , respectively. Note th at no = n and ni = 1. The num ber of children of a node in level i is determ ined as _ f r^-i 3 = 1 and * > o \ r 71,-1 1 < 3 < and * > 0 The next step is to determ ine the assignment of a newly allocated functional unit (i.e. a function cluster) to an em pty spot in the cluster tree. The location of a newly constructed function cluster is based on its connectivity to already placed function clusters in the cluster tree. The number of edges on the p ath connecting two spots is used to represent the interconnection cost between the two function clusters placed in these two spots. The interconnection cost for assigning a newly formed function cluster to some candidate em pty spots is thus calculated as the m inim al num ber of edges between the candidate spot to occupied spots which are connected to the new function cluster. Note th at the interconnection cost between an unscheduled predecessor4 and the currently scheduled operation is not considered and assumed to be zero in our approach. The spot w ith the m inim um interconnection cost is selected as the location for the function cluster. A new spot is added to the floorplan cluster tree only when all spots in the initial floorplan 4L e t node A and node B be two operations in the data flow graph. Node A is a predecessor of node B if any input of node B comes from the output of node A . 62 cluster tree are occupied. Once all the functional units are allocated and placed, the construction of the floorplan cluster tree is complete. For illustration purposes, the floorplan cluster tree construction process for a 10-time-step non-pipelined FIR filter design is shown in Figure 5.11. A cluster tree with four em pty spots is created initially using the results of operator lower-bound estim ation. The next allocated function m odule is a d d .01; the corresponded func tion cluster is formed and placed to the cluster node next to the controller, m u l.0 1 and a d d .02 are then allocated and placed. The function cluster a d d .01 is mod ified after it is placed. The m ultiplexer m u x 2 -1 .0 7 is inserted between a d d .01 and re g .0 1 in the register-transfer level design due to the sharing of r e g .0 1 . Note th a t the function cluster a d d .01 now has two levels of hierarchy because of the lim itation of cluster size. 5.3.2 Shape Function Computation Once the cluster tree is constructed, the floorplanner needs to determ ine the ex act topological relationships of the cluster nodes as well as the geometries of the prim itives. To obtain this information, the floorplanner first performs a bottom -up traversal to the cluster tree to com pute the shape functions for the cluster nodes. A top-down traversal then follows in order to determ ine the topological relationships of cluster nodes and the geometries of primitives to specific shape points. In [Sto83] and [Ott83], Stockmeyer and O tten showed how to combine two shape functions due to a horizontal cut or a vertical cut. Briefly, the procedure proposed by Stockmeyer is similar to merging two sorted lists. The procedure m erging two shape functions, S '(h f,w') = {(h^, w[), (h'2, • • • > (hp, w' p)} an-d S*(h* ,w*) = {(hi, wl), (h^, w l),..., (h*q, w *)}, to a new shape function S(h, w) for a vertical cut is sum m arized in Figure 5.12. P r o c e d u r e 2 / * Merge two shape functions S'(h',w ') and S*(h*,w *) due to a V cut. * / Initialize i j * — 1. for i < p or j < q d o b e g in 63 W i d t h dth B2 Width Adding Shape Functions routing area estimation Width Adding Routing Area Estimations W idth Width m m s * B2 B1 routing area estimation Adding Shape Functions Adding Routing Area Estimations Figure 5.12: Shape Function Addition for Horizontal and Vertical Cuts 64 Add ^ ( ( h ^ w ' ^ ^ h j j W j ) ) to shape function S ( h , w ) . if h\ > h*- th e n 1 J i < — i + 1; co n tin u e, if h\ < h* th e n 6 3 j *— j + 1; co n tin u e, if h\ = h* th e n * < — i + 1; j < — j = 1; co n tin u e. en d □ J ’{{h'i, w'j), (h*, Wj)) is a join function which puts two shape points, (h w $) and ( h j,W j) , together: •7((hi>w i)>(hj>w j )) = (m & x(hi’ hj ) ’w i + w j ) (5-16) A key fact is th at we do not have to consider all p ■ q new pairs, since m any of them are clearly suboptim al. A theorem proved by Stockmeyer [Sto83] and O tten [Ott83] shows only p + q pairs need to be considered. Zim m erm an extended the shape function com putation procedure for a binary cut with unspecified cut orientation [Zim8 8 ]. Briefly, to obtain the m erged shape function for an unspecified orientation cut, the shape functions corresponding to both horizontal and vertical cuts are calculated. The resulting shape function is derived by merging the shape points on the lower bound of the two shape functions. The procedure is illustrated in Figure 5.13. The shape function com putation for a binary cut with unspecified cut orienta tion was further extended by Dai to calculate the shape function for an unoriented m ulti-way cluster tree [DK87]. For a cluster node with k children, the shape function can be determ ined as the following. Assuming th at the cluster tree cor responds to a slicing structure , 5 we can always further decompose the cluster tree to a binary composition tree without specifying cut orientations. For a binary 5For a topological relationship that corresponds to a non-slicing structure, the shape function can also be calculated similarly. For example, the calculation of the shape function of a non-slicing structure with 5 children was published in [WW90]. 65 cluster node B1 B2 com p osite sh a p e function Width Lower Bound M erging of S h a p e Functions Figure 5.13: Merging of Shape Functions decom position tree, there exists an unique canonical representation, such as the Polish expression [WL8 6 ]. Note th at the leaves of the binary decom position tree correspond to the children of the cluster node and the internal nodes of the binary decom position tree correspond to the unspecified orientation cut nodes in a binary tree structure. The combined shape function for the root of this cluster tree, which is first decomposed to a binary tree with unspecified cut orientation, can therefore be calculated in the same m anner as the procedure described for a binary tree with unspecified cut orientation. In the cluster tree generation phase, a multi-way cluster tree is created w ithout determ ining the topological relationships of cluster nodes. The shape function for an unoriented m ulti-way cluster node is first com puted and the topological relationships of cluster nodes are then determined. The wiring area is estim ated for each child in the cluster node before the shape function is calculated [Ped91]. (This is in contrast to Zim m erm an’s approach [Zim8 8 ] which calculates the shape function for each child without estim ating the wiring area first and then shifts 66 h eigh t width h eigh t B3 width B3, h eight B2 B3 width Apply Lower-bound Merge multi-way cluster n o d e height SF width j Figure 5.14: Calculation of the Shape Function for a Cluster Node with 3 Children 67 the whole shape function in order to account for the wiring area.) Next, the lower bound of merging all the calculated shape functions is taken to obtain the shape function for the unoriented multi-way cluster node. Figure 5.14 depicts the calculation of shape function for an unoriented multi-way cluster node. Note th at f the num ber of shape points of a cluster node are linearly reduced to a lim ited num ber for efficiency and to reduce complexity. For exam ple, twenty-five to thirty shape points for a cluster node are usually sufficient to produce a fairly smooth shape function based on our experiments. This procedure is recursively applied up the cluster tree until the composite shape function for the root node is calculated. This bottom -up calculated shape function is used to generate a feasible floorplan by propagating the associated geometries of the floorplan solution, which corresponds to a shape point in the shape function, to the leaf nodes. As long as the cost function is non-decreasing in all of its argum ents, the above procedure minimizes the floorplan w ith respect to the cost function [Sto83]. 5.4 3D Scheduling Technique In this section, the 3D scheduling algorithm is presented. We first discuss some assum ptions which were m ade by the 3D scheduling algorithm . Loops are con sidered in the 3D scheduling algorithm , however, designs w ith nested loops are handled but not parallel loops. The 3D scheduling algorithm is then presented. A fter presenting the complexity analysis of the 3D scheduling, we finally justify the use of force-directed scheduling and incremental cluster tree techniques. 5.4.1 Timing Model Assumptions I j As stated before, a two phase non-overlapped clocking scheme is assumed: a control-path clock followed by a data-path clock. More precisely, the delay through a control p ath consists of the wiring delay of the values generated by the d ata path and then fed back to the PL A circuit (if any) plus the internal delay of the PL A circuit, and the wiring delays of control signals to the d ata path. The d ata path execution always starts from register outputs, m ultiplexer or external inputs and I I I ! 68 ends at registers or external outputs. Several assum ptions on tim ing are m ade in the calculation of control path and d ata path delays: 1 . The external input values are ready (stable) to be fetched by the d ata path at the beginning of the respective data path execution cycle. These values are assumed to be latched by an external circuit. This assum ption was m ade to simplify the problem and could be relaxed w ithout changing the basic approach used here. 2. T he external output values are produced and latched by the d ata p ath until the d a ta path execution is completed. The delay introduced by the output latch circuit is considered. 3. The d ata path execution paths start from either external inputs or registers and term inate at either external outputs or registers. 4 . The condition values (internal values fed back to the controller) produced by the d ata path are created at least one clock cycle before they are used by the control path. 5. The delay for transferring a value directly between two data-path registers (with no operations in between) is less than the longest operation delay in the d ata path. Note th a t the external input values are latched by additional logic until the values are no longer needed by the data path. The additional logic m ay be created by a binding/allocation program, such as MABAL [KP90]. The values created by the d ata path are latched by the data-path registers. Since synchronous circuits are the only kind being considered here, values th at are produced in the data path, but not “consumed” in the tim e step generated (i.e. the value will be used ; by some operations in the later tim e steps), are stored in registers to avoid possible race conditions. This register requirem ent is consistent with the third assum ption about the d ata p ath execution paths. The fourth assum ption m ade about condition values is due to the tim ing dependencies between the control path and d ata path, j T he assum ption concerning the delay of transferring a value between two d ata path registers is im portant. In a circuit design, the designer may want to share hardw are modules which may result in some values being created but not consumed in the following execution cycle. In th at case, a value it m ay become necessary to move from one register to another during the data path execution cycle. (This situation frequently happens in a pipelined design.) Since data-path registers are only partially allocated and the floorplan is increm entally produced, the d ata p ath registers allocated by the 3D scheduling process are floorplanned as close to the operators as possible. The data-path registers allocated by an allocation and binding program (like MABAL) are placed at the boundary of the partial floorplan produced by the 3D scheduling process. The assum ption, th a t the delay of transferring a value directly between any two data-path registers is less than th at of the longest delay of the d ata path, allows us to give higher priority to binding the values just produced in the current d ata path execution cycle to the near of registers w ithout violating the clock cycle constraint (i.e. degrading system perform ance). 5.4.2 Loop Model For designs w ith loops, the 3D scheduling algorithm considers only nested loops but not parallel loops. 6 Figure 5.15 shows the representation of a loop in our DDS translator [Che91]. A block in the data flow graph model represents a set of operations. A loop structure in our approach consists of a loop entrance, a loop condition and a loop body. The loop feedback values are im plicitly specified and represented by the subscripts. The loop condition determ ines th e execution of the loop body, a and l o points are used to represent loop structures in our approach. An a point is a point with one incoming edge and two outcom ing edges. An l o point, in contrast, has two incoming edges and one outcoming edge. A loop structure in a DDS tim ing graph always begins at an a point and ends at an l o point. The loop condition is represented by a pair of OrFork and Or Join nodes in the tim ing graph. The left edge of the OrFork point always has one tim e range to which the loop exit is bound. The right side of the OrFork usually has several consecutive 6 Parallel loops are loops without data dependencies in between which could be executed in ■ parallel. I I Before Loop \ Loop Input Values f Loop Entrance Loop Condition Loop Body \ Loop Output Values / After Loop j] / O rFork C L o o fXDrJoin The Data Flow Graph Model The Timing Graph Model Figure 5.15: The Loop Model ranges to which the loop-body operations are bound. Note th at a pseudo/dum m y edge (or range) is added between the u > point and the OrJoin point to m ake the tim ing graph connected. Due to tim ing considerations in implementing a future controller for a design with loops, break nodes are introduced in the 3D scheduling algorithm . A break node is a pseudo node in the data flow graph th a t is inserted in the beginning of the loop structure and right after the loop body. The break nodes ensure th a t the 3D scheduler assigns different tim e steps to the operations outside the loop structure and the operations inside the loop structure. After the break nodes are inserted, the d ata flow graph can be scheduled in the same way as a design w ithout loops. O perations outside the loop which are parallel to the loop structure are scheduled concurrently w ith the operations in the loop structure. However, these operations are executed in the first instance of loop structure execution; the resources to which the operations outside the loop structure are bound are idle until the execution of 71 B efore Loop B efo re-L o o p S ta te s C Break Node C £ L oop -E n tran ce S ta t e s J L oop E ntrance L oop C ondition L oop B od y D ep en d on th e L oop C ondition D epend on th e S ta tu s R eg ister in itiate , | [L oop inputl | F e e d b a c k . Lo° P M V a lu e s j v V a lu e s C B reak N o d e After L oop L oo p -B o d y S ta t e s A fter-L oop S ta te s T h e D ata Flow G raph M odel T h e C ontroller M odel Figure 5.16: The Proposed Approach for Loop Designs the loop structure is completed. Since the num ber of loop iterations is sometimes non-determ inistic, only area constraints can be accounted for to th e 3D scheduler for designs w ith loops. Figure 5.16 shows the insertion of break nodes into the d a ta flow graph and the proposed controller model. The states th at activate the control signals for the operations outside the loop structure are derived the same m anner as the designs w ithout loops. The states corresponding to the operations in the loop structure are classified into three groups, namely loop-entrance states, a loop-condition state and loop-body states. A status register is used here to indicate the execution status of the loop structure. Three possibilities are found in the loop-condition state: • If the loop condition is false, the controller exits the loop and jum ps to the first state of after-loop states. 72 • If the loop condition is true and the loop has not been executed before, the controller initiates the loop input values and executes the operations inside the loop structure. • If th e loop condition is true and the loop has been executed before, the controller initiates the loop feedback values and continuously executes the operations inside the loop structure. Note th a t the status register is set to not-visited status by the controller be fore entering the first state of loop-entrance states. The status register is set to visited status in the last state of loop-body states. The operations outside th e loop structure are executed only when the status register is in not-visited status. 5.4.3 3D Scheduling Algorithm T he basic idea of 3D scheduling is the following: edges in a d ata flow graph are annotated to represent the interconnection delays in the register-transfer network. In the beginning of the scheduling process, edges are initialized w ith the estim ated wiring delays and possible m ultiplexer delays. As an operation is scheduled, the m odule assignment is perform ed according to the cost function. A new m odule is allocated and floorplanned if no allocated m odule is available or allocating a new m odule incurs less cost. The cost function is a combination of the distribution sum, instance cost, connection com patibility and register cost. The user specifies relative weights among these factors. We will explain these factors more in the later of this section. T he scheduler then does the register assignment if the tim e steps of the newly scheduled operation and its scheduled successors are different. The scheduler also inserts the necessary m ultiplexers between functional modules and registers. The effects of wiring and allocated registers and m ultiplexers are considered in the floorplanning. The wiring delays obtained from the partial floorplan are then fed back to adjust the interconnection delays in the d ata flow graph, which will affect the ASAP, ALAP and distribution graphs used by the 3D scheduler in the next scheduling iteration. Figure 5.17 shows the flow chart of the 3D scheduling algorithm . A cluster ! tree is constructed at the beginning of the scheduling process, using the results of 73 NO NO wm I Recom pute Wiring Delays I |----------------------------- - t ...... — ----------------------: | Update the Wiring Delays to the E dges in the D ata Flow G raph for Next Scheduling Unschedulei t YES Recom pute S h ap e Functions Initialize the Floorplan Cluster T ree Allocate Multiplexers and R egisters, If N ecessary Estim ate O perator Lower Bound Estim ate Controller A rea & Delay Assign Schedule and Binding for the Op. with the Minimum Force Form a function cluster and assign to the cluster which minimizes wiring length Insert the allocated m odules to an existing function cluster which minimizes the wiring length R ecom pute ASAP, ALAP and D istribution G rap h s using Predicted Wiring Dela; Figure 5.17: The Flow Chart of 3D Scheduling Algorithm the operator lower-bound estim ation. The 3D scheduling algorithm then performs scheduling, m odule allocation and module binding for the operations in the data flow graph. This is achieved by extending force-directed scheduling to include the individual m odule instances in the force function [CT90]. T he blocks w ith solid lines in the flow chart show the effects of the wiring delays fed back from the partial floorplan to the 3D scheduler. The blocks covered by the hatch patterns show the floorplanning process. The detail floorplanning process has been shown in Figure 5.7. The 3D scheduling algorithm takes the same two dimensional view of scheduling and binding th at Park presented [PP88]. Operations in a d ata flow graph are assigned in a two dimensional allocation table, one dimension defined by the tim e steps and the other by instances of hardware modules. Figure 5.18 shows the allocation table for the 4-tim e-step FIR filter exam ple (the d a ta flow graph is shown in Figure 5.1). Each small rectangle in the table is called a cell. A cell is a logical unit which represents the possible binding of an operation to the module | in a certain tim e step. By using the allocation table, the 3D scheduling algorithm I I 74 adderl adder2 multi mult2 1 f 2 C/3 < D 4 Figure 5.18: The Allocation Table for the 4-Time-Step FIR F ilter Exam ple is able to consider tim e steps as well as m odule instances during scheduling. The i potential for assigning an operation to each instance of a hardw are m odule is the self force which is calculated as s e lf force(op, cs-range) — £ {Aprob(op,j) x springK (op,j, instance(j))} — j fz n e w — ra n g e £ \prob(op,j) x springK(op, j,in sta n ce(j))} j£ n e w ~ r a n g e where cs.range(op) is a period of tim e steps for which the force is calculated (see Section 5.1 for the definitions), range(op) is the set of cells into which the op can be scheduled and binded, newjrange(op) is the set of cells th at are w ithin estrange, and instance(cell) projects the cell to the associated m odule instance. prob(op,j) is the probability th at op will be assigned to cell j . W ithout lose of generality, equal probabilities are assumed for all cells in a range. Thus prob(op, j ) is equal to l/\range(op)\. Aprob(op, j) is the change of probability associated w ith op and cell j when a range is reduced to include only the new cells in the newjrange. For exam ple, in Figures 5.1 and 5.18, the current range of addb covers 2 cells (4,7) so each has a probability of 1/2. However, when addb is lim ited to tim e 75 step 3 (i.e. csjrange), the newjrange covers one cell (7) only and the Aprob is 1 - 1/2 - 1/ 2 . The springK is m ade up of four terms: sp rin g K (operation, cell, instance) — I a x distribution sum(oper ation, cell) + (3 x instance cost(instance) + 7 x connection compatibility (operation, instance) -f- 6 x register cost (operation, instance) where a, /5, j and 6 are user-specified constants which allow the users to change the relative im portance of the four term s. The distribution sum term is a m easure of how strongly the operation should be assigned to th e cell. The distribution sum is equal to the value of the distribution graph in the tim e step which the cell occupies. A larger distribution sum means more functional modules are required in this tim e step. The instance cost term is used to represent the allocation cost of the tentative schedule. The area of the functional module with the m inim al area is chosen as the unit-instance area. There is no instance cost if the num ber of functional modules allocated is less than the num ber of functional modules predicted. Otherwise, the area of the newly allocated functional module normalized to the unit-instance area is used as the instance cost. The connection compatibility term is a measure of how sim ilar the connec tion requirem ents of the operation m atch the existing connections. The estim ated wiring area in establishing the connection paths is used to reflect the potential J connection cost. The connection paths considered are the paths from the prede- I cessors of the operation to the instance (i.e. a functional module) and the paths from the instance to the successors of the operation. There is no connection cost if a connection path already exists. To estim ate the connection cost, the wiring area is approxim ated by the estim ated wire length m ultiplied by the track w idth, which is a param eter depending on the technology. For a scheduled predecessor/successor I (i.e. it is bound to a placed functional module) and a placed instance, the wire ! | ! 76 I length is estim ated by the half perim eter of the bounding box of these two func tional modules. Notice th at the instance may not be placed yet if it is a newly allocated functional module. For an unscheduled predecessor/successor or an unplaced instance, an effective wire length is used to estim ate the wire length. The effective wire length C e is estim ated as Ce = y/Aef + (nif ■ A em) (5-17) A ef is the effective area of the unplaced functional module (see E quation 5.7 for the definition of effective area), riif is the num ber of inputs of the unplaced functional m odule and A em is the estim ated area of the input m ultiplexer. T he num ber of the m ultiplexer inputs is estim ated by , where rik is the num ber of operations whose type is the same as the unplaced functional m odule in th e d ata flow graph and O k is the num ber of operators predicted. For the case th at the predecessor/successor has been scheduled but the instance has not been placed, the wire length for the connection cost prediction Cw is estim ated as = Wes + 7fes + Cei (5.18) where Wes and 7ies are the effective width and height of the placed predeces sor/successor, respectively, and Cei is the effective wire length of the unplaced instance. For the case th a t the predecessor/successor has not been scheduled but the instance has been placed, the wire length for the connection cost prediction Cw is estim ated as c w = Ces + W ei + H ei (5.19) where Ces is the effective wire length of the unscheduled predecessor/successor, W ei and T ~ L ei are the effective w idth and height of the placed instance, respectively. 77 For the case th at neither the predecessor/successor nor the instance have been scheduled and placed, the wire length for the connection cost prediction C w is then estim ated as C w — m ax(2 £ es + C ei,C es + 2 £ e?) (5.20) T he sum of the estim ated connection cost of the connection paths is finally norm alized to the area of a 2 -to-l m ultiplexer to obtain the value of th e connection com patibility term . Note that the allocated modules should not be shared if the connection cost is larger than the instance cost. The register cost term reflects the register requirem ents of the operation under the tentative schedule. The register cost of a scheduled operation is dependent on the lifetime of the values produced. In our approach, we consider the register cost of an operation whose successors are scheduled. For an operation whose successors are not scheduled, the register cost is assumed to be zero. There is no register cost if a value is produced and consumed in the same tim e step (i.e. the value is produced inside an operator chain). Otherwise, the register cost is equal to the num ber of tim e steps between the tim e step produced and tim e step last being used (i.e. the lifetime of this value). T he total force is the sum of self force and indirected force. The indirected force is calculated the same m anner as Paulin’s approach (Equation 5.2). Once the total force has been used to schedule an operation, the 3 D scheduling algorithm m ust then decide on its operator binding (and value binding, if necessary). In the 3 D scheduling algorithm , the instance and interconnection costs are included in the term s of springK] the instance with m inim um springK in the scheduled tim e step determ ines the binding of the scheduled operation, j In the 3 D scheduling process, each allocated functional unit corresponds to a function cluster in the cluster tree. A new function cluster is created if a functional unit is newly allocated. The position of a newly constructed function cluster in the cluster tree is determ ined by its interconnections to other placed function clusters. M ultiplexers and registers are allocated and added into th e cluster tree when the existing interconnections are not adequate. A partial floorplan is then obtained by j com puting the shape functions. The wiring delays derived from the floorplan are 78 fed back to the scheduling process to guide the next scheduling iteration. Once all the operations in the control/data flow graph are scheduled, the scheduling process is completed. i I 5.4.4 Complexity Analysis of the 3D Scheduling j Algorithm The run-tim e complexity of the 3D scheduling algorithm depends on i • the complexity of the operator lower-bound estim ation, • the complexity of the force calculation, and • the complexity of the floorplanning procedure. T he complexity of calculating the distribution graph is 0 ( n ) where n is the num ber of the operations being scheduled in the data flow graph. Since the tighter operator lower-bound estim ation procedure reduces the size of the m axim ally con secutive distribution window every iteration, the distribution graph is calculated 0{c) tim es at most to obtain the operator lower bound, where c is the num ber of tim e steps into which the d ata flow graph is being scheduled. Therefore, the com plexity of the operator lower bound estim ation is Q{cn). In the calculation of the force, 0 {c i) forces are calculated for an unscheduled operation where i is the num ber of functional modules being allocated in the design. To determ ine the operation with m inim um force, the forces of 0 (n ) unscheduled operations are calculated every scheduling iteration. Since there are O(n) unsched uled operations at the beginning of the scheduling process, the com plexity of the force calculation is 0 (c n 2i). This is i times the complexity of the force-directed scheduling used by Paulin. The additional i term is caused by the fact th a t all the possible instances for an operation are examined for every force calculation. T he floorplanning procedure consists of a bottom -up traversal and a top-down traversal. Because the cluster tree is balanced and the num ber of shape points and children in a cluster node are lim ited, the complexity of the bottom -up traversal | is O(i). Since the process is repeated for every functional m odule allocated, the i ! 79 com plexity of the top-down traversal is also 0 (i). The complexity of the floorplan ning procedure is 0 ( i 2) due to O(i) functional modules allocated in th e schedul ing process. The overall run-tim e complexity of the 3D scheduling algorithm is 0 (c n -f cn2i + i2) which can be simplified to 0 (c n 2i). 5.4.5 Discussion From our experim ents, we found th at the scheduling order of force-directed schedul ing is not fixed. The operations along the critical path are not always scheduled earlier th an the operations on off-critical paths. The quality of th e constructive cluster tree is affected by the order of new modules inserted. And, the order of constructing the floorplan cluster tree depends on the order of the operations being scheduled and bound. Therefore, a scheduling technique which schedules the op erations along the critical path first is expected to produce b etter floorplan cluster trees. The freedom-based scheduling technique is an example which always schedules the operations along the critical path (i.e. the operations w ith sm aller freedom [PPM 8 6 ]) first. However, a good decision on inserting dum m y operations along the critical p ath in order to produce more serialized designs is not easy to obtain. A modified force-directed scheduling technique which partially changes the scheduling order to schedule the operations along the critical path first may be a solution in our application. A scheduling approach which combines the freedom-based approach and force-directed approach is another possible solution to this problem . T he floorplanning approach is greedy. However, most inform ation in register- transfer level is unknown during scheduling. Since a cluster tree represents the [ floorplan in our approach, the num ber of edges traversed is used to m easure the | distance between two function clusters. This is simply heuristic and not well reflecting connection costs in the floorplan. A new heuristic is needed in the feature development. 80 Chapter 6 Data Path Synthesis Experiments and Results This chapter describes several experiments performed using the d ata path synthesis program , LADS (Layout And D ata path Synthesis). We com pared the schedules created by LADS to those created by traditional scheduling program s/techniques, such as MAHA and Force-Directed Scheduling. Various filter designs and a robot arm controller were synthesized to experiment with the 3D scheduling technique under different design cases. 6.1 Introduction T he 3D scheduling algorithm for data path synthesis was im plem ented in a pro gram called LADS using the C language. LADS performs scheduling, operation bindings and partial value bindings for a VHDL input according to user-specified constraints. A library contains the characteristics of available modules (such as delays, areas and geometries), fabrication-dependent param eters (such as transis tor sizes, resistances and capacitances) and other adjustm ent param eters to guide the scheduling process as well as the floorplanning process. Table 6.1 shows the i m odule library we used in our d ata path experiments. In the d ata p ath experim ents, we used LADS to generate schedules and bind ings. Some examples were further synthesized by MABAL to produce register- transfer level designs. CSG took the outputs from MABAL to generate controller specifications. The register-transfer level designs and controller specifications were 81 1.2pm Technology Delay (ns) Area (pm 2) 16-bit Carry Select Adder 25.0 279264.00 16-bit Ripple Adder 35.0 81557.50 16-bit Ripple Subtractor 38.0 113529.25 16-bit Comparator 33.5 134380.50 16-bit Multiplier 53.0 1804522.00 Table 6.1: Design Library Set Used by LADS (O btained from ChipC rafter) compiled by the Cascade Design Autom ation ChipCrafter silicon compiler to ob tain layouts. The schedules produced by MAHA and the Force-Directed Scheduling tech nique were used as examples to compare the differences among schedules, register- transfer level designs and final chip layouts. We also used the tim ing analyzer to compare and validate the delays estim ated by LADS. We tested LADS w ith a num ber of examples. The filter examples included a F IR filter, an AR filter and an Elliptic filter. A robot arm controller served as a non-filter example. Both pipelined and non-pipelined design schedules were also experim ented with for most of the designs. 6.2 A Non-Pipelined FIR Filter Example For the F IR filter non-pipelined design experim ents, we used a 10-time-step design as our exam ple. We used ca rry s e le c t a d d ers for all additions in this experim ent. The schedule created by LADS is shown in Figure 6.1. We used MABAL to synthe size the register-transfer level design using the schedule (shown in Figure 6.2). and bindings (shown in Appendix A) produced by LADS. The register-transfer level design was then compiled by ChipCrafter to obtain the layouts (see Figures A .l- ; A .5 in Appendix A). The solid lines on the layouts are the critical paths obtained I from th e tim ing analysis program within ChipCrafter. The figures are scaled in order to reflect the relative dimensions among the layouts. Table 6 . 2 summarizes the results of the layouts produced by ChipCrafter. LA D S Prediction shows the prediction results produced by LADS at the schedul ing stage. As com pared the prediction results to the actual layouts, the estim ated 82 Time S tep add add mul mul add add add mul mul add add add mul add add mul add add mul add add mul add 10 Figure 6.1: The 1 0 -Tim e-Step FIR Filter Schedule Produced by LADS 83 REG03 MUX09 REG02 MUX08 MUX07 REG01 MUX05 MUX06 MUX01 MUX02 MUX04 ADD02 MUL01 ADD01 - MUX07-*. REG01 Possible Critical Paths: controller-»~MUX04 -►MUL01 M UX08 REG02 Figure 6.2: The RTL 10-Time-Step FIR Filter Design using the LADS Schedule area and aspect ratio are favorable in this experim ent. However, the estim ated d ata path delay is longer than the actual data p ath delay. This may be caused by using the worst-case wire length and ignoring the effects of buffers in estim ating wiring delays. The estim ated control path delay is also longer than the actual controller delay. This may be due to the merging of duplicate outputs in the PLA optim ization phase which is not considered in the current PLA estim ation program . T he placem ent program in ChipCrafter uses a sim ulated annealing algo rithm . Figures A.1-A.3, ChipCrafter Placement /, ChipCrafter Placement II and ChipCrafter Placement I l f show the layouts using LADS schedule but with differ ent cooling schedules in the placem ent program. In order to compare design char acteristics of the layouts using the placem ents produced by LADS to C hipCrafter, we created a layout using the placem ent produced by LADS. Figure 6.3 shows the floorplan created by LADS. Since LADS only performs partial value bindings, ex tra registers an d /o r m ultiplexers are expected to appear in the final register-level design created by MABAL. Figure 6.4 shows the floorplan for our final im plem en tation which is obtained by m anually modifying the floorplan created by LADS 84 REG.03 ML X2-1 .06 ADD.02 M l|x2^.05 MUL.01 IU1UX2-1 Pit REG.02 MUX8-1.04 cant roller MUX8-1.02 F lEG.Ofl M L X 2 07 ADD.01 MUX8-1.01 N PD -10 (est. D.P. d e la y 96.10, C.P. d e la y 51.56) S ize: 3 3 2 0 .9 4 x 3 3 2 9 .7 5 , A rea: 11057903.83 (A.R. 1.00) Figure 6.3: The Floorplan created by LADS to include modules added by MABAL. Cross hatched patterns indicate newly al located modules or modules which have been modified. The std^group is a set of standard cells th at are produced by the finite state m achine synthesis program w ithin ChipCrafter. Figure A .4 shows the actual layout of LA D S Placement I using th e placem ent of the modified LADS floorplan shown in Figure 6.4. In this experim ent, we found th at the data path and control path delays using the LADS placem ent are reduced from those in the ChipCrafter placements. The critical paths shown in Figures A.1-A.4 reveals th at th e floorplan produced by LADS is able to reduce the delays along the critical path. Note th a t two possible critical paths (i.e. M U X 04 —» • M U L01 —> • M U X 07 —> REG01 and M U X 04 —► M UL01 —* M U X 08 —* • R E G 02) are found in the register-transfer level design shown in Figure 6.2. The sim ulation critical paths are not identical in these layouts. T he area of the LADS layout is comparable to ChipCrafter Placement I and ChipCrafter Placement III, however, it is larger than ChipCrafter Placement II. 85 REG .03 MU X2- .09 M 111 96 MUL.01 fflPXg~1.08 REG.02 MUX8-1.04 ill controller MUX8-1.02 I EGO MUX2-107 ADD.01 MUX8-1.01 Figure 6.4: The Final Im plem entation Floorplan using the LADS Schedule By inspecting the layout shown in Figure A.4, we found a large area was a result of wiring. We believe the excessive am ount of wiring area results from the m is placem ent of pins and m ay be reduced through proper pin assignm ents [PMSK90] [Con89]. We thus m anually performed pin assignment to LA D S Placement I to validate our idea. Figure A .5 shows the layout of LA D S Placement LI which uses the same placem ent as LAD S Placement I but w ith m anual pin assignm ent. The result is encouraging; the chip area is reduced dram atically (about 30%) which is consistent to the one predicted in pin assignment papers. Note th at the per form ance is slightly improved in this particular case due to the reduce of chip area. The sim ulation critical path in the layout with m anual pin assignm ent is also changed. To compare the schedule results among different scheduling techniques, we , applied the same functional unit area constraint to scheduling program s using different techniques. The schedule produced by a Force-Directed Scheduling (FDS) program is shown in Figure 6.5. The schedule is further synthesized by MABAL to obtain the register-level design as shown in Figure 6 .6 . Note even though the 86 allocated functional units are the same as the LADS design, th e schedule and allocated interconnection units (e.g. m ultiplexers and registers) are different. T he results of the layouts using FDS schedule is shown in Table 6.3. The layouts and sim ulation critical paths are shown in Figures A.6 -A. 10. The delays of th e layouts using FDS schedule increase about 25% as com pared to the results of the LADS schedule. The perform ance degradation in this schedule is due to the chaining of two adders in some tim e steps which in turn changes the critical p ath from the m ultiplier (in the LADS schedule) to the two adders (e.g. M U X 12 —► A D D 0 1 —» • M U X 21 —> A D D 0 2 —► M U X 71 —► REG04 for the layout shown in Figure A.6 ). We also m anually created a performance-driven floorplan (see Figure 6.7) to reduce the delays along the critical paths. We used this placem ent to produce a tim ing-driven layout, Manual Placement I. The experim ent shows th a t the perfor m ance is b etter than two of C hipC rafter’s placem ent cases but slightly worse than one of C hipC rafter’s placements. For comparison purposes, we also perform ed pin assignm ent manually, as shown in Manual Placement II, to reduce wiring area. There is not much improvement in this example. We believe this is due to the more com plicated interconnections between modules; a good pin assignm ent was not possible to obtain m anually in this example. Similar experim ents were perform ed using MAH A and MABAL. T he schedule and register-transfer level design are shown in Figures 6 . 8 and 6.9, respectively. The tim ing-driven floorplan, which is derived manually, is shown in Figure 6.10. T he layouts and sim ulation critical paths are shown in Figures A. 11-A. 15. The experim ental results are sum m arized in Table 6.4. We found th a t in the designs using th e MAHA schedule, the delays of data path and control p a th are slightly worse th an the layouts using the Force-Directed Scheduling technique (about 10- 15% increase). This is probably due to the more complex interconnections and larger wiring area requirem ents in this register-transfer level design. To compare different scheduling techniques and layouts, a set of charts were drawn. Figure 6.11 compares the area, data path delay, control p a th delay and total execution delay among the three different ChipCrafter placem ents and three different scheduling techniques. It can be seen from these charts th a t different cooling schedules of the sim ulated annealing placem ent algorithm cause different 87 j Time S tep add add add m ul add add mul mul add add mul mul add mul add add mul add 10 Figure 6.5: The 10-Time-Step FIR Filter Design Produced by FDS 88 REG04 REG02 MUX71 REG03 REG06 REG05 REG01 MUX41 REG07 MUX21 MUX51 MUX22 MUX52 ADD02 ADD01 MUL01 MUX11 MUX12 Possible Critical Paths: controller-*~MUX12 -►ADDOI-*- MUX21 -►ADD02 MUX41-*. REG02 MUX71-*- REG04 Figure 6 .6 : The RTL 10-Time-Step FIR Filter Design using the FDS Schedule controller standard cell group A D D 0 2 M U X 12 M UX11 Q O < 8 o R E G 0 4 R E G 0 2 MUX21 R EG 01 ---- 1 R E G 0 7 R E G 0 6 M U X 51 M U L01 Figure 6.7: A M anually Created Timing-Driven Floorplan using the FDS Schedule 89 Time S tep add add add add mul add add mul mul add add add mul add mul add mul mul mul add Figure 6 .8 : The 10-Time-Step FIR Filter Schedule Produced by MAHA 90 REG04 REG02 MUX51 MUX41 REG05 REG03 MUX91 REG06 REG07 REG01 M U X 31 MUX11 MUX62 MUX61 MUX21 MUX22 MUX12 ADD01 MUL01 ADD02 j M U X 31-*- REG01 — . M U X 2 1 ^ /j»-MUX41-». REG02 Possible Critical Paths: controller _ MUX22 > ADD01-MUX11 ^ 0 0 0 2 ^ ^ ^ REGQ3 MUX91-^- REG06 Figure 6.9: The RTL 1 0 -Time-Step FIR Filter Design using th e MAHA Schedule s t a n d a r d c e ll g r o u p R E G 0 5 R E G 0 7 R E G 0 4 M U X 6 2 R E G 0 1 A D D 0 2 M U X 2 2 M U X 21 : i g W W W Q W W O W Q o < M U X 41 C5 £ 2 R E G 0 6 M U X 61 MUX51 R E G 0 3 M U L 01 Figure 6.10: A M anually Created Timing-Driven Floorplan using the MAHA Schedule 91 Description Dimensions (pm x pm ) Area (p m 2) LA D S Prediction 3320.94 x 3329.75 11057903.83 ChipCrafter Placement I 3696.75 x 4961.79 18342497.18 ChipCrafter Placement II 3506.25 x 3312.04 11612840.25 ChipCrafter Placement III 3001.75 x 4753.79 14269689.13 LA D S Placement I 3642.75 x 3873.00 14108370.75 LA D S Placement II 3089.50 x 3533.00 10915203.50 Description D.P. Delayf C.P. Delay* Exec. Delays L A D S Prediction 96.10 ns 51.56 ns 147.66 ns ChipCrafter Placement I 77.13 ns 22.92 ns 100.05 ns ChipCrafter Placement II 78.25 ns 19.58 ns 97.83 ns ChipCrafter Placement III 75.86 ns 20.13 ns 95.99 ns LA D S Placement I 73.18 ns 16.72 ns 89.90 ns LA D S Placement II 71.48 ns 16.74 ns 8 8 . 2 2 ns Table 6.2: The 10-Time-Step Non-Pipelined FIR F ilter Designs using the LADS Schedule variations in the design characteristics. However, the variations among different cooling schedules are far less th an the variations among schedules produced by different scheduling techniques. Therefore, we can conjecture th a t the design per formance in our system is more dom inated by schedules th an placem ents. We also compared the placem ent generated by LADS to th e ones using the sim ulated annealing algorithm (i.e. the placem ents created by C hipC rafter). Figures 6.12 shows the comparisons of the area, d a ta path delay, control path delay and execution delay with different scheduling techniques. The LAD S/M anual Placement I and LADS/M anual Placement II of the designs us ing the LADS schedule represent the layouts using LA D S Placement w ithout and with m anual pin assignm ent, respectively. The LAD S/M anual Placement I and LAD S/M anual Placement II of the designs using the FD S/M A H A schedule repre sent the layouts using the performance-driven placem ent shown in Figure 6.7/6.10 w ithout and w ith pin assignm ent, respectively. Since there are three layouts whose ^The D.P. delay represents data path delay. P f he C.P. delay represents control path delay. §The execution delay includes the delays in the data path and the control path. ! 92 ! Description Dimensions (pm x pm ) Area (pm2) ChipCrafter Placement I 3089.50 x 3533.00 10915203.50 ChipCrafter Placement II 3628.50 x 5592.29 20291624.27 ChipCrafter Placement III 3953.75 x 3830.33 15144167.24 Manual Placement I 4298.00 x 2917.00 12537266.00 Manual Placement II 4103.75 x 2939.25 12061947.19 Description D.P. Delay C.P. Delay Exec. Delay ChipCrafter Placement I 91.77 ns 33.38 ns 125.15 ns ChipCrafter Placement II 103.58 ns 35.29 ns 138.87 ns ChipCrafter Placement III 93.79 ns 32.43 ns 126.22 ns Manual Placement I 89.94 ns 32.27 ns 1 2 2 . 2 1 ns Manual Placement II 92.17 ns 32.41 ns 124.58 ns Table 6.3: The 10-Time-Step Non-Pipelined FIR Filter Designs using the FDS Schedule Description Dimensions (pm x pm ) Area (pm2) ChipCrafter Placement I 3301.00 x 3332.25 10999757.25 ChipCrafter Placement II 3485.79 x 3657.75 12750148.37 ChipCrafter Placement III 3708.50 x 3290.50 12202819.25 Manual Placement I 3982.00 x 3473.50 13831477.00 Manual Placement II 4005.25 x 3448.00 13810102.00 Description D.P. Delay C.P. Delay Exec. Delay ChipCrafter Placement I 99.97 ns 45.86 ns 145.83 ns ChipCrafter Placement II 116.65 ns 36.96 ns 153.61 ns ChipCrafter Placement III 116.18 ns 37.24 ns 153.42 ns Manual Placement I 102.91 ns 48.43 ns 151.34 ns Manual Placement II 1 0 0 . 1 1 ns 49.05 ns 149.16 ns Table 6.4: 10-Time-Step Non-Pipelined FIR Filter Designs using th e MAHA Schedule 93 placem ents produced by the ChipCrafter placem ent program for each schedule, the layouts w ith m inim al area are chosen to represent the placem ents produced by ChipCrafter in these charts. It is found th at the design characteristics (i.e. the perform ance and area) are improved in the layouts using the LADS placem ent. The perform ance of the layout with pin assignment using the LADS schedule is sim ilar to the one w ithout pin assignment but the area is improved. However, the design characteristics do not vary much for the layouts w ith and w ithout pin as signm ent using a Force-Directed schedule and MAHA schedule due to the complex interconnections. As for comparing the design characteristics, the results of the layouts using LADS schedule are favorable. The execution delay of the LADS design schedule is about 40% less than the designs using other scheduling techniques. The d ata path delay of the LADS design schedule is about 30% less than th e designs using other scheduling techniques. It is found from this set of experim ents th a t a scheduling program may produce very poor schedules in the register-transfer level if the ef fects of wiring and interconnections are completely ignored during the scheduling process. The 3D scheduling technique using the interconnection inform ation fed back from the floorplanner is able to produce b etter schedules than traditional scheduling techniques. 6.3 A Pipelined FIR Filter Example For the pipelined FIR filter design example, we used ca rry s e le c t a d d e rs for all additions in the control/data flow graph. An initiation-interval-4 9-tim e-step design in which four adders and two multipliers were allocated was created by LADS. The schedule and a tentative floorplan are shown in Figures 6.13 and 6.14, respectively. In order to compare the results using the 3D scheduling technique to other J scheduling techniques, we applied the same perform ance constraint to Sehwa and a Force-Directed scheduler. Both programs produced the same schedule result, which is a 6 -tim e-step design, as shown in Figure 6.15. In this exam ple, LADS i produced a design w ith a longer pipeline length because LADS took advantage Execution D elay (ns) Control P a th D elay (ns) D ata P a th D elay (ns) A rea (square micron) 25000000 LAOS FDS M A H A ■ P l a c e m e n t 1 H P l a c e m e n t 2 ED P l a c e m e n t 3 Figure 6.11: Comparisons with Different ChipCrafter Placem ents 16000000 14000000 3 12000000 s 10000000 8000000 < £ 6000000 i < 4000000 LADS FDS M A H A s- LADS FDS M A H A LADS FDS M A H A LADS FDS M A H A C h ip C r a f te r P l a c e m e n t BS L A D S /M a n u a l P l a c e m e n t I E l L A D S /M a n u a l P l a c e m e n t II Figure 6.12: Comparisons with Different Placem ent Strategies 96 of the fact th a t chaining two adders into one tim e step will cause longer clock cycles in the final im plem entation step due to wiring delays. From the previous non-pipelined F IR filter example, we found th at chaining two adders into one tim e step versus no operator chains results a 30-40% longer clock cycle (e.g. the execution delay of the design LA D S Placement II is 88.22 ns versus 125.15 ns in the design ChipCrafter Placement I using the FDS schedule). Since there are two adders chained into one tim e step (i.e. the tentative critical paths) in the Sehwa schedule (or Force-Directed schedule) and there are no operator chains in the LADS schedule, it is expected th at the clock cycle of the design using the Sehwa schedule (or Force-Directed schedule) is 30-40% longer th an the design using LADS schedule. The through-put of the design using the Sehwa schedule (or Force-Directed schedule) is expectantly 30-40% slower. Therefore, LADS schedule can produce a b etter performance design in a low resynchronization rate case. 6.4 A Non-Pipelined Differential Equation Solver Example In this exam ple, we are going to show th a t the floorplan produced by LADS is able to reduce the wiring delays along the critical path. Figure 6.16 shows the sched ule of a 4-tim e-step non-pipelined differential equation solver exam ple produced by LADS. We used a rip p le ca rry a d d er for all additions in this experim ent. T he com plete register-transfer level design synthesized by MABAL is shown in Figure 6.17. Figure 6.18 shows the floorplan produced LADS. The m odules shown with solid lines are the functional units along the critical p ath in the d a ta flow graph. The modules shown with lightly hatched patterns are the functional units not on the critical path. It is found th at in this example LADS successfully groups the functional units along the critical path in the data flow graph together and places them in the center of the floorplan to reduce the wiring delays along the ! critical p ath in the d ata flow graph. The modules on the off-critical paths in con- j trast are placed to corners of the floorplan because the execution delays are not | critical along these paths. 97 T im e S te p add add add add mul mul add mul add mul add mul add add mul add add add mul add ad d Initiation Interval = 4 ;ure 6.13: The Pipelined F IR Filter Schedule by LADS (Init.I. = 4, P.Len. REG.06 REG.04 ADD .04 MLX2-1 .07 MIX2-1 REG.03 MUL .02 M U ) i.oe ADD.03 controller REG.05 REG.02 M L X4-1 .03 ADD .02 M L X4-1 .06 MUL.01 MUX3-1.02 M L X4-1 .05 ADD .01 MUX3-1.01 REG.01 PD-4-9 (est. D.P. delay 85.90, C.P. delay 34.44) Size: 4322.65 x 3894.25, Area: 16833469.87 (A.R. 1.11) Figure 6.14: The Floorplan for Pipelined F IR Filter Design C reated by LADS 99 Time S tep add add add add mul mul add mul mul add add mul mul add add add mul add add mul add add Initiation Interval = 4 Figure 6.15: T he Pipelined FIR Filter Schedule by Sehwa (In it.I.= 4 , P .L en.= 6 ) u dx 3 x 3 y u dx x dx control x1 Figure 6.16: The 4-Tim e-Step Non-Pipelined Differential E quation Solver Schedule 1 0 0 MUX11 MUX32 REG01 REG03 REG04 MUX81 REG101 REG02 MUX71 SUB01 ADD01 MULOl MUL02 CMP01 MUX12 MUX21 MUX22 MUX41 MUX42 MUX31 MUX31 Possible Critical Paths: controller MUL02-^ REG02 MUX32 Figure 6.17: The RTL Design for the 4-Time-Step Differential E quation Solver Figures B .l and B.2 show the layouts using the placem ents produced by th e sim ulated annealing placem ent program in ChipCrafter w ith different cooling sched ules. Figure B.3 shows the layout produced by C hipC rafter using the LADS floor plan shown in Figure 6.18. The results of these layouts are sum m arized in Table 6.5. Note th a t the areas of these layouts are close. However, the layout using the LADS floorplan has a shorter critical path than the layouts using the placem ents produced by C hipCrafter. Since the placem ent program in ChipCrafter does not pay spe cial attention to the wiring delays along the critical path, it results in modules not on the critical path being placed between the modules along the critical path. Due to the insertion of these modules along the critical path, an approxim ate 6 % perform ance degradation is found in the layouts using the C hipC rafter placem ents. i 6.5 Non-Pipelined Elliptic Filter Examples j Two non-pipelined elliptic filters using rip p le c a r ry a d d e rs to execute additions in a d ata flow graph were experim ented with. Although a ripple carry adder has ! a longer propagation delay (~ 35ns) than a carry select adder ( « 25ns), two i _ i cascaded ripple carry adders have a shorter critical p ath delay (~ 30ns) th an two 101 HWWBWWtQWWWOOWI C K IP1.II1 FiEG.O } MUX3-1.07 MLX2-1 09 i i i S I i Dontrollei * UX3-1.C3 hO PC I MUX4-1.0E FIEG.O’ MUX2-1.06 MUX4-1.0 FIEG.O MUX I 1.02 M «U>2 MtlLOl E stim a te d d a ta p ath d e la y 7 4 .6 9 , control p a th d e la y 2 3 .4 0 S ize : 3 1 8 8 .5 0 x 3 0 5 7 .2 5 , A rea: 9 7 4 8 0 4 1 .6 2 A sp e c t R atio = 1.0 4 Figure 6.18: The Floorplan for the 4-Time-Step Differential E quation Solver Pro duced by LADS Description Dimensions (pm x pm ) Area (pm2) LA D S Prediction 3188.50 x 3057.25 9748041.62 ChipCrafter Placement I 3795.29 x 3358.00 12744583.82 ChipCrafter Placement II 3114.50 x 3601.25 11216093.12 LA D S Placement 3442.50 x 3506.50 12071126.25 i Description D.P. Delay C.P. Delay Exec. Delay LA D S Prediction 74.69 ns 23.40 ns 98.09 ns ChipCrafter Placement I 82.71 ns 49.63 ns 132.34 ns ChipCrafter Placement II 82.64 ns 50.13 ns 132.77 ns LA D S Placement 78.12 ns 48.86 ns 126.98 ns Table 6.5: The 4-Tim e-Step Non-Pipelined Differential E quation Solver Exam ple cascade carry select adders ( « 50ns). 3D scheduling takes into account th e delays introduced by cascaded modules to give the scheduler more degrees of freedom to perform the scheduling. Due to the possibilities of adder chaining in the d ata flow i graph, the elliptic filter design provides a good exam ple to dem onstrate th e use of | ex tra degrees of freedom to produce better schedules th an traditional approaches. 6.5.1 A 12-Time-Step Non-Pipelined Elliptic Filter Example Figure 6.19 shows the schedule of a 12-time-step non-pipelined elliptic filter de sign produced by LADS. The register-transfer level design produced by MABAL is shown in Figure 6.20. A respective tentative floorplan is shown in Figure 6.21. From the register-transfer level design, the floorplan for final im plem entation was modified m anually from the floorplan produced by LADS to include additional registers added by MABAL. Note that the sizes of some m ultiplexers were also changed to be com patible with the design produced by MABAL as shown in Fig ure 6 .2 2 . Seven layouts (shown in Figures C.1-C.7) were produced by C hipC rafter for the schedule shown in Figure 6.19. The placem ents of the first five lay outs were produced by C hipC rafter’s sim ulated annealing based placem ent pro gram w ith different cooling schedules. Two placem ents, LA D S Placement I and L A D S Placement II as shown in Figures 6.22 and 6.23, using the LADS floorplan but w ith different m anual placements for the modules added by MABAL were ex perim ented with. The respective layouts are shown in Figures C . 6 and C.7. Due to the more com plicated interconnections between modules in this register-transfer level design (see Figure 6.20) and the lack of autom atic pin assignm ent tools, we are unable to produce pin assignments manually. Therefore, the effects of pin as- | signm ents were not experim ented with in this set of experim ents. However, we expect an average of 7.5% reduction in the layout area and 25% reduction in the total interconnection length with proper pin assignments [PMSK90]. 103 33 36, 39 a d d 1 Time S tep a d d 2 a d d 3 a d d 4 a d d 5 a d d 6 a d d 7 a d d 1 0 a d d 8 m ul8 m u l3 a d d 9 a d d 1 4 .a d d 1 3 ) a d d 1 7 a d d 1 6 , 3 8 , f a d d 1 8 ,a d d 1 5 . a d d 1 a d d 2 0 a d d 1 9 m u ! 5 m u l4 26 m u ! 7 m u ! 6 a d d 2 2 a d d 2 1 9 10 a d d 2 5 a d d 2 4 a d d 2 3 a d d 2 7 a d d 2 6 12 Ou 33 18 39, 13 Figure 6.19: The 1 2 -Tim e-Step Non-Pipelined Elliptic F ilter Schedule by LADS 104 MUX01 M F ir MUX02 ADD01 REG06 REG05 MUX10 REG03 MUX05 j ' y MUX06 ADD02 MUX09 REG02 M UX81 REG04 MUX07 MUX111 REG07 MUX08 X ADD03 MUX04 REG01 MUX03 SQR01 REG08 REG09 REG10 MUX01 - . MUX05 -r- MUX07 — Possible Critical Paths: controller ADD01 __ADD02 ADD03— ►MUX10— ► REGOS MUX02 - — ^ MUX06 ^ MUXOQ ■ — ^ Figure 6.20: T he RTL Design of 12-Time-Step Non-Pipelined Elliptic F ilter by MABAL Table 6 . 6 shows the design characteristics of the layouts. T he area of layouts are larger than we predicted because an essential am ount of the additional inter connections (e.g. two m ultiplexers, seven registers and respective interconnection wirings) are needed to store the values between tim e steps. The delays of all the layouts are also longer than the scheduler estim ated (about 30% longer). From the results of the tim ing analysis, we found th at the signal propagation from the registers to the input of m ultiplexers has much longer delays th an we predicted. For exam ple, the signal propagation from register R EG 06 to m ultiplexer M U X 02 took approxim ately 1 0 ns in the LAD S Placement I layout. A fter further investigation of the results of the tim ing analysis and the register- transfer level design, we noticed th at unexpectedly long delays are caused by large fanouts and im proper sizes of the output buffers of the d atap ath registers and functional modules in this design example. Since we are not able to assign the buffer sizes individually in ChipCrafter, an autom atic buffer resizing tool provided w ithin C hipC rafter is used to reduce wiring delays by changing the sizes of output buffers. The buffer resizing program does not change th e placem ent of layouts; only th e sizes of output buffers are altered according to their loads. Figures C.8-C.14 105 ADD.02 MUX3-1.08 MIJX3-1 10 ADD.03 REG.03 MUX8-1.05 controller M L X2-1 .06 MUX5-1.03 MUX7-1.01 RE 02 ADD.01 M L1X 4-1 .04 SQR.01 MUX -1.09 MUX6-1.02 REG.01 NPD-12 (est. D.P. delay 97.55, C.P. delay 31.38) Size: 3367.00 x 3684.25, Area: 12404869.75 (A.R. 0.91) Figure 6.21: The Floorplan for 12-Time-Step Non-Pipelined Elliptic F ilter Design by LADS 106 IUX3-1 8 REG .04 MUX5-1.08 M l JX5-1 JX4-1. >7 ADD.02 ADD.03 REG.03 MUX6-1.05 controller std_group FIEG.O 7 MUX6-1.06 MUX4-1.03 MUX9-1.01 RE 02 ADD.01 Ml X4-1 .04 SQR.01 1 -1.09 MUX7-1.02 REG.01 Figure 6.22: The Final Im plem entation Floorplan L A D S Placement I for 12-Time Step Non-Pipelined Elliptic Filter Design i 107 REG.1 ) REG.04 MUX3-1 81 MUX5-1.08 MUX5-1.1 JX4-1. ADD.02 ADD.03 REG.03 MUX6-1.05 controller std_group MUX6-1.06 MUX4-1.03 I IEG. 0 j MUX9-1.01 RE 02 FIEG.O') ADD.01 M L X4-1 .04 SQR.01 3-1.09 MUX7-1.02 REG.01 Figure 6.23: The Final Im plem entation Floorplan L A D S Placement I I for 1 2 - Tim e-Step Non-Pipelined Elliptic Filter Design 108 Description Dimensions (p m x pm ) Area (pm?) L A D S Prediction 3367.00 x 3684.25 12404869.75 ChipCrafter Placement L 4240.25 x 4981.71 21123695.83 ChipCrafter Placement II 4367.50 x 4932.00 21540510.00 ChipCrafter Placement III 4352.75 x 6419.54 27942652.73 ChipCrafter Placement IV 2936.75 x 8148.04 23928756.47 ChipCrafter Placement V 3923.75 x 5454.96 21403899.30 L A D S Placement I 4067.00 x 4733.00 19249111.00 L A D S Placement II 4038.75 x 4577.50 18487378.12 Description D.P. Delay C.P. Delay Exec. Delay L A D S Prediction 97.55 ns 31.38 ns 128.93 ns ChipCrafter Placement I 135.48 ns 10.90 ns 146.58 ns ChipCrafter Placement II 138.35 ns 11.70 ns 150.04 ns ChipCrafter Placement III 151.47 ns 13.87 ns 165.34 ns ChipCrafter Placement IV 156.90 ns 13.87 ns 170.77 ns ChipCrafter Placement V 136.23 ns 10.82 ns 147.05 ns L A D S Placement I 133.93 ns 17.70 ns 151.63 ns L A D S Placement II 130.05 ns 16.83 ns 146.88 ns Table 6 .6 : The 12-Time-Step Elliptic Filter Designs before Buffer Resizing show the layouts and sim ulation critical paths after running the buffer resizing program . The layout characteristics after running the buffer resizing program are sum m arized in Table 6.7. T he design perform ance after running the buffer resizing program is quite sat isfactory. Not only the delays are reduced in all the layouts, but the area of the some layouts is also reduced. Note th at the d ata path delay estim ated by LADS is longer than the actual layouts after running the buffer resizing program . This result is the same as the one that we obtained in the non-pipelined FIR filter exam ple. We believe th a t this is due to the use of worst-case wire length in estim ating wiring delay. T he d a ta path delays of L A D S Placement ij, and L A D S Placement Ip are com parable or slightly longer than the d ata path delays of some layouts using the placem ent program within ChipCrafter. This is probably because a substantial am ount of interconnections ( « 50% more modules) are allocated by MABAL after the scheduling is done. The additional interconnections are then m anually placed 109 Description Dimensions (pm x pm ) Area (pm?) ChipCrafter Placement lb 4168.25 x 4941.50 20597407.38 ChipCrafter Placement lib 4330.75 x 4927.25 21338687.94 ChipCrafter Placement IHb 4352.75 x 6435.75 28013210.81 ChipCrafter Placement /Vj, 2935.25 x 8114.75 23818819.94 ChipCrafter Placement V b 3922.25 x 5290.25 20749683.06 L A D S Placement L 4038.00 x 4771.75 19268326.50 L A D S Placement lib 4032.50 x 4644.25 18727938.12 Description D.P. Delay C.P. Delay Exec. Delay ChipCrafter Placement lb 77.94 ns 4.28 ns 82.22 ns ChipCrafter Placement lib 78.38 ns 4.35 ns 82.73 ns ChipCrafter Placement Illb 81.77 ns 4.24 ns 8 6 . 0 1 ns ChipCrafter Placement IVb 81.28 ns 4.58 ns 85.86 ns ChipCrafter Placement Vb 79.11 ns 4.25 ns 83.36 ns L A D S Placement p 81.22 ns 4.53 ns 85.75 ns L A D S Placement lib 78.35 ns 4.29 ns 82.64 ns Table 6.7: The 12-Time-Step Elliptic Filter Designs after Buffer Resizing along the boundary of the floorplan produced by LADS. However, in the place m ent produced by the placem ent program within C hipCrafter, all the modules are considered and placed at the same tim e which allows the placem ent program w ithin C hipC rafter to produce better placements. Also notice th a t the sim ulation critical paths are changed before and after running the buffer resizing program . Sim ulation critical paths are also different among the layouts. By inspecting the register-transfer level design shown in Figure 6.20, m any possible critical paths are found in this elliptic filter example. This fact leads to the conclusion th a t the LADS floorplanner cannot effectively reduce the delays along all th e critical paths in this exam ple to produce a good perform ance-driven floorplan. To com pare the results of 3D scheduling to traditional scheduling approaches, we ran MAHA on the elliptic filter example with the same functional unit area constraint. The schedule produced by MAHA is shown in Figure 6.24 which is a 17-tim e-step non-pipelined design (versus a 12-time-step non-pipelined design produced by LADS). It is obvious th at MAHA is unable to utilize the knowledge about the actual delay caused by chaining ripple carry adders in a tim e step. 110 Therefore, only a 17-time-step non-pipelined schedule was produced by MAH A, which uses 5 more tim e steps than the schedule produced by LADS. A longer d ata path delay and execution delay in this 17-time-step non-pipelined design is expected. 6.5.2 A 10-Time-Step Non-Pipelined Elliptic Filter Example A 10-time-step non-pipelined schedule for an elliptic filter was generated using LADS as shown in Figure 6.25. A tentative floorplan created by LADS is also shown in Figure 6.26. For comparison purposes, we ran MAHA for the elliptic filter using the same resource constraints. The schedule produced by MAHA (shown in Figure 6.27) is a 14-time-step non-pipelined design. This experim ent shows th a t a shorter schedule was produced by LADS under the same resource constraint. We conjecture th a t a shorter schedule may be produced (i.e. a b etter perform ance design) when the scheduler is able to accurately estim ate the delays of operation chains during scheduling. The ability to estim ate path delays accurately helps a scheduler to explore more possible schedules and potentially produce a shorter schedule. A shorter schedule in a non-pipelined design results in better perform ance, (i.e. a shorter execution period.) In a pipelined design, a shorter schedule means a shorter pipeline length. For a pipelined design, the performance is determ ined by its initial interval if resynchronization does not occur. In this case, a shorter pipeline length for a pipelined design does not change the performance. However, a pipelined design having resynchronization, which often occurs in practice, with a shorter pipeline length is able to reestablish pipeline execution faster when resynchronizations occur [PP 8 8 ]. 6.6 Robot Arm Controller Examples i n Infy 13, 33 39 Time S tep a d d 1 a d d 3 a d d 2 a d d 4 a d d 5 m u ll m u l2 a d d 6 a d d 7 a d d 8 a d d 1 0 , m u l3 m u ! 8 10 10 a d d 9 [ a d d 1 3 . [ a d d 1 4 , a d d 1 6 a d d 1 7 18 f a d d 1 8 j " | a d d 1 5 12 a d d 1 i a d d 1 9 a d d 2 0 ] m u l5 m u ! 4 26 13 16 m u l6 m u ! 7 a d d 2 2 ,a d d 2 1 14 17 15 a d d 2 4 , a d d 2 5 a d d 2 3 16 a d d 2 7 ] a d d 2 6 17 38 39. Figure 6.24: The 17-Time-Step Non-Pipelined Elliptic F ilter Schedule by MAHA 112 Time S tep add7 add 10 add 14 add13 3 8 / 1 add17 add16 add 18 add15 add20 add 19 add21 Figure 6.25: The 10-Time-Step Non-Pipelined Elliptic Filter Schedule by LADS 113 ADD.03 MUX3-1.C9 3D.02 MUX6-1.05 MUX2-1.0S MUX4-1.1Z M L MUX3-1.10 REG.03 X4-1 SQR.01 06 MUX3-1.03 REG .02 controller REG.01 ADD.04 UX5-1.04 MUX3-1.11 SQR.02 MUX2-1.1A M L X2-1.02 REG.04 UX10-1.01 NPD-10 (est. D.P. delay 108.47, C.P. delay 33.99) Size: 4184.25 x 3786.50, Area: 15843662.62 (A.R. 1.11) Figure 6.26: The Floorplan for 10-Time-Step Elliptic F ilter Design C reated by LADS 114 Time S tep a d d l a d d 3 a d d 2 a d d 4 a d d 5 m u ll m u l2 a d d 6 a d d 7 a d d l 0 , a d d 8 m u l8 mul3 a d d 9 a d d l 4 a d d 1 3 a d d 1 6 a d d 1 7 38. 18 a d d 1 8 a d d 1 5 , 12 10 a d d l I a d d 2 0 , a d d 1 9 ] mul5 mul4 13 m u l6 m u l7 a d d 2 2 a d d 2 1 14 a d d 2 5 ( a d d 2 4 , 14 a d d 2 3 j a d d 2 7 a d d 2 6 Ou 33 18 38 13 Figure 6.27: T he 14-Time-Step Non-Pipelined Elliptic Filter Schedule by MAHA 115 Figure 6.28: The C ontrol/D ata Flow Graph for the Robot A rm Controller 116 Non-Pipelined Design N am e 1 Functional Units Estim ated Characteristics Cmp Add Sub Mul D.P. Delay (n s ) Area (pm2) Robot. 11 1 2 3 4 93.42 22894387.66 Robot. 12 2 2 2 3 95.85 21153578.12 Robot. 13 1 2 2 3 90.76 19545892.38 Robot. 14* 1 2 2 3 90.68 19519360.88 Robot.15* 1 2 2 3 95.65 20246519.36 Robot. 16 1 1 1 2 90.54 16468172.00 Table 6 .8 : Non-Pipelined Robot Arm Controller Exam ples by LADS For another experim ent, a portion of the robot arm controller exam ple obtained from UC-Berkeley served as our example. The robot arm controller was origi nally w ritten in the C language. The C code was then translated into a VHDL description to obtain an internal control/data flow graph representation as shown in Figure 6.28. R ip p le ca rry a d d ers are used for all additions in the d ata flow graph. Six non-pipelined designs and fourteen pipelined designs were experim ented w ith as shown in Tables 6 . 8 and 6.9, respectively. The schedule results are sim ilar to the ones using traditional scheduling approaches. This is probably because not m any additions or subtractions exist and functional chainings are not possible in this exam ple. 6.7 An Inner Loop Example A portion of the robot arm controller serves as our inner loop exam ple. The VHDL description of the robot arm controller exam ple shown in Figure 6.29 has two nested inner loops. The translated d ata flow graph and tim ing graph are shown in Figures 6.30 and 6.31, respectively. The N O P operations are dum m y operations which are introduced for the feedback of loop values [Che91]. 1T he nam ing convention for the non-pipeline robot arm controller designs is R o b o t.[ t im e s t e p s ] for feasible designs and R o b o t.[ tim e s t e p s ] * for infeasible designs. 2T he nam ing convention for the pipeline robot arm controller designs is R o b o t.[ t im e s t e p s ] , [ i n i t i a l i n t e r v a l ] for feasible designs and R o b o t.[ t im e s t e p s ] .[ in it ia l in t e r v a l ] * for infeasi ble designs; R o b o t.[ t im e s t e p s ] ,[ in it ia l in te r v a l]a represents an alternative design. 117 — This is the portion of a robot arm controller. — This was modified from the C code of the controller. entity robotc2 is port( kpl, k vl, k m ll, kml2, T l, K, L, m hlli, m hl2i, refl, xpl, xvl : in Integer; ql : out Integer); end robotc2; architecture Behaviour of robotc2 is begin process variable m h llO ,m h lll, mhl20, mhl21, ev l, xvhl, epl, u v l,u lO ,u ll, eml : Integer; variable I, J : Integer; begin — initialize states m h lll := m hlli; mhl21 := mhl2i; u ll := 0; — to set SubValuePath. xvhl := 0; ql <= 0; — to set SubValuePath. — BEGIN_OUTERLOOP — for J in 1 to K loop J := 0; while J < K loop J := J+ 1; -- I-control BEGIN INNERLOOP for I in 1 to L loop I := 0; while I < L loop I := I + 1; — D-control evl := uvl - xvl; ulO := kvl * evl; em l := xvhl - xvl; — ref. model xvhl := ulO * T l + xvl; — inertia estimation mhllO := km ll * em l * u ll + m h lil, mhl20 := kml2 * em l * u ll + m hlSl; — torque estimation ql <= mhllO * ulO + m hl20 * ulO — update states and refresh constants in ram m h lll := mhllO; mhl21 := mhl20; u ll := ulO; end loop; end loop; end process; end Behaviour; epl := refl - xpl; uvl := kpl * epl; Figure 6.29: The VHDL Description of the Robot Arm Controller Exam ple 118 i m h l 2 i ^ m h 1 1 i i NOP51 N O P 24 D iST-17 d i s t 31 DISToq d i s t 28 J n o p 64 NOPes NOP n o p 67 NOP 4 9 JOIN5 5 j o i n 56 JOIN 5 4 j o i n 53 n o p 61 j o i n 57 n o p 58 n o p 59 n o p 73 j o i n 74 Figure 6.30: T he Schedule and D ata Flow G raph of the R obot A rm Controller RO R1 PO 4 P1 4 P2 a & (P9) P10 R10 P11 R11 P12 OrFork R12\V 32 !V32 P13 P21 OrFork !V43/ R21\ V43 ® P22 R38 R34 4 P33 R33 < > P34 (o R35 P35 OrJoin R36 P36 R37 P37 (0 R39 P38 OrJoin R40 P39 R41 P40 Figure 6.31: The Tim ing G raph of the Robot Arm Controller Exam ple 120 Pipelined Design N am e2 Functional Units Estim ated Characteristics Cmp Add Sub Mul D.P. Delay (n s ) Area (p,m2) Robot. 11.4 1 4 2 5 93.45 28839741.56 Robot. 12.4 1 3 2 5 96.44 28687950.00 Robot. 13.4* 1 3 2 5 97.85 29756338.12 Robot. 14-4 1 4 2 5 92.46 29354080.88 Robot. 15-4 1 3 2 5 103.02 28462010.00 Robot. 16.4 1 3 2 5 92.75 28537831.88 Robot. 11.9 1 2 2 4 96.81 22284067.88 Robot. 12.9 2 2 3 93.10 21395779.43 Robot.13.9* 1 2 1 3 94.95 21651528.00 Robot.13.9a 1 2 1 3 95.48 20675078.69 Robot.15.9 1 2 1 2 94.99 18002333.22 Robot. 15.9a* 1 2 1 2 99.30 18593615.34 Robot. 16.9 1 2 1 2 95.14 16808586.25 Robot. 17-4 1 2 1 2 90.56 18213024.50 Table 6.9: Pipelined Robot Arm Controller Exam ples by LADS R ip p le c a r r y a d d e rs are used for all additions in th e d ata flow graph. A schedule w ith 8 tim e steps using LADS is shown in Figure 6.30. Note th a t opera tions, such as *3 4 , *3 9 , *4 2 , + 4 0 , + 4 3 , *4 4 , * 4 5 and + 4 6 , are closely coupled together in this example. Due to the lim itations of the force-directed scheduling technique, we are unable to get more serialized design for the robot arm controller exam ple. 121 Chapter 7 Control Path Synthesis M any design problems of practical interest exhibit conditional branching behav ior. However, m ost existing high-level synthesis systems only synthesize controllers for non-pipelined designs. ATOMICS system for C athedral-II [ZSRM90] synthe sizes ROM -based m icroprogram m ed controllers for the d ata paths produced by C athedral-II. Three pipeline stages can be identified in the control critical path of designs produced by ATOMICS. Pipelined designs th a t contain conditional branches were not given serious thought by high-level synthesis researchers un til the m id ’80’s, when such pipelined data paths were first synthesized [PP 8 8 ]. T he issue of controllers for such d ata paths were only recently addressed (e.g. by Kim and K urdahi [KK91]). In this chapter, we present algorithm s which generate a controller specification from a description of system behavior and the respective d ata path. T he function of a controller is to issue control signals to the data path. Control signals select operations to be perform ed at specific tim e steps and route th e processed d ata to appropriate functional units. Here, we are particularly interested in autom atically synthesizing controllers for both pipelined and non-pipelined d ata paths produced by data-path dom inated high-level synthesis tools, such as MABAL. Controllers for designs w ith or w ithout conditional branches are discussed. D ata path scheduling, m odule allocation and bindings are assum ed to have been perform ed to provide the exact schedule and d ata p ath inform ation before control path synthesis. Three aspects of a controller need to be determ ined at this point, namely, the num ber of states, next state transitions of a given state 122 and activated control signals in a given state. T he controller design process first determ ines a set of states th at is sufficient to represent the control behavior, and then proceeds by determ ining outputs and next state transitions for each state, according to th e input behavior. The design of a controller is usually a complex task since the controller m ust retain the current execution status to determ ine next state transitions and accordingly provide the proper control signals. This becomes m ore difficult when the design is pipelined. The rest of this chapter presents a m ethod for the synthesis of controllers for pipelined and non-pipelined d ata paths. We first discuss the assum ptions m ade in our control path synthesis. The next section deals w ith the control signal gen eration for a scheduled design. The im plem entation techniques of controllers for designs w ith and w ithout conditional branches are then given. Controller designs for both pipelined and non-pipelined designs are discussed. 7.1 Controller Assumptions Historically, there are two basic controller design approaches—hardw ired and mi croprogram m ed [Hay8 8 ]. The former views the controller as a sequential circuit with a num ber of logic components, such as random logic and PLAs. Once a con troller is constructed, changes in behavior can only be im plem ented by redesigning and physically rewiring the unit. Therefore, it is called a hardw ired controller. On the other hand, a m icroprogram m ed controller is designed around a control m em ory in which sets of micro instructions are stored. In general, a m icroprogram m ed controller is often more costly and slower than a hardw ired controller due to the presence of the control memory and its access circuitry. In this research, we will focus on hardw ired controllers characterized by a state table of a finite state m a chine. The controller is assumed to be im plem ented by a PLA circuit. However, if a m icroprogram m ed im plem entation is preferred, the finite state m achine we synthesize can also be im plem ented by a m icroprogram m ed controller. The controller is modeled as a Mealy-style finite state m achine in which state m em ory holds the present state and a com binational circuit decides next state i transitions and output signals. Depending on the im plem entation style selected, 123 status registers may or may not be used in the controller synthesis. T he purpose of introducing status registers into the controller is to store binary condition values in a design w ith conditional branches. If the controller design does not use status registers, condition values are stored in state memory. 7.2 Control Signal Generation Control signals are organized as a set of tuples. A tuple has several attributes, such as the activated device name, the activated tim e step and activated condition values. For a m ultiplexer control signal, an extra attrib u te, viz. th e selected port num ber, is added. In order to determ ine execution conditions of condition values for operations along conditional branches, a m utually-exclusive condition analysis of the con tro l/d a ta flow graph is required. We analyze m utual exclusion conditions of a co n tro l/d ata flow graph using an algorithm similar to the one proposed by Park [P P 8 8 ]. The algorithm assigns to every operation a label consisting of a sequence of one or more integer codes. Using these labels, we can test m utual exclusion betw een any pair of nodes (operations) and obtain activated condition values for operations inside conditional execution paths. Control signals are obtained by combining the inform ation from the input be havior and register-transfer level d ata path. A set of control signals is issued by the controller to achieve activities specified in the binding list under proper tim ing sequences. Basically, there are two classes of bindings in the list. An operation binding binds an operation to an operator and a value binding binds a d ata value to a register. Three types of control signals are defined in our approach, namely, register write signals, multiplexer port select signals and tri-state-driver enable signals. These control contents of registers, the data transferred through m ultiplexers and the d ata values on buses, respectively. Inputs for an operation are expected to come from external inputs or register outputs which are specified in the operation bindings; and the input of a value binding should come from an external input or 124 operation’ s output. Control signals are then constructed after d ata transfer paths are identified from operation bindings and value bindings. 7.3 Designs without Conditional Branches The controller for a design without conditional branches is simple and its state transition diagram is a ring. Tim e steps, to which operations are scheduled, are a continuous sequence of num bers in a scheduled co n tro l/d ata flow graph. The first tim e step starting of the execution is either tim e step 0 or tim e step 1 depending on w hether the synthesized design has its input d ata latched before the execution of the process or not [KP90]. W ithout loss of generality, we assume th a t the schedule of a co n tro l/d ata flow graph is started at tim e step 1 unless specified explicitly. W ith a simple extension, the procedures discussed here can also handle designs starting at tim e step 0 . 7.3.1 Controllers for Non-Pipelined Designs T he controller for a non-pipelined design w ithout conditional branches can be realized by a simple finite state machine. The problem can be form ally stated as follows: Given a scheduled n-tim e-step control/data flow graph Qn, vertices are grouped to a set of executions £ = {£i, £2, ..., £nj- An execution £i is defined as a set of control/data flow graph operations which are scheduled to be executed in tim e step i where * = 1 . . . n. £t - and £j are disjoint where i , j = 1 .. .n and i 7^ j . Note th a t vertices of a control/data flow graph are basically the union of executions and distribute/join nodes. 1 The task of a controller is then to generate the set of control signals to ensure th at all operations in £; are executed according to the behavior specified in tim e step i, i = . . . n. A state transition diagram Tn is a graph with a set of states S = S 2, ■■■,£> *.} as its vertices and transition arcs as its edges. A state consists of a set o f control signals which is activated when the state is being visited. 1 D istrib u te/join nodes in a con trol/d ata flow graph are not assigned to any tim e step in the current AD A M system . 125 StateSI : StateS2 : StateSn : State = StateS2 State = StateS3 State = StateSI Figure 7.1: A Ring State Transition Diagram D e fin itio n 7.1 A ring state transition diagram 'Ttn is a state transition di agram Tn whose vertex set is V = {<Si, £ 2, . . . , and whose edges are {<Sl,<S2}, {< ^ 2 ? ^ 3 }, • • • , { £ n -l 5£n}, An exam ple of a ring state transition diagram is shown in Figure 7.1. We deduce th e following lem m a from the definition of the ring state transition diagram . L e m m a 7.1 A ring state transition diagram 1Zn is sufficient to specify the control behavior o f a n-tim e-step non-pipelined design without conditional branches. P ro o f: For an n-tim e-step non-pipelined design, a new d ata set is processed every n clock cycles (i.e. the behavior is repeated every n clock cycles). Under the proposed two-phase non-overlapping clocking scheme (see Section 3.2 for more details), operations being scheduled in tim e step i (where 1 < i < n) should be executed and finished within a data-path clock cycle. In a design w ithout conditional branches, activated control signals are m erely dependent on the clock cycle being executed (or the state being visited). Each tim e step in a non-pipelined design w ithout conditional branches requires only one state to specify the activated control signals, so only n states are thus needed for an n-tim e-step non-pipelined design w ithout conditional branches. Since the behavior is repeated every n clock cycles, a ring state transition diagram IZn is sufficient to specify the controller for an n-tim e-step non-pipelined design w ithout conditional branches. n Lem m a 7.1 reveals th at the num ber of states required to specify the controller for a non-pipelined design is equal to the num ber of tim e steps in the scheduled data flow graph. The controller can thus be realized by defining a bijective function, T : €i — * Si where * = 1 . . . n. The control signal generating function jF produces a 126 set of control signals which are activated during the execution of operations w ithin execution Si, and the set of control signals is assigned to state Si. By assigning the first state to state S i, the controller design for a non-pipelined design w ithout conditional branches can be completed. 7.3.2 Controllers for Pipelined Designs The controller design for a pipelined design is similar to the procedure for a non- pipelined design. The basic idea is to “/o/d” the scheduled control/data flow graph every initiation interval. Therefore, the num ber of tim e steps in the folded con tro l/d a ta flow graph is equal to the length of the initiation interval, and operations in the folded control/data flow graph are executed every initiation interval. We now present a formal model of the folded co n tro l/d ata flow graph for a pipelined design. Vertices in the folded scheduled n-tim e-step co n tro l/d ata flow graph Qpn for a pipelined design with an initiation interval I can be represented by a set of p-executions, £p = {SP\ , S P2 , . . . , £ p/}. A p-execution Spi is defined as the union of executions {Si, Si+i, Si+2i, ■ • .,£;+»«/} where i + m l < n. Note th at p-executions, Spu and Spv, are disjoint where u ,v = \ . . . I and u ^ v. L e m m a 7 .2 A ring state transition diagram 7Zi is sufficient to specify the control behavior fo r an initiation-interval-I pipelined design without conditional branches. P ro o f: The proof is similar to Lemma 7.1. For an initiation-interval-7 pipelined design, a new d ata set is processed every I clock cycles. Under the proposed two- phase non-overlapping clocking scheme, operations being scheduled in tim e step i, i + I ,.. . ,i + m l (i.e. the folded tim e step i, where I < i < I, 1 < * -f m l < n and n is the pipeline length) should be executed and finished w ithin a data-p ath clock cycle. In a design w ithout conditional branches, activated control signals are m erely dependent on the clock cycle being executed (or the state being visited). Each folded tim e step in a pipelined design w ithout conditional branches re quires only one state to specify activated control signals, I states are thus needed for an initiation-interval-7 pipelined design w ithout conditional branches. Since the behavior is repeated every 7 clock cycles, a ring state transition diagram 7Zj 127 In the non-pipelined design, those two . — control signals were in State3. State2 : Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE State = State 1 State 1 : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Reg1_WRITE = TRUE Mux3_ENABLE - 1 Reg2„WRITE - - = TRUE State = State2 a) A Folded Control/Data Flow Graph b) Controller Specification for w/o Conditional branch a Pipelined Design with initiation Interval 2 Figure 7.2: A Simple Pipelined Design Synthesis Exam ple is sufficient to specify the controller for an initiation-interval-7 pipelined design w ithout conditional branches. □ The previous lem m a shows th at the num ber of states required to specify the controller for a pipelined design is independent of the pipeline length (the num ber of tim e steps) in a scheduled control/data flow graph. The num ber of states required is equal to the length of the initiation interval in a pipelined design. For exam ple, two states are sufficient to specify the controller for an initiation-interval- 2 pipelined design w ithout conditional branches (i.e. no m atter w hat the num ber of tim e steps is in the scheduled control/data flow graph). To realize the controller for a pipelined design w ithout conditional branches, we again define a bijective function, J-p : £vi — ► Si where i = 1 . . . I. T he control signal generating function J-v produces a set of control signals associated w ith the p-execution £pi which is the union of control signals activated in tim e steps i, i + I , i + 2 I , . .., i + m l where i + m l < n, and the set of control signals is assigned to state S{. Finally, by assigning the first state to state <Si, the controller design for a pipelined design w ithout conditional branches is done. 128 E x a m p le : We use the exam ple shown in Figure 3.9 to illustrate this procedure. Assume th at the scheduled control/data flow graph shown in Figure 3.9a represents a pipelined design w ith an initiation interval of 2 tim e units. T he co n tro l/d ata flow graph after being folded is shown in Figure 7.2a. T he register-transfer level d ata p ath shown in Figure 3.9b implements the function correctly even if the design is pipelined. The controller for this pipelined design is shown in Figure 7.2b. It has been found th at the controller for this pipelined design is sim ilar to the controller shown in Figure 3.9c, except th at activated control signals of sta tel in Figure 3.9b are th e union of activated control signals of statel and stateS in Figure 3.9c. □ 7.4 Controllers Using Status Registers For a design with conditional branches, condition values should be “rem em bered” by the controller until no more conditional operations are dependent on these condition values. In this section, we present a controller design th a t uses status registers to store condition values. In general, a controller using, status registers generates a sim pler state transition diagram than a controller w ithout status reg isters. However, this may be at the possible cost of increasing chip area. Before going further, we define the following terminology: D e fin itio n 7.2 The first tim e step in which the condition value Vc is created and stored into a register is defined as B v where B v > I, and the condition value Vc is said to be born in tim e step B v. D e fin itio n 7.3 The last tim e step in which the execution o f operations are depen dent on the condition value Vc is defined as T > v where T > v > B v > and the condition value Vc is said to be dead in time step T > v . D e fin itio n 7.4 The lifetime o f the condition value Vc is defined as a span o f time from tim e step B v + 1 to tim e step T > v. The condition value Vc is said to be alive within the lifetime. 129 D e fin itio n 7.5 The reserved period o f the condition value Vc is defined as a span o f tim e from time step Bv + 2 to tim e step T > v if ( J 3V + 1) < T > v . Otherwise, the reserved period o f the condition value Vc is defined as NIL. The condition value Vc is said to be reserved within the reserved period. Note th a t the death of a condition value is determ ined not only by the sched ule but also by the module allocation and bindings. The death of a condition value can be obtained correctly from the condition-value lifetime analysis. The condition-value lifetime analysis inspects all control signal tuples and the death of the condition value Vc is the last tim e step in which the activation of a control signal is dependent it. It is obvious th at a condition value Vc m ust be “rem em bered” by the controller if and only if the reserved period of the condition value Vc is not NIL. The num ber of status registers required to im plem ent a controller using status registers is no less than the num ber of condition values reserved in any clock cycle during execution. 7.4.1 Controllers for Non-Pipelined Designs For an n-tim e-step non-pipelined design controller using status registers, a ring state transition diagram 7Zn is sufficient to specify the controller behavior because status registers are used to “rem em ber” reserved condition values. T he controller specification for a state in which no condition value is alive is the same as the controller specification for a design w ithout conditional branches (i.e. a set of ac tivated control signals followed by the next state assignm ent). Figure 7.3a shows a typical controller specification for a state w ithout conditional branches. For the state w ith only one condition value alive, an if-else statem ent is used to specify the conditional branch. Otherwise, a case statem ent is used to specify a state w ith more than one condition value alive. Figures 7.3b and 7.3c show typical controller specifications for states with one and more than one condition values alive, respectively. Note th at a state w ith k condition values alive has at m ost 2k possible conditional branches. We determ ine the status register assignment of reserved condition values using the Left Edge algorithm th at is also used for register assignm ent during d a ta path 130 StateSy : if ( predicate) { < Activated Control Signals:. } else { < Activated Control Signals:. State = StateSy + 1 S tateS z : c a s e ( pred1... predn ) { < Activated Control Signals:. State = StateSz + 1 StateSx : < Activated Control Signals:. State = StateSx + 1 a) No Condition Value Reserved b) Single Condition Value Reserved c) Multiple Condition Values Reserved Figure 7.3: Control Specifications Using Status Registers synthesis [KP87]. The following procedure assigns a reserved condition value to a non-conflicting status register for every tim e step. P r o c e d u r e 1 sta tu s .reg ister -queue < — 0 . fo r tim e step i from 1 to the last tim e step do b e g in sta tu sjre g ister -queue < — allocated sta tu s registers. fo r the condition value Vc reserved in tim e step i do b e g in if sta tu s ^register-queue 0 remove a status register from sta tu sjreg ister -queue and assign it to the condition value Vc for tim e step i. else allocate a new status register and assign it to th e condition value Vc for tim e step i. e n d e n d □ Procedure 3 synthesizes the controller for a non-pipelined design using status registers by calling Procedures 2 and 1. Procedure 2 generates the control signals for each tim e step using results from the condition-value lifetim e analysis and Procedure 1. Obviously, the controller specification produced by Procedure 3 is i 1 131 a ring state transition diagram 7Zn for an n-tim e-step non-pipeline design using status registers. P r o c e d u r e 2 / * Subroutine to generate the controller specification fo r tim e step i. * / if no condition value is alive in tim e step i b e g in print out activated control signals in tim e step i. r e tu r n e n d / * Otherwise, at least one condition value is alive. * / if only one condition value Vc is alive in tim e step i b e g in / * Use an “ if-else” statem ent fo r the single condition value alive cases * / if the condition value Vc is not reserved / * Vc is a newly born condition value * / use the external input register value as the condition value, else use the assigned status-register value as the condition value, print out activated control signals in tim e step i which satisfy the condition value alive is “TRU E” for the “if” part first. / * No condition value must propagate in the last tim e step * / if tim e step i is n o t the last tim e step b e g in / * Propagate the reserved condition value. * / if the condition value Vc is reserved in tim e step i ■ + 1 assign “1 ” to the status register assigned to the condition value Vc in tim e step i 4 - 1 . e n d / * The “ i f ” part is done here. * / print out activated control signals in tim e step i which satisfy the condition 132 value alive is “FALSE” for the “else” part, if tim e step i is n o t the last tim e step if the condition value Vc is reserved in tim e step i + 1 assign “0 ” to the status register assigned to th e condition value Vc in tim e step i -f 1 . r e tu r n e n d / * For multiple condition values alive, we use a “ case” statem ent. * / fo r each condition value Vj is alive in tim e step i d o b e g in if the condition value Vj is not reserved / * Vc is a newly born condition value * / use the external input register value as the condition value, else use the assigned status-register value as the condition value. e n d fo r each possible condition value combination Pk in tim e step i do b e g in print out activated control signals in tim e step i which satisfy the condition value com bination Pk- if tim e step i is n o t the last tim e step b e g in / * Propagate the reserved condition value. * / fo r each condition value Vj reserved in tim e step i + 1 do b e g in assign “0 ”/ “1 ” to the status register assigned to the condition value Vc in tim e step i + 1 using the condition value com bination Pk- e n d e n d e n d r e t u r n □ 133 P r o c e d u r e 3 generate all possible control signal tuples first. perform the condition-value lifetime analysis to determ ine the lifetim e and reserved period for each condition value. execute Procedure 1 to assign status registers for reserved condition values. / * Now, generate the controller specification fo r each time step. * / fo r tim e step i from 1 to the last tim e step do b e g in use Procedure 2 to generate controller specification for tim e step i. if tim e step i is the last tim e step print the next state transition to state S i. e lse print the next state transition to state < S ;+1 - e n d □ 7.4.2 Controllers for Pipelined Designs T he controller design for a pipelined design using status registers is sim ilar to the design for a non-pipelined design using status registers. The idea of a folded scheduled control/data flow graph is used to model a pipelined design. We view the folded control/data flow graph as a set of p-executions. The function of the controller for an initiation-interval-7 pipelined design is to issue respective control signals th a t are sufficient to correctly execute operations in the p-execution £pi, j where i = 1,..., 7, t = m l + i and m is a non-negative integer. A ring state transition diagram IZi is sufficient to specify a controller for a pipelined design with an initiation interval 7 even though there are conditional branches, if status registers are used. The set of activated control signals for state j S{ is the union of control signals activated in tim e step i , i + 7, i -f 27, m l where i + m l < n, which is derived in a similar m anner to th e controller for pipelined designs w ithout conditional branches. A simple exam ple illustrates this procedure. E x a m p le : The scheduled control/data flow graph shown in Figure 3.10a is assum ed to be a pipelined design with an initiation interval of 2 tim e units. The register-transfer level design for the pipelined design is the sam e as the one for the non-pipelined design shown in Figure 3.10b. The folded co n tro l/d ata flow graph and controller are shown in Figures 7.4a and 7.4b, respectively. Note th a t the register-transfer level d ata path is the same as the non-pipelined one, so activated control signals of state 1 in this example are simply the union of activated control signals of s ta te l and stated in Figure 3.11b. Similarly, activated control signals of sta te 2 in Figure 7.4b are the union of activated control signals of sta te 2 and sta te4 in Figure 3.11. O Since reserved condition values are stored in status registers, the controller specification associated w ith each tim e step can be viewed as an individual block. A ctivated control signals of a state in the controller for a pipelined design are actually the union of activated control signals of overlapped tim e steps. However, due to the overlapped execution with tim e steps in a pipelined design, a condition value is distinguished by being associated with its working tim e step. A condition value w ith two instances reserved in two different tim e steps should be separately stored in two status registers. A new status register assignm ent procedure for reserved condition values is thus needed, as described below in Procedure 4. P r o c e d u r e 4 statusjregister-queue < — 0 . fo r state I from 1 to the initiation interval do b e g in fo r tim e step i from I to the last tim e step s te p initiation interval do b e g in statusjregister-queue allocated sta tu s registers. fo r the condition value Vc reserved in tim e step i d o 135 Those control specifications were in Stated, originally. Thoses two contorl signals were in State4, orignally. a) A Folded Control/Data Flow Graph with Conditional branch b) Controller Using Status Register Figure 7.4: A Pipelined Design Synthesis Exam ple with Conditional Branches Join Distribute State2 : «(Reg1){ Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE CSGvarO - 1 } else { CSGvarO = 0 Mux4_ ENABLE = 1 Reg2_WRITE - I HUE State = State 1 Statel : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRITE = TRUE tf (CSGvarO) { Vlux3 ENABLE - 1 Vlux4 ENABLE = 0 Reg2 WRITE = TRUE State = State2 136 b e g in if status -register-queue ^ 0 remove a status register from status -register .queue and assign it to the condition value Vc for tim e step i. e lse allocate a new status register and assign it to the condition value Vc for tim e step i. en d en d en d □ Procedure 4 produces non-conflicting status register assignm ents for reserved condition values in a pipelined design. The following procedure modified from Procedure 3 generates the controller for a pipelined design. P r o c e d u r e 5 generate all possible control signal tuples first. perform the condition-value lifetime analysis to determ ine th e lifetim e and reserved period for each condition value. execute Procedure 4 to assign status registers for reserved condition values. / * Now, generate the controller specification for each state. * / for state I from 1 to the initiation interval do b e g in for tim e step i from I to the last tim e step s te p initiation interval d o b e g in use Procedure 2 to generate the controller specification for tim e step i. if state I is equal to initiation interval print the next state transition to state Si. e lse print the next state transition to state Si+i. en d en d □ 137 7.5 Controllers without Status Registers We now present a controller design th at uses state mem ory to store reserved con dition values in a design with conditional branches. As compared to the controller using status registers, the controller w ithout status registers usually has a more com plicated state transition diagram. To realize a state transition diagram , three subtasks need to be determ ined, namely, the num ber of states required, next state transitions of a given state and activated control signals in a given state. In the im plem entation of a controller w ithout status registers, the num ber of states re quired to specify the controller is dependent on the lifetimes of condition values. Possible next state transitions of a state are determ ined by the status of reserved condition values (born, reserved or dead). Obviously, more work is needed to de sign the controller w ithout status registers. However, this type of controllers does not usually produce smaller PLA circuits than controllers using status registers from our experim ents. 7.5.1 Controllers for Non-Pipelined Designs For a design w ith conditional branches, a condition value should be “rem em bered” by the controller for the reserved period. The reserved period and the b irth and death tim e steps of a condition value are derived from the condition-value life tim e analysis which is identical to the one perform ed in designing the controller using status registers. Some interesting characteristics useful for synthesizing a controller w ithout status registers are found from results of the condition-value lifetim e analysis: • For an n-tim e-step non-pipelined design, at most 2k states are required to specify the controller behavior for any tim e step i if there are k condition values reserved in tim e step i, where 1 < i < n. 138 • A state Si)P in tim e step i has exactly one possible next state transition Si+ in tim e step s t H-1, if there is no condition value born in tim e step i or condition values born in tim e step i are also dead in tim e step i. It should be noted th a t if i is equal to n then tim e step 1 is next instead of tim e step * + 1. • A state S ij in tim e step i has two possible next state transitions, and Si+i i?, in tim e step i + 1 if and only if there is one condition value Vc which is born in tim e step i and reserved in tim e step i -f- 1. Also note th a t if i is equal to n then tim e step 1 is next instead of tim e step i + 1. • A state S ij in tim e step i has at most 2k possible next state transitions in tim e step i + 1 if there are k condition values born in tim e step i and reserved in tim e step i + 1. Note th at if i is equal to n then tim e step 1 is next instead of tim e step i + 1. L e m m a 7.3 Only one state is sufficient to specify the controller behavior in time step 1 for a non-pipelined design. P ro o f: For a non-pipelined design without conditional branches, one state is sufficient to represent the controller behavior for a tim e step in a scheduled con tro l/d a ta flow graph (Lemma 7.1). Thus, one state is sufficient to represent the controller behavior in tim e step step 1 for a non-pipelined design w ithout condi tional branches. For a non-pipelined design with conditional branches, a set of states is used to represent the controller behavior for a tim e step in a scheduled co n tro l/d ata flow graph. Since only reserved condition values are required to be stored by the controller, the num ber of states required to represent the controller behavior for a tim e step are dependent on the num ber of condition values reserved in th a t tim e step. T he upper bound of states required to represent a tim e step w ith k reserved condition values is 2k states. However, no condition value is reserved in tim e step 1 from the definition of the reserved condition value, so one state is also sufficient 139 to represent the controller behavior in tim e step 1 for a non-pipelined design w ith conditional branches. □ O ther characteristics such as one state and two states transiting to a state, and the upper bound of states transiting to a state could also be obtained in a sim ilar manner: • A state Si+ i n tim e step i + 1 has exactly one state SitP in tim e step i transiting to it if no condition values die in tim e step * + 1, where 1 < * < n —1 for an n-tim e-step non-pipelined design. • Two states, SitP and in tim e step i have the same next state transition Si+ij in tim e step * + 1 if and only if there is one condition value Vc th at is reserved and dies in tim e step i , where 1 < i < n — 1 for an n-tim e-step non-pipelined design. • At m ost 2k states in tim e step i have the same next state transition in tim e step i + 1 if there are k condition values th at are reserved and die in tim e step i, where 1 < i < n — 1 for an n-tim e-step non-pipelined design. A ctivated control signals in a given state are obtained by excluding non activated control signals from the set of all possible control signal tuples, according to the condition value combination assigned to the given state and the working tim e step. The following procedures synthesize the controller for a non-pipelined design w ithout status registers. Procedure 6 produces the controller specification for a state based on the status of condition values (born, reserved or dead). Proce dure 7 generates a set of states which is sufficient to represent the controller for a given tim e step, and assigns an unique condition value com bination to each state ! generated. Next state transitions are then determ ined using the above character istics. Finally, assigning the first state to the state corresponding to tim e step 1 completes the controller design. Note th at only one state is needed to specify the controller behavior in tim e step 1 from Lemma 7.3. P ro c ed u r e 6 140 / * Subroutine to create the controller specification for state S ij * / if no condition value is born in tim e step i b e g in print out activated control signals in tim e step i. if tim e step i is n o t the last tim e step b e g in let the condition value combination equal to P i j . remove condition values which die in tim e step i from 'P^j- use state Si+ifP in tim e step i + 1 with the same condition value com bination as V ■ , to be the next state transition, e n d e lse use State 1 in tim e step 1 as the next state transition, r etu r n en d / * Otherwise, at least one condition value is born * / if only one condition value Vc is born in tim e step i b e g in / * Use an “ if-else” statement for single condition value born cases * / print out activated control signals in tim e step i which satisfy the new born condition value Vc = “TR U E ” and the condition value com bination V ij for the “if” part first. if tim e step i is n o t the last tim e step b e g in let the condition value combination V ' j equal to Pij- add the new born condition value Vc = 1 to 'PU. remove condition values which die in tim e step i from 'P-j. use state in tim e step i + 1 with the same condition value com bination as V- „ • to be the next state transition. l iJ en d e lse use State 1 in tim e step 1 as the next state transition. / * Thus, the “ i f ” part is done * / 141 print out activated control signals in tim e step i which satisfy the new born condition value Vc = “FALSE” and the condition value com bination V ij for the “else” part. if tim e step i is n o t the last tim e step b e g in let the condition value combination V'l} equal to P t -j. add the new born condition value Vc = 0 to V[ remove condition values which die in tim e step i from use state Si+i,p in tim e step i + 1 with the sam e condition value com bination as V [ , to be the next state transition, en d e lse use State 1 in tim e step 1 as the next state transition, r etu r n en d / * For multiple newly born condition values, we use a “ case” statement * / for each newly born condition value com bination Pt>i in tim e step i d o b e g in print out activated control signals in tim e step i which satisfy condition value com binations Vu and V ij. if hme step i is n o t the last tim e step b e g in let the condition value combination V -j equal to P ij. add the new born condition value com bination Vbi to V ' j . remove condition values which die in tim e step i from P[ -. use state <S;+i,p in tim e step i + 1 with the same condition value com bination as 'P /, to be the next state transition. x iJ e n d e lse | use State 1 in tim e step 1 as the next state transition, e n d r e tu r n n P ro ced u re 7 142 / * Generate a set of states S* for each time step * / for tim e step i from 1 to the last tim e step do b e g in generate a set of states, S* = < S i, 2 , . . . , 2* i }.5 where k{ is th e num ber of condition values reserved in tim e step i. for condition value combination Vitj from Vit 1 to 7^ 2f c i i*1 tim e step i d o assign state S ij to the condition value com bination V ij. en d / * Now, produce the controller specification for each generated state * / for tim e step i from 1 to the last tim e step d o fo r each state $i,3 € S* in tim e step i do use Procedure 6 to generate the controller specification for sta te Sitj. □ 7.5.2 Controllers for Pipelined Designs T he controller design for a pipelined design w ithout status registers is analogous to the design for a non-pipelined design w ithout status registers. A folded scheduled co n tro l/d ata flow graph is used to model the execution of a pipelined design. The task of the controller in a pipelined design with an initiation interval I is to produce control signals th a t carry out the execution of operations in the p-execution Spi at clock cycle t = m l + i, where i = 1,... ,7 and m is a non-negative integer. An approach sim ilar to the approach th at we used in the controller design for a non- pipelined design w ithout status registers is applied here. A set of states which is sufficient to specify the controller behavior is first created. Next state transitions of the states created are then determ ined. A ctivated control signals in a given state are finally produced to complete the controller design. For an initiation-interval-7 pipelined design, the control behavior of a folded tim e step is repeated every I clock cycles. I sets of states, used to specify the controller for a pipelined design w ithout status registers. In the controller design for a pipelined design using status registers, a condition value 143 w ithin the reserved period is treated as a different condition value instance in each tim e step and should be “rem em bered” distinctly by the controller. Therefore, the instance of a reserved condition value is actually a function of th e c o n d itio n value its e lf as well as its w o rk in g tim e step. In order to differentiate the instances of a reserved condition value in different tim e steps, the tim e step i is added as a second subscript in the condition value Vc, m aking it Vc,i- In designing a controller w ithout status registers, each possible condition value com bination m ust correspond to a state in the controller in order to accomplish all possible execution cases. Due to the concurrent execution of m ultiple d ata sets in a pipelined design and the m utual exclusion of condition values, the set of states of a folded tim e step can be derived as the C a r te sia n p r o d u c ts of the sets of states which are m apped to this folded tim e step. To define the C a r te s ia n p r o d u c t of two sets of states, we first define the p ro d u c t of two states. A p r o d u c t of two states produces a new state whose condition value com bination is the union of the condition value combinations of the two states. The p r o d u c t of two states and C a r te sia n p r o d u c t of two sets of states are defined as follows. D e fin itio n 7 .6 T h e product o f tw o s ta te s S i y P a n d S j yq is a n e w s ta te S i Xj,r w hose c o n d itio n value c o m b in a tio n is the u n io n o f c o n d itio n value c o m b in a tio n s o f V i y P a n d V j yq a sso cia ted w ith s ta te s S i y P a n d S j yq, respectively, w here in d e x r = (p — l)|£jl + q- D e fin itio n 7 .7 T h e Cartesian product o f tw o s e ts o f s ta te s s * = {< ^ 2,1 5 2, • • • , $i,p} a n d S j — {‘ Sj.l? • • • , Sj,q} is a n e w s e t o f s ta te s &ixj = ■ ■ ■ i Sixj,pq} w here S i Xj,k Is the p r o d u c t o f tw o s ta te s S i [ij+ i wud &j,(k m o d g)+i; 1 ^ h < p q . i i 144 E xam p le: The Cartesian product of the sets of states «^2 — { ^ 2,1 * — ( V l ,2 : l ) , c > 2,2 * — ( V l ,2 : 0 ) } and <!>4 = { < ^ 4 , 1 * — 0^1,4 ' ■ 1 5 " 1 ^ 2 , 4 : 1 ) , « ^ 4 , 2 < — ( V l , 4 : 1 » T 2 , 4 : 0 ) , 5 4 |3 ( V l , 4 : 0 , V 2 , 4 ' 1 ) } is a new set of states < ^ 2 x4 = {<^2x4,1 (Vi,2 : 1jVi,4 : 1, 1 ^2,4 : 1); <$2x4,2 (Vl,2 : l)Vl,4 : C 1^2,4 : 0), <$2x4,3 (Vi,2 • l>Vx,4 : 0,V2l4 : 1); <$2x4,4 (^1,2 = 0, Vi,4 : 1, V2,4 • ' l)> <$2x4,5 * — (Vi,2 : 0,Vi,4 : 1, V 2,4 • 0), <$2x4,6 (^1,2 : 0, Vj.,4 : 0, V2,4 : 1)}- □ In the controller design using status registers, the controller uses status regis ters to store reserved condition values. However, in the controller design w ithout status registers, different states are used to distinguish and “rem em ber” differ ent instances of a reserved condition value. For an n-tim e-step pipelined design, the set of states required to specify the controller w ithout status registers in a folded tim e step t can be derived by Cartesian products of the sets of states < $ * > S*+J, ■ . ., S*+mi, where i + m l < re, t = i + m 'l and m ' is a non-negative integer. T he following theorem proves this statem ent. T h e o re m 7.1 The set of states produced by Cartesian products of the sets of states o* o* c* £>i * • • i&i+ml is s u ffic ie n t to sp e cify the control b ehavior f o r a n in itia tio n - in te r v a l- I p ip elin e d desig n in the fo ld e d tim e step t, w here t — i + m ' l , i + m l < n , n is the p ip elin e len g th a n d m , m ' are n o n -n eg a tive integers, i f the se t o f s ta te s Sj is s u ffic ie n t to sp e c ify th e c o n tro l b ehavior in tim e step j , w here j = i , i + I , ... ,i + m l . 145 P r o o f: In the controller design w ithout status registers for a pipelined design w ith an initiation interval / , a set of states (S *;, which is sufficient to represent the control behavior in a given folded tim e step t , should be able to represent all possible condition value combinations for tim e step + m l in a scheduled control/data flow graph. From the definition of the Cartesian product of two sets of states, S* and S j, the Cartesian product S*Xj includes all possible condition value combinations from these two sets of states S* and S f Since th e set 1 J of states S*- is produced by Cartesian products of S*, S*+I, .. ., and the sets of states S * ,S * +I, . . . , S*+mI are sufficient to specify the control behavior in tim e step i, « + / ,..., i + m l, respectively, the set of states Spi is sufficient to represent the control behavior in the folded tim e step t for an initiation-interval-7 pipelined design. □ To determ ine next-state transitions, the characteristics used to determ ine next state transitions for a non-pipelined design controller w ithout status registers are applied here. Note th at due to the use of the folded co n tro l/d ata flow graph, a tim e step in a non-pipelined design controller is now interpreted as a folded tim e step in a pipelined design controller which generally includes a set of tim e steps being executed at the same clock cycle. A ctivated control signals for a given state are derived in a sim ilar m anner as the non-pipelined design controller w ithout status registers. Finally, arbitrarily assigning one of the states Sp\ ti € Spl in the folded tim e step 1 as the first execution state completes the design of the controller w ithout status registers. N ote th a t any com pound state in the set of states S pl corresponding to the folded tim e step 1 is sufficient to specify the control behavior in the first execution clock cycle. This arises from the proof of Theorem 7.1. We now use a simple pipelined exam ple to illustrate the design process of the controller. E x a m p le : In the exam ple shown in Figure 7.4, one state is required for tim e step 1, 2 and 4 and two states are used for tim e step 3 before the scheduled control/data flow graph is folded. Note th at ()” after each state shows the possible condition value com binations associated with this state. NIL means th a t no condition value 146 S ta te 2 : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRITE = TRUE State = State3 S ta tel : Muxt_ENABLE = 0 Mux2_ENABLE = 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRITE = TRUE Mux3_ENABLE = 1 Mux4_ENABLE = 0 Reg2_WRITE = TRUE State = State3 State3 : Mux4_ENABLE = 1 Reg2_WRITE = TRUE If (R eg1){ Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE State = Statel } e ls e { State = State2 To State1/State2 Figure 7.5: The Controller for the Pipelined Exam ple Shown in Figure 7.4 com bination is associated with this state. The sets of states for tim e steps before the co n tro l/d ata flow graph is folded are = {State 1 + - - (NIL)} s 2 * — {State 2 « - - (NIL)} = {State 3 < - - (Vc,3 : 1), S t a t c A ^ (V,:,, : 0)} = {State 5 < - - (NIL)} where Vc represents the condition value of up > q”. 147 For an initiation-interval-2 pipelined design, two sets of states are needed which are derived from Cartesian products of S{, S£ and , S{, respectively. S = S* x S* = {State 1 <- (Vc > 3 : 1), Sta te 2 4 - (Vc > 3 : 0)} <S;2 = 5 * x = {S<a*e3 4- NIL)} Next state transitions between the two sets of states S*t and S*2 are determ ined in a sim ilar m anner as for the non-pipelined design controller w ithout status regis ters. The complete state transition diagram is shown in Figure 7.5. The first state can be arbitrarily assigned to either State 1 or State2. □ T he following procedures synthesize the controller for an initiation-interval-7 n-tim e-step pipelined design w ithout status registers. P r o c e d u r e 8 / * Subroutine to create the controller specification for state S pij * / if no condition value is born in the folded tim e step i b e g in print out activated control signals in the folded tim e step i. let the condition value combination equal to remove condition values which die in the folded tim e step i from V - •. A P l "> J increase the tim e step of condition value instances in P p; j by 1. if i is n o t equal to the initiation interval I use state < S pi+i) < € $pi+ 1 with the same condition value com bination as 'P'z , • to be the next state transition, else use state w ith the same condition value com bination as "P' , • to be the next state transition, r e tu r n en d / * Otherwise, at least one condition value is born * / if only Vc < s is born in the folded tim e step i where s — i + m l and i + rnl < n 148 b e g in / * Use an “ if-else” statement for single condition value born cases * / print out activated control signals in the folded tim e step i which satisfy the new born condition value Vc s = “TR U E ” and th e condition value com bination P pi,j for the “if” part first. let the condition value combination P ( pij equal to P pij- add the new born condition value VC ) S = 1 to P'pij- remove condition values which die in the folded tim e step i from V - ■ . x P %yJ increase the tim e step of condition value instances in P pij by 1. if i is n o t equal to the initiation interval I use state Spi+i,t £ Spi+1 with the same condition value com bination as PL „ to be the next state transition, e lse use state S pi it € with the same condition value com bination as V ' n i, to be the next state transition. P t yJ / * Thus, the “ i f ” part is done * / print out activated control signals in the folded tim e step i which satisfy the new born condition value = “FALSE” and the condition value com bination P pij for the “else” part. let the condition value combination P pij equal to 'Ppij- add the new born condition value Vc,s = 0 to j- remove condition values which die in the folded tim e step i from Upij- increase the tim e step of condition value instances in 'Ppij by 1. if i is n o t equal to the initiation interval I use state S pi+itt € « 5 '* 4+ 1 with the same condition value com bination as VL; , to be the next state transition. P i tJ e lse use state S pi,t £ with the same condition value com bination as PL , ■ to be the next state transition. P l iJ r e tu r n en d / * For multiple new born condition values, we use a “ case” statement * / 149 for each new born Vu in the folded tim e step i d o b e g in print out activated control signals in the folded tim e step i which satisfy condition value combinations 'P b i and ~ P p ig let the condition value com bination 'Pp, j equal to V Pi,j- add the new born condition value combination Vu to remove condition values which die in the folded tim e step i from ' P p i j - increase the tim e step of the condition value instances in by 1. if i is n o t equal to the initiation interval I use state 6 < S * i+i with the same condition value com bination as P ' , to be the next state transition, e lse use state SPi,t € S*x with the same condition value com bination as P ' • , to be the next state transition. P xiJ en d r e tu r n □ P r o c e d u r e 9 / * Generate a set of states S pi for each folded time step * / for tim e step i from 1 to the last tim e step do b e g in generate a set of states, S* — {«S;,x,«S;)2, . . . where ki is th e num ber of condition values reserved in tim e step i. for condition value combination P ;j from to P 2 )2 f c ; in tim e step i d o assigned state S i j to the condition value com bination 'P ij. en d fo r the set of states S*t from < S * x to S*j do S*i < — Cartesian products of the sets of states, S * , S *+ I , . . . ,S*+mI, m l < n. / * Now, produce the controller specification for each generated state * / for the set of states S*{ from « S ’ * 1 to S*j do for each state S pit] € S*{ do use Procedure 8 to generate the controller specification for state S pij. 150 1.2pm Technology Delay (ns) Area (p m 2) 16-bit Comparator (> ) 36 142087 16-bit Adder (+ ) 34 68825 16-bit Subtractor (— ) 36 93646 16-bit Multiplier (*) 48 1838537 Table 7.1: Design Library Set Used by CSG (O btained from ChipC rafter) □ 7.6 Control Path Experiments and Results The algorithm s presented for the control path synthesis have been im plem ented in a program called CSG using the C language. The experim ents are focused on the comparisons of two different controller im plem entation styles, e.g. using status registers versus not using status registers. A portion of the robot arm controller exam ple obtained from UC-Berkeley served as our exam ple as shown in Figure 6.28. In this exam ple, there is a total of 46 operations/nodes including 4 distribute-join pairs. The m odule library used for this experim ent is shown in Table 7.1. 7.6.1 Non-Pipelined Design Experiments and Results For the non-pipelined designs, we created nine designs using MAHA in the cur rent ADAM synthesis system, as shown in Table 7.2. The non-pipelined design schedules generated by MAHA are further processed to create register-transfer level designs by the MABAL program. CSG then takes schedule, bindings and register-transfer level data paths to generate the controller specifications. In order I to compare the num ber of control signals before the Finesse program synthesis | i and after the duplicated output signals are merged, Table 7.3 lists th e num ber of allocated modules in each design. For each design, both types of controller specifications (w ith/w ithout status registers) were produced and are shown in Ta bles 7.4 and 7.5, respectively. These designs are then synthesized using C hipC rafter ! to obtain the layouts. 151 Design Name Time Steps Execution Delay' (ns) Area (pm 2) MAHA. 11 11 528 6193552 MAHA. 12 12 576 4473482 MAHA. 13 13 624 4237749 M AHA. 14 14 672 4168924 M AHA. 15 15 720 4050457 M A H A. 19 19 912 2472474 M A H A. 20 20 960 2378828 M A H A. 21 21 1008 2236741 M A H A .22 22 1056 2143095 Table 7.2: Non-Pipelined Robot Arm Controller Exam ples by MAHA Non-Pipelined Design Name Functional Units Interconnection Units Cmp Add Sub Mul Registers Multiplexers M AHA. 11 2 3 2 3 34 25 M AHA. 12 2 2 4 2 34 21 M AHA. 13 1 2 3 2 34 20 M AHA. 14 1 1 3 2 34 15 M AHA. 15 1 2 1 2 34 19 MAHA. 19 2 1 3 1 34 14 M A H A. 20 2 1 2 1 34 15 M A H A .21 1 1 2 1 34 12 M A H A. 22 1 1 1 1 34 12 Table 7.3: Non-Pipelined Robot Arm Controller RTL Exam ples by MABAL Figure 7.6 shows a typical controller layout produced by C hipC rafter which includes a PLA component and a set of D-type flip flops. It can be seen from this layout exam ple th at the area of the D FFs is almost the same as th e PLA area. W iring also takes a substantial am ount of area in a controller. Therefore, in order to estim ate the area of a PLA type controller accurately, the area of D FFs and wiring can not be ignored. * The execution delay here includes the delays caused by functional units only. *Note that the number of states for a controller using status registers are always one more than the number of time steps. This is because the first state in the controller specification is j used to latch the input values. | §The area of a controller includes the area of PLA, DFFs and wiring. 152 <!•«« n n n»t i i n»i« 11 w i n I 'n n y m i w i i i i » m i i » m i r !»•; n i i > 7 i r m i . r 1 1 i i . i i 11 M i l l i l l i i i . i n i T i n T U T i T i f j t i > c mnntfy a «i i i i » n pi iim’i l i m n ii s ii i i i i m iiiipi i i n» in i i * iii p H- T d ll.ll» T C iT>~ I nTTTi l 11J 1111.111 .T rl 11 LI rT> tITt IT T.i X T j E Jj, r<ri if i r t 11 ri’i 11 i r»<M r u n t ir tr n iipi»iiii i»r«11 ( *5- ' ■ IT T IT T IT ffm T T rm T lT I fT i " i C Sj,« 3, Figure 7.6: A PLA-Based Controller Synthesized by Finesse Program 153 Controller Specification Design Name No. of States* Status Registers Outputs MAHA. 11 12 2 79 MAHA. 12 13 2 72 MAHA. 13 14 1 76 MAHA. 14 15 2 70 MAHA. 15 16 2 74 MAHA. 19 20 2 63 M A H A. 20 21 2 66 M A H A. 21 22 2 64 M AHA. 2 2 23 2 63 PLA Implementation Design Name Rows Ipts Opts PLA Area M. Opts DFFs Area§(p,m2) M AHA. 11 17 10 39 153990 45 42 603325 M A H A. 12 19 10 37 157954 41 38 575055 M A H A. 13 18 9 44 162491 36 52 879981 M AHA. 14 22 10 47 206672 29 53 919443 M AHA. 15 24 10 48 220070 32 50 755715 M A H A. 19 25 11 44 212503 26 45 753246 M AHA. 20 26 11 49 244219 24 50 863212 M A H A. 21 31 11 49 280061 22 49 858183 M AHA. 2 2 30 11 47 260929 23 48 769165 Table 7.4: Controllers Using Status Registers for Non-Pipelined Examples 154 350000 300000 25% 30% 28% 2 250000 E 2 200000 13%_____1S% 150000 3 100000 Q_ 50000 I PLA Using Status Registers □ PLA without Status Registers Figure 7.7: PLA Area Comparison of the Non-Pipelined Designs In the non-pipelined design experim ents, the size of the PLAs in controllers increased as the num ber of tim e steps increased (see Figure 7.7). This behavior is consistent w ith our prediction. However, the area of the controllers (see Figure 7.8) increased in the beginning, but then fluctuated w ith less than 10% variation. This phenom enon may be explained as follows. From Tables 7.4 and 7.5, the num ber of output signals decreased as the num ber of tim e steps increased in the controller specifications. Nevertheless, the num ber of merged outputs is reduced when the num ber of tim e steps is increased. This is probably because more output bits are encoded into a select word for more serialized non-pipelined designs. For exam ple, two 3-to-l m ultiplexers (4 control bits) m ay be replaced by one 6-to-l m ultiplexer (3 control bits) in a more serial design. Even though the num ber of controller outputs after removing the merged outputs, which is sim ilar to the num ber of PLA outputs, is not increased in the designs w ith more than 12 tim e steps, the | area of PLAs is still increased. This is probably due to the fact th a t the num bers of PLA product terms(rows) and state encoding bits(colum ns) are increased as the num ber of tim e steps increases. Figures 7.7 and 7.8 show the area comparisons of PLAs and control paths for the non-pipelined designs w ith/w ithout using status registers. It is interesting to note th at the controller specified using status registers usually creates a smaller j PLA area as well as a smaller total control area for the non-pipelined designs, i I 155 25% 28% 1000000 900000 800000 700000 600000 500000 400000 300000 200000 100000 Controller Using Status Registers □ Controller without Status Registers Figure 7.8: Controller Area Comparison of the Non-Pipelined Designs although it uses more state encoding bits5 than the one w ithout status registers. In order to understand the reasons for those results, we exam ine the state transition diagram s for M A H A .12 in both designs (w ith/w ithout status registers) after the Finesse6 synthesis. Notice th a t a twelve-time-step design needs at least thirteen clock cycles to complete the task because one extra clock cycle is used to latch input values. Figures 7.9 and 7.10 depict the state transition diagram s for M A H A .12 w ith o u t/w ith status registers, respectively. Even though the two state transition dia gram s look completely different, they represent exactly the sam e controller behav ior. By comparing these two state transition diagrams, we found th a t some of the states in the state transition diagrams without status registers split into several states as compared to the state transition diagram w ith status registers. All the split states are shown by bold faces and cross hatch patterns in Figures 7.9 and 7.10. The splits of those states indeed indicate th at a m ultiple-code state encoding is produced in the controller using status registers. Apparently, Espresso is able to exploit the don’t-care bits in state codes in order to m inim ize the size of the binary 5The state encoding bits here include the bit to represent states and the bits used by status registers. 6Finesse is a finite state machine synthesis program in Cascade Design Autom ation ChipCrafter. 156 Controller Specification Design Name No. of States Outputs MAHA. 11 21 79 M AHA. 12 25 72 MAHA. 13 19 76 MAHA. 14 25 70 MAHA. 15 30 74 MAHA. 19 40 63 MAHA. 20 41 66 MAHA. 21 41 64 M A H A. 22 42 63 PLA Implementation Design Name Rows Ipts Opts PLA Area M. Opts DFFs Area (p m 2) M AHA. 11 21 9 40 171527 44 40 635578 M AHA. 12 25 9 36 179181 41 37 778786 M AHA. 13 19 9 47 188595 34 50 863749 M AHA. 14 25 9 48 228518 27 50 964412 M AHA. 15 29 9 50 265571 29 51 942046 M AHA. 19 40 10 42 275989 27 43 965367 M A H A. 20 39 10 46 311978 26 49 923976 M A H A. 21 39 10 47 315747 23 48 893265 M AHA. 2 2 41 10 46 327286 23 47 915990 Table 7.5: Controllers without Status Registers for Non-Pipelined Exam ples 157 StateO (00011) S ta te l fW OQ C ^ S ta te 2 ( 1 0 1 0 0 j^ > C [ S t a t e 3 (H OOO T^) S ta te 4 ( 0 1 0 0 1 ) ^ > < ^ S t a t e 5 (001 < ^ S t a t e 6 (01 C j 5 t a t e 7 (0 0 0 0 1 ) CjtelesjSlO^) " sta te9 (001 < J 3 t e t e 1 0 (0 1 0 0 0 )^ ) C ^State11 (1 0 0 0 0 j^ > C ^ S ta te 1 2 (OOOOOT C ^ S ta te 1 3 (1 lT lO )^ ) C ^ S t^ e 1 4 (1 oTTo)^> C [ s t a t e 1 5 (11oT o)^ > C ^ S ^ e l 6 (0 1 lil0 )^ > S t a t e l 7 (1 1 0 0 ? r^ > C s ^ e 1 8 ( 1 0 1 0 l J > C ^ S ta te 1 9 (1 1 1 0 1 j^ > C ^ S t ^ e 2 0 (1 0 0 0 1 ) C f t a J e S ^ a i l C ^ r < ^ S ta te 2 2 (0 1 0 1 o £ > C ^ S tate23 ( 1 0 0 1 0 )^> C S t e t e 2 4 (O O O IO pi Figure 7.9: MAHA.12 State Transition Diagram w ithout S tatus Registers 158 S ta te 4 (0 0 1 1 0 ^ > < ^ S ta te 4 (0 0 1 M ) 7 p > < ^ S ta te 4 (0011 J 0 ) ^ > C ^ S tate4 (0011 11) S ta tes(1111 0 1 O CStateS (1111 11) StateS (1 1 0 1 1 0 p d S ta te S (1101 11 S ta tes (11010 0 p C S tate9 (1101 01 S ta te l 0 (0101 01 C S t a t e ll ( 1 0 1 C K tp ) StateO (0 1 0 0 0 0 ) J StateO (0 1 0 0 0 1 ) Figure 7.10: M AH A. 12 State Transition Diagram Using Status Registers Design Name Init. Intvl Pipeline Length Init. Delay** {ns) Area {pm 2) Sehwa.15.3 3 15 144 11777988 Sehwa.17.3 3 17 144 11777988 Sehwa. 13-4 4 13 192 9728539 Sehwa. 15.4 4 15 192 9728539 Sehwa. 11.5 5 11 240 7890002 Sehwa. 12.5 5 12 240 7890002 Sehwa. 19.6 6 19 288 5888994 Sehwa.24.6 6 24 288 5888994 Sehwa. 15.9 9 15 432 4050457 Sehwa. 16.9 9 16 432 4050457 Sehwa. 1 4 .11 11 14 528 3981632 Sehwa. 20.11 11 20 528 3981632 Sehwa.20.17 17 20 816 2143095 Sehwa.27.17 17 27 816 2143095 Table 7.6: Pipelined Robot Arm Controller Exam ples by Sehwa encoded cover. For example, no status registers are used in StateO for the con troller using status registers so, in Finesse, the two don’t-care bits and “0100” are assigned by the Unicode state encoding algorithm. To reduce the product terms, Espresso assigned two don’t-care bits to “00” and “01” and the state encodings of StateO became “0100 00” and “0100 01” (a multiple-code state encoding). Note that most current state encoding algorithms can produce Unicode encodings only. Our heuristic functionally separates the encoding task between states and condition values, which make the Unicode state encoding algorithm produce a multiple-code state encoding. 7.6.2 Pipelined Design Experiments and Results Fourteen pipelined designs with different initiation intervals an d /o r pipeline lengths were created by Sehwa as shown in Table 7.6. The respective register- transfer level designs produced by MABAL are shown in Table 7.7. B oth controller styles are generated for all the designs which are shown in Tables 7.8 and 7.9. **The initiation interval here includes the delays caused by functional units only. 160 Pipelined Design Name Functional Units Interconnection Units Cmp Add Sub Mul Registers Multiplexers Sehwa.15.3 2 4 2 6 110 40 Sehwa.17.3 2 4 2 6 110 36 Sehwa. 13-4 1 3 2 5 84 28 Sehwa.15.4 1 3 2 5 76 35 Sehwa.11.5 1 3 2 4 55 26 Sehwa.12.5 1 3 2 4 60 29 Sehwa.19.6 1 2 1 3 71 20 Sehwa.24-6 1 2 1 3 92 26 Sehwa.15.9 1 2 1 2 42 22 Sehwa.16.9 1 2 1 2 45 22 Sehwa.14-11 1 1 1 2 35 19 Sehwa.20.11 1 1 1 2 41 20 Sehwa.20.17 1 1 1 1 36 15 Sehwa.27.17 1 1 1 1 42 16 Table 7.7: Pipelined Robot Arm Controller RTL Exam ples by MABAL For simplicity, a comparison between two im plem entation styles, th e com par ison bar charts for the PLA area and complete controller area are depicted in Figures 7.11 and 7.12, respectively. The results are slightly different th an the non pipelined designs. In the pipelined designs, the designs th a t used status registers were found larger than those not using the status registers. This m ay be due to the high utilization of status registers which leaves Espresso w ith little or no freedom to assign the don’t-care bits to status registers. However, the num ber of cases th at improved are greater than the num ber of cases th a t are degraded in our experi m ents. Notice th at the percentage of area improvements on both th e PLA area and total controller area is more dram atic than for non-pipelined designs, and the area increase percentage of the controller using the status registers is much less th an the one not using the status registers. 161 Controller Specification Design Name No. of States Status Registers Outputs Sehwa. 15.3 3 4 170 Sehwa. 17.3 3 2 168 Sehwa. 13-4 4 3 131 Sehwa. 15.4 4 1 137 Sehwa. 11.5 5 1 101 Sehwa. 12.5 5 2 109 Sehwa. 19.6 6 1 115 Sehwa.24-6 6 4 145 Sehwa. 15.9 9 1 88 Sehwa. 16.9 9 2 90 Sehwa. 14-11 11 1 76 Sehwa.20.11 11 2 87 Sehwa.20.17 17 2 70 Sehwa. 27.17 17 2 77 PLA Implementation Design Name Rows Ipts Opts PLA Area M. Opts DFFs Area (pm2) Sehwa. 15.3 11 10 13 62823 163 21 310456 Sehwa. 17.3 6 6 11 37616 159 14 234454 Sehwa. 13.4 10 8 14 54584 120 24 285479 Sehwa.15-4 8 7 14 47011 125 26 304104 Sehwa. 11.5 9 8 26 73783 79 33 408352 Sehwa.12.5 11 9 22 73680 92 29 369543 Sehwa. 19.6 11 7 31 83722 87 39 593705 Sehwa.2 4 .6 22 10 40 175973 111 42 871637 Sehwa. 15.9 13 9 45 152487 48 48 666393 Sehwa. 16.9 17 10 50 192984 46 57 1029564 Sehwa. i 4 .l l 18 10 51 197505 30 53 826538 Sehwa.20.11 19 9 48 191454 44 54 868873 Sehwa.20.17 23 11 47 221892 29 50 877914 Sehwa.27.17 23 11 49 228270 35 51 783790 Table 7.8: Controllers Using Status Registers for Pipelined Examples 162 Controller Specification Design Name No. of States Outputs Sehwa. 15.3 36 170 Sehwa.17.3 9 168 Sehwa.13.4 18 131 Sehwa. 15-4 8 137 Sehwa. 11.5 9 101 Sehwa. 12.5 14 109 Sehwa. 19.6 10 115 Sehwa.24-6 52 145 Sehwa. 15.9 14 88 Sehwa. 16.9 23 90 Sehwa. 14-11 19 76 Sehwa.20.11 16 87 Sehwa.20.17 25 70 Sehwa.27.17 28 77 PLA Implementation Design Name Rows Ipts Opts PLA Area M. Opts DFFs Area (p,m2) Sehwa.15.3 30 10 19 123853 157 26 461174 Sehwa. 17.3 6 6 7 32234 163 10 155644 Sehwa. 13.4 19 9 25 103717 111 32 429688 Sehwa. 15.4 8 7 14 47200 125 26 316469 Sehwa.11.5 11 8 24 74366 81 32 429125 Sehwa.12.5 13 8 26 85107 87 32 431443 Sehwa.19.6 9 7 32 80207 86 38 570495 Sehwa.24-6 48 10 45 360937 106 51 994086 Sehwa.15.9 13 8 47 152167 45 50 742560 Sehwa.16.9 23 9 49 220554 46 53 811965 Sehwa. 14-11 20 9 50 201343 31 51 821688 Sehwa.20.11 15 8 52 176593 39 54 979000 Sehwa.20.17 24 9 48 216030 27 49 754222 Sehwa.27.17 28 9 49 244170 33 50 767844 Table 7.9: Controllers w ithout Status Registers for Pipelined Examples 163 Controller A rea (square micron) P L A A rea (square micron) 400000 350000 300000 200000 150000 100000 1 50000 PLA Using Status Registers □ PLA without Status Registers Figure 7.11: PLA Area Comparison of the Pipelined Designs 1200000 1000000 800000 600000 400000 200000 rO if) r-n r-2 fO uS lO irj CM CD o i cO s i C 7 ) uo cn C D g g s s § g 5 CO LlJ C O s to LlJ C O Li_] C O zzz U-l C O 5 C O s L O s CO E 5 C O r-« ca o CM CM $ £ a c LlJ L lJ to tn CM S * Controller Using Status Registers □ Controller without Status Registers Figure 7.12: Controller Area Comparison of the Pipelined Designs Chapter 8 Thermal Analysis and Simulation Results As integrated circuit feature sizes have decreased, designers have been forced to consider physical effects at higher levels of the design process. An increase in power density results from reduced feature sizes, partly due to higher operating speeds, and partly due to the increased component density on integrated circuits. However, the operating tem perature of a chip is lim ited to a certain range for acceptable system reliability. For silicon devices, this tem perature is in the range of 75-85° C [Jr.83]. W ith increasing power density and w ith lim its on the operating tem perature, therm al lim itations of the chip m ust be considered during the design process. Consequently, it is im perative to study not only the therm al properties of the device and package m aterial but also the run-tim e therm al properties of the chip/die so th at, during the design stage, a better therm al layout can be obtained to alleviate potentially high therm al stress. W hile therm al effects on-chip m ight be relatively inconsequential for m any single MOS chips, the current therm al problems associated with multichip modules indicate th at therm al effects cannot be ignored during m ultichip synthesis. For the design of electronic circuits for reliability, there are a num ber of hand books, specifications and guidelines which help us establish a common basis for comparing, evaluating and predicting related or com petitive designs. T he M ilitary Handbook MIL-HDBK-217E [Mil90] was developed by the D epartm ent of Defense to unify reliability prediction m ethods for integrated circuits produced by the mil itary. According to this handbook, surface tem perature is one of the m ajor factors affecting circuit reliability. Com puter-aided design software should predict and 165 analyze therm al properties in circuits while they are under design, highlighting areas which are overused in order to prevent such failures. Engineers can then im prove designs so th at systems themselves operate at lower tem peratures and more reliably. Localized heat-concentration problems can appear when a high-level synthesis tool or designer uses a greedy approach to the allocation and binding synthesis subtasks which overuse some central resources. Along with the scheduling pro cedure, the schedule and topology of functional modules and the selection of the package m aterial affect the design enormously. Since not all functional modules are active all the tim e, utilizations of functional modules, which are the result of scheduling, can also affect the reliability of the design. Therm al effects thus can represent a lim iting factor in the development of ASIC chips. An accurate model of the therm al behavior of the die structure is necessary in order to make reliable designs. Therm al analysis m ethods can be categorized into two general approaches, namely, analytical solutions and num erical solutions. The analytical approach involves the search for an exact analytical solution for structures w ith regular geometries. However, using an analytical approach, a solu tion is not always obtainable for a structure with complex geometries. Therefore, a num erical m ethod is a b etter approach for a structure w ith irregular geometries. Numerical techniques such as finite-difference [MAD83], finite-elem ent [WFM83], and boundary element [CPB88] have been widely used to analyze therm al profiles of electronic circuits. The purpose of this chapter is to provide a novel m ethod for im proving over all therm al characteristics of circuits during the high-level synthesis process. The basic idea in this work is to alleviate the heat-concentration phenomenon by av eraging work loads1 between modules of the same functionality during scheduling and by balancing power dissipation during floorplanning. Averaging or spreading the power dissipation of functional modules allows the design to operate with a more balanced therm al profile. As compared to worst case, our sim ulation results indicate the rise in tem perature can be reduced 20% by combining the efforts of improving the floorplan and the schedule. xThe amount of heat produced by a functional module is proportional to its work load. 166 Ambient ZZ-Package Substrati V / / / / / / 7 / / / / / / / / / , Print Circuit Boaraxs Ambient Figure 8.1: A Typical Single-Chip Heat Transfer Model In the following sections, we first introduce therm al models for a chip under therm al consideration. An analytical solution is then derived. A num erical finite- difference m ethod is also presented. An example is used to illustrate therm al im pacts on floorplanning and scheduling results. Finally, some rem arks on the therm al analysis are given. 8.1 Thermal Models To calculate the induced therm al stress of a design, the tem perature profile m ust be known. In our analysis, the heat conduction from the top of die surface to the working fluid2 is assumed to be negligible, as compared to th e heat conducted laterally through the doped substrate to the print circuit board. T he therm al conductivities of the substrate and package m aterial are constant at the steady state, and the surface tem perature is prim arily a function of functional m odule utilizations and the topology of functional modules. For a steady-state conductive cooling environm ent, the single-chip heat transfer environm ent can be modeled by a network of therm al resistances as shown in Figure 8.1 [TR89]. In the therm al resistance model, the tem perature and heat flux are analogous to the voltage and current in the analysis of an electronic circuit. The therm al resistances in Figure 8.1 are derived as 2Usually, the working fluid is air. 167 Heat Sources Silicon Figure 8.2: A Four-Layer Chip Therm al Model Wc * ~ K ■ A c where W c the distance between the conduction, K is the therm al conductivity of the conducted m aterial and A c is the cross-sectional area of the conduction. Since the am ount of heat produced inside a chip and the am bient tem perature are known, the tem perature on the surface of package can then be derived from th e network of therm al resistances. In order to obtain the analytical solution of the tem perature profile on the surface of the silicon substrate, we modified this therm al resistance m odel to a four- layer therm al model as shown in Figure 8.2 [LPM89]. The therm al characteristics are retained in this simplified model. For a common silicon device, the first (top) layer is usually silicon. The second layer is usually a bonding m aterial, such as epoxy. The third layer represents m etallization on the substrate (usually gold). The fourth (bottom ) layer, the package m aterial, uses m aterial such as alumina. Several assum ptions have been m ade in this four-layer therm al model: • The surface tem perature at any position on the silicon depends on the heat dissipated by the functional modules, the therm al conductivities of the silicon and package m aterial and the tem perature of the working fluid. • The heat generated from each functional module on the substrate is assumed to be uniformly distributed. The heat dissipation of a functional m odule is dependent on the clock frequency and its utilization. 168 Silicon Figure 8.3: A Simplified Layered Chip Therm al Model • The average power dissipation of a functional m odule is used in the therm al analysis, since the propagation speed of the heat dissipated is much slower than the clock frequency. Note th at the average power dissipation of a functional m odule is the product of the m axim al power dissipated at the working clock frequency and its utilization. For example, the power dissipated by a typical adder circuit when operating at a frequency of 10 MHz and Vdd = +5 volts is about 2 m W . If the utilization of this adder on a given design is 80%, the average power dissipation of this adder is 1.6 m W . 8.2 Derivation of the Analytical Solution Due to the small thickness of the bonding and m etallization layers, only the first silicon layer and the fourth package layer are considered. Therefore, the four-layer therm al model shown in Figure 8.2 is further simplified to a two-layer therm al m odel as shown in Figure 8.3 [CA78]. In this model, the package size is assumed to be the same as the die size; only the therm al properties of the silicon and package layers are considered. To derive the tem perature distribution of the die structure in the steady state, the classic heat-flow govern equation m ust be solved: V2T(x,y,z) = 0 (8.1) 169 where the therm al conductivity of the m edium is assumed not to be a function of tem perature. The need for a three-dimensional solution is due to th e fact th at heat transfer in the die is three-dimensional. The following boundary conditions are considered: • The bottom surface (the package surface) has an arbitrary but known tem perature distribution To(x,y). • The lateral sides of the layer structure are considered to be adiabatic.3 • The heat dissipated on the top surface of the silicon chip is described by the function P (x ,y ). • T he heat flux is continuous at the interfaces between layers. E quation 8.1 can be solved by separation of variables or other equivalent tech niques. The solution that calculates the tem perature on the top silicon surface is thus derived as [CA78] T ( x ,y , 0) = T0(x, y ,h 1 + h2) + ( kl + y — ) ■ / f P ( x ,y ) d x d y 4 _ ^ ~ J tS o 2 P (x ,y ) c o s ( ^ ) cos( ^ ) d x d y K \L \L 2 m _ 0 n _ 0 Amn • (8m + 1) • (Sn + 1) smh A mnh\ T . ■ cosh tanh Amn^2 .tti’ k x . / Ti7ry. • -jy------------------- --------------------------------------- • cos(------- ) • cos(------) ^ • sinh A mnh\ • tanh A mnh2 + cosh L x L 2 (8.2) where m 27r2 n27 r2 A mrL — \{ r0 T 8m and 8n are the Kronecker delta and K \ , K 2 are the therm al conductivities of the first layer and the second layer, respectively. Note th at the exact solution of the four-layer structure can be derived in a similar m anner [LPM89]. 3Adiabatic means no heat transfer actions between the analyzed structure and the working environment. 170 In the case of ASIC design, the power distribution P (x , y ) can be m odeled by a group of uniformly distributed functional modules. This fact suggests th a t we can represent the function P {x , y ) as a set of box functions4 with the am plitude Q f{ for each functional module. Due to the linearity of the heat distribution function P (x , y), the solution for the structure with a single heat source is first derived. For the structure with multiple sources, the solutions are then obtained by the superposition of corresponding single-source solutions. Two integrals in Equation 8.2 are thus derived as f P(x,y)dxdy = J^Qfi (8-3) J0 ieM J r L i i - l 2 m w x / U 7 r y / P { x ,y ) cos(— - — ) cos(— -- )d x d y = o Jo L \ L 2 ^ f • ,mxUs . ,mrV. .rmzia + b). .nx(c+ </)., 2^ Q f , • { s %nc{ - ^ —) • a%nc\ ~ 2 ^ ) ■ cos( 2 L ) ‘ cos( — ^ -----)> (8.4) where sin(g) sinc(g) — 9 is the heat dissipated by module i and M is the allocated m odule set. Since the solution has the form of an infinite series, a criterion is needed to truncate the sum m ation at a given required precision. Through a closer inspection of the infinite double Fourier cosine series, a rule of thum b, m = Q Li/U , n — 6 L 2/ V , is used which generally allows the tem perature to converge w ithin 1 percent of the final value [LPM89]. 4A box function, B ( x , y), is a function with zero-value everywhere, except a constant value inside a rectangle area. 171 node i L n o d e Cube Topm ost Layer - * Tg Nodal Network for each Layer Figure 8.4: Finite Difference Approxim ation and The Corresponding Nodal N et work 8.3 Derivation of the Numerical Solution The basic principle of the finite difference approach is derived from differential equations via Taylor’s expansion. Instead of using finite difference equations di rectly, these equations can be interpreted in a physical way to perm it a more convenient application. Consider the die structure shown in Figure 8.4. The die is divided into a num ber of cubes. The cross-sectional area of each cube equals to the area of a unit cell,5 so a larger functional module (such as a m ultiplier) is divided into several cells. In this case, the heat dissipated in the m odule is assum ed to equally distribute among all topm ost cubes which cover the module. T he heat dissipation of a cube is assumed to be a point heat source on the geom etric center of the cube. A nodal network is thus obtained, representing the structure under a steady-state condition. Each node i, which is the geometric center of cube i, m ust satisfy the equations ti jeBi R + — 0 (8.5) where Bi is the set of all neighboring nodes adjacent to node i and R ij is the $ ■ therm al resistance between node i and node j which equals to t— % Sij denotes the conduction distance between node i and node j and Aij is th e cross-sectional area for heat conduction normal to Sij. qi is the heat produced in the volume lump 5We use a constructive approach on floorplanning in the preliminary 3D scheduling research. We define a unit cell as a primitive block. 172 1.2 fxm Technology 8 bit adder 8 bit m ultiplier Delay Area Delay Area 13 ns 75 m il2 32 ns 903 m il2 Table 8.1: Library Set Created by ChipCrafter Silicon Compiler at i (i.e. cube i ) which is the average heat dissipated in the m odule divided by the num ber of topm ost cubes covering of the module. Note th at the heat produced by a m odule is assumed only coming from the topm ost layer of cubes, since the power is m ainly consumed by semiconductor circuits im planted on the surface of the silicon substrate, during signal switching. The form ulation so far is restricted to interior points of the structure in which the heat conduction is taking place. In order to solve the equation set represented by Equation 8.5, imposed boundary conditions m ust also be satisfied. The bound ary condition here is the equilibrium tem perature on the bottom of the structure surface which is the surface of the package [TR89]. The tem perature is assum ed to be uniform ly distributed on the surface of the package due to the high heat con ductivity of the package m aterial. By solving the n-equation set simultaneously, the tem perature profile can be obtained. The profile of the therm al stress can then be calculated. 8.4 Thermal Experiments and Results The therm al model presented in this chapter was used to experim ent w ith the ADAM high-level synthesis tools developed at USC. Figure 8.5 shows an exam ple of a d ata flow graph which is used as an input for high-level synthesis. Table 8.1 lists the m odule library set, which was derived from the Cascade Design A utom ation C hipC rafter silicon compiler. The data path synthesis program used to experim ent w ith the therm al model is based on the prelim inary 3D scheduling research, which incorporates interconnection delays during scheduling using floorplanning. A two- tim e-step schedule for a non-pipelined FIR filter design is shown w ith the cross- hatched line in Figure 8.5. 173 | add add add add add add add add mul mul mul mul mul mul mul mul add add add add add add add Time S tep 1 Time S tep 2 Figure 8.5: A 2-time-step Non-pipelined FIR Filter Design adder adder* adder multiplier multiplier adder adder adder multiplier adder multiplier adder wiring on critical path in time step 1 e wiring on critical path in time step 2 1 .2 pm te c h n o lo g y total delay on time step 1 = 70 ns total delay on tim e step 2 = 121 ns 0 The untilization rate of this adder is 50%. Figure 8.6: The Floorplan for the FIR Filter Design w ith M inim um O perators 174 adder1 3 adder0 adder multiplier multiplier multiplier adder s i n multiplier t adder I r adder . . . t f adder adder -*- wiring on critical path in time step 1 • wiring on critical path in time step 2 1 .2 p m te c h n o lo g y total delay on time step 1 = 70 ns total delay on time step 2 - 115 na 0 The untilization rate of this adder is 50%. Figure 8.7: The Floorplan for the FIR Filter Design w ith a R edundant Adder T he 3D scheduling algorithm minimizes interconnection delays along critical paths to improve the design performance. The software tries to introduce redun dant operators6 to alleviate interconnection delays, when the synthesized design cannot achieve user specified tim ing constraints (see Section 4 for more details). For exam ple, the floorplan of the two-time-step non-pipelined F IR filter design in Figure 8.5 is shown in Figure 8.6. In this example, the feasible design w ith m ini m um operators allocated contains 8 adders and 4 m ultipliers. The software found interconnection delays along critical paths are reduced by introducing one more adder, which is shown shaded in Figure 8.7, in this two-tim e-step non-pipelined F IR filter design. The therm ogram s of the design with m inimum operators and the design w ith a redundant adder are shown in Figures 8.8a and 8.8b, respectively. The tem perature shown in therm ogram s is the tem perature difference between the top surface of the silicon substrate and the bottom surface of the package. The therm al profiles were 6 Redundant operators are operators not required for the feasible design with minimum operators. 175 calculated by both analytical and numerical methods. The results of these two solutions m atch, but the analytical solution provides a more flexible and efficient way to com pute sim ulation results. The complexity of the analytical solution is dependent on the num ber of heat sources and the num ber of term s in the infinite Fourier series to be summed. For example, in the design w ith m inim um operators, it took 2.2 seconds of CPU tim e on a Solbourne 5e/900 m achine (or 2.25 seconds of CPU tim e on a SUN 4/460 machine) in the average to com pute one sample tem perature when there are twelve heat sources on the top surface of the silicon substrate. In order to dem onstrate the problem of heat concentration, adders in our ex perim ents are considered to be “hot-problem devices” which produce three times the am ount of heat th at m ultipliers produce when both adders and m ultipliers are fully utilized. The therm ogram of the design with m inim um operators in Fig ure 8.8a presents a smaller “hot-spot” as compared to the therm ogram of the design w ith a redundant adder in Figure 8.8b. The reason was obvious after these two floorplans were compared and analyzed. The design w ith a redundant adder design does produce better system performance as com pared to the design with m inim um operators. However, the design with a redundant adder attem pts to use modules around the central area intensively to shorten wiring delays, which results a large am ount of heat being produced in the central area, causing heat dissipation problems. To resolve the heat-concentration problem, two solutions are proposed. First, by spreading adders around the problem area over the unused space on the floor plan, the heat-concentration problem can be alleviated. We applied this strategy to the FIR filter example. The resultant floorplan and therm ogram are shown in Figures 8.9 and 8.8c, respectively. The results are successful. T he tem perature dropped about 20%.7 This heat-balancing strategy resolved the heat-concentration problems at the cost of performance degradation. In a design whose tim ing constraint is crucial, the synthesis program or designer does not allow any degradation of the system performance. In this case, we propose 7The dropped percentage is the ratio of reduced temperature versus overall temperature difference. 176 a) Minimum O perator Design b) O ne R edundant A dder Design 1m n m m ii i in in n n iim iiniii iliJI.. 11 ^ c) Thermal-Improved Design Figure 8.8: The Therm ogram s of the FIR Filter Design 177 multiplier multiplier adder adder adder multiplier adder multiplier adder adder ................ wiring on critical path in time step 1 wiring on critical path in tim e step 2 1.2 p m te c h n o lo g y total delay on time step 1 - 70 ns total delay on tim e step 2 = 118 ns ® The untilization rate of this adder is 50%. Figure 8.9: The Floorplan for the Heat-Balanced FIR w ith a R edundant Adder adder adder** multiplier multiplier adder adder multiplier multiplier adder' adder® adder wiring on critical path in time step 1 »- wiring on critical path in time step 2 1.2 'pm te c h n o lo g y total delay on time step 1 = 70 ns total delay on tim e step 2 = 115 n s 0 The untilization rate of this adder is 50%. Figure 8.10: The Floorplan for the Heat-Balanced FIR w ith R edundant Adders 178 Figure 8.11: The Therm ogram of the Heat-Balanced FIR with R edundant Adders another strategy by allocating additional redundant adders to reduced utilizations of adders around the central area. The resulting floorplan is shown in Figure 8.10. Three redundant adders were allocated to reduce the heat-concentration problem in this example. Utilizations of some adders are reduced in this floorplan. The related therm ogram shown in Figure 8.11 reveals th at both the therm al constraint and perform ance constraint are achieved simultaneously. The therm al improve m ent achieved is the same as in the previous case, the tem perature is reduced about 20% (Indeed, the therm al profile of this floorplan is slightly b etter th an the previous one.). These two design cases whose therm al properties have been im proved illustrate th at design tradeoffs are possible between area, perform ance and reliability factors. We believe tradeoffs can be done much better if therm al effects are sim ultaneously considered during the d ata path synthesis process instead of considering them as two problems separately and/or sequentially. 179 Chapter 9 Conclusions and Future Research This thesis has attem pted to solve many design issues associated w ith high-level synthesis. Most design autom ation systems consist of a collection of tools which are individually focused on specific optim izations and do not have a global un derstanding of the problem due to the high complexity of synthesis. As a result, excessive post-design iterations may be required to produce good quality designs. The work presented here describes an integrated high-level synthesis approach in which num ber of design issues such as effects of interconnections and therm al problem s are concurrently being considered during d ata path synthesis. Both pipelined and non-pipelined designs with or without conditional branches are con sidered in our high-level synthesis system. The nested inner-loop problem for a non-pipelined design is also addressed. Different design strategies are proposed to tradeoff area against performance while the functionality of a circuit design is still preserved. The research presented in this thesis can be classified into three m ajor areas: d ata path synthesis, control path synthesis, and therm al analysis and synthesis. Below we describe the contributions made in each area by this thesis. We also list some open problems for future research, which in some cases are related problems th at could not be fully addressed in this work, but in other cases arise as a direct consequence of the contributions in this work itself. 180 9.1 Data Path Synthesis In this thesis, a new approach to data path synthesis problem is addressed. We consider scheduling, module allocation, module assignment and floorplanning si m ultaneously during data path synthesis. The main objective of our approach is to produce better schedules and maximize the sharing among allocated modules for register-transfer level designs. A floorplan is constructed increm entally during d ata path synthesis which feeds back effects of interconnections to the scheduling process in order to guide the scheduler to determ ine the next scheduling iteration. We have shown the effects of interconnections in our experim ents. Totally ig noring the existence of interconnections during the scheduling process m ay pro duce poor schedules in register-transfer level designs th at lead to unnecessary perform ance degradation. Taking effects of interconnections into account dur ing scheduling is thus necessary in order to produce better schedules (i.e. better designs), specially in designing submicron integrated circuits. In our experim ents, the placem ent and pin assignment of a module greatly affect the area and perfor m ance of a circuit design. Also, the performance of a circuit design is dram atically influenced by its schedule. In the future work, a more accurate delay estim ation model, which consid ers various sizes of output buffers, is needed in order to provide more accurate tim ings to the scheduler. A more precise estim ate of the wiring area and delay, inform ation on pin locations of a module and a global routing procedure m ay be incorporated during floorplanning to increase the accuracy in estim ating effects of interconnections in the 3D scheduling algorithm. 9.2 Control Path Synthesis In the research on control path synthesis, two possible controller im plem entation strategies are presented: controllers w ith status registers and w ithout status reg isters. Intuitively, controllers without status registers would seem better, because less state encoding bits are used. However, experim ental results contradicted our initial expectation: controllers implem ented using status registers are sm aller in 181 all non-pipelined design cases tested and m any pipelined design cases. A fter we traced one of th e non-pipelined design cases, we believe th at the im provem ent is possibly due to the use of a m ultiple-code state encoding (instead of a conventional Unicode state encoding) in controllers using status registers. Therefore, in som e design cases, controllers im plem ented using status registers are sm aller than those w ithout status registers after logic sim plification. T he controller design which uses status registers provides an effective heuristic for producing m ultiple-code state encodings using an Unicode state encoding algo rithm . In general, a m ultiple-code state encoding, if done intelligently, is exp ected to produce better results than an Unicode state encoding. However, to the best of our know ledge, there is no current reported work on m ultiple-code sta te encod ing algorithm s. One possible reason is the high com plexity of com puting optim al m ultiple-code state encodings. There is still much work to be done here. More research on controller design w ith and w ithout status registers m ay lead to more concrete results th at could help us choose a better controller im plem entation style in an early design stage of control path synthesis. T his research will also assist us to develop m ore powerful m ultiple-code state encoding algorithm s by giving hints to Unicode state encoding algorithm s. W e believe that w ith further research on m ultiple-code state encod ings, we m ay be able to derive m ore powerful and efficient heuristics for th e state encoding problem . 9.3 Thermal Analysis and Synthesis The m ain objective of this approach is to floorplan heat-balanced designs by avoid ing overuse of the functional modules around the central area. We have shown a localized heat concentration which induces a high therm al stress in the problem area and causes reliability problems. To calculate the therm al profile on the sur face of an integrated circuit, a four-layer therm al model and a simplified two-layer therm al model were proposed. Both analytical and num erical solutions were in vestigated for the two-layer therm al model in our experim ents and they produced 182 sim ilar results. However, the analytical solution gives us more flexibility and ef ficiency during the com putation of therm al profiles. We also found th a t therm al problems may be successfully alleviated by rearranging functional modules around the problem area or by introducing extra redundant operators. In the future work, a more exact therm al model should be studied which allows more layers to be considered in the com putation of therm al profiles. C om puta tion tim e may be improved by using an infinite plate model [LMP89], when the dimension ratio of heat sources to die structure is large. 183 Reference List [AHU74] [BG90] [CA78] [Che91] [Con89] [CPB88] [CPTR89] [CT90] [DK87] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, editors. The Design and Analysis of Computer Algorithms, chapter 2, pages 52-55. Addison Wesley Publishing Company, 1974. F. Brewer and D. Gajski. Chippe: A System for C onstraint Driven Behavioral Synthesis. IE E E Transactions on Computer-Aided Design, 9(7):681-695, July 1990. R. Castello and P. Antognetti. Integrated-Circuit Therm al Modeling. IE E E Journal of Solid-State Circuits, SC-13(3):363-366, June 1978. C. Chen. VHDL2DDS: A VHDL Language to DDS D ata Structure Translator. Technical Report 91-21, D epartm ent of Electrical Engi neering, University of Southern California, July 1991. J. Cong. Pin Assignment with Global Routing. In Proceeding o f 1989 IE E E Int. Conf. on Computer-Aided Design, pages 302-305, November 1989. C. C. Chen, A. L. Palisoc, and J. M. Baynham. Therm al Analysis of Solid-State Devices Using Boundary Element M ethod. IE E E Transac tions on Electronic Devices, 35(7):1151-1153, July 1988. C. Chu, M. Potkonjak, M. Thaler, and J. Rabaey. Hyper: An Inter active Synthesis Environment for High Performance Real Tim e Appli cations. In Proceeding o f 1989 IE E E Int. Conf. on Computer Design, pages 432-435, October 1989. R. J. Cloutier and D. E. Thomas. The Com bination of Scheduling, Allocation, and Mapping in a Single Algorithm. In Proceeding o f 27th A C M /IE E E Design Automation Conference, pages 71-76, July 1990. W. Dai and E. S. Kuh. Simultaneous Floor Planning and Global R out ing for Hierarchical Building-Block Layout. IE E E Transactions on Computer-Aided Design, CAD-6(5):828-837, Septem ber 1987. 184 [Gaj88a] [Gaj88b] [Gir84] [Gup91] [GVM87] [Hag91] [Hay88] [HCLH90] [HGF89] [Jai89] [JMP88] [JPP87] D. Gajski, editor. Silicon Compilation, chapter 8, pages 311-360. Addison-Wesley Publishing Company, Inc., 1988. D. Gajski, editor. Silicon Compilation, chapter 7, pages 204-310. Addison-Wesley Publishing Company, Inc., 1988. E. Girczyc. Autom atic Generation o f Microsequenced Data Paths to Realize ADA Circuit Descriptions. PhD thesis, Carleton University, July 1984. P. Gupta. PLA and Wire Delay Analysis. D epartm ent of Electrical Engineering, University of Southern California, 1991. Internal Report. G. Goossens, J. Rabaey J. Vandewalle, and H. De Man. An Effi cient Microcode Compiler for Custom DSP Processors. In Proceed ing of 1987 IE E E Int. Conf. on Computer-Aided Design, pages 24-27, November 1987. J. W. Hagerman. A Fast and Accurate Technique for Function Unit Allocation Estim ation. Technical Report CMUCAD-91-28, Carnegie Mellon University, April 1991. J. Hayes. Computer Architecture and Organization. McGraw-Hill Book Company, second edition, 1988. C. Huang, Y. Chen, Y. Lin, and Y. Hsu. D ata P ath Allocation Based on B ipartite Weighted M atching. In Proceeding o f 27th A C M /IE E E Design Autom ation Conference, pages 499— 504, June 1990. A. Herrigal, M. Glaser, and W. Fichtner. A Global Floorplanning Technique for VLSI Layout. In Proceeding o f 1989 IE E E Int. Conf. on Computer-Aided Design, pages 92-95, November 1989. R. Jain. High-Level Area-Delay Prediction with Application to Behav ioral Synthesis. PhD thesis, University of Southern California, July 1989. R. Jain, M. J. Mlinar, and A. C. Parker. Area-Time Model for Synthe sis of Non-Pipelined Designs. In Proceeding o f 1988 IE E E Int. Conf. on Computer-Aided Design, pages 48-51, November 1988. R. Jain, A. C. Parker, and N. Park. Predicting A rea-Tim e Trade offs for Pipelined Designs. In Proceeding of 24th A C M /IE E E Design Autom ation Conference, pages 35-41, June 1987. 185 [Jr.83] A. J. Boldgett Jr. Microelectronic Packaging. Scientific Am erican, pages 86-96, July 1983. [KK91] [Kna86] [Kna90] [KP87] [KP89] [KP90] [KR91] [Lau79] [Len90] [LMP89] : [LPM89] J. Kim and F. Kurdahi. Synthesis of Tim e-Stationary Controllers for Pipelined D ata Paths. In Proceeding o f 1991 IE E E Int. Conf. on Computer-Aided Design, pages 30-33, November 1991. D. Knapp. A Planning Model o f the Design Process. PhD thesis, University of Southern California, December 1986. D. Knapp. Feedback-Driven D atapath O ptim ization in Fasolt. In Pro ceeding o f 1990 IE E E Int. C onf on Computer-Aided Design, pages 300-303, November 1990. F. Kurdahi and A. C. Parker. REAL : A Program for REgister ALlo- cation. In Proceeding of 24th A C M /IE E E Design Autom ation Confer ence, pages 210-215, June 1987. F. J. Kurdahi and A. C. Parker. Techniques for Area Estim ation of VLSI Layouts. IE E E Transactions on Computer-Aided Design, 8(l):81-92, January 1989. K. Kucukcakar and A. Parker. D ata P ath Tradeoffs using MABAL. In Proceeding o f 27th A C M /IE E E Design Autom ation Conference, pages 511-516, June 1990. F. J. Kurdahi and C. Ram achandran. LAST: Layout Area and Shape Function EsTim ator. In Proceeding o f 1st A C M /IE E E European De sign Autom ation Conference, pages 351-355, February 1991. U. Lauther. A M in-Cut Placem ent Algorithm for General Cell As semblies Based on a Graphical Representation. In Proceeding o f 16th A C M /IE E E Design Autom ation Conference, pages 1— 10, 1979. T. Lengauer. Combinatorial Algorithms fo r Integrated Circuit Layout. W iley-Teubner Series in Com puter Science, 1990. C. C. Lee, Y. J. Min, and A. L. Palisoc. A General Integration Al gorithm for the Inverse Fourier Transform of Four-Layer Infinite Plate Structure. IE E E Transactions on Components, Hybrids, M anufactur ing Technology, 12(4):710— 716, December 1989. C. C. Lee, A. L. Palisoc, and Y. J. Min. Therm al Analysis of Integrated Circuit Devices and Packages. IE E E Transactions on Components, Hybrids, Manufacturing Technology, 12(4):701-709, December 1989. 186 [MAD83] [McF86] [Mil90] [Mli91] [MPC88] [NCP82] [OP90] [Ott83] [Pan88] [PD86] [Ped91] L. M. M ahalingam, J. A. Andrews, and J. E. Drye. Therm al Analysis on Pin Grid Array Packages for High Density LSI and VLSI Logic Circuits. IE E E Transactions on Components, Hybrids, M anufacturing Technology, CHMT-6:246-256, September 1983. M. McFarland. Using Bottom -up Design Techniques in the Synthe sis of Digital Hardware from A bstract Behavioral Descriptions. In Proceeding of 23th A C M /IE E E Design Autom ation Conference, pages 474-480, July 1986. M ilitary Handbook MIL-HDBK-217E. Reliability Prediction o f Elec tronic Equipment. Rome Air Development Center, February 1990. M. Mlinar. Control Path/D ata Path Tradeoffs in V LSI Design. PhD thesis, University of Southern California, May 1991. M. M cFarland, A. Parker, and R. Camposano. Tutorial on High-Level Synthesis. In Proceeding o f 25th A C M /IE E E Design A utom ation Con ference, pages 330-336, July 1988. A. Nagle, R. Cloutier, and A. Parker. Synthesis of Hardware for the Control of Digital Systems. IE E E Transactions on Computer-Aided Design o f Integrated Circuits and Systems, CAD-1(4):201-212, October 1982. M. O sterm an and M. Pecht. Placement for Reliability and Routabil- ity of Convectively Cooled PW B ’s. IE E E Transactions on Computer- Aided Design, 9(7):734-744, July 1990. R. H. J. M. O tten. Efficient Floorplan Optim ization. In Proceeding of 1983 IE E E Int. Conf. on Computer Design, pages 499— 502, October 1983. B. M. Pangrle. Splicer: A Heuristic Approach to C onnectivity Bind ing. In Proceeding o f 25th A C M /IE E E Design Autom ation Conference, pages 536-541, June 1988. D. P. La Potin and S. W. Director. Mason: A Global Floorplanning Approach for VLSI Design. IE E E Transactions on Computer-Aided Design, CAD-5(4):477-489, October 1986. M. Pedram . An Integrated Approach to Logic Synthesis and Physical Design. PhD thesis, University of California at Berkeley, August 1991. 187 [PK89] [PMSK90] [PP88] [PPM86] [Sak83] [Sto83] [TPL+86] [TR89] [TWR+88] [UT86] [VHD88] [WFM83] P. Paulin and J. Knight. Force-Directed Scheduling for the Behavioral Synthesis of ASIC’s. IE E E Transactions on Computer-Aided Design, June 1989. M. Pedram , M. Marek-Sadowska, and E. Kuh. Floorplanning with Pin Assignment. In Proceeding o f 1990 IE E E Int. Conf. on Computer- Aided Design, pages 98-101, November 1990. N. Park and A. Parker. Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications. IE E E Transactions on Computer-Aided Design, 7(3):356-370, M arch 1988. A. Parker, J. Pizarro, and M. Mlinar. MAHA: A Program for D atap ath Synthesis. In Proceeding of 23th A C M /IE E E Design Autom ation Conference, pages 461-466, July 1986. T. Sakurai. Approxim ation of W iring Delay in M O SFET LSI. IE E E Journal o f Solid-State Circuits, SC-18(4):418-426, August 1983. L. Stockmeyer. Optimal Orientations of Cells in Slicing Floorplan Designs. Information and Control, 57:91-101, 1983. C. Tseng, A. M. Prabhu, C. Li, Z. Mehmood, and M. M. Tong. A Ver satile Finite State Machine Synthesizer. In Proceeding o f 1986 IE E E Int. Conf. on Computer-Aided Design, pages 206-209, November 1986. R. Tum m ala and E. Rymaszewski. Microelectronics Packaging Hand book, chapter 4, pages 167-224. Van N ostrand Reinhold, 1989. C. Tseng, R. Wei, S. G. Rothweiler, M. M. Tong, and A. K. Bose. Bridge: A Versatile Behavioral Synthesis System. In Proceeding o f 25th A C M /IE E E Design Autom ation Conference, pages 415-420, June 1988. S. Unger and C. Tan. Clocking Schemes for High-Speed Digital Sys tems. IE E E Transactions on Computers, C-35(10):880-895, October 1986. The Institute of Electrical and Electronics Engineers, Inc. IE E E Stan dard VHDL Language Reference Manual, M arch 1988. IEEE Std 1076- 1987. D. L. Waller, L. R. Fox, and R. J. M annem ann. Analysis of Surface Mount Therm al and Therm al Stress Performance. IE E E Transactions on Components, Hybrids, Manufacturing Technology, CHMT-6:257— 266, September 1983. 188 [WL86] [WTL91] [WW90] [Zim88] [ZSRM90] D. F. Wong and C. L. Liu. A New Algorithm for Floorplan Design. In Proceeding of 23th A C M /IE E E Design Autom ation Conference, pages 101-107, June 1986. W . Wolf, A. Takach, and T. Lee. High-Level V LSI Synthesis, chap ter 10, pages 231-254. Kluwer Academic Publishers, 1991. T. Wang and D. F. Wong. An O ptim al Algorithm for Floorplan Area Optim ization. In Proceeding of 27th A C M /IE E E Design A utom ation Conference, pages 180-186, June 1990. G. Zimmerman. A New Area and Shape Function Estim ation Tech nique for VLSI Layouts. In Proceeding o f 25th A C M /IE E E Design Autom ation Conference, pages 60-65, June 1988. J. Zegers, P. Six, J. Rabaey, and H. De Man. CGE: A utom atic Gen eration of Controllers in the CATHEDRAL-II Silicon Compiler. In Proceeding o f 1990 IE E E Int. Conf. on Computer-Aided Design, pages 617-621, November 1990. 189 Appendix A Layouts of the 10-Time-Step Non-Pipelined FIR Filter Design This appendix shows the partial bindings produced by LADS and the layouts of the 10-time-step non-pipelined FIR filter design exam ple presented in C hapter 6. The layouts were produced using ChipCrafter. Solid lines in the layouts show the critical paths obtained by the tim ing sim ulator in ChipCrafter. The figures are scaled in order to reflect the relative dimensions among the layouts. Five layouts were created for each schedule. ChipCrafter I, ChipCrafter II and C hipCrafter III use the placem ents produced by ChipCrafter with different cooling schedules. 190 Operation Operator Register Period addl ADD01 REGOl 3 - 3 add2 ADD01 REGOl 2 - 2 add3 ADD01 REGOl 4 - 4 add4 ADD01 REGOl 5 - 5 add5 ADD01 REGOl 6 - 6 add6 ADD01 REGOl 7 - 7 add7 ADD01 REGOl 8 - 8 add8 ADD01 REGOl 9 - 9 mull MUL01 REG02 4 - 4 mul2 MULOl REG02 3 - 3 mul3 MUL01 REG02 5 - 5 mul4 MULOl REG02 5 - 6 mul5 MUL01 REG02 7 - 7 mul6 MULOl REG02 8 - 8 mul7 MULOl REG02 9 - 9 mul8 MULOl REGOl 10 - 10 adda ADD02 REG03 5 - 5 addb ADD02 REG03 6 - 6 addc ADD02 REG03 7 - 7 addd ADD02 REG03 8 - 8 adde ADD02 REG03 9 - 9 addf ADD02 REG02 10 - 10 addg ADD02 None N.A. Table A .l: The Bindings for the 10-Time-Step Non-Pipelined FIR F ilter Design Produced by LADS 191 MUX8-1.04 — Mjlas A 0 0 .0 2 L MUX8-1.01 ADD.01 : REG.02 M U X 2-|.05 MUX2-1.09 MUX2-1.07 S L “'.V r f H i MUX3-1.06 » REG.03 rr-rr-t REG.01 MUS2-1.08 " • ■ • " ■ • ■ • a ■ jppomvwvwvw^ MUL.01 mm controller Figure A .l: The Layout of ChipCrafter Placement I using LADS Schedule 192 MUX8-1.02 M U X2-1J5SM U X 3-1.06. MUX2-1.0? :• J J 2 J 3 J J J J S iU L t e lt f U U t g g MUXS-1.04 §SS5 MUX8-1.01 Figure A.2: The Layout of ChipCrafter Placement I I using LADS Schedule 193 M I IYA-1 H 9 ...... i n w A O I a V £ (fn rn m m m m MUX2-1,08 MUX2-1.09 VlJX2-1.07 MUX2-1.05 controller MUX3-1.06 U U M U ^ ' MUX8-1.01 M U t a j u . - Figure A.3: The Layout of ChipCrafter Placement III using LADS Schedule 194 i ,<<I*t! MUX2-1.09I MUX3-1.06l“ MUX8-1.04 controller I ; MUX2-1.07 ). ....w w w ...% v v ........ v ............................................................. .......................................... .„ . ............ ........... .................. . ...................* j Figure A.4: The Layout using LADS Schedule and Floorplan w ithout Pin Assign m ent 195 n REQ.03 ADO.02 MUX3-1.06 □ O it: MUX2 REG.02 r-h liihHiiilflH MUL.01 MUX8-1.04 ontroller MUX2-1.07 REG.01 MUX8-1.01 M U X 8 -1 .0 2 ADD.01 Figure A.5: The Layout using LADS Schedule and Floorplan with Pin Assignment 196 R E G 01 b b s MUX51 ! MUX52 i » V ;■ M il REG06 MUX22 sag""! MUX7' ADD01 MUX12 Icontfolieri Figure A.6: The Layout of ChipCrafter Placement I using FDS Schedule 197 T f f i U f REG03 MUX52 H R E G 0 4 f m m im . W iW W W i W W ^ j^ i t u n im i * ■ w ^ w ^ w v w w / mw vm nwH .................. 88888 REG07 REG05 MUX22 MUX21 ^•v.^ REG01 I REG06 \ ADD02 Trnnrsrr: MUX12 ADD01 iiiin is a titiu M MUX51 iriV riH IM M lilirilH M IIH IIIU H Iiim iM IIIJIIIH ontroller Figure A .7: The Layout of ChipCrafter Placement II using FDS Schedule 198 -«*• REG03 M U X 51 ADD01 M UX12 M UX22 M w M M S f i l H H W M I I REG07 ar | IUX52 M UX71 ;ori trailer REG04 Figure A .8: The Layout of ChipCrafter Placement III using FDS Schedule 199 j .:A > > :.y .v .:.> v A :A :.> y < .> y .v .s> y o > v .v .% V ' V .y .y -v .:A :w y .y .> V v V /.v A > y ,v .v > > v .r.^ Figure A.9: The Performance-Driven Layout using FDS Schedule w ithout Pin Assignment teWM^W.VJW^^W>WWl .WiVAWA\WAMlWLVWUWMVlftWAl WM^WWAVAVM‘iW jW W lW M W .W W W W W M .W ^ con i h h u i:;; f : B a u S u tu J J U f e w w i ? M UX51 MUX52 is s -:3 M U X 2 1 MUX12 iU U U U A lU J U MUX22 firm iwiilW TT Figure A .10: The Performance-Driven Layout using FDS Schedule with Pin As signment 2 0 1 MUX61 controller MUX12 MUX31Jj MUX22 MUX11 MUX62 REGOl ADD01 MUX21 MUX51 Figure A .ll: The Layout of ChipCrafter Placement I using MAHA Schedule 2 0 2 MUL01 icontrollei * t £ I------ IEG05 MUX62 REG01 MUX61 DD02 j? M U X12.S REG04 MUX22 MUX21 Figure A. 12: The Layout of ChipCrafter Placement I I using MAH A Schedule 203 MUX5 MUX61 R E < j!02 ' MUX62 MUX11 MUX12 IUX31 r^M U X 9 1 llflgal MUX21 sontrollej MUX22 Figure A .13: The Layout of ChipCrafter Placement H I using MAHA Schedule 204 .y.v.v.v.v.v.v.v.v.v.v U ii n n ii MUX62 REGOl inUX3 MUX61 contro ADD02 j ja m n s n n REG03 n MUX22 MUX11 MUL01 ADD01 MUX12 MUX21 Figure A. 14: The Performance-Driven Layout using MAH A Schedule w ithout Pin Assignment 205 I J L : '...- m MUX9 REG01 MUX31 c o n tro lle r^ S tr ii if MUX61 ADD02 MUX51 REG03 »i '"8 »"'*t m iirr. * MUX41 * IIIIIHIIIIIUHill MUX22 w w udB B MUX12 MUX21 * - • MW^SSSB £ ..... j....x<i««yww^wwS3i fw rriiiF 5 x to X < * X W A :O * K W X 0 'X W > > > X W M .> !> X < < O > X < < ? W W X * X W X 'X < 4 ^ ^ X C W « < C 6 » » « « X 'X ':'K « ? W X ,M , ! W M ,K < ,K ,> Figure A. 15: The Performance-Driven Layout using MAHA Schedule w ith Pin Assignment 206 Appendix B Layouts of the 4-Time-Step Non-Pipelined Differential Equation Solver Example This appendix shows the partial bindings produced by LADS and the layouts of the 4-time-step non-pipelined differential equation solver example presented in Chapter 6. The layouts were produced using ChipCrafter. Solid lines in the layouts show the critical paths obtained by the tim ing sim ulator in ChipCrafter. The figures are scaled in order to reflect the relative dimensions among the layouts. Three layouts were created for this example. ChipCrafter I and ChipCrafter I I use the placements produced by ChipCrafter with different cooling schedules. Operation Operator Register Period mull MULOl REGOl 2 - 2 mul2 MUL0 2 REG02 2 - 2 mul3 MULOl REGOl 3 - 3 mul4 MUL02 REG02 3 - 3 mul5 MUL02 REG02 4 - 4 mul6 MULOl REGOl 4 - 4 addl A D D 0 1 None N.A. add2 A DD01 REG03 2 - 2 subl SUB01 REG03 4 - 4 sub2 SUB01 None N.A. cmpl CMP01 None N.A. Table B .l: The Bindings for the 4-Time-Step Non-Pipelined Differential Equation Solver Example Produced by LADS SUB01 MUX32 CMP01 ADD01 M UL01 MUL02 Figure B .l: The Layout of ChipCrafter I using LADS Schedule SUB01 CMP01 < MUL02 JVIUX32 ADD01 MUL01 Figure B.2: The Layout of ChipCrafter II using LADS Schedule 209 I T iin i I ADD01 CMP01 SUB01 L REG02 *p3 M UX32 i 11 M UL01 MUL02 Figure B.3: The Layout using LADS Schedule and Floorplan 2 1 0 Appendix C Layouts of the 12-Time-Step Non-Pipelined Elliptic Filter Design This appendix shows the partial bindings produced by LADS and the layouts of the 12-time-step non-pipelined elliptic filter design presented in C hapter 6. The layouts were produced using ChipCrafter. Solid lines in the layouts show the crit ical paths obtained by the tim ing simulator in ChipCrafter. The figures are scaled in order to reflect the relative dimensions among the layouts. ChipCrafter I-V use the placements produced by ChipCrafter with different cooling schedules. L A D S Placement I and LA D S Placement II use the floorplan produced by LADS but w ith different manual placements for the modules allocated by MABAL. All the layouts were also produced using the buffer resizing program in ChipCrafter to improve design performance. 2 1 1 Operation Operator Register Period addl A D D 01 REGOl 2 - 2 add2 A DD01 REG03 3 - 3 add3 A D D 02 REG02 2 - 2 add4 A D D 03 REG03 2 - 2 add5 A D D 03 REGOl 3 - 3 mull SQR01 REG03 5 - 5 mul2 SQR01 REGOl 4 - 4 add6 A D D 01 REG03 6 - 6 add7 A DD01 REG02 5 - 5 add8 A D D 02 REG02 6 - 6 mul3 SQR01 REG02 7 - 7 add9 AD D 01 REGOl 12 - 12 addlO A D D 02 REGOl 5 - 5 mul8 SQROl REGOl 6 - 6 addll A D D 03 None N.A. addl3 A DD01 REGOl 8 - 8 addl4 A DD01 REG03 7 - 7 addl5 A D D 02 REG03 8 - 8 mul4 SQROl REG02 9 - 9 addl6 A D D 01 REGOl 10 - 10 addl7 A D D 02 None N.A. addl8 A D D 02 REG03 9 - 9 mul5 SQROl REG02 10 - 10 addl9 A D D 01 REGOl 11 - 11 mul6 SQROl REG02 12 - 12 add20 A D D 03 REGOl 7 - 7 mul7 SQROl REG02 8 - 8 add21 A D D 02 None N.A. add22 A D D 02 None N.A. add23 A D D 03 None N.A. add24 ADDO l None N.A. add25 A DD01 REGOl 9 - 9 add26 A D D 02 None N.A. add27 A D D 02 None N.A. Table C .l: The Bindings for the 12-Time-Step Non-Pipelined Elliptic Filter Design Produced by LADS M UX02 | ADDOl : EG07 ^ [wREGO§ SQR01 . Illll IDIllif SMUX111 MUX05 ■ f . TJjffffffffff, ; ! • v tiutniii: ' 4 i m m i n f MUX09 REG09 * R E G 01: FIEG02 MUX81 MUX07 trailer I v.;.w.v.v.v.v.v.\x.y/A;,;^K.:.>x.:.:.x<.:w;<k:.yxv:w"v>K-:-w.:-:WAVA"^ Figure C .l: The Layout of ChipCrafter Placement I before Buffer Resizing 213 t MUX05 M W U U t J M U t REG10 juiiirjaiiH i jj. l!.ji!H!i!ff.i.!>.i| 2 2 2 £ M tt ac^rt*? t b h b i b b b b b n c w y i MUX02 IUX10 MUX09 ■ k m m MUX04 contro Ter REG01 I ■ _! • * » • &.«.W.W.VAVAV.V.V.V.VA^^W,XAN%^WA%%WAM.V.V.^WAV.W.W.W/^WAWA^SW^^ Figure C.2: The Layout of ChipCrafter Placement I I before Buffer Resizing 214 i ... REG03 MUX111 REG07 i wnaiM vam xmrfffrrfM fffttvfmttttn .iiiiiiiinriiiiiiiiNiniiMiiiiiiiiiiiiiiiniiiiiiutitii “ “ WWW REGOl j REG06 MUX03 S ! . , : : : MUX04 : : | : SQR01 ADD03 MUX10 tm MUX07 « . IBS*** lUli Muxos p*i i REG05 MUX06 M U X 0 2 ADD01 E i I £H~ ADD02 "TTmrTTi MUX09 (jontrollej- Figure C.3: The Layout of ChipCrafter Placement III before Buffer Resizing 215 imiiM iiH imi: ■ : ■ i t i a i i i i n i f r n 1 IIIIIIIII!IllIIII■ ■ •(■ ■ ■ I MUX05 S MIIMSII III w w ii^ ^ w m w m w m w m M UX 08 Sim illH IIIIII ADD02 MUX06 m ummwiwnvnw d c q q SQR01 in i::::::::::. giiniiiiiiiiiH W iiiauilllllllll fc iiiaiiiiiiiiiiii MUX02 !• ■ i ii i i a a i i i n l l £ t R E G 0 .7 _ “ .RESP.6 & □□03 UXtl r r muxo9 R ontroller Sim ulation C ritical Path R E G 03 -> M UX02 -> ADD01 -> M UX05 -> A D D 02 -> MUX08 -> A D D 03 -> MUX10 -> R E G 03 Figure C.4: The Layout of ChipCrafter Placement IV before Buffer Resizing 216 REG04 MUX81 MUX05. MUX06 I MUX02 IUX08 ADDOl EG03, MUX10 ADD03 REG01. MUX07 SQR01 IUX04 [controller Figure C.5: The Layout of ChipCrafter Placement V before Buffer Resizing 217 REGOl MUX07 ADD03 ADD02 I UX111 | MUX05 i con t REGO MUX06 j muiiuilltlllUlli UX03: I MUX01 % EG02 REG09 5 ADD01 SQR01 hi i f REG06 i MUX02 REGOl WlWWAWWsWlWWMIWWWaWlWftWMMftW^MMWIWftWWAWliWW^WMiWlWiWWiWWWIWWAWAWi Figure C.6: The Layout using LA D S Placement I before Buffer Resizing 218 M UX07 := ? A D D 0 3 m il MUX05 UX03 SQR01 IUX04 MUX02 Figure C.7: The Layout using LA D S Placement I I before Buffer Resizing ,VAWW.V.V.V.V^WMSWWMVdW.VAWWM' MUX02 ADDOi SQR01 MUX06 [M UXlil MUX05 MUX1 MUX81 controller M UXOZi Figure C.8: The Layout of ChipCrafter Placement I after Buffer Resizing 2 2 0 MUX06 REG08 MUX05 ADD02 iuxm ADD01 ■ ! F MUX81 MUX01 MUX02 MUX10 MUX09 SQR01 Figure C.9: The Layout of ChipCrafter Placement II after Buffer Resizing 2 2 1 Ai C 4 K n ;- t « r c « B oh* 1 = • w > ts t r * i RJG03 I M UX111 J M ^ f l f l f l f i M i J 8 C S 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 S 8 C w M w w w w ^ » M < w w * w w M w w M ■n m i m i ■ i m in i nil m i tin m i ■ m i tin iliu m m in i pumiiH. » REG01 REG06 k fllfllM MUX03 SQROl : : : : : : MUX04 ADD03 MUX07 MUX08 ! MUX01 jxt REG10 »“ — — I M i i i J l i i UX81 MUX05 S MUX06 MUX02 ADD01 t - ! REG02 s n r . nVR1 1 MUX09 [ i n n M liiSj l l l ll l l l l l ll l l controller Figure C.10: The Layout of ChipCrafter Placement III after Buffer Resizing 2 2 2 MUX05 MUX08 MUX06 1 W M U M W U v w n SQROl .REGQ1 pontroller, Sim ulation C ritical Path R E G 03 -> M U X02 -> ADD01 -> M U X 05 -> A D D 02 -> M U X08 -> A D D 03 -> M UX10 -> R E G 03 Figure C .ll: The Layout of ChipCrafter Placement IV after Buffer Resizing 223 MUX81 I&MUX111 V, M l i V ^ C .......»MH.|||HH«»I IV IV A v V iM m n f f ln B m , MUX06 i A U W W U W H ^ wMi^qw wt MUX08 MUX02 ADD01 L L ^ o % MUX09 H tllii;;!; REGOl ' | s ADD03 r i r n n n L EG 10 SQR01 UX04 introller Figure C.12: The Layout of ChipCrafter Placement V after Buffer Resizing 224 MUX08 MUX10 M UX81 REG04 5MUX67ji i lUXIfl S r e a m MUX05 MUX06 SQR01 MUX02 ^A W A W /A V A V .V A ^V W A V iW A V M V .V A ^^W A W A ^W W A '•‘A W A V , V .V A W ^^^^V / / A ^^W ^V A W / A ^V A V V W A ^^^W ^^W ^W «A ■ . •^W i •. ■ «W ^^W M ^W A V A W X •! Figure C .l3: The Layout using LA D S Placement I after Buffer Resizing 225 M UX08 MUX10 M UX81 SST : muxo; ADD 03 w k v ^ I ADD02 Hfl REQ03 jf® g r i g ~ j S S J i t * - i u x m MUX05 contrpl MUX06 M UX03 SQR01 M UX04 MUX02 REGOl Figure C.14: The Layout using LA D S Placement II after Buffer Resizing 226
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255701
Unique identifier
UC11255701
Legacy Identifier
DP22877