Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High-Level Synthesis For Asynchronous System Design
(USC Thesis Other)
High-Level Synthesis For Asynchronous System Design
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type o f computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 HIGH-LEVEL SYNTHESIS FOR ASYNCHRONOUS SYSTEM DESIGN by Tzyh-Yung Wuu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CARLIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) May 1995 Copyright 1995 Tzyh-Yung Wuu UMI Number: 9625042 UMI Microform 9625042 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY OF SO U T H E R N C A LIFO R N IA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by Tzyh-Yung Wuu under the direction of h is Dissertation Committee, and approved by all its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent of re quirements for the degree of DOCTOR OF PH ILOSOPH Y Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson Tzyh-Yung Wuu Sarma Sastry H igh-L evel S yn thesis for A synchronous S y stem D esign This thesis presents a method for automatic synthesis of asynchronous digital systems from high-level data flow specifications. An extended data flow model that accurately reflects the behavior of the asynchronous components is presented so that the data flow specification can be mapped directly into a hardware realization. In addition, a timing model for the basic asynchronous building blocks is developed. Based on the block-level timing model, a precise and concise timing model for high- level design representations is derived. For data path synthesis, the sequencing and allocation problem and register minimization problem are formulated as graph theoretical problems, and algorithms are developed to solve these two problems. For control synthesis, design templates in data flow graph are developed to support the sequencing and allocation result for implementing the resource sharing. Finally, a real FIR digital design is designed to demonstrate the design process of this thesis. The results of this design show the effectiveness of this design methodology. D ed ication To my parents. A cknow legem ents I thank my advisor Prof. Sarma Sastry for his guidance, support, and en couragement. During my Ph.D. studies, I developed a feeling of deep respect and admiration for him as a researcher as well as a person. His enthusiasm towards research and his insight into the problems of design automation made this thesis possible. I thank Prof. Alice Parker for her invaluable advice and comments. Her excel lent critiques of my work have enhanced the depth of this thesis. I would also like to thank her for warm support throughout my doctorate years. I thank Prof. Kennith Alexander for serving on my dissertation committee, and for taking the time to read my thesis, and Prof. Jean-Luc Gaudiot and Prof. Sandeep Gupta for serving on my guidance committee. I owe a special debt of gratitude to MOSIS of USC/ISI, which has supported my research and provided the resources to complete my Ph.D. work. Especially, I thank Dr. Shih-Lien Lu, Jeff Sondeen, and Jen-I Pi for their friendship and encouragement when I was working at MOSIS. I thank my friends Dr. Yung-Te Lai, Dr. King Ho, Dr. Ravi Kumar, Ming-Yeh Shen, Chen-Huan Chiang, and Chih-Hsu Yen for their warm friendship. Thank you guys for lending a hand and helping me throughout the struggle of my doctorate years. I thank my parents for their endless love, support, and patience, which made it possible for me to continue my graduate study, and finally, to finish this thesis. iii I thank Amber Hensley for editing this thesis. I gratefully acknowledge the financial support received from the Advanced Re search Projects Agency (Grant MDA903-92-D-0020) and the Powell Foundation. C ontents List Of Tables x List Of Figures xi 1 Introduction 1 1.1 Asynchronous System Design for V L S I .................................................. 1 1.2 The Fundamentals of Digital Circuit D e s ig n ........................................ 2 1.3 Asynchronous and Synchronous D e sig n .................................................. 7 1.4 High-Level Synthesis for Asynchronous System Design............................................................................................... 8 1.5 Research O v erv iew ..................................................................................... 12 1.5.1 Micropipeline-based Asynchronous Systems ............................. 13 1.5.2 Data Flow Specifications................................................................ 13 1.5.3 Timing Model for the System Performance Estimates . . . . 14 1.5.4 Data Path S y n th e sis ...................................................................... 14 1.5.5 Control S ynthesis............................................................................. 15 1.5.6 Register M in im izatio n ................................................................... 16 1.6 Related W o rk ............................................................................................... 17 1.6.1 Asynchronous Circuit S y n th esis.................................................. 17 1.6.2 High-Level S y n th esis...................................................................... 19 1.6.2.1 Design R epresentation................................................. 19 1.6.2.2 Algorithmic Transform ation........................................ 20 v 1.6.2.3 Scheduling and A llocation............................................ 20 1.7 Thesis O utline................................................................................................ 21 2 System Overview 23 2.1 Design Flow O v erv iew ................................................................................ 23 2.2 Data Flow G ra p h .......................................................................................... 24 2.3 Data Path S y n th e s is................................................................................... 2S 2.4 Control S ynthesis.......................................................................................... 31 2.5 Register M in im iza tio n ................................................................................ 36 2.6 Hardware R ealization................................................................................... 39 2.7 Summary ....................................................................................................... 40 3 M icropipelines and Data Flow Specifications 42 3.1 Micropipelines and Data Flow M o d e l..................................................... 42 3.1.1 Data Transfers and Handshaking P rotocols............................... 43 3.1.2 Realization of a Basic B lo c k ......................................................... 44 3.1.3 Data Flow Model for Basic Blocks............................................... 48 3.2 Data Flow Specifications............................................................................ 50 3.2.1 Basic C o n s tru c ts ............................................................................ 50 3.2.2 D a t a ................................................................................................... 53 3.2.2.1 Data T y p e s ...................................................................... 53 3.2.2.2 Data Representations .................................................. 55 3.2.2.3 Data M anipulation......................................................... 56 3.2.3 Sim ulation.......................................................................................... 58 3.3 The Need of an Extended Data Flow G r a p h ........................................ 60 3.3.1 Register Blocks and Computational B locks............................... 61 3.3.2 Extended Data Flow Model for Computational B lo c k s............................................................................................... 61 3.4 Extended Data Flow G r a p h ...................................................................... 64 vi 3.4.1 The Hardware Components for EDFG ...................................... 68 3.5 Timing Model for Data Flow S pecification.......................................... 69 3.5.1 Timing Behavior Model for Basic B lo c k s.................................. 70 3.5.2 Timing Behavior of Composed B locks......................................... 73 3.5.3 Performance Analysis of Linear P ipelines.................................. 76 3.5.3.1 Performance Analysis for a S ta g e ................................ 77 3.5.3.2 Performance Analysis of a Linear P ip e lin e ................ S O 3.5.4 Timing Model for EDFG and D F G ............................................ 81 3.5.4.1 Timing Model for E D F G ............................................... 81 3.5.4.2 Timing Model for DFG .............................................. 84 4 Data Path Synthesis 87 4.1 Introduction.................................................................................................. 87 4.2 Timing Model for Sequencing and Allocation P ro b le m ...................................................................................... 89 4.3 Notations and Terminologies.................................................................... 92 4.3.1 Notations and Terminologies for the Graph T h e o r y .............. 92 4.3.2 Notations for Data Path Synthesis ............................................ 93 4.4 The Problem F o rm u la tio n ........................................................................ 95 4.4.1 Timing Constraint and Graph M o d e l......................................... 95 4.4.2 Sequencing and A llocation............................................................. 99 4.4.3 The System Completion T im e .........................................................104 4.5 Minimizing the System Completion T im e ..................................................106 4.5.1 The Design Subspace Containing a Minimum Completion Time D esign................................................................. 106 4.5.2 An Optimum Completion Time Design G e n e ra tio n ................. 112 4.5.3 The Branch and Bound Technique ...............................................120 4.5.3.1 Bounded by the Data Dependency Constraint . . . 120 vii 4.5.3.2 Bounded by the Resource Dependency Constraint . 121 4.5.4 Heuristic A lg o rith m s.........................................................................126 4.6 The Problem Formulation for Minimizing the System Initiation Period 129 4.6.1 Extended Graph Model for the System Initiation Period . . 129 4.6.2 The Problem Formulation and Possible A pproaches.................132 5 Control Synthesis 134 5.1 Problem Definition and Classification of Sharing Schemes ............................................................................................135 5.2 Sharing Structure for S A F S .........................................................................140 5.2.1 Effects of the Sharing S c h e m e .........................................................142 5.3 Sharing Structure for S A V S .........................................................................145 5.3.1 Effects of the Sharing S ch em e.........................................................147 5.4 Sharing Structure for D A F S .........................................................................150 5.4.1 Effects of the Sharing S c h e m e ........................................................ 151 5.5 Sharing structure for D A V S .........................................................................153 5.5.1 Effects of the Sharing S ch em e .........................................................155 5.6 Sharing Structure for SAFS with a Micropipelined Shared U n i t ........................................................................157 5.7 Local T ransform ations.................................................................................. 161 5.7.1 Transformation for Sharing Structure R ed u ctio n....................... 162 6 Register M inimization 166 6.1 Timing Model for Register M inim ization..................................................167 6.2 The Problem F o rm u la tio n ............................................................................170 6.2.1 Graph Model for Register M inimization........................................ 170 6.2.2 Unnecessary R e g is te rs ..................................................................... 174 6.3 The Maximum Cost Unnecessary Register S e t ........................................177 6.3.1 An Optimum Set Generation .........................................................177 viii 6.3.2 Heuristic A lg o rith m s........................................................................180 7 FIR Design 184 7.1 DFG Specification ........................................................................................186 7.2 D ata Path S y n th e s is .................................................................................... 187 7.3 Control S ynthesis........................................................................................... 189 7.4 Register M in im iza tio n ................................................................................. 192 7.5 Implementation Results ..............................................................................195 8 Conclusions and Future Research 203 8.1 C onclusions.................................................................................................... 203 8.2 Future R esearch..............................................................................................204 ix List O f Tables 1.1 Circuit Classification in terms of the operation mode and the delay model................................................................................................................ 5 2.1 Timing parameters for the operations in the multiplier....................... 28 3.1 The input-output data type relation of basic constructs...................... 54 4.1 Timing parameters for each operation in Figure 4.2(c)........................ 95 4.2 The comparison of the longest path delays between the SA graph in Figure 4.3 and the SA graph in Figure 4.6............................................ 112 4.3 Operator timing parameters for the example of active SA graph gen eratio n .........................................................................................................118 4.4 An active SA graph generation .................................................................119 6.1 All RE edge sets for the SA graph in Figure 6.4(a)................................. 1S2 6.2 Edge ordering and checking in the heuristic algorithm........................... 183 7.1 Timing parameters for the FIR filter synthesis.........................................187 7.2 The sequencing and allocation results of the 16-point FIR filter. . . 188 7.3 Timing parameters for the EDFG in the experiment of the FIR design.200 7.4 Experimental results of the FIR design...................................................... 201 x List O f Figures 2.1 Design flow overview for the synthesis of asynchronous systems. . . 24 2.2 DFG description of the 4-bit add-and-shift multiplier........................... 25 2.3 Timing model for the computation of an operation................................ 27 2.4 Timing diagram for the computation of a multiplication operation. . 27 2.5 (a)The synthesis result for the non-pipeline system. (b)The synthe sis result for the pipeline system................................................................ 30 2.6 (a) 2 operations in the input DFG. (b) The sharing structure for 2 operation SAFS.............................................................................................. 33 2.7 The output DFG for the non-pipeline system synthesis result. . . . 34 2.8 The output DFG for the non-pipeline system synthesis result after local transformation...................................................................................... 35 2.9 Timing diagram for the computation of a multiplication after the removal of the ash2’s input register........................................................... 36 2.10 (a)The timing diagram with 6 registers remained. (b)The timing diagram after the removal of the Asher2’s input register..........................37 2.11 (a) Phantom Addshift: the AddShift with no input register, (b) Equivalent structure to the AddShift with the input register............. 38 2.12 The EDFG description for the non-pipeline system synthesis result after register minimization........................................................................... 39 2.13 EDFG to basic block mapping..................................................................... 40 xi 3.1 D ata transfer and handshaking protocol, (a) Data transfer between two blocks, (b) Two-phase handshaking, (c) Four-phase handshaking. 43 3.2 The behavior of Muller C-element............................................................ 45 3.3 The behavior of asynchronous register..................................................... 45 3.4 (a) The block diagram of the basic block in micropipelines, (b) The timing diagram of the basic block.............................................................. 47 3.5 (a) The behavior of a basic block, (b) Data flow model for the basic block.................................................................................................................. 48 3.6 Basic constructs of the DFG...................................................................... 51 3.7 Example: manipulation of data types...................................................... 55 3.8 Special token functions................................................................................ 57 3.9 Fork-Join Factoring, (a) Before transformation, (b) After transfor mation............................................................................................................... 59 3.10 Two-input addition....................................................................................... 60 3.11 (a) Decomposed basic blocks, (b) Corresponding EDFG constructs. 62 3.12 (a) Timing diagram of register block, (b) Timing diagram of com putational block............................................................................................. 62 3.13 (a) The behavior of a computational block, (b) Extended data flow model................................................................................................................ 63 3.14 Basic constructs of EDFG........................................................................... 65 3.15 The behavioral differences between the phantom Fork and the Fork. (a) State diagram of phantom Fork, (b) Partial state diagram of Fork. 66 3.16 (a) A Fork, (b) A phantom Fork with an input Storage..................... 67 3.17 Generic basic blocks..................................................................................... 68 3.18 Block design for 4-output Fork.................................................................. 69 3.19 (a) Timing model for register block, (b) Timing model for non register block................................................................................................... 72 3.20 The process of merging a register block to a non-register block, (a) Before merging, (b) After merginig........................................................... 74 3.21 The behavior of composed blocks, (a) Register block to non-register block, (b) Non-register block to register block, (a) Non-register block to non-register block, (a) Register block to register block. . . 75 3.22 (a) Block diagram of a single stage between two registers, (b) Timed-petri net description of the block diagram.................................. 78 3.23 Completion time and throughput analysis of the system in Figure 3.22.................................................................................................................... 78 3.24 A linear pipelined system.............................................................................. 80 3.25 Timing model for storage nodes................................................................. 82 3.26 Timing model for non-storage nodes......................................................... 83 3.27 Timing model for DFG nodes..................................................................... 84 3.28 Simplified timing model for DFG nodes................................................... 85 4.1 The input-output relation of the data path synthesis........................... 88 4.2 (a) A four-addition four-multiplication DFG. (b) A sequencing and allocation over one adder and one multiplier, (c) The G antt chart representation of the sequencing and allocation...................................... 91 4.3 The timed graph with edges for the sequencing and allocation in Figure 4.2(b). (In fact, this is the SA graph representing the se quencing and allocation.)........................................................................... 98 4.4 (a) A cyclic SA graph, (b) An acyclic SA graph........................................104 4.5 Moving a node from P t< i to P u r (a) Before adjustment, (b) After adjustment......................................................................................................... 107 4.6 An left-shift SA graph of the SA graph in Figure 4.3............................... I l l 4.7 Enumeration tree of the SA graph generation............................................ 119 4.8 A series of computations..................................................................................130 xiii 5.1 The input-output relation of the control synthesis...................................135 5.2 An abstract-sharing scheme with static allocation for 4-operation DFG to 2-operator DFG mapping, (a) Before applying sharing scheme, (b) After applying sharing scheme................................................137 5.3 An abstract-sharing scheme with dynamic allocation for 4-operation DFG to 2-operator DFG mapping, (a) Before applying sharing scheme, (b) After applying sharing scheme................................................137 5.4 N operations in the input DFG............................................................. 139 5.5 (a) Sharing structure for SAFS. (b) Definition of CFfs_N................141 5.6 (a) Sharing structure for SAVS. (b) Definition of CFvs_N...............146 5.7 Sharing structure for DAFS.................................................................... 150 5.8 Sharing structure for DAVS.................................................................... 154 5.9 Sharing structure for SAFS with micropipelined shared units. . . . 158 5.10 (a) Timing behavior for SAFS with a 2-stage micropipelined unit. (b) Timing behavior for DAFS with 2 non-pipelined shared units. . 158 5.11 Constant folding, (a) Before transformation, (b) After transforma tion.......................................................................................................................161 5.12 Multiple paths between two operators........................................................162 5.13 Multiple common source-destination path reduction, (a) Before transformation, (b) After transformation................................................... 164 5.14 Multiple common multiple-source-destination path structure...............164 6.1 The input-output relation of the register minimization......................... 167 6.2 (a) Two operation DFG. (b) EDFG description for the two operation system, (c) The timing diagram of this two operations...........................168 xiv 6.3 Graph model for the register removal, (a) Before the register at (Vi,Vj) is removed, (b) After the register at (v{,Vj) is removed, (c) Before the registers at path (u,-,. . . , Vj) are removed, (d) After the registers at path (u;,... ,Vj) are removed....................................................171 6.4 (a) An SA graph, (b) An RE graph corresponding to E rei — {(u+,i, v.,3)}- (c) An RE graph corresponding to E re2 = {(u+,2, v»,3)}- (d) An RE graph corresponding to E r e 3 = {(u+,2,u»)3), v+,4)}. 173 6.5 (a) Case 1: G r e j - (b) Case 1: G r e 2• (c) Case 2: G r e x • (d) Case 2: G r e 2.....................................................................................................................175 6.6 (a) E r e = {(w+,2, v.,3), (v.,2, u+)3)}. (b) E r e = {(u+,2, u.l3), (t>.,2, u+,3), (w.,2, w+,4)}- (c) E r e = {(u+,2, v . i3), (u +)4 , n , )4)}. (d ) E r e = { ( v , l2, u +l4), (u+)4,u .i4)}..........................................................................................................181 7.1 Input DFG description for a FIR digital filter............................................185 7.2 (a) DFG description for mem_H. (b) DFG description for SeqOlr. . 186 7.3 DFG description of the FIR filter for synthesis.......................................... 188 7.4 The SA graph for the 2-MUL, 2-ADD design...........................................190 7.5 The Gantt chart for the 2-MUL, 2-ADD FIR filter.................................190 7.6 2-part ADD and 2-part M U L ........................................................................191 7.7 Mapped DFG description for the 2-MUL 2-ADD FIR design. . . . 193 7.8 Reduced DFG description for the 2-MUL 2-ADD FIR design. . . . 194 7.9 The RE graph for the 2-M U L , 2-ADD design.......................................... 195 7.10 EDFG description for the 2-MUL 2-ADD FIR design (Part I). . . . 196 7.11 EDFG description for the 2-MUL 2-ADD FIR design (Part II). . . 197 7.12 (a) EDFG description of mem_H. (b) EDFG description of CFfs8.1. 198 7.13 Layout for the 2-ADD 2-MUL FIR design................................................ 202 xv C hapter 1 In trodu ction 1.1 A synchronous S ystem D esign for V LSI In the past few decades, digital systems have been designed primarily using the synchronous paradigm. The domination of the synchronous paradigm is due to the simplicity of design methodologies for synchronous circuits. As VLSI fabrication advances toward deeper submicron technologies, the gate count on a single chip will exceed one million, the logic delay of a single gate becomes less than one nanosecond, and clock speeds will achieve 100 MHz or faster. The global-clocking signal and the timing assumption for clocked-memory elements for synchronous circuit design become a serious hurdle toward the full exploitation of advanced fabrication technology. This problem gets worse as device size is further scaled down, the number of devices in a chip continues to increase, and the clock speed continues to go up. Asynchronous design eliminates both the need for a global clock and problems associated with clock skew. Classical asynchronous design theory views the en tire system as a finite state machine and deals primarily with the design of state machines using flow tables as the specification. This specification is exponential in complexity. Furthermore, the difficulties involved in eliminating hazards and 1 races, and the lack of system-level asynchronous design methodologies have pre vented the widespread use of asynchronous design for the implementation of large systems. An alternative approach is to design the system in a modular fashion. The basic idea is for all components or processes to operate at their own speed with each component interacting with other components by signal transitions. The concept of a component is based on the level of granularity to which the system is partitioned. A component can be as simple as a Muller C-element or as complex as a floating-point multiplication unit. The basic requirement for a component is that the surrounding environment, which represents the behavior of all components connected to this component, does not violate the behavior of the component. The design of each component can be based on the same principle. This thesis presents a system for carrying out the automatic synthesis of asyn chronous systems from data flow specifications. For this thesis, a fully charac terized library of asynchronous components that support the data flow model of computation through the use of a handshaking protocol was developed. Incorpo rating the characteristics of these components, the synthesis system repeatedly transforms an abstract system specification to a lower, more detailed specification, until sufficient detail exists to permit fabrication. The goal of these transforma tions is either to maximize system performance under the resource constraint or to minimize resource usage under the performance constraint. 1.2 T he Fundam entals o f D igital C ircuit D esign To build a complex, high-speed system, an understanding of the assumptions and the limitations of different digital circuit styles is necessary for designers to justify whether a system design is realizable in VLSI for a given goal and constraint. There are two main classes of digital circuits [31]: 1. A co m b in atio n al c irc u it is one in which the outputs depend only upon the current inputs if both the inputs and internal signals of the circuit are stable and unchanged. 2. A seq u en tial c irc u it is one in which the outputs depend upon the current inputs and the past inputs if both the inputs and internal signals of the circuit are stable and unchanged. The past history of the circuit is represented by the current state of the circuit. In general, a combinational circuit is implemented by an acyclic logic network, and its behavior is specified by a boolean expression. A sequential circuit is imple mented by a logic network with feedback, and its behavior is specified by either a flow table or a state table. The behavior of a sequential circuit in the time domain is analogous to the behavior of finite or infinite iterations of a combination circuit in the space domain [21, 48]. Although both classes of circuits can implement a given system specification, most digital systems are implemented by sequential circuits due to a limitation in the space domain. The general class of sequential circuits, level-m ode seq u en tial c irc u its [21], are circuits in which all inputs and all internal signals affect the behavior of the circuits. The operation of level-mode sequential circuits raises two problems: 1. When do inputs change? 2. When does the circuit settle down, or when is every internal signal stable to a single value state? Sequential circuits are further classified into fu n d am e n tal-m o d e c irc u its and p u lse-m o d e c irc u its [31]. This classification is made by the restriction imposed on the way in which inputs change in terms of the stability of internal signals. • F u n d am en tal-m o d e o p eratio n : There are no special kinds of inputs, such as pulses. Only one input is allowed to change at any given time when the circuit is in a stable state. Fundamental-mode sequential circuits are the restricted case of level-mode sequential circuits. • P u lse-m o d e o p eratio n : There are one or more special inputs that are treated as pulses and often are used to drive the clock inputs of flip-flops. Other inputs are treated as level inputs. During a certain period, which relates to the data storing of the flip-flops, the data inputs of the flip-flops need to be stable. This period is called the active state of the flip-flops [31], and is usually defined in terms of the setup time of the flip-flop, the hold time of the flip-flop, and the pulse signal(s) for the flip-flops. The pulse-mode sequential operation is a special case of the fundamental-mode operation. The pulse-mode sequential circuits restrict the operation of flip-flops to ensure that the flip-flops run in the fundamental mode and release the rest of circuits from the restriction of the fundamental-mode operation, for example, the unstable internal signals during the non-active state of flip-flops can be totally ignored. In general, fundamental-mode circuits are called asynchronous circuits, and pulse-mode circuits are called synchronous circuits [31]. This definition for asyn chronous circuits and synchronous circuits is misleading; for example, an asyn chronous system usually has both a fundamental-mode control circuit and a pulse mode datapath. In this thesis, a synchronous circuit controls all pulse-mode cir cuitry by global clock(s), while an asynchronous circuit does not have a global clock to control all pulse-mode circuitry. All operation modes are defined from the time the circuits settle down. The operation timing depends on the delay in the circuits. Three delay models are used to analyze circuits. 1. B o u n d ed delay m odel: The delays of components and connection wires are bounded by practical values. Delay Model Operation Mode Fundamental Mode Fulse Mode Bounded • Muller (J-elements • Latches/Flip-flops • Synchronous systems • Datapath in Sutherland’s micropipelines • DCVSL - static Unbounded component • Speed-indenpendent circuits Unbounded component & wire • Delay-insensitive circuits • DCVSL - dynamic (?) Table 1.1: Circuit Classification in terms of the operation mode and the delay model. 2. U n b o u n d ed co m p o n en t delay m odel: The delays of components can be any value between zero and infinity, while the delays of connection wires are zero. 3. U n b o u n d ed com p o n en t delay and u n b o u n d ed w ire delay m odel: The delays of both components and connection wires can be any value be tween zero and infinity. Sequential circuits are further grouped into six classes based on operation modes and delay models as shown in Table 1.1. Because the operation of the synchronous system relies heavily on the correct timing of every clocked-memory element in the system, the synchronous system runs in the pulse-mode operation with the bounded delay model. On the other hand, the asynchronous system can be viewed as a set of independently-operated components. If a circuit runs correctly under the un bounded component delay and the unbounded wire delay model, it is a delay- insensitive circuit [10]. A system is delay-insensitive if composed of a set of delay- insensitive circuits. However, a basic set of delay-insensitive circuits has yet to be found. The speed-independent circuit relaxes the assumption for delay-insensitive 5 circuits, and runs correctly under the unbounded component delay model. How ever, this kind of circuit does not have the property of composition closure. Both delay-insensitive circuits and speed-independent circuits are too pes simistic, and are not realistic in the physical implementation. Most asynchronous circuits are delay-sensitive. Several important circuits, which are often used in asynchronous system designs, are examined below. • The Muller C-element, which has a structure similar to the SR-latch [33], cur rently has no delay-insensitive implementation [10]. Although this element is used as an atomic component in most asynchronous circuits, the existing im plementation of this element is a delay-sensitive circuit in fundamental-mode operation. • The datapath used in Sutherland’s micropipelines, which has the same struc ture as the datapath used in the synchronous system design, is a delay- sensitive circuit in pulse-mode operation, where the asynchronous controller generates the pulses. • The differential cascode voltage switch logic (DCVSL) often is used in the datapath implementation of self-timed circuits [33]. DCVSL is a pulse-mode circuit with the request as the pulse signal for the circuit. For the dynamic- logic implementation of DCVSL, the circuit is delay-insensitive, when there is no charge leakage problem. In reality, the charge leakage problem exists in the dynamic-logic implementation. For the static-logic implementation of DCVSL, the feedback transistors are equivalent to a latch and are delay- sensitive. Therefore, DCVSL is neither a delay-insensitive circuit nor a speed- independent circuit. 6 1.3 A synchronous and Synchronous D esign As technology advances, the physical size of an integrated circuit (IC) shrinks, allowing more complex designs to be placed on a single chip and causing the wiring delay to become very significant. Due to the large number of elements, the clock must be distributed over a large area. For synchronous systems running at very high clock speeds, the original assumptions of the globally-clocked pulse-mode operation become disadvantages. The causes and corresponding design problems are described below. • There is the clock distribution or clock skew problem in synchronous system. Clock skew is the phase difference of a clock signal at different locations in a complex system, and is caused by the delay of the clock signal transmis sion. Therefore, synchronous system design requires a lot of timing analysis and circuit delay tuning to ensure that every flip-flop or memory element in the system operates properly for both the distributed global clocks and the signals generated by the combinational circuits. • As new technology emerges, re-tuning is required for every design. This re tuning process is as difficult as solving the clock skew problem for the design in the original technology. • Because all stages are controlled by the global clocks, the slowest stage in the entire system limits the clock speed. The original quantization assumption simplifies the design, but costs the system performance. An asynchronous circuit, which is composed of asynchronous components, does not have any global clock signals. Although asynchronous circuits are also delay- sensitive, the timing problem of the system operation is local to each component and to the communication-interfacing parts between two connected components. 7 Therefore, the asynchronous system design is easier than the synchronous system design for the following reasons: • There is no clock distribution/clock skew problem, and there is no global signal distribution problem. • Usually, an asynchronous design system is composed of a set of basic subcir cuits. For a new technology, the re-tuning process may apply only to the set of basic subcircuits. Therefore, for a new technology little effort is needed to re-tune every design, or the re-tuning is limited to the communication- interfacing parts of each system. • The slowest subcircuit will not slow down the operation in the other parts of the system, but it may be a bottleneck for the system, for example, the slowest stage in a pipeline system. Hence, asynchronous designs are easily adapted to advanced technology. Also, their performance may be comparable to or better than synchronous designs when the clock speed becomes a limitative factor for the system. 1.4 H igh-L evel S ynthesis for A synchronous S y stem D esign High-level synthesis translates a system behavior specification into a system struc ture description [19, 32]. High-level synthesis allows designers to explore design alternatives at the algorithmic (functional) level, and to find a design satisfying their constraints and goals, such as system performance, physical size, power con sumption, and so on. Most existing approaches for asynchronous system design focus primarily on the synthesis of interfacing circuits among asynchronous components. Little work 8 has been done in the area of high-level synthesis for asynchronous system design. Although the essential high-level synthesis issues for asynchronous system design are similar to the issues for synchronous system design, the models and formula tions of these two synthesis problems may be different. The asynchronous system synthesis issues are described in the following sections. D esign styles, com p o n en ts, and technology m ap p in g : Before the system is synthesized, designers may want a certain style of the design structure, such as the central-control structure or the distributed-control structure, the bus-based design or the mux-based design, and so on. For each design style, certain Register Transfer Level (RTL) components are used, such as Muller C-elements, XOR gates, or handshaking registers in Sutherland’s micropipelined design style [44]. The implementation of each component may vary for different design databases and may vary for different technologies, for example, a customed macro block, a boolean network, or the composition of other functional blocks. The combination of the design style, the components, and the technology mapping, decide the high-level synthesis result, and also directly affect the tasks of high-level synthesis. D esign specifications and re p resen tatio n s: A design specification describes the behavior of a system. Ideally, a specification is not bound to any specific design style. Therefore, the same system behavior specification can be implemented using different design styles. In practice, a designer selects a specification biased toward either the target structure or the synthesis assumptions; for example, different subsets of VTIDL are used in different synthesis systems. Design representations, such as graphs, (timing) diagrams, charts, and so on, are used during the synthesis process. Usually, the first step in high-level synthesis is to transform the behavior specification into an internal representation, which is closely related to the target design style. For example, the token data flow graph 9 (DFG) is used in asynchronous system synthesis while the control/data flow graph (CDFG) is used in synchronous system synthesis. D esign q u ality m easures and estim ates: Design quality measures can be obtained from the synthesized result. However, it is time-consuming to generate a design using the entire synthesis process, and the result may not meet the design constraints and goals. Therefore, the estimation of the design quality during the synthesis process is very important. Due to the limited detail of the design in synthesis, the quality measures of a design rely on abstract estimation models. A meaningful estimate is either close to the real quality value of the design implementation or related to the real quality value, for example, the lower bound of the real quality value or its value within a certain percentage range of the real quality value. In addition to being meaningful, the estimation model needs to estimate the quality without exploring all details of the design. Most design equality estimation models for asynchronous and synchronous sys tems are similar except for performance measures. Due to the difference in the be havior of asynchronous and synchronous systems, the performance of asynchronous systems, such as the completion time and the initiation period, is measured by the time interval between two signal events, while the performance of synchronous systems is measured by the number of clock cycles and the clock period. D esign a u to m a tio n an d synth esis alg o rith m s: The goal of synthesis is to develop a systematic design methodology and to automate the design process. To automate this process, the synthesis problem needs to be formulated and al gorithms that solve this problem need to be provided. Since the complexity of the design problem is very high, the design process is partitioned into several 10 sub-problems. Partitioning of the synthesis problem is not unique, and the sub problems are interdependent. For asynchronous system synthesis, there are several main sub-problems. « D esign re p re se n ta tio n tran sfo rm a tio n s: Transformations convert the design representations so that the following synthesis algorithms can reach a better synthesis result. For example, the expression factorization is used for the reduction of the number of operations in the design description, the tree high reduction is used for the critical path delay reduction, and other transformations are used for the hardware specific reduction. • O p e ra tio n sequencing and allocation: This problem, one of the most important ones in data path synthesis, is analogous to the scheduling and allocation problem in synchronous system synthesis. Operation sequencing and allocation determines the execution order of operations in the design and the allocation of each operation to available operator(s) under a given constraint and goal, such as the system performance and the set of avail able operators. Unlike the synchronous system, which has clock steps, the operation sequencing and allocation cannot be solved separately. • C o n tro l synthesis: Control synthesis converts the design representation from the algorithmic description to the structure description. In the algo rithmic description of a design, the main components are operations, and the data dependency describes the operation order. In the structure descrip tion, the main components are operators, and a control sequence generator describes the operation order. The mapping of operations in the algorithmic description to operators in the structure description and the execution order of the operations through the operators have to satisfy the sequencing and allocation result obtained from data path synthesis. 11 • R e g iste r m inim ization: Since there is no clock signal, the sharing of a reg ister by multiple operations may be very expensive in terms of performance overhead as well as control overhead. Therefore, each operation has its own input register(s) in data path synthesis, even though, not all of them are necessary in the structure of the design. Register minimization eliminates unnecessary registers from the structural representation of the design. The removal of a register between two opera tions is equivalent to the operation chaining of these two operations. The elimination of registers should maintain the goals and constraints from the result of data path synthesis without reducing the expected performance. • M odule selection: Module selection determines which operators are used to construct the design. The selection of operators may affect the synthesis result dramatically. Most synthesis problems assume that human designers select the modules. • S y stem p a rtitio n : Due to the limitation of die size, a system may need to be partitioned into multiple chips. This problem involves a lot of physical constraints such as off-chip communication overhead, pin limits for each chip, and so on. 1.5 R esearch O verview The main goal of this research is to develop a methodology for the asynchronous system synthesis of a high-level behavior specification to an RTL structure descrip tion for a given design. 12 1.5.1 M icropipeline-based Asynchronous System s The target design style in this thesis is based on Sutherland’s micropipeline struc ture [44] in which every function is implemented by a component. The communica tion between two components relies on a handshaking protocol. The data-sending component sends a request signal to notify the data-receiving component when the data is ready, and the data-receiving component sends an acknowledge signal to notify the data-sending component when the data is received. Based on this design style, each component can be built independently. A complex function can be built by the composition of simple components following the same principle. A set of components developed in a standard cell library of the HP CMOS34 technology are used to compose some complex computational functions and Digital Signal Process (DSP) designs throughout the synthesis system presented in this thesis. 1.5.2 D ata Flow Specifications The data flow graph (DFG) is used as the input specification, and is based on the token model used in data flow computing [16]. The input tokens, which start the computation of a function, are analogous to the input-request signals that start a computation in the micropipeline-based component. The direct correspondence between micropipelines and the data flow computation model provides unique ad vantages for the high-level synthesis of the data paths and the control unit(s). At the beginning of a design where the concurrency in the system is fully explored, the DFG is used as an algorithmic-description of the design. After the data path synthesis, the sequencing and allocation information is added to the DFG description. Control synthesis converts the algorithmic-descriptive DFG to a structural descriptive DFG that satisfies the data path synthesis sequencing and allocation result. 13 At the structure level the extended data flow graph (EDFG) separates the reg isters from the computational part of the DFG description. The output of register minimization is the EDFG description of this design. This EDFG description represents the RTL structure of this design. 1.5.3 Tim ing M odel for the System Perform ance E stim ates For the design quality measures, this thesis focuses on the system performance estimates. The timing behavior of the basic component is modeled by the timed- Petri net [40]. The signal transition of the basic component is modeled by a token firing, and the delay between two signal transitions is modeled as the delay between two corresponding token firings. The timing behavior of a system that is composed of several basic components is described by the composition of the corresponding timed-Petri nets for these components. From the analysis of the timed-Petri net for asynchronous systems, an abstract timing model, in which the computational delay and the control delay for each operation are precisely modeled in the high-level representation, is derived for the data flow specifications. Therefore, the timing behavior of a design in the high-level design representation can be accurately analyzed. Furthermore, these timing models are used in the formulation of high-level synthesis algorithms. The accuracy of these models plays a significant role in the guidance of the synthesis algorithms in making a correct design decision during the abstract description of this design. 1.5.4 D ata Path Synthesis The key task for the data path synthesis is the sharing of resources by system operations while meeting certain design limits. This task involves the allocation of DFG operations to the hardware components chosen from the cell library, and 14 the sequencing of operations assigned to the same component. Resource allocation can either be static or dynamic. Sequencing can be either fixed or variable. Thus, there are four possible sharing schemes, and they will be presented in Chapter 5, Control Synthesis. This thesis presents an algorithm for finding an optimal static allocation of the operations in the DFG to the library components and an optimal fixed sequencing of the set of operations for each given library component. The resource allocation and sequencing are done simultaneously. The criterion for optimalization is the minimum completion time. This problem is formulated as a graph theoretic problem. The input is the DFG representation of a system, and the DFG is assumed to be acyclic. The result is another graph. In this new graph, a node represents an operation node in the DFG, and a direct edge from one node to another node represents either the data dependency relation in which the output of a node is the input of the other node or the resource dependency relation in which two operations share the same operator and one operation is directly sequenced before the other operation. Each node delay equals the computation delay for that operation. For a data dependency relation an edge delay equals 0, and for a resource dependency relation an edge delay equals the control delay. For the resulting graph, the longest path delay equals the completion time. The operation sequencing and allocation finds the set of edges, which represents the resource dependency relation that satisfies the resource constraint, for a minimal longest path delay. 1.5.5 Control Synthesis Control synthesis generates a structural description of a design according to the allocation of resources and the sequencing of operations from data path synthesis. In this thesis, DFG is not only used as the algorithmic description language, but also as the structural description language of a system. Therefore, control synthesis 15 transforms the algorithmic-descriptive DFG, in which a node corresponds to an operation, to the structural descriptive DFG, in which a node corresponds to an operator. There are four basic templates for the transformations corresponding to the four possible resource allocation and sequencing schemes. Each template includes shared units, the data path routing part, which provides the structure for the re source allocation, and the sequence generating part, which generates the control signal for the operation sequencing. Further, this thesis develops accurate estima tors of the area and performance overhead for each control template. As a result, the controller overhead can be incorporated into the data path synthesis step. 1.5.6 R egister M inim ization Register minimization eliminates unnecessary registers in the design structure de scription obtained from the control synthesis. Register minimization should keep the completion time under the performance constraint and should not cause the system to deadlock. By extending the graph model used in the data path synthesis, this problem is formulated as a graph theoretic problem. The elimination of registers from the design can be modeled as an extra control delay edge added to the end of each operation chain. An operation chain is formed by a chain of operations without a register between any two operations in the chain. For the extended graph, the longest path delay equals the completion time. Register minimization finds the set of registers, whose corresponding graph has the longest path delay under the performance constraint. After unnecessary registers are eliminated, an EDFG description is formed to represent the RTL structure of this design. 16 1.6 R ela ted W ork 1.6.1 Asynchronous Circuit Synthesis Asynchronous-system synthesis is highly dependent upon the selection of both the specification and the hardware model of systems. In this section, several synthesis methods are examined in terms of the behavior specification of the circuit. Flow T able The Huffman model, the most dominant model for digital system design in centrally-controlled architecture, is the earliest implementation model for the finite state machine (FSM) [18, 30, 47]. The most popular methods for synthesizing an asynchronous controller are specifying the behavior of FSM by the state table, formulating both the next state functions and the output functions in boolean equations, and implementing the FSM in logic gates [18, 25, 30, 35, 47]. These synthesis methods from the state table emphasize the design of hazard- free next/output functions with respect to unbounded gate delays and/or wire delays. These approaches are limited under the single input change operation. The drawbacks of this approach are the difficulty of describing concurrent behavior and the difficulty of implementing large systems due to the lack of decomposibility [12]. Recent research [36, 56] emphasizes the multiple input change for the state machine approach to improve the asynchronous circuit performance. T race T h e o ry Trace theory based design methods start from the trace structure [41, 50], which can be either a simple FSM or a composition of FSMs. A trace struc ture is composed of some basic operations. The circuit is built by translating each basic operation to a pre-designed hardware component [49]. (In terms of imple mentation, the trace theory based methods can also be classified as syntax-directed methods.) These synthesis methods stress the decomposition of the trace structure. 17 Some researchers [17] have added decomposition rules, substitution rules, and re ordering rules to insure speed-independent operation. Although little work uses this theory, trace theory significantly impacts the research of decomposition-based asynchronous circuit/system design. Signal T ran sitio n G rap h Signal transition graph (STG) based design methods start with the STG specification and build state flow tables using the total states [12] to obtain the next/output functions. These synthesis methods [6,13,27, 28, 34] focus on the construction of STGs that insure hazard-free resulting circuits. In the three methods mentioned previously, flow table, trace theory, and sig nal transition graph, the target is an asynchronous control circuit that does not consider functional operations or system performance. All emphasize low-level circuits, which are not sufficient for the design of asynchronous systems [13]. C S P -like Specification CSP (communicating sequential processes) [23], a pro gramming language, and CSP-like specifications have the capability of describing concurrent behavior and distributed controlled behavior. The synthesis methods based on CSP-like specifications usually are syntax-directed methods. In the syntax-directed method, there is a. set of basic circuits. A system is synthesized by mapping each construct in the behavior description to the corre sponding circuit. Therefore, a synthesized system operates concurrently with the distributed controller, if the description of the system does. The synthesis method developed by Martin and Burns [2, 8, 11] focuses on the signal transitions of systems. Though Martin and Burns’s specification has the ca pability to describe the functional operations by using decomposition rules, their method separates the functional operations from the control unit. Therefore, re sulting circuits in their method are centrally-controlled [14, 29]. Since their method considers controller design primarily, it does not consider the system performance. 18 Another syntax-directed method developed by Brunvand and Sproull [8] used Occam, a CSP-like language, for the system behavioral specification. They devel oped high-level modules [7] to simulate the behavior of the Occam constructs. A system is synthesized by translating a given Occam description to corresponding modules. They then use peephole optimization at the circuit level. This approach has the potential to describe the concurrent behavior of a system with the dis tributed controller. However, this specification is used primarily to synthesize control circuits rather than functional circuits. 1.6.2 High-Level Synthesis High-level synthesis translates a behavior specification to a structure description of the system [19, 32]. As discussed earlier, the high-level synthesis of asynchronous systems and the synthesis of synchronous systems are similar in many aspects. Due to the different design styles between these two kinds of systems, the formulation of problems differ. This thesis focuses on the design representation, the sequencing and allocation problem (data path synthesis), and the control structure generation (control synthesis). The sequencing and allocation problem in the asynchronous system is analogous to the scheduling and allocation problem in the synchronous system. The control structure generation in the asynchronous system is similar to the algorithmic transformation in the synchronous system. 1.6.2.1 Design Representation For synchronous systems, the design representations are clock-oriented or step- oriented, for example, ISPS [5] is the instruction-liked design specification. Re cently, the data flow based descriptions are used to represent the design behavior, for example, CDFG [19] and DDS [26]. The data flow based behavior description 19 places more emphasis on the concurrent behavior at the beginning of the descrip tion. As the design proceeds through synthesis, timing, scheduling, and structural information are attached to the behavior description. 1.6.2.2 Algorithm ic Transformation Some transformations can apply to the behavior description at the algorithmic level [43, 45, 51] before the behavior description is mapped to the structure description at the functional block level (or the register transfer level). The transformations at this level re-organize the operations in the algorithmic level description to reduce the number of operations and/or the number of steps in the entire computation. This kind of transformation is especially useful for the structure design that is directly mapped from the algorithmic description. 1.6.2.3 Scheduling and Allocation In synchronous system synthesis, operation scheduling and allocation assign each operation in the behavior specification a time step and a resource. There are three types of scheduling problems: time-constrained scheduling; resource-constrained scheduling; and a combination of both. In As soon as ■possible (ASAP) scheduling [46], the operations in the behavior description are scheduled from the input of the design to the output, and each operation is scheduled at the time step following the scheduling of all its predeces sors. In As late as possible (ALAP) scheduling, the operations are scheduled from the output of the design to the input, and each operation is scheduled at the time step preceding the scheduling of all its successors. These two scheduling types are often used when no resource constraints exist. The scheduling results are used as bounds of other algorithms, for example, bounds for integer linear programming (ILP) kind of scheduling problem [24]. 20 List scheduling [15, 22, 37] is often used in high-level synthesis. While similar to the ASAP, it uses the priority function to decide the scheduling orders. List scheduling is often used for the resource-constrained scheduling problem. Badia [4] used the list scheduling technique for asynchronous system synthesis, but did not formulate the control delay in the problem. In freedom-based scheduling [38], freedom is the cost that indicates how critical each operation is. The operations on the critical path(s) are the most critical, and are scheduled first. In forced-directed scheduling [39], the concept of force evenly distributes the operations to clock steps. Previously, freedom-based and forced-directed scheduling types were used for time-constrained scheduling. Now in cooperation with list scheduling, these two scheduling types can be used for resource-constrained scheduling. The formal approach for the scheduling and allocation problem is either the ILP [24] approach or the mixed integer linear programming (MILP) approach [20]. Hwang’s approach for synchronous system synthesis uses both ASAP and ALAP scheduling to find the scheduling bounds of each operation that reduce the solution searching space. Hafer’s approach for asynchronous system synthesis uses a standard MILP to solve the problem. The standard MILP limits the size of problem that can be solved. 1.7 T h esis O utline Chapter 2 overviews the synthesis system using a 4-bit add-and-shift multiplier example. This example proceeds from an input DFG specification through data path synthesis, control synthesis, register minimization for a given set of charac terized components, to an EDFG description, which is an abstract RTL netlist of the design. 21 Chapter 3 describes the hardware implementation and the data flow specifica tions. The micropipeline-styled asynchronous circuit design is described first and the behavior of asynchronous component is linked to the data flow model. Then the data flow specifications and their respective timing models used in the synthesis system of this thesis are defined. Chapter 4 describes the problem formulation and the algorithms of the op eration sequencing and allocation for the data path synthesis. A graph model for specifying the input DFG with the operation sequencing and allocation is de scribed. The sequencing and allocation algorithms for minimizing the completion time with the resource constraint are discussed. Chapter 5 addresses the synthesis of the control structure in accordance with the allocation of resources and the sequencing of operations. Four basic templates for control structures corresponding to the four possible resource allocation and sequencing schemes are presented. Area and performance overheads for each tem plate are accurately estimated. At the end of chapter 5, local transformations that reduce the structure generated by these templates are presented. Chapter 6 describes the problem formulation and the algorithms of the register minimization problem. A graph model extended from the model used in the oper ation sequencing and allocation problem is used in the formulation of this problem. The algorithms for the register minimization problem are discussed. Chapter 7 presents some design examples that illustrate the synthesis steps. Several designs for each input DFG specification are implemented through MOSIS Netlist-to-Parts Service. Finally, Chapter 8 summarizes the conclusions and future research. 22 C hapter 2 S y stem O verview This chapter describes the design approach used in this thesis. First, the design flow of the synthesis system is introduced. Then, a simple example, which uses this design flow, describes the problems and terms associated with each synthesis step. 2.1 D esig n Flow O verview Figure 2.1 shows the design flow proposed in this thesis for the synthesis of asyn chronous systems. The ovals in the diagram represent synthesis steps, and the rectangular boxes represent the input and output of the synthesis steps. The syn thesis begins with the behavior description of the design, which is specified by the data flow graph (DFG). With the resource/performance constraints, data path syn thesis sequences and allocates operations to available operators. Control synthesis restructures the input DFG into a scheduled DFG according to the sequencing and allocation result obtained from data path synthesis. Finally, register mini mization eliminates unnecessary registers from the scheduled DFG and produces an extended data flow graph (EDFG), which is the abstract of the RTL netlist of asynchronous systems. A set of RTL asynchronous blocks was developed using HP C34100 standard cell library [58]. Through MOSIS Netlist-to-Parts Service 23 ! Resource/Performance! ! Constraints , r Sequencing & Allocation Methods [igh-Level Synthesis’ Scq. & alloc, results Asynchronous building block library Control synthesis & Local transformations DFG | MOSIS Netlist-to-Parts Service Register Minimization Standard cell library EDFG DFG Realization (RTL Netlist) Physical Layout Input Specification (Data "raph) Figure 2.1: Design flow overview for the synthesis of asynchronous systems. [59], the final layout with the extracted capacitance can be obtained from a back- annotation tool. In the following sections, the design representations, synthesis problems, and terms will be defined using a simple example. 2.2 D a ta Flow G raph The data flow graph (DFG), which is based on the token model used in data flow computing [16], is used as the input specification. A DFG is a directed graph with typed nodes and port-specific arcs, where a port refers to a input/output terminal of a node. Each directed arc in a DFG connects a specific output port of a node to a specific input port of a node. 24 constl ash3 (AddShift) m jl ( Join <m,0,q> ash4 (AddShift) ash2 (AddShift) Output Figure 2.2: DFG description of the 4-bit add-and-shift multiplier. The semantics of a DFG are expressed by the movement of tokens. A token represents the presence of data on the corresponding input. A node is activated when all its necessary input arcs have tokens. An activated node computes or fires by absorbing all the tokens on its inputs and placing tokens on its outputs. Figure 2.2 is the DFG description of a 4-bit add-and-shift multipler, where m jl, a s h l,... are the node instance names and m, q, (m, 0, q ),... are the symbolic values for the token passing through these arcs. The phantom boxes M and Q are two inputs, the multiplicand and the multipler, and the phantom box Output is the output. Instance constl, a constant 0 generator, generates tokens with value 0. Instance m jl, a DFG Join construct, absorbs a token from each of its inputs and generates an output token containing the value of all its input. If the token from M has value m and the token from Q has value q, the output token of Join has value (m ,0 ,q ), where “( )” represents the join of several distinct data values. 25 AddShift is an atomic function which does addition and bit-right-shift opera tion. Let the input of AddShift have the value (M, A , Q) and the output have the value (M ',A ',Q '). For a 4-bit multiplier, M ,A ,Q ,M ',A ', and Q' are 4-bit. The function of AddShift can be described as follows. {ou,Atmp} < — A + M * Q lsb, (2.1) {A ',Q '} < — {o v,A tmp, Qmsb...lsb+> } > (2.2) M ' «- M , (2.3) where ov is an 1-bit overflow for addition, Atmp is a 4-bit temporary data, M S B means the most significant bit, L S B means the least significant bit, and L S B + 1 means the second least significant bit. {} smashes data into the bit level with the leftmost bit as MSB, for example, {ou, Atmp} is a 5-bit data. Equation (2.1) does the addition operation, if Qlsb — 1- Equation (2.2) does the right-shift operation to get rid of Q lsb and to move in ov from the left of Atmp. For 4-bit multiplication, four AddShift operations are needed. i?([2,3]) is a special atomic function, called the router function, and it extracts the second and third data items from the output of the last AddShift, for example, generating (a^4\ q ^ ) , if the input is (m,a (4W 4> ). T im in g m o d e for D FG For each node there are two timing parameters. • F orw ard p ro p ag atio n delay tim e Dpp: the delay from the time when the input tokens are absorbed to the time when the output tokens are produced. • B ackw ard p ro p ag atio n delay tim e Dpp'- the delay from the time when all output tokens are removed to the time when the node can absorb new input tokens. Figure 2.3 shows the timing of an operation executed by an operator. When all input data are available, the computation starts, or in other words, the tokens are 26 Input tokens are ready. I j f i r The operator is ready for another new computation. I 8 1 ^BP I i All output tokens are removed. All output tokens are ready. Figure 2.3: Timing model for the computation of an operation. mjl ashl ash2 ash3 ash4 10 12 14 6 2 4 8 Figure 2.4: Timing diagram for the computation of a multiplication operation. 27 D f p D b p Join U 1 AddShift 3 1 0 1 overhead 0 1 Table 2.1: Timing parameters for the operations in the multiplier. absorbed. Let to be the time in which all input tokens are absorbed. At time t0 + Dfpi output data are produced. The time delay 8 is the period that this operator waits for all output tokens to be absorbed by the following stages. If the input registers of the following stages are always available, 6 = 0. For this synthesis system, 8 = 0. After all output tokens are absorbed, the operator starts to reset itself for the next computation. At time t0 + D f p + 8 + D bp, the operator is ready for another computation. Assume that each node in Figure 2.2 has a distinct operator for the operation, and D f p and D b p of these operations are listed in Table 2.1. Assume that data M and Q arrive at time 0. Figure 2.4 shows the timing interval of each operation in the computation of a multiplication operation. 2.3 D a ta P a th Synthesis D ata path synthesis takes a specification of the system behavior and a set of constraints and goals that need to be satisfied, and finds an operation-operator mapping that implements the behavior while satisfying the goals and constraints. In the design system of this thesis, the behavior of a design is specified by the DFG, which describes the data dependency among operations in the system. The constraints and goals are the system performance and the area cost. The system performance can be either the system completion time or the pipelined period. The system completion time, L, is the time interval from the time when the input tokens are ready to the time when the output tokens are 28 ready; for example, L = 12 for the multiplication operation in Figure 2.4. The sys tem pipeline period, P, is the shortest input token initiation period for the system. Because ashl can start a new computation every Z?Fp(AddShift) + .DBp(AddShift) and it has maximum D f p + D b p in all nodes, R = 4 for the system in Figure 2.2. For a non-pipeline system, a new set of input tokens are taken after the previous output tokens are generated. For a pipeline system, a new set of input tokens are taken as long as the input stage of the system is ready. Therefore, data path synthesis minimizes the system completion time for the non-pipeline system and the system pipeline period for the pipeline system. The area cost is described by the number of operators. The operation-operator mapping is the sequencing and allocation of operations over operators and includes following information: • W hat operator is each operation allocated to? • W hat is the execution order of operations that are allocated to the same operator? This thesis only solves the data path synthesis minimizing the system comple tion time for a given set of operators, that is, the resource-constrained data path synthesis problem for the non-pipeline system. When an operator is shared by several operations, both area overhead and performance overhead exists. Because resource sharing always reduces the system performance, if the area overhead is greater than the area reduction due to the resource sharing, the resource does not really need to be shared. For example, an AND gate will not be shared by many AND operations, because the area cost of the control structure necessary to support sharing is much greater than the area cost of the AND gates to be reduced. The area overhead ratio, A oveT h.ead.ratioi is the ratio of the area overhead of the sharing structure to the area of a shared unit 29 mjl Asher 1 Asher2 rl ashl ash3 ash2 ash4 □ 8 10 (a) 12 14 Dp? C ] DBPn D, B R ,1 mjl ashl ash2 Asher 1 1 1 1 r m : ash3 ash4 Asher2 1 • I B | H i rl b 8 10 12 14 16 (b) 18 Figure 2.5: (a)The synthesis result for the non-pipeline system. (b)Tlie synthesis result for the pipeline system. for each operation allocated to the shared unit. Therefore, A overhead.ratio needs to be less than 1 before resource sharing is applied to this operator. Resource sharing also contributes performance overhead. Let Dppo v and D bpo v be the extra forward propagation delay time and the extra backward propagation delay time for each operation sharing an operator with other operations. In other words, the D p p and D b p of an operation inflate to D p p + D ppov and D b p + Dbp„„ when executed by a shared operator. 30 Assume that there are two AddShift units available, for example, Asherl and Asher2 for the system previously described in Figure 2.2. Assume that A overhead-ratio < 1, D ppov = 0, and D bpov — 1- Figure 2.5(a) is a synthesis result with L — 12 and P = 11. Figure 2.5(b) is another synthesis result with L = 16 and P = 10. Al though this thesis only discusses the synthesis algorithm for non-pipeline systems design, this example shows the difference between the synthesis result for non pipeline system design and the synthesis result for pipeline system design. The synthesis result of Figure 2.5(a) has minimum system completion time, and the synthesis result of Figure 2.5(b) has minimum system pipeline period. 2.4 C ontrol S ynthesis Control synthesis is the process of re-shaping the original (input) DFG with respect to the sequencing and allocation result obtained during data path synthesis. Two tasks are included in the control synthesis presented in this thesis: 1. Sharing structure generation 2. Local transformation Sharing structure generation produces a re-shaped (output) DFG by applying some generic control structures, which are called sharing schemes, to the input DFG with respect to the data path synthesis result. Since sharing schemes are general tem plates for any sequencing and allocation result, the size of the sharing constructs and the number of data flow paths in each sharing structure may be further re duced. Local transformation uses the peephole optimization technique to reduce the size of the sharing structure generated. Each sharing scheme needs to handle the allocation of operations to operators and the execution order of operations sharing the same operators. In Chapter 5, four kinds of sharing schemes will be presented: 31 1. Static-allocation with fixed sequencing (SAFS) 2. Static-allocation with variable sequencing (SAVS) 3. Dynamic-allocation with fixed sequencing (DAFS) 4. Dynamic-allocation with variable sequencing (DAVS) In sharing schemes with static allocation, each operation is allocated to a spe cific operator whenever the system is activated. On the other hand, in sharing schemes with dynamic allocation, each operation is dynamically allocated to one of the available operators. In sharing schemes with fixed sequencing, the operation execution order is fixed whenever the system is activated. However, in sharing schemes with variable sequencing, the operation order is not fixed and depends on the execution speed of each operation in the system. For the same sequencing scheme, sharing schemes with dynamic allocation, which have complicated allocation structure, have larger area overhead than shar ing schemes with static allocation. For the same allocation scheme, sharing schemes with variable sequencing, which have complicated sequencing structure, have larger area overhead than sharing schemes with fixed sequencing. In Chapter 5, the general structures for these four sharing schemes are pre sented, and the analytical formulations of D f p ov, D b p ov, and A overiiead.ratio are derived. Because the SAFS sharing scheme has smaller area overhead, the SAFS sharing scheme is used in this synthesis system. The algorithms for data path synthesis are also based on the overhead functions of the SAFS sharing scheme. For this example, the data path synthesis result has only the case of 2 opera tions to 1 operator allocation, so only the sharing scheme for this specific case is presented here. Assume the two operations sharing the same operator OP are opl and op2, shown in Figure 2.6(a), where /i and I2 can be connected to either the input port(s) of the input DFG or the output(s) of some nodes, and 0 \ and 0 2 32 opl ( OP OP 1 op2 (a) r c>*-i n Selector ✓ msO CFfs_2 OP( OP Distributor mdO O , (b) Figure 2.6: (a) 2 operations in the input DFG. (b) The sharing structure for 2 operation SAFS. can be connected to either the output port(s) of the input DFG or the input(s) of some nodes. For example, if opl corresponds to ashl in Figure 2.2, /i corresponds to the output of m jl and 0 \ corresponds to the input of asli2. Figure 2.6(b) is the sharing structure for the 2 operations to 1 operator SAFS sharing scheme. Con trol function CFfs_2 generates repeatedly a sequence of a token with data value 0 followed by a token with value 1. Corresponding to the token with data value 0 from CFfs_2, OP obtains input data from input 0 of Selector and generates output data to output 0 of Distributor. Corresponding to the token with data value 1 from CFfs_2, OP obtains input data from input 1 of Selector and generates output data to output 1 of Distributor. Therefore, this sharing structure realizes the opl and op2 to op allocation with the fixed execution sequence of opl followed by op2. By applying the SAFS sharing scheme to the synthesis result in Figure 2.5(a), the output DFG in Figure 2.7 is obtained, where m sl, m dl, Asher 1, and ctll form a sharing structure, and ms2, md2, Asher2, and ctl2 form the other sharing structure. Therefore, Asher 1 in Figure 2.7 executes both ashl and ash3 in Figure 33 const 1 m ct!2 ms2 m ji ( Join Selector CFfs_2 <m,0,q> Asher2 [AddShift m sl Selector Distributor m d2 Asherl (AddShift Distributor .([2,3]). mdl <m,a’,q*> i i i i Output Figure 2.7: The output DFG for the non-pipeline system synthesis result. 34 const 1 m jl ( Join Asher2 (AddShift) <m,0,q> m d 2 ( ^ ° istribulor ctll m sl Selector CFfs_2 <m,a",q"> Asher 1 (AddShift) ([2,3]) <m,a’,q’> < m ,d \q "> | Output I I I Figure 2.8: The output DFG for the non-pipeline system synthesis result after local transformation. 2.2 with the execution order of ashl followed by ashS. Asher2 executes both ash2 and ash4 with the execution order of ash2 followed by ash4■ Hence, the DFG in Figure 2.7 satisfies the synthesis in Figure 2.5(a). In Figure 2.7, the two outputs of Asher 1, which are distributed by m dl, are connected to the two inputs of Asher2, which are selected by ms2. Whenever there is data at output i of m dl, input i of ms2 absorbs the data for i . = 0 and i = 1. Therefore, one path can be used between these two operators. After local transformation the output DFG is reduced to the DFG in Figure 2.8. The output DFG not only has a smaller control structure but also preserves the sequencing and allocation result. 35 mjl ashl ash2 ash3 ash4 2 4 6 8 10 12 14 Figure 2.9: Timing diagram for the computation of a multiplication after the removal of the ash2’s input register. 2.5 R egister M inim ization Register minimization eliminates unnecessary registers in the scheduled DFG ob tained from the control synthesis. Each node input in DFG implicitly has a register that may be removed to save area cost without sacrificing the system performance. For example, if the input register of ash2 in Figure 2.2 is removed, ash2 can start computing right after receiving data from ashl. Since the input register of ashl needs to hold the data for the computation of ash2, ashl cannot be reset until ash2 finishes computing. Figure 2.9 shows the timing diagram of a multiplication after the removal of the input register of ash2. Assume there is a register at each node input at the output DFG of control syn thesis. Assume register minimization does not remove the system input registers. This thesis formulates the register minimization problem as a graph optimization problem under the performance constraint. By applying the register minimization on the DFG in Figure 2.8 with the timing constraint L < 12, five registers remain; three registers at the inputs of m jl, the 36 mjl Asherl Asher2 rl ashl ash3 ash l ash4 P 10 12 14 (a) D pp □ d bp CH D, BP, mjl Asherl Asher2 rl ash 1 ash3 ash2 4- ash4 P 10 12 14 16 18 (b) Figure 2.10: (a)Tlie timing diagram with 6 registers remained. (b)The timing diagram after the removal of the Asher2’s input register. register at input 1 of m sl, and the register at the input of Asher2. Figure 2.10(a) shows the timing diagram of the system with five registers remaining. Comparing the timing diagram in Figure 2.10(a) with the timing diagram from data path synthesis in Figure 2.5(a), shows that the completion time of the system after register minimization remains 12. Now examine the effect of removing the register at either the input of Asher2 or input 1 of m sl. If the register at the input of Asher2 is removed, Asherl for operation ashl cannot be reset until Asher2 for operation ash2 completes the com putation. Therefore, the Asherl for operation ash3 is postponed until after Asherl 37 A & •A ddShifo {AddShifu y y (a) (b) Figure 2.11: (a) Phantom Addshift: the AddShift with no input register, (b) Equivalent structure to the AddShift with the input register. for ashl is reset. Figure 2.10(b) shows the timing diagram for the completion time of a multiplication after the removal of the input register of Asher2. Comparing the timing diagram in Figure 2.10(b) with the timing diagram in Figure 2.5(a), shows that the system completion time has deteriorated from 12 to 14. Therefore, the input register of Asher2 cannot be removed, unless the timing constraint is changed to L < 14. If the register at the input 1 of m sl is removed, Asher2 for operation ash2 cannot be reset until Asherl for operation ash3 completes the com putation. To complete the computation of ashS, the input register for operation ash4 needs to be ready. However, the input of ash4 shares the same register as the input of ash2, which is occupied until the computation of ashS is completed. Therefore, to avoid deadlock, the register at the input 1 of m sl cannot be removed. E x te n d e d d a ta flow grap h (E D F G ) is a extension of DFG that distinguishes the input-non-registered nodes from the input-registered nodes. A phantom node represents the node with no input register, and a non-phantom node represents the node with input register(s). For example, Figure 2.11(a) represent an AddShift with no input register AddShift, and Figure 2.11(b) is an equivalent structure for a non-phantom AddShift. Despite the fact that the syntax of EDFG is the same as the syntax of DFG, the token model and the timing model of EDFG are 38 constl si Asher2 [AddShift] m j l : Join } Distributor md2 <m,a ,q > ctll m sl Selector C F f s _ 2 Asherl |AddShifti <m ,a’,q’> | <m,ri",tj''> <a"',c(‘'> Output Figure 2.12: The EDFG description for the non-pipeline system synthesis result after register minimization. quite different and are discussed later in Chapter 3. By removing the registers, the register minimization result for the DFG in Figure 2.8 becomes the EDFG in Figure 2.12. 2.6 H ardw are R ealization This thesis adopts the syntax-directed method [9, 11] to realize the physical de sign from the extended data flow graph (EDFG) specification. In this method, each basic construct in the high-level specification is translated directly to a correspond ing hardware component. Therefore, the data flow graph not only describes the behavior of a system but also represents the structure of the system. 39 EDFG Basic Blocks nb nh D / » nb Data Flow ----- - " “i...... o r d Ack Phantom Atomic Function Phantom Distributor ~ 'Jn a m c)2 & jn b Distributor*' .0_ 1.. nbj ** {nb Distributor **. ik ‘ • . . 0 I ..* * * mb fiuime PD istributor n b # #n l • • • i « • Figure 2.13: EDFG to basic block mapping. A path in an EDFG is mapped to a two-phase handshaking data transfer bus, which includes a data bus (wires) with the same data type, a request line, and an acknowledge line. A token on a data flow path corresponds to the state that data is being transferred between two ends of the data transfer bus. In terms of the handshaking protocol, this state corresponds to the interval from the request signal transition to the acknowledge signal transition of the data transfer bus. Each component is designed to satisfy this correspondence. Thus, there is a one-to-one correspondence between the elements of an EDFG and the hardware components (see Figure 2.13). A set of asynchronous components corresponding to EDFG elements are developed [52]. Therefore, a RTL netlist based on these cells can be obtained for the final layout implementation. 2.7 Sum m ary In this chapter, the synthesis procedure was described briefly through a simple example to provide readers with an overview of this design system. Each syn thesis step will be addressed in depth in the rest of this thesis. At the end of 40 the thesis, several experimental results for two real designs with performance and area measurement will be used to demonstrate the effectiveness of this synthesis procedure. 41 C hapter 3 M icropipelines and D a ta Flow Specifications In this chapter, the target design style, the design representations, and the timing models for the synthesis system of this thesis are described. In the first section, the structure of micropipelines is described as the target design style for the synthesis system, and the behavior of each component in micropipelines is linked to the data flow model. In the second section, the data flow graph (DFG) and its extension (EDFG) are described as the design representations in this synthesis system. In the last section, the timed-Petri net is used to define the timing model for the components in micropipelines, and the timing models for DFG and EDFG are derived further. 3.1 M icropipelines and D ata Flow M od el The hardware model employed in this thesis is based on Sutherland’s Micropipelines [44]. This model assumes that request signals are bundled with the data signals to ensure proper operation, namely, the bundled data convention [42, 44]. Unlike speed-independent and delay-insensitive designs, the micropipeline model requires the determination of the delays in the computational blocks. As this can be done in a manner similar to the conventional design of synchronous systems, this does not pose any serious problem. 42 Sender Receiver (a ) i ® © i ® © i ® © * 1 i M Z l M Z © i © © R equest j ( r © i ©i © 1 A cknow ledge j j 1 ! / i : Orv. C yrlft - : O n r C y rlf r : n Dnft.ftyrl/* (b ) 1® © ® © ® © ° iu M 1 m © © © © © © Request j \ / \ r~ \ ® © © © © © A cknow ledge j ) _ _ J ~ I n (c ) Figure 3.1: Data transfer and handshaking protocol, (a) Data transfer between two blocks, (b) Two-phase handshaking, (c) Four-phase handshaking. A system consists of a collection of functional blocks with data transfers taking place between either two or more functional blocks, or a functional block and the surrounding environment. Data transfers between any two blocks rely on a handshaking protocol. Each block is activated whenever its input data is available. The operations of functional blocks in micropipelines are asynchronous, concurrent, and data-driven. 3.1.1 D ata Transfers and Handshaking P rotocols The handshaking protocol used in the design method of this thesis can be a two- phase and/or a four-phase handshaking protocol. These are shown in Figure 3.1. Two-phase handshaking protocol with the bundled data convention is shown in Figure 3.1(b). Referring to Figure 3.1(b), it can be seen that there are three 43 events in each cycle of data transfer. First, the sender puts valid data on the data bus. Second, the sender activates a signal transition on the request line to notify the receiver that data is available. Third, the receiver activates a signal transition on the acknowledge line to notify the sender that the data has been received so that another cycle of the data transfer can begin. For the four-phase handshaking protocol shown in Figure 3.1(c) the request line and the acknowledge line are initialized to 0 at the start of each data transfer cycle. The first three events are the same as those in the two-phase handshaking protocol. In the next two events, the sender resets the request to 0 and the receiver resets the acknowledge to 0 . In terms of data validation period, there are two kinds of conventions for the four-phase handshaking protocol [8]. In the narrow convention the sender holds the data valid from the rising request signal to the rising acknowledge signal. In the broad convention the sender holds the data valid from the rising request signal to the falling acknowledge signal. Although the current implementation follows the two-phase handshaking pro tocol with the bundled data convention, a system may contain different protocols for different data transfers in its implementation as long as the sender and the receiver of each data transfer follow the same protocol. 3.1.2 R ealization of a Basic Block A functional block as proposed by Sutherland [44] has the structure shown in Figure 3.4(a). There are three basic elements in this structure. The Muller C- element, represented by a C gate, is used to control the handshaking protocol. The asynchronous register, represented by a reg block, is used to capture and pass input data. The computational part, represented by a Logic block, is used to perform the functional computation for the structure, for example, addition. The 44 Figure 3.2: The behavior of Muller C-element. Di1’ DI3’ J ___________ \ Di DI1 | DI2 DI3 | Di4 DI5 Di 1 reg 1 p - r Do Do D ll D ll' D13 Di3’ Di5 Figure 3.3: The behavior of asynchronous register. oval node in this structure represents an added delay. This added delay is used to ensure that the output data transfer satisfies the bundled data convention. M u lle r C -elem ent A Muller C-element has two inputs and one output. The output of a C-element is 1 if all the inputs are 1 and 0 if all the inputs are 0; otherwise, its value remains unchanged [42]. A two input C-element can be viewed as a logical and of two events in which an event can be a 0 -1 or a 1 -0 transition [44]. The behavior of the Muller C-element without considering the gate delay is shown in Figure 3.2. A synchronous re g iste r The asynchronous register proposed by Sutherland has four terminals for the register; the data input terminal, Di\ the data output ter minal, Do; and two one-bit control signals, c and p, which are called capture and pass respectively. If the value of c equals the value of p, then the value of Di passes to Do; otherwise, the value of Do remains unchanged. Operationally, c and p are initialized to 0 ; signal transitions (events) then occur at c and p in the sequence of cpcp... The behavior of the asynchronous register without considering the gate delay is shown in Figure 3.3. In the operation of the asynchronous register, event c always captures input data to let the output hold the last input value before event c; for example, Dil’ and Di3’ in Figure 3.3 are captured by c events. Event p starts the passing mode of the register; for example, Dil, Di3, and Di5 in Figure 3.3 are passed to “Do”. C o m p u ta tio n a l p a rt The computational part can be implemented by com binational logic. To ensure the handshaking protocol, added delay is required. The computational part can also be implemented by differential cascode voltage switch logic (DCVSL) without added delay [34]. DCVSL is suitable for four-phase handshaking operation. To make this kind of circuit useful in two-phase design two-to-four and four-to-two phase change circuits are needed. The Combination of Muller C-elements and asynchronous registers forms the pipeline structure of asynchronous systems, or, micropipelines [44]. A single stage from Sutherland’s micropipelines is a basic functional block in this system and is shown in Figure 3.4(a). Ri, Ai, and Di correspond to the input request signal, the input acknowledge signal, and the input data signals. Ro, Ao, and Do correspond to the output request signal, the output acknowledge signal, and the output data signals. The input/output behavior of the basic block with considering the gate delay is shown in Figure 3.4(b). Ri, Ai, Ro and Ao are initialized to 0. Notice that there is an inversion at one of the inputs at the Muller C-element; this means that there is an initial event at that input. Therefore, event Ri generates an event at the output of the Muller C-element, which in turn captures the input data Di. After passing 46 R i d Di nb Ao •mb Do (a ) A i •nb Logic (> Ro O n e c y c le O n e cycle o f in p u t d a ta tra n s fe r O n cy cle Do • ) Ro Ao 3 ----- 1 o n e c y c le o f o u tp u t d a ta tra n sfe r ’ O n e cy c le (b) Figure 3.4: (a) The block diagram of the basic block in micropipelines, (b) The timing diagram of the basic block. through the register, this event passes to Ai and notifies the input data transfer that Di is stored and its value can be changed. In other words, one cycle of input data transfer is complete. The computational part then executes the captured input at the output of the register. The added delay parallel in the computational part is to ensure that the valid data is produced before the arrival of event Ro. Therefore, the added delay equals the critical path delay of the computational part in addition to some safe margin delay. Event Ao completes one cycle of output data transfer, and it allows the register to pass new input data to the functional block. After p of the register receives an event from Ao, the Muller C-element receives an (initial) event to allow another cycle of input data capture. In case the new Di and Ri have arrived before the transfer of output data is completed, the Muller C-element waits until another input of the C-element receives an (initial) event from Ao through p of the register. 47 / © H i t © D i Ri A i i / D i R i A i basic block \ S t a t e 2 basic block D o R o A o D o R o A o S t a ill D i R i Ai basic block D o R o A o m .© • • • S t a t e 6 D i Ri Ai basic block © e 1 S t a t e 3 /TIT'; D i R i Ai basic block D o R o A o D i Ri A i basic block D o R o A o n ..... S t a t e 5 (a ) T M © S t a t e 4 Calculation mode (Transient states) ame (b ) name \ame) Figure 3.5: (a) The behavior of a basic block, (b) Data flow model for the basic block. 3.1.3 D ata Flow M odel for Basic Blocks The main reason this system uses data flow specification is that the behavior of basic blocks of asynchronous systems can be described by a data flow model. Each basic block is a functional unit. The block captures available data at the input and produces data at the output, which becomes the input data of another functional block. The behavior of the basic block is analogous to the behavior of a functional node in a data flow graph, which absorbs input tokens and generates output tokens. This analogy is shown in Figure 3.5. Figure 3.5(a) shows a sequence of states that describes the same input/output behavior of the basic asynchronous block in Figure 3.4(b). Each state in Figure 3.5(a) represents that the block has just received or produced an event and is 4S waiting for the next event. Events in Figure 3.5(a) are labeled by a numbered circle: 1 represents data available to the two-phase handshaking protocol; 2 represents the request signal transition; and 3 represents the acknowledge signal transition. State 1 and State 6 in Figure 3.5(a) correspond to an idle functional node in the data flow model. Although there is input data available at State 1 , the block has not been notified of the availability of input data by the external world. (The external world means the surrounding environment of the basic block.) Only when event Ri is activated by the external world does the basic block know that there is data available at Di. Therefore, State 2 in Figure 3.5(a) corresponds to a data flow function node with an input token. After the block captures the input data and completes its calculation, the block produces an output data, and the external world is notified by event Ro. Therefore, State 5 in Figure 3.5(a) corresponds to the output token generated in the data flow model. After the external world releases the output data by activating event Ao, the block becomes idle again. Two transient states, State 3 and State 4, in Figure 3.5(a) are not mapped to the data flow model. These states represent the functional operation in the real circuit which takes time. However, they can be ignored in the high-level data flow model and are represented by a proper timing model for system analysis. H an d sh ak in g p ro to co l an d tok en m odel The key to this analogy is modeling the data transfer, which is based on two-phase handshaking protocol, by the token movement in the data flow model. The data transfer between event number 2 (Request) and event number 3 (Acknowledge) in Figures 3.1(a) and 3.1(b) is the state when the data is available on the data bus and waiting to be captured by the receiver. This state is mapped to an appearance of a token between two function nodes in the corresponding data flow graph. The function node corresponding to the sender block generates the token that will be absorbed by the function node 49 corresponding to the receiver block. Later in this thesis the extended data flow graph will be derived based on the same token model. 3.2 D a ta Flow Specifications The data flow graph (DFG) is used as the input specification. It is based on the token model used in data flow computing [16]. Unlike synchronous systems in which operations are scheduled to time steps, the behavior of a DFG is described by the activation of operations in DFG, for example, the token movement. A DFG is a directed graph with specific types of nodes and port-specific arcs, where a port refers to a input/output terminal of a node. Each node in a DFG belongs to a finite set of node types that represent the basic constructs of the DFG specification. Each directed arc in a DFG connects a specific output port of a node to a specific input port of a node. The semantics of a DFG are expressed by the movement of tokens. A token represents the presence of data on the corresponding input. A node is activated when all its necessary input arcs have tokens. An activated node computes or fires by absorbing all the tokens on its inputs and placing tokens on its outputs. There is no notion of synchronization among activated nodes, as these nodes operate asynchronously and concurrently [16]. 3.2.1 Basic Constructs By considering the area/performance efficiency of asynchronous block implemen tation, the basic DFG constructs from the conventional data flow specification are generalized and enriched. Figure 3.6 shows the set of DFG constructs used in this thesis. For example, the conventional data flow specification often uses binary in put (or output) control constructs such as the Join, the Fork, the Distributor, and the Selector. To join eight data inputs to one destination, three levels of two-input 50 \pruim e • r • ~ ■ ~i Input Port T Output Port Fork Join Selector Selector Distributor ^0 n-~ 7TTJ- — » Data Flow Path Arbiter P V — Pass(<cond>) (name] Atomic Function Macro Function Figure 3.6: Basic constructs of the DFG. Join are needed. In terms of delay and area consumption, the implementation of a in Figure 3.6 is given below. • Jo in : If there is a token at each input, Join absorbs them and generates an output token that represents all the input token data values. • Fork: If there is an input token, Fork absorbs the input token and generates a token on each output with the same data value as the input. • D istrib u to r: If there is a token at the data input port and a token at the condition input port carrying the value m, Distributor absorbs both input tokens and generates a token at output port m with the same value as the data input port. single block that can handle eight-to-one join is more efficient than three levels of the two-input Join [53]. This enhancement implies that if new constructs satisfy the data flow model and are needed in the description of asynchronous systems, the set of basic constructs may grow. The behavior of the basic constructs shown 51 • S elector: If there is a token at input port m and a token at the condition input port carrying the value m, Selector absorbs both input tokens and generates an output token with the same data value as input port m. • P ass((co n d )): If there is a token at the data input port and a token at the condition input port, Pass((cond)) absorbs both input tokens. If the condition data equals (cond), Pass({cond)) generates an output token with the same data value as the data input. - • A rb ite r: If token(s) exist at the input port(s) and there is no token at the output port, one and only one input token is absorbed and passed to the output port. • A tom ic functions: These represent computational nodes, for example, adders, multipliers, and so on. • M acro function: A macro function represents a function defined by an other data flow graph. It supports hierarchical description. Certain assumptions in the hardware implementation are imposed on the behavior model of the DFG. • S to rag e elem en t A ssum ption: There are a limited number of storage elements in hardware. Besides the assumption that each input of any node has a storage element, extra storage elements need to be specifically expressed in the DFG description. • O p e ra tio n -o p e ra to r m ap p in g assu m p tio n : There are a limited num ber of functional units (operators). Each node in a DFG corresponds to a operator in the hardware implementation. These assumptions lead to the following rules for the data flow specification and its behavior model. Due to the storage element assumption, only one token is ever 52 allowed on an arc at any time. In addition, every basic construct can absorb tokens from its input port(s) only when no tokens are present at any of its output arcs. In other words, no tokens are allowed to accumulate in any of the basic constructs. Due to the operation-operator mapping assumption, no recursive (macro) function is allowed in this system. 3.2.2 D ata 3.2.2.1 D ata Types Since the goal is to transform DFG descriptions into hardware realizations, each data item has a fixed format, or data type. There are three basic data types: 1. A null data type is denoted by null. 2. A set of n-bit wire data types is denoted by nb, where n is a positive integer. 3. Group data types are denoted by {g\,g 2 , ••• where each < / , - is a null, wire, or another group data type. The following items in a DFG have data types: input/output ports, directed arcs, and tokens. If an output port is connected to an input port through a directed arc, then the output port, the input port, the directed arc, and the tokens that flow through the arc should have the same data type. The data type of an input/output port depends upon the function of the node. For example, the input of a 16-bit adder contains two 16-bit data and the output of this adder contains a 1-bit carry out and a 16-bit sum. The data types of the input and output are specified by (16b,16b) and (lb,16b), respectively. Therefore, the data type of the arc and the output of any node connected to the input of this adder should be (16b,16b). Each node in a DFG is a data manipulator that receives input data and gen erates output data. There are certain relations among the types of inputs and the types of outputs for each node. Except for atomic functions, macro functions, 53 Construct name Data types Input (s) Output(s) Condition Join to, t l , • • • , t,i— i tout — (to,tl, . . . , t n— l) Fork tin to • “ tl — • • . tn—l — t{n Arbiter t0 = t ! = . . . = t n-1 tout — to Distributor tin to — tl — . . . — tn — l t{n log2n b Selector to -= - . * * — tn— l tout to log2n b Pass((cond)) tin tout = tin tc ~ *<cond) Table 3.1: The input-output data type relation of basic constructs. and Join, all input ports, excluding the conditional input, and output ports of any basic construct have the same data type. Join absorbs tokens from all inputs and generates a token that represents all input tokens. If n inputs of Join from left to right have data types to,ti,..., tn- 1 , then the data type of the output is (to, t i ,..., t„_i). Let the data type of the conditional input be ic, the data type of the single non-conditional input be t,„, the data type of the single output be tout, the data type of (cond) for Pass be ^cond)’ anc^ the data types of n inputs or n outputs be to,tj,..., tn-i from left to right of the node. The input-output data type relation of basic constructs is summarized in Table 3.1. During the manipulation of a DFG, a data type may be reduced to an equivalent data type. The data type reduction rules are listed as follows. 1. ( < 71) = g\. Since < /i has a simpler form than ( < 71), ( < 71) is reduced to g\. (di, 9 2, • • • , 9j— 1,9j, 9j+\, • • • >9m) ts reduced to (9 1, 9 2, ••• ,9j—i,9j+i, ••• ,9m), if gj is a null data type. A data type is primitive if it cannot be reduced. Every data type can be reduced to a unique primitive data type. Hence, only primitive data types are considered during the manipulation of DFG constructs, for example, the input/output port data types of Join shown in Figure 3.7. 54 Join of null and 8b is <null,8b> <null,8b> is equivalent to <8b>. <8b> is equivalent to 8b. Figure 3.7: Example: manipulation of data types. 3.2.2.2 D ata Representations A DFG is not only a specification language but also a simulation language. Let B{ € {0 , 1 , X} be a binary value, O, € {0 , 1 , 2 ,3 ,4 ,5 ,6 ,7, A'} be an octal value, and Hi € {0 ,1 ,2 ,3 ,4 ,5 ,6 ,7 , 8 ,9, a , 6,c, d, e, / , X ) be a hexadecimal value, where X is the unknown value. The representations of data values are defined as follows. 1. The data value of null is (). 2. The data value of nb is represented by any of the following forms: • n 'h H \Ii2 .. .H j, where j = \ log.in] and H\ is the most significant half byte. • n'oO \0 2 ... Ok, where k = \l0 g3 n] and 0 \ is the most significant quarter byte. • n'bB \B 2 ... Bn, where B\ is the most significant bit. • {Bi ,B 2, . . . , bn), where B\ is the most significant bit. The first three formats are adopted from the Verilog description language, and are useful for the data representation. Because n' represents the number of bits, the leading zeros of H iH 2 ... H j, 0 \ 0 2 .. .Ok, and B \B 2 ... B n can be omitted. The last format is adopted from John Backus’ FP [3], and is useful for the definition of functions in the FP format. The leading zeros of the FP format can not be omitted. .MJoin 55 3. The data value of (gi, < 72, • • •, 9m) is represented by (di, d2i • • • , dm), where is a data representation of gi. For example, 4 and 10, which are two data inputs for a 16-bit adder, can be expressed by any of following formats: • (16'/i4, l&ha), • (16'o4,16'ol2), • (16'felOO, 16'felOlO), or • ((0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0), (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0)). In this example, two data inputs of the adder are merged together to form a single input representation. 3.2.2.3 D ata Manipulation Each node in a DFG is a data manipulator, so an adder is a data manipulator. In this thesis general functional nodes are not discussed, but special functions, which compose new data by extraction, repetition, and shuffling the input data, are presented. In addition, some special functions that produce tokens without input or absorb tokens without output are defined. D ata extraction, repetition, and shuffling Functions, which rearrange data by extraction, repetition and shuffling, are called routers in the DFG, because they are implemented by wire routing. The notation of routers is based on Backus’ FP [3], where i is the FP selector function, square bracket pair, [...], is the functional form of construction, and circle, o, is the functional form of composition. These are defined as follows: i : ( ii, x2, .. •, xn) — X{, for any positive i < n, 56 Token Generator Constant Generator (jft) Input Float (tt!bs) Token Absorber Constant Function ( i ) Output Float Figure 3.8: Special token functions. L / 1 , / 2 , . ••,/«] : x = (fl :x,f 2 :x,...,fn :x), fog:x = f:(g:x), where x and (x i,x 2, ■ • ■ ,x n) are the input data for the functions, and / 1, f 2, ..., /„ , / , and g are the functions, for example, FP selector functions. A router is denoted by R {F P . f unction), where F P .fu n ctio n is an FP function formed by an FP selector function, an FP functional form of construction, and FP functional form of composition, for example, i?([l 0 1 ,1 o 2,2 o 1 , 2 o 2]). If the input of the above function is a (2b, 2b) data, this router produces a 4b data, for example, [ l o l , 2 o l , l o 2 , 2 o 2]: ((bn , b l2), (b 21, b 22)) = (1 o 1 : ( ( 6 n , b12), (b 2i, b22)), 2 o 1 : ( ( 6 n , 612) , (621, b22)), 1 o 2 : ((611, b\2), (b 2i , b 22)), 2 o 2 : ((6 11,612), (b 2i , b 22))) = (1 : ( b n , b i 2), 2 : (611,612), 1 : (621,622), 2 : (621,622)) = (611, 612, 621, 622) Special token functions In addition to the tokens obtained from input ports or removed by the output ports, some special functions in the DFG description generate or absorb tokens. These functions and some related functions are shown in Figure 3.8. • Token Generator Tgen: If there is no token at the output, Tgen produces a token at the output. The output data type of Tgen is null. 57 • Token A b so rb e r Tabs: If there is an input token, Tabs absorbs the input token. The input data type of Tabs is null. Notice with Tabs there is no token accumulation problem. Tabs eliminates any input token. • C o n sta n t F unction C ((num )): This is an atomic function that absorbs an input token and generates an output token with value (num), for example, C(16'/i4). The input data type of C((num)) is null. The output data type is the data type of (num). • C o n sta n t G e n e ra to r (num ): This is a macro function composed of Tgen and C((num)). • In p u t F lo at Ift: Ift never generates any token at the output. (In terms of token generation, this function is opposite to Tgen.) • O u tp u t F lo at O ft: Oft never absorbs any token from its input. (In terms of token absorption, this function is opposite to Tabs.) The constant generator is very useful for the DFG with constant factors. For example, a DFG that implements function f{x,y) = 2 *x + x*y needs to generate a token with the value 2 for every pair of inputs, (x,y). Other functions, like Tabs, Ift, and Oft, may be used in the transformation of one DFG to another. For example, if the conditional input of a Distributor always receives non-zero tokens, an arc connecting to output 0 of this Distributor is equivalent to an arc connecting to an Ift. Designers may want to check if this float is caused by the design error. 3.2.3 Sim ulation Similar to other hardware description languages, DFG can be simulated for any given test input. This thesis focuses on the symbolic simulation of a DFG and the 58 Figure 3.9: Fork-Join Factoring, (a) Before transformation, (b) After transforma tion. verification of the equivalency of two DFGs by symbolic simulation. This technique is used to verify the correctness of any DFG transformation. S ym bolic to k en sim ulation Symbolic token simulation simulates a DFG by assigning a token of a given symbolic value to each input of a DFG. For example, if there is a token with data value d ,- and data type f,- at each input /,■ of the DFG in Figure 3.9(a) for i = 1,2, then a token is produced with data value (dl,d2) and data type (ti,t2) at each output of 0\ and 0 2. In this example, di, d2, and (di, d2) are all symbolic values. P ro o f of th e equivalency of tw o D FG s by sym bolic sim u latio n Since a symbolic value represents a general form of data value, it can be replaced by any specific value. If the two outputs of two DFGs are the same with respect to the same set of input symbolic values, the two DFGs are functionally and behaviorally equivalent. For example, applying the same symbolic values d\ and d2 to the DFG in Figure 3.9(b) produces two output tokens with a data value (di,d2) at outputs 0\ and 0 2. Therefore, the two DFGs in Figure 3.9 are functionally and behaviorally equivalent. R il D il A il Ri2 D i2 A i2 I6b dcliy' • 16b • • 16b <lb,l6b> ADD Join ADD A o D o Ro Figure 3.10: Two-input addition. This technique can be used to prove the validity of a DFG transformation. A DFG can be transformed to another DFG only if the two DFGs are equivalent. Therefore, the transformation from the DFG in Figure 3.9(a) to the DFG with fewer nodes in Figure 3.9(b) is valid. 3.3 T he N eed o f an E xten d ed D a ta Flow G raph In Section 3.1.3 a data flow graph models the basic blocks in micropipelines. Con versely, a DFG description can be translated into an asynchronous system com posed of basic blocks. However, each input port of every block needs a register to latch data. This could possibly result in many registers. For example, the two-input addition DFG description in Figure 3.10 can be translated directly into 60 the implementation by mapping nodes Join and ADD to blocks. In this example, there are two levels of registers. In terms of area and performance efficiency, not both levels of registers are needed. Therefore, removing the input register of the ADD block yields an implementation with better performance and smaller area. 3.3.1 R egister Blocks and C om putational Blocks In order to reduce the cost of registers, the registers are separated from the basic blocks of micropipelines. Two basic blocks, the register block and the computa tional block shown in Figure 3.11(a), are defined. Their input/output behaviors are shown in Figure 3.12. The delays between events, Dsfi, D st,i, Dsp, Djp, and Dbp, are defined later in this thesis. In terms of input/output events, the register block behaves exactly the same as the basic block of micropipelines. On the other hand, the input events and the output events of a computational block are closely related. Because there is no storage in the computational block, input data cannot be released until the output data is released. Therefore, a complete output event cycle of the two-phase handshaking protocol occurs between event Ri and event Ai of the input event cycle. In other words, the progression of events is as follows: Di, Ri, Do, Ro, Ao, and finally Ai. Figure 3.13(a) shows the sequence of input/output events. 3.3.2 Extended D ata Flow M odel for C om putational Blocks Since the behavior of the functional block without storage differs from the original functional block of micropipelines, a phantom node represents the corresponding data flow construct for the computational block in Figure 3.11(b). Using the same handshaking protocol-token analogy from Section 3.1.3, the extended data flow model for computational blocks is defined as shown in Figure 3.13. Each state in 61 Ri Di Ai ------- £ > •nb 3 - i •nb Ao Do Ro Register block Storage Ai Di rnb Logic Ri 0- mb Ao Do Ro Computational block (a ) x ifname) Y Phantom Atomic function (b) Figure 3.11: (a) Decomposed basic blocks, (b) Corresponding EDFG constructs. ®© ® © ® © Di Ri f Ai \ •Dsn U-H @ © b > P j Do iD » w i- i) © D , p Ro Ao f — H L _ (a ) ®© ®© Di Ai © ) Dbp • IDbp 1 do H D I X t 1 1 1 R -2--------- 0 ) ! ... Ao f ' - t . (b) Figure 3.12: (a) Timing diagram of register block, (b) Timing diagram of compu tational block. 62 input token S t a t £ 2 t a t ati? S t a t e 5 © Reset mode Logic Do Ro Ao Logic Do Ro Ao Logic Do Ro Ao Logic Do Ro Ao Logic Do Ro Ao (Transient state) (a) \fname\ \fnamei \fname> (b) Figure 3.13: (a) The behavior of a computational block, (b) Extended model. data flow Figure 3.13(a) represents a block waiting for the next event after either receiving or producing an event. State 6 and State 1 in Figure 3.13(a) correspond to an idle phantom functional node in the extended data flow model. When event Ri occurs, the external world notifies the computational block of the availability of an input data at Di. Because there is no memory in the computational block, the external world keeps the value of Di until the release of the corresponding output data, in other words, when event Ao occurs. States 2, 3, and 4 in Figure 3.13(a) correspond to a phantom functional node with an input token. When Do produces valid data , Ro notifies the external world. State 4 in Figure 3.13(a) corresponds 63 to a phantom functional node with an output token. After the external world activates Ao and releases the output data, the computation block activates event Ai and is idle again. ; Figure 3.13(b) is the extended data flow model for the computational block. Unlike the conventional data flow model, the phantom node does not absorb an input token when the node produces the corresponding output token; rather, the phantom node absorbs the input token when the external world absorbs the corre sponding output token of this node. The output token resembles an extension of the input token through the phantom node. The output token is therefore called an extended token (of the input token). One transient state in Figure 3.13(a), the state in which the computational block is reset, is not mapped to the extended data flow model. In the imple mentation shown in Figure 3.11(a), Ai is directly connected to Ao\ therefore, this transient state has zero delay. (Assume that wiring delay can be ignored.) How ever, if the computational block is implemented by DCVSL, this transient state may take some time. As previously stated, the transient state in the high-level data flow model can be ignored and taken care of with a proper timing model for system analysis. 3.4 E xten d ed D a ta Flow G raph After separating the register from the micropipeline structure, a simple but very useful extension to the DFG for describing these new basic blocks is presented. The result is called an extended data flow graph (EDFG), which provides a bridge between the abstract DFG specification and the circuit implementation. The set of basic constructs in EDFG is shown in Figure 3.14. 64 \l,nam e\ Input Port ‘T i OutPut Port Data Flow Path X Phantom vj”/ Atomic Function Figure 3.14: Basic constructs of EDFG. Syntactically, an EDFG is the same as a DFG. However, the rules of the token movement for the EDFG differ slightly from the rules for the DFG. There are two kinds of tokens: • R eg u lar token: A regular token, which is denoted by a dark circle, repre sents the direct output token of a register. • E x te n d e d token: An extended token, which is denoted by a circle, repre sents the output token of a non-register node. Both kinds of tokens represent the availability of data. The behavior of Storage, the only non-phantom construct in the EDFG, is the same as the behavior model described in the DFG. A Storage absorbs input tokens, either regular or extended, and generates regular output tokens with the same values as the input tokens. The behavior of a phantom construct in the EDFG is similar to the behavior of the corresponding non-phantom construct in the DFG, except for the following differences: • Input tokens for a phantom node can be regular or extended. 65 V - / ! Join I ' ' V ' ' ! Fork I Phantom Join Phantom Fork I . . . . Distributor > 0 n-r'c». « ... Selector Phantom Distributor Phantom Selector Phantom Arbiter Macro Function Phantom Pass(<cond>) Storage Left output tokens is absorbed. A JFork'l ■w , Right output token is ^Sfcsorbed by other node. A A A •Fork'l — ► ! Fork 1 — ► Fork) / " \ / " \ X . Right output token is absorbed! A Left output token is ^ absorbed by other node. JFork'l r \ (a ) A Fork) A new'input token ^ J can arrive'ip these states. Fork Fork Fork (b) Figure 3.15: The behavioral differences between the phantom Fork and the Fork, (a) State diagram of phantom Fork, (b) Partial state diagram of Fork. • A phantom node generates only extended tokens. • A phantom node does not absorb its input tokens when it generates output tokens. • A token on the input of a phantom node can be absorbed only if all its extended tokens are absorbed. Figure 3.15 shows the behavioral differences between the phantom Fork and the Fork. After a Fork or a phantom Fork generates the output tokens, the input arc of the Fork in the DFG can receive a new data token. However, the input arc of the phantom Fork in the EDFG cannot receive a new data token until the external world absorbs all its output tokens. E x te n d e d to k en s and reg u la r tokens There is no difference between the extended token and the regular token in terms of the availability of data and the analogous meaning to the hardware implementation. Defining extended tokens em phasizes the semantic difference between phantom nodes and non-phantom nodes, 66 \ (a ) (b ) Figure 3.16: (a) A Fork, (b) A phantom Fork with an input Storage. which are used in the conventional data flow graph [16], and the relationship be tween phantom node input and output data. D F G vs. E D F G A DFG node is equivalent to its phantom counterpart with an input storage at each input port, for example, the Fork/phantom Fork in Figure 3.16. If a DFG description can be replaced by an EDFG description, what is the need for the DFG? The first reason is that the DFG allows designers to focus on the functional description without worrying about the hardware implementation, for example, how many adders, where to assign registers, and so on. The DFG is also a well-known language/concept for data flow computing, so designers can easily adopt it. Another reason is the convenience of system synthesis. The system in this the sis uses DFG and EDFG in the different stages of asynchronous system. The DFG is used primarily in the early steps of system synthesis such as sequencing and allocation, mapping of sharing schemes, and local transformation. The EDFG is used primarily in the synthesis steps such as register minimization, deadlock pre vention, and local transformation before the specifications are mapped to hardware modules. 67 EDFG Basic Blocks D ata Flow D — 5— -^ * • “ 1..........* nr D Ack ........... Phantom Atomic Function £ijham e)3b Jhamt Phantom D istributor jn b •'''D istrib u to r* * * ^ p , ‘...0 .......1.-^ '* * "" n b] [nb #n P D istributor • > # f* Figure 3.17: Generic basic blocks. 3.4.1 T he Hardware C om ponents for EDFG This thesis adopts the syntax-directed method [9, 11] to realize the physical design from the extended data flow graph (EDFG) specification. In this method, each basic construct in the high-level specification is directly translated into a corre sponding hardware module. Therefore, the data flow graph not only describes the behavior of a system, but also represents the structure of the system. By using this method, the correctness of a hardware implementation is proven by construction. Therefore, the design method focuses primarily on mapping the constructs and the behavior models of the EDFG description to the functional/control blocks of the micropipeline structure, and on reflecting the hardware characteristics of the functional/control blocks to the parameters of the DFG constructs. T ran sla tin g E D F G co n stru c ts into asynchronous co m p o n en ts An EDFG path is mapped to a two-phase handshaking data transfer bus, which includes a data bus (wires) with the same data type, a request line, and an acknowledge line. A token, either regular or extended, on a data flow path corresponds to the state in 68 D i R i A i nb DO RO A O DI RI Ai D2 R2 A2 D3 R3 A3 Figure 3.18: Block design for 4-output Fork. which the sender waits for the acknowledge after sending the data and the request. This correspondence allows a component to be designed for each construct in an EDFG ensuring that the interface requirement is satisfied. Thus there is a one-to- one correspondence between the elements of an EDGF and the hardware modules (see Figure 3.17). All the mappings in the cell library of this thesis have been developed [52], for example, Figure 3.18 shows a design for the 4-output phantom Fork. Since each basic construct of EDFG is directly mapped to an asynchronous module, the characteristics of each hardware component such as the delay param eters, the area cost, the power consumption, and so on, can be directly attached to the basic construct. Furthermore, designers and synthesis algorithms can make design decisions in a high-level data flow description based on these attached hard ware parameters. 3.5 T im ing M odel for D a ta Flow Specification A high-level specification is useful for analyzing/predicting the resulting imple mentation in addition to describing the functional behavior of systems. In this section timed-Petri nets [40] will be used to model the timing behavior of basic 69 blocks. The composition of these timed-Petri net models can be used to express the timing behavior of asynchronous systems composed of basic blocks. Based on the timing models derived from timed-Petri nets, the timing parameters and the timing behaviors of both DFG and EDFG will be defined for the high-level system analysis and the high-level synthesis. 3.5.1 Tim ing Behavior M odel for Basic Blocks As described in Section 3.4, ” Extended Data Flow Graph,” the storage elements and functional units are considered separately. Two basic blocks in the hardware implementation are the register block and the computational block, shown in Fig ure 3.11(a). The corresponding EDFG constructs for these two basic blocks are the Storage and the phantom atomic function, shown in Figure 3.11(b). In addition to the register block and the computational block, another ba sic block is the control block. Control blocks are basic blocks corresponding to phantom constructs in EDFG that have more than one input and/or more than one output, for example, asynchronous blocks of Selector. If the joining of multi ple events in control blocks is considered as “an event”, a control block behaves the same as a computational block. For example, a two-input Join can be acti vated only if both input request events arrive. The event of “both input request events occurring” is equivalent to the input request event of a computational block. Therefore, only register blocks and computational (non-register) blocks need to be modeled. Since the data transfer between blocks follows the two-phase handshaking pro tocol, data values are always valid from event request to event acknowledge. There fore, only the event of control signals needs to be modeled. There are four events associated with the register and computational block: 70 • Input data ready, Ri: - This event corresponds to the input “request” signal transition. • Input data done, Ai: - This event corresponds to the input “acknowledge” signal transition. • Output data ready, Ro: - This event corresponds to the output “request” signal transition. • Output data done, Ao: - This event corresponds to the output “acknowledge” signal transition. The timing behavior of basic blocks is best described using timed-Petri nets [40] with transitions representing input/output (control) events of the block, and places representing the conditions of events in the block. The delay from a place (state) to a transition (event), which is labeled on the arc between the place and transition, represents the minimum time interval from the time when the condition is satisfied to the time when the transition is activated. The state of the block is represented by the distribution of tokens in the timed-Petri net. Figure 3.19 shows the timed-Petri net models for the register block and the computational block. For each model the tokens represent the initial state of the corresponding block. By simulating the token movement in each Petri net, the relationship between Figures 3.19(a) and 3.12(a) for the register block and the relationship between Figures 3.19(b) and 3.12(b) for the computational block are found easily. T im in g p a ra m e te rs Each pair of sequential events is associated with a delay. These delays are shown in Figures 3.12 and 3.19. The timing parameters for a register block are defined as follows. • Forward latch time, Ds;i: is the time for the register to latch the input data when the register is ready and the new data just arrives. This corresponds 71 Ri Ai Ri Ai Register Ro Ao Comp./ Control Ro Ao Ri H - ( s n ? -Jn ( g j D ib l L _ jJ M q . A o Ro (a) Figure 3.19: (a) Timing model for register block, (b) Timing model for non-register block. to the delay from Ri to Ai in the Petri net, where the post-Ri condition represents that input data is ready. • Backward latch time, Dm : is the time for the register to latch the input data when the input data is ready and the register just becomes available. This corresponds to the delay from Ao to Ai in the Petri net, where the post-Ao condition represents that the register is ready. • Propagation delay time, Dsp: is the time from when the input data is latched to when the output data becomes valid. This corresponds to the delay from Ai to Ro in the Petri net. The timing parameters for a non-register block are defined as follows. 72 • Forward propagation delay, Djp: is the time from when all required input data are valid to when all corresponding output data are valid. This corre sponds to the delay from Ri to Ro in the Petri net. • Backward propagation delay, Dbp: is the time from when the output data is being acknowledged to when the input data is being acknowledged. This corresponds to the delay from Ao to Ai in the Petri net. In addition to the basic delay parameters associated with each block, there are two delays associated with the environment, (5 1 and < 5 2 . The delay < 5 1 is the delay from Ai to Ri; in other words, the time from the input acknowledge event (completion) to the next input request event (starting). Similarly, 52 is the de lay from Ro to ,4o; in other words, the time from the output request (starting) to the next output acknowledge (completion). Both < 5 1 and 62 depend on the re sponse from the environment of the block, and are not constrained by the hardware implementation. Therefore, < 5 1 and < 5 2 are bracketed, for example, [51] and [52]. 3.5.2 T im ing Behavior o f Com posed Blocks Two kinds of behavior models, the register model and the non-register model, have been defined to describe the behavior of basic asynchronous blocks. If the composed behavior of these two models in timed-Petri nets can be derived, the behavior for any given system can be derived. There are four possible combinations of these two models: 1. The connection of the output of a register to the input of a non-register. 2. The connection of the output of a non-register to the input of a register. 3. The connection of the output of a non-register to the input of a non-register. 4. The connection of the output of a register to the input of a register. 73 Ri(Bl) [51 (B 1)]| ^n^^A K B lW ® rDsb< B l) t_JJ52£B U 0^ jjD s p (Bl) A o(B l) R o(Bl) Ro(B2) DfP (B2) [62(B2)] - r Ao(B2) [ol(B2)] Ai(B2) Ri(Bl) \ 1 [6I(B 1)] kDjn(Bl)Ai(B1 (•JTD sbK Bl) “‘O t j j S 2 ( B n p ^ j D s p (Bl) A o(B l) =A i(B 2 Ro(B = Ri(B { Ro(B2) O c ^ l — O [52(B2)] A o(B 2)-t - DbP (B2) tyi(B2)r O (b) Figure 3.20: The process of merging a register block to a non-register block, (a) Before merging, (b) After merginig. Let B1 and B2 be two basic blocks. Let event E of block B be denoted by E(B), for example, Ri(Bl). Let timing parameter Dxx of block B denoted by DXX(B), where xx is any of the following, sfl, sbl, sp, f p , or bp, for example, Dsfi(Bl). When the outputs of block B 1 feed the inputs of block B2, Ro(Bl) connects directly to Ri(B2) and Ao(Bl) connects directly to Ai(B2). The behavior of a composed block can be generated by merging the timed-Petri nets of B1 and B2 with Ro(Bl) = Ri(B2) and Ao(Bl) = Ai(B2). Figure 3.20 shows the process of connecting a register to a non-register. 74 Ri(Bl) 3 j r I [ 5 i ( b i) ] ® ^ B i T O Ao(Bl) _ _ =Ai(B2) D S P (BD 4 _ Ro(Bl) = Ri(B2) Dbp(B2) 6 O | | ^[52(112)] q ^ | J f r „ ( B 2 ) Ao(B2) Ro(B2) (a) O Ro(Bl) __ = Ri(B2) Ri(Bl) I Dfp(Bl) _ A n fB ll -A i(B l) Db [(B l) O I | . r S 2 ( B 2 1 lQ . [ .T O B 2 ) Ao(B2) Ro(B2) (b) Ro(Bl) = Ri(B2) a Ri(B 1) ■ Drp(Bl) [S1 (B1 )] Ai(B 1) — — Db|(B l) o D f p ( B 2 T -j- Ro(B2) Q [52(B2)j - - Ao(B2) Dh/B2)Q Ao(Bl) =Ai(B2) (C) Ri(Bl) (§ > ^ r 9 D^Bl) n [81(81)1 _ _ Ro(Bl) = Ri(B2) Ao(Bl) <9) V B 21* h O i e w M Ro(B2) 1 1 ;jAo(B2) (d) Figure 3.21: The behavior of composed blocks, (a) Register block to non-register block, (b) Non-register block to register block, (a) Non-register block to non register block, (a) Register block to register block. 75 After two blocks are merged with Ro(Bl) = Ri(B2) and Ao(Bl) = Ai(B2), there are two redundant paths, which are indicated by shaded paths in Figure 3.20(b). 1. The path from Ro(Bl) to Ao(Bl) with environment delay [£2(B1)] represents the output environment of the register block. The path from Ri(B2), Ro(B2), Ao(B2), to Ai(B2), which is parallel to the path from Ro(Bl) to Ao(Bl), is the output environment of the register block. Therefore, the path from Ro(Bl) to Ao(Bl) is redundant after two blocks are merged. 2. Similarly, the path from Ao(Bl), Ai(Bl), to Ro(Bl) is the input environment of the non-register block, so the path from Ai(B2) to Ri(B2) is redundant. Figure 3.21(a) is the final block composition of the connection from the output of a register block to the input of a non-register block. The timed-Petri nets for the four possible combinations are summarized in Figure 3.21. These compositions will be used to model the delay parameters and the performance analysis in the data flow specification of the following discussion. 3.5.3 Perform ance A nalysis of Linear P ipelines Two measures evaluate the performance of a system, the completion time and the throughput rate. The completion time measures the length of time the execution of a set of data from inputs to outputs of the system takes. The throughput rate measures the number of data sets that can be processed by the system per time unit in steady state. The inverse of the throughput rate is called the pipeline period. Let the completion time, pipeline period, and throughput rate be denoted by L, P, and R, respectively. By formulating the performance measures for linear pipelines at the block level, a proper timing model for the high-level data flow specification can be found. 76 A linear pipeline is a series of computations divided by registers. Although many systems are not linear pipelines, each input-output computation path in a system can be viewed as a linear pipeline. Unlike synchronous systems, the com putation time between two consecutive registers is not fixed. The performance for a stage will be analyzed in the next section, and then expanded to the performance of a pipeline. 3.5.3.1 Performance Analysis for a Stage By observing the timed-Petri net descriptions in Figure 3.19 and the descriptions for the composed block in Figure 3.21, it is found that the event Ai of a register breaks the input and the output of the register into two loops, with Ai being the join event of the two loops. The parts of hardware corresponding to a loop of events in a Petri net define a stage. Two timing parameters, the forward propagation delay time, FP{, and the backward propagation delay time, BP{, are defined for stage i. The timing delay from Ai of the input register to Ai of the output register determines the forward propagation delay time for a stage. The timing delay from Ai of the output register to Ai of the input register determines the backward propagation delay for a stage. Figure 3.22 (a) is a simple asynchronous system with computational blocks Compl and CompS between registers Regl and Reg2. Figure 3.22 (b) is the com posed behavior of this system. In this system, the output of Regl, Compl, Comp2, and the input of Reg2 form a stage. The input of Regl and the output of Reg2 also form two separate stages. The three stages from input to output are labeled stages 0, 1, 2. The forward and backward propagation delay times for stages 0, 1, 2 are formulated as follows. (Note that the input 61 (of Regl) and the output 62 (of Reg2) are always assumed to be zero when the performance of a system is measured. In other words, new data are fed into the system as soon as the input register is free and output data are removed as soon as available.) 77 Ai Ri Di R egl Reg2 C om pl Comp2 Stage 0 Stage 1 (a) Stage 2 [81 (Regl)] R o(R egl) = R o(C om pl) Ro(Comp2) R i(C om pl) _ = Ri(Comp2) _ = Ri(Reg2) R i(R egl)- » O ncReg-,)......“ O ' " " I - o — l-------- - Q ® — 1 H p(K egl) ^ p(CompI) Dfp(Comp2) ^ Y Y - U o ( R e g 2 ) \ \ / Jf82(Reg2)] — i - -W- A ifR epli Ao(Com p2) - j-r ( j sA i(R e g 2 ) / \ Y / \ -L* . A i(R egl) V b i( R e g l) z ^ 2 J 5 “ D h n (C o m p 2 ) A o(R egl) = A o(C om pl) A i(C om pl) = A i(C om p2) (b) ■O o — B * ■Ro(Reg2) :^R eg2) Figure 3.22: (a) Block diagram of a single stage between two registers, (b) Timed- petri net description of the block diagram. i F R i BP, FP, iF P o i jBfij | FPj |B P ; I C om pletion time BP„ FP, iBPii FP, 1 FP21B R1 BR, FP, FP„ 1 P ipeline P eriod j B P j 1 FP? iB f i 1 0 0 0 1 FPn 1 P ip elin e P eriod Figure 3.23: Completion time and throughput analysis of the system in Figure 3.22. 78 FP 0 = Da ji(Regl) BP 0 = 0 FP\ = Dap(Regl) + Dfp(Compl) + D fp(Comp2 ) + D 3ji(Reg 2 ) BP\ = Dbp(Comp2) + Dbp(Compl) + Dahl(Regl) FP 2 = Dsp{Reg2 ) BP 2 = Dsbt(Reg2 ) By using these parameters, the timing diagram of this system is obtained by sim ulation shown in Figure 3.23. The measures of L, P, and R can be formulated as follows. L — FPo + FP\ -)- xna,x{(FP2 -f- BP2), B P \\ (3.1) P = FPi + BPi (3.2) R = 1 /P (3.3) Based on the Petri net description of Figure 3.22, for each loop of the Petri net there is always only one token. The minimum time for the token to move around the loop in any stage i is (FPi + BPi). Therefore, the lower bound of the pipeline period is (FP; + BP,). In other words, the throughput rate of stage i is less than or equal to (FP+BPi)- Since two consecutive stages i and (i + 1) have a joint event Ai(Regi), the throughput rate of these two stages, which equals the firing rate of event Ai(Regi), is less than or equal to and (FP( . Therefore, the pipeline period of these two stages i and (i + 1) is greater than or equal to (FP{ + BPi) and (FP(,+i) -f PP(,+1)). The lower bound of the system pipeline period equals the maximum of (FP, + BPi) of all stages. A final observation is from the timing diagram in Figure 3.23. The data for ward execution on a stage is concurrent with the data backward execution on its previous stage. For example, F P i overlaps BP 0 in the timing diagram, and F P2 79 Ai Ri Di Regl RegO Comp2 Compl o o — -----Ao Comp n Reg n ----- -- Ro ' ■ I -----/ — S D q Stage 0 Stage 1 o o Stage n Stage (n+1) Figure 3.24: A linear pipelined system. overlaps BP\. Therefore, the forward propagation delay time mainly determines the completion time. 3.5.3.2 Performance Analysis of a Linear Pipeline From previous analysis, two registers without any register blocks between them form a stage. W ithout loss of generality, a series of computation blocks with a register between any two consecutive computation blocks defines a linear pipeline, as shown in Figure 3.24. In this figure, registers are labeled RegO, Regl, ..., Regn and the computation block between Reg(i — 1) and Regi is labeled Compi for 1 < i < n. There are (n -f 2) stages in the system for which the forward and backward propagation delay times are defined below. FP0 = DSfi(RegO) BP0 = 0 FPi — Dap(Reg(i — 1)) + D jp(Compi) + Da fi(Regi), for 1 < i < n BP{ = Dbp(Compi) + Dsbi[Reg(i — 1)), for 1 < i < n PP(n+1) = DS p(Regn) BP(n+i) = Dsbi(Regn) 80 By generalizing the result of the previous example, the performance measures of this system can be formulated as follows. n + l L = £ FPi + BPin+1) + A, i-.O where A = max{0, maxJ=n{5P,- - £"=/+! FPj - BP(n+1)}} (3.4) P = mVx{(FPi + BPi)} (3.5) R = 1 / P (3.6) In equation (3.4), A is most likely zero because usually £ ”= /+1 FPj BP,. There fore, it is convenient to assume A is zero. 3.5.4 Tim ing M odel for EDFG and DFG After understanding the timing behavior of asynchronous systems at the circuit level, a timing model for the high-level data flow specification that reflects the behavior of the low-level implementation is defined. The key to modeling timed- behavior for data flow specification is the interpretation of tokens in the data flow specification with respect to the events in the block model, for example, the token- handshaking protocol relation described in Section 3.1.3. Since an EDFG is an abstract representation of an asynchronous system and a DFG description can be replaced by an EDFG description, the timing model for EDFG is described first. 3.5.4.1 Timing Model for EDFG An EDFG is an abstract representation for an asynchronous system. Therefore, the timing parameters, Dsji, Dm, Dsv,D jv, and Dbp, are directly attached to corre sponding nodes in the EDFG description. Ds/;/sm and (D 3fi/M + Dsp) correspond to the input token absorption time and the time a token takes to move from input to output, respectively. The timed-state diagram for a storage node is shown in Figure 3.25, where each state corresponds to a possible token distribution in Figure 81 [61] ,[53] (new input. X arrived) (idle) ■ (input arrived) (data latched) Ro [62] Ao Ao (output produced) (register resetting) Figure 3.25: Timing model for storage nodes. 3.19(a). Two labels are attached to each directed arc between two states in Figure 3.25, the delay between the two states and the event which occurs between the two states. For example, SI in Figure 3.25 corresponds to the initial token distri bution shown in Figure 3.19(a). After event Ri occurs, S i transits to S2 and the time interval between SI and S2 is 51. S2 in Figure 3.25 corresponds to the token distribution of the post-Ri condition and the post-Ao condition in Figure 3.19(a). The unshaded part of Figure 3.25 represents the arrival of the input token at the state in which the register does not hold other data. D 3 ji is the latch time for the token absorption. The shaded part represents the arrival of the input token at the state that the register is holding another data, and Dm is the latch time for the token absorption. The delays associated with the environment are 51, 52, 53, 54, 55, and 56, where 51, 53, 54, and 55 represent the input token arrival time, and 52 and 56 represent the output token removal time. 82 : Join i [811 § 12] y / ! J o in ) : Join i SI (idle) / , • * ’ • f62l : Join i t — A o R i_ ° S 2 - Join i Ro [813] (input arrived) (output produced) (resetting) Join " i S2” A i_0 & A i_ l Figure 3.26: Timing model for non-storage nodes. For phantom nodes, D jp represents the time from the generation of an input token to the generation of the corresponding output token, and Dt,p represents the time from the removal of an output token to the removal of the corresponding input token. In order to show that multi-input/m ulti-output phantom nodes behave the same as single-input single-output phantom nodes, a timed-state diagram for a two-input Join is presented in Figure 3.26. In this figure, each state corresponds to a possible token distribution in Figure 3.19(b). For example, S i in Figure 3.26 corresponds to the initial token distribution shown in Figure 3.19(b). After both Ri-0 and R i.l occur, Si transits to S2. S2 in Figure 3.26 corresponds to the token distribution of the post-Ri condition in Figure 3.19(b). The timing model described in this section is an enhancement for the (extended) data flow model from Section 3.1.3 and Section 3.3.2. Comparing the unshaded part of Figure 3.25 to Figure 3.5(b) for the storage node, S3 is an extra state in the timed extended data flow model. S3, which is the state between S 2 and S4, describes the transient states in Figure 3.5(a). S5, also an extra state in the timed 83 [81] /T \ D sfi > [[name) ----------- » t^ b p S5 (resetting) Ri Ai [82] Ao S2 S3 (input arrived) (data computing) Ro 1 - ^ P + D fp name S4 (output produced) Figure 3.27: Timing model for DFG nodes. model, represents the register resetting state after the removal of the output token, and is merged in the idle state in Figure 3.5(a). Comparing Figure 3.26 with Figure 3.13(b) for the phantom node, S4 is an extra state in the timed-extended data flow model. Again, S4 describes the transient state in Figure 3.13(a). At this point, the timed extended data flow model has fully reflected the low-level behavior in the high-level description. 3.5.4.2 Timing M odel for DFG Each node in a DFG corresponds to an EDFG phantom node plus a storage at each input of this EDFG node (see Figure 3.16). Therefore, the timing model of the EDFG can be used to simulate a DFG. On the other hand, a timing model for the DFG can be developed by using the timed-Petri net model of the composition of the register block and the non-register block in Figure 3.21. The timed-state diagram for a DFG node is shown in Figure 3.27. The part of the diagram corresponding to the shaded part in Figure 3.25 is not shown in this 84 [SI] R i name Ai S 3 (input arrived) (data computing) P b p = ° b p + D sbl [82] Ao S 5 (resetting) Ro ° F P = D ; ’ p + D fp+ Djfi name S 4 (output produced) Figure 3.28: Simplified timing model for DFG nodes. model. Comparing Figure 3.27 with Figure 3.5(b), there are two extra states in Figure 3.27, S3 and S5. S3 corresponds to the transient states for the data latch and the data computation. S5 corresponds to the transient state for functional unit resetting. Based on the analysis in Section 3.5.3, the timing model shown in Figure 3.28 is further simplified and only two timing parameters, Dpp and Dpp, are needed. Referring to Section 3.5.3, D"jl in Dpp is the forward latch time of the output register of the function unit, and D'sbl in Dpp and D'g p in Dpp are the backward latch time and the propagation delay time of the input register of the function unit. In other words, the stage delay parameters are adopted for the simplified model. This simplified DFG timing model complies with the data flow model in Section 3.1.3 and reduces the simulation complexity, in addition to providing a simpler model for high-level synthesis problems. According to the timed behavior of a DFG/EDFG, a system can be simulated and analyzed from the behavior description. Furthermore, the attached parameters 85 and analysis formulations can be used as measures in a high-level synthesis. These will be discussed in the next section. 86 C hapter 4 D a ta P a th S ynthesis 4.1 In trodu ction D ata path synthesis takes a system behavioral specification and maps operations to operators, where an operation represents an functional operation in the behav ioral specification and an operator represents an hardware component executing operations in the implementation. In the design system of this thesis, the behavior of a design is specified by the DFG, which describes the data dependency between operations in the system. The operation-operator mapping is the sequencing and allocation of operations over operators, including the following information: • The allocation of an operation to an operator. • The execution order of the operations which are allocated to the same oper ator. Allocation can be either static or dynamic. In a static allocation, each operation is allocated to and executed by a specific operator. In a dynamic allocation, each operation is allocated to several operators and can be executed by one of these operators. The operator that executes an operation can be different from time to 87 Resource constraints \ Original DFG / Sequencing & Allocation A fixed sequencing & static allocation result Figure 4.1: The input-output relation of the data path synthesis. time. In general, an idle operator is assigned to execute an operation that is ready to be executed. Sequencing can be either fixed or variable. In fixed sequencing, the execution order of a set of operations sharing the same operator(s) are fixed. On the other hand, in variable sequencing, the execution order of a set of operations sharing the same operator(s) is not pre-determined. Thus, there are four possible schemes in data path synthesis. The constraints and goals satisfied in data path synthesis are the area cost and system performance. A set of operators, which are used for the system im plementation, describe the area cost. The system performance can be either the completion time or the initiation rate. The completion time is the time between the arrival of a set of inputs and the computation of their corresponding outputs. The initiation rate is the number of times a set of inputs for a computation can be initiated for a given time period. The reciprocal of the initiation rate is called the initiation period. This thesis presents an algorithm for finding both the optimal static allocation of the components in the library to the operations in the DFG and the optimal fixed sequencing of the set of operations that are assigned to the same component. 88 The resource allocation and sequencing are done simultaneously. The criterion for optimization is the minimum completion time. Figure 4.1 shows the input-output relationship of the data path synthesis problem in this system. In this chapter, the timing model used in the data path synthesis of asyn chronous systems is introduced. The control delay and the delay overhead caused . by the structure for the resource sharing is considered in addition to the operation execution time. Then the data path synthesis problem is formulated as a graph theoretic problem. In the following section, a condition that forms a design subspace containing at least one minimum completion time design is presented. Also, algorithms that generate an optimum design or near-optimal designs from this design subspace are provided. In the last section, the resource-constrained fixed sequencing and static allocation problem is formulated. The goal of this problem is to minimize the initiation period by extending the graph model used in the minimization of the completion time. 4.2 T im ing M od el for Sequencing and A llocation P rob lem In an asynchronous system the time interval during which an operator executes an operation and the time interval during which an operation occupies an operator are different. The former is dominated by the computation delay of the operator, and the latter is dominated by the control delay and the computation delay of the operator. Two timing parameters correspond to an operation executed by operator m*: Dfp(m,k), the forward propagation delay time, and DBp{m,k), the backward prop agation delay time. 89 The forward propagation delay time is also known as the computation delay. When all operation inputs are available and operator m* is available, execution of this operation starts. At Dpp(mk) time after the start of this operation, operator mk generates the operation outputs. The backward propagation delay time is also known as the control delay. The operator starts resetting after the other operators take all of the outputs. Dsp(mjt) is the time interval in which the operator resets so that it is available for another operation. There is a waiting period, < 5 , between the generation of outputs and the removal of all outputs. Therefore, Dpp(mk) corresponds to the time when operator nik executes the operation and (Dpp(mk)+DBp(mk)+ 6 ) corresponds to the time when the operation occupies operator mk- Assume that every output of an operation in data path synthesis has a register, then all data outputs are taken immediately after they are generated, so 6 = 0 . In asynchronous systems, an operation starts its computation as soon as all inputs and the resource for the operation are available. Except for the data depen dency and the resource dependency, no other mechanism postpones the starting time of a computation. In the following discussion, assume that all inputs arrive at time 0 , the time when the system is started. E x am p le Figure 4.2(a) is an example DFG that has four additions, v+ii , . . . , v+ t4 , and four multiplication, . . . , v,,4. Assume that one adder, m +ij, and one mul tiplier, m„,i, are available for the implementation. Also assume that the tim ing parameters for these two operators are Dpp(m+ti) = 2, Z?sp(m+ij) = 1, DFPim.'i) = 4, and Dapim*^) = 1. Figure 4.2(b) is one sequencing and allocation result, in which all additions are allocated to with the sequence (u+)i,u +i2,u+, 4 , ^ , 3) and all multiplications 90 nu,: v.j -> v .,-> v.2-> vv m * , i • \ i ->v+ .2 ->K,< -> v + i J (b) FP BP m 2 3 5 6 9 14 19 22 25 (C) Figure 4.2: (a) A four-addition four-multiplication DFG. (b) A sequencing and allocation over one adder and one multiplier, (c) The G antt chart representation of the sequencing and allocation. are allocated to m»,i with the sequence (v*,3,i>»,i,u.,2,v«,4). Assume that there is no sharing overhead. Figure 4.2(c) shows the Gantt chart for this sequencing and allocation result, where dashed arrows represent the data precedence of the DFG. In this example, u»,i can start only when u. i3 releases m»( i since both are allocated to the same operator and vr< i is sequenced after On the other hand, v , t3 can start when both u+)i and v +<2 finish their execution because u»)3 does not share the same operator with them. 91 S h arin g overhead For several operations to share an operator, a control struc ture is needed to manage the sequencing and allocation. This control structure results in area overhead as well as performance overhead. The area overhead ratio, Aoverhead-ratio, is the ratio of the area overhead of the sharing structure to the area saved as a result of the resource sharing. For example, if n additions share 1 adder, the area overhead ratio is A the area cost for n-to-1 sharing structure * *overhead-ratto = , * x r / i \ j i the area cost ol (n — 1 ) adders If Aoverhead.ratio > 1, the operator should not be shared by these kind of op erations. Therefore, this type of operation is not put into the data path syn thesis. For example, sharing logic AND gate for AND functions is not worth while. The performance overheads, Dppo v and D b p ov, inflate the execution delay time and the control delay time, respectively. Therefore, when operator m* exe cutes the operation u ,- the forward propagation delay time of the operation in a shared structure is (Dppimk) + D f p ov) and the backward propagation delay time is (D]3 p{mk) + D b p ov). The formulations for j40U e r/iea(f_ rat,0, D f pov, and D b p o v for a given sharing struc ture will be discussed in Chapter 5, “Control Synthesis” . Based on the analysis in Chapter 5, D fpo v = 0 and a constant D b p o v are used in this chapter. Also assume Aovtrhead-ratio < 1 is checked for the operators that are used in data path synthesis. 4.3 N o ta tio n s and T erm inologies 4.3.1 N otations and Term inologies for the Graph Theory A directed graph G = (V,E) is defined by a set of nodes V and a set of directed edges E. A directed edge is an ordered pair of nodes (u,-, Vj) for u,-, vj 6 V, where Vi is the head of the edge and Vj is the tail of the edge. Indegree of a node u ,-, 92 ind(v{), is the number of edges whose tail is V{. Outdegree of a node v,-, outd(vi), is the number of edges whose head is u ,-. A directed path is composed of a set of directed edges {(ui,U2)i (^2^ 3)7 • ■ •, (u„_i, vn) }, and can be simply represented by a sequence of nodes, (wi, U 3 , ..., u„-i, un)- This path begins from node Uj, passes through v2, U 3, ..., and ends at node vn. A path is simple if all nodes on the path are distinct. However, the path may begin and end at the same node. A cycle is a path that begins and ends at the same node. A directed acyclic graph is a directed graph with no cycle. A directed graph is strongly connected, if for every two nodes u ,- and Vj two paths exist, one from V { to vj and one from vj to Vj. A directed graph is unilaterally connected, if for every two nodes V { and vj only one path exists, either one from u ,- to Vj or one from Vj to w « . A directed graph is weakly connected, if for every two nodes V { and vj a sequence of nodes (w ,-,, u ,-2, . . . , u,n) such that u ,- = u,-,, Vj = u)n, and an edge between vp and vp+\ for p = 1,..., n — 1 exists. A directed graph is disconnected, if it is not weakly connected. 4.3.2 N otations for D ata Path Synthesis The notations used in data path synthesis for a design are defined as follows. • O T: the set of operation types in the design. Let k = \OT\ and OT = {ti,<2, • • • ,**}■ • Vt\ the set of type t € OT operations in the design. Let nt = |Kt| and Vt = {vt,\,vt,2 , .. • ,ut,n,}. Let V be the set of all operations in the design, therefore, V = UtjeorVtj- • Mt: the set of type t components (operators) in the design. Let rt = \Mt\ and Mt = {m i,i,m ti2, ... Let M be the set of all components for the design, therefore, M = U ^ e o rMi}. 93 Each component m* € M has following parameters. • OTopr(mk): the operation type of the operator. • DFp(irik): the forward propagation delay time of the operator. • Dspirriit): the backward propagation delay time of the operator. • Sopr(mi!): the area cost of the operator. Each operation uf - 6 V has following parameters. • OTopn(vi): the operation type of the operation. • m(vi): the operator that executes the operation. • Tea(v{): the time at which the operation begins execution. • Tec(vi,m(vi)): the time at which the operation completes execution. • Tra(vi): the time at which the operator m(u;) for operation i> ; begins to reset. • Trc(vi,m(vi)): the time at which the operator m* for operation Vi finishes resetting. A DFG, which describes the behavior of the design, is denoted by a directed acyclic graph G = (V,E). A node u ,- 6 V represents an operation in the design, therefore, V = U tjg o r^ " ^ directed edge (u,-, vj) € E represents an output of operation V { is an input of operation Vj, where V { is the direct predecessor of Vj and Vj is the direct successor of u,-. If there is a path from u; to Vj, V { is the predecessor of vj and Vj is the successor of u;. If ind(v{) = 0, u ,- is an input node; while, if outd(vi) = 0 , V { is an output node. 94 w Tr . j r < ] u+,l 0 y 2 V+.2 3 5 5 6 W +,3. ... y g . " 2 T 24 '25 U+,4 19 21 2 F '22 W .,1 10 14 14 15 V..2 .... 15_ " 19'' 19 .... "20 ”.,3 5 9 10 W.,4 .... 2 r 25 25 '26 Table 4.1: Timing parameters for each operation in Figure 4.2(c). E x am p le Using the notation defined in this section, the example in Figure 4.2 can be defined as follows. • OT = {+,*}. • V+ = {u+il,t;+i2,u+)3 ,i;+,4}, K = and V = V+U V.. • M+ = M , = {771, 4 }, an<i M = M+ U A/»; m(v+ti) — m +)i for i = 1,2,3,4, and m (v,tj) — m ,t 1 for j = 1,2,3,4. • The DFG is G = (U, E), where V = V+UK and E = {(u,,i,u,i2), (u.,2,v+l3), (u,,2,u+l4), (u+ll,U.l3), (u+,2,l>.l3), (u+1 4,u ,l4), (v.,3,U.,4)}- • For the sequencing and allocation result in Figures 4.2(b) and 4.2(c), the corresponding timing parameters Tes, Tec,Trs, and Tt c are shown in Table 4.1. 4.4 T he P rob lem Form ulation 4.4.1 T im ing Constraint and Graph M odel According to the timing model described in Section 4.2, the timing relationship between the execution starting time, the execution completion time, the operator 95 reset starting time, and the operator reset completion time of operator u ,- € V can be described as follows, where operator m(u,-) executes w ,- . Tec(vi,m(vi)) = Tes(v {) + DFp(m(vi)). (4.1) Tra(vi) = Tec{vi,m(vi)), since 8 = 0. (4.2) Trc(vi, m(vi)) = Tra{vi) + DBp(m(vi)). (4.3) The time interval that operation u ,- occupies operator m{vi) is from Tes(vi, m( Vj )) to Trc{vi,m(vi)), which is denoted by (Tea(u,-,m(v,)),Trc(u,-,m(u,))). Data dependency constraint: If directed edge (vi,Vj) is in DFG G = (V, E), the computation of operation vj can start only after the computation of operation V{ is completed, in other words, Tec(vi,m(vi)) < Tes(vj). (4.4) Resource dependency constraint: If the same operator m executes operation Vi and operation Vj, then the time interval when u ,- occupies m and the time in terval when Vj occupies m cannot overlap, in other words, (Tes(u,), Trc(u,-, m(u,))) and (Te3(vj),Trc(vj,m(vj))) cannot overlap. Therefore, only one of following rela tionships holds. Trc(vi,m(vi)) < Tes(vj), that implies u ,- is executed before Vj; (4.5) Trc (vj,m(vj)) < Tea(vi), that implies vj is executed before u,-. (4.6) Graph M odel The timing parameters and the timing constraints are modeled as delays in the graphical representation of DFG, G = (V,E). A delay is assigned to each operation u; € V to represent the delay from the inputs to the outputs of an operation. Since this delay equals the execution time of the operation, it equals DFp{m(vi)). A delay is assigned to each edge (v{,Vj) 6 E to represent the 96 delay from an output of u; to an input of vj. Since Vj can start as soon as t> j generates outputs, this delay equals 0. Later edge (vr, vs) will be added into G to represent that vr and vs share the same operator and that vs starts execution after vT finishs execution. Due to the resource dependency constraint, this edge delay equals Dgp(m(vr)), where m(vr) = m(us). N o ta tio n s for th e p a th delay The path delay is the sum of all node delays and edge delays in the path, and is denoted by l(vi,V2, ... ,v„-i,vn) for path (t?i, V 2 , . . . , vn-i,v n). Either one delay, the U i delay or the vn delay, or both can be excluded. These delays are denoted as follows. lv1{Vl,V2,...,V n_ 1,Vn) = l(vuv 2, ...,v n-i,v n) - d ( v i). lvn{v i , U 2, . . . , U „ _ i , l > n ) = l{v l,t> 2, . . . , U „ - l , U n ) ~ d{vn). lvu V„{vi,v2, . . . , U „ _ 1 , 1 > „ ) = l(vu v2,..., un_i, v„) - d(v i) - d{vn). The longest path from u; to vj is the path with the largest path delay out of all paths from u ,- to Vj. This delay can be denoted by L(vi,vj), LVi(v;,vj), LVj(vi,vj): or LVitVj(vi,Vj), depending upon whether one delay, the u ,- delay or the vj delay, or both are excluded. The inputs of a DFG include the inputs of all input nodes in the DFG. The outputs of a DFG include the outputs of all output nodes in the DFG. The longest path from the inputs of a DFG to node ut - is the path with the largest path delay out of all paths from the inputs of the DFG to node u;, and its delay is denoted by L(I,V{). The longest path from node u ,- to the outputs of a DFG is the path with the largest path delay out of all paths from node u ,- to the outputs of the DFG, and its delay is denoted by L(u,-, O). The longest path from the inputs of a DFG to the outputs of the DFG is the path with the largest path delay out of all paths in the DFG, and its delay is denoted by L(I,0). When more than one graph is considered, LGk(.. .) and i° k( ...) are used to denote the path delays for graph Gk- *2 Figure 4.3: The timed graph with edges for the sequencing and allocation in Figure 4.2(b). (In fact, this is the SA graph representing the sequencing and allocation.) E x am p le For the example DFG in Figure 4.2(a), the graph representation of the sequencing and allocation result of Figure 4.2(b) is shown in Figure 4.3, where delays associated with the nodes and edges represent the timing relationships be tween operations. The thin edges, which represent the data dependency, are part of the original DFG, and the thick edges, which represent the resource dependency of the sequencing and allocation result, are not part of the original DFG. Let G' = (V,E U E') represent the graph in Figure 4.3, where E' is the set of thick edges. In G', the set of input nodes is {w+,i}, and the set of output nodes is {u.,4}. The path delay of (u+,i, v»,3,u.,i) is 11, but the longest path delay from u+,i to u»,i is 14, in other words, /G W i,t> „ l3,u„,i) = 11, LG ' (?;+,!,?;„,!) = 14. 98 The path delay of (u+tl,w,i 3,u .i4) is 10, but the longest path delay from u+,j to u*i4, which is also the longest path from the input of the DFG to the output of the DFG, is 25, in other words, lG'{v+ t l,V„,3,U„i4) = 10, LG'(I,0) = LG'(v+,1,vm A) = 25. 4.4.2 Sequencing and A llocation This section defines a sequencing and allocation for a given DFG over a given set of resources and a feasible sequencing and allocation. Then a graphical represen tation corresponding to each sequencing and allocation is shown. This graphical representation is used in the rest of this chapter for data path synthesis. Definition 4.4.1 A sequencing and allocation (SA) of a DFG G = (V, E) over a set of resources M is defined as follows. • Allocation: For each tj 6 OT, Vtj is partitioned into rt3 subsets, Vi, , 21 ..., VtJ> rtji in other words, r,j Vt] = U Vi,,,-, and <=1 Vtj,P p) Vtj,q = 0, for 1 < p < q < rtj. The operations in each partition are allocated to an operator with the same operation type as these operations, for example, operations in Vtj< i are allo cated to operator mt j. • Sequencing: The operations in each set Vt]< i form a sequence, which rep resents the execution order of the operations through operator 99 A feasible sequencing and allo catio n (a feasible SA) is one in which a timing assignment of Tes, Tec, Trs, and Trc for every operation in V exists such that the data dependency constraint, the timing relationship in Equation (4.1), is satisfied for every (u,-, vj) € E, and the resource dependency constraint, the timing relationship in Equation (4.5), is satisfied for every (V{,Vj) in any sequence in SA. D efinition 4.4.2 For the SA defined in Definition 4.4.1, assume that (ut , vtjj2, ..., U tj.ijv, |) denotes the sequence for the operations in partition Let EsaV i be an edge set corresponding to the sequence in Vt]ti defined as ■EsAv^'j = | < Z = 1> • • • > — ! } • An SA edge set E sa corresponding to the SA of a DFG G = (V, E) is defined as r‘ > E s a = U U E sav, . • tjGOTi= 1 An SA g rap h G s a corresponding to the SA of a DFG G = ( V , E ) is defined as G Sa = { V , E \ J E s a ). Note that E fl E s a may not be empty set. L em m a 4.4.1 Let G s a be an SA graph with the SA edge set E s a corresponding to a feasible SA for DFG G = (V,E) over a set of operators M. If Dppim-k) and DBp{mk) for every m* € M are greater than zero, the following properties are true. 1 . W{ € V, m(vi) € M. 2 . If (vi,vj) € Esa, rn{v{) = m{vj). 3. G s a = (V,E U E s a ) is acyclic. 4. In graph G'SA = (V,ESA), 1 0 0 (a) ind(v{) < 1, for every V { G V. (b) outd(vi) < 1, for every u ,- G V. (c) For any v,-, vj G V, if there does not exist a path between and Vj, m(v{) ■ £ m(vj). Proof: 1. Since the allocation is over the set of resource M, V u,- G V, € M. 2. If (v,-, vj) G there exists a sequence containing i > ,- and vj in SA. Since the operations in any sequence in SA share the same operator, ?n(u,) = m(vj). 3. If G s a is cyclic, there exists a cycle, (vi, V 2, ...,v n, uj). If (u,-, u,+i) G E , then Tec(vi,m(v{)) < Te3{vi+1) is due to the data dependency constraint (Inequal ity (4.4)). If (uj, uj+i) G E s a , then TT C (vi,m{vi)) < Tea(vi+\) is due to the resource dependency constraint (Inequality (4.6)). Since DFp(mk) > 0 and DBp(pik) > 0 for every m* G M, according to relations (4.1), (4.3) and (4.2), Tea(vi) < Tec(v{,m(vi)) = Tra(vi) < Trc(u,-,m(u,)) for every u; G V. There fore, Tea(vi) < Tea(v2) < . . . < Tea(vn) < Tea(v 1). This is a contradictory result. 4. (a) If there exists an operation V { with ind{vi) > 1 for graph G'SA = (F, Esa), there exists at least two distinct operations Vj and such that (vj,V{), (Vk,v;) G Esa- Therefore, u ,- is in two distinct sequences in SA, where each sequence corresponds to a partition or an allocation. Then the intersection of these two partitions is not empty. This result contradicts the definition of an SA. (b) Similarly, outd(vi) < 1 for every u ,- G V can be proven. (c) If there does not exist a path between V { and Vj in G'SA, vi and vj are in the different allocations, so m(vi) ^ m(vj). □ 1 0 1 If a graph G sa = (V, E U E s a ) has the four properties stated in Lemma 4.4.1, for a given DFG G = (V.E), a feasible SA for DFG G can be generated. Algorithm 4.4.1 [G §^ to SA conversion] 1 . Find the set of nodes with 0 indegree in G'SA = (V ,E s a )• Let this set be denoted by Sh- 2. For every Vi G Sh, find a simple path starting from u ,- in G'SA. 3. For each path from Step 2, a set is formed by the operations in the path, and a sequence is formed by the ordering of the path. An SA is formed by all sets and sequences corresponding to paths from Step 2 . Lemma 4.4.2 For a given DFG G = (V.E), if a graph G sa = (V,EU E s a ) has the four properties stated in Lemma 4.4.1, Algorithm 4.4.1 produces a feasible SA over M. Proof: Since G sa is acyclic, G 'SA is acyclic. Therefore, Sh is not empty for Step 1 of the algorithm. Since ind(v{) < 1 and ouid(vi) < 1 for any u ,- G V in G'SA, an unique simple path is generated for every v, G Sh from Step 2 of the algorithm. Therefore, no two paths pass through the same operation. Since two operations connected by an edge in E sa are allocated to the same operator, all operations on the same path are allocated to the same operator. Therefore, all subsets in Step 3 satisfy both the allocation and the sequencing conditions in Definition 4.4.1. Provided that the node delay is the operation execution time, the edge delay is 0 for the edge in E, and the edge delay is the operator reset time for the edge in E s a • Since G sa is acyclic, the longest path delay from the inputs of G sa to the input of each node can be found, and the longest path delay can be assigned as the execution starting time for each node. This timing assignment satisfies the timing constraint in Definition 4.4.1. Therefore, the result produced by Algorithm 4.4.1 is an SA over M. □ 1 0 2 Corollary 4.4.1 Let G sa = (V, E U E s a ) satisfy the properties stated in Lemma 4.4.1 corresponding to a DFG G = (V,E ) and a resource constraint M . There exists a feasible SA for DFG G over M , where Gsa is the corresponding SA graph for this SA. If the condition that Gsa = (V, E U Esa) is acyclic is relaxed to the condition that G'sa — (V, Esa) is acyclic, Algorithm 4.4.1 still works. However, the resulting SA may not be feasible. Since an SA graph is interchangeable with an SA for a DFG, an SA graph is used to represent an SA, and an acyclic SA graph is used to represent a feasible SA. E x am p le The SA in Figure 4.2(b) can be described as follows. • The SA for additions: with sequencing (v+ii,v+t2,v+A,v+i3 ). • The SA for multiplications: K,,i with sequencing (u*,3,u*,i,u«,2)t;»t 4 ). The graph in Figure 4.3 is the SA graph corresponding to the SA in Figure 4.2(b). Since the SA graph in Figure 4.3 is acyclic, the corresponding SA is feasible. Table 4.1 shows a possible timing assignment for all operations that satisfies both the data dependency constraint and the resource dependency constraint. Change the SA for additions from the previous SA to • F+,i with sequencing (u+,4 ,u+i3,u+il,u+ > 2). Figure 4.4(a) is the corresponding SA graph for this change. The new SA is not feasible, since there is a cycle (u.,2,u+,3,u+ > i,u»i 3,u ,tl, u»,2). In addition to the previous change for the SA for additions, change the SA for multiplications to • with sequencing (v«,i,u.i 2,u.,3,u«,4). Figure 4.4(b) is the SA graph corresponding to the lastest change. The lastest SA is feasible, since there is no cycle. 103 (a) (b ) Figure 4.4: (a) A cyclic SA graph, (b) An acyclic SA graph. 4.4.3 T he System C om pletion Tim e This section formulates the synthesis problem as a graph problem by using the graph model defined in the previous sections. Theorem 4.4.1 The completion time of a DFG G = (V, E) for a given acyclic SA graph G s a = (V, E U E s a ) equals the longest path delay from the inputs of G s a to all its outputs. This path delay is denoted by C T ( G , E s a ), therefore, CT(G,Esa) = LGsa(I,0) P roof: This theorem is proved by induction. Assume that pred(vi) is the set of the direct predecessors of V{. Let O s a represent the set of output nodes in G s a • Since G sa is acyclic, the operations in the SA graph can be levelized. For example, if pred(v{) = 0, Level(vi) = 1; otherwise, Level(v{) = AIaxVjepre^ V i){Level(vj)} + 1. Assume that there are £ levels of operations in G s a • For ut - with Level(vj) = 1 , Tec(vi,m{vi)) = Te,(vi) + DFP(m{vi)) = DFP(m(vi)) = LGsA{I,Vi). 104 Assume that Tec,m[vj)(vi) = L°SA(I,Vi) for u ,- with Level(v{) < I < £. For any vk with Level(vk) = / + 1, Te,(vk) > Tec(vj,m(vj)) + DBp{m(vj)), if {vj,vk) £ ESA Tea(vj) > Tec{vk,m{vk)), if (vj,vk) € E Since an operation is started as soon as all data and resources are available, Tec(vk,m(vk)) = Tea(vk) + DFp{m(vk)) _ , , J Tec(uj,m (uj)) + DBp(m(vj)) if (vj,vk) £ ESA 1 _ | m M ) jf (v,' v t ) e E | + D irp(m(ui )). Since vj £ pred(v), Level(vj) < /, Tec(vj,m(vj)) = LGsA(I,Vj). Therefore, Tec(vk, m(vk)) = L°SA(I,vk). Hence, CT(G,Esa) = MaxVieoSA{Tec(viMvi))} = LGsa(I,0). □ Lemma 4.4.3 Let Gsa, = (V,EUEsai) and Gsa2 = {V, E u E sa 2) be two acyclic SA graphs corresponding to a DFG G = (V,E). If Esai — Esa2i then CT(G,EsAl)< C T (G ,E SA2). Proof: Assume that pathi is the longest path delay from the inputs of Gsa, to all its outputs. Since Esa, Q Esa2, pathi is a path in Gsa2• Therefore, CT(G,ESAi)< C T(G ,EsA2). □ Corollary 4.4.2 For a given DFG G = {V.E), let Gsa, — (V, E) be the SA graph with Esa, = CT(G, 0) is the lower bound of the completion time for all acyclic SA graphs. 105 A path produced in Algorithm 4.4.1 corresponds to an SA of a set of operations to an operator. Therefore, the number of paths for each type in G'SA = (V, E sa ) represents the number of operators of a specific type required for this SA graph. The synthesis problem is formulated as follows. For a given DFG G = (V, E) and a given set of operators M, find an SA edge set E sa in which E sa and the corresponding G s a — (V', E L I E s a ) and G'Sa = (V, E s a ) satisfy the properties of Lemma 4-4-1 so that the corresponding SA graph G sa has the minimum longest path. 4.5 M inim izing th e S ystem C om pletion T im e This section presents a condition that forms a design subspace containing at least one minimum completion time design. An algorithm enumerating all S As satisfying this condition is then developed. The minimal completion time design in these SAs is the minimum completion design. The search space is further reduced using the branch and bound technique. The worst case complexity of this algorithm is 0 (rn • n!), where r is the number of operators and n is the number of operations. Finally, heuristic algorithms that search a design through this design subspace are developed. Instead of enumerating all possible SAs satisfying this condition, these heuristic algorithms direct the searching to a design using some cost functions. 4.5.1 The D esign Subspace Containing a M inim um C om pletion Tim e D esign D efinition 4.5.1 Let G s a : = (V,E U ESAi) and G s a 2 = (V,E U E s a 2) be two acyclic SA graphs corresponding to a DFG G = (V,E ). Let Pt,i and PtJ be two simple paths in G'SAl = (V, Esax) corresponding to two sets of operations allocated to the same type of operators, for example, type t operators. Pt); and Pt,j may 106 Pu P ' J ...► ( V y ^ (a) (b) Figure 4.5: Moving a node from Ptj to Ptj- (a) Before adjustment, (b) After adjustment. be the same path. If Esa 2 is an adjustment of Esax in which an operation vT is moved from Ptj to P t j * and if this adjustment satisfies following relations, LGsa*{I,vt) < LGsa'{ I, vt), and L G s a 2 (7, u,) < LGsai (7, va) for vB € V \ {ur}, this adjustment is called the left-shift. Assume that Ptj and Pt)j are two simple paths in G'SAl for the allocation of two same-type operators. Assume that there are n,- operations in path Ptj and nj operations in path P t j * and assume that these two paths are denoted as follows. p _ f (uMnnt,i2,--.,n t,ip_1,nt,ip,ut1< p + 1, . . . , u 1|in.), if p> 1, I (vt,iP, vt,ip + l vtli„.), if p = 1. p _ f > Vt,j2 1 ' ‘ • 1 Vt,jq 1 Vt< jq + 1 ) ' • • > )) if < 1 ^ 1 1,3 \ i f 9 = 0 . Esa 2 is an adjustment of Esax in which vtjp is moved from Ptj to P t j and put between vtljq and n«j,+1. Figures 4.5(a) and 4.5(b) show the node before and after 107 this adjustment, E s A: \ { K p - i , (utlip, Ui,ip+1), (vt,jq, Vt,jq+1)} u {K <p-,»ut,» P + i )»K ,i,,vt,iP), K P, vtJq+, )} if P > 1 ,9 > 1, E s Ai \ { ( W( , i p _ i ) V t , i p ) i (Vt,ipi V t,ip + 1 ) } E s a 2 = u { K i p . , , v t,ip+1), K , p , v ttjq+1)} i f p > 1, q = 0, E s a , \ {(vt,ip, v t i'p+ i u { K i , 1uMp)»K«p^(J,+1)} if P = 1,9 > 1, . Esa, \ { K ip , «i,ip+i)} U {(t7tlfp, v t,Jq+, )} if p = 1 , 9 = 0 . L em m a 4.5.1 Let Gsa, = (V, EUEsa, ) and Gsa2 = (V,EUEsa 2 ) be two acyclic SA graphs of DFG G = (V,E). If Gsa2 is a left-shift of Gsa, , the following two conditions are satisfied. 1 . If Gsa, satisfies the resource constraint M, Gsa 2 satisfies the resource con straint. 2 . CT(G,Esa 2)< C T (G ,E Sa,). P roof: 1. The left-shift adjustment does not increase the number of partitions. There fore, if G s a , satisfies the resource constraint, G s a 2 satisfies the resource constraint. 2. Since the longest path delay from the inputs to any node is not increased by the left-shift adjustment, CT(G, E s a 2) = LGsAi(I,0 ) < LGsai(I,0 ) = C T ( G , E s a ,). □ D efinition 4.5.2 An acyclic SA graph G s a is called active, if another acyclic SA graph, which is a left-shift of G s a > does not exist. T h eo re m 4.5.1 There exists at least one minimum completion time design that is active. 108 P ro o f: Assume that all minimum completion time designs are not active. Let Gsai be one of the minimum completion time designs. Since Gsai is not active, Gsa 2 > which is the left-shift of Gsai > can t> e found. If Gsa2 is not active, Gsa,, which is the left-shift of Gsa2 » can be found. Eventually, a series of acyclic SA graphs GsAni G SAn— i , • GsAt is found, where Gsa„ is active and G s ^ , is the left-shift of GsAi ^ or * = 1 , — 1 . According to the Lemma 4.5.1, CT(G, Esa„) < CT(G,EsAn~i) ^ GT(G,Esa 1 )• Since Gsai is a minimum completion time design, CT(G,EsAn) = CT(G,EsAi)- Therefore, Gsa„ is a minimum completion time design and Gsa„ is active. This is a contradiction. Therefore, there exists at least one minimum completion time design that is active. □ IIow is it determined if a left-shift SA graph exists for a given SA graph? First, determine if a “gap” exists for the left-shift adjustment. Assume that the operations in Pt,i and the operations in Pt< j are allocated to operator nit,i and operator mtij, respectively. In order to move an operation in front of vt,jq + 1 without increasing the longest path delay from the inputs to vt,jq+1, the gap before vtj q + 1 must be greater than or equal (Dpp(mt,j) + DBp{mt,j) + Dbpov), in other words, U - ’W J - (L° SA' V , * * ) + DBP(mt,j) + DBPo„) > Dpp(mt'j) + DBp{mtij ) + DBpov, for q > 1 , (4.7) - DFP(m Ui) + DBp{rnt,j) + DBpov, for q = 0 . (4.8) The first item, (I,vt< jq+ 1), in the lefthand side of Inequality (4.7) is the longest path delay from the inputs of the system to the inputs of Vt,jq+,, and the second item, (L°SA 1 (I,Vt,jq) + DBp{mt,j) + DBpov) is the longest path delay from the inputs of the system through {vt,jq,vtj q+ 1) € Esa^ to the input of vtijq+ 1. If the difference between these two items is greater than or equal to (Dpp(mt< j ) + DBp{nit,j) + DBPo v)i an operation can be put between them without increasing the longest path delay from the inputs of the system to the inputs of vt+I. 109 Second, find a potential candidate to move into the gap. Assume that U (,Ip is the left-shift operation. Let GgAj be defined as (V,EU Esa^ \ {(ut,>p_,,Ut,«p)})- By definition of the left-shift adjustment, the adjustment needs to satisfy the following relation. L Z A p2{I,vttip) < L Z A p'{I,vt< ip). G s Therefore, the starting time of the gap needs to be smaller than L v , s^ 1 (7, vtiIp), in other words, LGsa> (7, vtjq) + DBp{mt>j) + DBPo v < Lv * * 1 (7, uW p), for q > 1, (4.9) 0 < l S " * ( / , v t|ip), for g = 0. (4.10) And the longest path delay from the inputs of the system to the inputs of ut),p can be reduced by moving out of the Ptj, in other words, In order to move vtlip in front of vtj q + 1 without increasing the longest path de lay from the inputs to U (t j9+ , after adjustment, the following relation needs to be satisfied. L v t X 1 (7 , i>mp) + DFp{mt,j) + DBp(mt,j) + DBPov < L vff^ ( 7 , vtijq+1). (4.12) To summarize, if vt< ip,vtjq, and vt,jq + 1 satisfy the relations from (4.7) to (4.12), a left-shift adjustment can be made by moving vt,ip to the SA after vtj q and before Vt,jq+1 • 1 1 0 Figure 4.6: An left-shift SA graph of the SA graph in Figure 4.3. E x am p le Let the SA graph in Figure 4.3 be G s a , • There is a gap before u,,3 satisfying the inequality (4.8), where q = 0 , (u„,3 is the starting operation of m»,i), Lvfj'iliV*, 3) = 5, and DFP{m»,i) + DBp{m*'i) = 5. u„,i can move into this gap, because the inequalities (4.10), (4.11), and (4.12) are satisfied, where Lv? * 1 (/, u»,i) = 10 and Q tt L v . T ( I , v . tl) = 0. Therefore, u„,i moves into the gap in front of v,,3, so G s a ,. is not active. On the other hand, v,# cannot move into the gap in front of u,i3, because the inequality (4.11) is not satisfied, where 2) = 15 and Gn L v. T { I , v.,2) = 15. After moving u.,x into the gap in front of u,i3, the new SA graph, which is shown in Figure 4.6, is active, because no SA graph that is the left-shift of the L tT H h v u ) L Z A 2 (I,vu ) u+,l 0 0 w +A .. 3' 3 V+,3 2 2 17 V+A iy 14 «*,! 1U U V ..2 ..... 15' 1 0 " 5 5 IV 4 21 16 Table 4.2: The comparison of the longest path delays between the SA graph in Figure 4.3 and the SA graph in Figure 4.6. new SA graph can be found. The comparison of the longest path delays between the operations of the SA graph in Figure 4.3 and the operations of the SA graph in Figure 4.6 is shown in Table 4.2, where Lvts*2 (I,vt,i) < LvtSAl for every vt,i € V. The completion time of the SA graph in Figure 4.3 is 25, and the completion time of the SA graph in Figure 4.6 is 20, therefore, 20 = CT{G,Esm ) < CT(G,Esm ) = 25 4.5.2 A n O ptim um C om pletion Tim e D esign G eneration Since there exists at least one minimum completion time design that is active, a minimum completion time design can be found by searching for the minimal completion time design of all active SA graphs. D efinition 4.5.3 A p a rtia l acyclic SA g rap h GP SA = (VP,E PU EgA) of a DFG G = (V, E) over a set of resource M is defined as follows. • Vp C V is the set of operations sequenced and allocated to M. If v € Vp, then all predecessors of v in G have to be in Vp. • Gp = (VP,E P) is the induced subgraph of G corresponding to V p, where E p = E D {Vp x I/p). 1 1 2 • E v sa C V p x Vp is the SA edge set corresponding to an SA for Gp under the resource constraint M. In other words, GP SA = (VP,E P U EgA) is an acyclic SA graph of Gp over the set of resource M. A systematic approach for generating all active SA graphs is now developed. The notations used in the algorithm are defined as follows. • n denotes the number of operations that already have been sequenced and al located in a partial SA graph, and VP n represents the set of these n sequenced and allocated operations. • GP n = (VPn,E Pn) is the induced subgraph of G corresponding to V v", where EP n = E n (VP n x VPn). • The SA graph for GP n under the resource constraint M is denoted by GV gA = (VPn,E P n U Eg]4 ), where EgA is the the SA edge representing the partial SA for these n operations in V Pn. • G s a P u = (V,E U EgA) is an acyclic SA graph of G, and may not satisfy the resource constraint M. • Sn is the set of operations not in VPn, but all of whose predecessors are in VPn. • For each mj € M, Vp" represents the set of operations in VP n and allocated to rrij. The starting operation and the ending operation in are denoted by <S(V^) and £(VP ”), respectively. Therefore, ZG !£i(/,£(]/£’ ’)) is the earliest time £(V^") can finish its operation. Let cr^(rrij) represent the earliest time the next operation can start at mj after £ ( V ^ ) , therefore, °m" (m i) = £ G" » (/,£ (V ^ ;)) + DBP(mj) + DBPov. 113 The algorithm presented below uses the depth-first search to generate all active SA graphs. This algorithm starts with n = 0, where VV Q = 0, Ep gA = 0, and So is the set of input nodes of DFG G. For each n, VPn, Eg[4 , and Sn are known, and V£n, E Pn, GPn, Gp sn A, and GSAP n can be derived from VV n and Ep sn A. From V Pn, E p sn A, and Sn, several sets of V P n,, EgA, and Sn> with n' = n -f 1 can be found to form new partial SA graphs with (n + 1) operations. Find VP n,, E p sn A, and Sn< from yr„ e p sa, and Sn by finding potential candidate operations in Sn and adding one of them into the existing partial SA graph Gp sn A. This algorithm is presented as follows. Algorithm 4.5.1 (Active SA Graphs Generation) 1. Initialization: let n = 0, • Sn = {vi € V | ind(v{) = 0 in G}, the set of input nodes in G , • VP n — 0, • E's -a = 0, • V?” = 0, £(V£") is null, and crp ?(mj) = 0 for each mj € M. 2. Find < j > * and the corresponding M m for a given Gp sn A: • For each u ,- G Sn, av{vi) = M a x i L v ^ (I,v{), }, where m(u,) G M is the module realizing < 7„(v;), and (j> v(vi,m{vi)) = av{vi) + DFP{m(vi)) + DBp{m(vi)) + DBpov- • < t > ' = M inViSSn{<f>{vi,m(vi))}. • M* is the set of operators by which < j )* could be realized. 3. For each operation u; G S„ executable by any m* G M* with cr„(ut -,m*) < < /> * , create a new partial SA graph in the following ways. • Select any mj G M~. 114 . If V% ± 0, E P s T = E& U m v * ;),«,-)}; otherwise, E%$' = E&. • Kn"+1 = Kn" U { t> ,-} and £ ( 1 ^ ) = u,-. For other m'- € A/ \ {mj}, K S;*‘ = KJ;. and W f ) = a v,p n + 1 = vp n u { « ,- } . 4. For each new partial SA graph from Step 3 with Vi sequenced and allocated, create a corresponding Sn+i as follows. (a) 5„+i = S n \ {r>,}. (b) For each direct successor Vk of V { with all its direct predecessors in 1 /Pn+1, add Vk into S'n+1. 5. For each partial SA graph with n + 1 operations, if n + 1 = |V|, this SA graph is active; otherwise, go to Step 2 with n < — n + 1 until all active SA graphs have been generated. L em m a 4.5.2 Each partial SA graph, G^A, generated in above algorithm is ac tive. P ro o f: If Gp sA is not active, an Vt,ip G VPn, which can be left-shifted, exists. Assume that (ut,j,,^,j,+1) € E p sn A, and vuip can left-shift into the gap between Vt,jq and • Therefore, n ^ P n r ip n Lv,% (I,vt,ip) -1 - DFP(mt,j) + DBp{mt,j) + DBpo v < L»,^+ I (I,vtj q+1), (due to Inequality (4.12)) Lgsa(I ,vt,jq) - 1 - DFP(mt< j) + 2 * (DBp(mtj) -1 - DBPov) < Lv,5*^ {I,vtljq+ 1), (due to Inequality (4.7)) 115 where G " sp a = iVPn, ESA\{{vt,iP -i ,vt,iP)})- Since both vuip and v t ,jq+ 1 are sequenced and allocated after vtljq, < f> * after vtj q in Step 2 is smaller than or equals the maximum of the lefthand side of the above two inequalities, that is, F < M ax{L°% {I, vt< ip ), (LGsA(I,vtijq) + DBP(mtyj) + DBPov)} +DFp{mt,j) + DBp{mt,j) + DBpo v If Vt,ip is sequenced and allocated earlier than vt,j +1, then uti,p can fit in the gap before Vt,jq+1. However, this is not this case. If vt,jq+1 is sequenced and allocated earlier than u£ i,p, then s~>Pn P < (I, v-M,) This is a contradiction to the condition on Step 3. Therefore, Gp sn A has to be active. □ L em m a 4.5.3 If G sa is an active SA graph of G under the resource constraint M, there exists an SA graph GP JA{ = G sa generated by Algorithm 4.5.1. Proof: 1 . Let ( f> v(vi,m(vi)) be the operator release time by operation V { in Gsa > that is, ( /> v(vi,m(vi)) = LGsA(I,Vi) + DBP(m(vi)) + DBPov. Sort all operations in V to a sequence (v£li^,, u < 2 ,^2, . . . , such that all predecessors of are in the subsequence ... ,uln_ll^„_1) and M vti,4> i,m (vu ,< l> i)) < <Mwti+i.^+nmK + is not a predecessor of vU + u< t> i+ i- This sequence exists because G sa is acyclic. 2. Let VP n = { u t j ^ t W a , . . . , *>(„,*,,}, Epn = E n (VP n x VPn), and E p s n A = E sa H (VP n x V Pn). Then GP sA = (VPn,E p" U EgA) is a partial active SA graph. 116 (a) Since for every vtit^ 6 VP n its predecessors are in the sequence vt2 , < t > 2i • • • ivu-iA -i)i ad °f its predecessors are in VPn. Since Gsa is an SA graph of G, Esa satisfies the resource constraint M which E p sn A also satisfies. Therefore, GP S " A is a partial SA graph of G. (b) If Ggn A is not active, there exists a left-shift adjustment. This adjustment is also a valid left-shift adjustment for Gsa • However, Gsa is active. This is a contradiction, so GP gA is active. 3. By induction, it can be proven that Gsa can be generated by Algorithm 4.5.1. Gp s° a is the initial partial active SA graph. Assume that Gp sn A is a partial active SA graph generated by Step 5 of Algorithm 4.5.1 with VPn,E Pn, and E p sn A defined above with n < |V|. Since all predecessors of vtn+ lt< f , n+1 are in V Pn, v«n+i,tf»+i e Sn' For every u« u ,< * u 6 V \ VP n and u > n + 1 , either vtn + li< t> n + i is a predecessor of vta^ u, which is not in 5„, or M vtn+ w t> n+^Tn{vtn+ ull> n+ 1)) < I n o t h e r w o r d s , M v tn+u4 ,n+i,m (vtn + u < t,n+ i)) < M vtu,*u, m ( vt»,K )) if vturfu € Sn and u ± n + 1 , that is, ^ « ( U * n + l i 0 n + l 1 m ( V t n + 1 . 0 n + l ) ) — ^ W U j i n + l & l ; l u r f u 6 S „ { ^ ( V i u . ^ u > T O ( U « u ,^ u ) ) } Therefore, < f> * corresponds to t n + 1 type operations. Since G^ 1 is an active partial SA graph, GP sa' is one of the partial SA graphs generated in Step 3, 4 and 5 after G P S " A . Therefore, Gsa can be generated by Algorithm 4.5.1. □ T h eo re m 4.5.2 Algorithm 4.5.1 generates all active SA graphs for DFG G under the resource constraint M. E x am p le Use the DFG in Figure 4.2(a) to obtain an active SA graph by sim ulating the enumeration scheme in Algorithm 4.5.1. Assume that there are one 117 Ufp l> b p 4.0 0.5 ™+,i 2 .0 ' (15"' ™ + ,2 3.0 0.5 overhead 0 .0 0.5 Table 4.3: Operator timing parameters for the example of active SA graph gener ation multiplier and two adders, whose timing parameters are shown in Table 4.3. As sume that D fp o v = 0 .0 and D b p o v = 0.5. In Step 1 , the following is obtained. So = {v„,i,u+)1,i>+i2}, Vn = 0 ,£ & = 0, y p o _ yp o — yp o — r m*,i ra+,1 m +,2 1 < ° ( m *,i) = = <°(™ +,2) = 0 . In Step 2, try to find ( j > * and the corresponding M*, where M ^ .i ) = 0, m .tl) = 5 c«(u+,i) = 0 ,< f> v{v+il,m +ti) = 3 < rv{v+i2) = 0, < A „(u+,2,m+,i) = 3 Therefore, ( j> m = 3 and M m = {m+,i}. There are two choices in Step 3; either u+ti or v + i2 can be scheduled next. For this example, pick u+,i. Then the following is obtained. ^rn+ll = {u+,1 },£{VQa) = u+,1 with ^ ‘(m+.i) = 3, V ™ = VI) and £(V%) = £ (V » ) for m'j e M \ V* = {v+A}. 118 n s n V{ € (V P n + 1 \ VPn) e , € ( ^ r \ £ & ) m k € M ' 0 v+,i null 3 m+A 1 ivmJ,V+ .2 i V + .2 null 4 ” *+.2 . 2 V-A null 5 m».i 3 V..2, V..3 U ..2 K i , v.,2) 10 m..i 4 • 1 V..3, V+.3, V+A } V+A ( v + ' U V + a ) 1 2 ™ + a 5 { U +,3 } U +,3 (v+,2,V+'3) 13 m + . 2 6 U.,3 (V.,2,U.,3) 15 7 1 V«,41 U.,4 2 0 Table 4.4: An active SA graph generation n = 0 + .2 n = 2 n = 3 n = 4 n = 5 + .3 n = 6 •j n = 7 n = 8 Figure 4.7: Enumeration tree of the SA graph generation. 119 Continuing with the selection of v+,i, in Step 4 the following is obtained. Si = {u+,2,u.,i}. In Step 5, n « — 1 and return to Step 2. All SA graphs can be obtained by repeating the procedure. Table 4.4 shows how one SA graph is produced following through the previous result, where n represents the indexing, u ,- £ (VPn+‘ \ VPn) represents the operation selected for S„, ej £ (.E p s n A 1 \ E P S" A) represents the allocation of vx for Sn, and < f> * and mk £ M* represent the corresponding parameters in Step 2 for a given n. Figure 4.7 shows the enumeration tree, where the ©s represent the branches to be selected in Table 4.4 and the »s represent the branches not to be selected. 4.5.3 T he Branch and Bound Technique Although the design space of all active SA graphs is smaller than the design space of all SA graphs, the number of all active SA graphs is still very large. In this chapter, the branch and bound technique is used to further reduce the search space for an optimum solution. 4.5.3.1 Bounded by the Data Dependency Constraint As stated in the proof of Lemma 4.5.3, to find a Gsa Algorithm 4.5.1 generates the corresponding partial SA graphs, GP S° A, Gp s\ , . . ., GP JA = Gsa • For each Gp sn A, the corresponding SA edge set E p sn A and SA graph Gsap„ for G can be found. Since Esa — E s a * ^ or 0 < n < |V| — 1 , according to Lemma 4.4.3 CT(G, E ^ ) < CT(G,Ep sn A +1), where CT(G,Ep sn A) = LGs*p»(I,0). If CT(G,Ep sn A) is larger than the current minimal solution, the active SA graphs Gsa with the edge set Esa 2 Esa d° n°f need to be searched. This bound is called the data dependency bound is because L° sap»(/, O) is the longest path delay decided by the data dependency of the operations in V \ VPn. 1 2 0 The CT(G, E p sn A) = LGsAm ( /, O) can be found by traversing the directed acyclic graph GsaP u • From G sa 1 to Gp sn A, only vtn^ n and an SA edge are added into the graph, so the delays of paths not passing through vtn^ n are not changed. Therefore, the bound can be found without traversing the whole graph every time. The bound LGsap"(I,0 ) equals the maximum of XG s/' p« - 1 (1,0) and the longest path delay through vtn t< j> n , that is, LGsapn ( / , O) = Max{L ° SA p"-> ( / , O), LGs^ ( I , v tn^ + LP ts n^ ( v t ni4> n , 0 )} Since G p s n A 1 is an acceptable partial SA graph, ZGSi4pn- 1 (/, 0) is smaller than the q s a G s a current minimal solution. Therefore, Lv,s nA ^ (utn,0n,O) needs to be checked to see if smaller than the current minimal solution. Since LG ^ " ( vtn,< j> „ , 0 ) is not in the sequenced and allocated part of graph Gsap„ , Therefore, LG .(vi,0) can be calculated once for all u; € V at the beginning of GSA the algorithm. At Step 3 of Algorithm 4.5.1, aV(vtn,< f,n) = LV l^ (I,vtn t< j,n), so q s a Lv > S n*£(I ’vtn,'l> n) = Vv(vtn,< t > „) + DFp(m(vt Therefore, the data dependency bound, Bdata, is defined as follows. Bdata = a'v(Vtn ,< t> n ) + ^ F P ( rn ( Vtn,’ l>n)) + ( U(n,^n ) ^ ) ( 4 . 1 3 ) If Bdata is smaller than the current minimal solution, then Ggn A is an acceptable partial SA graph. Otherwise, the designs with E p s n A as the subset of its SA edge set are pruned. 4.5.3.2 Bounded by the Resource Dependency Constraint For a given G p £a or Gsap„ > there is also a lower bound for the system completion time of GsaP u for u > n and EgA D E p s n A due to the resource constraint M . 1 2 1 This bound can be found by removing all data dependency relations between the unscheduled nodes. Let G sn A be a partial SA graph with VPn, E P n = E fl (VP n x VPn), E$[\ corre sponding to a DFG G = {V,E). Let G(pn) corresponding to a DFG G = (V, E) and a partial SA graph Gp sn A be defined as a DFG graph in which all data depen dency relations for the nodes in V \ VP n are removed and all data dependency relations for the nodes in VP n are kept, that is, G(Pn) = (V,Ep"). Let G(pnYsA be ^ e partial SA graph corresponding to the DFG G(pn) and the SA edges Egn A, that is, G(Pn)& = ( ^ P"U£&). L em m a 4.5.4 For a given VP n and E P S" A, the longest path delay from inputs to outputs for G(pn)p sn A is smaller than or equal to the longest path delay from inputs to outputs for G s a p„ , that is, Lg{p^ sa(I,0) < LGsAm (I,0). P ro o f: Because (EP n U EgA) Q (E U Egn A), any path in G(pn)p sn A is in G s a p„ ■ Therefore, the condition in this lemma holds. □ For a given E sa 2 E p s n A under the resource constraint A/, let G sap„ Sa he the SA graph of the DFG G = (V,E) corresponding to this E s a , that is, G sA Pn_SA = { V , E U E s a ), and let G ( p n )sA be the SA graph of the DFG G'(pn) corresponding to the same E s a , that is, G{pn)sA = (V,Ep" U E SA)- 1 2 2 L em m a 4.5.5 For a given Esa ^ E V S" A under the resource constraint M, the longest path delay from inputs to outputs for G(pn)sA is smaller than or equal to the longest path delay from inputs to outputs for Gsap„_sA, that is, x g ( p-,) s^(/j0 ) < L GsAp n - s * ( I , 0 ) . For all E sa 5 Esa under the resource constraint M, the minimum of the longest path delay from inputs to outputs for G (pn)sA is smaller than or equal to the minimum of the longest path delay from inputs to outputs for Gsap„_sA, that is, Min^EsA2 EP sn{LG{Pn)sA{I,0)} < MinVEsA 2 E^A{LGsApp-^(I,0)}. P ro o f: The two conditions in this lemma hold for the reason stated previously in the proof of Lemma 4.5.4. □ Since LG^ SA(I,0) is the longest path delay without considering the data dependency relation among operations in V \ VPn, the minimum value of all La(pn)SA(I,0) represents the resource dependency bound. Let Ny\vpn(ti) be the number of type U operations in V \ VPn, that is, the number of type t{ operations that have not been sequenced and allocated. Let B res 0urce{U) be the lower bound of the system completion time due to type t; operators. After vtn^ n is sequenced and allocated, BTe 3 0urce(ti) can be obtained by the following algorithm. A lg o rith m 4.5.2 1. Sort all mj € Mik in the non-decreasing order of (crp ?(mj)+DEp(mj)), where ap?(mj) = LgP s\{I,£{V ^)) + DBP{m.j) + DBPm. Let M Ltk be the list. 2. B Tesource{ik) = max,njgA/(fc {((r^(rrij) DBP(mj) DBPov)} 3. for i = 1 to Nv\vpn(h) begin 4. Let m'j be the first operator in M Ltk\ remove m ' from M L tk. 123 5. if (crP"(m') + DFp(mj)) > Bres0U T C e{tk) then B teaourceiik) ~ ® rrf iP^j) "L 6. o-^(m'j) = crP"(m') + D Fp (m ') + DBp{rrij) + £>spot,. 7. Insert m'- back to MLtk such that the non-decreasing order of (crj"(mj) +DFp(mj)) is maintained. end of for 8. return Bresource(tk). Lemma 4.5.6 Algorithm 4.5.2 generates the lower bound of the system comple tion time for type tk operators. Proof: This lemma can be proven by induction. □ The lower bound of the system completion time bounded by the resource depen dency constraint is B r e s o u r c e = m ax { B rc30urce ( ) } tk€T Algorithm 4.5.1 for the minimum system completion time design generation with the branch and bound can be rewritten as follows. Algorithm 4.5.3 (An Optimum SA Graph Generation) 0 . Let G sa be an initial SA graph generated by some heuristic algorithm, and let CTmin = L °SA(I,0) be the current minimal system completion time. (Heuristic algorithms will be discussed in the next section.) 1 . Initialization: let n = 0, • Sn = {uj € V | ind(vi) = 0 in G}, the set of input nodes in G, • VP n = 0, 124 • E& = 0 > • ^m" = 0 ? £(Kn") * s nu^i and < 7m"(mj) = 0 for each m;- € M. 2. Find 4 > " ' and the corresponding M* for a given Gp sn A: • For each u ,- € 5„, <r„(u,) = M a*{LSMp"(/,t>,0 ,M *nraj.6JlfoiV (m (,.){aJS'(mJ -)} }, where m(v{) € A/ is the module realizing < rv(vi), and (t> v{vi,m.{vi)) = cr„(vj) + DFP{m{vi)) + DBp{m{vi)) + DBPov. • 4 > * = M inV i< zsn{< f> (v{,m(vi))}. • M * is the set of operators by which 4> * could be realized. 3. For each operation v ,- € Sn executable by any m~ € M" with crv(vi, m') < < / > * : (a) Create a new partial SA graph according to the following. • Select any mj € M*. . If V% ± 0, E l r = E P s a U {(£(V £),u,)}; otherwise, • V^"+ i = U {u,} and £(V£"+1) = u ,-. For other m ' e M \ {mj}, = K I;. m d £ (V ^ * ') = £ (V ^ ). • V rP "+ > = f/P" U {v;}. (b) Check if Bdata and B Tea0u rce are smaller than CTmin. If yes, accept this partial SA graph and go to Step 4. Otherwise, ignore this partial SA graph. 4. For each new accepted partial SA graph from Step 3 with V { sequenced and allocated, create a corresponding S„+\ as follows. (a) 5 „+1 - S n \ {u,}. (b) For each direct successor Vk of u ,- with all its direct predecessors in V r 'n+i , add Vk into S n+ 125 5. For each partial SA graph with n + 1 operations, either of following cases occurs. • If n + 1 = |V|, this SA graph is active. If O) < CTm{n, G^ 1 becomes the new current minimal completion time design and GTm iri = L gsa 1 (/, O). If Lgs a (I,0 ) > CTmin, ignore this SA graph. • I f n + 1 < |V|, n < — n + 1 and go to Step 2 to expand this partial SA graph. If all accepted active SA graphs are examined, output the current minimal completion time SA graph and GTm,„. Theorem 4.5.3 For a given DFG, Algorithm 4.5.3 generates a minimum comple tion time SA graph under the resource constraint M. 4.5.4 H euristic A lgorithm s The number of active SAs is very large, and it is very time-consuming to find an optimum design using Algorithm 4.5.3. Instead of enumerating all active SA graphs, a set of V Pn+1, E P n+ > , and E ^ 1 is found for a set of V Pn, E Vn, and Eg'A. Therefore, a single search path from (VP o,E P a,Es°A), (VPl,E Pl,EgA), . .. to (Vpiv 'i,E pivi,EgAl) can be found. This is also known as the list scheduling algorithm [15]. The order of operations to be scheduled is determined by a few priority func tions. The freedom [38] of a node equals the difference between the critical path delay of the design and the longest path delay through the node, and represents how critical this operation is in the design. Operations with smaller freedom values are more critical and have a higher priority than operations with larger freedom 126 values. The other priority function used is the longest path delay from an opera tion to output nodes. The larger this delay is, the higher the priority the operation has. The list scheduling algorithm is described as follows. Algorithm 4.5.4 (List Scheduling Algorithm) 1. Initialization: let n = 0, • S„ = {vi e V | ind(v{) = 0 in G}, the set of input nodes in G, • V P n = 0, • B ? S A = I • Kn" = £(K«") * s nu^t and crfXlmj) — 0 for each m j € M. 2. Find < f > * and the corresponding M * for a given Gsa- • For each u; 6 < ?„, a„(v,) = M ax{LV iSApn(I,v i),M in mj& M oTopn(v.){a^{Tnj )} }, where m(u,) € M is the module realizing ct„(i;,), and < j> V(vi,m(v.;)) = < rv(vi) -f DFp{m{vi)) + DBp{m{vi)) DBpov. • < / > * = M inV i< zSn{<t>{vi,m{vi))}. • M m is the set of operators by which < j > * could be realized. 3. Find a V { € Sn with the highest priority among operations requiring an operator m* G M* with crv(v{, m*) < < f> * . Create a new partial SA graph G%i + I according to the following. Gsa • Select a mj € M * which has the minimal Lv, Pn+1 (I, vi) for v ,-. . If V% + 0, E I T = E P sa u {(^(V ^),u,)}; otherwise, E & ' = E p s\. • ^m" +1 — Kn" {vi} and ) = V{. For other m'- € M \ {mj}, y ^ ; 1 = y% , ^ d = 5 ( v p . 127 • V * " * 1 = VP n U {u,}. 4. For the new partial SA graph from Step 3 with V { sequenced and allocated, create a corresponding Sn+i as follows. (a) iSn+l = Sn \ {u,}. (b) For each direct successor of u ,- with all its direct predecessor in F p"+>, add ujt into 5n+i. 5. If n + 1 = |V|, this SA graph is an active SA graph, then output the SA graph and stop the procedure. Otherwise, return to Step 2 with n « — n + 1. There are several variations of this algorithm. One variation skips Step 2 of Algorithm 4.5.4, in other words, no < £ * is calculated. In Step 3, the i> , • with the highest priority of all operations in Sn is selected to create a new partial SA graph. Another variation calculates the priority function for each node before the scheduling starts, so the input DFG determines the scheduling order of all operations. E x am p le Let the longest path delay from an operation to output nodes be the priority function used in the heuristics, that is, Lg (v{,0) for u ,- € V. In the previous example, DFp{mm < \) = 4, Dpp(m+tj) = 2, and DFp{m+t2) = 3, so the average Dpp for the multiplication is set to be 4 and the average Dpp for the addition is set to be 2.5. Then the scheduling order can be obtained as (v»,i,u«,2,u+)i,v +i2,u.,3, v+t 4 ,v ,< 4 ,v+'3) with the corresponding priority cost (14.5, 10.5, 10.5, 10.5, 8 , 6.5, 4, 2.5). 128 4.6 T he P rob lem Form ulation for M inim izin g th e S ystem In itiation P eriod 4.6.1 Extended Graph M odel for th e System Initiation Period Each computation of a system corresponds to a set of data from inputs to outputs of the system’s DFG description? Many computations of the system correspond to multiple copies of the same DFG descriptions, where each copy of a DFG descrip tion corresponds to a computation for a set of input data. Assume that there is no data dependency between any two computations. To initiate as many computations as possible during an implementation of the system, make sure that the resources for each computation are ready whenever a specific resource is required. The resource dependency between any two consecutive computations is described by a set of SA edges. If <S(m/..) and £{m,k) are the starting operation and the ending operation of operator m*, respectively, an SA edge from £(mjt) of a computation to <S(mjt) of its following computation is used to describe the resource dependency between two consecutive computations in terms of operator m*. In Figure 4.8, u ,- and vj are the starting operation and the ending operation of an operator, respectively, and vp and vq are the starting operation and the ending operation of another operator, respectively. Therefore, there is an SA edge from vj of a computation to V { of its following computation, and there is also an edge from vq of a computation to vp of its following computation. 129 identical SA graphs D F G ; Computation 1 D F G i Computation 2 Figure 4.8: A series of computations. Let n represent operation vt,n in the zth computation. Let S '( m k ) and be the starting operation and the ending operation of operator m* in the z'th com putation. Let Gm SA be the extended graph model to describe many computations of the system. Gm SA is defined as follows. oo oo i=i i=i where V = e v) E = { (uti,ni) vt2,n2)l''?,(uii.niiui2.n2) € E} E sa = { ( v ii,ni >ui2, n 2 »u<2.«2) ^ E s a } E s a = {(5,(mf c ) ,5 ,+1(mf c ))|Vm) t € M , i = 1 , 2 ,...} 130 The initiation period equals the time from the computation starting time of a computation to the computation starting time of its next computation. In addi tion to directly affecting the starting time of its next computation, a computation indirectly affects the starting time of the computations after its next computa tion. Therefore, the initiation period of an SA graph means the average initiation period of the SA graph. Since there is only the resource dependency across com putations, the initiation period can be defined by the longest path delays between the starting operations of the same operator from different computations, that is, L^ N (mk){S'{mk),St+N{mk)) for all m k 6 M. D efinition 4.6.1 For a given DFG G = (V, E) and an SA edge set Esa under the resource constraint A/, the initiation period of the SA graph Gsa is , „ r . „ , £ & V . ) IP \G \E sa) — }• /V— ► oo iV Later it will be shown that the initiation period can be found with N < \M\. To find L%s +N{mk' )(S'(mk),S ,+N(mk)), matrix manipulation is used to find the all-pairs longest paths for a given G* SA, in which a pair of nodes start at a starting operation of an operator in a computation and end at a starting operation of either this operator or another operator in another computation that is started later than this computation. This method is similar to the all-pairs shortest paths problem for a directed acyclic graph [1]. Let the operators in M be labeled from 1 to |M |, that is, ... ,m|A/|. The matrix Ai represents all-pairs longest path delays from the starting operation of a computation to the starting operation of its following computation. In other words, Ai[z,j] represents the longest path delay from inputs of < S p(m,) to inputs of < S p+1(m,), where p is any positive integer, that is, LGSA(s (mi),£{mj)) + DBp(mj), if S(rrij) = £(mj); ^GSA(S(mi),£(mj)) -f DBp{mj) + DBpov, if S{rrij) ^ £(m j); —oo, otherwise. 131 For p, q > 0, the matrix Ap+ q is defined as follows. Ap+q — Ap[i, A : ] -|- L em m a 4.6.1 The entry (i,j) in matrix A n represents the longest path delays from the input of S p(m,) to the input of S^p+n\m.j), that is, An[iJ] = L f & {m.)(S'(rni),S ’+»{mJ)). P ro o f: By induction. □ C o ro llary 4.6.1 The initiation period of the SA graph is IP (G ,E sa ) = M axmi&M{ lim ^ M } . /V -+00 T v A„ is called the SA all-pair matrix for n computations. After the formulation of the system initiation period is transformed from the path delays of G * SA to the manipulation of matrix Ai, there is no need to search longest paths through G* s a any more. Instead, the system initiation period can be obtained by the manipulation of the SA all-pair matrices, which is only a function of G s a - 4.6.2 T he Problem Formulation and Possible Approaches The synthesis problem for finding the minimum initiation period design is formu lated as follows. For a given DFG G = {V,E) and a given set of operators M , find an SA edge set E sa in which • E sa and the corresponding G sa = (F, E U E s a ) and G'SA = (F, E s a ) satisfy the properties in Lemma 4-4 A • Operators in M are labeled from 1 to \M\, where A n is the corresponding SA all-pair matrix for n computations 132 so that Maa:,=ii...i|M|{liniiv->oo } is minimum. To minimize the system completion time, the condition for the left-shift ad justm ent and active SAs are defined. The similar definitions for the minimization of the system initiation period are obtained, but the objective function is Ayv[z',z] for all i £ { 1 ,2 ,..., \M\}. There are several properties for Ayv[s, * ’ ]. L em m a 4.6.2 A/aM > V z £ { 1 ,2 ,..., |A/j}. P ro o f: For any i 6 { 1 ,2 ,..., |A/|}, A;v[m] = A/ax,-, e{ij2 |a/|}{Ai[z,z'i] + j4i [ z ’ i, * 2] + • • • + y4i[z‘ iv-i,*]} An [i,*] > Ai[z,z] + Ai[z,?] + ... + Ai[z,z] Aat[z’,z] > N * a C o ro llary 4.6.2 IP (G ,E SA) > Jl/ax,-G{li2 |a/|) M i [*>*]}• C o ro llary 4.6.3 If for any positive integer N and for any z £ { 1 ,2 ,..., |A/|} A/a®,', €{1,2 |a/|}\{.}{Ai[z,z‘i] + Aifz-!,^ ] + ... + Ai[z'/V-M]} = -o o , then IP (G ,E s a) = A/ax,-=i i2 |m|Mi[z',z]}. The levelization approach is widely used in the pipeline design of the data path synthesis of synchronous system. This approach guarantees the condition in Corollary 4.6.3. Also, the initiation period equals the maximum number of clock cycles of all levels, which is analogous to maximum delay of Ai[z,z] for all z. Therefore, a similar approach can be used to find a near-optimal system initiation period design for asynchronous systems, however, the size of each level to meet the clock period does not need to be considered. 133 C hapter 5 C ontrol S ynthesis After data path synthesis, the control structure that supports resource sharing and computation sequencing needs to be synthesized. In the data flow design paradigm, control synthesis is the process of re-shaping the original DFG with respect to the sequencing and allocation result obtained during data path synthesis. Figure 5.1 shows the input-output relationship of control synthesis. Two tasks are included in control synthesis for this thesis: 1. Sharing structure generation 2. Local transformation Sharing structure generation produces structures for an output DFG from an input DFG and a sequencing and allocation result. The output DFG performs the same function as the input DFG, but with fewer resources, and satisfys the sequencing and allocation result obtained from data path synthesis. In this system, this synthesis task is carried out by the input DFG to output DFG mapping. Several sharing schemes, which are the general templates of sharing structures for the mapping from the input DFG with the sequencing and allocation result to the output DFG, are developed. This chapter focuses on the development and analysis of sharing schemes. 134 Sequencing Selection of & allocation sharing scheme result Original DFG \ I / Control Synthesis Re-shaped DFG Figure 5.1: The input-output relation of the control synthesis. Since sharing schemes are general templates for any DFG with any sequencing and allocation result, the size of the sharing constructs and the number of data flow paths in each sharing structure may be further reduced. Local transformation uses the peephole optimization technique, which is similarly used in the compiler design, to reduce the size of the sharing structure generated by the sharing structure synthesis. 5.1 P rob lem D efin ition and C lassification o f Sharing Schem es Assume that the input DFG is acyclic and each functional node is executed exactly once. A sharing scheme is a template that maps the input DFG into an output DFG such that: 1. The execution of each functional node in the input DFG corresponds to an execution of a functional node in the output DFG, and each functional node in the output DFG may be executed more than once. 2 . The output DFG performs the same function as the input DFG. 135 Functional nodes in the input DFG are called operations, and functional nodes in the re-shaped DFG are called operators. The first item in the above sharing scheme means that the operation-operator mapping satisfys the allocation result of data path synthesis. The second item means that the operation execution sequence in the output DFG satisfys the sequencing result of data path synthesis. There are two kinds of allocations: 1 . Static allocation 2. Dynamic allocation In a static allocation, each operation is executed by a specific operator every time. In a dynamic allocation, each operation can be executed by one of several operators. The operator that executes an operation can be different from time to time. In general, an idle operator is assigned to execute an operation that is ready to be executed. Sequencing schemes can be classified as: 1 . Fixed sequencing 2. Variable sequencing In the sharing scheme with fixed sequencing, the execution order of the set of operations sharing the same operator(s) are fixed. On the other hand, with vari able sequencing, the execution order of the set of operations sharing the same operator(s) is not pre-determined. For example, both Figure 5.2 and Figure 5.3 are two different, abstract-sharing schemes, where the output DFG uses two operators, m\ and m 2, to execute four operations, v\, v2, v3 and v4, in the input DFG. Figure 5.2 is an abstract-sharing scheme with fixed allocation; for example, operation v\ is always executed by operator m\. Figure 5.3 is an abstract-sharing scheme with dynamic allocation; for example, operator v\ can be executed by either operator mi or operator m 2. 136 control & routin ; control & routin; (a) (b) Figure 5.2: An abstract-sharing scheme with static allocation for 4-operation DFG to 2-operator DFG mapping, (a) Before applying sharing scheme, (b) After ap plying sharing scheme. ( control & routing^ (b) (a) Figure 5.3: An abstract-sharing scheme with dynamic allocation for 4-operation DFG to 2-operator DFG mapping, (a) Before applying sharing scheme, (b) After applying sharing scheme. 137 In both cases, the control and routing parts need to be synthesized so that the output DFG satisfys the sequencing result. Since v\ is allocated to mi in Figure 5.2, the input and output paths of v\ in the input DFG are routed to and from mi in the output DFG. Similarly, the input and output paths of i> i in the input DFG are routed to and from both and m 2 in Figure 5.3. If is scheduled after Vi from data path synthesis, the control sequence has to satisfy the sequencing constraints. If fixed sequencing is used, the control parts in both Figure 5.2 and Figure 5.3 need to be synthesized so that v\ is always followed by V 2 . If variable sequencing is used, the control sequence does not need to be synthesized since the control part accept any sequence. Based on the two allocation types and the two sequencing schemes, sharing schemes are categorized into four classes: • Static-allocation with fixed sequencing (SAFS). • Static-allocation with variable sequencing (SAVS). • Dynamic-allocation with fixed sequencing (DAFS). • Dynamic-allocation with variable sequencing (DAVS). Structurally, each sharing scheme has three parts, which together are referred to as the sharing structure. • S h ared u n it(s): The shared unit(s) are the operator(s) that are to be shared by a set of operations, which can be either non-pipelined or pipelined. A non-pipelined unit and a pipelined unit in asynchronous system design are respectively analogous to a single-cycle operator and a multicycle operator in synchronous system design. If the pipelined operator is formed by inserting registers and Muller C-elements into a combinational logic, the operator is called a micropipelined unit [44]. 138 I2 OP, OP, I n I 3 1 A A A O R ) • • T T O n O 3 Figure 5.4: N operations in the input DFG. • D a ta ro u tin g p a rt: DFG constructs, which direct the data of each oper ation to the shared unit(s), form the data routing part. Basically, the data routing part deals with the allocation. • C o n tro l p a rt: The control part generates sequencing information and con trols the data routing part. The control part deals with the sequencing scheme. Since the SAFS sharing scheme has small timing and area overhead, the data path synthesis algorithms in this thesis focus on this sharing scheme with non-pipelined shared units. The following sections describe sharing structures for each of the four shar ing scheme classes, using non-pipelined shared units. The area and performance overheads, which are used in data path synthesis, are formulated for each sharing scheme. In addition, a sharing structure with micropipelined units for SAFS is presented to show how the pipelined units affect the timing behavior of the shared structure. In the following sharing structures, N same type operations, shown in Figure 5.4, are used to represent the part of the input DFG, where N is greater than 1. ii,..., I n can be connected to either the input port(s) of the input DFG or the 139 output(s) of some nodes in the DFG. O i,..., On can be connected to either the output port(s) of the DFG or the input(s) of some nodes in the DFG. For example, N is 4 in Figure 5.3(a). Assume that v\ in Figure 5.3(a) corresponds to OP\ in Figure 5.4, the output of q\ and the input of < 7 5 in Figure 5.3(a) correspond to I\ and 0 \ in Figure 5.4, respectively. For static-allocation sharing schemes, assume N operations share one operator; for example, the mi is the operator shared by 2 operations in Figure 5.2(b). For dynamic-allocation sharing schemes, assume N operations share M operators; for example, mi and m 2 are the 2 operators shared by 4 operations in Figure 5.3(b). 5.2 Sharing Structure for SA FS In the following sharing structure for SAFS, shown in Figure 5.5(a), assume that N operations share 1 operator with fixed sequencing in which OP{ is followed by OPi+1 for i = 1 to N — 1. In this structure, a Selector, which directs the inputs of the N operations to the input of the shared unit, and a Distributor, which directs the output of the shared unit to the outputs of the N operations, form the data routing part. CFfs.N, which forms the control part of Figure 5.5(a), is a macro function that generates a fixed sequence of condition tokens ensuring that the data paths open in a pre-determined order. (CFfs stands for the control function for the fixed sequencing.) Since the fixed sequencing is OP, followed by OP,+ 1 for i = 1 to N — 1, the data routing part requires the fixed sequence of condition tokens, 0,1,2,..., Af — 1 from CFfs-N. In other words, CFfs.N is a \log2 N]-b\t O-to-(N-l) cyclic counter with initial value |7 o < 72Af]-bit 0. Figure 5.5(b) is the DFG definition of function CFfs-N, where COUNT.N is an atomic function that generates the next condition token from the current condition token plus 1 , and “'600...0” represents the initial value |7 o < 72JV]-bit 0. ('b is used to lead the binary numbers.) Since the 140 1 N- Selector CFfs_N OP Distributor taunt. < t ’bOO..O Fork ! Oport! J Oport! I I (a) (b ) Figure 5.5: (a) Sharing structure for SAFS. (b) Definition of CFfs_N. sequencing is fixed in this scheme, the data at / 4, which is available before the data I 3 , cannot be processed until OP(I 3) is completed. The sharing structure in Figure 5.5(a) can handle sequencing results in two ways. In the first method, the input/output locations are rearranged without changing the definition of CFfsJVfunction in Figure 5.5, in which CFfs.N is a cyclic counter. For the execution order {OPqi, OPq2, OPQ 3,..., OPqN), Iq i is connected to input i — 1 of Selector, and Oq i is connected to output i — 1 of Distributor. In the second method, a new function New-COUNT.N replaces the COUNT-N function in function CFfs.N without changing the input/output locations, in which Iq i is connected to input < 7; — 1 of Selector and Oq i is connected to output < 7; — 1 of Distributor. For the execvition order (OPqi, OPq2, OPq3, ..., OPqN), function New.COUNT-N generates an output token with value q{ + 1 — 1 for an input token with value < 7; — 1 for i = 1 to N — 1 , and it generates an output token with value < 7 1 — 1 for an input token with vaule — 1. This thesis uses the first method so that New.COUNT-N does not have to be synthesized for each different sequencing. 141 F u n ctio n ality p reserv atio n To determine if functionality is preserved by the output DFG, it must be shown that the output DFG maintains the execution or der of operations in the input DFG. Since the sequencing from data path synthesis should preserve the functionality, SAFS preserves the functionality if the sequenc ing of operations by the sharing structure is consistent with the result of data path synthesis. 5.2.1 Effects of the Sharing Schem e A sharing scheme is a fixed template with respect to the number of operations, the number of shared operators, and the type of scheme. Therefore, the effects of the sharing scheme with respect to area and performance can be easily estimated. A rea The area gained by a mapping equals the area of the eliminated atomic functions. The area overhead of a mapping equals the area of the extra nodes needed in the sharing structure. Based on the design in the asynchronous block library for this thesis [54], the area costs for the DFG constructs used in the sharing structure are as follows: DFG constructs Area cost function OP: Ao p r-bit data n-input Selector: Asi -f \log2 n\ * A a2 + n * AS3 ■ f * (p 1 ) * r * A muX 2toi r-bit data n-output Distributor: Ad\ + \log2n\ * Ad2 + n * n-bit counter (CFfs_n function): Ac\ + n* A c2 In above list, all A factors are constants. A op and Amux2t0- [ are the area cost of OP and the area cost of 1-bit 2-to-l multiplexer, respectively. For example, the cost of the r-bit data n-input Selector has constant factor Aa\. The cost grows logarithmically with respect to constant factor A a2. The cost grows linearly 142 with respect to constant factor j4 S3 and the cost of the r-bit 2-to-l multiplexer, r * A muX 2to\- Similarly, the costs of the r-bit data n-output Distributor and the n-bit counter are defined with respect to constant factors Adi, Ad2, Ad3 , Aci, and A C 2- The net area gain of the sharing structure for SAFS, Agotn, is the area gained due to the resource sharing less the area overhead or Again — (N 1) * A 0p (Aai -f- A di H ” A cj) — \l0g 2 NI * (AS 2 + Ad2 + Ac2) — N * (Aa3 + Ad3) — {N — 1) * r * Am uX 2toi, (5.1) where N is the number of operations sharing the same operator and r is the number of bits for the input of OP. Because resource sharing always reduces system performance, this resource sharing scheme can be used only when Again is greater than 0 , or (N — l ) * A op > (Asi + Adi + Aci) 4- \log2N ~\ * (A „2 + Ad2 + Ac 2) + N * (As3 + A (J3) + {N - 1 ) * r * Amux2toi (5.2) A op ^ + * ( ^ . 2 + Ad2 + Ac2) + (A a3 -f Ad3) + r * Amux2tol (fi'3) , 1 (Asi + A di + A c 1 + A a 3 + Ad3 ) 1 > ( F 3 T ) * ------------------A7P ------------------ (i432 + Ad2 + AC 2) + 7 7 7 ------7 7 * ------- -------------------- (N - 1) A op , (A S3 + Ad3) + r * A mux2toi /c + ------------------2--------------------------------------- (5-4) / i 0p 143 The righthand side of inequality (5.4) is the area overhead ratio, A oveThead.ratioi or the ratio of the area overhead to the area gain resulting from the resource sharing. Theoretically, as long as 1 > Aoverhead-ratioi the sharing scheme can be applied. Practically, the threshold value for Ao v e T head-ratio, the lefthand side of inequality (5.4), may be set less than 1 . Since Aoverhead-ratio depends on the number of operations, N , the upper bound or lower bound of Aoverhead-ratio may be used to decide if it is worthwhile applying the sharing scheme to operator OP. The upper bound of Aoverhead.ratio can be obtained by setting N = 2 in inequality (5.4), that is, ■ , ^ Mai + A di + A l + A„ 3 -f A d 3 ) , { A , 2 + A d 2 + A c 2) 1 > A + A n 0 p f*op , (A 3 + A 3 ) + r * Amux2tol /c e\ + --------------- 1 ------------------ • (5-5) The lower bound of AO V erhead-ratio can be obtained by setting iV — * 0 0 in inequality (5.4), that is, , M s3 + A ft) + r * Arjmx2tol /- 1 > ---------------- . (5.6) /I o p P erfo rm an c e Before sharing, the execution of each operation at operator OP has a forward propagation delay time Dppo p and a backward propagation delay time D bP o P• 1° a sharing structure, the execution time of each operation that shares an operator is increased by the routing delay and possible control delay. Based on the analysis of the design library in this thesis, the forward and backward propagation delay times of Selector and Distributor are defined as follows: DFG constructs Delay parameters n-input Selector. DFPscUc,O T {n) = DFPtl + \log2n \ * DF P aJ DbpSc,cc,or(n ) = DBPtl + \log2ri\ * DBpa2 n-output Distributor. DFPscUctor{n) = DFPdl + \log2 n] * D F P d 2 144 D BPSc,cccr(n ) = D BPdl + \log2n 1 * D Bpd2, where Dpp,,, D fps2, D bpal, Dbp,2, Dppdi, D fp^ D bpm, and Dbp& are constants. The forward propagation delay time overhead, D fpo v , and the backward propaga tion delay time overhead, Dbpov, for the sharing structure in Figure 5.5 are DfP o v = DFPSclcctor(N ) + DFPD ialributor(N), and (5.7) Dbpo v = DBpSclcctor{N) + DBpD i,tributor{N). (5.8) Since the data computation of OP and the control generation of Selector and Distributor can be executed at the same time, Dppo v is assumed to be zero if DFpo p > DFpov; otherwise, DFpo p is assumed to be zero. Usually, DFpo p > DFpov. Since the logarithmic factor is much smaller than the constant factor in the above timing parameters, Dppa2 Dfp b1 , the logarithmic term in the above timing parameters may be ignored. Therefore, it can be assumed that Dbpo v is a constant during data path synthesis. 5.3 Sharing Structure for SAVS In the following sharing structure for SAVS, shown in Figure 5.6(a), assume that N operations share 1 operator with variable sequencing in which any execution order of 0P{ for i = 1 to N is acceptable. In other words, SAVS takes only the allocation result from data path synthesis. With the exeception of the control part, this structure is similar to the sharing structure in Figure 5.5(a). N f£([ ]) functions and a CFvs.Nfunction form the control part. (CFvs stands for the control function for the variable sequencing.) J?([ ]) is an atomic function that passes the input token to the output without passing the input data value. CFvs.N is a macro function that generates condition tokens to indicate which input token is available so that a proper data path in the data routing part is open. 145 [Fork) (Fork Fork Selector OP CFvs.N Distributor I p o r t! Iport i | Iport ifN -1 Fork I I I I I Oport (a) (b ) Figure 5.6: (a) Sharing structure for SAVS. (b) Definition of CFvs_N. Figure 5.6(b) is the DFG definition of function CFvs.N, where C ((const)) is an atomic function that generates a token with constant data value (const) if it receives a null data token. For example, when I 2 has data available, it activates the second f?([ ]), which generates a null data token to CFvs-N. Due to the token from the second input of CFvs-N, a data token with value “1” is generated on every output of CFvs.N. These data tokens open the data route from input 1 of the Selector to output 1 of the Distributor. When more than one input data are available, an Arbiter, denoted by 0 , in CFvs.N decides the order of routing paths/operations using a first-come-first-served (FCFS) ordering scheme. F u n ctio n ality p reserv atio n An FCFS ordering scheme enforces no specific or der on operations sharing the same operator. Any order of these operations is 146 acceptable to the sharing structure, so the output DFG which uses the FCFS or dering scheme follows and preserves the operation execution order of the input DFG. Therefore, the sharing structure using an FCFS ordering scheme preserves the functionality. 5.3.1 Effects of the Sharing Scheme The effects of the sharing structure for SAVS are analyzed in the following sections. A re a The area gain/overhead of this sharing scheme is formulated similarly to the formulation for the previous scheme. The area costs for DFG constructs com mon to both sharing schemes are as previously defined in Section 5.2.1. The area costs for DFG constructs found only in the sharing structure for SAVS are defined as follows: DFG constructs Area cost function n-input Arbiter. Aa 1 + n * Aa3 2 -output Fork: Amuuerc 2 R([}): 0 C( (const)): 0 Aai and A a 3 are constant factors for the n-input Arbiter. Amu//erc 2, which is a constant, is the area cost of the 2 -input Muller C-element. The total area gain of the structure for SAVS, Aso:n, is Apain = (A 1) * A0p (A si -f- Adi " f" Aa 1 -f- Amu((erC ? 2 ) — \log2 N ~\ * (A 32 + Ad2) — N * (A„3 -f- Ad3 + A a3 -f- AmunerC2 ) {N — l)*r* Am xlX 2toi (5.9) 147 This sharing scheme can be used when A gain is greater than 0, that is, (A T 1) * A op > (A .,1 “ h Aj i A ai -J - A mullerC 2 ) + \log2 N ~\ * (Aj2 + Adi) + N * (A53 + Ad3 -f A az + AmullerC 2 ) Aop > 1 > “ h (Af 1 ) * r * AmuX 2toi 1 ( T V - 1 ) * ( A j l + Adi + Aai + A S3 + j4d3 + Aa3 + 2 * A mullerC2 ) \log2 N] . . + (N - T ) * ( a + + (A s3 + Ad3 + A a3 + AmullerC2) + 7 * * •<4m ux2*ol (5.10) 1 (A T -1) ( A i l + Adi + A a i + A S3 + A d 3 + Aq3 + 2 * AmullerC2) * 1 ----------------------------------- ■ r ± o p \logjN ] (Aa2 + Ad2) + ( N - 1 )* Aop ( A S3 + A d 3 + A a 3 + A mul(erC2) + r * A mur2£ol / c i i \ + ----------------------------- ------------------------------- (5.11) * * -o p As in Section 5.2.1, the righthand side of inequality (5.11) is the area overhead ratio. The threshold value for Aoverhead-ratio, the lefthand side of inequality (5.11), may be set less than 1. Similar to the analysis of SAFS, the upper bound and the lower bound of Aoverhead.ratio can be derived from the righthand side of inequality (5.11), and can be used as the simplified Aoverhead.ratio to decide the feasibility of sharing OP. P erfo rm an ce The performance overhead for the sharing structure for SAVS is higher than the overhead for the sharing structure for SAFS. The extra overhead is caused by the control generation starting after a datum arrives. The timing 148 parameters for DFG constructs found only in the sharing structure for SAVS are defined as follows: DFG constructs Delay parameters n-input Arbiter. DFpArbitcr(n) = DFpal + flog2n\ * D F F a 2 DBPA rbl"r(n ) = DBPal + \log2n] * DBpa2 2-output Fork: DFpFork{2) = 0 D B P fork (2) = D m utlerC 2 ^ ( [ ] ) : D FPm ]) = D B p m J) = 0 G( (const)): £ f p c(.) = £ b p c(.) = 0 , where DFpaJ, DFpa 2 , DBpa,, and DBpa2 are constants. Dmuuerc 2 , which is a con stant, is the 2 -input Muller C-element delay. The forward propagation delay time overhead, DFpov, and the backward propagation delay time overhead, DBpov, for the sharing structure in Figure 5.6 are Dfpo v = DFPS'lcctor{N) + DFpD iatributor{N) + DFpA rbitcr(N), and (5.12) DbP o v = DbpS c ,",JN) + DBpD iatribulJ N ) + DB PA TbitJ N ) + 2 * Dmuu„c 2 - (5.13) Since the data computation of OP and the control generation of Selector and Distributor can be executed at the same time, DFpo v is assumed to be DFpArbiler(N) if DfPop > DFPsclcctor{N) + DFpD iitributor{N)\ otherwise, DF F op is assumed to be zero. In other words, DFP ov > 0 even if DF P op > DFPsclector(N) + DFPDi3lributor(N). Unlike the sharing scheme with fixed sequencing, the logarithmic factor of Arbiter cannot be ignored. Therefore, both DFpo v and DBPov increase when N increases for the sharing scheme with variable sequence. 149 o c Selector / ' ' ’ D istrib u to r''^ . w M-l storages M-l storages o, O 2 O n Figure 5.7: Sharing structure for DAFS. 5.4 Sharing Structure for D A F S In dynamic allocation, operators are partitioned into several subsets. Each opera tion in the input DFG is mapped to one of the subsets and dynamically allocated to an operator in the subset when the operation is to be executed. Therefore, each operator is used by a set of operations (also see Figure 5.3). In the following sharing structure for DAFS, shown in Figure 5.7, assume that N operations share M operators with fixed sequencing in which OPi is followed by OP{+\ for i = 1 to N — 1 . This structure is similar to the sharing structure for SAFS. For DAFS, an extra Distributor and Selector are added in the data routing part to route input data to every shared unit and every shared unit to data output. An extra control token generator indicates which shared unit is used. By assuming the operator that starts first is first released, the sequentially cyclic ordering is used for the 150 operator allocation, in other words, operator 1 to operator M is sequentially and cyclically allocated. Function CFfs.M is used as the control token generator for operator allocation in the structure. In terms of the execution order of these operations, the scheme in DAFS is the same as in SAFS. However, DAFS allows more than one operation in the sharing structure to be executed at the same time. Therefore, extra storages are required to store the condition tokens for the output paths of unfinished operations in the structure. In Figure 5.7, S is an atomic function which copies input tokens to output, in other words, 5 is a storage node. For example, if I m arrives while the first M — 1 operations are still executing in the structure, then two £,• storages keep the condition tokens for OP(Ii) to open the path from OPi to O, for i = 1 to M - l . Functionality preservation As with SAFS, the functionality preservation of DAFS is ensured by the sequencing result from data path synthesis. 5.4.1 Effects of the Sharing Schem e The effects of this sharing structure for DAFS are analyzed as follows. Area The area cost in Section 5.2.1 can also be used for each of the nodes. Let the area cost of a n-bit storage node be Asti + n * Aa ti with constant factors Asn and A a i2 . The total area gain of the structure for DAFS, Again-, is Again = (A — M) * Aop — 2 * (Aai 4- Adi + Aci) - (R °5 2 A] + \log2 M ] ) * (A 32 + Ad2 + Ac2) - (N + M) * (As3 + Adz) ( A 1 ) * t * A mux2toi - (M - 1) * s * Amux2toi 151 — 2 * (M — 1 ) * A a t\ — (M — 1)*(\log 2 N~\ + \log2M~\) * A at2, (5.14) where r is the number of bits for the input of OP and s is the number of bits for the output of OP. Because resource sharing always reduces system performance, DAFS can be applied only when Aflain is greater than 0, in other words, (N — M) * Aop > 2 * (j43 i + Adi + Aa) + (\log2N] + \log2M~\) * (A a2 + Ad2 + Ac2) + (N 4- M) * (AS3 + Ads) + (N — 1 ) * r * A m ux2t01 + (M — 1 ) * S * A m U x2tol + 2 * (M — 1 ) * A„ti + (M - 1) *(\log 2 N] + \log2 M]) * A a t2 (5.15) 2 A op > y * (^ * 1 + Adi + Ad + A5 3 + Ad3) (\log2 N] + \log2 M ]) + --------(N ~ M ) -------- ( 32 + d 2 + c2) (N + M) (N — 1 ) * r + (M — 1 ) * s (N - M) 2 * ( M - l ) , + (N - M ) * M “ t” 7T7 7T\ * Amux2tol 1 > ( M - l ) * ( [ % 2i V l + r % 2 M l ) + ---------------(aF T m ) * < 5-16) 2___ (A si + Adi + Ad + Aa 3 + Ad3) (N - M) Aop (\log2N) + \log2M \) (Aj2 + Ad2 + Ae2) ( N - M ) * A0p , (N + M) (As3 + Ad3) + T T T T T 7 * ( N - M ) A op 152 (./V — 1 ) * r + ( M - 1) * s A mux2toi H -------------7Tr---- T 7 T ------------ * " ---- ( N - M ) A, op , 2 * ( M - l ) Asn H — ttt— r r r ~ * ( N - A f ) A, op . (M - 1 ) . (\log,tf\ + fl<®Af|) . + -( A ^ M ) -------------------------------* 1 ^ (5-I7) As in Section 5.2.1, the righthand side of inequality (5.17) is the area overhead ratio. A threshold value of less than 1 is defined for A overhead.ratio for determining whether or not to apply this sharing scheme to operator OP. P erfo rm an c e The performance overhead for DAFS is higher than the overhead for SAFS. The forward propagation delay time overhead, D f p ov , and the backward propagation delay time overhead, D b p ov, for the sharing structure are D f p ov = DFpSelcctor{N) + DFpD istributor(N) + D FPSc, 'ctor(M ) + D FPD„,ribu,oAM ) (5 -1 8 ) DbP o v = Dbp^ J N ) + DBpD latrifeu to r M + d b p S c U c , o t ( m ) + D BPDist ributor{M) (5.19) Since the data computation of OP and the control generation of Selector and Distributor can be executed at the same time, Dpp0 „ is assumed to be zero if D fP op > D f p ov \ otherwise, D f p op is assumed to be zero. As with the analysis for SAFS, the logarithmic term in the above timing pa rameters may be ignored. Therefore, it is assumed that D b p b„ is a constant during data path synthesis. 5.5 Sharing stru ctu re for DAVS In the following sharing structure for DAVS, shown in Figure 5.8, assume that N operations share M operators with variable sequencing in which any execution 153 Fork) (Fork, Fork Selector [CFvs_N CFfs_M Distributor J M-l 'M-l M-l storages 1 M- Selector Distributor O, O 2 On Figure 5.8: Sharing structure for DAVS. 154 order of OP,- for i = 1 to N is acceptable. Since DAVS has the same sequencing as SAVS, it only takes the allocation result from data path synthesis. The structure for DAVS is similar to the sharing structure for DAFS in terms of the data routing part for operation allocation, and to the sharing structure for SAVS in terms of the sequencing control generator. Functionality preservation As with SAVS, the mapping of DAVS preserves the functionality. 5.5.1 Effects o f the Sharing Scheme The effects of the sharing structure for DAVS are analyzed as follows. Area The area costs for the various nodes defined in previous sections can be used also for DAVS. The total area gain of the structure for DAVS, Asa,n, is Again ” (Af A/) * Aop (2 * Aai A 2 * Adi A AC 1 d" Aai A ArnullerC2) — \l0g2N] * (As2 + Adi) — \l0g2M) = 1 = (A 32 A Ad2 A AC 2) — N * ( A S3 + A d 3 - f A a3 d" A m ullerC2) — M * (Aa3 d- Ad3) — (A - 1 ) * r * Am uX 2toi (AI 1) * s * A m uX 2toi — 2 * (M — 1 ) * — (Af — 1) * (\log2N) d- \log2M 1) * Ast2, (5.20) 155 where r is the number of bits for the input of OP and s is the number of bits for the output of OP. DAVS can be applied only when Again is greater than 0 , in other words, (T V — M) * Aop 1 > (2 * A si 4- 2 * Adi + A ci + A ai + A muuerc 2) + \^O g2N) * [As2 + Ad2) + \^Og2M ) * ( A a2 + Ad2 + A c2 ) + TV * ( A a3 + Adz + A az + A m ullerC2) + M * (j4 43 + Adz) + (TV - 1) * r * Am uX 2toi + (M - 1) * s * A mux2toi + 2* (M — 1) * Asa + (M - 1 ) * (\log2N) + \log2M}) * A 3t2 1 (2 * A si + 2 * Adi + A ci + A a 1 + AmullerCz) > T T T T * (T V - M) A, \log2N] (AS2 4- Ad2) 4- t t — t t * op (N - M) A op \log2M ] {As2 4- Ad2 4- Ac 2) + T T r T T * ( N- M) A op . A ( A S3 + A d z + A a 3 + A m u U erC 2) + w n f ) * a -------------------- M (As3 4- Adz) + T T — t t * op (T V - M) A, op [N - 1) * r + (M - 1) * s Amux2toi H-------------------- T T -------T T -------------------* (T V - M) A op 2 * (M — 1) Astl H TTr T T ~ * (T V - M) A, op , (M - 1) *(\log2N] + \log2M]) . Ast2 ,r n i , + (JTWj * 1 ^ (5'2 1 1 156 Again, the righthand side of inequality (5.21) is the area overhead ratio. A thresh old value of less than 1 is defined for AoveThead-ratio to determine whether or not to apply this sharing scheme to operator OP. Performance The delays for the various nodes defined in previous sections also can be used for DAVS. The forward propagation delay time overhead, DFpoll, and the backward propagation delay time overhead, D bpov, for the sharing structure are D f p o v = DFPscUctor{N) + DFpDt3trlbutor{N) + DFpArbH cT(N) +DFPScU ct0r(M) + DppD i atnbutor (M) (5.22) DBpov = DBPs'U ctor(N) + DBPDi,tributJ N ) + DBpArbiUr(N) +2 * Dm uU eTc2 + DBpS'lcctor{M) + DBpD istTibutor(M) (5.23) Since the data computation of OP and the control generation of Selector and Distributor can be executed at the same time, DFpo v is assumed to be DFpArbitcr(N) if DFPop > DFPscUctor{N) +DFpD istributor(N) +DFpSclcctor(M) +DFpD istributor(M); otherwise, DFpo p is assumed to be zero. 5.6 Sharing Structure for SA FS w ith a M icropipelined Shared U n it In addition to the sequencing and allocation assumption in Section 5.2, assume that the shared unit is pipelined in k stages for the following sharing structure, shown in Figure 5.9. Let the k stages of OP be OP-1, OP-2, ..., and OP-k from input to output. In other words, for any input /,-, OP : /, = OP-k 0 . . . 0 OPJ2 o O P -1 : /,• 157 Selector CFfs_N k-stage pipeline iP_k Distributor Figure 5.9: Sharing structure for SAFS with micropipelined shared units. Time 160-200 h 120-160 h I, 80-120 h I2 40-80 h I, 0-40 I, O P J OP_2 (a) Resource Time 160-200 120-160 80-120 40-80 0-40 h I, h h i I2 1. b ! I, ! OP, OP, (b) Resource Figure 5.10: (a) Timing behavior for SAFS with a 2-stage micropipelined unit, (b) Timing behavior for DAFS with 2 non-pipelined shared units. 158 in Backus’ FP format [3], or OP(I{) = OPJk( . .. (OP-2(OP-l(/,-)))) in mathematical functional format. Figure 5.9 shows a sharing structure for SAFS with a k-st&ge micropipelined shared unit. If the data initiation rate is less than or equal to the peak data initiation rate of the micropipelined unit, the behavior of the sharing structure in Figure 5.9 resembles the behavior of the sharing structure for DAFS with k non-pipelined shared units, which is the structure in Figure 5.7 with M = k. For example, assume that OP is partitioned into a 2-stage micropipelined unit with OP-1 and OP-2. W ithout considering the control overhead for the inserted registers in the micropipelined unit, assume that Dpp(OP) = 80 nsec, Dpp(OP-l) = D f p (OPJ2) = 40 nsec, and DBp{OP) = DBp{OP. 1) = DBp(OP-2) = 0 nsec. W ith one data input per 40 nsec for both schemes, the time vs. resource usage dia grams in Figure 5.10 show the behavioral resemblance between these two schemes. In both schemes, new data can be read in every 40 nsec. Each data execution time is 80 nsec. SAFS with a micropipelined unit uses two stages of an operator to execute two operations at the same time. DAFS, however, uses two operators to execute two operations at the same time. Although a micropipelined operator has the execution power of multiple non shared operators, it has some overheads such as the register overhead. If operator OP is equally partitioned into k stages in terms of the forward propagation delay time, Dpp(OP-i) does not equal Dpp(OP)/k but Dpp(OP)/k + Dpp(register). For example, Dpp(OP-l) should be greater than Dpp(OP)/k = 40 in the previous 159 example. Therefore, the execution for each operation, which equals k * D p p {O P -i), is greater than D f p (OP). The area cost of this structure is similar to the cost of the structure for SAFS with a non-pipelined shared unit with the following exceptions, which are added to the area overhead. The area cost of the shared unit is A'op = S L i Aop_ , . The extra area cost for the storages of condition tokens is (k — 1) * (Asti + \log2N ) * A st2). Therefore, Again — (A T fop) * A op (A 3l -f" Adi A c\) — \log2N 1 * (A32 + Ad2 + AC 2) — N * (A S 3 + Adf) — ( N — l) * r * A mux2toi — (k — 1) * (A sti + rlog2N~\ * A st2), (5.24) where f op = A'op/ A op. The area overhead ratio for the sharing structure for SAFS with a &-stage micropipelined shared unit is A over head-ratio = ( N - f op) (Asi + A d i + Ae i + Aj3 + Aj3 ( k — 1) * Ajti) A„ \log2N] (As2 + Ad2 + AC 2 + (k — 1) * A3t2) + (N - f op) * Aop (N — 1) (As3 + Ad3) + r * Amiix2tol / - + A , (5'25) Again, A oveThead-ratio is used to decide whether or not the sharing scheme can be applied to OP. 160 l) I 2) (9) ( 7 GO GO (* ) (a) (b) Figure 5.11: Constant folding, (a) Before transformation, (b) After transforma tion. 5.7 L ocal T ransform ations Algorithmic transformations can be used to improve the design efficiency at the behavioral level so that the resulting design description can generate a suitable implementation [43, 45, 51]. Most transformations use the peephole optimization technique and are referred to as local transformations. The biases in behavioral level descriptions are caused by the coding style of the designer or generated by other transformations such as sharing schemes. Transformations are developed to reduce the number of operations, the size of control structures, the length of the critical path, or to remove the redundancy, and so on. For example, the constant folding is one of Snow’s transformations for the C-MU RT-CAD system [43]. This transformation replaces a sub-description, which is a function of a set of constants c ,- for i = 1 ,... ,n , by a constant. This transformation can eliminate the arithmetic operations in the description. For the example in Snow’s thesis, function F = (l+ 2 )* (9 — 7), can be replaced by constant 6 , as shown in the transformation in Figure 5.11. This replacement eliminates two additions and one multiplication including function F. The same technique can be used for the DFG-based descriptions. 161 OP, Distributor Selector OP2 Figure 5.12: Multiple paths between two operators. Because of the token model in DFG, the correctness of local transformations can be easily proved by the symbolic token simulation shown in Chapter 3. In the rest of this section, a new transformation, which is used extensively to reduce the routing and control parts of sharing structures, for the control synthesis is presented. 5.7.1 Transformation for Sharing Structure R eduction The sharing structures developed earlier are general schemes for resource sharing. When these sharing schemes are applied to a real design, multiple paths from an operator to another operator may exist, such as, the situation in Figure 5.12. Since these paths have the same source operator and destination operator, they may be replaced by one path to transfer data between these two operators. These paths can be merged only when the order of tokens generated by the outputs of Distributor is the same as the order of tokens absorbed by the inputs of Selector. This condition is called the path ordering condition. 162 Figure 5.13 is the transformation of the multiple common source-destination path reduction, where output i of Distributor is connected to input p of Selector, output j of Distributor is connected to input q of Selector, and the path ordering condition is satisfied for these two paths. Two additional problems need to be solved in this transformation: 1. Port mapping: mapping of data paths in the input DFG to the data paths in the output DFG. 2. Control token generation: control token generation that satisfies the port mapping. One possible port mapping for the outputs of the Distributor and the input of the Selector is shown below. • Output port index mapping for Distributor with ij = i: Before transformation After transformation Output x Output a:, if 0 < x < i; Output i , if x = i or x = j; Output (a: — 1), if i < x < n and x ^ j. • Input port index mapping for the Selector with pq = p: Before transformation After transformation Input y Input y, if 0 < y < p; Input p, if y = p or y = q\ Input (y — 1), if p < y < m and y ^ q. The control token generation needs to be consistent with the port mapping stated above. A simple way for dealing with the mapping is to attach an atomic function at the output of the original control token generation. 163 IN„ IN I IN„ C Distributor ^ , ^ C 1 — - ^ T T . . p . . q . / m ' - x Selector OUT,, • p . . q..m Selector T " OUT C2 OUT„ INn IN I IN * C Distributor 'v. Q..H. ■ n - l _ S > ^ C1 ✓^TTTpq.. ^ — . Selector C2 OUT„ (a) T OUT (b) OUT. Figure 5.13: Multiple common source-destination path reduction, (a) Before trans formation. (b) After transformation. OP, OP2 Distributor Distributor Join Join Selector OP3 Figure 5.14: Multiple common multiple-source-destination path structure. 164 This transformation can be extended to the merging of any number of data paths. This transformation can also extended to the merging of multiple common multiple-source-destination paths, for example, the structure in Figure 5.14. Ac cording to the area cost function and the delay function of Selector and Distributor, both the area and performance overheads of a sharing structure are reduced by this transformation with only a minor increase in area for the control token generation. In summary, this chapter presented several sharing schemes for the control synthesis of asynchronous systems. The area and performance analysis of these schemes were derived, so data path synthesis could get accurate sharing overhead for operation sequencing and allocation. A transformation for the sharing structure reduction was also presented. This transformation may reduce the area cost and increase the performance of a sharing structure for a given synthesis result after sharing schemes are applied. 165 C hapter 6 R eg ister M inim ization A DFG description obtained from the control synthesis represents a scheduled- structure description of a system in which each input of every operation has a register. Register minimization is the process of removing unnecessary registers from the scheduled DFG under performance constraints. The objective of this process is to minimize the cost of registers for the hardware realization or, in other words, to maximize the cost of registers removed from the scheduled DFG. This thesis presents an algorithm for finding the optimal set of unnecessary registers under the constraint of the system completion time. The criterion for optimization is the maximum total cost of these unnecessary registers. Figure 6.1 shows the input-output relationship of register minimization. The input SA graph, obtained from the data path synthesis, represents the timing relation among operations in the scheduled DFG. The input scheduled DFG, obtained from the control synthesis, represents a structure description for the design. The removal of the set of unnecessary registers from the scheduled DFG results in the EDFG description for the design implementation. This chapter first analyzes the impact of removing a register from a DFG in terms of the system timing behavior. Then, the register minimization problem is formulated as a graph theoretic problem in which register removal is modeled by the addition of an edge in the SA graph. The last section of this chapter presents 166 SA graph (Sequencing Performance & allocation constraints result) Scheduled DFG Removing registers Graph problem for register minimization A set of unnecessary registers Register-minimized EDFG description Figure 6.1: The input-output relation of the register minimization. algorithms for finding an optimal set of unnecessary registers or a near-optimal set of unnecessary registers. 6 . 1 T im ing M od el for R egister M in im ization This section analyzes the impact of removing a register on the timing behavior of a DFG. Figure 6.2(a) is a simple DFG with two function nodes. By replacing each DFG node by a storage node and a phantom function node, the EDFG in Figure 6.2(b) is obtained, where storage s3 is added for the analysis purpose. Assume that all registers have the same forward latch time, Dsji, the same backward latch time, Dm , and the propagation delay time, Dap. Assume that the forward propagation delay time and the backward propagation delay time for phantom function node 167 @------ 0 n, n (a) stage 0; stage I stage 2 • stage 3 stage 0 stage 1+2 stage 3 Sj Il| Sj Ilj Sj Si Hj S3 (b) (d) DppCfnl) | DH P (fnl)| 1 t j Dpp(fn2) |D R P (fn2)| (c) D ppC fn 1 +fn2) | P B P (fn 1 +fn2) (e) Figure 6.2: (a) Two operation DFG. (b) EDFG description for the two operation system, (c) The timing diagram of this two operations. fname are represented by Djp(fname) and Dbp{fname). According to the timed- Petri net analysis in Section 3.5, the timing diagram for stage 1 and stage 2 of the system is shown in Figure 6.2(c), where Dpp{stage i) = DFp(fnl) = Djp(fnl) + Dsji + D3p (6.1) DBp{stagei) = DFP(fnl) = Dbp(fnl) + DM (6.2) Dpp(stage 2) = DFp(fn2) = Djp(fn2) + D3ji + D3p (6.3) DBp(stage2) = DFp(fn2) — Dbp(fn2) + Dsbi (6.4) The time interval during which /n,- is occupied by the operation at stage i is (DFp(stagei) + DBp(stagei)) for i = 1 and 2. If s2 is removed from the previous example, EDFG description in Figure 6 .2 (d) is obtained, where stage 1 and stage 2 in Figure 6 .2 (b) are merged into one stage, 168 stage 1+2. According to the timed-Petri net analysis in Section 3.5, the timing diagram for the stage 1+2 of the updated design is shown in Figure 6.2(e), where DFP{stagei+2 ) = DFP{fnl + fn2) = Djp(fnl) + Dfp(fn2) + D,fl + D,p (6.5) DBp(stage i + 2) = DBp(fnl + fn2) = Dbp( fn l ) + Dbp(fn2 ) + Dsb! (6 .6 ) Since there is no register between operations f n l and fn2, the output of / n l cannot be released until fn2 gets the acknowledge signal from storage S3. In other words, the time interval during which fn l is occupied by the corresponding operation is (DFp(stagei^.2 ) + DBp(stage 1+ 2 ))- To reduce the complexity of the register minimization problem, the timing effects of removing a register from a DFG are further simplified. Assume that Djp(fname) is much greater than (Dsfi-\-Dsp) for any function /name, D jp(fname) (Dsji + Dsp), and assume that Dbp(fname) <£ D„b t. (In the current implemen tation, Dbp(fname) equals 0 for any function fnam e .) Then DFp(stage-]i_^2) and DBP(stage 1 + 2 ) can be simplified as follows: DFP(stagei+2) - DFP(stage 1) + DFP(stage2) (6.7) DBP(stagei+2 ) ^ DBP(stage 1) ~ DBP(stage2) (6.8) Similarly, if all registers except for both the input and output register in a n-stage path are removed, following relations can be obtained. n DFP{stagei+" + n ) ~ £ DFP{stagei) (6.9) 1=1 DBp(stagei_|___j.n ) ^ DBp(stagei), for i = 1 ,..., n (6.10) 169 Therefore, only the DFG timing parameters, Dpp and Dpp, are needed for the register minimization problem. The time interval during which fn j is occupied by the corresponding operation is (£"=J- Dpp(stagei)) + Dpp(stagej). 6 .2 T h e P rob lem Form ulation 6.2.1 Graph M odel for R egister M inim ization In Chapter 4 the SA graph was developed to represent not only the sequencing and allocation of operations over operators but also the timing relation among operations. This chapter extends the SA graph to represent the timing relation for the removal of a register. Assume that G sa = (V, jE U E s a ) is an acyclic SA graph for DFG G = (V, E). There is a register for each input of every operation in DFG, and each edge e € E corresponds to an input of an operation. Therefore, E also represents the set of registers in G before register minimization. Eliminating the register at (w,-,Vj) G E will delay the releasing time for the operator which executes u ,-. If (u,-, vk) G E s a , removing the register at (u,-,Uj) may delay the starting time of u*. According to the timing model described in relations (6.7) and (6 .8 ), the elimination of the register at (Vi,Vj) can be modeled by the deletion of the timing constraint at (u;, vk) and the addition of a the timing constraint at (vj,Vk) with DBp{ms) delay, where operator m s is shared by v ,- and Vk, in other words, m(vj) = m(vk) = m s. Figure 6.3(a) shows the timing relation among u,-, Vj, and Vk before the removal of the register at (v{,vj). Figure 6.3(b) shows the timing relation among V{, Vj, and Vk after the removal of the register at (u;,Uj), where the mark on edge (u;,uj) denotes the register removal. Similarly, if there is an edge (v{,Vk) G Esai the elimination of the registers at a simple path (u,-,. . . , vj) in E can be modeled by the deletion of the timing constraint at (Vi,Vk) and the addition of a timing constraint at {vj,Vk) with DBp{rns) delay, 170 (a) (b) (c) (d) Figure 6.3: Graph model for the register removal, (a) Before the register at (v,-, vj) is removed, (b) After the register at is removed, (c) Before the registers at path (vi,..., V j) are removed, (d) After the registers at path («j, ...,Vj) are removed. where operator m s is shared by V { and t> * . Figure 6.3(c) and Figure 6.3(d) show the timing relation before and after the removal of registers in path («,-,... ,iij). Definition 6.2.1 For a given directed graph G' = (F ', E 1 ), a directed path is dominant, if and only if no other paths in G' contain this path. A directed path is dominant with the starting vertex if and only if the path starts at u; and no other paths starting at u ,- contain this path. Definition 6.2.2 Let G s a = (F, E U E s a ) be an acyclic SA graph corresponding to a DFG G = (V,E) with a feasible SA edge set Esa- An RE edge set Ere with respect to the SA graph G s a is the set of edges from which registers are eliminated. Corresponding to E r e , E e e is the set of edges deleted from the timing relation in G s a - i and E e e is the set of edges added into the timing relation in G s a - They are defined as follows. E r e — {(u< > vk) € Es/i|V(u,-,. . . , Vj) is a dominant path starting at Vi in G' = (F, E r e ) } E re = { ( u ^ l V K , . . . , vj) is a dominant path starting at u ,- in G' = (V,Ere ) with (u,-,Vk) e E sa) 171 The register elimination graph or the RE graph, G r e , is the graph model representing the timing relation after the removal of registers at E r e . The RE graph is defined as follows. G r e = (V,E U ( E sa \ E e e ) U E r E ) Since the edges connected to input ports and output ports interact with the sur rounding environment of the system, the registers at these edges are not eliminated in the register minimization. E x am p le The DFG in Figure 6.4(a), which is the same as the DFG in Figure 4.6, is an SA graph. If EREl = {(u+,i,v,l3)}, then EEEl = {(v+tl,u+,2)} and E r Ei = {(u*,3> u+,2)}- The RE graph corresponding to E r Ei , G r e , , is shown in Figure 6.4(b). There is a loop in Figure 6.4(b), (u+i2,u«,3,i’ +t 2)- The edge (u+!2,u»i3) G E means that w » i3 is waiting for the output from v+ < 2. The edge (u+t2,v.,3) G E EEi means that u+)2 is waiting for both the release of the operator executing u,,3 and the release of the operator executing u+)i. Therefore, the system is deadlocked. Definition 6.2.3 If the RE graph is cyclic, the system is deadlocked. Definition 6.2.4 Let E r E be an RE edge set for a DFG G = (V,E) and an SA edge set E s a - If the RE graph G r e corresponding to G , E s a , and E r E is acyclic, the RE edge set E r e is feasible. Otherwise, E r e is infeasible. E x am p le Using the same DFG as before (the DFG in Figure 6.4(a)), If E re2 = {(u+i2,u„t3)}, then EEEi = {(u+l2,t;+,4)} and E%E 2 = {(u.)3 ,u+)4)}. The RE graph corresponding to E r e 2 , G r e2, is acyclic and shown in Figure 6.4(c), where the delay at (u»,3,u+i4) G EEE2 is the same as the delay at (u+i2,u+i4) G EEE2. Further the register (u„)2,u+)4) can be removed from the RE graph G r e2. If E re3 = { ( v + , 2, u . , 3 ) , ( v . l2,t>+ i 4 ) } , then EEE3 = {(u+i2,u+t4), (u„,2,u„,4)} and E ^ Es = 172 + ,/ (a) (b) (d) i2 (C) Figure 6.4: (a) An SA graph, (b) An RE graph corresponding to Erei = { (u +)i , u » )3)} . (c) An RE graph corresponding to E r e 2 = { (u +12 ,u»,3)}. (d) An RE graph corresponding to Ere3 = {(u+i2, u,)3), (u,)2, ^+,4)}- 173 {(u»,3,u+,4),(u+,4,u»,4)}- The RE graph corresponding to E re3, G r e3, is acyclic and shown in Figure 6.4(d), where the delay at (i> „t 3,t;+t4) € Ere2 is the same as the delay at (u+i2,u+i4) € ERE2. 6.2.2 U nnecessary R egisters Similar to the SA graph, the completion time for the system with unnecessary registers eliminated equals the longest path delay of the corresponding RE graph. Theorem 6.2.1 Let G r e — (V, E U ( E sa \ E r e ) U E r e ) be an RE graph corre sponding to a DFG G = (V, E) with an SA edge set E sa and a feasible RE edge set E r e , where E r e and E r e are two edge sets corresponding to E r e as defined in Definition 6.2.2. The completion time of the system corresponding to G, E s a , and E r e equals to the longest path delay from the inputs of G r e to the outputs of G r e , and it is denoted by CT(G, E s a , E r e ), that is, C T ( G , E s a , E r E ) = L g *e ( I , 0 ) Proof: Use the method presented in the proof for Theorem 4.4.1. □ Lemma 6.2.1 Let E r e, and E re2 be two feasible RE edge sets corresponding to a DFG G = (V,E) and an SA edge set E s a • If E r e2 = E r e , U {e}, where e € E \ E r e, , then C T { G , E s a , E r e ,) < C T ( G , E s a , E r e 2) Proof: Let e be denoted by (vi,vj). There are two cases to consider. 1. There does not exist any € V such that (v{,Vk) € E sa or £ E e e . Figure 6.5(a) and Figure 6.5(b) show the vt-vj parts in G r e, and G r e2• In this case, E REl = E RE2 and E REl = E REq , so G r e , — G r e2, s o L G r e i ( 1 ,0 ) = L Gre2 ( I ,0 ) . 174 (C) (d) Figure 6.5: (a) Case 1: Gr e (b) Case 1: Gre2■ (c) Case 2: Gre,. (d) Case 2: Gre2. 2. There exists at least one Vk € V such that (v;,Vk) € Esa and/or {vi,Vk) € EftEl ■ Figure 6.5(c) and Figure 6.5(d) show the v,-Vj-Vk parts in Gre, and Gre2. • If (vi,Vk) € E s a , there is only one such edge. If a path p in G r e , does not contain the edge (v{,Vk) £ E s a , then path p is also in G r e 2- If a path p' in G r e , contains the edge (u,-, Vk) € E s a , then path p' is not in G r e 2. If p' is denoted by (ui, v2, . • •, u,-, Vk,..., vz), then a path p" = (uj, v2, ..., vj, V k , . . . , vx) in G r e 2 exists, and / ° * * ( p " ) = / ° « i (p') + D Fp(m(vj)), where operation vj is executed by operator m(vj). Therefore, LGrbj (1,0 ) < LGrei(I,0). • Similarly, for any (u,-, Vk) € EEEi, LGrei (I, O) < LG re2(I,0). Therefore, CT(G,Esa,E re1) < CT(G,Esa,E re2). □ 175 Corollary 6.2.1 Let Erei and Ere2 be two feasible RE edge sets corresponding to a DFG G — (V,E) and an SA edge set Esa- If Ere^ Q Ere2, C T ( G , E s a , E r Ei ) < C T { G , E s a , E re2) Corollary 6.2.2 Let Ere. be a feasible RE edge set corresponding to a DFG G = (V ", E) and an SA edge set Esa- CT(G,Esa) < CT(G,E sa,E r e ) E x am p le The system completion time of the SA graph in Figure 6.4(a) is 20, that is, C T ( G , E s a ) = C T { G ,E s a ,0) = 20. The system completion time of the RE graph in Figure 6.4(c) is 20, that is, CT{G,EsA-,EREi = {(u+il,u»,3)}) = 20. The system completion time of the RE graph in Figure 6.4(d) is 21, that is, C T { G , E s a , E r e2 = {(u+il,v„> 3 ), (u,,2,u+,4)}) = 21. In the preceding examples, 0 Q E r e , C E re2, so C T ( G , E s a ,®) < C T ( G , E s a , E r e 1) < C T ( G , E s a , E r e2)- Corollary 6.2.2 means that removing registers from a scheduled DFG never reduces the system completion time. Usually, removing registers increases the system completion time. Definition 6.2.5 If the removal of a set of registers from a DFG does not increase the system completion time over the constraint, this set of registers is called a set of unnecessary registers. Each register has a cost, which is usually proportional to the bit width of the register. Let w(e) represent the register cost at edge e. The register minimization problem is to find the maximum cost of unnecessary registers, and the problem is formulated as follows. Let G sa be an acyclic 5/1 graph corresponding to a DFG G = (V, E). Let the SA edge set of G sa he E s a - Each e € E has a cost w(e) representing the register cost 176 at this edge. Each v 6 V has a delay d(v) representing the operation execution time, each e £ E has a zero delay, and each e € Esa has a delay d(e) representing the operator reset time. Let L be the constraint of the system completion time. The register minimization problem is to find a feasible RE edge set Ere in which • E ee and Eee are two edge sets corresponding to E re as defined in Definition 6.2.2, • Gre = {V,E U (Esa \ E rE) U Ere) zs acyclic, • LGre{ I ,0 ) < C maximizes the cost of Ere, lZe&ERBw{e)- 6.3 T h e M axim um C ost U n n ecessary R eg ister Set 6.3.1 A n O ptim um Set G eneration D efinition 6.3.1 Let Si and S2 be two sets of sets. The union set production of Si and S2, denoted by S\ ® S2 , is defined as follows. Si® S2 = {si US2 IS 1 G Si, a3 € S2}. L em m a 6.3.1 Let J E ,- be the set containing all sets which have i unnecessary registers corresponding to a DFG G = (V, E), an SA edge set E sa, and a system completion time constraint C for i = 1,2,..., \E\. Then Ej+i C ((Ej ® Ei) \ Ej) for j = 1,2,...,|J5| - 1. 177 Proof: Assume that there exists an E r e € Ej+j, and E r e $ ((Ej ® Ex) \ Ej). Let E r e = E r e 1 U E r e j and E r e 1 fl E r e } = 0 with |E r e ^ = 1 and \ E r e } \ = j . Since E r e £ ({Ej ® E\) \ Ej), E r e 1 Ex or E r e 3 £ Ej. If E r e x £ E\, then C T ( G , E s a , E r e J > C. Since E r e 3 E r e ,, according to Lemma 6.2.1, C T ( G , E s a , E r e ) > C T ( G , E Sa i E r E i) > C. Similarly, if E r e } ^ Ej, then CT(G, E s a , E r e } ) > C , and C T ( G , E s a , E r e ) > C. In either case, E r e $ Ej+1 . This is a contradiction. □ If \Ei\ = r, then \Ej\ < (p for j > 1. Therefore, there are at most 2r — 1 different sets of unnecessary registers. The next algorithm enumerates all 2r — 1 possible sets for a given E \. Algorithm 6.3.1 (Possible unnecessary register set enumeration) • Main Algorithm /* Assume that r, E\, and the edge ordering are global. */ 1. Find Ei for a given DFG G = (V, E) and an SA edge set E s a - 2. Let r = \E\\. Ordering the r edges in Ei, says, ej -< e-i -< ... -< er. 3 . ExpandNext($,l). • Procedure ExpandNext(ERE, num) 1. if num > r, then return. 2. for i = num to r begin 3 . E r e *— E re U { e , } /* Ere is a possible unnecessary register set. */ 4 . ExpandN ext(ERE,i + 1 ) end end of Procedure ExpandN ext Lemma 6.3.2 Algorithm 6 . 3 . 1 generates all possible unnecessary register sets for a given DFG and an SA edge set. 178 P roof: Let E r e = {e,j, e,-2, ..., e,f c _,, eIJ(} C {ei, e2, ..., er}. The edges in FJ/je can be ordered. W ithout a loss of generality, assume that the edge ordering in E r e is e< i -< e ,-2 -< ... -< e1 J f e . The set E r e can be generated by a sequence of function calls, ExpandN ext 1), ExpandNext({ei1},ii + l), ExpandNext({e^, e ,-2}, ? '2 +1). .. ExpandN ext({e{„ etJ, ..., + 1). □ In Step 3 of Procedure E x p a n d N e x t , C T { G , E s a , E r e ) is the lower bound of the system completion time for all E 'r e D E r e - (Corollary 6.2.1) Therefore, if C T ( G , E s a , E r e ) > £ , where £ is the system completion time constraint, then the branch does not need to be further expanded after this E r e - Assume that S r e is the current maximal cost of the unnecessary register set with cost Cost(SRE). Let £ be the system completion time constraint. The algorithm to search the maximum cost of the unnecessary register set is described as follows. A lg o rith m 6.3.2 (M axim um cost u n n ecessary re g iste r set) • M ain A lg o rith m 0. Let Sre < — 0 and Cost(SRE) < — 0. 1. Find Ei for a given DFG G = (V, E) and an SA edge set E sa- 2. Let r = \E\\. Ordering the r edges in E\, says, ei -< e2 -<...-< er. 3. ExpandNextOpt(9,1). 4. Print Sre and Cos^ S re)- 9 P ro c e d u re ExpandNextOpt(ERE, num) 1. if num > r, th e n re tu rn . 2. for i = num to r begin 3. Ere < — Ere U {e,}, and Gre is the corresponding RE graph. 4. if Gre is acyclic an d CT(G, Esa, Ere) < £ and Ee.gEflE ttf(cj) + E L ,+ i < Cost(SRE) th e n begin 5. if CT(G,E sa,E re) > Cost(SRE) th e n begin 179 6. Ere * — Ere 7. Cost(eRE) «- CT(G , E s a , E r e ) end 8. ExpandN ext{ERE,i + 1) end end end o f th e P ro c e d u re ExpandNextOpt E x am p le For the DFG and the SA edge set in Figure 6.4(a), E l = {{(U+,2,U.,3)},{(U.,2,U+I3 )} ,{ ( u .1 2,n+ ,4)},{(U + ,4,U.1 4)}}. Let ei, e2, e3, and e4 represent (u+i2,v*i3), (u„t2,u+)3), (■ u » i2,u+i4), and {v+A,vm A), respectively. Assume that ru(et) = 1 for i = 1,2,3,4, and assume that there is no completion time constraint. In other words, only the system deadlock needs to be avoided. The result is shown in Table 6.1, where the corresponding RE graphs are shown in Figure 6.4 and Figure 6.6. The order of E r e shown in this table follows the enumeration scheme in Algorithm 6.3.2. If the completion time constraint is set at 22, then 9 of these 15 entries are examined, where 6 of these examined entries are either deadlocked or over the constraint. 6.3.2 H euristic A lgorithm s As stated previously, the number of unnecessary register sets can be as large as 2 l£d? where E\ is the set containing all sets with at least one unnecessary regis ter. Instead of enumerating all possible sets, the heuristic algorithm incrementally checks and adds the registers into the unnecessary register sets. The quality of the result from the heuristic algorithm depends on order in which the registers (edges) in the SA graph are checked. In the previous example, if the checking order is (e4, e2, e3, e4), then the result is E r e = {ej,e2,e3}. If the checking order is (ej, e4, e2, e3), then the result is E r e = {ei,e4}. Currently, the 180 (a ) + ,3 (b) (c ) +,/ (d) Figure 6.6: (a) E r e = {(u+i2,u„i3), (u.,2,u+)3)}. (b) E r e = {(v+l2, v,,3), (u,t2,u+t3), (u.,2 ,i>-m)}. (c) E r e - {(u+i2,i;.)3), (u+t4,v.,4)}. (d) E r e = {(t>.,2, u+,4), (w + 1 4,u.,4)}- 181 E r e G T ( G , E Sa , E r e ) G rE ( Figure) 0 20 6.4(a) ei 20 6.4(c) e i , e 2 } 24 6.6(a) 6i , e 2, e3 24 6.6(b) e i, e 2, e3, e4 } deadlock N/A e j, e 2, e4} deadlock N/A ei>C3 21 6.4(d) e i, e3, e 4 ) deadlock N/A e i , e 4 23 6.6(c) e2l 24 N/A ^2, e3 24 N/A 62, e3, e4 } deadlock N/A 62, e4 1 deadlock N/A 63} 21 N/A 63, 64 } deadlock 6.6(d) e 4 23 N/A Table 6.1: All RE edge sets for the SA graph in Figure 6.4(a). longest path delay in the SA graph is used to sort all edges. Edges with shorter longest path delays from the inputs of the system to the tail of the edge are checked earlier. Assume that £ is the system completion time constraint. The algorithm is described as follows. Algorithm 6.3.3 (Heuristic Algorithm for Register M inimization) 1. Sort all edges in E in the non-decreasing order of the longest path from the inputs of the system to the tail of each edge. Let the ordering be e4 -< e2 •< c\e \. 2. E r e < — 0; G r e < — G s a ', 3. for i = 1 to \E\ begin 4. Ere< < — Ere U { e,-} , and Gre> is the corresponding RE graph. 5. if Gre> is acyclic and L grb, [ I , 0 ) < C then 6 . E r e < — Ere>\ G r e < — Gre>\ end 182 Edges The longest path delay for edge ordering Checking status (Current completion time) Initial SA graph — 26 (u + .i,u » ,3) 2 deadlock 4 deadlock (U+.2,U..3) : ei 5 20 (v., 3,vmA) 9 deadlock (vm ,2,v+A) : e3 14 21 (u».2,u+.3) : e2 14 24 (U+,4i V*A) • ^4 16 deadlock Table 6.2: Edge ordering and checking in the heuristic algorithm. 7. Print E r e and H eeERE w(e)- The algorithm first decides the order in which the edges are checked. Since E\ is not known in this algorithm, all edges in E except the input and output edges are ordered. For the previous example in Figure 6.4(a), the order is listed in Table 6.2, where e,-’s in the first column are the corresponding edges in E\ used in Algorithm 6.3.2. The second column shows the longest path delay for the ordering in Step 1 of the heuristics. The third column of this table shows the incremental checking status in Step 5 of the algorithm. 183 C hapter 7 F IR D esign This chapter presents a 16-point, 16-bit, programmable, causal FIR filter design using the flow described in Chapter 2. The convolution sum of the filter is 1 5 y(n) = ^ 2 h(k) * x(n — k) k=o A causal FIR system with linear phase has the property that h(k) = h(15 — k), for k = 0 ,...,7 . Therefore, the convolution sum can be reduced to the following form, 7 y{n) = * (® (n — ® (n — is + k)) f c = 0 = h(0) * (x (n ) + x(n — 15)) + h( 1) * (x(n — 1) + x(n — 14)) + h(7) * (x(n — 7) + x(n — 8 )) The filter coefficients, h(-)s, are programmable, so this filter can be configured to any 16-point causal FIR filter. In the implementation, the computation is based on the 16-bit fixed-point number system, where the addition is in 16-bit computation and the multiplication is in 8 -bit computation. 184 CFfs8.1 Distributor mem_H) (mem_H mem_H mem_H mcm_H) (mem_H mcm_H mcm_H in_X 16’hO 16'hO J6 'h 0 I6’h0 J 6 ’h0 J 6 ’h0 Fork Fork Fork Fork Fork Fork Fork Fork 16’hO 16’h( 16’hO 16’hO Fork Fork Fork Fork Fork Fork Fork ADD ADD ADD ADD ADD ADD ADD ADD Tran Trun Tran Tran Tran Trun Tran Tran MUL MUL MUL MUL MUL MUL MUL MUL ADD ADD ADD ADD ADD ADD ADD : o u i_y i Figure 7.1: Input DFG description for a FIR digital filter. 185 ;’bi Fork OUT (b) IN SeqOlr Selector Fork OUT Figure 7.2: (a) DFG description for memJH. (b) DFG description for SeqOlr. 7.1 D F G Specification Figure 7.1 is the input DFG description of the FIR filter. There are two inputs for this system: in.H is the 8 b input for the initialization of system coefficient h{i) for 0 < i < 7 and in .X is the 16b input for x(n) for n > 0. There is one 16b output, out.Y, for y(n) for n > 0 in the system. In Figure 7.1, ADD, MUL, and Trun are atomic functions. ADD is the 16-bit addition. MUL is the 8 -bit multiplication. Trun is the truncation function which truncates 8 least significant bits from a 16-bit data of an ADD to a 8 -bit data for the following MUL. In Figure 7.1, CFfs8.1 and mem.H are macro functions. CFfs8.1 generates the sequence (0,1,2,3,4,5,6 ,7) after the system starts. Eight data tokens are read in from in.H and distributed to proper mem.Hs for h(0) to /i.(7). mem-H, whose DFG description is shown in Figure 7.2, reads in a data token from its input and produces the same data token forever. There are fifteen data tokens in the DFG description of the FIR filter. These tokens represent the initial data of x(n — k) for 1 < k < 15. In this specification, 186 Construct Name D f p (nsec) D b p (nsec) A d d 16.0 4.4 MUL 36.5 4.4 Trun 0 .0 0 .0 Fork( 2 -output) 0 .0 3.1 overhead 0 .0 " 14.3“ ... Table 7.1: Timing parameters for the FIR filter synthesis. the data value of the tokens is 0, that is, a:(—1) = x ( — 2) = ... = x ( —15) = 0 with n = 0 initially. After h(i) for 0 < i < 7 are read, each input x(n) generates an output y(n). 7.2 D a ta P a th S ynthesis There are three types of atomic functions in this design. Only two of the functions, ADD and MUL, need to be considered, since the third, Trun, can be implemented by physically truncating unused data. Currently, each type of operation uses one kind of implementation. Timing parameters for all implementations are shown in Table 7.1. The D f p for both Fork and the sharing overhead are zero due to the parallel between data computation and control generation in the hardware implementation. Currently, the overhead of the backward propagation delay time for the sharing scheme is 14.3 nsec. For convenience, the token input part of the DFG is removed. The DFG with the token input part removed is shown in Figure 7.3. It is assumed that all x (n —i) for i = 0 ,..., 15 and all h(j) for j = 0 , .. ., 7 arrive at the same time. Data path synthesis finds the sequence and allocation with the minimum system completion time for a given set of resources. Table 7.2 shows the sequencing and allocation results, where the heuristic algorithm always found the optimum solution for this example. The following steps only show the intermediate format of the design procedure for a design with 2 MULs and 2 ADDs. The 187 il :2 i3 i4 i5 i6 i7 i8 i9 ilO i l l il2 il3 il4 i!5 il6 add I, add: add.l add' addi add! addS/ add7/ A D D A D D A D D A D D A D D A D D A D D A D D il7 l Tn,n) i 18 l Tn,7 il9 l Tru7 i20 l Tru7 i21 l Tru7 i22 l Tn,7 i23 l Tn,7 i24 V ™ mul: mul' mul! m u ll mul m u l3 m ul7/ mul! M U L M U L M U L M U L M U L M U L M U L M U L a d I I ad d i a d ii ad li ac df/ ad li addi A D D A D D A D D A D D A D D A D D A D D out Figure 7.3: DFG description of the FIR filter for synthesis. Resource Constraints (No. of operators) Completion time (nsec) No. of MUL No. of ADD heuristic Optimum > 1 1 “ 5U1.8 ' 51)1.8 " 1 > 2 “ 434.9 ' 454.9 ' > 2 2 258.9 258.9 2 > 3 " 250.1 " 25U.I 3 ....> 3 194.9 194.9 " > 4 3 T 8 6 .8 186.8 4 > 4 171.7 .. . m - v "'■> 5 " 4 “ T67.2 15772 > 5 > 5 164.5 164.5 Table 7.2: The sequencing and allocation results of the 16-point FIR filter. 188 sequencing and allocation result of the 2-M UL, 2-AD D design is as follows: (Note the three columns give the execution starting time, Te3, the execution completion time, Tec, and the operator reset completion time, Ttc, respectively.) ADDI: add2: ( 0 .0 16.0 3 4 .7 ) add3: ( 34 .7 5 0 .7 6 9 .4 ) addS: ( 69 .4 8 5 .4 104.1) add6: (104.1 120.1 138.8) add7: (1 3 8 .8 154.8 173.5) add8: (1 7 3 .5 189.5 208 .2 ) adde: (2 0 8 .2 224.2 242 .9 ) addg: (2 4 2 .9 258.9 277 .6 ) a d d i: ( 0 .0 16.0 3 4 .7 ) add4: ( 34 .7 50 .7 6 9 .4 ) adda: ( 6 9 .4 8 5 .4 104.1) addb: (1 0 7 .7 123.7 142.4) addc: (1 4 2 .4 158.4 177.1) addd: (177.1 193.1 211 .8 ) a d d i: (2 2 4 .2 240.2 258 .9 ) mul2: ( 16.0 5 2 .5 7 1 .2 ) mul3: ( 7 1 .2 107.7 126.4) mul5: (1 2 6 .4 162.9 181.6) m ul7: (1 8 1 .6 218.1 236 .8 ) m u ll: ( 16.0 5 2 .5 7 1 .2 ) mul4: ( 7 1 .2 107.7 126.4) mul6: (1 2 6 .4 162.9 181.6) mul8: (1 8 9 .5 226.0 244 .7 ) The corresponding SA graph for the above result is shown in Figure 7.4, and the corresponding Gantt chart for the above result is shown in Figure 7.5. 7.3 C ontrol S ynthesis In the next step the sequencing and allocation result is applied to the input DFG using the sharing scheme with fixed allocation and fixed sequence. To save the sharing structure, a multi-input operation is broken into two parts: the Join of all input part and the computation part, for example, Figure 7.6 shows the 2-part 189 addi. lddl a d d ? id d i odd! lddl lddl ADD ADD ADD ADD ADD ADD ADD ADD Trun Trun Trun Trun Trun Trun Trun Trun m u l mull m u l,] mull mul' mull 'm u l7 / mull MUL MUL MUL MUL MUL MUL MUL MUL nddi ud II ac It ud li o c ii uc if/ id J i ADD ADD ADD ADD ADD ADD ADD Figure 7.4: The SA graph for the 2-M U L, 2-AD D design. □ FP(ADD) = 16.0 nsec 1 I FP(MUL) = 36.5 nsec 0 BP(ADD/MUL) = 4.4 nsec I Q BP(overhead) = 14.3 nsec Data Dependency t ADDI ADD2 MUL1 MUL2 add2 add3 add5 add6 add7 add8 adde addg II H aildl add4 addq hddb a'ddc iaddd a jd f ^ 1 H mud / I H I ZBHQfflZEIGIH i k \ 7 * mul^ — # » • » i * \4 1 multi i m i mul! Figure 7.5: The Gantt chart for the 2 - M U L , 2-A D D FIR filter. 190 Figure 7.6: 2-part AD D and 2-part M U L. ADD and the 2-part MUL. Assume that the original label is divided similarlj\ For the ADD, the label addi becomes Ji and addi for Join and 1-input ADD. For the MUL, the label muli becomes JMi and muli for Join and 1-input MUL. Figure 7.7 is the mapped DFG corresponding to the synthesis results in Figure 7.5, where net labeling is used to represent data path connectivity. For example, addi is connected to the output of J l and to the right input of JM I in Figure 7.3. In Figure 7.7, these two paths are labeled Jlo and JM lr. Referring to Figure 7.5, addi is scheduled as the first operation at operator ADDS. J lo and J M lr are linked to the corresponding paths of the sharing structure for the ADDS in Figure 7.7, such that, input port 0 of Selector and output port 0 of Distributor for the ADDS. After applying the sharing scheme for the sequencing and allocation result, possible reduction on the mapped DFG is searched. In this design, there are two operands for each operation. In order to apply the principle of the local transforma tion in Figure 5.13, the inputs and outputs of two operations need to correspond to the same sources and destinations. For example, in Figure 7.7 (Jb l,Jb r) for addb and (J d l,J d r) for addd originate at the same sources, Distributor of MULI for both left operands and Distributor of ADDS for both right operands. (Jbl,Jbr) for addb and (Jdl,Jdr) for addd have the same destination, Selector of ADDS, through Jbo and Jdo. Therefore, these two sets are merged to (Jbdl,Jbdr) and 191 Jbdo. Similarly, (Jel,Jer) and Jeo for aide and (J g l,J g r) and Jgo for addg can be merged to (Jeg l,Jeg r) and Jego. Figure 7.7 is thus reduced to Figure 7.8. The control part of the sharing structure depends on the routing of the inputs of Selector and the outputs of Distributors. For example, in Figure 7.7 the con trol parts CFfsJh CFfs7, and CFfs8 generate repeatedly the sequence (0,1,2,3), the sequence (0,1,2,3,4,5,6 ), and the sequence (0,1,2,3,4,5,6 ,7) for both Selectors and Distributors. In Figure 7.8, after local transformations, the control part CFfs4 'gen erates repeatedly the sequence (0,1,2,3) for the Selector and the sequence (0,1,1,2) for the Distributor, the control part CFfs4” generates repeatedly the sequence (0,1,2,3) for the Selector and the sequence (0,1,2,2) for the Distributor. Similarly, the corresponding sequences generated by CFfs8’ and CFfs7’ can be found. 7.4 R egister M inim ization Using the SA graph in Figure 7.4 the maximum cost-unnecessary register set that does not increase the system completion time can be found. The registers at (a d d l,tl), (add'2,t2), (add3,t8), (add4,tA), (add5,t5), (add6,tQ), (add7,t7), (add8,t8), (m u lti, addf), (m ult8,addg), and (addf,addg) can be removed with out increasing the system completion time. Figure 7.9 is the corresponding RE graph. Removing the set of unnecessary registers from the DFG, the EDFG de scription can be obtained. Figures 7.10 and 7.11 are the final EDFG description. The EDFG descriptions of mem-H and CFfs8.1 are shown in Figure 7.12. In the final EDFG description, COUNT4, COUNT7, and COUNT8 are used to produce the sequences (0,1,2,3), (0,1,2,.. .,6 ), and (0,1,2,.. .,7). D ecJijJk is a decoder func tion that converts an z-bit control signal to a j-bit control signal with index k to distinguish different decoders. 192 JM 3 o JM 5o Selector M UL] M U L . [CFfs4 Distributor Ja r J lr Jb r Jd r Selector MUI M U L , CFfs4 Distributor Jer Jcr J 6 o J7o iS o JRo J3o Selector ADD] AD D CFfs8, Distributor out rTnu JM Zr JM 3r JMHr JM 6 r JM 7 r Jbo Jco Jdo_ J4o J lo Selector a d d : AD D Distributor J M lr (4 JM 4r Jdl Figure 7.7: Mapped DFG description for the 2- M U L 2-A D D FIR design. 193 il i2 i3 i4 iS i6 i7 i8 i9 ilO ill il2 il3 il4 ilS il6 il7 J M lr il8 JM 2 r i 19 JM 3 r i20 JM 4 r i21 JM 5 r i22 JM 6 r i23 JM 7 r i24 jM 8 r 1 ' ^ T " 1 J M 3 o T J M 4 o IJM S o l J M 6 o T jM 7 o T JM H o Jal Ja r Jbl Jb r Jc! Jcr Jdl id r Jel Selector M UL] M U L Distributor Jbdr Jfr JM 81 Selector MUI M U L , Distributor CFfs4* Jc r Jeg r J 6 0 J7o J5 o J3o Selector ADDI A D D Distributor out JM 2r (Tnin) JM 3 r JMBr J K B 7 JM b r JM 7r Jao Jbdo J4o Jco i l o Selector a d d : A D D Distributor J M lr t4 JM 4r Jegl :Ffs7’l Jbdl Jet Figure 7.8: Reduced DFG description for the 2- M U L 2-A D D FIR design. 194 addi. o d d ? udd5/ ADD) ki ADD ADD ADD ADD ADD ADD Trun (Trun T ru n k s ( T r u n k s (TrunV** T r u n Trun mul mul' [null muI7/ mul! MUL MUL MUL MUL MUL MUL MUL MUL addi u d I I ad li a d J < a d d i r d d i ADD ADD ADD ADD ADD ADD ADD Figure 7.9: The RE graph for the 2-M U L, 2-A D D design. 7.5 Im p lem en tation R esu lts In the last step the EDFG description is mapped into RTL netlist for the layout generation. The example designs discussed in the preceding section have been im plemented using a library of asynchronous building blocks [52] composed with an industrial standard cell library, HP C34100 [58], in a commercial CAD tool, Ca dence Design Framework \li™ 1. The performance of each design has been simu lated in a mix-mode simulator, Verilog-XL(R), using the model distributed with the cell library. The wiring capacitances of the design are extracted by D RA CU LA ^. The implementation and the simulation of these designs at the layout level show 1 Design Framework II is a design framework, and it is a trademark of Cadence Design Systems, Inc. Cell Ensemble is a standard cell placement and routing tool used in our experiments for the layout generation, and it is a trademark of Cadence Design Systems, Inc. Verilog-XL is a mix mode simulator, and it is a registered trademarks of Cadence Design Systems, Inc. DRACULA is an IC layout verification system, and it is a registered trademark of Cadence Design Systems, Inc. 195 in_H CFA8.I 'Bfstributbr m m c ;> [mcm_Hj (mcm_Hj (mem_HJ hO hi h2 h3 h7 h6 i F o rk 1 . F o ik i F o rk I ) pD(*) £D<5) £ ) ( |) £ > (^ ( y \ Join j \ JainJ ( J o i n t ( Jo in I Join I '.Joint J o i n t '.J o i n t 'llilii ’Hii! 'Hjii 'Hj4a 'Tji« ~ l'j7» ~ Vik .. ! Jo in 1 i Join J ( J o i n t V jM lo *T iM 2 n 1 JM3o Jo in t ( Jo in I % Tjmsi> *1 JM6» Figure 7.10: EDFG description for the 2 - M U L 2 - A D D FIR design (Part I). 196 M5n ' 0 ' T ' 2 " 3 . S e le c to r IM U L l D istrib u to r 'I-' -2 - f JM4o Mfto S e le c to r m u l : MUL. D istrib u to r J6o J7o J8o 2 3 4 5 6**7 S e le c to r . . . . A D D I'* ; a d d i I Forte I D istrib u to r out JfT {Trun! JMHr fMftr J S 7 r J4< > i l , ..... . - - y 1 2 3 4 5 - SjiJector A D D J' u > a d d 'i i ....... D istrib u to r . , . 2 3. 4 .S.L.- t T r u n * jM T M 4 ,-L JTnini J*' J K R F Tbl Co u n t) ! F a c t I l>j<j [ * • 013)3 " jc T Jdl Figure 7.11: EDFG description for the 2-MUL 2-AD D FIR design (Part II). 197 Selector Fork 1 OUT ( r o u N T i 3’b()(X Fork 1 (a ) (b) Figure 7.12: (a) EDFG description of mem-H. (b) EDFG description of CFfsS.l. the feasibility of the design method of this thesis. To show the effectiveness of the data flow model, the area/performance of these designs obtained from the DFG/EDFG model are compared with the area/performance obtained from the final layouts. To use the DFG/EDFG model, all the timing parameters, such as D a ji, Da ti, Dsp, D jp, and Z )& p, need to be known during the course of synthesis procedure. Since the asynchronous building blocks have been designed, these timing param eters are obtained by simulation. Table 7.3 shows the EDFG timing parameters for the FIR design. The fanout/loading capacitance internal to each block is con sidered, but the fanout/loading capacitance external to the block, which depends on the interconnection of the block in a real design, is not considered currently. In Table 7.3, several modules, such as Selector-4, have more than one set of timing parameters. The additional sets of timing parameters represent more than one implementation for the same EDFG construct. (Select-4 represents a 4-input Se lector.) The slow module is used for non-critical nodes of the FIR filter design. 198 “DFGsim” is used to label the performance measure of the design from the simu lation of the DFG/EDFG model. At the DFG/EDFG level, the area measure of a design is the area sum of all asynchronous blocks used in the design, and the area of each block is the area sum of all standard cells implementing the block. Therefore, no wiring area is considered at the DFG/EDFG level. “Cell” is used to label the area measure. The performance for a real layout is obtained by the simulation of a stan dard cell netlist with wiring capacitances derived from a parasitic extraction tool, DRACULA, and is labeled “Csim”. The area for a real layout is obtained by the multiplication of the width and the height of the layout. The area measured for the core size of the final layout is labeled “Core”. Table 7.4 gives the experimental results of the 16-point, 16-bit FIR filter. Figure 7.13 shows one layout for the FIR filter implementation. Again the performance estimation from the simulation of the DFG/EDFG model was found to be within 87.6% to 98.1% of the final layout performance measurement. This experimental result also shows that the cell area obtained from the DFG/EDFG model approx imately occupies 56.2% to 59.4% of the final layout. The Csim (actual extracted) value is larger than the DFGsim (estimated) value for each design, due to the following reasons. 1. The operation fanout and the control fanout external to the basic blocks are not considered in the current model. 2. The wiring delays between modules are also not considered in the current model. This delay cannot be accurately estimated until the actual layout is generated. 3. Some extra buffers were needed between some modules to comply with a CAD tool limitation in the current implementation, and these extra delays were 199 Construct Name DSp (nsec) Db p (nsec) Dsji (nsec) L > s6/ (nsec) D sp (nsce) (Ph) Trun 0 .0 0 0 .0 0 — — — (PH) C(‘bl) 0 .0 0 0 .0 0 — — — (Ph) ADD 11.60 0 .0 0 — — — (Ph) MUL 32.10 0 .0 0 — — — (Ph COUNTx 1.70 0 .0 0 — — — (Phj D ecJtjJc 2 .1 0 0 .0 0 — — — (Ph) Selector_2 8.87 6.28 — — — (Ph) Selector J 1 8.92 5.91 — — — (Ph) Selector_4 9^1 5.29 6.13 — — — (Phj Selector_4 # 2 8.92 6.59 — — — (Ph) Selector-5 6.77 6.90 — — — (Ph) Selector_6 7.08 6.50 — — — (Ph) Selector_7 6.87 6.72 — — — (Phj Selector.8 10.26 7.64 — — — (Ph) Selector_10 6.51 7.77 — — — (Ph) Distributor_2 9.19 1.53 — — — (Phj Distributor_3 #1 5.18 2 .6 8 — — — (Ph) Distributor_3 # 2 8.72 2.84 — — — (Ph) Distributor_5 5.96 3.72 — — — (Ph) Distributor-8 #1 6.30 3.75 — — — (Ph) Distributor_8 # 2 1 1 .0 0 3.92 — — — (Ph) Distributor.lO 6.96 5.66 — — — (Ph) Fork_2 0 .0 0 3.10 — — — (Phj Fork_3 0 .0 0 3.10 — — — (Ph) Join.2 3.10 0 .0 0 — — — Storage! (8 b,8 b)) — — 2.80 3.83 0 Storage((16b,16b)) — — 2.80 3.83 0 Storage!8 b) #1 — — 2.80 3.83 0 Storage(sb) ^ 2 — — 4.25 5.27 0 Storage(l6 b) #1 — — 2.80 3.83 0 Storage(16b) # 2 — — 4.25 5.27 0 Storage(nb), for n < 3 — — 2.73 3.77 0 Table 7.3: Timing parameters for the EDFG in the experiment of the FIR design. 200 Designs # 1 # 3 # of shared units mult. 1 2 3 add 1 2 . . . . 3 . Completion time (nsec) Csim 615.84 298.53 210.40 DFCsim ■539.40 “273T85 202.68 Ratio 0.876 0.917 0.963 Pipeline period (nsec) Csim 627.20 307.32 217.01 DFGsim S&IAl 283.90 "212.88 Ratio 0.881 0.924 0.981 Area (xlO6 /i2) Core 16.434 20.923 23.707 Cell 9.583 12.419 13.318 Ratio 0.583 0.594 0.562 Table 7.4: Experimental results of the FIR design. not known during synthesis and analysis of the design at the DFG/EDFG model. Despite these factors, the high-level timing model of this thesis is quite accurate; the DFGsim/Csim ratio is 98.1% for the best case and is 87.6% for the worst case in all experimental results. It is common to use the cell area to estimate the routing overhead before placement and routing [57]. From the two preceding examples, the Cell/Core ratio is within 56.2% to 64.5%. Although the area ratios for the above design are not fixed, these ratios vary in a certain small range. Therefore, the area measured by the cell area in the data flow model is sufficient for high-level synthesis algorithms. Since there is an accurate timing model and a proper area measurement at the data flow level, synthesis results represent the design space properly. One of the main reasons that synthesis algorithms explore the design space properly is that in the system, the area/performance overhead of resource sharing, such as control units and multiplexers/demultiplexers, are reflected and predicted accurately. This allows both the performance and area at the data flow level to be estimated quite accurately. 201 C M o C M “ 5 " yigjfjjH g:=‘H 3 « U c :^ i..= ,|H 3 5 , ’It:i?l=!-5fi s ju tiiH «JS!StS!a;!« l ; i , “ - v ^ i " » * d ;V5 =T~: .H! Hi '*• « B |!^gS ?«a'iiS sI . h sit ;§| , : I a & 'r | l : . f ; % -i i s . '- : n S !H |S .,S !» J!JjrS H S l “ 9 = ! “ 2 = '■ S ' h V -!!£ ;'= < < 2 15 H K i f IjilfflalJlB w #3ig s li I f C: S H j-illiii-B jis ;! sf 2iu [ m I H | ill w S J 0l ,S ] o a • '£ . = = s a ? 3s ! • « p l;S » s p S ! 3 ip te p SgWtgg C hapter 8 C onclusions and Future R esearch 8 . 1 C onclusions This thesis presented a synthesis procedure for asynchronous system design. The design style used in this system is based on Sutherland’s micropipelines, in which a system is composed of a set of components using handshaking protocols. To automate the synthesis procedure, the following research results were achieved. • Define the design representation based on a data flow model, which fully reflects the behavior of asynchronous systems. (In Chapter 3) • Define the delay model in the data flow specification in which the system performance is accurately estimated. (In Chapter 3) • Based on the delay model in the data flow specification, two of the high-level synthesis problems for asynchronous system design are modeled as graph theoretic problems. — One is the resource-constrained fixed sequencing and static allocation problem whose goal is to minimize the system completion time. (In Chapter 4) - The other is the register minimization problem. (In Chapter 6 ) 203 Both theoretical results lead to algorithms for an optimum solution. Al though both algorithms are time-consuming, these results show the insight of these problems. Furthermore, these theoretical results lead to efficient and effective heuristic algorithms for solving problems. • Control synthesis is defined by a set of templates with respect to the corre sponding sequencing and allocation schemes obtained from data path syn thesis. Because the structure of these templates are parameterized, the per formance and area overhead due to the resource sharing are easily estimated, and passed into the formulation for data path synthesis. (In Chapter 5) At the end of this thesis, a digital filter was successfully designed using this synthesis procedure. This design and others [55] have shown the effectiveness of this synthesis procedure. 8.2 Future R esearch There are several directions for future research. • Expand the graph model for the fixed sequencing and static allocation prob lem whose goal is to minimize the initiation period. (The preliminary for mulation is in Chapter 4.) • In this thesis, the formulation for data path synthesis focuses primarily on the fixed sequencing and static allocation problem. There are other sequenc ing and allocation schemes. These schemes have higher design overhead than the fixed sequencing and static allocation scheme, but are good for the sharing for large resources. The formulations for the data path synthesis of these problems are required so that the tradeoff of using different schemes at different resources or system configurations can be known. 204 • In our synthesis system, the focus was on the computation-intensive design. There is a need to investigate the control-intensive design. The modeling of the control intensive system with or without data flow model is another interesting research topic. 205 R eferen ce List [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983. [2] V. Akella and G. Gopalakrishnan. “SHILPA: A High-Level Synthesis System k for Self-Timed Circuits”. In Proceedings of the ICCAD-92, pages 587-591, 1992. [3] J. Backus. “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs” , The 1977 Turing Award Lecture. Communications of the ACM, 21(8):613-641, 1978. [4] R. M. Badia and J. Cortadella. “High-Level Synthesis of Asynchronous Sys tems: Scheduling and Process Synchronization”. In Proceedings of the Euro pean Conference on Design Automation, pages 70-74, 1993. [5] M. R. Barbacci. “Instruction Set Processor Specifications (ISPS): The Nota tion and Its Applications”. IEEE Transactions on Computer, C-30(l):24-40, January 1981. [ 6 ] P. A. Beerel and T. H.-Y. Meng. “Automatic Gate-Level Synthesis of Speed- Independent Circuits”. In Proceedings of the ICCAD-92, pages 581-586, 1992. [7] E. Brunvand. Parts-R-Us. Technical Report CMU-CS-87-119, School of Com puter Science, Carnegie Mellon University, May 1987. 206 [8 ] E. Brunvand. Translating Concurrent Communicating Programs into Asyn chronous Circuits. Technical Report CMU-CS-91-198, School of Computer Science, Carnegie Mellon University, September 1991. [9] E. Brunvand and R. F. Sproull. Translating Concurrent Communicating Programs into Delay-Insensitive Circuits. Technical Report CMU-CS-S9-126, School of Computer Science, Carnegie Mellon University, April 1989. [10] J. A. Brzozowski and J. C. Ebergen. Recent Developments in the Design of Asynchronous Circuits. Technical Report CS-89-18, Computer Science De- partm ent, University of Waterloo, May 1989. [11] S. M. Burns and A. J. Martin. Synthesis of Self-timed Circuits by Program Transformation. Technical Report 5253:TR:87, Dept, of Computer Science, California Institute of Technology, 1987. [12] T.-A. Chu. “On the models for designing VLSI asynchronous digital systems” . INTEGRATION, the VLSI journal, 4(2):99-113, 1986. [13] T.-A. Chu. Synthesis of Self-timed VLSI Circuits from Graph Theoretic Spec ifications. PhD thesis, Dept, of EECS, Massachusetts Institute of Technology, September 1987. [14] I. David, R. Ginosar, and M. Yoeli. “Self-Timed Architecture of Reduced In struction Set Computer”. Computer Science 577, Course Material, University of Utah, pages 263-283, 1990. [15] S. Davidson et al. “Some Experiments in Local Microcode Compaction for Horizontal Machines”. IEEE Transactions on Computer, C-30(7):460-477, July 1981. [16] A. L. Davis and R. M. Keller. “Data Flow Program Graphs”. IEEE COM PUTER, 15(2):26— 41, 1982. 207 [17] J. C. Ebergen. Arbiters: An Exercise in Specifying and Decomposing Asyn- chronously Communicating Components. Technical Report CS-90-29, Dept, of Computer Science, University of Waterloo, July 1990. [18] A. D. Friedman and P. R. Menon. Theory and Design of Switching Circuits. Computer Science Press, Inc., 1975. [19] D. Gajski et al. High-Level Synthesis - Introduction to Chip and System Design. Kluwer Academic Publishers, 1992. [20] L. J. Hafer and A. C. Parker. “A Formal Method for the Specification, Analy sis, and Design of Register-Transfer Level Digital Logic” . IEEE Transactions on Computer-Aided Design, CAD-2(l):356-370, January 1983. [21] F. J. Hill and G. R. Peterson. Computer Aided Logical Design. John Wiley and Sons, Inc., 4th edition, 1993. [22] C. Y. Hitchcock and D. E. Thomas. “A Method of Automatic Data Path Synthesis”. In Proceedings of the 20th Design Automation Conference, pages 484-489, 1983. [23] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, Inc., 1985. [24] C.-T. Hwang, J.-H. Lee, and Y. C. Hsu. “A Formal Approach to the Schedul ing Problem in High Level Synthesis”. IEEE Transactions on Computer-Aided Design, CAD-10(4):464-475, April 1991. [25] R. M. Keller. Towards a Theory of Universal Speed-Independent Modules. IEEE Transactions on Computer, C-23(l):21-33, 1974. [26] D. W. Knapp and A. C. Parker. A Data Structure for VLSI Synthesis and Ver ification. Technical Report DISC/83-6a, EE-Systems, University of Southern California, March 1984. 208 [27] L. Lavagno, K. Keutzer, and A. Sangiovanni-Vincentelli. “Algorithms for Synthesis of Hazard-free Asynchronous Circuits”. In Proceedings of the 28th Design Automation Conference, pages 302-308, 1991. [28] K.-J. Lin and C.-S. Lin. “Automatic Synthesis of Asynchronous Circuits”. In Proceedings of the 28th Design Automation Conference, pages 296-301, 1991. [29] A. J. Martin et al. The Design of an Asynchronous Microprocessor. Pro ceedings of the Decennial Caltech Conference on VLSI, pages 351-373, March 1989. [30] E. J. McCluskey. Introduction to the Theory of Switching Circuits. McGraw- Hill, New York, 1965. [31] E. J. McCluskey. Logic Design Principles. Prentice-Hall, Inc., 1986. [32] M. C. McFarland, A. C. Parker, and P. Camposano. “The High-Level Syn thesis of Digital Systems” . Proceedings of the IEEE, 78(2):301— 318, 1990. [33] T. H.-Y. Meng. Synchronization Design for Digital Systems. Kluwer Aca demic Publishers, 1991. [34] T. H.-Y. Meng, R. W. Brodersen, and D. G. Messerschmitt. “Automatic Synthesis of Asynchronous Circuits from High-Level Specifications”. IEEE Transactions on Computer-Aided Design, 8 ( 11):1185— 1205, 1989. [35] D. E. Muller and W. S. Bartky. “A Theory of Asynchronous Circuits”. Pro ceedings of an International Symposium on the Theory of Switching, 29:204- 243, 1959. [36] S. Nowick and D. L. Dill. “Synthesis of Asynchronous State Machines Using a Local Clock”. In Proceedings of the ICCD-91, pages 192-197, 1991. 209 [37] B. M. Pangrle and D. D. Gajski. “State Synthesis and Connectivity Binding for Microarchitecture Compilation”. In Proceedings of the ICC AD-86, pages 210-213, 1986. [38] A. C. Parker, J. Pizarro, and M. Mlinar. “MAHA: A Program for Datapath Synthesis” . In Proceedings of the 23rd Design Automation Conference, pages 461-466, 1986. [39] P. G. Paulin and J. P. Knight. “Forced-Directed Scheduling for Behavioral Synthesis of ASIC’s”. IEEE Transactions on Computer-Aided Design, CAD- 8(6):661— 679, June 1989. [40] C. V. Ramamoorthy and G. S. Ho. “Performance Evaluation of Asynchronous Concurrent Systems Using Petri Nets. IEEE Transactions on Software Engi neering, SE-6(5):440-449, 1980. [41] M. Rem, J. L. A. van de Snepscheut, and J. T. Udding. “Trace Theory and Definition of Hierarchical Components”. Third Caltech Conference on Very Large Scale Integration, pages 225-239, March 1983. [42] C. L. Seitz. “System Timing”. In Introduction to VLSI Systems, by C. Mead and L. Conway, Addison Wesley, pages 128-262, 1980. [43] E. A. Snow. Automation of Module Set Independent Register-Transfer Level Design. Ph.D. Thesis, Electrical Engineering Department, Carnegie-Mellon University, April 1978. [44] I. E. Sutherland. “MICROPIPELINES”, The 1988 Turing Award Lecture. Communications of the ACM, 32(6):720-738, 1989. [45] H. Trickey. “Flamel: A High-Level Hardware Compiler”. IEEE Transactions on Computer-Aided Design, 6(2):259— 269, March 1987. 210 [46] C. Tseng and D. P. Siewiorek. “Automatic Synthesis of Data Path in Digital Systems”. IEEE Transactions on Computer-Aided Design, CAD-5:379-395, July 1986. [47] S. H. Unger. Asynchronous Sequential Switching Circuits. John Wiley and Sons, Inc., New York, 1969. [48] S. H. Unger. The Essence of Logic Circuits. Prentice-Hall, Inc., New Jersey, 1989. [49] J. L. A. van de Snepscheut. “Deriving Circuits from Programs”. Third Caltech Conference on Very Large Scale Integration, pages 241-256, March 19S3. [50] J. L. A. van de Snepscheut. Trace Theory and VLSI Design. Lecture Notes in Computer Science 200. Springer-Verlag, 1985. [51] R. A. Walker and D. E. Thomas. “Design Representation and Transformation in The System Architect’s Workbench” . In Proceedings of the ICCAD-87, pages 166-169, 19S7. [52] T.-Y. Wuu. A Data-Driven Model for Asynchronous System Synthesis. Thesis Proposal, Electrical Engineering-Systems, University of Southern California, December 1991. [53] T.-Y. Wuu and S.B.K. Vrudhula. A Design of a Fast and Area Efficient Multi input Muller C-element. IEEE Transactions on VLSI Systems, 1 (2):215— 219, 1993. [54] T.-Y. Wuu and S.B.K. Vrudhula. Design of Asynchronous Blocks using the HP C34100 Standard Cell Library. Internal Memo, Information Science In stitute, University of Southern California, May 1993. 211 [55] T.-Y. Wuu and S.B.K. Vrudhula. Synthesis of Asynchronous Systems from Data Flow Specifications. Technical Report ISI/RR-93-366, Information Sci ence Institute, University of Southern California, December 1993. [56] K. Y. Yun and D. L. Dill. “Automatic Synthesis of 3D Asynchronous State- Machines” . In Proceedings of the ICCAD-92, pages 576-580, 1992. [57] MOSIS prices and gate equivalents. Tanner Research, April 1989. 1 [58] The HP C34100 Standard Cell Library Data Manual. Number 5091-4273EUS. Hewlett-Packard Company, Integrated Circuit Business Division, May 1992. [59] The M OSIS Service. USC/ISI, MOSIS Project, July 1993. 212
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Stateful computations in functional languages
PDF
Optimization of BIST resources during high-level synthesis
PDF
Logic synthesis for low power VLSI designs
PDF
Theory and practice in system-level design of application-specific heterogeneous multiprocessors
PDF
Formal methods for behavioral and system-level power optimization and synthesis.
PDF
Design and analysis of interactive video-on-demand systems
PDF
HiDISC: A high-performance hierarchical decoupled computer architecture
PDF
The design and synthesis of concurrent asynchronous systems.
PDF
Energy recovery techniques for CMOS microprocessor design
PDF
VLSI architectures for video compression applications
PDF
Performance modeling and network management for self-similar traffic
PDF
Performance modeling of a class of queuing systems with self-similar characteristics
PDF
Integration of optoelectronic technologies for chip-to-chip interconnections and parallel pipeline processing
PDF
High-level area-delay prediction with application to behavioral synthesis
PDF
Occamflow: Programming a multiprocessor system in a high-level data-flow language
PDF
Achieving high performance and energy efficiency in superpipelined processors
PDF
Efficient communication algorithms for parallel computing platforms
PDF
Hybrid fractal/wavelet methods for image compression
PDF
Optimal stochastic approach to robust manipulator collision controller design
PDF
Design and performance of the software-controlled COMA
Asset Metadata
Creator
Wuu, Tzyh-Yung (author)
Core Title
High-Level Synthesis For Asynchronous System Design
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,Engineering, System Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-77779
Unique identifier
UC11354680
Identifier
9625042.pdf (filename),usctheses-c17-77779 (legacy record id)
Legacy Identifier
9625042.pdf
Dmrecord
77779
Document Type
Dissertation
Rights
Wuu, Tzyh-Yung
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical
Engineering, System Science