Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIGH-LEVEL SYNTHESIS W ITH PIN CONSTRAINTS FOR M ULTIPLE-CHIP DESIGNS by Yung-Hua Hung A D issertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DO CTO R OF PHILOSOPHY (Com puter Engineering) D ecem b er 1992 Copyright 1992 Yung-Hua Hung UMI Number: DP22849 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22849 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 U N IV E R SIT Y O F SO U T H E R N C A L IFO R N IA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, w ritten by Yung-Hua Hung under the direction of h.U. Dissertation Committee, and approved by all its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent of re quirem ents for the degree of P h . D . CpS >32. H 3 3 G D O C TO R OF PH ILOSOPH Y Dean of G raduate Studies December;,2, 1992 DISSERTATION COMMITTEE Chairperson A cknowledgement s I take this opportunity to express my sincere appreciation to my advisor, Prof. Alice C. Parker, for her valuable guidance and continued encouragement. W ithout her, this thesis would not exist. I would also like to thank Profs. Ming-Deh Huang and Sandeep G upta for serving on my dissertation committee, and Profs. Sarm a Sastry, Michel Dubois, and V iktor Prasanna for serving on my guidance committee. I thank all my friends and colleagues in USC for their warm friendship and various help. In particular, I like to mention Chih-Tung Chen, Pravil G upta, Jen-Pin Weng, Shiv Prakash, Atul Ahuja, Sen-Pin Lin, Charles Njinda and M ary Zittercob. I have received support from Computer & Communication Research Labora tories (CCL), a division of Industrial Technology Research Institute (IT R I), and encouragement from my friends and colleagues, there, including my former m an agers, Dr. Ting-Hao Chao and Dr. Chih-Tien Hsing. To my little honey - Elbert - your welcome interruptions kept my goal in per spective. Finally, I would like to thank my wife, Lan-Hsiang, for her patience, support, and sacrifices throughout my graduate study. iii Contents A ck n o w led g em en ts iii L ist O f F igu res vi L ist O f T ables ix A b stra ct xi 1 In tro d u ctio n 1 1.1 M o tiv atio n ............................................................................................................ 3 1.2 Problem Statem ent O v e rv ie w ...................................................................... 3 1.3 Related W o r k ..................................................................................................... 4 1.4 Thesis O u tlin e ..................................................................................................... 6 2 R esearch O verview 7 2.1 Problem S ta te m e n t........................................................................................... 7 2.2 Problem Assumptions .................................................................................... 9 2.2.1 I/O O peration M o d e l.................................................... 11 2.3 Prim ary Problem Issues ................................................................................. 12 2.3.1 Pin C o n s tra in ts .................................................................................... 13 2.3.2 Interchip Connections ...................................................................... 13 2.3.3 Tim e Division I/O M u ltip lex in g..................................................... 14 2.4 Research Difficulties ....................................................................................... 14 3 S y n th esis for D esign s w ith a S im p le P a rtitio n in g 19 3.1 Pin Allocation S ubproblem ...........................................•................................ 20 3.1.1 The ILP F o rm u latio n .......................................................................... 20 3.1.2 The Size of the ILP T a b le a u ............................................................ 29 3.2 Scheduling under Resource C o n s tra in ts .................................................... 29 3.3 Determ ining Feasibility of an I L P ............................................................... 30 3.4 Experim ental R e s u lts ....................................................................................... 33 iv 4 S y n th esis for D esign s w ith a G eneral P a rtitio n in g 38 4.1 D eterm ining Interchip C o n n e c tio n s ........................................................... 39 4.1.1 The ILP F o rm u latio n .......................................................................... 40 4.1.2 A Heuristic Search P r o c e d u r e ......................................................... 46 4.2 Scheduling with Given Interchip C onnections.......................................... 49 4.3 Bidirectional I/O p o r t s ................................................................................... 52 4.4 Experim ental R e s u lts ...................................................................................... 53 4.4.1 AR Lattice F i l t e r ................................................................................. 54 4.4.1.1 Designs with Unidirectional I/O P o r t s ...................... 54 4.4.1.2 Designs with bidirectional I/O p o r t s .......................... 61 4.4.2 Fifth-order Wave Elliptic F i l t e r ....................................................... 71 4.4.2.1 Designs with Unidirectional I/O P o r t s ...................... 71 4.4.2 .2 Designs with Bidirectional I/O P o r t s .......................... 77 5 In terch ip C on n ection S yn th esis A fter S ch ed u lin g 83 5.1 S ch ed u lin g ............................................ 83 5.2 Interchip Connection Synthesis .................................................................. 84 5.3 Experim ental R e s u lts ...................................................................................... 89 6 Sh arin g B u ses in a C ycle 93 6.1 Interchip Connection Synthesis ................................................................. 94 6.1.1 ILP F o rm u la tio n ................................................................................. 95 6 .1.1.1 Assignment C o n s tra in ts .................................................. 96 6 .1.1.2 D ata Transfer C o n s tra in ts ............................................... 98 6 .1.1.3 Resource C o n stra in ts......................................................... 99 6.1.1.4 L in e a riz a tio n ....................................................................... 99 6.1.2 Heuristic S earch .......................................................................................102 6.2 Reassignment of I/O Operations to Buses ................................................103 6.3 Experim ental R e s u lts .........................................................................................103 7 O th er E x ten sio n s 112 7.1 D ata Recursive E d g e s.........................................................................................112 7.2 Conditional I/O O p e r a tio n s ........................................................................... 121 7.3 Tim e Division I/O M u ltip lex in g .............................. 126 7.4 M ultiple-cycle Operations .............................................................................. 126 7.5 C onclusion................................. 130 8 C on clu sion s and Future W ork 131 8.1 C o n trib u tio n s .......................................................................................................131 8.2 Future W o rk ..........................................................................................................132 v List Of Figures 1.1 An example of a CDFG ................................................................................ 2 2.1 M ultiple-chip S y n th esis................................................................................... 8 2 . 2 T he scheduling and binding of a CDFG containing only functional o p e ra tio n s ........................................................................................................... 15 2.3 A portion of CDFG containing I/O o p e ra tio n s ...................................... 16 2.4 A partial schedule and interchip connection for Figure 2 .2 ................. 17 2.5 A portion of simple partitioning of a CDFG .......................................... 18 3.1 A simple partitioning e x a m p le ..................................................................... 21 3.2 Possible communications in a simple partitioning ............................... 25 3.3 Possible interchip connection in a design of simple partitioning . . . 27 3.4 A list scheduling algorithm with pin allocation feasibility checker . . 31 3.5 A sim ple-partition AR f i l t e r ......................................................................... 34 3.6 Scheduling of the sim ple-partition AR filte r............................................. 36 3.7 Interchip connection for the partitioned AR filter in Figure 3.5 . . . 37 4.1 Interchip connection m o d e l............................................................................ 39 4.2 An interchip connection example .............................................................. 40 4.3 A heuristic search procedure to find an interchip connection structure 47 4.4 An example of an interchip c o n n e c tio n .................................................... 50 4.5 Snapshots of communication bus a llo c atio n ............................................. 52 4.6 Interchip connection model with bidirectional I/O p o r ts ..................... 53 4.7 A general-partition AR f ilte r ........................................................................ 55 4.8 Interchip connection for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 .................................................................. 56 4.9 Interchip connection for the AR filter design with unidirectional I/O ports and an initiation rate of 4 .................................................................. 56 4.10 Interchip connection for the AR filter design with unidirectional I/O ports and an initiation rate of 5 .................................................................. 57 4.11 Schedule for the AR filter design with unidirectional I/O ports and an initiation rate of 3 ...................................................................................... 58 VI 4.12 Schedule for the AR filter design with unidirectional I/O ports and an initiation rate of 4 ....................................................................................... 59 4.13 Schedule for the AR filter design with unidirectional I/O ports and an initiation rate of 5 ....................................................................................... 60 4.14 Interchip connection for the AR filter design w ith bidirectional I/O ports and an initiation rate of 3 .................................................................. 65 4.15 Interchip connection for the AR filter design w ith bidirectional I/O ports and an initiation rate of 4 .................................................................. 65 4.16 Interchip connection for the AR filter design w ith bidirectional I/O ports and an initiation rate of 5 .................................................................. 6 6 4.17 Schedule for the AR filter design with bidirectional I/O ports and an initiation rate of 3 ....................................................................................... 67 4.18 Schedule for the AR filter design with bidirectional I/O ports and an initiation rate of 4 ....................................................................................... 6 8 4.19 Schedule for the AR filter design with bidirectional I/O ports and an initiation rate of 5 ....................................................................................... 69 4.20 A partitioned elliptic f i l t e r ............................................................................. 72 4.21 Interchip connection for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 6 73 4.22 Interchip connection for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 7 74 4.23 Schedule for the elliptic filter design with unidirectional I/O ports and an initiation rate of 6 ............................................................................ 75 4.24 Schedule for the elliptic filter design with unidirectional I/O ports and an initiation rate of 7 ............................................................................ 76 4.25 Interchip connection for the elliptic filter design w ith bidirectional I/O poi'ts and an initiation rate of 6 78 4.26 Interchip connection for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 7 79 4.27 Schedule for the elliptic filter design with bidirectional I/O ports and an initiation rate of 6 ............................................................................ 80 4.28 Schedule for the elliptic filter design with bidirectional I/O ports and an initiation rate of 7 ............................................................................ 81 5.1 Com patible graph for interchip connection p r o b le m .............................. 87 5.2 A heuristic procedure for constructing interchip c o n n e c tio n ................ 89 6.1 Illustration of communication slots, sub-buses, and sub-slots . . . . 94 6.2 Interchip connection for the AR filter with an initiation rate of 3 . . 104 6.3 Interchip connection for the AR filter with an initiation rate of 4 . . 105 6.4 Interchip connection for the AR filter with an initiation rate of 5 . . 105 6.5 Schedule for the AR filter with an initiation rate of 3 .............................. 106 vii 6 . 6 Schedule for the AR filter with an initiation rate of 4 .............................107 6.7 Schedule for the AR filter with an initiation rate of 5 .............................108 7.1 A partial CDFG w ith a data recursive e d g e ................................................. 113 7.2 An example schedule of two execution in s ta n c e s ....................................... 114 7.3 The CDFG representation for Expression ( 7 .1 ) .......................................... 115 7.4 An example showing no feasible sc h ed u lin g ................................................. 116 7.5 An instance in ASG transform ed from an instance in PCS ..................118 7.6 An example showing exclusion of conditional s h a r i n g .............................124 7.7 A heuristic procedure for conditional resource sharing among I/O o p e ra tio n s ...............................................................................................................125 7.8 Model for m ultiplexed I/O o p e ra tio n s............................................................127 7.9 An Example of O peration C h a in in g ............................................................... 128 7.10 An example of an allocation w h eel...................................................................129 vm List Of Tables 4.1 Resource Constraints for the AR filter designs with unidirectional I/O p o r t s ............................................................................................................ 54 4.2 Summarized results for the AR filter designs with unidirectional I/O p o r t s ...................................................................................................................... 57 4.3 Bus assignment for the AR filter design with unidirectional I/O ports and an initiation rate of 3 .................................................................. 61 4.4 Bus allocation for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 ............................................................................. 61 4.5 Bus assignment for the AR filter design w ith unidirectional I/O ports and an initiation rate of 4 .................................................................. 62 4.6 Bus allocation for the AR filter design with unidirectional I/O ports and an initiation rate of 4 ............................................................................. 62 4.7 Bus assignment for the AR filter design with unidirectional I/O ports and an initiation rate of 5 .................................................................. 63 4.8 Bus allocation for the AR filter design with unidirectional I/O ports and an initiation rate of 5 ............................................................................. 63 4.9 Resource Constraints for the AR filter designs w ith bidirectional I/O p o r t s .................................................................................. 64 4.10 Summarized results for the AR filter designs with bidirectional I/O p o r t s ...................................................................................................................... 64 4.11 Bus assignment for the AR filter design w ith bidirectional I/O ports and an initiation rate of 3 ............................................................................. 6 6 4.12 Bus assignment for the AR filter design w ith bidirectional I/O ports and an initiation rate of 4 ............................................................................ 70 4.13 Bus assignment for the AR filter design w ith bidirectional I/O ports and an initiation rate of 5 ............................................................................. 70 4.14 Resource Constraints for the elliptic filter designs w ith unidirec tional I/O p o r t s ................................................................................................. 71 4.15 Bus allocation for the elliptic filter design with unidirectional I/O ports and an initiation rate of 6 .................................................................. 74 4.16 Bus allocation for the elliptic filter design with unidirectional I/O ports and an initiation rate of 7 .................................................................. 77 ix 4.17 Resource Constraints for the elliptic filter designs w ith bidirectional I/O ports ............................................................................................................ 77 4.18 Bus allocation for the elliptic filter design with bidirectional I/O ports and an initiation rate of 6 .................................................................. 79 4.19 Bus allocation for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 7 .................................................................. 82 5.1 Resources required for the AR filter with variations of initiation rates and pipe lengths using the technique described in this chapter 90 5.2 Pipe length for the AR filter with variations of initiation rates and resource constraints using the technique described in C hapter 4 . . 90 5.3 Resources required for the elliptic filter with variations of initiation rates and pipe lengths using the technique described in this chapter 91 5.4 Pipe length for the elliptic filter with variations of initiation rates and resource constraints using the technique described in C hapter 4 91 6.1 I/O operation to bus assignment with an initiation of 3 ......................... 109 6 . 2 I/O operation to bus assignment with an initiation of 4 .........................109 6.3 I/O operation to bus assignment with an initiation of 5 ......................... 110 6.4 Comparisons of num ber of pins required and pipe length for different a s su m p tio n s ........................................................................................................... 1 1 0 Abstract In this thesis, data-path synthesis techniques for m ultiple-chip pipelined systems from partitioned behavioral-level specifications are presented. Most digital designs are too large to fit into a single chip. Design of such systems is becoming harder and harder because the complexity of digital systems is increasing drastically and synthesis of m ultiple-chip digital systems will become crucial. The problem of synthesizing m ultiple-chip digital systems is com plicated by several aspects. First, output and input of a value between partitions m ust take place in the same transfer cycle. Also, the values transferred across chips may have a variety of bit widths, which makes it difficult to utilize I/O pins efficiently. Finally, no devices will be allowed on the off-chip connection, and this makes all traditional connection binding techniques no longer applicable. A class of partitioning, called simple partitioning, is identified, in which it is proven th a t there exists an interchip connection w ithout com m unication conflicts for a schedule w ithout I/O pins conflicts. For the other class of partitioning, gen eral partitioning, the problem is divided into two subproblems, nam ely interchip connection synthesis and scheduling. The problem is approached in either order: interchip connection synthesized before or after scheduling. The experim ental re sults obtained by both approaches are compared. An Integer Linear Program m ing (ILP) form ulation and a heuristic search technique are developed for the subprob lem of interchip connection synthesis. Chapter 1 Introduction High-level synthesis [MPC 8 8 ] is the process of transform ing a behavioral descrip tion into a structural design th at can im plem ent the behavior. In general, the transform ation is a one-to-many mapping. T hat is, there are a num ber of im ple m entations performing the given behavior. One of the tasks of synthesis is to find the structural im plem entation th at meets a set of constraints while achieving a design goal. Because of the complexity of high-level synthesis, the task of synthesis is usually divided into two sequential sub-tasks: synthesis of the data path, and synthesis of the control path. The core of traditional data path synthesis is further divided into three simpler tasks, namely scheduling, module allocation, and resource bind ing. Scheduling assigns operations to appropriate control steps (states). M odule allocation determ ines the num ber and types of hardw are modules required. Re source binding assigns operations and d ata values to specific hardw are resources. Control synthesis refers to the autom atic generation of a controller which provides the control signals for sequencing events in the data path. In general, the input to data-path synthesis tools is modeled in an interm e diate form at, such as the Value Trace (VT) [McF78], Design D ata Structure (DDS) [KP85], dataflow/control-flow (dacon) [Tri87], and Sequencing Interm ediate Form (SIF) [MK8 8 ], sometimes referred to as a Control D ata Flow G raph (CD FG ), which can be obtained by compiling a Hardware Description Language (HDL), 1 Figure 1 .1 : An example of a CDFG such as ISPS [Bar81], VHDL [VHDL8 8 ], and HardwareC [MK8 8 ], A CDFG is a directed graph, in which the nodes represent operations to be perform ed, and the arcs represent their dependencies. An example of a CDFG is shown in Figure 1.1. The output from data-path synthesis tools is usually a register transfer level (RTL) design. An RTL d ata path consists of operators and registers interconnected via m ultiplexers, buses, and wires. 2 1.1 Motivation Most m odern digital systems can hardly fit into a single chip, even though the num ber of components th at can be fabricated on a single chip has been increas ing drastically in recent years, because the complexity of digital systems has also increased significantly in the same period of tim e. In such a design composed of m ultiple chips, the interchip communication is an im portant issue and should be considered at an early stage of the design process: However, most current high- level synthesis research [PP 8 8 , PK89, PK 9I, HHL91] pays no attention to the issue of com m unication across chips. Pin constraints should also be taken into account during the process of high- level synthesis for m ulti-chip designs, as all IC chips only have a finite num ber of pins available for off-chip communication, although this num ber has increased over the years. However, many current data path synthesis tools do not consider pin constraints. In the process of scheduling, they make an im plicit assum ption th at every input value is ready whenever it is required1. In general, th e assum ption is not tru e because the num ber of input pins is lim ited and inputs cannot be assured to be available at any time. Some tools [CK8 6 , N T 8 6 , NK90] can take pin lim itations into account im plicitly by the use of tim e constraints on inputs and outputs. 1.2 Problem Statement Overview In our approach, partitioning of a design onto m ultiple chips is perform ed before the behavioral synthesis process. Using predictions, the behavioral partitioner, such as CHOP [KP91], partitions the behavioral specification into a num ber of clusters in such a way th at the synthesized m ulti-chip design will likely be feasible w ith respect to the given constraints. 1It can be stated in another way that all input values are available before any functional operation can execute. 3 The intent of this research is to propose an approach to m ulti-chip d ata p ath synthesis given a partitioned behavioral specification. A pipelined schedule and pin allocation consisting of m ultiple chips is to be produced, w ith each chip cor responding to a cluster of operations in the behavioral specification. Also, a set of com m unication buses between the chips will be synthesized, and the values to be transferred will be scheduled and assigned to the pins and buses. The designs produced will satisfy user-supplied constraints, including the num ber of I/O pins on individual chips. 1.3 Related Work L ist scheduling [Girc84, P P 8 8 , P G 8 6 ] has been widely used, and attem pts to m in imize the execution tim e of the design where hardware resource constraints are specified. In the algorithm , ready operations, whose precedent operations have been scheduled, are sorted according a heuristic priority function, and scheduled in the next control step until a resource conflict occurs, and a control step is ad vanced. There are many variations of list scheduling. The variation is m ainly due to how the heuristic priority function is defined. Force-Directed Scheduling (FDS), developed by Paulin [PK89], tries to reduce the num ber of hardware resources by balancing the concurrency of the operations among control steps such th at a global tim e constraint, expressed as the m axim um num ber of control steps, is met. In each iteration, D istribution Graphs (DGs) for all operation types are com puted, from which force can be derived. Forces are calculated for all operations at all feasible control steps. The operation - control step pair yielding the lowest (most negative) force, th at is, most likely balanced, is selected and assigned. An Integer Linear Programming (ILP) formulation has been applied to the d ata-path synthesis problem. Hafer and Parker [HP83] are the first ones who modeled the data-path synthesis problem with a formal m ethod, and m apped a behavioral specification into a scheduled and allocated d ata path by using Mixed 4 Integer Linear Program m ing (M ILP). Lee et al. [LHL89], Hwang et al. [HHL90], and Gebotys et al. [GE91] have applied ILP to the scheduling problem in data-path synthesis. In the literature, only two research projects explicitly on high-level synthesis of m ultiple-chip designs are found. De Micheli et al. [KM90, GM90] applied an iterative approach consisting of three stages, namely resource binding, scheduling, and partitioning. In their approach, resource binding is perform ed before schedul ing. In the binding stage, operations are bound to specific resources. There will be a large num ber of binding configurations. In each iteration, one of th e binding configurations is selected, either by an exhaustive search or a heuristic search with some cost criteria. Once a resource binding is selected, the operations bound to the same resource are serialized to resolve resource conflicts. A branch-and-bound approach is used to explore the alternatives. Then, a SIF model, a variation of a CDFG containing binding and serialization inform ation, is passed to a partitio n ing program to be partitioned into m ultiple chips. Their system can only handle non-pipelined designs. Moreover, pin sharing among I/O operations is not con sidered, as the pin cost, which is used as one of the constraints in the process of partitioning, is com puted by simply adding the costs of all I/O operations in the partition. So, the design produced by this approach will require m any more I/O pins th an necessary. Gebotys [Gebo92] approached the problem of synthesizing m ulti-chip designs by form ulating partitioning, scheduling, and allocating hardw are (functional units, I/O pins, and interchip buses) as an ILP. Since she m ade the assum ptions th at every interchip bus is connected to all of the chips and every value transferred off- chip has the same bit width, the interchip com m unication bus structure does not need to be determ ined. Instead, only the numbers of input ports, output ports, and interchip buses are to be determ ined. These assumptions will be fine for two-chip designs. However, it would require more I/O pins than necessary for systems which contain more than two chips. The larger num ber of chips in a system , the more I/O pins are likely to be wasted under the assumptions. Also, values transferred 5 across chips may have different bit widths. In addition, the problem size will be lim ited by the run tim e to solve the ILP. W hen the num ber of integer variables in an ILP reaches a few hundred, the run tim e will grow drastically w ith th e num ber of integer variables. 1.4 Thesis Outline C hapter 2 gives an overview of the research, and addresses the difficulties which arise during the m ulti-chip pipelined data-path synthesis process. In C hapter 3, a class of partitioning, called simple partitioning, is identified. The technique to synthesize m ulti-chip designs with a simple partitioning is presented. For systems w ith a general partitioning, the problem of synthesizing m ulti-chip designs is di vided into 2 subproblems: ( 1 ) interchip connection synthesis, and (2 ) scheduling. C hapters 4 and 5 describes different techniques to solve these subproblem s in differ ent orders, interchip connection synthesis before and after scheduling, respectively. The experim ental results produced by these approaches are com pared. C hapter 6 presents a technique to deal w ith extended cases in which a com m unication bus can be used to transferred m ultiple values in a cycle by using different portions of th e bus. In C hapter 7, some extensions to the research are discussed. Finally, conclusions and future research directions are summ arized in C hapter 8 . 6 Chapter 2 Research Overview In this chapter, a brief definition of the research problem is given first. Then, some restrictive assum ptions which will be made are discussed. Following this several prim ary problem issues of the research are addressed. Finally, the difficulties being encountered are shown w ith simple examples. 2.1 Problem Statement The research problem is to synthesize synchronous m ultiple-chip pipelined designs from partitioned behavioral specifications, achieving a user-specified goal, while satisfying user-given constraints. The lim ited num ber of I/O pins will be taken into account. As shown in Figure 2.1, the problem inputs include • a CDFG which has been partitioned into a num ber of partitions, each of which will be im plem ented on a separate chip, • a set of hardw are modules, with param eters of cost and delay, which will be used to im plem ent the operations in the CDFG, • constraints on cost and pin count for each individual chip, or a perform ance requirem ent of the whole design, and • a specified goal of maximizing performance of the whole design, or minimizing the total cost of the design, depending on what constraints are given. 7 i Partitioned Control Dataflow Graph Constraints including Pin constraints / partition ^ L Multiple-Chip Synthesis Library j Schedule of functional operations and I/O transfers Interchip connections Assignment of I/O transfers to communication buses Figure 2.1: M ultiple-chip Synthesis The problem outputs include • a schedule for the functional operations and I/O transfers, • a set of interchip communication buses, and • an assignment of I/O operations to communication buses. For m ultiple-chip designs, it is beneficial to have functional units of all p arti tions operating concurrently to fully utilize m ultiple hardw are resources. One of the effective ways to achieve this is to design such systems as pipelined system s, at least from the view of the system. Individual partitions can be im plem ented in either a non-pipelined or pipelined manner. In this research, only synthesis of pipelined systems is discussed since scheduling of a non-pipelined design can be regarded as a special case of scheduling of a pipelined design with initiation rate equal to or greater than the pipe length. 2.2 Problem Assumptions In order to focus on the m ajor problem issues and to keep the com plexity of the problem to a manageable level, several restrictive assum ptions are m ade. These include selection of a clocking scheme, determ ination of a Control D ata Flow G raph (CDFG ) representation, and consideration of I/O transfers, functional operations, and the control mechanism. • Clocking Scheme A global clock is used for all partitions, and to synchronize all functional operations and I/O transfers. The period of the clock will be set by the user or by prior software and input to the program. Two m inor clocks, one for d ata m anipulation and one for I/O transfers, can also be handled. In this case, both clocks m ust be derived from a global clock, and the initiation interval of the design m ust be a m ultiple of the period of the global clock. 9 • Control Data Flow Graph A flat CDFG w ithout internal loops is assumed. A hierarchical CDFG m ust be flattened beforehand. Internal loops with a fixed or m axim um iteration count should be unwound. Internal loops with indeterm inate iteration counts cannot be unwound, and so will not be covered in th e research since they make the design of a pipelined system very difficult and for some real-tim e system s, impossible. The CDFG can contain nested conditionals as well as d ata recursive edges. • I /O Transfers Interchip com m unication is performed synchronously. A value produced in a partition is transferred directly to the requesting partition at a predeterm ined tim e. In other words, a value is not transferred to the requesting partition through any other partition(s). I/O transfers are activated at the beginning of a clock cycle, and m ust be completed in a single cycle. W hen two minor clocks are used, I/O transfers is activated at the beginning of a m inor clock cycle and m ust be completed in the same minor cycle. • Functional Operations For each operation type, there exists exactly one functional unit th a t can perform the operation in the module set. In other word, m odule selection has been done before scheduling. Each partition can have its own module set. All functional units are combinational. The delays of functional units can be greater than the period of a clock cycle. Chaining of operations and multi-cycle operations are allowed. I/O transfer and functional operations can also be chained. The module set can be specified by a behavioral-level partitioning tool [KP91]. • Control Mechanism A distributed control mechanism is assumed. There are several disadvantages to applying a central controller approach. The bandw idth of d ata transfers across chips would be reduced significantly because a large num ber of I/O 1 0 i i pins would have to be reserved for control signals. Also, control signals traveling off-chip would cause long delays in the control path. B oth of these could result in designs with poor performance. 2.2.1 I /O O p era tio n M o d el For a system consisting of a num ber of partitions, a value produced in a partition m ay be used in other partitions. So, an I/O transfer is required when an operation in a partition requires a value produced in another partition. The criteria of an I/O transfer are • th a t the output operation of the value from a partition and th e input oper ation of the value to the other partition m ust take place in the same control step , 1 and • th at there is a free2 communication path between these partitions in the control step. In our system, a value can be output once or several tim es when a value pro duced in a partition is required by two or more partitions. For the case th a t a value is to be output once, the value can be output to a com m unication bus and all of the requesting partitions can input th at value in the same control step. This requires th a t all of requesting partitions have their input ports connected to the com m unication bus. In this case, only one com m unication bus will be allocated for only one cycle. On the other hand, for the case th at a value is to be outp u t several tim es, the value can be output to different communication buses in the same or dif ferent control steps and the requesting partitions can input the value from different com m unication buses in the corresponding control steps. This has the flexibility j I th a t th e requesting partitions are not required to have their input ports connected to the same com m unication bus. In this case, more than one com m unication bus R e c a ll th at we have assum ed that values are transferred directly. 2A com m unication path is said to be free in a control step if it has not been allocated for d ata transfer in the control step. 11 will be allocated to transfer the value. Note th at, for each partition, a value needs to be input only once and stored even it is used by more than one operation in a partition. In this case, I/O pins can be saved at the expense of chip area due to an ex tra register used to store the input value. However, the extra register does not have a first-order effect on the chip area. An I/O operation is modeled as a single I/O operation node which consists of both one output operation for a partition and one input operation for another partition, since an output operation for a partition is always accom panied in the same control step by an input operation for another partition. Scheduling an I/O operation node in a certain control step means th at the output and input operations it denotes will take place in th at control step. For a value required by m ore than one partition, there are several I/O operation nodes, one for each requiring partition, since a value is allowed to be output several tim es if necessary. In order to be able to identify the I/O operations transferring the same value, each I/O operation is associated w ith the name of the value it will transfer. A cost, called bit-width, is associated with each I/O operation to indicate th e bit w idth of th e value it will transfer. There is also a delay associated w ith each I/O operation. The I/O operation delay is defined as the duration from the beginning of the clock cycle where the I/O operation is activated to the tim e the value becomes valid at the destination. An estim ated delay is assumed for all I/O operations, because the actual delays of output drivers and interchip connections are unknown a priori. 2.3 Primary Problem Issues A num ber of m ajor problem issues are to be explored in this research. These include • pin constraints, • interchip connections, and 12 • tim e division I/O multiplexing. Each of the research topics will be described in the following sections. 2.3.1 P in C on stra in ts Values produced in a chip and used in another chip require I/O pins for d ata transfers. Pins are resources, much like functional units are resources, and I/O transfers are modeled as operations allocated for the pins. However, the problem is com plicated by the fact th at a value is generally composed of several bits, so m ore th an one pin will be allocated for transferring a value. In a pipelined design w ith an initiation rate of L, the operations which are scheduled in th e same control step group3 will execute overlapped in tim e, and cannot share the same hardw are resources. So, an I/O pin can be allocated at most L tim es, once in each control step group. This can be modeled as a m ultiple bin packing problem , where the bin capacity represents the num ber of available pins and the num ber of bins is the initiation rate of the design. 2 .3 .2 In terch ip C o n n ectio n s There m ust be interconnection among chips to allow interchip com m unication. T he synthesis of interchip connections is quite different from and harder th an the binding of connections w ithin a chip. The complication arises from the fact th at no switching devices, such as M UX’s or tri-state buffers, can be used on the off- chip connections. All traditional datapath connection binding techniques [KP90a, WS89, HCLH90, PK90] can no longer be applied to the problem since all of them assume th a t M UX’s and/or tri-state buffers can be used on connections. The difficulty of th e problem will be discussed in Section 2.4. 3For a pipelined design w ith an initiation rate of L, the set of control steps k, k + L ,... w ill be referred to as control step group k. 13 2 .3 .3 T im e D iv isio n I /O M u ltip le x in g A value transferred across chips may have a large bit width. A large num ber of pins will be required to transfer the value as a whole. Sometimes, it m ay be beneficial to split a value into several sub-values, each having a smaller num ber of bits, and to have these sub-values transferred in a num ber of cycles. This will be referred to as tim e division I/O multiplexing. A lthough th e num ber of I/O pins required may be reduced w ith tim e division I/O m ultiplexing, larger chip area and degraded perform ance can result because m ore register control signals are required to latch the m ultiplexed values an d /o r m ore control steps are needed. This kind of tradeoff should be considered at an early stage in the synthesis process. 2.4 Research Difficulties Most of the research difficulties in m ulti-chip synthesis are caused by interchip com m unications. These difficulties will be described in detail in this section. For the ease of discussion, some simple examples will be used to dem onstrate these difficulties. Scheduling I/O operations is quite different from and harder th an schedul ing functional operations because no switching elements are allowed on interchip connections, while switching devices can be used on intrachip connections. The discrepancy will be illustrated by two CDFG fragments which have the same d ata dependence, one for functional operations and the other for I/O operations. For th e ease of comparison, each value is assumed to be one bit wide. Figure 2 .2 (a) shows the portion of a CDFG containing only functional op erations, which can be scheduled in 2 control steps, if there are 1 m ultiplier, 1 subtractor, and 2 adders available for these operations. The num bers on the left of th e figure are control steps, and the operations between two shaded lines are scheduled in th at control step. The operation binding for the schedule is indicated by dotted ellipses, and the operations w ithin an ellipse are assigned to the same 14 Mull Add2 *2 Addl +2 - 1 Subl Subl Add2 MUX Addl Mull (a) A portion of scheduled CDFG (b) The corresponding RTL structure Figure 2.2: The scheduling and binding of a CDFG containing only functional operations functional unit. The corresponding RTL design is shown in Figure 2.2(b). Note th a t a m ultiplexor has been inserted before the input of Subl. On the other hand, a similar CDFG with I/O operations is shown in Figure 2.3. Assume th a t each of the chips Pa, and P\, has 1 output pin, and each of the chips Pc, and Pd has 1 input pin, which is similar to the assum ption m ade in Figure 2.2. If I/O operations 0P 1, IP l, OPS, and IP S are scheduled in control step 1 , as shown in Figure 2.4(a), I/O operations 0 P 2 , and IP 2 are unable to be scheduled in any control step under the pin constraints given. This can be illustrated by the partial RTL design, as shown in Figure 2.4(b), corresponding to the partial schedule. Transferring value V2 requires a connection between the ou tp u t pin of partition Pa, and the input pin of partition Pd, as indicated by a dotted line in Figure 2.4(b). However, the connection would cause the transfers of the values V\, and V 3 , which occur in the same control step to conflict even though partitions Pa, and Pd have 1 output pin, and 1 input pin, respectively, unallocated in control step 2. For the given constraints, the design requires 3 control steps, each for transferring one value. This leads to the following result: 15 Figure 2.3: A portion of CDFG containing I/O operations Scheduling an I/O operation in a control step in which there are unallo cated I/O pins does not guarantee th at there will be no com m unication conflict in the schedule. Moreover, scheduling an I/O operation in a control step, even w ith no com m unication conflict in th a t control step, can make achieving a com pleted schedule impossible. Even a restricted case, a simple partitioning , 4 where scheduling an I/O op eration in a control step in which there are unallocated I/O pins guarantees no com m unication conflict in the control step, m ay not lead to a schedule (as we show later). The problem is due to the requirem ent th a t the input and ou tp u t operations for a value m ust occur in the same control step. Consider the partial CDFG shown in Figure 2.5. The assum ption of one bit per value is made. Suppose th at partition Pa has 2 output pins, and partitions Pb, and Pc have 2, and 1 input pins, respectively. Also, an initiation rate of 2 is assumed. If the I/O operations for transferring values Vi, and V2 are scheduled in control step s, we will never be able to complete the schedule. T he I/O operations for transferring values V 3 , and V4 cannot be scheduled in any of the control steps s, 5 T 2,.. . because partition Pa has no unallocated output pins in those control 4See D efinition 3.2 16 OP3 ' OP1 OP2 IP2 IP1 EP3 (a) Partially scheduled CDFG (b) Partial inter-chip connection Figure 2.4: A partial schedule and in te r c h ip connection for Figure 2.2 steps, whereas one of them m ust be scheduled in one of those control steps because partition Pc has only one input pin. So, one of the I/O operations for transferring values V\ , and V2 should be scheduled in one of control steps s, s 4 - 2 ,..., while the other should be scheduled in one of the control steps s + l , s + 3,.... One possible schedule is th a t the I/O operations for transferring values V i,V 2 ,Vs, and V4 are perform ed in control steps s, s 4 - 1 , s + 1 , and s -f 2 , respectively. Values transferred across chips may have a variety of bit widths. How to utilize I/O pins efficiently makes the problem even harder because transferring different values may require using different numbers of I/O pins. 17 _____ 1 VI V2 V3 V4 Figure 2.5: A portion of simple partitioning of a CDFG 18 Chapter 3 Synthesis for Designs with a Simple Partitioning In this chapter, the definition of a simple partitioning is given. For a system w ith a simple partitioning, the interchip com m unication problem can be reduced to a pin allocation problem. T hat is, there exists an interchip connection w ith no com m unication conflicts for a schedule w ithout I/O pin conflicts. An integer- linear program m ing (ILP) form ulation for this pin allocation subproblem is given in Section 3.1. Section 3.2 describes how list scheduling is adapted to schedule a system w ith a simple partitioning and pin constraints. Before each I/O operation is scheduled, the feasibility of the ILP is checked to determ ine w hether a feasible pin allocation still exists. The feasibility of the ILP is determ ined by applying th e Dual Simplex algorithm to the ILP tableau. Section 3.3 shows how the feasibility of the ILP is determ ined and shows how the ILP tableau can be updated increm entally as the scheduling progresses. Last, experim ental results are given in Section 3.4. D efin itio n 3.1 In a partitioning, partition Pa is said to drive partition P & if a value required by an operation in partition Pj, is produced by an operation in partitio n Pa. In this case, partition Pa is a driver of partition Pj. P artitio n Pj, is driven by partition P a. D e fin itio n 3.2 A partitioning is said to be simple if it satisfies all of th e following conditions: 1. Every partition drives at most two partitions. 19 2. Every partition is driven by at m ost two partitions. 3. If a partition is driven by two partitions, its drivers drive no other partitions. 4. If a partition drives two partitions, it is the only driver of these two partitions. An exam ple of a simple partitioning is given in Figure 3.1. 3.1 Pin Allocation Subproblem A pin allocation is an allocation of I/O pins for all I/O operations in some control step groups. Recall th a t an I/O operation is composed of an output operation of a partition and an input operation of another partition. So, two sets of I/O pins will be allocated for an I/O operation in the same control step. D e fin itio n 3.3 Pin Allocation Problem Given • the num ber of I/O pins (used for d ata transfers) for each partition, • initiation rate L (L control step groups), • a set of I/O operations, and • a partial pin allocation. The pin allocation problem is to determ ine if a valid pin allocation still exists. T h at is, I/O pins can be allocated for every I/O operation in some control step group. 3.1.1 T h e IL P F orm ulation Suppose the CDFG is divided into N partitions 1,..., iV, and th at a pipelined sys tem w ith an initiation rate of L is to be designed. An additional pseudo partition, p artitio n 0, reflects the outside world and is used to model the pin constraints of 20 • • • • • • • • • • • • • • • • • • • • • • • • • • • Figure 3.1: A simple partitioning exam ple 21 the whole system. It contains only input operations and output operations. The input operations are used for outputs from the system, and the o u tp u t operations are used for inputs to the system. The num bers of input pins, and o u tp u t pins available for partition 0 are the output pins and input pins, respectively, of the system . T he notations and variables to be used in the form ulation and discussion are • L : Initiation rate of the design. • N : N um ber of partitions. • W — { w | every 1 /0 operation w }. • W v — { w | every I/O operation w used to transfer value v }. • I Si = { w | every I/O operation w used to input a value to p artitio n i }. • OSj = { v | v is a value output from partition j }. • B w : Bit w idth of I/O operation w. • B v : Bit w idth of value v. • I, : Total num ber of input pins in partition i. • Oj : Total num ber of output pins in partition j. • Ti : Total num ber of pins used for data transfers in partition i. Pins used for power and control lines are excluded. • O j : An integer variable specifying the num ber of pins used for o u tp u t from p artition j. • x W tk ’ ■ A binary variable denoting th a t I/O pins can be allocated for I/O operation w in control step group k. % w,k = 1 if I/O pins can be allocated for I/O operation w in control step group k ; x w,k = 0 otherwise. 22 T he pin allocation subproblem can be form ulated as follows: Maximize 0 (3-1) subject to < Ii, for 0 < i < N , 0 < k < L\ (3.2) < Oj, for 0 < ; < N, 0 < k < L; (3.3) > 1 , for w € W. (3.4) It is assumed in the form ulation th at the pins of each partition have been divided into input pins and output pins. A trivial objective function 3.1 is used because only the feasibility of the constraints is of interest. C onstraint 3.2 states th a t, for each partition, the total num ber of bit widths of the input operations scheduled in control step group k should not exceed the num ber of input pins available. C onstraint 3.3 states th a t the total num ber of bit widths of th e ou tp u t operations scheduled in control step group k should not exceed the num ber of output pins available. The m axim um of x w < k is used instead of the sum m ation of x w,k because only one output slot is required for the I/O operations transferring the sam e value being scheduled in the same control step. C onstraint 3.4 ensures th a t I/O pins can be allocated for every I/O operation in a certain control step group. T he greater-or-equal-to rather than equal-to is used because I/O pins can be allocated for an I/O operation in several control step groups, and th e initial ILP tableau1 can be m ade dual feasible.2 To get linear constraints, the m axim um function in Constraint 3.3 is elim inated by replacing m axw6iy„ {x w,k} by yV lk, a j new binary variable, and introducing a new constraint which captures the relation 1A m atrix form com posed o f the objective function and constraints. T h e inequalities are converted in to equality form by introducing slack or surplus variables. 2 A tableau is said to be dual feasible if all elem ents in the reduced cost row are nonnegative, j ^ j ByjXyj^k \u£lSi £ B v m&x{xW )k} v^OSj L —l ^ ^ % w,k k— 0 betw een yV yk and x Wtk- Then, Constraint 3.3 can be replaced w ith th e following equations: Y2 BvVv,k < Oj, for 0 < j < N, 0 < k < L; (3.5) v^OSj and ~ \ W v\yv,k < 0, for \WV\ >1,0 < k < L; (3.6) w£Wv where \WV\ denotes the cardinal of set W v. T he m axim um of x W tk can be ex pressed as C onstraint 3.6, because xWtk, and yVik are binary variables. Note th at m ayiw&wv{xw,k} needs to be replaced by yV tk only if \WV\ > 1. If the partition of total pins into input pins and outp u t pins is not specified, th a t is, constants and Oj are not unknown in advance, introducing new integer variables Oj, substituting /,■ = T; — o ,- and Oj = O j into C onstraints 3.2 and 3.3, and moving term s O j to the left hand side yields T . B wx W fk -f O i < Tj, for 0 < i < N, 0 < k < L\ weiSi y : B v m a x { iw fc } — O j < 0, for 0 < j < N, 0 < k < L. veoSj weWv T h eo re m 3.1 For a simple partitioning, there exists an interchip connection con figuration w ithout com m unication conflict for a schedule satisfying C onstraints 3.7, 3.8, and 3.4. P roof: Since the connections at the input and output ends of a partition can be constructed independently, the com m unication among partitions in a simple partitioning m ust follow one of the two forms shown in Figure 3.2. Note th a t the connection between just two partitions is a special case of either one shown in the figure. First, consider th e case shown Figure 3.2(a). Let X k = {Com ponents of w | w E I S a A x W tk = 1}, ( 3. 7) (3.8) (a) (b) Figure 3.2: Possible communications in a simple partitioning Yk = {Components of w | w 6 ISb A x t/!b = 1}, Zk = {Components of a v 6 O S/ A j IK, = 2 A (\/w G & k = |W |, bk = |U|> C k = \Z,U M a — m ax{a1 t}, k Mb I I 3 T * 0 , = Oi, and Ii = Ti — Oi. 1)}, Xk, and Yk are sets of bits of values which are to be input to partitions Pa and Pb, respectively, at control step group k. Zk is a set of bits of values which is to be o u tp u t from partition Pf to Pa and Pb at the same control step group k. a& and bk are th e num bers of bits transferred from partition Pf to Pa and Pb, respectively, at control step group k. C k is the num ber of bits transferred from partitio n Pf to both Pa and Pb at th e same control step group k. M a and M b are the m axim um num bers of bits transferred from partition Pf to Pa and P& , respectively, at any 25 control step. Since { x W tk} is a feasible solution of C onstraints 3.7, 3.8, and 3.4, we have ak < M a < h , (3.9) bk < Mb < h , (3.10) a-k + bk — C k < Of. (3.11) T he above relations state th at the num bers of bits of values input and outp u t at any control step group k m ust not be greater than the num bers of input and o u tp u t pins, respectively. The minus term in the last inequality reflects the fact th a t every value output to both partitions during the same control step can share ou tp u t pins. Case (i): If M a + Mb < O f, th at is, the num ber of output pins of partitio n Pf is greater than or equal to the sum of the m axim um num bers of bits transferred from partitio n Pf to Pa and Pb at any control step, we can construct the interchip connection for these partitions, as shown in Figure 3.3(a), with N a — M a, Nb = M b-, and N c = 0, where N a,Nb, and Nc are the num bers of links in connections A, B, and C, respectively. Then, all values from Pf to partitions Pa and Pb can be transferred through connections A and B, respectively. So, no com m unication conflict results. Case (ii): If M a + Mb > O f , th at is, the sum of the m axim um num bers of bits transferred from partition Pf to Pa and Pb at any control step is greater th an the num ber of output pins available in partition Pf, the interchip connection shown in Figure 3.3(a) w ith Nc — Ma + Mb — O f , Na = Ia — N c, and Nb = h — N c can also be used. W hen C k < N c, we can make all bits in Zk transferred through connection C . A fter C k links of connection C are allocated to Zk, Nc — C k links of connection C rem ain unused and can be used to transfer bits in Xk an d /o r Yk. Only in the case of ak — C k > Na, and bk — C k > Nb is necessary to exam ine com m unication I conflict. In this case, Na values in Xk, and Nb values in Yk can be transferred I 26 (a) (b) Figure 3.3: Possible interchip connection in a design of simple partitioning through connections A and B , respectively. The unallocated portion of connection C is used to transfer the rest. No com m unication conflict occurs if the num ber of bits transferred does not exceed the num ber of links available. T h at is, (a* — C k — N a) + (bk ~ ck — N b) < (Nc - ck). (®k C f c N a) T (bk C k Nb) (N c Ck) = a,k + bk — C k — (N a + N b + N c) = ctk -f- bk — C k — (Ia + Ib ~ Ma — M b -j- Of) — (o-k + bk — Ck — Of) + (Ma — Ia) + (Mb ~ Ib) < 0. Since each term of the above expression is less than or equal to zero according to C onstraints 3.9, 3.10, and 3.11, the entire expression is less th an or equal to zero. So, no com m unication conflict will happen. W hen C k > N c, N c bits of Zk are transferred through connection C, and the rest are transferred through connections A and B. It is required to check w hether there are enough connections A and B. T hat is, {a,k — ck) + {ck — N c) < N a, and { h - ck) + (ck - N c) < N b. {&k Ck) T (C k Nc) N a = Q ,k la < o, {bk Ck) T (ck X c) N b — bk Ib < 0. So, no com m unication conflict will happen. Now, consider the case shown in Figure 3.2(b). Let Xk — {Components of w | w € O Sa A x W tk = 1}, Yk — {Components of w | w 6 O Sb A x W tk = 1}, and Zk = 4 > - It can be proven in a similar way th a t the connection shown in Figure 3.3(b) can be used for com m unication w ithout conflict. □ Note th a t the above form ulation is only required for designs w ith an initiation rate greater th an 1. For designs w ith an initiation rate of 1, th e constraints for the pin allocation subproblem are reduced to B w + B v < Ti, for 0 < i < N. w€lSi v£OSi T he feasibility of pin allocation can be guaranteed if the sum m ation of the bit w idths of all input and output values does not exceed the to tal num ber of pins available for d ata transfers for each partition. I 28 3 .1 .2 T h e S ize o f th e IL P T ableau In the above ILP form ulation, there are \W\ * L variables of xW tk, and N 4- 1 variables of Oj. The num bers of constraints in C onstraints 3.7, 3.8, and 3.4 are (7V+1)*L, and \W\, respectively. For the best case, in which there is no value going to more than one partition, the num ber of variables and constraints are \W\ * L -f(N -fl) and 2(N + 1 )* L + | W |, respectively. For the worst case, in which all of values are going to exactly two partitions, there are ex tra \ W\ * L/2 variables of yV ikt and \W\ * L/2 constraints in C onstraint 3.6. Summing up results in \W\ * 3X/2 + (IV-f 1) variables, and 2(iV + l)*L + |lF |( l + L /2) constraints. However, the size of the ILP form ulation can be reduced in the following way. Let w^,W2 , • ■ ■ ,w q be single-fanout I/O operations transferring values th at have same bit w idth from p artitio n i to j. Define x Wltk -f x W 2,k + ••• + % wg,k as x W tk, and su b stitu te into C onstraints 3.7 and 3.8. In addition, the inequalities in C onstraint 3.4 which correspond to I/O operations uq, w2, . .. , wq are replaced by L - 1 ^ j %w,k T ([• k=0 In practice, m ost of the values have the same bit width. So, the tableau size can be reduced quite a lot. 3.2 Scheduling under Resource Constraints In th e scheduling approach described in this section, all partitions of th e system are processed at the same tim e, since there is strong interdependence among partitions. Scheduling an output or input operation of a partition will also restrict th e schedule of other input or output operation(s), and so allocation of input or o u tp u t pins of other partition(s). In addition, scheduling a partition can affect the quality of the schedule of other partitions, and so the whole system. List scheduling is used in our prototype program to schedule th e system un der resource constraints, in term of the num bers of modules for each type. I/O 29 operations, each of which requires an input operation of a partition and an out pu t operation of another partition to occur in the same control step, can cause a problem when scheduling a design consisting of m ultiple partitions, each having pin constraints. Scheduling can be impossible to com plete if I/O operation nodes are scheduled inappropriately, as discussed in Section 2.4. A solution to avoid these situations is to determ ine w hether there still exists a valid pin allocation if I/O pins are allocated for the I/O operation in a certain control step before each I/O operation is actually scheduled. If allocating I/O pins for (or scheduling) an I/O operation in a certain control step results in no feasible pin allocation, the I/O operation will be postponed to a later control step. The procedure for scheduling under resource constraints is shown in Figure 3.4. A pin allocation feasibility checker has been incorporated into the scheduling procedure to determ ine the feasibility of pin allocation before scheduling an I/O operation in a control step, as shown by the bold boxes in the flow chart. 3.3 Determining Feasibility of an ILP C ertain ILPs, such as max-flow and weighted bip artite m atching problem s, can be solved as linear program s with the Simplex algorithm . In such problem s, the optim al solution to the corresponding LP is guaranteed to be integer. For m ost ILPs, however, the solution obtained usually does not satisfy the integral constraint im posed by ILP. U nfortunately, the problem we have here does not exhibit th e in tegral property. General algorithm s for ILP fall into two categories [PS82]: cutting plane algorithms, which are derived from the simplex algorithm , and enumera- tive algorithms, which are based on enum erating, either explicitly or im plicitly, all possible solutions. The m ain idea of the cutting plane algorithm is to add inequalities or “cuts” th a t do not exclude integer feasible points to the corresponding LP until th e solu tion to the LP is integer. 30 No Fail Yes No Done Yes No Next ready node ? Yes No Yes Yes No I/O operation? No Yes A n y \ ^ unscheduled nodes Any resource unallocated in current control step Any feasible pin allocation ^''Scheduling the node's in current control step will make pin allocation infeasible? Advance control step Schedule the node in current control step Figure 3.4: A list scheduling algorithm with pin allocation feasibility checker i i i 31 j The cutting plane algorithm w ith Dual All-Integer Cuts was developed by Ralph Gom ory [Gomo60] in 1960. In the algorithm , the initial tableau is assum ed to be all-integer and dual feasible. At each iteration of the dual simplex algorithm , an pivot. So, the tableau rem ains all-integer after pivoting. A fter the initial tableau is constructed, the Dual A ll-Integer C utting Plane algorithm is called. Infeasibility of the initial form ulation m eans th a t there is no feasible schedule under the pin constraints. This technique can be applied to our problem. Every tim e before scheduling an I/O operation node w in a control step s, constraint xW ;k > 1, where k = s mod L , is added to the ILP and the cutting plane algorithm is called to check w hether it is feasible. T he initial tableau of the new ILP is obtained by updating the result tableau of the previous step. T hat is, the solution obtained in th e previous step is used as the start point for the current step. The cutting plane algorithm usually term inates in a few iterations since the feasible solution space defined by the constraints is modified only slightly. The initial tableau of the updated tableau can be obtained in th e following of the previous step. Let Zj, and Zj denote x w^, and x'w k, respectively. C onstraint all-integer inequality is generated and used as the pivot row and ensures a — 1 ways. Let (3.12) \ Zm j w here Aj, and b are column vectors and z} is a single variable,3 be the result tableau 3Zj rather than xWik is used since Zj can be xWik, or xWik — 1. 32 %w,k > 1 can be added into the tableau simply by letting x'wk = x W tk — 1 > 0. S ubstituting x W tk = x'w k + 1 into Equation 3.12 and rearranging it yields 3.4 Experimental Results T he A R Lattice Filter [Kun84] is used as a test example. It is to be im plem ented inserted on the arcs across partition boundaries. All I/O operations are 8 bits wide. In addition, the following assum ptions are stated: • The stage tim e is 250 ns. • Inputs arrive every 2 cycles. • I/O operations take 10 ns. • A dditions are perform ed by adders w ith 30 ns. delay. • M ultiplications are perform ed by m ultipliers w ith 210 ns. delay. • M inim um num ber of functional units are used. • Chained operations are allowed. / (3.13) y J The modified tableau with the additional constraint can be obtained sim ply by subtracting th e j ’th column, Aj, of the original m atrix from th e constant vector, b. on 4 chips. Two chips have 48 pins used for d ata transfer, while th e other two have 32 pins. A simple partitioning of the AR filter into 4 blocks is shown in Figure 3.5, in which I/O operation nodes, indicated as shaded nodes, have been *2 +3 +4 +2 X2 Jr 2 f Im _ ? X» +7 +8 X4 X3 +a + b +c 02 \ O l Figure 3.5: A sim ple-partition A R filter T he m inim um functional operators required for partitions P4 ,P 2 ,P 3, and P4 are (2+,2*), (2+,2*), (l + ,2*), and (l + ,2*), respectively, under th e given initiation rate. W ith these resource constraints, our program scheduled the partitioned AR filter as shown in Figure 3.6 by using list scheduling incorporated w ith the safety check mechanism. It takes 0.5 sec. on Sun 3/280. Each of the partitions Pi and P2 has 10 input operations, and 2 o u tp u t oper ations, and each of the partitions P3 and P 4 has 6 input operations, and 2 output operations. W ith the initiation rate of 2, the 48 pins have to be grouped into 6 bundles of 8 bits, 5 for input and 1 for output, and the 32 pins have to be grouped into 4 bundles of 8 bits, 3 for input and 1 for output. Note th a t I/O operations II, 12, 13, and 14 were assigned in control step 1 even though there was still one group of input pins unallocated after I/O operations 15, 16, 17, and 18 were assigned in control step 0. The safety checker in the program predicted th a t the schedule could not be com pleted if any of I/O operations II, 12, 13, and 14 was assigned to control step 0 after I/O operations 15, 16, 17, and 18 were scheduled. One of I/O operations X5, and X6 m ust be assigned in control step group 0, and th e other in control step group 1 because partition P4 has only 1 outp u t pin. At least one group of pins should be reserved for I/O operations X5, and X6. I/O operations Ij and Ik of partition P3, and In and Io of partition P4 were postponed to control step 1 for the same reason. The interchip connection for the schedule, which is obtained by following the procedure in the proof of Theorem 3.1, is shown in Figure 3.7. 35 Figure 3.6: Scheduling of the sim ple-partition AR filter 36 Figure 3.7: Interchip connection for the partitioned AR filter in Figure 3.5 37 Chapter 4 Synthesis for Designs with a General Partitioning In Section 2.4, we discussed the fact th a t assigning an I/O operation in a control step in which there are unallocated input and output pins in the corresponding p artitions does not guarantee th at com m unication conflicts will not occur, be cause no switching devices are allowed outside partitions. For designs w ith simple partitioning, it is guaranteed, as discussed in C hapter 3, th a t there will be no com m unication conflict if an I/O operation is assigned in a control step in which there are unallocated input and output pins in the corresponding partitions. The technique discussed in the previous chapter can be used to predict th e feasibility of a schedule w ith simple partitioning before an I/O operation is scheduled. For general partitioning, however, we have to develop new m ethods. In this chapter, a technique to produce designs which have a general partition stru ctu re is described. Here, the problem has been simplified so th a t every com m unication bus can only be used to transfer at m ost one value in one control step. T h at is, values w ith sm aller bit w idth are not grouped and transferred on a com m unication bus w ith a larger bit w idth. Also, values have to be transferred as a whole. In other words, a com m unication bus does not carry a portion of a value. 38 Pi Pi Pn • • • • • • 1 L J ‘ \ t ‘ L i ‘ 1 V i i ir Piji ^ r ^iji • • 11 ' ' < ' • • # }F Figure 4.1: Interchip connection model Due to its complexity, the problem will be approached in two phases, 1. determ ining the interchip connections before scheduling, and 2. scheduling the design given the interchip connections. 4.1 Determining Interchip Connections T he interchip connection model has the bus structure shown in Figure 4.1. In the figure, pith (q% ,h ) is the w idth of the output (input) ports of partitio n Pi connected to bus Ch- In our model, each bus can be connected to m ore than one partitio n at its input port, and more than one partition at its output port. A partitio n can be connected to more th an one bus. However, an input or o u tp u t port can be connected to one and only one bus. The w idth of the input or o u tp u t port of a p artitio n connected to a bus is not necessarily the same as the w idth of the input or ou tp u t port of other partitions connected to the bus. For exam ple, the interchip connection shown in Figure 4.2 can be used to transfer values w ith less th an or equal to 16 bits from partition Pa to partition Pc, and values w ith no m ore th an 12 bits from partition Pb to partition Pc. Only 12 instead of 16 o u tp u t pins of partitio n Pb are connected to the bus if only values w ith bit w idth less th an or equal to 12 are to be transferred from partition Pb to partition Pc through the bus. 39 Figure 4.2: An interchip connection exam ple 4.1 .1 T h e IL P F orm u lation The com m unication among partitions can be denoted by a directed m ultigraph1 G = (P, IT), where P is a set of nodes, P = {Po, P\,..., Pat}, and IT is a set of edges, W = { W l , W 2, . • • ,tCg}. Each node represents a partition, and each edge — (P i,P j) represents an I/O transfer (operation) from partition Pi to partition Pj. The following notation will be used in th e discussion. • L : Initiation rate of the design. • N : N um ber of partitions. • R : M axim um num ber of com m unication buses. • B w : Bit w idth of I/O operation w. • Ti : Total num ber of I/O pins used for d ata transfers (pins used for power and control lines excluded) in partition P8 -. 1In a m ultigraph, there can be several edges betw een a pair o f nodes. 40 • C = {Ci, C2, ..., Cr) : A set of com m unication buses. A com m unication bus is connected to one or m ore partitions at th eir outp u t ports and one or more partitions at their input ports. A com m unication bus is capable of perform ing I/O transfer w — (P;, Pj) if it connected to partition Pi at its ou tp u t port and partition Pj at its input port, and the w idths of both ports of the partitions connected to the bus are at least B w lines. • OWi — {w | w = (Pi, P) A P is any partition}. • IW i — {w | w = (P, Pi) A P is any partition}. • W v = { w | every I/O operation w used to transfer value v }. • O i : An integer variable specifying the num ber of pins to be used for output from partition Pi. • Vw,h - A binary variable denoting I/O operation w is assigned to com m uni cation bus Ch- Vw,h = 1 if I/O operation w is assigned to com m unication bus Ch] Vw,h = 0 otherwise. • pith ' ■ An integer variable denoting the w idth of the outp u t port of partitio n Pi connected to com m unication bus Ch- Pi'h = 0 if Ch is not connected to the output of P{. • qi:h ■ An integer variable denoting the w idth of the input port of partition Pi connected to com m unication bus Ch- < li,h = 0 if Ch is not connected to the input of Pj. The problem of determ ining interchip connection can be form ulated as an ob jective function subject to a set of constraints. T he assignment constraint R ^ 2 y w,h = 1, for ru e IT; (4.1) h=l 41 states th a t every I/O operation m ust be assigned to one and only one com m uni cation bus. The data transfer constraints Pi h > m ax B wyW th, for 0 < z < N , 1 < h < R; (4-2) ’ w£OWt Q i h > m ax B wyw h, for 0 < f < V V , 1 < h < R] (4-3) 1 w^IWi ’ specify th a t the w idth of the output (input) port of partition Pj connected to com m unication bus Ch m ust be greater than or equal to the bit w idths of values o u tp u t from (input to) partition Pj via the bus. T he resource constraints R R ]C P i> h + Y1 Q i’h - T^ for 0 < i < N (4.4) h— 1 h— 1 ensure th a t the to tal num ber of input and output pins connected to any bus does not exceed the total num ber of pins available. For each com m unication bus, only one value can be transferred in a control step because com m unication buses are connected com ponents. So, for a pipelined design w ith an initiation rate of L, each com m unication bus can be used to transfer at m ost L values. This is specified by capacity constraints m ax yw h < L, for 1 < h < R. (4-5) The heuristic objective function R Maximize Y ) m ax yw > h (4-6) w£W ’ a=l tries to m axim ize the num ber of buses actually used since a higher I/O bandw idth is m ore likely to result. T he m axim um function of binary variables in C onstraint 4.5 can be elim inated by introducing a new binary variable as discussed in Section 3.1. T he m axim um 42 function of integer variables in Constraints 4.2 and 4.3 can be elim inated in th e following way. T he inequality Pi,h > m ax B wyW th w£OWi can be replaced by Pi,h > BwVw,h for w € OWi. In the above form ulation, there are approxim ately {\W\ + 2N + 2)R variables, and (2R + l)|H /r| + (lV + l) + i? constraints. The variables are composed of |W \ * R variables of (N + 1) * R variables of p^h, and (N + 1) * R variables of qi^- C onstraints 4.1, 4.2, 4.3, 4.4, and 4.5 have |VF|, \W\ * R, \W\ * R, N + 1 , and R constraints, respectively. Linearizing m ax wGwv Vw,h m C onstraint 4.5 will introduce ex tra binary variables and constraints. T he above form ulation requires a m axim um num ber of com m unication buses, R , to be specified. A naive way of determ ining the upper bound of R is to count the to tal num ber of I/O operations. However, a tighter upper bound can be obtained. T he concept is based on the observation th a t every com m unication bus m ust be connected to at least one input port and one output port, and no input port or o u tp u t port can be connected to more than one com m unication bus. For each chip, the m axim um num ber of input (output) ports is estim ated according to the num ber of I/O pins available for input (output), and the num ber and bit widths of input (output) operations. Let • B \ < B 2 < ■ ■ ■ < B t : A increasing sequence of bit w idths. • Ti : Total num ber of I/O pins used for d ata transfers in partitio n P{. • n litk ’ ■ N um ber of input operations in Pi whose bit w idth is equal to Bk- • Iubi;k ' ■ U pper bound on the num ber of input ports of Pi having a w idth of Bk- 43 • Ilbifk : Lower bound on the num ber of input ports of Pi having a w idth of Bk assum ing th a t a m inim um num ber of input ports of P t are used for input operations w ith widths greater than Bk- • IPi,k ' ■ U pper bound on the num ber of input pins left unallocated in Pi after pin allocation for input values w ith bit width greater th an Bk- • ISi,k ' ■ N um ber of input slots 2 in partition Pi available for input values w ith bit w idth less than or equal to Bk when a m inim um num ber of input pins are used for input values w ith bit width greater th an Bk- (I S ^ t — 0 .) • IPU : Lower bound on the num ber of input pins of Pi required for all input operations in P,. • O P h : Lower bound on the num ber of output pins of Pi required for all o u tp u t operations in P * -. Before the upper bounds on the num bers of input ports and o u tp u t ports are com puted, the lower bounds on the num bers of input pins and o u tp u t pins are determ ined because the values produced during the com putation of the lower bounds on th e num bers of input pins and output pins are required to com pute the upper bounds on the num bers of input ports and output ports. T he lower bound on the num ber of input ports of Pi w ith a w idth of Bk assum ing th a t a m inim um num ber of input ports of Pi are used for input operations w ith w idths greater th an Bk, Ilbitk, can also be given as Til 1 tO j.fc = I ^ I • The unallocated input slots due to not fully utilizing input ports w ith bit w idths greater th an Bk can be used to input values with Bk or less bits. This is reflected by the m inus term in the num erator. 2For an in itiation rate of L, an input port can be used to input L values, each value at a control step. W e say there are L input slots for each input port. i 44 T he num ber of input slots in partition P, available for input values w ith bit w idths less th an or equal to Bk - 1 when a m inim um num ber of input pins is used for input values w ith bit widths greater than Bk - 1 can be obtained by adding the num ber of additional slots of an input port w ith a w idth of Bk required to the num ber of those already available for input values w ith bit w idths less th an or equal to Bk and subtracting those actually used for input operations w ith a w idth of Bk- T hat is, I$i,k— X — I$i,k T ^ L TlJ^k- So, a lower bound on the num ber of input pins of P t -, /P /;, can be com puted by T I P k = £ Ilbitk * Bk- k=1 A lower bound on the num ber of output pins of P 8, O P li, can be com puted in a sim ilar way. For each partition P;, the m axim um num ber of I/O pins available for input can be obtained by subtracting the m inim um num ber of I/O pins required for outp u t from the to tal num ber of I/O pins available for d ata transfers. T h at is, I Pi7 T — P — OPU- The upper bound on the num ber of input ports of P, having a w idth of Bk, Iubltk, is given as Iubi'k = m i n ( L ^ ^ J The first term gives the m axim um num ber of input ports having a w idth of Bk which can be formed from the rem aining input pins. However, the upper bound should not exceed the num ber of input operations w ith Bk bits, nl^k- T he m axim um num ber of input pins available for input values w ith bit w idths less th an or equal to B k-i can be obtained by subtracting the m inim um num ber of input pins allocated for input values w ith Bk bits from the m axim um num ber 45 of those available for input values with bit widths less th an or equal to Bk- T h at is, IP%,k— 1 — IP%,k Hbi,k * Bk- So, an upper bound on the num ber of input ports of Pi, u p , can be com puted by sum m ing up all upper bound num bers of input ports w ith different bit widths. T h at is, T I u p = Iubitk- k— 1 An upper bound on the num ber of output ports of Pi, Oubi, can be com puted in a sim ilar way. So, a upper bound on the num ber of com m unication buses, called R, which is tighter th an th at obtained by counting the to tal num ber of I/O operations, can be given by N N R = min (E Iubj, T , Oubj). i=0 i=0 The reason for taking the m inim um is th a t every com m unication bus m ust be connected to at least one input port and one output port. 4 .1 .2 A H e u r istic Search P ro ced u re The above form ulation can be subm itted to an ILP package to get a solution, and we have im plem ented a program to autom atically generate the ILP form ula tion, and have subm itted the form ulation to two ILP packages, Bozo [HH90] and LIND O [LIN87]. However, for practical size problems, the size of th e form ulation is too large to obtain a solution w ithin a reasonable tim e. Still, th e ILP form ulation is useful for verification of synthesized results. So, a heuristic search technique has been developed as shown in Figure 4.3. ! At the beginning, the I/O operations are sorted. An I/O operation w ith a j higher bit w idth requires more I/O pins, and is assigned before th e others. It is believed th a t th e search tree can more likely be pruned at an earlier stage. At each 46 1: d e te r m in e J n terch ip _ co n n ectio n 2: build a sorted list of I/O operations L in descending order of bit w idth 3: C = assign_nodes ( L ) 4: a ssign _n od es ( L ) 5: if ( L = ( j > ) th e n return < / > 6 : w = first elem ent of L 7: L’ = L - w 8 : select a small num ber of com m unication buses for w having best gain 9: for ( each com m unication bus Ch ) do 10: if ( there still are I/O pins available to assign w to Ch ) th e n 11: assign w to Ch tentatively and update connection table 1 2 : C ’ = assign_nodes ( L’ ) 13: if ( C ’ 7 ^ FAIL ) th e n return Ch + C’ 14: restore connection table 15: e n d if 16: en d for 17: retu rn FAIL Figure 4.3: A heuristic search procedure to find an interchip connection structure 47 level of the search tree, only a small num ber of com m unication buses having the best gain will be explored. For the sake of run tim e efficiency, the com m unication buses to be explored do not have the same topology. Two com m unication buses are said to have same topology if they are connected to same partitions even though they can have different bit widths. The num ber of branches to be explored at each node is controlled by an branching factor, which is given by users to trade off the execution tim e and th e chance of finding a solution. The search tim e is still exponential in term of the num ber of I/O operations in the worst case if the branching factor is greater th an 1 . The gain, g, of assigning I/O operation w = ( p , Pj) to com m unication bus Ch is composed of three factors gx, g2 , and g3 shown as follows: g — 10000 * gi + 100 * # 2 + # 3 . T he above weighting constants are used to order factors gx, <72 and g$ according to their priorities. T he values of 10000, 100, and 1 are chosen arbitrarily. Factor gx is used to make the I/O operation try to utilize an already existing com m unication p ath in order to save pin resources, and is given as 0 if Pi,h — 0 A q^h — 0 wfi Pi,h / 0A qjth = 0 wfj if Pi,h = 0 A qJth 7 ^ 0 w fi + wfj otherwise 9i (4.7) where num ber of bits of unassigned I/O operations in p wj ■ — ------------------------------------------------ ; ---- ; ------------------- 1 num ber of unallocated pins in Pi P riority is given to the partitions w ith tight pin resources. This is reflected by the weighting factor w f{. 48 The factor <72 given below is used to force the I/O operations transferring the sam e value to be assigned to a common com m unication bus. So, no ex tra com m u nication slot is consumed. 1 if w and a w' assigned to Ch transfer same value 92 = \ 0 otherwise Factor g3 tries to balance the utilization of com m unication buses, and is given as follows: g3 = num ber of unassigned slots in Ch 4.2 Scheduling with Given Interchip Connections A fter th e interchip connection has been determ ined, d ata p ath and I/O scheduling using this interconnection can be performed. A list scheduling technique is used. Before scheduling each I/O operation w in control step s, any com m unication bus available in the control step for this I/O operation is checked. Recall th a t ev ery I/O operation is also assigned to a com m unication bus in th e procedure of determ ining interchip connection. A direct way of checking availability of com m u nication buses for the I/O operation is to check w hether the com m unication bus previously assigned to the I/O operation is free in the control step. However, the resulting schedule using static assignment can be too pessim istic. For the sim ple exam ple shown in Figure 4.4(a), the interchip connection can be constructed as shown in Figure 4.4(b). Assume th a t initially I/O operations w\ and W2 are assigned to com m unication bus Ci, and I/O operations w 3 and uq are assigned to com m unication bus C * 2 . After w\ is scheduled in control step s, W2 cannot be scheduled in the same control step since C\ has been allocated for W\. However, W2 can actually be scheduled in the same control step by using C 2 . In this case, w 3 can be reassigned to C\. W hen assigning io2 to C 2 , it m ust be ensured th a t w 3 49 w3 w W2 (a) Communication between partitions (b) Inter-chip connection Figure 4.4: An exam ple of an interchip connection or W4 can be reassigned to another com m unication bus. Therefore, it is of benefit to reassign com m unication buses dynam ically as th e scheduling proceeds. Reassignm ent can be done in the following way. If the com m unication bus currently assigned to the I/O operation being checked has been allocated to another I/O operation in the current control step group, the I/O operation tries to preem pt another I/O operation which is currently assigned to a com m unication bus th a t is unallocated in the current control step group. The preem pted I/O operation, in tu rn , tries to preem pt another I/O operation until a preem pted I/O operation can find a free3 com m unication slot, and be assigned to it. Before th e preem ption process starts, the com m unication slot currently assigned to the I/O operation being checked is released. T he process of checking any free com m unication bus to assign an I/O oper- j l ation at a control step and reassigning I/O operations to com m unication buses if necessary is sim ilar to the one of finding an augmentation path in a bipartite matching problem, which can be solved in polynom ial tim e [PS82]. A b ip artite 3A free com m unication slot is a com m unication slot not assigned to any I/O operation ! 50 graph G = (V, U, E) is a graph in which every edge in E has one term inal in V, and th e other term inal in U. A m atching of a graph is a subset of the edges with th e property th at no two edges in the subset share the same node. T he m atching problem is to find a m axim um m atching of a graph. Given a graph and a m atch ing, matched (free) edges are edges (not) in the m atching. Exposed nodes are not incident upon any m atched edge. An alternating p ath is a p ath p — [vi, t?2, . . . , u*], where [ui, V2 ], [^3 , u4] , . . . are free, and [v^, U3], [u4, u5] , . . . are m atched. An aug m entation p ath is an alternating path p — [vi,v2, ... ,Vk], where v-i, and Vk are exposed. T he problem of reassigning I/O operations to com m unication buses can be represented by a bipartite graph G = (V,U,E), in which V is a set of nodes rep resenting I/O operations, U is a set of nodes representing com m unication slots, which are 2-tuples (Ch,Sk), where Ch is a com m unication bus, and Sk is a control step group, and E is a set of edges representing the capability of a com m unication bus able to be used to perform an I/O operation. T he reassignm ent of I/O opera tions to com m unication slots w ithin a com m unication bus can be done by simply switching th e assignm ent. So, the com m unication slots w ithin a com m unication bus are grouped to elim inate unnecessary search. The reassignm ent of I/O operations for the exam ple is shown in Figure 4.5. A th in line indicates th a t the com m unication bus is capable of perform ing the I/O operation. A heavy line indicates th at the I/O operation is currently assigned to the com m unication bus. A shaded line indicates th a t th e assignm ent has been fixed because th e com m unication bus has been allocated for an I/O operation at a control step. Circles in ellipses show the status of com m unication buses at certain control steps. An em pty one indicates th a t the bus is free at the control step. The dotted line in Figure 4.5(b) corresponds to an augm entation p ath in the m atching problem . 51 wx Wj Wi (a) Intial assignment (b) After Cj allocated for wj w2 (c) Re-assignment Figure 4.5: Snapshots of com m unication bus allocation 4.3 Bidirectional I/O Ports In th e previous section, we have assumed th a t all I/O ports are unidirectional. T h at is, an I/O port is either an input port or output port, but not both. However, I/O pins can be utilized more efficiently if bidirectional I/O ports are used. Interchip connection w ith bidirectional I/O ports can be m odeled as shown in Figure 4.6. The ILP form ulation for the model is sim ilar to th e form ulation for th e m odel w ith unidirectional I/O ports except th at d ata transfer constraints 4.2 and 4.3 are replaced by r; h > m ax B wyw for 0 < * < N , 1 < h < R t,n ~ wei\vtuoWi wyw’n^ - - 5 - - where is an integer variable denoting the w idth of the I/O port of p artitio n Pi connected to com m unication bus Ch- Resource constraint 4.4 is replaced by R y / rith < Ti, for 0 < i < N. h-\ 52 Figure 4.6: Interchip connection model w ith bidirectional I/O ports T he upper bound of com m unication buses can be estim ated sim ilarly to the procedure described in the previous section. However, the m axim um num ber of com m unication buses is half of total num ber of I/O ports since every bus m ust have at least two I/O ports connected to it. The heuristic search technique developed for designs w ith unidirectional I/O ports can be applied to designs w ith bidirectional I/O ports, except th a t E qua tion 4.7 for factor g\ is replaced by 0 if rith = 0 A rjth = 0 wfi if / OA rjth = 0 wfj if rhh - 0 A rjth ± 0 wfj + wfj otherwise 4.4 Experimental Results It would be difficult to directly compare the work to De M icheli’s work [GM90] because they only consider nonpipelined designs. 91 = 53 I Initiation R ate Po Pi A P 3 3 120P 135P 2+ 2* 90P 2+ 2* 90P 2+ 2* 4 120P 135P 1 + 1* 90P 1+ 2* 90P 1+ 2* 5 120P 135P 1+ 1*' 90P 1+ 2* 90P 1+ 2* P: Pins, +: Adders, *: M ultipliers Table 4.1: Resource C onstraints for the AR filter designs w ith unidirectional I/O ports 4 .4 .1 A R L a ttic e F ilter T he AR L attice Filter is partitioned in a general-partition stru ctu re as shown in Figure 4.7, and a variety of bit widths are assumed. The num ber next to the I/O operation is th e bit w idth. The I/O operations w ithout num bers next to them have 8 bits. O ther assum ptions are stated as follows: • The stage tim e is 250 ns. • I/O operations take 10 ns. • A dditions are perform ed by adders w ith 30 ns delay. • M ultiplications are perform ed by m ultipliers w ith 210 ns delay. • Chained operations are allowed. 4.4.1.1 D esigns w ith U nidirectional I /O Ports The partitioned A R filter has been synthesized w ith initiation rates of 3, 4, and 5. T he hardw are resources available for designs w ith different initiation rates are shown in Table 4.1. F irst, the interchip connections are synthesized by using the heuristic search technique described in this chapter. Interchip connections for designs w ith different initiation rates are shown in Figures 4.8, 4.9, and 4.10. The actual num bers of I/O pins allocated for the interchip connections are given in th e second colum n of 54 +2 +4 +1 +3 +6 Im <X2 *b +8 +7 +a X5{ 36 X6J 36 + b +c 0 2 O l Figure 4.7: A general-partition AR filter 55 ____ i CioCn c1 2 c5 c7 c9 c4 c6 c8 Figure 4.8: Interchip connection for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 Figure 4.9: Interchip connection for the AR filter design w ith unidirectional I/O ports and an initiation rate of 4 56 C8 C9 c5 c7 c 4 c6 Figure 4.10: Interchip connection for the AR filter design w ith unidirectional I/O ports and an initiation rate of 5 Table 4.2. T he schedules for the designs w ith these interchip connections and the resource constraints given are shown in Figures 4.11, 4.12, and 4.13. D uring the scheduling process, I/O operations have been reassigned to com m unication buses. T he num bers of control steps required for the schedules w ith and w ithout bus reassignm ent are given in the th ird and fourth columns of Table 4.2, respectively. From the results, it can be seen th at the schedules w ith bus reassignm ents have sm aller num bers of control steps than the schedules with static bus assignm ent. Initiation R ate # P in s used ^ C o n tro l steps Po Pi P 2 Ps w / reassignm ent w /o reassignm ent 3 109 133 87 87 11 12 4 101 125 87 87 14 15 5 85 125 79 79 9 13 Table 4.2: Sum m arized results for the A R filter designs w ith unidirectional I/O ports 57 4 2 +3 4 4 Jkf s j s ? +6 4 8 4-7 X4 27 < X3 4 9 4 a X5 X6, 7 36 4 b 4C Ol 02 Figure 4.11: Schedule for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 58 *4 +1 +2 +4 Tk} +5 { In *b +8 +7 X4 27 < X3 +a X5 36 X6 j +b 36 +c O l ^ 37 02 Figure 4.12: Schedule for the AR filter design w ith unidirectional I/O ports and an initiation rate of 4 59 Figure 4.13: Schedule for the AR filter design w ith unidirectional I/O ports and an initiation rate of 5 60 Com m unication Bus Initial Assignment Final Assignm ent Cx 0 1 0 2 0 1 0 2 c 2 X I X3 X5 X I X3 X5 c3 X2 X4 X6 X2 X4 X6 c 4 Im Ip Iq Ie Im Ip c s Ik In Io la Ik In c 6 Ig Ii 1 1 Ig Ii 1 1 cv Ic Ih Ij Ic Ih Ij c8 Id Ie If Id If Iq c9 19 la lb 19 lb Io Cx 0 16 17 18 11 13 18 Cxi 13 14 15 14 16 Cx 2 11 12 12 15 17 Table 4.3: Bus assignm ent for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 Control Steps Bus Allocation Cx C2 < c3 c 4 c 5 Ch c 7 c 8 c 9 Cio Cxx C12 0 ,3 ,... 0 2 X I X2 Ie la Ig Ic If lb 18 16 17 1 ,4 ,... 01 X3 X4 Im Ik Ii Ih Id 19 13 14 15 2 ,5 ,... X5 X6 Ip In 1 1 Ij Iq Io 11 12 Table 4.4: Bus allocation for the AR filter design w ith unidirectional I/O ports and an initiation rate of 3 T he initial and final assignm ents of I/O operations to com m unication buses for the schedules are shown in Tables 4.3, 4.5, and 4.7. The com m unication bus allocations for these schedules and interchip connections are shown in Tables 4.4, 4.6, and 4.8. 4.4.1.2 D esigns w ith bidirectional I /O ports The partitioned A R filter has also been synthesized w ith the assum ption th a t bidi rectional I/O ports are allowed. The hardw are resources available for designs w ith Com m unication Bus Initial Assignment Final Assignm ent cv 01 0 2 0 1 0 2 cv X I X3 X5 X I X3 X5 Cs X2 X4 X6 X2 X4 X6 c 4 1 1 Im Ip Iq Ie 1 1 Im c 5 Ij Ik In Io la Ij Ik Ch Ie If Ig Ii Ig Ii Ip cv la lb Ic Ih Ic Ih In Ch Id Id If Iq c 9 19 19 lb Io CVo 15 16 17 18 12 14 16 18 CVi 11 12 13 14 11 13 15 17 Table 4.5: Bus assignment for the A R filter design w ith unidirectional I/O ports and an initiation rate of 4 Control Steps Bus Allocation cv cv Ch c 4 Ch c 6 cv c 8 cv C V o CVi 0 ,4 ,--. X3 X4 Ie la Ig Ic If lb 18 17 1 ,5 ,... 0 2 X5 X6 Im Ik Ii Ih Id 19 16 15 2 ,6 ,... 1 1 Ij Ip In Iq Io 14 13 3 ,7 ,... 01 XI X2 12 11 Table 4.6: Bus allocation for the AR filter design w ith unidirectional I/O ports and an initiation rate of 4 62 Com m unication Bus Initial Assignment Final Assignment Cx 01 0 2 0 1 0 2 c 2 X I X3 X5 X I X3 X5 cv X2 X4 X6 X2 X4 X6 c 4 Ii 1 1 Im Ip Iq Id If Ii 1 1 cv Ih Ij Ik In Io 19 lb Ih Ij cv Id Ie If Ig Ie Ig Im Ip Iq c 7 19 la lb Ic la Ic Ik In Io c 8 14 15 16 17 18 11 14 16 18 cv 11 12 13 12 13 15 17 Table 4.7: Bus assignm ent for the AR filter design w ith unidirectional I/O ports and an initiation rate of 5 Control Steps Bus Allocation Cx cv c 3 Ch cv C9 cv c 8 c 9 0 ,5 ,... X5 X6 If ib Ig Ic 18 17 1 ,6 ,... 01 Id 19 Ie la 16 15 2 ,7 ,... Ii Ih Im Ik 14 13 3 ,8 ,... 0 2 X I X2 1 1 Ij Iq Io 11 12 4 ,9 ,... X3 X4 Ip In Table 4.8: Bus allocation for the AR filter design w ith unidirectional I/O ports and an initiation rate of 5 63 Initiation R ate Po P i P2 Pz 3 110P 100P 2+ 2* 90P 2+ 2* 90P 2 + 2* 4 110P 100P 1+ 1* 90P 1 + 2* 90P 1+ 2* 5 HOP 100P 1+ 1* 90P 1 + 2* 90P 1 + 2* P: Pins, -f: Adders, *: M ultipliers Table 4.9: Resource Constraints for the A R filter designs w ith bidirectional I/O ports Initiation R ate ^ P in s used $ Control steps Po P i P '2 Pz w / reassignment w /o reassignm ent 3 109 97 87 87 11 12 4 101 89 78 78 15 16 5 77 81 70 70 10 14 Table 4.10: Sum m arized results for the AR filter designs w ith bidirectional I/O ports different initiation rates are shown in Table 4.9. The interchip connections for the different initiation rates are shown in Figures 4.14, 4.15, and 4.16. T he actual num bers of I/O pins allocated for interchip connections are given in th e second colum n of Table 4.10. The schedules for the designs with these interchip connec tions and the resource constraints given are shown in Figures 4.17, 4.18, and 4.19. D uring the scheduling process, I/O operations have been reassigned to com m u nication buses. The num ber of control steps required for th e schedules w ith and w ithout bus reassignm ent are given in the third and fourth columns in Table 4.10. From th e results, it can be seen th at the schedules w ith bus reassignm ents have sm aller num bers of control steps th an the schedules not allowed bus reassignm ent. The initial and final assignm ents of I/O operations to com m unication buses for the schedules are shown in Tables 4.11 - 4.13. As we expected, the designs with bidirectional I/O ports require less I/O pins th an the corresponding designs w ith only unidirectional I/O ports. 64 C i o c n c1 2 C 5 C 7 c9 C4 C6 Cg p2 C3 P3 * + \ ► 27 J36_ _ C2_ -•36 Figure 4.14: Interchip connection for the A R filter design w ith bidirectional I/O ports and an initiation rate of 3 ClO C „ 8 c5 c7 c9 C4 C6 Cg ^ 3 6 C 2 P2 c3 P 3 \ ► 1 8 -■36 Ci Figure 4.15: Interchip connection for the A R filter design w ith bidirectional I/O ports and an initiation rate of 4 65 e8 c5 c7 c4 c6 Figure 4.16: Interchip connection for the A R filter design w ith bidirectional I/O ports and an initiation rate of 5 Com m unication Bus Initial Assignment Final Assignment c x 18 0 1 0 2 18 0 1 0 2 c 2 X3 X5 X6 X I X5 X6 Cs X I X2 X4 X2 X3 X4 Im Ip Iq Ie Im Ip c 5 Ik In Io la Ik In c 6 Ig Ii 1 1 Ig Ii 1 1 cv Ic Ih Ij Ic Ih Ij C8 Id Ie If Id If Iq c 9 19 la lb 19 lb Io Cl 0 15 16 17 12 17 C\\ 12 13 14 13 15 C \2 11 11 14 16 Table 4.11: Bus assignm ent for the AR filter design w ith bidirectional I/O ports and an initiation rate of 3 i __________________________________________________________________________________________________ 6 6__ J 42 +3 4 4 Imi 4 6 Io i t In 48 47 X4 4 a X5 T 36 X6, 4 b 4C 02 Ol 10 J Figure 4.17: Schedule for the AR filter design w ith bidirectional I/O ports and an initiation rate of 3 67 +2 +3 +4 +5 +6 *b +8 +7 X3 +a X. X5 + b 10? X6, O l +c 12? 02 14? Figure 4.18: Schedule for the AR filter design w ith bidirectional I/O ports and an initiation rate of 4 68 Figure 4.19: Schedule for the A R filter design w ith bidirectional I/O ports and an initiation rate of 5 69 Com m unication Bus Initial Assignment Final Assignment C V 17 18 01 0 2 13 18 0 1 0 2 cv X3 X4 X5 X6 X3 X4 X5 X6 cv X I X2 X I X2 cv 1 1 Im Ip Iq Ie 1 1 Im cv Ij Ik In Io la Ij Ik cv Ie If Ig Ii Ig Ii Ip cv la lb Ic Ih Ic Ih In cv Id Id If Iq cv 19 19 lb Io Cio 13 14 15 16 11 15 16 Cu 11 12 12 14 17 Table 4.12: Bus assignm ent for the AR filter design w ith bidirectional I/O ports and an initiation rate of 4 Com m unication Bus Initial Assignment Final Assignm ent C V 16 17 18 0 1 0 2 14 16 18 0 1 0 2 cv X I X3 X4 X5 X6 X I X3 X4 X5 X6 cv X2 X2 c 4 Ii 1 1 Im Ip Iq Id If Ii 1 1 cv Ih Ij Ik In Io 19 lb Ih Ij cv Id Ie If Ig Ie Ig Im Ip Iq cv 19 la lb Ic la Ic Ik In Io c 8 11 12 13 14 15 11 12 13 15 17 Table 4.13: Bus assignm ent for the AR filter design w ith bidirectional I/O ports and an initiation rate of 5 70 L Po Pi P2 P3 P4 P5 5 32P (64P,3+,1*) (48P,1+,1*) (48P,2+,2*) (64P,3+,2*) (32P,l+,2*) 6 32P (64P,2+,1*) (48P,1+,1*) (32P,1+,1*) (48P,2+,1*) (32P,1+,1*) 7 32P (48P, !+ ,!*) (48P, !+ ,!*) (48P,!+ ,!* ) (48P,2+,1*) (32P, !+ ,!* ) L: Initiation rate, P: Pins, +: Adders, *: Multipliers Table 4.14: Resource Constraints for the elliptic filter designs w ith unidirectional I/O ports 4 .4 .2 F ifth -o rd er W ave E llip tic F ilte r The fifth-order wave elliptic filter [KWK85] is partitioned as shown in Figure 4.20, and all values are assumed to have 16 bits. It is also assumed th a t I/O operations and additions take 1 cycle, while m ultiplications take 2 cycles and are not pipelined. In th e elliptic filter, there are a num ber of d ata recursive edges4 w ith degree 1. A degree d on an edge denotes th a t the value produced by the source operation will be consumed by the destination operation d iterations (execution instances) later. W ith a single d ata stream , the initiation rate for the filter m ust be at least 20 cycles w ithout retim ing or transform ation [PR91, LP91] to optim ize th e filter, because the length of the critical loop (X33, +2, Xf, +5, Xe, . .., Xh, . .., Xj, ..., +28) is 20 cycles. In order to dem onstrate the sharing of m ultiple buses and let m ore I/O operations execute concurrently, the degree on each d ata recursive edge has been modified to 4. Thus, the design for the modified filter can operate on four independent m ultiplexed stream s of data, and the m inim um initiation rate becomes 5 cycles. 4.4.2.1 D esign s w ith U nidirectional I /O Ports I T he partitioned elliptic filter has been synthesized w ith initiation rates of 5, 6, and 7. T he hardw are resources available for designs w ith different initiation rates are shown in Table 4.14. The schedule for the design w ith an initiation rate of i 4D a ta recursive edges will be discussed m ore detail in C hapter 7. I 71 X2 X13 X26 X33 X39. ^' W' V' V' W' V' X.' vi +2 +4 5 Xf +7 +6 Xe +8 +10 X d ( +9 X h + 12 Xc . \ \ V \ \ V 1 ( ( + 1 5 + 14 +18 +17 +19 + 16 +21 +20 +22 +23 +25 +26 +24 +28 +27 ,Oul Figure 4.20: A partitioned elliptic filter 72 Input Output Ci C 2 C 3 Figure 4.21: Interchip connection for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 6 5 cannot be obtained under the resource constraints even if one exists because of the very tight tim e constraints imposed by d ata dependencies betw een execution instances and the greedy heuristic of the list scheduling. Interchip connections for designs w ith different initiation rates are shown in Figures 4.21, and 4.22. The actual num bers of I/O pins used for interchip con nections are the same as the I/O pin constraints. The schedules for the' designs w ith these interchip connections and the resource constraints given are shown in Figures 4.23 and 4.24. Note th at, in these schedules, the I/O operations w ith negative indexes are scheduled in the control steps w ith negative num bers. This m eans th a t the values produced by previous execution have been input and stored before the current execution instance is initiated. No bus reassignm ents are done during scheduling because the initial assignment is the only valid bus assignm ent for the given interchip connections. The com m unication bus allocations for these schedules and interchip connections are shown in Tables 4.15 and 4.16. Note th a t I/O operations which transfer the same value and are scheduled in the same j Input Output Figure 4.22: Interchip connection for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 7 Control Steps Bus Allocation Ci C2 c 3 0 ,6 ,... (Ia,Ib) X26 X39 1 ,7 ,... Xf Op 2 ,8 ,... Xa Xi (Xc,Xd) 3 ,9 ,... Xb Xg Xj 4 ,1 0 ,... X33 Xh 5 ,1 1 ,... X13 X2 Xe Table 4.15: Bus allocation for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 6 74 X39. X2. X33. 125 +1 +2 Xft iXa +4 *2 +6 +7 Xd Xd< +8 +10 +9 Xb Xh ' 5 Xc +12 10 5 ’ Xh X26 Xc 'xC +15 145 +14 J y x - v - v - x S ^ +18 +17 1 6 -3 315 143 +16 +21 +20 17 3 IS xl? 17 5 18 3 16 +25 5 X3! +28 +27 X I3 Figure 4.23: Schedule for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 6 X2. X39. X33. 10 165 +1 +2 • r - Xf/ iXa +4 i — xd +6 513 +10 10 +9 xh; Xh Xc +12 10,5 Xh w X26 Xc +15 13 + 14) ^ +18 +19 15 145 +16 +17 519 +21 * 4 16 S +20 18 X iS 520 --- 18 5 20 +25 523 +28 X 39 X33 X I3 Figure 4.24: Schedule for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 7 76 Control Steps Bus Allocation Ci c 2 C3 0 ,7 ,... (Ia,Ib) Xj 1 ,8 ,... X33 (Xc,Xd) Xf 2 ,9 ,... Xa Xg 3 ,1 0 ,... Op Xh 4 ,1 1 ,... X13 X39 Xi 5 ,1 2 ,... Xe X26 6 ,1 3 ,... Xb X2 Table 4.16: Bus allocation for the elliptic filter design w ith unidirectional I/O ports and an initiation rate of 7 L Po P i P2 P3 P4 P 5 5 32P (48P,2+,1*) (48P,1+,1*) (32P,2+,2*) (32P,3+,2*) (32P,1+,1*) 6 32P (32P,2+,1*) (32P,1 + ,1*) (32P,1+,1*) (32P,2+,1*) (32P,1+,1*) 7 32P (32P,1+,1*) (32P,1+,1*) (16P,1 + ,1*) (32P,2+,1*) (16P,1+,1*) L: Initiation rate, P: Pins, +: Adders, *: Multipliers Table 4.17: Resource C onstraints for the elliptic filter designs w ith bidirectional I/O ports control step can be assigned to the same com m unication bus w ithout bus conflicts. As th e design w ith an initiation of 7, assigning I/O operations la and lb, which are scheduled in control step 0, to com m unication bus C\ does not result in bus conflict. 4.4.2.2 D esign s w ith B idirectional I /O Ports T he partitioned elliptic filter has also been synthesized w ith the assum ption th a t bidirectional I/O ports are allowed. The hardw are resource constraints are shown in Table 4.17. List scheduling cannot find any schedule for design w ith an initiation rate of 5 under the resource constraints even one exists for th e same reasons given above. 77 Input In/Out Figure 4.25: Interchip connection for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 6 T he interchip connections for the designs w ith initiation rates of 6 and 7 are shown in Figures 4.25, and 4.26, respectively. The actual num bers of I/O pins used for interchip connections are the same as the I/O pin constraints. T he schedules for th e designs w ith these interchip connections and the resource constraints given are shown in Figures 4.27 and 4.28. In these schedules, the I/O operations w ith negative indexes are scheduled in the control steps w ith negative num bers. T he m eaning of this is the same given above. No bus reassignm ents are done during scheduling. The com m unication bus allocations for these schedules and interchip connections are shown in Tables 4.18 and 4.19. N ote th a t I/O operations which transfer the same value and are scheduled in the same control step can be assigned to the same com m unication bus w ithout bus conflicts. Also, I/O operations transferring the same value are not restricted to be scheduled in the sam e control step using the same com m unication bus. As th e design w ith an initiation of 7, assigning I/O operations la and lb, which are scheduled in control 78 Input Output Figure 4.26: Interchip connection for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 7 Control Steps Bus Allocation Ci c 2 c 3 c 4 0 , 6 , . . . lb la X2 X39 1 ,7 ,... Xf Xi 2 , 8 , . . . Xd Xa Xj 3 ,9 ,... X26 Xb Xg 4 ,1 0 ,... Xc Xh 5 ,1 1 ,... Op Xe X13 X33 Table 4.18: Bus allocation for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 6 79 k w w v w A v a2j / X39. X33 +1 +2 ixf iXa +4 X f f j Xc +7 Xd X d< +8 +10 +9 Xh Xc x +12 X § 105 'Xh X26 11 5 5 io n ..... ■ " { j v' " \ (+15 Xc ! X r +14 +18 +17 +19 5 1 1 145 +16 Xii +21 +20 17 5 +22 185 +26 +25 +28 +27 k x x x \ \ x w w \ % \ y- X2 5 X 33J X13 Figure 4.27: Schedule for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 6 X2. X3 10 s 1 5 5 +3 +2 m X a +4 t +6 +7 +8 + 10 10 Xd +9 Xb, Xh +12 m Xa- Xh X26 X c +15 + 14] ^ ((-19 ,Xi +18 15 14 +16 Xil +17 +21 16 15 +20 IB 519 3 520 5 20 18 +25 522 +28 5 26 X13 Figure 4.28: Schedule for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 7 81 Control Steps Bus Allocation Ci C 2 C 3 0 ,7 ,... (Ia,Ib) Xj 1 ,8 ,... Xc Xf X33 2 ,9 ,... Xa Xd Xg 3 ,1 0 ,... Xi Xh 4 ,1 1 ,... X13 X39 5 ,1 2 ,... X2 Xe Op 6 ,1 3 ,... Xb X26 Table 4.19: Bus allocation for the elliptic filter design w ith bidirectional I/O ports and an initiation rate of 7 step 0, to com m unication bus C\ does not result in bus conflict. I/O operations Xc and Xd are scheduled in control step 8 using bus C\ and control step 9 using bus C 2 , respectively. Also, the designs w ith bidirectional I/O ports require less I/O pins th an the corresponding designs w ith only unidirectional I/O ports. 82 Chapter 5 Interchip Connection Synthesis After Scheduling In C hapter 4, an approach to m ultiple-chip design synthesis, in which the interchip connections are synthesized before scheduling, has been discussed. T his chapter contains another approach to the synthesis problem, in which scheduling is per form ed before interchip connection synthesis. We also make the sam e assum ptions th a t every com m unication bus can be used to transfer at m ost one value in a cy cle, and th a t values are transferred as a whole. Then, the experim ental results are com pared to those obtained by the previous approach. 5.1 Scheduling In th e prototype program , Force-Directed Scheduling [PK89] is used to schedule a CD FG th a t has been partitioned into a num ber of partitions, each of which will be im plem ented in a chip, given an initiation rate and pipe length. All partitions are scheduled at the same tim e because of the interdependency among partitions im posed by the constraints th at output and inputs of each value betw een p a rti tions m ust take place in the same cycle. FDS tries to m inim ize the num ber of hardw are resources required by balancing the utilization of the resources. In FDS, the distribution graphs for all operation types are required to derive th e force of assigning an operation to a control step. Intuitively, the distribution graph of I/O 83 operation types can be obtained by combining the distribution graphs of input operation types and output operation types of the corresponding partitions since an I/O operation consists of an output operation from a partitio n and an input operation to another partition, and we try to do so. U nfortunately, unlike the cases of functional units, balancing the distribution graph for I/O operation types does not necessarily lead to the m inim um num ber of I/O pins required because no switching devices are allowed off-chip. 5.2 Interchip Connection Synthesis In pipelined designs, an execution instance will be initiated before th e previous execution instance is com pleted. So, the operations scheduled in the same control step group are executed overlapped, and cannot share resources. T he operations scheduled in different control step groups will never be executed concurrently, and can share th e same hardw are resources. Two I/O operations wi = (P{1 ,P j1), and w 2 = (Pi2 ,Pj2) are said to be compatible if and only if they can share a com m unication bus. The I/O operations are said to be in conflict if they are not com patible. The I/O operations which are scheduled in different control step groups can share a com m unication bus, and so are com patible. I/O operations which are scheduled in the same control step group can share a com m unication bus only if they transfer the same value in the same control step. A fter scheduling is perform ed, each I/O operation is assigned to one of th e control step groups. Then, the interchip connection can be determ ined w ith the goal of m inim izing the num ber of I/O pins required. T he problem can be modeled as a com patibility graph, in which each node represents an I/O operation. There is an edge between two nodes if and only if their corresponding I/O operations are com patible. On each edge is an associated weight, which reflects the benefit of having the I/O operations corresponding to these two end nodes assigned to a 84 com m unication bus. The weight, weight(irq, w2), on the edge betw een uq and w 2 corresponding to two com patible I/O operations is given by weight {wi, W2 ) = (it;/,- + w fj) m in(J Bwi, B w 2 ) if i 1 = i 2 A j t = j 2 wfi Tmn(Bwi , B w2) if i\ — i 2 A j 1 ^ j 2 wfj m in(B wl, B w2) if ix / i 2 A j 1 = j 2 0 otherwise where it;/,- is a weighting factor used to specify the im portance of sharing I/O pins of partitio n P,-, and B w\ and B w 2 are the bit widths of I/O operations w l and w 2, respectively. The weight on an edge denotes the num ber of I/O pins which can be shared, and so can be saved if the I/O operations corresponding to these two end nodes are assigned to a shared com m unication bus when w fi = 1 for all i. T he m inim um of two bit w idths is taken because it indicates the num ber of pins which can be saved if the two I/O operations share a com m unication bus. The m inim ization of I/O pins for different partitions can com pete w ith each other. The p artitio n w ith a higher weighting factor will be given priority over th e one w ith a lower value. Note th a t an edge can have a zero weight, and it is quite different from no edge at all since the I/O operations corresponding to the two nodes w ith an edge w ith a zero weight connected to them can share a com m unication bus even if they do not share I/O pins. For the case of bidirectional I/O ports, in which an I/O port can be used as either a source or destination of an I/O transfer, both I/O operations w = (P 8-,P j) and w' — (P j,P i) require the same com m unication p ath on a com m unication bus if they are assigned to the same com m unication bus. The problem of determ ining interchip connection w ith the goal of m inim izing th e to tal num ber of I/O pins required can be reduced to the problem of partitioning th e com patibility graph into a num ber of disjoint cliques having the largest gain, where the gain is th e sum m ation of the weights on all edges th a t are contained in th e cliques. T he nodes in a clique represent the fact th a t th e corresponding I/O operations are assigned to the same com m unication bus. T he num ber of 85 com m unication buses is equal to the num ber of cliques partitioned. T he structure of a com m unication bus (the I/O ports of which chips are connected to th e bus) can be easily determ ined from the assignm ent of I/O operations to th e bus since there m ust be a com m unication path with a sufficient bit w idth for each I/O operation assigned to the bus. T he problem is similar to the clique partitioning problem , which had initially been used by Tseng and Siewiorek [TS83] to model the problem of operator, reg ister, and on-chip interconnection binding, and which is widely used for synthesis. T he general clique partitioning problem has been known to be N P-hard. For inter val graphs, which represent the intersection between a set of intervals along th e real axis, the left-edge algorithm [HS71], which is a polynom ial tim e algorithm , can be used to p artitio n th e graph into a m inim um num bers of cliques. Springer [Spr91] has shown th a t the least-cost clique partitioning problem is N P-hard even for the interval graph. We believe th at the problem of determ ining the interchip con nection w ith the goal of minimizing the num ber of I/O pins required is also an N P-hard problem . The com patibility graph of our problem has some special properties. T he graph can be divided into L groups Go, G i ,. . ., G l - i, as shown in Figure 5.1, where L is the initiation rate of the design. Group Gk contains th e nodes which correspond to the I/O operations scheduled in control step group k. So, there is an edge betw een every node in Gi and every node in Gj (for i j). Group Gk can be further divided into a num ber of subgroups SGk,i, SGk, 2 i ■ ■ ■ ■ > SGk,nk • T he nodes in subgroup SGk,j correspond to the I/O operations which transfer the sam e value at the same control step and so can share a com m unication bus. Each subgroup usually contains only one node. There is an edge betw een any two nodes in a subgroup, while there is no edge between any two nodes in two different subgroups in the same group. Finding a set of cliques is equivalent to finding a set of disjoint subsets, Sk, in which there can be more th an one node from th e same group only if all of the nodes from the same group are in the same subgroup. 86 Edge between every two nodes in same subgroup (transferring same value in same control step) I/O operations scheduled in control step group 0 Edge between every two nodes in different groups L-l No edge between nodes in same group but not in same subgroup Figure 5.1: Com patible graph for interchip connection problem By taking the advantage of the above properties, the heuristic procedure shown in Figure 5.2 has been developed. At first, the nodes corresponding to the I/O operations assigned in the same control step group are grouped. T he nodes in each group are then divided into subgroups depending on w hether the corresponding I/O operations transfer the same value in the same control step. T he nodes in a subgroup are combined to form a supernode. A supernode represents a set of nodes which are in the same clique. Combining nodes rq,... ,Vk to form a supernode v m eans th a t nodes v i,...,V k are replaced by node u, all edges connected to any of i?i,. .. , Vk are removed, and an edge (V, t>) is created for each node v' which is connected to all of iq,.. ., i?*. The weight, weight(V, u), on the newly created edge is given by k weight (v', v ) = ^ weight (y' , V{) i=l A fter these steps, there are only edges between nodes in different groups. Every clique will contain at m ost one node in each group since the nodes in a group conflict w ith each other. The graph induced from any two groups is bipartite. So, the cliques can be constructed by applying a series of b ip artite weighted m atchings to two groups giving priority to larger groups. After the m atching betw een the two groups is found, a new group will be created by combining these two groups according the m atching, and replaces them . Finally, only one group is left, and it will contain a num ber of supernodes, each of which represents a set of nodes in a clique. T he I/O operations corresponding to the nodes in a clique are assigned to a com m unication bus. Then, the bus structure can be constructed according to the assignm ent of I/O operations to buses. T he m ost tim e-consum ing step in the procedure is to solve th e b ip artite weighted m atching problem by applying the H ungarian algorithm , which has a com plexity of 0 ( n 3). So, the worse case com plexity of the procedure is 0 ( L n 3), where n is th e total num ber of I/O operations. For the cases in which the I/O operations are evenly distributed among control step groups, the com plexity will be 0 (n 3 / L 2). 88 D e te r m in e Interch ip C o n n ectio n For each group Gk do Divide Gk into SGk,i, ■ • •, SGk,nk O rder Gt such th at nl > n 3 for i < j For 2 — 1 to L — 1 do Apply Hungarian algorithm to G 0 and Gi For j = 1 to no do SGo^j S G q^ j LJ F tch(j) Figure 5.2: A heuristic procedure for constructing interchip connection 5.3 Experimental Results For com parison, the A R lattice filter is partitioned the same way as shown in Figure 4.7, and the same assum ptions are made. Table 5.1 shows the num bers of I/O pins and operators required w ith variations of initiation rates and pipe lengths. For a given initiation rate, increasing the pipe length does not necessarily lead to a lower hardw are requirem ent. For a given initiation rate, the design w ith least resource requirem ents is shown in bold font. The results produced by the previous approach described in C hapter 4 using different resource constraints are shown in Table 5.2. For all of the test cases, given the same interchip connections, b etter scheduling results th an those shown can be obtained by postponing some of the operations as we have done here by constraining some of the operations and rerun the program . These results are shown in the parentheses at th e outp u t column. T he elliptic filter is also partitioned the same way as shown in Figure 4.20, and the same assum ptions are made. Table 5.3 shows th e num bers of I/O pins and operators required, and the input to output delay w ith variations of initiation rates and pipe lengths. The results produced by the previous approach are shown in Table 5.4. The previous approach can not produce any schedule for several designs w ith tight tim e and resource constraints even there exists a schedule. The reason is because of a small slack tim e in the critical loop and the greedy nature of the list scheduling technique im plem ented in our earlier approach. For m ost of 89 Inputs O utputs Initiation Pipe # 1 / 0 pins # A d d ers ^M u ltip liers R ate Length Po P i P 2 Po Pi P 2 P 3 A P 2 P 3 6 154 162 95 95 2 2 2 2 4 4 7 170 90 130 114 2 2 2 2 3 2 3 8 146 90 114 114 2 2 2 2 3 2 9 133 97 105 113 2 2 2 2 2 3 10 149 105 105 113 2 2 2 2 2 2 6 154 98 95 95 2 1 1 2 2 2 7 154 106 114 98 2 2 2 2 2 2 4 8 146 82 105 105 2 2 2 1 2 2 9 146 98 95 95 2 2 1 1 2 2 1 0 138 98 95 95 2 1 2 1 2 2 6 146 82 106 122 2 1 1 2 2 2 7 1 2 2 90 87 87 2 1 1 2 2 2 5 8 130 90 79 97 2 1 1 2 2 2 9 154 90 97 113 2 2 1 2 2 2 10 138 90 89 105 2 2 2 2 2 2 Table 5.1: Resources required for the AR filter with variations of initiation rates and pipe lengths using the technique described in this chapter Inputs O utputs Initiation # 1 / 0 pins # A dders # M ultipliers Pipe R ate Po Pi P 2 Po Pi P 2 Po Pi P 2 P3 Length 3 109 97 87 87 2 2 2 2 2 2 11(9) 4 101 89 78 78 1 1 1 1 2 2 15(11) 101 53 87 87 1 1 1 1 2 2 12(11) 5 77 81 70 70 1 1 1 1 2 2 10(9) 85 53 79 79 1 1 1 1 2 2 13(10) Table 5.2: P ipe length for the AR filter w ith variations of initiation rates and resource constraints using the technique described in C hapter 4 90 Inputs Outputs Init. Pipe # I/O pins (# A d d 5rs,#Multipliers) In-Out Rate Length P o P i P i P i P i P b P i P i P i P i P s Delay 22 32 48 48 32 32 32 (2 ,1 ) (1 ,1 ) (2 ,2 ) (3 ,2 ) (1 ,1 ) 21 23 32 48 48 32 32 32 (2,1) (1,1) (2,2) (3,2) (1,1) 23 5 24 32 64 48 32 32 32 (2,1) (1,1) (1,2) (3,2) (1,1) 23 25 48 48 48 16 48 . 32 (2,1) (1,1) (1,2) (3,2) (1,2) 24 26 48 64 32 32 48 16 (2,1) (1,1) (2,2) (4,2) (1,2) 23 22 48 64 48 32 32 32 (2,1) (1,1) (1,1) (2,2) (1.1) 18 23 16 48 32 32 16 48 (2,1) (1,1) (1,1) (2,1) (1,1) 18 6 24 32 32 32 32 32 16 (2 ,1 ) (1 ,1 ) (1 ,1 ) (2 ,1 ) (1 ,1 ) 20 25 32 48 48 32 32 48 (2,1) (1.1) ( l .D (2,1) (1.1) 20 26 32 48 32 16 48 32 (2,1) (1,1) (1.1) (2,1) (1,1) 22 22 32 48 32 16 48 16 (2,1) (2,1) (1,1) (2,2) (1,1) 19 23 32 48 32 16 48 32 (2,1) (2,1) (1.1) (2,1) (1.1) 20 7 24 16 48 32 16 32 48 (2,1) (1,1) (1.1) (2,1) (1,1) 20 25 32 48 32 16 48 16 (2,1) (1,1) (1,1) (2,1) (1.1) 22 26 32 32 32 16 48 16 (2 ,1 ) (1 ,1 ) (1 ,1 ) (2 ,1 ) (1 ,1 ) 23 Table 5.3: Resources required for the elliptic filter w ith variations of initiation rates and pipe lengths using the technique described in this chapter In p u ts O u tp u ts In it. # 1 / 0 p in s ( # A d d e r s, # M u ltip lie r s) P ip e In -O u t R a te P o P i P i P i P i P & P i P i p 3 P i P s L e n g . D e la y 5 32 48 48 32 32 32 (2,1) ( i .D (2,2) (3,2) (1.1) -(23) -(22) 6 32 32 32 32 32 16 (2,1) ( i ,i ) (1.1) (2,1) (1.1) -(23) -(19) 32 32 32 32 32 32 (2,1) (i,i) (1,1) (2,1) (1.1) 24 18 7 32 32 32 16 48 16 (2,1) ( i ,i ) (1,1) (2,1) (1,1) 28(24) 20(19) 32 32 32 16 32 16 (1,1) ( i ,i ) (1,1) (2,1) (1.1) 27(26) 20(19) Table 5.4: P ipe length for the elliptic filter with variations of initiation rates and resource constraints using the technique described in C hapter 4 the test cases, given the same interchip connections, b etter scheduling results th an those shown can be obtained by postponing some of the operations as we have done here by constraining some of the operations and rerun the program . These results are shown in the parentheses at th e output column. From th e above results, the approach described in this chapter usually produces a design th a t requires more I/O pins because scheduling puts m ore constraints on the I/O pin optim ization in the current approach. In addition, th e distribution graph of I/O operations used in FDS, which is derived by combining the distribu tion graphs of the corresponding input and output operations, does not reflect the 91 usage of th e com m unication bus accurately. On the other hand, due to interchip connection synthesis w ithout considering the possible im pact on scheduling and th e greedy heuristics used in the list scheduling, the previous approach described in C hapter 4 usually produces a schedule w ith a longer input to o u tp u t delay. For a design w ith tight tim e constraints, th e previous approach m ay not be able to produce any schedule. The above conclusions are drawn based on th e test cases, and m ay not be valid for some cases. However, we believe th a t they would be true in general. In m ost of the test cases, b etter schedules can be obtained by postponing some of the operations. So, it is believed th a t a b etter scheduling result m ight have been obtained if a more advanced scheduling technique, such as PLS [HHL91], rather th an list scheduling were used in th e previous approach. T he I/O pin optim ization described in this chapter tries to m inim ize the to ta l num ber of I/O pins required. However, minim izing the to tal num ber of I/O pins used does not m ean m inim izing the m anufacture cost because of th e discrete num ber of I/O pins on a package. In conclusion, the approach described in the previous chapter w ith a m ore advanced scheduling technique will be m ore desirable th an the approach described in this chapter. 92 Chapter 6 Sharing Buses in a Cycle I/O pin resources can be utilized more effectively at the expense of m ore compli cated control if we allow a com m unication bus to transfer m ore th an one value at the sam e tim e by using different portions of a bus. In order to avoid th e situation th a t each I/O pin would require an individual control signal, we assum e th a t each com m unication bus is logically divided into a small num ber of sub-buses, each of which is composed of a num ber of bits. One or more consecutive sub-buses can be grouped and used to transfer a value. So, more than one value can be transferred on a com m unication bus at the same tim e by using different portions of the bus. Like the problem s described earlier, this problem is also approached in two steps: (1) interchip connection synthesis, and (2) scheduling. F irst, an ILP for m ulation for the subproblem of interchip connection synthesis is given. Due to th e large am ount of com putation tim e required to solve the problem exactly, the heuristic search technique described in C hapter 4 is extended to cope w ith th e ex tended problem . Then, a m ethod of reassigning I/O operations to com m unication buses dynam ically during scheduling is presented. Last, some experim ental results are shown, and com pared to the previous results for the restricted cases w here only one value can be transferred on a com m unication bus in a single bus cycle. Here, it is still assumed th a t each value m ust be transferred as a whole. T hat is, a value will not be divided into a num ber of sub-values, and then transferred in a num ber of control steps. 93 Sub-buses (h,l) (h,s) (h,S) Communication slot (h,l) • * • • • • Sub-slot (h,l,s) L-1 Width of communication bus h Figure 6.1: Illustration of com m unication slots, sub-buses, and sub-slots 6.1 Interchip Connection Synthesis T he same bus structure shown in Figure 4.6 will be used. For the sake of sim pler control, a com m unication bus is logically divided into S sub-buses, each of which is composed of a num ber of bits, rather than into individual single-bit buses. Each com m unication bus consists of L com m unication slots, each of which corresponds to one com m unication bus usage in each control step group. Figure 6.1 illustrates the concept of com m unication slots, sub-buses, and sub-slots. T he sub-bus (h ,s ) denotes the s’th sub-bus of the com m unication bus A. T he com m unication slot (A, /) denotes the usage of the com m unication bus A in the control step group (bus cycle) I. T he sub-slot (A, /, s ) denotes the usage of the sub-bus (A, s ) in the control step group I. An I/O value transferred across chips using a partial or whole com m unication bus can be divided conceptually into S sub-com ponents, each of which will be assigned to a sub-bus of a com m unication bus. Some of these sub-com ponents can have zero bit w idths, which means th at the corresponding sub-buses are not assigned to the I/O transfer. Sub-components of a single value m ust be transferred 94 at the same tim e by using some contiguous sub-buses. T h at is, an I/O transfer m ust be assigned to one or more contiguous sub-slots. 6 .1 .1 IL P F orm u lation The subproblem can be form ulated as an ILP, which consists of three classes of constraints: (1) assignment constraints, (2) data transfer constraints, and (3) re source constraints. Each class of constraints will be described in Sections 6.1.1.1, 6.1.1.2, and 6.1.1.3. Some of constraints in the form ulation are not in linear form. Linearization of these constraints will be discussed in Section 6.1.1.4. T he following notation will be used in the form ulation. • L : Initiation rate of the design. • N : N um ber of partitions. • R : M axim um num ber of com m unication buses. • B w : Bit w idth of I/O operation w. • Ti : Total num ber of pins used for d ata transfers (pins used for power and control lines excluded) in partition P 4. • Wi = { w | every I/O operation w used to transfer a value to or from p artitio n P i } • • W v — { w | every I/O operation w used to transfer the same value v }. • V = { v | every value v transferred across chips }. • Xw,h,i,s ' ■ A binary variable denoting a portion or the whole I/O operation w to be assigned to the sub-slot (h, l,s). £w,h,i,s = 1 if a portion or the whole I/O operation w to be assigned to the sub-slot (h, I, 5); Xw,h,i,s = 0 otherwise. 95 • zw,h,i,s • An integer variable denoting the num ber of bits of I/O operation w to be assigned to the sub-slot (h, l,s). • bwh,s : An integer variable denoting the bit w idth of the sub-bus (h,s). • rith : An integer variable denoting th e w idth of the I/O port of p artitio n Pi connected to com m unication bus Ch- fi,h — 0 if Ch is not connected to Pi. 6 .1 .1.1 A ssig n m en t C on strain ts T he class of assignm ent constraints states th at every I/O operation m ust be as signed to contiguous sub-slots, and no two I/O operations can be assigned to the sam e sub-slot unless they transfer the same value. R L —\ ]T) m ax x W ihiliS = 1 Vu; £ W (6.1) k = i i=o s = l T he above constraints state th at every I/O operation m ust be assigned to some sub-slots of one and only one com m unication slot. m axs x W ih,i,s = 1 denotes th a t I/O operation w is assigned to some sub-slots of the com m unication slot (h,l). An I/O operation can only be assigned to some contiguous sub-slots. In other words, the bit vector < x W jh,i,i, Xw,h,i,2 , • • • > Xw,h,i,s > can only contain at m ost one sequence of l ’s. T hat is, there can not be m ore than two 0 — > 1 or 1 — > 0 transitions in the bit vector < 0, x W th,i,i, Xw,h,i,h ■ ■ ■ > x w,h,i,s, 0 > . T he beginning and ending zeroes are padded to exclude the illegal bit vectors w ith p attern s of < 1...10...01...1 > , in which there are only two 0 — * 1 or 1 — > 0 transitions. A 0 — > 1 or 1 — > 0 transition between two consecutive bits can be detected by exclusive- O Ring these two bits. The technique of using exclusive-OR to detect transitions in a bit stream has been widely used in the field of testing. So, th e following 96 constraints are satisfied if and only if the bit vector < x W th,i, 1 , x w,h,t,2 , ■ ■ • > x w,h,i,s > contains at m ost one sequence of l ’s. 5 x w,h,l,l + ^ Xx w,h,l,s — 1 © x w,h,l,s) © x w,h,l,S © 2 VtU, /i, / (6.2) s=2 T he following constraints state th at no more th an one I/O operation can be assigned to the same sub-slot. 5 ^ < 1 V/i, /, s (6.3) V u ; However, I/O operations transferring the same value can be assigned to the same sub-slots. In this case, Constraint 6.3 m ust be replaced by th e following constraints. V ' m ax x W thti,s < 1 V/i, /, s (6.4) TT Vv However, if any sub-slot is assigned to two or more I/O operations transferring the same value, all of these I/O operations m ust be assigned to the sam e sub-slot(s). In other words, if the bitwise AND of the bit vectors < x W rh,i,i, ■ ■ ■ , x W ih,i,s > and < x w',h,i,ii • • • 5 x wi,h,i,s > 5 where w and w' transfer the same value, is not all zeroes, these two bit vectors m ust be identical. The integer variable ovW jW >^,i defined as follows can be used to check if any sub-slot of the com m unication slot (h,l) is shared by the I/O operations w, and it/, transferring th e same value. — Cnax(xlt,i/i]/) S © 3'u//,/i,/,s) Vu, /i, I Vie, m £ H©, where -f indicates arithm etic add. x W th,i,s + x w'th,i,s can be 0 , 1, or 2 , since x W th,i,s and x wi'h,i,s can be either 0 or 1. So, ovW jW i,h,i can have a value of 0, 1, or 2, and will have a value of 2 if the bitwise AND of these two bit vectors is not all zeroes. 97 T he following constraints are used to force these two bit vectors to be identical if their bitwise AND is not all zeroes. (oVw,u>',h,l ^ 2) = £ ■ '('Ew.h.l.s © w',h,l,s) — 0) Vti1,W £ W v (6.5) S 6 .1 .1 .2 D a ta T ransfer C o n stra in ts T he class of d ata transfer constraints states th a t the sub-buses to which an I/O operation is assigned m ust provide a com m unication p ath wide enough for th e I/O operation. There is a relation between x W ih,i,s and zW :h,i,s■ T h at is, th e num ber of bits of a value to be transferred on a sub-slot can be greater than zero if and only if the sub-slot is assigned to the I/O operation. (-^W fhJiS ^ 6) 4 - 7 * — 1) V u?, /& , /, s (6.6) The num ber of bits to be transferred on a sub-bus can not exceed th e bit w idth of th e sub-bus. bwh,s > m a x zw < hii,s V h,s (6.7) Vtu,/ For every I/O operation, th e total num ber of bits to be transferred m ust be equal to the bit w idth of the I/O operation. 53 = B w Vro (6.8) Vh,l,a Define a *’ , M = Z w ’ h ’l ’ s ’ which is the m axim um num ber of bits of I/O operations over all bus cycles in p artitio n Pi assigned to the sub-bus (h ,s ). a{th,a is also th e m inim um num ber of I/O pins of partition Pi connected to the sub-bus (h ,s ). The following constraints state th a t the m inim um num ber of I/O pins connected to a com m unication bus 98 m ust provide a com m unication path for I/O operations assigned to the bus based on the idea th a t if a partition has I/O pins connected to some sub-bus all previous sub-buses m ust be connected to the partition. s —1 (cii,h,s > 0) = > • ri< h > bwh > t + ai< h tS Vi, h, s (6.9) t=l 6 .1 .1 .3 R eso u rce C on strain ts T he resource constraints ensure th at for each partition th e total num ber of pins required cannot exceed the total num ber of pins available. R £ rilfc < T* forO<i<iV (6.10) h=1 6 .1 .1 .4 L in earization In the above form ulation, some of the constraints are not in linear form. Here, linearization of these constraints will be discussed. The following notation will be used in the discussion. • B x: a binary variable. • Ix\ an integer variable, or an expression having integer value. • C: an variable having a value of 0, 1, or 2. • M: a very large value. A constraint w ith a m axim um function of the form B x > m ax Bi i=l can be linearized and replaced by the following constraints B x > Bi\ for 1 < i < n. 99 A constraint w ith a m axim um function of the form B x = max B t 2 = 1 can be replaced by the following constraints B x > m ax B % . and i= 1 B x < it,B i. T he second inequality is used to force B x to be zero if all of Bi are zeroes. A constraint w ith a m inim um function of the form n B x < m in Bi can be linearized and replaced by the following constraints B x < B^, for 1 < i < n. A constraint w ith a m inim um function of the form n B x = m ini?; 2 = 1 can be replaced by the following constraints n B x < min Bi, and 2 = 1 n B x > Y . B i - i n - 1). i- 1 The second inequality is used to force B x to be one if all of 5 , are ones. A constraint w ith exclusive-OR function of the form B z = B x ® By 100 can be replaced by B z = m ax (B x, B y) — m in (i4 , B y). C onstraint 6.5 has the form (C > 2) = ► ( 4 = 0), which can be replaced by the following constraint (2 - C )M > Ix. For th e cases where C < 1, the above constraint will be satisfied autom atically since the left hand side will have a large value. For the cases th a t C = 2, the left hand side of the above constraint will become zero, and 4 will be forced to zeroes. C onstraint 6.6 has the form (4 > 0) (Bx = 1), which can be replaced by the following constraints. 4 5: M B X , and 4 > B x. T he first inequality implies th at B x m ust be one if 4 is greater than zero. The second inequality implies th a t 4 m ust be greater th an zero if B x is equal to one. C onstraint 6.9 has the form ( 4 > 0 ) ( 4 > 4 ) , which can be replaced by ( 4 > 0 ) 43- (B z = 1), and 101 (.B z = 1) =» (Ix > Iy). Then, th e second inequality can be replaced by Ix > I y - (1 - B Z)M. For th e cases where B z = 0, the above inequality will be satisfied autom atically since the right hand side will have a negative value. For the cases th a t B z = 1, the above inequality will be reduced to Ix > Iy. 6 .1 .2 H e u r istic Search Since th e size of the form ulation is even larger than th a t of the form ulation of the restrictive cases described in C hapter 4, the heuristic search technique described in the previous chapter has been extended to handle this case. To reduce the search space and also the com putation tim e, for our prototype program we assume th a t a com m unication bus can be divided into at m ost two sub-buses. So, an I/O operation can be assigned to the first sub-bus, second sub bus, or the whole bus. The same cost function described previously will be used. At each stage of selecting the com m unication buses w ith least costs for an I/O operation, each unsplit com m unication bus is tentatively split into two sub-buses if th e second sub-bus is wide enough for the I/O operation while at least one of I/O operations previously assigned to the bus can be fit in th e first sub-bus after the bus is split. T hat is, a com m unication bus will be split only if the w idth of the bus is greater th an or equal to the sum of the bit w idth of the I/O operation under consideration and the bit w idth of one of the previously assigned I/O operations. T he search space is also restricted by not allowing extension of th e w idth of a com m unication bus to make two I/O operations able to share a com m unication slot. In other words, the w idth of a com m unication bus will be determ ined and set to the largest bit w idth of the I/O operations assigned to the bus. 102 6.2 Reassignment of I/O Operations to Buses For the simplified cases where no com m unication buses are split, a successful re assignm ent of I/O operation to com m unication bus is composed of a sequence of preem ption operations, since an I/O operation can preem pt at m ost one I/O operation. For the cases where some of the com m unication buses are split into two sub-buses, an I/O operation may require one or both com ponents of a com m unication slot. So, during the preem ption procedure, one I/O operation could preem pt two I/O operations, each of which could, in turn, preem pt two I/O op erations. In addition, the I/O operation preem ption could be done only if both of the preem pted I/O operations could successfully preem pt other I/O operations. This would require backtracking. To reduce the search space and th e com putation tim e, we only allow one I/O operation to be preem pted by another I/O operation, and prune the branches which require two I/O operations to be preem pted at the sam e tim e. Then, the procedure is similar to the one of finding an augm entation p ath in the bip artite m atching problem, the same as the simplified cases. W hat we face here it is th at the answer may be “NO” even if th e reassignm ent of an I/O operation is actually possible. However, this does not seem to prune b etter solutions in m ost cases. 6.3 Experimental Results For com parison, the same partitioned A R lattice filter in Figure 4.7 is used, and th e same assum ptions are made. The interchip connections and schedules for the partitioned A R filter w ith different initiation rate are shown in Figures 6.2 - 6.7, assum ing th a t all I/O ports can be bidirectional and two values can be transferred on a bus in a control step. The initial and final bus assignm ents are shown in Tables 6.1 - 6.3. For the designs w ith initiation rates of 3 and 4, the com m unication bus Ci is split into two sub-buses, one w ith 8 bits and the other w ith 29 bits. For th e design w ith an initiation rates of 5, in addition to the com m unication bus C\ 103 C i o C 5 C 7 C 9 C 4 c 6 c 8 36 37 (8+29) Figure 6.2: Interchip connection for the A R filter w ith an initiation rate of 3 being split into two sub-buses, the com m unication bus C 2 is also split into two sub-buses, each w ith 18 bits. The comparison among different bus assum ptions is shown in Table 6.4. As we m ight expect, a smaller num ber of I/O pins are required if two values are allowed to be transferred on a com m unication bus at the same tim e. Because of the smaller num bers of I/O pins and com m unication buses used, the length of the pipeline may be longer as I/O operations com pete for com m unication buses. One thing which needs to be m entioned is th a t the longer pipe length (15 stages) for the design in which two values are allowed to be transferred on a com m unication bus at the same tim e is not m ainly caused by the sm aller num ber of com m unication buses, but rather by the greedy heuristic used in list scheduling. A b etter scheduling result m ight have been obtained if a m ore advanced scheduling technique had been used. If + 3 were postponed to control step 4, +11 and 0 1 would have been able to be scheduled in control steps 8 and 9, respectively. So, the length of the pipeline could be reduced to 10 stages, which is th e same as the earlier design which did not share the bus cycles. Designs w ith fewer I/O pins would be obtained if a com m unication bus could be split into m ore th an two sub-buses. For the test case w ith an initiation rate of 4, 8 1 /0 pins of partition P\ would have been saved by elim inating com m unication 104 c1 0 C 5 C 7 C 9 C 4 C g c8 37 (8+29) Figure 6.3: Interchip connection for the AR filter w ith an initiation rate of 4 c7 c4 C6 C3 C5 '3 6 (18+18) 37 (8+29) Figure 6.4: Interchip connection for the A R filter w ith an initiation ra te of 5 105 18 5 Ic j JU *4 +2 +3 Im ? +5 +6 ? In +8 +7 +9 +a X5 X6, + b +c 02 Ol 10? Figure 6.5: Schedule for the AR filter w ith an initiation rate of 3 106 1 A T < T<C T H T O f , TfV T „ T U T ~ 5 T. +2 +4 Io xi? +8 +7 X3 ?X4' 'V, +a X5 +b 10 \ X 6 11? Ol +c 0 2 14 5 Figure 6.6: Schedule for the A E filter w ith an initiation rate of 4 107 -K. 4-2 4-3 44 Imj +5 4-6 (XI +8 X4 4-9 4-a +c X . 4-b "X 02 Ol Figure 6.7: Schedule for the AR filter w ith an initiation rate of 5 108 Com m unication Bus Initial Assignment Final Assignment C{ 0 1 0 2 18 0 1 0 2 18 C'{ 17 17 C2 X3 X5 X 6 X I X5 X 6 c3 XI X2 X4 X2 X3 X4 c4 Im Ip iq Ie Im Ip c5 Ik In Io la Ik In c6 Ig Ii 1 1 Ig Ii 1 1 C 7 Ic Ih ij Ic Ih Ij Cs Id Ie if Id If Iq c9 19 la ib 19 Ib Io C 1 0 14 15 16 1 2 14 16 Cu 1 1 1 2 13 1 1 13 15 Table 6.1: I/O operation to bus assignment w ith an initiation of 3 Com m unication Bus Initial Assignment Final Assignment C[ 01 0 2 15 18 01 0 2 15 18 C'{ 16 17 14 17 c 2 X3 X4 X5 X6 X3 X4 X5 X6 Cs XI X2 X I X2 c 4 1 1 Im Ip Iq Ie 1 1 Im c 5 Ij Ik In Io la Ij Ik c 6 Ie If Ig Ii Ig Ii IP cv la Ib Ic Ih Ic Ih In Cs Id Id If Iq C9 19 19 Ib Io Cio 1 1 12 13 14 1 1 12 13 16 Table 6.2: I/O operation to bus assignment w ith an initiation of 4 109 Com m unication Bus Initial Assignment Final Assignm ent C{ 0 1 0 2 13 15 18 0 1 0 2 15 18 C" 14 16 17 1 1 14 17 C'2 X3 X4 X5 X 6 X I X3 X4 X5 X 6 X I C" X2 X2 c 3 Ii 1 1 Im Ip Iq Id If Ii 1 1 c A Ih Ij Ik In Io 19 Ib Ih Ij c 5 Id Ie If Ig Ie Ig Im Ip Iq c 6 19 la Ib Ic la Ic Ik In Io C7 1 1 1 2 1 2 13 16 Table 6.3: I/O operation to bus assignment w ith an initiation of 5 Initiation R ate Bidirectional (No sharing) Bidirectional (Sharing) PinReq PipeLen PinReq PipeLen 3 97 87 87 1 1 89 87 87 1 1 4 89 78 78 15 81 78 78 15 5 81 70 70 1 0 81 52 52 15 Table 6.4: Comparisons of num ber of pins required and pipe length for different assum ptions 110 bus Cio (in Figure 6.3), if bus C\ had been split into 4 sub-buses, each of which has at least 8 bits. Bus C\ could be used to transfer 1 1,12,13, and 14 in a same cycle, 15, / 6 ,17, and 18 in another cycle, and 01 and 0 2 in the other two cycles. I l l Chapter 7 Other Extensions In this chapter, some extensions to our research are discussed. F irst, m odeling of d ata recursive edges and problems arising due to d ata recursive edges are pre sented. Some I/O operations will never be executed in the same execution instance since they are executed conditionally, if conditional branches exist across m ultiple partitions. Conditional sharing among these I/O operations is discussed in the next section. Then, modeling of tim e division I/O m ultiplexing is given. Last, handling a CDFG containing multiple-cycle operations is presented. 7.1 Data Recursive Edges In general, an im plicit outerm ost control loop is assum ed on a CDFG. T h at is, the CD FG is perform ed repeatedly, each tim e for a new set of data. Each iteration of the CD FG is referred to as an execution instance. An operation in an execution instance m ay require a value generated by an operation in the current or a previous execution instance. This d ata dependency can be represented by an edge associated w ith a degree d in the CDFG, where d means th at the value used by an operation is generated d iterations earlier. An edge w ith a degree 0 represents d a ta dependency in the same execution instance. We refer to an edge associated w ith a degree greater th an zero as a data recursive edge} For the partial CDFG shown in Figure 7.1, an 1T h is is equivalent to subscripted values in the D D S [KP85], w ith i being o f degree 1. Such inform ation is explicit in signal flow graphs, where delays are show n. 112 / Yi-1 / o p OPb Figure 7.1: A partial CDFG w ith a d ata recursive edge input value to operation opa comes from the output of operation op 6 of the previous execution instance. This d ata dependency is denoted by an edge (indicated by a dashed line) from op 6 to opa w ith degree 1. No input operation is required if an input value to an operation is produced by an operation of a previous execution instance in the same partition. The value produced in the previous execution instance can be stored for later use. A m axim um tim e constraint will be imposed on the two operations which have a d ata recursive edge connected between them , because a value m ust have been produced before it can be consumed. Assume th at the initiation rate is L, and opa and opf, of execution instance i are scheduled in tim e steps ta and G, respectively. Then, oj>b of execution instance i — 1 is performed in tim e step G — L. Shown in Figure 7.2 are two execution instances of the CDFG. In a valid schedule, opb of execution instance i — 1 m ust be com pleted before opa of execution instance i can be started. T hat is, tb — L < ta, or equivalently tb — ta < L , which can be viewed as a m axim um tim e constraint between operations opa and op6. In general, for a d ata recursive edge w ith degree d, the m axim um tim e constraint will be tb — ta < dL — (C b — 1) if opb takes q, cycles to execute, and G and ta indicate the tim e steps the operations begin execution. 113 Execution instance i-1 Execution instance i X; OPh Figure 7.2: An example schedule of two execution instances 114 yi-i + 1 Figure 7.3: The CDFG representation for Expression (7.1) Some researchers [GVM89, HHL91] im plicitly assume th a t there will be d reg isters on the d ata transfer path corresponding to a d ata recursive edge w ith a degree d. So, a precedence edge from opa to opi m ust be added to prevent opb from being scheduled before opa to avoid an over-writing hazard where the new value yi generated by opt is w ritten to the register before the old one is used by opa. Such a precedence edge would be inappropriate for some cases where the old value yi-i is required after the new version pi is used. This can be shown by the following simple example: Pi = a i * bi; Zi = Ci*yi + yz- 1 . (7.1) T he CD FG representation for the above expression is shown in Figure 7.3. If an edge from operation -f 1 to operation * 1 is added, there will be a cyclic dependency am ong operations * 1 , *2, and + 1 . So, it will be impossible to schedule the CDFG unless some dataflow graph transform ations are applied to the CD FG , such as splitting operation *1 into two operations, one for y;, and the other for yi-\. In 115 (a) A partial CDFG (b) A possible interchip connection Figure 7.4: An exam ple showing no feasible scheduling our approach, no precedence edge from opa to opb will be added for a d ata recursive edge from opb to opa. In this case, we may require m ore th an d registers on the corresponding d ata transfer path. For a CD FG containing m axim um tim e constraints, there m ight exist no sched ule if th e interchip connections are synthesized w ithout envisioning the effect of the m axim um tim e constraints. This can be shown by the exam ple CD FG in Fig ure 7.4(a). Figure 7.4(b) shows one possible interchip connection, where C\ is th e only com m unication bus on which I/O operations X and Y can transfer values. Suppose th a t I/O operations X and Y can only be scheduled in the same control step group due to the m axim um tim e constraint and the m inim um tim e constraint betw een them . T he m inim um tim e constraint is due to the precedence dependency among th e operations between X and Y and resource constraints for these oper ations. There exists no schedule for the example given the interchip connection because I/O operations X and Y have to execute in the same control step group 116 and cannot share the same com m unication bus, while C\ is the only bus capable of transferring values for X and Y. One m ight attem p t to develop a technique to predict if any two I/O operations can be assigned to the same com m unication bus w ithout excluding all feasible pipelined schedules for a CDFG containing a d ata recursive edge during interchip connection synthesis. U nfortunately, the problem of determ ining w hether assigning any two I/O operations to the same com m unication bus will result in no pipelined schedule for a CDFG containing a d ata recursive edge is N P-com plete. This can be proven by polynom ial-transform ing a known N P-com plete problem , precedent constrained scheduling, to the problem. D e fin itio n 7.1 Precedent Constrained Scheduling (PCS) [GJ79] Given a set of tasks T = {Xi,T2, . . . , Tn}, each taking 1 tim e unit, M processors, a partial order P on T , and a deadline D. Is there an M - processor schedule for T th a t m eets the overall deadline D and obeys the precedence constraints? T h e o re m 7.1 Given a partitioned CDFG which contains d ata recursive edges, resource constraints for each partition, a partial interchip connection, an assign m ent of any two I/O operations to the same com m unication bus. T he problem (ASG ) of determ ining if there exists a m ulti-chip pipelined sched ule w ith a bus assignm ent which meets the given partial bus assignm ent is NP- com plete. P ro o f: (a) A S G E N P since it can be checked in polynom ial tim e w hether a schedule satisfies all given constraints. (b) We transform an instance p E P C S to an instance q E A SG . Let T — {Ti, T2, . . . , Tn} be a set of tasks, P be a partial order, M be the num ber of processors, and D be a deadline in p. We construct an instance q E A S G , as shown in Figure 7.5, in which the num bers shown on th e left are tim e steps for one possible schedule. The instance is pipelined betw een partitions but w ithin partition P 2, execution is not pipelined, and contains 117 time step at least D + l delays D + l D + 2 x 2D+3 Figure 7.5: An instance in ASG transform ed from an instance in PCS 118 • a set of operations V = V1 \J V 2 [j 1 0 , where V\ = { t i ,.. ., ^£>+ 1 }, a set of operations in partition Pi, V2 = T, a set of operations in partition P2, and 1 0 = {X , Y}, a set of I/O operations, where X is an I/O operation feeding value forward P2 and Y is an I/O operation for d ata recursive back to Pi; All operations in V take one control step to execute; • a set of precedence edges E = { ( Y ,ti) ,{ ti,t2) , . . . , ( t D, t D+i),( tD+i ,X ) } [JEx U P U E y , where (Y, ti) is a d ata recursive edge with degree 2, E x = {(X , T i) ,. . . , (X, Tn)}, a set of edges from X to all nodes in V2, and E y = {(Ti, Y ) , . . ., (Tn, Y)}, a set of edges from all nodes in V2 to Y; • 1 operator for P y • M operators for P2; • an initiation rate L = D + 2; and • X and Y being assigned to the same com m unication bus. Obviously, the above transform ation can be done in polynom ial tim e. Now, we want to prove th at p is satisfied if and only if q is satisfied. (i) Suppose th at q is satisfied. Then, there exists a schedule for q, in which 1 1 , X , and Y are scheduled in control steps st, Sx, and Sy, respectively, such th a t sy ~- Sx * L, (7.2) Sx - — St > D + l, and (7.3) sy - ~ ~ St < 2 L - 1 . (7.4) 119 T he inequality 7.2 holds because of the conflict-free constraint on the com m uni cation bus to which X and Y are assigned. The inequality 7.3 holds because of th e precedence constraints through t\, £ 2 , • • •, tn+i, X . The inequality 7.4 holds be cause of th e m axim um tim e constraints imposed by the d ata recursive edge. From th e above inequalities, we have sy — sx < D + 1. Hence, the m axim um num ber of tim e steps allocated for T in the schedule is (sy — 1) — (Sx + 1) + 1 < D. So, p is satisfied. (ii) Suppose th a t p is satisfied. Then, there is a schedule for p w ith a deadline D. Let Sk be th e tim e step in which Tk is executed in the schedule, and 1 < Sk < D. So, we can let V of q be scheduled as follows: • ti, ■ ■ ■ , tD+i are scheduled in tim e steps sx — ( D + 1 ) ,. .. , s j — 1; • X is scheduled in tim e step s x ; • Tk € V2 is scheduled in tim e step sx + Sk', and • Y is scheduled in the tim e step sx + D -f 1. T he above schedule for q satisfies the precedence constraints and resource con straints. T he conflict-free constraint on the com m unication bus is also satis fied since I/O operations X and Y are not scheduled in th e same control step group, ((sx + D + 1) — sx = D + l is not a m ultiple of L = D + 2.) The m axim um tim e constraint imposed by the d ata recursive edge is also satisfied. ((sx + D + 1) — (sx — (D + 1)) = 2(D -fi 1) < 2L.) So, q is satisfied. A S G is NP-com plete since A S G £ N P , and P C S is polynom ially tran s form able to A SG . □ From the above theorem , we know th at synthesizing an interchip connection which guarantees a pipelined schedule exists for a partitioned CD FG containing d ata recursive edges is an intractable problem. However, a technique which can give a good approxim ate solution is still desirable. 120 7.2 Conditional I/O Operations T he behavioral description of a digital system usually contains conditional branches. For such a system, only the operations of selected branches will be ex ecuted in an execution instance. The operations on different conditional branches never executed at the same tim e in an execution instance can share hardw are resources. In conditional resource sharing, a hardw are m odule is shared among operations of m utually exclusive conditional branches executed in th e same con trol step. Conditional resource sharing is not applied between different execution instances since execution of selected conditional branches can hardly be predicted in advance, as stated earlier by Park [PP88]. For a partitioned CDFG, in which a whole conditional block is p u t in a p arti tion, all I/O operations will be executed in every execution instance. So, no condi tional sharing among I/O operations is possible. In some cases, however, a condi tional block can be too large to fit into a single partition because of area constraints. In these cases, a conditional block m ust be partitioned into several partitions, and only some of the I/O operations will be executed in an execution instance. In this case, conditional resource sharing should also be applied to I/O operations to achieve a b etter utilization of I/O pin resources. Conditional sharing on func tional units have been reported by several researchers [PP88, HCDdA88, WY89]. Usually, m utually exclusive operations are identified before the task of conditional sharing begins. The coloring algorithm given by Park et al. [PP88] and the condi tion vector (CV) defined by W akabayashi et al. [WY89] have been used to deter m ine mutual-exclusiveness between functional operations. These techniques can also be used to determ ine mutual-exclusiveness between I/O operations. In some approaches [PP88, WY89], conditional sharing is perform ed during scheduling, while in Hwang’s approach [HCDdA88], conditional sharing is predeterm ined be fore scheduling. In this section, the discussion will concentrate on conditional sharing among I/O operations, assuming th at the m utual-exclusiveness between I/O operations has already been determ ined by one of these techniques. 121 Conditional resource sharing among I/O operations is perform ed before inter chip connection synthesis. A com patibility graph is used in the process of con ditional resource sharing. In the com patibility graph, a node represents a set of m utually exclusive I/O operations which are to share a com m unication slot. Two nodes are said to be com patible if and only if the I/O operations in the correspond ing sets are all m utually exclusive and can be scheduled in the same control step in order to share a com m unication slot. Com patible nodes are connected by edges. T he conditional sharing process consists of a series steps combining two com pat ible nodes. The pair of nodes w ith the most promising “benefit” is com bined in each iteration. Combining two com patible nodes means th a t the I/O operations in the corresponding sets are to be scheduled in the same control step to share a com m unication slot. Each node is associated with two attributes: 1. a tim e fram e,2 frame = {asap: asap + 1, . . . , alap}, which is a range of control steps in which the I/O operations in the cor responding set can be scheduled, assuming th at these I/O operations are scheduled in the same control step in order to share a com m unication slot; and 2. a bus connection structure, r = (r0, r x, . . . , rjv), where r,- is the w idth of the I/O port of partition Pi connected to the bus, which is a com m unication bus to which the least num ber of I/O pins are connected such th at the bus can still be used by these I/O operations, where max^_0{ri} is the bus width. Each edge is also associated w ith a weight reflecting the benefit of having the I/O operations in the sets represented by the two end nodes share a com m unication slot. The benefit is com puted in two steps. First, the basic benefit w(e) on an edge e = ( u i,^ ) , which represents the benefit of combining the two end nodes 2 the same as freedom used in MAH A [PPM 86] 122 w ithout taking into account possible exclusions of com binations of other nodes, is com puted as follows: u>(e) = gain(e) — p f * penalty(e), where N gain(e) — rnin(ri(vx), r t -(u2)) 2 — 0 is the to tal num ber of I/O pins which can be shared if nodes r > x and v2 are combined; penaltyle) = U , penalty{e) U m m e M n /rame(„2)|| is the percentage of freedom of scheduling lost due to the com bination of nodes v1 and v2\ and p f is the weighting of the penalty factor specified by users. In th e above equation, \\frame\\ denotes the num ber of control steps in fram e. Next, the modified weight w'(e), which represents the benefit of combining th e two end nodes, while considering the first-order effect of possibly excluding com binations of other nodes, is com puted in the following way. Let E i — {e = (nx, v) j v is neither v2 nor connected to u2} > E 2 = {e = {v2,v) | v is neither iq nor connected to ux}, ex G E\ A (Ve € E x{w{e1) > u>(e))), and e2 G E 2 A (Ve € E 2(w(e2) > ^ (e))). Then, th e modified benefit w'{e) is given by w'(e) = w(e) — (max(ro(e1), w(e2)) + f * m in(u;(ei), w(e2))). T he second term reflects the possible exclusion of other node com binations. Com bining nodes ux and v2 will not exclude the com binations {u ,u x} and { v ,v 2} if v is connected to both ux and t> 2. However, the com bination {ua,u x} will be excluded by combining ux and v2 if va is not connected to v2. Similarly, the com bination 123 Figure 7.6: An example showing exclusion of conditional sharing {^6 ,^ 2 } will be excluded by combining v\ and v2 if vi is not connected to V\. The com binations {wa,ui} and m ay or m ay n°t be satisfied at th e sam e tim e if vi and v2 are not combined. This can be illustrated by the exam ple shown in Figure 7.6, in which a dotted edge indicates the com patibility betw een the two end nodes, and an arrowed edge indicates the precedence relation between the two end nodes. The com bination {^2 ,^ 3 } will exclude both com binations {u i,u 3} and {^ 2 ,^ 4 }, while the com bination {ui,u3} will only exclude one of th e com binations {^2 ,^ 3 } and {u i,u 4}, and the exclusion will be due to the precedence constraints. So, / is used to partially reflect these variations, and will have a value between zero and one. It can be given by the user to explore alternative solutions. A fter combining nodes ui and v2, the com patibility graph is u pdated by re placing nodes vx and v2 by node v', each pair of edges (v ,v x) and (v ,v 2) by edge (v,v'), and deleting all edges and (n ,n 2). The attrib u tes of the combined node v1 = {v1,v 2} are determ ined by frame{v') = frame{v\) p j frame(v2), and ri{v') = m ax(ri(n1),ri(n 2)), for 0 < i < N. A heuristic procedure used for conditional resource sharing among I/O opera tions is sum m arized in Figure 7.7. For a partitioned CDFG having n conditional 124 1: Perform A SA P/A LA P 2: R epeat until no more edge 3: Com pute basic weight w(e) for each edge e 4: Com pute modified weight w'(e) for each edge e 5: Select an edge e = having largest w'(e) 6: U pdate the com patibility graph by combining nodes vx and v2 7: Com pute attributes for the combined node {^1 ,^ 2 } 8: U pdate A SA P/A LA P for predecessors and successors of v x and v2 Figure 7.7: A heuristic procedure for conditional resource sharing among I/O operations I/O operations, it takes at most n — 1 iterations since the num ber of nodes is decreased by one in each iteration. The most tim e consuming step in th e loop is step 4, which has a tim e complexity of 0 ( n 3). So, the overall tim e com plexity of th e procedure is 0 ( n 4). A fter the conditional sharing process is completed, we have a num ber of disjoint sets of I/O operations. The I/O operations in a set can be scheduled in the same control step to share a com m unication slot. However, these I/O operations are not restricted to share a com m unication slot. The objective of the conditional sharing process is to provide a good set-division on conditional I/O operations such th at the sharing among the I/O operations in a set does not affect the sharing among the I/O operations in the other sets. The final decision on determ ining which I/O operations are to share a com m unication slot is left to the procedure of synthesizing interchip connections, because the interchip connection synthesis handle interchip connections in a more global way than the conditional sharing process, during which no unconditional I/O operations and I/O pin constraints are considered. In th e procedure of interchip connection synthesis, the I/O operations in a set can be handled in th e same way as I/O operations transferring the same value. 125 7.3 Time Division I/O Multiplexing A value transferred across chips may have a large bit width. A large num ber of pins will be required to transfer the value as a whole. Sometimes, it would be desirable to split a value into several sub-values, each having a sm aller num ber of bits, and to have these sub-values transferred in a num ber of cycles. In this case, I/O operations for such a value can be modeled with a split node, a merge node, and a num ber of I/O operation nodes, as shown in Figure 7.8(b). A Split node splits the bits of a value into several sub-components, each containing a smaller num ber of bits, while a merge node merges several values into a value having a larger num ber of bits. Note th a t only one split node is required for several I/O operations transferring a value to several partitions. A lthough the num ber of pins required may be reduced w ith Tim e Division I/O m ultiplexing, larger chip area and degraded perform ance can result because m ore register control signals are required to latch the m ultiplexed values an d /o r m ore control steps are needed. We have assumed th a t the designer makes the decision which I/O operations are to be split and into how m any com ponents an I/O operations is split. Further study is required to develop a tool which could assist designers in making a tim e division I/O m ultiplexing decision or even to make the decision by itself. 7.4 M ultiple-cycle Operations An operation can take more than one cycle to execute depending on the hardw are m odule selected. Such an operation is called a multiple-cycle operation. M ultiple- cycle operations can be pipelined or non-pipelined depending on the im plem enta tion of the hardw are modules. In this section, non-pipelined m ultiple-cycle opera tions are assumed. In our approach, a m ultiple-cycle operation will not be chained w ith other operations. The reason is th a t the chaining may degrade the utilization of functional units. Suppose th a t operations +1 and *1 are chained, as shown in 126 (a) Original Control Data Flow Graph SPLIT I/O I/O I/O I/O MERGE MERGE (b) Control Data Flow Graph after multiplexed I/O nodes inserted Figure 7.8: Model for m ultiplexed I/O operations 127 -T2, Scheduled Operations 1 2 3 4 Control Step Figure 7.9: An Exam ple of O peration Chaining Figure 7.9, and th at they are bound to functional units addl, and mull, respec tively. Addl cannot be used for other operations in control steps 2, 3, and 4, during which operation *1 is executing, because the result of +1 is not latched. (Values can only be latched at the boundary of cycles.) T he execution of an m-cycle operation is different from the sequential execution of m single cycle operations. The m-cycle operation m ust be bound to a functional un it whereas the m single cycle operations can be bound to different functional units. So, the lower bound for multiple-cycle operations, which is tighter th an the bound given in BAD [KP90b, Kuc91] and M OSP [Jai90], is given by undefined otherwise where op. the num ber of functional units of type i , rii: the num ber of operations of type i, L: the initiation rate, m,: the num ber of cycles to execute operation of type i. N ote th a t we cannot have a pipelined design with an initiation rate less th an largest num ber of cycles taken by an operation. Even for a given num ber of modules satisfying Equation 7.5, th e schedule may not be com pleted if multiple-cycle operations are scheduled inappropriately. For exam ple, there are three 2-cycle operations opl, op2, and op3 in a CD FG , and an initiation rate of 6 is assumed. From Equation 7.5, one functional unit is sufficient. 128 7.5 Figure 7.10: An example of an allocation wheel However, if operations opl and op2 are scheduled to start executing in control steps 0, and 3, respectively, operation op3 is unable to be scheduled since the functional unit is free only in control steps s2, and s 5, which are not contiguous. This can be shown w ith an allocation wheel in Figure 7.10, in which the shaded cells indicate a function unit has been allocated in those control steps. Before scheduling a multiple-cycle operation, th e program is required to check w hether such an assignment will cause the num ber of resources to be insufficient for the rest of the operations. Initially, an allocation wheel of L cells is associated w ith each m ultiple-cycle module. W hen a multiple-cycle operation is scheduled, the corresponding cells in an allocation wheel are m arked used. As scheduling proceeds, the allocation wheel may contain m any fragm ents of unused cells. The fragm entation can decrease the m axim um num ber of operations to which a func tional unit can be allocated. If fragm entation caused by a tentative assignm ent causes th e resources to be insufficient for the rest of the operations, the assignm ent has to be postponed until it is safe. 129 7.5 Conclusion Of the topics discussed in this chapter, data recursive edges and m ultiple-cycle operations have been im plem ented in our prototype program. However, much m ore work needs to be done on the first three topics, namely d ata recursive edges, conditional I/O operations, and tim e division I/O multiplexing. This chapter has described the nature of the problems and spelled out some tentative approaches. 130 Chapter 8 Conclusions and Future Work 8.1 Contributions In this thesis, we have addressed some problems which arise during the synthe sis process of m ultiple-chip systems. We have also presented some techniques to synthesize m ulti-chip digital systems. These techniques have been im plem ented in our prototype software. A class of partitioning, called simple partitioning, is identified. We have shown th a t the interchip com m unication problem of synthesizing m ulti-chip pipelined system s w ith a simple partitioning can be reduced to a pin allocation problem, which can be solved simultaneously during the scheduling process. The pin alloca tion problem has been form ulated as an ILP. The size of the form ulation is small enough to be able to solve the pin allocation problem during scheduling process. Due to the com plexity of the problem, for systems w ith a m ore general par titioning, the problem is divided into two subproblems: (1) interchip connection synthesis, and (2) scheduling. The techniques to solve these subproblem s in ei th er order have been discussed in Chapters 4 and 5. For the approach in which interchip connections are synthesized before scheduling, the problem of interchip connection synthesis has been form ulated as an ILP, and solved by an heuristic search technique. We have also proven th at synthesizing an interchip connection 131 which guarantees a pipelined schedule exists for a partitioned CD FG containing d ata recursive edges is an intractable problem. For the other approach in which interchip connections are synthesized after scheduling, interchip connection synthesis can be modeled as a m axim um -gain clique partitioning problem. An heuristic clique partitioning technique has been developed by taking the advantage of the special properties of the com patibility graph for the problem. I/O pin resources can be utilized more effectively if a com m unication bus is allowed to transfer more than one value at the same tim e by using different portions of the bus. The interchip connection synthesis problem for these cases has also been form ulated as an ILP. The techniques described in C hapter 4 have also been extended to deal with these cases. We also point out some future research directions, and have presented a te n ta tive technique for sharing communication buses among conditional I/O operations. 8.2 Future Work T he work discussed in this thesis is just the beginning. There is still m ore work to be done. As the experim ental results show in C hapter 5, the list scheduling technique im plem ented in our prototype software does not work well for pipelined scheduling, and usually produces inferior schedules. B etter scheduling results could be obtained by simply replacing the list scheduling by a more advanced scheduling technique, such as PLS [HHL91], which can be easily adapted to th e prototype program . The num ber of pins required may be reduced by using tim e division I/O m ul tiplexing. However, larger chip area and degraded perform ance can result because m ore register control signals are required to latch the m ultiplexed values an d /o r m ore control steps are needed. Tradeoff among the num ber of I/O pins required, chip area and system perform ance requires further study. 132 For the approach in which interchip connection is synthesized before schedul ing, the decisions m ade in the process of interchip connection synthesis will impose constraints on scheduling, and so affect the quality of scheduling results. F urther more, the constraints imposed by interchip connection synthesis can exclude all feasible pipelined schedules for a design with m axim um tim e constraints. It is de sirable to develop some techniques which can give a good prediction of the im pacts on scheduling during interchip connection synthesis. An alternative approach to obtain better designs, interchip connection synthe sis and scheduling should be performed simultaneously. Some scheduling tech niques [NK90, PK91] are good candidates to be integrated w ith interchip con nection synthesis, since these techniques are iterative im provem ent m ethods, in which tentative interchip connections can be determ ined and im proved during the refinem ent process. Since very little structural inform ation is available during behavioral-level par titioning, partitioners make decisions based on some sort of predictions, which may not be accurate enough. So, the tasks of partitioning and synthesizing would likely be repeated a num ber of times. It would be desirable if useful inform ation from the synthesis tools could be fedback to guide the behavioral-level partitioner. Further study is required to define some m easurem ents, which are useful for partitioner to produce a b etter partitioning. 133 Reference List [Bax81] [CK86] [GM90] [GE91] [Gebo92] [GJ79] [Girc84] [Gomo60] [GVM89] M. Barbacci, “Instruction Set Processor Specification (ISPS): The N otation and its Applications,” IEEE Transaction on Computers, C-30(l), pp. 24-40, Jan. 1981. R. Camposano and A. Kunzm ann, “Considering Tim ing C onstraints in Synthesis from a Behavioral D escription,” Proceedings IE E E In ternational Conference Computer Design, pp. 6-9, Oct. 1986. R. G upta and G. De Micheli, “Partitioning of Functional Models of Synchronous Digital Systems,” Proceedings of International Confer ence on Computer-Aided-Design, pp. 216-219, Nov. 1990. C. H. Gebotys and M. I. Elmasry, “Simultaneous Scheduling and Allocation for Cost Constrained O ptim al A rchitectural Synthesis,” Proceedings of the 28th Design Automation Conference, pp. 2-7, June 1991. C. H. Gebotys, “O ptim al Synthesis of M ultichip A rchitectures,” Pro ceedings of International Conference on Computer-Aided-Design, pp. 238-241, Nov. 1992. M. R. Garey and D. S. Johnson, “Com puters and Intractability: A Guide to the Theory of NP-Com pleteness,” W. H. Freem an and Com pany, 1979. E. Girczyc, “A utom atic G eneration of M icrosequenced D ata P aths to Realize ADA Circuit Descriptions,” PhD thesis, Carleton University, Jul. 1984. R. Gomory, “All-Integer Integer Program m ing A lgorithm ,” IB M Research Center Report RC-189, January 1960; also in Industrial Scheduling (eds.: M uth and Thom pson), Englewood Cliffs, N J., Prentice-H all, 1963. G. Goossens, J. Vandewalle, and H. DeMan, “Loop O ptim ization in Register-Transfer Scheduling for D SP-system s,” Proceedings of the 26th Design Automation Conference, pp. 826-831, June 1989. 134 [HCDdA88] [HCLH90] [HH90] [HHL90] [HHL91] [HP83] [HS71] [Jai90] [KM90] [KP85] [KP90a] K. S. Hwang, A. E. Casavant, M. Dragomirecky, and M. A. d ’Abreu, “Constrained Conditional Resource Sharing in Pipeline Synthesis,” Proceedings of International Conference on Computer-Aided-Design, pp. 52-55, Nov. 1988. C. Huang, Y. Chen, Y. Lin, and Y. Hsu, “D atapath Allocation Based on B ipartite Weighted M atching,” Proceedings of the 27th Design Automation Conference, pp. 499-504, June 1990. L. Hafer and E. Hutchings, “Bringing up Bozo,” Tech Report C M PT TR 90-2, School of Computer Science, Simon Fraser University, Burnaby, B.C. V5A 1S6, M arch 1990. C. Hwang, Y. Hsu, and Y. Lin, “O ptim um and H euristic D ata P ath Scheduling Under Resource Constraints,” Proceedings of the 27th De sign Automation Conference, pp. 65-70, June 1990. C. Hwang, Y. Hsu, and Y. Lin, “Scheduling for Functional Pipelin ing and Loop W inding,” Proceedings of the 28th Design Automation Conference, pp. 764-769, June 1991. L. Hafer and A. C. Parker, “A Formal M ethod for the Specifica tion, Analysis, and Design of Register-Transfer Level Digital Logic,” IEEE Transactions on Computer-Aided Design, CAD-2(1), pp. 4-18, January 1983. A. Hashimoto and J. Stevens, “W ire Routing By O ptim izing Channel Assignment W ithin Large A pertures,” Proceedings of the 8th Design Automation Workshop, pp. 155-169, June 1971. R. Jain, “MOSP: Module Selection for Pipelined Designs with M ultiple-Cycle O perations,” Proceedings of International Conference on Computer-Aided-Design, pp. 212-215, Nov. 1990. D. Ku and G. De Micheli, “High-level Synthesis and O ptim ization Strategies in Hercules and Hebe,” EURASIC, Proceedings of the Eu ropean Conference on ASIC design, May 1990. D. K napp and A. C. Parker, “A Unified Representation for Design Inform ation,” Proceedings of the IFIP Conference on Hardware De scription Languages, Aug. 1985. K. Kiigukgakar and A. C. Parker, “MABAL - A Software Package for M odule And Bus Allocation,” International Journal of Computer Aided VLSI Design, pp. 419-436, Nov. 1990. 135 [KP90b] [KP91] [Kuc91] [Kun84] [KWK85] [LHL89] [LIN87] [LP91] [McF78] [MK88] [MPC88] [NK90] K. Kiigiikgakar and A. C. Parker, “BAD : Behavioral Area-Delay Predictor,” Tech Report 90-31, University of Southern California, Nov. 1990. K. Kiigiikgakar and A. C. Parker, “CHOP: A Constraint-D riven Sys tem Level P artitioner,” Proceedings of the 28th Design Automation Conference, pp. 514-519, June 1991. K. Kugiikgakar, “System-Level Synthesis Techniques W ith Em phasis On Partitioning And Design Planning,” PhD Dissertation, University of Southern California, Oct. 1991. S. Y. Kung, “On Supercom puting w ith Systolic/W avefront Array Processor,” Proceedings of the IE E E , pp. 867-884, July 1984. S. Y. Kung, H. J. W hitehouse, and T. K aliath, “VLSI and M odern Signal Processing,” Prentice-Hall, pp. 257-276, 1985. J. Lee, Y. Hsu, and Y. Lin, “A New Integer Linear Program m ing Formulation for the Scheduling Problem in D ata P ath Synthesis,” Proceedings of International Conference on Computer-Aided-Design, pp. 20-23, Nov. 1989. “LINDO - Users M anual for Linear, Integer and Q uadratic Program ming with LINDO,” LINDO Systems, Inc., 1987 D. A. Lobo and B. M. Pangrle, “Redundant O perator Creation: A Scheduling O ptim ization Technique,” Proceedings of the 28th Design Automation Conference, pp. 775-778, June 1991. M. C. M cFarland, “The Value Trace: A D ata Base for A utom ated Digital Design,” Master’ s thesis, Dept, of Electrical Engineering, Carnegie-Mellon University, Dec. 1978. G. De Micheli and D. C. Ku, “HERCULES: A System for High-Level Synthesis,” Proceedings of the 25th Design Automation Conference, pp. 483-488, June 1988. M. C. M cFarland, A. C. Parker and R. Camposano, “Tutorial on High-Level Synthesis,” Proceedings of the 25th Design Automation Conference, pp. 330-336, June 1988. J. A. Nestor and G. Krishnamoorthy, “SALSA: A New Approach to Scheduling with Tim ing Constraints,” Proceedings of International Conference on Computer-Aided-Design, pp. 262-265, Nov. 1990 136 [NT86] [PG86] [PK89] [PK90] [PK91] [PP88] [PPM86] [PR91] [PS82] [Spr91] [Tri87] J. A. Nestor and D. E. Thomas, “Behavioral Synthesis w ith Inter faces,” Proceedings of International Conference on Computer-Aided- Design, pp. 112-115, Nov. 1986 B. M. Pangrle and D. D. Gajski, “State Synthesis and Connectivity Binding for M icroarchitecture Com pilation,” Proceedings of Inter national Conference on Computer-Aided-Design, pp. 210-213, Nov. P. G. Paulin and J. P. Knight, “Force-Directed Scheduling for the Behavioral Synthesis of ASIC’s,” IEEE Transactions on Computer- Aided Design, vol. 8, pp. 661-679, June 1989. C. A. Papachristou and H. Konuk, “A Linear Program Driven Scheduling and Allocation M ethod Followed by an Interconnect Op tim ization Algorithm ,” Proceedings of the 27th Design Automation Conference, pp. 77-83, June 1990. I. Park and C. Kyung, “Fast and Near O ptim al Scheduling in Au tom atic D ata P ath Synthesis,” Proceedings of the 28th Design A u tomation Conference, pp. 680-685, June 1991. N. Park and A. C. Parker, “Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications,” IEE E Transactions on Computer-Aided Design, vol. 7, pp. 356-370, March 1988. A. C. Parker, J. Pizarro and M. J. M linar, “MAHA: A Program for D atapath Synthesis,” Proceedings of the 23rd Design Automation Conference, pp. 461-466, June 1986. M. Potkonjak and J. Rabaey, “Optimizing Resource U tilization Us ing Transform ations,” Proceedings of International Conference on Computer-Aided-Design, pp. 88-91, Nov. 1991. C. H. Papadim itriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, pp. 326. Prentice-H all, Inc. Englewood Cliffs, New Jersey, 1982. D. L. Springer, “Coloring and Clique P artitioning for D ata P ath Al location,” PhD thesis, Carnegie Mellon University, May 1991. H. Trickey, “Flamel: A High-Level Hardware Com piler,” IEEE Transactions on Computer-Aided Design, vol. 6, pp. 259-269, M arch 1987. BLUE LINE THESIS PAPER 25% Cotton Content ° 100 Sheets O 8Vi X 11 o White o 20 lb. ES-220D-10 137 [TS83] C. Tseng and D. P. Siewiorek, “Facet: A Procedure for th e A uto m ated Synthesis of Digital Systems,” Proceedings of the 20th Design Automation Conference, pp. 490-496, June 1983. [VHDL88] IEEE Standard VHDL Language Reference M anual. The In stitu te of Electrical and Electronics Engineers Inc., M arch 1988. [WS89] N. Woo and H. Shin, “A Technology-Adaptive Allocation of Func tional Units and Connections,” Proceedings of the 26th Design A u tomation Conference, pp. 602-605, June 1989. [WY89] K. Wakabayashi and T. Yoshimura, “A Resource Sharing and Con trol Synthesis M ethod for Conditional Branches,” Proceedings of In ternational Conference on Computer-Aided-Design, pp. 62-65, Nov. 1989. BLUE LINE THESIS PAPER ° 100 Sheets ES-220D-10 25% Cotton Content ° 8 Vi x 11 ° White ° 201b. 138
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255770
Unique identifier
UC11255770
Legacy Identifier
DP22849