Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SYSTEM -LEVEL DESIGN TECHNIQUES AND TOOLS FO R SYNTHESIS O F A PPLICA TIO N -SPECIFIC DIGITAL SYSTEMS by Chih-Tung Chen A Dissertation Presented to th e FACULTY OF TH E GRADUATE SCHOOL U N IV ERSITY OF SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirem ents for the Degree D O C TO R OF PH ILO SO PH Y (Com puter Engineering) August 1994 Copyright 1994 Chih-Tung Chen UMI Number: DP22879 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a com plete manuscript and there are missing pages, th ese will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22879 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, M l 4 8 1 0 6 -1 3 4 6 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 Ph-V. CpS C5I& 3 W B ^ / 3 This dissertation, written by .................. £ h ih -X u u g .. .Chsn................................................... under the direction of hxs Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson Dedication To m y wife, m y parents and my son. ii Acknowledgements I take this opportunity to express m y sincere appreciation to my advisor, Pro fessor Alice Parker, for her guidance and support through the years. She was always there to help m e with research and personal problem s and to m otivate m e when I was discouraged. It is my fortune to have known her. I would like to thank Professors Ming-Deh Huang and Sandeep G upta for serv ing on m y dissertation com m ittee, and Professors M elvin Breuer, Michel Dubois, and Dennis Mcleod for serving on my guidance com m ittee. Their com m ents and suggestions were m ost valuable. I also thank all my friends and colleagues at USC, past and present, for their friendship and various help. In particular, I like to m ention Yung-Hua Hung, Jen- P in Weng, Sen-Pin Lin, Pravil G upta, Charles Njinda, Diogenes Silva Jr., A tul A huja, Shiv Prakash and K ayhan Kucukcakar. I especially thank m y parents for their continuous love and support. They give m e their com plete tru st and accept so naturally th a t their son is still a student for countless years. To my beloved child, Phillip, your interruption m ade th e tim e m ore enjoyable and kept my goal in perspective. Finally, I would like to thank my wife, Shu-Mei, for her patience, support and sacrifices throughout m y graduate study. She has given up her m ost precious years in her life to give her husband a chance to achieve his selfish goal. I gratefully dedicate this thesis to her. I would like to acknowledge the financial support from the Advanced Re search P rojects Agency under C ontract No. J-FBI-90092, the D epartm ent of Air Force, th e D epartm ent of Arm y and the D epartm ent of Navy under C ontract No. N00039-87-C-0194, and th e N ational Science Foundation under C ontract No. GER-9023979. Contents D ed ication ii A cknow led gem ents iii L ist O f F igures viii L ist O f Tables xi A b stract xii 1 In trod uction 1 1.1 B a c k g ro u n d ........................................................................................................... 1 1.2 Behavioral S y n th e s is ......................................................................................... 2 1.3 M otivations ...................................................................................... 5 1.4 Problem A p p r o a c h ............................................................................................. 7 1.5 Thesis O rg an izatio n ............................................................................................. 10 2 R elated R esearch 12 2.1 Design Specification and R e p re s e n ta tio n ................................................... 12 2.2 System Partitioning .................................................................................. 14 2.3 M ultiple-Process S y n th e sis............................ 16 2.4 Design V e rific a tio n ............................................................................................. 18 3 S ystem Specification, R epresentation and Translation 21 3.1 Translation of VHDL Processes into D D S .................................................. 23 3.1.1 Problem S ta te m e n t.............................................................................. 23 3.1.2 VHDL Translation Approach ......................................................... 26 3.1.2.1 Control Flow A n a ly s is ...................................................... 26 3.1.2.2 Local D ata Flow A n a ly s is ............................................... 29 3.1.2.3 Global D ata Flow A n a l y s i s ..................... 31 3.1.2.4 G raph G e n e r a tio n ............................................................. 37 3.1.2.5 G raph O p tim iz a tio n ......................................................... 41 v 3.2 A rray and In p u t/O u tp u t M o d e lin g ............................................................ 43 3.2.1 A r r a y s ..................................................................................................... 43 3.2.2 In p u t/O u tp u t and Inter-Process Com m unication .................. 45 3.2.2.1 Types of C o m m u n ic a tio n ................................................ 45 3.2.2.2 Modelling Com m unication in VHDL and DDS . . . 47 3.3 E x p e rim e n ts......................................................................................................... 49 3.4 S u m m a r y ............................................................................................................. 57 S ystem -L evel P artition in g of A pplication-S pecific D igital S ys tem s 58 4.1 Introduction . . .............................................................................................. 58 4.2 Problem A p p r o a c h ............................................ 61 4.3 An M ILP P artitioning M e th o d ...................................................................... 64 4.3.1 Overview of the Partitioning M o d e l................................................ 65 4.3.2 N o ta tio n .................................................. 67 4.3.3 O bjective Function . ........................................................................ 68 4.3.4 C o n s tr a in ts ............................................................................................. 69 4.3.4.1 Process to Chip A ssig n m en t............................................ 69 4.3.4.2 Chip Package S election....................................................... 70 4.3.4.3 Inter-chip C o n n e c tio n .................................... 71 4.3.4.4 P in Capacity Constraints ................................................ 71 4.3.4.5 Area Capacity C onstraints ................... 72 4.3.4.6 Tim ing Constraints ............................................ 72 4.3.4.7 Exploration of Design A lte r n a tiv e s .............................. 73 4.3.5 L in e a riz a tio n ......................................................................................... 75 4.4 E x p e rim e n ts........................................................................................................ 77 4.4.1 A Mobile Phone System Exam ple .................................... 77 4.4.2 A Pow ertrain Control System E x a m p le ........................................ 79 4.4.3 A JP E G Image Compression System E x a m p le ........................... 84 4.5 E x te n sio n s......................................................................................................... . 89 4.5.1 Tradeoffs of Com m unication Delay and H a r d w a r e ................... 89 4.5.2 A Genetic-Search Partitioning M e t h o d ...................... 90 4.6 Sum m ary ................ 96 S yn th esis o f S ystem s w ith U n bounded-D elay O perations and C om m u nicating P rocesses 98 5.1 In tro d u c tio n ........................................................................................................ 98 5.2 Overview of Relative S cheduling .......................................................................101 5.3 Synthesis of Single-Threaded P ro c e s s e s ............................ 105 5.3.1 Scheduling A p p ro a c h ..............................................................................107 5.3.2 Tim ing C o n s tra in ts ................................................................................. 110 5.3.3 Resource A llo catio n ................................................................................. 113 5.3.4 Control S c h e m e .........................................................................................114 5.3.5 An ILP M e th o d .........................................................................................114 5.3.6 E x p e rim e n ts................................................................................................121 5.4 Synthesis w ith M ultiple P ro c e ss e s...................................................................123 5.4.1 Com m unication M o d e l...........................................................................124 5.4.2 Com m unication F ea sib ility .................................................................... 125 5.4.3 Synchronization of Blocking E v e n ts .................................................. 127 5.4.4 Synchronization of Non-blocking E v e n t s ......................................... 127 5.4.5 ILP m o d ific a tio n .............................................. 129 5.4.6 E x p e rim en ts................................................................................................132 5.5 A H euristic Approach for M ulti-Process S y n th e s is ...................................135 5.6 S u m m a r y ................................................................................................................138 6 V erification o f Synthesized RTL D esigns 140 6.1 Problem S ta te m e n t................ 141 6.1.1 Properties of the Synthesized RTL D e sig n s......................................143 6.2 A pproach O v erv iew ..............................................................................................146 6.3 H ybrid Sym bolic/N um eric S im u latio n ............................................. 148 6.3.1 Elem ent E v a lu a tio n ................................................................................. 152 6.3.1.1 D ata P a t h .................................................................................152 6.3.1.2 C o n tro lle r.................................................................................153 6.3.2 Representation of Symbolic D a t a ......................................................154 6.3.3 Exam ples ................................................................................. 155 6.4 G raph-B ased Behavior C o m p a riso n ...............................................................159 6.4.1 T he Isom orphic P ro p e rty .......................................................................160 6.4.2 A G raph M atching P r o c e d u re ................................ 165 6.5 Experim ents ..............................................................................................168 6.6 Analysis of the General RT-Level Design Verification P roblem . . . 171 6.7 S u m m a r y ............................................................................................................... 176 7 C onclusion 178 7.1 System P a rtitio n in g .............................. 178 7.2 M ultiple-Process S y n th e sis ................................................................................179 7.3 RTL Design V e rific a tio n ...................................................................................181 7.4 O ther C o n tr ib u tio n s ................... 182 A p p en d ix A T he VHDL Subset of the ADAM S y s te m ............................................................... 192 vii L ist O f F igu res 1.1 A schedule for a C D F G ...................................................................................... 3 1.2 An RTL data-path im plem entation of the behavior in Figure 1.1 . . 4 1.3 System synthesis p ro c e s s .................................................................................. 8 3.1 A sam ple VHDL d e s c rip tio n .......................................................................... 24 3.2 T he parallelization during the VHDL to DDS tr a n s la tio n .................. 25 3.3 T he steps to translate a VHDL process into D D S ................................ 27 3.4 T he flow graph of th e one-counter e x a m p le ............................... 29 3.5 T he d a ta flow graph of the basic block e x a m p l e .................................... 32 3.6 M eta blocks for if-then-else and while-loop structures ...................... 33 3.7 A conditional a ssig n m en t............................ 34 3.8 Global d ata flow analysis of the one-counter e x a m p l e ......................... 36 3.9 T he DDS tem plate of an if-then-else s t r u c t u r e ............................... 38 3.10 T he DDS tem plate of a while-loop s t r u c t u r e ...................................... . 39 3.11 T he DFG and CTG of the ones-counter e x a m p le ..................... 40 3.12 T he array read and w rite operations ........................................... 44 3.13 A VHDL exam ple w ith two p ro c e ss e s ............................ 46 3.14 T he com m unication m odel in D D S ............................................................ 48 3.15 T he VHDL description of a square root fu n c tio n .................................... 50 3.16 T he d a ta flow graph of th e square root f u n c t io n .................................... 51 3.17 T he VHDL statem ents of the robot arm controller e x a m p le .............. 53 3.18 T he d ata flow graph of the robot arm c o n tr o lle r .................................... 54 3.19 T he VHDL description of DCT .................................................. 55 3.20 T he d ata flow graph of D CT produced by V H D L 2 D D S ...................... 56 4.1 An ASIC-based C C IT T H.261 video decoder from Bellcore Inc. . . 59 4.2 Selection of process im plem entations ................. 60 4.3 Overview of P ro P a rt’s partitioning a p p r o a c h ........................................... 62 4.4 Overview of the process-level partitioning p a r a m e te r s ......................... 66 4.5 Trim m ing th e design space to be explored using local perform ance and area constraints ........................................................................................ 73 4.6 A pproxim ating the design space using piece-wise linear equations . 74 4.7 A m ultiple-process system to be partitioned ................................ 77 viii 4.8 T he approxim ated area/delay curves of processes p i, p2 and p3 . . 78 4.9 A two-chip partitioning for the mobile phone e x a m p le ................... 80 4.10 An exam ple derived from the GM pow ertrain a p p lic a tio n ............ 81 4.11 A three-chip partitioning of the pow ertrain e x a m p le ....................... 83 4.12 Design flow for the JP E G system e x a m p l e ......................................... 85 4.13 Decomposing a 2D-DCT into repeated row-column lD -D C T s . . . . 86 4.14 1D-DCT RTL designs from M A B A L ................................................... 86 4.15 Param eters used by P r o P a r t ..................................................................... 87 4.16 A three-chip partitioning of the JP E G s y s t e m .................................. 88 4.17 T he param eter coding of th e JP E G e x a m p le ..................................... 93 5.1 T he execution m odel of relative sc h e d u lin g .......................................... 103 5.2 An exam ple of the counter-based control in relative scheduling . . . 105 5.3 An exam ple of anchor ordering in a process d e s c rip tio n ..................... 106 5.4 T he scheduling of th e anchor set of a single-threaded process . . . . 109 5.5 A m inim um tim ing constraint satisfied under an unbounded-delay operation ................................................................................................................ I l l 5.6 An unsatisfiable m axim um tim ing c o n s tr a in t...........................................I l l 5.7 Scheduling restrictions im posed by Uy ..................................................... 112 5.8 An im plem entation of a simple unbounded control s t e p ..................... 115 5.9 A single-threaded constraint graph .................... . . . . 1 2 1 5.10 A schedule w ith m inim um num ber of control s te p s ................................ 122 5.11 A schedule which requires m inim um r e s o u r c e s ....................................... 123 5.12 M odeling of a inter-process com m unication event ................................ 124 5.13 T he synchronization of a non-blocking com m unication event . . . . 128 5.14 A Two-Process Exam ple . ............................................................................. 132 5.15 A schedule w ith a m inim um num ber of control s te p s ................................133 5.16 T he solution obtained by m inim izing the to tal resources . ................. 134 6.1 A high-level synthesis m o d e l........................................................ 142 6.2 An exam ple of the links between C D F G s and its synthesized RTL design I ....................................................................................................................145 6.3 An overview of our approach for checking synthesized RTL designs. 147 6.4 An exam ple of symbolic sim ulation ..............................................149 6.5 Typical flow of event-driven s im u la tio n ................. 150 6.6 Execution paths of a state-transition g r a p h ................................................. 151 6.7 A four-function ALU behavioral m o d e l ........................................................ 152 6.8 An exam ple of a synthesized RTL design w ithout conditional branches 156 6.9 T he sim ulation result of the single-path e x a m p le .......................................157 6.10 An exam ple of a synthesized RTL design w ith conditional branches 158 6.11 T he sim ulation result of the m ultiple-path e x a m p l e ................................159 6.12 An exam ple of extracting D F G ss from a C D F G s ................................... 161 6.13 A vertex in C i has no correspondence in C s ..............................................164 6.14 An exam ple to show the isom orphic property betw een th e cones of two corresponding outputs of D F G s and D F G i .......................................165 6.15 An exam ple of m atching operations and values between two cones . 166 6.16 T he cones of two output values obtained from th e hybrid sim ulation of a non-pipelined AR f i l t e r ............................................................. .. 169 6.17 T he d ata flow graph of the AR filter e x a m p l e .........................................170 6.18 T he verification m odel and its relationships w ith the functional and RTL m odels................................................................................................................173 L ist O f T ables 3.1 T he use/definition table of the basic block e x a m p l e .............................. 31 3.2 C alculation of input and output sets for th e m eta b l o c k s ................... 33 3.3 A p artial list of exam ples translated by VHDL2DDS .......................... 49 4.1 Edge param eters for the m obile phone exam ple ....................... 78 4.2 Package Library 1 . . . . . ......................................................... 78 4.3 Design points chosen for processes p i, p2 and p 3 ..................................... 79 4.4 Package Library 2 ................................ 79 4.5 T he param eters of the pow ertrain e x a m p le ........................................ 82 4.6 Design points chosen for the single-chip partitioning of th e power train e x a m p le ...................................................................................................... 82 xi Abstract Previously, m ost behavioral synthesis tools em phasized th e synthesis of single chip and single process designs. However, application-specific digital system s today are usually too large to fit into a single chip and consist of m ultiple concurrent pro cesses. This thesis addresses two im portant issues for th e design of such system s; nam ely, system partitioning and synthesis of concurrent processes. T he system partitio n in g m ethodology presented here is aim ed at partitioning th e system at th e process level onto m ultiple chips while considering chip packaging options as well as the potential process design alternatives in the system . Synthesis techniques for concurrent processes are also introduced so th a t not only th e perform ance and area constraints of each individual process can be m et but also th e com m unication am ong th e processes can be synchronized. In this thesis, we also address the issue of design verification. A lthough syn thesized designs are often considered to be correct by construction, in reality there is no such guarantee unless the whole synthesis process, including techniques and program s, can be form ally validated. Hence, we developed an efficient verification approach using a hybrid sym bolic/num eric sim ulation to check both th e functional and tim ing correctness of synthesized RTL designs. Finally, a VHDL to DDS com piler software which transform s th e system spec ification w ritten in th e VHDL hardw are description language into a synthesizable representation for the ADAM synthesis system is also included here. Chapter 1 Introduction 1.1 Background VLSI technology has reached densities of over one m illion transistors per chip, and VLSI chips are being used in m ore diverse applications th an ever before. At the sam e tim e, the life cycles of electronic products are rapidly decreasing. An effective way to deal w ith the increasing com plexity of designs while reducing the design tu rn aro u n d tim e is to apply design autom ation on m ore levels of abstraction at which circuits are designed. Today, design autom ation tools for layout generation, placem ent and routing, m odule generation, sim ulation, and logic synthesis have becom e fairly reliable and widely available. As th e m arket pressure constantly dem ands higher-level design tools, autom ation of the entire design process from conceptualization to silicon has becom e an im portant goal and an intensive research field in th e last decade. T he benefits of such a methodology include not only shorter design tim e, b u t also ease of m odification of th e design specifications and the ability to explore different design tradeoffs m ore effectively. This thesis addresses the issues th a t arise in behavioral and system -level syn thesis of application-specific digital systems. 1 1.2 Behavioral Synthesis Behavioral synthesis, also known as high-level synthesis, is a process which takes a behavioral specification of a digital system along w ith a set of constraints and goals on th e resulting hardw are and finds a structure th a t realizes the given be havior while satisfying the given goals and constraints. The behavior is usually described algorithm ically in some hardw are description language (HDL) such as VHDL [Ins88], ISPS [Bar81], or HardwareC [MK88]. T he stru ctu re is a register- transfer level (RTL) im plem entation which includes a d ata p ath as well as a control p ath. T he d ata p ath is a network of registers, functional units, m ultiplexers, and buses. T he control path, on th e other hand, is usually a finite-state m achine which drives th e d ata p ath in order to produce th e required behavior. Due to its complexity, behavioral synthesis is often divided into several dis tin ct yet inter-dependent tasks [MPC88]. F irst, the behavioral specification m ust be translated and possibly optim ized into an internal representation, generally referred to as a control d ata flow graph (CD FG ), th a t m odels both the control and d ata flow of the input behavior. N ext, scheduling, allocation and binding are perform ed to m ap the CDFG into structure. Scheduling discretizes th e execution of operations in th e CDFG by assigning them to control steps as shown in Fig ure 1.1. T he allocation task determ ines the num ber and types of required hardw are resources including functional units, storage elem ents and interconnection paths. Resource binding assigns operations and values to specific allocated hardw are re sources. At this stage, an RTL data-path im plem entation is produced as shown in Figure 1.2. Once the schedule and the d ata p ath have been determ ined, the required control p ath can be synthesized to activate com ponents in th e d ata p ath according to th a t schedule. Most behavioral synthesis approaches stop at this stage and pass th e RTL im plem entation to the lower level design autom ation tools for fu rth er optim ization and the layout generation. 2 Figure 1.1: A schedule for a CDFG 3 r n u f j D mul 2 1 add 1| add 2l 1 0 _ 1 / Si Figure 1.2: An RTL data-p ath im plem entation of th e behavior in Figure 1.1 1.3 Motivations A lthough th e sizes of im plem entable VLSI chips have increased drastically in recent years, m ost m odern digital system s still can hardly fit into a single chip. Traditional behavioral synthesis research has had the objective of producing single-chip VLSI im plem entations from behavioral specifications. A fter a design is synthesized as a single-chip design, it m ay not be possible to partition th e design onto m ultiple chips while satisfying the constraints; especially, tim ing constraints. Even if a feasible partitioning can be found, it is likely th a t th e m ulti-chip im plem entation would be inferior since all the synthesis decisions and optim ization m ade assum e a single-chip im plem entation as the target design. A nother com m on characteristics of hardw are system s is th a t they are inherently parallel. In fact, com plex application-specific system s often consists of m ultiple concurrent tasks (processes). For exam ple, three GM production designs illus tra te d by Fuhrm an [Fuh91] contain from 4 to 10 concurrent processes. A typical JP E G im age compression system consists of 6 m ajor tasks; nam ely, forw ard/inverse discrete cosine transform , quantization/dequantization and encoding/decoding. M ost existing behavioral synthesis approaches prim arily address synthesis of a single com ponent (process) in a system w ithout considering its system -level inter action. Using this synthesis paradigm for m ultiple-process system s, each process m ust be synthesized separately, and the integration and synchronization of th e pro cesses usually have to be done m anually by the designers. However, if processes are synthesized one at a tim e, th e decisions m ade previously m ay affect and con strain th e synthesis of other processes in th e system , which is also likely to result in an inferior system im plem entation. Furtherm ore, these synthesis approaches are lim ited in their ability to handle I/O and com m unication requirem ents, w ith a few exceptions [Nes87, Hay90, KM92]. In fact, m any of them only consider the overall latency betw een the inputs and outputs of synthesized designs. Hence, it 5 m ay becom e very difficult for the designers to integrate the processes in th e system after th eir im plem entation have been synthesized w ithout taking into account th e synchronization issue. In order to design complex ASIC system s effectively w ith acceptable design tim e and quality using design autom ation tools, we identify several im p o rtan t system -level issues to be addressed and incorporated into th e current behavioral synthesis paradigm . • Partitioning. As the design prediction techniques becom e m ore advanced and accurate [KP93], it is now feasible and preferable to p artitio n a system onto m ultiple chips before the behavioral synthesis process. P artitioning before synthesis allows th e design space to be explored in a system -level view. Several system-level tradeoffs, such as chip count, chip packaging, the perform ance and resource requirem ents of d ata processing and d a ta tra n s fers, can be taken into account during partitioning. Consequently, proper directions can be provided to the subsequent synthesis process and th e syn thesized m ulti-chip system will likely to be feasible w ith respect to th e given constraints. • Synthesis. To design a system such th a t its processes can coordinate th eir ac tions flawlessly, the synthesis approach requires sim ultaneously solving all the tim ing and synchronization constraints im posed by one process on another. Also, if th e tim e when an external synchronization will occur is not known a priori, th e delay of th e corresponding operation becomes data dependent; hence, synthesis algorithm s can no longer assum e th a t all operations have fixed delays. Furtherm ore, we need to budget the chip resources allocated to each process on a chip since the to ta l resources taken by th e processes on a chip are lim ited by the chip package to be used. 6 • Verification. A com m on approach to avoiding costly design iterations is to find th e design problem s as early as possible. A lthough synthesized designs are often argued to be correct by construction, in reality there is no such guarantee unless th e whole synthesis process, including techniques and soft wares, can be form ally validated. As the synthesis tools advance to higher levels of abstraction, they become even m ore sophisticated and m ay still be under constant evolution. Also, they often require com plex interaction w ith other tools a n d /o r th e designers. Hence, th e chances for errors are greatly increased, and an effective verification m ethodology to cross-check the syn thesized results becomes highly desirable. W ith a b e tter understanding of these system-level issues, an effective system design m ethodology can be form ed to increase the design quality and to reduce the design cycles. 1.4 Problem Approach This research focuses on a system design m ethodology for synthesizing m ulti-chip system s w ith m ultiple concurrent processes as shown in Figure 1.3. In this figure, th e specific problem s addressed in this thesis are highlighted by bold-line boxes. T he first task is to compile the system specification w ritten in VHDL into a unified design representation called th e Design D ata S tru ctu re (DDS) [KP85]. At this step, th e behavior of each process is transform ed into a pair of data-flow and control-flow graphs so th a t the parallelism is extracted and th e representation is ready for synthesis. In addition, th e inter-process com m unication is extracted and represented by the DDS bindings in a m anner to be described in Section 3.2.2. N ext, several process behavioral transform ations could be evaluated to trad e off am ong hardw are sharing, com m unication overhead, and control com plexity. For exam ple, we could collapse two processes into one so th a t th e th e inter-process 7 System Specification in VHDL I VHDL to DDS Compilation P rocess Transformations f Process-Level Partitioning Functional/Timing Verification Scheduling/Allocation for Concurrent P rocesses Interconnect/Controller Synthesis i Silicon Compiler _ _ ^ ^ * L a y o Figure 1.3: System synthesis process com m unication betw een them can be replaced by direct d a ta transfers, and th e hardw are resources can be shared in th e merged behavior. On th e o ther hand, if a process is found too big to fit in a single chip while m eeting its constraints, its behavior should be decomposed into a num ber of sm aller processes. T he process transform ations essentially try to determ ine a proper coarse-grain concurrency for th e system by resetting th e process boundaries. In this system synthesis methodology, partitioning of th e system onto m ultiple chips is perform ed at the process level. T he advantage of this approach is th a t the num ber of objects to be considered during partitioning is small and th e functional boundaries specified by th e designers are preserved. Also, w ith th e aid of advanced prediction tools, th e exploration of process design alternatives can be done con currently w ith partitioning. Finally, the chip count and the chip capacities (area and pins) can be traded off according to the available chip packaging options. A fter partitioning, a concurrent approach is used to synthesize the processes in the system . In this approach, each process is m apped to its own d a ta p a th w ith a single th read of control. The objective is to schedule and allocate each process in such a way th a t all the tim ing and synchronization constraints are m et and the hardw are resources are distributed to processes according their perform ance requirem ents. O nce the schedule and allocation of each process have been deter m ined, its RTL im plem entation and layout can be generated independently from other processes using lower level design tools. Finally, a distinctive feature of this synthesis m ethodology is the inclusion of a design verification step to ensure the correctness of the RTL im plem entations on not only w ith respect to their functional behaviors b u t also their I/O sequencing and tim ing. D ue to th e fact th a t synthesized im plem entations are derived from th eir specifications in a well-defined m anner, a hybrid sym bolic/num eric approach is developed in this research to perform this verification task form ally and yet autom atically. 9 Some potential applications of the system design m ethodology described here include but are not lim ited to • design of cost-effective system s, • rapid prototyping of complex system designs, and • p artial or full system redesign. 1.5 Thesis Organization T he organization of this thesis is as follows: C hapter 2 surveys th e related work on four m ajor areas of this research: design specification and representation, system partitioning, m ultiple-process synthesis and design verification. In C hapter 3, we describes the problem of HDL com pilation in general and th e translation from VHDL to th e DDS representation in particular. T he required techniques for co n tro l/d ata flow analysis and flow graph generation/optim ization are presented. We will also discuss the m odeling of arrays, in p u t/o u tp u t and inter-process com m unication in both VHDL and DDS. T he im plem entation of th e prototype software VHDL2DDS based on this work has created a VHDL front- end for the ADAM synthesis system and has led to num erous top-dow n design experim ents from behavioral VHDL descriptions to chip layouts. C hapter 4 describes th e research on the system-level partitio n in g problem . A novel partitioning approach is presented to partition a system at th e process level, to explore each process’s design alternatives, to determ ine proper chip count, and to consider chip packaging options concurrently. Two partitio n in g techniques, an M ILP form ulation and a genetic-search procedure, will be described. E xperim ental results, including a JP E G im age compression system , using th e pro to ty p e software P ro P a rt are also given. 10 C h ap ter 5 deals w ith th e synthesis of designs w ith unbounded-delay operations under tim ing constraints and w ith m ultiple com m unicating processes. T he concept of single-threaded processes will be introduced and used as th e basis of our synthesis approach. We will also show how to satisfy detailed tim ing constraints when unbounded-delay operations are present and how to synchronization inter-process com m unication. In C hapter 6, we discuss th e RTL design verification problem . We will show th a t though the correctness of synthesized designs cannot be guaranteed, there exists several im p o rtan t properties in synthesized RTL designs which can be used to sim ply th e verification task. An effective verification approach based on a hybrid sym bolic/num eric sim ulation will be introduced. T he advantage of this verification approach is th a t it not only can form ally verify the d a ta p ath but also can faithfully exercise th e control p ath and allow tim ing issues, such as delays, clocking schemes, and I/O protocols, to be taken into account. Finally, C hapter 7 sum m arizes this thesis and outlines fu tu re research direc tions. 11 Chapter 2 Related Research In this chapter, a num ber of related research efforts will be described. They are divided into four m ajor areas; namely, design specification and representation, sys tem partitioning, m ultiple-process synthesis and design verification. Some related research directly relevant to th e research described in this thesis will be described fu rth er in th e appropriate chapters. 2.1 Design Specification and Representation In general, a behavioral synthesis system requires a behavioral specification lan guage which has th e expressive power to describe all th e design behaviors in th e targeted dom ain. In addition, the sem antics of the language m ust be translated into a representation th a t is suitable for synthesis. To date, there is still no agreem ent in the synthesis com m unity on how a be havioral specification language should be designed. In fact, several input languages have been used by current behavioral synthesis system s. For exam ple, • VHDL [Ins88], used by the ADAM system at USC, th e SpecC harts language [VNG91] at UCI, and th e System A rchitect’s W orkbench (SAW) [TLW+90] a t CM U, 12 • H ardw areC, used by Hercules [MK88], • ISPS [Bar81] and Verilog, used by SAW [TLW+90], • Silage [Hil85], used by C athedral [GRVM90]. T he selection of a specification language depends on th e dom ain of th e targeted designs, th e degree of expressivity and the ease of im plem entation. A survey of several specification languages can be found in [Nar92]. T he internal representations used by m ost behavioral synthesis system s to m odel th e design behaviors are also different in style, b u t share a com m on ob jective which is to capture the essential control and d a ta flow inform ation of the design behavior using flow graphs. M ost of these flow-graph representations can b e classified into either disjoint or hybrid control and d ata flow representations. T he Design D ata S tructure [KP85] of the ADAM system is a typical disjoint control and d a ta flow representation. T he control and d a ta flow inform ation are kept in separate graphs and a set of bindings is used to relate objects in these graphs. D escart [OG86] and VSS [LG88] use a control-flow graph sim ilar to the one used in standard software compilers. Each node in the control-flow graph represent a basic block and the com putations w ithin each basic block are m apped to a separate data-flow graph. T he hybrid control and data-flow representation merges b o th th e control and d a ta flow inform ation into one graph. T he Value Trace (V T) [McF78] of th e CMU- DA system and th e Control and D ata Flow G raph (CD FG ) of th e HAL system [PK89] follow this scheme. T he SIF representation used in th e O lym pus system [MKMT90] uses a hierarchical sequencing graph to show both d a ta and control flow dependencies. T he DSFG representation [LGP+91] for capturing D SP designs in th e Silage language is another hybrid representation in which th e signal-flow graph is an n o tated w ith control inform ation and design constraints. 13 T he task of com piling design descriptions and producing th e flow-graph repre sentation as o u tp u t is com plicated by the fact th a t th e design behavior is described by sequential statem ents in some hardw are description language like VHDL. T here fore, extensive local/global flow analysis and graph optim ization are needed to produce fully parallelized flow graphs. M any of th e analysis techniques required for this com pilation task are sim ilar to those used trad itio n al software com pilers [ASU86]. However, additional global data-flow analysis and graph generation steps are needed. Though HDL com pilation using these steps had been perform ed as early as 1978 [McF78], form alization of the problem and detailed description have not been presented to our knowledge. 2.2 System Partitioning P artitio n in g plays a key role in the design of digital system s. There are various goals th a t are com m only achieved during system partitioning. For exam ple, a digital system m ay be partitioned to reduce the design com plexity, to perform concurrent design, and to satisfy physical-capacity constraints. System p artitio n ing approaches can be classified into two categories: structural partitioning and behavioral partitioning. In stru ctu ral partitioning, the system specification (behavior) is first synthe sized into structure, and then the structure is partitioned onto chips, m odules and boards. N um erous approaches have been proposed in solving circuit partitioning problem s at the logic or RT level, such as group m igration [KL70, FM82], sim ulated annealing [CH90, GS84], evolution [SR89], clustering [LLT69], and even interac tive partitioning [BKM +66]. W hile structural partitioning usually can provide a solution th a t satisfies physical constraints such as area and pins, it ignores th e fact th a t the design of th e stru ctu re can be heavily influenced by th e system p artitio n ing. For exam ple, after the system structure is synthesized as a single-chip design, 14 it m ay not be possible to partition the design onto m ultiple chips while satisfying th e constraints (especially tim ing constraints). Behavioral partitioning is a process of dividing the system behavior onto a num ber of p artitions which can be synthesized into separate hardw are m odules or chips. T here are essentially two levels of granularity at which behavioral partitio n in g can be perform ed. Previous approaches have been m ostly at th e operation level. A detailed survey of these approaches has been done by Vahid [Vah91]. A lternatively, behavioral partitioning can be perform ed at a higher level of granularity, such as processes, procedures and m em ory blocks. In BUD [McF86], M cFarland uses a clustering algorithm based on a sim ilarity m easure to p artitio n control d a ta flow graphs (CDFGs) in a m anner th a t encap sulates scheduling and allocation decisions. A sim ilar approach was proposed by Lagnese and Thom as in APARTY [LT91] by em ploying m ultiple stages of clus tering w ith different closeness functions. T he partitioning m ethod used in YSC [CvE87] by Cam posano and van Eijndhoven clusters logic and operations in the behavioral specification into groups in order to im prove th e tra ctab ility of subse quent logic synthesis. In Vulcan [GM90], G upta and De M icheli use a hierarchical hypergraph m odel for repeated two-way partitioning of the largest vertex a t each level of th e hierarchy until th e constraints are m et. C H O P by Kiigiikgakar [Kuc91] evaluates a partitioned CD FG using a com prehensive behavioral area-delay prediction tool called B EST [KP93] to explore possible alternative im plem entations of each partition. T he tool th en searches for a com bination of predicted design points for the partitions which will m ake th e syn thesized m ulti-chip design feasible while taking into account th e design integration overhead. An interactive partitioning interface and a sim ple two-way partitioning procedure are also provided. 15 A system -level behavioral partitioning tool called SpecPart is described by V ahid and G ajski [VG92]. In this work, the objects to be partitio n ed are ele vated to a higher level of abstraction such as processes, procedures, and o ther code groupings im posed by the specification language. An estim ated area and delay is assigned to each object and a group m igration technique sim ilar to th e K ernighan-Lin algorithm is used for partitioning. G u p ta and De M icheli [GM92] also described an approach for partitioning th e system specification at th e process level into hardw are and software com ponents in order to satisfy th e perform ance and cost constraints. In SpecSyn [GGVN93], G ajski et al. proposed a system -level design m ethodology and fram ework for m apping system functionalities onto a set of m ixed-technology m odules through a com prehensive designer interface. 2.3 Multiple-Process Synthesis Though behavioral synthesis of digital hardw are has received enorm ous atten tio n over th e years, m ost approaches1 have focused on th e synthesis of single-process de signs. Hence, synthesis of system s with m ultiple processes is often achieved by syn thesizing individual processes separately and th en interconnecting th eir stru ctu re [Fuh91]. This kind of approach requires th a t I/O and inter-process com m unication be specified m anually by th e designers. H ung [HP92] and G ebotys [Geb92] extended trad itio n al behavioral synthesis to m ultiple-chip designs. U nder these approaches, th e design behavior is still repre sented by a single CDFG. The CD FG is first partitioned and th en th e partitioned CD FG is scheduled globally so th a t each chip’s execution is fully synchronous w ith th e others and th e inter-chip com m unication is achieved com pletely in lock step. 1A comprehensive study of these approaches and related synthesis topics can be found in [MPC88]. 16 Like m any previous scheduling techniques, these approaches still assum e th a t all th e operations in th e CD FG have fixed execution delays. T here is a class of work [TW93, Nes87, Hay90] which address th e issue of I/O and inter-process com m unication as a separate problem , known as interface synthesis. These techniques, however, are m ostly applicable to control-dom inated designs w ith little or no d a ta com putation. T he approach rep o rted by Takach and W olf [TW93] first analyzes a netw ork of com m unicating processes to generate a sm all set of com m unication constraints. Solving th e set of generated com m unica tion constraints and th e internal constraints of each process results in a feasible schedule for th e network. For the synthesis of m ore general designs, Ku introduced a technique called rel ative scheduling [KM92] which can handle operations w ith unbounded delays under detailed tim ing constraints. In relative scheduling, th e sta rt tim e of an operation is specified in term s of offsets from a set of anchors (unbounded-delay operations). T his approach requires resource binding be perform ed before scheduling. Once resource binding is determ ined, the operations bound to th e sam e resource m ust be serialized by adding proper sequencing dependencies am ong th em to resolve conflicts. Furtherm ore, th e control scheme for the hardw are arch itectu re targeted by relative scheduling is fairly com plicated. Recently, Ku et al. [KFJM92] applied relative scheduling to designs w ith m ul tiple processes. Like Takach’s approach [TW93], the inter-process com m unication events are first extracted from th e process specifications and com posed into a causality graph. T he causality graph is then scheduled using relative scheduling. T he sequencing and tim ing constraints im plied by this com posed interface sched ule are th en reflected to the individual processes as external tim ing constraints, and each process is scheduled and synthesized accordingly. Since relative schedul ing requires resource allocation/binding to be perform ed before scheduling, this approach is not suitable for the synthesis of m ultiple-process designs under global 17 resource constraints; e.g., the area capacity of a chip. A preferred approach is one th a t can concurrently trade off th e perform ance and resource requirem ents of each process during scheduling. 2.4 Design Verification T here are a variety of system s and approaches for design verification. A com prehensive survey can be found in [CP88, McF93]. T he m ost com m on approach for design verification at virtually every abstraction level is num eric sim ulation. However, w ithout sim ulating all possible operating conditions, num eric sim ulation can only show the existence of errors but not their absence [CP88]. Form al m ethods, on th e other hand, use m athem atical reasoning rath er th an test-case experim ents to show th a t th e design will behave properly as required by th e specification. Generally, form al verification relies on a form al m odel, which m ust be m athem atically sound, to represent specifications, im plem entations and properties. A dditionally, a set of reasoning rules associated w ith this form alism m ust be developed to perform proofs. Some reputed form alism s are higher-order logic [HD86], tem poral logic [Boc82] and behavior expressions [MP83]. However, it requires a great deal of expertise and tim e to express designs or to do a nontrivial proof in any of these form alism s [McF]. Hence, form al verification a t present is still extrem ely difficult, if not im possible, to autom ate. T here is another class of verification approaches called symbolic sim ulation. T h e idea behind symbolic sim ulation is to evaluate circuit behavior over expanded sets of signal values so th a t a num ber of operating conditions can be sim ulated in a single run. In late 1970’s, researchers at IBM applied sym bolic sim ulation to hardw are verification at th e register-transfer level [Dar79]. T he research activities on this problem only lasted till th e early 1980’s [Cor81] due to th e weakness of th e sym bolic m anipulation m ethods. For exam ple, the algebraic expressions built 18 up during th e course of sim ulation m ay becom e too large and cum bersom e to m anipulate effectively. Also, conditional branching and looping constructs proved even m ore intractable, since in general the sim ulator could not determ ine which conditional branches would be taken or the conditions under which a loop would term in ate. T he fundam ental problem is th a t m any properties of algebra on the integer dom ain are undecidable [Bry90]. Hence, it becom es very difficult for the algebraic m anipulator to determ ine the equivalence of two com plex expressions. Recently, there is a renewed interest in sym bolic sim ulation at th e logic level due to the introduction of Ordered Binary Decision D iagram s (O BD D s) [Bry86]. T his representation has been shown to be canonical; i.e, each Boolean function has a unique representation. Hence, a BD D-based sym bolic sim ulator [BBB+87] becom es very useful for verifying com binational circuits. T he verifier can sim ply introduce Boolean symbols for each prim ary in p u t and sim ulate th e circuit to obtain a Boolean function in OBDD for each prim ary o u tp u t. These functions can then be com pared w ith th e ones derived from th e circuit specification. For sequential circuits, however, m ore sophisticated approaches are required if th e cir cuit and its specification are using different state encodings. O ne approach [BF89] is to require th e relation betw een the state encodings be specified explicitly. A hindrance of BD D -based approaches is th a t th e size of an OBDD can becom e ex ponential in term s of th e num ber of Boolean symbols. Hence, th e scale of circuits th a t can be handled is lim ited. In CATHEDRAL [GRVM90], a divide-and-conquer approach called SFG -tracing [GGP+91] is em ployed. In SFG -tracing, th e specifi cation is first partitioned, and the partitions in the specification are used to verify th e im plem entations by BD D -based symbolic sim ulation. R L E X T [KW89] is a rule-based system developed by K napp and W inslett for au to m atic verification of synthesized RTL designs,. R L EX T relies on a unified representation sim ilar to the USC Design D ata S tructure (DDS) [KP85] for design inform ation including the behavioral specification, th e stru ctu ral im plem entation 19 and th e relation betw een them . A set of rules are defined to d etect possible de sign inconsistencies like value collisions, unbound operations and d a ta dependency violations. On novel feature of R LEX T is th a t it allows th e user to m odify th e design stru ctu re and then provides th e ability to repair some design-rule violations autom atically. T his approach, however, does not consider th e control logic nor th e interaction betw een the d atap ath and th e controller. How th e tim ing issues such as clocking schemes and delays are handled is not described. 20 Chapter 3 System Specification, Representation and Translation In general, th e first task of an autom atic synthesis system is to tran slate a behav ioral specification of a design into a representation th a t is suitable for synthesis. In fact, this task m ay be the m ost critical one when the specification language and th e representation system have incom patible m odels to describe th e given design. This is because m ost specification languages and th eir sim ulation m odels need to be intuitive to th e designers; contrarily, th e internal design representations for behavioral synthesis are geared tow ard ease of synthesis. Several languages have been used to specify behavioral descriptions as input to high-level synthesis system s. The m ajor consideration in selection of th e be havioral specification language involve specifying the behavior w ithout introducing unnecessary stru ctu re or tim ing inform ation which overconstrain th e synthesis pro cess. In th e USC ADAM system , VHDL [Ins88] has been adopted as th e prim ary specification language due to its standardization and popularity. A lthough th e popularity of using VHDL as a behavioral specification language for behavioral synthesis is increasing, the m ajor problem w ith VHDL is th a t it was intended to be a sim ulation language. Therefore, m any sem antics of th e language are not suitable for synthesis or m ay severely overconstrain th e design. Hence, th e VHDL 21 subset used by th e ADAM system is restricted to purely functional behavior. T he essential tim ing constraints m ust be supplied separately to th e synthesis tools by th e designers. A D A M ’s VHDL subset is described in A ppendix A. T he design representations used by m ost behavioral synthesis approaches are quite different in style b u t, in general, they shares a sim ilar objective which is to cap tu re th e necessary control and d ata flow inform ation of th e design behavior using one or two flat or hierarchical graphs. T he ADAM system use a unified m ulti-level design representation called the DDS (Design D a ta S tru ctu re) [KP85] to m odel th e design under developm ent. This unique form of representation divides design inform ation into four m odels (graphs) which describe th e d a ta flow behavior, th e control and tim ing behavior, th e logical structure and th e physical stru ctu re of a hardw are design. In addition, the relations am ong th e objects in these models are represented by two or three-way bindings. T he task of VHDL to DDS translation is com plicated by th e fact th a t each VH D L process is described by sequential statem ents while DDS em ploys single assignm ent d a ta flow graphs for behavioral representation. Hence, extensive lo cal/global d a ta flow analysis and graph optim ization are needed to produce fully parallelized d a ta flow graphs for behavioral synthesis. In this chapter, we will discuss several im portant aspects of th e tran slatio n from VHDL to DDS; nam ely, flow analysis, optim ization and m odeling. A p rototype com piler called VHDL2DDS has been developed and used to experim ent w ith m any exam ples including an A R filter, a robot-arm controller, arith m etic Fourier transform ation, an F F T butterfly node and a JP E G im age com pression system . T his tool currently serves as th e front end of th e ADAM system for accepting new designs to be synthesized. Since the details regarding VHDL and DDS have been described thoroughly elsewhere [Ins88, KP85], th eir term inology shall be used w ithout fu rth er explanation in the following discussion. 22 3.1 Translation of VHDL Processes into DDS In practice, m ost hardw are designs consist of a num ber of concurrent processes, w here a process is defined as an independent thread of control. Therefore, a hardw are system described in VHDL is m odeled by a num ber of concurrently running processes, each of which represents a piece of hardw are th a t gets activated w hen one or m ore of its inputs on the sensitivity list changes or sim ply runs forever if it has n either a sensitivity list nor any w ait statem en t. T he behavior of a process in VHDL is described by a sequence of statem ents. For exam ple, Figure 3.1 shows a sam ple VHDL description th a t consist of one process for a ones-counter m odule. Hence, the fundam ental problem in th e VHDL to DDS tran slatio n is to produce a d a ta flow graph (D FG ) and a control tim ing graph (C TG ) in DDS for each process along w ith bindings between th e two so th a t together they are functionally equivalent to th e process description in VHDL. 3 .1 .1 P r o b le m S ta te m e n t Basically, a VHDL process is described algorithm ically by a sequence of statem ents. T he statem en ts are sequentially executed unless they specify altern ativ e branching destinations. T he m ain side effect of these statem ents is to m odify th e current state (variables) of th e process. In DDS, th e behavior specification of a design is represented by a d a ta flow graph (D FG ), a control tim ing graph (C TG ) and a set of bindings B betw een DFG and C T G . T he D FG is a directed acyclic b ip artite graph (O , V, E ), w here O is a set of operations, V is a set of values, and E is a set of directed edges th a t connect operations and values. This graph is a single-assignment graph. T h a t is, a value in D FG is th e result of function application and can be generated only once. T he C TG is also a directed graph (P, R ), where P is a set of points represent control or tim ing events and R is a set of directed edges (ranges) represent relations betw een 23 - - O n e s - C o u n t e r - An Exam ple VHDL D e s c r i p t i o n p a c k a g e O C types i s t y p e BYTE i s a r r a y (0 t o 7) o f BIT; e n d O C ty p es; u s e w o r k .O C t y p e s . a l l ; e n t i t y ONES_CNT i s p o r t ( A: i n BYTE; C: o u t INTEGER); e n d ONES_CNT; a r c h i t e c t u r e BEHAVIOR o f ONES_CNT i s b e g i n p r o c e s s v a r i a b l e NUM, I : INTEGER; b e g i n NUM ;= 0; I : = 0 ; w h i l e I < 8 lo o p i f A(I) = '1' t h e n NUM ;= NUM + 1; e n d i f ; I := I + 1; en d l o o p ; C <= NUM; e n d p r o c e s s ; e n d BEHAVIOR; Figure 3.1: A sam ple VHDL description 24 points. Special points are defined in CTG to represent conditional branches and loops. T he m odeling of conditional branches and loops in DDS will be discussed shortly in Section 3.1.2.4. T he bindings B are a set of 2-tuples (0 , R ) which ten tativ ely assign dataflow operations to tim ing ranges. T h e task of VHDL to DDS translation can be briefly described as following: • m ap th e usage of each variable in th e process into a sequence of values in th e D FG . • m ap th e d a ta m anipulation constructs in th e statem ents into th e operations of th e DFG. • m ap th e essential control flow im plied by th e process description in to th e C T G and produce the proper bindings between th e D FG and C T G . T his translation is both im portant and difficult since it is analogous to th e parallelization problem th a t transform s a sequential algorithm into a m axim ally parallel one. T he objective is to ex tract the parallelism from th e design specifica tion in VHDL so th a t the parallel flow graphs in DDS can be selectively serialized during high-level synthesis to produce a RTL im plem entation th a t satisfies th e design constraints. RTL Implementation sequential parallel selectively serialized design behavior behavior Figure 3.2: The parallelization during the VHDL to DDS translation Design in VHDL VHDL High to DFG/CTG Level Synthesis DDS 25 3 .1 .2 V H D L T ra n sla tio n A p p ro a ch T here are two m ajor issues to be dealt w ith in this m apping; nam ely, how to sep arate th e control flow and th e d ata com putation inform ation and how to analyze th e d a ta usage and dependency in a process description. In order to tackle this m assive task, we approach it by a sequence of steps as shown in Figure 3.3. Briefly speaking, our approach is to first construct a flow graph from th e parsed VHDL description1 by perform ing control flow analysis [ASU86]. T he vertices in this flow graph are so-called basic blocks. T he flow graph effectively isolates th e com pu tations from th e control flow by confining them in th e basic blocks. T hen, local d a ta flow analysis is perform ed to collect intra-block d a ta dependencies, followed by global data-flow analysis to find the inter-block d a ta dependencies. A fter this step, the an n o tated flow graph becomes a com bination of th e d a ta flow and control flow graphs. Finally, a num ber of graph reduction rules are executed to optim ize th e flow graph. In w hat follows, we will briefly discuss each of these steps. 3 .1 .2 .1 C o n tr o l F lo w A n a ly s is In a VHDL process, the statem ents are sequentially executed unless th e current statem en t specifies an alternative branching destination. Therefore, th e sequence of statem en ts from th e one which was the target of th e previous branching to th e one currently altering the flow of execution constitutes a com putation unit. D e fin itio n 3.1 A sequence o f consecutive statem ents S i , . . . ,S j that does not have the possibility o f branching out in the middle is called a basic block Bf. [ASU 86]. D e fin itio n 3 .2 A flow graph G is a directed graph (B , E) where B is a set o f ver tices representing the basic blocks and each edge (B x ,B y) £ E denotes a conditional or unconditional branch from the basic block x to y. 1 We use a commercial VHDL syntax analyzer from Compass Design Automation to parse the VHDL descriptions. 26 Parsed VHDL Description 1 Control Flow Analysis Flow Graph (Basic blocks) Local Data Flow Analysis I Annotated Flow Graph (A local data flow graph for each basic block) Global Data Flow Analysis I Global Data Flow Graph & Control Flow Graph Graph Optimization DDS Data Flow Graph & Control Timing Graph Figure 3.3: The steps to translate a VHDL process into DDS 27 C ontrol flow analysis is a procedure th a t partitions a sequence of statem en ts into basic blocks and constructs a flow graph th a t represents th e flow of control. Let th e first statem ent of a basic block be called th e leader. T he set of leaders in a sequence of statem ents can be determ ined by th e following rules: • T h e first statem en t is a leader. • A ny statem en t th a t is th e target of a conditional or unconditional branch is a leader. • Any statem en t th a t im m ediately follows a branching statem en t is a leader. A fter th e leaders are identified, th e following rules can be used to construct th e flow graph of a sequence of statem ents ..., S n }. Let th e leaders be {5/j, .. ., S 1m}, where lx < . . . < lM - 1. For each leader S^, produce a basic block vertex Bk th a t consists of the statem en ts from Sit to Sii+1- 1- 2. For each branching statem ent, add an edge from th e basic block in which it is located to each one of its targ et basic blocks. T he edge is associated w ith th e branching condition if the statem en t is conditional or sim ply true otherw ise. For exam ple, after perform ing control flow analysis on th e ones-counter exam ple given in Figure 3.1, six sm all basic blocks can be found. T he corresponding flow g raph is shown in Figure 3.4. T h e flow graph represents th e essential flow of control of th e VHDL process being tran slated . In fact, it can be regarded as th e prelim inary C TG in DDS since each basic block corresponds one or m ore sequential ranges in th e C T G and each 28 C <= NUM; NUM NUM := NUM + 1; while I < 8 loop '1' then Figure 3.4: T he flow graph of the one-counter exam ple edge2 betw een two basic blocks can be transform ed into a range from th e ending point of th e source block to the startin g point of th e destination block. 3 .1 .2 .2 Local D a ta Flow A nalysis Each basic block in th e flow graph is a com putation unit. T he only form s of statem en ts th a t will appear in a basic block are those which perform com putation a n d /o r produce side effects; nam ely, assignm ents, procedure calls and expressions. Each of th em can be m odeled as (I ( S ), 0 ( S ), F(S)) where I(S) is a set of variables which are referenced by the statem ent S, 0 ( S ) is a set of variables whose values are u p d ated by 5 , and F (S ) is the set of operations perform ed by S. H ere, we are interested in the inter-statem ent d a ta dependency w ithin each basic block. T h a t is, we want to know which statem en ts w ithin a basic block 2Except the loop-back edges. Each such edge is implied by a pair of special points in the CTG which denote the places where control is returned or transferred when a loop-back is made. 29 define th e values for each I set and sim ilarly who will use the values of each O set. Let B = {5;,..., Sj} and i < k < j . T he following rules can be used to collect th e d efinition/use inform ation: 1. T h e definition of a variable x € I(Sk) is either the largest d such th a t i < d < k and x £ O(Sd) or it is defined outside B if no such Sa can be found. 2. T he use of a variable y £ 0 ( S k ) is a list of statem ents U after Sk such th a t y is in th eir I sets b u t y is not in the O set of any statem en t w ithin this region (from Sk to th e last statem en t in U). For exam ple, let a basic block contain the following statem ents: ai = in i * Pi a 2 = in i * p 2 bx = in 2 + a2 ou ti = bi * P3 Ci = o u ti * p 4 ou t2 = a\ + ci A fter producing the input and o u tp u t sets of each statem en t and analyzing the in ter-statem en t d a ta dependency, we have th e definition/use tab le shown in Ta ble 3.1. A dditionally, we need to know th e input set of B th a t are th e variables whose values are defined elsewhere b u t are referenced in B and th e o u tp u t set of B th a t contains all th e variables being u p d ated by B . These two sets are im p o rtan t in th e global d a ta flow analysis. 1(B) and 0(B) can be determ ined by th e next two equations. J (B ) = I(S,)u( u (^)-UO(S,) (fc=i+l \ l—i 0(B) = ( J 0 (S k) f k = i 30 Statem ent Input Set O utput Set val def val use 1: a\ = in i * P i in i 0 ax 7 P i 0 2: a2 = in \ * P 2 ini 0 a2 3 P2 0 3: bi = in 2 + a2 in 2 0 bi 4 « 2 2 4: tm p = bi * p3 bi 3 tm p 5, 6 P3 0 5: ci = tm p * p4 tm p 4 Cl 7 P4 0 6 : outi = tm p tm p 4 outi 0 7: tm p = ai -)- Cj ai 1 tm p 8 Cl 5 8 : out2 = tm p tm p 7 out2 0 Table 3.1: T he use/definition table of th e basic block exam ple As a result, th e basic block exam ple given earlier will have an in p u t set { in 1,iri2.)pi,p2iP3iP4} and an ou tp u t set {at, a 2 , 6 1 , Ci, tm p , ou ti, out2}. A fter this analysis, it is conceptually easy to produce a piece of D FG th a t corresponds to each basic block. This is because th e definition and use inform a tion provides th e inter-statem ent d ata dependencies while th e syntax trees of th e expressions w ithin a statem en t give the in tra-statem en t d a ta dependencies. By transform ing each operation in th e F sets of all the statem en ts in th e basic block into a node and each d a ta dependency into a link betw een nodes, a piece of DFG is built. For instance, Figure 3.5 shows a piece of d a ta flow graph th a t corresponds to our basic block exam ple. 3 .1 .2 .3 G lobal D a ta Flow A n alysis U nfortunately, th e inter-block d a ta dependency is m uch m ore com plex th a n th e intra-block one previously discussed, because th e flow of execution is no longer a 31 inj ► out tmp Figure 3.5: T he d a ta flow graph of the basic block exam ple straight line as w ithin a basic block. For exam ple, a basic block after a conditional stru ctu re m ay have m ore th an one basic block which define its in p u t values; con sequently, th e correct d a ta dependencies are decided by th e branching conditions. H ence, it is not sufficient to sim ply scan backward (forward) along th e incom ing (outgoing) paths of a basic block in th e flow graph. O ur approach for this global d a ta flow analysis is outlined in th e following steps: 1. H ierarchically group th e basic blocks w ithin th e sam e stru ctu re, such as if- then-else or while-loop structures, into a meta block as shown in Figure 3.6. 2. C alculate th e input and o u tp u t sets of those m eta blocks according to Ta ble 3.2 3. 3. A nalyze th e definition and use relationships am ong th e m e ta blocks hierar chically. R eaders m ay wonder why th e input set of a if-then-else m e ta block shown in Table 3.2 contain some variables also in its o u tp u t set. Specifically, any variable 3Other forms of conditional and iterated statements can be transformed into those two shown in Table 3.2. 32 if-then-else WL Figure 3.6: M eta blocks for if-then-else and while-loop stru ctu res T ype of m eta block Input and o u tp u t sets if C then T else E I (IF) = 1(C) U I(T) U I(E)U {0 ( I F ) - (0(T) n O(E))} 0(1 F) = 0 ( T ) U O(E) while C loop L I(WL) = 1(C) U I(L) U O(L) 0 (W L ) = 0(L) Table 3.2: C alculation of input and o u tp u t sets for the m eta blocks appearing in th e o u tp u t set of th e I F block b u t not m odified by b o th th e T and E blocks m ust be included in the input set of th e I F block. This is illu strated in Figure 3.7. H ere y get th e value of x either from th e if statem en t if C is tru e or if C then x:= = x blocks I sets O sets assign if-then-else assign Figure 3.7: A conditional assignm ent th e old value of x from th e earlier assignm ent. By enforcing x as one of th e input variables of th e if m eta block, this block can then supply th e correct value for y by m erging these two possible definitions of x during th e global definition and use analysis. By introducing m eta blocks into th e flow graph, th e global d a ta flow analysis can be done hierarchically. Since each level of hierarchy basically consists of a sequence of basic blocks or m eta blocks, the inter-block d a ta dependency can be analyzed using the techniques sim ilar to the local d ata flow analysis discuss earlier except th a t th e granularity is at the block level instead of at th e statem en t level. T h e definition and use analysis w ithin each m eta block, however, is done in a b o tto m up fashion. A n if-then-else m eta block I F consists of th re e subblocks, a condition C , a th en body T and an else body E . A fter we have analyzed these 34 th ree subblocks, th e definition and use of 1(1 F ) and 0 ( 1 F ) can be ob tain ed by th e following rules: 1. T he definition of each variable x € 0 ( 1 F ) is a 2-tuple ( d e /r ,d e /g ) . If x € 0 ( T ), defr is from subblock T; otherw ise, d e /r is defined outside th e I F m eta block and will be determ ined at next level of hierarchy, d e /e is determ ined in a sim ilar way for subblock E . 2. T he use of each variable y € 1(1 F) denotes every subblock whose in p u t set contains y. O n th e o th er hand, a while-loop m eta block W L consists of tw o subblocks, a condition C and a loop body L. Its d a ta dependency is analyzed in th e following way: 1. T he definition of each variable x € 0(W L) is a 2-tuple (d e/t,, d e /imt), where defi is from subblock L and d e /!mt is th e initial value of x defined outside W L m e ta block. 2. T he use of each variable of y € I(WL) denotes every subblock whose input set contains y, 3. If a variable 2 in 1(C) or I(L ) is also in 0(L), its definition becom es a 2-tuple (de/x,, d e finn), where defi, is is th e previous-iteration value from subblock L and d e finit is th e initial value defined outside W L . Figure 3.8 shows th e block hierarchy of th e one-counter exam ple and th e result of global d a ta flow analysis. T he definition/use table shown in th is figure describes th e inter-block d a ta dependency. For exam ple, from this tab le we can find th a t th e definition of I in Block 3 comes from its parent block (Block 5), w hich in tu rn s from higher levels (Block 7 and Block 8) of the block hierarchy. From Block 8, we know th a t th e initial value of I comes from Block 1 and its previous-iteration value is from Block 7. 35 Block Input set Output set val def use val def use 1 NUM 8 I 8 2 I 8 3 A 5 I 5 4 NUM 5 NUM 5 5 A 7 3 NUM 4,5 7 I 7 3 NUM 7 4,5 6 I 7 I 7 7 A 8 5 I 6 8 I 8 5,6 NUM 5 8 NUM 8 5 8 A 0 7 I 7,8 I 1.7 2,7,8 NUM 7,8 9 NUM 1.7 7 9 NUM 8 C 0 BI B2 B3 B4 B5 B6 B7 B8 B9 NUM; NUM NUM : while I < loop then Figure 3.8: Global d ata flow analysis of th e one-counter exam ple 36 3 .1 .2 .4 G raph G en eration A fter th e global d a ta flow analysis, all the inter-block d a ta dependencies are found and th e global D FG and C TG can be generated. G eneralized DDS tem p lates for th e if-then-else and while-loop structures are shown in Figures 3.9 and 3.10. In each of th e tem plates, two pseudo operations, D (d istrib u te) and J (join), are used to support conditional d a ta dependencies. Each D operation is like a 2-to-l dem ultiplexer which distributes the input value to one of its o u tp u ts according to th e predicate. Contrarily, a J operation is like a l-to-2 m ultiplexer w hich merges two possible input values and presents one at its o u tp u t. T he while-loop tem p late uses a pair of pseudo operations, LB (loop begin) and LI (loop itera te ), to denote an im plicit value feedback from LI to LB so th a t the D FG can rem ain acyclic. These pseudo operations do not necessarily correspond to actual hardw are m odules after synthesis. T heir presence is m ainly for synthesis tools to detect m utual exclusiveness and loop feedback in DFG. In th e CTG s of both tem plates, a pair of or-fork and or-join points are used to represent conditional execution. Each outgoing range of an or-fork point is asso ciated w ith a condition. T he flow of execution follows th e range whose condition is satisfied. T he CTG of a while-loop stru ctu re begins at an a point and iterates a t an u j point, at which tim e th e flow of execution returns to th e a point. N ote th a t th e range betw een th e u j and or-join points will never be taken. Its presence is m ainly to m ake th e CTG connected. Figures 3.11 shows th e D FG and CTG generated for th e ones-counter exam ple. T h e im plicit control and feedback edges are also shown in th e D FG using grey arrows. 37 DFG: then body then then 'if else else. else body D : Distribute J : Join condition * O: if CTG: then body (Pred = T) condition (Pred = F) else body it : sim ple point y : or-fork point p : or-join point Figure 3.9: The DDS tem plate of an if-then-else structure 38 DFG: *while loop body loop condition hile D : Distribute J : Join LB: Loop begin LI : Loop iterate CTG: loop body loop condition (Pred = T) (Pred = F) y : or-fork point jx : or-join point a : alpha point to : om ega point Figure 3.10: The DDS template of a while-loop structure 39 N U M L B R< > , nevv I DFG C CTG Figure 3.11: T he D FG and CTG of th e ones-counter exam ple 40 3 .1 .2 .5 G r a p h O p tim iz a tio n Like trad itio n al com pilers for program m ing languages, th ere is plenty of oppor tu n ity to optim ize th e graphs during th e VHDL to DDS tran slatio n . However, som e transform ations, such as tree-height reduction and loop unfolding, do not always result in a hardw are design th a t is preferred by designers. For exam ple, tree-height reduction [NP91] can reduce th e critical p ath s of d a ta flow graphs, b u t it can also introduce additional operations. We feel th a t these kinds of tran sfo rm a tions should be perform ed separately and interactively by th e designer. H ere, we focus on those transform ations th a t are guaranteed to im prove th e final design4. T h e flow graphs and th e definition/use inform ation are p articularly useful in per form ing these transform ations. L e m m a 3.1 For any basic block Bk in a flow graph where k > 1, if Bk does not have any incom ing edges then it is dead code. P ro o f: Since Bk does not have any incom ing edge, th ere exists no p a th from B \ to Bk- Therefore, when th e execution begins at B y, th ere will be no way to reach Bk under any input. □ T his lem m a provides an easy way to elim inate th e redundant p arts of a behavioral specification, b u t it does not guarantee to identify all of them . This is because those basic blocks w ith incom ing edges are also dead codes if th eir p a th conditions can never be evaluated to true. A fter th e definition and use analysis, some values in th e d a ta flow graph m ay not be used by any operations and they are not th e p rim ary o u tp u ts either. T hese values are called dangling. T he following rules can be used to elim inate th e dangling p a rts of a d a ta flow graph: 4Some straightforward transformations like constant propagation and copy operation elimi nation will not be discussed here. 41 1. If th ere is no en try in th e use list of a value and it is not a p rim ary o u tp u t, m ark it dangling. 2. If th e o u tp u ts of an operation are all dangling, rem ove it and all its o u tp u t values from th e graph. A dditionally, for each in p u t of this operation, delete it from all th e use lists in which it appears. L em m a 3.2 For any two operations Oi ( x i , . . . , x n) and 0 2 (2 /1,... ,yn), i f 0 \ and O 2 are o f the sam e type and Xj and yi are defined by the sam e value fo r all i, then Oi and O 2 are com m on subexpressions. Therefore, one o f them can be removed fro m the graph and the definition and use inform ation o f its outputs can be redirected to the other. P roof: T he proof is trivial. Since Oi and O2 perform th e sam e function, they should produce equivalent results under th e sam e in p u t condition. F urtherm ore, if th ere is a com m utative property, th e order of inputs m ay not be im p o rtan t as long as th eir correspondence can be established. □ By recursively applying this lem m a to a d a ta flow graph, m any of th e com m on subexpressions can be elim inated. L em m a 3.3 For any operation 0 ( x j , ... , x n) within a loop, i f Xi are defined out side the loop fo r all i, then O is a loop invariant operation which can be moved out o f the loop. Proof: Since all th e inputs of O are defined before th e loop is entered, th e input condition of O will not change a t any iteration of th e loop. C onsequently, its o u tp u ts will rem ain th e sam e at every iteration. Therefore, it is equivalent to com pute O once before entering the loop. □ However, care m ust be taken when perform ing this transform ation. T his is because th e loop m ay execute zero tim es. In this situation, m oving O out of th e loop will cause an additional operation to be executed, which m ay produce an invalid result. 42 H ence, each o u tp u t of 0 m ay need to be m erged w ith its previous value before th e loop if th ere is a subsequent use of it after the loop. 3.2 Array and Input/Output Modeling U nlike scalar variables, in p u t/o u tp u t ports and arrays have to be handled in a different way since they are associated w ith stru ctu ral attrib u tes. For exam ple, we cannot safely assum e th a t a sequence of assignm ents to an o u tp u t p o rt is equivalent to th e last assignm ent to this port. This is because th e designer m ay intend to tran sfer a set of d a ta via a single pair of in p u t/o u tp u t ports. In addition, when an elem ent of an array gets m odified by an assignm ent statem en t, th e subsequent references to th e array cannot be m ade parallel w ith th e assignm ent unless we can m ake sure th a t they refer to different elem ents in the array5. 3 .2 .1 A rra y s A rrays cannot be handled like scalar variables in a single-assignm ent d a ta flow graph, because an array is a com posite object th a t consists of elem ents of th e sam e type. A sequence of assignm ents to th e elem ents of an array cannot be done in parallel when there is a possibility th a t two assignm ents m ay refer to th e sam e elem ent. Even if two assignm ents can be guaranteed to refer to tw o different elem ents of an array, they m ay still have to be sequentialized if th e array is going to be im plem ented in a single-port m emory. In our m odel, for each array type declared in VHDL, two additional opera tions, array read operations and array write, are im plicitly defined as shown in F igure 3.12. A n array read operation takes th e designated array and th e associ ated indexes as inputs and provides th e value of th e referred elem ent as th e o u tp u t. 5 We parallelize array accesses in the event that the synthesis software has available multiport memories. 43 A nay Index Index Array Value W ( ) Value of Array[Index] (a) Array Read New Array (with Array[Index] = Value) (b) Array Write Figure 3.12: T he array read and w rite operations O n th e o th er hand, an array w rite operation takes an additional in p u t for th e value to be w ritten to th e referred elem ent and produces an new instance of th e array. If an indexed nam e appear a t th e right (left) hand side of an assignm ent statem en t, an array read (w rite) operation is created in th e D FG . A sequence of array read operations occurring before th e array is m odified are m ade in parallel by VHDL2DDS. Hence, th e subsequent synthesis tools can selectively sequentialize them according to th e num ber o f read p o rts of th e m em ory m odule chosen to im plem ent the array. O n th e o ther hand, a sequence of w rites to an array are explicitly sequentialized according to th e order of assignm ents appearing in VHDL. This sequentialization is done by m aking th e array input of a w rite operation refer to th e instance of th e array produced by th e previous w rite operation. T he current im plem entation of VHDL2DDS does not analyze the indexes of array w rite operations in order to parallelize those array w rite operations which will refer to different elem ents in th e array. 44 3 .2 .2 I n p u t/O u tp u t an d In te r -P r o c e ss C o m m u n ic a tio n As we m entioned earlier, m ost hardw are system s consists of a nu m b er of con current processes which m ay com m unicate w ith each o ther or th e environm ent. For instance, th e VHDL exam ple given in Figure 3.13 consists of two processes which com m unicate w ith each other through th e global variables sta rt, finished and p2tem p. A dditionally, process one is com m unicating externally th rough two in p u t p o rts da ta l and data2 as well as a o u tp u t port result. 3 .2 .2 .1 T y p es o f C om m u n ication In general, th e com m unication is considered synchronous if th e send action in one process and th e receive action in another refer to th e sam e clock; otherw ise, it is asynchronous and m ust be done through explicit handshaking. F urtherm ore, th e com m unication can be either static or dynamic. C om m unication is static if th e exact tim e it takes place can be determ ined during synthesis. T h a t is, th e send and receive operations are synchronized statically in tim e. This ty p e of com m unication requires little synchronization overhead, b u t it imposes a strict tim in g constraint for synchronization on the schedules of th e sending and receiving processes. We define dynam ic com m unication to be one whose tim e cannot be scheduled to a fixed tim e; e.g., w aiting for th e assertion of an external signal. Obviously, dynam ic com m unication needs m ore synchronization overhead (such as handshaking signals and th e associated logic) than static com m unication does. It m ay also lead to higher controller costs due to th e busy w aiting on both th e sending and receiving processes. T he com m unication can be further classified into e ith er buffered or unbuffered. In buffered com m unication, th e execution of th e sender or receiver does not have to be blocked (suspended) unless the buffer is full or em pty. 45 e n t i t y e x a m p l e i s p o r t ( d a t a l : i n b i t _ v e c t o r ( 3 d o w n t o 0 ) ; d a t a 2 : i n b i t _ v e c t o r ( 3 d o w n t o 0 ) ; r e s u l t : o u t b i t _ v e c t o r ( 3 d o w n t o 0)); e n d e x a m p l e ; a r c h i t e c t u r e e x a m p l e o f e x a m p l e i s s i g n a l s t a r t , f i n i s h e d : b i t ; s i g n a l p 2 t e m p : b i t _ v e c t o r ( 3 d o w n t o 0 ) ; b e g i n p i : p r o c e s s b e g i n s t a r t <= ' 1 ' ; w a i t o n f i n i s h e d u n t i l f i n i s h e d = ' 1 ' ; s t a r t <= ' 0 ' ; r e s u l t <= p 2 te m p ; w a i t o n d a t a l , d a t a 2 ; e n d p r o c e s s ; p 2 : p r o c e s s v a r i a b l e r e s : b i t _ v e c t o r ( 3 d o w n t o 0) ; b e g i n f i n i s h e d <= ' 0 ' ; r e s := d a t a l + d a t a 2 ; p 2 t e m p <= r e s ; f i n i s h e d <= ' 1 ' ; w a i t o n s t a r t ; e n d p r o c e s s ; e n d e x a m p l e ; Figure 3.13: A VHDL example with two processes 46 3 .2 .2 .2 M o d e llin g C o m m u n ic a tio n in V H D L a n d D D S In VHDL, inter-process com m unication is specified by th e assignm ents or refer ences of th e global signals which are shared by th e processes. Sim ilarly, inter-chip com m unication is done via th e input /o u tp u t ports. A dditionally, sensitivity lists and w ait statem en ts provide th e m echanism to specify blocking com m unication events. A reference to an in p u t p o rt or a global signal in a statem en t of a process im plies a read operation of a com m unication event is taking place. For instance, th e following statem en t from th e VHDL exam ple shown in Figure 3.13 consists of tw o read operations from input ports datal and data 2 : r e s := datal + data2; Sim ilarly, an assignm ent to an o u tp u t port or a global signal becom es a w rite operation of th e corresponding com m unication event. For exam ple, th e statem en t p2temp <= r e s ; im plies a w rite operation of an inter-process com m unication event via th e global signal pStem p. A com m unication event consists of a w rite operation in one process and a num ber of read operations in the others. In this m odel, a sequence of references to an in p u t p o rt or a global signal not only im ply a num ber of read operations but also a sequential ordering am ong th em w hich m ust be obeyed by th e im plem entation6. This is also applied to a sequence of assignm ents to an o u tp u t port or a global signal. A typical exam ple of this situ atio n is to transfer a set of d a ta via one pair of in p u t/o u tp u t ports. A blocking com m unication event can be specified by using a w a it statem en t or a sensitivity list th a t contains the signal in question. O therw ise, th e com m unication event is intended to be im plem ented statically as a non-blocking one. For exam ple, th e statem en ts 6See M cFarland’s m odel [MP83] for an early form alization of this. 47 wait on x, y; z := x + y; im ply th a t here th e com m unication events for x and y are to be im plem ented as blocking ones. In DDS, th e tim ing and sequencing of com m unication can be represented hierar chically by th e process-level CTG s and a system -level C T G as shown in F igure 3.14. A read or w rite operation in a process is represented by a binding (O , T , C ), where System Structure DFG DFG CTG CTG Sender Receiver System CTG Figure 3.14: T he com m unication m odel in DDS O is a read or w rite operation in th e D FG of th e process, T is a range in th e process’s C T G , and C is a stru ctu ral carrier which is either an in p u t/o u tp u t port or an inter-process com m unication link. T he ranges of a sequence of read (w rite) operations from (to) an in p u t/ o u tp u t port or a global signal are explicitly ordered in th e C TG . Sim ilarly, each event for inter-process or inter-chip com m unication can be rep resented by a range in the system -level CTG . This range for a com m unication 48 E xam ple D escription Line C ondition Loop Synthesized arf.vhdl A R filter 79 no no yes aft.vhd l arithm etic fourier transform 145 no no yes fir.vhdl F IR filter 37 no no yes ellip.vhdl elliptic wave filter 90 no no yes robot.vhdl robot arm controller 99 yes no yes diffeq.vhdl differential equation solver 94 yes no yes d ct.vh d l discrete cosine transform 71 no yes yes oc.vh d l ones counter 29 yes yes no sq t.vh d l square root function 25 yes yes no Table 3.3: A p artial list of exam ples tran slated by VHDL2DDS event is a composite range th a t is com posed by th e ranges of a w rite action in one process and th e read actions in the others. If th e com m unication event is a block ing one, th e d uration of the range is specified as unbounded. O therw ise, a fixed delay, if known, can be given to the range to specify th e com m unication tim e for an unblocking event. T he relationships and constraints am ong th e event ranges in th e system -level C TG can be added by th e designer to describe th e system tim ing requirem ents and th e interaction to the external world. 3.3 Experiments T he VHDL2DDS program has been involved in m any top-dow n chip design exper im ents using th e ADAM high-level synthesis system . Table 3.3 lists som e of th e exam ples th a t were tran slated by VHDL2DDS. In this section, we will use three exam ples to show th e features of VHDL2DDS. Figure 3.15 shows a sim ple VHDL description which illu strates th e way th a t VHDL2DDS handles conditional branches, loops and arrays. Basically, this m odule perform s a sim ple integer square root function. Though this description does little com putation, its control flow is a good test for global d a ta flow analysis. T he d a ta flow graph produced by VHDL2DDS is shown in Figure 3.16. In this graph, 49 -- An simple integer square root function entity SQT is port(x : in integer;-- the input value y : out integer);-- the result end SQT; architecture BEHAVIOR of SQT is begin process variable a, b: integer; begin i f x > 0 then a : = 1 ; b : = x ; while abs(a - b) > 1 loop a : = (a + b) / 2; b : = x / a ; end loop; y <= a; else y <= x; end i f; end process; end BEHAVIOR; Figure 3.15: T he VHDL description of a square root function 50 LB LB AB: new hew a y F igure 3.16: T he d a ta flow graph of the square root function 51 two pairs of LB (loop begin) and LI (loop iterate) pseudo operations are used to denote th e im plicit feedback values for a and b to be used in th e nex t itera tio n of th e while loop. T he pair of D (distribute) and J (join) pseudo operations at th e left h an d side of th e figure correspond to th e if statem en t in th e V H D L description and th e D /J pair at the right hand side is th e loop-exit condition for th e w h ile statem en t. T he control edges (flags) and feedback edges are shown by grey arrows in th is graph. T h e robot arm controller exam ple given in Table 3.3 was originally w ritten in th e C language. T he C code was tran slated into a VHDL description in order to be synthesized through th e ADAM system . Figure 3.17 shows th e VHDL statem en ts of th is exam ple. Though th e VHDL description of this exam ple contains four sequential conditional branches, VHDL2DDS was able to retain only th e essential d a ta dependencies and produced a highly parallelized d a ta flow graph as shown in Figure 3.18. Recently, a JP E G im age com pression system was designed using th e Unified S ystem C onstruction system [GCDBP94], which is an integration of several newer system -level tools w ith th e ADAM synthesis system . Figure 3.19 shows th e VHDL description of an 8x8 16-bit D iscrete Cosine Transform m odule used in th e com pression system . This m odule perform s 8 point D C T row-wise over an 8x8 fram e. F igure 3.20 shows th e d ata flow graph produced by VHDL2DDS. T h e feedback edges are not shown here to m ake the graph easier to in terp ret. In this graph, th ere are sixteen array read operations to In F r a m e in parallel, b u t th e eight a r ray w rite operations to O u tF ra m e in th e lower right p a rt of th e figure have been sequentialized by VHDL2DDS according to the order of th e assignm ents in th e V H D L description. 52 -- The VHDL statements of the robot arm controller example evl = uvl - xvl ; ev2 = uv2 - xv2 ; ulO = kvl * evl; u20 = kv2 * ev2 ; eml = xvhli - xvl em2 = xvh2 i - xv2 if eml < Z1 then tmpl := Z1 - eml; else tmpl := Z1 + eml; end if; if tmpl < evthresh then eml := Z2; else eml := Z2+eml; end i f; if em2 < Z3 then tmpl := Z3 - em2; else tmpl := Z3 + em2; end i f; if tmpl < evthresh then em2 := Z4; else em2 := Z4 + em2; end i f; xvhlo<=ulO *Tl+xvl; xvh2o<=u20*Tl+xv2; mhllO:=kmll*eml*ulli+mhlli mhl20:=kml2 *eml*u21i+mhl2 i mh210:=km21*em2*ulli+mh21i mh220:=km22 *em2*u21i+mh22i ql<=mhll0*ul0+mhl20*u20; q2<=mh210*ul0*mh220*u20; mhllo<=mhllO; mhl2o<=mhl20; mh21o<=mh210 ; mh2 2o<=mh220; ullo<=ul0; u21o<=u20; -- 1st conditional branch -- 2nd conditional branch -- 3nd conditional branch -- 4nd conditional branch Figure 3.17: The VHDL statements of the robot arm controller exam ple 53 Figure 3.18: T he d a ta flow graph of th e robot arm controller 54 -- dct.vhdl : an 8x8 16-bit DCT VHDL behavioral description package dct_data is type Frame is array(0 to 7, 0 to 7) of integer; constant 01,02,03,04,05,06,09,012,014,015,016 :integer; end dct_data; use work-dct_data.all; entity dct is port(InFrame :in Frame; OutFrame :out Frame); end dct; use work.dct_data.all; architecture behavior of dct is begin process variable tl,t2,t3,t4, i: integer; ml,m2,m3,m4: integer; xn variable begin for i tl t2 t3 t4 ml m2 m3 m4 0 to 7 loop InFrame(0,i)+InFrame(1,1) InFrame(3,i)+InFrame(4,i) InFrame(1,i)+InFrame(6, i) InFrame(2,i)+InFrame{5,i) tl+t2, t3+t4 tl-t2 t3-t4 OutFrame(0,i) OutFrame(4,i) tl := cl5* (m3+m4) ; . OutFrame(2,i) <= cl4*m3 OutFrame(6,i) <= cl6*m4 <= cl2*(ml+m2) <= cl2*(ml-m2) tl, tl; ml m2 m3 m4 t2 t3 t4 InFrame(0,i)-InFrame(7, i) InFrame(1,i)-InFrame(6,i) InFrame(2,i)-InFrame{5,i) InFrame(3,i)-InFrame(4,i) c6*(ml+m2+m3+m4); c3 *(ml+m4); OutFrame(1,i) OutFrame(3,i) OutFrame(5,i) OutFrame(7,i) end 1oop; end process; end behavior; c9*(m2+m3)j < = < = < = cl*ml + t3 c5*m4 + t2 c2*ml + t2 t3 + c4*m4 t2 + c2*m3 cl*m2 + t4 t4 - c4*m3 t2 + c5*m2 Figure 3.19: The VHDL description of DCT 55 8 sequentialized array write operations to OutFrame V Figure 3.20: T he d a ta flow graph of D C T produced by VHDL2DDS 56 3.4 Summary In this chapter, we have discussed th e problem of tran slatin g design specifications in behavioral VHDL into a flow-graph representation called DDS for synthesis. T h e p rim ary objective of this translation is to ex tract th e parallelism from the given sequential design specification by retaining only th e essential d a ta and con tro l dependencies in the generated flow graphs. T he techniques for control flow analysis, local/global d a ta flow analysis, and graph g en eratio n /o p tim izatio n have been presented. We also have described th e m odelling of arrays, in p u t/o u tp u t and inter-process com m unication in VHDL as well as DDS. A com piler called VHDL2DDS has been im plem ented and is fully operational using th e techniques described here. VHDL2DDS currently serves as th e VH D L front-end of th e ADAM high-level synthesis system , accepting new designs to be synthesized. It has been involved in num erous chip design experim ents. W e believe th e specific m ethodology presented here could also be applied to tra n slate o th er HDLs or high-level languages into a synthesizable form at. In addi tion, it can also be used to generate parallel m achine code from sequential program s for highly parallel com puters such as data-driven m achines. 57 Chapter 4 System-Level Partitioning of Application-Specific Digital Systems 4.1 Introduction T h e com plexity of m odern digital system s keeps increasing even as device sizes shrink dram atically. As a result, m any such system s cannot fit into a single in teg rated circuit using available chip packages while satisfying all th e given design constraints. Consequently, these system m ust be p artitio n ed onto a num ber of chips a t some point during th e design process. System p artitio n in g can be classified into either stru c tu ra l p artitio n in g or be havioral partitioning. In stru ctu ral partitioning, th e system behavior is first con verted to stru ctu re, and then th e stru ctu re is p artitio n ed am ong chips. S tru ctu ral p artitio n in g usually can provide a solution th a t satisfies physical constraints such as area and pins. However, after th e system stru ctu re is synthesized as a single chip design, it m ay not be possible to p artitio n the design onto m ultiple chips while satisfying th e constraints (especially tim ing constraints). A lternatively, if th e sys tem p artitio n in g is perform ed before th e behavioral synthesis, such p artitio n in g can heavily influence decisions m ade in subsequent synthesis tools and m ay th ere fore lead to higher perform ance designs and m ore efficient use of area and pins. 58 A m ajo r difficulty w ith behavioral partitioning is th a t th e feasibility and quality of th e p artitio n in g is hard to m easure. Generally speaking, w ith th e help of new prediction techniques [Kuc91], behavioral partitioning is advantageous. T here are essentially two levels of granularity at which behavioral p artitio n in g can be perform ed. Previous approaches to behavioral p artitio n in g have m ostly focused on partitioning th e design behavior, usually a control d a ta flow graph (C D F G ), a t th e operation level onto a num ber of p artitio n s w hich can be synthe sized into separate hardw are m odules or chips. O n th e o ther hand, behavioral partitioning can be done a t a higher level of granularity, such as processes, procedures and m em ory blocks. F igure 4.1 shows the system p artitio n in g of an A SIC-based C C IT T H.261 video decoder from Bellcore Inc. In this figure, one can find th a t th e system has already been divided into a IDCT Rec-Buffer Memory Error Correction Variable Length Decoder Variable Delay Frame Memory Image Reconstruction F igure 4.1: An A SIC-based C C IT T H.261 video decoder from Bellcore Inc. nu m b er of well-defined functional entities by th e designers. As chip capacities keep increasing, each chip is likely to contain m ore th an one functional entity. In this context, we use th e term process-level partitioning to denote th e form of system p artitio n in g th a t preserves the functional boundaries specified by th e designers. T here are several advantages to process-level partitioning. F irst, th ere are far fewer objects at th e process level th an those at th e operation level, w hich allows 59 us to utilize m ore com prehensive partitioning techniques like m ixed-integer linear program m ing and to take into account m ore partitioning issues concurrently like th e chip count, chip packaging and perform ance constraints. Also, since process- level p artitio n in g preserves functional boundaries specified by th e designer, th e o bjects grouped into a chip are fam iliar to th e designers. Therefore, it becom es m uch easier for th e designers to test these chips functionally as com pare to those w hich are p artitio n ed a t th e operation or lower levels and m ay consist of m any fragm ents of different system functions. In addition, if a process needs to be changed in th e fu tu re redesign of th e system , only th e chip w here th e process is located is affected and th e rest of th e m ulti-chip system can rem ain in tact. A nother im p o rtan t issue when partitioning a system w ith m ultiple processes is th a t each process to be synthesized later m ay have a wide range of altern ativ e im plem entations w ith varying area and delay characteristics. T h e right com bi n atio n of process im plem entations has to m eet local/global design co nstraints as well as chip-packaging constraints. For exam ple, Figure 4.2 shows two process p i and p2 assigned to a chip. If a perform ance constraint T is im posed on th e signal aelay fp n yctelay(p3 < T ;area(p1 )L Wea(p2) i area_capacity(K) I e 2 Figure 4.2: Selection of process im plem entations p a th from p i to p2, th e design points chosen for p i and p2 have to satisfy b o th 60 th e perform ance and chip-area constraints as shown in th e figure. A t present, se lection of cost-perform ance characteristics for individual processes is m ostly done m anually by designers. However, the quality of p artitio n in g results is highly sen sitive to these design decisions and m any tradeoffs are possible. B y exploring th e design alternatives of th e processes in th e system during p artitio n in g , design point selection can be done not only to m eet th e given perform ance constraints b u t also in consideration of effects on th e partitioned system in term s o f th e cost, size, reliability, and so on. In this chapter, a process-level system partitioning approach will be presented. T he novel aspect of this approach is th a t th e exploration of process design altern a tives is done concurrently w ith partitioning. In addition, th e chip count and chip capacities (area and pins) are not sim ply a num ber of given constraints to be satis fied; instead, they are trad ed off according to the available chip packaging options. P ro to ty p e software, P ro P a rt, based on this process-level p artitio n in g approach has been developed and used to experim ent w ith several exam ples including a JP E G im age com pression system . 4.2 Problem Approach T he problem which we are trying to solve here can be briefly described as P artitioning a digital system with multiple processes into a num ber o f custom chips so that the design constraints are m et and the cost o f these custom chips is m inim ized. Figure 4.3 shows an overview of our partitioning approach. In our approach, th e system to be partitioned is defined by a hypergraph G (P, E ), w here P is a set of processes and E represents th e interconnections am ong P . Each process in P is either an unrealized one to be synthesized later or an 61 ^ S y s te m Specification) hypergraph (P.E) ^ D e s ig n C o n strain ts) ro cess behaviors Estimation predicted p o in ts Process-level Partitioning T ^ P a c k a g in g Library^ chip counts p ro cess to chip assignm ent design point selection chip package selection minimize cost and m eet design constraints Figure 4.3: Overview of P ro P a rt’s partitioning approach 62 already-im plem ented chip or chip portion to be used “as is” 1. For each process to be synthesized, its behavior can be passed to behavioral area-delay estim ation tools like B E ST [KP93] in order to predict a num ber of possible design points to be chosen during partitioning. In contrast, each im plem ented m acro is assum ed to have know n area and delay param eters2. T he hyperedges in E are m ulti-point interconnections w eighted by their bitw idths. Each such edge represents a com m u nication link am ong th e connected processes. Given such a hypergraph, th e num ber of pins needed for a p artitio n (chip) is calculated as th e sum of th e weights of all hyperedges w hich cross th e p artitio n boundary. O th er inputs of th e problem include a package library and a num ber of design constraints. T he package library contains a set of available or preferred chip pack ages to be chosen for each chip. T he area and pin capacities of these packages as well as th eir costs are specified in the library. Designers are allowed to im pose constraints on perform ance and cost as well as on th e chip count. O ur process-level partitioning approach involves in several interacting tradeoffs and p artitio n in g decisions. T he decisions to be m ade during p artitio n in g besides assigning processes to chips are • determ ining th e num ber of chip partitions (chip count), • selecting a suitable chip package for each chip p artitio n from th e library, • selecting a feasible com bination of design points for th e processes in th e system . T hese p artitio n in g decisions are inter-dependent in a com plicated way. T h e design point selection and th e process to chip assignm ent determ ines th e sizes of chip lrThe selection of an already-implemented macro versus one to be designed has already been made by higher-level tools. Program m able macros, of course, have unknown memory costs and delays. Characteristics of such macros can be predicted using appropriate estimators [RP93]. 63 partitio n s and th e required num bers of pins. Typically th e dim ensions of a chip can range from 2 m m 2 to 25 m m 2. T he area and pin capacities of a chip depend on th e package chosen for th e chip. For exam ple, a Dual In-line Package (D IP ) allows around 64 pins while a P in G rid A rray (PG A ) package m ay allow as m an y as 300 pins. T he cost of a chip is usually proportional to its size, pin count and chosen package. Hence, to reduce th e cost of chips, sm all chip p artitio n s are favored. However, this m ay increase th e chip count. On th e other h an d , since th e off- chip delay is m uch larger th an the on-chip delay, large chip p artitio n s can help in reducing th e off-chip delay and in satisfying the perform ance constraints. A sm all chip count (large chip partitions) can also increase th e reliability of th e system , b u t m ay suffer from low chip yields. Therefore, th e ideal techniques for our process-level system p artitio n in g prob lem would be ones which can m ake th e partitioning decisions and tradeoffs de scribed here concurrently. Two such partitioning techniques, an M ILP m ethod and a genetic-search m ethod, will be described in th e following section and Sec tion 4.5.2 respectively. 4.3 An MILP Partitioning Method In this section, we will present a m athem atical program m ing m eth o d to solve th e process-level system partitioning problem described in th e previous section. As we have indicated earlier, th e m athem atical program m ing m eth o d is feasible since th e re are relatively few objects at th e process level and th e m eth o d gives us the ab ility to m ake th e partitioning decisions and satisfy th e constraints concurrently. In addition, th ere exist techniques [Pra93, GE92] to im prove th e ru n tim e for solv ing ILP m odels by at least an order of m agnitude. In fact, P rak ash ’s resu lt [Pra93] is p articu larly encouraging to us due to th e sim ilarity betw een his m odel and ours. For instance, th e levels of abstraction on which these two m odels are based are 64 roughly equivalent though th e application dom ains are com pletely different. Fur therm ore, several analogies can be draw n betw een th e decisions to be m ade during P rak ash ’s synthesis and our partitioning. Before describing th e detailed form ulation, we first give an overview of th e p artitio n in g m odel th a t our form ulation is based on. T he n o tatio n and objective function will be discussed next, followed by th e detailed form ulation. T he lin earization of som e non-linear constraints in th e form ulation will be discussed at th e end of this section. 4 .3 .1 O v e r v ie w o f th e P a r titio n in g M o d e l Figure 4.4 shows various partitioning param eters which will be tak en in to account in our M ILP form ulation. F irst, th e design point chosen for a process determ ines th e area consum ed at th e chip where th e process is assigned and th e delay added to th e signal p a th th a t th e process involves. If not all th e processes connected by a hyperedge are assigned to th e sam e chip p artitio n , th e hyperedge becom es an inter-chip connection. If a hyperedge is an inter-chip connection, it will introduce an off-chip delay to th e signal p a th th a t it involves and will add an off-chip connection cost to th e overall system cost. In addition, it will consum e a num ber of pins (equal to its w eight) at every chip which it connects. O n th e other hand, if a hyperedge is an on-chip connection, it will only introduce an on-chip delay to the signal p a th th a t it involves. In Section 4.5.1, we will ex ten d th is partitioning m odel to trad e off com m unication delay, cost an d I/O pins. T h e package selected for a chip p artitio n determ ines th e area and pin capacities of th e chip and contributes a chip packaging cost to th e overall system cost. T he num ber of pins required for a chip p artitio n is th e sum of th e weights of all hyper edges which cross th e chip partitio n boundary. Similarly, th e area requirem ent of a chip p a rtitio n is determ ined by th e processes assigned to th e chip. B oth th e area 65 Interconnect Cos Off-Chip Delcr' Bitwidth itiiim iiim m iiiiiiiim iiiiini Process p ) On-Chip Delay System hypergraph Chip partitions Area and HO Pin Capacities, Cost Package a Area Pins Cost k t AREAk PINk COSTk Chip package library Figure 4.4: Overview of the process-level p artitio n in g param eters 66 an d pin requirem ents of a chip p artitio n m ust m eet the capacities im posed by the selected chip package. Finally, if any perform ance constraint is im posed on a signal p a th , th e delays introduced by th e processes and th e hyperedges on this p a th m ust satisfy the perform ance constraint. 4 .3 .2 N o ta tio n • P = {p | p is a process in th e system to be partitioned}. a p and dp denotes th e area and delay of process p. T hey are constants if p is an already- im plem ented m acro to be used “as is” ; otherw ise, th ey are real variables th a t depends on the design point chosen for p. • C = {c | c is one of th e chip partitio n s}3. T he num ber of elem ents in C can be determ ined either by th e design constraint on th e to ta l chip count from th e designer or by a reasonable upper bound like th e num ber of processes in th e system . • K — { k | k is one of th e available chip packages}. T he area and pin ca pacities of k are denoted by A R E A k and P IN k respectively and its cost is C O S T k. • E = { e | e is a hyperedge in th e system }. If e becom es an inter-chip connec tion, its cost and estim ated delay are denoted by C O S T e and D°J E O th er wise, its estim ated on-chip delay is D°n and it will not introduce an additional cost to th e system . T he bitw idth of e is denoted by B W e. • Pe = {p | every process p connected by the edge e}. 3Each partition after system partitioning corresponds to a chip in the system. In the following discussion, we will refer to a chip partition in C by simply saying “a chip c in C ” . 67 • x P tC : A binary variable which denotes w hether or not process p is assigned to chip c. • Uc,k ■ A binary variable which denotes w hether or not to select package ty p e k for chip c. • /3C : A binary variable which indicates w hether chip c is needed (not em pty) or not. • be: A bin ary variable which indicates w hether edge e becom es an inter-chip or on-chip connection. In sum m ary, th e partitio n in g variables described previously provide th e follow ing inform ation: • T h e binary variables x P iC and y C ik give us the process-to-chip assignm ent and th e chip package selection respectively. • From th e set of binary variables f3c we can determ ine a n d /o r control th e final chip count. • T he set of binary variables 6e are used to denote the inter-chip connections as well as th e pin requirem ent of each chip. • T he real variables a p and dp of process p, once determ ined by th e selection of a design point, becom e the area and perform ance constraints w hen syn thesizing p. ctp can also be used to perform a process-level floorplan before synthesizing th e system structure. Thus, m ore accu rate global w iring delays can be available to th e synthesis tools. 4 .3 .3 O b je c tiv e F u n c tio n T he objective function we try to optim ize here is to m inim ize th e overall cost after system partitioning. It consists of th e cost of th e new custom chips and th e cost 68 of th e inter-chip connections. This cost function T to be m inim ized can be stated as follows: £ £ C O ST k x yc > k + £ C O S T e x fe e (4.1) cec keK eeE O th er objective functions can be used as well if necessary. For exam ple, designers m ay sim ply w ant to perform a iV-way system partitioning while m inim izing in ter chip connections. This can be achieved by th e following objective function: £ BWe X fee eeE and th e following constraint on th e to tal chip count: £ & = Af c€ C 4 .3 .4 C o n str a in ts T h e form ulation for process-level system partitioning consists of 7 classes of con strain ts. 4 .3 .4 .1 P ro cess to C hip A ssign m en t In our p artitio n in g m odel, each process m ust be assigned to one an d only one chip. T his can be achieved by th e following constraint using th e binary variables x p < c : £ x P ,c = 1 Vp € P (4.2) c£C H ence, one and only one of x P ;C will be 1 for each process p. If th ere exist processes which cannot be assigned to a single chip using any package in th e library while m eeting th eir design constraints, th en no solution will be found by this m odel. In such cases, these processes can b e decom posed by operation-level partitioning approaches into a num ber of sm aller synchronous 69 processes before partitioning th e system . However, we believe such cases will becom e less frequent as chip capacities continue to increase. 4 .3 .4 .2 C hip Package S election Each chip c in C is needed (not em pty) if and only if th ere are one or m ore processes assigned to it. This can be stated as: ( & = l ) 4 » ( X > p , c > 0 ) V c e C (4.3) p € P T he < = t > sym bol denotes th e if-and-only-if relation. Hence, for each chip c, (3C becom es 1 if and only if there is at least one x P tC which is equal to 1. If a chip is needed, we m ust select a chip package for it from th e library. Therefore, for each chip c, one and only one of binary variable yc > & m u st be 1 w hen j3c is equal to 1; otherw ise, all yc > k m ust be 0. T his is achieved by: £ Vc* = V c e C (4.4) keK Since th e constraint on th e to tal chip count, if given, is reflected in th e num ber of elem ents in C , th e final num ber of chips th a t are needed will always be less th an or equal to th e given constraint. If the designer insists on a p articu lar num ber of chips after partitioning, it can be achieved by adding th e following constraint: £ & = C H IP C O U N T (4.5) c € C w here C H I P C O U N T is th e final chip count given by th e designer. 70 4 .3 .4 .3 In ter-ch ip C on n ection A n hyperedge e will becom e an inter-chip connection if and only if not all the processes connected by e are assigned to the sam e chip. This is reflected by the following constraint: be — 1 — m in x v c Ve € E (4.6) c e c pePe H ence, be will be 0 w hen th e term '52cec x p , c at the right h an d side of this equation becom es 1. Due to th e function m m , th e only case th a t th is term is equal to 1 is when th ere is a chip c such th a t xP jC is 1 for every process p in Pe4. In such case, all th e processes connected by e are assigned to chip c. O therw ise, every m in function will be 0, which m eans th a t not all th e processes connected by e are assigned to th e sam e chip. Therefore, be is equal to 1 and e is an inter-chip connection. 4 .3 .4 .4 P in C ap acity C on strain ts Obviously, th e I/O pins used a t each chip cannot exceed th e pin capacity of th e selected package for this chip. This can be enforced by Y ] be x B W e x m ax x v? c < V ] P I N k x yc,k Vc € C (4.7) eOS P€P* keK T h e left h and side of this constraint represents th e to ta l b itw id th of th e inter-chip connections involving chip c. W hen an edge e is an inter-chip connection involving chip c, b o th be and m axp€pe x P tC will be 1. T he right hand side of th is constraint is sim ply th e pin capacity of th e package selected for chip c. 4T here exists a t m ost one such chip in any case; hence, ^2c€C m inp€Pc x p , c will never be greater th an 1. 71 4 .3 .4 .5 A rea C ap acity C on strain ts Like th e I/O pin constraints, the area consum ed by th e processes assigned to a chip cannot exceed th e area capacity of th e selected package for this chip. This can be constrained by Y x P ,c X«p< ] T A R E A k x yc < k Vc € C (4.8) P ep keK H ere, a p is a constant if process p is an already-im plem ented m acro to be used “as is” ; otherw ise, it is a variable which depends on th e design point chosen for p. 4 .3 .4 .6 T im in g C on strain ts T he designer m ay im pose a num ber of tim ing constraints on th e system . In our m odel, each tim ing constraint is associated w ith a p a th in th e system hypergraph. Each such p a th is represented by a sequence po, e i,p i,e 2 ,. • • ,P t-i, et,Pt such th a t Pi € P and e8 E E for all i and p* E Pei for i — 1,.. . , t. T he tim in g constraint over this p a th becom es dpo + dei + dPl -f . . . -f- det + dpt < Tt (4.9) w here • Tt is a constant th a t represents th e upper bound delay over this p ath. • dei denotes th e delay of edge e,- for i = 1,... ,t. Since e, m ay becom e either an on-chip or off-chip connection, dei is a variable. Let D°™ and be th e estim ated on-chip and off-chip delays of et respectively. W e have the following equation: < = X K + (1 - K ) X 0 ,7 (4.10) 72 • dP t is th e delay of process pi for i = 0 , . . . ,t. It is a constant if pi is an already-im plem ented macro; otherw ise, it is a variable w hich depends on th e design point chosen for pi. 4 .3 .4 .7 E xp loration o f D esign A ltern a tiv es As we discussed earlier, each process m ay have m any possible design alternatives to be chosen during partitioning. For processes to be synthesized later, behavioral area-delay estim ation tools like B E ST [KP93] can be used to predict a num ber of possible design points from th eir behavioral specifications. If local perform ance and area constraints are im posed on a process whose design altern ativ es are to be explored during partitioning, th e local constraints can be used to trim th e design space first. For exam ple, in Figure 4.5, th ere are only 3 feasible design points left for process p after trim m ing its design space using th e local perform ance and area constraints. Figure 4.5: T rim m ing the design space to be explored using local perform ance and area constraints 73 If th e num ber of feasible design points for a process p is sm all, a set of binary variables ZitP is introduced to select one and only one of them . Let Ip be th e set of design points for p. We have th e following constraint. 5 > i , P = 1 V p € P (4.11) i€lP In addition, th e a p (area) and dp (delay) of process p becom e O L p = ^ ^ zitP x A R E A ifP ieip dp — ) ^ ^i,p x D E L A Y itP However, if th e num ber of design points for a process is large, it will not be feasible to use binary variables for selection. In such case, th e design space is approxim ated by a set of piece-wise linear equations [NW88] as shown in Fig ure 4.6. For exam ple, let th e design space of a process p contains four design D P Figure 4.6: A pproxim ating th e design space using piece-wise linear equations points, {(A l5 £>x), (A 2, D 2), (A3, D 3), (A4, D ^)}. We introduce th e following th ree piece-wise linear constraints into th e partitioning model: a p — A i > X (dp — D i) 74 a P - A 2 > D ^ f h X ~ C X p — A 3 > x {dp — D 3 ) Since these piece-wise linear constraints involve only real variables, th e ru n tim e of solving M ILP m odels will only be affected m oderately by th e num ber of these constraints. 4 .3 .5 L in e a r iz a tio n T h e form ulation presented earlier contains some non-linear constraints which have to be linearized in order to be solved by M ILP solvers. C o n strain t 4.3 contains an if-and-only-if relation. It can be replaced by the following linear constraints: Pc > *p,c V p € P p € P H ence, if /3C is 0, th en the first set of constraints will ensure th a t x P tC is 0 for all p. If fic is 1, th e second constraint requires th a t at least one p in P such th a t x p > c is 1. T h e rig h t hand side of C onstraint 4.6 contains the following non-linear term s: m in jr., c Vc € C peP* p' Each of these term s can be replaced by a binary variable n e < c subject to th e fol lowing constraints: n, n e > c 5; ^ C p ^ c \/jP £ Pg « + ( | p . l - D > p € P e 75 T h e first set of constraints ensure th a t n e > c is 0 if th ere exists a p in Pe such th a t x V tC is 0. If x p > c is 1 for all p in Pe, th e right h and side of the second constraint becom es \Pe\', hence, n e > c m ust be 1. Sim ilarly, each m a x function in C onstraint 4.7 can be replaced by a binary variable m e < c used in th e following constraints: U 2e > c ^ x p) C Vp € Pe ^ ^ ® p,c pePe T he first set of constraints ensures th a t m e )C is 1 if th ere exists a p in Pe such th a t xP fC is 1. If x P tC is 0 for all p in Pe, th e second constraint will force m e,c to be 0. A fter replacing each m a x term by a binary variable as described above, th e left hand side of C onstraint 4.7 will contain be x m e,c w hich can be fu rth er replaced by another binary variable w e jC subject to We,c < be We,c < m e,c We,c + 1 > be + ra e, Finally, each non-linear term x P tC x a p in C onstraint 4.8 can be replaced by a real variable vP tC confined by ^p,c ^ 0 Up,c ^ Op ^1 X p^c'j X P I G w here B IG is a constant w ith a reasonable large value. In addition, th e original cost function T to be m inim ized is changed to T + B I G x vp,c 76 As a result, if x P iC is 1, vP jC will be set to o l p\ otherw ise, vP tC will be 0. 4.4 Experiments A p ro to ty p e software, P ro P a rt, based on our process-level system p artitio n in g approach has been developed and used to experim ent w ith several exam ples. It consists of an M ILP m odel generator, an M ILP solver and a solution analyzer. T h e M ILP solver is a branch-and-bound program called Bozo, w ritte n by Hafer [HH90], which invokes a com m ercial linear program m ing package, X LP, developed by X M P Software, Inc. 4 .4 .1 A M o b ile P h o n e S y s te m E x a m p le el e2 PI digital Vjnoderp/ P3 data >rocessoj e3 e4 e7 P2 voice 'ycodec, r p t ^ system controller e5 e6 Figure 4.7: A m ultiple-process system to be p artitio n ed F igure 4.7 shows a m ultiple-process system derived from th e digital cellular m obile telephone system illu strated in th e referenced article [BS92]. In th is exam ple, th ere are four processes and seven edges T hree processes, p i, p2 an d p3, are assum ed to be synthesized later. T heir approxim ated area-delay curves are given in Figure 4.8. Process p4 is assum ed to be an already-im plem ented m acro whose 77 C T 3 . , <U i I (150, 2,000) (175,2,500) (200, 3,000) (250, 2,000) (300,1,400) (350,1,600) (500,1,000) (450,1,000) (325,1,750) Delay P2 P I Delay P3 Delay Figure 4.8: T he approxim ated area/d elay curves of processes p i, p2 and p3 Edge B itw idth On-chip Delay Off-chip D elay Off-chip Cost e l , . . . , e6 16 10 30 160 e l 64 10 30 640 Table 4.1: Edge param eters for th e m obile phone exam ple area is 950 and delay is 200. A perform ance constraint, 800, is im posed on th e p a th from p i to p2. T he delay from p3 to th e in p u ts of p i or p2 thro u g h edge e7 is lim ited to 300. Finally, th e param eters of these edges are shown in Table 4.1. T h e package library shown in Table 4.2 was first used to p artitio n th e system . T he M ILP m odel produced by P ro P a rt consists of 130 constraints and 64 variables (47 binary variables). Bozo was able to find a m inim al-cost solution in few sec onds on a Sun SPA R C system 300. T he solution is a single-chip p artitio n in g using Package A rea Pins Cost k l 3000 80 800 k2 6000 100 1800 k3 9000 120 4000 k4 15000 140 6600 Table 4.2: Package L ibrary 1 78 Process A rea Delay p i 1400 300 p2 1040 490 p3 1866.7 290 Table 4.3: Design points chosen for processes p i, p2 and p3 Package A rea Pins Cost k l 2000 128 800 k2 4000 140 1000 k3 5000 150 4000 k4 6000 160 6600 Table 4.4: Package Library 2 package k2. T he design points chosen for processes p i, p2 and p3 are shown in Table 4.3. In order to show th e tradeoff ability of our approach, we m odified th e package library (shown in Table 4.4) as well as th e tim ing constraints. Specifically, we reduced th e area capacities of each package b u t increased th eir pin capacities. In addition, we lowered th e costs of sm aller packages so th a t m ultiple-chip solutions could becom e com petitive. We also tightened two tim ing constraints to 450 and 250 respectively so th a t processes p i, p2 and p3 would require design points w ith less delays b u t larger areas. As we expected, Bozo found a m inim al-cost two-chip p artitio n in g in a sim ilar ru n tim e. T he solution is shown in Figure 4.9. 4 .4 .2 A P o w e r tr a in C o n tro l S y s te m E x a m p le Sim ilar experim ents were also perform ed on a system derived from th e GM power tra in control application [Fuh91] as shown in Figure 4.10. T his exam ple contains 79 Chipl (K3: Area = 5000, Pins = 150) ^ ^ e3 ✓ x ' / P l V -------------- 1 P2 v e:i & A :2000 T): 150 J - T e4 A:2160 ^ £ 2 9 0 / 5?.................................. f P3 A: 2200 D: 220 P4 A: 950 D: 220 Chip2 (K 2: Area = 4000, Pins = 140) F igure 4.9: A two-chip partitio n in g for th e m obile phone exam ple 10 processes and 17 edges. Five general processes p i, ... ,p5 are assum ed to be syn thesized later, u l, ..., u4 are counters and / I is a finite-state m achine. Table 4.5 sum m arizes th e p aram eters used in th e experim ents. Two experim ents were per form ed using two previous package libraries, Table 4.2 and 4.4, joined w ith two sets of tim ing constraints as shown in Table 4.5. T he M ILP m odel generated by P ro P a rt consists of 405 constraints and 184 variables (155 binary variables). T he m inim al-cost solution under package library 1 and th e first set of tim in g constraints was a single-chip partitio n in g using package k4. T he design points chosen for pro cesses p l , . . . , p 5 are shown in Table 4.6. W hen th e package library 2 and the second set of tim ing constraints were used, th e m inim al-cost solution found was a three-chip p artitio n in g as shown in F igure 4.11. In these experim ents, feasible solutions were found in few seconds; however, m inim al-cost solutions took 6-12 hours on a Sun 4/200 system . We have not m ade any a tte m p t to optim ize th e ru n tim e of tlie M ILP solver. 80 GD G D - e8 e7 p2 e l2 el3 q D e2 f el Q D e6 < e5 e4 f l £ e3 elO p3 e l l e l4 ▼ el5 p5 e l6 p4 a V ) GD e l7 F igure 4.10: A n exam ple derived from th e GM pow ertrain application 81 Process Characteristics process area/delay points ul,...,u4 500/20 (fixed) fl 1500/40 (fixed) Pi 2000/150 1400/300 1000/450 p2 3000/200 1600/350 1000/500 p3 2500/175 2000/250 1750/325 p4 2500/175 1500/325 1100/475 p5 1900/125 1200/250 950/350 Edge Parameters Edge Bitwidth On-chip Delay Off-chip Delay Cost el 24 10 30 240 e9 32 10 30 320 e ll 32 10 30 320 others 8 10 30 80 Timing Constraints Path 1st Set 2nd Set tl p2 el0p3 900 680 t2 p2 el2 p4 850 705 t3 p i e3 p3 800 580 t4 p5 el 4 300 280 Table 4.5: T he p aram eters of th e pow ertrain exam ple Process A rea Delay p i 1000 450 p2 1000 500 p3 1750 325 p3 1460 340 p3 1100 290 Table 4.6: Design points chosen for th e single-chip p artitio n in g of th e pow ertrain exam ple 82 GD C hip 2 (k 2: Area= 4000, Pins= 140) A: 140<? D: 300 j A: 1933.3 D: 270 A :1520 D: 370 A: 120< D: 250 A : 1500 D: 325 C hip 3 (k2: A rea= 4000, Pins= 140) C hip 1 (k 2: A rea= 4000, Pins= 1 4 0 )j Figure 4.11: A three-chip p artitio n in g of th e pow ertrain exam ple 83 4 .4 .3 A J P E G Im a g e C o m p r e ssio n S y s te m E x a m p le O ur system p artitio n in g m ethodology was applied to th e design of a JP E G still- im age com pression system [Wal91] using a num ber of th e Unified System Con stru ctio n (USC) tools, including P ro P a rt. T he USC p roject at th e U niversity of S outhern C alifornia involves th e production of an in teg rated set of system -level tools for synthesizing m ulti-chip, heterogeneous application-specific system s which m eet cost, perform ance and power constraints. Figure 4.12 shows th e in p u t JP E G system specification, th e design flow and th e o u tp u t of various tools. Four m ajo r synthesis tools were used in th is experim ent. For d a ta p a th synthesis, we used SMASH [GP94] (scheduling) and M A BA L [KP89] (allocation and binding). T he P ro P a rt tool th a t we discussed in this chapter was used to p a rtitio n th e JP E G system . A m ultiprocessor synthesis tool called SOS [Pra93] was used to stu d y th e architecture tradeoff of th e 2D -D CT (D iscrete Cosine Transform ) function. T his experim ent has been described in detail in another article [GCDBP94]. We will only discuss th e design activities related to th e system p artitio n in g here. In this experim ent, we began w ith th e synthesis of th e D C T function. T he 2D -D CT was first decom posed into rep eated row -colum n lD -D C T s prior to th e application of th e synthesis tools (Figure 4.13). T he 1D -D CT m acro was synthe sized first and used to construct a 2D -D CT, clearly a b o tto m -u p step. SMASH was used to generate five schedules w ith varying cost and perform ance for a 1D- D C T m acro from a behavioral VHDL description of th e 8-point D C T described in th e referenced article [FLS+92]. These 1D-DCT schedules were th en processed by M ABAL to generate th e RTL d a ta p a th netlists. T he n etlists were analyzed to o b tain th e area characteristic of th e d atap ath s as shown in Table 4.14. T he areas for functional units, m ultiplexers and registers were determ ined from th e netlists, 84 J P E G SYST EM S P E C I F I C A T I O N Entropy Encoder Source Image Data Quantizer 2D- DCT Compressed Image Data Entropy Decoder Dequan tizer Reconstruct Image Data 2 D -1 DCT Decompose 2D-DCT into two lD-DCTs C Existing Layout Predicted Parameters ID- DCT VDHL Description Five 2D -DCT estimations ProPart VHDL2DDS awwBCW W waBQW Boawow SMASH and MABAL Partitioned 3 Chip Im plem entation o f JPEG Chip 1 Five ID- DCT RTL Designs N N o n m sem m m t 2 D -1 DCT Choice 1 » Choice 2 Three 2D- DCT Multiprocessor Architectures > Quantizer Entropy Encoder Chip 3 Dequant. Entropy Decoder Layout Generation using ChipCrafter 2D- DCT Layout (Chip 1) Figure 4.12: Design flow for th e JP E G system exam ple 85 8 X 8 pixel frame ID DCT over rows ID DCT over columns transformed frame Figure 4.13: D ecom posing a 2D -D CT into rep eated row -colum n lD -D C T s and w iring area was estim ated m anually using a rule-of-thum b which we observed in our earlier experim ents 5 [PGH91]. Design Number Functional area (106 pm2) Interconnect area (106 pm2) Total area (106 pm2) Mux + Registers Wiring A B C = 2(A+B) 1 29.35 3.74 66.18 99.27 2 19.68 3.87 47.09 70.63 3 17.20 4.05 42.50 63.74 4 9.92 3.67 27.18 40.77 5 5.12 4.12 18.48 27.73 Figure 4.14: 1D-DCT RTL designs from M ABAL Before p artitio n in g th e JP E G system using P ro P a rt, we estim ated th e p er form ance and area of all th e functions in th e system . A 2D -D CT architecture consisting of tw o 1D -D CT m odules and an 8 x 8 fram e buffer was selected as described in th e literatu re [FLS+ 92]. T he w orst-case d a ta p a th delay was used to calculate th e perform ance for each design, assum ing a tw o-phase non-overlapping clocking schem e. T he quantizer perform ance and area were estim ated sim ilarly, 5We did not use our wiring area estim ation tools in this experim ent due to lack of tim e. 86 and th e param eters estim ated are com parable to those rep o rted in th e literatu re [FLS+92]. We used th e p aram eters of an existing im plem entation for th e H uffm an coding [PP93]. (a) Process Characteristics (b) Package Library Process Estimated Area/Delay points 2d-dct 217096/480 159819/780 146024/900 100106/1200 74004/1560 quan 19036/89 (fixed) enco 12200/100 (fixed) deco 12200/100 (fixed) dequ 19036/89 (fixed) 2d-idct 217096/480 159819/780 146024/900 100106/1200 74004/1560 Package Area Pins Cost klOO 84174 260 1214 k209 175528 304 4240 k271 227306 344 6849 k464 389800 452 18923 Note: 1. Area is 10 sq. microns. 2. Delay is in nano second. 3. Cost is a function of area and pin capacities. Figure 4.15: P aram eters used by P ro P a rt A fter estim ating th e perform ance and area of all th e functions in th e JP E G system , it was p artitio n ed by P ro P a rt. T he d a ta for each function in th e system is sum m arized in Table 4.15 (a). As we can see from th e tab le, b o th th e 2D -D CT and ID C T functions consist of five possible design alternatives to be chosen during partitioning. T he package library6 used in this experim ent is shown in Table 4.15 (b). T he M ILP m odel generated by P ro P a rt consists of 162 constraints and 83 variables (73 binary variables). T he m inim al-cost p artitio n in g was found in less th a n 1 m inute on a Sun 4/200 system . T he solution is a three-chip partitio n in g as shown in Figure 4.16. T he design point selected for 2D -D C T corresponds to th e second 1D -D CT design produced by SMASH and M ABAL. N ote th a t P ro P a rt placed th e D C T and ID C T on separate chips, and lum ped th e rem aining functions onto a single chip. Finally, th e layouts of th e 1D-DCT m acro and 2D -D C T chip 6The library data was derived from a commercial ASIC library (LSI LCA300K). 87 k209 klOO k209 2D IDCT Dequantizer 2D DCT Entropy Decoder Entropy Encoder Quantizer f Desig V Area: l.< Design choice for 2-D DCT and IDCT chips .6133X10® Jim2 (estimated), Delay: :hips A :800 n sJ Figure 4.16: A three-chip p artitio n in g of th e JP E G system were generated using Cascade Design A u to m atio n ’s C hipC rafter according to th e design point selected by P ro P a rt. From this experim ent, we found th a t th e overall system design flow was not com pletely top-dow n. T here were som e portions w here th e tools were used in a b o tto m -u p fashion. For exam ple, th e use of m acro synthesis, th e n th e use of the m acro param eters in partitioning, and finally th e incorporation of m acros into chips clearly show a cycling betw een chip-level and system -level design tasks. W e believe th a t this m ix tu re of top-dow n and b o tto m -u p design flows will be com m on in the design of large digital system s. This is because th e m ost suitable m ethodologies to im plem ent various system functions or to estim ate th eir param eters are usually different and also vary from one design situation to another. We found th a t our system p artitio n in g approach does provide this kind of flexibility. 88 4.5 Extensions In this section, two extensions to our system p artitio n in g approach will be dis cussed. F irst, we will show how to extend our M ILP partitio n in g m ethod to trad e off com m unication delay, cost and I/O bandw idth. A genetic-search partitio n in g m ethod will also be described. 4 .5 .1 T rad eoffs o f C o m m u n ic a tio n D e la y an d H a rd w a re T h e system p artitio n in g approach described previously assum es th a t w hen a hy peredge becom es an inter-chip connection, th e num ber of pins required a t each chip connected th e edge is fixed. For some d a ta com m unication betw een pro cesses, th ere exist tradeoffs betw een th e com m unication delays and hardw are such as I/O pins and buffers. For exam ple, a 16-bit parallel d a ta transfer betw een two processes on separate chips could be converted to tw o 8-bit transfers in order to reduce pins at th e expense of a longer com m unication delay. In this section, we will discuss a way to incorporate com m unication tradeoffs into our M ILP form ulation for system partitioning. Like th e exploration of each process’s design alternatives as discussed in Sec tion 4.3.4.7, th e com m unication tradeoffs can be achieved by selecting a suitable com m unication alternative for each hyperedge in th e system during partitioning. For each hyperedge e, let I e be a set of com m unication alternatives for e. Each point i in Ie defines th e off-chip com m unication delay D°JJ, th e b itw id th BWi,e, th e buffer area B A i> e and th e cost CO STi> e for th e hyperedge e w hen i is selected. T he selection of th e com m unication alternatives can be done by using a set of binary variables zt)e such th a t X *i,e = 1 Ve € E (4.12) 89 In our original M ILP form ulation, th e B W e (bitw idth), D °J* (off-chip delay) and C O ST e are constants. T hey now becom e variables subject to bwe = z ife x B W i,e i € h d°J‘ = E * - * ® ? " i € l e COSte = E Zite x C O ST ite ieh A n additional variable bae is introduced for each edge e to denote th e buffer area th a t will be consum ed at each chip it connects. bae = ^ ] % i,e x BAi> e *€/e Finally, th e area-capacity constraint needs to be m odified to reflect th e buffer area consum ed by th e inter-chip com m unications. Specifically, C onstraint 4.8 becom es E + E ‘ . x bae x m a x j P ]C < AREAk x yc,k Vc G C (4.13) p < £ P e€E pePe k&K w here th e second term at th e left hand side of this constraint represents th e to tal buffer area used by all th e hyperedges which involve chip c and becom e inter-chip com m unications. O f course, all enhancem ents and additions to this m odel increase M ILP run tim e, and so altern ativ e partitio n in g m ethods have been explored. 4 .5 .2 A G e n e tic -S e a r c h P a r titio n in g M e th o d In this section, we will discuss th e application of G enetic A lgorithm s (G A) to our system p artitio n in g problem in order to find acceptable solutions in a m ore manageable ru n tim e th a n the M ILP m eth o d presented earlier. GA is a m ore suit able altern ativ e partitio n in g m eth o d of our M ILP m ethod th a n group-m igration 90 techniques like th e K ernighan-Lin algorithm . This is because th e objects (pro cesses) to be p artitio n ed m ay not have constant a ttrib u te s such as area an d delay if they have several design alternatives to be explored; hence, it is not clear how to m easure th e im provem ent resulted from th e m ovem ent of such an object across p artitio n s in a group-m igration approach. O n th e other hand, GA is a global op tim izatio n technique which can be applied to com plex problem s whose p aram eters are inter-dependent and should be considered concurrently . O verview o f G en etic A lgorith m s GAs use a com putational m odel inspired by evolution. They encode a p o ten tial solution to a specific problem on a sim ple chrom osom e-like d a ta stru ctu re. GAs are itera tiv e procedures which m ain tain a population of candidate solutions. Each stru ctu re in th e population is usually represented by a fixed-length binary string which represents a vector of param eters to th e objective function. D uring each iteratio n , th e current population is evaluated, and, on th e basis of th a t evaluation, a new population of candidate solutions is form ed using several genetic operators like selection, crossover and m utation. T he selection operator ensures th a t th e expected num ber of tim es th a t a solution is chosen for th e n ext generation is proportional to th e “goodness” of th e solution. T he crossover and m u tatio n operators are used to introduce variation into th e new population in order to search o th er points in th e search space. U nder th e crossover operator, two p aren t solutions exchange portions of th eir binary representation to generate new sam ple points in th e search space. A fter crossover, each b it in th e population undergoes m u tatio n w ith some low probability. T he term in atio n of a GA procedure m ay be triggered by finding an acceptable approxim ate solution, by fixing th e to ta l num ber of iteratio n s, or some o th er application-dependent criterion. For an in-depth introduction of GA, readers could exam ine additional references [Gol89, W hi93]. 91 E n cod in g o f th e S y stem P a rtitio n in g P rob lem Generally, th ere are only two m ajo r com ponents of a GA th a t are problem depen dent; nam ely, th e solution encoding and th e evaluation function. In th e view of a GA, th e problem to be solved is like a black box w ith a series of control dials representing different p aram eters, and th e o u tp u t of th e black box is only a value retu rn ed by th e evaluation function which indicates how well a p articu lar combi natio n of p aram eter settings solves th e given problem . In w hat follows, we will discuss how to encode our system partitio n in g problem as a genetic-search prob lem and d em onstrate th e idea using a task-independent GA system called G A ucsd [SG92]. P aram eter R ep resen ta tio n GAs work on th e coding of th e problem param eters ra th e r th a n th e actu al problem . Hence, we need to determ ine 1. w hat are th e essential param eters th a t can characterize a valid system p ar titioning, and 2. how to encode th em in a way th a t th e crossover and m u tatio n operators will still generate valid solutions. In our system partitio n in g problem , th ere are two p rim ary p artitio n in g decisions: th e process-to-chip assignm ent and th e design point selection. T he chip package selection is a secondary decision which can be determ ined using th e best-fit strategy once th e p rim ary decisions are m ade. T h e process-to-chip assignm ent can be represented by a num ber of assignm ent param eters. Each process in th e system is associated w ith an assignm ent p aram eter whose value denotes th e chip p artitio n w here th e process is assigned. These assignm ent param eters can be encoded by a sequence of fixed-length bit vectors. 92 T h e length of these b it vectors corresponds to the upper bound o f th e final chip count. Sim ilarly, th e exploration of process’s design space can be done by using a selection p aram eter whose value determ ines a specific design po in t from th e set of possible alternatives. However, encoding such a selection p a ram eter into a fixed- length b it vector m ay result in several unnecessary b it p attern s. For exam ple, if th e set of design alternatives of a process consists of 100 points, we need at least 7 b its to cover this range. T his coding consists of 128 discrete values; therefore, there are 28 additional bit p attern s. O ne solution is to random ly duplicate th e design points in th e set to fill out these unnecessary b it p attern s; otherw ise, th ey should b e evaluated by default to a worse possible point which will never be selected when com pared to others in th e set. For exam ple, th e JP E G exam ple given in Section 4.4.3 can be encoded by six 2-bit assignm ent param eters, one for each function in th e system , and two 3- b it selection param eters for exploring th e design alternatives of D C T and ID C T. F igure 4.17 shows th e representation of a JP E G p artitio n in g solution u n d er this coding. Assignment Parameters Selection Parameters DCT Quantizer Encoder IDCT Dequantizer Decoder DCT IDCT 00 01 01 10 01 01 001 001 F igure 4.17: T he p aram eter coding of th e JP E G exam ple Since th e param eters are encoded by a sequence of non-overlapping fixed-length b it vectors, applying crossover and m u tatio n operators to any position in such a 93 coding stru ctu re will always produce bit strings w hich represent valid p artitio n ing solutions. However, a valid partitioning solution is not necessarily a feasible solution which m eets th e design constraints. T h e search for feasible partitioning solutions has to be done through th e evaluation function. E valu ation F un ction In a GA, th e evaluation function provides a m easure of perform ance (or cost) w ith respect to a p articu lar com bination of p aram eters which represents a point in th e search space. In term s of our system partitio n in g problem , th is evaluation function not only has to reflect th e partitioning cost as discussed in Section 4.3.3 b u t also need to guide th e genetic search away from th e solutions which violate th e design constraints. Hence, we define this function T as follows: T = C O S T c + X ; C O S T e + ^ W t x ( E X C E S S t) 2 (4.14) c& C efzE w here • C O S T c is th e cost of packaging chip p artitio n c, • C O S T e is th e cost when a hyperedge e becom es an inter-chip connection, • E X C E S S t is th e am ount of violation of a tim ing constraint t im posed on th e system , and • W t is a constant weight which indicates th e relative im portance of violating tim ing constraint t. C O S T e and E X C E S S t can be calculated directly from th e coding stru ctu re which encapsulates th e param eters for th e process-to-chip assignm ent and th e de sign point selection. T his is because for each p articu lar com bination of these pa ram eters th e set of hyperedges which cross chip p artitio n s can be identified and the delay of each elem ent on a signal p a th w ith a tim ing constraint can be determ ined. 94 C O STc, however, depends on not only th e process-to-chip assignm ent and th e design point selection b u t also th e available chip package options specified in th e given library. This can be done by incorporating a best-fit chip-package selection into th e evaluation function. In o th er words, th e available chip packages in th e lib rary are sorted according to th eir costs. For each chip p a rtitio n in a candidate solution w hich is not em pty, th e chip package which m eets th e area and pin re quirem ents and costs th e least is selected. If th ere is no chip package in th e library which can satisfy th e area or pin requirem ents of a chip p a rtitio n c, a large value is given to C O STc to m ake this solution unfavorable during th e genetic search. Since th e am ount of violation of tim ing constraints is reflected . in th e cost function to be m inim ized, th e solutions found will m ost likely be feasible. However, th e GA does not guarantee to find a feasible solution which will m eets all th e tim ing constraints if th e in itial population is chosen com pletely at random . Hence, we can either seed th e initial population w ith som e feasible solution found m anually or using an other heuristic technique. A lternatively, we can increase th e size of th e population so th a t th e initial solutions selected random ly will provide enough variance to cover p o ten tial search p ath s tow ard feasible solutions. E x p e r im e n t To validate th e GA p artitio n in g m ethod described here, we experim ent w ith th e JP E G system exam ple given in Section 4.4.3 using G A ucsd [SG92], a task- independent GA system . In this experim ent, th e problem of p artitio n in g th e JP E G system was encoded into a C-language evaluation function using th e coding struc tu re shown in Figure 4.17. We used a population size of 64, crossover ra te of 43%, m u tatio n ra te of 0.53% and 2040 generations. T he sam e 3-chip p artitio n in g solu tion given by our M ILP m ethod as shown in Figure 4.16 was produced by GAucsd in less th a n 3 seconds on a H P 9000/720 w orkstation. 95 R e m a r k s We found th a t th e genetic algorithm is a prom ising optim ization technique for the system p artitio n in g problem not only because it can provide acceptable solutions in less ru n tim e th a n our M ILP m ethod b u t also because it will allow us to handle issues like yield and power which are difficult to incorporate into th e M ILP form u lation due to th eir non-linearity. For exam ple, to p artitio n a system into an M CM w ith an acceptable yield, th e yield of th e M CM is a non-linear function of th e yields of th e attach ed dies, w here th e yield of a die is also an lion-linear function of th e die size. T his partitio n in g problem can be encoded into a genetic-search problem whose evaluation function perform s these non-linear calculations of M CM and die yields. 4.6 Summary In this chapter, we have presented a system -level p artitio n in g approach to p a rtitio n a system at th e process level, to explore each process’s design alternatives, to determ ine proper chip count, and to consider chip packaging options concurrently. An M ILP p artitio n in g m ethod was given and im plem ented in a p ro to ty p e tool called P ro P a rt. Several experim ents including a JP E G im age com pression system were perform ed to dem onstrate th e usefulness of this tool. Two extensions were discussed including th e com m unication tradeoffs during partitio n in g and a genetic- search p artitio n in g m ethod. W e believe th a t system partitio n in g a t a higher level of granularity such as processes and procedures will becom e m ore and m ore advantageous and neces sary as b o th chip capacities and system com plexity keep increasing. For fu tu re developm ent, our system partitio n in g approach should be extended in tw o direc tion: non-uniform technology and m ixed packaging devices. T he first issue deals w ith m apping system functionalities onto a num ber of interconnected com ponents, 96 which m ay be ASICs, pre-designed p arts or program m able devices. These com ponents in tu rn can be d istrib u ted am ong m ixed packaging devices such as chips, m ulti-chip m odules (M CM ) and boards in order to satisfy or optim ize th e con strain ts on cost, size, yield, power, and o ther design characteristics. 97 Chapter 5 Synthesis of Systems with Unbounded-Delay Operations and Communicating Processes 5.1 Introduction Though high-level synthesis of digital hardw are has received enorm ous atten tio n over the years, m ost approaches have focused on th e synthesis of single-process designs. U nder these approaches, th e design specification is usually represented by a single graph which captures th e essential d a ta and control flow of th e design behavior, and synthesis tasks such as scheduling and m odule allocation/binding are applied to th e operations of this graph. A dditionally, an im p o rtan t assum ption is often m ade in previous approaches is th a t all th e operations in th e graph have fixed execution delays. In practice, we find th a t com plex application-specific system s often consist of m ultiple concurrent and interacting processes. For exam ple, th ree GM production designs illu strated in [Fuh91] contain from 4 to 10 concurrent processes. Synthesiz ing a system w ith m ultiple concurrent processes poses new challenges to synthesis tools. F irst, th e processes m ay need to in teract w ith th e external environm ent or w ith each other. D ue to th e I/O and inter-process com m unication, th e synthe sis of each process often involves detailed tim ing constraints as well as operations 98 w ith unbounded delays; i.e., delays are unknow n a t th e com pile tim e. Second, th e synthesis tool has to concurrently solve all th e tim ing constraints im posed by one process on another in order to synchronize all th e processes in th e system . Also, since th e processes on a chip are com peting for chip resources such as area and pins and since th e to ta l resources on a chip are lim ited by th e chip package, re source allocation for each process should be done by trad in g off th e perform ance and resource requirem ents of all th e processes on th e chip. Finally, if we synthesize one process a t a tim e, th e decisions m ade previously m ay affect and constrain the synthesis of o ther processes in th e system , which m ay resu lt in an inferior system im plem ent at ion. T raditional approaches for designing m ultiple-process system s are to synthe size individual processes separately. T he integration and synchronization of th e processes in th e system are usually done m anually by th e designers. For exam ple, System A rch itect’s W orkbench (SAW) [TLW+ 90], a single-process synthesis system , was used in th e design of th ree in d u strial applications as m entioned in [Fuh91]. In these design experim ents, each process was described in a separate ISPS or Verilog file and synthesized individually. T he synthesized netlists of th e processes were th en m anually interconnected. T he I/O and inter-process com m u nication was specified m anually by th e designers. T here is a class of work [TW 93, Nes87, Hay90] w hich address th e issue of I/O and inter-process com m unication as a separate problem , know n as interface synthe sis, from th e d a ta p a th synthesis. These techniques, however, are m ostly applicable to control-dom inated designs w ith little or no d a ta com putation. For th e synthe sis of m ore general designs, K u introduced a technique called relative scheduling [KM92] which can handle operations w ith unbounded delays under detailed tim ing constraints. In this approach, th e sta rt tim e of an operation is specified in term s of offsets from a set of anchors (unbounded-delay operations). Since relative scheduling requires m odule allocation/binding to be perform ed before scheduling, 99 we found it not very suitable for th e synthesis of m ultiple processes u n d er global resource constraints, e.g., th e area capacity of a chip. In addition, th e control schem e used by this approach is quite com plicated. In this chapter, we will present an synthesis approach which is not only suit able for th e synthesis of designs w ith unbounded-delay operations under detailed tim ing constraints b u t also applicable to th e synthesis of system s w ith m ultiple com m unicating processes. Unlike K u ’s relative scheduling, our approach allows us to trad e off betw een perform ance and resource requirem ents of th e processes during scheduling, to satisfy th e tim ing constraints, and to synchronize th e inter-process com m unication, despite th e presence of unbounded-delay operations. F urtherm ore, th e control overhead can be reduced in our approach. In w hat follows, we first give an overview of K u ’s relative scheduling and dis cuss its lim itations in Section 5.2. T hen in Section 5.3 we present our approach for synthesizing designs w ith unbounded-delay operations. O ur approach is based on th e observation th a t each process generally corresponds to one th read of control and th ere exists a sequential order am ong th e unbounded-delay operations in the process description. By preserving this order, scheduling of single-threaded pro cesses can still be done statically in term s of control steps despite th e presence of unbounded-delay operations. Consequently, m any good synthesis techniques orig inally developed for designs w ith only fixed-delay operations can still be u tilized in our approach w ith some m odifications. A n ILP scheduling m eth o d as well as some experim ental results are also given in this section. In Section 5.4, we ex ten d our approach to handle system s w ith m ultiple com m unicating processes, w here two additional issues need to be addressed; nam ely, inter-process com m unication and global resource allocation. Finally, in Section 5.5 we describe a heuristic approach m odified from freedom-based scheduling to m eet our scheduling requirem ents. 5.2 Overview of Relative Scheduling As we have discussed earlier, trad itio n al scheduling approaches assum e fixed exe cution delays for th e operations in th e design behavior. These techniques are not suitable for th e synthesis of m any real-tim e A SIC designs th a t involve detailed tim ing constraints and operations w ith unbounded delays. In [KM92], K u presented a technique called relative scheduling th a t supports operations w ith fixed and unbounded delays. In relative scheduling, th e schedul ing problem is given in th e form of a directed constraint graph G (V ,E ), w here th e vertices V represents th e operations and th e edges E denotes represents th e dependencies. E ach edge (v i,v f) is associated w ith a weight W ij th a t defines either th e upper or lower bounds betw een th e execution of V{ and Vj. In th e following, som e n o tatio n and definitions used by K u in his article [KM92] are given to briefly illu strate relative scheduling. D e fin itio n 5.1 The anchors A Q V o f a constraint graph G (V i E ) are the source vertex v0 and all vertices with unbounded delay. T h e anchors will serve as reference points for specifying th e sta rt tim e of operations in relative scheduling. D e fin itio n 5 .2 The anchor set o f a vertex Vi is a subset o f anchors A(v{) C A such that a € A ivf) i f there exists a path in the forw ard constraint graph G fl fro m a to Vi containing at least one edge with unbounded weight. In other words, an anchor a is in th e anchor set of a vertex if th e vertex can begin execution only after th e com pletion of a. F urtherm ore, th e anchor set of a vertex represent th e unknow n factors th a t can affect th e activation tim e of th e vertex (operation). 1G/ is obtained by removing the backward edges (maximum tim ing constraints) from the original constraint graph G. 101 D e fin itio n 5.3 The start tim e o f a vertex V{, denoted by T ( v { ), is recursively defined as follows: T {vi) = m ax {T (a) + 8(a) + < ra(vt -)} aeA(vi) where T ( a ) is the sta rt tim e o f anchor a, 5(a) is the execution delay o f a, and cra(vi) is the offset o f V { with respect to a. T he offset cra(vi) defines th e am ount of tim e th a t u; has to w ait after th e com pletion of a. D e fin itio n 5 .4 A relative scheduling C l o f a constraint graph G (V, E ) is the set o f offsets o f each vertex Vi E V with respect to each anchor in its anchor set A(vi) such that all the tim ing constraints can be satisfied. T h e execution m odel of relative scheduling is illu strated in F igure 5.1. A vertex v can begin execution only after all th e anchors in its anchor set A ( v ) are com pleted. W hen an anchor a E A (v ) is com pleted, it sends a signal to an counter or shifter which delays cra(v) am ount of tim e before sending a com pletion signal to vertex v. T h e vertex v is activated after it receives a com pletion signal from every anchor th a t it depends on. R e m a r k s A n im p o rtan t assum ption m ade in K u ’s relative scheduling technique is th a t m odule allocation/binding have been perform ed prior to scheduling. A ny con flict caused by th e assignm ent of m ultiple operations to a single m odule m ust be resolved in advance by adding proper sequencing dependencies am ong these op erations. W ith o u t scheduling inform ation, these serialization decisions m ade for m odule sharing m ay result in inefficient use of hardw are resources or poor per form ance. For th e synthesis of m ultiple-process system s u nder global resource 102 time anchors which v depends on T(a) 8(a) A(v) T: start time A : anchor set 8 : operation delay <T: offset T(v) T (v) = maxaeA{v)(T(a) + S(a) + a fl(v)) Note: anchors are operations with unbounded delays such as waiting for an event to occur Figure 5.1: T he execution m odel of relative scheduling 103 constraints, techniques th a t com bine scheduling w ith resource allocation are gen erally preferred so th a t th e perform ance and resource requirem ents of each process can be tra d ed off. F urtherm ore, th e control schem e for th e hardw are arch itectu re targ eted by relative scheduling is fairly com plicated. This is because th e control logic is im plem ented as an interconnection of finite-state m achines (FSM ), one for each state vertex2 of th e constraint graph [MK88]. These FSM s are called control elements. A control elem ent interacts w ith others via handshaking signals. Two handshaking signals, enable and done, are defined to indicate w hen a control elem ent is enabled and w hen it has finished. T he enable signal for an operation is logical conjunction of th e done signals generated by th e com pletion of th e anchors th a t th e operation depends on and delayed by ap propriate am ount of tim e (offsets) using counters or shifters. For exam ple, Figure 5.2 shows a counter-based approach for an operation v th a t depends on two anchors a and 6 w ith offsets 2 and 3 respectively. This fine- grain control is inevitable in relative scheduling due to th e execution of operations w ith respect to th eir anchor sets as described earlier. T he “processes” in K u ’s m odel are different from th e tra d itio n al view of pro cesses. In his m odel, a process is defined by a constraint graph whose anchors are only p artially ordered. Therefore, th ere m ay be m ore th a n one anchor whose com pletion tim e is unknow n at any given tim e. This m akes th e execution tim es of th e associated operations undeterm inistic. T h at is a m ajo r reason th a t m odule binding has to be done before scheduling in K u ’s approach and th e control schem e has to be so com plicated. 2State vertices are those operations that require at least one cycle for execution. 104 d on ea d o n e b counter. I f u * enabley Figure 5.2: A n exam ple of th e counter-based control in relative scheduling 5.3 Synthesis of Single-Threaded Processes In this section, we will present a synthesis approach which allows us to tra d e off perform ance and resource requirem ents during th e synthesis of designs w ith unbounded-delay operations while reducing th e cost of control as well. O b se r v a tio n s In general, each process described in a design specification corresponds to one th re ad of control in designers’ m inds as well as in sim ulation m odels. T he execution of a process proceeds from one w aiting sta te (an anchor) to another in a sequential order th a t is explicitly given in th e process description. In addition, th e execution betw een tw o consecutive w aiting states is determ inistic. For exam ple, F igure 5.3 shows a portion of VHDL specification of a decoder process in an error-correction system . In this description, two wait statem en ts are expected to and actually have 105 .---------------------. . ; ------------------------- ^ra-^w ait f o r in co m in g d a ta | w a it) on d a ta _ r e a d y u n t i l d a ta _ r e a d y = '1 '; -I- s ig n a l s t a r t o f p r o c e s s o u l r e a d y <= 'O'; - - isam p le in p u t s tre a m ± 0; wJaJfle i < 16 lo o p fw a iy i on s tr o b e u n t i l s tr o b e = '1 '; rnS3ut_dat a (i ) : = d e c o d e r _ in ; i : = i + 1 ; en d lo o p ; Figure 5.3: A n exam ple of anchor ordering in a process description to be executed sequentially even though there is no d a ta dependency betw een them . We found th a t b o th th e scheduling of designs w ith unbounded-delay operations and th e required control schem e can be simplified if we preserve th e anchor order em bedded in th e process description. Briefly speaking, our idea is to schedule each anchor to an exclusive control step in which th e process execution will stay until th e anchor is com pleted. In addition, th e order of these control steps m atches th e anchor order given in th e process description. T he rem aining operations are scheduled to m eet the perform ance requirem ent, th e resource availability and the tim ing constraints. A lthough one m ay argue th a t a certain degree of parallelism m ay be undiscov ered w hen keeping th e anchor order, we believe th a t th e ability to tra d e off the perform ance and resource requirem ents during scheduling is preferable to K u ’s rel ative scheduling for th e synthesis of m ultiple-process system s u nder global design constraints. 106 5 .3 .1 S c h e d u lin g A p p ro a ch T raditional scheduling approaches assum e th a t operations w ith “fixed” delays are assigned to a sequence of “equal-length” control steps. We call th is kind of ap proach static scheduling. N um erous static scheduling techniques have been pro posed, including ILP-based, rule-based, and heuristic m ethods. T hey can be fur th er classified into scheduling under perform ance constraints, scheduling under resource constraints and sim ultaneously scheduling and resource allocation. Un fortunately, these kind of scheduling approaches are not applicable to designs w ith unbounded-delay operations. In th is section, we will introduce a scheduling approach based on th e idea of single-threaded processes. An advantage of this approach is th a t scheduling can be done statically in term s of a sequence of “variable-length” control steps; hence, m ost of th e existing static scheduling techniques, under some m odifications, will becom e applicable to designs w ith unbounded-delay operations. In th e following discussion, we will try to use K u ’s notatio n as m uch as possible for consistency. D e fin itio n 5.5 A constraint graph G (V ,E ) is single-threaded if its anchors A are an ordered list {ao, «i,..., where ao is the source vertex Vo and there exists a path from a 4 _i to a,- in G fo r i : 1 . . . K . In other words, th e anchors of a single-threaded G are sequentialized by either d a ta dependencies or explicit sequencing edges am ong th em according to th e order of th eir appearance in th e process description. If th ere are anchors w hich can be executed concurrently, this explicit sequentialization of th e anchors could represent som e perform ance penalty as com pared w ith th e relative scheduling approach. In our m odel, a schedule of a single-threaded G(V, E ) is defined by a ordered list of control steps, {Co, C i ,. . ., Cjv}, w here length(Ci), i : 0 . . . N , denotes th e d u ratio n of control step C,-. T he execution of G proceeds sequentially and cyclically from Co to CN . N ote th a t th e control steps m ay have different lengths. 107 D efin itio n 5.6 A schedule of a single-threaded G (VyE ) is an integer labelling < t : V — ^ N such that a(v) is in the range o f [ 0 ...N ] fo r all v 6 V . Obviously, for a schedule to be valid, all th e dependencies and tim ing constraints m ust be m et. Norm ally, th e execution of a process stays in an unbounded-delay operation w hen it is activated, and th e process execution proceeds to o ther operations only after th e com pletion of th e unbounded-delay operation. This is tra n slated into th e following definition in our model: D efin itio n 5.7 For all anchors a o f a single-threaded G (V ,E ), a is scheduled to an exclusive control step such that length{C(r^a)) is equal to delay(a) which is unbounded. The schedule of the anchor set A o f G is C ^ , ) , ..., C, c-(aji.)} such that <r(ao) = 0 and cr(a0) < <r(ai) < . . . < er(aK'). T he scheduling of th e anchor set A of a process actually divides th e scheduling space into K + 1 zones to which non-anchor vertices can be scheduled (see Fig ure 5.4). Form ally, a zone is a range of control steps defined by th e following equation: [ C e r ( a , ) + 1 ? • • • • > C 'a ( a i+ 1 ) — l ] i f 0 ^ < K [C 7 < r (ajf)+ i,...,.JV ] if i = K T h e d u ratio n of each zone is a variable w hich depends on th e perform ance require m ent, th e resource availability, and th e tim ing constraints am ong those operations. zone (0 = For designs w ith conditional branching and loop constructs, th e scheduling ap proach described in this section can be used as th e basis for hierarchical scheduling. In o ther w ords, scheduling is applied hierarchically in a b o tto m -u p fashion, where th e body of a loop is another constraint graph of lower hierarchy and each branch of a conditional construct is also a constraint graph. 108 Co(a0) ( a o z o n e 0 z o n e 1 Cote*) ( aK z o n e K Figure 5.4: T he scheduling of th e anchor set of a single-threaded process 109 5 .3 .2 T im in g C o n str a in ts A m ajo r com plication w hen synthesizing designs w ith unbounded-delay operations is to m eet th e tim ing constraints no m a tte r how long these unbounded-delay op erations will actually take. G enerally, tim ing constraints are used to define th e upper and lower bounds betw een th e execution of two operations. Since th e delays of th e anchors are unbounded, it is only m eaningful to define a tim ing constraint from th e end tim e of th e first operation to th e s ta rt tim e of th e second one. In our m odel, th e sta rt tim e of a vertex v is defined as follows: <r(v) — 1 T s ( v ) = ^ 2 length{Ci) + ^ 2 delay(vj) » = 0 Vvj€pred(v)A<r(vj)=cr(v) T he first sum m ation is th e tim e when th e control step Ca(v) starts, and th e second one denotes th e to tal delay taken by u ’s predecessors w hich are also scheduled to Ca(v) (chaining). T he end tim e of v can be sta te d as follows: T e(v) = T s ( v ) + delay(v) D e fin itio n 5.8 A m inim um timing constraint between two operations Vi and Vj is defined by a lower bound lij > 0 such that T s(vj) > Te(vi) + kj Similarly, a m axim um timing constraint is defined by a upper bound Uij > 0 such that T s(vj) < Te(vi) + Uij If th e given tim ing constraints are inconsistent, they m ay be not satisfiable under any schedule. This is even m ore im p o rtan t in cases involving unbounded-delay operations. For m inim um tim ing constraints, th e ir satisfiability is only affected by 110 th e lower-bound delays of th e unbounded-delay operations. This is illu strated in Figure 5.5. In this exam ple, th e low er-bound delay of anchor a is 1 cycle; hence, th e © [1-?] 0 I * = 2 Figure 5.5: A m inim um tim ing constraint satisfied under an unbounded-delay operation separation betw een Vi and Vj in this schedule will be at least 2 cycles as required by th e m inim um tim ing constraint However, a m axim um tim ing constraint Uy can be easily unsatisfiable if there exists an anchor in th e p a th from V{ to vj as shown in F igure 5.6. This is because th e Figure 5.6: A n unsatisfiable m axim um tim ing constraint separation betw een Vi and Vj depends on d(a), th e delay of a, w hich is unbounded. Therefore, th ere exists a value of <£(a), e.g. + 1, to m ake T s (v j ) larger th an 111 Te(vt) + Uij. The following lem m a can be proved in a similar way for checking the satisfiability of maximum tim ing constraints in a single-threaded process. L e m m a 5.1 For all maxim um timing constraints o f a single-threaded G(V, E), is feasible if and only if there is no path from V{ to vj which goes through an anchor other than V{ and Vj themselves. A dditionally, a m axim um tim in g constraint im poses restrictions on th e scheduling of u, and Vj. This is illu strated in Figure 5.7, w here Vi is either the anchor ax or a fixed-delay operation scheduled to zone(x). Since th e separation from Te{yi) to T s(vj) cannot be larger th a n U{j, vj can not be scheduled la te r th a n ax+1 - O therw ise, th e satisfiability of will depend on th e delay of ax+1. a x-l a , ax+l zone x -l zone x Te(Vj) Ts(vj) Figure 5.7: Scheduling restrictions im posed by u These scheduling restrictions are form ally described as follows: Let k : V — y N be th e function defined as follows: k (v ) = k if v is an anchor z if v is not an anchor and v is scheduled to zone(z) 1 1 2 T h eorem 5.2 For each feasible m axim um timing constraints Uij of a single threaded G (V ,E ), if vj is an anchor then K{vj) < k (vi) 4- 1/ otherwise, k (vj) < K(vi). P roof: Let = x and k(vj) = y. Case I. vj is an anchor. Hence, Vj corresponds to ay in A. Assuriie y > x + 1. T hen, Cax+l is betw een zone(x) and Cay. Since = x, Vi is scheduled in either Cax or zone(x). Therefore, th e gap betw een Te(v{) and T s(vj) depends on th e delay of ax+i which is unbounded. utj cannot be satisfied in general. Therefore, y < x + 1. Case II. Vj is not an anchor. Hence, Vj is scheduled to zone(y). A ssum e y > x. T hen, Cay is betw een zone(x) and zone(y ), which m eans th a t th e separation betw een Teiyf) and T s{vj) depends on the delay of anchor ay. cannot be satisfied in general. Therefore, y < x. □ 5 .3 .3 R e so u r c e A llo c a tio n In our scheduling approach, although th e control steps to w hich th e anchors are scheduled have unbounded lengths, th e order of control steps {C q, C\ , ..., Cn } being executed is deterministic. Hence, th e life span of operations and values still can be determ ined statically in term s of control steps. T he life span of an operation or value is defined by th e control step w here it sta rts (or is created) to th e one w here it ends (or is no longer needed). T he ability to determ ine th e life span of operations and values allows us to determ ine or even control the resource requirem ents a t each control step during scheduling. For exam ple, two operations of a scheduled single-threaded G(V, E ) can share a single m odule if th ere is no overlap betw een th eir life spans or they are 113 m u tu al exclusive. Also, we can schedule less critical operations to those control steps whose resources are under-utilized. Since resource sharing can be done statically in term s of control steps in our scheduling approach, we are able to trad e off perform ance and resource require m ents, despite th e presence of unbounded-delay operations, by perform ing schedul ing under tim ing constraints while m inim izing resources or by doing scheduling and resource allocation sim ultaneously. 5 .3 .4 C o n tro l S ch em e T he control schem e required to support th e designs synthesized by our approach is m uch sim pler th an th e one em ployed in relative scheduling. As th e word “single th re ad ” im plies, th e control u n it of a single-threaded process is a single finite-state m achine (FSM ). This is because a single-threaded process will stay in one and only one of th e control steps {Co, C\ , ..., Cat} at any given tim e. F urtherm ore, th e task to be perform ed at each control step is fixed. Hence, m ost existing control synthesis techniques for static scheduling are still applicable here w ith som e m odification. For exam ple, each non-anchor control step can be m ap p ed to one or a sequence of states of th e FSM as usual according to its length and th e clock cycle. A sim ple anchor like w aiting for th e expected condition to be m et can be done by a busy-w aiting state w ith a conditional sta te tran sitio n as shown in Figure 5.8. 5 .3 .5 A n IL P M e th o d In th is section, we will dem onstrate our scheduling approach using a integer-linear program m ing (IL P) m ethod. O ur form ulation is based on H su’s work [HLH91]; however, we expect no m ajo r difficulty to apply m ore com prehensive form ulations like [GE92] to our scheduling approach. T he problem which we are trying to solve here can be defined as follows: 114 c c [ U c ▼ Figure 5.8: A n im plem entation of a sim ple unbounded control step G iven a single-threaded control-data flow graph G (V ,E ), find a m inim al-cost schedule th a t satisfies th e given set of tim ing constraints. T h e following notatio n will be used in th e form ulation: • V = {wo, • • -, vl}- v0 is th e source vertex, i and j are used as vertex indexes. • A = {ao,..., clk} is th e ordered list of anchors in G, w here ao = Vo- k is th e anchor index. • C = {G0, .. •, Cn } denotes th e control steps to be scheduled, w here N is an u p p er bound of th e to tal num ber of control steps, n is th e control step index. • T here are M types of functional units. V{ 6 Fm if Fm can perform operation V{. T he cost of Fm is C S m. For sim plicity of illustration, we m ade th e following assum ptions: • T he low er-bound delay of each anchor is one cycle. • Each non-anchor operation takes one cycle. • O nly th e costs of functional units are considered. 115 • All the m axim um tim ing constraints have been checked to be feasible. T he techniques to m odel m ulti-cycle operations and chaining in ILP can be found from several other form ulations [GE92, W G B , Pra93] for static scheduling. tion of non-linear constraints will be described next. Finally, we will show some experim ent results using a public-dom ain ILP solver. F o rm u la tio n O p eration to C ontrol S tep A ssign m en t Let Xi< n be a binary variable such th a t x,-> n = 1 if Vi is scheduled to control step Cn; otherw ise, Xi> n = 0. Obviously, each operation, including th e anchors, m ust be assigned to one and only one control step. T he following discussion begins w ith th e detailed form ulation. T he lineariza- N y ' = 1 Vd; 6 V (5.1) n— 0 However, th e control step to which an anchor ak E A is scheduled m ust be exclu sively used by a*. Let a* correspond to V{ E V. (xi> n = 1) = $ ■ x j,n — 0 for 0 < n < N j £ V - { v i } (5.2) Let < r V i be th e index of th e control step to which is scheduled. a V i can be obtained N < ? Vi = Y ^ n * x i,n Vu; € V (5.3) 116 D a ta D ep en d en cy For each (v{, Vj) E E , we know th a t Vj m ust be scheduled after V{. T his can be ensured by th e following constraint: < rV j ~ crV i > 1 (5.4) In fact, each d a ta dependency (u,-,Uj) is like a m inim um tim ing constraint /,-j such th a t lij = 1 . M in im u m T im in g C on strain ts E ach m inim um tim ing constraint can be enforced easily by th e following con strain t: (7V j o "V i ^ l{j (5.5) Since th e low er-bound delay of each anchor scheduled betw een crVi and a V] is 1, th e m inim um tim ing constraint will be satisfied as long as th e above constraint is m et. M axim u m T im in g C on strain ts U nlike m inim um tim ing constraints, m axim um tim ing constraints need to be an alyzed carefully in order to guarantee th eir satisfaction in all circum stances. For each m axim um tim ing constraint Uij, there are four possible cases: Case I. B oth V { and Vj are anchors. Let th em correspond to ax and ay in A . From T heorem 5.2, we know y < x -f 1. If y — x T 1, we add th e following constraint to th e form ulation: O therw ise, is ignored since y < x im plies crVj < u Vi. 117 Case II. Vi is an anchor b u t vj isn ’t. Let Vi be ax in A . From T heorem 5.2, we know zone(vj) < x. This can be enforced by < r V j < Cax+ 1 and (Tv ■ & vi — ^ Case III. Vi is not an anchor b u t vj is. Let vj b e ay in A . From Lem m a 5.2, we know zone(vi) > y — 1. This can be constrained by o"vi & a,y— i and &Vj & Vi — u i,j Case IV. B oth u ,- and Vj are not anchors. From T heorem 5.2, we know z o n e (v j) < zone(vi). A ssum ing zone(vi) is x, this m axim um constraint can be enforced by < T v j < o-ax+ 1 and (5.6) &vj Gvi — ^i,j U nfortunately, x is not a constant because it depends on aVi. A lterna tively, C onstraint 5.6 can be replaced by the following constraints to ensure zone(vj) < zone(vi): (ctt ^ (_& vj ^ ^afc) Vflfc G A (5.7) 118 R esou rce A llo ca tio n Let f m denote th e num ber of function units of ty p e m used in th e solution. Hence, th e num ber of operations of type m scheduled to each control step cannot exceed f m. This is stated as follows: £ X i,n < fm for 0 < n < N (5.8) V i^zF rn If we w ant to m inim ize th e to tal resources, f m is a variable to be included in th e objective function. For scheduling under resource constraints, f m becom es a constant. O b jectiv e F unction To find a m inim al-cost schedule th a t satisfies th e given tim ing constraints, th e objective function T to be m inim ized can be stated as follows: M £ C S m * f m (5.9) m = l We can also try to m inim ize th e to tal num ber of control steps C step while m eeting the resource constraints by adding the following constraint to th e form ulation: Cstep = rnax(aVi) for all V { w ithout successors. (5.10) Finally, we can sim ply search for a feasible solution to m eet b o th tim ing and resource constraints w ithout a objective function. L in e a r iz a tio n Since th ere are som e non-linear constraints in th e form ulation presented earlier, th ey have to be linearized in order to be solved by ILP solvers. 119 Constraint 5.2 can be easily linearized as follows: ( 1 — x iin) * B I G > 5 3 X0,n j€V-{vi} w here B I G is a reasonable larger num ber; e.g., a num ber which is slightly larger th a n ||V || can be used here. Hence, if x,-)n is 1, xj,n will be 0 as required. O n th e o th er hand, if Xi> n is 0, th e above constraint will always be satisfied. To linearize C onstraint 5.7, it is first split into th e following two constraints: (< ? v i < < 7 a k) & (Kk = 1) an(l (5-11) (bitk = 1) =* {oV j < (T ak) (5.12) where bitk is a binary variable which is equal to 1 if and only if < rV i < craie. Con strain 5.12 can be linearized by Vak > o-V j — B I G * (1 — bi> k) Hence, if b^k is 1, crV j will be less th a n crak. C onstraint 5.11 can be replaced by th e following constraint: < X a k — < x V i < B I G * bi> k Hence, if cra fc > aVi, bitk will be 1 . However, th is constraint does not guarantee bitk to be 0 when crak < aVi. This requirem ent can b e enforced by adding th e term B I G * bitk to th e cost function T to be m inim ized. 1 2 0 5 .3 .6 E x p e r im e n ts A num ber of experim ents were perform ed using a public-dom aiii IL P solver, called Ipsolve3, to validate our scheduling m ethod. Figure 5.9 shows a single-threaded constraint graph which was th e running exam ple used by K u in his article [KM92]. (^) fixed-delay operations f j anchors min. constraints or data dependencies -m s ™;™ max Constraints Figure 5.9: A single-threaded constraint graph In this exam ple, th ere are two anchors (Vo and vvS) and th ree m axim um tim ing con straints. T he ILP m odel produced for these experim ents consists of 56 constraints and 8 6 binary variables. F irst, we scheduled this exam ple w ith th e objective of m inim izing th e num ber of control steps. A 13-control-step schedule shown in Figure 5.10 was found in 3 Lpsolve is an efficient C program based on a sparse matrix dual simplex LP solver for solving mixed-integer linear programming problems. It is written by Berkellar at Eindhoven University of Technology, Design Automation Section, P.O. Box 513, NL-5600 MB Eindhoven, Netherlands. 121 less th a n 2 seconds on a H P 9000/720 w orkstation. This solution is sam e as th e m inim um schedule given in [KM92] using relative scheduling. c0 Figure 5.10: A schedule w ith m inim um num ber of control steps N ext, we tried to schedule this exam ple while m inim izing th e to ta l resources. A schedule which also used 13 control steps was found as shown in Figure 5.11. C om pared to th e previous schedule in Figure 5.10, this solution defers v$ by one control step to CV while still satisfying all th e tim ing constraints. Consequently, v$ and i> 5 can share a single m odule if there exists a functional u n it th a t can perform b o th of them . On th e contrary, since K u ’s relative scheduling is sim ilar to as-soon- as-possible (A SA P) scheduling and does not consider resource u tilizatio n during scheduling, th e m inim um schedule it produces will require at least two m odules 122 since b o th £>3 and v5 are scheduled to T his clearly shows th e advantage of our scheduling approach which allows us to tra d e off perform ance and resource requirem ents during scheduling. c 0 Figure 5.11: A schedule which requires m inim um resources 5.4 Synthesis with Multiple Processes In this section, we will apply our synthesis m odel to handle system s w ith m ulti ple com m unicating processes. T here are two m ajor issues to be considered when synthesizing system s w ith m ultiple processes. 1 . to synchronize processes which interact w ith each other, and 123 2 . to d istrib u te th e resources to each process on a chip according to its perfor m ance and resource requirem ents. In th e following discussion, we will first review our com m unication m odel. T he problem of m ultiple-process synthesis which we are focusing on will be defined next. W e will th en analyze th e feasibility of a given set of com m unication events before discussing th e synchronization of both blocking and non-blocking com m unication events. Finally, we will show how to m odify the ILP scheduling m eth o d presented earlier to handle system s w ith m ultiple processes. 5 .4 .1 C o m m u n ic a tio n M o d e l O ur inter-process com m unication m odel has been discussed in Section 3.2.2. In short, th e inter-process com m unication of a system is defined by a set of com m uni cation events. Each com m unication event is a point-to-point com m unication which consists of a write operation in th e sending process and a read operation in th e receiving process as shown in Figure 5.12. A n event is called blocking if its syn- R eceiver S ender Figure 5.12: M odeling of a inter-process com m unication event chronization has to be achieved dynam ically via hand-shaking. O n th e o th er hand, th e w rite and read operations of a non-blocking event are synchronized statically in tim e. 124 N o ta tio n In our synthesis m odel, each process is represented by a single-threaded constraint graph G (V y E ). A point-to-point com m unication event evk3 is defined by a tuple (w k, r k), w here w \ E V and r3 k E Vj. w k is a w rite vertex (operation) in the sending process G{. Sim ilarly, rk corresponds to a read vertex in th e receiving process Gj. If evk3 is a blocking event, th e tim e for evkJ to com plete is not determ inistic in b o th th e sending and receiving processes. Therefore, w k and r k are anchors in Gi and Gj respectively. On th e other hand, if evk3 is a non-blocking event, b o th th e sending and receiving processes are required to be synchronized in tim e when evk3 occurs. In other words, w k and r3 k m ust be starte d a t th e sam e tim e when scheduling Gi and Gj. They are represented by norm al vertices w ith a fixed delay. P r o b le m s ta te m e n t T he problem to be solved here can be sta te d as follows: Given a set of single-threaded processes P S = {Go,..., and a set of com m unication events E V = {ev0, .. . , e v m}, schedule and allocate each process in P S such th a t th e tim ing constraints associated w ith each process are m et, each event in E V is synchronized, and th e to tal cost (resource) constraint is m inim ized or m et. 5 .4 .2 C o m m u n ic a tio n F e a sib ility U nfortunately, it is not always possible to synchronize every com m unication event in E V due to potential occurrences of deadlock ox lack of synchronization points 4 for non-blocking events. Deadlocks occur when th ere are circular dependencies am ong th e blocking events. On th e other hand, each non-blocking event is required to be 4Basically, a synchronization point between two processes is a point in time where their execution will start becoming synchronous. We will formally define it later. 125 scheduled to a period of tim e w here th e execution of b o th th e sending and receiving processes is synchronous. Hence, it is im p o rtan t to analyze th e feasibility of th e given set of com m unication events before scheduling. D e fin itio n 5 .9 The dependency graph o f a set o f communication events E V is a directed graph DG(V, E), where V = E V and (evi, evj) 6 E E if 3 a process Gk E P S such that there exists a path from eu, to evj in Gk ■ In other words, D G represents a p artial order am ong th e com m unication events. If evi is a predecessor of e v j in D G , it im plies th a t evi m ust tak e place before e v j. T h e o r e m 5.3 A set of communication events E V is infeasible if its dependency graph D G has a cycle. P ro o f: Let evi and e v j be two vertices on th e cycle, and let T s and T e denote the sta rt tim e and th e end tim e of an event respectively. Since evi is a predecessor as well as a successor of euy, we know Te(ev{) < T s(e vj) T e(evj) < T s(e vi) (5.13) (5.14) However, each event takes som e tim e to com plete, which m eans Ts(evi) < Te(evi) T s (e v j) < T e(evj) (5.15) (5.16) Com bining E quations 5.13 w ith 5.15 and 5.14 w ith 5.16, we have T s(evi) < T s(e vj) T s (e v j) < T s(evi) (5.17) (5.18) E quation 5.17 contradicts 5.18; hence, E V is infeasible. □ 126 In short, an acyclic dependency graph ensures th e com m unication to be deadlock-free. 5 .4 .3 S y n c h r o n iz a tio n o f B lo c k in g E v e n ts Since a blocking com m unication event is synchronized dynam ically v ia hand shaking, it will be accom plished as long as b o th th e w rite and read operations are executed by th e sending and receiving processes eventually. T he scheduling of processes does not have a direct effect on th e synchronization of blocking events. In fact, if every event of an acyclic dependency graph D G is blocking, th e inter-process com m unication can be m ade valid by sim ply scheduling individual processes prop erly. T his is because a valid scheduling o f each process will not only satisfy the tim ing constraints b u t also m eet all its d a ta and sequencing dependencies. Conse quently, if two vertices (blocking events here) are ordered in D G , th e n they cannot be scheduled out of order in any valid schedule. Hence, each vertex in D G will take place only after all its predecessors have com pleted. 5 .4 .4 S y n c h r o n iz a tio n o f N o n -b lo c k in g E v e n ts If a com m unication event is non-blocking, its sending and receiving processes can not be scheduled independently. A non-blocking event ev]f = (w k, r k) is synchro nized only if th e scheduling of processes Gi and Gj can guarantee th a t th e sta rt tim e of w % k and rk are the same. U nfortunately, this synchronization cannot be done by sim ply scheduling b o th w l k and rk to th e sam e control step in Gi and Gj like [HP92, Geb92] due to th e occurrences of unbounded-delay operations. Instead, we will introduce th e notion of synchronization points and show how th ey can be utilized to synchronize non-blocking events. 127 D e fin itio n 5 .1 0 A synchronization point of two processes Gi and Gj is a tuple (C*, C 3) such that C l x and C 3 are the control steps of Gi and Gj respectively and the execution o f Gi and Gj always leave Cx and C 3 y simultaneously. Some synchronization points of two interacting processes can be found even before th eir scheduling has been done. For exam ple, if two processes Gi and G j have a com m on startin g state, (Cq,Cq) will be a synchronization point. Each block ing event ev^3 = ( w \,r 3 k) will contribute a synchronization point betw een Gi and Gj. If th ere exists a synchronization point betw een two processes, th eir execution will be synchronous for a period of tim e from th e synchronization point to th e next control step w ith an unbounded length in eith er process. In fact, each non- blocking event requires such a period of tim e in which its w rite and read actions can be scheduled and synchronized. This is illu strated in Figure 5.13 and in th e following lem m a. process i process j .« o C * , § s £ offset -► Ca(wk} * **- offset Co(r) k . next unbounded control step Figure 5.13: The synchronization of a non-blocking communication event 128 L e m m a 5 .4 Let (C % x , C3) be a synchronization point between processes Gi and Gj and let C * and C3 be the control steps with unbounded lengths which immediately follow C l and C 3. A non-blocking event evk3 = (w k,r k) scheduled in the following regions, x < a(w k) < u and (5.19) y < *(ri) < (5 - 2 0 ) is synchronized if o ff s e t { C l ,w \ ) = o f fs e t(C y ,r J k) where o f f set(a,b) is the time from the end o f a to the start ofb. P ro o f: From E quations 5.19 and 5.20, we know th a t th ere is no control step w ith an unbounded length betw een C* and C ^ Wky as well as betw een C* and C ^ rky Therefore, o ffs e t{ C % x,w k) and o ffs e t( C ^ ,r 3 k) are bounded and equal. Also, since (C l, C 3) is a synchronization point, th e execution of Gi and Gj will leave Gl x and C 3 sim ultaneously; i.e., Te(<7’) = T e (C 3). Hence, we have T e(C l) + o ffs e t( C i,w i) = T e(C 3) + o f fs e t( C 3 y,r 3 k) = 4 . Ts(wi) = Ts(rl) In other words, th e sta rt tim e of wk and rk are th e sam e. □ 5 .4 .5 IL P m o d ific a tio n In w hat follows, th e ILP form ulation presented in Section 5.3.5 will be extended to handle system s w ith m ultiple processes. Assum ing th e dependency graph D G of th e given set of com m unication events E V is acyclic, th e original ILP form ulation except th e objective function can be sim ply duplicated once for each process in 129 P S . Since D G is acyclic, th e blocking events will be synchronized by them selves as long as each process is scheduled properly. For each non-blocking event evk3 = {wk, rJ k) in E V , th e following constraint is added to th e form ulation for each pair of ax and a3 y such th a t b o th a% x and a3 y are anchors and j) is a synchronization point betw een G, and Gj. (c r(4 ) < cr(u4) < <r(aj,+1)) A (5.21) (<j{a{) < a{r{) < <x(a£+1)) => * (« 4 ) - ^ ( 4 ) = a (rk) ~ * (ai) (5-22) This constraint is basically a repetition of Lem m a 5.4. If th e left-hand side of this constraint is true, wk and r3 k are to be synchronized w ith respect to (G ^ a^ , C^(av)) using th e sam e offsets. Obviously, C onstraint 5.21 needs to be linearized. F irst, it can be replaced by: ^(O < < * ■ ( « £ ) 4 = 1 (5 - 2 3 ) *(*>1) < ff( 4 + i) ** bl = 1 (5-24) cr(a3 y ) < cr(rk) O b3 k = 1 (5.25) *(4) < ^K+i) ^ 4 = 1 (5 - 2 6 ) bl = b l A b l A b 3 k A b 4 k (5.27) 4 = i => a (w l ) - a « ) = < 4 4 ) - < 4 4 ) (5-28) C onstraints 5.23 to 5.26 can be linearized in a sim ilar way as we did for Con strain t 5.7 in page 120. C onstraint 5.27 is equivalent to 4 < 4 4 < 4 130 Hence, if any of b\ , ..., bk is 0, 6 | will be 0. If all &*,..., bk are 1, b\ becom es 1 as well. Finally, C onstraint 5.28 can be replaced by: a (w l k) - c t ( 4 ) < tr(rl) - <r(a3 v) + (1 - fe|) * B I G <r(«4) ~ <*(<£) > ^ 1 ) ~ <r(a0 ~ i1 ~ bl ) * B IG Hence, if b\ is 1, <y{w\) — cr(a^) will be equal to <r(r£) — cr(aJ y). In order to d istrib u te th e resources to each process according to th e perform ance requirem ent, th e objective function T can be replaced by one w hich represents the to ta l resources consum ed by all th e processes in P S . M E E C S ™ * / £ (5.29) p&PS m = l Therefore, th e solution found by th e above form ulation will be one which m eets th e tim ing constraints associated w ith each process, synchronizes every com m unication event and m inim izes th e to ta l cost. 131 5 .4 .6 E x p e r im e n ts Figure 5.14 shows a two-process exam ple derived from a netw ork package decod ing/encoding system [KFJM 92]. In this exam ple, there are th ree com m unication events (shown as dotted lines) betw een th e decoder Gi and encoder G 2 processes. O ne of th em (k^, N 5) is a blocking event and th e o th er two are non-blocking. F urtherm ore, three m axim um constraints are im posed on process G\ and two are im posed on G 2 - T he ILP m odel produced for this exam ple consists of 114 con strain ts and 211 variables (190 binary). ....... 2 > * ' f N , / ....... ' S I ........................................ P rocess P1 (decoder) Process P2 (encoder) Figure 5.14: A Two-Process Example 132 F irst, we scheduled b o th processes in a m inim um num ber of control steps while satisfying all th e tim ing constraints as well as synchronizing th e three com m uni cation events. As a result, G\ and Gi are scheduled in 9 and 10 control steps respectively as shown in Figure 5.15. In this schedule, two non-blocking events (Vs,N e) and (V7 , N 7) are synchronized by scheduling th em w ith respect to the synchronization point (V3 , N$) using offsets 2 and 5 respectively. .. ....... V P rocess G-| P rocess G Figure 5.15: A schedule w ith a m inim um num ber of control steps 133 N ext, we tried to schedule this exam ple while m inim izing th e to ta l resources. Due to th e tig h t tim ing constraints, th e solution found as given in Figure 5.16 does not show im provem ent over th e previous solution in term s of resource allocation. T he to ta l num ber of control steps, however, rem ains th e same. ,* • . ^ ' . mi : : !’ ! ! .......................... 1 ....... I Figure 5.16: T he solution obtained by m inim izing th e to ta l resources 134 5.5 A Heuristic Approach for Multi-Process Synthesis In th e earlier sections, we dem onstrated our synthesis approach for designs w ith unbounded-delay operations and com m unicating processes using an ILP m ethod. However, for large designs, th e size of th e ILP m odels m ay be too large to ob tain a solution w ithin a reasonable run tim e. Fortunately, as we discussed earlier, our synthesis approach is com patible w ith m ost existing static scheduling tech niques. Hence, a good heuristic approach can be obtained easily by m odifying an existing heuristic-search procedure. In this section, we will discuss how to m od ify th e freedom -based scheduling technique [P PM 8 6 ] to handle unbounded-delay operations under tim ing constraints and com m unicating processes. T he basic idea behind freedom -based scheduling is to schedule th e operations on th e critical p a th first. For th e rem aining off-critical-path operations, th eir freedoms are calculated. T he freedom of an operation is determ ined from th e earliest tim e th e execution of th e operation can start to th e latest tim e at which th e operation has to be finished. T he operations are assigned to a control step in th e order of increasing freedom . Freedom -based scheduling can be m odified in th e following way to handle unbounded-delay operations under detailed tim ing constraints. 1. All th e operations w ith unbounded delays have to be scheduled to an exclu sive control step. Since th e critical p a th is th e longest p ath in a process’s constraint graph, th e unbounded-delay operations of a single-threaded process should, by def inition, be all on th e critical p ath. Hence, by scheduling th e critical p ath first, we can ensure th a t each unbounded-delay operation is assigned to an 135 em pty control step. In addition, w hen scheduling rem aining fixed-delay op erations, th e control steps occupied by th e unbounded-delay operations can be avoided. 2. T he calculation of freedom should take into account b o th m inim um and m ax im um tim ing constraints. W hen detailed tim ing constraints are present am ong th e operations, th e free dom of an operation not only depends on th e d a ta dependencies b u t also is affected by th e associated tim ing constraints. Basically, th e freedom of an operation, v is determ ined by an execution in terval [el(v),lt(v)], where el(v) is th e earliest tim e th a t v can be scheduled and lt(v) th e latest tim e. Since each m inim um tim ing constraint Uj can be represented by an edge (vj, vj) w eighted by Ijj in th e constraint graph, el(vj) of an operation vj can be recursively defined as: el(vi) = m ax {el(vj) + V i Gim-preds(vj) el(vj) = 0 if im jpreds{vj) = 0 el(vj) = cr(uj) if vj has been scheduled to cr(vj) w here im~preds(vj) is th e set of im m ediate predecessors of vj and Wij is either th e delay of V{ or a m inim um tim ing constraint lij. Sim ilarly, the latest execution tim e lt{x>i) of an operation can be defined as: lt(vi) = m in ~ wij} vj £im ~succs(vi) lt(v{) = N if im s u c c s(v i) = 0 = cr(vj) if Vj has been scheduled to cr(vj) 136 w here im su c c s(v i) are th e set of im m ediate successors of Vi, N is th e schedule length, and w,j is either th e delay of Vj or a m axim u m tim ing constraint tty betw een u; and Vj. For system s w ith m ultiple com m unication processes, if we schedule individual processes one at a tim e and th en propagate th e synchronization constraints from one process to another, feasible solutions m ay not be found. Hence, a b e tter approach would be th e one th a t can ensure the synchronization of all th e inter process com m unication events. T he basic principle of freedom -based scheduling can still be applied here; i.e., th e operations w ith higher scheduling difficulties should be scheduled first. As we have discussed in Section 5.4.3, th e scheduling of processes does not have a direct effect on th e synchronization of blocking events. In addition, the scheduling of a non-blocking event requires a synchronization point, e.g. a block ing event, betw een th e sending and receiving processes. Therefore, we can begin w ith scheduling th e critical p a th including th e unbounded-delay operations of each process. Once all th e unbounded-delay operations are scheduled, th e synchroniza tion points am ong th e processes can be identified. Therefore, each non-blocking com m unication event can be scheduled sim ultaneously at b o th th e sending and re ceiving processes according to th e available synchronization points betw een them . Finally, th e rem aining operations can be scheduled according to th eir freedom s as usual. This approach is outlined as follows: multi-process-fbs {PS, E V ) /* P S is a set of processes * / /* E V is a set of communication events among P S */ for (each process Gi € P S ) do schedule the critical path of < ? ,•; /* All the anchors in Gi should have been scheduled now */ 137 calculate the freedoms of all non-critical-path operations in Gy, endfor; while (3 non-blocking events in E V that have not been scheduled) do select the event e v 'f = (* 4 , r{) with the smallest freedom; schedule w \ and rk in Gi and Gj respectively using the same offset with respect to a synchronization point between Gi and Gy, update the freedoms of the affected operations; endwhile; for (each process G, E P S ) do while (3 operations in Gi that have not been scheduled) do schedule the operation with the smallest freedom; update the freedoms of the affected operations; endwhile; endfor; 5.6 Summary In this chapter, we presented an approach for synthesis of designs w ith unbounded- delay operations under tim ing constraints and w ith m ultiple com m unicating pro cesses. C om pared to relative scheduling, this approach allows us to tra d e off be tw een perform ance and resource requirem ents during scheduling as well as to re duce th e control overhead. O ur approach is based on the observation th a t each process generally corresponds to one th read of control and th ere exists a sequen tia l order am ong th e unbounded-delay operations in th e process description. By preserving this order, scheduling of single-threaded processes can still be done statically in term s of control steps despite th e presence of unbounded-delay op erations. Consequently, m any good synthesis techniques originally developed for designs w ith only fixed-delay operations can still be utilized in our approach w ith 138 some m odifications. In this chapter, we d em onstrated our scheduling m ethod us ing an ILP form ulation and also discussed how to m odify an existing heuristic technique, freedom-based scheduling, to m eet our scheduling requirem ents. 139 Chapter 6 Verification of Synthesized ItTL Designs D ue to th e cost of engineering and fabrication and the critical m arketing tim e, design errors of digital system s should be elim inated at all costs. A lthough one m ay argue th a t th e synthesized designs should be correct by construction, in reality there is no such guarantee unless th e whole synthesis process, including techniques and program s, can be form ally validated. However, to validate a large softw are system like a high-level synthesis system form ally is still im practical, if not im possible, for current form al verification techniques [McF93]. A m ore practical altern ativ e is to verify th e synthesized designs w ith respect to th eir specifications. A lthough designs can be verified a t various levels of abstraction, it is desirable to find any design problem as early as possible. In addition, a high-level synthesis system m ay produce m any designs for a given set of constraints; w ithout proper verification, there is a lack of sense of correctness while th e designs are being evaluated or com pared. O n th e other hand, pure functional validation is not enough because m any design errors are also related to th e control and tim ing of the design. Hence, we believe th a t th ere is a strong need for an au to m atic tool which can check b o th the functionality and tim ing of synthesized designs efficiently, to be integrated in a high-level synthesis system . In this chapter, we will present an efficient approach for checking th e RTL de signs produced by th e USC ADAM high-level synthesis system . This m ethodology 140 is also applicable to other synthesis system s incorporating a sim ilar design flow. O ur approach is m otivated by th e observation th a t th e stru ctu ral designs are de rived in a well-defined m anner from th e behavioral specifications [M PC8 8 ] in the ADAM system . These RTL designs possess several com m on properties so th a t sym bolic sim ulation can be effectively utilized to perform th e checking task. Using this approach, we are able to not only verify th e design functionality form ally b u t also take into account th e interaction betw een th e d a ta p a th and th e controller as well as th e tim ing issues, such as delays and th e clocking schem e. This chapter is organized as follows. F irst, we describe th e problem of verifying synthesized RTL designs in Section 6.1, and analyze a generic high-level synthesis m odel in order to identify properties of autom atically synthesized designs. In Section 6.2, we give an overview of our approach for checking synthesized RTL designs. T he hybrid sym bolic/num eric sim ulation m odel used in our approach will be described in Section 6.3. In Section 6.4 we will show th a t th ere is an isomorphic property betw een th e behavioral specifications and th e ex tracted behaviors of th e corresponding im plem entations. A behavior-com parison procedure based on this property will th en be given. Finally, we will present som e experim ent results and sum m arize this chapter. 6.1 Problem Statement In Figure 6.1, we show a generic m odel of high-level synthesis. T he problem which we are solving here can be briefly described as follows: Show whether or not the R T L implementation I will perform the re quired computation specified in the design behavioral specification S for every execution instance. 141 Design specification 5 (CDFG) i High-Level Synthesis T RTL design / (datapath + controller) Q Scheduling ^ (D a ta path Allocation/Binding) Controller Generation io n j Figure 6.1: A high-level synthesis m odel The design specification S here is an control d a ta flow graph C D F G s which defines th e required com putations to be done for every execution instance. T he im plem en tatio n I itself is a static stru ctu re w hich consists of a d ata p a th and a controller. We are asked in this problem to obtain th e dynam ic relationship betw een sequences of inputs and o u tp u ts while I is physically operated under th e specified clocking schem e and in p u t/o u tp u t protocol. However, even if we can faithfully o b tain this dynam ic behavior of / , it is still very difficult to prove th e correctness if I and S are regarded as two independent entities. This is because th e problem is sim i lar to showing w hether or not two arb itrary behaviors are functionally equivalent, which is believed to be undecidable. In Section 6 .6 , we will apply th e com putation theory to dem onstrate this undecidability of solving th e general RTL verification problem , w here S and I are considered independent. Even a sim pler problem like determ ining th e equivalence of two finite-precision algebraic expressions cannot be done in eith er P or N P tim e if N P ^ co-NP1. inform ally, co-NP (complement of NP) defines a set of problems whose complement problems are in NP. Currently, it is not known yet whether NP is closed under complementation, although it is generally doubted. 142 L e m m a 6 . 1 Showing the equivalence of two finite-precision algebraic expressions is a co-NP problem which cannot be done in either P or N P time if N P ^ co-NP. P ro o f: T he proof is straightforw ard. F irst, we know th a t th e not tautology problem 2 is an N P-com plete problem and its com plem ent, th e tautology prob lem , is a co-NP problem which cannot be solved in P nor N P tim e if N P ^ co-NP [GJ79]. T he tautology problem can be easily reduced in polynom ial tim e to th e problem of showing th e equivalence of two Boolean expressions since each instance of this problem is the sam e as showing w hether or not th e B oolean expression in question is equivalent to constant 1. T he later problem in tu rn can be easily reduced to one which determ ines th e equivalence of two finite-precision algebraic expressions since Boolean expressions are a subset of finite-precision algebraic expressions. Therefore, the la tte r problem s are b o th co-NP problem s w hich cannot be done in P nor N P tim e if N P 7^ co-NP. □ In high-level synthesis, th e synthesis system actually derives th e stru ctu ral design from th e behavioral specification in a well-defined m anner [M PC8 8 ]. Con sequently, th e specification S and th e im plem entation I are not independent. In fact, there are links betw een S and I. These links can be utilized in verifying their correspondence. In the following section, we will describe th e links betw een S and I in term s of a num ber of com m on properties of th e synthesized RTL im plem en tations. 6 .1 .1 P r o p e r tie s o f th e S y n th e siz e d R T L D e sig n s In high-level synthesis, th e RTL im plem entation I is th e result of a m apping from C D F G s • T he m ajo r tasks of this m apping involve assigning th e operations to 2 A tautology is a Boolean expression that has the value 1 for all assignments of values to its variables. The not tautology problem is to return 1 if a given Boolean expression is not a tautology and return 0 otherwise. 143 control steps (scheduling), assigning th e operations and values to hardw are (d ata p a th allocation/binding), and generating a controller to deliver th e required con trol signals (control synthesis) [M PC8 8 ], Consequently, th e im plem entation / , if m apped correctly, will have th e following properties: P r o p e r t y 6 . 1 For each operation op in C D F G s , th ere exists a functional u n it u in I such th a t u is designated perform op and op is achieved by directing all the input values of op to th e corresponding in p u t p o rts of u. For exam ple, Figure 6.2 shows a CD FG and a possible RTL im plem entation. T he A dder in / is designated to perform b o th additions +i and + 2 in C D F G s . + 1 is achieved by directing a and b to th e in p u t ports of A dder through M u x l and M ux2 . Sim ilarly, th e m ultiplication * in C D F G s is bound to th e M ultiplier in I, P r o p e r t y 6 . 2 For each d a ta dependence (opi,opj) in C D F G s, th ere exists an interconnect p a th in I betw een th e functional units tq and uj which are designated to perform opi and opj. T he interconnect p a th is set up by using buses a n d /o r sw itching devices. If th e outp u t of iq is not consum ed by Uj w ithin a cycle, a storage elem ent is needed in th e interconnect p a th to store th e value before sending it to th e in p u t p o rt of uj. In Figure 6.2, th e d a ta dependency betw een + 1 and * is done through th e inter connect p ath , A dder — ¥ R egl — > M ultiplier. Sim ilarly, the dependency betw een + 2 and * is achieved by th e p ath , A dder — > Mux3 — > ■ Reg2 — » M ultiplier. P r o p e r t y 6 .3 C D F G s defines th e required com putations which have to be done in I for every execution instance. For exam ple, after applying four in p u t values a, 6 , c and d to th e RTL design I in Figure 6.2, I is expected to perform two additions and one m ultiplication and produce an o u tp u t which is equal to (a + b) * (c + d) after three cycles. 144 a c b d I L X k m uxk/ n mu^ Adder Multiplier ^M iixy Output y ,v A V / M W ? ^ » V .V .W CDFGS RTL design / Figure 6 .2 : An exam ple of th e links betw een C D F G s and its synthesized RTL design I 145 W e will show in Sections 6.3 and 6.4 how these properties can be effectively utilized to verify synthesized designs. In fact, th e inform ation regarding these properties is generally available after th e synthesis process. For exam ple, this inform ation is represented explicitly in th e Design D ata S tru ctu re (DDS) of the ADAM high-level synthesis system by m eans of bindings [KP85]. This inform ation is p articularly useful in diagnosing design errors because these bindings represent th e design decisions m ade during th e synthesis process and th e events th a t should occur while th e design is operating. If errors are found, th e im plem entation bind ings which are generated during th e sim ulation can be traced to determ ine th e cause. 6.2 Approach Overview From th e previous section, we find th a t th e correctness of a synthesized RTL design is really determ ined by w hether or not it will perform th e required d ata operations and d a ta transfers as specified in th e given CD FG . Hence, we developed an approach which combines sym bolic sim ulation at th e RT level w ith a behavior- com parison procedure based on th e properties described in Section 6.1.1. T he m otivations to apply sym bolic sim ulation in our approach are twofold. F irst, it provides form al results because th e sim ulator operates over a symbolic dom ain, and at th e sam e tim e we are able to take into account design tim ing in term s of th e clocking scheme, delays, and in p u t/o u tp u t protocols. Second, th e sym bolic sim ulation results are ready for com parison w ith th e design specification for high-level synthesis since they b o th can be represented in a sim ilar form such as a d a ta flow graph. Figure 6.3 shows a flow chart which briefly illustrates our approach. F irst, 146 Synthesized RTL data path Controller’s STG Input/Output Protocol DFGq + path condition Different No Yes Yes More Paths No Diagnosis DFG-based behavior comparison Hybrid symbolic/numeric RTL simulation CORRECT) rERRORS Figure 6.3: An overview of our approach for checking synthesized RTL designs. 147 th e inputs to our approach includes th e behavioral specification C D F G s, th e syn thesized RTL d a ta p ath , th e state-transition graph of th e controller, and th e in p u t/o u tp u t protocol. T he control flow of the design is analyzed to produce a list of all possible execution paths. For each execution p ath , th e associated path condition will be used to drive th e subsequent sim ulation. T he hybrid sym bolic/num eric sim ulation perform ed next proceeds from one execution p a th to another. It is a hybrid m odel because th e d a ta p a th is evaluated sym bolically b u t th e controller is sim ulated num erically so th a t all th e control signals will be either 1, 0 or unknown throughout th e sim ulation. T he result of th e sim ulation is represented by a d a ta flow graph which describes th e actual d ata operations and d ata transfers occurring in th e synthesized d a ta p ath . Finally, th e sim ulation result is com pared graphically w ith th e given C D F G s under th e sam e p a th condition. If the com parison procedure finds any difference betw een these two graphs, th ere exist design errors and th e current sim ulation result is diagnosed to find th e possible causes. T he whole process is repeated u n til no m ore execution paths are left. If th e d a ta flow graph obtained from th e sim ulation m atches w ith th e given C D F G s for every execution p ath , th e given RTL design is considered to be synthesized correctly. 6.3 Hybrid Symbolic/Numeric Simulation As we have discussed earlier, w hat is m ost im p o rtan t for a synthesized RTL de sign is w hether or not it perform s th e required d a ta operations and th e correct sequencing of d a ta transfers for each execution instance. Also, m any design errors are related to the control and tim ing of th e design. Hence, we need to be able to ex tract th e circuit behavior in term s of th e symbolic d a ta operations and d ata transfers th a t occur in th e d a ta p ath and at th e sam e tim e to em phasize exercising 148 mux en R4 R1 R2 R3 Initial State: R1 = a R2 = b R3 = c Control Signals: sel = 1 en = 1 Result: R4 = d, where d : a + c Figure 6.4: A n exam ple of sym bolic sim ulation th e control and m odeling th e tim ing. T he hybrid sym bolic/num eric sim ulation to be described here perform s exactly this task. T he idea behind sym bolic sim ulation is sim ilar to extending arith m etic over num bers to sym bolic algebraic operations over symbols and num bers. For exam ple, Figure 6.4 shows an RTL circuit w ith an adder, a two-to-one m ultiplexer and four registers. Let th e sym bolic values a, b and c represent the initial register values stored in R l, R2 and R3 respectively. Suppose b o th th e control signals sel and en are 1. A fter sim ulating the circuit symbolically, a new sym bolic value d is produced by th e adder and stored in R4, and d is equivalent to a + c. In this way, we have th e response to all possible initial conditions of R l, R2 and R3 in one sim ulation run. T he sim ulation m odel we use is event-driven. Figure 6,5 shows a typical flow of event-driven sim ulation [ABF90]. T he m ain difference betw een our m odel and th e trad itio n al ones is th a t th e evaluation of activated elem ents and th e representation of signal values are sym bolic in our m odel. In addition, a tra n sp o rt delay m odel can be used to reflect the circuit operation m ore accurately. 149 no event left? yes STOP Initialization determine activated elements process the current events evaluate activated elements schedule new events & update sim. time Figure 6.5: Typical flow of event-driven sim ulation 150 Executaion Paths Path Condition 0-1-6-7-8 o —t i i o 0-2-3-5-8 0 - 1=1 , c2=0 0-2-4-5-8 0-1=1, C 2=1 Figure 6 .6 : E xecution paths of a state-tran sitio n graph For designs w ith conditional branches, th e state-tran sitio n graphs of th eir con trollers will consists of several possible execution paths. D e fin itio n 6 . 1 A n execution path is a direct path from an in itial state to an end state in the state-transition graph o f a finite-state machine (FSM ) controller. D e fin itio n 6 . 2 A path condition is an assignment to a set o f Boolean variables which together determine alternative paths through the state-transition graph. For exam ple, th e state-tran sitio n graph shown in Figure 6 . 6 contains three possible execution paths, each of which is associated w ith a path condition th a t represents th e assum ptions m ade along th e p a th during th e control flow analysis. 151 so. S1 c B y SO S1 c 0 0 A + B 0 1 A -B 1 0 A & B 1 1 A I B propagation delay = 10 ns Figure 6.7: A four-function ALU behavioral m odel 6 .3 .1 E le m e n t E v a lu a tio n T he evaluation of an elem ent is done to com pute its o u tp u t values according to its current in p u t values. 6 .3 .1 . 1 D a t a P a t h T he evaluation of a d a tap a th m odule depends on its behavior m odel. For exam ple, in A D A M ’s D BS, this inform ation is available from th e behavioral m odel of th e com ponent used to im plem ent the m odule. C urrently, we represent this inform a tion in th e form of function tables. T he function tab le of a d a ta p a th m odule defines th e m anipulation of sym bolic d a ta for each possible condition on th e con trol lines. For exam ple, th e behavioral m odel for a sim ple four-function ALU is shown in Figure 6.7. There are three kinds of d a ta p a th m odules: • F u n c tio n a l U n its . These m odules are used to perform th e operations given in a specification (CD FG ). W hen a functional u n it is evaluated, a new sym bolic operation is perform ed. In general, its control inputs, if any, are used for function selection. T hen, one or m ore new sym bolic values are produced 152 at its o u tp u t ports after a specified delay tim e. T h e in p u t/o u tp u t sym bolic values are related by th e sym bolic operation being perform ed. • S w itc h in g D e v ic e s . T he result of evaluating a sw itching device is th a t sym bolic values are transferred from its in p u t ports to its o u tp u t ports. The current values of its control lines determ ine th e p ath s on which th e transfers take place. Sim ilarly, th e o u tp u t change is separated from th e input change by a propagation delay. No new sym bolic value is produced in th e evaluation of a sw itching device. • S to ra g e E le m e n ts . Symbolic values can be w ritten into or read from stor age elem ents via th eir d ata in p u t/o u tp u t ports. B oth registers and on-chip m em ory are allowed in our sim ulation m odel. A storage elem ent is evalu ated whenever its clock or enable signals change. T he m em ory addresses are regarded as control signals; therefore, they are num eric. If th e in p u t condition of a m odule being evaluated is invalid, its o u tp u ts and d ata storage, if any, are set to unknown. The d a tap ath carriers (nets) are used for propagating the sym bolic values. A carrier connecting m ore th a n one o u tp u t port requires those o u tp u ts to be tristate. Norm ally, at m ost one trista te o u tp u t is enabled at any tim e. A value collision occurs if two or m ore o u tp u t ports drive a carrier a t th e sam e tim e [KW89]. This type of design error can be easily detected by th e sim ulator. 6 .3 .1 .2 C o n tr o lle r T he controller is evaluated when there is a change on its clock signal. If th e clock change results in a state transition, th e outputs are com puted and the controller moves to th e n ext sta te in th e current execution p ath . T he p ath condition of th e current execution p a th is u pdated, if necessary, at each sta te transition. For exam ple, if th e state transition requires th e inputs ii and to be 1 and 0 and 153 if th e sym bolic values currently appearing at i\ and are a and 6 respectively, th en a = 1 and 6 = 0 will be added to th e p a th condition. If th ere is a conflict betw een th e assum ptions m ade in the state transition and th e p a th condition to be u pdated, th e current execution p a th is a false path w hich will never occur. T he sim ulation will proceed to th e next execution p a th im m ediately if a false p a th is found. On the other hand, if a required in p u t for th e sta te tran sitio n contains an unknown value, the sim ulator aborts th e current execution p a th and reports a data-dependency violation. 6 .3 .2 R e p r e se n ta tio n o f S y m b o lic D a ta T he sym bolic values and operations which occur during th e sim ulation are th e actu al events exhibited by th e RTL design. These values and operations constitute th e actual d atap a th behavior of th e stru ctu ral im plem entation to be com pared w ith th e design specification. Hence, it is very im p o rtan t to represent these symbolic d a ta in a way th a t is suitable for com parison w ith th e specification. In our sim ulation m odel, th e sym bolic values and operations produced during th e sim ulation are used to build a b ip artite d a ta flow graph, which is sim ilar to th e representation used for th e design specification. A vertex is created in the d a ta flow graph whenever a new sym bolic value or operation is produced during th e sim ulation. If a symbolic value is th e result of an operation perform ed by a functional u nit, th e operation becom es its direct predecessor. Sim ilarly, th e sym bolic values which appear a t th e in p u t ports of a functional u n it becom e th e direct predecessors of th e operation being perform ed. O ur m odel is different from th e early works [Dar79, Cor81] on sym bolic simu lation at th e RT level in the following ways: 154 1. T he overhead to propagate th e algebraic expressions is elim inated since we focus on m erely collecting th e actual d a ta operations and d a ta transfers th a t occur in th e d a ta path. 2. A powerful algebraic m anipulator is not required since we do not try to sim plify the expressions during th e sim ulation. Instead, th e d a ta flow graph representing th e sim ulation result is com pared w ith th e specification using th e graph-isom orphism property. These differences are th e reason th a t sym bolic sim ulation can be effectively applied to solve our problem . Besides th e d a ta flow graph which represents the com putations perform ed by th e design, th e sim ulator can also check the bindings betw een the specification and th e im plem entation. This is because th e occurrence of a sym bolic value or opera tion, th e stru ctu ral com ponent in which it take place, and th e current sim ulation tim e co n stitu te a binding which represents w hat is actually happening during th e sim ulation. By collecting th e actual operation and value bindings from th e sim ula tion and com paring th em w ith those specified in th e DDS, we are able to determ ine which design decisions are causing th e problem if the design fails. 6 .3 .3 E x a m p le s In this section, we will use a single-path exam ple and a m ultiple-path exam ple to illu strate our hybrid sym bolic/num eric sim ulation. A sim ple RTL d ata p ath and its controller’s state-transition graph w ithout conditional branches are shown in Figure 6 .8 . Initially, th e contents of R egl and Reg2 are unknown. Four symbolic values a, b, c and d are applied to th e input ports of th e d a ta p a th before sim ulation. At th e first cycle, since th e enable signals of M uxl and M ux2 are 0, the left input of M uxl and M ux2 are selected and a and b propagate to th e inputs of Adder. Hence, an addition operation on a and b is 155 a c b d Adder Multiplier Output State3: Mux3_ENABLE = 1 Reg2_W RITE = TRUE State = S tate 1 Statel : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Reg1_W RITE = TRUE State = S tate2 State2: Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_W RITE = TRUE S tate = S tate3 RTL data path Controller state-transition graph Figure 6 .8 : A n exam ple of a synthesized RTL design w ithout conditional branches 156 perform ed by the d ata p ath , and a new sym bolic value <1 w hich is th e result of th e addition operation is created a t th e o u tp u t of A dder. Since th e w rite signal of R egl is 1, t l is stored in R egl at the end of th e cycle. Figure 6.9 sum m aries the sim ulation result of this single-path exam ple. As we described in Section 6.3.2, the Initial: R eal = u Reg2 = u Cycle 1: R eal = t1, where t1 = a + b Reg2 = u Cycle 2: Regl =t1 Reg2 = t2, where t2 = c + d Cycle 3: Regl = t1 Reg2 = t3, where t3 = t1 * t2 Final: Output = t3 a b c d ¥ ¥ < 3 output Figure 6.9: T he sim ulation result of th e single-path exam ple sym bolic values and operations produced during th e sim ulation are represented by a d a ta flow graph as shown in th e b o tto m half of Figure 6.9. Figures 6.10 shows a synthesized RTL design w ith two possible execution paths, 1 — 2 — 4 — 5 and 1 — 2 — 3 — 5. For each execution p ath , a sym bolic value t l which is th e result of a com parison (> ) operation on p and q is produced and stored in R egl at th e end of th e first cycle. At th e second cycle, since R egl now contains t l , th e p a th condition is set to t l = 1 for th e execution p a th 1 — 2 —4 —5 and <1 = 0 for 157 To Controller Output RTL data path Adder Multiplier Comparator > Stated : State = State5 S ta te 5 : Mux4_ENABLE = 1 Reg2_WRITE = TRUE State = Statel * Mux3_ENABLE = 1 Mux4_ENABLE = 0 Reg2_WRITE = TRUE State = State5 S tatel : Mux1_ENABLE = 0 Mux2_ENABLE = 0 Mux3_ENABLE = 0 Reg1_WRITE = TRUE Reg2_WRI+E = TRUE State = State2 S ta te 2 : If (Regl) { Mux1_ENABLE = 1 Mux2_ENABLE = 1 Mux3_ENABLE = 0 Reg2_WRITE = TRUE State = State4 } else { , State = StateS Controller with 2 execution paths Figure 6.10: An exam ple of a synthesized RTL design w ith conditional branches 158 1 — 2 — 3 — 5. T he sim ulation result of this m ultiple-path exam ple is sum m arized in Figure 6.11. T he d ata flow graph obtained from the sim ulation for each execution Execution path (1 - 2 - 4 - 5), Path condition (Regl = t1 = 1 ) % b- P- © - ► t l t4-V ^iro u tP u t Execution path (1 - 2 -3 - 5), Path condition (Regl = t1 = 0 ) b P output Figure 6.11: T he sim ulation result of th e m ultiple-path exam ple p a th will be com pared separately w ith th e given design specification C D F G s under th e sam e p a th condition. This com parison procedure will be discussed in th e following section. 6.4 Graph-Based Behavior Comparison In our approach, the behavior com parison is based on th e d a ta flow graph m odel. Since th e design specification is already represented in a CD FG which is a super set of this m odel, only th e behavior of th e stru ctu ral im plem entation has to be tran slated into this m odel. T he hybrid sym bolic/num eric sim ulation described in 159 Section 6.3 perform s th e translation task for us. Therefore, th e verification of th e RTL im plem entation I becom es th e problem of com paring two CD FG s, C D F G s derived from th e in p u t specification and C D F G i derived during th e sim ulation. C D F G i , however, is actually a set of d a ta flow graphs (D FG ), each of which corresponds to th e result of sim ulating I for one of its execution paths. Because th e design I is th e result of a m apping from C D F G s , there exists a strong relationship betw een C D F G s and C D F G i (see Section 6 .1 .1 ). In fact, we will show th a t there is an isom orphic property betw een th e two. Consequently, a graph-m atching procedure based on this property has been developed to com pare th em efficiently. 6.4.1 T h e Iso m o rp h ic P r o p e r ty From Section 6.1.1, we know th a t th e RTL im plem entation / , if m apped correctly, will have several properties. In sum m ary, / will perform th e required com putations specified by C D F G s for every execution instance. T he com putations are done by m aking sure th e input values of each required operation are available at the corresponding input ports of the designated functional unit which is configured properly. Let D F G j be th e result of sim ulating / for one execution instance under the p a th condition pc. If th e specification C D F G s is in terp reted sym bolically under th e sam e p ath condition pc, th e result is a d a ta flow graph D F G s such th a t the predicate of each operation in D F G s is evaluated to true under pc. For exam ple, Figure 6 . 1 2 shows a C D F G s w ith conditional branches. Two D F G s ex tracted from C D F G s under t l = 1 and <1 = 0 are shown in th e right h an d side of this figure. Before we show th e isom orphic property betw een D F G s and D F G i , we will first establish th e correspondence for all th eir prim ary in p u t/o u tp u t values. This 160 Distribute Join p q a b p q a b (T \ DFGS under t1 = 0 DFGS under t1 = 1 Figure 6 .1 2 : An exam ple of extracting D F G ss from a C D F G s 161 correspondence is im portant because it provides th e startin g point to com pare these two graphs. L e m m a 6 . 2 There exists an one-to-one correspondence between D F G s and D F G i fo r the prim ary input/output values. P ro o f: T he prim ary input values are applied to I according to th e in p u t protocol. For each prim ary input value in s of D F G s , th ere exists an in p u t p o rt iport of I and some period of tim e [£s, t e] such th a t a sym bolic value m j, which corresponds to ins, is created and applied to iport externally from ts to t e during th e sim ulation. Therefore, in i is in D F G i as a prim ary in p u t value and it is assum ed to correspond to in s by th e in p u t protocol. Sim ilarly, for each prim ary o u tp u t value outs of D F G s, there exists an o u tp u t p o rt oport of I and some tim e t such th a t a sym bolic value outi is read from oport at tim e t. Hence, th e sym bolic value outi is in D F G i and is assum ed by th e o u tp u t protocol to correspond to o u ts . □ Intuitively, th e isom orphic property betw een D F G s and D F G i exists because I is synthesized in such a way th a t each required operation in D F G s will be per form ed by a designated functional unit at some tim e and every d a ta dependency will be preserved by establishing a proper interconnection. Hence, if I is synthe sized correctly, each corresponding prim ary o u tp u t of D F G s and D F G i should have sim ilar geom etric properties, which is explained by th e following theorem . T h e o r e m 6.3 For each pair o f the corresponding prim ary output values (outs, o u ti) o f D F G s and D F G i, the cones3 o f outs and outi are isom orphic. 3A cone of a vertex v in a graph G — (V, E) is a subgraph C = (V*, E1 ) such that • V' = { v } U predecessors(v) • for all vi, 1/2 in V', if edge in E then (t/1 , 1 /2) is also in E'. 162 P ro o f: From P roperties 6.1 and 6.2, we know th a t for each operation oprs in DFGs-, there exists a functional unit u of th e design I and som e tim e t such th a t u is configured to perform th e operation type of oprs; otherw ise, a synthesis error occurs. Hence, if oprs is a required com putation, a list of sym bolic values in D F G i which corresponds to th e input values of oprs in D F G s> m ust appear at the respective input ports of u at tim e t so th a t an operation opri which is equivalent to oprs is perform ed. Consequently, a list of new sym bolic values which corresponds to th e o u tp u t values of oprs in D F G s will be produced in D F G i. In o ther words, th ere exists a one-to-one m apping4 from D F G s to D F G i for all the operations and values in D FG s- Let Cs and (7/ be th e cones of outs and outi respectively. We claim th a t there exists a one-to-one and onto relation betw een Cs and (7/. This relation is one-to- one as we have discussed earlier. If this relation were not onto, th ere would exist either a vertex or a edge in Ci which did not have a counterpart in Cs- Case I. Let vi,i be th e vertex th a t has no correspondence. Since vi,i is a predecessor of o u t/, there exists a p a th in C / from v iti to out/. In this p ath , th ere m ust be an edge (vi,2, u/,3 ) such th a t u/ )3 corresponds to vs,3 in Cs bu t i>/)2 does not have a co unterpart in Cs as shown in Figure 6.13. Hence, (u/> 2,u /)3) is an incident edge of u/ ]3 which does not correspond to any of vs,3 . However, th e correspondence betw een o/ >3 and vs ,3 im plies th a t there is a one-to-one correspondence for all th eir incident edges. Therefore, we have a contradiction. Case II. Let (u /,i,v /i2) be th e edge in (7/ th a t does not have a counterpart in Cs - From Case 1 , we know th a t every vertex in (7/ m ust have a counterpart 4It is not necessary an onto m apping. 163 Cone C. C one Ci No match ed ge outs outj Figure 6.13: A vertex in C i has no correspondence in Cs in Cs- Let vs ,2 be th e vertex in D F G s th a t corresponds to w/j2. T hen, Vi,2 has an incident edge (u j,i,u /l2) which does not correspond to any ° f vs,2 - Therefore, this edge do not exist. Thus, there exists a one-to-one and onto relation betw een Cs and Ci for all th e vertices and edges; i.e., Cs and Cj are isom orphic. □ T he isom orphic property betw een Cs and C i not only im plies th a t th ere is a one-to-one correspondence betw een th eir vertices and edges such th a t th e incidence relationship is preserved, b u t also requires th a t each pair of corresponding vertices are com patible. In other words, if th e corresponding vertices are operations, they m ust be of th e sam e type. If they are values, th ey have sam e bitw idths. Figure 6.14 shows th e D F G s and D F G i obtained (under th e p a th condition t l = 1 ) respectively from th e specification and sim ulation of th e m ultiple-path exam ple given earlier. T he isom orphic property betw een th e cones of th e o utputs of D F G s and D F G i can be seen easily. 164 output DFGS under t1 = 1 © — DFG| from simulation Execution path (1 - 2 - 4 - 5) Path condition (Reg1 = t1 =1) Figure 6.14: A n exam ple to show th e isom orphic property betw een the cones of two corresponding o u tp u ts of D F G s and D F G i 6 .4 .2 A G rap h M a tch in g P r o c e d u r e Knowing th a t there is an isom orphic property betw een C D F G s and C D F G i, it becom es straightforw ard to develop a m ethod for behavior com parison. In fact, all we need to do is to check whether or not the cones o f their corresponding output values are isomorphic fo r all the execution paths. Unlike the general isom orphism problem in graph theory, which is still an im p o rtan t unsolved problem , it is m uch easier to check th e isom orphic property be tw een th e cones of th e corresponding o u tp u t values because th e following reasons: 1. T he correspondences of th e prim ary input and o u tp u t values of th e cones are known in advance. 2. T he correspondences of two operations can be established as soon as they are determ ined to be of th e sam e type and all th eir in p u t values are equivalent. 165 T h e first reason gives Us the startin g point to com pare th e cones. T he second- one enables us to perform th e com parison in an iterative-im provem ent way which establishes th e correspondences betw een th e cones increm entally from th e prim ary inputs to th e outputs. For exam ple, Figure 6.15 shows two sm all cones to be com pared. Once we know th a t a and b are equivalent to x and y respectively, th e correspondences of two addition operations and th eir o utputs can be established as shown by arrows 3 and 4 in th e figure. m e o Figure 6.15: A n exam ple of m atching operations and values betw een two cones In w hat follows, we will present a polynom ial-tim e procedure for checking th e isom orphic property between th e cones of two corresponding o u tp u t values. Let outs and outi of D F G s and D F G i be a pair of corresponding o u tp u t values and let Cs and C i be th eir respective cones to be checked. T he following procedure will retu rn tru e if C s and Ci are equivalent; otherw ise, false is returned. cones_equiv_check(C,5, Ci) 1. Create an attribute for each vertex in Cs and Ci and initialize it to nil. 2. For each pair of corresponding primary input values of Cs and Ci, give their attributes a unique identifier. 3. Find all operations in Cs or Ci such that 166 • their attribute is still n il; and • the attributes of all their input values are not n il. Insert these operations into a ready list L ready 4. Find all the operations in L ready such that • they are of the same type; and • all of their corresponding input values have the same attributes. If found then (a) Remove these operations from Lready (b) Give the attributes of these operations a unique identifier. (c) For each set of the corresponding output values of these operations, give their attributes a unique identifier as well. 5. Repeat step 3 until no change can be made. 6. If there exists a vertex in Cs or Cj whose attribute is still n il, return fa ls e . Otherwise, re tu rn true. T he worse-case tim e com plexity of this procedure is <3(n2), w here n is th e num ber of operations in a cone. T he average com plexity, however, is m uch lower because th e num ber of operations in th e ready list is usually only a sm all fraction of th e to tal operations in th e cones and a separate ready list can be created for each function type to further reduce th e num ber of operations to be m atched. T he num ber of possible execution paths of a design to be verified, however, is 0 ( 2 C ), w here c is th e num ber of conditional branching constructs in th e design behavior. A lthough th e num ber of execution paths is exponential w ith respect to c, c is relative sm all and m anageable as com pared to th e to tal operations in th e design behavior. 167 6.5 Experiments In order to show th e effectiveness of our approach, we perform ed a num ber of experim ents w ith th e designs synthesized from the USC ADAM system . In fact, our prelim inary experim ents im m ediately identified th a t th e controllers generated for these designs by th e early version of the ADAM control signal generator (CSG) [WP92] were incorrect. CSG was th en revised using th e FIN ESSE system from Cascade Design A utom ation as th e back-end FSM synthesis tool. Shortly after CSG was revised, we experim ented w ith a non-pipelined A R fil ter. M AH A [PPM86] was used for scheduling and M ABAL [KP89] for d a ta p ath allocation and binding. This design is characterized as follows: • It has 4 tim e steps. • T here are 26 input ports and 2 o u tp u t ports and b o th th e in p u t and o u tp u t values are not latched. • T he d a ta p ath consists of 6 m ultipliers, 4 adders, 6 registers, 14 3-to-l and 5 2-to-l m ultiplexors. • A tw o-phase non-overlapping clocking scheme is used. T he analysis of th e controller generated by CSG resulted in only one execution p a th w ith four states. T he experim ent was carried out by holding th e in p u t values (sym bolic) at th e input ports during th e execution and obtaining two o u tp u t values from th e o u tp u t ports at the end of th e 4th clock cycle. T he cones of these two o u tp u t values, as shown in Figure 6.16, were extracted from th e d a ta flow graph b u ilt during th e sim ulation. These two cones were com pared correctly w ith the original d ata flow graph shown in Figure 6.17. We also experim ented w ith a robot arm controller whose control flow is m uch m ore com plex th an the previous one. T he control d ata flow graph of th e robot-arm 168 * outi out2 Figure 6.16: T he cones of two o u tp u t values obtained from th e hybrid sim ulation of a non-pipelined A R filter 169 Figure 6.17: T he d a ta flow graph of th e A R filter exam ple controller can be found in Figure 3.18. T he design was synthesized in a sim ilar way except we required th e inputs values to be latched. T h e RTL im plem entation has 12 tim e steps, 16 possible execution paths, 25 input and 14 o u tp u t ports. T he controller was generated by CSG using 2 statu s registers. T he verification of this RTL im plem entation was done by com paring 14 pairs of cones for each execution p ath . From this experim ent, some design inefficiencies were found: • All th e constant values were required to be supplied externally, which results in inefficient use of input ports. • C onditional values were unnecessarily routed to th e o u tp u t ports. • Some of th e input values were not latched as specified. 6.6 Analysis of the General RT-Level Design Verification Problem As we have seen from previous sections, the design specification and its synthesized RTL im plem entation are not independent. U tilizing th e links betw een th e two, the problem of verifying synthesized RTL designs becom e tractable. In this section, we will study th e difficulty of th e general RTL design verification problem , where th e specification and its im plem entation are considered to be two independent entities. T he RTL verification problem , in th e m ost general form , can be described as follows: G iven: • a behavioral design specification S. • an RTL im plem entation I. 171 Goal: show th e behaviors B s and B i of S and I respectively are functionally equivalent. T he specification S specifies w hat th e system should do, and it is usually described functionally in some HDL (H ardw are D escription Language). O n th e other hand, th e im plem entation I , typically described by a netlist, m odels th e system struc turally as an interconnection of RTL com ponents. T he verification task is to show th a t, for all feasible inputs, th e o utputs produced by I (B j ) are equivalent to the o u tp u ts com puted by S (B s) [McF93]. T he behaviors B s and B i are regarded as th e way the system or its com ponents interact w ith th eir environm ent, i.e., the m apping from inputs to o u tp u ts [MPC88]. However, if th e system possesses se quential behavior5, B s and B j describe th e m apping from “sequences” of inputs to “sequences” of outputs. Therefore, th e definition of equivalence of two behaviors has to be m odified accordingly. Since S and I are given in different languages, it is necessary to tran slate both of th em into a m odel in which th eir behaviors can be com pared. For exam ple, we can sim ulate b o th S and / using th eir respective sim ulators for all feasible inputs, in theory, to get B s and B j and verify th eir ou tp u t correspondence. A lternatively, we can describe both S and I in a form alism in which a theorem -prover can be used, hopefully, to prove th a t B s and B j are equivalent. W hichever way we use, a “reasonable” verification m odel representing th e behaviors m ust be capable of expressing every design in th e functional m odel of S and in th e RTL m odel of I. Figure 6.18 shows th e relationships betw een th e verification m odel and the associated functional and RTL models. 5The output at any point depends on the current state which in turns relies on the past history of inputs [McF], 172 Functional M odel RTL M odel Translation Verification M odel Figure 6.18: T he verification m odel and its relationships w ith th e functional and RTL m odels. D e fin itio n 6 .3 A verification model is called com plete if fo r each elem ent < f > in the functional and R T L models there exists a counterpart element xp in the verification model such that < p and x p are functionally equivalent. D e fin itio n 6 .4 A complete verification model is feasible if we can fin d a transla tion fun ctio n to map each < p in the functional and R T L models to its counterpart x p in the verification model. In other word, a complete verification m odel is at least as powerful as th e functional and RTL m odels in term s of th e expressive capability, and the m odel is feasible (useful) only if we can effectively tran slate all the designs into it. If we do not consider the fact th a t I is derived from S , B s and B i sim ply correspond to two independent points in a com plete verification m odel. Let Civ be 173 th e com plete verification m odel. Solving th e general RTL verification problem is then equivalent to finding a decision function / such th a t for all xpi,xftj € fly , 1 if ibi = x/jj = { (6-1) 0 otherwise In w hat follows, we will show th a t it is not possible to find such a decision function no m a tte r which com plete verification m odel we use. However, some background com putation theory needs to be introduced first. Informally, a function is a partial recursive fun ctio n (prf) if it is effectively “com putable” [MY78]. In other words, given a definition of a p artial recursive function we can produce an algorithm , e.g. w rite a C program , to com pute it. A programming system is a list of program s fo , <^i,. . . which includes all of th e p r f s. A program m ing system P S is an acceptable programming system (A P S) if and only if the following conditions are m et. • For all p r f / , there exists an index i such th a t < fi — f . T h a t is, there is at least a program in PS* for each prf. • For all index i, 4 > i is a prf. In other words, every program in P S is a prf. • There exists a universal program < j > u such th a t for all i and x f> u(i, x) = 4>i(x): where x is an function argum ent over N (n atu ral num bers). • There exists a to tal recursive function c such th a t = 4 > i 0 4 > j f°r all * and j i , w here o denotes function com position. A PSs are used in com putation theory to m odel th e com m on characteristics of “reasonable” program m ing system s m athem atically. Hence, any results applied to APSs also hold for all “reasonable” program m ing system s, and certainly for all existing general purpose program m ing languages like C, PASCAL, and VHDL [MY 78]. 174 Now, we can introduce th e notion of undecidable (algorithm ically unsolvable) problem s concerning APSs. Let N be th e set of n atu ral num bers. D e fin itio n 6.5 For all T C N , the fun ctio n Cr ■ N — > {0,1} is called the charac teristic function o f T if and only if C r(0 = < 1 if i < E r 0 otherwise D e fin itio n 6 .6 For all Y Q N , Y is decidable if and only if Cr is a prf. T he following lem m a describes an im p o rtan t undecidable set which will be used to show th e undecidability of the general RTL verification problem . L e m m a 6 .4 For all A P S {4>}, the set V — {< i , j >\ < f > i = < /> j}6 is undecidable. T he proof of this lem m a can be found in [MY78]. Basically, Lem m a 6.4 says th a t th ere is no algorithm for deciding w hether or not two arb itrary program s in any APS are equivalent. Finally, we can establish th e m ain result of this section w ith th e following theorem . T h e o r e m 6.5 The general R T L verification problem is algorithm ically unsolvable fo r all complete verification models. O u tlin e o f p ro o f: Recalling earlier discussion, we know th a t solving the general RTL verification problem is equivalent to finding a decision function / to determ ine w hether or not two behaviors in a verification m odel are equivalent as defined in E quation 6.1. Let fly be a com plete verification m odel. T he general RTL verification problem can be represented by the following set: V = {< i , j > | x j> i = tfj for all ipi, ipj in fly} 6< , > is any pairing function which can establish an effective one-to-one correspondence be tween the 2-tuples in N x N and the numbers in N. For example, f(x, y) = 2*(2y + 1) — 1. 175 Hence, th e decision function / is actually th e characteristic function of th e set V . From Lem m a 6.4, we know th a t V is undecidable for all APSs. Therefore, if Q v is an A PS, / is not a p artial recursive function. In other words, there exists no such a program ( / ) which can decide w hether two arb itrary behaviors in flv are equivalent or not. Here we argue th a t any complete verification m odel flv should satisfy th e definition of an A PS7 since its associated functional m odel for design specifications is based on some hardw are description language such as VHDL or C which certainly qualifies as a “reasonable” program m ing system . Hence, the gen eral verification problem is algorithm ically unsolvable for all complete verification m odels. □ 6.7 Summary In this chapter, we have identified several properties of autom atically synthesized RTL designs. From these properties, we found th a t th e correctness of a synthe sized RTL design is really determ ined by w hether or not th e d a ta operations and transfers occurring in th e d ata p ath will conform to th e given design specification (a control d a ta flow graph). Consequently, a hybrid sym bolic/num eric sim ulation was introduced in this chapter to extract th e behaviors of synthesized RTL designs into d ata flow graphs which can be com pared directly w ith th eir specifications. We have also shown th a t there exists an isom orphic property betw een th e d a ta flow graphs obtained from th e specification and the hybrid sim ulation respectively. A polynom ial tim e procedure based on this isom orphic property was presented for behavior com parison. T he experim ents we have conducted show th a t th e correctness of synthesized designs cannot be taken for granted, especially when th e synthesis tools are in th eir 7A formal proof will need to show this argument mathematically. However, we feel that this enormous work will lead us out of focus of this research. 176 early releases. Hence, we believe th a t th e verification m ethodology presented in this chapter is valuable not only for obtaining confidence on th e correctness of the synthesized RTL designs b u t also for identifying unforeseen problem s in synthesis tools. In th e future developm ent, techniques to handle loops which are not unrolled are needed, in which case the controller’s state-transition graph becom es cyclic. W hen th ere are cycles in a state-transition graph, our definition of possible execution p ath s has to be changed. One solution is to traverse each cycle only once; therefore, th e num ber of possible execution paths will not be affected by th e actual iterations tak en by th e loops. T his approach avoids th e problem of determ ining th e conditions under which a loop would term in ate during th e symbolic sim ulation. A dditionally, we need a m ethod to check th a t th e values produced by th e loop body are fed back correctly for subsequent iterations. 177 Chapter 7 Conclusion As we have m entioned earlier, m ost high-level synthesis system s address th e task of transform ing a behavioral description of a design into an equivalent register- transfer level stru ctu re, where th e design behavior is often assum ed to contain a single process and th e stru ctu re is to be im plem ented on a single chip. This thesis has attem p ted to address m any design issues associated w ith th e synthe sis of m ultiple-chip system s w ith m ultiple concurrent processes. A system -level synthesis m ethodology is proposed in this thesis using system -level partitioning, m ultiple-process scheduling/allocation and RTL design verification to reduce the design tim e required for finding good quality system designs. Below we sum m arize th e contributions of th e research presented in this thesis and discuss some future research topics in each of these areas. 7.1 System Partitioning In this thesis, a new approach for th e system partitio n in g problem is presented. An im p o rtan t aspect of this approach is th a t partitioning is perform ed at th e process level where th e num ber of objects to be considered are far fewer th a n those at th e operation level and the functional boundaries specified by th e designers are preserved. Furtherm ore, th e exploration of process design alternatives is done 178 concurrently w ith partitioning, and the chip count and th e chip capacities (area and pins) are not sim ply a num ber of given constraints to be satisfied; instead, th ey are trad ed off according to th e available chip packaging options. We believe th a t system partitioning at a higher level of granularity such as processes and procedures will becom e m ore and m ore advantageous and necessary as b o th chip capacities and system com plexity keep increasing. Two partitioning m ethods were described in this thesis. F irst, an M ILP for m ulation was presented and im plem ented in a prototype tool called P ro P a rt. Sev eral experim ents including a JP E G im age com pression system were perform ed to dem onstrate the usefulness of this tool. A genetic-search technique was also de scribed to find acceptable solutions in a m ore m anageable ru n tim e. T he genetic- search technique was found to be a prom ising optim ization technique for system partitioning to handle com plex issues like yield and power. In th e future, the com m unication tradeoffs need to be done m ore thoroughly, considering pin sharing am ong different d ata transfers and considering different pin/buffer requirem ents at th e sender and the receivers. N on-uniform technology should also be taken into account. In other words, the processes could be p arti tioned onto a num ber of mixed com ponents, which m ay be ASICs, pre-designed p arts or program m able devices. These com ponents in tu rn s could be distributed am ong various packaging devices such as chips, m ulti-chip m odules (M CM ) and boards in order to satisfy or optim ize th e constraints on cost, size, yield, power, and other design characteristics. 7.2 Multiple-Process Synthesis In C hapter 5, we presented a new approach for synthesis of designs w ith unbounded-delay operations under tim ing constraints and w ith m ultiple com m uni cating processes. Com pared to relative scheduling, this approach allows us to trade 179 off betw een perform ance and resource requirem ents during scheduling as well as to reduce th e control overhead. O ur synthesis approach is based on a notion of single-threaded processes. In other words, each process corresponds to one th read of control and th ere exists a sequential order am ong th e unbounded-delay operations in th e process descrip tion. We found th a t scheduling of a single-threaded process can be done statically in term s of control steps by preserving the sequential order of unbounded-delay operations em bedded in th e process description. Two im p o rtan t issues were also addressed; nam ely, how to satisfy detailed tim ing constraints when unbounded- delay operations are present and how to synchronize inter-process com m unication. An advantage of our approach is th a t it is com patible w ith m any good synthesis techniques originally developed for designs w ith only fixed-delay operations. In C hapter 5, we dem onstrated our scheduling m ethod using an ILP form ulation and also proposed a heuristic procedure m odified from freedom-based scheduling. T here is still m uch work to be done here. Though our approach can be used as th e basis for hierarchical scheduling to handle design w ith conditional branches and loops, tim ing constraints can only be applied on th e operations w ithin th e sam e level of hierarchy. A thorough analysis m ethod will be needed to distribute th e constraints across th e hierarchy or a non-hierarchical approach should be de veloped. T he handling of inter-process com m unication needs to be extended to allow one-to-m any and buffered com m unication events. Furtherm ore, th e in ter process com m unication scheme given by th e designers m ay be too conservative and m ay contain some blocking com m unication events th a t can be converted to non-blocking ones. Techniques to autom atically determ ine a m inim al-cost in ter process com m unication scheme during synthesis or as a separate step should be investigated. 180 7.3 RTL Design Verification In th e research on design verification, we have identified several im p o rtan t prop erties of synthesized RTL designs. As a result, we found th a t the correctness of a synthesized RTL design is m ainly determ ined by w hether or not th e d ata operations and transfers occurring in th e d ata p a th conform to th e given design specification (a control d a ta flow graph). Therefore, we presented a hybrid sym bolic/num eric sim ulation to ex tract the behaviors of synthesized RTL designs into d a ta flow graphs which can th en be com pared directly w ith th eir specifications. We have also shown th a t there exists an isom orphic property betw een th e d ata flow graphs obtained from th e specification and the hybrid sim ulation respectively. A polynom ial tim e procedure based on this isom orphic property was given for behavior com parison. T he advantage of this verification approach is th a t it not only can form ally verify th e synthesized d a ta p ath b u t also can faithfully exercise th e control path, which is also a m ajor source of design errors. T he value of this approach was shown by its ability to identify problem s w ith th e early CSG tool in th e ADAM system . This also proves th a t th e correctness of synthesized designs cannot be guaranteed, especially when th e synthesis tools are in th eir early releases. In th e future, b e tte r handling of loops is definitely needed. W hen there are loops which are not fully unrolled, th e controller’s state-tran sitio n graph becomes cyclic. Therefore, our definition of possible execution p ath s has to be changed. M ore research should be done on diagnosis of design errors in order to trace th e possible causes autom atically. Finally, m odification of our verification m ethodol ogy to handle m ultiple-process designs is also needed. 181 7.4 Other Contributions In C hapter 3, we have form ulated th e behavioral VHDL com pilation problem in detail. T he techniques for control flow analysis, local/global d ata flow analysis, and graph generation/optim ization were presented. We also discussed th e m odeling of arrays, in p u t/o u tp u t and inter-process com m unication in VHDL as well as in the DDS representation. A prototype software called VHDL2DDS based on these techniques has been developed and is fully operational. VHDL2DDS currently serves as th e VHDL front-end of the ADAM high-level synthesis system , and has been used in num erous chip design experim ents. One work done during the course of this research b u t not discussed in this thesis is on integrating synthesis and test in th e USC ADAM project. T he goal is to com bine tradeoffs in cost, perform ance and testability during th e high-level synthesis process. A n E D IF interface1 [Che91] was developed to serve as th e bridge betw een the ADAM synthesis system and th e built-in test system , and to export AD A M ’s synthesized designs via the stan d ard E D IF form at [Eng87]. Hence, a fully au tom ated design p ath from VHDL behavioral specifications to testable chip layouts was created. A num ber of experim ents were also perform ed via th e E D IF interface to study the tradeoffs am ong th e perform ance, cost and testability and to assess w hat fea tures of the synthesized design have an im pact on th e design testability. Five exam ple designs were synthesized and m ade testable. Four of them are th e A R fil ter w ith different design characteristics and the other one is a robot arm controller. Two A R filter testable designs were laid out using th e Seattle Silicon Compiler. From these experim ents, a great deal of insight about th e synthesis techniques 1The EDIF interface is a multi-way design translation system involving three design repre sentations: MABAL [KP89], EDIF [Eng87] and CBASE [GCG+89]. 182 which support testability and th e area overhead for m aking design testable was obtained. T he details of these experim ents can be found in [NCPB91]. 183 Reference List [ABF90] [ASU86] [Bar81] [BBB+87] [BF89] [BKM+66] [Boc82] [Bry86] [Bry90] M. Abram ovici, M. A. Breuer, and A. D. Friedm an. Digital System s Testing and Testable Design. W . H. Freem an and Com pany, New York, 1990. A. V. Aho, R. Sethi, and J. D. U llm an. Compilers: Principles, Tech niques, and Tools. Addison-W esley, 1986. M. Barbacci. Instruction Set Processor Specification (ISPS): The N otation and its A pplications. IE E E Transactions on Com puters, C-30(l):24-40, January 1981. R. E. B ryant, D. B eatty, K. Brace, K. Cho, and T. Sheffler. COS MOS: A Com piled Sim ulator for M OS C ircuits. In 2 4 th Design A u tom ation Conference, pages 9-16. A C M /IE E E , 1987. S. Bose and A.L. Fisher. Verifying Pipelined H ardw are Using Sym bolic Logic Sim ulation. In I n t’ l Conference on C om puter Design. IEEE, O ctober 1989. M. Beardslee, C. Kring, R. M urgai, H. Savoj, R. K. Bray ton, and A. R. Newton. SLIP: A Software E nvironm ent for System Level Inter active P artitioning. In I n t’ l Conference on Com puter-Aided Design, pages 280-283. IEE E, 1966. G. V. Bochm ann. H ardw are Specification w ith Tem poral Logic: An Exam ple. IE E E Transactions on Com puters, C-31(2):223-231, M arch 1982. R. E. B ryant. G raph-B ased A lgorithm s for Boolean Function M anip ulation. IE E E Transactions on C om puters, C-35(8):677-691, A ugust 1986. R. E. B ryant. Symbolic Sim ulation - Techniques and A pplications. In 27th Design A utom ation Conference, pages 517-521. A C M /IE E E , 1990. 184 [BS92] [CH90] [Che91] [Cor81] [CP88] [CvE87] [Dar79] [Eng87] [FLS+92] [FM82] [Fuh91] S. H. Bang and B. J. Sheu. A M ulti-C hip M odule for H and-Held D igital Cellular M obile Telephone. In M ulti-Chip M odule Conference, pages 115-118. IEE E, M arch 1992. A. C hatterjee and R. Hartley. A New Sim ultaneous C ircuit P artitio n ing and Chip Placem ent A pproach Based on Sim ulated Annealing. In 27th Design A utom ation Conference, pages 36-39. A C M /IE E E , 1990. C. T. Chen. M anual Pages fo r the A D A M E D IF Interface. De p artm en t of Electrical Engineering - System s, U niversity of Southern California, July 1991. W . E. Cory. Symbolic Sim ulation for Functional Verification w ith ADLIB and SDL. In 18th Design A utom ation Conference, pages 82- 89. A C M /IE E E , 1981. P. C am urati and P. P rinetto. Form al Verification of H ardw are Cor rectness: Introduction and Survey of C urrent Research. IE E E Com puter, 21 (7):8— 19, July 1988. R. Cam posano and J. van Eijndhoven. P artitioning a Design in Struc tu ral Synthesis. In I n t ’ l Conference on C om puter Design, pages 564- 566. IEE E, O ctober 1987. J. A. D arringer. T he A pplication of P rogram Verification Techniques to H ardw are Verification. In 16th Design A utom ation Conference, pages 375-381. A C M /IE E E , 1979. Engineering D epartm ent, Electronic Industries Association. Elec tronic Design Interchange Form at Version 2 0 0, 1987. H. Fujiw ara, M. L. Liou, M. T. Sun, K. M. Yang, M. M. M aruyam a, K. Shom ura, and K. O hyam a. An All-ASIC Im plem entation of a Low B it-R ate Video Codec. IE E E Transactions on Circuits and System s fo r Video Technology, 2(2):123-134, June 1992. C. M. Fiduccia and R. M. M attheyses. A Linear-Tim e H euristics for Im proving Network P artitioning. In 19th Design A utom ation Con ference, pages 175-181. A C M /IE E E , 1982. T. E. Fuhrm an. Industrial Extensions to U niversity High Level Syn thesis Tools: M aking it Work in the Real W orld. In 28th Design Autom ation Conference, pages 520-525. A C M /IE E E , 1991. 185 [GCDBP94] [GCG+89] [GE92] [Geb92] [GGP+91] [GGVN93] [GJ79] [GM90] [GM92] [Gol89] [GP94] P. G upta, C. T. Chen, J.C . D eSouza-Batista, and A. C. Parker. Ex perience w ith Im age Com pression Chip Design using Unified Sys tem C onstruction Tools. In 31th Design A utom ation Conference. A C M /IE E E , 1994. R. G upta, W. Cheng, R. G upta, I. H ardonag, and M. Breuer. A n O bject-O riented VLSI CAD Framework. IE E E Com puter, 22(5):28- 37, M ay 1989. C. H. G ebotys and M. I. Elm asry. Sim ultaneous Scheduling and Al location for Cost C onstrained O ptim al A rchitectural Synthesis. In 28th Design A utom ation Conference, pages 2-7. A C M /IE E E , 1992. C. H. Gebotys. O ptim al Synthesis of M ultichip A rchitectures. In I n t ’ l Conference on C om puter-Aided Design, pages 238-241. IEE E, 1992. M. Genoe, L. Glaesen, E. Proesm ans, E. Verlind, and H. De M an. Illustration of the SFG -Tracing M ulti-Level Behavioral Verification M ethodology, by th e Correctness P roof of a High to Low Level Syn thesis A pplication in CA TH ED RA L-II. In I n t’ l Conference on Com puter Design. IEEE, O ctober 1991. D. G ajski, J. Gong, F. Vahid, and S. N arayan. T he SpecSyn De sign Process and H um an Interface. Technical R eport T R ICS 93-3, D epartm ent of Inform ation and C om puter Science, U niversity of Cal ifornia, Irvine, 1993. M. R. G arey and D. S. Johnson. Com puters and Intractability: A Guide to the Theory o f NP- Completeness. W . H. Freem an and Com pany, New York, 1979. R. G upta and G. De Micheli. P artitioning of Functional M odels of Synchronous D igital System s. In I n t’ l Conference on Com puter-Aided Design, pages 216-219. IEE E, 1990. R. G upta and G. De Micheli. System Synthesis via Hardw are- Software Co-design. Technical R eport C SL -T R -92-548, D epartm ent of E E and CS, Stanford University, O ctober 1992. D. E. Goldberg. Genetic Algorithm s in Search, O ptim ization and M achine Learning. Addison-W esley, Reading, M A, 1989. P. G upta and A. C. Parker. SMASH: A Program for Scheduling M em ory-Intensive A pplication Specific H ardware. In 7th I n t’ l S ym posium on High-Level Synthesis, May 1994. 186 [GRVM90] [GS84] [Hay90] [HD86] [HH90] [Hil85] [HLH91] [HP92] [Ins88] [KFJM92] [KL70] G. Goossens, J. Rabaey, J. Vandewalle, and H. De M an. An Efficient M icrocode Com piler for A pplication Specific D SP Processors. IE E E Transactions on Com puter-Aided Design, 9(9):925— 937, Septem ber 1990. J. Greene and K. Supowit. Sim ulated A nnealing W ithout Rejected Moves. In I n t ’ l Conference on Com puter Design, pages 658-663. IEE E, O ctober 1984. S. H ayati. The Synthesis o f Control-Dominated Application Specific Integrated Circuits Using Global Based Design M anagement. PhD thesis, D epartm ent of Electrical Engineering - System s, U niversity of Southern California, Novem ber 1990. F. K. H anna and N. Daeche. Specification and Verification of D igital System s using H igher-O rder Logic. IE E Proceeding, 133(5):242-254, Septem ber 1986. L. J. Hafer and E. Hutchings. Bringing up Bozo. Technical R eport C M PT T R 90-2, School of Com puting Science, Sim on Fraser Univer sity, Burnaby, B.C ., V5A 1S6, M arch 1990. P. Hilfinger. A High-level Language and Silicon Com piler for D igital Signal Processing. In Int ’ I Sym posium on Circuits and System s, pages 213-216. IEEE, M ay 1985. C. T. Hwang, J. H. Lee, and Y. C. Hsu. A Form al A pproach to the Scheduling Problem in High-Level Synthesis. IE E E Transactions on Com puter-Aided Design, 10(4), April 1991. Y. H. H ung and A. C. Parker. High-Level Synthesis w ith P in Con straints for M ultiple-Chip Designs. In 29th Design A utom ation Con ference, pages 231-234. A C M /IE E E , 1992. T he In stitu te of Electrical and Electronics Engineers Inc. IE E E Stan dard VHDL Language Reference M anual, 1988. D. K u, D. Filo, C. N. Coelho Jr., and G. De Micheli. Interface Op tim ization for Concurrent System s under Tim ing C onstraints using Interface M atching. In 6th I n t’ l Workshop on High-Level Synthesis, Novem ber 1992. W . K ernighan and S. Lin. An Efficient H euristic Procedure for P arti tioning G raphs. Bell System Technical Journal, 49:291-307, January 1970. 187 [KM92] [KP85] [KP89] [KP93] [Kuc91] [KW89] [LG88] [LGP+91] [LLT69] [LT91] [McF] D. K u and G. De Micheli. R elative Scheduling U nder T im ing Con straints: A lgorithm s for High-Level Synthesis of D igital C ircuits. IE E E Transactions on Com puter-Aided Design, 11(6):696-718, June 1992. D. W . K napp and A. C. Parker. A Unified R epresentation for Design Inform ation. In I n t’ l Sym posium on Com puter Hardware D escription Languages and their Applications, 1985. K. K ucukcakar and A. C. Parker. M ABAL - A Software Package for M odule And Bus ALlocation. In I n t’ l Journal o f Com puter Aided V L S I Design, June 1989. K. K ucukcakar and A. C. Parker. BEST: Behavioral A rea-D elay P re dictor. Technical R eport CEng 93-24, D epartm ent of Electrical E n gineering - System s, U niversity of Southern California, M ay 1993. K. Kucukcakar. System -Level Synthesis Techniques with Em phasis on Partitioning and Design Planning. PhD thesis, D epartm ent of Electrical Engineering - System s, University of Southern California, O ctober 1991. D. W . K napp and M. W inslett. A Form alization of Correctness for Linked R epresentations of D atap ath H ardw are. In IF IP W orkshop on Applied Formal M ethods fo r Correct V L S I Design, Novem ber 1989. J. Lis and D. G ajski. Synthesis from VHDL. In I n t’ l Conference on Com puter Design, pages 378-381. IEEE, 1988. D. Lanneer, G. Goossens, M. Pauwels, J. Van M eerbergen, and H. De M an. An O bject-O riented Fram ew ork Supporting th e Full High-Level Synthesis Trajectory. In 10th I n t ’ l Sym posium on C om puter Hard ware D escription Languages and their Applications, pages 281-300, 1991. E. L. Lawler, K. N. Levitt, and J. Turner. M odule Clustering to M in imize Delay in D igital Networks. IE E E Transactions on Computers, C -18(l):47-57, January 1969. E. Lagnese and D. Thom as. A rchitectural P artitioning for Sys tem Level Synthesis of Integrated Circuits. IE E E Transactions on Com puter-Aided Design, 10(7), July 1991. M. C. M cFarland. P ractical Lessons in Verification and High-Level Synthesis. AT& T Bell Laboratories, M urray Hill, N J 07974-2070. 188 [McF78] [McF86] [McF93] [MK88] [MKMT90] [MP83] [MPC88] [MY78] [Nar92] [NCPB91] [Nes87] M. C. M cFarland. T he Value Trace: A D ata Base for A utom ated D igital Design. Technical R eport DRC-01-4-80, D ept, of Electrical Engineering, Carnegie-M ellon University, D ecem ber 1978. M .C. M cFarland. Using B ottom -U p Design Techniques in th e Syn thes is of D igital H ardw are from A bstract Behavioral Descriptions. In 23th Design A utom ation Conference, pages 474-480. A C M /IE E E , June 1986. M. C. M cFarland. Form al Verification of Sequential Hardw are: A Tutorial. IE E E Transactions on Com puter-Aided Design, 12(5) :633- 654, May 1993. G. De Micheli and D. Ku. HERCULES - A System for High-Level Synthesis. In 25th Design A utom ation Conference. A C M /IE E E , 1988. G. De M icheli, D. Ku, F. M ailhot, and T . Truong. T he Olym pus Synthesis System . IE E E Design and Test o f Computers, O ctober 1990. M. C, M cFarland and A. C. Parker. An A bstract M odel of Behavior for H ardw are Descriptions. IE E E Transactions on Com puters, C- 32(7):621-636, July 1983. M. C. M cFarland, A. C. Parker, and R. Cam posano. Tutorial on High-Level Synthesis. In 25th Design A utom ation Conference, pages 330-336. A C M /IE E E , 1988. M. M achtey and P. Young. A n Introduction to the General Theory o f Algorithms, th e C om puter Science Library. Elsevier N orth Holland, Inc., 1978. S. N arayan. A Survey of System-Level Specification Languages. Technical R eport T R ICS 92-100, D epartm ent of Inform ation and C om puter Science, U niversity of California, Irvine, 1992. C. N jinda, C. T. Chen, A. C. Parker, and M. Breuer. Integrating Synthesis and Test in ADAM. In IF IP International Workshop on the Relationship Between Synthesis, Test, and Verification, Novem ber 1991. J. Nestor. Specification and Synthesis o f Digital System s with Inter faces. PhD thesis, Carnegie-M ellon University, A pril 1987. 189 [NP91] [NW88] [OG86] [PGH91] [PK89] [PP93] [PPM86] [Pra93] [RP93] [SG92] [SR89] [TLW+90] A. Nicolau and R. Potasm an. Increm ental Tree Height R eduction For High Level Synthesis. In 28th Design A utom ation Conference, pages 770-774. A C M /IE E E , 1991. G. L. N em hauser and L. A. Wolsey. Integer and Com binatorial Op tim ization. W iley Inter-science, 1988. A. Orailogulu and D. G ajski. Flow G raph R epresentation. In 23th Design A utom ation Conference, pages 503-509. A C M /IE E E , 1986. A. C. Parker, P. G upta, and A. Hussain. T he Effects of Physical Design C haracteristics on th e A rea - Perform ance Tradeoff Curve. In 28th Design Autom ation Conference, pages 530-534. A C M /IE E E , June 1991. P. G. Paulin and J. P. K night. Force-Directed Scheduling for the Behavioral Synthesis of A SIC’s. IE E E Transactions on Computer- Aided Design, 8(6):661-679, June 1989. H. P ark and V.K. Prasanna. A rea Efficient VLSI A rchitectures for Huffm an Coding. I n t i Conference on Acoustics, Speech and Signal Processing, 1993. A. C. Parker, J. Pizarro, and M. J. M linar. MAHA: A Program for D atap ath Synthesis. In 23th Design A utom ation Conference, pages 461-466. A C M /IE E E , 1986. S. Prakash. Synthesis o f Application-Specific M ultiprocessor System s. PhD thesis, D epartm ent of Electrical Engineering - System s, Univer sity of Southern California, 1993. J. Raghavendran and A. C. Parker. H ardw are/softw are tradeoffs in adam . Technical R eport CEng 93-28, D epartm ent of Electrical En gineering - System s, U niversity of Southern California, 1993. N. Schraudolph and J. G refenstette. A U ser’s G uide to GAucsd 1.4. Technical Report CS92-249, U niversity of California, San Diego, July 1992. Y. Saab and V. Rao. An Evolution-Based A pproach to P artitioning ASIC System s. In 26th Design A utom ation Conference, pages 767- 770. A C M /IE E E , 1989. D. E. Thom as, E. D. Lagnese, R. A. W alker, J. A. N estor, J. V. Ra- jan , and R. L. Blackburn. Algorithm ic and Register-Transfer Level 190 [TW93] [Vah91] [VG92] [VNG91] [Wal91] [WGB] [Whi93] [WP92] Synthesis: The System A rchitect’ s Workbench. T he Kluwer Interna tional Series In Engineering and C om puter Science. Kluwer Academ ic Publishers, 1990. A. Takach and W . Wolf. Scheduling C onstraint G eneration for Com m unicating Processes. Technical report, D epartm ent of Electrical Engineering, Princeton University, February 1993. F. Vahid. A Survey of Behavioral-Level P artitioning System s. Tech nical R eport T R ICS 91-71, D epartm ent of Inform ation and Com p u ter Science, University of California, Irvine, 1991. F. Vahid and Daniel D. G ajski. Specification P artitioning for Sys tem Design. In 29th Design A utom ation Conference, pages 219-224. A C M /IE E E , 1992. F. Vahid, S. N arayan, and D. G ajski. SpecCharts: A Language for System Level Synthesis. In I n t’ l Sym posium on Com puter Hardware D escription Languages and their Applications, 1991. G. K. W allace. The JP E G Still P ictu re Com pression Standard. A C M Communications, 34(4):31-44, A pril 1991. T. C. W ilson, G. W . Grewal, and D. K. Banerji. An ILP Solution for Sim ultaneous Scheduling, Allocation, and Binding in M ultiple Block Synthesis. VLSI-CAD Group, D epartm ent of C om puting and In form ation Science, U niversity of G uelph, Guelph, O ntario, C anada N1G-2W 1. D. W hitley. A G enetic A lgorithm Tutorial. Technical R eport CS-93- 103, Colorado S tate University, Novem ber 1993. J. P. W eng and A. C. Parker. CSG: Control P a th Synthesis in th e ADAM System . Technical R eport CEng 92-03, D epartm ent of Elec trical Engineering - System s, U niversity of Southern California, April 1992. 191 Appendix A The VHDL Subset of the ADAM System T he VHDL used in th e ADAM system and in th e early USC system is a subset of th e IE E E S tandard VHDL[Ins88] since we are only concerned w ith representing be havioral specifications in VHDL. This subset was carefully defined to avoid features incom patible w ith the notion of behavioral description or unable to be represented in DDS, while still giving sufficient expressive power for m ost applications. T he allowed VHDL constructs are lim ited to the following: 1. Design Entities T he prim ary hardw are abstraction in VHDL is th e design entity. A design en tity is defined by an entity declaration together w ith a corresponding ar chitecture body. • E n tity Declarations T he entity declaration basically defines th e inputs and o utputs of the design entity. A given entity declaration is restricted to be used by only one design entity; th a t is, it cannot be shared in this VHDL subset. The restrictions described in Item 5 (Declarations) are applied accordingly to the entity header and th e entity declarative part of a given entity 192 declaration1. T he entity statem ent part m ust be em pty in each design entity; in other words, th e behavior of a design entity m ust only be specified in th e corresponding architecture body. • Architecture Bodies T here are three general styles of descriptions possible w ithin an ar chitecture body: structural, dataflow and behavioral. However, only th e behavioral one is supported in A D A M /U SC . Behavioral descrip tions specify d ata transform s in term s of algorithm s for com puting out p u t responses to input changes. T he feature of m ultiple asynchronous processes is not yet supported in th e current version of VHDL2DDS; therefore, each architecture body is required to have one and only one concurrent statem ent in the architecture statem ent part. 2. Subprograms Since th e configuration is not included in th e VHDL subset, subprograms serve as th e m ajo r m echanism for building the desired design hierarchy2. T he definition of a subprogram can be given in two parts: a subprogram declaration and a subprogram body. Subprogram s w ithout subprogram bodies are usually used as th e leaf nodes in the design hierarchy and th e interfaces to th e m odules in th e system library. B oth procedures and functions are allowed. Subprogram overloading is not supported, and th e operator overloading is lim ited to once for each scope of declarations. 3. Packages 1In fact, this rule is applied to all declarative parts in this VHDL subset. It will not be stated explicitly in the rest of the VHDL subset definition unless additional restrictions are required. 2In fact, this limitation makes the design entity unsuitable for describing internal blocks because there is no way to bind a collection of design entities into a design hierarchy without using a configuration declaration. 193 Packages provide a m eans of defining declarations which can be shared by different design units. One of th e m ajor usages of packages in ADAM is to define the interfaces of some im plem entation-dependent m odule libraries. In such a case, th e package declaration has no corresponding package body. No special restrictions except those in Item 5 are im posed on packages. 4. Types In VHDL, a type is characterized by a set of values and a set of operations. All im plicitly declared operations for a given type declaration are supported and will be translated autom atically by VHDL2DDS. However, they are not recom m ended to be used in th e VHDL descriptions because th ere m ay be no corresponding m odules in th e m odule libraries. As a result, th e explicitly declared subprogram s for a type are m ore appropriate in term s of m odule bindings. Two classes of types are allowed w ith restrictions; nam ely, scalar types and composite types. • Scalar Types Scalar types are lim ited to th e predefined types B IT, BO O LEA N , and IN T E G E R only. Currently, users can not define th eir own scalar types. T he IN T E G E R type is assum ed to be a 32-bit im plem entation. • Composite Types T he com posite type is th e only user-definable type class in this VHDL subset. It is further lim ited to array types only. A n array object is a com posite object consisting of elem ents th a t have th e sam e type. Its prim ary usages are to m odel different bit-w idth values and m em ories. T he m axim al dim ensionality of an array type is lim ited to 2. B oth unconstrained array types and constrained array types are allowed. The index definition of an unconstrained array type m ust be IN TE G ER , 194 and th e index constraint of a constrained array type m ust be ranges. B IT -V E C T O R is the predefined array type supported by VHDL2DDS. Since subtypes are not supported, th ere are several lim itations on the uses of array types. F irst, a constrained array type is not defined as an unconstrained array type plus a subtype of this unconstrained array type. It itself is a d ata type. Also, defining a constrained array type from an existed unconstrained array is not allowed. This m akes the unconstrained array types of little use. For each array type, two additional operations are im plicitly defined by this VHDL subset. They are array read operations and array write operations. If an indexed name appear at the right (left) h and side of an assignm ent statem ent, an array read (w rite) operation will be used. This feature is well suited for m odeling m em ories; however, it cannot m odel th e extraction of a subvalue from a m ulti-bit value. 5. Declarations In addition to design entities, subprogram s, packages and types, th e other kinds of declarations allowed are object declarations and interface declara tions. • Object Declarations All three classes of objects are allowed; nam ely, constants, signals, and variables. An object declaration declares an object of a specified type. T he feature of deferred constants is not supported. Signals will be treated as variables; th a t is, only th e syntactical aspect of signals is preserved, b u t th eir sem antics will be identical to variables in term s of th e VHDL to DDS translation. Therefore, a signal declaration is not allowed to have a resolution function, guards, or th e signal kind. 195 • Interface Declarations Interfaces objects also include constants, signals, and variables. The restrictions described above are applied accordingly. In addition, the mode of an interface object is lim ited to either in or o u t. 6. N am es All forms of nam es except attribute nam es and slice nam es are allowed. T he identifier for an entity, a package, a subprogram , or an interface object has only the first 5 characters significant after translation. An index nam e is considered to be an array read (w rite) operation instead of sim ply denoting an elem ent of an array. 7. Expressions An expression is a form ula th a t defines the com putation of a value. It con sists of a set of operators and their operands. Though all VHDL predefined operators are supported by VHDL2DDS, VHDL2DDS does not assum e any specific im plem entation to a predefined operator, nor is it aware of the avail ability of any library m odule for binding. Care m ust be taken not to use any predefined operator unless th e user can m ake sure th ere exists some corre sponding m odule in th e library or th e operator in question will somehow be im plem ented. In ADAM, a m ore appropriate approach is to define a package for each available m odule library using function declarations or overloaded operators and use these functions or operators in expressions instead of pre defined ones. T he allowed operands in an expression include nam es, literals, and function calls. In addition, an expression enclosed in parentheses m ay be an operand in an expression. A literal is either a integer literal, a Boolean literal, a bit literal, or a bit string literal. 196 8. Sequential Statem ents Sequential statem ents shall be th e m ajor m eans for describing th e behavior of th e com ponent under design. T he allowed sequential statem ents are • Signal assignm ent statem ent. • Variable assignment statem ent. • Procedure call statem ent. • I f statem ent. • Case statem ent. • Loop statem ent. • Next statem ent. • E xit statem ent. • R eturn statem ent. • Null statem ent. S tatem ent labels can be used whenever necessary. A signal assignm ent state m ent is considered like a variable assignm ent statem ent. Hence, transport delay is not supported and the waveform a t the right hand side can only consist of one elem ent. In addition, a waveform elem ent is not allowed to have an a f te r clause. T he iteration scheme of a fo r loop m ust be a range of type IN TE G ER . 9. Concurrent Statem ents T he process statem ent is the only form of concurrent statem ent allowed in this VHDL subset3. A process statem en t defines an independent sequential 3Since the feature of multiple concurrent processes is not supported, the inclusion of the process statement in this VHDL subset is merely for syntactical reasons. 197 process representing the behavior of the design. T he execution of a process statem ent is m odeled by the endlessly repetitive execution (an im plicit loop) of its sequence of statem ents. Hence, a process statem ent is not allowed to have a sensitivity list. 198
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255747
Unique identifier
UC11255747
Legacy Identifier
DP22879