Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CONTROL PATH/DATA PATH TRADEOFFS IN VLSI DESIGN by M itchell J. M linar A Dissertation Presented to the Fa c u l t y o f t h e G r a d u a t e S c h o o l U n i v e r s it y o f S o u t h e r n C a l if o r n ia In P artial Fulfillment of the Requirem ents for the Degree DOCTOR OF PHILOSOPHY (Com puter Engineering) May 1991 Copyright 1991 M itchell J. M linar UMI Number: DP22827 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22827 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by M it c h e ll J . M lin ar under the direction of h.hf. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of ops ’ 91 DOCTOR OF PHILOSOPHY Dean o f Graduate Studies Da t e ft * . . DISSERTATION COMMITTEE Chairperson Dedication To my wife, Sharon and my children, Alison, Eric, and Rebecca. You who endured the sweat and the tears, can now share in the glory. Acknowledgments I would like to thank my advisor, Prof. Alice Parker, under whose direction this thesis formed. Only she can appreciate the am ount of tim e spent towards my guidance and technical growth. I would also like to thank Professors Melvin Breuer and Dennis Estes for serving on my thesis com m ittee. Their suggestions and comments have been m ost helpful. Two other faculty m em bers have also contributed in p art to this thesis. Tech nical discussions w ith Professor Sarm a Sastry have proved m ost illum inating. Professor Jam es Yee opened new doors which exposed me to alternative paths and provided new insight on m any problems. I would also like to thank my friends in the “p its” : Jorge Seidel, Meera Balakrishna, K ayhan Ktigiikgakar, and Shiv Prakash. They m ade the long years m ore bearable. Special thanks go to my collegue, Rajiv Jain. His loyalty and friendship helped to keep my “eyes on the prize” even in the worst of tim es. Long talks w ith Sally Hayati also kept my sanity intact. My PhD was far more interesting w ith you two around. I also acknowledge the support of TRW; their financial aid during the early years is appreciated. Research assistance provided by th e the Defense Advanced Research Projects Agency (C ontract N00014-87-K-0861) was also helpful. Last, but not least, is my current employer EEsof who has quietly tolerated my school schedule. To m y three children - Alison, Eric, and Rebecca - your welcome interruptions kept my goal in perspective. I would also like to thank my m other for getting me started on this long path. iii Finally, and m ost im portant, my wife deserves a large p art of the credit. Her unlim ited patience and support have turned my dream into a reality. IV Contents D ed ica tio n ii A ck n o w led g m en ts iii L ist O f F igu res xii L ist O f T ables x v i 1 In tro d u ctio n 1 1.1 Problem S ta te m e n t....................................................................................... 2 1.2 M o tiv atio n ......................................................................................................... 4 1.3 Problem A p p r o a c h ....................................................................................... 6 1.3.1 D ata P ath A n a ly sis........................................................................... 6 1.3.2 Routing and Storage A n a ly s is ...................................................... 7 1.3.3 Control P ath A n a ly s is .................................................................... 8 1.3.4 System-Level P a r titio n in g ............................................................. 10 1.4 Evaluation Tools w ithin A D A M ............................................................... 13 1.4.1 SLIMOS: M odule Set S e le c tio n .................................................. 13 1.4.2 MAH A: Non-Pipelined D ata P ath Synthesis ....................13 1.4.3 Sehwa: Pipelined D ata P ath S y n th e s is .........................................14 1.4.4 H A PPE: D ata P ath E s tim a tio n .................................................. 14 1.4.5 MABAL: M ultiplexer and Bus A llo catio n .....................................15 1.4.6 Berkeley PLA Synthesis Package ................................................... 15 1.4.7 PASTA: PLA Control Area E stim ator .........................................16 1.4.8 PLEST D ata P ath Area E s t i m a t o r ........................................... 17 1.5 Related W o r k .................................................................................................. 17 v 1.5.1 D ata P atlE S y n tH eS is........................... . . 17 1.5.2 Control P ath S y n th e sis.................. 18 1.5.3 Synthesis considering both control and d ata path analysis 20 1.5.4 High-level hardw are/firm w are/softw are partitioning . . . . 23 1.6 Thesis O u tlin e .......................................................................................................24 E x ten sio n s to D a ta P a th S y n th esis 26 2.1 In tro d u c tio n ...........................................................................................................26 2.2 Overview of the MAHA A lg o r ith m ............................................................. 28 2.3 Evolution of MAHA ......................................................................................... 31 2.3.1 Synthesis of Serialized D esig n s..........................................................32 2.3.2 Register E x ten sio n s...............................................................................33 2.3.3 M ultiplexer E x te n sio n s........................................................................35 2.3.4 Extension for Localized C o n s tra in ts ............................................... 36 2.4 The MAHA Program S t r u c t u r e .....................................................................36 2.4.1 Algorithm Input .................................................................................. 37 2.4.1.1 Dataflow G raph I n p u t ......................................................37 2.4.1.2 Library G e n e ra tio n .............................................................38 2.4.1.3 Local Constraint S pecification........................................38 2.4.1.4 Global C o n s tra in ts ............................................................. 39 2.4.2 Clock Cycle G eneration .................................................................... 39 2.4.3 Critical P ath Partitioning and A llo c atio n .....................................40 2.4.4 Off-Critical P ath Analysis and A llo catio n .....................................46 2.4.5 Dataflow G raph S tre tc h in g .................................................................48 2.4.6 MAHA Synthesis P r o c e d u re ............................................................. 50 2.4.7 R untim e A n a ly sis.................................................................................. 55 2.5 Loop Transform ation in D ata P ath S y n th esis............................................56 2.5.1 Simple Loop T ran sfo rm atio n ............................................................. 57 2.5.2 Transform ation A lg o rith m .................................................................61 2.5.3 R untim e Analysis for Loop T ran sfo rm atio n ................................. 67 2.6 Examples and Synthesis R e s u lts .................................................................... 67 2.7 Lim itations of MAHA ......................................................................................90 2.8 Sum m ary ............................................................................................................ 102 3 A rea E stim a tio n o f D a ta P a th C on trollers 105 3.1 In tro d u c tio n .........................................................................................................105 3.2 A PLA Control Area M o d e l ......................................................................... 110 3.2.1 General Model .................................................................................... 110 3.2.2 Com puting R c and p ..........................................................................112 3.2.2.1 Stage Control M e th o d .....................................................112 3.2.2.2 Register Control M e th o d ................................................. 115 3.2.2.3 M ultiplexer/B us Control L i n e s ...................................118 3.2.3 A Specific Model for Berkeley PLA Synthesis Tools . . . . 123 3.3 Loops and C onditionals...................................................................................126 3.3.1 Effect of Loops on PLA A r e a ........................................................126 3.3.1.1 Loop U n ro llin g ..................... 128 3.3.1.2 Im plem enting Loops using a C o u n te r.......................... 129 3.3.2 Effect of Conditional Branches on PLA A r e a ............................ 133 3.4 Extending the PLA Model . ................................ 138 3.4.1 Estim ation of Folded PLA A r e a ..................................................... 138 3.4.2 Extensions for Pipelined C ontrollers..............................................140 3.5 Validation of the PLA Model ......................................................................142 3.5.1 Validation Process .............................................................................142 3.5.1.1 Basic M o d e l......................................................................... 143 3.5.2 Loops and Conditionals ..................... 145 3.5.3 Pipelined D e sig n s.................................................................................153 3.5.4 PLA Column F o ld in g .............................................. 153 3.5.5 Use of P A S T A ....................................................................................... 153 3.5.6 Shortcomings of Using the Berkeley Toolset ............................ 155 3.6 Sum m ary ............................................................................................................ 157 4 P r e d ic tin g R e g iste r and M u ltip le x e r R eq u irem en ts 159 4.1 In tro d u c tio n .........................................................................................................159 4.1.1 System to be M o d e le d ...................................................................... 160 4.2 Register A rea in Non-pipelined D e sig n s.................................................... 161 4.2.1 Register Bounds in Non-Pipelined D e s ig n s ............................... 164 4.2.2 E stim ating Register Use in Non-Pipelined D e sig n s................. 175 vii 4.3 Register Area in Pipelined Designs ............................................................178 4.3.1 Register Bounds in Pipelined D e sig n s...........................................178 4.3.2 E stim ating Register Use in Pipelined Designs ......................... 181 4.4 Predicting th e Num ber of M icrocycles........................................................ 184 4.5 M ultiplexer A r e a ...............................................................................................189 4.5.1 Theoretical Bounds on M ultiplexer A r e a ....................................189 4.5.2 Estim ating M ultiplexers in Non-Pipelined D e sig n s....................190 4.5.2.1 M ultiplexer A rea with O perators in N on-Pipelined D e sig n ......................................................................... ... . 192 4.5.2.2 M ultiplexer A rea w ith Registers in Non-Pipelined Design ..................................................................... 195 4.5.2.3 Exam ple for Non-Pipelined D e s i g n .............................195 4.5.3 E stim ating M ultiplexers in Pipelined Designs . . .. ... 197 4.5.3.1 M ultiplexer A rea w ith O perators in Pipelined De sign .......................................................................................199 4.5.3.2 M ultiplexer A rea w ith Registers in Pipelined De sign ...................................................................................... 201 4.6 Experim ents and V a lid a tio n ......................................................................... 201 4.7 S u m m a r y ........................................... 208 5 A n a ly sis o f C on trol P a th /D a ta P a th Tradeoffs 212 5.1 In tro d u c tio n .........................................................................................................212 5.2 Types of Control P a th /D a ta P ath T r a d e o f fs ..........................................212 5.2.1 Com position of O p e r a to r s ................................................................213 5.2.2 O perator D e c o m p o sitio n ...................................................................214 5.2.3 Sequential/C om binational T radeoffs..............................................214 5.3 Control P a th /D a ta P ath Bitw idth Tradeoff M o d e l............................... 219 5.3.1 A Simple M o d e l................................ 220 5.3.2 B it-dependent O p e ra to rs ...................................................................227 5.3.3 Special O perator B itw idth C o n s id e ra tio n s ................................230 5.3.4 Exam ple using the small m o d e l .................................................... 231 5.3.5 Lim itations of the C ontrol/D ata Bitw idth Model ...................233 5.4 Sum m ary ............................................................................................................236 6 C on trol P a th /D a ta P a th T radeoff E valu ation 237 6.1 In tro d u c tio n .......................................................................................................237 6.2 Control P a th /D a ta P ath Tradeoffs E x p e rim e n ts ..................................... 237 6.2.1 A M ethodology for Evaluating T ra d e o ffs .................................. 238 6.2.2 Synthesis versus Prediction: An E x a m p le .................................. 240 6.2.2.1 Design S y n th e s is ...............................................................240 6.2.2.2 Design P re d ic tio n .............................................................. 242 6.2.2.3 Synthesis versus P r e d ic ti o n ..........................................243 6.3 Tradeoff Examples.............................................................................................250 6.3.1 Bitw idth T r a d e o f fs ............................................................................ 252 6.3.1.1 Non-Pipelined Bit Serialization: Floating Point P r o c e s s o r ............................................................................252 6.3.1.2 Pipelined Bit Serialization: Elliptical Filter . . . 258 6.3.2 Com position T ra d e o ffs ......................................................................264 6.3.2.1 Merging Different Bitwidths: Tem perature Con troller ...................................................................................264 6.3.2.2 ALU Substitution: Floating Point Coprocessor . 268 6.3.3 M ulti-processor T radeoffs..................................................................271 6.3.4 Tradeoff T r e n d s ................................................................................... 275 6.4 Sum m ary ...........................................................................................................279 7 C on clu sion s and F u tu re R esearch 280 7.1 C o n trib u tio n s ....................................................................................................280 7.2 A utom ating System-Level T rad eo ffs...........................................................282 7.2.1 D etection of Tradeoff L o c a tio n s ....................................................282 7.2.2 Ordering System-Level Tradeoffs .................................................284 7.2.3 A utom ated A pplication of T radeoffs.............................................285 7.3 Future R e se a rc h .................................................................................................285 A p p e n d ix A N o ta tio n .........................................................................................................................295 A p p e n d ix B MAHA Usage: Inputs and O u t p u t s ......................................................................298 ix B .l MAHA I n p u ts ....................................................................................................298 B.1.1 Node D e s c rip tio n ................................................................... 300 B .l.2 Edge D e s c rip tio n ................................................................................300 B .l.3 M odule D e sc rip tio n ............................................................................301 B.2 Executing M A H A ............................................................................................. 303 B.2.1 Interactive Execution of M A H A ................................................... 303 B .2.1.1 MAHA A utom atic O p e r a tio n ....................................... 306 B.2.2 Com m and Line Execution of M A H A .........................................309 B.3 MAHA Interactive O u t p u t ............................................................................309 B.4 An Interactive Exam ple ............................................................................... 310 B.5 File F o rm a ts ........................................................................................................320 B.5.1 Dataflow Description F i l e ..............................................................320 B .5.1.1 Node D e sc rip tio n ..............................................................320 B .5.1.2 Edge Description . . . ................................................ 321 B.5.2 M odule Description F i l e ................................................................. 322 B.5.3 Constraint Description File...............................................................323 B.5.4 MAHA O utput F ile ............................................................................325 A p p e n d ix C PLA Loop Counter Area Evaluation ................................................................. 329 C .l P roduct Terms in the PLA for the Register Control M ethod . . . 332 A p p e n d ix D PASTA Usage: Inputs and O u tp u ts .....................................................................336 D .l PASTA I n p u t ....................................................................................................336 D.2 PASTA O u t p u t ................................................................................................ 339 D.3 N o te s ..................................................................................................................... 340 D.4 PASTA E x a m p l e ................................................................................... 341 A p p e n d ix E REG Usage: Inputs and O u t p u t s ........................................................................ 343 E .l REG Input ....................................................................................................... 343 E.2 REG O utput ....................................................................................................344 E.2.1 Non-pipelined Register Estim ation ............................................ 344 x E.2.2 Pipelined Register E s tim a tio n .......................................................345 A p p e n d ix F M UX Usage: Inputs and O u tp u ts ........................................................................ 347 F .l MUX I n p u t ...................................................................................................... 347 F.2 MUX O u t p u t ...................................................................................................348 A p p en d ix G Floating Point Coprocessor D escrip tio n ..............................................................351 G .l Floating Point Coprocessor Value T ra c e ........................................ 351 G.2 Floating Point Coprocessor Park Normal Form Description . . . . 373 xi List Of Figures 1.1 A rea/T im e for S e r ia liz a tio n ........................................................................ 6 1.2 Tradeoff Analysis Flow D ia g ra m ................................................................ 11 2.1 Exam ple dataflow g r a p h .............................................................................. 29 2.2 General Overview of MAHA A lg o rith m .......................................................30 2.3 Exam ple delay operation in se rtio n ...............................................................34 2.4 MAHA Critical P ath P artitions (pcp) ........................................................41 2.5 MAHA C ritical P ath Scheduling and A llo c a tio n ................................... 42 2.6 Exam ple w ith C ritical P ath P a rtitio n e d .................................................... 44 2.7 D em onstration dataflow graph # 2 ........................................................... 45 2.8 MAHA Off-Critical P ath ASAP Scheduling and Allocation: P art 1 of 2 ........................................................................................................................ 47 2.9 MAHA Off-Critical P ath ASAP Scheduling and Allocation: P art 2 of 2 ........................................................................................................................ 48 2.10 MAHA Critical P ath S tre tc h in g .................................................................. 49 2.11 Initialization of M A H A ....................................................................................50 2.12 D em onstration dataflow graph # 1 ........................................................... 52 2.13 Overall operation of MAHA: P art 1 of 2 ................................................. 53 2.14 Overall operation of MAHA: P art 2 of 2 .................................................54 2.15 Simple l o o p .........................................................................................................58 2.16 Loop w ith m id-graph exit c o n d itio n ........................................................... 60 2.17 Loop w ith m utual exclusion and m id-graph e x i t .....................................61 2.18 Loop after simple tra n s f o r m a tio n ...............................................................62 2.19 Disconnection of m utually exclusive b ran ch es..........................................63 2.20 Com pleted loop tr a n s f o rm a tio n .................................................................. 63 xii 2721 — Procedure foFTransforrhing Cyclic in to ^ cy cli^ D ataflo w 7 . . 64' 2.22 Loop Transform ations .....................................................................................65 2.22 Loop Transform ations (cont.) ....................................................................... 66 2.23 D em onstration d ata flow G r a p h ....................................................................68 2.24 MAHA Results for 2-stage D e s ig n ................................................................ 70 2.25 M AHA Results for 3-stage D e s ig n ................................................................ 71 2.26 MAHA Results for Cheapest D esig n .............................................................72 2.27 Sehwa Results for Cheapest D e s i g n .............................................................75 2.28 Parallel E x a m p l e ........................................................................................... 77 2.29 Tem perature C o n tro ller..................................................................................... 78 2.30 M ultiplier E x a m p le ........................................................................................ 79 2.31 F IR F i l t e r ......................................................................................................... 79 2.32 A R L attice F i l t e r ................................................................................................80 2.33 Random G raph . . .................................. 81 2.34 Simple C o n d itio n a l............................................................................................ 82 2.35 Large C o n d itio n al................................................................................................83 2.36 Synthesis Results for Small exam ple ......................................................... 85 2.37 Synthesis Results for M u ltip lie r ....................................................................86 2.38 Synthesis Results for F IR F i l t e r ....................................................................87 2.39 Synthesis Results for AR Lattice F i l t e r ......................................................88 2.40 Synthesis Results for Random G r a p h ......................................................... 89 2.41 MAHA design of A R f i l t e r .............................................................................. 92 2.42 H um an design of A R f i l t e r .......................................................................... 93 2.43 Elliptical Wave F i l t e r .........................................................................................99 2.44 Elliptical F ilter Designs of Several Synthesis S y s t e m s ........................100 2.45 Sample Schedule of FIR F i l t e r ..................................................................... 103 2.46 Identical D elay/C ost of F IR Filter w ith Fewer R egisters/M uxes . 103 3.1 PLA F inite S tate M a c h in e ............................................................................ 106 3.2 C om puter R untim e .......................................................................................108 3.3 Two Exam ple D ata P a t h s ............................................................................ 113 3.4 8-state PLA w ith a register control line s h o w n ..................................... 114 3.5 AR Lattice F i l t e r ..............................................................................................117 xm 3.6 M ultiplexer control m e th o d s .........................................................................119 3.7 Different PLA m ultiplexer control s t y l e s ................................................ 120 3.8 Internal construction of P L A .........................................................................124 3.9 Loop on Status F l a g .......................................................................................126 3.10 Fixed Count Loop .......................................................................................... 127 3.11 Variable Count L o o p .......................................................................................127 3.12 PLA w ith E xternal Counter and Register Decoding for Loop . . . 131 3.13 Conditional P ath E x a m p le ............................................................................ 135 3.14 Complex Conditional P ath E x a m p le ...........................................................137 3.15 D ata P ath of Averaging O peration ...........................................................145 3.16 Serialized ADD operation ............................................................................ 146 3.17 Complex Conditional G raph ©1986 N. P a r k ......................................... 150 4.1 A R lattice filter showing c u t s e t ..................................................................162 4.2 Exam ple layered n e tw o rk ................................................................................165 4.3 Exam ple figure highlighting eligible e d g e s ................................................ 167 4.4 A lgorithm for DFG transform ation/flow lower bound assignm ent . 168 4.5 Transform ation of dataflow showing lower bound on f l o w .................. 170 4.6 Transform ing conditional graphs for register c o m p u ta tio n .................171 4.7 C riss-cro ss............................................................................................................173 4.8 Modified criss-cross m arked w ith lower bound on f lo w ......................... 174 4.9 Register values for A R lattice f i l t e r ...........................................................177 4.10 G raph depicting pipelined sc h e d u lin g ....................................................... 180 4.11 Relation between partitions and latency ......................................183 4.12 Register values for pipelined AR lattice filter ......................................185 4.13 Exam ple dataflow graph showing s c h e d u lin g ......................................... 196 4.14 Exam ple dataflow graph showing RTL a r c h ite c tu r e ........................... 198 4.15 C om putation of ?/, and ............................................................................200 4.16 Pipelined conditional graph .........................................................................207 5.1 A R F ilter showing O peration G r o u p in g s ................................................ 215 5.2 8-bit m ultiplier im plem ented using 4-bit m u ltip lie r s ...........................217 5.3 Exam ple Dataflow G ra p h ................................................................................218 5.4 Serial versus Parallel Im p le m e n ta tio n ....................................................... 221 xiv 5.5 Increm ental A rea vs Serialization Q uotient for = 100000 . 226 5.6 B it-dependent O perator A rea versus M axim um Serialization . . . 228 5.7 Best O perator A rea versus Serialization (m ultiplexers and regis ters) ...................................................................................................................... 229 5.8 8-bit M ultiply using 4-bit M u ltip lie r..........................................................231 5.9 Small Dataflow G r a p h .................................................................................. 232 5.10 R andom G r a p h ................................................................................................ 234 6.1 Control P a th /D a ta P ath Tradeoff E v a lu a tio n ........................................ 239 6.2 Evaluation of Design Space: Prediction and S y n th e sis .......................240 6.3 A R Filter Area C o m p a ris o n ........................................................................244 6.4 A R Filter Totals .............................................................................................245 6.5 A R filter design (clock cycles = 1 0 ) ..........................................................247 6.6 A R F ilter Design Region ........................................................................... 249 6.7 Floating Point Coprocessor: Serialize m u ltip ly /d iv id e ....................... 255 6.8 Floating Point Coprocessor: Serialize ad d /su b /cm p /n eg ................ 257 6.9 Floating Point Coprocessor: Serialize all m o d u l e s .............................. 259 6.10 Predicted Elliptical Filter: O perators Only ................. 262 6.11 Predicted Elliptical Filter: Total Design w /o w ir in g .......................... 263 6.12 Tem perature C o n tro ller.................................................................................. 265 6.13 M odule B itw idth Tradeoffs: D ata P a t h ...................................................266 6.14 M odule B itw idth Tradeoffs: Total ..........................................................267 6.15 Floating Point Coprocessor: O p e r a to r s ................................................... 269 6.16 Floating Point Coprocessor: Total .......................................................... 270 6.17 Floating Point Coprocessor: Modified ( D a t a p a t h ) .............................. 272 6.18 i8251: Parallel d e s i g n s .................................................................................. 274 6.19 i8251: Serial designs ......................................................................................277 C .l M inim al Covering of product term s for C o u n t e r .................................330 xv List Of Tables 1.1 M inim um Sensitivity of PLA Com plexity versus A r e a .................. 9 2.1 MAHA Results for Simple E x a m p le .......................................................69 2.2 Modules Used for S y n th e s is ..................................................................... 69 2.3 Sehwa Results for Simple E x a m p l e ....................................................... 73 2.4 Com parison of Synthesis Results: S e h w a .............................................76 2.5 Sum m ary of Register and M ultiplexer A rea for Cheapest Design . 90 2.6 M odule Library of Adders and M u ltip lie r s .........................................95 2.7 M odule Sets Evaluated by M A H A .......................................................... 96 2.8 Com parison of MAHA results to H um an/R andom : A rea 1 . . . . 97 2.9 Com parison of MAHA to H um an/R andom : T im e 1 ....................... 98 2.10 Sum m ary of MAHA V a lid a tio n ............................................................. 101 3.1 M ultiplexer Control A n aly sis.................................................................... 121 3.2 V alidation of PLA m o d e l...........................................................................125 3.3 Unrolling versus E xternal Counter for Loop Im plem entation . . .133 3.4 PLA A rea Estim ation: Basic Model ........................................................144 3.5 PLA A rea E stim ation: Conditionals and Loops ...................................146 3.6 Sum m ary of Results for Pipelined P L A ...............................................153 3.7 Com parison of Folded PLA A r e a ..........................................................154 3.8 PLA A rea Estim ation using D ata P ath Prediction Tools ................. 155 4.1 Com parison of register requirem ents: non-pipelined designs . . . . 172 4.2 Com parison of AR Filter estim ated and actual register use . . . .177 4.3 Com parison of pipelined AR Filter estim ated and actual register u s e .................................................................................................................... 184 xvi 4.4 Pipelined Predicted D ata V alu es................................................................. 187 4.5 Com parison of Predicted and A ctual M ic ro c y c le s .............................. 188 4.6 Register prediction: Non-pipelined d e s ig n s ..............................................202 4.7 Register prediction: Non-pipelined d e s ig n s .............................................. 203 4.8 Register prediction: Pipelined d e sig n s.........................................................205 4.9 Register prediction: Pipelined d e sig n s.........................................................206 4.10 M ultiplexer estim ation: Non-pipelined d e s ig n s ..................................... 209 4.11 M ultiplexer estim ation: Non-pipelined d e s ig n s ..................................... 210 4.12 M ultiplexer estim ation: Pipelined d esig n s............................... 211 5.1 Sensitivity of 8 to n and £ ............................................................................223 5.2 M inim um and M aximal Change in PLA Param eters versus n . . . 223 5.3 Analysis of Serialization Factor # 1 for the Exam ple shown in Figure 5.9 . . ................................................................................................. 232 5.4 Analysis of Serialization Factor # 2 for the Exam ple shown in Figure 5.10 233 6.1 M odule Library for A R Exam ple ................................................ 241 6.2 A R F ilter Designs: P r e d i c t e d ......................................................................246 6.3 AR F ilter Designs: A c t u a l .................. 246 6.4 AR F ilter Designs: Error S u m m a r y ........................................................... 248 6.5 N on-Pipelined Design Curve C om putation Tim e (sec.) ................... 250 6.6 M odule L i b r a r y ................................................................................................ 253 6.7 M ultiplier/D ivider Floating Point B it-S erialization..............................254 6.8 A d d /S u b /C m p /N eg Floating Point B it-S erialization.......................... 256 6.9 Combined Floating Point B it-S e ria liz a tio n ............................................258 6.10 Elliptical F ilter M odule S e t s ........................................................................260 6.11 Original Elliptical Filter: P r e d ic te d ..........................................................260 6.12 Original Elliptical Filter: A c tu a l.................................................................261 6.13 Floating Point Coprocessor w ith A L U ...................................................... 269 6.14 18251: Individual d e s ig n s ...............................................................................272 6.15 i825l: Parallel Design Com binations ...................................................... 273 6.16 i8251: Parallel Design C o m p a riso n s..........................................................275 6.17 i8251: Serial Design C o m b in atio n s............................................................. 276 xvii 6.18 i8251: Serial Design C o m p a riso n s............................................................. 276 C .l Com parison of Internal versus External Decoding of Loop C ounter 331 C.2 Sum m ary of Counter Im pact on PLA P a r a m e t e r s ................................332 D .l Fixed Values in P A S T A ...............................................................................340 xviii Chapter 1 Introduction W ith the lowered cost and broadened functionality of digital system s, an increas ing num ber of prim arily non-electronic companies are inserting complex digital chips into their products. In conjunction w ith this rapid growth, the life cycle of any given digital product is decreasing. Thus, an intense com petitive stance is needed to ensure survival in the electronics industry. A com pany m ust get to th e m arket first w ith a new chip; those getting there second m ay find the m arket saturated by a com petitor or exhausted - the m arket window was missed. A consequence of this requirem ent is th a t digital system s designers use CAD tools to produce chips which comprise hundreds or thousands of individual pre designed m odules. U nfortunately, this is not ju st a sim ple m a tte r of m anufac turing the first design th a t is feasible. To produce chips of superior price and perform ance, a num ber of candidate chip designs m ust be analyzed. Explo ration of this design space in digital systems can be very expensive in com puter run-tim e. Further, there are portions of th e design space which are not even considered by current CAD programs; thus, it is the intuitive judgm ent of the hum an designer - a good designer - th a t navigates the design through these areas. One im portant aspect of design, partitioning a behavior at th e system level into control and d ata p ath structures, is currently perform ed solely by hum an designers. E xpert designers understand the tradeoffs betw een these two struc tures, b u t have no tools for locating good partitions. Even experienced designers may choose inferior designs due to the enorm ity of the design space and th e in teraction between the d a ta p ath and control path structures. 1 T he research presented here is directed towards resolving th e problem of applying control p a th /d a ta path tradeoffs and evaluating design decisions against a set of global constraints. B oth experienced and inexperienced system designers will be capable of realizing superior designs; in addition, expert designers have a technique for evaluating design changes at th e system level. A m ethodology for exploring the design space to locate superior designs quickly using these tradeoffs and evaluation tools will be described. T he rem ainder of this chapter details the problem and th e approach taken. 1.1 Problem Statement In the design of large electronic system s, particularly those which utilize a cen tra l processor, partitioning to separate the hardw are, firmware, and software subsystem s is usually based upon the experience of hum an designers. Given a hum an-dom inated design environm ent, high-level partitioning into these sub spaces is economical and necessary to com pleting a design in a tim ely m anner; this separation allows each to be developed fairly independent of the others. T he quality of the initial partitioning drives the final result since design options are reduced at an early stage. Tradeoffs betw een the control p ath and data p ath functionality perform ed early in the design process to produce th e struc ture, henceforth referred to as control path/data path tradeoffs, are crucial to producing good designs. W hen perform ed at an early stage, control p a th /d a ta p ath tradeoffs allow m ajor decisions to be evaluated at far less com puting cost th an at a later stage. For exam ple, the decision w hether to develop a pipelined design can be perform ed a t th e system level, but m ay be im possible to change once actual synthesis has started at a lower (RT (Register-Transfer) or circuit) level. Even if such m ajor backtracking is feasible, a substantial p art of th e design would have to be discarded w ith a com parable increase in design tim e and potentially wasted resources. The research suggested here is intended to fill a gap in th e design process and subsequently im proving the RT-level design produced. To be viable, the m ethod used should locate the globally superior design region quickly w ithout 2 resorting to exhaustive search. C urrent design tools are inadequate as they do not explore all regions of th e design space even if considerable com puting resources are expended. For exam ple, in the A R L attice Filter shown later, there are 16 m ultiply and 12 addition operations. Using a small m odule library of only three elem ents (m ultiplier, adder, and m ultiply-adder pair), there are 256 possible structures at the system level (not achievable using current synthesis tools) resulting in over 3000 feasible RT-level designs. The enormous detail present at th e RT-level spawns tradeoffs of a local n atu re since global evaluation is too costly. Clearly, a set of tools for elim inating inferior structures at the start would enhance the design process by dram atically reducing th e num ber of RT-designs to fully construct and evaluate later. One approach for exploring designs at the system level is transform ing the behavior to produce different structural representations. An early effort which discusses global transform ations is found in Snow’s dissertation [Sno78]. Snow showed how analysis of behavior provides an im plem entation-independent envi ronm ent for perform ing some control p a th /d a ta p ath tradeoffs. A collection of translation tem plates which operated upon the d ata p ath were derived from the experience of optim izing compilers. As a practical dem onstration, some of the transform ations were applied to two subsets of the PDP-11 processor to show the resultant im provem ent. There are also m any other types of transform ations which operate upon dataflow graphs and alter their internal behavior. Tree height reduction [Tri85] and loop winding [Gir87] are two examples. O ther com piler optim izations such as strength reduction and loop unwinding are also candidates [AUS79]. A typical synthesis engine produces alternative im plem entations from the same behavior. For exam ple, some engines provide a range of designs from the m ost parallel to the m ost serial, which changes the area of the controller and d ata p ath hardw are. A decision w hether to use a pipelined or non-pipelined approach is another high-level tradeoff. To m ore closely describe the effort, the nature of control p a th /d a ta path tradeoffs as they are treated in this research will be defined. At th e system level, a design can be viewed structurally as a control p ath operating d ata path 3 hardw are including routing and storage. Design inform ation present at an early stage which affect system -level tradeoffs includes • th e behavioral description, • user goals and constraints, and • available (hardw are) modules. A behavioral description is an abstract view of th e system in term s of opera tions and th e values produced and consumed; this pure form lacks any structural or physical attrib u tes. In addition, there are a num ber of different gauges by which to m easure the quality of a design, among which are • circuit operation tim e, • circuit area, • pin count, • test coverage, and • power consum ption. Describing th e interaction within the control p ath and the d ata path as well as between the control and d a ta paths is fundam ental to this research, w ith the goal to locate superior im plem entations. Em phasis is placed upon characterization of this separating line between th e two design subspaces. Since this “line” is a direct consequence of individual d ata p ath and control p ath architectures which are governed by system -level constraints, an understanding of each subspace is also im portant. Tools which separately address d a ta p ath and control p ath design will be necessary to evaluate system-level partitioning; such tools become an integral p art of th e research. 1.2 Motivation An objective of the research is to produce techniques which accept a behavioral description, perform user tradeoffs at a high level, evaluate those tradeoffs with 4 respect to global constraints, and produce a system-level partitioning between the control p ath and d ata p ath which results in b etter designs. T he boundary betw een th e control p ath and d a ta p ath is varied through the use of tradeoff templates to locate superior designs. (A superior design is one whose structure achieves or surpasses the goals while m eeting constraints im posed by th e design specification.) T he foremost design param eters considered in this thesis are time and area. N ote th a t other param eters m ay be chosen such as power consum ption or testa bility when perform ing tradeoffs; this thesis does not address them . A designer either strives to m inim ize tim e w ith respect to an area constraint or m inim ize area w ith respect to a tim e constraint. Based upon these two prim ary factors, item s related to circuit design are • operator area and tim e, • routing logic and storage area and delay, • controller area and tim e, and • wiring area and delay. As an exam ple, a given behavior can usually be im plem ented by a large am ount of hardw are which operates quickly or a small am ount of hardw are th a t m ust be shared serially to com plete the task. Effects of this “serialization” upon th e control p ath and d ata p ath area are shown in Figure 1.1. T he area of the d a ta p ath drops w ith increasing serialization. However, since th e controller has m ore states as well as increased routing and storage hardw are, its area grows. T he to tal design area, which is th e sum of area for the d a ta and control path, reaches a “boundary point” in the area/tim e curve where th e area is a m inim um . Designs greater in tim e th an the “boundary line” at the point # 0 are also larger and thus inferior to solutions lower or equal in tim e to th e boundary point. Figure 1.1 clearly shows th a t control area can not be ignored, although its effect is governed by the size and com plexity of the d ata path. T he figure also shows th a t there is a relation between th e am ount of serialization and both the control p ath and d ata path area. Applying control p a th /d a ta p ath tradeoffs 5 Boundary point Boundary line Total D ata path Control P ath x 0 T im e Figure 1.1: A rea/T im e for Serialization relies on th e developm ent and verification of accurate control path and d ata p ath estim ation models. 1.3 Problem Approach T he focus of this research is upon describing control p a th /d a ta p ath tradeoffs for partitioning a circuit description and producing a m ethodology for evaluating these tradeoffs. Starting w ith a behavioral description in the form of a dataflow graph and a set of constraints, a circuit structure has been found which is capable of producing th e best designs. Steps in this stru ctu re’s construction include the d ata p ath , routing and storage, and th e control path. 1.3.1 Data Path Analysis T here are m any aspects of d ata p ath synthesis which affect circuit area and tim e. Those related to system-level tradeoffs are readily defined. One task of d ata p ath synthesis is partitioning of a behavioral description (usually in the form of a dataflow graph) into tim esteps. This offers a tradeoff between operator 6 sharing which reduces th e cost of th e d ata p ath and th e larger execution tim e which results. Second, th e type of d ata p ath synthesis, or design style, has two m ajor cat egories which are specified at th e system level: pipelined and non-pipelined design. B oth design styles will be considered, although a larger focus will be upon non-pipelined designs. M odule selection forms a th ird category. There are num erous issues among which are ranking of acceptable m odules (area and tim e), generation of m od ules, bitw idth considerations, and influence of shared operators on the selection process. Finally, m odule allocation is distinct from m odule selection. A fter m od ule selection, the type of m odule bound to a given operation has been chosen, but not th e quantity of m odules or sharing of the specific hardw are w ith other operations. W iring and routing/storage delays are two often ignored factors in m odule allocation. In this research, system partitioning will not consider second- order aspects of m odule sharing such as fan-in/fan-out (other th an m ultiplexer allocation) or placem ent proxim ity; m odule speed, area, and sharing will be considered. M odule selection, bit-serial operators, and tim estep partitioning will be ad dressed in this thesis. A lthough an im portant factor, design style selection is beyond the scope of this thesis. 1.3.2 Routing and Storage Analysis Signal routing and storage requirem ents m ust be considered to evaluate the effects of applying control p a th /d a ta p ath tradeoffs. Signal routing is accom plished through buses or m ultiplexers; storage is perform ed by registers. T he use of m ultiplexers, buses, and registers can play a significant role in determ ining th e quality of a design. Since area is th e dom inant contribution of routing and storage hardw are to a design, analysis will concentrate on this param eter. Tim e delay is of lower priority since its effect is essentially uniform over a wide range of designs (i.e. a fixed contribution to each tim e step). 7 The effect of a tradeoff upon wiring area depends upon th e transform ation and th e circuit size. M odule selection m ay have some effect upon wiring betw een m odules, and im pacts internal m odule wiring. However, for typical standard cell and random logic designs, local changes have sm all im pact on th e global wiring area. Since a general model for standard cell wiring area exists [KP86], one can assess th e overall im pact of system partitioning upon wiring area. 1.3.3 Control Path Analysis C ontrary to d ata p ath synthesis, little has been w ritten about control p ath synthesis in recent years. This is prim arily due to the regular structures in control synthesis (both PLAs and m icrocode) which have been heavily researched in the past; th e problem is considered solved. C urrent control p ath literature concentrates upon either m inim izing th e product term s in PLAs or construction of m icrocoded and nanocoded controllers. At present, PLAs are th e predom inant m ethod for constructing controllers. A utom ated synthesis tools for PLAs have been in widespread use for several years. However, a theoretical analysis of the area consum ption of PLA state m achines has yet to be published. Basic analysis will consider the following: • describing the PLA area w ith respect to its param eters (inputs, outputs, and product-term s), • theoretically bounding a n d /o r estim ating th e num ber of product-term s, • state m achine size effects, • register and m ultiplexer operation effects, • conditional path effects, and • fixed and variable loop effects. T he PLA area is a direct consequence of the d a ta p ath requirem ents. Hence, predicting th e PLA area w ith the d a ta p ath param eters is one im p o rtan t goal of th e research. S tate m achine size as well as register and m ultiplexer control influence PLA area; these effects will also be m odeled. Conditional paths and 8 Table 1.1: M inim um Sensitivity of PLA Com plexity versus A rea States Area (m il2) Im pact of Increasing Term by One O utput Term P -term A % A A rea A % A A rea 2 11.8 10.0 1.2 6.1 0.7 5 27.8 5.5 1.5 5.9 1.6 10 51.0 4.1 2.1 5.0 2.6 20 121.8 2.7 3.2 3.4 4.1 50 459.4 1.5 6.7 1.7 7.8 75 932.3 1.0 9.5 1.1 10.3 100 1518.4 0.8 12.4 0.9 13.1 150 3158.7 0.6 18.1 0.6 19.0 200 5345.9 0.4 23.8 0.5 24.7 300 11576.5 0.3 35.3 0.3 36.4 loops introduce complex b u t necessary extensions. Finally, the model will be extended for folding and d ata p ath design style considerations. One difficulty w ith a centralized PLA controller is th a t its size grows until it has the dom inant area in a design. Furtherm ore, a point is reached where a sm all increase in control operation results in an enorm ous im pact on th e PLA area. Expanding control using an existing PLA would entail increasing the num ber of o u tp u t lines a n d /o r the product term s and possibly the input lines. In the best case, ju st one of these param eters would be increased. Table 1.1 reveals the effect of increasing a term by on e as th e 2 p m CMOS PLA size increases. A lthough the increased ou tp u t and product-term area effects as a percentage of th e to tal shrink w ith increasing PLA size, raw area rises at a m uch higher rate. For exam ple, increasing a state m achine from 200 to 201 states consumes nearly | as m uch area as doubling the states of a 5-state PLA or building a completely separate 4-state PLA controller. Furtherm ore, th e layout m ay ultim ately become too large to use efficiently since either the power bus m ust be enlarged or a longer clock cycle tim e m ust be accepted or both. Large PLAs also provide less flexibility for th e floorplan when combined w ith th e d a ta p ath on a single chip. These results suggest th a t partitioned or distributed controllers m ight be useful 9 in large a n d /o r com plicated designs. Indeed, this approach has been used in the M otorola 680x0 and Intel 80x86 line of microprocessors. In d ata p ath synthesis, large blocks can be partitioned into sm aller functional blocks to produce a m ore efficient layout. This concept will also be applied to th e PLA controller for specific operations such as register and m ultiplexer control as well as loop constructs; some aspects of the control will be handled by external hardw are to reduce the overall area and keep th e control blocks m anageable. Large PLAs are often m uch slower th an the hardw are they operate. Thus, breaking th e controller into sm aller functional blocks m ay also prove useful in im proving the speed of the controller. In summ ary, there are two observations about the PLA area. F irst, any additional functionality would, at m inim um , im pact the num ber of outputs or product-term s. A dditional term s cause th e PLA to grow in area at a non-linear rate. Second, and m ore im portant for system-level partitioning, this relation reveals th a t neglecting control costs when partitioning a given dataflow graph into a sm aller tim esteps (resulting in potentially less d ata p a th hardw are but m ore states) m ay not produce th e best designs. 1.3.4 System-Level Partitioning System-level partitioning, as defined in this thesis, is strictly d a ta p ath /co n tro l p ath partitioning. Tradeoffs are perform ed to explore a larger portion of the design space and realize im plem entations th a t are unobtainable through a direct in terp retatio n of the behavior. Thus, a design which b ette r m eets th e user objectives m ay be achieved. In this thesis two global m ethods are explored for perform ing system-level partitioning: analytical and em pirical. D espite th e lure of a purely analytical approach for obtaining an optim al solution, a com plete and accurate analysis is exceedingly difficult, although some progress is reported here on some d ata p ath /co n tro l p ath tradeoffs. By contrast, th e em pirical approach proves not only to be adaptable to any transform type, b u t also can be controlled intelligently to yield results rapidly. A flow diagram of this second m ethod is shown in Figure 1.2 . 10 C ^ ^ e h a v i o r ^ ^ C ^ C ^ str a in ts^ ^ ^ Design -Curve done m ore Final ►esigns ^ more xforms or ^ s io n e ^ A pply Transform D atapath Estim ates R outing/ Storage Synthesis R outing/ Storage Estim ates D atap ath Synthesis Control Synthesis Control Estim ates Figure 1.2: Tradeoff Analysis Flow D iagram T he tradeoff procedure relies on both the presence of quick estim ators and also a com plete set of synthesis tools. E stim ators are used to bound an area/tim e region in the design space which encompasses th e design objectives. Should the estim ates appear promising, synthesis tools are targeted to produce designs in this sm aller region. Therefore, it should be possible to perform a num ber of tradeoffs in a tim ely m anner. T here are m any types of tradeoffs which affect the u ltim ate perform ance of a circuit. Classes of tradeoffs having the m ost im pact on th e overall design are com prised of • bitw idth tradeoffs, • decom position tradeoffs, • com position tradeoffs, and • distributed control. For bitw idth and decom position tradeoffs, such as bit serial operators or use of shift-add circuitry to perform a m ultiply, increased serialization or simplifi cation of the d ata p ath m ay adversely affect th e num ber of control states and, hence, th e control area. Com position tradeoffs such as ALU substitution in crease the potential for sharing d ata p ath hardw are at th e expense of additional ALU control lines. Finally, distributed control divides a centralized controller into sm aller slave controllers which m ay reduce th e to tal control and wiring area as well as offering independent synchronous operation. This is offset by a d ata p ath area increase since sharing to th e same degree m ight be no longer possible. System-level tradeoffs allow the designer to explore design alternatives which enable a stru ctu ral decision to be m ade quickly and correctly. However, such tradeoffs require an accurate m odel of both th e d ata p ath and control p ath area. Currently, th ere is a m odel available for predicting d a ta p ath area, including wiring area, which will be utilized [KP84]. However, an area estim ation m odel for PLA controllers did not exist when this research began. Thus, one was developed. In addition, the predictors lack register and m ultiplexer estim ation; these models were also form ulated. 12 A nother param eter to consider in evaluating tradeoffs is d a ta p ath and control p a th delay. If it is assum ed th a t a controller is faster th an th e hardw are it is operating, which is usually th e case for small designs, th en this problem simply reduces to bitw idth effects on d ata p ath delay. For larger controllers, control delay m ay be a factor. However, such work is beyond th e scope of this thesis, and has since been addressed by others w ithin th e ADAM group. 1.4 Evaluation Tools within ADAM To evaluate tradeoffs, utilities for both synthesizing and estim ating a com plete design from a behavioral description m ust be available. T he different parts of a design such as the control p ath , routing/storage and d a ta p ath are constructed using different and, unfortunately, unrelated tools; these m ust be com pletely integrated into a single coherent package to accom plish system -level tradeoffs. 1.4.1 SLIMOS: Module Set Selection P rior to estim ation or synthesis, m odule styles which will im plem ent the various functions in the dataflow graph are selected. T he m odule styles chosen from a library of candidates are driven by th e behavior (dataflow graph) and the constraints. SL IM O S selects modules for pipelined design using a theoretical area-tim e lower-bound m odel [JPP88]. A lthough S L IM O S is targeted for pipelined designs, it has also been able so far to construct th e best m odule sets for non-pipelined designs. A theoretical m odel for m odule set selection of non-pipelined designs has yet to be published. 1.4.2 MAHA: Non-Pipelined Data Path Synthesis D ata p ath synthesis is concerned w ith tim estep partitioning and allocation of pre-defined hardw are modules to a dataflow graph. Typically, these functions are constrained by cost at one boundary and speed at the other. Early in the design process, it is desirable to explore the possible global design space, using th e dataflow graph and fixed m odule styles, to find th e superior solution space. 13 M A H A was w ritten to provide th e designer (autom ated or hum an) w ith scheduling and allocation solutions given a dataflow graph, a m odule library, and area a n d /o r tim e constraints [PPM86]. An initial linear m odule assignm ent and scheduling of critical p ath operations is followed by a cost-based assignm ent using the concept of freedom. Freedom is a m easure of the tim e range during which operations (off the critical-path) can be scheduled consistent w ith the tim ing constraint. O perations w ith the least freedom (tightest fit) are scheduled' first. T he program either m inim izes speed subject to a area constraint, minimizes area subject to a tim e constraint, or provides a set of feasible solutions. Extensions to M A H A include resource-sensitive scheduling, register alloca tion (worst case), and m ultiplexer allocation to produce a m ore accurate area estim ate. In addition, delay operation insertion was added to widen the design search space, and local tim ing constraints can be entered. 1.4.3 Sehwa: Pipelined Data Path Synthesis S eh w a is a synthesis program for producing pipelined designs which uses the sam e inputs as M A H A . Sehw a perform s functional pipelining, which m eans th a t there is no physical stage which corresponds to a logical grouping of op erations. The program handles conditional branches and considers the effects of register assignm ent; it also alters its schedule dependent upon th e degree of resynchronization expected in th e design. T he underlying theory and im plem en tatio n of S eh w a is given in [PP88]. 1.4.4 HAPPE: Data Path Estimation H A P P E (High level A rea Perform ance Prediction and E stim ation) consists of two tools which predict th e area and delay of th e d ata p ath for com pleted designs given th e m odule set and dataflow graph inform ation. P S A D N P is used to ob tain non-pipelined estim ates; P S A D P is used for pipelined estim ates. T he un derlying theories of the pipelined area/tim e prediction [JPP87] and non-pipelined a rea /tim e prediction [JMP88] are given elsewhere. 14 1.4.5 MABAL: Multiplexer and Bus Allocation M A B A L allocates the routing and storage hardw are for either a pipelined or non-pipelined design [KP90]. This program uses a cost-driven heuristic to al locate m ultiplexers a n d /o r buses for interconnecting operators and registers. M A B A L recognizes operator com m utivity which allows it to reduce the size of m ultiplexer trees or produce sm aller buses. S tarting w ith schedule produced by a synthesis program , M A B A L perform s register, bus, and m ultiplexer assignm ent. (An algorithm which can produce optim al register assignm ent in polynom ial tim e was incorporated into M A B A L [KP87].) T he cost (area) of sharing registers and operators is com pared against the actual routing hardw are area and a projection of the wiring area; the cheapest approach is always taken. T he o u tp u t of M A B A L is an RT-level design which could be used as an input for a logic/layout synthesis tool. 1.4.6 Berkeley PLA Synthesis Package There are m any m ethods for building a controller, of which random logic, tim ing generators, m icrocode, and PLA s are th e m ost popular. For larger designs, m icrocode and PLA s are favored due to their regular stru ctu re and availability of design tools to im plem ent them . Currently, PLAs are increasing in popularity am ong designers. Since more com plex VLSI chips are currently being produced, there is often m ore th an one controller to allow m ultiple tasks (such as m em ory m anagem ent and prefetch) on th e same chip. Sm aller distributed controllers favor PLA im plem entation over m icrocontrollers when cost is to be m inim ized. Consequently, the thesis will focus upon controllers im plem ented by PLAs, although the area and tim e tradeoffs for m icrocode are expected to be similar. T he Berkeley CAD Toolbox has a num ber of program s to assist in the design of a PLA . At th e top level is P E G (PLA E quation G enerator) which accepts a high-level state m achine description as input and produces a set of state equa tions useful to other tools to im plem ent a controller [Ham83]. The high-level state m achine gram m ar accepted by P E G is sufficient to describe any syn chronous controller; program s created are isom orphic to M oore state diagram s. 15 E Q N T O T T is a program which accepts state equations and derives th e sum -of-product term s used in PLA design. By piping th e o u tp u t of P E G into E Q N T O T T , the PLA “personality m atrix ” is generated. To produce a heuristically optim ized PLA (m inim um num ber of product-term s), the output of E Q N T O T T is passed through E S P R E S S O . U ltim ately, M K P L A accepts this personality m atrix and produces a layout of the PLA. A tool was w ritten for this thesis research which extracts the area from th e PLA layout for validation of th e PLA models. 1.4.7 PASTA: PLA Control Area Estimator R ather th an synthesizing a PLA , it is m ore useful to obtain a quick area estim ate when a num ber of designs need to be explored rapidly. T he Berkeley tools have proved to be adequate for small exam ples, b u t slow down at an exponential rate w ith the num ber of controller steps. A survey of th e literatu re did not reveal any such estim ation tool, so it was constructed as p art of this thesis. Control area estim ation specific to PLA state m achines has been extensively analyzed here. Several param eters which determ ine the size of an unfolded PLA have been identified including • tim estep partitions (states), • register control, • (conditional) test inputs, • conditional branches, • m ultiplexer control signals, and • control loops (fixed and variable). An algorithm ic m odel called PASTA (P la A rea eST im ator A lgorithm ) which determ ines th e area of a general PLA state-m achine controller is discussed in C hapter 3. This m odel encompasses both pipelined and non-pipelined design and has been extended for prediction of folded PLAs. 16 1.4.8 PLEST Data Path Area Estimator P L E S T estim ates th e area of a standard cell layout from th e num ber of cells (m odules), num ber of nets, and average wire length [KP86] [Kur87]. A lthough use of P L E S T allows th e designer to consider another aspect of design which contributes area, there is an issue of design tim e versus utility. During the initial system -level evaluations where a wide variety of designs are attem p ted (which is likely to result in wide area changes), a precise wiring estim ate may not be necessary. Behavioral descriptions on th e order of 100 operators take 30% longer for d a ta p ath analysis when P L E S T is used versus the sole use of M A H A . Hence, execution of P L E S T m ay only be w arranted for discrim ination betw een designs having sim ilar areas. 1.5 Related Work In this section, a survey of research related to control p a th /d a ta p ath tradeoffs is presented. Extensive search has uncovered no references in th e specific area of interest; however, there are several published works in th e area of hardw are/firm w are/softw are tradeoffs th a t are closely related and will be reviewed. In addition, since th e foundation of th e proposed research relies on robust tools for d ata p ath and control p ath analysis, an overview of these areas is included first. 1.5.1 Data Path Synthesis Over th e past decade, there has been broad em phasis on autom ated design of d ata p ath hardw are, m ore aptly described as d a ta p ath synthesis. (This is also referred to as high-level synthesis.) D ata p ath synthesis starts w ith a behavioral description in th e form of a dataflow graph to produce the d a ta p ath hardw are in two steps: event scheduling and resource allocation. Event scheduling partitions th e dataflow graph into one or m ore clock cycles of some fixed length (for non- pipelined synthesis). Resource allocation assigns hardw are modules to operators w ith an em phasis upon resource sharing; th e am ount of sharing depends upon th e particular design and the event scheduling. 17 Since optim ally synthesizing d ata p ath hardw are based upon a set of con strain ts is thought to be an N P-com plete problem [KT83, Tho86], a variety of approaches have been published which achieve good designs in a reasonable am ount of tim e. These range from a purely heuristic approach for allocation such as E M U C S [McF81] to a form al approach for scheduling and allocation which uses m ixed integer-linear program m ing [HP83]. Purely heuristic approaches suf fer from being overly rigid. A lthough they typically perform th e synthesis task quickly, th e resulting designs can be far from optim al due to some aspect of the design th a t was missed. For even th e small exam ples cited in literatu re, a hum an designer could do better. At th e other end of d ata p ath synthesis is a rigorous form al definition which can be used to produce optim al designs. Early work by the author was directed at describing and synthesizing register-transfer (RT) designs [PKM84], Peripheral research related to H afer’s form al approach revealed th at even for sm all problem s, producing optim al designs can be expensive [HP83, HP81]. O ptim al d a ta p ath synthesis involved solving a m ixed integer-linear program m ing problem where th e m atrix size grows roughly at 0 ( n 3) where n is the num ber of operations in th e d a ta p ath. (According to Hafer [HP83], the num ber of relations grows at 0 ( h n 2) where h is th e num ber of hardw are elem ents. In th e case w here every operation gets a hardw are elem ent, h — n .) Clearly, there is a lim it on the size of designs which can be perform ed using this m ethod. An overview of o ther d ata p ath synthesis program s is published elsewhere [PH87]. 1.5.2 Control Path Synthesis In com parison to d ata p ath synthesis, research in control p ath synthesis is lim ited. Control p ath design is a less difficult problem since its characteristics depend closely upon the d a ta p ath configuration. Furtherm ore, controllers are generally either regular structures such as PLA s or stan d ard architectures such as a m icroengine, which fu rth er reduces th e design complexity. D uring early research efforts, prim ary em phasis was placed upon autom atic generation of m icrosequence controllers for d a ta p ath hardw are [Nag80, NP81, NCP82]. In N agle’s research, a control graph is formed by analyzing an ordered 18 acyclic dataflow graph which has been preallocated (hardw are assigned to each operation node), but w ithout any specific tim eslot specified. A collection of routines reduce th e control graph to a m inim al (not necessarily optim al) m icro program through two m ethods: autonom ous reduction and attractio n weights. A utonom ous reduction evaluates th e current control graph and selects the node w ith th e highest weight (m axim al state overlap) to be assigned a separate field in th e m icrocode; the iterative process term inates when th e m icrocode word w idth increases beyond some specified lim it. This task is accom plished in linear tim e w ith respect to th e control graph. A ttraction weights are used to determ ine sharing in 0 ( n 2) tim e; control nodes which can be executed in th e same p artitio n are grouped during successive iterations until no fu rth er sharing is possible. A m ore general approach to controller design appeared in a thesis by Baldwin [Bal84]. An expansion of the concepts discussed in Leive’s thesis [Lei81] p ertain ing to autom ated logic synthesis and m odule selection is the basis of B aldw in’s dissertation. Baldw in form ulated a controller design language to describe the operation of a circuit which is subsequently transform ed into a control stru ctu re using a rule-based system . U ltim ately, th e controller is synthesized. Problem s specific to controllers such as loops and conditional branches were discussed and im plem ented in his system . A lthough B aldw in’s transform s applied to the control stru ctu re are generally useful, experim ental results using T T L com ponents were m arginal. A variety of synthesized designs were com pared against a small num ber of hum an-produced controllers using the same com ponents. Even th e m ost inexperienced of the hum an designers generated a circuit far b ette r th an th a t synthesized by B ald w in’s heuristics. Synthesis of a PLA or m icrocode controller is difficult using this approach since it relies upon a library of control com ponents which can be allocated to th e control graph in m uch the sam e m anner as d a ta p ath synthesis. This m ethod is not viable for th e research presented here. M ore recent control p a th synthesis concentrates upon PLA design. To date, all of the known available (non-com m ercial) tools in this area are derivatives of th e Berkeley CAD system for PLA design [Ham83]. Since a large p art of the prelim inary research pertains to PLA area estim ation, which was validated using 19 th e Berkeley synthesis tools, fu rth er evaluation of th e Berkeley tools is presented in C hapter 3. 1.5.3 Synthesis considering both control and data path analysis A t a higher level in th e design process, interactions betw een th e control p ath and d ata p ath are considered. An early paper th a t addressed th e problem of control p a th /d a ta p ath tradeoffs is the A rchitecture and Design Assessm ent System (ADAS) for digital signal processors [FSC84]. Using a directed graph to describe th e behavior, ADAS allows th e designer to evaluate perform ance and cost of a design using a top-dow n approach. A high-level design is first partitioned into m ajor functional blocks by th e designer who can verify (using P etri Net sim ulators) w hether the graph appears to be perform ing th e desired function properly. Nodes in the graph are successively decom posed by the designer and perform ance/cost analyzed through design tools. These tools provide the user w ith inform ation about th e current design at each step; however, it is th e designer who decides w hat action should be undertaken to im prove th e result. T he im pact of design changes upon th e control p ath and d a ta p ath is assessed in the ADAS m odule binding package. Initially, a set of hardw are resources are explicitly m apped onto d ata p ath operations by the designer; these can be rearranged to im prove sharing while determ ining the im pact on the controller. Conversely, th e control p ath architecture could be modified to analyze alternative d a ta p ath designs to reflect th e varied hardw are allocation. B oth steps are perform ed so as to m inim ize cost w ithout exceeding th e original constraints. A lthough ADAS utilizes algorithm s and heuristics to assist in the design process, a hum an operator m ust m ake all of the decisions. Design decisions are still required in th e system described here. However, such interaction can be confined to th e top level of th e design w ith lower level predictors and synthesis tools perform ing the lower level tasks autom atically. Furtherm ore, the set of transform ations described here could be autom ated in some future system to produce superior com bined control p a th /d a ta p ath designs. 20 In M arshall’s PhD dissertation [Mar86], a m ore autom ated system is pre sented. M arshall constructed a system which produces a board level design from a high-level description. S tarting w ith a system description in OCCAM [Ltd85], the OCCAM interm ediate code generated is used for sim ulation and synthesis. M arshall focused upon a “m icrocontroller” im plem entation where a predefined set of I/O m odules is available to be attached to a Z-80 m icroprocessor w ith some RAM and ROM . These I/O modules are allocated as needed and software is generated for th e Z-80 to handle the additional hardw are. M arshall’s work is unique in synthesizing both hardw are and software at th e board level. Some software is p art of a pre-defined executive for com m unicat ing w ith I/O devices; however, the overall system (as in his voltm eter and cash register exam ples) is custom generated by th e synthesis engine. A lthough M ar shall requires a m inim al tran sp u ter as the stru ctu ral base, in order to achieve b ette r designs synthesis, he will eventually have to consider both hardw are and firm ware (C PU w ith software) analysis w ith no predefined and restrictive m acro structures. T he large scope of com bining d a ta p ath and control p ath synthesis becomes apparent in C asavant’s dissertation on design autom ation [Cas85]. Casavant presents a working system which synthesizes both control p ath and d ata p a th elem ents from a behavioral description. Initially, functional units are assigned to th e behavior which are then scheduled into m icroinstructions. A fter scheduling is com plete, registers are assigned heuristically to m inim ize area. M icrocode generation and m ultiplexer allocation form the last steps of the process. A lthough Casavant has produced an im pressive am ount of synthesis tools, th e scope of each is lim ited to a small set of user-controlled algorithm s. In fact, all actions perform ed by th e system are done under user control; no attem p t was m ade to find a good solution autom atically. In addition, th e initial schedul ing of th e system is fixed and cannot be modified; backtracking is not directly supported. Resource changes, which are d ictated solely by th e user, will invoke a new schedule, hardw are/register assignm ent, and m icrocode design. M inim iz ing tim e subject to a m axim um cost (area) constraint is the goal of all of his synthesis tools. 21 C asavant’s work is well-suited to an interactive w orkstation as m ajor deci sions such as hardw are availability, register usage, bus resources, and m icrocode word sizes m ust be provided by th e user. Clearly, an experienced designer is necessary to achieve th e full potential of the tools, or to achieve good designs at all. An autom ated design m anager would be a valuable and useful extension of this work. A nother recent work which focuses upon high-level tradeoffs between the control and d ata p ath is the CAM AD system [Pen86]. A behavioral descrip tion is transform ed into a stru ctu ral representation through synthesis of both th e control and d ata parts. T he standard d ata flow digraph is used for d ata representation whereas an extended tim ed P etri N et m odel is used for control. In th e initial construction, each node in the d ata p ath has a corresponding node in the control graph to which it is linked. For the actual control p a th /d a ta p ath tradeoff, Peng tre ats th e two parts as a single graph which include b o th C-edges (control) and D-edges (data). Edge “costs” are assigned based upon frequency of execution (from sim ulation or analysis) w ith arb itrary weighting (Wd and W c). G raph partitioning (K ernighan and Lin) is perform ed u n til sufficiently sm all m odules are realized. C ircuit area and perform ance are directly influenced by varying the weights, Wd and W c. Peng’s initial work appears very prom ising in th a t it combines both control and d a ta p ath parts into a single model. However, this m ajor salient feature is also its weakness as even sm all designs are very com plex to represent. O pti m izations on the d a ta p ath are perform ed sim ultaneously w ith th e control path, m aking design changes m ore expensive com putationally and potentially intro ducing “design oscillations” . Interestingly, Peng’s partitioning of both th e d ata p ath and control p a th results in num erous localized controllers. It was not de term ined in his research w hether widely distributed control is m ore cost effective th an fewer and m ore centralized PLAs or m icrocode controllers. 22 1.5.4 High-level hardware/firmware/software partitioning A thorough search of the literatu re reveals sparse research in th e area of system - level partitioning. One aspect th a t is closely related is hardw are/firm w are trad e offs. Early analysis of hardw are/firm w are/softw are tradeoffs focused upon com p u ter design [BD71, Man72]. B oth B arsam ian and M andell discussed h ard w are/firm w are/softw are tradeoffs as related to processor and system design. In particular, analysis centered on (at th a t tim e) a fairly new m ethod called m icro program m ing. T he authors saw th e m icroprogram able processor as a vehicle to increase hardw are complexity. (The lim itation of hardw are com plexity was due to th e only type of controller then available for sm all processors: random logic.) Increased hardw are capability reduces external software requirem ents. A lthough th e specific topic is now dated, some of the concepts*discussed are not. M andell argues th a t the “outw ard tra d e ” of functions (distributed control) is expensive b u t m ay be w orth the increase in cost. Also, th e am ount of h ard ware required should be adjusted to achieve a specific perform ance goal which initially m ight appear infeasible. In essence, a tradeoff betw een hardw are and firm w are/softw are is perform ed so th a t constraints of cost and speed are m et. B arsam ian states th a t machines need not be designed by “th e faster, th e b e tte r” rule. R ath er, tradeoffs can allow th e use of slower hardw are if th e central proces sor can execute m ore complex operations locally. A com puter m odel (M AM O) which considers usage of m ajor com ponents (C PU , memory, I/O , etc.) is pre sented; sim ulation of a sorter revealed w here th e best cost/perform ance im prove m ents could be perform ed. All of these concepts have an analogy in VLSI chip design and, in particular, th e research suggested here. A nother idea which has surfaced m ore recently is evaluating the frequency of execution. Hennessy claims hardw are/softw are tradeoffs in processor design should consider th e typical execution of a program [H+82j. Among his m any points, he argues th a t general C PU operations should not set (a group of) con dition codes b u t rath er a specific com pare instruction should set a single flag. He cites a vast am ount of actual program execution which shows the infrequent use of condition codes w ithout first using a com pare instruction. A lthough his 23 argum ents are directed tow ards RISC architecture, expected execution is an im p o rtan t tool for chip design. For exam ple, if a conditional which involves a divide operation is used infrequently even if it is located on th e critical path (the p ath w ith th e longest delay), th e “average” perform ance is affected little by th e actual im plem entation (w hether in hardw are or software). In the search for good designs, one can targ et parts of th e behavioral description which offer th e greatest p otential im provem ent globally and result in m ore efficient use of com puter resources. 1.6 Thesis Outline Since m any of th e tools needed to perform control p a th /d a ta p ath tradeoffs did not exist prior to this research, an extensive p art of th e research is tool developm ent. A description of M A H A , th e non-pipelined synthesis program , is presented in C hapter 2. Scheduling, allocation, and stretching the dataflow graph (serializing the design) are detailed. In addition, transform ation of a behavioral description containing loops into an acyclic form is also described. A m odel for estim ating PLA control area (P A S T A ) is included in C hapter 3. Two m ethods for producing d a ta p a th controllers using PLA s are form ulated and extended for loops and conditionals. Control of pipelined architecture is also addressed. Finally, th e m odel is broadened to encom pass estim ation of folded PLAs. E stim ation of routing and storage requirem ents is given in C hapter 4. Predic tion of register usage relies upon a modified flow analysis of th e behavioral graph. Results from the register estim ates and outp u t from th e d a ta p ath prediction tools are used by th e m ultiplexer estim ator. A statistical m odel for m ultiplexer usage is derived and verified. In C hapter 5, control p a th /d a ta p ath tradeoff types are described. Using th e control m odel described in C hapter 3, an analytical m odel is derived for one type of control p a th /d a ta p ath tradeoff and validated. A m ore general tradeoff analysis m ethod, which uses the m odels and heuris tics of earlier chapters, is presented in C hapter 6. This m ethod is applied to 24 num erous designs to illustrate a variety of tradeoffs against a set of user con straints. Large exam ples include a floating point coprocessor and th e i8251 UART. Finally, C hapter 7 concludes w ith a sum m ary and discussion of fu tu re re search topics. 25 Chapter 2 Extensions to Data Path Synthesis 2.1 Introduction Over th e past decade, th ere has been broad em phasis on autom ated design of d a ta p a th hardw are, also known as d ata p ath synthesis or high-level synthe sis. D ata p a th synthesis starts w ith a behavioral description in the form of a dataflow graph to produce d a ta p ath hardw are w ith three tasks: event schedul ing, resource allocation, and m odule binding. Event scheduling partitions the dataflow graph into one or m ore clock cycles of some fixed length (for non- pipelined synthesis). Resource allocation assigns hardw are m odules to operators and also allocates d a ta steering (m ultiplexers a n d /o r buses) and storage m odules (registers) while m axim izing resource sharing; th e am ount of sharing depends upon th e particular design, event scheduling, and m odule bindings to the graph operations. Since optim ally synthesizing d ata p ath hardw are to m eet a set of user-defined constraints is recognized as an N P-com plete problem [KT83, Tho86], a variety of approaches have been published which achieve good designs in a reasonable am ount of tim e. These range from a purely heuristic approach for allocation such as E M U C S [McF81] to a form al approach for scheduling and allocation which uses m ixed integer-linear program m ing [HP83]. P urely heuristic approaches suf fer from being overly rigid. A lthough they typically perform th e synthesis task quickly, the resulting designs can be far from optim al due to some aspect of the design th a t was missed. For even th e small exam ples cited in literatu re, a hum an designer could do b etter. 26 At th e other end of d a ta p ath synthesis is a rigorous form al definition which can be used to produce optim al designs. Early work by the author was directed at describing and synthesizing Register-Transfer (RT) designs [PKM84]. Peripheral research related to H afer’s form al approach revealed th a t even for sm all problem s, producing optim al designs can be expensive [HP83, HP81]. D ata p ath synthesis included solving a linear program m ing problem w here the m atrix size grew nearly at 0 ( n 3) w here n is th e num ber of operations in th e d ata path. Clearly, there is a lim it to th e size of designs which can be perform ed using this m ethod. T he work described in this paper is sim ilar in intent to the E M U C S algo rith m designed by M cFarland, b u t actually has its roots in a control synthesis algorithm developed by Nagle [NCP82]. N agle’s notion of fre e d o m s has been directly applied to this research. For a synthesis tool to m eet th e requirem ents on flexibility, th e program has to m ake decisions in some order such th a t earlier decisions do not overly constrain later decisions. It m ust also have some m ethod for com puting th e effect of a single design decision on the overall area and speed of th e resulting hardw are. Thus, th e synthesis problem is described as follows: T he program should input a dataflow description of th e hardw are behavior and optional constraints on area and perform ance, and m ust o u tp u t a d a ta p ath stru ctu re consisting of registers, operators, and required interconnections, along w ith a tim e schedule giving th e ordering of operations. T he program should m ake th e m ost con strained decisions first, so th a t th e ordering of decisions does not greatly affect th e optim ality of th e resulting design. It should adjust either to area or speed constraints and it should be able to m easure th e im pact of each design decision to avoid extensive search of th e design space. Finally, the program should be able to restart or backtrack when it is clear constraints will not be m et w ith the current strategy. The next section of this chapter describes th e operation of th e Modified Au tom atic H ardw are A llocator (M A H A ), which synthesizes non-pipelined designs in the m anner described above. R estrictions of the original version of M A H A 27 [PPM 86], as well as refinem ents and additions which have since been incorpo rated, are described. T he M A H A program stru ctu re and loop handling are detailed. Synthesis exam ples and lim itations conclude th e chapter. 2.2 Overview of the M AHA Algorithm M A H A carries out th e synthesis tasks described above in th e following m an ner. F irst, a list of operations is read by M A H A , followed by a list of values transferred betw een these operations. (An exam ple dataflow graph containing operations as nodes and values as edges is shown in Figure 2.1.) T he longest p ath (in tim e) from top to b o tto m or critical path is located; M A H A divides this p ath into p tim e steps (or partitions) of equal duration. Each tim e step now represents one m inor cycle or register transfer. M A H A allocates operators for th e critical p ath in a first-com e first-serve fashion, w ith m ultiple operations sharing resources as long as th e operations do not occur in the same tim e-step. Then, M A H A decides which of th e rem aining or off-critical p ath operations have th e least scheduling freedom. Thus, the first operations scheduled, which m ay have scheduling difficulties, get th e first chance to share resources. O pera tions scheduled later m ay find m ost resources fully utilized; however they have th e greatest scheduling flexibility, and thus are m ore likely to find a free resource in some tim e slot. M A H A adds resources as necessary when it schedules these off-critical p a th operations. A t some point, M A H A m ay run out of resources due to an area constraint. A t this point, it repartitions th e critical p ath into m ore tim e steps, and begins allocation over. A dding m ore partitions allows scheduling th e resources to be used by m ore operations th an before; thus, fewer resources m ay be required. If area is to be m inim ized, M A H A will increm entally add as m any tim e steps as possible w ithout violating th e tim ing constraint; if speed is to be m axim ized, M A H A will add resources as long as it can w ithout violating the area con strain t. (N ote th a t both area and tim e are optionally adjusted by th e associated register a n d /o r m ultiplexer scheduling and allocation.) A general overview of th e M odified A utom atic H ardw are A llocator and scheduler is included in Figure 2 .2 . 28 root dist out sort Figure 2.1: Exam ple dataflow graph Read DFG and module library Read local constraints A ccept global constraints Initialize partition count, p = 1 D eterm ine resources from p I F ind critical p ath (c.p.) and c.p. partitions, pcp, _______and clock cycle tim e, c, given resources P cp = P nd tim e ok Schedule c.p. operations and allocate using fixed resources Any unscheduled operations? C om pute freedom s of unscheduled operations Select operation w ith lowest freedom . ........- i . Z Schedule in first clock cycle where resource can be shared Op was scheduled? Schedule at earliest point; allocate new resource Cheapest design ached? Set p artitio n count, v = p + 1 ^ d o n e ^ Figure 2.2: General Overview of MAHA A lgorithm Each node in th e dataflow graph represents an operation to be perform ed by a piece of hardw are. T he critical p ath operations, which include operations on m ultiple critical paths, are rem oved from th e set of all operations and are as signed hardw are first. Since th e critical p a th operations are sequentially ordered, th ere are no tim ing conflicts. Also, since the critical p ath determ ines th e overall speed, doing the critical p ath assignm ent first ensures th e fastest possible resu lt ing design. F urther, once th e critical p a th has been assigned, one has a lower bound on th e to tal area. Thus, th e approach used in this heuristic partitions th e problem . T he m ethod for com puting the critical p ath in the original version of M A H A is carried out by a program by P ark called th e Clocking Scheme Synthesis Pack age (CSSP) [PP85a]. A lthough CSSP was originally intended to clock d ata p aths already designed, it was used w ithin M A H A to avoid duplication of ef fort. T his program takes a set of operations and values and com putes optim al clocking schemes, th e critical p ath , and clock cycle tim es for single and m ulti phase clocks. Given this inform ation, hardw are can be assigned to clock periods; specifically, critical p ath operations are assigned to a tim e step and bound to hardw are m odules for th a t clock cycle. O perations which are not on th e critical p ath could share this hardw are during other clock cycles. Before discussing th e algorithm used in M A H A further, th e notion of fre e d o m introduced in C hapter 1 is again described. The freedom of an operation is defined as th e difference betw een th e tim e when th e input values are available for a given operation and the tim e when th e result of th a t operation is required, less th e propagation delay of th e hardw are. Once th e critical p a th has been assigned, th e rem aining operations are assigned by iteratively com puting th e freedom s of each operation and scheduling th e operation w ith th e least freedom . 2.3 Evolution of M AHA In its initial form, M A H A perform ed a subset of th e overall d a ta p ath design as originally envisioned by Parker. One lim itation involved th e critical p ath partitioning schem e whereby th e dataflow could be divided or serialized into a m axim um of n tim e-steps w here n is approxim ately th e num ber of operations 31 on th e critical path. This restricted th e degree by which M A H A could serialize a design for lower area, a lim itation which was resolved w ith delay operation insertion discussed in this section. Further, register and m ultiplexer effects were ignored in th e original utility. Solely using operator area m ay give m isleading results. For exam ple, assum e a design w ith two clock cycles has twice th e operator area consum ption of a design w ith four cycles and b o th exhibit identical overall delay. One m ight be tem p ted to discard the larger design. However, additional m ultiplexers and registers are needed for th e sm aller design which increases both its delay tim e and area. Furtherm ore, control area - which is not addressed here - is also increased. Consideration of routing and storage elem ents were added to M A H A and are discussed in this section. A th ird consideration in synthesis is localized constraints. For exam ple, two operations in a dataflow graph m ay be constrained to occur w ithin/outside a certain interval. Extensions to th e synthesis program addressed the need for localized constraints. Finally, wiring area is another aspect of th e d a ta p a th area; however, this aspect of d a ta p ath analysis is handled externally to M A H A using the P L E S T u tility [KP86]. In its current form, M A H A com putes and o utputs values needed to estim ate th e wiring area, b u t does not perform any internal wiring area anal ysis. 2,3.1 Synthesis of Serialized Designs T he first release of M A H A serialized designs up to th e lim it of as-soon-as- possible (and as-late-as-possible) scheduling of th e critical p ath . There are two problem s w ith this approach. F irst, scheduling was not resource sensitive. If two identical operations in series along th e critical p a th had a collective delay sm aller th a n th e m inim um clock tim e, they m ight always be scheduled into th e sam e tim e step. Hence, they could never share th e sam e operator which m ay result in a lower circuit area. Only a resource sensitive scheduling technique would eventually force these operations into different tim e steps. 32 Second, resource-sensitive scheduling of off-critical p ath operations was also lacking. In particular, when identical operations lie either on and off (or both off) th e critical p ath , th ere is th e potential th a t they would be scheduled into th e same tim e-step and not be able to share resources. One approach would be to arbitrarily force th e off-critical p a th operation into a different tim e step. However, this m ay result in a scheduling conflict w ith another operation and, m ore im portantly, m ay extend and thereby change th e critical p ath . Given this im pact, it is also quite probable th a t operations scheduled up to this point could m ore efficiently utilize resources given the ex tra “breath in g ” space. A solution is to stretch th e critical p ath by th e introduction of delay opera tions. These operations, which have no area associated w ith them , are inserted along th e critical p a th w here resource conflicts occur w ith (or betw een) off- critical p ath operations. One such exam ple is shown in Figure 2.3 where a delay operation has been inserted prior to -7. By allowing tim e for th e off-critical p ath operation -3 , it could utilize the sam e hardw are resource as -7 and lower the overall circuit area.. Once th e m odification to th e graph is com pleted, it is synthesized normally. Insertion of delays m ay continue until only one resource of each operation type is needed for a design; at this point, th e dataflow algorithm is fully serialized and no fu rth er area reduction is possible (given th e m odule set). 2.3.2 Register Extensions A lthough registers m ay be a sm all percentage of the to ta l area in large designs, they can represent an appreciable area in sm aller or serialized designs. In ad dition, register delay affects th e clock cycle tim e. A sim ple linear strategy to com pute register area and delay was incorporated into M A H A . A fter operator allocation is com plete, each value in the dataflow graph is visited; if the source and destination are in different tim e steps, the value is a candidate for register assignm ent. If a register has been attached to this value from th e operation in a prior check, additional registers are superfluous. Also, values em anating from root do not have registers attached. 33 4 loin Figure 2.3: Exam ple delay operation insertion This approach accom plishes register allocation in 0 (va lu es) tim e and pro vides a hard upper bound on th e num ber of registers required; th e results m ay be im proved by taking advantage of b e tte r external algorithm s. R egister delay is also a consideration for d ata p ath synthesis. Since a reg ister is attach ed betw een each tim e step, th e clock cycle tim e is increased to accom m odate th e register delay. 2.3.3 Multiplexer Extensions M ultiplexers can also im pact area in a design w ith extensive sharing. For m ore accurate results, and in p articular for assessing design changes, th e area of m ul tiplexers m ust be considered. M A H A adds m ultiplexers as needed for operator sharing. M ultiplexers are not shared. W ithin any tim e-step w here m ultiplexers are utilized, m ultiplexer delay may exceed th e “clock slack” tim e. “Clock slack” is th e difference betw een th e clock cycle tim e and the m axim um delay of all paths through operations assigned to th a t specific tim e-step which contain th e operation being allocated m ultiplexers. If th ere is sufficient slack, a given multiplexer tree m ay not im pact the clock cycle tim e. In th e worst case, th e clock tim e would be increased by th e par ticu lar conflicting m ultiplexer tree delay. M A H A increases th e clock tim e to accom m odate m ultiplexer tree delay. A rea and delay calculations for each multiplexer tree are incorporated into M A H A . Basically, for a given m -bit n: 1 m ultiplex, a tree consisting of 2:1 m ultiplexers has an area of A muxtree = m(ri 1 ^Amux (2.3.1) w here A mux is th e area of a 2:1 m ultiplexer [Kur87]. T he tim e delay is determ ined by th e m axim um dep th of th e tree and is w ritten as T m u x t r e e = ["^§2 ^ m u x (2.3.2) w here Tmux is th e delay tim e of a 2:1 m ultiplexer. i 35 2.3.4 Extension for Localized Constraints In m any designs, constraints m ay be im posed upon a local area of th e algorithm . As an exam ple, due to some constraint, one operation cannot begin until another has been started . Synthesis bounded by local constraints is not a new concept, although only one synthesis package to date is known to accept th em [NT86]. Local constraints should allow specifying m inim um a n d /o r m axim um delays betw een any two operations. In a dataflow graph, there is an im plicit constraint given by the ordering of th e operations. For exam ple, an value betw een operation x and operation y can be viewed as “operation y sta rts after operation x finishes” , in addition to th e d a ta transfer aspect. E xtending this concept, one could m odel | i a m inim um delay betw een any two operations by inserting an edge along w ith ^ a delay operation (if needed), provided no loops are introduced by adding th e J new edge. ; I C ontrary to specification of m inim um delay, m axim um delay cannot be repre sented solely by m odifying th e dataflow graph. As M A H A serializes th e graph, th e delay betw een operations naturally lengthens. T here is no inherent m ech anism by which the delay betw een any two operations can be lim ited. Hence, i | th e sim plest m ethod is used for M A H A : m odify th e scheduling algorithm to detect and avoid constraint violation (both m inim um and m axim um ). W hen an operation is scheduled, its constraints are checked and th e schedule discarded if a violation occurs. 2.4 The M AHA Program Structure i T he original version of M A H A was w ritten in Franz LISP. Since then, M A H A has been ported to the C program m ing language w ith several enhancem ents ! added. However, the basic stru ctu re of M A H A has rem ained unchanged through out. Discussion of M A H A is divided into five separate areas: algorithm input, clock cycle generation, critical p ath partitioning and allocation, off-critical p ath analysis and allocation, and graph extension. An overview of th e com plete pro cedure is th en given. 36 | 2.4.1 Algorithm Input t T here are four inputs to th e algorithm : behavior in th e form of a dataflow graph, library of hardw are m odules which can im plem ent th e operations in th e dataflow, local constraints, and global constraints. 2 .4 .1 .1 D a ta flo w G rap h In p u t i T he behavior of a system is described using a dataflow graph where opera tions are represented as nodes and value transfers described by directed edges. M A H A expects th e graph to have a single entry point from a node specifically nam ed root which has no incom ing edges. Similarly, th e graph term inates at a single exit point nam ed outport which has no outgoing edges. All other nodes m ust lie along some acyclic p ath from root to outport. A n exam ple graph was shown in Figure 2.1. N ote th a t root and outport are “place holders” ; th e function perform ed is essentially a no-op w ith no area or delay associated w ith them . Thus, they have no im pact on th e synthesis results. In the actual program , node tags which specify the operation as being the to p or b o tto m of a loop are also accepted. Although M A H A does not accept cyclic graphs, these tags allow it to generate useful designs for algorithm s which had loops prior to transform ation. An external algorithm which transform s cyclic graphs into acyclic ones prior to synthesis is presented later in this chapter. Finally, th e graph is colored to delineate m utually exclusive operations. Two special operations designated dist and join are used to in itiate and term in ate a m ulti-w ay conditional branch. T he square boxes labelled d and j in Figure 2.1 are d istrib u te and join, respectively. Each dist m ust have a m atching join and | all operations w ithin th e specific conditional m ust have a p a th to th e m atching j join. O perations + 8 and -9 are m utually exclusive to -1 0 and + 1 1 • Thus, the sam e operator could be used for + 8 and + 1 1 since only one is active at any tim e. Coloring of a graph for conditional branches is described in P ark [Par85]. 37 2 .4 .1 .2 L ib rary G en era tio n A fter th e hardw are library is selected by the user, M A H A analyzes each type of operation perform ed in th e dataflow graph and generates a list of all com ponents in this library which can perform it. If m ore th an one hardw are m odule can perform a given operation, M A H A averages th e area and speed of all such m odules to form a single “average m odule” ty p e associated w ith th a t p articu lar operation. O perations which have a local delay constraint require an additional check to elim inate non-com pliant hardw are m odules which do not m eet this constraint. H aving an average library simplifies hardw are assignm ent as each operation ; has only a single com ponent in th e library which can im plem ent its function. ( I T hus, th e propagation delay of every operation is fixed. It also allows centering th e design in a chosen area in th e design space w ithout fine tuning for a particular m odule selection. M A H A is not intended to perform m odule selection. If m odule selection is perform ed prior to synthesis, th en a reduced library can be passed to M A H A . Using this m ethod along w ith appropriate function labelling on th e operations in th e dataflow graph, each operation will have only one hardw are type which can im plem ent th e function. This fixes th e propagation delay; th e area associ ated w ith th e im plem entation depends upon hardw are resource sharing. T his is usually th e m ode in which M A H A is operated. In our exam ple of Figure 2.1, one a d d er and one su b tra c to r m odule type would be selected; th e a d d er has a delay of 100 and area of 200, th e su b tracto r 110 and 220. T he num ber of instantiations (uses) of these operators depends upon th e p articu lar design. 2 .4 .1 .3 L o ca l C o n stra in t S p e c ific a tio n M A H A can optionally accept local user constraints. These constraints specify I any of th e following: • m inim um propagation delay of a given operation • m axim um propagation delay of a given operation 38 • m in im u m delay betw een two operations • m axim um delay betw een two operations i For delay betw een two operations, either the startin g or ending tim e of the operation can be specified. These constraints are independent of th e actual graph dependencies. An exam ple local constraint would be Tb(—7) < Te{—A) —50 (2.4.3) j ! which reads “operation -7 should sta rt at least 50 tim e units before operation -4 ! ! com pletes” . 2 .4 .1 .4 G lo b a l C o n str a in ts I ( i W hen designing hardw are, one is usually confronted by constraints on th e area and m axim um delay associated w ith th e circuit. G lobal constraints are I • A u, the m axim um circuit area allowed, and • Tu, th e m axim um circuit tim e. U sually one global constraint is m inim ized while obeying an upper lim it on the other. 2.4.2 Clock Cycle Generation ! Feasible clock cycle tim es are constructed by M A H A prior to synthesizing any j I designs. Each operation is assigned a propagation delay associated w ith its | hardw are m odule. A union of all m axim um tim es betw een any pair of opera- j tions which are connected via some p a th form th e clock cycle array; theoretical foundations and th e algorithm are detailed in [PP85a]. T h e clock tim es are bounded on th e low side by th e m axim um delay of any o perator (cmin) and n a t urally bounded by th e m axim um delay from root to outport on th e high side. T he first condition is im posed by a precondition th a t a given operation m ust be com pleted in th e sam e cycle th a t it is initiated; however, m ultiple operations per clock cycle are allowed. 39 Using th e exam ple m odule library described earlier, then cmtn = 110 and the m axim um delay is 720. T he list of all possible clock tim es, in order, is 110, 200, 210, 220, 300, 310, 320, 410, 420, 520, 620, and 720, which form s th e clock array. 2.4.3 Critical Path Partitioning and Allocation W ith th e dataflow graph, m odule library, and th e num ber of partitions, p, of < equal clock tim e, c, into which th e graph should be divided, th e critical path can be determ ined. (T he clock cycle tim e is one chosen from th e clock array.) | It will be shown later how to determ ine th e best clock tim e c given p. | F irst, th e m inim um num ber of resources required is com puted using lower- bound area-tim e theory for non-pipelined synthesis [JMP88]. Given th e num ber I ; | of partitions, th e num ber of hardw are (operator) resources (< ? ,-) necessary for j im plem enting n; operations of ty p e i in th e dataflow graph is (2.4.4) I I giving a m inim al area of t j A m = Y JoiA{i) (2.4.5) i I w here A {i) is th e area of hardw are resource ty p e i. (n, is th e m axim um num ber j of m odules of type i needed for th e m ost parallel design; n a - m ay be less th a n th e actual num ber of operations of type i in th e presence of conditional branches.) If th ere is an area constraint (A u) and A m > A u, there is no need to synthesize this p articu lar design. A higher p artitio n count which results in an area less th a n A u m ust be chosen. T he critical p ath m arking algorithm is a variation of th a t given in Park [PP85bj. Each operation is assigned a propagation delay equal to its com ponent delay. All possible paths from root to th e output are followed and th e m axim um tim e at each operation is retained. Each operation is m arked as it is visited to prevent traversing any portion of th e graph m ore th an once. Hence, th e critical p a th is found in 0 ( n 2) tim e, w here n is th e num ber of operations in th e graph. Given clock cycle tim e (c), Parks original algorithm which determ ines the num ber of critical p a th partitions, pcp, is m odified to consider hardw are resources 40 /* Inputs are minimal resources, o2 -, arid clock time, c. */ Initialize partition count and partition time, pcp = 1 and T p = 0. Set resources left to maximum available, U { — Oi for all i. Starting from top (bottom ) of critical path for ASAP (ALAP) scheduling, while more unassigned operations remain in critical path, Select next unscheduled critical path operation k where type is j. While delay too large to fit into current partition, T p + Delayk > c, or no resources of type j are left in this partition, uj = 0, then Create a new partition, pcp = pcp + 1. Reset partition time, T p = 0. Reset resources used, U { — Oi for all i. Endwhile /* assign operation to partition number * / Set resource link and adjust resources left, Uj = uj — 1. Compute new time used in partition, T p = T p + Delayk- Schedule operation k into partition pcp. Endwhile /* scheduling critical path * / Increment pcp to give actual number of partitions. Return total partitions, pcp, and delay, T — (pcp - l ) x c + Tp. Figure 2.4: M AHA C ritical P a th P artitio n s (p cp) | I as described in Figure 2.4. Since resources are tracked, a resource link betw een th e operator and operation is m aintained for later use in critical p ath allocation. This algorithm schedules th e critical p ath operations and also determ ines the to tal delay (T ) associated w ith this design. T he num ber of critical p a th partitions, pcp, is determ ined by the clock cycle tim e and th e num ber of resources. However, th e resources were com puted from th e num ber of partitio n s, p. Clear]}'-, unless p = pcp, th e critical p ath schedule is invalid. To overcome this problem while also finding th e lowest clock cycle tim e for which p — pcp. p is arb itrarily set and th e set of feasible clock tim es is exam ined. S tartin g w ith th e m inim um clock th a t is theoretically possible [JMP88] and th e m axim um clock tim e, a binary search is perform ed to obtain th e lowest clock tim e which gives th e desired resource-sensitive p artitio n count, w here p = pcp. 41 ! /* I * Inputs are minimal resources, o;, I * and desired partition count, pdesired, and * clock array, C LK [m axclk], which is in ascending order. / Using binary search, find minimum value of I such that partition count for CLK[l] equals pdesired, if possible; . i p cp is determined (Figure 2.4). I j If I exists (p cp = pdesired)> then i Set clock cycle time c = C LK [l\, and j circuit time T — c x pcp. j Allocate hardware to critical path; i (turn all resource links into resource allocations). I j Endif /* check if partition count feasible * / I I Figure 2.5: M AHA C ritical P a th Scheduling and A llocation W ith th e m inim al clock tim e which gives p = pcp, scheduling and allocation I of critical p ath hardw are can now be accom plished. M A H A sequentially as- | signs hardw are to each operation in th e critical p ath , keeping track of hardw are | usage and area. A design is generated using both to p-to-bottom or as-soon-as- j possible (A SA P) scheduling/allocation as well as bottom -to-top or as-late-as- : possible (A LA P) scheduling/allocation. An “assigned hardw are” list is gener- 1 ated which contains all “purchased” hardw are and th e tim e range(s) currently 1 assigned. Since each operation has only a single com ponent type w hich can be used, allocating hardw are to an operation only requires analyzing th e usage of previously allocated hardw are. If a com ponent of th e correct type has been pre viously assigned, th e corresponding tim e ranges are exam ined to determ ine if I th a t hardw are com ponent can be reused. If not, a new piece of hardw are is pur- i chased. This m axim al sharing g r ee d y algorithm is perform ed on th e dataflow ! I graph in approxim ately linear tim e. T he critical p a th scheduling and allocation J algorithm is shown in Figure 2.5. Assum e th a t the p artitio n count was set to four (p — 4). W ith 7 effective add operations and 4 effective su b tract operations, two a d d ers and one su b tr a c to r 42 are initially allocated for th e exam ple graph im plem entation. Using a clock tim e of 200 does not violate th e resources, b u t gives pcp — 5 which is invalid. A clock tim e of 210 gives 4 partitions on the critical p a th as shown in Figure 2.6. In M A H A , hardw are area is equally as im p o rtan t as operator delay. Con sider th e subgraph depicted in Figure 2.7 which is p art of a larger dataflow graph. A ssum e th e delays of operators a, 6, and c are identical. By solely considering th e delay tim e, th e right p a th (bcbcbc) is 50% longer th a n th e left p ath (aaaa). } However, if only one hardw are resource is available for each type, then th e left p a th m ust be divided into four p artitions (a | a | a | a) w hereas th e right p ath I can be divided into th ree (be | be | be). W ith this resource dependence, the crit- | ical path (time) of a design must be recomputed whenever the resources allowed i changes. A single evaluation of th e critical p ath based upon operator delay is i I insufficient. Besides resource dependence, th e critical p a th of a design is also affected by clock cycle tim e. As th e clock cycle tim e is changed, non-critical paths m ay now becom e th e critical p ath . To illustrate, assum e operator delays of a = 50, b = 30, and c — 30 for th e dataflow graph in Figure 2.7. For clock cycle tim es greater th an or equal to 60, th e left p ath is the critical p a th (ignoring resources for the I m om ent). However, for clock tim es less th an 60, th e right p a th becomes the [ i critical p ath. Thus, the critical path of a design must be recomputed whenever the clock cycle time changes. Local constraints also affect th e scheduling of th e critical p ath . Before each operation is actually scheduled, it is tested to determ ine if any of th e user- specified constraints are violated. If not, M A H A will accept th e scheduling, j O therw ise, if scheduling th e operation in a later tim e-step results in acceptance of j th e constraint, th e la tte r schedule is used. In th e case w here th e operation cannot l be scheduled w ithout violating a constraint, a message of th e form “constraint | violation” is given and M A H A tries again w ith a larger p artitio n count. J A dataflow graph m ay have m ore th an one critical p ath . In this case, M A H A selects only one for perform ing critical p ath allocation. T he rem aining critical p ath s are scheduled and allocated along w ith non-critical p ath operations. Since 43 root 1st clock cycle 'LsL 2nd clock cycle 3rd clock cycle 4th clock cycle F ig u re 2.6: E x am p le w ith C ritical P a th P a rtitio n e d 44 I \ N . F ig u re 2.7: D em o n stratio n dataflow g rap h # 2 45 th e scheduling freedom of “ignored” critical p a th operations is lim ited as com pared to other non-critical p ath operations, clearly these additional critical paths j will be scheduled before other off-critical p ath operations. T he presence of m ultiple critical paths is not a m ajor issue in non-pipelined designs. M ultiple critical p ath s usually occur in rep etitiv e designs, w here each critical p a th contains th e sam e operations in th e sam e order. Since every critical j p a th is identical, one can arb itrarily select any of th em and obtain th e sam e result. F urtherm ore, highly repetitive'dataflow graphs of this n atu re are m ore ap tly suited to pipelined design w here M A H A is less likely to be used. 2.4.4 Off-Critical Path Analysis and Allocation A fter com pleting th e critical p a th allocation, off-critical-path operations are as signed hardw are. Initially, each operation not previously allocated is analyzed to determ ine its freedom . Essentially, freedom is th e slack tim e in which an operation can be perform ed w ithout lengthening th e critical p ath, i Let ■ A J ’ parent(k) be th e set of all direct predecessors of operation k; in other | words, every operation w ith an outgoing value connected to operation k is a j m em ber of th is set. Sim ilarly, Afchiid(k) is th e set of all direct successors to k. ! T hen, th e earliest tim e operationk can be scheduled is ! Teariy(k) = ma,x(Teariy(operationj) + D (operationj)) (2.4.6) I w here operation j £ •A /’ parent(fc) and D(operationj) is th e delay associated w ith operation j. Sim ilarly, th e latest tim e is Tiate(k) = nun (T[ate(operationm)) — D{operationk) (2.4.7) w here operationm £ N chiid{k). I O perations which have been scheduled have Teariy and Tiate set by th e clock cycle and th eir position w ithin th a t cycle. N ote th a t previously allocated critical J p a th operations have had th eir values of TeaTiy and Tiate assigned, startin g w ith Teariy{root) = 0, before off-critical p ath allocation occurs. Since all paths m ust eventually connect to a critical p a th operation, root, or outport, th e recursive /* Input is clock time, c. * / offcp_main: Compute freedoms of all operations not scheduled (or allocated). Select operation k with lowest freedom(T/ate — Teariy)\ type is j . Set initial indices h and I to the early and late time partitions; ft _ ancj J _ /* Find partition to share resource, or just arbitrarily assign * / While more partitions to check ( h < I), then Set Uj to resources used of operation type j in partition h. If resources of type j still left (uj < Oj), then Break while loop. Else /* no resources of type j left * / Set next partition, h = h + 1. If out of partitions to use (h > t), then Reset initial h value ( h = ). Break while loop Endif /* out of partitions - abort while * / Endif Endwhile /* determine partition to assign operation * / Figure 2.8: M AHA O ff-Critical P a th ASAP Scheduling and A llocation: P a rt 1 of 2 n atu re of Equations 2.4.6 and 2.4.7 is bounded. A list of freedom for all non allocated operations is form ed in 0 ( n 2). T he off-critical p a th operation w ith th e sm allest freedom (tig h test constraint) is chosen for scheduling as described in th e algorithm of F igure 2.8. If the freedom is so sm all th a t the operation m ust occur during one specific tim e stage, th en th e operation is scheduled in th a t stage. However, it is m ore com m on to see freedoms which allow a operation to be assigned to any one of a num ber of consecutive stages. In this case, th e hardw are allocated in th e allowed stages is sequentially exam ined and th e operation is assigned to th e first stage w here hardw are sharing can occur. If none of th e stages allows resource sharing, th e earliest stage for A SA P scheduling (latest stage for A LA P scheduling) is arb itrarily chosen. Once th e off-critical p a th operation has been assigned to th e chosen stage, either existing hardw are is shared or a new operator is allocated. Figure 2.9 /* Found partition, now schedule and allocate * / Assign operation k to partition h. Update resources used in partition, Uj = Uj + 1. If resources exceeded previous limit (uj > Oj), then If first operation which cannot share resource (n ocp not set), then Set n ocp to the node’s index (fc) to identify the off-critical path operation. Endif /* determine where to stretch graph at * / Increase limit, Oj = Oj + 1. Increase total area, A, by area of operation type j. Endif /* resource limit exceeded * / If area constraint (if any) exceeded (A > A u) and n ocp set, then Exit routine with partial design. Endif /* constraint check * / If any operations left unscheduled, then Goto offcp_main. i Figure 2.9: M AHA OfF-Critical P a th A SA P Scheduling and A llocation: P a rt 2 | of 2 I i contains a description of th e allocation portion of th e off-critical p ath algorithm . Exceeding th e m axim um area during off-critical p ath allocation m ay result in re-partitioning of th e dataflow graph. O therw ise, after allocating each chosen off-critical p a th operation, M A H A recalculates th e freedom s for all operations affected by it and picks another off-critical p a th operation which has not been scheduled or allocated. This process continues u n til all operations have been ' allocated hardw are. 2.4.5 Dataflow Graph Stretching j j W hen a dataflow graph has been scheduled into its m axim um num ber of tim e- j steps along th e critical p ath , b u t has not yet reached its m inim al resource lim it, t M A H A lengthens th e dataflow graph to reduce th e likelihood of a scheduling j conflict. This function can be perform ed w ith a partially synthesized design j provided at least one scheduling conflict has occurred since an area of resource conflict has been identified. 48 (a) Set h to most recently assigned partition of operation n op; 1 set j to the operation type. I (b) Find critical path operation k such that: — partition of operation k is h, — type of operation k is j (if exists in this partition), and — there are no other operations of type j in partition h which occur later. (c) Insert new delay operation d into graph such that: — All predecessors of operation k now point to delay operation d, I — a connection is made between operations d and k, and [ I — delay of operation d is set to push the following operations into the next j partition. Figure 2.10: M AHA C ritical P a th S tretching T h e procedure used to stretch a dataflow graph is sim plified since th e critical p a th scheduling algorithm resolves resource conflicts along a given p ath . As such, th e only scheduling conflict th a t arises is betw een critical p a th and off- critical p a th operations or solely betw een off-critical p a th operations. Resolution involves th e insertion of a delay operation into th e critical p a th at th e point of overlap as described in Figure 2.10 and depicted in F igure 2.3. (T he algorithm describes A SA P node insertion; A LA P is sim ilar.) T he selected critical p ath operation and its successors are delayed so th a t th e off-critical p a th operation j can be allocated to th e sam e hardw are as th e conflicting operation. This delay j operation is a M A H A artifact th a t has no associated area or “real” delay in th e j actu al design; however, th e scheduling algorithm tre a ts this operation like any I o ther and achieves th e desired results. T he delay of this operation is adjusted depending upon th e clock cycle tim e to force th e following operations into th e next clock cycle. 49 Read in dataflow graph; check for errors. (Section 2.4.1.1) Read in local constraint file, if any. (Section 2.4.1.3) Read in module library; construct "average library” . (Section 2.4.1.2) Construct array of all possible clocks (c). (Section 2.4.2) Get user global constraints, Tu and A u. (Section 2.4.1.4) j Initialize minimum and current partition counts, p = pmin = 1. i Clear all hardware resources. j Clear first conflicting ofF-critical-path operation, n ocp. I Assign minimum clock time, cm;„, to delay of slowest module. I Figure 2.11: Initialization of M AHA i 2.4.6 MAHA Synthesis Procedure In order to synthesize a graph, several values m ust be initialized and operations perform ed as sum m arized in Figure 2.11. T he actions taken by M A H A are d ictated by th e global constraints. These are supplied by th e user for b o th tim e (Tu) and area (A u)\ a value of zero in stru cts M A H A to m inim ize th a t p aram eter w ith respect to a lim it on th e other. Assigning a zero to b o th forces M A H A to generate all possible designs. W ith global constraints set, th e dataflow graph is initially divided into one stage w here every operation requires a d istinct hardw are resource except for sharing which occurs on m utually exclusive paths. M A H A synthesizes designs startin g w ith this m ost parallel (fastest and m ost expensive) tow ards th e m ost serial (cheapest and slowest) which has some larger num ber of partitio n s, p. j As th e graph is p artitio n ed into two or m ore stages (clock cycles), hardw are ! can be shared thereby lowering th e d a ta p ath area. Figure 2.12 shows a sim ple dataflow graph th a t has been p artitio n ed into two clock cycles indicated by th e solid line near th e m iddle of the graph. O n th e left p ath , th e top operation a is im plem ented using hardw are resource A . T he second a m ust use a different hardw are resource of th e sam e type. T h e clock cycle ends after th e second operation and a register is used to store th e result. A th ird a occurs in th e second clock cycle. Since th e results from th e first and second operations are com plete and th e incom ing value is stored in a register, 50 hardw are resources which perform ed those operations are free to be reused as shown by th e dashed lines. Sim ilarly, operations b and c require two hardw are resources. N ote th a t resources B \ and C\ are unused during th e second clock cycle. By p artitio n in g a graph into m any clock cycles ( “serializing” the dataflow algorithm ), o perator hardw are can be reduced. This is offset by additional re sources required for registers to store values and m ultiplexers or buses for ro u t ing of values to a given hardw are resource. These lower area designs are usually slower. D epending upon the clock cycle tim e used (as com pared to operator delays), some portions of th e design m ay exhibit idle tim e. If all operations in Figure 2.12 have equal delay, then th e left p ath is 50% idle; each clock cycle takes 4 tim e units, b u t hardw are is only active for 2 of those. In addition, the overall circuit tim e has increased from 6 to 8 tim e units. T here are also control area effects which will be addressed in th e next chapter. M A H A synthesizes designs by scheduling each operation into a specific clock cycle and allocating a hardw are resource to perform th e operation, given s ta rt ing resources and num ber of partitions. Successive design steps increm ent the num ber of p artitio n s until either no fu rth er designs m eeting th e constraints are possible, or th e cheapest possible design is found (e.g. only one hardw are re source for each o perator type). An overview of th e M A H A algorithm is given in Figures 2.13 and 2.14 w here T is tim e, A is area, and Tm and A m are the m inim um tim e and area possible. n ocp is th e first off-critical-path operation encountered during synthesis which cannot share a utilized resource. If a constraint on b o th area and tim e is given, M A H A searches for all non-inferior designs m eeting these constraints. If eith er p aram eter is being m in im ized, M A H A will search for th e single best design. M A H A exits early w ith th e best design if tim e is m inim ized w ith a lim it on area, this lim it has been m et, and th e graph is about to be stretched; fu rth er partitio n in g can only result in slower designs. Conversely, if area is being m inim ized w ith respect to a lim it on tim e, M A H A will continue generating designs u n til th e lim it on circuit tim e is exceeded or the cheapest design is produced. T he best design produced up to this point is chosen. 51 F ig u re 2.12: D em o n stratio n dataflow g rap h # 1 /* Determine new resources needed and if max serialization reached * / /* Enter loop with p = pmin = 1 * / main Joop: Determine minimum resource area, A m, needed given partitions, p. If first loop or resources changed from last loop, then calculate new maximum number of partitions, pmax, which is the number of partitions when clock time is cmin . Endif /* resource change or first loop * / If graph is past maximum serialization, p > pmax, then stretch_entry: If resources at minimum possible (one of each type), then display best design(s) and exit. Else if an off-critical-path operation was allocated a new resource in the last iteration (n ocp), then Stretch the dataflow graph. Set partition count p to its minimum possible value, pmin- Else increment the number of partitions (p = p + 1) /* Either an n ocp will be found or minimum resources will be reached Goto mainJoop. Endif /* graph at maximum serialization * / /* Now perform preliminary check to see if circuit exceeds constraints * / If area constraint (A u) provided and is exceeded (A m > A u), then Increment the number of partitions (p = p + 1); Goto mainJoop. Endif /* area constraint check * / Calculate minimum circuit time (Tm = cmin x p). If time constraint (Tu) provided and is exceeded ( Tm > Tu), then Goto stretchjentry. /* stretch graph (if possible) * / Endif /* time constraint check * / F ig u re 2.13: O verall o p eratio n of M A H A : P a rt 1 of 2 Critical path synthesis and constraint check * / Perform critical path scheduling and allocation (Section 2.4.3); produces partition count pcp and circuit time T. If the partition count p not reachable (pcp ^ p), then If partition count at minimum ( p = Pm in), then Set new minimum (pm,n = p + 1) due to graph stretching. Endif / * increase minimum partition count * / Else /* Off-critical path synthesis and constraint check * / If time constraint (Tu) provided and is exceeded (T > Tu), then Goto stretch .entry. /* stretch graph (if possible) * / Endif /* time constraint check with actual time, not estimate * / Perform off-critical path scheduling and allocation (Section 2.4.4) produces circuit area A and may set n ocp. If area does not exceed constraint, if any, (A < A u), then Update list of feasible designs. Endif /* final area constraint check * / Endif /* partition count not reachable check * / Increment the number of partitions (p — p + 1); Goto mainJoop. Figure 2.14: O verall operation of M AHA: P a rt 2 of 2 Finally, constraints are tested during b o th critical p a th and off-critical p ath synthesis. T hus, M A H A m ay only have synthesized a p artia l design before de tectin g a constraint violation and term in atin g construction of th e current design. U pon com pletion, M A H A displays th e com ponent types used and all in stan tiatio n s of th em including th e clock cycles w here th ey were utilized. As an aid to other portions of th e ADAM system , M A H A can generate a file of the clock cycle assigned to each operation, hardw are resources used, and conditional coloring. 2.4.7 Runtime Analysis In a dataflow graph w ith TV nodes (operations) and E edges (values), th e num ber of unique edges is bounded by th e num ber of operations in an acyclic represen tatio n . (For this analysis, edges arising from th e sam e source and term inating at the sink are not unique.) T he first node m ay have TV — 1 edges em anating from it, b u t th e second node can have only TV — 2 edges since a “loop” back to th e first node is prohibited. By expansion and series su b stitu tio n , S < (2.4.8) or E ex . TV2 . P rior to synthesis, M A H A generates a list of all possible clock tim es. This consists of th e union of th e m axim um p ath betw een two nodes over all node pairs. T he m axim um p a th from a single node to every other node is found in O (E ) tim e. T hus, construction of all possible (TV x (T V — 1)) clock tim es takes 0 ( T V x E ) or 0 ( T V 3 ) . These clock tim es are then sorted which has a com plexity of 0 ( N 2 log T V ). D uring th e first program loop, or w hen either th e resources or th e clock cycle tim e change, th e critical p a th m ust be determ ined, which takes 0 ( T V 2 ) tim e. T he critical p ath is scheduled and has hardw are allocated in linear tim e. Freedom of off-critical p a th operations is found in O ( T V ) tim e; in th e worst case, scheduling and allocation of all off-critical operations nodes is accom plished in 0 ( N 2). W ith 55 up to N designs possible from th e m ost serial to m ost parallel design, synthesis j of a fixed dataflow graph m ay take 0 ( N 3). ■ | C onstruction of th e entire design space, which m ight be necessary to find j th e best single design m eeting some constraints, m ay involve insertion of delay operations to “serialize” th e design. This alters th e dataflow graph. In th e worst case, w here a dataflow graph consists of N parallel p ath s w ith one operation in each p ath , u p to N — 1 delay operations m ight be inserted. W ith synthesis of a fixed dataflow graph taking 0 ( N 3), worst case synthesis of all possible designs | ] is accom plished in 0 ( N 4) tim e. ! i ; i , 2.5 Loop Transformation in Data Path Synthesis Synthesis tools for generating RT-level designs from behavioral descriptions often : sim plify th e problem to m ake it m ore m anageable. O ne sim plifying assum ption prevalent in d a ta p a th synthesis is th e rem oval of loops. W ith th e exception of I an im plicit outerm ost loop for a given design, loops internal to a description, to ! i our knowledge, do not share resources outside of the loop in any of th e currently ! : published system s [GK84, PK 87, Cam 85, M cF86, PPM 86]. R ath er, th e loop ! is either ignored or m odified by a hum an to rem ove feedback. A m ore robust solution is needed. j For approxim ating th e area and tim e, rem oval of loops is a reasonable simpli- > fication for d a ta p ath synthesis. A fter all, a loop is a control p a th construct and | I m ay have no d a ta p ath hardw are associated w ith it. However, if one wishes to 1 i synthesize a correct design which considers secondary effects such as m ultiplex- t ers, buses, registers and, m ost im portantly, a controller, th en im plem entation j and control of loops m ust be considered. I ] This section contains a m ethod for transform ing a cyclic dataflow graph. , Such a m echanism is useful since current synthesis program s w hich determ ine th e critical p a th or “color” th e graph to detect m utually exclusive operations rely upon th e acyclic n atu re of th e dataflow graph [PP86, PG87]. However, a | robust synthesis system m ust be fully cognitive of loops. 56 2.5.1 Simple Loop Transformation Loops are explicitly represented in behavioral descriptions such as ISPS [Bar73]. (T hey are also easily detected using m ethods described in Aho [AHU74].) T he difficulty lies in transform ing th e cyclic subgraph into an acyclic one suitable for d a ta p a th synthesis while preserving th e behavior. Even for th e sim ple loop of Figure 2.15a which has a behavior of a* (w here a* signifies th a t a is executed one or m ore tim es), disregarding th e feedback edge m ay lose im p o rtan t control j I inform ation. (E n is th e loop entrance and E x is th e loop exit.) Also, th ere is a precondition which forces th e placem ent of a register som ew here in th e feedback p a th as shown by the dashed line. This restriction could be extended to include th e binding of values to registers at th e top of th e loop; th e value(s) clocked j ; a t each iteratio n should use th e sam e registers to reduce controller com plexity. However, as will be shown later, by considering th e expected execution of th e loop, registers will be properly placed. O ne hardw are description language used by m any of th e current synthesis program s is ISPS. ISPS is easily converted into a dataflow graph through an in term ed iate representation know n as th e Value Trace (V T). T ranslation of ISPS into th e V T cleverly avoids th e problem of loops by in stan tiatin g each loop into its own V T -body (or subroutine) [McF78]. Each V T -body can be synthesized individually (since an outerm ost loop is im plicit) before m erging w ith larger pieces of th e behavior. However, if th e behavior is flattened to allow optim iza tions over m ultiple levels of th e Value Trace, another approach is necessary to accom m odate loops. J T he transform ations proposed here tre at a loop as a dataflow graph having I m utually exclusive branches. O ne branch is taken for th e loop continuation < ( r e s t a r t in ISPS) and th e other is taken for loop exit (le a v e in ISPS). It is ' j | assum ed th a t th ere are no hidden side effects for any operations w ithin the j loop. T hus, th e portion of th e subgraph containing th e loop can be analyzed in isolation. J Coloring of a graph is used to detect m utually exclusive operations. Special operations in th e dataflow graph called distribute and join signify th e sta rt and finish of one or m ore m utually exclusive paths. T hey are special operations being 57 1 I E ii i=: (a) O riginal dataflow i = l i= n E a (b) Transform ed dataflow Figure 2.15: Sim ple loop 58 m erely “place holders” so th a t th e dataflow is unbroken and dependencies are readily determ ined. T he analogy betw een loops and conditional p ath s is plain. At th e top of th e loop, values are either entering th e loop for th e first tim e or are th e results from a previous loop; this selection is analogous to a conditional join. A t th e b o tto m of th e loop, values are either propagated to th e next operation or fed back to th e to p of th e loop which is th e sam e as a conditional distribute. H ence, one can su b stitu te join for loop entrance and distribute for loop exit as they effectively represent th e sam e operations. Since this leaves a m ism atched conditional branch (a join precedes a d istrib u te), th e entrance (join) operation is m oved after th e j i exit (d istrib u te) operation as shown in Figure 2.15b. This is only a m odification i to th e dataflow graph; since no operations have been transposed, th e order of \ operations is retained. F urtherm ore, since th e a* loop behavior is determ ined \ by th e controller - w hich will still execute a one or m ore tim es - th e behavior i is unchanged. N ote th a t th e feedback register cuts lie on either end of th e new subgraph. T his would be th e sam e physical register as values produced at th e ; I b o tto m of the loop are th en im plicitly available a t the top of th e loop. j Since th e behavior in b o th branches of th e E x /E n conditional is identical | (no-op), in this p articu lar exam ple it can be collapsed into a single edge in the dataflow graph. Execution differences and register clocking are handled by the controller. Transform ation of a m ore com plicated loop is shown in Figure 2.16a. The behavior is w ritten a(ba)*; E n is m oved resulting in th e dataflow graph of Figure 2.16b. As in th e first exam ple, values produced a t E n are im plicitly fed back to th e top of th e loop via use of th e sam e register prior to o and after E n. Loops which have th eir exit condition w ithin a nested conditional p ath , such as th a t shown in Figure 2.17, are a special case. Clearly, th e single transform ation results in two intertw ined conditional p ath s as depicted in Figure 2.18; such a graph cannot be colored in order to determ ine m u tu al exclusion properly. T h e tran sfo rm atio n can be com pleted properly by noting th a t d istrib u te and join operations as well as entrance and exit operations are place holders and represent no physical behavior. Thus, th e dataflow edge which bypasses th e J operation can be rerouted through it. Behavior is retain ed by duplicating the 59 £ n £ x i — n (a) O riginal dataflow i=l i= n E n (b) Modified dataflow F ig u re 2.16: Loop w ith m id -g rap h ex it co n d itio n 60 En Ex, Figure 2.17: Loop w ith m u tu al exclusion and m id-graph exit f e x it/en tra n c e p air on eith er side of th e J operation as shown in Figure 2.19. j Sim ilar to th e first exam ple w here adjacent branches of a conditional p a th are 1 identical, a single edge can be su b stitu ted resulting in th e dataflow graph of Figure 2.20. T he controller m ust track w hether th e loop is active or being term in ated and o p erate th e d a ta p a th accordingly. 2.5.2 Transformation Algorithm Figure 2.21 contains th e algorithm for converting a cyclic dataflow description in to an acyclic graph. T he outerm ost repeat isolates each loop and w rites th e 1 loop identifier and loop control param eters such as entrance, exit, and feedback edge to a loop control file. (This inform ation is used for control synthesis and estim ation; it could also be used for expected execution analysis.) T h e inner repeats break overlapping m utually exclusive branches (d ist/jo in w ith E x /E n ) and rem ove any useless E x /E n pairs. Any new e n tran ce/ex it pairs are w ritten to th e control file w ith th e loop identifier. T he list of loop transform ations 61 Ex En Figure 2.18: Loop after sim ple transform ation 62 ExO C S D Figure 2.19: Disconnection of m utually exclusive branches Figure 2.20: Completed loop transformation 63 procedure break Jo o p (G ) begin repeat Find a feedback edge. If none, exit repeat. Mark feedback and loop entrance edges. Find all matching loop exit edges. If more than one unique edge, exit with error. Write loop id and entrance, exit, and feedback edges to loop control file. Apply transformation of Figure 2.22a. repeat Find overlapping dist/join-entrance/exit pairs. If none, exit repeat. Apply transformation of Figure 2.22b to remove overlap. Write entrance/exit pair and loop id to control file, end repeat repeat Find exit/entrance pair with matching no-op edges. If none, exit repeat. Apply transformation of Figure 2.22c. Write entrance/exit pair and loop id to control file, end repeat end repeat end Figure 2.21: P rocedure for Transform ing Cyclic into A cyclic Dataflow applicable is shown in Figures 2.22a through c. T h e transform ation shown in F igure 2.22b is applied recursively u n til all d istrib u te /jo in and en tran ce/ex it dataflow subgraphs are “untw ined” . W ith th e transform ations com plete, any acyclic d a ta p ath synthesis program can process th e graph. A control p a th synthesis program or estim ation tool can utilize th e loop inform ation to operate th e hardw are properly as it can be given all inform ation regarding which conditionals are p a rt of a specific loop. A lthough d a ta p ath synthesis of cyclic graphs is possible after perform ing th e loop transform ations to m ake th em acyclic, good designs are not necessarily produced by th e synthesis tool. This is an u n fo rtu n ate artifact of any acyclic transform ation of loops; from th e view of d a ta p a th synthesis, every operation in th e graph is only executed once. As a result, hardw are and register allocation 64 E x Ex. En (a) Breaking th e loop Ex En (b) B reak propagation Figure 2.22: Loop Transform ations E x * 1 65 E x * i L En | I i (c) B reak com paction i ; Figure 2.22: Loop Transform ations (cont.) ! m ay be non-optim al and tim ing estim ates for th e graph are m inim al rath er th an : | typical values. A m ethod w hich reduces th e adverse effect of loop transform ations is u n rolling th e loop into two or m ore iterations prior to synthesis. Unrolling of sim ple graphs m erely entails replicating th e acyclic loop region. However, for m ore com plex graphs having exit conditions inside m u tu ally exclusive branches, replication of th e subgraph being iterated occurs along th e feedback edge as indicated by th e ellipse-m arked edge in Figure 2.16b. As th e loop is unrolled and designs evaluated, register and hardw are allo cation quickly fall into place, e.g. registers appear a t th e loop entrance or exit and feedback edges, while hardw are is shared betw een loops. By increasing th e ! num ber of unrolled loops until little change is d etected in th e loop hardw are al- j location, th e best hardw are allocation and register placem ent can be determ ined ! for th e loop. ! I T he loop transform ation software is a separate package for preprocessing a I * ' I j potentially cyclic graph; th e o u tp u t is usable by eith er th e non-pipeline (M A H A ) J I or pipeline (S eh w a) synthesis packages for th e d a ta p ath . In addition, the con tro l area estim ation package utilizes th e loop control inform ation generated. The effect of loop transform ation was determ ined experim entally by running M A H A ' and analyzing th e results m anually. T his could be in teg rated as an autom atic extension to th e transform ation software in th e future. 66 2.5.3 Runtime Analysis for Loop Transformation Each feedback edge in procedure breakJoop can be found in 0 ( E ) tim e w here E is th e num ber of edges in th e graph. N is th e num ber of nodes. B acktracking to find th e loop entrance and exit edges m ay take an additional 0 ( E ) steps. Each isolated subgraph which com prises th e loop has N ' nodes and E ' edges where N ' < N and E ‘ < E , w hen th e m axim um num ber of e x it/e n tran c e and d ist/jo in pairs is one per node. From E quation 2.4.8, a w orst case conversion tim e of 0 ( N 3) occurs for a graph entirely com posed of conditional operations. 2.6 Examples and Synthesis R esults A group of exam ple d a ta flow graphs which have varying degrees of parallelism , sharable operator's, conditional p aths, and o perator com plexity was selected for validating M A H A . Some of these graphs are small enough th a t th e results gen erated by M A H A could be verified using exhaustive search. For th e rem ainder, o ther tools were used to search for good designs. (Sam ples of M A H A program ! in p u t and o u tp u t can be found in A ppendix B.) T he first exam ple is a sim ple one to dem onstrate th e pow er of th e enhanced version of M A H A , as shown in Figure 2.23. Each operation produces either th e sum or pro d u ct of its in p u ts and is 16-bits wide for b o th in p u t and o u tp u t. (For clarity, some in p u t edges external to th e figure are not shown.) Clearly, th e m ost expensive (and fastest) design of Figure 2.23 would require th e purchase of five a d d and th ree m u l operators; th e cheapest would be th e purchase of a single m u l and a d d . Table 2.1 shows th e results from th e m ost parallel to th e m ost serial im plem entation which are individually shown in Figures 2.24 through 2.26. O nly th e non-inferior designs are listed. (Design A is inferior to design B if A has an area and tim e g reater th a n or equal to th a t of B, excepting w here both p aram eter pairs are equal.) T im e and area are in nanoseconds and square mils, respectively, and area does not include wiring. M odules used are based upon th e RCA 3-m icron CADDAS library w ith param eters listed in Table 2.2. 21. 2 1 T h e a d d i t i o n d e la y s i n t h i s m o d u l e l i b r a r y a r e s o m e w h a t p e s s i m i s t i c d u e t o a h is to r i c a l e r r o r in r e a d i n g a g r a p h , a n d s h o u l d n o t b e t a k e n a s r e p r e s e n t a t i v e o f a c t u a l m o d u l e s . 67 add iplit add add m ul mul add m ul m erge add F ig u re 2.23: D e m o n stratio n d a ta flow G rap h 68 Table 2.1: M AHA R esults for Sim ple E xam ple T im e Basic W ith reg W ith reg -f m ux P artitio n s Tim e A rea Tim e A rea T im e A rea 1 2110 168000 2115 168512 2115 168512 2 2110 110600 2120 112136 2128 113288 3 2145 106400 2160 108960 2184 110688 6 2250 53200 2280 57808 2328 60688 Table 2.2: M odules Used for Synthesis T ype B itw idth A rea (m il2) P rop. D elay (ns) add 16 4200 340 sub 16 4200 340 m ul 16 49000 375 cm p 16 4200 340 and 2 5 3 shift 16 32 2 inv 1 2 2 bufpad 1 26 4 reg 1 32 5 m ux 1 18 4 add split add add m ul m ul add m ul m erge add F ig u re 2.24: M A H A R esu lts for 2-stage D esign add split add add m ul m ul add m ul m erge add F ig u re 2.25: M A H A R esu lts for 3-stage D esign 71 add split add add m ul m ul add m ul m erge add F ig u re 2.26: M A H A R esu lts for C h eap est D esign 72 Table 2.3: Sehwa R esults for Sim ple E xam ple Basic W ith reg Stages Init. In trv l T im e A rea Stages Init. In trv l T im e A rea 6 1 2250 168000 3 1 2160 170560 10 5 3750 53200 7 5 2660 59856 In th e table, B a s ic results are for operator scheduling and allocation only. W hen M A H A is in stru cted to include registers ( W ith re g ), a fixed delay of 5 nanoseconds is added per clock cycle. For th e fastest design, a single 16-bit reg ister is needed after add5 which im pacts th e circuit area. Finally, w hen M A H A allocates b o th registers and m ultiplexers, circuit tim e and area is im pacted fur th er. T he results for th e first exam ple were verified m anually; however, such a p ro cess is tedious and th e possibility for error exists. Fortunately, several altern ativ e tools are available for verifying th e results: S eh w a , o p ts y n , and r a n d s y n . S e h w a is a pipeline synthesis program w ritten by P ark as p a rt of his dissertation on pipeline synthesis [PP86]. Results produced by S e h w a are shown in Table 2.3; “T im e” in this tab le represents to tal tim e from top to b o tto m as opposed to in itiatio n interval. S e h w a does not handle registers in th e sam e fashion as M A H A due to th e pipelined architecture. In addition, m ultiplexers are not addressed in S e h w a, so a com parison against M A H A is restricted to th e basic case (no registers or m ultiplexers). Since th e fastest design im plies a 1:1 m apping of operators onto operations, results from th e two synthesis program s have identical area in th e basic case. However, th e tim es are different due to th e difference betw een pipelined and non-pipelined architectures. T he overall in p u t-to -o u tp u t delay for th e fastest pipelined design is often slower th an th e com parable non-pipelined design due to th e p en alty of p artitio n s being equal size even though operator tim es vary; ! however, throughput of th e pipelined design is far b e tte r th an th e non-pipelined result. In th e exam ple of Figure 2.23, the first o u tp u t value takes 2250 ns and 2110 ns to com pute for th e pipelined and non-pipelined designs, respectively. T his 2110 ns value is fixed for any o u tp u t values in th e non-pipelined design. 73 Conversely, th e pipelined design produces a new result each 375 ns (after th e j in itial 2250 ns delay) u n til th e pipe is flushed. T he fastest tim e com puted by M A H A should never be slower th a n th e fastest pipelined tim e, w hich provides a com parison point. Converse to th e fastest design, only th e area is identical betw een M A H A j and S eh w a results for th e cheapest design. W hen th e cheapest pipelined and non-pipelined designs have th e sam e num ber of p artitio n s, th e tim e com puted i should also m atch. T he cheapest design of Figure 2.23 is depicted in Figures 1 2.26 and 2.27 as produced by M A H A and S eh w a, respectively. For th e basic design, th e areas are th e same. ' A tim e com parison betw een M A H A and S eh w a is m ore difficult for the I cheapest design. T he tim e shown in th e table is S eh w a ’s begin-to-end delay for j a single d ataset and not th e in itiatio n interval; hence, it is possible for M A H A j to produce a b e tte r overall tim e th a n S ehw a. D espite these differences, Seh w a J is an effective tool for com paring b o th th e cheapest and fastest designs in m any j cases. Table 2.4 lists th e results for th e d a ta flow graphs shown in Figures 2.23 j thro u g h 2.35. O nly th e basic m odule area is used; registers and m ultiplexers are ■ not considered. A dditional synthesis tools, o p tsy n and ra n d sy n , were developed specifically to verify M A H A synthesis results. B oth of these tools determ ine th e opera to r scheduling given th e exact hardw are resources. O p tsy n perform s optim al synthesis by constructing every legal schedule p erm u tatio n w ithin th e resource lim itations. U nfortunately, com putation tim e is exponential allowing only the sm aller graphs to be verified w ith this procedure. R a n d sy n generates legal schedule p erm utations using a M onte-Carlo approach; here, th e o ptim al design cannot be guaranteed. However, given a sufficient design sam ple size, th e prob able best design is indicated. Given o ther synthesis utilities and some hum an designers by w hich to com pare M A H A , a num ber of descriptions were attem p ted . B ehavior ranged from irregular to regular stru ctu res and included conditional p ath exam ples; results are shown in Figures 2.28 through 2.35. N ote th a t th e hum an designers produced excellent designs in all b u t a few cases. 74 add split add add m ul m ul add m ul m erge add F ig u re 2.27: Sehw a R esu lts for C h eap est D esign 75 T able 2.4: C om parison of S yn th esis R esu lts: Sehw a D escription Figure Design M AHA Sehwa T im e A rea Tim e A rea Sim ple E xam ple 2.23 Fastest C heapest1 2110 2250 168000 53200 2250 3750 168000 53200 Parallel E xam ple 2.28 Fastest C heapest3 2720 2720 67200 8400 2720 3060 67200 8400 T em perature C ontroller 2.29 Fastest C heapest 1711 4152 46343 12663 2380 3740 46343 12663 M ultiplier E xam ple 2.30s Fastest C heapest 2075 2625 217000 53200 2250 2625 217000 53200 F IR F ilter 2.31s Fastest C heapest 3095 5625 455000 53200 3375 5625 455000 53200 A R L attice F ilter 2.32 Fastest C heapest 2825 6750 834400 53200 3000 7125 834400 53200 R andom G raph 2.33 Fastest C heapest 2075 6000 207200 57400 2250 6000 207200 57400 Sim ple4 C onditional 2.34s Fastest C heapest 1360 1360 16800 8400 _ _ Large C onditional 2.35s Fastest C heapest 1700 2380 46200 8400 1700 2040 46200 8400 1 B egin-to-end d a ta p a th tim e is slower for S eh w a in th e cheapest case due to a num ber of inactive clock cycles added to achieve pipelining. O nly six clock cycles (2250 ns) showed m odule activity. 2 T im e is slower for S eh w a in th e cheapest case due to th e additional step it added for a jo in node which has no delay. 3 Figures 2.30, 2.31, 2.34 and 2.35 originally appeared in P a rk ’s dissertation [PP86]. 4 S eh w a failed to give an answer for th is graph. Since th e graph is sm all, results were verified m anually. 76 add sub sub sub add B2 add sub B3 sub E2 add add E3 sub add sub E l E5 add B4 sub add E4 F ig u re 2.28: P a ra lle l E xam ple 77 add add add add add sub div cm p div sub sub cm p cmp and out m v out out F ig u re 2.29: T e m p e ra tu re C o n tro ller 78 Figure 2.30: M ultiplier E xam ple add add add add m ul m ul m ul m ul add add add add F ig u re 2.31: F IR F ilte r Figure 2.32: A R L attice F ilter Figure 2.33: R andom G raph add . add sub sub add sub D1 Figure 2.34: Sim ple C onditional 82 ( g f ----- f e ; \ F ig u re 2.35: L arge C o n d itio n al 83 As an additional check, any com parisons betw een M A H A and o p ts y n or r a n d s y n synthesis results which differed were checked against an exhaustive search design from S eh w a. In essence, S e h w a was used to confirm th e validity of o p ts y n and r a n d s y n . O verall errors of th e five graphs for w hich optim al and random synthesis designs were generated as com pared to M A H A are less th an 2%. O nly 2 of th e 27 to ta l designs produced by M A H A using this m odule set were shown to be inferior. E rrors are sufficiently sm all to confirm th a t M A H A is o p erating in th e intended fashion; a m ore fu n d am en tal exam ination of th e algorithm is included in th e next subsection. In addition to validation of M A H A for basic synthesis, verification of register and m ultiplexer extensions was perform ed m anually for th e cheapest designs. (T he cheapest design was chosen since it has th e lowest o perator com plexity giving a reasonable design for a hum an to optim ize.) Table 2.5 sum m arizes th e applicable exam ples. In all cases, th e register and m ultiplexer values m et th e expected non-optim ized costs (e.g. no sharing of register and m ultiplexer i resources). Since M A H A was developed, another u tility called M A B A L has appeared. T his tool accepts th e scheduling and allocation from M A H A and attach es the routing and storage hardw are. Unlike M A H A , M A B A L is capable of allocat ing b o th m ultiplexers and buses as well as registers. M A B A L uses heuristics to m inim ize these values. Thus, th e register and m ultiplexer results are b e tte r th a n M A H A and th e usefulness of register and m ultiplexer allocation w ithin M A H A is dim inished. However, M A H A does provide a useful “first c u t” at th e com pleted d a ta p a th , and is useful for non-pipelined designs w ith inner loops, which M A B A L cannot handle. R egister and m ultiplexer cost can co n trib u te a significant value for heavily serialized designs as shown in Table 2.5. A shift in b o th th e area and tim e of the designs is noted. In fact, for th e F IR filter, inclusion of register and m ultiplexer costs causes a different im plem entation to be th e cheapest. W hereas th e basic design used 15 stages, a design which included routing and storage used 8 stages. Introducing register and m ultiplexer effects favors th e lower p artitio n count in cost and also resulted in a lower tim e th a n one w ould expect, an observation also recognized by M cFarland [McF87]. T his exam ple also illustrates th e problem 84 Y 108000-- 96000 84000 + 72000 60000 48000 36000 24000 - 1 2 0 0 0 - • : M AHA O : E xhaustive search □ : H um an Design X: T im e (nS) Y : A rea (square mil) St H 1 -----1 -----h H h H 1 -----1 -----1 -----1 -----1 -----1 -----h 250 500 750 1000 1250 1500 1750 2000 2250 F ig u re 2.36: S ynthesis R esu lts for Sm all ex am p le Y 198000 176000 -- 154000 -- 132000 -- 110000 - 88000-- 66000 44000 -- 22000 - - * $ • : M AHA O : E xhaustive search □ : H um an Design X : T im e (nS) Y : A rea (square m il) H — h 500 1000 1500 2000 2500 3000 3500 4000 4500 Figure 2.37: Synthesis R esults for M ultiplier 86 Y 450000 400000 3 5 0 0 0 0 - 300000 -- 250000 200000 150000 -- 1 0 0 0 0 0 - 50000 -- • : M AHA O : M onte Carlo (260,000 pts) □ : H um an Design X : T im e (nS) Y : A rea (square m il) * j* } H 1 h H 1 ---- 1 ---- 1 -----h 750 1500 2250 3000 3750 4500 5250 6000 6750 F ig u re 2.38: S yn th esis R esu lts for F IR F ilte r Y 810000 7 2 0 0 0 0 - 6 3 0 0 0 0 - 540000 450000 360000 270000 180000 + 90000 • : M AHA O : M onte C arlo (520,000 pts) □ : H um an Design X : T im e (nS) Y : A rea (square m il) O H 1 --- 1 --- h H 1 -----1 -----1 -----1 -----1 -----1 -----h 750 1500 2250 3000 3750 4500 5250 6000 6750 F ig u re 2.39: S ynthesis R esu lts for A R L a ttic e F ilte r Y 225000 - 200000 - < $ > 175000 - 150000 - 125000 - < $ > 100000 - < $ > 75000 - < $ > 50000 - 25000 - < $ > • : M AHA O : M onte C arlo (380,000 p ts) X : T im e (nS) Y : A rea (square m il) < $ > O • < S > H 1 --- 1 - H h H 1 --- 1 --- 1 - 750 1500 2250 3000 3750 4500 5250 6000 6750 F ig u re 2.40: S ynthesis R esu lts for R an d o m G ra p h T ab le 2.5: S u m m ary of R eg ister a n d M u ltip le x er A re a for C h eap est D esign D escription Figure Synthesis Basic W ith reg. W ith reg + m ux Sim ple 2.23 T im e 2250 2280 2328 Exam ple A rea 53200 57808 60688 Parallel 2.28 T im e 2720 2760 2856 E xam ple A rea 8400 16592 20624 T em p eratu re 2.29 Tim e 4152 4212 4308 C ontroller A rea 12663 18039 23583 M ultiplier 2.30 Tim e 2625 2660 2716 A rea 53200 57168 58680 F IR 2.31 Tim e 5625 5760 5856 F ilter A rea 53200 65592 77112 A R L attice 2.32 T im e 6750 6840 7128 F ilter A rea 53200 74704 85360 R andom 2.33 Tim e 6000 6080 6144 G raph A rea 57400 80440 94840 Sim ple 2.34 Tim e 1360 1380 1412 C onditional A rea 8400 11472 12624 Large 2.35 Tim e 2380 2415 2499 C onditional A rea 8400 13520 15392 w ith d a ta p a th synthesis: unless all steps of th e synthesis process are done sim ultaneously, one m ay not achieve optim al or near-optim al results. 2.7 Lim itations of M AHA As a non-pipelined synthesis program , can M A H A be used in higher-level syn thesis as a basic analysis tool? In order to answ er this question, an understanding of th e lim itatio n s of M A H A is necessary. It has been shown in th e previous section th a t M A H A produces designs th a t are not necessarily optim al. It is therefore useful to know th e frequency and ex ten t by which M A H A “m isses” th e optim al design. T here are several factors which can affect th e synthesis results including • th e size of th e dataflow graph, • th e num ber of operation types in th e graph, and 90 • th e selected m odule set areas and delays. F irst, specific errors noted in th e previous section will be detailed. T here are two cases w here M A H A generates designs w hich are inferior to M onte Carlo and hum an designs. O ne design w here M A H A generates an inferior result is for th e A R lattice filter w here th e resources are 2 m ultipliers and 1 adder. T he actu al p artitio n in g results are shown in Figures 2.41 and 2.42. N ote th a t although th e clock cycle tim e is identical, th e hum an design took th ree less p artitio n s. T his discrepancy is due to b o th th e graph itself and th e technique by which M A H A serializes it. T he g raph has sixteen unique b u t identical critical p ath s w hich are interw oven. Since M A H A only uses a single critical p ath , one is arb itrarily chosen. T he rem aining “non-” critical p ath s will be scheduled and allocated once th e critical p a th is com pleted. This p a th will also be stretched in order to realize th e m ore serialized designs. It is in th e stretching of this graph w ith th e given resources th a t M A H A runs into a problem . W ith so m any identical critical p ath s, th e one chosen runs thro u g h th e two adders which are m arked A D D in Figure 2.41. Each of these p a rticu la r a d d operations is pushed to a later tim e w henever a conflict occurs. T hese operations cannot be perform ed until all other com peting a d d operations have com pleted, contributing a clock cycle. T his effect also extends to th e surrounding m u l operations. In p articu lar, w ith four m ultipliers contending for two slots, only one delay should be needed. However, since M A H A inserts a critical p a th delay against th e first non-critical p a th resource conflict, graph sym m etry is not considered and delays are not in serted optim ally. F urtherm ore, in fu tu re synthesis, th a t non-critical p a th oper ation is forced to occur in th e sam e tim estep as th a t delay node on th e critical p a th . Each cluster bounded by th e boxes shown is scheduled in one additional clock cycle each, giving th e th ree clock cycle difference. T h e o ther design encountered which is non-optim al is th e random dataflow graph having 16 a d d , 10 s u b and 2 m u l operations. Resources consisted of one su b tracto r, one m ultiplier, and tw o adders. In this case, stretch in g th e dataflow 91 m ul m ul m ul m ul m ul m ul m ul m ul add add ad d ad d ad d add m ul m ul m ul m ul add ADD, m ul m ul m ul m ul 10 ADD; add ad d ad d 14 15 16 F ig u re 2.41: M A H A design of AR filter 92 m ul m ul m ul m ul mul mul m ul mul ad d ad d add ad d add ad d mul m ul m ul m ul add ad d m ul m ul m ul mul ad d ad d 10 ad d ad d 12 13 F ig u re 2.42: H um an design of A R filter 93 graph resulted in a single stage w here a su b tracto r was prevented from being used, causing an additional clock cycle to be required. It should be noted th a t th e A R lattice filter is probably a worst case exam ple for non-pipelined design. Clearly, this highly sym m etrical graph is best im ple m ented using a pipelined architecture. However, b o th exam ples highlight one p o ten tial w eakness of this algorithm . In stretching th e dataflow graph to pro duce m ore serial designs, th ere is no u n d erstan d in g of shapes and layout of the graph. In all cases, a hum an designer could observe th e p a tte rn and exploit the sym m etry. G raph analysis and p a tte rn detection a t a high level w ould im prove th e intelligence of th e algorithm . However, given th e com plexity of such a task and th e present sta te of M A H A , it is not clear such an activ ity w ould greatly im prove th e overall quality of th e th e synthesis results. In general, synthesis errors on sm all dataflow graphs are greatly pronounced since a sm all change has a large im pact. However, optim al (exhaustive search) techniques can be em ployed in such cases as needed w ithout excessive com pu tatio n tim e. H um an designers also excel in this area. Conversely, large designs are difficult for a hum an designer to do well (given a lim ited tim e to produce a circuit) and exhaustive techniques are not feasible. G iven th e heuristic n atu re of synthesis, one w ould expect m ost designs to be non-optim al, b u t still accept able. U nfortunately, criteria for w hat is acceptable is usually subjective due to ex tern al design factors such as processing costs, production ru n size, and th e application environm ent. T he num ber of m odule types and th eir areas and delays also affect th e syn thesis process. A graph w ith a wide range of operation types m ay lim it th e flexibility of M A H A . However, if m any of these operations are available in a single com plex m odule such as an ALU, th en sharing can be im proved. This is a m odule-set tradeoff problem which can g reatly im pact th e results. Also, in th e previous section, one ty p e of a d d and m u ltip ly were chosen. If a num ber of a d d e rs and m u ltip lie rs are available, which set should be chosen? Clearly, this is a sep arate topic of research in which some progress has been m ade [Jai89]. G iven th e RCA 3um CADDAS library, several types of m ultipliers and adders were constructed; one set was shown in earlier. A dditional m odules are shown in 94 T ab le 2.6: M odule L ib rary of A dders an d M u ltip liers T ype B itw idth A rea (m il2) Prop. D elay (ns) add-f 16 4200 340 add-m 16 2880 530 add-s 16 1200 1510 sub-f 16 4200 340 sub-m 16 2880 530 sub-s 16 1200 1510 m ul-f 16 49000 375 m ul-m 16 9800 2950 mul-s 16 7100 7370 Table 2.6 22. Using th e sam e dataflow graphs given in Section 2.6, o th er m odule sets were constructed and used by M A H A for synthesis. (No a tte m p t was m ade to use two different m odules of th e sam e ty p e in a single design.) Since th e program s o p tsy n and r a n d sy n accept a quantized m odule set as in p u t, a direct com parison betw een th em and M A H A is possible. To sim plify th e analysis, every M A H A design point was distinguished as follows: 1. For each M A H A design point, th e sam e m odule set and q u an tities were in p u t into o p ts y n /r a n d s y n to com pare th e tim e. 2. A fter com pleting step (1), th e m odule set quantities were low ered to ascer ta in w hether th e sam e circuit tim e could be m et using a sm aller m odule set. T he area difference, if any, was th en com pared. W ith this approach, each design po in t produced by M A H A could be com p ared in b o th area and tim e against th e verification utilities. T h ree m odule sets chosen by S L IM O S as being th e m ost useful were used and are listed in Table 2.7 [JPP88]. T hese m odule sets were applied against a num ber of dataflow graphs to ascertain errors in b o th area and tim e as shown in th e first two colum ns of Tables 2.8 and 2.9. (A n elliptical wave filter shown in Figure 2.43 was included so th a t a com parison could be m ade w ith other synthesis tools.) M A H A was 2 2T h e a d d i t i o n d e la y s i n t h i s m o d u l e l i b r a r y a r e s o m e w h a t p e s s im i s t ic d u e t o a h i s t o r i c a l e r r o r i n r e a d i n g a g r a p h , a n d s h o u l d n o t b e t a k e n a s r e p r e s e n t a t i v e o f a c t u a l m o d u l e s . 95 Table 2.7: M odule Sets E valuated by M A H A N am e H ardw are M odules fast m edium slow m ul-f m ul-m m ul-s add-f add-s add-s sub-f sub-s sub-s th en executed; th e to ta l num ber non-inferior designs ob tain ed is listed in th e th ird colum n. T he q u an tity of these designs which exhibited optim al area and tim e are shown in Tables 2.8 and 2.9, respectively. O p tim al area m eans th a t no design of th e sam e (or lower) tim e could be realized w ith th e less hardw are; op tim al tim e m eans th a t no design of th e sam e (or less) area could be constructed w ith a sm aller circuit delay. From these results, th e m axim um and average error over all non-optim al designs is com puted as well as th e average error over all designs (b o th optim al and non-optim al). A t first glance, it m ight appear th a t th e synthesis program is not produc ing good results for some descriptions. In p artic u lar, th e elliptical wave filter M A H A designs diverge from th e M onte C arlo results by m ore th an a single m odule. A rea differences are due to th e heuristic n a tu re of th e scheduling al gorithm . U nfortunately, tim e delay error is m agnified in th is dataflow graph, since tim e is th e p ro d u ct of clock cycle tim e and th e num ber of partitio n s. For exam ple, th e m ost serial elliptical filter design h ad 28 tim e steps. T he w orst case arises for 16 p artitio n s w hen th e clock is off by only 11%. A com parison of th e M A H A elliptical wave filter designs against o ther syn thesis packages is p lo tted in Figure 2.44. O th er tools include HAL [PK87], EM U CS [MPC88], SPAID [HE89], C A T R E E [GE88], and S P L IC E R [Pan88]. In this exam ple, M A H A not only produced b e tte r designs w hen th e sam e m odule sets are used, b u t it also explored a larger portion of th e design space. A nother approach to determ ine th e usefulness of M A H A is to analyze th e d a ta to determ ine what causes th e synthesis errors. Table 2.10 contains a sum m ary of th e com parison ranked by graph size (th e num ber of nodes in th e dataflow graph) as com pared to th e tim e and area variance in module area and delay units. If th e num ber of designs th a t are optim al in area and tim e are 96 T able 2.8: C om parison of M AHA results to H um an/R andom : A r e a 1 Dataflow M odule T otal O ptim al M ax. Ave. Overall G raph Set Designs Designs Err. (% )2J E rr. (% )3 E rr. (%) P arallel fast 4 4 0.0 (0) 0.0 0.0 m edium 4 4 0.0 (0) 0.0 0.0 slow 4 4 0.0 (0) 0.0 0.0 M ultiplier fast 5 5 0.0 (0) 0.0 0.0 m edium 5 5 0.0 (0) 0.0 0.0 slow 6 5 20.0 (*) 20.0 3.3 F IR F ilter fast 5 5 0.0 (0) 0.0 0.0 m edium 8 7 42.0 (1) 42.0 5.3 slow 10 8 15.5 (*) 12.3 2.5 A R F ilter fast 6 6 0.0 (0) 0.0 0.0 m edium 9 7 55.5 (1) 50.0 11.1 slow 10 8 15.5 (*) 12.4 2.5 R andom Gr. fast 8 8 0.0 (0) 0.0 0.0 m edium 7 7 0.0 (0) 0.0 0.0 slow 10 10 0.0 (0) 0.0 0.0 Sim ple Cond. fast 2 2 0.0 (0) 0.0 0.0 m edium 2 2 0.0 (0) 0.0 0.0 slow 2 2 0.0 (0) 0.0 0.0 Large Cond. fast 2 2 0.0 (0) 0.0 0.0 m edium 2 2 0.0 (0) 0.0 0.0 slow 2 2 0.0 (0) 0.0 0.0 Ellip. F ilt. fast 7 5 17.4 (*) 10.6 3.0 m edium 12 5 91.1 (4) 58.8 34.2 slow 12 7 66.3 (1) 38.0 15.8 1 A com parison of area betw een th e designs is m ade where th e tim es are identical. 2 T he num ber in parentheses is th e num ber of ex tra m odules purchased. A value of m eans th a t th e to ta l nu m b er of m odules is identical, b u t th e individual m odule ty p e counts are different. 3 T he average erro r is th e average of all designs w hich did not have th e optim al area. 97 T ab le ‘ 2.9: C o m parison of M A H A to H u m a n /R a n d o m : T i m e 1 D ataflow M odule T otal O ptim al M ax. Ave. 0 verall G raph Set Designs Designs E rr. (% )2 E rr. (% )3 E rr. (%) P arallel fast 4 4 0.0 (0) 0.0 0.0 m edium 4 4 0.0 (0) 0.0 0.0 slow 4 4 0.0 (0) 0.0 0.0 M ultiplier fast 5 5 0.0 (0) 0.0 0.0 m edium 5 5 ' 0.0 (0) 0.0 0.0 slow 6 4 20.5 (*) 11.9 4.0 F IR F ilter fast 5 5 0.0 (0) 0.0 0.0 m edium S f! 1 16.7 (1) 16.7 2.1 slow 10 8 16.7 (1) 13.7 2.7 AR F ilter fast 6 4 23.1 (3) 16.5 5.5 m edium 9 5 25.0 (2) 15.9 8.S slow 10 5 20.5 (1) 18.4 9.2 R andom Gr. fast 8 7 10.0 (1) 10.0 1.3 m edium 7 6 10.0 (1) 10.0 1.4 slow 10 9 2.0 (*) 2.0 0.2 Sim ple Cond. fast 2 2 0.0 (0) 0.0 0.0 xnedium 2 2 0.0 (0) 0.0 0.0 slow 2 2 0.0 (0) 0.0 0.0 Large Cond. fast 2 1 16.7 (1) 16.7 8.3 m edium 2 1 16.7 (1) 16.7 8.3 slow • 2 1 16.7 (1) 16.7 8.3 . Ellip. F ilt. fast 7 4 7.1 (1) 5.7 2.4 m edium 12 4 44.4 (3) 27.4 18.3 slow 12 4 26.9 (2) 16.3 10.9 1 A com parison of tim e betw een th e designs is m ade w here th e areas are identical. 2 T h e num ber in parentheses is th e total tim e difference expressed as th e num ber of e x tra m inim um clock delays. A value of m eans th a t th e delay is the less th a n th e m inim um clock (i.e. th e faster m odule). 3 T h e average error is th e average of all designs which d id not have th e optim al tim e. 98 ¥ ¥ ¥ ¥ Figure 2.43: E lliptical W ave F ilter 99 Y i 459000 - 408000 - 357000 - © : M AHA o : M onte Carlo • : HAL ★ : EM UCS + : SPAID □ : C A T R E E O : S P L IC E R X : T im e step 306000 - Y : A rea (m il2) 255000 - 204000 - * 9 ? 153000 - 102000 - @ * 51000 ■ o o © + + ^ o © + H 1 -----1 -----1 -----1 -----h H 1 --- (- H 1 --- 1 --- h 6 9 12 15 18 21 24 27 X F ig u re 2.44: E llip tic al F ilte r D esigns of S everal S ynthesis System s Table 2.10: Sum m ary of M A H A V alidation G raph Total O ptim al W orst Case (m odules) D escription Size Designs Designs A rea Increase 1 T im e Increase 2 Sim ple Cond. 6 6 6 0 0 M ultiplier 9 16 15 * * Large Cond. 15 6 6 /4 0 1 P arallel 16 1 2 1 2 0 0 F IR F ilter 23 23 2 0 1 1 A R F ilter 28 25 21/16 1 2 R andom Gr. 28 25 25/22 0 1 E llip t. F ilt. 36 31 17/15 4 3 1 T h e num ber indicates th e q u a n tity of e x tra m odules purchased. A value of m eans th a t th e to ta l num ber of m odules is identical, b u t th e individual m odule ty p e counts are different. 2 T h e num ber is th e total tim e difference expressed as th e num ber of ex tra m inim um clock delays. A value of m eans th a t th e delay is th e less th a n the m inim um clock (i.e. th e faster m odule). different, these values are separated by a slash in th e Optimal Designs colum n. T his second perspective of th e non-pipelined synthesis program also indicates th a t M A H A is producing quality designs. B ased upon th e analysis, th e operation of M A H A can be sum m arized as follows: • T h e num ber of tim e steps (p artitio n s) or m odule difference betw een th e op tim al and M A H A results appear to be w eakly influenced by th e dataflow g raph size. • Sm all dataflow graphs and conditional p a th graphs are su b ject to greater error in actu al area. T his is due to th e enorm ous im pact of being off the optim al design by even a single m odule. • Sim ilarly, as a given dataflow graph is p artitio n ed into m ore tim e steps, its likely error increases. Serializing a design involves m any m ore possibilities and, hence, a g reater risk th a t th e algorithm will select a non-optim al 101 design. As th e area shrinks and th e num ber of tim e steps rises, choosing th e wrong value has significant im pact on th e resu ltan t design. • Finally, th e im pact on tim e is m uch m ore pronounced th a n th e im pact on area for serialized designs. This is due to th e discrete n atu re of tim e-step scheduling; being off by a slight am ount on each clock cycle results in an : error m agnified by th e num ber of clock cycles. 1 j A nother aspect of d a ta p a th synthesis will d em o n strate additional non-optim ality as a result of th e allocation process. In d a ta p a th synthesis, th e way in which a I dataflow graph is p artitio n ed into tim esteps can affect th e results for a com plete I design w hich includes registers and m ultiplexers. Figures 2.45 and 2.46 illu strate th e problem . In th is exam ple using th e F IR filter, th e two different partitio n in g schem es result in th e sam e tim e and prim ary hardw are allocation: an eight tim e- step im plem entation utilizing a m u ltip lie r and two a d d e rs ; b o th of these are optimal designs. In Figure 2.45, th e R E A L optim al register allocation scheme utilizes 10 registers. However, in Figure 2.46, only 2 registers are needed. T he to ta l area difference betw een th e different im plem entations due to register costs | contributes nearly an o th er a d d e r in area. However, th e additional m ultiplexer J area due to increased register sharing m ight overw helm any savings noted in register area. In dynam ic schem es w here register cost is negligible, th e register count dif ference would not present a problem . However, for th e static case, th e area im p act is not negligible. U nfortunately, unless one perform s exhaustive analysis j including all contributing effects, it is unlikely th a t such design differences will be detected. In short, th e non-optim al designs produced by M A H A are the sam e order of m agnitude as th e secondary effects due to register allocation and partitio n in g . U ntil a m ethod which accounts for secondary effects is introduced, th e fu n d am en tal M A H A algorithm is adequate. 2.8 Summary M A H A is a design tool which assigns operations in a dataflow rep resen tatio n to hardw are operators. T he resulting bindings can th en be passed to a placem ent 102 m ul Figure 2.45: Sam ple Schedule of F IR F ilter V Figure 2.46. Identical D elay/C ost of F IR F ilter w ith Fewer R egisters/M uxes 103 and ro u tin g program to produce silicon. As an exam ple, using th e M P2D cell lib rary from RCA as th e hardw are m odule library, a dataflow rep resen tatio n can be tak en to silicon using S L IM O S to select th e m odule set, M A H A to perform th e p artitio n in g and allocation, an d M A B A L to assign th e routing and storage hardw are [KP90]. T h e resulting RT level design w ould be fed to M P2D to produce th e final layout. M A H A illu strates a flexibility not found w ith o th er synthesizers, including its ad ap tab ility to either area or speed constraints, depending on th e appli cation. M A H A currently assigns operations to operators, schedules w hen th e operations should or m ust occur, and allows exploration of th e design space given th e co n strain ts of th e user. R egisters are assigned to th e d a ta p ath s as necessary, although M A H A does not a tte m p t to share registers during free clock cycles. A t present, m ultiplexers are indicated w here needed, w ith no consideration of w hen it w ould be advantageous to use a bus or o th er types of control for signal p aths. 104 Chapter 3 Area Estim ation of D ata Path Controllers 3.1 Introduction C ontrol p a th synthesis has been heavily researched in th e p ast, b u t to d ay re ceives little atten tio n . D ue to th e regular stru ctu res of controllers (both PLA s and m icrocode), synthesis issues have been m ostly resolved excepting area o p ti m ization. C u rren t control p a th literatu re concentrates upon eith er m inim izing th e pro d u ct term s in PLA s or construction of m icrocode and nanocode con trollers to produce sm aller control area [KC8 6 , G R83, W C 8 8 ]. At present, PLA s are th e p redom inant m ethod used for control w ithin VLSI chips. P ast research has resulted in a num ber of au to m ated synthesis tools for PLA s [Ham83, DSVV83, KY85]. W ith a finite state m achine (FSM ) description as th eir inputs, these tools produce a PLA layout. As indicated in Figure 3.1, these PLA s consist of two regions. O n th e left side, one or m ore of th e in p u ts (or th eir inverse) are A N D ed to g eth er to form each p r o d u c t te r m . O n th e right side, each o u t p u t line is driven by one or m ore p ro d u ct term s w hich are O R ed together. T hus, given a set of active in p u ts, th e PLA will drive some set of o u tp u t lines. A n FSM can be realized by having a subset of th e in p u t and o u tp u t lines represent th e current state and next state, respectively. These lines are th en connected through a feedback register as shown in th e figure. D uring circuit design, th e d a ta p ath is produced, and control of th e operators, registers, and b uses/m ultiplexers is determ ined. T hus, control p a th synthesis is perform ed as a d istin ct step following d a ta p a th synthesis. However, in th e course of evaluating a large design space or com plex circuits w here m ultiple designs are 105 "I 1 ---------- 1 ---------- 1 ---------- 1 — II II I E T -t- ---------i - - - l i i i ] - ~ T I I I 1 - -fi3 f------r - ~ r i i i 3 — ® - - !~ 1 --- 1 - - - - - - - L - - E 2 1 T 1 I ' T i i i t - a i i i -£ p - d£- -ts- R egister Sit ate B its P ro d u ct Term s Inputs O u tp u ts Figure 3.1: PL A F in ite S tate M achine 106 constructed, such m ethods are com putationally expensive. As shown in Figure 3.2, th e PLA synthesis execution tim e increases exponentially as th e num ber of control states is increased. T h e existence of a control area predictor would g reatly enhance th e ability of a system or designer to rapidly explore th e design space. C ontrol area is affected by th e d a ta p a th hardw are allocation and schedul ing w hich, in tu rn , is determ ined by th e dataflow algorithm being synthesized, | resource library, and user constraints. A PLA generation tool is usually used after form ing th e d a ta p a th using these in p u t p aram eters. T hese d a ta p a th in p u t p aram eters m ight also be provided by prediction tools. In th is case, specific scheduling and allocation of individual operations will not be known; hence, a PLA synthesis tool cannot be used at th a t tim e for building a controller. How ever, a control area predictor m ay be capable of utilizing d a ta p ath estim ates. T he availability of a predictor based solely upon d a ta p a th param eters should be capable of integ ratio n in to d a ta p a th synthesis program s to im m ediately assess design changes after p artial synthesis. It could also be used in conjunction w ith d a ta p a th estim ation tools to m ore quickly explore th e design space. W ith such a prediction, a m ethod is available for evaluating control p a th /d a ta p a th tradeoffs prior to d a ta p a th synthesis. H ence, th e research problem described in this ch ap ter is to p redict control area from either synthesized or estim ated d a ta p aths. In th is ch ap ter a theoretical analysis of th e area of PL A finite sta te m achines is presented. T h e solution approach entails: • describing th e PLA area in term s of its control param eters (inputs, o u tputs, and p ro d u ct term s), • predicting these p aram eters using d a ta p a th a ttrib u te s, • considering control of pipelined designs as a special case, • extending th e predictive m odel to folded PL As, and • validating th e m odels using actu al designs. T h e PLA area estim ation technique exam ines th e dataflow graph, its scheduling into tim esteps, and register and b u s/m u ltip lex er control requirem ents. 107 Y 108000 -- 96000- 84000 - 72000 60000 48000 36000 + 24000 12000 + O : Berkeley (g > : P red icto r X: N um ber of S tates Y : R untim e (ms) O O -0- O O O H 1 ----- 1 -----H 0 )-----1 -----h@ +- -0 -. . .I - 11 22 33 44 55 6 6 77 8 8 99 F ig u re 3.2: C o m p u te r R u n tim e A lthough prior discussion has focused upon area, op eratio n tim e of a PLA cannot be neglected. However, advances in technology have resulted in PLA controllers being used effectively up to 40 MHz. T his is reflected in actual designs w here high speed circuits have previously used o th er control m ethods such as random logic or tim ing generators. T he use of PLA controllers is not believed to be a lim itatio n in m ost digital designs. T herefore, PLA delay will not be exam ined. i In th e n ext section a basic PLA controller area prediction m odel is presented, followed by an analysis of loops and conditional p ath s. E stim atio n of folded PLA s and a brief discussion of pipelined control are also included. Finally, experim ental results of th e com plete m odel are presented. Several assum ptions are m ade in th e developm ent of th is control area pre- diction m odel. ! • T h e control area estim ated is specific to finite sta te m achines. • T h e PLA sta te feedback register is considered p a rt of th e control area. • H ardw are ex tern al to th e PLA w hich is used for control, such as a counter for loops, is considered p a rt of th e control area m odel. A ny hardw are not specifically p a rt of th e controller is excluded. • W iring area w ithin a controller is not fully analyzed. T h e area of a PLA inherently includes in tern al w iring connectivity; however, w iring betw een th e PLA and any additional external control hardw are, including th e feed back register, is ignored. It is presum ed th is area is a sm all fraction of ! th e PLA and external hardw are, b u t could be tak en into account during w iring area analysis of com bined d a ta p a th and control. T his ch ap ter focuses upon th e PLA as th e m echanism for control. However, th is does not preclude use of th e underlying m odel for o th er control m ethods such as m ulti-level logic. I 109 3.2 A PLA Control Area M odel In a finite state m achine th e controller perform s sta te tran sitio n s based upon th e cu rren t sta te and, w here needed, ex tern al inputs. T he M oore sta te m achine m odel is used here. C ontrol area is determ ined by th ree m ajo r param eters: ] 1 . th e num ber of inputs, i, some w hich are used to represent th e current state and others as in p u t sta tu s from th e d a ta p a th or ex tern al control p ath hardw are; i I 1 j 2 . th e num ber of o u tp u ts, o, some w hich are used to represent th e next state ' plus others for controlling ex tern al control p a th or d a ta p a th hardw are; and, i 3. th e num ber of pro d u ct term s, p, in th e PLA . T h e PLA param eters i, o, and p are driven by th e num ber of d a ta p a th states, w hich is th e num ber of p artitio n s in non-pipelined design and th e in itiatio n interval for pipelined design. O ther factors such as conditional execution and loops m ay also affect th e external control area, as in th e use of a counter for loops, b u t th ey predom inantly influence th e in p u ts, o u tp u ts, and num ber of control states in th e PLA itself. 3.2.1 General Model T h e num ber of FSM states, £, determ ines th e num ber of sta te b its (th e m inim um count on th e num ber of in p u t and o u tp u t lines) which is flog2 C l • Assum ing that no inputs, outputs, or product terms are redundant is im portant in estimating these parameters of a P L A . T he presence of redundancy would unnecessarily increase th e size of th e PLA . U nfortunately, th e presence of j red u n d an t p ro d u ct term s is not uncom m on. C urrently, th ere is broad em phasis in th e lite ratu re on m inim izing PLA p ro d u ct term s and th e results are encour aging [RSV 8 6 , GR87j. O ne would expect fu tu re PLA synthesis tools to produce th e sm allest possible PLA ; hence, only th e area of a fully m inim ized PLA will be predicted. 110 In p u ts to a PLA sta te m achine are com prised of ex tern al in p u ts (to th e PLA ) and th e ("log2 ( " " j n ex t-state bits fed back from th e o u tp u t. T hese ex tern al inputs ! reflect th e sta tu s of ex tern al hardw are or events w hich im p act eith er loop control or conditional branch execution. T he num ber of loop sta tu s in p u t lines, ii00p, is determ ined by b o th d a ta p a th and ex tern al control p a th (loop counter) status. icond,, w hich is th e num ber of conditional statu s in p u t lines, are affected by the ty p e and num ber of conditional branches in th e d a ta p ath . W ith these values, r th e num ber of PLA in p u ts, i, is i — Cl d” icond + ^loop (3.2.1) I Sim ilarly, th e PLA sta te m achine o u tp u ts contain th e n e x t-state b its and ex- | tern al control lines. Some of these external control lines are associated w ith loop j (counter) control, c > ioop, and conditional branch selection and control, ocond. T he rem ainder are used for explicit control of registers, m ultiplexers and auxiliary hardw are. T h e num ber of control lines associated w ith registers, R c, depends upon th e ty p e of register control as described in th e n ext section. M ultiplexers j m ay be explicitly controlled by th e PLA or im plicitly controlled via th e sta te feedback bits. In th e la tte r case, th e num ber of m ultiplexer control lines, M , is set to zero. Finally, th e num ber of control lines for o p eratin g ALUs and other m ulti- function hardw are, A , depends upon th e d a ta p a th m odule selection. H ence, th e num ber of o u tp u t lines, o, is o = fl°ga C] + Rc + M 4- A + O ioop + ocond (3.2.2) | T h e num ber of control states (£) is directly estim ated from th e num ber of I stages (control steps) in th e d a ta p a th design (P ) and ad ju sted by any condi tional p a th or loop sta te contribution. In p articu lar, if loops are unw ound, (ioop additional control steps are needed as detailed in Section 3.3.1. T h e num ber of additional control states associated w ith conditionals, Ccondy can only be approx im ated as described later in Section 3.3.2. (cond reflects sep arate control of each conditional p a th . For exam ple, a 4-way conditional of 3 states each w ould have Ccond = 9. T h e num ber of control states can be approxim ated as follows: 111 I + Cloop + Ccond (3.2.3) I 3.2.2 Computing R c and p T h e m ajo r difficulty in form ulating a PL A area estim ato r using d a ta p a th p aram eters is com puting th e num ber of pro d u ct term s, p. It is clear th a t th e num ber of control states and num ber of o u tp u t control lines directly affect th e num ber l of p ro d u ct term s; loops and conditionals m ay also have an im pact. For a sim ple sequential controller w ith no loops or conditionals, we show in th is section th a t only sta te count and o u tp u t control determ ine p. T h e num ber of o u tp u t control lines is determ ined by th e control style. Tw o distin ct PLA control styles whose choice only affects th e num ber of register control lines, R c, have been identified. O ne is for th e controller to assert , a unique o u tp u t line at each stage or control step; this o u tp u t line operates all of th e registers active in th e given stage and is thus labelled “stage control” . (For a register used over several stages we O R th e ap p ro p riate PLA o u tp u ts to I produce its control.) T he second ty p e of PLA controls th e registers w ith o u tp u t lines directly. T his m ethod is labelled “register control” . We will choose the approach w hich yields m inim al control area for a given design. To illu strate, Figure 3.3 contains two different p artitio n ed dataflow graphs. In Figure 3.3a, a single control line m ay suffice if th e sam e register is used at each stage and, hence, th e “register control” m ethod is superior. Conversely, th e controller for th e graph in Figure 3.3b w ould be sm aller using th e “stage control” m ethod. In general, a sm all num ber of control states favor “stage con tro l” w hereas larger controllers favor “register control” as th e m inim al area PLA style. 3 .2 .2 .1 S ta g e C o n tro l M e th o d For th e “stage control” m ethod, a unique “stage” control line is asserted during each step of th e ( control stages, yielding an estim ated register control line q u an tity of Rc = C (3.2.4) 112 (a) Serial dataflow graph (b) P arallel dataflow graph Figure 3.3: Tw o E xam ple D ata P ath s 113 PLA register control ( Figure 3.4: 8 -state PLA w ith a register control line shown I T h e “stage control” m ethod relies on each of th e R uniquely controlled reg- j isters to be controlled by O R ing to g eth er one or m ore of th e R c o u tp u t lines. (If two registers have identical control values in all states, w here “d o n ’t care” is an au to m atic m atch, then these registers can share th e sam e control line.) A schem atic view of an 8 -stage PLA w ith one register control line asserted in 5 ! states is show n in Figure 3.4. ' Since during PLA area estim ation we do not know during w hich of th e P j ! d a ta p a th stages a given register control line is asserted, th e ex tern al control area associated w ith O R gates is form ulated based upon em pirical observations. ) l Each of th e R uniquely controlled registers is assum ed to have its control form ed using 2-input O R gates arranged into an m : 1 tree. T he value for m depends upon th e actu al register control, w hich is unknow n, and th e num ber of d a ta p ath stages P . If k is th e ratio of stages w here a given register is asserted versus th e to ta l num ber of stages, th en m — k P and th e increm ental area associated w ith 114 stage control random logic, A rea stagei is A rea stage « R x A or x ( \k P — 1 ]) (3.2.5) w here A or is th e area of a 2-input O R gate. If each register is expected to be used in every state, k — 1. For sm all graphs, where “stage control” is superior to “register control” , an em pirical value of k — 0.8 has been observed. Varying this factor from 0.5 to 1 typically alters th e to ta l control area by less th a n a percent. T h e num ber of p ro d u ct term s for a PLA is sim ple to d eterm in e in th e “stage control” case. Since each stage control line is asserted in one and only one state as indicated in Figure 3.4, it m ust be form ed by a single p ro d u ct term . T here m ay also be additional pro d u ct term s associated w ith loop te rm in atio n tests, Pioop, an d conditional branch tests, pC O nd■ Thus, P ~ C A Ploop T Pcond (3.2.6) G iven th e PLA arch itectu re w here each state has one p ro d u ct te rm active th a t is inactive in every o th er sta te , th e equations for any n e x t-sta te , m ultiplexer control, or auxiliary o u tp u t such as ALU control can clearly be achieved. One sim ply O Rs to g eth er one or m ore of these unique sta te p ro d u ct term s for m ul tiplexers and auxiliary control. N ext sta te generation m ay also rely upon th e additional pro d u ct term s associated w ith loops or conditionals. 3 .2 .2 .2 R e g is te r C o n tro l M e th o d W ith th e “register control” style, each unique register control line is specifically o u tp u t by th e PLA . G iven an exact or estim ated value of th e nu m b er of uniquely i controlled registers, i?, th en j R c = R (3.2.7) D eterm ining th e num ber of p ro d u ct term s is m ore difficult in this case. By construction, one or m ore register control lines must be asserted during each stage; a given register control line m ay also be asserted during m ultiple stages. If each register is only asserted during a single stage, th e p ro d u ct term count is identical to th a t of th e “stage control” m ethod. A sserting a register during 115 multiple stages is typical even for m oderately sized designs w hich offers th e possibility of reducing th e num ber of p ro d u ct term s. In a sim ple sequential sta te m achine (no loops or conditionals), th e n ext sta te only depends on th e current state. As derived in A ppendix C, a m inim um area £-state counter can be constructed w ithin a PLA using pro d u ct term s, w here n = |~log2 £~|. For designs w here ( < 8 , th e actu al num ber of product term s is into £ — 1 , w hich is only one less th a n th e “stage control” m ethod. As ' th e num ber of control states increases, th is difference could increase, b u t does not. T h e exam ple using a synthesized design w ith 19 states will dem o n strate th e problem . | T he synthesized design of Figure 3.5 has 1 m ultiplier, 1 adder, and 6 registers w hich execute in 19 clock cycles or states. A sim ple counter for 19 clock cycles , can be achieved using 14 pro d u ct term s in th e PLA . E xcept for states 1 and [ 17 th ro u g h 19, th e control states have m ultiple p ro d u ct term s active. (A single p ro d u ct te rm simplifies register control.) Of th e 6 registers, only 2 can use existing p ro d u ct term s. It takes 11 additional pro d u ct term s to satisfy th e j | register control requirem ents, giving 25 pro d u ct term s to tal. By com parison, : i a single p ro d u ct te rm p er sta te offers th e sam e control capability w ith only 19 p ro d u ct term s. ! O th er designs can be tried as well. Since register sharing ten d s to increase w ith th e num ber of clock cycles, one w ould not expect th e situ atio n described in th e exam ple to change for m ore serial designs. M ore parallel designs do have an o p p o rtu n ity for savings as th ere is less sharing. However, unless every register control line can be form ed by som e O R ed com bination of sta te bits and \ controlling a register is possible in th e last stage (w here all sta te bits are zero), ! th en ( p ro d u ct term s are required. As can be seen, p o ten tial savings in p ro d u ct term s by using th e register I m eth o d due to th e register control requirem ents is insignificant. (A detailed explanation is given in A ppendix D.) T hus, in com paring th e tw o register control styles, one notes th a t th e p ro d u ct term s and in p u t count are unchanged, b u t th a t th e num ber of o u tp u t term s can change su b stan tially (R versus (). Clearly, th e m eth o d chosen should be th e one which results in th e m inim um control area. T h e decision w hether to select “stage control” or “register control” is influenced 116 m ul mul mul m ul mul m ul mul add add add add add add mul m ul mul m ul add add mul m ul mul add add mul add add mul Figure 3.5: AR Lattice Filter 117 by th e num ber of control states as well as th e num ber of unique register control lines. 3 .2 .2 .3 M u ltip le x e r /B u s C o n tr o l L in es M ultiplexers and buses are functionally identical hardw are as viewed by th e controller. B o th select a specific value to be utilized in a given state; buses can be th o u g h t of as “global” m ultiplexers since th ey are, in general, w idely d istrib u ted in a circuit as com pared to m ultiplexers which are local to som e hardw are unit. T h e h ardw are realization of value selection is depicted in Figure 3.6. H ere, F igure 3.6a shows encoded control lines and Figure 3.6b shows unencoded control lines. For unoptim ized m ultiplexers, during one and only one state , a given in p u t line is selected thro u g h assertion of one or m ore m ultiplexer control lines. N ote th a t if th e sam e value is selected during m ultiple states, m ultiplexer area optim izers reduce th e num ber of m ultiplexers, w hich elim inates th is single-state uniqueness. B oth optim ized and unoptim ized m ultiplexers will be considered. For V LSI chips using cells, 2 : 1 m ultiplexer cells are typically used. A n n : 1 m ultiplexer tree com prised of 2 : 1 m ultiplexer cells could have as m any a s n - 1 instead of [log2 n] control lines as shown in Figures 3.6a and 3.6b, respectively. T his offers m ore control flexibility w hich can be exploited, provided it results in an overall sm aller circuit area. A global exam ination of th e general m ultiplexer control problem suggests th re e styles depicted in F igure 3.7: directly using PLA n ex t-state o u tp u t lines, in d irectly using PLA n ex t-state o u tp u t lines and random logic, or directly by using PLA m ultiplexer o u tp u t lines. In th e first case, specific PLA sta te o u tp u t lines are connected directly to th e m ultiplexer control lines. Clearly, since any PLA n e x t-sta te o u tp u t bit com bination is unique for each state, these lines can be used to control unoptim ized m ultiplexers. No o th er additional hardw are is needed. W hen m ultiplexer area is optim ized (d u p licate selection hardw are elim inated), th en ad d itio n al random logic (A N D /O R /IN V ) m ay be needed for m ultiplexer 118 (a) encoded control lines i i 4 5 (b) unencoded control lines Figure 3.6: M ultiplexer control m ethods 119 PLA in p u t o u tp u t m ux control via PL A sta te b its in p u t PLA o u tp u t to m ux control Tb random logic to m ux control 3 « - (b) indirect m ux control via PLA s ta te bits PLA input o u tp u t to m ux control (c) direct m ux control v ia PL A o u tp u t lines Figure 3.7: Different PLA m ultiplexer control styles 120 T ab le 3.1: M u ltip lex er C o n tro l A nalysis Dataflow Graph Stages Ops/M odules module id muxl/muxr1 S /A /O / l2 Add Mul AR 9 12/2 16/2 addl 5 /3 3 / 0 / 0 / 0 add2 3 /4 3 / 1 / 0 / 0 mull 6 /8 3 / 0 / 0 / 0 mul2 5 /8 3 / 0 / 0 / 0 regs (worst) 3 2 / 0 / 0 / 0 AR 19 12/1 16/1 addl 4 /6 4 / 6 / 0 / 3 mull 10/16 4 / 1 / 0 / 0 regs (worst) 2 1 /0 / 0 /0 Efilter 8 2 6 /5 8 /2 addl 3 /2 2 / 0 / 0 / 0 add2 4 /3 2 / 0 / 0 / 0 add3 5 /3 4 / 2 / 0 / 1 add4 7 /3 3 / 0 / 0 / 0 add5 3 /3 4 / 2 / 2 / 3 mull 5 /5 3 / 0 / 0 / 0 mul2 2 /2 1 / 0 / 0 / 0 regs (worst) 4 2 / 2 / 0 / 0 Efilter 28 2 6 /1 8 /1 addl 7 /9 5 / 8 / 6 / 5 mull 5 /5 4 / 1 / 0 / 0 regs (worst) 3 2 / 0 / 0 / 0 1 muxl and m uxr are the number of multiplexers used at the left and right inputs, respectively, for that particular module. 2 S is the number of state bits used; A is the number of 2-input AND gates, O is the number of 2-input OR gates, and I is the number of inverters. control as shown in Figure 3.7b. T his hardw are uses th e P L A n ex t-state o u t p u t b its to indirectly o p erate m ultiplexers. T he am ount of ad d itio n al hardw are required determ ines th e area efficiency of th is approach. As an exam ple, m ultiplexers used in four synthesized designs based on two dataflow graphs will be exam ined. Using th e m inim ized m ultiplexer assignm ent from M A B A L , th e control lines were analyzed to d eterm in e which sta te bits were used and th e am ount of additional hardw are in th e form of A N D /O R /IN V gates needed. Table 3.1 lists th e statistics for these designs. W ith th e exception of th e 28-stage Efilter, th e additional hardw are required was sm all. 121 C om paring th e approaches w hich utilize th e PLA sta te b its, it is in th e larger designs th a t ran d o m logic needed for indirect control m ight be significant. If this ad d itio n al hardw are exceeds th e area reduction due to m ultiplexer m inim ization, th en using unoptimized m ultiplexers results in a sm aller design. (M ultiple con tro l signal p a tte rn s are used to select th e sam e value.) For exam ple, in th e 28-stage Efilter, approxim ately 33 tran sisto rs of random logic are needed when m ultiplexers are m inim ized. T he unoptim ized configuration requires 10 ad d i tional m ultiplexers of two pass tran sisto rs each; thus, a savings of 30% in area is o b tain ed for th e direct PLA n ex t-state b it control style, j Finally, th e th ird approach is direct control of th e m ultiplexers via PLA I o u tp u ts as presented in Figure 3.7c; th e num ber of m ultiplexer control lines j is M . T his approach is functionally identical to th a t of Figure 3.7b except | th a t th e decode logic is in tern al to th e PLA . D espite th e clean ap pearance of j th is m eth o d as com pared to th e indirect PL A control m ethods, th e PLA area j for im plem enting com plete m ultiplexer control via PLA o u tp u t lines is usually larger th a n th e earlier approaches. To see why th is is tru e , one needs to com pare ! i th e area of each additional o u tp u t line versus ex tern al random logic. i | For N M O S technology, each n -b it AND or O R gate requires n -f 1 equivalent j transistors; an IN V requires two. As observed in th e Berkeley NM OS PLA , | approxim ately |p + 6 equivalent tran sisto rs are used for each m ultiplexer control } line in a folded P L A . For exam ple, w here th e 9-stage A R filter uses a to ta l i j of one tw o-input AND gate (3 tran sisto rs) for indirect control, 60 equivalent tran sisto rs are needed w ithin th e PLA for th e 3 control lines (assum ing only 3 lines are needed). If th e PLA cannot have its o u tp u t lines com pletely folded, th en th is difference increases. Clearly, none of th e designs in T able 3.1 would benefit from direct PLA control. A lthough designs can be contrived w here direct PLA control produces th e lowest overall area, design exam ples (albeit sm all) encountered h ad sm aller to ta l control area using th e indirect control m ethods. H ence, using existing PLA o u tp u ts ra th e r th a n individual m ultiplexer control o u tp u ts results in an overall circuit area which is lower. 3.2.3 A Specific Model for Berkeley PLA Synthesis Tools i In control synthesis P L A layout generators use predefined m acrocells. G iven the values for z, o, and p from th e earlier equations, th e area of a PLA can be defined as P L A area = ci (2z + o)p + C 2P + 0 3 (2 * + o) + C 4 (3.2.8) { w here z is th e num ber of inputs, o th e num ber of o u tp u ts, and p th e num ber of j p ro d u ct term s. j T he first te rm determ ines th e to ta l area of th e cellpairs w hich is a block j having an in p u t (or inverse in p u t or o u tp u t) line an d p ro d u ct term line passing ! thro u g h it; these interior blocks are labelled as ci in F igure 3.8. T h e second j i 1 i te rm com putes th e o u tp u t register and interconnect area betw een th e in p u t and | j o u tp u t sides as well as th e product term pre-charge pair (if any) and ground cell ( area and are labelled as C 2 ; th e th ird te rm describes th e pre-charge pair, ground cell, and feedback register area for th e in p u t and o u tp u t (0 3 ). Finally, th e last term is any additional area for “closing th e bounding box” and should only be of consequence for sm all sta te m achines. T he coefficients c i ... c4 of E quation 3.2.8 can be readily determ ined by m easuring th e area of each ty p e of m acrocell specific to a control synthesis tool. In itial calibration and verification of th e m odel entailed use of th e Berkeley PLA synthesis package [Ham83] and th e M A G IC layout tool [OHM +84]. (T he PLA package includes P E G to produce th e sta te equations, E Q N T O T T which transform s these into a PLA personality m atrix , E S P R E S S O , which m inim izes 1 th e p ro d u ct term s of this m atrix , and M K PLA , w hich generates a C IF descrip tio n of th e final controller including th e state feedback register.) Initially, ten a rb itra ry PLA s (each w ith a random ly created personality m atrix ) were con- stru cted using M K PLA and “C IF block” areas w ere analyzed using M A G IC to | determ ine th e coefficients. T he Berkeley package defaults to 2 u m NM OS; thus, th e p artic u la r coefficients used for our experim ents use th is technology. Based ] upon th e interior blocks m easured, th e coefficients w hich give th e PLA area in square m icrons are Cl = 73.7717 123 PLA Gnd PLA Pullup PLA P ullup C3 C3 ~3 C3 C3 C3 C3 C2 "1 C l Cl Cl c 2 Cl Cl Cl C2 C2 Cl Cl Cl Ci C2 Cl Cl Cl C2 C l Cl Cl C l C l c 2 Cl Cl Cl C2 C2 C l C l C l Cl C2 C l C l C l c 2 c 2 ' l - I Cl Cl C2 Cl Cl Cl C2 C2 Cl Cl Cl Cl C 2 Cl C l Cl C2 C 3 C 3 C3 C3 C3 C3 C3 PLA Inputs PLA O u tp u ts (+ reg ister) PLA Interconnect F igure 3.8: In tern al construction of PLA PLA G nd 124 T ab le 3.2: V alid atio n of P L A m o d el PLA MAGIC CIFAREA Model In Out Pterm H W Area H W Area Area 3 6 5 124 131 25.2 121 131 24.6 24.5 4 11 10 156 196 47.4 153 195 46.2 46.8 5 25 20 236 325 118.9 233 323 116.7 120.6 6 12 40 407 231 145.7 401 227 141.1 138.4 6 56 50 492 612 466.7 483 603 451.4 446.2 7 40 70 660 488 499.2 651 483 487.4 490.2 7 87 80 761 908 1071.0 741 895 1028.0 1036.6 c2 = 19.603 (3.2.9) c3 = 611.698 c4 - 3025.94 To avoid m anually using M A G IC during later validation, a u tility (C IFA R EA ) ; was w ritte n w hich analyzes th e M K PLA C IF file and o u tp u ts th e bounding box in A and to ta l area in square m icrons. Table 3.2 sum m arizes results of seven specific PLA s processed by M A G IC , C IFA R E A , and th e PLA m odel. H and W are th e height an d w id th of th e PL A in m icrons; a r e a is th e to ta l bounding box area in square mils. N ote th a t M A G IC is only capable of determ ining th e area of a user-defined box. T hus, one was created which encapsulated th e PL A to th e best ability of a hum an user and used for com parison. It is therefore not surprising th a t th e | area given by M A G IC is always g reater th a n C IFA R EA ; hum an perception of enclosure entails inclusion of some bounding area. Since C IFA R EA is reason- i ably precise, all PLA s described later in th is ch ap ter have th eir area com puted ] using th is program ; validation tim e is reduced and in tro d u ctio n of hum an errors m inim ized. To verify E quations 3.2.8 and 3.2.9, over 100 PLA p ersonality m atrices were random ly generated and synthesized using th e B erkeley tools. T h e size of th e controller varied from a single in p u t, o u tp u t, and p ro d u ct te rm up to a controller w ith 22 in p u ts, 45 o u tp u ts, and 100 pro d u ct term s. C om parison against th e 125 I w hile(m puti is FALSE) ! begin I perform som e functions j (over one or more control steps) end ! Figure 3.9: Loop on S tatu s Flag I j results from C IFA R EA resulted in a w orst case area error of 3.3% for th e sm allest ! exam ple w ith an average error of 0.73%. | O ne feature of using m acrocells in th e Berkeley toolset is th a t each cell han- I dies a pair of lines. Since th e num ber of in tern al in p u t lines is always even, th ere 1 is no influence on this p aram eter. However, b o th th e o u tp u t and p ro d u ct term values are rounded u p by P A S T A to th e n earest even num ber to be consistent w ith th is layout construction. O th er PL A tools m ay not be restricted to paired lines for each prim itive cell. i I I j 3.3 Loops and Conditionals i j In th e previous section, a detailed m odel for PLA area for controllers w ithout j loops was derived. In th is section, th e effects of loops and conditionals are I derived and th eir im pact on th e PLA param eters assessed. i ! j 3.3.1 Effect of Loops on PLA Area | Loops are a com m on construct in control. For th e m ost triv ial case, a PLA state i m achine has one inherent loop w here th e last sta te connects to th e first state. Cycles in th e dataflow graph w hich becom e m inor loops in th e controller are of m ore in terest. Each of th e m inor loops is labelled as ft;. Loop param eters associated w ith loop on consist of its length in control steps, PL{a.i), and th e num ber of iteratio n s associated w ith it, N(cxi). T here are th ree types of loops: loop on sta tu s flag, variable count and fixed count; these are depicted in Figures 3.9 th ro u g h 3.11. begin perform som e functions (over one or more control steps) end F igure 3.10: F ixed C ount Loop t r~ for i: = l to v a r ia b le i begin perform som e functions (over one or more control steps) end Figure 3.11: V ariable C ount Loop T h e sim plest ty p e of loop is w here a sta tu s condition or flag (generated ex tern ally to th e PLA ) is checked to determ ine loop te rm in atio n as shown in F igure 3.9. Since th e loop operations are invariant, only a single te st is perform ed to determ ine re sta rt or exit from th e loop. (N (cti) is unknow n in th is case.) By construction, each pro d u ct te rm determ ines a specific n e x t-state based I upon th e in p u t lines. A n external statu s flag consum es one in p u t line w hich did t n o t exist in a sim ple sequential sta te m achine. As th e outcom e of th e te st chooses j one of tw o d istin ct states, th ere m ust be tw o p ro d u ct term s ra th e r th a n one. O ne n e x t-sta te p ro d u ct term enables re sta rt of th e loop; th e o th er term in ates th e loop. A lthough two pro d u ct term s now select th e sam e sta te (for loop sta rt ( as well as loop re sta rt), they cannot be m erged due to th e presence of th e loop sta tu s variable in one w hich is absent in th e other. In th e rem aining cases (variable or fixed itera tio n loops), one can either “unroll” th e loop or provide a count m echanism for track in g th e loop as presented in Figures 3.10 and 3.11, respectively. In th e first m ethod, a fixed loop is unrolled into a nu m b er of PLA states so th a t no branching (to th e top of th e loop) is necessary. (A variable loop could be im plem ented by unrolling th e loop into th e m axim um num ber of loops expected; an “exit condition” w ould check for loop term in atio n at th e end of each unrolled loop and skip to ju st beyond th e last loop state. If th e exit condition were tru e, in th is case, N(cti), th e num ber of iteratio n s for loop a,-, is set to th e m axim um num ber of iteratio n s.) As will be shown, unw inding m ay be useful for very sm all loops, b u t is im p ractical for either large or long loops because th e num ber of control states increases dram atically. I 3 .3 .1 .1 L o o p U n r o llin g Each of th e N(cxi) iteratio n s of loop alphai consists of PL(on) control steps; PL(cti) states inherently exist as p a rt of th e acyclic control. T hus, th e additional num ber of P L A states due to loop unrolling, 0 oop, is I Cioop < £ PL(oti)[N(ai) - 1 ] (3.3.10) T h e sum is tak en over all w hich are being unrolled. T h e ineq u ality arises from th e case w here, given tw o or m ore loops, • th e loops are th e sam e length (P L ), j • th e loops occur in th e sam e tim esteps (com pletely overlap in tim e), • th e loops have th e sam e num ber of iteratio n s (N ), • and th e loops are in different parallel branches (not conditional branches), j U nrolling (unw inding) a loop is estim ated to increase th e num ber of control | states by < f > w hich is com puted as < f > = P L (a i) [N(cti) - 1] (3.3.11) R egister sharing betw een different iterations of th e loop is considered. T he degree of register independence in a loop, K re3i w hich is a value betw een 0 and 1 inclusive, will affect th e am ount of additional hardw are needed for loop operation. (E ach value-storage operation during loops can b e loop-dependent, w hich m eans different physical registers are used during each loop itera tio n , or 128 loop invariant w hich m eans th e sam e register can be reused.) For exam ple, a 16-bit add o p eratio n using a 4-bit adder w ould have loop dependent registers and K reg = 1. A 4-bit m ultiply using shift-and-add would have a 16-bit register w hich is loop invariant and K reg = 0. Since registers in this case are controlled directly via th e PLA , th ere are approxim ately K reg4 > additional pro d u ct term s and o u tp u t lines needed w hen th e percentage of loop-invariant registers is low. T hese additional p ro d u ct term s also produce th e n ex t-state value. However, w hen K reg is sm all, th e n th e m inim al num ber of p ro d u ct term s associated w ith a sim ple counter contributes to control area as only th e norm al n ex t-state generation is needed. T h e effective counter size, ip, is inversely related to K reg and com puted as if>= { ( 1 “ Kre9^ lf ( 1 ~ ~ Kre9^ > 1 (3.3.12) : 1 otherw ise ! J T h e com bined effect upon th e increm ental area associated w ith unrolling, A reaunroii, | becom es A re a u „ M « c1K ^ 2+ ( c 2+ c ! ,)K r„<i’+ ( c 1+ c 2) L lof e » J ( Llog2 V > J + 1) (3_ 3 _ 13) In th e unlikely situ atio n w here these four conditions are m et, th e sam e PLA states can be used to control b o th loops. However, since th e inequality p re sum es this condition is detectab le by som e com plex sta te m inim ization utility, th e inequality of E q u atio n 3.3.10 is rem oved in th e PA STA im plem entation. For u n r o lle d lo o p s, th e v a lu e s for ii o o p , oioop, a n d pi00p are zero . T h e inputs, o u tp u ts, and p ro d u ct term s derived from £ (w hich includes (ioop) will reflect th e n u m b er of ad d itio n al states incurred by unrolling th e loop. 3 .3 .1 .2 I m p le m e n tin g L o o p s u sin g a C o u n te r Loop control hardw are using a counter has a variety of im plem entations. Two p rim ary schem es involve building a counter as p a rt of th e PL A or using an ex tern al counter plus some additional hardw are. Four operational considerations are 129 1 . initialization of th e counter, j 2 . decrem ent of th e counter, j 3. testin g for count com pletion, 4. and uniquely defining th e sta te (for clocking th e correct d a ta p a th register). A com parison betw een an external counter and a PLA counter was accom plished for 3 u m NM OS technology w ith th e Berkeley PLA synthesis tools j [Ham83]. As com pared to ex tern al hardw are using th e sam e technology, th e 1 increm ental PLA area to co n stru ct an in tern al counter is g reater th an attaching an ex tern al counter u n d er PLA control. O ne w ould expect th is difference to hold for o th er technologies as well. D etailed analysis can be found in A ppendix C. A counter used by th e controller for operating loops will be assum ed by PA STA to be constructed externally to th e PLA . O ne exam ple arch itectu re of a PLA containing an ex tern al counter (w ith external decoding for th e loop) is shown in Figure 3.12. H ere, a sim ple 5-state m achine has a loop w ith th ree j I itera tio n s during th e m iddle tw o steps (for a to ta l of nine states). j Use of an ex tern al counter im pacts th e PLA p aram eters for' each loop. A given controller has N c ex tern al counters. Each of these counters m ay have a different b itw idth; n t is th e b itw id th of counter i. It is assum ed th a t individual counters do not share any control lines. T hus, from A ppendix C, th e increases in PL A in p u ts, o u tp u ts, and p ro d u ct term s due to loops are iioop = N c (3.3.14) O loop — E (» < + 1) (3.3.15) j i Pioop = N c (3.3.16) A n area com parison betw een unrolling a loop and using an ex tern al counter d eterm ines w hich m ethod to use for a given design. A lthough non-overlapping counters could be shared to reduce ex tern al area for an actu al design, counter cost is n o t a dom inating factor, so th e analysis presented here assum es th ere is no sharing am ong counters. 130 PLA Preload Reg O u tp u ts In p u ts Dec D a ta In 0 C O U N T E R Ld D ata Out,______ F ig u re 3.12: P L A w ith E x te rn a l C o u n te r a n d R eg ister D ecoding for Loop As shown in Figure 3.12, there are several con trib u tio n s to circuit area for each ex tern al counter. T he n -b it resettab le counter itself has an area of n A cntr, w here A cntr is th e area of a single-bit counter w ith carry. A decoder and AND gates are also used to load th e proper registers during a given loop iteration. T h e m axim um am ount of hardw are is needed if registers are com pletely loop dependent (K reg = 1); no ex tra hardw are is needed if registers are loop invariant. D ecode hardw are distinguishes a loop state; its area is th u s approxim ated by K reg P L(a.i)nAdecodi where AdeC od is th e are a of a 1 : 2 decoder. Sim ilarly, AND gates determ ine which loop-dependent register to select during each loop sta te co n trib u tin g K re g P L ( a ; ) N ( a i ) A and to th e area, w ith A a n d th e area of a 2-bit AND gate. T he increm ental area increase of a PLA controller having an external counter, A re a £ xt Cntr, becom es A tC(Ij t ;xi Cntr ~ (C1TC3 ) (n T 3 )TC2 T n A cntr T Kreg PL(^Oli)N[(X{ ) A and ~ \~ PregPL{^CZ j)f?A (3.3.17) as derived in A ppendix C. (N ote th a t K regP L (a i ) m ust be an integer.) T he first two m ajo r term s in E quation 3.3.17 reflect th e PLA area increase to o p erate the counter; rem aining term s are for external hardw are. E xam ining E quations 3.3.17 and 3.3.13, one can d eterm in e th e correct im ple m en tatio n before synthesis of th e PLA . Table 3.3 lists th e “b reak p o in t” betw een unrolling a loop versus an ex tern al counter. T he value listed a t each position is th e num ber of loop invariant register control lines. If th e num ber of lines for th e ta rg e t controller m eets or exceeds this value, th en th e loop should be im plem ented using an external counter. O th er sym bols show n are * to signify th a t th e unrolled loop has th e lowest area regardless of th e num ber of loop in variant register control lines, and o when th e ex tern al counter always has th e lowest area. G iven th e num ber of iteratio n s and loop length, th e num ber of loop-invariant registers (and, hence, num ber of control lines per loop) solely de term ines w hich approach is superior. As expected, except for very short loops and sm all iteratio n s, the sm allest area is realized using an ex tern al counter. 132 T ab le 3.3: U nro llin g versus E x te rn a l C o u n ter for L oop Im p le m e n ta tio n Loop Length (P L («<)) Iter (N (a i)) 1 2 4 8 16 32 64 128 2 4 3 3 3 2 2 3 * 2 2 2 2 2 1 1 4 * 2 2 1 1 1 1 1 5 ★ 2 1 1 1 1 1 o 6 o 1 1 1 1 1 1 o 8 o 1 1 1 1 1 1 o 16 o 1 1 1 1 1 1 o 64 o 1 1 1 1 1 1 o 128 o 1 1 1 1 1 o o ! I * o See text for explanation. 3.3.2 Effect of Conditional Branches on PLA Area A nother im p o rtan t extension for a PLA m odel is handling of conditional branches such as th e if-then-else or switch-case statem en ts in program m ing. H ere, a num b er of m u tu ally exclusive d a ta p ath s could be executed dependent upon some condition ex tern al to th e PLA . C onditional branches can be im plem ented in two different ways: different sta te sequences for each conditional p a th or “sharing” th e sam e states am ong different branches. D e fin itio n 3 .1 A c o n d itio n a l su b g r a p h is a directed subgraph o f a dataflow graph which has a d is tr ib u te operation as its source and the m atching jo in operation as its sink. D e fin itio n 3 .2 A c o n d itio n a l p a th , /?, is any path in a conditional subgraph ; fro m its source to its sink. In a dataflow graph, each p a th sta rts at a d is tr ib u te o p eratio n an d ends at th e associated jo in operation taking any a rb itrary branch including nested conditionals th a t are encountered along th e p ath . C om plete overlap of com m on operations is not allowed; th u s a restrictio n for any two p ath s, /? ,• and /3j , for some o p eratio n O i>etai, opt e j3i and opt /3j or the reverse, exclusive of th e d istr ib u te and jo in operations. A given conditional p a th m ay contain o th er conditional p ath s w ith in it th a t are associated w ith nested conditionals. 133 D e fin itio n 3 .3 A m a x im a l c o n d itio n a l su b g r a p h or m cs is a conditional subgraph that is n o t contained in any other conditional subgraph. The set o f “ o u term o st” conditional subgraphs in a dataflow graph is composed solely o f m ax im al conditional subgraphs. Two different p a th s w ithin th e sam e m cs m ay be m u tu ally exclusive if th ey lie along different different branches of th e sam e conditional. However, if these p ath s lie in two different m cs, they are not m u tu ally exclusive. T his is im p o rtan t for determ ining th e num ber of additional control states needed in th e PLA . O ne m eth o d for extending th e PLA m odel for conditionals is to assum e each conditional p a th has a set of unique states. T h e location of th e conditional is im p o rta n t in determ ining th e num ber of control states w hich are added. In p artic u la r, w here m -w ay conditionals w hich occur along th e p rim ary (critical) p a th , only m — 1 of these p ath s co n trib u te additional control states. (T he critical p a th is th e longest p a th in th e graph.) As d em o n strated in F igure 3.13a, only two of th e th ree conditional branches co n trib u te additional control states. In 1 th e absence of th e conditional, th e control states along one p a th would still be counted. Each conditional p a th /? ,• has PC(fii) control steps. C is a binary function whose value is 1 if th e /? ,• conditional is not on th e critical p a th and 0 otherw ise; if a conditional p a th is p a rt of th e critical p a th , th e states associated w ith it have already been counted. For a PLA controlling conditional j , w here all /?, p ath s are different m u tu ally exclusive branches of j and Dj is th e to ta l num - I b er of conditional p ath s of j , th e num ber of P L A states is increased from th e j I unconditional case to tcondti) = x P C iPi) (3.3.18) i=1 over all conditional p ath s /?;. A m ore com plicated case is w here two (or m ore) conditionals are contained in different m cse s. For each control sta te w here both conditionals are active, th e num ber of control states is th e p ro d u ct of th e num ber of conditional p ath s for each. T his com plex exam ple is illu strated in F igure 3.14. j t is th e num ber of 134 jom dist F ig u re 3.13: C o n d itio n a l P a th E x am p le 135 conditional p ath s sim ulataneously active in control step t. A i is th e num ber of J | conditional branches of conditional i w hich could be active during control step t. j In th e exam ple, j t = 2 and A i — A 2 — 2 for 3 control steps (one p er operation). W ith in P A S T A , estim atin g th e num ber of control steps is perform ed on a i state-b y -state basis dependent upon th e in p u t description. Since exact overlap of operations betw een p ath s is not know n, th e additional num ber of states, Ccond, < is estim ated as , j C c o n d « e | i I • ■ • A jt - 1 j (3.3.19) | w here t is th e control sta te index. C ontrol of th e graph shown in F igure 3.14 j w ould en tail adding 9 control steps. | U nlike sequential d a ta p ath s w here register sharing is not obvious, conditional branches provide a clear o p p o rtu n ity for sharing registers. In th e “stage control” m eth o d , one only need consider th e longest p a th in any m axim al conditional subgraph to d eterm in e th e num ber of control lines necessary. W ith in each of th e m axim al conditional subgraphs, th e sam e register control lines can be used at each control step independent of w hich conditional branch is taken. However, different m cs need different control lines and th e sum of each m axim al p ath is taken. T his value is offset by th e num ber of tim esteps associated w ith a conditional p a th which lies on th e critical p a th , PC'mcs. T hus, th e nu m b er of ad d itio n al o u tp u t lines needed, ocond, is < W = £ max{PC(/3i)} - £ PC'm c , (3.3.20) mcs O cond is sum m ed over all m axim ally conditional subgraphs. T h e num ber of pro d u ct term s is also ad ju sted for conditionals. U sing th e sam e reasoning as th a t given for th e loop term in atio n condition, a Uk-way con d itional requires Uk — 1 p ro d u ct term s to choose th e correct conditional branch. T hus, th e nu m b er of additional p ro d u ct term s for conditionals, pcond, is Pcond = - X ) (3.3.21) k 136 I dist dist JOI F ig u re 3.14: C om plex C o n d itio n al P a th E x a m p le 137 and sum m ed over all conditionals. In th e “register control” m ethod, each register already has a control line which will be coupled to th e additional p ro d u ct term s to reflect conditional usage. T hus, ocond = 0, since no additional register control lines are required. T h e use of sta tu s bits offers another m eth o d for a PLA to o p erate condition als. A sta tu s b it (or bits) determ ines w hich of D conditional p ath s to o p erate and requires [log2 D ] PLA inputs. In order for one p a th to be distinguish- . able from an o th er, a sep arate p ro d u ct te rm is required for each unique “s ta te ” . Sim ilarly, th e o u tp u t lines m ust control th e proper hardw are and m ay have to determ ine th e next sta te based upon th e in p u ts. T he num ber of o u tp u ts and 1 p ro d u ct term s rem ain unchanged since th e sta tu s b it is used to distinguish be- i 1 . 1 tw een several states w hich share th e sam e pro d u ct term . Likewise, th e num ber i of registers controlled does not change. However, since each conditional p a th is likely to need a unique statu s b it (and is a requirem ent for overlapping condi tio n al p a th s in parallel branches), th e sta tu s-b it m eth o d in p u t line count can exceed th e in p u t count for th e sta te expansion m ethod. As a resu lt, if sta tu s 1 b its are used, th ey m ay yield a larger PLA . j j 3.4 Extending the PLA M odel T h ere are two n atu ral extensions to th e PLA m odel described earlier. F irst, only a m inim ized PLA was considered. In som e designs which have m ore place m ent freedom and w here space is a t a prem ium , folding of th e PLA controller is perform ed. E stim atin g th e size of a folded PLA is useful for such designs. In ad d itio n , th e PL A m odel is specific to control of a non-pipelined d a ta p a th de sign. B y ad ju stin g th e m odel, control of pipelined designs can be accom m odated. B o th of these extensions are discussed in th e following sections. 3.4.1 Estimation of Folded PLA Area Folding of PL As is a com m on p ractice for reducing th eir size. N um erous heuristic techniques are available, b u t few m odels predicting folding possibilities have appeared u n til recently. A unique approach w hich uses statistical m odeling to 138 quantize th e am ount of PLA folding has been published [MT86]. U sing as a j basis th e num ber of rows, num ber of colum ns, and cross-point density of a PLA , M akarenko analytically derived and verified experim entally a p ro b ab ility density function (P D F ) for th e expected num ber of folds. T his m odel has tw o lim itatio n s. O ne is th e assum ption th a t cross-point den sity is uniform . In m ost PLA s, cross-point density varies. T h e second lim itation is th a t accu rate results could only be obtained by rep eated trials of a “random selection h eu ristic” after tu n in g to a p articu lar PLA generation algorithm . | To overcom e these lim itations, th e folding m odel as described in th e equations of T heorem 3.16 in [MT86] was only applied to th e o u tp u t lines of th e PLA ; folding of in p u t colum ns presents a special problem . T h e au th o rs of this paper [MT86] suggest th a t inverters be provided at th e inputs as needed. However, if j a and a ap p ear on opposite sides of th e PLA as inputs, a m ust still be ro u ted to i th e o th er side of th e PLA to produce its inverse. T his w ould likely result in an area increase. In addition, th e cross-point density on th e in p u t side for a state m achine is generally close to 0.5 - a value w here folding is no longer possible according to th e p robabilistic m odel. Conversely, on th e o u tp u t side, regardless of w hich of th e basic m ethods described here is used to p redict th e state-m achine p aram eters, th e cross-point ! density is m ore closely d istrib u ted . F urtherm ore, except for very sm all PLA s, th e density is less th a n 0.50 m aking this portion am enable to folding. W ith th is ad ju stm en t to th e folding m odel, m ultiple sam pling is not necessary to account for a p artic u la r folding algorithm ; th e P D F value is calculated once for each folding quantity. This is possible as folding m ethods published (e.g. [BB87] and [RSV86]) yield near-optim al results for th e o u tp u t colum ns as pred icted in o th er publications ([LVSV82], [LA83]). In p u ts to M akarenko’s folding m odel are th e num ber of rows (r), num ber of colum ns (c) and cross-point density (d ). T he o u tp u t rows and colum ns are triv ial to com pute: r = p c = o (3.4.22) (3.4.23) 139 C ross-point density here is th e num ber of physical connections betw een p ro d u ct term s and o u tp u t lines. (A n exam ple PLA is shown in F igure 3.1 w ith cross-points m arked.) Since PA STA assum es a sim ple sta te assignm ent w here sequential states are represented as sequential binary num bers, an exact value can be d eterm ined for connections on th e o u tp u t sta te b its by sum m ing th e bi nary T ’s in each sta te b it rep resen tatio n over all states, giving Sb- A lower bound can be com puted for th e rem aining o u tp u t lines. R egardless of w hether “stage control” or “register control” is used, each p ro d u ct te rm drives a t least one reg iste r/sta g e o u tp u t line. In addition, m u ltip lex er/b u s and auxiliary o u tp u ts (if any) m ust be driven by a t least one pro d u ct term . T hus, a lower estim ate of crosspoint density is d = S b + P + A ± M „ (3.4.24) O X p Since A and M (auxiliary and m ultiplexer) o u tp u t lines are generally a sm all quantity, th e density error will be sm all. O therw ise, th e lower crosspoint density m ay result in higher folding predictions th a n is actually possible. T h e P D F described in [MT86] is tak en over all folding num bers from zero to c/2 using th e crosspoint density described here; th e fold num ber w ith th e highest p ro b ab ility is tak en as th e estim ate on th e num ber of folds. E xperim ental results are described in Section 3.5. 3.4.2 Extensions for Pipelined Controllers P ipelined designs have additional control considerations since m ultiple portions of th e d a ta p a th are being independently and sim ultaneously operated. Problem s include tracking valid d a ta in th e pipe (fill/flush) as well as handling conditional path s. N ote th a t inner loops are not allowed w ithin th e pipe itself, so th ey will n o t be considered here. O ne m eth o d for tracking valid d a ta in th e pipe is for th e controller to track and u p d a te th e pipe statu s during filling, full operation, and flushing of th e pipe contents. In practice, this is rarely done. R ath er, a statu s b it is m ain tain ed at each stage w hich indicates d a ta validity. (In som e high-speed applications such as video processing, w here occasional invalid results are still acceptable, d a ta validity checking is not even perform ed.) Essentially, th e valid b it is intro d u ced 140 r w ith th e d a ta and is reset during any stage w here th e d a ta becom es invalid. This b it propagates to th e end of th e pipe stage-by-stage; it is also used to disable any d etrim en tal operations th a t m ight be perform ed during a given stage. For exam ple, if a m em ory w rite should only contain valid d ata, th is b it is used to in h ib it th e w rite. T h e control problem focuses upon handling unconditional and conditional pipelines, w hich is an extension of th e non-pipelined case previously described. Tw o new term s are introduced: in itiatio n interval (/), w hich is th e num ber of clock cycles betw een successive in itiatio n s of in p u t d a ta , and m icrocycles (u), th e num ber of groups of clock cycles w hich overlap in execution. u P _ 7 (3.4.25) P is th e num ber of clock cycles before a given d ataset is com pletely processed and appears at th e o u tp u t. Since d a ta is intro d u ced every I clock cycles, I is th e basic nu m b er of states in th e PLA provided th ere is no explicit fill/flush m echanism o th er th a n th e use of “valid” b it registers. C = / (3.4.26) T h e PLA area estim atio n equations accept / instead of P to d eterm in e th e PLA param eters for unconditional pipelined designs. C onditional branches pose a m ore difficult problem . If a conditional branch appears across two or m ore m icrocycles, a different p a rt of th e p a th m ay be active in each. O ne m ust consider all possible d istin ct p ath s in determ ining th e num ber of PLA states. T hus, th e sta te contribution for conditionals, C ,C ondi becom es i C cond = J 2 ( i (3.4.27) 1 = 1 w here (! = f [ D C '( f t ) (3.4.28) k=1 For each relative clock cycle of th e / in a m icrocycle, th e n u m b er of d istin ct paths in every m icrocycle is m ultiplied together. D C '{/?*) is th e nu m b er of distinct 141 conditional p ath s in m icrocycle k and relative clock cycle num ber of i; it is one if th ere are no conditional p aths. T he num ber of PLA states is d eterm ined by th e sum m ing th e value for each clock cycle over th e in itiatio n interval. From E q u atio n 3.4.28, it can be seen th a t th e controller size grows as th e num ber an d length of th e conditional branches increase. E ventually, it w ould be less costly to use one controller in each m icrocycle operating th e I stages. T his tradeoff is a topic of ongoing research in control area estim ation w hich will be i exam ined in C h ap ters 5 and 6. * 3.5 Validation of the PLA M odel In th e previous sections, m odels for PL A sta te m achine param eters were devel- j oped for sequential dataflow graphs as well as for loops and conditional branches. I E xtensions for PLA folding and use in pipelined designs were also discussed. E x perim en tal results are presented in this section. i 3.5.1 Validation Process To validate th e PLA s, a num ber of scheduled behavioral graphs h ad th eir con trollers generated using th e Berkeley toolset as well as pred icted by P A S T A . I Since b o th th e Berkeley toolset and P A S T A require a description file as th eir in p u t, it is essential th a t th e tw o descriptions address control in th e sam e fashion. T hus, b o th files should • address th e sam e graph, i I • s ta rt w ith th e sam e d a ta p a th schedule, j i • be configured for th e sam e ty p e of control (register versus stage control), | and I • (for th e Berkeley tools) not be a red u n d an t or poor description. T h e first tw o are obvious; th e th ird requires a decision regarding w hich control m eth o d to adopt. For validation, a variety of designs cover one or th e o th er control m eth o d , depending upon w hich is sm aller in area. Finally, th e fo u rth 142 p oint highlights a problem encountered during validation: poor descriptions. T his issue will be exam ined m ore closely. 3 .5 .1 .1 B a sic M o d e l For sim ple descriptions w ith no loops or conditionals, com parison betw een th e p red icted and actu al PLA is reduced to th e ty p e of control m eth o d em ployed. Sm all PLA s ten d to favor stage-control as this lim its th e size of th e PLA . As th e controller grows, a tra n sitio n to control of registers is used to m ain tain th e lowest possible area. E arlier in th is ch apter, th e basic m odel p aram eters (in p u ts, o u tp u ts, and p ro d u ct term s) were calib rated against th e Berkeley tools. Now th e com plete control m odel as im plem ented in PA STA will be used to validate th e basic PLA j equations presented earlier. For “stage control” , verification is easily accom plished. Since no control of registers is being considered, each P E G description i takes on th e following form : I ! j — Generic state machine (state outputs) j I OUTPUTS : si s2 ... sn; Start : ASSERT si; : ASSERT s2; — insert states as needed (each colon represents a state) — last state MUST loop back to top of PLA : ASSERT sn; GOTO Start; Som e additional results and exam ples are shown in T able 3.4. T he first po rtio n of Table 3.4, w here ty p e is S ta , are th e stage-control PLA s. N ote th a t in all cases, th e in p u ts, o u tp u ts, and pro d u ct term s m atch. T h e error introduced is strictly due to th e PLA area m odel coefficients. 143 T ab le 3.4: P L A A rea E stim a tio n : B asic M odel States Type Estimate Actual Error (%) i/o /p Area {m il2) i/o /p Area { m il2) 2 Sta 1 /3 /2 11.4 1 /3 /2 11.8 3.5 4 Sta 2 /6 /4 18.7 2 /6 /4 18.9 0.1 8 Sta 3 /1 1 /8 38.0 3 /1 1 /8 38.5 1.2 16 Sta 4 /2 0 /1 6 80.7 4 /2 0 /1 6 82.3 2.0 32 Sta 5 /3 7 /3 2 225.1 5/3 7 /3 2 226.8 0.1 64 Sta 6 /7 0 /6 4 674.2 6 /7 0 /6 4 684.5 1.5 96 Sta 7/103/96 1402.4 7 /103/96 1414.8 0.1 128 Sta 7/135/128 2335.8 7/135/128 2346.3 0.0 4 R=2 2 /4 /4 16.1 2 /4 /4 16.1 0.0 8 R =4 3 /7 /8 31.2 3 /7 /8 31.0 0.0 16 R=8 4 /1 2 /1 6 60.8 4 /1 2 /1 6 60.7 0.0 32 R=16 5 /2 1 /3 3 159.2 5/2 1 /3 2 153.1 3.8 64 R=32 6 /3 8 /6 4 421.5 6/3 8 /6 4 420.0 0.0 96 R=32 7 /3 9 /9 6 656.0 7/3 9 /9 6 651.6 0.1 128 R=40 7/47/128 978.1 7/47/128 974.8 0.0 T h e second half of th e tab le addresses register control. A num ber of PLA s were co n stru cted w ith an arb itra ry num ber of registers which were random ly asserted (b u t a t least one register asserted per state). T he form of th e P E G file is as follows: Generic state machine (register outputs) OUTPUTS : rl r2 ... rn; Start : ASSERT r? r ? ; : ASSERT r? r? ____; — insert states as needed (each colon represents a state) -- last state MUST loop back to top of PLA : ASSERT r? r? ____; 144 0 E x tern al D ata m ux add latch div latch F igure 3.15: D a ta P a th of A veraging O peration GOTO Start; R esults for th e register control m eth o d are even m ore accu rate. However, one anom aly did occur for th e 32 sta te design. In th is case, th e PL A m inim ization tools could not reduce th e num ber of p ro d u ct term s to th e th eo retical estim ate; one add itio n al p ro d u ct term was needed in th e actu al design. 3.5.2 Loops and Conditionals As p a rt of th e PLA m odel validation for loops and conditional p a th s, four exam ple behaviors were synthesized m anually and com pared against th e estim ates. ! In ad dition, a com parison was m ade against b o th loop unrolling and an external counter for th e loop exam ples. Tw o of these behavioral graphs are specific to loops as shown in Figures 3.15 and 3.16 and th e others specific to conditionals, w ith one conditional graph shown in F igure 3.17. Table 3.5 lists th e results. For each of th e loop exam ples, w hich represent th e extrem es of register shar ing, construction of th e P E G control file was slightly different. N ote th a t th e 145 and xor -= K m ux add latch la tch latch latch latch latch latch m ux Figure 3.16: Serialized ADD op eratio n Table 3.5: PLA A rea E stim ation: C onditionals and Loops Physical Control Area [mil2) Type Figure Impi. 7Creg Estimate Actual Error (%) Loop 3.15 Unroll 0.0 27.8 27.3 1.8 Ext. Cntr. 0.0 83.1 85.1 2.4 Loop 3.16 Unroll 1.0 27.8 27.6 0.1 Ext. Cntr. 1.0 31.4 29.9 5.0 Cond. not shown Expand _ 34.8 34.6 0.1 Cond. 3.17 Expand - 130.1 132.7 2.0 146 m ultiplexer control was included for th e loop unrolling exam ple, b u t is n o t re quired as a single 3-input AND g ate w ould have sufficed. T h e P E G descriptions for th e larger loop shown in F igure 3.16 are sim ilar. Loopl example for unrolling (Figure 3.15) OUTPUTS : muxl regl reg2; Start : ASSERT muxl regl; ASSERT regl; ASSERT regl; ASSERT regl; ASSERT regl; ASSERT regl; ASSERT regl; ASSERT reg2; GOTO Start; — Loopl example for hardware cntr (Figure 3.15) INPUTS : instat; OUTPUTS : dec Id regl reg2 vail val2 val3; Start : ASSERT Id regl vail val2 val3; Back : ASSERT dec regl; IF instat THEN Back; : ASSERT reg2; GOTO Start; 147 T h e P A S T A in p u t files for th e exam ples shown in Figures 3.15 and 3.16 are fairly sim ple. (T h e PASTA n o tatio n is described in A ppendix E.) E ach de scription s ta rts w ith an equal sign “= ” w hich indicates th e d a ta p a th states, registers, and optional m ultiplexer lines for th e controller. In th e second de scription, a unique register is controlled during each of th e 7 iteratio n s of th e control-step loop. (T h e colon signifies th a t K reg follows (scaled by a factor of 100); by default, K reg = 1.) T h e last description has a 4-iteration loop w ith th e : sam e registers o p erated during each loop. T hus, K reg = 1 and th e value after th e loop count is set to 100. I # ! # PASTA input for examples * # # Figure 3.15 unrolled = 8 2 1 # Figure 3.15 counter = 2 2 (7:0 * 1) + 1 # Figure 3.16 unrolled = 6 2 2 # Figure 3.16 counter = 2 2 (4:100 * 1) D espite th e sm all exam ples used for designs w ith K reg = 0, errors are w ithin acceptable tolerances. For b o th designs, th e sam e num ber of in p u ts, o u tp u ts, and p ro d u ct term s was predicted and m easured. Conversely, th e error for designs ! w ith K reg = 1 is higher. In this case, th ere is add itio n al decode hardw are besides th e ex tern al counter. T his area can only be estim ated by P A S T A , w hereas a hu m an designer can optim ize th is p o rtio n of th e control logic. In a ty p ical design, one would expect th a t a large dataflow g raph w ould have som e loops as well as som e loop-free portions. G iven th e high accuracy of th e non-looping p o rtio n of th e controller m odel, large exam ples should im prove upon th e error shown here. 148 T h e designs of Table 3.5 containing conditional branches include a sim ple 8- node graph w ith one conditional branch (not shown) and a com plex conditional b ran ch shown in F igure 3.17. T h e P E G descriptions becom e com plex due to th e q u a n tity of conditional branches. In these designs, m ultiplexers were specifically included since th a t is th e way th ey were b u ilt by th e circuit engineer. Simple conditional INPUTS si; OUTPUTS : rl r2 r3 r4 ml; Start Lpr Lpend ASSERT rl; IF si THEN Lpr; ASSERT ml r2; ASSERT ml r3; GOTO Lpend; ASSERT r2; ASSERT r3; ASSERT r4; GOTO Start; Park conditional by hand INPUTS : dl d2 d3 d4 d5; OUTPUTS : rl r2 r3 r4 r5 r6 r7 ml m2 m3 m4 m5 m6; Start : ASSERT rl; CASE(dl d3) 0 ? => dim; 149 f c r f N s V-J — f sub ) \ F ig u re 3.17: C om plex C o n d itio n al G ra p h © 1 9 8 6 N. P a rk 150 d3m 1 0 => d3m; 1 1 => d3p; ENDCASE => d3p; ASSERT r2 ml; ASSERT r3 m3; GOTO j3; dim ASSERT r2 ml; ASSERT r3 m6; IF d2 THEN d2p; ASSERT r4 m2 ml; CASE (d4) 1 => d4p; 0 => jl; ENDCASE; d2p ASSERT r4 m5; CASE (d4) 1 => d4p; 0 => jl; ENDCASE; d3p j 3 j l ASSERT r2 ml m4; ASSERT r3 m2; ASSERT r4 m4 m5; IF d4 THEN d4p; ASSERT r5 m3 ml; CASE (d4) 1 => d5p; 0 => d5m; ENDCASE; d4p ASSERT r5 m6 m4; 151 IF d.5 THEN d5p; d5m : ASSERT r6 m6 m5; GOTO j 5; d5p : ASSERT r6 m3 m2; j5 : ASSERT r7 m3 m2 ml; GOTO Start; I In co n trast to th e P E G in p u t, th e PASTA in p u t is m uch sm aller, being an estim ato r. T h e num ber of states w ith w hich an inner loop is controlled is ob tain ed by applying th e d a ta p a th estim ato r ( P S A D - N P ) to th e inner branches after th e resources are fixed by P S A D - N P for th e en tire design. j # ! # P a rk s t u f f # = 4 4 1 j 1 + s|{2,2} + 1 = 5 7 6 (aK(l+b|{l,l}),(c|{l,2})} & (1 + dl-Cl,1») + (1 & e|{l,l» T here is no error for th e sm all conditional; th e in p u ts, o u tp u ts, and product term s m atch exactly. However, th e large conditional has an error associated w ith it. A lthough th e in p u ts and o u tp u ts m atched (9 and 17, respectively), th e num ber of p ro d u ct term s did not (22 versus 23 actu al). A n exam ination of o u tp u ts showed th a t th e estim ated design generated a PLA w ith 14 states; th e j real design took 15 states. T h e e x tra sta te is due to th e actu al schedule which | entailed one m ore sta te w ithin an inner conditional th a n p redicted. D uring th e first p o rtio n of th e validation, details of th e individual estim ated and actu al PLA s has been given. For th e rem ainder of th is section, detailed descriptions will be excluded and only th e sum m arized resu lts are presented. 152 T ab le 3.6: S u m m ary of R esu lts for P ip e lin e d P L A Dataflow Init. Inputs Outputs P terms Graph Parts Interval Pre. Act. Pre. Act. Pre. Act. Multiplier 4 2 2 2 5 5 3 3 7 5 3 3 7 7 6 6 AR Lattice Filter 7 3 2 2 5 5 3 3 16 6 3 3 9 9 6 6 Large Conditional 6 2 11 11 8 8 38 40 6 3 10 10 7 7 29 30 3.5.3 Pipelined Designs A com parison of experim ental results and predictions for pipelined designs are j shown in T able 3.6. T h e sam e behavioral descriptions th a t have been previ ously used w ere pipelined using S e h w a and a controller m anually co n stru cted to o p erate each design [PP86], N ote th a t behavior containing inner loops is not allowed in pipelined design. T hus, only sequential and conditional behavior is dem o n strated . 3.5.4 PLA Column Folding T h e validity of th e PLA folding m odel can b e seen by th e results contained in T able 3.7. Several of th e designs described previously in Table 3.5 w ere m anually folded using th e algorithm described in [BB87]. T he density ad ju stm en t factor d eterm in ed in [MT86] (0.82) is used for th e estim atio n m odel. Folding over 30 designs resulted in an average error of ± 1 colum ns. 3.5.5 Use of PASTA A lthough PL A area estim ation can be used as a post d a ta p a th synthesis tool, its p rim ary purpose is th a t of estim atin g control area prior to synthesis. In order to accom plish th is task, tools w hich p redict th e num ber of states, hardw are resources and register requirem ents are necessary. T he num ber and length of conditional p ath s, as well as th e length of each loop, can be estim ated from 153 T ab le 3.7: C o m p ariso n of Folded P L A A rea Folded Columns Description States Columns Actual Estimate AR Lattice Filter 7 9 4 5 19 11 5 4 Multiplier 4 6 3 3 6 7 3 4 Large Conditional 21 11 5 4 Average Loop (Unrolled) 8 5 2 2 Serial ADD (Counter) 3 7 1 2 th e resources and dataflow graph; loop counts are not known. For practical designs, th is is rarely a problem as m ost loops eith er have a fixed count or a loop term in atio n te st based upon som e condition ex tern al to th e PLA . W hen using estim ated d a ta p a th results, th e actu al scheduling and alloca tio n of each o p eration are n o t know n. However, using th e p red icto r described here, th e control area can be estim ated. Two exam ples illu stra te th e use of this technique. For th e first exam ple, an elliptical wave filter (a dataflow g raph w ith 36 nodes and 57 edges) was processed first through S L IM O S to o b tain th e best m odule set [JPP88]. T h e dataflow graph and m odule set are exam ined by th e pipeline estim atio n tool to produce a set of p redicted designs [JPP87]. O ne design having a in itiatio n interval of / = 7 has resources of 4 adders an d 2 m ultipliers and a clock of 2950 n S predicted. Finally, a register prediction tool estim ated th a t 14 registers w ould be needed. Since th e num ber of registers is larger th a n th e num ber of states, th e stage control m ethod is used. T h e p red icted and actu al results for synthesized controller and d a ta p a th are show n in T able 3.8. For th e second exam ple, th e large conditional (P ark ) is revisited. E stim ated resources and clock cycle tim e were generated for an in itiatio n interval of five [JM P88]. B ased upon this prediction, conditional p a th lengths were estim ated as th eir length in operations; these p ath s were sep arated in to control steps (17). Since only 3 registers are pred icted as being needed, register control was used for th is m ethod. T h e results are also included in T able 3.8. 154 T ab le 3.8: P L A A rea E stim a tio n u sin g D a ta P a th P re d ic tio n Tools Dataflow Sta jes Inputs Outputs P terms Area Graph Pre. Act Pre. Act Pre. Act Pre. Act Pre. Act Ellip. Fltr 18 14 3 3 10 10 7 7 34.7 33.4 Large Cond. 14 17 10 9 8 7 22 23 102.3 97.9 3.5.6 Shortcomings of Using the Berkeley Toolset T h e B erkeley toolset offers a com plete and ro b u st solution to th e synthesis of 1 PL A controllers. U nfortunately, th ere are several shortcom ings to this package | found in: P E G , E Q N T O T T , and E SPR E SSO . P E G is th e front-end tool which accepts th e user-specified control description and generates th e sta te equations. E Q N T O T T accepts th e state equations produced by P E G , o u tp u ttin g an unop tim ized P L A personality m atrix. E SPR E SSO reads this p ersonality m a trix and m inim izes th e num ber of pro d u ct term s. O ne shortcom ing (or feature) of P E G is th a t th e controller m ust be com pletely specified by th e user, b u t th e user has considerable freedom in expressing equivalent behavior. P E G will m errily gen erate useless descriptions as easily as excellent ones; th e p rim ary driver is th e in p u t control file. U nfortunately, this could resu lt in a large discrepancy betw een P E G and P A S T A control area estim ates. For exam ple, th e following is a description in P E G : OUTPUTS: register_l_WRITE register_2_WRITE register_3_WRITE register_4_WRITE regi st er_ 5.WRITE registerl_l_WRITE; Start : ASSERT register.1_WRITE register_2_WRITE register_3_WRITE; Statel : ASSERT register_2_WRITE register_4_WRITE register_5_WRITE; 155 I State2 : ASSERT register_4_WRITE register_5_WRITE registerl_l_WRITE; State3 : ASSERT register.1_WRITE register_4_WRITE register_5_WRITE; State4 : GOTO Start; T h e m inim ized PLA resulting from this description (after executing P E G , E Q N T O T T , and E SPR E SSO ) has 3 in p u ts, 9 o u tp u ts, and 4 p ro d u ct term s for a to ta l area of 25.3 m il2. T h e com parable P A S T A in p u t file specifies 5 states and 6 registers and gives 3 in p u ts, 8 o u tp u ts, and 5 p ro d u ct term s for a to tal I area of 27.7; an error of 10%. A lthough th e area is close, one w ould expect the p ro d u ct-term and o u tp u t counts to m atch. T h e difference in o u tp u t lines is due to th e m eth o d by w hich th e PLA is j controlled. T h e P E G description is controlling six registers in five control steps. Conversely, P A S T A uses th e stage control m ethod th ereb y saving one o u tp u t line. (In th is exam ple, th ree external 2-input O R gates will be required for th e registers; however, th e to ta l control area would still be larger using register i control.) | Besides a difference in o u tp u t lines, th e p ro d u ct-term counts are dissim ilar. ! T his is caused by th e S tate4 description w here nothing is happening. A lthough th is could be an idle sta te , it is likely th e user m eant to loop th e description at th e end of S ta te S . T h e PLA controller is now reduced by one state giving 2 in p u ts, 8 o u tp u ts, and 4 p ro d u ct term s. A n ad ju stm en t in th e P A S T A in p u t file and forcing register control yields th e sam e PLA p aram eters. In this case, P A S T A gave an area of 21.6 m il2 versus th e 21.3 m il2 actu al, an error of only J 1.4%. ! Finally, in this exam ple, th ere is a pair of red u n d an t register control lines: j Register-4- W R IT E and Register_5- W R I T E . (T his is w hy th e earlier O R gate j estim ate was th ree instead of four.) U nfortunately, since th e Berkeley tools have | PLA in terio r m odules w hich use pairs of lines, reducing th e o u tp u t line count from 8 to 7 does not have any effect in th is case. However, th e n et effect of all changes has reduced th e PLA area by 19% which w ould be even higher for single-line PLA blocks. It is clear th a t careful a tte n tio n m ust be p aid to the 156 1 control description; a poor one will result in a m ore costly PLA to carry out th e ^ sam e functions. A nother u n fo rtu n ate consequence of th e P E G description language is its in ability to accept “don’t care” o u tp u t signals. All o u tp u t lines produced by P E G are positively controlled. A n o u tp u t line th a t is not asserted is deasserted. A “d o n ’t care” sta te occurs w ith some frequency for b o th registers and m ul- j tiplexers. T hese “d o n ’t cares” could be exploited to reduce th e PL A density 1 and, possibly, th e num ber of p ro d u ct term s. Lower PL A density m ight allow I increased folding and a fu rth er reduction in PLA area. j i T h e last featu re lacking in th e toolset is th e ab ility to recognize red u n d an t I o u tp u t lines and com bine them . In p articu la r, it is easy to produce PLA s w ith identically controlled o u tp u t. A useful extension w ould be for E Q N T O T T or E S PR E SSO to com bine these o u tp u ts. T his w ould be even m ore critical for reducing th e PL A size w hen o u tp u t lines have m any “d o n ’t care” states. M any 1 exam ples have been encountered w hich exhibit red u n d an t o u tp u t control lines. 3.6 Summary In this chapter, a m ethod for estim atin g th e area of a PLA controller was pre sented. T h e m odel encom passes com m on controller design considerations such as loop control, conditional branches, and design style selection (non-pipelined versus pipelined arch itectu re). In addition, th e m odel was extended for e stim a t ing th e area of folded PLA s. In p u ts to th e m odel are values obtained during d a ta p a th synthesis such as o p erato r allocation an d scheduling. A nother m ore im p o rtan t application of PA STA - th a t of high-level area es- j tim atio n - was also dem o n strated . T h e results suggest th a t this control area I p red icto r, using th e results of d a ta p a th estim atio n tools, is capable of provid- | ing th e designer w ith accu rate chip design area w hich includes b o th th e control p a th and d a ta p ath . F u tu re efforts involve in teg ratin g th e PLA area estim a to r w ith th ese high-level d a ta p a th estim ato rs of scheduling and allocation, as well as register and m ultiplexer m odels, into a large system tradeoff tool. Es tim atio n techniques could th e n be used to accurately describe th e shape of th e 157 design space in area and tim e w ith m inim al execution tim e. Synthesis w ould be reserved for local exploration to find th e best design in th e chosen region. i 158 Chapter 4 Predicting R egister and M ultiplexer Requirem ent s 4.1 Introduction A n RTL design consists of hardw are m odules connected via registers, m ultiplex ers, and wires. Buses could also be used in place of m ultiplexers. T h e design space is characterized by th e area and tim e p aram eters of a design; param e ters such as power consum ption and tem p e ra tu re could also b e used. C urrently, prediction techniques exist for a re a /tim e (AT) ch aracterizatio n using o perator a rea/d elay only [JPP87] [JM P88]. T hese estim ates do n o t include register and m u ltip lex er/b u s area which have been shown to have a m easurable im pact on th e AT curve [GP82]. P red ictio n of register and m ultiplexer usage requires know ledge ab o u t th e d a ta p a th . T h e in p u ts available to such prediction m odels consist of • th e behavioral dataflow graph, • th e m odule set including th e num ber of each ty p e of o perator, • th e num ber of p artitio n s into w hich th e dataflow g raph is divided, and • th e in itiatio n interval (for pipelined designs only). E stim atin g bus requirem ents is a m ore difficult problem . T h e num ber of bus drivers can be determ ined sim ilar to m ultiplexer estim ation. However, this value 159 is also influenced by th e num ber of available buses and tradeoff betw een buses and m ultiplexers. Bus an d driver area estim ation is not addressed in th is thesis. In th is ch apter, m ethods for accurately predicting register and estim atin g m ultiplexer area are described. U pper and lower register bounds are derived for non-pipelined and pipelined designs; these bounds are dependent upon th e dataflow graph and, for pipelined designs, th e p artitio n s and in itiatio n interval. A n accu rate m odel for p redicting register use from these bounds is described. T h e m odel derived for m ultiplexer area also requires o p erato r and register usage j in ad d itio n to th e register m odel param eters. Finally, these m odels are validated ■ against synthesis tools. j i I i 4.1.1 System to be Modeled j T h e following is a description of th e system to be m odeled. • B ehavior is described in th e form of an acyclic dataflow graph. Nodes j represent operations and edges represent values. A given dataflow graph j contains one source th a t has no incom ing edges (henceforth called root) and one sink which has no outgoing edges (o u tp o rt). A n exam ple dataflow . graph is shown in Figure 4.1; all incom ing edges not tied specifically to an o th er node are im plicitly connected to root. All outgoing edges not specifically tied to an o th er node are im plicitly connected to outport. • A n edge begins a t one node (called th e source) an d ends at a different node (sin k). 1 i • Betw een any two nodes, th ere is a m axim um of one edge. M ultiple edges betw een tw o nodes are m erged into a single edge w hich com bines th e sep a ra te values along th e original edges into a new value. • Root can only be a source node; outport can only be a sink node. • C onditional branches are allowed w ithin th e dataflow graph. T h ey m ay be nested to any d ep th , and have no restrictio n on fanout. Special nodes, dist and jo in , respectively describe th e beginning and end of a m ulti-w ay 160 conditional branch. A ll p ath s from a given dist m ust pass th ro u g h th e m atching join. • A m odule lib rary consists of com ponents th a t can im plem ent th e operations in th e dataflow graph. C om ponents have area, propagation delay, and o th er physical a ttrib u te s associated w ith them . A m odule set is a subset of th e m odule library used for a specific design. T h e m odule set is com plete in th a t every o p eration in th e dataflow graph has one (and only one) operator j ty p e tak en from th e m odule library which is used for im plem entation. I O th er term s used th ro u g h o u t this chapter are described here. • A path is a sequence of o n e o r m o re edges (el5 e2, .. ., en) such th a t the | I j sink of e, is th e source of et+x. • A conditional path is one w hich sta rts at a dist node and ends at the m atching jo in node. • A complete path is one which sta rts at root and term in ates a t outport. I I 4.2 R egister Area in N on-pipelined Designs To estim ate th e num ber of registers, knowledge of b o th th e theo retical upper and lower register bounds are needed. T heoretical register lim its can be d eter m ined for non-pipelined designs strictly by exam ining th e dataflow graph. M ore correctly, one m ust exam ine all possible p artitio n s of th is dataflow graph. A par- I tition is any edge cu tset th a t separates root and o utport; an exam ple p artitio n is shown in Figure 4.1. A scheduled dataflow g raph will have one or m ore p a rti tions w hich, w hen th e p artitio n lines are draw n in order on th e dataflow graph (from th e first on up), m ay p artia lly overlap b u t will never fully intersect since values on edges only propagate forw ard in th e graph. Physically, a p artitio n distinguishes different clock cycles of th e scheduled graph. Each p a rtitio n in a dataflow graph is com prised of a set of edges and each of these edges has a value associated w ith it. A register is required for each unique value th a t m ust be retain ed p ast th e clock cycle in w hich it is generated. T hus, 161 mul mul mul mul mul mul mul add add add add add add mul mul mul mul add add mul mul mul m ul add add add add F igure 4.1: A R la ttic e filter show ing cu tset 162 each edge w ithin a given p a rtitio n cu tset signifies p o ten tial register use. T he actu al num ber of registers needed for a given clock cycle is determ ined by four i effects: • For non-pipelined designs, it is a s s u m e d t h a t n o r e g is te r s a r e n e e d e d o n a n y in c o m in g e d g e s fro m root a s th e s e v a lu e s a r e s to r e d e x te r n a lly . Edges w hich have a source of root are therefore deliberately excluded w hen considering register use for non-pipelined designs. • Conversely, it is also assum ed th a t registers are required on all unique outgoing edges. (T hese two constraints reflect register assignm ent used by th e validating synthesis tools.) • A unique value generated by an operation m ay b e present on one or m ore edges. Clearly, th e value only need be stored once regardless of how m any edges hold th e value in a p a rtitio n cutset. • If two values in th e cu tset occur on different conditional branches, only one of these values is actually active during th a t clock cycle. T hese m u tu ally exclusive values are th u s able to share th e sam e register. T he assum ptions m ade regarding which of th e graph incom ing or outgoing edges are eligible for register assignm ent reflect o p eratio n of th e current USC synthesis tools. W here incom ing a n d /o r outgoing values are used (Vin and Vout), th e equations w hich will be given are readily m odified to reflect assignm ent of registers to V Jn a n d /o r Vout. j By exam ining all possible p artitio n s, th eo retical bounds can be established i upon th e m inim um (Rmin) and m axim um (R max) num ber of registers necessary > for non-pipelined designs w ithout perform ing synthesis. Essentially, th e p a rtitio n w ith th e m axim um num ber of edges is th e cu tset w hich defines th e m axim um num ber of registers needed. Conversely, th e p a rtitio n w ith th e m inim um num ber of edges is th e sm allest cu tset w hich defines th e m inim um num ber of registers. (T he actu al size of a given cu tset is not th a t sim ple, b u t can be determ ined using th e assum ptions ju st presented.) E stim atin g th e num ber of registers for non-pipelined designs (th e q u an tity R n p est) from these lim its is com plicated by th e fact th a t only a subset of these p a rtitio n cu tsets is observed in actu al designs. 163 4.2.1 Register Bounds in Non-Pipelined Designs A nalysis of a dataflow g raph to determ ine m inim um and m axim um register lim its can be tre a te d as a netw ork flow problem . A n algorithm w hich derives th e upper bound on th e num ber of registers using netw ork flow has been published by K urdahi [Kur87], In netw ork flow analysis, each edge of th e dataflow g rap h is first assigned a non-negative value which defines th e m inim um (or m axim um ) flow (akin to oil i pipeline flow or electrical cu rren t) allowed through th a t edge. A legal network flow is one w here a non-negative integer flow value is assigned to each directed edge such th a t 1. th e flow on each edge is at least a t its m inim um (or at m ost at its m axi m um ) , i ' i 2. th e to ta l flow into each node (excepting root and outport) equals th e to tal flow out of th a t node, and 3. th e to ta l outgoing flow from root is equal to th e to ta l incom ing flow to outport. W hen flow th ro u g h an edge exceeds its m inim um , it is said to have excess flo w . (Conversely, w hen flow is less th a n its m axim um , it has excess capacity.) In K u rd ah i’s algorithm , a m inim um flow constraint of one is assigned to every edge. T hen, th e graph is in itially seeded by adding flow along different complete paths u n til a legal netw ork flow is achieved. Since th is seeding usually results in excess flo w along one or m ore edges, an a tte m p t is m ade to reduce flow along j com plete p ath s u n til th e m inim um flow is reached. R educing to ta l netw ork flow to th e sm allest legal netw ork flow possible can be accom plished using a m in flo w , m ax cut algorithm . A lgorithm s for determ ining m in-flow , m ax-cut of a dataflow graph are detailed in E ven [Eve79]; D inic’s m eth o d was chosen by K urdahi. A fter an in itial flow is seeded, th e D inic m ethod finds th e m inim um flow by first form ing a layered netw ork to determ ine if any complete path (from root to outport) can have its flow reduced, and th en reducing it if one exists. 164 root Layer 4 Layer 3 Layer 2 Layer 1 o u tp o rt Figure 4.2: E xam ple layered netw ork F orm ation of th e layered network entails finding a com plete p a th along edges w ith excess flow . B y reducing flow only along edges w ith excess flow, it is ensured th a t a legal netw ork flow will be retained. For construction of th e first layer, a set of all nodes is constructed w hich have flow to outport along an edge w ith excess flow as shown in Figure 4.2. (All edges are assum ed to have excess flow in th is exam ple.) Each successive layer consists of nodes having edges w ith excess flow to any of th e previous layers. A t som e p oint, no fu rth er layers can be form ed. If th e last layer reached contains root, th en one or m ore com plete p ath s have been found w here flow can j be reduced. In th is case, th e flow is reduced along one p a th and an o th er a tte m p t j to reduce flow is m ade, sta rtin g w ith form ation of th e layered netw ork. If root was n o t reached, th ere is no such p a th indicating th a t m inim um flow has been achieved. T h e to ta l netw ork flow at th is p oint gives th e K urdahi predicted upper bound on the num ber o f registers, Rkur- 165 K u rd ah i’s approach gives an exact solution w hen all edges in th e d ataflo w "] g raph represent a unique value. If th e sam e value resides on m ultiple edges ! leaving th e sam e node, or conditionals exist in th e graph, th is approach instead yields a result which is g reater th a n th e exact value. A revised m ethod will be presented w hich accom m odates these m ore general dataflow graphs to provide an im proved solution. In K u rd ah i’s algorithm , th e lower bound on flow th ro u g h th e dataflow graph is one u n it for every edge. However, if a given node has m ore th a n one outgoing edge w here each edge has th e sam e value, only th e total flow am ong these edges need be one (as only one register is required); th e flow along a single edge m ight j be zero. T his problem is resolved by locally transform ing th e node to reflect th e | p roper flow constraints and applying eligibility rules to th e edges. Eligibility determ ines which edges em anating from a node are suitable for register assignm ent. Essentially, if two edges have th e sam e value (which im plies ; th ey are from th e sam e node) and one edge has a lifetim e th a t always m eets or j exceeds th e other, th e n th e edge w ith th e shorter lifetim e is ineligible for register j use. T he longer lifetim e eligible edge is said to dom inate th e ineligible edge lifetim e. Ineligible edges em an atin g from a node cannot be used to satisfy th e > m in im u m flo w th ro u g h th a t node. D e f in itio n 4 .1 E lig ib le e d g e s : Given a node n in a dataflow graph with out going edges E ( n ) = {ei, e2 , ...} , e; G E(n) is ineligible fo r m eeting the m inim um flow through n if 1. there exists an edge ej € E ( n ) containing the sam e value as ei (which m eans that e* and ej have the sam e source, but m u st have different sinks), and 2. there is a path fro m the sink (or destination) o f ei to the sink o f ej. A n y edge which is not m arked ineligible is eligible to m eet the flow require m ents through n. Clearly, any edge which is the sole outgoing edge o f a node is autom atically eligible. (See Figure 4-3). As an exam ple, a dataflow graph w ith eligible edges highlighted is included in F igure 4.3. A ssum e th a t b o th edges from s i hold th e sam e value; thus, either 166 root out F igure 4.3: E xam ple figure highlighting eligible edges edge m ight be assigned a register. However, assigning a register to edge s i — s2 is not helpful since th e value from s i m ust be active u n til s3 com pletes. Edge s i — s3 dom inates value lifetim e m aking s i — s2 is ineligible to store th e value * produced a t s i. As a result, th e re is no m inim um flow co n strain t on s i — s2. A t m ost nodes, two or m ore edges w ith th e sam e value m ay be eligible. O nly one of these eligible edges need have a flow (register assignm ent) to satisfy th e legal netw ork flow. W hich edge is assigned a register in an actu al design is not known. ] To reflect th e eligibility co nstraints, each node w hich o u tp u ts a value along j m ultiple edges is transform ed into a subgraph. T he algorithm w hich assigns flow lower bound and transform s nodes w ith m ultiple edges is shown in F igure 4.4; it replaces th e u n ity flow lower bound co n strain t of K urdahi. N ode tran sfo rm atio n ! is perform ed as needed so th a t th e m in flow , m ax cut D inic m eth o d described in th e original m eth o d can be used. T his new procedure produces th e m odified m axim um num ber of registers, R m a x - 167 procedure low — bound — a s s ig n begin Mark all nodes in G { N , E ) as "not visited” For each node n € G ( N , E ) marked as “not visited” begin For each outgoing value v from node n begin If v is assigned to multiple edges begin Add no-op node n'v into dataflow graph below n Add edge between n and n' v ' , assign flow lower bound of one Mark node n' v as “visited” For each edge e having value v begin If edge e is eligible begin Move source of e to n' v end Assign flow lower bound of zero to e end end Else assign flow lower bound of one to the single edge end Mark node n as “visited” end end Figure 4.4: A lgorithm for D FG transform ation/flow lower bound assignm ent T h e effects of th is procedure upon th e dataflow g raph are shown in Figure 4.5(a) and (b). In Figure 4.5a, node n has generated th ree values am ong five edges. E xcepting th e m iddle c valued edge, all edges are eligible. A fter applying th e transform ation, th e graph shown in F ig u re 4.5b is pro duced. For each value v which originally resided on m ultiple edges in Figure 4.5, a new no-op node n'v is introduced w hich has an edge from n (flow constraint of one) and th e original eligible m ultiple edges w ith value v have th eir sources tied to th is new node. Ineligible edges are not m oved, as show n by th e one edge w ith a value of c connected to n. T h e flow constraint along all co n stru cted edges I w hich have not been assigned are set to zero. T his tran sfo rm atio n ensures th a t | at least one edge for each value has a flow, while excluding ineligible edges from I having a flow requirem ent. I D eterm ining register requirem ents for a general dataflow g raph w ith condi tional branches entails building outw ard from th e innerm ost conditional. W hen a dataflow graph contains conditionals, m utually exclusive subgraphs off th e sam e conditional branch are com pared and th e subgraph exhibiting th e largest num - j ber of registers is retain ed while th e others are rem oved. (A m utually exclusive < subgraph is a connected graph w ith a startin g source node of dist and terrni- i n atin g sink node of th e m atching jo in along one of th e conditional branches of dist.) R epeatedly applying this transform ation startin g w ith th e innerm ost con ditional alters th e dataflow graph into one w ithout conditionals w hile retaining th e register requirem ents. A conditional dataflow graph an d its unconditional transform ed view are displayed in Figure 4.6. Collectively, conditional rem oval an d dataflow tran sfo rm atio n /lo w er bound | j assignm ent produce a g raph th a t can have flow seeded and D inic’s algorithm ; applied as K urdahi described in his register algorithm . T h en , th e to ta l m ini- j m um netw ork flow realized is th e m axim um num ber of registers needed, R max- , if every value in th e g raph requires a storage register an d assum ing this register is optim ally shared. H ence, th e estim ated m axim um num ber of registers needed I for non-pipelined designs, R n p max, is R u P m a x = R r n a x (4.2.1) 169 b b a c c (a) a (b) F ig u re 4.5: T ran sfo rm atio n of dataflow show ing low er b o u n d on flow 170 add add sub add sub add sub D1 add add > add sub sub F igure 4.6: T ransform ing conditional graphs for register com p u tatio n 171 Dataflow Graph Nodes Edges Computed maximum Actual Maximum RnPmax Kurdahi [Kur87] AR filter 30 52 8 12 8 FIR filter 25 47 8 8 8 Elliptical filter 36 57 8 15 12 HAL example [PK87] 12 23 5 8 5 Random 28 57 11 13 11 Multiplier 11 23 5 8 5 Criss-cross 8 13 2 4 3 N o t e : Kurdahi’s algorithm was adjusted to ignore register cutsets along incoming edges. ! Table 4.1: C om parison of register requirem ents: non-pipelined designs ; I T h e procedure for finding Rmax does not distinguish betw een edges having I th e sam e value w hen b o th are eligible. T his sim plification can lead to u n d eresti m atin g th e m axim um num ber of registers w hen “dependent register assignm ent” I occurs. D epicted in F igure 4.7, this condition arises w hen two or m ore nodes, i each having tw o or m ore outgoing eligible edges containing th e sam e value, have identical outgoing paths. D ependent upon th e g raph scheduling, an additional register m ay be required. In th e exam ple, b o th a2 and a3 have edges to a4 and a5. Using th e algorithm , th is criss-cross dataflow is transform ed into th a t shown in Figure 4.8; num bers : given are th e m inim um flow bound. All edges are eligible giving a register ! value (m inim um flow) of two instead of th e actu al three. A ssum ing a flow of [ one from a2 to a4 and from a3 to a5, an additional register is needed for th e . register dependent edge from a3 to a 4 (or a2 to a 5) to accom m odate th e w orst j case cu tsets of {(a2, a4), (a3, a4), (a5, a6)} or {(a2, a5), (a4, a6), (a3, a5)}. In th e | i dataflow exam ples encountered, few exhibit dependent register assignm ent as th e I results in T able 4.1 indicate. H ence, resolving dependent flow was not considered. D eterm ining th e m inim um num ber of registers required for non-pipelined ! designs, R n p min , is trivial; one sim ply counts th e num ber of edges w ith unique values having a sink of outport. Hence, RuPmin = Vout (4 .2.2) 172 .root a2 «4 a5 a6 out Figure 4.7: Criss-cross 173 .root aV a2 a5 a4 a6 out Figure 4.8: M odified criss-cross m arked w ith lower bound on flow 174 E quatio n 4.2.2 yields th e m inim um num ber of registers for non-pipelined designs since by our definition: I • registers are always required after th e last operation(s) in th e dataflow graph, and • any sm aller register cu tsets in th e dataflow graph are do m in ated by the required registers at th e outport. i I i 4,2.2 Estimating Register Use in Non-Pipelined Designs J A lthough register bounds are readily determ ined for non-pipelined design, th e J count for a specific design could fall anyw here betw een. A m eth o d for pre- j dieting register usage for any given design is im p o rtan t, p a rticu larly for larger designs w here these bounds m ay be widely separated. T h e approach presented is based upon em pirical observations on th e num ber of registers needed in any non-pipelined design. T h e estim ate of th e num ber of registers in non-pipelined designs, R stab, is dependent upon th e register bounds described earlier as well th e topology of th e dataflow graph. Term inology w hich will be used th ro u g h o u t th e rem ain d er of th is chapter reflect th e dataflow graph and resources assigned to it. h 4 - is th e nu m b er of op erations of ty p e i in th e dataflow graph. n 4 - is th e effective nu m b er of operations of ty p e i as defined by Ja in [JPP87]. Essentially, n; is th e m axim um num ber of resources of ty p e i needed for th e m ost parallel design. D ataflow graphs w ith no conditionals have n l = h*-; otherw ise, n t m ay be less th a n hj. T h e dataflow graph “shape” , as m easured by th e to ta l num ber of nodes versus its length from root to outport L (in nodes), gives th e effective num ber of expected registers per p a rtitio n as 1 Rejf = y ni (4.2.3) * A ctually, one m ust also consider operations w hich gen erate m ore th a n one unique outgoing value. However, th is clouds th e derivation w ith o u t any benefit. T h e com puter program w hich im plem ents th e algorithm determ ines th e actual num ber of outgoing values. N o te t h a t L n e v e r in c lu d e s n o d e s h a v in g 175 n o d e la y o r a r e a s u c h a s d ist, jo in , root, a n d outport. D ist and jo in only ] p ro p ag ate values; th ey do not generate any new ones. P er th e register assignm ent assum ption, root values are not considered and outport does not produce any values. If R eff is sm all com pared to R n p max, th e graph is relatively long (m easured from root to outport in nodes) and narrow (fewer nodes in parallel). H ence, th e registers needed for a given design should be closer to R n p min as only a sm all nu m b er of edge cutsets or p artitio n s are likely to have a large cardinality, j Conversely, as R eff approaches R n p max, th e graph becom es m uch m ore squat and wide; an increasing num ber of cu tsets w ould exhibit th e larger card in ality i j and R n p max registers w ould be needed. A linear approxim ation betw een these | lim its gives o _ E , rtj ir j ~ i ±j X Jj\Ttpm ax T h e estim ated value for th e num ber of registers in non-pipelined designs, R n p est , is determ ined by R stab except for th e case w here th e num ber of p artitio n s j p is one. H ere, th e value is known to be R n p min. In th e program R E G E S T , which im plem ents th e register prediction algo rith m , non-pipelined design register count and area are com puted for th e speci- i fied graph. B oth R n p min and R stab are p rin ted . In addition, R E G E S T com putes ■ I th e average b itw id th at th e node o u tp u ts in th e g raph and estim ates th e cost in j \ register-bits. I T h e A R filter serialization versus register usage graph in F igure 4.9 is con- I stru cted from Table 4.2. M odel results are com pared against th e register allo- t cation of R EA L [KP87] (which uses th e non-pipelined designs synthesized by M A H A [PPM 86]). A dditional exam ples are analyzed later in th is chapter. 4.2.5) (Rnprnax Rfipm in) ” t” R^Pmin (4.2.4) Partitions Registers Actual Estimate 1 2 2 3 4 5 4 5 5 7 6 5 8 6 5 9 6 5 14 5 5 17 7 5 18 7 5 19 6 5 Table 4.2: C om parison of A R F ilte r estim ated and actu al register use Y 18 1 6 - 1 4 - 12 10 8 6 4 0 : A ctual • : R n p e s t X : Stages Y : R egisters 0 0 • • 0 0 < § > • • • R k' urdahi R n p r R n p r i H 1 --- 1 --- 1 --- 1 - H I 1 H 8 10 12 14 16 18 F ig u re 4.9: R eg ister values for A R la ttic e filter 177 4.3 R egister Area in Pipelined Designs C o m p u tatio n of register requirem ents for pipelined designs is an extension of th e non-pipelined results. In non-pipelined designs, th e value assigned to a given edge is fixed u n til th e o u tp u t values are latched at th e end of th e p -th (last) clock cycle. However, in pipelined designs, th e value stored along an edge is only stab le for a t m ost th e in itiatio n interval, /, w here I < p, before it is e ith er tran sferred to another register or discarded. (N on-pipelined design can be considered a special case of pipelined design w here p = I.) W hereas a single edge could never have m ore th a n one register in a non-pipelined design, edges in pipelined designs m ay have m ore if th e value crosses m icrocycle boundaries (described shortly). A nother m ajo r difference betw een pipelined and non-pipelined designs is th a t th e dataflow g raph in p u t values are latched in stead of th e o u tp u t values (per th e USC synthesis tools used for validation). T his alters th e restrictio n described for non-pipelined designs w hich not only dem onstrates th e flexibility of th e m odel, b u t allows com parison against th e pipeline synthesis tools w hich assign registers to th e dataflow graph inputs. 4.3.1 Register Bounds in Pipelined Designs A pipelined design consists of two or m ore m icrocycles, u. E ach microcycle is com prised of I consecutive p artitio n s in th e dataflow graph, excepting th e last m icrocycle w hich m ay have less th a n I. W h a t distinguishes pipelined design from non-pipelined design is th a t th ese m icrocycles all o p erate synchronously in parallel. In o th er words, th e first clock cycle or p a rtitio n of all m icrocycles execute sim ultaneously followed by th e second clock cycle of all m icrocycles, and so fo rth , u n til th e /-th clock cycle is com pleted, at w hich tim e th e procedure repeats. T h e trem endous im pact of pipelined design upon register count can be re alized by exam ining F igure 4.10. (All edges w ith no source shown are assum ed to be connected to root w ith all th e left-hand-side incom ing graph edges of this ty p e taking one value and th e right-hand-side tak in g another value.) H ere, a 178 | g raph has been divided into tw o m icrocycles (indicated by th e trip le line) w ith an in itiatio n interval of th re e for a to ta l of six p artitio n s or clock cycles. In the first clock cycle, add2, a d d S , addJ^, and add5 generate th e ir values at th e sam e tim e. In th e second, all th e m ultipliers operate. In th e last clock cycle of each m icrocycle, th e rem aining add operations occur, two values are saved across th e m icrocycle boundary, and an o u tp u t is generated. A t th is p o in t, th e procedure rep eats w ith add2 and add4 using a new set of d a ta and addj. and add5 using th e last o u tp u t of adda and addb. J A non-pipelined design w ould only require tw o registers. However, for this I pipelined design, eight registers are required: • Tw o registers are needed for th e incom ing root values along th e left and 1 right edge. • Tw o registers are needed w ithin th e first m icrocycle; these registers are shared over th e th ree clock cycles. • Tw o registers are needed in th e second m icrocycle. Since th is m icrocycle operates in parallel w ith th e first, it cannot share th e registers used th e first m icrocycle. • Sim ilarly, two registers are needed to retain th e old incom ing root values during th e second m icrocycle as th e first m icrocycle root registers now have a new value assigned. A lthough th e to ta l num ber of clock cycles or p artitio n s, p, is sim ply u x / in | th is exam ple, th e last m icrocycle could have less th en / clock cycles. T h e p aram eters w hich are provided are p and I. T h e num ber of m icrocycles is com puted as « = [ f l (4.3.6) O ne featu re of pipelined designs is th a t registers are needed on all incom ing values, Vin , of th e dataflow graph. A nother feature is th a t one edge m ay have m ore th a n one register dependent upon w hether it crosses a m icrocycle bound ary. For th e first m icrocycle, th e num ber of registers needed is th e m axim um 179 root add add m ul m ul add add add add m ul m ul add add out F ig u re 4.10: G ra p h d e p ictin g p ip elin ed sch eduling 180 of Vin and R n p max. R em aining m icrocycles would each need th e register esti- 1 m ate derived for non-pipelined designs giving th e estim ated m axim um num ber ' • of registers for pipelined designs, R p max» as Rp-max = m ax(Vin, R n p max) + (u — 1 )R n p max (4.3.7) I D eterm ining th e lower bound on estim ated num ber of registers for pipelined j designs, R p min, is sim ilar to th e m eth o d used to com pute R p max. j 1 ~ ! j R P r n in = m a x (V in , R n p m i n ) -)- (it X 'jR j n in (4.3.8) i Rmin is com puted using a dual of th e D inic algorithm for R max described in th e \ j previous section. T here are several differences to th e heuristic: I • T h e flow co n strain t assigned to each edge is th e m axim um ra th e r th a n th e m inim um allowed. I ( • A ny edge w ith a zero constraint for finding Rmax is replaced by th e value j for Rmax or som e higher a rb itra ry num ber. T hese edges do not constrain j i flow. i • T h e netw ork is not seeded, or rath e r, th e in itial netw ork flow is zero. I • T h e layered netw ork is b u ilt forw ard sta rtin g w ith root and workings to- | w ards outport through edges w ith excess capacity. | i • If a com plete p a th exists th ro u g h th e layered netw ork, flow is added along ! th is p a th and th e process rep eats w ith a new layered netw ork. ' W hen th e m axim um flow is reached, th e to ta l netw ork flow is th e m inim um * ~ I num ber of registers needed in th e graph or R min - 4.3.2 Estimating Register Use in Pipelined Designs E stim atin g register area for pipelined designs is com plicated by th e p artitio n in g of th e design into m icrocycles. Each m icrocycle can be tre a te d , in isolation, as a non-pipelined subgraph. A nd, as will be shown, only th e m icrocycle count is 181 significant to estim ate pipelined design register usage; individual p a rtitio n and in itiatio n interval differences are secondary. O ne featu re of non-inferior pipelined designs is th a t th e in itiatio n interval j and th e num ber of stages are closely related; as th e num ber of stages in a non inferior design space increases, so does th e in itiatio n interval as shown in Figure 4.11. (N on-inferior designs are th e subset of all synthesized designs th a t are superior or equal to com parable designs; q u ality is m easured in area and tim e.) , As in th e non-pipelined case, increasing th e num ber of stages boosts th e register i q u a n tity needed tow ards some lim it. T hus, one w ould expect a dataflow g raph ^ p artitio n ed in to a large num ber of stages to exhibit register usage related to this i lim it (tim es th e num ber of m icrocycles). i Conversely, as in itiatio n interval increases, th e expected register count de- i clines. W hen in itiatio n interval is low, th ere are few stages w here register sharing can occur and th e register lim it is again reached. As in itiatio n interval increases, | th e ab ility to reuse registers rises. (G enerally, register sharing only occurs during th e sam e m icrocycle.) G iven th e inverse effects of in itiatio n interval and p a rtitio n increase upon th e nu m b er of registers, th e prediction m odel assum es th a t only th e num ber of m icrocycles has an effect on th e register count. T he actu al values for in itiatio n 1 interval and stage count induce secondary effects th a t ten d to cancel each o th er out. H ence, as pipelined design register count is related to th e sum of individual m icrocycles, a sim ple average betw een th e two register lim its is tak en for R p est , th e estim ated num ber of registers for pipelined designs. RPest = RPmax T Rpm in) (4.3.9) For th e unique case w here a pipelined design has a single m icrocycle, th en registers are used a t only th e in p u t edges giving Rpesti 1) = v i n (4.3.10) 182 Y A 24 18-- 1 2 - - 6 - ® o A A A A A A * A A A A * A g g + * + A++T*r * + A A H 1 --- 1 — 12 (g > : A R F ilter A : F IR F ilter A : E lliptical : R andom + : M u ltiplier X : L atency Y : P artitio n s 18 H 1 h 24 X Figure 4.11: R elation betw een p artitio n s and laten cy 183 Init. Interval Partitions Registers Actual Estimate 1 6 74 68 2 7 46 48 3 7 32 38 4 12 36 38 6 17 28 38 8 13 24 28 12 16 21 28 16 19 23 28 T able 4.3: C om parison of pipelined A R F ilter estim ated an d actu al register use I T h e program R E G E S T p rin ts out th e num ber of registers for each speci fied in itiatio n interval and p a rtitio n count of pipelined designs. As in th e non- pipelined case, th e register-bit count is also given. P ipelined results for th e A R filter are contained in Table 4.3 and Figure 4.12 w ith additional results described in Section 4.6. 4.4 Predicting the Num ber of M icrocycles P red ictin g th e num ber of m icrocycles is not im p o rtan t for o p erato r area estim a tion in pipelined design, b u t is critical to th e register and m u ltip lex er predictors > and, to som e ex ten t, th e controller. T h e num ber of m icrocycles («) is given by u = |Y | (4.4.11) w here p is th e to ta l num ber of clock cycles and I is th e pipeline in itiatio n interval j or latency. (Physically, th e num ber of m icrocycles is th e m axim um nu m b er of d a ta sets w hich can be resident in th e pipeline at any tim e.) P red ictio n of th e area-tim e d a ta p a th for pipelined designs ( P S A D P ) only generates th e m odule counts and th e in itiatio n interval; n eith er th e num ber of clock cycles (p) in a given design nor th e num ber of m icrocycles are com puted. G iven in itiatio n interval, efforts here focused on deriving a m odel for p redicting th e nu m b er of clock cycles w hich would u ltim ately allow com p u tatio n of m icrocycles. 184 Y 90 80 70+ . 6 0 - 5 0 - 40 30 20 - 1 0 - <® : A ctual • : P red icted X : L atency Y : R egisters jgL H 1 --- 1 --- 1 --- h H 1 -----1 -----1 -----1 -----1 -----1 -----h 8 10 12 14 16 18 -Rpr -Rpn X Figure 4.12: R egister values for pipelined A R lattice filter 185 T h ere is a good reason for th e lack of a m odel for pipelined clock cycle count. T he num ber of clock cycles is heavily influenced by th e in d iv id u al d a ta dependencies and ordering of th e operations in th e dataflow graph besides th e d irect influence of in itiatio n interval (/). W ith graph topology having a large effect, clock cycle count is also likely to be affected by th e ty p e of pipelining and could even be synthesis tool specific. Since p u rsu it of such a generalized m odel is beyond th e scope of th is thesis, an altern ativ e solution was found in th e form of an em pirical m odel. A brief discussion of em pirical observations of in itiatio n interval (I) and clock cycles (p) was given earlier. It was noted th a t p increased at a som ew hat lower ra te th a n 1. A dditional exam ination of S e h w a (pipelined synthesis) o u tp u t revealed th a t p had a constant offset and a som ew hat fixed shape versus in itiatio n interval over several dataflow graphs; u steadily decreased as th e design was serialized. A sim ple linear m odel was form ulated to reflect th e observed behavior of non-inferior pipelined designs. In p articu la r, th e low er-bound pipelined area- delay p red icto r provided a direct in p u t to th is m odel. T h e estim ate for th e n um ber of clock cycles, p, in pipelined design is p = l ~ l J -J T Tl0f f Cp X I m a x (4.4.12) L ' is th e m inim um relative length of th e critical path used to establish th e “floor” for th e clock cycle count and is given by V = f ^ - 1 (4.4.13) w here T cp is th e delay of th e critical path and cmjn is th e m inim um clock cycle tim e (th e m axim um delay of any given m odule). For th e fastest design, w here 1 = 1, experim ents indicated th a t clock cycle tim e was as short as possible, which justifies th e “floor” on clock cycle count. (R esynchronization introduces an o th er dim ension to th e m odel w hich entails some knowledge of th e g rap h topology; its effects were not analyzed.) T h e fraction in th e second term of E quatio n 4.4.12 scales th e m axim um off- critical p a th operation count which could be assigned to sep arate clock cycles 186 T ab le 4.4: P ip e lin e d P re d ic te d D a ta V alues Design Cmin V max{n,-} noJfcp E lliptical F ilter 375 14 26 20 A R L attice F ilter 2950 8 16 20 F IR F ilter 375 4 15 14 and reduce overall resource requirem ents. T hus, w hen I = 1, th e m axim um n um ber of resources are used and p = L '. A ccording to th e AT theory, th e graph can be serialized up to j Imax = m ax{rii} (4.4.14) | I w here rot - is th e effective num ber of operations of ty p e i. ! For th e m inim um area design, th ere is only one expected resource of each I ty p e i. E xperim ents revealed th a t in th e general w orst case, each off-critical- p a th o p eratio n would require an additional clock cycle. n 0/ / cp is th e relative nu m b er of operations not in th e critical p a th or noffcp = ^ noffcpj — 'y ) j N C p (4.4.15) i i w here N cp is th e num ber of operations in th e critical p ath . Clearly, th e m inim um num ber of additional cycles needed is n0ffC p(min) = m a x i(n i — Ncpi) (4.4.16) A lthough th is value is intuitively m ore appealing th a n n0f f cp, it did not fit th e pipelined d a ta as well for th e graph sizes used in validation (< 60 operations). To verify th e m odel, values from several different dataflow graphs were en tered into th e m odel and com pared against S e h w a o u tp u t. Tables 4.4 and 4.5 contain th e results of these com parisons. N ote th a t since th e goal is to p redict th e nu m b er of m icrocycles given by E quation 4.4.11, verification of th is m odel did n o t com pare p. T he results indicate a close correlation betw een th e predicted and actu al nu m b er of m icrocycles. 187 T ab le 4.5: C o m p ariso n of P re d ic te d a n d A c tu al M icrocycles Init. A ctual P red icted Design In trv l Clocks M icrocycles Clocks M icrocycles E rror E filter 1 14 14 14 14 0 2 14 7 14 7 0 3 15 5 15 5 0 4 17 5 16 4 1 5 14 3 17 4 1 6 17 3 18 3 0 7 19 3 18 3 0 8 21 3 19 3 0 9 24 3 20 3 0 13 19 2 23 2 0 26 32 2 34 2 0 A R 1 6 6 8 8 2 2 7 4 9 5 1 3 7 3 10 4 1 4 12 3 11 3 0 6 16 3 14 3 0 8 12 2 16 2 0 12 13 2 21 2 0 16 19 2 26 2 0 F IR 1 4 4 4 4 0 2 6 3 4 2 1 3 6 2 5 2 0 4 8 2 6 2 0 5 10 2 7 2 0 8 15 2 10 2 0 15 15 1 17 2 1 188 For one exam ple, th e A R filter, a deviation betw een p red icted and actu al j l m icrocycles occurred w hen in itiatio n interval was sm all. D ue to th e dataflow g raph topology, w hereby th e to p consists solely of m ultipliers and th e b o tto m consists of adders, th ere are two clock cycles in w hich th ere is no resource conflict. T his advantage was exploited by S e h w a. As in itiatio n interval increases, th is J resource polarization is w eakened since a given m icrocycle is now m ore likely J to contain b o th m odule types. Since th e p a rtitio n estim atio n m odel does not I 1 consider th e location of m odules, polarized resources will result in a sm aller I p a rtitio n count th a n predicted. 1 1 i ! 4.5 M ultiplexer Area i j A nother consideration in chip area is th e cost (area) of routing d a ta to shared hardw are. Tw o com m on m ethods which provide sharing are attach in g th e oper- J a to r or register in p u ts to buses or m ultiplexer trees. Bus stru ctu res are generally I m ore useful w hen a num ber of values are tra n sp o rted relatively large distances in th e physical circuit w hereas m ultiplexers are m ore ap tly suited to localized com m unication. G iven only th e num ber of operators of each ty p e and th e scheduling, a tradeoff betw een bus and m ultiplexer allocation is n o t p red ictab le a t th is tim e due to th e lack of adequate m odels for bus estim ation. Consequently, for th e estim atio n technique described here, buses are not considered. In th e following sections, th e bounds on m ultiplexer area and a technique for estim atin g th e cost are presented. T h e equations are sim plified in th a t th e { I b itw id th s of individual operators and values are norm alized to one. E xpansion 1 to include b itw id th a ttrib u te s is readily accom plished, b u t needlessly com pli- j cates th e discussion. N ote th a t M U X E S T , th e program which im plem ents th is | m u ltip lex er estim ation, does tak e in to account o p erato r b itw id th for com puting m u ltip lex er costs. 4.5.1 Theoretical Bounds on Multiplexer Area B ounds on m ultiplexer area can be determ ined by finding th e sep arate bounds on m ultiplexers for operators and registers and sum m ing them . A given dataflow 189 graph contains h a operations of ty p e i (w ith n; or effective num ber of operations being equal to th e w orst-case num ber of operators needed). A specific design im plem en tatio n has o2 resources of ty p e i (O i < rii < hi) and R registers. T h e num ber of m ultiplexers needed depends upon th e design p aram eters (n,-, h i, Oi, R ) and th e shape of th e dataflow graph. Clearly, in th e best case w here incom ing values to an o p erato r or register always arrive from th e sam e source, no m ultiplexers are needed. In th e w orst case, all incom ing values arrive from unique sources. T hese bounds were derived earlier by K urdahi [Kur87]. 0 < Mi < h i - o i (4.5.17) 0 < M r < VTeg - R (4.5.18) M i is th e num ber of 2 : 1 m ultiplexers needed for all operato rs of ty p e i and M r is th e to ta l num ber of 2 : 1 m ultiplexers needed for registers w hich lie w ithin th e bounds given above. Vreg is th e num ber of values in a dataflow graph w hich are assigned to registers and is dependent upon th e shape of th e graph, th e design style (pipelined or non- pipelined) and th e scheduling. Each design style is sep arately addressed. 4.5.2 Estimating Multiplexers in Non-Pipelined Designs M ultiplexer area varies w ith th e degree of o p erato r sharing in non-pipelined designs. W hen th e num ber of stages is low, th ere is little sharing of resources except along conditional p ath s and, hence, m ultiplexing is m inim al. Increasing th e num ber of stages increases sharing and raises th e m ultiplexer count up to som e lim it defined by th e dataflow graph and hardw are allocation for a p a rticu la r design. B eyond this po in t, th e num ber of m ultiplexers m ay rem ain constant or decrease. D espite th e increased sharing as m ore serialized designs are produced, fewer m ultiplexers m ay b e needed since th e num ber of d istin ct locations w here values are produced is decreasing. T h e num ber of values ro u ted thro u g h m ultiplexers is some fraction of th e to ta l values in th e dataflow graph, V . V is defined as th e num ber of operations, all rii, plus th e num ber of unique incom ing values from root, Vin . (A gain, if a 190 given op eratio n ty p e produces m ore th a n one value, th en th e equations m ust be ad ju sted to reflect these additional values.) T f = J 2 n i + Vin (4.5.19) i V will u ltim ately be used to determ ine th e num ber of unique values presented a t a given o p erato r or register in p u t. U sing th e effective num ber of operations ( ■ rii) m ore accurately reflects im plem entation th a n th e to ta l nu m b er of operations (h*) as will be shown. I For dataflow graphs w ith conditionals, h; > n; for one or m ore i. However, ! it is not necessary to consider all th e conditional values generated, ju st one p a th | ; 1 from root to outport, as values p rim arily leave a conditional th ro u g h th e nearest j join. O perations outside of th e conditional use values generated in conditionals J th ro u g h th e associated jo in ; essentially, this gives outside operato rs access to values in only one branch of a conditional. O perations in m u tu ally exclusive branches can never use value(s) generated operations in an o th er b ran ch of th a t conditional. B ecause of these characteristics, th e effective num ber of operations (nj) best approxim ates th e num ber of values produced w ithin th e g rap h which are available for in p u t by o th er op erato rs or registers. T h e num ber of values w hich are assigned to registers, Vreg, is schedule depen dent. G iven th a t R registers are used for a specific design, th e dataflow graph has a t least Vreg > R values to be stored. G iven th a t a design has p clock cycles, a m inim um of one stored value is generated per clock cycle giving a different lower bound: Vreg > p. T h e m axim um of these two quan tities provide a greatest lower bound on Vreg. O ne u p p er bound on Vreg is sim ply all values th a t could be assigned a register, w hich is all edges in th e graph w hich do not have root as th eir source. However, th e m axim um value of Vreg m ay n o t exceed th e assignm ent of R registers for t I each of th e p stages. Sum m arizing, m ax (R ,p ) < Vreg < m in (V — | E in \,p x R) (4.5.20) 191 Since Vreg cannot b e explicitly determ ined w ithout know ing th e exact schedul- j ing in a dataflow graph, Vreg is arb itrarily set to an average of these bounds or j V = m ax (i?,p ) + m in ( V - | E in \,p x R) ^ 2 N ote th a t a special case occurs for p = 1 giving VTeg = R as registers are only assigned a t th e o u tp u t and no m ultiplexers are needed for registers. T h e m u ltip lex er estim ation m odels can be divided into two categories: m u lti plexers a ttach ed to o p erato r inputs and m ultiplexers a tta ch ed to registers inputs. T his sep aratio n is due to th e in p u t connectivity of these elem ents. O p erato r in- ! p u ts are eith er connected to o p erato r or register o u tp u ts or root; register in p u ts | are only connected to o p erato r o u tp u ts (for non-pipelined designs). T h e m odels derived are based on th e probable num ber of unique values expected at each i o p erato r or register in p u t. j 4 .5 .2 .1 M u ltip le x e r A rea w ith O p era to rs in N o n -P ip e lin e d D esig n A m odel for p redicting m ultiplexer usage w ith operators in non-pipelined designs is detailed in th is section. F irst, th e num ber of values from o th er operators and registers w hich could be found on th e in p u ts of all operators of a given ty p e is defined. N ext, th e num ber of distinct values expected a t an in p u t is derived based upon th e num ber of values and num ber of selections draw n from this pool. U sing these results, th e m ultiplexer count is readily com puted. In a given design, th e to ta l num ber of d istin ct values arriving a t th e in p u t of an o p erato r of ty p e i is dependent upon th e num ber of occurrences of this oper- j ation ty p e in th e graph (hi) and th e num ber of operators actu ally im plem enting | th e function (o,-). O ther factors include th e num ber of d istin ct physical in p u ts 1 (/,•) and o u tp u ts (Oi) associated w ith th is op eratio n type. For exam ple, an adder w ould have two in p u ts (th ree if carry-in), an d one o u tp u t (tw o for carry-out); /,■ and Oi do not inherently reflect th e b itw id th of a p artic u la r connection and are assum ed to be one. (However, th e M U X E S T estim atio n program does use th e actu al b itw id th .) Values found on in p u ts of a specific o p erato r m ay be from any o p erato r or register, excepting those values which are solely directed to outport (Vout) and 192 any values o u tp u t by th e node itself (0*). T he values eligible for in p u t to an o p erato r of ty p e i are th u s V i = V — Vout - Oi (4.5.22) Som e of these V 1 eligible values are assigned to registers (Vfeg), while the rem ainder are only associated w ith operators (1 4 ® — Vfeg). T h e num ber of values w hich are assigned registers, V*eg, is m odified to reflect th e exclusion of values directed solely tow ards outport. l'/reg = IllaX ( V ^ - V 0u t O i , 0 ) (4.5.23) I T h e probabilistic m odel used to estim ate th e num ber of m ultiplexers needed at each of th e /,■ in p u ts m akes th e following assum ptions: • E very incom ing value eith er arrives from a register or an operator. • T h e V 1 eligible values have a uniform, probability of being found on a given in p u t of each o p erato r of ty p e i. i • Each selection occurrence is independent of any other. T hus, th e selection of a specific eligible value does n o t affect th e p ro b ab ility of it being selected again. • T h e m ultiplexer count needed a t each in p u t of o p erato r i is directly related to th e num ber of statistically distinct values connected to it; th e m axim um num ber, w hen all connected values are d istin ct, is — ot. A n o u tp u t value Va is distinct from another o u tp u t Vb if th ey are physically o u tp u t from different registers or operators. As sharing of operations increases, i th e num ber of d istin ct values in th e graph decreases as th ere are fewer physical places w here values are generated. A n o p erato r of ty p e i has m ultiplexers at each of th e /,• in p u ts determ ined by th e expected num ber of distinct values originating in operators, V0(i), and th e expected num ber of distinct values originating in registers, Vr(i). Vr(i) is w eighted by th e fraction of values assigned to registers; Va{i) is w eighted by th e 193 fraction of values not assigned to registers. T h e num ber of m ultiplexers, Mi, required at th e in p u ts of all operators of ty p e i is M i = { y i \ y i 1 - ) w)+ iff-w (4.5.24) Va(i) and Vr(i) are found by com puting th e expected num ber of d istin ct m em bers chosen from a pool of values. For Va(i), th e pool consists of all values (V*); for Vr(i), th e num ber of values is equal to th e num ber of registers (R). V 1 is not reduced by V ' - th e num ber of values assigned to registers - since a given value m ay be found b o th on a register in p u t and an o p erato r in p u t. T h e num ber of selections draw n from each pool is dependent upon th e num ber of operations (hi) and th e num ber of operators used to im plem ent th is operation (o4). V0(i) = V(hi - O i + 1, V 2 ') (4.5.25) Vr(i) = V(hi - 0i + 1, R) (4.5.26) T h e to ta l num ber of operations (hi) is tak en in stead of th e effective num ber of operations (n t). O perations of th e sam e ty p e along different m u tu ally exclusive p a th s m ay share th e sam e hardw are, b u t m ay not have th e sam e connectivity; m ultiplexers m ight be necessary. V (k ,n ) is th e average num ber of distinct m em bers chosen after k selections from a pool of n d istin ct m em bers, w ith replacem ent. D raw ing upon a pool of n m em bers k tim es gives a to ta l of n k possible ordered Astuples. Q is th e q u an tity of th ese ^-tuples w hich have exactly q d istin ct m em bers. T h e w eighted average of Q tak en over all possible values of q gives th e expected (or mean) number of distinct m em bers, V. Clearly, q is bounded by th e lesser of k selections or n m em bers. V (k ,n ) = — 5 2 <tQ(n i k i<l) (4.5.27) 7 1 9=1 As an exam ple, w hen n = 5 and k = 3, q could range from 1 for tuples (a, a, a), (6,6,6), . . . , (e, e, e) to 3 for tuples (a, 6, c), (a, 6, d), .. ., (c,d,e). T he num ber of 3-tuples w here q = 1, 2, or 3 is Q ( 5 ,3 ,1) = 5, Q (5 ,3 ,2 ) = 60, and 194 Q (5 ,3 ,3 ) = 60 giving v = 1 x 5 + 2 x 60 + 3 x 60 2g) 53 or 2.44 as th e expected num ber of distinct values. F inally, Q (n ,k ,q ) is m ath em atically described by ordered Stirling num bers of th e second kind [Liu68] giving Q (n ,k ,q ) = ( ” \ £ ( - 1 y c ^ J ( ? - j ? (4.5.29) w here P ( ) is th e p erm u tatio n o p erato r and C() is th e com bination operator. 4 .5 .2 .2 M u ltip le x e r A r e a w ith R e g iste r s in N o n -P ip e lin e d D e sig n T h e num ber of m ultiplexers w ith registers is dependent upon th e num ber of values assigned to registers {V^eg) and th e num ber of registers available (R). For non-pipelined design, we assum e th a t th e only values found on a register input are produced by operators. M any designers and synthesis program s m ake th e sam e assum ption. E xceptions usually occur in C P U design, w here synthesis is less applicable. T hus, th e pool of values is th e sum of all Oi. T hese values have a uniform p robability of being found on a register in p u t. T hus, th e num ber of m ultiplexers required for all registers in non-pipelined design, M r , is Mr = V ( V £ , - * + ! , £ > ) (4.5.30) 4 .5 .2 .3 E x a m p le for N o n -P ip e lin e d D esig n To illu stra te th e m odel, a sm all exam ple was chosen as shown in F igure 4.13. A single design produced by M A H A , a non-pipelined synthesis program , pro vided th e scheduling and resource allocation for th e adders and m ultipliers w ith R E A L provided th e register assignm ent. M ultiplexers were in serted m anually and verified using M A B A L , a m odule and bus allocator (w hich can determ ine register and m ultiplexer requirem ents). 195 F igure 4.13: E xam ple dataflow graph showing scheduling 196 | For th is design, th ree adders and two m ultipliers are used in four stages w ith | five registers. (Edge cutsets indicating th e schedule produced by M A H A are ; ' m arked in Figure 4.13.) All m ultiplier in p u ts are attach ed to one of four d istinct ! graph inputs. By sw apping in p u ts on th e m ultiplier, th e to ta l m u ltip lier m u lti plexer count can be reduced from th ree to two. Values found on all adder inputs consist of m ultiplier o u tp u ts in two cases, register o u tp u ts in six cases, and adder o u tp u t and constant value in one case each. All carry-in pins are eith er attach ed to a constant (zero) or to one of tw o registers. R earranging in p u ts reduces th e 1 num ber of adder m ultiplexers to four on a single adder: one for one in p u t, one for th e o th er in p u t, and two for th e carry-in bit. (T he o th er adders do not need j f j m ultiplexers.) Finally, tw o registers require th ree m ultiplexers am ongst th em j j giving a to ta l of 9 m ultiplexers for th e m anual design. T h e final architecture, j validated using M A B A L , is shown in F igure 4.14. For th e probabilistic m odel, a m ore general exam ination of th e graph is m ade, j T here are five in p u ts to th e graph (one is a constant), five register o u tp u ts, two , ad d er o u tp u ts, and two m ultiplier o u tp u ts which m ight be found a t each adder . . . . . 1 ! in p u t. (T he carry line is folded into one of th e adder in p u ts.) T h e resu ltan t | m ultiplexer estim ato r predicts th a t four m ultiplexers are needed for all adders and four for all m ultipliers. (T h e d istrib u tio n of these eight to ta l m ultiplexers am ong th e two adders and th ree m ultipliers is unknow n.) T h e predicted m u lti plexer requirem ents for registers is one. T he n et result is th a t th e probabilistic m odel m atches th e highly m inim ized design. If th e m axim um num ber of m u lti plexers were taken as an estim ate, th e result would have been 12 m ultiplexers, j A bro ad er assessm ent of th is m odel com pared to actu al synthesis results is given | later. j ! 4.5.3 Estimating Multiplexers in Pipelined Designs j As in th e pipelined register estim ation, m ultiplexer estim atio n for pipelined de signs shifts to co m p u tatio n of an average for a single m icrocycle w hich is repeated over all m icrocycles. T h e m odel described here does not consider sharing of m ul tiplexers betw een different m icrocycles. 197 F igure 4.14: Exam ple dataflow graph showing RTL arch itectu re Sim ilar to non-pipelined design, b u t confined to a single m icrocycle, estim ates are derived for values (V u), registers (R u), and values assigned to registers (V ™ eg). T h e average num ber of registers used in a given m icrocycle, R u, is R v R p V0ut u (4.5.31) G raph o u tp u t edges (Vout) are excluded from R u since, in th e m odel used by th e synthesis tools, th ey are n o t assigned to any register. Values found on an o p erato r in p u t w ithin a given m icrocycle are produced by registers or operators w ithin th a t m icrocycle. T h e one exception occurs in th e first m icrocycle w here values from root m ay be found on an o p erato r or register in p u t. T h e average num ber of values per m icrocycle, V u, can be com puted as V u = R u(u - 1) + Vin + u (4.5.32) w ith th e fraction of these values assigned to registers bou n d ed by th e num ber of stages w ithin a given m icrocycle (th e in itiatio n interval). T h e estim ated num ber of values assigned to registers p er m icrocycle for pipelined design, l / “5, is a slight m odification of E q u atio n 4.5.20. V u and R u su b stitu te for V — K'n and R- in itiatio n interval I replaces p. max(jRu, /) < Vru eg < m in( V U,I R U ) (4.5.33) Sim ilar to non-pipelined estim ation, th e lim its are averaged to com pute Vru eg. (4.5.34) m a x (R u,l) + Tnm(Vu,l R u) reg 4 .5 .3 .1 M u ltip le x e r A r e a w ith O p erators in P ip e lin e d D esig n E qu atio n 4.5.24 is m odified to com pute th e num ber of m ultiplexers used in a given m icrocycle. For a specific m icrocycle z, the num ber of m ultiplexers needed for all operators of ty p e z, M ( l ) f , is M(l)t = y u _ y u y u - iy va z(i) + y u O \ / y u T \ / (4.5.35) 199 procedure compute — — su m begin U h j - Q j S et M ? = T * M }{ 1). S et new yi = hi — o* — T * yi. If yi > 0 begin S et M f = M?* + M (l)? . end , end S et yi = S et T = F igure 4.15: C o m p u tatio n of yi and Af“ A gain V*(i) and V* are found by com puting th e num ber of unique objects chosen from th e pool of values but restricted to a single m icrocycle. V0 z (i) = V(yi + Vrz (i) = V(yi + l , R u) (4.5.36) (4.5.37) yi is a variable w hich describes th e excess selection set size of operation ty p e i during a given m icrocycle. Initially, it is set to yi (in itia l) = ft i u (4.5.38) ft* O i Vi (4.5.39) T his value for yi is used for up to T m icrocycles, which is th e m icrocycle count t w here th e sum of all yi m atches th e to ta l excess {hi — Oi). Clearly, T < u. I T h e sum of th e m ultiplexers used in each of th e m icrocycles M ( 1)?, w here yi > 0, gives th e num ber of m ultiplexers for all operato rs of type i, M™. A n algorithm for com puting yi and is detailed in F igure 4.15. 200 4 .5 .3 .2 M u ltip le x e r A rea w ith R e g iste rs in P ip e lin e d D esig n M ultiplexer area in pipelined designs is sim ilar to th e non-pipelined com putation of E qu atio n 4.5.30. In each m icrocycle, th e num ber of values w hich can be assigned to registers is V, r U (4.5.40) T his q u an tity of values is fixed for all m icrocycles. W hen applied over all m icrocycles, th e to ta l num ber of m ultiplexers w ith registers, M™, is determ ined. Af“ = \uV(Vr u eg - R u + 1, V ?j\ - 1 (4.5.41) 4.6 Experim ents and Validation To validate th e m odels, a num ber of tools were used to gen erate designs w ith registers and m ultiplexers for com parison. For non-pipelined designs, a m ethod has been published w hereby register allocation is m inim al (R E A L ) [KP87]. By coupling th e d a ta p a th synthesis results from M A H A w ith R E A L , accurate register values can be obtained. P red icted register results were produced by R E G E S T , a program which im plem ents th e register analysis algorithm p re sented in this chapter. (T he use of R E G E S T is described in A ppendix F.) Tables 4.6 and 4.7 sum m arize th e register experim ents on non-pipelined designs. For these exam ples, a variety of algorithm s w ith dataflow g raph representations u n d er 60 nodes in size w ere used. Each non-inferior design as produced by M A H A and R E A L is listed in Tables 4.6 and 4.7 along w ith th e predicted register count from R E G E S T . T he error is also given. T h e last exam ple in th e Table 4.7, Conditional, is a conditional dataflow g raph described in P ark [PP86], It should be noted th a t R E A L does not g u ar antee finding th e o p tim al num ber of registers for conditionals; th e optim al num b er is actually 3 in stead of 5. Hence, th e estim ates are low in com parison to R E A L , b u t correct. 201 Dataflow Registers Graph Partitions Actual Predicted Error AR filter 1 2 2 0 3 4 5 1 4 5 5 0 7 6 5 1 8 6 5 1 9 6 5 1 14 5 5 0 17 7 5 2 18 7 5 2 19 6 5 1 FIR filter 1 1 1 0 2 8 4 4 3 7 4 3 4 7 4 3 5 5 4 1 8 2 4 2 11 5 4 1 15 5 4 1 Elliptical filter 1 8 8 0 3 8 8 0 4 8 8 0 5 8 8 0 7 9 8 1 8 9 8 1 9 9 8 1 13 11 8 3 14 9 8 1 15 9 8 1 28 11 8 3 T able 4.6: R eg ister p red ictio n : N o n -p ip elin ed designs Dataflow Reigisters Graph Partitions Actual Predicted Error HAL exam ple 1 3 3 0 2 4 4 0 3 3 4 1 7 4 4 0 Random 1 2 2 0 2 6 7 1 3 8 7 1 4 7 7 0 5 7 7 0 7 9 7 2 10 8 7 1 14 8 7 1 Multiplier 1 4 4 0 3 4 4 0 4 5 4 1 5 4 4 0 6 4 4 0 Conditional* 1 2 2 0 2 3 3 0 5 5 3 2 7 5 3 2 R E A L did not find correct values for actual designs; see text. Table 4.7: R egister prediction: N on-pipelined designs I T h e results of register experim ents on pipelined designs are depicted in Ta- i bles 4.8 and 4.9. Exam ples were processed thro u g h S e h w a and results from R E A L collected; it is not know n w hether optim al results are generated for pipelined designs. T h e register estim ate errors are clearly higher th a n th a t for non-pipelined designs. G iven th a t register prediction in a pipelined design is tre a te d as tw o or m ore co n caten ated non-pipelined designs, th e prediction error in a given stage is essentially m ultiplied by th e num ber of m icrocycles. (A n error m easure w hich is th e to ta l register error divided by th e num ber of m icrocycles gives errors com parable to th a t for non-pipelined estim ates.) U nfortunately, a m ore precise estim ate is likely to require some know ledge of th e num ber of val ues crossing each m icrocycle boundary; specific scheduling inform ation is not available during prediction. Problem s were also experienced using R E A L w ith a conditional g raph w hen th e num ber of m icrocycles was greater th a n one. T herefore, m anual register assignm ent was accom plished as shown by a fast conditional pipelined design in Figure 4.16. In this figure, th e num ber of registers assigned by a hum an designer is shown. (T his is divided into th e num ber of registers required on th a t stage “+ ” th e num ber of registers required for incom ing values from root.) Since no registers can be shared, th e hum an result of 25 registers (27 if o u tp u t latches used) is optim al and com pares favorably to th e estim ate. For b o th pipelined and non-pipelined designs, th e register pred ictio n m odel indicates a good m atch. W here errors occurred in non-pipelined designs, there appeared to be no p a tte rn as to th eir m agnitude. Conversely, for pipelined designs, th e location of operations w ithin th e dataflow g raph could be correlated ! to error. In p artic u la r, graphs w hich had acute “necking” , w here edge cutsets i in one region of th e graph have large cardinality w hereas those in an o th er have sm all cardinality, often produced estim ates higher th an m easured. (B oth th e A R an d E lliptical filters exhibit necking.) T h e register p rediction error in pipelined designs m agnifies any register variation betw een m icrocycles or register sharing across m icrocycle boundaries. To validate th e m ultiplexer results produced by M U X E S T , a tool recently added to th e A D A M system was used. Known as M A B A L for m odule and bus allocator, it accepts a schedule and a dataflow graph and perform s th e tasks of 204 Dataflow Init. Rej listers Graph Interval Partitions A ctual Predicted Error AR filter 1 6 74 68 6 2 7 46 48 2 3 7 32 38 6 4 12 36 38 2 6 17 28 38 10 7 15 24 38 14 8 13 21 28 7 16 19 23 28 5 FIR filter 1 4 44 42 2 2 6 32 33 1 3 6 25 25 0 4 8 24 25 1 5 10 20 25 5 8 15 20 25 5 15 15 17 17 0 Elliptical filter 1 7 73 53 20 2 8 43 32 11 3 9 35 25 10 4 12 33 25 8 5 9 18 18 0 6 10 18 18 0 7 10 17 18 1 8 11 17 18 1 9 12 18 18 0 13 17 16 18 2 26 32 16 18 2 T able 4.8: R eg ister p red ictio n : P ip e lin e d designs 205 Dataflow Init. R egisters Graph Interval Partitions Actual Predicted Error Random 1 3 30 26 4 2 4 21 19 2 3 5 15 19 4 4 4 14 14 0 5 5 14 14 0 6 6 14 14 0 8 8 14 14 0 10 10 14 14 0 Multiplier 1 2 8 8 0 2 4 9 8 1 3 4 8 8 0 4 5 8 8 0 5 7 8 8 0 Conditional* 1 5 55 25 30 2 6 30 19 11 3 6 19 16 3 5 5 13 13 0 6 6 13 13 0 R E A L did not find correct values for actual designs; see tex t. Table 4.9: R egister prediction: P ipelined designs sub add add ^egs 3 + 7 sub D a sub 3 + 4 add sub add add D4 3 + 2 add sub sub add D5 add sub F ig u re 4.16: P ip e lin e d co n d itio n al g ra p h 207 o p erato r allocation, register allocation, bus a n d /o r m ultiplexer allocation, and I m odule binding to produce a com plete n etlist at th e RT-level [KP90]. R egister assignm ent is based upon th e algorithm s used in R E A L ; bus and m ultiplexer assignm ent is driven by cost-based heuristics. M A B A L does a high degree of m inim ization; th u s, for very regular graphs, one w ould expect it to achieve b e tte r th a n th e expected results. A com parison betw een th e m ultiplexer estim atio n and M A B A L results are presented in Tables 4.10 th ro u g h 4.12 for non-pipelined and pipelined designs, respectively. (T h e use of M U X E S T is described in A ppendix G.) Designs w ith a single stage are excluded, w ith th e exception of th e conditional, as no m ultiplex ers are needed. As can be seen, th e estim aters com pare favorably against actual results w ith a couple of exceptions. T h e average error of several m ultiplexers appears to be in d ependent of th e graph size. I 4.7 Summary In th is chapter, m odels for estim atin g register and m ultiplexer requirem ents have been presented. B oth non-pipelined and pipelined designs, as well as b e havior w hich includes conditional p ath s, are encom passed. T h e sim plicity of these m odels eases th eir insertion into current synthesis packages and provides useful results. O ne of th e disadvantages is th e com p u tatio n com plexity of th eo retical register bounds. A lthough this analysis need only be perform ed once p er dataflow graph, th e co m p u tatio n tim e is on th e order of 0 ( n 2) w here n is th e num ber of nodes in th e dataflow graph. Extensions w hich include bus stru ctu res w ould also be useful j in tra d in g off m ultiplexer and bus usage for a given design prior to synthesis. I However, th e current m odel is ad eq u ate for integ ratio n w ith o ther d a ta p a th o p erato r and control p a th area prediction m odels. 208 D ataflow Graph Partitions Multi >lexers Total Error Actual Predicted Op Reg Op Reg AR 3 24 3 30 3 6 filter 4 30 5 30 5 0 7 34 3 34 5 2 8 33 7 34 4 2 9 34 7 36 3 2 14 33 5 34 2 2 17 34 5 34 3 2 18 34 5 34 2 3 19 32 4 34 1 1 FIR 2 16 1 14 3 0 filter 3 25 3 24 5 1 4 27 4 26 5 0 5 29 6 28 4 3 8 24 2 28 3 5 10 28 2 30 2 2 15 27 1 30 1 3 Elliptical 3 26 8 30 6 2 filter 4 27 7 32 7 5 5 31 8 34 7 2 7 30 7 36 6 5 8 29 10 36 6 3 9 28 9 36 5 4 13 31 10 38 5 2 14 30 11 38 4 1 15 35 7 40 2 0 28 26 3 38 1 10 Table 4.10: M ultiplexer estim ation: N on-pipelined designs 209 D ataflow Graph Partitions Multi >lexers Total Error Actual Predicted Op Reg Op Reg HAL 2 6 2 8 1 1 exam ple 3 10 3 12 2 1 7 10 2 12 2 2 Random 2 23 1 24 3 3 3 29 3 30 5 3 4 33 4 32 6 1 5 35 5 34 6 0 7 37 5 36 5 1 10 35 4 36 4 1 14 35 4 36 3 0 Multiplier 3 4 3 4 2 1 4 8 3 8 1 2 5 10 3 12 2 1 6 10 2 12 1 1 Conditional 1 8 0 8 0 0 2 18 3 16 3 2 5 19 4 18 4 1 7 22 6 18 5 5 Table 4.11: M ultiplexer estim ation: N on-pipelined designs Dataflow Graph Init. Interval Partitions M ultiplexers Total Error A ctual Predicted Op Reg Op Reg AR 2 7 28 12 26 11 3 filter 3 7 32 14 40 8 2 4 12 34 13 40 8 1 6 16 32 12 38 10 4 7 12 33 11 40 8 4 8 13 32 14 34 8 4 16 19 33 10 38 9 4 FIR 2 6 18 8 20 6 0 filter 3 6 27 11 28 5 5 4 8 28 10 32 5 1 5 10 27 10 32 7 2 8 15 28 10 36 6 4 Elliptical 2 8 22 10 30 11 9 filter 3 9 26 10 38 10 12 4 12 29 12 42 10 9 5 9 27 18 40 10 5 6 10 29 19 40 9 1 8 11 30 14 38 9 3 9 12 29 13 40 9 7 13 17 28 11 34 9 4 26 32 21 9 28 8 6 Random 2 4 24 8 26 8 2 3 5 25 13 32 9 3 4 4 30 9 34 5 0 5 5 32 11 36 5 2 6 6 34 14 36 5 7 8 8 32 13 38 4 3 10 10 30 11 40 4 3 Conditional 1 5 8 0 4 0 4 2 6 19 4 14 4 5 3 6 22 4 18 6 2 5 5 20 4 16 6 2 6 6 20 6 18 6 2 T able 4.12: M ultiplexer estim ation: P ipelined designs 211 Chapter 5 Analysis of Control P a th /D a ta Path Tradeoffs 5.1 Introduction In order to m ig rate fu nctionality betw een th e control p a th and d a ta p a th , explicit i tradeoffs can be applied to th e behavior. In this ch apter, th e focus is upon control p a th /d a ta p a th tradeoff tem plates. T he types of tradeoffs an d exam ple usage j will be presented in th e first p art. In th e second p a rt, one th e tradeoff types - b itw id th tradeoffs - is fu rth er exam ined. U sing th e control p a th area m odel of C h ap ter 3, an an aly tical m odel for assessing b itw id th tradeoffs is developed. Finally, th e m odel is applied to several exam ples and lim itatio n s of th is m odel described. I 5.2 Types of Control P a th /D ata P ath Tradeoffs C ontrol p a th /d a ta p a th tradeoffs encom pass a large range of design activities. Interactio n s w ithin th e d a ta p a th and control p a th as well as betw een d a ta p a th and control p a th hardw are collectively com prise these tradeoffs. Tradeoffs j having p rim ary im pact on th e d a ta p a th w hich, in tu rn , affect th e control p ath are p resented in this section. (C ontroller specific tradeoffs such as use of th e P L As versus m icrocode or m ultilevel logic are not addressed here.) Tradeoffs w hich affect th e area of th e d a ta p a th and th e controller are useful for exploring th e m igration of functionality as certain decisions are m ade. In th e sim plest case, a designer m ay desire a very sm all design and w ould thus have th e tendency to reduce hardw are to a m inim al quantity. If th e im p act of 212 d a ta p a th serialization upon th e control p a th an d ro u tin g /sto rag e hardw are is ignored, an inferior design m ay be realized. Even if considered, sharing of d a ta p a th hardw are m ay not be adequate to m eet a tig h t area design goal, and o ther types of tradeoffs m ust be exam ined. T h e use of bit-serial operators, for exam ple, will expand th e design space into a region w hich m ay satisfy th e objectives. T here are tw o types of control p a th /d a ta p a th tradeoffs: explicit and implicit. E xplicit tradeoffs are those w hich specifically affect both th e control p a th and d a ta p a th area a n d /o r tim e. T hese tradeoffs involve implementation o f control path functions in data path hardware. For exam ple, th e im plem entation of a loop counter w ithin a PLA versus an ex tern al counter com prises an explicit tradeoff. T h ere are also implicit tradeoffs w here changes to th e d a ta (control) p a th indirectly affect th e control (d ata) p ath . T hree types of im plicit tradeoffs w hich have been identified are composition, decomposition, and sequential/combinational tradeoffs. T hese are explored in this chapter. 5.2.1 Composition of Operators O p erato r com position or com plex operator su b stitu tio n is one class of h a rd w are/firm w are analysis problem . Choosing an a d d e r/su b tra c to r or ALU over a sim ple ad d er is not obvious. Clearly, th ere is an im m ediate negative im pact on b o th th e d a ta p a th and control p a th area. However, a parallel design m ay perform faster or, m ore likely, a serialized design m ay achieve a previously u n reachable area objective. Use of an ALU versus dedicated hardw are (such as a m ultiplier) is only j p a rtly a m odule selection issue. T his su b stitu tio n is not a 1:1 m apping since th e ' ALU could also perform o th er functions. A dditional control is also required for o p eratio n and sharing of this resource which affects th e overall circuit size. A nother form of com position is tran sfo rm atio n from a subgraph of intercon n ected operations into a single operation. O ne such exam ple is shown in th e A R lattic e filter of F igure 5.1. T here are eight instances of th e m u ltip ly /a d d e r p air (highlighted in th e graph). A costly (b u t fast) m odule im plem entation, as com pared to sep arate m ultipliers and adder, m ight resu lt in a finished design w here both area and tim e are lower th a n im plem entations of th e original graph. T his is 213 tru e w hen th e d a ta p a th area reduction dom inates th e overall area change and th e increased circuit speed is not ham pered by a slower controller. 5.2.2 Operator Decomposition D ecom position entails expansion of som e operation into a collection of m ore prim itive operations which achieve an identical outcom e. O p erato r decom posi tion is useful • to reduce th e area of hardw are im plem entation (possibly at th e expense of controller size), • w hen no library m odule is capable of providing th e function, and • to explicitly define operations a t a lower level (such as a ripple-carry versus carry-lookahead adder). A m u ltip lier serves as an exam ple of o p erato r decom position. A candidate im plem entation for an 8-bit m ultiply is a 16-bit adder w ith a shift register. T h e controller area increases to accom m odate looping 8 tim es to perform th e m ultiply. A lthough such an approach reduces d a ta p a th hardw are, d a ta p a th tim e rises and controller com plexity m ay increase significantly. A nother tran sfo rm atio n involves stren g th reduction. For exam ple, a m ul tip ly or divide o p eration by a power-of-2 term w ould be tran sfo rm ed into th e ap p ro p riate set of shift registers. Since transform ations of th is ty p e introduce operations th a t m ay exist elsew here in th e dataflow graph, fu rth er area savings m ay be realized due to resource sharing. O ther reductions include transform ing ad d itio n /su b tra c tio n by a sm all constant into a p resettab le counter, and m ultip licatio n by a sm all constant im plem ented via a series of additions. 5.2.3 Sequential/Combinational Tradeoffs S eq u ential/com binational or b itw id th tradeoffs are a class of problem s w here an o p eratio n im plem ented as a single com binational circuit is tran sfo rm ed into a sm aller b itw id th set of operators perform ing in a sequential fashion. Consider a single op eratio n w hich im plem ents an 8-bit m ultiply. E xpansion in to sequential 214 mul mul mul mul m ul m ul mul mul add add add add add add mul mul mul add add mu! m ul mul m ul add add add add F igure 5.1: A R F ilter showing O peration G roupings 215 4-bit m ultiplies yields th e dataflow graph of Figure 5.2. T his sim ple change has considerable im p act on b o th circuit area and tim e. F irst of all, th e m ultiply is an indivisible stru ctu re as far as any synthesis engine is concerned. Consequently, it tends to dom inate th e dataflow graph since it is eith er slow or expensive as com pared to adders, com parators, and o th er m ore prevalent operators. By transform ing th e m ultiply, hardw are area can decrease at th e expense of tim e - in general. If th e 8-bit m ultiply uses CSAs (C arry Save A dders) and th e 4-bit m ultiply is a flash (very high speed) R O M , th e second im p lem en tatio n m ight be faster a n d /o r larger. This “recursive” expansion could continue resulting in additional design possibilities. O nly a global analysis can determ ine w hether such a su b stitu tio n has pushed th e design in th e right direction. If th e operation occurs frequently in the dataflow graph, th e d a ta p a th area savings m ay be overw helm ed by th e added control loops. A lternatively, a subroutine in th e controller could be used. Al though e x tra p rocedural hardw are and stack registers w ould be needed, th ere m ight be o th er operators w hich could tak e advantage of a controller subroutine I capability, th ereb y lowering th e overall area. In th e second exam ple dataflow graph of Figure 5.3, a m ultiplication is lo cated b o th on and off th e critical p ath . D epending upon th e constraints, either none, m u l l , or m ull and mul2 are candidates for tran sfo rm atio n . (Mul2 is not a can d id ate by itself since it is on th e critical p ath ; if a tran sfo rm atio n to a slower design is acceptable on th e critical p ath , it is clearly acceptable off the critical p a th for this graph. T h e tran sfo rm atio n m ust consider such effects.) F u rtherm ore, each d a ta p a th tran sfo rm atio n candidate is accom panied by th e j choice of a sequential, looping, or subroutine control p ath , giving 13 to ta l im- I p lem entations. If sub7 was also a m ultiplier, this am ount w ould increase to 44 possibilities. A n exponential increase in com plexity d em o n strates why global b ru te force evaluation is im practical for tran sfo rm atio n evaluation. 216 m ul m ul m ul m ul add add add F igure 5.2: 8-bit m ultiplier im plem ented using 4-bit m ultipliers 217 join split split split split F igure 5.3: E xam ple Dataflow G raph 5.3 Control P a th /D a ta P ath B itw idth Tradeoff j M odel G iven th e types of control p a th /d a ta p a th tradeoffs, an analytical m odel is de rived for th e m ost useful type: b itw id th tradeoffs. T h e results ob tain ed by this m odel are dependent upon th e control p a th area prediction derived in C h ap ter 3; thus, it is only th e control area m odel developed earlier w hich allows a b itw id th m odel to be developed. A com m on tradeoff to apply to any behavior is b itw id th m odification of th e operators. T h e usefulness of bit-serial operators has been illu strated ; a com- I m ercial package even im plem ents a user-defined degree of serialization [HC 8 8 ]. C onsequently, a m odel for evaluating b itw id th effects on th e to ta l area of a given dataflow graph will be derived. Term inology used th ro u g h o u t th e rem ainder of th is section is described below. • Adp is th e area of th e d a ta p ath . • A cp is th e area of th e PLA controller. • ci .. . C 4 are area constants reflecting th e area of a PLA controller from E q u atio n 3.2.8 in C h ap ter 3 w here A cp ps ci(2i + o)p + c2p + c3(2i + o) + c4 (5.3.1) and i, o, and p are th e num ber of in p u ts, o u tp u ts, and p ro d u ct-term s respectively • £ is th e num ber of states in th e PLA . • A mux is th e area of a 1-bit 2:1 m ultiplexer. T he area of an n -b it m :l m ultiplexer is approxim ated by A Zux t t n x ( m - 1) x A mux (5.3.2) • A reg is th e area of a 1 -b it register. T h e area of an a rb itra ry n -b it register is 219 A ?., k n A „ , (5.3.3) • A™ ntr is th e area of an n -b it binary counter w ith reset capability. A ssum ing th e use of a ripple-carry counter, th en Acntr ~ n -^lntr (5.3.4) A dataflow graph consisting of a single operation (of th e right ty p e) can be im plem ented to perform its operation fast w ith parallel logic or cheap w ith serial logic as depicted in Figure 5.4. H ere, a 6-bit op eratio n of ty p e op can be im plem ented using |- b it operations upon each subset of th e d ata . M ultiplexers I j select w hich subset to o p erate upon and registers hold th e in term ed iate results. i 5.3.1 A Simple Model j For th e in itial m odel, only th e increm ental area of th e control and d a ta p ath j is analyzed, including m ultiplexers and a register after each m icro-cycle. It is : ! assum ed th a t I • a dataflow graph p a rtitio n ed into stages exists, • th e hardw are ty p e w hich im plem ents each op eratio n has been selected, • th e p artic u lar operation has been scheduled, b u t not allocated hardw are, I • a PLA controller exists, and • th e constru cted PLA has C states to operate th e original (unm odified) d ata p ath . Clearly, sharing of th is hardw are w ith o th er portions of th e design im pacts b itw id th tradeoffs. For th e m om ent, it is assum ed th a t th e o p erato r is not shared. T his assum ption will be addressed later. A ssum e th a t allocation of th e full 6-bit hardw are would increase th e d a ta p ath area by Au-op and is bit-independent, i.e. th e resu ltan t b in ary value of a given b it position is independent th e values of any o th er b it position. (For exam ple, l ’s c o m p le m e n t and x o r are b it-independent w hereas a d d and c o m p a r e are 220 op I l_ I J b (a) x 0 y 0 x x Yi x 2 y 2 x 3 y3 M ultiplexers op b /4 R egisters b /4 b /4 b /4 b /4 (b ) F ig u re 5.4: S erial versus P a ra lle l Im p le m e n ta tio n 221 b it-d ep en d en t.) Op can be im plem ented in th e d a ta p a th by a single o p erato r of b itw id th > 6 or 6 operators of 1-bit. If n is th e num ber of operato rs of bitw idth b' required to im plem ent op, th en and n, th e serialization q u otient, can range from 1 to b. For th e following equations, hardw are assignm ent for a single serialized oper ation will be exam ined in isolation. A ssum e th a t th e o p erato r has been divided into integral p arts (e.g. ^ is an integer) w here only a single o p erato r is repeatedly used (in a loop) to im plem ent th e operation. Since in term ed iate p artia l results need to be stored, registers will be needed; th e com bined registers hold th e com plete result after th e last p a rtia l operation is finished. Also, m ultiplexers will be needed to select th e ap p ro p riate in p u t for each p a rtia l o u tp u t. T he additional d a ta p a th area (exclusive of w iring) required to im plem ent this o p erato r, A Aap, is A A dp ps Aht~°v -|- bAreg + ---- — A mux (5.3.6) n n w here A bi-op is th e area of a 6 -bit bit-in d ep en d en t o p erato r and y is th e num ber of in p u ts of b itw id th 6 for th e o perator. (Typically, y has a value betw een two and four.) N ote th a t although th e num ber of d istin ct registers increases w ith serialization, th e to ta l register b itw id th (and area) is unchanged by serialization as reflected by th e second term . T h e m ultiplexer area co n trib u tio n is worst-case. T he controller area to o p erate this additional a rb itra ry serial p a th depends upon th e cu rren t size of th e PLA . T h e controller sta te b it count increase, 6, is th e difference betw een th e original com binational and an a rb itra ry sequential im plem en tatio n of th e o p erato r as defined in E quation 5.3.7. Since th e original schedule reflected no serialization (n = 1 ), only th e ad d itio n al serial states im p act control area. A gain, th is is an u p p er bound analysis w here any additional states added through serialization cannot use existing control states, th ereb y im p actin g th e control area. For exam ple, if th e num ber of control states increases from 14 to 17, th en th e num ber of sta te bits increases from 4 to 5. T h e controller sta te b it increase is 222 T ab le 5.1: S en sitiv ity of 8 to n a n d £ c n 2 5 10 20 50 100 150 200 250 300 1 1 - - - - - - - - - 2 1 - - - - - - - - - 4 2 1 - - - - - - - - 8 3 1 1 - - - - - 1 - 16 4 2 1 1 1 - - - 1 - 32 5 3 2 1 1 1 - - 1 - 64 6 4 3 2 1 1 - - 1 - Table 5.2: M inim um and M axim al C hange in PLA P aram eters versus n P aram eter M inim um M axim um A* (inputs) A o (o u tp u ts) A p (p-term s) 0 0 n — 1 * 6 n — 1 -f S n — 1 * A ssum es new states axe required in PLA , else th is term is zero. 6 = r> °g 2 (" - 1 + o i - r > o g 2 ci (5.3.7) U sing E q u atio n 5.3.7, Table 5.1 shows how th e sensitivity of 6 decreases rapidly w ith increasing £ u n til it becom es nearly in d ependent of n. N otice, however, th a t n ear th e “power-of-2 ” boundaries an e x tra sta te b it is likely to be added. G iven n and 6, th e u p p er and lower bounds for th e changes in th e num ber of in p u ts, o u tp u ts, an d p ro d u ct-term s can be determ ined as provided in T able 5.2. T he num ber of in p u ts is only affected by sta te-b it increase. P ro d u c t-term count is ad ju sted by th e num ber of additional states. T h e o u tp u t count is affected by th e sta te -b it increase (if any) and th e num ber of additional register control lines needed. N ote th a t control of th is serialized operation is done in consecutive states; add itio n al m ultiplexer control lines are not needed since th e controller sta te bits can be used as described in C h ap ter 3. 223 A w orst-case increm ental area associated w ith th e controller can be derived using E q u atio n 5.3.1 and su b stitu tin g th e m axim um values listed in Table 5.2 for A i, A o, and A p. T h e increm ental cost of th e controller, A cp, is th e difference betw een th e controller for th e serialized d a ta p a th and th e original control which is A A cp « A cp(i + A i,o + A o ,p + A p) — A cp(i,o,p) (5.3.8) « ci(2 A i + A o)p + ci(2i + o)A p + cj(2A« -f A o)A p + c2A p + c3 (3Az + Ao) ~ (cjp -f- c3)(35 -f- n — 1) T Ci(n — 1)(2* T o T 35 T n — 1) -(- c2(n — 1) As n increases, th e increm ental d a ta p a th area drops roughly as - and incre m en tal control p a th area rises as n 2. T h e m inim um serialized area, or “boundary p o in t” , is w here the sum of ( A A cp+ A A dp) is m inim um w ith respect to n. Beyond th e boundary p o in t, to ta l area and tim e increase yielding only inferior designs. A — { A A cp + A A dp) = 0 (5.3.9) ^ Q 'fa A A ^ + -^—A A dp = 0 A bi- op ^ + + C3 _ Ci _ j_ nCi^ 3 ^ + c 2 on d p c\{2i T o T 35 T n — 1) T (35 T n — l)c j ~ — T + = 0 A lthough in a continuous dom ain, th e problem will be m ig rated to a dis crete dom ain under th e assum ption th a t only a sm all increm ental value for n is considered. H ence, th e p a rtia l derivatives for i, o, a,nd p can be su b stitu ted by A i, Ao, and A p. These, in tu rn , can be replaced by th e m axim um p aram eter change values of Table 5.2 and th e equation solved for th e o p erato r area at the b ou n d ary po in t, A u - opb.p.. Abi-op(b.p.) « 2ct n 2(z — 1 ) -f c in 2(p + o + n + 35) + n 2 (c2 -f c 3J 5 .3 .1 0 ) + 3« 2 — (ciP + c 3 — ci) -1 - can 3(35 + 1) n — 1 + C 224 W ith this equation, th e area of th e original unserialized o p erato r, Abi-op, can be related to th e serialization q u o tien t, n, which results in th e m inim al circuit area increase. For a given b it-in d ep en d en t operation, its op tim al b itw id th im p lem en tatio n can be com puted w ith respect to area. It rem ains to resolve th e PL A in p u t, o u tp u t, and p ro d u ct-term values. T h e basic PLA param eters of E q u atio n 5.3.11 are used in conjunction w ith E q u atio n 5.3.10 to com pute th e actu al “b o u n d ary p o in t” w here ostage and orcg are th e stage control and register control PL A o u tp u ts, respectively. R is th e num ber of original register control lines and A R is th e add itio n al nu m b er of register control lines, if any, required for serialization. * = riog2 Cl Ostage = C + [log2 Cl (5.3.11) Or eg — R + A R + [log2 Cl p = c A ssum ing th a t any additional control steps require register control, th en th e change in o u tp u t lines is th e sam e for eith er th e register control or stage control m odel. T h e stage control m odel will b e used th ro u g h o u t th e rem ainder of this section. A g rap h showing th e tradeoff betw een increm ental control p a th and d a ta p a th area versus serialization quotient is included in Figure 5.5 for in itial con tro ller states of C = 1,10,100, and 500. T h e increm ental area is associated w ith allocating th e o p erato r and serial scheduling im pact on th e controller for some a rb itra ry design. (If this serialization occurred along some conditional p a th con trolled by th e sam e PLA , th en th e increm ental num ber of control states might be g reater th a n th a t shown here.) To see th e effects of o p erato r serialization only upon th e tradeoff, m ultiplexer and register effects were not included in com p u tin g A A re a cp and A Areadp. As expected, th e op tim al serialization q uotient falls as th e size of th e PLA increases. Clearly, choosing an incorrect serializa tion q u o tien t can have a d etrim en tal effect on area w hen a large PL A is being extended. 225 Y 45000 4 0 0 0 0 - 35000 - 30000-- 25000-- 20000 15000 + 10000 5000 □ □ □ □ &: NuBiber of original states = 1 O : lu m b e r of original states = 10 ArJSTurnber of original states = 100 □ ^ N u m b e r of original states = 500 EX: Serialization Q uotient Y : Increm ental O p e ra to r+ C trl A rea (m*72)A A A »A A A a a a A A ' a A ‘ a A' A' A A' A A A A A A' A' A A A ,ooo H 1 ----- 1 ----- 1 ----- \ ----- 1 ----- 1 ----- 1 ----- h H 1 f 1 ------ 1- X 6 12 15 18 21 24 27 F igure 5.5: Increm ental A rea vs Serialization Q uotient for Abi_op — 1 0 0 0 0 0 From th is derivation, an optim al n can be com puted for a given o p erato r size (Abi-op) w ith an in itial size for th e PLA . Figure 5.6 shows th e serialization facto r for any given bit-in d ep en d en t o p erato r and num ber of cu rren t control states. C hoosing th e best controller and n value is straightforw ard. For exam ple, assum e th e o p erato r area is 5000 m il2. A vertical line at (1000 log 10 5000) = 3700 in F igure 5.6 intersects w ith th e b o u n d ary line of 10 control states betw een four and five. H ence, th e design w ith an n value of four has th e least cost (assum ing it is feasible); th e actu al cost of th e m inim um can th en be com puted from the sum of E quations 5.3.6 and 5.3.8. Considering m ultiplexers and registers affects th e bou n d ary curve by in tro ducing a constant term , w ith th e resulting curves shown in F igure 5.7. As com pared to Figure 5.6, th e serialization quo tien t n for th e b o u n d ary p oint in F igure 5.7 has been reduced for a given design. j Even in th is sm all exam ple, th e discrete n a tu re of th e results is evident. A likely situ atio n is w here th e serialization quotient cannot be realized. In this case, th e first usable value on either side of th is serialization q u o tien t should be j evaluated using E quations 5.3.6 and 5.3.8 to find th e m inim um . A lthough the larger serialization quotient yields an inferior design in th e continuous space of th e m odel, it m ay still be a viable design in th e discrete serialization space. 5.3.2 Bit-dependent Operators A second ty p e of o p erato r is th e bit-dependent type. T hese operators, such as add and com pare, m ust be perform ed in a p artic u la r order and have an additional cost to re tain th e dependencies. For exam ple, an a d d is perform ed from th e low- order to th e high-order b its w ith a carry b it retain ed at each m icro-cycle. On th e o th er h an d , th e c o m p a r e could have two retain ed sta tu s bits: less-than and g reater-th an . T here are m ore com plex issues regarding serial im p lem en tatio n of a com pare, such as “early o u t” consideration for g reater-th an or less-than, which w ould elim inate th e need of such b its. However, “early o u t” is only possible for high-order to low -order com parisons. Changes to th e b it-in d ep en d en t m odel for bit-dependent operators are m ini m al. O ne m ore te rm is defined, r, th e num ber of “retain ed ” bits (w hich is carry 227 Y 27-- 24 21 18 15 12 9 + 6 3 + C g > : N um ber of original states = 1 O : N um ber of original states = 10 A : N um ber of original states = 100 □ : N um ber of original states = 500 X : O p erato r A rea (103 log10 m il2) Y : O p tim al Serialization Q uotient <20 H 1 --- 1 --- 1 --- H H 1 -----1 -----1 ----- 1 -----1 -----1 -----1 ----- h 600 1200 1800 2400 3000 3600 4200 4800 5400 X F igure 5.6: B it-dependent O p erato r A rea versus M axim um Serialization 27 24-- 21 - - 18 15 1 2 - 9 - 6 - 3 c = 1 o c = 1 0 A c = 1 0 0 □ c = 500 X A rea (10 3 log10 m il2) Y O ptim al Serialization Q uotient H h H 1 --- 1 --- 1 --- 1 --- 1 --- h H 1 - H 1 --- 1 - 500 1000 1500 2000 2500 3000 3500 4000 4500 X F igure 5.7: B est O p erato r A rea versus Serialization (m ultiplexers and registers) 229 for add, borrow for su b tract, etc.). T hese retain ed bits are stored in registers betw een iteratio n s (such as a carry -b it for an adder). F u rtherm ore, th e retained b its are only selected after th e first iteration; th e retain ed b its are initialized on th e first loop (which is zero for th e a d d e r ). T hus, a m u ltip lex er is also needed for each retain ed bit. T he increm ental d a ta p a th area for th e bit-dependent o p erato r becom es A A dp « + (b + r )A reg + (n - 1 )(— + r )A mux (5.3.12) n n I i w here Abd-op is th e area of a b it-d ep en d en t operator. C ontrol of bit-dependent operations m ay require one ad d itio n al control line for m ultiplexing th e “retain e d ” bits. For th e first state, a co n stan t value (ty p i cally zero) is in p u t into th e o p erato r for one side of th e m ultiplexer; rem aining cycles use th e previous o u tp u t value of th e retain ed bits available at th e other in p u t of th e m ultiplexer. T h e PLA o u tp u ts and bou n d ary point o p erato r area are m odified as follows: oreg = R + AjR + flog2 £ " |+ 1 (5.3.13) and Abd— op(b.p.') ss Abi— op{b-p.) rn A m u x (5.3.14) w ith Abi-op{b.p.) defined in E quation 5.3.10. T his fu rth e r flatten s th e curve shown in Figure 5.7. 5.3.3 Special Operator Bitwidth Considerations T h e final class of operators has been labelled “special o p e ra to r” because they expand into com plex im plem entations of two or m ore operations. Tw o m em bers of th is category are m u ltip ly and divide. A distinguishing featu re is th a t expansion com plexity is some function of n which is non-linear. As an exam ple, F igure 5.8 depicts one m ethod for generating lower b itw id th m ultipliers, w ith a serialization quotient of n = 2 d em o n strated here. T he area im p act of th e m u ltip lier d a ta p a th for a general n and b itw id th 6 (w here th e 230 m uxes C3 m ul add latches C4 £ 3_ _ £ 2 Figure 5.8: 8 -bit M ultiply using 4-bit M ultiplier £ 1 num ber of in p u ts is fixed a t two) is 26 y b t A A d p * = = * H + ( (re — 1 ) + tybAmux + 26/4 re re re g (5.3.15) B ased upon one ty p e of hardw are m ultiplier, th e area of th e m u ltip lier can be estim ated as (5.3.16) § A'2 m u l ^ m u l A*, « 2 2 A w here 5 is th e num ber of b its and A is th e area of a 2-bit m ultiplier. H ence, A L , « 2 (5.3.17) O th er m ethods for im plem enting m u ltip ly and divide would first need to gen e rate an eq u atio n sim ilar to E quation 5.3.15 w ith w hich to derive th e boundary point line. 5.3.4 Example using the small model To d em o n strate th e control p a th /d a ta p a th discrete m odel b itw id th tradeoff, a sm all dataflow g raph was chosen as shown in F igure 5.9, w ith 16-bit d a ta 231 add add sub add sub sub F igure 5.9: Sm all Dataflow G raph p ath s. AddS was im plem ented using sm aller b itw id th adders. T h e first two rows of Table 5.3 contains th e results of th e original design and o p tim al bit- serial im p lem en tatio n pred icted by th e m odel. T h e th ird row shows th e best value o b tain ed th ro u g h exhaustive search of all power-of- 2 add b itw id th s for addS (from 1 to 16) and its associated P L A area as synthesized by M A H A and th e B erkeley PL A tools, respectively . 5 1 5 ^ B i t w i d t h s o f 1, 2 , 4 , 8 , a n d 1 6 w e r e a t t e m p t e d . Table 5.3: A nalysis of Serialization F actor # 1 for th e E xam ple show n in Figure 5.9 Serial M ethod addS bitw id th Tim e- steps A rea {m il2) D ata P a th C ontroller T otal None 16 4 13256 768 14024 M odel 4 7 10538 1250 11788 A ctual 4 7 10618 1368 11986 232 Table 5.4: A nalysis of Serialization F actor # 2 for th e E xam ple shown in F igure 5.10 Serial M ethod addl2 b itw id th Tim e- steps A rea (m il2) D ata P a th C ontroller T otal N one 16 15 77216 3264 80480 M odel 4 18 74162 3868 78030 A ctual 4 16 74066 3443 77509 T his exam ple dem onstrates th a t th e expected b itw id th is th e sam e as th a t | found th ro u g h exhaustive search. It also highlights a shortcom ing of th e current I ! m odel: its local scope. Since add2 m ight share h ard w are w ith addl a n d /o r addS prior to serialization, th e im plem entation having th e lowest area m ay have been m issed. Secondly, serialization is restricted by any system -level tim e co n strain t which m ay im plicitly lim it th e degree of serialization. A second exam ple highlights another lim itation: th e b itw id th expansion m odel assum es one needs additional control states an d o u tp u t control lines. H ere, th e 16-bit adder labelled addl2 in th e random dataflow graph of Figure 5.10 was chosen for analysis w ith results in T able 5.4. D espite th e very sm all change in area, th e m odel finds th e correct serialization quotient. N ote th a t only one m ore control sta te was added for th e op tim al b itw id th tradeoff versus th ree state s for th e com puted serialization. B itw idth expansion which does not affect th e num ber of control states w ould p o ten tially allow for m ore serialization th a n th e cu rren t m odel predicts. 5.3.5 Limitations of the Control/Data Bitwidth Model T h e m odel given in this section derives th e im p act on b o th th e d a ta p a th and controller of serializing one operation of a given type. If th ere is m ore th a n one op eratio n of th is type, th ey all m ight be serialized depending upon th e d a ta p a th constraints. For exam ple, if a tig h t tim e co n strain t existed, only those operations w ith sufficient slack tim e m ight be serialized. Conversely, if th e constraints are such th a t th e critical p a th operations can be serialized, th en it is possible th a t 233 mul add sub add sub sub add sub add add add add add mul sub sub add sub add add sub sub add add add add sub add Figure 5.10: Random Graph 234 all operations of th is ty p e can be serialized. N ote, however, th a t th e critical p ath could change if serialization along any p a th is extensive. D a ta p a th cost can also vary depending upon th e g raph topology. H ardw are m odules of th e desired reduced b itw id th m ay already exist in th e d a ta p a th which entails no add itio n al hardw are area for serialization. F u rth er, after transform ing th e first op eratio n of a given ty p e, serializing th e n ex t m ight not im pact th e d a ta p a th cost. D etailed scheduling inform ation is required to resolve this issue. In th is section, th e control im pact of serializing a single operation was de rived. However, w hen an o th er operation of any ty p e is now serialized, th e control I cost could range from th e m odel estim ate down to zero. T h e la tte r case occurs | w hen th e op eratio n being serialized has available control states. F urtherm ore, | I th e control m odel does not consideration serialization in th e presence of condi tio n al p ath s. T h e m odel presented here is th u s lim ited to a local, non-conditional branch, single o p eration view of th e design. E xtending it w ould require knowl- edge ab o u t th e connectivity and scheduling in th e dataflow graph. For a single o p eration having m ultiple occurrences in th e dataflow graph, it is evident th a t serialization is unlikely to be a b in ary decision. G iven a p artial conversion, selection of w hich operations to serialize becom es an issue th a t also extends to m u ltip le operation types. O ne can d id ate solution technique is th e use of integer program m ing to not only determ ine w hich operations to serialize, b u t to w hat degree. However, since th e dataflow graph, resources, and tim ing are all p o ten tially altered by a single tran sfo rm atio n , it is unlikely th a t a single closed form m odel could be derived. A lthough th ere are seem ingly num erous lim itatio n s to th e m odel given here, its b est use w ould be to becom e p a rt of a larger tran sfo rm atio n engine. By having a m odel, even one w ith a lim ited view, it is possible to m ake decisions regarding which operations of th e given operation ty p e to serialize. E xploratory single-point changes could be m ade quickly w hile an intelligent engine guides th e search based upon th e changes w hich have been m ade. It is in this role th a t th e m odel becom es of value. 235 5.4 Summary In th is ch ap ter, control p a th /d a ta p a th tradeoff types were discussed. T hree im plicit types of tradeoffs were identified: com position, decom position, and se q u en tial/co m b in atio n al. T h e last tradeoff ty p e includes b itw id th tradeoffs for w hich an analytical m odel was developed. Finally, th is m odel was validated against som e exam ples and lim itatio n s presented. It is ap p aren t th a t deriving analytical m odels for all different ty p es of tra d e offs w ould be extrem ely difficult due to th e com plexity of th e problem . M igration of fu nctionality by altering th e b itw id th alone revealed th a t global analysis is necessary to determ ine th e op tim al solution. T hese lim itations can be overcom e by resorting to prediction tools w hich are able to achieve good solutions w ithout th e cum bersom e execution tim e associated w ith synthesis. Tradeoffs could be applied and th eir im pact quickly assessed. T his approach is explored in th e next i chapter. 236 Chapter 6 Control P a th /D a ta Path Tradeoff Evaluation 6.1 Introduction In C h ap ters 2 th ro u g h 5, discussion centered upon m echanism s w hich enable a user to perform system a re a /tim e analysis. C om bined w ith o th er tools in th e ADAM system , a com plete capability to estim ate as well as perform actual synthesis for b o th th e d a ta p a th and control p a th is available. In th is ch ap ter, th e focus is upon th e application and evaluation of control p a th /d a ta p a th tradeoffs. A m ethod for determ ining th e q uality of th e tradeoffs is described an d a variety of exam ples dataflow graphs are m odified and analyzed. Finally, tradeoff tre n d s observed during th e experim ents are discussed. 6.2 Control Path/D ata Path Tradeoffs Evaluation Given th e types of control p a th /d a ta p ath tradeoffs, th e effects of applying specific tradeoffs need to be evaluated. T here are tw o approaches for determ ining th e effect of applying a given tradeoff: analytical and em pirical. T h e analytical approach is usually th e m ore desirable approach; given a system of equations, one could readily d eterm in e th e outcom e of applying a tran sfo rm atio n . F urtherm ore, it m ight be possible to apply th e inverse eq u atio n set and th ereb y derive the best solution given th e design objectives. As was shown in th e last ch ap ter, th e discrete n a tu re of th e design space m akes it extrem ely difficult to derive any general m odel. T hus, th e analytical approach has lim ited application. 237 j T h e second approach for evaluating tradeoffs is synthesis of each design or set I of designs as transform ations are applied. A lthough p ractical for sm all designs, th is is not viable for large circuits. A n acceptable altern ativ e is to p red ict th e outcom e using estim atio n techniques. B oth of these concepts will be explored. 6.2.1 A Methodology for Evaluating Tradeoffs A general approach for evaluating control p a th /d a ta p a th tradeoffs is to process th e m odified behavior th ro u g h a design system and assess th e results. U nfor tu n ately , th is concept exhibits serious com puter ru n tim e lim itatio n s for larger im plem entations. Instead of synthesizing designs, would it not be faster to p re dict th e set of designs w hich results w hen a change is m ade? O nly w hen predicted designs m eet user constraints w ould synthesis of com plete designs occur. A m eth o d for evaluating designs using coarse estim ates and fine synthesis results is depicted w ithin each vertical p a th of Figure 6.1. T h e system starts w ith a dataflow graph representing th e behavior, a library of hardw are m odules, and a set of design constraints and goals. Initially, designs from th e m ost parallel to th e m ost serial are predicted using th e estim atio n utilities. Should a region in th e design space appear prom ising, actu al synthesis of designs is perform ed in th a t region. T he resu ltan t pred icted design curve and discrete synthesized designs will ap p ear sim ilar to th a t shown in Figure 6.2. A t th is stage, m any current system s w ould consider th e design com plete. However, by transform ing th e dataflow graph, a designer can effect fu rth e r stru c tu ra l changes. W hen a tran sfo rm atio n is applied to th e dataflow graph, a new design curve is com puted as indicated by th e loop in Figure 6.1. P red icted designs w hich ap p ear prom ising are synthesized, com pared against th e original results, and eith er accepted or rejected. T his process continues u n til eith er th e I set of applicable transform ations is exhausted, th e design goals and constraints are reached, or some prior lim itatio n on design tim e /c o m p u ter resources has been exceeded. 238 C ^ ^ e h a v i o r ^ ^ ^ C ^ C ^ s t r a i n t s ^ ^ ) Final fesigns ^ m ore xform s or ^'•done'L ' A pply T ransform R o u tin g / Storage E stim ates C ontrol E stim ates D ata p a th Synthesis R o u tin g / Storage Synthesis C ontrol Synthesis D atap ath E stim ates F ig u re 6.1: C o n tro l P a th /D a ta P a th T radeoff E v a lu a tio n A r e a P red icted Design Space Synthesized Design M ax. Subspace T im e F igure 6.2: E valuation of Design Space: P red ictio n and Synthesis 6.2.2 Synthesis versus Prediction: An Example To illu strate th e evaluation technique, a design curve will be produced for th e A R lattic e filter. Synthesis and estim ation results will b e com pared. A vailable in p u ts are th e dataflow graph and a library of m odules having differing areas and delays. A n area co n strain t of 70000 m il2 was specified and a non-pipelined design style was chosen using th e m odule library in Table 6 .1. 6 1 F irst we describe th e synthesis process, followed by th e estim ation procedure. 6 .2 .2 .1 D e s ig n S y n th e s is As defined here, design synthesis program constructs an RT (register transfer) description from th e dataflow graph and chosen m odule set. T hese results, com plete w ith hardw are selection, routing, storage, d etailed controller design, and w iring assignm ent, could be passed to a package w hich determ ines th e floorplan 6 1 T h e a d d i t i o n d e la y s in t h i s m o d u l e l i b r a r y a r e s o m e w h a t p e s s i m i s t i c d u e t o a h i s t o r i c a l e r r o r i n r e a d i n g a g r a p h , a n d s h o u l d n o t b e t a k e n a s r e p r e s e n t a t i v e o f a c t u a l m o d u l e s . 240 T able 6.1: M odule L ib rary for A R E xam ple N am e Type B its A rea (m il2) Delay (n S ) m ulf m ultiply 16 49000 375 m ulm m ultiply 16 9800 2950 muls m ultiply 16 7100 7370 addf addition 16 4200 340 addm addition 16 2880 530 adds addition 16 1 2 0 0 1510 and produces th e actual chip layout. Because of the detail involved, this process is tim e consuming. To begin, th e program S L IM O S is used to select th e m odule set [JP P 8 8 ], (As described in C hapter 1, although S L IM O S is intended for pipelined designs, it has shown so far to produce th e best m odule sets for non-pipelined designs as well.) T he A R lattice filter consists of 16-bit adds and m ultiplies. Using this designer’s 60/40 rule, an initial area constraint of 28000 m il2 was entered into S L IM O S (40% of 70000) and feasible m odule sets, each consisting of one adder and one m ultiplier, were chosen from Table 6.1. There are five m odule sets which form th e entire design space: {(m u l f , addf), (m u l f , addm ), (m u lm , adds), (m ulm , adds), (m u lf, adds)}', (m ulm , adds) best m eets th e criteria. Using th e chosen m odule set and dataflow graph, M A H A synthesizes designs from parallel to serial. A to tal of eight non-inferior designs were produced as illu strated in Figure 6.3. D uring its search, M A H A internally generated 80 individual designs and discarded th e rem aining as inferior in tim e a n d /o r area to these eight. W ith scheduling and operator allocation com pleted, routing and storage | hardw are is attach ed by M A B A L . M ultiple executions of M A B A L are per form ed on each of the eight designs in order to obtain th e m inim um register and m ultiplexer area. N ext, a PL A controller is constructed from th e d ata p ath results. T he Berke ley tools (P E G /E S P R E S S O /E Q N T O T T ) are used to generate th e PLA person ality m atrix. Layout is produced by M K PLA from this m atrix. R esults of the 241 control synthesis, m ultiplexer and storage allocation are also included in Figure 6.3. T he to ta l area including th e d a ta p ath is depicted in Figure 6.4. N ote th a t Figure 6.3 has “tim esteps” instead of “tim e” along the X axis. This allows a direct com parison of area for sim ilar predicted and synthesized designs while rem oving any clock cycle tim e differences. In reality, clock cycle tim e for predicted and synthesized designs can differ for th e sam e num ber of tim esteps. This would result in both a tim e and area displacem ent on th e graph and m ake it difficult to com pare sim ilar designs. However, since th a t is m ore realistic, future results will reflect area and tim e. It should also be noted th a t d a ta represents either control p ath synthesis of d a ta p a th synthesized designs or estim ated control p ath area for predicted d a ta p ath designs. For control synthesis, P E G PL A descriptions were w ritten to elim inate redundant o u tp u t lines. Finally, th e wiring area of each design can be estim ated using P L E S T . P L E S T is given the to tal block w idth for d a ta p ath , register, and m ultiplexer hardw are plus the num ber of two-wire nets. T he version of P L E S T used auto- j m atically com puted the average wire length based upon th e RCA 3 u m CMOS library. Since P L E S T is not an actual router, b u t an area estim ator, it m ore correctly belongs in th e prediction leg. However, u ntil a router com patible w ith th e rest of th e ADAM system is obtained, P L E S T will continue to be used to factor in wiring results. 6 .2 .2 .2 D e sig n P r e d ic tio n i T he estim ated design points were produced in th e following m anner: 1 | As in synthesis, S L IM O S is initially used to generate the m odule set m eeting | th e constraints for th e estim ators. Using th e chosen m odule set, the design curve I is predicted using P S A D N P for non-pipelined designs. Besides m odule library and operation quantities, P S A D N P requires critical p ath delay; this value is obtained by executing the first portion of M A H A which lists th e operations in th e nom inal critical p ath and associated delay tim e prior to actual synthesis. T he o u tp u t of P S A D N P gives th e num ber of clock cycles, th e anticipated delay, and th e predicted area. Results which are closest to th e actual synthesis 242 results are plotted in Figure 6.3. (For th e rem ainder of this exam ple, th e rem ain ing predicted points were dropped since a com parison is being m ade against the synthesized results.) As can be seen by Table 6.4 for the design w ith 5 tim esteps, a large error exists for tim e. However, th e clock cycle tim e difference betw een th e predicted and synthesized designs is less th an half the m inim um clock cycle tim e; this difference is magnified by the num ber of tim e steps giving the larger to tal tim e error. Fortunately, it is only in th e m id-range of tim esteps th a t such a tim e error occurs. For tim estep larger values, th e clock cycle tim e rapidly approaches the m inim um , reducing circuit tim e errors. T he next step is to estim ate register and m ultiplexer usage w ith R E G E S T and M U X E S T . R E G E S T determ ines th e storage requirem ents from th e dataflow graph. M ultiplexer estim ation in M U X E S T uses th e o u tp u t from th e register estim ation as well as the d a ta p ath prediction results. Finally, P A S T A is used to estim ate th e controller area. A file which contains th e d a ta p a th control inform ation is read by P A S T A ; th e size of each PL A personality m atrix as well as folded and unfolded area is displayed when P A S T A is run. 6 .2 .2 .3 S y n th e sis v ersu s P r e d ic tio n D uring design evaluation, a choice m ust be m ade betw een prediction and synthe sis. P rediction is faster th an synthesis w ith an inherent risk in th a t predictions m ay be infeasible or inaccurate. A com parison betw een th e predicted and syn thesized designs for th e A R lattice filter reveal some differences as indicated in Figures 6.3 and 6.4 and Tables 6.2 and 6.3. Only four designs boldfaced in Table 6.4 appear com parable; th e rem aining four designs are in error. T he com parable designs have identical predicted and synthesized operator area (clock cycles of 1, 5, 10, 18). Hence, any error is due to routing, storage, and control. For seven designs, th e actual and predicted register count track closely. M ul tiplexer and control error are also not significant. For th e rem aining design w ith 10 clock cycles, register use is higher th an expected (by two) contributing the predom inant error. If predicted register assignm ent is optim al, then two values 243 Y 157500 -- 140000- 122500 105000- 8 7 5 0 0 - 70000 5 2 5 0 0 - 3 5 0 0 0 - 17500-- Act. operator area Q : Est. operator area • : Act. ctrl p ath + m ux + reg o : Est. ctrl p ath + m ux + reg X: Tim esteps Y : A rea (square mil) o® ® ® o o • • 8 0 4 6 8 0 H 1 -----1 -----1 -----1 -----1 -----H 10 12 14 16 18 X Figure 6.3: A R F ilter A rea Com parison 24-4 Y i k C3 < g > : A ct. to tal + wiring 9 9 ^nnn - 0 : Est. to tal + wiring • : A ct. C P + o p erato rs+ m u x + reg 200000 - o : E st. C P + o p erato rs+ m u x + reg X: T im e (nS) 175000 - • Y : A rea (square mil) 150000 - 125000 - ’ ® 100000 - Qg> 75000 - C) o • & © ® 50000 • o O © o 8 • 25000 - o 8 H 1 -----1 -----1 -----1 -----1 -----h H 1 --- 1 --- h 6000 1200018000240003000036000420004800054000 F igure 6.4: A R F ilte r T otals T able 6.2: A R F ilte r Designs: P red ic te d Clock Cycles Op. A rea (m il2) Regs Muxes C trl A rea (m il2) Total w /w iring Tim e Area 1 171200 2 0 - 16405 239711 4 42800 5 33 578 16452 87942 5 42800 5 34 764 16485 88462 6 31800 5 33 764 17802 72617 8 22000 5 32 983 23704 58528 10 22000 5 32 1535 29630 59632 16 11000 5 30 1786 47408 44862 18 11000 5 30 2353 53334 45006 Table 6.3: A R F ilter Designs: A ctual Clock Cycles Op. Area (m il2) Regs Muxes C trl A rea (m il2) Total w /w iring T im e A rea 1 171200 2 0 - 16405 239711 4 63600 6 32 578 17892 117485 5 42800 6 35 881 22365 90834 6 44000 6 24 999 18198 85823 8 41600 6 36 1117 23680 89972 10 22000 8 34 1665 29630 66226 16 20800 7 39 1786 47408 62442 18 11000 6 34 2353 53334 47370 in th e graph are being stored longer th an expected. Indeed, an exam ination of Figure 6.5, which depicts th e scheduling generated by M A H A , reveals th at adders on the upper left are scheduled later th an necessary. By m anually reas signing th e adders to th e earliest clock cycle possible (which does not change the resources or tim ing), M A B A L allocates six registers instead of eight and the difference is explained. For predicted and synthesized designs having the same operator area, the com bined error in routing, storage, and control area is a small fraction of the overall area. W ith an average area error of 4.4% (worst case is 9.9%) and average error in tim e of 6.6%, clearly th e prediction results are com parable. 246 m ul m ul mul m ul mul mul mul mul add add add add add add mul mul mul mul add add m ul mul mul mul add add add add F igure 6.5: A R filter design (clock cycles = 10) 247 T able 6.4: A R F ilte r Designs: E rro r S um m ary Clock w /o wiring w ith wiring Cycles A rea (%) Tim e (%) A rea (%) T im e (%) 1 0 .0 0.0 0 .0 0 .0 4 27.7 8.1 25.1 8.0 5 1.4 2 6.4 2.6 2 6 .3 6 18.8 2.3 15.4 2.2 8 38.7 0.0 34.9 0.0 10 8 .0 0.0 9 .9 0.0 16 37.6 0.0 28.1 0.0 18 6 .9 0.0 5.0 0.0 T he four poorer predictions in th e A R filter com parison have their error introduced during operator area estim ation. Due to d ata dependencies which P S A D N P does not consider, the predicted designs for tim esteps of 4, 6, 8, and 16 are, in fact, unachievable w ith th e com puted resources. T he cause of this is detailed by Jain in [Jai89] and will not be elaborated here. This suggests th a t tradeoff analysis m ay be relying upon inaccurate d a ta p a th estim ation. However, this error m ust be taken in context. T h at is, for th e predictors to be usable in tradeoff analysis, they m ust bound the design space changes. T he goal of th e predictors is not to establish a precise design, b u t to ascertain th e effects of perform ing certain tradeoffs. Consequently, absolute errors are less relevant th an trends observed when a tradeoff is applied. For a typical analysis, th e predictors would be used to bound an area in th e design space; th e synthesis tools would be targeted to produce designs only in th a t region. Figure 6.6 illustrates this concept using th e A R filter. T he predictors give global j estim ates w ith synthesis tools producing th e actual localized designs. N ote how j th e constraint given initially is m et by synthesizing only two designs rath er th an th e entire design space. It should be noted th a t th e 60/40 rule used by the designer was close to the actual d a ta p ath to to ta l ratio; had a different ratio been taken, additional designs m ight have been synthesized. T he results of Table 6.4 and Figure 6.4 suggest another sim plification, as wiring area is predict to have little im pact on th e area error. Evaluation of 248 Y 85000 80000 75000 + 70000 + 65000 60000 -- 55000 - 50000 45000 + O o g : A ctual Q : E stim ated X : Tim e (nS) Y : A rea (square mil) ■ * “ C onstraint o — I — I — I- H 1 h 1 --- 1 --- h 230002600029000320003500038000410004400047000 X F ig u re 6.6: A R F ilte r D esign R egion T able 6.5: N on-P ipelined D esign C urve C o m p u ta tio n T im e (sec.) Synthesis P rediction Function Tool Designs T im e Tool Designs Tim e M odule Selection SLIMOS - 0.00 SLIMOS - 0.00 S cheduling/ Alloc. MAHA S 7.64 PSA D N P 16 0.02 R outing/S torage M ABAL 8 972.16 R E G /M U X 8 1.18 ; C ontroller Berkeley 8 10.10 PASTA 8 0.12 S ubtotal Average 8 989.90 202.34 S ubtotal Average 8 1.32 0.17 Interconnect PLE ST 8 628.80 PLEST 8 622.64 Total Average 8 1618.7 202.34 Total Average 8 623.96 78.00 tradeoffs is reduced in com plexity by avoiding this step. Only designs th a t are close in area m ight consider wiring effects as a final differentiator. T he com putation tim e expended for tradeoff evaluation is shown in Table 6.5. Clearly, prediction is substantially faster th a n synthesis. Even if M A B A L is not itera te d to produce the m inim um register and m ultiplexer assignm ent, synthesis is 36 tim es slower than prediction for this sm all dataflow graph. W orst case analysis gives a runtim e of 0 ( n 2) for prediction; synthesis takes 0 ( n 4). Thus, even for graphs of m oderate size, prediction is essential if one wishes to explore a large design space using control p a th /d a ta p ath tradeoffs. 6.3 Tradeoff Examples W ith a powerful m echanism for evaluating m any p arts of a design, accurately evaluating tradeoff effects is possible. In this section, th e types of tradeoffs de scribed "earlier will be dem onstrated using algorithm s ranging from a sm all filter to a large coprocessor and cover both pipelined and non-pipelined architectures. D uring th e course of evaluation, two difficulties were not possible to over come: pipeline synthesis and routing/storage synthesis of large graphs. S ehw a, th e pipeline synthesis utility, does not operate efficiently on large graphs. In fact, S e h w a often does not com plete designs of dataflow graphs w ith over 100 250 nodes and 200 edges. T he first design point takes over four hours to reach w ith successive points taking another three to six additional hours each on a Sun 3/280. Even upon reaching a design, only a subset of th e operation scheduling is displayed. Fortunately, th e pipelined d a ta p ath operator prediction results are indistinguishable from actual pipelined synthesis, allowing one to evaluate large designs using estim ates w ith some assurance of accuracy. A second difficulty was using M A B A L to a tta c h registers and m ultiplex ers to large designs. Unlike M A H A and Sehw a, M A B A L utilizes extensive inform ation about th e dataflow graph such as com m utivity, explicit num ber of input and o u tp u t lines, and sensitivity to bitw idth. A t th e tim e this work was | done, there was no tool available for converting betw een th e d a ta p ath synthesis tool and M A B A L file form ats; m anual reconstruction of th e dataflow graph is therefore necessary. Clearly, m anual transform ation of large graphs is not prac tical. R ather, since each u tility will ultim ately use the EV E database for storage and retrieval of design inform ation, a solution will be available in th e future. T he solution to th e latter problem was to utilize M A H A and R E A L in place of M A B A L for large exam ples. R E A L is a program for register assignm ent whose algorithm was incorporated into M A B A L [KP87]; thus, th e results are com parable. F urther, M A H A can be instructed to allocate m ultiplexers for operators. R egister m ultiplexers can be determ ined by counting th e tru e storage instances and actual operator allocation defined by M A H A . For all th e designs, a m odule library was chosen based upon the RCA 3 um CM OS stan d ard cells. Coefficients for th e PLA m odel were modified to reflect this ty p e of technology; values in square m icrons are Cl = 663.95 C 2 = 176.42 C 3 = 5505.28 c 4 = 27233.46 251 Table 6.6 contains th e area and delay for the m ajor d a ta p a th m odules used throughout the rem ainder of this chapter.6 2 6.3.1 Bitwidth Tradeoffs O ne m ajor category of tradeoffs is th a t related to bitw idth. Serializing a specific operation into sm aller com ponents allows one to trad e increased circuit tim e for less area. In this section, th e effects of serialization upon pipelined and non- pipelined designs are explored. As will be observed, the im pact is different for each design style due to th e m ethod by which pipeline quality is m easured. J 6 .3 .1 .1 N o n -P ip e lin e d B it S eria liza tio n : F lo a tin g P o in t P r o c e sso r O ne design typically constructed in a non-pipelined m anner is a floating point coprocessor. Since th e coprocessor spends m uch of its tim e idle and often does not have instruction prefetch cognizance, the benefit of pipelining is lim ited. T he com plexity of building floating point hardw are also favors a non-pipelined approach as chip size is usually th e lim iting factor. (T he description for this exam ple is included in A ppendix H.) There is a tradeoff betw een th e size of th e chip versus its perform ance. In particu lar, floating point hardw are boasts large multiply and divide circuitry which could be bit-serialized to decrease th e chip area at th e expense of increased circuit delay. N on-pipelined tools will construct the design space from parallel to serial im plem entations given a fixed m odule set. A dditional portions of the design space can be explored by adjusting the bitw idth of th e multiplier and divider. T he exam ple described here is a floating point coprocessor (courtesy of M ichael M cFarland from Bell Labs) which contains both a 64-bit multiplier and divider as well as num erous sm aller bitw idth operations: add, subtract, negate and com pare. Since th e construction and serialization of a hardw are divider is im prac tical, this was initially im plem ented using shift-com pare-add/subtract hardw are which allows it to be bit-serialized. 6 2T h e a d d i t i o n d e la y s in t h i s m o d u l e l i b r a r y a r e s o m e w h a t p e s s im i s t ic d u e t o a h is to r i c a l e r r o r in r e a d i n g a g r a p h , a n d s h o u l d n o t b e t a k e n a s r e p r e s e n t a t i v e o f a c t u a l m o d u le s . 252 T able 6.6: M odule L ib rary O peration T ype Bit W idth M odule Nam e A rea m il2 Delay n S A ddition S ubtraction Com pare 4 add4f add4m add4s 960 450 300 120 220 540 16 addf addm adds 4200 2880 1200 340 530 1510 32 addf addm adds 8850 7540 2880 578 800 2530 64 addf addm adds 18520 12000 6900 985 1220 4250 M ultiply 8 m ul8f m ul8m mul8s 13000 5000 3600 360 980 2500 16 m ulf m ulm muls 49000 9800 7100 375 2950 7370 32 m ulf m ulm muls 184700 19208 14000 390 8880 21730 64 m ulf m ulm muls 696000 38000 27620 410 26730 64050 T able 6.7: M u ltip lie r/D iv id e r F lo atin g P o in t B it-S erializatio n D ataflow A rea Total G raph O perator R egister M ultiplexer Control Tim e A rea O riginal 102828 1065 0 0 0 103893 85453 10650 4520 4800 53470 105423 83801 10650 9040 5698 80214 109189 76401 10650 11865 5826 106972 104742 32-bit 75033 10000 6894 4800 43470 96727 72181 10000 11660 5698 65229 99539 50781 10000 15900 5826 86988 82507 49581 10000 17490 6470 108755 83541 16-bit 107853 9600 10998 4953 14750 133404 89301 9600 16074 5879 22149 120854 66001 9600 20727 6011 29531 102339 56201 9600 15651 7648 51737 89100 48601 9600 16497 8772 81302 83470 41501 9600 17343 9474 118256 77918 41001 9600 18189 9474 155211 78264 T he design space was first searched using the conventional parallel-to-serial partitio n in g w ith fixed 64-bit m odules; design points are listed in th e first por tion of Table 6.7 as well as shown in Figure 6.7. Then, th e multiplier and divider were bit-serialized into 32- and 16-bit m odule blocks producing a cheaper im ple m entation. (T he multiplier and divider m odules always have th e sam e bitw idth.) M oreover, th e coprocessor is now generally faster for these serial designs. This unexpected result is due to th e large delay associated w ith th e 64-bit multiplier and divider, which dom inate th e clock cycle tim e for non-pipelined designs. (The synthesis and prediction tools used for this research assum e th a t any operation m ust com plete in a single clock cycle.) For th e fastest designs, th e 16-bit m odules yield th e best results. However, this effect is due to th e type of m odule set chosen: sh ift-an d -ad d /su b tract rath er th a n carry-look-ahead or array blocks. W hen bit-serialized, its area and tim e are linearly related to its bitw idth. Thus, there is little penalty for lowering the bitw idth. This m ight not be th e case if a tight tim e constraint were im posed which would result in a different m odule set being chosen. 254 A A 112000 - 84000 O o A A o O A A 56000 28000 ®: O riginal (64-bit) O: 32-bit A: 16-bit X: T im e (n S ) Y: A rea (m il2) H 1 h H 1 h - 128000 32000 64000 96000 Figure 6.7: F loating Point Coprocessor: Serialize m ultip ly /d iv id e T able 6.8: A d d /S u b /C m p /N e g F lo atin g P o in t B it-S erializatio n Dataflow A rea Total G raph O perator Register M ultiplexer Control T im e A rea O riginal 85453 10650 4520 7000 53470 107623 83801 10650 9040 7500 80214 110991 76401 10650 11865 8243 106972 107159 32-bit 57813 10000 6800 4800 43470 79413 52081 10000 11200 5698 65229 78979 34701 10000 13600 5826 86988 64127 33501 10000 14800 6470 108770 64771 30621 10000 16000 7698 174008 64319 16-bit 85353 6563 13923 4953 14750 110792 64401 6563 18207 5879 22149 95050 44401 6563 21090 6011 29531 78065 33901 6563 23919 6945 44370 71328 25101 6563 21777 8772 59160 62213 16301 6563 21777 9474 118320 54115 15801 6563 20349 9755 155295 52468 Serialization of th e less com plex operations in th e floating point coprocessor was also evaluated: add, subtract, compare, and negative. R esults for these im plem entations, where th e five m odules always have th e sam e bitw idth, are I shown in Table 6.8 and Figure 6.8. Total area here is less th an observed for com plex operator serialization, b u t circuit tim e has increased. T here are three tim es as m any of these less com plex operations; hence, serialization results in a greater reduction in cost as com pared to th e original design curve, despite their I sm aller individual size. Due to th e dom inant delay of th e complex m odules, th e cost of a fast design can be reduced by serializing the rem aining com ponents. As noted in Figure 6.8, both tim e and cost are reduced even in th e high speed region. However, th e m odule set used m ay not be th e best one as this fastest design region was not targ eted by SL IM O S , th e m odule selection program . One could adjust the constraints and restart th e evaluation which m ay result in selection of a different m odule set. 256 Y A 96000 A 7 2 0 0 0 - 48000-- 24000 - A O O A A O O O A A < 8 > : O riginal (64-bit) O : 32-bit A : 16-bit X: T im e (n S ) Y: i A rea (m il2) H 1 h -4-—I ---- h -4— t-- I ► X 36000 72000 108000 144000 Figure 6.8: Floating Point Coprocessor: Serialize ad d /su b /c m p /n e g 257 T able 6.9: C om bined F lo atin g P o in t B it-S erializatio n Dataflow A rea Total G raph O perator R egister M ultiplexer Control Tim e A rea 32-bit 56853 9000 5324 4800 43470 75977 47601 9000 10648 5698 65229 72947 29901 9000 14036 5826 86988 58763 25501 9000 16940 6743 130506 58184 22301 9000 17424 8243 261012 56968 16-bit 88953 8000 16400 4953 14750 118306 64401 8000 26000 5879 22149 104280 46201 8000 14800 6011 29531 75012 33901 8000 16800 6945 44346 65646 24301 8000 10800 8772 51873 81302 15701 8000 11600 9474 118256 44775 13701 8000 11600 9755 177384 43056 Finally, th e two different transform ations can be com bined to produce a th ird range of designs as shown in Table 6.9 and Figure 6.9. Here, the difference is | significant. N ot only are th e designs m uch lower in cost, b u t th e speed is nearly unchanged. Due to the sm all size of th e controller as com pared to th e d a ta p ath hardw are, full serialization is cost effective. A useful design region has been found as a result of bitw idth transform ation. 6 .3 .1 .2 P ip e lin e d B it S eria liza tio n : E llip tic a l F ilte r T here are generally two types of pipelined designs: one which only considers th roughput and th e other which also considers start-to-end delay. M ost signal processing filters fall into th e first category whereas some general purpose config urable pipelines fall into th e second. For this filter exam ple, only th e th roughput ! is of im portance. In a pipelined design where throughput (or area) is th e prim e consideration, serializing a given slowest (or largest) operator potentially yields a b e tte r design. An elliptical filter consisting of 16-bit multiplies and adds is a prim e candidate for im provem ent given th e goal of m axim izing throughput w ith an area constraint of 100000. S L IM O S chose th e fastest possible m odule set for th e original design 258 A 96000 72000 A A O A O O O O 48000 24000 A -A- : O riginal (64-bit) O : 32-bit A : 16-bit X : T im e (n S ) Y : A rea (m il2) H — I — l- H 1 h -I 1 h X 52000 104000 156000 208000 Figure 6.9: Floating Point Coprocessor: Serialize all m odules T able 6.10: E llip tical F ilte r M odule Sets Design Curve M odules chosen/quantity M ultiply A dd 8-bit 16-bit 4-bit 16-bit O riginal - m ulf/8 - addf/26 8-bit m ult. m ul8f/32 - add4m /40 addf/26 4-bit add. - m ulm /8 add4s/104 - B oth m ul8f/32 - add4m /144 - Table 6.11: O riginal Elliptical Filter: P redicted Init. Clock Op. A rea C trl Area Total w /w iring Intrvl Cycles (m il2) Regs M uxes (m il2) Tim e A rea 1 14 501200 141 0 - 380 828369 2 14 250600 71 35 412 768 439905 3 15 184800 51 40 674 1164 334878 4 16 127400 41 54 674 1552 246335 5 17 123200 41 54 992 1940 233966 6 18 119000 31 52 1100 2352 227039 7 18 114800 31 50 1246 2744 219318 8 19 65800 31 52 1368 3136 161397 9 20 61600 31 50 1801 3528 145263 13 23 57400 21 48 2508 5148 127869 26 34 53200 21 40 4953 10400 118802 consisting of 26 16-bit adds and 8 16-bit multiplies. (All th e m odule sets chosen for this exam ple are listed in Table 6.10 and the m odule set library is given in Table 6.6.) As discussed earlier, design evaluation of large pipelines is ham pered by the lim itation of th e pipeline synthesis program . S e h w a was only able to synthesize th e original dataflow graph; synthesis and prediction results are given in Tables 6.11 and 6.12. T he average error is 1.6% in area and 0.0% in tim e; the worst case area error is 3.6%. This sm all error establishes a high degree of confidence in th e estim ators. Since no altered dataflow graphs could be synthesized, only prediction tools will be used for fu rth er evaluation of pipelined designs. 260 T able 6.12: O riginal E llip tical F ilter: A ctu al Init. Clock Op. A rea C trl A rea T otal w /w iring Intrvl Cycles (m il2) Regs Muxes (m il2) Tim e A rea 1 14 501200 138 0 - 380 827139 2 14 250600 70 35 412 768 438253 3 15 184800 50 47 674 1164 336106 4 17 127400 47 40 674 1552 249047 5 14 123200 31 42 992 1940 228678 6 17 119000 31 46 1100 2352 225599 7 19 114800 29 44 1246 2744 215056 8 21 65800 28 43 1368 3136 157593 9 24 61600 30 40 1801 3528 141883 13 19 57400 19 37 2508 5148 123989 26 32 53200 16 30 4953 10400 112846 T he multiplier is th e largest and slowest com ponent in th e dataflow graph. Each 16-bit multiplier was tran slated into a subgraph consisting of 8-bit mul tiplies and 4-bit adds. T he m odule set chosen by S L IM O S included th e fast 8-bit multiply, 16-bit adder, and m edium speed 4-bit adder. Since circuit tim e is still dom inated by th e 16-bit add and 8-bit multiply, selecting th e fastest 4-bit adder could only serve to increase th e area. T he results for th e design curve are p lo tted in Figures 6.10 and 6.11. A ddition is th e m ost com m on operation in th e graph; serialization of 16-bit adders into 4-bit adders was perform ed and included in Figure 6.11. Com bining th e m ultiplier and addition bitw idth reductions results in a th ird design curve which is also shown. A n analysis of th e results reveal several things. In Figure 6.10 when only d ata p a th hardw are is exam ined, it appears th a t serializing the adder is of dubious quality whereas serializing th e multiplier is b e tter th a n th e original design. Se rializing both is th e best; however, this design space does not consider registers, m ultiplexers, control area, or wiring. By including routing, storage, and control effects in th e a re a/tim e com pu tatio n , th e original unaltered dataflow graph rem ains th e best overall as shown in Figure 6.11. O nly for th e fastest design and cheapest design is a serialized 261 Y I I 135000 ■ - 120000 - - 105000- 90000 75000 + 60000 45000 + 30000 15000 • O -f + O o + -)— I — I — I — I — I — I- o < g > : Serialize add O : Serialize m ultiply • : Serialize both + : O riginal X: T im e (nS) Y : A rea (square mil) + O 2000 4000 H 1 --- 1 --- 1 --- 1 --- \ --- 1 - --- h 6000 8000 F ig u re 6.10: P re d ic te d E llip tical F ilter: O p e rato rs O nly Y i » 2 2 5 0 0 0 - 200000 - 175000 -- 150000 -- 125000- 100000 -- 75000 -- 5 0 0 0 0 - 2 5 0 0 0 - O C g ): Serialize add O : Serialize m ultiply • : Serialize both + : Original X : Tim e (nS) Y : A rea (square m il) O + + < 0 o + + o o + H 1 --- 1 --- 1 --- 1 - 2000 h— i — y - H 1 ---- 1 ---- 1 ---- 1 ---- h X 4000 6000 8000 F ig u re 6.11: P red ic te d E llip tical F ilter: T otal D esign w /o w iring m ultiplier beneficial. A lthough a serialized m ultiplier results in two points which } extend th e design curve, th e additional routing, storage, and control erase any J benefit for m ost of th e design space. Also, th e poor quality of serialized adder designs is clearly evident; excessive delay in the adder as well as th e additional registers and m ultiplexers needed m ake these designs unsatisfactory. 6.3.2 Composition Tradeoffs j C om position tradeoffs consist of substituting com plex m odules for specific op erations in th e dataflow graph to increase m odule sharing and, in tu rn , lower circuit area. This tradeoff is at th e expense of increased control and signal ro u t ing area. Two types of com position tradeoffs are explored: upw ard equating of bitw idths and ALU substitution. 6 .3 .2 .1 M erg in g D ifferen t B itw id th s: T em p e r a tu r e C o n tro ller T he tem p eratu re controller shown in Figure 6.12 has m ultiple bitw idth adders and subtractors which could be could be m erged into single b itw id th add and subtract operations. T here are 8-, 9-bit additions and subtractions, and 10-bit additions being perform ed. R ath er th an use th ree different m odules, a 9-bit m odule could perform b o th th e 8-bit and 9-bit operations which is p lotted in the figure. A 10-bit adder would also suffice for any of th e addition operations. This second transform ation was com bined w ith transform ing th e subtractor to accom m odate a compare operation. A th ird design curve is then realized as shown in Figures 6.13 and 6.14. Due to th e size of the dataflow graph, prediction tools were not used. T he tradeoff curve in Figure 6.13 suggests th a t perform ing b o th transfor m ations results in superior designs excepting the fastest. Inclusion of registers, m ultiplexers, and control shrink th e difference as shown in Figure 6.14. A higher area and delay incurred by using larger bitw idth operators and lack of hardw are sharing skew th e high speed portion of th e design space tow ards th e original configuration. 264 add add add add add sub div cmp div sub sub cmp cm p and out inv out out F ig u re 6.12: T e m p e ra tu re C ontroller 265 Y ik 9000- 8000- A 7000- • 6000- • 5000- 4000- A 3000- 2000- 1000- — i — A • : O riginal < g > : Merge 8-bit and 9-bit ad d /su b A : 10-bit add; m erge cmp and sub X : Tim e (nS) Y : A rea (square m il) A A H 1 - - - - - 1 - - - - - 1 ----- 1 ----- 1 - - - - - 1 - 5000 6000 7000 8000 9000 10000110001200013000 Figure 6.13: M odule B itw idth Tradeoffs: D a ta P a th 266 Y 9000- A 8000- • < 8 > 7000- • ; a 6000- 5000- 4000- 3000- 2000- 1000- _ _ i_-L- ,..t. A • : O riginal ( g > : M erge 8-bit and 9-bit a d d /su b A : 10-bit add; m erge cmp and sub X : Tim e (nS) Y : A rea (square mil) A H 1 h H 1 ---- 1 ---- 1 ---- 1 -----1 -----1 ---- H 5000 6000 7000 8000 9000 10000110001200013000' X Figure 6.14: M odule B itw idth Tradeoffs: Total 267 M erging different bitw idths has Tittle im pact upon th e controller for such sim- i pie m odules and sm all dataflow graph; thus, one would expect such a tradeoff to be perform ed. However, there is additional d ata p ath hardw are needed to handle th e e x tra bits w hen executing a sm aller bitw idth operation. In an addition, a m ultiplexer is necessary to enter a zero in th e next higher b it position as well as select th e appropriate carry-out bit. Com plex operations m ay have to m ultiplex all additional bits and have cum bersom e selection logic for th e outputs. 6 .3 .2 .2 A L U S u b stitu tio n : F lo a tin g P o in t C o p ro cesso r Given an area constraint on th e floating point coprocessor, an alternative to bit serialization is th e use of ALUs. In the original design, there are a collection of 64-bit and 8-bit closely related operations: addition, subtraction, negation, and compare. Com bining each bitw idth category into a single ALU increases the area and delay of any one operation, b u t should result in a cheaper d a ta path for sim ilar serialized scheduling. For each ALU, th e controller is burdened w ith two additional control lines. To determ ine th e effects of ALU substitution, 64-bit and 8-bit ALUs were separately used in place of th eir four associated operations, th en taken together. R esults produced by th e estim ation tools are presented in Table 6.13 and Figures 6.15 and 6.16. (N ote th a t all extrem ely slow designs which are im practical were discarded. For exam ple, if the delay increased by two-fold w ith a 10% decrease in area, this was deem ed im practical.) As one would expect, th e d a ta p ath cost drops off m ore rapidly for the ALU versus th e original design curve as revealed in Figure 6.15. This effect is still observed in Figure 6.16 since th e control area is sm all com pared to th e d a ta p ath , despite its 25% increase over th e original controller. However, it is im m ediately apparent th a t 8-bit ALU substitution is not useful in this exam ple as it is inferior to all other designs. Closer inspection reveals th a t it is the mix of operations which defines the usefulness of ALU substitution. Eleven of th e fourteen 64-bit operations are compare; th e rem aining operations listed earlier are each used once. Given the large num ber of 64-bit compare operations, there are likely to be tim esteps where one or m ore are idle. By using ALUs, th e to tal m odule count is reduced since 268 Table 6.13: F loating Point Coprocessor w ith ALU Dataflow A rea Total G raph O perator Register M ultiplexer Control Tim e A rea ALU (64-bit) 77781 9860 2700 5259 63684 95600 60721 9860 5850 6061 80229 82492 51941 9860 8100 6945 160506 76846 ALU (8-bit) 85353 9860 4750 5412 62014 105375 79653 9860 10925 6197 123988 106635 ALU (both) 77073 9860 3600 5871 63684 96404 62736 9860 6750 6606 80229 85952 58161 9860 8100 6569 132188 82690 V O < g > : O riginal O : 64-bit ALU A : 8-bit ALU V- B oth ALUs X : T im e (n S ) Y : A rea (m il2) ^ — i— i— h — i — i — i — i— i — i— i— i — i— i— i— i— X 32000 64000 96000 128000 F ig u re 6.15: F lo atin g P o in t C oprocessor: O p erato rs 88000 66000 44000- 2 2 0 0 0 - A 269 Y A A 88000- 66000 -- V o 44000- 22000 0 : No ALUs O: 64-bit ALU A : 8-bit ALU V : B oth ALUs X: Tim e (n S ) Y : A rea (m il2) H 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 —► - X 32000 64000 96000 128000 Figure 6.16: F loating Point Coprocessor: Total 270 th e hardw are associated w ith th e three separate operations can be reused. In constrast, th e sixteen 8-bit operations are evenly spread betw een compare, ad dition, and subtraction; ALU substitution does not result in any im provem ent. Only th e increased cost and delay of the ALU is observed. To confirm this effect, the dataflow graph was altered such th a t 64-bit oper ations were evenly spread and subtract was set as th e dom inant 8-bit operation; results are displayed in Figure 6.17. T he even spread of th e 64-bit operations lessen th e im provem ent of introducing ALUs. However, due to th e large area of th e 64-bit operations, even a balanced set of operations is benefited by ALU su b stitu tio n which results in operator sharing. Conversely, concentrating the 8- b it operations into a single operation resulted in im provem ent when introducing ALUs as com pared to the unm odified graph. A tradeoff therefore exists for ALU su b stitu tio n betw een th e original operation costs and operation mix versus the ALU cost. 6.3.3 Multi-processor Tradeoffs M ulti-processor tradeoffs encom pass partitioning a large design into sm aller indi vidually controlled segm ents. A lthough identical operations in different segments m ay no longer be shared, it is hoped th a t th e num ber of states associated w ith controlling th e sm aller subdesigns would result in a reduced area. O ne circuit w hich could benefit from d istrib u ted control transform ation is th e Intel 8251 A synchronous Com m unications circuit. T he design is divided into three m odules: m ain (which services th e m icroprocessor connection), re ceiver, and tra n sm itte r (which handle the serial I/O ). U nder th e original design, these m odules run independent of one another while sharing some signals and a com m on d a ta bus. T he area and tim e of each m odule and th e to tal design (for non-inferior solutions).is detailed in Table 6.14. A lthough th e hardw are required for each m odule is nearly th e same, th e receiver w ith its synchronization and error checking requires substantially m ore control. T he objective is to reduce th e overall design area while allowing the tim e to increase. A lthough an intuitive design im provem ent suggests com bining th e sm aller tra n sm itter and m ain sections while leaving th e larger receiver intact, every 271 88000 -- 66000 44000 ®: No ALUs O : 64-bit ALU A : 8-bit ALU y : B oth ALUs X : Tim e (n S ) Y : A rea (m il2) 2 2 0 0 0 - X 32000 64000 96000 128000 Figure 6.17: Floating Point Coprocessor: Modified (D atap ath ) Table 6.14: i8251: Individual designs Dataflow A rea Total G raph O perator Register M ultiplexer Control T im e A rea m ain 1390 640 288 384 718 2702 1300 640 504 551 1452 2995 rcvr 2390 640 216 7796 718 11042 2255 640 504 7957 1452 11356 x m tr 1301 896 360 1394 226 3951 1276 896 576 3333 452 6081 Total 5081 2176 864 9574 718 17695 4831 2176 1631 11841 1452 20432 T able 6.15: i8251: P arallel D esign C om binations Dataflow A rea Total G raph O perator Register M ultiplexer Control Tim e A rea m ain-rcvr 3430 896 720 16195, 718 21241 2345 896 936 16889 1077 21066 2320 896 1224 17062 2178 21502 m ain-xm tr 2646 1408 792 4720 718 9566 2531 1408 1152 7693 1077 12784 2416 1408 1368 7801 2178 12993 rcvr-xm tr 3526 1536 648 4720 718 10430 2441 1536 1368 7694 1077 13039 2416 1536 1584 7800 2178 13336 m ain-rcvr-xm tr 3776 1920 1224 41733 726 48653 2596 1920 2088 41733 1101 48337 2441 1920 2520 43755 2202 50636 com bination of m erging the sections in parallel was attem p ted . For th e parallel designs, th e section pairs or trios would share the same controller and hardw are executed in a parallel fashion. The control area is expected to rise, b u t th e d a ta p ath cost should drop significantly. Results for th e parallel designs are sum m arized in Table 6.15. Total area and tim e for com pleted i8251s is contained in Table 6.16 and Figure 6.18. For th e parallel designs, a large increase in controller area is expected when th e receiver is one of th e m erged sections. T he best design curve consists of one controller for th e m ain m odule and the other for a com bined tran sm itter/receiv er. T he anom aly for th e rcvr-xmtr pair is due to th e n atu re of serial I/O control. W hereas th e receiver has its control concentrated in th e later clock cycles, the tra n sm itte r exhibits th e opposite characteristics. Hence, th eir control meshes well and this p articu lar architecture yields a substantial im provem ent in overall area w ith little im pact on perform ance. Since control area dom inates th e i8251 chip area, com bining th e individual m ain, tra n sm itte r, and receiver sections in series is another approach for lowering circuit area. A reduction in both d a ta p ath and control p ath area should be 273 Y □ 45000 - 40000 - 35000 - 30000 - 25000 - o 20000 - A 0 15000 - V 10000 - 5000- H 1 h □ V □ 0 : O riginal O : m ain-rcvr,xm tr A : m ain-xm tr,rcvr V - m ain,rcvr-xm tr □ : m ain-rcvr-xm tr X : T im e (n S ) Y : A rea (m il2) O A V - I 1 --- ! --- f- H 1 h 250 500 750 1000 1250 1500 1750 2000 2250 Figure 6.18: i8251: Parallel designs T able 6.16: i8251: P arallel D esign C om parisons Design A rea Configuration Tim e D ata P ath R oute/S torage C ontroller Total m ain / rcvr / x m tr 718 5081 3040 9574 17695 1452 4831 3807 11841 20432 m ain- rcvr / xm t r 718 4731 2872 17592 25192 1077 3646 3088 18283 25017 2178 3595 3592 20395 27583 m ain -x m tr/rcv r 718 5036 3056 12516 20608 1077 4921 3416 15489 23826 2178 4671 3920 15758 24349 m ain /rcv r-x m tr 718 4916 3112 5104 13132 1077 3741 4318 8245 16304 2178 3716 4264 8351 16331 m ain-rcvr-xm tr 726 3776 3144 41733 48653 1101 2596 4008 41733 48337 2202 2441 4440 43755 50636 observed a t th e expense of tim e. Tables 6.17 and 6.18 and Figure 6.19 contain th e results of this tradeoff. In th e serial tradeoff, th e original conjecture th a t th e receiver is best left by itself holds. Its dom inate control area drives this tradeoff to favor com bining th e sm aller m ain and tra n sm itte r controllers together while leaving th e receiver intact. A rea reduction is observed, b u t is not as substantial as th e parallel tradeoff. It is clear th a t control area alone cannot be com pared when determ ining th e d istrib u ted control architecture. Interaction betw een controllers m ight result in either a slight increase or large area degradation dependent upon th e actual state m achine sequencing. Effective distrib u ted control requires a detailed view of th e en tire control sequencing m aking this tradeoff difficult to au to m ate or predict. 6.3.4 Tradeoff Trends D uring experim entation, ten dataflow graphs were transform ed and estim ates com puted a n d /o r synthesis perform ed to yield well over 500 designs. O nly a 275 T able 6.17: 18251: Serial D esign C om binations Dataflow A rea Total G raph O perator Register M ultiplexer Control T im e A rea m ain-rcvr 3240 640 288 8601 778 12769 2640 640 504 9150 1089 12934 2510 640 504 9316 2178 12970 m ain-xm tr 2932 896 360 1958 758 6156 2721 896 507 3469 1452 7593 2606 896 576 3573 2569 7651 rcvr-xm tr 3027 896 360 12910 1110 17193 2656 896 504 15015 1452 19071 2606 896 576 15325 2214 19403 m ain-rcvr-xm tr 3612 896 360 15260 1200 20128 2746 896 504 15450 1815 19596 2696 896 576 15260 2611 19428 2606 896 576 15830 3321 19908 Table 6.18: i8251: Serial Design Com parisons Design A rea Configuration T im e D ata P ath R oute / Storage Controller Total m ain / rcvr / xm tr 718 5081 3040 9574 17695 1452 4831 3807 11841 20432 m ain-rcvr / xm tr 778 4541 2184 9995 16720 1089 3916 2616 12483 19015 2178 3786 2616 12649 19051 m ain- x m tr/rc v r 758 5322 2122 9754 17198 1452 4976 2547 11426 18949 2569 4861 2616 11530 19007 m ain /rcv r-x m tr 1110 4417 2184 13294 19895 1452 4046 2328 15399 21773 2214 3906 2616 15876 22398 m ain-rcvr-xm tr 1200 3612 1256 15260 20128 1815 2746 1400 15450 19596 2611 2696 1472 15260 19428 276 Y 22500 2 0 0 0 0 - 17500 -- 15000 12500- 1 0 0 0 0 - 7500 5000 2500 □ V A □ V o zP ®: Original O : m ain-rcvr,xm tr A : m ain-xm tr,rcvr SJ: m ain,rcvr-xm tr □ : m ain-rcvr-xm tr X : T im e (n S ) Y : A rea (m il2) H 1 -----1 ---- 1 ---- 1 ---- 1 -----1 ---- H - I 1 --- 1 --- H H 1 --- (- 280 560 840 1120 1400 1680 1960 2240 2520 Figure 6.19: i8251: Serial designs sm all subset was presented in this chapter. However, w ith such a large popula tion, a general tren d for some of the transform s was observed. F irst, it is clear th a t a distinction m ust be m ade betw een pipelined and non pipelined designs. Since th e m ethod by which design quality is m easured differs betw een these two architectures, tradeoffs which yielded im provem ents in one architecture were detrim ental in the other. B itw idth tradeoffs underscore this polarization. A lthough bitw idth trades realized im provem ents w ith m ost non pipelined designs, especially w ith slower design goals, m any pipelined designs analyzed did not realize any im provem ents as th e pipelined elliptical filter ex am ple revealed. In th e experim ental results obtained, pipelined designs tended to favor larger original bitw idth hardw are as com pared to non-pipelined designs. A nother tren d observed is th a t bitw idth com position into single m odules is useful unless m odules are complex or differ by a significant num ber of bits. As in bitw idth tradeoffs, experim ental results revealed th a t pipelined designs showed m uch less im provem ent w ith bitw idth com position for th e faster im plem enta tions, b u t did not degrade the design curve. Com plex operator substitution quality, such as use of an ALU, was alm ost solely determ ined by th e ty p e m ix and quan tity of operators for which th e ALU would be used. N either th e relative size nor speed of th e ALU m odule appeared to significantly influence this trend. However, an insufficient q u an tity of op erations or a balanced type m ix would negate ALU tradeoff benefits. W ith a large im balance in o perator type m ix, even a sm all q u an tity of operations would benefit from ALU substitution. Finally, m ultiprocessor tradeoffs revealed two effects. F irst, th e relative size of th e subcircuits (such as th e i8251 m ain, tran sm itter and receiver subcircuits) determ ines th e effect of com bining these subcircuits in series (where each sub circuit operates in sequence) or parallel (where each subcircuit operates sim ulta neously). Sm aller overall area was observed when subcircuits were com bined in series so th a t they could share d a ta p ath hardw are at th e expense of additional control size. Com bining these sam e subcircuits in parallel favored designs which were faster and exhibited sm aller control area, b u t had larger overall area. Hence, w hereas th e d a ta p a th dom inated serial m ultiprocessor effects, th e control p ath determ ined parallel m ultiprocessor effectiveness. 278 In applying tradeoffs, it is clear th a t all aspects of design - d a ta p ath , control p a th , routing, and storage - cannot be ignored during evaluation. A lthough w iring area was not analyzed, designs th a t are sim ilar in area w ould need it to distinguish th e best design. Of equal im portance is th e application of constraints. A rea and tim e con strain ts not only delineate design quality, b u t determ ine th e initial m odel set for b o th th e estim ation and synthesis tools as well as which tradeoffs are of use. Large changes in constraint values could result in different design curves and additional regions in the design space to be explored. 6.4 Summary In this chapter, a m ethodology for control p a th /d a ta p a th tradeoff analysis was explored. A description of a system which can be used to perform tradeoffs using either estim ation or synthesis tools was detailed. Finally, tradeoffs were applied to illu strate b it serialization, bitw idth analysis, o perator com position, and d istrib u ted control. It is evident th a t applying num erous tradeoffs entails enorm ous com putation tim e if only synthesis tools are em ployed. D espite some inaccuracies, estim ation techniques are extrem ely useful for establishing trends and locating regions which are capable of m eeting th e design constraints. T he execution tim e of prediction tools also allow num erous detailed tradeoffs to be perform ed w ithin a short period. This disparity betw een synthesis and estim ation runtim e is acute when larger practical designs are attem p ted . Finally, it is clear from th e exam ples th a t tradeoffs can yield unexpected results. Some general trends can be noted, but there seem to be no absolutes. However, these exceptions can be recorded and exploited on subsequent analyses. W ith a usable fram ew ork established for actual perform ing these tradeoffs, it is now possible to w rite higher-level tools which could au to m ate the concepts to use and build upon the tradeoff m atrix. F urther efforts to identify other tradeoffs which designers em ploy would also be of use. 279 Chapter 7 Conclusions and Future Research 7.1 Contributions In this thesis a system was described which allows a user to quickly locate supe rior designs in a large discrete digital design space. T hrough th e use of system - level tradeoffs, regions which m ight be ignored by present day design tools can be constructed and analyzed. T hrough the use of estim ation techniques, eval u ating th e design space after applying a tradeoff allows a designer to ascertain th e resu ltan t circuit quality quickly. This approach is of particular im portance in to d ay ’s design com m unity where ever larger chips are being constructed while developm ent tim e decreases. T he concept will also be used by m ore intelligent system s which transform and locate superior designs autom atically. In order to perform system -level tradeoffs, the m echanism for predicting th e circuit param eters m ust be in place. (In th e digital design com m unity, area and tim e are th e m ost widely used criteria for determ ining design quality.) Also, only a m inim al am ount of inform ation should be required from th e designer. One begins w ith a behavioral description in the form of a dataflow graph, m odule library, and a set of user-defined global constraints and finishes w ith a candidate set of one or m ore designs which m eet th e requirem ents. C entral to this m ethodology are a com plete suite of synthesis tools and set of accurate prediction tools; these la tter utilities rapidly provide a set of estim ated designs to establish th e design region of interest. T hen, slower synthesis tools can be targ eted to locate th e precise designs w ithin this area. In order for this concept to be viable, both the synthesis and prediction m ethods m ust have 280 com parable analysis capability. P rior to this thesis, th ere were a num ber of m issing tools; thus, a large portion of th e research was involved in closing this gap. In C hapter 2, extensions to a non-pipelined d ata p a th synthesis program M A H A were presented and th e tool validated. Useful additions to M A H A w ere resource sensitive scheduling, full serialization of the d a ta p a th , register assignm ent, and m ultiplexer assignm ent. T he la tte r two features are useful if one does not wish to use other utilities for register and m ultiplexer assignm ent. In C hapter 3, an estim ation m odel for PLA control area was derived. W ithout a control m odel, one is unable to perform control p a th /d a ta p ath tradeoffs using prediction tools; this is the first known research to address this topic. P rim ary features of th e PLA area estim ation m odel are • applicability to b o th pipelined and non-pipelined designs, • altern ate PLA control structures depending upon th e q u an tity of d a ta p ath registers and num ber of control states, • handling of conditional branches and loops, and • applicability to both unfolded and folded PLAs. A n area estim ation m odel for registers and m ultiplexers com prised C hapter 4 . R egister estim ation em ployed an em pirical approach by bounding th e lim its and observing th e characteristics of real designs. M ultiplexer area estim ation used a statistical approach which assesses th e degree of duplicity of each operator and register input. P aram eters available to both of these m odels are restricted to the inform ation th a t would be available during d a ta p ath estim ation and does not rely upon any synthesis results. As in th e control area estim ator, b o th pipelined and non-pipelined design styles are accom m odated. W ith a suite of powerful tools to perform design synthesis and prediction, application of tradeoffs was discussed in C hapter 5. T he types of tradeoffs were identified. Finally, in C hapter 6 , a m ethodology for determ ining th e effects of system - level transform ations presented. A num ber of exam ples were analyzed w ith dif ferent control p a th /d a ta p a th tradeoff types applied to real designs. Using these 281 tools, assessing the im pact of perform ing these tradeoffs was d em onstrated in cluding • operator decom position into sim pler or sm aller parts, • operator com position into com plex m odules such as ALUs, • varying th e bitw idth of operators versus im proved sharing, and • m ulti-processor approaches to achieve b e tte r throughput. 7.2 Autom ating System-Level Tradeoffs i 1 Clearly, the effort expended on this research can be used as a basic foundation for fu tu re research on system tradeoffs. T he p rim ary bottleneck of not having an evaluation m ethodology and a suite of tools - b o th estim ation and synthesis engines - has been resolved. A dditional topics related to system -level tradeoffs w ould address th e non-autom ated p arts of the system . 7.2.1 Detection of Tradeoff Locations A m ajo r problem in system -level tradeoffs is detecting where such tradeoffs are applicable. For a tradeoff involving a single d a ta p ath operation, this is not diffi cult. However, for a tradeoff which m anipulates a group of operations, m atching every subgroup of operations in a large dataflow graph against th e targ et tem plate becom es an in tractab le problem . Essentially, th e tem p late graph is being com pared against all subgraphs of th e dataflow graph to find a one-to-one cor respondence for b o th vertices and edges. H enceforth, th e problem will be called graph m atching although some m athem aticians prefer describing a m atching tem p late as being embeddable in th e dataflow graph. G raph m atching is currently used to some ex ten t in synthesis and com piler design. Researchers in hardw are synthesis use m atching to determ ine allocation and sharing of hardw are [Bra75]. Com piler developers use graph m atching for 282 com m on subexpression elim ination and code generation [AU77]. G raph m atch ing in synthesis is applied over th e entire graph, b u t is confined to single oper ato r m atching; m atching in com piler design m ay involve several operations, but is generally restricted to sm all local subgraphs or expressions. T h e difficult graph m atching problem to resolve for system -level tradeoffs is a global search of a graph to find a p articu lar subgraph which is isom orphic to some p articu lar tem plate. Given a specific subgraph, «, to be m atched against a dataflow graph, 12, th e com plexity of th e problem can be w ritten as (7.2.1) w here | k |< < | 1 2 | and | S \ m eans the cardinality (num ber of vertices or operations) in set S. E quation 7.2.1 reveals th a t increasing th e tem plate size results in an expo nential increase in search tim e. Therefore, keeping th e num ber as well as th e size of th e tem plates sm all is im portant; this is at odds w ith in itial analysis showing th a t, in general, functions having num erous perm utations are com m on. For ex am ple, a single m ultiplier operation is an apparent candidate for expansion into a variety of m u ltip lier/ad d or sh ift/ad d com ponents. At first glance, th e problem of b o th locating a p articu lar subgraph as well as detecting its m yriad of perm utations appears unsolvable. R educing th e scope to only subgraph m atching lowers th e complexity. Since system -level tradeoffs operate in th e behavioral space, any perm utations such as th e m ultiplier exam ple above will be detected: th e behavioral description will rem ain unchanged, only th e stru ctu ral im plem entation is different. Thus, th e series-of-adders is known to be a constant m ultiplier from th e original specification and also dem onstrates one advantage of working down from a higher-level description. A nother aspect of tem p late m atching is detection of com m on subgraphs. In this case, a subgraph of some arb itrary shape and function is duplicated w ithin th e dataflow graph. D etection of these graphs is considerably m ore difficult th an com m on subexpression detection in com piler design. Subexpression analysis considers a tree and constructs th e groups always startin g w ith th e leaf nodes; com m on subgraphs share th e sam e leaf nodes (values) and vertices (operations). C ocl 0 ||k| 283 For a dataflow graph, th ere is no distinct startin g point for building a subgraph. F urtherm ore, unlike com pilers, subgraphs are considered sim ilar if th ey perform th e sam e set of operations in the sam e order. Each m ay actually operate on a different set of values. (In com pilers, b o th th e operation and th e value m ust m atch.) However, th e com m onality am ong values can be used to determ ine the binding attraction of two common subgraphs. T here are two m ethods known for accom plishing this task in an au to m ated fashion. Template reduction based upon com piler design is one approach and dataflow graph reduction based upon graph theory and dataflow m achine analysis is another candidate. 7.2.2 Ordering System-Level Tradeoffs W ith a set of procedures for locating tradeoff tem plates or detecting m ultiple occurrences of a given subgraph, it is im p o rtan t to determ ine when such tra d e offs should be perform ed. Clearly, blind application of th e m ost useful localized tem p late m ay not result in any global im provem ent in some application. T here fore, a strategy for applying transform s (tem plates) to th e graph which result in a b e tte r solution - based upon user constraints - is an im p o rtan t aspect of design transform ation. Given a com plete set of allowable transform ations and an efficient graph m atching algorithm , is th ere an optim al order AND location to apply th e transform s? Ordering is necessary since there is often m ore th an one transform th a t can be applied; a location schem a arb itrates in th e (likely) event th a t th e best transform can be applied a t m ore th a n one graph location. To illu strate th e problem , th e A R filter shown earlier is used as an exam ple. If a tem p late com prised of two m u ltip lie r s connected to an a d d e r exists and can be used as a single complex operation to replace th e subgraph, there are a to ta l of 256 transform ed graphs. Of course, each im plem entation can be p artitio n ed into a num ber of tim esteps to produce a variety of designs. D ue to th e discrete n atu re of th e design space, it is unrealistic to assum e th a t a single analysis will result in the optim al design. If an unknow n num ber of steps are necessary to achieve an optim al design, it is desirable to be able to halt the search and still note an im provem ent tow ards th e optim al design. 284 7.2.3 Automated Application of Tradeoffs O nce m ethods for determ ining where and when to apply tradeoffs have been determ ined, a general system -level analysis package can be produced in an au to m ated fashion. Since this approach is intended to au to m ate an im p o rtan t step in th e design process, it should entail m inim al user input. However, to allow for th e expertise of individual designers, the design space can be bounded by th e user and rules a n d /o r tem plates elim inated for consideration by th e user before analysis begins. Clearly, it would be difficult to capture all possible transform ation tem plates in a single utility. R ather, a useful subset would be included to account for th e m ost significant design changes. O rdering of th e tradeoffs would result in the m ost su b stan tial im provem ents being perform ed early in th e tradeoff analysis w ith successively sm aller im provem ents at each step. A constraint of m inim um design im provem ent should be introduced such th a t th e process has a finite ru n tim e. 7.3 Future Research As in all research, a solution to one problem only results in th e uncovering of num erous others. In fact, some problem s th a t are discovered actually change th e direction of th e thesis substantially. Such was th e case in this research. Originally, th e research topic involved determ ining th e types of system -level tradeoffs and show how high-level partitioning affects th e u ltim ate design. These tradeoffs would be perform ed in an au tom ated m anner to transform a design into th e best stru ctu re for th e given constraints and goals. A lthough m any types of tradeoffs were addressed by this thesis, there m ay be others yet unexplored. D ue to th e effort required in generating a system th a t could perform system - level tradeoffs, th e in itial goal of autom ating tradeoffs was never realized. A m ajor portion of th e research focused upon area estim ation models for control, routing, and storage hardw are. It was only after these tools were com plete could perform ing system -level tradeoffs be realized. 285 M uch of th e potential research reflects th e im portance of accurate estim ation m odels. In particular, both the control and routing m odels can be extended. T he PLA m odel should be reform ulated into a m ore general m odel which ac com m odates m ulti-level logic and m icrocontrollers as well as PLA s. A lthough a m ulti-processor tradeoff was shown in the previous chapter, a m odel which describes th e effects of d istrib u ted a n d /o r hierarchical controllers would enhance th e work. Nearly all large designs in production today, especially m icroproces sors, use a com plex hierarchy of controllers to increase parallel operation or to m inim ize area. A nother m odel lacking entirely is area estim ation for bus access logic. Al though this would seem to be a sim ple revision of th e m ultiplexer m odel, there are m ore issues in buses which affect th eir size such as th e scheduling and value usage. Even m ore difficult is a m ethod for estim ating th e steering logic area w here b o th buses and m ultiplexers are used. < I T he im portance of estim ation m odels cannot be u n d erstated . A lthough syn- ! thesis techniques will continue to im prove, there is unlikely to be a m ajo r break- j through in one aspect of real design: com putation tim e. Even w ith th e advances in com puter throughput, th e com plexity of chips being produced has also risen; absolute com putation tim e has virtually rem ained constant. For large designs, solely using synthesis tools would be too costly in com puter ru n tim e and user patience to be of any use. However, m ore robust and precise m odels allow the user to explore extrem ely large design spaces w ith num erous tradeoffs in a rea sonable tim e. W ith these additional tools, a system capable of producing high j quality designs autom atically from a general set of user constraints and goals j becom es practical. j Reference List [AHU74] A. Aho, J. H opcroft, and J. U llm an. The Design and Analysis of C om puter Algorithms. Addison-W esley, M assachusetts, 1974. [AU77] A. Aho and J. U llm an. Principles o f Com piler Design. Addison- Wesley, M assachusetts, 1977. [AUS79] A. Aho, J. U llm an, and V. Sethi. Com piler Design and Techniques. Addison-W esley, M assachusetts, 1979. [Bal84] D. Baldwin. A utom atic Evaluation o f Design Choices in D igital Con troller Synthesis. PhD thesis, Yale University, Decem ber 1984. [Bar73] M. B arbacci. ISP: A N otation to D escribe a C om puter’s Instru ctio n Sets. C O M P U T E R , M arch 1973. [BB87] N. Biswas and C. B hat. A M axim um PLA Folding A lgorithm . In Proc. o f IC C D , pages 686-689. IE E E , O ctober 1987. [BD71] H. B arsam ian and A. DeCegam a. E valuation of H ardw are-Firm w are- Software Trade-offs w ith M athem atical M odeling. In Spring Jo in t C om puter Conference, pages 151-161, 1971. [Bra75] F. Bradshaw . D irected G raph M odels for H ard w are/Soft ware De sign. In 1975 Intl. Sym p. on C om puter Hardware D escription L an guages and Their Applications Proc., pages 7-15. ACM SIGDA, ACM SIGA RCH , IE E E Comp. Soc. Tech. Com. on C om puter A r chitecture, Septem ber 1975. 287 [Cam85] [Cas85] [DSVV83] [Eve79] [FSC84] [GE8 8 ] [Gir87] [GK84] t I [GP82] R. Cam posano. Synthesis Techniques for D igital System Designs. In Proc. 22nd Design A utom ation Conference, pages 475-481, June 1985. A. E. Casavant. Algorithm s fo r Logic Design A utom ation. PhD thesis, U niversity of Illinois at U rbana-C ham paign, 1985. D ept, of C om puter Science. i G. DeM icheli, A. Sangiovanni-Vicentelli, and T. Villa. Com puter- aided Synthesis of PLA Based F inite S tate M achines. In Proc. Intl. Conf. on C om puter Aided Design, pages 154-156, Novem ber 1983. i S. Even. Graph Algorithms. C om puter softw are engineering series. C om puter Science Press, Rockville, M D, 1979. | G. Frank, C. Sm ith, and J. C uadrado. A n A rchitecture Design and A ssessm ent System for Softw are/H ardw are Codesign. W hite P aper, R esearch Triangle Park, Novem ber 1984. I i C. G ebotys and M. Elm asry. VLSI Design Synthesis w ith Testabil ity. In Proc. 25th Design A utom ation Conference, pages 16-21, June 1988. E. Girczyc. Loop W inding - A d ata flow approach to functional program m ing. In IE E E Intl. Sym p. on Circuits and System s, pages 382-385, M ay 1987. E. Girczyc and J. K night. An ADA to S tandard Cell H ardw are Com piler Based on G raph G ram m ars and Scheduling. In Proc., 1984 Intl. ! Conference on Com puter Design - ICCD , pages 726-729, O ctober ! 1984. J.J. G ranacki and A.C. Parker. T he Effect of R egister-Transfer De sign Tradeoffs on Chip A rea and Perform ances. In Proc. 20th Design A utom ation Conference, D ecem ber 1982. 288 [GR83] [GR87] [H+82] [Ham83] [HC 8 8 ] [HE89] [HP81] [HP83] [Jai89] [JM P88] R. Galivanche and S. Reddy. A Parallel'P'LA M inim ization Program . In Proc. Intl. Conference on C om puter Aided Design, pages 600-607, Novem ber 1983. R. Galivanche and S. Reddy. A Parallel PLA M inim ization Program . In Proc. 24th Design A utom ation Conference, pages 600-607. IE E E and ACM, Ju n e 1987. J. H ennessy et al. H ardw are/Softw are Tradeoffs for Increased P er form ance. C om puter Architecture News, 10(2):2— 11, M arch 1982. G. H am achi. Designing Finite State M achines with PEG . UC Berke ley, 1983. R. H artley and P. C orbett. A D igit-Serial Silicon Com piler. In Proc. 25th Design A utom ation Conference, pages 646-649. IE E E /A C M , Ju n e 1988. B. H aroun and M. Elm asry. A rchitectural Synthesis for D SP Silicon Com pilers. IE E E Trans, on CAD, 8(4):431-447, A pril 1989. L. Hafer and A. Parker. A Form al M ethod for th e Specification, A nalysis and Design of R egister-Transfer Level D igital Logic. In Proc. 18th Design A utom ation Conference, pages 846-853, June 1981. L. Hafer and A. Parker. A Form al M ethod for th e Specification A nalysis, and Design of R egister-Transfer Level D igital Logic. IE E E Trans, on C om puter-Aided D esign, CA D -2(1):4-17, Jan u ary 1983. R. Jain. High-Level Area-Delay Prediction with Application to B e havioral Synthesis. PhD thesis, D ept, of E lectrical Engineering, U ni versity of Southern California, July 1989. R. Jain, M. M linar, and A. Parker. A rea-T im e M odel for Synthesis of N on-Pipelined Designs. In Proc. Intl. Conference on Com puter- A ided Design, pages 48-51, N ovem ber 1988. 289 [JPP87] [JP P 8 8 ] [KC8 6 ] [KP84] [K P 8 6 ] [KP87] [KP90] [KT83] [Kur87] [KY85] R. Jain, A. Parker, and N. Park. P redicting A rea-T im e Tradeoffs for Pipelined Design. In Proc. 2 4 th Design A utom ation Conference, pages 35-41, Ju ly 1987. R. Jain, A. Parker, and N. Park. M odule Selection for Pipelined Designs. In Proc. 25th Design A utom ation Conference, pages 542- 547, Ju n e 1988. Y. Kuo and W . Chou. G enerating Essential Prim es for a Boolean Function w ith M ultiple-V alued Inputs. In Proc. 23rd Design A u tom ation Conference, pages 193-199, June 1986. F. K urdahi and A. Parker. A rea estim ation of stan d a rd cell designs. DISC R eport 84-2, EE-System s D ept. USC, 1984. F. J. K urdahi and A. C. Parker. PLEST: A P rogram for A rea E sti m ation of VLSI Integrated Circuits. In Proc. 23rd Design A u to m a tion Conference, pages 467-473, Ju n e 1986. F. J. K urdahi and A. C. Parker. REAL: A P rogram for R E gister ALlocation. In Proc. 24th Design A utom ation Conference, pages 210-215, Ju n e 1987. K. K ucukcakar and A. C. Parker. M ABAL - A Softw are Package for M odule A nd Bus ALlocation. To appear in Intl. Journal o f Com puter-Aided V L S I D esign, 2(4), 1990. T. J. Kowalski and D .E. Thom as. T he VLSI Design A utom ation As sistant: P ro to ty p e System . In Proc. 20th Design A utom ation Con ference, pages 479-483, 1983. F. J. K urdahi. Area E stim ation o f V L S I Circuits. PhD thesis, U ni versity of Southern California, A ugust 1987. Y. Koseki and T. Y am ada. PLA YER: A PLA Design System for VLSIs. In Proc. 22nd Design A utom ation Conference, pages 766- 769, June 1985. 290 [LA83] [Lei81] [Liu6 8 ] [Ltd85] [LVSV82] [Man 72] [M ar8 6 ] [McF78] [McF81] [McF86] W entai Liu and D aniel A tkins. Bounds on th e Saved A rea R atio due to PLA Folding. In Proc. 20th Design A utom ation Conference, pages 538-544. IE E E and ACM , June 1983. G. W . Leive. The Design, Im plem entation, and Analysis o f an A u tom ated Logic Synthesis and Module Selection System . PhD thesis, D ept, of E lectrical Engineering, Carnegie-M ellon U niversity, January 1981. C. L. Liu. Introduction to Com binatorial M athem atics. M cGraw-Hill Book Company, New York, 1968. INM OS Ltd. O C C A M Programming M anual, July 1985. M. Luby, U. V azirani, and A. Sangiovanni-V incentelli. Some T heo retical R esults on the O ptim al PLA Folding Problem . In Intl. Con ference on Circuits and Computers, pages 165-170. IE E E , 1982. R. L. M andell. H ardw are/softw are trade-offs - Reasons and direc tions. In Fall Joint C om puter Conference, pages 453-459, 1972. R. M. M arshall. Synthesis o f Hardware System s fro m Very High Level Behavioural Specifications. PhD thesis, U niversity of E din burgh, 1986. M. M cFarland. T he value trace: A d a ta base for au to m ated digital design. M aster’s thesis, D ept, of Electrical Engineering, Carnegie- M ellon University, P ittsb u rg h , P a., D ecem ber 1978. i M. M cFarland. A llocating Registers, Processors, and Connections, j Internal P aper, 1981. j M. M cFarland. Using B ottom -U p Design Techniques in th e Syn thesis of D igital H ardw are from A bstract B ehavioral D escriptions. In Proc. 23rd Design A utom ation Conference, pages 474-480, June 1986. 291 [McF87] [M PC 8 8 ] [M T 8 6 ] [Nag80] [NCP82] [NP81] [NT 8 6 ] [OHM+84 [Pan 8 8 ] [Par85] M. M cFarland. R eevaluating the Design Space for R egister-T ransfer j H ardw are Synthesis. In Intl. Conference on C om puter-Aided Design, pages 262-265, Novem ber 1987. M. C. M cFarland, A. C. Parker, and R. Cam posano. Tutorial on High-Level Synthesis. In Proc. 25th Design A utom ation Conference, Ju n e 1988. D arrell M akarenko and John T artar. A S tatistical Analysis of PLA Folding. IE E E Trans, on C om puter-Aided D esign, CA D-5(1):39-51, Jan u ary 1986. A. Nagle. A utom atic Design o f Sequencers fo r The Control o f Digital Hardware. PhD thesis, Carnegie-M ellon University, O ctober 1980. A. Nagle, R. C loutier, and A. Parker. Synthesis of H ardw are for th e C ontrol of D igital System s. IE E E Trans, on Com puter-Aided Design, CAD-1(4):201-212, 1982. A. Nagle and A. Parker. A lgorithm s for M ultiple-C riterion Design of M icroprogram m ed C ontrol H ardw are. In Proc. 18th Design A u tom ation Conference, pages 486-493, Ju n e 1981. J. N estor and D. Thom as. Behavioral Synthesis w ith Interfaces. In Intl. Conference on Com puter Aided Design, N ovem ber 1986. J. O usterhout, G. H am achi, R. Mayo, W . Scott, and G. Taylor. M AGIC: A VLSI Layout System . In Proc. 21st Design A utom ation Conference, pages 152-159, Ju n e 1984. B. Pangrle. SPLICER: A H euristic A pproach to C onnectivity B ind ing. In Proc. 25th Design A utom ation Conference, pages 536-541, Ju n e 1988. N. Park. Synthesis o f High Speed Digital System s. PhD thesis, Uni versity of Southern California, A ugust 1985. 292 1 [Pen8 6 ] I i I [PG87] [PH87] [PK87] ! [PKM84] [PP85a] [PP85b] [P P 8 6 ] [P P 8 8 ] [PPM 86] Z. Peng. Synthesis of VLSI System s w ith th e CAM AD Design Aid. In Proc. 23rd Design A utom ation Conference, pages 278-283, June 1986. B. Pangrle and D. G ajski. Design Tools for Intelligent Silicon Com pilation. IE E E Trans, on C om puter-Aided Design, CAD-6(6):1098— 1112, Novem ber 1987. A. P arker and S. H ayati. A utom ating th e VLSI Design Process using E xpert System s and Silicon Com pilation. Proc. IE E E , 75(6):777- 785, 1987. P. P aulin and J. K night. Force-D irected Scheduling in A utom atic D a ta P a th Synthesis. In Proc. 24th Design A utom ation Conference, pages 195-202, July 1987. A. Parker, F. K urdahi, and M. M linar. A G eneral M ethodology for Synthesis and Verification of R egister Transfer designs. In Proc. 21st , Design A utom ation Conference, Ju n e 1984. N. P ark and A. Parker. Synthesis of O ptim al Clocking Schemes. In Proc. 22nd Design A utom ation Conference. ACM IEE E, Ju n e 1985. N. P ark and A. Parker. Synthesis of optim al pipeline clocking schemes. Technical R eport D ISC /85-1, D ept, of EE-System s, U ni versity of Southern California, Jan u ary 1985. N. P ark and A.C. Parker. Sehwa:A Program for Synthesis of Pipelines. In Proc. 23rd Design A utom ation Conference, pages 454- 460, Ju ly 1986. ! N. P ark and A. Parker. Sehwa: A Software Package for Synthesis of Pipelines from Behavioral Specifications. IE E E Trans, on C om puter Aided Design, 7(3), M arch 1988. A. P arker, J. Pizarro, and M. M linar. MAHA: A Program for D ata p a th Synthesis. In Proc. 23rd Design A utom ation Conference, pages 461-466, July 1986. 293 [RSV86] [Sno78] [Tho8 6 ] [Tri85] [W C 8 8 ] R. Rudell and A. Sangiovanni-V incentelli. E xact M inim ization of M ultiple-V alued F unctions for PLA O ptim ization. In Proc. Intl. Conference on C om puter Aided Design, pages 352-355, Novem ber 1986. E. Snow. A utom ation o f Module Set Independent Register Transfer Level Design. PhD thesis, D ept, of E lectrical Engineering, Carnegie- M ellon University, P ittsb u rg h , Pa., April 1978. D. E. Thom as. Design Methodologies, pages 401-439. Elsevier Sci ence Publishers B. V., N orth-H olland, 1986. H. Trickey. Compiling Pascal Programs into Silicon. PhD thesis, D ept, of C om puter Science, Stanford U niversity, Ju ly 1985. C.-L. Wey and T.-Y . Chang. PLA YG RO UN D : M inim ization of PLA s w ith M ixed G round-True O utputs. In Proc. 25th Design A u tom ation Conference, pages 421-426, June 1988. 294 Appendix A N otation Some n o tatio n is com m on throughout th e thesis. T heir m eanings are listed here. • E is th e num ber of edges in a dataflow graph. • N is th e num ber of nodes in a dataflow graph. • £ is th e num ber of control states. • P is th e num ber of stages into which a dataflow graph is partitioned. • M is th e num ber of unique m ultiplexer control lines. • S is th e q u an tity of input statu s lines. • A is th e num ber of additional control lines (ALUs, etc.). • R c is th e num ber of control lines associated w ith registers. • R is th e num ber of unique registers being controlled in th e d a ta p ath. • * is th e num ber of inputs on th e PLA. • o is th e num ber of o u tp u ts on th e PLA. • p is th e num ber of product-term s on th e PLA . • ci . . . C 4 are coefficients for determ ining th e area of a PLA ; they are tech nology dependent param eters. • PL(aii) is th e length of a loop on in stages where P L (a i) > 0. 295 • N(cxi) is th e num ber of iterations of loop a:; w here iV(cq) > 1. • K reg is a value from 0 to 1 which reflects th e degree of register uniqueness per loop. If a value stored at a given step is p u t into a different register each loop, K r eg = 1. If th e sam e register is used each loop, th en K reg = 0. • PC(/3i) is the length of conditional p ath in stages w here P C (fti) > 0 (at least one p artitio n is reached or crossed). • C is th e num ber of conditional paths in th e entire graph. • P is an edge p artitio n cutset of th e dataflow graph. Its m em bers are th e edges which divide th e dataflow graph into exactly two subgraphs where root and outport are not m em bers of th e sam e subgraph. • C is th e set of all p artitio n cutsets, P . • c is th e clock cycle tim e. • p is th e num ber of stages (of clock cycle c) into which th e dataflow graph has been partitioned. • V ( e4 ) is th e value assigned to edge e;. • L is th e height of th e dataflow graph. This height or m axim um length (in operations) from root to outport is exclusive of th e end nodes. • rii is th e num ber of operations of type i in th e dataflow graph. • o, is th e num ber of operators of ty p e i ubbsed in th e actu al design. • I is th e initiation interval of a given pipelined design. In itiatio n interval is defined as th e num ber of clock cycles betw een two successive inputs. • E in is th e set of edges from root to unique destination nodes. • E out is th e set of edges to outport from unique source nodes. • R is th e num ber of registers needed for a given design. 296 • Rmin and Rmax are th e m inim um and m axim um num ber of registers re quired for a given dataflow graph (any given cutset of C). • R n p min and R n p max are th e m inim um and m axim um num ber of registers required for any non-pipelined design of a given dataflow algorithm , re spectively. • Rpmin and R p max are th e m inim um and m axim um num ber of registers required for a given pipelined design, respectively. • R n p est and R p est are th e estim ates on th e num ber of registers for a given non-pipelined and pipelined design, respectively. • R npum is th e em pirical lim it on the m axim um num ber of registers required for a non-pipelined design. • u is th e num ber of m icrocycles for a pipelined design. • V is th e num ber of unique values in a given dataflow graph. • Vreg is th e num ber of unique values which are assigned to registers for a given design. • M i is th e num ber of 2:1 m ultiplexers needed at each input of operator type i in a given design. • M r is th e num ber of 2:1 m ultiplexers needed for all registers in a given design. 297 Appendix B M AHA Usage: Inputs and Outputs M A H A is a non-pipelined synthesis program th a t relies upon files for describing a dataflow graph, th e m odule library, and constraints. M A H A can be run interactively as well as in b atch m ode. B .l M AHA Inputs A lthough M A H A usually relies on an interactive environm ent to control pro gram flow, fixed input d a ta cannot be entered at runtim e. Two d a ta in p u t files are required by M A H A : a dataflow description and a m odule library. The dataflow description consists of a single file which contains b o th a list of nodes and a list of edges; the m odule library consists of m odules and th eir associated cost, speed, and bit-w idth. A lthough b o th the dataflow file and library file m ay be nam ed in any m anner of your choosing, it is recom m ended th a t th e prim ary nam es be identical while changing the extension to indicate th e ty p e of file. For exam ple, th e dataflow m ight be contained in garbage.dfg and th e m odule library in garbage.lib. T he dataflow description file contains th e description of a com plete dataflow graph. T he graph has a num ber of restrictions on it: • T he graph can contain no m ore th an 1000 nodes or 2000 edges. • T he graph m ust be acyclic (no loops). • T he top of th e graph is indicated by th e reserved word root which cannot be defined elsewhere. Root has NO input edges, only o u tp u t edges. If 298 you do not have a root node, M A H A will add one for you along w ith th e necessary edges em anating from it. If you have a root node, it should be of operation ty p e dum m y since it does not represent any physical operation. • T he b o tto m of th e graph is indicated by the reserved word outport and cannot be defined elsewhere. Outport has NO o u tp u t edges, only input edges. If you do not have a outport node, M A H A will add one for you along w ith th e necessary edges connected to it. If you have a outport node, it should be of ty p e dum m y. • C onditional branches are indicated by th e reserved words dist for fork (distribution) nodes and jo in for join nodes. D ist has one in p u t arc and one or m ore o u tp u t arcs; jo in is th e converse. • Parallel branches m ay be indicated by th e reserved words parbeg w here the parallel branches begin and parend where th e parallel branches recom bine. However, M A H A assum es m ultiple edges are in parallel im plicitly, so these nodes are m ore for readability. T he dataflow description file has both node and edge inform ation in a one node (edge) per line form at as follows: node-description-1 node-description-2 node-description-n edge-description-1 edge-description-2 edge-description-m N ote th e blank line betw een the node and edge descriptions; this is a R E Q U IR E M E N T . Also, one can insert com m ents in to th e description. A com m ent line is indicated by th e presence of a pound sign (‘# ’) in th e first column; com m ents m ay appear anyw here and are ignored by M A H A . 299 ■ « ■■ ■ '- ■ « ■ B.1.1 Node Description A node description contains th e node nam e, node type, and b itw id th as follows: node-name node-type bitwidth Node-name is any 1 to 15 character nam e which is unique to th e dataflow graph description. T he node-type is also a nam e of up to 15 characters which specifies th e function of th e node; this node-type M U ST m atch one or m ore m odule functions in th e m odule library. Bitwidth is a positive integer from 0 to w hatever; a bitw idth of 0 inform s M A H A th a t this node is an implied node (e.g. one th a t has no associated cost or delay). For exam ple, addl add 8 nam es an adder addl which perform s an add function and is of b itw id th 8. T here m ust be at least one add in th e m odule library. Keep in m ind th a t case is im p o rtan t; hence, add and Add are N O T the same. A brief exam ple of of some valid node nam es are shown below. c3 dummy 8 x l_ p l bit-read 8 storagel d-flop 140 add7 add 8 subtractior»3sub 5 no_op dummy 0 flag_check cmp 5 B.1.2 Edge Description An edge description follows th e node description in a dataflow file w ith an in ter vening blank line. It consists of a source node, destination node, and bitw idth. 300 source-node destination-node bitwidth Like th e node description, source-node and destination-node are nam es up to 15 characters in length. T he node nam es M U ST m atch nam es previously included in th e node description. An exam ple of edge descriptions using previously listed node nam es are shown below. x l_ p l add7 8 subtraction3flag_check 5 c3 add7 8 B.1.3 Module Description T he second in p u t d a ta file is th e m odule library which consists of a list of indi vidual functional m odules and some of th eir physical param eters. T he form of th e m odule library is: module-description-1 module-description-2 module-description-p Sim ilar to th e dataflow graph description, com m ents are allowed in th e m od ule library. A ny line w ith a pound sign in th e first colum n is ignored by M A H A . Each moduie-description is of th e form: module-name module-function bit-width prop-delay cost std-width nets Module-name is a nam e of up to 15 characters which is uniquely identified in th e m odule library. Module-function is also a nam e of up to 15 characters and 301 describes th e general function of th e m odule. Some typical general functions are add, sub, or, and, and d-flop. T he node-type of every node description M UST m atch th e module-function of one or m ore m odules. Bit-width, prop-deiay, and cost are in t e g e r values which describe th e bit w idth of th e m odule, propagation delay in some arb itrary units, and cost (usually area, although could be power, etc.) in some arb itrary units. Finally, std-wdith and nets are any associated stan d ard cell w idth and (internal) 2-wire nets; these are only used if one wishes to execute th e w iring area estim ation program P L E S T upon com pletion of M A H A . O therw ise, th e values can be set to zero as they are not used by M A H A for synthesis. W hen read into M A H A , th e m odule library is reduced to a list which is linked to th e node list. Specifically, for each node-type nam ed in th e node list, th e m odule library is scanned as follows: 1. If th e node-type and module-function are th e sam e, proceed to step 2, else proceed to step 7. 2. If th e module bit-width is less th a n or equal to zero, proceed to step 3, else proceed to step 5. 3. If th e module bit-width is zero, th e actual cost and prop-delay are defined as th e module cost tim es th e node bit-width and module prop-delay tim es the node bit-width, respectively. Proceed to step 6. 4. If th e module bit-width is less th an zero, th e actual cost is th e module cost tim es th e node bit-width w hereas th e prop-delay is sim ply th e module prop-delay (no adjustm ent). Proceed to step 6. 5. If th e module bit-width is less th an th e node bit-width, proceed to step 7. 6. T he cost and prop-delay inform ation is added to th e matching module list. 7. If there are m ore m odules to check, proceed to step 1, else proceed to step 8. 8. A n average module is created which has as its prop-delay (cost) th e average of all prop-delays (costs) from th e matching module list. 302 A pass over th e m odule library is m ade for each node to create th e average module which im plem ents each node function. If two nodes generate th e sam e m atching m odule list, th ey will point to th e sam e average module. Since an average module is used by M A H A , a m odule library should contain identical functions w ith different prop-delays, costs, and bit-w idths. If you wish to have direct control over th e m odule library, th en only a single com patible m odule should be included in th e library for each node type. B.2 Executing M AHA T here are two m ethods for executing M A H A : interactive execution or com m and line execution. Each will be detailed. B.2.1 Interactive Execution of MAHA M A H A is executed for interactive synthesis by running th e program w ith no I other param eters. maha Once M AHA begins execution, it first inquires for th e nam e of th e dataflow description file. Dataflow graph filename? 1 I A fter you enter th e nam e of your file, M A H A reads th e node and edge j list and attaches root and outport nodes if m issing. M A H A th en m arks th e conditional p ath s using a node coloring algorithm . T he user is prom pted for optional display of th e com pleted node coloring. Show node coloring? (Y /N ): If answ ering Y , M A H A displays th e list of nodes and associated color. N ext, M A H A prom pts for th e nam e of a constraint file. 303 Constraint filename? M A H A accepts a set of lim ited local constraint equations. These are not th e global goals and constraints on tim e and area, b u t rath er constraints on individual node area or relative delay betw een two nodes. Section B.5.3 contains a description of th e constraint file. E n ter th e nam e of th e constraint file. If there are no constraints, as is usually the case ju st press R E T U R N w ith no nam e. T he last file M A H A needs is the m odule library. Module library filename? A fter th e m odule library filenam e is entered, M A H A generates the average module library as described in th e previous section. It displays th e num ber of average modules and th e m axim um delay associated w ith any m odule; this is also th e m inim um clock cycle tim e. Besides allocation and scheduling of operations, M A H A is capable of ana lyzing register and m ultiplexer area and delay at th e option of th e user. Consider register cost and delay? (Y /N ): If you do not wish to include register analysis, then answer N . A nsw ering Y brings up a list of th e current cost per b it and delay. Current values: Regcst/bit = 16, regtim = 5. Change? (Y /N ): If these values are acceptable, one should respond w ith N . O therw ise, M A H A will prom pt for th e new cost per bit and delay. E n ter th e two in te ger num bers separated by one or m ore blanks. M A H A constructs registers of any length needed using th e single-bit values for cost; delay is independent of bitw idth. T he sam e questions given th e user for register analysis are now given for m ultiplexer analysis. T he analysis is specific to the m ultiplexer being 2:1 single bit. M A H A constructs m ultiplexers of any bitw idth and any n : 1 m ultiplex 304 using these 2:1 values. Delay is dependent upon th e height of th e n : 1 tree and is given by [log2 n\ x m uxudclay A t this tim e, you are prom pted for a num ber of self-explanatory item s. Echo to output file? (Y /N ): Print the node list? (Y /N ): Print the edge list? (Y /N ): Print the module library (Y /N ): For th e first of th e above inquiries, M A H A will prom pt for a filename if you choose to echo th e o u tp u t to a file. Echoing is sim iliar to a script file as all console o u tp u t is also directed to th e specified file. M A H A now calculates th e nom inal critical path and lists th e nodes in it, th e to ta l critical p a th tim e, and th e m inim um clock cycle tim e. ( Critical path is th e longest delay p a th from root to outport. O nly a single critical p ath is retu rn ed even if there is m ore th an one critical p ath .) M A H A also displays the in p u t processing tim e and th e list of nodes in th e critical p a th in order from left to right startin g from root and ending at outport. You are now asked for constraints which direct th e search for a solution. Enter maximum time (0 to minimize): Enter maximum cost (0 to minimize): T here are three choices you have regarding th e constraints assigned: 1. You can specify tim e (some positive integer) and set cost to 0 (to m ini m ize). M A H A will search for th e cheapest design w hich m eets the tim e constraint. 305 2. You can specify cost (some positive integer) and tim e to 0 (to m inim ize). M A H A will search for th e fastest design which m eets th e cost constraint. 3. You can specify B O TH cost and speed. M A H A will find th e best design th a t m eets b o th constraints. Due to th e way M A H A functions, it will produce th e fastest design th a t m eets b o th of th e constraints. N otice th a t th e case w here B O T H m axim um tim e and cost are set to zero (m inim ize both) is N O T currently allowed. M A H A will p rin t a m essage and ask for th e tim e and cost again. A fter displaying th e constraints, M A H A inquires w hether th e user wishes to m anually control th e search Do you wish to manually control the search? (Y /N): Norm ally, you will w ant th e M A H A algorithm to find th e result for you; in th is case, answ er N to th e above question. If, for some reason, you wish to bypass th e M A H A search algorithm to go directly to a specific design point, answer Y to th e question. Since M A H A acts differently dependent on the answer, m anual and a u to m a tic search are discussed separately. B .2 .1 .1 M A H A A u to m a tic O p eration W hen you specify autom atic operation, M A H A will internally search for the best result. F irst, th e user is prom pted for th e type of allocation and scheduling. Allocation: 0 = ASAP only, 1 = ASAP/ALAP : M A H A has th e capability to perform allocation in “as early as possible” or “as late as possible” and take th e best of th e two results. However, this could result in an execution tim e which is th ree tim es longer th an ju st A SA P allocation. W hen you first enter a graph or have a very large graph, it is recom m ended th a t you respond w ith 0. However, for sm all graphs, you m ay wish to try b o th despite th e longer execution tim e. In this case, answer 1. M A H A will optionally show all of its internal operations. 306 Show freedoms and status (detailed information)? (Y/N): If you answ er Y to th e above question, th e “grundgy” detail of M A H A operation is shown: bounding th e tim e range to search, buying and sharing of m odules, current cost, and percent com pletion. If you wish to avoid this extensive detail, which is generally only useful for seeing th e algorithm in action, enter N a t th e question. Show ONLY non-inferior solutions? (Y /N ): If you answ er Y , M A H A will give a tab le of th e best results IG N O R IN G th e constraints. However, M A H A will highlight th e best solution. Answering N lists all designs M A H A produces. T he best design m eeting th e constraints is again listed. B .2 .1 .1 .1 M a n u a l O p e r a tio n If m anual search has been specified, th e user directs all operations. Enter partition count (positive # ) , or clock cycle time (as negative # ) , or RETURN to exit: M anual control of M A H A allows either directly specifying th e num ber of p artitio n s (tim e slots) to cut th e dataflow graph or entering th e clock cycle tim e. (To distinguish a p artitio n count from a clock cycle tim e, th e form er is entered as a negative num ber.) A d is tin c t f e a tu r e o f m a n u a l c o n tr o l o f M A H A is t h a t y o u a re a llo w e d to e x c e e d t h e o rig in a l c o n s tr a in ts . M a n u a l P a r titio n in g : M A H A will p artitio n a dataflow graph betw een 1 and n partitions, where n is th e num ber of p artitio n s realized w hen using th e m in im u m clock cycle tim e discussed earlier. If th e num ber of p artitio n s is w ithin this range, M A H A inquires 307 Allocation: 0 = ASAP only, 1 = ASAP/ALAP : M A H A has th e capability to perform allocation in “as early as possible” or “as late as possible” and take th e best of th e two results. However, this could result in an execution tim e which is th ree tim es longer th an ju st A SA P allocation. W hen you first enter a graph or have a very large graph, it is recom m ended th a t you respond w ith 0. However, for sm all graphs, you m ay wish to try b o th despite th e longer execution tim e. In this case, answer 1. M A H A will optionally show all of its internal operations. Show freedoms and status (detailed information)? (Y /N ): If you answ er Y to th e above question, the “grundgy” detail of M A H A operation is shown: bounding th e tim e range to search, buying and sharing of m odules, current cost, and percent com pletion. If you wish to avoid this extensive detail, which is generally only useful for seeing th e algorithm in action, enter N at th e question. M a n u a l C lo c k -c y c le E n tr y : M A H A will p artitio n a dataflow graph based on a preset clock cycle tim e. T he only restriction is th a t th e tim e m ust equal or exceed th e m inim um clock cycle tim e discussed earlier. If th e clock cycle tim e is w ithin this range, M A H A inquires Allocation: 0 = ASAP only, 1 — ASAP/ALAP : M A H A has th e capability to perform allocation in “as early as possible” or “as late as possible” and take th e best of th e two results. However, this could result in an execution tim e which is th ree tim es longer th a n ju st A SA P allocation. W hen you first enter a graph or have a very large graph, it is recom m ended th a t you respond w ith 0. However, for sm all graphs, you m ay wish to try b o th despite th e longer execution tim e. In this case, answer 1. M A H A will optionally show all of its internal operations. 308 Show freedoms and status (detailed information)? (Y/N): If you answ er Y to th e above question, th e “grundgy” detail of M A H A operation is shown: bounding the tim e range to search, buying and sharing of m odules, current cost, and percent com pletion. If you wish to avoid this extensive detail, which is generally only useful for seeing th e algorithm in action, enter N at th e question. B.2.2 Command Line Execution of MAHA M A H A can be executed directly from the com m and line w ith no user interac tion. T he form at is maha dfg modlib constraints outfile [regflg [muxflg]] where dfg is th e dataflow graph file, modlib is th e m odule library file, con straints is th e constraint file (if any), and outfile is th e o u tp u t filename. If there is no constraints file, then a dash ( “-” ) should be entered. Regflg and muxflg are optional flags. In th e com m and line m ode, M A H A does not norm ally consider register or m ultiplexer costs. However, if regflg is 1, th en register costs will be considered. Likewise, if muxflg is 1, then m ultiplexer costs will be considered. A fter scanning th e com m and line and opening th e o u tp u t file, M A H A gen erates all possible designs. These are w ritten in a com puter-readable form at as described in Section B.5.4. Unless some sort of run-tim e error occurs, no m es sages will appear at th e console. M A H A perform s its tasks silently in an iden tical m anner as th e autom atic search described in Section B .2.1.1; b o th ASAP and A LA P scheduling are attem p ted for every design and th e best chosen. B.3 M AHA Interactive Output O nce M A H A has com pleted allocation of th e graph, it displays th e final clock cycle tim e, cost, and to tal tim e for th e graph. 309 Show the hardware map? (Y/N): If you wish to see th e final allocated results, answer Y to th e question. M A H A will o u tp u t a table th a t looks like HARDWARE NODE.SLOT add8 add8.000 add5.001 r-shiftlO divl.001 add8 add3.001 T he first colum n contains the list of all hardw are purchased, only th e function is actually listed. (N otice there is one r-shiftlO and two add8s in th e exam ple.) A t th e colum ns to th e right of the hardw are is th e list of all nodes which are bound to th a t piece of hardw are and th e tim e slot associated w ith it. In the exam ple, addl in th e first p artitio n (slot # 0 ) and add5 in th e second p artitio n (slot # 1 ) share th e sam e hardw are - add8. Add3 and divl were p u t into th e second p artitio n (slot # 1 ) and do not share th eir hardw are w ith any other operators. B.4 An Interactive Example In this section, th e exam ple th a t was included in th e M A H A paper, “M AHA: A d a ta p ath Synthesis P rogram ” by Alice Parker, Jorge Pizarro, and M itchell Mli- nar, A C M /IE E E 23rd Design A utom ation Conference, June, 1986. T he dataflow graph used in this exam ple is reproduced below. 310 D a ta flo w G rap h E x a m p le Each node is assigned a distinct nam e (not to exceed 15 characters) for use by M A H A . Below is a copy of th e dataflow graph file, example.dfg, which is accessible from th e maha directory. Since th e paper was w ritten, th ere have been some m inor changes to M A H A - nam ely, th e separation of conditional and parallel branches. T he exam ple dataflow graph in th e paper has unconditional branches. root dummy 0 outport dummy 0 addl add 8 add2 add 9 add3 add 10 add4 add 9 add5 add 8 subl sub 9 sub2 sub 8 sub3 sub 8 divl r-shift 10 div2 r-shift 10 cm pl cmp 8 cmp2 cmp 8 cmp3 cmp 8 andl and 2 invl inv 1 o u tl buf 1 out2 buf 1 out3 buf 1 D5 parbeg 0 311 D6 J5 J6 parbeg parend parend root addl root add5 root addl root add4 root add2 root add4 root add3 root add5 addl add2 add2 add3 add3 divl add4 subl add5 subl divl div2 div2 D6 D6 sub3 subl cmp3 D6 sub2 sub3 cm pl sub2 cmp2 cmp2 andl cmp3 andl andl out3 cm p l D5 D5 invl D5 out2 invl o u tl o u tl J5 0 0 0 8 8 8 8 8 8 8 8 9 10 10 9 9 9 8 8 8 8 8 8 312 out2 J5 1 out3 J6 1 J5 J6 1 J6 outport 1 N otice how th e above exam ple follows th e rules outlined previously. • th ere is a single root node which starts the dataflow graph • there is a single outport node which ends th e dataflow graph • th e dum m y-type nodes have a bitw idth of 0. Since th e root and outport nodes are algorithm ic conveniences, a bitw idth of 0 inform s M A H A to ignore and cost and delay for this node. • th ere is a blank line betw een th e node list and the edge list T he associated m odule library for this dataflow graph, example.lib, is repro duced below: dummy dummy 0 0 0 0 dist dummy 0 0 0 0 0 join join 0 0 0 0 0 parbeg parbeg 0 0 0 0 0 parend parend 0 0 0 0 0 add2 add 2 40 80 0 0 add4 add 4 72 120 0 0 add8 add 8 120 180 0 0 addl2 add 12 150 220 0 0 addl6 add 16 200 300 0 0 addn add 0 20 45 0 0 sub2 sub 2 50 90 0 0 sub4 sub 4 84 130 0 0 sub8 sub 8 140 200 0 0 subl2 sub 12 225 250 0 0 313 subl6 sub 16 240 360 0 0 subn sub 0 25 50 0 0 mul2 mul 2 80 140 0 0 mul4 mul 4 150 300 0 0 mul8 mul 8 280 640 0 0 mux2 mux 2 30 55 0 0 mux4 mux 4 54 100 0 0 cmp4 cmp 4 70 110 0 0 cmp8 cmp 8 130 180 0 0 cm p l2 cmp 12 190 240 0 0 and2 and 2 10 18 0 0 and3 and 3 14 22 0 0 r-shift n r-shift 0 44 88 0 0 r-shift4 r-shift 4 44 250 0 0 r-shift8 r-shift 8 44 400 0 0 r-sHiftl2 r-shift 12 44 510 0 0 r-shiftl6 r-shift 16 44 600 0 0 j invl inv 1 8 14 0 0 inv2 inv 1 8 25 0 0 bufl buf 1 10 14 0 0 buf2 buf 1 30 100 0 0 buf3 buf 1 50 150 0 0 T h e sam ple m odule library points out some of th e features and restrictions described earlier: • Each m odule has a unique nam e. • T he s e t of m odule operations is well defined: addition, subtract, multiply, cmp (com pare), and, buffer driver, inverter, i-shift (left shift register), r-shift (right shift reg ister/d iv id er), distribute, and join. • Even fictitious node operations such as parbeg, dummy, and parend M U ST be declared in th e m odule library. (O th er fictitious nodes include dist and join, b u t are not used in this exam ple.) 314 • All of th e delay and cost values are positive integers (including zero). | I I H ere is a sam ple ru n of M A H A using th e exam ple. eve[l] maha , MAHA v7.01 i i USC Design Automation Group Mitchell Mlinar, July 1988 i i J Dataflow graph filename? dataflow.dfg Reading in the nodelist. There are 24 nodes: roots = 1, outports = 1. Reading in edgelist. i ! There are 32 edges. Checking for extra edges required. > 0 extra edges added. Coloring conditional paths ... Show node coloring? (Y/N): n Constraint filename? Module (mod2) library filename? example.lib There are 13 modules, minimum time is 240. 315 Consider r e g i s t e r c o st & delay? (Y /N ): n Consider mux cost & delay? (Y/N): n Input process time: 0.060 seconds. Echo to output file? (Y/N): n j Print the node list? (Y/N): n i ! Print the edge list? (Y/N): n | Print the module library? (Y/N): y Module name Width Delay Cost Size Nets dummy0 0 0 0 0 0 add8 8 160 240 0 0 add 9 9 200 300 0 0 add 10 10 200 300 0 0 sub9 9 240 360 0 0 sub8 8 190 280 0 0 r-shiftlO 10 44 600 0 0 cmp 8 8 130 180 0 0 and 2 2 12 20 0 0 invl 1 8 19 0 0 buf 1 1 30 88 0 0 distO 0 0 0 0 0 j oinO 0 0 0 0 0 Press RETURN to continue — 316 N otice how M A H A calculates th e average of th e m odule library. Each node { in th e dataflow will have a single m odule associated w ith it (b u t not alloacted j yet). Finding the nominal critical path. i I The nominal critical path has 13 nodes with a time of 1010. I j The minimum clock time is 240. { j Critical path process time: 0.000 seconds. ! I I ; I i ; 1 i The critical path is: root addl add2 add3 divl div2 D6 sub2 cmp 2 andl out 3 J6 outport Enter maximum time (0 to minimize): Now th a t th e critical p a th has been found, we can try to perform th e synthesis w ith a tim e constraint. j j Enter maximum time (0 to minimize): 3000 i Enter maximum cost (0 to minimize): 0 C o n s t r a in t s : 317 Time: 3000 Cost: minimize Do you wish, to manually control the search? (Y/N) : n Automatic search ... Allocation: 0 = ASAP only, 1 = ASAP/ALAP : 1 I Show freedoms and status (detailed information)? (Y/N): n i | Show ONLY non-inferior solutions? (Y/N): n Partitions Clock Time Cost Regs Muxe i 1010 1010 3707 3 0 2 560 1120 3707 6 0 3 362 1086 3707 8 0 4 (Cannot reach this) 5 (Cannot reach this) 6 240 1440 3167 12 0 Linking add5 between addl and add 2 2 650 1300 3407 5 0 3 444 1332 2627 7 0 4 362 1448 3227 9 0 5 (Cannot reach this) 6 (Cannot reach this) 7 240 1680 2987 13 0 Tagging node div2 as start of new partition 2 764 1528 2807 5 0 3 444 1332 2627 7 0 4 406 1624 2627 8 0 318 5 320 1600 2627 11 0 6 (Cannot reach this) 7 240 1680 2387 13 0 Best is time of 1680, clock of 240, cost of 2387 (13 regs, 0 muxes) Analysis time: 0.020 seconds. Recalculate the best case showing the hardware map? (Y/N): y HARDWARE dummy0 add8 add9 add 10 r-shiftlO distO sub8 cmp8 and2 buf 1 j oinO invl sub9 NODE.SLOT (+ for root.000 addl+000 add4+000 add3+003 divl+004 D6.005 sub2+005 cmp3+003 andl.006 outl+006 J6.006 invl.006 subi+002 register) outport.006 add5+001 add2+002 div2.005 D5.006 sub3+005 cmpl.006 out2+006 J5.007 cmp2.006 out3+006 Bye. eve [2] Since th e cost constraint was never m et (m inim ized cost), th e solution w ith th e lowest cost was selected. 319 B.5 File Formats Included here is a brief sum m ary of all file form ats used by M A H A . B.5.1 Dataflow Description File T he dataflow description file has both node and edge inform ation in a one node (edge) per line form at as follows: j node-description-1 node-description-2 nod e-description-n edge-description-1 edge-description-2 edge-description-m N ote th e blank line betw een the node and edge descriptions; this is a R E Q U IR E M E N T. Also, one can insert com m ents into the description. A com m ent line is indicated by th e presence of a pound sign (‘# ’) in th e first column; com m ents m ay appear anyw here and are ignored by M A H A . B .5 .1.1 N o d e D e sc r ip tio n A node description contains th e node nam e, node type, and bitw idth as follows: ! node-name node-type bitwidth Node-name is any 1 to 15 character nam e which is unique to th e dataflow graph description. T he node-type is also a nam e of up to 15 characters which specifies th e function of th e node; this node-type M U ST m atch one or m ore m odule functions in th e m odule library. Bitwidth is a positive integer from 0 to 320 what'ever;~aM3itwidth~of~0 informsU\TA-HA--that-this-node-is-an-&ra£>/ied-node (e.g. one th a t has no associated cost or delay). For exam ple, addl add 8 nam es an adder addl which perform s an add function and is of bitw idth 8. T here m ust be at least one add in th e m odule library. Keep in m ind th a t case is im p o rtan t; hence, add and Add are N O T th e same. A brief exam ple of of some valid node nam es are shown below. c3 dummy 8 x l_ p l bit-read 8 storagel d-flop 140 add7 add 8 subtraction3sub 5 no_op dummy 0 flag_check cmp 5 B .5 .1.2 E d g e D e sc r ip tio n An edge description follows th e node description in a dataflow file w ith an in ter vening blank line. It consists of a source node, destination node, and bitw idth. source-node destination-node bitwidth Like th e node description, source-node and destination-node are nam es up to 15 characters in length. T he node nam es M UST m atch nam es previously included in th e node description. An exam ple of edge descriptions using previously listed node nam es are shown below. x l_ p l add7 8 subtraction3flag_check 5 c3 add7 8 321 B.5.2 Module Description File T he second in p u t d a ta file is the m odule library which consists of a list of indi- 1 vidual functional m odules and some of th eir physical param eters. T he form of th e m odule library is: module-description-1 module-description-2 module-description-p i Sim ilar to th e dataflow graph description, com m ents are allowed in th e mod- j ule library. Any line w ith a pound sign (‘# ’) in th e first colum n is ignored by ' M A H A . Each module-description is of th e form: I module-name mod ule-fu nction bit-width prop-delay cost std-width nets Module-name is a nam e of up to 15 characters w hich is uniquely identified in th e m odule library. Module-function is also a nam e of up to 15 characters and describes th e general function of th e m odule. Some typical general functions are add, sub, or, and, and d-flop. T he node-type of every node description M UST m atch th e module-function of one or m ore m odules. Bit-width, prop-delay, and cost are in te g e r values which describe th e b it w idth of th e m odule, propagation delay in some a rb itrary units, and cost (usually area, although could be power, etc.) in some arb itrary units. Finally, std-wdith and nets are any associated stan d ard cell w idth and (internal) 2-wire nets; these are only used if one wishes to execute th e wiring area estim ation program P L E S T upon com pletion of M A H A . O therw ise, th e values can be set to zero as they are not used by M A H A for synthesis. W hen read into M A H A , th e m odule library is reduced to a list which is linked to th e node list. Specifically, for each node-type nam ed in th e node list, th e m odule library is scanned as follows: 322 1. If th e node-type and module-function are the sam e, proceed to step 2, else proceed to step 7. 2. If th e module bit-width is less th an or equal to zero, proceed to step 3, else proceed to step 5. 3. If th e module bit-width is zero, th e actual cost and prop-delay are defined as th e module cost tim es th e node bit-width and module prop-delay tim es th e node bit-width, respectively. Proceed to step 6. 4. If th e module bit-width is less th an zero, the actual cost is th e module cost tim es th e node bit-width whereas th e prop-delay is sim ply th e module prop-delay (no adjustm ent). Proceed to step 6. 5. If th e module bit-width is less th a n th e node bit-width, proceed to step 7. 6. T he cost and prop-delay inform ation is added to th e matching module list. 7. If there are m ore m odules to check, proceed to step 1, else proceed to step 8 . 8. A n average module is created which has as its prop-delay (cost) th e average of all prop-delays (costs) from th e matching module list. A pass over th e m odule library is m ade for each node to create th e average module which im plem ents each node function. If two nodes generate th e sam e m atching m odule list, they will point to th e sam e average module. Since an average module is used by M A H A , a m odule library should contain identical functions w ith different prop-delays, costs, and bit-w idths. If you wish to have direct control over the m odule library, th en only a single com patible m odule should be included in th e library for each node type. B.5.3 Constraint Description File T he current version of M A H A optionally accepts a lim ited set of tim e con straints. These are contained in a constraint file. Each line is of th e form 323 node-time cmp-op node-time [± time-constant] w here node-time is a reference to a specific node tim e, cmp-op is a com parison operator, and time-constant is some positive integer num ber. Node-time is a reference to a specific tim e of a given node. It consists of a letter, colon and nodename. B:nodename Beginning time of specific node E:nodename End time of specific node D.nodename Delay time of node operation T he letter specifies th e tim e value to use associated w ith th e node. T he nodename M U ST have m atch a unique nam e which is present in th e dataflow description file. T here is one exception to th e generalized tim e constraint equation: if th e delay of node appears on the left-hand side, th e right-hand side m ay only contain a single constant. This is necessary since th e m odule library is chosen prior to synthesis; M A H A does not have th e capability to reconstruct a m odule library on th e fly. Cmp-op allowed are less th a n (< ) or greater th a n (> ); however, M A H A im plicitly assum es th a t these operations are less-than-or-equal-to (< ) or greater- than-or-equal-to (> ). Finally, th e constraint file accepts com m ent lines. Any line w ith a pound sign in th e first colum n (“# ”) is ignored. In addition, blank lines are also ignored. A sam ple file is included below. # # Constraints file for myexOa.dfg # # Comment lines start with # # # Blank lines are also allowed 324 # Tim es t h a t make s e n s e ( c a s e o f 1 s t ch ar n o t im p o r t a n t ) : # B:nodename B e g in n in g o f op nodename # E:nodename End o f op nodename # D:nodename D e la y o f op nodename # c o n s t a n t j u s t a number (a lw a y s p o s i t i v e ) # # O p e r a tio n s c o n s i s t o n ly o f : # # tim e < tim e # tim e > tim e # tim e < tim e + c o n s t a n t # tim e < tim e - c o n s t a n t # tim e > tim e + c o n s t a n t # tim e > tim e - c o n s t a n t # b :m u lc > e:a d d 2 # b :a d d 4 < brm ulc + 1000 b :m u la > b :m u lc - 7500 d : add4 < 600 B.5.4 MAHA Output File W hen executed using com m and line m ode, M A H A w rites a m achine readable synthesis file. T he form at of this file is as follows. design-desription-1 design-description-2 325 design-description-n *** Each design description is offset by a line containing a blank followed by th ree astericks. A design description is two or m ore lines as shown. parts [ clock area width nets regs muxes (inserts) ] nodename-1 nodetype-1 bound partition regflag color nodename-2 nodetype-2 bound partition regflag color nodename-n nodetype-n bound partition regflag color Parts is th e num ber of p artitio n s into w hich th e dataflow graph has been cut w ith an associated clock cycle tim e of clock. Area, width, and nets are th e calculated area, standard-cell w idth, and num ber of 2-wire nets of th e resulting design. Regs and muxes are th e m axim um num ber of registers and m ultiplexers needed for this design. Finally, th e num ber of delay nodes inserted into the dataflow graph is indicated by inserts. If a valid design has been reached, parts will be a positive integer; otherw ise, parts will be negative and no o th e r v a lu es w ill b e p resen t on th e d esig n h e a d er lin e. Also, if M A H A was executed w ithout considering registers or m ultiplexers, th en th e values displayed are invalid and should be discarded. Following th e design header line, th e results for each node in th e dataflow graph are listed, nodenode-i is th e unique node nam e as contained in th e dataflow graph description file; nodetype-i is th e associated m odule type and is often a nam e generated internally in M A H A . Bound is an integer index value; nodes which have th e sam e index share th e sam e hardw are. N ote th a t two different nodes having th e sam e nodetype do not necessarily have th e sam e bound index. A lthough th ey im plem ent th e sam e operation, each is accom plished in a different hardw are m odule. Partition is th e p artitio n index num ber w here th e node operation is perform ed. T he value for partition ranges from zero to parts - 1. Regflg indicates w hether 326 th e o u tp u t value(s) from th is node are needed in later clock cycles; a value of one (1) indicates th a t a register is needed to save its value. If not, th en a zero (0) is present. Finally, th e conditional coloring of th e node is given in th e form colorl[:color2[:...]] w here th e coloring identifier is one or m ore 2-digit hexadecim al num bers separated by a colon. Colors which have a length greater th a n one indicate th a t this operation is located on some conditional path; th e length specifies the depth at which this conditional exists. A p artia l M A H A o u tp u t file is shown below and m atches th e exam ple de scribed earlier. 4 362 3227 0 180 9 0 (1) root dummy0 0 0 0 01 outport dummyO 0 3 0 02 addl add8 1 0 0 01 add2 add9 3 1 1 01 add3 add 10 4 2 0 01 add4 add 9 3 0 1 01 add5 add8 2 0 1 01 subl sub9 14 1 1 01 sub 2 sub8 8 3 0 02:02 sub3 sub8 8 3 0 02:01 divl r-shift10 5 2 0 01 div2 r-shift10 6 2 1 01 cmpl cmp8 9 3 0 02:01 cmp 2 cmp8 9 3 0 02:02 cmp3 cmp 8 9 2 1 01 andl and 2 10 3 0 02:02 invl invl 13 3 0 02:01:01 outl buf 1 11 3 1 02:01:01 327 out 2 buf 1 out 3 buf 1 D5 distO D6 distO J5 joinO J6 joinO *** 11 3 1 02:01:02 11 3 1 02:02 7 3 0 02:01 7 2 0 02 12 4 0 02:01 12 3 0 02 328 Appendix C PLA Loop Counter Area Evaluation Using a counter as p art of th e loop control has a variety of im plem entations. Two prim ary schemes involve building a counter as p a rt of th e PLA or attaching a separate piece of hardw are. Four operational considerations are 1. initialization of th e counter, 2. decrem ent of th e counter, 3. testing for count com pletion, 4. and uniquely defining th e state (for clocking th e correct d a ta p ath register). I n itia liz a tio n of th e counter m ust be perform ed directly by th e PLA for th e internal counter. A lthough an external counter could have a prew ired starting value, th is is less attractiv e since it m ay preclude using th e counter for handling other loops (if th ey have a different num ber of iterations). Hence, the external counter will be initialized by th e PLA , which requires n additional o u tp u t lines for an n-b it counter. For th e d e c r e m e n t loop operation, an external counter is supplied a single o u tp u t trigger by th e PLA . Conversely, th e internal counter has in p u t lines containing th e previous value and product term s to determ ine th e next value. T he optim al num ber of product term s is the sum of th e m inim al covering for each state bit and is dem onstrated in Figure C .l, w here th e bits are num bered from 1 up. K arnaugh m ap coverage dictates th a t a given b it position, m , w here 1 is the least-significant bit position, can be m inim ally covered w ith m term s. Hence, for a given n -b it counter, th e to tal num ber of product term s is th e sum m ation from 329 Terms Covers 10000...0 1 011xx...x 2 n"2 0101x...x 2 n "3 n-4 01001...x 2 01000...1 2 n-n t n-1 b it posn 2 V n Figure C .l: M inim al Covering of product term s for C ounter 1 to n or for a 2n iteratio n loop. If th e num ber of iterations for a given loop, N (a i), falls in th e range of 2n~1 < N (a j) < 2n, then th e m inim al num ber of product term s is th e lower bound. n ' { n> + ! ) f r u Pint cntr — ^ V 1 w here rt' = |_log2 iV(Q!;)J. T h e te r m in a tio n te s t entails checking th e counter for a value of zero. Since it is far cheaper to check for a zero in external hardw are already given th e ex tern al counter, a single PLA in p u t statu s b it suffices. A product term is necessary for b o th internal and external counters to determ ine th e n ext state for loop term in atio n or continuation. Finally, a unique value m ay be required for each loop to distinguish the register being controlled. (For exam ple, a 16-bit m ultiply im plem ented using an 8-bit m ultiplier would store th e interm ediate values in separate registers in each of th e four loops.) This also entails decoding th e contents of th e counter to determ ine th e iteratio n num ber which, w hen com bined w ith th e loop state, yields a unique register reference. Decoding can be perform ed in tern al to th e PLA , b u t external decoding is m ore cost effective as shown in Table C .l. T he large difference in areas would be fu rth er com pounded by having larger state m achines or directly building a sm aller decoder from CM OS transistors rath er th an in v e r te r and n an d stan d ard cells. Consequently, decoding of th e unique 330 T able C .l: C om parison of In te rn a l versus E x tern a l D ecoding of Loop C o u n ter Iterations A PLA A rea 1 E xternal Decode A rea 2 2 153 3 4 355 2 1 8 574 50 16 809 115 A rea in m il2 1 T he PLA area was based upon the m inim al num ber of term s to achieve th e decode for th e sim plest 2-state PLA. 2 T he external area was based upon a standard decoder using inverters and nand gates from th e RCA CADDAS library. register value, if necessary, should always be perform ed external to the PLA itself. To com pare th e cost of each im plem entation, th e im pact on th e PLA param eters is sum m arized in Table C .2 . Collecting th e term s in Table C.2, assum ing th e num ber of iterations, is some positive integer power-of-2 , and ignoring th e decoding area (which is th e sam e for b o th th e internal and external realizations), th e m inim um area contri bution for an internal PLA counter can be derived by using th e area equation for th e PLA based upon its inputs (i), o utputs (p ), and pro d u ct term s (p ), A.C p Ci(2i + o ) + C2P + 0 3 ( 2 ? - f - o) + C 4 (C.2) w here ci, . .. , C 4 , are constants for a specific PLA tool and technology. Assuming th e PLA exists and only th e internal counter is being added, th e increm ental PLA area associated w ith this counter is cn tr 3t ic \ V ( n ' + 1 ) + + C 2 + 1^ + 3 c3n (C.3) For an external counter, the increm ental area of th e PLA is A re a ext cntr « cx(n + 3) + c2 + c3(n + 3) + n A cntr (C.4) 331 T able C.2: S u m m ary of C o u n ter Im p act on P L A P a ra m e te rs T ype Term A in p u t A o u tp u t A product term s Internal initialize — n - decrem ent n - n '( n '+ 1) 2 term ination test — - 1 unique reg op 1 - - (2”) - ( 2 " ) E xternal initialize - n _ decrem ent - 1 - term ination test 2 1 (n) - 1 unique reg op 1 - - (2” ) ' (2“ ) n = = L 1oS 2 Values in th e colum ns represent th e increm ental num ber of term s due to each loop operation 1 T he value in parentheses is included if register control signals m ust be unique for each iteratio n and th e decoding is perform ed internal to th e PLA . 2 T he value in parentheses is included if th ere is no external signal which detects a count of zero. Clearly, E quation C.3 grows m ore rapidly th a n E quation C.4 as n increases. Even for a value of N(oti) = 2, th e PLA counter has an area 222 m il2 larger th a n an external counter. T hus, counters used for the control of loops should be realized using external hardw are controlled by the PLA rath er th a n extending th e PLA to include an internal counter. C .l Product Terms in the PLA for the Register Control M ethod In C h ap ter 3, th e assertion was m ade th a t th e num ber of pro d u ct term s needed for “register control” was th e sam e as for th e “stage control” m ethod. T h at claim is justified in this appendix. In a PLA having ( states, th e num ber of o u tp u t lines, o, is determ ined by th e num ber of state bits ( [log2 £] ) which m ust be fed back to th e in p u t through a register, plus other required control lines. These include th e num ber of ou tp u t lines used to control registers (R c), m ultiplexers (M ), auxiliary hardw are such 332 as ALUs (A ), and th e num ber of additional o u tp u t lines associated w ith loop control (oioop) and conditional branches {oco n c i). U nder th e “register control” m ethod, th e num ber of o u tp u ts is o = [log2 Cl 4- R c ■ + ■ M + A + oio0p -j- ocond (C.1.5) W ith th e “stage control” m ethod, th e num ber of product term s is fixed. This m ay not be tru e in th e “register control” m ethod if th ere are conditionals, loops, m ultiplexer and auxiliary control. Since th e goal is to m inim ize th e num ber of product term s for “register control” , M , A, oioop, and oc o n c i will not be considered for this discussion. T he R c register control lines are individually labeled as R 1, R 2, . .., R Rc. O peration of a given R y o u tp u t line in a PLA is its list of product term s, p ta (w here a. is some unique identifier for each term ) th a t are O Red together, R y — p ta U ptb U . . . (C.1.6) Sim ilarly, each next sta te o u tp u t line, S t * , is also constructed by O Ring together one or m ore product term s. By th e uniqueness requirem ent of th e register o u tp u t lines w here no two lines m ay produce identical control over all states, Rx ^ R v, x ^ y (C.1.7) In other words, R x and R y m ust differ by at least one product term or they are not unique. In th e “stage control” m ethod, a p ro d u ct-term is constructed for each d ata p a th state. Hence, th e num ber of product term s (p) is equal to th e num ber of control states for a basic PLA or P = C (C.1.8) 333 This is also clearly the m axim um num ber of product term s needed for “reg ister control” . In order to reduce th e num ber of product term s, two conditions m ust hold: 1. T here exists at least one product term , p ta, which can be com bined w ith another product term , p ta, w here a ^ b. 2. In any PLA o u tp u t line w here p ta is one of th e O Red term s shown in E quation C.1.6, ptb m ust also be one of th e O R ed term s. These conditions will now be exam ined. 1: Each p roduct-term can be described by A NDed elem ents of one or m ore in p u t lines (?) (or th eir inverse t) as Pta = i«n... (c.i.9) T he “size” of th e product-term , | p ta |, is th e to tal num ber of elem ents which are A N D ed together. (An elem ent is either some in p u t line or its inverse.) Given two product term s, pti and p tj, to com bine them into a single product term , pt^ = pti H p tj, two criteria m ust be m et. F irst, it m ust be feasible to com bine th e Boolean term s. If th e “size” of one differs from th e other, it is not possible to com bine them . Thus, |p*i|=|p*il (C.1.10) Every elem ent in pti m ust also identically m atch some elem ent in p tj except ing one-and-only-one elem ent in pti and p tj which do not m atch. This elem ent in pt{, il a, and in p tj, i\, m ust be th e inverse of one another. A = i (c.i.ii) If pti and p tj m atch in all elem ents, th en pti and p tj are red u n d an t which is not allowed. If pti and p tj have two or m ore elem ents which are inverses as defined in C .l .l l , it is not possible to com bine th e two product term s. Since th e PLA as constructed in this thesis has sequential binary represen tatio n s for each state, this im plies th a t there are two states either 1, 2, or some 334 power-of-2 states ap art th a t can have th eir inputs com bined into a single product term . 2: T he second criterium for com bining pti and p tj entails th a t neither of these product term s are used individually. This exclusion m ust apply to both register control and state o u tp u t lines. Hence, we have th e following conditions. 1. T here exists no register control line R a such th a t pti e R a a p tj £ R a v pu (£ R a a ptj e R a (C .1 .1 2 ) Thus, for every register control line w here pti forms p a rt of the control equation, p tj is also p art of this equation. This im plies th a t R a is asserted for some states th a t are exactly 1, 2, 4, or some power-of-2 num ber of states apart. 2. T here also exists no sta te b it control line S t h such th a t pti € S t b A p tj ^ S t h V pti £ S t b A p tj 6 S t b (C.1.13) Since sta te bits are sequential binary representations in P A S T A , this con dition will usually hold, unless the num ber of states in th e PLA is not a power-of-2. These severe conditions lim it the circum stances w here th e num ber of product term s in th e register control m ethod can be reduced to less th a n the stage control m ethod. N ot only m ust a register be operated in specific states, b u t the PLA should be a power-of-2 num ber of states. It is possible to construct artificial designs w here th e conditions described can be m et. However, upon exam ining typical designs, it is clear th a t norm al designs do not have th e degree of control sym m etry necessary. Registers, p artic ularly those which have been optim ally shared, have rarely exhibited any control ordering in th e designs produced in this thesis. O nly two cases have arisen ex perim entally w here these conditions were m et; each had one less p roduct-term th an expected. 335 Appendix D PASTA Usage: Inputs and Outputs P A S T A is self-prom pting for all param eters needed. However, it does rely on an in p u t file and has a terse display form at, both or which are explained here. D .l PASTA Input T he in p u t file to PASTA contains as m uch inform ation as th e user can estim ate (or th e actu al results) for th e d a ta p a th scheduling. This file contains one or m ore descriptions of th e d a ta p a th to be controlled; each description is in th e following form: = s t a g e s r e g i s t e r s [ m u l t i p l e x ] [ ( e x p r e s s i o n d e s c r i b i n g l o o p s and c o n d i t i o n a l s ) ] T he first line of each description is started by an equal sign (’= ’) which signifies th e sta rt of a new PLA estim ate. On th e sam e line in order are: • stages which is th e num ber of stages into which the d a ta p a th is divided • registers which is th e num ber of registers used in the d a ta p a th (either an estim ate or the actual value) • m ultiplex which is th e num ber of additional m ultiplexer control lines needed, if any (optional). Normally, th e state bits of th e PLA are usually sufficient for controlling any arb itrary n : 1 m ultiplexer in practical designs. How ever, not all designs can be controlled in this fashion efficiently so this option has been provided. 336 If th e d a ta p a th being controlled has any conditional paths or loops, then an expression describing this additional control is necessary. (In th e absence of this expression, it is assum ed th a t th e d a ta p a th has no loops or conditionals.) T he entire expression m ust be surrounded by parentheses ’()’ and can be up to 1024 characters over m ultiple-lines. It has th e following pseudo-B N F definition; explicit characters are in single quotes. expr := ' ('expr')' I expr PATHOP expr I con-expr I loopv L00P0P expr I constant loopv desig[:constant] I constant[:constant] con-expr := desig CONOP ' expr expr ... ] desig := variable I variable '[’ constant >]’ variable alpha-string alpha-string := alpha-char I alpha-char string string := alpha-char I digit constant := digit I digit constant digit := >0’ I >l> | >2* I '3' I >4’ I >5’ I >6’ \ >7’ I >Q> | '9' 337 alpha-char := ’A-Za-z' PATHOP :* >& > I » + ’ CONOP := »|' LOOPOP := >*’ T he order of operations and th eir m eaning recognized listed from highest precedence to least is: • | for conditional paths. (Ex: var|4,3 m eans 4 states O R 3 states are exe cuted, depending upon th e value of ’var’.) • * for loop designator (Ex: 4*3 m eans 4 iterations of 3 states) • &: for parallel paths (Ex: 4&3 m eans 4 states in parallel w ith 3) • + for serial paths (Ex: 4+ 3 m eans 4 states followed by 3 m ore) C onditionals (and som etim es loops) refer to a variable which controls is used to m ake a conditional (loop) decision. T his variable nam e m ay be up to 31 characters and should be th e sam e nam e for all loops/conditionals w hich utilize th a t variable. T he variable is optionally followed by a bitw idth in a default value of one b it is otherw ise assum ed. N ote th a t th e bitw idth designator can only be used at th e first occurrance of th e variable otherw ise an error condition is produced. Each branch of a m ulti-w ay conditional is separated by a com m a in the conditional expression. Regardless of w hether th e conditional variable has had its b itw id th declared, PASTA will increase its bitw idth (if necessary) so th a t it can accom odate th e m ulti-w ay selection. Thus, th e user is free to leave off the bitw id th for selection of m ulti-w ay conditionals and PASTA will autom atically m ake it th e m inim um num ber possible. 338 T here are two types of loops which can be specified: constant and variable. A constant loop will either be unw ound or controlled via an external counter, depending upon th e num ber of states involved and th e loop count. A variable loop uses ’desig’ as th e variable to test after each loop to determ ine continuation. In addition, there is a constant after th e colon (’:’) which can be used to specify th e degree of register reuse for each loop (see th e pap er for a description of this term , Kreg). This value can ONLY be betw een 0 and 100. A 0 indicates th a t all registers w ithin th e loop are reused during each iteration; 100 indicates th a t each loop uses a different register. D.2 PASTA Output T he o u tp u t header and results appears as follows: Area (mil sq) Sta Reg Ty Ins Out Pts Fid Dnsity Unfold Folded w here 'sta' is the estimated number of PLA states 'reg' is the number of registers 'ty' is the PLA type: "St" is the stage control method PLA "Re" is the register control method PLA 'ins' is the number of inputs into the PLA (including state bits) 'outs' is the number of outputs from the PLA (including state bit 'pts' is the number of product-terms in the PLA 'fid' is the expected number of output folds possible in the PLA 'dnsity' is the cross-point density or the number of transistors divided by the number of intersections (output side ONLY) 'unfold' is the estimated for the unfolded PLA area 'folded' is the estimated for the folded PLA area T he area includes any external counters or o ther hardw are needed by the controller itself. 339 T able D .l: F ix ed V alues in PASTA N am e M eaning c l PLA aggregate block size for all ’’interior” blocks these have in /o u t/p te rm s passing through them . c2 PLA block size of p term pre-charge/load blocks c3 PLA block size of in /o u t p re-charge/load/driver blocks c4 PLA area offset constant not related to in /o u t/p te rm s a_cntr A rea of 1-bit cascadable counter w ith lo ad /d ecr/zero detect a_demux A rea of 1-bit 1 : 2 dem ux a_and A rea of 2-input AND gate D.3 N otes T he P A S T A program is currently hardcoded for th e Berkeley toolset as well for th e external counter plus associated hardw are. S tatic values in th e file ”pasta8.h” can be altered and PASTA recom piled w ith th e new coefficients. These are: 340 D.4 PASTA Example : * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * s te * * .* ****** ************ ******** ****** ********** *************** * 4 * * * USC - A dvanced D e s ig n A u to m a tio n S y ste m (ADAM) *, * *j * PLA C o n tr o l A rea E s t im a t io n *. 1 * f o r P i p e l i n e d and N o n - P ip e lm e d D e s ig n s * * * * * ' * M i t c h e l l M lin a r *• * * - * V e r s io n 2 .7 3 Aug 1989 *j * * I * *j * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ********************************************************4 Enter PASTA input filename: Sta Reg Ty Ins Out Pts 8 1 Re 3 4 8 14 1 Re 4 5 14 14 1 Re 5 5 15 1 2 1 Re 5 5 13 testout.pasta Area (mil sq) Fid Dnsity Unfold Folded 1 0.6250 212.06 212.06 1 0.4643 367.21 321.33 1 0.4062 446.57 396.57 1 0.3810 413.09 367.21 341 396.57 446.57 1 0.4062 15 14 713.41 646.95 2 0.4653 24 22 0.040 seconds CPU processing time: adam[2] 342 Appendix E REG Usage: Inputs and Outputs R E G estim ates th e register count for pipelined and non-pipelined designs given th e dataflow graph. For pipelined designs, th e num ber of m icrocycles (p arti tio n s/in itiatio n interval) m ust also be given. E .l REG Input R E G can be run interactively by entering: re g T he program will prom pt for the param eters it needs. U pper case words signify program o u tput. ENTER P FOR PIPELINED OR N FOR NON.PIPELINED DESIGN: T ype P if register estim ation of pipelined design is to be perform ed. For non pipelined design style type N. ENTER DATAFLOW GRAPH NAME: E n ter th e nam e of th e file containing th e d a ta flow graph. For th e form at of the dataflow graph, see th e docum entation on M AHA. ENTER MODULE LIBRARY NAME: E n ter th e nam e of th e m odule library. For th e form at of th e m odule library, see th e docum entation on MAHA. 343 ENTER PARTITIONS AND LATENCY: This question is only asked for pipelined design style. T ype th e initiation interval and th e p artitio n count. From these two param eters th e value of th e m icro cycle is com puted which is used to estim ate the register count for pipelined designs. E.2 REG Output Included here is a sam ple of R E G o u tp u t. B oth non-pipelined and pipelined exam ples are shown. E.2.1 Non-pipelined Register Estimation *********************************************************** * * * * .** *J ** USC - Advanced Design Automation System (ADAM) ■ ■ * * * * I ** Register Area Estimation ** * * * * : ** for Pipelined and Non-Pipelined Designs * * j * * * * * * * * ** Mitchell Mlinar *J * * * * * * Version 1.92 Nov 1989 * * j * * * * I * * * * j * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * J 344 E nter P f o r p ip e lin e d or N f o r n o n -p ip e lin e d d e sig n : n Enter dataflow graph name: /usr/eve3/mlinar/mahac/ar.dfg Non-pipelined design with graph length of 8 Minimum registers (partitions =1): 2 (32 regbits) Estimate registers for partitions >1:5 (80 regbits) Runtime: 0.160 seconds adam[ 2] E.2.2 Pipelined Register Estimation *********************************************************** ** *J ** ** USC - Advanced Design Automation System (ADAM) **j ** ** ** Register Area Estimation * * j * * * * ** for Pipelined and Non-Pipelined Designs **| * * * * : ** **[ ** Mitchell Mlinar **! * * ** Version 1.92 Nov 1989 ** * * *xJ i ** ** 345 ************************************************************ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Enter P for pipelined or N for non-pipelined design: p Enter dataflow graph name: /usr/eve3/mlinar/mahac/ar.dfg Pipelined design with graph length of 8 Setup runtime: 0.180 seconds Enter partitions and latency: 6 1 RPmin = 22 RPmax ® 72 Regs = 47 (752 regbits) 0.000 seconds Enter partitions and latency: 12 4 RPmin = 16 RPmax = 36 Regs = 26 (416 regbits) 0.000 seconds Enter partitions and latency: 19 16 RPmin = 14 RPmax = 24 Regs = 19 (304 regbits) 0.000 seconds Enter partitions and latency: adam[3] 346 Appendix F M UX Usage: Inputs and Outputs M U X estim ates the m ultiplexer count and size (in bits) for pipelined and non- pipelined designs given th e dataflow graph. For pipelined designs, th e num ber of m icrocycles (p artitio n s/in itiatio n interval) m ust also be given. F .l M UX Input M U X can be run interactively by entering: mux T he program will prom pt for th e param eters it needs. U pper case words signify program output. ENTER P FOR PIPELINED OR N FOR NON_PIPELINED DESIGN: T ype P if m ultiplexer estim ation or pipelined design is to be perform ed. For non-pipelined design style type N. ENTER DATAFLOW GRAPH NAME: E n ter th e nam e of th e file containing th e d a ta flow graph. For th e form at of the dataflow graph, see th e docum entation on MAHA. ENTER MODULE LIBRARY NAME: E n ter th e nam e of th e m odule library. For the form at of th e m odule library, see th e docum entation on MAHA. 347 ENTER PARTITIONS ( i - j ) AND NUM BER OF REGISTERS ( k - 1 ) : This is asked for non-pipelined design style. E n ter th e num ber of partitions w hich m ust lie betw een integers i and j , and th e num ber of registers which m ust lie betw een k and 1 . T he num ber of registers can be obtained from register estim ation or actual synthesis. ENTER PARTITIONS (i-j) LATENCY (m-n) AND NUMBER OF REGISTERS (k-1): T his is th e question asked for pipelined design style. E nter th e num ber of p a rti tions which m ust lie betw een integers i and j, th e value of initiation interval (i.e. th e num ber of clock cycles betw een two successive initiations of the pipeline), and th e num ber of registers which m ust lie betw een k and 1 . T he num ber of registers can be obtained from register estim ation or synthesis. ENTER NUMBER OF MODULES FOR (module-name): T ype th e q u an tity of m odules of type m odule-nam e. F.2 M UX Output Included here is an exam ple showing M U X o u tp u t. ************************************************************* ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ a^a ^ a a jf ^ a ^ ^ ^ a^a a |r ^ ^ ^ * * * * * * *** * * j *** USC - Advanced Design Automation System (ADAM) *** * * * * * * *** Multiplexer Area Estimation * * = t | * * * * * * *** for Pipelined and Non-Pipelined Designs *** * * * * * * *** **J *** Mitchell Mlinar **=* 348 *** *** *** Version 1.42 Oct 1989 *** * * * * * * +:5|c5fc3f:^:^c^c3fc^c^c^:^c3t<^c^4c^:^3|c5|c5jc^c3+c5|cjf:^c^c^c^c4:3^ + ^c^c2(c + 5(c)4ca(c5f: + ^c5f:^:5f:5(c^:^c^c + + 5fc4:5f:3^^c^c^:^c^:^: Enter P for pipelined or N for non-pipelined design: n Enter dataflow graph name: fir.dfg Enter module library name: lib.ff Values in graph: 47 Incoming: 24 Outgoing: 1 Input processing time: 0.020 Non-pipelined design with graph length of 9 Enter partitions (1-25) and # of registers (1-23): 2 8 Enter # of modules for addl6 [maximum: 15, default: 8 ]: 8 Enter # of modules for mull6 [maximum: 8 , default: 4]: 7 Op: 8 addl6 Mux/pin: 6 Subtotal 2:1 = 192 Op: 7 mull6 Mux/pin: 1 Subtotal 2:1 = 32 Registers = 8 Muxes: 3 Subtotal 2:1 = 48 Total 2:1 muxes = 272 in 2 stages [runtime of 0.080 sec.] E nter p a r t i t i o n s (1 -2 5 ) and # o f r e g i s t e r s (1 -2 3 ) : 5 6 349 Enter # of modules for addl6 [maximum: 15, default: 3]: 4 Enter # of modules for mull6 [maximum: 8 , default: 2]: 2 Op: 4 addl6 Mux/pin: 9 Subtotal 2:1 = 288 Op: 2 mull6 Mux/pin: 5 Subtotal 2:1 = 160 Registers = 6 Muxes: 4 Subtotal 2:1 = 64 Total 2:1 muxes = 512 in 5 stages [runtime of 0.160 sec.] Enter partitions (1-25) and # of registers (1-23) : 15 5 Enter # of modules for addl6 [maximum: 15, default: 1]: Enter # of modules for mull6 [maximum: 8 , default: 1]: Op: 1 addl6 Mux/pin: 9 Subtotal 2:1 = 288 Op: 1 mull6 Mux/pin: 6 Subtotal 2:1 = 192 Registers = 5 Muxes: 1 Subtotal 2:1 = 16 Total 2:1 muxes = 496 in 15 stages [runtime of 0.160 sec.] Enter partitions (1-25) and # of registers (1-23) : adam[3] 350 Appendix G * Floating Point Coprocessor Description M ike M cFarland of Boston College and AT&T Bell Laboratories developed th e floating point coprocessor exam ple used in this thesis. B oth th e Value Trace and P ark N orm al Form descriptions are included. G .l Floating Point Coprocessor Value Trace v26 %v25 001001001000010000-US * * REPEAT1; 11 rl5 <8 > A.EXP 12 rl6 <63> T.FRACT 13 r ! 8 <8 > T.EXP xl +:US (v26.il:A.EX) (*.c2=l) pi * <9> * x2 <r> (xl.pl) (*.c3=0) pi rl5 <8 > A.EXP x3 SRO:US (v26.i2:T.FR) (*.cl=l) pi rl6 <63> T.FRACT x4 EQL:US (v26.i3:T.EX) (x2.pl:A.EX) pi * <1> * x5 SELECT (x4.pl) bl.x5 BRANCH [TRUE] bl.x5 ENDBR b2.x5 BRANCH [0] b2.x5 ENDBR 351 x5 ENDSEL x8 LEAVE @v25:PREN (x2.pl:A.EX) (x3.pl:T.FR) 01 rl5 <8 > A.EXP 02 rl6 <63> T.FRACT v27 * / . v 25 001001001000010000-US * * REPEAT2; 11 rl8 <8 > T.EXP 12 rl6 <63> T.FRACT 13 rl5 <8 > A.EXP xl +:US (v27.il:T.EX) (*.c2=l) pi * <9> * x2 <r> (xl.pl) (*.c3=0) pi rl8 <8 > T.EXP x3 SRO:US (v27.i2:T.FR) (*.cl=l) pi r ! 6 <63> T.FRACT x4 EQL:US (x2.pl:T.EX) (v27.i3:A.EX) pi * <1> * x5 SELECT (x4.pl) bl.x5 BRANCH [TRUE] bl.x5 ENDBR b2.x5 BRANCH [0] b2.x5 ENDBR x5 ENDSEL x8 LEAVE @v25:PREN (x2.pl:T.EX) (x3.pl:T.FR) 01 rl8 <8 > T.EXP 02 rl6 <63> T.FRACT v28 %v24 001001000000010000-US * * NORMALIZE; 11 rl3 <63> A.FRACT 12 rl5 <8 > A.EXP 13 r30 <140> WAIT 352 14 r9 <5> FP.IR 15 rl6 <63> T.FRACT ±6 rl8 <8 > T.EXP ±7 rl7 <1> T.SIGN 18 rl4 <1> A.SIGN 19 rl9 <1> C 110 r4 <1> FP.ERROR 111 r20 <63> FP.REG 112 r21 <8 > EXP.REG 113 r22 <1> SIGN.REG 114 r6 <1> READY 115 r5 <16> DATA 116 r7 <1> Z 117 r8 <1> N 118 r2 <1> GO xl <r> (v28.il:A.FR) (*.c4=62) pi * <1> A.FRACT x2 EQL:US (xl.pl:A.FR) (*.c5=i) pi * <1> * x3 SELECT (x2.pl) bl.x3 BRANCH [TRUE] bl.x3 ENDBR b2.x3 BRANCH [0] b2.x3 ENDBR x3 ENDSEL x5 SLO:US (v28.il:A.FR) (*.cl=l) pi rl3 <63> A.FRACT x6 -:TC (v28.i2:A.EX) (*.c2=i) pi * <9> * x7 <r> (x6 .pl) (*.c3=0) pi rl5 <8 > A.EXP x8 LEAVE @v24:INST (x5.pl:A.FR) (x7.pl:A.EX) ol rl3 <63> A.FRACT 353 o2 r l 5 <8> A.EXP v29 %v24 001001000000010000-US * * IN.NORMALIZE; 11 rl3 <63> A.FRACT 12 rl5 <8 > A.EXP 13 r9 <5> FP.IR 14 rl6 <63> T.FRACT 15 rl8 <8 > T.EXP 16 rl7 <1> T.SIGN 17 rl4 <1> A.SIGN 18 rl9 <1> C 19 r4 <1> FP.ERROR 110 r20 <63> FP.REG 111 r21 <8 > EXP.REG 112 r22 <1> SIGN.REG 113 r6 <1> READY 114 r5 <16> DATA 115 r7 <1> Z 116 r8 <1> N 117 r2 <1> GO xl <r> (v29.il:A.FR) (*.c4=62) pi * <1> A.FRACT x2 EQL:US (xl.pl:A.FR) (*.c5=l) pi * <1> * x3 SELECT (x2.pl) bl.x3 BRANCH [TRUE] bl.x3 ENDBR b2.x3 BRANCH [0] b2.x3 ENDBR x3 ENDSEL x5 SLO:US (v29.il:A.FR) (*.cl=l) pi rl3 <63> A.FRACT 354 [ x6 -:TC (v29.i2:A.EX) (*.c2=l) ! pi * <9> * x7 <r> (x6 .pl) (*.c3=0) pi rl5 <8 > A.EXP x8 LEAVE <3v24:INST (x5.pl:A.FR) (x7.pl:A.EX) 01 rl3 <63> A.FRACT 02 rl5 <8> A.EXP v25 %v24 00100000000010000-US * * PRENORMALIZE; 11 rl5 <8 > A.EXP 12 r ! 8 <8 > T.EXP 13 rl6 <63> T.FRACT xl NEQ:US (v25.il:A.EX) (v25.i2:T.EX) pi * <l> * x2 SELECT (xl.pl) bl.x2 BRANCH [TRUE] x3 GTR:US (v25.il:A.EX) (v25.i2:T.EX) pi * <l> * x4 SELECT (x3.pl) bl.x4 BRANCH [0] x5 CALL 0v26:REPE (v25.il:A.EX) (v25.i3:T.FR) (v25.i2:T.EX) pi rl5 <8 > A.EXP p2 rl6 <63> T.FRACT bl.x4 ENDBR (x5.p2:T.FR) (x5.pl:A.EX) (v25.i2:T.EX) b2.x4 BRANCH [l] x6 CALL <§v27 :REPE (v25 . i2 :T .EX) (v25 . i3 :T.FR) (v25 . il: A .EX) pi rl8 <8 > T.EXP p2 rl6 <63> T.FRACT b2.x4 ENDBR (x6.p2:T.FR) (v25.il:A.EX) (x6 .pl:T.EX) x4 ENDSEL pi rl6 <63> T.FRACT p2 rl5 <8> A.EXP 355 p3 rl8 <8 > T.EXP bl.x2 ENDBR (x4.p3:T.EX) (x4.p2:A.EX) (x4.pl:T.FR) b2.x2 BRANCH [0] b2.x2 ENDBR (v25.i2:T.EX) (v25.il:A.EX) (v25.i3:T.FR) x2 ENDSEL pi rl8 <8 > T.EXP p2 rl5 <8 > A.EXP p3 rl6 <63> T.FRACT x7 LEAVE <3v25:PREN (x2.pl:T.EX) (x2.p3:T.FR) (x2.p2:A.EX) 01 rl8 <8 > T.EXP 02 rl6 <63> T.FRACT 03 r!5 <8 > A.EXP v24 %v23 00100000010010000-US * * INST.LOOP; 11 r2 <1> GO 12 r3 <6> INST.IN 13 r20 <63> FP.REG 14 rl6 <63> T.FRACT 15 r21 <8 > EXP.REG 16 rl8 <8 > T.EXP 17 r22 <1> SIGN.REG 18 rl7 <1> T.SIGN 19 rl3 <63> A.FRACT 110 rl5 <8 > A.EXP 111 rl4 <1> A.SIGN 112 rl9 <1> C 113 r4 <1> FP.ERROR 114 r6 <1> READY 115 r5 <16> DATA 116 r7 <1> Z 117 r8 <1> N xl <r> (v24.il:G0) (*.c3=0) 356 pi r2 <1> GO x2 WAIT (xl.pl:GO) pi r30 <140> WAIT x3 <r> (v24.i2:INST) (*.c3=0) * pi r3 <6> INST.IN x4 <r> (x3.pl:INST) (*.c3=0) pi r9 <5> FP.IR x5 <r> (x4.pl:FP.I) (*.c6=3) pi * <2> OP x6 SELECT (x5.pl:OP) bl.x6 BRANCH [0] x7 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM x8 [r] (v24.i3:FP.R) (x7.pl:REG) pi rl6 <63> T.FRACT x9 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xlO [r] (v24.i5:EXP) (x9.pl:REG) pi rl8 <8> T.EXP xll <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xl2 [r] (v24.i7:SIGN) (xll.pl:REG) pi rl7 <1> T.SIGN xl3 EQL:US (x8 .pl:T.FR) (*.c8=0) pi * <l> * xl4 SELECT (xl3.pl) b1.x14 BRANCH [TRUE] b1.x14 ENDBR b2.xl4 BRANCH [0] b2.xl4 ENDBR xl4 ENDSEL xl6 EQL:US (v24.i9:A,FR) (*.c8=0) pi * <1> * 357 xl7 SELECT (xl6 .pl) bl.xl7 BRANCH [TRUE] xl8 0 (xl0.pl:T.EX) (xl2.pl:T.SI) pi * <9> * xl9 © (xl8 .pl) (x8 .pl:T.FR) pi * <72> * x20 <r> (xl9.pl) (*.c3=0) pi rl3 <63> A.FRACT x21 <r> (xl9.pl) (*.c9=63) pi rl4 <1> A.SIGN x22 <r> (xl9.pl) (*.cl0=64) pi rl5 <8> A.EXP x23 <r> (x4.pl:FP.I) (*.cll=2) pi * <1> MODE x24 SELECT (x23.pl:MODE) bl.x24 BRANCH [TRUE] x25 NOT:US (x21.pi:A.SI) pi rl4 <1> A.SIGN bl.x24 ENDBR (x25.pi:A.SI) b2.x24 BRANCH [0] b2.x24 ENDBR (x21.pi:A.SI) x24 ENDSEL pi rl4 <1> A.SIGN bl.xl7 ENDBR (x24.pl:A.SI) (x22.pi:A.EX) (x20.pi:A.FR) b2.xl7 BRANCH [0] b2.xl7 ENDBR (v24.ill:A.SI) (v24.ilO:A.EX) (v24.i9:A.FR) xl7 ENDSEL pi rl4 <1> A.SIGN p2 rl5 <8 > A.EXP p3 rl3 <63> A.FRACT x27 CALL <§v25 :PREN (xl7 .p2 : A .EX) (xlO.pl :T.EX) (x8 .pl:T.FR) pi rl8 <8 > T.EXP p2 rl6 <63> T.FRACT 3.58 p3 rl5 <8 > A.EXP x28 <r> (x4.pl:FP.I) (*.cll=2) pi * <1> MODE x29 XOE:US (xl7.pi:A.SI) (xl2.pl:T.SI) pi * <1> * x30 XORiUS (x28.pl:MODE) (x29.pl) pi * <1> * x31 SELECT (x30.pl) b1.x31 BRANCH [0] x32 +:US (xl7.p3:A.FR) (x27.p2:T.FR) pi * <64> * x33 <r> (x32.pl) (*.c3=0) pi rl3 <63> A.FRACT x34 <r> (x32.pl) (*.c9=63) pi rl9 <1> C x35 SELECT (x34.pl:C) bl.x35 BRANCH [TRUE] x36 SRI:US (x33.pi:A.FR) (*.cl=l) pi rl3 <63> A.FRACT x37 +:US (x27.p3:A.EX) (*.c2=l) pi * <9> * x38 <r> (x37.pl) (*.c3=0) pi rl5 <8 > A.EXP bl.x35 ENDBR (x38.pi:A.EX) (x36.pi:A.FR) b2.x35 BRANCH [0] b2.x35 ENDBR (x27.p3:A.FR) (x33.pi:A.FR) x35 ENDSEL pi rl5 <8 > A.EXP p2 rl3 <63> A.FRACT bl.x31 ENDBR (x35.p2:A.FR) (x35.pi:A.EX) (x34.pl:C) (xi7.pl:A.SI) b2.x31 BRANCH [1] x39 -:TC (xl7.p3:A.FR) (x27.p2:T.FR) pi * <64> * 359 x40 <r> (x39.pl) (*.c3=0) pi rl3 <63> A.FRACT x41 <r> (x39.pl) (*.c9=63) pi rl9 <1> C x42 EQL:US (x40.pi:A.FR) (*.c8=0) pi * <1> * x43 SELECT (x42.pl) bl.x43 BRANCH [TRUE] bl.x43 ENDBR (*.c3=0) (*.cl2=0) b2.x43 BRANCH [0] b2.x43 ENDBR (xl7.pi:A.SI) (x27.p3:A.EX) x43 ENDSEL pi rl4 <1> A.SIGN p2 rl5 <8 > A.EXP x45 NOT:US (x41.pl:C) pi * <1> * x46 SELECT (x45.pl) bl.x46 BRANCH [TRUE] x47 — :TC (x40.pl:A.FR) pi * <64> * x48 <r> (x47.pl) (*.c3=0) pi rl3 <63> A.FRACT x49 NOT:US (x43.pl:A.SI) pi rl4 <1> A.SIGN bl.x46 ENDBR (x49.pi:A.SI) (x48.pi:A.FR) b2.x46 BRANCH [0] b2.x46 ENDBR (x43.pl:A.SI) (x40.pl:A.FR) x46 ENDSEL pi rl4 <1> A.SIGN p2 rl3 <63> A.FRACT x50 CALL @v28:N0RM (x46.p2:A.FR) (x43.p2:A.EX) (x2.pl:WAIT) (x4.pl:FP.I) (x27.p2:T.FR) (x27.pi:T.EX) (xl2.pi:T.SI) (x46.pl:A.SI) (x41.pl:C) (v24.il3:FP.E) (v24.i3:FP.R) 360 (v24.i5:EXP) (v24.i7:SIGN) (v24.i14:READ) (v24.il5:DATA) (v24.il6:Z) (v24.il7:N) (v24.il:G0) pi rl3 <63> A.FRACT p2 rl5 <8 > A.EXP b2.x31 ENDBR (x50.pi:A.FR) (x50.p2:A.EX) (x41.pl:C) (x46.pl:A.SI) x31 ENDSEL pi rl3 <63> A.FRACT p2 r!5 <8> A.EXP p3 rl9 <1> C p4 rl4 <1> A.SIGN bl.x6 ENDBR (x31.p4:A.Sl) (x31.p3:C) (x31.p2:A.EX) (x31.pl:A.FR) (x27.p2:T.FR) (x27.pl:T.EX) (xl2.pl:T.SI) (v24.il3:FP.E) (v24.i3:FP.R) (v24.i5:EXP) (v24.i7:SIGN) (v24.i!5:DATA) (x2.pl:WAIT) (v24.i14:READ) b2.x6 BRANCH [l] x51 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM x52 [r] (v24.13:FP.R) (x51.pl:REG) pi rl6 <63> T.FRACT x53 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM x54 [r] (v24.i5:EXP) (x53.pl:REG) pi rl6 <8> T.EXP x55 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM x56 [r] (v24.i7:SIGN) (x55.pl:REG) pi rl7 <1> T.SIGN x57 <r> (x4.pl:FP.I) (*.cll=2) pi * <1> MODE x58 SELECT (x57.pi:MODE) bl.x58 BRANCH [0] x59 EQL:US (x52.pi:T.FR) (*.c8=0) pi * <1> * 361 x60 EQL:US (v24.i9:A.FR) (*.c8=0) pi * <1> * x61 DR:US (x59.pl) (x60.pl) pi * <1> * x62 SELECT (x61.pl) bl.x62 BRANCH [TRUE] bl.x62 ENDBR (*.c8=0) b2.x62 BRANCH [0] b2.x62 ENDBR (v24.i9:A.FR) j x62 ENDSEL | pi rl3 <63> A.FRACT x64 X0R:US (v24.ill:A.SI) (x56.pi:T.SI) pi rl4 <1> A.SIGN x65 +:US (v24.ilO:A.EX) (x54.pl:T.EX) pi * <9> * | x6 6 <r> (x65.pl) (*.c3=0) I pi * <8 > * x67 -:TC (x6 6 .pl) (*.cl3=128) pi * <9> * x6 8 <r> (x67.pl) (*.c3=0) pi rl5 <8> A.EXP x69 *:US (x62.pl:A.FR) (x52.pi:T.FR) pi * <126> * x70 <r> (x69.pl) (*.c4=62) pi * <62> * x71 PADO:US (x70.pl) pi rl3 <63> A.FRACT x72 <r> (x71.pi:A.FR) (*.c4=62) pi * <l> A.FRACT x73 EQL:US (x72.pl:A.FR) (*.c3=0) pi * <1> * x74 SELECT (x73.pl) bl.x74 BRANCH [TRUE] 362 x75 SLO:US (x71.pi:A.FR) (*.cl=l) pi rl3 <63> A.FRACT x76 -:TC (x6 8 .pi:A.EX) (*.c2=l) pi * <9> * x77 <r> (x76.pl) (*.c3=0) pi rl5 <8 > A.EXP bl.x74 ENDBR (x77.pl:A.EX) (x75.pi:A.FR) b2.x74 BRANCH [0] b2.x74 ENDBR (x6 8 .pl:A.EX) (x71.pi:A.FR) x74 ENDSEL pi rl5 <8 > A.EXP p2 rl3 <63> A.FRACT bl.x58 ENDBR (x74.p2:A.FR) (x74.pi:A.EX) (x64.pl:A.SI) (v24.il3:FP. b2.x58 BRANCH [1] x78 EQL:US (x52.pi:T.FR) (*.c8=0) pi * <1> * x79 SELECT (x78.pl) bl.x79 BRANCH [TRUE] x80 <w> (v24.113:FP.E) (*.c5=l) (*.c3=0) pi * <1> FP.ERROR bl.x79 ENDBR (x80.pi:FP.E) b2.x79 BRANCH [0] b2.x79 ENDBR (v24.il3:FP.E) x79 ENDSEL pi r4 <1> FP.ERROR x82 EQL:US (v24.i9:A.FR) (*.c8=0) pi * <1> * x83 SELECT (x82.pl) bl.x83 BRANCH [TRUE] bl.x83 ENDBR b2.x83 BRANCH [0] b2.x83 ENDBR x83 ENDSEL 363 x85 XOR:US (v24.ill:A.SI) (x56.pi:T.SI) pi rl4 <1> A.SIGN x8 6 GEQ:US (v24.i9:A.FR) (x52.pi:T.FR) pi * <1> * x87 SELECT (x8 6 .pl) bl.x87 BRANCH [TRUE] x8 8 SRO:US (v24.i9:A.FR) (*.cl=l) pi rl3 <63> A.FRACT x89 +:US (v24.ilO:A.EX) (*.c2=l) pi * <9> * x90 <r> (x89.pl) (*.c3=0) pi rl5 <8 > A.EXP bl.x87 ENDBR (x90.pi:A.EX) (x8 8 .pi:A.FR) b2.x87 BRANCH [0] b2.x87 ENDBR (v24.ilO:A.EX) (v24.i9:A.FR) x87 ENDSEL pi rl5 <8 > A.EXP p2 rl3 <63> A.FRACT x91 0 (x87.p2:A.FR) (*.cl4=0) pi * <125> * x92 /:US (x91.pl) (x52.pi:T.FR) pi * <125> * x93 <r> (x92.pl) (*.c3=0) pi rl3 <63> A.FRACT x94 -:TC (x87.pi:A.EX) (x54.pl:T.EX) pi * <9> * x95 <r> (x94.pl) (*.c3=0) pi * <8 > * x96 +:US (x95.pl) (*.cl3=128) pi * <9> * x97 <r> (x96.pl) (*.c3=0) pi rl5 <8 > A.EXP b2.x58 ENDBR (x93.pl:A.FR) (x97.pi:A.EX) (x85.pi:A.SI) 364 (x79.pl:FP.E) x58 ENDSEL pi rl3 <63> A.FRACT p2 rl5 <8> A.EXP p3 rl4 <1> A.SIGN p4 r4 <1> FP.ERROR b2.x6 ENDBR (x58.p3:A.SI) (v24.il2:C) (x58.p2:A.EX) (x58.pl:A.FR) (x52.pi:T.FR) (x54.pi:T.EX) (x56.pl:T.SI) (x58.p4:FP.E) (v24.i3:FP.R) (v24.i5:EXP) (v24.i7:SIGN) (v24.il5:DATA) (x2.pl-.WAIT) (v24. i 14:READ) b3.x6 BRANCH [2] x98 <r> (x4.pl:FP.I) (*.cll=2) pi * <1> MODE x99 SELECT (x98.pi:MODE) bl.x99 BRANCH [0] xlOO <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xlOl [r] (v24.i3:FP.R) (xlOO.pi:REG) pi rl3 <63> A.FRACT xl02 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xl03 [r] (v24.i5:EXP) (xl02.pi:REG) pi rl5 <8 > A.EXP xl04 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xl05 [r] (v24.i7:SIGN) (xl04.pl:REG) pi rl4 <1> A.SIGN bl.x99 ENDBR (xl05.pl:A.SI) (xl03.pi:A.EX) (xlOl.pi:A.FR) (v24.i7:SIGN) (v24.i5:EXP) (v24.i3:FP.R) b2.x99 BRANCH [l] xl06 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM 365 xl07 [w] (v24.i3:FP.R) (xl06.pi:REG) (v24.i9:A.FR) (*.c3=0) pi r20 <63> FP.REG xl08 <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xl09 [w] (v24.i5:EXP) (xl08.pi:REG) (v24.ilO:A.EX) (*.c3=0) pi r21 <8> EXP.REG xllO <r> (x4.pl:FP.I) (*.c7=0) pi * <2> REG.NUM xlll [w] (v24.i7:SIGN) (xllO.pi:REG) (v24.ill:A.SI) (*.c3=0) pi r22 <1> SIGN.REG b2.x99 ENDBR (v24.ill:A.SI) (v24.ilO:A.EX) (v24.i9:A.FR) (xlll.pi:SIGN) (xl09.pi:EXP) (xl07.pi:FP.R) x99 ENDSEL pi rl4 <1> A.SIGN p2 rl5 <8> A.EXP p3 rl3 <63> A.FRACT p4 r22 <1> SIGN.REG p5 r21 <8> EXP.REG p6 r20 <63> FP.REG b3.x6 ENDBR (x99.pi:A.SI) (v24.il2:C) (x99,p2:A.EX) (x99.p3:A.FR) (v24.i4:T.FR) (v24.i6:T.EX) (v24.i8:T.SI) (v24.il3:FP.E) (x99.p6 :FP.R) (x99.p5:EXP) (x99.p4:SIGN) (v24.il5:DATA) (x2.pl:WAIT) (v24.i14:READ) b4.x6 BRANCH [3] xll2 <r> (x4.pl:FP.I) (*.cll=2) pi * <1> MODE xll3 SELECT (xll2.pl:MODE) bl.xll3 BRANCH [0] xll4 <r> (v24.i14:READ) (*.c3=0) pi r6 <1> READY xll5 WAIT (xll4.pl:READ) pi r30 <140> WAIT 366 xll6 <r> (v24.ii5:DATA) (*.c3=0) pi r5 <16> DATA xll7 <r> (xll6 .pl:DATA) (*.c3=0) pi * <7> * xll8 <w> (v24.i9:A.FR) (xll7.pl) (*.cl5=56) pi rl3 <63> A.FRACT xll9 <r> (xll6 .pl:DATA) (*.cl6=7) pi rl4 <1> A.SIGN xl20 <r> (xll6 .pl:DATA) (*.cl7=8) pi rl5 <8> A.EXP xl21 <w> (v24.i14:READ) (*.c3=0) (*.c3=0) pi * <1> READY xl22 <r> (xl21.pl:READ) (*.c3=0) pi r6 <1> READY xl23 WAIT (xl22.pl:READ) pi r30 <140> WAIT xl24 <r> (v24.i5:DATA) (*.c3=0) pi r5 <16> DATA xl25 <w> (xll8 .pl:A.FR) (xl24.pl:DATA) (*.cl8=40) pi rl3 <63> A.FRACT xl26 <w> (xl21.pl:READ) (*.c3=0) (*.c3=0) pi * <1> READY xl27 <r> (xl26.pl:READ) (*.c3=0) pi r6 <1> READY xl28 WAIT (xl27.pl:READ) pi r30 <140> WAIT xl29 <r> (v24.i!5:DATA) (*.c3=0) pi r5 <16> DATA xl30 <w> (xl25.pl:A.FR) (xl29.pi:DATA) (*.cl9=24) pi rl3 <63> A.FRACT xl31 <w> (xl26.pl .-READ) (*.c3=0) (*.c3=0) pi r6 <1> READY xl32 <r> (xl31.pl:READ) (*.c3=0) 367 pi r6 <1> READY xl33 WAIT (xl32.pl:READ) pi * <1> WAIT xl34 <r> (v24.il5:DATA) (*.c3=0) pi r5 <16> DATA xl35 <w> (xl30.pl:A.FR) (xl34.pl:DATA) (*.cl7=8) pi rl3 <63> A.FRACT xl36 <w> (xl31.pl:READ) (*.c3=0) (*.c3=0) pi * <1> READY xl37 <w> (xl35.pi:A.FR) (*.cl2=0) (*.c3=0) pi * <63> A.FRACT xl38 CALL @v29: IN.N (xl37.pi:A.FR) (xl20.pi:A.EX) (x4.pl:FP.I) (v24.i4:T.FR) (v24.i6:T.EX) (v24.i8:T.SI) (xll9.pi:A.SI) (v24.il2:C) (v24.il3:FP.E) (v24.i3:FP.R) (v24.i5:EXP) (v24.i7:SIGN) (xl36.pi:READ) (v24.il5:DATA) (v24.il6:Z) (v24.il7:N) (v24.il:G0) pi rl3 <63> A.FRACT p2 rl5 <8> A.EXP bl.xll3 ENDBR (xl38.p2:A.EX) (xl38.pl:A.FR) (xl36.pi:READ) (xl33.pl:WAIT) (xll9.pi:A.SI) (v24.i15:DATA) b2.xll3 BRANCH [1] xl39 <r> (v24.i14:READ) (*.c3=0) pi r6 <1> READY xl40 WAIT (xl39.pl:READ) pi r30 <140> WAIT xl41 < 3 (v24.ilO:A.EX) (v24.ill:A.SI) pi * <9> * xl42 <r> (v24.i9:A.FR) (*.cl5=56) pi * <7> A.FRACT xl43 0 (xl41.pl) (xl42.pi:A.FR) pi * <16> * xl44 <w> (v24.il5:DATA) (xl43.pl) (*.c3=0) 368 pi r5 <16> DATA xl45 <w> (v24.il4:EEAD) (*.c3=0) (*.c3=0) pi * <1> READY xl46 <r> (xl45.pl:READ) (*.c3=0) pi r6 <1> READY xl47 WAIT (xl46.pl:READ) pi r30 <140> WAIT xl48 <r> (v24.i9:A.FR) (*.cl8=40) pi r5 <16> A.FRACT xl49 <w> (xl44.pl:DATA) (xl48.pi:A.FR) (*.c3=0) pi r5 <16> DATA xl50 <w> (xl45.pl:READ) (*.c3=0) (*.c3=0) pi * <1> READY xl51 <r> (xl50.pl:READ) (*.c3=0) pi r6 <1> READY xl52 WAIT (xl51.pl:READ) pi r30 <140> WAIT xl53 <r> (v24.i9:A.FR) (*.c20=24) pi * <16> A.FRACT xl54 <w> (xl49.pl:DATA) (xl53.pl:A.FR) (*.c3=0) pi r5 <16> DATA xl55 <w> (xl50.pl:READ) (*.c3=0) (*.c3=0) pi * <1> READY xl56 <r> (xl55.pl:READ) (*.c3=0) pi r6 <1> READY xl57 WAIT (xl56.pi:READ) pi r30 <140> WAIT xl58 <r> (v24.i9:A.FR) (*.c21=8) pi * <16> A.FRACT xl59 <w> (xl54.pl:DATA) (xl58.pi:A.FR) (*.c3=0) pi r5 <16> DATA xl60 <w> (xl55.pi:READ) (*.c3=0) (*.c3=0) pi * <1> READY 369 b2.xll3 ENDBR (v24.ilO:A.EX) (v24.i9:A.FR) (xl60.pi:READ) (xl57.pl:WAIT)(v24.ill:A.SI) (xl59.pi:DATA) xll3 ENDSEL pi rl5 <8> A.EXP p2 rl3 <63> A.FRACT p3 r6 <1> READY p4 r30 <140> WAIT p5 rl4 <1> A.SIGN p6 r5 <16> DATA b4.x6 ENDBR (xll3.p5:A.SI) (v24.il2:C) (xll3.pl:A.EX) (xll3.p2:A.FR) (v24.i4:T.FR) (v24.i6:T.EX) (v24.i8:T.SI) (v24.il3:FP.E) (v24.i3:FP.R) (v24.i5:EXP) (v24.i7:SIGN) (xll3,p6 :DATA) (xll3.p4:WAIT) (xll3.p3-.READ) x6 ENDSEL pi rl4 <1> A.SIGN p2 rl9 <1> C p3 rl5 <8> A.EXP p4 rl3 <63> A.FRACT p5 rl6 <63> T.FRACT p6 rl8 <8> T.EXP p7 rl7 <1> T.SIGN p8 r4 <1> FP.ERROR p9 r20 <63> FP.REG plO r21 <8 > EXP.REG pll r22 <1> SIGN.REG pl2 r5 <16> DATA pl3 r30 <140> WAIT pl4 r6 <1> READY xl61 EQL:US (x6.p4:A.FR) (*.c8=0) pi * <1> * xl62 <w> (v24.il6:Z) (xl61.pl) (*.c3=0) pi r7 <1> Z xl63 <w> (v24.il7:N) (x6 .pl:A.SI) (*.c3=0) pi r8 <1> N xl64 <w> (v24.il:GO) (*.c3=0) (*.c3=0) pi * <1> GO xl65 LEAVE @v24:INST (x6.p5:T.FR) (x6 .p6 :T.EX) (x6.p7:T.SI) (x6 ,p3:A.EX) (x6 .pl:A.SI) (x6.p4:A.FR) (x6.p2:C) (x6 .p8 :FP.E) (x6.p9:FP.R) (x6 .plO:EXP) (x6 .pll:SIGN) (x6.pl4:READ) (x6 .pl2:DATA) (xl62.pl:Z) (xl63.pl:N) (xl64.pl:GO) 01 rl6 <63> T.FRACT 02 rl8 <8 > T.EXP 03 rl7 <1> T.SIGN 04 rl5 <8> A.EXP 05 rl4 <1> A.SIGN 0 6 rl3 <63> A.FRACT 07 rl9 <1> C 0 8 r4 <1> FP.ERROR 09 r20 <63> FP.REG 010 r21 <8> EXP.REG 011 r22 <1> SIGN.REG 012 r6 <1> READY 013 r5 <16> DATA 014 r7 <1> Z 015 r8 <1> N 0 16 r2 <1> GO v23 %sl 001001000100010000-US * * START; 11 r2 <1> GO 12 r3 <6> INST.IN 13 r20 <63> FP.REG 14 rl6 <63> T.FRACT 15 r21 <8> EXP.REG 371 16 rl8 <8 > T.EXP 17 r22 <1> SIGN.REG 18 rl7 <1> T.SIGN 19 rl3 <63> A.FRACT 110 rl5 <8> A.EXP 111 rl4 <1> A.SIGN 112 rl9 <1> C 113 r4 <1> FP.ERROR 114 r6 <1> READY 115 r5 <16> DATA 116 r7 <1> Z 117 r8 <1> N xl CALL @v24:INST (v23.il:GO) (v23.i2:INST) (v23.i3:FP.R) (v23.i4:T.FR) (v23.i5:EXP) (v23.i6:T.EX) (v23.i7:SIGN) (v23.i8:T.SI) (v23.i9:A.FR) (v23.ilO:A.EX) (v23.ill:A.SI) (v23.il2:C) (v23.il3:FP.E) (v23.i14:READ) (v23.il5:DATA) (v23.il6:Z) (v23.il7:N) pi r! 6 <63> T.FRACT p2 rl8 <8 > T.EXP p3 rl7 <1> T.SIGN p4 rl5 <8 > A.EXP p5 rl4 <1> A.SIGN p6 rl3 <63> A.FRACT p7 rl9 <1> C p8 r4 <1> FP.ERROR p9 r20 <63> FP.REG plO r21 <8 > EXP.REG pll r22 <1> SIGN.REG pl2 r6 <1> READY pl3 r5 <16> DATA pl4 r7 <1> Z pl5 r8 <1> N pl6 r2 <1> GO 372 x2 LEAVE @v23:STAR (xl.p8 :FP.E) (xi.pl2:READY) (xl.pl3:DATA) (xl.pl4:Z) (xl.pl5:N) (xl.pl6:G0) 01 r4 <1> FP.ERROR 02 r6 <1> READY 03 r5 <16> DATA 04 r7 <1> Z 05 r8 <1> M 0 6 r2 <1> GO G.2 Floating Point Coprocessor Park Normal Form Description # Descript:ion # nnll4 dummy 0 nnll7 dummy 0 nnll8 dummy 0 nnl23 dummy 0 nnl24 dummy 0 nnl27 dummy 0 nnl57 dummy 0 nnl59 dummy 0 nnl64 dummy 0 nnl65 dummy 0 nnl6 8 dummy 0 nn.169 dummy 0 nnl74 dummy 0 nn.179 dummy 0 nn!82 dummy 0 373 nnl85 dummy 0 nn.187 dummy 0 nnl9 dummy 0 nnl91 dummy 0 nn2 2 1 dummy 0 nn34 dummy 0 nn.50 dummy 0 nn51 dummy 0 nn57 dummy 0 nn62 dummy 0 nn64 dummy 0 outport dummy 0 root dummy 0 v23_225 dummy 63 v23_226 dummy 63 v23_227 dummy 8 v23_228 dummy 8 v23_229 dummy 1 v23_230 dummy 1 v23_231 dummy 63 v23_232 dummy 8 v23_233 dummy 1 v23_234 dummy 1 v23_235 dummy 1 v23_236 dummy 1 v23_237 dummy 16 v23_240 bit-read 1 v23_241 dummy 1 v23_242 bit-read 6 v23_243 bit-read 6 v23_244 bit-read 6 v23_245 dist 0 v23_246 join 0 374 v23_249 bit-read 6 v23_250 read 63 v23_251 bit-read 6 v23_252 read 8 v23_253 bit-read 6 v23_254 read 6 v23_255 cmp 63 v23_256 dist 0 v23_257 join 0 v23_262 cmp 63 v23_263 dist 0 v23_264 join 0 v23_267 concat 8 v23_268 concat 63 v23_269 bit-read 63 v23_270 bit-read 63 v23_271 bit-read 63 v23_272 bit-read 6 v23_273 dist 0 V23.274 join 0 v23_277 inv 63 V23.282 dummy 8 v23_283 dummy 8 v23_285 cmp 8 v23_286 dist 0 v23_287 join 0 v23_290 cmp 8 v23_291 dist 0 v23_292 join 0 v23_298 add 8 v23_299 bit-read 8 v23_300 r-shift 63 v23_301 cmp 8 v23_302 dist 0 v23_303 join 0 v23_315 add 8 v23_316 bit-read 8 v23_317 r-shift 63 v23_318 cmp 8 v23_319 dist 0 v23_320 join 0 v23_329 dummy 8 v23_330 dummy 63 v23_332 bit-read 6 v23_333 xor 6 v23_334 xor 6 v23_335 dist 0 v23_336 join 0 v23_339 add 63 v23_340 bit-read 63 v23_341 bit-read 63 v23_342 dist 0 v23_343 join 0 v23_346 r-shiftl 63 v23_347 add 8 v23_348 bit-read 8 v23_353 sub 63 v23_354 bit-read 63 v23_355 bit-read 63 v23_356 cmp 63 v23_357 dist 0 V23.358 join 0 v23_363 inv 63 v23_364 dist 0 v23_365 join 0 v23_368 neg 63 376 v23_369 bit-read 63 v23_370 inv 0 v23_391 bit-read 63 v23_392 cmp 63 v23_393 dist 0 v23_394 join 0 v23_399 1-shift 63 v23_400 sub 8 v23_401 bit-read 8 v23_406 bit-read 6 v23_407 read 63 v23_408 bit-read 6 v23_409 read 8 v23_410 bit-read 6 v23_411 read 6 v23_412 bit-read 6 v23_413 dist 0 v23_414 join 0 v23_417 cmp 63 v23_418 cmp 63 v23_419 or 63 v23_420 dist 0 v23_421 join 0 v23_426 xor 6 v23_427 add 8 v23_428 bit-read 8 v23_429 sub 8 v23_430 bit-read 8 v23_431 mul 63 V23.432 bit-read 63 v23_434 bit-read 63 v23_435 cmp 63 v23_436 dist 0 377 v23_437 join 0 v23_440 1-shift 63 v23_441 sub 8 v23_442 bit-read 8 v23_447 cmp 63 v23_448 dist 0 v23_449 join 0 v23_452 bit-write 1 v23_455 cmp 63 v23_456 dist 0 v23_457 join 0 v23_462 xor 6 v23_463 cmp 63 v23_464 dist 0 v23_465 join 0 00 i CO <N > r-shift 63 v23_469 add 8 v23_470 bit-read 8 v23_473 concat 0 v23_474 div 63 v23_475 bit-read 63 v23_476 sub 8 v23_477 bit-read 8 v23_478 add 8 v23_479 bit-read 8 v23_482 bit-read 6 v23_483 dist 0 < to 03 1 00 4* join 0 v23_487 bit-read 6 v23_488 read 63 v23_489 bit-read 6 v23_490 read 8 v23_491 bit-read 6 v23_492 read 6 v23_495 bit-read 6 v23_496 write 63 v23_497 bit-read 6 v23_498 write 8 v23_499 bit-read 6 v23_500 write 6 v23_503 bit-read 6 v23_504 dist 0 v23_505 join 0 v23_508 bit-read 1 v23_510 bit-read 16 v23_511 bit-read 16 v23_512 bit-write 63 v23_513 bit-read 16 v23_514 bit-read 16 v23_515 bit-write! 1 v23_516 bit-read 1 v23_518 bit-read 8 v23_519 bit-writei 63 v23_520 bit-write! 1 v23_521 bit-read 1 v23_523 bit-read 16 v23_524 bit-writei 63 v23_525 bit-writei 1 v23_526 bit-read 1 v23_528 bit-read 16 v23_529 bit-write: 63 v23_530 bit-write: 1 v23_531 bit-write: 63 v23_549 bit-read 63 v23_550 cmp 63 v23_551 dist 0 v23_552 join 0 v23_557 1-shift 63 v23_558 sub 8 v23_559 bit-read 8 v23_564 bit-read 1 v23_566 concat 8 v23_567 bit-read 63 v23_568 concat 63 v23_569 bit-write 63 v23_570 bit-write 1 v23_571 bit-read 1 v23_573 bit-read 63 v23_574 bit-write 63 v23_575 bit-write 1 v23_576 bit-read 1 v23_578 bit-read 63 v23_579 bit-write 63 v23_580 bit-write 1 v23_581 bit-read 1 v23_583 bit-read 63 v23_584 bit-write 63 v23_585 bit-write 1 v23_586 cmp 0 v23_587 bit-write 1 v23_588 bit-write 1 v23_589 bit-write 1 v23_.il dummy 1 v23_il6 dummy 1 v23_il7 dummy 1 v23_240 v23_241 1 v23_242 v23_243 6 v23_243 v23_244 6 380 v23_244 v23_245 v23_245 nn51 0 v23_243 v23_249 v23_225 v23_250 v23_249 v23_250 v23_243 v23_251 v23_227 v23_252 v23_251 v23_252 v23_243 v23_253 v23_229 v23_254 v23_253 v23_254 v23_250 V23.255 v23_255 v23_256 v23_231 v23_262 v23_262 v23_263 v23_263 nn57 0 v23_252 v23_267 v23_254 v23_267 v23_267 v23_268 v23_250 v23_268 v23_268 v23_269 v23_268 v23_270 v23_268 v23_271 v23_243 v23_272 v23_272 v23_273 v23_270 v23_277 nn62 v23_274 0 v23_270 nn62 63 nn57 v23_267 0 nn57 v23_272 0 nn64 v23_264 0 v23_233 nn64 1 v23_232 nn64 8 6 6 63 6 6 8 6 6 1 6 63 63 63 63 8 6 8 63 63 63 63 6 6 63 381 v 2 3 _ 2 3 1 n n 64 63 V 2 3 .2 8 2 v 2 3 _ 2 8 5 v 2 3 _ 2 8 3 v 2 3 _ 2 8 5 V 2 3 .2 8 5 v 2 3 _ 2 8 6 v 2 3 _ 2 8 2 v 2 3 _ 2 9 0 v 2 3 _ 2 8 3 V 2 3 .2 9 0 v 2 3 _ 2 9 0 v 2 3 _ 2 9 1 v 2 3 _ 2 9 1 n n l9 0 v 2 3 _ 2 9 8 v 2 3 _ 2 9 9 v 2 3 _ 2 9 9 v 2 3 _ 3 0 1 v 2 3 _ 3 0 1 v 2 3 _ 3 0 2 v 2 3 _ 2 9 1 n n 34 0 v 2 3 _ 3 1 5 v 2 3 _ 3 1 6 v 2 3 _ 3 1 6 v 2 3 _ 3 1 8 v 2 3 _ 3 1 8 v 2 3 _ 3 1 9 nn50 v 2 3 _ 2 8 7 0 v 2 3 _ 2 8 3 nn50 8 v 2 3 _ 2 8 2 nn50 8 v 2 3 _ 2 8 7 v 2 3 _ 3 2 9 v 2 3 _ 2 8 7 v 2 3 _ 3 3 0 v 2 3 _ 2 6 4 v 2 3 _ 2 8 2 v 2 3 _ 2 5 2 v 2 3 _ 2 8 3 v 2 3 _ 2 4 3 v 2 3 _ 3 3 2 V 2 3 .2 6 4 v 2 3 _ 3 3 3 v 2 3 _ 2 5 4 v 2 3 _ 3 3 3 v 2 3 _ 3 3 2 v 2 3 _ 3 3 4 v 2 3 _ 3 3 3 v 2 3 _ 3 3 4 v 2 3 _ 3 3 4 v 2 3 _ 3 3 5 v 2 3 _ 2 6 4 v 2 3 _ 3 3 9 v 2 3 _ 3 3 0 v 2 3 _ 3 3 9 v 2 3 _ 3 3 9 v 2 3 _ 3 4 0 v 2 3 _ 3 3 9 V 2 3 .3 4 1 v 2 3 _ 3 4 1 v 2 3 _ 3 4 2 8 8 8 8 8 8 8 8 8 8 8 8 0 0 8 8 6 0 6 6 6 6 0 63 63 63 63 382 v 2 3 _ 3 4 2 n n l l 4 0 v 2 3 _ 3 4 0 v 2 3 _ 3 4 6 63 v 2 3 _ 3 4 7 v 2 3 _ 3 4 8 8 n n l i 4 v 2 3 _ 3 4 6 0 n n l l 4 v 2 3 _ 3 4 7 0 n n l l 7 v 2 3 _ 3 4 3 0 v 2 3 _ 3 4 0 n n l l 7 63 v 2 3 _ 3 3 5 n n l l 8 0 v 2 3 _ 2 6 4 v 2 3 _ 3 5 3 0 v 2 3 _ 3 3 0 v 2 3 _ 3 5 3 63 v 2 3 _ 3 5 3 v 2 3 _ 3 5 4 63 v 2 3 _ 3 5 3 v 2 3 _ 3 5 5 63 v 2 3 _ 3 5 4 v 2 3 _ 3 5 6 63 v 2 3 _ 3 5 6 v 2 3 _ 3 5 7 63 n n l2 3 v 2 3 _ 3 5 8 0 v 2 3 _ 2 6 4 n n l2 3 0 v 2 3 _ 3 5 5 v 2 3 _ 3 6 3 63 C O C O C O i C O C M !> v 2 3 _ 3 6 4 63 v 2 3 _ 3 6 4 n n l 2 4 0 v 2 3 _ 3 5 4 v 2 3 _ 3 6 8 63 v 2 3 _ 3 6 8 v 2 3 _ 3 6 9 63 v 2 3 _ 3 5 8 v 2 3 _ 3 7 0 0 n n l2 4 v 2 3 _ 3 6 8 0 n n l2 4 v 2 3 _ 3 7 0 0 n n l2 7 v 2 3 _ 3 6 5 0 v 2 3 _ 3 5 8 n n ! 2 7 0 v 2 3 _ 3 5 4 n n l2 7 63 v 2 3 _ 3 9 1 v 2 3 _ 3 9 2 63 v 2 3 _ 3 9 2 v 2 3 _ 3 9 3 63 v 2 3 _ 4 0 0 v 2 3 _ 4 0 1 8 n n l l 8 v 2 3 _ 3 5 3 0 nn51 v23i_249 0 nn51 v23i_251 0 383 nn51 v 2 3 _ 2 5 3 0 nn51 v 2 3 _ 2 6 2 0 nn51 v 2 3 _ 3 3 2 0 v 2 3 _ 2 4 5 n n l5 7 0 v 2 3 _ 2 4 3 v 2 3 _ 4 0 6 6 v 2 3 _ 2 2 5 v 2 3 _ 4 0 7 63 C D o i to C M > v 2 3 _ 4 0 7 6 v 2 3 _ 2 4 3 v 2 3 _ 4 0 8 6 v 2 3 _ 2 2 7 v 2 3 _ 4 0 9 8 v 2 3 _ 4 0 8 v 2 3 _ 4 0 9 6 V 2 3 .2 4 3 v 2 3 _ 4 1 0 6 v 2 3 _ 2 2 9 v 2 3 _ 4 1 1 1 v 2 3 _ 4 1 0 v 2 3 _ 4 1 1 6 v 2 3 _ 2 4 3 v 2 3 _ 4 1 2 6 v 2 3 _ 4 1 2 v 2 3 _ 4 1 3 6 v 2 3 _ 4 1 3 n n l5 9 0 v 2 3 _ 4 0 7 v 2 3 _ 4 1 7 63 v 2 3 _ 2 3 1 v 2 3 _ 4 1 8 63 v 2 3 _ 4 1 7 v 2 3 _ 4 1 9 63 v 2 3 _ 4 1 8 v 2 3 _ 4 1 9 63 v 2 3 _ 4 1 9 v 2 3 _ 4 2 0 63 n n l6 4 v 2 3 _ 4 2 1 0 v 2 3 _ 2 3 1 n n ! 6 4 63 v 2 3 _ 2 3 3 v 2 3 _ 4 2 6 1 v 2 3 _ 4 1 1 v 2 3 _ 4 2 6 6 v 2 3 _ 2 3 2 v 2 3 _ 4 2 7 8 v 2 3 _ 4 0 9 v 2 3 _ 4 2 7 8 v 2 3 _ 4 2 7 v 2 3 _ 4 2 8 8 v 2 3 _ 4 2 8 v 2 3 _ 4 2 9 8 v 2 3 _ 4 2 9 v 2 3 _ 4 3 0 8 v 2 3 _ 4 2 1 v 2 3 _ 4 3 1 0 v 2 3 _ 4 0 7 v 2 3 _ 4 3 1 63 v 2 3 _ 4 3 1 v 2 3 _ 4 3 2 63 v 2 3 _ 4 3 4 v 2 3 _ 4 3 5 63 v 2 3 _ 4 3 5 v 2 3 _ 4 3 6 63 v 2 3 _ 4 3 6 n n l6 5 0 v 2 3 _ 4 3 0 v 2 3 _ 4 4 1 8 v 2 3 _ 4 4 1 v 2 3 _ 4 4 2 8 n n i6 5 v 2 3 _ 4 4 0 0 n n l6 5 v 2 3 _ 4 4 1 0 n n l6 8 v 2 3 _ 4 3 7 0 v 2 3 _ 4 3 0 nu.168 8 n n l5 9 v 2 3 _ 4 1 7 0 n n l5 9 v 2 3 _ 4 1 8 0 n n l5 9 v 2 3 _ 4 2 6 0 n n l5 9 v 2 3 _ 4 2 7 0 v 2 3 _ 4 1 3 nn.169 0 v 2 3 _ 4 0 7 v 2 3 _ 4 4 7 63 < to w i 4* 4* ■ > j < (O w 1 4> 4^ C O 63 v 2 3 _ 2 3 5 v 2 3 _ 4 5 2 1 n n l7 4 v 2 3 _ 4 4 9 0 v 2 3 _ 2 3 5 n n l7 4 1 v 2 3 _ 2 3 1 v 2 3 _ 4 5 5 63 v 2 3 _ 4 5 5 v 2 3 _ 4 5 6 63 v 2 3 _ 2 3 3 v 2 3 _ 4 6 2 1 v 2 3 _ 4 1 1 V 2 3 .4 6 2 6 v 2 3 _ 2 3 1 v 2 3 _ 4 6 3 63 v 2 3 _ 4 0 7 v 2 3 _ 4 6 3 63 v 2 3 _ 4 6 3 v 2 3 _ 4 6 4 63 v 2 3 _ 4 6 4 n n l7 9 0 v 2 3 _ 2 3 1 v 2 3 _ 4 6 8 63 v 2 3 _ 2 3 2 v 2 3 _ 4 6 9 8 v 2 3 _ 4 6 9 v 2 3 _ 4 7 0 8 n n l7 9 v 2 3 _ 4 6 8 0 n n l7 9 v 2 3 _ 4 6 9 0 n n l8 2 v 2 3 _ 4 6 5 0 385 v 2 3 _ 2 3 2 n n l8 2 8 v 2 3 _ 2 3 1 n n l8 2 63 v 2 3 _ 4 6 5 v 2 3 _ 4 7 3 0 v 2 3 _ 4 7 3 v 2 3 _ 4 7 4 0 v 2 3 _ 4 0 7 v 2 3 _ 4 7 4 63 v 2 3 _ 4 7 4 v 2 3 _ 4 7 5 63 v 2 3 _ 4 6 5 V 2 3 .4 7 6 0 v 2 3 _ 4 0 9 v 2 3 _ 4 7 6 8 v 2 3 _ 4 7 6 v 2 3 _ 4 7 7 8 v 2 3 _ 4 7 7 v 2 3 _ 4 7 8 8 v 2 3 _ 4 7 8 v 2 3 _ 4 7 9 8 n n l6 9 v 2 3 _ 4 4 7 0 n n l6 9 v 2 3 _ 4 5 5 0 nn.169 v 2 3 _ 4 6 2 0 n n l6 9 v 2 3 _ 4 6 3 0 n n l5 7 v 2 3 _ 4 0 6 0 n n l5 7 v 2 3 _ 4 0 8 0 n n l5 7 v 2 3 _ 4 1 0 0 n n l5 7 v 2 3 _ 4 1 2 0 v 2 3 _ 2 4 3 v 2 3 _ 4 8 2 6 v 2 3 _ 4 8 2 v 2 3 _ 4 8 3 6 v 2 3 _ 4 8 3 n n l8 5 0 v 2 3 _ 2 4 3 v 2 3 _ 4 8 7 6 v 2 3 _ 2 2 5 v 2 3 _ 4 8 8 63 v 2 3 _ 4 8 7 v 2 3 _ 4 8 8 6 v 2 3 _ 2 4 3 v 2 3 _ 4 8 9 6 v 2 3 _ 2 2 7 v 2 3 _ 4 9 0 8 v 2 3 _ 4 8 9 v 2 3 _ 4 9 0 6 v 2 3 _ 2 4 3 v 2 3 _ 4 9 1 6 < to C O i to to C O v 2 3 _ 4 9 2 1 v 2 3 _ 4 9 1 v 2 3 _ 4 9 2 6 n n l8 5 v 2 3 _ 4 8 7 0 n n l8 5 v 2 3 _ 4 8 9 0 386 nnl85 v23_491 0 v23_483 nn.187 0 v23_243 v23_495 6 v23_225 v23_496 63 v23_495 v23_496 6 V23.231 v23_496 63 v23_243 v23_497 6 v23_227 v23_498 8 V23.497 v23_498 6 v23_232 V23.498 8 v23_243 v23_499 6 v23_229 v23_500 1 v23_499 v23_500 6 v23_233 v23_500 1 nn.187 v23_495 0 nnl87 v23_497 0 nnl87 v23_499 0 v23_243 v23_503 6 v23_503 v23_504 6 v23_504 nnl91 0 v23_236 v23_508 1 v23_237 v23_510 16 v23_510 v23_511 16 v23_231 v23_512 63 v23_511 v23_512 16 v23_510 v23_513 16 v23_510 v23_514 16 v23_236 v23_515 1 v23_515 v23_516 1 v23_227 v23_518 8 v23_512 v23_519 63 v23_518 v23_519 8 v23_515 v23_520 1 v23_520 v23_521 1 v23_237 v23_523 16 v23_519 v23_524 63 v23_523 v23_524 16 v23_520 v23_525 v23_525 v23_526 1 v23_237 v23_528 16 v23_524 v23_529 63 v23_528 v23_529 16 v23_525 v23_530 1 v23_529 v23_531 63 v23_549 v23_550 63 v23_550 v23_551 63 V23.558 v23_559 8 nnl91 v23_508 0 nnl91 v23_510 0 nnl91 v23_515 0 nnl91 v23_518 0 nnl91 v23_523 0 nnl91 v23_528 0 V23.504 nn221 0 v23_236 v23_564 1 V23.232 v23_566 8 v23_233 v23_566 1 v23_231 v23_567 63 v23_566 v23_568 8 v23_567 v23_568 63 v23_237 v23_569 16 v23_568 v23_569 63 v23_236 v23_570 1 v23_570 v23_571 1 V23.231 v23_573 63 v23_569 v23_574 63 v 2 3 _ 5 7 3 v 2 3 _ 5 7 4 63 v 2 3 _ 5 7 0 v 2 3 _ 5 7 5 1 v 2 3 _ 5 7 5 v 2 3 _ 5 7 6 1 v 2 3 _ 2 3 1 v 2 3 _ 5 7 8 63 v 2 3 _ 5 7 4 v 2 3 _ 5 7 9 63 v 2 3 _ 5 7 8 v 2 3 _ 5 7 9 63 v 2 3 _ 5 7 5 v 2 3 _ 5 8 0 1 v 2 3 _ 5 8 0 V 2 3 .5 8 1 1 v 2 3 _ 2 3 1 V 2 3 .5 8 3 63 v 2 3 _ 5 7 9 v 2 3 _ 5 8 4 63 v 2 3 _ 5 8 3 v 2 3 _ 5 8 4 63 v 2 3 _ 5 8 0 v 2 3 _ 5 8 5 1 nn221 v 2 3 _ 5 6 4 0 nn221 v 2 3 _ 5 6 6 0 nn221 v 2 3 _ 5 6 7 0 nn221 v 2 3 _ 5 7 0 0 nn221 v 2 3 _ 5 7 3 0 n n 221 v 2 3 _ 5 7 8 0 n n221 v 2 3 _ 5 8 3 0 v 2 3 _ 2 4 6 v 2 3 _ 5 8 6 0 v 2 3 _ 5 8 6 v 2 3 _ 5 8 7 0 v 2 3 _ 2 4 6 v 2 3 _ 5 8 8 0 r o o t v 2 3 _ i l 1 r o o t v 2 3 _ i l 6 1 r o o t v 2 3 _ i l 7 1 v 2 3 _ 3 3 5 v 2 3 _ 3 3 9 0 v 2 3 _ 3 4 3 v 2 3 _ 3 3 6 0 v 2 3 _ 3 4 3 v 2 3 _ 3 3 6 0 v 2 3 _ 3 4 1 v 2 3 _ 3 3 6 63 v 2 3 _ 2 6 4 v 2 3 _ 3 3 6 0 v 2 3 _ 3 4 8 v 2 3 _ 3 4 3 8 v 2 3 _ 3 4 6 v 2 3 _ 3 4 3 63 v 2 3 _ 3 4 2 n n l l 7 0 389 v 2 3 _ 3 5 5 v 2 3 _ 3 3 6 63 V 2 3 .3 6 5 v 2 3 _ 3 3 6 0 v 2 3 _ 3 9 3 v 2 3 _ 3 9 4 0 v 2 3 _ 3 5 7 v 2 3 _ 3 5 8 0 v 2 3 _ 3 5 7 n n l2 3 0 v 2 3 _ 3 7 0 v 2 3 _ 3 6 5 0 v 2 3 _ 3 6 9 v 2 3 _ 3 6 5 63 v 2 3 _ 3 6 4 n n l2 7 0 v 2 3 _ 5 5 1 v 2 3 _ 5 5 2 0 v 2 3 _ 4 1 4 v 2 3 _ 2 4 6 0 v 2 3 _ 2 3 4 v 2 3 _ 2 4 6 1 V 2 3 .4 1 4 v 2 3 _ 2 4 6 0 v 2 3 _ 4 1 4 v 2 3 _ 2 4 6 0 v 2 3 _ 4 0 7 v 2 3 _ 2 4 6 63 v 2 3 _ 4 0 9 v 2 3 _ 2 4 6 8 v 2 3 _ 4 1 1 v 2 3 _ 2 4 6 6 v 2 3 _ 4 1 4 v 2 3 _ 2 4 6 0 v 2 3 _ 2 2 5 v 2 3 _ 2 4 6 63 v 2 3 _ 2 2 7 v 2 3 _ 2 4 6 8 V 2 3 .2 2 9 v 2 3 _ 2 4 6 1 v 2 3 _ 2 3 7 v 2 3 _ 2 4 6 16 v 2 3 _ 2 4 1 v 2 3 _ 2 4 6 1 v 2 3 _ 2 3 6 v 2 3 _ 2 4 6 1 v 2 3 _ 5 5 1 v 2 3 _ 5 5 2 0 v 2 3 _ 4 3 7 v 2 3 _ 4 1 4 0 v 2 3 _ 4 3 7 v 2 3 _ 4 1 4 0 v 2 3 _ 4 2 6 v 2 3 _ 4 1 4 6 v 2 3 _ 2 3 5 v 2 3 _ 4 1 4 1 v 2 3 _ 4 2 0 v 2 3 _ 4 2 1 0 v 2 3 _ 4 2 0 n n l6 4 0 v 2 3 _ 4 4 2 v 2 3 _ 4 3 7 8 v 2 3 _ 4 4 0 v 2 3 _ 4 3 7 63 v 2 3 _ 4 3 6 n n l6 8 0 390 v 2 3 _ 2 8 6 v 2 3 _ 2 9 0 v 2 3 _ 4 7 5 v 2 3 _ 4 1 4 v 2 3 _ 4 7 9 v 2 3 _ 4 1 4 v 2 3 _ 4 6 2 v 2 3 _ 4 1 4 v 2 3 _ 4 4 9 v 2 3 _ 4 1 4 v 2 3 _ 4 5 7 v 2 3 _ 4 1 4 v 2 3 _ 4 4 8 v 2 3 _ 4 5 2 v 2 3 _ 4 5 2 v 2 3 _ 4 4 9 v 2 3 _ 4 4 8 n n l7 4 0 v 2 3 _ 4 5 6 v 2 3 _ 4 5 7 v 2 3 _ 4 5 6 v 2 3 _ 4 5 7 v 2 3 _ 2 9 2 v 2 3 _ 2 8 7 v 2 3 _ 2 9 2 v 2 3 _ 2 8 7 v 2 3 _ 2 9 2 v 2 3 _ 2 8 7 v 2 3 _ 4 7 0 v 2 3 _ 4 6 5 v 2 3 _ 4 6 8 v 2 3 _ 4 6 5 v 2 3 _ 4 6 4 n n l8 2 0 v 2 3 _ 2 4 5 v 2 3 _ 4 8 2 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 v 2 3 _ 2 3 4 v 2 3 _ 2 4 6 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 v 2 3 _ 2 2 6 v 2 3 _ 2 4 6 V 2 3 .2 2 8 v 2 3 _ 2 4 6 v 2 3 _ 2 3 0 v 2 3 _ 2 4 6 v 2 3 _ 2 3 5 v 2 3 _ 2 4 6 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 v 2 3 _ 4 8 4 v 2 3 _ 2 4 6 V 2 3 .2 3 7 v 2 3 _ 2 4 6 v 2 3 _ 2 4 1 v 2 3 _ 2 4 6 v 2 3 _ 2 3 6 v 2 3 _ 2 4 6 v 2 3 _ 4 9 2 v 2 3 _ 4 8 4 0 63 8 6 0 0 0 1 0 0 0 0 0 8 63 0 0 1 0 0 63 8 1 1 0 0 0 16 1 1 6 v 2 3 _ 4 9 0 v 2 3 _ 4 8 4 8 v 2 3 _ 4 8 8 v 2 3 _ 4 8 4 63 v 2 3 _ 2 2 9 v 2 3 _ 4 8 4 1 v 2 3 _ 2 2 7 v 2 3 _ 4 8 4 8 v 2 3 _ 2 2 5 v 2 3 _ 4 8 4 63 v 2 3 _ 2 3 3 v 2 3 _ 4 8 4 1 v 2 3 _ 2 3 2 v 2 3 _ 4 8 4 8 v 2 3 _ 2 3 1 v 2 3 _ 4 8 4 63 v 2 3 _ 5 0 0 v 2 3 _ 4 8 4 6 v 2 3 _ 4 9 8 v 2 3 _ 4 8 4 8 v 2 3 _ 4 9 6 v 2 3 _ 4 8 4 63 v 2 3 _ 2 4 5 v 2 3 _ 5 0 3 0 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 2 3 4 v 2 3 _ 2 4 6 1 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 2 2 6 v 2 3 _ 2 4 6 63 v 2 3 _ 2 2 8 v 2 3 _ 2 4 6 8 v 2 3 _ 2 3 0 v 2 3 _ 2 4 6 1 v 2 3 _ 2 3 5 v 2 3 _ 2 4 6 1 v 2 3 _ 2 2 5 v 2 3 _ 2 4 6 63 v 2 3 _ 2 2 7 v 2 3 _ 2 4 6 8 v 2 3 _ 2 2 9 v 2 3 _ 2 4 6 1 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 5 0 5 v 2 3 _ 2 4 6 0 v 2 3 _ 5 3 0 v 2 3 _ 5 0 5 1 v 2 3 _ 5 1 3 v 2 3 _ 5 0 5 16 v 2 3 _ 2 3 7 v 2 3 _ 5 0 5 16 v 2 3 _ 3 0 2 v 2 3 _ 3 0 3 0 v 2 3 _ 2 8 3 v 2 3 ,2 9 2 8 v 2 3 _ 2 3 2 v 2 3 _ 5 0 5 8 v 2 3 _ 2 3 1 v 2 3 _ 5 0 5 63 v 2 3 _ 5 8 5 v 2 3 _ 5 0 5 V 2 3 .2 3 3 v 2 3 _ 5 0 5 v 2 3 _ 5 8 4 v 2 3 _ 5 0 5 v 2 3 _ 2 8 2 v 2 3 _ 2 9 2 v 2 3 _ 3 0 2 v 2 3 _ 3 0 3 v 2 3 _ 2 8 6 nn50 0 v 2 3 _ 3 3 6 v 2 3 _ 2 4 6 v 2 3 _ 3 3 6 v 2 3 _ 2 4 6 v 2 3 _ 3 3 6 v 2 3 _ 2 4 6 v 2 3 _ 3 3 6 v 2 3 _ 2 4 6 v 2 3 _ 3 3 0 v 2 3 _ 2 4 6 v 2 3 _ 3 2 9 v 2 3 _ 2 4 6 v 2 3 _ 2 5 4 v 2 3 _ 2 4 6 v 2 3 _ 2 3 5 v 2 3 _ 2 4 6 v 2 3 _ 2 2 5 v 2 3 _ 2 4 6 V 2 3 .2 2 7 v 2 3 _ 2 4 6 v 2 3 _ 2 2 9 v 2 3 _ 2 4 6 V 2 3 .2 3 7 v 2 3 _ 2 4 6 v 2 3 _ 2 4 1 v 2 3 _ 2 4 6 v 2 3 _ 2 3 6 v 2 3 _ 2 4 6 v 2 3 _ 2 5 7 v 2 3 _ 2 4 6 v 2 3 _ 2 5 6 v 2 3 _ 2 5 7 v 2 3 _ 2 5 6 v 2 3 _ 2 5 7 v 2 3 _ 2 7 4 v 2 3 _ 2 6 4 v 2 3 _ 2 7 1 v 2 3 _ 2 6 4 v 2 3 _ 2 6 9 v 2 3 _ 2 6 4 V 2 3 .2 7 3 v 2 3 _ 2 7 7 v 2 3 _ 3 1 9 v 2 3 _ 3 2 0 v 2 3 _ 2 7 7 v 2 3 _ 2 7 4 v 2 3 _ 2 7 3 nn62 0 v 2 3 _ 2 6 3 nn64 0 v 2 3 _ 3 1 9 v 2 3 _ 3 2 0 v 2 3 _ 3 9 3 v 2 3 _ 3 9 4 1 1 63 8 0 0 0 0 0 63 8 6 1 63 8 1 16 1 1 0 0 0 0 63 63 0 0 63 0 0 v 2 3 _ i l v 2 3 _ 2 4 0 : 1 v 2 3 _ i l v 2 3 _ 5 8 9 : 1 v 2 3 _ i l 6 v 2 3 _ 5 8 7 1 v 2 3 _ i l 7 v 2 3 _ 5 8 8 1 V 2 3 .2 5 0 nn50 63 v 2 3 _ 2 8 2 v 2 3 _ 2 9 8 8 n n l9 v 2 3 _ 2 9 8 0 n n l9 v 2 3 _ 3 0 0 0 v 2 3 _ 2 5 0 V 2 3 .3 0 0 63 v 2 3 _ 2 8 3 v 2 3 _ 3 0 1 8 n n l9 V 2 3 .3 0 1 0 v 2 3 _ 2 9 9 v 2 3 _ 2 9 2 8 v 2 3 _ 3 0 3 v 2 3 _ 2 9 2 0 v 2 3 _ 3 0 0 v 2 3 _ 2 9 2 63 v 2 3 _ 2 8 3 v 2 3 _ 3 1 5 8 nn.34 v 2 3 _ 3 1 5 0 n n 34 v 2 3 _ 3 1 7 0 v 2 3 _ 2 5 0 v 2 3 _ 3 1 7 63 v 2 3 _ 2 8 2 v 2 3 _ 3 1 8 8 n n 34 v 2 3 _ 3 1 8 0 v 2 3 _ 3 1 6 v 2 3 _ 2 9 2 8 V 2 3 .3 2 0 v 2 3 _ 2 9 2 0 v 2 3 _ 3 1 7 v 2 3 _ 2 9 2 63 v 2 3 _ 2 8 7 v 2 3 _ 3 4 7 0 V 2 3 .2 8 7 n n l l 7 0 v 2 3 _ 2 8 7 n n l2 3 0 v 2 3 _ 3 6 5 v 2 3 _ 3 9 1 63 V 2 3 .3 6 5 v 2 3 _ 3 9 9 63 v 2 3 _ 3 5 8 v 2 3 _ 4 0 0 8 v 2 3 _ 3 9 9 v 2 3 _ 3 3 6 63 v 2 3 _ 3 9 4 v 2 3 _ 3 3 6 0 v 2 3 _ 2 4 1 v 2 3 _ 3 3 6 140 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 4 3 v 2 3 _ 3 3 6 5 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 3 3 0 v 2 3 _ 3 3 6 63 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 3 2 9 v 2 3 _ 3 3 6 8 n n l i 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 5 4 v 2 3 _ 3 3 6 1 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 3 6 5 v 2 3 _ 3 3 6 1 v 2 3 _ 3 5 5 v 2 3 _ 3 3 6 1 v 2 3 _ 2 3 5 v 2 3 _ 3 3 6 1 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 2 5 v 2 3 _ 3 3 6 63 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 2 7 v 2 3 _ 3 3 6 8 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 2 9 v 2 3 _ 3 3 6 1 n n l l 8 V 2 3 .3 3 6 0 v 2 3 _ 2 3 6 v 2 3 _ 3 3 6 1 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ 2 3 7 v 2 3 _ 3 3 6 16 n n l l 8 v 2 3 _ 3 3 6 0 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ i l 6 v 2 3 _ 3 3 6 1 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ i l 7 v 2 3 _ 3 3 6 1 n n l l 8 v 2 3 _ 3 3 6 0 v 2 3 _ i l v 2 3 _ 3 3 6 1 v 2 3 _ 4 0 1 v 2 3 _ 3 3 6 8 v 2 3 _ 4 3 2 v 2 3 _ 4 3 4 63 v 2 3 _ 4 3 2 v 2 3 _ 4 4 0 63 v 2 3 _ 4 3 2 n n l6 8 63 v 2 3 _ 5 0 8 v 2 3 _ 5 0 5 1 v 2 3 _ 5 1 6 v 2 3 _ 5 0 5 v 2 3 _ 5 2 1 v 2 3 _ 5 0 5 v 2 3 _ 5 2 6 v 2 3 _ 5 0 5 v 2 3 _ 5 3 1 v 2 3 _ 5 4 9 v 2 3 _ 5 3 1 v23_55T v 2 3 _ 5 1 4 v 2 3 _ 5 5 8 v 2 3 _ 5 5 7 v 2 3 _ 5 0 5 V 2 3 .5 5 2 v 2 3 _ 5 0 5 v 2 3 _ 2 4 3 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 2 6 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 2 8 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 3 0 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 5 1 3 v 2 3 _ 5 0 5 v 2 3 _ 2 3 4 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 3 5 v 2 3 _ 5 0 5 n n l 9 i v 2 3 _ 5 0 5 0 v 2 3 _ 2 2 5 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 2 7 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 2 2 9 v 2 3 _ 5 0 5 n n !9 1 v 2 3 _ 5 0 5 0 v 2 3 _ 5 3 0 v 2 3 _ 5 0 5 v 2 3 _ 2 3 7 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ i l 6 v 2 3 _ 5 0 5 n n l9 1 v 2 3 _ 5 0 5 0 1 1 1 63 63 8 63 0 5 63 8 1 1 1 1 63 8 1 1 16 1 396 v 2 3 _ i l 7 v 2 3 _ 5 0 5 1 n n l9 1 v 2 3 _ 5 0 5 0 v 2 3 _ .il v 2 3 _ 5 0 5 1 v-23.559 v 2 3 _ 5 0 5 8 v 2 3 _ 5 6 4 v 2 3 _ 5 0 5 1 v 2 3 _ 5 7 1 v 2 3 _ 5 0 5 1 v 2 3 _ 5 7 6 v 2 3 _ 5 0 5 1 v 2 3 _ 5 8 1 v 2 3 _ 5 0 5 1 r o o t v 2 3 _ 2 3 2 8 r o o t v 2 3 _ 2 3 3 1 r o o t v 2 3 _ 2 3 4 1 r o o t v 2 3 _ 2 3 5 1 r o o t v 2 3 _ 2 3 6 1 r o o t v 2 3 _ 2 3 7 16 r o o t v 2 3 _ 2 4 2 6 r o o t v 2 3 _ 2 2 5 63 r o o t v 2 3 _ 2 2 6 63 r o o t v 2 3 _ 2 2 7 8 r o o t v 2 3 _ 2 2 8 8 r o o t v 2 3 _ 2 2 9 1 r o o t v 2 3 _ 2 3 0 1 r o o t v 2 3 _ 2 3 1 63 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 5 8 7 o u t p o r t 1 v 2 3 _ 5 8 8 o u t p o r t 1 v 2 3 _ 5 8 9 o u t p o r t 1 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 v 2 3 _ 2 4 6 o u t p o r t 0 398
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255665
Unique identifier
UC11255665
Legacy Identifier
DP22827