Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Consolidated logic and layout synthesis for interconnect -centric VLSI design
(USC Thesis Other)
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm m aster. UMI films the text directly from the original or copy submitted. Thus, som e thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality o f the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONSOLIDATED LOGIC A N D LAYOUT SYNTHESIS FOR INTERCONNECT-CENTRIC VLSI DESIGN by Amir H. Salek A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2000 Copyright 2000 Amir H. Salek Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number; 3018123 __ ___ __( § ) UMI UMI Microform 3018123 Copyright 2001 by Bell & Howell Information and Learning Company. Ail rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA TH E GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, written by Am ir H. S a le k under the direction of hxs. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date....Ah& V Ist..3 J..2goo DISSERTATION COMMITTEE Chairperson C Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To Tiba, Hesum and Behjat Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments Throughout the course of my Ph.D. I made a whole slew of mental notes on who to thank when it was all over. However, in writing this, probably the most commonly read section of any dissertation, I am sure that I will miss the names of many dear friends, colleagues, and family members who helped in making this a great experience. Thank you all! The guidance of Professor Massoud Pedram, my thesis advisor, was vital to the making and completion of this thesis. He is a great mentor who taught me to recognize the important questions and search for their answers. This is a skill which I will carry with me from hereon forward. Massoud has been my main source of motivation and new ideas and triggered in me the ability to create anew. His continuous support and encouragement has been a great treasure on which I have always relied. I must acknowledge the effort, attention, and help of my other thesis committee members Professors Melvin Breuer, Jean-Luc Gaudiot, Deborah Estrin, and Peter iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Beerel. Their input was essential to the completion and presentation of this work in its current form. Tiba, my wife, who has been my great source of support throughout this project. While all else was and is changing, what has remained constant is her love, care, and support. She has experienced with me all the ups and downs of this amazing roller coaster ride over the past several years. I owe this work and all my other achievements to her. I thank my parents and family. Their continuing dedication, love and support was the enabling force behind realizing this goal. Their presence provided me with the strength to overcome some very difficult barriers and pursue this objective. I would like to thank my former advisor Professor Homayoun Hashemi who showed me the ins and outs of going on to higher education and taught me what I deem as some of my most valued skills. Finally, I am indebted to all my friends and colleagues for their invaluable support. Thank you Al, Ali, Babak, Cheng-Ta, Chris, Hamid, Jae, Jinan, Kamran, Lucille, Massoud, Iman, Kiarash, Mana, Mandana, Mehdi, Mohsen, Niloofar, Noushin, Noyan, Peyman, Qing, Qinru, Ramin, Reza, Roshanak, Sasan, Soroosh, Tabassom, Vara, Vida. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table o f Contents Dedication ii Acknowledgments iii Table of Contents v List of Tables viii List of Figures ix Abstract xii Chapter 1 Introduction 1 Chapter 2 Simultaneous Fanout Optimization and Routing Tree Construction 9 2.1 Introduction .................................................................................................................. 9 2.2 Prior Work .................................................................................................................. 10 Fanout Optimization............................................................................................. 10 Routing Tree Construction ................................................................................... 12 Other Related Works ............................................................................................ 13 2.3 Problem Formulation................................................................................................. 14 V Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4 Order-Dependent Hierarchical Buffered Routing Tree Construction .................. 16 C a-T rees................................................................................................................ 16 FANROUT ............................................................................................................ 19 Quality and Complexity Analysis .........................................................................25 2.5 Experimental R esults...................................................................................................28 Chapter 3 Semi Order-Independent Hierarchical Buffered Routing Tree Construction 34 3.1 Introduction ............................................................................................................... 34 3.2 Local Order-Perturbation......................................................................................... 36 3.3 Semi Order-Independent Hierarchical Buffered Routing Tree Construction 45 BUBBLE_CONSTRUCT ................................................................................... 45 *PTREE .................................................................................................................. 50 M ERLIN................................................................................................................ 54 Quality and Complexity Analysis .........................................................................56 3.4 Experimental R esults................................................................................................ 61 Comparison on Individual N ets........................................................................... 62 Comparison on C ircuits..........................................................................................63 Chapter 4 Simultaneous Floorplanning, Technology Mapping and Gate Placement 66 4.1..............................Introduction ...................................................................................... 66 4.2..............................Background ..................................................................................... 67 Technology M apping............................................................................................ 67 Linear Placem ent.................................................................................................. 67 SiMPA .................................................................................................................... 68 Floorplanning ........................................................................................................ 69 Acyclic Partitioning.............................................................................................. 69 4.3 Simultaneous Floorplanning, Technology Mapping and Gate Placement .............. 69 K-LCT: A K-Way Levelized Cluster Tree ......................................................... 73 *SiMPA-E for Global Area Optimization .......................................................... 76 *SiMPA-R for Eliminating Timing Violations.................................................. 80 4.4 Experimental R esults.............................................................................. 84 Chapter 5 PEGASUS 90 5.1 Introduction ............................................................................................................... 90 5.2 Design Methodologies.............................................................................................. 91 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.3...........................Pegasus ................................................................................................. 93 Architecture of PEGASUS..................................................................................... 93 Hierarchical Database - Introduction ............................................................. 95 Hierarchical Database - Example ......................................................................... 98 Dynamic Property Management ................................................................ 104 Application Programs and The Design F lo w ................................................... 105 Interface................................................................................................................. 108 System Im plementation....................................................................................... I ll 5.4 Experimental R esults................................................................................... 112 Chapter 6 Conclusion 118 Chapter 7 References 121 v ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Tables Table 1: Conventional flows fo r single nets....................................................................... 31 Table 2: FANROUT versus the conventional methods fo r single nets........................... 32 Table 3: FANROUT versus conventional flows fo r a set o f benchmarks..................... 33 Table 4: Total buffer area, delay, and runtime fo r a set o f nets. ........................... 64 Table 5: Post-layout area, delay, and runtime fo r a set o f benchmark circuits............ 65 Table 6: Verifying the effectiveness o f the simultaneous approach................................ 86 Table 7: *SiMPA versus the conventional flo w ................................................................ 87 Table 8: The runtimes........................................................................................................... 89 Table 9: Min-Delay versus Min-Area design modes....................................................... 117 v iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List o f Figures Figure 1: Gate and interconnect delay versus technology generation............................... 4 Figure 2: An LT-Tree Type-Ifor a net with 9 sinks............................................................ 11 Figure 3: An output ofPTREEfor the “ dcba” order. ...................................................... 13 Figure 4: A valid C4-Tree fo r (sl,s2,.. .,s9).........................................................................17 Figure 5: Optimal Coc-Tree construction.............................................................................18 Figure 6: A three-dimensional solution curve.....................................................................20 Figure 7: Pseudo-code fo r FANROUT. ............................................................................. 21 Figure 8: An illustration fo r the grouping steps................................................................. 23 Figure 9: Employing existing sub-solutions to generate larger sub-solutions................24 Figure 10: Structure o f MERLIN. .......................................................................................35 Figure 11: Construction with order perturbation...............................................................38 Figure 12: Equivalence o fW and N(P)............................................................................... 41 Figure 13: BDD off. ............................................................................................................42 Figure 14: Grouping structures. .........................................................................................44 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 15: Construction with perturbation....................................................................... 44 Figure 16: The pseudo-code fo r BUBBLEjCONSTRUCT. ............................................ 46 Figure 17: The pseudo-code fo r STRETCH. .....................................................................48 Figure 18: A legal grouping scenario fo r BUBBLEjCONSTRUCT............................. 48 Figure 19: An illegal grouping case...................................................................................49 Figure 20: The pseudo-code fo r S1NK_SUBSET. ............................................................ 50 Figure 21: A legal grouping scenario fo r *PTREE. ........................................................52 Figure 22: The pseudo-code fo r *PTREE........................................................................ 53 Figure 23: Local neighborhood search in MERLIN. .......................................................55 Figure 24: The pseudo-code fo r MERLIN. .......................................................................56 Figure 25: An illustration fo r the proof o f Lemma 14.......................................................59 Figure 26: Clustered circuit...............................................................................................71 Figure 27: *SiMPA............................................................................................................... 72 Figure 28: A cycle................................................................................................................. 74 Figure 29: A circuit and its corresponding 4-LCT graph.................................................75 Figure 30: The new 4-LCT. ................................................................................................ 76 Figure 31: The area m odel................................................................................................. 77 Figure 32: The flow of*SiM PA-E......................................................................................78 Figure 33: Global shape function generation.................................................................... 79 Figure 34: The flow of*SiMPA-R....................................................................................... 81 Figure 35: An example..........................................................................................................82 Figure 36: The conventional design flow........................................................................... 92 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 37: The high-level architecture o f PEGASUS. ...................................................... 94 Figure 38: Design library m odel.........................................................................................96 Figure 39: Class hierarchy in PEGASUS. .........................................................................97 Figure 40: Compact view o f a half adder. ....................................................................... 100 Figure 41: Compact view o f the fidl adder. ..................................................................... 101 Figure 42: Application view o f a fid l adder generated by an expansion scheme..........103 Figure 43: The PegLib fo rm a t........................................................................................... 110 Figure 44: WING console....................................................................................................112 Figure 45: Library manager. .................................................................................... 113 Figure 46: Synthesis flow .....................................................................................................114 Figure 47: Schematic view editor. ....................................................................................115 Figure 48: Physical view editor. ....................................................................................... 116 x i Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CONSOLIDATED LOGIC A N D LAYOUT SYNTHESIS FOR INTERCONNECT-CENTRIC VLSI DESIGN by Amir H. Salek A b str a c t The semiconductor process technology regularly provides newer techniques and recipes that allow fabricating smaller, cheaper, and faster transistors in integrated circuits (IC). Nevertheless, it is up to the IC design and optimization techniques to fully employ the opportunity to build faster, more complex electronic systems within a reasonable cost. In recent years, computer-aided design (CAD) tools have been unable to effectively handle the new design issues and effects arising in fabrication technologies with deep- submicron (DSM) resolution. Consequently, the semiconductor industry has experienced a productivity gap between its fabrication and design capabilities. DSM technologies are forcing the philosophy of design to shift focus from transistors to interconnects and, as a result, many CAD techniques have to be fully re-designed or significantly altered to tackle the new technology constraints. The increased importance of interconnet delay and area in IC synthesis forces design flows to no longer be divided into two artificially separated phases known as front-end and back-end. MERLIN and *SiMPA are two new algorithms, each integrating a set of logic and layout synthesis steps into a unified dynamic programming based solution. xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MERLIN integrates fanout optimization and global routing, whereas, *SiMPA integrates floorplanning, technology-mapping, and gate placement. The concerns regarding the implementation of these unification-based algorithms have been studied in PEGASUS, an environment for specifying, synthesizing, and optimizing large designs. PEGASUS provides the facility, i.e. combined logical and physical database and utilities, for implementing application programs for integrated logic and layout synthesis. This CAD software package has been designed according to object-oriented programming principles and is intended to be efficient, programmable, and extensible. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c h a p t e r 1 Introduction The strong desire for ever-increasing levels of integration and higher performance often transcends the capabilities of traditional design tools and flows in the integrated circuit (IC) design arena. In particular, new deep-sub-micron (DSM) design considerations, such as the dominance of interconnect delays and signal integrity issues, have forced IC designers to re-evaluate the existing design methodologies and techniques. The semiconductor community faces increasingly difficult challenges as it moves into production at feature sizes approaching 100 nm and lower. Some of these challenges span the entire spectrum of technology and will require major initiatives to develop solutions. Other challenges emerge as the semiconductor community is forced to seek solutions using approaches that have no historical precedent. The National Technology Roadmap fo r Semiconductors (NTRS) [NTRS97], a widely referenced roadmap, sees the following as the grand challenges facing the semiconductor industry: l Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • the ability to continue affordable scaling, • affordable lithography, • new materials and structures, • GHz frequency operation on- and off-chip, • metrology and test. The semiconductor industry has maintained its growth by achieving a 25% to 30% per-year cost reduction per function throughout its history. This productivity growth in integrated circuits came through design innovation, device shrinks, wafer size increases, yield improvement, and equipment utilization improvements. However, some of the historical contributions to increased performance are no longer available. For example, yields today are quite high, precluding appreciable contributions to productivity increases through yield increases. The largest contribution to productivity growth continues to be decreased feature size. This scaling not only increases the number of transistors per square centimeter in an integrated circuit but also increases the speed of the circuits. A second complicating factor is that the complexity of the circuits, and the resulting manufacturing complexity, is escalating as feature sizes decrease. Today’s high-end microprocessors require six levels of interconnect. New materials, new technologies, and new approaches must be invented. Affordable scaling and these required inventions constitute the aforementioned grand challenges. Device and circuit speeds using extensions of today’s chip and system architecture will soon reach fundamental limits. In the GHz frequency regime, circuit elements can no longer be treated as discrete, and transmission line approaches will be increasingly 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. required. For example, electromagnetic signals require 0.1 nanoseconds to travel 3 centimeters (cm) in vacuum and will take longer in any medium with a dielectric constant greater than one. In addition, at frequencies of 10 GHz, the wavelength of 3 cm is comparable to the chip size. Even for today’s high-performance microprocessors, full chip performance cannot be achieved in packaged parts after assembly because of limitations of current packaging technology. The challenge of getting signals in the GHz frequency range off-chip and into the system after packaging is perhaps even greater than the challenge of on-chip performance at this frequency. New approaches for system and chip architectures will be required. New circuit design and algorithms that circumvent parasitic limitations, along with assembly and packaging approaches, will be needed at these high frequencies. Solutions will require approaching the overall system as a unit rather than treating design, integrated circuit, and packaging as separate entities. As shown in Fig. 1, calculated gate and interconnect delay versus technology generation illustrates the dominance of interconnect delay over gate delay for aluminum metallization and silicon dioxide dielectrics as feature sizes approach 100 nm. Also shown is the decrease in interconnect delay and improved overall performance expected for copper and low k dielectric constant insulators. Even in the latter case the delay of interconnect considerably contributes to the overall delay of circuits. Physical design deals with aspects of chip implementation related to the spatial layout of devices and interconnects. In the past, the primary physical design objective 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Delay (ps) ^G a te wi At & S iQ Lo w k wtCu L o w k AI&Cu > ' AI&Cu Line Gate Delay Sum of Delays, Al & Si02 Sum of Delays, Cu & Low ic Interconnect Delay, A l & Si02 Interconnect Delay, Cu & Low ic 3.0 fiQ -cm 1.7}t£2-cm i<=4.0 k = 2.0 ,8u Thick 43 ft Long 650 500 350 250 180 130 100 G eneration (nm) Fig. 1. Gate and interconnect delay versus technology generation. was to arrange devices and interconnects to achieve minimum-area layout. Now and in the future, physical design must trade off area against competing measures of design quality speed, power, signal integrity, and manufacturing yield. This change is ultimately due to the different scaling of devices and interconnects. With successive process generations, devices are smaller; have less drive strength; and are more noise- sensitive. Interconnects are more resistive; have greater coupling capacitance; and traverse larger chip sizes; and thus become the key to chip performance. As the physical design solution increasingly determines overall design quality, there will be a need for far more accurate models of device/interconnect interactions. The global trade-off mentioned above, together with the need for better models, will require ever closer unifications of physical design with system-level design, logic-level design, and circuit implementation. Additionally, new design methodologies need to be developed to 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. support tight coupling of analysis, synthesis, and (re)specification activities across multiple levels of representation. Several new requirements will emerge in the physical design context. Increased design complexity and reuse together imply a need for new block characterization tools, as well as new data management and hierarchy management tools. Increased adoption of area-array I/O implies a need for tools that address associated power/clock/test distribution, reliability, and noise issues. Increased clock frequencies imply a need for high-frequency and analog/mixed signal design tools. Manufacturing variability implies a need for statistical design and design centering tools. Overall, future physical design tasks will entail interacting multi-level, multi-objective optimizations. Such optimizations will be increasingly constraint- dominated and more degrees of freedom must be considered simultaneously to achieve a feasible solution. Rethinking sequential algorithms and tool interactions for symmetric multiprocessing will be important. In summary, future physical design needs can be broadly classified as "unifying physical design" or as "new syntheses and analyses". Four unifications of design within the design flow are especially critical, as follows: 1. Analyses and synthesis must be unified to erase barriers between today’s disparate design flow. Constraints must drive synthesis, and estimated parasitics drive analy sis, in a closed-loop, "construct by correction" iterative improvement process. An example of a potential solution is a new analysis backplane that enables perfor mance analysis that integrates simulation traces with analyses of hardware descrip tion language hierarchy, clock structure, and device noise sensitivities 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Layout-level and system-level design must be unified. Modeling capabilities must be developed to enable forward estimation-driven syntheses and design explora tion. New tools will be needed to support hierarchy and reuse across multiple implementation levels, as well as auto-interactive and iterative design paradigms. 3. Layout-level and logic-level design must be unified. In particular, system timing management, logic optimization, placement, and routing must coexist in a single environment within the next two process generations. 4. Layout-level and circuit-level design must be unified, e.g., for synthesis of dynamic CMOS circuits and for transistor-level layout within cell-based design. In general, as future designs become increasingly interconnect-limited, physical design must carefully balance all of these factors. This thesis focuses on the unification of layout and logic-level design in interconnect- centric VLSI design. In general, to address the increasing interaction between these design phases, one can either increase the look-ahead capability of high-level tools or develop new algorithms for solving larger portions of the overall design problem simultaneously. This latter unification-based approach [Pe98] seems more promising. Indeed, the nature of IC design problems and the current state of CAD solutions have reached a point where it is both necessary and possible to combine some steps of the synthesis and physical design processes. The unification-based algorithms are capable of capturing existing interactions among the ‘merged’ design steps and producing higher-quality implementations by systematically searching a much larger solution space (see [SLP98] and [LSP97]). 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The rest of this thesis is organized as follows. Chapter 2 presents an optimal algorithm, FANROUT, for solving the problem of simultaneous fanout optimization and routing tree construction for an ordered set of critical sinks. The algorithm, which is based on dynamic programming, generates a rectilinear Steiner tree routing solution containing appropriately sized and placed buffers. The resulting solution inherits the detailed structure of P-Tree [LCLH96] and the topology of a new structure called Ca-Tree. The two variants of the problem, i.e. maximizing the driver required time subject to a total area constraint and minimizing the total area subject to a minimum driver required time constraint, are handled by propagating three-dimensional solution curves during the construction phase. The solution generated by FANROUT is optimal with respect to a given sink order. However, it is not clear which order gives the best result. In Chapter 3, this problem is addresses by proposing a novel local order-perturbation technique which finds the local optimum solution in an exponential-size solution subspace in polynomial time. The proposed approach can be applied to any dynamic programming based algorithm which is optimal only with respect to the given order of input objects. Chapter 4 presents an integrated design flow which combines floorplanning, technology mapping, and placement using a dynamic programming algorithm. The proposed design flow consists of five steps: maximum tree sub-structure formation, levelized cluster tree construction, minimum area implementation using 2-D shape functions, critical path identification, and repeated application of simultaneous floorplanning, technology mapping and gate placement along the timing critical paths. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The issues and concerns regarding the implementation of the above and any other integrated logical and physical design methods have been addressed in Chapter 5. Generally, existing open-source CAD systems target either the logic synthesis or the physical design domains, creating an artificial separation between the two design phases. Chapter 5 presents PEGASUS, a new CAD system which provides a unified logical and physical database and tools that span across different levels of the design hierarchy. PEGASUS is an environment for specifying, synthesizing, and optimizing large designs and provides the facility for implementing application programs for integrated logic and layout synthesis. PEGASUS has been designed according to object- oriented programming principles and is been intended to be efficient, programmable, and extensible. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c h a p t e r 2 Simultaneous Fanout Optimization and Routing Tree Construction 2 .1 . In tr o d u c tio n Consideration of the effects of interconnect delay and area has become a crucial factor in the design of ultra dense, high speed integrated circuits. In an industry where higher performance design brings substantial advantages over the competition, more and more time and resources are being spent on making faster chips through careful optimization of many design aspects, especially interconnect planning and optimization. In particular, the problem of constructing a buffered routing tree has emerged a critical design problem. The chapter presents a new algorithm FANROUT which simultaneously solves the fanout optimization and routing tree generation problems. Both of these design tasks are difficult optimization problems and have considerable effects on the circuit delay and area. Fanout optimization is effectual by boosting the transmitted signal via insertion of sized buffers whereas performance-driven routing generation is effective by generating 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interconnect structures which deliver the signal to critical sinks faster. In conventional design flows, these two tasks are often performed in a sequential manner. Consequently, a solution made by one of these optimizations becomes the input for the other. That flow reduces the flexibility and impact of these operations. Solving the unified problem, i.e. generating a buffered routing tree for a set of sinks and a driver, helps capture the intrinsic interactions between the combined design steps and produces higher quality implementations by systematically searching a much larger solution space. 2 .2 . P r io r W ork 2.2.1. Fanout Optimization Fanout optimization, an operation performed in the logic domain, addresses the problem of distributing an electrical signal to a set of sinks with known loads and required times so as to maximize the required time at the signal driver (root of the net.) Interconnect delays are ignored in this operation because locations of the sinks are not known at this stage. The general form of this problem is NP-hard [To90], however its restriction to some special families of topologies is known to have polynomial complexity. Among many fanout optimization techniques - e.g. [Go76], [BCD89], [SS90], and [VP93] - the one proposed by [To90] has been proven to be very effective. That algorithm introduces a special class of tree topologies, called LT-Tree, for which the fanout problem is solved optimally with respect to a given order of sinks using dynamic programming. A LT-Tree oftype-I is a tree that permits at most one internal node among 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I I t I II i i i i i t a i i The Given Order f ~ > ^ ^ ^ ^ ^ ^ ^ ^ Fig. 2. An LT-Tree Type-Ifor a net with 9 sinks. the immediate children of its internal nodes and also does not allow any left sibling for the internal nodes. Touati proposed a dynamic programming-based algorithm for the fanout optimization problem where the buffer structure is restricted to the LT-Tree topology and sinks with larger required times are placed farther from the root of the tree. The algorithm first sorts the sinks in their non-decreasing required time order and then starting from the least critical sink, it enumerates all the left-most grouping of the sinks to be driven by a buffer. Finally for each grouping, it enumerates all possible ways of adding either zero or one buffer to drive the leftmost subset of the sinks. Touati gives sufficient conditions for the LT-Tree construction algorithm LTTREE to be optimal. For more details, see [To90]. Lemma 1: LT-Tree construction algorithm shows 0(n2) complexity where n is the number of sink nodes [To90]. 1 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.2. Routing T h e Construction Performance-driven interconnect design, an operation performed in the physical domain, addresses the problem of connecting a signal driver to a set of sinks with known loads, required times and locations so as to maximize the required time at the driver. [CHKM96] gives a comprehensive review of the algorithms for solving this problem. The inherent complexity of the problem has forced researchers to solve it heuristically or to impose constraints on the structure of the resulting interconnect. Among the recent works in this area, the algorithm presented by Lillis et al. in [LCLH96] has been shown to be quite effective. The authors proposed the Permutation- Constrained Routing Tree or P-Tree structure and solved the above problem with respect to the P-Tree structure; see Fig. 3 for an example. This approach consists of two major phases: I) finding heuristically a proper order for the sinks, II) generating the routing structure based on the order. The second phase of the algorithm is referred to as PTREE throughout this thesis. Given an order for the sink nodes, PTREE finds the optimal embedding of the net into the Hanan grid} using a dynamic programming approach. In PTREE, the intermediate routing solutions are stored in the form of two dimensional, non-dominated solution curves of total area versus required time for every Hanan poinP'. 1. The Hanan grid of a net is defined as the grid formed by the intersection of horizontal and vertical lines running through the terminals of the net [Ha6 6]. 2. Hanan points of a net are the vertices of the Hanan grid of the net 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 3. An output of PTREE fo r the “dcba ” order. Lemma 2: For a given order on the sinks and with the restriction that the Steiner points lie on the Hanan points, PTREE computes the set of all rectilinear Steiner trees with non dominated required time and total capacitance [LCLH96]. Lemma 3: If the individual capacitive values of wires and gate inputs are polynomially bounded integers or can be mapped to such with sufficient precision, then PTREE has 0(nsq) pseudo-polynomial complexity (see [GJ79]), where n is the number of sink nodes and q is the maximum number of distinct load values [LCLH96]. Corollory 1: If PTREE is called with a sinks and it uses k candidate locations instead of Hanan points, its complexity is 0(ka3q). 2.2.3. Other Related Works Lukas van Ginneken in [Gi90] proposed an algorithm to insert buffers on appropriate internal nodes of a given routing tree in order to maximize the required time at the driver. The application of van Ginneken’s method after constructing the routing tree is usually more effective than applying fanout optimization followed by routing tree generation [SLP98]. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. First attempts to combine fanout optimization and routing generation were presented in [OC96a] and [LCL96]. In [OC96a], the authors proposed a combination of A-Tree routing generation [CLZ93] and van Ginneken’s buffer insertion [Gi90] methods. They later extended the work in [OC96b] to include wire sizing as well. Their algorithm takes the placement information of the source and the sinks in addition to the signal required times and heuristically generates a buffered routing structure that maximizes the required time at the source of the net. In these works, the subtrees are combined using a weighted addition function with a user-specified parameter to heuristically decide which two subtrees are to be merged. The algorithms in [OC96a] and [OC96b] have no guarantee of optimality. In [LCL96], Lillis et al. introduced a new version of PTREE which systematically solves the integrated problem of buffering and routing. That algorithm, to be called B_PTREE in the rest of this thesis, uses a dynamic programming formulation and generates three dimensional solution curves. B_PTREE similar to PTREE is optimal only with respect to a given sink order. 2 .3 . P ro b lem F o r m u la tio n Given a net with n+1 pins, the problem is to drive the set of sink pins, S=fs, ,s 2, ..., s j , by the driver of the net s via a buffered routing structure that satisfies a combination of the maximum required time at the root and the minimum total area constraints. The area constraint can be stated in the form of total buffer area or total capacitance; the total capacitance is considered as a metric indicating the total buffer and interconnect area. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. More specifically, the problem may be stated in two ways: I) minimize the required time subject to an area constraint, II) minimize the area subject to a required time constraint. The following information is provided as input: 1. The position of the source s=(s*,sy) where s* and s > ' are the horizontal and vertical coordinates of s. 2. The properties of each sink node si=(si x ,s?r ,si,,sir ) for l<i<n, where s,x and s? are the horizontal and vertical coordinates, sf is the capacitive load, and sf is the signal required time of node j,-. 3. A library of buffers B={b, , b2 , ... , b j containing m buffers with different strengths. 4. A set of k candidate locations for placing the buffers P = f p,, p2, ..., pk}. 5. A linear ordering of the sinks AH' s ,, s2, ..., s„). There are many candidates for P; it can be the set of Hanan points [Ha6 6] (similar to what [LCLH96] has proposed) or a set of reserved buffer locations (identified as a result of the placement phase.) Our experiments, in agreement with a conclusion made in [LCLH96], demonstrate neither one of the above choices would alter the final result significantly as long the following two conditions are satisfied: I) k is large enough with respect to n and II) the candidate locations are distributed within the bounding box of the net with higher concentration in regions with high density of sink pins. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 .4 . O r d e r -D e p e n d e n t H ie r a r c h ic a l B u ffe r e d R o u tin g T ree C o n stru ctio n This section presents FANROUT an algorithm for solving the problem of simultaneous fanout optimization and routing generation. The resulting buffered routing tree contains a certain logical hierarchy that captures the hierarchical sink groups used during the construction. The hierarchy tree has a certain structure which is formally defined in the below. 2.4.1. Ca-Trees A desired property of a hierarchical algorithm is its independence from any specific class of hierarchy graph structure. However, in many cases the complexity is so high that there is no choice but to restrict the solution space to a family of hierarchies. In this case, the problem is to identify a set of structures which are consistent with the nature of problem and constructing them requires a reasonable effort. In the following, a new class of trees Ca-Trees (read as si-alpha trees) are introduced which is used to capture the hierarchy in the buffered routing construction algorithm. Definition 1: A tree is a degree-restricted alphabetic buffer chain tree (Ca-Tree) for a given order of sinks Yl=(sIts2, ...,sn) if and only if: • every internal node has at most one internal node among its immediate children, • there is a depth-first traversal that visits the sinks in the (sj,s2,.. .,sn) order, • the maximum branching factor is a. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I I I I I [ I I I The Given Order ^ ^ ^ ^ ^ ^ ^ ^ ^ Fig. 4. A valid C4-Treefor(sj,s2, ...,s g ). Fig. 4 illustrates an example for C4-Trees. In this figure the maximum branching factor is four and every internal node (shown by white circles) is connected to at most one other internal node while preserving the given order. Lemma 4\ In a Ca-Tree, the internal nodes construct a unique path (chain). Proof: This is an immediate conclusion from Definition 1. ■ In this application, every internal node is a buffer and in the resulting buffer chain, a more critical sink (considering both timing and physical information) tends to be connected closer (in terms of the number of intermediate stages) to the root. Parameter a represents the maximum number of fanouts for every buffer or branching point. Our experience shows that even when no restriction is imposed on the maximum number of fanouts for each buffer, the maximum fanout count in the optimal buffer tree solution is usually bounded by a small number. That value is dependent on 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. L=6 Fig. 5. Optimal Cct-Tree construction. the characteristics of the cells (sink nodes) and the buffer library and not the problem size (number of sinks). Notice that by eliminating the parameter a from the definition, the main structure and properties of Ca-Trees do not breakdown. In that case, the only disadvantage is larger (still polynomial) runtime needed for optimally constructing such a structure. Although there is a large number of Ca-Trees for every sink order, the optimal C a- Tree can be found in a polynomial time using dynamic programming. Briefly, the optimal Ca-Tree for an ordered set of sinks is generated by starting from small L's and combining every L neighboring sinks, until L=n. At every step, the best solutions for the sub-groups with length I (<L) are available - due to the bottom up flow of the method - and are used to generate the solution for the length L sub-problem, see Fig. 5 Note that the final Ca-Tree structure satisfies the given sink order. This algorithm will be referred to as CaTREE in the rest of this thesis. Lemma 5: LT-Tree Type-I [To90] is a special case of Ca-Tree where a = and no internal node has a left sibling. Proof: The proof directly follows the definitions of LT-Tree Type-I and Ca-Tree. ■ 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Note Ca-Trees can be relaxed with respect to the first property given in Definition 1, i.e. each internal node may have more than one internal node (but bounded by a certain parameter) among its immediate children. Although the optimal structure can still be achieved using dynamic programming, the complexity of the corresponding optimal construction algorithm is significantly larger. 2.4.2. FANROUT FANROUT incorporates Ca-Tree and P-Tree construction techniques into a unified framework such that the resulting routing structure is both Ca-Tree, in terms of the overall topology, and a P-Tree, in terms of the detailed physical structure. FANROUT requires an ordering of the sinks and guarantees the optimality of the solution with respect to that ordering only. In the following paragraphs, the details of FANROUT are given. FANROUT (see Fig. 7) is called with a set of parameters: 5, P, B, and I I The parameters s and II=(r/, s2, ... , sn) represent the root and an ordered set of sinks of a subject net. The parameter P— { p i , p 2 , ..., p ^} represents a set of candidate locations for the placement of buffers and Steiner points in the final buffered routing structure. Finally, B={bj , b2 , ..., b^J is a library of buffers. FANROUT operates on three dimensional solution curves T (see Fig. 6 ), each associated with a candidate buffer location p and a sub-group of sinks identified by the variables L and R. L is the length of the sub-group and R indicates the position of the 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Input Load i k f \ Required Time' Area / Fig. 6. A three-dimensional solution curve. rightmost sink of the sub-group in I I . For example, if Il=(j/, s2, — , s g ) , T(3, 7, p,• ) stores all the buffered routing structures that connect p-t to {jj, s6, s7). In FANROUT all solution curves store only non-inferior solutions as defined below. Definition 2: Suppose 0 7 and a 2 are two different buffered routing structures to con nect a candidate location to set of sinks. < J 2 is said to be inferior to < 37, iff load(< J/)<load(<J2), reqlimef<52)<reqTime(<5j), and area(a /)<area(c2). As shown in Fig. 7, FANROUT consists of three main sections: Initialization, Construction, and Extraction. The Initialization section deals with creating and initializing solution curves corresponding to sub-problems consisting of only one sink, i.e. L=l. FANROUT is a dynamic programming based technique and at each step it generates new curves by combining and manipulating already available curves for smaller sub-problems. In the Construction section, this bottom-up step is repeated until the solution curve for the main problem is found. Finally in the Extraction section, from among the solutions of the final T, the solution with the best trade-off between required time and total area is chosen. At the end, the corresponding structure is generated by tracing back the pointers of the constituting sub-problems. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm FANROUT{ s, P , B, II=(jy, s2, ..., sn) ) INITIALIZATION 1. for r = n downto I 2. foreach pe P 3. set T( 1, r, p ) = {The set of all non-inferior paths extended from p to sr and driven with or without a buffer} CONSTRUCTION 4. fo rL = 2 ton 5. for R = n downto L 6. for I = max(7, L-a+1) to L 7. for r = R downto R-l+1 8 . foreach p e P 9. foreach y e Tf /, r, p ) 10. set A = PTREECP, [sR.L+I,..., sr4 , y, sr+1,..., 11. foreach 8 e A 12. set p' = Location of the root of 8 13. foreach b e. B 14. set 8' = A buffered routing structure created by driving 8 by b located at p' 15. set < c,t, a > to the input capacitance, the input required time and the area of S', respectively 16. if < c, t, a > is a non-inferior solution in T(L, R , p') 17. insert < c, t, a > in T (L , R , p-) EXTRACTION 18. find the solution p in T( n , n , s ) which best satisfies the constraints 19. retrieve the buffered routing tree structure 91 of p by following the pointers stored during the generation of the solution curves 20. return 9{ Fig. 7. Pseudo-code fo r FANROUT. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1) Initialization: In FANROUT, before performing any operation a set of solution curves are initialized in lines 1 through 3. In this part of the algorithm, sub-groups of length 1 are considered and the corresponding solution curves for every candidate buffer location and sink sub-group are initialized. These initial solutions consist of the minimum Manhattan distance paths from the candidate location p to the sink sr . At the root of these paths, both options of inserting and not inserting a buffer are examined. 2) Construction: FANROUT starts by working on the groups consisting of only one sink, i.e. L= l, and proceeds until L=n. At each step, it constructs buffered routing structures that connect L neighboring sinks within E L Line 4 enumerates all the possible values for L and line 5 detects every legal sub-group O of II that contains L sinks. Every sub-group of sinks can potentially constitute an internal node in the final C a- Tree structure, therefore, according to Definition 1 it can contain at most one other internal node (smaller sub-group) as its immediate child. Consequently, during the process of grouping a set of L sinks, the cases must be considered in which a subset of the sinks (call it C D ) have already been grouped; see Fig. 7 and lines 6 and 7 in the pseudo-code. That way the Ca-Tree structure which captures the hierarchy of design is generated and maintained. In this context, the hierarchy implies that during the generation of a buffered routing structure, all the sinks are not processed at once, instead, at any time a subset of sinks are combined together in agreement with the Ca- Tree structure. Later, each combination is treated as one node in the next level of hierarchy. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The corresponding Ca-Tree A grouping situation Fig. 8. An illustration fo r the grouping steps. In line 6 , the term max(I,L-a.-l) ensures that Q does not drive more than a other internal and sink nodes, following the third property of Ca-Tree’s given in Definition 1. In line 7, the term R to R-l'+l ensures that C O remains within Q. . After line 7, it is known which sub-group C O is to be combined with which sink node(s) to generate the new sub-group Q. However, there are many solutions associated with each g o ; in fact for every buffer candidate location p, there is a solution curve for co which has to be considered in the merge operation; see Fig. 9. Line 8 enumerates all the candidate points, and line 9 retrieves the non-inferior solutions y from the solution curve of p which corresponds to C O . In line 10, PTREE is called on the root of y 1 and the rest of the sinks in Q in order to combine them by routing structures whose roots can be located at any candidate location. PTREE returns a collection of solution curves and stores them in A. Then, for every buffered routing structure in A all the buffers in the library are tried to drive its root and the non-inferior combinations are stored in the corresponding solution curves; 1. PTREE sees the root of y as a pseudo-sink whose required time and load is the ones of y. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Candidate point # T(2. 5, pi) T(2, 5, p2) Sink a a • pi •P 2 I 1=2, r=5 L=4,R=5 Fig. 9. Employing existing sub-solutions to generate larger sub-solutions. see lines 11 through 17. Along with every solution, a set of pointers are also stored which are used to reconstruct the best solution during the extraction phase. Note that the operations performed in lines 11 through 17, in fact, can be performed internally by a modified PTREE with no change in the worst case complexity of PTREE. Therefore, the complexity of that part of the code is not considered during the complexity analysis of FANROUT. 3) Extraction: The above bottom-up construction process continues until the solution curve for the whole problem, i.e. L=n, is generated. At that point, solutions of the problem are stored in r(n, n, s) because they are rooted at s and are connected to all the sinks. From among all the non-inferior solution of T(n, n, s), the one which best satisfies the input constraints is chosen. The buffered routing structure corresponding to that 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. solution is retrieved in lines 18 and 19 by following the stored pointers. Finally, in line 2 0 the best buffered routing structure is returned. 2.4.3. Quality and Complexity Analysis The proposed algorithm is an optimal polynomial algorithm based on a set of assumptions as explained before and in the following lemmas and theorems. Theorem 1: The solution space of FANROUT is the product of those of PTREE and CaTREE. Proof: A careful analysis of the pseudo-code shows any P-Tree structure with inserted buffers such that no buffer immediately drives more than one other buffer is considered by FANROUT. Also, any Ca-Tree whose buffers’ output nets are implemented using PTREE is considered by FANROUT. ■ Lemma 6: The following statements are true for any routing structure S R that connects a source to a set of sinks: I) By decreasing the load of any sink, the capacitance observed at the root of S R does not increase. II) By increasing the required time of any sink, the required time at the root of S R does not decrease. Proof: For case I, decreasing the load of a sink decreases the amount of charge needed to bring the voltage of S R to a certain level. Therefore, statement I is valid. For case II, if that particular sink is on the critical path, the statement is trivially true. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Otherwise, the required time of the driver is determined by the required time of the other sinks and remains unchanged. ■ Lemma 7: PTREE is monotone with respect to the load and the required time of the sinks. Proof: Suppose 9? is a routing structure generated by PTREE. Reducing the capacitance and/or increasing the required time of a sink while preserving 9t results in the decrease of the capacitance and increase of the required time at the root of 9?. Therefore, if PTREE is run after changing the load and the required time of the sinks in this way, the resulting structure is non-inferior with respect to S R and PTREE stores it in the curve. ■ Lemma 8: The use of the prune operation by FANROUT does not result in the loss of any non-inferior solution. Proof: Assume that < y 2 is inferior with respect to a,. By induction, if < s 2 is the whole net and its input is directly connected to the net driver, the required time does not decrease and the load does not increase by replacing a2 with 0 7 . If < J 2 is a solution to a sub-problem, its input is driven by another internal node, call it g. Due to the monotone behavior of PTREE (c.f. Lemma 7), at g the required time and the input load of the implementation including a 2 is guaranteed to be no better than those of the implementation containing a,. A similar argument is then valid for g and the rest of the internal nodes down to the leaf nodes. ■ 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Theorem 2: FANROUT is an optimal algorithm. Proof: An examination of the dynamic programming structure of FANROUT shows that if no pruning is performed on the solution curves, all the possible solutions would be considered. Therefore, to prove the optimality of the algorithm it is enough to prove that for an optimal solution, replacing a non-inferior solution with an inferior solution cannot improve the whole implementation; This, however, was proved in Lemma 8 . ■ Lemma 9: Depending on what metric is used for measuring the area, the number of solutions in a solution curve is bounded either polynomially or pseudo-polynomially. Proof: The load of any solution is the input capacitance of the driving buffer. However, the number of distinct input capacitances of the buffers is bounded by the total number of the available buffers in the library m. In every solution the maximum number of inserted buffers is bounded by O(n). Therefore, the number of distinct buffer areas is also bounded by O(n) because the area of every buffer is smaller than a constant number and also the smallest difference between the area of every two solutions is always greater than a constant number. Both these limits are determined by the library and are not dependent on the size of problem. If the total buffer area is the metric used for measuring the area, the number of non- inferior solutions in a solution curve is bounded by O(mn). The reason is that the prune operation keeps at most one solution per each distinct area and input load values. The bound becomes pseudo-polynomial when the total capacitance is the metric used for measuring the area. In that case, we assume the number of distinct capacitive loads is 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. polynomially bounded (q) and it is larger than the number inserted buffers. As a result the number of non-inferior solutions in pseudo-polynomially bounded by O(mq). ■ For the sake of simplicity, in the following it is assumed the total buffer area is used as the metric to measure the area cost of a solution. Theorem 3: FANROUT has 0(kmn3 ) memory complexity, where k, m, and n are numbers of candidate locations, buffers, and sinks, respectively. Proof: There are k candidate locations and for each combination of L and R (total of n(n+l)/2 combinations), there is a solution curve. Each solution curve stores 0(mn) solutions and as a result the claim is proved. ■ Theorem 4: FANROUT has 0(mqk?a5n3) runtime complexity, where k, m, and n are the numbers of candidate locations, buffers, and sinks, respectively. Also, a is the maximum branching factor in Ca-Tree and q is a polynomially bounded number of distinct capacitive loads. Proof: In Fig. 7, the number of iterations performed in lines 4 through 7 is 0 (a 2n2). Lines 8 and 9 introduce O(k) and O(mn) complexity, respectively. Calling PTREE in line 10 costs 0(ka3q), because the number of sinks provided to PTREE is always less than a ; see Corollory 1. The complexity of FANROUT is determined by considering all above factors. ■ 2 .5 . E x p erim en ta l R esu lts In order to verify the effectiveness of FANROUT, a set of experimental results are reported here. As for the conventional flows, we do not impose any restrictions on the 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ordering for the sinks. In other words, every fanout optimization and routing tree generation methods are independently free to choose their own appropriate ordering for the sinks (if any needed). In Table 1, the results are presented for a set of nets taken from a number of benchmarks where the sinks are placed randomly. For these examples, two conventional flows are compared against FANROUT where FANROUT has been applied on two different orderings: 1. Ordering with respect to the sink required times, REQ. 2. Ordering generated by solving the traveling salesman problem on the set of sinks, TSP. The first conventional flow setup, conv-I, uses SIS [SSLM92] for fanout optimization, followed by using PTREE for routing tree generation. For each net, different fanout optimization methods available in SIS are used and for each net only the best result in terms of the required time is reported. The second conventional flow setup, conv-II, uses PTREE for routing tree generation followed by using the buffer insertion method introduced in [Gi90]. Note that in Table 1, “area”, “req-time” and “wire-length” stand for the sum of the area of buffers, the required time at the input of the driver and the total wire length, respectively. Our next set of experiments in Table 2 compares the performance of the conventional design flows against our proposed simultaneous algorithm on a number of benchmarks using a CASCADE standard cell library (0.5u HP CMOS process). Gate and wire delays are calculated using a 4-parameter delay equation (similar to that in [LSP97]) and the 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Elmore delay model [E148], respectively. Also, FANROUT has been ran with TSP orderings for the experiments reported in Table 2. These experiments showed that the runtime of the fast FANROUT is in the order of few minutes which is comparable to the runtimes of the conventional flows. Note that the area and delay reported in this table are total chip area and delay after detail routing. These experiments were run in the SIS environment on an Ultra-2 Sun Sparc workstation (sahand.usc.edu) with 256MB memory. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Conventional conv-I conv-II nets # of sinks req-time area wire length req-time area wire length 1 C432 n e tl 6 35.23 219.12 83.07 21.52 149.93 63.15 net2 10 33.50 329.12 135.15 29.74 297.66 118.64 1 C1355 net3 8 36.23 329.12 101.28 25.12 297.66 87.09 net4 9 34.23 382.58 116.94 30.33 268.18 116.58 1 C3540 net5 35 38.70 457.49 119.17 38.20 1147.30 152.62 net6 73 59.44 836.99 535.58 59.78 836.99 549.36 1 C5315 net7 12 24.94 516.23 68.22 12.21 268.18 42.89 net8 21 33.10 542.96 195.17 35.59 533.50 254.74 1 C6288 net9 16 48.33 516.23 144.30 43.75 415.58 168.95 netlO 20 62.49 436.04 146.61 95.96 238.70 175.93 1 C7552 n e tll 16 48.57 516.23 179.16 30.28 504.02 211.38 n e tl2 23 41.68 245.85 185.45 54.88 503.69 261.70 Table 1: Conventional flows for single nets. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. FANROUT REQ TSP nets # of sinks req-time area wire length req-time area wire length C432 netl 6 21.83 165.66 63.41 17.08 165.66 55.49 net2 10 28.05 77? 70 93.22 24.46 7 7 7 . 7 0 113.11 C1355 net3 8 32.14 195.47 98.44 29.46 195.47 96.57 net4 9 28.78 195.47 109.74 26.65 61.82 110.31 C3540 net5 35 32.16 270.38 122.82 32.01 270.38 132.27 net6 73 54.75 649.88 549.30 54.69 649.88 583.05 C5315 net7 12 21.83 248.93 71.40 17.23 248.93 70.15 net8 21 32.61 409.31 206.01 25.32 409.31 200.46 C6288 net9 16 40.38 7 7 7 7 0 157.68 28.96 7 7 7 . 7 0 160.35 netlO 20 51.67 222.20 136.42 42.86 222.20 145.90 C7552 n etll 16 37.83 222.20 182.51 21.98 7 7 7 . 7 0 171.69 netl2 23 33.00 272.58 157.30 31.62 272.58 189.66 Conv-I: 9H13 BBSS BHSS hbssi gBggj 18B1S Conv-II: H IM 1 1 8 1 1 1 w m w . ■SBSI a p g Table 2: FANROUT versus the conventional methods for single nets. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Conventional Ratios conv-I conv-II FANROUT/ conv-I FANROUT/ conv-II Circuit Area Delay Area Delay Area Delay Area Delay C l 7 400.50 0.87 400.50 0.87 1.04 1.03 1.04 1.03 C1355 35539.54 10.39 35225.19 10.20 0.71 0.72 0.72 0.73 C1908 51936.70 16.34 48694.77 18.54 0.84 0.68 0.90 0.59 C432 21947.10 11.59 19179.60 13.54 1.01 1.00 1.16 0.86 C499 29203.65 9.27 29208.99 8.99 1.07 0.77 1.07 0.80 C5315 134504.94 19.31 127776.26 20.04 0.84 0.56 0.88 0.54 C880 29786.25 10.53 28626.21 10.20 0.70 0.95 0.73 0.98 alu2 30199.15 14.72 27942.48 17.23 0.78 0.72 0.84 0.61 alu4 50985.15 21.60 46912.67 23.89 1.02 0.80 1.10 0.73 apex6 44626.00 7.12 44514.75 6.67 0.89 0.74 0.89 0.79 cml51a 2042.32 2.88 1753.01 3.21 0.76 0.64 0.89 0.57 dalu 95323.54 23.65 53424.14 26.47 0.93 1.01 1.66 0.90 m isexl 4015.55 4.25 3166.56 5.27 0.77 0.68 0.98 0.54 lal 5810.46 3.78 5931.42 4.10 0.85 0.74 0.83 0.68 £rgl 6319.74 3.61 6425.50 3.54 0.87 0.75 0.85 0.76 pcle 4775.31 3.51 4644.03 3.53 0.87 0.54 0.90 0.54 rd73 3519.67 3.62 3594.87 3.50 1.04 1.01 1.02 1.05 vg2 5264.19 3.69 5334.03 3.62 0.97 0.68 0.95 0.70 Average Ratios: B ill HgH H U m i Table 3: FANROUT versus conventional flows for a set of benchmarks. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c h a p t e r 3 Semi Order-Independent Hierarchical Buffered Routing Tree Construction 3.1. In tr o d u c tio n Similar to many other dynamic programming based algorithms, FANROUT is optimal only with respect to a given order on its input objects (in this case the net sinks.) This shortcoming is addressed in this chapter by introducing a new technique called local order-perturbation which is used to enhance FANROUT. The resulting algorithm called MERLIN is less sensitive to the input sink order with the cost of having a reasonably more complexity. The core optimization engine of MERLIN that is called BUBBLEjCONSTRUCT optimally solves the simultaneous routing and buffer insertion problem for a local neighborhood of an initial sink order. It exploits similar sub-solutions among the members of the neighborhood in order to maintain the polynomial complexity of the algorithm. Although a complete buffered routing structure is not generated for every member of the neighborhood, the sink order which results in the best buffered routing 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ^ FANROUT ^ ^ Local order-perturbation ( Local search ^ ( BUBBLEjCONSTRUCT*^ C M E R L I N j Fig. 10. Structure o f MERLIN. structure is automatically chosen from among the members of the neighborhood. The outer optimization part of MERLIN is an iterative technique based on a local neighborhood search strategy. Both FANROUT and BUBBLE_CONSTRUCT generate and propagate three- dimensional required time and load versus total area solution curves in a bottom-up fashion. In the three-dimensional solution curves, the load and the required time dimensions ensure the validity of the principle of dynamic programming [Be57] for solving the problem whereas the total area allows the user to solve the problem for either one of the following two variants: I) minimizing the required time subject to an area constraint, II) minimizing the area subject to a required time constraint. The technique presented in this chapter offers the following advantages compared to the existing methods: 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • employment of a novel local order-perturbation technique (bubbling) which enables the optimization engine to find (in polynomial time) the best buffered routing tree structure in an exponential-size sink order neighborhood of the initial order, • propagation of three-dimensional solution curves which in turn allows the algorithm to trade-off required time versus total area and vice versa, • definition and use of *P-Tree structures which expand the power of the optimization algorithms, resulting in highly optimized solutions, • employment of a local neighborhood search strategy along with the ability of the core optimization algorithm to find the best solution in the neighborhood of any sink order which makes the proposed method much less sensitive to the initial order. 3 .2 . L o ca l O rd e r -P e r tu r b a tio n This section presents a new technique that can enhance any order-dependent dynamic programming based algorithms such as PTREE, CaTREE, and FANROUT to generate optimal solutions with respect to a neighborhood of solutions. Definition 3: An order IT on n sinks is a one-to-one function defined as U :fl, 2, ..., n /— » (I, 2, ... , n} and j=TL(i) is called the position ofsi in II Also, I F is the inverse function of IT and i=TV1 (j) gives the sinks index of the / t h element in II. Example 1: 11= { (1 -A ) , (2 ^ 6 ) , (3->l) , (4-*5) , (5->3), (6->2) , (7->8) , (8->7) , (9— >9)} or equivalently (Sj,s$,55,sj,s4,s2,sg,sy.sg) is an order on (sj,s2, ...,s9}. Also, U(3)=l means sj is the first element in II and Tl'1(2)=6 means the second element in II issfr 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although an algorithm that constructs an optimal structure for any given order is a useful tool, the main difficulty of the problem remains in how to come up with a “good” sink order. In the problem o f buffered routing generation, required times, input loads, and physical locations of sink nodes should be all considered for generating an appropriate order. How those independent and sometimes opposing parameters are incorporated in an order is not an easy problem. The exponential number of possible orders forces designers to use a heuristic approach to combine the effect of those parameters in an ad-hoc manner. In general, the limitation imposed by working with one order at a time is very undesirable. The local order-perturbation is a technique that works in a neighborhood of sink orders. No matter how one has come up with an order, the semi-order-independent dynamic programming formulation performs a systematic search in the neighborhood of that order. If the initial order is not a local/global optimal order but is close to it, this method chooses the local/global optimal order automatically. The main advantage of such technique is its efficiency while preserving the optimality which is exponentially better than that of an exhaustive search method. Its superiority primarily originates from its enhanced dynamic programming nature that enables the method to take advantage of all similar sub-problems among all the neighboring orders and thus avoid recomputing the sub-solutions. By allowing the bottom-up semi-order-independent algorithm to apply order perturbation operations, the sink order in the resulting solution can deviate from the initial order. A simple case is shown in Fig. 11 where the right-side border of a sub- 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A Bubble Bubble O ut Fig. 11. Construction with order perturbation. group C O has been perturbed. Consequently, the order in the resulting group £2 is (s2, sj, S 4, S fy s$, S y ) as opposed to the initial (s^s^s^sj.s^sy) order; that is, in the new order s$ has been swapped with s6. In Fig. 11, s^ has been left out from co; it is called a bubble. When 00 is used in a larger sub-group, it can be assumed that the bubble has been moved to the other side of the border of co in the final structure. That operation is called Bubble Out and causes swapping two neighboring sinks. Definition 4: For a set of sinks {sj, s2, ..., s j , the neighborhood of II is defined as: A T O = { rr| Vjt -,|n(0-n'(0l<i } • In other words, the difference between the position of every S; in IT and II' is at most one. Example 2: H'=(sj,s3,s2 ,s4ls5,s^,sgts7 ,sg) is in the neighborhood of n= (sj,s2,s3,s4,s5,s6,s7 ,sg,s9) but Yl"=(s3,s2,s],s4,s5,s6,s7,s8,s9) is not in the neighborhood of H Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 5: For n>l, displacing the element i (l<i<n-l) of II (also referred to as the displacement operation in this thesis) is defined as swapping the location of the location of sn-,(r+/; in n. Example 3: Displacing the 4th element of H'=(sjts^,s2 ,s4 ,s^,s6,sg,s7 ,sg) results in IT ( S j , S j, SS^f, Sg, Sg, S y ,S q) Lemma 10: Every II'e N(T1) can be built from II using a series of non-overlapping dis placement operations. Proof: This is a proof by induction. Let’s represent a sub-string of II which consists of the i left-most elements of II by sub_string(TI ,i). For i=0, it is trivial that sub_string(Tl',0)=sub_string(Tl ,0). Suppose for i=K-l, sub_string(TL',K-1) can be obtained from sub_string(U ,K-1) using a series of non-overlapping displacement operations. Let j=TTI('K). Since II'e N(U), there are three cases based on Definition 4. • U(j)=W(j): It means that the K’th elements in II and II' are the same. Therefore, the statement given in the above lemma holds for i=k as well. • Yl(J)=U'(j)-I: It means that the K +/’th element in II' is the same as the K’th element in II Let j'= n , '1(K). In this case, we will have Tl(j')=TL'(j')+l, otherwise TL(jf)-H' (j')>l (in violation of Definition 4) because the Tl'(j')-1 and H'(j') slots have already been taken by sinks other than S y . Therefore, we have a displacement at K’th element of II and the statement given in the above lemma holds for i=K+l as well. • U(j)=U'(j)+l: This case cannot happen since it implies that Sj is the K-/’st element of I I ' which is in conflict with the above assumption. ■ 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Definition 6 : Any arbitrary (n-l)-bit binary number w is called a non-overlapping dis placement code, if and only if it contains no two adjacent bits of 1. Also, W is defined as the set of all non-overlapping displacement codes. Definition 7: The position of a bit b in a binary number w is defined as the number of bits on the left-side of b plus one. Lemma II: There exists a one-to-one relationship between the members of W and N (U ). Proof: First, we prove that Vwe W there exists a corresponding I l'e N(H). For every 1 bit in w, set / to the position of that bit in w and displace the fth element of FI; call the resulting order IT'. Before the first displacement operation the inequality given in Definition 4 holds. Also, if the inequality between the initial order and the resulting order after j displacement operations is valid, it still holds after the y+7’st displacement operation. That is because every displacement operation changes the location of the two swapped elements by ±7 and keeps the location of the other elements unchanged. In addition, since w is non-overlapping code, no element is displaced more than once. Consequently, using induction we conclude FTe N(U). Now, we prove V ll'e A'fTIj there exists a corresponding we W. According to Lemma 10, IT can be generated from n using a unique set of non-overlapping displacements. Those displacement operations can be coded in a non-overlapping displacement code which belongs to W. ■ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 12. Equivalence o fW and N(H). Example 4: Fig. 12 illustrates the equivalence of W and N(Tl) for a simple example. Theorem 5: For n> l, the number of distinct orders in the neighborhood of a given order II is equal to: Proof: According to Lemma 11 there is a one-to-one relation between N(H) and W. Therefore, these two sets have equal cardinality and we can equivalently prove the above equation for W. So, we have to find out how many binary numbers in the form of w=wjw 2 .. -wn-l exist which have no two adjacent l ’s. The population of such numbers is equal to the off-set size of the following Boolean equation: f=wjw 2 +w2w3+ .. .+wn_ 3 wn_ 2 +wn_ 2 wn_ } 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2 Fib(n-2) FIb(n-3) Fib(n-2) Fib(n-1) Fib(n-2) Fib(n) Fib(n-1) Fib(n-1) Fig. 13. BDD off. Fig. 13 is the BDD representation of the above equation. By induction, it can be verified that this structure is valid for n>0. In this figure, the number under each BDD node gives the number of distinct paths that exist from that node to the root. Due to the symmetric structure of the equation and the corresponding BDD, the number of paths from a node to the root follows the Fibonacci number series. In Fibonacci number series, the k+ l'st number is the sum of the k-l ’st and fc ’th numbers in the series. As shown in Fig. 13, the number of distinct paths from the leaf node zero to the root is 2xFib(n)+Fib(n-l). Note that the factor of 2 appearing in the equation represents the fact that during the decomposition a zero sub-space has been reached 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. while the decomposition is not yet over with respect to the last variable. By few simple manipulations we get the final result: 2xFib(n)+Fib(n-1) =(Fib( n)+Fib( n -l ))+Fib(n ) =Fib(n+l)+Fib(n)=Fib(n+2) There is a direct closed-form for calculating the n+2'nd number in Fibonacci series. Using to the formula given in [MCS], the equation in the theorem is derived. ■ The formula that returns the n’th Fibonacci number is an interesting formula. It involves square root of 5 (an irrational number) yet it always gives an integer for all (integer) values of n [MCS]. Theorem 5 proves the size of A T(II) is an exponential function of the number of sinks. Consequently, finding the best order in that sub-space of orders is an exponential complexity task, if a simple enumeration-based technique is used. However, all the common sub-solutions of different orders can be shared in a dynamic programming algorithm that utilizes the aforementioned idea of bubbling. This in turn allows us to investigate the whole neighborhood in a polynomial time. In Fig. 11, notice that if having bubbles on the sides of sub-groups is allowed, the resulting sink order can be altered. Fig. 14 presents a set of abstract grouping structures (Xo* X/» X3 } which cover a whole neighborhood of orders. Xo has no bubble on its sides and %h % 2 - > anc* %j have bubbles on the right-side, left-side, and both sides, respectively. For instance, the grouping C D of Fig. 11 is a %/-type structure. A full neighborhood would not be covered, unless at each level of dynamic programming and 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Xo Xl X 2 X 3 Fig. 14. Grouping structures. Q. 9 for each sub-group of sinks, all the grouping structures are generated from all the grouping structures of their internal sub-groups; Fig. 15 shows an example. Example 5: The example in Fig. 15 illustrates the use of % 3 structure to generate a % r type solution for Q In this case, the resulting order is (s3, s2, s4, s5, s 7 ,s6 ,sg). This new sub solution will be used to generate larger sub-solutions that contain it. The local order-perturbation technique can be extended to structures with more than one bubble on each side. Those structures in turn result in covering larger neighborhoods. However in that case, the number of grouping structures grows exponentially that consequently results in a significant slow down in the corresponding construction algorithm. Fig. 15. Construction with perturbation. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 .3 . S em i O r d e r -In d e p e n d e n t H ie r a r c h ic a l B u ffe r e d R o u tin g T ree C o n str u c tio n In this section, the local order-perturbation theory is applied to FANROUT and B_FTREE and the resulting algorithm MERLIN is presented. In sub-sections 3.3.1 and 3.3.2, BUBBLE_CONSTRUCT and *PTREE are introduced and discussed. Those algorithm are employed by MERLIN that is presented in sub-section 3.3.3. 3.3.1. BUBBLE_CONSTRUCT The technique presented in section 3.2 is employed by the following algorithm BUBBLE_CONSTRUCT that generates hierarchical buffered routing trees in a neighborhood of orders. The resulting hierarchies are consistent with the Ca-Tree structure and in addition the routing inside each layer of the Ca-Tree hierarchy is a P- Tree. Some parts of BUBBLE_CONSTRUCT code (Fig. 16) are similar to the code of FANROUT and are not discussed again. It is assumed the reader is familiar with the algorithm presented in section . BUBBLE_CONSTRUCT operates on three dimensional solution curves T, each associated with a distinct set of values for I, r, p, and e. The first three variable are the same as the ones defined for FANROUT. The variable e encodes the grouping structure used to generate the solution curve. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm BUBBLEjCONSTRUCT( s, P , B, n = 0 7, s2, ..., sn) ) INITIALIZATION 1. fo re = 0to3 2. // similar to the lines 1 through 3 in Fig. 7 except that // IY 1, r , p ) should be replaced with T( 1, r , p , e ) CONSTRUCTION 3. for L = 1 to n 4. for E = 0 to 3 5. set L' = L + STRETCH(L, E) / / see Fig. 17 6 . for R — n downto L' 7. set G = SINK_SUBSET( H , R , L, E ) // see Fig. 20 8 . for I = max( I , L -a + 1 ) to L-l 9. for e = 0 to 3 10. set T = I + STRETCH(l, e) //see Fig. 17 11. for r = R downto R-T+l 12. set g — SINK_SUBSET( FI, r ,1, e ) II see Fig. 20 13. if g-G # ( { > continue 14. foreach p e P 15. foreach y e T( I , r , p, e ) 16. s e t r r = REORDER(TI, G, g , y ) 17. set A = *PTREE( P , B ,11') 18. // similar to the lines 11 through 17 in Fig. 7 except that // T( L , R , p ' ) should be replaced with T( L , R , p', E ) EXTRACTION 19. // similar to the lines 18 through 20 in Fig. 7 except that II T( n , n , s ) should be replaced with T( n , n , s, 0 ) Fig. 16. The pseudo-code fo r BUBBLEJCONSTRUCT. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1) Initialization: In this section, a set of solution curves are initialized. In that part of the algorithm, sub-groups of length 1 are considered and the corresponding solution curves for every candidate buffer location, sink, and grouping structure are initialized. Note that for sub-groups with length 1, all four grouping structures (Xo >Z/>Z2 > ^ d Xj) are the same, however, for the sake of simplicity in the rest of the pseudo-code separate (although similar) solution curves are generated for each case; a similar situation occurs for %i and %2 where L=2. 2) Construction: The main difference between this algorithm and FANROUT is in the grouping phase, i.e. the lines 3 through 13 in Fig. 16. BUBBLE_CONSTRUCT starts from L=1 and goes up to L=n For each new sub-group of sinks, all possible grouping structures (coded by numbers 0 to 3) are enumerated in line 4. For the case of Xo (E=0), the length of the sub-group is equal to L, but for the other cases the actual length of the sub-group is larger by one or two units, to capture the effect of inserting one or two bubbles on the sides. This new length is calculated and stored in L' (refer to line 5 and Fig. 17). In line 6 , all the possible sub-strings of length L' are considered from the right to the left of I I In fact, the variable R points to the right-most element of the sub-strings of L' elements. Similar to FANROUT, during the process of grouping each set of L sinks, situations in which a sub-set of them have already been grouped should be considered. Fig. 18 illustrates an example where a sub-group of 5 sinks Q. is being generated using a combination of an already generated sub-group of 3 sinks © and two other sinks, i.e. s2 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm STRETCH( L, E ) 1. setg = 0 2. if L=2 an d E>0 set g= l 3. else if L>2 and E— l set g — l 4. else if L>2 and E=2 set g= l 5. else if L>2 and E=3 set g -2 6 . return g Fig. 17. The pseudo-code for STRETCH. A grouping situation corresP°nding Ca-Tree Fig. 18. A legal grouping scenario fo r BUBBLE_CONSTRUCT. and s4. Lines 8 through 11, similar to lines 3 through 6 , investigate all possible sub group lengths with different grouping structures and positions which fit inside the sub group being constructed. It can be seen that in some cases Q and co are not compatible. As an example, consider the situation shown in Fig. 19 where the difference between the values of r and R is such that the grouping structure of o > does not fit in the grouping structure of f l These cases are detected and skipped in line 13 of the pseudo-code. In that line, cases in 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 19. An illegal grouping case. which a sink node belongs to co but does not belong to £2 are detected and skipped. Note that sets G and g - calculated in lines 7 and 12 - represent the sets of sinks included by £2 and to, respectively; also refer to Fig. 20. In line 16, REORDER updates n by replacing all the sinks that belong to g with a pseudo-sink which represents the root of y (a solution to co.) The resulting order is called rr. In line 17, an enhanced version of B_PTREE (called *PTREE) is called to generate a new set of solutions for all candidate locations. Every solution created by *PTREE shows the combination of co with the rest of sink nodes of £2 The details of *PTREE are presented in the following sub-section. 3) Extraction: In this section, a solution from T(n, n, s, 0) which best satisfies the input constraints is selected and reconstructed by tracing back the stored pointers. Note that the order of sinks in the final solution may be different from the initial given order. The quality and complexity of BUBBLE_CONSTRUCT is further discussed in sub section 3.3.4. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm SINK_SUBSET( n=C?y, s2, ..., sn), R, L, E ) 1. if L=1 set G = { sR } 2 . else if L= 2 3. switch E 4. case 0 : set G = { s r. , . s r } 5. case 1,2, 3 : set G = { sR .2. sR } 6 . else 7. L=STRETCH( L, E ) 8 . switch E 9. case 0 : set G = { sR^ , .sR_ ij+ 2,sR.L^},„ ,sR.2,sR.,,sR} 10. case I : set G = { sR-l + /.sR.u+2.sr_ jj+ 3 sR.2,sRJ 11. case 2 : set G = { sR. ^ , .sR. ^ j .sR_ 2 .sR_,,sR} 12. case 3 : set G = { sR.L + , . sR.L+ J.sR.2 . sR } 13.return G Fig. 20. The pseudo-code for SINK_SUBSET. 3.3.2. *PTREE *PTREE is a solution to the problem of non-hierarchical buffered routing tree construction. As mentioned earlier, *PTREE is called by BUBBLE_CONSTRUCT within every level of Ca-Tree hierarchy. Consequently, *PTREE is responsible for conducting the order-perturbation task within each level of hierarchy, otherwise BUBBLE_CONSTRUCT would not be optimal with respect to a whole neighborhood of orders. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The algorithm is an enhanced version of B_FTREE [LCL96] with two main difference: I) It uses the local order-perturbation technique and as a result it is optimal with respect to a neighborhood of orders, II) It takes as the input a set of candidate buffer locations and, therefore, the location of buffers and Steiner points is not restricted to Hanan points. At every step of dynamic programming in *PTREE, a sub-group of sinks £2 is solved using the existing solutions to two smaller sub-groups (to/ and co2) which partition £2 into two segments. Fig. 21 illustrated a legal grouping situation in which C O / and co 2 properly partition £ 1 Three critical issues have been considered in the design of *PTREE: • C O / and co 2 do not share any sink, • ©/ and co 2 cover all (and only) the sinks of £ 2, • All the possible combinations of grouping structures and sizes must be considered for 00/ and co 2. The pseudo-code of *PTREE is given in Fig. 22. The lines 2 through 5 generate all the combinations of size L, grouping structure E, and position R for £2 Then, C 0/ i s constructed by lines 7 through 10. As shown in the code, the rightmost element of co/ is always the rightmost element of £2 . Some generated combinations of £2 and co/ may not be legal, that means £2 does not contain all the elements of 00/ . Although those cases could be avoided during the construction of 0)/, it is easier for the sake of presentation to 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a > 2 £2 co/ I2=3 L=6 1 i=3 e2=3 E=3 e^=3 r2=5 ^ = 8 Fig. 21. A legal grouping scenario for *PTREE. prune those cases explicitly by checking the condition in line 12. In that line, any ( 0 / incompatible with Q is detected and disregarded. In lines 13 through 22, co 2 is generated by considering Q and C O /. The size, grouping structure, and position of a > 2 are determined such that it contains all the sink of £2 which are not in ( 0 / . If C O / has a bubble on its left side (grouping structures % 2, % 3), that bubble is the rightmost element of 0)2; see line 19. Similarly, if Q has a bubble on its left-side, co 2 should have a grouping structure with a bubble on its left, and so on. For more details please refer to the given pseudo-code in Fig. 22. Again, some illegal cases may occur that are pruned in line 24. After line 19, C O / and co 2 are known and their solution curves are combined in the same way that B_PTREE combines solution curves. Interested readers are referred to [LCL96] for the details of B_PTREE. The quality and complexity of *FTREE is further discussed in sub-section 3.3.4. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm *PTREE(P, B, II=(sj, s2 > — » sn)) INITIALIZATION 1. // the same as the one in BUBBLE_CONSTRUCT, see Fig. 16 CONSTRUCTION 2 . f o r L = 2 to n 3. for£ = 0 to3 4. set L’= L + STRETCH(L, E); / / see Fig. 17 5. for /? = n downto L' 6 . set G = SINKJSUBSET( II R, L, E ); II see Fig. 20 7. for lj = 1 to L -l 8 . for ej = 0 to 3 9. set I/ = lj + STRETCH(lj, ej); //see Fig. 17 10. set rj = R 11. set gy = SINK_SUBSET( H r 7 , lh et ); II see Fig. 20 12. if gr G ^ ( j > continue; 13. set l2 = L - lj 14. switch e / 15. case 0, I : set r2 = rr l{ 16. switch E 17. case 0, 1 : set e2 = 0 18. case 2, 3 : set e2 = 2 19. case 2, 3 : set r2 = rr lj'+2 2 0 . switch £ 2 1 . case 0 , 1 : set e2 = 1 22. case 2, 3 : set e2 = 3 23. set g2 = SINK_SUBSET( H r2, l2, e2 ); II see Fig. 20 24. if gr G < j > or intersection(gh g2) ^ ( j > continue; 25. // combine llt rj, p, ej ) and Tf l2, r2, p, e2 ) like the way PTREE works Fig. 22. The pseudo-code fo r *PTREE. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3.3. MERLIN This sub-section presents MERLIN, a local neighborhood search algorithm, which employs BUBBLE_CONSTRUCT to find a local optimum sink order and the optimum buffered routing tree corresponding to that order. Generally, an optimization problem has a set of solutions and a cost function that assigns a value to every solution. The goal is to find an optimal solution, i.e. one that has the minimum (or maximum) cost. The local neighborhood search as a member of iterative solution methods is a widely-used, general approach for solving optimization problems. To obtain a local search (LS) algorithm for solving an optimization problem, one superimposes a neighborhood structure on the solutions, i.e. for each solution a set of neighboring solutions is specified. This LS algorithm starts from some initial solution, which may be constructed by some other algorithm, or generated randomly, and from then on keeps moving to a better neighboring solution, until finally it terminates at a locally optimal solution. This method has been applied both in the context of continuous and discrete optimizations [Ya90]. In general, simulated annealing is a special case of local neighborhood search that allows uphill moves. Fig. 23 illustrates the behavior of local neighborhood search. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sin k order The n-dim ensional space o f sin k orders Fig. 23. Local neighborhood search in MERLIN. Definition 8 : A function N:F->2f , which associates a subset N(x) with each xeF, is a neighborhood function over F, iff \fx&F, xe N(x) and \/xeF, xe N(y )=>ye N(x). BUBBLE_CONSTRUCT induces a well-defined neighborhood function in which it finds the best solution. The same definition is also used by MERLIN. Lemma 12: The properties required by Definition 8 are consistent with properties of neighborhood introduced in Definition 4. Proof: In Theorem 5, we proved that the size of neighborhood, N(U), is always greater than 1 independent of the choice of II. Also, for every Il'e N(Tl) there is a unique non-overlapping displacement code, w, that transforms II to II'. To prove that also l i e N(H'), we need to prove that there is a non-overlapping displacement code, w', that transforms II' to I I . It can be shown that in fact w'=w is the solution. ■ 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. algorithm MERLIN( s ,P, B, 11=^,, s2, , sn)) 1 . set rr = n 2 . do { 3 . set n = ir 4. set 9? = BUBBLEjCONSTRUCT( s , P ,B,U) 5. set IT = SINK_ORDER( S R ) 6 . } while ( II != I I ') 7. retu rn S R Fig. 24. The pseudo-code fo r MERLIN. There exists at least two sink orders , i.e. IT and II', in common between the neighborhood of two consecutive iterations of MERLIN’S local search. In fact, often this overlap O VERLAP(N(TI),A^II')) is relatively large. Intuitively, when the corresponding non-overlapping displacement code has more Ts, O VERLAP(N(Tl),7/(11')) is smaller. Obviously, the overlapping sub-space is considered twice that is a waste. However, this can be prevented by keeping solution curves of the very last iteration. For similar sub problems, between the two iterations simply copy the corresponding solution curve. Obviously, this speed up is achieved at the cost of doubling the memory usage. 3.3.4. Quality and Complexity Analysis The following statements formally describe the properties of the algorithms presented in this section. 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Theorem 6 : *PTREE executes in O(kcPq) where k is the total number of buffer candidate points, a is the number of sinks, and q is a polynomially bounded number of distinct capacitive loads. Proof: In Fig. 22, lines 2, 5, and 7 each introduce 0(a) complexity. Note that in the pseudo-code, n is the number of sinks that is referred to as a in this theorem. The merge operation (line 25) which is the same operation in B_PTREE has a O(kq) complexity [LCL96]. ■ Lemma 13: Any order generated by BUBBLE_CONSTRUCT is in the neighborhood of the initial order FL Proof: This is a proof by induction. The pseudo-code directly forces the grouping structures to cover each other like nested shells. Starting from the inner most shell, we analyze the effect of grouping structures. Case i= l: after the bubble-out step (see Fig. 11), for the most inner grouping structure, the order of all the sinks remain unchanged except for the two which are on the border of the bubbled sub-group. Consequently, the inequality relation in Definition 4 remains valid and the resulting order is within the neighborhood of the initial order. Case i-n: suppose after the bubble-out step for the n-1 inner most grouping structures, the inequality of Definition 4 still holds. The order for the sinks on the border of the ra’th grouping structure must be still unchanged because no overlap is allowed between the borders of two grouping structures. Therefore, even after the bubble-out step for the n’th grouping structure, the resulting order is within the neighborhood of the initial order. ■ 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lemma 14: Any ITeiV(TL), is considered by BUBBLE_CONSTRUCT. Proof: BUBBLE_CONSTRUCT implicitly tries all the possible valid combinations of grouping structures on F L Therefore, it is enough to prove that V II'e N(Tl) there exists a combination of grouping structures which result in that order. Suppose that w is the non-overlapping displacement code of ft's A(TD, as given in Definition 6 ). Starting from the left-most 1-bit in w (j is the position of that bit in w) extend a % /-type sub-group from the left-most sink to the j+ l's sink. After the bubble-out step the resulting order is similar to n ' for the j left-most sinks and similar to II for the rest of the sinks. Repeat this operation for the next bit 1 in w in order from left to right. There are no two neighboring l ’s in w, therefore at each step the left portion of the resulting order resembles IT and the other portion is like n . At the last step when there is no 1 left in w, we cover the initial order from left to right with a x 0 -type sub-group. The resulting order, after the bubble-out step for all the sub-groups, is I I ' and since it has a valid grouping structure it is considered by the pseudo-code of Fig. 16. ■ The proof of Lemma 14 is illustrated with an example in Fig. 25. Lemma 15: Any identical sub-problem among the members of N(TL) is shared and pro cessed only once. Proof: Any sub-problem is uniquely identified by I, e, and r values. VpeP, T(l,e,r,p) is generated only once, no matter in which compatible and larger grouping structure it will be used later. Note that according to Lemma 13 and Lemma 14, BUBBLE_CONSTRUCT covers the whole space of N(Yl). ■ 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n n' ® ® @ © © @ ® © © w = 0 1 0 0 0 0 1 0 V / Fig. 25. An illustration fo r the proof o f Lemma 14. Theorem 7: The solution space of BUBBLE_CONSTRUCT is the product of the spaces of *P-Tree and Ca-Tree for the neighborhood of the initial given order. Proof: VIT'eA^IT,), all the corresponding Ca-Trees with the boundary of their sub- Tree structures are considered by *PTREE. However, there are some Ca-Trees which correspond to 11', but whose displacements are not at the boundary of the sub-groups. For those, *FTREE considers all the necessary displacements inside one layer of that Lemma 16: BUBBLE_CONSTRUCT is monotone with respect to required time, load, and area. Proof: By considering that *PTREE is monotone with respect to the required time, load, and area, we can conclude that in a Ca-Tree, decreasing the load of an internal or sink node results in the decrease of the load in its immediate parent. A similar argument is valid for required time and total area. ■ groups on the displaced sinks’ locations are visited and for every one of them all the *P- Ca-Tree. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lemma 17: In BUBBLE_CONSTRUCT, the use of the pruning operation does not result in the loss of any non-inferior solution. Proof: This immediately follows from Lemma 16 and Definition 2. ■ Theorem 8: Subject to restrictions imposed by the *P-Tree and Ca-Tree structures, BUBBLE_CONSTRUCT finds all the non-inferior solutions with respect to required time and total area in the neighborhood of a given order. Proof: If no pruning is performed all the space is explicitly constructed (see Theorem 7). Lemma 17 states that the prune operation drops the sub-solutions that are used only in inferior solutions. Therefore, all the non-inferior solutions remain in the final curves of BUBBLE_CONSTRUCT. ■ For the sake of simplicity, in the following statements it is assumed that the total buffer area is the metric to measure the area of a solution. Theorem 9: BUBBLE_CONSTRUCT has 0(kmn3 ) memory complexity, where k, m, and n are numbers of candidate locations, buffers, and sinks, respectively. Proof: The proof is the same as the one given for Theorem 3. The only difference is that the number of solution curves is 4 times more in BUBBLE_CONSTRUCT rather than in FANROUT, because for every grouping structure one solution curve is stored. ■ Theorem 10: BUBBLE_CONSTRUCT has 0(mqk?a.sn3 ) mntime complexity, where k, m, and n are number of candidate locations, buffers, and sinks, respectively. Also, a is the maximum branching factor in Ca-Trees and q is a polynomially bounded number of distinct capacitive loads. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Proof: The proof is similar to the one given in Theorem 4. Note that the effect of using 4 grouping structures on the complexity is constant. ■ Corollory 2: Assuming that m, q, and a are parameters independent from the size of the problem n and are determined by the library and technology, the effectual worst-case complexity of BUBBLE_CONSTRUCT is 0(k2 n3 ). Theorem 11: The cost associated with the order produced in each iteration of MERLIN is strictly decreasing except in the last iteration. Proof: BUBBLE-CONSTRUCT always returns the best order in the neighborhood, so if a different order is returned it must correspond to a lower cost. Obviously, in the last iteration, the cost of the given order is the best in the neighborhood and that is how the iteration is terminated. ■ 3 .4 . E x p erim en ta l R esu lts In this section, three design flows have been tested and compared on a set of benchmark circuits. The design flows are described below. • Setup-I: For every net, fanout optimization using LTTREE is followed by routing tree constmction phase using PTREE. In LTTREE, the net sinks are sorted with respect to their required times. However, in PTREE the net sinks are sorted by a solution to the TSP (Traveling Salesman Problem) using the method suggested in [LCLH96]. • Setup-II: Routing tree generation using PTREE is followed by buffer insertion using the method of [Gi90]. The sink order for PTREE is again the TSP order. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Setup-Ill: Finally, hierarchical buffered routing generation is performed using MER LIN. The initial sink order is the TSP order. All the experiments have been implemented and executed in SIS [SSLM92] and on a dual-processor Ultra-2 Sun Sparc workstation with 256MB memory. In these experiments, an industrial standard cell library (0.35|im in CMOS process) consisting of 34 buffers has been used. Gate and wire delays are calculated using a 4-parameter delay equation and the Elmore delay model [E148], respectively. This rest of this section is divided into two main sub-sections. In sub-section 3.4.1, the above three setups are tested on a set of single nets. In sub-section 3.4.2, the same techniques are employed in a full design flow and the results are reported for a set of circuits. 3.4.1. Comparison on Individual Nets Table 4 reports the results of running the above four experimental setups on 18 individual nets randomly selected from a set of benchmark circuits. For every extracted net, the sink locations are determined randomly in a bounding box. The size of the box has been determined such that the delay of a wire segment whose length is half the perimeter of the box is approximately equal to the delay of an average gate driving that wire. In addition, the load and required time data of sinks have been selected randomly from a nominal range. In Table 4, the reported area and delay values are the total buffer area and the required time at the root of the net in the resulting buffered routing structures. Also, the 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. runtimes have been reported in seconds for every net and experimental setup. Note, the data of Setup-I has been reported in absolute values, however, for the other three experimental setups the results have been scaled with respect to their corresponding data in Setup I. 3.4.2. Comparison on Circuits Table 5 reports the post-layout total area and delay values for a set of benchmark circuits. In these experiments the above three setups have been plugged into a full design flow that extends from the logic synthesis all the way down to the detail routing. The resulting design flows have been named as Flow-I through Flow-Ill. Again, the data for Flow-I is absolute to which the rest are scaled. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ratios over Flow I Taken from circuit N et name N u m of sinks Setup I: LTTREE + PTREE Setup II: PTREE+Buffer Insertion Setup HI: MERLIN Area 1 0 0 0 X 2 Delay (ns) CPU (s) Area Delay CPU Area Delay CPU # of Loo P C432 n etl 16 58 38.54 22 0.33 0.87 0.36 0.28 0.39 25.09 2 net2 16 83 35.49 41 0.27 0.71 1.66 0.69 0.48 5.24 1 net3 10 51 32.19 44 1.31 0.88 4.27 0.56 0.70 15.27 7 C1355 net4 9 35 26.69 16 0.64 0.88 1.88 0.82 0.57 3.00 4 net5 9 16 23.42 15 0.80 0.95 0.86 3.80 0.47 2.33 5 net6 13 29 25.42 14 0.33 0.95 3.43 0.56 0.30 78.00 6 C3540 net7 12 58 41.03 29 0.50 0.88 1.79 1.44 0.55 23.59 12 net8 35 93 47.05 99 0.17 0.83 4.42 0.17 0.49 7.92 1 net9 73 214 60.73 229 1.55 0.69 1.83 0.12 0.42 1.98 1 C5315 netlO 49 70 40.29 302 0.64 0.78 2.34 0.36 0.33 6.09 2 n e tll 21 80 38.20 111 1.12 0.66 1.02 0.40 0.26 4.32 4 netl2 50 128 58.79 829 0.65 0.53 0.64 0.20 0.27 13.20 9 C6288 netl3 16 58 44.65 52 0.83 0.73 1.12 2.11 0.49 9.33 5 netl4 20 58 45.67 28 0.67 0.91 1.71 1.00 0.73 3.54 1 netl5 60 90 90.29 197 0.25 0.74 1.42 0.29 0.55 16.20 4 C7552 netl6 12 54 32.20 26 1.35 0.90 3.00 1.18 0.54 12.38 2 netl7 16 58 31.35 54 0.94 0.86 1.11 1.56 0.39 9.72 5 net!8 23 54 38.38 43 0.35 0.91 2.16 0.29 0.39 5.70 1 Average: 0.71 0.81 1.95 0.88 0.46 13.49 Table 4: Total buffer area, delay, and runtime for a set of nets. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ratios over Flow I Cir cuits Flow I: LTTREE + PTREE Flow H: PTREE+Buffer Insertion Flow E E : MERLIN Area X1000A.2 Delay (ns) CPU (s) Area Delay CPU Area Delay CPU C1355 3630 8.18 1276 0.97 0.97 0.99 0.93 0.72 2.23 C1908 7768 14.47 2560 1.03 1.10 0.95 1.02 0.80 2.55 C2670 9428 12.40 1699 0.99 0.99 1.09 1.06 0.96 2.05 C3540 15762 22.17 5436 1.21 1.57 0.79 1.27 0.88 0.98 C432 3574 10.13 1382 1.16 1.06 0.79 157 1.00 1.17 C6288 28497 52.94 13547 0.96 1.03 0.88 1.00 0.90 1.00 C7552 35189 19.80 9250 0.78 1.06 0.95 0.85 0.74 1.36 Alu4 8191 15.69 2842 1.22 0.99 0.86 1.02 0.96 1.62 B9 1210 2.81 271 0.98 1.25 0.82 1.36 0.99 4.18 Dalu 10344 18.59 3465 0.73 0.88 0.66 0.88 0.67 1.74 Desa 32388 27.00 19427 1.12 1.12 0.75 1.19 0.82 0.83 Duke2 5499 9.00 2554 1.15 0.91 0.74 1.04 0.83 0.80 K2 22823 26.66 5831 0.85 0.75 1.73 0.93 0.63 2.56 Rot 8315 7.80 1572 0.91 1.02 0.83 1.00 0.81 3.40 T481 8917 10.12 5239 1.22 1.01 0.78 0.92 1.08 1.26 Average: 1.02 1.05 0.91 1.07 0.85 1.85 Table 5: Post-layout area, delay, and runtime for a set of benchmark circuits. 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c h a p t e r 4 Simultaneous Floorplanning, Technology Mapping and Gate Placement 4 .1 . In tr o d u c tio n This chapter introduces an integrated approach for simultaneous floorplanning, technol ogy-mapping, and detailed gate placement, of row-based standard-cell layout styles. The unique contributions of this work are: • A new design flow, *SiMPA, which simultaneously performs floorplanning, technol ogy mapping, and placement. • A new data structure, k-way levelized cluster tree, which represents the hierarchy of the circuit for *SiMPA. • A new global area optimizer, *SiMPA-E, which optimizes chip area via simultaneous floorplanning, technology mapping, and gate placement. • A new critical path optimizer, *SiMPA-R, which for a given number of critical paths effectively trades area for delay while simultaneously considering all floorplanning, technology mapping, and gate placement solutions. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The rest of this chapter is organized as follows. In section 4.2, the related background and prior work are presented. The proposed methodology for unifying floorplanning, technology mapping, and gate placement is introduced in Section 4.3. In sections 4.4, the experimental results are presented and discussed. 4 .2 . B a ck g ro u n d 4.2.1. Technology Mapping It is well-known that the general technology mapping (TM) problem is NP-hard [HS96]. Keutzer in [Ke87] approximated the general TM problem by a set of tree- covering sub-problems and optimally mapped each, with respect to gate area, by a polynomial dynamic programming-based (DP) algorithm (KA). Later, Rudell in [Ru89] extended this idea to the minimum delay mapping problem. Subsequently, Touati et al. provided a mapper capable of minimizing area under delay constraints [TMBW90]. Chaudhary and Pedram presented a method for finding a set of all non-inferior mappings for a tree with different area-arrival time trade-offs [CP92]. All of the above mentioned methods assume the dominance of gate area and delay over interconnects. 4.2.2. Linear Placement Linear placement (LP) is defined as the assignment of the vertices of a graph to the open slots located on a line so as to minimize a desired cost function. MINCUT is the LP problem where the cost function is the maximum cutwidth. It has been proven that this problem is NP-complete for general graphs [GJ79] but that it is polynomial in cases 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where the subject graph is a tree. Lengauer, in [Le82], developed an approximation algorithm (LA) for tree MINCUT problem whose cutwidth is within a factor of two of the optimal value. Later, Yannakakis introduced a polynomial optimal algorithm (YA) for tree MINCUT problem using the DP strategy [Ya85]. 4.2.3. SiMPA To address the high performance requirements of DSM designs, SiMPA (Simultaneous Technology Mapping and Linear Placement Algorithm) integrates technology mapping and linear placement by combining DP-based YA/LA with KA. This algorithm allows TM to access accurate physical information and LP to guide logic optimization. SiMPA-E (‘E ’ stands for exact total area) is a combination of YA and KA which optimally finds the minimum total (wiring+gate) area implementation of a given tree circuit [LSP97]. SiMPA-D (‘D’ stands for disjoint combination based) combines LA and KA and optimizes total area and total delay by generating three dimensional trade-off curves for tree circuits [LSP98]. FPD-SiMPA (FPD stands for floorglan driven) is a design flow which incorporates SiMPA-D and SiMPA-E (SiMPA-D/E) for building two dimensional implementations for DAG-structure circuits [LSP98]. The outline of this design flow is as follows: 1. Partition the initial DAG into a set of tree clusters. 2. Map and place each cluster. 3. Floorplan the clusters on a two dimensional plane. 4. Perform global routing followed by timing analysis. 5. Trade off area for delay using SiMPA-D/E for each timing or area critical cluster. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.4. Floorplanning It refers to the placement and sizing of flexible blocks and is a common feature in many hierarchical design environments. In general, floorplanning algorithms are based on one of the following methods: mathematical programming, rectangular dualization, and combinatorial based methods [Sh93]. Recently, two novel floorplanning techniques based on sequence pair [MFNK96] and bounded slicing grid structures have been proposed [NFMK96]. In this thesis, a combinatorial based floorplanning method, like that in [PK92], is combined with SiMPA through the use of an appropriately defined hierarchy. 4.2.5. Acyclic Partitioning Acyclic partitioning prevents nets from creating directed cycles among the parts. A two-way acyclic partitioning algorithm was introduced in [IPFC93], then extended to multi-way acyclic partitioning in [CLB94]. The latter is used in this work when delay optimization is added to the proposed algorithm (*SiMPA-R). The acyclic property is necessary because it allows simple handling of the timing dependencies among the parts by following a topological ordering. 4.3. S im u lta n e o u s F lo o rp la n n in g , T ech n o lo g y M a p p in g a n d G ate P la c e m e n t SiMPA-D and SiMPA-E are only capable of manipulating tree-structure circuits. However, such a structure is atypical of most circuits and therefore in FPD-SiMPA, a 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DAG circuit is clustered into tree clusters and each cluster is manipulated by SiMPA-D/ E separately. In other words, the design flow divides the problem into a set of smaller sub-problems and then combines the sub-solutions to build the final solution. This approach can be classified as a member of the divide-and-conquer family of algorithms and consequently inherits the intrinsic shortcomings associated with this family. Generally, FPD-SiMPA restricts SiMPA to work within the individual tree clusters only, and employs the floorplanner for taking advantage of the possible inter-cluster optimizations opportunities. The floorplanner, however, is subject to the optimization decisions already made within the clusters which are not necessarily appropriate for the overall circuit optimization. *SiMPA, the proposed algorithm in this work, addresses this deficiency by combining SiMPA with the floorplanning design step. Hence, the flooplanner controls all the intra-cluster and inter-cluster optimization opportunities and so the decisions made inside and outside the clusters are the best in the global sense. Another important property of *SiMPA is that it eliminates delay budgeting (slack distribution) in the design flow. * SiMPA, due to its integrated nature, has no need for this step. It provides Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. TM and LP woi inside clusters FP works on the clustered DAG Fig. 26. Clustered circuit. designers with a conceptually simpler and more precise design tool which outputs the global trade-off picture of the design using fewer heuristics. The way FP is merged into SiMPA is different from the way TM and LP were merged in SiMPA. TM and LP both work at the same hierarchy level (inside the tree clusters) and so it was possible to implement their combination via interleaved calls from one to another at every step of the DP-based algorithm. FP on the other hand, works at a higher level of hierarchy (among the tree clusters), and therefore, its integration into SiMPA must be achieved by other means. The floorplanner uses SiMPA to capture the shape functions of all the tree clusters. These are subsequently used by the bottom-up floorplanner to calculate the global shape function for the whole circuit. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. *SiMPA Fig. 27. *SiMPA. *SiMPA comes in two flavors: • *SiMPA-E which targets the total minimum area by generating a set of non-inferior solutions (w.r.t. shape) for the whole circuit. • *SiMPA-R which removes the timing violations for a given implementation and a design hierarchy by simultaneously replacing and remapping the clusters on timing critical paths. This algorithm is capable of computing accurate shape/delay trade-off curves for the whole chip while speeding up one critical path of the circuit at a time. In our flow, *SiMPA-E is first called to generate a minimum area implementation of the subject circuit, then *SiMPA-R is called to remove the timing violations one path at a time. In *SiMPA, a special data structure is used for the representation of the design hierarchy. The first sub-section below describes and analyzes this representation. Next, *SiMPA-E and *SiMPA-R are presented and discussed. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.1. K-LCT: A K-Way Levelized Cluster Tree Any bottom-up technique requires, at least implicitly, a hierarchy to work with. For the case of BearFP (a DP-based floorplanner), the hierarchy is represented by a cluster tree which shows a grouping of macro-cells into first-level clusters, first-level clusters into second-level super clusters, and so on, until the root of the cluster tree is reached [PK92]. *SiMPA, which includes a floorplanner similar to BearFP, likewise requires a data structure for the extraction and representation of the hierarchical organization of the tree clusters. In contrast to BearFP which uses a simple k-way tree by including the most connected blocks as sibling nodes under the same parent node, * SiMPA requires a more elaborate hierarchy and a corresponding suitable data structure to represent it. This novel data structure is called k-way levelized cluster tree (k-LCT) and is introduced and discussed below. Definition 9: For any given rooted tree, T(V, E), Vve V , the following terms are defined: • parent(v): immediate parent of v . • children(v): set of the immediate children of v. • descendents(v): the set of all nodes in the subtree rooted by v, including v itself. • leaves(v)— { u I u g descendent(v) a u is a leaf node }. Definition 10: Given a directed acyclic graph G(V, E'), a rooted tree C(V, E) with root r is a cluster tree for G if there exists a one-to-one function T: leaves(r)— >V. Members of E and E' are called cluster-edges and DAG-edges, respectively. Note that for a given 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 28. A cycle. circuit, the corresponding DAG does not include the primary inputs or outputs which are later handled in a pad assignment phase. Definition 11: C(V, E), a rooted cluster tree of a given G(V, E'), is called acyclic, if for any ve V and 1 <i<\children(v)\ the following statement holds: Vvy,. . children(v), and V(«y,. . leaves(vj)x...xleaves(v,-)then { (T (M y ),r (M 2 )) , ... , (r(«f -_y),!"(«,-)) , (r(«y),r(«y))}<r£'. Definition 12: C(V, E), a rooted cluster tree of a given G(V, E'), is a k-LCT if: I) Vve V , \children(v)\<k, II) Vfve Va vg leaves(r)), V(we leaves(r)/\u<£ descendents(v)), Vwe leaves(v),(T(u),T(w)i E'. In the following, C(V, E) (with root r) is assumed to be a k-LCT of a given circuit G(Vf, E') unless otherwise is stated. Property two of Definition 11 states that in a k-LCT, all the DAG-edges connect nodes to those in the same or higher level, hence guaranteeing a k-LCT to be an acyclic structure. This property empowers *SiMPA to calculate all needed wire lengths signal arrival times while synthesizing the circuit; details will be discussed later on in this 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A c ir c u it GfV'JE') A 4-LCT for G, C(V,E) Fig. 29. A circuit and its corresponding 4-LCT graph. thesis. Fig. 29 which demonstrates a k-LCT for a simple circuit can be used to verify the acyclicness property of k-LCT’s. Lemma 18: For any edge in G(V, E'), say (u,v) (let \>=r'l(«) and co=r'I(v)), we have t)e descendents(parent! < & ) ) ■ Theorem 12: C(V, E), a k-LCT of a given G(V, E') is an acyclic cluster tree. The following lemmas introduce transformations on k-LCT’s for the generation of a new equivalent k-LCT with different properties; such as a more balanced k-LCT which generally generates more efficient floorplans. Lemma 19: In C, any leaf node can be raised as high as the level where its lowest-level fanout resides. Lemma 20: In C, any leaf node can be lowered as low as the level where its highest- level fanin resides. Lemma 21: In C, any two nodes which have the same parent node and the total num ber of their immediate children is less than k can be merged together. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. b7 b4 b3 b5 b2 b l Fig. 30. The new 4-LCT. The 4-LCT in Fig. 30 is equivalent to that shown in Fig. 29, generated using a sequence of transformations as mentioned above. This tree is more desirable than that of Fig. 29, since DP-based floorplanner tends to produce better results when the number of the internal nodes is logk(n). Interested readers are referred to [SLP98] for the proof of the theorems. 4.3.2. *SiMPA-E for Global Area Optimization *SiMPA-E is a combination of SiMPA and floorplanning which targets total area optimization. If a bottom-up floorplanning scheme with shape propagation ability is used, *SiMPA-E is capable of finding a global height versus width trade-off curve for the whole network. This curve is then used in finding the solution satisfying the user- specified aspect ratio requirement. Tbtal area for a one-dimensional standard-cell layout is calculated by A = W. (h + ( 3 . c), where W is the sum of the width of all the cells in a row, h is the cell height, a constant determined by the ASIC library being used, (3 is the minimum distance between 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. w Fig. 31. The area model. the two adjacent wires’ centers, and c is the maximum cutwidth (a.k.a. cut-density) of the design. Lemma 22: Cutwidth versus gate-area (c versus g) curves are transformed into height versus width curves, according to the following equations: height=h+cx$ and W=g/h. As shown in Fig. 32, *SiMPA-E first partitions the given decomposed circuit into a set of maximal tree sub-networks (clusters). Then, SiMPA-E produces the cutwidth/ gate-area trade-off curves, which are subsequently translated to height/width curve (shape function) using Lemma 22. During the third step, the set of all the shape functions is passed to a bottom-up floorplanner employed by the algorithm. The floorplanner takes these shape functions and generates the global 2-D trade-off curve. Finally, the best floorplan is picked from this curve. Note that, solutions for all the clusters are generated simultaneously from the best solution for the root of the cluster tree. Once the best global solution is found, the cluster sub-solutions which were used to Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Decomposed Circuit ^ Partition into tree clusters ^ ^^For each cluster, store the area^N trade-off curves generated by SiMPA-E Find the global area trade-off ^ c u r v e using the floorplanner c Choose the best solution I To *SiMPA-R Fig. 32. The flow o f *SiMPA-E. construct the solution are traced back and assigned as the internal implementation of each cluster. Few observations relating to the operation of *SiMPA-E are: Observation 13: In the shape function for a cluster, the detailed physical and logical implementations of every solution are known prior to executing the floorplanner. This information is then used throughout the floorplanning process for the exact area and delay calculations of the whole or part of the design. Observation 14: *SiMPA-E achieves global area optimization by taking into account detailed design information for all the design choices and comparing solution qualities based on exact physical and logical information. Further, *SiMPA-E does not need to use any area estimators which are usually based on heuristic techniques. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bombimb2 L s . L i L i Fig. 33. Global shape function generation. Observation 15: Any floorplanner which can properly use the set of calculated shape functions can be employed in this flow. However, the global height versus width trade off curve is built most effectively using a bottom-up floorplanner with shape propaga tion capability. The use of this type of floorplanner, as mentioned earlier, needs a hierar chy on the clusters required for satisfying the acyclicness property if critical paths are intended to be resynthesized using *SiMPA-R. In this work, we employ k-LCT’s (with k=4) to satisfy all these requirements. Observation 16: SiMPA-E calculates the shape function (including gate and wire areas) for every tree cluster and therefore all optimum non-inferior points are included in the shape function. However, the area taken by inter-cluster wiring is an entirely differ ent issue. Bottom-up floorplanning can be made to estimate (probabilisticly or construc tively) the inter-cluster (inter-block) connection lengths. These estimates are not exact and hence claims of optimality for *SiMPA-E are subject to inaccuracies in estimating the inter-block connection lengths and area requirements. The choice of a k-LCT hierar- 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. chy in *SiMPA-E leads not only to advantages in *SiMPA-R (discussed later), but also restricts inter-cluster interconnects which in turn reduces the impact of the inaccuracies suffered due to the use o f a bottom-up approach. Observation 17: For every leaf and internal node of the cluster tree (hierarchy), * SiMPA-E stores the calculated shape functions for future reference by *SiMPA-R. This information is needed by the resizing process of the clusters on the timing critical path. The process may open up space for the timing critical clusters by optimally resizing the non-critical ones in order to allow trading space for better timing behavior inside the critical clusters. Theorem 18: *SiMPA-E is a polynomial algorithm. 4.3.3. *SiMPA-R for Eliminating Timing Violations Although the global shape function generated by * SiMPA-E captures the best implementation in terms of area, it is likely for the resulting implementation(s) to violate user specified timing requirements. In the *SiMPA flow, it is thus essential to employ a technique to eliminate the timing critical paths by resynthesizing and trading area for delay. After using *SiMPA-E and choosing an area optimized implementation, *SiMPA- R is employed for the purpose of fixing the timing violations. 1) Critical Clusters Initial Manipulation: The output of * SiMPA-E is a fully mapped, two-dimensionally placed implementation of the given circuit. *SiMPA-R first identifies all critical paths and selects the one with the most negative slack and marks all the tree 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. From *SiMPA-E _________ i_________ ^ Identify the critical paths ^ C "^^tIiieTIusterteeeniar^lK^m ticar^\ tree partitions ) Generate&propagate 3-D solution curves on the marked k-LCT Choose the best solution Fig. 34. The flow of*SiMPA-R. clusters on this critical path as critical. For every critical leaf node in the k-LCT, the internal nodes on the unique path to the root of the k-LCT are also marked so that floorplanning, technology mapping, and placement can be redone. We refer to the resulting representation as a marked k-LCT. Fig. 35 demonstrates a simple example in which the critical path is shown by thicker arrows in the clustered circuit. In the 4-LCT the critical leaves and nodes are marked by white boxes and black circles, respectively. Lemma 23: In a marked k-LCT, there is at most one critical internal node and one crit ical leaf node among the children of every internal node. 2) 3-D Curve Generation and Propagation: Having marked the k-LCT, *SiMPA-R begins the generation and propagation of the 3-D (height, width, and the critical signal ready time at the root of a subtree) solution curves from the lowest level marked internal node up towards the root. At each step, the algorithm detects the immediate children of the current internal node, v, and retrieves or generates their corresponding 2/3-D solution 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. M . r t m rM b3\ b2 Output of *SiMPA-E The circuit decom posed A 4-LCT for the circuit into tree clusters Fig. 35. An example. curves in order to examine the solution possibilities for all the floorplan templates. An immediate child of v, say u, is either a non-critical internal/leaf node, a critical internal node, or a critical leaf node. In the first case (a non-critical internal/leaf node), there is no marked node among the descendents(v), indicating the absence of a critical signal. Therefore, the corresponding solution curve is the 2-D curve already calculated and stored by *SiMPA-E. For the case of a critical internal node, a 3-D solution curve is already calculated by the previous calls to the same procedure. The situation is a bit different for a critical leaf node; in this case one of the incoming signals is the critical signal which may be provided by its marked sibling’s descendents. Consequently, the arrival time of the signal depends on the implementation chosen for the sibling node. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. However, ail the physical and logical implementation information up to that point are available and therefore for every floorplan template and block-to-room assignment, the possible critical signal arrival times can be precisely calculated. Using these arrival times, SiMPA-D generates a 3-D solution curve for that critical cluster. These 3-D curves (for every template and assignment) are assembled into one curve and the inferior solutions are pruned out. The timing values on this curve are the critical signal’s ready times at the output of v for each possible implementation of the subtree. Note that according to Lemma 23, there exists at most one marked leaf node and therefore SiMPA-D is run for at most one node during processing of the immediate children of every v. If v has no immediate marked leaf node, the timing values of the critical internal node are the critical signal’s ready times. As for the geometrical calculations, the height and width of the solutions of v are calculated from the solutions to the sub-problems in ways similar to *SiMPA-E’s method. Now, few remarks relating to *SiMPA-R: • Every solution generated by * SiMPA keeps the pointers to its constituent sub-solu tions in order to retrieve all the corresponding design information for the best solu tion selected in the final stage. • For every template and room assignment, the relative locations of the blocks are known. Accordingly, interconnect routes within v are known (subject to using a bot- tom-up global routing algorithm. In addition, k-LCT provides *SiMPA with a proper structure which helps avoid signal cycles among children(v), and therefore all timing information can be calculated according to the topological order of children(v). 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • This procedure proceeds recursively until it reaches the root in the k-LCT. The trade off curve of the root reveals the global design options from which the best is selected and traced down using the stored pointers. Theorem 19: *SiMPA-R is a polynomial time algorithm. 4 .4 . e x p e r im e n ta l R esu lts The sequence of steps consisting of technology independent optimization, technology mapping, placement, global and detailed routing is the conventional design flow. To verify the effectiveness of the simultaneous approach, results are presented in Table 6 for two setups: I) Conventional flow. II) Technology mapping, maximal tree clustering, floorplanning (BearFP), linear placement of each cluster using YA, followed by TimberWolf and YACR. The presented results show only small differences between the two mns which points to the fact that *SiMPA’s efficiency is due to its simultaneous approach to the design process rather than just improved mapping and placement algorithms. The next set of experiments compares the performance of the conventional design flow with our proposed DSM methodology on a number of benchmarks using a CASCADE standard cell library (0.5u HP CMOS process). Gate and wire delays are calculated using a 4-parameter delay equation (similar to that in [LSP97]) and the Elmore delay model, respectively. These experiments were run in the SIS environment [SSLM92] on an Ultra-2 Sun Sparc workstation with 256MB memory. The area and delay reported here are total chip area and delay after detailed routing. Table 7 compares 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the results of *SiMPA with the conventional flow showing an average of 34% performance improvement with small area penalty. In these experiments, the floorplanning was performed by our preliminary k-LCT floorplanner prototype, TimberWoIf was used for global routing, and YACR was used for detailed routing. The runtimes for *SiMPA remained comparable to the ones for the conventional flow, as shown in Table 8 for a number of benchmark circuits. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Setup-I Setup-EI Ratio Circuit Area Delay Area Delay Area Ratio Delay Ratio alu2 2459 13.13 2402 13.10 0.98 1.00 apex7 1254 6.45 1260 6.61 1.00 1.02 cm l50a 226 3.22 221 3.18 0.98 0.99 cm l51a 168 2.97 172 2.91 1.03 0.98 cm l62a 229 2.50 234 2.57 1.02 1.03 duke2 3283 10.16 3241 9.75 0.99 0.96 k2a 7555 15.96 7570 19.51 1.00 1.22 rot 4748 8.91 4651 12.12 0.98 1.36 table3 5014 55.17 5030 66.34 1.00 1.20 C1908 4288 17.46 4477 16.67 1.04 0.95 C880 2673 9.53 2496 10.61 0.93 1.11 C1355 2452 8.84 2517 9.69 1.03 1.10 C3540 10101 27.63 9674 36.09 0.96 1.31 Average Ratios: Table 6: Verifying the effectiveness of the simultaneous approach. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C onventional Flow *SiMPA Ratios Circuit Area Delay Area Delay Area Delay alu2 2459 13.13 2017 8.60 0.82 0.65 alu4 4258 22.89 3725 11.89 0.87 0.52 apex6 4554 12.01 5106 6.82 1.12 0.57 apex7 1254 6.45 1317 4.31 1.05 0.67 b9 620 3.62 598 2.57 0.96 0.71 cml50a 226 3.22 215 1.42 0.95 0.44 cml51a 168 2.97 94 1.46 0.56 0.49 comp 641 2.84 544 2.22 0.85 0.78 conl 83 1.55 93 0.76 1.12 0.49 cordic 376 2.74 266 1.90 0.71 0.69 dalu 6356 19.45 5959 14.75 0.94 0.76 duke2 3283 10.16 2708 8.64 0.82 0.85 f51m 300 5.76 370 4.42 1.23 0.77 k2a 7555 15.96 9421 12.37 1.25 0.78 lal 560 4.09 537 2.60 0.96 0.64 misex3 3374 12.89 3559 10.73 1.05 0.83 mux 221 3.03 234 1.46 1.06 0.48 pcle 471 3.66 300 2.41 0.64 0.66 pcler8 602 3.90 570 3.00 0.95 0.77 ritex 337 2.16 299 1.84 0.89 0.85 ritexl 90 1.09 94 0.77 1.04 0.71 rot 4748 8.91 4864 8.20 1.02 0.92 term! 711 3.92 778 2.72 1.09 0.69 z4ml 267 3.45 182 1.66 0.68 0.48 5xpl 641 7.66 546 4.23 0.85 0.55 9sym 811 7.71 1019 5.62 1.26 0.73 Z5xpl 398 7.08 459 5.79 1.15 0.82 Z9sym 876 7.77 1068 5.14 1.22 0.66 b l2 393 3.81 355 1.97 0.90 0.52 bw 1017 9.33 845 5.77 0.83 0.62 clip 689 7.85 76 8 3.61 1.11 0.46 e64 1602 8.29 1648 9.13 1.03 1.10 inc 587 7.78 577 3.45 0.98 0.44 m isexl 293 5.90 313 2.61 1.07 0.44 Table 7: *SiMPA versus the conventional flow. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C onventional Flow *SiMPA R atios Circuit Area Delay Area Delay Area Delay misex2 517 4.03 514 2.48 0.99 0.62 misex3c 2629 19.50 2765 8.54 1.05 0.44 rd53 226 3.16 155 2.00 0.69 0.63 rd73 335 3.78 346 2.73 1.03 0.72 rd84 857 7.01 767 4.61 0.89 0.66 sao2 815 6.20 778 3.49 0.95 0.56 squar5 256 3.65 221 2.36 0.87 0.65 table3 5014 55.17 5356 53.51 1.07 0.97 vg2 543 3.66 462 2.67 0.85 0.73 xor5 96 1.56 92 1.10 0.96 0.71 C17 40 0.78 41 0.37 1.02 0.47 C432 2810 11.91 1329 12.28 0.47 1.03 C1908 4288 17.46 2538 14.17 0.59 0.81 C880 2673 9.53 2377 10.74 0.89 1.13 C1355 2452 8.84 2684 7.84 1.09 0.89 C499 2304 8.64 2702 6.55 1.17 0.76 C2670 4678 8.40 4597 7.17 0.98 0.85 C3540 10101 27.63 6985 22.20 0.69 0.80 C5315 11232 16.03 10201 16.22 0.91 1.01 C7552 12738 24.53 14158 11.77 1.11 0.48 Average Ratios: s i i i 8 g i g f g li l§ 1 0l 6§ Table 7: *SiMPA versus the conventional flow. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Circuit Runtime (sec) Conventional Runtime (sec) *SiMPA Ratio alu2 146 178 1.22 apex7 63 67 1.06 cm l50a 15 22 1.47 cm l51a 11 15 1.36 cm l62a 16 17 1.06 duke2 146 152 1.04 k2a 519 753 1.45 rot 215 312 1.45 table3 243 265 1.09 C1908 255 291 1.14 C880 155 199 1.28 C1355 191 243 1.27 C3540 663 711 1.07 Table 8: The runtimes. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. CHAPTER 5 PEGASUS 5.1. In tro d u ctio n The large size and sheer complexity of current designs has made computer aided- design (CAD) tools the main enabling instrument for the semiconductor industry. IC CAD tools are changing swiftly and substantially and new techniques and algorithms are introduced every year. Historically, open CAD systems have significantly contributed to the VLSI design community by providing the ability of fast prototyping new ideas and comparing them against other techniques. Due to the open nature of these systems, they are developed primarily at universities and are then made available to the public. In some cases, these tools have been used as a starting point for developing new industry tools and have been a source of significant advances in the field. In previous chapters, we studied that to resolve the timing closure problem in deep submicron technologies, L&P design steps should be finely integrated in a unified system. However, none of the existing open CAD systems provide such an environment 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. since they have been specifically designed for either logic synthesis, e.g. SIS [SSLM92], or physical design, e.g. TimberWolf [Se88]. PEGASUS is a new design system which provides an integrated L&P database in which it is convenient to integrate logic and layout design and optimization techniques. It provides a unified timing engine which can analyze the design anytime during the L&P design steps. Building on the industry practices and frameworks as examplified in [CHDS], PEGASUS supports logical and physical hierarchies for a design and reduces the memory usage for hierarchical designs. In addition, the system has been implemented in C++ and broadly exploits the properties of object-oriented programming including encapsulation, polymorphism, and inheritance [Sc98]. The rest of this chapter is organized as follows. In section 5.2, the digital design methodologies and how they are affected by the DSM technologies are discussed. In section 5.3, the PEGASUS system is introduced and explained. Experimental results are presented in section 5.4. 5.2 . D e sig n M eth o d o lo g ies The conventional ASIC design flow is divided into two main sections: RT-level and logic synthesis, which are referred to as the front-end; floorplanning, placement, and routing, which are referred to as the back-end (see Fig. 36.) This self-imposed separation between the front-end and back-end parts has simplified the tools and methodology in previous technologies. As a result of that approach, front-end and back-end tools have been designed independently and often by different groups of people. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Behavioral Synthesis RTL Code o c Tech-Indep Synthesis o ra m o 3 m 3 G . ^ Tech-dep Synthesis ^ Netlist o ^ Floorplanning ^ ^ o C Placement c Routing J 03 & > n 7 ? i n 3 a- Fig. 36. The conventional design flow. DSM design methodologies rely on one or few different general techniques such as: early partitioning, high-level floorplanning, top-level routing, simultaneous logic and layout design, high-speed interconnect design, and centralized timing calculator. One common requirement of all the above techniques is that the CAD framework has to provide an integrated physical and logical database and environment. The conventional tools and frameworks fail to help in the DSM region because they have not been designed to cross the dictated boundary between logical and physical domains. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this chapter, the main focus is on open-source CAD systems. SIS (Sequential Interactive Synthesis) [SSLM92] is a widely-used example of one such system for logic optimization. SIS provides a database and a set of application programs specifically designed for Boolean and gate level design and optimization; for a collection of those techniques see [HS96].) Despite its popularity for logic synthesis applications, SIS lacks the structure to perform physical optimization or handle many of the DSM design optimizations. PEGASUS has been designed to provide an alternative system that eases the implementation work for a developer working on DSM design techniques. The structure and the main features of PEGASUS are introduced in the next section. 5.3 . P eg a su s PEGASUS is intended to be a CAD environment and a suite of tools for technology mapping, fanout optimization, gate placement, gate sizing, and global routing with emphasis on the unification-based design algorithms. The input to the system is a technology-independent optimized circuit, whereas the output is a placed circuit generated after mapping and technology-dependent optimization steps. 5.3.1. Architecture of PEGASUS The PEGASUS system is composed of two main components: the Database and the Application (see Fig. 37). These two parts are isolated from one another and communicate through an internal interface. The database is divided into two different sections: the design database which stores the design libraries, and the primitive database which stores the primitive libraries. A design library contains the complete 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Q PEGASUS ^ □ ^ Database ^ 4 ■ ^ Application ^ Interface J 3 IV ^ Design Libs ^ Q Utilities ^ ^ Primitive Libs User Apps Fig. 37. The high-level architecture o f PEGASUS. representation of a circuit design, including the information regarding its components, the circuit connectivity, and the logical and physical information. On the other hand, a primitive library contains information specific to the primitive components that are instantiated in the design libraries, such as the gates of a standard cell library or a set of parameterized application-specific cells. Primitive libraries also contain detailed information about the electrical and physical specifications of the fabrication process from which the primitive objects are casted. The second part of the system, the Application part, contains the algorithms (or application programs) that perform synthesis, optimization, or analysis on design libraries using the interface provided by the database. These applications are further divided into two categories: utility applications that provide system utilities such as centralized delay and power calculators, and user applications which are user-defined tools, such as the fan-out optimization, placement, or floor-planning algorithms. The complete separation between the design database and the user applications enables end users to easily add their algorithms with no intervention in the database internals. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The object-oriented structure of PEGASUS reduces the effort of programming, debugging, and adding new objects and algorithms. As opposed to many other university tools, adding a new component to PEGASUS is often translated into deriving a new object in the system hierarchy, with no conflict between the new code and the existing code. 5.3.2. Hierarchical Database - Introduction The ability of handling large designs is vital to the success of any digital IC design system. This is especially important in deep-submicron regimes where designs are very large and contain many details. The database of PEGASUS takes advantage of the hierarchical representation of a design in order to save memory resources. Generally, in any design, parts of the data are shared among objects of the same type whereas some other context-dependent (specific) data cannot be shared. Let us consider, for simplicity, a design composed of 4 NAND2 gates. Here, information regarding the logic function is shared by all of the NAND2 gates. This fact suggests that only one copy of such information should be stored in the database. On the other hand, other information such as placement coordinates must be stored for each individual gate. In order to represent a design in a very efficient way, in the PEGASUS database, a design library stores the information in two different views: the Compact View and the Application View. The model representing the structure of a design library in the PEGASUS hierarchical database is shown in Fig. 38. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ^ Compact View ^ ^ Internal Interface ^ ^ Application V iew ^ » ,___________S , » ^ Shared Data ^ £ User Interface ^ Q Specific Data ^ Fig. 38. Design library model. As can be seen, while the compact view stores shared information, the application view holds information specific to the context in which the different design objects are instantiated. Therefore, exploiting the hierarchical structure of the circuit, the compact view maintains a reduced internal representation storing only the essential information about the circuit. On the other hand, the application view stores a full representation of the circuit, where the shared information is expanded from the compact view through an internal interface. While the database maintains both views at the same time, the internal interface operates behind the scene to ensure consistency between these two views. However, only the application view is visible to the user and can be therefore edited through the user interface, while all the rest is out of the user’s scope. To save memory resources, the application view is not generated at the beginning but is expanded dynamically based on the requests made from the user applications through the user interface. Based on such requests, only a small portion of the design is expanded in the application view at one time and is subsequently made available to the user. When there is no need to process/analyze a portion of the design, it is automatically released from the application view while the compact view is accordingly updated by the internal interface. We call this feature Expand-on-Demand. The Expand-on-Demand operation is 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. _ ( P O b j e c t ) _ ( I b H oB iegi) I B M L % ♦ 0 ( DbNet ] (PbCelH nst)(PbPortlnst) [ N et Cell ] f Port " * ) { DbCell DbPort ] f DbGate ] ^ 4 ^ Synopsys ^ Peglib ^ ^ D sm libJ ^ Network ^ Fig. 39. Class hierarchy in PEGASUS. transparent to the user applications, so they can always assume that the whole application view is available to them at all time. The object hierarchy of the PEGASUS database is depicted in Fig. 39. In this structure the class PObject is the root of the hierarchy tree and every other object is directly or indirectly derived from it. In particular, the derived classes DbObject and AppObject are the base classes for all the other derived classes in the compact view and in the application view, respectively. The Property class is a special class and it will be treated in detail in section 5.3.4. In the compact view, every distinct design cell is stored as an object of type DbCell. Inside a DbCell, input and output terminals are stored as objects of type DbPorts, and children cells are stored as objects of type DbCelllnst. DbCelllnst objects are instances of other DbCell objects. A DbCelllnst object is empty except for the terminal 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. information that is stored in terms of DbPortlnst objects. In particular for each DbPort in its reference DbCell, a DbPortlnst object is instantiated in the DbCelllnst. Finally, a DbCell also stores a set of DbNet objects that internally connect DbPort and DbPortlnst objects. A DbCell object has no external connectivity and only serves to maintain a unique and shareable representation of the data of any distinct hierarchical entity of a design. Furthermore, different levels of hierarchy are permitted through the recursive mechanism of the instantiation of DbCell objects into DbCelllnst objects inside other DbCell objects. The design representation in the application view is fairly simple. Here, every DbCelllnst object is expanded into an object of type Cell, every DbPortlnst is expanded into a Port object, and every DbNet is expanded into a Net object. Exception to this rule is the top level DbCell representing the whole design. In this case, the DbCell is also expanded into a Network (derived type of Cell), and its DbPort and DbNet objects are expanded into Port and Net objects, respectively. During the expansion all the shared data in the compact view is reproduced as many times as the instances of that data, while the hierarchical structure is preserved. 5.3.3. Hierarchical Database - Example The internal operation of the PEGASUS database is better explained with the help of a small example. However, those who are not interested in the details can skip over this subsection without loss of continuity. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The following is an example of the representation of a half adder and a full adder in the PEGASUS database. The compact view of the half adder ha and the full adder fa are depicted in Fig. 40 and Fig. 41, respectively. In these figures, DbCell objects are represented as rectangles with white background, DbCelllnst objects are rectangles with grey background, DbPort objects are small black squares, DbPortlnst objects are small white squares, and DbNets are black lines. In the text, we use a hierarchical naming convention that interposes a dot in between different levels of hierarchy. For example, a.b.c means that object c belongs to (or is in the scope of) object b which, in turn, belongs to object a. Also, for clarity, names are reported in text in bold face. We omit the use of this notation in the figures since the different hierarchical levels are already clearly identified in those contexts. In Fig. 40.a the primitives of the gates xor, and, or used for the design of the half and full adders are also represented. All of these gates are represented as objects of type DbGate (a derived type of DbCell), and have three DbPort objects each: xor.a, xor.b xor.c for xor, and.a, and.b, and.c for and, and or.a, or.b one for or. Since such gates are primitive objects, they will be stored in a primitive library inside the PEGASUS database. In Fig. 40.b, ha is stored as a DbCell object whose inputs and outputs are DbPort objects ha.a, ha.b, ha.s, ha.c. The DbGate objects xor and and are instantiated inside ha through the DbCelllnst objects ha.xorl and ha.andl. Each DbCelllnst object is an empty object that only contains three DbPortlnst objects each which are the instances of the DbPort objects of the corresponding DbCell references (xor for 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a) Primitives b) H alf Adder Fig. 40. Compact view o f a half adder. ha.xorl, and and for ha.andl). In particular, DbCelllnst ha.xorl has DbPortlnst ha.xorl.a, ha.xorl.b, ha.xorl.c, whereas DbCelllnst ha.andl has DbPortlnst ha.andl.a, ha.andl.b, ha.andl.c. Inside ha DbPort and DbPortlnst objects are then connected together via DbNet objects. For example, DbNet ha.n4 connects DbPortlnst ha.andl.c to DbPort ha.c. Similarly, the compact view of the full adder fa, with inputs a, b and outputs c, s, is represented in Fig. 41. Here, fa is composed of DbPort objects fa.a, fa.b, fa.ci, fa.c, fa.s, DbCelllnst objects fa.hal, fa.ha2 (the instances of the DbCell ha of Fig. 40), fa.orl (the instance of the DbGate xor), and eight different DbNet objects connecting together the DbPort and DbPortlnst objects inside fa. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 41. Compact view o f the full adder. At this point the memory saving feature of the compact view becomes clear: the complete circuit information is arranged in the database so that no shareable information is stored more than once. To realize that, let us consider more closely the full adder example. Inside fa we have stored components and connectivity information in terms of objects of type DbPort, DbNet, DbCelllnst and DbPortsInst. Nevertheless, in doing so, we do not have wasted any memory replicating the information regarding the internals of fu.hal, fu.ha2, and fa.orl. As a matter of fact, those DbCelllnst objects only contain a set of DbPortlnst objects that are needed to store their unique external connectivity and a pointer to their corresponding references (DbCell objects). The rest of the half adder information is shared information and it is stored in DbCell ha. To complete our example we show in Fig. 42 how the internal interface of the database automatically expands the compact view in the application view, based on the access requests made by 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the user through the user interface. There are different ways ruling the object expansion from compact view to application view in PEGASUS. Here, for the sake of simplicity, we use a simple expansion scheme. In Fig. 42, white rectangles are Cell objects, small black squares are Port objects, and black lines are Net objects. At the beginning of the operation (a) the full adder fa is expanded in the application view as an empty object fa l of type Cell (since fa l is the top hierarchy cell in the application view it is actually expanded into an object of type Network which is a derived type of Celt). Also, each DbPort of fa is expanded in fa l into a corresponding Port object. As soon as the user requests to access one of the nets of Cell fa l, all the DbNet objects of fa are expanded in fal, each one into a corresponding Net object. At this point (d), if there is an access to one of the components of fal, all the DbCelllnst objects of fa are expanded, each one into a corresponding Cell object, and then stored in fa l as children. Also, for each DbCelllnst, all DbPortlnst objects are expanded as Port objects and stored in the corresponding Cell objects. For example, DbCelllnst fa.hal is expanded into Cell fa l.h a l, whereas DbPortlnst objects fa.hal.a, fa.hal.b, fa.hal.c, fa.hal.s, are expanded into Port objects fal.h a l.a , fal.h al.b , fa l.h a l.c , fal.h al.s, respectively. Finally (d), if the components of fa.hal are accessed, the same expansion mechanism repeats itself for this level of hierarchy generating Net objects fa l.h a l.n l, fal.h al.n 2 , fal.hal.n3, fal.hal.n4, Cell objects fal.h al.x o rl, fa l.h a l.a n d l, and Port objects fal.h al.x o rl.a, fal.h al.x o rl.b , fal.hal.xorl.c, fa l.h a l.a n d l.a , fal.h a l.a n d l.b , fal.h al.an d l.c. Unless the user requests to access them, cells fal.ha2 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a b ci fa l a) Cell and ports expand n4 n2 c i n6 fa l b) Nets expand c t orl ha2 hal fa l n2 n3 n4 n 5 n8 c) Children cells expand a b xorl n8 n4 b n4 n2 ha2 ± u l L C I n3 fa l orl d) Sub-hierarchy fal.hal expands completely Fig. 42. Application view o f a full adder generated by an expansion scheme. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and fa l.o rl are not further expanded in the application view, so that no memory resource is wasted. 5.3.4. Dynamic Property Management The PEGASUS system is flexible enough to allow its users to include an arbitrary number of new applications. In general each application will have to handle and store in the database different data structures to represent information in different categories, e.g. physical, logical, delay, power, or information related to groups of objects. For efficiency and ease of programmability, each application is independent of the others and of the central database. While the independence property of the applications is provided by a common user interface, the independence between the applications and the database is accomplished through the use of a dynamic property allocation mechanism called Dynamic Property Management (DPM). With this feature, whenever a user application needs to attach a specific data structure to a design object in the application view (e.g. a global router may need to store the row number in each Cell object), that data structure is dynamically attached to the object as a property through a call to the Property Manager of that object. The property manager determines whether to attach the property in the application view (context-dependent information) or in the compact view (shared information), and then returns a unique property handle to the user application, so that the application is able to access the property later on by simply using that handle. Along with information on the stored data structure, the property handle also includes an ownership feature with which the application that first registers a property can grant or refuse writing permission to other applications. In PEGASUS, for 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. system extensibility, the DPM interface is automatically made part of the user interface of every class derived from classes DbObject and AppObject through the use of the special template class Property, which inherits from the top hierarchy class PObject. In particular, this feature is mainly accomplished by letting all the interested classes (the classes drawn over a grey shaded rectangle in Fig. 39) inherit from the Property class as well. The use of DPM has several advantages. One of them is system programmability. Indeed, a user-application developer needs to know only the user interface (which includes the DPM interface), but nothing about the database internals. Another advantage is a substantial system memory saving. In fact, even if a property is registered for an entire family of objects (e.g. cells or ports), memory is allocated only for those objects of a family that need to store that particular property. Also, if a property is no longer used, it is dynamically removed from memory. 5.3.5. Application Programs and The Design Flow In addition to the database and utilities, PEGASUS provides a set of application programs and a design methodology. The proposed methodology is floorplan-driven and performs early partitioning and floorplaning in order to decide and set the global physical structure of the design. Many of the algorithms employed in the methodology - to be introduced in the following - simultaneously perform logical and physical synthesis and heavily rely on the integrated nature of the PEGASUS database. The 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. details of the design flow and a short description of each application program are given below. 1. Initialization and floorplanning: In this methodology, a technology-independently optimized circuit is read and clustered into logic blocks. The size of each logic block is controlled within a range for which it can be assumed the gate delay domi nates the interconnect delay (see [SK98].) Each combinational cluster is decom posed into the basic gates, e.g. NAND2 and INV gates and consequently the clusters are floorplanned as soft blocks using a dynamic programming technique called Bear-FP [PK92], 2. Global wire planning: At this point, a rough and high-level placement of the design is known and is used to plan the global wires. MERLIN [SLP99a] is employed to perform simultaneous routing tree construction and buffer insertion/sizing for the global wires. The combination of the two design steps is vital to achieve high qual ity global wires because they often have long delays and large capacitive loads. If the two steps are applied sequentially, the first step has to sacrifice a lot of area to achieve a reasonably good delay, while that overhead could be prevented by letting the other step to work simultaneously. MERLIN works on one net at a time and maximizes the required time at the driver of the net by considering the effect of load, location, and required time of each sink. 3. Cluster implementation: Every combinational cluster is implemented internally using SiMPA [SLP99b] which is a simultaneous technology mapping and gate placement technique. By using SiMPA, the inter-cluster wire congestion and the 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. gate area are minimized while the delay satisfies a given constraint. SiMPA is a dynamic programming based algorithm and generates and propagates 2 -dimen- sional (minimum area mode) and 3-dimensional (minimum delay with controlled area mode) solution curves. During the synthesis, SiMPA has access to both logical and physical information and as a result its area and delay calculations are exact. SiMPA allows the user to choose the final implementation from a set of solutions which provide a range trade-off between area-delay or height-width. 4. Cluster optimization-L Due to the controlled size of logic blocks, it can be assumed that inside each logic block the gate delay dominates the interconnect delay. There fore, a fanout optimization technique based on Bipolar LT-Trees [CPPZ98] is employed to speed up the nodes with high capacitive load. Consequently, the inserted buffers are placed within the placement region of their corresponding clus ters. The Bipolar LT-Tree is a generalization form of the LT-Tree [To90] structure. This fanout optimization algorithm employs a continuous time buffer sizing tech nique based on a transistor-level deep-submicron model. 5. Cluster optimization-11: Followed by the fanout optimization step, a simultaneous gate sizing and re-placement algorithms, called SCD [CHP99], is applied to each cluster to further optimize their internal implementation. SCD formulates the gate sizing and displacement problem in a mathematical programming form, called Generalized Geometric Programming (GGP). To solve a GGP problem, the origi nal GGP is transformed into a sequence of geometric programming (GP) problems 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by a process commonly referred to as condensation [ADP75]. [KXY96] provides a set of effective techniques to solve the GP problem. 6 . .Final optimization: After completing the internal implementation and optimization of all clusters, the global placement of the design is updated and any overlap or excessive dead area between gates are removed. 5.3.6. Interface The Pegasus Library (PegLib), is a new gate library format used in the PEGASUS system to effectively characterize all the standard cell and fabrication process information (logical, physical, etc.) that is handled by the database and by the different application programs. The main features of the PegLib format are modularity and extensibility. In such format, the information about gates and processes is structured in nested blocks of textual data surrounded by braces. In particular, while the data corresponding to each library entity is organized in sections, blocks of information relative to that entity needed by different application programs are organized in sub sections that are nested inside the corresponding entity section. Not all of the sub sections are essential and must be present at all times. If a block of data regarding a user application or utility that is not currently in use is missing, the library file is still loaded in the system so that other applications are able to work. When an application requires the missing data to be present in the database, the complete library can be loaded again, or such information can be dynamically added in the database. The modularity of PegLib makes it very easy to customized by the user. As a matter of fact, whenever a new application program requiring specific information that cannot be extracted from 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the available data is included in PEGASUS, a corresponding new sub-section can be added in PegLib to serve that purpose (at the cost o f modifying the library parser to make it recognize the new sub-section). In Fig. 43 an example of the syntax of the format is reported. Here, an optional PROCESS section that contains electrical and layout information of the library fabrication process is followed by a list of CELL sections that store the information regarding the single gates. In particular, for each gate, information about connectivity, logical and physical properties, and other application- specific information are inserted in nested sub-sections. Particularly, the DELAY section reports the default delay model used in the PEGASUS system. In this format, the delay of each input pin (against one output pin) of a cell is characterized by four subsets of 4 parameters each, modeling the rising and falling propagation times and the rising and falling transition times, transition times are here defined as the difference between the times where the rising and falling edges of a signal are at the 10% and 90 % of their total swing, respectively. On the other hand, the propagation times are defined as the difference the times where the falling and rising edges of a signal are at the 50% of their swing. The pin-to-pin delay model is as follows: delay = (kl + k2 load) transition_time + k3 load + k4 Here, delay represents any of the propagation or transition delays for the output pin, load denotes the capacitive load of the cell, and transition_time refers to the falling or rising transition times of the input pin as appropriate. The choice of such a format has Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LIBRARY <name> { PROCESS { I/ fabrication process information ELECTRIC { <electrical data> } LAYOUT { clayout data> } } CELL ccell name> { INPUT <input list> OUTPUT <output list> LOGICAL {<logic information>} PHYSICAL { AREA ccell area> PIN_LOCATION { cpin name> cpin location> cpin name> cpin location> // other pin locations } // other info } DELAY { cin pin name> ccapacitive load> { cout pin name> cphase> cmax load> clist of 2 0 parameters> // other out pin delay info } ............// other pin delay info } APPLICATTON_ONE { capplication specific data> } APPLICAnON_TWO { capplication specific data> } // other application sections } CELL ccell name> { } ................// other cell info } Fig. 43. The PegLib format. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. been dictated by the need for an accurate delay model which includes the effect of the slope of the voltage signals in the calculation of the delay. 5.3.7. System Implementation The PEGASUS system is being developed in an object-oriented programming (OOP) fashion using the C++ language and its Standard Template Library (STL) [STL]. The OOP structure of the system provides powerful code encapsulation and, at the same time, reduces the effort of programming, debugging, and adding new objects and algorithms. On the other hand, STL provides general-purpose, templetized classes and functions that implement many popular and commonly used algorithms and data structures in a very efficient and user-friendly manner. PEGASUS makes use of STL as a standard utility and also follows closely its coding convention throughout the design of its data structure interfaces. New tools and algorithms (application programs) are organized in separate and independent packages. In such a modular system, end users can easily add new code or locate and modify existing code working independently of each other. At the system level, a command line user interface and a graphic user interface are implemented using the Tcl/Tk [Ou94] scripting language. Particularly, the graphic user interface, called WING provides a graphic environment for visualizing and editing design and primitive libraries, cell schematic and physical views, and a friendly interface to the PEGASUS synthesis flow. Furthermore, WING facilitates the debugging effort letting programmer graphically visualize and interact with the details of the data stored in the database. Ill Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S S g S S S l Wing: The P eg asu s Windowing U ser Interface Resign Jools Help readLibrary: read design H brary >'Pesi9nLib2r 'fr b « ;file /export/ho»e/sahand-05/pegasus/src/pasquale/pegasus-1.0/pegasus/circuits/C3540,v readLibrary: top hierarchy design nane 'C3540' readLibrary ^ DesighLib3--y /export/ho«e/sahanid-05/pegasus/src/pasguale/pe9asus^l< 0/pe9asus/circuits/C5315.v ~r - - ■ ' ’ readLibrary: read design lib ra ry 'DesignLib3' fron f ile /export/hone/sahand-05/pegasus/src/pasquale/peg^us-1.0/pegasus/circuits/C5315*v readLibrary: top hierarchy design n a m e '"C5315'" J J Fig. 44. WING console. 5.4 . E x p e r im e n ta l R esu lts The initial version of PEGASUS including the database and a set of application programs have been implemented and results have been generated for a set of benchmark circuits. Fig. 44, Fig. 45, and Fig. 46 show the main user interface windows of PEGASUS. The WING console, see Fig. 44, echoes every command invoked through the WING user interface. By using the library manager window, Fig. 45, the design and primitive libraries can be read by PEGASUS. Note that as shown in the screen snapshot multiple libraries can be open in the system at the same time. In that window, the constituting elements of every design or library can be seen by clicking on its name, e.g. C1908 in Fig. 45. The schematic and physical views of a selected design (see Fig. 47 and Fig. 48) can be viewed by pressing the corresponding buttons in the library manager window. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Library M a n ag er Pegasus Library Manager Name:j ' v genB» v dsmlib ♦ verflog v bBf Reference:^ . ... • - ■ : ■ ■ = ' File:) • . . ----1 . . /._ . . __ ... Browse { Add [ Remove | Update ( Verify [ Schematic [ Physical | Synthesize Done RainGateUb Plus Gate L ib Design L ib 1 Design UbZ Design Ub3 CI308 iO (sbfinv_1) 1 7 (std in v l) i6E (stdnorZJ ) i27 (stdxor2_1) iZ6 (stdor2_1) 1 3 0 (stifin v l) 1 2 5 8 (Stdbiv j ) i12(sUfinv_1) i65 (stdmv_I) iG 2 (stdxof2 1) Current Design Library: Design Libl current Cel: Cl 908 Fig. 45. Library manager. Fig. 46 illustrates the synthesis flow windows which is used to control the flow of synthesis and optimization. As shown after each step a report appears on the right side of the window. The synthesis and optimization tools in PEGASUS can be tuned for minimum area, delay, or a combination of the two. Table 9 compares the post-layout area and delay of three benchmark circuits for minimum area and minimum delay modes. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P egasus S y n th esis Flow Pegasus Synthesis Flow Synthesis Input Data Current Input Design: jDesignLibl Reference Libraries: |PlusGateLib Save | Cancel ( Apply ( Done | inputs: 33 outputs: 25 c e ls: 262 nets: 295 gate area: 208026 um^2 logic delay: 9.06 ns Technology Mapping: TROT _j Statistics _i Hn Permutation Area Delay point count M in Area <— > Min Delay 6 0.80 c e ls: 511 nets: 544 gate area: 408870 unr*2 logic delay: 6.16 ns elaipsed time: 10 £ 9 s Clear | Run | J :: . ■ Pin Info FBe: | Browse | Pan-Out Optimization: HOOF _l Statistics _| Transition Time over Propagation Tine Load Threshold Sir*. Oustaring 10 4.00 c e ls: 606 n ets: 633 gate area: 432168 unr*2 logic delay: 5.50 ns elapsed time: 41S 2 s Clear j Run | Process m e: |hp0.5u Browse | Ceil Hacement Optimization: M E D L _j Statistics W LeveBzetf Ptacem ent pr Simulated Annealing m Cell Placement pr Pad Hacement HOT <- Temp -> COLD Moves per Instance 1.0 100 ISA chip area: 1060um x 1112um gate area: 492168 umA 2 chip delay: 10.04 ns wire length: 21987120 um elapsed time: 20.55 s Clear | Run | | ' - | | ■ ; ; ■ ■ - ' J | | . Gate Sizing and Cell Displacement: SCD Iterations 80 chip area: I066um x 1112um gate area: 498051 unr*2 chip delay: 9.62 ns wire length: 215934.58 um elapsed time: 129.04 s Clear | Run | Fig. 46. Synthesis flow. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 47. Schematic view editor. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fig. 48. Physical view editor. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. M in-Area M ode M in-D elay M odel Circuit Area (1000A . 2) D elay (ns) Runtim e (s) Area (1000A2) D elay (ns) R untim e (s) C1908 926 10.07 195 1523 9.24 252 0 5 4 0 1780 14.97 314 3262 13.57 845 C5315 14660 15.38 213 15010 12.38 331 Table 9: Min-Delay versus Min-Area design modes. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Conclusion The rising demand for complex, high-throughput, low power, and cost-effective silicon chips and systems has made the employment of ultra-deep submicron technologies a vital factor to the growth of the semiconductor industry. However, the currently available CAD tools and methodologies, equipped with deep submicron libraries, often fail in their design completion due to numerous timing and congestion problems. The root of the problem lies in the fact that those methodologies and tools are transistor-centric and artificially separate the logic and layout synthesis steps. In this dissertation, the problem is approached from three different viewpoints: methodology, tools, and design environment. A design methodology suitable for deep submicron technologies must be interconnect-centric. The separation of logic and layout synthesis is no longer justified, since the delay, area, and power consumption by interconnects are by no means 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. negligible compared to those of transistors. The best solution is employ a hybrid top- down, bottom-up methodology in which every design step works homogeneously in both the logic and layout domains. Such a methodology requires unified logic and layout database, timing engine, and estimation tools. The tools and algorithms for deep submicron design should consolidate the logic and layout synthesis and optimization tasks. This dissertation introduces the FANROUT, MERLIN, and *SiMPA algorithms all of which perform simultaneous logic and layout synthesis. FANROUT and MERLIN perform simultaneous technology mapping and placement using the new data structures, Ca-Tree and *P-Tree, and a new dynamic programming grouping scheme called Local Order Perturbation. *SiMPA performs simultaneous floorplanning, technology mapping and placement using the new data structure k-LCT. Specifically, FANROUT and MERLIN address the problem of distributing a signal among a set of sinks with different placement, load, and required time values. The proposed technique generates a set of non-inferior buffered routing structures which provide different trade-offs between the required-time at the root and the total buffer area. MERLIN consists of an iterative optimization block which uses a local neighborhood search strategy and a dynamic-programming-based optimization engine which generates all the non-inferior structures in the neighborhood of a given sink order. This optimization engine generates and propagates three-dimensional solution curves and employs a novel local order-perturbation method to cover an exponential size solution space in polynomial time. Experimental results show significant improvement 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in delay with little area penalty compared to the conventional buffer and routing tree generation techniques. *SiMPA performs simultaneous floorplanning, technology mapping and linear placement. *SiMPA is a dynamic-programming-based algorithm which first computes the shape function of every tree cluster using SiMPA-E and then utilizes a bottom-up floorplanner to generate higher level shapes based on a k-way levelized cluster tree. After the minimum area solution is chosen and the critical paths are identified, the clusters on each critical path are re-floorplanned, re-mapped and re-placed simultaneously to achieve improved timing. This dissertation also introduces PEGASUS, a comprehensive system for deep submicron VLSI design and optimization that targets the technology-dependent part of the design flow. PEGASUS provides an integrated logical and physical database, utility, and environment that can be used for prototyping technology-dependent VLSI-CAD algorithms and flows. The PEGASUS system has been designed in an object-oriented and modular fashion and is easily extensible. It accepts, preserves, and manipulates the hierarchy of design and internally uses this information to increase the efficiency of the software package. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. References [ADP75] [Be57] [BCD89] [CHDS] [CHKM96] [CHP99] [CLB94] [CLZ93] [CP92] M. AVREEL, R. DEMBO AND U. PASSY, Solution o f generalized geometric programming, SIAM International Journal for N um erical Methods in Engineering, 1975, pp.217-255. R. BELLMAN, Dynamic Programming, Princeton University Press, Princeton, 1957. C. L. BERMAN, J. L. C a r t e r a n d K. F. d a y , The fanout problem: From theory to practice, in Proceedings of Advanced Research on VLSI, 1989, pp. 69-99. The Chip Hierarchical Design System Technical Data Standard - http://www.si2.org/CHDStd/. J. CONG, L. He, C. KOH AND P. M a d d e n , Performance optimization o f VLSI interconnect layout, Integration, the VLSI Journal, 1996, pp. 1-94. W. C h en , C-T. H sie h a n d M. PEDRAM, Gate sizing with controlled displacement, in Proceedings of International Symposium on Physical Design, 1999, pp. 127-132. J. CONG, Z. Li AND R. B a g r o d ia , Acyclic multi-way partitioning o f boolean networks, in Proceedings of Design Automation Conference, 1994, pp. 670-675. J. CONG, K. L eu n g AND D. Z h o u , Performance-driven interconnect design based on distributed RC delay model, in Proceedings of Design Automation Conference, 1993, pp. 606-611. K. C h a u d h a r y AND M. PEDRAM, A near-optimal algorithm fo r technology mapping minimizing area under delay constraints, in Proceedings of Design Automation Conference, 1992, pp. 492-498. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [CPPZ98] [E148] [Gi90] [GJ79] [Go76] [Gr92] [Ha66] [HS96] [IPFC93] [Ke87] [KXY96] [LCL96] [LCLH96] [Le82] [LSP97] P. CocCfflNl, M . PEDRAM, G PICCININI AND M . ZAMBONI, Fanout Optimization under a Submicron Transistor-Level Delay Model, in Proceedings of International Conference on Computer-Aided Design, 1998, pp. 551-556. W. C. ELMORE, The transient response o f damped linear network with particular regard to wideband amplifiers, in Journal of Applied Physics, 1948, pp. 55-63. L.P.P.P. VAN GlNNEKEN, Buffer placement in distributed RC-tree networks for minimal Elmore delay, in Proceedings of International Symposium on Circuits and Systems, 1990, pp. 865-868. M. R. G a r e y AND D. S. JOHNSON, Computers and Intractability: A Guide to the Theory o f NP-Completeness, W. H. Freeman, San Francisco, 1979. M. C. GOLUMBIC, Combinatorial merging, IEEE Transactions on Computers, 1976, pp. 1164-1167. L. K. GROVER, Local search and the local structure o f NP-complete problems, in Operations Research Letters 1992, pp. 235-243. M. H a n a n , On Steiner’ s problem with rectilinear distance, SIAM Journal of Applied Mathematics, 1996, pp. 255-265. G D. HACHTEL AND F. SOMENZI, Logic Synthesis and Verification Algorithms, Kluwer Academic Publishers, Norwell, 1996. S. IMAN, M. PEDRAM, C. F a b ia n a n d J. CONG, Finding uni-directional cuts based on physical partitioning and logic restructing, in Proceedings of Physical Design Workshop, 1993, pp. 187-198. K. KEUTZER, DAGON: Technology mapping and local optimization, in Proceedings of Design Automation Conference, 1987, pp. 341-347. K. O. KORTANEK, X . X u AND Y. Ye, An infeasible interior-point algorithm for solving primal and dual geometric programs, in Journal of Mathematical Programming, 1996, pp. 112-120.. J. L illis , C. K. C h en g a n d T. Y. Lin, Simultaneous routing and buffer insertion fo r high performance interconnect, in Proceedings of Physical Design Workshop, 1996, pp. 7-12. J. LILLIS, C. K. C h en g, T. Y. LIN a n d C. Ho, New performance driven routing techniques with explicit area/delay tradeoff and simultaneous wire sizing, in Proceedings of Design Automation Conference, 1996, pp. 395- 400. T. LENGAUER, Upper and lower bounds on the complexity o f the min-cut linear arrangement problem on trees, in SIAM Journal on Algebraic and Discrete Methods, 3(1982), pp. 99-113. J. LOU, A. H. SALEK AND M. PEDRAM, An exact solution to simultaneous technology mapping and linear placement problem, in Proceedings International Conference on Computer-Aided Design, 1997, pp. 671-675. 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [LSP98] J. LOU, A. H. SALEK AND M. PEDRAM, An integrated flow fo r technology remapping and placement o f sub-half-micron circuits, in Proceedings of Asia and South Pacific Design Automation Conference, 1998, pp. 295-300. [MCS] http://www.mcs.surrey.ac.Uk/Personal/R.Knott/Fibonacci/fibFormula.html. [MFNK96] H. M u r a ta , K. F u jiy o sh i, S. N a k a ta k e a n d Y. K a jtta n i, VLSI module placement based on rectangle packing by the sequence-pair, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 15(12), 1996, pp. 1518-1524. [NFMK96] S. N a k a ta k e , K. F u jiy o sh i, H. M u r a ta a n d Y. K a jtta n i, Module placement on BSG-structure and IC layout applications, in Proceedings of International Conference on Computer-Aided Design, 1996, pp. 484-491. [NTRS97] http://notes.sematech.org/ntrs/Rdmpmem.nsf. [OC96a] T. OKAMOTO AND J. CONG, Buffered Steiner tree construction with wire sizing fo r interconnect layout optimization, in Proceedings of International Conference on Computer-Aided Design, 1996, pp. 44-49. [OC96b] T. OKAMOTO AND J. CONG, Interconnect layout optimization by simultaneous Steiner tree construction and buffer insertion, in Proceedings of Physical Design Workshop, 1996, pp. 1-6. [Ou94] J o h n K. O u s te r h o u t, Tel and the Tk Toolkit, Addison-Wesley Publishing, Boston, 1994. [Pe98] M. PEDRAM, Logical-physical co-design fo r deep submicron circuits: challenges and solutions, in Proceedings of Asia and South Pacific Design Automation Conference, 1998, pp. 137-142. [PK92] M. PEDRAM AND E. Kuh, Bear-FP: A robust framework fo r floorplanning, in International Journal of High Speed Electronics, 3(1), 1992, pp. 137- 170. [Ru89] R. RUDELL, Logic synthesis for VLSI design, Memorandum UCB/ERL M89/49, Ph.D. Dissertation, University of California at Berkeley, 1989. [Sc98] H. SCHILDT, C++: The Complete Reference, Third Edition, Osborne McGraw-Hill, Berkeley, 1998. [Se88] C. SECHEN, VLSI Placement and Global Routing using Simulated Annealing, Kluwer Academic Publishers, Norwell, 1988. [Sh93] N. SHERWANI, Algorithms fo r VLSI Physical Design Automation, Kulwer Academic Publishers, Norwell, 1993. [SK98] D. SYLVESTER AND K. K e u t z e r Getting to the bottom o f deep submicron, in Proceedings of International Conference on Computer Aided Design, 1998, pp. 203-211. [SLP98] A. H. SALEK, J. Lou AND M. PEDRAM, A simultaneous routing tree construction and fanout optimization algorithm, in Proceedings of International Conference on Computer-Aided Design, 1998, pp. 625-630. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [SLP99a] [SLP99b] [SS 90] [SSLM92] [St83] [STL] [TMBW90] [To90] [VP93] [WM89] [Ya85] [Ya90] A. H. SALEK, J. LOU a n d M. P ed ram , MERLIN: Semi-order-independent hierarchical buffered routing tree generation using local neighborhood search, in Proceedings of Design Automation Conference, 1999, pp. 472- 478. A. H. SALEK, J. LOU AND M. PEDRAM, An integrated logical and physical design flow fo r deep submicron circuits, in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(9), 1999, pp. 1305-1315. K. J. SINGH a n d A. Sangiovanni-VincENTELLI, A heuristic algorithm fo r the fanout problem, in Proceedings of Design Automation Conference, 1990, pp. 357-360. E. M. S e n to v ic h , K. J. S in g h , L. L a v a g n o , C. M oon , R. M u r g a i, A. S a ld a n h a , H. S a v o j, P. R. S tep h a n , R. K. B r a y t o n a n d A. SANGIOVANNI-VINCENTELLI, SIS: A system fo r sequential circuit synthesis, Memorandum No. UCB/ERL M92/41, Electronics Research Laboratory, College of Engineering, University of California, Berkeley, CA 94720, May 1992. L. STOCKMEYER, Optimal orientation o f cells in slicing floorplan designs, Journal of Information and Control, 57(1983), pp. 91-101. The Standard Template Library - http://www.sgi.com/Technology/STL/ index.html. H. J. T o u a ti, C. W. M oon , R. K. B r a y t o n a n d A. W ang, Performance oriented technology mapping, in in Proceedings of Advanced Research on VLSI, 1990, pp. 79-97. H. TOUATI, Performance-oriented technology mapping, Ph.D. thesis, University of California, Berkeley, Technical Report UCB/ERL M90/109, November 1990. H. VaiSHNAV AND M. PEDRAM, Routability-driven fanout optimization, in Proceedings of Design Automations Conference, 1993, pp. 230-235. W .S. WONG a n d R.J.T. M o r r is, A new approach to choosing initial points in local search, in Information Processing Letters, 30(1989), pp. 67-72. M. YANNAKAKIS, A polynomial algorithm fo r the min-cut linear arrangement o f trees, in Journal of the Association for Computing Machinery, 32(4), 1985, pp. 950-988. M. YANNAKAKIS, The analysis o f local search problems and their heuristics, in Proceedings of Annual Symposium on Theoretical Aspects of Computer Science, 1990, pp. 298-311. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Effects of non-uniform substrate temperature in high-performance integrated circuits: Modeling, analysis, and implications for signal integrity and interconnect performance optimization
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
BDD minimization using don't cares for formal verification and logic synthesis
PDF
A template-based standard-cell asynchronous design methodology
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Architectural support for network -based computing
PDF
Pseudo-Exhaustive Built-In Self-Test System For Logic Circuits
PDF
Directed diffusion: An application -specific and data -centric communication paradigm for wireless sensor networks
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
Asset Metadata
Creator
Salek, Amir H.
(author)
Core Title
Consolidated logic and layout synthesis for interconnect -centric VLSI design
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Beerel, Peter (
committee member
), Breuer, Melvin A. (
committee member
), Estrin, Deborah (
committee member
), Gaudiot, Jean-Luc (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-91916
Unique identifier
UC11327008
Identifier
3018123.pdf (filename),usctheses-c16-91916 (legacy record id)
Legacy Identifier
3018123.pdf
Dmrecord
91916
Document Type
Dissertation
Rights
Salek, Amir H.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical