Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
High level design for yield via redundancy in low yield environments
(USC Thesis Other)
High level design for yield via redundancy in low yield environments
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
High Level Design for Yield via Redundancy in Low Yield Environments by Yue Gao A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2014 Copyright 2014 Yue Gao 2 DEDICATION To my wife Vanessa, my mother Rose and my father Tom for their unconditional love and support 3 ACKNOWLEDGEMENTS My academic journey as a Ph.D. student in The University of Southern California is one of the most important experiences in my life. I cannot imagine how it could have been possible for me to complete this journey without the people whom I am greatly grateful for. First and foremost, I would like to thank my advisor Professor Melvin Breuer, whom I met when I took his class as a Master’s student and was immediately captured by his lecture. Professor Breuer’s unfathomable wisdom in the field of electrical engineering kindled my interest in research, revealed my long path of learning ahead, motivated me and will continue to motivate me in my future endeavors. Through his patient guidance he has shaped me as the scholar I am today, and words fail me as I try to express my gratitude. So I will simply say: thank you Professor for being my mentor, and, my friend. Next I would like to thank my co-advisor Professor Sandeep Gupta, who has continued to enlighten me when I am lost in my journey. You could always pinpoint the drawbacks in my methodologies and warp my perspective when I face seemingly impenetrable roadblocks during my research. I would also like to thank my committee members: Professor Peter Beerel, Professor Aiichiro Nakano and Professor Paul Bogdan, as well as my colleagues Shuo Wang, Yang Zhang and Yanzhi Wang who have worked with me for some of the projects. Last but certainly not least, I want to thank my family. My wife Vanessa, who at this time of writing is bearing our first child: your love and support is my air and my light. Having you and our future child in my life is my greatest blessing, and with you, I am infinitely strong. My mother Rose and my father Tom: I cannot be more proud to be carrying your blood. You have been my pillars, my sun and my sky through every step of my life, I love you. To my uncle Leon and aunt Mary, you have taken such good care of me during my time away from my parents, thank you so much. 4 Table of Contents ABSTRACT ............................................................................................................................................................... 11 CHAPTER 1. INTRODUCTION ............................................................................................................................. 12 1.1 BACKGROUND INFORMATION ........................................................................................................................ 13 1.2 CHALLENGES IN EFFICIENT YIELD ENHANCEMENT IN RANDOM LOGIC ......................................................... 23 1.4 THESIS OUTLINE ............................................................................................................................................ 27 CHAPTER 2. RELATED WORK - YIELD ENHANCEMENT ........................................................................... 29 2.1 CORE LEVEL REDUNDANCY ........................................................................................................................... 29 2.2 MODULE LEVEL REDUNDANCY...................................................................................................................... 30 2.3 CIRCUIT LEVEL REDUNDANCY ....................................................................................................................... 31 2.4 MICROARCHITECTURE LEVEL YIELD IMPROVEMENT ..................................................................................... 32 CHAPTER 3. RELATED WORK - FLOORPLANNING AND PLACEMENT ................................................. 34 3.1 STANDARD CELL PLACEMENT ....................................................................................................................... 34 3.2 BLOCK FLOORPLANNING................................................................................................................................ 36 3.3 MIXED BLOCK "FLOORPLACEMENT".............................................................................................................. 36 CHAPTER 4. REDUCED REDUNDANCY INSERTION .................................................................................... 40 4.1 INTRODUCTION .............................................................................................................................................. 40 4.2 PROBLEM DEFINITION .................................................................................................................................... 44 4.3 MAXIMIZATION OF E(P)/A ............................................................................................................................. 53 4.4 EXPERIMENTAL RESULTS ............................................................................................................................... 58 4.5 CONCLUSIONS ................................................................................................................................................ 62 CHAPTER 5. HYBRID SHARED REDUNDANCY INSERTION ...................................................................... 63 5.1 INTRODUCTION .............................................................................................................................................. 63 5.2 PROBLEM DEFINITION .................................................................................................................................... 64 5 5.3 MODEL VERIFICATION ................................................................................................................................... 73 5.4 MAXIMIZATION OF E(P)/A ............................................................................................................................. 83 5.5 EXPERIMENTAL RESULTS ............................................................................................................................... 84 5.6 CONCLUSIONS ................................................................................................................................................ 86 CHAPTER 6. UNIFIED REVENUE AND YIELD AWARE DESIGN OPTIMIZATION ................................ 87 6.1 INTRODUCTION .............................................................................................................................................. 87 6.2 SYSTEM MODEL ............................................................................................................................................. 89 6.3 MOTIVATIONAL CASE STUDY ........................................................................................................................ 93 6.4 OPTIMIZATION ALGORITHM ........................................................................................................................... 95 6.5 EXPERIMENTAL RESULTS ............................................................................................................................. 118 6.6 CONCLUSIONS .............................................................................................................................................. 126 CHAPTER 7. GLOBAL YIELD AND FLOORPLAN AWARE DESIGN OPTIMIZATION ........................ 128 7.1 INTRODUCTION ............................................................................................................................................ 128 7.2 SYSTEM MODEL ........................................................................................................................................... 130 7.3 MOTIVATIONAL CASE STUDY ...................................................................................................................... 135 7.4 THEOREMS ................................................................................................................................................... 138 7.5 EXPERIMENTAL RESULTS ............................................................................................................................. 156 7.6 CONCLUSIONS .............................................................................................................................................. 162 CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS .......................................................................... 163 REFERENCES ........................................................................................................................................................ 167 6 List of Figures Figure 1. Example of yield............................................................................................................ 14 Figure 2. Redundancy insertion example...................................................................................... 19 Figure 3. Example wafer maps with marketable and malfunctioning (i.e. defective) dice .......... 20 Figure 4. Yield learning ................................................................................................................ 22 Figure 5. Floorplan example on estimating the area overheads of yield enhancement ................ 27 Figure 6. Core level spare core sharing ........................................................................................ 29 Figure 7. Redundant transistor illustrative example ..................................................................... 32 Figure 8. Floorplanning/Placement classes: (a) standard cell; (b) block; (c) mixed block ........... 39 Figure 9. Conventional and Reduced Redundancy design flow ................................................... 44 Figure 10. A 2-Module abstract microprocessor .......................................................................... 45 Figure 11. Three design alternatives for M 1 .................................................................................. 46 Figure 12. E(P)/A values with respect to the area of the spare A’ 1 ............................................... 51 Figure 13. The pipeline hardware model for measuring F x (A’ x )................................................... 53 Figure 14. Maximizing E(P)/A for systems with the additive performance model ...................... 57 Figure 15. The increment of E(P)/A values for RRI with respect to the original yield ................ 59 Figure 16. Design exploration under the pipelined performance model ...................................... 61 Figure 17. Case study for hybrid redundancy ............................................................................... 65 Figure 18. Percent improvement of the Version 3 design in terms of E(P)/A .............................. 69 Figure 19. Hybrid redundancy ...................................................................................................... 69 Figure 20. The evaluation of worst-case performance .................................................................. 73 Figure 21. Steering logic complexity ............................................................................................ 74 Figure 22. The impact of steering logic complexities on stage delay ........................................... 76 7 Figure 23. Steering logic complexity illustration ......................................................................... 79 Figure 24. Noise attenuation concept............................................................................................ 80 Figure 25. Improvement percentage of the HSR design for k=2, t=5 ........................................... 85 Figure 26. Traditional yield aware design flow and the design flow of URaY ............................. 89 Figure 27. Five stage pipeline example ........................................................................................ 91 Figure 28. Motivational Case Study ............................................................................................. 94 Figure 29. High level algorithm flow of URaY............................................................................. 96 Figure 30. Admissibility adjustment review ................................................................................. 98 Figure 31. The basic concept behind quadrant bisection ............................................................ 100 Figure 32. Example system ......................................................................................................... 100 Figure 33. Placing modules into Quadrant 1 .............................................................................. 101 Figure 34. Placing modules into Quadrant 2 and Quadrant 3 ..................................................... 101 Figure 35. Seed generation example: Initial state ....................................................................... 102 Figure 36. Seed generation example: Optimization .................................................................... 104 Figure 37. Graphical description of GROWTH .......................................................................... 109 Figure 38. The calculation of transitive slack ............................................................................. 112 Figure 39. The maximum number of spares allowed ................................................................. 113 Figure 40. Prelude to determining the intra-cluster module arrangements ................................. 115 Figure 41. Determining the intra-cluster module arrangements via exhaustive search .............. 116 Figure 42. Steering/testing logic insertion .................................................................................. 117 Figure 43. Creating the cluster for the baseline designs ............................................................. 119 Figure 44. Histogram of spare configurations for experiments 15.1 to 15.10 ............................ 126 Figure 45. A quad-core microprocessor with on-die GPU and shared L3 cache ....................... 128 8 Figure 46. High level view of GlYFF ......................................................................................... 130 Figure 47. Motivational case study for GlYFF ........................................................................... 137 Figure 48. System model for Theorem 3 and Theorem 4 ........................................................... 139 Figure 49. Example yield-area function plot for Theorem 3 and Theorem 4 ............................. 140 Figure 50. Prelude to theorems: yield-area function plot (Case 1) ............................................. 141 Figure 51. Prelude to theorems: Y/A plot (Case 1)...................................................................... 142 Figure 52. Prelude to theorems: yield-area function plot (Case 2) ............................................. 143 Figure 53. Prelude to theorems: Y/A plot (Case 2)...................................................................... 143 Figure 54. Utilization of Theorem 3 and Theorem 4 .................................................................. 147 Figure 55. Explanation for Theorem 6 ........................................................................................ 150 Figure 56. Y/A plot for the proof of Theorem 6 .......................................................................... 151 Figure 57. Baseline design floorplan .......................................................................................... 157 Figure 58. GlYFF design floorplan ............................................................................................. 158 Figure 59. Baseline design floorplan .......................................................................................... 161 Figure 60. GlYFF design floorplan ............................................................................................. 161 Figure 61. Utilizing non-redundancy based DFY techniques in the current frameworks .......... 165 Figure 62. Utilizing redundancy for DFY and cross-layer soft error resiliency ......................... 166 9 List of Tables Table I. Performance for different configurations of M 1 and M' 1 ................................................. 47 Table II. Area, Yield and Performance parameters ...................................................................... 50 Table III. Experimental results for a 5-module system................................................................. 58 Table IV. Effects of performance characteristics on RRI E(P)/A ................................................ 60 Table V. Experimental results for a 7-module system .................................................................. 60 Table VI. Experimental results for a 10-module system .............................................................. 62 Table VII. Performance ratings for different module status combinations .................................. 67 Table VIII. E(P)/A comparisons and improvement of Version 3 design ..................................... 68 Table IX. Steering logic complexity ............................................................................................. 75 Table X. System Performance Degradation .................................................................................. 78 Table XI. Summary of the design before any redundancy insertion ............................................ 81 Table XII. Summary of the design after internal redundancy insertion (n = 3) ............................ 82 Table XIII. Summary of the design after external redundancy insertion (n = 1, m = 4) .............. 82 Table XIV. Summary of the design after hybrid redundancy insertion (n = 2, m = 2) ................. 83 Table XV. System configuration for k=2, t=5 .............................................................................. 84 Table XVI. Percent improvement of the HSR design for k=4, t=10 ............................................ 86 Table XVII. Analysis of the three example designs ..................................................................... 95 Table XVIII. Results showing the area, HPWL and revenue ..................................................... 121 Table XIX. Die yield and revenue details for experiment 20.7 .................................................. 123 Table XX. Baseline design details for experiment 20.7 ............................................................. 124 Table XXI. URaY design details for experiment 20.7 ................................................................ 124 Table XXII. Baseline design details for experiment 15.6 ........................................................... 125 10 Table XXIII. URaY design details for experiment 15.6.............................................................. 125 Table XXIV. Case study design configuration ........................................................................... 138 Table XXV. Prelude to theorems: area, yield and Y/A numbers (Case 1) .................................. 141 Table XXVI. Prelude to theorems: area, yield and Y/A numbers (Case 2) ................................. 142 Table XXVII. Y limit (N, k) values .................................................................................................. 155 Table XXVIII. Redundancy configuration for the SoC .............................................................. 157 Table XXIX. Design evaluations ................................................................................................ 158 Table XXX. Redundancy configuration for the SoC .................................................................. 160 Table XXXI. Redundancy configuration for each CPU core ..................................................... 160 Table XXXII. Design evaluations............................................................................................... 162 11 Abstract For modern deep nano-scale integrated circuit manufacturers, constructing large and complex high performance systems with acceptable yield is a major concern. Future technologies are forecasted to experience extremely low yields due to process variations, noise and high defect densities [1][2]. To tackle the low yield issues in emerging technologies, researchers have advocated the notion of "Design for Yield (DFY)": addressing yield as a first order design objective in the early design cycle, along with area, performance and power. DFY, a new member of the DFx family, could be interpreted as an extension of "Design for Manufacturing (DFM)" which addresses yield issues in the post-design stage with ad-hoc techniques like redundant via insertions. This shift in the design focus to incorporate yield concerns calls for a re-evaluation of the existing design flow. Tradeoffs between critical design metrics must be leveraged carefully in the early design stage and done efficiently to facilitate a fast design convergence process. This is especially important in today's competitive electronics market, where, to maintain profitability, designs must be initiated, finalized, manufactured, tested and shipped in a short window of time. This dissertation introduces several computer aided design (CAD) DFY frameworks that aims to enhance a system's yield during the design stage, while co-optimizing yield, (die) area and (system) performance to maximize the overall revenue for the IC manufacturer. Our frameworks include new design flows, system models, optimization algorithms and experimental results to assist in the transition towards a DFY design flow for future technologies plagued with low yields. 12 Chapter 1. Introduction For modern deep nano-scale integrated circuit manufacturers, constructing large and complex high performance systems with acceptable yield is a major concern. Future technologies are forecasted to experience extremely low yields due to process variations, noise and high defect densities [1][2]. To tackle the low yield issues in emerging technologies, researchers have advocated the notion of "Design for Yield (DFY)": addressing yield as a first order design objective in the early design cycle, along with area, performance and power. DFY, a new member of the DFx family, could be interpreted as an extension of "Design for Manufacturing (DFM)" which addresses yield issues in the post-design stage with ad-hoc techniques like redundant via insertions. DFY methodologies have already been practiced for memory structures: spare row/column based hardware repair approaches and error-correcting codes (ECC) [4] have seen great commercial success. On the other hand, DFY for random logic (e.g. microprocessor cores), which is the main focus of this work, remains to be a difficult challenge. Fortunately new advancements in system architecture and logic design have opened new windows for yield enhancement, including those inspired by classical explicit redundancy insertion and the more recent microarchitectural techniques that rely on graceful degradation. This shift in the design focus to incorporate yield concerns calls for a re-evaluation of the existing design flow. Tradeoffs between critical design metrics must be leveraged carefully in the early design stage and done efficiently to facilitate a fast design convergence process. This is especially important in today's competitive electronics market, where, to maintain profitability, designs must be initiated, finalized, manufactured, tested and shipped in a short window of time. 13 This dissertation introduces several computer aided design (CAD) DFY frameworks that aims to enhance a system's yield during the design stage, while co-optimizing yield, (die) area and (system) performance to maximize the overall revenue for the IC manufacturer. Our frameworks include new design flows, system models, optimization algorithms and experimental results to assist in the transition towards a DFY design flow for future technologies plagued with low yields. 1.1 Background Information Here we formally define the term "yield (Y)", and its extension concept "yield per area (Y/A)", which will serve as the most important design evaluation metric throughout this work. Yield (Y) For semiconductor circuits, once the design is completed and the lab developed hardware prototypes are verified, manufacturing moves to production plants or foundries for large volume fabrication. In this process, manufactured ICs are vulnerable to a variety of physical and electrical phenomena such as random defects, process variations and noise. As a result, some dice within the wafer may deviate from their expected physical attributes (e.g. open wires, shorted wires, stuck-open transistors etc.). These malfunctioning dice (also referred to as defective dice) exhibit erroneous logical behavior or deteriorated operating points such as decreased clock frequency or increased power consumption. These defective dice must be identified and discarded to avoid customer dissatisfaction and returns. To achieve this, every die must undergo a series of manufacturing tests to determine if it has survived the manufacturing process and still functions the way it was intended. Manufacturing yield is then defined as the fraction of "good" or "marketable" dice sampled from a large quantity of manufactured dice. For commercial and competitive reasons, manufacturers are typically secretive about their 14 yields, but some news pieces suggest that for some ICs yield can be as low as 30%. As technology continues its aggressive scaling, yield problems can only become worse [1]. For the example shown below in Figure 1, out of the two sampled silicon wafers, a total of 48 dice were produced (24 dice per wafer), out of which only 19 are marketable, the rest being defective. Hence the yield is approximately 40% (calculated as 19/48). Obviously, for a fixed design, the die area of the finalized design (after layout and floorplanning) will be fixed, in which case a higher yield directly amounts of a larger number of marketable dice per wafer and a larger revenue. Improving yield during fabrication is a primary concern for semiconductor manufacturers and foundries, and can be achieved through perfecting the manufacturing process (e.g. better equipment, more precise control of doping, humidity etc.). However, yield improvement is not the responsibility for designers. Figure 1. Example of yield By definition, yield values can only be observed after volume production. However, to combat low yields for future technologies and systems, researchers argue that yield issues needs to be addressed in the design stage prior to manufacturing. This yield aware design methodology aims to anticipate yield issue at the design stage, and design systems that can achieve high yields during fabrication. Yield aware design is orthogonal to the post-fabrication yield ramp-up Marketable Die Malfunctioning Die 15 process (e.g. adjusting manufacturing conditions during initial chip production). To enhance yield during the design stage prior to fabrication, yield needs to be estimated based on the system information and manufacturing environment. Traditionally, prior to fabrication yield values are calculated analytically using yield models. The yield model defines the relationship between yield and other parameters regarding the design and fabrication process/equipment/environment. The most widely used yield model, the Poisson yield model, formulates the yield of a die (Y) with its critical area (A critical ), which reflects the design's susceptibility to defects, and the defect density (d) [3]. The defect density (d) is dependent on the technology and manufacturing maturity, a more mature manufacturing process has a lower defect density. The formula for the Poisson yield model is written as: . The Poisson yield model is defect-oriented, and has the memoryless property of the Poisson distribution, which significantly simplifies the calculation of yield and is favored by many researchers. Other yield models include the binomial yield model [3], which considers defect clustering. If the yield model is sufficiently accurate, maximizing this projected yield during design optimization is equivalent to maximizing actual manufacturing yield (i.e. observed yield). On the contrary, if the yield model is inaccurate, then the manufacturing yield may be very different from the values projected by the yield model during the design stage. Consequently, optimizations based on the imprecise yield model may produce promising results during design exploration, but in fact produce underwhelming or even negative results after fabrication. In our frameworks, we generally do not wish to be constrained by yield model accuracies and yield model formulations, which is a standalone topic by itself. Instead, we simply take yield models as inputs and assume they are accurate enough, and ensure that our optimization algorithms are agnostic to the details of the yield model. Therefore, as far as we are concerned, 16 maximizing the yield values derived from the yield model is equivalent to maximizing the post- fabrication observed yield, which directly impacts the number of marketable die from each wafer. Yield improvement for memory and cache structures in microprocessors has reached academic success as well as industrial maturity through the use of spare rows/columns, error correcting codes [4] and selective disabling [6][7][22]. This is mostly due to the structural uniformity of memory structures. Unfortunately, yield improvement in random logic (e.g. microprocessor cores) remains a challenge. In our works we mainly focus on improving yield for random logic, and assume memory and cache structures are well protected by existing techniques. Although post-design yield enhancement techniques that applies to the layout (e.g. redundant via insertion and lithography correction) or manufacturing process (e.g. tuning doping density) have been industrialized for quite some time, these ad-hoc approaches can only enhance yield without altering the design fundamentals (e.g. number of spare modules, routing and floorplan). While perhaps adequate for current manufacturing technologies where yield is still generally acceptable, it will no longer be sufficient if we are to continue technology scaling to near "end of Moore" eras. Yield per Area (Y/A) Aside from perfecting the manufacturing process, one can achieve a high yield by using techniques such as triple modular redundancy (TMR) [5], but these techniques often have a negative effect on other metrics of concern, such as system performance and die area. Thus when attempting to improve yield at the design stage, one must have an in-depth understanding of the benefits and costs of doing so. For example, explicit hardware redundancy insertion (a review on various forms of hardware redundancy is presented in Chapter 2) is the most well-known yield improvement technique for random logic. Evidently, hardware redundancy incurs high area 17 overheads: TMR for example, has at least a 200% area overhead due to the triplication of the entire system. For a redundancy augmented design, due to the increase in die size, each wafer will produce fewer number of dice compared to a bare design without any redundancy, but each die will be able to tolerate defects to a certain degree. The increase in die area must be carefully weighed against the increase in yield. We illustrate this with the following example. A module is a collection of logic gates, usually serving a particular function (e.g. an ALU or decoder) and its input and output terminals. Modules usually differ in functionality, individual yield and area. In some cases a system may have multiple identical modules, such as an out-of- order pipeline with multiple ALUs. Consider the module/block level schematic of a small-scale system as illustrated on the top of Figure 2. The total die area of this four-module system can be seen as the green square below the module/block level schematic in Figure 2. This die is the same die that has been featured in Figure 1, from which we can see that this particular die area resulted in 24 dice per wafer (a total of 48 dice for two wafers). To improve the yield of this design in high defect density environments, we add appropriate redundancies to each of the modules, along with the necessary steering logic (forks, joins and switches) and DFT logic (scan flip-flops, not shown in Figure 2), and arrive at the design illustrated in the bottom half of Figure 2. This is an example of "module level redundancy" (redundancy is instantiated as spare modules), to be reviewed in detail in Section 2.2. Each module and all of its spares are referred to as a "stage". After this improved design is manufactured, only one defect free module per stage needs to be activated (activated modules are indicated in solid black outlines), while the rest, possibly defective modules, are deactivated (deactivated modules are indicated in dashed gray outlines). The electrical interconnect between the activated modules is realized using fork and join circuits, usually referred to as steering logic. 18 The activation of good modules and deactivation of defective modules is accomplished via clock gating, power gating or laser wire cuts. For now we ignore the hardware failures that may occur in the steering logic. The "good" and "bad" modules are identified through post-fabrication testing on ATE or socket boards, by the use of scan-paths, BIST and software tools including ATPG. Since redundancy is adopted, the raw yield value of each die (the probability that a particular die is not defective or marketable) will increase, which matches intuition. However the increase in yield must be weighed against the increase in die area, which occurs mainly due to three reasons: (1) the inclusion of the additional spare modules, (2) the inclusions of the additional steering logic and (3) additional wiring use to interconnect the steering logic with the spare modules. This increase in die size is explicitly illustrated in the bottom of Figure 2. Even with seemingly very high area overheads, using algorithms that maximize Y/A [18][19][20], the ultimate financial gains in terms of marketable chips per wafer are considerable, as discussed with an example below. In addition, redundancy insertion is mostly independent of architecture, so it can universally applied to any module block as long as the issues regarding routing, steering logic and testing logic are resolved. 19 Figure 2. Redundancy insertion example For the sake of comparison, assume the exact same manufacturing process in Figure 1 was repeated for the new redundancy augmented design, producing identical defect spots on each of the wafers. The wafer map is illustrated in Figure 3. Although the two wafers now only produces a total of 24 dice as opposed to 48 dice in Figure 1, up to 21 dice are marketable because of the increase in die yield as opposed to 19 marketable dice in Figure 1. Since total revenue is directly related to the average fraction of “good” or “marketable” dice identified from a wafer, this indicates a 10% increase in potential revenue if all chips are sold. If we examine the 21 Original Design, Size: Improved Design Augmented with Redundancy, Size: Increase O U T Module 2 Area: 1 Module 4 Area: 2 I N Module 1 Area: 3 Module 3 Area: 2 O U T I N F O R K J O I N Module 2 Area: 1 S W I T C H F O R K Module 3 Area: 2 Module 3 Area: 2 Module 3 Area: 2 Module 4 Area: 2 Module 4 Area: 2 J O I N Module 1 Area: 3 Module 1 Area: 3 Module 1 Area: 3 20 marketable dice, modules hit by manufacturing defects will still malfunction, but they can be deactivated and virtually replaced by a spare module. Replacements are made possible through the configuration of the steering logic, as the path that includes the defective module(s) will be deactivated. The "bad" die would be the ones where all modules in the entire stage are hit by defects. For chip manufactures, total revenue is directly related to the average fraction of “good” or “marketable” dice identified from a wafer. The cost of a fixed number of silicon wafers is fixed, which means if design and testing costs are ignored, maximizing revenue is equivalent to maximizing profit. Figure 3. Example wafer maps with marketable and malfunctioning (i.e. defective) dice Next we perform some simple calculations to introduce the concept of "yield per area". Assume A is the area of each die, and Y is its die yield, calculated from yield models. Since the wafer area is constant (A w ), A w /A is the approximate number of dice produced from each wafer (we ignore the wafer edges that cannot produce an entire die), and Y·A w /A is the expected number of marketable dice produced from a wafer. Stripping away the constant A w , we can then extend the traditional concept of "yield (Y)" to "yield per area (Y/A)", where the "Area" is the area of each die (A). In this research we are interested in yield enhancement techniques that modify the original design, hence we are focused on Y/A instead of the conventional yield. The Marketable Die Malfunctioning Die 21 concept of yield per area (Y/A) is not particularly new, it has been used for evaluating yield enhancement approaches for SRAM memories [76], and later extended to use for random logic [18][19][ 20]. Yield Learning During the early periods of fabrication, yield is typically very low due to an immature manufacturing process, especially when a new technology node is deployed. During this phase, production chips are typically only used non-commercially for internal silicon debugging. As time progresses, the yield will gradually ramp up due to three elements [25]: (1) continuous process improvement, (2) unexpected yield drops because of process abnormalities being detected at probing test and (3) temporary yield drops due to sudden changes in process condition being identified within the fabrication line. This yield ramp-up process is called "yield learning". The yield of a new product, when plotted over time, follows a trajectory called the "yield learning curve". Mass production and commercialization begins when yield reaches a certain target value, as illustrated in Figure 4a. 22 (a) (b) Figure 4. Yield learning Yield (Y) Time Yield Target Random Defect Dominates Mass Production Begins Yield per Area (Y/A) Time Yield per Area Target Mass Production Begins Earlier Original Design Improved Design w/ Redundancy 23 Suppose an alternative design with redundancy augmented is manufactured using the same fabrication line within the same time window. Because the design and its die area have changed, comparisons between original and redundancy augmented designs must be made based on Y/A values and not just raw yield values. Figure 4a can be translated into a Y/A plot by scaling the entire vertical axis by the die area of the original design (A), and the yield target is translated to the Y/A target. The improved redundancy augmented design maximizes Y/A values which results in the Y/A curve indicated as dashed lines. Since the yield target can be met sooner, mass production may begin earlier. In today's competitive electronics market early product rollout is often crucial to capturing new markets and maximizing profits [25][26][27]. Therefore, maximizing Y/A has compound benefits: it not only maximizes the total number of marketable dice for each silicon wafer of fixed size and significantly outperforms the bare design for low yield technologies, but also shortens time-to-market (TTM). 1.2 Challenges in Efficient Yield Enhancement in Random Logic Considering yield enhancement during the design stage introduces new design paradigms with new tradeoffs, which the designers must carefully analyze, model and evaluate so that the improvement in yield outweighs the associated overheads. We have already clearly demonstrated that the increase in yield often comes with the cost of increased die area. Generally speaking, overheads of yield enhancement include: Area overhead Performance overhead Design time overhead Testing cost overhead 24 Without carefully evaluating these overheads of yield enhancement, the practicability of theoretical yield enhancement approaches will be undermined. Unfortunately, these overheads are usually difficult to formulate or quantify, especially in the early stages of the design cycle. In this work we do not consider design time and testing cost (mostly reflected in testing time) overheads as they can only be extract from an extensive amount of sensitive industrial data. We believe that our cost metrics are flexible enough so that design time and testing cost overheads can be easily included, if the data is available. In this subsection, as an illustrative example and motivation for Chapter 6, we focus on the evaluation of the area overhead incurred by the yield enhancement process, namely, how to calculate the die area (A) in Y/A before and after applying yield enhancement techniques? The total die area is determined by the smallest bounding box that contains all the logic components and wiring in the layout stage. Technically this die area can only be known towards the end of the design cycle, when layout and floorplanning is completed. A floorplan is an arrangement of the logic components in the system in terms of rotation and geographical locations. A review of floorplanning techniques is given in Chapter 3. Figure 5 shows example floorplans for the designs displayed in Figure 2. The modules in the designs are enclosed in a blue rectangle that represents the outline of the die, hence the area of this blue rectangular bounding box is the die area. It is important to note that the die area (A) of a particular design is usually larger than the sum of all its module areas due to the whitespace within the die that is not occupied by logic. Whitespace is either reserved intentionally for routing purposes, or created unintentionally due to floorplanning constraints (this is shown in the bottom of Figure 5, where the modules cannot form a perfect rectangle). Floorplanning algorithms typically aim to minimize total die area by moving or rotating modules, and in some cases, also changing the aspect ratio of soft modules. 25 When utilizing redundancy insertion based yield enhancement techniques, the increase in die area stems from (1) the spare modules, (2) steering/testing logic, (3) additional wiring, and (4) floorplanning constraints. While the area overhead of the first two elements can be estimated without the floorplan (spare module area is known as the input, and steering/testing logic area can be reserved according to the module I/O width), the area overhead for redundancy induced wiring and white space wastage can only be known after floorplanning. A simple die area estimation heuristic can avoid implementing a floorplanning procedure by simply summing up the estimated area of every original and spare module, together with the estimated area of the steering/testing logic to arrive at a back-of-the-envelope calculation for total die area. While adding a floorplanning procedure into the calculation of Y/A may seem straightforward, floorplanning is a well-known NP hard problem, thus creating a predicament for yield enhancement: in order to optimize Y/A, different system configurations (e.g. number of spare modules for each stage) need to be examined and die areas need to be recalculated at each step. Using a brute force method of repeatedly triggering a floorplanning algorithm in such a recursive optimization procedure could easily make the yield enhancement problem intractable for large- scale systems. To the best of our knowledge, current yield enhancement techniques (especially redundancy insertion based ones) do not consider floorplanning [18][19][20][23][24], and use the sum of the area of the modules and steering logic as an approximation of die area. This not only results in inaccurate die area calculations and wiring estimations, but also may incur a sacrifice in solution quality. In Chapter 6 we will address this issue by introducing a novel design framework that unifies yield enhancement with the floorplanning process. 26 The overheads of yield enhancement, such as the aforementioned area overhead, are the main obstacles that stop DFY from reaching maturity. In our frameworks, we will model these overheads, such as area and performance mathematically at a system model, and cautiously leverage those overheads versus the increase in yield when modifying the bare design (the original system without any form of yield enhancement). We demonstrate that when the bare design yield drops below a certain threshold, with systematic design frameworks featuring efficient design optimization algorithms such as those introduced in this work, yield enhancement in the design stage will be beneficial even when its overheads are seemingly high. 27 Figure 5. Floorplan example on estimating the area overheads of yield enhancement 1.4 Thesis Outline Chapter 2 summarizes the related work on yield enhancement, which is further categorized by the granularity on which the yield enhancement techniques are applied. Chapter 3 summarizes the related work on floorplanning and placement, which will aid in the reading of Chapter 6. In Chapter 4 we introduce the first of our yield-centric CAD frameworks that explicitly trades performance for increased system yield. Our second CAD framework is described in Chapter 5 In Module 1 Module 2 Module 4 Out Module 3 W H W In Module 1 Module 3 Module 3 Module 3 Module 4 Module 4 Module 2 Out JOIN Module 1 Module 1 FORK JOIN FORK SWITCH H 28 where we focus on multicore CPU architectures, and combine the spare sharing concept commonly found in SIMD architectures with traditional module level redundancy. In Chapter 6 we propose a unified CAD framework that addresses yield enhancement and floorplanning in a holistic fashion. 29 Chapter 2. Related Work - Yield Enhancement 2.1 Core Level Redundancy In multicore architectures, redundancy can be inserted at the coarse granularity of individual cores. Core level redundancy treats each core as a black box and requires neither analysis nor modification of the logic inside the core. When a core fails, a corresponding available spare cores can logically replace it. When the original cores are identical to each other, each spare core can potentially replace any defective original core. The basic idea is illustrated in Figure 6: the "replacements" are made virtually by activating the corresponding wiring paths in the crossbars. Original Core (Functional) Original Core (Defective) Spare Core (Functional) Activated Paths Deactivated Paths Core Replacements No original core is defective, spare cores are not used Three original cores are defective and replaced by spares Figure 6. Core level spare core sharing In this architecture of symmetric cores, the spare cores become shared amongst all original cores, and this greatly increases the flexibility in repair and yield improvement [28][29]. In concept, this is similar to spare rows/columns in memory structures. Such "core sparing" technique is commonly seen, both in research and industrial practices, in SIMD architectures such as GPUs, where the system is massively parallel with many symmetrical cores. 30 Cheng et al. [21] provides a review of the previous work in this field. Their algorithm computes the optimal redundancy configuration while taking into account the associated wiring area overheads. 2.2 Module Level Redundancy At a finer granularity, redundancy can be applied at the level of modules within a core. Architecture design, schematic design or layout engineers can specify a module. At the architecture level a module typically performs a particular function, such as an adder, a FIFO queue, a branch predictor or a decoder. At the schematic or layout level a module can partitioned into several smaller modules, and using this approach recursively, theses submodules become arbitrary collections of logic gates or transistors. Module level redundancy inserts redundant modules along with the necessary steering logic. We have already visited module level redundancy in Figure 2. Module level redundancy is relatively easy to design and universally applicable without requiring extensive analysis on the system functionalities, but requires extra wiring and steering logic, which contributes to the area and performance overheads. The improvement in yield should offset these overheads for this form of redundancy insertion to be beneficial. For manufacturing processes that are plagued with high defect densities or progressing through the tedium of yield learning, module level redundancy can be very useful. Mohammad et al. [18][19][20] give a detailed review of previous work in this field. Their approach, which lays out the fundamentals for module level redundancy, is as follows: take a block of sequential logic, and 1) partition it into modules, 2) replicate module M i N i times (N i ≥ 0), where the N i replicated modules are considered to be spares or redundant modules, 3) insert steering logic that consists 31 of switches, forks and joins between stages, and 4) activate one “good” module in each stage, identified by the use of scan-paths, BIST and software tools including ATPG. Then disable the other modules at that stage. The objective function is to either maximize yield (Y) and/or yield per area (Y/A). Figure 2 depicts the concepts just discussed, where after testing, activated modules and paths are indicated in dark solid outlines and deactivated modules and paths are indicated in gray dashed outlines. 2.3 Circuit Level Redundancy At the finest granularity, redundancy can be inserted at the circuit level, which can be categorized into three types: transistor, gate and wiring. Circuit level redundancy for transistors has been studied in [30] where redundant transistors are inserted in the circuit. It is important to note that the redundant transistors cannot be deactivated to save power since the steering and testing logic for such fine-grained redundancy would be too complex. The advantage of this type of circuit level redundancy is that testing and reconfiguration is not needed to cope with faults. Figure 7 is an illustrative example of the concept of redundant transistors. The left of Figure 7 shows a basic two input (input X and input Y) NAND gate, but with one redundant PMOS transistor for the pull up path of input X shown in green, which does not affect the functionality of the gate. When all transistors are defect free and X = 0 and Y = 1, output should be 1 due to the parallel PMOS transistors. In this case there will be two pull up paths (indicated as bold blue arrows) because of the redundant PMOS transistor, as seen in the middle of Figure 7. In the right of Figure 7, the original PMOS transistor for input X experienced a "stuck open" fault, then under the same input conditions, the output will still be pulled up to VDD because of the redundant transistor. In fact, under all input conditions, this gate still 32 functions as a two input NAND gate. In other words, this transistor configuration can tolerate a stuck-open fault in either PMOS transistor on the left. 0 1 Input X Input Y VDD GND Output 0 1 1 Input X Input Y VDD GND Output 1 Input X Input Y VDD GND Output Stuck Open Transistor Redundant Transistor Figure 7. Redundant transistor illustrative example When circuit level redundancy is applied for gates, the basic concepts remain the same [31]. For wiring, the two main approaches are redundant vias [32][33] and redundant paths [34]. 2.4 Microarchitecture Level Yield Improvement The nature of some complex systems, such as microprocessors, opens new venues for yield improvement without relying on explicit redundancy, but usually degrading performance in the process. The branch predictor is a special case where defects in the branch predictor will only degrade performance, but not corrupt the overall correctness of program execution [35][36]. When the performance degradation cause by the defective branch predictor is acceptable, then such defects can be ignored [37]. The memory-based structures in microprocessors can borrow concepts from the commercially successful defect tolerance techniques used in cache/memory [38], exploiting what the authors define as "microarchitectural redundancy". When upper compiler or software layers are involved, yield can be further improved by manipulating the compiler or the source code itself. The Relax [39] framework will allow the 33 hardware to cooperate with the ISA layer and compiler. As a result, vulnerable instructions can be isolated prior to execution, and receive special treatment to ensure program correctness. 34 Chapter 3. Related Work - Floorplanning and Placement Floorplanning and placement are well-studied topics in the field of CAD, and in some cases, they are similar in concept. The objective of floorplanning/placement is to arrange the logic blocks/modules in terms of their physical location, orientation to achieve practical objectives such as area minimization, wire length minimization, and timing optimization. Naturally this is performed after the design has been finalized, i.e., the netlist of the entire design is fixed. As mentioned in [40], floorplanning/placement approaches can be categorized into three classes: standard cell placement, block floorplanning, and mixed block "floorplacement". 3.1 Standard Cell Placement Standard cell placement, illustrated in Figure 8a, is a classic CAD problem. Each cell contains the circuitry for a relatively simple logic function, and its functionality, wiring and layout are predetermined. Normally cells are rectangular and of similar height so that they can be placed in a row with other cells without wasting space. Collectively these cells form a cell library. Designer can then pull cell designs from this library without needing to be bothered with the low level cell design. For a modern VLSI design, the number of cells typically ranges in the millions. The "standard cell" model, a cornerstone of modern design, allows different types of cells to have varying width, but uniform height. Standard cell placement attempts to place rectilinear circuit elements (cells) into one or more horizontal rows. Well-known placement methods include simulated annealing [41], analytic methods [43][44], and recursive partitioning [45]. 35 Simulated Annealing Based Standard Cell placement Both Timberworlf [41] and Dragon [42] are well known placement tools based on simulated annealing. Given enough run time, these approaches can produce high quality results. Analytical Standard Cell placement Analytical placement tools include GORDIAN [43] that uses quadratic programming, and Kraftwerk [44] which uses a force directed approach. Recursive Partitioning Based Standard Cell placement Placement by recursive partitioning is one of the oldest approaches to a fundamental problem in computer-aided design. First introduced by Breuer [45], and subsequently improved by Dunlop and Kernighan [46], the approach has proven to be popular and competitive if the assumption of standard cells holds. Recent researchers have revisited the Fiduccia–Mattheyses (FM) partitioning algorithm to address modern CAD problems [47]. The original Feng Shui algorithm [48], introduced in 2001, starts with the traditional recursive bisection algorithm, however the bisections are performed concurrently and not sequentially, so that the partitioning decisions made inside one region can affect the partitioning decisions to be made for another region. Location assignments are performed via iterative deletion. The selection of cut sequences in Feng Shui [49][50] is a subject on its own, significantly contributing to the quality of the final placement result. The researchers involved in this project have shown that high quality cut sequences can be generated with relatively simple methods, and that the alternating cut directions and bisections suggested by Breuer [45] are in fact the best approach for many problems. 36 3.2 Block Floorplanning Block floorplanning problems [51], illustrated in Figure 8b, involve a group of large rectangular or rectilinear macro blocks (typically in the hundreds for modern designs) of arbitrary size and aspect ratio. Each block might contain a large number of standards cells, smaller blocks, or a combination of both. In block floorplanning the blocks no longer share the uniformity found in standard cell placement, which is a key enabler of the deployment of recursive bisection and analytical methods. As a result although the number of blocks is generally smaller than the number of standard cells in the standard cell placement framework, block floorplanning problems frequently exhibit a puzzle-like nature; finding a way to fit blocks together such that gaps are eliminated is an NP-Complete problem and generally considered to be more difficult, computationally wise, than standard cell placement. So far annealing methods have dominated floorplanning, because of the difficulty in finding a good heuristic method to guide optimization. With blocks of varying width and height, it has been essential to approach the problem as one of "packing". Placement representations such as sequence pair [52], O-trees [53] and B*-trees [54] are widely used (a survey of floorplan representations is provided in [55]), with an annealing based framework being used to search the solution space. Other examples include the work of [56] and [57]. Annealing is flexible, but it is comparatively slow, and does not scale well to large problems without imposing constraints on the problem. 3.3 Mixed Block "Floorplacement" Between the extremes of standard cell placement and floorplanning is mixed block design illustrated in Figure 8c, this is sometimes also referred to as "floorplacement". Designs that contain both standard cells and macro blocks present a "boulders and dust" challenge. 37 For many years, this was viewed as an extremely difficult problem. Of course, simulated annealing based methods that serve block floorplanning can still be used, simply by treating standard cells as arbitrary macro blocks [58]. A more computationally attractive solution to this problem is to use a hierarchical approach: first cluster standard cells into large logic blocks using partitioning algorithms such as the min-cut [45] based ones, so that the problem size is reduced and floorplanning can be carried out. Then place the clustered standard cells using standard cell placement procedures. While this method reduces problem size to the extent where the floorplanning techniques can be applied, it suffers from the inherit drawback of almost any hierarchical algorithms: decreased solution quality for large-scale problem inputs. Examples include the Macro Block Placement program [59] that restricts the partitioned blocks to a rectangular shape and the ARCHITECT floorplanner [60] that permits rectilinear (e.g. "L" shaped) blocks. Following the success of placement algorithms in the standard cell placement problem, it is only natural that many standard cell placement methods have been retrofitted to deal with the mixture of standard cell and macro blocks. Analytic placement tools APlace [61] and UPlace [62] perform well, by first computing a non-legal abstract placement, and then performing placement legalization using a mixed-size Tetris legalization 1 method. Mixed-Mode Placement (MMP) [63] uses a quadratic placement algorithm combined with a bottom-up two-level clustering strategy and slicing partitions to remove overlaps. MMP was demonstrated on industrial circuits with thousands of standard cells and not more than 10 macro blocks. The simulated annealing based multilevel optimization tool, mPG [64], consists of a coarsening phase and a refinement phase. In the coarsening phase, both macro blocks and standard cells are recursively clustered together 1 Legalization refers to the process of moving blocks to rid a layout of overlapping blocks. 38 to build a hierarchy. In the refinement phase, large objects are gradually fixed in place, and any overlaps between them are gradually removed. The locations of smaller objects are determined during further refinement. In 2002 - 2003, a three-stage placement-floorplanning-placement flow [65][66] was presented to place designs with large numbers of macro blocks and standard cells. The flow utilizes the Capo standard cell placement tool [70], and the Parquet floorplanner [56]. In the first stage, all macro blocks are “shredded” into a number of smaller sub-cells connected by two-pin nets created to ensure that sub-cells are placed close to each other during the initial placement. A global placer is then used to obtain an initial placement. In the second stage, averaging the locations of cells created during the shredding process produces initial locations of macros. The standard cells are merged into soft blocks, and a fixed-outline floorplanner generates valid locations for the macro blocks and soft blocks of movable cells. In the final stage, the macro blocks are fixed into place, and cells in the soft blocks go through a detailed placement. In 2005, the hmetis [67] inspired Feng Shui algorithm [48] broke the confinements of the standard cell assumption to apply recursive bisection in a mixed block environment, which resulted in Feng Shui 2.5 [68]. Feng Shui 2.5 was later upgraded to Feng Shui 2.6 [40]. The basic idea is to treat large macro blocks like standard cells during placements (in sharp contrast to the simulated annealing based approaches which treat standard cells as macro blocks), and use legalization methods to eliminate overlaps. This methodology was pursued further: most recently in 2010, [69] improved Feng Shui to handle soft blocks. It avoids the problem size explosion by exploiting the fact that uniform height blocks can be efficiently placed, and forces soft blocks to conform to a predetermined height to the best extent. The authors do not take advantage of the 39 “soft” nature of many blocks during placement and believe that this in fact complicates the problem without providing significant benefit. Capo [70] has also evolved over the years, transitioning from standard cell problems, to mixed size, and recently into floorplacement, through the integration of an annealing based floorplanner. Capo [70] uses bisection when circuit elements are small, but switches to floorplanning when a portion of the placement contains blocks that cannot be legalized easily. The placement tool PATOMA [71] also performs recursive bisection, but adds a fast "legality checker" to ensure that bisections will not result in a configuration that cannot be legalized easily. Figure 8. Floorplanning/Placement classes: (a) standard cell; (b) block; (c) mixed block (b) (c) (a) 40 Chapter 4. Reduced Redundancy Insertion 4.1 Introduction Motivation Although traditional module level redundancy insertion can achieve considerable Y/A benefits, it has the following drawbacks: 1) For immature technologies, modules with large area that cannot be easily partitioned have very low yield. Their spares naturally also have very low yield. To achieve acceptable overall Y/A, multiple spares are required, but this adds complexity to the steering logic, which reduces performance. Test time also increases, since more modules need to be tested. 2) After fabrication, as technology matures and yield improves beyond a certain threshold, the spares become wasted silicon space. To eliminate such wastage, new masks must be produced, which is a very expensive process. In this chapter, we outline a new approach towards efficiently applying module level redundancy, by trading off yield, area and performance. This new form of redundancy shows that there exist many promising venues when addressing yield at the design stage. We first state two fundamental principles, one in VLSI design: performance is positively correlated with area; the second one in circuit manufacturing, namely: yield is negatively correlated with area. The first principle implies that chips or modules that consume a larger silicon area will generally enjoy better performance, which matches intuition, since the silicon area can be used to add performance enhancing features (e.g. using a carry look-ahead adder instead of a carry ripple adder), or simply used for upsizing the transistors. The second principle is rather straightforward as well: chips or modules with larger area have a higher probability of 41 encountering defects. This negative correlation is reflected in all yield models, especially the Poisson yield model. According to those two principles, if a spare module were to decrease in size, it would result in degradation in performance but an increase in yield. For example, a spare for a carry look- ahead adder can be a ripple carry adder. This is the starting point of Reduced Redundancy (RR). Compared to traditional spare modules, a Reduced Redundant Module (RRM) M’ x is a spare module for the original module M x that has less area, less performance, but not less functionality than M x . The process of inserting RRM is referred to as Reduced Redundancy Insertion (RRI). Reduced Redundancy Insertion Application Context The multi-core era does not necessarily mean that each core will regress towards simplicity. Currently, each core in the Intel i7-3770 series still possesses powerful multithreading capabilities. It is predicted that the complexity of each core will continue to increase to deliver enough processing power to satisfy the growing program computation demands [15]. Hence, for modules in these complex cores, performance enhancement features can potentially be scaled back or abandoned to form RRM. In this body of work we will analyze the associated tradeoffs. One can make a valid claim that for some designs using a module with degraded performance will jeopardize the operation of the entire system. For many situations, however, this is not the case. Below we list several ways that a system can operate properly with little or no reconfiguration when some modules are replaced with their RR counterparts. Pipelines In a conventional pipeline, performance degradation of any stage can be coped with by lowering the clock frequency. 42 Asynchronous Logic/Handshaking Interfaces In asynchronous circuits modules communicate with each other through handshake protocols, and thus each module can take an arbitrary amount of time to finish computation. RRI can be easily applied in such environments. Commercial Microprocessors The Instruction Scheduler Module The instruction scheduler inside modern microprocessors includes single-cycle wake up, wide scheduling width and scheduling heuristics to boost performance [72]. A simpler scheduler can be used without reconfiguration due to the handshake policies with the execution units. The execution unit can simply announce its readiness to the instruction scheduler. Execution Units Execution units, such as ALUs, can be designed in different ways, balancing area and performance. The integer execution unit in Pentium 4 [73][74] employs carry-merge trees, domino logic and operates at the 2x frequency domain. These features that enhance performance can be removed to form reduced redundant modules. The Entire Core At the coarsest granularity, each core can have its own spare, and core sparing has been reviewed in the previous chapter. In comparison, a RR core will have smaller area and degraded performance. Such a system would resemble ARM’s big.LITTLE architecture [16], where an A15 core is coupled with a lightweight A7. The big.LITTLE architecture is designed for power efficiency, not yield improvement, but it does prove the feasibility of having heterogeneous performance within the same system. 43 Intrinsic Error-Tolerant Systems Some systems can inherently tolerate errors at the application level, such as video/audio decoding. Shin et al. [75] exploits these characteristics and explicitly reduces area in the design stage while sacrificing computation accuracy. These modules are perfect candidates for RRI. In some cases, to use RRM, system level or module level reconfiguration is needed. We integrate this reconfiguration overhead into the area of the RRM. Details regarding synchronization of the RRM with the entire system in terms of communication protocols and/or clock distribution are beyond the scope of this work. Design Flow RR calls for a revision in the conventional yield improvement design flow (Figure 9a). In the RRI design flow (Figure 9b) the design is passed onto RR analysis, where the algorithms presented in this work are executed, returning more ideal RR configurations. This information is then fed back to the circuit designers, who will attempt to redesign modules while satisfying the RR analysis requirements. 44 Figure 9. Conventional and Reduced Redundancy design flow RR analysis and RR redesign can go through several iterations if the circuit designer cannot satisfy the requirements of the redundancy configurations suggested by the RR analyzer which runs the design optimization algorithms presented in this chapter. After reaching some point of agreement, the design will be manufactured, tested and configured so that failed modules are replaced by their spares. Although some chips are sold as lowered performance versions, which stems from RRI and perhaps also speed binning, the total revenue per wafer achieved with RRI can often significantly outperform the traditional redundancy insertion schemes. 4.2 Problem Definition Case Study Prelude We examine a simplified case study to serve as an introduction to RRI. Consider an abstract microprocessor that consists of a cache module M 0 and a core module M 1 , illustrated in Figure 10, manufactured on a wafer of size A w . The area and yield of M i are A i and Y i , respectively.Yield (a) Conventional Redundancy Insertion (b) Reduced Redundancy Insertion Original Circuit Design Redundancy Analysis and Insertion Final Circuit Manufacture Testing Reduced Redundancy Redesign Speed Binning Original Circuit Design Reduced Redundancy Analysis and Insertion Final Circuit Manufacture Testing 45 models are used to estimate Y i . By the use of spare rows/columns, ECC and recent software/OS level fault tolerant techniques [8], even in environments of high defect densities and process variations, the cache (M 0 ) can achieve a high yield. M 1 however, does not share the uniformity of the cache module. The average performance of M 1 is P 1 , the value of which is estimated by executing architectural or post-synthesis benchmark simulations, and will serve as the baseline performance value. Figure 10. A 2-Module abstract microprocessor Over the initial years of production when the technology is immature or progressing through the tedium of yield learning, a non-negligible fraction of the manufactured M 1 will produce errors due to defects, noise and process variations. This causes the entire chip to be discarded. Thus, M 1 needs to be more robust, and one way to accomplish this is via redundancy. This can be done using coarse-grained core-level redundancy as we have reviewed in the previous chapter. We explore this solution as an introductory example. Assume only one spare core can be used for M 1 . Multiple spares are prohibited due to the complexity of the steering logic, testing logic and testing time. M 0 will not have a spare since its yield Y 0 is high. Figure 11 displays three versions of M 1 design to be examined: the bare design, the conventional redundancy design and the new RRI design. In this chapter, die area will be approximated as the sum of all module areas, thus ignoring floorplanning constraints and wiring. Module 0 (M 0 ): Cache (Area A 0 ) Module 1 (M 1 ): Core (Area A 1 , Performance P 1 ) 46 Figure 11. Three design alternatives for M 1 Version 1 Design (No Redundancy) The first version has no redundancy. From this point on we assume that the yield of any module is independent of the yield of other modules (this is true for most defect oriented yield models, such as the Poisson yield model), and therefore the yield of the entire chip is Y 0 Y 1 . If we were to use A 0 +A 1 to estimate the area of the die (this formulation ignores wiring and floorplanning constraints), the Y/A of the chip is Y 0 Y 1 /(A 0 +A 1 ). For each silicon wafer of size A w , the number of “good” dice with performance P 1 is approximately A w Y 0 Y 1 /(A 0 +A 1 ). Version 2 Design (Traditional Redundancy, 1 Spare Only) Version 2 will have one spare for M 1 that is identical to M 1 . Therefore, a chip will only fail if M 0 fails (which is unlikely due to the robust protection that the cache employs), or both M 1 and its spare fail. The yield of the entire chip is Y 0 [1-(1-Y 1 ) 2 ], and the Y/A of this design is: Y 0 [1-(1- Y 1 ) 2 ]/(A 0 +2A 1 ). For each wafer of size A w , approximately A w /(A 0 +2A 1 ) dice will be manufactured. Of these manufactured dice, the expected number of “good” dice with performance P 1 is: A w Y 0 [1-(1-Y 1 ) 2 ]/(A 0 +2A 1 ), the rest of the dice will be discarded. Module 1 (M 1 ) Area: A 1 Performance: P 1 Yield: Y 1 Version 1 Version 2 Version 3 A’ 1 <A 1 , P’ 1 <P 1 , Y’ 1 >Y 1 F O R K J O I N Module 1 (M 1 ) Area: A 1 Performance: P 1 Yield: Y 1 Module 1 (M 1 ) Area: A 1 Performance: P 1 Yield: Y 1 F O R K J O I N Module 1 (M 1 ) Area: A 1 Performance: P 1 Yield: Y 1 Module 1’ (M’ 1 ) Area: A' 1 Performance: P' 1 Yield: Y' 1 47 Version 3 Design (Reduced Redundancy, 1 Spare Only) The third design will utilize RRI. The RRM, denoted by M’ 1 , has area A’ 1 , where A’ 1 <A 1 , performance P’ 1 and yield Y’ 1 . M’ 1 has all the functionality of M 1 , but P’ 1 <P 1 due primarily to the reduction in area. The decrease in area leads to an increase in yield, which holds for all well- known defect-oriented yield models. To summarize: A’ 1 <A 1 , P’ 1 <P 1 and Y’ 1 >Y 1 . The overall performance can no longer be evaluated as a constant, since it depends on which parts are functional (see Table I). Table I. Performance for different configurations of M 1 and M' 1 M 1 M’ 1 (RRM) Probability (i.e. Yield) Performance Good Good Y 1 Y’ 1 P 1 Good Bad Y 1 (1- Y’ 1 ) P 1 Bad Good (1- Y 1 ) Y’ 1 P’ 1 Bad Bad (1- Y 1 )(1- Y’ 1 ) 0 For a wafer of size A w , a total of A w /(A 0 +A 1 +A’ 1 ) dice will be manufactured. The expected number of dice/wafer that operate at the nominal performance P 1 is: Y 0 Y 1 A w /(A 0 +A 1 +A’ 1 ). Similarly, the expected number of dice that will operate at the lowered performance of P’ 1 is Y 0 (1-Y 1 )Y’ 1 A w /(A 0 +A 1 +A’ 1 ). The revenue each dice generates depends on various hard to predict commercial factors. In our work we assume that price is linearly correlated to performance, thus to maximize revenue we should “extract” the most performance out of each single wafer. We can approximate this objective function by the expression shown in Equation 1. Equation 1 48 To help justify the use of Equation 1, we introduce the concept of Expected Performance per Area (E(P)/A), which is the expected performance divided by total area. For versions 1 and 2 designs, performance is constant. Therefore the expectation of performance is constant. E(P)/A in this case is simply Y/A multiplied by nominal performance P 1 . With RR, the expectation of performance is defined as: For the version 3 design, the value of E(P)/A is: Now Equation 1 can be rewritten as A w ·E(P)/A. Since A w is constant, maximizing E(P)/A will maximize total revenue per wafer. For other formulations of wafer revenue, Eq. (1) would need to be adjusted accordingly. Formal Formulation Extrapolation to a system with one simplex module (M 0 ), and n redundancy-applicable modules (M 1 , M 2 , ..., M n ) introduces some key complications that will be analyzed next. We assume a generic system organized as follows: M 0 still has size A 0 and yield Y 0 , while M k (1≤k≤n) has size A k and yield Y k . M 0 will not have a spare. For M 1 -M n , we assume the following: 1. Each module can have either one or no spare. The spare can be identical to the original module as in traditional redundancy insertion or smaller in area than the original module as in RRI. 2. For each module, either the original or its spare, if it exists, must be fault free to ensure system operation. 3. All reconfiguration overhead required for utilizing the spare is incorporated in the area of the spare. 49 4. The sole spare for M k , denoted as M’ k , has size A’ k (A’ k ≤A k ) and yield Y’ k (Y’ k ≥Y k ). If A’ k = 0 then the spare does not exist. The golden performance of the system P gold is the reference performance when M 0 -M n are all functional. Later, parameters related to M 0 will be integrated into the equation as a common term. The probability of P gold being achieved is . The chip will be discarded if there is a module where both the original and its spare have failed, or the original has failed and there is no spare. The probability of stage k failing is denoted as B k . When M k is not protected i.e. A’ k = 0, then B k = (1-Y k ). When M k has one spare, i.e. 0<A’ k ≤A k then B k = (1-Y k )(1-Y' k ). According to the inclusion-exclusion theory, the probability of a chip failing due to failures in any stage is: When the chip is still functional, i.e. no stage has failed, but P gold cannot be achieved, the chip will operate at a certain degraded performance with the probability of (the chip is neither performing at P gold nor completely defective). The value of the degraded performance (P degrade ) is dependent on which spare modules are in use. Thus in reality it cannot be fixed as a constant. However, to avoid the combinatory sets of values to represent P degrade , we take an extremely pessimistic approach, that is, of all the P degrade values that the system can achieve by using any combination of the spares, choose the minimum. If one or more RRM is used in place of the original, the chip is said to perform at P degrade . This guarantees that we will not overestimate E(P)/A in later calculations. This also adheres to 50 practical marketing considerations, where only two types of chips need to be priced, those with performance P gold and those with the lowered performance P degrade . The expected performance of the entire chip including M 0 : The E(P)/A of the entire chip is given by Equation 2, and serves as the primary objective function in this work. Equation 2 In Equation 2, when A' i = A i , then traditional redundancy is deployed for M i , and the system will not receive performance penalties for utilizing spare modules. When 0<A' i <A i , then RRM is used for M i . A' i = 0 implies that no redundancy is used for M i , which means that if M i fails, then the entire chip will be discarded. The following sections of this chapter will provide insight on the maximization of E(P)/A for this generic system. To show a quantitative example of the benefits of RR, we return to the example in Figure 11. Table II summarizes the relevant parameters of the system. The relationship between P’ 1 and A’ 1 is P’ 1 = sqrt(A’ 1 ) according to Pollack’s Rule [9]. We plot the E(P)/A for the three designs of Figure 11 in Figure 12, as we decrease the area of the spare, the performance of the system will degrade when the RRM is accessed. Table II. Area, Yield and Performance parameters A 0 Y 0 A 1 Y 1 P 1 A’ 1 P’ 1 10 0.9 5 0.5 sqrt(A 1 ) = 2.23 Adjustable sqrt(A’ 1 ) 51 As seen in Figure 12, the E(P)/A function for the RR design is not monotonic, and at the peak point RR exhibits about a 17% increase in E(P)/A compared to traditional redundancy. Figure 12. E(P)/A values with respect to the area of the spare A’ 1 To augment the above example, we later present Theorem 1 based on the same system depicted in Figure 10. Analysis of the E(P)/A Function To maximize E(P)/A, we note that Equation 2 is a function of area (A), yield (Y, implied in the expectation) and performance (P). Intuitively, as the area of a module decreases, its yield should increase and its performance decrease. We refer to the relationship between performance and area as the P-A function, which is architecture dependent. The relationship between area and yield is determined by the yield model. The conclusions drawn from our work are valid regardless of these detailed relations as long as they are well behaved. All well-known yield models can be used in our work, but for simplicity we will use the Poisson yield model: 0.05 0.055 0.06 0.065 0.07 0.075 5 4.8 4.6 4.4 4.2 4 3.8 3.6 3.4 3.2 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 E(P)/A A'1 No Redundancy Traditional Redundancy Reduced Redundancy 17% Increase 52 where d is a constant dependent on the technology and defect density. The values of A i , Y i and P gold are predetermined by the circuit design, layout and manufacturing process. By adjusting A' 1 - A' n , we can obtain different values of Y' 1 - Y' n . If A' 1 - A' n are set, B can be calculated. P degrade is also dependent on A' 1 -A' n , but this relationship requires further investigation. Assuming this relationship is known, we will only need to manipulate A' 1 -A' n to maximize E(P)/A. Section 4.3 will lay out the construction of P degrade and introduce the maximization procedure for E(P)/A. Theorem 1: Consider the system shown in Figure 10, and the aforementioned Poisson yield model. If performance is modeled as a single term polynomial function of area, i.e. P’ 1 = KA’ 1 R where K > 0 and R > 0, RRI will enhance the E(P)/A value compared to traditional redundancy when A 1 > A’ 1 > R/d. Proof: Let EPA Traditional and EPA RR be the E(P)/A value of the design using only traditional redundancy and the same design using RR, respectively. . We know that K>0, Y 0 >0, <1, A 0 +2A 1 >0, also the function is monotonically decreasing when . Therefore EPA Traditional <EPA RR when A 1 >A’ 1 >R/d. Theorem 1 provides some fundamental insight regarding RRI. That is, RRI is most beneficial when 1) the target module has a large area, 2) the defect density is high, and 3) the performance penalty from decreasing module area is low. 53 4.3 Maximization of E(P)/A Models of Degraded Performance The elusive relationship between P degrade and A’ 1 -A’ n is defined by the function H in Equation 3. It determines the maximization methodology of E(P)/A. Equation 3 In this work we explore two P degrade formulations for two types of systems: serial and parallel. For other system configurations, P degrade should be either approximated by one of our two models, or derived by extending or combining our two models. Pipeline/Bottleneck Performance Model We assume that M 1 -M n are organized in a linear uni-directional pipeline. We first define a function P x = F x (A’ x ) for each M x called the P-A function. F x (A’ x ) is the performance of the entire chip when only M x is replaced by M’ x . Figure 13 illustrates the system configuration for measuring F x (A’ x ), with disabled modules shaded in grey. Our algorithm assumes F x (A’ x ) is given as input, where in reality the F x (A’ x ) would be derived either from empirical estimations or architectural simulations. By definition, P gold is achieved when A’ x =A x x, i.e. . Figure 13. The pipeline hardware model for measuring F x (A’ x ) F O R K Module x Module x’ ... ... Module 1’ Module 1 ... ... Module n’ Module n S W I T C H J O I N S W I T C H S W I T C H S W I T C H 54 The clock rate of a linear pipeline is bounded by its slowest stage. Thus we can reasonably assume that P degrade is constrained by the RRM that degrades performance the most, i.e. . Additive Performance Model For the second model, we assume that M 1 -M n operate in parallel and contribute independently to the overall system performance, which is evaluated by taking the sum of all individual module performance values. This performance model targets components such as the collection of ALUs in microprocessors. Again we define a P-A function G x (A x ) for each module x (M x ), which represents the performance rating of M x alone. By our definition of the parallel system . Alternatively a weighted sum function can be used. The worst-case P degrade is now defined in Equation 4. Note that A' x = 0 indicate that M x does not have any spares. Equation 4 Maximizing E(P)/A Pipeline/Bottleneck Performance Model Our method for maximizing E(P)/A starts by finding the optimal values of A' 1 , A' 2 , ..., A' n . It consists of an outer loop and a kernel optimization procedure. The outer loop finds the best- suited stages S R {1, 2, ..., n}, where redundancy is appropriate. In other words, we have 0 < A' i ≤ A i when i ∈ S R and A' i = 0 when i ∉ S R , for each i ∈ {1, 2, ..., n}. The kernel optimization solves the Simplified Redundancy Area (SRA) optimization problem, i.e. finding the optimal A' i value for each i ∈ S R when the set S R is given. 55 We first describe the kernel, which aims to find the optimal A' i values in a |S R |-dimensional space, where |S R | denotes the cardinality of S R . We effectively reduce the search space of A' i values by exploiting the following Theorem 2. Theorem 2: For stages in the set S R , the optimal area values of the redundant spares satisfy: ∈ Proof: Let us suppose that ∃i, j ∈ S R such that F i (A' i )<F j (A' j ) in the optimal solution of the SRA optimization problem. We are going to prove that by reducing A' j such that F i (A' i ) = F j (A' j ), the E(P)/A value calculated by (2) will increase. This will contradict the claim that F i (A' i )<F j (A' j ) can exist in the optimal solution of the SRA optimization problem. When reducing the A' j value (still we have F i (A' i )<F j (A' j )), P degrade ≤F i (A' i ) will remain unchanged. Only the B value will change in the numerator of (2) during the reduction of A' j . We know that B j will decrease when reducing A' j . We calculate the derivative of B with respect to B j : ∈ ∈ ∈ . Notice that ∈ ∈ ∈ is essentially the probability that both the original module and its spare fail in at least one stage in the set of stages {1, 2, ..., n}\{j}. Since the probability is less than 1, is positive. In other words when we reduce the A' j value, B will decrease, and therefore the numerator of (2) will increase. On the other hand, the denominator of (2) will decrease. Since the values of the numerator and denominator of (2) are greater than zero, it follows that the E(P)/A value from (2) will increase when A' j decreases. 56 By exploiting the above observation, we successfully reduce the search space for finding the optimal A' i values from the original |S R |-dimensional space to a curve in which all the F i (A' i ) values are identical. Then we utilize the ternary search algorithm, which is an extension of the well-known binary search algorithm, to find the optimal F i (A' i ) values, and subsequently, the optimal A' i values. An upper bound and a lower bound are required for performing the ternary search algorithm. For each i ∈ S R , the lower bound of F i (A' i ) is 0 (also note that F i (A' i ) > 0), while the upper bound of F i (A' i ) is F i (A i ). We use to denote the maximum E(P)/A value for a given S R , which is achieved by the kernel procedure. The outer loop sequentially removes appropriate candidate elements in the subset S R to determine the optimal set . In summary, the complete method for maximizing E(P)/A under the pipeline performance model is outlined in Algorithm 1. Algorithm 1: Near-Optimal Redundancy Insertion and Sizing Algorithm under the Pipeline/Bottleneck Performance Model. Initialize . Perform kernel optimization procedure to find the optimal value of each ∈ . For from 0 to : For each ∈ : Perform kernel optimization procedure to find the optimal value of each ∈ . Find with the maximum value. . . 57 Find with the maximum value, and find the corresponding values. Additive Performance Model The heuristic for maximizing E(P)/A under the additive performance model is rather simple (Figure 14). We start with the module M i with the largest area not already visited and set A’ i = A i . Gradually decrease A’ i until E(P)/A does not increase, or the lower bound has been reached. After the maximal point is reached, check if deleting the spare will help improve overall E(P)/A. If so remove the spare by setting A’ i = 0. Mark the module as visited, and repeat until all modules are visited. Figure 14. Maximizing E(P)/A for systems with the additive performance model Select largest Mi not yet visited Increase A’i E(P)/A Increased? Yes All modules visited? No Undo A’i increase No Terminate Yes 58 4.4 Experimental Results Pipeline/Bottleneck Performance Model For the pipeline model, we first present a 5-module system that mimics the AMD Bulldozer architecture [10] in terms of area composition. The Bulldozer dedicates roughly 5% of the chip layout area to instruction fetch with branch prediction (M 1 ), 11% to decode logic (M 2 ), 15% to execution units (M 3 ), 12% to memory access organization such as load/store queues (M 4 ), and the remaining 57% to cache and other functions (M 0 ). System configurations for the fixed original modules (A Ori , Y Ori ) and calculated RRM (A RRM , Y RRM ) are listed in Table III. In this example, the overall yield of the chip (Y) is Y 0 ×Y 1 ×Y 2 ×Y 3 ×Y 4 = 0.2. The P-A function parameters conform to Pollack’s Rule, with offsets to ensure that the original module performances are equal, conforming to the performance model of the linear system. Table III. Experimental results for a 5-module system A Ori Y Ori (d=0.037) A RRM Y RRM P-A Function M 0 A 0 =57 Y 0 =1 * Y=0.2 - - - M 1 A 1 =5 Y 1 =0.83 A' 1 =1.19 Y' 1 =0.96 P 1 =A 1 ^0.5+4.64 P gold = 3.87 P degrade = 2.64 M 2 A 2 =11 Y 2 =0.66 A' 2 =4.71 Y' 2 =0.84 P 2 =A 2 ^0.5+0.56 M 3 A 3 =15 Y 3 =0.57 A' 3 =7.44 Y' 3 =0.76 P 3 =A 3 ^0.5 M 4 A 4 =12 Y 4 =0.64 A' 4 =5.38 Y' 4 =0.82 P 4 =A 3 ^0.5+0.41 EPA Traditional = 0.0166 EPA RRI = 0.0201 (20.57% Increase) * Y 0 can be set to any number; it does not influence the increase percentage of EPA RRI The RRM area values (A RRM ) are computed using Algorithm 1. We compare the RR design to a baseline design using optimal traditional redundancy, that is, while limiting each module to have at most one identical spare, and choose the optimal configuration. The E(P)/A of the RRI design (EPA RRI ) is 20.57% higher than that of the baseline design (EPA Traditional ). 59 In Figure 15 we plot the E(P)/A increase percentage of RRI designs over the baseline for different Y values. When Y is as low as 0.1, i.e. the defect density is high, RRI outperforms the baseline, which is already vastly superior over the original design, by over 50%. When technology matures and Y increases, the benefit of RRI diminishes, but RRI never underperforms the baseline design when Y < 0.8. Another important observation is that for Y > 0.67, the baseline design will shed all redundancy, i.e. traditional redundancy is no longer helpful, and the best design is the bare design. However, even when traditional redundancy cannot improve Y/A, RRI still has room for 10% improvement when Y = 0.7 in terms of E(P)/A. This means that RRI remains to be advantageous even when yield learning occurs. Figure 15. The increment of E(P)/A values for RRI with respect to the original yield Clearly the effectiveness of RRI also heavily depends on the P-A functions. We compare systems described on the left of Table IV with different P-A function parameters. In our formulation, the exponent m is the dominating factor in determining the relationship between area and performance. Larger m indicates that the performance penalty from reduced area is greater. The plot on the right of Table IV shows that regardless of the original yield, the benefit of RRI will start to diminish as the P-A function reaches linearity, which is achieved when m = 1. 0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 E(P)/A Increase Percentage over Baseline Original Yield (Y) of Entire System without Redundancy Baseline = Traditional Redundancy Baseline = No Redundancy 60 Table IV. Effects of performance characteristics on RRI E(P)/A A Ori P-A M 0 A 0 =57 - M 1 A 1 =5 P 1 =A 1 ^m+j M 2 A 2 =11 P 2 =A 2 ^m+k M 3 A 3 =15 P 3 =A 3 ^m M 4 A 4 =12 P 4 =A 3 ^m+n Next we consider a 7-module system configuration (Table V) to demonstrate the effectiveness of the algorithm for larger problem sizes. The lower bound of A RRM is set to be A Ori /5, which means that the RRM cannot have an area value less than 20% of the area of the original module. This bound is to ensure that we do not arrive at RRMs that are so small that they are impossible to design. Table V. Experimental results for a 7-module system A Ori Y Ori A RRM Y RRM P-A Function M 0 A 0 =15 Y 0 =1 Y=0.3 - - - M 1 A 1 =6 Y 1 =0.75 A' 1 =4.4 Y' 1 =0.81 P 1 =A 1 ^0.95 P gold = 5.49 P degrade = 4.11 M 2 A 2 =5 Y 2 =0.79 A' 2 =1.0 Y' 2 =0.95 P 2 =A 2 ^0.5+3.25 M 3 A 3 =5 Y 3 =0.79 A' 3 =1.5 Y' 3 =0.93 P 3 =A 3 ^0.6+2.86 M 4 A 4 =4 Y 4 =0.83 A' 4 =0.9 Y' 4 =0.96 P 4 =A 4 ^0.6+3.19 M 5 A 5 =4 Y 5 =0.83 A' 5 =1.4 Y' 5 =0.93 P 5 =A 5 ^0.7+2.85 M 6 A 6 =4 Y 6 =0.83 A' 6 =3.6 Y' 6 =0.84 P 6 =A 6 ^1.6-3.70 EPA Traditional =0.0602 EPA RRI = 0.059263 (19.28% Increase) To detail the design space exploration process of the RRI configurations (A RRM ), Figure 16 plots the P-A functions for each module. The original designs occupy the leftmost point on each 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 0.2 0.3 0.4 0.5 0.6 0.7 0.8 E(P)/A Increase Percentage over Baseline Original Yield of Entire System without Redundancy m=0.3 m=0.4 m=0.5 m=0.6 m=0.7 m=0.8 m=0.9 61 curve. Moving right along the curves corresponds to employing RRMs with smaller area. The E(P)/A of the our RRI design is 19% higher than that of the baseline. Figure 16. Design exploration under the pipelined performance model Additive Performance Model We present a 10-module arbitrary system configuration (Left on Table VI) for the additive performance model. The lower bound of A RRM is set to A Original /5. We plot the E(P)/A improvements over the baseline design (right on Table VI), and observe that the conservatively estimated E(P)/A value of RRI outperforms the baseline by over 10% when Y<0.6. Once again we see that when Y>0.6, the RRI can outperform the bare design when the traditional redundancy design cannot. 2 2.5 3 3.5 4 4.5 5 5.5 6 6 5.6 5.2 4.8 4.4 4 3.6 3.2 2.8 2.4 2 1.6 1.2 0.8 0.4 Performance Area Module 1 Module 2 Module 3 Module 4 Module 5 Module 6 Original Design RRI Design (Performance Degradation: ~25%) 62 Table VI. Experimental results for a 10-module system A Ori P-A Function M 0 A 0 =35 - M 1 A 1 =10 P 1 =A 1 ^0.6 M 2 A 2 =9 P 2 =A 2 ^0.5 M 3 A 3 =8 P 3 =A 3 ^0.55 M 4 A 4 =7 P 4 =A 4 ^0.4 M 5 A 5 =6 P 5 =A 5 ^0.6 M 6 A 6 =5 P 6 =A 6 ^0.6 M 7 A 7 =4 P 7 =A 7 ^0.8 M 8 A 8 =3 P 8 =A 8 ^0.7 M 9 A 9 =2 P 9 =A 9 ^0.9 4.5 Conclusions In this chapter we presented a novel redundancy insertion technique termed Reduced Redundancy Insertion (RRI) that overcomes some of the drawbacks of conventional redundancy. Instead of using spares that are identical to the original modules, we appropriately choose spares that trade-off smaller area and higher yield at the cost of degraded performance. Under such schemes, each wafer may produce dice that have different revenues, and we must co-optimize yield, area and performance to achieve maximum revenue per wafer. For this we introduce a new metric called Expected Performance per Area, along with algorithms to maximize this function. 0.0% 5.0% 10.0% 15.0% 20.0% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 E(P)/A Increase Percentage Over Baseline Original Yield of Entire System without Redundancy Baseline=Traditional Redundancy Baseline=No Redundancy 63 Chapter 5. Hybrid Shared Redundancy Insertion 5.1 Introduction In this chapter we introduce another redundancy insertion variant, which also serves to enhance E(P)/A at the design stage. For this body of work we turn our attention to multi-core CPU architectures. Modern CPUs have turned to integrating multiple cores onto a single die to increase performance through computational parallelism [11], which is an answer to the power consumption and heat dissipation issues obstructing the path to higher clock frequencies [12]. The Intel Core i3/i5/i7 series [13] typically include two or four physical cores for mainstream products and up to eight cores for high-end markets. The latest lineup of A6 processors from Apple contains two ARMv7 based custom CPUs [14]. The multi-core era does not necessarily mean that each core in the CPU will regress towards simplicity. Currently, each core in the Intel i7-3770 series still possesses powerful multithreading capabilities. It is predicted that the complexity of each core will continue to increase to deliver enough processing power to satisfy the growing program computation demands [15]. Such new parallel architectures with complex individual processing units necessitate a new framework for yield enhancement and redundancy insertion. We have reviewed coarse grained core level redundancy for SIMD architectures and fine grained module redundancy, and in this chapter we will marry the two concepts by 1) introducing a hybrid spare sharing redundancy insertion scheme that combines the advantages of the above two approaches, while carefully leveraging the associated area and performance overheads, 2) presenting an extensively verified, systematic scalable model to evaluate the quality of the final design in terms of projected revenue per wafer, and 3) developing a maximization algorithm to determine the near optimal redundancy 64 configurations during the design stage. We address a different class of multi-core processors from massively parallel architectures (e.g. GPUs), such as CPUs, where 1) each core contains a private cache, 2) the complexity of each core is high and 3) the number of cores is small. In these systems core level redundancy will often not be the best choice. More importantly, aside from area overheads, we also consider the other important side effect of redundancy insertion, namely: performance degradation. 5.2 Problem Definition Case Study Prelude We first examine a simplified case study to serve as an introduction to Shared Redundancy (SR). Consider an abstract microprocessor that consists of two identical cores, C 1 and C 2 . Each core has a cache module M 0 and an execution engine M 1 , manufactured on a wafer of size A w . The area and yield of M i are A i and Y i respectively. Yield models are used to estimate Y i . In previous chapters, we have established the fact that M 0 can achieve a high yield (Y 0 ) without the aid of further redundancy. M 1 however, does not share the same characteristics as M 0 . Since the failure of M 1 will cause the entire chip to be discarded, M 1 needs to be more robust. Figure 17 displays three versions of designs to be examined: (1) the bare design, (2) the traditional redundancy design and (3) the SR design, which in this small system is similar to existing core level redundancy insertion in SIMD architectures. The performance of each core is P 1 , which is derived from architectural simulations and measured in the appropriate metrics, such as instructions per cycle (IPC). Here we evaluate the system performance by taking the sum of the performance values of each core, which in this case 65 is 2P 1 . In reality, system performance would be application dependent, but strongly correlated with the aforementioned sum. Figure 17. Case study for hybrid redundancy Version 1 Design (No Redundancy) The first version design has no redundancy, which is also referred to as the bare design. From this point on we assume that the yield of any module is independent of the yield of other modules (which holds true for the Poisson yield model), and the entire system will fail if any core fails. That is, a four core system cannot be sold as a three core system if one core is defective. Under these assumptions, the yield of the entire chip is Y 0 2 Y 1 2 . The Y/A of the chip is Y 0 2 Y 1 2 /(2A 0 +2A 1 ). For each silicon wafer of size A w , the number of marketable dice that operate at a performance of 2P 1 is approximately the floor of A w Y 0 2 Y 1 2 /(2A 0 +2A 1 ). Version 1 Module M 0 Module M 1 Module M 0 Module M 1 Core 1 Core 2 Version 2 Module M 0 Module M 1 Module M 1 FORK JOIN Module M 0 Module M 1 Module M 1 FORK JOIN Core 1 Core 2 Version 3 Module M 0 Module M 1 FORK JOIN Module M 1 Module M 1 SWITCH SWITCH Module M 0 Module M 1 FORK JOIN Core 1 Core 2 Common Resource Pool (CRP) 66 Version 2 Design (Traditional Redundancy) Version 2 will have one spare for M 1 per core. The steering logic consists of forks and joins. Note that each spare is dedicated to its core and cannot be used by other cores. The yield of the entire chip is Y 0 2 [1-(1-Y 1 ) 2 ] 2 . For each wafer of size A w , neglecting the area of the steering logic (to be modeled in future sections) for now, the number of marketable dice is: A w Y 0 2 [1-(1- Y 1 ) 2 ] 2 /(2A 0 +4A 1 ). By definition, the performance of the entire system is 2P' 1 , where the per core performance P' 1 is less than P 1 due to the penalties incurred by the steering logic. We will refer to such designs as traditional redundancy designs. Naturally this design can be easily extended to include more than one spare M 1 per core. Version 3 Design (Shared Redundancy) The third design will utilize shared redundancy, where the two spares in Version 2 will be moved to the "Common Resource Pool" (CRP), and become spares that can be shared by C 1 and C 2 . The inclusion of the CRP will require additional steering logic, namely switches inside the CRP, which must provide one-to-one connections between all inputs and outputs. Cascaded forks and joins or simplified crossbars can be used the realize the steering logic. The overall performance can no longer be evaluated as a constant, since it depends on which modules are functional, as seen in Table VII. When all original modules are functional (second row in Table VII), per core performance is assumed to be P'' 1 , a suboptimal performance that is not necessarily equal to P' 1 due to different steering logic designs. The spare M 1 ’s in the CRP will not be accessed. The system performance is P'' 1 + P'' 1 = 2P'' 1 . When a core has to use a shared M 1 in the CRP in place of the failed original module, the core performance is P D , and P D <P'' 1 since in this case the wiring length is longer and the shared M 1 is controlled by additional switches. We will elaborate on the performance degradations in future sections. The system 67 performance will be P'' 1 +P D if the other core is defect free (third row in Table VII), or in the worst case, 2P D when both cores have to access the CRP (forth row in Table VII). When the amount of defect free spares in the CRP cannot support the failed original M 1 ’s, the chip will be discarded, i.e. performance is zero. This design is very robust. As long as two out of four M 1 ’s are defect free, the die will be marketable provided that all of the M 0 ’s and the steering logic are also defect free. The die will need to be discard if more than two M 1 ’s are defective. Table VII. Performance ratings for different module status combinations M 1 in C 2 M 1 's in the CRP M 1 in C 2 Probability/Yield (Excluding Y 0 ) System Performance Good Irrelevant Good Y 1 2 2P'' 1 Good At least 1 Good Bad Y 1 (1-Y 1 )[1-(1-Y 1 ) 2 ] P'' 1 +P D Bad At least 1 Good Good Y 1 (1-Y 1 )[1-(1-Y 1 ) 2 ] P'' 1 +P D Bad Both Good Bad Y 1 2 (1-Y 1 ) 2 2P D All other configurations 1-∑P above 0 For a wafer of size A w , a total of A w /(2A 0 +4A 1 ) dice will be produced per wafer. The expected number of dice/wafer that operate at performance 2P'' 1 is: Y 0 2 Y 1 2 A w /(2A 0 +4A 1 ). Similarly, the expected number of dice that will operate at the lower performance of P'' 1 +P D and 2P D is 2Y 0 2 Y 1 (1-Y 1 )[1-(1- Y 1 ) 2 ]A w /(2A 0 +4A 1 ) and Y 0 2 Y 1 2 (1-Y 1 ) 2 A w /(2A 0 +4A 1 ), respectively. The marketing appeal of a particular chip depends on various non-deterministic commercial aspects. However, the selling price of a chip is often strongly correlated to its performance. Therefore, from a fixed size wafer, our desire is to manufacture chips that can collectively deliver a large amount of performance. Similar to the previous chapter on RRI, we use the E(P)/A metric to project wafer revenue. 68 The E(P)/A equations for the three designs are listed in Table VIII. To illustrate the benefits of shared redundancy at the module level, we perform a quantitative analysis by setting P 1 = 1, P' 1 = P'' 1 = 0.95, P D = 0.85, A 1 = 5 and A 0 = 5. Y 1 is calculated using the Poisson yield model: Yield = e -d·Area , where d is the defect density which is dependent on technology and manufacturing maturity. Table VIII. E(P)/A comparisons and improvement of Version 3 design E(P)/A Version 1 Version 2 Version 3 In Figure 18 we plot the percent improvement of the Version 3 design over the other two designs for different values of Y 1 . As shown by the plot, for high defect densities, i.e. small Y 1 values, SR outperforms traditional redundancy by over 30%. For Y 1 >0.75, redundancy will no longer be helpful regardless of how it is applied. 69 Figure 18. Percent improvement of the Version 3 design in terms of E(P)/A Formal Definition We now formally define the system architecture and the E(P)/A model for hybrid redundancy that has both shared modules and private modules associated with a core. Figure 19. Hybrid redundancy -20% -10% 0% 10% 20% 30% 40% 50% 60% 70% 0.2 0.4 0.6 0.8 1 Percent Improvement Y 1 : Yield of Module 1 (D 1 ) Over V er. 2 Over V er. 1 Common Resource Pool (CRP) ... Module M t Module M t ... ... From other cores... To other cores... Core 1 ... Module M 0 FORK JOIN Module M 1 ... SWITCH SWITCH Module M 1 Module M 1 Core k ... Module M 0 FORK JOIN Module M 1 JOIN FORK ... Module M 1 Module M 1 ... JOIN Module M t FORK Module M t 70 System Architecture The system, illustrated in Figure 19, consists of k identical independent processing units, which we will refer to as cores, denoted as {C 1 , C 2 , ..., C k }. The system can only operate when all k cores are functional. Each core C i has t + 1 modules labeled M 0 , M 1 , ..., M t , each having yield values Y 0 , Y 1 , ..., Y t , respectively, which are calculated by the same yield model except for Y 0 . M 0 is a collection of hardware blocks that can achieve high yield (Y 0 ) without explicit redundancy (e.g. ECC protected cache). Stage j consists of the original M j modules in all cores. We assume independence between module yields, which holds for all defect-oriented yield models. Within each core C i , the original module at stage x (M x ) can include n x -1 dedicated spare modules that can only be utilized by C i . M x can also be replaced by any of the m x shared spare modules located in the CRP. Other cores can also access these m x spares. Spares and steering logic located inside each core are viewed as internal, while components in the CRP are viewed as external. For large systems, our optimized SR design may include internal as well as external spares, which we will refer to as "Hybrid Shared Redundancy" (HSR) to differentiate our work with existing spare sharing schemes. Assume the reference performance for each core is P gold , achieved when no redundancy is applied. When m x =0 for all x the system corresponds to a traditional redundancy design explored in [13][14][15], which we will use as the baseline for comparison. Intuitively, if the performance loss and wiring overhead is overlooked, moving all internal spares into the CRP is always advantageous. In the following sections, we will use a scalable model based on the concept of E(P)/A to account for all the redundancy induced overheads. 71 E(P)/A Model and Analysis The E(P)/A of the system in Figure 19 is defined in Equation 5. , Equation 5 System designers and the manufacturing process determine the area (A), yield (Y) and golden core performance (P gold ) values. The objective of our HSR design is to find a redundancy configuration, characterized by values of k, m and n (m and n values are separate for each stage), that maximizes the E(P)/A of the entire chip, which would in turn lead to maximized wafer revenue. Since Equation 5 is written in closed form, it is scalable for large systems. The denominator in Equation 5 is simply the area of the die, including the area of all the internal and external spares, as well as the area overhead (A Overhead ) from the steering logic and extra wiring. P Sub and P D are the degraded core performance values that depend on the values of 72 k, m and n. P Sub is the performance of the core when the CRP is included but not accessed, i.e. all original modules are fault free. By definition the system performance in this case is k · P Sub . P D is the worst-case performance of a core if it had to access at least one shared spare in the CRP. The formulations of P Sub , P D and A Overhead will be verified in the following sections. α is the expected system performance when the CRP is accessed, which should be a weighted sum of P Sub and P D depending on which cores are accessing the CRP. Such variation in performance across cores, which can already be seen in ARM’s big.LITTLE architecture [16], can be handled with tailored scheduling algorithms. To avoid the numerous sets of degraded performance representations, we take an extremely conservative approach illustrated below. Assume the system has k=2 cores and t=2 modules excluding M 0 . The CRP has one shared copy for M 1 and M 2 each. In Figure 20a, M 1 and M 2 in C 2 have both failed (indicated by dashed gray lines), causing C 2 to employ two shared spares from the CRP. The performance of C 2 will degrade, but not C 1 . Since this is the worst case for C 2 , i.e., all stages access the CRP, we set its core performance as P D . Therefore the system performance should be P Sub +P D . For the system configuration in Figure 20b, both cores will access the CRP, but neither exhibits the worst-case performance of P D . In Equation 5 however, for the sake of mathematical convergence in the formulation, the system performance of both scenarios will be set to 2P D . This formulation strategy will be extended to larger systems. These assumptions will guarantee that we will never over estimate the E(P)/A values for HSR based designs in later discussions. An even more pessimistic estimation, which we use in our analysis, is to set all degraded system performances to kP D . In this fashion, only two types of chips need to be marketed: those with nominal performance kP Sub and degraded performance kP D . The E(P)/A values of HSR 73 designs that are calculated this way will be severely underestimated, and will serve as a lower bound to the actual E(P)/A value. Figure 20. The evaluation of worst-case performance 5.3 Model Verification In this section we will show that our system model is grounded in simulation data and rigorous analysis. We understand that Equation 5 needs to be adjusted for each specific target design, but we believe that our abstractions will provide 1) the starting platform for further accurate calculations and 2) important early estimations. Performance Degradation Formulation Verification The performance degradation introduced by the use of HSR cannot be precisely quantified due to its strong dependence on architecture details, technology and even the application. Equation 5 does not serve to offer universally applicable, accurate numerical values for P Sub or P D , but rather to capture the trends in performance degradation with respect to the change in redundancy configurations. This will help system designers clearly understand the tradeoffs when applying redundancy. Performance of C 1 : >P D C 2 : >P D Module M 1 Module M 1 Module M 1 Module M 2 Module M 2 Module M 2 Core 1 CRP Core 2 Performance of C 1 : P Sub C 2 : P D Module M 1 Module M 1 Module M 1 Module M 2 Module M 2 Module M 2 Core 1 CRP Core 2 (a) (b) 74 Compared to the bare design, performance degradation stems from three sources: (1) the internal steering logic, (2) the external steering logic, and (3) the wiring delay to the CRP. Table IX shows the IO complexity of the steering logic with respect to the number of internal (n) and external (m) spares. Note that the internal steering logic will negatively impact performance in traditional redundancy designs as well. For technologies plagued with high defect densities, this performance overhead would be inevitable. The internal steering logic complexities are dependent on the number of internal spares (n - 1), and if the number of shared spares for this stage is greater than zero. In Equation 5 we model the delay incurred by internal steering logic to be linearly correlated to the maximum of all n values (n θ ). The external steering logic will only incur delays when the CRP is accessed. We assume a linear correlation between this external delay and two parameters: 1) the maximum number of external spares (m θ ), and 2) the number of cores sharing the CRP (k). Figure 21. Steering logic complexity The linearity in the performance degradation with respect to different values of k, m θ and n θ is supported by post-layout delay measurements. The experiments were done with the 180nm tsmc02 technology library using the Cadence Virtuoso tool. Modules are simply built with a Fork ... M i M i Join SWITCH 1 ... M i M i SWITCH 2 Other cores... Other cores... Stage Delay CRP 75 series of oversized buffers. Steering logic components are built using multiplexers. We considered a wide range of steering logic complexities. For each stage, we measure the stage delay starting from the input of the fork to the output of the join, indicated in Figure 21. Table IX. Steering logic complexity Component Input * Output * Core Fork 1 n[i]+1 Join n[i]+1 1 CRP Switch 1 k m[i] Switch 2 m[i] k *The width of each unit input or output is dependent on the functionalities of the module The data shown in Figure 22a and Figure 22b are based on a system with k = 2 cores. Figure 22a illustrates that as the number of internal spares included per stage (n) increases, the stage delay linearly increases for all values of m. When m = 0, n = 1 no redundancy is included, hence the sudden increase in stage delay when n > 1. The curve for m = 0 in Figure 22a also represents the delay of the internal path without propagation through the CRP. Similarly in Figure 22b and Figure 22c, a linear correlation between stage delay and the values of m and k can be found. (a) 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1 2 3 4 Path Delay (ns) for k=2 n (Number of Internal Spares) m=0 m=1 m=2 m=3 76 (b) (c) Figure 22. The impact of steering logic complexities on stage delay The extra wiring delay will depend on the final layout of the design, but clearly, for large k the routing becomes more complex, leading to increased wire delays. We assume the wire delay 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1 2 3 4 Path Delay (ns) for k=2 m (Number of External Spares) n=1 n=2 n=3 n=4 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 2 4 6 Path Delay (ns) k (Number of Cores Sharing the CRP) m/n=1/1 m/n=2/1 m/n=2/2 m/n=3/2 77 linearly increases with respect to k, but a more optimistic logarithmic wire delay model can be used. Determining the exact values of the scaling coefficients (γ 1 , γ 2 , D wire ) requires detailed knowledge of the target design and technology that is beyond the scope of this work. For our experiments, we set these parameters to reasonable values. We would like to point out that the redundancy-induced delays in each stage do not necessarily always translate into global system performance degradation. For example, in pipelines, if the delay from utilizing HSR for each stage falls within the timing slack, core performance will not degrade. If the slack cannot mask the delay, the clock frequency can be lowered, in which case the P D and P Sub would be directly linked to the complexity of the steering logic and the CRP related wiring. In modern complex systems, such as Out-of-Order (OoO) microprocessors, instead of tuning the clock, shared spares in the CRP can consume extra clock cycles to cope with the additional delay. To provide further insight on this, we analyzed a particular module in an OoO processor. These simulations were done using the SimpleScalar [17] toolset on a subset of SPEC2000 Benchmarks. We will demonstrate that when employing HSR in large complex systems, core performance degradation can potentially be alleviated. In this case study, the core is an OoO processor that contains four integer functional units (IFUs). In case any of the four IFU fails, it can be replaced by IFUs in the CRP. We conservatively assume that the elongated delays from the HSR setup (wiring and extra steering logic) exceed the timing slack. As a result, each internal IFU will take one clock cycle to finish computation, while identical IFUs in the CRP will need additional cycles. The system can 78 continue to operate if the instruction scheduler is aware of the IFU latency arrangements. In Table X we observe the degradation of core performance, measured in instructions per clock (IPC), when two internal IFUs are replaced by external IFUs. The nominal IFU configuration is denoted as (1, 1, 1, 1) to represent four IFUs each operating at a latency of one clock cycle. When a spare IFU in the CRP operates at a latency of two clocks (due to wire delays), and two defective internal IFUs are replaced by two non-defective spare IFUs in the CRP, then this system is referred to as a (1, 1, 2, 2) system. Similarly, if the IFUs in the CRP operates at a latency of four clocks, and two defective internal IFUs are replaced by two non-defective spare IFUs in the CRP, then this system is referred to as a (1, 1, 4, 4) system. In the right of Table X, the average performance degradation for the (1, 1, 2, 2) system, where half of the IFUs experienced a 100% slowdown, is only 5.2%. This percentage becomes 14.4% when those two external IFUs further degrade to a quarter of the nominal speed. A counterargument would be that such small decrease in performance is because the original allocation of four IFUs is excessive. However, a comparison with a two IFU system (1, 1) shows otherwise: the (1, 1, 4, 4) system shows notably superior performance. The inconsistency between local and global performance decrease in fact originates from unbalanced resource utilization. Table X. System Performance Degradation 1,1, 1,1 1,1, 2,2 1,1, 4,4 1,1 mcf 3.62 3.23 2.82 2.51 gzip 2.58 2.50 2.22 2.19 gcc 2.21 2.20 2.02 1.97 twolf 2.87 2.71 2.83 2.83 parser 2.75 2.62 2.41 2.25 bzip 2.27 2.13 1.90 1.88 swim 5.28 4.86 4.15 3.34 art 3.29 3.19 2.67 3.00 0% 5% 10% 15% 20% 25% mcf gzip gcc twolf parser bzip swim art Performance (IPC) Degradation wrt. (1, 1, 1, 1) Benchmark 1,1,2,2 1,1,4,4 79 Area Overhead Formulation Verification We analyze the area overhead (A Overhead ) of the steering logic and wiring by studying a crossbar-based design shown in Figure 23. Our analysis can provide early estimations of the area overhead without the complete circuit layout. For multiplexer based steering logic designs, their area overhead would have to be approximated using the same method. We assume module widths are comparable. For each core at stage i, the height and width of the internal fork/join is O(1) and O(n i ) respectively. All k cores will have identical internal steering logic overheads. The external switches in the CRP, on the other hand, have height O(k) and width O(m i ). Given that all t stages can contain internal and/or shared spares, the total steering logic area overhead for the entire system will be O(t · k · n θ ) and O(t · k · m θ ) respectively. In Eq. 1 we scale these parameters with coefficients ρ 1 and ρ 2 . Figure 23. Steering logic complexity illustration Case Study: A HDL Prototype Design We have designed a synthesizable Verilog prototype to substantiate our modeling assumptions. We are mainly interested in how the steering logic and associated wiring required for using From other cores... From other cores... D i D i D i D i-1 D i-1 ... Core 1 ... ... D i D i D i D i-1 D i-1 ... Core 2 D i D i CRP Height: O(k) Width: O(m i ) Height: O(k) Width: O(m i ) Height: O(1) Width: O(n i ) Height: O(1) Width: O(n i ) 80 shared redundancy impact area and performance after automated place and route. The design is composed of two ASIC processors that operate in parallel (i.e. independent of each other). Each ASIC processor performs noise attenuation on an 8x8 gray scale digital image in hardware without compiler or instruction stream support. The black-and-white image is stored as a two dimensional array of integers (representing the gray scale values) ranging from 0 to 4095 inside a block of memory. Figure 24. Noise attenuation concept The digital image noise image attenuation process works in the following fashion. The image is stored inside memory. The processor will examine each pixel in the image individually, read its gray scale value, and overwrite it with the average gray scale value of its four (north, south, east and west) neighboring pixels. The averaging procedure consists of a series of clocked sequential binary additions followed by a binary division, which means that each ASIC processor will include one adder as a part of an accumulator and one divider. Figure 24 depicts the noise attenuation process. To the left, we can see that a dark spot is a noise pixel amongst a pure white image. When examining this pixel of gray scale 100, it will sum up the values of the neighboring cells, take average and overwrite back into the location corresponding to the pixel under examination. 255 255 255 255 100 255 255 255 255 255+255+255+255/4 = 255 255 255 255 255 255 255 255 255 255 Pixel under Inspection 81 In this case study focuses on the 16 bit adder in each of the two ASIC processors: we wish to observe the area and performance overhead of adding internal and external (shared) spares for this adder. In terms of area, the adder is approximately 3% of each ASIC processor. Synthesis and place and route is done with Xilinx ISE Design Suite, the target device is Xilinx Spartan 6, Package CSG324. First, we report the performance and device utilization summary for the n = 1 bare design without any redundancies as shown in Table XI. Since the automated synthesis, place and route procedure can be geared towards the optimization of area or performance, we report two sets of values for each redundancy configuration. Table XI. Summary of the design before any redundancy insertion Optimization Objective Number of Slice Registers Number of Slice LUTs Number of Occupied Slices Number of LUT Flip Flop pairs Number of Bounded IOBs Minimum period (ns) Area 2143 856 617 2400 51 15.686 Performance 2246 864 647 2501 51 15.942 Next we add two spare adders, along with the necessary steering logic, in each of the two ASIC processors to improve yield. The performance and device utilization summary for this design with n = 3 is presented in Table XII. Compared to Table XI, we can see that adding internal redundancy incurs a performance overhead of 4.95% (when optimizing for area) or -1.11% (when optimizing for performance, this value is negative is due to the nature of the synthesis, place and route procedure). The area overhead (in terms of occupied slices) is 4.05% (when optimizing for area) or 4.9% (when optimizing for performance). 82 Table XII. Summary of the design after internal redundancy insertion (n = 3) Optimization Objective Number of Slice Registers Number of Slice LUTs Number of Occupied Slices Number of LUT Flip Flop pairs Number of Bounded IOBs Minimum period (ns) Area 2143 912 642 2493 53 16.462 Performance 2328 1029 679 2638 53 15.764 Next for each ASIC processor we “move” the two internal spare adders into the CRP to arrive at a design with the redundancy configuration n = 1 and m = 4 to increase the global system yield. The performance and device utilization summary for this design is presented in Table XIII. Compared to the n = 3 design in Table XII, the use of shared redundancy increased area (in terms of occupied slices) by -0.47% or -0.88% (meaning the area actually decreased, due to the random nature of the synthesis, place and route procedure), and degraded performance by 4.81% or 15.3%. Table XIII. Summary of the design after external redundancy insertion (n = 1, m = 4) Optimization Objective Number of Slice Registers Number of Slice LUTs Number of Occupied Slices Number of LUT Flip Flop pairs Number of Bounded IOBs Minimum period (ns) Area 2143 878 639 2459 58 17.254 Performance 2233 926 673 2583 58 18.156 A HSR approach would be to set n = 2 and m = 2 which allows two external shared spares and one internal spare for each ASIC processor. Theoretically, this HSR design should perform better than the n = 1, m = 4 design, since it sacrifices a certain degree of the flexibility in spare replacement. However, the data in Table XIV suggests that the performance increased while the area in fact increased, most likely stemming from the inefficient placement of internal and external spares. Therefore, the aforementioned overheads should not be taken as definitive values. A custom design that is fully aware of the characteristics of internal/external spares may even achieve much better results and converge to our mathematical models. Nevertheless, we have 83 showed, in concept, that hybrid shared redundancy can be superimposed on traditional redundancy without severe area and performance overheads. This actually motivates a yield enhancement aware CAD flow. Table XIV. Summary of the design after hybrid redundancy insertion (n = 2, m = 2) Optimization Objective Number of Slice Registers Number of Slice LUTs Number of Occupied Slices Number of LUT Flip Flop pairs Number of Bounded IOBs Minimum period (ns) Area 2143 927 648 2512 58 17.038 Performance 2238 974 686 2581 58 17.902 5.4 Maximization of E(P)/A The maximization procedure of E(P)/A is a recursive greedy algorithm, illustrated in Algorithm 2. Algorithm 2: The E(P)/A maximization algorithm for HSR initialize m = 0, n = 1 for all stages, MAX[E(P)/A] = 0 sort_by_area(all stages) while (loop until MAX[E(P)/A] converges) for (i=1; i ≤t; i++) while (1) m[i] = m + 1 evaluate(E(P)/A) if (E(P)/A > MAX[E(P)/A]) MAX[E(P)/A] = E(P)/A store_config() 84 else restore_saved_config() break repeat for n 5.5 Experimental Results We first examine a system with k = 2, t = 5. The system parameters are presented in Table XV. We vary the defect density (d) to represent different manufacturing situations. d dictates the total yield of the bare design , so for clarity we focus on Y Bare in place of d. We compute the optimized HSR design configuration for a range of Y Bare values, corresponding to different defect densities. We then compare our HSR design with two baselines: 1) the optimal traditional redundancy design (OTRD) [18][19][20], which can potentially shed all redundancies for high Y Bare values, and 2) designs that only uses coarse-grained core level redundancy [21], namely adding one or two spare cores. The OTRD will be derived via exhaustive search. For the second baseline design, D 0 will not be included in the spare cores, and A Overhead =0. Table XV. System configuration for k=2, t=5 System Design D 0 * D 1 D 2 D 3 D 4 D 5 Area 25 7 7 1 5 2 Parameters and Coefficients P Gold γ 1 γ 2 D Wire ρ 1 ρ 2 Value 0.5 0.01 0.02 0.02 0.4 0.5 *The yield of D 0 (Y 0 ) will not influence the improvement percentage of our HSR designs In Figure 25, we plot the percent improvement in terms of E(P)/A values for the HSR design over the optimized baseline designs. The solid lines correspond to when the E(P)/A for the HSR designs are calculated using Equation 5, while the dashed lines are when degraded performances are all set to kP D , an overly pessimistic estimation. 85 Figure 25. Improvement percentage of the HSR design for k=2, t=5 The first observation is that for Y Bare <0.7, the HSR designs never underperform the baselines. When Y Bare >0.7, HSR is no longer beneficial because the best design is the bare design, i.e. redundancy is no longer useful. Second, compared to the OTRD, improvement percentage of the HSR design is on average approximately 10% when 0.3<Y Bare <0.7, and peaks at 15%. Notice that for Y Bare >0.6 the improvement percentage begins to decline. This is because traditional redundancy loses its advantage when the overall yield surpasses a certain threshold, and the OTRD switched over to the bare design. HSR can fill the gap between the regions in which bare designs are still suffering from low yields but traditional redundancy cannot provide aid. Finally, recall that all of our assumptions are conservative, thus the actual benefits of the HSR design can only be higher than what is reported here. 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Percent Improvement of HSR Yield of Original Design without Redundancy (YBare) OTRD OTRD One Spare Core Only One Spare Core Only Two Spare Core Only Two Spare Core Only 86 The above process is repeated for a larger system with k=4 and t=6. We select seven Y Bare values to show the E(P)/A improvements of the HSR design over the baseline designs in Table XVI. For HSR designs with large k, the CRP can be shared by more cores, but the overheads are more prominent, and can potentially dominate the improvement in yield if not for our optimizations. Here core level redundancy is generally more favorable than OTRD, yet it is still constantly outperformed by our HSR design. Table XVI. Percent improvement of the HSR design for k=4, t=10 Y Bare 0.17 0.24 0.37 0.41 0.51 0.58 0.64 OTRD 16.4% 18.2% 24.6% 26.0% 28.4% 30.0% 27.4% 1 Spare Core 26.7% 20.3% 16.9% 16.2% 15.0% 14.3% 13.9% 2 Spare Cores 18.5% 17.6% 21.5% 22.5% 24.4% 25.7% 16.8% 5.6 Conclusions In this chapter, we have outlined a new approach towards yield enhancement through "Hybrid Shared Redundancy" (HSR) in modern multi-core processors. HSR combines the concept of shared redundancy found in SIMD architectures with intra-core redundancy to reap greater improvements in yield while leveraging the performance and area overheads. We developed a scalable model that estimates the wafer revenue for redundancy augmented designs. By adopting the framework presented in this chapter, system architects can explore different redundancy designs during the early design stage. To this end we have presented an algorithm to compute the near optimal HSR configuration for the target multi-core processor under a predicted defect density. Experimental results show that HSR designs significantly outperform traditional intra-core or spare sharing redundancy designs for almost the entire defect density spectrum. 87 Chapter 6. Unified Revenue and Yield Aware Design Optimization 6.1 Introduction In order to incorporate yield as a primary objective in the design process for future technologies, an overhaul of the existing design flow and design software toolset is necessary. Most of the current methodologies for yield enhancement are ad-hoc and remain decoupled from the rest of the design flow. For example, when module level redundancy [18][19][20][77] needs to applied, the traditional design process, depicted in Figure 26a, would behave as follows: Step 1: Take the "bare design" produced without considering yield, insert the appropriate number of spare modules without knowledge of the floorplan. Step 2: Feed the redundancy augmented design to the place and route/floorplanning tool as static inputs. The place and route/floorplanning tool simply operates as normal, unaware of the fact that the certain modules in the input serve as spares to enhance the overall die yield. In this chapter we introduce the "Unified Revenue and Yield Aware Design Optimization Framework (URaY)", a novel computer-aided design (CAD) optimization framework for microprocessor based systems. URaY only targets random logic modules (e.g. decoders, ALUs and instruction schedulers etc.) since the on-die cache can achieve high native yields with well- known techniques due to its structural uniformity. The URaY design flow is illustrated in Figure 26b. Unlike traditional methods for yield enhancement, URaY co-optimizes yield, area and performance from a macroscopic system level to maximize overall revenue while considering (1) yield learning, (2) chip floorplan, and (3) performance degradation due to global wiring. Unlike 88 traditional methods for floorplanning, the target system in URaY is dynamic due to the integration of the yield enhancement process. Experimental results show that when compared to a non-holistic design flow using optimal redundancy configurations (HYPER [77]) and the most powerful block floorplanner available (DeFer [78]), URaY is able to achieve 18 - 20% improvement in overall revenue. URaY is also very flexible: it supports the plug-in of other yield, delay, testing cost and pricing models. 89 Figure 26. Traditional yield aware design flow and the design flow of URaY 6.2 System Model When leveraging the tradeoff between yield, area and performance, the ultimate goal for URaY, and the manufacturer, is to maximize overall revenue calculated as: M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M4 M1 M1 M2 M3 M4 M1 M2 M4 M1 M2 M4 M3 M3 M1 M1 M2 M4 M1 M2 M4 M3 M3 M1 M2 Testing Costs Pricing Model Delay Model Yield Model “Plug-in” Parameters: (a) (b) Floorplan Visualization Global Wiring Estimation Yield Estimation Revenue Calculation Steering/Testing Logic Yield Enhancement (e.g. Redundancy Insertion) Place and Route/Floorplan URaY : Unified Revenue and Yield Aware Design Optimization Framework Bare Design System Configuration Bare Design System Configuration 90 W(t) is the wafer production volume at time t, note that here we assume all chips manufactured from these W(t) wafers can be sold. S(t) is the projected selling price of the chip at time t, . t 1 and t 2 is the time when mass production of the target system begins and ends, respectively. System Inputs and Outputs Inputs to URaY include: 1) Bare system design modeled as a collection of rotatable, hard rectangular blocks with arbitrary aspect ratios. These blocks can represent abstracted functional components such as decoders, ALUs or controllers. Microprocessor based systems always contain at least one cache block. Modules are connected to each other via buses. The system architects must specify the timing critical path expressed as a directed path amongst modules. Note our definition of the timing critical path is different from the traditional definition in pure combinational logic. We illustrate with the following example regarding a five stage pipeline illustrated in Figure 27a. The performance of this pipeline depends on many factors, such as the combinational delay of the slowest stage, the presence of feedback loops for data forwarding, branch prediction mechanisms etc. In this work, since we are focused on high level CAD frameworks for DFY, we use a generic model to capture the relationship between the system specifications and the system performance without using SPICE simulations. In traditional static timing analysis, the critical path is the path through of series of gates which incurs the worst combinational delay. We extend this concept to a more abstract level, and define the timing critical path as a path through a series of modules that bottlenecks system performance. The timing critical path of the pipeline in Figure 27a is indicated in red in 91 Figure 27b, which could represent the data movement of a "lw" (load word) instruction. Note that here the timing critical path may pass through several pipeline stage registers, because it is an evaluation from the system level and not the gate level. This system level timing critical path is defined by the system architects through architectural simulations, and given as an input to URaY. (a) Architecture Diagram (b) Chip Diagram Figure 27. Five stage pipeline example 2) Delay model specifying the relationship between the half perimeter wire length of the timing critical path (henceforth referred to as HPWL) and wire delay. The Elmore delay model [79] states that wire delay is proportional to wire length squared. With appropriate wire sizing and repeater insertions, the global wire delay can be abstracted to a linear function of the HPWL Instruction Decode/ Register Fetch Execution Data Memory Write Back Instruction Memory Instruction Memory Instruction Decode/ Register Fetch Execution Data Memory Write Back 92 [80]. The linear delay model states that without detailed static timing analysis, delay is approximated to be linear to the Manhattan distance between the source and destination, regardless of the logic in between. Therefore, the pipeline stage registers that intercept the timing critical path do not undermine our desire to minimize the HPWL. Regarding the translation from delay to overall system performance, URaY will only focus on wire delays for two reasons: (i) optimizations carried out in URaY do not affect gate delays, and (ii) wire delays will, if not already, manifest as the primary source of system performance degradation [81]. This generic delay model can be and should be retrofitted for specific individual target systems. The delay model together with the timing critical path described above will be crucial to URaY, as they will be used to estimate system performance degradation from the change in the die floorplan when the design is being optimized for yield. 3) Yield model specifying the relationship between critical area and yield. URaY supports the "plug-in" of any yield model. For experiments performed in this chapter, we will use the Poisson yield model [3]. We would like to point out that since URaY constantly maintains the floorplan image during optimization, custom industrial yield models reflecting intra-die defect clustering and process variation can be used. This would not be possible in other yield enhancement methodologies. Unfortunately due to lack of access to highly sensitive intra-die defect correlation and process variation data, we could not explore this aspect. 4) Yield learning curve representing the anticipated yield learning process. Yield learning data is also highly sensitive, but since they do not impact our optimization process and results, we used synthetic yield learning curves in our experiments. 5) Pricing model quantifying S(t). In our research we do not wish to capture the mercurial aspects of IT economics, hence we assume a linear correlation between performance and chip 93 selling price [23]. More complex chip pricing models can be used in URaY if they are available. Under these conditions, S(t) can be replaced by system performance, which is a linear function of the HPWL and not a function of t, scaled by some constant S'. 6) Wafer production volume data quantifying W(t). We set W(t) to a constant W. More accurate production data can be used in URaY if they are available (the URaY code does not have to be changed as we have provided the parameter interface). Outputs of URaY include: 1) The floorplan of the final optimized design. In this chapter we focus on fixed outline, non- slicing floorplans since microprocessor dice are typically square and not constrained by slicing requirements. This floorplan will include the geometric location of the modules, along with their spares, plus the inserted steering/testing logic. In our experimental results section, for the sake of clarity, we have clustered modules with their spares and the corresponding steering/testing logic. This will be explained in detail ahead. 2) Projected revenue, which can now be rewritten as: 6.3 Motivational Case Study Now that we have established the input parameters, output data and pertinent concepts in URaY, we examine a motivational case study that exposes the drawbacks of traditional yield aware design flows. Assume a system consists of four modules: M 0 , M 1 , M 2 and M 3 . M 0 is a 3 x 3 square cache; M 1 , M 2 and M 3 are all 2 x 2 square non-cache modules. M 0 has a yield of ~1 which is made possible through the use of spare rows/columns. Under a defect density of ~0.17, M 1 , M 2 and M 3 have identical yield values of 0.5. The critical path of this system is M 0 -M 1 -M 2 -M 3 -M 0 94 (which could represent the data flow of an instruction performing a cache read followed by computations on the fetched values, then finally a write back to the cache). Assume a total of 10 6 wafers (W = 10 6 ) each of size 10 3 (A wafer = 10 3 ) is produced, during which yield learning does not occur, and S' = 10 3 . Recall that S' is constant scalar used to map the system performance to selling price. Figure 28. Motivational Case Study Figure 28a illustrates a floorplan of the bare design, with the timing critical path shown in red lines. This floorplan is optimal in term of the size of smallest enclosing square. While compact in die area and minimal in HPWL, this bare design has a very low yield in high defect density environments. Without an integrated framework, the only way to tackle this low yield issue is to first arrive at the optimal redundancy configuration, then feed the new design to the floorplanner. This separated design methodology has two major drawbacks: (1) the optimal redundancy configuration is calculated without considering floorplanning or wiring induced performance degradation and (2) addressing both die area and HPWL minimization concurrently is an extremely difficult problem. Figure 28c illustrates the optimal design floorplan if this design flow is adopted. M 1 , M 2 and M 3 each have two spares, note that spares cannot form L shapes M0 M1 M3 M2 M0 M1 M3 M2 M1 M2 M3 M0 M1 M2 M3 M1 M3 M2 M2 M3 M1 (a) (b) (c) 95 because most floorplanners cannot handle L shape primitives. Steering/testing logic is ignored in this case study for simplicity, but they will be included in the full scale experiments. As seen in Table XVII, the new design (c) indeed increased Y/A over the bare design (a) by over 100%. Even when floorplanning whitespace and HPWL are considered, the improvement in revenue is still near 15%. Table XVII. Analysis of the three example designs Design (a) Design (b) Design (c) Die Area 5 2 = 25 7 2 = 49 8 2 = 64 Yield 0.5 3 = 0.125 0.75 3 = 0.421875 0.875 3 = 0.669922 Y/A 0.005 0.00861 0.010468 HPWL 11 15 20 Revenue 455 million 574 million 523 million *Origin of 0.75: 1 - (1 - 0.5) 2 = 0.75; Origin of 0.875: 1 - (1 - 0.5) 3 = 0.875 Although optimal choices were made for design (c) both at the redundancy augmentation step and the floorplanning step, it is actually not optimal in revenue. Design (b) in Figure 28b employs a single spare for M 1 , M 2 and M 3 . Despite being smaller in Y/A compared to design (c), it outperforms design (c) in revenue by almost 10%. Clearly, even with the most powerful yield enhancement and floorplanning algorithms, there is still ample room for improvement in revenue. 6.4 Optimization Algorithm The optimization algorithm at the core of URaY is a two phase process illustrated in Figure 29. First, a floorplan "seed" based on the bare design system configuration is generated. This is identical to the classical floorplanning problem but with a unique objective. The seed should be mainly optimized for HPWL minimization; having larger than normal amounts of whitespace is acceptable as they will be filled when the seed "grows" in the next phase. Unfortunately the true quality of the seeds cannot be known a priori, since the overall revenue can only be evaluated after seed growth. A seed that is minimal in die area and/or HPWL does not guarantee that it will 96 grow into the best solution. This "fuzzy" objective makes the seed generation process exhibit an even more complex puzzle-like nature than the standard floorplanning problem. The generated seed will be fed to the second phase, in which the co-optimization of yield, area and performance will be carried out. Redundancy will be applied for select modules while carefully balancing the increase in area and HPWL. Conceptually, the seed floorplan will grow in die size. We have incorporated HYPER which calculates the optimal redundancy configuration to maximize Y/A into URaY to provide bounds to trim algorithm runtime, although URaY is fully capable of operating without HYPER. Figure 29. High level algorithm flow of URaY Bare Design System Configuration <Seed 1> Performance centric “Loose” in area minimization HYPER GROWTH : Integrated Co-Optimization of Yield, Area and Performance margin: 0 margin: y margin: 1 ... Quadrant Bisect and Simulated Annealing (QBaSA) section: 1 section: x section: 2 ... <Solution> Floorplan Visualization Optimized for Revenue Seed Generation (Stochastic) Seed “Growth” (Non-Stochastic) Module 0 (Cache): 0 Spares Module 1: Ref [1] = 1 Spare Module 2: Ref [2] = 2 Spares Module 3: Ref [3] = 1 Spare … 97 The design space to be explored by URaY is essentially that of a traditional floorplanning problem compounded with that of a traditional yield enhancement problem. URaY tackles this design space explosion by facilitating mass parallelism in both the seed generation and seed growth process so that it can take advantage of chip-multiprocessor (CMP) architectures and cloud computing systems. More specifically, multiple seeds are generated independent of each other, and each seed can grow into different sizes independently. This is clearly shown in Figure 29, where different seeds are generated with different section parameter values in parallel, and for each seed, several GROWTH processes operate in parallel with different values of the margin parameter. There are no intermediate merging operations, only at the last step where the design with the largest revenue is selected as the solution. For floorplan representation we use the O-tree instantiated with linked lists. The original O- tree implementation forfeited the ability to inject nodes internally in order to facilitate fast tree transformations, but we have unlocked the full potential of O-trees by allowing arbitrary node removals, external/internal node injections and arbitrary node swaps by manipulating data structure pointers. The admissibility adjustment (i.e. floorplan compaction in both the horizontal and vertical direction) remains the same as the original implementation in [53]. We review the admissibility adjustment procedure in Figure 30. The original O-Tree translates to a non- admissible floorplan, which is not compact. If the O-tree and the corresponding floorplan were to go through admissibility adjustment, then M 2 could be pushed the left, then M 3 will drop down and then pushed to the left. This compacted floorplan is now admissible, with no more possible compactions in the left or downwards direction. The O-tree will then be reconstructed from the new floorplan, reflecting the changes in the locations of the modules. 98 Figure 30. Admissibility adjustment review Phase 1: Seed Generation with QBaSA Microprocessor based systems contain one or more cache units which typically take up 40% - 60% of the total die area. In the seed generation phase, URaY takes advantage of this specific characteristic to prune the search space. Our algorithm for seed generation is termed "Quadrant Bisection and Simulated Annealing (QBaSA)". For a system with one cache unit depicted in the left of Figure 31a, URaY views the system as four quadrants, one of which is entirely occupied by the cache; the rest of the modules are partitioned into the rest of the three quadrants using a heuristic based on the classical min-cut concept [45]. The min-cut procedure will use the timing critical path, break the timing critical path into pairs of undirected connections between modules, and treat them as standard module interconnections. An extension of QBaSA to a two cache M0 Root M0 Original O-Tree M1 M2 M2 M2 M1 M3 M3 Original Floorplan (Non-Admissible) M0 M2 M1 M3 M0 M2 M1 M3 M0 M2 M1 M3 M0 M2 M1 M3 Root M0 Reconstructed O-Tree M1 M2 M3 Admissibility Adjustment (Compaction in Horizontal and Vertical Directions) 99 system is shown in Figure 31b. The only difference here is that there are several options regarding the placement of the cache units, and there are more quadrants to work through. We explain how modules are placed into different quadrants in more detail using a system depicted in Figure 32, which shows system configuration prior to floorplanning (timing critical path shown in red). Since the timing critical path is a loop that begins with and ends at the cache (M 0 ), the min-cut based quadrant placement procedure can guarantee that one and only one cut exists between any two adjacent quadrants, and thus can be reduced to a bin-packing like procedure. In Figure 33 we can clearly see that after M 1 and M 2 have been placed into Quadrant 1, there is no more room to accommodate M 3 in Quadrant 1, and the cut between the cache quadrant and Quadrant 1 is 1, and the cut between Quadrant 1 and Quadrant 2 will also be 1 (the connection from M 2 to M 3 ). The min-cut based placement process continues in Figure 34. Modules placed in the last quadrant may exceed the area of the quadrant due to the sizes of Quadrant 1 and Quadrant 2 being too small. Note that modules are simply placed into quadrants in this step, their geometric locations are not yet determined, so the module positioning in Figure 33 and Figure 34 are only for illustration purposes. For a system that has a more complex timing critical path, the main concept remains the same: attempt to let each quadrant accommodate as many modules as possible while keeping the cut between quadrants minimal. 100 (a) (b) Figure 31. The basic concept behind quadrant bisection Figure 32. Example system M0 (Cache) Quadrant 3 ... Subtree for Quadrant 3 Subtree for Quadrant 1 ... Subtree for Quadrant 2 ... Root M0 Quadrant 1 Quadrant 2 section: x Wcache Hcache Wcache + x - Hcache M0-1 (Cache) Quadrant 2 Quadrant 1 Quadrant 3 section: x Wcache Hcache 2Wcache + x - Hcache M0-2 (Cache) Wcache Quadrant 4 ... Subtree for Quadrant 2 Subtree for Quadrant 1 ... Subtree for Quadrant 3 ... Root M0-1 M0-2 Subtree for Quadrant 4 ... M4 M0 M1 M2 M3 M5 M6 M7 101 Figure 33. Placing modules into Quadrant 1 Figure 34. Placing modules into Quadrant 2 and Quadrant 3 The modules inside the three quadrants will initially form three rootless subtrees which will be part of the master tree previously shown in the right of Figure 31a. Next the annealing M0 section =4 Hcache = 4 Wcache = 4 Quadrant 1 M0 section =4 Hcache = 4 Wcache = 4 M1 Quadrant 1 Area: 4 x 4 =16 Placed M1, Leftover Area: 16 – 8 = 8 M0 section =4 Hcache = 4 Wcache = 4 M1 Placed M2, Leftover Area: 16 – 8 - 4 = 4 M2 M0 section =4 Hcache = 4 Wcache = 4 M1 Cannot Place M3 into Quadrant 1 M2 M0 section =4 Hcache = 4 Wcache = 4 M1 Possible Floorplan M0 section =4 Hcache = 4 Wcache = 4 M1 Possible Floorplan M2 M0 section =4 Hcache = 4 Wcache = 4 M1 Possible Floorplan M2 M0 section =4 Hcache = 4 Wcache = 4 M1 M2 M3 M4 Quadrant 2 Area: 4 x 4 =16 M0 section =4 Hcache = 4 Wcache = 4 M1 M2 Placed M3, Leftover Area: 16 – 8 = 8 M3 M0 section =4 Hcache = 4 Wcache = 4 M1 M2 Placed M4, Leftover Area: 16 – 8 - 8 = 0 M4 M3 M0 section =4 Hcache = 4 Wcache = 4 M1 M2 Placed Unplaced Modules in Quadrant 3 (May Exceed Area of Quadrant 3) M6 M5 M7 Possible Floorplan M0 section =4 Hcache = 4 Wcache = 4 M1 M2 M3 Possible Floorplan M0 section =4 Hcache = 4 Wcache = 4 M1 M2 M3 M4 Possible Floorplan M0 section =4 Hcache = 4 Wcache = 4 M1 M2 Possible Floorplan M0 section =4 Hcache = 4 Wcache = 4 M1 M2 M3 M4 M5 M6 M7 102 procedure will limit itself to the predetermined quadrants, meaning that node injections (i.e. removing a node and placing it in another location in the O-tree) and node swaps are limited to two nodes in the same quadrant. However, modules may eventually move to other quadrants due to the admissibility adjustment if admissible = 1. In practice, URaY first calls QBaSA with admissible = 0 to ensure that modules stay within their respective quadrants, then calls QBaSA again with admissible = 1 so that the modules gain a small degree of freedom to move into other quadrants. We further explain this procedure in detail by using the same example from Figure 32 and quadrant configurations from Figure 34. We begin with a bad initial floorplan represented by the O-tree in the left of Figure 35. If this O-tree is translated to the actual floorplan, then it would produce a non-admissible floorplan shown in the right of Figure 35 that also violates the quadrant confinements. For example, M 2 is on the outside of Quadrant 1 defined by section = 4. M4 M3 M0 section =4 Hcache = 4 Wcache = 4 M1 M2 M6 M5 M7 Root M0 M1 M2 M3 M4 M6 M7 M5 Subtree for Quadrant 1 Subtree for Quadrant 2 Subtree for Quadrant 3 Figure 35. Seed generation example: Initial state 103 From the initial state in Figure 35, the simulated annealing based optimization procedure may employ different types of moves depicted in Figure 36. We only focus on Quadrant 1 for the sake of clarity in Figure 36, which shows the following moves: Node rotation: the O-tree structure does not change but the module information store within nodes will change. Node swap: this can be done in O(1) time by exchanging pointers;. Node rotation + node injection (admissible = 0): after manipulating the O-tree, the floorplan do not undergo the admissibility adjustment process previously described in Figure 30. Node rotation + node injection (admissible = 1): after manipulating the O-tree, the floorplan will undergo the admissibility adjustment process previously described in Figure 30. Naturally, this move is more time consuming than when admissible = 0. Also, it may easily cause modules to move into other quadrants. 104 Figure 36. Seed generation example: Optimization M0 M1 M2 Root M0 M1 M2 Initial State Root M0 Rotation M0 M1 M2 Node Swap M0 M1 M2 Root M0 M2 M1 M0 M2 Root M0 M1 M2 M2 Node Rotation + Injection (Non-Admissible) M2 M1 M0 Root M0 M1 Node Rotation + Injection (Admissible) M1 M2 M2 M2 M2 M1 M2 105 The high level pseudo code for QBaSA is given in Algorithm 3. Since the seed should be mainly optimized for HPWL, we use the weighted sum of HPWL and √A die as the optimization objective and let HPWL carry a higher weight. Algorithm 3: Quadrant Bisect and Simulated Annealing (QBaSA) Algorithm QBaSA (int x, int admissible) begin for (Quad = 1; Quad <= 3; Quad++) j = 0; for (i = 1; i < N_MODULES; i++) k = min_cut_select(); switch (Quad) 1: Quad_A = H cache * x 2: Quad_A = (W cache + x - H cache ) * x 3: Quad_A = (W cache + x - H cache ) * W cache if (j < Quad_A) j = j + Width k * Height k ; Quadrant[k] = Quad; malloc(master_root); malloc(M 0 ); attach M 0 as the child of master_root; temp_node = M 0 ; for ( i such that Quadrant[i] == 1) malloc(M i ); attach M i as the child of temp_node; 106 temp_node = M i ; temp_node = M 0 ; for ( i such that Quadrant[i] == 2) malloc(M i ); attach M i as the child of temp_node; temp_node = M i ; temp_node = master_root; for ( i such that Quadrant[i] == 3) malloc(M i ); attach M i as the child of temp_node; temp_node = M i ; for (temp = start_temp; temp > end_temp; temp = temp - step_temp) saved_root = copy_tree(master_root); quality = evaluate(master_root); node_1 = random_select(master_root); node_2 = random_select_from_quadrant(master_root, Quadrant[node_1]); k = #of_children(node_1) + 1; switch (rand_n(0, #of_children(node_1) + 1) 0: swap node_1 and node_2 1: inject node_2 between node_1 and the 1st children of node_1 k: inject node_2 as a newly created children of node_1 if (admissible == 1) convert_tree_to_admissible_tree(master_root); delta = evaluate(master_root) - quality; if (reject(delta, temp) == 1) master_root = copy_tree(saved_root); 107 else quality = evaluate(master_root); end Phase 2: Integrated Optimization with GROWTH In the next phase, the modules placed in the seed will start to "expand" in order to increase die yield and overall revenue. Viable options for yield enhancement include: (i) adding spare modules, (ii) decreasing critical area by upsizing transistors, widening wires or adding redundant vias/gates/transistors and (iii) adding control logic to enable microarchitectural graceful degradation. Currently URaY only supports option (i). This is mainly because that it is difficult to obtain a convincing baseline setup that supports options other than (i), hence if URaY used options (ii) and (iii) but the baseline cannot, then it would be an unfair comparison with the baseline designs. The algorithm used in this phase, GROWTH, is a non-stochastic procedure with a time complexity of O(n 2 · max(W seed , H seed ) 2 ), where n is total number of modules in the target system and W seed /H seed is the width/height of the seed floorplan. The high level pseudo code is shown in Algorithm 4. Figure 37 provides a graphical description of the main concepts. GROWTH constrains itself with the margin input: the seed cannot grow beyond the (max(W seed , H seed ) + margin) x (max(W seed , H seed ) + margin) bounding box. Independent GROWTH iterations can execute in parallel with different margin input parameters as previously shown in Error! eference source not found.. This is necessary because the dimensions of the preferred solution cannot be known a priori. In our experiments, we derived the following empirical bounds for the margin of module i (M i ): 108 where Ref[i] is the number of spares for M i from HYPER. Hence in this theoretical formulation, when margin reaches its upper bound, it is empirically assumed that: All non-cache modules incorporated two more spares than what is specified by HYPER The floorplan has zero whitespace 109 Figure 37. Graphical description of GROWTH Root M0 M0 (Cache) M1 M4 M2 M3 M1 M2 M4 M3 <Seed> Width: 5 Height: 5 ... margin: 1 margin: 2 Transitive right slack for M3: 2 margin: 1 margin: 2 M0 (Cache) M1 M4 M2 M3 M0 (Cache) M1 M4 M2 M3 Bound by HYPER Unrestricted M0 (Cache) M1 M4 M2 M3 M3 M1 M2 M4 M0 (Cache) M1 M4 M2 M3 M3 M1 M2 M4 M2 M4 M4 M3 110 Algorithm 4: GROWTH Algorithm GROWTH (int margin, int bound) begin do delta_max = -1; j = -1; for (i = 1; i < N_MODULES; i++) system_yield = evaluate_system_yield(); enhance_yield(Module[pick(i)]); if (bound == 1 && spares[i] > Ref[i]) system_yield' = 0; else system_yield' = evaluate_system_yield(); if (bound == 1) delta = system_yield' - system_yield / (Area[i]); else delta = system_yield' - system_yield; if (delta > delta_max) j = i; restore_enhance_yield(Module[i]); if (j != -1) rt_slack = transistive_right_slack(master_tree, Module[i], margin); 111 up_slack = transistive_up_slack(master_tree, Module[i], margin); if (rt_slack * Height[i] < up_slack * Width[i]) rt_slack = 0; else up_slack = 0; enhance_yield(Module[i]); if (spares[i] + 1 >= 5) m = fast_SA(Module[i], Width[i] + rt_slack, Height[i] + up_slack); else m = exhaust(Module[i], Width[i] + rt_slack, Height[i] + up_slack); if (m == 0) restore_enhance_yield(Module[i]); while (j != -1) for (i = 1; i < N_MODULES; i++) rt_slack = transistive_right_slack(master_tree, Module[i], margin); up_slack = transistive_up_slack(master_tree, Module[i], margin); add_testing_steering_logic(overhead_fraction, rt_slack, up_slack); end GROWTH makes the following important decisions and calculations during optimization: 1) The order in which GROWTH process all the modules: We allow GROWTH to make local greedy selections at each iteration, that is, it picks M i that can will achieve the largest improvement in Y i /A i . Note that Y i /A i is local to M i , meaning we do not have to perform any 112 floorplanning procedure during the selection of M i . GROWTH chooses the formal criteria for the restricted run, and uses the latter for the unrestricted run. 2) The slack for individual module expansion: When modules expand, they can push aside their adjacent modules. Therefore, the slack is transitive (shown in blue in the center diagram in Figure 37, we further illustrate with). GROWTH only needs the slack values for the upward and rightward direction as the intermediate floorplans are always admissible. Figure 38. The calculation of transitive slack 3) The maximum amount of spares that can be added for each module: this quantity would depend on two factors: (1) the slack values for each module, since obviously the inserted redundant modules should not increase the die area beyond the bounds specified by margin, and (2) the outputs of HYPER for the iteration that is bound by HYPER (Figure 37). For the restricted GROWTH iteration, when Module i have had Ref[i] spares inserted, then no more spares will be inserted for Module i even if it is possible to do so within the confinements of margin. For the unrestricted iteration of GROWTH, all modules are free to add as many M0 M1 M2 Transitive Right Slack for M1: 3 M0 M1 Transitive Right Slack for M1: 1 M2 M0 M1 M3 Transitive Right Slack for M1: 1 (Not 3) M2 M0 M1 margin = 3 M3 M4 M2 M0 M1 M3 M4 Transitive Right Slack for M1: 3 (Not 1) margin = 1 M0 M1 Transitive Up Slack for M1: 3 M0 M1 M2 margin = 2 margin = 3 margin = 1 Transitive Up Slack for M1: 2 margin = 3 Transitive Up Slack for M1: 4 (Not 3) margin = 1 M2 M0 M1 M3 M4 margin = 3 Transitive Up Slack for M1: 3 (Not 4) 113 spares as the current whitespace (determined by transitive slack values) permits. We provide some graphic illustrations in Figure 39. Figure 39. The maximum number of spares allowed 4) The dimension of the expanded cluster: After spares have been inserted, they will form a cluster with the original module. The rotation and location of the modules will then determine the dimensions of the cluster, which may be easy to calculate when the number of spares is 1, but becomes a non-trivial problem when more spares are needed. For this local optimization GROWTH will construct an O-tree consisting only of the original module and its spares. Then exhaustive search is performed to find the best intra-cluster module organization. For a total of n modules including the original and all of the spares, there are a total of 2 n rotation combinations, each of which can have up to (n!) 2 different O-tree structures, hence this search needs to examine up to 2 n ·(n!) 2 trees. This is feasible when the number of spares for the original module is less than four (i.e. n < 5). Having more than four spares for any module is highly unlikely, but in case it happens, URaY will switch to a fast simulated annealing procedure. We use a graphical example to explain this local optimization procedure in Figure 40 and Figure 41. Beginning with Figure 40, for the sake of discussion, assume that Module 2 and Module 3, along with the cache module (Module 0) can all achieve high yield without redundancy, hence we only need to insert spares for Module 1. M0 M1 Number of Spares for M1 Suggested By HYPER : 1 M0 M1 margin = 2 margin = 2 Transitive Right Slack for M1: 0 Transitive Up Slack for M1: 5 M0 M1 Outcome of Restricted GROWTH M1 M0 M1 Outcome of Unrestricted GROWTH M1 M1 114 When the margin parameter is set to 2, we first realize that Module 1 can "grow" upwards to incorporate one spare and the necessary steering/testing logic (indicated in shaded blocks). After the insertion of the first spare module, the upward transitive slack reduces to 0 for Module 1, but the transitive right slack for Module 1 is still 2. So the question becomes whether Module 1 can incorporate two spares (along with the necessary steering logic) when considering the right slack. This problem is somewhat more difficult than before, when there is only one spare module in addition to the original module. In Figure 41 we illustrate the exhaustive search process by listing some of the options explored. Please keep in mind that the algorithm will enumerate all possible cases in terms of location and rotation of each module. In Figure 41, we first isolate a subtree that consists of only Module 1 along with its spares. Without considering rotation (i.e. the original module and all of its spares are indistinguishable), there is a total of five different O-tree structures, which we have shown in Figure 41. For each O-tree structure, some modules can be rotated. As an example, we have only shown only one rotation variation for each O-tree structure. And among those different arrangements of the cluster, two of them are feasible for the given margin confinements from Figure 40, they are indicated by dashed red boxes. As we can see in Figure 41, when there are at least two spares, there may be intra-cluster whitespace that may be utilized by the steering/testing logic. However we do not take advantages of such opportunities since it is unclear whether logic designers can successfully insert steering/testing logic in the gaps within the cluster, and always conservatively reserve space explicitly beyond the rectangular cluster to use for the insertion of steering/testing logic. Such scenarios can only appear when there are two or more spares for a particular module. 115 Figure 40. Prelude to determining the intra-cluster module arrangements M0 M1 M2 M3 margin = 2 margin = 2 M0 M1 M2 M3 margin = 2 margin = 2 M1 margin = 2 Transitive Up Slack for Module 1: 3 The only possible of layout of one original Module 1 and one spare module plus the testing/steering logic M0 M1 M2 M3 margin = 2 margin = 2 M1 margin = 2 Transitive Right Slack for Module 1: 2 116 Figure 41. Determining the intra-cluster module arrangements via exhaustive search M1 M1 M1 Root M1 M1 M1 M1 Root M1 M1 M1 M1 M1 M1 M1 M1 M1 Root M1 M1 M1 M1 M1 M1 Root M1 M1 M1 M1 M1 M1 Root M1 M1 M1 Root M1 M1 M1 M1 M1 Root M1 M1 ... Root M1 M1 M1 M1 M1 ... Root M1 M1 M1 M1 M1 ... Feasible for the margin M1 M1 M1 M1 M1 M1 Root M1 M1 M1 M1 M1 M1 ... Feasible for the margin M1 M1 Root M1 M1 M1 M1 Intra-cluster whitespace that can be used for steering/ testing logic Space explicitly allocated for steering/testing logic 117 At the end of GROWTH, the steering/testing logic (shown as shaded blocks in Figure 37) will be added for module clusters that employed at least one spare. This procedure will also use the transitive slack values as illustrated in Figure 42. For our experiments in the next section, we set the size of the steering/testing logic to be a fraction (5%) of the sum of the area of the original module plus all of its spares (recall that intra-cluster whitespace are not considered in the steering/testing logic insertion process, although they could be utilized by the logic designers). The circuit level implementation of the steering/testing logic is beyond the scope of our research: we only reserve area for the steering/testing logic, the circuit and logic designers are responsible for determining the actual location and implementation (e.g. laser fuse or multiplexer based). Figure 42. Steering/testing logic insertion M0 M1 M2 M3 margin = 2 M0 M1 M2 M3 margin = 2 M1 margin = 2 Transitive Up Slack for Module 1: 3 After inserting one spare for Module 1 Transitive Up Slack for Cluster 1: 1 Transitive Right Slack for Cluster 1: 0 Process Module 1 M0 M1 M2 M3 margin = 2 M1 The only possible dimensions of Cluster 1 when steering/testing logic is included. Steering/ testing logic takes 20% of the cluster area. Possible Physical designs M1 M1 M1 M1 M1 M1 M1 M1 M0 M1 M2 M3 margin = 2 margin = 2 margin = 2 Transitive Up Slack for Module 1: 3 Transitive Right Slack for Module 1: 3 Process Module 1 “Grow” upwards since Up Slack > Right Slack M0 M1 M2 M3 margin = 2 M1 After inserting one spare for Module 1 Transitive Up Slack for Cluster 1: 1 Transitive Right Slack for Cluster 1: 2 Do not insert more spares due to HYPER constraints, therefore add steering/testing logic to the right since Right Slack> Up Slack margin = 2 M0 M1 M2 M3 margin = 2 M1 margin = 2 The dimensions of Cluster 1 when steering/testing logic is included to the right, consuming the transitive right slack. Steering/ testing logic takes 25% of the cluster area. Possible Physical designs M1 M1 M1 M1 118 6.5 Experimental Results Baseline It is important that we fully disclose the baseline designs on which we base our comparisons. We have gathered the most powerful tools in their respective application contexts. When stitching those tools together, we provided generous assistance to let the baseline achieve the best possible result. For yield enhancement, we use HYPER which calculates the optimal redundancy configuration for any given system when floorplanning and HPWL is ignored. We extended HYPER to handle yield learning curves through exhaustion so that it retains its optimality. That is, since HYPER takes a single defect density value as input, we do the following: Feed the defect density at time t (extracted from the yield learning curve) to HYPER. Execute HYPER and generate the redundancy configuration. Floorplan the design that may or may not be augmented with redundancy (depending on the output of HYPER) using the procedures described below. Evaluate the generated revenue of the floorplanned design over the entire timeline (t 1 to t 2 ), not just at time t. Repeat the above for all t (t 1 ≤ t ≤ t 2 ) and choose the design that generated the maximum revenue. Floorplanning the redundancy augmented design produced by HYPER will result in an impractical system where modules are geographically separated from their spares, hence we ran a script to cluster modules with their respective spares, along with the required steering/testing 119 logic. We always attempt to create clusters that resemble squares (Figure 43), seeing that floorplanners are able to handle square blocks much better than rectangular blocks. Figure 43. Creating the cluster for the baseline designs Both the clustered design and the bare design are then fed to DeFer, the best block floorplanner to date. It not only can produce floorplans with very little whitespace, but is also capable of addressing HPWL amazingly well due to the initial partitioning step carried out by hMetis [67]. We delete the connections in the input that are not on the timing critical path so that hMetis will not be interfered with irrelevant cuts. DeFer relies on a manual parameter γ that bounds the whitespace fraction. The authors of DeFer set γ = 10% in their publication [78], which would sometimes cause DeFer to crash when running our benchmarks with large caches. So we sweep a range of γ values between 10% and 30%, and run DeFer at least 100 times for each γ value to obtain a large batch of results. Finally, all of these designs are evaluated over the target time window (t 1 to t 2 ) with our revenue calculation equation, and the best design is selected as the baseline. Results Each system in our experiments contains 15 - 30 modules. We do not further disintegrate modules into smaller ones because module level redundancy cannot be applied at a very fine granularity due to steering and testing logic constraints. All inputs are based on the GSRC Hard M1 Insert Spare M1 M1 Accepted cluster layout M1 M1 Rejected cluster layout 120 [82] benchmarks, from which we extract modules and inflate one random module to represent the cache. The timing critical path is a loop that originates from and ends at the cache. We list four sets of results in Table XVIII. For experiments 15.1 - 15.10, the CPU contains 15 modules in total; for experiments 20.1 - 20.10, the CPU contains 20 modules in total; for experiments 25.1 - 25.10, the CPU contains 25 modules in total; for experiments 30.1 - 30.10, the CPU contains 30 modules in total. Within each set of ten different experiments, the module dimensions are different, but the yield learning curve, W(t), t 1 and t 2 is set to be identical. This is so that we do not deliberately create advantageous manufacturing conditions for each target system when executing URaY. In our experiments, we view time as discrete so that the yield learning curve does not necessarily have to follow any particular mathematical function. We would like to stress that the exact formulation of Y(t), W(t), t 1 and t 2 do not undermine the effectiveness of URaY, as long as the majority of the wafer production is not concentrated in times where the bare system yield is unrealistically low or high. What is important is that URaY successfully leverages the intricate three-way tradeoff when it is difficult to do so using conventional approaches. 121 Table XVIII. Results showing the area, HPWL and revenue Set Baseline Design URaY Design Area HPWL Revenue Area HPWL Revenue Improve 15.1 214 2 973 59.8 181 2 750 67.2 12.4% 15.2 208 2 990 61.9 179 2 654 71.8 15.9% 15.3 212 2 1048 56.0 202 2 810 64.2 14.6% 15.4 251 2 1048 40.4 230 2 866 48.9 21.0% 15.5 278 2 1096 30.3 261 2 1034 36.4 20.1% 15.6 240 2 1060 43.5 240 2 948 49.4 13.5% 15.7 265 1104 32.5 246 2 796 45.2 39.1% 15.8 270 2 1002 33.6 245 2 878 37.4 11.3% 15.9 266 2 1440 24.1 261 2 1026 32.0 32.8% 15.10 244 2 996 45.3 242 2 840 54.5 20.3% Average Revenue Improvement for Experiments 15.1 - 15.10 20.1% 20.1 266 2 1450 97.8 230 2 974 113.8 16.3% 20.2 271 2 1386 99.1 224 2 1032 111.8 12.8% 20.3 267 2 1494 93.3 237 2 1022 100.5 7.72% 20.4 318 2 1564 62.4 230 2 1050 77.9 24.8% 20.5 289 2 1552 77.8 236 2 990 92.4 18.7% 20.6 252 2 1570 102.5 207 2 934 128.1 24.9% 20.7 268 2 1494 94.3 236 2 1084 121.0 28.3% 20.8 273 2 1128 121.4 201 2 840 145.6 19.9% 20.9 254 2 1682 92.8 210 2 906 112.6 21.3% 20.10 259 2 1194 128.6 206 2 850 147.1 14.4% Average Revenue Improvement for Experiments 20.1 - 20.10 18.9% 25.1 279 2 1860 72.1 230 2 1182 88.6 22.9% 25.2 205 2 1088 125.6 208 2 1034 136.0 8.28% 25.3 304 2 1990 56.3 246 2 1316 66.8 18.6% 25.4 335 2 2210 41.3 249 2 1376 53.3 29.0% 25.5 245 2 1284 61.6 242 2 1104 81.9 32.9% 25.6 289 2 1842 67.8 258 2 1206 76.4 12.6% 25.7 306 2 1756 63.2 251 2 1172 71.5 13.1% 25.8 320 2 2068 48.4 290 2 1498 58.2 20.2% 25.9 324 2 1782 55.4 251 2 1228 63.6 14.8% 25.10 306 2 1742 63.8 261 2 1220 72.6 13.8% Average Revenue Improvement for Experiments 25.1 - 25.10 18.6% 30.1 265 2 1470 133.6 278 2 1528 152.4 14.1% 30.2 266 2 1650 123.9 280 2 1616 151.2 22.0% 30.3 317 2 2234 128.5 265 2 1330 154.3 20.0% 30.4 344 2 2204 109.9 276 2 1428 127.1 15.6% 30.5 345 2 2374 101.3 274 2 1524 120.7 19.1% 30.6 314 2 2350 124.2 262 2 1332 165.7 33.4% 30.7 342 2 2320 106.1 289 2 1494 129.3 21.8% 30.8 261 2 1588 134.4 282 2 1514 164.7 22.5% 30.9 315 2 2578 112.3 269 2 1598 130.9 16.5% 30.10 323 2 2248 123.0 266 2 1446 146.3 18.9% Average Revenue Improvement for Experiments 30.1 - 30.10 20.3% Note: The units and constant scalars (e.g. S') are irrelevant for comparisons 122 When comparing the baseline design and the URaY design in Table XVIII, it can be seen that the URaY design is often smaller in both die area and HPWL. This is not because URaY achieved a more compact floorplan with the same system input, but in fact because URaY selected a redundancy configuration that employs less number of spares. Therefore, although not shown in Table XVIII, the URaY design actually sacrificed yield to reduce the area and performance overheads to maximize revenue. On the other end of the spectrum, for a few experiments namely 25.2, 25.5, 30.1, 30.2 and 30.8, URaY was able to cautiously insert spares for select modules when the baseline designs abandoned any form of redundancy altogether (i.e. the baseline design is the bare design). In these situations where yield enhancement is rather expensive, URaY is still able to improve revenue when HYPER is no longer effective. Detailed Analysis In this subsection we first perform an in-depth analysis for experiment 20.7. For experiments 20.1 - 20.10 there is a total of 11 discrete time points ranging from t 1 = 0 to t 2 = 10. At each time point, W = 10 6 silicon wafers are manufactured. The die yield and revenue at each time point is listed in Table XIX. As expected, the bare design suffers from low yields and consequently, disappointing revenue. By deploying HYPER and DeFer, the die yield increased drastically, accompanied by a moderate amount of increase in die area and HPWL. This resulted in a 13.3% increase in revenue, hence this redundancy augmented design is selected as the baseline. This baseline design is detailed in Table XX. The number of spare modules for each module index is given in the left. We can see that all modules except the cache (M 0 ) have exactly one spare assigned to it. This is not a coincidence, it is actually because the module areas are comparable, hence their yield values are comparable and they would employ the same number of spares to achieve the best Y/A for the 123 system. To the right of Table XX, the floorplan is plotted along with the timing critical path in red. Each light gray rectangle is a cluster that includes the original module, precisely one spare module and the corresponding steering/testing logic. There is only one single-module cluster (i.e. a cluster that contains only the original module) shown as a white rectangle, which is the cache (M 0 ). The URaY design is detailed in Table XXI. URaY selected an entirely different redundancy configuration compared to the baseline design. When margin = 3, there was no room in the die left to accommodate spares for M 9 , M 11 , M 12 , M 15 , M 17 and M 19 . These modules, along with M 0 , appear as solid white rectangles in Table XXI. Despite the sharp drop in die yield with respect to the baseline design at every time point seen in Table XIX, this design has a smaller die area and shorter HPWL. Overall, it achieves 28.3% improvement in revenue. Table XIX. Die yield and revenue details for experiment 20.7 t Bare Design Baseline Design URaY Design Die Yield Revenue Die Yield Revenue Die Yield Revenue 0 0.16841 3.4433 0.85164 7.6935 0.47727 7.9052 1 0.20105 4.1105 0.87568 7.9498 0.51846 8.5874 2 0.24000 4.9071 0.89799 8.1905 0.56210 9.3101 3 0.28652 5.8580 0.91833 8.4129 0.60817 10.0732 4 0.34204 6.9932 0.93651 8.6141 0.65663 10.8758 5 0.37371 7.6408 0.94472 8.7059 0.68173 11.2917 6 0.40832 8.3484 0.95232 8.7913 0.70739 11.7168 7 0.44613 9.1215 0.95928 8.8701 0.73361 12.1509 8 0.48745 9.9662 0.96558 8.9418 0.76035 12.5938 9 0.53259 10.889 0.97119 9.0062 0.78760 13.0452 10 0.58191 11.898 0.97611 9.0630 0.81532 13.5044 Total - 83.18 - 94.23 - 121.05 124 Table XX. Baseline design details for experiment 20.7 Module Spares Baseline Design Floorplan 0 0 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 15 1 16 1 17 1 18 1 19 1 Table XXI. URaY design details for experiment 20.7 Module Spares URaY Design Floorplan 0 0 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 0 10 1 11 0 12 0 13 1 14 1 15 0 16 1 17 0 18 1 19 0 125 Table XXII. Baseline design details for experiment 15.6 Module Spares Baseline Design Floorplan 0 0 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 13 1 14 1 Table XXIII. URaY design details for experiment 15.6 Module Spares Baseline Design Floorplan 0 0 1 1 2 1 3 1 4 2 5 1 6 2 7 1 8 2 9 2 10 2 11 1 12 2 13 1 14 2 During the unrestricted optimization iteration in GROWTH, instead of scaling back the strength of redundancy insertion, URaY may allocate more spares for a particular module than what HYPER suggests, and experiment 15.6 is one example. When we compare the baseline design (shown in Table XXII) and the URaY design (shown in Table XXIII), we can see that in the URaY design, several clusters are shown in dark gray. These are the clusters that include the 126 original module plus two spares instead of one. So compared to the baseline, URaY adopted a more aggressive stance towards incorporating redundancy to enhance yield, which is the opposite of what we have previously seen in experiment 20.7 (Table XX and Table XXI). In Figure 44 we plot the spare configuration histogram for experiments 15.1 to 15.10. Except for experiments 15.5 and 15.10, URaY always chose a different spare configuration to enhance yield than the baseline design. This shows that when floorplanning and global wiring are considered during yield enhancement, the ideal design point will deviate from the theoretical optimal design point given by HYPER. Indeed, addressing yield in the design stage requires more comprehensive analysis and optimization tools than what is currently available. URaY is envisioned as the first step towards a truly integrated yield aware design flow. Figure 44. Histogram of spare configurations for experiments 15.1 to 15.10 6.6 Conclusions In this chapter we outlined the "Unified Revenue and Yield Aware Design Optimization Framework (URaY)" to aid designers in efficiently converging to designs that will survive manufacturing processes plagued with low yields. When a target system is optimized for yield, the increase in die area and degradation in system performance must be appropriately evaluated and carefully controlled, or else yield enhancement may backfire to produce designs that are high 0 2 4 6 8 10 12 14 Experiment 15.1 Experiment 15.2 Number of Modules Experiment 15.3 Experiment 15.4 Experiment 15.5 Experiment 15.6 Experiment 15.7 Experiment 15.8 Experiment 15.9 Experiment 15.10 No Spares (URaY Design) No Spares (Baseline Design) One Spare (URaY Design) Two Spares (URaY Design) One Spare (Baseline Design) Two Spares (Baseline Design) 127 in die yield but eventually generate less revenue. URaY leverages this three-way tradeoff between yield, area and performance by addressing yield enhancement, backend floorplanning and global wiring minimization in a holistic fashion. Compared to a conventional design flow employing the most powerful redundancy based yield enhancement algorithm and block floorplanning tool, we were able to improve manufacturer revenue by 18 - 20%. 128 Chapter 7. Global Yield and Floorplan Aware Design Optimization 7.1 Introduction As previously discussed in Chapter 2, recently different DFY techniques have been developed for caches, GPUs and CPUs, many of which are inspired by the classical concept of hardware redundancy [5]. Modern microprocessors such as the Intel Ivy Bridge [83] or Haswell [84] product line with Intel HD graphics integrates the shared L3 cache, GPU and multi-core CPU on a single die. Figure 45 recreates the high level floorplan of an Ivy Bridge die and is approximately to scale. This raises the question whether DFY should focus on separate components independently, or adopt a more global perspective. We reveal that the solution quality (in terms of manufacturer revenue) achieved from the former segregated DFY methodology will suffer for future technologies plagued with low yield issues. Figure 45. A quad-core microprocessor with on-die GPU and shared L3 cache In this chapter we introduce "Global Yield and Floorplan Aware Design Optimization Framework (GlYFF)", a holistic DFY framework for SoCs/MPSoCs. Experimental results show L3 Cache CPU Core CPU Core CPU Core CPU Core Integrated GPU Miscellaneous Logic Block 129 that GlYFF can better capture the globally optimal design point and increase the overall revenue for the IC manufacturer by over 17%. Figure 46 provides a high level view of GlYFF, which sports the following features: Supports the plug-in of almost any yield model and yield learning curve. A central design optimization engine that optimizes the design from a global perspective. Interface with detailed floorplanners to facilitate iterative floorplan aware design optimizations. To the best of our knowledge, GlYFF is the first framework that systematically unifies floorplan aware yield enhancement for systems with multiple yield enhancement strategies. We acknowledge the fact that there remain some advanced yield enhancement options currently not supported by GlYFF such as thread migration [85] and graceful degradation [38]. 130 Figure 46. High level view of GlYFF 7.2 System Model Similar to the Intel architectures depicted in Figure 45, our target system contains four major on-die components: (1) the L3 cache or more generally the last level cache (LLC), (2) the integrated on-die GPU, (3) the multi-core CPU with optional private L1/L2 cache for each core and finally (4) the miscellaneous logic block (MLB) containing system agents, I/O and memory Yield Estimation Area Estimation System Input GPU CPU LLC Yield Aware Design Optimization GPU: Core Level Redundancy Core Core Core Original Core Spare Core Non-Core Logic CPU: Core/Module/Hybrid Redundancy M1 M1 M2 M2 M3 M3 Core 1 M1 M1 M2 M2 M3 M3 Core 2 M1 M2 M3 Core 1 M1 M2 M3 Core 2 M1 M2 M3 Core 1 M1 M2 M3 Core 2 M1 M2 M3 Core 1 M1 M2 M3 Core 2 M1 M2 M3 Core 3 (Spare) M1 M1 M2 M2 M3 M3 M1 M1 M2 M2 M3 M3 M1 M2 M3 Core 1 M1 M2 M3 Core 2 Core 1 Core 2 M1 M1 M2 M2 M3 M3 Core 3 (Spare) Module Level Core Level Hybrid L3 Cache: Spare Rows/Columns Original Cell Spare Cell Core Core Core Core Core Core Core Core Core Core Core Cache: CACTI GPU: Custom Floorplanner CPU: DeFer Yield Model Yield Learning Curve Yield Enhancement Options MLB Component Positions GPU L3 Cache CPU CPU CPU CPU Detailed Floorplanners & Design Analysis Tools 131 controllers etc. Their yield (Y) and area (A) are respectively denoted as Y L3 , Y GPU , Y CPU , Y MLB , A L3 , A GPU , A CPU and A MLB . The overall Y/A of the target system is the die yield (Y die ) divided by the die area (A die ). For yield calculations, we adopt the popular Poisson yield model [3]. Y die can be replaced by the product of all component yields. If we assume there is zero whitespace when assembling the L3 cache, CPU, GPU and MLB onto the die (such as shown in Figure 45), then A die can be calculated as A L3 + A GPU + A CPU + A MLB . Finally, we arrive at Equation 6 below, which will serve as the optimization objective when deciding the redundancy configurations for each system component. Equation 6 Next each component will be discussed in detail. While the component yields can be calculated with respect to the yield model, the component areas can only be known after detailed floorplanning. However, we have previously seen in our development of URaY that recursively triggering floorplanning routines while exploring the massive design space is computationally infeasible; therefore the design optimization algorithm must use reasonable approximations for the component areas when necessary. L3 Cache/Last Level Cache (LLC) The L3 cache typically has a large capacity, and may be shared amongst the CPU and the GPU [83]. The L3 cache is characterized by its fixed capacity (e.g. 8MB) and the cache structure (e.g. number of rows/columns/subarrays). We do not allow GlYFF to change the cache capacity or alter the cache structure, this way the timing parameters (e.g. read/write response time) of the 132 cache should remain stable once the cache design is complete. GlYFF can however manipulate the number of spare rows (sp_row) and spare columns (sp_col), which represent the amount of extra hardware dedicated to yield enhancement. These parameters collectively determine the dimensions (width W L3 , height H L3 and area A L3 ) and yield (Y L3 ) of the L3 cache. We calculate Y L3 with a Poisson yield model tailored to caches augmented with spare rows/columns. We use a modified version of CACTI which can fully account for the interconnect and multiplexer area overheads associated with spare row/column insertion to derive the area of the L3 cache with a single spare row and single spare column, and use its value to approximate A L3 . This is because such redundancy configuration can allow the cache to achieve very high yields [86]. Towards the end of the floorplanning stage in GlYFF, the cache will opportunistically incorporate additional spare rows or columns. Both the yield model and modified CACTI tool are developed by the authors in [8]. Integrated Graphics Processing Unit (GPU) The integrated GPU contains C GPU symmetric low-complexity cores, along with the surrounding non-core logic (e.g. controllers and schedulers). Each GPU core has a fixed width (W GPU_CORE ) and fixed height (H GPU_CORE ), thus its area A GPU_CORE is also a constant. To support core level redundancy (previously discussed in Chapter 2 and Chapter 5), the GPU may contain C GPU_SP spare cores and each spare core may used to virtually replace any defective core. The yield of each GPU core is Y GPU_CORE , calculated as a function of A GPU_CORE , d and the kill ratio of the GPU k GPU : . The non-core portion of the GPU is not optimized by GlYFF, and assumed it is protected with traditional DFM techniques such as redundant wiring [34]. The yield of this portion of the GPU is therefore fixed to be Y GPU_NONCORE . The yield of the entire GPU Y GPU can then calculated using Chapter 6 133 Equation 7 The dimensions of the GPU (width W GPU , height H GPU and area A GPU ) is determined by the floorplan, which includes the C GPU + C GPU_SP GPU cores and the non-core portion, the latter of which is abstracted as a soft module with a fixed area of A GPU_NONCORE . During optimization, we estimate A GPU as follows: . Multi-Core CPU The CPU consists of C CPU high-complexity cores, where C CPU is typically small (e.g. C CPU = 4 or C CPU = 8). Each CPU core contains n + 1 modules (M 0 - M n , M i is used to denote Module i). All of the modules have a fixed dimension which cannot be modified by GlYFF, although GlYFF can rotate modules in the floorplanning process. Module 0 (M 0 ) is the private L1/L2 cache while M 1 - M n are the random logic modules. The area and yield of M i is denoted as A i and Y i , respectively. Y i can be calculated as follows with the kill ratio of the CPU k CPU : . The CPU may use the following three forms of redundancy. Regardless of the redundancy scheme used, all cores are assumed to be homogenous. For example, if Module 2 of a particular CPU core contains one spare module, then all CPU cores have one and only one spare module for Module 2. 1) Fine Grained/Module Level: For each CPU core, each M i (0 ≤ i ≤ n) may have C i_SP (C i_SP ≥ 0) spare modules. Since the logic modules are distinct, spares cannot be shared across modules. Since the size of the L1/L2 cache (M 0 ) is relatively small compared to the L3 cache, it can achieve a very high yield with a few spare rows/columns [86]. Therefore our framework will not manipulate the L1/L2 cache: we simply assume it has already 134 incorporated spare rows/columns, and the area overhead has already been accounted for in A 0 . In other words, redundancy is not needed for M 0 (i.e. C 0_SP = 0). 2) Coarse Grained/Core Level: Similar to the GPU, the CPU may employ C CPU_SP spare cores that can virtually replace any defective core. 3) Hybrid: The CPU may also employ redundancy at both granularities. In this case, there is at least one spare CPU core, and all CPU cores have the same module level redundancy configuration. A CPU core is functional when ∄0 ≤ i ≤ n such that M i and all of its spares (if available) are defective, and the CPU is functional when there are at least C CPU functional cores. This means that a system designed for four CPU cores cannot function or cannot be sold as a three CPU core system. The yield of each CPU core Y CPU_CORE is calculated with Equation 8 and the yield of the CPU as a whole Y CPU is calculated with Equation 9. Equation 8 Equation 9 For each CPU core the dimensions (width W CPU_CORE and height H CPU_CORE ) and area (A CPU_CORE ) is largely dominated by the sum of all the modules (including spare modules). The area of the CPU is the sum of all the core areas, therefore we estimate A CPU with Equation 10. 135 Equation 10 Miscellaneous Logic Block (MLB) The MLB is abstracted as a soft module with a fixed area and fixed yield of A MLB and Y MLB , respectively. The MLB may have special logic built in to combat manufacturing defects, e.g. low overhead steering logic to direct memory requests away from defective memory controllers [87]. 7.3 Motivational Case Study In this section we examine a motivational case study that demonstrates the importance of a global approach towards yield enhancement in SoCs. We start with the bare design (a) that does not include any form of redundancy. The design configuration and component yield values are listed in the first row in Table XXIV, and the best achievable floorplan is shown in Figure 47a. Note that each CPU core contains a private L1/L2 cache of area 9 and seven distinct modules all having an area of 1. Using exhaustive Y/A design optimizations for each individual component without any knowledge of the global floorplan, we arrive at the baseline design (b) which included two spare GPU cores, one spare module for each module except the private L1/L2 cache inside each CPU core and a spare row plus a spare column for the L3 cache. The floorplan is shown in Figure 47b, which obviously is not ideal. But even with this floorplan in Figure 47b, this redundancy augmented system improved Y/A by 1140% (over 10x improvement) compared to the bare design (a). 136 Here there is one simple but effective adjustment to design (b), namely the rearrangement of the GPU cores to arrive at design (c) with its floorplan shown in Figure 47c while maintaining the same redundancy configuration as design (b). In other words, design (c) differs from design (b) only in the local floorplan of the GPU. Although this transition from design (b) to design (c) is very obvious in this small scale case study, if the GPU DFY tool was entirely decoupled from the rest of the system design process, then the GPU would never have taken the core layout shown in Figure 47c due to the wasted whitespace. Design (c) can be further optimized, by utilizing the GPU whitespace to include one extra spare GPU core, thus bringing the number of spare GPU cores from two in design (b)/design (c) to three in design (d). However there remains some space between the CPU and the MLB, which can be filled if each CPU core contained one more spare module for each module except the L1/L2 cache. But doing so will cause the combined width of the four CPU cores to exceed that of the L3 cache, but the cache can include one additional spare column to match the widened CPU. Through this series of floorplan and yield aware design optimizations, we finally arrive at design (e), which improves Y/A over design (b) by 34.5% and over design (c) by 18.5%. (a) Bare design, Y/A: 1.5×10 -4 GPU Core MLB GPU Core GPU Core GPU Core GPU Core GPU Core L1/L2 L1/L2 L1/L2 L1/L2 L3 Cache 137 (b) Baseline design, Y/A: 1.71×10 -3 (c) Improved Baseline design, Y/A: 1.94×10 -3 (d) Further optimized design, Y/A: 2.16 ×10 -3 (e) Final optimized design, Y/A: 2.3 ×10 -3 Figure 47. Motivational case study for GlYFF GPU Core GPU Core GPU Core GPU Core GPU Core GPU Core GPU Core GPU Core L1/L2 L1/L2 L1/L2 L1/L2 MLB L3 Cache GPU Core MLB GPU Core GPU Core GPU Core GPU Core GPU Core L1/L2 GPU Core GPU Core L1/L2 L1/L2 L1/L2 L3 Cache GPU Core MLB GPU Core GPU Core GPU Core GPU Core GPU Core L1/L2 GPU Core GPU Core GPU Core L1/L2 L1/L2 L1/L2 L3 Cache GPU Core MLB GPU Core GPU Core GPU Core GPU Core GPU Core L1/L2 GPU Core GPU Core GPU Core L1/L2 L1/L2 L1/L2 L3 Cache 138 Table XXIV. Case study design configuration GPU CPU Cache # of Cores Core Area Core Yield # of Cores Core Area Core Yield Module Areas # of Rows # of Columns Cache Yield (a) 6 9 0.82 4 16 0.4783 9, 1, 1, 1, 1, 1, 1, 1 13 12 0.3 (b) 8 9 0.82 4 23 0.9321 9, 2, 2, 2, 2, 2, 2, 2 14 13 0.9 (c) 8 9 0.82 4 23 0.9321 9, 2, 2, 2, 2, 2, 2, 2 14 13 0.9 (d) 9 9 0.82 4 23 0.9321 9, 2, 2, 2, 2, 2, 2, 2 14 13 0.9 (e) 9 9 0.82 4 30 0.9930 9, 3, 3, 3, 3, 3, 3, 3 14 14 0.92 This case study demonstrates that applying DFY techniques without being aware of the other on-die system components or the global floorplan will result in a significant loss in overall Y/A. 7.4 Theorems In this section, we will present several theorems which will not only provide guidelines when pruning the algorithm search space in GlYFF, but also provide insights for redundancy based DFY approaches in general. First we set up the system model for Theorem 3 and Theorem 4. The (single core) system consists of N + 1 modules: M 0 and M 1 -M N . M 0 is the cache, which already includes spare rows/columns. The area of M 0 is β, and the yield of M 0 is Y 0 . Since M 0 does not need any other form of redundancy, both Y 0 and β are constant. The rest of the modules (M 1 -M N ) are equal sized, and the area of each module is α. Note that having identical areas does not necessarily mean that logic inside the modules are the same. The designer may choose to include x spare modules in the system. These spare modules may be for any module except M 0 , or even shared amongst multiple modules. 139 Let Y(x) be the die yield of a system with x spare modules, and A(x) be the estimated die area. Y(x) can be calculated using any yield model, A(x) is approximated as the sum of all module areas, i.e. A(x) = β + (x + N)α. We ignore the area of the additional steering/testing logic from spare module insertion. An example system (not a floorplan representation) is illustrated in Figure 48. Figure 48. System model for Theorem 3 and Theorem 4 Assume the sequence of spare module insertion that will provide maximum Y/A benefit is known (a sample insertion sequence: insert a spare for M 4 , insert a spare for M 6 , then insert a spare for M 9 ...). Since we are essentially trading area for yield, we can plot the increase in yield as die area increases due to the insertion of spare modules. We refer to this as the yield-area function plot, as seen in Figure 49. M1 Area: α M2 Area: α ... MN Area: α M0 (Cache) Area: β Yield: Y0 140 Figure 49. Example yield-area function plot for Theorem 3 and Theorem 4 Let us now examine a simple case study with concrete area, yield and Y/A numbers from Table XXV. The yield-area function plot is given in Figure 50, and we notice that this yield-area function is concave. Now if we look at the plot of Y/A in Figure 51, we can see that once the Y/A experience the initial jump, it begins to monotonically decrease. System/Die Yield Die Area Bare β+Nα β+(N+1)α 1 Spare Module β+(N+2)α 2 Spare Modules β+(N+3)α 3 Spare Modules β+(N+4)α β+(N+5)α β+(N+6)α 4 Spare Modules 5 Spare Modules 6 Spare Modules 141 Table XXV. Prelude to theorems: area, yield and Y/A numbers (Case 1) Area Yield Y/A 11 0.493575 0.04487 12 0.541942 0.045162 13 0.583374 0.044875 14 0.619613 0.044258 15 0.651816 0.043454 16 0.680793 0.04255 17 0.707132 0.041596 18 0.731273 0.040626 19 0.753556 0.039661 20 0.774245 0.038712 Figure 50. Prelude to theorems: yield-area function plot (Case 1) 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 10 12 14 16 18 20 22 Yield Area 142 Figure 51. Prelude to theorems: Y/A plot (Case 1) If we switch to a different set of numbers in Table XXVI, where the yield-area function plot (Figure 52) is convex, we can see that the Y/A values will monotonically increase (Figure 53). Table XXVI. Prelude to theorems: area, yield and Y/A numbers (Case 2) Area Yield Y/A 11 0.085184 0.007744 12 0.110592 0.009216 13 0.140608 0.010816 14 0.175616 0.012544 15 0.216 0.0144 16 0.262144 0.016384 17 0.314432 0.018496 18 0.373248 0.020736 19 0.438976 0.023104 20 0.512 0.0256 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046 10 12 14 16 18 20 22 Yield per Area (Y/A) Area 143 Figure 52. Prelude to theorems: yield-area function plot (Case 2) Figure 53. Prelude to theorems: Y/A plot (Case 2) Now we are ready to present Theorem 3 and Theorem 4, which will reveal how the Y/A values behave when certain characteristics of the yield-area function is known. 0 0.1 0.2 0.3 0.4 0.5 0.6 10 12 14 16 18 20 22 Yield Area 0 0.005 0.01 0.015 0.02 0.025 0.03 10 12 14 16 18 20 22 Yield per Area (Y/A) Area 144 Theorem 3: If there exists x = x' such that if the yield-area function plot is concave at x', and the Y/A of a system with x' + 1 spare modules is smaller than that of a system with x' spare modules, then the Y/A of the system will monotonically decrease for any x ≥ x'. Formally: if and then Proof: 145 Theorem 4: if there exists x = x' such that if the yield-area function plot is convex at x', and the Y/A of a system with x' + 1 spare modules is larger than that of a system with x' spare modules, then the Y/A of the system will monotonically increase from x'. Formally: if and then Proof: 146 We can see that Theorem 3 and Theorem 4 rely only on the concave or convex characteristics of the yield-area function plot. In other words, Theorem 3 and Theorem 4 are agnostic to yield models, or how spare modules are utilized. Using these two theorems, the DFY engineer can prune the design space more effectively. For example, as illustrated in Figure 54, by analyzing the yield-function which contains high level abstractions and approximations regarding the redundancy augmented design, the DFY engineer can select a subset of design point for detailed floorplanning, routing, timing simulation or other design validations. More specifically, since the yield-are function starts with a convex region followed by a concave region, the designer can do the following: 1) Start with the bare design, then add one redundant module which brings the highest theoretical Y/A gain and evaluate/validate this design through floorplanning, timing simulation etc. to see if the actual Y/A is indeed greater than that of the bare design. If not, then redundancy is not needed at all, and the DFY process can terminate. 147 2) Jump to the design point where the concave region in the yield-area function plot starts, and keep adding redundancy, evaluate/validate each design until the observed Y/A starts to decrease. 3) Choose the most suitable design among all the examined designs. Essentially instead of examining the entire spectrum of designs, only the sifted design points indicated by the red crosses in Figure 54 need to be examined. Yield Area Convex Region Concave Region Saw Y/A improvement Saw Y/A decrease Potential solution Yield per Area Area Figure 54. Utilization of Theorem 3 and Theorem 4 Next we add some limiting factors to the previous system model to set up the new more restrictive system model for Theorem 5, Theorem 6, Theorem 7 and Theorem 8. First, we will no longer allow the sharing of spare modules (such as seen in the hybrid share redundancy scheme), which means a spare module can only be used to replace one particular module (such as seen in reduced redundancy insertion). A stage is defined as the original module plus all of its spares (same as the definition in Section 4.2). Since M 1 -M N are equally sized, we use the Poisson yield model and let Y 1-N be the yield of any one of those modules based on their critical area. Now following the Poisson yield model, the 148 bare system yield can be calculated as Y 0 (Y 1-N ) N . When inserting spares, Universal-k scheme means that k spare module(s) is/are included for M 1 , M 2 , ..., and M N . Theorem 5: When adding a spare module, adding it to the stage with the least number of spares will result in the largest increase in Y/A. Proof: Assume stage i has the least number of spares, and the number of modules (including the spare modules and the original module) in this stage is k i . Choose any other stage j with k j > k i . We only need to compare the yield of these two stages (stage i and stage j), since the rest of the stages do not change and the increase in area is fixed to be α when adding a single spare module. If we choose to add a spare to stage i, then the yield of the two stages becomes: If we choose to add a spare to stage j, then the yield of the two stages becomes: We wish to prove that since k j > k i and 1- Y 1-N < 1, the above inequality holds. 149 Theorem 6: There exist a non-negative integer k such the Universal-k scheme results in the optimal design with maximal Y/A. Proof: For N < 3 the proof is trivial. We consider N ≥ 3. From Theorem 5, we conclude that the order of spare insertion must evenly distribute the inserted spares, instead of inserting spares for one stage repeatedly. Since the modules are equally sized (except M 0 ), we determine the spare insertion sequence as follows: insert one spare for M 1 , then M 2 , then M 3 ... and finally M N . Repeat this pattern until N spare modules have been inserted. Let x be the total number of spares inserted. y(x) be the yield of the stage with the least amount of spares when x spares have been inserted in total. Naturally y(x+1)>y(x). Y(x) is the yield of the entire system when x spares have been inserted in total. We now will prove that the Y/A function plot under this spare insertion order is segment-wise convex (Figure 56), i.e. convex within a segment. A segment contains consecutive point from any x such that x mod N = 0 to x + N. Note that when x = 0, the design is the bare design, and of course x mod N = 0. When x = 1, x mod N = 1, and all stages (recall that a stage contains the original module plus all of its spare modules) will have the same number of spares (i.e. zero) except for one, which has one more spare module than the rest of the stages. As x increases, the design will gradually add spares to stages that currently do not have spares when all stages have the same number of spares, i.e. x mod N = 0. This pattern will repeat as illustrated in Figure 55 for a system where N = 3. 150 M1 M2 M3 N = 3 x = 0; x mod N = 0 M1 M2 M3 x = 1; x mod N = 1 M1 M1 M2 M3 x = 2 x mod N = 2 M1 M2 M1 M2 M3 x = 3 x mod N = 0 M1 M2 M3 M1 M2 M3 x = 4 x mod N = 1 M1 M2 M3 M1 M1 M2 M3 x = 5 x mod N = 2 M1 M2 M3 M1 M2 M1 M2 M3 x = 6 x mod N = 0 M1 M2 M3 M1 M2 M3 M1 M2 M3 x = 7 x mod N = 1 M1 M2 M3 M1 M2 M3 M1 M1 M2 M3 x = 8 x mod N = 2 M1 M2 M3 M1 M2 M3 M1 M2 ... Figure 55. Explanation for Theorem 6 151 Figure 56. Y/A plot for the proof of Theorem 6 Within each segment, we have: Y SYSTEM (x) = Y SYSTEM (x) Y SYSTEM (x+1) = Y SYSTEM (x)[y(x+1)/y(x)] Y SYSTEM (x+2) = Y SYSTEM (x)[y(x+1)/y(x)][y(x+1)/y(x)] therefore: Y SYSTEM (x+2)-Y SYSTEM (x+1)= Y SYSTEM (x)[y(x+1)/y(x)][y(x+1)/y(x) - 1] Y SYSTEM (x+1)-Y SYSTEM (x)= Y SYSTEM (x)[y(x+1)/y(x) - 1] since y(x+1)>y(x), we have Y SYSTEM (x+2)-Y SYSTEM (x+1)>Y SYSTEM (x+1)-Y SYSTEM (x) Theorem 7: For the Universal-1 scheme to bring improvement of Y/A over the bare design (i.e. Universal-0), Y 1-N must satisfy: In other words, the system yield Y 0 (Y 1-N ) N must satisfy: Yield per Area Area Universal-0 Universal-1 Universal-2 Universal-3 152 Proof: For to be true, we must have: Theorem 8: For the Universal-2 scheme to bring improvement of Y/A over the bare design, Y 1-N must satisfy: In other words, the system yield Y 0 (Y 1-N ) N must satisfy: 153 Proof: For to be true, we must have: For the last theorem, we focus on multi-core systems. The system consists of N symmetric cores. Assume the yield of each core is Y and the area of each core is A. Note that since the cores are identical, we assume any inserted spare core can be used to replace any defective original core, provided that the spare core is not defective itself. The system is functional (i.e. the die is marketable) when there are at least N functional cores. 154 Theorem 9: If a single shared spare core is inserted, it will only increase Y/A if: Proof: Let Y system (k) be the yield of the entire multi-core system that contains k spare cores. For to be true, we must have: We can extrapolate from Theorem 9 to arrive at the following. If k (k > 1) spare cores are inserted for an N core system, it will only increase Y/A if Y is smaller than a certain threshold 155 Y limit (N, k) where Y limit (N, k) acts as a tight upper bound for Y. The inequality Y < Y limit (N, k) is equivalent to: We hypothesize that given a fixed N, Y limit (N, k) is monotonically decreasing with k. This is intuitively correct, since for the insertion of many spare cores to be effective, the original core yield must be very low. However this straightforward hypothesis is rather difficult to formally prove due to the complexities of the above inequality, thus we use a look-up table presented below to serve as an informal proof. Table XXVII. Y limit (N, k) values k N 2 3 4 5 6 7 8 1 0.75 0.88 0.93 0.96 0.97 0.97 0.98 2 0.66 0.83 0.89 0.93 0.95 0.96 0.97 3 0.6 0.78 0.86 0.9 0.93 0.95 0.96 4 0.55 0.75 0.84 0.88 0.91 0.93 0.95 5 0.52 0.71 0.81 0.87 0.9 0.92 0.94 6 0.48 0.69 0.79 0.85 0.89 0.91 0.93 7 0.46 0.66 0.77 0.83 0.87 0.9 0.92 8 0.44 0.64 0.75 0.82 0.86 0.89 0.91 In Table XXVII we list the Y limit (N, k) values for different combinations of N and k. If we focus on any particular column (i.e. fixing N), we can see that Y limit (N, k) decreases as we go down the table (i.e. increasing k), which is consistent with our previous hypothesis. On the other hand, if we focus on a single row, we can see that Y limit (N, k) increases as we move towards the right of the table (i.e. increasing N), this is consistent with the well known fact that core level redundancy is most effective for many core architectures (e.g. GPUs and other SIMD systems). This look-up table only needs to be constructed once, and can be reused in the DFY process. 156 7.5 Experimental Results The system configuration of the bare SoC design is shown in the first row of Table XXX. There are a total of 24 cores in the GPU, four cores in the CPU and 31 modules in each CPU cores, one of which is the private L1/L2 cache. There are no spare cores or spare modules in the bare SoC design, which is indicated by "+ 0" in the first row of Table XXVIII and Table XXX. The cache does not have any spare rows or columns. Under a moderate defect density of d = 4.9 × 10 -6 , the bare design must be fortified against manufacturing defects using redundancy. Using existing algorithms and toolsets (exhaustive search for the GPU and L3 cache + HYPER [77] for the CPU), we arrive at the baseline design. The redundancy configuration of the baseline design is shown in the second row of Table XXVIII. The baseline design uses 10 spare cores for the GPU (indicated by "+ 10"), no spare core for the CPU (indicated by "+ 0"), 30 spare modules for each CPU core (indicated by "+ 30", there is one spare for each module except the L1/L2 cache in the CPU core) and one spare row plus one spare column for the L3 cache. The best achievable floorplan for this baseline design is shown in Figure 57, with the design evaluations shown in the first row of Table XXIX. The GlYFF design is quite different from the baseline design. It has, as seen in the third row of Table XXVIII, 12 spare cores for the GPU (indicated by "+ 12", two more than the baseline design), one spare core for the CPU (indicated by "+ 1"), 30 spare modules for each CPU core (indicated by "+ 30") and two spare rows plus one spare column for the L3 cache (which increased the yield of the L3 cache from 0.77 to 0.89, but also significantly increased the height of the cache due to the extra spare row and accompanying steering logic and interconnects). The module level redundancy configuration for each CPU core remained the same as the baseline 157 design. Due to the spare CPU core, the GlYFF design actually utilized hybrid redundancy and not pure fine grained module level redundancy. The floorplan of this GlYFF design is shown in Figure 58. From Table XXIX we can see that compared to the baseline design, the GlYFF design improved the Y/A value of the system by 17.91%. Table XXVIII. Redundancy configuration for the SoC Design GPU CPU L3 Cache # of Cores # of Cores Total # of Modules per CPU core # of Spare Rows # of Spare Columns Bare 24 + 0 4 + 0 31 + 0 0 0 Baseline 24 + 10 4 + 0 31 + 30 1 1 Optimized 24 + 12 4 + 1 31 + 30 2 1 Figure 57. Baseline design floorplan 158 Figure 58. GlYFF design floorplan Table XXIX. Design evaluations Design System Yield Die Area (Post Floorplan) Yield per Area (Y/A) Baseline 0.6318 4232×1400 = 5924800 1.0664×10 -7 Optimized 0.8700 3844×1800 = 6919200 1.2574×10 -7 Now we use the same design, but a much higher defect density of d = 1.4 × 10 -5 , and see how GlYFF reacts. The baseline design uses 24 spare cores for the GPU (indicated by "+ 24"), no spare core for the CPU (indicated by "+ 0"), 60 spare modules for each CPU core (indicated by "+ 60") and one spare row plus one spare column for the L3 cache. This configuration is shown in the second row of Table XXX. The best achievable floorplan for this baseline design is shown in Figure 59, with the design evaluations in the first row of Table XXXII. The GlYFF design is much more aggressive in terms of redundancy insertion under the high defect density environment. From the third row of Table XXX we see that the GlYFF design has 26 spare cores for the GPU (indicated by "+ 26", two more than the baseline design), no spare core for the CPU (indicated by "+ 0"), 71 spare modules for each CPU core (indicated by "+ 71") and two spare rows plus one spare column for the L3 cache. The details of the module level 159 redundancy configuration for each CPU core are presented in Table XXXI. We highlighted the module spare count in the GlYFF design in red when it differs from the module spare count in the base line design. The floorplan of this design is shown in Figure 60. From Table XXXII we can see that here the Y/A improvement is 23.08%. 160 Table XXX. Redundancy configuration for the SoC Design GPU CPU L3 Cache # of Cores # of Cores Total # of Modules per CPU core # of Spare Rows # of Spare Columns Bare 24 + 0 4 + 0 31 + 0 0 0 Baseline 24 + 24 4 + 0 31 + 60 1 1 Optimized 24 + 26 4 + 0 31 + 71 2 1 Table XXXI. Redundancy configuration for each CPU core Number of Spare Modules (Excludes Original Module) Baseline Design Optimized Design Module 0 (L1/L2 Cache) 0 0 Module 1 2 3 Module 2 2 2 Module 3 2 2 Module 4 2 3 Module 5 2 2 Module 6 2 2 Module 7 2 2 Module 8 2 3 Module 9 2 2 Module 10 2 3 Module 11 2 2 Module 12 2 2 Module 13 2 2 Module 14 2 2 Module 15 2 3 Module 16 2 2 Module 17 2 3 Module 18 2 3 Module 19 2 2 Module 20 2 3 Module 21 3 3 Module 22 2 2 Module 23 2 2 Module 24 2 2 Module 25 2 2 Module 26 2 2 Module 27 2 3 Module 28 1 2 Module 29 2 3 Module 30 2 2 161 Figure 59. Baseline design floorplan Figure 60. GlYFF design floorplan 162 Table XXXII. Design evaluations Design System Yield Die Area (Post Floorplan) Yield per Area (Y/A) Baseline 0.5867 4574×1661 = 7597414 7.7222×10 -8 Optimized 0.7828 4071×2023 = 8235633 9.5044×10 -8 7.6 Conclusions We have previous witnessed the complexity of floorplan aware yield enhancement in single core CPUs during the development of URaY in Chapter 6. For large SoCs, the system is even more complex with different components having different suitable yield enhancement techniques. When redundancy based DFY is used to enhance yield or Y/A in low yield environments, the design optimization framework must be constantly aware of the changes in terms of local and global floorplan so that it can accurately evaluate the final die yield and area. GlYFF is a holistic computer-aided DFY framework that efficiently explores the massive design space to help designers converge to the solution design. GlYFF can not only determine the redundancy configuration for each system component (e.g. CPU, GPU or cache), but also establish the detailed floorplan for each system component as well as the overall die. With the assistance of GlYFF, the designer can perform more in-depth system validations such as layout level timing simulations, if necessary. More importantly, like URaY, GlYFF revealed the insufficiencies of existing theoretical redundancy insertion algorithms: the preferred design point suggested by GlYFF typically deviates from the design point chosen by existing DFY algorithms. 163 Conclusions and Future Research Directions As technology continues to evolve and aggressively pushing for higher circuit density, higher performance and lower power consumption, manufacturing yield is becoming a major concern for IC manufacturers and designers. Various post-design "Design for Manufacturing (DFM)" techniques (e.g. redundant via insertion, OPC and CMP) have already been deployed in industry. However, DFM may not be enough as yield continues to drop. "Design for Yield (DFY)" is a more proactive solution to diminishing yield issues. DFY requires the consideration of yield issues at the design stage, which will overturn the established design paradigm by ushering in yield as a new primary design objective, along with the current ones: area, performance and power. DFY can be (and should be) used in tandem with DFM techniques. In this work, we focus on DFY approaches that originates from the use of hardware redundancy. While the use of redundancy have been studied in theory and demonstrated to be very effective at low yield environments despite the seemingly high area overheads, there remain many other important design aspects that are still not yet considered in those theoretical analyses. We go beyond the known theories in hardware redundancy insertion, and introduce new computer-aided DFY frameworks for the new design paradigms. We first introduced new ways of applying hardware redundancy along with the corresponding design optimization algorithms. "Reduced Redundancy" proposed the inclusion of spare modules that are smaller in area, less in performance but higher in yield. "Hybrid Shared Redundancy" combined fine grained module level redundancy with coarse grained core level redundancy in multi-core CPU architectures. URaY is a powerful CAD tool that co-optimizes yield, area and performance while being aware of the chip floorplan and global wiring. Unlike the theoretical 164 analysis from past literatures, it is capable of producing full floorplanning images with global wiring estimations to guide the designers in converging to the final solution. We showed that when practical issues such as floorplanning and wiring are involved, the preferred design point usually deviates from the theoretical optimal design point, with the former achieving 10%-20% increase in overall revenue. GlYFF is a DFY framework for large scale SoCs where different system component have their own suitable yield enhancement strategy, and again we have shown that with this holistic yield aware and floorplan aware design optimization framework, solution quality is able to improve significantly over the theoretical optimal design point. With those new design optimization frameworks, we believe that designers can better understand the impact of yield, and arrive at the optimized design much more efficiently. DFY for random logic is still a relatively new field, with many more challenges and exciting opportunities ahead. For future work, a natural extension to URaY or GlYFF is to incorporate non-redundancy based DFY techniques such as redundant via insertion [32][33], redundant wiring [34] or microarchitectural graceful degradation [38]. We can see the potential benefits of having additional yield enhancement options in the DFY process in Figure 61. The bare design contains the cache, Module 1 (M 1 ) and Module 2 (M 2 ). To enhance the yield and Y/A of the system, a spare module for M 1 is inserted. At this point (the middle design in Figure 61), adding another spare module for M 2 will cause the global floorplan to have large amounts of whitespace. The designer may choose to use non-redundancy based DFY techniques for M 2 , perhaps to avoid the floorplanning whitespace and the testing time costs for M 1 plus M 2 . Clearly, M 2 can "expand" (i.e. trade area for enhanced yield) while keeping the die area unchanged. The final design may become the rightmost design in Figure 61, which uses a mix of different DFY techniques. 165 Figure 61. Utilizing non-redundancy based DFY techniques in the current frameworks We can also expand our visions beyond permanent faults caused by manufacturing defects. Redundancy can also be used to combat soft errors and in-field reliability issues. We illustrate with a simple example in Figure 62. The bare design requires all modules plus the cache to be functional to perform its functionality: a cache read, followed by processing by M 1 and M 2 , then finally writing the processed data back into the cache. Assume a spare module is inserted for both M 1 and M 2 for maximizing Y/A. If all modules within any stage are defective, then the chip must be discarded. When one M 1 and one M 2 are defective, the defective modules are deactivated and the systems functions as if there was no spare modules inserted. This is shown in the middle of Figure 62, where defective modules are shown in red. This hardware actually has potentials in soft error resiliency, aside from its ability to tolerate manufacturing defects. In the right of Figure 62, we see that one M 2 is defective, hence the data processing intended for M 2 must be performed by the other M 2 , and the defective M 2 must be permanently disabled. None of the M 1 's are defective, hence there is possibility for both M 1 's to mirror each other's operation. The purpose of doing so is to enable comparison based checking of the output data. In M1 M2 Cache M1 M2 Cache M1 M1 M2 Cache M1 Bare Design One spare module inserted for Module 1 One spare module inserted for Module 1 + “Expanded” Module 2 166 case of an output mismatch, then at least one M 1 was affected by soft errors, and the upper software layer or scheduler must take the appropriate actions. Such hardware assistance will enable cross-layer soft error resiliency and can be potentially be much more efficient than addressing fault tolerant scheduling from the software side alone [88][89][90]. Figure 62. Utilizing redundancy for DFY and cross-layer soft error resiliency Cache M1 M1 M2 M2 Cache M1 M2 Controller Cache M1 M2 M1 M2 Controller Cache Read M1 Processing M2 Processing Cache Write Cache Read M1 Processing M2 Processing Cache Write Cache Read M1 Processing M2 Processing Cache Write M1 Processing Output Comparison 167 References [1] R. C. Leachman and C. N. Berglund, "Systematic mechanisms limited yield (SMLY) study," Int’l. SEMATECH, DOC #03034383A, March 2003. [2] J. Srinivasan et al., "The impact of technology scaling on lifetime reliability," Int’l Conf. on Dependable Systems and Networks, 2004. [3] J. A. Cunningham, "The use and evaluation of yield models in integrated circuit manufacturing," IEEE Trans. on Semiconductor Manufacturing, 3(2): 60-71, 1990. [4] H. Sun, N. Zheng, and T. Zheng, "Realization of L2 Cache Defect Tolerance Using Multi-bit ECC," IEEE Intl' Symp. on Defect and Fault Tolerance of VLSI Systems, 2008. [5] R. E. Lyons et al., "The use of triple modular redundancy to improve computer reliability," IBM Journal of Research and Development, 7(2): 200–209, 1962. [6] A. Agarwal et al., "A process-tolerant cache architecture for improved yield in nanoscale technologies," IEEE Trans. on Very Large Scale Integration Systems, 13(1), 2005. [7] P. P. Shirvani and E. J. McCluskey, "PADded cache: a new fault-tolerance technique for cache memories," VLSI Test Symp., 1999. [8] H. Hsuing et al., "Salvaging chips with caches beyond repair," DATE, 2012. [9] S. Borkar, "Thousand core chips: a technology perspective," Design Automation Conference, 2007. [10] T. Fischer et al., "Design Solutions for the Bulldozer 32-nm SOI 2-Core Processor Module in an 8-Core CPU," Int’l. Solid State Circuits Conf., 2011. [11] R. Kumar and V. Zyuban, "Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling," ISCA, 2005. [12] S. Borkar, "Thousand core chips-A technology perspective," DAC, 2007. [13] The battle of the bridges, http://semiaccurate.com/2012/04/24/the-battle-of-the-bridges [14] Teardown of Apple's A6 processor, http://appleinsider.com/articles/12/09/25/teardown_of_ apples_a6_processor_finds_1gb_ram_2_cpu_3_gpu_cores [15] M. D. Hill and M. R. Marty, "Amdahl’s law in the multicore era," IEEE Trans. on Computers, 41(7): 33–38, 2008. [16] P. Greenhalgh, "Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7," ARM White Paper, 2011. 168 [17] T. Austin et al., "SimpleScalar: An infrastructure for computer system modeling," IEEE Trans. on Computers, 35(2): 59-67, 2002. [18] M. Mirza-Aghatabar et al., "Algorithms to maximize yield and enhance yield/area of pipeline circuitry by insertion of switches and redundant modules," DATE, 2010. [19] M. Mirza-Aghatabar et al., "SIRUP: switch insertion in redundant pipeline structures for yield and yield/area improvement," ATS, 2009. [20] M. Mirza-Aghatabar et al., "Theory of logical partitioning of yield/area maximization using redundancy," IEEE Workshop on DFM & Y, 2011. [21] Da Cheng and S. Gupta, "A systematic methodology to improve yield per area of highly- parallel CMPs," IEEE Int'l Symp. on Defect and Fault Tolerance in VLSI Systems, 2012. [22] Y. Ooi et al, "Fault-Tolerant Architecture in a Cache Memory Control LSI," IEEE Journal of Solid-State Circuits, 27(4): 507-514, 1992. [23] Y. Gao et al., "A New Paradigm for Trading Off Yield, Area and Performance to Enhance Performance per Wafer," DATE, 2013. [24] Y. Gao et al., “Trading Off Area, Yield and Performance via Hybrid Redundancy in Multi- Core Architectures”, VLSI Test Symposium, 2013. [25] K. Nemoto et al., "Quantifying the benefits of cycle time reduction in semiconductor wafer fabrication," IEEE Transactions on Electronics Packaging Manufacturing, 23(1): 39-47, 2000. [26] C. Weber, "Yield learning and the sources of profitability in semiconductor manufacturing and process development," IEEE Transactions on Semiconductor Manufacturing, 17(4): 590-596, 2004. [27] R. E. Bohn and C. Terwiesch, "The economics of yield-driven processes," Journal of Operations Management, 18(1): 41-59, 1999. [28] L. Zhang et al., "Defect tolerance in homogeneous manycore processors using core-level redundancy with unified topology," DATE, 2008. [29] N. Aggarwal et al., "Configurable isolation: building high availability systems with commodity multi-core processors," ISCA, 2007. [30] N. Sirisantana et al., "Enhancing Yield at the End of the Technology Roadmap," Design & Test of Computers, 2004. [31] R. Kothe et al., "Embedded self repair by transistor and gate level reconfiguration," Design and Diagnostics of Electronic Circuits and Systems, 2006. 169 [32] N. Harrison, "A Simple Via Duplication Tool for Yield Enhancement," IEEE Int'l Symp. on Defect and Fault Tolerance in VLSI Systems, 2001. [33] J. Bickford et al., "Yield improvement by local wiring redundancy," ISQED, 2006. [34] A. B. Kahng, "Non-tree routing for reliability and yield improvement," Int’l Conference on Computer-Aided Design, 2002. [35] S. Almukhaizim et al., "Cost-effective graceful degradation in speculative processor subsystems: the branch prediction case," Int’l Conference on Computer Design, 2003. [36] S. Almukhaizim et al., "Faults in processor control subsystems: testing correctness and performance faults in the data prefetching unit," Asian Test Symposium, 2001. [37] T.-Y. Hsieh et al., "Tolerance of performance degrading faults for effective yield improvement," Int'l Test Conference, 2009. [38] P. Shivakumar et al., "Exploiting microarchitectural redundancy for defect tolerance," Int’l Conference on Computer Design, 2003. [39] M. de Kruijf et al., "Relax: An architecture framework for software recovery of hardware faults," ISCA, 2010. [40] A. R. Agnihotri et al., "Mixed block placement via fractional cut recursive bisection," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 24(5): 748-761, 2005. [41] W. Swartz and C. Sechen, "Timing driven placement for large standard cell circuits," Design Automation Conf., 1995. [42] M. Wang, X. Yang, and M. Sarrafzadeh, "Dragon2000: standard-cell placement tool for large industry circuits," Int'l. Conf. Computer-Aided Design, 2000. [43] J. Kleinhans, G. Sigl, F. Johannes, and K. Antreich, "GORDIAN: VLSI placement by quadratic programming and slicing optimization," IEEE Trans. Computer-Aided Design Integrated Circuits, 1991. [44] H. Eisenmann and F. M. Johannes, "Generic global placement and floorplanning," DAC, 1998. [45] M.A. Breuer, "A class of min-cut placement algorithms," DAC, 1977. [46] A. E. Dunlop and B. W. Kernighan, "A procedure for placement of standard cell VLSI circuits," IEEE Transactions on Computer-Aided Design, 4(1): 92-98, 1985. 170 [47] A. E. Caldwell, A. B. Kahng, and I. L. Markov, "Optimal partitioners and end-case placers for standard-cell layout," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 19(11): 1304-1313, 2000. [48] M.C. Yildiz and P. H. Madden, "Global objectives for standard cell placement," Great Lakes symposium on VLSI, 2001. [49] A. Agnihotri, M. C. Yildiz, A. Khatkhate, A. Mathur, S. Ono, and P. H. Madden, "Fractional cut: Improved recursive bisection placement," Int'l. Conf. Computer-Aided Design, 2003. [50] M. C. Yildiz and P. H. Madden, "Improved cut sequences for partitioning based placement," DAC, 2001. [51] R. H. J. M. Otten, "What is a floorplan?," Int'l. Symp. on Physical Design, 2000. [52] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, "VLSI module placement based on rectangle-packing by the sequence pair," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 15(12): 1518-1524, 1996. [53] Y. Pang, F. Balasa, K. Lampaert, and C.-K. Cheng, "Block placement with symmetry constraints based on the O-tree nonslicing representation," DAC, 2000. [54] Y.-C. Chang, Y.-W. Chang, G.-M. Wu, and S.-W.Wu, "B*-trees: A new representation for non-slicing floorplans," DAC, 2000. [55] B. Yao et al., "Revisiting floorplan representations," Int'l Symp. on Physical Design, 2001. [56] S. N. Adya and I. L. Markov, "Fixed-outline floorplanning through better local search," Int'l. Conf. Computer Design, 2001. [57] J.-M. Lin and Y.-W. Chang, "TCG: A transitive closure graph based representation for nonslicing floorplans," DAC, 2011. [58] C. Sechen, "Chip-planning, placement, and global routing of macro/custom cell integrated circuits using simulated annealing," DAC, 1988. [59] M. Upton, K. Samii, and S. Sugiyama, "Integrated placement for mixed macro cell and standard cell designs," DAC, 1990. [60] A. Shanbhag, S. Danda, and N. Sherwani, "Floorplanning for mixed macro block and standard cell designs" Great Lakes Symp. on VLSI, 1994. [61] A. B. Kahng, S. Reda, and Q. Wang, "APlace: a general analytic placement framework," Int'l. Symp. on Physical Design, 2005. 171 [62] U. Brenner and M. Struzyna, "Faster and better global placement by a new transportation algorithm," DAC, 2005. [63] H. Yu, X. Hong, , and Y. Cai, "Mmp: a novel placement algorithm for combined macro block and standard cell layout design," Asia South Pacific Design Automation Conf., 2000. [64] C. C. Chang, J. Cong, and X. Yuan, “Multi-level placement for large scale mixed-size IC designs,” Asia South Pacific Design Automation Conf., 2003. [65] S. N. Adya and I. L. Markov, "Consistent placement of macroblock using floorplanning and standard-cell placement," Int'l Symp. on Physical Design, 2002. [66] S. N. Adya, I. L. Markov, and P. G. Villarrubia, "On whitespace in mixed-size placement and physical synthesis," Int'l. Conf. on Computer Aided Design, 2003. [67] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, "Multilevel hypergraph partitioning: Application in VLSI domain," DAC, 1997. [68] Khatkhate, Ateen, Chen Li, Ameya R. Agnihotri, Mehmet C. Yildiz, Satoshi Ono, Cheng- Kok Koh, and Patrick H. Madden. "Recursive bisection based mixed block placement." Int'l Symp. on Physical Design, 2004. [69] A.R. Agnihotri et al., "An effective approach for large scale floorplanning," Great lakes symposium on VLSI, 2010. [70] S. N. Adya, S. Chaturvedi, J. A. Roy, D. A. Papa, and I. L. Markov. "Unification of partitioning, placement, and floorplanning," Int'l. Conf. on Computer Aided Design, 2004. [71] J. Cong, M. Romesis, and J. Shinnerl, "Fast floorplanning by look-ahead enabled recursive bipartitioning," Asia South Pacific Design Automation Conf., 2005. [72] M. Anders et al., "A 6.5 GHz 130 nm single-ended dynamic ALU and instruction scheduler loop," ISSCC Dig. Tech. Papers, 410–411, 2002. [73] D. Boggs et al., "The microarchitecture of the Intel Pentium 4 processor on 90nm technology," Intel Technology Journal, 8(01), 2004. [74] S. Wijerantne et al., "A 9GHz 65nm Intel Pentium 4 Processor Integer Execution Core," IEEE Solid-State Circuits Conf., 2006. [75] D. Shin and S. Gupta, "Approximate logic synthesis for error tolerant applications," DATE, 2012. [76] J. C. Cha and S. K. Gupta, "Characterization of granularity and redundancy for SRAMs for optimal yield-per-area," IEEE Int'l Conference on Computer Design, 2008. 172 [77] M. Mirza-Aghatabar et al., "HYPER: a Heuristic for Yield/area imProvEment using Redundancy in SoC," Asian Test Symp., 2010. [78] J. Z. Yan, and C. Chu. "DeFer: deferred decision making enabled fixed-outline floorplanner," DAC, 2008. [79] W. C. Elmore, "The transient response of damped linear networks," Journal of applied physics, 19(1): 55-63, 1948. [80] A. B. Kahng and G. Robins, On optimal interconnections for VLSI, Kluwer Academic, 1995, pp. 67-69. [81] B. M. Beckmann and D. A. Wood, "Managing Wire Delay in Large Chip-Multiprocessor Caches," Int'l Symp. on Microarchitecture, 2004. [82] http://vlsicad.eecs.umich.edu/BK/GSRCbench/ [83] http://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture) [84] http://en.wikipedia.org/wiki/Haswell_(microarchitecture) [85] M. D. Powell et al., "Architectural core salvaging in a multi-core processor for hard-error tolerance", ISCA, 2009. [86] Y. Kang et al., "FlexRAM: Toward an Advanced Intelligent Memory System," Int'l Conf. on Computer Design, 1999. [87] Y. Li et al., "Self-repair of uncore components in robust system-on-chips: An OpenSPARC T2 case study," Int'l Test Conf., 2013. [88] Y. Gao et al., "Using Explicit Output Comparisons for Fault Tolerant Scheduling on Modern High-Performance Processors," DATE, 2013. [89] Y. Gao et al., "An Energy and Deadline Aware Resource Provisioning, Scheduling and Optimization Framework for Cloud Systems," International Conference on Hardware/Software Codesign and System Synthesis, 2013. [90] Y. Gao et al., "An Energy-Aware Fault Tolerant Scheduling Framework for Soft Error Resilient Cloud Computing Systems," DATE, 2014.
Abstract (if available)
Abstract
For modern deep nano-scale integrated circuit manufacturers, constructing large and complex high performance systems with acceptable yield is a major concern. Future technologies are forecasted to experience extremely low yields due to process variations, noise and high defect densities. To tackle the low yield issues in emerging technologies, researchers have advocated the notion of "Design for Yield (DFY)": addressing yield as a first order design objective in the early design cycle, along with area, performance and power. DFY, a new member of the DFx family, could be interpreted as an extension of "Design for Manufacturing (DFM)" which addresses yield issues in the post-design stage with ad-hoc techniques like redundant via insertions. This shift in the design focus to incorporate yield concerns calls for a re-evaluation of the existing design flow. Tradeoffs between critical design metrics must be leveraged carefully in the early design stage and done efficiently to facilitate a fast design convergence process. This is especially important in today's competitive electronics market, where, to maintain profitability, designs must be initiated, finalized, manufactured, tested and shipped in a short window of time. ❧ This dissertation introduces several computer aided design (CAD) DFY frameworks that aims to enhance a system's yield during the design stage, while co-optimizing yield, (die) area and (system) performance to maximize the overall revenue for the IC manufacturer. Our frameworks include new design flows, system models, optimization algorithms and experimental results to assist in the transition towards a DFY design flow for future technologies plagued with low yields.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Defect-tolerance framework for general purpose processors
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Introspective resilience for exascale high-performance computing systems
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Testing for crosstalk- and bridge-induced delay faults
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Error-rate and significance based error-rate (SBER) estimation via built-in self-test in support of error-tolerance
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
A framework for soft error tolerant SRAM design
PDF
Error-rate testing to improve yield for error tolerant applications
PDF
Theory, implementations and applications of single-track designs
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
Asset Metadata
Creator
Gao, Yue
(author)
Core Title
High level design for yield via redundancy in low yield environments
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Defense Date
07/30/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
defect tolerance,design for yield,fault tolerance,OAI-PMH Harvest,redundancy
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Breuer, Melvin A. (
committee chair
), Gupta, Sandeep K. (
committee chair
), Nakano, Aiichiro (
committee member
)
Creator Email
samixsamix@hotmail.com,yuegao@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-472309
Unique identifier
UC11286820
Identifier
etd-GaoYue-2884.pdf (filename),usctheses-c3-472309 (legacy record id)
Legacy Identifier
etd-GaoYue-2884.pdf
Dmrecord
472309
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Gao, Yue
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
defect tolerance
design for yield
fault tolerance
redundancy