Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Lifetime reliability studies for microprocessor chip architecture
(USC Thesis Other)
Lifetime reliability studies for microprocessor chip architecture
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LIFETIME RELIABILITY STUDIES FOR MICROPROCESSOR CHIP ARCHITECTURE by Jeonghee Shin A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2008 Copyright 2008 Jeonghee Shin Dedication To my parents and my sisters ii Acknowledgments First and foremost, I would like to thank my advisor and the Chair of my dissertation committee, Dr. Timothy M. Pinkston, for his inspiring guidance, encouragement, pa- tience and understandings. His enthusiasm and support made it possible for me to com- plete this dissertation. I would also like to thank other members of my committee: Dr. Michel Dubois, Dr. Mary Hall, Dr. Peter Beerel and Dr. Sandeep Gupta. Their short but insightful discussions shaped my dissertation topic and encouraged me to believe in myself as a researcher. Iwould not beassatisfied with my dissertation as I am if there were no two long- term internships at the IBM T. J. Watson research center. Not only was a significant part of my dissertation completed during the internships, but my research philosophy and attitude were also rebuilt during that time. I would like to thank the IBM researchers who inspired me and helped me on my dissertation work: Dr. Victor Zyuban, Dr. Pradip Bose, Dr. Zhigang Hu, Dr. Jude A. Rivers, Dr. Jaime Moreno, Dr. C.-K. Hu, Dr. Sufi Zafar, Dr. Lixin Zhang, and many other researchers and summer interns. iii I would like to specially thank my mentor, Dr. Victor Zyuban. I was always inspired and overwhelmed by his knowledge, genius and impeccable attitude towards research. Heistrulyanexemplarrolemodelforme. My special thank also goes to a member of my dissertation committee and the man- ager for my internships, Dr. Pradip Bose. His support and bright ideas enabled my internships to be fruitful and my wish of becoming an IBM researcher to come true. I also very much appreciate his coast-to-coast travel for my defense. I was blessed to have wonderful and outstanding colleagues and friends at USC. Especially, I would like to thank the SMART group members, Dr. Yong Ho Song, Dr. Wai Hong Ho and Bilal Zafar, and a visiting scholar, Dr. Kyung Geun Lee, for their discussion, feedbacks and encouragement. I am also very thankful to Tim Boston, Diane Demetras and Rosine Sarafian for their caring and help with the administrative work. I was very lucky to have my best and lifelong friend, Dr. Jiyun Byun who has been my friend, my counselor, my supporter, my family and my inspiration for past sixteen years. My life would not be the same without her. Finally, Idedicatethisdissertationtomyparents andmysisters. Their belief in me enabled me to start this journey bravely and to complete it successfully. Thank you and I love you all very much! iv Table of Contents Dedication ii Acknowledgments iii List of Tables viii List of Figures ix Abstract xv Chapter 1 Introduction 1 1.1 MotivationandResearchApproach.................... 3 1.2 Research Contributions . ............. ............ 7 1.3 Organization of the Dissertation ......... ............ 9 Chapter 2 Background and Related Work 11 2.1 Wearout Failure Mechanisms ............. .......... 11 2.1.1 Electromigration . . . . ............. ........ 11 2.1.2 NegativeBiasTemperatureInstability(NBTI) .......... 13 2.1.3 TimeDependentDielectricBreakdown(TDDB). ........ 16 2.2 Previous Studies for Modeling Chip Lifetime Reliability . . . . . . . . 17 2.3 Previous Studies for EnhancingChipLifetimeReliability ........ 20 Chapter 3 Proposed Framework for Architecture-level Lifetime Reliability Mod- eling 23 3.1 Our Approach: Technology-Independent Failure Modeling and Analysis 24 3.2 FIT of Reference Circuit (FORC)........... .......... 28 3.2.1 FORCforElectromigration.................... 28 3.2.2 FORCforNBTI....... ........... ........ 31 3.2.3 FORCforTDDB ...... ........... ........ 32 3.3 Estimating the Failure Rate of Microarchitecture Structures Based on FORC ........... .............. .......... 33 v 3.3.1 Failure Rate of Microarchitecture Structures due to Electromi- gration.............. ................. 33 3.3.2 Failure Rate of Microarchitecture Structures due to NBTI . . . . 42 3.3.3 Failure Rate of Microarchitecture Structures due to TDDB . . . 46 3.4 Summary .............. ........... ........ 48 Chapter 4 Microarchitecture Lifetime Reliability Analysis Using FORC 50 4.1 Evaluation Methodology . .............. .......... 50 4.2 Multicore Processor Microarchitecture Reliability Analysis . . . . . . . 54 4.2.1 Core-LevelReliabilityAnalysis . ................ 54 4.2.2 Chip-LevelReliabilityAnalysis ................. 59 4.3 Discussion . ........... ............. ........ 65 4.4 Summary .............. ........... ........ 67 Chapter 5 Lifetime Reliability Evaluation Framework for Redundant Systems 68 5.1 Lifetime Reliability Models for Generic Redundant Systems . . . . . . 69 5.1.1 Warm k-out-of- nSystems . ....... ............ 72 5.1.2 Cold k-out-of- nSystems . .................... 72 5.1.3 Impact of Component Lifetime Distributions on System Lifetime 76 5.2 Evaluating the Lifetime Reliability of Redundant SRAM Arrays . . . . 83 5.2.1 NBTI-Induced SRAM Cell Lifetime Distributions . . . . . . . . 84 5.2.2 Lifetime Reliability of SRAM Arrays with Redundancy . . . . . 87 5.3 Summary .............. ........... ........ 90 Chapter 6 Proactive Wearout Recovery Approach for Exploiting Microarchitec- tural Redundancy to Extend Chip Lifetime 92 6.1 ProactiveUseofRedundancy . ...................... 94 6.2 WearoutRecovery. ....... ............. ........ 95 6.2.1 Implementation of Wearout Recovery Mode . . . . . . . . . . . 96 6.2.2 WearoutRecoveryAppliedtoInverterChains . ......... 99 6.2.3 WearoutRecoveryAppliedtoSRAMArrays . ......... 100 6.3 Proactive Wearout Recovery Approach for Extending Cache SRAM Life- time ................. ........... ........ 103 6.3.1 ArchitectureDesignConsiderations ............... 103 6.3.2 Implementing Proactive Wearout Recovery in Cache SRAM . . 105 6.3.3 ImpactonPerformanceandArea. ................ 108 6.4 Summary .............. ........... ........ 110 Chapter 7 Redundant Cache SRAM Lifetime Reliability Analysis 111 7.1 Evaluation Methodology . .............. .......... 111 7.1.1 EvaluatingImpactonPerformance ................ 111 7.1.2 EvaluatingImpactonLifetimeReliability ............ 114 7.2 Exploiting Redundancy for ExtendingCacheSRAMLifetime...... 115 vi 7.2.1 Reactive Use of Redundancy ......... .......... 115 7.2.2 ProactiveUseofRedundancy. .................. 120 7.2.3 Comparison of Proactive and Reactive Use of Redundancy . . . 128 7.3 Summary .............. ........... ........ 131 Chapter 8 Conclusions and Future Work 132 8.1 Conclusions . .......... ............. ........ 132 8.2 FutureWork . .......... ............. ........ 133 References 135 vii ListofTables 3.1 The number of effective defects (EDs) and duty cycle for modeling the failure rate of various microarchitecture structures due to NBTI. The de- vices in the table are indexed in Figure 3.6. T 0 and T 1 indicate the ratio of time when SRAM cells, latches, and repeated wires present 0 and 1, respectively. P fatal is the percentage of devices along the critical path, causing circuit failure if they fail. Note that the failure of precharge tran- sistors and the PFET devices of the feedback circuit of latches do not cause the circuit to fail because these devices are not along the critical path. 44 3.2 The number of effective defects (EDs) and duty cycle for TDDB model- ing for various microarchitecture structures. The devices in the table are indexed in Figure 3.6. T 0 and T 1 indicate the ratio of time when SRAM cells, latches, and repeated wires present 0 and 1, respectively. P fatal is the percentage of devices along the critical path, causing circuit failure if they fail. The fatality of breakdown (NF: non-fatal; F: fatal) is given for the source (Src) and drain (Drn) area. We assume that breakdown at the source and at the drain area are independent, thus counting each separatelyasafailureifitisfatal. .................... 47 5.1 Exponential and lognormal distribution models. The failure rate h( t) is defined as the probability that the survivors until time t fail during the next instant of time ∆ t: h( t)= f( t) R( t) . In the CDF of lognormal, Φ denotes theCDFofthestandardnormaldistribution. ............... 76 5.2 Lifetime distribution models for cache memory systems employing re- dundancy techniques. In the table, N ce lls denotes cache memory size in bits; N ar r a y s and N way s denote the number of SRAM arrays and associa- tive ways composing the cache memory, respectively; N col s and N rows denote the number of columns and rows per array, respectively. In ad- dition, R unit ( t) and R ce l l ( t) are the reliability function of data units and SRAM cells, respectively. . .............. .......... 89 7.1 Configuration of the simulated processor chip and L2 cache structure. In the L2 cache configuration, RCQ, COQ and SNPQ indicate queues holding transactions for cache line reloads, castouts (i.e., write-back) and snooping, respectively. . . .............. .......... 112 viii ListofFigures 2.1 NBTI-induced threshold voltage increase over time in arbitrary unit with different duty cycles, d. . ............. ............ 15 2.2 Four possible cases of TDDB. Post-breakdown behavior is modeled as 10k Ω of resistance to determine the fatality of breakdowns [47][63]. . . 17 3.1 The reference circuit chosen for electromigration. The outputs (i.e., drains) of the NFET and PFET devices are connected through an M2 line segment. As a result, v up and v dow n vias abut the M1 metal lines to M2. Upon the one-to-zero transition of the clock, the PFET device conducts, and current flows through v up upward from M1 to M2 in order to charge the wire capacitance of the M2 line, C ref . On the zero-to-one transition of the clock, the NFET device conducts, and currentflows through v dow n downward from M2 to M1 in order to discharge C ref . Therefore, v up and v dow n always have unidirectional current, causing the electromigra- tioneffect. . ........... ............. ........ 29 3.2 The reference circuit chosen for NBTI. It consists of a series of inverters between two latches. The input of one latch should propagate through the inverter chain and be latched into the other within one clock period. Because the value of the signal changes between V dd and0Vinpassing through each inverter, the PFET device in every other inverter is stressed. 30 3.3 Multi-port registerfile layout with current directions causing failures due to electromigration. Because read bitlines are always precharged prior to cells being read, via v sel has current flow from bitline bl 0_ k toward the pass transistor upon reading out 1 to discharge the precharged capaci- tance of the bitline, C bitlin e , but no current flows while reading 0. In (b), an example layout is depicted for bitline bl 0_k implemented on the M2 metal layer and the pass transistors of cell Cell ik and Cell jk ,both of which are connected to bl 0_ k through v se l and v un sel , respectively. The arrows indicate the current direction on bl 0_k when Cell ik stores 1 and is being selected by asserting wordline wl 0_i . As shown in (b), current flows from the bitline to v se l , while little current flows through v unsel .. . 34 ix 3.4 Array structure layout with current directions causing failures due to electromigration. The direction of current flowing throughvias connect- ing cells to bitlines is similar to that for registerfiles shown in Figure 3.3, except thatvias on bitlines (e.g., bl 0_ k ) and complementary bitlines (e.g., bl 0_k ) have current while reading 0 and 1, respectively. In (b), an example layout is depicted for bl 0_k implemented on the M2 metal layer and the pass transistor of cell Cell ik and Cell jk , both of which are connected to bl 0_ k through v sel and v un sel , respectively. The arrows indicate the current direction on bl 0_ k when Cell ik stores 0 and is being selected by asserting wordline wl 0_ i . . .............. ............... 37 3.5 Logic structure layout with current directions possibly causing failures due to electromigration. The layout shows an example of an NAND gate with inputs A, B,and C, and output Out. The M1 lines connecting the drain of the three PFET devices and the upper NFET device have unidi- rectional current flow regardless of the value of the NAND gate output, Out.However,the via connecting the M1 lines to M2 has bidirectional current, depending on the value of Out. . ................ 41 3.6 The PFET and NFET devices in various microarchitecture structures. The number of effective defects and duty cycle of the devices for NBTI and TDDB are given in Tables 3.1 and 3.2, respectively. . . . . . . . . . 43 4.1 Simulation environment for estimating performance, power dissipation, temperature and chip lifetime, which is built around Mambo [54]. The activity and value statistics are collected by Mambo and fed into the power and reliability model. Temperatureismeasured by using thees- timated power dissipation and a thermal resistance matrix [16] and, if needed, it may be fed into the reliability model. . . . . . . . . . . . . . 53 4.2 Simulated 15mm×15mm quad-core processor floorplan. It consists of 0.5mm×0.5mm cells on the grid (i.e., 30×30 cells), each to which the functionsofthechipareproperlyassigned. ............... 53 4.3 FIT of EM, NBTI, and TDDB of the master core that initiates the appli- cation and distributes tasks to the other three slave cores, while running the first 10 msec of Barnes are shown in (a), (b), and (c), respectively. While the FIT of EM is contributed only by vias in arrays and register files, theFIT of NBTI andTDDBisbrokendownintothatfor arrayand register file structures, and that for logic and wire structures. The area breakdown of the units composing the core is given for these structures in(d). ................ ........... ........ 55 x 4.4 FIT of electromigration (FIT EM ) at the unit-level. IFAR: instruction fetch address register, IAT: instruction address translation, BHT: branch history table, ICACHE: I-cache data arrays, IDIR: I-cache directory, BPL: branch prediction logic, IRLD: cache line reload logic, IBUF: in- struction buffer, ILST: link stack registerfile, DISP: dispatch logic, GPMP, FPMP, and CMP: GPR, FPR, and CTR mapper, respectively, FPSH, GPSH, and CSH: shadow arrays of GPMP, FPMP, and CMP, respec- tively, FXQ,FPQ,BRQ,and CRQ:FXU,FPU,BRU,and CRU issue queues, respectively, CMPL: completion table, ADDR: address adder, DCACHE: D-cache data arrays, DAT: data address translation, DDIR: D- cache directory, FRMT: Id format, STQ: store queue, LRQ: load reorder queue, SLB: segment look-aside buffer, TLB: address translation look- aside buffer, CTRL: control logic, DPRF: data prefetch, BRQ: branch misc. queues, BREX: branch execution logic, BRR: count and link register, GPR: general-purpose registers, FXU: fixed-point units, FPR: floating-point registers, FPU: floating-point units. . . . . . . . . . . . . 57 4.5 FIT of NBTI (FIT NBT I )atthe unit-level, brokendownintoarrayand register file structures versus logic and wire structures. See the caption of Figure 4.4 for the name of the functions composing each unit. . . . . 58 4.6 FIT of TDDB (FIT TDDB )atthe unit-level, brokendownintoarray and register file structures versus logic and wire structures. See the caption of Figure 4.4 for the name of the functions composing each unit. . . . . 60 4.7 FIT EM density distribution over the simulated quad-core processor chip for thefirst 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are alignedtothe floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. . ....... ............ 61 4.8 FIT NBT I density distribution over the simulated quad-core processor chip for thefirst 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are alignedtothe floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. . ....... ............ 62 4.9 FIT TDDB density distribution over the simulated quad-core processor chip for thefirst 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are alignedtothe floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. . ....... ............ 62 4.10 FIT density distribution of the simulated quad-core processor chip run- ning Barnes, captured in 10 msec time intervals. The x- and y-axes are alignedtothe floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. . ....... ............ 63 xi 4.11 FIT density distribution of the simulated quad-core processor chip run- ning SPLASH benchmark programs such as Cholesky, FFT and Ocean for the time interval of the fifth 10 msec. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. ........ ............... 64 5.1 An example of the cold 5-out-of-7 model in which a system consists of seven components and two of them are spares in standby mode. The system survives if no more than two spares are needed to replace faulty components(darkershadedbox). .................... 73 5.2 Recursive tree representing possible combinations of exactly two spares being used in the cold 5-out-of-7 model. W( i, j) denotes P{ W i ( t) = j}, the probability that exactly j cold spares are used for one or more of the positions 1 to i by time t. N( j) denotes P{ N( t)= j}, the probability that j cold spares are used for a certain component position by time t. Since the branches represent the probability of non-overlapping events, P{ W i ( t)= j} is the sum of the product along the branches. . . . . . . 73 5.3 Cumulative exponential and lognormal (a shape parameter value of 0.5, 0.2 and 0.01) distributionfunctionsintermsoftime t in an arbitrary unit. Thefourfunctionshavethesamemean. ................. 77 5.4 Impact of lifetime distribution models on system lifetime reliability. In (a), system size increases from a single component to ten, none of which are redundant. In (b), graceful performance degradation enables the sys- tem to operate even if some components fail (i.e., the warm 10-out-of- n model where 10 ≤ n ≤20). ........... ............ 79 5.5 Impact of lifetime distribution models on system lifetime reliability. In (a) and (b), up to ten cold or warm spares are employed along with ten original components (i.e., the cold or warm 10-out-of- n model, where 10 ≤ n ≤20). ............... ................. 80 5.6 6T SRAM cell consisting of two PFET (P L and P R ) and two NFET de- vices holding the cell state and two NFET pass transistors. . . . . . . . 84 5.7 Derived lifetime function of an SRAM cell, 4KB array and 256KB cache with respect to NBTI. . . . . . . ............. ........ 85 5.8 Evaluated 256KB cache memory consisting of 64 SRAM arrays. Each 4KB array has 128 columns and 256 rows. Each cache associative way consists of eight arrays of one row and four cache lines (different gray levels) are interleaved among the eight arrays. . . . . . . . . . . . . . . 88 6.1 Implementation of normal (NO), power gating (PG), and wearout gating (WG) modes of operation. For wearout gating in (b), virtual ground is charged to V dd to remove the electric field across the shaded circuit, which stimulates NBTI wearout recovery of the PFET devices. . . . . . 96 xii 6.2 Change of the voltage level of virtual ground and the leakage ratio of the footer and circuits over time during power gating and wearout gating modes of operation of the circuits illustrated in Figure 6.1. In wearout gating mode, virtual ground is charged to V dd which stimulates wearout recoverybutincreasesfooterleakagepower. ............... 97 6.3 Implementation of wearout recovery mode for an inverter chain. Input combinations NO, PG, WG, IR odd and IR even are those needed for normal, power gating, wearout gating, and intense recovery (for odd and even PFETs,respectively)modesofoperation. ................ 99 6.4 Implementation of wearout recovery mode for SRAM arrays. Input com- binations NO, PG, WG, IR L and IR R are those needed for normal, power gating, wearout gating and intense recovery (of the left and right PFETs, respectively) modes of operation............ .......... 101 6.5 Cache SRAM configured to support proactive use of array-level redun- dancy for wearout recovery............... .......... 106 7.1 Simulated L2 cache structure which consists of four main control logic circuits and queues as well as directory and data arrays: reload, castout, snoop and recovery/drain machines. ......... .......... 113 7.2 Lifetime reliability enhancement and area overhead of redundancy tech- niques for the evaluated 256KB cache memory. The error bars of the derived model indicate the difference in the normalized MTTF of the cache memory evaluated using a range for the shape parameter between 0.4 and 0.7, as discussed in Section 5.2.1, compared to that using a shape parametervalueof0.5shownwithdatabars................ 116 7.3 Lifetime reliability enhancement and area overhead of redundancy tech- niques for the evaluated 256KB cache memory. The error bars of the derived model indicate the difference in the normalized MTTF of the cache memory evaluated using a range for the shape parameter between 0.4 and 0.7, as discussed in Section 5.2.1, compared to that using a shape parametervalueof0.5shownwithdatabars................ 117 7.4 (a) The duty cycle distributions and lifetime reliability of the evaluated 256KB L2 cache memory. (b) The MTTF (shown on the right y-axis) of the baseline cache configuration (i.e., neither redundancy nor balanced duty cycle) for each application is normalized to that of the baseline run- ning V olrend. The lifetime reliability enhancement (shown on the left y-axis) of the cache configurations with a cell duty cycle of 50% (bal- anced duty cycle) and/or with one redundant array used reactively or proactively for each application is also shown. . . . . . . . . . . . . . . 121 xiii 7.5 IPC during the drain process. The x axis begins the time that the drain processstartsand ends thetimethatthe processiscompleted. For the locking scheme of the entire arrays, “blocked load” causes IPC drop by blocking the memory transactions following. The individual locking scheme reduces the blocking time, thus removing IPC drop to zero. . . . 124 7.6 Impact of the drain process on IPC (instructions committed per cycle) loss for DAXPY and pointer chasing with various time periods between twosuccessivedrainprocesses. ...................... 126 7.7 IPC loss of the drain process averaged across the SPLASH benchmark programs. The drain process is scheduled once every million cycles and overall IPC loss is divided by the total number of drain processes. . . . 127 7.8 The lifetime reliability enhancement vs. performance/area overhead of the evaluated 256KB L2 cache memory with various redundancy tech- niques. In (a), for ECC, data points (left to right) indicate sector (32- byte), quad-word, double-word, word (4-byte), double-byte, and byte error correction codes; for colunm spares, data points (left to right) in- dicate two or four spares per 128, 64, 32, 16 and 8 columns. In (b), the lifetime reliability enhancement of each technique is shown for the best case, except for graceful performance degradation (GPD) which shows the four cases of data points indicating the cache enabled to operate with at most one disabled way (left) to four disabled ways (right). . . . . . . 129 xiv Abstract Deep submicron semiconductor technologies enable greater degrees of device in- tegration and performance, but they also pose many new microprocessor design chal- lenges. Chip lifetime reliability as affected by wearout-related failures, for one, has become a major concern. Atomic-range dimensions, escalating power densities, pro- cess/operational variation and other consequences of extreme scaling all contribute to this concern. Much recent research has been conducted to understand and model the ef- fects of wearout failure mechanisms such as negative bias temperature instability (NBTI), electromigration, gate oxide breakdown, etc., on chip lifetime reliability. Circuit and architectural techniques for mitigating and/or tolerating such wearout failures are also being explored for extending chip lifetime. Nonetheless, the challenge of modeling and improving the effects of low-level failures at the architecture-level continues to be a rather daunting one. This research tackles the issue of modeling chip lifetime reliability at the architec- ture level. We propose a new and robust structure-aware lifetime reliability model at the architecture-level, where only devices vulnerable to wearout failure mechanisms and the xv effective stress condition of these devices are taken into account for the failure rate of mi- croarchitecture structures. In formulating the proposed model, we separate architecture- level factors from technology-dependent parameters which are encapsulated into a newly proposed technology-independent unit of reliability, called the FIT of reference circuit (FORC). This allows architects to abstract processor reliability analysis from technology- level effects. In addition, the proposed model is extended for processor systems employ- ing microarchitectural redundancy which has been used as a means of improving chip lifetime reliability. Microarchitectural redundancy is typicallyusedinareactive way, allowing chipsto maintain operability in the presence of failures, by detecting and isolating, correcting and/or replacing components on a first-come, first-served basis only after they become faulty. In this research, we explore an alternative, more preferred method of exploit- ing microarchitectural redundancy to enhance chip lifetime reliability. In our approach, redundancy is used proactively to allow non-faulty microarchitecture components to be temporarily deactivated (i.e., in recovery mode) on a rotating basis, to suspend and/or recover from certain wearout effects. This approach improves chip lifetime reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. To make our proactive approach more effective, we also propose circuit-level techniques to exploit the recovery effect of wearout failure mechanisms such as NBTI while com- ponents operate in recovery mode. Finally, the proposed approach is applied to cache xvi SRAM susceptible to failure caused by NBTI for exploiting microarchitectural redun- dancy targeted to enhancing cache SRAM lifetime reliability. xvii Chapter1 Introduction Deep submicron semiconductor technologies enable greater degrees of device integra- tion and performance, but they also pose many new microprocessor design challenges. Chip reliability as affected by wearout-related failures, for one, has become a major concern [7]. Atomic-range dimensions, escalating power densities, process/operational variation and other consequences of extreme scaling all contribute to this concern. The wearout of integrated circuit devices generally appears as the degradation of circuit speed or memory cell stability over time. When the degree of the wearout reaches beyond that which can be tolerated in the circuits employing the devices, the circuits fail due to timing violation or bit flipping, and if there is no mechanism implemented to contain these wearout failures, chip lifetime becomes over. Whilesofterrorscausedbyalpha particlesorcosmicraysare random in timeand transient, wearout failures are generally associated with failure mechanisms such as neg- ative bias temperature instability (NBTI), electromigration, gate oxide breakdown, etc., 1 and remain permanent 1 . Therefore, developing a methodology of estimating chip life- time should be carried out, based on a detailed understanding of physical phenomena of wearout failure mechanisms which enables to identify the vulnerability of devices to certain failure mechanisms, and of circuit implementation which enables to analyze the criticality/fatality of the affected devices for the reliability of the circuits they belong to. For extending chip lifetime, it is necessary to have mechanisms which permanently contain wearout failures or which slow down the wearout of devices before they fail. Unlike manufacturing defects, another permanent type of failure, wearout failures are manifest during chip lifetime. While chips affected by manufacturing defects can be screened out before chip deployment by testing and burn-in processes, it is more difficult to predict the lifetime of chips affected by wearout failures. This makes it important to have lifetime reliability-aware designs in which techniques for extending chip life- time can be activated in the field at low overhead, along with on-line failure detection mechanisms. It is also important to have an accurate methodology of analyzing the life- time reliability of chips with lifetime extension techniques in order not to underdesign nor to overdesign the chips in meeting target lifetimes, especially by taking into account operating conditions of devices caused by applied workloads. The necessity of such lifetime reliability studies dramatically growswithdeepsub- micron technologies as technology scaling unfavorably impacts chip lifetime reliability, resulting from reduced device size, non-ideal supply voltage scaling, increased power 1 Some failure mechanisms such as NBTI have recovery effects. However, the recovery generally occurs on only a part of wearout, which eventually results in the wearout progressing over time. 2 densities, significant process/operational variations and increased number of integrated devices. This research proposes new and robust frameworks for analyzing chip lifetime reliability and a preferred approach for using microarchitectural redundancy to extend chip lifetime by effectively exploiting low-level effects such as physical characteristics of failure mechanisms and circuit implementations, and architecture-level effects such as microarchitectural configurations and applied workloads. Our research approach as well as themotivationofthisworkispresented in Section1.1. Thisisfollowedbythe summary of the contributions of this research in Section 1.2. Finally, the organization of this dissertation is provided in Section 1.3. 1.1 MotivationandResearchApproach Much recent research has been conducted at different levels to understand and model the effects of wearout failure mechanisms, and to develop techniques for mitigating and/or tolerating such wearout failures to extend chip lifetime. Reliability studies at the device level observe emerging wearout failure mechanisms, propose the lifetime reliability model of devices vulnerable to specific mechanisms based on hypotheses of their physical phenomena, and empirically validate the proposed mod- els. Such device-level reliability models are extended to the circuit-level by consider- ing the effect of the wearout of devices on circuit performance or stability. Thus far, many lifetime reliability models at the device and circuit levels, have been proposed and empirically validated by academia and industry. As a result, the basic mechanisms of 3 wearout failures at these low levels have been fairly well understood and the models at these levels have gained widespread acceptance [6]. However, these low-level models are application-oblivious, generally assuming the worst-case operating conditions, which cause processor chipstobeoverdesigned. Compared to the enormous amount of reliability research at the low levels, there have been only a few studies conducted at the architecture-level. The previous architecture- level studies successfully brought the significance of applied workloads for chip lifetime reliability to much attention. However, a great degree of the abstraction of low-level details causes limited use or diminished credibility of theirfindings. Srinivasan, et al., [61] have proposed a first architectural lifetime reliability model for use with single-core architecture-level, cycle-accurate simulators. A closer exam- ination of how they put low-level reliability models together at the architecture-level oftentimes reveals a number of key assumptions that enable plausible abstractions at the architecture-level. For example, the baseline (target) total failure rate measured in FITs 2 is assumed to be evenly distributed across all the considered failure mechanisms. This is clearly a somewhat arbitrary axiom since some failure mechanisms can be more severe than others, and technology scaling affects the failure mechanisms in different ways and degrees. In addition, a uniform device density across the chip and an identical vulnerabil- ity of devices to failure mechanisms are assumed. As a result, the failure rates estimated by their model tend to be proportional to chip area, regardless of the exact component 2 The standard method of reporting constant failure rates for semiconductor components is failures in time (FITs), which is the number of failures seen in one billion hours. The mean time to failure (MTTF) of a component is inversely related to this constant failure rate, i.e., MTTF = 10 9 /FITs. 4 mix within that area. However, an examination of the floorplan or photo micrograph of any modern multicore chip clearly shows heterogeneity across the die area—and the consequent limitations of such an assumption. For accurate lifetime reliability estimation, basic axioms such as those above adopted by prior architecture-level reliability models need to be improved based on a detailed un- derstanding of the implementation of modern processor microarchitecture components and the characteristics of wearout failure mechanisms. In this research, we propose an architectural lifetime reliability model that considers the vulnerability of basic structures of the microarchitecture (e.g., arrays, register files, latches, and logic, composing op- erational units across the chip) to different types of failure. To do so, we analyze the devices truly affected by specific failure mechanisms, taking into account the effective stress condition of the devices caused by applied workloads. Furthermore, we propose a technology/environment-independent unit of reliability, called the FIT of reference circuit or FORC, through which the failure rate of a microarchitecture structure can be expressed for each type of failure mechanism. This efficient and portable framework separates architecture-level factors from technology-dependent parameters so as to allow architects to abstract processor reliability analysis from technology-level effects. In addition, it is important to accurately evaluate the impact of lifetime extension techniques in order not to underdesign nor to overdesign processor systems in meeting target lifetimes since they can cause significant design overhead such as area, complex- ity, power and performance. Microarchitectural redundancy has been a commonly used 5 technique for improving lifetime reliability of microprocessor chips. When applied to microprocessors, chips can maintain operability in the presence of failures by detecting and isolating, correcting and/or replacing microarchitecture components. Most previous reliability models are based on assumptions such as exponential distri- bution for representing system component lifetimes, for simplicity. The exponential life- time distribution provides relatively straightforward analysis owing to its constant failure rate over time but is not always representative of the underlying phenomena of chip life- time reliability [1][44]. Our proposed framework is based on a well-accepted reliability model for generic redundant systems [1][44][57] andisabletobeeasilycustomizedfor specific processor subsystems. As an example, this research demonstrates a methodol- ogy of analyzing cache SRAM with conventional redundancy techniques such as error correction code (ECC), component sparing and graceful performance degradation, by customizing the proposed framework. We also propose a methodology for deriving a more representative lifetime distribution model specifically for SRAM arrays, based on a detailed understanding of how SRAM cells are worn out over time and eventually fail due to NBTI. Using the proposed framework, chip lifetime reliability analysis is con- ducted in a comprehensive manner across different levels rather than at a certain level, which effectively combines low-level effects such as characteristics of failure mecha- nisms and implemented circuits, and architecture-level effects such as microarchitecture configurations and applied workloads. 6 Microarchitectural redundancy is typicallyusedinareactive way, allowing chipsto maintain operability in the presence of failures by detecting and isolating, correcting, and/or replacing components on a first-come, first-served basis only after they become faulty. This research explores alternative, more preferred method of exploiting microar- chitectural redundancy to enhance chip lifetime reliability. In our proposed approach, redundancy is used proactively to allow non-faulty microarchitecture components to be temporarily deactivated, on a rotating basis, to suspend and/or recover from certain wearout effects. This approach improves chip lifetime reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. We propose circuit- level techniques that can be exploited at the architecture-level for operating components in recovery mode as a means of mitigating the effects of NBTI-induced wearout. Our proposed approach for using microarchitectural redundancy to exploit characteristics of failure mechanisms by developingbothcircuit andarchitectural techniques can achieve the maximal lifetime reliability enhancement. 1.2 ResearchContributions The main contributions of this dissertation focus on specifying architectural effects on chip lifetime reliability and, furthermore, the importance of comprehensive lifetime reli- ability studies across different levels. The following lists the specific contributions made in this dissertation: 7 • This research proposes a framework for architecture-level lifetime reliability mod- eling in which only devices vulnerable to failure mechanisms and their effective stress conditions are taken into account for the failure rate of microarchitecture structures. The proposed framework not only effectively captures low-level de- tails, but it also specifies explicit architectural parameters affecting chip lifetime reliability such as microarchitecture configurations and activity factors, rather than implicit ones such as temperature as done in previous architectural models. This is done by proposing a new unit of reliability, called the FIT of reference circuits or FORC which encapsulates technology-dependent parameters. With this new reli- ability modeling framework, computer architects are empowered to proceed with architecture-level reliability analysis independent of technological parameters. • Furthermore, this research proposes a more comprehensive and flexible frame- work for analyzing the lifetime reliability of redundant systems by improving the limitations of the commonly used evaluation methodologies such as the sum-of- failure-rates model and Monte Carlo simulations and effectively combining low- level effects and architecture-level effects. This research studies fundamentals of the lifetime reliability of generic redundant systems and proposes a methodology of deriving more representative lifetime distribution of system components, a crit- ical input for accurate reliability analysis. Particularly, the proposed framework is applied to redundant cache SRAM systems with respect to NBTI failure. In de- riving the lifetime distribution of SRAM cells, basic components of cache SRAM 8 arrays, the physical phenomena of the NBTI failure mechanism, the circuit imple- mentation of 6T SRAM cell arrays, cache configuration and utilization are effec- tively put together. Using the derived lifetime distribution, the lifetime reliability model of redundant systems is applied to cache SRAM employing conventional redundancy techniques such as sparing, ECC and graceful performance degrada- tion. • While microarchitectural redundancy has typically been used in reactive ways, this research explores an alternative approach of exploiting redundancy in a proactive manner to enhance chip lifetime reliability. This research shows that with arguably similar area and delay overhead, proactive use of redundancy improves chip life- time reliability in a much higher degree than traditional reactive use, especially by using the proposed circuit-level techniques to exploit wearout recovery properties of failure mechanisms such as NBTI. The evaluation of the proposed proactive wearout recovery approach shows that design techniques exploiting low-level and architecture-level characteristics together can significantly improve chip lifetime reliability, compared to those limited to certain level effects. 1.3 OrganizationoftheDissertation The remainder of this dissertation is structured as follows. In Chapter 2, the physical phe- nomena of wearout failure mechanisms under study in this dissertation such as electro- migration, NBTI and TDDB are described, along with the introduction of well-accepted 9 device-level models of these failure mechanisms. This is followed by the techniques pre- viously proposed for modeling and enhancing chip lifetime reliability. In Chapter 3, an efficient and portable framework for architectural lifetime reliability modeling is pro- posed. The reliability analysis of a multicore processor chip by using the proposed framework is demonstrated in Chapter 4. This framework is extended for effectively analyzing the lifetime reliability of redundant systems in Chapter 5. In Chapter 6, a proactive wearout recovery approach for enhancing chip lifetime reliability is proposed with circuit and architectural techniques to exploit microarchitectural redundancy and wearout recovery properties of failure mechanisms. Particularly, the proposed approach applied to cache SRAM arrays is analyzed in detail, compared to reactive approach, in terms of lifetime reliability enhancement, performance and area impact in Chapter 7. Fi- nally, this dissertation is concluded with a discussion of important results and directions for future work in Chapter 8. 10 Chapter2 BackgroundandRelatedWork To help understand the motivation of our proposed lifetime reliability models and tech- niques for enhancing chip lifetime reliability, we describe the physical phenomena be- hind wearout failure mechanisms under study in this dissertation. This is followed by a device-level reliability model for each failure mechanism which is used in the remainder of this dissertation. In addition, we discuss previous studies on modeling and improving chip lifetime reliability. 2.1 WearoutFailureMechanisms 2.1.1 Electromigration Electromigration is a well-known and well-studied failure phenomenon that can occur on conductor lines due to the mass transport of conductor metal atoms [6][20][40][46]. Conducting electrons transfer some of their momentum to the metal atoms of the lines. 11 This creates a net flow of the metal atoms in the direction of electron flow. As the atoms migrate, there is a depletion of metal atoms in one region and pile-up in other regions. This causes increased resistance in the depleted region, eventually leading to open cir- cuitsandattendantfailures. Themeantimetofailure (MTTF) due to electromigration of a metal line can be modeled using Black’s equation [6]: MT T F EM = A EM · J − n · e E α_EM kT ,(2.1) where A EM and n are empirical constants, J is current density of the metal line, E α_EM is the activation energy for electromigration, k is Boltzmann’s constant, and T is absolute temperature in degrees Kelvin. The constants n and E α_EM depend on the conductor metal material used [61][20]. The electromigration effect is significant if high current densities occurunidirection- ally, as bidirectional current flow introduces the recovery effect due to the movement of metal atoms in one direction subsequently balanced by an equivalent movement of atoms in the opposite direction of current flow [4][46][39]. In addition, the portions of conductor lines most vulnerable to electromigration are vias abutting metal lines in dif- ferent metal layers [21]. Therefore, we considervias experiencing unidirectional current flow when modeling chip lifetime reliability affected by electromigration in Chapter 3. For simplicity, we assume that vias with upward (i.e., from lower metal layer to upper metal layer) and downward (i.e., from upper metal layer to lower metal layer) directional current have the same electromigration effect. In addition, vias in power and ground 12 networks are not modeled in Chapter 3since, inmostcases, there is sufficient area to duplicate them to improve reliability. 2.1.2 NegativeBiasTemperatureInstability(NBTI) Negative bias temperature instability (NBTI) is a critical failure mechanism affecting deep submicron technologies [75]. NBTI occurs in PFET devices stressed with negative gate-source bias (i.e., V gs = −V dd ) at elevated temperature. After silicon oxidation, most Si atoms bond to oxygen at the interface of silicon and gate oxide, but some Si atoms bond to hydrogen, causing hydrogen-terminated trivalent silicon bonds (Si 3 –Si–H). Ac- cording to the hydrogen reaction-diffusion model [75], these bonds are dissociated under stress conditions such as high electric field and/or elevated temperature. As a result, dangling bonds (Si 3 –Si·) create traps at the interface and hydrogen atoms diffused from the interface create traps in the gate oxide. These positively charged traps result in an undesired threshold voltage increase. The shift in threshold voltage causes degradation in circuit speed and noise margin, eventually, leading to circuit failures due to timing violations or array cell state instability or destruction [30][68]. Recovery from NBTI-induced threshold voltage shift can occur during the period over which no stress is applied on the gate (i.e., V gs =0) as hydrogen atoms diffused during NBTI stress return to the interface to mend the dangling bonds and electrons injected from the substrate neutralize oxide traps created from NBTI stress [38][75]. This naturally occurring recovery effect of NBTI-induced wearout is intensified (i.e., 13 made faster and more pronounced) when PFET devices are reverse biased (i.e., V gs =V dd ) as hydrogen atoms are more effectively attracted to the interface and electron injection is more active [38][66][77]. In this work, we use the predictive hydrogen reaction-diffusion model proposed in [68] to quantify NBTI-induced threshold voltage increase over time. Assuming V ds =0 and the existence of no other traps due to non-hydrogen based mechanisms, the increase in threshold voltage ( ∆V T ) caused by NBTI stress conditions over time t is given by the following: | ∆ V T | = A NBT I · t ox · q C ox ·| V gs − V T |· e E ox E 0 · e − E α_NB T I kT · t n .(2.2) In Equation 2.2, A NBT I and E 0 are empirical constants; t ox , C ox and E ox are oxide thickness, capacitance per unit area, and electric field, respectively; k, T and E α_NBT I are Boltzmann’s constant, absolute temperature in degrees Kelvin, and activation energy for NBTI, respectively; V T is the original threshold voltage; and n is the slope of a log ∆V T –log t graph determined by diffusing species. Duringthetimethatnostressisappliedonthegate(i.e.,V gs = 0), it has been reported that threshold voltage shift is not fully recoverable [45][75][77]. Taking this effect into account, we modify the recovery model given in [68] by the following: | ∆ V T | = ∆ V t 0 · ³ 1+ r NBT I · ³ e − t − t 0 τ −1 ´´ ,(2.3) 14 0 1000 2000 3000 4000 5000 6000 7000 8000 5 10 15 20 25 30 35 40 45 50 55 Time (a.u.) | ∆ V T | (mV) d=1 d=0.8 d=0.6 d=0.4 d=0.2 Figure 2.1: NBTI-induced threshold voltage increase over time in arbitrary unit with different duty cycles, d. where ∆V t 0 and t 0 are the threshold voltage shift and the time when stress conditions are removed, respectively; τ is recovery speed; r NBT I is the ratio of the recoverable part of NBTI-induced ∆V T under no or reverse bias. We assume that at most 50% of the NBTI-induced ∆V T is recoverable under no bias (i.e., r NBT I = 0.5). As there has been no conclusive study to date that indicates how much ∆V T is recoverable under reverse bias, we assume various recovery conditions and compare them in Chapter 6. Figure 2.1 shows an example of NBTI-induced threshold voltage increase over time, assuming the following empirical constants and technology parameters for Equations 2.2 and 2.3 [68]: A NBT I = 1.8mV/nm/C 0.5 , E 0 =2MV/cm, t ox =1.3nm, T = 373.15K, E α_NBT I =0.13eV, V T =0.2V, V dd =1V, n = 0.25, r NBT I =0.5%· ∆V t 0 ,and τ =10 2 . 15 In the figure, duty cycle d is the ratio of stress time that the devices are negatively bi- ased over a given period of time. Keeping track of ∆V T based on Equations 2.2 and 2.3 to take into account this NBTI recovery effect requires tremendous profiling efforts, causing severe slowdown of simulation speed. Instead, in the remainder of this disser- tation, we use Equations 2.2 and 2.3 combined as a function of time t anddutycycle d, f dV T ( d, t), by connecting the peak points of ∆V T at which stress time ends and recovery time begins [42][55][69]. 2.1.3 TimeDependentDielectricBreakdown(TDDB) Time dependent dielectric breakdown (TDDB) is a wearout failure mechanism that forms a conductive path in gate oxide due to a gradual pile-up of defects such as electron traps, interface states, and positively charged donor-like states [64][63]. The formed conductive path causes a sudden increase in oxide conductance or gate leakage current which opposes the current of the logic stage driving the affected devices. As a result, zero-to-one or one-to-zero transitions of the devices become slow [47][25]. We model post-breakdown behavior as 10k Ω of resistance [47][63] to examine for fatality, although the magnitude of post-breakdown conductivity or the hardness of breakdown can vary, depending on the current density applied to the gate oxide under stress [47][63][34][25]. In addition, we assume that a single device failure for any circuit along the critical path is sufficient to lead to a timing violation. 16 P FE T s our c e breakdown PFET drain breakdown NFE T s ou rc e breakdown N FE T drain breakdown Figure 2.2: Four possible cases of TDDB. Post-breakdown behavior is modeled as 10k Ω of resistance to determine the fatality of breakdowns [47][63]. Depending on the location of breakdown, there can be four types of gate oxide break- down as illustrated in Figure 2.2: breakdown at the PFET source, the PFET drain, the NFET source and the NFET drain. We assume that breakdown at the source and at the drain area are independent, thus counting each separately as a failure if it is fatal. The mean time to failure of TDDB given in [72] is applicable to all four types of breakdown, assuming a duty cycle ofd: MT T F TDDB = 1 d · A TDDB · V − a+ bT gs · e X+ Y T +ZT kT ,(2.4) where A TDDB , a, b, X, Y and Z are fitting parameters derived empirically [61]. 2.2 PreviousStudiesforModelingChipLifetime Reliability There has been enormous research to model failure mechanisms at the device level including those described in the previous section. For electromigration, J. R. Black 17 proposed a mean-time-to-failure model (often called Black’s equation) which has been widely used to predict the lifetime of conductor lines [6]. In [21], an electromigration model particularly for copper conductor lines and low- κ dielectric material is proposed by considering various factors that may affect conductor line lifetime due to electromi- gration, such as conductor line microstructures, line length, liner material types, current direction, etc. The effect of bidirectional current flow is studied experimentally and the- oretically in [46] and [23], respectively. The reliability models of the NBTI and TDDB failure mechanisms are extensively summarized in [52] and [63], respectively, thus we do not discuss them further here. There has also been much work to incorporate these device-level reliability models into the circuit-level. UC Berkeley BERT [67] and Cadence R ° Virtuoso R ° UltraSim [12] are the most well-known lifetime reliability circuit simulators. They are capable of pre- dicting and validating the timing of circuits affected by wearout failure mechanisms such as electromigration, TDDB and hot carrier injection, and compatible to the SPICE or FastSPICE circuit simulator. Compared to that at the device- and circuit-level, lifetime reliability modeling at the architecture-level is rather a daunting one. The RAMP lifetime reliability model [61], was developed for architecture-level reliability analysis. Some of its underlying assump- tions such as uniform device density and identical device vulnerability prevent RAMP frombeingextendedtocovertheentirechip. In [35], dynamic thermal and current stress on conductor lines is modeled for electromigration. While the proposed electromigration 18 model attempts to embrace the impact of discontinuous stress on the lifetime reliability of conductor lines as those proposed in Chapter 3, it still relies on maximum temperature across the chip and the worst-case current density specified at design time. This architectural lifetime reliability model made efforts on capturing the impact of applications running on processor chips on chip lifetime reliability, which distinguishes them from the device- or circuit-level reliability models mentioned above. However, ab- straction of the characteristics of failure mechanisms and the circuit-level implementa- tion of processor chips causes inaccuracy in this model. Moreover, the impact of applied workloads is considered in this model rather indirectly through temperature than usage patterns affecting the wearout of devices. The architectural reliability model proposed in this dissertation improves upon these limitations of previous work by considering only devices vulnerable to certain failure mechanisms and their effective stress conditions. Furthermore, the number of effective defects and effective stress conditions are formu- lated with microarchitectural parameters so as to identify how computer architects can impact chip lifetime reliability. While many studies have been conducted for the lifetime reliability analysis of re- dundant systems, most of them use Monte Carlo simulations, assumingacertainlifetime distribution model, in contrast to what is done here. This research models lifetime distri- butions specificallyfor SRAM cellsbyconsideringthe device andcircuit-level param- eters such as NBTI-induced threshold voltage shift and cell stability, and incorporating architecture-level parameters such as duty cycle distributions. Specifically for SRAM 19 lifetime reliability for NBTI failures, many researchers have studied the impact of NBTI on SRAM cell stability [13][28][29][48]. Among them, a few studies also attempt to model SRAM lifetime distributions with respect to NBTI as is done in this research. In [26], SRAM failure probability is derived for a given degree of threshold voltage devia- tion arising from process variation and a given duty cycle of the PFET devices of SRAM cells. While our methodology proposed in Chapter 5 is more comprehensive by tak- ing into account the distribution of threshold voltage variations and that of duty cycles, the detailed low-level cell stability model used in [26] can be integrated into our model. In [17], the probability of SRAM failures due to NBTI is proposed especially for SiON and high- κ, although their focus is at the device level. 2.3 PreviousStudiesforEnhancingChipLifetime Reliability In this section, we review prior work on extending chip lifetime. Some techniques are based on using some form of redundancy (e.g., component sparing) used reactively to tolerate the effects of wearout [8][62]. Other techniques are based on adjusting the oper- ational characteristics (e.g., supply voltage, frequency, threshold voltage, or duty cycle) to reduce or recover from wearout stress conditions of failure mechanisms [2][61][30]. 20 Much research has been conducted for enabling microarchitectures to withstand wearout failures to enhance the lifetime reliability of processor chips. Besides traditional for- ward error recovery (FER) techniques based on some form of triple modular redundancy (TMR) [15], a number of less costly architectural techniques (albeit, still requiring sub- stantial cost) have been proposed recently for fault isolation and replacement in both logic and memory structures. Some self-repair techniques are based on duplication and component redundancy (i.e., sparing) at the bit-slice, memory cell, register entry, func- tional unit, array structure, tile and/or processor core levels [8, 33, 9, 70, 10, 19] while others exploit the multiplicity of identical structures already existing in the architecture (i.e., graceful performance degradation), such as execution units, fetch and issue queues, cache ways, etc., of multiple-issue out-of-order pipelines [61, 18, 53, 62, 56, 41, 36]. The Intel Itanium 2 processor, for example, implements built-in self test (BIST) and built-in self repair (BISR) that allows the on-chip L3 cache to survive faults not only in memory cells through redundant cell sparing, but also on decoders and control logic within the cache by supplying a redundant cache subarrayinadditiontospare memory cells[70]. In addition, microarchitectures are protected with parity bits or ECC to detect and/or correct errors. In contrast to our proactive approach proposed in Chapter 6, these techniques use mi- croarchitectural redundancy reactively, preventing components from being able to sus- pend or recover from wearout. 21 Another study evaluates the effectiveness of dynamic reliability management tech- niques in which the processor architecture can self-adapt its operational characteristics (i.e., voltage and frequency—and, thus, power and temperature output) in response to changing application behavior to meet its lifetime reliability target [61]. In [74], the effectiveness of power gating and voltage scaling techniques is also evaluated for the lifetime reliability of logic circuits. While these techniques degrade the performance of microprocessors, our approach maintains the same performance by using redundancy besides transitioning time of operating modes. In addition, our proposed circuit-level techniques for recovery mode completely remove or even reverse the stress conditions. There have been many studies to improve the lifetime reliability of logic and SRAM arrays vulnerable to NBTI failure. The focus of most proposed solutions is to tolerate the estimated circuit delay degradation caused by NBTI by tuning circuit delay param- eters such as gate size, supply and threshold voltage [42][68][73][68]. While circuit delay degradation can be handled by providing sufficient delay margin at design time, the degradation of cell stability of SRAM arrays is not as straightforward as circuit de- lay. In [30] and [2], techniques offlipping the value of array cells are proposed to balance the duty cycle of the two PFETs of array cells. While these techniques can improve lifetime reliability by proactively prolonging the onset of wearout failures, they neither exploit microarchitectural redundancy nor exploit intensified wearout recovery effects as proposed in this dissertation. 22 Chapter3 ProposedFrameworkforArchitecture-levelLifetime ReliabilityModeling This chapter tackles the issue of modeling chip lifetime reliability at the architecture level. Compared to that at the device and circuit levels, lifetime reliability modeling at the architecture level in the context of microarchitectural configurations and applied workloads is a rather daunting one. The prior architecture-level lifetime reliability model proposed in [61] is based on assumptions such as uniform device density and identical device vulnerability which cause the model less accurate, especially for multicore pro- cessor chips. For accurate reliability analysis, these assumptions need to be improved based on a detailed understanding of the implementation of modern microarchitecture components and the characteristics of failure mechanisms. In this chapter, we propose a new and robuststructure-aware lifetime reliability model at the architecture-level, where only devices vulnerable to wearout failure mechanisms and the effective stress condition 23 of these devices are taken into account for the failure rate of microarchitecture struc- tures. In formulating the proposed model, we separate architecture-level factors from technology/environment-dependent parameters which are encapsulatedintoanewlypro- posed technology-independent unit of reliability. This allows architects to abstract pro- cessor reliability analysis from technology-level effects. In what follows, the approach and details of the proposed methodology of lifetime reliability modeling are provided in Sections 3.1-3.3. In Section 3.4, a summary of this chapter is given. 3.1 OurApproach: Technology-IndependentFailure ModelingandAnalysis The various failure mechanisms responsible for lifetime degradation do not contribute equally to processor core or chip failures. Moreover, the impact of failure mechanisms on different parts of the chip may vary dramatically as on-chip devices are not equally nor necessarily vulnerable even to the same failure mechanism. As a result, it is incorrect to assume a uniform device density over a chip or a subpart of the chip and an identical vulnerability of the devices to failure mechanisms, regardless of what is actually imple- mented over the chip area. In other words, an accurate architectural lifetime reliability model should carefully consider the vulnerability of basic structures of the microarchi- tecture(e.g.,arrays,registerfiles,latches,logic,etc.,composingoperationalunitsacross 24 the chip) to different types of failures by analyzing their effective defect density, taking intoaccounttheireffectivestressconditiontospecificfailuremechanisms. Following the above approach, we defineeffectivedefectdensity as the number of de- vicesvulnerable to a certain type of failure mechanism per unit area of a structure, where the term “device” denotes the primitive physical element on which the failure can occur. For instance, for failures due to electromigration, the effective defect density is given by the number of vias per unit area that have unidirectional current flow as discussed in Section 2.1.1. Vias are the interconnect abutments between metal lines in different lay- ers, e.g., between M1 and M2, M2 and M3, and so on. Vias constitute the weakest part of metal lines [21], but not all are vulnerable to electromigration; vias that experience bidirectional currentflow are generally able to recover from deleterious electromigration effects as the movement of metal atoms in one direction is subsequently balanced by an equivalent movement of atoms in the opposite direction of current flow [46][23]. Thus, only vias with unidirectional current flow are counted in the effective defect density for failures due to electromigration. Similarly, for NBTI, the effective defect density is given by the number of PFET devices per unit area along the critical path as NBTI occurs on PFET devices and causes increased device delay, making only devices along the critical path vulnerable to timing violations [52]. For TDDB, the effective defect density is given by the number of PFET and NFET devices per unit area having a leakage current through gate oxide exceeding that which can be tolerated by the logic driving the devices [47]. 25 Once the effective defect density for a certain failure mechanism is found for a given structure within the microarchitecture, an appropriate reliability model can be applied to find the failure rate for that structure and failure mechanism. In order not to over- estimate the failure rate, the effective stress condition of the failure mechanism needs to be taken into account for the reliability model, as most CMOS devices experience discontinuous stress rather than constant stress. For instance, an electromigration stress condition ofvias occurs only during a one-to-zero (alternatively, zero-to-one) value tran- sition of metal lines, generating unidirectional current flow through the vias.ForNBTI, PFET devices are under stress only while their gate is low and their source is high. Similarly, for TDDB, PFET (alternatively, NFET) devices undergo stress condition only while their gate is low (alternatively, high), and their source is high (alternatively, low), respectively. We account for the effective stress condition using activity factor and/or duty cycle causing stress for the devices vulnerable to failure mechanisms. During time other than stress periods, devices are either recovering from or unaffected by the fail- ure mechanism [46][52][49]. In Section 3.3, both effective defect density and effective stress condition for structures and failure mechanisms are formulated using architectural parameters. Chip lifetime reliability is affected not only by the architectural factors representing the number of effective defects and the effective stress condition of various microar- chitecture structures but also by many technological and environmental parameters that are difficult to abstract at the architecture level for the general case. In particular, there 26 are implementation technology differences such as device pitch, semiconductor material (bulk silicon versus SOI), manufacturing process, etc., all of which may vary from one chip maker or generation to another and strongly influence chip lifetime. For studying lifetime reliability trends among chips in the same technology and between chips of dif- ferent technologies,amoreefficient and portable framework is needed that separates architecture-level factorsfromtechnology/environment-dependentparameters. Toward this end, we propose a technology/environment-independent unit of relia- bility, called the FIT of reference circuit or FORC, through which the failure rate of a microarchitecture structure can be expressed for each type of failure mechanisms. That is, the appropriate reliability model for a specific failure mechanism to the microarchi- tecture structure of interest is applied and the structure’s failure rate relative to the cor- responding reference circuit for that failure mechanism is described. By encapsulating technology/environment-dependent parameters into the FIT of the reference circuit, reli- ability analysis can be abstracted at the architecture level. In our approach, the impact of various microarchitecture designs on lifetime relia- bility can be studied rather straightforwardly by parameterizing the configurations (e.g., number of entries in the registerfiles or arrays, number of ports, etc.) and quantifying ac- tivity factors (e.g., number of accesses, number of value transitions, etc.) as determined empirically through architecture-level cycle-accurate simulations on applied workloads. From this, effective defect densities and effective stress conditions for certain failure mechanisms can be found, on which the appropriate FORC-based reliability model can 27 be applied to find the failure rate in terms of FORC for the microarchitecture designs. Furthermore, to understand the impact of technology (e.g., scaling) and environment (e.g., temperature) on the lifetime reliability of a given microarchitecture, one need only to estimate how the FORC for a given failure mechanism would scale or change and then apply this new FORC value to the architecture-level FIT expression derived relative to the reference circuit. 3.2 FITofReferenceCircuit(FORC) In the following subsections, the representative reference circuits for three major failure mechanisms—electromigration, NBTI and TDDB—are described, and FORC expres- sions for each are derived. The same methodology can be applied to formulate FORC primitives for other failure mechanisms. 3.2.1 FORCforElectromigration Figure 3.1 shows an example reference circuit vulnerable to electromigration. The out- puts (i.e., drains) of the NFET and PFET devices are connected through an M2 line segment, as shown in Figure 3.1(b). As a result, v up and v dow n vias abut the M1 metal lines to M2, connecting the outputs of the PFET and the NFET devices. The length of the M2 line is assumed to be equal to the typical length of the metal segment between two successive wire repeaters, which is about 300 µm in 65nm technology. This reference circuit may not be a typical circuit in well-designed CMOS gates where the outputs of 28 Clock P-diffusion N-diffusion Poly M1 M2 CA V1 (b) Example layout of reference circuit (a) Reference circuit v up v down v up v down V dd GND C re f Curr ent direc tio n upon the one - to - zero cloc k transition C ur rent direc tion upon the z ero-to-one cloc k tr ansitio n Figure 3.1: The reference circuit chosen for electromigration. The outputs (i.e., drains) of the NFET and PFET devices are connected through an M2 line segment. As a result, v up and v dow n vias abut the M1 metal lines to M2. Upon the one-to-zero transition of the clock, the PFET device conducts, and current flows through v up upward from M1 to M2 in order to charge the wire capacitance of the M2 line, C ref . On the zero-to- one transition of the clock, the NFET device conducts, and current flows through v dow n downward from M2 to M1 in order to discharge C ref . Therefore, v up and v dow n always have unidirectional current, causing the electromigration effect. PFETs and NFETs are connected with M1 lines. However, it is used in places having an M1 blockage between PFET and NFET devices, such as decoders, since the connection of the NFET and PFET devices with M1 lines are impractical or area inefficient. When the clock transits from one to zero, the PFET device conducts, and current flows through v up upward from M1 to M2 in order to charge the wire capacitance of the M2 line, given by C ref in Figure 3.1(a). Conversely, on the zero-to-one transition of the clock, the NFET device conducts, and current flows through v dow n downward from M2 to M1 in order to discharge C ref . We ignore little current through these vias which charges the drain capacitance of the non-conducted PFET or NFET device on the transition of the clock value. As a result, v up and v dow n are subject to an average unidirectional current of ( C ref ·V dd )/ t,where t is the clock period, causing thevias to be 29 Clock 1 Inv er ter s h av ing N B TI - af fec ted P FE T de v ic e s ... Figure 3.2: The reference circuit chosen for NBTI. It consists of a series of inverters between two latches. The input of one latch should propagate through the inverter chain and be latched into the other within one clock period. Because the value of the signal changes between V dd and 0V in passing through each inverter, the PFET device in every other inverter is stressed. vulnerable to electromigration effects. BasedonBlack’s equation in Equation 2.1, the failure rate in FIT of the reference circuit (vias in this case) due to electromigration is described by the following: FORC EM = 10 9 A EM · µ C ref · V dd t ¶ n · e − E α_EM kT .(3.1) Note that FIT (Failures in Time, i.e., the number of failures seen in 10 9 hours) is in- versely related to MTTF (mean time to failure), i.e., FIT=10 9 /MTTF. Using this notion of FORC, we can express failure rates of microarchitectural components due to electro- migration in relative terms of FORC EM , as described in Section 3.3.1, in order to isolate the architecture from low-level peculiarities associated with technological and environ- mental parameters such as A EM , V dd , t, E α_EM ,and T . 30 3.2.2 FORCforNBTI We chose a reference circuit for NBTI that includes PFET devices under stress and limits allowable gate delay increase before timing violation occurs. As shown in Figure 3.2, the reference circuit consists of a series of N inv inverters between two latches. The input of one latch should propagate through the inverter chain and be latched into the other within one clock period. Because the value of the signal changes between V dd and 0V in passing through each inverter, the PFET device in every other inverter is stressed and becomes slower over time due to NBTI-induced threshold voltage increase. This eventually can lead to a violation in the latch setup time and, ultimately, the capturing of a wrong value in the latch. Suppose that microprocessors are built with a 1% timing margin. In other words, the inverter chain shown in Figure 3.2 can tolerate up to 1% increase in the total delay of the inverters before causing a setup time violation. This delay margin can be convertedtothe maximumallowable V T increase for the reference circuit, ∆ V T _ref ,by using the alpha power law model [51]: T = k· V dd ( V dd − V T ) α ∆ T ∆ V T = k· α· V dd ( V dd − V T ) α+1 = α· T ( V dd − V T ) ∆ T T = α· ∆ V T ( V dd − V T ) ∆ T T = α· ∆ V T _ref ( V dd − V T ) =0 .01· N inv ∆ V T _ref =0 .01· N inv · ( V dd − V T ) α , 31 where α is technology-dependent, but for PFETs, the range of α is typically between 1.5 and 1.7 [51]. That is, a V T shift greater than ∆ V T _ref can cause failure of the ref- erence circuit. Using Equation 2.2 and the relation between lifetime and FIT rates, i.e., FIT=10 9 /MTTF, the failure rate in FITs of the reference circuit can be derived as follows: FORC NBT I =10 9 · µ K ∆ V T _ref ¶1 n ,(3.2) where K = A NBT I · t ox · p C ox ·| V gs − V T |· e E ox E 0 · e − E α_NBT I kT . In Section 3.3.2, we describe how to derive the failure rate of basic microarchitecture structures using FORC NBTI given in Equation 3.2, where technological and environmental parameters, such as A NBT I , t ox , C ox ,V gs ,V T , E ox , E 0 , E α_NBT I ,and T , are encapsulated into FORC NBTI . 3.2.3 FORCforTDDB The PFET or NFET devices along the critical path with 100% duty cycle can be reference circuits for TDDB. Using Equation 2.4, the FORC for TDDB is given by FORC TDDB = 10 9 A T DDB · V a − bT gs · e − X+ Y T +ZT kT .(3.3) Using this FORC TDDB , the way to derive the failure rate of microarchitecture struc- tures is given in Section 3.3.3. In the derived formula, technological and environmental 32 parameters such as A TDDB , a, b, X, Y , Z,V gs ,and T in Equation 2.4 are encapsulated into FORC TDDB . 3.3 EstimatingtheFailureRateofMicroarchitecture StructuresBasedonFORC In the subsections below, we derive expressions for the failure rates of several basic and widely used microarchitecture structures, such as arrays, register files, latches, logic gates, multiplexers, and wire repeaters, for the three failure mechanisms described thus far. Using these expressions, lifetime reliability of multicore microarchitectures is ana- lyzed via architecture-level cycle-accurate simulations in Chapter 4. 3.3.1 FailureRateofMicroarchitectureStructuresdueto Electromigration RegisterFileStructures Using FORC EM in Equation 3.1, we are able to estimate the failure rate of a multi-port register file, such as that shown in Figure 3.3. In register files, vias having only unidi- rectional current are those between bitlines and pass transistors (i.e., NFETs gated by wordlines). Generally, bitlines are implemented on the M2 or upper metal layers, thus requiringvias to connect the bitlines to pass transistors. Because read bitlines are always precharged prior to cells being read, these vias (e.g., v se l ) have current flow from the 33 Cell ik Cell jk ... ... ... ... ... ... ... ... wlN readports-1_i wl 0_i bl N readports-1_k bl 0_k C urrent direc tion when cell ik s tor es “ 1" a nd is bein g s e lec ted ... ... C bitline C bi tl ine v sel v unsel wlN readports-1_j wl 0_j (a) Register file with N readports read ports (b) Example layout of read bitline bl 0_k GND ... GND ... ... v sel v unsel Cell ik Cell jk bl 0_k wl 0_i wl 0_j Figure 3.3: Multi-port register file layout with current directions causing failures due to electromigration. Because read bitlines are always precharged prior to cells being read, via v sel has current flow from bitline bl 0_ k toward the pass transistor upon reading out 1 to discharge the precharged capacitance of the bitline, C bitl ine , but no currentflows while reading 0. In (b), an example layout is depicted for bitline bl 0_ k implemented on the M2 metal layer and the pass transistors of cell Cell ik and Cell jk , both of which are connected to bl 0_ k through v sel and v unsel , respectively. The arrows indicate the current direction on bl 0_ k when Cell ik stores 1 and is being selected by asserting wordline wl 0_i .Asshownin (b), current flows from the bitline to v se l , while little current flows through v unsel . 34 bitlines toward the pass transistors upon readingout 1todischarge theprechargedca- pacitance of the bitlines, C bitl ine , but no current flows while reading 0. The drain current of the pass transistors of unselected cells throughvias such as v unsel is ignored since this drain current is much smaller than the current due to the discharge of C bitl in e to read out 1. Figure 3.3(b) depicts the current direction on read bitline bl 0_ k when cell Cell ik stores 1 and is being selected by asserting wordline wl 0_ i . The effective defect density of register filesisgiven by thenumberof vias between bitlines and pass transistors over the area of the structure. Each cell has onevia between the bitline and pass transistor per read port, totaling N cel l s · N readpor ts vias across the reg- isterfile. Inordertoexpressthefailurerateoftheregisterfile relative to the failure rate of the reference circuit, i.e., FORC, we must also determine the current through the vulner- able vias,which is given by ( C bitl ine ·V dd ) /t, i.e., the amount of capacitance discharged through vias. Here, C bitline =( N entr ies · C dr ain )+ C wire . This current flows through the vias only while reading out a 1, i.e., on average, [ N reads /( N entr ies · N readports )]· P 1 , where N reads and N en tr ie s are the number of reads of the register file and physical regis- ters, respectively, and P 1 is the probability of the cell storing a 1. Thus, the product of the two expressions gives the effective current of the vias. By using Equations 2.1 and 3.1, the following expression givesthefailurerateofregister files due to electromigration in terms of FORC EM : 35 FI T EM_regf ile = N ce lls · N readports · 10 9 A EM · µ C bitline · V dd t · N reads N entr ies · N readpor ts · P 1 ¶ n · e − E α_EM kT = N ce lls · N readports · µ C bitlin e C ref · N reads N entr ies · N readports · P 1 ¶ n ·FORC EM .(3.4) In Equation 3.4, N cel l s , N readports ,and N entr ies are architectural configurations, N reads is activity statistics, and P 1 is value statistics. The activity and value statistics can be obtained from a cycle-accurate microarchitecture simulator. While C bitl ine and C ref are circuit-dependent parameters, the ratio of C bitl ine to C ref is generally known at the early stage of microprocessor designs. In addition to vias on (local) read bitlines described above, write bitlines and global read bitlines that are not shown in Figure 3.3 are also affected by electromigration and can be modeled as local read bitlines. However, for write bitlines, the electromigration effect becomes insignificant if the writing of 1s and 0s is balanced for the cells, due to the recovery effect. In other words, the electromigration effect due to the writing of 1 that causes current from bitlines toward cells is canceled by that due to the writing of 0 that causes the same current but in the opposite direction, i.e., from cells toward bitlines. 36 Cell ik Cell jk ... ... ... ... ... ... ... ... wl N ports-1_i wl 0_i bl N ports-1_k bl 0_k C urren t direc tion when cell ik sto re s “0” and is being selected ... ... ... ... ... ... ... ... C u rr ent dir ec t ion when cellik stores “1" and is be ing selected wl N ports-1_j wl 0_j bl 0_k bl N ports-1_k C bitline C bitline C bitline C bitline (a) Array with N ports ports (b) Example layout of bitline bl 0_k ... ... ... v sel v unsel Cellik Celljk bl0_k wl 0_i wl 0_j Figure 3.4: Array structure layout with current directions causing failures due to elec- tromigration. The direction of current flowing through vias connecting cells to bitlines is similar to that for register files shown in Figure 3.3, except that vias on bitlines (e.g., bl 0_ k ) and complementary bitlines (e.g., bl 0_k ) have current while reading 0 and 1, re- spectively. In (b), an example layout is depicted for bl 0_ k implemented on the M2 metal layer and the pass transistor of cell Cell ik and Cell jk , both of which are connected to bl 0_ k through v sel and v un sel , respectively. The arrows indicate the current direction on bl 0_ k when Cell ik stores 0 and is being selected by asserting wordline wl 0_i . 37 ArrayStructures Arrays are similar to register files, except that the same bitlines are used for both reads and writes, and they are paired for cells, as illustrated in Figure 3.4. The paired bitlines make two vias per cell, and the sharing of bitlines for reads and writes causes the com- putation of the average current density to be more complicated. For reads, the current density is the same as that computed for register files: ( C bitl ine · V dd )/t.Whencells hold 0 and are being selected by asserting the wordline, this current flows through vias on bitlines such as bl 0_k in Figure 3.4. Likewise, the same current flows through vias on complementary bitlines such as bl 0_ k while reading 1 from cells. For writes, current flows on the bitlines only if writes cause the value of cells to change. To overwrite cells holding 1 by 0, the bitline write-input drivers pull down transistors in the cells, causing current flow from the cells to the bitlines. However, this current is relatively small and can be ignored in arrays since PFET transistors in array cells are generally designed to be very weak. While cells holding 0 are overwritten by 1, the write-input drivers charge transistors in the cells as well as the capacitance of the bitlines, causing current similar to that of reads but only for the time needed to pull up transistors in the cells, giving ( C bitl in e ·V dd ) /( γ · t). Here, γ is the duty cycle to pull up transistors in the cells. The complementary bitlines have the same mechanism except that current is generated while cells holding 1 are overwritten by 0. The effective defect density of arrays with N cel l s cells and N por t s read/write ports is N cel l s · N por t s for vias on bitlines and the same number of vias on the complementary 38 bitlines over the area of the structure, and the average current through these vias is the sum of current due to reads and writes: forvias on bls, C bitl ine · V dd t · N reads N rows · N por t s · P 0 + C bitl ine · V dd γ· t · N wr ites N row s · N por t s · 1 2 P flip , and forvias on bls, C bitl ine · V dd t · N reads N rows · N por t s · P 1 + C bitl ine · V dd γ· t · N wr ites N row s · N por t s · 1 2 P flip , where P 1 and P 0 are the probability of the cell holding 1 and 0, respectively, and P flip is the probability of flipping the value of the cell due to writes, either 1 to 0 or 0 to 1. Therefore, the failure rate of the array structure due to electromigration in terms of FORC EM is given by the following: 39 FI T EM_ ar r a y = N cel l s · N por t s · 10 9 A EM · ∙µ C bitl ine · V dd t · N reads N row s · N por t s · P 0 + C bitline · V dd γ· t · N writes N row s · N por t s · 1 2 P flip ¶ n + µ C bitline · V dd t · N reads N row s · N por t s · P 1 + C bitline · V dd γ· t · N wr ites N row s · N por t s · 1 2 P flip ¶ n ¸ · e − E α_EM kT = N cel l s · N por t s · 10 9 A EM · µ C bitl ine · V dd t· N row s · N por t s ¶ n ·[( N reads · P 0 + 1 2 γ · N wr ites · P flip ¶ n + µ N reads · P 1 + 1 2 γ · N wr ites · P flip ¶ n ¸ · e − E α_EM kT = N cel l s · N por t s · µ C bitline C ref · N rows · N por t s ¶ n ·[( N reads · P 0 + 1 2 γ · N wr ites · P flip ¶ n + µ N reads · P 1 + 1 2 γ · N wr ites · P flip ¶ n ¸ ·FORC EM .(3.5) In Equation 3.5, P 1 , P 0 ,and P flip are value probabilities; N reads and N wr ites are ac- tivity statistics; and N cel l s , N po r t s ,and N row s are architectural configuration parameters. OtherStructures Similar to arrays and register files, data paths have vias attaching nodes (e.g., functional units) to bitlines composing data paths. Theseviasmayalsobeaffectedbyelectromigra- tion if nodes load either 0 or 1 predominantly to the bitlines. For example, most of the output bits of the count leading zero (CLZ) operation are zero, causing currentflow from 40 A BC A B C Out A B C Out GND V dd V dd Figure 3.5: Logic structure layout with current directions possibly causing failures due to electromigration. The layout shows an example of an NAND gate with inputs A, B,and C, and output Out. The M1 lines connecting the drain of the three PFET devices and the upper NFET device have unidirectional currentflow regardless of the value of the NAND gate output, Out.However,the via connecting the M1 lines to M2 has bidirectional current, depending on the value of Out. bitlines toward the functional unit through thevia to discharge the capacitance of bitlines if the bitlines are charged prior to the access by CLZ. The effective defect density of the data path is N nodes · N bitl ines · N bus because vias are generated at the place where each node is connected to N bu s bus(es) whose width is N bitline s . In Chapter 4, these vias are assumedunaffectedbyelectromigrationasdata paths are assumed to present a balanced number of 1s and 0s. Vias in other structures such as latches, logic gates, multiplexers, or wire repeaters are rarely affected by electromigration due to the balanced number of one-to-zero and zero- to-one transitions. However, metal line segments connecting the diffusions of PFET and NFET devices always have net unidirectional current flow, making the line segments vulnerable to electromigration. Figure 3.5 shows an example layout of a three-input NAND gate exhibiting this behavior. The M1 lines connecting the drain of the three 41 PFET devices and the upper NFET device have unidirectional current flow regardless of the value of the NAND gate output, Out. According to [21], metal lines shorter than a critical line length are subject to Blech effects which offset the electromigration effect. The Blech effects result in aflow of metal atoms back toward the cathode (i.e., the opposite direction of electron flow) because the drifted atoms accumulate at the anode end and cause tensile stress due to an increase in the atomic density and compressive stress. In Chapter 4, we assume that metal line segments connecting the diffusion of devices are short and, thus, mass transport due to electromigration is suppressed by the Blech effect. 3.3.2 FailureRateofMicroarchitectureStructuresduetoNBTI All we need for modeling the failure rate of structures affected by NBTI is to find the number of effective defects per unit area (i.e., PFETs of the structure that lie along critical paths) and duty cycle for the microarchitecture structure of interest. Assuming the sum- of-failure-rate (SOFR) model [60], the failure rate in FIT of the structures is straightfor- wardlycomputedas thesumof theFITsofPFET devices belonging to those structures. Table 3.1 lists the number of effective defects (EDs) and duty cycle of the devices over the area of various structures composing microarchitecture operational units illus- trated in Figure 3.6, from which the failure rate can be found. In the table, T 0 and T 1 indicate the ratio of time when cells, latches, and repeated wires present 0 and 1, respec- tively. P fatal is the percentage of devices along the critical path, the failure of which is 42 bl bl ... ... precharge wl aa f bc de f CLK D Q CLK Q k l m n o p q r ... ... (c) Wire repeaters (d) Transmission gate in multiplexers (a) Array (b) Register file (e) Latch s t u v wx wbl ... bc de wbl ... precharge g h wwl rbl rwl ... i j j Figure 3.6: The PFET and NFET devices in various microarchitecture structures. The number of effective defects and duty cycle of the devices for NBTI and TDDB are given in Tables 3.1 and 3.2, respectively. 43 Table 3.1: The number of effective defects (EDs) and duty cycle for modeling the failure rate of various microarchitecture structures due to NBTI. The devices in the table are indexed in Figure 3.6. T 0 and T 1 indicate the ratio of time when SRAM cells, latches, and repeated wires present 0 and 1, respectively. P fatal is the percentage of devices along the critical path, causing circuit failure if they fail. Note that the failure of precharge transistors and the PFET devices of the feedback circuit of latches do not cause the circuit to fail because these devices are not along the critical path. Structure Index Number of EDs Duty cycle Array a Non-fatal & register b N cel l s T 0 file c N cel l s T 1 m, q Non-fatal Latch l N latch es 0.25 † o N latch es T 0 Wire repeater s, u 0.5· N repeater s T 0 or T 1 ‡ Multiplexer w N muxe s · N inpu ts 0.5/N inputs † Logic gate − P fatal · N pF E T s 0.5 † Since NBTI occurs only on negatively biased PFET devices, the PFETs for transmission gates in latches and multiplexers are stressed only if the gate of the PFETs is 0 and the transmission gates pass 1 (with the proba- bility of 0.5 assumed). ‡ For wire repeaters, PFETs in every other repeater are under stress at any given time. 44 fatal, i.e., leads to circuit failure. Note that the failure of precharge transistors (denoted by a in Figure 3.6) and PFET devices in the feedback circuit of latches (m and q) is not fatal since these devices are not along the critical path, and thus not considered for the failure rate. In the cell of array and registerfile structures, there are two PFET devices (b and c). While one (i.e., b) is under stress in the case of the cell storing 0, the other (i.e., c) is under stress in the opposite case, i.e., the cell storing 1. Similarly, for repeated wires, PFETs in every other repeater (e.g., s or u) are under stress at a time. In addition, the PFET device of transmission gate in latches and multiplexers (l and w) is stressed only if the gate of the PFET is 0 and the transmission gate passes 1, because NBTI occurs only on negatively biased PFET devices. In the table, the probability of the occurrence of this stress condition is assumed to be 0.5. There are several microarchitecture components with high duty cycle, resulting in high failure rates due to NBTI. SRAM arrays constitute one of the components suffering from NBTI since the PFET devices of SRAM cells are vulnerable to NBTI and may have high duty cycle if the contents of the cells areflipped occasionally due to many reads but a few writes [30]. A more detailed methodology of modeling SRAM lifetime reliability is given in Chapter 5. Other microarchitecture components possibly to have high failure rates due to NBTI are latches that follow decoders for arrays and registerfilessuchasthe register mapper, caches, TLB, load/store reorder queues, etc. For instance, if latches hold the result of the row address decoder of a cache, the value of these latches changes only when the wordline driven by the latches is asserted. As a result, the average duty cycle of 45 the latches increases as the number of rows increases if cache lines are evenly accessed, and cache associativity decreases. Clock distribution networks also can be affected by NBTI, resulting in clock skew. The failure rate of clock distribution networks can be computed in the same way as that of wire repeaters in Table 3.1. 3.3.3 FailureRateofMicroarchitectureStructuresduetoTDDB As most CMOS devices experience discontinuous stress modes, we account for this using the duty cycle of stress for the devices in order not to underestimate their lifetime. Unlike electromigration and NBTI, TDDB has no recovery effect on digital circuits; however, removing the stress simply suspends gate oxide breakdown [49]. Taking this into ac- count, the failure rate of each PFET or NFET is FORC TDDB multiplied by duty cycle d as the reference circuits for TDDB have constant stress (i.e., 100% duty cycle): FI T TDDB_ per_FET = d· FORC TDDB ,(3.6) where duty cycle is the ratio of stress, i.e., for NFET devices, it is the time ratio when the devices are positively biased; for PFETs, it is the time ratio when the devices are negatively biased. SimilartoNBTI, we canexpressfailurerates of microarchitectural components vul- nerable to TDDB in terms of FORC TDDB by finding the number of effective defects over the area of the components and duty cycle. That is, we need to find the num- ber of PFETs and NFETs over the area of the structure that lie along critical paths and 46 Table 3.2: The number of effective defects (EDs) and duty cycle for TDDB modeling for various microarchitecture structures. The devices in the table are indexed in Figure 3.6. T 0 and T 1 indicate the ratio of time when SRAM cells, latches, and repeated wires present 0 and 1, respectively. P fatal is the percentage of devices along the critical path, causing circuit failure if they fail. The fatality of breakdown (NF: non-fatal; F: fatal) is given for the source (Src) and drain (Drn) area. We assume that breakdown at the source and at the drain area are independent, thus counting each separately as a failure if it is fatal. Structure Index Src Drn Number of EDs Duty cycle a NF † F 2· N por t s · N lb ls 1- N r d s/cy cle + N w t s/cy cle 2· N lb ls · N por t s b NF F N cel l s T 0 Array c NF F N cel l s T 1 d F F 2· N cel l s T 0 e F F 2· N cel l s T 1 f F F 4· N cel l s · N por t s N r d s/cy cle + N w t s/cy cle 2· N rows· N por t s g NF † F N rdports · N lb ls 1- N r d s/cy cle 2· N lbls · N r dpor t s Register h F F 2· N cel l s · N rdpor ts N r d s/cy cle 2· N en tr i e s · N r dpor t s file i NF F N cel l s · N rdpor ts T 1 j NF F 2· N cel l s · N wtpor ts N w t s/cy cle 2· N en tr i e s · N wtpor ts k, l F F 2· N latch es 0.25 m, n NF † F N latche s 0.5 Latch o F F 2· N latch es T 1 p F F 2· N latch es T 0 q F NF N latche s T 1 r F NF N latche s T 0 s F F 2· N repeater s T 1 Wire u F F 2· N repeater s T 0 repeater t F F 2· N repeater s T 1 v F F 2· N repeater s T 0 Multiple- w F F 2· N muxes · N in puts 0.5/N in p u t s xers x F F 2· N muxes · N in puts 0.5/N in p u t s Logic gate − F ‡ F ‡ P fatal · N FETs 0.5 † Simultaneous multiple breakdowns may cause circuit failure. ‡ The fatality is determined by circuits which the devices belong to. 47 their duty cycle. In addition, we assume that breakdown at the source and at the drain area are independent, thus counting each separately as a failure if it is fatal. Assum- ing the SOFR model, the failure rate of basic structures is straightforwardly computed as the sum of the FITs of PFET and NFET devices belonging to those structures. Ta- ble 3.2 lists the number of effective defects (ED), duty cycle of devices, and the fatal- ity of breakdown (NF: non-fatal; F: fatal) over the area of various structures illustrated in Figure 3.6, from which the FIT can be found. We assume that precharge transis- tors are always stressed except the time when the corresponding bitlines are accessed— the average probability of ( N r eads/cy cl e + N wr ites/cy cle ) /(2· N lb ls · N po r t s ) for arrays and N reads/cy cle /(2· N lbls · N readports ) for registerfiles. Here, precharge takes half of the clock cycle and bitline access for reads takes the other half of the cycle. On the other hand, pass transistors are under stress only when the corresponding bitlines are accessed. 3.4 Summary In this chapter, we address the issues of modeling chip lifetime reliability at the architecture- level. We propose a framework for architecture-level lifetime reliability modeling which effectively captures the impact of microarchitectural configurations and applied work- loads. We present a new concept called FIT of reference circuit (FORC) that allows architects to quantify failure rates without having to deal with circuit- and technology- specific details of the implemented architecture. As a result, the FORC-based approach 48 allows relative performance-reliability trade-offs to be evaluated in making design de- cisions, especially at early design stages. The proposed framework along with a cycle- accurate architecture simulator allows accurate estimation of the failure rate of various types of microprocessor architectures. This is demonstrated in the next chapter. 49 Chapter4 MicroarchitectureLifetimeReliabilityAnalysisUsing FORC The methodology of simulation-based reliability analysis using our model proposed in the previous chapter is described in this chapter. To demonstrate the methodology, the lifetime reliability of a quad-core processor chip running SPLASH benchmark programs is analyzed. 4.1 EvaluationMethodology Our reliability modeling framework described in Sections 3.2 and 3.3 requires only mi- croarchitecture configuration parameters, activity statistics, and value probabilities for failurerateanalysisinterms of FORC. Thus, it can be implemented using any cycle- accurate, microarchitecture simulator similar to what is done in RAMP [61]. In this paper, we use Mambo [54], an IBM proprietary full-system simulation toolset for the 50 PowerPC architecture to collect activity statistics of a quad-core POWER5-like micro- processor running SPLASH benchmark programs [71]. Mambo faithfully emulates a multiprocessor system with models for components including processors, memories, and disks. It shares some of its roots with the PowerPC extensions added to the SimOS sim- ulator [50]. We configure Mamboinsuchawaythatthe coresare similartothose used in POWER5 [58], and L2 cache is private for each core and 2MB each. L2 caches, the memory controller (MC), and the I/O controller (GX) are interconnected through a bus similar to the Fabric bus controller of POWER5. For workloads, we simulate a subset of the SPLASH suite in Mambo, including Barnes, Cholesky, FFT and Ocean [71]. As Mambo is a full-system simulator, OS (kernel) code (Linux 2.6.7 with OpenPIC 1.0) is included as part of the workload simulated. To analyze the lifetime reliability of a quad-core processor chip using the FORC method with detailed temperature information, a simulation environment is built around Mambo capable of estimating power and temperature, as illustrated in Figure 4.1. Our power modeling methodology is similar, in principle, tothatusedinPowerTimer[11]. The chip components such as cores, L2 caches, on-chip interconnects, etc., are broken down into smaller structural units, and everyunitisbrokendownintoyetsmaller“func- tions.” In reality, each functionmayconsistofuptotensofphysicalmacros,buttheyare treated as a single entity. We compute power consumption for every function by using switching and activity statistics traced by Mambo. In addition, for each function, we use a 65nm library and technology numbers for calculating leakage. 51 For floorplanning, all functions of the chip are assigned to 0.5mm×0.5mm cells on the grid. Figure 4.2 shows the 15mm×15mm quad-core chip floorplan used. The floor- plan of the chip is designed systematically as the number of cores is changed, keeping in mind standard latency and wireability considerations. Several iterations were done to ensure that area utilization and power density does not exceed predefined limits for any ofthecells. However,itisnot reflective of any real processor product design. We simulate the thermal characteristics of our chip with a thermal RC network, based on the duality between thermal and electrical systems. A more detailed description of architecture-level thermal modeling can be found in [59]. Specifically, since our focus in this work is on long-term reliability, we model only the stable-state temperature and ignore more transient temperature variations. Therefore, we obtain temperature simply by multiplying power with a thermal resistance matrix, as given by A· P = T . Here, A is the thermal resistance matrix that can be either calculated using analytical tools such as computational fluid dynamics (CFD) simulations or measured using infra-red (IR) thermal imaging equipment [16]. P is the power matrix which specifies the power consumption of each cell in a 30×30 grid, and T is the resulting temperature matrix. 52 M icroarchitecture configurations Power dissipation Temperature Chip lifetime Workloads IPC Technology & implementation param eters Activity statistics Mambo Power library Thermal library Reliability library Activity & value statistics Power matrix Figure 4.1: Simulation environment for estimating performance, power dissipation, tem- perature and chip lifetime, which is built around Mambo [54]. The activity and value statistics are collected by Mambo and fed into the power and reliability model. Tem- perature is measured by using the estimated power dissipation and a thermal resistance matrix [16] and, if needed, it may befedintothereliabilitymodel. L2 L2 L2 L2 FPU FPU ISU ISU ISU ISU FPU BRU FPU FXU FXU FXU FXU LSU LSU LSU LSU L2C BRU BRU BRU IFU IFU IFU IFU L2C L2C L2C NCU NCU L3DIR L3DIR L3DIR L3DIR MC GX FBC NCU NCU Figure 4.2: Simulated 15mm×15mm quad-core processor floorplan. It consists of 0.5mm×0.5mm cells on the grid (i.e., 30×30 cells), each to which the functions of the chip are properly assigned. 53 4.2 MulticoreProcessorMicroarchitectureReliability Analysis In this section, we analyze the impact of the three failure mechanisms on different mi- croarchitecture structures in a quad-core processor chip using the FORC reliability mod- eling framework. Failure rates due to the failure mechanisms are given relative to the FORC at ambient temperature of the corresponding failure mechanism. 4.2.1 Core-LevelReliabilityAnalysis Figure 4.3 shows the failure rate analysis of the core (master core) that initiates the appli- cation and distributes tasks to the other three cores (slave cores), while running Barnes. The area breakdown of the core is also given in thefigure. Note that failures due to elec- tromigration are caused by vias only in arrays and register files. Figures 4.4-4.6 show a more detailed failure rate analysis of the core, at the unit-level. In Figures 4.3, 4.5 and 4.6, FIT NBTI and FIT T DDB are broken down into failures occurring in array and register file structures versus those occurring in logic and wire structures including logic gates, latches, multiplexers and wire repeaters. The failure rateofelectromigration(FIT EM ) does not necessarily follow the area trend; instead, it is mainly affected by activity fac- tors as well as effective defect density. For example, in Figure 4.3(a), IFU has higher FIT rate than LSU despite being half theareaofthe arrayand register file structures, and ISU has much lower FIT rate than 54 0 5 10 15 20 IFU BR U ISU LSU FXU FP U x10 3 FORC EM 0 30 60 90 IFU BRU ISU LSU FXU FPU x10 6 F O RC NBTI logic & w ires array s & regfiles (a) FIT of EM (b) FIT of NBTI 0 2 4 6 8 10 IFU BR U ISU LSU FXU FP U x10 6 FORC TDDB logic & w ires arrays & regfiles 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 IFUBRU ISU LSU FXUFPU Area (mm 2 ) logic & w ires arrays & regfiles (c) FIT of TDDB (d) Area of the units Figure 4.3: FIT of EM, NBTI, and TDDB of the master core that initiates the application and distributes tasks to the other three slave cores, while running the first 10 msec of Barnes are shown in (a), (b), and (c), respectively. While the FIT of EM is contributed only byvias in arrays and registerfiles, the FIT of NBTI and TDDB is broken down into that for array and register file structures, and that for logic and wire structures. The area breakdown of the units composing the core is given for these structures in (d). 55 FXU or FPU due to having similar area. This can be explained by the FIT rates of functions composing these units in Figure 4.4. In the IFU, several functions implemented as arrays and registerfiles, such as I-cache, address translation table (IAT), branch history table (BHT), and instruction buffer (IBUF), contribute to high FIT rate. This is because the IFU has a higher number of effective defects and it is accessed almost every cycle, resulting in high current density throughvias caused by high N reads as eight instructions are fetched almost every cycle and scanned/predicted for branches, even if the IBUF is full. For the ISU, however, only the register map shadows and the completion table are implemented as register files, and they have relatively low activity factors compared to other units, causing low current density. The failure rate of NBTI (FIT NBTI ) is mainly affected by ∆ V c anddutycycle as well as effective defect density. In Figure 4.3(b), IFU and LSU have much higher FIT NBTI than other units due to high FIT rate in the arrays and register files. In particular, arrays have 15-20 times higher cell density and lower ∆ V c (50mV for arrays versus 80mV for register files assumed) than register files due to their customized design. As shown in Figure 4.5, over 50% of the FIT NBTI of IFU and LSU occurs in arrays. Other units have lower FITs, contributed mostly by register files that implement tables and queues. Note that IBUF and ILST of IFU are also implemented as registerfiles, thus having lower FIT rate, as shown in Figure 4.5(a). Compared to logic gates, latches have lower effective defect density and multiplexers have smaller duty cycle, resulting in lower FIT NBTI .As a result, ISU queues consisting mostly of latches and multiplexers contribute to smaller 56 0 2 4 6 8 10 IFAR IAT BHT ICACHE IDIR BPL IRLD IBUF ILST x10 3 FORC EM 0 2 4 6 8 DISP GPMP FPMP CMP FPSH GPSH CSH FXQ FPQ BRQ CRQ CMPL SPR FORC EM (a) IFU (b) ISU (c) LSU (d) BRU, FXU and FPU 0 2 4 6 8 ADDR DCACHE DAT DDIR FRMT STQ LRQ SLB TLB CTRL DPRF x10 3 FORC EM 0.0 0.5 1.0 1.5 2.0 BR Q BR EX BR R G PR F XU F PR F PU x10 3 FO RC E M Figure 4.4: FIT of electromigration (FIT EM ) at the unit-level. IFAR: instruction fetch ad- dress register, IAT: instruction address translation, BHT: branch history table, ICACHE: I-cache data arrays, IDIR: I-cache directory, BPL: branch prediction logic, IRLD: cache line reload logic, IBUF: instruction buffer, ILST: link stack register file, DISP: dispatch logic, GPMP, FPMP, and CMP: GPR, FPR, and CTR mapper, respectively, FPSH, GPSH, and CSH: shadow arrays of GPMP, FPMP, and CMP, respectively, FXQ, FPQ, BRQ, and CRQ: FXU, FPU, BRU, and CRU issue queues, respectively, CMPL: completion table, ADDR: address adder, DCACHE: D-cache data arrays, DAT: data address translation, DDIR: D-cache directory, FRMT: Id format, STQ: store queue, LRQ: load reorder queue, SLB: segment look-aside buffer, TLB: address translation look-aside buffer, CTRL: con- trol logic, DPRF: data prefetch, BRQ: branch misc. queues, BREX: branch execution logic, BRR: count and link register, GPR: general-purpose registers, FXU: fixed-point units, FPR: floating-point registers, FPU: floating-point units. 57 0 10 20 30 40 50 IFAR IAT BHT ICACHE IDIR BPL IRLD IBUF ILST x10 6 FORC NBTI logic & w ires arrays & regf iles 0 1 2 3 DISP GPMP FPMP CMP FPSH GPSH CSH FXQ FPQ BRQ CRQ CMPL SPR x10 6 FORC NBTI logic & w ires arrays & regfiles (a) IFU (b) ISU ( c) LSU (d) BRU, FXU and FPU 0 10 20 30 40 ADDR DCACHE DAT DDIR FRMT STQ LRQ SLB TLB CTRL DPRF x10 6 FORC NBTI logic & w ires arrays & regfiles 0 5 10 15 20 25 BRQ BREX BRR GPR FXU FPR FPU x10 6 FORC NBTI logic & w ires arrays & regfiles Figure 4.5: FIT of NBTI (FIT NBT I ) at the unit-level, broken down into array and register file structures versus logic and wire structures. See the caption of Figure 4.4 for the name of the functions composing each unit. FIT NBTI . Similarly, FPU has higher FIT NBTI than ISU or FXU due to its large number of gates along the timing critical paths. The FIT of TDDB (FIT TDDB ) is mostly determined by effective defect density rather than other factors such as duty cycle because many of the NFET and PFET devices are paired so that only one of them is under stress at any time, leading to the duty cycle having less impact. As a result, FIT TDDB follows the area trend relatively well as shown 58 in Figure 4.3(c). As for NBTI, arrays have higher FIT TDDB than register files due to higher cell density. This is the main reason that array structures cause around 70% of the FIT TDDB of IFU and LSU, whereas other units have lower FIT TDDB , contributed mostly by register files. Despite large logic and wire area (more than two times larger than that of other units as shown in Figure 4.3(d)), ISU has lower FIT TDDB because it is implemented with a large number of latches that have lower effective defect density than logic gates. A more detailed analysis of each unit is given in Figure 4.6. 4.2.2 Chip-LevelReliabilityAnalysis Figures 4.7-4.9 show the FIT density distribution (i.e., FITs per 0.5mm×0.5mm cell on the grid of the floorplaninFigure4.2)overthe simulated quad-core processor chip for the first 10 msec of running Barnes for the three failure mechanisms. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. Figure 4.7 shows that FIT EM is strongly affected by activity factor. Thus, the master core (the lower right core in Figure 4.2) has a much higher failure rate due to EM than the other three slave cores while initiating and dispatching tasks to the slave cores. In Figure 4.8, while the master core has higher FIT NBTI than the slave cores, the difference is much less than that for FIT EM because FIT NBTI is affected by duty cycle rather than activity factor. For the same reason, L2 caches have higher FIT NBTI despite having lower activity. The FIT TDDB density distribution in Figure 4.9 is almost symmetric across the four cores, although activities of the cores are very different 59 0 1 2 3 4 5 IFAR IAT BHT ICACHE IDIR BPL IRLD IBUF ILST x10 6 FORC T DDB logic & w ires arrays & regfiles 0.0 0.1 0.2 0.3 0.4 0.5 DISP GPMP FPMP CMP FPSH GPSH CSH FXQ FPQ BRQ CRQ CMPL SPR x10 6 FORC TDDB logic & w ires arrays & regfiles (a) IFU (b) ISU (c) LSU (d) BRU , FXU and FPU 0 2 4 6 ADDR DCACH E DAT D DIR FRM T ST Q LRQ SLB TLB CT R L DPR F x10 6 F O RC TDDB logic & w ires arrays & regfiles 0.0 0.5 1.0 1.5 BRQ BREX BRR GPR FXU FPR FPU x10 6 FORC TDDB logic & w ires arrays & regfiles Figure 4.6: FIT of TDDB (FIT TDDB ) at the unit-level, broken down into array and reg- isterfile structures versus logic and wire structures. See the caption of Figure 4.4 for the name of the functions composing each unit. 60 0 2 4 6 8 10 12 14 0 2 5 7 10 12 0.0 0.5 1.0 1.5 2.0 2.5 x10 3 FORC E M Figure 4.7: FIT EM density distribution over the simulated quad-core processor chip for the first 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. on the chosen interval of the benchmark. This is because the activity factor or duty cycle is insignificant for FIT TDDB as discussed above. Instead, FIT TDDB is mainly affected by effective defect density. This is the main reason for higher FIT TDDB rate in L2 caches. Figure 4.10(a) shows that only arrays and registerfiles of the master core are initially affected by EM. As we proceed to the next set of intervals, as shown in Figure 4.10(b)- (d), the other three cores have greater activity and FIT EM across the four cores become similar. Despite less difference among time intervals, failure rates due to NBTI or TDDB also even out across the four cores as the application proceeds. Finally, the FIT rate of the quad-core processor chip running other SPLASH benchmark programs such as Cholesky, FFT and Ocean is given in Figure 4.11 for the time interval of the fifth10msec. 61 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 x10 6 F ORC N B TI (a) Total FITs 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 x10 6 FORC N B TI (b) Array and register file structures 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 x10 6 FORC N B T I (c) Logic and wire structures Figure 4.8: FIT NBT I density distribution over the simulated quad-core processor chip for the first 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB (a) Total FITs (b) Array and register file structures (c) Logic and wire structures 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 F OR C TDDB 0 2 4 6 8 10 12 14 0 5 11 0.0 0.2 0.4 0.6 x10 9 FORC TD D B Figure 4.9: FIT TDDB density distribution over the simulated quad-core processor chip for thefirst 10 msec of running Barnes. That is, FITs per 0.5mm×0.5mm cell on the grid of the floorplan in Figure 4.2. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. 62 (a) Second 10msec 0 2 4 6 8 10 12 14 0 2 5 7 10 12 0 1 2 3 4 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 30 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB Total FIT TDDB (b) Third 10msec 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 30 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB Total FIT TDDB (c) Fourth 10msec 0 2 4 6 8 10 12 14 0 2 5 7 10 12 0 2 4 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 11 0 5 10 15 20 25 30 35 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB Total FIT TDDB Figure 4.10: FIT density distribution of the simulated quad-core processor chip running Barnes,capturedin10msectimeintervals. Thex-andy-axesarealignedtothefloorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. 63 0 2 4 6 8 10 12 14 0 2 5 7 10 12 0 2 4 6 8 10 12 14 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 10 0 5 10 15 20 25 30 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB Total FIT TDDB (a) Cholesky 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 2 4 6 8 10 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 10 0 5 10 15 20 25 30 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 1 2 3 4 x10 9 FORC TDDB Total FIT TDDB (b) FFT 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14 0 10 20 30 x10 3 FORC EM Total FIT EM 0 2 4 6 8 10 12 14 0 5 11 0 10 20 30 40 50 x10 6 FORC NBTI Total FIT NBTI 0 2 4 6 8 10 12 14 0 5 11 0 2 4 6 x10 9 FORC TDDB Total FIT TDDB (c) Ocean Figure 4.11: FIT density distribution of the simulated quad-core processor chip running SPLASH benchmark programs such as Cholesky, FFT and Ocean for the time interval of the fifth 10 msec. The x- and y-axes are aligned to the floorplan: (0,0) is the upper left corner and (15,15) is the lower right corner of the floorplan. 64 4.3 Discussion We define FORC for the three failure mechanisms in Section 3.2 and derive the failure rate of structures in terms of FORC in Section 3.3. In this section, we discuss issues arising when computing the total failure rate of a microarchitecture composed of multiple structures and across various failure mechanisms. The proposed FORC-based reliability modeling framework allows reliability analy- sis for microarchitectures to be conducted at any stage of the design and development process. At the early stage of design, detailed technology/environmental parameters are unavailable. Using the FORC concept, we can estimate the lifetime reliability of mi- croarchitectures alternatives independent of lower-level parameters. As more parameters become available, analysis using the FORC concept becomes more accurate. One straightforward method to compute the total FIT rate is to sum the FITs of the structures composing the microarchitecture in terms of FORC. Since technology/environment- dependent parameters are encapsulated into the FORC primitive, this method is espe- cially useful at the early stage of processor development when such parameters are typ- ically unavailable. However, using the same valueofFORCacrossall theevaluated structures implies that the structures have thesametemperature sinceFORCisafunc- tion of temperature as shown in Equations 3.1, 3.2, and 3.3. As more technological and environmental parameters, especially temperature, are available, the FORC value can be differentiated among the structures. Thus, a more accurate total FIT rate can be com- puted by adding the FITs of the structures in terms of FORC and multiplying them by 65 the values of FORC calculated at the corresponding temperature values. It could be con- venient to normalize this total FIT rate with FORC at ambient temperature (25 ◦ C) as all other parameters but temperature in the FORC equations can be encapsulated into FORC at ambient temperature. To combine failure rates across different failure mechanisms, the one-time quantifi- cation of FORCs for the failure mechanisms needstobedoneforagiventechnologyand implementation style. Then, total failure rates can be computed by adding the FITs of the microarchitecture in terms of FORC for each failure mechanism as described above, and multiplying them by the value of FORC for the corresponding failure mechanism, assuming the SOFR model. In addition, our reliability model has yet to account for the impact of microarchitec- tural features to improve reliability such as redundancy due to the limitation of the SOFR model used. The SOFR model is widely used to compute the failure rate in FIT of those structures composed of multiple components due to its simplicity, where components can be devices, microarchitecture structures, units or chips. The SOFR model is used in this chapter to combine the failure rates of individual devices for the formulas of the failure rate of microarchitecture structures in Sections 3.3. However, it has a few underlying assumptions: 1) components composing the microarchitecture fail independently with respect to each other, 2) thefirst component failure causes the entire microarchitecture to fail, and 3) the failure rate of the components is constant over time, i.e., their lifetime or time-to-failure distributions follow an exponential function. These assumptions may not 66 be true in reality. As microarchitectures employing redundancy such as sparing and error correction coding. are able to operate in the presence of failures. Finally, the exponential distribution does not represent well the wearout period of chip lifetime since the failure rate during this time increases over time. In the following chapter, we no longer assume the SOFR model and, instead, use a model that more realistically represents redundant systems. Using it, we analyze the impact of redundancy on the lifetime reliability of cache memory architectures which are observed to be most vulnerable across the chip to the failure mechanisms under study in Section 4.2. 4.4 Summary In this chapter, we demonstrate the methodology of analyzing processor chip reliability by using our FORC-based framework, along with a cycle-accurate architecture simulator. The impact of microarchitectural features to enhance chip lifetime reliability such as redundancy needs to be carefully modeled to allow the exploration of area, power and performance trade-offs. To do so, underlying assumptions caused by the SOFR model need to be improved in such a way that the failure rate of redundant components can be effectively combined. In the following chapter, we describe details of modeling the lifetime reliability of redundant systems. Finally, the impact of technology scaling on chip lifetime reliability can be revisited by using our FORC-based reliability model. 67 Chapter5 LifetimeReliabilityEvaluationFrameworkfor RedundantSystems Microarchitectural redundancy has been a commonly used technique for improving life- time reliability as well as yield of processor systems. When applied to microprocessors, chips can maintain operability in the presence of defects or failures by detecting and isolating, correcting and/or replacing microarchitecture components. In this chapter, we propose a comprehensive framework for analyzing the lifetime reliability of redundant systems, which effectively put low-level effects and architecture-level effects together. In Section 5.1, we describe lifetime reliability models for systems employing redundancy in different ways. We also present the criticality of underlying lifetime distribution models for system components in estimating system lifetime reliabilitybyconsideringvarious amounts and types of redundancy. Then, we applythe frameworktoSRAM architec- tures in Section 5.2. For accurate reliability analysis, we propose a new methodology of modeling the distribution of SRAM cell lifetimes with respect to NBTI. This is followed 68 by the analysis methodology of the lifetime reliability of cache SRAM memory systems with various redundancy techniques. Finally, we summarize the chapter in Section 5.3. 5.1 LifetimeReliabilityModelsforGenericRedundant Systems In this section, we review basic concepts and terms used for lifetime reliability and de- scribe lifetime reliability models for systems employing redundancy. Lifetime distribution models are analytical tools used to describe the collection of lifetimes or time to failure for a random sample of components manufactured within particular design, material, and process parameters [1]. These models are generally de- scribed as probability density functions, f( t),definedoverarangeoftime t from 0 to infinity. The corresponding cumulative distribution functions, F( t), are also commonly used lifetime functions, giving the probability that a randomly selected component fails in time t. Lifetime distribution models can alternatively be described by a survival or relia- bility function, R( t), which is the probability that a component survives beyond time t. Since a component either fails or survives, the lifetime function and the reliability function are mutually exclusive: R( t) =1- F( t). By integrating the reliability function, lifetime distribution models can be quantified in terms of mean-time-to-failure (MTTF): M TTF = R ∞ 0 R( t)· dt [57]. 69 The lifetime distribution model of systems consisting of multiple components can be derived by properly combining the lifetime or reliability function of the various com- ponents. The reliability function of a system consisting of n components composed in series (i.e., no redundancy) is R ser i es ( t)= n Q i=1 R i ( t),(5.1) assuming that the components fail independently of one another and that R i ( t) is the reliability function of the ith component [1]. This series model implies that all of the system components must survive for the system to survive. If all components have the same reliability function R( t), Equation 5.1 simplifies to R se r i e s ( t)=( R( t)) n . While systems having no redundancy fail once thefirst component fails, systems hav- ing redundancy can survive in the presence of some number of failures by deactivating faulty components and/or replacing faulty components with non-faulty ones. This can significantly extend system lifetime [8][62]. As a result, most microprocessor systems employ various redundancy techniques to improve lifetime reliability [32][37][70]. In general, the lifetime distribution model of systems employing redundancy is de- rived using k-out-of- n models in which the system consists of n components and survives as long as at least k components are non-faulty [1][57]. Furthermore, redundant compo- nents are distinguished as warm or cold, depending on when they are powered on and start aging. Warm redundant components are powered on at system deployment, thus subject to wearout in the same way as the non-redundant (original) set of components. 70 Cold redundant components are powered off or power gated at system deployment, thus suffer no wearout effects until put into use. There are several ways to implement k-out-of- n redundant systems which can toler- ate n - k failures. One is to replace faulty components with n - k redundant (or spare) components such that the effective system size effectively remains the same over time until the system fails. If the wearout of spares is suspended by putting them in a special standby mode until activated (i.e., cold sparing), the lifetime distribution of the system can be derived using the cold k-out-of- n model discussed below. While cold sparing can extend the lifetime of spare components, implementing the standby mode (e.g., with power gating) is not always affordable due to increased area overhead and design com- plexity. In this case, spares are aging even before being put to use. The lifetime distribu- tion of the system employing warm sparingcanbederivedbyusingthewarm k-out-of- n model discussed below. In the cold or warm k-out-of- n model for sparing, k is deter- minedbytheeffectivesystemsizeand n - k is the number of supplied spares. In addition to sparing, k-out-of- n redundant systems can tolerate failures by deac- tivating at most n - k components due to failure—a technique called graceful perfor- mance degradation. With graceful performance degradation, the effective system size diminishes over time until the system fails. Since all components are activated at system deployment, the lifetime distribution of systems employing graceful performance degra- dation can be derived by using the warm k-out-of- n model discussed below in which k 71 is the minimum number of components needed for operation and n is determined by the initial system size. In the following subsections, we describe the cold and warm k-out-of- n models which are applicable to any type of component or system with lifetime described by a lifetime function or reliability function. For simplicity, components are assumed to have identical lifetime distributions and the failure of components is assumed to occur independently of one another. 5.1.1 Warm k-out-of- nSystems As discussed above, due to redundancy, k-out-of- n systems survive as long as at least k components are non-faulty. For warm k-out-of- n systems, the reliability function can be simply described by the following, as all components remain active from the beginning until failing: R warm ( t)= n X i= k ⎛ ⎜ ⎜ ⎝ n i ⎞ ⎟ ⎟ ⎠ ( R( t)) i ·(1 − R( t)) n − i ,(5.2) where R( t) is the reliability function of system components. If k =1, Equation5.2 becomes F warm ( t) =( F( t)) n , i.e., the redundant system fails only if all components fail. 5.1.2 Cold k-out-of- nSystems The derivation of the lifetime distribution model of cold k-out-of- n systems is more complicated as spares are in standby mode and can be used in various ways. Figure 5.1 72 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Activea Standby a 1. No spare used 2. One spare used 3. Two spares used for two component positions for one component position twice Fa ulty Non-faulty Figure 5.1: An example of the cold 5-out-of-7 model in which a system consists of seven components and two of them are spares in standby mode. The system survives if no more than two spares are needed to replace faulty components (darker shaded box). W( 5 ,2) W(4,2)·N(0) W(4,1)·N( 1 ) W(4,0 ) ·N(2) W(3 ,0)·N(1) W(3 ,1)·N(0) W(2,0 ) · N(1) W( 2 ,1)·N(0) W(1,0)·N(1) W(1 ,1)·N(0) W(3,0 ) ·N(2) W(3,1)·N(1) W(3 ,2)·N(0) {N(0)} 4 {N(0)} 3 {N(0 )} 2 N(0) N(1) {N(0 ) } 3 ... ... Figure 5.2: Recursive tree representing possible combinations of exactly two spares be- ingusedinthecold5-out-of-7model. W(i, j) denotes P{ W i ( t) = j}, the probability that exactly j cold spares are used for one or more of the positions 1 to i by time t. N( j) denotes P{ N( t)= j}, the probability that j cold spares are used for a certain compo- nent position by time t. Since the branches represent the probability of non-overlapping events, P{ W i ( t)= j} is the sum of the product along the branches. 73 illustrates a few examples of spares being used in the cold 5-out-of-7 model. Here, the system consists of seven components, two of which are cold spares. The system survives if no more than two spares are needed to replace faulty ones (darker shaded in thefigure) whichcan be anyofthe five original components or even one of the spares activated previously to replace an original component. For cold k-out-of- n systems, all possible combinations of the use of spares to replace faulty components must be taken into account when deriving the reliability function. This can bedonebyusing arecursive algorithm [65]. First, we number the “position” of the five original components, as shown in Figure 5.1. Then, let P{ W i ( t) = j}be the probability that exactly j cold spares are used for one or more of the positions 1 to i by time t,where1 ≤ i ≤ k and 0 ≤ j ≤ n- k. In addition, let P{ N( t) = j} be the probability that j cold spares are used for a certain component position due to failure by time t, which implies P{ N( t) =0} = R( t) and P{ N( t) = j}= f( t) ⊗ P{ N( t) = j-1}, where 1 ≤ j ≤ n- k.Asaresult,P{ W i ( t) =0}= [P{ N( t)=0}] i and P{ W 1 ( t) = j}= P{ N( t) = j}. If 1 <i ≤ k and 0 <j ≤ n- k,P{ W i ( t) = j} can be recursively derived as illustrated in Figure 5.2. In Figure 5.2, the branches of the recursive tree represent non-overlapping cases of P{ W 5 ( t) = 2}, i.e., exactly two spares used for the five positions, and the branches end if i =1or j =0in P{ W i ( t) = j}. The two spares can be used in the following ways: 1) none for positions 1 to 4 and both for position 5 (the left uppermost branch); 2) one for positions 1 to 4 and the other for position 5 (the middle uppermost branch); and 3) 74 both for positions 1 to 4 and none for position 5 (the right uppermost branch). The left uppermost branch ends because j = 0 while the other uppermost branches, P{ W 4 ( t)=1} and P{ W 4 ( t) = 2}, continue to split in a similar manner. When the recursive tree is completed, P{ W i ( t) = j} is the sum of the products along the branches: P{ W i ( t)= j} = j X r=0 P{ W i −1 ( t)= r}· P{ N( t)= j − r}. The probability that the cold k-out-of- n system survives by time t is thesameasthe probability that no more than n- k spares are needed for the positions 1 to k by time t. For instance, a 5-out-of-7 redundant system survives if zero, only one, or two spares are needed to replace components in positions 1 to 5 due to failure. Since these three events are non-overlapping, the probability that the system survives by time t is the sum of the probabilities P{ W 5 ( t) =0}, P{ W 5 ( t) =1}, andP{ W 5 ( t) = 2}. As a result, the reliability function of cold k-out-of- n redundant systems is R col d ( t)= n − k X r=0 P{ W k ( t)= r}. (5.3) In the simple case of a system having only one component and n-1 spares (i.e., the cold 1-out-of- n model), system lifetime is the sum of n identically distributed random 75 Table 5.1: Exponential and lognormal distribution models. The failure rate h( t) is de- fined as the probability that the survivors until time t fail during the next instant of time ∆ t: h( t)= f( t) R( t) . In the CDF of lognormal, Φ denotes the CDF of the standard normal distribution. Exponential Lognormal PDF f( t) λe − λt 1 tσ √ 2 π · e − 1 2 · ( ln t − µ σ ) 2 CDF F( t) 1 − e − λt 1 2 + 1 2 · Φ ³ ln t − µ σ √ 2 ´ Mean 1 λ e µ+ σ 2 2 Failure rate h( t) λ 2· e − 1 2 · ( ln t − µ σ ) 2 tσ √ 2 π· ³ 1 − Φ ³ ln t − µ σ √ 2 ´´ lifetimes, say F( t), which can be computed by using convolution formulas. That is, the lifetime function of a cold 1-out-of- n system, F n ( t), can be described by the following: F n ( t)= f( t) ⊗ F n −1 ( t) , (5.4) where x( t) ⊗ y( t)= R t 0 x( t − u)· y( u)· du and F 1 ( t)= F( t). 5.1.3 ImpactofComponentLifetimeDistributionsonSystem Lifetime Twocommonlyusedlifetimedistribution models are exponential and lognormal distribu- tion models. Table 5.1 lists the probability density function (PDF), cumulative distribu- tion function (CDF), mean, and failure rate of the exponential and lognormal distribution 76 0 0.2 0.4 0.6 0.8 1 0 50 100 150 200 250 300 350 t (a.u.) F(t) exponential lognormal_0.5 lognormal_0.2 lognormal_0.01 Figure 5.3: Cumulative exponential and lognormal (a shape parameter value of 0.5, 0.2 and 0.01) distribution functions in terms of time t in an arbitrary unit. The four functions have the same mean. models. The failure rate, h( t),isdefined as the conditional probability that the survivors until time t fail during the next instant of time ∆ t: h( t)= f( t) R( t) . The exponential distribution model is a memoryless random distribution which mod- els the time between independent events occurring at a constant average rate [57]. When it represents the distribution of component lifetimes, system components have a constant failure rate λ (the reciprocal of the mean) over time and the MTTF of the system is the reciprocal of the sum of the components’ failure rates. This simplicity results in the ex- ponential model being widely used for lifetime reliability analysis [1][44]. However, the constant failure rate limits the accuracy of the exponential model in representing realis- tic component lifetime distributions as failure rate due to wearout tends to increase over time [1][44]. 77 The lognormal distribution model represents the natural logarithm of component life- times which has a normal distribution. The median µ and standard deviation σ of log- normal distributions are also called the scale and the shape parameters, respectively. The lognormal model can fit a wide range of lifetime distributions by using these two parameters—in particular, empirical data of many semiconductor wearout processes such as corrosion, diffusion, migration, crack growth, chemical reactions, etc. [1] Figure 5.3 illustrates the cumulative distribution function of an exponential distribu- tion and three lognormal distributions representing various distributions of component lifetimes. Component lifetimes are more widely distributed in the exponential model than in the lognormal models. Among the lognormal models, for smaller shape param- eters there is a higher likelihood that, once one component fails, others will shortly fail. In the following subsection, we discuss how this affects lifetime reliability analysis. Figure 5.4 and 5.5 show the impact of lifetime distribution models on estimating the lifetime reliability of systems composed from different numbers of components and various types of redundancy. In quantifying system lifetime reliability estimates for com- parison, we derive the reliability function of the systems using Equation 5.1 – 5.3, and then we calculate MTTF by integrating the reliability function over time from 0 to infin- ity. Results are shown in Figure 5.4, where the MTTF of systems is normalized to that of the system composed of one component in (a) and that of the system composed of ten components with no redundancy in (b). 78 (a ) No redundancy (b) Graceful performance degradation 0 0.2 0.4 0.6 0.8 1 1 2 3 4 5 678 9 10 Number of components in the system Normalized MTTF exponential lognormal_0.5 lognormal_0.2 lognormal_0.01 0 5 10 15 20 25 30 0 1 2 3 4 5 67 89 Max number of components allowed to be deactivated Normalized MTTF exponential lognormal_0.5 lognormal_0.2 lognormal_0.01 Figure 5.4: Impact of lifetime distribution models on system lifetime reliability. In (a), system size increases from a single component to ten, none of which are redundant. In (b), graceful performance degradation enables the system to operate even if some components fail (i.e., the warm 10-out-of- n model where 10 ≤ n ≤ 20). 79 (a) Cold sparing (b ) W arm sparing 0 2 4 6 8 10 12 012 345 6 7 89 10 Number of cold spares Normalied MTTF exponential lognormal_0.5 lognormal_0.2 lognormal_0.01 0 2 4 6 8 0 1 234567 8 9 10 Number of warm spares Normalized MTTF exponential lognormal_0.5 lognormal_0.2 lognormal_0.01 Figure 5.5: Impact of lifetime distribution models on system lifetime reliability. In (a) and (b), up to ten cold or warm spares are employed along with ten original components (i.e., the cold or warm 10-out-of- n model, where 10 ≤ n ≤ 20). 80 Figure 5.4(a) shows the impact of system size on lifetime reliability estimates for a system that does not employ redundancy. Overall, system lifetime reliability diminishes as the number of components increases. However, the estimated degree of the degrada- tion varies, depending on the underlying lifetime distribution model. Using the exponen- tial model, system lifetime in MTTF is proportional to the reciprocal of the number of components due to a constant failure rate, which causes reliability analysis to be unreal- istically pessimistic. For instance, the MTTF of a system consisting of ten components is estimated to be one tenth that of a system consisting of a single component, which is overly pessimistic in comparison of the lognormal cases. Using the lognormal model with a shape parameter value of 0.01, the MTTF of the ten-component system decreases by only 1.3%, compared to that of the single- component system. This is because components tend to fail soon after other similar components start to fail. In the lognormal model with a shape parameter value of 0.5 and 0.2, the MTTF of the ten-component system decreases by about 58% and 28%, re- spectively, compared to the single-component system. While the lognormal models may represent more realistic lifetime distributions, there are still significant differences in re- liability estimates when using different shape parameter values. If the wrong value is used, estimates can be highly inaccurate as shown in Section 5.2. Figures 5.5(a) and (b) show the impact of cold and warm sparing techniques on sys- tem lifetime reliability, assuming a redundant system consisting of ten original compo- nents and zero to ten spares (i.e., the 10-out-of- n model, where 10 ≤ n ≤ 20). For the 81 exponential model, system lifetime in MTTF increases almost linearly with the number of cold spares while warm sparing sublinearly improves system lifetime reliability by 90% with one spare to a factor of 7.7 with ten spares. This is because of the premature wearout of spares before spares are put into use to replace faulty original components. Here, estimates using the exponential model for cold and warm sparing systems are un- realistically optimistic. Using the lognormal model with a shape parameter value of 0.01, estimates for cold and warm sparing hardly show improved system lifetime. Only in the case of a full set of cold spares is there about 100% improvement in system lifetime estimated. Using the lognormal model with a shape parameter value of 0.5, cold and warm sparing with one to ten spares are estimated to improve system lifetime by about 30% to a factor of 3.8 and 25% to a factor of 2.2, respectively. Using the lognormal model with a shape param- eter value of 0.2, cold and warm sparing with one to ten spares is estimated to improve system lifetime by about 12% to a factor of 2.3 and 10% to 37%, respectively. As for the reliability analysis of systems having no redundancy, lifetime enhancement owing to the sparing techniques varies, depending on the value used for the shape parameter. Similar observations can also be made in Figure 5.4(b) which shows system lifetime enhancement owing to graceful performance degradation. Here, the system is assumed to consist of twenty components initially activated and survives if no more than ten are faulty (i.e., the warm 10-out-of- n model, where 10 ≤ n ≤ 20). Using the exponential model, graceful performance degradation improves system lifetime by a factor of 28 if 82 the system is able to operate with at least one non-faulty component. Using the lognormal model with a shape parameter value of 0.01, graceful performance degradation hardly improves system lifetime. Using the lognormal model with a shape parameter value of 0.5 and 0.2, it improves system lifetime by 30% and 11% if at most one failure in the system can be tolerated and a factor of 4.7 and 85% if the system can operate with at least ten non-faulty component, respectively. As discussed thus far, the underlying lifetime distribution model causes inaccuracy or significant difference in lifetime reliability analysis results. This is more clearly ob- served in the reliability analysis of larger systems as discussed in Section 5.2. Thus, it is important to develop lifetime distribution models which more effectively represent the distribution of component lifetimes due to wearout failure mechanisms. 5.2 EvaluatingtheLifetimeReliabilityofRedundant SRAMArrays In this section, we analyze the impact of various redundancy techniques such as ECC (error correction code), component sparing and graceful performance degradation on the lifetime reliability of cache memory systems using the redundancy models given in Sec- tion 5.1. To do so, we derive the lifetime distribution of SRAM cells (basic component 83 bl bl P L wl P R Figure 5.6: 6T SRAM cell consisting of two PFET (P L and P R ) and two NFET devices holding the cell state and two NFET pass transistors. of SRAM arrays) with respect to NBTI and that of SRAM arrays employing the redun- dancy techniques. This is followed by quantifying the lifetime reliability enhancement caused by the redundancy techniques. 5.2.1 NBTI-InducedSRAMCellLifetimeDistributions The threshold voltage of the devices composing SRAM cells is one of the parameters which determine the robustness of the cells duringreadandwriteoperations,oftencalled cell stability [24]. As the threshold voltage of the PFETs of SRAM cells increases due to NBTI, the strength of the devices of the cells becomes unbalanced, which causes cell stability to diminish over time and, eventually, the cell state to flip during read or write operations [3][24][27]. As a result, we take into account the degradation of cell stability due to NBTI over time in deriving the lifetime distribution model of SRAM cells. Illustrated in Figure 5.6 is a 6T SRAM cell which consists of two PFET (P L and P R ) and two NFET devices holding the cell state and two NFET pass transistors. Depending on the cell state, either of the two PFET devices is subject to NBTI stress and the other 84 10 6 10 7 10 8 10 9 10 10 10 11 0 0.2 0.4 0.6 0.8 1 t (a.u.) F(t) Cell 4KB array 256KB cache Figure 5.7: Derived lifetime function of an SRAM cell, 4KB array and 256KB cache with respect to NBTI. is subject to NBTI recovery. If the cell stores 0, P L is under stress and P R is not under stress; otherwise, if the cell stores 1, P R is under stress and P L is not under stress. Thus, the amount of increase in the threshold voltage of the PFET devices is strongly dependent on the pattern of the state which the SRAM cells store over time. Let r h (0 ≤ r h ≤ 1) be the ratio of time that the cells store 1, over a given period of time, and f d ( r h ) be the probability density function of a distribution of r h .Notethat f d ( r h ) may be affected by higher-level array architectures and applied workloads. Then, the duty cycle of P R due to NBTI is r h and that of P L is 1 − r h . Given the duty cycle of the PFETs, we can compute NBTI-induced threshold voltage increase by using f dV T (d, t). Since the PFET most affected by NBTI causes failure of a cell, the increased threshold voltage affecting cell lifetime is f dV T (max( r h ,1 − r h ),t). 85 In quantifying the reduced cell stability due to increased threshold voltage of the PFETs of cells, we use Rambo, an IBM Monte Carlo simulation-based tool to measure cell stability against parameter variations such as channel length, width, threshold volt- age, etc. [27] Let f dσ (4V T ) be the function which converts threshold voltage shift to a degradation of cell stability in σ. Then, degradation in cell stability due to NBTI until time t with given r h is f dσ_NBT I ( r h ,t)= f dσ ( f dV T (max( r h ,1 − r h ),t)). Next, we combine the NBTI-induced cell stability loss f dσ_NBT I ( r h ,t) with the de- viation of cell stability due to process variation, which is generally represented with a standard normal distribution [3][13][27]. The center of the cell stability distribution is the probability that SRAM cells have stability with a balanced strength of the six devices of the cells. The further away from the center of the distribution, the more the balance of the devices’ strength is reduced, causing the cell state to flip due to a lower increase in threshold voltage. Let F cs ( s los s ,s lb ) be the cumulative distribution function of the probability that the state of SRAM cells flips due to no greater than s los s cell stability loss, given that SRAM cells whose stability is further than s lb fromthecenterofthecell stability distribution are weeded out during testings [24][27]. That is, s lb indicates the quality of deployed SRAM cells. Then, the probability that cells fail due to NBTI until time t with given r h and s lb is F cs ( f dσ_NBT I ( r h ,t),s lb ). By taking into account the distribution of r h , the cumulative lifetime function of SRAM cells given s lb is the following: F cel l ( t)= R 1 0 f d ( r h )· F cs ( f dσ_NBT I ( r h ,t),s lb )· dr h .(5.5) 86 Since SRAM arrays make up a series system consisting of array cells, the reliability function of the arrays is R ar r a y ( t)=( R cel l ( t)) N c e lls ,where R cel l ( t)=1 − F cel l ( t) and N cel l s is array size in bits. If cache memory systems consist of multiple SRAM arrays, the reliability function of the cache memory is R cache ( t)=( R ar r a y ( t)) N arrays ,where N ar r a y s is the number of arrays composing the cache memory. Shown in Figure 5.7 is the derived lifetime function of a SRAM cell, 4KB array and 256KB cache (consisting of 64 arrays), assuming s lb =5 σ, f dV T ( d, t) given in Figure 2.1, and f d ( r h ) of a normal distribution with a mean of 0.5 and a standard deviation of 0.1. Our exhaustive derivations with various parameters of Equation 5.5 show that the life- time distribution of SRAM cells is well fitted to a lognormal distribution with a shape parameter value between 0.4 and 0.6 if s lb =5 σ and between 0.5 and 0.7 if s lb =4 σ. 5.2.2 LifetimeReliabilityofSRAMArrayswithRedundancy Illustrated in Figure 5.8 is the evaluated 256KB cachememoryconsistingof64SRAM arrays. Each 4KB array has 128 columns and 256 rows. Each cache associative way consists of eight arrays of one row, and four cache lines (distinguishedbygraylevels) are interleaved among the eight arrays. We evaluate the lifetime reliability of the cache memory with ECC, warm or cold sparing, and graceful performance degradation by using the redundancy models given in Table 5.2. 87 ... 8 arrays per way a 8 asso cia tive waysa 128a ... ... ... . . . ... 2 56a ... ... . . . ... ... . . . ... ... . . . Figure 5.8: Evaluated 256KB cache memory consisting of 64 SRAM arrays. Each 4KB array has 128 columns and 256 rows. Each cache associative way consists of eight arrays of one row and four cache lines (different gray levels) are interleaved among the eight arrays. ECC is mainly used to detect and correct soft errors or transient failures [37]. Since the error correction capability may also be used to tolerate permanent failures, we eval- uate lifetime enhancement due to ECC techniques in this section. Since there is no dis- tinction between ECC bits and data bits, the lifetime distribution model of cache memory systems employing ECC can be derived by using the warm k-out-of- n model where n is the number of ECC and data bits, and k is the number of correctable bits subtracted from n. For example, the unit of byte single-error-correction (SEC) ECC consists of 8 data bits and 3 ECC bits such that one-bit errors are correctable. Thus, the lifetime dis- tribution model of the byte SEC ECC units can be derived using the warm 10-out-of-11 model. The unit of byte double-error-correction (DEC) ECC consists of 8 data bits and 5 ECC bits such that up to two bits of error are correctable. Thus, the lifetime distribution model of the byte DEC ECC units can be derived using the warm 11-out-of-13 model. 88 Table 5.2: Lifetime distribution models for cache memory systems employing redun- dancy techniques. In the table, N cel l s denotes cache memory size in bits; N ar r a y s and N way s denote the number of SRAM arrays and associative ways composing the cache memory, respectively; N col s and N row s denote the number of columns and rows per ar- ray, respectively. In addition, R un it ( t) and R cel l ( t) are the reliability function of data units and SRAM cells, respectively. Data The k-out-of- n model for R un it ( t) Redundancy technique unit Type k n R( t) SEC ECC x-bit W x+2 x+3 R ce ll ( t) DEC ECC x-bit W x+3 x+5 R ce ll ( t) Column sparing ( s spares) x-col W/C x x+ s ( R cel l ( t)) N rows Row sparing (a pair of s spares) array W/C N row s /s N rows /s+1 ( R ce ll ( t)) N co ls · s Array sparing ( s spares) cache W/C N array s N array s + s ( R cel l ( t)) N cols · N rows GPD ( s deactivated ways) cache W N way s - s N way s ( R cel l ( t)) N cel l s /N ways Since cache memory is comprised of a series system of ECC data units (e.g., the 256KB cache consists of 256K byte-ECC data units), the reliability function of the cache memory is R cache ( t)=( R unit ( t)) N un it s , (5.6) where R un it ( t) is the reliability function of data units and N units is the number of units composing the cache. For sparing techniques, the lifetime function of units sharing spares can be derived using the cold or warm k-out-of- n model, where k - n is thenumberofcoldorwarm spares. For instance, the lifetime distribution of a pair of x columns with s spares can be derivedbythecold x-out-of-( x+ s) model if the wearout of spares is effectively suspended until spares (i.e., in standby mode) are activated; otherwise, the warm x-out-of-( x+ s) 89 model can be used. Thus, the reliability function of cache memory employing column sparing can be computed using Equation 5.6, where N units = N arrays · N col s /x. While spare columns can be used individually, spare rows are assumed to be used all at once as individual row replacement is highly impractical. Thus, the lifetime distribu- tion of arrays consisting of N rows rows and a pair of s sparerowscan be derivedusing the cold N row s s -out-of-( N rows s +1) model if the wearout of spares is effectively suspended until activated; otherwise, the warm N rows s -out-of-( N row s s +1) model can be used. Thus, the reliability function of cache memory employing row sparing can be computed using Equation 5.6, where N units = N ar r a y s . In a similar way, the reliability function of cache memory having s spare arrays can be derived using N ar r a y s -out-of-( N ar r a y s + s)models, cold or warm. For graceful performance degradation (GPD), we assume that at least one failure of the eight arrays composing an associative way causes the entire way to be deactivated because of cache line interleaving. Thus, the lifetime distribution of a cache memory consisting of N way s associative ways can be derived using the warm ( N way s - s)-out-of- N way s modelifatmost s ways are allowed to be deactivated due to failure. 5.3 Summary In this chapter, we review lifetime reliability modeling for redundant systems and study the impact of lifetime distribution models which are critical inputs for lifetime relia- bility analysis. Our study shows that lifetime reliability analysis is strongly affected 90 by the choice of lifetime distribution models. For accurate reliability analysis, we pro- pose a methodology for modeling a more representative lifetime distribution specifically for SRAM arrays and cache memory systems with conventional redundancy techniques such as ECC, sparing and graceful performance degradation, with respect to NBTI. Our evaluation results show that ECC techniques generally improve cache lifetime reliabil- ity to a higher degree than the evaluated sparing or graceful performance degradation techniques. In addition, it is important to take into account the reliability condition of deployed processor components and the characteristics of wearout mechanisms in deriv- ing the lifetime distribution model of the components for accurate reliability analysis as they affect component lifetime distributions, eventually processor system lifetime. 91 Chapter6 ProactiveWearoutRecoveryApproachforExploiting MicroarchitecturalRedundancytoExtendChipLifetime As discussed in the previous chapter, microarchitectural redundancy has been used as a means of improving chip lifetime reliability. It is typicallyusedinareactive way, al- lowing chips to maintain operability in the presence of failures, by detecting and isolat- ing, correcting and/or replacing components on afirst-come, first-served basis only after they become faulty. In this chapter, we explore an alternative, more preferred method of exploiting microarchitectural redundancy to enhance chip lifetime reliability. In our approach, redundancy is used proactively to allow non-faulty microarchitecture com- ponents to be temporarily deactivated (i.e., inrecovery mode)onarotatingbasis, to suspend and/or recover from certain wearout effects. This approach improves chip life- time reliability by warding off the onset of wearout failures as opposed to reacting to them posteriorly. The detailed description of the proposed proactive approach for using redundancy is given in Section 6.1. To make it more effective, we propose circuit-level 92 techniques to exploit the recovery effect of wearout failure mechanisms such as NBTI while components operate in recovery mode. While conventional power reduction tech- niques can reduce NBTI stress conditions to a limited degree when applied [74], the proposed techniques go much further by completely removing or entirely reversing the stress conditions. In addition, the proposed proactive wearout recovery approach is applied to cache SRAM susceptible to failure caused by NBTI for exploiting microarchitectural redun- dancy targeted to enhancing cache SRAM lifetime reliability. The NBTI failure mech- anism significantly impacts the lifetime reliability of chips implemented with deep sub- micron technology. SRAM arrays are particularly affected by NBTI-induced wearout. SRAM arrays generally take up a large portion of the chip, having a larger number of devices vulnerable to NBTI. Array cells tend to hold the same value over a long period of time, which causes some devices to be under NBTI stress for a large portion of time (i.e., to have a high duty cycle). Moreover, the degradation of cell stability caused by NBTI cannot be mitigated simply by providing sufficient delay margin at design time as is done for degradation of logic circuit speed. Cache SRAM arrays and NBTI-induced wearout are, thus, the focus of this work, though we believe proactive wearout recovery can be applied to other microarchitectural components and failure mechanisms. 93 6.1 ProactiveUseofRedundancy Redundancy is a commonly used technique for improving lifetime reliability as well as yield of processor systems [1][8][62]. When applied to microprocessors, chips can maintain operability in the presence of defects or failures by detecting and isolating, correcting, and/or replacing microarchitecture components reactively on a first-come, first-served basis after components become faulty. We refer to this as reactive use of microarchitectural redundancy for extending chip lifetime. Reactive use of redundancy allows as many failures to be tolerated as there are non-faulty redundant components. With this, non-faulty components operate either in active or standby modes. Lifetime can be extended by graceful performance degradation of the system in which all components (including redundant ones) initially operate in active mode until failing or by swapping into the system redundant spares that transition from standby mode to active mode when failures occur. An alternative approach proposed in this paper for extending chip lifetime is to use redundancy for the purpose of allowing components to suspend or recover from wearout well before any of them fail. We refer to this as proactive use of microarchitectural redundancy. Redundancy used proactively allows non-faulty microarchitecture compo- nents to be temporarily deactivated and later reactivated on a rotating basis to suspend and/or reverse the effects of wearout. With this, non-faulty components (including redun- dant ones) operate either in active mode or in recovery mode, periodically transitioning 94 between the two modes according to a recovery schedule. This enables chip lifetime reli- abilitytobeimprovedbywardingofftheonset of wearout failures as opposed to reacting to them posteriorly. While both approaches have similar area and delay overhead to implement redun- dancy, proactive use of redundancy for extending chip lifetime has several advantages over reactive use. The number of failures occurring over a given period of time (i.e., failure rate) tends to increase rapidly over time after a certain amount of component wearout [1]. Prolonging the time before components reach this point of wearout by sus- pending their use can extend lifetime. For some failuremechanismssuch asNBTI, the effects of wearout can be reversed during the suspended period that stress conditions are removed (i.e., during recovery mode) [75]. Thus, proactive use of even a limited amount of redundancy can suspend or reverse component wearout. Reactive use of redundancy provides no such benefits to component wearout but, instead, provides only for as many wearout failures to be tolerated as there are redundant components, which typically is very limited. Furthermore, with proper scheduling, proactive use of redundancy allows component wearout to be balanced across thechiptostave offchipkillowing to only a few heavily worn-out components. 6.2 WearoutRecovery Proactive use of redundancy can suspend wearout, but it is more effective at delaying the onset of failures if the target failure mechanism has wearout recovery properties that 95 V dd Virtual GND (a) Power gating PG: 0V NO: Vdd (b ) W earout gat ing WG,PG: 0V NO: V dd WG: 0V PG,NO: V dd power -gated circuits power/wearout- gat e d circ ui t s Footer Footer V dd Virtual GND Figure 6.1: Implementation of normal (NO), power gating (PG), and wearout gating (WG) modes of operation. For wearout gating in (b), virtual ground is charged to V dd to remove the electric field across the shaded circuit, which stimulates NBTI wearout recovery of the PFET devices. can be exploited. Below, we discuss the limitation of power gating techniques to ex- ploit wearout recovery. This is followed by our proposed circuit-level techniques for implementing a wearout recovery mode in which devices can undergo recovery from NBTI-induced wearout. 6.2.1 ImplementationofWearoutRecoveryMode PowerGating Conventional power reduction techniques such as voltage scaling and power gating can reduce the stress condition of wearout failure mechanisms to a certain degree by reduc- ing the applied electric field, resultinginaslow-downofwearout [74]. However, as 96 Time No rmaliz ed virtua l GN D v olta ge lev e l E n tering pow er gating mod e Fo ot er leak ag e ra t io E nte ring w earo ut gating mode 1 0 C irc u it leak age ratio Voltage level Leak age ratio 1 Figure 6.2: Change of the voltage level of virtual ground and the leakage ratio of the footer and circuits over time during power gating and wearout gating modes of operation of the circuits illustrated in Figure 6.1. In wearout gating mode, virtual ground is charged to V dd which stimulates wearout recovery but increases footer leakage power. these techniques do not completely remove the electricfield (i.e., V gs <0), NBTI wearout recovery does not take effect. Figure 6.1(a) illustrates power gating circuitry in which a footer (NFET) device cuts off leakage paths through ground. Immediately after entering the power gating mode, virtual ground is at the same voltage level as ground. As the leakage current through the power-gated circuits charges virtual ground, leakage through the footer device increases while leakage through the power-gated circuits decreases. When the footer leakage and the circuit leakage become equal, the voltage level of virtual ground stabilizes, generally around 1/3 V dd to 2/3 V dd as illustrated in Figure 6.2 [22]. As a result, power gating reduces leakage power and the stress condition of wearout failure mechanisms. How- ever, the electricfield which remains across the circuits prevents wearout recovery of the failure mechanisms from taking effect, limiting lifetime reliability enhancement. 97 WearoutGating We propose adding to the power gating circuitry shown in Figure 6.1(a) a charge path that allows the voltage level of virtual ground to be equal to V dd when applied. This completely removes the electric field across the circuits, enabling wearout recovery to take effect. As illustrated in Figure 6.1(b), virtual ground is charged to V dd through the added PFET device when this device is on. This technique hereafter called wearout gating removes the NBTI stress condition, thus enabling the circuits to undergo wearout recovery and chip lifetime reliability to enhance. Wearout gating can also reduce leakage power, but it is less preferable to power gating for the purpose of reducing power as it has increased leakage, increased wake-up latency and original “memory” loss relative to power gating. Instead, the wearout gating circuitry in Figure 6.1(b) can be configured for power gating by turning off the PFET and footer device. As wearout gating requires only a subtle change to conventional power gating circuitry, implementing it is relatively straightforward. Design and verification tools for power gating can be used which support splitting of the power/ground networks and functionality/timing verification of mode transitions [5][14]. IntenseRecovery Recovery from NBTI-induced threshold voltage shift is faster and more effective when PFET devices are positively biased (i.e., V gs =+V dd ) rather than simply not biased (i.e., V gs =0) [29][77]. To accelerate wearout recovery, we propose a technique called intense 98 ... P ev en N ev en P odd N odd IR odd IR even WG PG NO 0V V dd V dd V dd 0V 0V V dd V dd 0V 0V V dd 0V V dd V dd 0V V dd 0V Vdd 0V 0V IR odd IR even WG PG NO Input V dd 0V - - - P even N even P odd N odd Input P ower/ wearout-gated devices V irtual V dd of ev en s tages V irtual V dd of o dd s tages Figure 6.3: Implementation of wearout recovery mode for an inverter chain. Input com- binations NO, PG, WG, IR odd and IR even are those needed for normal, power gating, wearout gating, and intense recovery (for odd and even PFETs, respectively) modes of operation. recovery in which the gate of PFET devices is charged by their driver and the source is discharged through a virtual V dd rail in order to create positive bias in the gate oxide. In the following section, the implementation of the proposed intense recovery mode is described in detail for inverter chains and SRAM arrays. 6.2.2 WearoutRecoveryAppliedtoInverterChains In this section, the proposed techniques applied to inverter chains are described with a focus of recovery mode implemented with the intense recovery technique. Figure 6.3 illustrates an implementation of the proposed intense recovery technique applied to an 99 inverter chain, where the sources of the PFETs in odd and even stages are tied to separate virtual V dd rails to provide proper voltage levels. For intense recovery mode of PFET devices in odd stages, V dd is applied at the input of the inverter chain, and the virtual V dd of even stages remains charged in order to charge the gate of the devices in odd stages. Meanwhile, the virtual V dd of odd stages is discharged through the NFET device (N odd ) to create positive bias in the gate oxide. Conversely, for intense recovery mode of PFET devices in even stages, 0V is applied to the input of the inverter chain, and the virtual V dd of odd stages remains charged in order to charge the gate of the PFETs in even stages. Meanwhile, the virtual V dd of even stages is discharged through the NFET device (N even ) to create positive bias in the gate oxide. The intense recovery technique requires virtual V dd rails to be divided in the same macro block—one for devices undergoing wearout recovery and the other for the drivers of the recovering devices—and the input of the drivers to be signaled properly. If these design requirements are not affordable in circuits, the wearout gating technique proposed above can be used to improve chip lifetime reliability or a sufficient delay margin must be preserved at the design phase in ordertomeettargetlifetime[31]. 6.2.3 WearoutRecoveryAppliedtoSRAMArrays In this section, the proposed techniques applied to SRAM arrays are described with a focus of recovery mode implemented with the intense recovery technique. As shown in Figure 6.4, reverse biasing is created by charging the gate of the PFET devices and 100 ... ... ... P L N L P R N R P L N L P R N R IR L IR R WG PG NO V dd 0V V dd V dd 0V V dd 0V V dd 0V 0V 0V V dd V dd V dd 0V 0V V dd V dd 0V 0V IR L IR R WG PG NO C ells 1 0 - - - bl bl bl bl C L C R C L C R V irt ual V dd of the left P FE Ts V irtual V dd of the right P FE Ts Figure 6.4: Implementation of wearout recovery mode for SRAM arrays. Input combi- nations NO, PG, WG, IR L and IR R are those needed for normal, power gating, wearout gating and intense recovery (of the left and right PFETs, respectively) modes of opera- tion. 101 discharging the source of the devices. Since each SRAM cell has cross-coupled inverters, each inverter is used to charge the gate of the PFET device of the other by storing the proper value to the cell before transitioning to recovery mode. That is, a “1” needs to be stored for recovery of the left PFET devices of the cells (C L ) while a “0” needs to be stored for recovery of the right PFET devices of the cells (C R ). The sources of the left and right PFETs need to be tied to separate virtual V dd rails in order to provide the proper voltage level. The virtual V dd of PFETs under recovery mode is discharged through the NFET device (N L or N R ) and that of the other side is charged through the PFET device (P L or P R ). In addition, discharging the virtual V dd of PFET devices of array cells results in the cell value which causes the gate of the devices to charge. The regular structure of SRAM arrays enables the separation of virtual V dd for the left and right PFETs of the cells to be relatively straightforward. The power line running vertically across array cells to supply V dd to both sides of the PFETs of cells [43] can be divided into two, one for each side, each connected to different virtual power rails as illustrated in Figure 6.4. Since most SRAM array designs already have virtual power or ground rails for cell biasing [5], this implementation of wearout recovery mode costs negligible additional area overhead. If the intense recovery mode is preferable to be conducted at finer granularities such as selective columns, virtual V dd can be further divided so that the cells in each column can enter intense recovery mode independently from those in other columns. 102 6.3 ProactiveWearoutRecoveryApproachfor ExtendingCacheSRAMLifetime In this section, we discuss modifications to cache architectures for proactive use of array redundancy, assuming round-robin recovery scheduling. The performance and lifetime reliability of the proposed cache architectures is evaluated in Chapter 7. 6.3.1 ArchitectureDesignConsiderations Proactive wearout recovery can be implemented in cache SRAM arrays in various ways based on the granularity of redundancy, recovery scheduling, and the way in which tran- sitions are made between modes of operation. It can be applied at the array level as depicted in Figure 6.4, at coarser granularities such as associative ways, or at finer gran- ularities such as selective rows or columns. Recovery can occur at regular time intervals, during idle times of system use, or upon alerts triggered by the user or system to optimize lifetime reliability within certain cost constraints. The recovery mode can be scheduled in an oblivious way—such as round-robin over regular time intervals—or in a way that allows more heavily stressed components which would otherwise wearout faster to undergo recovery more frequently. The latter method requires lifetime prediction by monitoring threshold voltage increase in PFET devices [26] or by using cell value or duty cycle analysis of applied workloads. In any case, recovery characteristics of the failure mechanism are factors which can 103 determine how long components should remain in recovery mode and how frequently they should transition between active and recovery modes of operation. Maintaining correctness amid transitions between active and recovery modes by com- ponents used proactively is critical, just as it is with transitioning between standby and active modes (likewise, active and faulty modes) in the reactive case. This is especially true of caches as their contents are lost once the recovery mode described in Section?? is entered into. Cache lines in shared, dirty or exclusive states must be drained properly before the cache arrays enter into recovery mode. We explore two ways of draining the arrays: cache lineinvalidation and cache linemigration. The invalidation drain mechanism is similar to cache line replacement schemes for reloads or multiprocessor cache coherence schemes for updates. For arrays entering into the recovery mode, the cache lines in dirty and/or exclusive state are forced to be written back to a lower level of the memory hierarchy using the existing write-back logic of the cache memory. In addition, invalidation requests are sent to the upper level of the hierarchy so that the upper-level cache memory invalidates the cache lines, if any, to hold the inclusion property. During the drain process, the cache lines must be locked properly to avoid race conditions with regular cache access requests, while all other cache lines are still accessible. When the drain process is completed, the arrays can enter into recovery mode and cache access requests to the invalidated lines are handled as cache misses. Alternatively, the cache lines of arrays entering into recovery mode can be migrated to some newly allocated array. For migration to occur, cache lines are locked properly, read 104 from theoriginalarray andwrittentothe newlyallocated arrayvia migrationchannels such as the cache read/write bus or dedicated links. While cache requests requiring data array access such as reads and writes should be rejected during the drain process, those requiring only directory access such as invalidation or snoop requests can be serviced, in addition to those to all other cache lines unaffectedbythedrainprocess. Whenmigration completes, the array enters into recovery mode. Cache lines are unlocked and accessed from the newly allocated array. Unlike the invalidation mechanism, the state of the cache lines remains unchanged after the drain process, causingnoextramisses. 6.3.2 ImplementingProactiveWearoutRecoveryinCacheSRAM Figure 6.5 depicts an 8-way set-associative cache consisting of 64 arrays, eight of which compose one associative way (each row in the figure) and each of which is implemented as illustrated in Figure 6.4 to enable NBTI wearout recovery. One additional “spare” ar- rayisalsoimplementedandusedproactively to allow any one of the 65 arrays to operate in recovery mode at any given time. This is as opposed to it being used reactively to re- place one of the 64 remaining arrays upon failure. What results is an effective cache size of 64×array size. Cache lines are assumed to be interleaved among the arrays composing an associative way (i.e., row) since compact SRAM cell design limits the bandwidth of theread/writebus. Whenaread request is received, the addressed lines of all eight ways are read and the requested line is selected by way-selecting multiplexers based on tag matching results from the directory, followed by array-selecting multiplexers to choose 105 a 18 a 17 a 12 a 11 a 28 a 27 a 22 a 21 a 78 a 77 a 72 a 71 a 88 a 87 a 82 a 81 a 0 ... ... ... ... 8 assoc iat ive ways a 8 array s per w aya W ay selec t Ar r a y s elec t Spare array ... ... Wa y 1 Wa y 2 Wa y 7 Wa y 8 a 18 a 17 a 12 a 11 a 28 a 27 a 22 a 21 a78 a77 a72 a71 a88 a87 a82 a81 a 0 ... ... ... ... 8 ass ociat ive way sa 8 array s per w aya W a y se le c t A r r a y sele c t Spare array ... ... Wa y 1 Wa y 2 Wa y 7 Wa y 8 operating in rec ov er y mode (a) Invalidation -based cache SRAM (b) Migration -based cache SRAM Figure 6.5: Cache SRAM configured to support proactive use of array-level redundancy for wearout recovery. 106 data from arrays in active mode. If array-selecting multiplexers exist for each way, active arrays are chosen before the way selection. We assume that the recovery mode of arrays is scheduled round-robin, starting from the right to the left and from the top to the bottom of Figure 6.5, i.e., a 11 , a 12 , ..., a 21 , a 22 , ..., a 87 , a 88 , a 0 . In order for a 11 to enter into recovery mode, all the valid cache lines of the array must be invalidated or migrated to a 0 via dedicated links or the read/write bus of the cache (neither shown in the figure). This causes the entire way (i.e., a 11 ,..., a 18 ) to be affected by the drain process due to cache line interleaving among the arrays. Once the invalidations or migrations are completed, a 11 enters into recovery mode and is replaced by a 0 . From this point, when the cache memory receives cache access requests, a 0 , a 12 , ..., and a 18 must be accessed for way 1. If the cache line allocated to way 1 is hit, the array-selecting multiplexers map out a 11 (the one in recovery mode) and map in a 0 by selecting data straight from above for those on the left side of the array in recovery mode (i.e., a 12 , ..., a 18 , in this case) and data from the right for those on the right side (i.e., a 0 , in this case). If a cache line of other associative ways is requested, the array-selecting multiplexers select data straight from above. In order for the next array, a 12 , to enter into recovery mode, a 11 transitions out of recovery mode back into active mode. As was done with a 11 previously, all of the valid cache lines of a 12 must be invalidated or migrated to a 11 . This results in a 0 , a 11 , a 13 , ..., and a 18 holding cache lines allocated to way 1. The drain process for the rest of the 107 arrays in the first row is done in a similar way. After all the arrays of the first row have undergone recovery once, a 0 , a 11 , a 12 , ..., and a 17 hold cache lines allocated to way 1 and a 18 transitions from recovery mode back to active mode for the next array, a 21 ,to undergo recovery. Since a 0 will replace a 21 and hold cache lines allocated to way 2 (i.e., the second row) from now on, cache lines allocated to way 1 either have to be invalidated or migrated back to a 11 , ..., a 18 to drain a 0 . In addition, valid cache lines of array a 21 allocatedtoway2musteitherbe invalidated or migrated to a 0 . Thus, transitioning to recovery mode for the array in the next row causes cache lines allocated to two ways to be invalidated or migrated. In a similar manner, the rest of the arrays transition between modes to enable wearout recovery of all array components comprising the cache. 6.3.3 ImpactonPerformanceandArea The performance of cache memory architectures which proactively use redundancy for wearout recovery is affected by the drain process needed to transition components be- tween active and recovery modes. Accesses to arrays entering into recovery mode are delayed throughout the drain process. There may also be additional delay due to con- tention on cache resources such as cache read/write ports, cache bus bandwidth, and/or write-back related queues and logic to implement the invalidations or migrations. In ad- dition to these delays, extra cache misses occur if invalidated lines are accessed after the drain process completes. As the time over which components transition between active and recovery modes is negligible compared to the time components stay in a mode, the 108 negative impact of mode transitions on overall performance is nominal, as presented in Chapter 7. For the invalidation drain mechanism, handling write-back or invalidation requests does not require additional hardware. The migration drain mechanism with dedicated links requires additional wiring between adjacent arrays and another level of multiplex- ing at write port data inputs. The fewer the number of bits read from an array, the smaller the overhead of implementing the links. However, lower bandwidth requires more cycles to transfer data between arrays, causing additional latency for the drain process. The active array selection logic can reuse steering logic and multiplexers used to iso- late defective arrays for yield enhancement that exist in most modern microprocessors, thus incurring negligible additional area overhead [70]. While the array selection logic would permanently map out defective arrays for yield enhancement, it would dynami- cally map arrays in and out based on the recovery schedule. Whether the array selection logic is on the critical path or not is determined by the location of the logic, especially for caches with late select. If array selection logic exists for each associative way, the selection logic can be located before way selection mul- tiplexers, causing it not to lie on the critical path as active arrays are selected for each way before tag match results arrive. If associative ways share the array selection logic, the selection logic follows way selection multiplexers, thus lying on the critical path as control of array selection multiplexers depends on both tag match results and the location of the arrays in recovery mode. 109 6.4 Summary We propose a proactive approach for exploiting microarchitectural redundancy in which redundancy is used to allow non-faulty components intermittently to be temporarily de- activated in order to recover from wearout well before they fail. For higher effectiveness, we also propose circuit-level techniques to implement recovery mode which exploit the recovery effect inherent to wearout failure mechanisms such as NBTI. In the proposed wearout gating technique, an incremental change to typical power gating circuitry en- ables semiconductor devices to undergo wearout recovery to improve lifetime reliability by up to about 12 times and 5.5 times higher than power gating for PFET devices and 4KB SRAM arrays, respectively. For further lifetime extension, the proposed intense re- covery technique accelerates the recovery effect by reversing the stress condition of the NBTI failure mechanism. In addition, we apply the proposed proactive wearout recovery approach to cache SRAM architectures. This is evaluated in detail in terms of lifetime reliability enhancement, performance impact and area overhead in the following chapter. As NBTI is observed not only in silicon dioxide but also in alternative gate oxide ma- terialssuchashigh- κ dielectrics [45][77], the degradation of lifetime reliability due to NBTI will continue over near-future technology generations. Thus, these lifetime exten- sion techniques are anticipated to continue to be beneficial. In addition, the fundamental ideas of the proposed techniques and designs are effective for any failure mechanism ca- pable of wearout recovery such as PBTI (positive bias temperature instability), although we focus on NBTI in this chapter to describe and demonstrate the ideas. 110 Chapter7 RedundantCacheSRAMLifetimeReliabilityAnalysis In this chapter, we evaluate the performance and lifetime reliability of the cache SRAM architecture modified to exploiting microarchitectural redundancy in reactive or proac- tive approach as described in Chapter 5 and 6. We also analyze the trade-offs of lifetime reliability enhancement versus performance/area overhead and compare our proactive wearout recovery approach against conventional reactive approaches for using redun- dancy. 7.1 EvaluationMethodology 7.1.1 EvaluatingImpactonPerformance We assume a POWER5-like processor chip [58] with an L2 cache which supports our proactive approach as described in previous sections. We use Mambo [54], an IBM proprietary full-system simulation toolset for the PowerPC architecture to analyze the 111 Table 7.1: Configuration of the simulated processor chip and L2 cache structure. In the L2 cache configuration, RCQ, COQ and SNPQ indicate queues holding transactions for cache line reloads, castouts (i.e., write-back) and snooping, respectively. Chip configuration Number of POWER5-like cores 2 Number of threads per core 2 L1-I/D cache size 32KBytes L1-I/D cache associativity 4/8-way L3 cache size (victim) 2MBytes Cache line size 128Bytes Main memory size 1GBytes L2 cache configuration Cachesize(percore) 256KBytes Set associativity 8-way Number of read/write ports 2 Bus bandwidth 32Bytes/cycle Replacement policy Tree LRU RCQ/COQ/SNPQ size 8/8/4 entries Castout policy Valid lines impact of the proposed proactive approach on overall system performance. We configure Mamboinsuchawaythatthe coresare similartothoseusedinPOWER5[58], andL2 cache is private for each core and 256KB each. L2 caches, memory and I/O controllers are interconnected through a bus similar to the fabric bus controller of POWER5. The configuration of the simulated processor chip and L2 cache is given in Table 7.1. Figure 7.1 illustrates the simulated L2 cache architecture. The L2 cache consists of four main control logic and queues as well as cache arrays: reload, castout, snoop and recovery/drain (RD) machines. The L2 cache access requests from the two processor cores are queued in load queue (LDQ)orstore queue(STQ)and first-come, first-served 112 COQ RCQ SNPQ L2 cache i/d-deman d, prefetch stores LDQ STQ To c o res Fr om c ore s To L 3 /FB C Reload machine C astout m achine S noop machine Recovery/drain controller RD m achine From L3/ FB C Figure 7.1: Simulated L2 cache structure which consists of four main control logic cir- cuitsand queuesaswellasdirectory and data arrays: reload, castout, snoop and recov- ery/drain machines. 113 by the reload machine. The reload machine accesses L2 directory and data arrays and replies to processor cores accordingly. If a cache miss occurs, a cache miss request goes to L3 cache or FBC and replaces a cache line. If the replaced cache line has been updated, it is queued in castout queue (COQ) to be written back or casted out to L3 cache, the peer L2 cache, or main memory by the castout machine. The snoop machine handles cache access requests from the peer L2 cache via FBC. Upon receipt of a request, the snoop machine looks up the L2 directory and, if needed, updates the cache line state and/or replies with data which are queued in snoop queue (SNPQ). In addition to the POWER5- like L2 cache structure, we implement the recovery/drain (RD) machine to schedule recovery mode and drain process, execute the drain mechanism, and reconfigure the array selection logic accordingly, as described in Section 6.3. For workloads, we run memory-intensive macro-benchmark programs such as DAXPY and pointer chasing to stress the impact of the drain process. The working data set of the programs is as large as the L2 cache size, i.e., 256KB. We also run a subset of the SPLASH suite [71] in full-system mode of Mambo, where OS code (Linux 2.6.7 with OpenPIC1.0) isincludedaspartof the SPLASH workload simulated. 7.1.2 EvaluatingImpactonLifetimeReliability In this analysis, the 256KB L2 cache with no redundancy is considered the baseline cache configuration in which the cache consists of 64 arrays and each array has 128 columns and 256 rows. Depending on the type of redundancy evaluated, redundant components 114 such as cell bits, columns, rows or arrays are added to the baseline configuration. For ac- curate evaluation of the benefits of various types of redundancy, the lifetime distribution function—which is a probability function describing the collection of lifetimes or time- to-failure of components—is derived for the SRAM cells and the cache memory for each reactive use of redundancy as described in Section 5.2. For proactive use of redundancy, the lifetime distribution function can be derived using thewarm k-out-of- n model. How- ever, the wearout pattern of components changesassomeofthemfailbecausetheymay no longer be able to enter into recovery mode. For simplicity, we conservatively assume that the cache SRAM consisting of 65 arrays fails if any one of the 65 arrays fails (i.e., a series system), although the cache could continue to operate with only 64 arrays but without wearout recovery. Once lifetime distribution functions are obtained, the lifetime reliability of the cache SRAM is quantified in terms of mean-time-to-failure (MTTF) for comparison [57]. 7.2 ExploitingRedundancyforExtendingCache SRAMLifetime 7.2.1 ReactiveUseofRedundancy Figure 7.2 and 7.3 show the impact of the redundancy techniques on cache lifetime reli- ability, along with area overhead measured in the number of array cells. The error bars of the derived SRAM cell lifetime distribution model indicate difference in the MTTF 115 (a) ECC (b) Graceful performance degradation 1 10 100 1000 10000 12 4 8 16 32 ECC data unit size in bytes Normalized MTTF 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Area overhead derived model, SEC ECC exponential, SEC ECC derived model, DEC ECC exponential, DEC ECC area, SEC ECC area, DEC ECC S i 9 1 3 5 7 1 way 2 ways 3 ways 4 ways Max number of ways allowed to be deactivated Normaliz ed MTTF derived model exponential S i 4 S i 5 Figure 7.2: Lifetime reliability enhancement and area overhead of redundancy tech- niques for the evaluated 256KB cache memory. The error bars of the derived model indicate the difference in the normalized MTTF of the cache memory evaluated using a range for the shape parameter between 0.4 and 0.7, as discussed in Section 5.2.1, com- pared to that using a shape parameter value of 0.5 shown with data bars. 116 (a) Column sparing (b) Row and array sparing 1 10 100 1000 8 163264 128 Number of columns sharing two spares Normalized MTTF 1 1.1 1.2 1.3 Area overhead derived model, warm spares exponential, warm spares derived model, cold spares exponential, cold spares area overhead Series4 S i 5 S i 6 1 10 100 16-row pair 32-row pair 1 array 8 array Number of spares Normaliz ed MTTF 1 1.1 1.2 Area ov erhead derived model, warm spares exponential, warm spares derived model, cold spares exponential, cold spares area overhead Series4 Figure 7.3: Lifetime reliability enhancement and area overhead of redundancy tech- niques for the evaluated 256KB cache memory. The error bars of the derived model indicate the difference in the normalized MTTF of the cache memory evaluated using a range for the shape parameter between 0.4 and 0.7, as discussed in Section 5.2.1, com- pared to that using a shape parameter value of 0.5 shown with data bars. 117 of the cache memory by fitting it into a lognormal distribution with the range of a shape parameter value between 0.4 and 0.7, as discussed in Section 5.2.1, compared to that with a shape parameter value of 0.5 shown with data bars. The data of the exponential distribution model are obtained by using an exponential distribution to represent SRAM cell lifetimes which have the same mean as the derived ones. Figure 7.2(a) shows that cache lifetime enhancement due to ECC techniques saturates around 2× for the SEC ECC and around 3× for the DEC ECC as area overhead increases. The SEC ECC improves cache memory lifetime by a factor of 1.7 to 2.1 with a cost of 3% to 38% area overhead while the DEC ECC improves cache memory lifetime by a factor of 2.2 to 2.9 with a cost of 3% to 63% area overhead. The error rate due to the parameter variation of the failure mechanism and deployed cell quality is -14% to +29% for the SEC ECC and -20% to +45% for the DEC ECC, on average. The estimated cache lifetime enhancement due to ECC is unrealistically high, i.e., a factor of hundreds or thousands, if the exponential distribution model is used as the underlying SRAM cell lifetime distribution model. Figure 7.3(a) and (b) show estimated cache lifetime enhancement due to sparing tech- niques applied at the column, row or array level using the various lifetime distribution models. Cold column sparing improves cache lifetime by 45% to 90% with a cost of 1.6% to 25% area overhead; the row cold sparing improves cache lifetime by about 32% to 33% with a cost of 6% to 13% area overhead; and the array cold sparing improves cache lifetime by about 9% to 10% with a cost of 1.6% to 13% area overhead. That 118 is, the finer-grain sparing techniques improve cachelifetimetoahigher degree buthave higher area overhead and, possibly, design complexity. Warm sparing improves cache lifetime to a slightly lower degree than cold sparing. The error rate due to the parameter variation of the failure mechanism and deployed cell quality is -12% to +25% for warm column sparing and -14% to +31% for cold column sparing on average; -5% to +11% for warm row sparing and -6% to +12% for cold row sparing on average; -2% to +3% for warm array sparing and -2% to +4% for cold array sparing on average. As for ECC, the use of the exponential model results in unrealistically high lifetime enhancement estimates for the sparing techniques, i.e., a factor of tens to hundreds for column sparing, afactoroften forrow sparing, and90% for array sparing. Graceful performance degradation improves cache lifetime by 10% to 27%, allowing one eighth to one fourth of the cache size to be deactivated due to failure, as shown in Figure 7.2(b). While finer-grain graceful performance degradation techniques can improve cache lifetime further, they come with higher design complexity to implement the necessary reconfiguration. The error rate due to the parameter variation of the failure mechanism and deployed cell quality is -3% to +7% on average. Using the exponential model, cache lifetime reliability is estimatedtobeimprovedbyafactor of 2to6dueto graceful performance degradation. Subsequently, several observations can be made regarding these results. First, the evaluated ECC techniques generally improve cache lifetime reliability to a higher degree 119 than the evaluated sparing or graceful performance degradation techniques, regardless of the underlying lifetime distribution model used in the analysis. However, using ECC for hard errors due to wearout failures may not be preferable as this diminishes the ability of the system to handle soft error correction and detection. Second, the exponential distribu- tion model causes lifetime reliability analysis to be unrealistically optimistic. Finally, the characteristics of wearout failure mechanisms and the quality of deployed SRAM cells affect the lifetime distribution model of SRAM cells, eventually the lifetime of cache memory systems employing redundancy techniques. Thus, it is important to take into account the quality of deployed processor components and the characteristics of wearout mechanisms in deriving the lifetime distribution model of components for accurate reli- ability analysis. 7.2.2 ProactiveUseofRedundancy LifetimeReliabilityAnalysis Wefirst evaluate the lifetime reliability of the baseline cache configuration against which others are compared below. Because of NBTI recovery effects, the duty cycle of SRAM cellsisoneofthecriticalfactors affecting lifetime reliability. As shown in Figure 7.4(a), the duty cycle is determined by the applied workload. Since the lifetime of SRAM cells is limited by the PFETs with the higher duty cycle (i.e., the more worn-out ones), the lifetime reliability in MTTF of the cache increases as the wearout of the two PFET de- vices in the cells becomes more balanced, i.e., as the center of the duty cycle distributions 120 (a ) Duty cycle distributions of SRAM cells (b) Lifetime reliability of the evaluated cache memory 0 5 10 15 20 Ocean FMM Volrend Cholesky LU Barnes Water- Nsqured Lifetime enhancement factor 0 1 2 3 Normalized MTTF balanced (50%) duty cycle balanced duty cycle + reactive proactive wearout recovery approach balanced duty cycle + proactive normalized MTTF to Volrend 0.0 0.1 0.2 0.3 0% 20% 40% 60% 80% Ratio of the time SRAM cells store "1" Occurrence ratio 100% Ocean Cholesky LU Water- Nsquared FMM Volrend Barnes Figure 7.4: (a) The duty cycle distributions and lifetime reliability of the evaluated 256KB L2 cache memory. (b) The MTTF (shown on the right y-axis) of the baseline cache configuration (i.e., neither redundancy nor balanced duty cycle) for each appli- cation is normalized to that of the baseline running V olrend. The lifetime reliability en- hancement (shown on the left y-axis) of the cache configurations with a cell duty cycle of 50% (balanced duty cycle) and/or with one redundant array used reactively or proactively for each application is also shown. 121 moves toward 50% and the tails of the distributions become shorter. The curve in Fig- ure 7.4(b) plots the MTTF (shown on the right y-axis) for each application normalized to that of V olrend for the cache SRAM. As shown, the MTTF of the cache running “LU” is about 2.6 times higher than that of the cache running “V olrend” as the cell duty cycle for “LU” is more balanced than that of “V olrend.” Also shown in Figure 7.4(b) is the lifetime enhancement (shown on the left y-axis, normalized to the baseline cache configuration) for the ideal case of balanced (50%) duty cycle applied to all applications, which is the aim of wearout mitigating techniques such as cell value flipping [2][30]. As shown, if the SRAM cell duty cycle can be ideally balanced with such techniques, about a 3× im- provement in lifetime reliability can be gained for applications having poor SRAM cell duty cycle distributions such as “V olrend.” By adding the reactive use of one redundant array (i.e., “balanced duty cycle + reactive”), only a subtle increase in lifetime reliability enhancement is gained over balancing the cell duty cycle. With one redundant array used proactively for intensive wearout recovery in which 70% of the wearout can be recovered during the recovery mode [76], the lifetime relia- bility of the cache is improved by about 5.5× to 10.2× compared to the baseline case, as shown in Figure 7.4(b) (i.e., “proactive wearout recovery approach”). Across the appli- cations, this is about three tofive times higher than the balanced (50%) duty cycle cases, both with no use of redundancy and with reactive use of redundancy. When proactive wearout recovery is used in conjunction with a technique for balancing the duty cycle of SRAM cells (i.e., “balanced duty cycle + proactive”), the lifetime reliability of the cache 122 is improved by about 6× to 16.5× compared to the baseline case and by up to 5× over the balanced duty cycle case. We observe that our proactive wearout recovery approach significantly improves the lifetime reliability of the cache SRAM across various cell duty cycle distributions caused by applied workloads. In addition, the proactive approach has greater impact on improv- ing lifetime reliability when component wearout is balanced. Thus, scheduling recovery mode to balance wearout can make the proactive approach more effective in enhancing lifetime reliability. Since scheduling recovery mode intelligently may increase design complexity, a careful analysis of the trade-offs is needed. AnalysisofPerformanceImpact For drain process, the cache lines need to be locked to avoid a race condition. If cache ac- cess requests need the locked lines, the requests remain in LDQ or STQ to be handled by the reload machine when the lines are unlocked. This causes the requests following them to be blocked and the queues to be full, which eventually results in blocking all memory transactions and no instructions committed. In Figure 7.5, the IPC (the number of com- mitted instructions percycle)ofthe simulated processor running DAXPY is dropped to zero in the drain process, more specifically during the time that a load transaction waits for one of the locked lines to be released. For the invalidation mechanism, the blocked transaction is released before the drain process is completed because the requested cache line is invalidated and the blocked transaction is handled as L2 cache miss. This results 123 (a) Invalidation drain mechanism (b) Migration drain mechanism with dedicated links (c) Migrat ion drain mechanism with read/write bus 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20 25 30 35 40 45 Time (x100 cycles) IPC inval_Lall inval_Lindiv blocked load 0.0 0.5 1.0 1.5 2.0 2.5 3.0 05 10 15 20 Time (x100 cycles) IPC migr_dedic_Lall migr_dedic_Lindiv blocked load 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0 5 10 15 20 Time (x100 cycles) IPC migr_sbus_Lall migr_sbus_Lindiv blocked load Figure 7.5: IPC during the drain process. The x axis begins the time that the drain process starts and ends the time that the process is completed. For the locking scheme of the entire arrays, “blocked load” causes IPC drop by blocking the memory transactions following. The individual locking scheme reduces the blocking time, thus removing IPC drop to zero. 124 in the IPC between 1.1 and 2.5 for the rest of the drain process, as shown in Figure 7.5(a). On the other hand, there is no change of the state of cache lines in the migration mecha- nisms, which causes the load transaction to be blocked until the completion of the drain process as shown in Figures 7.5(b) and (c). To mitigate the performance penalty due to blocked transactions, we implement an individual locking scheme in which cache lines are individually locked and unlocked before and after they are drained without waiting until the completion of the drain of the entire arrays. The individual locking scheme successfully prevents the IPC drop for both invalidation and migration mechanisms as shown in Figure 7.5. Figure 7.6 shows the impact of the drain process on the overall performance of two memory-intensive workloads with various time periods between two drain processes. For the invalidation drain mechanism, the blocked request is released before the drain process completes, as the requested cache line is invalidated and the blocked request is handled as a L2 cache miss. However, there is no change of the state of cache lines in the migration drain mechanisms, which causes the request to be blocked until the completion of the drain process. For DAXPY , the invalidation mechanism has less IPC (instructions committed per cy- cle) loss than the migration mechanisms, as shown in Figure 7.6(a), because the blocking time during the drain process is shorter and extra L2 cache misses caused by invalidations are mitigated by efficient prefetching of the regular memory access patterns. The migra- tion mechanism with dedicated links has less IPC loss than the migration mechanism 125 (a) DAXPY (b) Pointer chasing 0.0% 0.2% 0.4% 0.6% 0.8% 1 11 21314151 Time between two drains (x100K cycles) IPC loss Invalidation Migration with dedicated links Migration without dedicated links 0.0% 0.5% 1.0% 1.5% 2.0% 1 1121 3141 51 Time between two drains (x100K cycles) IPC loss Invalidation Migration with dedicated links Migration without dedicated links Figure 7.6: Impact of the drain process on IPC (instructions committed per cycle) loss for DAXPY and pointer chasing with various time periods between two successive drain processes. 126 0.0000% 0.0002% 0.0004% 0.0006% 12 3 IPC loss per drain process Migration with ded. links Invalidation Migration w/o ded. links Figure 7.7: IPC loss of the drain process averaged across the SPLASH benchmark pro- grams. The drain process is scheduled once every million cycles and overall IPC loss is divided by the total number of drain processes. using the cache read/write bus as there are fewer resource conflicts. For pointer chas- ing, the migration mechanisms have less IPC loss than the invalidation mechanism, as shown in Figure 7.6(b), because unpredictable memory access patterns make prefetching less beneficial. More importantly, however, the performance loss in terms of the per- centage of IPC is negligible (well below 1%) for these memory-intensive applications and becomes even more insignificant as the frequency of the drain process decreases, regardless of the mechanism. Figure 7.7 shows the impact of the drain process on system performance averaged across the SPLASH applications, where the drain process is scheduled once every mil- lion cycles and the overall performance penalty is divided by the total number of drain processes. The invalidation mechanism has more IPC loss than migration with dedicated links, but less IPC loss than the migration mechanism using the cache read/write bus. 127 Similar to the memory-intensive workloads, the performance loss of both drain mecha- nisms is negligible, well below 0.001% per drain. 7.2.3 ComparisonofProactiveandReactiveUseofRedundancy We compare our proposed proactive approach with conventional reactive approaches for using redundancy—such as ECC, sparing, and graceful performance degradation—to quantify the benefits in terms of lifetime reliability enhancement and performance/area overhead. While ECC techniques are typically used to tolerate transient errors, we con- sider them as a baseline for the comparisons due to their low complexity and capability of tolerating permanent (wearout-related) failuresaswell. When ECCisusedfor tran- sient errors, cache data can be sent to the processor(s) before error detection/correction. However, cache data having permanent errors must be corrected before being sent, which causes ECC logic to lie on the critical path. In this analysis, one cycle is added to cache access time for ECC logic. For sparing, the latency of reconfiguration logic is included in cache access latency at design time, thus having no performance penalty. For the proactive use of the redundant array, the drain process of the invalidation mechanism is scheduled once every million cycles. The measured performance penalty for the above is averaged across the simulated SPLASH applications. Figure 7.8(a) shows the lifetime reliability enhancement versus area overhead mea- sured in the number of SRAM cells. For ECC, data points (left to right) indicate sector (32-byte), quad-word, double-word, word (4-byte), double-byte and byte error correction 128 (a) Lifetime enhancement versus area overhead (b ) Lifetime enhancement versus performance overhead 0 2 4 6 8 0% 20% 40% 60% Area overhead (measured in the number of cells) Lifetime enhancement factor 3-bit ECC 5-bit ECC 2 spare columns (reactive use) 4 spare columns (reactive use) 16 or 32 spare rows (reactive use) 1 or 8 spare arrays (reactive use) at most 4 ways disabled for GPD one spare array (proactive use) 0 2 4 6 8 0% 4% 8% 12% IPC loss Lifetime enhancement factor 3-bit ECC 5-bit ECC 2 spare columns (reactive use) 4 spare columns (reactive use) 32 spare rows (reactive use) 8 spare arrays (reactive use) at most 1 to 4 ways disabled for GPD one spare array (proactive use) Figure 7.8: The lifetime reliability enhancement vs. performance/area overhead of the evaluated 256KB L2 cache memory with various redundancy techniques. In (a), for ECC, data points (left to right) indicate sector (32-byte), quad-word, double-word, word (4-byte), double-byte, and byte error correction codes; for colunm spares, data points (left to right) indicate two or four spares per 128, 64, 32, 16 and 8 columns. In (b), the lifetime reliability enhancement of each technique is shown for the best case, except for graceful performance degradation (GPD) which shows the four cases of data points indicating the cache enabled to operate with at most one disabled way (left) to four disabled ways (right). 129 codes. For column spares, data points (left to right) indicate two or four spares for every 128, 64, 32 and 16 columns. For row and array spares, data points indicate 16 (left) and 32 (right) spare rows per array, and one (left) and eight (right) spare arrays, respectively. The lifetime reliability enhancement using various ECC and reactive uses of redundancy saturates below a factor of three even with increased area overhead. One redundant array used proactively improves lifetime reliability of the cache SRAM by about a factor of seven, on average, for the SPLASH workloads, which is about five times higher than the lifetime reliability enhancement provided by reactive use of redundancy with the same area overhead, about 1.6% (two spare columns per 128 columns or one spare array). Figure 7.8(b) shows the lifetime reliability enhancement versus IPC loss, where the lifetime enhancement of each technique is presented for the best case, except for graceful performance degradation which shows the four cases of the cache enabledtooperate with at most one disabled way (left) to four disabled ways (right). As shown, graceful performance degradation improves the lifetime reliability of the cache SRAM by about 10% to 27%, but suffers a performance loss of 6% to 13%. ECC and the proactive approach suffers a less than 1% performance loss while providing lifetime reliability enhancement of up to about a factor of three and seven, respectively. In summary, our proposed proactive wearout recovery approach for using redundancy has significantly better lifetime-performance and lifetime-area trade-offs than the com- pared reactive approaches. The proactive approach is compared to reactive approaches 130 to quantify relative benefits. However, it should be noted that proactive and reactive ap- proaches are orthogonal and may be used in a proper combination to improve lifetime reliability further. 7.3 Summary We propose a proactive approach for exploiting microarchitectural redundancy to extend cache SRAM lifetime. Our proactive approach improves the lifetime reliability of cache SRAM susceptible to NBTI failure by approximately a factor of five over conventional reactive uses of redundancy with the same areaoverhead. Whileour proactive approach with even simple recovery scheduling (i.e., round-robin over regular time intervals) sig- nificantly improves cache lifetime reliability, more sophisticated recovery scheduling can be studied to further enhance lifetime reliability. As future work, our proactive wearout recovery approach for using microarchitectural redundancy can be explored for other structures within microprocessors and other chip failure mechanisms. 131 Chapter8 ConclusionsandFutureWork 8.1 Conclusions This research addresses the issue of modeling chip lifetime reliability at the architecture- level. We propose a framework for architecture-level lifetime reliability models and present a new concept called FIT of a reference circuit (FORC) that allows architects to quantify failure rates without having to deal with circuit- and technology-specific details of the implemented architecture. The proposed framework along with a cycle-accurate architecture simulator allows an accurate estimation of the failure rate of various types of microprocessor chips. Our results show that the failure rate of a quad-core processor chip is mainly contributed by array and register file structures due to the large number of effective defects. In addition, the FORC-based approach allows relative performance- reliability trade-off to be evaluated for design decisions, especiallyatthe earlydesign stage. 132 In addition, the impact of typical microarchitectural features to enhance chip lifetime reliability is modeled by considering various amounts and types of redundancy. Our study shows that lifetime reliability analysis is strongly affected by the choice of lifetime distribution models of system components. Therefore, we propose a methodology to derive a more representative lifetime distribution model specifically for SRAM cells and cache SRAM memory systems with conventional redundancy techniques such as ECC, sparing and graceful performance degradation, with respect to NBTI. Our evaluation results show that ECC techniques generally improve cache lifetime reliability to a higher degree than the evaluated sparing or graceful performance degradation techniques. This research also proposes a proactive approach for exploiting microarchitectural redundancy in which redundancy is used to allow non-faulty components intermittently to be temporarily deactivated in order to recover from wearout well before they fail. In addition, we propose circuit-level techniques which exploit the recovery effect inherent to wearout failure mechanisms such as NBTI while operating components in recovery mode. Applied to cache SRAM, our proactive wearout recovery approach improves the lifetime reliability of cache SRAM susceptible to NBTI by approximately a factor offive over conventional reactive use of redundancy with the same area overhead. 8.2 FutureWork There are similar related research issues that remain unexplored. They are summarized as follows: 133 • While we demonstrate the lifetime reliability of one type of multicore processors in this paper, more various microarchitectures in terms of the number of cores, cache size and hierarchy, on-chip interconnection networks, etc., need to be explored to find the optimal microarchitecture design for multicore processors. In addition, the impact of microarchitectural features to enhance chip lifetime such as redundancy on lifetime reliability needs to be carefully evaluated to find a area-, power-, and performance-efficient design. • While we focus on NBTI in this dissertation to describe and demonstrate the funda- mental ideas of the proposed wearout gating and intense recovery techniques, these ideas can be further explored for other failure mechanisms which have wearout re- covery effects, such as PBTI (positive bias temperature instability). • While our proactive approach with even simple recovery scheduling (i.e., round- robin over regular time intervals) significantly improves cache lifetime reliability, more sophisticated recovery scheduling can be studied to further enhance lifetime reliability. In addition, our proactive approach for using microarchitectural re- dundancy can be explored for other structures within microprocessors and other wearout failure mechanisms. • Combined impact of various failure mechanisms 134 References [1] NIST/SEMATECH e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/, 2003. [2] Jaume Abella, Xavier Vera, and Antonio González. Penelope: The NBTI-Aware Processor. InProceedingsofInternationalSymposiumonMicroarchitecture,pages 85–96, November 2007. [3] Kanak Agarwal and Sani Nassif. Statistical Analysis of SRAM Cell Stability. In Proceedingsofthe Conference onDesignAutomation, pages 57–62, July 2006. [4] Kaustav Banerjee and Amit Mehrotra. Coupled Analysis of Electromigration Re- liability and Performance in ULSI Signal Nets. In Proceedings of International ConferenceofComputer-AidedDesign, pages 158–164, 2001. [5] Azeez J. Bhavnagarwala, Stephen V . Kosonocky, Michael Immediato, Dan Knebel, and Anne-Marie Haen. A Pico-Joule Class, 1GHz, 32KByte x64b DSPSRAM with Self Reverse Bias. In Proceedings of Symposium on VLSI Circuits Digest of TechnicalPapers, pages 251–252, June 2003. [6] J. R. Black. Electromigration-A brief survey and some recent results. IEEETrans- actionsonElectronDevices, 16(4):338–347, April 1969. [7] Shekhar Borkar. Designing reliable systems from unreliable components: the chal- lenges of transistor variability and degradation. IEEEMicro, 25(6):10–16, Novem- ber 2005. [8] Douglas C. Bossen, Alongkorn Kitamorn, Kevin F. Reick, and Michael S. Floyd. Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology. IBMJournalofResearchandDevelopment, 46(1), January 2002. [9] Fred A. Bower, Paul G. Shealy, Sule Ozev, and Daniel J. Sorin. Tolerating Hard Faults in Microprocessor Array Structures. In International Conference on De- pendableSystemsandNetworks, pages 51–60, June/July 2004. [10] M. Broglia, G. Buonanno, M.G. Sami, and M. Selvini. Designing for Yield: a Defect-Tolerant Approach to High-Level Synthesis. In International Symposium onDefectandFaultToleranceinVLSISystems, pages 312–317, November 1998. 135 [11] D. Brooks, P. Bose, V . Srinivasan, M. K. Gschwind, P. G. Emma, and M. G. Rosenfield. New methodology for early-stage, microarchitecture-level power- performance analysis of microprocessors. IBM Journal of Research and Devel- opment, 47(5/6):653–670, 2003. [12] Inc. Cadence Design Systems. Reliability Simulation in Integrated Circuit Design. Technical report, http://www.cadence.com. [13] Andrew Carlson. Mechanism of IncreaseinSRAMVminDue to Negative Bias Temperature Instability. To appear in IEEE Transactions on Device and Materials Reliability. [14] Jonathan Chang, Ming Huang, Jonathan Shoemaker, John Benoit, Szu-Liang Chen, Wei Chen, Siufu Chiu, Raghuraman Ganesan, Gloria Leong, Venkata Lukka, Stefan Rusu, and Durgesh Srivastava. The 65nm 16MB On-die L3 Cache for a Dual Core Multi-Threaded Xeon Processor. In Proceedings of Symposium on VLSI Circuits Digest ofTechnicalPapers, pages 126–127, 2006. [15] P. G. Depledge and et al. Fault-Tolerant Computer Systems. IEE Proceedings, 128(4):257–272, May 1981. [16] H. Hamann et al. Hotspot-limited microprocessors: direct temperature and power distribution measurements. ToappearontheIEEEJournalofSolide-StateCircuit, February 2007. [17] A. Haggag, S. Kalpat, M. Moosa, N. Liu, M. Kuffler, H.-H. Tseng, T.-Y . Luo, J. Schaeffer, D. Gilmer, S.Samavedam, R.Hegde, B.E.White Jr., and P.J.To- bin. Generalized Models for Optimization of BTI in SiON and High-K Dielectrics. In Proceedings of International Reliability Physics Symposium, pages 665–666, March 2006. [18] JamesR.Heath, PhilipJ.Kuekes, Gregory S.Snider, and R.Stanley Williams. A Defect-Tolerant Computer Architecture: Opportunities for Nanotechnology. Sci- ence, 280:1716–1721, 12 June 1999. [19] H. Peter Hofstee. Power efficient processor architecture and the cell processor. In Proceedings of the Eleventh International Symposium on High-Performance Com- puterArchitecture(HPCA-11), Feb 2005. [20] C.-K. Hu, D. Canaperi, S.T. Chen, L.M. Gignac, B. Herbst, S. Kaldor, M. Krishnan, E. Liniger, D.L. Rath, D. Restaino, R. Rosenberg, J. Rubino, S.-C. Seo, A. Simon, S. Smith, and W.-T. Tseng. Effects of overlayers on electromigration reliability im- provement for Cu/low K interconnects. In Proceedings of International Reliability PhysicsSymposium, pages 222–228, April 2004. 136 [21] C.-K. Hu, L. Gignac, and R. Rosenberg. Electromigration of Cu/low dielectric con- stant interconnects. Microelectronics and reliability, 46(2-4):213–231, February- April 2006. [22] Zhigang Hu, Alper Buyuktosunoglu, Viji Srinivasan, Victor Zyuban, Hans Jacob- son, and Pradip Bose. Microarchitectural techniques for power gating of execution units. InProceedingsofthe2004internationalsymposiumonLowpowerelectron- icsanddesign, pages 32–37, 2004. [23] William R. Hunter. Self-consistent solutions for allowed interconnect current den- sity ˚ UPart II: Application to design guidelines. IEEE Transactions onElectronDe- vices, 44(2):310–316, February 1997. [24] Rajiv V . Joshi, Saibal Mukhopadhyay, Donald W. Plass, Yuen H. Chan, Ching- Te Chuang, and A. Devgan. Variability analysis for sub-100 nm PD/SOI CMOS SRAM cell. InProceedingsoftheEuropeanSolid-StateCircuitsConference,pages 211–214, September 2004. [25] B. Kaczer, R. Degraeve, G. Groeseneken, M. Rasras, S. Kubicek, E.Vandamme, and G. Badenes. Impact of MOSFET oxide breakdown on digital circuit operation and reliability. In Proceedings of International Electron Devices Meeting,pages 553–556, December 2000. [26] Kunhyuk Kang, Haldun Kufluoglu, Kaushik Roy, and Muhammad Ashraful Alam. Impact of Negative-Bias Temperature Instability in Nanoscale SRAM Array: Mod- eling and Analysis. IEEE Transactions on Computer-aided Design of Integrated CircuitsandSystems, 26(10):1770–1781, October 2007. [27] Rouwaida Kanj, Rajiv V . Joshi, and Sani R. Nassif. Mixture importance sampling and its application to the analysis of SRAM designs in the presence of rare failure events. InProceedingsofDesignAutomationConference, pages 69–72, July 2006. [28] R. Kapre, K. Shakeri, H. Puchner, J. Tandigan, T. Nigam, K. Jang, M.V .R. Reddy, S. Lakshminarayanan, D. Sajoto, and M. Whately. SRAM variability and sup- ply voltage scaling challenges. In Proceedings of International Reliability Physics Symposium, pages 23–28, 2007. [29] A. T. Krishnan, V . Reddy, D. Aldrich, J. Raval, K. Christensen, J. Rosal, C. O’Brien, R. Khamankar, A. Marshall, W-K. Loh, R. McKee, and S. Krishnan. SRAM Cell Static Noise Margin and VMIN Sensitivity to Transistor Degradation. In Proceed- ingsofInternationalElectronDevicesMeeting, pages 1–4, December 2006. [30] Sanjay V . Kumar, Chris H. Kim, and Sachin S. Sapatnekar. Impact of NBTI on SRAM Read Stability and Design for Reliability. In Proceedings of the 7th Inter- national SymposiumonQuality Electronic Design, pages 210–218, 2006. 137 [31] Yung-Huei Lee, William McMahon, Neal Mielke, Yin-Lung Ryan Lu, and Steve Walstra. Managing Bias-Temperature Instability for Product Reliability. In Pro- ceedings of International Symposium on VLSI Technology, Systems and Applica- tions, pages 1–2, April 2007. [32] Ana Sonia Leon, Brian Langley, and Jinuk Luke Shin. The UltraSPARC T1 Pro- cessor: CMT Reliability. In Proceedings of IEEE Custom Intergrated Circuits Conference, pages 555–562, 2006. [33] R. Leveugle, Z. Koren, I. Koren, G. Saucier, and N. Wehn. The Hyeti defect tolerant microprocessor: a practical experiment and its cost-effectiveness analysis. IEEE TransactionsonComputers, 43(12):1398–1406, December 1994. [34] B. P. Linder, J. H. Stathis, R. A. Wachnik, Ernest Wu, S. A. Cohen, A. Ray, and A. Vayshenker. Gate oxide breakdown under Current Limited Constant V oltage Stress. InProceedingsofIEEESymposiumonVLSITechnologyDigestofTechnical Papers, pages 214–215, June 2000. [35] Zhijian Lu, John Lach, Mircea R. Stan, and Kevin Skadron. Temperature-aware modeling and banking of IC lifetime reliability. IEEEMicro, 25(6):40–49, Novem- ber 2005. [36] Michael A. Lucente, Clifford H. Harris, and Robert M. Muir. Memory system reliability improvement through associative cache redundancy. IEEE Journal of Solid-StateCircuits, 26(3):404–409, March 1991. [37] MichaelJ.Mack,WolframM.Sauer,ScottB.Swaney,andBruceG.Mealey. IBM POWER6 reliability. IBM Journal of Research and Development, 51(6):763–774, November 2007. [38] S. Mahapatra, M.A. Alam, P. Bharath Kumar, T.R. Dalei, and D. Sana. Mechanism of negative bias temperature instability in CMOS devices: degradation, recovery and impact of nitrogen. InProceedingsofInternationalElectronDevicesMeeting, pages 105–108, December 2004. [39] J.A. Maiz. Characterization of electromigration under bidirectional (BC) and pulsed unidirectional (PDC) currents. In Proceedings of the 27th Annual Inter- national ReliabilityPhysicsSymposium, pages 220–228, April 1989. [40] Ennis T. Ogawa, Ko-Don Lee, V olker A. Blaschke, and Paul S. Ho. Electromigra- tion Reliability Issues in Dual-Damascene Cu Interconnections. IEEETransactions onReliability, 51(4):403–419, December 2002. [41] Ishwar Parulkar, Thomas Ziaja, Rajesh Pendurkar, Anand D’Souza, and Amitava Majumdar. A scalable, low cost design-for-test architecture for UltraSPARC chip multi-processors. InInternational TestConference, pages 726–735, October 2002. 138 [42] Bipul C. Paul, Kunhyuk Kang, Haldun Kufluoglu, Muhammad A. Alam, and Kaushik Roy. Impact of NBTI on the Temporal Performance Degradation of Digi- tal Circuits. IEEEElectronDevice Letters, 26(8):560–562, August 2005. [43] Donald W. Plass and Yuen H. Chan. IBM POWER6 SRAM arrays. IBM Journal ofResearchandDevelopment, 51(6):747–756, November 2007. [44] ReliaSoft Corporation. Limitations of the Exponential Distribution for Reliability Analysis. ReliabilityEdge, 2(3):1–3, 2001. [45] G. Ribes, J. Mitard, M. Denais, S. Bruyere, F. Monsieur, C. Parthasarathy, E. Vin- cent, and G. Ghibaudo. Review on High-k Dielectrics Reliability Issues. IEEE TransactionsonDeviceandMaterialsReliability, 5(1):5–19, March 2005. [46] K. P. Rodbell, A. J. Castellano, and R. I. Kaufman. AC electromigration (10MHz - 1GHz) in Al metallization. InProceedingsofthefourthinternationalworkshopon stressinducedphenomenainmetallization, pages 212–223, January 1998. [47] R. Rodriguez, J. H. Stathis, B. P. Linder, S. Kowalczyk, C. T. Chuang, R. V . Joshi, G. Northrop, K. Bernstein, A. J. Bhavnagarwala, and S. Lombardo. The impact of gate-oxide breakdown on SRAM stability. IEEE Electron Device Let- ters, 23(9):559–561, September 2002. [48] Giuseppe La Rosa, Wee Loon Ng, Stewart Rauch, Robert Wong, and John Sudi- jono. Impact of NBTI Induced Statistical Variation to SRAM Cell Stability. InPro- ceedings of International Reliability Physics Symposium, pages 274–282, March 2006. [49] E. Rosenbaum, Z. Liu, and C. Hu. Silicon dioxide breakdown lifetime enhance- ment under bipolar bias conditions. IEEE Transactions on Electron Devices, 40(12):2287–2295, December 1993. [50] Mendel Rosenblum, Edouard Bugnion, Scott Devine, and Stephen Alan Herrod. Using the SimOS Machine Simulator to Study Complex Computer Systems. ACM TransactionsonModelingandComputerSimulation, 7(1):78–103, January 1997. [51] T. Sakurai and A. R. Newton. Alpha-power law MOSFET model and its appli- cations to CMOS inverter delay and other formulas. IEEE Journal of Solid-State Circuit, 25(2):584–594, April 1990. [52] Dieter K. Schroder and Jeff A. Babcock. Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing. Journal of AppliedPhysics, 94(1), July 2003. [53] Ethan Schuchman and T. N. Vijaykumar. Rescue: A Microarchitecture for Testabil- ity and Defect Tolerance. In International Symposium on Computer Architecture, June 2005. 139 [54] H. Shafi,P.J.Bohrer,J.Phelan,C.A.Rusu,andJ.L.Peterson. Designandvalida- tion of a performance and power simulator for PowerPC systems. IBM Journal of ResearchandDevelopment, 47(5/6):641–652, 2003. [55] Jeonghee Shin, Victor Zyuban, Zhigang Hu, Jude A. Rivers, and Pradip Bose. A Framework for Architecture-Level Lifetime Reliability Modeling. In Proceedings ofInternationalConferenceonDependableSystemsandNetworks, pages 534–543, June 2007. [56] Premkishore Shivakumar, Stephen W. Keckler, Charles R. Moore, and Doug Burger. Exploiting Microarchitectural Redundancy For Defect Tolerance. In In- ternationalConferenceonComputerDesign, pages 481–488, October 2003. [57] Martin L. Shooman. Reliability of Computer Systems and Networks: Fault Toler- ance,Analysis,andDesign. Wiley-Interscience, 2001. [58] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM Journal of Research and Development, 49:505–521, 2005. [59] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, Sivaku- mar Velusamy, and David Tarjan. Temperature-aware microarchitecture: modeling and implementation. ACM Transactions on Architecture and Code Optimization, 1(1):94–125, March 2004. [60] Jayanth Srinivasan, Sarita V . Adve, Pradip Bose, Jude Rivers, and Chao kun Hu. RAMP: A Model for Reliability Aware Microprocessor Design. Technical Report RC 23048, IBM Research Report, December 2003. [61] Jayanth Srinivasan, Sarita V . Adve, PradipBose,andJudeA.Rivers. TheCasefor Lifetime Reliability-Aware Microprocessors. In Proceedings of the 31st Annual InternationalSymposiumonComputerArchitecture, pages 276–287, June 2004. [62] Jayanth Srinivasan, Sarita V . Adve, Pradip Bose, and Jude A. Rivers. Exploit- ing Structural Duplication for Lifetime Reliability Enhancement. In International SymposiumonComputerArchitecture, June 2005. [63] J. H. Stathis. Reliability limits for the gate insulator in CMOS technology. IBM JournalofResearchandDevelopment, 46(2/3):265–286, March/May 2002. [64] J. H. Stathis and D. J. DiMaria. Reliability projection for ultra-thin oxides at low voltage. InProceedingsofInternationalElectronDevicesMeeting, pages 167–170, December 1998. [65] M. Tortorella and W. B. Frakes. A Computer Implementation of the Separate Main- tenance Model for Complex-system Reliability. Quality and Reliability Engineer- ingInternational, 22(7):757–770, December 2005. 140 [66] Shimpei Tsujikawa, Kikuo Watanabe, Ryuta Tsuchiya, Kazuhiro Ohnishi, and Jiro Yugami. Experimental evidence for the generation of bulk traps by negative bias temperature stress and their impact on the integrity of direct-tunneling gate di- electrics. In Proceedings of Symposium on VLSI Technology Digest of Technical Papers, pages 139–140, June 2003. [67] Robert H. Tu, Elyse Rosenbaum, Wilson Y . Chan, Chester C. Li, Eric Minami, Khandker Quader, Ping Keung Ko, and Chenming Hu. Berkeley reliability tools- BERT. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(10):1524–1534, October 1993. [68] Rakesh Vattikonda, Wenping Wang, and Yu Cao. Modeling and Minimization of PMOS NBTI effect for Robust Nanometer Design. In Proceedings of the 43rd annualconferenceonDesignautomation, pages 1047–1052, July 2006. [69] Wenping Wang, Shengqi Yang, Sarvesh Bhardwaj, Rakesh Vattikonda, Sarma Vrudhula, Frank Liu, and Yu Cao1. The Impact of NBTI on the Performance of Combinational and Sequential Circuits. In Proceedings of the 44th conference on Designautomation, pages 364–369, June 2007. [70] Don Weiss, John J. Wuu, and Victor Chin. The On-Chip 3-MB Subarray-Based Third-Level Cache on an Itanium Microprocessor. IEEE Journal of Solid-State Circuits, 37(11), November 2002. [71] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. In Proceedings of the 22th International Symposium on Computer Architecture, pages 24–36, February 1995. [72] E.Y. Wu,E.J.Nowak,A.Vayshenker, W.L. Lai, andD.L.Harmon. CMOS scaling beyond the 100-nm node with silicon-dioxide-based gatedielectrics. IBM Journal ofResearchandDevelopment, 46(2/3):287–298, 2002. [73] Xiangning Yang and Kewal Saluja. Combating NBTI Degradation via Gate Sizing. In Proceedings of International Symposium on Quality Electronic Design,pages 47–52, 2007. [74] Xiangning Yang, Eric Weglarz, and Kewal Saluja. On NBTI Degradation Process in Digital Logic Circuits. In Proceedings of International Conference on VLSI Design, pages 723–730, 2007. [75] Sufi Zafar. Statistical mechanics based model for negative bias temperature insta- bility induced degradation. Journalof AppliedPhysics, 97(10), May 2005. [76] Sufi Zafar. A Tutorial on Negative Bias Temperature Instability (NBTI) in MOS- FETs. InProceedingsofIntegratedReliabilityWorkshop, October 2006. 141 [77] Sufi Zafar, Arvind Kumar, Evgeni Gusev, and E. Cartier. Threshold V oltage In- stabilities in High-k Gate Dielectric Stacks. IEEE Transactions on Device and MaterialsReliability, 5(1):45–64, March 2005. 142
Abstract (if available)
Abstract
Deep submicron semiconductor technologies enable greater degrees of device integration and performance, but they also pose many new microprocessor design challenges. Chip lifetime reliability as affected by wearout-related failures, for one, has become a major concern. Atomic-range dimensions, escalating power densities, process/operational variation and other consequences of extreme scaling all contribute to this concern. Much recent research has been conducted to understand and model the effects of wearout failure mechanisms such as negative bias temperature instability (NBTI), electromigration, gate oxide breakdown, etc., on chip lifetime reliability. Circuit and architectural techniques for mitigating and/or tolerating such wearout failures are also being explored for extending chip lifetime. Nonetheless, the challenge of modeling and improving the effects of low-level failures at the architecture-level continues to be a rather daunting one.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Calculating architectural reliability via modeling and analysis
PDF
Parallel simulation of chip-multiprocessor
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Reliable cache memories
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Reliable languages and systems for sensor networks
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Compiler directed data management for configurable architectures with heterogeneous memory structures
PDF
Communication mechanisms for processing-in-memory systems
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Mapping sparse matrix scientific applications onto FPGA-augmented reconfigurable supercomputers
PDF
User modeling for human-machine spoken interaction and mediation systems
PDF
A user-centric approach for improving a distributed software system's deployment architecture
PDF
Model-guided empirical optimization for memory hierarchy
PDF
A system for trust evaluation and management leveraging trusted computing technology
PDF
Exploiting latent reliability information for classification tasks
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
Asset Metadata
Creator
Shin, Jeonghee
(author)
Core Title
Lifetime reliability studies for microprocessor chip architecture
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
08/12/2008
Defense Date
05/14/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
lifetime reliability,microprocessor chip,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pinkston, Timothy M. (
committee chair
), Bose, Pradip (
committee member
), Dubois, Michel (
committee member
), Hall, Mary (
committee member
)
Creator Email
jeonghee.shin@gmail.com,jeonghes@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1574
Unique identifier
UC1224877
Identifier
etd-Shin-2166 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-108618 (legacy record id),usctheses-m1574 (legacy record id)
Legacy Identifier
etd-Shin-2166.pdf
Dmrecord
108618
Document Type
Dissertation
Rights
Shin, Jeonghee
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
lifetime reliability
microprocessor chip