Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Towards a cross-layer framework for wearout monitoring and mitigation
(USC Thesis Other)
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
TOWARDS A CROSS-LAYER FRAMEWORK FOR WEAROUT MONITORING AND MITIGATION by Bardia Zandian A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2012 Copyright 2012 Bardia Zandian ii Dedication To my beloved parents and my sister, for their constant support. iii Acknowledgements First, I would like to thank my advisor, Professor Murali Annavaram who has provided me with mentorship and guidance in every single step of my doctorate studies. Thank you for your feedback, support, and inspiration during our many long meetings in the past few years. I would also like to thank the members of my defense and qualification exam committee: Prof. Michel Dubois, Prof. Timothy Pinkston, and Prof. Jeff Draper for their very constructive feedback which helped shape this dissertation. I would like to thank Prof. Stephen Cronin who inspired and encouraged me early in my graduate studies at USC and introduced me to scientific research. I would like to thank the following people at IBM T. J. Watson Research Center for their insightful feedbacks and support of my research: Dr. Pradip Bose, Dr. Alper Buyuktosunoglu, and Dr. Jude Rivers. I would like to thank Ming Hsieh Institute leadership team for their mentorship and support during my final years at USC. Specifically, I would like to thank Prof. Hossein Hashemi, Prof. Bhaskar Krishnamachari, and Prof. Shrikanth Narayanan. I would like to acknowledge the following members of the USC Information Sciences Institute (ISI) for their collaboration and support: Jonathan Ahlbin, Michael Bajura, Gregory Boverman, and Michael Fritze. I would like to thank the engineering team at Qualcomm who helped expand both the depth and breadth of my technical skills during iv my summer internship. I would also like to thank the following member of the Ming Hsieh Department of Electrical Engineering for all their help during my graduate studies: Tim Boston, Diane Demetras, Christina Fontenot, Danielle Hamra, Estela Lopez, and Janice Thompson. I would like to thank the graduate students in the research groups I worked at for their collaboration and feedbacks on the project I worked on. I would like to thank the following members of the Super Computing in Pocket (SCIP) research group: Mohammad Abdel-Majeed, Lakshmi Kumar Dabbiru, Melina Demertzi, Waleed Dweik, Sabyasachi Ghosh, Hyeran Jeon, Gunjae Koo, Suk Hun Kang, Sangwon Lee, Kimish Patel, Jinho Suh, Daniel Wong, and Qiumin Xu. I would like to thank the following members of the Cronin Research Lab for their support in my earlier years researching at USC: Mehmet Aykol, Adam Bushmaker, I-Kai Hsu, Wei-Hsuan Hung, Rajay Kumar, and Jesse Theiss. I would also like to acknowledge two very motivated and bright undergraduate interns, Thomas Punihaole and Ricardo Rojas who worked with me during parts of doctorate research. During my years as a graduate student at USC I had the great fortune of meeting good friends. I would like to thank members of the USC Ski and Snowboard Team, USC Outdoors Club, USC Surf Club, and many new and old friends in Los Angeles who have provided me with a chance to explore the beautiful mountains, deserts, and beaches of California and maintain a balanced life outside of school. I also had the privilege to serve with some of the best student leaders at USC on the Graduate and Professional Student Senate (now Graduate Student Government), Engineering Graduate Student Association v (now Viterbi Graduate Student Association), and IEEE student branch at USC. I want to thank the 2007-2008 and 2008-2009 executive board and senators of the above student organizations for providing me with the opportunity to learn and practice many leadership and communication skills complementary to the technical research skills I was learning as a graduate student. Funding for my research was made possible by National Science Foundation grants CAREER-0954211, CCF-0834798, CCF-0834799, IBM Faculty Fellowship, Ming Hsieh Institute scholarship, and USC ISI. Finally I would like to thank my parents and all who have at some point in my life taught me something new and inspired me to explore and learn. My passion for science, engineering, exploration, and discovery has been a result of their teachings and providing me the opportunity for intellectual growth. vi Table of Contents Dedication .......................................................................................................................... ii Acknowledgements .......................................................................................................... iii List of Tables .................................................................................................................... ix List of Figures .................................................................................................................... x Abstract ........................................................................................................................... xiii Chapter 1 Introduction ................................................................................................... 1 1.1 An Overview of Wearout and Guardbands ......................................................... 4 Chapter 2 Reliability Monitoring Using Adaptive Critical Path Testing .................. 8 2.1 Introduction ......................................................................................................... 8 2.1.1 Wearout Monitoring........................................................................................ 9 2.1.2 Wearout Monitoring Applications ................................................................ 10 2.2 WearMon Reliability Monitoring Framework .................................................. 11 2.2.1 Architecture of the Monitoring Unit ............................................................. 11 2.2.2 Test Vector Selection .................................................................................... 13 2.2.3 Test Frequency Selection .............................................................................. 14 2.2.4 DTC and Opportunistic Tests ....................................................................... 17 2.2.5 Dynamic Path Delay Variations.................................................................... 20 2.3 Hierarchical Reliability Management ............................................................... 22 2.3.1 Distributed Vs. Central RMU Control .......................................................... 23 2.3.2 Error Avoidance Implementation Issues ....................................................... 27 2.4 Experimental Methodology .............................................................................. 29 2.4.1 FPGA Emulation Setup................................................................................. 30 2.4.2 Three Scenarios for Monitoring .................................................................... 33 2.4.3 Dynamic Adaptation of RMU ....................................................................... 36 2.5 Evaluation Results ............................................................................................ 37 2.5.1 Area Overhead .............................................................................................. 37 2.5.2 Monitoring Overhead .................................................................................... 38 vii 2.5.3 Opportunistic Testing.................................................................................... 40 2.6 Related Work .................................................................................................... 44 2.7 Summary and Conclusions ............................................................................... 46 Chapter 3 WAT: A Cross-layer Wearout Analysis Tool ........................................... 47 3.1 Introduction ....................................................................................................... 48 3.2 Implementation Details of WAT ...................................................................... 50 3.2.1 Stage 1: FPGA-based Emulation and FUB Input Trace Collection ............. 50 3.2.2 Stage 2: ASIC Synthesis and Static Timing Analysis .................................. 51 3.2.3 Stage 3: Gate-level Simulation and Switching Activity Data Collection ..... 52 3.2.4 Stage 4: Translation of Gate Port Activity to Transistor Activity ................ 53 3.3 An Application of WAT: Accurate Wearout Simulation for Chip Lifespan Prediction ...................................................................................................................... 56 3.3.1 Building Blocks of a Lifespan Prediction Framework ................................. 57 3.3.2 Gate-level Activity Statistics ........................................................................ 58 3.3.3 Probability of Logic State ............................................................................. 59 3.3.4 Toggle Probability ........................................................................................ 62 3.4 Related Works ................................................................................................... 66 3.5 Other Applications of WAT .............................................................................. 68 3.5.1 Circuit Design Optimization for Improved Reliability ................................. 68 3.5.2 Instruction Set Architecture Reliability Benchmarking ................................ 69 3.5.3 Device Vulnerability Identification and Selective Hardening ...................... 69 3.5.4 Automatic Test Vector Generation for Post-fabrication Testing .................. 70 3.6 Summary and Conclusions ............................................................................... 71 Chapter 4 Cross-layer Wearout Aware Design Flow ................................................ 72 4.1 Introduction ....................................................................................................... 73 4.2 Cross-Layer Design Flow ................................................................................. 77 4.2.1 Step 1: Selection of the Analysis Group ....................................................... 80 4.2.2 Step 2: Utilization Based Path Prioritization ................................................ 82 4.2.3 Step 3: Approaches for Selecting Monitored Paths ...................................... 84 4.2.3.1 Approach 1: Monitor Least Reliable ..................................................... 85 4.2.3.2 Approach 2: Two Monitoring Groups .................................................. 89 4.2.3.3 Approach 3: Virtual Critical Paths ........................................................ 91 4.2.3.4 Approach 4: Two Monitoring Domains ................................................ 92 4.2.4 Summary of Approaches............................................................................... 94 4.3 Evaluation Methodology ................................................................................... 95 4.4 Evaluation Results ............................................................................................ 99 4.4.1 Detailed Results from sparc_ifu_dec .......................................................... 100 4.4.2 Path Utilization Profile Analysis of All FUBs ............................................ 104 4.4.3 CLDF Overheads ........................................................................................ 106 4.5 Related Work .................................................................................................. 108 4.6 Summary and Conclusions ............................................................................. 110 viii Chapter 5 Wearout-aware Runtime Use of Redundancy to Improve Lifespan .... 112 5.1 Introduction ..................................................................................................... 112 5.2 Background ..................................................................................................... 114 5.2.1 Redundancy................................................................................................. 114 5.2.2 Timing Margin Heterogeneity .................................................................... 115 5.2.2.1 Reasons for Timing Margin Heterogeneity ........................................ 116 5.3 Wearout-Aware Scheduling ............................................................................ 120 5.3.1 An Illustration of WAS ............................................................................... 120 5.3.2 Microarchitectural Wearout-Aware Scheduling ......................................... 121 5.3.2.1 Implementation of MWAS Control Unit ............................................ 124 5.3.2.2 MWAS Enhancements ........................................................................ 126 5.3.3 Architectural Wearout-Aware Scheduling .................................................. 127 5.3.3.1 AWAS Implementation ...................................................................... 129 5.3.3.2 AWAS Enhancements ........................................................................ 130 5.3.4 Wearout-dependent M diff_th .......................................................................... 131 5.3.5 Comparison of MWAS with AWAS .......................................................... 132 5.3.6 Hybrid Wearout-Aware Scheduling ........................................................... 133 5.3.6.1 HWAS Implementation ...................................................................... 135 5.4 Evaluation Setup ............................................................................................. 137 5.4.1 Device Level Wearout Models ................................................................... 137 5.4.1.1 NBTI ................................................................................................... 138 5.4.1.2 HCI ...................................................................................................... 139 5.4.2 Critical Path Model ..................................................................................... 139 5.4.3 Wearout of Microarchitectural Structures .................................................. 140 5.4.4 Model Parameters ....................................................................................... 141 5.4.5 Architectural Simulation ............................................................................. 141 5.4.6 Workload Diversity ..................................................................................... 142 5.5 Results ............................................................................................................. 143 5.5.1 Lifespan Improvement ................................................................................ 143 5.5.2 Performance Impact .................................................................................... 146 5.5.3 Energy Impact ............................................................................................. 147 5.5.4 Sensitivity Analyses .................................................................................... 147 5.6 Related Works ................................................................................................. 152 5.7 Summary and Conclusions ............................................................................. 154 Chapter 6 Conclusions and Future Work ................................................................. 155 References ...................................................................................................................... 158 ix List of Tables Table 2.1: Effects of CUT size on RMU and EAU implementation. ............................... 27 Table 3.1: List of circuit blocks from OpenSPARC T1 processor which are used in the cross-layer analysis. .......................................................................................................... 59 Table 4.1: Comparison of approaches. ............................................................................. 94 Table 4.2: Comparison of Four Approaches for different FUBs. ................................... 107 Table 5.1: Non-uniform utilization at different levels. ................................................... 118 Table 5.2: Comparison of Dynamic Power/Thermal Management (DPM/DTM), Reactive/Proactive Dynamic Reliability Management (R-DRM/P-DRM) frameworks. 152 x List of Figures Figure 1.1: Manifestation of wearout a different levels in the computer system stack. ..... 4 Figure 1.2: Wearout-induced timing margin degradation in circuit paths. ......................... 5 Figure 2.1: Overview of the reliability monitoring unit (RMU) and its interface with the CUT................................................................................................................................... 12 Figure 2.2: Testing critical paths with multiple test frequencies at three different wearout levels during the lifetime of circuit. .................................................................................. 15 Figure 2.3: Reduced guardband. ....................................................................................... 16 Figure 2.4: Central DTC for multiple CUTs. .................................................................... 25 Figure 2.5: Autonomous distributed RMUs and their interaction with PEAs. ................. 28 Figure 2.6: Timing margin degradation. ........................................................................... 33 Figure 2.7: Dynamic adaptation of RMU. ........................................................................ 37 Figure 2.8: Overhead of (a) linear (b) exponential schemes. ............................................ 38 Figure 2.9: Test fail rates for (a) linear (b) exponential schemes. .................................... 39 Figure 2.10: Distribution of (a) opportunity duration (b) distance between opportunities. ........................................................................................................................................... 42 Figure 3.1: An overview of the cross-layer wearout analysis tool (WAT). WAT uses FPGA emulations coupled with gate-level simulation for understanding how software impacts device wearout. .................................................................................................... 49 Figure 3.2: Six FPGAs mounted on evaluation boards used in the first stage of cross-layer analysis setup. ................................................................................................................... 51 Figure 3.3: Transistor level design of 2 input (a) NAND and (b) AND gates. Highlighting how switching activity and logic state on inputs and outputs of logic gates will translate xi into switching activity and logic state on the gates of PMOS and NMOS transistors in the design. ............................................................................................................................... 55 Figure 3.4: CDF of probability of logic value 1 on inputs different FUBs. ..................... 60 Figure 3.5: CDF of probability of logic value 1 on internal nodes of different FUBs ..... 60 Figure 3.6: Distribution of the probability of logic value 1 on the input ports for 12 FUBs. ........................................................................................................................................... 62 Figure 3.7: Distribution of probability of logic value 1 on the internal nodes of 12 FUBs. ........................................................................................................................................... 62 Figure 3.8: Toggle probability distribution for input ports of 12 FUBs. .......................... 63 Figure 3.9: Toggle probability distribution for internal nodes of 12 FUBs. ..................... 63 Figure 3.10: CDF of toggle probability distribution of the internal nodes of 12 FUBs compared. .......................................................................................................................... 64 Figure 3.11: Comparison of the CDF of toggle probability distribution for (a) input ports (b) internal nodes of the 12 FUBs. .................................................................................... 65 Figure 4.1: Design time and runtime cross-layer interaction. ........................................... 75 Figure 4.2: Path delay distribution (a) before optimization and after (b) Approach 1 (c) Approach 2 (d) Approach 3 (e) Approach 4. .................................................................... 85 Figure 4.3: Flow chart of evaluation methodology. .......................................................... 96 Figure 4.4: (a) Path timing and (b) path utilization profile of sparc_ifu_dec. ................ 100 Figure 4.5: Slack distribution before (dotted line) and after CLDF optimizations on (a) sparc_ifu_dec (b) sparc_exu_ecl (c) lsu_stb_ctl (d) sparc_exu_rml. ............................. 104 Figure 4.6: Path utilization profile for (a) sparc_ifu_dec (b) sparc_exu_ecl (c) lsu_stb_ctl (d) sparc_exu_rml (Vertical axis has logarithmic scale). ............................................... 105 Figure 5.1: Reasons for timing margin heterogeneity in a processor. ............................ 117 Figure 5.2: Wearout of different microarchitectural structures. ..................................... 120 Figure 5.3: Wearout control to extend lifespan. ............................................................. 121 Figure 5.4: Busy control bit. ........................................................................................... 123 xii Figure 5.5: MWAS control unit and its inputs and outputs. ........................................... 124 Figure 5.6: Flow chart of HWAS. ................................................................................... 135 Figure 5.7: Lifespan improvement due WAS techniques. .............................................. 145 Figure 5.8: Impact of WAS on when half the cores have failed. .................................... 145 Figure 5.9: Performance impact of MWAS and HWAS. ............................................... 147 Figure 5.10: Effect of timing margin sensor inaccuracy. ................................................ 148 Figure 5.11: Lifespan distribution for different guardbands. .......................................... 149 Figure 5.12: Sensitivity analyses to opportunities, imbalance, NBTI/HCI impact, PMOS stress. ............................................................................................................................... 150 xiii Abstract CMOS scaling has enabled greater degree of integration and higher performance but has the undesirable consequence of decreased circuit reliability due to rapid wearout. Accelerated processor wearout and the consequent degradation in lifetime have become a first order design constraint. This dissertation tackles these challenges by developing new tools to accurately quantify wearout, providing novel methods to quantify the wearout impact due to software interactions on hardware. The dissertation then demonstrates the usage of these tools by developing a wearout-aware scheduling approach that achieves wear leveling within a processor. This dissertation first presents WearMon, an adaptive critical path monitoring architecture which provides accurate and real-time measure of a processor's wearout- induced timing margin degradation. Special test patterns are used to check a set of critical paths in the circuit-under-test. By activating the actual devices and signal paths used in normal operation of the chip, each test will capture up-to-date timing margin of these paths. This monitoring framework dynamically adapts testing interval and complexity based on analyses of prior test results, which increases efficiency and accuracy of monitoring. Monitoring overhead can be completely eliminated by scheduling tests only when the circuit is idle. This wearout detection mechanism is a key building block of a xiv hierarchical runtime reliability management system where multiple wearout monitoring units can co-operatively engage preemptive error avoidance schemes. Our experimental results based on an FPGA implementation show that the proposed monitoring framework can be easily integrated into existing designs and operate with minimal overhead. WearMon overhead can become a hurdle when a circuit block has a steep critical path timing wall. Many prior research studies intuitively argued that only a few paths within a steep critical path timing wall are actually utilized by application software. But there has been a dearth of tools that enable designers to understand how software impacts the utilization of critical paths in a circuit. The next part of this dissertation develops a tool for cross-layer analysis of wearout, called WAT. WAT uses FPGA emulation closely coupled with software simulation to provide accurate insight into device switching activity and runtime path utilization. We demonstrate the utility of WAT by providing accurate gate-level switching activity statistics as inputs to a lifetime wearout simulation tool. The switching activity statistics are used as inputs to the lifetime prediction tool which uses accurate device level models for the electrophysical phenomena causing wearout. Accurate switching statistics from WAT can significantly improve the lifetime prediction accuracy. WAT is also used to address the concern regarding WearMon overhead in the presence of steep critical path timing walls. A new design-for-reliability approach is developed that reshapes a critical path wall to make a circuit more amenable for wearout monitoring. This design flow methodology uses path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms xv for selecting paths to be monitored. These four approaches allow designers to select the best group of paths to be monitored under varying power, area and monitoring budget constraints. Finally we demonstrate the impact of runtime wearout management in a proactive runtime wearout-aware scheduling approach, WAS. Processor failure can occur due to wearout of a single structure even if vast majority of the chip is still operational. WAS strives for uniform wearout of processor structures thereby preventing a single structure from becoming an early point of failure. The fine-grained microarchitectural level chip wearout control polices use feedback from a network of timing margin monitoring sensors to identify the most degraded structures. Our evaluation shows WAS can result in 15% to 30% improvement in lifespan of a multi-core processor chip with negligible performance and energy consumption impact. 1 Chapter 1 Introduction Our reliance on electronic circuits is rapidly growing as these circuits are used in almost every aspect of our lives. Long and reliable lifespan is becoming an even more stringent requirement of electronic systems as these systems are deployed in more mission-critical areas. International Technology Roadmap for Semiconductors (ITRS) [32] has predicted increasing reliability challenges with future generations of circuits fabricated using deep submicron semiconductor technologies. Specifically the issue of accelerated wearout is starting to affect how we design and implement circuits. The following trends show how wearout problems are becoming increasingly challenging and point to the need for a more robust and efficient framework for dealing with the wearout problem in the future generations of circuits: 1) CMOS Scaling Impact: Wearout is one of the undesirable consequences of CMOS scaling [15, 16, 31] since device scaling leads to 1) Higher current density 2) Higher electric field 3) Higher operation temperature 4) Increased process variations due to atomic-range device dimensions. Negative Bias Temperature Instability (NBTI), Hot Carrier Injection (HCI), Time-Dependent Dielectric Breakdown (TDDB), and Electromigration [4, 14, 15] are some of the results of 2 the above trends which lead to circuit wearout. These phenomena are happening at a higher rate in more scaled CMOS technologies [31]. The net result of these phenomena is gradual timing degradation and eventual breakdown of circuits [52, 57, 64], which is referred to as wearout or aging. The current trend of accelerated wearout makes simple guardbanding a highly inefficient solution due to the need for excessively large guardbands. Furthermore, since the severity of the wearout problem is increasing even circuits which previously did not require expensive resiliency enhancements, such as redundancy, might require these enhancements just to have an acceptable lifespan and sufficient reliability during that lifespan. 2) Increased Variations: Both fabrication process variation and runtime operation variations are increasing which further aggravate the reliability problem. The former is a side effect of smaller devices sizes and pushing integrated circuit fabrication technology close to the physical limits of what is possible. The later is due to use of circuits with large number of components which also have more diverse workloads combined with many architectural and microarchitectural enhancements which dynamically change operation of the circuit at runtime. An example of this type of dynamic runtime variations is implementation of thermal and power management techniques such as power gating and dynamic voltage and frequency scaling (DVFS) which result in deliberate runtime changes in the operation point of the circuit. A direct result of these variations is large difference between the wearout rate of devices on a chip. Hence designing for the worst case 3 variations comes at the cost of large amount of lost opportunities for higher performance or better power efficiency. 3) Increased Demand for Longer Lifetime of Circuits: This trend is due to the emergence of more application domains that are mission critical and demand stringent guarantees regarding circuit reliability. Another major factor contributing to this trend is the reduced rate of planned obsolescence. Planned obsolescence is a manufacturing practice employed by many system vendors in the semiconductor industry. These vendors introduce the next generation of a product, which is superior to its predecessor (e.g. faster, more power efficient, and/or smaller), before the useful life of the older generation of the product has ended. This practice forces the older products to be obsolete before they fail due to wearout. In the early decades of semiconductor industry process technology improved at a rapid pace, and architectural enhancements significantly improved power and performance. Hence planned obsolescence was in fact an economical solution since the advantages of new processors outweighed the costs of early replacement. However, as technology scaling hits new walls (specially since 2004) it is increasingly difficult to convince end-users that planned obsolescence is a cost-effective solution. The end result is a longer expected infield lifespan for most integrated circuits. In the early decades of modern semiconductor technology evolution, wearout induced timing degradation has been relatively small, both in terms of rate and magnitude. Hence, designers overwhelmingly relied on guardbanding to tolerate wearout- 4 induced degradation. But when applications demand higher reliability standards than provided by guardband alone, reliability was achieved through error detection and correction mechanism and component level redundancy [48, 58, 59]. 1.1 An Overview of Wearout and Guardbands We first provide a brief overview of how wearout manifests at various levels of system hierarchy and show how guardbands are used to tackle with wearout. Figure 1.1 shows manifestations of wearout at different levels in the computer systems stack. Root causes of wearout are electro-physical phenomena which happen at the device level and their manifestations are in form of timing degradation of transistors and interconnects and their eventual failure. At the circuit level these translate into gate timing degradation and finally at the microarchitecture level wearout results in signal path timing degradation and eventually path timing violations. Wearout manifests as execution errors in the higher levels in the computer systems stack. In this dissertation we primarily focus on wearout impact at the circuit and microarchitecture layers which causes timing violations. Figure 1.1: Manifestation of wearout a different levels in the computer system stack. 5 Hardware designers estimate the expected wearout during the lifetime of a processor and use guardbands to proactively reduce the clock frequency (and increase supply voltage) to account for worst-case wearout. Figure 1.2 shows the impact of wearout-induced timing degradation on the signal paths within the circuit. Clock frequency of a processor (f) is determined by delay of its slowest critical path (D crit ). Clock period of a processor (T=1/f) is calculated by adding a timing guardband (G) to the D crit , hence T= D crit +G. This timing guardband ensures that variations in the delay of signal paths do not cause timing violations and errors during the lifetime of the processor. Timing margin of each circuit path is the difference between delay of that path and the clock period. In Figure 1.2 timing margin of the critical path (M crit ) and a non-critical path (M non-crit ) are shown at t 0 , when the chip is new, as well as at t 1 , after the chip has been used for some time and suffered from some amount of path timing degradation. This timing degradation results in increase in path delays and reduction in the timing margin of paths. This is shown as M crit (t 0 )>M crit (t 1 ) and M non-crit (t 0 )>M non-crit (t 1 ). M crit (t 0 ) is equal to the guardband, G, set by the designers. Figure 1.2: Wearout-induced timing margin degradation in circuit paths. 6 Design stage wearout prediction is becoming increasingly challenging as process variations lead to random device characteristics both within and across chips. Dynamically changing environmental conditions and workload-dependent circuit path utilization further exacerbate the problem of wearout estimation. While these uncertainties existed in the past, the severity of their impact is increased as devices are scaled [31, 37] due to the trends that were described earlier. An advantage of advances in semiconductor design and fabrication in the past couple of decades has been availability of a large transistor budget and runtime compute power at the discretion of computer system architect and designer. In the dark silicon era where chip power and temperature constraints restrict operation of high performance processors, most systems cannot sustain operation with all hardware resources active and operating a maximum performance. Hence, using a tiny fraction of this chip real estate for reliability management is now becoming an attractive option. This dissertation in fact presents multiple approaches to exploit a tiny fraction of the chip area to improve overall processor reliability. Motivated by these challenges and opportunities this dissertation makes the following contributions: (1) This dissertation first presents WearMon, an adaptive critical path monitoring architecture which provides accurate and real-time measure of a processor's wearout-induced timing margin degradation. WearMon description and evaluations are presented in Chapter 2. (2) The next part of this dissertation develops a tool for cross-layer analysis of wearout, WAT. In particular, WAT enables designers to 7 accurately visualize how software execution can impact device switching activity and runtime path utilization information on a circuit. This tool is described in Chapter 3. (3) We exploit the capabilities of WAT to propose design time enhancements to make a circuit amenable for wearout monitoring, even in the presence of a steep critical path timing wall. We propose four novel algorithms that allow designers to select the best group of paths to be monitored under varying power, area and monitoring budget constraints. Chapter 4 describes the proposed solution. (4) In Chapter 5 we demonstrate the usage of our design tools and wearout monitoring framework through a proactive runtime wearout- aware scheduling approach. This scheduling framework uses real-time feedback from a distributed network of timing margin sensors on the chip to control wearout of microarchitectural blocks of the system. Finally in Chapter 6 we conclude with a summary of the contribution of this dissertation and highlight some open problems, which indicate possible future exploration directions in the area of design-for-reliability. 8 Chapter 2 Reliability Monitoring Using Adaptive Critical Path Testing In this chapter we present WearMon [70, 71], an adaptive critical path monitoring architecture which provides accurate and real-time measure of a processor's wearout- induced timing margin degradation. Special test patterns check a set of critical paths in the circuit-under-test. By activating actual devices and signal paths used in normal operation of the chip, each test will capture up-to-date timing margin of these paths. The monitoring architecture dynamically adapts testing interval and complexity based on analysis of prior test results, which increases efficiency and accuracy of monitoring. Experimental results based on FPGA implementation show that the proposed monitoring unit can be easily integrated into any design. Monitoring overhead can be reduced to zero by scheduling tests only when a unit is idle. 2.1 Introduction Wearout-induced timing degradation is hard to predict or accurately model early in the design stage. This is due process variations and running utilization variations which are not known at design stage. Most commercial products solve the timing degradation 9 problem by inserting a guardband at design and fabrication time. As highlighted earlier, guardbands reduce performance of a chip during its entire lifetime just to ensure correct functionality during a small fraction of time near the end of chip's lifetime. Our inability to precisely and continuously monitor timing degradation is one of the primary reasons for over-provisioning of resources. Without an accurate and real-time measure of timing margin, designers are forced to use conservative guardbands. Processors currently provide performance, power, and thermal monitoring capabilities. In this chapter, we argue that providing reliability monitoring to improve visibility into the timing degradation process is equally important to future processors. Such monitoring capability enables just-in-time activation of error detection and recovery methods, such as those proposed in [2, 25, 39, 41, 62]. 2.1.1 Wearout Monitoring For monitoring to be effective, we believe, it must satisfy the following three criteria: (1) Continuous Monitoring: Unlike performance and power, reliability must be monitored continuously over extended periods of time; possibly many years. (2) Adaptive Monitoring: Monitoring must dynamically adapt to changing operating conditions. Due to differences in device activation factors and device variability, timing degradation rate may differ from one circuit to the other. Even a chip in early stages of its expected lifetime can become vulnerable due to aggressive runtime power and performance optimizations such as operating at near-threshold voltages and higher frequency operation [41, 62]. 10 (3) Low Overhead Monitoring: Since monitoring must be continuous the monitoring architecture should have low performance overhead. Furthermore, monitoring framework should be implementable with minimal modifications to existing processor structures with limited area overhead. 2.1.2 Wearout Monitoring Applications WearMon is designed to satisfy all of the above criteria. With continuous, adaptive and low-overhead monitoring, conservative preset guardbands can be tightened. Processor can deploy preemptive error correction measures during in-field operations only when the measured timing margin of the circuit is small enough to affect its functionality. The unprecedented view of timing degradation provided by WearMon will enable designers to correlate predicted behavior from analytical models with the in-field behavior and use these observations to make appropriate design changes for improving reliability of future processors. Wearout monitoring information can be used for triggering automated activation of cold spares or requests for replacement before any failures occur. If the chip is also enhanced with adaptive reliability, performance, and power management mechanisms then real-time knowledge regarding the wearout state of the circuit can directly be used by the above management techniques to trade off reliability, performance, and power consumption. Furthermore, wearout monitoring allows designers to build circuits which keep the same level of reliability over their lifetime even when the underlying components wearout. By monitoring wearout the designer can compensate for decreasing margins (and resultant reliability reduction) by reducing performance and/or power 11 efficiency. The end result would be circuits which wearout (lose performance, and power efficiency) as they are utilized, but maintain a desired level of reliably. This approach of trading off performance for a preset failure rate is going to be an effective solution in many categories of circuits where reliability gets precedence over performance or power efficiency. This chapter is organized as follows: Section 2.2 explains details of the WearMon framework. In Section 2.3 hierarchical reliability management issues are discussed. Section 2.4 shows the experimental setup used to evaluate the effectiveness of WearMon. Results from our evaluations are discussed in Section 2.5. Section 2.6 compares WearMon to related works. We summarize and conclude Chapter 2 in Section 2.7. 2.2 WearMon Reliability Monitoring Framework WearMon is based on the notion of critical path tests. Specially designed test vectors are stored in an on-chip repository and are selected for injection into a circuit under test (CUT) at specified time intervals. The current timing margin of the CUT is measured using outcomes from these tests. In the following subsections we describe the architecture of WearMon. In particular, we describe one specific implementation of the WearMon framework in the form of a Reliability Monitoring Unit (RMU). 2.2.1 Architecture of the Monitoring Unit Figure 2.1 shows an overview of RMU and how it interfaces with the CUT. In our current design we assume that a CUT may contain any number of data or control signal paths that end in flip-flops, but it should not contain intermediate storage elements. 12 Figure 2.1: Overview of the reliability monitoring unit (RMU) and its interface with the CUT. The four shaded boxes in Figure 2.1 are the key components of RMU. Test Vector Repository (TVR) holds a set of test patterns and the expected correct outputs when these test patterns are injected into the CUT. TVR will be filled once with CUT-specific test vectors at post-fabrication phase. We describe the process for test vector selection in Section 2.2.2. Multiplexer, MUX1, is used to select either the regular operating frequency of the CUT or one test frequency from a small set of testing frequencies. Test frequency selection will be described in Section 2.2.3. Multiplexer, MUX2, on the input path of the CUT allows the CUT to receive inputs either from normal execution trace or from TVR. MUX1 input selection is controlled by the Freq. Select signal and MUX2 input selection is controlled by the Test Enable signal. Both these signals are generated by the Dynamic Test Control (DTC) unit. Section 2.2.4 will describe DTC operation. 13 DTC selects a set of test vectors from TVR to inject into the CUT and the test frequency at which to test the CUT. After each test vector injection the CUT output will be compared with the expected correct output and a test pass/fail signal is generated. For every test vector injection an entry is filled in the Reliability History Table (RHT). Each RHT entry stores a time stamp of when the test is conducted, test vector, testing frequency, pass/fail result, and CUT temperature. RHT is implemented as a two level structure where the first level (RHT-L1) only stores the most recent test injection results on an on-die SRAM structure. The second level RHT (RHT-L2) is implemented on a flash memory that can store test history information over multiple years. While RHT-L1 stores a complete set of prior test injection results within a small time window, RHT-L2 stores only interesting events, such as test failures and excessive thermal gradients over the entire lifetime of the chip. DTC reads RHT-L1 data to determine when to perform the next test as well as how many circuit paths to test in the next test phase. 2.2.2 Test Vector Selection Prior studies using industrial CPU designs, such as Intel Core 2 Duo processor [5], show that microarchitectural circuit blocks often have three groups of circuit paths. First group contains few paths (<1%) with zero timing margin; second group contains several paths (about 10%) with less than 10% timing margin, followed by a vast majority of paths (about 90%) with a larger timing margin. Test and verification engineers spend significant amount of effort to analyze the first two sets of critical paths and generate test vectors to activate these paths for pre and post-fabrication testing purposes. These test vectors are ideal candidates for filling up the TVR. These paths are the ones with the least 14 amount of timing margin and are most susceptible to timing failures. Even in the absence of manual effort to identify these paths, CAD tools such as Primetime from Synopsys can also classify the paths into categories based on their timing margin and then generate test vectors for activating them. We propose to exploit the effort already spent either by test designers or CAD tools to create critical path test vectors. TVR is thus initially filled with test vectors that test paths with less than 10% timing margin. Based on results shown in [5], even for very large and complex CUTs such as cache controllers, TVR may store in the order of 50-100 test vectors. In our current implementation, TVR stores vectors in the sorted order of their timing criticality once during the post-fabrication phase. Compression techniques can be used to further minimize the storage needs of TVR. 2.2.3 Test Frequency Selection In order to accurately monitor the remaining timing margin of paths in the CUT, they are tested at multiple test frequencies above the nominal operation frequency of the CUT. The difference between the highest frequency at which a path passes a test and the nominal operation frequency determines the current remaining timing margin for that path. The test frequency range is selected between nominal operating frequency and frequency without a guardband. This range is then divided into multiple frequency steps. Figure 2.2 shows these testing clock frequencies. Multiple clock frequencies for testing, F test (i), can be selected between F test (Max) and F test (min). Highest clock frequency which can be used for a test is going to be equal to an operation frequency at which no guardband is used (i.e. F test (Max)=1/ ΔD init ). ΔD init is the initial delay of the slowest paths in the CUT. The slowest clock frequency used for testing, F test (min), is equal to the 15 nominal operation frequency of the CUT which is 1/T clk . The nominal clock period of the circuit (T clk ) is conventionally defined at design time as the delay of the slowest path in the CUT, ΔD init , plus a guardband. The number of test frequencies used, i, will depend on the area and power budget allocated for reliability monitoring enhancements. Larger number of test frequency steps would increase the precision of the detected timing margin degradation but would have a higher implementation overhead. Figure 2.2: Testing critical paths with multiple test frequencies at three different wearout levels during the lifetime of circuit. Figure 2.2 shows the pass (P) and fail (F) results for tests conducted on a path at different stages of the CUT lifetime. In the scenarios illustrated on Figure 2.2, path delays ΔD 1 , ΔD 2 , and ΔD 3 represent the increased path delay due to wearout at different stages of circuit’s lifetime. Initially all tests at all test frequencies pass because the delay of the path has not degraded to be larger than ΔD init , this is labeled as Early on Figure 2.2. In this scenario ΔD 1 is smaller or equal to ΔD init . Scenarios labeled Mid and Late illustrate the timing degradation of the path as it gradually suffers from wearout. In the Mid 16 scenario only four tests conducted at the high end of the test frequency range fail and the rest of the tests pass. In the Late scenario only one tests conducted at the low end of the test frequency range passes and the rest of the tests fail. In all of these scenarios the highest test frequency at which a test passes indicates the remaining timing margin of the path tested. Figure 2.3: Reduced guardband. It is worth noting that when using WearMon, it would be possible to reduce the default guardband to a smaller value while still meeting the same reliability goals. This is shown on Figure 2.3, where (T clk1 = 1/f clk1 ) > (T clk2 = 1/f clk2 ) > ( ΔD init ). T clk1 is the original clock period when using the default guardband and T clk2 is the clock period with a reduced guardband (RGB). Note that in either case the initial path delay, ΔD init , of the CUT is the same. The purpose of WearMon is to continuously monitor the CUT and check if CUT timing is encroaching into the reduced guardband. RGB is a performance improvement made possible by RMU. In addition to the performance enhancement, using 17 RGB would also result in a smaller test frequency range and hence can reduce monitoring overhead. 2.2.4 DTC and Opportunistic Tests DTC is the critical decision making unit that determines the interval between tests (called interval throughout the rest of the chapter) and the number of test vectors to inject during each test phase (called complexity). These two choices would exploit tradeoffs between increased accuracy and decreased performance due to testing overhead. DTC reads the most recent RHT entries to decide the interval and complexity of the future testing phases. The most recent RHT entries are analyzed by DTC to see if any tests have failed during the testing phase. Each failed test entry will indicate which test vector did not produce the expected output and at what testing frequency. DTC then selects the union of all the failed test vectors to be included in the next phase of testing. If no prior tests have failed, DTC simply selects a set of test vectors from the top of TVR. We explore two different choices for the number of vectors selected in our results section. Note that the minimum number of test vectors needed for one path to get tested for a rising or a falling transition is two. Instead of using just two input combinations that sensitize only the most critical path in the CUT, multiple test vectors that exercise a group of critical paths in the CUT will be used in each test phase. Thus, test vectors used in each test phase are a small subset of the vectors stored in TVR. This subset is dynamically selected by DTC based on the history of test results and CUT's operating condition. Initially DTC selects test vectors in the order from the most critical (slowest) 18 path at design time to less critical paths. As path criticality changes during the lifetime of the chip, cases might be observed where paths that were initially thought to be faster are failing while the expected slower paths are not. Hence, the order of the critical paths tested can be dynamically updated by the DTC by moving the failing input patterns to the top of the test list. To account for the unpredictability in device variations, DTC also randomly selects additional test vectors from TVR during each test phase making sure that all the paths in the TVR are checked frequently enough. This multi-vector test approach allows more robust testing since critical paths may change over time due to different usage patterns, device variability, and difference in the devices present on each signal path (e.g. different number of PMOS and NMOS transistors on different signal paths result in different susceptibility to NBTI or HCI). Once the test complexity has been determined, then DTC selects the test interval. There are two different approaches for selecting when to initiate the next test phase. In the first approach DTC initially selects a large test interval, say 1 million cycles between two test phases and then DTC dynamically alters the test interval to be inversely proportional to the number of failures seen in the past few test phases. For instance, if two failures were noticed in the last eight test phases then DTC decreases the new test interval to be half of the current test interval. An alternate approach to determine test interval is opportunistic testing. In this approach DTC initiates a test injection only when the CUT is idle thereby resulting in zero performance overhead. Current microprocessors provide multiple such opportunities for testing a CUT. For example, on a branch misprediction the entire pipeline is flushed 19 and instructions from the correct execution path are fetched into the pipeline. Execution, writeback, and retirement stages of the pipeline are idle waiting for new instructions since the newly fetched instructions take multiple cycles to reach the backend. When a long latency operation such as a L2 cache miss is encountered, even aggressive out-of-order processors are unable to hide the entire miss latency thereby stalling the pipeline. Finally, computer system utilization rarely reaches 100% and the idle time between two utilization phases provides an excellent opportunity to test any CUT within the system. We quantify the distance between idle times and their duration for a select set of microarchitectural blocks in Section 2.5.3. DTC can automatically adapt to the reliability needs of the system. For a CUT which is unlikely to have failures during the early stages of its in-field operation, test interval is increased and test complexity is reduced. As the CUT ages or when the CUT is vulnerable due to low power settings, DTC can increase testing. Note that the time scale for test interval is extremely long. For instance, NBTI related timing degradation occurs only after many hours of intense activity in the CUT. Hence testing interval will be in the order of seconds even in the worst case. We will explore performance penalties for different selections of test intervals and test complexities in the experimental results section. We should emphasize that testing the CUT does not in itself lead to noticeable increase in aging of the CUT. The percentage of time a CUT is tested is negligible compared to the normal usage time. We would also like to note that there are design alternatives to several of the RMU components described. For implementing variable test 20 frequencies, there are existing infrastructures for supporting multiple clocks within a chip. For instance, dynamic voltage and frequency scaling (DVFS) is supported on most processors for power and thermal management. While the current granularity of scaling frequency may be too coarse, it is possible to create much finer clock scaling capabilities as described in [67]. Alternatively, an aging resilient delay generator such as the one proposed in [2] can be used with minimal area and power overhead. 2.2.5 Dynamic Path Delay Variations Due to the limited size of TVR it is possible that for some circuits the number of critical paths that need to be tested exceed the size of TVR. Hence, designers may be forced to select only a subset of the critical paths to be tested. Furthermore, finding all the critical paths can be a challenging task. This is due to the following reasons: (1) Critical paths found at design stage using static timing analysis of the circuit might not be the slowest paths in the manufactured circuit due to variations in the fabrication process. (2) The amount of wearout devices in a circuit suffer from varies depending on how they are utilized and their operation condition. Accurate design time predictions regarding in-field runtime utilization of circuit paths and variations in operation environment conditions, such as temperature, can be difficult. These runtime effects result in non-uniform wearout of devices and circuit paths. As a result, the critical paths can change during runtime. To overcome this impediment we propose to enhance the monitoring framework to dynamically update TVR contents during in-field operation of the circuit. An auxiliary 21 TVR with higher capacity than the main TVR can be implemented on off-chip flash memory. The auxiliary TVR stores test vectors for the entire cluster of critical paths. WearMon periodically conducts an extended test phase where it uses the test vectors in the auxiliary TVR to check the circuit with larger coverage. After the extended test phase all the paths are sorted based on the newly observed timing margin. The top N critical paths, where N is the size of the main TVR, are then selected. Test vectors that test these N critical paths are used to update the content of the main TVR. This approach ensures that the main TVR always holds the test vectors required for testing the most critical paths in the circuit. For example, assume we have a CUT which has 10,000 paths of which 4000 paths are marked as critical during design time using static timing analysis. Assume that the main TVR is designed to store test vectors for checking the top 1000 slowest paths. At design time we fill the TVR with test vectors for the top 1000 slowest paths in the CUT. In addition, we also store test vectors to test all 4000 critical paths in the auxiliary TVR. During the normal test phases, a subset of vectors stored in TVR are injected into the CUT and the slow paths in the circuit get checked routinely. However, WearMon occasionally conducts an extended test phase using the larger auxiliary test vector group. At specific time intervals (e.g. every month), which are much larger than the normal test intervals, the test vectors stored in the auxiliary TVR are brought into the chip for conducting tests. Result of this infrequent testing with larger coverage would then be used to refine the selection of the 1000 paths which are going to be monitored during the next month of utilization. In other words, the top 1000 slowest paths in the 4000 paths 22 tested will replace the old 1000 paths in the TVR for the next month of operation. These infrequent but more robust test cycles will provide feedback from the actual runtime timing degradation of extended group of circuit path and will ensure that that the TVR always stores the test vectors for testing the slowest paths in the circuit. While our approach ensures that dynamic variations of critical paths are accounted for, the fundamental assumption is that a non-critical path that is not tested by vectors either in the main TVR or the auxiliary TVR will not fail while all the critical paths that are tested are still operational. It is important to note that due to the physical nature of the phenomena causing aging related timing degradation, the probability of a path which is not in the TVR failing before any of the circuit paths that are being tested is extremely low. Studies on 65nm devices show that variations in threshold voltage due to NBTI are very small and gradual over time. In particular, the probability of a sudden large variation in threshold voltage which results in large path delay changes is nearly zero [27, 49]. Thus the probability of an untested path failing is the same as the probability of a path with timing margin greater than the guardband suddenly deteriorating to the extent that it violates timing. Chip designers routinely make the assumption that such sudden deterioration is highly unlikely while selecting the guardband. Hence our assumption is acceptable by industrial standards. 2.3 Hierarchical Reliability Management We expect that WearMon will be coupled with preemptive error avoidance mechanisms to improve circuit reliability. Error avoidance mechanisms operate in two ways: 23 (1) Circuit operation point adjustment: In this approach errors are avoided by changing the circuit’s operating parameters, such as reducing frequency or increasing voltage. This approach trades performance and/or power efficiency for reliability. (2) Using planned redundancy: Cold or hot spares can be used to replace or temporarily disable unreliable CUTs. Redundancy in the form of checker units [9] can also be engaged or disengaged based on the reliability state of the circuit. Error avoidance mechanisms can be implemented at different granularities. For example, operation point adjustment or placement of cold spares can be at the pipeline stage level or at the core level in the other extreme. Error avoidance does not necessarily need to be at the same granularity as monitoring. For example, if the CUT being monitored is an ALU then error avoidance can be done at the ALU level or can be at the execution stage of a processor pipeline which encompasses the ALU that is being monitored. For reasons which will be explained in Section 2.3.2, the latter option is going to be more feasible in most large circuits. 2.3.1 Distributed Vs. Central RMU Control WearMon monitoring infrastructure presented in Section 2.2 uses distributed autonomous RMUs where one RMU is attached to each circuit block tested. Each RMU is capable of monitoring timing margin of the CUT it is attached to and hence its wearout state. These distributed RMUs have localized control (implemented within DTC) to adjust the test complexity and test interval. 24 RMU efficiency is reduced when the CUT size is large. This inefficiency is a result of increased number critical paths which need to be monitored which not only requires more TVR capacity but also would require more testing time to cover the large set of critical paths which need to be tested. Monitor large CUTs is better implemented using multiple RMUs, each monitoring a smaller sub-circuit of the large CUT. However, multiple RMUs increase the area overhead. To reduce this overhead multiple RMUs share a single centralized DTC unit. This organization is shown in Figure 2.4. Each of the DTC i units shown in this figure control a group of CUTs. Centralized DTC has the following benefits: (1) Centralized DTCs will have information about a larger section of the circuit and hence wearout control policies can be enhanced to take advantage of this additional information. (2) The logic required for implementation of the DTC policies can be time shared between multiple CUTs. Gradual nature of wearout allows for such sharing of resources without any loss of monitoring effectiveness. (3) As will be explained in the next section, since error avoidance would work more effectively with larger CUTs, the size of the circuit block which has the central DTC can even be increased enough to match the error avoidance implementation granularity. 25 Figure 2.4: Central DTC for multiple CUTs. The centralized DTC unit receives timing margin information from each of the smaller CUTs that it is monitoring within the large CUT. The worst timing margin from all the smaller CUTs is treated as the overall timing margin of the larger CUT. We illustrate the benefits of a centralized DTC with an example. Assume that a centralized DTC 2 , shown on Figure 2.4, monitors CUT 2-1 to CUT 2-u . At a given instance, DTC 2 identifies that CUT 2-1 has not suffered from timing degradation and the slowest path in this CUT has a delay of D 1 . On the other hand, CUT 2-2 has suffered from some wearout induced timing degradation and hence the slowest path in this CUT has a delay of D 2 > D 1 . CUT 2-3 has degraded even more and its slowest path delay is D 3 > D 2 > D 1 . Assume the rest of the CUTs attached to DTC 2 (i.e. CUT 2-4 to CUT 2-u ) all have a critical path with a delay smaller than D 1 . In the scenario described above CUT 2-3 has the least amount of timing margin left. The uneven wearout of CUTs could be due to many reasons, such as higher utilization of a CUT, presence of hot spots near the CUT. CUT 2-3 dictates the overall timing margin of the larger CUT that is monitored by DTC 2. DTC 2 exploits this global knowledge to 26 reduce the test frequency range for checking all the CUTs it is controlling by reducing maximum test clock frequency for testing to 1/D 3 . Thus, the delay of the paths in none of the CUTs controlled by DTC 2 is going to be checked below D 3 . Even if some CUTs in this group have a delay less than D 3 , this lower delay would not change the output of DTC 2 which is reporting the largest path delay in the group. By reducing the test frequency range DTC 2 prevents redundant testing. This results in reduced testing overhead. Algorithm 2.1: Test frequency range adjustment for multiple CUTs sharing one DTC unit. Procedure AdjustTestFreqRange(w, D_Max_init i-j ) Inputs: Number of CUTs (w) attached to the DTC i and initial delay of the slowest path in each of these CUTs (D_Max_init i-j ) Output: Maximum test frequency to be used for all CUTs controlled by DTC i (F_test_Max i ) D_Max i = D_Max_init i-1 ; for each CUT i_j with 1<j ≤w if D_Max_init i-j > D_Max i then D_Max i = D_Max i-j ; end for each CUT i_j with 1 ≤j ≤m CritPathTest(j, 1/D_Max_init i_j , D_Max i_j ); if D_Max i_j > D_Max i then D_Max i = D_Max i-j ; end F_test_Max i =1/D_Max i ; Procedure CritPathTest(j, F_test_Max i , D_Max i-j ) Inputs: Index of the CUT to get checked (j) and the maximum clock frequency to be used for testing (F_test_Max i ) Output: Delay of the slowest path in CUT i-j (D_Max i-j ) Conduct critical path testing for CUT i-j using the methodology described in Section 2.2 with test frequency range between F_test_Max i and 1/T clk and report D_Max i-j; Algorithm 2.1 shows the general steps on how DTCs with central knowledge from multiple CUTs can use monitoring data to update the test clock frequency range used for monitoring. The algorithmic steps for test frequency range adjustment are executed for each DTC i independently. When test frequency adjustment procedure is 27 done WearMon critical path testing procedure uses the initial values for the slowest path delay in each CUT to define the maximum test frequency used. This result in an increase of the test frequency range during tests conducted for this adjustment but this is not done all the time and normal tests use the adjusted test frequency range. The benefit of this increased range during the adjustment procedure is capability of detecting recovery from wearout and appropriate adjustment of the test frequency range. 2.3.2 Error Avoidance Implementation Issues CUT size for monitoring should be small enough to ensure efficiency of RMU’s operation. Error avoidance should be provided at a larger granularity in order to reduce implementation cost and increase its operation efficiency. Error avoidance mechanisms, such as changing voltage/frequency, are more efficient when applied at coarse granularity. For example, implementation cost of multiple voltage or frequency domains to change CUT operational point would significantly increase if CUT size is small. Table 2.1: Effects of CUT size on RMU and EAU implementation. Implementation Overhead Small CUT (e.g. FUB) Large CUT (e.g. Cores) RMU Low High EAU High Low To address these conflicting goals, a hierarchical design is used. Circuit would be divided into multiple error avoidance domains. Each domain’s wearout is monitored by multiple distributed autonomous RMUs. Each domain has an error avoidance unit (EAU) which is responsible for taking action to prevent errors in that domain. Each EAU can change the circuit operational point of its designated error avoidance domain which can 28 comprise of multiple CUTs. Implementation overhead difference between RMU and EAU is summarized in Table 2.1. Figure 2.5 shows a circuit which is divided into i error avoidance domains and each domain has one EAU, hence we have i error avoidance units indexed EAU 1 to EAU i . There are multiple circuit blocks in each error avoidance domain and each of these blocks are monitored using an independent RMU. The index used for each RMU is in the form of i-j where i represent the EAU number (which is also the index of the error avoidance domain) and j is the RMU number within that domain. In Figure 2.5 illustration we assume that that the number of RMUs in each error avoidance domain need not be the same for different domains. For example in EAU 1 has p RMUs index as RMU 1-1 to RMU 1-p while EAU 2 has q RMUs. Each RMU shown in Figure 2.5 can be attached to multiple CUTs with a central DTC unit (as described in Section 2.3.1). This is shown for RMU i-2 as a central DTC i-2 attached to u CUTs indexed CUT i-2-1 to CUT i-2-u . Figure 2.5: Autonomous distributed RMUs and their interaction with PEAs. Circuits produced from unreliable components are used in a wide range of applications. There are circuits that can tolerate a certain amount of unreliability while 29 there are other circuits which are required to operate reliably during a long lifetime. In systems with high priority of lifetime reliability, such as in circuits used in medical devices or server class processors dealing with financial transactions, the cost of unreliable operation or unpredictable failure is extremely high. In such systems the budget allocated for reliability monitoring is higher and hence WearMon can be implemented differently compared to other systems where reliable operation does not get such high priority. The proposed hierarchical design provides a highly configurable and customizable methodology for implementation of the WearMon framework which can fit reliability monitoring needs of different systems. 2.4 Experimental Methodology WearMon is not an error correcting mechanism. It provides monitoring capabilities which may in-turn be used to apply error detection or correction mechanisms more effectively. As such, monitoring by itself does not improve reliability. Hence, the aims of our experimental design in this chapter are as follows: First, area overhead and design complexity of RMU are measured using close-to-industrial design implementation of the RMU and all the associated components. Second, we explored test intervals and test complexity state space. Third, performance overhead of monitoring is measured. As described earlier, the overhead of testing can be reduced to zero if testing is done opportunistically. We use software simulation to measure the duration of each opportunity and the distance between two opportunities for zero-overhead testing. In all our experiments effects of timing degradation are deliberately accelerated to measure the worst case overhead of monitoring. 30 2.4.1 FPGA Emulation Setup We implemented the RMU design, including TVR, DTC and RHT using Verilog. We have selected a double-precision Floating Point Multiplier Unit (FPU) as the CUT to be monitored. This FPU implements the 64-bit IEEE 754 floating point standard. The RMU and the FPU CUT are mapped onto a Virtex 5 XC5VLX100T FPGA chip using Xilinx ISE Design Suite 10.1. RHT-L1 is implemented as a 256 entry table using FPGA SRAMs. During each test phase the test vector being injected, testing frequency, test result and FPU temperature are stored as a single entry in RHT-L1. When a test fails the test result is also sent to the RHT-L2, which is implemented on CompactFlash memory, and its size is limited only by size of the available CompactFlash. When RHT-L1 is full the oldest entry in the RHT-L1 will be overwritten, hence, DTC can only observe at most the past 256 tests for making decisions on selecting the test interval. In our implementation DTC reading results from RHT does not impact the clock period of the CUT and it does not interfere with the normal operation of the CUT. Furthermore, reading RHT can be done simultaneously while test injection results are being written to RHT. One important issue that needs to be addressed is the difference that exists between FPU's implementation on FPGA as compared to ASIC (Application Specific Integrated Circuit) synthesis. The critical paths in the FPU would be different between FPGA and ASIC implementations. These differences would also be observed between instances of ASIC implementations of the same unit when different CAD tools are used for synthesis and place and routing, or between different fabrication technology nodes. 31 However, to maximize implementation similarity to ASIC designs, the FPU is implemented using fifteen DSP48E Slices which exist on the Virtex 5 XC5VLX100T FPGA chip. These DSP slices use dedicated combinational logic on the FPGA chip rather than the traditional SRAM LUTs used in conventional FPGA implementations. Using these DSP slices increases the efficiency of the FP multiplier and would make its layout and timing characteristics more similar to industrial ASIC implementation. Section 2.2.2 described an approach for selecting test vectors to fill TVR by exploiting designer's efforts to characterize and test critical paths. For the FPU selected in our experiments we did not have access to any existing test vector data. Hence, we devised an approach to generate test vectors using benchmark driven input vector generation. We selected a set of five SPEC CPU2000 floating point benchmarks, Applu, Apsi, Mgrid, Swim, Wupwise, and generated a trace of floating point multiplies from these benchmarks by running them on Simplescalar [7] integrated with Wattch [18] and Hotspot [55]. The simulator is configured to run as a 4-way issue out-of-order processor with Pentium-4 processor layout. The first 300 million instructions in each benchmark were skipped and then the floating point multiplication operations in the next 100 million instructions were recorded. The number of FP multiplies recorded for each trace ranges from 4.08 million (Wupwise) to 24.39 million (Applu). Each trace record contains the input operand pair, the expected correct output value, and the FPU temperature. These traces are stored on a CompactFlash memory accessible to the FPU. Initially the FPU is clocked at its nominal operating frequency reported by the synthesis tool and all input traces produce correct outputs when the FPU runs at this 32 frequency. The FPU is then progressively over-clocked in incremental steps to emulate the gradual timing degradation that the FPU experiences during its lifetime. We then feed all input traces to the FPU at each of the incremental overclocked step. At the first overclocking step, above the nominal operating frequency, input operand pairs that exercise the longest paths in the FPU will fail to meet timing. At the next overclocking step more input operand pairs that exercise the next set of critical paths will fail, in addition to the first set of failed input operand pairs. In our design, FPU's nominal clock period is 60.24 nanoseconds (16.6 MHz) where all the input operand pairs generate the correct output. We then overclocked the FPU from 60.24 nanosecond clock period to 40 nanosecond clock period in equal decrements of 6.66 nanoseconds. These correspond to 4 frequencies: 16.6 MHz, 18.75 MHz, 21.42 MHz, and 25 MHz. The clock period reduction in each step was the smallest decrement we could achieve on the FPGA board. Percentage of the FP multiply instructions that failed as the test clock frequency was increased are shown for multiple benchmarks in Figure 2.6. When the CUT is initially overclocked by a small amount the number of input vectors that fail is relatively small. These are the vectors that activate the critical paths with the smallest timing margin. We selected the top 1000 failing input operand pairs as the test vectors to fill the TVR. As CUT is overclocked beyond 25 MHz (simulating a large timing degradation), almost all input vectors fail across all benchmarks. 33 Figure 2.6: Timing margin degradation. 2.4.2 Three Scenarios for Monitoring We emulated three wearout scenarios for measuring the monitoring overhead. The three scenarios are: Early-stage, Mid-stage, and Late-stage monitoring. Early-stage monitoring: In early-stage monitoring we emulated the condition of a chip which has just started its in-field operations, for instance, the first year of chip's operation. Since the amount of timing degradation is relatively small, it is expected that the CUT will pass all the tests conducted by RMU. To emulate this condition, RMU tests the CUT at only the nominal frequency, which is 16.6 MHz in our experimental setup. The test vectors injected into the CUT do not produce any errors and hence DTC does not change either test interval or test complexity. We explored two different test intervals, where tests are conducted at intervals of 100,000 cycles and 1,000,000 cycles. At each test injection phase we also explored two different test complexity settings. We used either a test complexity of 5 test vectors or 20 test vectors for each test phase. The early- stage monitoring allows us to measure the overhead of WearMon in the common case when no errors are encountered. 34 Mid-stage monitoring: In mid-stage monitoring we emulated the conditions when the chip's timing margin has degraded just enough so that some of the circuit paths that encroach into the guardband of the CUT start to fail. These failures are detected by RMU when it tests the CUT with a test frequency near the high end of the test frequency range (closer to frequency without any guardband). We emulated this failure behavior by selecting the frequency for testing the CUT to be higher than the nominal operation frequency of the CUT. In our experiments we used 18.75 MHz for testing, which is 12.5% higher than the nominal frequency. Since this testing frequency mimics a timing degradation of 12.5% some test vectors are bound to fail during CUT testing. In mid-life monitoring DTC's adaptive testing is activated where it dynamically changes the test interval. In our current implementation DTC uses the number of failed tests seen in the last 8 test injections to determine when to activate the next test phase. We explored two different test interval selection schemes. The first scheme uses linear decrease in the test interval as the fail rate of the tests increase. The maximum (initial) test interval selected for the emulation purpose is 100,000 cycles and this will be reduced in equal steps down to 10,000 cycles when the fail rate is detected as being 100%. For instance, when the number of test failures in the last eight tests is zero then DTC makes a decision to initiate the next test after 100,000 cycles. If one out of the eight previous tests have failed then DTC selects 88,750 cycles as the test interval (100,000-(90,000/8)). An alternative scheme uses initial test interval of 1 million cycles and then as the error rate increases the test interval is reduced in eight steps by dividing the interval to half for each step. For instance, when the number of test failures in the last eight tests is zero then DTC makes a 35 decision to initiate the next test after 1,000,000 cycles. If two out of the eight previous tests have failed then DTC selects 250,000 cycles as the test interval (1,000,000/2 2 ). The exponential scheme uses more aggressive testing when the test fail rate is above 50% but it would do significantly more relaxed testing when the fail rate is below 50%. These two schemes provide us an opportunity to measure how different DTC adaptation policies will affect testing overheads. Late-stage monitoring: In the late-stage monitoring scenario the chip has aged considerably. Timing margin has been reduced significantly with many paths in the CUT operating with limited timing margin. In this scenario, path failures detected by RMU are not only dependent on the test frequency but are also dependent on the surrounding operating conditions. For instance, a path that has tested successfully across the range of test frequencies during a specific operating condition (e.g. supply voltage and temperature) may fail when the test is conducted under a different operating condition. The reason for the prevalence of such a behavior is due to non-uniform sensitivity of the paths to effects such as Electromigration and NBTI that occur over long time scales. The goal of our late-stage monitoring emulation is to mimic the pattern of test failures as discussed above. One reasonable approximation to emulate late-stage test failure behavior is to use temperature of the CUT at the time of testing as a proxy for the operating condition. Hence, at every test phase we read the CUT temperature which has been collected as part of our trace to emulate delay changes. For every 1.25 C raise in temperature the baseline test clock period is reduced by 6.66 nanoseconds. We recognize that this is an unrealistic increase in test clock frequency with a small temperature 36 increase. However, as mentioned earlier our FPGA emulation setup restricts each clock period change to a minimum granularity of 6.66 nanoseconds. This assumption forces the RMU to test the CUT much more frequently than would be in a real world scenario and hence the adaptive testing mechanism of DTC is much more actively invoked in the late- stage monitoring scenario. 2.4.3 Dynamic Adaptation of RMU Figure 2.7 shows how DTC dynamically adapts the test interval to varying wearout conditions in Late-stage monitoring scenario with test complexity of 20 and using the linear test interval adaption scheme. The data in this figure represents a narrow execution slice from Apsi running on FPGA implementation of FPU. The horizontal axis shows the trace record number, which represents the progression of execution time. The first plot from the bottom shows the test fail rate seen by DTC. The highlighted oval area shows a dramatic increase the test fail rate and the middle plot shows how DTC reacts by dynamically changing the interval between Test Enable signals. Focusing again on the highlighted oval area in the middle plot, it is clear that the test interval has been dramatically reduced. The top plot zooms in on one test phase to show 20 back-to-back tests corresponding to the testing complexity 20. 37 Figure 2.7: Dynamic adaptation of RMU. 2.5 Evaluation Results In this section results related to area and performance overhead of the WearMon framework are presented. Opportunities to perform tests without interrupting the normal operation of the processor are studied in Section 2.5.3. 2.5.1 Area Overhead RMU and FPU implemented on FPGA utilize 4994 FPGA slices out of which 4267 are used by the FPU and only 727 slices are used for the RMU implementation. Out of the 8818 SRAM LUTs used in our design the RMU consumes only 953 LUTs while the remaining 7865 LUTs are used by the FPU. The FPU also uses fifteen dedicated DSP48E slices for building the double precision FP multiplier, while only one DSP48E slice is used by RMU logic. This shows the very low area overhead of the RMU compared to the area of the CUT it is monitoring. Majority of the FPU is implemented in the large dedicated DSP48E blocks and we estimate the area overhead of the RMU in an efficient ASIC implementation to be far below 3% of the block it is monitoring. 38 2.5.2 Monitoring Overhead Figure 2.8(a) shows the execution time overhead of testing compared to the total execution time of the benchmark traces. The horizontal axis shows the three monitoring scenarios, Early-stage (labeled Early), Mid-stage (Mid) and Late-stage (Late) monitoring. Vertical axis shows the number of test injections as a percentage of the total trace length collected for each benchmark. DTC uses linear decrease in test interval from 100,000 cycles to 10,000 cycles depending on the test fail rates. Results with test complexity (TC) of 5 and 20 test vectors per test phase have been shown. The Early-stage monitoring overhead is fixed for all the benchmarks and depends only on the test interval and complexity. Testing overhead varies per benchmark during Mid-stage and Late-stage testing. The reason for this behavior is that benchmarks which utilize the CUT more frequently, i.e. benchmarks that have more FP multiplications, will increase CUT activity factor which in turn would accelerate CUT degradation. Degradation would be detected by DTC and it will increase the monitoring overhead to test the CUT more frequently. The worst case overhead using Late-stage monitoring scenario is only 0.07%. (a) (b) Figure 2.8: Overhead of (a) linear (b) exponential schemes. 39 (a) (b) Figure 2.9: Test fail rates for (a) linear (b) exponential schemes. Figure 2.8(b) shows the same results when DTC uses exponential test intervals. Note that the vertical axis scale range is different between Figure 2.8 parts (a) and (b). Comparing the percentage of time spent in testing between the linear and exponential interval adjustment schemes, it is clear that exponential method results in less testing overhead in almost every one of the emulated scenarios. This observation is a direct result of the fact that exponential test intervals start with a higher initial test interval setting and DTC only decreases the testing interval (to conduct tests more frequently) when test failures are detected. Figure 2.9 shows the percentage of the tests that have failed in each of the emulation schemes described earlier for Figure 2.8. The vertical axis shows only Mid-stage and Late-stage monitoring schemes since Early-stage does not generate any test failures. On Figure 2.9(a) linear test interval scheme results are shown for both the test complexities emulated, TC=5 and TC=20. Figure 2.9(b) shows similar results for the exponential test interval scheme. Test fail rates increase dramatically from Mid-stage to Late-stage scenario, as expected during in-field operation. Benchmarks such as Apsi and Wupwise have zero test failures in the mid stage because they do not stress 40 the FPU as much as the other benchmarks and hence the emulated timing degradation of FPU is small. However, in the Late-stage emulations FPU's timing degrades rapidly and hence the test fail rates dramatically increase. There is direct correlation between the fail rate observed in Figure 2.9 and the test overhead reported in Figure 2.8 which is a result of DTC's adaptive design. 2.5.3 Opportunistic Testing Testing overhead can be reduced to zero if tests are performed opportunistically when a CUT is idle. Two important characteristics of testing opportunities have been studied in this section: 1) The duration of an opportunity 2) The distance between opportunities. For generating these results we used Simplescalar [7] simulator configured as a 4-wide issue out-of-order processor with 128 in-flight instructions and 16KB L1 I/D cache. Since these opportunities depend on the application running on the processor, we run 10 benchmarks from SPEC CPU2000 benchmark suite, each for one billion cycles. The benchmarks used are Gzip, Crafty, Bzip, Mcf, and Parser in addition to the five used in our FPGA emulations. There are two types of opportunities studied: local and global. Local opportunities: These exist when only a particular CUT in the processor is idle. Difference in utilization of functional units caused by unequal distribution of different instruction types in the execution trace and variable execution latencies for different units in the processor will result in local opportunities. Our experiments are carried out using execution-driven simulations with a detailed processor model which has one floating point multiplier/divider (FPMD), four integer ALU's, one floating point ALU (FPU), and one integer multiplier/divider unit (IMD). We measure the duration of 41 idle periods and distance between successive idle periods for the above units to quantify local opportunities which exist for testing them. Global opportunities: These opportunities exist when multiple CUTs in the processor are all idle at the same time. For instance, after a branch misprediction the entire pipeline is flushed leaving most of the CUTs in the processor idle. Cache misses, branch mispredictions, and exceptions will result in global opportunities. To quantify the global opportunities we measure duration and distance between idle periods at the pipeline stage level for instruction fetch (IF), integer execution (EINT), and the retirement stage (RE). Furthermore, distance between L1 data cache misses (L1DCM), branch mispredictions (BMP), L2 cache misses (L2CM), and L1 instruction cache misses (L1ICM) have also been measured. Figure 2.10(a) shows the distribution of opportunity durations in terms of percentage of total execution cycles for the above mentioned local and global opportunities. Data collected for the four ALUs has been averaged and shown as one data set. Distribution of distance between consecutive opportunities has been shown in Figure 2.10(b). Please note that the vertical axis for both Figure 2.10(a) and (b) are in logarithmic scale. Rightmost cluster of columns in these figures shows the cumulative value of all the opportunities with duration/distance above 1000 cycles. Data on integer multiplier/divider unit has not been shown on Figure 2.10(b) because all opportunities for this unit (0.2% of execution cycles) where within 100 cycles of each other. 42 (a) (b) Figure 2.10: Distribution of (a) opportunity duration (b) distance between opportunities. Most local and global opportunities studied have duration of 10-100 cycles, which is sufficient time for testing with multiple test vectors. More detailed breakdown of the low duration opportunities, below 100 cycles, indicates that for the eight units studied on average, cases with duration lower than 10 clock cycles account for 0.8% of the total simulated execution cycles. Opportunities of duration 10-50 cycles account for 0.4% of the execution cycles and then opportunities of 50-100 cycles duration account for 0.01%. 43 Since these opportunities occur very close to each other it is even possible to spread a single test phase across two consecutive opportunity windows. The average percentage of local and global opportunities of each duration group is shown on Figure 2.10(a) as trend curves. Percentage of global opportunities with high duration rapidly drops for opportunity durations above 400 cycles. This behavior indicates that global opportunities must be used for shorter test phases. Local opportunities, on the other hand, have long durations making them ideal for longer test phases. Based on average trends shown in Figure 2.10(b), local opportunities are prevalent when the distance between opportunities is small. However, global opportunities are prevalent both at short distances as well as long distances. While Figure 2.10(a) provides us with valuable information on the duration of opportunities available for testing, Figure 2.10(b) shows if these opportunities happen often enough to be useful for our monitoring goals. These results show that there are more test opportunities than the number of tests needed in our monitoring methodology and hence only some of the opportunities are going to be taken advantage of when the monitoring unit schedules a test. To better utilize the idle cycles, DTC would specify a future time window in which a test must be conducted, rather than an exact time to conduct a test. When a CUT is idle within the target time window the test is conducted. In rare occasions when there are insufficient opportunities for testing, DTC can still force a test by interrupting the normal CUT operation. As shown in Section 2.5.2 even in such a scenario performance overhead of WearMon is low. 44 Monitoring multiple small CUTs instead of one large CUT would provide the possibility to take advantage of more fine grain local opportunities while other parts of the processor continue normal operation. Monitoring small CUTs is beneficial even if testing is not performed opportunistically because even if a test is forced and the normal operation of a CUT is interrupted, other parts of the processor can still continue functioning. 2.6 Related Work There have been prior works on modeling, detection, correction, and prediction of wearout faults. Modeling chip lifetime reliability using device failure models has been studied in [52, 57]. These models provide an architecture-level reliability analysis framework which is beneficial for prediction of circuit wearout at design time but they do not address actual in-field wearout of the circuit which is highly dependent on dynamically changing operation conditions. Others have studied methods for error detection and correction. The mechanism proposed by Bower et al. [17] uses a DIVA [9] checker to detect hard errors and then the units causing the fault are deconfigured to prevent the fault from happening again. Shyam et al. [54] explored a defect protection method to detect permanent faults using Built-in Self Test (BIST) for extensively checking the circuit nodes of a VLIW processor. BulletProof [21] focuses on comparison of the different defect-tolerant CMPs. Double sampling latches which are used in Razor [25] for voltage scaling can also be used for detecting timing degradation due to aging. A fault prediction mechanism for detecting NBTI-induced PMOS transistor timing violations have been studied in [2]. This failure prediction is done at runtime by 45 analyzing data collected from sensors that are inserted at various locations in the circuit. It has also been shown that some of device timing degradation, such as those caused by NBTI, can be recovered from by reducing the device activity for sufficient amount of time [4]. Blome et al. [13] use a wearout detection unit that performs online timing analysis to predict imminent timing failures. FIRST (Fingerprints In Reliability and Self Test) [56], proposes using the existing scan chains on the chip and performing periodic tests under reduced guardbands to detect wearout. With WearMon, the optimal time for using many of the error detection and correction methods studied in the works mentioned above can be selected based on the information provided by the monitoring unit. [2, 13] have suggested methods which use separate test structures that model the CUT such as buffer chains or sensing flip flops and main advantage of our method compared to these previous approaches is that our test vectors activate the actual devices and paths that are used at runtime, hence each test will capture the most up-to-date condition of the devices taking into account overall lifetime wearout of each device. Monitoring sufficient number of circuit paths using many of mentioned approaches [17, 25, 54, 56] requires significant extra hardware and also lacks the flexibility and adaptability of the mechanism proposed in this work. WearMon not only has the adaptability advantage to reduce performance overheads due to testing (specially during the early stages of the processors lifetime) but also is more scalable and customizable for increased coverage without incurring significant modification to the circuit; monitoring of additional paths can simply be done by increasing the size of the 46 TVR. Furthermore, the capability to dynamically select the circuit paths to test, results in a more targeted testing which reduces the number of performed tests. 2.7 Summary and Conclusions As processor reliability becomes a first order design constraint for low-end to high-end computing platforms there is a need to provide continuous monitoring of the circuit reliability. Reliability monitoring will enable more efficient and just-in-time activation of error detection and correction mechanisms. In this chapter, we presented WearMon, a low cost architecture for monitoring a circuit using adaptive critical path tests. WearMon dynamically adjusts the monitoring overhead based on current operating conditions of the circuit. We showed that runtime adaptability is essential for robust monitoring in the presence of variations in operating conditions. Furthermore, WearMon framework can be configured to work with preemptive error avoidance mechanisms to increase circuit reliability and prolong the lifetime of the circuits. We showed that the proposed design is feasible with minimal area overhead and design complexity. FPGA emulation results show that even in the worst case wearout scenarios the adaptive methodology incurs negligible performance penalty. We also showed that numerous opportunities which are suitable for multi-path testing exist when different parts of the processor are not being fully utilized. 47 Chapter 3 WAT: A Cross-layer Wearout Analysis Tool The first manifestations of wearout occur at the device level in the form of transistor timing degradation. Slower transistor switching time then manifests at the circuit and microarchitecture layers in the computer system stack in the form of logic gate timing degradation and signal path timing degradation. These path timing degradations eventually results in timing violations and execution failures at the software level. Device usage conditions such as device switching activity rate, state of devices (i.e. probability of a transistor being in “on” or “off” state), as well as the operation voltage and device temperature affect which transistors in the circuit wearout and how much they degrade. Device usage characteristics are impacted primarily in response to workload execution demands. Understanding how workload execution impacts device utilization is critical in developing reliability solutions that span multiple layers of a computer system stack. Cross-layer solutions must analyze the impact of software execution on the usage of 48 transistors and interconnects within the circuit. But there has been a dearth of tools that enable designers to understand how software impacts the device utilization. The next part of this dissertation tackles this important problem. This chapter describes our novel cross-layer wearout analysis tool (WAT) that combines FPGA-based emulations with software simulation. FPGA emulation coupled with gate-level simulations provides fast and accurate characterizations of the runtime behavior of devices in the circuit while real workloads are running on the system. 3.1 Introduction We will start this section by presenting an overview of our cross-layer analysis tool. Figure 3.1 shows different stages of this tool and how they interact. WAT is comprised of four stages. In the first stage the RTL code of a processor as well as a group of benchmarks that can run on the processor are used as inputs to WAT. The output of this stage is a trace of the signal activity on the inputs of a function unit block (FUB) within the processor. This FUB is the primary target of WAT’s wearout analysis. The second stage synthesizes the target FUB (i.e. a sub block of the circuit emulated on FPGA) whose input signal activity was captured in the first stage. This synthesis produces an ASIC implementation of the FUB using logic gates from a standard cell library. Circuit structure of this ASIC implementation (i.e. logic gates used, their sizes, and how they are connected to form signal paths) closely match an industrial strength implementation of the FUB. Static timing analysis is also performed during stage two which generates the timing profile of the synthesized FUB. The third stage conducts gate- level simulation of the target FUB. Gate-level simulation uses two of the outputs that 49 were generated in the previous two stages. First input to stage three is the gate-level netlist of the FUB which was generated in stage two. The second input to this stage is the input activity trace collected during the FPGA emulation. The output of the third stage is switching activity statistics (e.g. number toggle, number of cycles at logic value 0 or 1, etc.) for all the internal node of the FUB’s gate-level implementation (i.e. inputs and outputs pins for the logic gates in the FUB). Then in stage four the standard cell library used for synthesis is used translate activity at the inputs and outputs of the logic gates in the design to activity and state of the transistors within each gate. Figure 3.1: An overview of the cross-layer wearout analysis tool (WAT). WAT uses FPGA emulations coupled with gate-level simulation for understanding how software impacts device wearout. 50 Section 3.2 presents details of the above four stages of the framework. Section 3.3 presents an application of WAT in pre-fabrication wearout simulation and we will present result obtained by using WAT. Section 3.4 compares WAT with alternative tools for cross-layer wearout analysis and highlights its advantages. Section 3.5 describes a few other applications of WAT. Section 3.6 has the summary and conclusions of this chapter. 3.2 Implementation Details of WAT 3.2.1 Stage 1: FPGA-based Emulation and FUB Input Trace Collection In this stage, the RTL level design of processor whose reliability behavior is being analyzed is mapped onto an FPGA. The current implementation of WAT uses OpenSPARC T1 [45] processor for this purpose. RTL design of OpenSPARC T1 will be synthesized and mapped a Xilinx ML509 FPGA development board [68]. WAT assumes that the processor architect can approximately prioritize critical FUBs within the processor design that can significantly impact the processor’s overall wearout. These FUBs are then selected for cross-layer analysis. In our current study we selected a wide range of FUBs to study their wearout behavior. The inputs ports of a selected FUBs are probed for signal trace collection during FPGA emulation. The trace consists of a collection of input vectors that were sent into the FUB while running an application on the emulated processor on the FPGA. WAT currently can boot Solaris operating system and can run a SPARC ISA binary. We have used Xilinx ChipScope Pro Integrated Logic Analyzer IP [68] which allows probing and monitoring of the input ports of the target FUB with minimal intrusion in the design. Trace of inputs to the target FUBs can be collected in the form of a complete time trace 51 or in form of a value change dump (VCD) files which only store time stamped transitions of input signals. Note that the amount of information collected using either of the mentioned trace formats would be the same but VCDs are more compact. Figure 3.2 shows the six ML509 evaluation boards that we used in current WAT setup. Use of multiple FPGA boards is not a requirement but having them allows concurrent execution of multiple workloads and concurrent collection of multiple input traces for multiple FUBs. Figure 3.2: Six FPGAs mounted on evaluation boards used in the first stage of cross-layer analysis setup. 3.2.2 Stage 2: ASIC Synthesis and Static Timing Analysis The FPGA emulations in the previous stage enable WAT to emulate the entire processor RTL and run a complete system software stack. The goal of stage two in WAT is to create an ASIC implementation of the FUBs selected in stage one. This implementation would have logic gates, signal paths, and an ASIC circuit structure 52 similar to what would eventually be fabricated on silicon. FPGA mapping of a processor provides only a functionally equivalent view of the ASIC implementation of the processor on silicon. But the underlying circuit structure (e.g. gates, paths, etc.) differ significantly between FPGA and ASIC implementation. Hence, in order to perform gate- level utilization analysis, WAT requires an ASIC implementation of the selected FUBs. The outputs of such implementation would be a gate-level netlist of each selected FUB that closely follows the actual implementation on silicon and hence gate and path-level analyses can be accurately done. Any design synthesis and timing analysis tool such as those provide by commercial vendors like Synopsys and Cadence can be used in stage two. Any design cell library can also be used for the synthesis. WAT currently uses Synopsys Design Complier for synthesis using a 90nm standard cell library and Synopsys PrimeTime for static timing analysis. 3.2.3 Stage 3: Gate-level Simulation and Switching Activity Data Collection In stage three the ASIC implementation of each selected FUB is simulated using gate-level simulation. Complete ASIC implementation of the whole processor is possible but the resulting gate-level netlist will be large and simulating that entire processor will be very slow. Hence, WAT’s approach of targeting only a few select FUBs and analyzing them independently using ASIC implementation is a much faster approach without compromising accuracy. WAT simulates each FUB using input traces collected during stage one. Depending on the designer’s desire, stage one can use multiple application runs to collect 53 a large number of input traces that correspond to various likely usage scenarios in field. During stage three the trace collected from each application run on the FPGA is used as the test bench to drive gate-level simulation of the selected FUB. This gate-level simulation can be done using tools such as those provided by Cadence, Mentor Graphics, or Synopsys. In WAT we use the ncsim tool provided by Cadence to perform gate-level simulation. During gate-level simulation a Switching Activity Interchange Format (SAIF) file is generated for each simulation run which tell us the exact number toggles for each of the nets in the circuit simulated. Furthermore, data regarding the state of each of these nets (e.g. number of cycles at logic value 0 or 1) are also reported. 3.2.4 Stage 4: Translation of Gate Port Activity to Transistor Activity There are many gates in a digital CMOS circuit which are each built from a group of PMOS and NMOS transistors. Since wearout degrades transistors, the most accurate wearout models also describe the degradation of individual transistors. The inputs to these wearout models are usage information of the transistor as well as its operation conditions (i.e. voltage and temperature) and their output is the amount wearout-induced timing degradation the transistor has suffered from. We will use an example AND and NAND gates to clearly highlight how the gate-level switching and state activity data collected in stage three of WAT is translate into switching and state activity data for transistors in of the circuit. Figure 3.3 shows the transistor level design of a 2 input NAND gate and a 2 input AND gate. Figure 3.3(a) highlights how activity of inputs “A” and “B” of a NAND gate 54 directly translate into activity of the 2 PMOS and 2 NMOS transistors in this design (because these inputs are directly connected to the gates of these transistors in the design). Figure 3.3(b) show the how a AND gate consists of more transistors in the transistor level design and is comprised of a NAND gate followed by and inverter (INV gate). In the case of the AND gate the switching activity on inputs “A” and “B” directly translate into the switching activity of the 4 transistors which are directly connected to these two inputs (the 4 transistors forming the first level NAND function). But in the case of the AND gate the switching activity of the 2 transistors which form the inverter connected to the output “OUT” are not directly represented by the switching activity of the inputs “A” and “B” due to lack of direct wire connection. In order the find and use the switching activity of the two transistors in the inverter part of the AND gate the activity on node “C” which is connected to the gates of the PMOS and NMOS in the inverter are needed. The ways to find this switching activity of node “C” is to use switching activity on the “OUT” ports of the gate. This information can be used for wearout calculation of two layer gates such as AND gate. The switching and state data on the “OUT” port can easily be used to infer activity and state of node “C”. For example, the probability of logic value 1 on node “C” of the AND gate in Figure 3.3(b) is equal to one minus the probability of logic value 1 on “OUT” port of the AND gate. Toggle probability of nodes “C” and “OUT” in that AND gate are equal. The same systematic analyses of the transistor level design of each gate of the circuit can be used to infer the switching activity probability of each of transistors in each of the gates in the target FUB. The standard cell library used for synthesis of the design 55 has the transistor level design of all the gates which will be used as a reference design during the above described translation stage. (a) (b) Figure 3.3: Transistor level design of 2 input (a) NAND and (b) AND gates. Highlighting how switching activity and logic state on inputs and outputs of logic gates will translate into switching activity and logic state on the gates of PMOS and NMOS transistors in the design. To summarize WAT comprises of four stages. An application is executed on a full system FPGA emulation of the target circuit. During this FPGA emulation input ports to a target FUB are monitored and time traces of these input transitions are collected. These input transition traces are fed into a gate-level simulation of the same FUB. At the end of the gate-level simulation switching activity and logic state data for all the internal nodes (i.e. inputs and outputs of logic gates) in the gate-level implementation of the target FUB are generated. Finally, in stage four switching and state activity data on the input/output port of logic gates are translated into transistor activity and state information by 56 referencing the standard cell library used for synthesis of the FUB. In the next section we will present an application of this information in wearout and reliability analysis. 3.3 An Application of WAT: Accurate Wearout Simulation for Chip Lifespan Prediction One of the most important challenges of dealing with increasing circuit wearout is lack of robust and accurate long-term wearout simulation tools. Accurately modeling of lifetime degradation due to wearout and predicting lifespan of a circuit are critical in several application domains. For instance, in mission critical systems quantifying chip expected lifespan accurately can help system designers to build appropriate error tolerance mechanisms at design time. Similarly, knowing chip’s remaining lifetime in a mission critical system can be used to determine when to preemptively replace failing components and avoid surprise failures. Many of the wearout causing electrophysical phenomena such and HCI have dependence on the activity factor of transistor and degradation happens when transistors are switched. There are other electrophysical wearout causing phenomena such as NBTI which have a dependence on the state of the transistor (e.g. logic value 0 on the gate of a PMOS transistor is the stress condition for NBTI). Hence, accurate lifetime modeling requires accurate quantification of device level utilization and state transition statistics. Conventional lifetime wearout estimation methodologies (such as those used in [29, 62, 65]) use fixed value or value ranges for probability of transistor activity or its state. Many of these lifetime prediction models simply assume that inputs to a circuit are equally likely to be either a 0 or 1. These tools then propagate these inputs through the 57 circuit to measure how various gates within the circuit are stressed. However, it is intuitively clear that not all transistors in a circuit block have an equal switching or state probability. When lifetime estimations use simplified assumptions highlighted above they may either overestimate, or worse underestimate, amount of wearout and failure probability. Wearout is inherently a multi-layer issue where application execution can directly impact how devices wearout. Will present how cross-layer analysis data collected through WAT can be used to quantify the activity factor (i.e. number toggles during a simulated period) of each internal nodes of a group of circuits, and quantify state probability (i.e. the number of clock cycles each node had logic value 1 or 0) of each of these nodes. 3.3.1 Building Blocks of a Lifespan Prediction Framework There are multiple stages in a framework for accurate lifespan prediction. First, accurate models for how devices in the circuit wearout during usage are needed. Then models which represent the structure of the circuits built for the above mentioned transistor are needed. Once the above platform is available the most challenging task is to provide inputs to the wearout models in order to generate accurate timing degradation values for wearout caused by those inputs. In order to accurately predict the lifespan of a chip we will need to be able to estimate the workload the circuit is going to execute. And more importantly we need to be able to observe the impact of that execution on each and every device in the circuit and this is where WAT is used. The lifetime wearout simulation referred to in this section can be divided in the two stages: frontend and backend. The frontend of the simulation framework is where 58 static timing analysis as well device level activity statistics are generated using WAT. The backend of the framework uses the information gathered in the frontend as inputs to wearout models which ten quantify degradation of devices. The backend of the framework is outside the scope of the contributions of the dissertation. 3.3.2 Gate-level Activity Statistics To generate accurate device-level usage statistics we use WAT to analyze the device utilization of multiple circuit blocks of OpenSPARC T1 processor. The RTL level design of T1 is mapped on FPGA and we executed 10 benchmarks from the SPEC CPU 2000 as part of Stage 1 of WAT. The following 10 benchmarks were used: Ammp, Bzip2, Crafty, Equake, Gzip, Mcf, Parser, Perlbmk, Twolf, and Vpr. The activity of every single input port and internal node of 12 different FUBs of OpenSPARC T1 processor are logged. Table 3.1 shows a list of the 12 FUBs from OpenSPARC T1 processor which were used in this study. There are two type of statistics analyzed: probability of logic state (i.e. probability of having a 0 or a 1) and toggle probability. Note that statistics of the internal nodes of the FUBs presented in this chapter only report data on the inputs to gates and not their outputs. The reason for not using the switching activity statistics on the output port of gates is to avoid double counting internal nodes in result reported for the circuit. This is because the output of every gate used in the circuit is connected at least to one input of another gate or an output of the FUB and we are already reporting data on those nodes. The results presented in the remainder of this section are in form of distributions and cumulative distribution functions (CDF) which summarize the thousands of data points 59 into easily readable plots. But note that the backend of the wearout prediction framework will use the raw statistics gathered for each of the hundreds of input ports and thousands of internal nodes in the circuit. Table 3.1: List of circuit blocks from OpenSPARC T1 processor which are used in the cross-layer analysis. FUB Name Description Internal Node Count Input Port Count IFU_DEC Instruction decoder of instruction fetch unit 1668 90 EXU_ECL Execution control logic of the execution unit 4813 221 LSU_STB_CTL Control logic for store buffer of load/store unit 1823 67 EXU_RML Register management logic of execution unit 3383 50 EXU_ALU Arithmetic and logic unit of execution unit 5228 335 EXU_DIV Divider of execution unit 9435 277 EXU_ECC Error checking and correction logic of execution unit 3081 228 EXU_SHFT Right and left shifting logic of execution unit 4581 81 IFU_DCL Decode control logic of instruction fetch unit 976 64 IFU_ERRCTL Error control logic for instruction fetch unit 3424 128 IFU_IFQCTL Instruction fetch queue control logic of instruction fetch unit 2835 101 LSU_QCTL1 Queue Control for load/store unit 4574 170 3.3.3 Probability of Logic State First, we present data on the distribution of probability of having logic value 1 on the input ports of circuit blocks. Probability of logic value 0 is just one minus the probability of logic value 0. Figure 3.4 shows the CDF of probability of having a 1 on inputs of 12 different FUBs. Steeper start of the CDF in this plots highlight the fact that for a large percentage of inputs the probability of having a logic value 1 is smaller than 0.25 (25%). Figure 3.5 shows the CDF of probability of having a 1 on the internal nodes of the FUBs. Data on Figure 3.4 captures the state probability of hundreds of inputs ports for each FUB while Figure 3.5 data shows state probability of thousands of internal nodes of those FUBs. The number of inputs and internal nodes for each FUB is in Table 3.1. 60 Figure 3.4: CDF of probability of logic value 1 on inputs different FUBs. Figure 3.5: CDF of probability of logic value 1 on internal nodes of different FUBs The distribution in Figure 3.4 highlights that different input ports of a FUB have very different probability of having logic value 1. This shows the inaccuracy in the commonly used assumption of 0.5 probability of logic value 1 on all inputs to a FUB. 61 Such an assumption would be a step CDF with zero percentage till 0.5 on the horizontal axis and then a jump to 100% at 0.5. Probability of state not only varies between inputs of the same FUB it is also different between inputs and the internal nodes of the same circuit. Hence, merely observing the probability of state on input ports of a FUB is not sufficient for making assumption regarding the state probability distribution of the internal nodes of that FUB. It is worth noting a small step up at 0.5 in the CDF of internal nodes. This step up is because of the clock signal connected to a multiple internal nodes. Clock signal in the study had duty cycle of 50% resulting in this small jump in CDF. There is only one input port in each circuit block which is connected to the clock signal but this signal gets distributed to multiple flip flops inside the circuit. Overall vulnerability of a circuit block to NBTI is dependent on the number PMOS transistors it has and the probability of those transistors being in the stress state (having a 0 on their gate). Figure 3.6 and Figure 3.7 highlight that the distribution of probability of having logic value 1 on the inputs of circuit blocks and their internal nodes within the T1 processor. PMOS transistors connected to nodes with low probability of logic value 1 (probability lower than 0.04 or 4%) are those which spend more than 96% of their time in NBTI stress condition. The group of nodes with this low probability of logic state 1 account of 46% of the internal nodes in the all FUBs studied. The data shows that 68% of FUB inputs which less than 4% percentage of their time in logic state value 1. 62 Figure 3.6: Distribution of the probability of logic value 1 on the input ports for 12 FUBs. Figure 3.7: Distribution of probability of logic value 1 on the internal nodes of 12 FUBs. 3.3.4 Toggle Probability Toggle probability distribution of all the input ports as well as internal nodes for 12 different FUBs are shown in Figure 3.8 and Figure 3.9 respectively. 63 Figure 3.8: Toggle probability distribution for input ports of 12 FUBs. Figure 3.9: Toggle probability distribution for internal nodes of 12 FUBs. 64 Figure 3.10: CDF of toggle probability distribution of the internal nodes of 12 FUBs compared. Toggle probability distribution for both input port as well as internal nodes shows a larger difference between different FUBs compared to probability of logic state. Most of the node and inputs have a toggle probability which is far below 0.5 (less than a toggle every other clock cycles). Figure 3.10 CDF takes a closer look at the difference between different FUBs with respect to distribution of toggle probability of their internal nodes. One interesting observation is that FUBs with pipeline control function such as EXU_RML, IFU_IFQCTL1, LSU_STB_CTL, IFU_ERRCTL, and IFU_IFQCTL have over 90% of their internal nodes with less than 0.4% toggle probability while FUBs such EXU_ALU or IFU_DEC which are in the data paths have a much larger percentage of their internal nodes with higher toggle probability. Since many wearout phenomena such as HCI happen due to switching activity of transistors these results indicate that FUB in the data path are more susceptible to HCI compare to FUBs with pipeline control 65 function. This can not only be used in accurate simulation of the wearout of these different blocks but also can be used for identification of FUBs which can be selectively hardened for HCI. CDFs in Figure 3.11 show the toggle distribution on (a) input (b) internal nodes of all the FUBs in the study. The error bar on this figure show the difference range between different FUBs in the study. The large difference between the toggle probability distribution between different FUBs further highlights the need to use accurate node toggle probabilities for different FUBs in order to accurately simulate the amount wearout by phenomena such as HCI. (a) (b) Figure 3.11: Comparison of the CDF of toggle probability distribution for (a) input ports (b) internal nodes of the 12 FUBs. The results presented in this section provide a sample of the type of information WAT can provide. In this section we focused on a one application of such data but there 66 are many other applications which can benefit from the cross-layer insight provided by WAT. 3.4 Related Works To the best of our knowledge we are not aware of any existing published or commercial cross-layer analysis tools capable of providing the device-level usage statistics that are provided by WAT. There have been software-based frameworks developed for academic cross-layer reliability studies such as the one used in [28, 34, 35]. These alternatives to WAT for cross-layer analysis are based on hierarchical simulation which uses multiple software simulators each capable of simulating one layer in the computer system stack. One example implementation of a hierarchical simulation that provides coarse-level cross-layer insight is the following. A full system simulator will be used to boot an operating systems and run applications to collect broad statistics, such as a trace of the instruction sequences executed. Then a functional simulator of the processor would be used for simulating a processor. This simulator is fed the instruction sequences collected from system level simulator. The processor simulator may collect coarse FUB-level statistics such as temperature, number of times that FUB was used during one time epoch. Finally a FUB within the processor is simulated with gate-level accuracy. During this simulation the captured temperature and utilization statistics may be used to derive the FUB wearout. An example of hierarchical simulation is the framework used in [34] where Simics full system level simulation [42] is followed by use of CMU transplant tool [20] to connect the state of the full system simulator to a gate- level simulation of a FUB. 67 There are three major challenges in hierarchical simulation which is needed for cross-layer wearout analysis. First, wearout analyses require simulation of long running workload in order to capture trends and behaviors which can affect wearout. Second, accurate communication between different simulators is necessary. Third, if these cross- layer analysis tools are to be used broadly they need to be capable of simulating different systems and different architectures. WAT has advantages compared to hierarchical simulation in dealing with all of the above three challenges. These advantages are mainly attributed to use of FPGA emulation of the RTL of the processor rather than using a full system simulator and a transplant tool or a functional simulator to go from full system simulation to gate-level simulation. The main advantages of WAT are: high speed, high accuracy, and flexibility to changes in architecture and microarchitecture. These advantages are all due to the fact that FPGA synthesis and place and route essentially build a tailored full system emulator of the RTL design with functional level accuracy. However, one impediment to using WAT is the requirement of a fully working RTL-level description of a processor. However, the hierarchical simulation tools may only need RTL description of just a few critical FUBs. Hence, WAT is more suitable for detailed design studies during the late stages of a processor design. While hierarchical simulation may be used for coarse level design exploration studies during the early stage of a processor design. High Speed: By using FPGA emulation we can replace full system software simulation with FPGA emulation. Not only emulation of large and complex circuits on FPGA is faster but also reducing one stage in the cross-layer methodology (i.e. the stage 68 between full system simulation and gate-level simulation) reduces the time overhead of the framework. High Accuracy: Higher accuracy of WAT in cross-layer analyses comes from using the exact same RTL design code for both FPGA and ASIC synthesis of the target FUB resulting in exact match for the functional behavior of the circuit between FPGA emulation and gate-level simulation. This means that all the data and control signal which are inputs and outputs different FUBs are emulated on the FPGA during full system emulation. Flexibility with Different Designs: Any circuit design can be simulated as long as we have the RTL code. It should be noted that the RTL design of the circuit is needed for any cross-layer analyses, even using hierarchical simulation, which involves gate- level accuracy. Once the RTL code is available then a wide range of wearout studies can be conducted with relative ease. Hierarchical simulation not only requires the RTL design for gate-level simulation it also requires redesigning or reconfiguring the full system simulator and functional simulator or transplant tool for every new target circuit. 3.5 Other Applications of WAT In this section we highlight a few different applications in which WAT can be used. These applications include: circuit design optimizations, ISA vulnerability analyses, selective device hardening, and test generation for post-fabrication testing. 3.5.1 Circuit Design Optimization for Improved Reliability In Chapter 4 we use WAT to obtain application-driven path utilization profiles of circuit blocks. Results of these analyses are then used to optimized and redesign circuits 69 to be more suitable for wearout monitoring and have more resilience to wearout-induced timing degradation. Details of this design optimization framework [72] and how it uses the cross-layer information from WAT is presented in Chapter 4 of this dissertation. 3.5.2 Instruction Set Architecture Reliability Benchmarking Other researchers have also used WAT. WAT was used in a framework for systematic tracking of how each instruction in an ISA stresses the gates during its execution through the processor pipeline [23]. This study quantifies the number of devices each instruction activates during its execution. This methodology was used for benchmarking the vulnerability of instruction set architecture (ISA) to wearout. The results of such cross-layer evaluations can be used to improve reliability by making enhancement to the ISA and/or compliers. 3.5.3 Device Vulnerability Identification and Selective Hardening NBTI stress condition is when the gate of a PMOS has logic value 0. Hence, devices which have gates that have higher probability of having a logic value 0 are at higher risk of NBTI. Furthermore, transistors which have a higher toggle probability on their gates are more at risk of degradation caused by phenomena such as HCI. Identification of such device in the circuit by use of cross-layer analysis during the design phase can enable selective hardening (i.e. using design and fabrication techniques to reduce susceptibility to NBTI, HCI, etc.) of those transistors at higher risk. It should be noted that reducing susceptibility of at risk transistors by techniques such as gate sizing and/or selective voltage control comes at power, performance, and 70 area cost and hence accurate design stage cross-layer studies can help in identification of the most vulnerable transistors for use of limited budget for hardening. 3.5.4 Automatic Test Vector Generation for Post-fabrication Testing Another interesting usage of activity statistics on the internal nodes as well as input nodes of different FUBs is in automatic test vector generation for post-fabrication testing [73]. For a circuit with a large number of input ports random generation of test vectors based on equal and independent probability of 0 and 1 on different inputs of the circuit can result in generation of test and/or stress input vectors which do not frequency occur in normal use of the target circuit. This can result in incorrectly testing or stressing of devices and incapability to measure wearout accurately. The activity statistic collected on the input nodes of the circuit blocks can be used during generation of input vector sets to result in state probability and toggle rates which are similar to what the circuit would observe during normal use. For example, an input bit to the circuit block might have low probability of having logic value 1 or a low probability of toggling. This behavior can be account for by tuning the test vector generation so that the test vector set generate results in low probability of logic value 1 or low toggle probability on the bit mention in this example. The end result is automatic generation of a test vector set which tests signal path which are commonly used during normal operation of the circuit. Results presented in Section 3.3 clearly highlight the diversity in the activity and state probably for different inputs and internal nodes of a FUB. For example, let look at the ALU of OpenSPARC T1 (i.e. EXU_ALU) which has 335 inputs, input_port(1) to input_port(n). WAT provides logic state and toggle probability for all these inputs. For 71 example, WAT’s reported data for input_port(i) and input_port(j) show that probability of logic value 1 on the input_port(i) is 80% while probability of input_port(j) having logic value 1 is only 5%. This difference can be a consequence of the function, microarchitecture, or typical workload of the circuit (e.g. input_port(j) is connected to a rarely used active high ALU control signal). Having accurate activity statistics can guide the automatic selection of the test/stress vectors so that 80% and 5% probabilities are used in picking logic value one for input ports i and j respectively and hence having a much more representative stress or test result. 3.6 Summary and Conclusions In this chapter we presented detail of WAT which is cross-layer wearout analysis tool that combines FPGA-based emulations with software simulation. WAT provides fast and accurate characterizations of the runtime behavior of devices in the circuit while real workloads are running on the system. WAT is useful during pre-fabrication phase of chip development where cross-layer analysis provides precise understanding of how a FUB is likely to be stressed by applications during in-field operation. We then presented a number of application of WAT as well as a detailed experimental results section which highlight the insight WAT provides regarding the switching activity and state probability of devices within different FUBs. Note that while WAT is currently being used for wearout studies and all application highlighted earlier in Section 3.3 are in the design-for-reliability domain, it is possible to adapt WAT for conducting cross-layer power and performance studies that require gate-level activity data while running applications at the chip level. 72 Chapter 4 Cross-layer Wearout Aware Design Flow The WearMon approach described in Chapter 2 relies on checking the most critical circuit paths to detect timing degradation. However, high-volume industrial chips that are optimized for power and area efficiency have a steep critical path wall making it difficult to select just a few paths for wearout monitoring. Furthermore, wearout depends on dynamic conditions, such as processor’s operating environment, and application- specific path utilization profile. The dynamic nature of wearout coupled with steep critical path walls may result in excessive number of paths that need to be monitored thereby reducing the effectiveness of WearMon. In this chapter we present a novel cross- layer circuit design flow [72] that uses path timing information and runtime path utilization data to significantly enhance monitoring efficiency. The proposed methodology uses application-driven path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms for selecting paths to be monitored. These four approaches allow designers to select the best group of paths under varying power, area and monitoring budget constraints. 73 4.1 Introduction As discussed in Chapter 2, WearMon does wearout monitoring based on checking the signal paths in the circuit under test. In WearMon, stored test vectors that are specifically selected to sensitize the most critical paths of the circuit are used for runtime tests that capture the timing margin (also called slack) of these paths. This approach of wearout monitoring falls under a broad category called in situ circuit checking. Another in situ circuit checking method [54] uses Built-in Self Test (BIST) mechanism to perform runtime circuit tests. Some techniques use “canary” circuits, as opposed to in situ circuit checking. Canary circuits are proxy circuits that are designed to fail before the actual circuit [19]. Canary circuits do not test the actual signal paths in the circuit; rather they only act as proxies for the primary circuit wearout. Other techniques use sensors inserted into the circuit at design time which are capable of detecting wearout by sensing increased circuit delay [2, 13] or changes in other parameters, such as threshold voltage (V th ) [36]. The main advantage of in situ circuit checking is that they capture the effects of actual circuit lifetime utilization at a lower cost and with higher flexibility for online adaptation. Most wearout monitoring mechanisms, including WearMon, generally make the following two basic assumptions: (1) In any given circuit there are only a few circuit paths that have critical timing margins. Hence, to accurately predict imminent timing failures only a few circuit paths with the least timing margin need to be monitored. (2) Circuit paths with least timing margin have a higher probability of being among the first to violate timing. Hence, monitoring prioritizes paths purely based on the timing margin 74 measured at design time. The first assumption indicates that selection of only a few paths for monitoring would be sufficient for robust monitoring. This assumption may hold well in some designs that use automatic design tools to synthesize, place and route the design. In the absence of knowledgeable designer’s input these tools typically do not create steep critical path walls [35, 46], where a large number of paths have small timing margin. However, custom design optimizations for maximizing power and area efficiency, particularly employed in high performance processors, may result in the creation of a steep critical path wall in several circuit blocks. In the presence of a steep critical path wall the number of paths that need to be monitored can be very large, thereby increasing the monitoring overhead. The second assumption made by in situ approaches results in the selection of paths purely based on design stage timing margin. However, it has been shown that wearout depends on dynamic runtime utilization of the processor and many of the causes of wearout get exacerbated with increased circuit utilization [31, 37]. Path selection purely based on timing margin neglects this important dependence. Thus the robustness of the monitoring approach that relies purely on static timing margin can be compromised due to the dynamic nature of path utilization. Hence, we conclude that in order for monitoring approaches to be more broadly applied (beyond low-cost computing segment) there is a need for a symbiotic interaction between circuit design tools, monitoring hardware and the high-level application software. Only through such an interaction it is possible to identify circuit paths which are the slowest at design time and also have higher lifetime utilization resulting in most wearout induced timing degradation. 75 Figure 4.1: Design time and runtime cross-layer interaction. In this chapter we present a novel cross-layer circuit design flow methodology [72] that combines use of static path timing information with runtime path utilization data to significantly enhance monitoring efficiency and robustness. Figure 4.1 shows the layered framework consisting of two phases: (1) Cross-layer design flow (CLDF) phase: This phase (marked as “Design Time”) uses representative application inputs to derive circuit path utilization profile. The microarchitecture specification provides monitoring budget, such as the amount of chip area or the power consumption allocated for wearout monitoring. CLDF also derives timing profile from static timing analysis of circuit’s design. The wearout aware algorithm then combines information from software, microarchitecture and circuit layers to drive circuit design optimizations with the explicit goal of making a circuit amenable for robust and efficient monitoring. The algorithm selects a refined group of paths along with a robust set of input vectors for wearout monitoring. (2) Wearout monitoring phase: A runtime wearout monitoring phase, similar to that proposed in WearMon [70], continuously monitors the paths selected from the CLDF 76 phase. The information about the circuit paths which need to be monitored, obtained from the CLDF phase, is used in the runtime phase for wearout detection. The focus of this chapter is to develop the CLDF framework. As such, we assume that a wearout monitoring mechanism similar to WearMon which was presented in Chapter 2 exists in the underlying microarchitecture. CLDF significantly enhances the applicability of existing runtime monitoring approaches. For example, where wearout sensors or canary circuits are used for monitoring, CLDF will identify circuit paths that are most susceptible to failure thereby allowing the designer to select the most appropriate location of the wearout sensors or canary circuitry. When in situ monitoring approaches are used [10, 40, 54, 70] only the most susceptible circuit paths reported by the CLDF framework are monitored. It should be noted that although the CLDF framework can be used with all the above mentioned reliability monitoring approaches, throughout this chapter we assume that the underlying microarchitecture uses an in situ monitoring approach similar to WearMon to illustrate how our design phase optimizations can enhance runtime monitoring efficiency. The main contributions of the work presented in this chapter are: 1. We design and implement a novel cross-layer circuit design flow methodology that combines use of static path timing information with runtime path utilization data to significantly enhance monitoring efficiency. This framework uses path utilization profile, path delay characteristics, and number of devices in near critical paths to optimize the circuit using selective path constraint adjustments (i.e. increasing the timing margin of selected group of paths). This optimization results in a new implementation of the circuit 77 which is more amenable for low overhead monitoring of wearout-induced timing degradation. 2. We propose four algorithms for selecting the best group of paths to be observed as early indicators of wearout induced timing failures. Each of these algorithms allows the designer to tradeoff area and power overhead of monitoring with robustness and efficiency of monitoring. In this study we embed the methodology presented in Chapter 3 which is a hybrid hierarchical emulation/simulation infrastructure into the design flow of a circuit to study the effects of application level events on gate-level utilization profile. This setup provides a fast and accurate framework to study system utilization across multiple layers of the system stack using a combination of FPGA emulation and gate-level simulation. In an era when computers are built from increasing number of components with decreasing reliability, multi-layer resiliency is becoming a requirement for all computer systems. In this chapter we present details of a low cost and scalable solution in which different layers of the computer system stack can communicate and adapt both at design phase and during the runtime of the system. Our cross-layer design flow approach is discussed in Section 4.2. Section 4.3 shows our hybrid cross-layer evaluation infrastructure, followed by the evaluation results in Section 4.4. Section 4.5 describes the most relevant prior work, followed by chapter summary and conclusions in Section 4.6. 4.2 Cross-Layer Design Flow In this section we describe the cross-layer circuit design flow (CLDF) methodology. At the core of CLDF is a novel approach that modifies the distribution of 78 path timing margins, so as to create a group of critical paths that are more likely to fail before any other paths fail. The paths that are likely to fail first are referred to as wearout- critical paths. Wearout-critical paths would be ideal candidates for being monitored as early indicators of wearout. CLDF receives a monitoring budget, in terms of the area and power overhead allowed for monitoring, as input from the designer. CLDF uses three characteristics of the circuit, namely path timing, path utilization profile, and number of devices on the path, to select a limited number of wearout-critical paths to satisfy the monitoring budget constraints specified by the designer. Paths which are selected to be monitored at runtime are going to be checked regularly using approaches like [10, 25, 40, 54, 70]. We assume that WearMon is the runtime monitoring technique that is used for actual wearout detection. We provide a quick recap of WearMon here and it is described in detail in Chapter 2. WearMon tests the circuit at a test frequency, f test , which is higher than the normal operation frequency, f 0+GB =1/T 0+GB . T 0 is the delay of slowest paths in the circuit at design time and hence ideally the circuit can operate at that clock period at fabrication time. Designers add a guardband (increase the clock period) to deal with wearout. T 0+GB is the clock period of the system with added guardband, which is the usual operational clock period of the circuit. If multiple tests, each at a clock period that falls within the T 0 and T 0+GB range (1/T 0+GB <f test <1/T 0 ), are preformed the test results would provide information about the exact amount of timing degradation in paths tested. We first provide an overview of the algorithmic steps for the proposed CLDF approach. Detailed description of the key steps will follow immediately. 79 Step 1. The circuit that is being monitored is first synthesized using conventional design flow. Performance, power, and area constraints are provided as inputs to the synthesis tool. The synthesis tool generates the implementation of the design and an initial static timing report that shows the timing margin of each circuit path. The first step in CLDF takes this synthesized design as input and sorts all the circuit paths in the timing report based on their timing margin. It then selects some number of paths, say nLong, with least timing margin. These nLong paths are further analyzed in the rest of the steps. Step 2. The second step is where the cross-layer aspect of design flow comes into effect. In this step, CLDF selects a representative set of workloads and runs them on the synthesized design. Utilization profile of the nLong paths selected in step 1 is collected, which provides information regarding how frequently each of these paths have been exercised during the execution of the selected workloads. Step 3. We use one of the four approaches discussed in Section 4.2.3 to select the following two path groups from the nLong paths: a) Path to be optimized further. b) Paths to be monitored at runtime. Step 4. Paths selected in group 3(a) are optimized to be faster which results in more timing margin for these paths. The goal of this optimization is to make the paths in group 3(b) wearout-critical paths that allow for robust monitoring. It should be noted that groups 3(a) and (b) are not mutually exclusive and depending on the approach selected by the CLDF framework there might be paths which are in both groups and are optimized and also selected for being monitored. 80 Step 5. This step collects necessary data to enable robust runtime monitoring of paths in group 3(b). This step is dependent on the monitoring framework used. For example if a runtime wearout monitoring such as WearMon is used the input vectors that would sensitize the paths in group 3(b) are created in this step. These inputs are then stored in a test vector repository to enable runtime monitoring. If approaches like [19, 25] are used for runtime monitoring then location of the paths in group 3(b) and their structure should be stored so that canary circuits can be designed for them or wearout sensors can be inserted at appropriate locations. As stated earlier, in this chapter we assume a monitoring approach similar to WearMon based on test vector injection for path testing is used. 4.2.1 Step 1: Selection of the Analysis Group The first step in CLDF is to use a traditional synthesis tool to synthesize the design and perform static timing analysis. The hardware description language (HDL) code for the design in addition to performance, area, and power constraints are provided as inputs to the synthesis tool. The output of this initial synthesis will be the gate-level implementation and a timing report that indicates the amount of timing margin for each circuit path. CLDF then generates a sorted path list based on timing margin and selects a group of nLong longest paths (paths with the least timing margin). These paths are considered for optimization and/or runtime monitoring as we will describe later. The selection of nLong paths is done as follows. CLDF selects nLong paths based on an initial cut-off criteria (InitCutoff) given as input to the algorithm. CLDF selects only those paths whose delay is larger than InitCutoff percentage of the maximum path delay. For example if the delay of the longest 81 path in the circuit is 10ns and if InitCutoff is selected as 75% then CLDF picks all paths with delay of 7.5ns or higher thereby ensuring all paths that are within 75% of the worst- case delay are selected for analysis. The cutoff parameter is selected by the designer based on the worst-case wearout expected in a design within the typical lifetime of the processor. It has been shown in prior studies that all wearout causing phenomena, such as NBTI, and Electromigration, reach a maximum wearout level beyond which they cause device failure [31, 37]. In fact, this knowledge is what is used by conservative circuit design approaches for selecting a guardband to prevent premature failures; when a designer selects a 10% guardband the assumption is that no path with more than 10% timing margin will fail before the end of expected lifetime of the processor. Hence, InitCutoff is simply the conservative guardband that has already been estimated at design time. It is worth noting that for circuits with steep critical path timing walls using InitCutoff may result in selection of a large group of paths for further analysis, thereby making nLong a very large number. Large nLong values do not create any impediment in the next steps of CLDF algorithm. Similarly, for circuits with shallow critical path timing walls nLong may be small. If nLong is too small (smaller than the number of paths which can be monitored efficiently), then there is no need to even conduct further analysis since the circuit does not have many critical paths and it may be possible to monitor all critical paths without further analysis or need for CLDF. As mentioned earlier, the main goals of CLDF is to make circuits with steep critical path timing walls (large nLong values) still amenable for runtime monitoring. 82 4.2.2 Step 2: Utilization Based Path Prioritization Step 2 generates utilization profile of nLong paths. The utilization data is collected while executing a representative set of applications that are expected to run on the design. During execution of representative applications the number of times each of the nLong paths is utilized is logged. Then nLong paths are sorted based on the cumulative number of times each path was utilized during profile runs; we call this sorted list the utilization profile. CLDF uses HighUtilCutoff parameter given as input to CLDF to identify paths that have utilization greater than HighUtilCutoff percent of the maximum utilization reported for the nLong paths. These paths are demarcated as high utilization paths. CLDF also uses a LowUtilCutoff parameter and any path with utilization lower than this cutoff is demarcated as a low utilization path. The rationale behind using two cutoffs is to create two distinct groups of paths with very different utilization levels. As explained shortly, this clear separation between high and low utilization is used to create robust and efficient monitoring mechanisms. Timing degradation of a circuit path is a sum of the degradation of all the devices on that path. Hence, if all other parameters are the same, more devices on a path result in more susceptibility to wearout induced timing degradation. As such CLDF uses device counts on a path to further differentiate between paths. CLDF uses a single input parameter called DevCutoff to demarcate paths with high or low device counts. After gathering the utilization profile, CLDF divides nLong paths into three categories based on HighUtilCutoff, LowUtilCutoff, and DevCutoff. Timing margin of the 83 first category of paths will be increased; we refer to these as the optimized group. The second category contains those paths that are monitored for wearout at runtime, which is referred to as the monitored group. The third category contains paths that are neither optimized nor monitored. We have explored four path categorization algorithms in this research. These algorithms provide different tradeoffs between performance, power, area, and reliability. Illustrative example: While describing the four algorithms, we will use an illustrative example to show how path categorization is done. For this purpose in Figure 4.2(a) we show the initial delay distribution of a sample circuit taken from OpenSPARC T1 processor [45]. This sample circuit is the instruction decode block of the instruction fetch unit (sparc_ifu_dec). Section 4.4 presents more quantitative details for this circuit but they are not necessary here for understanding the algorithms. The timing constraint used for synthesis is 0.95ns (T 0 or zero timing margin path delay). We assume there is a 0.09ns timing guardband added by the designer to deal with wearout. Hence, the resulting system clock period is 1.04ns (T 0+GB ). In this discussion we assume that we use 90% as the InitCutoff value. Hence, we select nLong paths that are within 90% of the longest timing paths. All paths in the right most five columns of Figure 4.2(a) form the nLong paths. There are three types of paths highlighted with shades of black in Figure 4.2(a): high utilization & high device count, low utilization & low device count, and all other paths. The group marked high utilization & high device count are the paths that have utilization that exceeds the HighUtilCutoff and device count that exceeds the DevCutoff parameters. Similarly, low utilization & low device count are the paths that have 84 utilization that is below the LowUtilCutoff and device count that is below the DevCutoff parameter. Intuitively, the separation of paths into three types based on utilization and device count provides an opportunity to shift steep critical timing walls by not treating all paths with the same timing margin as equally important. Instead we create path heterogeneity with device count and utilization information derived from application level information. By exploiting this crucial runtime information through design time utilization analysis we can avoid monitoring all the paths even in the presence of critical path timing walls, as we will show in the next section. 4.2.3 Step 3: Approaches for Selecting Monitored Paths The output from this step is the identification of paths that are used for monitoring. We assume that a designer has a fixed budget to monitor only nMonitor paths (based on the area, power, and performance budget allocated for monitoring). Hence, the goal is to select a total of nMonitor paths. In this section we describe four approaches that we designed for path selection. 85 Figure 4.2: Path delay distribution (a) before optimization and after (b) Approach 1 (c) Approach 2 (d) Approach 3 (e) Approach 4. 4.2.3.1 Approach 1: Monitor Least Reliable The goal of this approach is to create a distinct group of paths which, with high probability, are the paths that are going to have wearout induced timing failure before the rest of the paths. These paths will be monitored and used as predictors of imminent timing violations. Approach 1 achieves this goal by reshaping the path delay distribution 86 of the circuit as follows. A group of paths that are most susceptible to wearout are selected for monitoring. Concurrently, all the paths that are not monitored are removed from the critical path wall by increasing the timing margin of these paths. Since paths that are not monitored have higher timing margin the probability of path not monitored failing before the monitored group is reduced. Figure 4.2(a) shows the distribution of a sample circuit before using Approach 1 and Figure 4.2(b) shows the redistribution of the paths after applying Approach 1. The paths with the most delay in the redistributed plot, highlighted in black on Figure 4.2(b), are the group left for monitoring while all other paths are moved away from the critical path wall. Detailed description of Approach 1 is given below. Paths optimized: This approach starts with the utilization profile generated in Step 2 of the algorithm, which sorts nLong paths based on path utilization. The HighUtilCutoff parameter is used to select paths with high utilization, i.e. paths with utilization greater than the cutoff parameter. We then sort the high utilization paths based on the number of devices on each path. We further divide this newly sorted list by using DevCutoff parameter and identify the high device count and low device count paths. At the end of this process we end up with three sets of paths: high utilization & high device count, high utilization & low device count, and the remaining paths without any concern for their device count. We then separate high utilization & high device count paths from the nLong paths. The remaining paths (nLong paths excluding high utilization & high device count paths) are optimized to have a larger timing margin. The increase in the margin is equal to the initial circuit guardband. Path optimization is done by 87 resynthesizing the design using stricter timing constraint for the paths selected. The delay of the optimized paths can be reduced, for instance, by increasing the size of devices used on these paths. Since the optimized paths have more timing margin they are also significantly less likely to cause timing violations. Paths monitored: The high utilization & high device count paths which are not optimized (black bars in Figure 4.2(b)) will form the set of paths which are going to be continuously monitored for wearout. These paths have a higher probability of suffering the most wearout. These paths are utilized more frequently and utilization has a first order effect on many of the wearout causing phenomena. These paths also have more devices on them and are more susceptible to timing degradation caused by wearout of their devices. Runtime monitoring would check the path delay degradation of these paths between T 0 and T 0+GB and will alert the system if any monitored path delay gets critically close to T 0+GB . Discussion of Approach 1: The goal is to select a total of nMonitor paths where all paths have the characteristic of high utilization & high device count. Our main motivation for using HighUtilCutoff selection criteria is to pick a subset of nLong paths with a distinctly higher utilization compared to the rest of the nLong paths in that circuit. To satisfy this goal HighUtilCutoff can be selected in the range of 75% to 85% of the maximum utilization in the nLong path group. If a smaller percentage is selected, the relative utilization difference between the paths selected and the ones not selected would become smaller and hence the goal of leveraging utilization differences between paths will not be satisfied. 88 A few special cases are worth mentioning. First, if the number of paths in the high utilization & high device count category are more than the monitoring budget we simply select the most utilized nMonitor paths from this category and optimize the remaining paths even in this category. On the other hand, in some circuits the number of paths categorized as high utilization & high device count, after applying HighUtilCutoff and DevCutoff, may be fewer than nMonitor. In this case we fill the remaining paths for monitoring from high utilization & low device count category as well thereby removing these paths from further optimization. It should be noted that the goal of this work is to deal with circuits which have many more paths than the nMonitor. If the paths selected to be in the nLong path group are fewer than nMonitor paths, then it is not necessary to use the CLDF approach and all paths in the nLong group can simply be monitored. The value used for nMonitor has a direct impact on the area overhead of Approach 1. If nMonitor is small then the number of paths which are not monitored will be large and hence the area overhead of the optimization is going to increase. Recall that all the paths in nLong group that are not monitored will be optimized, which usually requires increasing device sizes. Furthermore, paths optimized with larger device sizes also lead to higher dynamic power consumption whenever these paths are exercised. These overheads can be reduced if the monitoring overhead is increased, by selecting larger nMonitor. Of course there is the tradeoff that more paths being monitored would mean more overhead for the monitoring setup. 89 One advantage of Approach 1 is that it does not perturb paths with high utilization & high device count which typically are the most power hungry paths in a circuit. On the other hand, since it does not perturb the high utilization & high device count paths the optimization effort and the resulting area overhead would not improve circuit’s susceptibility to timing failures since the high utilization & high device count paths still have a small timing margin. In other words, this approach only has the benefit of making any circuit with any path distribution suitable for monitoring and will increase robustness and effectiveness of monitoring but it does not change the fundamental wearout behavior of the circuit. It should be noted that for designs which have stringent power and area constraints but can tolerate some performance degradation (e.g. mobile device chips that are more constrained by power and area than maximum frequency) Approach 1 can be implemented in an alternative way. Paths which are selected for monitoring can be deoptimized while keeping all other paths the same. In other words, we increase the clock period and reduce the speed of high utilization & high device count paths to match the lower timing demands. All the other paths remain untouched and hence they will all gain additional timing margin while the high utilization & high device count paths will be the wearout critical paths used in monitoring. 4.2.3.2 Approach 2: Two Monitoring Groups Goals of this approach are twofold: (1) monitor the paths which are most susceptible to wearout but also make sure that the optimization effort of CLDF results in a longer lifetime of the circuit in presence of wearout. (2) Increase robustness of 90 monitoring even if the path utilization during in-field operation varies from the utilization profile collected from representative applications. In order to achieve the first goal of improving reliability, first the paths most susceptible to wearout are monitored as in Approach 1. In addition these monitored paths are also optimized to improve the lifetime of the circuit. To achieve the second goal we also monitor a second subset of paths that are not necessarily the most wearout susceptible during the profile run. The redistribution is shown on Figure 4.2(c). Paths monitored: The monitoring budget is split equally into two groups. First group of paths to be monitored, called Group 1, is the same as those in Approach 1, namely, high utilization & high device count paths selected using the selection polices described in Approach 1, except that we only select nMonitor/2 paths. The paths selected for monitoring are removed from the nLong paths. The remaining paths are then sorted in descending order based only on utilization without any constraint on device count. The second half of monitored paths, called Group 2, is selected from the top of this newly sorted list. Group 2 increases the robustness of monitoring since it selects half the paths that are categorized as not as susceptible during profile run. Hence, the reliance on profile data accuracy can be reduced. Paths optimized: All paths except those in Group 2 are optimized. As a result of optimizing paths in Group 1 the most susceptible paths will have more timing margin and the overall circuit lifetime is increased. By not optimizing paths in Group 2, which are not as susceptible to wearout, we create a distinct group of paths with very different 91 timing margin profile that are simultaneously monitored thereby further improving monitoring robustness. Discussion of Approach 2: In Approach 2 every path in the nLong group is either monitored or optimized or both. In particular, there are no paths that are neither optimized nor monitored. This approach is particularly suitable for designs with larger monitoring budget (nMonitor) and circuits with clustering of a large number of paths in the low utilization & low device count paths. 4.2.3.3 Approach 3: Virtual Critical Paths In the first two approaches there is no limit on the number paths optimized which may lead to unacceptable area and/or power overheads for some circuits. Approach 3 focuses on limiting area and power overheads from optimization while still retaining monitoring efficiency of prior approaches. The approach relies on a small change to monitoring process itself to achieve its goal. Monitoring is done using a higher testing frequency than the previous two approaches. Paths monitored and optimized: We use Approach 1 to select nMonitor paths for monitoring. We then optimize only the paths selected for monitoring and all other paths are untouched. It is worth noting that in Approach 1 we optimized all other paths that are not monitored, where as in Approach 3 we limit the number of paths optimized to be the exact same as the number of paths monitored. Thus the area and/or power overhead associated with optimizing remains fixed (based on the nMonitor paths) independent of the number of nLong paths. 92 Modifications to monitoring hardware: Testing of the critical paths selected for monitoring needs to be done using a different testing clock frequency, f test , than the one described early in Section 4.2, which is 1/T 0+GB <f test <1/T 0 . When Approach 3 is employed, a test clock period that is shorter than the actual clock period of the system is going to be used for monitoring; we will refer to this as a virtual test clock. Note that paths monitored are also the paths optimized. Hence, monitored paths no longer have the smallest amount of timing margin. For monitoring purposes, however, these paths are treated as if they are still the most critical paths (virtually critical). Note that the monitored paths are in high utilization & high device count category even though they are optimized. These paths will suffer the most wearout during in-field operation. Thus Approach 3 still relies on most wearout susceptible paths. Since these paths are also optimized they have more timing margin and hence they would not threaten the systems performance or lifetime. Figure 4.2(d) illustrates the new test clock range in which these paths are monitored. The new test clock period is between T t and T t+GB instead of between T 0 and T 0+GB . T t is the delay of the slowest optimized path and T t+GB is T t plus the same guardband. The paths highlighted in black are the ones most susceptible to wearout and have been optimized and are also monitored. 4.2.3.4 Approach 4: Two Monitoring Domains Approach 3 creates a set of paths that are monitored at an elevated test clock frequency with the assumption that monitoring the most utilized paths that are also optimized will be sufficient to detect wearout. After the path redistribution of Approach 3 there will be a new set of critical paths which are not going to be monitored. These are 93 the paths which have a larger delay than T t as shown on Figure 4.2(d). Note that these paths have lower predicted activity factor than the paths monitored and hence the assumption is that these paths are less susceptible to wearout. However, during in-field operation if the utilization varies from the utilization profile collected from representative applications then the prediction may not be accurate. In this case, paths that have a smaller timing margin may become susceptible to failure. Approach 4 eliminates this susceptibility by adding additional paths to monitor from these smaller timing margin paths. Paths optimized and monitored: This approach monitors two groups of paths. Paths in Group 1 are selected the same way as Approach 3. However, Approach 4 selects half the number of paths (nMonitor/2) to monitor using the virtual test clock (between T t and T t+GB ). The Group 1 paths selected for monitoring are also optimized as in Approach 3 (number of paths optimized is half of nMonitor paths). Paths selected in Group 1 are removed from nLong paths. The second half of the monitored paths, called Group 2, is selected from the paths remaining in nLong paths. We sort the remaining paths in descending order based on their utilization and select the top nMonitor/2 paths. Group 2 paths are monitored using a test clock with period between T 0 and T 0+GB (this is the original test frequency range used by Approach 1 and 2). Thus Approach 4 uses two monitoring test frequency ranges. Group 2 paths are the ones with the least timing margin after optimizing Group 1 paths. Group 2 paths have a delay above T t as shown on Figure 4.2(e). 94 Modifications to monitoring hardware: The additional monitoring cost incurred in this approach would be due to the need for additional control hardware to enable monitoring at two different test frequency ranges. This slightly more complex monitoring hardware would reduce the sensitivity of monitoring to path utilization profiling accuracy since two distinctly different sets of paths are monitored. 4.2.4 Summary of Approaches Table 4.1 summarizes the four approaches discussed. As shown in Table 4.1 nLong critical circuit paths in the analysis set can be grouped into four categories based on utilization and devices count: (1) High utilization & high device count (2) High utilization & low device count (3) Low utilization & high device count (4) Low utilization & low device count. Each of the four approaches (labeled as App. 1 to 4) are going to select a subset of each of the above four path categories to be optimized (Opt.) and/or to be monitored (Mon.). Table 4.1: Comparison of approaches. Path Utilization Path Device Count App. 1 App. 2 App. 3 App. 4 Opt. Mon. Opt. Mon. Opt. Mon. Opt. Mon. High High (1) No Yes Yes Yes Yes Yes Yes Yes Low (2) Yes No No Yes No No No Yes Low High (3) Yes No No Yes No No No No Low (4) Yes No Yes No No No No No 95 4.3 Evaluation Methodology In order to implement CLDF and evaluate various approaches discussed in the previous section the evaluation setup must have the following capabilities: First, it should synthesize the circuit design to generate the timing report used in Step 1 discussed in the last section. Any traditional ASIC synthesis tool has this capability. Second, the evaluation setup must take the nLong paths obtained from Step 1 and collect utilization profiles of these paths while running representative applications of interest. Running applications, particularly complex applications with system interactions, requires full system simulation capability with support for running an operating system. Hence the circuit being analyzed must be part of a complete processor design. Full system simulation can then be run on top of the processor design. Conducting a full system simulation on top of a gate-level processor design is an extremely slow process. Hence, collecting utilization profile from application runs on a gate-level full system simulator is impractical. Finally the evaluation setup must be capable of using the utilization profile data to generate a list of paths for further optimizations and a list of paths for monitoring. Taking into consideration these complex requirements, we rely on WAT cross- layer simulation/emulation tool presented in Chapter 3. Figure 4.3 shows the flow chart of our evaluation setup which shows specific implementation details required by CLDF. There are three inputs; the design code for a full processor, the representative benchmarks that should be used for collecting path utilization data, and the monitoring budget available in term of number of paths which can be monitored. There are two outputs from the design flow. First output is a set of paths that need to be optimized such that the 96 circuit is better suited for wearout monitoring. Second is a list of critical circuit paths that can be used in monitoring hardware. Next, various blocks in the flowchart will be explained in detail. Figure 4.3: Flow chart of evaluation methodology. ASIC Synthesis: WAT takes as input complete Verilog HDL design of a processor. In our implementation, we used OpenSPARC T1 processor HDL code [45]. The designer then identifies a select few functional unit blocks (FUBs) of this processor and marks them as candidates for wearout monitoring. Verilog HDL code of these FUBs is then extracted from the processor HDL. In our implementation, we selected four different FUBs from OpenSPARC T1 processor. These FUBs are: (1) Instruction decoder (sparc_ifu_dec) (2) Execution control logic (sparc_exu_ecl) (3) Store buffer controller (lsu_stb_ctl) (4) Register management logic (sparc_exu_rml). The extracted FUBs are then synthesized using ASIC design flow to generate two outputs: gate level circuit 97 implementation, and path timing characterization. We use Synopsys Design Compiler to generate the gate level circuit implementation. IBM 90nm design library is used to synthesize the gate-level netlist of the circuits used in this study. We also use Synopsys PrimeTime timing analysis tools to collect timing characterization of each studied circuit. FPGA Emulation: The full processor HDL is mapped on a Xilinx ML509 FPGA evaluation platform that uses a Virtex 5 XC5VLX110T FPGA chip. ML509 is specifically designed with enough resources so that the OpenSPARC T1 processor (1 core, 4 threads) can be fully implemented on it. OpenSolaris 11 operating system is booted on top of OpenSPARC T1 on the ML509 platform. This FPGA-based full system emulation can execute unmodified SPARC binaries of most workloads. The second input to WAT is a collection of representative benchmarks to run. We selected ten unmodified SPEC CPU 2000 benchmarks. As explained in Chapter 3, WAT generates utilization profiles of the selected FUBs while executing the selected benchmarks. WAT uses the Xilinx ChipScope Pro Integrated Logic Analyzer (ILA) IP cores [68] to probe input signals to the FUBs. ChipScope probes the input signals to the selected FUBs while the benchmarks are executing on the FPGA based full system emulation. These FUB inputs are then saved as in form of value change dump (VCD) files. Gate-level Simulation: The next step is to collect path utilization within the selected FUBs while running the benchmarks. WAT performs gate-level simulation of the selected FUBs using VCD inputs collected from the FPGA emulation. By using VCDs we skip gate-level simulations during those cycles when the inputs do not change. 98 It should be noted that skipping these uninteresting events provides reduced gate-level simulation time without any loss of accuracy in the collected utilization profile. Gate- level simulation provides detail path utilization information from the ASIC implementation of the FUB. During gate-level simulation a Switching Activity Interchange Format (SAIF) file is generated for each simulation run which counts the exact number toggles for each of the gates in the simulated circuit. The utilization of each circuit path is the number of toggles of a node on the path which had the minimum toggles. This approach takes into consideration the fact that many gates are shared among multiple paths and ensures that when utilization of a path is calculated, the toggles for gates which are shared among multiple paths are accounted for accurately. Cadence NC-Verilog tool suite is used for gate-level simulation of the synthesized netlist. The input trace to this simulation is from FPGA emulations and the gate-level netlist is generated using the Synopsys Design Compiler circuit synthesis. The output from this simulation is a utilization profile of all the nodes in the netlist. Wearout Aware Algorithms: The last step in the flowchart shown in Figure 4.3 is to execute wearout aware algorithms. This step implements the four algorithms described in Section 4.2.3. This step takes the following inputs: path timing profile collected from ASIC synthesis, path utilization data obtained from gate-level simulation, and designer specified constraints such as the number of paths to be monitored. This output of this step identifies paths that need to be optimized and paths that need to be monitored based on each of the four algorithmic approaches. 99 4.4 Evaluation Results In this section, we demonstrate the effectiveness of the CLDF framework using the four selected FUBs from OpenSPARC T1 processor. CLDF provides the designer with the opportunity to tradeoff area and power for reliability in a fully-automated design flow chain. All the designer has to do is to provide benchmarks of interest, a monitoring budget (in terms of number of paths to be monitored) and some guidelines on setting parameter such as HighUtilCutoff, LowUtilCutoff, and DevCutoff. For the results presented in this chapter the monitoring budget (nMonitor parameter) is set to 50 paths which can be monitored. We will compare the four approaches to a baseline which simply monitors the 50 slowest paths in the FUB regardless of the utilization profile. The baseline approach uses conventional design flow and only uses the statically determined path delay distribution for selecting the paths to be monitored. As mentioned in the previous section the FUBs are synthesized using Synopsys Design Compiler and then Synopsys PrimeTime is used for timing characterization the paths in the FUBs. These FUBs are synthesized for a 1.25ns system clock with OpenSPARC’s default 20% guardband added to the clock period. As explained earlier, we use InitCutoff parameter to be the same as the clock period guardband which is set to 20%. In other words, paths with less than 20% timing margin are selected to form the nLong path group. All three selection parameters HighUtilCutoff, LowUtilCutoff, and DevCutoff are also set to 20% for these evaluations. 100 4.4.1 Detailed Results from sparc_ifu_dec We first present results and analyses for one FUB from OpenSPARC T1 processor, namely sparc_ifu_dec. We will then show results for all the remaining FUBs. Figure 4.4(a) shows the initial timing margin distribution for all paths in the sparc_ifu_dec FUB. Total number of paths in the FUB is 5988. These paths are divided into four groups based on their timing margin (s): (A) s<20% (B) 20%≤s<40% (C) 40%≤s<60% (D) 60%≤s<80%. Using 20% InitCutoff value resulted in 2974 paths as nLong paths which are shown in group (A). There were 2274 paths in group (B), 537 in group (C), and 203 in group (D). The primary observation from this figure is that there is a relatively steep critical path wall; nearly 50% of all paths have less than 20% margin. Figure 4.4: (a) Path timing and (b) path utilization profile of sparc_ifu_dec. 101 In order to consider the effect of path utilization on their wearout, as described earlier, it is necessary to generate the utilization profile. Utilization profile of all paths is shown in Figure 4.4(b). The horizontal axis is divided into three categories. Paths with utilization less than 20% are in low utilization group and paths with utilization within 20% of the maximum utilization of any path in this FUB (which is 55% utilization rate in this case) are in the high utilization group and finally all paths with utilizations in between the above cutoff values are classified as medium utilization. On the vertical access the distribution of the paths from each of the four groups A, B, C, and D is shown for each utilization category. Figure 4.4(b) shows that most of the paths in the group A have a low utilization while the highest utilized paths are mostly in groups B, C, and D. This result shows that for this FUB only a small group of the critical paths (less than 20% margin) have high utilization. This group of paths is more susceptible to wearout due the high utilization and hence is going to be selected by our algorithms to be optimized and/or monitored. Next, we will look at how each one of the four approaches presented in this research will modify the implementation of the above FUB and which paths are selected for monitoring. Approach 1: Initially all paths in the nLong group are sorted based on their amount of utilization. As clearly seen in the path utilization profile Figure 4.4(b), only a small subset of critical paths has high utilization. Hence when we select paths within the top 20% (HighUtilCutoff) of the maximum utilization, the selection results in just 31 paths with utilization between 35.2% and 44.6%. Recall that Approach 1 selects high utilization & high device count paths for monitoring while all the remaining paths in the 102 nLong group are optimized. In this FUB just applying HighUtilCutoff parameter resulted in selection of too few paths even before applying DevCutoff parameter. Since the monitoring budget allows 50 paths to be selected for monitoring, DevCutoff parameter is not applied. In fact, this approach ended up selecting 31 high utilization paths. The remaining 19 paths selected are simply the most utilized paths from the nLong paths after the above 31 paths are excluded. These 19 paths are selected even though they are not as prone to wearout as the first 31 selected. The remaining 2924 out of the nLong paths (2974 paths from group A minus 50 monitored paths) are optimized. The top graph in Figure 4.5(a), labeled “App. 1”, shows the timing margin distribution of the circuit paths before and after the above selected paths are optimized. Approach 2: This approach divides the 50 path monitoring budget into two equal parts of 25 paths each. The first group of paths for monitoring is selected using the same policy used for Approach 1. The approach picks the top 25 out of the 31 paths that are in the high utilization & high device count category. The 25 selected paths are removed from the nLong paths and the remaining paths are sorted again purely based on utilization. The top 25 paths are selected as the second group from this newly sorted list for monitoring. All paths are optimized except for the second group of paths in this approach. The first 25 paths plus 2924 form a group of 2949 paths which get optimized in this approach. Approach 3: Approach 3 selects the same 50 paths as Approach 1 for monitoring. But it also optimizes the same 50 paths to have 20% more timing margin as shown in 103 Figure 4.5(a). The same 50 paths that are optimized are also monitored but using an elevated test frequency range. Approach 4: Approach 4 selects the first group of 25 paths to optimize using the same policy as in Approach 1, namely high utilization & high device count paths. These 25 paths are also monitored in Approach 4. However, the second group of 25 paths that need to be monitored are selected purely based on high utilization, but the second group is not optimized. But as described earlier, Approach 4 uses two different test frequency ranges for monitoring. Approach 4 selects 25 paths for optimization. (a) (b) 104 (c) (d) Figure 4.5: Slack distribution before (dotted line) and after CLDF optimizations on (a) sparc_ifu_dec (b) sparc_exu_ecl (c) lsu_stb_ctl (d) sparc_exu_rml. 4.4.2 Path Utilization Profile Analysis of All FUBs Figure 4.6 shows the path utilization profile of four FUBs from OpenSPARC T1 processor. It should be noted that the vertical axis is in logarithmic scale. First observation from comparison of these plots is that the utilization profile of the circuit paths in each FUB is quite different. For example there are paths in sparc_ifu_dec which 105 are utilized almost half of the time (the columns around 50% on Figure 14(a)). On the other hand for sparc_exu_rml shown on Figure 4.6(d) there are no paths which are utilized more than 1% of the time. This extreme difference in the utilization profile between FUBs further reinforces the need for consideration of application-driven utilization profile in path selection for optimization and monitoring. Furthermore, these utilization profiles also motivate the need for applying multiple algorithms for path selection. Accordingly, different FUBs also benefit from different approaches for robust monitoring. (a) (b) (c) (d) Figure 4.6: Path utilization profile for (a) sparc_ifu_dec (b) sparc_exu_ecl (c) lsu_stb_ctl (d) sparc_exu_rml (Vertical axis has logarithmic scale). Utilization distribution of the subgroup of paths with less than 20% margin follows almost the same general trend as all the paths in the circuit, particularly at low utilization level (as shown using black columns on Figure 4.6). In other words, if we consider all the paths in the circuit and observe that the number of paths with 1% 106 utilization is much more than the number of paths with 10% utilization then the same observation holds if we just use the paths with less than 20% timing margin. In the FUBs studied, there is a limited group of paths with high utilization and many of these paths have more than 20% timing margin. Figure 4.6 shows that there is a notable clustering of paths with low utilization. It is not surprising that even though a circuit has many paths, majority of paths have low utilization in these FUBs. In other words, only a limited number of paths are frequently utilized. Since utilization is the primary driver for wearout, identifying these paths will significantly enhance monitoring efficiency without decreasing the confidence in monitoring. 4.4.3 CLDF Overheads Table 4.2 shows the number of paths selected for optimization (labeled Opt paths) by each of the four approaches and the area overhead of optimizations. Each approach leads to different area overheads depending on the structural, timing, and utilization characteristics of the circuit. sparc_exu_ecl and lsu_stb_ctl FUBs do not show a steep critical path wall. Hence as expected the area overhead of all approaches for these FUBs is almost zero. In essence, our approach does not introduce any additional overhead when the underlying FUB is already well suited for monitoring. On the other hand, sparc_ifu_dec and sparc_exu_rml have a steep critical path wall as shown on Figure 4.5(a) and (d). As a result there are overheads associated with all 4 approaches for these FUBs. The area overhead of Approach 1 is always higher for these FUBs since vast majority of paths are just optimized. In Approach 2 increased correlation between the 107 paths monitored and paths optimized results in a more robust monitoring that has less reliance on path utilization profile. However, it increases the area overhead slightly since 25 additional paths are optimized. To increase monitoring robustness Approach 1 and 2 optimize more paths to give them extra timing margin. As a result these two approaches can dramatically alter the initial timing distribution. This shift is also clearly seen in Figure 4.5(a) and (d) for Approaches 1 and 2 where timing redistribution looks significantly different. Table 4.2: Comparison of Four Approaches for different FUBs. Approach 1 Approach 2 Approach 3 Approach 4 Circuit block Description Area ( m 2 ) nLong Opt Paths % Area overhead Opt Paths % Area overhead Opt Paths % Area overhead Opt Paths % Area overhead sparc_ifu_dec Instruction decode 87352 2974 2924 15 2949 25 50 3 25 2 sparc_exu_ecl Exec. control logic 255900 449 399 0 424 0 50 0 25 0 lsu_stb_ctl ST buffer control 95131 533 483 0 508 0 50 0 25 0 sparc_exu_rml Reg. management logic 175666 3449 3399 7 3424 7 50 2 25 1 Approaches 3 and 4 limit the number of paths to be optimized and hence the area overhead is significantly reduced. Approaches 3 and 4 enhance monitoring robustness by jointly optimizing the paths and relying on elevated test frequency to control the area overhead. Since Approaches 3 and 4 only change the timing distribution of a fixed number of paths (nMonitor paths) they do not fundamentally alter the initial timing distribution. This can be clearly seen in Figure 4.5 where the path distribution after 108 Approaches 3 and 4 are very similar to the initial distribution. In fact these approaches select less than 10% of the paths with timing margin less than 20% for optimization. In summary, it has been shown that different FUBs benefit from different approaches. These differences are due to the differences in the timing margin distribution and utilization profile of FUBs. Aggressive optimization of the FUB is not always needed and in scenarios where a FUB does not have a steep critical path wall the ideal group of paths to be monitored can be selected without adding any additional optimization overhead. One interesting aspect of these four different approaches is that their overheads are dependent on the initial timing and utilization profile of the circuit and hence each approach is more suitable for a category of circuits with specific initial timing characteristics. 4.5 Related Work Predictions by the International Technology Roadmap for Semiconductors (ITRS) [32] for more severe wearout in future technology generations has resulted in increased research efforts in modeling, detecting, and predicting wearout. Some methods have been specifically developed for prediction of wearout related timing failures [2, 13, 70]. These methods tackle the problem of wearout at microarchitecture level or by making circuit level enhancements. While these approaches focus only at one layer of the system design, in this work we presented a novel approach which correlates microarchitectural wearout prediction techniques with the circuit design implementation. The resulting circuit design is aware of the presence of wearout monitoring and hence can make monitoring more 109 robust and efficient. Our work takes the utilization of the circuit paths driven by application level information to change circuit implementation. Research in using runtime behavior during circuit design time has expanded in the recent years. Many of these efforts target improved power consumption or operation at reduced error rates. Design time error rate analysis has been used for improving reliability in presence of variations [50]. Circuit modifications proposed in [50] make the implementation of the circuit more suitable for timing speculation [8]. In Blueshift [28] targeted acceleration for frequently exercised path is used to change circuit implementation with the goal of improving performance of timing speculation even in the presence of a critical path wall. In [34], the authors presented a design time optimization with the goal of reducing error rates even when the circuit is operating at a reduced operation voltage. These approaches allow for more aggressive voltage scaling and increased power savings without impacting reliability. In many of these prior studies the circuit is intentionally operated at a higher than nominal clock frequency resulting in some circuit paths not meeting the timing constraint. In contrast our approach does not do timing speculation. Rather the goal is to continuously monitor circuit wearout efficiently. Hence, the design changes necessary for wearout monitoring are quite different than those necessary for timing speculation. We have exploited runtime path utilization information, as was done in [28] which has a different end goal of improving timing speculation. We use runtime information about how the design is used to reduce the number of paths needed for monitoring. We have taken advantage of the graduate nature of wearout and its dependence on utilization to 110 correlate design time optimization efforts with runtime wearout monitoring enhancements. The resulting cross-layer resiliency framework improves the effectiveness and efficiency of in situ circuit monitoring techniques. 4.6 Summary and Conclusions As device sizes in a processor continue to shrink with each new process technology, there is a growing concern for reliability. While reliability issues can take different forms, wearout is a prevalent degradation where circuit timing margins gradually decrease over the lifetime of the circuit. Continuously monitoring wearout will become critical. By monitoring the amount of timing margin left in a circuit it is possible to enable just-in-time error detection and correction solutions. Since monitoring itself will be done continuously it is necessary to improve monitoring efficiency. Selecting only the most critical paths in a circuit to monitor can reduce monitoring overhead. But in the presence of critical path timing wall, monitoring overhead can be significant due to the need to monitor many paths. This chapter addresses this serious bottleneck to monitoring in the presence of critical path timing walls. We present a cross-layer design flow that uses application knowledge to separate more frequently used critical paths from the ones with low utilization. Since wearout is a function of utilization, using application level information to derive path utilization provides new ways to improve monitoring efficiency and robustness. We describe four approaches to redistribute the timing of circuit paths that take advantage of this cross- layer utilization information. All these approaches provide the designer the ability to tradeoff monitoring robustness with power and area overheads. 111 The proposed design is implemented in a novel evaluation framework that allows application level information and circuit design tools to interact and exchange information. Our evaluation framework provides an automated mechanism to generate the best set of paths that need to be monitored given the design constraints. Using OpenSPARC T1 processor FUBs we evaluated the four proposed approaches. Our results show that all four approaches have unique capabilities that allow them to be applied to FUBs with different initial timing characteristics. 112 Chapter 5 Wearout-aware Runtime Use of Redundancy to Improve Lifespan Wearout of processors during their service life results in gradual timing degradation and their eventual failure. Processor failure can occur due to wearout of a single structure even if vast majority of the chip is still operational. This chapter presents proactive runtime wearout-aware scheduling, WAS[74], polices at different levels of processor hierarchy which increase chip lifespan as well as improve reliability during that lifespan. WAS strives for uniform wearout of processor structures thereby preventing a single structure from becoming an early point of failure. The fine-grained microarchitectural level chip wearout control polices use feedback from a small network of timing margin monitoring sensors to identify the most wornout structures. Our evaluation shows that WAS can result in 15% to 30% improvement in lifespan of a multi- core processor chip with negligible performance and energy consumption impact. 5.1 Introduction Many reactive dynamic reliability management (R-DRM) frameworks have been proposed [6, 26, 48, 58, 59, 62] with the goal of improving processor performance by 113 reducing conservative guardbands and increasing chip lifespan. These techniques detect hard errors and disable faulty structure (e.g. cores [3, 59], microarchitectural structures [51, 58], or small circuit blocks [43]). R-DRM solutions react to wearout-induced failures but they do not prevent or control the degradation. In this chapter we present a framework to monitor the wearout state of various circuits within a core at runtime and proactively apply error avoidance techniques to postpone onset of these failures. In this work, remaining timing margin of the critical paths in different circuit blocks of the processor are monitored at runtime. Then a proactive dynamic reliability management (P-DRM) unit on the chip would control the wearout and utilization of the most vulnerable microarchitectural structures based on the amount of remaining timing margin detected. Rather than the reactive approach of diagnosing source of failure and dealing with it after the occurrence of an error, our proposed solution will postpone the onset of failures. Only after all preventive measures to postpone wearout-induced failures have been exhausted, the system falls back on error tolerance. This chapter makes the following contributions: We present a classification of non-uniform circuit wearout and its causes. We describe the reasons for timing margin heterogeneity and their effect on the lifespan of a processor. A wearout-aware scheduling framework is proposed to prevent premature failure of a circuit. This framework effectively utilizes existing redundancy at the circuit block level to improve lifespan with little to no performance or power overhead. 114 A hybrid architectural and microarchitectural wearout control approach based on correlated scheduling policies at different granularity (both at the core level and at the microarchitectural structure level) is implemented. Rest of this chapter is organized as follows. Section 5.2 provides the background information regarding hardware redundancy and timing margin profile of processors. Section 5.3 presents the details of our wearout-aware scheduling framework. In Section 5.4 details of the evaluation setup and lifetime wearout modeling is presented. Evaluation results are in Section 5.5. We discuss the related works in Section 5.6 and summarize and conclude this chapter in Section 5.7. 5.2 Background We first discuss different types of circuit redundancy in modern processor. Then reasons for manifestation of different amounts of timing margin degradation in different parts of a chip are highlighted. This timing margin heterogeneity can result in earlier than expected failure of a chip due to a timing violation caused by the circuit block with the least timing margin within the chip. Our proposed solution will exploit the abundantly available redundancy in current processors to reduce the occurrence of timing margin heterogeneity and extend processor’s lifespan. 5.2.1 Redundancy Presence of replicated structures in the processor which perform the same function, called natural redundancy, is inherent to most high performance processor designs. Natural redundancy is widely used for exploiting parallelism and improving performance. Study of an Intel ® Core-2 TM like CPU shows prevalence of both inter-core 115 and intra-core natural redundancy [48]. It is also shown that only 17% of the processor’s execution units are not naturally redundant. Processors also use artificial redundancy which is explicitly added with the goal of increasing system lifetime and not for improving performance. Examples of artificial redundancy include additional columns and rows in caches. Most of today’s high performance processors use a combination of natural and artificial redundancy to increase both fabrication yield and lifetime reliability. Redundancy is implemented at different granularities and redundant structures can be whole cores, referred to as Architectural redundancy, or can be smaller circuit building blocks of a core, referred to as Microarchitectural redundancy. Some structures, such as the clock tree or the interconnection network, typically have neither architectural nor microarchitectural redundancy. We assume non-redundant structures are protected using robust but expensive design techniques, such as gate sizing. Protecting non-redundant structures is outside the scope of the framework presented in this chapter. 5.2.2 Timing Margin Heterogeneity Clock frequency of a processor (f) is determined by the delay of its slowest path (D crit ), which is called a critical path. Clock period of a processor (T=1/f) is calculated by adding a timing guardband (G) to D crit , hence T= D crit +G. This timing guardband ensures that variation in the delay of signal paths does not cause timing violations. Delay variation in a circuit path can be due to process variations, operation conditions, and wearout. Timing margin is defined as the difference between the delay of a circuit path 116 and the clock period. Each processor core has many microarchitectural structures and each of these structures have their own critical path. In this chapter we will refer to timing margin of slowest path in each microarchitectural structure as timing margin of that structure. The structure with the least amount of timing margin at each point in time determines processor’s remaining life. 5.2.2.1 Reasons for Timing Margin Heterogeneity Figure 5.1 shows a classification of various reasons for timing margin heterogeneity. These reasons are described below: Design Time: Even during the processor design phase there exists a diversity of timing critical paths across different processor structures [47]. The main reason for this initial diversity is the difference between function and complexity of different circuit blocks. Furthermore, within-die process variations can result in different timing margins even for replicated microarchitectural structures with identical circuit design. For example, analyses of a fabricated chip multi-processor (CMP) on the 65nm technology from Intel show 22% difference between the maximum achievable operation frequency between cores on the same chip due to within-die process variations [24]. 117 Figure 5.1: Reasons for timing margin heterogeneity in a processor. Runtime: Timing margins degrade non-uniformly across processor structures during the chip’s lifetime. To the first order the primary reason for this non-uniformity is variation in utilization of different signal paths in a processor. Biased use of different system components (i.e. chips, cores, and microarchitectural structures which have the same function and design) can result in non-uniform timing margin degradation. It is interesting to note that even using replicas of the same unit for same amount of time does not ensure homogenous wearout. For example using two different cores in a CMP for exactly the same amount of time does not ensure that they have the same timing margin degradation. This uneven wearout is because jobs running on these two cores might have different instruction mixes (e.g. different number of integer and floating point instructions) and each instruction would use a different microarchitectural structure more heavily compared to others. Table 5.1 summarizes the reasons for non-uniform utilization of identical chips, cores, microarchitectural structures, and signal paths. 118 Table 5.1: Non-uniform utilization at different levels. Level Reason for Non-uniform Use Chip Biased job scheduling in multi-chip systems Core Biased thread/job scheduling in multi-core chips Microarchitectural Structures Biased hardware usage of redundancy Signal Paths Differences in input operand distribution Runtime temperature and voltage variations resulting from differences in utilization cause further difference in amount of timing degradation across different structures in the processor [29]. The same process variation phenomena that causes timing margin heterogeneity at design and fabrication time also plays a role in runtime timing margin heterogeneity [61]. Differences in the circuit path structure such as different number and type of gates can lead to different wearout rates which cause timing margin heterogeneity (e.g. more PMOS transistors than NMOS on a path would make it more susceptible to NBTI than HCI). Due to all the variations highlighted above, design time prediction of the heterogeneous timing margin degradation is impractical. Hence runtime monitoring of the remaining timing margin of the most critical paths in the circuit is necessary for accurately synthesizing a picture of the processors wearout state at each point during its life. Many of today’s chips are enhanced with built-in timing margin sensors which are used for post fabrication testing and self-tuning. Many accurate and low power runtime wearout sensing frameworks have been proposed in the recent years [2, 19, 36, 54, 70]. In this work, we will assume the presence of wearout sensors to monitor timing margin of different microarchitectural structures in the processor at runtime. 119 In the dark silicon era where chip power and temperature constraints restrict operation of high performance processors and system-on-chip (SoC) products, uncontrolled wearout can result in significant amount of timing margin heterogeneity. Most high performance systems cannot sustain operation with all hardware resources active and operating a maximum performance. Dynamic power and temperature management techniques which use aggressive power and clock gating as well as Dynamic Voltage and Frequency Scaling (DVFS) can further increase the amount of timing margin heterogeneity. Proactive wearout leveling controlled by feedback from timing margin sensors on the chip would be critical in order to efficiently use the large number of hardware resources on this chip in a balanced way and exploit maximum possible systems lifetime. Figure 5.2 shows how a workload running for 10 years results in different amounts of wearout for a number of microarchitectural structures of a core. These results are collected from our simulation study (details are in Section 5.5). Wearout of the control logic for the reorder buffer (ROB), load queue (LDQ1, LDQ2), integer register file (IntREG), integer register alias table (IntRAT), and integer ALU (IntUnitALU1, IntUnitALU2) have been shown. IntUnitALU and LDQ both have 2 replicas each. These results show how different structures have different timing margin degradation and even two identical structures can differ in their timing margin. 120 Figure 5.2: Wearout of different microarchitectural structures. 5.3 Wearout-Aware Scheduling We propose Wearout-Aware Scheduling (WAS) which will use runtime timing margin feedback to change the utilization pattern and wearout rate of circuit structures of a processor. The goal of WAS is to create uniform wearout for all targeted structures thereby postponing early failures of a chip due to timing failure of a single structure. This goal is achieved by using existing redundancy in the processor. Wearout of structures is controlled by adjusting their usage and controlling their wearout recovery time. Selection of microarchitectural structures to monitor and control with WAS is a design decision for a given implementation of the processor and also depends on the location of redundancy. 5.3.1 An Illustration of WAS The example in Figure 5.3 highlights how differences in wearout rate can result in early failure of a structure while a replica of the same structure still has some remaining timing margin. In this example two replicas of a microarchitectural structure under different lifetime usage scenarios are compared. The two structures are two identical copies of the integer ALU unit in a 4-issue out-of-order core with the same initial critical 121 path delay, D crit (t 0 ). In the first scenario one of the replicas has suffered from higher wearout than the other one due to one or more of the reason discussed in Section 5.2. As a result of this difference in the wearout level, delay of the critical path in the structure with high level of wearout, curve labeled High Wearout, will exceed the allocated clock period (which includes a timing guardband) after just Y H years of use. While the replica of the structure which has low wearout, curve labeled Low Wearout, could have been used till Y L years, where Y L >Y H . As a result of the early failure of one structure at Y H , a whole core or even the CMP chip would be rendered unusable. WAS would balance the remaining timing margin of both replicas by controlling their usage and they both would follow the degradation curve labeled Controlled Wearout. As a result both replicas would fail at roughly the same time, Y C , and lifespan of the core is extended from Y H to Y C . This lifespan extension is a result of using, otherwise wasted, lifetime of the replica with lower wearout. Figure 5.3: Wearout control to extend lifespan. 5.3.2 Microarchitectural Wearout-Aware Scheduling WAS can be implemented at either microarchitectural level or at architectural level. Microarchitectural Wearout-Aware Scheduling (MWAS) uses intra-core 122 redundancy to control wearout of replicated structures within a core. The objective of MWAS is to prevent large timing margin difference between replicas of the same structure within a core. MWAS is transparent to software and is a purely hardware-based approach to alter utilization patterns of structures. MWAS targets structures with intra-core redundancy, such as an Integer ALU which typically have two or more replicas per core. Timing margin of each target structure and its replicas are monitored with a timing margin sensor. The margin is checked by MWAS every T MWAS_check clock cycles and if the timing margin of a structure goes below a preset threshold, M MWAS_activ_th , then MWAS scheduling control policies described next would be activated. When the difference between the remaining timing margins of the structure with the least timing margin and its within core replica with the most timing margin exceed a preset threshold, M diff_th , MWAS will attempt to balance the timing margins by reducing the utilization, and hence wearout rate, of the replica with the least remaining timing margin. M MWAS_activ_th , is an initial threshold that is used to avoid unnecessarily activating MWAS during chips initial life stage. M diff_th threshold is used to only activate MWAS timing margin balancing techniques when the timing margins of replicas diverge above this threshold. A MWAS target structure is marked not to be used for a period of time, T MWAS , by setting a busy bit (shown on Figure 5.4) which would prevent use of that structore. A microarchitectural level busy bit is already implemented for most structures with natural redundancy in order to assist the scheduling logic. MWAS relies on setting this busy bit without the need for modifications to the local scheduling hardware. For structure 123 without a busy bit in the baseline implementation, a single MWAS_Busy control bit can be added for each MWAS target structure. This bit is checked by local scheduler before sending operations to the structure and if it is set, the resource would not be used. A structure marked busy by MWAS can simply be treated as being busy performing an operation and no new operations are assigned to that structure. After the T MWAS period the target structure’s busy bit is reset and the structure would become available. Figure 5.4: Busy control bit. During T MWAS period MWAS can initiate power/clock gating, which slow down wearout or stop its progress. MWAS can also put structures into a wearout recovery state (by using frameworks like [1]). This approach will prevent creation of large timing margin differences between replicas, which is the main objective of MWAS, but would also selectively slow down, reverse, or prevent wearout which by itself helps extend the lifespan of these structures. Power gated structures would not suffer from any wearout. Clock gating stops wearout phenomena which depend on switching activity of transistors (e.g. HCI and electromigration). Gating also slows NBTI and PBTI because of reduction 124 in temperature. Use of special input vectors for NBTI recovery during clock gating is also an effective method to recover some of the timing degradation due to this phenomenon. Any combination of the above methods can be used to achieve wearout leveling. 5.3.2.1 Implementation of MWAS Control Unit MWAS control unit, shown in Figure 5.5, is the hardware controller which does analyses of the timing margin data from the distributed network of wearout sensor on the processor and uses MWAS control polices, described in Algorithm 5.1, to control the wearout of different structures. Figure 5.5 shows the MWAS control unit and its inputs and outputs. This unit can be placed alongside dynamic power and thermal management (DPM/DTM) control units present in most modern processors. An alternative would be to embed the control polices of MWAS with the existing DPM or DTM controllers. In order to clearly highlight MWAS implementation in this work we will assume MWAS control unit is implemented as a standalone hardware control unit. Figure 5.5: MWAS control unit and its inputs and outputs. 125 MWAS control unit maintains the MWAS status table which is a directory containing the wearout and operation status of all FUBs controlled by the MWAS control unit. Timing margin of the FUBs are read from a distributed network of wearout sensors. The operation status bits control the busy bit for each FUB as well as enables or disables power/clock gating or NBTI recovery for the FUB. This MWAS status table also provides the MWAS control policies with structural information regarding the FUBs using the “Group ID” field. All replicated FUBs (i.e. target structure for MWAS which have natural redundancy) would have the same ”Group ID”. Algorithm 5.1 shows how MWAS control unit identifies an MWAS target structure. To summarize, the algorithm reads the timing margin of all redundant structures within a core once every T MWAS_check cycles (i.e. the refresh rate of the timing maring, M i , in the MWAS status table). It then sorts all the structures based on the amount of remaining timing margin. The algorithm then selects all structures that have a timing margin less than M MWAS_activ_th . It then looks at the corresponding replica’s timing margin to measure the gap in timing margin between the two replicas. These replicas are those which have the same “Group ID” in the MWAS status table. If the gap is larger than M diff_th then it identifies the replicated structures as MWAS target structures whose utilization will be changed to reduce the timing margin difference. Utilization adjustment are archived by setting busy bit and/or initiating power/clock gating or NBTI recovery by setting appropriate status bits in the MWAS status table. 126 Algorithm 5.1: MWAS used for each core every T MWAS_check cycles. Procedure MWAS(M i (t), G j , M MWAS_activ_th , M diff_th ) Inputs: Timing margin of all i microarchitectural structures monitored within the core at time t. (M i (t)) List of indexes, {n,m,…}, of replicas of structure of type j within a core. (G j ={n,m,…}) MWAS activation timing margin threshold. (M MWAS_activ_th ) Activation timing margin difference threshold between structures of the same type. (M diff_th ); Output: Index of the structure which requires MWAS wearout control. (MWAS_Busy_Structure); for each G j [-, M_Max j (t)]=Max(M i (t) for i G j ); [I, M_Min j (t)]=Min(M i (t) for i G j ); M_Diff j (t)=M_Max j (t)-M_Min j (t); if ((M_Min j (t) < M MWAS_activ_th ) AND (M_Diff j (t)>M diff_th )) then MWAS_Busy_Structure =I; end end Procedure Max(M i (t) for i G j ) / Min(Mi(t) for i G j ) Inputs: List of timing margins for a group of i structures of type j. (M i (t) for i G j ) Outputs: The index, I, of the structure with the most/least timing margin in group j and its timing margin value. (I, M_Max j (t)/ M_Min j (t)); Selection of different WAS thresholds is based on the available timing guardband as well as the wearout rates expected in the technology used for fabrication of the chip. More conservative guardbands (i.e. higher timing margin) as well as low expected wearout can be taken advantage of to use more relaxed threshold which result in less frequent use of WAS policies. In MWAS it is possible that the two copies of a replicated structure are both worn out equally, i.e. their timing margin differs by less than M diff_th . If such a structure is critical to the health of the overall chip we will rely on an Architectural Wearout-Aware Scheduling (AWAS) algorithm, described later, to protect this structure 5.3.2.2 MWAS Enhancements Opportunistic MWAS: Since wearout happens slowly over a long time, changing usage policies to control wearout does not have timing urgency. In other words, 127 even if MWAS identifies a target structure which needs timing margin balancing, this does not require immediate scheduling reaction. Opportunistic MWAS exploits this flexibility to allow more relaxed scheduling control policies. If all the replicated structures are needed to exploit parallelism and to be able to respond to operation periods with high resource demand then MWAS allows the replicated units to be fully utilized for a short time window. Only when system is not fully utilized, MWAS can selectively disable a structure. We will refer to this policy as opportunistic MWAS. Studies, such as [58], show that natural intra-core redundancy present in current high performance processors which are over provisioned for the worst-case workloads, is not used frequently. By using the more relaxed opportunistic MWAS approach the performance degradation due to disabling a replica can be significantly reduced. Resource Management: In order to further reduce the performance impact of MWAS, instruction level parallelism (ILP) of all running threads on a CMP can be monitored and when a thread has high resource demand due to high inherent ILP it can be assigned to run on a core which has the most available resources. Cores with limited resource availability, due to enforcement of usage control by MWAS, are selected for less resource demanding threads. 5.3.3 Architectural Wearout-Aware Scheduling Architectural Wearout-Aware Scheduling (AWAS) uses inter-core redundancy to control utilization of microarchitectural structures in different cores by controlling thread assignment to cores. AWAS receives the timing margin of different microarchitectural structures inside all of the cores in the processor, once every T AWAS_check clock cycles. 128 AWAS selects the structure with the least timing margin among all the cores as the AWAS target structure. If the timing margin of the AWAS target structure is below M AWAS_activ_th , then the timing margin of this structure is going to be compared with all its replicas across different cores. In a homogenous CMP with N identical cores, there are going to be at least N replicas of the target structure. If the timing margin difference between AWAS target structure and its cross-core replica with the most remaining timing margin exceed a preset threshold, M diff_th , then core level thread assignment is going to be altered by the AWAS in order to change wearout rates and balance timing margin of the two replicas across two different cores. AWAS attempts to balance cross-core timing margin heterogeneity of the AWAS target structure by assigning threads which use the AWAS target structure less to the core containing that structure. For example in a quad-core CMP, if the integer ALU unit of core number 1 is selected as the AWAS target structure the wearout rate of this structure in core 1 is slowed down to prevent it from jeopardizing the functionally correct operation of this core. This wearout slowdown is achieved by assigning threads that use the integer ALU unit less to this core. AWAS relies on structure utilization counter to guide core-level thread scheduling. Utilization counter are attached to each microarchitectural structure for which timing margin is monitored. Each time the structure is used the counter for that structure would be incremented. The counter values, indicating the number of times the structure was used during the T AWAS_check period, would be sent to the AWAS controller along with the timing margin information. 129 5.3.3.1 AWAS Implementation AWAS implementation requires modification to the thread scheduling software. The AWAS control policy procedure which enhances the thread scheduling software is shown in Algorithm 5.2. AWAS requires hardware support to maintain a status table similar to the one presented for MWAS. This table, reffered to as AWAS status table will maintain the same information at the MWAS status table shown in Figure 5.5 for FUBs which expand across different cores. The AWAS status table is accessed by the AWAS control procedure running as part of the thread schedualling software. The access mechanism would be identical to how performance counters on processors are accessed by scheduling software. Algorithm 5.2: AWAS executed every T AWAS_check cycles. Procedure AWAS(M i (t,k), G j , M AWAS_activ_th , M diff_th ) Inputs: Timing margin of all i microarchitectural structures monitored within all k cores at time t. (M i (t,k)) List of the index, {n,m,…}, of replicas of structure of type j within any of the CMP cores. (G j ={n,m,…}) AWAS activation timing margin threshold. (M AWAS_activ_th ) Activation timing margin difference threshold between structures of the same type. (M diff_th ); Output: Index of the structure which requires AWAS wearout control. (AWAS_Structure) Index of the core which contain the AWAS_Structure. ( ) Type of AWAS_Structure. ( ); [ , , I, M_Min(t)]=Min(M i (t,k) for i,k in CMP); [-,-,-,M_Max(t)]=Max(M i (t,k)) for i (k) and k ; M_Diff(t)=M_Max(t) M_Min(t); if ((M_Min(t) < M AWAS_activ_th ) AND (M_Diff(t)>M diff_th )) then AWAS_Structure = I; end Procedure Max(M i (t) for i,k) / Min(M i (t) for i,k) Inputs: List of timing margins of i structures in k cores of the CMP at time t.. (M i (t) for i,k)) Outputs: Type of the structure with the maximum/minimum timing margin. ( ) Index of the core containing the above structure. ( ) Index of the above structure. (I) The maximum/minimum timing margin value. (M_Max(t) / M_Min(t)); 130 Thread migration technique similar to the ones proposed [48] can be used to divert utilization from specific microarchitectural structures. Since AWAS uses past utilization profile of structures as a predictor for future utilization, the effectiveness of AWAS technique would depend the workload and how persistent the utilization of the target structure is in a thread. Algorithm 5.2 shows an implementation of AWAS procedure. Thread swapping to control wearout is necessary only when all the cores are active but processors are typically underutilized [11]. Rather than swapping active threads between cores AWAS can be done by not assigning any active thread to the core with the wornout structure. 5.3.3.2 AWAS Enhancements Opportunistic AWAS: Gradual nature of wearout provides DRM techniques with a long window of time, possibly many days, to attempt balancing timing margins. During this long period, AWAS policies would initially attempt to control wearout opportunistically (i.e. using low utilization periods) without any loss of performance. In this opportunistic approach no thread swapping would take place, instead when number of threads in the system is less than number of resources available to execute them, the core containing the AWAS target structure would be selected to be left idle or power/clock gated. On-demand Utilization Monitoring: Utilization monitoring is not always needed and in order to conserve power it only gets activated when AWAS control is needed (i.e. that is when a structure meets M AWAS_activ_th and M diff_th conditions). Only when such a structure is selected then for the next T AWAS_check period the utilization 131 counters would be activated and would report the utilization profile of the only cross-core replicas of the AWAS target structure. 5.3.4 Wearout-dependent M diff_th Both MWAS and AWAS use an important characteristic of the wearout induced timing degradation and that is the rapid initial timing degradation rate which slows down in the later stages in the lifetime of the processor. This can be clearly observed in the Figure 5.2 and Figure 5.3 curves in form of steep increase in the path delay followed by a decrease in the slope during the later stage of the chip’s life. A direct result of this observation is that wearout control in early stages of the processors lifetime will have larger impact in extending lifespan. As a result both MWAS and AWAS techniques use a timing margin difference threshold which is a function of Wearout_level of the processor, M diff_th (Wearout_level). Wearout_level is the smallest timing margin of any monitored structure across all the cores in the CMP. This varying threshold will be lower in early life to promoting more frequent usage of WAS techniques when they can have the most lifespan impact. Then as the chip suffers from more wearout and the wearout level balancing effect of WAS techniques is reduced they are also used less often. It should be noted that varying M diff_th (Wearout_level) impacts only the usage frequency of WAS techniques but it does not control whether to use opportunistic WAS or WAS which can cause performance degradation. The amount of remaining timing margin of the structures, rather than the difference, is what dictates whether opportunistic WAS with near-zero performance overhead can be used or aggressive WAS with some negative performance impact needs to be enforced. Hence, even with the constant M diff_th , less 132 aggressive opportunistic techniques are used in early life due to higher overall remaining timing margin of structures. 5.3.5 Comparison of MWAS with AWAS The main difference between MWAS and AWAS frameworks is that in AWAS higher level job assignment (i.e. core level) is used to the control lower level microarchitectural structure wearout rate. But in MWAS microarchitectural level utilization adjustments, by disabling structures, is used to control their wearout rate. In both approaches the timing margin is monitored at the microarchitectural structure level. A key difference between the two WAS techniques is in when and where they are most effective. MWAS Application Domain: MWAS is a great option for structures which have multiple replicas within a core, specially when this replication is beyond what is needed for performance improvement for typical expected workloads of the processor. When there is limited ILP in the execution trace MWAS can be used without any performance impact. Another suitable time for MWAS is when within core wearout level heterogeneity is high. Furthermore, wearout control for structures not commonly used by all instructions can be done using MWAS without any performance impact. AWAS Application Domain: AWAS is more suitable method of wearout control for microarchitectural structures which do not have intra-core replicas. Altering utilization of these structures can only be done by transferring their workload to their corresponding replicas in other cores using AWAS. An example of systems which would mostly benefit from AWAS is CMP systems with simple in-order cores. Such core would 133 not have much intra-core natural redundancy (e.g. OpenSPARC T1, and most GPUs), and hence are not good candidates for MWAS. On the other hand, AWAS exploits inter-core natural redundancy, which is abundant in this category of processors. Another group of structures which would also benefit more from AWAS are structures commonly used by most threads and most instructions. Structures which have natural redundancy but are critical for exploitation of ILP within a core would also be good candidates for use of AWAS since use of MWAS can result in loss of core performance and incapability to fully exploit ILP. An example of structure of this type would be frontend units of processors such as instruction decode. 5.3.6 Hybrid Wearout-Aware Scheduling Since MWAS and AWAS are useful under different operating conditions, it is ideal to combine both these approaches into a single scheme, which we call Hybrid Wearout-Aware Scheduling (HWAS) framework. Figure 5.6 shows how the HWAS framework works. Once every T HWAS cycles every structure I within a core K is selected for WAS analysis. HWAS always gives priority to MWAS when possible given the more targeted and stronger effect of MWAS. Threshold conditions of Algorithm 5.1 are checked to see if MWAS is needed. If Algorithm 5.1 identifies structure I as a target for MWAS then MWAS is activated on structure I. Vast majority of time MWAS will select opportunistic MWAS which has zero performance overhead. In the rare event that MWAS results in performance loss because structure I is being actively used then the HWAS selector disables MWAS for structure I. For this purpose HWAS does a short pilot run, T WAS_pilot << T HWAS , and if the performance impact, in term of instructions per 134 cycle (IPC) reduction, is below an acceptable threshold, IPC diff_th , then the MWAS technique would be continued. Otherwise MWAS is disabled and HWAS then tries to use AWAS as the next desirable approach to improve the reliability of structure I. In this case Algorithm 5.2 is checked to see if AWAS is needed. If needed, AWAS moves the thread running on core K to a different core. Again vast majority of time AWAS could be done opportunistically with negligible performance overhead. In the rare case the performance overhead due to AWAS is high, AWAS is disabled and the execution is continued on the original core. 135 Figure 5.6: Flow chart of HWAS. 5.3.6.1 HWAS Implementation HWAS control is implemented within the thread scheduler, similar to the how AWAS is implemented. The main addition to HWAS is that it would also read the MWAS status tables and would be able to override MWAS decisions made in the MWAS 136 control unit. Disabling MWAS or AWAS simply means that wearout-leveling benefits from WAS are not gained during a period in the operation of the processor when their use would result in unacceptable performance loss (the level acceptable performance loss is indicated by IPC diff_th which is a parameter decided by the design team based on the amount priority wearout-leveling is given relative to performance). The duration which MWAS or AWAS are disabled can rage anywhere between a few minutes to a few days. It should be noted that most high performance processors which are the target chips for implementation of WAS have hardware performance monitoring enhancement to report the IPC of different cores. The IPC reported by these performance monitors is used by HWAS to gauge the impact of resource reduction by MWAS during the above described T WAS_pilot . If WAS is to be implemented on a processor without such performance monitors they can easily be added. It should be noted that whenever MWAS is possible (i.e. for structure with within core redundancy) it is done rather than AWAS due to higher effectiveness of MWAS (i.e. because microarchitectural power/clock gating would be possible and workload behavior prediction is not needed). There exist scenarios where MWAS is not desirable but AWAS is and vise versa (details in Section 5.3.5). This is due to the fact that both opportunities as well as the impact on performance at a given time during lifetime differ significantly for MWAS and AWAS. For example, MWAS can be used with minimal performance impact even when there are threads running on all cores of the CMP but one or more of the threads lack sufficient ILP and are not fully using the natural redundancy within a core. On the other hand, the above scenario is not an opportunity for AWAS. Opportunity 137 to do AWAS is when the number of threads on CMP is less than the number of cores and ILP within in each thread does not matter. Benefits from HWAS are a result of capability to achieve WAS goals utilizing both of the above opportunities. Furthermore HWAS can balance the timing margin for structures with intra core redundancy as well as the ones without it. 5.4 Evaluation Setup In order to evaluate the wearout-aware scheduling frameworks proposed and measure their impact on performance, energy consumption, and lifespan of processors we rely on accurate wearout models which are implemented in our simulation framework. 5.4.1 Device Level Wearout Models We use device level models for two dominant electro-physical phenomena causing wearout, NBTI and HCI. Both phenomena result in gradual increase of transistor threshold voltage (V th ). NBTI which is more dominant [64] impacts PMOS and HCI impacts NMOS transistors. A direct result of this gradual increase in V th is increase of switching delay (D s ) of the transistors affected: (5.1) Switching delay is computed using the alpha power law, where 1.3. We use supply voltage (V dd ), effective channel length (L eff ), and mobility ( μ) for the 32nm technology. Threshold voltage increases due to NBTI and HCI can be calculated using formulas in the following subsections. 138 5.4.1.1 NBTI The Reaction-Diffusion model [44] explains the NBTI phenomenon which happens when the gate of a PMOS transistor has a logic 0 value and V gs = –V dd . As a result of the holes present in the channel, the Si-H bounds are broken at the interface between the channel and the gate oxide. Positive traps due to Si+ at the interface will increase V th after H diffuses away [38]. The phase at which a PMOS has a logic 0 value at its gate and suffers from the above phenomenon is called the NBTI Stress phase. The V th increase during the stress phase depends on the V dd and temperature (T) and is calculated using [63]: (5.2) E 0 , E a , and k are constants equal to 0.2 V/nm, 0.13 eV, and 8.6174×10 −5 eV/K, respectively. C ox is the gate capacitance per unit area, and t ox is the oxide thickness, these have values 4.6×10 −20 F/nm 2 and 0.65nm, respectively. t stress is the amount of time the PMOS is in the stress phase and A NBTI is a constant which depends on the rate of wearout. When gate of the PMOS has a logic value 1 and V gs =0 the transistor is turned off and H atoms diffuse back resulting in elimination of some of the created traps in the stress phase. This is called the NBTI Recovery phase. The overall impact of the NBTI on V th after both phases is calculated using [63]: (5.3) η is a constant equal to 0.35 and t rec is the amount time the PMOS has logic 1 value at its gate. 139 5.4.1.2 HCI HCI affects NMOS transistors and happens when high energy electrons, referred to as hot electrons, are accelerated in the electric field of the channel and collide with the interface with the gate oxide. Result of this collision is creation of electron-hole pairs and the hot electrons get trapped in the gate oxide increasing V th of the transistor. This phenomenon happens during logic transition and is proportional to switching frequency of transistors [60]: (5.4) t ox is the oxide thickness and E 1 is a constant with values equal to 0.65nm and 0.8 V/nm [66], respectively. α is activity factor and f is the operation clock frequency. A HCI is a constant for the wearout rate. These NBTI and HCI models have been validated by [38, 44, 60, 63]. In summary, the V th degradation caused by both NBTI and HCI follows a power law relationship with time, with exponents 0.25 and 0.5 respectively. Threshold voltage, temperature, amount of NBTI stress time (PMOS), and activity factor (NMOS) are dynamic parameters that effect the amount of wearout induced timing degradation transistors suffer from. 5.4.2 Critical Path Model Wearout level of each logic circuit block is dictated by wearout of the slowest, most critical, path in that block. We model wearout of microarchitectural structures by modeling wearout of their critical path. The critical path for each microarchitectural structure within a processor is modeled as a chain of gates. For simplicity of the 140 description let’s assume this critical path consists of only inverter. Delay of this critical path consisting of N inverter is the delay of N/2 PMOS transistors plus N/2 NMOS transistors. This is because for each output transition, either 0-to-1 or 1-to-0 only half the NMOS and half of the PMOS transistors charge or discharge their outputs. Equations 5.3 and 5.4 are used to calculate V th changes due to NBTI and HCI for each of the PMOS and NMOS transistors, respectively. Then Equation 5.1 is used to find the wearout-induced switching delay increase for each of the transistors. Sum of the wearout-effected delays of these N/2 PMOS and NMOS is delay of the critical path after wearout. Critical paths in different microarchitectural structures can have gates other than inverters and hence would have a different ratio of PMOS and NMOS transistors. We model this difference and show its impact on WAS lifespan improvements. 5.4.3 Wearout of Microarchitectural Structures In order to do dynamic runtime wearout-aware scheduling there is a need to have a distributed network of wearout sensors on the chip. We assume there is a wearout sensor for each microarchitectural structure of the processor, which is measuring timing margin of the most critical path in that structure during lifetime of the processor. For the processor we modeled in our evaluations we divide each processor core into 30 microarchitectural structures (e.g. Integer ALU units, Floating point multipliers, ROB control logic, etc.). We use the wearout models described to simulate the lifetime wearout reported by these sensors. Note that wearout of memory structures is modeled as wearout of their critical path which is going from inputs through their decoder and eventually to the output of the sense amplifier. 141 5.4.4 Model Parameters Due to high degree dependence on the fabrication technology and lifetime workload, we will use a range of values for some of the parameters in our model. Default values and the ranges used are selected to be similar to those assumed in the literature. The sensitivity analysis in the results section will highlight the impact of these variations. One of the parameters which affect the wearout of PMOS transistors due to NBTI is that average fraction of time the PMOS is in stress mode. A range between 30% and 70% (with default value of 50%) is used. Another parameter is the relative impact of NBTI and HCI on wearout. A range of 1 to 10 with default 3 (NBTI induced timing degradation is 3 times more than HCI) as suggested in [12, 44] is used. As highlighted earlier, in reality a critical path could consist of gates other than inverters and hence the ratio of NMOS to PMOS in a critical path can vary. We use a range of 0.1 to 10 with default value of 1 for the ratio of NMOS to PMOS on a path. Default value of 1 for this parameter corresponds to a path which consists of a chain of inverters or a chain of gates which result in equal number of NMOS and PMOS transistors on the paths. Our study of OpenSPARC T1 [45] processor core indicates that the most critical path of the microarchitectural structure of this processor have between 7 and 17 gates on them, we will use 10 as default. 5.4.5 Architectural Simulation We use SESC [33] execution driven simulator. A CMP running at frequency of 3 GHz is simulated. We evaluate the impact of our wearout-aware scheduling techniques for CMPs with 4, 8, 16, 32, 64, and 128 cores. The processor has homogenous out-of- 142 order cores similar to alpha 21264 with private L1 and L2 cache. We use Wattch [18] and HotSpot [55] for power and temperature modeling respectively. Wearout of 30 microarchitectural structure per core are monitored. These 30 structures cover all the major functional blocks of the core. 14 structures do not have natural redundancy at the core level (e.g. only one branch predictor per core) but have redundancy across cores. The remaining structures consist of 8 functionally unique structures which each have one natural redundant unit within the core (e.g. Integer ALU, and Load queue each have two replicas per core). In order to simulate the difference in the timing margin of replicas, due to the reasons highlighted in Section 5.2.2.1, we simulate a usage difference between replicas of the same structure. This difference in utilization affects the wearout rates and would account for the fact that critical path of a structure is not always sensitized when structure is utilized. This also accounts for existing bias in many local schedulers (e.g. round robin scheduling with biased use of the units with lower index). We will use a range from 20% to 80% with default of 40% structure usage imbalance. 5.4.6 Workload Diversity The workload used consists of 36 integer and floating point benchmarks. 15 benchmarks from SPEC2000, 9 from SPEC2006, and 12 from SPLASH benchmark suites are used. 329 SimPoints [30] of each 100 Million instructions from SPEC2000 and 227 SimPoints of each 100 Million instructions from SPEC2006 benchmarks are used. The SPLASH benchmarks used are executed to completion with a total of 35 Billion instructions. These 36 benchmarks are put into random order and are executed consecutively. This 36-application workload prevails for a week of lifetime and then the 143 order of execution is shifted by one benchmark each week. After 36 weeks a new random sort of the benchmarks is generate, and the reordering continues for another 36 weeks as before. To model a 10 year lifespan a total of 540 different week long workloads are used. Different workloads are scheduled on each of the cores of the CMPs simulated (e.g. for a 4 core CMP a total of 2160 different 36-application workloads are used). Temperature trace and utilization profile for each of the 30 microarchitectural structures within each of the cores is used by the wearout models described earlier to calculate the wearout level of each structure. WAS algorithms use 306,180 wearout measurements reported per structure during the 10 year simulated period (about 3 wearout level readings per hour of chip’s life). 5.5 Results 5.5.1 Lifespan Improvement Lifespan of a processor is the time till the first failure in the processor. The objective of WAS is to use wearout-leveling to extend lifespan of the processor and postpone this first failure. Figure 5.7 shows the percentage of lifespan increase achieved using WAS techniques relative to a baseline without them. Results are shown for 4 to 128 core CMPs for default timing guardband value of 10% and 40% usage imbalance between natural redundant units. All the cores have a workload which keeps them busy for 90% of time. This high utilization rate is selected to highlight how our adaptive framework can have significant benefits even when idle opportunities are scarce and WAS lifespan extention under lower utiliztiaon rates is higher than the conservative 144 values reported (this is shown on Figure 5.12). WAS algorithms implemented for evaluations uses M MWAS_activ_th =M AWAS_activ_th =6% and M Diff_th =0.5% (i.e. MWAS and AWAS scheduling decisions are activated when the timing margin of a structure is below 6% and the difference in timing margin between two structures above 0.5%). MWAS is limited to controlling the wearout of only those structures with intra-core natural redundancy. AWAS is limited to the amount of low core utilization opportunities available in workloads and accuracy of predicting structural utilization of the workloads. HWAS exploits wearout control opportunities of both MWAS and AWAS by dynamically selecting the best approach. Hence, HWAS gives the best lifespan improvement. An average lifespan improvement of 17%, 8%, and 23% is observed for MWAS, AWAS, and HWAS respectively. All the cores simulated have different workloads to mimic realistic usage scenarios. AWAS requires idle/under-utilized cores in order to switch active threads from vulnerable cores. As a result AWAS benefits are dependent on workload. This is while MWAS does not relay either on core or FUB underutilization and will disable MWAS target structures even if there is a negative performance impact. This performance impact is quantified in Section 5.5.2. 145 Figure 5.7: Lifespan improvement due WAS techniques. Figure 5.8: Impact of WAS on when half the cores have failed. The lifespan reported on Figure 5.7 is based on the failure of the first core in CMP due a timing violation caused by its most critical structure. A CMP enhanced with DRM framework would fall back on graceful performance degradation after the failure of its first structure/core. Hence the CMP can continue operation after one or more cores of the CMP are decommissioned. Results in Figure 5.8 show the lifespan as the time when half the cores in the CMP fail, called the half-life of CMP. Although the primary objective of WAS is to postpone the first failure in a processor we also look at an additional benefit of WAS which is extension of the graceful performance degradation phase of the processor’s lifespan (for processors which can tolerate failure of some of their cores) 146 MWAS technique improves the lifespan of the most critical structure within each core and hence the lifespan of all the cores in the system are improved. As a result, in addition to postponing the first core failure, MWAS increases the half-life of the system by an average of 6.2%. On the other hand, AWAS balances utilization at the granularity of a core. Hence utilization reduction for the most critical core can result in higher utilization of another, less critical, cores. As a result AWAS can postpone failure of the first core but this can lead to a small reduction in half-life (an average of 0.7%), particularly for CMP with fewer cores where options for switching workload are limited. AWAS always increases the lifespan without any core failures, as shown in Figure 5.7. HWAS uses the balancing benefits of MWAS and AWAS and would result in average half-life increase of 4.6%. 5.5.2 Performance Impact Since MWAS infrequently alters the availability of processor resources, our measurements of average IPC of each core and CMP show negligible (<0.3%) degradation. But during the infrequent windows of time when MWAS is disabling a structure within a core it would temporarily reduce the IPC of the core. Figure 5.9 shows the IPC degradation impact during these short periods when MWAS active. The average IPC degradation for all the CMPs simulated is 10%, during the switching period only. Performance overhead of AWAS is a result of thread migrations. Previous research [22] shows that when interval between thread migrations is kept above 160 Thousand cycles the average performance loss is less than 1%. In our framework the minimum number of cycles between AWAS initiated thread migration is 3,000 Billion 147 cycles which is much longer than the above suggested number. As a result the there is no measurable performance degradation observed when AWAS is used. HWAS has almost the same overhead as MWAS as shown on Figure 5.9. Figure 5.9: Performance impact of MWAS and HWAS. 5.5.3 Energy Impact AWAS, MWAS, and HWAS have negligibly small energy consumption impact relative to the baseline. This is because in all these techniques already existing natural redundancy is used and when a structure is not used it is power gated. This not only helps with wearout control but also would conserve energy. Distributed wearout sensors and the WAS control logic would have a small power consumption similar to conventional dynamic power/thermal management controllers. 5.5.4 Sensitivity Analyses Figure 5.10 shows the effect of sensor inaccuracy on lifespan expansion achievable by MWAS. The horizontal axis shows the percentage of wearout sensor inaccuracy and the vertical axis shows percentage of WAS lifespan extension relative to the WAS lifespan extension possible with a sensor with no inaccuracy. The curve labeled 148 Average shows the lifespan expansion for average of 280 workloads simulated. Curves labeled Best and Worst show the sensitivity to sensor accuracy for a core executing the workloads which resulted in the best and worst wearout respectively (e.g. Worst workload is the one which resulted in the most wearout and shortest lifespan). Figure 5.10 shows that we would need wearout sensor accuracy of 20% or better in order to be able to achieve the benefit of the WAS frame. Most timing margin sensors available have accuracy better than this threshold. Figure 5.10: Effect of timing margin sensor inaccuracy. Figure 5.11 shows lifespan distribution of 280 cores with different workloads. From left to right the distributions are for guardband values 8.5%, 10%, and 11.5% (as percentage of D crit (t 0 )). Each column shows the number of cores which had a lifespan of the same number years. The columns in red marks the cores which had a lifespan of equal to the average lifespan of the 280 core cluster with a specific guardband. Two important observations can be made from these results. First, width of the lifespan distribution increases as the amount guardband is increased. Second and more importantly shape of the spread and the relative location of the average in the distribution also changes. For 149 larger guardband values the average lifespan moves farther from the lifespan of the core with the longest lifespan. The main reason for this is that for larger guardbands and as a result longer lifetimes, wearout effects of different workloads can produce more diverse timing margins between different core structures. Hence, difference between lifespan of the core with the shortest life and the one with the longest increases. This shows that for cores with longer lifespan, timing margin heterogeneity is going to be higher and benefits from WAS increase. Figure 5.11: Lifespan distribution for different guardbands. It should be noted that process variation effects and structural difference between the critical paths of different microarchitectural structures have been explicitly not considered in the experiments in the above discussion to isolate the wearout heterogeneity due to lifetime workload differences. But if these variations are considered, the observed heterogeneity is going to be more and use of WAS techniques would be even more beneficial. 150 Figure 5.12 shows the impact of changes in three parameters of the wearout model on the lifespan improvement observed for each of the WAS techniques. Note that the scale of the vertical axis is different for the 3 plots on Figure 5.12. Figure 5.12: Sensitivity analyses to opportunities, imbalance, NBTI/HCI impact, PMOS stress. Opportunities: As shown on Figure 5.12 leftmost set of three bars, decreasing the percentage of core utilization from 95% to 85% (i.e. increasing opportunities for WAS from 5% to 15%) results in increasingly better performance of MWAS and HWAS. But AWAS does not follow the same trend as MWAS and HWAS due to the following reason. Note that an opportunistic AWAS policy is used in this study and wearout leveling is only achieved by shifting workload of a core with a critically wornout structure to a core which is idle (i.e. AWAS is not done at the cost of performance loss if idle cores are not available). This policy always gives priority to performance and hence 151 is only effective when there are available idle cores at the time when AWAS needs to switch workload of the critical core. This is the reason why adding additional amount of underutilization opportunities is not sufficient for observing increased lifespan due to AWAS. The benefits from opportunistic AWAS can be increase only if the idleness of cores happens when AWAS needs to switch workloads. Imbalance: As shown in Figure 5.12 in the second group of three bars from the left, as imbalance between replicated structures is increased from 20% to 60% the benefit of MWAS and HWAS increase but AWAS does not show the same trend. This is due to the fact that AWAS does not control the wearout heterogeneity of replicated structures within a core and rather performs wearout-leveling across different cores which is not impacted by changing this parameter. NBTI/HCI Impact: Increasing the impact of NBTI on timing degradation relative to HCI from 50% to 99% (third group of bars from the left in Figure 5.12) and increasing the percentage of time the PMOS transistor is under stress (the last group of bars in Figure 5.12) results in increased benefits from MWAS and HWAS due to higher levels of wearout observed. Path length changes and relative number of PMOS/NMOS transistors had negligible impact on the lifespan changes due to WAS and are not shown on Figure 5.12. Higher wearout and higher timing margin heterogeneity would increase the lifespan improvement achievable using WAS. More opportunities for balancing timing margin also make WAS techniques more effective. Average workload of most processors 152 show a high amount of under utilization hence the benefits from the WAS are going to be even higher than the conservative value reported. 5.6 Related Works There has been much recent research in design of dynamic reliability management (DRM) frameworks. ElastIC [59] presents an overview DRM. Most efforts have been focused on reactive techniques (i.e. R-DRM) which detect runtime failures caused by wearout and then control graceful performance degradation [3, 6, 43, 48, 51, 53, 58]. For example, [43] is a R-DRM framework which uses fine-grain artificial redundancy to mitigate latent defects or failure due to wearout. WAS is a proactive DRM framework (i.e. P-DRM) which uses runtime measurement of remaining timing margin of different microarchitectural blocks of the processor well before occurrence of a failure and uses this runtime feedback to balance timing margin of difference blocks of a processor. The most relevant P-DRM works to this paper are [26, 29, 62]. Table 5.2 has a comparison summary the DPM, DTM, R-DRM, and the above mentioned P-DRM frameworks with WAS. Table 5.2: Comparison of Dynamic Power/Thermal Management (DPM/DTM), Reactive/Proactive Dynamic Reliability Management (R-DRM/P-DRM) frameworks. DPM/ DTM R- DRM P-DRM Comments WAS Facelift Maestro Colt HW Feedback T, P, U ED TM T TM - T: Temperature, P: Power, U: Utilization Rate, ED: Error Detection, TM: Timing Margin Mitigation Knobs PG, CG, DVFS DR PG, CG, DVFS VS, JS JS HWC PG: Power Gating, CG: Clock Gating, DVFS: Dynamic Voltage and Frequency Scaling, DR: Decommissioning Resources, VS: Voltage Scaling, JS: Job Scheduling to cores, HWC: Hardware Change Solution Level MA, C MA, C MA, C C C MA MA: Microarchitecture Level, C: Core Level Impact on Wearout UR NE CR, WL UR UR, WL UR UR: Unmonitored Reduction, NE: No Effect, CR: Controlled Reduction, WL: Wearout-leveling 153 Facelift [62] uses voltage scaling techniques at granularity of cores to slow down wearout at the cost of operating cores with lower timing margin or at lower performance. The time to apply voltage scaling at core level is selected by chip manufacturer using a one-time calculation using a non-linear optimization based on wearout formulas and lifetime workload estimation. Microarchitectural level wearout heterogeneity within cores and wearout leveling at this granularity is not considered. [26] uses information regarding the wearout state of a multicore processor to assign jobs to cores based on their wearout state and application temperature profiling information. Benefits from this framework are limited to what is achieved with AWAS alone since only core level jobs assignment is used for wearout leveling. Colt [29] is a scheme to reduce the PMOS stress time for NBTI and can recapture 27% of the timing margin degradation of a microarchitectural block during a 7 year lifespan. Although the benefits from this framework are similar to WAS, Colt does not account for wearout heterogeneity and applies techniques which require significant hardware modifications. Furthermore, Colt does not take into account non-uniform wearout state of the system and hence applies the NBTI stress reduction techniques uniformly to all blocks. There are previous studies which focused on frameworks to use redundancy for improving yield and/or ensuring graceful degradation of the processor’s performance when parts of the chip start to fail due to wearout [6, 48, 51, 58]. To the best of our knowledge WAS is the first framework which utilizes existing redundancy at the circuit 154 block level to improve lifetime using fine-grain wearout-leveling based on feedback from distributed wearout sensors. 5.7 Summary and Conclusions In this chapter, we developed wearout control polices which use feedback from a network of wearout monitoring to maximize the lifespan of the processor. We quantified how WAS technique can not only postpone the start of graceful performance degradation but also would prolong the lifespan of the processor during the graceful performance degradation phase. In the dark silicon era where a large portion of the chip has to be power or clock gated during majority of the chips lifetime, intelligent microarchitectural level wearout leveling based on accurate timing margin feedback is the most effective way to efficiently utilize the lifespan of every microarchitectural structure within the processor. Importance of P-DRM techniques will grow in asymmetrical multicore systems and heterogeneous CMPs. WAS would add wearout awareness to such systems and can guide many power and clock gating decisions which are otherwise done without consideration for reliability. Use of feedback from runtime timing margin of the circuit in guiding the WAS wearout mitigation framework results in the broad applicability our framework to many different microarchitectures fabricated using emerging technologies. 155 Chapter 6 Conclusions and Future Work In this dissertation we first described the design and evaluation of WearMon which is a novel approach for runtime wearout monitoring in Chapter 2. Then in Chapter 3 we presented WAT which is a on a fast and accurate tool for cross-layer analyses of wearout. In Chapter 4 we used this cross-layer analysis framework to design and evaluate a cross-layer design flow methodology to significantly broaden the applicability of WearMon even to circuits with steep critical path timing walls. In particular, we proposed four algorithms that use application-specific path utilization profile to select only a few paths to be monitored for wearout. Our results show that wearout monitoring can in fact be done at ultra-low cost and monitoring overhead can be traded off with area, power and performance overhead. Then in Chapter 5 we described a cross-layer framework to mitigate wearout at runtime and extend chip lifetime using wearout leveling. The runtime wearout mitigation framework uses real-time timing margin information from a network of wearout sensor (e.g. WearMon) to dynamically adjust the microarchitectural block 156 level scheduling and extend the lifespan of the circuit using feedback based wearout leveling. The primary focus of the research presented in this dissertation has been on digital circuits and mainly high performance processor. But reliability challenges highlighted in this dissertation and the problem of circuit wearout is also affecting analog and mixed signal circuit and might also be a challenge to emerging technologies such as carbon nanotubes [69]. Hence study of design-for-reliability solutions tailored for analog and mixed signal circuits and emerging technologies is an interesting open area of research for future exploration. Emergence of heterogeneous and asymmetric processors as well has large amount of integration in System-on-Chip products poses new challenges in the area of design-for-reliability which need to be addressed. Many modern chips would have infield configurability with the goal of increasing performance, power, and thermal efficiency and many dynamic operation control polices for DTM, DPM, and DRM are controlled independently. Since all these solution use very similar control knobs, such as power/clock gating and DVFS, it would be highly effective to combine these control polices. Exploration of the interaction of the sometimes conflicting objectives of the dynamic power/thermal/reliability management policies would an interesting area of future research. High level of integration made possible by the advances in the semiconductor industry in the past few decades would require hardware enhancements with intelligent built-in and autonomous mechanisms to control performance, power, and reliability. Circuits which not only have power, performance, and temperature monitoring 157 enhancements but also have built-it reliability monitoring which enables dynamic runtime control of wearout would significantly extend lifespan of circuits. The cross-layer frameworks and methodologies for wearout monitoring, mitigation, and analyses presented in this dissertation serve as a valuable building block for creating the above mentioned reliability enhancements. 158 References [1] J. Abella, X. Vera, and A. Gonzalez, "Penelope: The NBTI-Aware processor," in International Symposium on Microarchitecture, 2007, pp. 85-96. [2] M. Agarwal, B. C. Paul, M. Zhang, and S. Mitra, "Circuit failure prediction and its application to transistor aging," in IEEE VLSI Test Symposium, 2007, pp. 277- 286. [3] N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith, "Configurable Isolation: Building high availability systems with commodity multi-core processors," in International Symposium on Computer Architecture, 2007, pp. 470-481. [4] M. A. Alam and S. Mahapatra, "A comprehensive model of PMOS NBTI degradation," Microelectronics Reliability, vol. 45, pp. 71-81, Jan 2005. [5] M. Annavaram, E. Grochowski, and P. Reed, "Implications of device timing variability on full chip timing," in 13th International Symposium on High- Performance Computer Architecture, 2007, pp. 37-45. [6] A. Ansari, S. G. Feng, S. Gupta, and S. Mahlke, "Necromancer: Enhancing System Throughput by Animating Dead Cores," in International Symposium on Computer Architecture, 2010, pp. 473-484. [7] T. Austin, E. Larson, and D. Ernst, "SimpleScalar: An infrastructure for computer system modeling," Computer, vol. 35, pp. 59-67, Feb 2002. 159 [8] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge, "Opportunities and challenges for better than worst-case design," in Asia and South Pacific Design Automation Conference, 2005, pp. 2-7. [9] T. M. Austin, "DIVA: A reliable substrate for deep submicron microarchitecture design," in 32nd Annual International Symposium on Microarchitecture, 1999, pp. 196-207. [10] A. H. Baba and S. Mitra, "Testing for transistor aging," in VLSI Test Symposium, 2009, pp. 215-220. [11] L. A. Barroso and U. Holzle, "The case for energy-proportional computing," Computer, vol. 40, pp. 33-37, Dec 2007. [12] K. Bernstein, D. J. Frank, A. E. Gattiker, W. Haensch, B. L. Ji, S. R. Nassif, E. J. Nowak, D. J. Pearson, and N. J. Rohrer, "High-performance CMOS variability in the 65-nm regime and beyond," IBM Journal of Research and Development, vol. 50, pp. 433-449, Jul-Sep 2006. [13] J. Blome, S. Feng, S. Gupta, and S. Mahlke, "Self-callibrating online wearout detection," in International Symposium on Microarchitecture, 2007, pp. 109-122. [14] S. Borkar, T. Kamik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, "Parameter variations and impact on circuits and microarchitecture," in 40th Design Automation Conference, 2003, pp. 338-342. [15] S. Borkar, "Designing reliable systems from unreliable components: The challenges of transistor variability and degradation," IEEE Micro, vol. 25, pp. 10- 16, 2005. [16] S. Borkar, "Electronics beyond nano-scale CMOS," in Design Automation Conference, 2006, pp. 807-808. 160 [17] F. A. Bower, D. J. Sorin, and S. Ozev, "A mechanism for online diagnosis of hard faults in microprocessors," in 38th Annual IEEE/ACM International Symposiumn on Microarchitecture, 2005, pp. 197-208. [18] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A framework for architectural- level power analysis and optimizations," in International Symposium on Computer Architecture, 2000, pp. 83-94. [19] A. C. Cabe, Z. Y. Qi, S. N. Wooters, T. N. Blalock, and M. R. Stan, "Small embeddable NBTI sensors (SENS) for tracking on-chip performance decay," in International Symposium on Quality Electronic Design, 2009, pp. 1-6. [20] E. Chung and J. Smolens, "OpenSPARC T1: Architectural Transplants," 2007. [21] K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky, "BulletProof: A defect-tolerant CMP switch architecture," in 12th International Symposium on High-Performance Computer Architecture, 2006, pp. 3-14. [22] T. Constantinou, Y. Sazeides, P. Michaud, D. Fetis, and A. Seznec, "Performance implications of single thread migration on a chip multi-core," SIGARCH Comput. Archit. News, vol. 33, pp. 80-91, 2005. [23] M. Demertzi, B. Zandian, R. Rojas, and M. Annavaram, "Benchmarking instruction reliability to Intermittent errors," in IEEE International Symposium on Workload Characterization (IISWC), San Diego, California, 2012. [24] S. Dighe, S. R. Vangal, P. Aseron, S. Kumar, T. Jacob, K. A. Bowman, J. Howard, J. Tschanz, V. Erraguntla, N. Borkar, V. K. De, and S. Borkar, "Within- die variation-aware dynamic-voltage-frequency-scaling with optimal core allocation and thread hopping for the 80-Core teraFLOPS processor," IEEE Journal of Solid-State Circuits, vol. 46, pp. 184-193, Jan 2011. 161 [25] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power pipeline based on circuit-level timing speculation," in International Symposium on Microarchitecture, 2003, pp. 7-18. [26] S. Feng, S. Gupta, A. Ansari, and S. Mahlke, "Maestro: Orchestrating lifetime reliability in chip multiprocessors," in High Performance Embedded Architectures and Compilers, 2010, pp. 186-200. [27] T. Fischer, E. Amirante, P. Huber, K. Hofmann, M. Ostermayr, and D. Schmitt- Landsiedel, "A 65 nm test structure for SRAM device variability and NBTI statistics," Solid-State Electronics, vol. 53, pp. 773-778, Jul 2009. [28] B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. M. Chen, and C. Zilles, "BlueShift: Designing processors for timing speculation from the ground up," in High Performance Computer Architecture, 2009, pp. 213-224. [29] E. Gunadi, A. A. Sinkar, K. Nam Sung, and M. H. Lipasti, "Combating aging with the Colt duty cycle equalizer," in Microarchitecture, 2010, pp. 103-114. [30] G. Hamerly, E. Perelman, J. Lau, and B. Calder, "Simpoint 3.0: Faster and more flexible program phase analysis," Journal of Instruction Level Parallelism, vol. 7, pp. 1-28, 2005. [31] J. Hicks, D. Bergstrom, M. Hattendorf, J. Jopling, and J. Maiz, "45nm transistor reliability," Intel Technology Journal, vol. 12, pp. 131-144, 2008. [32] International Technology Roadmap for Semiconductors. Available: http://www.itrs.net/ [33] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and P. Montesinos. (2005). SESC simulator. Available: http://sesc.sourceforge.net 162 [34] A. Kahng, S. Kang, R. Kumar, and J. Sartori, "Designing processors from the ground up to allow voltage/reliability tradeoffs," in High Performance Computer Architecture, 2010, pp. 1-11. [35] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, "Slack redistribution for graceful degradation under voltage overscaling," in Asia and South Pacific Design Automation Conference, 2010, pp. 825-831. [36] E. Karl, P. Singh, D. Blaauw, and D. Sylvester, "Compact in situ sensors for monitoring negative-bias-temperature-instability effect and oxide degradation," in Solid State Circuits Conferance, 2008, pp. 410–411. [37] C. Kenyon, A. Kornfeld, K. Kuhn, M. Liu, and A. Maheshwari, "Managing process variation in Intel's 45nm CMOS technology," Intel Technology Journal, vol. 12, pp. 93-109, 2008. [38] N. Kimizuka, T. Yamamoto, T. Mogami, K. Yamaguchi, K. Imai, and T. Horiuchi, "The impact of bias temperature instability for direct-tunneling ultra- thin gate oxide on MOSFET scaling," in VLSI Technology, 1999, pp. 73-74. [39] M. L. Li, P. Ramachandran, U. R. Karpuzcu, S. K. S. Hari, and S. V. Adve, "Accurate microarchitecture-level fault modeling for studying hardware fault," in 15th International Symposium on High-Performance Computer Architecture, 2009, pp. 105-116. [40] Y. J. Li, S. Makar, and S. Mitra, "CASP: Concurrent Autonomous chip self-test using Stored test Patterns," in Design, Automation & Test in Europe, 2008, pp. 885-890. [41] X. Y. Liang, G. Y. Wei, and D. Brooks, "ReVIVAL: A variation-tolerant architecture using voltage interpolation and variable latency," in 35th International Symposium on Computer Architecture, 2008, pp. 191-202. 163 [42] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," Computer, vol. 35, pp. 50-58, 2002. [43] T. Nakura, K. Nose, and M. Mizuno, "Fine-grain redundant logic using defect- prediction Flip-Flops," in Solid-State Circuits Conference, 2007, pp. 402-611. [44] S. Ogawa and N. Shiono, "Generalized diffusion-reaction model for the low-field charge-buildup instability at the Si-Sio2 interface," Physical Review B, vol. 51, pp. 4218-4230, Feb 15 1995. [45] OpenSPARC T1 Processor. Available: http://www.opensparc.net/opensparc- t1/index.html [46] J. Patel, "CMOS process variations: A critical operation point hypothesis," Online Presentation, 2008. [47] F. Paterna, L. Benini, A. Acquaviva, F. Papariello, G. Desoli, and M. Olivieri, "Adaptive idleness distribution for non-uniform aging tolerance in multiprocessor Systems-on-Chip," in Design, Automation & Test in Europe, 2009, pp. 906-909. [48] M. D. Powell, A. Biswas, S. Gupta, and S. S. Mukherjee, "Architectural core salvaging in a multi-core processor for hard-error tolerance," in International Symposium on Computer Architecture, 2009, pp. 93-104. [49] S. E. Rauch, "Review and reexamination of reliability effects related to NBTI- induced statistical variations," IEEE Transactions on Device and Materials Reliability, vol. 7, pp. 524-530, Dec 2007. [50] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas, "EVAL: Utilizing processors with variation-induced timing errors," International Symposium on Microarchitecture, pp. 423-434, 2008. 164 [51] E. Schuchman and T. N. Vijaykumar, "Rescue: A microarchitecture for testability and defect tolerance," in International Symposium on Computer Architecture, 2005, pp. 160-171. [52] J. Shin, V. Zyuban, Z. Hu, J. A. Rivers, and P. Bose, "A framework for architecture-level lifetime reliability Modeling," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007, pp. 534-543. [53] P. Shivakumar, S. W. Keckler, C. R. Moore, and D. Burger, "Exploiting microarchitectural redundancy for defect tolerance," in International Conference on Computer Design, 2003, pp. 481-488. [54] S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin, "Ultra low- cost defect protection for microprocessor pipelines," ACM Sigplan Notices, vol. 41, pp. 73-82, Nov 2006. [55] T. Skadron, W. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, "Temperature-aware microarchitecture," in International Symposium on Computer Architecture, 2003, pp. 2-13. [56] J. Smolens, B. T. Gold, J. C. Hoe, B. Falsafi, and K. Mai, "Detecting emerging wearout faults," in IEEE Workshop on Silicon Errors in Logic - System Effects, 2007. [57] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in International Symposium on Computer Architecture, 2004, pp. 276-287. [58] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, "Exploiting structural duplication for lifetime reliability enhancement," in International Symposium on Computer Architecture, 2005, pp. 520-531. 165 [59] D. Sylvester, D. Blaauw, and E. Karl, "ElastIC: An adaptive self-healing architecture for unpredictable silicon," IEEE Design & Test of Computers, vol. 23, pp. 484-490, Nov-Dec 2006. [60] E. Takeda, C. Y. W. Yang, and A. Miura-Hamada, Hot-carrier effects in MOS devices vol. 73: Academic Press, 1995. [61] A. Tiwari, S. R. Sarangi, and J. Torrellas, "ReCycle: Pipeline adaptation to tolerate process variation," in International Symposium on Computer Architecture, 2007, pp. 323-334. [62] A. Tiwari and J. Torrellas, "Facelift: Hiding and slowing down aging in multicores," in International Symposium on Microarchitecture, 2008, pp. 129- 140. [63] R. Vattikonda, W. P. Wang, and Y. Cao, "Modeling and minimization of PMOS NBTI effect for robust nanometer design," in Design Automation Conference, 2006, pp. 1047-1052. [64] W. P. Wang, V. Reddy, A. T. Krishnan, R. Vattikonda, S. Krishnan, and Y. Cao, "Compact Modeling and simulation of circuit reliability for 65-nm CMOS technology," IEEE Transactions on Device and Materials Reliability, vol. 7, pp. 509-517, 2007. [65] W. P. Wang, S. Q. Yang, S. Bhardwaj, R. Vattikonda, S. Vrudhula, F. Liu, and Y. Cao, "The impact of NBTI on the performance of combinational and sequential circuits," in ACM/IEEE Design Automation Conference, 2007, pp. 364-369. [66] W. Wenping, V. Reddy, A. T. Krishnan, R. Vattikonda, S. Krishnan, and C. Yu, "Compact Modeling and Simulation of Circuit Reliability for 65-nm CMOS Technology," IEEE Transactions on Device and Materials Reliability, vol. 7, pp. 509-517, 2007. 166 [67] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark, "Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors," in 11th International Symposium on High-Performance Computer Architecture, 2005, pp. 178-189. [68] Xilinx. Available: http://www.xilinx.com [69] B. Zandian, R. Kumar, J. Theiss, A. Bushmaker, and S. B. Cronin, "Selective destruction of individual single walled carbon nanotubes by laser irradiation," Carbon, vol. 47, pp. 1292-1296, Apr 2009. [70] B. Zandian, W. Dweik, S. H. Kang, T. Punihaole, and M. Annavaram, "WearMon: Reliability monitoring using adaptive critical path testing," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2010, pp. 151-160. [71] B. Zandian and M. Annavaram, "Automated detection of and compensation for guardband degradation during operation of clocked data processing circuit," US Patent App. 13/327,561, 2011. [72] B. Zandian and M. Annavaram, "Cross-layer resilience using wearout aware design flow," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2011, pp. 279-290. [73] B. Zandian and M. Annavaram, "Software-based infield wearout monitoring for synchronous digital chips," in IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Fast Abstracts, 2012. [74] B. Zandian and M. Annavaram, "WAS: Wearout-aware runtime use of redundancy to improve processor lifespan," in International Symposium on High Performance Computer Architecture, 2013 (Under Review).
Abstract (if available)
Abstract
CMOS scaling has enabled greater degree of integration and higher performance but has the undesirable consequence of decreased circuit reliability due to rapid wearout. Accelerated processor wearout and the consequent degradation in lifetime have become a first order design constraint. This dissertation tackles these challenges by developing new tools to accurately quantify wearout, providing novel methods to quantify the wearout impact due to software interactions on hardware. The dissertation then demonstrates the usage of these tools by developing a wearout-aware scheduling approach that achieves wear leveling within a processor. ❧ This dissertation first presents WearMon, an adaptive critical path monitoring architecture which provides accurate and real-time measure of a processor's wearout-induced timing margin degradation. Special test patterns are used to check a set of critical paths in the circuit-under-test. By activating the actual devices and signal paths used in normal operation of the chip, each test will capture up-to-date timing margin of these paths. This monitoring framework dynamically adapts testing interval and complexity based on analyses of prior test results, which increases efficiency and accuracy of monitoring. Monitoring overhead can be completely eliminated by scheduling tests only when the circuit is idle. This wearout detection mechanism is a key building block of a hierarchical runtime reliability management system where multiple wearout monitoring units can co-operatively engage preemptive error avoidance schemes. Our experimental results based on an FPGA implementation show that the proposed monitoring framework can be easily integrated into existing designs and operate with minimal overhead. ❧ WearMon overhead can become a hurdle when a circuit block has a steep critical path timing wall. Many prior research studies intuitively argued that only a few paths within a steep critical path timing wall are actually utilized by application software. But there has been a dearth of tools that enable designers to understand how software impacts the utilization of critical paths in a circuit. The next part of this dissertation develops a tool for cross-layer analysis of wearout, called WAT. WAT uses FPGA emulation closely coupled with software simulation to provide accurate insight into device switching activity and runtime path utilization. We demonstrate the utility of WAT by providing accurate gate-level switching activity statistics as inputs to a lifetime wearout simulation tool. The switching activity statistics are used as inputs to the lifetime prediction tool which uses accurate device level models for the electrophysical phenomena causing wearout. Accurate switching statistics from WAT can significantly improve the lifetime prediction accuracy. ❧ WAT is also used to address the concern regarding WearMon overhead in the presence of steep critical path timing walls. A new design-for-reliability approach is developed that reshapes a critical path wall to make a circuit more amenable for wearout monitoring. This design flow methodology uses path utilization profile to select only a few paths to be monitored for wearout. We propose and evaluate four novel algorithms for selecting paths to be monitored. These four approaches allow designers to select the best group of paths to be monitored under varying power, area and monitoring budget constraints. ❧ Finally we demonstrate the impact of runtime wearout management in a proactive runtime wearout-aware scheduling approach, WAS. Processor failure can occur due to wearout of a single structure even if vast majority of the chip is still operational. WAS strives for uniform wearout of processor structures thereby preventing a single structure from becoming an early point of failure. The fine-grained microarchitectural level chip wearout control polices use feedback from a network of timing margin monitoring sensors to identify the most degraded structures. Our evaluation shows WAS can result in 15% to 30% improvement in lifespan of a multi-core processor chip with negligible performance and energy consumption impact.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
In-situ digital power measurement technique using circuit analysis
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Cooperation in wireless networks with selfish users
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Reliable cache memories
PDF
Multichannel data collection for throughput maximization in wireless sensor networks
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Defect-tolerance framework for general purpose processors
PDF
Toward understanding mobile apps at scale
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Topology generation for protocol testing: methodology and case studies
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Leveraging programmability and machine learning for distributed network management to improve security and performance
Asset Metadata
Creator
Zandian, Bardia
(author)
Core Title
Towards a cross-layer framework for wearout monitoring and mitigation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/21/2012
Defense Date
10/16/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
circuit aging,circuit wearout,cross-layer wearout analyses,OAI-PMH Harvest,reliability,wearout mitigation,wearout monitoring
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Dubois, Michel (
committee member
), Govindan, Ramesh (
committee member
)
Creator Email
bzandian@usc.edu,bzandian@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-117765
Unique identifier
UC11291242
Identifier
usctheses-c3-117765 (legacy record id)
Legacy Identifier
etd-ZandianBar-1333.pdf
Dmrecord
117765
Document Type
Dissertation
Rights
Zandian, Bardia
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
circuit aging
circuit wearout
cross-layer wearout analyses
reliability
wearout mitigation
wearout monitoring