Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
(USC Thesis Other)
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DESIGN AND TESTING OF SRAMS RESILIENT TO BIAS TEMPERATURE INSTABILITY (BTI) AGING by Xuan Zuo A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2020 Copyright 2020 Xuan Zuo Acknowledgements First and foremost, I would like to thank my advisor, Prof. Sandeep K. Gupta, for all the help and guidance in my entire Ph.D life. While doing research with him, I was impressed by his strong ability to sort out insight from complicated problems and details. He taught me all the important skills to conduct research, such as finding meaningful topics, thinking about the high-level story and grasping the key point, presenting ideas and results systematically and clearly. Whenever I came to his office feeling uncertain about what to do next, his wisdom and kindness never ceases to amaze me. I have learned so much from him and he will continue to be my role model for my future life. Next, many thanks go to the other members of my defense and qualifying committee: Prof. William Halfond, Prof. Paul Bogdan, Prof. Alice Parker, and Prof. Pierluigi Nuzzo, for their insightful and helpful feedback on my research. I thank Prof. William Halfond and Prof. Paul Bogdan for sitting on both my qualifying exam and defense committees. I appreciate for their valuable comments and suggestions to help me in many aspects to make this dissertation complete. I acknowledge the support from the NSF grant. My sincere thanks also go to Prof. Shahin Nazarian, for my unforgettable and mean- ingful TA experience. I have been a teaching assistant for him every semester since spring 2015 for many courses. I learned many useful knowledge and skills from him. His dedication and hard-working always inspired me. ii I also want to thank all my colleagues and friends at USC for the valuable advice and help. Finally, I would like to dedicate this dissertation to my family. Words cannot express how grateful I am to have unconditional love and support from my family even if we are far apart. Many thanks to my mother who is always proud of my accomplishments. I would like to express my profound gratitude to my husband, Hao Yu, who encouraged me to pursue a Ph.D. His love and support are one of the most important reasons that I successfully finish my Ph.D. iii Contents Acknowledgements ii List of Tables vii List of Figures ix Abstract xii 1 Introduction 1 1.1 SRAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Process variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Scope of this dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4.1 Process variation-induced delay test of SRAMs . . . . . . . . . 5 1.4.2 SRAM design against BTI aging . . . . . . . . . . . . . . . . 8 2 Process variation-induced delay test of SRAMs 11 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Problem development and approach . . . . . . . . . . . . . . . . . . . 13 2.3 Related research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Failure Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Analysis of key circuit components . . . . . . . . . . . . . . . . . . . 18 2.5.1 Analysis of sense amplifiers . . . . . . . . . . . . . . . . . . . 19 2.5.2 Analysis of address decoder . . . . . . . . . . . . . . . . . . . 22 2.5.2.1 NAND gate delay analysis . . . . . . . . . . . . . . 23 2.5.2.2 Address sequence selection for 3-bit decoder . . . . . 24 2.5.2.3 Address sequence selection for large address decoders with pre-decoding . . . . . . . . . . . . . . . . . . . 27 2.6 Generation of new tests . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.6.1 Test generation for VIDF . . . . . . . . . . . . . . . . . . . . . 28 2.6.2 Test generation for both VIDF and DIDF . . . . . . . . . . . . 30 2.7 Evaluation of new tests . . . . . . . . . . . . . . . . . . . . . . . . . . 31 iv 2.7.1 Experimental results for SRAM with 3-bit address decoder . . . 31 2.7.2 Results for SRAM with pre-decoding address decoder . . . . . 32 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 Low-cost SRAM redesign for effectively combating aging in systems with long lifetime and tight power requirements 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.1 BTI aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 The impact of BTI aging on SRAMs’ stability . . . . . . . . . . 41 3.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Characterize the workload . . . . . . . . . . . . . . . . . . . . 46 3.3.2 Design objective . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Design approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 SRAM cell sizing approach for fixed workload . . . . . . . . . 50 3.4.1.1 6T planar CMOS SRAM cell . . . . . . . . . . . . . 51 3.4.1.2 6T FinFET SRAM cell . . . . . . . . . . . . . . . . 58 3.4.1.3 10T ST SRAM cell . . . . . . . . . . . . . . . . . . 62 3.4.2 Design flow for any given workload . . . . . . . . . . . . . . . 67 3.4.2.1 Key ideas . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.2.2 SRAM cell sizing approach for any given workload . 68 3.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5.1 Comparison of DPPM and lifetime yield-per-area for 6T SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.5.2 Comparison of DPPM and lifetime yield-per-area for 10T ST SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.3 Power overhead analysis of the classical approach . . . . . . . . 72 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 Delay degradation caused by aging in SRAM peripheral circuitry 81 4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Aging analysis for SRAM peripheral circuit . . . . . . . . . . . . . . . 83 4.2.1 Address decoder . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.2 Precharge and write circuit . . . . . . . . . . . . . . . . . . . . 84 4.2.3 Sense amplifier . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Quantify delay component . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Process variation-induced delay test of SRAMs with aging . . . . . . . 90 4.4.1 Address decoder delay analysis . . . . . . . . . . . . . . . . . 92 4.4.1.1 NAND gate delay analysis . . . . . . . . . . . . . . 93 4.4.1.2 Address sequence selection for small decoders . . . . 94 4.4.1.3 Address sequence selection for large pre-decoded decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 v 4.4.2 Test generation for VIDF . . . . . . . . . . . . . . . . . . . . 104 4.5 Peripheral circuits redesign to combat aging degradation . . . . . . . . 108 4.5.1 Power overhead analysis for the classical approach for periph- eral circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5.2 Sizing approach for decoder to combat aging delay degradation 110 4.5.3 Sizing approach for sense amplifier to combat aging . . . . . . 113 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5 Aging-resilient SRAM design: an end-to-end framework 118 5.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 Using ECC to repair aging failures . . . . . . . . . . . . . . . . . . . . 119 5.2.1 ECC background . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.2 DPPM and lifetime yield estimation with ECC . . . . . . . . . 120 5.2.3 Calculation of soft error resilience when using ECC to repair aging failures . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.4 Characterize the ECC implementation overheads . . . . . . . . 124 5.3 Design approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.3.1 An end-to-end SRAM design framework for lifetime yield-per- area optimization . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.4 Experiment results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 6 Contributions 133 6.1 Tests for variation-induced delay faults in SRAMs . . . . . . . . . . . . 133 6.2 Tests for aging- and variation-induced delay faults in SRAMs . . . . . . 134 6.3 Sizing for aging resilient SRAM design . . . . . . . . . . . . . . . . . 134 6.4 End-to-end SRAM design framework for aging resilience . . . . . . . . 135 Reference List 137 vi List of Tables 2.1 Possible address dependent delay faults and the corresponding sufficient conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Charactering two types of SAs in terms of parameters of fault 2 and fault 3 22 2.3 Input transitions triggering the worst-caseT PHL andT PLH for NAND gates with different PMOS and NMOS transistor size ratios for an indus- trial 65nm CMOS process . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Address transitions invoking the worst-case delay for 3-bit decoders with different transistor size ratios in NAND gates . . . . . . . . . . . . 26 2.5 Test generation procedure for 4 different SRAM designs under case 1 timing control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.6 New proposed test algorithms for different designs targeting variation- induced delay Faults, WT, WCGD and GALPAT for an industrial 65nm CMOS process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7 Number of failing chip instances captured by different tests for SRAMs with 3-bit address decoder . . . . . . . . . . . . . . . . . . . . . . . . 35 2.8 Number of defect induced delay faults captured by different tests . . . . 35 2.9 Number of failing chip instances captured by different tests for SRAMs with pre-decoding address decoder . . . . . . . . . . . . . . . . . . . . 35 3.1 All the transistors whose sizing impact RNMb of a 6T planar SRAM cell 55 3.2 All the transistors whose design parameters impact RNMb of a 6T Fin- FET SRAM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3 Layout design rule for FinFET . . . . . . . . . . . . . . . . . . . . . . 79 3.4 All the transistors whose sizing impact RNMb of a 10T ST SRAM cell . 79 3.5 Optimal 6T SRAM cell designs for various workloads . . . . . . . . . . 80 3.6 Design parameters for optimal 10T ST SRAM cells for various workloads 80 3.7 Lifetime yield-per-area and DPPM comparison for different 10T ST SRAM cells under four different workloads . . . . . . . . . . . . . . . 80 4.1 Aging degradation under different input patterns for 2-input, 3-input NAND gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 Aging degradation under different input patterns for 4-input NAND gates 85 4.3 Aging degradation for SA1 under various control signals . . . . . . . . 86 vii 4.4 Aging degradation for SA2 under various control signals . . . . . . . . 87 4.5 Delay components of 3-bit SRAM read operation before and after 60 months aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Input transitions triggering the worst-caseT PHL andT PLH for various NAND gates without and with aging . . . . . . . . . . . . . . . . . . . 97 4.7 Address transitions invoking the worst-case delay for 3-bit decoders with different transistor size ratios in NAND gates before and after 60m aging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.8 New proposed test algorithms for different designs targeting variation- induced delay Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.9 Number of failing chip instances captured by different tests for SRAMs with 3-bit address decoder before and after aging . . . . . . . . . . . . 108 4.10 Number of failing chip instances captured by different tests for SRAMs with pre-decoded address decoder before and after aging . . . . . . . . 108 4.11 Delay of original NAND3 gate and the NAND3 gate resized for reduc- ing critical path delay after aging to meet clock constraint . . . . . . . 117 5.1 Lifetime yield and DPPM comparison for SRAMs using a cell opti- mized for yield-per-area at the time of fabrication (D0) under four dif- ferent workloads with different ECC schemes . . . . . . . . . . . . . . 124 5.2 Number of check bits and area overhead (compared to the total area of data cells in SRAM array) for SEC and DEC for different correctable data lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3 Area overhead (relative to entire SRAM, computed using modified CACTI) comparison for sizing and ECC approach under four different workloads 127 5.4 Design results of 2MB SRAMs with 6T cells under four different work- loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 viii List of Figures 1.1 6T, 8T and 10T schmitt Trigger SRAM cell structures . . . . . . . . . . 3 2.1 SRAM general structure diagram . . . . . . . . . . . . . . . . . . . . . 18 2.2 (a) Timing diagram of SRAM read operation and (b)-(d) three possible ASDDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Two commonly used Sense Amplifiers . . . . . . . . . . . . . . . . . . 21 2.4 Address decoder structures . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 (a) 2-input, (b) 3-input and (c) 4-input NAND gates . . . . . . . . . . . 25 2.6 (a)T PHL and (b)T PLH distribution of 3-input NAND gate among 1000 Monte Carlo instances with process variation. . . . . . . . . . . . . . . 26 2.7 Address sequences invoking all WDeactDs and WActDs for address decoders with (a) Wp/Wn = 2, (b) Wp/Wn = 0.67 (balanced), (c) Wp/Wn = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Stress state for nMOS transistor and pMOS transistor . . . . . . . . . . 41 3.2 6T SRAM cell’s stress and recovery status when value “0” is stored. . . 42 3.3 The curves for a 6T SRAM cell indicating RNM before and after aging. 43 3.4 The curves for a 6T SRAM cell indicating WNM before and after aging 45 3.5 Read noise margins through lifetime for a 6T SRAM cell when value “0” is stored in the cell for 75% of the time . . . . . . . . . . . . . . . 45 3.6 Write noise margins through lifetime for a 6T SRAM cell when value “0” is stored in the cell for 75% of the time . . . . . . . . . . . . . . . 46 3.7 (a) Read noise margin and (b) write noise margin changes after 60 months of usage for a 6T SRAM cell under different signal probabilities. . . . . 47 3.8 The signal probability distribution in data caches extracted from [1]. . . 48 3.9 Noise margins of the 6T planar SRAM cell versus the sizes of (a) AXL (b) PDL (c) PUL (d) AXR (e) PDR (f) PUR . . . . . . . . . . . . . . . 54 3.10 Access time of the 6T planar SRAM cell versus the transistor sizes (a) for read-0 and (b) for read-1. . . . . . . . . . . . . . . . . . . . . . . . 55 3.11 Optimal sizes of PDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12 Layout of 6T planar CMOS SRAM cell . . . . . . . . . . . . . . . . . 56 3.13 Read noise margin changes after 60 months usage for 1000 monte carlo 6T planar CMOS SRAM cell instances with process variations . . . . . 57 ix 3.14 (a) Lifetime yield-per-area and (b) DPPM comparison for 2MB SRAM arrays for all cells withP signal = 0.25. . . . . . . . . . . . . . . . . . . . 59 3.15 Noise margins of a 6T FinFET SRAM cell versus gate length and Fin thickness of all transistors . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.16 6T FinFET SRAM cell layout . . . . . . . . . . . . . . . . . . . . . . 61 3.17 (a) Lifetime yield per area and (b) DPPM comparison for 2MB FinFET 6T SRAM arrays for all cells withP signal = 0.25. . . . . . . . . . . . . 62 3.18 10T Schmitt Trigger SRAM cell . . . . . . . . . . . . . . . . . . . . . 63 3.19 (a) Noise margins of the 10T ST SRAM cell versus the sizes of NFR. (b) Access time of the 10T ST SRAM cell versus the transistor sizes . . 65 3.20 Optimal sizes of NFR in 10T ST cell . . . . . . . . . . . . . . . . . . . 65 3.21 Layout of 10T ST SRAM cell . . . . . . . . . . . . . . . . . . . . . . 66 3.22 (a) Lifetime yield per area and (b) DPPM comparison for 2MB SRAM arrays using 10T ST cells for all cells withP signal = 0.25. . . . . . . . . 66 3.23 Lifetime yield-per-area comparison for different 6T SRAM cells under four different workloads. . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.24 DPPM comparison for different 6T SRAM cells under four different workloads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.25 (a) VDD and (b) power overhead in percentage for the design optimized for yield-per-area (D0) to ensure DPPM <= 50 for 6T SRAMs under four workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.26 (a) VDD and (b) power overhead in percentage for the design optimized for yield-per-area (D0) to ensure DPPM <= 50 for 10T ST SRAMs under four workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.1 (a) 2-input, (b) 3-input and (c) 4-input NAND gates . . . . . . . . . . . 84 4.2 (a) Precharge circuit and (b) write circuit . . . . . . . . . . . . . . . . 85 4.3 Two commonly used Sense Amplifiers . . . . . . . . . . . . . . . . . . 86 4.4 Timing diagram of SRAM read operation . . . . . . . . . . . . . . . . 90 4.5 Read delay degradation of an SRAM with 8-bit address decoder caused by aging through lifetime . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.6 T PHL and T PLH distributions of (a)-(b) 2-input NAND gates, (c)-(d) 3-input NAND gates, and (e)-(f) 4-input NAND gates without aging . . 95 4.7 T PHL andT PLH distributions of (a)-(b) 2-input NAND gates, (c)-(d) 3- input NAND gates, and (e)-(f) 4-input NAND gate after 60 months aging 96 4.8 The distributions of (a) activation delay and (b) deactivation delay of 3-bit address decoder with balanced NAND3 before aging . . . . . . . 99 4.9 The distributions of (a) activation delay and (b) deactivation delay of 3-bit address decoder with Wp/Wn=2 NAND3 before aging . . . . . . 99 4.10 (a) Address sequences invoking WDeactDs for address decoders with Wp/Wn = 0.67 (balanced), (b) Address sequences invoking WDeactDs and WDactDs for address decoders with Wp/Wn = 2 . . . . . . . . . . 100 x 4.11 A pre-decoded 16-bit address decoder . . . . . . . . . . . . . . . . . . 101 4.12 The probability that a fault can be captured byG 2 but cannot be captured byG 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.13 The sum of (a)T PHL and (b)T PLH distributions of NAND2 and NAND3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.14 The sum of (a)T PHL and (b)T PLH distributions of two NAND4 gates. 104 4.15 The distributions of deactivation delay of 6-bit address decoder with balanced NAND3 and balanced NAND2 (a) before aging and (b) after 60m aging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.16 (a) VDD and (b) power overhead in percentage for SRAMs with 3-bit, 8-bit and 12-bit address decoder to ensure read delay meet clock constraint111 4.17 The distributions of (a) activation delay and (b) deactivation delay of resized 3-bit address decoder to reduce critical path delay after aging . . 113 4.18 SA1 delay after aging under different sizes of P1, P2, N1and N2. . . . . 115 xi Abstract Process variation and aging are the two major causes of circuit error as well as per- formance and robustness degradation. Further, with continued technology scaling, both these causes are becoming increasingly important. We focus on static random-access memories (SRAMs) since they are widely used. SRAMs are very susceptible to process variations and aging because of their small transistor sizes and dense layouts. In the first part of this dissertation, we propose a general approach to generate O(n) tests to cover all address dependent process variation-induced delay faults (VIDFs) in arbitrary SRAM designs. The upper bound of test length is 9.5n for arbitrary designs with n locations. We focus on address dependent delay faults since these are the most likely to escape traditional linear tests. Most previous memory tests that were devel- oped for address decoder delay faults, including GALPAT and the Worst Case Gate Delay (WCGD), focus on defects. Although the test length of WCGD is n(nlogn), which is necessary to detect address decoder delay faults caused by defects, it is not effective for address decoder delay faults induced by process variations, since it does not cover all the two-pattern address sequences required for VIDFs. We show that a dif- ferent test strategy is necessary for VIDFs because VIDFs are multiple, widespread, and have small delay values that are correlated. In particular, we identify all the address dependent variation-induced failure mechanisms along with sufficient conditions for their detection. We model the delay information of address decoders under variation and xii identify the address transitions that invoke the concerned delays for capturing the target VIDFs. We then generate the shortest linear test that covers all these two-pattern address sequences to detect all target VIDFs. We use our test generation approach to generate O(n) tests and use extensive simulations to demonstrate that our new tests achieve nearly perfect coverage of VIDFs for arbitrary SRAM designs. Then we efficiently integrate our new tests for variations with tests for delay defects and demonstrate the efficiency and effectiveness of our new combined memory tests compared to previously known memory tests for delay faults. We analyze the effect of Bias temperature instability (BTI) aging on SRAM periph- eral circuitry, including address decoder, precharge circuit, write circuit and sense ampli- fiers. We find that aging causes delay degradation in peripheral circuitry, especially in the address decoder. We augment our above test generation method for VIDF to consider aging. Delay degradations caused by aging do no affect the input transitions invoking the worst-case delay. Thus aging degradation does not affect the path selection in test generation. The test generated for the design before aging can also be used for the design after aging. Our experimental results show that our new tests can achieve close to 100% coverage for VIDFs in arbitrary SRAM designs before and after aging with reduced test length. In the second part of this dissertation, we propose a design approach for combat aging in SRAM cells through sizing the transistors in a manner that dramatically reduces the aging quality loss with no power overhead for any given workload. The performance of transistors degrades due to aging. Aging degradation causes lifetime failures and low- ers the quality of shipped chips. Our new approach for combating aging is especially suitable for IoT components and other embedded systems that have tight constraints on power and require a long lifetime with low aging quality loss. We focus on aging in SRAMs since these need to retain state and hence remain under stress even when logic xiii is put in the sleep mode. The magnitude of aging degradation in SRAMs depends on the workload applied to the cells, measured by the duration for which various values (0’s and 1’s) are stored. Some single-purpose IoT systems have a specific workload, while more general-purpose systems have a broader range of workloads. Hence, we study a wide range of workloads. Previous design methods for SRAMs against aging require signifi- cant changes at architecture-level or expensive changes at the cell-level. In contrast, our method simply sizes the transistors in SRAM cells to optimize the lifetime yield-per- area under the tight constraints on aging quality loss (measured by DPPM) and power. We demonstrate the effectiveness of our approach for 6T SRAM cell and 10T Schmitt Trigger SRAM cell in planar CMOS technology as well as 6T SRAM cell in FinFET technology. Our results show that transistor sizing is surprisingly effective at combating aging for a wide range of workloads. Specifically, it reduces the DPPM by orders of magnitude and increases the lifetime yield-per-area under tight power constraints. We quantify the delay along the critical paths of SRAMs and estimate the amount of delay degradation caused by aging for each component. To avoid failure caused by delay degradation, we can leave a sufficient margin in the timing control. If the timing constraint is tight, we can increase VDD to compensate for the delay degradation at a power overhead. We can also resize the address decoder and sense amplifier to ensure the critical path delay after aging is not larger than that before aging for the original design to meet the clock constraint. Our methods allow designers to choose the proper design based on their timing, power and area constraints. Furthermore, we develop an end-to-end SRAM design framework to maximize the aging resilience under the given constraints. We explore the efficiency of error-correcting codes (ECC) to combat aging by quantifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, respectively. We also calculate the soft error resilience when ECC is used to repair aging failures. We find xiv that ECC is efficient for repairing aging failures for workloads with small aging failure rates without sacrificing the soft error resilience. After comparing approaches based on cell sizing and ECC in terms of overheads, lifetime yield and DPPM, we can choose one or a combination of these approaches to identify the optimal design against aging under the given constraints. To provide the end-to-end capability to designers, we inte- grate our cell sizing approach and our ECC approach into an existing SRAM compiler, CACTI. Our new compiler provides the design with the optimal lifetime yield-per-area under given constraints. xv Chapter 1 Introduction This dissertation focuses on delay testing of SRAM and designing the optimal SRAM in terms of lifetime yield-per-area under process variation and aging. Static Random Access Memories (SRAMs) are one of the most commonly used memories and widely used as caches in state-of-the-art microprocessors. Due to the aggressive transis- tor sizing, SRAMs are very sensitive to process variation and aging. Hence, SRAMs are more likely to have aging-induced failures and, due to their wide use, likely to contribute significantly to the overall aging-induced failure rates for chips. 1.1 SRAMs SRAM is a type of memory that can hold the data as long as the power is supplied. Various SRAM cell structures have been proposed over several decades. The typical structures are 6T SRAM cell, 8T SRAM cell [2], and 10T SRAM cell [3] as shown in Fig. 1.1. 6T SRAM cell is the most commonly used SRAM design in practice. Thus we will focus on the design of SRAM using 6T cells. The 6T cell consists of two cross-coupled inverters – namely, PUL and PDL, PUR and PDR and two access transistors (AXL and AXR). The gates of the access transistors are connected to a common word line. When the word line is high, the two access transistors are turned on, and we can read the data stored in the internal nodes (VL and VR) of SRAM cell or write a new value into the cell by driving BL and BLB. When the word line is low, the cell can hold the value in 1 the internal nodes. 6T SRAM cell is sensitive to noise, process variation, and aging. To ensure correct operation, SRAM cells should be designed to have enough noise margins. There are four types of cell failure modes [4], i.e., read failure, write failure, access time failure and hold failure. Read failure is also called a destructive read. During a read operation, the voltage of the internal node which stored “0” increases and exceeds the trip point of the other inverter. The data stored in the cell is flipped. Write failure happens when the voltage of the internal node to be written “0” cannot be reduced below the trip point of the other inverter. Hold failure occurs when the stored value in a cell is corrupted in standby mode, typically when the SRAM is put into a low voltage mode. These three types of failures are static failures. We consider that SRAM has a stability issue when one of the failures occurs. The probability of these failures can be reduced by increasing the corresponding noise margin via various design approaches, such as sizing the transistor appropriately, increasing the supply voltage, adding assist circuits, using more robust cell structures, and so on. Access time failure happens when the time allocated for discharging the bit line is not sufficient during a read operation. In this case, the voltage difference between BL and BLB is too small for the sense amplifier to sense correctly. Access time failure can be avoided through timing control. More efficient sense amplifier can also be adopted to reduce the probability of access time failures. Compared to traditional 6T SRAM cell, 8T SRAM cell has two extra transistors to separate read and write operations, which leaves more freedom for transistor sizing. The structure of 8T cell is shown in Fig. 1.1(b). The write noise margin of 8T cell is the same as that of 6T cell. Read operation is more robust compared to 6T cell since read noise margin of 8T cell is the same as the hold noise margin of 6T cell. However, the half-selected static noise margin of 8T cell is the same as the read noise margin of 6T 2 (a) (b) (c) Figure 1.1: 6T, 8T and 10T schmitt Trigger SRAM cell structures cell, which is the stability bottleneck for 8T cell. (A cell is half-selected when its word line is active while the bit lines are not selected. The cell is not accessed for read or write. ) Also, the transistors in 8T cell suffer the same aging degradation as in 6T cell, that is, the four transistors of the cross-coupled inverters are periodically under aging stress. The two extra transistors M1 and M2 have negligible aging degradation due to their short stress time. Due to these similarities and no obvious advantages compared with 6T cell, we will not focus on 8T cell in our research. Several 10T SRAM cell structures have been proposed. The 10T Schmitt Trigger (ST) cells stand out because they have high read stability and high tolerance to process variations. They can achieve low failure probability for ultra-low power supply opera- tion and do not require any changes over the conventional SRAM architecture used for 6T cells. Thus we also focus on the design of SRAM using 10T ST cells in our research. 1.2 Process variations In this dissertation, one of our main concerns is that variations are becoming increas- ing causes of incorrect circuit operation (i.e., reduction in yield), degraded operation (e.g., increase in power or delay), and decrease in robustness to noise [5, 6]. For SRAM, 3 process variation is an important cause for failures due to aggressive transistor sizing and timing. Process variations have been extensively studied in the past decade [5, 6, 7]. Process variations refer to the variations of transistors’ key attributes, including mobility, thresh- old voltage, and gate size. They are caused by the randomness and imperfections during the semiconductor manufacturing process. Major sources of process variations include random dopant fluctuation (RDF) [8], line edge roughness (LER) [9], dielectric thick- ness variation, proximity patterning effect, and polishing [10]. Among these sources, the impact of RDF on SRAMs is particularly significant due to the small transistor sizes in SRAMs, which leads to significant variations in Vth of transistors [8]. 1.3 Aging With the continued reduction in feature sizes of devices, the rate of aging of ICs is increasing [11]. The major sources of transistor aging include bias temperature instabil- ity (BTI), hot carrier injection (HCI), and time-dependent dielectric breakdown (TDDB) [10, 12, 13]. Among all these aging mechanisms, the BTI effect is widely regarded as the most prominent source of transistor aging [12]. Hence, our research focus on the BTI aging effect. There are two types of BTI effects: negative biased temperature instability (NBTI) and positive biased temperature instability (PBTI). NBTI causes threshold voltage (Vth) degradation for a PMOS transistor when a negative bias voltage is applied. PBTI increases the absolute value of Vth of NMOS over time when a positive bias voltage is applied. It is commonly considered that threshold degradation is caused by the traps gener- ated at the gate dielectric interface when the transistor is biased in the inversion region 4 [14], although the precise mechanism of BTI effect is still under dispute. Various aging models have been proposed (e.g., [15, 16, 14, 17, 18]) and all capture the degradation of threshold voltage as a function of initial threshold voltage, supply voltage, duty cycle, and temperature. A compact model proposed in [15] to predict NBTI effect has been widely used. The impact of aging on logic circuits and SRAMs has been widely studied [19, 20, 21]. BTI aging causes timing degradation in logic circuit and timing as well as stability degradation in SRAM cells [22, 20]. Aging can cause failures and lower the quality of shipped chips during their operational lifetimes due to delay problems, unacceptably low static noise margins [21, 23], and so on. Several key attributes of SRAM aging have been ignored when researchers study the impact of aging on SRAMs, such as different transistors in SRAM cell aging differently, and the values stored in SRAM cells differ from cell to cell. All these differences lead to various levels of differential aging and introduce new challenges for SRAM design. Our research explores how to design SRAM cells against aging considering these properties. 1.4 Scope of this dissertation 1.4.1 Process variation-induced delay test of SRAMs With continuing technology scaling, process variation is increasing and becoming an important cause of memory failure [6]. We focus on address decoder delay faults since these are the most likely to escape traditional linear tests (March tests). Most previous memory tests that were developed for address decoder delay faults, including GALPAT [24, 25, 26] and the Worst Case Gate Delay (WCGD) algorithm [25], focus on defects. To detect defect induced delay faults (DIDFs) in address decoder, O(nlogn) test is necessary since the defects located in the parallel transistors in decoder requires 5 all pairs of address transition with Hamming distance H=1 to detect. However, the O(nlogn) test specifically designed for address decoder delay faults induced by defects, e.g., WCGD, may be not effective for address decoder delay faults induced by process variations, since it does not cover all the two-pattern address sequences required for variation induced delay faults (VIDFs). Also, the O(nlogn) test length limits its practical use. The effects of process variation induced delay are different from those of defect induced delay. For multiple input gates, different input patterns trigger different delays in the nominal case. Variation converts each of these nominal delays into a distribu- tion. While, DIDFs may have arbitrary locations and delay values, VIDFs are multiple, widespread, have small delay values that are correlated, and can be modeled by proba- bility density functions (PDFs). Hence, a different test strategy is necessary for process variation-induced delay faults. We focus on developing effective linear tests for VIDFs. In particular, we identify all the address dependent variation-induced failure mecha- nisms along with sufficient conditions for their detection. Via a study of the structures of typical address decoders, we model the delay infor- mation of address decoders under variation and identify the address transitions that invoke the worst-case delays for various types of address decoders. Large address decoders usually use pre-decoding structures to reduce the transistor count and criti- cal path delay in address decoders. Large decoders are built with NAND gates (2, 3, 4-input) and inverters for delay efficiency. Only NAND delay is pattern dependent in the address decoder. Address transitions triggering the maximum gate delay on each gate on the critical path invokes the maximum path delay since the delay is additive along the path. The delay value for different input patterns depends on the value of charges at internal node capacitance. The value of internal capacitance is discrete for 6 different input patterns. Thus the delay values for different input patterns differ by cer- tain amounts. We need to identify the input patterns triggering the worst-case delay for the NAND gate. Then we can drive the address sequence triggering the worst-case delay for decoders, including pre-decoded decoder. The delays in the nominal case depend on the size of the gates. The distribution of delays depends on the variation. Based on the distribution, we can decide which delay is our concerned delay and need to be covered in tests. We can generate the shortest linear test that covers all the above two-pattern address sequences to cover all target VIDFs. In Chapter 2, we propose a general approach to generate O(n) tests to cover all address dependent VIDFs in arbitrary SRAM designs. The test length for VIDFs is bound by 9.5n for arbitrary designs. We use extensive simulations to demonstrate that our new tests achieve nearly perfect coverage of VIDFs in arbitrary SRAM designs with reduced test length. Then we efficiently integrate our new tests for variations with tests for delay defects and demonstrate the efficiency and effectiveness of our new combined memory tests compared to previously known memory tests for delay faults. In Chapter 4, we analyze the effect of BTI aging on SRAM peripheral circuitry, including address decoder, precharge circuit, write circuit and sense amplifiers. We find that aging causes delay degradation in peripheral circuitry, especially in the address decoder. Hence we expand our test generation method for VIDFs to consider aging. We identify the address transitions that trigger the worst-case deactivation and activation delays at decoder outputs for 32nm process before and after aging. We find that delay degradations caused by aging do no affect the input transitions invoking the worst-case delay. Thus aging degradation does not affect the path selection in test generation. The test generated for the design before aging can also be used for the design after aging. Our experimental results show that our new tests can achieve nearly perfect coverage of VIDFs for arbitrary designs before as well as after aging. 7 1.4.2 SRAM design against BTI aging The performance of transistors degrades due to aging. BTI is the most prominent aging mechanism in nano-scale CMOS technologies. Aging degradation causes lifetime failures and lowers the quality of shipped chips. Chips used in IoT and other embedded systems that have tight constraints on power and require a long lifetime with low aging quality loss, where the aging quality loss captures the number of SRAM arrays likely to fail over the expected lifetime of the chip. We focus on aging in SRAMs since these need to retain state and hence remain under stress even when logic is put in the sleep mode, typically under low voltage. BTI aging causes significant stability issues for SRAMs. The authors in [27, 28, 29, 30, 31] propose several SRAM design methods at architecture and circuit levels. All these methods require significant changes at the architecture-level or expensive changes at the cell-level. Also, the cell-level methods ignore the stress condition deviations across different transistors in SRAM cells and the impact of workload on aging degradation. The magnitude of aging degradation in SRAMs depends on the workload applied to the cells, measured by the duration for which various values (0’s and 1’s) are stored. Some single-purpose IoT systems have a specific workload, while more general-purpose systems have a broader range of work- loads. Hence, we study a wide range of workloads. Our research explores a significantly less expensive design approach for combating aging in SRAM. We propose a design approach for combat aging in SRAMs through sizing the transistors in a manner that dramatically reduces the aging quality loss with no power overhead and very low area overhead for a wide range of workloads. We study the impact of aging on the stability of SRAM cells under different work- loads in combination with process variations. We explore how the cell sizing affects its resilience to BTI aging and how to size the transistors in cells to increase lifetime of SRAMs and reduce aging quality loss. Read noise margin degrades after aging. We use 8 the tool of sizing to optimize the noise margins after aging with a minimum area over- head. We identify which noise margin needs to improve and how to find the optimal size of the transistors, including the associated costs and benefits. We demonstrate that transistor sizing is surprisingly effective at combating aging for arbitrary workloads. In Chapter 3, we propose a systematic design flow for SRAM arrays for any given workload to combat BTI aging. Our SRAM design is optimized for lifetime yield- per-area under a given aging quality loss target and tights constraints on power. We demonstrate the effectiveness of our sizing approach via extensive simulations for both 6T SRAM cells and 10T ST SRAM cells. The experiment results show our design approach generates optimal cells that dramatically decrease aging quality loss while simultaneously improve (in most cases) the lifetime yield-per-area for various workloads without extra power overhead. Aging causes delay degradation in peripheral circuitry. We also need to explore the aging effect in SRAM peripheral circuitry and study the design approach for peripheral circuits to mitigate aging delay degradation. In Chapter 4, we quantify the delay along the critical paths of SRAMs and estimate the amount of delay degradation caused by aging for each component. We analyze the power overhead for the classical approach to combat delay aging degradation in periph- eral circuitry. We explore the sizing approach of address decoder and sense amplifier to ensure that the worst-case delays after aging are not larger than those before aging of the original design to meet the clock constraint. To avoid failure caused by delay degradation, we can leave a sufficient margin in the timing control. If the timing constraint is tight, we can increase VDD to compensate for the delay degradation at a power overhead. We can also resize the address decoder and sense amplifiers. Our methods allow designers the choice to choose the proper design based on their timing, power and area constraints. 9 In Chapter 5, we develop an end-to-end SRAM design framework to maximize the aging resilience under the given constraints. The design objective is to optimize the lifetime yield per area of SRAM, while also satisfying a given target DPPM. Error- correcting codes (ECC) are classically intended to repair cell failures due to soft errors, may also be used to increase aging resilience. In addition to transistor sizing, we study the use of ECC to improve aging resilience. By quantifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, we explore the efficiency of ECC to combat aging. We find that ECC is efficient for repairing aging failures for workloads with small aging failure rates without sacrificing the soft error resilience. After comparing approaches based on transistor sizing in SRAM cells and ECC in terms of overheads, lifetime yield and DPPM, we can choose either one or a combination of these approaches to identify the optimal design against aging under the given constraints. We integrate our methods into an existing SRAM compiler, CACTI [32], to provide the end-to-end capability to designers. 10 Chapter 2 Process variation-induced delay test of SRAMs 2.1 Background Process variation is increasing with the continued reductions in the technology fea- ture size. Variation is becoming an important cause of circuit error as well as perfor- mance and robustness degradation in the nano-scale technologies [6]. Especially for memories, process variation is an important cause for failures due to aggressive tran- sistor sizing and timing. Traditionally, memories have been studied first for each new technology because of their structural regularity, since the regularity simplifies certain aspects of the problem and hence allows for a deeper examination of the newly emerging complications. SRAMs are our target since they are widely used (e.g., they occupy a very large frac- tion of chip area in all large processors) and are very susceptible to process variations, especially random dopant fluctuations exacerbated by the scaling [8]. Also, SRAMs are used as the highest speed memories, thus aggressive timing is essential. SRAMs’ read and write operations involve several timing control signals for enabling control precharge, word line enable, sense amplifier enable, and so on [33]. Although memory tests have been developed and refined for several decades, most test algorithms target fabrication-induced defects [24, 26]. As we demonstrate ahead, the effects of process variation induced delay are different from those of defect induced delay. 11 Hence, we study the impact of process variation on memories and identify the new faults induced by process variation that escape existing test algorithms. Our focus is on address sequence dependent delay faults (ASDDFs), since these are the most likely to escape traditional linear tests (March tests). By analyzing the timing of SRAM read operation, we identified that the delays associated with decoders are the key deter- minants of ASDDFs. In this chapter, by analyzing the timing and peripheral circuit of SRAMs, especially address decoders and sense amplifiers (SAs), we identify delay faults induced by process variation and propose a new general approach for developing tests for delay faults for various SRAM designs. Key ideas: This chapter studies process variation-induced delay faults. By inves- tigating the timing of the SRAM read operation (as it is more stringent than that for the write operation), we identify all failure mechanisms caused by process variation. In particular, we identify three types of VIDFs along with sufficient conditions for their detection. By analyzing the critical paths in SRAMs and the timing constraints of read operations, we identify that different address decoder and SA designs lead to different probabilities of VIDFs and significantly change the test strategy. Via a study of the struc- tures of typical address decoders, we model the delay information of address decoders under variation and identify the address transitions that invoke the worst case for various types of address decoders. We also study the impact of different types of SAs on failure conditions. We propose a test generation method that targets VIDFs. Our new linear test provides coverage of VIDFs that is close to 100% for various SRAM designs. We use our test generation method to generate tests to cover both VIDFs and defect-induced delay faults (DIDFs), with a test length of 10n+4nlog 2 n and 100% coverage for DIDFs and almost 100% coverage for VIDFs. The rest of this chapter is organized as follows. Section 2.2 develops the problem and our general approach. Section 2.3 examines the previous memory address decoder 12 delay tests. Failure mechanisms for VIDFs are identified in Section 2.4. Section 2.5 analyzes sense amplifiers and address decoders. Section 2.6 describes our new test generation for VIDFs. New tests targeting VIDFs and DIDFs are presented in Section 2.6 and evaluated experimentally in section 2.7. Finally, we present our conclusions in Section 2.8. 2.2 Problem development and approach The most commonly used memory tests are march tests. Since all traditional march tests cover the entire memory address space, we focus on the delay faults that require specific two-pattern address sequences, i.e. ASDDFs. Fig. 2.1 shows the general struc- ture of a single column SRAM, including address decoder, SRAM cell, precharge cir- cuit, write circuit, and sense amplifier. In SRAMs, the critical path is read delay, which is divided into address decoder delay, word line delay, cell delay for discharging bit line, bit line delay, and SA delay. Normally, in delay tests, two questions are concerned: 1) which path is selected, and 2) how to select the vectors to invoke the worst-case delay of the target path. In terms of path selection in memory, all paths are equally likely to be critical because of the symmetry, so we select all of them. For vector selection, among all the delay components in read delay, address decoder delay is the only one that can be controlled by vectors. All other delay components are decided by the circuits and delay faults, and independent of address sequences. Therefore, we focus on analyzing address decoder delays. Our general approach to generate efficient test for VIDFs is as follows: Step 1. Iden- tify the target faults. We focus on ASDDFs that may escape traditional march tests. Via a study of SRAM timing conditions, we identify all failure mechanisms in this cate- gory and derive the delay condition should be satisfied to detect each of these faults. 13 Step 2. Characterize the SAs to identify which subset of above faults are more critical and should be targeted. Step 3. Characterize the relationship between the two-pattern address sequences and delays. We first analyze the propagation delay for each primi- tive gate in address decoders and derive the probability density functions (PDFs) of the worst-case deactivation delay and activation delay for decoders. Using the structures of decoders (pre-decoding for large size SRAMs) to identify address sequences that would maximize the probability of detection of all types of target faults. Step 4. Find the short- est test that covers all above two-pattern address sequences to cover all target faults. Step 5. Comprehensive evaluation of new tests. 2.3 Related research Most previous memory tests focus on defects, including the tests that target delay faults [24, 25, 26]. Also, all tests designed for address decoder delay faults, except WT [34], target delay faults caused by defects. A DIDF may have an arbitrary location and delay value. However, VIDFs are multiple and widespread, and have small delay values that are correlated, and can be modeled by PDFs. Due to these reasons, the test strategies are different for DIDFs and VIDFs. GALPAT [24, 25] and WCGD [25] are test algorithms that can detect address decoder delay faults. GALPAT is a comprehensive test as it exhausts all pairs of address transitions that are sufficient for DIDFs as well as VIDFs. For an SRAM with n distinct address values, GALPAT tests require a sequence of 4n 2 + 6n vectors [25]. This test length is prohibitive even for medium-size SRAMs embedded in processors and SOCs. Therefore, despite its effectiveness, GALPAT is seldom used in practice. However, in this chapter, we use GALPAT to determine the total number of chips that have faults induced by process variations among Monte Carlo 14 SRAM instances, and use this as the golden result to compare the effectiveness of other tests. Like other classical approaches targeting address decoder delay faults induced by defects, WCGD consists of all address pairs with hamming distance H=1, which is the necessary condition to cover all address decoder delay faults induced by defects. Although WCGD does not require exhausting every address transition, test length 6n(log 2 n + 1) of WCGD still inhibits its practical use and it is not effective for address dependent VIDFs, since it does not cover all the two-pattern address sequences required for detecting VIDFs. Recently, for the first time, a test focusing on process variation, namely WT [34], was developed. WT is designed based on identifying the address transitions invoking the worst-case deactivation delay (WDeactD) of address decoder outputs. WT only generates those address transitions that expose VIDFs with the highest probability. WT is also the first test with linear test length (8n) that targets variation-induced faults in decoders. However, WT is developed for a specific SRAM decoder design, where the deactivation delay of decoder output is always larger than its activation delay. For other SRAM designs, the test effectiveness is much lower than GALPAT. In this chapter, we generalize the approach taken by WT to generate O(n) tests for various types of decoder designs. 2.4 Failure Mechanism We identify all the address dependent variation-induced failure mechanisms in this section. Fig. 2.2(a) shows SRAM timing diagram for read operations, where s and d denote source and destination address in a two-pattern test sequence. The important time events are labeled on the diagram. The deactivation time of the source decoder output 15 is denoted by t1, and t3 is the activation time of the destination decoder output. t2 and t4 represent the activation and deactivation times of word line enable signal WL_EN. WL_EN is added to generate a pulse to AND with the output of the decoder to filter the glitches produced during address transitions. t5 denotes the activation time of the Read signal, which allows sense amplifier to start generating output by amplifying the voltage difference between the bit line (BL) and the complementary bit line (BL_bar). For a correct read operation, the timing events must satisfy certain constraints. After carefully examining all possible timing conditions for SRAM read operations, we have identified three such failure mechanisms as listed in Table 2.1 with their appearance, and sufficient conditions for a test to detect each fault. The last column shows the highest probability conditions for these faults to occur. These conditions guide our approach to generate the most efficient tests for targeting all ASDDFs. Fig. 2.2(b)-(d) show the example timing diagrams for these three faults. In Fig. 2.2(b), deactivation of Dec_s is delayed and the source word line stays active for a certain time (≥ T1) when it is supposed to turn off. Assume that “1” is stored in the source cell and “0” is stored in the destination cell. After precharge, the source cell is driving BL and BL_bar since its word line is still active. BL_bar discharges when “1” is stored in the cell. When the destination word line turns on, the voltage of BL_bar is lower than BL. It is impossible to charge the BL_bar again since precharge circuit has turned off. Although the source and destination cells are not connected directly, the indirect impact through BL or BL_bar may cause the destination cell value to flip. This is the failure mechanism for fault 1. For a given timing control, larger decoder deactivation delay increases the probability of occurrence of fault 1. To capture fault 1, the test must generate address transitions that trigger the worst-case source decoder output deactivation delay. 16 Fig. 2.2(c) shows the timing diagram of fault 2, where Dec_s deactivates late causing multiple word lines to be high for some duration. First, we examined the possibility of the cells’ values to flip when the source and the destination word lines’ activation times overlap, i.e., when two cells are shorted via BL and BL_bar. Extensive Spectre simulations show that the probability is low that either cell’s value will flip since the overlap between the activation times of both word lines is relatively low. However, simulations also show that the presence of fault2 leads to insufficient time for the SA. Late source word line deactivation causes BL_bar to discharge to a certain value (≥ DV) when “1” is stored in the source cell. This slows down the SA during read of the destination cell. To capture fault 2, address transitions invoking the WDeactD of source address decoder output must be used in the test. Fault 3 is exemplified in Fig. 2.2(d), where Dec_s is deactivated on time, but Dec_d is activated late. Another cause for fault 3 is when both Dec_s deactivation and Dec_d activation are on time, but WL_EN is deactivated too soon or Read signal asserts early. All these situations can cause insufficient sense time. When the SA starts sensing, the voltage difference between BL and BL_bar is smaller than the specific value (DVcritical), which makes it difficult for SA to latch the correct value within the avail- able time. With process variation and cross-coupling between wires, SA may also latch a wrong value. Although the deactivation time of WL_EN and the arrival time of Read signal are not address-dependent, the actual sense time depends on the destination word line Act time. Thus, to capture fault 3, the address transitions invoking the worst-case activation delay (WActD) of destination address decoder output must be used in the test. All three faults are caused by violating the required timing constraints. Timing events are determined collectively by SRAM design and timing control. Different timing controls will affect the probability of occurrence of the faults. If WL_EN arrives early (small t2), fault 1 and 2 are more likely to occur. Postponing the assertion of WL_EN 17 Figure 2.1: SRAM general structure diagram decreases the likelihood for fault 1 and 2 but increases the probability of occurrence of fault 3. If WL_EN deactivates early (small t4) or Read signal activates early (small t5), the probability of occurrence of fault 3 increases. If longer amplification time is allowed, fault 2 can be avoided. To avoid all of these three faults, conservative timing control can be chosen, i.e., large WL_EN arrival time, wide WL_EN pulse, and late Read signal arrival time. Although this can decrease the possibility of VIDFs, overall performance is sacrificed. It is SRAM designer’s responsibility to balance the high probability of correctness and performance and to choose proper timing control. 2.5 Analysis of key circuit components According to the above analysis, address sequence is critical to developing efficient tests in terms of coverage and test length. Address decoders need to be analyzed to iden- tify the address transitions that trigger WDeactDs and WActDs at decoder outputs. SAs 18 Figure 2.2: (a) Timing diagram of SRAM read operation and (b)-(d) three possible ASDDFs also need to be studied since amplification time and DVcritical affect the probabilities of fault 2 and fault 3. 2.5.1 Analysis of sense amplifiers Sense amplifier is on the critical path of SRAM read operation. Different types of sense amplifiers have different working principles and features, which affect the test strategy. In this chapter, we use our test generation approach for two commonly used SAs, a current controlled latch sense amplifier [35] (SA1) and a latch based amplifier with pass transistors [33] (SA2). Fig. 2.3(a) shows a current controlled latch sense amplifier [35] (SA1). BL and BL_bar are connected to the gates of N3 and N4 respectively, which are decoupled from the outputs through high input impendence. Before starting sensing, reset transistors 19 Table 2.1: Possible address dependent delay faults and the corresponding sufficient con- ditions # Sufficient condition for fault to occur Effect Explanation (Assume “1” is stored in the source cell, and “0” is stored in the destination cell.) Highest probability condition f1 t1 – t2 > T1 Cell value flip BL_bar discharges and causes its voltage lower than BL. When the word line of the destination cell is activated, the cell’s value flips. Maximum source decoder output deactivation delay (max t1) f2 Min{t1-t3, t1-t2} > T2 Insufficient amplifica- tion time BL_bar discharges of DV , which makes SA slower than normal condition. Maximum source decoder output deactivation delay (max t1) f3 Min{t5,t4} - t3 < T3 Insufficient sense time When SA starts sensing, the voltage difference between BL and BL_bar is smaller than 4V critical . Maximum destination decoder output activation delay (max t3) P3 and P4 hold both Data_out and Data_out_bar at “1”, and clear the previous latched value. When Read signal is asserted, P3 and P4 turn off, and the current source N5 turns on, then equal current goes through each half of the amplifier. The side with discharging bit line has a smaller gate voltage on input transistor. The current flowing through that side drops correspondingly and causes the voltage at output node to increase, which causes the amplifier to latch and a valid value to exhibit at the output. Recall fault 2 occurs because one of the bit lines is unintentionally discharged. The current flows through that side is smaller than normal operation, which increases the delay of SA. The SA can latch a correct value if a longer time is allowed. Thus fault 2 only appears when 20 (a) Current controlled latch sense amplifier (b) Latch Based Amplifier with Pass Transistors Figure 2.3: Two commonly used Sense Amplifiers “Read 0” since the Data_out holds to ‘’1” by reset transistor before Read signal arriving. This feature can be used to reduce test length during test generation. Another commonly used SA, called latch based amplifier with pass transistors, is shown in Fig. 2.3(b). Inputs, BL and BL_bar, connect to outputs through two pass transistors, which causes feedback between outputs and bit lines. This feature is good for power but decreases the stability of the sense amplifier. SAs have an impact on the sufficient conditions for fault 2 and 3, i.e., the value of T2, 4V critical , and T3 defined in Section 2.4. For a given timing control, large amplification time increases the likelihood of fault 2, and largeDVcritical increases the likelihood of fault 3. Their work principles also affect the test strategy. For test generation, SAs can be characterized in terms of T2,4V critical and operations required for capturing fault 2. As an example, these parameters of SA1 and SA2 are shown in Table 2.2, which are obtained from simulation results for nominal designs. As shown in Table 2.2, only “Read 0” operation is needed to capture fault 2 for SA1, while both “Read 0” and “Read 1” can detect fault 2 for SA2. For a given timing control, fault 2 is more likely for SA1 and fault 3 is more likely for SA2. 21 Table 2.2: Charactering two types of SAs in terms of parameters of fault 2 and fault 3 T2 (Fault 2) 4V critical (Fault 3) T3 (Fault 3) Operations that can detect fault 2 SA1 1.1ps 204mv 3.5ps Read 0 SA2 1.2ps 240mv 8.4ps Read 0 and Read 1 2.5.2 Analysis of address decoder According to the analysis of failure mechanisms, address sequence is a critical fac- tor in developing efficient tests for address sequence dependent delay faults in terms of coverage and test length. Address decoders need to be analyzed to identify the address transitions that trigger the worst-case deactivation and activation delays at decoder out- puts. Large address decoders usually use pre-decoding structure to reduce the transistor count and critical path delay in address decoders [33]. Thus large address decoders are actually built with primitives, such as 3-input NAND gates, 4-input NAND gates, and so on. Combinations of NAND gates and inverters are usually used to achieve minimum delays as these are delay-efficient compared to other primitive gates. Fig. 2.4(a) shows a 12-bit address decoder designed using 3-input and 4-input NAND gates and inverters. The 12-bit address is divided into three groups. Each set of four primary inputs is pre-decoded and then combined through 3-input NAND gates. The transistors on the 3-4 forks are sized properly to achieve equal delay on each fork. The structure for small address decoder is more straightforward. Fig. 2.4(b) shows a 3-bit address decoder where the transistors on 1-2 forks are also optimized to achieve equal delay on each fork. The critical path delay is the sum of the delay of each primitive gate on the path. According to the decoder structure, the maximum deactivation delays and activation delays of decoder outputs are determined by NAND gates delay since only 22 (a) A 12-bit address decoder structure with pre-decoding (b) A 3-bit address decoder structure Figure 2.4: Address decoder structures NAND gates have different pull-up and pull-down delays for different sequences [36]. For the decoders shown in Fig. 2.4, decoder output activation delay corresponds to the high to low propagation delay (T PHL ) of NAND gate, while deactivation delay of decoder output corresponds to low to high propagation delay (T PLH ) of NAND gate. Address transitions triggering the maximum gate delay on each gate on the critical path invokes the maximum path delay since the delay is additive along the path. 2.5.2.1 NAND gate delay analysis To identify the address transitions that invoke the worst-case path delay, we need to examine the worst-case propagation delay of NAND gates. 2-input, 3-input, and 4-input NAND gates are commonly used in decoder design since primitive gates with more than 4 inputs are typically delay-inefficient. Fig. 2.5 shows the NAND gates. The input transitions triggering the maximumT PHL andT PLH under various transistor sizes are 23 listed in Table 2.3. Cdn, Csn and Cdp represent the parasitic capacitances of NMOS and PMOS. The worst-caseT PLH is always invoked by single input flipping which drives the NMOS transistor closest to GND (X2, X3, X4 for 2-input NAND gate, 3-input NAND gate and 4-input NAND gate respectively), where only one PMOS transistor turns on and charges the load capacitance and all of the internal capacitances. The situation is more complicated forT PHL . Based on theoretical analysis and sim- ulation results, we found that Miller effect, internal node charge, load capacitance, and pull-down current values have different effects on propagation delay. The difference of pull-down current is subtle under different input transitions and mainly caused by the body effects of serial NMOS transistors. Miller effect is more significant when more inputs flip concurrently. On the other hand, if the inputs of transistors close to output remain unchanged, more internal capacitances need to be discharged. These two factors have opposite effects on T PHL , where the relative values of the internal node capaci- tances and the output capacitance determine which factor plays a leading role. Transis- tor size ratio of PMOS and NMOS affects the input transition invoking the worst-case T PHL as shown in Table 2.3. The larger the PMOS size, the more transistor inputs need to be flipped to invoke the maximumT PHL . Fig. 2.6 show theT PHL andT PLH distribu- tions of 3-input balanced NAND gate among 1000 Monte Carlo instances with process variations. 2.5.2.2 Address sequence selection for 3-bit decoder Based on the decoder structure and primitive gate delays, address transitions trig- gering WDeactDs and WActDs of decoder outputs can be identified and the results for a 3-bit address decoder are shown in Table 2.4. Without any loss of generality, in all decoder designs in this chapter, we consider that the most significant address bit (MSB) connects to the NMOS transistor closest to GND in NAND gate. For decoders whose 24 (a) (b) (c) Figure 2.5: (a) 2-input, (b) 3-input and (c) 4-input NAND gates Table 2.3: Input transitions triggering the worst-caseT PHL andT PLH for NAND gates with different PMOS and NMOS transistor size ratios for an industrial 65nm CMOS process T PHL T PLH PMOS/NMOS size 2.25 2 1.33 1 0.67 0.5 Anysize NAND2 (X2X1) 00-11 00-11 00-11 01-11 01-11 01-11 11-01 NAND3 (X3X2X1) 000- 111 000- 111 001- 111 001- 111 001- 111 011- 111 111-011 NAND4 (X4X3X2X1) 0000- 1111 0001- 1111 0001- 1111 0001- 1111 0011- 1111 0011- 1111 1111- 0111 least significant address bit drives the NMOS closest to GND, the concerned address transitions will change correspondingly. As shown in Table 2.4, WDeactD is always triggered by the address transitions with MSB flipped independent of the transistor size as shown in Table 2.4. The address tran- sitions invoking WActD depends on the size ratio of PMOS and NMOS in NAND gates. For a decoder when the PMOS and NMOS size ratio equals 2, the address transitions with all three address bits flipped invoke the worst-case activation delay. The address transitions with the two most significant bits flipped trigger WActD for decoder with 25 (a) (b) Figure 2.6: (a) T PHL and (b) T PLH distribution of 3-input NAND gate among 1000 Monte Carlo instances with process variation. Table 2.4: Address transitions invoking the worst-case delay for 3-bit decoders with different transistor size ratios in NAND gates Source address Destination address Wp/Wn=2 Wp/Wn=0.67Balanced Wp/Wn=0.5 WDeactD WActD WDeactD WActD WDeactD WActD 000 100 111 100 110 100 100 001 101 110 101 111 101 101 010 110 101 110 100 110 110 011 111 100 111 101 111 111 100 000 011 000 010 000 000 101 001 010 001 011 001 001 110 010 001 010 000 010 010 111 011 000 011 001 011 011 balanced NAND gates. We use balanced NAND gates for demonstration in all of the experiments reported ahead. Fig. 2.7 shows the shortest address sequences covering all address transitions trig- gering WDeactDs and WActDs for different address decoders in Table 2.4. The address transitions labeled by blue arrows invoke WActD, and the transitions labeled by black arrows invoke WDeactD. The transitions labeled by grey arrows are only used for con- nection. The numbers marked on arrows denote the sequence of the address transitions. These address sequences can be used to generate efficient tests for VIDFs. 26 (a) Wp/Wn=2 (b) Wp/Wn=0.67 (c) Wp/Wn=0.5 Figure 2.7: Address sequences invoking all WDeactDs and WActDs for address decoders with (a) Wp/Wn = 2, (b) Wp/Wn = 0.67 (balanced), (c) Wp/Wn = 0.5 2.5.2.3 Address sequence selection for large address decoders with pre-decoding Recall the delay is additive along the path. Therefore, we can derive the PDFs of deactivation delay and activation delays for large address decoders from the PDFs of the primitive gate delay. To find the address transitions invoking the worst-case path delay for large address decoders with pre-decoding, the address bits should be divided into groups according to the pre-decoding structure and the worst-case delay in each stage should be identified. Take the 12-bit address decoder shown in Fig. 2.4(a) as an example. It uses two-stage pre-decoding. Each set of four address bits is connected to a 4-input NAND gate and then combined through 3-input NAND gates. As shown in Table 2.3, for a balanced 4-input NAND gate, with X4X3 flipping, the worst-caseT PHL is invoked, and the worst-caseT PLH is invoked by flipping X4. For a balanced 3-input NAND gate, X3X2 flipping invokes the worst-caseT PHL and only X3 flipping invokes the worst-caseT PLH . Thus in stage 2, group 2 (A[7]-A[4]) and group 3 (A[11]-A[8]) are selected and in stage 1, the two MSB bits in each group are flipped to invoke WActD of the path, i.e., A[11]A[10] and A[7]A[6]. To invoke WDeactD of the path, group 3 is selected and in group 3, the MSB is flipped, i.e., A[11]. 27 Another important feature of address is the relative values of activation delay and deactivation delay of decoder outputs. It affects the likelihood of different faults, thus affecting the test strategy. If the deactivation delay is smaller than the activation delay, then faults 2 is less likely. In contrast, the probability of occurrence for faults 2 increases when deactivation delay is larger than the activation delay. For a decoder with a larger activation delay, the possibility of occurrence of fault 3 increases. The relative values of activation delay and deactivation delay are determined by the pull-up and pull-down network strengths. Based on the relative values of delays, decoders can be classified into two types. Type 1’s activation delay is larger than deactivation delay and type2’s deactivation delay is larger than activation delay. In this chapter, we use one decoder from each type to demonstrate the test generation. 2.6 Generation of new tests 2.6.1 Test generation for VIDF We demonstrate how to generate efficient tests to target VIDF given an SRAM design with address decoder and SA characteristics. As mentioned in Section 2.4, tim- ing control plays an important role in test generation, which affects the probability of occurrence for different types of delay faults. In this chapter, two typical timing con- trols, and four different SRAM designs with two types of decoder and two types of SA are given to demonstrate the generality of our test generation method. All experiments conducted in this chapter are based on an industrial 65nm CMOS process. Case 1: By setting the arrival time of WL_EN relatively large, fault 1 can be ignored. Fault 1 requires large overlap between WL_EN arrival time t2 and the deactivation time of decoder output t1, which rarely occurs in practical timing controls. Table 2.5 presents the test generation for all four designs. The features of decoders and SAs related to test 28 generation are listed in this table. Based on this information, we can decide which address transitions and operations are used in the test. For example, for design 1, i.e., SRAM using decoder 1 (AD1) with activation delay larger than deactivation delay and SA1, fault 2 and fault 3 are both likely to occur. Thus, the test must contain all address transitions invoking WDeactD and WActD as shown in Table 2.6 variation test 1 (VT1). The next step is to find the shortest address sequence to cover all the required address transitions. According to the address decoder analysis, the hamming distances for all the address transitions invoking the WDeactDs and WActDs are larger than n/2. Thus in VT1, we use different data background for cells to reduce the test length. During initialization, we write “0” in the first half of the cells and “1” in the second half of the cells. Then the address is traversed as shown in Fig. 2.7(b) to invoke WActD and WDeactD, one after another. The second half of the test is used to change the data back- ground since “Read 1” and “Read 0” are not totally symmetric under process variation. In design 2, address decoder 2 (AD2) with larger deactivation delay and SA1 are used. The possibility of occurrence of fault 3 is much smaller than fault 2, and the address transitions invoking WDeactD are our interest. Only “Read 0” is required to detect fault 2 for SA1. Thus VT2 is generated for design 2 as shown in Table 2.6. Different data background is used to initialize the cells as in VT1, and the address transitions with the MSB flipped are traversed to invoke WDeactD. Using the same principle, the test for design 3 and 4 can be generated and shown in Table 2.6. Case 2: For the nominal design, we set the WL_EN assertion time to be minimum and set the arrival time of WL_EN to avoid multiple word lines activating and larger than that in case 1. The dominant fault for design using AD1 is fault 3, and for design using AD2 are fault 2 and 3. VT1 is used for design 2 and 4 to detect fault 2 and fault 3. For design 1 and 3, to achieve better fault coverage, VT1 is used for fault 3 which includes all address transitions invoking WActD and the second worst-case activation 29 Table 2.5: Test generation procedure for 4 different SRAM designs under case 1 timing control # SRAM designs Address decoder feature SA feature Target Faults Concerned delay Tests Strength Address transition invoking WAatD Address transition invoking WDeactD Operations capturing target Faults 1 AD1 + SA1 Activation > Deactiva- tion A[m, m-1] flip A[m] flip Read 0 and Read 1 Fault 2 and 3 WActD and WDeactD VT1 2 AD2 + SA1 Deactivation > Activation A[m, m-1] flip A[m] flip Read 0 Fault 2 WDeactD VT2 3 AD1 + SA2 Activation > Deactiva- tion A[m, m-1] flip A[m] flip Read 0 and Read 1 Fault 2 and 3 WActD and WDeactD VT1 4 AD2 + SA2 Deactivation > Activation A[m, m-1] flip A[m] flip Read 0 and Read 1 Fault 2 WDeactD VT3 delay, which happens to be the address transitions invoking WDeactD. For other timing control with no constraints, VT1 can always be used to detect all three faults. 2.6.2 Test generation for both VIDF and DIDF DIDFs are totally arbitrary in terms of values and locations. The defects located in the parallel transistors in decoder requires all pairs of address transition with Hamming distance H=1 to detect [37]. These address transitions can also detect other DIDFs in decoder. Using the same approach and combining all address transitions with H=1 and the address transitions invoking WDeactD and WActD, we can generate effective and efficient new test V+DT for both defect and variation induced delay faults as shown in Table 2.6. 30 2.7 Evaluation of new tests In this section, our new tests (VT1 to VT3 and V+DT), WT, WCGD, and GALPAT are evaluated in terms of test length and coverage for both VIDF and DIDF. 1000 Monte Carlo instances are generated and each instance represents an SRAM module with intra- die (mismatch) and inter-die (variation) process variation. First, the simulation results of SRAMs with 3-bit address decoder are compared in detail for various SRAM designs. Then large SRAM simulation results are provided. 2.7.1 Experimental results for SRAM with 3-bit address decoder The number of SRAM instances that have VIDF and detected by each test for various designs under the two timing control cases introduced in test generation section are shown in Table 2.7. Case 1: We set the arrival time of WL_EN to 0.08ns. Deactivation time of WL_EN and Read signal arrival time are equal and set to 0.13ns. Case 2: We set the arrival time of WL_EN to 0.10ns. Deactivation time of WL_EN and Read signal arrival time are set to 0.13ns. As shown in Table 2.7, our new linear tests for VIDF achieve nearly perfect fault coverage, i.e., the golden coverage provided by GALPAT. WT test can provide good coverage when only fault 2 exists. But for fault 3, the coverage is unacceptable as WT only considers WDeactD. The new tests have much better coverage than WT, especially when fault 3 is dominant. The test lengths for all our new tests are lower than the test length for WT. The maximum test length for new tests is 6.5n. WCGD can only achieve good fault coverage when only fault 2 exists. It is not effective for fault 3 even though its test length is O(nlogn). Our new test for both variation and defect can achieve perfect fault coverage for DIDF and almost 100% fault coverage for VIDF. The test length is 31 smaller than WCGD when address decoder size is larger than 4-bit. The simulation results for DIDF are shown in Table 2.8. The single resistive defect is inserted in every net in the address decoder. 2.7.2 Results for SRAM with pre-decoding address decoder We have designed 6-bit, 8-bit, and 12-bit address decoders using 2-stage pre- decoding. For 6-bit decoder, 3-input NAND gates are used in pre-decoder, and 2-input NAND gates are used in the second stage. For 8-bit and 12-bit address decoders, 4-input NAND gates are used for pre-decoding, and 2- and 3-input NAND gates are used in the second stage respectively. All NAND gates are balanced. Based on the decoder structure and simulation results, we are able to obtain decoder delay information and the address transitions invoking WDeactD and WActD for each decoder. WDeactD is invoked by the address transitions with the MSB flipped, and WActD is triggered by the address transitions with the two most significant bits flipped for 6-bit and 8-bit decoders, and A[11]A[10]A[7]A[6] flipped for 12-bit decoder. Their decoder output activation delays are larger than deactivation delays. Current-controlled latch sense amplifier is used. Based on these, we can generate the same tests for these SRAMs as for 3-bit design 1 as shown in test generation section. The arrival time of WL_EN is set to avoid multiple word lines to be high and the WL_EN assertion time is set to be minimum in nominal case. VT1 is used to capture all three types of faults. The simulation results are shown in Table 2.9. Non-linear test simulations cannot be completed within practical time for 12-bit decoder. As can be seen, the coverage for the new linear test is nearly perfect for VIDF, which is more than twice of that of WT and the test length is 13/16 of that of WT. The test length for the new test for defects and variation is 90% of WCGD with the perfect coverage for both DIDF and VIDF. 32 2.8 Conclusion In this chapter, we study the process variation induced delay faults during SRAM read operation. We have investigated the timing constraints of read operation and failure mechanisms and identified three types of address dependent VIDF, which may escape traditional march tests. We have conducted a systematic study of address decoder to identify the required address sequences. We note that the characteristics of address decoders and sense amplifiers affect the probabilities of different types of faults. Based on the characteristics of a given SRAM design, including address decoder, sense ampli- fier, as well as the timing control, we generate new test algorithms that target VIDF. Using the same approach, we integrated our new tests for variation with tests for delay defects effectively. We evaluate our new tests, along with WT, WCGD, and GALPAT, by simulating a large number of SRAM instances with process variation. Simulation results of SRAMs show that our new tests achieve high coverage with reduced test length com- pared with WT, WCGD, and GALPAT. 33 Table 2.6: New proposed test algorithms for different designs targeting variation- induced delay Faults, WT, WCGD and GALPAT for an industrial 65nm CMOS process # Test length Targeting faults Description (m is number of address bits and n is number of word lines for a memory) VT1* 6.5n All three VIDFs {⇓ n−1 n 2 (w1);⇓ n 2 −1 0 (w0); ⇑ n 4 −1 v=0 (r0 v, r1 v⊕2 m 2 m−1 , r0 v⊕2 m−1 , r1 v⊕2 m , r0 v, r1 v⊕2 m , r0 v⊕2 m−1 , r1 v⊕2 m 2 m−1 , r0 v ); ⇑ n 2 −1 0 (w1);⇑ n−1 n 2 (w0); ⇓ 4 n −1 v=0 (r1 v, r0 v⊕2 m 2 m−1 , r1 v⊕2 m−1 , r0 v⊕2 m , r1 v, r0 v⊕2 m , r1 v⊕2 m−1 , r0 v⊕2 m 2 m−1 , r1 v )} VT2 3n Fault 2 for SA1 {⇓ n−1 n 2 (w1);⇓ n 2 −1 0 (w0);⇑ n 2 −1 v=0 (r0 v, r1 v⊕2 m);⇑ n−1 v= n 2 (r1 v, r0 v⊕2 m)} VT3 5n Fault 1 and fault 2 {⇓ n−1 n 2 (w1);⇓ n 2 −1 0 (w0);⇑ n 2 −1 v=0 (r0 v, r1 v⊕2 m , r0 v ); ⇑ n 2 −1 0 (w1);⇑ n−1 n 2 (w0);⇓ n 2 −1 v=0 (r1 v, r0 v⊕2 m , r1 v )} V+DT* 10n + 4nlog 2 n All three VIDFs and DIDFs {⇑ (w0);⇑ v (w1 v, ⇑ m−1 i=0 (r1 v, r0 v⊕2 i),r1 v, r0 v⊕2 m 2 m−1 , w0 v ); ⇑ (w1);⇑ v (w0 v, ⇑ m−1 i=0 (r0 v, r1 v⊕2 i),r0 v, r1 v⊕2 m 2 m−1 , w1 v )} WT 8n Fault 1 and fault 2 {⇓ (w0);⇑ (w1 v, r0 v⊕2 m , w0 v );⇑ (w1);⇓ (w0 v, r1 v⊕2 m , w1 v )} WCGD 6n(1+ log 2 n) Fault 1 and fault 2 and DIDFs {⇑ (w0);⇑ v (w1 v, ⇑ m−1 i=0 (r0 v⊕2 i , r1 v, r0 v⊕2 i , ),w0 v ); {⇑ (w1);⇑ v (w0 v, ⇑ m−1 i=0 (r1 v⊕2 i , r0 v, r1 v⊕2 i , ),w1 v )} GALPAT 4n 2 + 6n All three VIDFs and DIDFs {m (w0);⇑ v (w1 v, ⇑ −v (r0,r1 v ),w0 v ); m (w1);⇑ v (w0 v, ⇑ −v (r1,r0 v ),w1 v )} * For SRAMs with pre-decoding address decoders, the address transition selections are based on the pre-decoding structures as mentioned in Section 2.5.2.3 34 Table 2.7: Number of failing chip instances captured by different tests for SRAMs with 3-bit address decoder Case # SRAMs # of faulty chips captured by various tests New tests for VIDF V+DT WT WCGD GALPAT Case 1 D1+SA1 452 (VT1) 453 301 336 453 D2+SA1 395 (VT2) 397 395 396 397 D1+SA2 465 (VT1) 466 309 317 467 D2+SA2 367 (VT3) 369 367 368 369 Case 2 D1+SA1 273 (VT1) 274 143 149 274 D2+SA1 147 (VT1) 149 139 141 149 D1+SA2 334 (VT1) 336 154 167 336 D2+SA2 149 (VT1) 149 141 142 149 Table 2.8: Number of defect induced delay faults captured by different tests Memory size in terms of row address bit # of defects captured by various tests V+DT WCGD GALPAT VT1 WT 3 bit 309 309 309 266 177 8 bit 6838 6838 6838 4816 3120 Table 2.9: Number of failing chip instances captured by different tests for SRAMs with pre-decoding address decoder Memory size in terms of row address bit # of faulty chips captured by various tests VT1 V+DT WT WCGD GALPAT 6 bit 359 361 127 134 361 8 bit 351 354 112 115 355 12 bit 314 - 108 - - 35 Chapter 3 Low-cost SRAM redesign for effectively combating aging in systems with long lifetime and tight power requirements 3.1 Introduction With the continued reduction in feature sizes of devices, the rate of aging of ICs is increasing [11]. The performance of transistor degrades due to aging. Negative bias temperature instability (NBTI) is considered to be the major reliability hazard in nano-scale CMOS and causes threshold voltage (Vth) degradation for a pMOS transistor when a negative bias voltage is applied. Positive bias temperature instability (PBTI), which increases Vth of nMOS transistors, is considered as a second-order effect in poly technology but is a prominent aging mechanism in high-k/metal technologies [38, 20, 39, 40]. Various aging models have been proposed (e.g., [15, 16, 14]). The impact of aging on logic circuits and SRAMs has been widely studied [19, 20, 21]. NBTI and PBTI cause timing degradation in logic circuit and timing as well as stability degradation in SRAM cells [22, 20]. Aging can cause failures and lower the quality (typically described in terms of defective parts per million (DPPM)) of shipped chips 36 during their operational lifetimes due to delay problems, unacceptably low static noise margins [21, 23], and so on. Aging quality loss, or aging failure rate, is a measure of the fraction of the chips that fail during the expected lifetime of a chip and must meet stringent industry DPPM standards, typically 10-100ppm. To reduce aging quality loss, one can develop new test methods to identify and abandon SRAMs that will fail due to future aging over the desired lifetime before shipping chips to customers. However, this will decrease yield as more chips will be discarded. A much more appealing solution is to design the SRAM to be resilient to aging to reduce aging quality loss and hence to increase its lifetime. There are some straightforward methods to avoid aging failures, e.g., changing the operating conditions of SRAMs. In particular, to avoid failures due to aging-induced delay problems, we can increase clock frequency and/or increase power supply voltage. To avoid failures due to aging-induced cell stability problems, we can increase power supply voltage. Hence, at the chip level, the aging failure rate can be decreased at the cost of lower performance and/or higher power. Since many of today’s chips operate under stringent power constraints and perfor- mance requirements, it is imperative to develop more intelligent approaches to combat aging without power and performance loss. In particular, chips used in IoT and other embedded systems are often distributed in the field and must operate for years on a tiny battery and/or by harvesting small amounts of energy, such as capturing small amounts of solar power. Such chips have especially tight power consumption constraints and require a long lifetime with low aging quality loss. We focus on aging in SRAMs since these need to retain state and hence are likely to remain under stress even when the logic blocks on a chip are put in the sleep mode. BTI aging degrades the stability of SRAM cells over time [21, 23]. 37 To reduce the effects of aging on correct operation of SRAM cells, several SRAM design methods have been proposed at architecture and circuit levels. These methods require significant changes at the architecture-level or expensive changes at the cell- level. At the architecture level, the authors in [27] propose a design method to balance the workload across cores to reduce the aging degradation. In [28], the authors propose proactive use of spares in SRAMs to mitigate aging. To reduce the asymmetry of BTI aging in SRAM cells, the authors in [29] propose a method that periodically changes data-encoding and flips the data bits stored in SRAM cells. Shutting down idle cache blocks to mitigate aging is proposed in [31]. At the circuit level, the authors in [30] pro- pose to use adaptive body bias to compensate for negative bias temperature instability (NBTI) aging. This requires a standalone threshold voltage sensing circuit to estimate aging degradation and to generate the required body bias voltage; it also requires signif- icant additional layout area in each SRAM cell to enable adjustments to the body bias voltages of individual transistors. Further, only NBTI is considered and the body bias voltage is generated based on the threshold degradation of a pMOS device under the full stress condition. This approach also ignores the stress condition deviations across different transistors in SRAM cells and the impact of workload on aging degradation. In reality, the magnitude of aging degradation in an SRAM cell depends on the workload for the cell, measured by the duration for which various values (0’s and 1’s) are stored. In practice, the workload may differ for different SRAMs, and also for different cells in a single SRAM array. All these differences lead to various levels of differential aging and introduce new challenges for SRAM design. Single-purpose IoT systems have specific workloads. More general-purpose systems have a broader range of workloads. Hence, in this chapter, we study a wide range of workloads. 38 In contrast to the design methods at the architecture level and the high-overhead methods at the circuit level, such as body bias, our method simply sizes the transistors in SRAM cells to optimize the lifetime yield-per-area under the tight constraints on aging quality loss (measured by DPPM) and power. We find that transistor sizing is surprisingly effective at combating aging for a wide range of workloads, including the worst-case workload. The prior work [41] mentions sizing to mitigate aging in SRAM. However, the used approach is not described. In this chapter, we use the tool of sizing to maximize noise margins after aging. We explore how sizing transistors in an SRAM cell affects its resilience to BTI aging, such as which noise margin to increase and how to size the transistors, including the associ- ated costs and benefits. We develop a systematic approach to understand the impact of aging on each transistor in an SRAM cell and size the transistors in a manner that dra- matically reduce the aging quality loss with no power overhead and low area overhead for a wide range of workloads. Our SRAM design is optimized for lifetime yield-per- area under a given target aging quality loss, where the aging quality loss captures the number of SRAM arrays likely to fail over the expected lifetime of the chip. The rest of this chapter is organized as follows. In Section 3.2 we introduce the back- ground on BTI aging, differential aging, and the impact of BTI aging on the stability of SRAMs. In Section 3.3, we present models of workloads for different classes of appli- cations and then introduce our design objective. In Section 3.4, we start by introducing our sizing approach to combat BTI aging for fixed workload and then present our design approach of SRAM cells to combat BTI aging for any given workload. In Section 3.5, we demonstrate the effectiveness of our new designs through extensive experimental evaluations. Finally, we present our conclusions in Section 3.6. 39 3.2 Background 3.2.1 BTI aging BTI is the major reliability hazard in nano-scale CMOS. NBTI causes threshold voltage (Vth) degradation of a pMOS when a negative bias voltage is applied to the transistor’s gate. In high-K processes, PBTI and NBTI degrade Vth of stressed nMOS and pMOS transistors, respectively [39, 20]. Many BTI aging models have been pro- posed (e.g., [15, 14]) to predict BTI aging. We adopt the aging model proposed in [15] for use in our experimental evaluations. According to the physical understanding of the BTI effect, when a transistor is stressed, BTI degrades its strength. The stress patterns for nMOS and pMOS transistors are shown in Fig. 3.1. The transistor partially recovers from the degradation after the stress is removed. Hence, in the long term, the threshold degradation caused by BTI aging highly depends on the percentage of time that a transistor is stressed, i.e., the duty cycle. Due to different duty cycles, different transistors in a circuit age differently. We refer to this effect as differential aging. Various transistors in an SRAM cell face different stress conditions. In a typical 6T SRAM cell shown in Fig. 3.2, depending on the value stored in the cell, the four transistors of the cross-coupled inverters – namely, PUL, PDL, PUR and PDR – are periodically under stress. When a value “0” is stored in the cell, PDL and PUR are stressed and suffer aging degradation, while PUL and PDR are in recovery phase. When a value “1” is stored in the cell, PUL and PDR are stressed while PDL and PUR are in recovery phase. Because of the short stress time through SRAM’s lifetime, access transistors (AXL and AXR) suffer negligible BTI degradation since they are only under stress during write operation, if the corresponding word line is selected. (During read 40 Figure 3.1: Stress state for nMOS transistor and pMOS transistor operation, the access transistors are turned on, but the voltage at BL or BLB are allowed to drop only by a small voltage from VDD.) Logic values (0’s and 1’s) are typically stored in SRAM array with different proba- bilities. Various transistors in an SRAM cell face different stress conditions, thus they suffer different amounts of Vth degradation. If an SRAM cell stores the value “0” with a higher probability over lifetime, as is the case for the most of the SRAM cells in data cache [31, 42], PDL and PUR suffer more significant aging degradation compared to PUL and PDR. In general, the magnitude of SRAM cells’ aging degradation depends on the workload. The workload may differ for different SRAM arrays, and also for differ- ent cells in one SRAM array. All this workload heterogeneity leads to differential aging and introduces new challenges for aging-resistant SRAM design. 3.2.2 The impact of BTI aging on SRAMs’ stability To maximize the noise margins after aging, we need to understand the impact of BTI aging on all noise margins and identify which noise margin needs to be increased via sizing. SRAMs’ read and write stability can be measured by the read static noise margin (RNM) and write noise margin (WNM) respectively. As shown in Fig. 3.3, RNMa and RNMb denote the read noise margins of an SRAM cell, where RNMa is the side of the 41 Figure 3.2: 6T SRAM cell’s stress and recovery status when value “0” is stored. largest square that can fit into the upper opening of the butterfly curve, and RNMb is the side of the largest square that can fit into the lower opening of the butterfly curve. RNM is the minimum of RNMa and RNMb. WNM is used as a metric of write stability. It is defined as the width of the smallest embedded square between two DC transmission curves of the two inverters of an SRAM cell as shown in Fig. 3.4. To design an SRAM cell resistant to BTI aging, the impact of aging on SRAM’s read and write noise margins need to be carefully studied. In this chapter, all experiments are conducted using Predictive Technology Model (PTM) 32nm library [43]. The NBTI model proposed in [15] is adopted. For 32nm high-k process, as reported in [20, 39], PBTI causes the same magnitude of Vth shift to nMOS as NBTI does for pMOS. As stated earlier, the logic values stored in SRAM array are not symmetric. Various transistors in an SRAM cell face different stress conditions, thus they suffer different amounts of Vth degradation. Differential aging is reflected in the noise margins of 42 Figure 3.3: The curves for a 6T SRAM cell indicating RNM before and after aging. SRAM cells. Without loss of generality, we assume the desired lifetime is 5 years, i.e., M = 60 months. Before aging, RNMa and RNMb are equal because of the symmetric strengths of the key transistors in SRAM cells as shown by the blue dashed butterfly curves in Fig. 3.3. However, when value “0” is always stored in the cell, PDL and PUR suffer aging degra- dation. After aging, RNMb decreases significantly, while RNMa improves as shown by the red solid butterfly curves in Fig. 3.3. RNM is the minimum of RNMa and RNMb. Thus, RNM decreases after aging. Fig. 3.5 shows the read noise margin of a 6T SRAM cell through its desired lifetime when value “0” is stored in the cell for 75% of its lifetime. We illustrate this case since it has been widely reported that the dominant logic bit value “0” is stored approximately 75% [31, 42] of the time in cache. We can see that after 60 months of usage, RNMb degrades severely while RNMa improves slightly. This is easy to explain since when value “0” is stored in the cell for 43 more time than value “1”, PDL and PUR suffer more degradation than PDR and PUL. Thus the pull down is weaker in the left part of the cell and the relative stress of the pull down network and access transistor decreases more than the right part. Pull up becomes weaker in the right part of the SRAM cell. During a read operation, the voltage division between access transistor and pull down transistors determines V OL . The switching threshold voltage (VM) of the inverter is determined by the relative stresses of the pull up and pull down networks. Therefore, VM of the left inverter of the cell increases, while VM of the right inverter decreases. V OL of the left inverter increases more than that of the right inverter. The decrease of VM of the right inverter and increase ofV OL of the left inverter cause RNMb to decrease significantly, and make it easier to corrupt the stored value in the cell during read operation. The relative amount of shift of VM of the left inverter andV OL of the right inverter determines whether RNMa increases or decreases. Clearly, if value “0” is stored for more time than value “1” in the SRAM cell, RNMb will degrade more significantly than RNMa. Similarly, if value “1” is stored in the cell for more time than value “0”, RNMa will degrade more significantly than RNMb. Fig. 3.6 shows the write noise margin of a 6T SRAM cell through its desired lifetime in the case value “0” is stored in the cell for 75% of its lifetime. Write noise margins increase since both pull up transistors become weaker after aging. The new value is easy to write into the cell because the relative stresses between access transistor and pull up transistor increases. Because of the varying workload for different cells, noise margin degradation for each cell is different. We define signal probability (P signal ) as the probability that an SRAM cell stores the value “1” over its lifetime. Fig. 3.7(a) shows RNM degradation of a 6T SRAM cell under different P signal after 60 months of usage. We can see that RNMb degrades severely while RNMa improves when the value “0” is stored in the 44 Figure 3.4: The curves for a 6T SRAM cell indicating WNM before and after aging Figure 3.5: Read noise margins through lifetime for a 6T SRAM cell when value “0” is stored in the cell for 75% of the time SRAM cell for higher fractions of the time than the value “1”, i.e., whenP signal < 0.5, because PDL and PUR suffer more aging degradation than PDR and PUL. Similarly, RNMa will degrade more significantly than RNMb forP signal > 0.5. When P signal equals 0.5, RNMa and RNMb degrade by equal amounts, since the transistors in the left inverter and right inverter undergo similar stress conditions. Finally, RNM degradation is the smallest when values “0” and “1” are stored in a cell with equal probability compared to other workloads since RNM is the minimum of RNMa and RNMb. 45 Figure 3.6: Write noise margins through lifetime for a 6T SRAM cell when value “0” is stored in the cell for 75% of the time Fig. 3.7(b) shows the change in WNM of a 6T SRAM cell after 60 months of usage under differentP signal . In all the cases, write noise margins increase (noteDW1NM and DW0NM values are always positive) since both pull up transistors become weaker after aging. According to the above analysis, RNMa and RNMb degrade by different amounts when value “0” and value “1” are stored in the cell with different probabilities. SRAM cell must be designed to maximize the noise margin which is more likely to cause failure after M months of usage. 3.3 Problem definition The previous section illustrates that the aging degradation of SRAM cells depends on their workload. For different workloads, different cells have distinct signal proba- bilities and hence have different noise margin degradation. To better present our design approach, we first characterize various workloads and then present our design objective. 3.3.1 Characterize the workload The workload can be characterized based on the different types of applications. The more application-specific a chip is, the better we can characterize its workload, e.g. 46 (a) (b) Figure 3.7: (a) Read noise margin and (b) write noise margin changes after 60 months of usage for a 6T SRAM cell under different signal probabilities. hearing aid, sensors monitoring the heart rate, vibration, or light. In contrast, for chips with general purpose applications, there is limited information on the workload. For such cases, we develop an unknown workload model, since a more specific workload distribution cannot be given. To present our design approach ahead, we show some typical workloads studied in this chapter as follows: Fixed workload: All cells in an SRAM array have the same signal probability. Gaussian 1: A Gaussian distribution for the signal probability with mean = 0.5 and standard deviation = 0.1. Gaussian 2: A Gaussian distribution for the signal probability with mean = 0.5 and a larger standard deviation = 0.3. 47 Figure 3.8: The signal probability distribution in data caches extracted from [1]. Skew distribution: The signal probability distribution in data caches extracted from [1] is shown in Fig. 3.8. Unknown: The workload distribution is unknown and can be arbitrary in general. We will show our sizing approach is effective at combating aging for a wide range of workloads, including the unknown workload. 3.3.2 Design objective The conventional design goal for SRAM cell is to achieve maximum yield-per-area by increasing the read and write stability at the time of fabrication, i.e., at m = 0 months. However, BTI causes stability degradation of the SRAM cell [21, 23]. This will lead to a high aging quality loss if, during SRAM design, we only consider the noise margins and yield at the time of fabrication. In practice, aging quality loss is extremely critical and is measured in terms of defective parts per million (DPPM). To ensure customer sat- isfaction, DPPM must be below a small value, which we call the target DPPM. Usually, the target DPPM is 50 [44], i.e., no more than 50 chips per million can fail due to aging during the expected lifetime once the chips are shipped to customers. 48 For a design optimized for yield at fabrication, if only 1% of the SRAM cells in a 2MB array store 0 all the time, DPPM = 1403 for 5 year usage, even if we assume that all the remaining 99% of the SRAM cells are functional during the entire lifetime. While we can increase power supply voltage to reduce aging quality loss, this leads to unacceptably high power overheads. Thus we need to develop a more intelligent approach to combat aging without power loss. To avoid any unnecessary area overhead in the SRAM redesign, we include the area information in the design objective. Our design objective is to optimize the lifetime yield-per-area of SRAM array under a given DPPM target and the tight constraints on power. The terms used in the chapter are defined as follows, Let us use m to denote time and M to denote lifetime. Hence, m = 0 indicates fabrication time, and m = M indicates the end of the desired life. Failure rate of an SRAM cell (P cell fail ): the probability that an SRAM cell fails at fabrication. Aging failure rate of an SRAM cell (P cell fail,aging ): the probability that an SRAM cell functions properly at fabrication but fails during its desired lifetime due to aging. Yield of an SRAM array: the probability that an SRAM array functions correctly after fabrication. Lifetime yield of an SRAM array: the probability that an SRAM array is able to function correctly throughout its desired lifetime of M months. Aging quality loss of an SRAM array: the probability that an SRAM array functions correctly at fabrication but fails during its desired lifetime due to aging. DPPM of the SRAM array = Aging quality loss× 10 6 . Then for an SRAM array consisting of N cells, Lifetime yield = N Y i=1 (1−P cell,i fail −P cell,i fail,aging ) 49 Aging quality loss =(1− N Y i=1 (1−P cell,i fail,aging )) (3.1) 3.4 Design approach We start by exploring how the cell sizing affects resilience to BTI aging motived by the asymmetric fixed workload, which is quite common since many researchers [30] [29] [31] studying SRAM aging use the assumption that all the cells in an SRAM array have the same signal probability. We also start with this easy case and propose an asymmetric sizing method to design SRAM cell resistant to BTI aging. This suggests that cell sizing may also be an effective approach for arbitrary workloads. We then propose our new approach for the design of SRAM cells for effectively combating aging for any given workload. 3.4.1 SRAM cell sizing approach for fixed workload Our general approach for designing SRAM cell resistant to BTI aging for the fixed workload is as follows: Step1. Calculate the threshold degradation for each transistor in SRAM cell through the expected lifetime based on the aging model. Step 2. Identify the transistors to size and the associated trade-offs in terms of noise margins and delay, specifically, a) identify the impact of aging on SRAM’s stability, b) explore all the design parameters to characterize the impact of values of the parameters for each transistor on the noise margins, and c) study the relationship between the values of design parameters of each transistor and the access time. Step 3. Study the impact of sizing on area. Develop a formula for cell area based on a parametric layout of SRAM cell. 50 Step 4. Analyze the impact of aging on noise margins for SRAM cells under process variations. Step 5. Based on the analysis of steps 2-4, identify the optimal values of design parameters for all transistors to maximize the minimum noise margin after aging. Eval- uateP cell fail,aging . Repeat Step 5 untilP cell fail,aging achieves the target value. Step 6. Carry out a comprehensive evaluation of the new design. 6T CMOS is the most commonly used SRAM design in practice. We start to develop our design approach on 6T planar CMOS SRAM cell. Then we extend our design approach in FinFET SRAM design since FinFET has been proven to be a better alter- native for the planar device in nanoscale technology for improved stability particularly for sub-20nm technology[45, 46]. To achieve ultra low power consumption, our design approach is also applied to 10T Schmitt Trigger CMOS SRAM cells, since they can achieve low failure probability for ultra-low power supply operation and do not require any changes over the conventional SRAM architecture used for 6T cells. 3.4.1.1 6T planar CMOS SRAM cell The relationship between noise margins and the transistor sizes in the nominal case To fully understand how to size the transistors to optimize the lifetime yield under the given DPPM, we study the relationship between the noise margins and the size of each transistor. 6T cell is the most commonly used SRAM design in practice. Fig. 3.9 shows the noise margins when only the size of a single transistor in the 6T planar SRAM cell is changed. To reduce aging quality loss, we need to minimize the stability degradation caused by aging. For illustration, let us first consider the workload whereP signal for all cells is 0.25 since the dominant logic value “0” is stored 75% of the time (on average) in cache[31]. RNMb degrades while RNMa increases over time for SRAM cells withP signal of 0.25 51 as shown in Section 3.2. Write noise margins increase over the lifetime, while read noise margin decreases, especially RNMb. Thus to prevent the noise margin degrada- tion caused by aging, we focus on increasing RNMb to meet DPPM requirement which maximizing yield-per-area and minimizing performance overhead. From Fig. 3.9, we can see that the size of AXL, PDL, PDR and PUR have impact on RNMb of the cell. Fig. 3.10(a) and 3.10(b) show the relationships between the access times and the transistor sizes in the SRAM cell to help decide the optimal size of each transistor, where TR0 denotes the access time for read 0 and TR1 denotes the access time for read 1. The access time is determined by the time required for the cell to achieve the minimum voltage drop on the bit lines required by the sense amplifier. Our design objective is to increase RNMb to reduce the noise margin degradation caused by aging. In Table 3.1, we list all the transistors whose sizing impact RNMb of the SRAM cell as well as how to increase RNMb via sizing. The reasons why RNMb can be improved through sizing of each transistor are also presented in the table along with their negative side-effects. The curves that provide the required information of the sizing are shown in the parentheses in each table entry. As shown in Table 3.1, RNMb can be improved by increasing the size of PDL and PUR and/or decreasing the size of AXL and PDR. However, each approach has some negative effects on write noise margin or access times, except increasing the size of PDL. For example, write noise margins and access time are sensitive to the size of access transistor as shown in Fig. 3.9(a) and 3.10(a) respectively. Decreasing the size of AXL will decrease W0NM and increase access time dramatically. Although decreasing the size of PDR will increase RNMb, RNMa will decrease significantly and the access time for read 1 will increase. Sizing up pull up transistor PUR will decrease W1NM dramatically. In contrast, increasing the size of PDL has no negative impact on write noise margins based on Fig. 3.9(b) and no 52 negative impact on access times based on Fig. 3.10(a) and 3.10(b). Thus we choose to increase the size of PDL. Similarly, for a different workload where P signal > 0.5, RNMa will degrade more significantly than RNMb. Thus we focus on optimizing RNMa after aging and increase the size of PDR ifP signal > 0.5. WhenP signal = 0.5, RNMa and RNMb degrade by equal amounts. We need to increase both RNMa and RNMb via increasing the size of both pull down transistors in 6T planar cell to achieve sufficient noise margin after aging. Generally, there are two ways to size the transistors - namely skew-size and up-size. Skew-size is to increase the size of a single transistor and is used when the workload is asymmetric. For example, whenP signal = 0.25, RNMb degrades more. We focus on increasing RNMb, thus we only increase the size of PDL until RNMb after the desired lifetime is equal to RNMa at fabrication. Up-size is to increase the size of both pull down transistors to increase both RNMa and RNMb. Up-size is used when workload is symmetric or the target DPPM cannot be achieved only via skew-size for asymmetric workload. We use skew-size first for asymmetric workload and then evaluate the DPPM. If the target DPPM cannot be achieved, starting with the symmetric cell, we size up both pull down transistors with the minimum step to increase both read noise margins and then use skew-size until we achieve the target DPPM. The optimal sizes of transistors in SRAM cell for the nominal case We first identify the optimal sizes of transistors in SRAM cell for the nominal case where we ignore pro- cess variations. To ensure a very low aging quality loss, our goal is to design the SRAM cell with no stability degradation caused by aging throughout the lifetime. According to the previous analysis, differential aging typically observed in data caches [31, 42] causes RNMb degradation and sizing up single transistor PDL can improve RNMb without negative effect on write noise margins and access time. However, increasing the size of 53 (a) (b) (c) (d) (e) (f) Figure 3.9: Noise margins of the 6T planar SRAM cell versus the sizes of (a) AXL (b) PDL (c) PUL (d) AXR (e) PDR (f) PUR PDL will decrease RNMa, hence we need properly size PDL to achieve optimal RNM through lifetime. First, we need to ensure that the degraded RNMb after the desired lifetime should be no less than the minimum noise margin at fabrication, i.e., RNMa, to prevent RNM from reducing through lifetime. Then under this constraint, we need to maximize RNM at fabrication. We need a starting point to optimize the noise margins after aging. Thus we choose the original design (the SRAM cell optimized for yield at 54 (a) (b) Figure 3.10: Access time of the 6T planar SRAM cell versus the transistor sizes (a) for read-0 and (b) for read-1. Table 3.1: All the transistors whose sizing impact RNMb of a 6T planar SRAM cell Transistors having impact on RNMb How to increase RNMb Reason Side effect except area overhead AXL Decrease size (Fig. 3.9(a)) DecreaseV OL of the left inverter Decrease W0NM and increase TR0 dramatically (Fig. 3.9(a) and 3.10(a)) PDL Increase size (Fig. 3.9(b)) DecreaseV OL of the left inverter Decrease RNMa (Fig. 3.9(b)) PDR Decrease size (Fig. 3.9(e)) IncreaseVM of the right inverter Decrease RNMa and increase TR1 significantly (Fig. 3.9(e) and 3.10(b)) PUR Increase size (Fig. 3.9(f)) IncreaseVM of the right inverter Decrease W1NM dramatically (Fig. 3.9(f)) fabrication per area) as the base design. We start with the base design and properly size PDL to avoid stability degradation. Fig. 3.11 shows the relationship between the read noise margins and the size of PDL, for different lifetimes. The optimal sizes of PDL for different lifetimes are indicated by the dashed circles, where the optimal size of PDL is the size when RNMb after desired lifetime is equal to RNMa at fabrication as indicated in Fig. 3.11. 55 Figure 3.11: Optimal sizes of PDL Figure 3.12: Layout of 6T planar CMOS SRAM cell 6T planar SRAM cell layout Our design goal is to design an SRAM with optimal lifetime yield per area under the given DPPM target. To compare the lifetime yield per area of an SRAM array, the area of an SRAM cell must be estimated. Fig. 3.12 shows the layout of a 6T planar CMOS SRAM cell. Based on the layout and the design rules, we can derive a formula for area in terms of transistor sizes. We set the area of the default symmetric cell to 1 and then normalize the area of other cells to calculate the lifetime yield per area. 56 Figure 3.13: Read noise margin changes after 60 months usage for 1000 monte carlo 6T planar CMOS SRAM cell instances with process variations The impact of aging on RNM under process variations In Section 3.4.1.1, we have identified the optimal size of the transistors in SRAM cell for the nominal case. We need to study the impact of aging on noise margins for SRAM cells under process variations to check whether the optimal size for the nominal case can achieve the optimal lifetime yield. We generate 1000 Monte Carlo SRAM cell instances with process variations. The RNM of the 1000 instances at fabrication and after 60 months of usage are extracted. Fig. 3.13 shows the changes of RNMa and RNMb for all 1000 instances, indicated by ΔRNMa and ΔRNMb respectively. We can see that the noise margin changes caused by aging are almost identical for different instances and are almost equal to those in the nominal case. Thus the optimal size for the nominal case can be used as the optimal size for all instances with process variations. Aging failure rate evaluation Our objective is to design SRAMs with optimal life- time yield-per-area under a given target DPPM and constraints on power for the fixed workload. Here we assume all cells in an SRAM array have the sameP signal . Thus our target value ofP cell fail,aging of an SRAM cell can be calculated using 1− (1−DPPM× 10 −6 ) 1/N . We have identified the optimal values of sizing for all transistors to maxi- mize the minimum noise margin after aging. We need to estimate P cell fail,aging . If our 57 design achieves the target value ofP cell fail,aging , it should be the optimal design in terms of lifetime yield-per-area, since we achieve our target with the smallest area overhead. Otherwise, we need to further increase the size of appropriate transistors to increase read noise margin and repeat the previous analysis step until we achieve the target value of P cell fail,aging . To demonstrate the effectiveness of our design approach, we compare the lifetime yield-per-area and DPPM of SRAMs using our 6T asymmetric cells with SRAMs using the cells optimized for yield at fabrication per area. We use a 32nm high-k metal gate PTM library for our simulations. The desired life- time for SRAM is assumed to be 5 years, i.e., M = 60 months. The given target DPPM is 50. Thus the target value ofP cell fail,aging is 2.98E-12 for a 2MB SRAM array.P signal for all cells are 0.25. The probability collective method, a variant of the importance sampling method proposed in [47], is adopted to estimate the failure rate and aging failure rate. To model process variations, we assume that Vth of each transistor follows identical and independent Gaussian distributions with a standard deviation approximately equal to 10% of the nominal Vth value. In Fig. 3.14, we compare the lifetime yield-per-area and DPPM of a 2MB SRAM using our optimal 6T SRAM cells with the base design. The base design is optimized for yield at the time of fabrication per area. Our experiment results show that, if all cells have P signal = 0.25, the base design has an unacceptably high aging quality loss, i.e., DPPM > 2000. Our new design can achieve our goal with a higher lifetime yield-per- area and an extremely low aging quality loss, namely DPPM < 10. 3.4.1.2 6T FinFET SRAM cell In the previous section, we demonstrate the 6T SRAM cell design in planar CMOS. Same as planar CMOS, BTI aging is considered as the major aging mechanism for Fin- FET. FinFET SRAM cells are more vulnerable to BTI than planar CMOS cells [48]. We 58 (a) (b) Figure 3.14: (a) Lifetime yield-per-area and (b) DPPM comparison for 2MB SRAM arrays for all cells withP signal = 0.25. will extend our design approach to combat BTI aging in FinFET SRAM cells. Through studying the impact of BTI aging on FinFET SRAM stability in combination with pro- cess variations and exploring all the design parameters for each FinFET transistor in SRAM cells, we have successfully identified the optimal designs against BTI aging. We evaluate our new designs along with the base design in terms of lifetime per area and aging quality loss. We demonstrate that our design approach provides optimal FinFET SRAM cell design with a dramatically low aging quality loss and optimal lifetime yield per area. 59 (a) (b) Figure 3.15: Noise margins of a 6T FinFET SRAM cell versus gate length and Fin thickness of all transistors The relationship between noise margins and the design parameters of FinFET transistors in the nominal case The design approach is the same for FinFET cell. While compared to planar CMOS, FinFETs have much more design parameters, includ- ing gate length (Lg), Fin thickness (Tfin), Fin pitch (fptich), length of the source/drain (lrsd), and Fin number (Nfin). Thus the search space for design of FinFET cells is significantly larger. Fin number is discrete, which adds constraints to our design space. Furthermore, the area impact of increasing the strength of a FinFET device is also higher. To develop our general approach for designing FinFET SRAM cells resistant to BTI aging, we first explored various values of all the design parameters of FinFETs to find the impact of those parameters for each transistor on the noise margins. We find that increasing Lg makes FinFET weaker, but the butterfly curve of SRAM cells becomes steeper and SNM increases as shown in Fig. 3.15(a). Increasing fpitch, lrsd, or/and Nfin will increase the drive strength of a FinFET device. The effects of fpitch and lrsd are minor. Although increasing Tfin increases the effective width of FinFET (2*Hfin+Tfin), the butterfly curve becomes flatter, thus SNM decreases when the Fin thicknesses of all transistors increase as shown in Fig. 3.15(b). Same as in 6T planar CMOS SRAM cell, the amount of noise margin degradation depends on signal probability. For all the cells with the signal probability of 0.25, RNMb 60 Figure 3.16: 6T FinFET SRAM cell layout degrades more severely than RNMa. In this case, our design objective is to improve read noise margin, especially RNMb. In Table 3.2, we list all the transistors whose design parameters impact RNMb of 6T FinFET SRAM cell as well as how to increase RNMb. The reasons why RNMb can be improved for each approach are also presented in the table along with the side effects. Although increasing fpitch and lrsd of PDL have no side effects, the influence of these two parameters is small. Thus associated area overhead is not justified. We choose to tune the gate length of all 6 transistors and fin number of pull down transistors to achieve the optimal design. After considering the cell area, we identify the optimal sizes of transistors for BTI resistant FinFET cells under the nominal case. 6T FinFET SRAM cell layout To estimate the area of SRAM cell, we develop a for- mula for cell area on a parametrized layout of FinFET SRAM cell. The design parameter and layout design rule for FinFET cells are shown in Table 3.3 [49][50]. 14nm PTM-MG library is used for our design. Lg, Tfin, fpitch, lrad as well as nfin are design parameters, while Sc, Sm2m, and Sg2c are the fixed design rule. The layout of 6T FinFET cell is shown in Fig 3.16. 61 (a) (b) Figure 3.17: (a) Lifetime yield per area and (b) DPPM comparison for 2MB FinFET 6T SRAM arrays for all cells withP signal = 0.25. Experimental evaluation To demonstrate the effectiveness of our design approach, the lifetime yield per area and DPPM of a 2MB SRAM using our optimal FinFET SRAM cells are compared with the base design. The base design is optimized for yield at fabrication per area. The basic experimental setup is the same as 6T planar CMOS SRAM. 14nm PTM-MG library is used for our simulations. Our experiment results show that the base design has an unacceptably high aging quality loss (DPPM > 100,000). Our new design can achieve much higher lifetime yield-per-area with a DPPM < 1. 3.4.1.3 10T ST SRAM cell With the continued scaling in the feature sizes of devices, the increased density and leakage necessitate the ultra low power supply operation for SRAMs to achieve low power consumption. After carefully studying the conventional 6T cell, 8T cell, and 10T cell designs and comparing their various metrics, we found that the 10T Schmitt Trigger (ST) SRAM cells [3] have high SNMs and high tolerance to process variations. These cells can achieve the lowest failure probability for ultra-low power supply operation. Also, they do not require any changes over the conventional SRAM architecture used 62 Figure 3.18: 10T Schmitt Trigger SRAM cell for 6T cells. Thus we also choose 10T ST cell structure as one of our baseline designs and extend our design approach on 10T ST cell to achieve optimal lifetime yield per area SRAM design in the ultra low power domain. In a typical 10T ST SRAM cell as shown in Fig. 3.18, NL1, NL2, NR1 NR2, PL and PR, these six transistors of the cross-coupled inverters are periodically under stress depending on the value stored in the SRAM cell. Same as 6T cell, the access transis- tors experience negligible BTI degradation because of their short stress time through SRAM’s lifetime. Feedback transistors NFL and NFR are NMOS transistors and their drain terminals are always connected to VDD. Thus they do not suffer BTI degradation. For all the cells with the signal probability of 0.25, NL1, NL2, and PR suffer more sig- nificant aging degradation compared to its NR1, NR2, and PL. RNMb degrades more severely than RNMa. Thus we focus on maximizing RNMb to prevent the noise margin degradation caused by aging, we focus on maximizing RNMb. The optimal sizes of transistors in SRAM cell in the nominal case Compared to 6T cell, DC transmission curves of 10T ST cell is steeper due to the existence of feedback 63 transistor. We focus on the feedback transistor analysis. As 6T cell, in Table 3.4, we list all the transistors whose sizing impact RNMb of the SRAM cell, how to increase RNMb via sizing, the reasons as well as the side effect. As shown in Table 3.4, RNMb can be increased by increasing the size of NFR, NL1, NL2, PR and/or decreasing the size of XL, NR1, and NR2. However, each approach has some negative effects on stability or access time, except increasing the size of NFR. For example, write noise margins and access time are sensitive to the size of access transistor. Decreasing the size of XL will decrease W0NM and increase access time dramatically. Increasing the size of pull down transistor NL1 and NL2 will decrease RNMa, and thus reduce the yield at fabrication. Although decreasing the size of pull down transistors of the right inverter NR1 and NR2 will increase RNMb, RNMa will decrease significantly and the access time for read 1 will increase. Sizing up pull up transistor PR will decrease W1NM dramatically. In contrast, increasing the size of NFR has no negative impact on other noise margins based on Fig. 3.19(a) and access time based on Fig. 3.19(b). Thus we choose to increase the size of NFR. We could properly size NFR transistor to ensure that RNM does not reduce through lifetime. Fig. 3.20 shows the relationship between the read noise margins and the size of NFR, for different lifetimes. The optimal sizes of NFR for different lifetimes are indicated by the dashed circle. Although sizing up NFR has no negative effect on RNMa, in terms of area efficiency, NFR should not be oversized. The optimal size of NFR is the size when RNMb after the desired lifetime is equal to RNMa at fabrication as indicated in Fig. 3.20. 10T ST SRAM cell layout As in the previous section, we also derive a formula for the cell area based on the layout and the design rule to compare the lifetime yield per area of SRAM array. Fig 3.21. shows the layout of 10T ST SRAM cell [3]. 64 (a) (b) Figure 3.19: (a) Noise margins of the 10T ST SRAM cell versus the sizes of NFR. (b) Access time of the 10T ST SRAM cell versus the transistor sizes Figure 3.20: Optimal sizes of NFR in 10T ST cell Experimental evaluation The lifetime yield per area of the SRAM using our opti- mized 10T ST asymmetric cell and the base design are plotted in Fig. 3.22. The base design is optimized for yield at fabrication per area. We can see that the lifetime yield of the base design decreases significantly due to the cell failures caused by aging. As shown in Fig. 3.22, the default design has a very large DPPM value, larger than 10000 which will cause significant customer dissatisfaction. Our proposed asymmetric design has a very low aging quality loss with a small area overhead. In particular, our design’s DPPM is less than 10 for a 2MB SRAM array of 0.25 signal probability for all the cells. 65 Figure 3.21: Layout of 10T ST SRAM cell (a) (b) Figure 3.22: (a) Lifetime yield per area and (b) DPPM comparison for 2MB SRAM arrays using 10T ST cells for all cells withP signal = 0.25. Also, the lifetime yield per area of our design is significantly higher than that of the base design. One advantage in 10T ST cell is sizing up NFR has no negative effect on stability and access time except with a small area overhead. Thus even without the fixed work- load assumption, the SRAM array using asymmetric cells will always lead to better DPPM since the dominant logic value in cache is “0”. With accurate information about workload, we can find the optimal size for various SRAM cells against aging using our design approach to achieve target DPPM and optimal lifetime yield per area. We will present the design approach for any given workload in the next section. 66 3.4.2 Design flow for any given workload In Section 3.4.1, we showed one sizing approach for the fixed workload. However, the workload may be different for different SRAM arrays and different cells in the same SRAM array. In Section 3.3, we know the workload depends on the applications. We show some typical workloads and our design objective for SRAMs. In this section, we present our design flow for SRAMs to optimize the lifetime yield- per-area under tight constraints on DPPM and power for any given workload. We study planar CMOS in this section. 3.4.2.1 Key ideas A given workload can be described using a probability density function (PDF), where the y-axis is the percentage of the cells that are likely to have the identical sig- nal probability and the x-axis is the corresponding signal probability. Hence, for each point in the PDF, we can estimate the aging degradation in the values of various noise margins that the corresponding set of cells is likely to experience from the aging model. From the noise margin analysis and the experimental evaluation for aging failure rate for SRAM cell, we are able to estimate the amount of the respective noise margin that will be "used" to tackle the effect of the estimated level of aging for this set (percentage) of cells. We can also estimate the "unused" noise margin as 1 - "used" for each such set. Thus for a given DPPM, we can select a sufficient number of above sets such that the probabilities of the sets add up to be greater-than-or-equal-to (1-DPPM). Define U as the minimum of the "unused" noise margins for all the sets selected above. Our sizing constraint is to ensure U > 0. We can also define our problem as maximize U, subject to delay, area, and VDD constraints. 67 3.4.2.2 SRAM cell sizing approach for any given workload From Section 3.2, we know that the magnitude of noise margin degradation depends onP signal . Thus the aging failure rate of an SRAM cell depends onP signal . Different from the fixed workload, the target value of P cell fail,aging cannot be calculated directly. Because it is impossible to derive the function of P cell fail,aging in terms of P signal with- out an existing design, we only know the workload distribution and the expectation of P cell fail,aging from the target DPPM. Thus we develop a new optimization method and pro- pose a design flow for SRAMs for a given workload and target DPPM (DPPM target ) as follows: Step 1: a) Compute averageP signal for the given workload, denoted asP signal−ave , b) calculate the target value of aging failure rate using P cell fail,aging−target = 1− (1−DPPM target × 10 −6 ) 1/N Step 2: Use the above design approach for fixed workload to design an SRAM cell, called initial cell, using P signal−ave and P cell fail,aging−target . The base design is the cell optimized for the yield-per-area at fabrication. Step 3: Evaluate the initial cell using the given workload. Specifically, a) calculate Vth degradation caused by aging for different P signal using aging model, b) estimate aging failure rate for the initial cell with the given workload, c) calculate DPPM using DPPM = 1− ( N Y i=1 (1−P cell,i fail,aging ))× 10 6 d) ifDPPM≤DPPM target , the initial cell is the optimal design, exit the process. Step 4: Adjust the target value of aging failure rate. Aging failure rate forP signal−ave is estimated in step 1, denoted asP cell fail,aging−obtain . The average aging failure rate for the given workload can be computed as P cell−ave fail,aging = P cell fail,aging (P signal )f signal (P signal )dP signal , 68 wheref signal is the probability density function ofP signal . The new target value of aging failure rate is calculated by P cell−new fail,aging−target =P cell fail,aging−obtain ×P cell fail,aging−target /P cell−ave fail,aging P cell fail,aging−target =P cell−new fail,aging−target Step 5: Using the cell designed above as the base design, design a new SRAM cell using the design approach for fixed workload withP signal−ave and the new target value of aging failure rate from step 4. Step 6: Evaluate the cell from step 5 with the given workload. If DPPM ≤ DPPM target , the cell is the optimal design, exit the process. Otherwise, repeat step 4. The average signal probability of the given workload determines the symmetry of the optimal cell. For example, ifP signal−ave = 0.5, the optimal cell is symmetric and both RNMa and RNMb are optimized after aging. If P signal−ave = 0.1, we tend to increase RNMb and the optimal cell is sized asymmetrically. Through a large number of simulations, we find that the ratio of aging failure rate with different signal probabilities does not vary much as long as the cell symmetry is determined. Thus our design process converges rapidly, by using our method (step 4) for target aging failure rate adjustment. If the workload is unknown, we must consider the extreme workload distribution for aging in worst-case ways to guarantee the target DPPM. The worst-case is whereP signal for half of the cells is 0 and the other half of the cells is 1, and the cell locations for such P signal value are arbitrary. RNMb degrades most whenP signal = 0 and RNMa degrades most when P signal = 1. Due to the arbitrary cell locations, we need to consider both cases (P signal = 0, P signal = 1) simultaneously, thus skew-size is not effective, and the more expensive up-size is needed. This is the worst-case from the perspective of sizing. For this extreme workload, we simplify our above design flow into the design approach 69 for fixed workload withP signal = 0, with the added constraint that the cell must be sized symmetrically, and both RNMa and RNMb require to be optimized after M months of usage. 3.5 Experimental results To demonstrate the effectiveness of our new design flow, the optimal designs for four different workloads are evaluated. They are Gaussian 1, Gaussian 2, skew distribution and unknown described in Section 3.3.1. As stated in Section 3.4.2, we consider the extreme case for unknown workload. 6T cell is the most commonly used SRAM design in practice. To serve designers targeting ultra low power consumption, we also apply our design flow to 10T Schmitt Trigger (10T ST) SRAM cells, as shown in Fig. 3.18, which provide low failure probability for ultra-low power supply operation without requiring any changes to the conventional SRAM architecture for 6T cells. The stress and recov- ery status when value “0” is stored in the cell is also shown in Fig. 3.18. The evaluation is conducted for both 6T SRAM cell and 10T ST SRAM cell for ultra low power supply voltage operation. 3.5.1 Comparison of DPPM and lifetime yield-per-area for 6T SRAM cell The evaluation results for all the above four different workloads for 6T SRAM cells are shown in Fig. 3.23 and Fig. 3.24. The design parameters and the area overheads for different designs are listed in Table 3.5. D0 is the design optimized for yield-per-area at fabrication. However, as shown in Fig. 3.24, D0 cannot achieve the DPPM target for any of the four workloads. This leads to unacceptably high customer dissatisfaction. (In Fig. 3.24, the first two bars are for one workload, and the next two are for the 70 second workload, etc.) For different workloads, we generate the corresponding designs to achieve the optimal lifetime per area with the given constraint on the target DPPM using our design flow. For the Gaussian distribution workload, the optimal cells (D1 and D2) are symmetric. Compared to D0, the optimal cell D1 has a slightly lower lifetime yield-per-area for Gaussian 1 due to area overhead, it has an extremely low DPPM (< 1), while D0 cannot achieve the DPPM target. For the skew distribution in data caches obtained from [1], an asymmetric cell design (D3) achieves the design goal. For unknown workload, the cell (D4) is designed and evaluated for the extreme case (i.e., the worst-case). Thus the cell may be over-optimized for aging but guarantees customer satisfaction for any workload. Our new design flow generates optimal cells that dramatically decrease aging qual- ity loss as shown in Fig. 3.24, while simultaneously increase (in most cases) the life- time yield-per-area for various workloads as shown in Fig. 3.23. Remarkably, even for the unknown workload, under worst-case assumptions, our new design (D4) not only achieves DPPM of 31.52 but also achieves lifetime yield per area of 89.99%, which is much higher than that of the original SRAM cell design (D0). 3.5.2 Comparison of DPPM and lifetime yield-per-area for 10T ST SRAM cell The 10T ST cell netlist is shown in Fig. 3.18. 10T ST cell adopts the same memory- array architecture as the conventional 6T SRAM cell and can achieve ultra low voltage operation. The evaluation results of our designed cells for all four different workloads are listed in Table 3.7. Here S1, S2, S3, and S4 are respective cells designed for Gaus- sian 1, Gaussian 2, skew distribution, and unknown workloads. The design parameters for each is shown in Table 3.6. We can find similar experiment results in 10T ST cell as 71 Figure 3.23: Lifetime yield-per-area comparison for different 6T SRAM cells under four different workloads. in 6T SRAM cell. The design optimized for yield at fabrication per area (S0) has unac- ceptably large aging quality loss for all four different workloads. The effectiveness of our design flow is demonstrated by the fact that it provides dramatically better designs, i.e., significantly lower DPPM accompanied by nearly identical or larger lifetime yield- per-area for various workloads as shown in Table 3.7. 3.5.3 Power overhead analysis of the classical approach As shown in Section 3.5.1 and 3.5.2, the stability degradation causes unacceptably high aging quality loss for the design optimized for yield-per-area at fabrication. To reduce the aging failure rate, a classical approach is to increase power supply voltage. We can either use high VDD at the beginning of the operation or increase VDD grad- ually in the field to ensure that the aging quality loss will not exceed the target DPPM. However, this approach leads to unacceptably high power overheads, especially for IoT components and other embedded systems that have tight constraints on power. 72 Figure 3.24: DPPM comparison for different 6T SRAM cells under four different work- loads. Fig. 3.25(a) shows that VDD of the 6T SRAM design optimized for yield-per-area at fabrication is raised gradually in the field based on the aging degradation under dif- ferent workloads to ensure target DPPM <= 50. The original VDD is 0.7V . The aging degradation happens after a short period of usage, thus VDD needs to be raised after this short time. As shown in the curve with purple triangles, in Fig. 3.25(a), VDD for design optimized for yield-per-area at fabrication (D0) under Gaussian 1 workload is raised to 0.72 after 2 months of usage to ensure the low aging quality loss. For the other three workloads, the power supply needs to be raised even higher. As shown in the curve with green crosses in Fig. 3.25(a), VDD for D0 under Gaussian 2 is raised gradually until to 0.83. Similarly, for skew distribution, VDD for D0 is raised to 0.82 to ensure the target DPPM as shown in the curve with red circles in Fig. 3.25(a). The unknown workload is the worst and VDD for D0 is raised to 0.85 after 33 months of usage. In contrast, as shown in the orange line in Fig. 3.25(a), we do not need to raise VDD to achieve the target DPPM for the designs optimized for lifetime yield-per-area since our cells 73 are originally designed under DPPM constraint to operate at the original VDD (0.7V) throughout the lifetime. Fig. 3.25(b) shows that the corresponding power overhead in percentage of D0 under different workloads to ensure target DPPM. The power overhead in percentage for D0 under Gaussian 1 is shown in the curve with purple triangles in Fig. 3.25(b). The aver- age power overhead for the 60 months lifetime is 5.75% for Gaussian 1. The curves with green crosses, the red circles and the blue pluses in Fig. 3.25(b) denote the power overhead in percentage through the lifetime for the original cell design (D0) under Gaus- sian 2, skew distribution, and unknown workload respectively when the power supply is raised gradually in the field to combat aging. We can see that for these three workloads, the power overhead is much higher, i.e., for Gaussian 2, the average power overhead is 34.31%, for skew distribution, the average power overhead is 31.18%, and for unknown distribution, the average power overhead is 44.42%. In contrast, our design approach does not cause power overhead as shown in the orange line in Fig. 3.25(b). Power overhead control is important for 10T ST SRAM since such cell is usually used in designs that require ultra low power consumption. Fig. 3.26(a) shows that VDD of the 10T ST SRAM design optimized for yield-per-area at fabrication (S0) is increased gradually in the field based on the aging degradation under different workloads to ensure target DPPM≤ 50. The corresponding power overhead in percentage is shown in Fig. 3.26(b). The original VDD is 0.5V . In Fig. 3.26(a), we can see that VDD for S0 is increased to 0.54, 0.73, 0.71, 0.78 to ensure the target DPPM for Gaussian 1 denoted in the curve with purple triangles, Gaussian 2 denoted in the curve with green crosses, skew distribution denoted in the curve with red circles and unknown workload shown in the curve with blue pluses respectively. The average power overhead during the 60 months lifetime for Gaussian 1, Gaussian 2, skew distribution and unknown workload are 14.64%, 105.75%, 95.93%, 139.44%, respectively. 74 (a) (b) Figure 3.25: (a) VDD and (b) power overhead in percentage for the design optimized for yield-per-area (D0) to ensure DPPM <= 50 for 6T SRAMs under four workloads If we use high VDD at the beginning of the operation, the power overhead is even higher. We can see that our transistor sizing approach is surprisingly effective at com- bating aging for various workloads under tight constraints on DPPM and power. 75 (a) (b) Figure 3.26: (a) VDD and (b) power overhead in percentage for the design optimized for yield-per-area (D0) to ensure DPPM <= 50 for 10T ST SRAMs under four workloads 3.6 Conclusion In this chapter, we develop a method for sizing to increase the aging resilience of SRAMs. We study the impact of aging on the stability of SRAM cells under different workloads. We identify which noise margin to increase and how to size the transistors. We start by exploring how the cell sizing affects resilience to BTI aging and motivate our approach using the asymmetric fix workload. Then we develop one method for all 76 workloads (including the most general one, namely unknown workload) and show that transistor sizing is surprisingly effective at combating aging. Specifically, we propose a systematic design flow for SRAMs for any given work- load to combat BTI aging. Our SRAM design is optimized for lifetime yield-per-area under a given aging quality loss target and tight constraints on power. We demonstrate the effectiveness of our new design via extensive simulations for both 6T SRAM cells and 10T ST SRAM cells. We show that the design optimized for yield-per-area at fab- rication has unacceptably high aging quality loss for all studied workloads. Our new design flow generates optimal cells that dramatically decrease aging quality loss while simultaneously improves the lifetime yield-per-area for various workloads without extra power overhead. 77 Table 3.2: All the transistors whose design parameters impact RNMb of a 6T FinFET SRAM cell Transistors having impact on RNMb How to increase RNMb Reason Side effect except area overhead AXL Increase gate length Increases VM of the right inverter and VTC becomes steeper Decreases W0NM dramatically and increases access time Decrease fin number Decreases VOL of the left inverter Decreases both WNM dramatically and increases access time significantly PDL Decrease gate length Decreases VOL of the left inverter Decreases RNMa Increase fin number Decreases VOL of the left inverter Decreases W1NM Increase fin thickness Decreases VOL of the left inverter Decreases RNMa dramatically Increase fpitch Decreases VOL of the left inverter Increase lrsd Decreases VOL of the left inverter PDR Increase gate length Increases VM of the right inverter and VTC becomes steeper Decreases RNMa and increases access time Decrease fin number Increases VM of the right inverter Decreases RNMa dramatically and increases access time significantly PUR Decrease gate length Increases VM of the right inverter Decreases W1NM Increase fin number Increases VM of the right inverter Decreases W1NM dramatically 78 Table 3.3: Layout design rule for FinFET Parameter 14nm PTM FinFET (nm) Comment Lg 18 = 2l Gate length Tfin 10 Body (Fin) thickness fpitch 32 Fin pitch lrsd 30 Length of the source/drain Sc 36 = 4l Minimum contact size Sm2m 27 = 3l Minimum space between metal wires Sg2c 18 = 2l Minimum space between gate to contact Table 3.4: All the transistors whose sizing impact RNMb of a 10T ST SRAM cell Transistors having impact on RNMb How to increase RNMb Reason Side effect except area overhead XL Decrease size DecreaseV OL of the left inverter Decrease W0NM and increase TR0 dramatically NFR Increase size IncreaseVM of the right inverter None NL1 Increase size DecreaseV OL of the left inverter Decrease RSNM1 NL2 Increase size DecreaseV OL of the left inverter Decrease RSNM1 NR1 Decrease size IncreaseVM of the right inverter Decrease RSNM1 and increase TR1 significantly NR2 Decrease size IncreaseVM of the right inverter Decrease RSNM1 and increase TR1 significantly PR Increase size IncreaseVM of the right inverter Decrease W1NM dramatically 79 Table 3.5: Optimal 6T SRAM cell designs for various workloads Size for each transistor D0 D1 D2 D3 D4 AXL 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ PDL 5λ/2λ 6λ/2λ 7λ/2λ 7λ/2λ 8λ/2λ PUL 2λ/3λ 2λ/3λ 2λ/3λ 2λ/3λ 2λ/3λ AXR 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ PDR 5λ/2λ 6λ/2λ 7λ/2λ 6λ/2λ 8λ/2λ PUR 2λ/3λ 2λ/3λ 2λ/3λ 2λ/3λ 2λ/3λ Area overhead 1 1.037 1.074 1.055 1.111 Table 3.6: Design parameters for optimal 10T ST SRAM cells for various workloads Size for each transistor S0 S1 S2 S3 S4 AXL 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ NFL 2λ/2λ 2λ/2λ 3λ/2λ 2λ/2λ 4λ/2λ NL1 4λ/2λ 5λ/2λ 5λ/2λ 5λ/2λ 6λ/2λ NL2 4λ/2λ 5λ/2λ 5λ/2λ 5λ/2λ 6λ/2λ PUL 2λ/4λ 2λ/4λ 2λ/4λ 2λ/4λ 2λ/4λ AXR 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ 2λ/2λ NFR 2λ/2λ 2λ/2λ 3λ/2λ 3λ/2λ 4λ/2λ NR1 4λ/2λ 5λ/2λ 5λ/2λ 4λ/2λ 6λ/2λ NR2 4λ/2λ 5λ/2λ 5λ/2λ 4λ/2λ 6λ/2λ PUR 2λ/4λ 2λ/4λ 2λ/4λ 2λ/4λ 2λ/4λ Area overhead 1 1.027 1.055 1.027 1.111 Table 3.7: Lifetime yield-per-area and DPPM comparison for different 10T ST SRAM cells under four different workloads Workload Design Yield/area Lifetime yield/area DPPM Area overhead Gaussian 1 S0 0.98196 0.98059 1393.61 1 S1 0.96900 0.96900 7.94 1.027 Gaussian 2 S0 0.98196 0.92454 58468.65 1 S2 0.94699 0.94698 10.92 1.055 Skew distribution S0 0.98196 0.95961 22757.76 1 S3 0.95417 0.95414 26.91 1.027 Unknown S0 0.98196 0.69460 292636.03 1 S4 0.89997 0.89994 24.71 1.111 80 Chapter 4 Delay degradation caused by aging in SRAM peripheral circuitry 4.1 Background and Motivation In Chapter 3, we studied the aging effect on SRAM cell and know that aging causes stability degradation on SRAM cells. We then developed an SRAM cell sizing approach to combat aging. In this chapter, we study the effect of aging on SRAM peripheral circuitry. We develop methods to design peripheral circuitry which, in conjunction with SRAM cell design approaches in Chapter 3, maximize aging resilience of the entire SRAMs under power and performance constraints. In Chapter 2, we proposed a general test generation approach for VIDFs in arbitrary SRAM designs. In this chapter, we consider aging during test generation. Aging causes delay degradation in the address decoder. These degradations may affect the probabili- ties of different types of faults and the delays of the address decoder caused by various address transitions. Theoretically, the transistors in circuits may have different Vth values after fabrica- tion due to different levels of process variation. The transistors with low initial Vth age faster than those with high initial Vth as shown in the aging model [15]. After aging, the standard deviation of delay distributions decreases. Also, aging causes delay degra- dation and hence increases the mean of delay distributions. The delay value of each 81 multi-input gate for different input patterns depends on the value of charge in its inter- nal node capacitances. Aging does not affect the value of internal capacitance. Thus, aging may not affect the test generation for VIDFs. In this chapter, we study delay degradation caused by aging on peripheral circuitry to determine the level of delay degradation caused in each part of the peripheral circuitry and identify the key characteristics of these degradations. We develop methods to test entire SRAM, including peripheral circuitry, for the combination of process variations and delay degradation caused due to aging. We use extensive simulations to verify that our test generation approach is effective for arbitrary SRAM designs before and after aging. The rest of this chapter is organized as follows. In Section 4.2, we analyze the effect of BTI aging on SRAM peripheral circuitry, including address decoder, precharge circuit, write circuit and sense amplifiers. In Section 4.3, we quantify the delay along the critical paths of SRAMs and estimate the amount of delay degradation caused by aging for each component. In Section 4.4, we demonstrate that our test generation approach for VIDFs is effective for arbitrary SRAM designs. The upper bound for the test length is 9.5n for arbitrary designs. In Section 4.5, we analyze the power overhead for the classical approach to combat delay aging degradation in peripheral circuitry. We explore the sizing approach of address decoder and sense amplifier to ensure that the worst-case delays after aging are not larger than those before aging for the original design to meet the clock constraint. Finally, we present our conclusions in Section 4.6. 82 4.2 Aging analysis for SRAM peripheral circuit This section presents a qualitative analysis of aging-induced degradation in periph- eral circuitry, especially identifies the effects of aging on delay and stability and catego- rizes input values in terms of their aging and healing effects. The next section quantifies the levels of delay degradations to identify its dominant components and their charac- teristics that can be harnessed to develop efficient test and design approaches for aging. 4.2.1 Address decoder As mentioned in Chapter 2, large address decoders are realized using primitives, such as 3-input NAND gates, 4-input NAND gates, and so on (even in large decoders, the number of inputs in NAND is limited due to delay constraints). Combinations of NAND gates and inverters are usually used to achieve minimum delays as these are delay-efficient compared to other primitive gates. The threshold voltage of transistors in gates degrades under stress condition. The overall aging degradation of gates is depen- dent on the input patterns. For an inverter, when input equals “0”, PMOS is stressed and suffers aging degradation, while NMOS is in recovery phase. When input equals “1”, NMOS is stressed, while PMOS recovers. Similarly, different input patterns cause aging degradation in different transistors in NAND gates as shown in Table 4.1. The corresponding 2-input, 3-input, 4-input gate structures are shown in Fig. 4.1. As shown in Table 4.1, in NAND gates, PMOS transistors suffer more aging degra- dation than NMOS transistors when each input pattern has an equal probability of occur- rence. Each PMOS transistor has the same probability of being under stress. For NMOS, the transistor close to GND suffers more aging degradation. E.g., in 3-input NAND gates, P1, P2, P3, and N3 have a 50% probability of being under stress, N1 has a 12.5% probability of being under stress and N2 has a 25% probability of being under stress; 83 (a) (b) (c) Figure 4.1: (a) 2-input, (b) 3-input and (c) 4-input NAND gates Table 4.1: Aging degradation under different input patterns for 2-input, 3-input NAND gates NAND2 NAND3 X1X2=00 P1, P2 aging X1X2X3=000 P1, P2, P3 aging X1X2X3=100 P2, P3 aging X1X2=01 P1, N2 aging X1X2X3=001 P1, P2, N3 aging X1X2X3=101 P2, N3 aging X1X2=10 P2 aging X1X2X3=010 P1, P3 aging X1X2X3=110 P3 aging X1X2=11 N1, N2 aging X1X2X3=011 P1, N2, N3 aging X1X2X3=111 N1, N2, N3 aging assuming all 8 input patterns occur with equal probability. Aging degradation in prim- itive gates causes delay increase of address decoder. This delay degradation is pattern dependent, hence we can estimate the delay degradation caused by aging along with the address information. 4.2.2 Precharge and write circuit Fig. 4.2(a) shows a commonly used precharge circuit. When precharge signal equals “0”, transistor M1, M2 and M3 are under stress and suffer aging degradation. The aging degradations of M1, M2, and M3 depend on the ratio of precharge time and read cycle. 84 Table 4.2: Aging degradation under different input patterns for 4-input NAND gates X1X2X3X4=0000 P1, P2, P3, P4 aging X1X2X3X4=1000 P2, P3, P4 aging X1X2X3X4=0001 P1, P2, P3, N4 aging X1X2X3X4=1001 P2, P3, N4 aging X1X2X3X4=0010 P1, P2, P4 aging X1X2X3X4=1010 P2, P4 aging X1X2X3X4=0011 P1, P2, N3, N4 aging X1X2X3X4=1011 P2, N3, N4 aging X1X2X3X4=0100 P1, P3, P4 aging X1X2X3X4=1100 P3, P4 aging X1X2X3X4=0101 P1, P3, N4 aging X1X2X3X4=1101 P3, N4 aging X1X2X3X4=0110 P1, P4 aging X1X2X3X4=1110 P4 aging X1X2X3X4=0111 P1, N2, N3, N4 aging X1X2X3X4=1111 N1, N2, N3, N4 aging (a) (b) Figure 4.2: (a) Precharge circuit and (b) write circuit Aging degradation increases the delay of the precharge circuit. We can estimate this degradation using information regarding SRAM timing control signals. A simple write circuit is shown in Fig. 4.2(b). Different transistors suffer different aging degradations and depend on the timing control signals. When Write signal equals “1” and Data signal equals “1”, N1, N2, and N4 are stressed. When Write equals “1” and Data equals “0”, N1, N2, and N3 are stressed. When Write equals “0” and Data equal “1”, N4 are stressed. When Write equals “0” and Data equals “0”, N3 are stressed. However, the asymmetric aging of N3 and N4 is not a concern since Data signal arrives earlier than Write signal during write operations and hence is not in a critical path. Aging causes delay degradation of write circuit. But this delay degradation is not important since write circuit delay is usually not in the critical path. 85 (a) Current controlled latch sense amplifier (b) Latch Based Amplifier with Pass Transistors Figure 4.3: Two commonly used Sense Amplifiers Table 4.3: Aging degradation for SA1 under various control signals Read = 0 Data_out = 1 P3, P4 aging Data_out_bar = 1 Read = 1 BL = 0 Data_out = 0 N5 aging, BL_bar = 1 Data_out_bar = 1 N4, N2, P1 aging BL = 1 Data_out = 1 N5 aging, BL_bar = 0 Data_out_bar = 0 N3, N1, P2 aging 4.2.3 Sense amplifier As stated in Chapter 2, we study the two commonly used sense amplifiers, i.e., a current controlled latch sense amplifier [35] (SA1) as shown in Fig. 4.3(a) and a latch based amplifier with pass transistors [33] (SA2) as shown in Fig. 4.3(b). Different transistors in SA suffer different aging degradation and depend on the input signals. The aging degradation conditions for SA1 are shown in Table 4.3. Table 4.4 shows the aging degradation of SA2 under various control signals. From Table 4.4, we can see that if SA2 outputs “0” with higher probability, P4, P2, and N1 suffer more aging degradation than P3, P1, and N2. However, compared to SRAM cell, the differential aging is not significant even under the extreme probability. This is because data is not stored in the output node of SA2 for long periods of time and 86 Table 4.4: Aging degradation for SA2 under various control signals Precharge and sense phase Precharge phase Read = 0 BL = 1 Data_out = 1 P3, P4 aging BL_bar = 1 Data_out_bar = 1 Sense phase BL = 1 Data_out = 1 P3, P1, N2 aging BL_bar = 0 Data_out_bar = 0 Sense phase BL = 0 Data_out = 0 P4, P2, N1 aging BL_bar = 1 Data_out_bar = 1 Read phase Data_out = 1 N3, P1, N2 aging Read = 1 Data_out_bar = 0 Data_out = 0 N3, P2, N1 aging Data_out_bar = 1 transistors have longer recovery phases. In SA2, aging causes both stability and delay degradations. For SA1, as shown in Table 4.3, if the output of SA1 equals “0” with a higher probability, N4, N2, P1 suffer more aging degradation than N3, N1, P2. However, due to the current control property, aging only causes an increase in sense delay and there is no stability issue of SA1. Compared to SA2, SA1 suffers less aging degradation because Read signal equals “0” for most of the time. Only reset transistors P3 and P4 are under stress when Read equals “0”. The aging in reset transistors does not affect the stability and performance of SA1. Due to the short stress periods of other transistors, SA1 suffers lower aging degradation compared to SA2. We can estimate the SA delay degradation caused by aging given timing information. Aging duty cycle depends on the signal probability and the ratio of read time to read cycle. From the previous analysis, we can see that aging causes delay degradation in AD, write circuit, precharge circuit and SA1. Aging causes both delay degradation and sta- bility degradation in SA2. The delay degradation in write circuit is not a major concern. 87 A similar sizing approach as in SRAM cells can be used to mitigate the stability degrada- tion in SA2. However, we can simply choose SA1 or other stability degradation immune sense amplifiers to avoid stability degradation caused by aging. 4.3 Quantify delay component Aging in peripheral circuit causes delay degradation. We can quantify the delay of SRAM and estimate the delay degradation for each component. To avoid failure caused by delay degradation, we can leave a significant margin in the timing control or increase VDD. We can also redesign the peripheral circuit if the timing constraint is tight. In SRAMs, the critical path is read delay, which is divided into AD delay, word line delay, cell delay for discharging bit line, and SA delay. Fig. 4.4 shows a timing diagram of SRAM read operation and all timing events are listed in the figure. Word line delay is the RC delay and is not affected by BTI aging. In Table 4.5, all delay components increased by aging for an SRAM with 3-bit address decoder are listed. The experiment is conducted using PTM 32nm library and supply voltage VDD is 0.7V for this chapter. We can estimate the delay degradation for each component given the information of control signals since the stress duty cycle depends on the duration of control signals. Without any loss of generality, in our experi- ments, we assume that different addresses have equal probability. The ratio of precharge time to read cycle is 25% and the ratio of read time to read cycle is 50% in simulation. The signal probability for SRAM cell is 0.5. The precharge delay is small and over- lapped with AD delay. The overall read delay degradation is the sum of AD delay degradation, cell discharge delay degradation, and SA delay degradation. The cell delay degradation caused by aging is small compared to that of peripheral circuit. The cell 88 Table 4.5: Delay components of 3-bit SRAM read operation before and after 60 months aging 32nm PTM (0.7V) Before aging (ns) After aging (ns) Increase Precharge delay (t1-t0) 0.0045 0.0051 13.33% 3-bit AD delay (t2-t0) 0.0522 0.0615 17.82% Cell delay for discharging the bit line (t4-t3) 0.0188 0.0190 1.05% SA delay (t7-t6) 0.0151 0.0165 9.27% Overall delay affected by aging in critical path (AD, discharge, SA delay) 0.0861 0.0970 12.66% delay is highly dependent on the strength of access transistors in SRAM cell, which suffer negligible aging degradation. Hence the cell delay degradation is small. Among all the delay components in peripheral circuits, the address decoder delay dominants, even for a small decoder, i.e., 3-bit decoder. The address decoder suffers the most delay degradation. Hence the decoder delay degradation is the major component in peripheral circuitry delay degradation. The address decoder delay, as well as the degradation, is larger than the cell delay and the corresponding degradation in SRAMs with a 3-bit address decoder or an 8-bit address decoder. For an SRAM with a large 12- bit decoder, the cell delay is larger than the decoder delay. However, the absolute value of cell delay degradation is still smaller than the address decoder delay degradation, since the cell delay degradation is small. Hence the decoder delay degradation is the major component in the overall read delay degradation in SRAMs of typical sizes. Fig. 4.5 shows the read delay degradation caused by aging of an SRAM with 8-bit address decoder through its lifetime. If the timing constraint is tight, we can redesign the address decoder to reduce the delay. We can also either use higher VDD at the beginning of the operation or increase VDD gradually over time in the field to avoid failures caused by delay degradation. 89 Figure 4.4: Timing diagram of SRAM read operation 4.4 Process variation-induced delay test of SRAMs with aging In this section, we consider aging during test generation. Through the above aging analysis, we know aging causes delay degradations and may cause failures in SRAMs. It is necessary to develop tests throughout the lifetime of an SRAM to decide whether 90 Figure 4.5: Read delay degradation of an SRAM with 8-bit address decoder caused by aging through lifetime aging-induced delay degradation is causing erroneous operation so as to either take action to remedy (e.g., increase VDD) or to stop using the chip. Delay degradation in SRAM critical path is dominated by the degradation of address decoder delay in SRAMs of typical sizes. The aging-induced delay degradation in decoder is pattern dependent while aging-induced delay degradation in other periph- eral circuits is not pattern-dependent. Non-pattern dependent delay degradation will be detected by existing tests; hence we focus on the degradation in decoders. Delay degradation in decoder may affect the probabilities of different types of faults and the delays of the address decoder caused by various address transitions. However, the differences between the delays of different paths without aging are highly relative to the value of charge in its internal node capacitances. Aging does not affect the value of internal capacitance. The differences between the delays of different paths after aging may close to the case without aging. Thus, aging may not affect the test generation for VIDFs. In this section, we analyze the decoder delay for different paths after aging to verify our speculation. 91 In this section, we observe that the aging-induced delay degradation in decoder does not change the path selection. Our method for generating O(n) delay tests for SRAMs in Chapter 2 can be directly used to test SRAMs with aging. In Chapter 2, all the experiments were conducted on an industrial 65nm process. Our test generation approach is general and can be applied to various technologies. In this section, we will develop linear tests targeting VIDFs for 32nm PTM library. Since the delay of the circuit and the level of process variation depends on the technology and then results in different new tests. Different input patterns trigger different delays of address decoder. The worst-case delay is invoked by specific two-pattern address sequences and can be characterized based on the decoder structure. The key step in test generation presented in Chapter 2 is the characterization of the relationship between the two-pattern address sequences and delays, and identification of the address transitions that trigger the worst-case deactivation and activation delays at decoder outputs. Then we can generate the shortest test that covers all these address transitions to cover all tar- get VIDFs. The address sequences that trigger the worst-case delay may be different for different technologies. We speculate that aging degradation does not affect the address sequences that trigger the worst-case delay. 4.4.1 Address decoder delay analysis To verify aging delay degradation in address decoder does not change the path selec- tion in test generation, we first analyze the address decoder delay for 32nm process before and after aging. As mentioned in Section 2.5.2, large address decoders almost always use a pre- decoding structure to reduce the transistor count and critical path delay [33]. Thus in this dissertation, we focus on large address decoders that are pre-decoded. For delay efficiency, we use inverters, 2-input NAND gates, 3-input NAND gates, and 4-input 92 NAND gates to build large decoders. For example, a 12-bit address decoder is typically designed using 3-input, and 4-input NAND gates and inverters. The 12-bit address is divided into three groups. Each set of four primary inputs is pre-decoded and then combined through 3-input NAND gates. For generalization, for n-bit decoders (n≤ 16 ), we use two-stage pre-decoding structures, i.e., each set of k (k≤ 4 ) primary inputs are pre-decoded and then combined through m-input NAND gates (m=n/k, m≤ 4 ). For decoders with more than 16 inputs, more than two-stage pre-decoding structure should be used for delay efficiency. However, decoders with more than 16 inputs are not commonly used [51]. Thus we focus on the analysis of decoders with no more than two stages. Address transitions triggering the maximum gate delay on each gate on the critical path invokes the maximum path delay since the delay is additive along the path. Along the path, the worst-case gate delay must be considered only for gates that have two or more inputs, i.e., NAND gates in decoders. The delays of NAND gates highly depend on the input transitions. To identify the address transitions that invoke the worst-case path delay, we need to examine the worst-case propagation delay of NAND gates. 4.4.1.1 NAND gate delay analysis In Section 2.5.2.1, we present the delay analysis for 2-input, 3-input, and 4-input NAND gate for 65nm process without aging. In the section, we revisit this analysis for NAND gates for 32nm process before and after aging. The NAND gates structures are the same as in Section 2.5.2.1 and are shown in Fig. 4.1. Fig. 4.6 show theT PHL andT PLH probability density functions (PDFs) of balanced 2-input, 3-input, and 4-input NAND gates for 32nm process without aging. All PDFs in this chapter are obtained from the simulations of 1000 Monte Carlo instances. Fig. 4.7 show theT PHL andT PLH PDFs of balanced 2-input, 3-input, and 4-input NAND gates 93 for 32nm process with aging. Via comparison of Fig. 4.6 and Fig. 4.7, we see that aging degradation does not affect the path selection in NAND gates. For the NAND gates with aging degradations, the worst-caseT PHL andT PLH are triggered by the same input transitions as in the case where we do not consider aging. The delay distribution for the case with aging is very similar to that without aging, with a right shift, i.e., increases in delay value. The input transitions triggering the maximumT PHL andT PLH under various tran- sistor sizes are listed in Table 4.6. The worst-case T PLH is always invoked by single input flipping which drives the NMOS transistor closest to GND, where only one PMOS transistor turns on and charges the load capacitance along with all of the internal capaci- tances. The situation is more complex forT PHL . We have analyzed the reason in Section 2.5.2.1. Briefly, Miller effect and internal capacitances discharge have opposite effects on T PHL . The larger the PMOS size, the more transistor inputs need to be flipped to invoke the maximumT PHL . For certain designs, e.g., NAND2 with a 1.5 PMOS NMOS size ratio, the worst-caseT PHL , and the second worst-caseT PHL are close. We select the input transitions to trigger both delays. 4.4.1.2 Address sequence selection for small decoders The structure for a small address decoder is straightforward. Fig. 2.4(b) shows a 3-bit address decoder. The critical path delay is the sum of the worst-case delay of each primitive gate on the path. For the 3-bit decoder, only NAND3 gates in the path have the worst-case delay. Address transitions triggering the worst-case deactivation delay and the worst-case activation delay of decoder outputs correspond toT PLH andT PHL of NAND3, respectively. 94 (a) (b) (c) (d) (e) (f) Figure 4.6:T PHL andT PLH distributions of (a)-(b) 2-input NAND gates, (c)-(d) 3-input NAND gates, and (e)-(f) 4-input NAND gates without aging From Section 2.5.2.1, we know that the worst-case path selection in NAND gates is not affected by aging degradation. Thus the address transitions triggering the worst- case delay in decoders are also not affected by aging. Thus we can derive the address transitions triggering WDeactD and WActD (recall that WDeatD and WActD are the worst-case deactivation delay and the worst-case activation delay, respectively) from the input transitions of NAND3. Table 4.7 shows the address transitions invoking the worst- case delay for 3-bit decoders shown in Fig. 2.4(b) before and after aging. The gates in 95 (a) (b) (c) (d) (e) (f) Figure 4.7:T PHL andT PLH distributions of (a)-(b) 2-input NAND gates, (c)-(d) 3-input NAND gates, and (e)-(f) 4-input NAND gate after 60 months aging decoder are sized to achieve minimum delay using logical effort. The inverters are balanced in practice. Two NAND gate size ratios are shown. Balanced NAND gates are commonly used in practice. We also use another case, i.e., Wp/Wn = 2, to demonstrate the test generation because this case is more complicated since the worst-case activation and the second worst-case activation delay are close and need to be considered at the same time. 96 Table 4.6: Input transitions triggering the worst-caseT PHL andT PLH for various NAND gates without and with aging PMOS/NMOS Size T PHL T PLH 2.25 2 1.5 1 0.67 0.5 Any size NAND2 (X2X1) 00-11 00-11 00-11 & 01-11 01-11 01-11 01-11 11-01 NAND3 (X3X2X1) 001- 111 & 000- 111 001- 111 & 000- 111 001- 111 001- 111 001- 111 001- 111 111- 011 NAND4 (X4X3X2X1) 0001- 1111 0001- 1111 0001- 1111 0011- 1111 & 0001- 1111 0011- 1111 0011- 1111 1111- 0111 Fig. 4.8 shows the PDF of the 3-bit address decoder activation delay and deactivation delay to support that the critical path of decoder can be derived from the critical path of NAND gates. For a 3-bit address decoder with balanced NAND3 (decoder 1), the worst-case deactivation delay is larger than the activation delay. The address transitions invoking WDeactD are of interest in both cases, i.e., before and after aging. Thus new tests for decoder 1 should cover all the address transitions with the MSB flipped. For a 3-bit address decoder with Wp/Wn=2 NAND3 (decoder 2), the worst-case activation delay is larger than the deactivation delay. The address transitions invoking both WActD and WDeactD are of interest for the cases before and after aging. The address transitions with all three bits flipped and the two most significant bits flipped trigger WActD and the second WActD for decoder 2. The WActD and the second WActD are very close. Thus new tests for decoder 2 should cover all the address transitions with MSB flipped, the two most significant bits flipped and all three bits flipped. Fig. 4.10 shows the short- est address sequences (follow the edges in the order of their label values) covering all address transitions required for 3-bit address decoders with balanced NAND3 and with 97 Table 4.7: Address transitions invoking the worst-case delay for 3-bit decoders with different transistor size ratios in NAND gates before and after 60m aging. Source address Destination address Wp/Wn=2 Wp/Wn=0.67Balanced WDeactD WActD WDeactD WActD 000 100 110 100 110 111 001 101 111 101 111 110 010 110 100 110 100 101 011 111 101 111 101 100 100 000 010 000 010 011 101 001 011 001 011 010 110 010 000 010 000 001 111 011 001 011 001 000 Wp/Wn=2 NAND3. The address transitions labeled by blue numbers invoke WActD, and the transitions labeled by black numbers invoke WDeactD. The transitions labeled by grey numbers are only used for connection. New tests to detect VIDFs will use these address sequences for efficiency. 4.4.1.3 Address sequence selection for large pre-decoded decoders As stated in Section 2.5.2, large address decoders are pre-decoded for delay effi- ciency. For n-bit decoders (n<=16), we use two-stage pre-decoding structures, i.e., each set of k (k<=4) primary inputs are pre-decoded and then combined through m-input NAND gates (m=n/k, m<=4). Address transitions triggering the maximum gate delay on each gate on the critical path invokes the maximum path delay since the delay is 98 (a) (b) Figure 4.8: The distributions of (a) activation delay and (b) deactivation delay of 3-bit address decoder with balanced NAND3 before aging (a) (b) Figure 4.9: The distributions of (a) activation delay and (b) deactivation delay of 3-bit address decoder with Wp/Wn=2 NAND3 before aging additive along the path. The PDF of worst-case deactivation delay and activation delay for larger pre-decoded decoders can also be derived from the PDF of delays of primi- tive gates along the path. Further, we can identify the address sequenced triggering the worst-case delay. Take the 16-bit address decoder as an example as shown in Fig. 4.11. It uses a two-stage pre-decoded structure. Each set of four address bits is connected to a 4- input NAND gate and then combined through 4-input NAND gates. The worst-case T PHL is invoked by flipping X4X3, and the worst-case T PLH is invoked by flipping X4. Thus to invoke the worst-case activation delay of the decoder, in stage 2, the two 99 (a) Balanced (b) Wp/Wn=2 Figure 4.10: (a) Address sequences invoking WDeactDs for address decoders with Wp/Wn = 0.67 (balanced), (b) Address sequences invoking WDeactDs and WDactDs for address decoders with Wp/Wn = 2 groups connected to X4 and X3 of the second stage 4-input NAND gates are selected, i.e., group 3 (A[11]-A[8]) and group 4 (A[15]-A[12]), and in stage 1, the MSB and the second MSB bit in each group are flipped, i.e., A[15]A[14] and A[11]A[10]. To invoke WDeactD of the path, group 4 is selected and in group 4, the MSB is flipped, i.e., A[15]. The distributions of gate delay values for primitive gates can be modeled by Gaus- sian distributions. We consider all the delay distributions are independent. Thus the sum of two delay distributions is still Gaussian distribution, with its mean being the sum of the two means, and its variance being the sum of the two variances, i.e.,μ = μ 1 +μ 2 , andσ 2 =σ 2 1 +σ 2 2 . Fig. 4.13 shows the sum of delay distributions of NAND2 and NAND3. In Fig. 4.13(a), NAND2_1 indicates the maximumT PHL delay of NAND2 triggered by MSB flipping. NAND_2 indicates the second worst-caseT PHL of NAND2 triggered by X2X1 flipping. The blue line labeled by NAND2_1+NAND3_1 is the sum of the worst-case T PHL of NAND2 and the worst-case T PHL of NAND3, which is also the worst-case T PHL after summation (the mean of this distribution isμ max and the Standard deviation (Std) of the distribution isσ). The green line labeled by NAND2_1+NAND3_2 is the 100 (a) A pre-decoded 16-bit address decoder (b) Pre-decoded 16-bit address decoder block diagram Figure 4.11: A pre-decoded 16-bit address decoder sum of the worst-caseT PHL of NAND2 and the second worst-caseT PHL of NAND3. The red line labeled by NAND2_2+NAND3_1 is the sum of the second worst-case T PHL of NAND2 and the worst-case T PHL of NAND3. The purple line labeled by NAND2_2+NAND3_2 is the sum of the second worst-case T PHL of NAND2 and the second worst-case T PHL of NAND3. The dash black vertical bar indicates the arrival time of WL_EN, i.e,μ max + 3σ. 101 Figure 4.12: The probability that a fault can be captured byG 2 but cannot be captured byG 1 We first analyze the probability of detecting the target faults, i.e., fault coverage, using the address sequences triggering the worst-case delay. G 1 andG 2 are two Gaus- sian distributions, where the meanμ 1 ofG 1 is larger than the meanμ 2 ofG 2 as shown in Fig. 4.12. We set the arrival time of WL_EN pulse (T). The width of WL_EN pulse is equal to the minimum sense time that ensures the correct read operation. Any delay larger than the arrival time of WL_EN will cause a fault. The probability that a fault can be captured byG 2 but cannot be captured byG 1 is equal to P (G 1 <T <G 2 ) = ∞ T f G 2 (t)dt· T −∞ f G 1 (t)dt, wheref G1 (t) is the PDF of the worst-case delay whose address transitions are covered in the test, andf G2 (t) is PDF of one of the remaining delays whose address transitions are not covered in the test. Thus for an n-bit address decoder, the fault coverage can be computed as, P fault−coverage = N Y i=2 [1−P (G 1 <T <G i )], whereN = 2 n , the address transitions invokingG 1 are covered in the test, while the address transitions invokingG 2 toG N are not covered. 102 In Fig. 4.14(a), we can see that the mean difference between the blue line and the green line is equal to the mean difference of the worst-caseT PHL of NAND2 and the second worst-case T PHL of NAND2. The Standard deviations for both lines slightly increase. The increased amounts are close and dominated by the large deviation (σ = q σ 2 1 +σ 2 2 ). The larger delay tends to have a larger deviation. If we set the arrival time of WL_EN asμ max +3σ, the fault coverage is very close to 100% as shown in Fig. 4.13. For large decoders, the PDF of decoder delays can be derived from the sum of PDF of the primitive gates along the path, i.e., inverters and NAND gates. The PDF of inverter delay is the same for all the paths. For the sum of the delays, the mean difference between the worst-case delay and the second worst-case delay equals the minimum mean difference of NAND gates with a larger std. The probability that a fault can be captured byG 2 but cannot be captured by G 1 is very close to 0 for all NAND gates in the path, thus this probability is also close to 0 for decoder if we set proper value for WL_EN arrival time, whereG 1 corresponds with the address transitions covered in tests, andG 2 corresponds with one of the remaining delays whose address transitions are not covered in tests. Only the tail of the PDF affects the fault coverage (T→∞). Fig. 4.14 shows the sum of delay distributions of two 4-input NAND gates to support the analysis since 16-bit decoders (largest for two-stage pre-decoded decoder) use 4-input NAND gates for both stages. The fault coverage is also very close to 100 as shown in Fig. 4.14 when we set the arrival time of WL_EN asμ max + 3σ. To support the above analysis, Fig. 4.15 shows the PDF of a 6-bit address decoder deactivation delay generated directly from monte carlo simulations before and after aging. The worst-case deactivation delay is triggered by the address transitions with only MSB flipping. The delay distribution is similar to the case after aging with a shift to the right. The test containing the address transitions with MSB flipping can achieve 103 (a) (b) Figure 4.13: The sum of (a)T PHL and (b)T PLH distributions of NAND2 and NAND3. (a) (b) Figure 4.14: The sum of (a)T PHL and (b)T PLH distributions of two NAND4 gates. close to 100% fault coverage for deactivation related faults for both cases before and after aging when we set proper arrival time of WL_EN. 4.4.2 Test generation for VIDF We demonstrate how to generate efficient tests for VIDFs given an SRAM design. After analyzing the delay for each primitive gate in address decoders and deriving the PDFs of the worst-case delay, we identify the address sequences that would maximize the probability of detection of all types of target faults. We can find the shortest path to cover all required two-pattern address transitions. All experiments conducted in this chapter are based on 32nm PTM library. We set proper arrival time of WL_EN for each 104 (a) (b) Figure 4.15: The distributions of deactivation delay of 6-bit address decoder with bal- anced NAND3 and balanced NAND2 (a) before aging and (b) after 60m aging. decoder, and for the case with aging, the arrival time of WL_EN is increased based on the aging degradation value. Table 4.8 shows the new tests generated by our approach for VIDFs for decoder 1 and decoder 2 described in Section 4.4.1.2. For decoder 1, the deactivation delay is larger than the activation delay. Thus, the test must contain all address transitions invoking WDeactD as shown in Table 4.8 variation test 2 (VTest2). For decoder 2, the 105 Table 4.8: New proposed test algorithms for different designs targeting variation- induced delay Faults # Test length Targeting faults Description (m is number of address bits and n is number of word lines for a memory) VTest1 9.5n All three VIDFs {⇓ n−1 n 2 (w1);⇓ n 2 −1 0 (w0);⇑ n 4 −1 v=0 (r0 v, r1 v⊕2 m 2 m−1 , r0 v⊕2 m−1 , r1 v⊕2 m , r0 v, r1 v⊕2 m , r0 v⊕2 m−1 , r1 v⊕2 m 2 m−1 , r0 v );⇑ n 2 −1 v=0 (r0 v, r1 v⊕2 m 2 m−1 2 m−2 , r0 v ); ⇑ n 2 −1 0 (w1);⇑ n−1 n 2 (w0);⇓ 4 n −1 v=0 (r1 v, r0 v⊕2 m 2 m−1 , r1 v⊕2 m−1 , r0 v⊕2 m , r1 v, r0 v⊕2 m , r1 v⊕2 m−1 , r0 v⊕2 m 2 m−1 , r1 v );⇓ n 2 −1 v=0 (r1 v, r0 v⊕2 m 2 m−1 2 m−2 , r1 v )}} VTest2 5n Fault 1 and Fault 2 {⇓ n−1 n 2 (w1);⇓ n 2 −1 0 (w0);⇑ n 2 −1 v=0 (r0 v, r1 v⊕2 m , r0 v ); ⇑ n 2 −1 0 (w1);⇑ n−1 n 2 (w0);⇓ n 2 −1 v=0 (r1 v, r0 v⊕2 m , r1 v )} * For SRAMs with pre-decoding address decoders, the address transition selections are based on the pre-decoding structures as mentioned in Section 4.4.1.3 activation delay is larger than the deactivation delay. The worst-case activation delay and the second worst-case activation delay is close and need to be captured together. Thus, the test must contain all address transitions invoking WDeactD and WActD as shown in Table 4.8 variation test 1 (VTest1). Our new tests (VTest1 and VTest2), WT, WCGD, and GALPAT are evaluated in terms of test length and coverage for VIDFs. 1000 Monte Carlo instances are gener- ated and each instance represents an SRAM module with intra-die and inter-die process variation. Aging degradation for Vth from the aging model [15] is added for the cases with aging. In our simulation, we use the same initial Vth for all transistors when esti- mating aging degradation. Thus our result is more conservative. Table 4.9 shows the number of SRAM instances that have VIDFs and captured by each test for SRAMs with 3-bit address decoder before and after aging. The simulation results for 6-bit and 8-bit 106 address decoders are shown in Table 4.10. We present the decoder designs whose output activation delays are larger than deactivation delays. For 6-bit decoder and 8-bit decoder, 3-input NAND gates and 4-input NAND gates are used in pre-decoder, respectively, and 2-input NAND gates are used in the second stage. The address transitions with the MSB flipped invoke WDeactD. The address transitions with the two MSB flipped and the three MSB flipped invoke WActD. VTest1 is used to capture all three faults. Our new linear tests for VIDFs achieve nearly perfect fault coverage for the cases before and after aging, i.e., the golden coverage provided by GALPAT. WT and WCGD are effective when deactivation delay related faults are dominant, i.e., fault1 and fault2. However, for the designs with large activation delays, WT and WCGD cannot achieve good fault coverage. Our new tests with a maximum length of 9.5n can achieve nearly 100% fault coverage for VIDFs. For designs that contain faults only related to deactivation delays, the test length is bound by 5n. For designs that contain faults related to both activation delay and deactivation delays, the test length is bound by 9.5n. T PHL of NAND gate is affected by the miller effect and internal node charge. These two factors have opposite effects onT PHL , where the relative values of the internal node capacitances and the output capacitance deter- mine which factor plays a leading role. The larger the PMOS size, the more transistor inputs need to be flipped to invoke the maximumT PHL . Thus in some cases, we need to capture both the worst-case and the second worst-case delays if these two are close, however, the third worse case delay is always far away enough with these two delays. The worst situation for VIDFs test generation is to capture the worst-case deactivation delay, the worst-case activation delay, and the second worst-case activation delay, and these delays are triggered by different two-pattern address sequences. 107 Table 4.9: Number of failing chip instances captured by different tests for SRAMs with 3-bit address decoder before and after aging SRAMs # of faulty chips captured by various tests Before aging After aging New tests for VIDF WT WCGD GALPAT New tests for VIDF WT WCGD GALPAT Decoder1 157 (VTest2) 157 157 158 212 (VTest2) 212 213 213 Decoder2 297 (VTest1) 183 194 297 332 (VTest1) 201 217 333 Table 4.10: Number of failing chip instances captured by different tests for SRAMs with pre-decoded address decoder before and after aging Memory size # of faulty chips captured by various tests in terms of Before aging After aging row address bit VTest1 WT WCGD GALPAT VTest1 WT WCGD GALPAT 6 bit 229 163 169 230 215 157 166 216 8 bit 198 139 145 198 234 178 183 235 The test generation for VIDFs is bound by 9.5n for arbitrary designs. 4.5 Peripheral circuits redesign to combat aging degra- dation In this section, we explore the design methods for peripheral circuity ti maximize the aging resilience for the entire SRAM under the given constraints. We analyze the power overhead for the classical approach to combat delay aging degradation in peripheral circuitry. We explore the sizing approach of address decoder and sense amplifier to ensure that the worst-case delays after aging are not larger than those before aging for the original design to meet the clock constraint. 108 4.5.1 Power overhead analysis for the classical approach for periph- eral circuits As shown in Section 4.5, aging causes delay degradation in the peripheral circuitry of SRAM. For a tight clock constraint, we can either use a higher VDD right from the beginning of the operational life of the chip or increase VDD gradually in the field to avoid failures caused by delay degradation. However, in either case, this approach leads to power overhead. Fig. 4.16(a) shows that VDD of an SRAM with a 3-bit address decoder, an SRAM with an 8-bit address decoder, and an SRAM with a 12-bit address decoder are raised gradually in the field based on the delay degradation caused by aging to ensure the critical path delay meets clock constraint. The original VDD is 0.7V . VDD needs to be raised after a short period of usage due to aging degradation. The delay of a large decoder is higher than that of a small decoder. The cell delay for discharging bit line also increases with the increase of the number of word line, i.e., the number of decoder output, since more cells are connected to a bit line and then more capacitance needs to be discharged. The cell delay increases much faster than decoder delay. Thus, compared to an SRAM with a small decoder, the decoder delay constitutes a smaller percentage of the overall read delay for an SRAM with a large decoder. The cell delay degradation caused by aging is small. The decoder delay degradation is the major component in the overall read delay degradation. Therefore, an SRAM with a large decoder suffers less read delay degradation due to aging. For an SRAM with a small decoder, higher VDD is required to satisfy delay constraints compared to an SRAM with a large decoder. As shown in Fig. 4.16(a), over 60 months of aging, VDD for 3-bit SRAM is raised to 0.78 gradually denoted in the curve with blue circles. In Fig. 4.16(a), VDD for 8-bit SRAM is raised to 0.74 gradually denoted in the curve with 109 red pluses, and VDD for 12-bit SRAM is raised to 0.72 denoted in the curve with green triangles. Fig. 4.16(b) shows that the corresponding power overhead as a percentage of the total power for SRAMs with 3-bit address decoder, 8-bit address decoder, and 12-bit address decoder to meet the clock constraint, respectively. The power overhead in per- centage for an SRAM with 3-bit address decoder is shown in the curve with blue circles in Fig. 4.16(b). The average power overhead for the 60 months lifetime is 18.68%. The curve with red pluses denotes the power overhead in percentage through the lifetime for an SRAM with 8-bit address decoder and the average is 12.13%. The power overhead in percentage for an SRAM with 12-bit address decoder is shown in the curve with green triangles in Fig. 4.16(b) and the average is 5.37% over 60 months lifetime. 4.5.2 Sizing approach for decoder to combat aging delay degrada- tion Aging causes delay degradation in SRAM peripheral circuits, especially in address decoders. For tight timing constraints, we can raise VDD to compensate for delay degra- dation. However, as shown above, this causes power overhead. For designs with both tight timing and power constraints, alternatively, we can resize the decoder to reduce the critical path delay after aging to meet the timing constraints. This will lead to a small area overhead. Thus it is the designer’s choice to select the desired trade-off between power, timing, and area. From Section 4.2.1, we know the aging degradation for each transistor in all the gates in decoders. PMOS and NMOS in inverters suffer the same stress conditions. Their threshold degrades in a similar amount. In NAND gates, PMOS transistors suffer more degradation than NMOS. ThereforeT PLH degrades more thanT PHL . Only NAND gates in decoder have a worst-case delay. Thus we focus on resizing the NAND gates 110 (a) (b) Figure 4.16: (a) VDD and (b) power overhead in percentage for SRAMs with 3-bit, 8-bit and 12-bit address decoder to ensure read delay meet clock constraint to reduce the worst-case delay of NAND gates after aging to reduce the overall critical path delay. All gates in the original decoder are sized using the logical effort approach to achieve the minimum delay. The sizes of transistors are discrete and the step size of increase isλ, i.e., half of the feature size of the process. We start resizing the NAND gates in the critical path of the decoder. Table 4.11 shows the delay of original NAND3 gate and the NAND3 gate resized for reducing critical path delay after aging to meet clock constraints. The 3-input NAND gate drives the same load as it in the decoder. We can see that the worst-case delay before and after aging in the original NAND3 is T PLH 111 triggered by input transitions 111-011. The aging degradation is worse in T PLH than that inT PHL . Thus, we can increase PMOS size gradually to reduceT PLH . We can also increase the size of NMOS close to ground (N3) to compensate for the aging degradation of NMOS, i.e., use progressive sizing. We perform binary search across the sizes since the size of transistors is discrete. As shown in Table 4.11, the worst-case delays after aging of the resized NAND3 areT PLH triggered by input transition 111-011 andT PHL triggered by input transition 001-111. And the worst-case delay after aging of the resized NAND3 is less than the worst-case delay before aging of the original NAND3. Hence the critical path delay after aging of the resized NAND3 can meet the clock constraint. The area overhead of the resized NAND3 compared to the original NAND3 is 13.3%. The flow of resizing the address decoder to reduce critical path delay is as follows: Step 1: Balance the deactivation delay and activation delay before aging. Step 2: Increase PMOS size gradually in NAND gates to compensate for the aging degradation. Perform binary search across the transistor sizes in steps ofλ. Step 3: Increase NMOS transistor close to GND in NAND gates depending on the relative value of delay. Step 4: Increase the size of all inverters except inverters connected to primary inputs by the minimum step (λ) to compensate for the aging degradation. In the above flow, for each step, if the worst-case delay of decoder after aging is equal or less than the worst-case before aging of the original decoder design, we can stop the sizing process. We start with the most efficient way and increase the size by the minimum amount (λ) in each iteration. Thus we can meet the clock constraint with the minimum area overhead. During resizing, the NAND gates are skewed. However, the changes in the overall size of each gate in the decoder are small. Hence we do not need to perform the logical effort approach on the path again. Fig. 4.17 shows the distributions of activation delay and deactivation delay of 3-bit address decoders resized 112 (a) (b) Figure 4.17: The distributions of (a) activation delay and (b) deactivation delay of resized 3-bit address decoder to reduce critical path delay after aging to reduce critical path delay after aging. We can see the worst-case delay after aging of the resized decoder is 7.71E-11s (μ max + 6σ), and is less than that of the original decoder before aging, i.e., 7.87E-11s as shown in Fig. 4.8. 4.5.3 Sizing approach for sense amplifier to combat aging We study the two commonly used sense amplifiers, i.e., a current controlled latch sense amplifier [35] (SA1) and a latch based amplifier with pass transistors [33] (SA2) as shown in Fig. 4.3. In SA1, the threshold of transistors in SA increases due to aging. The current flow through both sides of the amplifier reduces, which increases the delay of SA. The cross- coupled inverters P1, N1 and P2, N2 determine the noise margin. Reducing the size of pull-up will decrease read 0 delay. For SA1, the two output nodes are held to VDD by reset transistors before Read signal arriving, hence we can only focus on the read 0 delay. Thus in SA1, N1, N2 should be stronger than P1, P2 to reduce delay. However, the ratio between PMOS and NMOS cannot be too small. If P1 and P2 are weaker than N1 and N2, SA tends to latch “0” easier than “1”. Using weak PMOS transistors increase the chance of latching a wrong value when there is a mismatch between input 113 transistors and cross-coupled inverters, especially, when N3 is weaker than N4 due to process variation. Thus we should keep a certain ratio of the size of pull-up and pull- down transistors for the cross-coupled inverters. The relationship between the size of four transistors in cross-coupled inverters and SA1 delay is not monotonic. This is due to the common effects of the slope of VTC and the value of capacitance. The VTC is steeper for large inverters. The cross-coupled inverter can get out of metastability easier. This leads to a smaller delay. However, increasing the size of inverters will increase capacitance and then increase the delay. Fig. 4.18 shows SA1 delay after aging under different sizes of P1, P2, N1, and N2. We can choose the size of cross-coupled inverters considering both noise margin and delay when design SA1. Increasing the size of input transistors N3 and N4 will increase the current, and then decrease the delay of SA1. The delay decreases faster with the increase in the size of current source N5. Larger the sizes of N5, larger the current flowing through current source and then leads to smaller SA1 delay. Thus increasing the size of N5 to compensate for aging delay degradation is area efficient. We choose to increase the size of N5 in our design. The size of N5 in the original SA1 design is 0.64μ. To ensure SA1 delay after aging is not more than the original delay before aging, the new size for N5 is 0.96μ. The area overhead for the new sense amplifier design compared to the original SA1 is 8%. SA2 is shown in Fig. 4.3(b), when Read = 0, PMOS pass transistors are turned on, the voltage of BL and BL_bar are passed to the internal nodes of amplifier. When signal Read is brought high, PMOS pass transistors are turned off and the NMOS current source (N3) is turned on, then SA latches the correct value at the output. Aging degrades the strength of PMOS pass transistors and then increases the sense time. To compensate for the degradation of sense time, we can size up both PMOS pass transistors. We can use the same sizing approach as in SRAM cell presented in Chapter 3 to mitigate the 114 Figure 4.18: SA1 delay after aging under different sizes of P1, P2, N1and N2. stability degradation of the latch inverters in SA2. We can size up N3 to compensate for the aging delay degradation of SA2. 4.6 Conclusion In this chapter, we analyze the effect of BTI aging on SRAM peripheral circuitry, including address decoder, precharge circuit, write circuit and sense amplifiers. We find that aging causes delay degradation in peripheral circuitry, especially in the address decoder. We consider aging in the test generation method for variation induced delay faults. Delay degradations caused by aging do no affect the input transitions invoking the worst-case delay. Thus aging degradation does not affect the path selection in test generation. The test generated for the design before aging can also be used for the design after aging. Our experimental results show that our new tests can achieve nearly perfect coverage for 32nm process before and after aging with reduced test length. The test length for VIDFs is bound by 9.5n for arbitrary SRAM designs. We quantify the delay along the critical paths of SRAMs and estimate the amount of delay degradation caused by aging for each component. We analyze the power over- head for the classical approach to combat delay aging degradation in peripheral circuitry. 115 We explore the sizing approach of address decoder and sense amplifier to ensure that the worst-case delays after aging are not larger than those before aging of the original design to meet the clock constraint. To avoid failure caused by delay degradation, we can leave a sufficient margin in the timing control. If the timing constraint is tight, we increase VDD to compensate for the delay degradation at a power overhead. Alternatively, to avoid power overheads, we resize the address decoder and sense amplifiers. Our meth- ods allow designers the choice to choose the proper design based on their timing, power and area constraints. 116 Table 4.11: Delay of original NAND3 gate and the NAND3 gate resized for reducing critical path delay after aging to meet clock constraint NAND3 with load (Wn=96n, Wp=64n) Input transitions Before aging After aging Delay degradation T PLH (s) 111-110 8.33E-12 9.82E-12 17.82% 111-101 1.30E-11 1.53E-11 17.47% 111-100 5.04E-12 5.88E-12 16.58% 111-011 1.66E-11 1.95E-11 17.22% 111-010 4.62E-12 5.41E-12 17.10% 111-001 7.09E-12 8.30E-12 17.06% 111-000 3.79E-12 4.36E-12 15.05% T PHL (s) 110-111 8.12E-12 9.18E-12 13.05% 101-111 1.20E-11 1.36E-11 13.37% 100-111 1.16E-11 1.29E-11 11.81% 011-111 1.34E-11 1.51E-11 12.59% 010-111 1.01E-11 1.11E-11 9.90% 001-111 1.41E-11 1.55E-11 10.00% 000-111 1.33E-11 1.47E-11 10.35% NAND3 with load (WN1=WN2=96n, WN3=112n, Wp=80n) Input transitions Before aging After aging Delay degradation T PLH (s) 111-110 7.46E-12 8.77E-12 17.53% 111-101 1.13E-11 1.32E-11 16.79% 111-100 4.52E-12 5.22E-12 15.59% 111-011 1.41E-11 1.65E-11 16.91% 111-010 4.23E-12 4.88E-12 15.49% 111-001 6.18E-12 7.18E-12 16.13% 111-000 3.38E-12 3.91E-12 15.72% T PHL (s) 110-111 8.85E-12 9.94E-12 12.27% 101-111 1.27E-11 1.43E-11 12.21% 100-111 1.25E-11 1.39E-11 10.87% 011-111 1.41E-11 1.58E-11 11.74% 010-111 1.11E-11 1.21E-11 9.36% 001-111 1.50E-11 1.65E-11 10.52% 000-111 1.44E-11 1.60E-11 10.87% 117 Chapter 5 Aging-resilient SRAM design: an end-to-end framework 5.1 Background and Motivation We develop a design approach for SRAM cells against BTI aging in Chapter 3. We study the impact of aging on each transistor in an SRAM cell and size the transis- tors in a manner that dramatically reduces the aging quality loss of the SRAM array. The SRAM cell sizing approach is effective at improving the aging resilience for var- ious workloads without power overhead. In Chapter 4, we study the impact of aging in SRAM peripheral circuits, including address decoder, pre-charge circuit, write cir- cuit, and sense amplifier. Our SRAM aging analysis has found that aging causes delay degradation in peripheral circuits and stability degradation in cells. We present several approaches to mitigate aging delay degradation in peripheral circuitry. Further, traditional approaches to combat SRAM failures, such as ECC, may also be used to increase aging resilience. ECC protection schemes are classically intended to repair cell failures due to soft errors. Using ECC incurs area and performance overheads. To achieve optimal SRAM design against aging under the given constraints, several aspects need to be addressed. We need to quantify the overheads of ECC. The lifetime yield and DPPM need to be estimated when using ECC. We also need to consider the soft error resilience when using ECC to correct aging failures. 118 In this chapter, we develop an end-to-end SRAM design framework to maximize the aging resilience under the given constraints to provide designers the capability for optimal design of aging-resilient SRAMs. In addition to transistor sizing, we study the use of error-correcting codes (ECC) to improve aging resilience. By quantifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, we explore the efficiency of ECC to combat aging. After comparing approaches based on transistor sizing in SRAM cells and ECC in terms of overheads, lifetime yield and DPPM, we can choose either one or a combination of these approaches to identify the optimal design against aging under the given constraints. The rest of this chapter is organized as follows. In Section 5.2, we introduce the ECC background, estimate the lifetime yield and DPPM of SRAMs using different ECC schemes, and calculate the soft error resilience when using ECC to repair aging failures. We also qualify the ECC implementation overheads in Section 5.2. In Section 5.3, we present our end-to-end SRAM design framework for lifetime yield-per-area optimiza- tion. We present design results for two example cases in Section 5.4. Finally, we present our conclusions in Section 5.5. 5.2 Using ECC to repair aging failures 5.2.1 ECC background ECC is a powerful technique used in memories to repair failures caused in a lim- ited number of cells in arbitrary locations. In particular, Single-Error-Correction (SEC) code is one of the most popular codes used in memories but can only correct a single bit error [52, 53, 54]. Bose-Chaudhury-Hocquenghem Double-Error Correction (BCH DEC) [52] can correct two-bit errors. Conventionally, ECC is used to recover from soft errors. In this section, we explore the capability of ECC to repair both soft errors and 119 aging failures. Adding ECC to memory incurs area, latency, and power overheads. To analyze the efficiency of ECC for combating aging, we quantify the overheads of ECC and calculate the DPPM and lifetime yield of SRAM using different ECC schemes. We also estimate the soft error resilience (SER) to ensure that the designs using ECC to combat aging also meet the soft error resilience constraint. 5.2.2 DPPM and lifetime yield estimation with ECC Our design objective is to optimize the lifetime yield-per-area of SRAM array while also satisfying a given DPPM target. Thus, we first compute DPPM and lifetime yield when ECC is used to correct aging failures. We define D as the length of the data (in bits) corrected by ECC. For an SRAM array consisting of N cells, R = N/D. We assume the failures of SRAM cells caused by aging are independent from cell to cell, and this assumption is valid for all the cases (workloads) studied in this chapter. If we use SEC code to handle aging failures, the aging quality loss of an SRAM array (Q ag−SEC ) can be calculated as follows: Q ag−SEC = 1− R Y m=1 [ D Y i=1 (1−P c,m,i f,ag ) + D X i=1 P c,m,i f,ag D Y j=1,j6=i (1−P c,m,j f,ag )] whereP c f,ag is the aging failure rate of an SRAM cell, i.e., the probability that an SRAM cell functions properly at fabrication but fails during its desired lifetime due to aging. This equation calculates the probability of an SRAM array with at least one block (D- bit word) having two or more cell failures caused by aging. SEC code can correct one failure in each block. Thus, a chip will function correctly if every block only contains either zero or single failure. If any of the blocks contains more than one aging failures during its desired lifetime, the SRAM array fails due to aging. DPPM of the SRAM array = Aging quality loss× 10 6 . 120 The lifetime yield of an SRAM array (Y life−SEC ) can be calculated as follows when we use SEC code to correct aging failures: Y life−SEC = N Y i=1 (1−P c,i f ) R Y m=1 [ D Y i=1 (1−P c,m,i f,ag ) + D X i=1 P c,m,i f,ag D Y j=1,j6=i (1−P c,m,j f,ag )] whereP c f is the failure rate of an SRAM cell and is used in the experiment to take into account the yield at the time of fabrication. DEC code can correct two aging failures in each block. Thus, SRAM fails due to aging only when any of the blocks contains more than two aging failures during its desired lifetime. Thus, the aging quality loss of an SRAM array with DEC code (Q ag−DEC ) can be calculated as follows: Q ag−DEC = 1− R Y m=1 [ D Y i=1 (1−P c,m,i f,ag ) + D X i=1 P c,m,i f,ag D Y j=1,j6=i (1−P c,m,j f,ag ) + 1 2 ( D X i=1 D X j=1,j6=i P c,m,i f,ag P c,m,j f,ag D Y k=1,k6=i,k6=j (1−P c,m,k f,ag ))] The lifetime yield of an SRAM array when we use DEC code to correct aging fail- ures (Y life−DEC ) is calculated as follows: Y life−DEC = N Y i=1 (1−P c,i f ) R Y m=1 [ D Y i=1 (1−P c,m,i f,ag ) + D X i=1 P c,m,i f,ag D Y j=1,j6=i (1−P c,m,j f,ag ) + 1 2 ( D X i=1 D X j=1,j6=i P c,m,i f,ag P c,m,j f,ag D Y k=1,k6=i,k6=j (1−P c,m,k f,ag ))] 121 5.2.3 Calculation of soft error resilience when using ECC to repair aging failures The probability that a soft error occurs, at a single cell during the time interval [0, t] (P sf ), is calculated as follows, P sf (t) = 1−exp(−∇(t)) where∇(t) =F b × t×24hrs 10 9 hrs .F b is fit per bit. When ECC is used to repair both aging failures and soft errors, the bit error (to be corrected by ECC) occurs during the time interval [0,t] with probability P er (t) = 1− (1−P ag )(1−P sf (t)) whereP ag is the error probability due to aging,P sf is the error probability due to soft errors. Obviously, P er (t) = P ag +P sf (t)−P ag P sf (t) > P sf (t). That is, when ECC is used to repair both aging failures and soft errors, ECC essentially deals with a more error-prone situation even if P sf (t) is assumed unchanged. Or equivalently, because ECC is used to repair aging failures together with soft errors, the same ECC, in general, has a smaller chance to recover from soft errors compared with conventional scenarios where it is solely used to correct soft errors. To characterize the potential degradation, we study the soft error resilience of ECC after using ECC to repair aging failures, which is the probability that both soft error and aging failures can be repaired. The soft error resilience of each word for SEC after aging repair can be computed as, 122 P SEC−resilience (t) = D Y i=1 (1−P c,i f,ag )exp(−∇(t)) + D X i=1 [1−exp(−∇(t)) +P c,i f,ag exp(−∇(t))] D Y j=1,j6=i (1−P c,j f,ag )exp(−∇(t)) The soft error resilience of each word for DEC after aging repair can be computed as, P DEC−resilience (t) = D Y i=1 (1−P c,i f,ag )exp(−∇(t)) + D X i=1 [1−exp(−∇(t)) +P c,i f,ag exp(−∇(t))] D Y j=1,j6=i (1−P c,j f,ag )exp(−∇(t)) + 1 2 ( D X i=1 D X j=1,j6=i [1−exp(−∇(t)) +P c,i f,ag exp(−∇(t))][1−exp(−∇(t)) +P c,j f,ag exp(−∇(t))] D Y k=1,k6=i,k6=j (1−P c,k f,ag )exp(−∇(t)) In Table 5.1, the DPPM and lifetime yield for SRAMs using D0 are listed with different ECC schemes. D0 is the 6T SRAM cell optimized for the yield-per-area at the time of fabrication. In the table, the number associated with each ECC, e.g. SEC-512, is the length of the data (in bits) corrected by ECC. For G1 and Skew workloads, any of the ECC schemes in the table can achieve the target DPPM, value of 50. When the DPPM is small, the lifetime yield is close to the yield at fabrication. Thus, the lifetime yield-per-area is determined by the area overhead. For simplicity, we first ignore the aging in the ECC circuit. For G2 workload, SEC-64 or SEC with smaller code length or any DEC can achieve target DPPM. For Unknown workload, DEC-512 or DEC with smaller code length can be used to achieve target DPPM. 123 Table 5.1: Lifetime yield and DPPM comparison for SRAMs using a cell optimized for yield-per-area at the time of fabrication (D0) under four different workloads with different ECC schemes Workload DPPM (SEC- 512) DPPM (SEC- 64) DPPM (SEC- 16) DPPM (DEC- 1024) DPPM (DEC- 512) Lifetime yield (SEC- 512) Lifetime yield (DEC- 512) Gaussian 1 0.00067 0.00091 0.00069 0.00067 0.00067 0.96643 0.96643 Gaussian 2 57.90 7.11 0.552 0.0010 0.00039 0.96637 0.96643 Skew 2.07 0.26 0.019 0.00018 0.00021 0.96643 0.96643 Unknown 18330.4 2283.6 544.3 56.1 6.53 0.94872 0.96642 We can add extra ECC only for aging resilience to avoid sacrificing the soft error resilience of the design. The better approach for real design is to use the existing ECC for soft error directly to handle aging failure, especially for the cases with small aging failure rates (such as for G1 workload). This results in infinitesimal degradation of soft error resilience. However, for cases with large aging failure rates, the probability that multiple blocks have at least one aging failure in each block is high. In such cases, using existing ECC for soft error to correct aging failure decreases the soft error resilience to unacceptably low levels. Thus, we also estimate the soft error resilience when using ECC to combat aging. If the existing ECC for soft error cannot meet the soft error or aging constraint, a stronger ECC circuit needs to be used to handle the aging failure. Table 5.3 shows the ECC schemes considering both aging repair and soft error resilience for different workloads. 5.2.4 Characterize the ECC implementation overheads To optimize the design in terms of lifetime yield-per-area, it is necessary to estimate the area overhead of ECC. The major components of ECC implementation are encoder 124 and decoder logic, and storage for check bits. For example, SEC-512 requires 11-bit check bits for every 512-bit. DEC almost doubles the number of check bits as compared with SEC. Table 5.2 shows the number of check bits needed to implement SEC and DEC and the corresponding area overheads at different correctable data lengths. The encoder and decoder of SEC are constructed as XOR trees. BCH DEC are cyclic codes and usually implemented by multi-bit Linear Feedback Shift Registers (LFSR). Thus, the delay and area overheads are larger than those of SEC. A single-cycle implementation of DEC decoders incurs 55% to 69% latency penalty compared to SEC codes. Thus, for similar area overhead, we first choose SEC. We consider that the extra area of ECC implementation is caused by encoder and decoder circuits, and the memory cells required to store the check bits. In Table 5.2, the area overhead of storage of check bits is calculated by the ratio of the area of the memory cells needed to store check bits and the area of correctable data cells. The encoder and decoder area overheads are the ratio of the area of encoder and decoder circuits to the total area of data cells in SRAM array. For a fixed error-correcting capability, encoder and decoder complexities increase with the length of the data corrected by ECC, while the check bits area overheads decrease dramatically. For SEC, the decoder and encoder logic occupy a much smaller area compared to the storage of check bits. The major area overhead for SEC is the storage bits. For a fixed code length, the complexities of encoder/decoder logic and check bit array both increase with the error-correcting capability. However, the encoder/decoder logic complexities increase much faster than check bit array. For example, the area overhead of check bits array for data length of 512 grows from 2.15% to 3.91%, while the area overhead of encoder/decoder logic grows from 0.05% to 4.09%. For DEC-512, the encoder and decoder logic dominate the area. 125 We synthesize the ECC encoder and decoder circuits using Design Compiler with 45nm PDK library and report the area of encoder and decoder circuits for various cor- rectable data lengths and error correction capabilities. Then we characterize the area of ECC decoder and encoder circuits in terms of correctable data length, error correction capability. The soft error rate is 5× 10 6 Fit/Mb. The cell area is calculated based on a parametric layout of an SRAM cell. We modify CACTI [32] to add the bit cell area as a function of transistor size and add ECC area including check bits storage, and decoder and encoder circuits. The peripheral circuits and interconnect area can be estimated using CACTI. Thus, the modified CACTI is able to estimate the overall area of SRAM including ECC implementation. Table 5.3 shows the area overhead comparison of ECC approach and the sizing approach for different workloads. For calculating the area overhead in the second and third columns in Table 5.3, the base is the area of entire SRAM using D0 without ECC. Thus the area overhead for the sizing approach is the ratio of the area of entire SRAM using cells optimized through sizing approach without ECC to the base respectively. The area overhead in the third column is the ratio of the area of entire SRAM using D0 with appropriate ECC schemes to the base respectively. The base is the area of entire SRAM using D0 with SEC-512 for the fourth column and using DEC-256 for the fifth column. We assume the original ECC is replaced by the new ECC scheme for area overhead calculation. From Table 5.3, we see that the area overhead of ECC approach is lower than that of the cell sizing approach for G1, skew, and Unknown workloads. SRAM design under G1 and Skew workloads can use SEC-512 to achieve the target DPPM. The overall area overhead is only 1.81%. However, ECC encoder and decoder circuits cause delay overheads. From synthesis results, encoders for SEC incur a latency penalty from 0.3ns- 0.8ns for various correctable data lengths. The delay overhead for decoders for SEC is 126 Table 5.2: Number of check bits and area overhead (compared to the total area of data cells in SRAM array) for SEC and DEC for different correctable data lengths SEC BCH DEC The length of data (in bits) corrected by ECC # of check bits Storage of check bits area over- head (%) Encoder + decoder area over- head (%) # of check bits Storage of check bits area over- head (%) Encoder + decoder area over- head (%) 16 6 37.5 0.0012 10 62.5 0.0048 32 7 21.87 0.0025 12 37.5 0.0174 64 8 12.5 0.0052 14 21.87 0.066 128 9 7.03 0.011 16 12.5 0.26 256 10 3.91 0.024 18 7.03 1.028 512 11 2.15 0.050 20 3.91 4.09 Table 5.3: Area overhead (relative to entire SRAM, computed using modified CACTI) comparison for sizing and ECC approach under four different workloads Workload Area overhead for sizing approach Area overhead for ECC used only for aging SEC-512 was already available (SER>0.9) DEC-256 was already available (SER>0.9) Gaussian 1 (G1) 1.0222 1.0181 (SEC-512) 1 1 Gaussian 2 (G2) 1.0443 1.0659 (DEC-512) 1.0470 (DEC-512) 1 Skew 1.0329 1.0181 (SEC-512) 1 1 Unknown 1.0665 1.0659 (DEC-512) 1.0470 (DEC-512) 1 more than two times of that of encoders. The delay penalty for multi-cycle implementa- tion of BCH DEC code is dramatically high. Thus, under a tight timing constraint, the sizing approach is chosen over ECC approach to combat aging. 127 5.3 Design approach 5.3.1 An end-to-end SRAM design framework for lifetime yield- per-area optimization We have studied cell sizing and ECC approaches to increase aging resilience. Each approach has different overheads. According to the constraints given by customers, we can achieve the optimal SRAM design to maximize the lifetime yield-per-area under the DPPM constraint via one of or the combination of the two approaches. We propose an end-to-end SRAM design framework to optimize the lifetime yield-per-area under the given constraints as follows: Given access time, power, area, DPPM, and soft error resilience constraints. Assume ECC exists for soft error. Step 1: Use CACTI to find candidate designs under the given constraints on access time, power, and area. The base cell D0 is the cell optimized for the yield-per-area at the time of fabrication. Step 2: Estimate the DPPM and SER of SRAM with the existing ECC scheme for various workloads depending on the application. If both DPPM and SER meet the target, report the design. If not, go to step 3. Step 3: For the given DPPM for SRAM, using SRAM sizing approach to achieve given DPPM constraint while optimizing lifetime yield-per-area. The redesigned cell is called new cell. DPPM is estimated without ECC. Perform binary search across cell designs between the original cell and the new cell. Estimate SER and DPPM with ECC. Choose the cell with the smallest area overhead satisfied DPPM and SER constraints. Report the area overhead. This area overhead bounds the ECC approach. 128 Step 4: Explore ECC approach to achieve target DPPM. The SRAM cell is the base cell D0 optimized for the yield-per-area at the time of fabrication. a) Starting with the original ECC, reduce the data length to half. If the starting ECC is SEC. Simply calculate the area overhead of check bits. If check bit area overhead of SEC with half-size data blocks is larger than that of DEC with the largest data length, we move to code with higher error correction capabil- ity (DEC). b) Estimate the DPPM and SER of SRAM with the new ECC scheme. Repeat 4a) if DPPM or SER does not satisfy the target. We choose SEC over DEC for similar area overhead when both can satisfy the target DPPM because SEC code results in a lower delay penalty. Report the area overhead of ECC using CACTI. Report the access time and power of design with the new ECC. If the access time or power exceeds the constrains, report the design in step 3 as the optimal design. End the process. We need to add delay degradation caused by aging to the access time when compare with user access time constraints. Compare the area overhead with the area overhead in step 3. If the area overhead of ECC is smaller than that of the cell sizing approach, we choose this ECC scheme. Report the design as the optimal design. Otherwise, report the design in step 3 as the optimal design. End the process. Assume ECC does not exist for soft error. Step 1: Use CACTI to find a candidate design under the given constraints on access time, power, and area. Step 2: For the given DPPM of SRAM, use SRAM sizing approach to achieve given DPPM and optimal lifetime yield-per-area. Report the area overhead. This area over- head bounds the ECC approach. 129 Step 3: Explore ECC approach to achieve target DPPM. The SRAM cell is the base cell D0 optimized for the yield-per-area at the time of fabrication. a) Starting with the largest possible correctable data length (the size of cache line) and the lowest error correction capability (SEC), estimate the DPPM of the design. b) If the DPPM is larger than the target DPPM, reduce the data length to half. Cal- culate the area overhead of check bits of various ECC schemes. If the check bit area overhead of SEC with small data blocks is larger than that of DEC with the largest data length, we move to code with higher error correction capability (DEC). Estimate the DPPM. If DPPM is still larger than the target DPPM, repeat step 3b). If DPPM is no more than the target DPPM, report the area overhead of ECC using CACTI. Report the access time and power of the design with ECC. If the access time or power exceeds the constraints, report the design in step 2 as the optimal design. End the process. Otherwise, compare the area overhead with the area overhead in step 2. If the area overhead of ECC is smaller than that of redesigned cell, we choose this ECC scheme. Report the design as the optimal design. Otherwise, report the design in step 2 as the optimal design. End the process. 5.4 Experiment results For all the experimental evaluations, we use the aging model proposed in [15]. The desired lifetime in the experiments is 60 months. We adopt the probability collective method proposed in [47] to estimate the failure rate and aging failure rate. Our design objective is to maximize the lifetime yield-per-area of a 2MB SRAM with 6T SRAM cells under target DPPM (i.e., 50). The block size is 64B. 130 Table 5.4: Design results of 2MB SRAMs with 6T cells under four different workloads Workload Cell Add ECC DPPM Area overhead Case 1: Access time < 5ns, SEC-512 exists, Soft error resilience > 0.9 G1 D0 No 0.00067 1 G2 D2 No 48.78 1.0435 Skew D0 No 2.07 1 Unknown D4 No 31.52 1.0653 Case 2: Access time < 10ns, no ECC exists G1 D0 SEC-512 0.00067 1.0181 G2 D2 No 48.78 1.0443 Skew D0 SEC-512 2.07 1.0181 Unknown D0 DEC-512 6.53 1.0659 Table 5.4 shows the design results for two study cases. Case 1 has a small access time specification and DEC cannot be adopted in this case. In case 1, we assume SEC-512 already exists for soft error and need to consider soft error resilience. D0 is the SRAM cell design optimized for yield-per-area at the time of fabrication. For area overhead estimation, the base is the area of original design, i.e., SRAM using D0 with SEC-512. D2 and D4 are the cell designs produced by the transistor sizing approach for G2 and Unknown workloads. For G1 and Skew, the existing ECC can achieve target DPPM and soft error resilience. For G2, cell design D2 with up-sized pull-down transistors are used. Both target DPPM and soft error resilience are satisfied with a minimum area overhead. For Unknown, D4 (larger pull-down transistors compared with D2) is adopted, since DEC cannot be used due to access time constraint. We assume a larger access time limitation for case 2. We do not need to evaluate soft error resilience since we assume the original ECC does not exist because soft errors are not important. DEC-512 is adopted for Unknown workload to achieve minimum area overhead without exceeding the access time constraint. 131 5.5 Conclusion We developed an end-to-end SRAM design framework to maximize the aging resilience under the given constraints. We explored the efficiency of ECC to combat aging by quantifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, respectively. We also calculated the soft error resilience when ECC is used to repair aging failures. We find that ECC is efficient for repairing aging failures for workloads with small aging failure rates without sacri- ficing the soft error resilience. After comparing approaches based on cell sizing and ECC in terms of overheads, lifetime yield and DPPM, we can choose one or a combi- nation of these approaches to identify the optimal design against aging under the given constraints. To provide the end-to-end capability to designers, we integrated our cell sizing approach and our ECC approach into an existing SRAM compiler, CACTI. Our new compiler provides the design with the optimal lifetime yield-per-area under given constraints. 132 Chapter 6 Contributions 6.1 Tests for variation-induced delay faults in SRAMs To address the increasing importance of process variations, we developed a general approach to generate O(n) tests to cover all address dependent process variation-induced delay faults (VIDFs) in arbitrary SRAM designs with n locations. The upper bound of test length is 9.5n for arbitrary designs. Our approach generates the shortest linear test that covers all the required two- pattern address sequences to detect all target VIDFs. The worst situation for VIDFs test generation is to capture the worst-case deactivation delay, the worst-case activation delay, and the second worst-case activation delay, and these delays are triggered by dif- ferent two-pattern address sequences. The test length for this situation is 9.5n, which is the upper bound for all the new tests generated by our approach for arbitrary SRAM designs. We evaluated our new O(n) tests, along with tests with much higher complexity (from O(nlog(n)) to O(n 2 )), namely WT, WCGD, and GALPAT, using extensive simu- lations to demonstrate that our new tests achieve nearly perfect coverage of VIDFs for arbitrary SRAM designs for both 65nm process and 32nm process. In constrast, WT and WCGD cannot achieve good fault coverage for designs with large activation delays. Then we efficiently integrated our new tests for variations with tests for delay defects and demonstrate the efficiency and effectiveness of our new combined memory tests compared to previously known memory tests for delay faults. 133 6.2 Tests for aging- and variation-induced delay faults in SRAMs We analyzed the effect of BTI aging on SRAM peripheral circuitry, including address decoder, precharge circuit, write circuit and sense amplifiers. We find that aging causes delay degradation in peripheral circuitry, especially in the address decoder. We considered aging in the test generation method for variation induced delay faults. We identified the address transitions that trigger the worst-case delays at decoder outputs for 32nm process before and after aging. We find that delay degradations caused by aging do no affect the two-pattern address sequences that invoke the concerned delay to detect the target VIDFs. Thus aging degradation does not affect the test generation. We demonstrated that our above O(n) tests for VIDF can also be used for the design after aging. Our experimental results showed that our above new tests achieve close to 100% coverage for VIDFs in arbitrary decoder designs before as well as after aging at much lower test lengths compared to any known tests. 6.3 Sizing for aging resilient SRAM design We developed the first approach for combating aging in SRAM cells by sizing the transistors in a manner that dramatically reduces the aging quality loss with no power overhead and very low area overhead for a wide range of workloads. We demonstrated the effectiveness of our approach for 6T SRAM cell and 10T Schmitt Trigger SRAM cell in planar CMOS technology as well as 6T SRAM cell in FinFET technology. Our results show that transistor sizing is surprisingly effective at combating aging for a wide range of workloads. Specifically, it reduces the DPPM by 134 orders of magnitude and increases the lifetime yield-per-area without extra power over- head. We also quantify delays along the critical paths of SRAMs and estimated the amount of delay degradation caused by aging for each component. To avoid failure caused by aging-induced delay degradation, one alternative is to leave sufficient margin in the tim- ing control. If the timing constraint is tight, we can either increase VDD to compensate for the delay degradation at a power overhead or resize the address decoder and sense amplifier to ensure the critical path delay after aging is not larger than that before aging for the original design to meet the clock constraint. Our methods allow designers the choice to choose the proper design based on their timing, power and area constraints. 6.4 End-to-end SRAM design framework for aging resilience Finally, we developed the first end-to-end SRAM design framework to maximize the aging resilience under given constraints on area, power, delay, and aging quality loss. We explore the efficiency of error-correcting codes (ECC) to combat aging by quan- tifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, respectively. We also calculate the soft error resilience when ECC is used to repair aging failures. We find that ECC is efficient for repairing aging failures for workloads with small aging failure rates without sacrificing the soft error resilience. After comparing approaches based on cell sizing and ECC in terms of overheads, life- time yield and DPPM, we can choose one or a combination of these approaches to identify the optimal design against aging under the given constraints. To provide the end-to-end capability to designers, we integrate our cell sizing approach and our ECC 135 approach into an existing SRAM compiler, CACTI. Our new compiler provides the design with the optimal lifetime yield-per-area under given constraints. 136 Reference List [1] C.-C. Chen, T. Liu, and L. Milor, “System-level modeling of microprocessor reli- ability degradation due to bias temperature instability and hot carrier injection,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 8, pp. 2712–2725, 2016. (document), 3.8, 3.3.1, 3.5.1 [2] J. Keane, J. Kulkarni, K.-H. Koo, S. Nalam, Z. Guo, E. Karl, and K. Zhang, “17.2 5.6 mb/mm2 1r1w 8t sram arrays operating down to 560mv utilizing small-signal sensing with charge-shared bitline and asymmetric sense amplifier in 14nm fin- fet cmos technology,” in 2016 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2016, pp. 308–309. 1.1 [3] J. P. Kulkarni, K. Kim, and K. Roy, “A 160 mv robust schmitt trigger based sub- threshold sram,” IEEE Journal of Solid-State Circuits, vol. 42, no. 10, pp. 2303– 2313, 2007. 1.1, 3.4.1.3, 3.4.1.3 [4] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of failure probability and statistical design of sram array for yield enhancement in nanoscaled cmos,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 12, pp. 1859–1880, 2005. 1.1 [5] S. Nassif, K. Bernstein, D. J. Frank, A. Gattiker, W. Haensch, B. L. Ji, E. Nowak, D. Pearson, and N. J. Rohrer, “High performance cmos variability in the 65nm regime and beyond,” in Electron Devices Meeting, 2007. IEDM 2007. IEEE Inter- national. IEEE, 2007, pp. 569–571. 1.2 [6] K. J. Kuhn, “Cmos transistor scaling past 32nm and implications on variation,” in IEEE journal of Advanced Semiconductor Manufacturing Conference, 2010, pp. 241–246. 1.2, 1.4.1, 2.1 [7] S. Bhardwaj, S. Vrudhula, P. Ghanta, and Y . Cao, “Modeling of intra-die process variations for accurate analysis and optimization of nano-scale circuits,” in DAC, 2006, pp. 791–796. 1.2 137 [8] Y . Ye, F. Liu, M. Chen, S. Nassif, and Y . Cao, “Statistical modeling and simulation of threshold variation under random dopant fluctuations and line-edge roughness,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 6, pp. 987–996, 2011. 1.2, 2.1 [9] A. Asenov, S. Kaya, and A. R. Brown, “Intrinsic parameter fluctuations in decananometer mosfets introduced by gate line edge roughness,” IEEE Transac- tions on Electron Devices, vol. 50, no. 5, pp. 1254–1260, 2003. 1.2 [10] K. Kuhn, C. Kenyon, A. Kornfeld, M. Liu, A. Maheshwari, W.-k. Shih, S. Sivaku- mar, G. Taylor, P. VanDerV oorn, and K. Zawadzki, “Managing process variation in intel’s 45nm cmos technology.” Intel Technology Journal, vol. 12, no. 2, 2008. 1.2, 1.3 [11] M. A. Alam, “A critical examination of the mechanics of dynamic nbti for pmos- fets,” in Electron Devices Meeting, 2003. IEDM’03 Technical Digest. IEEE Inter- national. IEEE, 2003, pp. 14–4. 1.3, 3.1 [12] J. Keane, X. Wang, D. Persaud, and C. H. Kim, “An all-in-one silicon odometer for separately monitoring hci, bti, and tddb,” IEEE Journal of Solid-State Circuits, vol. 45, no. 4, pp. 817–829, 2010. 1.3 [13] M. Choudhury, V . Chandra, K. Mohanram, and R. Aitken, “Analytical model for tddb-based performance degradation in combinational logic,” in 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010). IEEE, 2010, pp. 423–428. 1.3 [14] C. Ma et al., “Universal nbti compact model for circuit aging simulation under any stress conditions,” Device and Materials Reliability, IEEE Transactions on, vol. 14, no. 3, pp. 818–825, 2014. 1.3, 3.1, 3.2.1 [15] S. Bhardwaj, W. Wang, R. Vattikonda, Y . Cao, and S. Vrudhula, “Predictive mod- eling of the nbti effect for reliable design,” in Custom Integrated Circuits Confer- ence, 2006. CICC’06. IEEE. IEEE, 2006, pp. 189–192. 1.3, 3.1, 3.2.1, 3.2.2, 4.1, 4.4.2, 5.4 [16] V . B. Kleeberger et al., “A compact model for nbti degradation and recovery under use-profile variations and its application to aging analysis of digital integrated cir- cuits,” Microelectronics Reliability, vol. 54, no. 6, pp. 1083–1089, 2014. 1.3, 3.1 [17] T. Grasser, P.-J. Wagner, H. Reisinger, T. Aichinger, G. Pobegen, M. Nelhiebel, and B. Kaczer, “Analytic modeling of the bias temperature instability using cap- ture/emission time maps,” in Electron Devices Meeting (IEDM), 2011 IEEE Inter- national. IEEE, 2011, pp. 27–4. 1.3 138 [18] K. Kang, H. Kufluoglu, K. Roy, and M. A. Alam, “Impact of negative-bias temper- ature instability in nanoscale sram array: modeling and analysis,” Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 26, no. 10, pp. 1770–1781, 2007. 1.3 [19] T.-H. Kim, W. Zhang, and C. H. Kim, “An sram reliability test macro for fully- automated statistical measurements of v min degradation,” in 2009 IEEE Custom Integrated Circuits Conference. IEEE, 2009, pp. 231–234. 1.3, 3.1 [20] A. Bansal et al., “Impact of nbti and pbti in sram bit-cells: Relative sensitivities and guidelines for application-specific target stability/performance,” in Reliability Physics Symposium, 2009 IEEE International. IEEE, 2009, pp. 745–749. 1.3, 3.1, 3.2.1, 3.2.2 [21] ——, “Impacts of nbti and pbti on sram static/dynamic noise margins and cell failure probability,” Microelectronics reliability, vol. 49, no. 6, pp. 642–649, 2009. 1.3, 3.1, 3.3.2 [22] V . Reddy, A. T. Krishnan, A. Marshall, J. Rodriguez, S. Natarajan, T. Rost, and S. Krishnan, “Impact of negative bias temperature instability on digital circuit reli- ability,” Microelectronics Reliability, vol. 45, no. 1, pp. 31–38, 2005. 1.3, 3.1 [23] T. T.-H. Kim, W. Zhang, and C. H. Kim, “An sram reliability test macro for fully automated statistical measurements of degradation,” Circuits and Systems I: Reg- ular Papers, IEEE Transactions on, vol. 59, no. 3, pp. 584–593, 2012. 1.3, 3.1, 3.3.2 [24] A. J. Van de Goor, Testing semiconductor memories: theory and practice. John Wiley & Sons, Inc., 1991. 1.4.1, 2.1, 2.3 [25] A. J. Van de Goor, S. Hamdioui, G. N. Gaydadjiev, and Z. Al-Ars, “New algo- rithms for address decoder delay faults and bit line imbalance faults,” in Asian Test Symposium, 2009. ATS’09. IEEE, 2009, pp. 391–396. 1.4.1, 2.3 [26] N. K. Jha and S. Gupta, Testing of digital systems. Cambridge University Press, 2003. 1.4.1, 2.1, 2.3 [27] J. Abella, X. Vera, and A. Gonzalez, “Penelope: The nbti-aware processor,” in 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 2007, pp. 85–96. 1.4.2, 3.1 [28] J. Shin, V . Zyuban, P. Bose, and T. M. Pinkston, “A proactive wearout recovery approach for exploiting microarchitectural redundancy to extend cache sram life- time,” in ACM SIGARCH Computer Architecture News, vol. 36, no. 3. IEEE Computer Society, 2008, pp. 353–362. 1.4.2, 3.1 139 [29] S. V . Kumar, K. Kim, and S. S. Sapatnekar, “Impact of nbti on sram read stability and design for reliability,” in 7th International Symposium on Quality Electronic Design (ISQED’06). IEEE, 2006, pp. 6–pp. 1.4.2, 3.1, 3.4 [30] H. Mostafa, M. Anis, and M. Elmasry, “Adaptive body bias for reducing the impacts of nbti and process variations on 6t sram cells,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 58, no. 12, pp. 2859–2871, 2011. 1.4.2, 3.1, 3.4 [31] A. Ricketts, J. Singh, K. Ramakrishnan, N. Vijaykrishnan, and D. K. Pradhan, “Investigating the impact of nbti on different power saving cache strategies,” in Proceedings of the conference on design, automation and test in Europe. Euro- pean Design and Automation Association, 2010, pp. 592–597. 1.4.2, 3.1, 3.2.1, 3.2.2, 3.4, 3.4.1.1, 3.4.1.1 [32] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,” HP Laboratories, pp. 22–31, 2009. 1.4.2, 5.2.4 [33] S.-M. Kang, Y . Leblebici, and C. Kim, “Cmos digital integrated circuits: analysis & design,” McGraw-Hill Higher Education, Tech. Rep., 2014. 2.1, 2.5.1, 2.5.2, 4.2.3, 4.4.1, 4.5.3 [34] D. Cheng, H. Hsiung, B. Liu, J. Chen, J. Zeng, R. Govindan, and S. K. Gupta, “A new march test for process-variation induced delay faults in srams,” in Test Symposium (ATS), 2013 22nd Asian. IEEE, 2013, pp. 115–122. 2.3 [35] T. Kobayashi, K. Nogami, T. Shirotori, and Y . Fujimoto, “A current-controlled latch sense amplifier and a static power-saving input buffer for low-power archi- tecture,” IEICE transactions on electronics, vol. 76, no. 5, pp. 863–867, 1993. 2.5.1, 4.2.3, 4.5.3 [36] L. Wang, S. K. Gupta, and M. A. Breuer, “Diagnosis of delay faults due to resis- tive bridges, delay variations and defects,” in Test Symposium, 2006. ATS’06. 15th Asian. IEEE, 2006, pp. 215–224. 2.5.2 [37] S. Hamdioui, Z. Al-Ars, and A. J. van de Goor, “Opens and delay faults in cmos ram address decoders,” IEEE Transactions on Computers, vol. 55, no. 12, pp. 1630–1639, 2006. 2.6.2 [38] C. Liu, H. Nam, K. Kim, S. Choo, H. Kim, H. Kim, Y . Kim, S. Lee, S. Yoon, J. Kim et al., “Experimental study on bti variation impacts in sram based on high-k/metal gate finfet: From transistor level vth mismatch, cell level snm to product level vmin,” in 2015 IEEE International Electron Devices Meeting (IEDM). IEEE, 2015, pp. 11–3. 3.1 140 [39] S. Natarajan et al., “A 32nm logic technology featuring 2 nd-generation high-k+ metal-gate transistors, enhanced channel strain and 0.171μm 2 sram cell size in a 291mb array,” in Electron Devices Meeting, 2008. IEDM 2008. IEEE Interna- tional. IEEE, 2008, pp. 1–3. 3.1, 3.2.1, 3.2.2 [40] J. Lin, A. Oates, and C. Yu, “Time dependent vccmin degradation of sram fabri- cated with high-k gate dielectrics,” in 2007 IEEE International Reliability Physics Symposium Proceedings. 45th Annual. IEEE, 2007, pp. 439–444. 3.1 [41] S. Kothawade, D. M. Ancajas, K. Chakraborty, and S. Roy, “Mitigating nbti in the physical register file through stress prediction,” in 2012 IEEE 30th International Conference on Computer Design (ICCD), Sep. 2012, pp. 345–351. 3.1 [42] Y .-J. Chang and F. Lai, “Dynamic zero-sensitivity scheme for low-power cache memories,” IEEE Micro, vol. 25, no. 4, pp. 20–32, 2005. 3.2.1, 3.2.2, 3.4.1.1 [43] W. Zhao and Y . Cao, “New generation of predictive technology model for sub-45 nm early design exploration,” Electron Devices, IEEE Transactions on. 3.2.2 [44] F. M. Gonçalves, I. C. Teixeira, and J. Teiceira, “Realistic fault extraction for high- quality design and test of vlsi systems,” in Defect and Fault Tolerance in VLSI Systems, 1997. Proceedings., 1997 IEEE International Symposium on. IEEE, 1997, pp. 29–37. 3.3.2 [45] R. Joshi, K. Kim, and R. Kanj, “Finfet sram design,” in Nanoelectronic Circuit Design. Springer, 2011, pp. 55–95. 3.4.1 [46] Z. Guo, S. Balasubramanian, R. Zlatanovici, T.-J. King, and B. Nikoli´ c, “Finfet- based sram design,” in Proceedings of the 2005 international symposium on Low power electronics and design. ACM, 2005, pp. 2–7. 3.4.1 [47] F. Gong, S. Basir-Kazeruni, L. Dolecek, and L. He, “A fast estimation of sram failure rate using probability collectives,” in Proceedings of the 2012 ACM inter- national symposium on International Symposium on Physical Design. ACM, 2012, pp. 41–48. 3.4.1.1, 5.4 [48] S. Khan, I. Agbo, S. Hamdioui, H. Kukner, B. Kaczer, P. Raghavan, and F. Catthoor, “Bias temperature instability analysis of finfet based sram cells,” in Proceedings of the conference on Design, Automation & Test in Europe. Euro- pean Design and Automation Association, 2014, p. 31. 3.4.1.2 [49] [online]. Available: ptm.asu.edu. 3.4.1.2 [50] S. Sinha, G. Yeric, V . Chandra, B. Cline, and Y . Cao, “Exploring sub-20nm fin- fet design with predictive technology models,” in Proceedings of the 49th Annual Design Automation Conference. ACM, 2012, pp. 283–288. 3.4.1.2 141 [51] A. Pavlov and M. Sachdev, CMOS SRAM circuit design and parametric test in nano-scaled technologies: process-aware SRAM design and test. Springer Sci- ence & Business Media, 2008, vol. 40. 4.4.1 [52] R. H. Morelos-Zaragoza, The art of error correcting coding. John Wiley & Sons, 2006. 5.2.1 [53] L. D. Hung, H. Irie, M. Goshima, and S. Sakai, “Utilization of secded for soft error and variation-induced defect tolerance in caches,” in Proceedings of the Conference on Design, Automation and Test in Europe, ser. DATE ’07. San Jose, CA, USA: EDA Consortium, 2007, pp. 1134–1139. [Online]. Available: http://dl.acm.org/citation.cfm?id=1266366.1266612 5.2.1 [54] C.-L. Su, Y .-T. Yeh, and C.-W. Wu, “An integrated ecc and redundancy repair scheme for memory reliability enhancement,” in Proceedings of the 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, ser. DFT ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 81–92. [Online]. Available: http://dx.doi.org/10.1109/DFTVS.2005.18 5.2.1 142
Abstract (if available)
Abstract
Process variation and aging are the two major causes of circuit error as well as performance and robustness degradation. Further, with continued technology scaling, both these causes are becoming increasingly important. We focus on static random-access memories (SRAMs) since they are widely used. SRAMs are very susceptible to process variations and aging because of their small transistor sizes and dense layouts. ❧ In the first part of this dissertation, we propose a general approach to generate O(n) tests to cover all address dependent process variation-induced delay faults (VIDFs) in arbitrary SRAM designs. The upper bound of test length is 9.5n for arbitrary designs with n locations. We focus on address dependent delay faults since these are the most likely to escape traditional linear tests. Most previous memory tests that were developed for address decoder delay faults, including GALPAT and the Worst Case Gate Delay (WCGD), focus on defects. Although the test length of WCGD is n(nlogn), which is necessary to detect address decoder delay faults caused by defects, it is not effective for address decoder delay faults induced by process variations, since it does not cover all the two-pattern address sequences required for VIDFs. We show that a different test strategy is necessary for VIDFs because VIDFs are multiple, widespread, and have small delay values that are correlated. In particular, we identify all the address dependent variation-induced failure mechanisms along with sufficient conditions for their detection. We model the delay information of address decoders under variation and identify the address transitions that invoke the concerned delays for capturing the target VIDFs. We then generate the shortest linear test that covers all these two-pattern address sequences to detect all target VIDFs. We use our test generation approach to generate O(n) tests and use extensive simulations to demonstrate that our new tests achieve nearly perfect coverage of VIDFs for arbitrary SRAM designs. Then we efficiently integrate our new tests for variations with tests for delay defects and demonstrate the efficiency and effectiveness of our new combined memory tests compared to previously known memory tests for delay faults. ❧ We analyze the effect of Bias temperature instability (BTI) aging on SRAM peripheral circuitry, including address decoder, precharge circuit, write circuit and sense amplifiers. We find that aging causes delay degradation in peripheral circuitry, especially in the address decoder. We augment our above test generation method for VIDF to consider aging. Delay degradations caused by aging do no affect the input transitions invoking the worst-case delay. Thus aging degradation does not affect the path selection in test generation. The test generated for the design before aging can also be used for the design after aging. Our experimental results show that our new tests can achieve close to 100% coverage for VIDFs in arbitrary SRAM designs before and after aging with reduced test length. ❧ In the second part of this dissertation, we propose a design approach for combat aging in SRAM cells through sizing the transistors in a manner that dramatically reduces the aging quality loss with no power overhead for any given workload. The performance of transistors degrades due to aging. Aging degradation causes lifetime failures and lowers the quality of shipped chips. Our new approach for combating aging is especially suitable for IoT components and other embedded systems that have tight constraints on power and require a long lifetime with low aging quality loss. We focus on aging in SRAMs since these need to retain state and hence remain under stress even when logic is put in the sleep mode. The magnitude of aging degradation in SRAMs depends on the workload applied to the cells, measured by the duration for which various values (0’s and 1’s) are stored. Some single-purpose IoT systems have a specific workload, while more general-purpose systems have a broader range of workloads. Hence, we study a wide range of workloads. Previous design methods for SRAMs against aging require significant changes at architecture-level or expensive changes at the cell-level. In contrast, our method simply sizes the transistors in SRAM cells to optimize the lifetime yield-per-area under the tight constraints on aging quality loss (measured by DPPM) and power. We demonstrate the effectiveness of our approach for 6T SRAM cell and 10T Schmitt Trigger SRAM cell in planar CMOS technology as well as 6T SRAM cell in FinFET technology. Our results show that transistor sizing is surprisingly effective at combating aging for a wide range of workloads. Specifically, it reduces the DPPM by orders of magnitude and increases the lifetime yield-per-area under tight power constraints. ❧ We quantify the delay along the critical paths of SRAMs and estimate the amount of delay degradation caused by aging for each component. To avoid failure caused by delay degradation, we can leave a sufficient margin in the timing control. If the timing constraint is tight, we can increase VDD to compensate for the delay degradation at a power overhead. We can also resize the address decoder and sense amplifier to ensure the critical path delay after aging is not larger than that before aging for the original design to meet the clock constraint. Our methods allow designers to choose the proper design based on their timing, power and area constraints. ❧ Furthermore, we develop an end-to-end SRAM design framework to maximize the aging resilience under the given constraints. We explore the efficiency of error-correcting codes (ECC) to combat aging by quantifying the area and delay overheads of ECC and estimating the lifetime yield and DPPM of SRAMs with ECC, respectively. We also calculate the soft error resilience when ECC is used to repair aging failures. We find that ECC is efficient for repairing aging failures for workloads with small aging failure rates without sacrificing the soft error resilience. After comparing approaches based on cell sizing and ECC in terms of overheads, lifetime yield and DPPM, we can choose one or a combination of these approaches to identify the optimal design against aging under the given constraints. To provide the end-to-end capability to designers, we integrate our cell sizing approach and our ECC approach into an existing SRAM compiler, CACTI. Our new compiler provides the design with the optimal lifetime yield-per-area under given constraints.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Automatic test generation system for software
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Accurate and efficient testing of resistive bridging faults
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Understanding dynamics of cyber-physical systems: mathematical models, control algorithms and hardware incarnations
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Theoretical and computational foundations for cyber‐physical systems design
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Custom hardware accelerators for boolean satisfiability
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
Asset Metadata
Creator
Zuo, Xuan
(author)
Core Title
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/13/2020
Defense Date
12/09/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
aging quality loss,aging-resilient design,BTI aging,memory testing,OAI-PMH Harvest,process variation,SRAM design
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Sandeep (
committee chair
), Bogdan, Paul (
committee member
), Halfond, William Guillermo (
committee member
)
Creator Email
xzuo@usc.edu,zuoxuan2011@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-268690
Unique identifier
UC11674980
Identifier
etd-ZuoXuan-8165.pdf (filename),usctheses-c89-268690 (legacy record id)
Legacy Identifier
etd-ZuoXuan-8165.pdf
Dmrecord
268690
Document Type
Dissertation
Rights
Zuo, Xuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
aging quality loss
aging-resilient design
BTI aging
memory testing
process variation
SRAM design