Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A framework for soft error tolerant SRAM design
(USC Thesis Other)
A framework for soft error tolerant SRAM design
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A FRAMEWORK FOR SOFT ERROR TOLERANT SRAM DESIGN
by
Riaz Naseer
________________________________________
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2008
Copyright 2008 Riaz Naseer
Dedication
To Amma ń Fati (Ghulam Fatima – my maternal grandmother), who is the light of my
eyes and beat of my heart. She taught me simplicity, sincerity and humility. She taught
me the meaning of unconditional love and to this day provides me the warmth of her
love.
To my parents (Muhammad Shamshad and Balqis Khanum) for being who I am.
ii
Acknowledgements
Like any other thesis and PhD research, this endeavor is not carried out in solitude.
My greatest gratitude goes to my advisor – Jeff Draper, who not only mentored me
throughout this research but also inspired me as a professional manager and good
human being. This thesis would not have been possible without his continuous support
and encouragement. Along the way, I learned many lessons from him including
technical, professional and social aspects of one’s life.
I also thank my PhD guidance and dissertation committee members (Keith Chugg,
Mary Hall, Sandeep Gupta, Aiichiro Nakano, Monte Ung) for being there in an hour
of need and providing me technical feedback.
Special thanks to my colleague, Younes Boulghassoul, for his guidance and technical
support through various phases of my work. I learned valuable information about
radiation effects characterization and mitigation from him. Thanks also to Michael
Bajura, Tim Barrett, Young Hoon Kang, Taek-Jun Kwon, and Jeff Sondeen for their
support and valuable comments.
I received helpful comments and suggestions from my fellow students throughout this
research. Especially, I want to thank Rashed Bhatti, Nasir Mohyuddin and Bilal Zafar
iii
for long discussions during numerous presentations and through compilation of this
thesis.
I have learned many things from my numerous teachers throughout my academic
career. It is hard to name all of them here, so I thank them all. Particularly, I thank my
primary school teacher – Ahmad Ali, and USC professors: Won Namgoong, Alice
Parker and Viktor Prasanna.
Departmental staff, at Ming Hsieh Department of Electrical Engineering and at Office
of International Students, have provided me continuous support and an admirable
professional attitude throughout my student career at USC. I want to thank all of them,
and especially Diane Demetras for advising and accommodating my student needs.
Lastly, I want to thank my wife – Rahat, and my children – Shayaan and Eman, for
bearing with me through highs and lows for a long duration of graduate studies at
USC. Especially, Rahat’s support was crucial in keeping up with the unwieldy
schedule of a graduate student’s life.
Riaz Naseer
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract xi
Chapter 1 Introduction 1
1.1 Background 3
1.2 The Framework 8
1.3 Scope 16
1.4 The Roadmap 16
Chapter 2 Radiation Effects on Micro Electronics 19
2.1 The Radiation Sources 20
2.2 Radiation Effects 24
2.3 Radiation Hardening 31
Chapter 3 Models for Soft Error Mitigation of Memory Systems 34
3.1 Reliability Analysis for Spatially Protected Memories 35
3.2 Reliability Analysis for Spatially and Temporally Protected
Memories 43
3.3 Spatial and Temporal Redundancy Bounds for Target
Reliability 51
3.4 ECC Model Assumptions and Relevant Factors 56
Chapter 4 Soft Error Rate Characterization and Scaling Trends 60
4.1 SEU Characterization 62
4.2 SET Characterization 75
4.3 Deductions 88
Chapter 5 Radiation Hardening of Combinatorial Logic 90
5.1 Existing SET Mitigation Techniques 91
5.2 The Delay Filtered DICE (DF-DICE) 96
5.3 Deductions 103
v
Chapter 6 Implementation Architectures for ECC Model-based
Radiation Hardening of Memories 105
6.1 Survey of Existing ECC Implementations 106
6.2 DEC BCH Code Design for Memory Word Sizes 108
6.3 Implementation Approach 110
6.4 Implementation Results 114
6.5 Deductions 118
Chapter 7 Implementation Results 120
7.1 Test IC Design Description 120
7.2 Radiation Testing 124
7.3 Experimental Validation of the Effective BER Model 138
Chapter 8 Conclusions 144
8.1 Summary 145
8.2 Limitations 147
8.3 Extensions 148
Bibliography 150
vi
List of Tables
Table 1 Spatial redundancy for some linear block codes aligned for
typical memory word sizes 52
Table 2 Q
crit
values using different models 69
Table 3 Bit Error Rates (in errors/bit-day) using Q
crit
71
Table 4 Technology scaling effects on Q
crit
and predicted Bit Error
Rates 75
Table 5 Fan-out variations on SET pulse-widths (ps) for different gates
and LET 85
Table 6 Area comparison for DF-DICE 99
Table 7 Area comparison against DICE 103
Table 8 Required redundancy for ECC codes 115
Table 9 ECC encoder latency and area results 115
Table 10 ECC decoder latency and area results 116
Table 11 10 MeV cocktail ions used during the SEE tests 124
Table 12 Estimated BER from CREME96 for different radiation
environments 132
Table 13 Observed errors and errors predicted by our model 140
Table 14 ECC strength vs scrubbing rate trade-offs for LP Hardened
SRAMs 142
Table 15 ECC strength vs scrubbing rate trade-offs for SF Hardened
SRAMs 142
vii
List of Figures
Figure 1 Block diagram of the Framework 10
Figure 2 Maximum GCR spectrum for geosynchronous orbit 21
Figure 3 Radiation environment around earth 22
Figure 4 Secondary particles generated by the interaction of cosmic rays
with atmospheric atoms 23
Figure 5 Ion strike on drain of an off-NMOS transistor 25
Figure 6 Integral GCR flux versus LET for geosynchronous orbit 27
Figure 7 Trapped charge in gate-oxide 29
Figure 8 Edgeless or annular transistor layout 33
Figure 9 32 bit word-error probability comparison curves 39
Figure 10 Word error probability for different ECC codes versus raw BER 42
Figure 11 Codeword failure probability versus ratio of scrub rate to
physical bit error rate for a 16-bit information word protected by
various ECC 46
Figure 12 Effective BER reduction factor versus SR/BER for various
strength ECC 48
Figure 13 Scrub rate versus physical BER needed to achieve 10
-10
effective BER for different ECC schemes applied on 16-bit data
word 50
Figure 14 Spatial overhead vs block size for different ECCs 53
Figure 15 Scrub rate vs. spatial overhead trade-offs for desired effective
BER of 10
-10
errors/bit-day 54
Figure 16 Spatial~Temporal redundancy curves for different physical
BER for a 16-bit data word to achieve 10
-10
effective BER 55
Figure 17 3D simulation current pulses for LET < 1MeV-cm
2
/mg 64
Figure 18 SRAM cell with off-nmos strike model 66
viii
Figure 19 Ion current pulse profiles for different models 68
Figure 20 NAND gate off NMOS ion strike model 78
Figure 21 Minimum charges required to observe SET 81
Figure 22 SET pulse widths for large LET strikes on ASIC gates 82
Figure 23 SET pulse widths with process corner variations 84
Figure 24 Struck node voltage for large LET ion strikes 86
Figure 25 SET pulse width comparison 88
Figure 26 Triple Modular Redundancy applied on logic blocks 92
Figure 27 Temporal Sampling Latch 94
Figure 28 Delay Filter operation 95
Figure 29 Transient appearing at the D
in
somewhere between T
critical
95
Figure 30 Delay Filtered DICE Latch 97
Figure 31 Layout of DF-DICE latch 98
Figure 32 DF-DICE speed penalty versus clock frequency 100
Figure 33 Overall area comparison for scalable DF-DICE 103
Figure 34 SEC decoder 106
Figure 35 Systematic generator matrix of DEC (31, 21) & (26, 16) 110
Figure 36 Encoder circuit for DEC (26, 16) 111
Figure 37 Block diagram of BCH decoder 113
Figure 38 Minimizing decoder latency overhead on system performance 118
Figure 39 Top-level layout of the LP SRAM Chip 122
Figure 40 Top-level layout of the SF SRAM Chip 123
Figure 41 FPGA tester card and cube of daughter cards 125
Figure 42 Test chip mounted at the daughter card for radiation beam 126
ix
Figure 43 Baseline SRAM raw SEU cross-section versus LET for LP chip. 128
Figure 44 Hardened SRAM raw SEU cross-section versus LET for LP
chip 128
Figure 45 Baseline SRAM raw SEU cross-section versus LET for SF Chip 129
Figure 46 Hardened SRAM raw SEU cross-section versus LET for SF
chip 129
Figure 47 Cross-section versus LET for LP and SF chips 132
Figure 48 Single and Multi Bit Upset distributions versus Effective LET
(a) LP SRAM (b) SF SRAM 134
Figure 49 TID responses of the LP and SF SRAMs, showing the
normalized leakage current per bit, as a function of cumulated
dose. 138
x
Abstract
With aggressive technology scaling, radiation-induced soft errors have become
a major threat to microelectronics reliability. SRAM cells in deep sub-micron
technologies particularly suffer most from these errors, since they are designed with
minimum geometry devices to increase density and performance, resulting in a cell
design that can be upset easily with a reduced critical charge (Q
crit
). Though single-bit
upsets were long recognized as a reliability concern, the contribution of multi-bit
upsets (MBU) to overall soft error rate, resulting from single particle strikes, is also
increasing. The problem gets exacerbated for space electronics where galactic cosmic
rays carry particles with much higher linear energy transfer characteristics than
terrestrial radiation sources, predictably inducing even larger multi-bit upsets. In
addition, single-event transients are becoming an increased concern due to high
operational frequencies in scaled technologies. The conventional radiation hardening
approach of using specialized processes faces serious challenges due to its low-volume
market and lagging performance compared to commercial counterparts. Alternatively,
circuit-based radiation hardening by design (RHBD) approaches, where both memory
cells and control logic are hardened, may incur large area, power and speed penalties if
careful design techniques are not applied.
We develop a framework to harden SRAMs against these soft errors as
technologies scale further in sub-100nm regime. This framework is based on a hybrid
approach that combines spatial and temporal redundancy in a mathematical model to
xi
efficiently mitigate the effects of these transient errors. Particularly, the SRAM cells
are left intact to exploit the benefits of state-of-the-art commercial processes and error
correcting codes (ECC), and periodic memory scrubbing is used to obtain an effective
bit error rate (BER) over the intrinsic physical bit error rate of the process. Only the
peripheral circuitry is hardened using RHBD techniques, incurring an overall minimal
penalty. This model-based hardening requires a careful characterization for
establishing a clear problem domain. Therefore, we also develop simulation based
methodologies to quickly estimate and characterize these transient effects. Using the
guidelines provided by this characterization, two prototype SRAM ICs have been
designed and fabricated. The effectiveness of the proposed model is demonstrated by
the results of performing radiation tests on the fabricated prototype SRAM ICs.
xii
Chapter 1
Introduction
With technology scaling, static random access memory, or SRAM, has seen
exponential growth in its usage due to its high-performance characteristics that
directly enhances the performance of the computing paradigm in which it is
incorporated, such as micro-processors/micro-controllers, digital signal processors
(DSP) or field programmable gate arrays (FPGA). SRAM cells are designed with
minimum geometry devices to increase packing density and performance. On the other
hand, these reduced cell dimensions coupled with lower supply voltages adversely
affect SRAM reliability as technology scales. With increasing usage of SRAM in a
system, the reliability of the computation system heavily depends on the reliability of
the SRAM sub-system, since a memory failure can result in a critical system failure.
Ionizing radiation, terrestrial and more particularly in space environments,
poses new threats to micro-electronics system reliability. The two major radiation
effects on micro-electronics are total ionizing dose (TID) and single-event effects
(SEE). Total ionizing dose represents the cumulative effects of ionizing radiation that
result in device parametric variations and may eventually cause the device to cease
function with extended exposure over longer periods of time. Nevertheless, the TID
tolerance of semiconductor devices is improving with technology scaling and may be
mitigated by screening and shielding and therefore is not addressed in this work.
Alternatively, the sensitivity of semiconductor devices to single-event effects (which
are a direct consequence of a single strike by high-energy particles, e.g., protons,
neutrons, alpha particles or other heavy ions), especially single-event upsets (SEU)
and more recently single-event transients (SET), is increasing and results in soft errors
(transient errors that can be corrected by re-writing the data): a major concern for
memory reliability and its longevity.
Single-event effects can generally be addressed by two approaches: using rad-
hard devices manufactured through specialized processes or by employing fault-
tolerant techniques. Chips fabricated through rad-hard processes are costly due to the
specialized nature of these processes. They also greatly suffer from poor performance
and low packaging densities as these specialized processes lag their commercial
counter-parts by at least two technology generations. Fault-tolerance mitigation
techniques potentially allow the use of state-of-the-art high-performance technologies.
This method offers several benefits including high performance, low cost, as well as
availability of a wide range of industry standard tools and techniques resulting in
reduced time for design development. Within the aerospace industry, the search to
reduce the cost of high-performance spacecraft due to budgetary constraints makes the
application of commercial micro-electronics more attractive. Furthermore, the fear that
the low-volume aerospace industry will fail to advance their rad-hard processes with
fast technology scaling and commercial processes will be the only lasting processes in
the future makes it more urgent to investigate efficient fault-tolerant or radiation
2
hardening by design (RHBD) techniques to mitigate these soft errors. The goal of this
research is to devise efficient soft error mitigation techniques for SRAMs using
commercial processes with as minimum area, timing and power overheads as possible.
Particularly, this research characterizes various aspects of the soft error phenomenon
and develops a framework for mitigating soft error effects on system reliability
without requiring any changes in the underlying integrated circuit fabrication
processes.
1.1 Background
Interestingly enough, the first paper to discuss the issue of single event upsets
in microelectronics was a paper that was written in 1962 to assess the effects of
technology scaling in terrestrial microelectronics [114]. The authors predicted that
scaling would limit the minimum feature size of integrated devices to about 10 μm due
to these upsets! The first confirmed cosmic-ray induced upsets in space electronics
were reported in 1975 by Binder, et al [13], where 4 upsets were observed in bipolar J-
K flip-flops during 17 years of operation of a communication satellite. Soon after, in
1978, May and Woods (Intel Corp.) reported alpha-particle-induced soft errors in
dynamic memories (DRAM) in terrestrial environments [71]. The causes of these soft
errors were traced to alpha particle contaminants in the chip packaging material
manufactured at a factory built downstream from an abandoned uranium mine [121].
The problem was contained by using on-chip shielding and low-activity materials for
3
fabrication [70], and this solution subdued the issue of terrestrial soft errors for quite
some time.
In over five decades of aggressive scaling, CMOS transistor technology has
rapidly progressed towards nanotechnology-scale feature sizes. However, this
continuous shrinking of device dimensions has also strongly affected device and
circuit radiation sensitivity [107]. SRAM circuits in particular have experienced a
steady decrease in cell nodal capacitances and operating voltage margins [6]. This
trend has reduced both the minimal critical charge (Q
crit
) [27] and corresponding linear
energy transfer (LET) threshold [29] capable of corrupting a stored bit. This
characteristic has drastically increased the candidate population of ionizing particles
which can affect memory cells, both directly and indirectly [4], and therefore also the
likelihood of bit upsets. The frequency of single-bit upsets, and more recently multi-bit
upsets, is now a major reliability concern in commercial electronics as observed in
[107] (Sun Microsystems), [6] (Texas Instruments), [27] (Virage Logic Corporation),
[109] and [68] (Intel), [96] (Cypress Semiconductor), [72] (IBM).
The increased integration density of modern chips raises the likelihood of
having a large number of soft errors per chip. For certain ground-level applications,
e.g., high-end servers, networking switches and database applications, critical life
support system and industrial control, the reliability of the computing machine is as
important as its performance [76][87]. The problem becomes even more acute for
aircraft and space electronics where high-energy neutrons (at high altitudes), as well as
protons and heavy ions (in space) are more abundant [88]. FPGAs are becoming
4
popular in space electronics for their reconfigurable architecture yet these FPGAs
suffer greatly from soft errors because they are mostly composed of soft error
susceptible SRAMs [95][55]. Technology scaling has also exacerbated the reliability
problem even for terrestrial applications. For instance, Sun Microsystems learned a
lesson the hard way when its enterprise servers failed in the field due to soft errors in
cache memory, and it faced the consequences of major customer dissatisfaction
[22][110]. It is now believed that the threat to a system’s reliability from soft errors is
more than all other failure mechanisms combined [7]. Due to the increasing severity of
the soft error problem, there is a growing trend in the community to adopt soft error
rate as a design parameter along with the more common power, area and speed
parameters [56].
During the cold war era, the space community enjoyed considerable financial
support from government funding, and so radiation hardening was accomplished by
the use of specialized processes at hardening-by-process (HBP) foundries. These
processes provided hardness both for total dose as well as single-event effects.
However, these processes are extremely costly as compared to commercial processes,
both in terms of dollar amount per function and performance and integration density
per package. These penalties result from the complexity of the process as well as
conservative design rules. Given that the aerospace market is a low-volume market,
the cost of shifting to a new process becomes exorbitant as commercial technology
scales according to Moore’s law. As a result, hardening-by-process technologies lag
commercial technologies generally by two generations, and there is a fear, especially
5
in the space electronics community, that commercial processes will become the only
available process choice in the future. These considerations gradually led to the use of
more commercial-off-the-shelf components in space and defense electronics [32].
The convergence of increased issues of reliability in scaled technologies and
economic and state-of-the-art technology constraints has attracted the interests of the
mainstream circuit design community and space and defense communities. Therefore
numerous circuit-level and architecture-level solutions have been proposed over time.
Initial hardening approaches considered adding resistive and/or capacitive components
to slow the feedback from upset strikes [3][103][89][99][91]. Since the upset process
is similar to a write process for memory elements, these techniques eventually affect
the performance of the memory circuit. The trade-offs associated with these
techniques (accurately controlling the polysilicon resistor value, its temperature
dependence and associated decrease in speed) have inhibited the usefulness of these
techniques. Circuit-level hardening approaches rely on redundant circuit elements
(usually 12-16 transistors per 6T standard SRAM cell) to mitigate upsets
[105][20][98][116][65][118][113]. Among these solutions, the dual interlocked
storage cell (DICE, [20]) has found considerable usage in many applications.
Recently, Intel has applied a modified DICE latch to harden its core logic in their
latest quad-core processors [59]. Since these storage cells use a higher number of
transistors per cell, the result is an increased area and hence power consumption.
Furthermore, due to reduced dimensions in scaled technologies, the separation
between the original transistors and redundant transistors within a cell is much
6
smaller, and the redundant nodes may also be vulnerable to upset from the same strike
if care is not taken with respect to the separation distance between redundant nodes,
reducing the effectiveness of these techniques.
At the architectural front, designers adopted the use of error detection and
correction (EDAC) schemes for mitigating soft errors from memories
[74][73][107][93][92][16][23]. As a result, the use of a simple parity check became
prevalent in most memories. Furthermore, error correction schemes are now being
used where reliability of a system is critical [107]. In some cases a single-error-
correcting (SEC) error-correcting code (ECC) is also used to mitigate the effect of a
complete chip failure when the main memory is organized in 1-bit wide banks of
memory chips [25]. Here the complete failure of a chip results in a single-bit failure
for each word, and hence, it can be corrected by a SEC ECC scheme. On the other
hand, the use of error-correction schemes for on-chip memories is limited mostly to
parity check and in a few cases to single error correcting ECC due to the extra
overheads in increased spatial redundancy and decoding latency required for powerful
error correction schemes [109]. For highly critical scenarios, the use of triple modular
redundancy is another favorite technique among system designers, though it results in
at least 200% area and power overheads. Periodic memory scrubbing, where memory
contents are read periodically and corrected for any induced errors between scrub
cycles before being written back, is another technique to reduce the accumulation of
errors within memory. There are few attempts to model the reliability improvement of
DRAM memories by applying ECC, and in some cases scrubbing, using Markov
7
chains and other complex mathematical models [38][100][106][112]. Due to the
complexity of these models, the trade-offs for obtaining a desired reliability target
become intractable. Therefore, these models have not be generally applied to arrive at
a solution for mitigating soft errors in SRAMs. Another factor inhibiting the
development of a comprehensive model for SRAM soft error mitigation was the
observation that the probability of having multiple errors is extremely small, and error-
correcting techniques applied so far already cater to single errors [109]. However, due
to increased integration densities and smaller geometries of SRAM cells, the error
rates for memories are so high in scaled technologies that they cannot be effectively
mitigated by single-error correction techniques [5] alone. Therefore, this work focuses
on developing a comprehensive framework for mitigating soft error effects on system
reliability with a given requirement that all solutions considered require no changes in
the underlying integrated circuit fabrication processes. This framework makes it
possible to characterize trade-offs between spatial and temporal redundancies for
achieving desired reliability levels for memory subsystems for different application
scenarios. The following section introduces this framework.
1.2 The Framework
The usage of commercial electronic components offers a low-cost, high-
performance and high-integration density solution, which is very attractive for space
applications. This research focuses on radiation-hardening-by-design techniques that
allow commercial SRAMs to be used cost-effectively in space applications. The
8
framework to efficiently mitigate soft errors for SRAMs spans the whole gamut from
error rate estimation in a particular environment and identification of an efficient
solution to the practical and cost-effective implementation of the identified solution.
The diagram in Figure 1 shows the major building blocks of this framework. In
particular, the sensitivity of the design is identified through rigorous SPICE
simulations employing some empirical models for currents resulting from ionizing
particle strikes. The selection of an empirical model is corroborated with physics-
based 3D device simulations as well as supporting data from prior published research
results. This sensitivity data along with device physical geometries and environment
models is fed to a soft error rate estimation tool. This tool also models the effects of
commonly employed shielding mechanisms and generates a raw bit error rate (BER)
for the basic design cell. This raw BER serves as a starting point for the selection of
mitigation techniques. An ECC and scrubbing based model is developed that relates
the desired effective bit error rate to this raw BER. This model provides trade-offs
between the power of the ECC scheme used and the scrubbing rate to be utilized to
alleviate soft errors. The power of the ECC scheme, in turn, can be related to the
redundancy required to implement that particular ECC scheme, and hence the spatial
(area) overhead for the adopted mitigation technique. In addition, the temporal
overhead for the mitigation model can be implied by observing that when a memory
location is being scrubbed, it is not available for normal operation. Based on the
overall system requirements, these trade-offs result in the selection of an efficient
9
Figure 1 Block diagram of the Framework
solution. Approaches to implement the proposed solution are presented, and various
trade-offs are analyzed through case study implementations of these mitigation
techniques for most practical scenarios. In particular, the key contributions of this
work are:
A simulation based methodology to predict SEU rates for SRAMs
SEU rates are generally predicted by subjecting SRAMs to accelerated
radiation testing at ground level. This method of predicting SEU rates is quite costly as
it involves the actual manufacturing of parts and then radiation testing at scarcely
available test facilities such as [18][15]. To design qualifying commercial SRAMs for
10
space applications, it is desirable to know the expected SEU rates for the SRAM in the
technology of interest a priori. The cost and time efficiency of a simulation based
methodology to predict SEU makes it an attractive complement to ground-level
testing.
The major parameter defining a circuit’s response to single-event strikes is its
ability to maintain its current state. This ability of the device is overcome when
ionizing radiation deposits significant charge above a certain threshold to change the
state of the device [32]. This threshold charge is termed as critical charge (Q
crit
) and is
a key parameter for determining the error rate of a memory structure in any radiation
environment. Therefore, different ways to compute this critical charge and identify
fast and efficient circuit simulation methods for its estimation are explored [85]. SEU
rate prediction for space environments requires a space radiation environment
modeling tool. For this purpose we use CREME96 [24], a web based tool available
from the Naval Research Laboratory, to model space environments and radiation
effects on micro-electronics. Using these simulation tools, SEU rates can be predicted
before designing actual SRAM circuits and selecting a soft error mitigation solution,
making designs cost-effective and time-efficient. Using this simulation methodology,
soft error rates for predictive future technologies can also be estimated.
A hybrid simulation approach to characterize SET
Single-event transients (SET) appearing in combinational logic due to ionizing
radiation are becoming a major detractor for system reliability [34]. A direct
characterization approach of SET through accelerated testing remains considerably
11
challenging as it tackles issues related to probe capacitance isolation from DUT
(Device under Test) measurements and large scope sampling frequency requirements
for the capture of very small transients. Therefore, only a recent effort has been able to
directly measure transient currents on relatively large devices [37] compared to
numerous indirect measurements [11][34][42]. As an alternative to experimentation,
3D mixed-mode simulations have been shown to accurately predict transient effects
from single event strikes [33]. Yet these mixed-mode simulations are computationally
intensive and can practically simulate only very small circuits. Our proposed hybrid
simulation approach [83] combines the benefits of 3D device simulations and SPICE
circuit simulation. This hybrid approach accurately models the localized ion strikes on
circuits using 3D device simulations. The characterized ion current pulse from a single
strike is then used in SPICE simulations for detailed SET characterization in numerous
scenarios. This approach results in significant savings in cost and characterizes SET
phenomena time-efficiently.
Radiation hardening by design techniques to mitigate SET from combinational
logic
Soft errors in combinational logic can be particularly catastrophic as they can
result in what is known as single-event functional interrupt (SEFI). SEFI represent a
situation where some critical system control signal gets activated due to soft errors. In
addition, an upset in a flip-flop or latch may spread through the system and result in
data corruption, which may go un-noticed initially. We have developed cells (DF-
DICE [82][79]) that mitigate soft errors from combinational logic. This mitigation is
12
also important from the standpoint of SRAM radiation hardening because every
memory system has peripheral logic, which consists of combinational logic in general
and some sequential logic if the periphery is pipelined. Transient errors in SRAM
periphery logic may result in correlated errors in the memory array, which may not be
overcome by simple error protection schemes applied to memory arrays. In addition,
the ECC encoder and decoder are also subject to these transient errors. This affect may
jeopardize the benefits of ECC model based hardening if proper mitigation measures
for these transients are not adopted.
A model for characterizing spatial and temporal redundancy trade-offs for
meeting target reliability in SRAMs
Commercial SRAM technology offers high-performance and high-density
memory. Any technique that tries to improve the radiation hardness of SRAMs by
modifying the SRAM cell structure potentially cancels out the benefits of using
commercial off-the-shelf (COTS) components. This is because SRAM cells are
typically designed with a layout that uses minimally sized devices requiring special
design rule waivers from the foundry. Any modification to the cell structure loses the
ability to exploit these design rule waivers, increasing the cell area drastically due to
the application of conservative logic rules. Furthermore, this per-cell increase in area
potentially results in a huge increase in area for the memory array. Therefore, the
proposed framework for making radiation-tolerant SRAMs leaves the foundry-
distributed SRAM cell design intact. Rather than changing the SRAM cell design to
mitigate errors, it employs a combination of error correcting codes and a scrubbing
13
technique to achieve a desired reliability [5]. Further improvement in overhead
reduction is accomplished by converting the multi-bit upsets from a single strike to
single-bit upsets by a bit-interleaving technique. Bit interleaving maps logical bits in a
word to physically separated bits in a row, implying that physically adjacent cells
belong to different logical words. A model that relates temporal and spatial
redundancy trade-offs required for making commercial SRAMs radiation-tolerant is
developed. The inputs to this model are the physical/raw bit error rate and desired
effective bit error rate for a specific application. This model identifies various
temporal and spatial redundancy trade-offs possible to achieve a desired reliability.
The guidelines provided by the model result in an efficient selection of temporal and
spatial redundancy combinations for meeting target reliability goals.
A fully parallel implementation technique for a double error correcting BCH
code suitable for SRAM memories
Exacerbated SRAM reliability issues, due to soft errors and increased process
variations in sub-100nm technologies, limit the efficacy of conventionally used ECC
schemes. The framework invites the application of more powerful ECC schemes to
meet target reliability goals. Double error correcting (DEC) BCH (Bose-Chaudhuri-
Hocquenghem) codes offer a minimum redundancy choice for such powerful codes.
On the other hand, the large multi-cycle latency of conventional iterative decoding
algorithms for these DEC BCH codes has precluded their use in bandwidth-hungry
memory applications. Therefore, a new parallel implementation approach for DEC
decoders is developed that operates on complete memory words in a single cycle. The
14
practicality of this approach is demonstrated through ASIC (application specific
integrated circuit) implementations of DEC decoders for typical memory word sizes.
To evaluate the effectiveness and trade-offs, a comparative analysis between
traditional ECC and DEC codes for reliability gains and costs incurred has also been
performed. This implementation architecture enables radiation hardening for SRAM
memories by simply employing the proposed ECC model.
Implementation and Evaluation
Two test SRAM ICs (integrated circuits) have been designed using the
techniques suggested by the framework. To evaluate the effectiveness of the
mitigation model, each SRAM IC consists of two modules. One module is a baseline
SRAM without any mitigation for radiation effects and serves as a basis to evaluate
the radiation response of the process. The second module within the same IC is a
hardened SRAM that employs an ECC model based hardening approach. The ICs
have been fabricated in two different processes in IBM 90nm technology. One IC used
a single error correcting code while the other employed a double error correcting code.
Due to different ECC schemes employed, each IC provided a different data point for
validating the model with different scrubbing rate requirements. The radiation tests
performed on the manufactured devices validated the effectiveness of the developed
model [86].
15
1.3 Scope
This work addresses the reliability concerns associated with single-event
upsets and single-event transients of commercial microelectronics. Although the
techniques presented in this work are generic in nature and can be applied to mitigate
faults in other domains, the work focuses only on radiation hardening of SRAM
designs in particular. Although, we present radiation test results for our manufactured
SRAM devices including some single-event latchup and total dose results; this work
does not specifically address the reliability concerns associated with total dose effects,
single-event latchup effects, and other reliability issues arising from various
manufacturing defects and/or operational and aging faults.
1.4 The Roadmap
The dissertation starts with a brief description of radiation effects on
microelectronics in chapter 2, introduces some of the common terminology used
throughout this dissertation, and briefly summarizes existing solutions used in the
literature. Probability analysis from the theory of error correcting codes plays an
important role in the development of this framework. This analysis, performed in
chapter 3, indicates that simple error protection schemes commonly applied in today’s
design practices are not sufficient to meet desired reliability levels, in particular, for
space applications. This leads to the development of a model to apply temporal
redundancy to obtain desired effective error rates. Therefore, chapter 3 also presents a
16
model for the efficient mitigation of soft errors that is developed for this framework.
This model is based on a comprehensive probabilistic analysis and employs spatial
redundancy in the form of extra bits for an ECC scheme and temporal redundancy in
the form of periodic scrubbing. Simulation methodologies and modeling techniques to
characterize the severity of the single-event transient and single-event upset
phenomena are described in chapter 4. Using these simulation methodologies and
modeling techniques, error rates and single-event transient pulse widths are
characterized for current and scaling technologies. This characterization clearly
establishes a problem domain and provides guidelines for validating the effectiveness
of various solutions. Chapter 5 presents radiation-hardening-by-design techniques to
mitigate transient errors from combinational and sequential logic as well as analyzes
their cost-benefit trade-offs.
Chapter 6 presents architectures for efficient implementations of various ECC
decoders. It discusses implementation architectures for existing single error-correcting
Hamming and Hsiao codes. In addition, it also presents a new parallel implementation
approach for double error-correcting BCH codes, which is suitable for SRAM memory
applications. ASIC implementation results are presented to demonstrate the
practicality of this approach for typical memory word sizes. Based on the developed
framework, two SRAM chips were designed. The designed ICs were manufactured
and tested for radiation performance, verifying the effectiveness of the proposed
model. Chapter 7 describes the details of these designs and presents radiation test
17
results. Chapter 8 presents conclusions and discusses limitations and extensions of the
current research.
18
Chapter 2
Radiation Effects on Micro Electronics
The radiation sensitivity of MOS (metal oxide semiconductor) devices was
initially discovered as early as 1960s at the Naval Research Laboratory (NRL) [53],
and it was predicted in 1962 that device scaling would be limited due to upsets from
terrestrial cosmic rays [114]. Over time, it is observed that the radiation environment
in space significantly affects the performance and functionality of space-borne
electronic devices. In deep sub-micron technologies, even alpha particles from
packaging material and terrestrial cosmic-rays (consisting of neutrons) may result in
soft errors [8]. This chapter describes ionizing particle sources that are responsible for
causing errors in computing systems and malfunctioning of electronic devices,
particularly in spacecrafts. The chapter also discusses various effects resulting from
these ionizing particle sources and defines key terminology that is used throughout this
dissertation. It also briefly summarizes various solutions adopted to mitigate these
effects.
19
2.1 The Radiation Sources
The space radiation environment consists of high energy ion sources,
magnetic fields and plasma interactions. The terrestrial radiation sources include
alpha particles from chip packaging materials, terrestrial cosmic rays and thermal
neutrons. This section provides an overview of these sources and environments
and further details can be found in [50][8].
Galactic Cosmic Rays
The major source of radiation particles in space is known as galactic cosmic ray
(GCR). Though the actual origin of GCR is still unknown, it is believed that these are
generated somewhere outside the solar system and hence the name galactic, and are
accelerated by the shockwaves of novae. High energy protons account for almost 90%
of galactic cosmic rays, followed by 9% alpha particles, and the remaining consists of
other heavy ions (Z > 1) and electrons. GCR provides a continuous and isotropically
distributed low-flux spectrum of ionizing radiation. Figure 2 (derived using
CREME96 tool) shows a typical spectrum for solar quiet conditions when GCR is the
maximum contributor for the space radiation flux versus the kinetic energy of ions for
different ion species. The constituent ions of GCR have very high energies in the order
of 100’s to 1000’s of MeV/amu (atomic mass unit) [67]. Therefore they have a very
long range of penetration among the target materials, and secondary collisions can
create showers of other ionizing particles. The low-energy part of the spectrum can be
attenuated by shielding but no practical amount of shielding can stop high-energy
particles.
20
Figure 2 Maximum GCR spectrum for geosynchronous orbit
Solar Cosmic Rays
Solar cosmic rays are produced by the sun. They are similar to GCR but have
lower energies, and their mass distribution is skewed toward lower-mass ions. Large
fluences are experienced during solar proton events which result from Coronal Mass
Ejections (CME). The periods of high activity in the sun alter the energy and flux of
these particles resulting in solar flare conditions. Because of their relatively low
energy spectrum, they can be significantly attenuated by applied shielding.
Van Allen Belts
Earth’s magnetic field acts as shielding for the incoming cosmic rays toward
earth. This magnetic field traps a charged particle along the path of its lines resulting
in trapped radiation belts, also known as Van Allen Belts. The satellites operating in
low earth orbits are subject to these belts, and therefore careful analysis must be
21
performed to assess the severity of radiation for electronics used in these satellites. In
addition, a portion of trapped radiation belts is extended inwards to earth due to dipole
magnetic geometry and is known as South Atlantic Anomaly. In this region, the
trapped belts reached their lowest possible altitude and satellites passing through this
region encounter increased flux of protons. Figure 3 (source: NASA Marshall Space
Flight Center) shows the earth’s magnetosphere highlighting traped radiation belts.
Furthermore, other planets also have magnetic field and trapped radiation belts are
also formed around their magnetic lines.
Solar Wind
The solar wind provides a constant flux of relatively low-energy electrons and
protons. These have very little penetrating ability and thus affect only exposed
Figure 3 Radiation environment around earth
22
surfaces of spacecraft (e.g., thermal protection, solar panel cover glasses, and optical
elements).
Atmospheric Showers
When galactic cosmic ray enter the earth’s atmosphere, it interacts with oxygen
and nitrogen atoms and results in daughter particle generation. The main products of
this interaction are high-energy protons and neutrons, though other particles like p-
ions, muons, and gamma rays are also generated as shown in Figure 4 [121]. These are
also known as secondary cosmic rays or terrestrial cosmic rays. The flux of these
particles varies according to height as the atmosphere absorbs most of the particles.
Figure 4 Secondary particles generated by the interaction of cosmic rays
with atmospheric atoms
23
The maximum flux of these cosmic ray induced neutrons is at an altitude of 60000 feet
and is at minimum at sea level. These neutrons are the main source of single event
effects in terrestrial and aircraft application.
Impurities in the manufacturing materials
Certain impurities in manufacturing materials of integrated circuits also result
in ionizing radiation particles. The chief among these are alpha particles from
soldering materials, especially in flip-chip packaging. The alpha particle consists of
two neutrons and two protons i.e. a double charge helium ion (
4
He
2
). The alpha
particles released from packaging impurities generally contain kinetic energy in the
range of 4-9 MeV [8]. In silicon, an average energy of 3.6 eV is required to generate a
single electron-hole pair. Therefore, alpha particle passing through sensitive circuit
nodes can generate significant amount of charge that can potentially upset the stored
value on the node. In addition, the interaction of neutrons from terrestrial cosmic rays
with boron atoms also creates secondary radiation particles that can affect electronics.
In semiconductor electronics, boron is found in an insulating glass used in fabrication
(BPSG - Borophosphosilicate glass).
2.2 Radiation Effects
The impact of radiation has historically been categorized into two groups: one
reflects the effects over a long period of time, termed as total ionizing dose (TID) and
the other, known as single-event effects (SEE), is the immediate result of a single
radiant charged particle, whether from heavy ions in space or alpha particles from
24
packaging material. This section provides definitions and acronyms for some of the
basic radiation effects.
Figure 5 (taken from [101]) shows an ion strike passing through the drain of an
off-NMOS transistor, creating a track of electron-hole pairs. This track of electron-
hole pairs when passes through a reverse-biased depletion region, it creates a short
between the drain of OFF-NMOS holding a logic value ‘1’ and the p-well/substrate
which is at ground potential. This short-circuit momentarily discharges the drain node
resulting in a transient spike at that node. Note that a funnel region is also created
along the path of the track, which results in extending the charge collection region.
Similar effects happen when a strike passes through OFF-PMOS and the
corresponding n-well/substrate region. When a similar strike happens in a non-
Figure 5 Ion strike on drain of an off-NMOS transistor
25
sensitive region such as substrate, the generated electron-hole pairs recombine without
any noticeable effect on the circuit. A term of significant importance is linear energy
transfer (LET) which will be used throughout this dissertation. Therefore, we first
describe this term before providing definitions for radiation effects.
Linear Energy Transfer
Space radiation environments are typically plotted as a spectrum, which is a
function of ion energy and specie, as was shown in Figure 2. For single-event effect
studies, the individual fluxes of each specie are combined into an integral flux, and it
is plotted against the LET (Linear Energy Transfer) of the ions as shown in Figure 6
(derived using CREME96 tool). LET expresses the relevant characteristic of a
particle’s passage through material: a particle gives up energy as a function of the
distance it travels through the material and the density of the material. The respective
units for this definition are energy lost per unit track length per unit mass density.
Therefore LET is usually expressed in MeV-cm
2
/mg [(MeV/cm)/(mg/cm
3
)].
Alternatively, LET is also defined as the amount of charge that is deposited/created
per unit length by an ion, and the relevant units for this use are pC/um. LET units from
MeV-cm
2
/mg can be converted to equivalent charge deposited per unit length (pC/um)
as:
It takes 3.6 eV to generate one electron-hole pair in silicon. The charge on an
electron is 1.6*10
-19
C and the silicon density is 2.33g/cm
3
.
1 MeV of energy loss generates charge = 10
6
*1.6 * 10
-13
C/3.6 = 0.444 * 10
-13
C
1 MeV energy loss per unit cm generates charge = 0.444 * 10
-13
C/cm
26
1 MeV/cm energy loss generates charge through unit density of silicon i.e.
(1MeV/cm)/(mg/cm
3
) = 0.444 * 10
-13
* 2.33 * 10
3
C/cm
Hence LET of 1 MeV-cm
2
/mg generates charge = 1.036 * 10
-10
C/cm
= 1.036 * 10
-14
C/um = 0.01036 pC/um or ~10fC/um
Therefore, charge generated in silicon in a length “L” in microns by an LET (MeV-
cm
2
/mg) of ion can be expressed as:
dQ (pC) = 0.01036 * L (micron) * LET (MeV-cm
2
/mg) (1)
Figure 6 Integral GCR flux versus LET for geosynchronous orbit
2.2.1 Total Ionizing Dose (TID)
When an energized particle impacts a semiconductor, it deposits a small
electrical charge to the silicon. The charges often migrate off of the semiconductor,
but over time they eventually build up on the device. Once the part has accumulated
enough excess charge, it will cease to function. Total ionizing dose is the amount of
27
radiation or energy that a semiconductor can absorb over time till it ceases to function.
TID is measured in rad (Si), i.e. radiation absorbed dose relative to silicon. Typical
flux seen by an electronic component in a space environment ranges from less than
one krad(Si)/yr (equatorial LEO with reasonable shielding or interplanetary at solar
minimum), to 5 krad(Si)/yr (LEO with minimal shielding or interplanetary at solar
maximum), and 20 krad(Si)/yr (GEO).
The primary degradation mechanism for TID is the trapped charge in the
insulating layers, both gate and field oxides (SiO
2
). Oxide-trapped charge arises as
impurities in the oxide act as hole-traps, allowing electrons to be swept to an electrode
but generating a charge buildup as holes remain behind in the lattice. Interface-trapped
charge is the charge buildup at the interface between active silicon and the insulating
oxide layers. These trapped charges change transistor thresholds and lead to increased
leakage currents, both in individual transistors and in the bulk semiconductor covered
by field oxide. As TID increases, it causes device threshold shifts, increased device
leakage and power consumption, and performance parameter changes.
Figure 7 ([101]) shows trapped charges in gate oxide of a MOS transistor. The
accumulated holes in the gate oxide act as an applied voltage and therefore shift the
threshold of the device. Once enough charge is accumulated, it acts like a constant
voltage applied at the gate and it turns on the NMOS permanently. On the other hand,
accumulated positive charge in the gate oxide of PMOS device will permanently turn
it off. This way, the input/output of the gate become stuck at a particular value, and
results in functional failure of the device.
28
Figure 7 Trapped charge in gate-oxide
2.2.2 Single-Event Effects (SEE)
A change of state or transient induced in a circuit by a single energetic particle
strike is termed as Single-Event Effect (SEE). There are numerous single-event effects
depending on the ion strike location and manifestation of circuit response.
Single Event Upset (SEU)
Single event upset occur when a single ionizing radiation event (such as a
cosmic ray or proton) produces a burst of electron-hole pairs in a digital
microelectronic circuit that is large enough to cause the circuit to change state. SEU is
generally defined with respect to storage elements i.e. memory cells, stage register,
latches and flip-flops.
Single Event Transient (SET)
A single-event transient is a current transient induced by the passage of a
particle through a combinational circuit. The current transient results in generating a
29
voltage transient at the output node of a gate and this can propagate to cause an output
error in combinational logic.
Single Event Latchup (SEL)
A single-event latchup is similar to latchup phenomenon found in CMOS
(complementary metal oxide semiconductor) circuits except that it is triggered due to
the action of an ionizing particle. A single event latchup is a potentially destructive
condition involving parasitic circuit elements forming a silicon controlled rectifier
(SCR). In traditional SEL, the device current may destroy the device if not current
limited and removed in time. A "micro-latch" is a subset of SEL where the device
current remains below the maximum specified for the device. A removal of power to
the device is required in all non-catastrophic SEL conditions in order to recover
normal device operation.
Single-Event Functional Interrupt (SEFI)
A single-event functional interrupt occurs when an SEE results in a device
ceasing its proper operation or triggering some critical control mechanism. One such
effect described in the literature is the triggering of built-in test functions, which take
the device off-line.
Single Hard Error (SHE)
A single hard error is a single-event upset that causes a permanent change to
the operation of a device. An example is a permanent stuck bit in a memory device.
Single-Event Burnout (SEB)
Single-event burnout causes localized burnout due to a high current state in
a power transistor. SEB is a destructive condition. Since power transistors are very
30
large transistors – providing much bigger area for charge collection, therefore this
phenomenon is not expected to occur in standard size ASIC gates.
Single-Event Gate Rupture (SEGR)
This condition occurs due to heavy ion hits in power MOSFETs (metal
oxide semiconductor field effect transistor) when a large bias is applied to the gate,
leading to thermal breakdown and destruction of the gate oxide. SEGR is a destructive
condition.
Multiple Bit Upset (MBU)
Multiple bit upsets represent the situation when several memory elements
experience state changes due to the passage of the same particle. Due to decreased
dimensions of memory cells in scaled technologies, the strike diameter can potentially
cover many memory cells.
2.3 Radiation Hardening
2.3.1 Radiation Hardening by Process
Traditionally space electronics, where radiation hardening is essential for
reliable operation, were manufactured at designated foundries employing specialized
processes to improve the radiation tolerance of electronic devices. This method of
attaining system reliability has been known as Radiation Hardening by Process (RHP).
The major effects from radiation result through charge generation and collection [32].
Therefore these hardening techniques mainly focused to reduce charge generation and
collection by careful use of materials and processing steps. Particularly, the use of
31
silicon-on-sapphire (SOS) is an example of silicon-on-insulator (SOI) hardening by
process technique. This insulator greatly reduces the charge collection from the
substrate which is very thin as compared to normal bulk-CMOS processes. Similarly
stress-free ion implantation helps in reducing these effects [54]. Extra doping layers
can also be used to reduce the substrate charge collection [40]. Therefore, the usage of
triple-well [19], even quadruple-well [45] has been proposed to reduce single-event
upset sensitivity. In these cases, strikes behave like “inside the well strike” and
therefore most of the charge gets recombined within the buried well. Similarly, the
usage of epitaxial layer also reduces charge collection compared to bulk substrate [35].
With the changing economy of the space electronics market, these processes began
losing ground due to their very high cost and poor performance because they usually
lag their commercial counterparts by at least two technology generations.
2.3.2 Radiation Hardening by Design
The heavy costs of radiation hardening by process, both in terms of economy
and performance, has led to the development of Radiation Hardening by Design
(RHBD) in which architecture, circuit, and layout techniques are used to mitigate
radiation effects. Special layout structures, like edgeless transistors (shown in Figure
8) and guard-bands [2][108], are used to minimize the inter-device leakage currents
and improve TID tolerance. On the other hand, spatial and temporal redundancy is
employed to mitigate SEU and Single Event Transient (SET) effects
[49][12][105][20][69]. Triple modular redundancy is a simple technique that has been
used from the device level to the system level for mitigating SEU and SET [57]. For
32
memory systems, error detecting and correcting (EDAC) codes are by far the most
effective technique to guard against soft errors [6][107][100]. Thus RHBD trades off
circuit performance (area, speed and power) for increased system reliability. The ever-
decreasing thickness of gate and field oxides minimizes the trapped charge storage and
results in an improving TID tolerance of commercial devices as technology scales.
Lower supply voltages have resulted in minimizing the latchup effects. Therefore,
radiation-hardening-by-design techniques are now focused to improve circuit response
with respect to SET and SEU.
Figure 8 Edgeless or annular transistor layout
33
Chapter 3
Models for Soft Error Mitigation of Memory Systems
This chapter presents the theoretical substratum for our soft error mitigation
framework. The mitigation of soft errors from memories requires some form of
redundancy that can be either spatial or temporal or a combination of both. In a circuit-
based hardening approach, both the individual memory cell and peripheral control
circuits of a memory array are hardened through the use of radiation-hardening-by-
design techniques (RHBD) [51][1]. The overall error rate of the SRAMs designed
using these RHBD techniques is dominated by the raw bit error rate (raw BER) of the
individual memory cell. However, the associated area, power and speed penalties are
large and potentially offset the benefits of using state-of-the-art processes [20][14]. A
more robust system-based hardening approach improves the effective error rate of the
memory system over its raw BER by employing error correcting codes (ECC) and/or
periodic memory scrubbing (a process in which memory contents are periodically
checked for errors and written back after ECC correction to reduce the accumulation
of errors). In this case, the bit error rate (BER) of the SRAM is largely dictated by the
physical or individual BER of the memory cell and system design. This approach
requires finding a careful balance between physical cell BER, SEU-inducing
mechanisms, ECC redundancy (which requires extra memory cells implying spatial
34
redundancy), and scrubbing rate (implying the usage of temporal redundancy).
Therefore, this chapter develops a model to achieve this balance between spatial and
temporal redundancies in the form of ECC and periodic scrubbing.
Section 3.1 starts the development of this model beginning with reliability
analysis for memories protected by ECC alone. Section 3.2 extends this analysis to
incorporate temporal redundancy in the form of periodic scrubbing. Section 3.3
identifies the redundancy bounds with the usage of proposed soft error mitigation
techniques for effective bit error rates. The underlying assumptions for this model and
effects of relevant factors are discussed in Section 3.4.
3.1 Reliability Analysis for Spatially Protected Memories
We are using spatial redundancy in the form of extra bits required to
implement an error correcting code, therefore we briefly mention how an ECC is
defined. A block code consists of a set of fixed-length vectors called codewords. A
binary (n, k) linear block code is a k-dimensional subspace of a binary n-dimensional
vector space. Thus, an n-bit codeword contains k-bits of data and r (= n – k) check bits
i.e. amount of redundancy. Sometimes an ECC is also denoted by a triplet (n, k, d),
where n is the total number of bits used in a codeword, k is the number of information
bits and, d is the minimum number of bit-differences, or minimum Hamming distance,
between valid codewords within a code. The following error analysis is performed
considering that each bit is equally likely to be upset and that the errors are un-
correlated. This assumption makes this error analysis an independently identically
35
distributed random Bernoulli experiment. The implication for this assumption and
techniques to uphold this assumption are mentioned in section 3.4.
Representing the error probability of a bit being upset by BER, the probability
of error occurrence in k-bit wide word is given by an independently, identically
distributed (iid) k long sequence of Bernoulli random variables:
P
word-unprotected
= 1 - (1 - BER)
k
(1)
Where ‘1-BER’
is the probability of having no error for a single bit and therefore, (1-
BER)
k
represent that k-bit word has no error. Let us now consider the case when ECC
protection is applied.
3.1.1 Reliability Analysis for Single Error Correcting ECC Protected Memories
Single error correcting (SEC) codes are the widely used error correcting codes
to protect memories against soft errors as well as hard manufacturing defects. There
are two famous SEC codes; namely triple modular redundancy (TMR) and SEC
Hamming code. Let us examine each of these and also explore the effect of using
different block sizes.
Error Probability of a word protected by TMR
Triple modular redundancy (TMR) is a simple code, in which three identical
copies of data are stored and a majority voter is used to overrule error in one of the
data copies. If a single bit is protected by TMR, then its error probability can be
computed by applying the maximum a-posteriori probability rule (MAP) [63].
Considering the equally likely case for a bit is being upset from ‘0’ and ‘1’ and vice
36
versa, the MAP decoding reduces to maximum likelihood (ML) rule. In particular, the
error will occur in the 3-bit code word in the following cases:
• Any of the two bits are in error
• All three bits are in error
There are (3 choose 2 – a combinatorial expression) ways when two bits
could be in error and there is only one way for all three bits to be in error.
Representing the error probability of TMR applied to single bit by BER
3
2
C
TMR,
the
following equation represents this error probability:
BER
TMR
= 3 BER
2
(1 – BER) + BER
3
(2)
Now placing this value of BER
TMR
in equation (1), we can get the error probability of
a k-bit wide word which is protected by TMR as shown in equation 3.
P
word-TMR
= 1 – (1 – [3 BER
2
(1 – BER) + BER
3
])
k
(3)
Error Probability of a word protected by SEC Hamming code
Hamming code is a well-known and simple single error correcting code and
can be applied to various block sizes. Larger block sizes require less redundant bits
and save area for required additional memory cells at the cost of reduction in fault
tolerance. We represent Hamming code applied to various blocks as Ham (n, k) where
k is the number of information bits and n is the total number of bits in the codeword.
The parameter ‘d’ for SEC Hamming code is always 3 and therefore is omitted from
the 3-tuple. The relationship between n and k for single error correcting Hamming
code can be expressed by: n = k + log
2
(k) + 1
37
With the single bit error probability represented by BER as before
,
and
considering that Hamming code can recover a single bit error in an n-bit block, the
probability of having 2 or more errors in a code word of size n-bits is given by:
P
cw
= 1 – {(1 – BER)
n
+ n BER (1 – BER)
n-1
} (4)
Given this the overall error probability of a w-bit word, i.e., error probability in
w/k blocks with SEC Hamming code can be found as:
P
word-HAM
=1 – {(1 – BER)
n
+ n BER (1 – BER)
n-1
}
(w/k)
(5)
Equations (3) and (5) can be used for comparing achievable fault tolerance provided
by TMR and Hamming code.
Effect of block size
For observing the relative codeword error probability trends and evaluating the effects
of block size, Figure 9 plots the word error probabilities against the typical bit error
probability ranges for information word size of 32 bits. Notice that there is a huge
differential as we go from unprotected to any protection scheme, whereas there is a
small difference for word error probabilities between TMR and Ham (7, 4). Also
notice that as the bit error probability worsens below a certain threshold (0.01), the
gain provided by ECC codes decreases.
One interesting conclusion can be drawn from this analysis i.e. even though
there is a difference in absolute magnitude of the reliability gain when code is applied
at various block sizes, yet the respective slopes stays the same. It means that
asymptotic gain is independent of the block size to which code is applied. Therefore,
38
in the following discussion we can restrict our analysis to a single typical block size,
for example 16- or 32-bits.
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
-14
10
-12
10
-10
10
-8
10
-6
10
-4
10
-2
10
0
Single bit-error probability
32 bit word-error probability
Un-Protected
TMR
Ham(7,4)
Ham(12,8)
Ham(21,16)
Ham(38,32)
Figure 9 32 bit word-error probability comparison curves
3.1.2 Error Probability Analysis for Double Error Correcting Code
In order to find the probability of error in a codeword protected by double error
correcting (DEC) code, we first find the probability that there is no error in the
codeword protected by DEC. Assuming the bit error probability BER as before, the
probability of success of DEC can be represented by following cases:
39
• There is no error occurred in the codeword
• Any single bit is in error in the codeword
• Any two bits are in error in the codeword
The probability that there is no error occurred in an n-bit codeword can be
expressed as (1 - BER)
n
. A single bit error can happen in any of the ways while
rest of the (n – 1) bits are error free. For the two bits error case, since the order of the
bit errors in not important, represents the number of ways 2-bit errors can happen.
Adding all of this, the probability of success for DEC can be expressed by:
n
C
1
n
C
2
2 2
2
1
1
) 1 ( ) 1 ( ) 1 (
− −
−
− + − + − =
n n n n n
success DEC
BER BER C BER BER C BER P
i n i
i
n
i success DEC
BER BER C P
−
=
−
− =
∑
) 1 (
2
0
(6)
Therefore the probability of codeword error for DEC would be:
P
DEC
= 1 – P
DEC-success
i n i
i
n
i DEC
BER BER C P
−
=
− − =
∑
) 1 ( 1
2
0
(7)
3.1.3 Error Probability Analysis for Triple Error Correcting Code
The codeword error probability for triple error correcting (TEC) code can also
be computed in a similar fashion as DEC. In other words, we first find the probability
that there is no error in the codeword protected by TEC. Assuming the bit error
40
probability BER as before, the probability of success of TEC can be represented by
following cases:
• There is no error occurred in the codeword
• Any single bit is in error in the codeword
• Any two bits are in error in the codeword
• Any three bits are in error in the codeword
The probability that there is no error occurred in the codeword can be
expressed as (1 - BER)
n
. A single bit error can happen in any of the ways while
rest of the (n – 1) bits are error free. For the two bits error case, since the order of the
bit errors in not important, represents the number of ways 2-bit errors can happen.
Similarly for the three bits error case, represents the number of ways 3-bit errors
can happen. Adding all of this, the probability of success for TEC can be expressed
by:
n
C
1
n
C
2
n
C
3
3 3
3
2 2
2
1
1
) 1 ( ) 1 ( ) 1 ( ) 1 (
− − −
−
− + − + − + − =
n n n n n n n
success TEC
BER BER C BER BER C BER BER C BER P
i n i
i
n
i success TEC
BER BER C P
−
=
−
− =
∑
) 1 (
3
0
(8)
Therefore the probability of codeword error for TEC would be:
P
TEC
= 1 – P
TEC-success
i n i
i
n
i TEC
BER BER C P
−
=
− − =
∑
) 1 ( 1
3
0
(9)
41
Following the similar argument, we can find codeword error probability
bounds for any t
c
-bit error correcting code. For observing the codeword failure
probabilities change with changing BER, we need to assume some value of n for DEC
and TEC codes without knowing the existence of the code. For now, let us use n=26
for DEC and n=30 for TEC. In a later section, we will establish the values of n for
given k and d. Figure 10 plots these error probabilities for the 16-bit information word
in all of the above cases for a range of interest of physical/raw bit error rates. It is
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
-35
10
-30
10
-25
10
-20
10
-15
10
-10
10
-5
Physical BER
16 bit word-error probability
Unprotected
SEC(21,16)
DEC(26,16)
TEC(30,16)
Figure 10 Word error probability for different ECC codes versus raw BER
42
interesting to note that even though these ECC codes are more powerful, yet if the
desired effective bit error rate is around 10
-10
(a common goal for most space
applications), these codes fail to meet this target in severe environment cases like solar
flare worst week and proton belts where the bit error probability could be as bad as
10
-3
or even 10
-2
[5]. This makes it necessary to find other ways to mitigate these
errors. Therefore, in next section, we introduce the concept of temporal redundancy, in
the form of periodic scrubbing, and perform the codeword failure analysis to meet
target reliability.
3.2 Reliability Analysis for Spatially and Temporally Protected
Memories
The analysis so far has not considered any temporal aspects for the error
correcting codes. In this section, we introduce temporal factor in the error probability
analysis. This temporal notion can be implemented in the form of a scrub controller
for large memory structures. The operation of scrub controller is similar to the
operation of refresh controller in DRAMs. In particular, a scrub controller periodically
reads words from memory and passes these read words through the ECC decoder and
write backs the result while passing through the ECC encoder. In this way, any
transient errors introduced in the storage are corrected if they are within the error
correction capability of the ECC being employed. In other words, this replicates the
43
error free data in time and avoids accumulation of errors that would otherwise result
from not scrubbing the stored data.
As before, we assume that the single bit failure rate is represented by BER (the
physical bit error rate). Assuming random bit upsets in a memory that is characterized
by an average upset rate per physical bit, the probability of a physical bit not upsetting
during a period of time is commonly modeled by the 0
th
order Poisson distribution.
P (physical bit does not upset over a period of time) = e
−upset rate × time
Equivalently, in terms of the physical BER of a memory cell and the system scrub rate
(SR), the probability of a physical bit not upsetting between scrub cycles is:
P (physical bit does not upset between scrub cycles) = e
−BER/SR
(10)
From this, the probability that a physical bit does upset between scrub cycles is given
by
P (physical bit does upset between scrub cycles) = 1 − P (physical bit does not
upset between scrub cycles) (11)
Considering a multi-bit word, from probability theory, the probability that m physical
bits upset within an n-bit word between scrub cycles is:
P (m physical bits upset within an n-bit word between scrub cycles)
= (12)
m n m n
m
p p C
−
− ) 1 (
where p is the probability of a physical bit upset given by (11).
44
Using (12), we can write the probability of an ECC failure for an m-bit-correcting
ECC as the probability that (m + 1) or more bits are upset within the same physical n-
bit word between consecutive scrub cycles:
P (an uncorrectable memory upset during a scrub cycle in an n-bit word with
an m-bit-correcting ECC)
= P (m + 1 or more physical bits upset within an n-bit word between scrub
cycles)
= P (i physical bits upset in an n-bit word within a scrub cycle)
∑
+ =
n
m i 1
Which using the binomial identity is:
= 1 – P (i physical bits upset in an n-bit word within a scrub cycle)
∑
=
m
i 0
Which for SR >> BER, asymptotically
~= (13)
2 1
1
) / ( ) / ( SR BER O SR BER C
m n
m
+
+
+
Note that (13) is only a function of BER and SR. i.e.
f(x) = f(BER/SR)
Figure 11 illustrates (13) showing the probability of encountering an ECC
failure on a scrub cycle, for various m-bit ECCs applied on k-bit information word as
the ratio between the intrinsic memory cell BER and the system scrub rate is varied.
Curves for a single-bit-correcting, double-bit, triple-bit, and quadruple-bit correcting
ECCs with a 16-bit word are denoted by their respective code names i.e. SEC, DEC,
TEC and QEC. The slopes and offsets of the curves are indicative of their asymptotic
45
powers and constant offsets. The curve for an m-bit-correcting ECC has an asymptotic
slope of (m + 1) in log-log representation corresponding to, for example, an inverse
quadratic reduction in the probability of an ECC failure per scrub cycle as the scrub
rate is increased when using a single-bit-correcting ECC.
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
10
10
10
-45
10
-40
10
-35
10
-30
10
-25
10
-20
10
-15
10
-10
10
-5
10
0
Scrub Rate/BER
16 bit word-error probability
SEC(21,16)
DEC(26,16)
TEC(30,16)
QEC(34,16)
Figure 11 Codeword failure probability versus ratio of scrub rate to
physical bit error rate for a 16-bit information word protected by various ECC
3.2.1 Effective BER Reduction Factor
The effective BER achieved by the combination of ECC codes and scrubbing
is given as the product of the scrub rate (SR) times the probability (in (13)) of getting
46
an uncorrectable error on each scrub cycle normalized to number of data bits in the
codeword i.e. information block size.
Effective BER = SR × P (uncorrectable memory upset per scrub cycle) / block
size
The reduction factor relating a system’s effective BER to its physical BER after
applying scrubbing and ECC is thus:
BER reduction factor from ECC and scrubbing = (effective BER/physical BER)
= SR/BER × P (uncorrectable memory upset per scrub cycle)/ block size
(14a)
= SR/(BER × block size) × equation (13) (14b)
Which, for SR>>BER, is asymptotically
~= (14c)
1
1
) / ( ) / (
+
+
+
m m n
m
SR BER O SR BER C
Note that (14) is also a function of only the ratio between BER and SR
f(x) = 1/(BER/SR) × f (BER/SR)
Figure 12 illustrates (14) showing the relative reduction in the system BER as
the ratio between the physical BER and the system scrub rate SR is varied. The curves
shown are for the same combinations of k-bit information word and m-bit ECC shown
in Figure 11. The slopes and offsets of the curves are indicative of their asymptotic
powers and constant offsets. The curve for an m-bit-correcting ECC has an asymptotic
slope of m in log-log representation corresponding to, for example, an inverse linear
reduction in the effective BER for a system using a single-bit ECC as the scrub rate is
47
increased. Note that the asymptotic gain trend does not begin until the scrubbing rate
exceeds the physical BER by a certain threshold. Below this threshold, which is a
function of number of bits in an ECC word, the effective BER worsens the physical
BER of the memory cells. This region of “no improvement” is due to the increased
probability of a bit becoming invalidated because of other bit flips within an ECC
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
10
10
10
-40
10
-35
10
-30
10
-25
10
-20
10
-15
10
-10
10
-5
10
0
Scrub Rate/BER
effective BER reduction factor
SEC(21,16)
DEC(26,16)
TEC(30,16)
QEC(34,16)
Figure 12 Effective BER reduction factor versus SR/BER for various
strength ECC
codeword. The curve SEC (21, 16) shows that the scrub rate must be over 100 times
intrinsic BER before the effective BER is improved when using a single-bit-correcting
48
ECC code based on 16-bit data words. As an example, interpreting Figure 12 to reduce
the effective BER by 10
-4
below the physical BER, by looking for the intersections of
the ECC curves with the reduction factor 10
-4
, we read that scrubbing must occur at a
rate approximately 2x10
6
, 5x10
3
, 6x10
2
, and 2x10
2
times the physical BER using the
ECC represented by SEC, DEC, TEC and QEC respectively. We note that stronger 4-
bit error correcting code QEC requires four orders of magnitude less scrubbing rate
than the 1-bit correcting SEC ECC. Therefore, the key point from Figure 12 is that
stronger codes, in terms of number of bit correction, have steeper slopes on log-log
graphs, meaning that to achieve a given level of effective system error rate
improvement, stronger codes can require significantly lower scrubbing rates.
3.2.2 Scrub Rates required to achieve an effective BER
We can use Figure 12 to derive the required scrubbing rates in order to achieve
an effective BER with given ECC and raw physical BER. This relationship will enable
us to identify the required scrubbing rates for meeting the desired reliability targets.
This derivation can be accomplished by inspection from Figure 12. First, the
intersections of the asymptotic lines for SR >>BER with the line of “no improvement”
(BER reduction factor = 10
0
) are noted. These points are where increasing the
scrubbing rate just starts to reduce the effective BER from the physical BER, or where
the effective BER = the physical BER. These points are plotted in Figure 13 at their
corresponding physical BER coordinates. Next the slopes of the curves are related
from the asymptotic slopes in Figure 12. Considering the single-bit ECC, for example,
for every decade increase in physical BER, the scrubbing rate must be increased by
49
one decade just to achieve the same BER reduction factor, plus another decade to
improve the BER reduction factor for achieving the target effective BER as before.
This corresponds to a slope of 2/1 on Figure 12 log-log graph. Similarly the slopes for
double- and triple-bit-correcting ECCs are 3/2 and 4/3. From this trend, it can be seen
that even the strongest ECC will still require a linear increase ([n+1]/n ~= 1) in
scrubbing rate for every increase in physical BER to maintain a target effective system
BER. Figure 13 shows the required scrub rate to achieve 10
-10
effective BER for given
10
-10
10
-9
10
-8
10
-7
10
-6
10
-5
10
-4
10
-3
10
-2
10
-1
10
-10
10
-8
10
-6
10
-4
10
-2
10
0
10
2
10
4
10
6
10
8
10
10
10
12
BER
Scrub Rate
SEC(21,16)
DEC(26,16)
TEC(30,16)
QEC(34,16)
Figure 13 Scrub rate versus physical BER needed to achieve 10
-10
effective BER
for different ECC schemes applied on 16-bit data word
50
physical BER and different ECC schemes as applied on 16-bit data words ranging
from single error correction (SEC) ECC to quadruple error correction (QEC) ECC.
3.3 Spatial and Temporal Redundancy Bounds for Target
Reliability
Having related the required scrub rates to obtain an effective BER with a given
ECC, we can now proceed to derive relation for spatial and temporal redundancy
bounds for achieving effective BER with given physical BER.
Due to the organization and utilization of memories by word accesses, only
linear block codes are considered for ECC protection of memories. In order to evaluate
how much spatial redundancy would be required by a t
c
-bit error correcting code, we
can use the distance properties of the code. In particular, for t
c
-bit error correcting
code, the minimum distance d of the code is related by:
d >= 2 t
c
+ 1
Though this tells us the minimum distance of a code to correct certain errors, it does
not tell us how many check bits would be required to achieve this minimum distance.
From coding theory, it is known that Gilbert-Varshamov bound [64] establishes the
existence of a code with given parameters of n, k and d.
This bound can be represented in mathematical for as below:
k n i
d
i
q q
i
n
−
−
=
≥ −
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑
) 1 (
2
0
51
where q represent the q-ary base. Therefore, for binary block codes where q is 2, this
bound reduces to:
k n
d
i
i
n
−
−
=
≥
⎟
⎟
⎠
⎞
⎜
⎜
⎝
⎛
∑
2
2
0
Which can be further approximated to:
⎟
⎠
⎞
⎜
⎝
⎛ −
≤ −
n
d
nH k n
2
Where H is the entropy function given by:
H(x) = –x log x – (1 – x) log (1 – x).
This relation can now be used to derive redundancy required for a code with particular
minimum distance d. In addition, these bounds can be evaluated using the code tables
provided by [43]. The following table lists the parameters of some codes (that we used
in our earlier graphs) using the bounds derived from [43].
TABLE 1 Spatial redundancy for some linear block codes aligned for typical
memory word sizes
SEC
d = 3
DEC
d = 5
TEC
d = 5
QEC
d = 5
k n Check
bits
n Check
bits
n Check
bits
n Check
bits
8 12 4 16 8 19 11 25 17
16 21 5 26 10 30 14 34 18
32 38 6 43 11 49 17 52 20
64 71 7 77 13 83 19 86 22
The method in [43] not only provides the bound on check bits for certain d, but
it also provides the construction of the code itself for that minimum distance.
52
Therefore, it can be used for more practical bounds estimation for redundancy. We
have used data from these tables to arrive at the spatial and temporal redundancy graph
for achieving an effective BER of 10
-10
for various physical BER anticipated in
different radiation environments.
Figure 14 shows the spatial overhead bounds against varying block sizes for
different ECCs obtained using data in Table 1. It can be observed that as the block size
0
25
50
75
100
125
150
0 25 50 75 100 125 150
Block Size (bits)
%age Spatial Overhead
SEC DEC TEC QEC
Figure 14 Spatial overhead vs block size for different ECCs
increases, the required spatial overhead decreases for applying any ECC. Furthermore,
for larger block sizes such as 128 bits, the relative spatial overhead of applying more
powerful codes is small.
53
Using these bounds of spatial overheads and our previously developed relation
for scrub rates required to convert a physical BER into desired effective BER of 10
-10
errors/bit-day with a particular ECC, we can compute the trade-offs between scrub
rates and spatial overhead.
Figure 15 shows these trade-offs for three different values of physical BER for
data word size of 16 bits. Note that the scrub rate axis is in log scale. Notice that there
are exponential reductions in required scrub rates as the spatial overhead is increased
initially, but soon after these gains starts to saturate.
1.E-04
1.E-02
1.E+00
1.E+02
1.E+04
31.25 62.5 87.5 112.5
%age Spatial Overhead
Scrub Rate .
BER=1E-6
BER=1E-5
BER=1E-4
Figure 15 Scrub rate vs. spatial overhead trade-offs for desired effective BER of
10
-10
errors/bit-day
We can combine these curves to form a 3D plot representing solution domain
between spatial and temporal redundancy against the range of physical BER for an
effective BER. Figure 16 shows the temporal-spatial redundancy surface for 16-bit
54
data word to obtain target reliability with range of physical bit error rates. Note that
the x-axis (physical BER) and z-axis (scrub rate) are in log-scale and hence the
straight lines actually represent an exponential relationship. Similar surfaces can be
drawn for different width data words. The larger block sizes will result in shifting the
surface towards origin on y-axis (%age spatial redundancy) because larger block size
needs smaller %age redundancy overhead for the same ECC scheme. As observed in
previous sections, larger block size results in slightly decreasing reliability gains and
1e-10
1e-9
1e-8
1e-7
1e-6
1e-5
20
40
60
80
100
120
1e-10
1e-5
0
1e5
1e10
1e15
physical BER
%age Spatial Redundancy
Scrub Rate
Figure 16 Spatial~Temporal redundancy curves for different physical BER for a
16-bit data word to achieve 10
-10
effective BER
55
therefore required scrub rates will increase accordingly. This will have an effect of
moving the surface slightly upwards on z-axis.
With these bounds, a soft error mitigation solution can be chosen depending on
the application requirements for spatial and temporal redundancies. We can say that
this surface divides the solution domain. Any point on or above the surface is a
solution point while the region below the surface represent where no solution is
possible. Notice that due the discrete nature of the block codes, the solution is not
continuous. Instead only the intersection points of lines on the y-axis (%age spatial
redundancy) are selectable for a solution.
3.4 ECC Model Assumptions and Relevant Factors
The effectiveness of an ECC code in a memory system is evaluated by
calculating the probability of getting an uncorrectable error within a word. An
uncorrectable error occurs when the number of upset bits exceeds that can be corrected
by the ECC between consecutive memory scrubbing or other access cycles. Using this
expression, a system’s worst-case effective BER is calculated as the product of the
probability of getting an uncorrectable error between scrubbing cycles times the
scrubbing cycle rate. The input parameters needed are the physical cell BER, the type
of ECC code used, and the scrubbing cycle rate, or scrubbing rate. The assumption of
uncorrelated bit errors, on which this analysis is based, is invalidated by the
occurrence of multi-bit upsets (MBUs) within the same ECC word. Such events are
most often caused by a large area of photo-generated charge following the impact of a
56
high mass ion [122]; from an ion strike at a grazing angle affecting multiple memory
cells [58]; from byproducts of a nuclear spallation having occurred in overlaying
materials [115]; or from strikes on the memory control circuitry that result in the
corruption of an entire word line [61].
A commonly-accepted approach to mitigating MBU memory strikes is bit
interleaving [107][6][96], whereby successive bits in the same logical ECC word are
physically separated by interposing bits which are mapped to other logical ECC
words. In this way, every MBU of physically adjacent memory cells is transformed
into multiple single bit upsets (SBU) in different ECC-protected memory words. These
SBUs appear to be uncorrelated events relative to the ECC algorithm used. As
memory cell sizes have been shrinking, the number of adjacent cells affected by
MBUs has been growing, and thus, too, the minimum interleaving distance, measured
in memory cell spacing, needed to mitigate MBU effects. Baumann [6] generally
recommends a minimum of four to eight bits of interleaving, while Radaelli [96]
reports a minimum of six cells in a 150nm SRAM technology, and Maiz [68] suggests
six or more in 130nm and 90nm technologies. However, the aspect ratio of memory
array and column multiplexer design may potentially restrict the interleaving factor to
8 or 16. Therefore, we can say that interleaving factors are not going to scale without
some redesign of the memory architecture, e.g., spreading a logical word over multi-
banks. On the other hand, if the increasing MBU trend continues with technology
scaling as expected, it will become necessary to utilize more powerful codes, such as
double error correcting ECC, to maintain the uncorrelated error assumption as valid.
57
Depending on memory layout and ECC mapping in structures such as SRAM
caches, SET strikes on memory control circuitry can affect an entire word line and
potentially cause simultaneous MBUs within the same ECC-protected memory word
[61]. In these cases, memory control circuitry must be hardened sufficiently to ensure
that the BER associated with the memory control circuitry is lower than the desired
effective BER of the memory system. Similarly, the circuitry associated with ECC
encoding and decoding must be sufficiently hardened so that it also does not limit
system performance. The exact cross section of various structures (defined as the ratio
of number of errors observed to the integrated particle flux to which the structure was
exposed) can be dependent on both operating frequency and SRAM architecture.
Common control circuitry, such as ECC encode and decode circuits, would likely
show a cross-sectional frequency dependence, but not a device size dependence, since
SEU strikes only affect the current operations of these circuits. Structures such as
word drivers, in which SEE could affect many bits and memory words simultaneously,
could need stronger protection to reduce their cross-sections, which normally scale
according to the number of bits affected. However, in an architecture which
interleaves ECC codewords across multiple banks, the cross section of a word-driver
would not directly scale with the number of bits affected since a failure would not, by
itself, defeat the ECC.
3.4.1 Limiting Factors for Temporal Redundancy
The framework utilizes temporal redundancy in the form of a scrub controller.
The operation of the scrub controller can be described as: 1) periodically reading each
58
memory location, 2) checking the read codeword for possible errors and correcting
those errors if possible within the ECC correction capability (otherwise notifying the
system), and 3) writing back the corrected codeword by passing it through the ECC
encoder. In some sense, this operation is analogous to DRAM refresh, where the
memory contents are read periodically and the stored value recharged by writing it
back. On the other hand, this scrub operation is different from DRAM refresh
operation in the sense that the latter is dictated by the ability of the DRAM cell to
retain charge while the former is dictated by the trade-off desired in spatial
redundancy, which in turn depends on the prevailing physical bit error rates in a
technology and desired bit error rates for the application. Therefore, the limiting
factors for a scrub controller are driven by technology as well as the implementation
architecture for this controller. The memory size also plays a role in determining the
architecture for a scrub controller implementation and achievable scrub rates. For
example, assuming a single bank of 1G-words SRAM, the fastest scrubbing can
happen at a rate of 1ns provided 1ns is the time required to read, decode and write
back a codeword. This scrubbing rate with the given simplistic architecture would not
allow any spare time for the actual memory operation of system reads and writes. With
most practical memory architectures, the scrubbing overhead can be hidden by
scrubbing memory banks that are not being currently accessed by the system.
59
Chapter 4
Soft Error Rate Characterization and Scaling Trends
Due to continuous technology scaling, the reduction of nodal capacitances and
the lowering of power supply voltages result in an ever-decreasing minimal charge
capable of upsetting the logic state of memory circuits (a.k.a. critical charge or Q
crit
).
On the other hand, reduced dimensions of transistors in scaled technologies effectively
minimize the sensitive area for charge collection that counters the effect of reduced
critical charge. The first half of this chapter describes our efforts to quantify these
effects through a simulation based methodology to predict bit error rates for SRAMs
[85]. The key parameter for the estimation of error rate is the critical charge, and
therefore we first characterize this critical charge for the standard 6T SRAM cell
through a state-of-the-art Q
crit
characterization method, i.e., 3D device simulations. We
analyze various simple models to estimate Q
crit
and then determine parameters for a
calibrated model fitted to our 3D device simulation results. This analysis leads to the
identification of a simpler model that estimates critical charges, which are close to Q
crit
values computed using 3D simulations but with much reduced computation
complexity. Applying this identified model, we characterize physical bit error rates
(BER or SER/bit) for different space radiation environments using predictive
60
technology models from [94]. These estimated bit error rates serve as a basis for
finding an efficient mitigation solution using the theoretical models presented in
chapter 3.
The latter half of this chapter describes our hybrid simulation approach to
characterize single event transients [83]. This methodology is used to characterize SET
pulse widths in a commercial CMOS technology. We also utilize this approach to
study the effects of SET modulating factors, especially process corner and fan-out
variations. Using the hybrid simulation approach, it is shown that charges as small as
3.5fC can introduce transients in commercial 90nm CMOS technology, hence
increasing the likelihood of SET-induced soft errors. It is observed that process corner
variations and fan-out variations have a very significant effect on SET pulse widths.
The simulation results suggest that the selection of mitigation techniques for SET
radiation-hardened circuits cannot exclusively rely on baseline process analyses, as
they might grossly underestimate the true SET risk to the design. We also analyze
various methods that have been used to characterize SET pulse widths. The scaling
trends for SET pulse widths have been investigated by collecting published data for
different technologies and our simulation results. The scaling trends show that SET
pulse widths are getting relatively smaller with each new technology generation. This
has implications for delay filtering based SET mitigation techniques, such as those
described in chapter 5.
61
4.1 SEU Characterization
This section describes heavy-ion induced SEU characterization for SRAM
cells and bit error rate scaling trends down to the 22nm technology node.
4.1.1 Critical Charge Characterization for SRAM cells
Critical charge (Q
crit
) is generally defined as the minimum amount of charge
that must be collected by a circuit node, following the strike of an ionizing particle at a
sensitive node, in order to change the state of the circuit [31]. Technology Computer
Aided Design (TCAD) tools provide a great help in modeling the effects of radiation
on circuits at various technology nodes. Device level 3D simulations, in particular, are
very helpful for accurately predicting the behavior of those devices since they have the
clear advantage of modeling the actual geometry of the target device, as well as
incorporating underlying physical mechanisms (e.g. charge deposition, charge
collection efficiency) when generating each current pulse stimulus.
The space electronics community generally uses TCAD tools for Q
crit
characterization and soft error rate estimation, when no test data are available, but the
circuit community usually relies on various SPICE-level current models [56] to
compute the critical charge metric [75][48][119][46]. The shape and amplitude of the
current model have a strong effect on the computation of Q
crit
. Therefore, it is
important to calibrate these current models to 3D device simulations for every
technology generation. Here we use 3D device simulations to characterize the Q
crit
of a
6T SRAM cell designed for the 90nm technology, and determine the double-
exponential current model that best fits the stimulus from 3D device simulation. This
62
calibrated model and other models available in literature are then used to characterize
Q
crit
for the same SRAM cell.
4.1.1.1 Current Profiles from 3D Device Simulations
3D device and mixed-mode simulations are computation intensive and require
a fairly long development and calibration phase in comparison to SPICE-type circuit
simulations. However, they are also able to model complex physical mechanisms and
geometry-related phenomena that cannot be captured by circuit simulations, and that
can potentially affect parameters relevant to the calculation of Q
crit
(e.g. charge
collection depth, charge collection area and charge collection efficiency).
The 3D device simulations were performed using ISE-TCAD 10.0 DESSIS.
The transistors’ 3D models were calibrated for both high and low V
DS
over a range of
0 to 1.2 V to reproduce the DC transistor characteristics as defined in the
manufacturer’s process design kit (PDK) within a 5% error margin. In the initial set of
simulations, the modeled 90nm SRAM cell was subjected to ion strikes with LETs
(linear energy transfer) ranging from 1 to 20 MeV-cm
2
/mg. The SRAM cell exhibited
a very high sensitivity to radiation, as it got upset for an LET of 1 MeV-cm
2
/mg. A
second set of simulations with LETs ranging from 1 to 0.1 MeV-cm
2
/mg fine-tuned
the investigation of the LET threshold and determined it to be very close to 0.25 MeV-
cm
2
/mg. The resulting critical charge is 1.4fC. Figure 17 shows current profiles
resulting from the simulation of ion strikes on the off NMOS drain, over LETs ranging
from 0.1 to 1 MeV-cm
2
/mg.
63
It is interesting to note that photocurrent pulses from low LET heavy ions (~1
MeV-cm
2
/mg or less) have profiles that can be quite closely described by a double
exponential pulse. It should also be mentioned here that such property is no longer
valid for pulses from higher LET ion, such as 10 and 20 MeV-cm
2
/mg, where the tail
portion on the radiation-induced photocurrent shows distortions from extended charge
collection (deep carrier diffusion), hence cannot be described by a simple decaying
exponential function. Therefore, a double exponential pulse cannot model correctly all
instances of heavy ion strikes, and should be used with caution.
0
100
200
300
400
500
600
1950 1975 2000 2025 2050 2075 2100
time (ps)
Current (uA)
Pulses from 3D simulations
---- Best-fit Upset inducing pulse
LET=1
LET=0.5
LET=0.25
LET=0.20
LET=0.1
Upset inducing
Figure 17 3D simulation current pulses for LET < 1MeV-cm
2
/mg
For LET values below 1 MeV-cm
2
/mg, the characteristic parameters of TCAD
photocurrents (rise time, fall time, peak amplitude) were used to define double
exponential current pulses for ion strike simulations. The amplitude of the fitted
64
double-exponential pulse was modulated to find the upset/no-upset boundary for the
SRAM cell. Figure 17 shows a double-exponential pulse with time characteristic
parameters matched to 3D simulation results, and with amplitude modulated to find
the smallest pulse capable of inducing cell upset.
4.1.1.2 Ion Current Models for Q
CRIT
Characterization
The radiation-induced photocurrent resulting from an ion strike is
characterized by two collection phases: a first phase of electric-field accelerated free
carrier motion (fast drift current), followed by a second phase of charge collection due
to free carrier density gradients (slow diffusion current). The regions most sensitive to
ion strikes are reverse-biased junctions, because of the electric field present across
their large space charge region, hence efficiently collecting any charge generated in
their vicinity. In modern 6T SRAM designs, the off-NMOS drain is the most sensitive
region of the cell (in addition to the off-PMOS and high-Z NMOS drain regions) due
to the combined effects of increased electron mobility and design practices favoring
the speed at which a cell can flip states.
3D TCAD tools can achieve a great level of accuracy, but consequently, also
suffer from large computation times as well as limited capabilities to simulate circuits
exceeding relatively small sizes. Therefore, it is desirable to model radiation strikes as
current sources that can be easily injected in the target circuit for fast SPICE
simulations. Figure 18 shows an SRAM cell and its corresponding current source,
modeling a heavy-ion strike on the off-NMOS drain, applied across the drain and
grounded source terminals.
65
A number of current models have been proposed in the literature over the years
and they are used by the circuit community to characterize Q
crit
by performing SPICE
simulations. A simplified model was proposed by Roche et al in [97], using 3D device
simulations in 0.35um technology. According to that definition, Q
crit
can be found
using:
F DP DD N crit
T I V C Q + =
(1)
In equation (1), C
N
is the node capacitance, V
DD
is the supply voltage, I
DP
is
the maximum drain conduction current of the PMOS (provided the strike is made at
the OFF NMOS drain) and T
F
is the flipping time of the cell. The computation of T
F
Figure 18 SRAM cell with off-nmos strike model
involves 3D device simulation where the transient voltage of the opposite node (not
struck by the incident ion) in the SRAM cell is observed during the flipping of the cell.
In order to characterize Q
crit
by SPICE simulations only, it was proposed that the
I
DP
.T
F
term can be ignored from equation (1) and Q
crit
can be found by integrating an
66
exponentially decaying current (I
0
.exp(- τ)) with small time constants which are less
than 20ps. The target value of the magnitude of the current (I
0
) and the time constant
of the exponential ( τ) are achieved when the product I
0
.τ is minimized. The resulting
Q
crit
by this method will be slightly under-estimated because of neglecting the additive
term.
Another current model proposed by Freeman [39], to compute Q
crit
for bipolar
memories, has also been used for finding Q
crit
for a CMOS SRAM cell as in [48][46].
Freeman’s model is described by equation (2) and defines current in terms of total
charge deposited (Q) by the ion and a single timing parameter τ.
() ( )() ( )
τ
τ τ π
/
exp / / / 2
t
t Q t I
−
=
(2)
A diffusion collection model given by equation (3) has been used in [75][48]
and is considered more suitable for modeling neutron strikes. The time t
max
represents
the instant when the maximum value of the current I
max
is reached.
()
⎥
⎦
⎤
⎢
⎣
⎡
⎥
⎦
⎤
⎢
⎣
⎡
=
− t t t t
e I t I
2 / 3
2 / 3
/
max
max max
exp
(3)
Finally there is the most commonly used model by the circuit community
(given by equation 4) which is a double exponential pulse with two timing parameters
τ
r
and τ
f
, representing the rising and falling time constants of the exponentials. This
model has been widely used in the literature to find not only the Q
crit
but the single
event transients (SET) introduced by ion strikes in combinational logic [48].
() () [ ]
⎥
⎦
⎤
⎢
⎣
⎡
− − =
−
−
r
f t
t
r f
Q t I
τ
τ
τ τ
/
/
exp exp /
(4)
67
The current pulse rise and fall times, and their full-width at half-maximum
(FWHM) strongly affect the characterization of Q
crit
, to the point where each pulse
model results in its own Q
crit
value. Hence the characteristic timing parameters of the
above models for one technology node cannot be extrapolated directly to the next
technology node. Some empirical approximations have been used in the literature to
select values for these parameters. Figure 19 shows the current pulses derived using
such empirical approximations, where the FWHM is 161ps for a total charge of 10fC
as considered in [48]. On the other hand, the Roche model does not have the
requirement of specific FWHM and we see in Figure 19 that it has a very fast fall time
with relatively large peak amplitude as compared to other models.
0
10
20
30
40
50
60
70
80
90
0 200 400 600
time (ps)
Currnet (uA)
Diffusion
Freeman
Double Exponential
Roche
Figure 19 Ion current pulse profiles for different models
68
4.1.1.3 Estimated Critical Charge for 6T SRAM Cell
Using the methods described in previous subsections, we characterized critical
charge values for a commercial 6T SRAM cell. The simulated SRAM cell was
designed using foundry-provided high-density layout rules from a commercial 90nm
process. To simulate the SRAM cell, a net-list was first extracted using Cadence’s
Virtuoso Layout editor and converted to equivalent sub-circuit calls in order to use the
appropriate transistor models provided in the manufacturer’s PDK. Charge collection
from ion strikes was modeled by applying respective current sources across the
terminals of the “hit” transistor in the memory cell (Figure 18). HSPICE was used for
circuit simulation of the SRAM’s radiation response, with a 1.0V power supply bias
and under standard room temperature conditions (25°C). Table 2 presents the
simulated Q
crit
values for all the models investigated in this study.
Table 2 Q
crit
values using different models
Model Timing Parameters (ps) Q
crit
(fC)
3D TCAD Simulation (LET=0.25MeV-cm
2
/mg) 1.40
Double exp fit to 3D τ
r
= 2.5, τ
f
= 5.5 1.15
Double Exponential τ
r
= 33, τ
f
= 161 4.07
Double Exponential τ
r
= 16, τ
f
= 161 4.09
Roche et al τ = 2 0.95
Freeman τ = 90 3.90
Diffusion t
max
= 60 3.58
69
The results in Table 2 reveal that the Q
crit
values computed from the Roche et
al and the best fit double exponential models are closer to the Q
crit
value derived from
3D simulations. All other models are over-estimating the Q
crit
by almost a factor of
three. The reason for the best-fit double-exponential model resulting in a slightly
lower Q
crit
is the timing characteristics of the current, emphasizing that even small
variations in current profiles may result in Q
crit
inaccuracies. As for the Roche model,
the under-estimation of Q
crit
can be partially attributed to the ignored factor I
DP
.T
F
in
expression (1) which, if properly accounted for, will slightly increase Q
crit
. In general,
the double exponential model best-fitted to the 3D pulse and the Roche model will
result in Q
crit
under-estimations with reasonable error margins. All other models will
result in large Q
crit
over-estimations, unless calibrated with the 3D simulation on the
technology node of interest.
4.1.2 SEU Rate Prediction
Based on the Q
crit
values presented in Table 2 and the sensitive area cross-
section extracted from cell layout, we estimated the bit error rate (BER) of the SRAM
under consideration, using the CREME96 radiation environment modeling tool [24].
Table 3 shows the BER in errors/bit-day for geosynchronous orbit at solar
minimum/maximum galactic cosmic ray condition (GEO Orbit – MAX GCR). It can
be seen that the spread in BER for various models is more than an order of magnitude
difference. It can be observed from Table 3 that the double-exponential models with
empirical parameters, the Freeman model, and the Diffusion model are all under-
70
estimating the BER by at least of an order of magnitude, compared to the BER
calculated from TCAD simulations.
Table 3 Bit Error Rates (in errors/bit-day) using Q
crit
Model GEO Orbit (Max GCR)
3D TCAD Simulation 1.31e-6
Double exp fit to 3D 1.66e-6
Double Exponential 3.61e-7
Double Exponential 3.59e-7
Roche et al 2.08e-6
Freeman 3.93e-7
Diffusion 4.43e-7
When compared to 3D TCAD simulation, the Roche model results in an over-
estimation of BER by around 1.6 times while the best-fit double exponential current
results in comparable BERs. The best-fit double-exponential and Roche models result
in conservative over estimation of the BER without compromising reliability.
Therefore, these models can assertively be used for circuit’s radiation response
simulation when TCAD 3D simulation data is not readily available.
4.1.3 SEU Scaling Trends
Having identified simulation techniques to compute critical charge for SRAM,
we now use that methodology to predict soft error rates in scaling technologies.
Heavy-ion induced soft errors are a major threat to the reliability of satellites and
spacecraft systems. This concern becomes even more important with the increasing
71
usage of commercial-off-the-shelf (COTS) micro-electronic components in the space
industry. Although much work has been done for neutron-induced soft error rates
(SER) in terrestrial environments [27][62][102][47], there is no data available for
heavy-ion induced soft error rates in geosynchronous/interplanetary and/or low-earth
(earth’s magnetosphere) orbits. We present physical bit error rates (BER or SER/bit)
for radiation environments in space for predictive technology models from 130nm to
22nm technology nodes. We find that even though the BER is decreasing with
technology scaling, the over-all SER increases significantly, even for a constant area
scaling scenario. Thus, this characteristic emphasizes the need for more powerful soft
error mitigation techniques to be applied for maintaining desired reliability levels.
Key parameters affecting bit error rate for memory elements in a specific
environment are: critical charge (Q
crit
), charge collection area (or sensitive volume)
and charge collection efficiency. Q
crit
depends on nodal capacitances and supply
voltages and is decreasing with technology scaling, as discussed previously. However,
the charge collection area also decreases with technology scaling at a rate that more
than compensates the effect of decreasing Q
crit
and results in reducing bit error rates.
Charge collection efficiency depends on doping profiles in a process and is therefore
very sensitive to process variations. We employ a methodology to compute soft error
rates which is kept uniform for all technology nodes under consideration, thus
alleviating miss-predictions due to extrapolation of results. Furthermore, we maintain
a similar sizing ratio for pull-up and pull-down transistors (in a standard 6T SRAM
cell) as it is employed in today’s commercial high-density SRAM arrays. In particular,
72
we use an asymmetrical SRAM cell where a ratio of 3/5 for PMOS to NMOS sizing is
maintained to achieve high density and fast access times.
We model space environments using CREME96 [24] suite of routines and
predict soft error rates. With CREME96, bit error rates can be computed by using a
Weibull function if the device cross-section and effective LET data is available from
accelerated tests. Alternatively, a critical charge method can be used to compute BER,
where device cross-section is treated as a step function of Q
crit
. For future predictive
technologies, no accelerated test data is available; therefore, we use the critical charge
method to compute BER. It should also be noted that even though the critical charge
method provides relatively less accurate results, yet the absolute value of results is not
important for observing scaling trends. We have previously shown that the Roche
model computes Q
crit
of SRAM cells using SPICE simulations alone, with results that
are in good agreement to 3D mixed-mode simulations. Therefore, we compute Q
crit
by
SPICE simulations using the Roche model. For radiation environments, we model
geosynchronous/interplanetary orbits as well as trapped radiation belts in the earth’s
magnetosphere. Two conditions are modeled for geosynchronous orbits, namely:
mostly prevailing solar quite condition representing maximum galactic cosmic ray
activity (MAXGCR), and solar flare worst week (SFWW) representing stormy
conditions. Trapped radiation belts (Proton Belts) are modeled at 3000km
apogee/perigee with 0 degree inclination and averaged over 200 orbits with quiet
magnetic weather conditions employing an AP8MIN8 trapped proton model. A
73
thickness of 100 mils of Aluminum shielding is considered in all of the above
environments.
For predictive technology models with respective nominal voltages, Table 4
shows computed Q
crit
values for various technologies and estimated bit error rates for
modeled environments. The bit error rates are computed assuming a constant charge
collection depth of 1 μm for each technology. This charge collection depth is consistent
with the test data from 90nm CMOS technology. It can be observed from the results
that even though Q
crit
is decreasing by a technology scale factor of almost 0.7, the
physical bit error rates are also decreasing. This trend is similar to the recently
observed neutron-induced decreasing SER trend [21][62][102]. This decrease in
predicted BER can be mainly attributed to the decrease in charge collection areas since
charge collection efficiency has been modeled at 100%. But this decrease in physical
BER does not translate in an over-all decreasing system soft error rate. In fact the
current bit error rates are already in a range that they result in around 8 upsets per day
in MAXGCR environment and 3000 upsets per day in SFWW or Proton belt
environments for a 10Mb memory in 90nm technology. Furthermore, for a constant
area constraint, over-all SRAM error rates are increasing as shown in the rightmost
column of Table 4. In addition, increasing integration is likely to increase these soft
error rates even further. This increase in the system level SER dictates that more
effective soft error mitigation techniques must be developed.
74
Table 4 Technology scaling effects on Q
crit
and predicted Bit Error Rates
Technolo
gy Node
Vdd
(V)
Q
crit
(fC)
GEO
MAXGCR
(errors/bit-day)
GEO SFWW
(errors/bit-
day)
Proton Belts
(errors/bit-
day)
Const.
Area
SER
(a.u.)
130nm 1.3 4.62 9.89E-07 2.66E-04 1.48E-04 10
90nm 1.2 3.1 7.23E-07 2.70E-04 2.85E-04 15
65nm 1.1 2.26 5.41E-07 2.44E-04 3.18E-04 23
45nm 1 1.64 3.62E-07 1.78E-04 2.55E-04 30
32nm 0.9 1.22 2.35E-07 1.16E-04 1.72E-04 39
22nm 0.8 0.98 1.34E-07 5.54E-05 7.80E-05 45
4.2 SET Characterization
In this section, we summarize SET pulse width characterization methods that
have been employed in prior research and present our hybrid simulation methodology
to characterize SET pulse widths. We also study the effects of strong modulating
factors for transients such as fan out and process corner variations using our proposed
hybrid simulation technique.
A pseudo direct measurement of SET pulse widths has been adopted by Eaton
et al [34]. In their experiment, a test chip was built using adjustable delay temporal
latches in 0.18um technology and irradiated using a range of heavy ions. For a
particular LET, the SET pulse width was characterized by finding a pass/no-pass
boundary for capturing a transient, by varying the adjustable delay in the temporal
75
latch. With this methodology, the approximate widths of transients were estimated at a
resolution of 50ps. They report SET pulse widths ranging from 350ps to 1.3ns for LET
levels ranging from 11.5 MeV-cm
2
/mg to 64 MeV-cm
2
/mg. It was observed that SET
pulse widths are directly proportional to the LET of incident ions. Although an ideal
square wave shape is assumed for the transients, it is not possible to actually observe
the profile of the transient.
Gadlage et al [42] built a test chip using DICE (dual interlocked storage cell)
latch test structures. By using DICE latches, SEU errors were avoided, so the errors
caused in the test chip were attributed to SET only. They tested their chip for a range
of heavy ion irradiations and reported error cross-sections with respect to LET as well
as frequency of operation. The frequency effect was investigated because faster
operation is supposed to capture more transients generated in combinational logic.
They later used double exponential pulses to randomly inject transients using SPICE
simulations and matched the error rate with test data. Based on this correlation, they
report SET pulse widths as large as 2ns for a LET of 100 MeV-cm
2
/mg for 0.25 μm
and 0.18 μm technologies. It may be noticed that the correlation between simulation
error rate and test chip data does not necessarily provide direct insight for transient
profiles.
SET pulse widths have also been estimated using 3D mixed-mode simulations
by Dodd et al. in [33]. They have modeled technology nodes from 0.25um to 100nm
with a minimum size inverter as the target device. They report SET pulse widths
ranging up to 1ns for bulk-CMOS processes for a LET of 50 MeV-cm
2
/mg.With
76
technology scaling, a decreasing SET pulse width trend is observed for SOI
technology, but there in no such trend for bulk-CMOS processes.
Direct measurements of transient current pulses using high-speed scopes have
been carried out in [37]. The test chips, designed in 0.25um and 50nm technology
nodes, have been irradiated by laser beam as well as heavy ions. Although the size of
the test transistors is quite large (20um for 0.25um technology and 10um for 50nm
technology), a decreasing transient width trend is observed. The total duration of the
transient current measured at 10% of the peak height for an LET of 16 MeV-cm
2
/mg is
750ps for 0.25um technology and only 110ps for 50nm technology. Since the devices
used are quite large, the data cannot be used to estimate transient widths for regular
gates in a commercial ASIC library. Although a 3D device simulation analysis for SOI
devices is performed to correlate with irradiation results, similar analysis for bulk
devices is not available.
4.2.1 Hybrid Simulation Approach
In contrast to previous work, we propose a hybrid simulation approach to
characterize the single-event transient phenomenon [83]. This approach consists of
utilizing 3D device simulations to model ion strikes on target devices and SPICE
circuit simulations to study SET generation and propagation through combinational
logic. This hybrid approach speeds up the SET characterization process by orders of
magnitude as compared to traditional 3D mixed-mode simulations. Furthermore, it
provides the flexibility to inject transient current pulses anywhere in a large circuit and
directly observe the propagation of the resulting transient through a large
77
combinational network. In other words, this approach combines the proven accuracy
in ion strike modeling of 3D device simulations [30] and the efficiency of commercial
circuit simulators such as SPICE.
3D device simulations are used to generate current pulses that are function of
the incident ion’s LET and the target device’s dimensions. The resulting current pulses
are applied in SPICE circuit simulations to model ion strikes as shown in Figure 20. If
the resulting pulses from 3D simulations can be fit to an analytical model with
appropriate characteristic parameters, such as a double exponential pulse, the
computational time for modeling various LET ion strikes is further reduced at the cost
of a slight decrease in accuracy.
Figure 20 NAND gate off NMOS ion strike model
For this study, 3D device simulations have been carried out on minimum size
NMOS and PMOS transistors. The resulting current pulses have been injected at
sensitive nodes in SPICE simulations to characterize SET pulse widths for standard
78
ASIC gates i.e. INV, NAND and NOR. An inverter chain was additionally placed at
the output of the struck gate to observe the propagation of generated transients, but
also to witness any possible electrical masking or stretching phenomena that might
affect the SET pulse widths. The use of an inverter chain at the output of a struck node
precludes the logical masking phenomenon in the propagation of SET pulses, and
simplifies the error analysis due to SET propagation. Due to these reasons, inverter
chains have also been used in previous studies to investigate propagation of SET
pulses ([11][33][34][42]).
A set of simulations (for minimum size transistors) with ion strikes having
LETs ranging from 1 to 60 MeV-cm
2
/mg was carried out. It was observed that
photocurrent pulses from low LET heavy ions (~1 MeV-cm
2
/mg or less) have profiles
that can quite closely be described by a double exponential pulse (Figure 17) while
such a property is no longer valid for pulses from higher LET ions, such as 10 to 60
MeV-cm
2
/mg. Therefore, for LET values close to 1 MeV-cm
2
/mg, the characteristic
parameters of TCAD photocurrents (rise time, fall time, peak amplitude) were used to
define double exponential current pulses for ion strike simulations. The amplitude of
the fitted double-exponential pulse was modulated to find the transient inducing
boundary for the respective gates.
4.2.1.1 Critical Charge Concepts for Combinational Gates
Although it’s easy to define critical charge (Q
crit
) for memory elements, there is
no unique definition of critical charge for combinational logic circuits. If an ion strike
happens on a gate just before the storage element, then this transient need not be
79
propagated through many stages to potentially cause an error. On the other hand, for
an ion strike happening at a gate deep inside a combinational network, a resulting SET
will need to propagate electrically (as well as logically) through various stages until
the storage element or primary output to cause an error. Since we are concerned only
with the minimum charge values which may cause an SET to be propagated through
various stages, we investigated these charge values for different cases: 1) observation
of the SET pulse at the place of strike, and 2) observation of the SET pulse as
propagated one stage ahead, two stages ahead, four stages ahead and so on.
A number of factors that make the location of an ion strike critical within a
gate can be identified qualitatively. First, the worst locations for single event ion
strikes are the reverse-biased junction regions such as drain regions of off-nmos and
off-pmos devices. Secondly the Q
crit
depends on the nodal capacitance of the struck
node affected by parasitic capacitance and fan-out. For a multi-input gate (such as
NAND or NOR), a third factor affecting its Q
crit
is the input pattern where transistors
are connected in a series or parallel stack. A fourth factor is the relative location of an
ion strike within the gate itself. A strike on an internal node of a gate is less likely to
effectively propagate the generated transient to the gate’s output than a strike directly
at the gate’s output. Therefore for this work, we restrict ourselves to strikes occurring
at the gate output only. Considering all of the above mentioned scenarios, the worst
case for characterizing Q
crit
for an ASIC library dictates that we use standard 1x
strength gates with minimal fan-out of a standard 1x strength inverter.
80
Using the ion strike current pulse profiles from 3D device simulations and
modulating the best fit double exponential current model’s amplitude, we have
computed Q
crit
for some standard cell ASIC gates. Figure 21 shows Q
crit
values for 1x
strength standard INV, and 2-input NAND and NOR gates. Notice that the NOR gate
can be upset by a small charge of 3.5fC. The pulse width for all cases measured at half
maximum before vanishing is almost constant and is around 30ps.The difference in
Q
crit
values for an SET observed at the site of a strike versus a strike which can
propagate an SET to the 8th inverter ahead is 26% for an INV, 21% for a NAND and
16% for a NOR. These small values of transient inducing charges and small
differences in minimum charges required to propagate SETs to various stages has
implications for SET mitigation techniques which rely on gate sizing to improve
reliability, such as [26],[28], and [120].
0.5
1.5
2.5
3.5
4.5
5.5
6.5
A t s ite at 1 inv at 2 inv at 4 inv at 6 inv at 8 inv
Obse rvation S tage
Charge (fC)
NOR2 IN V NAND2
Figure 21 Minimum charges required to observe SET
81
4.2.1.2 SET Pulse Widths for Larger LETs
Photocurrents from high LET ion strikes have been directly applied at relevant
circuit nodes on the investigated ASIC gates since they were not fit to any generic
current model. The resulting transients from these large LET ion strikes have a quick
charge collection phase. This quick response time causes the drain-substrate junction
to forward-bias and the voltage at the struck node to undershoot significantly. This
effect is typical for bulk-CMOS processes and has also been observed in [33]. Another
observation for SETs generated from these strikes is that the resulting transients
quickly settle down (just after 2nd inverter) to a perfect square wave and their pulse
widths after the 3rd inverter is constant. Figure 22 provides the widths of SETs from
large LET ions. These values have been obtained using minimum strength inverter
chain at the output of struck gate with fan-out of 1.
0
100
200
300
400
500
600
700
0 10 20304050 607
LET (Me V-cm2/mg)
SET Pulse Width (ps)
0
INV NAND2 NOR2
Figure 22 SET pulse widths for large LET strikes on ASIC gates
82
Notice that the parallel stack of reverse-biased regions exhibits the highest
vulnerability to ion strikes. This fact is evident from the sensitivity of the NOR gate
for these off-nmos strikes as compared to INV and NAND gates. Results show that
generated SETs can be as large as 653ps for an LET of 60MeV-cm
2
/mg for NOR
gates. As expected, this also confirms that larger LETs result in larger pulse widths. It
is interesting to note that the SET pulse widths predicted from these simulations are
relatively smaller than observed for earlier technology nodes in [11][34][42].
Although a direct comparison with transient current measurements reported in [37]
cannot be made because of very large size devices under test and a different
technology used in that work (50nm), our simulations agree with the trend that SET
pulse widths are getting smaller in sub-100nm technologies. This has strong
implications for SET mitigation techniques which require temporal filtering such as
DF-DICE [82] and Temporal Sampling Latch [69]. As the performance penalties of
these techniques are directly proportional to the SET pulse widths to be filtered, these
techniques can be applied with reasonable penalties in modern technologies.
4.2.1.3 Process Corner Variations
In order to study the effects of process corner variations, we have modeled the
following process corners for this study: FF (Fast NMOS, Fast PMOS), FS (Fast
NMOS, Slow PMOS), SF (Slow NMOS, Fast PMOS), SS (Slow NMOS, Slow
PMOS) and TT (Typical NMOS, Typical PMOS). Figure 23 shows the spread in SET
pulse widths around the TT process corner. The SS corner results in the largest pulse
width while the FF corner yields the smallest for these off-nmos strikes. The pulse
83
widths reported are the settled pulse widths at the output of the 3rd inverter from the
location of the strike.
From Figure 23, we see that the worst case SET pulse width at an LET of
60MeV-cm
2
/mg could be as large as 885ps. The worst case percentage variations
between the extreme process corners for a particular gate are in the range of 42% to
75%. For example, the percentage variation of SET pulse width for the inverter is 42%
for LET of 10MeV-cm
2
/mg and 75% for LET of 60MeV-cm
2
/mg. The NOR gate, on
the other hand, shows a variation of 50% to 70% for the same cases. Considering this
range of up to 75% variation in the SET pulse widths for different process corners, it
can be concluded that SET pulse widths are very sensitive to process corners. This
50
250
450
650
850
5 15 253545 5565
LET (Me V-cm 2/m g)
SET Pulse Width (ps)
INV NAND NOR
Figure 23 SET pulse widths with process corner variations
quite large variation needs to be taken into account while analyzing performance
penalties of SET mitigation techniques.
84
4.2.1.4 Fan-out Variations
Here we analyze the effects of fan-out variations on SET pulse widths. Table 5
shows the simulation results with fan-out variations ranging from 1 to 6. The upper
limit of 6 on fan-out is chosen due to the drive capability of minimum size gates. An
interesting observation can be made from these results: a larger fan-out result in
generating larger SET pulse widths. This anomaly is best explained by a close look at
the waveforms of the struck node, shown in Figure 24. These large LET ion strikes
deposit significant charge in very short time which result in undershoots that attain
voltages as low as the negative of the supply rail. A large nodal capacitance in this
scenario actually slows down the charge collection phase and result in widening the
SET pulse width while reducing the amplitude of undershoots.
From the results it can be observed that fan-out has a significant effect on SET
pulse-widths and can vary the SET pulse width by as large as 84% for the NOR gate.
Table 5 Fan-out variations on SET pulse-widths (ps) for different gates and
LET
LET = 10 MeV-cm
2
/mg LET= 20 MeV-cm
2
/mg LET= 40 MeV-cm
2
/mg
FO INV NAND NOR INV NAND NOR INV NAND NOR
1 120 125 201 151 157 271 225 244 496
2 142 146 240 176 182 318 260 268 559
4 176 179 310 217 222 400 316 324 664
6 180 180 370 239 240 469 358 363 752
85
Table 5 Continued
LET= 50 MeV-cm
2
/mg LET= 60 MeV-cm
2
/mg
FO INV NAND NOR INV NAND NOR
1 254 263 571 286 295 654
2 291 299 638 324 333 725
4 350 357 750 387 395 843
6 395 400 843 435 442 942
-1.2
-0.8
-0.4
0.0
0.4
0.8
1.2
4.90 5.00 5.10 5.20 5.30 5.40
time (ns)
Voltage (v)
FO=1 FO=2 FO=4 FO=6
Figure 24 Struck node voltage for large LET ion strikes
Fan-out configurations for a gate being different in a synthesized logic, it is necessary
to characterize transients with SET affecting parameters.
4.2.2 SET Scaling Trends
Here we correlate our simulation results with SET data available from
irradiation testing as well as 3D mixed-mode simulations presented in prior research.
86
This correlation also helps us in confirming the scaling trends for SET pulse widths as
technologies scale. Due to the fact that irradiation testing and simulations were
conducted on different technology nodes and used different device sizes, it is difficult
to make direct comparisons with published data. Yet, we can identify relevant general
trends, allowing some grounds for joint analyses and discussions.
The graph in Figure 25 plots the SET pulse width versus LET for different
experiments including our simulation results for INV and NOR gates. The NAND
results are similar to those of the inverter and hence are not plotted to improve the
graph’s readability. It should be noticed that all previous studies have used inverter as
the target device. As can be seen from the graph, our hybrid simulation results for the
inverter case relate well to the 3D mixed-mode simulation results of Dodd et al for
100nm modeled technology node [33]. More importantly, the trend showing a SET
pulse width decrease with technology scaling, observed by Dodd et al and later
confirmed through direct measurement of transient currents by Ferlet-Cavrios et al
[37], is similarly present in our results, further validating the applicability of the
hybrid simulation approach presented in this paper. SET pulse widths for a NOR gate
also follow a comparable trend, but as mentioned earlier, the NOR gate is much more
sensitive to radiation and its resulting transients are much wider.
87
0
200
400
600
800
1000
1200
1400
1600
1800
0 2040 608
LET (M e V -cm 2 /m g )
SET Pulse Width (ps)
0
G adlage 0.18um
Dodd 0.10um
Eaton 0.18um
INV 90nm
NO R 90nm
Figure 25 SET pulse width comparison
4.3 Deductions
A simulation methodology to characterize soft error rates has been presented.
This simulation methodology uses critical charge and device geometries to estimate
error rates. Critical charges for 6T SRAM memory cells have been characterized using
3D TCAD simulations as well as with various ion strike current models. A current
model has been identified that results in critical charge estimation close to 3D TCAD
simulations. Using the identified current model, critical charges have been estimated
for predictive technology models in the nanometer regime. The scaling trends for bit
error rates have been investigated for different space radiation environments by
utilizing the estimated critical charges. These trends suggest that even though the
physical bit error rates are decreasing to some extent, the over all soft error rates are
increasing even assuming a constant area scaling scenario. These increased soft error
88
rates dictate the need for using effective soft error mitigation techniques not only for
space environments but also for terrestrial applications sensitive to reliability.
In addition, an efficient hybrid simulation approach is proposed to model
single-event transients in combinational networks. This approach combines the
accuracy benefits of 3D device simulations in modeling ion strikes, and computation
time efficiency and ability to simulate large circuits of commercial SPICE-type circuit
simulators. Minimal charges that can induce single-event transients on some
commercial 90nm ASIC library cells have been computed using this simulation
approach. Current nanometer technologies are shown to be very sensitive to radiation,
where charges as small as 3.5fC are capable of inducing a transient increasing the
likelihood of soft errors. Compared to earlier work on 0.25um and 0.18um
technologies, the SET pulse widths predicted here are relatively smaller. It should be
noted that the scaling-driven reduction in SET pulse width does not necessarily
translate into a decrease in the error contribution of SETs, since higher operating
frequencies in scaled technologies make it more likely for these SETs to cause
perturbations that can be latched in storage elements. However, the relative decrease in
transient pulse widths may revive the interest in adopting transient filtering solutions
as part of the radiation hardening techniques. The associated performance penalties for
SET mitigation, thus far significantly large, may become less intrusive to a circuit’s
optimal operations. The mitigation of these transients is explored in the following
chapter.
89
Chapter 5
Radiation Hardening of Combinatorial Logic
Having discussed the scaling trends for single-event transients and single-event
upsets in the previous chapter, we now focus on the mitigation solutions for these
effects. In particular, we observed that with increasing circuit speeds and lower driving
currents, single-event transients (SET) occurring in combinational logic are becoming
more important detractors for system reliability. Furthermore, we observed that the
over-all single-event upset rate is also increasing with technology scaling promoting
the need for effective soft error mitigation solutions. SEU mitigation in memories is
the main topic of the following chapter while this chapter addresses the SET and SEU
phenomena in combinational logic. Particularly, this chapter discusses techniques to
mitigate single-event transients as well as single-event upsets from sequential elements
distributed around combinational logic [82][79]. These transients are able to propagate
several gates and can be latched incorrectly in a memory element resulting in a soft
error. Moreover, these transients can result in the malfunctioning of the system if they
reach at a primary output triggering some critical control operation or routine. This
faulty behavior is generally termed as single-event functional interrupt (SEFI).
Furthermore, these transients may happen in clock lines and global control signals
such as preset and clear. Therefore, they may result in an erroneous clearing or setting
90
of the state of storage elements if they appear on global control lines. If these
transients occur in the control circuitry of large memory structures, they may result in
upsetting whole words rather than a single bit.
First we briefly summarize the existing solutions from literature to mitigate
transient errors in combinational logic. Then we present our proposed design of the
delay filtered dual inter-locked storage cell (DF-DICE) [82], which is immune to
single event transients on any of the inputs and single event upsets within the storage
cell. A random logic block, consisting of arbitrary combinational logic and sequential
elements, can easily be converted to a soft-error tolerant design by simply replacing all
the sequential elements with the proposed cells. We also discuss the scalability of the
design and show that the proposed cells are highly scalable for different radiation
environments. This scalability of the design results from the characteristic that
increases in area and speed of a design employing the proposed cells are proportional
to the targeted single-event transient pulse-width in a particular radiation environment.
5.1 Existing SET Mitigation Techniques
This section describes existing mitigation techniques for single-event transients
in combinational logic and single-event upsets in random sequential logic, i.e., latches
and flip-flops. These techniques can be applied to harden any logic block or peripheral
logic for regular memory structures like SRAM/DRAM arrays. These soft error
mitigation techniques require redundancy that may be spatial or temporal or a
combination of both.
91
5.1.1 Spatial Redundancy Techniques
The simplest spatial redundancy method, Triple Modular Redundancy (TMR), is
easy to understand and implement. This method replicates the whole logic block three
times, and then a majority vote is used to pass the correct value. If hardening is
targeted only for sequential elements, then TMR can be applied to only replicate
critical circuit nodes (latches and flip-flops) three times, and then a majority vote is
used to ignore any corrupt value. A block diagram of TMR is shown in Figure 26. As
evident, this method suffers from large area and power penalties, which are at least
two times compared to the original hardware. On the other hand, TMR offers minimal
speed degradation. Only the voter circuit affects the critical path for speed.
Figure 26 Triple Modular Redundancy applied on logic blocks
The second class of spatial redundancy mitigation techniques for this purpose
replicates critical nodes (latches and flip-flops) and uses feedback to recover the
correct value after an upset. The most commonly used SEU immune cells based on
these techniques include the Heavy Ion Tolerant (HIT) cell [12], the Single-Event
92
Resistant Topology (SERT) [105], and the Dual Interlocked storage Cell (DICE) [20].
The HIT cell suffers from the drawback that the transistor sizes are critical in restoring
the correct value after a single-event upset. The DICE and SERT cells do not depend
on optimal transistor sizing. Both DICE and SERT offer a comparable area solution,
yet the DICE cell has been widely used in industry, and sufficient data is available in
the literature about the SEU immunity of the DICE cell [11].
5.1.2 Temporal Redundancy Techniques
Temporal redundancy techniques are characterized by sampling the input data
at time intervals spaced greater than the width of the transients to be tolerated and then
voting the corrupt value out. The temporal latch [69] solution is based on this
principal. It uses four clocks to sample data at different time intervals. It employs
TMR to store temporally sampled values in three different latches and then uses a
majority vote to forward the correct value. Block diagram of temporal sampling latch
is shown in Figure 27. This design offers mitigation not only to SEU but also to SET
on data input. Although a modified version of this latch can mitigate SET on clock
inputs, this technique fails in case of transients on asynchronous control signals like
clear and preset. This design incurs a speed penalty which is at least equal to twice the
targeted SET pulse width. Furthermore, tripling the hardware and including the voter
circuit significantly increases the area and power overheads.
93
Figure 27 Temporal Sampling Latch
5.1.2.1 Delay Filtering
A novel single event transient filtering technique for a data input signal has
been presented in [77]. The basic principle is the same as temporal filtering, but the
filtering action is achieved using a C-element as shown in Figure 28. The C-element is
a state holding element, and it has the basic property that it changes its output only
when all the inputs are of identical logic value. The incoming data signal is fed to the
C-element using two paths; one with zero delay and the other with a critical delay of
T
critical
(equal to the SET pulse width to be suppressed). Hence a transient of T
critical
width will not be able to change the output of the C-element if it has settled to its final
output.
Though the latency penalty projected in [77] is said to be slightly larger than
T
critical
, yet it depends on the initial state of the circuit. A careful analysis reveals that
94
the actual latency penalty of this technique for mitigating transients of width less than
T
critical
is three times the T
critical
. This overhead in latency penalty can be best explained
by the waveforms shown in Figure 29:
Figure 28 Delay Filter operation
Figure 29 Transient appearing at the D
in
somewhere between T
critical
95
If the transient appears at the beginning of time when the Din was supposed to
settle to its final value, then the penalty will be proportional to two times of the
transient width. On the other hand, if the transient appeared at the end of T
critical
after
Din was initially settled, then the latency penalty will be proportional to three times of
the required transient mitigation threshold.
5.2 The Delay Filtered DICE (DF-DICE)
We combined the above mentioned delay filtering technique [77] with Dual
Inter-locked Storage Cell (DICE) [20]. We apply this filtering technique to all the
inputs of the sequential elements. This design results in a highly robust digital system
which is tolerant to both SETs on any of the inputs of a sequential element, including
Clock, Preset, Clear and Data, and SEUs within the sequential elements. We call this
design the “Delay Filtered DICE (DF-DICE)” [82]. Since transients can occur on any
of the inputs to a storage cell and may result in SEU, all inputs of the storage cell are
identified as critical to SET. Therefore, the delay filtering technique to mitigate SET is
used for every input to the storage cell. We have used a clocked inverter DICE latch
and flip-flop having preset and clear control signals. Figure 30 shows only a latch
version of the proposed cell. There are four delay filters on the left side, each for
Clock, Data, Preset and Clear inputs. On the right side is the DICE storage latch. Since
the C-element in the delay filter is inverting, an inverter is used at the output of each
delay filter to restore the polarity of the input. The operation of the circuit is as
follows: The incoming clock and data signals (as well as the preset and clear control
96
signals) pass through their respective delay filters. The output of the C- element will
not change unless both inputs have the same logic value. Hence data and clock signals
will need to be held stable for a period at least equal to the delay of the inverter chain.
Figure 30 Delay Filtered DICE Latch
As the output of the C-element is a dynamic node, it can lose its state if left
floating for longer times. To overcome this problem, a weak inverter (not shown)
which provides feedback and maintains the output state can be used at the output node.
If the delays of the clock and data signal through the delay filters are matched,
then the setup and hold times of the internal storage cell (DICE Latch/flip-flop) will
remain unchanged. However, due to the property of the delay filter, the clock period
need to be extended in proportion to three times of T
critical
. Therefore, the maximum
97
achievable clock frequency is reduced in proportion to three times of the targeted SET
pulse width. In the latch portion, CLKB1 and CLKB2 are the inverted version of the
CLK signal. Nodes I1, I2, I3, and I4 have the same value stored in the normal latching
mode. Consider a SEU, in which node I1 gets flipped from 1 to 0. The correct value at
I3 will restore I1 via I4 feedback. The stored value of 1 at I3 does not allow the value
of I2 to be changed either. So in the case of a SEU at a single node, the spatial
redundancy and the feedback at two points restores the uncorrupted value on all the
latch nodes.
5.1.3 DF-DICE Implementation
We have implemented the four cells, namely: DICE latch, DICE flip-flop, DF-
DICE latch, and DF-DICE flip-flop for mitigating 800ps wide SETs. All the cells were
laid out using the magic layout editor. The layout uses MOSIS SCMOS rules for 5-
Figure 31 Layout of DF-DICE latch
98
metal single poly TSMC 0.25-micron technology. The delay chain was built using
double length inverters to attain an area-efficient delay realization. A sample layout of
the DF-DICE latch is shown in Figure 31. Table 6 shows the respective areas for the
cells.
Table 6 Area comparison for DF-DICE
DICE Cell ( λ
2
) DF-DICE ( λ
2
) %age increase
Latch 248x266 248x470 76%
Flip-Flop 264x444 264x646 45%
As shown in the results, the area penalty just considering the cells vary from 45% to
76%. Considering a practical micro-architecture design, this penalty may not result in
a large overall penalty, depending on the number of storage elements in the system.
For example, if an ASIC has 30% of its area occupied by flip-flops, the overall chip
area penalty will be less than 13.5%. HSPICE simulation results show that pulse
widths of up to 825ps are actually mitigated. The speed penalty for 800ps targeted
SET mitigation for various clock speeds is shown in Figure 32.
99
0
10
20
30
40
50
60
70
80
50 MHz 100 MHz 150 MHz 200 MHz 250 MHz 300 MHz
Clock Frequency
%age Penalty
Figure 32 DF-DICE speed penalty versus clock frequency
5.2.2 Scalability of DF-DICE
An important feature of the DF-DICE cell is that it can be scaled to mitigate
transients of various pulse widths by changing the number of inverters in the delay
chain [79]. A direct on-chip measurement of SET pulse widths shows a linear
relationship between the LET (Linear Energy Transfer) of incident ions and the pulse
width of the generated SETs [34]. As the width of transients generated is different in
different environments, a particular delay filter can be used to mitigate all the
transients generated in a specific environment.
In terrestrial environments the main sources of radiation are alpha particles and
by-products of neutron interaction with silicon. The LET of alpha particles is less than
2 MeV-cm
2
/mg and so the expected SET pulse width from alpha particle radiation is
less than 200ps [34]. The cosmic ray neutrons result in Si-recoil reactions in the
100
terrestrial environments. The probability of recoil reactions with energies greater than
15 MeV is negligibly small [8], so according to the results in [34] the expected SET
pulse width for these environments is less than 450ps. Therefore DF-DICE cells
targeted for 450ps will suffice for all terrestrial applications either at sea level or at
higher altitudes. For space environments the flux of ions drastically decreases after an
LET of around 60 MeV-cm
2
/mg. Based on the minimum value of flux for high LET
values, we can assume that the probability of transients with pulse widths greater than
1.2ns is negligible.
Keeping in view this range of pulse widths for the transients, we have targeted
DF-DICE cells for five different SET thresholds. This scalability of DF-DICE cells for
various SET thresholds is achieved by varying the number of inverters in the delay
chain. A DF-DICE circuit with particular SET threshold mitigates any transients of
pulse widths less than or equal to its characteristic SET threshold. Notice that if two
nodes in the storage cell are hit at the same time, the SEU immunity of the cell may no
longer be effective. Also if the output node of the filter is directly hit by the incident
particle and it is the primary output of the circuit, then the SET immunity may be lost.
In order to overcome these problems, the layout of the DICE cells should be done
carefully and the filter can be implemented with some resistive hardening technique
like the one discussed in [10]. On the other hand, if this is a stage in a long
combinational network, the transient appearing at the output of the filter will be
mitigated in the next stage.
101
5.2.3 Cost Analysis for Scalable DF-DICE
For cost analysis purposes, we implemented the above mentioned DF-DICE
cells for five different SET thresholds. All the cells were laid out using the magic
layout editor. The layout uses MOSIS SCMOS rules for 6-metal single poly TSMC
0.18-micron technology. The delay chain has been built using double length inverters
to attain an area-efficient delay realization. The relative increase in area of DF-DICE
cells against the DICE only cells has been shown in Table 7 for five different SET
thresholds. The 3rd column shows the per-cell increase in area for DF-DICE latch and
the 4th column for DF-DICE flip-flop.
The per-cell results show that the increase in area can be up to 101% for 1.2ns
SET threshold in a DF-DICE flip-flop. But for practical designs where sequential
elements are only a part of the total area, the overall area penalty will depend on the
percentage of the area occupied by sequential elements. For example, considering a
practical micro-architecture design where sequential elements account for 30% of the
total area, the relative increase in area ranges from 9.5% 30.4% for flip-flop based
designs. For the latch based circuits where area penalty per cell is higher, the area
penalty ranges from 16% to 51%. The relative increase in overall area as a function of
SET threshold for the above case is shown in Table 7. The speed penalty will depend
on the operating speed of the design and will be similar as shown in Figure 33. These
results clearly show a linear relationship between the cost of employing DF-DICE
cells and the tolerance required which in turn depends on the intended environment of
operation.
102
Table 7 Area comparison against DICE
Transient
Threshold
DF-DICE latch
Area (um
2
)
Increase
per latch
DF-DICE Flip-flop
Area (um
2
)
Increase per
flip-flop
250ps 82160 54% 117515 31.7%
450ps 94413 77% 129768 45.5%
650ps 131936 100% 142424 59.7%
850ps 119925 124% 154878 73.5%
1200ps 145638 172% 179787 101.5%
0%
10%
20%
30%
40%
50%
60%
0 200 400 600 800 1000 1200 1400
SET Pulse w idth (ps)
% Increase in Area
DF-DICE Latch
DF-DICE Flip Flop
Figure 33 Overall area comparison for scalable DF-DICE
5.3 Deductions
This chapter has proposed and evaluated latch and flip-flop designs that are
immune not only to transients on every signal (i.e. clock, data, preset and clear) but
also to upsets within the storage cell. The presented cells are scalable to various
103
radiation environments. This design technique offers a high degree of automation
where all the storage elements in an exiting design can be replaced with these standard
cells. An effective estimation of the transient pulse widths in a particular environment
is necessary before finalizing the design decisions. An SET characterization
methodology, presented in the previous chapter, can be employed for estimating these
transient widths in any particular environment. However, the delay filter itself is prone
to SETs as is the case with the TMR voter circuit. Yet a significant improvement in
SET mitigation arises because a lot of target nodes have been replaced by fewer target
nodes with regard to susceptibility to radiation-induced errors.
104
Chapter 6
Implementation Architectures for ECC Model-based
Radiation Hardening of Memories
We have earlier presented theoretical models to mitigate soft errors for
memory systems in chapter 3. This proposed ECC model-based radiation hardening of
memories requires the implementation of various ECC encoders and decoders and a
scrub controller. This chapter explores existing ECC implementation architectures and
presents a new implementation approach for implementing more powerful codes such
as double error correcting (DEC) codes, which is suitable for memory applications
[78]. This implementation methodology helps in achieving the targeted solution for a
desired reliability predicted by the model. Section 6.1 presents a literature survey of
existing ECC implementation architectures. Section 6.2 presents the design of higher
strength ECC schemes that are aligned to typical memory word sizes. Our proposed
implementation approach for these higher strength ECCs is described in section 6.3.
The practicality of the proposed approach is demonstrated by implementation results
in section 6.4.
105
6.1 Survey of Existing ECC Implementations
The commonly used error correcting code to correct soft errors is single error
correcting (SEC) Hamming code [44]. The SEC Hamming code uses different parity
bits for various combinations of data bits. The parity bits are generated using modulo-
2 addition. For example, if four data bits are D1 through D4, then the required three
parity bits for a (7, 4) Hamming code can be the generated as follows:
P1 = D1+D2+D4 (6)
P2 = D1+D3+D4 (7)
P4 = D2+D3+D4 (8)
The parity bits are labeled 1, 2 and 4 because they are placed at the power of 2
locations in the output word while data bits are spread at the remaining locations. The
implementation for this encoder is simple and can be accomplished by using XOR
Figure 34 SEC decoder
106
gates for the parity equations. The decoder circuit, on the other hand, is more complex.
The decoder circuit for this example is shown in Figure 34.
The first part of the decoder circuit is similar to the encoder where a syndrome
is calculated, which is a modulo-2 sum of the received parity bits and computed parity
bits. If this syndrome is zero then it means that no error has occurred. If the syndrome
is not zero, then it tells the location of the error bit. A syndrome decoder is needed to
decode the location of error; the error bit can be corrected by performing a mod-2 sum
of every bit with the output of the syndrome decoder. In this way the correction circuit
can be made from 2-input XOR gates; one input of these XOR gates is the data read
bit while the other bit is the output of the error location (syndrome) decoder. Note that
the parity bits are not forwarded to the output and there is no need to correct if the
parity bits are wrong, so not all outputs of the syndrome decoder are used and its
circuit can be optimized.
A simple extension to the SEC Hamming code is the addition of an overall
parity bit. By adding this overall parity bit, the code can distinguish between single
and double errors. If the computed overall parity is same as the received overall parity
and there is a non-zero syndrome, it means that a double error has occurred. On the
other hand, if the computed overall parity bit is different from the received parity then
it means a single error has occurred and that can be corrected by the decoder.
Remember, if more than two bit errors have occurred, the code will fail and wrongly
decode the codeword. This single-error-correct and double-error detect (SEC-DED)
code is called extended-Hamming code.
107
In contrast to extended-Hamming code, Hsiao codes offer an optimal
implementation for SEC-DED codes [52]. The computation of overall parity bit in
decoding extended-Hamming code comes in the critical path and increases the latency
of the decoder. The Hsiao decoders avoid this latency by designing the code in such a
way that all single bit errors result in odd-weight syndromes while double bit errors
result in even-weight syndromes. The implementation architectures for Hsiao codes
are described in [52], and therefore not repeated here.
BCH (Bose-Chaudhuri-Hocquenghem) codes are a class of powerful random
error-correcting cyclic codes [64]. Although BCH codes have been used in
communication systems, they are typically not applied in high-speed memory
applications due to their relatively large redundancy requirements and decoding
complexity [107]. Commonly employed iterative BCH decoding such as Berlekamp-
Massey, Euclidian and Minimum Weight Decoding algorithms in communication
systems require a multi-cycle decoding latency [64][117][66][36]. Given the
dependence of microprocessor performance on memory latency and bandwidth, this
multi-cycle decoding latency is not tolerable for memory systems. Another
impediment is the block size of primitive BCH codes, which does not align with
typical memory word sizes [64].
6.2 DEC BCH Code Design for Memory Word Sizes
A binary (n, k) linear block code is a k-dimensional subspace of a binary n-
dimensional vector space. Thus, an n-bit codeword contains k-bits of data and r (= n –
108
k) check bits. An r x n parity check matrix H, or alternatively k x n generator matrix
G, is used to describe the code [64]. Due to the cyclic property of BCH codes, a
systematic generator matrix for primitive BCH codes of the form G
k,n
= [I
k,k
| P
k,r
] can
be generated by combining two sub-matrices. I
k,k
is an identity matrix of dimension k,
and P
k,r
is a parity sub-matrix consisting of the coefficients of k parity polynomials of
degree r. The k parity polynomials can be obtained from a polynomial division
involving the generator polynomial g(x) of BCH code as in (1).
( ) 1 - 0,1,...k i 2 mod ) ( / = =
+ −
x g X remainder P
i k n
i (1)
For example, a systematic generator matrix G for a primitive DEC (double
error correcting) BCH (31, 21), shown in Figure 35, is computed using this method.
Notice that the described code derivation procedure is based on the cyclic property of
BCH codes and is different from the normally used procedure in [64] and other texts.
As evident from the code parameters, the information word size of 21 bits is not a
common memory word size. We adopt a code shortening procedure to align it to a
memory word size, in this case 16 bits. Starting with row 17 of the generator matrix G,
the column containing the entry “1” in the identity sub-matrix section is identified and
this column and respective row containing the 1 are deleted from G. In this way, a
reduced generator matrix G for DEC (26, 16) is obtained, also shown in Figure 35.
109
Figure 35 Systematic generator matrix of DEC (31, 21) & (26, 16)
6.3 Implementation Approach
We propose a pure combinational logic approach to implement more powerful
codes such as DEC BCH codes [78]. This approach is constructed on a standard array
based syndrome decoding procedure [64]. In this procedure, a set of syndromes is pre-
computed corresponding to correctable error patterns and stored in a ROM based
lookup table (LUT). The resources to store and access this LUT increase exponentially
as the block size of the code increases, (which is generally the case in communication
systems), thereby inhibiting the usefulness of this procedure. With relatively smaller
110
block sizes for typical memory words, we adopted and modified this procedure to
decode DEC BCH codes. Instead of using ROM-based lookup table (LUT), error
correction bits are set according to Boolean function mapping of syndrome patterns.
This allows Boolean function implementation using standard cell ASIC design
methodology. In the following, we describe the encoder and decoder circuit
implementation in details.
6.3.1 DEC Encoder
The encoding process converts a data word (row vector b) into a codeword
(row vector c) by multiplying it with the generator matrix using modulo-2 arithmetic
i.e. c = b * G. With systematic generator matrix G, data bits are passed as-is in the
encoding process and only the check bits need to be computed. The computation of
check bits is accomplished through XOR trees as shown in Figure 36 for DEC (26,
16). The inputs to each XOR tree are data bits chosen according to non-zero entries in
Figure 36 Encoder circuit for DEC (26, 16)
111
respective columns of the parity sub-matrix as was shown in Figure 35. The depth of
the XOR tree has an upper bound of log2(k) if implemented using 2-input XOR gates.
The generated codeword is then stored in memory along with the appended check bits.
6.3.2 DEC-TED Encoder
A DEC-TED (double error correcting – triple error detecting) encoder is
similar to a DEC encoder except that an additional overall parity bit is added. Since
this is an overall parity bit covering the data as well as check bits, it can only be
computed after all other check bits have been computed. Therefore, this bit becomes
the critical path in a DEC-TED encoder and can considerably increase the latency of
the encoder.
6.3.3 DEC Decoder
For decoding purposes, a parity check matrix H, of the form: H
r,k
= [P
t
r,k
| I
r,r
],
is required where P
t
is transpose of the parity sub-matrix in systematic G and I is the
identity matrix. The input to decoder is the read codeword vector v, which may
contain errors in data or check bit locations. A syndrome s is computed, using modulo-
2 arithmetic, by multiplying H with v
t
(transpose of the read codeword v) i.e. s = H *
v
t
. A non-zero syndrome implies the presence of errors, in which case corresponding
bits in the error location vector e are set by the error pattern decoder. The error
location vector e is added with the received codeword v to get the corrected data. The
error pattern decoder and error corrector circuits can be optimized to correct errors
only in data bit locations.
112
A block diagram of the decoder is shown in Figure 37 (ignore the TED portion
for now) containing the three main parts: 1) Syndrome Generator, 2) Error Pattern
Decoder and 3) Error Corrector. The circuit for the syndrome generator is similar to
the encoder circuit. Essentially, it re-computes the check bits and compares those with
the received check bits. In case of no error, a zero syndrome is generated and can be
used to alert the error flag. Alternatively, a non-zero syndrome is generated if the
computed check bits are not matched to the received check bits. An error pattern
decoder circuit is implemented using combinational logic that maps the syndromes for
correctable error patterns. This mapping is pre-computed by multiplying all
correctable error patterns with the parity check matrix H. For binary vectors, an
erroneous bit is corrected merely by complementing it; therefore, the error corrector
circuit is simply a stack of XOR gates.
Figure 37 Block diagram of BCH decoder
113
6.3.4 DEC-TED Decoder
The decoder for a DEC-TED code is similar to the decoder for DEC with
modifications necessary to handle triple-bit error detection, as shown in Figure 37. In
particular, an all-0 column and an all-1 row are added to the DEC H matrix to obtain
the parity check matrix for DEC-TED. This increases the syndrome vector by 1-bit,
doubling the number of syndromes. The error location decoder then maps the
syndromes for 3-bit errors to a sentinel pattern. A simple sentinel value of the least
three bits being set in the error pattern e can be ANDed to flag triple error detection.
6.4 Implementation Results
For demonstrating the practicality of the parallel implementation approach,
DEC and DEC-TED encoder and decoder circuits have been implemented for typical
memory word sizes of 16, 32 and 64 bits [81]. For analyzing trade-offs with
conventionally used ECC schemes, SEC and SEC-DED codes have also been
implemented. Since the Hsiao code is an optimal SEC-DED code, we have included
only the Hsiao code for SEC-DED ECC for our comparisons. Synopsys Design
Compiler (DC) has been used for synthesizing all encoder and decoder circuits
targeted to an IBM 90nm standard cell library. Note that the major area overhead for
any ECC scheme comes from the redundancy required for its operation and is a
function of the error detection and correction capability of the code and block size.
With increasing block size, the required redundancy overhead decreases for all linear
block codes. Table 8 shows the redundancy overhead for SEC and DEC codes while
114
SEC-DED and DEC-TED code add an additional overall parity bit respectively. It can
be observed from the table that DEC BCH codes incur twice more overhead for check
bits as compared to SEC code. Therefore, in the following we restrict ourselves to the
overhead discussion of encoder and decoder circuits only.
Table 8 Required redundancy for ECC codes
Hamming SEC BCH DEC
Data bits Check
bits
% age
overhead
Check
bits
% age
overhead
16 5 31.2 10 62.5
32 6 18.7 12 37.5
64 7 10.9 14 21.8
128 8 6.2 16 12.5
Table 9 shows the latency and area results for post-synthesis encoder circuits
while Table 10 shows the latency and area results for decoder circuits. A major
inference from the synthesis results is that the decoding latency for the DEC codes is
reasonably small, and it is much better compared to multi-cycle shift register based
decoders used in communication systems. Therefore, this parallel implementation of
the DEC codes makes it feasible to utilize DEC ECC for memory applications.
Table 9 ECC encoder latency and area results
Ham. SEC Hsiao SEC-DED DEC DEC-TED
Data
Width
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
16 0.4 296 0.4 291 0.5 496 0.9 786
32 0.5 598 0.5 605 0.6 1250 1.1 1424
64 0.65 1302 0.7 1168 0.7 2335 1.3 2546
115
Table 10 ECC decoder latency and area results
Ham. SEC Hsiao SEC-DED DEC DEC-TED
Data
Width
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
Latency
(ns)
Area
( μm
2
)
16 0.9 576 0.9 935 1.4 4288 1.7 5432
32 1.1 1303 1.1 1376 1.8 11735 2.2 13757
64 1.3 2412 1.3 2681 2.2 37279 3 42976
As expected, the overall parity bit results in increasing the latency penalty of
the DEC-TED encoder by 80% to 85% as compared to the DEC encoder. Therefore,
DEC is preferred wherever possible as compared to DEC-TED. Looking at the
decoder results, we see that the latency of SEC and Hsiao SEC-DED decoders is
identical. On the other hand, the decoding latency for DEC and DEC-TED varies
significantly, between 21% to 36% for 16 and 64 bit decoders respectively. The
latency increases between DEC and DEC-TED because the corresponding syndrome
and correctable & detectable error patterns are doubled by adding the overall parity
bit. For DEC-TED, the computed overall parity cannot simply be compared with the
received overall parity to infer triple bit error detection since the computed overall
parity will be the same for single- and triple-bit errors, but single-bit errors are
correctable while triple-bit errors are not. In contrast, in the Hsiao SEC-DED case, the
single and double error distinction is based on the weight of syndrome (odd/even). If
we compare the latency penalty between Hsiao SEC-DED and DEC-TED, it is almost
double both for encoder and decoder. However, in comparing SEC-DED to DEC, the
percentage decoder latency penalty varies by 55% to 69% for different block sizes,
116
implying again that DEC is strongly preferred over DEC-TED unless there is strong
evidence that the target application requires triple-error detection.
Another important implication can be made from the almost identical latencies
of the Hsiao SEC-DED and DEC encoders. Since the syndrome generator circuit is
similar to the encoder circuit, error detection can be accomplished with quite similar
latency for both SEC-DED and DEC codes. The ECC implementation architecture can
benefit remarkably from this observation. In particular, as most memory accesses
would be error-free, data can be passed for processing to the next stage without the full
decoder delay as shown in Figure 38. In erroneous cases, when a non-zero syndrome is
detected, only then is the full decoder latency needed to correct the errors. For these
cases only, the next processing stage can be stalled for a cycle or two depending on the
speed of the processing stage, minimizing the overall performance penalty.
The spread of the area results for the encoder and decoder circuits is quite large
for various ECC schemes, but the overall area for each is very small compared to
typical ASIC sizes. Thus, these ECC encoders and decoders can be implemented
without a significant area impact. As derived in chapter 3, the reliability gain provided
by the double error correcting code is four orders of magnitude superior to single error
correcting codes for typical bit error rates. Therefore, applications requiring high
reliability, especially against soft errors or low-power standby SRAMs with extremely
low data retention voltages [60], can use DEC codes with relatively reasonable
performance overheads.
117
Figure 38 Minimizing decoder latency overhead on system performance
6.5 Deductions
Double error correcting BCH code designs, which are aligned to typical
memory word sizes, have been presented. A parallel approach for implementing these
double error correcting BCH codes for memory applications is described. This
approach enables a simple single-cycle implementation of DEC decoders, which
compares favorably to the iterative multi-cycle decoding used in communication
systems. ECC encoder and decoder circuits have been synthesized using standard cell
IBM 90nm technology for typical memory word sizes. Though the DEC decoders
incur more latency, compared to optimized SEC-DED codes (55% for 16-bit and 69%
for 64-bit word), a careful ECC implementation architecture can minimize the effects
of this penalty on overall system performance. In particular, data can be forwarded to
118
following stages after approximately equal error detection latency for both SEC-DED
and DEC codes, incurring the full decoder latency in the critical path only in erroneous
cases. The DEC code offers four orders of magnitude better reliability compared to
conventional SEC-DED codes for typical soft error rates. Therefore, it is especially
helpful for memories severely affected by increased error rates in scaled technologies,
either due to soft errors or errors occurring due to process variations. Furthermore,
low-power techniques employing extremely low data retention voltages during
standby modes can also benefit from these DEC codes.
119
Chapter 7
Implementation Results
To validate our modeling and mitigation strategies, we designed two SRAM
chips. Both chips were designed in IBM 90nm CMOS technology, one with a low
power (LP) process and the other with a standard fabrication (SF) high-performance
process. The test chips were designed to attain required effective bit error rates while
still meeting high packaging density and low speed penalty goals . In particular, the
SRAM arrays were built using the foundry provided 6T cell with special design rule
waivers targeted for high density. Error correcting codes in conjunction with scrub
rates were used to reduce the effective error rate. Radiation testing of the test chips has
been performed at Boeing’s Radiation-Effects Laboratory and at Lawrence Berkeley
National Laboratory (LBNL). This chapter describes our test chip designs and
irradiation testing results [86][17][81][80].
7.1 Test IC Design Description
This section summarizes the design highlights of the test SRAM ICs.
7.1.1 Low Power SRAM Chip
The test chip consists of two memory modules: one called baseline SRAM and
the other called hardened SRAM. The baseline SRAM consists of 64k bits organized
120
as 256 rows and 128 columns with a word size of 16-bits. It is called baseline SRAM
since no radiation hardening technique, either in peripheral logic or in the memory
array itself, is applied. This module serves as the reference vehicle to estimate and
compare the process’s inherent error rate with the error rate of the hardened SRAM.
The hardened SRAM utilizes the same memory cell as the baseline SRAM which is
designed for high density purposes in the low power commercial 90nm process. It uses
single error correction – double error detection (SEC-DED) Hsiao code (22, 16) for
mitigating the errors occurred in the array. The hardened array is organized as 256
rows and 176 columns to account for extra redundancy bits. This array also uses bit-
interleaving of 8X (i.e. each consecutive bit in a word is separated by 8 physical bits
from other words) so that multiple bit errors resulting from a single ion strike in a
word appear as single bit errors in multiple words and hence be correctable by applied
SEC-DED ECC. The peripheral logic has been implemented using TMR, applied on
address decoder, ECC encoder and decoder and control logic, to harden against single-
event transients. This TMR protection potentially prevents single event transients in
the peripheral logic to introduce whole word errors. The Figure 39 shows the overview
of the chip. The scrubbing mechanism for this chip is handled by an off-chip scrub
controller.
121
Figure 39 Top-level layout of the LP SRAM Chip
7.1.2 High Performance SRAM Chip
The high performance chip also consists of two modules, namely: baseline
SRAM and hardened SRAM for similar reasons described in the LP SRAM case. The
basic word size for this chip is 7-bits wide. The basic word size is so chosen due to the
application of double error correcting (15, 7) BCH code for the hardened SRAM. The
baseline SRAM is organized in 512 rows and 128 columns. Again the hardened
SRAM uses the same basic memory cell structure for this process, but there are more
cells in a word due to the redundancy required by the DEC BCH code. In particular,
122
the hardened SRAM is organized in 512 rows and 256 columns. A bit interleaving of
16X is implemented for this chip. Peripheral logic for baseline SRAM uses standard
cells while peripheral logic for hardened SRAM uses TMR to mitigate the effects of
single-event transients. An on-chip scrub controller has been implemented for this
chip. The scrub controller sequentially updates a word at the end of each scrub cycle.
The top-level layout of the SF chip is shown in Figure 40.
Figure 40 Top-level layout of the SF SRAM Chip
123
7.2 Radiation Testing
This section describes the testing mechanism and heavy-ion induced radiation
test results, which we have also reported in [86][17].
7.2.1 Test Facility
Radiation testing of single-event effects for both LP and SF ICs has been
performed at Lawrence Berkeley National Laboratory’s (LBNL) 88 inch cyclotron for
a total beam time of 32hours and in accordance with the EIA/JEDEC standard number
57-96 document [90]. The heavy ion SEE characterization used the 10MeV cocktail,
as it has the optimal ion energies, LETs and ranges (penetration depth) for this
experiment. Table 11 presents the energies and LETs of calibrated ions within the
10MeV cocktail option that were used in all of our SEE experiments. Note that the 2
highest “effective” LETs were obtained by performing tests in angled strikes
conditions. All tests were run to a total fluence of 1x10
7
particles/cm
2
, until at least
Table 11 10 MeV cocktail ions used during the SEE tests
Ion Energy (MeV) LET (MeV-cm
2
/mg)
B 108 0.89
O 184 2.19
Ne 216 3.49
Ar 400 9.74
Mn 541 16.88
Cu 659 21.03
Kr 886 30.85
Xe (normal) 1330 59.08
Xe (30
o
strike) 1330 67.8
Xe (60
o
strike) 1330 117.44
124
200 errors were recorded or until a destructive or functional perturbation in the DUT
occurred. The primary goal of the SEU testing was to determine the parameters
relevant to BER calculations, i.e. the threshold LET and the sensitive cross-section of
each SRAM device. Considering the worst case scenario, SRAMs were tested in
vacuum for the -10% nominal voltage condition i.e. 1.08V for LP and 0.9V for SF.
7.2.2 Testing Setup
During SEU testing, the device under test (DUT) was mounted on a custom
daughter card, in turn connected to a Xilinx Spartan-3 FPGA Avnet board, shown in
Figure 41. The entire FPGA-based tester was then placed inside the vacuum chamber,
and mounted to its circular rotating plate (Figure 42). Because of the beam’s
dimensions at the DUT’s surface (~1 quarter in size), the risk of having straggling ions
corrupting the FPGA tester is negligible. Additionally, such proximity between the
Figure 41 FPGA tester card and cube of daughter cards
125
DUT and the electronics monitoring its response to radiation (the Avnet board) is
highly desirable, because it reduces speed limitations due to loading delays usually
present in regular BNC cables.
The speed increase was possible by implementing the test software on the
tester board itself. The code controlled the timing between signals necessary to
properly operate the SRAMs. It allowed the loading of predetermined bit patterns (all
“0”, all “1”, checkerboard) on the SRAM chips, as well as executed the various
combinations of
Figure 42 Test chip mounted at the daughter card for radiation beam
read and/or write functions defined by the testing algorithms. Execution commands
were fed to the Avnet board, through the vacuum chamber wall interface, from a host
PC via a RS232 connection, while test data was periodically sent from the tester to the
host PC for storage.
126
This fully integrated test setup allowed us to exercise the DUT at a clock
frequency of 100 MHz, almost a 2-times speed improvement when compared to more
traditional testing methods used in the community so far. The gain in testing speed is
critical to accurately characterize the SRAM’s SEU response in near real-time
operating conditions.
7.2.3 Test Results
Figure 43 and Figure 44 present the raw SEU cross-sections of the LP baseline
and hardened SRAMs respectively, showing the radiation response of each SRAM:
before any error correction (if applicable) is performed. Similarly, Figure 45 and
Figure 46 present the raw SEU cross-sections of the SF baseline and hardened SRAMs
respectively. Each data point plotted in the figures represents the ratio between 1) the
number of errors counted at the uncorrected output of the tested SRAM and 2) the
total number of particles (total fluence) to which the SRAM was exposed. The ratio is
plotted for each particle listed in Table 11, normalized to the bit size of each SRAM.
For test completeness, the SRAMs were exposed under a variety of loaded bit
patterns, to investigate any possible pattern-sensitive dependence. The patterns used
are: “all 0” or 00 (so that any 2 adjacent bits are loaded with “0”), “all 1” or 11 (any 2
adjacent bits are loaded with “1”), and “10” or checkerboard pattern (where the held
memory value alternates for any two neighboring cells, across any two successive
rows and columns). Additionally, tests were conducted on the SRAMs for two
different operating modes: a static test mode, where the memory content was not
accessed during irradiation, hence leaving the peripheral/control circuitry of the core
127
SRAM array inactive (i.e. row/column decoders, sense amplifiers, word line drivers);
a dynamic mode, where the memory content was continuously read/written back
during irradiation, therefore continuously activating the same peripheral/control
circuitry previously left inactive in the static test.
Figure 43 Baseline SRAM raw SEU cross-section versus LET for LP chip.
Figure 44 Hardened SRAM raw SEU cross-section versus LET for LP chip
128
Figure 45 Baseline SRAM raw SEU cross-section versus LET for SF Chip
Figure 46 Hardened SRAM raw SEU cross-section versus LET for SF chip
(Each data set specifies the data pattern loaded and the operating mode: static (s) or
dynamic (d))
Any noticeable cross-section difference between the two operating modes
would indicate a sensitivity of the peripheral/control circuitry to single-event transients
129
(SETs), as it mostly constituted of combinational logic elements. Following key
observations from these results can be made:
• All SEU cross-sections (LP and SF, baseline and hardened) showed a
very low onset LET. Upsets were always observed for the lightest ion available in the
10 MeV beam cocktail: Boron with an LET of 0.89 MeV-cm
2
/mg. Using the common
definition of “threshold LET” (LET at 10% of saturating cross-section), we
determined that LET
th
for these test chips is ~ 3-5 MeV-cm
2
/mg.
• The analysis of the LP and SF cross-sections data did not show any
clear pattern dependence of the SRAM’s cross-section, as data points from the various
loaded patterns overlapped across the range of tested ions and LETs.
• Similarly, no sensible differences between static and dynamic testing
conditions were observed, indicating that the circuitry peripheral to the memory core
has a very low error cross-section: the total error cross-section of each SRAM is
dominated by the cross-section of its memory core.
For further analyses and bit error rate (BER) calculations, a Weibull function
was fitted to each of the SRAM experimental SEU datasets (Figure 47). The Weibull
parameters were optimized using the log weighted least square method, to achieve a
best fit to the data set distributions. The fitted parameters were then entered in the
CREME96’s SEE rate-calculation routines [24]. The functional form of the Weibull
function is F(x) = A (1- exp{-[(x-x0)/W]s}) with:
- A = limiting or saturating cross-section
- x0 = onset parameter, such that F(x) = 0 for x < x0 (onset LET)
130
- W = width parameter
- s = a dimensionless exponent
These Weibull parameters are for the LP and SF SRAMs are shown in Table 12.
From the combined cross-section test results of Figure 47, a few important
observations can be made. First of all, there is very little difference between Weibull
fits for the LP baseline and hardened as well as for the SF baseline and hardened. This
is a clear indication that, within a similar process, the radiation response of a SRAM is
dominated by the sensitivity of its unhardened memory core. Furthermore, the SF
cross-sections are higher than the LP cross-sections across the range of tested LETs
(twice more sensitive at saturation). Such difference in sensitivities may be related to
the lower operational Vdd of the SF technology (1.0V for SF vs. 1.2V for LP), in
addition to the -10% Vdd worst-case SEU test conditions. The little difference
between baseline SRAM and hardened SRAM suggest minor impact of TMR control
circuitry. It means that either transient were not generated at critical instances of time
or their widths were much smaller to cause significant effects at the final output.
7.2.3.1 Raw BER Calculation
Using the CREME96 tool, several orbital scenarios were investigated and the
SEU response of the SRAMs in these environments was characterized. Using the
previously defined Weibull parameters, each SRAM’s raw BER was computed for a
GEO orbit under solar quiet conditions at solar minimum and solar maximum
(maximum and minimum galactic cosmic rays (GCRs) respectively), and in a GEO
orbit under solar flare conditions using the worst week model (“99% worst
131
Figure 47 Cross-section versus LET for LP and SF chips
Table 12 Estimated BER from CREME96 for different radiation environments
Raw Upsets / Bit-Day Before ECC and
scrubbing
Geo. Orbit
Weibull Cross-Section
Parameters (per bit)
Device
Config
uration
a X0 w s
Max
GCR
Min
GCR
Solar
Flares
Worst
Week
Eq. Orbit
3,000 km
(Proton
Belts)
Sfb 3.3e-8 .30 23 1.3 2.3e-7 7.2e-8 1.4e-4 2.4e-4
Sfh 3.3e-8 .43 28 1.2 2.0e-7 6.2e-8 8.6e-5 1.1e-4
Lpb 1.8e-8 .73 36 .91 1.2e-7 3.8e-8 2.4e-5 2.0e-6
Lph 1.8e-8 .40 34 1.1 1.1e-7 3.3e-8 3.7e-5 3.8e-5
case” environment). Also, the raw BER was additionally computed for the harsh
proton radiation belt environment of equatorial orbit at 3000 km (apogee and perigee),
using the AP8MIN trapped proton model and quiet magnetic weather conditions. All
BER calculations assume 100mils of aluminum shielding and do not include nuclear
132
reactions from protons. Table 12 provides these estimated error rates before applying
any ECC protection in conjunction with scrubbing.
From the various radiation environments simulated, the GEO orbit under solar
flare condition as well as the LEO equatorial orbit exhibit the worst BER across all
SRAMs. As expected from their higher cross-sections, the SF SRAMs have a raw
BER consistently greater than LP SRAMs. However, the BER increase for SF is
much more pronounced for proton rich scenarios (LEO equatorial orbit).
7.2.3.2 SBU/MBU Distribution
Figure 48 shows the distribution of upsets for LP and SF SRAMs versus ion LET
value. At onset, single-bit upsets (SBU) dominate the total soft error distribution;
however MBU quickly become the main contributor as the LET increases. A wide
difference in MBU distribution can be noted for the two ICs, in addition to observing
that the largest MBU is 9 bits for LP versus 13 bits for SF. The difference in
distribution can be understood by investigating the critical charge of the 6T SRAM
cell in each process. Our simulations show that the Qcrit of the LP cell is slightly
higher than the SF (1.58fC vs. 1.23fC), due to variations in cell design and nominal
operating voltages (1.2V and 1.0V, respectively). These MBU distributions potentially
suggest that much larger interleaving factors should be implemented in order for the
errors to appear random and be effectively handled by an ECC model. If an increasing
MBU trend continues with technology scaling, it would become necessary to adapt to
more powerful codes such as DEC ECC.
133
0%
20%
40%
60%
80%
100%
1 2 3 1017 2131 59 117
LET (MeV-cm
2
/mg)
SBU/MBU
9bit
8bit
7bit
6bit
5bit
4bit
3bit
2bit
1bit
(a)
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2 3 10 17 21 31 59 117
LET (MeV-cm
2
/mg)
SBU/MBU
13bit
12bit
11bit
10bit
9bit
8bit
7bit
6bit
5bit
4bit
3bit
2bit
1bit
(b)
Figure 48 Single and Multi Bit Upset distributions versus Effective LET (a) LP
SRAM (b) SF SRAM
134
7.2.3.3 Single-Event Latchup Sensitivity
To completely characterize the SEE response of the devices, latchup experiments were
also conducted on the LP and SF SRAMs. Since commercial SRAM core memory
violates standard logic design rules to achieve commercial density, it makes it even
more susceptible to latchup. Particularly, the decreased spacing between wells and
diffusions of interest, in addition to the elimination of P+ taps (substrate grounds),
have the potential to be very harmful to the latchup resilience of the SRAMs.
Considering the worst case scenario, the devices were tested at +10% Vdd and
at a temperature of 125˚C. The SF SRAM exhibited total resilience to latchup for all
tests upto an effective LET of 118 MeV-cm
2
/mg, for a total fluence of 2x10
7
particles/cm
2
, without any effect on the SRAM operation (baseline and hardened).
Under worst-case test conditions, the LP SRAMs exhibited a very high sensitivity to
latchup. Latchup was triggered in the LP SRAMs almost instantaneously, for LETs as
low as 2.18 MeV-cm
2
/mg. The identification of such a low latchup threshold prompted
additional tests, revealing some interesting observations. Firstly, the trigger and
release voltages for the latchup phenomenon were identified around 1.10 - 1.14 V.
This means that if Vdd was kept at 1.1V or below, no latchup was triggered up to an
LET of 118 MeV-cm
2
/mg at 125˚C. Secondly, temperature was also shown to play a
crucial role in triggering latchup. We tested the SRAM up to the maximum LET (118
MeV-cm
2
/mg) at +10% Vdd, but at room temperature (25˚C), without registering a
single latchup event. Finally, none of the SEL was destructive: all the parts worked
perfectly after power-cycling to remove the latchup state. These results indicate that in
135
current and future technologies where supply voltages are 1.1V or below, the
processes are intrinsically immune to single-event latchup. This represents a positive
outcome of technology scaling towards single-event effects and potentially enhances
the interest of using commercial-of-the-shelf (COTS) electronics for space
applications.
7.2.3.4 Total Ionizing Dose (TID) Test Results
For characterizing the complete radiation response of the test devices, the TID
tests were also performed at the Boeing Radiation Effects Laboratory (BREL). Both
LP and SF SRAM chips were tested according to the MIL-STD 883E, method 1019.6
[111]: the irradiations were performed at a constant dose rate of 200rad/s, the
temperature of the Cobalt-60 gamma-ray chamber was cooled below 30˚C and the test
time between two successive irradiation steps was kept under 1 hour (actual test time
per TID level ~15 minutes). The dimensions of the cobalt-60 chamber limited the
number of simultaneously irradiated devices under test (DUT) to 4 per process.
Therefore, all the TID results presented are the averaged TID response of 4 DUTs. To
observe functional failure under extreme conditions, TID tests were conducted upto a
large total dose of 2Mrad.
The memory cores of the LP SRAM IC showed a ~400X increase in static leakage
current at 1Mrad and ~750X increase at 2Mrads. All LP SRAMs exhibited functional
failure (read/write process corrupted) between the 1Mrad and 1.3Mrad irradiation step.
However, all chips were functional again when tested after 7 days of room temperature
anneal. The SF memory core leakage current experienced a 20X increase from pre-rad
136
conditions at 1Mrad. However, this relatively small increase is misleading as the SF
pre-rad leakage is already much higher than the pre-rad LP leakage. All SF SRAMs
exhibited functional failure (read/write process corrupted) between the 600krad and
1Mrad irradiation step. After a 7 day room temperature anneal, only 2 of the 4 ICs
regained full functionality. We had used hardened I/O pads for the LP IC and they did
not show any leakage increase due to TID. On the other hand, the commercial I/O
pads used in SF IC showed a considerable increase in leakage current: ~500X at
1Mrad. To eliminate the impact of memory sizes on the TID data analysis, Figure 49
shows the normalized leakage current per bit for LP and SF ICs. It can be seen that the
leakage current increases only about ~3X at 300krad of irradiation implying that this
technology is intrinsically hard for most space missions requiring total dose hardness
between 150krad to 250krad. The normalized leakage current per bit reveals that the
SF static leakage per cell is approximately 10X higher than the LP pre-rad leakage.
This intrinsic increase in leakage has to be acknowledged for the SF technology, as the
total degradation from TID may appear to be lower (at 1Mrad: 20X for SF vs. 400X
for LP). The SF chips present a stronger degradation than the LP: the leakage current
of 0.1uA/bit (after which ICs show functional failure) is reached at 600krad for SF vs.
1Mrad for LP.
137
1.E-10
1.E-09
1.E-08
1.E-07
1.E-06
0 500 1000 1500 2000
Cummulated Dose (krads)
Leakage Current / Bit (A)
LP SRAM @ 1.2V
SF SRAM @ 1V
Figure 49 TID responses of the LP and SF SRAMs, showing the normalized
leakage current per bit, as a function of cumulated dose.
7.3 Experimental Validation of the Effective BER Model
We presented an ECC and scrubbing model in chapter 3 that translate the raw
BER into an effective BER based on the ECC characteristics (number of bits
corrected) and memory scrubbing operations. The theoretical model relied on several
assumptions. Hence, these assumptions required to be experimentally verified before
the model’s full validation.
• Assumption 1: All bit errors are random and uncorrelated.
Observations on the experimental data validated this assumption. Particularly:
the cross-section data showed little dependence on stored memory patterns or on
138
operational test conditions (static or dynamic), hence asserting the randomness of SEU
events. Additionally, error data analysis showed that most of the errors did not occur
in the control circuitry. No error bursts were observed, which would be characteristic
of correlated errors in the memory array from errors initiated in the control circuitry.
• Assumption 2: The ECC and peripheral circuitry must be natively
resilient or can be hardened at minimum to the desired BER i.e. the circuitry that is
supposed to help improve the BER of the memory array must have a cross-section
negligible compared to the main memory array.
The test chips were not designed for the measurement of this specific data,
however no failures were observed in the ECC circuitry. Thus, the assumption was
validated by inference.
With the main model assumptions verified, the validation of the ECC and
scrubbing rate model can lead to a reliable estimation of the effective BER anticipated
in the memory devices. Using equations developed in the theoretical model, we
compared the model’s effective BER predictions to experimental data measurements.
The model’s excellent prediction capability is corroborated in Table 13, where we
compare in the two last rows 1) the total number of errors still observed after ECC
correction attempt, with 2) the total errors predicted by the model that would not be
corrected, for the given raw BER, scrub rate and ECC.
139
Table 13 Observed errors and errors predicted by our model
LPH STATIC SFH STATIC LPH
DYNAMIC
SFH
DYNAMIC
Scrub Rate /bit 1 / Run 2.22 kHz 0.718 kHz
Raw BER Errors / Run / Memory Size Errors / Run / Time (Sec) /
Memory Size
Effective BER Equation A Equation B Equation A Equation B
Total SBU (All SEU
Runs)
13,004 12,117 9,901 9,372
Total Bit Errors
Observed (Not
Corrected by ECC)
7,305 195 NO ERRORS NO ERRORS
Total Bit Errors
predicted by Model
for a given Scrub
Rate, Raw BER and
ECC
7,633 187 6.7 (< 1 word) 0.0002
Estimated errors with 1-bit ECC ~= BER / [1 + (SR/BER)/300] (A)
Estimated errors with 2-bit ECC ~= BER / [1 + (SR/BER)/15]
2
(B)
These experimental results show that:
• For the static test condition, experimental data shows that for the LP
hardened SRAM, the 1 bit ECC reduced the error count by ~ 44%. In the SF
hardened SRAM case, using a 2 bit ECC, the error count was reduced~
98.5%.
• For the dynamic test condition, where the memory was periodically
scrubbed, the LP and SF hardened SRAMs did not register a single
140
uncorrected error during the experiment. The ECCs were always able to
output a correct codeword, regardless of the faulty bits in the memory array.
• The theoretical model’s prediction capability is excellent: the model over
estimates by only ~ 4.5% the number of uncorrected errors in the static
operating mode. The agreement with the dynamic test results is even better:
the difference is less than 0.1 %. The model can be confidently used to assess
the SRAMs resilience in various space environments and benchmark their
performances to the program’s SEU requirements.
7.3.1 Achieving the target BER of 10
-10
errors/bit-day
Now that we have characterized the raw BER of each SRAM using
experimental data, and that we have validated our theoretical model for effective BER
prediction, we can confidently investigate the rate at which the SRAM memories
needs to be scrubbed, to achieve the required target BER of 10
-10
errors/bit-day. This
estimation of the required scrubbing rate is very important for system design
considerations, as excessive scrubbing operations can considerably impact the timing
budget and limit the speed of SRAM operations. Our model results (Table 14) can be
used to guide the system designer in assessing the trade-offs between 1) increasing the
scrubbing rate of a memory while using a specific ECC vs. 2) adding to the design’s
complexity by using a stronger ECC protection with lower scrubbing rate
requirements.
141
Table 14 ECC strength vs scrubbing rate trade-offs for LP Hardened SRAMs
LP SRAM X Scrub Rate Every
Typical Orbits
Raw BER
(errors/bit-
day)
Single-bit ECC Double-bit
ECC
Max GCR 1.1e-7 50 days 30 years
Min GCR 3.3e-8 1 year 300 years
GEO-
Synchronous
WorstWeek (solar
flare)
3.7e-5 1 minute 1 day
LEO 3000km
equatorial
3.8e-5 1 minute 1 day
Table 15 ECC strength vs scrubbing rate trade-offs for SF Hardened SRAMs
SF SRAM X Scrub Rate Every
Typical Orbits
Raw BER
(errors/bit-
day)
Single-bit ECC Double-bit
ECC
Max GCR 2.0e-7 10 days 10 years
Min GCR 6.2e-8 100 days 90 years
GEO-
Synchronous
WorstWeek (solar
flare)
8.6e-5 10 sec 10 Hours
LEO 3000km
equatorial
1.1e-4 4 sec 3 Hours
The LP hardened SRAM is able to achieve the 10
-10
required BER by
combining the use of a 1 bit ECC and with moderate scrubbing for all orbital scenarios
(<1scrub/min). The SF hardened SRAM can also achieve the required BER goal with
a 1 bit ECC, but for some environments, the scrubbing rate becomes too taxing (once
every 10 sec and once every 4 sec for GEO solar flare and LEO equatorial orbit
conditions respectively). This is an indication that the 1 bit ECC is being operated at
the limit of its efficiency and that a stronger code should be considered. The use of a 2
142
bit ECC for the LP and SF considerably relaxes the needed scrubbing frequency to
achieve the 10
-10
goal BER for all orbital scenarios. In conclusion, the irradiation
results from the test SRAM chips confirm that the model assumptions are valid. The
test results further confirm the effectiveness of the model implying that we can
effectively trade off spatial and temporal redundancy.
143
Chapter 8
Conclusions
This dissertation addressed the issue of soft errors in commercial CMOS
technologies (including both single-event transients and single-event upsets), by
establishing the problem domain through rigorous characterization, developing an
efficient solution and validating the developed solution through a practical
implementation. The findings of this work contributed towards the completion of
major milestones for a multi-agency research program, sponsored by The Defense
Advanced Research Projects Agency (DARPA), which is trying to build a radiation-
hardened-by-design microprocessor using state-of-the-art commercial technologies.
The soft error issue is not only important for space applications, but it is also
becoming increasingly important for terrestrial applications. Particularly, reduced
dimensions and lower supply voltages in future technologies are going to reduce the
critical charge per node and increase the number of nodes per chip even further. This
scaling trend is expected to exacerbate the soft error problem for many terrestrial
applications involving banking, networking, life-support systems, industrial control,
avionics and critical defense applications. There are already numerous examples of
failures resulting from soft errors in terrestrial applications ranging from unprotected
144
cache memories in Sun Microsystems’s enterprise servers [110][22] to aircrafts and
implantable medical devices [88][87][32].
8.1 Summary
This work developed simulation-based methodologies to characterize single-
event transients and single-event upset phenomena. These simulation methodologies
offer cheaper and computation-time efficient estimation of these transient phenomena
as compared to complex 3D mixed-mode simulations and/or actual irradiation testing.
This characterization helps in establishing a clear problem domain, which in turn
identifies the solution space. Equipped with the characterization of the problem
domain, this work developed mitigation solutions both for SET and SEU. The
proposed DF-DICE cell offers an efficient solution to mitigate SET from
combinational logic with an ease of implementation where sequential logic elements
(latches and flip-flops) are merely replaced by respective DF-DICE cells.
The error correcting code and scrubbing based mathematical model for
memory upset mitigation is shown to be an efficient and much simpler solution
compared to other circuit-based hardening techniques. This model helps in achieving
the target effective reliability without altering the basic SRAM cell design and hence
allows exploiting the full benefits of high density and increased performance of state-
of-the-art commercial processes. The developed relation for temporal and spatial
redundancy trade-offs (described in chapter 3) is another interesting contribution of
this work. The model-based mitigation of soft errors and temporal and spatial
145
redundancy trade-offs require the development of implementation architectures for
more powerful error correcting codes which are suitable for SRAM applications.
Therefore, a fully parallel implementation approach for more powerful codes, such as
DEC BCH codes, was also developed. In the face of increasing multi-bit upset
challenges, this parallel implementation approach is another milestone in effectively
mitigating these errors and achieving a desired reliability. The practicality of this
approach was demonstrated through ASIC implementations of DEC BCH decoders for
typical memory word sizes.
The effectiveness of the proposed framework was validated by designing two
prototype SRAM ICs in commercial 90nm CMOS technology with two different ECC
model-based techniques guided by the framework. Particularly, one SRAM IC was
designed using a 90nm low-power process with a Hsiao SEC-DED code. The second
prototype SRAM IC was manufactured in a 90nm high-performance standard
fabrication process with a double error correcting BCH code following the parallel
implementation approach of chapter 6. The irradiation tests characterized the physical
raw bit error rate intrinsic to the process and cell design for the manufactured SRAM
ICs without activating the mitigation techniques. With the activation of mitigation
techniques for both ICs, no errors were observed implying that the model achieved the
desired effective bit error rate. These tests practically validated the model for at least
two different data points.
146
8.2 Limitations
This research has mainly focused on mitigating soft errors in SRAMs using an
ECC and scrubbing model to achieve an effective bit error rate with a given raw bit
error rate for a particular design. Like most other research, this work has solved some
problems while raising some new questions. The framework of this research has some
limitations that should be noted when applying the techniques to general design
scenarios.
The simulation methodologies for SET and SEU characterization are
developed in view of models and data available in literature and our experimentation
with state-of-the-art 3D device simulations in current technologies. The projected
trends for SET and SEU are expected to continue in the future, but if there are
significant modifications in the underlying CMOS processes, these models would need
to be reconciled with those future technologies.
The DF-DICE cells, which are proposed to mitigate SET from combinational
logic, incur a latency penalty that is proportional to the width of the transient to be
mitigated. As technology scaling leads to more high-speed circuits, the associated
latency penalty with these cells may limit the practicality. This will likely prompt the
development of more efficient transient mitigation solutions and may potentially
revive interest in self-checking circuits [41].
The ECC and scrubbing model presents a general approach to trade off spatial
and temporal redundancy. In this model, the overheads arising from the ECC encoder
and decoders are ignored. Though these assumptions are valid for single error and
147
double error correcting codes, the decoding complexity for more powerful codes can
affect these assumptions. It should be emphasized that with the predicted scaling
trends, though, it is likely that the presented ECC schemes would be sufficient to meet
the desired reliability levels, at least in the near future.
The ECC and scrubbing model is based on the assumptions that the errors are
independently and identically distributed, meaning that the probability of each bit
being upset is equally likely. For the multi-bit upset case, bit interleaving is used to
make them appear as single-bit upsets into logically disjoint words. Though
interleaving factors of 8 or 16 can be implemented without a noticeable overall impact
on performance, it cannot be scaled to much larger interleaving factors without
modifying the memory architecture. Double error correcting codes will become
necessary if multi-bit upsets greater than the 16-bit range start to occur frequently. The
double error correcting ECC would be able to handle MBU as large as 32-bits with an
interleaving factor of 16. In the near future, the frequency of the range of MBU
surpassing 32-bits is likely to remain very low, especially for terrestrial and aircraft
applications. However, if such large MBUs do become the trend, then new solutions
will need to be explored for mitigating those large ranges of MBUs, potentially
involving architectural changes.
8.3 Extensions
Although this work has dealt with the issue of soft errors particularly targeted
for SRAMs, the approach and techniques developed in this work can be extended to
148
other domains. The ECC and scrubbing model that relates raw bit error rate to an
effective bit error rate is generic in nature and is based on probabilistic analysis. As
long as the raw bit error rates can be characterized for other memory technologies, this
model can be used effectively to attain desired reliability levels. For example, this
model can be easily adapted to mitigate soft errors in DRAMs. Additionally, errors
resulting from other phenomena, such as low-power standby retention modes (or aging
and manufacturing defects), can be mitigated with these models as long as the error
distribution from these phenomena can be categorized as independently, identically
distributed random errors. Similarly, any storage media that can be characterized with
similar parameters is a good candidate for application of this research. The DF-DICE
cells, which are proposed to mitigate SET from peripheral logic, can also be used to
harden any combinational logic in general. The increasing SET trends may eventually
lead back to the subject of “designing reliable systems out of unreliable components”
as was explored in the early days of vacuum-tube computing.
149
Bibliography
[1] Aeroflex Rad-Hard-by-Design SRAMs. Aeroflex, Colorado Springs, CO, Aug.
2006 [Online]. Available: http://ams.aeroflex.com/Product-
Pages/RH_4Msram.cfm
[2] D. R. Alexander, “Design issues for radiation tolerant microcircuits for space”,
short course presented at the 1996 NSREC conference, Indian Wells, Ca. July
15-19, 1996
[3] J. L. Andrews, J. E. Schroeder, B. L. Gingerich, W. A. Kolasinski, R. Koga, S.
E. Diehl, “Single event error immune CMOS RAM,” IEEE Trans. Nucl. Sci.,
vol 29, pp. 2040–2043, Dec 1982
[4] J. Baggio, V. Ferlet-Cavrois, H. Duarte, O. F. Lament, “Analysis of
proton/neturon SEU sensitivity of commercial SRAMs-application to the
terrestrial environment test method,” IEEE Transactions on Nuclear Science,
vol. 51 (6), pp. 3420–3426, Dec 2004
[5] M. A. Bajura, Y. Boulghassoul, R. Naseer, S. DasGupta, A. Witulski, J.
Sondeen, S. Stansberry, J. Draper, L. Massengill, J. Damoulakis, “Models and
algorithmic limits for an ECC-based approach to hardening sub-100nm
SRAMs”, Radiation Effects on Componenets and Systems (RADECS), Sep
27-30,2006
[6] R. C. Baumann, “Radiation-induced soft errors in advanced semiconductor
technologies,” IEEE Transactions on Device Materials and Reliability, vol. 5
(3), pp. 305–316, Sep 2005
[7] R. C. Baumann, “Single Event Effects in Advanced CMOS Technology”,
NSREC Short course, 2005
[8] R. C. Baumann, “Soft errors in advanced semiconductor devices – Part 1: The
three radiation sources”. IEEE Transactions on Device and Materials
Reliability, vol 1 (1), pp. 17-22, March 2001
[9] R. C. Baumann, “The impact of technology scaling on soft error rate
performance and limits to the efficacy of error correction”, Digest of
International Electron Devices Meeting (IEDM '02). pp. 329-332, Dec 8-11,
2002
150
[10] M. P. Baze, S.P. Buchner, D. McMorrow. “A Digital CMOS design technique
for SEU hardening”. IEEE Transactions on Nuclear Science, vol 47 (6), pp.
2603 – 2608, Dec. 2000
[11] J. Benedetto, P. Eaton, K. Avery, D. Mavis, M. Gadlage, T. Turflinger, P. E.
Dodd, G. Vizkelethyd, “Heavy ion-induced digital single-event transients in
deep submicron Processes”, IEEE Transactions on Nuclear Science, vol. 51
(6), pp. 3480 – 3485, Dec 2004
[12] D. Bessot, R. Velazco, “Design of SEU-hardened CMOS memory cells: the
HIT cell”. Second European Conference on Radiation and its Effects on
Components and Systems (RADECS), pp. 563 – 570, 13-16 Sept. 1993
[13] D. Binder, E. C. Smith, A. B. Holman, “Satellite anomalies from galactic
cosmic rays”, IEEE Trans. on Nucl. Sci., vol 22, pp. 2675-2680, Dec 1975
[14] J. D. Black, A. L. Sternberg, M. L. Alles, A. F.Witulski, B. L. Bhuva, L. W.
Massengill, J. M. Benedetto, M. P. Baze, J. L.Wert, M. G. Hubert, “HBD
layout isolation techniques for multiple node charge collection mitigation,”
IEEE Trans. Nucl. Sci., vol. 52 (6), pp. 2536–2541, Dec 2005
[15] Boieng Radiation Effects Laboratory
http://www.boeing.com/assocproducts/radiationlab/index.htm
[16] D. C. Bossen, M. Y. Hsaio, “A system solution to the memory soft error
problem,” IBM J. Res. Develop., vol. 24, pp. 390–397, Mar 1980
[17] Y. Boulghassoul, M. Bajura, S. Stansberry, J. Draper, R. Naseer, J. Sondeen,
“TID Damage and Annealing Response of 90 nm Commercial-Density
SRAMs”, Radiation effects on components and systems (RADECS) workshop,
2008
[18] Brookhaven National Laboratory Accelerator Facility
http://www.tvdg.bnl.gov/
[19] D. Burnett, C. Lage, A. Bormann, “Soft-error-rate improvement in advanced
BiCMOS SRAMs,” in Proc. IEEE Int. Reliability Phys. Symp., pp. 156–160,
1993
[20] T. Calin, M. Nicolaidis, R. Velazco, “Upset hardened memory design for
submicron CMOS technology”. Nuclear Science, IEEE Transactions on, vol 43
(6), pp. 2874-2878, Dec. 1996
151
[21] E. H. Cannon, D. D. Reunhardt, M. S. Gordon, P. S. Makowensky, “SRAM
SER in 90, 130 and 180 nm bulk and SOI technologies”, IEEE Int. Rel. Phy.
Syp., pp. 300-304. 2004
[22] Cataldo, A. Cataldo, “SRAM soft errors cause hard network problems”, EE
Times, August 2001, [on line]
http://www.eetimes.com/story/OEG20010817S0073
[23] C. L. Chen, M. Y. Hsiao, “Error-correcting codes for semiconductor memory
applications: A state-of-the-art review”, IBM J. Res. Develop., vol. 28 (2), pp.
124–134, 1984
[24] Cosmic Ray Effects on Micro Electronics, 1996 (CREME96),
https://creme96.nrl.navy.mil/
[25] T. J. Dell, “A white paper on the benefits of chipkill-correct ECC for PC server
main memory”, IBM Microelectronics division, 1997
[26] H. S. Deogun, D. Sylvester, D. Blaauw, “Gate-level mitigation techniques for
neutron-induced soft error rate”, International Synmposium on Quality
Electronics Design (ISQED), pp. 175 – 180, 2005
[27] N. Deracobian, V. Vardanian, Y. Zorian, “Embedded memory reliability: the
SER challenge,” IEEE International Workshop on Memory Technology,
Design and Testing, pp. 104–110, Aug. 2004
[28] Y. S. Dhillon, A. U. Diril, A. Chatterjee, “Soft-error tolerance analysis and
optimization of nanometer circuits”, Proceedings of Design and Test in Europe
(DATE), pp. 288-293, 2005
[29] P. E. Dodd, “Physics-based simulation of single-event effects”, IEEE Trans. on
Device and Materials Reliability, vol. 5 (3) pp. 343–357, Sept. 2005
[30] P. E. Dodd, A. R. Shaneyfelt, K. M. Horn, D. S. Walsh, G. L. Hash, T. A. Hill,
B. L. Draper, J. R. Schwank, F. W. Sexton, P. S. Winokur, “SEU-sensitive
volumes in bulk and SOI SRAMs from first-principles calculations and
experiments”, IEEE Trans. on Nuc. Sci., vol. 48, pp. 1893 – 1903, Dec. 2001
[31] P. E. Dodd, F. W. Sexton, “Critical charge concepts for CMOS SRAMs”,
IEEE Transactions on Nuclear Science, vol 42 (6), pp. 1764 – 1771, Dec. 1995
[32] P. E. Dodd, L. W. Massengill, “Basic mechanisms and modeling of single-
event upset in digital microelectronics”, IEEE Transactions on Nuclear
Science, vol. 50(3), pp. 583–602, June 2003
152
[33] P. E. Dodd, M. R. Shaneyfelt, J. A. Felix, J. R. Schwank, “Production and
propagation of single-event transients in high-speed digital logic ICs”, IEEE
Trans. on Nuc Sci., vol 51 (6), pp. 3278 – 3284, Dec. 2004
[34] P. Eaton, J. Benedetto, D. Mavis, K. Avery, M. Sibley, M. Gadlage, T.
Turflinger, “Single event transient pulsewidth measurements using a variable
temporal latch technique”, IEEE Transactions on Nuclear Science, vol 51 (6),
pp. 3365 – 3368, Dec. 2004
[35] L. D. Edmonds, “Electric currents through ion tracks in silicon devices,” IEEE
Trans. Nucl. Sci., vol. 45, pp. 3153–3164, Dec 1998
[36] W. M. El-Medany,, C. G. Harrison, P. G. Farrel, C. J. Hardy, “VHDL
Implementation of a BCH Minimum Weight Decoder for Double Error”, Proc.
of the 18th Radio Science Conf., vol. 2, pp. 361-368, 2001
[37] V. Ferlet-Cavrois, P. Paillet, D. McMorrow, A. Torres, M. Gaillardin, J. S.
Melinger, A. R. Knudson, A. B. Campbell, J. R. Schwank, G. Vizkelethy, M.
R. Shaneyfelt, K. Hirose, O. Faynot, C. Jahan, L. Tosti, “Direct measurement
of transient pulses induced by laser and heavy ion irradiation in deca-
nanometer devices”, IEEE Trans. on Nuc. Sci. vol.52, Dec. 2005
[38] A. M. Finn, “System effects of single event upsets”, Computers in Aerospace
VII Conference, pp. 994-1002, Oct. 3-5, 1989
[39] L. B. Freeman, “Critical charge calculations for a bipolar SRAM array”, IBM
J. Res. and Dev., Vol. 40 (1), pp. 77-89, Jan 1996
[40] S. W. Fu, A. M. Mohsen, T. C. May, “Alpha-particle-induced charge collection
measurements and the effectiveness of a novel p-well protec- tion barrier on
VLSI memories,” IEEE Trans. Electron. Devices, vol. 32, pp. 49–54, Feb 1985
[41] E. Fujiwara, “Code Design for Dependable Systems: Theory and Practical
Applications”, Wiley, ISBN: 978-0-471-75618-7, July 2006
[42] M. J. Gadlage, R. D. Schrimpf, J. M. Benedetto, P. H. Eaton, D. G. Mavis, M.
Sibley, K. Avery, T. L. Turflinger,“Single event transient pulse widths in
digital microcircuits”, IEEE Trans. on Nuc Sci vol. 51 (6), Part 2, pp. 3285 –
3290, Dec 2004
[43] M. Grassl, "Bounds on the minimum distance of linear codes." Online
available at http://www.codetables.de
[44] R. W. Hamming, “Error Correcting and Error Detecting Codes”, Bell Sys.
Tech. Journal, vol 29, pp. 147-160, April 1950
153
[45] J. D. Hayden, R. C. Taft, P. Kenkare, C. Mazur, C. Gunderson, B. Y. Nguyen,
M. Woo, C. Lage, B. J. Roman, S. Radhakrishna, R. Subrah- manyan, A. R.
Sitaram, P. Pelley, J. H. Lin, K. Kemp, H. Kirsch, “A quadruple well,
quadruple polysilicon BiCMOS process for fast 16 Mb SRAMs,” IEEE Trans.
Electron. Devices, vol. 41, pp. 2318–2325, Dec 1994
[46] P. Hazucha, C. Svensson, “Impact of CMOS technology scaling on the
atmospheric neutron soft error rate”. IEEE Transactions on Nuclear Science,
vol 47 (6), Part 3, pp. 2586 – 2594, Dec. 2000
[47] P. Hazucha, T. Karnik, J. Maiz, S. Walstra, B. Bloechel, J. Tschanz, G.
Dermer, S. Hareland, P. Armstrong, S. Borkar, “Neutron soft error rate
measurements in a 90-nm CMOS process and scaling trends in SRAM from
0.25-um to 90-nm generation”, IEEE IEDM, pp. 523-526, 2003
[48] T. Heijmen, D. Giot, P. Roche, “Factors That Impact the Critical Charge of
Memory Elements”. 12th IEEE International On-Line Testing Symposium, pp.
57-62, July 2006
[49] R. Hentschke, F. Marques, F. Lima, L. Carro, A. Susin, R. Reis, “Analyzing
area and performance penalty of protecting different digital modules with
Hamming code and triple modular redundancy”, 15th Symposium on
Integrated Circuits and Systems Design, pp. 95-100, Sep 2002
[50] A. Holmes-Siedle, L. Adams, “Handbook of Radiation Effects”, Oxford
University Press, USA, 2nd edition, ISBN-13: 978-0198507338, March 28,
2002
[51] Honeywell SRAM, MRAM, and FIFO Rad Hard Memories. Honeywell, Solid
State Electronics Ctr., Plymouth, MN, Aug. 2006 [Online]. Available:
http://www.ssec.honeywell.com/aerospace/radhard/memory.html
[52] M. Y. Hsiao, “A Class of Optimal Minimum Odd-weight-column SEC-DED
Codes”, IBM Journal of R & D, vol. 14, pp. 395-401, July 1970
[53] H. L. Hughes, R. R. Giroux, “Space radiation affects MOSFET’s”, Electronics,
vol 37, p. 58, 1964
[54] H. L. Hughes, J. M. Benedetto, “Radiation Effects and Hardening of MOS
Technology: Devices and Circuits”, IEEE Trans. On Nuclear Science, vol 50
(3), pp. 500-521, Jun 2003
154
[55] D. E. Johnson, K. S. Morgan, M. J. Wirthlin, M. P. Caffrey, P. S. Graham,
“Detection of Configuration Memory Upsets Causing Persistent Errors in
SRAM-based FPGAs”, 7th Annual International Conference on Military and
Aerospace Programmable Logic Devices (MAPLD), September 2004
[56] T. Karnik, P. Hazucha, “Characterization of soft errors caused by single event
upsets in CMOS processes”, IEEE Trans on Dependable and Secure
Computing, vol 1 (2), April-June 2004
[57] F. L. Kastensmidt, L. Carro, R. Reis, “Fault-tolerant techniques for SRAM-
based FPGAs”, 2006, Springer , The Netherlands.
[58] R. Koga, S. D. Pingerton, S. C. Moss, D. C. Mayer, S. LaLumondiere, S. J.
Hansel, K. B. Crawford, W. R. Crain, “Observation of single event upsets in
analog microcircuits,” IEEE Trans. Nucl. Sci., vol. 40 (6), pp. 1838–1844, Dec
1993
[59] D. Krueger, E. Francom, J. Langsdorf, “Circuit Design for Voltage Scaling and
SER Immunity on a Quad-Core Itanium® Processor”, ISSCC Tech. Digest of
papers, 2008
[60] A. Kumar, H. Qin, P. Ishwar, J. Rabaey, and K. Ramchandran, “Fundamental
Bounds on Power Reduction during Data-Retention in Standby SRAM”, IEEE
International Symposium on Circuits and Systems, pp. 1867-1870, 2007
[61] D. Y. Lam, J. Lan, L. McMurchie, C. Sechen, “ SEE-hardened-by-design area-
efficient SRAMs,” in Proc. IEEE Aerospace Conf., pp. 1–7, Mar. 2005
[62] D. Lambert, J. Baggio, V. Ferlet-Cavrois, O. Flament, F. Saigne, B. Sagnes, N.
Buard, T. Carriere, “Neutron-induced SEU in bulk SRAMs in terrestrial
environment: simulations and experiments”, IEEE Trans. On Nucl. Sci., vol.
51(6), part 2, pp. 3435-3441, Dec 2004
[63] Alberto Leon-Garcia, “Probability and random processes for electrical
engineering”, 2nd edition 1994, Addison Wesley Publishers
[64] S. Lin, D.J. Costello, “Error control coding”, 2nd edition, pp. 78-82, Prentice
Hall, 2004
[65] M. N. Liu, S. Whitaker, “Low power SEU immune CMOS memory circuits,”
IEEE Trans. Nucl. Sci., vol 39, pp. 1679–1684, Dec 1992
[66] E.-H. Lul, T. Chang, “New decoder for double-error-correcting binary BCH
codes”, IEE Communications Proc., vol. 143 (3), pp. 129-132, June 1996
155
[67] G. Lum, “Hardness assurance for space systems”, NSREC Short Course, 2004
[68] J. Maiz, S. Hareland, K. Zhang, P. Armstrong, “Characterization of multi-bit
soft error events in advanced SRAMs”, IEEE International Electron Devices
Meeting (IEDM) Technical Digest, pp. 21.4.1–21.4.4., Dec 2003
[69] D. G. Mavis, P.H. Eaton, “Soft error rate mitigation techniques for modern
microcircuits”. 40th Annual Reliability Physics Symposium Proceedings, pp.
216 – 225, 7-11 April 2002
[70] T. C. May, “Soft errors in VLSI: Present and future,” IEEE Trans.
Components, Hybrids, Manuf. Tech., vol 2, pp. 377–387, Dec. 1979
[71] T. C. May, M. H. Woods, “Alpha-particle-induced soft errors in dynamic
memories”, IEEE Trans. Electron. Devices, vol 26, pp. 2-9, Feb 1979
[72] P. J. Meaney, S. B. Swaney, P. N. Sanda, L. Spainhower, “IBM z990 soft error
detection and recovery,” IEEE Transactions on Device and Materials
Reliability, vol. 5 (3), pp. 419–427, Sep. 2005
[73] G. Memik, M.H. Chowdhury, A. Mallik, Y.I. Ismail, “Engineering over-
clocking: Reliability-performance trade-offs for high-performance register
files”, IEEE International Conference on Dependable Systems and Networks,
2005
[74] G. Memik, M.T. Kandemir, O. Ozturk, “Increasing register file immunity to
transient errors”. Proceedings of Design, Automation and Test in Europe, vol.
1, pp. 586 - 591, 2005
[75] T. Mérelle, H. Chabane, J. M. Palau, K. Castellani-Coulie, F. Wrobel, F.
Saigne, B. Sagnes, J. Boch, J. R. Vaille, G. Gasiot, P. Roche, M. C. Palau, T.
Carriere, “Criterion for SEU occurrence in SRAM deduced from circuit and
device simulations in case of neutron-induced SER”, IEEE Trans. Nucl. Sci.,
Vol. 52 (4), pp. 1148-1155, Aug. 2005
[76] S. Mitra, S. Norbert, M. Zhang, Q. Shi, K. S. Kim, “Robust System Design
with Built-In Soft-Error Resilience”, Computer, vol 38 (2), pp. 43-52, Feb.
2005
[77] P. Mongkolkachit, B. Bhuva, “Design technique for mitigation of alpha-
particle-induced single-event transients in combinational logic”. IEEE
Transactions on Device and Materials Reliability, vol 3 (3), pp. 89 – 92, Sep.
2003
156
[78] R. Naseer, J. Draper, “DEC ECC Design to Improve Memory Reliability in
Sub-100nm Technologies”, IEEE International Conference on Electronics,
Circuits, and Systems (ICECS), Aug-Sep 2008
[79] R. Naseer, J. Draper, “DF-DICE: a scalable solution for soft error tolerant
circuit design”. International Symposium on Circuits and Systems (ISCAS),
pp. 3890 – 3893, May 21-24, 2006
[80] R. Naseer, J. Draper, “Heavy-ion-induced Multi-bit Upset Characterization and
Mitigation in SRAMs”, International Solid-State Circuits Conference – Student
Forum, ISSCC, Feb 2008
[81] R. Naseer, J. Draper, “Parallel Double Error Correcting Code Design to
Mitigate Multi-Bit Upsets in SRAMs”, 34th European Solid-State Circuits
Conference (ESSCIRC), Sep 2008
[82] R. Naseer, J. Draper, “The DF-DICE storage element for immunity to soft
errors ”. Circuits and Systems, 2005. 48th Midwest Symposium on , pp. 303 –
306, August 7-10, 2005
[83] R. Naseer, J. Draper, Y. Boulghassoul, S. DasGupta, A. Witulski, “Critical
charge and SET pulse widths for combinational logic in commercial 90nm
technology”, Proceedings of the 17th great lakes symposium on VLSI
(GLSVLSI), pp. 227-230, March 2007
[84] R. Naseer, R.Z. Bhatti, J. Draper, “Analysis of soft error mitigation techniques
for register files in IBM Cu-08 90nm Technology”. MidWest Symposium on
Circuits and Systems (MWSCAS), Aug 09-11, 2006
[85] R. Naseer, Y. Boulghassoul, J. Draper, S. DasGupta, A. Witulski, “Critical
charge characterization for soft error rate modeling in 90nm SRAM”,
International Symposium on Circuits and Systems (ISCAS), May 28-30, 2007
[86] R. Naseer, Y. Boulghassoul, M. Bajura, J. Sondeen, S. Stansberry, J. Draper,
“Single-Event Effects Characterization and Soft Error Mitigation in 90nm
Commercial-Density SRAMs”, International Association of Science and
Technology for Development (IASTED) Circuits and Systems Conference,
Aug 2008
[87] E. Normand, “Single event upset at ground level”, IEEE Transactions on
Nuclear Science, vol 43 (6), Part 1, pp. 2742-2750
[88] E. Normand, “Single-event effects in avionics”, IEEE Transactions on Nuclear
Science, vol 43 (2), Part 1, pp. 461-474
157
[89] A. Ochoa, Jr., C. L. Axness, H. T. Weaver, J. S. Fu, “A proposed new structure
for SEU immunity in SRAM employing drain resistance,” IEEE Electron.
Device Letters, vol 8, pp. 537–539, Nov 1987
[90] Online available from : http://www.jedec.org/
[91] F. Ootsuka, M. Nakamura, T. Miyake, S. Iwahashi, Y. Ohira, T. Tamaru, K.
Kikushima, K. Yamaguchi, “A novel 0.20 μm full CMOS SRAM cell using
stacked cross couple with enhanced soft error immunity,” in IEDM Tech. Dig.,
pp. 205–208, 1998
[92] F. I. Osman, “Error-correction technique for random-access memories”, IEEE
Journal of Solid-State Circuits, vol 17 (5), pp. 877–881, Oct 1982
[93] M. Pflanz, K. Walther, C. Galke, H. T. Vierhaus, “On-line error detection and
correction in storage elements with cross-parity check”, Proceedings of the
Eighth IEEE International On-Line Testing Workshop, pp. 69 – 73, 8-10 July
2002
[94] Predictive Technology Models, http://www.eas.asu.edu/~ptm/
[95] H. Quinn, P. Graham, J. Krone, M. Caffrey, S. Rezgui, “Radiation-induced
multi-bit upsets in SRAM-based FPGAs”, IEEE Transactions on Nuclear
Science, vol. 52 (6), Part 1, pp. 2455 – 2461, Dec. 2005
[96] D. Radaelli, H. Puchner, S. Wong, S. Daniel, “Investigation of multi-bit upsets
in a 150 nm technology SRAM device,” IEEE Trans. on Nuclear Science, vol.
52, no. 6, pp. 2433–2437, Dec. 2005
[97] P. Roche, J. M. Palau, G. Bruguier, C. Tavernier, R. Ecoffet, J. Gasiot,
“Determination of key parameters for SEU occurrence using 3-D full cell
SRAM simulations”, IEEE Transactions on Nuclear Science, vol 46 (6),
pp:1354 – 1362, Dec. 1999
[98] L. R. Rockett, Jr., “An SEU hardened CMOS data latch design,” IEEE Trans.
Nucl. Sci., vol 35, pp. 1682–1687, Dec 1988
[99] L. R. Rockett, Jr., “Simulated SEU hardened scaled CMOS SRAM cell design
using gated resistors,” IEEE Trans. Nucl. Sci., vol 39, pp. 1532–1541, Oct
1992
[100] A. M. Saleh, J. J. Serrano, J. H. Patel, “Reliability of scrubbing recovery-
techniques for memory systems,” IEEE Trans. Reliab., vol 39 (1), pp. 114–
122, Apr 1990
158
[101] J. Scarpulla, A. Yarbrough, “What Could Go Wrong? The Effects of Ionizing
Radiation on Space Electronics”, Crosslink, vol. 4 (2), Summer 2003
[102] N. Seifert, P. Slankard, M. Kirsch, B. Narasimham, V. Zia, C. Brookreson, A.
Vo, S. Mitra, B. Gill, J. Maiz, “Radiation-induced soft error rates of advanced
CMOS bulk devices”, Int. Rel. Phy. Sym. pp. 217-225, 2006
[103] F. W. Sexton, W. T. Corbett, R. K. Treece, K. J. Hass, K. L. Hughes, C. L.
Axness, G. L. Hash, M. R. Shaneyfelt, T. F. Wunsch, “SEU simulation and
testing of resistor hardened D-latches in the SA3300 microprocessor,” IEEE
Trans. Nucl. Sci., vol 38, pp. 1521–1528, Dec. 1991
[104] C. E. Shannon, “A mathematical theory of communications”, Bell Syst. Tech.
J., pp. 379-423 (Part 1); pp. 623-656 (Part 2), July 1948
[105] Q. Shi, G. Maki, ”New Design Techniques for SEU Immune Circuits,“ NASA
Symposium on VLSI Design, pp 7.4.1-7.4.11, Nov. 2000
[106] M. L. Shooman, “The reliability of error correcting code implementation”,
Proceedings of Annual Reliability and Maintainability Symposium, pp. 148-
155, 1996
[107] C. W. Slayman, “Cache and memory error detection, correction, and reduction
techniques for terrestrial servers and workstations,” IEEE Transactions on
Device and Materials Reliability, vol. 5, no. 3, pp. 397–404, Sept. 2005
[108] W. J. Snoeys, T. A. P. Gutierrez, G. Anelli, “A new NMOS layout structure for
radiation tolerance”. IEEE Transactions on Nuclear Science, vol 49 (4), pp.
1829-1833, Aug. 2002
[109] M. Spica, T. M. Mak, “Do we need anything more than single bit error
correction (ECC)?”. International Workshop on Memory Technology, Design
and Testing, pp. 111-116, Aug. 9-10, 2004
[110] Sun microsystems- soft memory errors and their effect on sun fire systems,
2002.
[111] Test Method Standard, Microcircuits. Feb 28, 2006 (http://www.dscc.dla.mil)
[112] N. H. Vaidya, “Comparison of duplex and triplex memory reliability”, IEEE
Transactions on Computer, Vol. 45 (4), pp. 503-507, April 1996
[113] R. Velazco, D. Bessot, S. Duzellier, R. Ecoffet, R. Koga, “Two CMOS
memory cells suitable for the design of SEU-tolerant VLSI circuits,” IEEE
Trans. Nucl. Sci., vol 41, pp. 2229–2234, Dec. 1994
159
[114] J. T.Wallmark, S. M. Marcus, “Minimum size and maximum packing density
of nonredundant semiconductor devices,” Proc. IRE, vol 50, pp.286–298, 1962
[115] K. M. Warren, R. A. Weller, M. H. Mendenhall, R. A. Reed, D. R. Ball, C. L.
Howe, B. D. Olson, M. L. Alles, L. W. Massengill, R. D. Schrimpf, N. F.
Haddad, S. E. Doyle, D. McMorrow, J. S. Melinger, W. T. Lotshaw, “The
contribution of nuclear reactions to heavy ion single event upset cross-section
measurements in a high-density SEU hardened SRAM,” IEEE Trans. Nucl.
Sci., vol. 52 (6), pp. 2125–2131, Dec. 2005
[116] H. T. Weaver, W. T. Corbett, J. M. Pimbley, “Soft error protection using
asymmetric response latches,” IEEE Trans. Electron Dev., vol 38, pp. 1555–
1557, Dec. 1991
[117] Shyue-Win Wei, Che-Ho Wei, “High-speed hardware decoder for double-
error-correcting binary BCH codes”, IEE Proceedings Communications,
Speech and Vision, vol. 136 (3), pp. 227- 231, Jun. 1989
[118] D. Wiseman, J. Canaris, S. Whitaker, J. Venbrux, K. Cameron, K. Arave, L.
Arave, M. N. Liu, K. Liu, “Design and testing of SEU/SEL immune memory
and logic circuits in a commercial CMOS process,” in IEEE NSREC Data
Workshop Rec., pp. 51–55, 1993
[119] Q. Zhou, K. Mohanram, “Gate sizing to radiation harden combinational logic”,
IEEE Transactions on CAD of Integrated Circuits and Systems, vol 25 (1), pp.
155 – 166, Jan. 2006
[120] Q. Zhou, K. Mohanram, “Transistor sizing for radiation hardening”,
Proceedings of Reliability Physics Symposium, pp. 310 – 315, 2004
[121] J. F. Ziegler, H. W. Curtis, F. P. Muhlfeld, C. J. Montrose, B. Chin, M.
Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave,
J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O'Gorman, B. Messina,
T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A.
H. Taber, R. J. Sussman, W. A. Klein, C. W. Wahaus, “IBM experiments in
soft fail in computer electronics (1978-1994)”, IBM J. of Res and Devlop. Vol
40, pp:3-18, 1996.
[122] J. A. Zoutendyk, L. D. Edmonds, L. S. Smith, “Characterization of multiple-bit
errors from single-ion tracks in integrated circuits,” IEEE Trans. Nucl. Sci.,
vol. 36 (6), pp. 2267–2274, Dec. 1989
160
Abstract (if available)
Abstract
With aggressive technology scaling, radiation-induced soft errors have become a major threat to microelectronics reliability. SRAM cells in deep sub-micron technologies particularly suffer most from these errors, since they are designed with minimum geometry devices to increase density and performance, resulting in a cell design that can be upset easily with a reduced critical charge (Qcrit). Though single-bit upsets were long recognized as a reliability concern, the contribution of multi-bit upsets (MBU) to overall soft error rate, resulting from single particle strikes, is also increasing. The problem gets exacerbated for space electronics where galactic cosmic rays carry particles with much higher linear energy transfer characteristics than terrestrial radiation sources, predictably inducing even larger multi-bit upsets. In addition, single-event transients are becoming an increased concern due to high operational frequencies in scaled technologies. The conventional radiation hardening approach of using specialized processes faces serious challenges due to its low-volume market and lagging performance compared to commercial counterparts. Alternatively, circuit-based radiation hardening by design (RHBD) approaches, where both memory cells and control logic are hardened, may incur large area, power and speed penalties if careful design techniques are not applied.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Modeling and mitigation of radiation-induced charge sharing effects in advanced electronics
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Improving efficiency to advance resilient computing
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Radiation hardened by design asynchronous framework
PDF
Low power and reliability assessment techniques for advanced processor design
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Reliable cache memories
PDF
High level design for yield via redundancy in low yield environments
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Error tolerance approach for similarity search problems
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Defect-tolerance framework for general purpose processors
PDF
Resiliency-aware scheduling
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
Asset Metadata
Creator
Naseer, Riaz
(author)
Core Title
A framework for soft error tolerant SRAM design
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/01/2008
Defense Date
04/30/2008
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
circuit design,error correcting codes,memory reliability,OAI-PMH Harvest,radiation hardening,single event effects,soft errors,SRAM
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey T. (
committee chair
), Chugg, Keith M. (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
naseer@usc.edu,riaznaseer@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1741
Unique identifier
UC192555
Identifier
etd-Naseer-2379 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-127659 (legacy record id),usctheses-m1741 (legacy record id)
Legacy Identifier
etd-Naseer-2379.pdf
Dmrecord
127659
Document Type
Dissertation
Rights
Naseer, Riaz
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
circuit design
error correcting codes
memory reliability
radiation hardening
single event effects
soft errors
SRAM