Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Reliable cache memories
(USC Thesis Other)
Reliable cache memories
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RELIABLE CACHE MEMORIES
by
Mehrtash Manoochehri
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2015
Copyright 2015 Mehrtash Manoochehri
To my mother and my brother for their help and support.
ii
Acknowledgements
Getting a Ph.D. is a very difficult task. I was lucky that I had sincere support from my
family, my professors and my friends. Otherwise, I don’t think that I could conclude my
Ph.D. studies.
First, I would like to thank my mother. Her continuous support was very important.
Whenever I was sad and disappointed, she was the only one I could rely on and I could
not survive without her.
I appreciate helps of Professor Murali Annavaram during my Ph.D. studies. He
helped me in different ways and specifically in finding a job which was a serious chal-
lenge due to my Iranian nationality.
I would like to thank Professor Kai Hwang, Professor Jeffery Draper, Professor
Sandeep Gupta, Professor Timothy Pinkston, Professor Aiichiro Nakano, Professor Mark
Redekop, Professor Gandhi Puvvada and Professor Mehrnoosh Eshaghian for their guid-
ance and sincere help.
Diane Demetras helped me a lot as she helps all students. I really feel that she loves
students like her own children and I wish the best for her.
I like to thank my friend Jinho Suh who helped me significantly after joining USC. I
am grateful to Daniel Wong who always taught me English, and solved my problems in
different aspect of the life in the United States. I am grateful to Lakshmi Kumar Dabirru
who spent a lot of time to help me in the Compiler Design course.
iii
I also appreciate help and support of my brother and other relatives during my Ph.D.
studies.
iv
Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
Abstract x
1 Introduction 1
2 Correctable Parity Protected Cache (CPPC) 6
2.1 Existing cache error protection schemes . . . . . . . . . . . . . . . . . 6
2.2 Basic L1 CPPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Normal operations of basic L1 CPPC (no fault) . . . . . . . . . 11
2.2.2 Recovery in basic CPPC . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Enhancements to basic CPPC . . . . . . . . . . . . . . . . . . 12
2.3 Basic L2 CPPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Normal operations of basic L2 CPPC (no fault) . . . . . . . . . 15
2.3.2 L2 CPPC error recovery . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 CPPC in exclusive cache hierarchies . . . . . . . . . . . . . . . 17
2.4 Spatial multi-bit fault tolerant CPPC . . . . . . . . . . . . . . . . . . . 18
2.4.1 Error detection with interleaved parity bits . . . . . . . . . . . . 18
2.4.2 Vertical spatial multi-bit errors . . . . . . . . . . . . . . . . . . 19
2.4.3 Byte-shifting operation . . . . . . . . . . . . . . . . . . . . . . 20
2.4.4 Dirty data error recovery . . . . . . . . . . . . . . . . . . . . . 21
2.4.5 Fault locator . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.6 Spatial multi-bit error coverage . . . . . . . . . . . . . . . . . 29
2.4.7 Incorrect correction of temporal multi-bit errors . . . . . . . . . 30
2.4.8 Barrel shifter implementation . . . . . . . . . . . . . . . . . . 31
2.4.9 Byte-shifting advantages . . . . . . . . . . . . . . . . . . . . . 31
v
2.4.10 Spatial multi-bit error correction with more pairs of correction
registers instead of byte-shifting . . . . . . . . . . . . . . . . . 32
2.5 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3 Chip Independent Error Correction (CIEC) in Main Memory 42
3.1 Background and related works . . . . . . . . . . . . . . . . . . . . . . 42
3.2 CIEC Soft error detection . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1 Assignment of parity bits . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Accesses to parity bits . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Multi-bit faults coverage . . . . . . . . . . . . . . . . . . . . . 47
3.3 Soft error correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 CIEC error recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5 Page swap-out and DMA accesses . . . . . . . . . . . . . . . . . . . . 52
3.6 Implementing read-before-writes in various cache levels . . . . . . . . 53
4 PARMA+ 55
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Previous cache reliability models . . . . . . . . . . . . . . . . . . . . . 60
4.3 Basic assumptions and equations of PARMA+ . . . . . . . . . . . . . . 62
4.3.1 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.2 SEU rate in one protection domain (SEU rate) . . . . . . . . . . 63
4.4 Failure of a domain independently of other domains . . . . . . . . . . . 67
4.4.1 Probability of c DSEUs . . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Probability of access failure given a single DSEU in the domain 68
4.4.3 Probability of access failure given two or more DSEUs in the
domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5 Failure dependencies with neighboring domains . . . . . . . . . . . . . 71
4.6 Model extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.8 PARMA+ floating point precision . . . . . . . . . . . . . . . . . . . . 88
4.9 PARMA+ tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.10 New empirical model for multi-bit faults . . . . . . . . . . . . . . . . . 92
5 Conclusions and Future Research Directions 94
Reference List 97
vi
List of Tables
2.1 Evaluation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Parameters used in reliability evaluations . . . . . . . . . . . . . . . . . 38
2.3 MTTF of the caches (Temporal MBEs faults) . . . . . . . . . . . . . . . 38
4.1 Failure condition for different protection codes . . . . . . . . . . . . . . 56
4.2 Evaluation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3 PARMA+ and MACAU deviations under fault patterns of Figure 4.9 (accel-
erated SEU rate) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Fault injection results of Table 4.3 and their 95% confidence interval . . . 83
4.5 PARMA+ and PARMA+
light
deviations under fault patterns of Figures 4.9
and 4.10 (accelerated SEU rate) . . . . . . . . . . . . . . . . . . . . . . 84
4.6 95% confidence interval for fault injection experiments of Table 4.5 . . . . 84
4.7 PARMA+ and PARMA+
light
deviations in bit-interleaved caches (acceler-
ated SEU rate) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8 95% confidence interval for fault injection experiments of Table 4.7 . . . . 87
4.9 PARMA+ and PARMA+
light
FIT rates under fault patterns of Figure 4.9
and 4.10 (actual SEU rate) . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.10 The emprical model results and deviations from PARMA+ . . . . . . . . 93
vii
List of Figures
2.1 L1 CPPC Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Write operation in L1 CPPC . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Extra hardware in L1 cache to support L2 CPPC . . . . . . . . . . . . . 15
2.4 Read-before-writes in L2 CPPC . . . . . . . . . . . . . . . . . . . . . . 17
2.5 L1 CPPC with barrel shifter . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Byte-shifting in an88 array . . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Faulty sets in Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Reduced faulty sets with bytes 0 and 1 in Example 1 . . . . . . . . . . . . 26
2.9 CPI of CPPC and 2D parity in the L1 cache normalized to 1D parity . . . 35
2.10 Percentage of write operations in dirty blocks of L2 CPPC that cause read-
before-writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.11 Number of read-before-writes in L1 supporting a L2 CPPC normalized to
the total number of L1 read hits . . . . . . . . . . . . . . . . . . . . . . 36
2.12 Energy consumption of different schemes in the L1 cache normalized to
energy of parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.13 Energy consumption of different schemes in the L2 cache normalized to
energy of parity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 Structure of a memory rank . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Fields of each parity cache entry . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Location of parities in a DIMM with two ranks . . . . . . . . . . . . . . 47
3.4 Reading data overwritten in CIEC . . . . . . . . . . . . . . . . . . . . 50
viii
4.1 A spatial multi-bit fault can flip different number of bits in a protection
domain depending on its location . . . . . . . . . . . . . . . . . . . . . 59
4.2 Failure of four words can be dependent on each other . . . . . . . . . . . 59
4.3 Two multi-bit fault patterns can add or subtract to each other . . . . . . 60
4.4 The running example of Section 4 . . . . . . . . . . . . . . . . . . . . . 64
4.5 Bits contributing toN
i
DSEU
for each of the two fault patterns . . . . . . . 66
4.6 Bits inN
2
DSEU
that cause no failure in Word 7 . . . . . . . . . . . . . . . 69
4.7 Failures of domains can be dependent on each other . . . . . . . . . . . . 72
4.8 Accesses to word 7 and its neighbors . . . . . . . . . . . . . . . . . . . . 73
4.9 Small fault patterns (black squares show faults) . . . . . . . . . . . . . . 81
4.10 Large fault patterns (black squares show faulty bits) . . . . . . . . . . . 85
ix
Abstract
Due to shrinking feature sizes, cache memories have become highly vulnerable to soft
errors. In this thesis, reliability of caches is studied in two ways:
First, a reliable error protection scheme called Correctable Parity Protected Cache
(CPPC) is proposed, which adds error correction capability to a parity-protected cache.
In CPPC, parity bits detect errors and the XOR of all data written into the cache is kept
to recover from detected errors in dirty data. Detected errors in clean data are simply
recovered by reading the correct data from the lower level memory. Our simulation
data shows that CPPC provides high reliability while its overheads are smaller than
competitive schemes, especially in the L2 or lower level caches. CPPC idea can also be
extended to main memories which will be explained later in details.
Second, this thesis proposes efficient approaches to measure reliability of caches. To
select an appropriate error protection scheme in caches, the impact of different schemes
on the FIT (Failure In Time) rate should be measured. An accurate and versatile cache
reliability model called PARMA+ is proposed which targets both single-bit and spatial
multi-bit soft errors. Contrary to all previous models, PARMA+ is applicable to any
multi-bit fault pattern and cache configuration and it has high accuracy. To validate
PARMA+, we compare the measurements obtained with it to the results produced from
accelerated fault-injection simulations. This comparison shows a very small deviation
which demonstrates the high accuracy of PARMA+.
x
Chapter 1
Introduction
Reliability is one of the most important considerations in the design of microproces-
sors. In today’s microprocessors, about 60% of the on-chip area is occupied by caches.
Because of this, caches have a considerable impact on microprocessors reliability, area
and energy consumption.
Most caches are write-back in which modified (dirty) data is not immediately prop-
agated to lower levels of the memory hierarchy. Since dirty data has no back-up copies
in other levels of the memory hierarchy, write-back caches are especially vulnerable to
soft errors. Two protection codes are common in write-back caches:
1- Single Error Correction Double Error Detection (SECDED) code: This code
is often used in commercial processors [14, 22]. The area overhead of SECDED is high:
SECDED needs 8 bits to protect a 64-bit word, a 12.5% area overhead. Since more bits
are read at each access, SECDED increases energy consumption as well.
2- Parity: Because of the high overheads of Error Correcting Codes (ECCs), some
level-one (L1) caches are protected by parity bits only. Parity bits are very effective in
L1 write-through caches because all detected faults are recoverable from the L2 cache,
but they do not provide any error correction capability for dirty data in L1 write-back
caches. In this situation, even a single-bit fault in dirty data can cause a failure [19].
In this situation, a low-cost and reliable error protection scheme for caches is sorely
needed. This thesis introduces a reliable write-back cache called Correctable Parity
Protected Cache (CPPC), which provides error correction capability on top of parity
protection at low cost. In CPPC, faults are detected by parity bits. Faults detected in
1
unmodified (clean) data are corrected by re-fetching the data from the next level of the
memory hierarchy. To correct faults in dirty data, CPPC keeps track of the bit-wise XOR
of all dirty data added to or removed from the cache in two (for L1 CPPC) or three (for
L2 or lower-level CPPCs) special registers. These special registers are called correction
registers. When a fault is detected in dirty data by the parity bits, CPPC recovers the
correct data by bit-wise XORing the content of the correction registers with all current
dirty data of the cache except for the faulty data.
In CPPC, the level of reliability can be tuned finely. Reliability level for single-bit
error correction can be enhanced by simply increasing the number of correction regis-
ters. The level of protection against spatial multi-bit errors in CPPC can be fine-tuned
by combining more correction registers with a byte-shifting scheme explained later.
Through simulations of sampled SPEC2000 benchmarks we show that an L1 CPPC
increases the Mean-Time-To-Failure (MTTF) by 18 orders of magnitude as compared
to a parity-protected cache while its performance and dynamic energy overheads are
on average 0.3% and 14%. CPPC is much more energy efficient in lower levels of the
memory hierarchy. Our simulation results show that in the second level of the cache
hierarchy, the dynamic energy overhead of CPPC is only 0.6% over the parity-protected
cache, while its MTTF is 14 orders of magnitudes longer. We also compare CPPC
to caches protected by SECDED combined with physical bit interleaving and by two-
dimensional parity. Our simulation results show that CPPC saves a significant amount
of energy compared to these two schemes. The dynamic energy consumed by an L2
CPPC is lower by 68% and 71% as compared to L2 caches protected by SECDED with
bit interleaving and two-dimensional parity, respectively.
Besides these energy advantages and low performance costs, CPPC provides the
following unique features.
2
1. CPPC enlarges the protection domain efficiently. Existing correction codes
are attached to cache blocks or words. By protecting caches at larger granularities such
as multiple blocks or even the entire cache, CPPC saves area and other resources. With
current methods, the granularity of protection domains cannot be increased efficiently.
For example, if a cache block is protected by ECC, a read-modify-write operation must
be performed for every partial write to the block, at the cost of added energy and perfor-
mance overheads. Protecting more than one cache block in the data array with ECC is
even more complex.
2. CPPC finely adjusts the degree of reliability. Traditional error protection
schemes such as Hamming codes change the protection capability coarsely. For exam-
ple, going from SECDED to DECTED (Double Error Correction Triple Error Detection)
improves the correction and detection capabilities by one bit for all words (or blocks).
However, DECTED needs 15 bits to protect a 64-bit word. Thus, its incremental over-
head over SECDED is very high. Since the protection-level change is so coarse, codes
like DECTED are not currently used in L1 or L2 caches.
3. CPPC uses software for error recovery. It is reasonable to avoid extra hardware
as much as possible for extremely rare events such as error recovery and handle them in
software instead.
We also extend CPPC to main memory and we call that Chip Independent Error Cor-
rection (CIEC) because it provides error correction capability independently of memory
chip width. Main memory is highly susceptible because it has billions of bits and failure
of any bit can crash the entire system. Main memory usually uses SECDED codes to
improve reliability. As it was explained before, SECDED causes 12.5% area overhead
which is high. In addition to 12.5% area overhead, conventional ECCs are not applicable
to new memory technologies which use wide devices. For example, stacked memories
employ wide chips such as 32-bit wide chips [28] while conventional ECC Dual Inline
3
Memory Modules (DIMMs) have employed narrow chips with width of 4 or 8 [38].
In addition, mobile DRAMs employ wide chips [18]. One of the advantages of wide
devices is to improve energy efficiency because fewer chips are activated on each mem-
ory access. Applying ECC to DIMMs with wide devices is very challenging [28, 18, 38]
for two reasons. First, with fewer chips on a DIMM, which is one of the advantages of
using wide chips, having an extra chip for ECC (say four data chips and one ECC chip)
increases the area overhead significantly (25% instead of 12.5%). Second, the standard
bus width must be divided between data and ECC chips which may not be possible. For
example, the standard bus width (72 bits for ECC DIMMs) cannot be divided among
x16 chips. Note that DIMMs use the same kind of chips for both ECC and data and a
DIMM with heterogeneous chips is not an acceptable solution to this problem [28].
Because main memory contributes to a large amount of the energy consumption of
today’s servers, the energy overhead caused by ECCs (especially by preventing usage
of low power memories) can have a huge negative impact on the energy consumption of
the entire system. In order to provide high level of reliability at very low costs, CPPC
may be applied to main memories as it will be described.
This thesis also provides reliability models that efficiently measure cache Failure
In Time (FIT) rate. As technology evolves, the fault patterns of soft errors in caches
are rapidly changing and it is predicted that all faults will be multi-bit faults (i.e., a
single particle hit flips more than one bit) by 2016 [1]. To select proper error protection
schemes for caches, we must be able to estimate and compare the reliability of different
protection schemes in addition to their energy, performance and area overheads. While
many simulators exist to measure the performance of caches and rigorous models such
as CACTI [36] are able to estimate cache area and energy consumption, no rigorous
model exists to estimate the FIT rate of caches in the presence of spatial multi-bit faults.
4
Computing the cache FIT rate under spatial multi-bit faults is very complex because
a particle hit can flip bits in different, adjacent protection domains. In some cases a fault
that falls in one domain is masked in the domain but may cause the failure of another
domain; a fault may also cause the failure of the two domains. Furthermore, dealing with
consecutive multiple multi-bit faults and their astronomically large number of possible
combinations of fault patterns is very complex.
In addition to the lack of a strong, general model for multi-bit faults, most previ-
ous models cannot measure the FIT rates of caches protected by ECCs. Reliability is
important in systems whose caches are protected by ECC or other methods.
To address the need for an all-inclusive cache reliability model which computes the
FIT (Failure In Time) rate of an ECC-protected cache for any set of overlapping spatial
multi-bit and single-bit faults, we introduce a new model called PARMA+. To the best
of our knowledge, this is the first model which is able to compute the FIT rate of a
cache under any possible sequence of multi-bit faults and any protection domain size
and topology.
Our experiments comparing PARMA+ with fault-injection simulations show that
PARMA+ is highly accurate. We also compare PARMA+ with MACAU [30], another
model developed for spatial multi-bit faults. By contrast with PARMA+, MACAU is
restricted to specific types of faults and specific domain sizes. The accuracy of MACAU
is less than PARMA+ even for the very limited configurations for which MACAU is
applicable.
The rest of the thesis is organized as follows. Chapter 2 explains CPPC. Chapter 3
describes CPPC extension to main memory. Chapter 4 describes PARMA+ and Chapter
5 concludes the thesis.
5
Chapter 2
Correctable Parity Protected Cache
(CPPC)
This chapter explains CPPC which is a very low cost cache error protection scheme. As
it was explained before, in CPPC, error correction is added to a parity-protected cache
and the major goal is to have a scheme whose costs are close to parity while its reliability
is close to ECC.
This chapter first explains the existing cache error protection schemes and then
describes the CPPC structure. Due to the complexity of CPPC, at first we will explain
a basic version of that which deals only with single-bit faults and later multi-bit fault
tolerant CPPC will be described. At the end of this chapter, CPPC is evaluated and is
compared with other schemes.
2.1 Existing cache error protection schemes
In addition to ECC and parity which are used in many commercial processors, there
are other cache protection approaches proposed in academic papers. These schemes are
explained in the following.
Two-dimensional parity for cache memories was proposed in [12]. In this approach
the horizontal parity (along a row) detects faults and the vertical parity (along a column)
corrects them. Since the vertical parity of the cache changes on every write, a new
vertical parity must be computed by reading the old data from the cache and XORing
6
it with the old vertical parity. This operation is called read-before-write and is done on
every write and every cache miss. Thus, the energy overhead of two-dimensional parity
is high.
Physical bit interleaving, in which bits of different words are stored in adjacent cells
and protected by separate codes, is a common technique to detect and correct spatial
multi-bit errors since it converts a spatial multi-bit error into several single-bit errors
in different words. Bit interleaving tolerates spatial multi-bit faults appropriately but
increases the energy consumption of the cache significantly [12].
To mitigate the area overhead of error protection codes in Last-Level Caches (LLCs),
the codes can be saved in main memory [37]. In this scheme, a lightweight code is kept
in the cache, and a stronger code is stored in main memory and cached in the LLC itself.
This scheme facilitates the efficient implementation of very strong protection codes.
In ICR (In-Cache Replication) [40], cache lines that have not been accessed for a
long time are allocated to replicas of dirty blocks. ICR trades off reduced effective cache
size for better reliability. Thus, the miss rate of the cache may be higher or, alternatively,
dirty blocks may be left unprotected. In [39] a variant of ICR in which a small cache
keeps a redundant copy of dirty blocks is proposed, but this scheme is not area-efficient
for large caches. Some papers [15, 2] advocate early write-backs. By reducing the
number of dirty blocks in the cache, these schemes improve the vulnerability to soft
errors, but energy consumption is high, especially when data locality is low and the
number of write-backs is large.
To improve the reliability of clean data in L1 caches, L1 cache data can be refreshed
by re-fetching the data from the L2 cache when the protection code of the L2 cache is
strong [29]. This scheme increases energy consumption very much because the energy
consumption of each access to L2 is several times more than to the L1 cache.
7
Caches protected by error correction codes may be periodically scrubbed in order to
improve reliability [20]. Scrubbing enhances the reliability of a cache against temporal
multi-bit faults at the cost of higher energy consumption.
In [13], it is proposed to decouple clean and dirty data in L2 and L3 caches by
protecting dirty lines with SECDED codes and clean lines with parity bits. Protection
codes are saved in a separate array and the maximum number of dirty blocks is enforced
by early write-backs. This idea saves area but it adds energy and performance overheads.
It has also been suggested to turn off cache lines [6] not accessed for a long time to
improve reliability by masking faults in them. This approach creates more cache misses.
In [15] the authors propose to decouple the protection of clean and dirty data. Dirty
blocks are protected with ECC, but, when a dirty block becomes clean, some ECC bits
are gated off and ECC is converted to parity. This approach does not improve area and
a performance penalty is incurred to convert back from parity to ECC.
In [24] error detection is decoupled from error correction in L1 data caches by com-
bining fast error detection codes in the data array and ECC in a separate array. This
scheme improves performance and energy consumption, but it only detects one bit fault
while it needs more bits than SECDED, which detects two-bit faults.
In summary, previous research and implementations have shown that the following
properties of a cache error protection code are highly desirable.
1. Decoupling error detection from error correction. Decoupling detection from
correction is efficient because a lightweight detection code is checked at loads and a
complex correction code is invoked only at the time of fault recovery, which is a very
rare event.
2. Decoupling the protection of clean data and dirty data. A large amount of
cache data is clean and does not need error correction capability because the correct
8
data values can be recovered from lower levels of the memory hierarchy. Decoupling
the protections of clean and dirty data saves energy and area.
3. Protecting against spatial multi-bit faults. The number of spatial multi-bit
faults is bound to rise in future with higher levels of integration [17, 1], because an
energetic particle can flip more bits as the area occupied by each bit shrinks. Thus, a
reliable cache protection scheme must be able to correct spatial multi-bit faults.
CPPC efficiently and seamlessly encapsulate all these properties within a simple
reliability algorithm. In particular, CPPC deals with dirty data more efficiently than
previous proposals.
2.2 Basic L1 CPPC
In this section, we describe the architecture of a basic L1 CPPC. Basic CPPC can recover
from single-bit faults but not from spatial multi-bit faults.
Figure 2.1 shows the components of a basic L1 CPPC. L1 CPPC maintains one dirty
bit per word in the cache directory. In addition, it has at least one parity bit per word
and two correction registers, R1 and R2, each the size of a word. Register R1 keeps
the XOR of all words stored into the cache and is updated on every processor store.
Register R2 keeps the XOR of all dirty words removed from the cache. Dirty words
are removed from the cache in two cases: 1) a store to a dirty word in the cache and 2)
the replacement of a dirty block. In both cases the dirty words that are overwritten or
written-back are XORed into R2.
At all times, the XOR of R1 and R2 is equal to the XOR of all dirty words currently
in the cache. To show why, we take a simple example that can be generalized. LetDi
be a word stored into the cache where i is the sequence number of the stores. Assume
9
Store buffer
L1 cache
XOR
R1
R2
XOR
Removed
dirty data
Figure 2.1: L1 CPPC Structure
that a program has executed n writes to L1 cache words from its beginning. Then R1
contains:
(R1) =D1D2D3::::Dn (2.1)
is the bitwise XOR between two binary data. We assume next that the first 100
dirty words stored into the cache are removed from the cache. After these removals the
content of R2 is:
(R2) =D1D2D3::::D100 (2.2)
The XOR of R1 and R2 after the removal of the stored values is equal to ”0D101
D102D103:::Dn”, because ”D1D2D3::::D100” was XORed in both R1
and R2, and the XOR of a value with itself is zero.
In addition, the XOR of a value with zero is equal to the value itself. Therefore,
”0D101D102D103:::Dn” is equal to ”D101D102D103:::Dn”, which
is the XOR of all dirty words remaining in the cache.
10
2.2.1 Normal operations of basic L1 CPPC (no fault)
In basic L1 CPPC, on every load of a word, the parity bit(s) associated with the word is
(are) checked. When a fault is detected, a recovery mechanism is activated. On every
store, the new data is XORed into R1 in parallel with the cache update. When a store
updates an already dirty word, it is necessary to first read the old dirty word before
updating the cache and then to XOR it into R2. To implement this read-before-write,
the write is executed slightly differently than in a regular cache. On every write, the per
word dirty bits are fetched with the cache tag and then checked in conjunction with the
tag comparison. If the dirty bit is set, the old word is read from the cache and XORed
into R2. If the dirty bit is not set, the write proceeds normally. The dirty bit is part
of the cache directory and every write to cache must first check the cache directory to
confirm a hit before updating the cache. Figure 2.2 shows the write operation in a basic
L1 CPPC.
On each cache write-back, all dirty words of the evicted block must be XORed
into R2. Since write-back caches typically process write-backs through a victim buffer,
this operation can be done off the critical path and without any significant performance
overhead.
2.2.2 Recovery in basic CPPC
The normal operation of basic L1 CPPC does not add significant overhead to the oper-
ation of a regular cache, but, at the detection of a fault, a complex but rare recovery
procedure is activated.
When a fault is detected in a clean word, the cache block is re-fetched from the next
level of the memory hierarchy.
11
XOR new data
with R1
Processor store
Is data written to a
dirty word?
Read old dirty
data
Yes
XOR old dirty data
with R2
No
Store data to the
cache
Figure 2.2: Write operation in L1 CPPC
To recover from a fault detected in a dirty word, the recovery algorithm first XORs
R1 and R2 together. The XOR of R1 and R2 is then XORed with all dirty words cur-
rently in the cache except the faulty word. This operation removes the values of all dirty
words except the faulty dirty word from the XOR of R1 and R2 and yields the corrected
value of the faulty dirty word.
2.2.3 Enhancements to basic CPPC
With one parity bit per word, basic L1 CPPC detects and corrects all odd numbers of
faults in any clean word. Additionally, by adding two 64-bit correction registers (R1 and
R2) to protect dirty data, basic L1 CPPC also detects and corrects all odd numbers of
faults in one dirty word provided there are no faults in other dirty words of the cache.
We again take a simple example to demonstrate this. Let R1R2 be equal toD1
D2D3:::Dj, where D1, D2,..., Dj are dirty words in the cache. Assume that a
fault has been detected only inD1 so that all other dirty words are fault free. To recover,
12
all other dirty words are XORed with R1 and R2. The result is the correct value ofD1
because the XOR ofD2D3:::Dj with R1 and R2 isD1. However, if more than
one word are faulty (such asD1 andD2), XORing correct dirty words with R1 and R2
results in the XOR of the correct values of the two faulty words (D1D2), not the
correct value of each faulty word separately.
The error correction capability of basic CPPC for dirty words can be scaled up in two
different ways. First, one can increase the number of parity bits per word. For instance,
with eight parity bits per word, the detection and correction capability is enhanced eight
times as compared to one parity bit per word. A possibility is to protect each byte of
a word with a parity bit, so that the protection domain associated with each parity bit
is one byte. In this case, the cache can detect an odd number of faults in any byte.
Moreover it is also possible to correct multiple faulty words when an odd number of
faults happen in different bytes of the faulty words. For example, three bit faults in byte
1 of a dirty word and five bit faults in byte 2 of another dirty word can be detected and
corrected. In essence, with one parity bit per byte instead of per word, the granularity
of correction is upgraded from words to bytes.
Second, more correction registers can be added. With one pair of correction reg-
isters, the protection domain of CPPC is the entire cache. However, the XOR of dirty
words can be maintained in smaller granularities than the entire cache. For example, one
can add one pair of correction registers (R3, R4) so that (R1, R2) maintains the XOR
of dirty words in one half of the cache and (R3, R4) maintains the XOR of dirty words
in the other half of the cache. The protected domain is cut in half and the correction
capability is improved by a factor two. By adding more correction register pairs, the
correction capability of CPPC can be scaled up gradually and at low cost.
13
2.3 Basic L2 CPPC
L2 CPPC is similar to L1 CPPC in that both of them recover from errors in dirty data by
keeping the XOR of all dirty data of the cache in correction registers. In this section, we
focus on differences between L2 CPPC and L1 CPPC and avoid repeating their common
characteristics.
L2 CPPC keeps a dirty bit with every chunk of data of the size of an L1 cache line
(called sub-block) because the unit of writing into the L2 cache is an L1 block.
In L1 CPPC, a read-before-write is required for every store to a dirty word. In L2
CPPC, overwritten dirty data should also be XORed into correction registers as well,
causing a read-before-write for every write into a dirty sub-block in L2. However, the
energy cost of these read-before-writes can be cut considerably, as we will show shortly
in Section 2.3.1. This is because write-backs to the L2 cache or to lower levels of
memory can be known in advance by tracking writes in higher-level memories. To have
a write into an L2 block, the copy of the block in the L1 cache must first become dirty.
This observation provides opportunities to avoid read-before-writes in the L2 CPPC
in two ways. First, the read-before-writes caused by write misses from L1 to dirty
sub-blocks of L2 can be completely eliminated in L2 CPPC. Second read-before-writes
following a read miss to a dirty L2 sub-block can be executed in L1 instead of L2 at a
much lower energy cost.
To transfer read-before-writes from L2 to L1, L2 CPPC is supported by an extra
correction register located close to the L1 cache and called Upper Level Register (ULR).
Hence, L2 CPPC needs three correction registers of the size of an L1 line.
Correction registers R1 and R2 are collocated with L2. R1 keeps the XOR of all
blocks written back by L1 to the L2 cache. Some of the dirty data removed from L2
is XORed into R2. The rest of the dirty data removed from L2 is XORed into ULR
collocated with L1.
14
L1 cache status bits
ULR
L1 cache
L2D L1D
Figure 2.3: Extra hardware in L1 cache to support L2 CPPC
Figure 2.3 shows the components added to the L1 cache to support the L2 CPPC
(shaded).
ULR: ULR is a register of the size of an L1 line and tracks the signature (XORs) of
some overwritten L2 dirty blocks.
L2D bit: L2D is added to the state bits of every L1 cache line. L2D is a copy of the
dirty bit of the block in the L2 cache obtained at the time of the L1 miss.
2.3.1 Normal operations of basic L2 CPPC (no fault)
L2 CPPC is very energy efficient because it performs no read-before-write. On a write
into the L2 cache (due to a write-back from L1 to L2), the new data is XORed into R1 as
is done in L1 CPPC. On a write-back from the L2 cache to the lower-level memory, the
removed dirty sub-blocks are XORed into R2. Like in L1 CPPC, the old dirty data must
also be removed from the signature when it is overwritten. This is done more efficiently
in L2 CPPC than in L1 CPPC.
When an L1 write miss hits on a dirty L2 sub-block, a read-before-write in the dirty
L2 sub-block is bound to happen later. This is because the sub-block fetched from L2
becomes immediately dirty in L1 and will be inevitably written-back to L2 later on when
it is replaced. In order to avoid the inevitable read-before-write later in L2, the sub-block
15
is anticipatively XORed into R2 at the time when it is sent to L1. Thus, the old dirty
data is XORed into R2 without any additional cache access and energy cost.
For L1 read misses hitting on dirty L2 sub-blocks, there may or may not be a read-
before-write in the L2 cache later depending on whether the L1 copy is modified after
the read miss. Thus, the read-before-write cannot be avoided like the case of L1 write
misses. However, the read-before-write in L2 can be converted into a read-before-write
in L1 because the old value of the L2 sub-block is in the L1 cache at the time the L1
block becomes dirty. Since the energy consumption of an access to L1 is several times
less than to L2, performing the read-before-write in L1 saves a significant amount of
energy. A read-before-write is performed in the L1 cache whenever a store hits in L1 on
a clean block whose L2 copy is dirty. In this case, the old clean block in L1 is read and
XORed into ULR before storing the new data into L1.
In order to know whether the copy of an L1 block in the L2 cache is dirty, the dirty
bit of a sub-block in L2 is copied to L1 on an L1 read miss which hits in the L2 cache.
The dirty bit of the L2 copy is kept in the L2D bit shown in Figure 2.3. If the L1 miss is
a write miss, L2D is reset.
With this approach, when an L1 block is written-back to a dirty L2 sub-block,
the read-before-write has already been performed and the old dirty L2 data is already
XORed into either R2 or ULR. Hence, no extra action is needed in the L2 cache. Figure
2.4 illustrates how read-before-writes are performed in L2 CPPC.
2.3.2 L2 CPPC error recovery
To recover from a fault detected in a dirty L2 sub-block, R1, R2 and ULR are first
XORed. Then the L2 cache is scanned sub-block by sub-block. If an L2 sub-block is
dirty and does not have a copy in L1 (this is checked either by inclusion bits in L2 or
by accessing the tag array of the L1 cache), the sub-block is XORed into the XOR of
16
L1 hit
Is the access a store, is
the block clean and is its
L2 copy dirty?
Yes
Read L1 block and XOR it
into ULR before the store
L1 miss
L2 hit
Is the access a store and
is the L2 sub-block dirty?
XOR the fetched sub-block
into R2 in parallel with L1
fill
Yes
Proceed as in a regular
cache
No
No
Figure 2.4: Read-before-writes in L2 CPPC
R1, R2 and ULR. If the L2 sub-block is dirty and has a copy in the L1 cache and the L1
copy is dirty, the sub-block is already XORed into ULR; thus, it should not be XORed
again at the time of recovery. If the L1 copy is clean, the L2 sub-block must be XORed
into the XOR of R1, R2 and ULR.
2.3.3 CPPC in exclusive cache hierarchies
In the context of exclusive cache hierarchies the L2 cache does not hold any data present
in the L1 cache. In this case, L2 CPPC does not require any read-before-write. This
is because no write-back from L1 modifies an L2 dirty block. L2 CPPC just needs
correction registers R1 and R2 and CPPC has extremely low energy costs.
17
The behavior of L1 CPPC is the same for both inclusive and exclusive hierarchies
and read-before-writes are needed in L1 CPPC in both cases.
2.4 Spatial multi-bit fault tolerant CPPC
This section is applicable to both L1 and L2 CPPCs. We use L1 CPPC to explain
the multi-bit fault-tolerant CPPC. The same operations apply to L2 CPPC, with some
obvious modifications. For example, in the byte-shifting technique explained in the
context of an L1 cache, a word is rotated before XORing it into R1 or R2. In L2 CPPC,
a sub-block is rotated before XORing it into R1, R2 or ULR.
2.4.1 Error detection with interleaved parity bits
With the soon-expected predominance of spatial Multi-Bit Errors (MBEs), basic CPPC
must be upgraded with interleaved parities to detect faults where a single strike flips
multiple adjacent bits. Regular parity only detects an odd number of faults and cannot
detect an even number of faults as in spatial two-bit faults where a single strike flips two
horizontally adjacent bits. Interleaved parity bits are the XOR of non-adjacent bits. For
example, the 8-way interleaved parity bits of a 64-bit word are the XOR of bits separated
by eight bits (Parity[i] = data bit [i] data bit [i+8]..., data bit [i+56]). Data bits i,
i+8,, i+56 form the protection domain are associated with Parity [i], i=0,...,7. N-way
interleaved parity can detect up to N faults in horizontally adjacent bits of the same
word because the bits belong to the protection domain of different parity bits.
Basic CPPC augmented with N-way interleaved parity detects and corrects horizon-
tal MBEs which flip up to N horizontally adjacent bits. This is because a horizontal
multi-bit fault in a row occurs either in one word or across the boundary of two words.
If only one word is faulty, it can be recovered by the XOR of R1, R2 and all non-faulty
18
dirty words. If the fault occurs across the boundary of two words, it is also correctable
because different parity bits are affected by the fault. For instance, in the case of 8-way
interleaved parity bits protecting a 64-bit word, if a 7-bit horizontal spatial fault straddles
the boundary of two words such as bits 62-63 of the left-side word and bits 0-4 of the
right-side word, parity bits P6-P7 of the left-side word and parity bits P0-P4 of the right-
side word detect the fault and since faults are in different parity protection domains, this
fault involving 7 bits can be corrected independently in each parity protection domain.
2.4.2 Vertical spatial multi-bit errors
Although basic CPPC with interleaved parities can correct horizontal spatial multi-bit
errors, it cannot correct vertical multi-bit errors. For example, a two-bit vertical fault
which flips the first bits (bit 0) of two vertically adjacent dirty words cannot be corrected
in basic CPPC with interleaved parity because these two bits are XORed into the same
bit 0 of R1. Thus, the original values of these two faulty bits cannot be recovered using
R1 and R2.
In sections 2.4.3 to 2.4.10, we propose two enhancements to basic CPPC in order to
provide spatial MBE correction capability in both dimensions. The two enhancements
are 1) byte-shifting and 2) adding correction register pairs. At first we expound on
the byte-shifting technique, and then, in Section 2.4.10, we explain how byte-shifting
operations can be eliminated by adding correction register pairs.
Again, to simplify the presentation, we assume in the following that the write gran-
ularity is a 64-bit word (as in L1 CPPC) to show how the technique works. We also
only consider spatial MBEs contained in an 8-by-8 bit square. The same designs with
obvious modifications can be deployed in general to correct spatial MBEs contained in
an N-by-N bit square.
19
Word 0 Word 1
Word 2 Word 3
Word 4 Word 5
Word 6 Word 7
Word 0 Word 1
Word 2 Word 3
Word 4 Word 5
Word 6 Word 7
Word 0 Word 1
Word 2 Word 3
Word 4 Word 5
Word 6 Word 7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
P0-P7 P0-P7
Cache words Interleaved parity
Store buffer
Barrel shifter
XOR
R1
Figure 2.5: L1 CPPC with barrel shifter
2.4.3 Byte-shifting operation
If CPPC were to XOR vertically adjacent bits into different bits of correction registers
R1 and R2 instead of the same bits, CPPC could potentially recover from vertical multi-
bit errors. This is the major intuition leading to our byte-shifting technique. In this
approach, data is rotated by different amounts in adjacent rows, before XORing the data
into R1 and R2. The data stored in the cache is NOT rotated.
Figure 2.5 shows a data array in which every 64-bit word is augmented with 8-way
interleaved parity bits that can detect up to an 8-bit horizontal fault in each word. Each
row of the array contains two words and their interleaved parity bits. To rotate data
before XORing it into R1, a byte-wise barrel shifter is added to the basic CPPC hard-
ware. The same structure exists for R2 but is omitted in the figure. Three bits of the
word address are connected to the control input of the barrel shifter in order to provide
20
eight different amounts of rotation in eight adjacent rows. All data array words that are
rotated by the same amount form a rotation class. With 8-way interleaved parity and
eight rotation classes practically all spatial multi-bit faults contained in an 8 8 square
can be corrected.
In this section, we assume that faults happen only in data to simplify the discussion.
However, faults can also happen in parities. In order to protect both parities and data,
one could upsize R1 and R2 to 72 bits so that the eight parity bits are rotated with the
word.
Figure 2.6 shows two data arrays. The array on the left side shows the arrangement
of bytes in the cache and the array on the right side shows how words are rotated in
different rows before they are XORed into R1 and R2. Figure 2.6 shows that every
multi-bit fault contained within an 8 8 square in the left-side array affects different
bytes of R1 and R2 after rotation as illustrated in the right-side array. For example, if a
3-bit vertical fault (a 3 1 fault) occurs in the first bit of byte 0 of the first three rows
(left array), the bit faults in each row are aligned with bit 0 of byte 0, bit 0 of byte 6 and
bit 0 of byte 7 of R1 and R2 after rotation (right array). Since basic CPPC with eight
interleaved parity bits per word can correct faults in different bytes of different words, it
can also correct spatial MBEs contained in an 88 square by rotating the words XORed
into R1 and R2. The additional hardware needed beyond basic CPPC is two byte-wise
barrel shifters, one for R1 and one for R2.
2.4.4 Dirty data error recovery
The differences between basic CPPC described in Section 2.2 and the enhanced CPPC
considered in this section are 1) the bits protected by each parity bit are interleaved
while they were contiguous in basic CPPC and 2) the data accumulated in correction
21
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B0 B1 B2 B3 B4 B5 B6 B7
B1 B2 B3 B4 B5 B6 B7 B0
B2 B3 B4 B5 B6 B7 B0 B1
B3 B4 B5 B6 B7 B0 B1 B2
B4 B5 B6 B7 B0 B1 B2 B3
B5 B6 B7 B0 B1 B2 B3 B4
B6 B7 B0 B1 B2 B3 B4 B5
B7 B0 B1 B2 B3 B4 B5 B6
Byte shifting
Bytes of
different rows
mapped to
different parts
of R1 and R2
Figure 2.6: Byte-shifting in an88 array
registers R1 and R2 have been rotated by various amounts, depending on their rotation
class. These two differences must be incorporated in the recovery algorithm.
CPPC can correct detected faults in a single faulty dirty word and also corrects
detected faults in different dirty words provided the faulty bits in different faulty words
are not in the protection domain of the same parity bit. In both cases, CPPC can recover
by correcting one faulty dirty word at a time, as was done in basic CPPC: To correct
each faulty dirty word, the recovery algorithm XORs all other dirty words after rotation
and then XORs the result with the contents of R1 and R2. The result is then rotated in
reverse to obtain the corrected value of each faulty dirty word.
Next, if the same parity bit in different faulty dirty words detects bit faults, the faults
are in the protection domain of the same parity bit. When this situation is detected,
CPPC restricts recovery to spatial multi-bit faults contained in aNN bit square. With
a CPPC capable of correcting spatial faults falling within anNN square a fault pattern
in which the distance between any two faulty rows is more than N is deemed irrecov-
erable. However, if the faulty dirty words are in adjacent N rows, CPPC considers it
a spatial multi-bit fault contained in an NN square and starts recovery. To recover
from these faults, CPPC launches a fault locator algorithm, which finds the location of
faulty bits in each word of theN vertically adjacent faulty rows. Once the locations of
22
faulty bits are found, they are flipped back in each faulty dirty word. The fault locator is
now described.
2.4.5 Fault locator
A parity bit only indicates whether an odd number of bits are faulty in its protection
domain but it does not point to the positions of the faulty bits. The fault locator finds the
locations of bit faults in adjacent rows once faults have been detected. A fault contained
in an 88 square occurs either in the same byte of adjacent rows or across the boundary
of the same pair of horizontally adjacent bytes of adjacent rows. Provided the one faulty
byte or the fault range in adjacent two faulty bytes are located, the error is correctable
because the interleaved parity bits in each faulty word point to the faults. For example,
if the fault locator determines that bytes 0 of several vertically adjacent words contain
the bit faults, the faults are located because interleaved parities of each word show the
exact location of the bit faults in the bytes.
The XOR of each bit of R1 and R2 is equal to the XOR of one bit of every dirty
word currently in cache. These two correction registers are key to finding the location
of faulty bits.
The fault locator first XORs all dirty words in the cache after proper rotation and
then it XORs this result with R1 and R2. This 64-bit result is called R3 in this section,
although R3 is not necessarily a register. In effect, R3 is the XOR of a value with itself.
In the absence of faults, all bits of R3 are zero because the XOR of a value with itself
is equal to zero. However, in the presence of faults, some bits of R3 are set in the bit
locations where faulty bits are XORed. For example, if there is a two-bit vertical fault
in bit 0 of the first two rows of the array in the left side of Figure 2.6, R3 has two 1s, one
in bit 0 and one in bit 56 and all other bits are 0 because bit 0 of the first row and bit 0 of
the second row have been rotated to bit 0 and bit 56 of R1 and R2 as is shown in Figure
23
2.6. Hence, the locations of the faulty bits XORed into R1 and R2 are designated by 1s
in R3.
When the CPPC fault locator is invoked, it knows 1) the parity bits which detected
the faults, 2) the rotation classes of the faulty words, and 3) the locations of the bits in
R3 whose values are 1. These three critical pieces of information are then exploited to
find the location of faulty bits. The fault locator first finds the common location of faults
in different words. Then it uses faulty parities to find the exact location of faults.
The procedure to locate faults is first illustrated in an example. In the process, we
introduce some definitions. The fault locator algorithm then follows.
Example 1. Assume that all parity bits P0-P7 of the first four words belonging to
the first four rotation classes 0-3 in Figure 2.6 have detected faults (they are all set).
Furthermore, bits 0-12 and 45-63 of R3 are set and all other bits of R3 are reset. We call
a byte of R3 with at least one bit set to 1 a faulty byte. Thus, bytes 0, 1, 5, 6 and 7 of
R3 are faulty bytes. The first task is to identify the bytes of words 0-3 that map to each
R3 faulty byte after rotation. The set of bytes of faulty words that map to each R3 faulty
byte is called the faulty set of that byte.
The faulty sets are shown in Figure 2.7. Since the locator only searches for 8 8
spatial faults, the faulty bits of faulty words must be contained in the same byte or in
the same pair of neighboring bytes. If the faulty sets do not contain the same byte or
do not contain one byte of a common pair of neighboring bytes, the fault is deemed
uncorrectable and a DUE (fatal) exception is raised. If all faulty sets contain one single
byte of all faulty words, then the fault must be contained in the same byte of all faulty
words. The parity bits of each faulty word point to the faulty bits in that byte. In the
example of Figure 2.7, this is not the case. However, at least one byte of two adjacent
bytes (bytes 0 and 1) is present in every faulty set. Thus the faulty bits must be contained
24
b0-b7 b8-b15 b40-b47b48-b55 b56-b63
0 1 R3 2 345 67
Word 0 0 1 2 345 67
Word 1 1 2 3 456 70
Word 2 2 3 4 567 01
Word 3 3 4 5 670 12
Faulty
set of
byte 0
Faulty
set of
byte 1
Faulty
set of
byte 6
Faulty
set of
byte 7
Faulty
set of
byte 5
Figure 2.7: Faulty sets in Example 1
in these two bytes of the four faulty words. Figure 2.8 shows the reduced faulty sets after
removing all bytes other than bytes 0 and 1.
Consider first Word 0 in Figure 2.8. It has bit faults in the faulty sets of R3 bytes 0
and 1. Byte 1 of Word 0 is the only byte in the faulty set of R3 byte 1. Thus all bit faults
signaled by the bits set in R3 byte 1 must be the bit faults in Word 0 byte 1. The rest of
the faulty bits in Word 0 must be in byte 0 and they must be bits 5-7 and Word 0 can be
corrected. The source of faulty bits 5-12 in R3 has been identified and therefore these
R3 bits are reset.
The procedure is now repeated. From Figure 2.8 we remove faulty Word 0 and the
faulty set associated with R3 byte 1. The faulty set associated with R3 byte 0 is kept
because R3 bits 0-4 are still set. Consider now Word 1. Its byte 1 is the only one left
contributing to faults in R3 byte 0 in the faulty set of R3 byte 0. Thus the bit faults in
R3 byte 0 must come from byte 1 of Word 1 and are bits 0-4. The other three faulty
bits in Word 1 must be bits 5-7 of byte 0 (pointed to by R3 bits 61-63). We now remove
25
b0-b7 b8-b15 b40-b47b48-b55 b56-b63
0 1 R3 2 345 67
Word 0 0 1
Word 1 1 0
Word 2 01
Word 3 01
Faulty
set of
byte 0
Faulty
set of
byte 1
Faulty
set of
byte 6
Faulty
set of
byte 7
Faulty
set of
byte 5
Figure 2.8: Reduced faulty sets with bytes 0 and 1 in Example 1
Word 1 and the faulty set of byte 0. Word 2 is the only word left contributing to faults in
byte 7 of R3, thus bits 8-12 (first 5 bits of byte 1 of Word 2) are faulty. The other three
bits come from byte 0 of Word 2, bits 5-7. Finally Word 3 is left and its faulty bits are
pointed to by the rest of the faulty bits in R3.
We now specify the fault locator algorithm. It has five steps.
Fault locator algorithm.
Step 1. Identify the faulty bytes of R3. For each R3 faulty byte, identify its faulty
set.
Step 2. Inspect all faulty sets. If one and only one byte is common to all faulty sets
and no neighbor of this byte is in one of the faulty sets, then the faulty bits are contained
within the byte and parity bits in each faulty word point to the bit faults. Go to step 5.
Step 3. If there is one and only one pair of adjacent bytes with at least one byte in
every faulty set then the fault has happened in these two adjacent bytes. If this is not the
case, the fault is not correctable, go to step 5.
26
Remove all bytes from R3 faulty sets that are not one of the two adjacent bytes.
Step 4. Check the following two conditions. If at any time none the following two
conditions is true, the fault is not locatable and the algorithm exits to step 5.
1) If there is an R3 faulty byte whose faulty set includes only one of the two bytes,
the bits set to 1 in R3 point to the exact location of faulty bits in the byte which is the
only member of that faulty set. Thus, the location of bit faults in one byte of one of the
faulty words is found and other bit faults in that faulty word occurred in the other byte.
As a result, bit faults in one word are located. Remove the bytes of that corrected word
from the faulty sets and reset the bits of R3 which are set due to this faulty word.
2) If there is a faulty word which has only one byte left in the faulty sets, faulty bits
in the faulty byte of the word are located by its parities. Remove that byte from the
faulty set and reset the bits of R3 which are set due to this faulty byte.
If all bits of R3 are zero, the fault is located. Otherwise, go back to the beginning of
step 4.
Step 5. If bit faults were located, correct them. Otherwise, halt the program and
raise a machine-check exception (DUE).
Example 1 illustrated case 1 in step 4. Example 2 illustrates case 2 in step 4.
Example 2. Consider the case that faulty sets of bytes 1 and 5 are removed from
Figures 2.7 and 2.8 so that bits 0-7 and 48-63 of R3 are set to 1 and all other bits of
R3 are reset. Furthermore, assume that the following parity bits point to bit faults: bits
P5-P7 of Word 0, bits P0-P7 of Word 1 and Word 2, and bits P0-P4 of Word 3. The fault
locator algorithm takes the following steps.
Step 1. Bytes 0, 6 and 7 of R3 are faulty. The faulty set of byte 0 of R3 includes
bytes 0, 1, 2, and 3. The faulty set of byte 6 of R3 includes bytes 6, 7, 0 and 1. The
faulty set of byte 7 of R3 contains bytes 7, 0, 1, and 2.
Step 2. Skip.
27
Step 3. All faulty sets contain either byte 0 or 1 and there is no other pair of adjacent
bytes with at least one byte in all faulty sets. Thus, the fault straddles the boundary of
bytes 0 and 1.
After removing all bytes except bytes 0 and 1 from all faulty sets, the faulty set of
byte 0 of R3 contains byte 0 of Word 0 and byte 1 of Word 1. The faulty set of byte 6
of R3 contains byte 0 of Word 2 and byte 1 of Word 3. The faulty set of byte 7 of R3
contains byte 0 of Word 1 and byte 1 of Word 2.
Step 4. All faulty sets contain bytes 0 and 1. However, Words 0 and 3 have only one
byte in all faulty sets and their faulty bits can be located by their parity bits. They are in
bits 5 and 6 of Word 0 and in bits 8-12 of Word 1. The corresponding bits are reset in
R3.
Now, the reduced faulty set of byte 0 of R3 contains byte 1 of Word 1. The faulty set
of byte 6 of R3 contains byte 1 of Word 2. The faulty set of byte 7 of R3 contains byte
0 of Word 1 and byte 1 of Word 2.
Bit faults are located in byte 1 of Word 1 and byte 0 of Word 2 as they are in R3
faulty sets with only one member. By continuing this procedure, the faulty bits in byte
7 of R3 are also located.
Step 5. The locator concludes that the spatial multi-bit fault occurred in bits 5 and 6
of Word 0, bits 8-12 of Word 3 and bit 5-12 of Words 1 and 2. Thus, the spatial multi-bit
fault is located and corrected.
An important question is whether the fault locator can find all errors in the cor-
rectable range. More specifically, can several separate multi-bit faults have the same
faulty parities, the same classes of faulty words and the same R3 pattern?
28
2.4.6 Spatial multi-bit error coverage
With correction registers R1 and R2, all spatial multi-bit faults are locatable by the fault
locator, except for some special cases.
For instance, consider the case of an 8 8 fault, where all faulty words contain eight
faulty bits. In this case, all bytes of R3 are faulty and the locator algorithm fails because
it is not possible to find one unique byte with no adjacent byte in all faulty sets (step 2)
or one unique pair of adjacent bytes intersecting with all faulty sets (step 3).
Another difficult situation is when a spatial multi-bit fault only affects rows sepa-
rated by 4 rows from each other. For example, a fault which flips bits in byte 0 of a
class 0 word and byte 0 of a class 4 word and a fault which happens in byte 4 of a class
0 word and byte 4 of a class 4 word cannot be located. This is because the content of
R3 is identical in both of these cases and the fault locator cannot determine whether the
fault has occurred in byte 0 or in byte 4 of the two faulty words. The locator algorithm
fails steps 2 and 3 and these errors remain DUEs.
With one register pair, all 48 faults are locatable. However, the fault locator locates
0.99997 of 5 8 faults as our software simulation shows. If the number of faulty rows
increases, the coverage will decrease.
To increase the fault locator coverage, we can add another pair of correction registers
so that the first four rotation classes (classes 0-3) are protected by one pair and the other
four rotation classes (classes 4-7) are protected by the other pair. By using two register
pairs, an 8 8 error is converted to two separate 4 8 errors protected by different
register pairs and they are correctable.
29
2.4.7 Incorrect correction of temporal multi-bit errors
The byte-shifting scheme of CPPC supposes that if there are several faults in adjacent
rows within the expected correctable range, they are part of a spatial multi-bit fault.
However, it is possible although extremely unlikely to have several temporal single-
bit faults in adjacent rows. If there are several temporal faults in adjacent rows, they
might be incorrectly detected as a spatial multi-bit fault which causes the fault locator
to produce an incorrect output. When the fault locator produces a wrong result, some
correct bits are erroneously flipped. For example, in Figure 2.6, if we have two temporal
bit faults in bit 56 of the class 0 word and in bit 8 of the class 1 word, the fault locator
decides incorrectly that bits 0 of both words are faulty. Instead of a two-bit error, we
end up with a 4-bit error! Furthermore, to make matters worse, a 2-bit DUE is converted
into a 4-bit SDC (Silent Data Corruption). In this case, bits 0 and 56 of the class 0 word
and bits 0 and 8 of the class 1 word are faulty after correction.
For this scenario to happen, the second fault must hit one of seven bits after the first
fault occurs, out of all the bits in the cache, in a very short period of time, i.e., before
the correction of the first fault. For instance, a fault in bit 56 of a class 0 word must be
followed by a second fault in bit 0 of a class 1 word or bit 8 of a class 2 word or bit 16
of a class 3 word or bit 24 of a class 4 word or bit 32 of a class 5 word or bit 40 of a
class 6 word or bit 48 of a class 7 word, in order for the locator to confuse the temporal
MBE for a spatial MBE.
This problem can be mitigated by increasing the number of correction register pairs.
With two pairs of correction registers in which each pair is responsible for four separate
rows, the probability of correcting a temporal multi-bit fault as a spatial multi-bit fault
is decreased by half (if the first fault is in a class 0 word, the second must now be in
classes 1-3). Therefore, after a single-bit fault, the second one must occur in one of
three bits out of all the bits of the cache in a short period of time. With four pairs of
30
registers so that each pair protects two classes of words, this problem is mitigated further
so that after the first fault, the second fault must occur in one specific bit. Finally with
eight pairs of registers, this problem is completely eliminated. In this case, byte-shifting
is also eliminated, and all multi-bit faults can be corrected. This solution is explained
further in Section 2.4.10.
2.4.8 Barrel shifter implementation
Our basic byte-shifting technique uses two barrel shifters to rotate the bytes before
XORing words into correction registers R1 and R2, in order to provide spatial multi-
bit error correction capability. In general, a barrel shifter is made of log
2
n levels (n is
the number of input bits) of multiplexers to rotate data. Fortunately, the barrel shifters
in CPPC rotate by multiples of bytes only. Therefore, the barrel shifters of a CPPC
only need
n
8
log
2
(
n
8
) multiplexers and log
2
(
n
8
) stages, which is significantly less than
n log
2
n and log
2
n in barrel shifters that must rotate by any number of bits.
The time and energy required to rotate a 32-bit word in a barrel shifter was computed
in [10] for a 90nm technology. The delay and energy consumed by the rotation of 32
bits are reported to be less than 0.4ns and about 1.5 pj, respectively. We used CACTI 5.3
to estimate the access latency of an 8KB direct-mapped cache. CACTI estimates 0.78ns
for the cache access time, which is much longer than the delay of the barrel shifter.
Thus, the barrel shifter operation is not on the critical path.
2.4.9 Byte-shifting advantages
The main advantage of the byte-shifting technique is that it is area efficient as spatial
multi-bit error correction is provided by adding two byte-wise barrel shifters which are
made of simple multiplexers.
31
Like in Section 2.2.3, in which we had a trade-off between area and reliability to
cover temporal single-bit faults, here the trade-off is between area and reliability to
cover spatial multi-bit faults. The two correction registers are able to correct all multi-bit
errors except for some special cases and take minimum area. To increase the reliability
against both spatial and temporal faults, more correction register pairs can be added.
2.4.10 Spatial multi-bit error correction with more pairs of correc-
tion registers instead of byte-shifting
Another method to correct spatial multi-bit errors in CPPC is adding more pairs of cor-
rection registers such that the protection of adjacent rows are interleaved among differ-
ent pairs of registers. For example, to correct all 8 8 errors in the cache of Figure 2.6,
eight pairs of correction registers can be used such that rows within distance of eight
are protected by the same correction register pairs. In this case, the dirty words in every
class are protected by a different register pair, and there is no need to rotate bytes so that
barrel shifters are unnecessary.
2.5 Evaluations
In this section, we compare four caches through detailed simulations. In our simulations,
the block size is the same in L1 and L2.
CPPC: As an L1 cache, CPPC has two correction registers, R1 and R2 and eight
interleaved parity bits per word, and it relies on the byte-shifting technique. As an L2
cache, CPPC has eight interleaved parity bits per block and three correction registers of
the size of an L1 block and also relies on the byte-shifting technique.
32
Table 2.1: Evaluation parameters
Parameter value
Functional Units 4 integer ALUs, 1 integer multiplier/divider, 4 FP ALUs, 1 FP
multiplier/divider
LSQ Size / RUU Size 16 Instructions / 64 Instructions
Issue Width 4 instructions / cycle
Frequency 3 GHZ
L1 data cache 32KB, 2-way, 32 byte lines, 2 cycles latency
L2 cache 1MB unified, 4-way, 32 byte lines, 8 cycles latency
L1 instruction cache 16KB, 1-way, 32 byte lines, 1 cycle latency
Feature Size 32nm
One-dimensional parity cache: As an L1 cache, this cache is protected by eight
parity bits per word and does not correct dirty data. As an L2 cache, each block is
protected by eight parity bits.
SECDED cache: As an L1 cache, a word-level SECDED code is combined with
8-way data bit interleaving. As an L2 cache, a SECDED code is attached to a block
instead of a word and bits are interleaved as well.
Two-dimensional parity cache: This cache is protected by 8-way horizontal inter-
leaved parity bits per word and 8-way horizontal interleaved parity bits per block for L1
and L2 caches (respectively), and it has one row of vertical parity bits to correct errors.
These configurations have been chosen in such a way that they have similar area and
spatial multi-bit error correction capabilities, except for the two-dimensional parity pro-
tected cache. In our simulations of the two-dimensional parity cache only one vertical
parity row is implemented for the entire cache so that it has similar hardware require-
ments as the CPPC configuration. With a single vertical parity row, the two-dimensional
parity scheme loses its correction capability against multi-bit errors since it needs eight
vertical parity rows to correct errors contained in an 8 8 square. Given this, for a
fair comparison, we do not compare the reliability of the two approaches. However,
the reliability of both approaches are practically the same because they can have a large
33
protection domain of the same size by configuring the number of correction registers
in CPPC and vertical parity rows in the two-dimensional parity cache. Thus, we only
compare their performance and energy consumption. We compare the four caches under
various criteria using analytical models, SimpleScalar [4] and CACTI [36] simulations.
We execute 100 million instructions Simpoints [27] obtained from the Spec2000 bench-
marks compiled for the Alpha ISA. Table 2.1 lists the parameters of the simulations.
2.5.1 Performance
In our simulations, we choose the same access latency of two cycles for both one-
dimensional parity and SECDED caches. Hence, we assume that the decoding latency of
SECDED is not in the critical path as if data can be read without checking the protection
code.
Figure 2.9 compares the CPI of the processor with L1 CPPC and two-dimensional
parity L1 cache normalized to that of the basic one-dimensional parity L1 cache. Port
contention and access latency of the caches are simulated. As is shown in Figure 2.9, the
performance overhead of L1 CPPC over the basic one-dimensional parity L1 cache is
0.3% on the average and at most 1% across all benchmarks. The performance overhead
of two-dimensional parity is on average 1.7% and is 6.9% in the worst case.
In L2 caches, the performance differences between all schemes are negligible and
are not shown. In L2 CPPC, additional L1 port contentions are caused by the transfer
of read-before-writes from L2 to L1, but the number of these accesses is negligible
compared to the total number of L1 cache accesses. Thus, the performance impact is
negligible for L2 CPPC as well. Figure 2.10 shows the percentage of stores to dirty data
in the L2 CPPC that need an access to the L1 cache. It shows that 52% of stores to data
with dirty copies in the L2 cache need a read-before-write in the L1 cache and the rest
34
AUTHOR ET AL.: TITLE 11
scheme loses its correction capability against multi-bit
errors since it needs eight vertical parity rows to correct
errors contained in an 8*8 square. Given this, for a fair
comparison, we do not compare the reliability of the two
approaches. However, the reliability of both approaches
are practically the same because they can have a large
protection domain of the same size by configuring the
number of correction registers in CPPC and vertical par-
ity rows in the two-dimensional parity cache. Thus, we
only compare their performance and energy consump-
tion. We compare the four caches under various criteria
using analytical models, SimpleScalar [11] and CACTI
[12] simulations. We execute 100 million instructions
Simpoints obtained from the Spec2000 benchmarks [19]
compiled for the Alpha ISA. Table 1 lists the parameters
of the simulations.
6.1 Performance
In our simulations, we choose the same access latency of
two cycles for both one-dimensional parity and SECDED
caches. Hence, we assume that the decoding latency of
SECDED is not in the critical path as if data can be read
without checking the protection code.
Fig. 9 compares the CPI of the processor with L1 CPPC
and two-dimensional parity L1 cache normalized to that
of the basic one-dimensional parity L1 cache. Port conten-
tion and access latency of the caches are simulated. As is
shown in Fig. 9, the performance overhead of L1 CPPC
over the basic one-dimensional parity L1 cache is 0.3% on
the average and at most 1% across all benchmarks. The
performance overhead of two-dimensional parity is on
average 1.7% and is 6.9% in the worst case.
In L2 caches, the performance differences between all
schemes are negligible and are not shown. In L2 CPPC,
additional L1 port contentions are caused by the transfer
of read-before-writes from L2 to L1, but the number of
these accesses is negligible compared to the total number
of L1 cache accesses. Thus the performance impact is neg-
ligible for L2 CPPC as well. Fig. 10 shows the percentage
of stores to dirty data in the L2 CPPC that need an access
to the L1 cache. It shows that 52% of stores to data with
dirty copies in the L2 cache need a read-before-write in
the L1 cache and the rest do not need any extra access. In
addition, as Fig. 11 shows, the number of read-before-
writes in L1 is on average 0.5% of the total number of L1
read hits. We compare the number of read-before-writes
to the number of read hits because they both occupy the
read port. Thus, the contention for the read port of L1 due
to added read-before-writes is not an issue and the per-
formance overhead of L2 CPPC is negligible (on average
it is 0.04% based on our evaluation).
6.2 Energy consumption
To compute the dynamic energy consumption of all cach-
es, we count the number of read hits, write hits, and read-
before-writes in the various caches. The dynamic energy
consumption of each access is estimated by CACTI. The
extra energy consumption of L2 CPPC is the energy con-
sumption of the read-before-writes in the L1 cache.
For interleaved SECDED, we multiply the energy con-
sumption of bit lines by eight and add it to the energy
consumption of the cache as it increases the number of
precharged bit lines eight times [10].
Fig. 9. CPI of CPPC and 2D parity in the L1 cache normalized to 1D parity
Fig. 10. Percentage of write operations in dirty blocks of L2 CPPC that cause read-before-writes
Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Do not cause
read-before-
write
Cause read-
before-w rite
0.96
0.98
1
1.02
1.04
1.06
1.08
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized CPI
CP PC
2D parity
Figure 2.9: CPI of CPPC and 2D parity in the L1 cache normalized to 1D parity
AUTHOR ET AL.: TITLE 11
scheme loses its correction capability against multi-bit
errors since it needs eight vertical parity rows to correct
errors contained in an 8*8 square. Given this, for a fair
comparison, we do not compare the reliability of the two
approaches. However, the reliability of both approaches
are practically the same because they can have a large
protection domain of the same size by configuring the
number of correction registers in CPPC and vertical par-
ity rows in the two-dimensional parity cache. Thus, we
only compare their performance and energy consump-
tion. We compare the four caches under various criteria
using analytical models, SimpleScalar [11] and CACTI
[12] simulations. We execute 100 million instructions
Simpoints obtained from the Spec2000 benchmarks [19]
compiled for the Alpha ISA. Table 1 lists the parameters
of the simulations.
6.1 Performance
In our simulations, we choose the same access latency of
two cycles for both one-dimensional parity and SECDED
caches. Hence, we assume that the decoding latency of
SECDED is not in the critical path as if data can be read
without checking the protection code.
Fig. 9 compares the CPI of the processor with L1 CPPC
and two-dimensional parity L1 cache normalized to that
of the basic one-dimensional parity L1 cache. Port conten-
tion and access latency of the caches are simulated. As is
shown in Fig. 9, the performance overhead of L1 CPPC
over the basic one-dimensional parity L1 cache is 0.3% on
the average and at most 1% across all benchmarks. The
performance overhead of two-dimensional parity is on
average 1.7% and is 6.9% in the worst case.
In L2 caches, the performance differences between all
schemes are negligible and are not shown. In L2 CPPC,
additional L1 port contentions are caused by the transfer
of read-before-writes from L2 to L1, but the number of
these accesses is negligible compared to the total number
of L1 cache accesses. Thus the performance impact is neg-
ligible for L2 CPPC as well. Fig. 10 shows the percentage
of stores to dirty data in the L2 CPPC that need an access
to the L1 cache. It shows that 52% of stores to data with
dirty copies in the L2 cache need a read-before-write in
the L1 cache and the rest do not need any extra access. In
addition, as Fig. 11 shows, the number of read-before-
writes in L1 is on average 0.5% of the total number of L1
read hits. We compare the number of read-before-writes
to the number of read hits because they both occupy the
read port. Thus, the contention for the read port of L1 due
to added read-before-writes is not an issue and the per-
formance overhead of L2 CPPC is negligible (on average
it is 0.04% based on our evaluation).
6.2 Energy consumption
To compute the dynamic energy consumption of all cach-
es, we count the number of read hits, write hits, and read-
before-writes in the various caches. The dynamic energy
consumption of each access is estimated by CACTI. The
extra energy consumption of L2 CPPC is the energy con-
sumption of the read-before-writes in the L1 cache.
For interleaved SECDED, we multiply the energy con-
sumption of bit lines by eight and add it to the energy
consumption of the cache as it increases the number of
precharged bit lines eight times [10].
Fig. 9. CPI of CPPC and 2D parity in the L1 cache normalized to 1D parity
Fig. 10. Percentage of write operations in dirty blocks of L2 CPPC that cause read-before-writes
Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Do not cause
read-before-
write
Cause read-
before-w rite
0.96
0.98
1
1.02
1.04
1.06
1.08
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized CPI
CP PC
2D parity
Figure 2.10: Percentage of write operations in dirty blocks of L2 CPPC that cause read-
before-writes
do not need any extra access. In addition, as Figure 2.11 shows, the number of read-
before-writes in L1 is on average 0.5% of the total number of L1 read hits. We compare
the number of read-before-writes to the number of read hits because they both access the
read port. Thus, the contention for the read port of L1 due to added read-before-writes
is not an issue and the performance overhead of L2 CPPC is negligible (on average it is
0.04% based on our evaluation).
2.5.2 Energy consumption
To compute the dynamic energy consumption of all caches, we count the number of
read hits, write hits, and read-before-writes in the various caches. The dynamic energy
consumption of each access is estimated by CACTI. The extra energy consumption of
L2 CPPC is the energy consumption of the read-before-writes in the L1 cache.
35
12 IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID
Fig. 11. Number of read-before-writes in L1 supporting a L2 CPPC normalized to the total number of L1 read hits
L1 CPPC needs a read-before-write on every store to a
dirty word but the two-dimensional parity cache needs a
a read-before-write on all stores and on all misses filling
clean cache lines. Replaced dirty blocks are read in all
caches (write-back) regardless of the protection scheme.
Thus, write-backs are not counted as read-before-writes
in our evaluations. Fig. 12 shows the dynamic energy
consumed by the various L1 caches normalized to the
one-dimensional parity cache. As is shown in Fig. 12, the
energy consumption of CPPC is on average 14% higher
than the one-dimensional parity cache. However, the en-
ergy consumed in other schemes is much higher. Two-
dimensional parity increases energy consumption by an
average of 70%, and SECDED with interleaved word bits
increases energy consumption by an average of 42%.
We also compare CPPC, SECDED, two-dimensional par-
ity and one-dimensional parity in the context of an L2
cache. Fig. 13 shows the energy consumed by the differ-
ent L2 caches normalized to the energy consumed by the
one-dimensional parity cache. L2 CPPC consumes only
0.6% more energy than the one-dimensional parity cache.
By comparison, the dynamic energy consumptions of
SECDED and two-dimensional parity L2 caches are 68%
and 75% higher than the one-dimensional parity L2 cache.
The energy consumption of the two-dimensional parity
cache is particularly high for mcf because mcf experiences
many cache misses. The L2 miss rate of mcf is about 80%
in our experiments.
L2 CPPC is more energy efficient than L1 CPPC because
L2 CPPC experiences less writes to dirty blocks (around
7% of all accesses vs. 14% in L1 CPPC). Furthermore, L2
CPPC obviates the need for read-before-writes for around
half of writes to dirty data, and read-before-writes are
done in L1 where the energy of each access to the cache is
15% of the energy of each access to the L2 cache.
6.3 Reliability
In this section, we compare the reliability of the various
cache options against temporal multi-bit errors. All the
compared caches except for the one-dimensional parity
cache practically tolerate the same amount of spatial mul-
ti-bit faults, thus we do not consider spatial multi-bit er-
rors in this section. Rather we focus on the impact of vari-
ous protection schemes on temporal MBEs.
One of the properties of CPPC is that it enlarges the pro-
tection domain. A CPPC with eight parity bits and one set
of correction registers in effect has eight protection do-
mains whose size is 1/8th of the entire set of dirty data
because the dirty data protected by each parity bit forms
a protection domain. By contrast, the protection domain
of a SECDED cache is a word or a cache block. Our goal
in this section is to show that, since the SEU rate is ex-
tremely small, increasing the size of the protection do-
main has very little effect on the reliability against tempo-
ral multi-bit errors.
We compute the average percentage of dirty data in L1
and L2 caches by simulating benchmarks. Table 2 shows
the average percentage of dirty data across all 15 bench-
marks used in Section 6. In addition to the percentage of
dirty data, our reliability model needs the average time
between two consecutive accesses to a dirty word in L1 or
to a dirty block in L2. We denote this average time be-
tween two accesses to dirty data by Tavg. Table 2 gives
Tavg for both L1 and L2 caches across the 15 benchmarks.
The MTTF of the one-dimensional parity cache is calcu-
lated as the expected time until a fault in the dirty data of
the cache divided by the Architectural Vulnerability Fac-
tor (AVF). AVF is the probability that a fault will affect
the result of the program. We use the average percentage
of dirty data across all benchmarks (shown in Table 2) to
compute the MTTF. The expected fault rate in dirty data
of the one-dimensional parity cache is computed by mul-
tiplying the FIT rate per bit by the size of dirty data in the
cache.
CPPC’s reliability estimation is very challenging because
the vulnerability of a bit depends not only on the interval
between two accesses to it, but also on accesses to other
dirty words. Hence, it requires a new approach to esti-
mate reliability. Moreover, CPPC has error correction
capability which must be factored in the reliability evalu-
ations. To have a failure in CPPC due to temporal single
bit SEUs, two single-bit faults should occur in a protec-
tion domain (1/8th of dirty data) and the second fault
should occur before the correction of the first fault.
Table 2. Parameters used in reliability evaluations
Cache L1 L2
Percentage of
dirty data
16 35
Average “Tavg”
of all bench-
marks
1828 378997
We use the approximate analytical reliability model in-
troduced in [20] which was shown to be fairly accurate.
Rate
0
0.005
0.01
0.015
0.02
0.025
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Figure 2.11: Number of read-before-writes in L1 supporting a L2 CPPC normalized to the
total number of L1 read hits
For interleaved SECDED, we multiply the energy consumption of bitlines by eight
and add it to the energy consumption of the cache as it increases the number of
precharged bitlines eight times [12].
L1 CPPC needs a read-before-write on every store to a dirty word but the two-
dimensional parity cache needs a read-before-write on all stores and on all misses filling
clean cache lines. Replaced dirty blocks are read in all caches (write-back) regardless
of the protection scheme. Thus, write-backs are not counted as read-before-writes in
our evaluations. Figure 2.12 shows the dynamic energy consumed by the various L1
caches normalized to the one-dimensional parity cache. As is shown in Figure 2.12,
the energy consumption of CPPC is on average 14% higher than the one-dimensional
parity cache. However, the energy consumed in other schemes is much higher. Two-
dimensional parity increases energy consumption by an average of 70%, and SECDED
with interleaved word bits increases energy consumption by an average of 42%.
We also compare CPPC, SECDED, two-dimensional parity and one-dimensional
parity in the context of an L2 cache. Figure 2.13 shows the energy consumed by the
different L2 caches normalized to the energy consumed by the one-dimensional par-
ity cache. L2 CPPC consumes only 0.6% more energy than the one-dimensional par-
ity cache. By comparison, the dynamic energy consumptions of SECDED and two-
dimensional parity L2 caches are 68% and 75% higher than the one-dimensional parity
36
AUTHOR ET AL.: TITLE 13
Fig. 12. Energy consumption of different schemes in the L1 cache normalized to energy of parity
Fig. 13. Energy consumption of different schemes in the L2 cache normalized to energy of parity
Table 3. MTTF of the caches (Temporal MBEs faults)
Cache MTTF of L1 caches MTTF of L2 caches
One-dimensional
parity
4490 years 64years
CPPC 8.02 *10
21
Years 8.07 * 10
15
Years
SECDED
6.2*10
23
Years 1.1 * 10
19
Years
This model calculates the probability (P) of having two
faults during Tavg. The expected number (1/P) of access
intervals before the occurrence of two faults during Tavg,
divided by AVF yields the MTTF. To be consistent, we
use the same model for SECDED. For the SECDED-
protected cache, the MTTF is the time before the occur-
rence of two faults in one dirty word in the L1 cache or in
one dirty block in the L2 cache.
Based on this model, Table 3 shows the MTTF of different
cache options for temporal MBEs. The SEU rate assumed
in Table 3 is 0.001 FIT/bit in which a FIT is equal to the
number of failure per one billion hours (here failure
means bit flip). AVF is computed as in [27].
The reliability of the one-dimensional parity cache drops
dramatically as the size of the cache increases, and thus
parity without correction is not acceptable in large caches.
As an L1 cache, CPPC improves the MTTF very much as
compared to the one-dimensional parity cache and pro-
vides a very high level of resiliency to temporal multi-bit
faults. As Table 3 shows CPPC is highly reliable as an L2
cache as well. Although the reliability of SECDED is bet-
ter than CPPC, the reliability of CPPC is already so high
that it can safely be deployed in critical systems.
This section demonstrates that error correction can be
implemented more efficiently by enlarging the protection
domain (as CPPC does) because reliability against tempo-
ral MBEs does not suffer much.
7. CONCLUSION
CPPC is a new reliable write-back cache which can grad-
ually increase the granularity of error correction from
words or blocks to multiple blocks and up to the entire
cache in order to save vital resources such as area and
energy. In CPPC, the granularity of protection can be
gradually adapted to a target reliability level. Moreover,
CPPC seamlessly decouples the protection of clean data
and dirty data and assigns fewer resources to clean data
in the cache.
To tolerate spatial multi-bit faults CPPC does not require
physical bit interleaving which greatly increases cache
energy consumption. The byte-shifting scheme proposed
in this paper is added to basic CPPC to tolerate spatial
multi-bit errors. Instead of the byte-shifting technique,
CPPCs can also be built with more correction registers to
correct spatial multi-bit errors with negligible overheads.
CPPC is efficient in L1 caches but much more so in lower-
level caches. Based on our simulation results, CPPC adds
only 0.6% energy overhead and practically no other over-
head to an L2 parity-protected cache while it provides
both single-bit and spatial multi-bit error correction capa-
bilities. Thus, with simple parity protection, CPPC can
raise the reliability of caches to a level similar to ECC at
virtually no cost. This is the main contribution and con-
clusion of this paper.
Acknowledgments. This work was supported by NSF
grant NSF-1219186.
8. REFERENCES
[1] http://www.serversrent.com/rent/ibm-pseries-660-server-
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
gcc
mesa
gzip
applu
twolf
apsi
mcf
bz ip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized Energy
CPPC
SECDED
2D parity
0
1
2
3
4
5
6
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized Energy
CPPC
SECD ED
2D parity
Figure 2.12: Energy consumption of different schemes in the L1 cache normalized to
energy of parity
AUTHOR ET AL.: TITLE 13
Fig. 12. Energy consumption of different schemes in the L1 cache normalized to energy of parity
Fig. 13. Energy consumption of different schemes in the L2 cache normalized to energy of parity
Table 3. MTTF of the caches (Temporal MBEs faults)
Cache MTTF of L1 caches MTTF of L2 caches
One-dimensional
parity
4490 years 64years
CPPC 8.02 *10
21
Years 8.07 * 10
15
Years
SECDED
6.2*10
23
Years 1.1 * 10
19
Years
This model calculates the probability (P) of having two
faults during Tavg. The expected number (1/P) of access
intervals before the occurrence of two faults during Tavg,
divided by AVF yields the MTTF. To be consistent, we
use the same model for SECDED. For the SECDED-
protected cache, the MTTF is the time before the occur-
rence of two faults in one dirty word in the L1 cache or in
one dirty block in the L2 cache.
Based on this model, Table 3 shows the MTTF of different
cache options for temporal MBEs. The SEU rate assumed
in Table 3 is 0.001 FIT/bit in which a FIT is equal to the
number of failure per one billion hours (here failure
means bit flip). AVF is computed as in [27].
The reliability of the one-dimensional parity cache drops
dramatically as the size of the cache increases, and thus
parity without correction is not acceptable in large caches.
As an L1 cache, CPPC improves the MTTF very much as
compared to the one-dimensional parity cache and pro-
vides a very high level of resiliency to temporal multi-bit
faults. As Table 3 shows CPPC is highly reliable as an L2
cache as well. Although the reliability of SECDED is bet-
ter than CPPC, the reliability of CPPC is already so high
that it can safely be deployed in critical systems.
This section demonstrates that error correction can be
implemented more efficiently by enlarging the protection
domain (as CPPC does) because reliability against tempo-
ral MBEs does not suffer much.
7. CONCLUSION
CPPC is a new reliable write-back cache which can grad-
ually increase the granularity of error correction from
words or blocks to multiple blocks and up to the entire
cache in order to save vital resources such as area and
energy. In CPPC, the granularity of protection can be
gradually adapted to a target reliability level. Moreover,
CPPC seamlessly decouples the protection of clean data
and dirty data and assigns fewer resources to clean data
in the cache.
To tolerate spatial multi-bit faults CPPC does not require
physical bit interleaving which greatly increases cache
energy consumption. The byte-shifting scheme proposed
in this paper is added to basic CPPC to tolerate spatial
multi-bit errors. Instead of the byte-shifting technique,
CPPCs can also be built with more correction registers to
correct spatial multi-bit errors with negligible overheads.
CPPC is efficient in L1 caches but much more so in lower-
level caches. Based on our simulation results, CPPC adds
only 0.6% energy overhead and practically no other over-
head to an L2 parity-protected cache while it provides
both single-bit and spatial multi-bit error correction capa-
bilities. Thus, with simple parity protection, CPPC can
raise the reliability of caches to a level similar to ECC at
virtually no cost. This is the main contribution and con-
clusion of this paper.
Acknowledgments. This work was supported by NSF
grant NSF-1219186.
8. REFERENCES
[1] http://www.serversrent.com/rent/ibm-pseries-660-server-
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
gcc
mesa
gzip
applu
twolf
apsi
mcf
bz ip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized Energy
CPPC
SECDED
2D parity
0
1
2
3
4
5
6
gcc
mesa
gzip
applu
twolf
apsi
mcf
bzip2
crafty
equake
lucas
galgel
ammp
mgrid
swim
Average
Normalized Energy
CPPC
SECD ED
2D parity
Figure 2.13: Energy consumption of different schemes in the L2 cache normalized to
energy of parity
L2 cache. The energy consumption of the two-dimensional parity cache is particularly
high for mcf because mcf experiences many cache misses. The L2 miss rate of mcf is
about 80% in our experiments.
L2 CPPC is more energy efficient than L1 CPPC because L2 CPPC experiences less
writes to dirty blocks (around 7% of all accesses vs. 14% in L1 CPPC). Furthermore,
L2 CPPC obviates the need for read-before-writes for around half of writes to dirty data,
and read-before-writes are done in L1 where the energy of each access to the cache is
15% of the energy of each access to the L2 cache.
2.5.3 Reliability
In this section, we compare the reliability of the various cache options against temporal
multi-bit errors. All the compared caches except for the one-dimensional parity cache
37
Table 2.2: Parameters used in reliability evaluations
Cache L1 L2
Percentage of dirty data 16 35
Average Tavg of all benchmarks 1828 378997
Table 2.3: MTTF of the caches (Temporal MBEs faults)
Cache MTTF of L1 caches MTTF of L2 caches
One-dimensional parity 4490 years 64 years
CPPC 8:0210
21
years 8:0710
15
years
SECDED 6:210
23
years 1:110
19
years
practically tolerate the same amount of spatial multi-bit faults, thus we do not consider
spatial multi-bit errors in this section. Rather we focus on the impact of various protec-
tion schemes on temporal MBEs.
One of the properties of CPPC is that it enlarges the protection domain. A CPPC
with eight parity bits and one set of correction registers in effect has eight protection
domains whose size is 1/8th of the entire set of dirty data because the dirty data protected
by each parity bit forms a protection domain. By contrast, the protection domain of a
SECDED-protected cache is a word or a cache block. Our goal in this section is to show
that, since the SEU rate is extremely small, increasing the size of the protection domain
has very little effect on the reliability against temporal multi-bit errors if the scheme has
correction capability like CPPC.
We compute the average percentage of dirty data in L1 and L2 caches by simulating
benchmarks. Table 2.2 shows the average percentage of dirty data across all 15 bench-
marks used in 2.5. In addition to the percentage of dirty data, our reliability model needs
the average time between two consecutive accesses to a dirty word in L1 or to a dirty
block in L2. We denote this average time between two accesses to dirty data by Tavg.
Table 2.2 gives Tavg for both L1 and L2 caches across the 15 benchmarks.
38
The MTTF of the one-dimensional parity cache is calculated as the expected time
until a fault in the dirty data of the cache occurs, divided by the Architectural Vul-
nerability Factor (A VF). A VF is the probability that a fault will affect the result of the
program and it is computed as explained in [5]. We use the average percentage of dirty
data across all benchmarks (shown in Table 2.2) to compute the MTTF. The fault rate in
dirty data of the one-dimensional parity cache is computed by scaling the ITRS FIT (it
is given as FIT/Mbit) rate to the size of dirty data in the cache and the expected time to
have the first fault in dirty data is simply computed from this fault rate.
Reliability estimation of CPPC is very challenging because the vulnerability of a bit
depends not only on the interval between two accesses to it, but also on accesses to other
dirty words. Hence, it requires a new approach to estimate reliability. Moreover, CPPC
has error correction capability which must be factored in the reliability evaluations.
To measure the reliability of CPPC, we have developed a new model as follows. In
CPPC, a protection domain can fail due to either two faults in the eight bits protected
by the same parity or two faults in two bits of different dirty words, which are in the
same protection domain (they are protected by similar parities in different words). The
second case has a much higher probability as it means that two faults should happen in
1/8th of the entire dirty data while the first case refers to the situation in which two faults
should happen in two bits out of the 8 bits of the cache. To compute MTTF of CPPC, we
measure the expected time that two faults happen in 1/8th of the entire dirty data. The
two faults should happen in a short interval, i.e. the second fault should occur before the
first fault is recovered. Thus, they should occur in the interval between two accesses to
a word, before accessing the domain which is hit by the first fault. We use Tavg as the
length of the interval between two accesses to a word. Thus, we need to calculate the
probability of having two faults during Tavg in 1/8th of dirty data of the cache which we
call asP
domain
. The probability of the cache failure during Tavg is denotedP
cache
and is
39
equal to 1 (1P
domain
)
8
which refers to the failure in at least one of the 8 domains.
MTTF is equal to the expected number (1/P
cache
) of Tavg which should be repeated to
have two consecutive faults.
P
domain
is computed by a Binomial equation as done in [31]. The Binomial equation
computes the probability that in two cycles out of Tavg, we have faults. In this Binomial
equation, the number of experiments is Tavg while the number of successes is two. The
probability of having a fault in one cycle in the large protection domain (probability of
success in the Binomial equation) can simply be computed by downscaling the ITRS
FIT rate from failures in a 1 Mbit array in 1 billion hours to the domain size in one
cycle.
To be consistent, we use the same model for SECDED. For the SECDED-protected
cache, the MTTF is the time before the occurrence of two faults in one dirty word in the
L1 cache or in one dirty block in the L2 cache.
Based on this model, Table 2.3 shows the MTTF of different cache options for tem-
poral MBEs. The SEU rate assumed in Table 2.3 is 0.001 FIT/bit in which a FIT is equal
to the number of failure per one billion hours (here failure means bit flip).
The reliability of the one-dimensional parity cache drops dramatically as the size
of the cache increases, and thus parity without correction is not acceptable in large
caches. As an L1 cache, CPPC improves the MTTF very much as compared to the
one-dimensional parity cache and provides a very high level of resiliency to temporal
multi-bit faults. As Table 2.3 shows CPPC is highly reliable as an L2 cache as well.
Although the reliability of SECDED is better than CPPC, the reliability of CPPC is
already so high that it can safely be deployed in critical systems.
This section demonstrates that error correction can be implemented more efficiently
by enlarging the protection domain (as CPPC does) because reliability against temporal
MBEs does not suffer much.
40
2.6 Conclusions
CPPC is a new reliable write-back cache which can gradually increase the granularity
of error correction from words or blocks to multiple blocks and up to the entire cache
in order to save vital resources such as area and energy. In CPPC, the granularity of
protection can be gradually adapted to a target reliability level. Moreover, CPPC seam-
lessly decouples the protection of clean data and dirty data and assigns fewer resources
to clean data in the cache.
To tolerate spatial multi-bit faults, CPPC does not require physical bit interleaving
which greatly increases cache energy consumption. The byte-shifting scheme proposed
in this thesis is added to basic CPPC to tolerate spatial multi-bit errors. Instead of
the byte-shifting technique, CPPCs can also be built with more correction registers to
correct spatial multi-bit errors with negligible overheads.
CPPC is efficient in L1 caches but much more so in lower-level caches. Based on
our simulation results, CPPC adds only 0.6% energy overhead and practically no other
overhead to an L2 parity-protected cache while it provides both single-bit and spatial
multi-bit error correction capabilities. Thus, with simple parity protection, CPPC can
raise the reliability of caches to a level similar to ECC at virtually no cost. This is the
main contribution and conclusion of this chapter.
41
Chapter 3
Chip Independent Error Correction
(CIEC) in Main Memory
This section presents some ideas to provide low cost error protection in main memory.
This low cost approach is called CIEC and it is the combination of CPPC and another
scheme called Virtualized ECC [38]. CIEC overheads are independent from the width
of chips on a DIMM and it can be very beneficial in DIMMs with wide chips.
At the beginning of this chapter, we will explain related works and then we will
describe how CIEC deals with soft errors.
3.1 Background and related works
In this section, we first review some basic concepts of main memories, starting with
relevant DIMM architecture concepts. In a DIMM, a set of chips accessed together is
called a rank as illustrated in Figure 3.1. The number of chips in a rank has a large
impact on the energy consumption of a DIMM. With a small number of chips, energy
consumption is lower as fewer chips are activated in each memory access. The number
of chips in a rank is equal to the bus width divided by chip width. For example, if the
bus width is 64 bits and the chip width is 8, there should be 8 chips per rank.
Because the Last Level Cache (LLC) block size is more than the bus width, memo-
ries access data in bursts. Different types of DIMMs have their own burst length: eight
for DDR3 and four for DDR2. In this chapter, we refer to the amount of data accessed
42
However, in [17], a read-before-write operation is needed
for every write to dirty data in the cache. Whereas this is
acceptable in a cache, it would not work in DRAM
memories because performance and energy costs would be
too high. In CIEC, read-before-write operations are
executed on the processor chip, and are transferred from the
main memory to the cache hierarchy, which saves a
significant amount of energy.
4- Optimization of the entire scheme by new techniques.
• The size of parities in main memory is determined
based on the memory burst length and bus width in
order to improve energy efficiency.
• Techniques to maintain high server availability in the
face of long error recovery times are proposed.
• The memory controller operations are modified to help
provide error correction capability on the processor
chip.
With these features, CIEC provides high reliability at low
costs. We evaluate CIEC with DRAMSim2 [7] integrated
with MARSS [8], and CACTI [9]. CIEC increases the main
memory reliability significantly against both soft and hard
errors. However, as compared to non-ECC DIMMs,
overheads are small: the area overhead is 3.3% in main
memory and 0.7% on the processor chip, the performance
overhead is 0.5% and the energy overhead is 4.5%. As
compared to regular ECC DIMMs with x8 devices, CIEC
with x32 devices saves about 75% energy. This 75% is not
the maximum energy savings achievable by CIEC. In our
simulations, we assumed that both x32 and x8 memories are
built with the same technology. However, low-power
memories such as WIDE I/O consuming much less power
can also take advantage of CIEC.
The rest of the paper is organized as follows. Section 2
provides background and describes the related works. In
Sections 3 and 4, CIEC’s soft error detection and error
correction schemes are exposed. CIEC’s resilience features
to detect and correct hard errors are described in Section 5.
Section 6 discusses the impact of the proposed scheme on
different metrics. Section 7 evaluates CIEC and Section 8
concludes the paper.
2. Background and related works
In this section, we first review some basic concepts of main
memories, starting with relevant DIMM architecture
concepts. In a DIMM, a set of chips accessed together is
called a rank as illustrated in Figure 1. The number of chips
in a rank has a large impact on the energy consumption of a
DIMM. With a lower number of chips, energy consumption
is lower as fewer chips are activated in each memory access
[2]. The number of chips in a rank is equal to the bus width
divided by chip width. For example, if the bus width is 64
bits and the chip width is 8, there should be 8 chips per
rank.
Because the Last Level Cache (LLC) block size is more
than the bus width, memories access data in bursts.
Different types of DIMMs have their own burst length:
eight for DDR3 and four for DDR2. In this paper, we refer
to the amount of data accessed per read or write to memory
(in bytes) as the Minimum Access Length (MAL). MAL is
equal to the bus width multiplied by the burst length of a
DIMM. For example, in DDR3 with a 64-bit bus, MAL is
64 bytes.
Figure 1. Structure of a memory rank
The rest of this section focuses on schemes which have been
proposed to improve main memory reliability.
Single Error Correction Double Error Detection (SECDED)
code is frequently used in DIMMs [1] and greatly improves
reliability. However it has high area and energy overheads.
Parity [10] has also been used in DIMMs; parity detects
errors but cannot correct them. Hence, the reliability of
parity is much less than the reliability of ECC.
Highly reliable systems [11] adopt chip-kill error correction.
Chip-kill DIMMs can tolerate the failure of an entire chip
but their energy consumption is high because chip-kill
imposes strict constraints on the DIMM architecture [4, 14].
Commercial chip-kill memories are mostly based on x4
devices [12, 13], which increases energy consumption
significantly because of the large number of chips.
LOT-ECC [14] provides chip-kill capability with x8 chips
and thus saves energy as compared to commercial x4 chip-
kill memories. LOT-ECC is a three-tiered protection
scheme. The first tier detects faults using parity bits. The
next two tiers correct faults and provide chip-kill reliability.
The reliability of LOT-ECC is high but its area overhead is
more than 26% as compared to a non-ECC DIMM.
Furthermore, on every write, LOT-ECC may need to update
the error correction code, thus causing an extra access to
memory and increasing energy consumption.
Virtualized ECC [4] is a two-tiered scheme that decouples
error detection and correction. The error detection code is
stored in separate chips and is accessed in parallel with data.
Error correction codes are stored in the memory data space
and used only when a fault is detected. Virtualized ECC
provides chip-kill reliability with x8 memory chips.
However, its area overhead is around 19%. In the same
paper [4], the authors also propose to add error correction to
non-ECC DIMMs. In this scheme, ECCs are stored in the
memory data space (there is no separate detection code) and
are cached in the LLC. This approach has several
drawbacks. First, since the size of ECCs is large, they
Chip 0 Chip1 Chip 2 Chip 3 Chip 4 Chip 5 Chip 6 Chip 7
DIMM
Memory controller
Bus
Rank
.… Chip N
Figure 3.1: Structure of a memory rank
per read or write to memory (in bytes) as the Minimum Access Length (MAL). MAL is
equal to the bus width multiplied by the burst length of a DIMM. For example, in DDR3
with a 64-bit bus, MAL is 64 bytes.
The rest of this subsection focuses on schemes which have been proposed to improve
main memory reliability. Single Error Correction Double Error Detection (SECDED)
code is frequently used in DIMMs [33, 38] and greatly improves reliability. However, it
has high area and energy overheads. Parity has also been used in DIMMs; parity detects
errors but cannot correct them. Hence, the reliability of parity is much less than the
reliability of ECC.
Highly reliable systems adopt chip-kill error correction. Chip-kill DIMMs can tol-
erate the failure of an entire chip but their energy consumption is high because chip-kill
imposes strict constraints on the DIMM architecture [33, 38]. Commercial chip-kill
memories are mostly based on x4 or x8 devices and not wider devices. Thus, their
energy consumption is high as compared to DIMMs with wide devices.
LOT-ECC [33] provides chip-kill capability with x8 chips and thus saves energy as
compared to commercial x4 chip-kill memories. LOT-ECC is a three-tiered protection
scheme. The first tier detects faults using parity bits. The next two tiers correct faults
and provide chip-kill reliability.
43
Virtualized ECC [38] is a two-tiered scheme that decouples error detection and cor-
rection. The error detection code is stored in separate chips and is accessed in parallel
with data. Error correction codes are stored in the memory data space and used only
when a fault is detected. Virtualized ECC provides chip-kill reliability with x8 memory
chips. In the same paper [38], the authors also propose to add error correction to non-
ECC DIMMs. In this scheme, ECCs are stored in the memory data space (there is no
separate detection code) and are cached in the LLC.
To reduce the energy consumption of chip-kill memories, [34] proposes a structure
similar to RAID-5 (Redundant Array of Inexpensive Disks) in which error detection
codes are stored in the same chip as data while error correction codes are kept in a
separate chip. This scheme is only applicable to special memory structures called Single
Subarray Access (SSA) also proposed in [34].
In order to provide a low-cost error protection scheme in main memory, we apply
CPPC to main memory and we call that as Chip Independent Error Correction (CIEC)
because its costs in independent from the chip width of the DIMM. Since the overheads
of a parity DIMM is high (usually 1 parity chip for 8 data chips), we store parities in the
memory data space as is done in [38]. In this way, CIEC can be applied to any kind of
DIMM. The rest of this section describes CIEC.
3.2 CIEC Soft error detection
The first line of defense against soft errors in CIEC is detection. The second line of
defense is error correction, which will be covered in Section 3.3.
44
3.2.1 Assignment of parity bits
CIEC detects faults by using parity bits. CIEC can be applied to non-ECC DIMMs and
does not require ECC DIMMs. In CIEC, parity bits are stored in the DIMM data space,
in page frames reserved for this purpose. The Memory Management Unit (MMU) is
aware of these reserved page frames and does not allocate data to them.
Since parity bits are not kept in separate chips (non-ECC DIMMs), they may not
be accessible in parallel with data. Hence, it is beneficial in terms of performance and
energy efficiency to limit the number of accesses to parity bits in main memory. One
way to achieve this goal is to cache parity bits on the processor chip.
To be effective, parity caching must have two characteristics. First, the parities of a
large number of pages must be cached. To reach this goal, the number of parity bits per
page should be small. Second, the parity cache should hold the parity bits associated
with the most likely data accessed next. Since the locality of accesses to one page is
high, CIEC brings the parity bits of an entire page to the parity cache when the page
is first accessed (a kind of parity prefetching). In order to read all parity bits of one
page with no more than one extra access to memory, we propose to allocate a number
of parity bits per page equal to the MAL of the DIMM. For example, if the bus width is
64 bits and the burst length is 8, MAL is 64 bytes and the amount of parity bits per page
is 64 bytes. If for any reason the size of parities of a page has to be bigger than MAL,
only a portion of parities is read and cached with one access (parities of the rest of the
page are not prefetched). In this chapter, we assume that the size of parities of a page is
equal to MAL.
Based on the size of a page and its parities, the parity bits are associated with mem-
ory data. For instance, if the page size is 4KB (6464 bytes) and MAL is 64 bytes, each
page has 64 bytes of parity and one parity bit is assigned to 64-bit data. The location of
45
compete with regular data in the LLC. Although ECCs are
cached in the LLC, the performance penalty can be
significant. For example, the performance overhead to
provide double chip-kill reliability is on average 9% and up
to 25% in x16 DIMMs [4]. Second, an extra virtual address
translation is needed for each access to the ECC. Third, on
every write, the ECC must be updated, which may require
accesses to main memory, and impacts performance and
energy consumption.
To reduce the energy consumption of chip-kill memories,
[16] proposes a structure similar to RAID-5 (Redundant
Array of Inexpensive Disks) in which error detection codes
are stored in the same chip as data while error correction
codes are kept in a separate chip. The area overhead of this
scheme is around 25% and the number of writes to memory
is doubled, which increases energy consumption.
Furthermore, it is only applicable to special memory
structures called Single Subarray Access (SSA) also
proposed in [16].
When the devices in a DIMM are wider than 8, the above
techniques either cannot be used [14] or cause significant
overheads [4, 16].
3. Soft error detection
The first line of defense against soft errors in CIEC is
detection. The second line of defense is error correction,
which will be covered in Section 4.
3.1 Assignment of parity bits
CIEC detects faults by using parity bits. CIEC can be
applied to non-ECC DIMMs and does not require ECC
DIMMs. In CIEC, parity bits are stored in the DIMM data
space, in page frames reserved for this purpose. The
Memory Management Unit (MMU) is aware of these
reserved page frames and does not allocate data to them.
Since parity bits are not kept in separate chips (non-ECC
DIMMs), they may not be accessible in parallel with data.
Hence, it is beneficial in terms of performance and energy
efficiency to limit the number of accesses to parity bits in
main memory. One way to achieve this goal is to cache
parity bits on the processor chip.
To be effective, parity caching must have two
characteristics. First, the parities of a large number of pages
must be cached. To reach this goal, the number of parity bits
per page should be small. Second, the parity cache should
hold the parity bits associated with the most likely data
accessed next. Since the locality of accesses to one page is
high, CIEC brings the parity bits of an entire page to the
parity cache when the page is first accessed (a kind of parity
prefetching). In order to read all parity bits of one page with
no more than one extra access to memory, we propose to
allocate a number of parity bits per page equal to the MAL
of the DIMM. For example, if the bus width is 64 bits and
the burst length is 8, MAL is 64 bytes and the amount of
parity bits per page is 64 bytes. This novel idea reduces
overheads significantly. If for any reason the size of parities
of a page has to be bigger than MAL, only a portion of
parities is read and cached with one access (parities of the
rest of the page are not prefetched). In this paper, we assume
that the size of parities of a page is equal to MAL.
Based on the size of a page and its parities, the parity bits
are associated with memory data. For instance, if the page
size is 4KB (64 * 64 bytes) and MAL is 64 bytes, each page
has 64 bytes of parity and one parity bit is assigned to 64-bit
data. The location of the parities of each page in the DIMM
is known by the memory controller and parities of 64
contiguous pages are stored in one parity page frame.
The parity cache may have several hundred entries and each
entry of the parity cache keeps the parity bits of an entire
page. The fields in each entry of the parity cache are the
valid and dirty bits, the Physical Page Frame Number
(PPFN) and the parity bits of the PPFN, as illustrated in
Figure 2.
Figure 2. Fields of each parity cache entry
3.2 Accesses to parity bits
On a page fault, a new page is transferred from disk to main
memory, and its parities are computed and are written both
in main memory and in the parity cache.
When there is a store to main memory, parity bits are
written only to the parity cache and the dirty bit of the parity
cache entry is set. The dirty entries of the parity cache will
be written back to memory when they are replaced.
For memory loads, the memory controller checks the parity
cache in parallel with main memory or even in advance, as
soon as the load enters the memory controller transaction
queue. Since the delay of an access to the parity cache is
much shorter than to main memory, a hit in the parity cache
is determined early. If the parity cache access hits, the
parities are read from the cache and there is no need for any
extra access to main memory. Otherwise, the memory
controller reads the parity bits of the page from main
memory and the parity bits are installed in the parity cache.
Figure 3. Location of parities in a DIMM with two ranks
In order to access the parities of a page in parallel with data
(when parities are not in the parity cache), it is best if the
parities are kept in another memory channel, which can be
accessed in parallel with data. If there is only one memory
channel, the parities of each page can be located in a
different rank than the page, in order to access parities and
data simultaneously. For example, if there are two ranks, the
Rank 0 Rank 1
Parities of rank 1 data
Rank 0 data
Parities of rank 0 data
Rank 1 data Chip
Valid bit Dirty bit PPFN Parity bits
DIMM
Figure 3.2: Fields of each parity cache entry
the parities of each page in the DIMM is known by the memory controller and parities
of 64 contiguous pages are stored in one parity page frame.
The parity cache may have several hundred entries and each entry of the parity cache
keeps the parity bits of an entire page. The fields in each entry of the parity cache are
the valid and dirty bits, the Physical Page Frame Number (PPFN) and the parity bits of
the PPFN, as illustrated in Figure 3.2.
3.2.2 Accesses to parity bits
On a page fault, a new page is transferred from disk to main memory, and its parities are
computed and are written both in main memory and in the parity cache.
When there is a write into main memory, parity bits are written only to the parity
cache and the dirty bit of the parity cache entry is set. The dirty entries of the parity
cache will be written back to memory when they are replaced.
For memory loads, the memory controller checks the parity cache in parallel with
main memory or even in advance, as soon as the load enters the memory controller
transaction queue. Since the delay of an access to the parity cache is much shorter than
to main memory, a hit in the parity cache is determined early. If the parity cache access
hits, the parities are read from the cache and there is no need for any extra access to
main memory. Otherwise, the memory controller reads the parity bits of the page from
main memory and the parity bits are installed in the parity cache.
In order to access the parities of a page in parallel with data (when parities are not
in the parity cache), it is best if the parities are kept in another memory channel, which
can be accessed in parallel with data. If there is only one memory channel, the parities
46
compete with regular data in the LLC. Although ECCs are
cached in the LLC, the performance penalty can be
significant. For example, the performance overhead to
provide double chip-kill reliability is on average 9% and up
to 25% in x16 DIMMs [4]. Second, an extra virtual address
translation is needed for each access to the ECC. Third, on
every write, the ECC must be updated, which may require
accesses to main memory, and impacts performance and
energy consumption.
To reduce the energy consumption of chip-kill memories,
[16] proposes a structure similar to RAID-5 (Redundant
Array of Inexpensive Disks) in which error detection codes
are stored in the same chip as data while error correction
codes are kept in a separate chip. The area overhead of this
scheme is around 25% and the number of writes to memory
is doubled, which increases energy consumption.
Furthermore, it is only applicable to special memory
structures called Single Subarray Access (SSA) also
proposed in [16].
When the devices in a DIMM are wider than 8, the above
techniques either cannot be used [14] or cause significant
overheads [4, 16].
3. Soft error detection
The first line of defense against soft errors in CIEC is
detection. The second line of defense is error correction,
which will be covered in Section 4.
3.1 Assignment of parity bits
CIEC detects faults by using parity bits. CIEC can be
applied to non-ECC DIMMs and does not require ECC
DIMMs. In CIEC, parity bits are stored in the DIMM data
space, in page frames reserved for this purpose. The
Memory Management Unit (MMU) is aware of these
reserved page frames and does not allocate data to them.
Since parity bits are not kept in separate chips (non-ECC
DIMMs), they may not be accessible in parallel with data.
Hence, it is beneficial in terms of performance and energy
efficiency to limit the number of accesses to parity bits in
main memory. One way to achieve this goal is to cache
parity bits on the processor chip.
To be effective, parity caching must have two
characteristics. First, the parities of a large number of pages
must be cached. To reach this goal, the number of parity bits
per page should be small. Second, the parity cache should
hold the parity bits associated with the most likely data
accessed next. Since the locality of accesses to one page is
high, CIEC brings the parity bits of an entire page to the
parity cache when the page is first accessed (a kind of parity
prefetching). In order to read all parity bits of one page with
no more than one extra access to memory, we propose to
allocate a number of parity bits per page equal to the MAL
of the DIMM. For example, if the bus width is 64 bits and
the burst length is 8, MAL is 64 bytes and the amount of
parity bits per page is 64 bytes. This novel idea reduces
overheads significantly. If for any reason the size of parities
of a page has to be bigger than MAL, only a portion of
parities is read and cached with one access (parities of the
rest of the page are not prefetched). In this paper, we assume
that the size of parities of a page is equal to MAL.
Based on the size of a page and its parities, the parity bits
are associated with memory data. For instance, if the page
size is 4KB (64 * 64 bytes) and MAL is 64 bytes, each page
has 64 bytes of parity and one parity bit is assigned to 64-bit
data. The location of the parities of each page in the DIMM
is known by the memory controller and parities of 64
contiguous pages are stored in one parity page frame.
The parity cache may have several hundred entries and each
entry of the parity cache keeps the parity bits of an entire
page. The fields in each entry of the parity cache are the
valid and dirty bits, the Physical Page Frame Number
(PPFN) and the parity bits of the PPFN, as illustrated in
Figure 2.
Figure 2. Fields of each parity cache entry
3.2 Accesses to parity bits
On a page fault, a new page is transferred from disk to main
memory, and its parities are computed and are written both
in main memory and in the parity cache.
When there is a store to main memory, parity bits are
written only to the parity cache and the dirty bit of the parity
cache entry is set. The dirty entries of the parity cache will
be written back to memory when they are replaced.
For memory loads, the memory controller checks the parity
cache in parallel with main memory or even in advance, as
soon as the load enters the memory controller transaction
queue. Since the delay of an access to the parity cache is
much shorter than to main memory, a hit in the parity cache
is determined early. If the parity cache access hits, the
parities are read from the cache and there is no need for any
extra access to main memory. Otherwise, the memory
controller reads the parity bits of the page from main
memory and the parity bits are installed in the parity cache.
Figure 3. Location of parities in a DIMM with two ranks
In order to access the parities of a page in parallel with data
(when parities are not in the parity cache), it is best if the
parities are kept in another memory channel, which can be
accessed in parallel with data. If there is only one memory
channel, the parities of each page can be located in a
different rank than the page, in order to access parities and
data simultaneously. For example, if there are two ranks, the
Rank 0 Rank 1
Parities of rank 1 data
Rank 0 data
Parities of rank 0 data
Rank 1 data Chip
Valid bit Dirty bit PPFN Parity bits
DIMM
Figure 3.3: Location of parities in a DIMM with two ranks
of each page can be located in a different rank than the page, in order to access parities
and data simultaneously. For example, if there are two ranks, the parities of data in each
rank are stored in the other rank. Figure 3.3 shows how parities are stored in this case.
If there is only one rank and several banks, the parities of each bank can be stored in
another bank to increase the parallelism between accesses to data and its parities.
3.2.3 Multi-bit faults coverage
Multi-bit faults can be either spatial or temporal. Temporal multi-bit faults are rare [11]
because the probability is high that the first fault is recovered before the occurrence of
the second one.
Spatial multi-bit faults may happen in main memories and can be detected by pari-
ties. When 64-bit data comes on the bus, its parity is checked. Since each pin of a chip
is usually connected to a separate memory array [34], a spatial multi-bit fault occurring
in one memory array flips only one bit of the 64-bit data. Hence, the fault is detectable.
47
3.3 Soft error correction
In this thesis, a chunk of memory data of the size of an LLC block in a page is called
a sub-page and it is assumed that there is a dirty bit per sub-page in the page table in
addition to the dirty bit of the overall page.
If a fault is detected in a clean page, the correct page is re-loaded from disk. In order
to recover from errors in dirty sub-pages, CIEC keeps the XOR of all dirty sub-pages
in one or more correction registers. Correction registers are located close to the LLC
on the processor chip. The operations of correction registers in CIEC is the same as
correction registers in L2 CPPC.
The two flowcharts in Figure 3.4 summarize the techniques to avoid read-before-
write accesses to DRAM in CIEC like what was done in L2 CPPC. When an LLC store
miss, hits in a dirty sub-page, the fetched sub-page is XORed into the correction register
in parallel with the transfer of data to the cache hierarchy. Therefore, the dirty sub-
page that will be overwritten later is XORed with the correction register in advance and
without any extra access to main memory (Figure 3.4(a)). Whenever a write hits on a
clean block in the LLC whose memory copy is dirty, the LLC block is read and XORed
to the correction register before executing the write operation (Figure 3.4(b)).
To implement these two techniques efficiently, the dirty bits of sub-pages are kept in
a cache called the dirty-bit cache on the processor chip. The dirty-bit cache keeps the
dirty bits of sub-pages of pages whose translation is present in the TLB. For example,
if the TLB has 512 entries and each physical page has 64 sub-pages, the dirty-bit cache
has 512 entries of size 64 bits. Whenever a TLB entry is replaced or its dirty bit is
written-back to the page table, its counterpart in the dirty-bit cache is also replaced or
is written-back to the page table. CIEC keeps the dirty bits of sub-pages in a separate
cache not in the TLB because they are not accessed on every memory access like TLB
48
entries are. For example, when a load hits in the cache hierarchy, the dirty-bit cache is
not accessed. By having a separate dirty-bit cache, TLB access latency is not affected.
In order to know whether the memory copy of an LLC block is dirty, the LLC line
maintains both the dirty bit of its own copy and the dirty bit of the memory copy. On a
read miss in the LLC, the dirty bit of the sub-page is copied from the dirty-bit cache to
the LLC tag. Because the access is an LLC read miss (not a TLB miss), the translation
is already in the TLB and the dirty bits are in the dirty-bit cache. On a store miss in the
LLC, the dirty bit copy of the sub-page in LLC is simply reset.
To enhance energy efficiency further, read-before-write operations can be executed
in caches at a higher level than the LLC. The extra reads can even be executed in the L1
cache. To do this, the sub-pages must have the size of an L1 line and a dirty bit must
be kept for each chunk of data of the size of an L1 cache line in the page table and in
the dirty-bit cache. In addition, the correction register must have the same size as an L1
cache line. The dirty bit of a memory sub-page must also be transferred to the L1 cache
on an LLC read miss found in main memory. All other cache levels must also have a
dirty bit for every chunk of data of the size of an L1 line in cases where the L1 line size
is different from the line size in other caches. Implementation of read-before-writes in
different levels of cache hierarchy is explained in more details in Section 3.6.
3.4 CIEC error recovery
If a clean sub-page is faulty, the fault is recovered by reading the page from disk. To
recover from faults in a dirty sub-page, all dirty sub-pages must be accessed when only
one correction register is available. With many correction registers such as 2048 regis-
ters, the amount of data protected by one correction register is a small fraction of the
49
LLC store miss
Is the sub-page dirty in
main memory?
Yes
XOR the sub-page with the
correction register in parallel
with transfer to the cache
hierarchy
LLC write hit
Is the block in LLC clean
and is its copy in main
memory dirty?
Read the LLC block and
XOR with correction
register before the store
Yes
Proceed as in a regular
memory
No
No
Figure 3.4: Reading data overwritten in CIEC
entire memory (1/2,048th) and a small fraction of main memory is scanned in every
error recovery.
For error recovery, there is an extra register of the size of each correction register
called recovery register. The number of recovery registers can be scaled up to recover
simultaneously from errors in protection domains of different correction registers.
When an error is detected in a dirty sub-page, the correction register associated with
the dirty sub-page is first copied into a recovery register. Then all the dirty memory
sub-pages in the same protection domain as the faulty sub-page are scanned one by one
and are XORed with the correction recovery register as was explained for L2 CPPC.
Since main memory is a critical part of a system, its availability is important. CIEC
maintains high memory availability and memory continues its regular operations during
recovery. During error recovery, all memory data not protected by the correction register
50
of the faulty sub-page can be accessed normally (this is 2,047/2,048 of main memory
with 2,048 correction registers). In memory pages protected by the same correction
register as the faulty page, we can read from all clean pages and all dirty pages normally
except for the faulty one.
To write into a sub-page which is in the protection domain of the faulty sub-page,
CIEC checks whether that sub-page is already scanned or not. If it is scanned, the write
completes normally. Otherwise, the new data is XORed with the recovery register in
addition to the correction register. When the error recovery process reaches that sub-
page, the sub-page is read and XORed with the recovery register in order to remove its
signature. Hence, writes to non-faulty data which is in the same protection domain as
the faulty data can be done normally. If there is an access (read or write) to the faulty
sub-page, the access and all of its dependents must be stalled until the correct value of
the sub-page is retrieved.
To estimate the delay of error recovery, assume that the memory size is 4GB and
memory is protected by 2,048 correction registers, so that each register protects 2MB
of memory. Assume further that 12.5% of this 2MB is dirty on average. Thus, each
correction register protects 256KB of dirty data. The number of dirty sub-pages (of
size 64B) is 4,096 (256KB/64=4,096). If memory access latency is 100ns (this is a
conservative number), the total delay of accessing 4,096 sub-pages is 0.4ms (4; 096
100 = 409; 600ns). The dirty bits of sub-pages must also be read. Since the size of the
memory chunk protected by the correction register is 2MB, the number of dirty bits is
32Kbits or 4KB. The time to read the dirty bits is therefore negligible. Consequently,
the error recovery of CIEC is expected to take about 0.4ms or less (many accesses can
be parallelized and the access latency can be much less than 100ns), which is less than
the latency of a disk access. Thus, the recovery time of CIEC is of the order of a page
fault or less. The processor waiting for the correct value of the faulty sub-page can
51
simply switch context as in the case of a page fault. Given the rarity of error recovery,
CIEC is not likely to impact memory availability and the effects of CIECs complex error
recovery process on performance and energy consumption can simply be ignored.
3.5 Page swap-out and DMA accesses
Whenever a dirty memory page is replaced and is written back to disk, all of its sub-
page copies in higher-level caches are invalidated. During swap-out, if a dirty sub-page
does not have a dirty copy in the LLC, the dirty sub-page must be XORed with the
correction register. However, if it has a dirty copy in the LLC, the data of the sub-page
is already XORed into the correction register and the sub-page should not be XORed
into the correction register.
Main memory can be accessed by other devices besides the processor such as by
Direct Memory Access (DMA) devices. If an I/O DMA device writes into a dirty mem-
ory sub-page without a dirty copy in caches, CIEC must read the old dirty data from
main memory (not the cache hierarchy) and XOR that with the correction register. The
energy overhead of this operation is much higher than accesses to any cache. Fortu-
nately, the rate of DMA accesses as compared to CPU accesses is very small and most
of I/O devices such as disks do not write into dirty pages. Tracing results including both
DMA and CPU accesses reported in [21] show that the number of CPU accesses to main
memory is at least 100 times more than the number of DMA accesses.
52
3.6 Implementing read-before-writes in various cache
levels
As we have seen, read-before-write operations in CIEC can be transferred from mem-
ory to any level of the cache hierarchy, including L1. If read-before-write operations are
done in the LLC, the energy overhead is higher than in L1. However, this energy over-
head is still small because these operations are triggered only for some memory writes
and the energy consumption of each LLC access is lower than a DRAM access.
When read-before-write operations are moved to higher levels of the cache hierar-
chy, the implementation of CIEC becomes more complex, especially in CMPs. In a
high-level cache (higher than LLC) a block can become dirty more than one time before
it is written back to main memory. This is because a block can be evicted from the cache
and again be read from the lower level cache (before it goes back to memory). In this
situation, CIEC needs to make sure that the read-before-write operation is only done
once in the cache hierarchy. To do this, an extra bit is added to each cache block indi-
cating whether the read-before-write was already done on the block and this bit should
move through the cache hierarchy. This extra bit is reset when a block is read from main
memory. When a dirty block that had a read-before-write (the extra bit is already set
due the read-before-write) is written back to the lower level cache, the extra bit should
also be written to the lower level cache. When data is read from the lower level cache,
this extra bit is also transmitted.
If read-before-write operations are implemented in higher-level caches private to
cores such as L1 caches, the complexity increases even further due to coherency in
CMPs. In multicores, the coherence protocol must deal with read-before-writes exe-
cuted in private L1s. In addition, the extra bit mentioned in the above paragraph must
53
be exchanged among L1 caches along with modified data in order to avoid more than
one read-before-write operation in the cache hierarchy.
54
Chapter 4
PARMA+
To select an appropriate combination of error protection schemes in caches, we need
an accurate model to measure the cache reliability against soft errors in presence of
different error protection schemes. To address this need, this chapter proposes PARMA+
which is an accurate cache reliability model that measures the cache FIT rate in presence
of any number of single-bit and spatial multi-bit faults patterns and almost all error
protection schemes used in commercial processors.
This chapter first provides a background about reliability modeling, then it explains
PARMA+ model and later evaluates the accuracy of PARMA+. In this thesis, PARMA+
is applied only to data array of the cache and not tags. We can apply equations of
PARMA+ to cache directories in the similar way.
4.1 Background
One critical premise of any reliability study is the definition of the failure model. A
cache failure happens when the error protection code of a protection domain cannot
correct errors in the domain at the time of an access. Error codes fail with different
number of bit faults, depending on their strength and on the state of the data, dirty or
clean. Table 4.1 shows the number of faulty bits which fail a protection domain protected
by common error codes. Clean data can tolerate more faults in general because a back
up copy is available at the next level and can be re-fetched.
55
Table 4.1: Failure condition for different protection codes
Protection code Number of faulty bits causing
failure in dirty blocks
Number of faulty bits causing
failure in clean blocks
Parity Any number of faulty bits Even number of faulty bits
SECDED More than 1 faulty bit More than 2 faulty bits
DECTED More than 2 faulty bits More than 3 faulty bits
PARMA+ computes FIT rate of an application. During our computations, we assume
that hardware is working properly and we just focus on the application failure. When
we compute the FIT rate of an application, we implicitly assume that the execution of
the application is repeated until it is failed. Unlike the case of hardware failure which
may be repaired, we do not have any repair for application failure. The equations used
in the section follow this concept.
Transient faults can cause either Detected Unrecoverable Errors (DUEs) or Silent
Data Corruption (SDCs) errors. When a DUE is detected, execution is terminated. SDC
errors are more subtle because they may or may not cause a failure when they propagate
outside the cache. Whether an SDC error propagated outside the cache will crash the
application or not depends on many factors such as the exact location of the fault and
the exact time at which the fault happened. Since PARMA+ in its current form only
tracks the propagation of faults in the cache and not in other system components, we
count an SDC as a terminating failure like a DUE when it is propagated outside the
cache. Therefore, in PARMA+, all errors (DUEs or SDCs) that are propagated outside
the target cache are considered terminal failures.
The average discrete failure rate in a time interval [t
1
;t
2
] is computed by well-known
equation (4.1) [9, 7]. We use the same notations as in these sources. In equation (4.1),
R(t) denotes the probability that the system has survived until time t. For the case of
an application execution,t
1
is the cycle when the program starts (cycle 0) andt
2
is the
cycle when it ends (t
2
is provided by the simulator).
56
Average failure rate =
R(t
1
)R(t
2
)
R(t
1
)(t
2
t
1
)
(4.1)
R(0) equals 1 because cache has no data att=0 and cannot be affected by soft errors.
Therefore, to compute equation (4.1), we need to calculateR(t
2
). R(t
2
) is the probabil-
ity that the cache does not fail during the execution of the entire program. Since reads
and write-backs are the only cache operations that can result in cache failure, R(t
2
) is
the probability that the cache does not fail at any read or write-back between cycles 0
andt
2
(end of the program). R(t
2
) is computed by equation (4.2). In this equation,P
j
is
the probability of failure at access j and max is the total number of reads and write-backs
in the program execution.
R(t
2
) =
max
Y
j=1
(1P
j
) (4.2)
The average failure rate shown in equation (4.1) can be converted to equation (4.3),
which computes the average failure rate in one CPU cycle. T
exe
is the number of cycles
of a program execution (t
2
t
1
).
Average failure rate =
1
max
Q
j=1
(1P
j
)
T
exe
(4.3)
To compute the FIT rate, we scale the failure rate of equation (4.3) to one billion
hours as shown in equation (4.4).
FIT rate =
(1
max
Q
j=1
(1P
j
)) 3600 10
9
T
exe
Cycle Period
(4.4)
Cycle Period is the cycle time in seconds.
57
To computeP
j
, we need to know the raw FIT rate and the distribution of fault pat-
terns. The raw FIT rate is the expected number of faults in one billion hours and is given
in the International Technology Roadmap for Semiconductors (ITRS) [1]. To compute
the SEU rate per bit in one cycle, the ITRS raw FIT rate is scaled down as shown in
equation (4.5). In equation (4.5),R(SEU) is the SEU rate per bit per cycle, FIT
ITRS
is
the ITRS FIT rate for a 1Mbit SRAM array, and f is the processor frequency.
R(SEU) =
FIT
ITRS
10
6
3600f10
9
(4.5)
Because a large number of faults are masked, the raw FIT rate grossly overestimates
the cache error rate. For example, if a bit fault happens in a cache block and then the
block is overwritten, the fault is masked and cannot cause a failure. Moreover, errors
can be erased by error correction. The contribution of PARMA+ is to estimate these
masking effects on the FIT rate for a given program and a given protection scheme.
The distribution of fault patterns can be obtained by beam injection experiments [17,
32]. In these experiments, the occurrence rates of various fault patterns are observed on
a real chip and the probabilities of occurrence of every pattern in one SEU are estimated.
These probabilities depend on the feature size, layout and other characteristics of a chip.
They can also be obtained by simulations of physical devices. The probability that
pattern i is observed when an SEU occurs is denoted byQ
i
. If there are N possible fault
patterns, we will have the following equation.
N
X
i=1
Q
i
= 1 (4.6)
Given the raw SEU rate and the distribution of fault patterns, a model must assess the
probability that the number of flipped bits since the last access exceeds the correction
capability of the error protection code. Three challenges must be met to do this.
58
4×2 fault pattern
Word 1
Word 2
Word 3
3-bit fault pattern
Word 1
Word 2
2-bit fault pattern
SEUs cancel each other on this bit
Word 1
Word 0
3-bit fault pattern
Figure 4.1: A spatial multi-bit fault can flip different number of bits in a protection domain
depending on its location
Block 1
Block 2
4×2 fault pattern
Block 3
Block 4
Word 1
Word 2
Word 3
Word 3
3-bit fault pattern
Word 1
Word 2
2-bit fault pattern
Word 1
Word 2
SEUs cancel each other on this bit
4×2 fault pattern
Word 2
Word 1
Word 0
Figure 4.2: Failure of four words can be dependent on each other
First, the number of bit faults due to an SEU varies with the fault pattern and its
location, as illustrated in Figure 4.1. Thus, the model must take into account the shape
of the pattern and its location to evaluate the failure rate.
Second, there are dependencies between failures in adjacent domains. One large
fault may cause failures in multiple domains. For example, Figure 4.2 shows a fault
pattern that flips two bits in four vertically adjacent words. If all these words are dirty
and protected by SECDED, they will all fail at the next access to them. If words 1, 2,
3 and 4 are accessed in that order, the probability that word 1 failed must be discounted
from the probabilities of words 2, 3 and 4 failing, because if word 1 fails, then failing
accesses to words 2, 3, and 4 will not happen. This is the most complex part of the
PARMA+ model, addressed in Section 4.5.
Third, if more than one SEU strike a domain between two consecutive accesses,
the effects of the multiple multi-bit faults accumulate and may increase or decrease the
probability of an access failure. If multiple multi-bit faults flip bits of different parts of a
59
4×2 fault pattern
Word 1
Word 2
Word 3
3-bit fault pattern
Word 1
Word 2
2-bit fault pattern
SEUs cancel each other on this bit
Word 1
Word 0
SEUs cancel each other on this bit
Word 1
Word 0
3-bit fault pattern 2-bit fault pattern
Figure 4.3: Two multi-bit fault patterns can add or subtract to each other
protection domain, the number of bit faults increases, but, if two SEUs overlap, some bit
faults may cancel out. These two cases are illustrated in Figure 4.3. In the top of Figure
4.3, two multi-bit faults cause the total number of faults in word 1 to become four while
in the bottom, the total number of bit fault in word 1 is two.
4.2 Previous cache reliability models
In order to compute the FIT rate of a cache for a program execution, the raw FIT rate is
typically multiplied by the Architectural Vulnerability Factor (A VF). A VF is the prob-
ability that a bit fault is converted into a systemic failure. The A VF of a cache for
single-bit faults is computed by the following procedure in [5]: Compute the total time
between all reads or write-backs of words and divide that time by the simulation time
multiplied by the number of cache words. An upper bound for A VF is called Tempo-
ral Vulnerability Factor (TVF) [35]. TVF is the fraction of data present and vulnerable
in the cache among all data held in the cache. Like A VF, TVF is limited to single-bit
faults and is not applicable to multi-bit faults and caches with error correction codes. In
[29, 3], the reliability of a cache is computed in a way similar to [5] by multiplying the
raw FIT rate by the fraction of cache data vulnerable during the execution of a program.
60
All these approaches are limited to single-bit faults and caches without error correction
codes.
SoftArch [16] computes the cache MTTF as if the same program execution is
repeated until the first failure occurs. This model only applies to single-bit faults like in
[5] and cannot model the effects of error correction codes.
Approximate analytical models [20, 26] have been proposed to estimate the expected
time for two-bit faults in a word in order to set the cache scrubbing rate. These approxi-
mate models are not specific to programs. Furthermore, they are limited to two temporal
single-bit faults and cannot deal with spatial multi-bit faults.
A compound Poisson process is proposed in [23] to set the interleaving distance
of SECDED codes to correct spatial multi-bit errors. However, this approach does not
benchmark the failure rates for given programs.
PARMA [31] is a model that estimates the cache FIT rate in the presence of error
protection codes. During a program execution, PARMA calculates the probability that
each cache access fails. PARMA tracks a block in and out of main memory and tracks
accesses into the processor, so that SDCs and true/false DUEs are counted separately.
However, PARMA is limited to single-bit SEUs which will not be seen in future tech-
nologies as all soft errors will come from spatial multi-bit faults.
In MACAU [30], the interval between two accesses to a word is modeled by a
Markov chain. The Markov chain states track the number of bit faults in a word. For
example, if a word has 32 bits, there can be 0 to 32 faults in a word and the Markov chain
has 33 states. At each access, the probability that the cache has any number of faults can
be computed by the Markov chain. MACAU can estimate the cache FIT rate only for a
few specific spatial multi-bit fault patterns which were observed in a 65nm technology.
MACAU only applies to protection domains of one word. It needs many large matrix
multiplications, which are very time-consuming. Another shortcoming of MACAU is
61
that the evaluation of the failure rate of a protection domain is done independently of its
neighbors.
Fault-injection experiments, either real-life or simulated, is the ultimate approach to
estimating the MTTF or FIT rate. However, fault-injection experiments are costly and
extremely time-consuming because of the large number of simulation runs needed to
obtain a significant estimate, especially when events are rare. Because actual fault rates
are extremely low, fault-injection simulations cannot be applied to actual, observed fault
rates and can only be run for extremely high fault rates, far from reality.
4.3 Basic assumptions and equations of PARMA+
PARMA+ computes the FIT rate based on several inputs. One of the inputs is the pro-
tection domain. A protection domain can be a block, a word, a byte or any set of bits in
a cache. Another input is the protection code in each protection domain. SECDED and
parity are typical but stronger codes such as DECTED (Double Error Correction Triple
Error Detection) are also possible.
In this thesis, the interval between two consecutive accesses to a protection domain
is called a vulnerability interval because it is the time during which a domain is exposed
to faults before any error correcting mechanism can be applied. For example, if the last
access to a word was at cycle 1000 and a read happens at cycle 2000, the vulnerability
interval is [1000, 2000]. The length of the vulnerability interval in cycles is denoted L.
In this example, L is equal to 2000-1000=1000.
We assume that at most one SEU can hit a protection domain in any one cycle (a
fraction of a nanosecond). Thus, if the protection domain is one word, there can only
be one SEU in a word in one cycle. The model could be refined to accept multiple
SEUs in the same cycle, but, given current technology trends in which the FIT rate per
62
megabit of SRAM is expected to stay around 1000 [1] for the foreseeable future, this
added complexity would be futile. SEUs that happen in the same protection domain are
referred to as ”Domain SEUs” (DSEUs). The DSEU rate is larger than the SEU rate
because a domain contains multiple bits.
The PARMA+ model can be applied to any memory array such as L1 cache, L2
cache, or even main memory. However, in this section, we focus on the reliability of an
L2 cache, and a failure is caused by DUEs or SDC errors that happen when a block is
read from L2 or written back by L2. In Sections 4.4 and 4.5, it is implicitly assumed
that the protection domain is one block and the granularity of cache accesses is equal to
the size of a protection domain. In Section 4.6, we will explain how PARMA+ models
cache accesses which read different number of protection domains.
4.3.1 Illustrative example
To illustrate the equations of PARMA+, we use the following running example in this
section. In this example, the cache contains 15 words as shown in Figure 4.4. Every
word has 32 bits and is protected by SECDED. Moreover, all words are dirty. Thus, any
word fails with two or more faulty bits.
In this example, an SEU has one of two bit fault patterns. The first fault pattern is
a single-bit fault and the second one is a 2 2 fault in which 4 bits are flipped. The
probabilities of the fault patterns in an SEU are both 0.5.
4.3.2 SEU rate in one protection domain (SEU rate)
The rate per bit and per cycle of each fault pattern i is equal to R(SEU) Q
i
, i.e.,
the SEU rate per bit multiplied by the probability that the SEU has fault pattern i. We
say that a DSEU occurs in a domain when at least one bit in the domain is flipped by
the fault. To compute the DSEU rate, we first compute the rate at which each fault
63
Word 0 Word 1 Word 2
Word 3
Word 6
Word 9
Word 12 Word 11 Word 14
Word 10
Word 7
Word 4
Word 11
Word 8
Word 5
Q1=0.5
Q2=0.5
Figure 4.4: The running example of Section 4
pattern i occurs in the domain. For the case of a single-bit fault pattern, the rate is
simplyR(SEU)Q
i
B, where B is the number of bits in the protection domain. For
multi-bit fault patterns, the calculation is a bit more complex because multi-bit faults
may straddle multiple protection domains.
The footprint of a fault pattern is the smallest rectangle that includes the pattern. For
example, the footprint of fault pattern 2 of Figure 4.4 is a 2 2 square. We pick the
North-West (N-W) bit of the footprint of the pattern to locate the fault in the cache array,
although any other bit could be chosen. In Figure 4.4, the N-W bit of fault pattern 2 is
shaded.
We defineN
i
DSEU
as the number of bits in the cache (inside or outside the domain)
such that if the N-W bit of fault pattern i is pinned to one of these bits, at least 1 bit is
flipped inside the domain. In order to computeN
i
DSEU
, the PARMA+ tool pins the N-W
bit of pattern i to different bits inside and around the domain and counts the number of
cases when at least 1 bit is flipped inside the domain by the fault pattern. For example, if
the protection domain is contained between rows G and F and between columns R and S
64
of the cache array, and if all multi-bit fault patterns are confined to anNM footprint,
the following algorithm computesN
i
DSEU
.
N
i
DSEU
= 0;
For(l = G N; l<= F; l + +)
For(n = R M; n<= S; n + +)
If(fault pattern i located at bit position (l; n)
ips at least one bit of domain)
N
i
DSEU
+ +;
For the example of Section 4.3.1, Figure 4.5 shows in gray the bits of the cache array
such that if the N-W bit of any fault pattern is pinned to any of them, the fault occurs
in Word 7, i.e., at least one bit is flipped in Word 7. N
1
DSEU
(for fault pattern 1) is equal
to 32 because if fault pattern 1 is pinned to any one of the 32 bits of Word 7, the fault
occurs in the domain. Fault pattern 2 occurs in Word 7 if its N-W bit is pinned to any
bit of Word 7 (32 bits total) or any bit of Word 4 (32 bits total) or bit 31 of Word 3 or bit
31 of Word 6 (a total of 66 bits), because at least one bit inside of Word 7 is flipped in
all these cases. Consequently,N
2
DSEU
is equal to 66.
The mean ofN
i
DSEU
over all possible fault patterns is calledN
DSEU
and is computed
by equation (4.7) in which N is the number of fault patterns.
N
DSEU
=
N
X
i=1
N
i
DSEU
Q
i
(4.7)
For our example,N
DSEU
is equal to 0:5 66 + 0:5 32 = 49.
65
Q1=0.5
Word 0 Word 1 Word 2
Word 3
Word 6
Word 9
Word 12
Word 4
Word 7
Word 10
Word 13
Word 5
Word 8
Word 11
Word 14
Q2=0.5
Word 0 Word 1 Word 2
Word 3
Word 9
Word 12
Word 4
Word 7
Word 10
Word 13
Word 5
Word 8
Word 11
Word 14
Word 6
Bit 31
Figure 4.5: Bits contributing toN
i
DSEU
for each of the two fault patterns
The DSEU rate in a protection domain is obtained by multiplying the SEU rate per
bit by the average number of bit locations in the array such that, if an energetic particle
hits that bit, a DSEU occurs.
R(DSEU) =R(SEU)N
DSEU
(4.8)
66
4.4 Failure of a domain independently of other domains
In this section, we ignore the failure dependencies that may exist between neighbor-
ing domains in the failure rate computation. Section 4.5 will take into account these
dependencies.
Every access to a protection domain may result in domain failure. Accesses are
reads and write-backs. The probability of failure due to an access to a domain is given
by equation (4.9). The probability of access failure is the sum of probabilities of having
c DSEUs in the domain that c DSEUs cause a failure in the accessed domain. In equation
(4.9), L is the length of the vulnerability interval between two accesses in cycles and j is
the access number. Since the model assumes at most 1 DSEU per cycle, the maximum
number of DSEUs between two accesses to the domain is L.
P
j
=
L
X
c=1
P (c DSEUs)P (accessj failsjc DSEUs) (4.9)
The computation of P(c DSEUs) is done in Section 4.4.1. P(access j failsj 1 DSEUs) and
P(access j failsj 2 DSEUs) independently of other domains are computed in Sections
4.4.2 and 4.4.3. Extensions to several DSEUs in a vulnerability interval are straightfor-
ward.
4.4.1 Probability of c DSEUs
The probability that one DSEU occurs in one cycle is denotedP
DSEU
and is modeled by
a Poisson process:
P
DSEU
=R(DSEU) e
R(DSEU)
(4.10)
67
The probability of c DSEUs occurring during L cycles is modeled by a Binomial distri-
bution [31]:
P (c DSEUs) =
L
c
(P
DSEU
)
c
(1P
DSEU
)
Lc
(4.11)
4.4.2 Probability of access failure given a single DSEU in the
domain
A failure happens in a domain when a certain number of bits are faulty. Table 4.1 shows
the failure condition for several well-known error protection codes. We defineN
i
Fail
as
the number of bits in the cache array such that if the N-W bit of fault pattern i is pinned
to any one of them, a failure will happen when the domain is accessed next. N
i
Fail
is
always less or equal thanN
i
DSEU
and is different in dirty and clean blocks.
We consider a protection domain contained between rows G and F and between
columns R and S of the cache array and multi-bit fault patterns included in anNM
footprint. N
i
Fail
is computed as follows.
N
i
Fail
= 0;
For(l = G N; l<= F; l + +)
For(n = R M; n<= S; n + +)
If(Pinning N W of fault pattern i at bit location (l; n) causes domain failure)
N
i
Fail
+ +;
The mean number of fault locations that cause a failure across all patterns is denoted
N
Fail
:
N
Fail
=
N
X
i=1
Q
i
N
i
Fail
(4.12)
68
Q2=0.5
Word 0 Word 1 Word 2
Word 3
Word 9
Word 12
Word 10
Word 13
Word 5
Word 8
Word 11
Word 14
Word 6
Bit 31
Word 4
Word 7
Figure 4.6: Bits inN
2
DSEU
that cause no failure in Word 7
P(access j failsj 1 DSEU) is the fraction of bits counted inN
DSEU
which cause a failure.
P (accessj failsj 1 DSEU) =
N
Fail
N
DSEU
(4.13)
In equation (4.13), N
Fail
is given by equation (4.12) and N
DSEU
is given by equation
(4.7).
In the case of the example of Section 4.3.1, it takes more than one faulty bit to cause
a failure, and thereforeN
1
Fail
is equal to 0 since pattern 1 is a single-bit fault and cannot
cause more than one bit fault in a word. If the N-W bit of pattern 2 is pinned to any bit
0-30 of Words 4 or 7, there will be more than one bit fault in the domain. Hence,N
2
Fail
is equal to 62. Figure 4.6 shows (in gray) the bits that contribute to N
2
DSEU
but do not
contribute toN
2
Fail
. These bits are bit 31 of Words 3, 4, 6 and 7. Therefore we have:
N
Fail
N
DSEU
=
0:5 0 + 0:5 62
0:5 32 + 0:5 66
= 0:63
69
4.4.3 Probability of access failure given two or more DSEUs in the
domain
Computing the probability of domain failure given two DSEUs in a vulnerability interval
is similar to the case of one DSEU. LetN
i;m
Fail
be the number of cases in which patterns
i and m occur in the domain and their superimposition causes domain failure. N
i;m
Fail
is
computed as follows:
N
i;m
Fail
= 0;
For(j = 0; j< N
i
DSEU
; j + +)
For(l = 0; l< N
m
DSEU
; l + +)
If(superimposition of patterns i and m causes failure)
N
i;m
Fail
+ +;
Note thatN
i;m
Fail
is always less thanN
i
DSEU
N
m
DSEU
as multi-bit faults can cancel each
other as shown in Figure 4.3. N
Fail
for two DSEUs is now given by equation (4.14).
N
Fail
=
N
X
i=1
N
X
m=1
Q
i
Q
m
N
i;m
Fail
(4.14)
P(access j failsj 2 DSEU) is the fraction of combinations of two DSEUs occurring in
the domain whose superimposition causes a domain failure:
P (accessj failsj 2 DSEUs) =
N
Fail
(N
DSEU
)
2
(4.15)
In equation (4.15),N
Fail
is given by equation (4.14) andN
DSEU
is given by equation
(4.7).
70
For the cases of several DSEUs (three or more) in a vulnerability interval of L cycles,
the procedure is conceptually similar. For example, for three DSEUs, equation (4.14)
would have three summation signs over patterns i, m and k. However, given the state
of technology now and for the foreseeable future (until 2024) [1], it is futile to consider
more than two DSEUs during a vulnerability interval because the probability of a single
DSEU during a vulnerability interval is already extremely small (less than 10
15
).
4.5 Failure dependencies with neighboring domains
In Section 4.4, the probability of failure (equation (4.9)) was computed for one pro-
tection domain, independently of failures in other domains. However, because spatial
multi-bit faults can fail more than one protection domain, the probability that an access
in a domain fails is dependent on prior failures in other domains. To illustrate this fur-
ther, Figure 4.7 shows an example in which a 2 2 fault occurs in Words 4 and 7 of
Figure 4.4. In this example, when Word 4 is accessed at cycle 1000, the cache fails, rais-
ing a DUE and terminating execution. Hence, the access to Word 7 which was supposed
to happen at cycle 1500 will not happen.
For the accesses of Figure 4.7, PARMA+ first computesP
j
for the access to Word 4
using equations from Section 4.4. This probability takes into account that the N-W bit
of the 2 2 fault pattern can hit any bit 0-30 of Word 4 or Word 1 (Word 1 is not part of
Figure 4.7) before 1000. When the probability of access failure to Word 7 is calculated
at cycle 1500, the failures caused by pinning the N-W bit of the 2 2 fault pattern to
bits 0-30 of Word 4 during [0,1000] should be ignored. This is because if the 2 2 fault
hits bits 0-30 of Word 4 during [0,1000], it would cause the failure of Word 4 but not
Word 7 and the execution would terminate at the access to Word 4.
71
Word 4
Word 7
The 2×2 fault hits words 4 and 7 at cycle 500
Time (cycle)
1000
Read word 4 (fails)
1500
Read word 7 (won’t happen)
500
Figure 4.7: Failures of domains can be dependent on each other
To compute, the probability of failure in Word 7 at cycle 1500, the vulnerability
interval [0,1500] is divided into two subintervals: first, subinterval [0,1000] in which
overlapping fault patterns causing failures in both Words 4 and 7 are not counted and,
second, subinterval [1001,1500] in which all fault patterns causing failures in Word 7
are counted. ThereforeN
i
Fail
is computed differently in each of these subintervals.
For a spatial multi-bit fault to cause a failure dependency between two accesses in
different domains, all three following conditions must be met:
1. The accesses should be either read, write-back or read-modify-writes (or read-
before-writes) triggered by write operations. Writes that do not trigger a read operation
cannot cause failures.
2. The fault pattern must cause a failure in both domains.
3. One of the accesses must be made during the vulnerability interval of the other
access.
Before quantifying failure dependencies, we first define the concept ofneighboring
domains. For the case of one DSEU in a vulnerability interval, a protection domain is
aneighbor of another domain if at least one of the fault patterns can cause a failure in
both domains. In the example of Figure 4.4, Words 4 and 10 are neighbors of Word 7
because fault pattern 2 can cause the failure of Word 7 and of Words 4 or 10. For the case
72
Word 4
Word 7
Time (cycle)
1000
Read word 4
2000
Read word 7
Q2=0.5
Word 4
Word 7
Word 1
Word 1
0
Pining N-W of the fault to any green bit fails word 4
Pining N-W of the fault to any green bit fails word 7
Read word 10
1600 1400
write word 7
Figure 4.8: Accesses to word 7 and its neighbors
of two DSEUs in a vulnerability interval, a protection domain is a neighbor of another
domain if the superimposition of any pair of fault patterns which occur in one domain
can cause the failure of both domains. For three and more DSEUs, neighboring domains
can be defined similarly. However, a model for more than two DSEUs in a vulnerability
interval is not currently needed due to the extremely low probability of such occurrence.
During a PARMA+ simulation, every protection domain keeps track of accesses to
its neighboring domains. In order for a failure dependency to affect the computation
ofP
j
at accessj, there should be a prior access to one of the neighbors of the domain
satisfying the three failure dependency conditions listed above.
PARMA+ divides a vulnerability interval into subintervals delimited by accesses to
neighboring domains. For example, the vulnerability interval [1000, 2000] of the last
access in Figure 4.8 is divided into subinterval 1 [1000, 1400], subinterval 2 [1400,
1600] and subinterval 3 [1600, 2000]. Faults that cause a failure of Word 7 and of
Words 4 or 10 during subinterval 1 must be ignored at cycle 2000 because if Word 4
fails at cycle 1400 or Word 10 fails at cycle 1600, then the access to Word 7 at 2000
will not happen, as the application would have been crashed before cycle 2000. For the
same reason, all faults that cause a failure in both Words 7 and 10 must be ignored in
subinterval 2. Finally, all faults must be counted in subinterval 3 because there is no
dependency with neighbors.
DSEUs can be distributed among subintervals in different ways. In the example, a
single DSEU can occur in one of the three subintervals, and two DSEUs can be dis-
tributed in six ways among the three subintervals: 2-0-0, 0-2-0, 0-0-2, 1-1-0, 1-0-1, and
73
0-1-1. The number of ways that c DSEUs can be distributed among U subintervals is
denotedS. S is equal to the number of ways that c balls can be distributed intoU bins
and is computed by equation (4.16).
S =
U +c 1
c
(4.16)
The probability of distribution k amongS possible DSEU distributions is computed
by equation (4.17). In this equation,L
t
is the length of subinterval t in cycles andc
t
is the
number of DSEUs in subinterval t in distribution k. The denominator of equation (4.17)
is the number of ways that c DSEUs can be distributed in L cycles and the numerator is
the number of ways that c DSEUs can have in distribution k, given the length of each
subinterval.
P (Distributionk) =
L
1
c
1
L
2
c
2
:::::
L
U
c
U
L
c
(4.17)
c =c
1
+c
2
+c
3
+::: +c
U
andL =L
1
+L
2
+::: +L
U
.
In order to account for failures in neighboring domains, P(access j failsj c DSEUs)
in equation (4.9) must be computed by equation (4.18).
P (accessj failsjc DSEUs) =
S
X
k=1
P (Distributionkjc DSEUs)
P (Domain failure with no failure in neighborsjc DSEUs and Distributionk)
(4.18)
P(Distribution kj c DSEUs) is computed by equation (4.17). Thus, we need to compute
the second term on the right side of equation (4.18). We first explain how to compute
P(Domain failure with no failure in neighborsj 1 DSEU and Distribution k).
(c=1) We defineN
i
SubFailk
in a way similar toN
i
Fail
. N
i
SubFailk
is the number of bits
in the cache array such that if the N-W bit of fault pattern i is pinned to any of them, the
74
domain fails but none of its neighbors fail given distribution k. The mean ofN
i
SubFailk
over all patterns is:
N
SubFailk
=
N
X
i=1
N
i
SubFailk
Q
i
(4.19)
Then:
P (Domain failure with no failure in neighbors
j 1 DSEU and Distributionk) =
N
SubFailk
N
DSEU
(4.20)
In the example of Figure 4.8 (the access at cycle 2000) with patterns of Figure 4.4,
fault pattern 1 (single-bit fault) cannot fail any domain. Thus, N
1
Fail
and N
1
SubFailk
are
both equal to 0. N
2
Fail
is 62. Among the 62 locations counted inN
2
Fail
, 31 of them cause a
failure in Word 4 and 31 of them cause a failure in Word 10. In subinterval 1, which ends
with the access to Word 4, faults in Word 7 that cause a failure in Word 4 or Word 10
must be ignored; thus all faults must be ignored in subinterval 1 andN
2
SubFail 1
is equal to
zero. In subinterval 2, which ends with the access to Word 10, any fault in Word 7 that
causes a failure in Word 10 must be ignored because the application must have survived
the access to Word 10 at cycle 1600. Hence, N
2
SubFail 2
is 31. Finally, in subinterval 3,
after cycle 1600, there is no failure dependency with neighbors and no failure of Word
7 can be ignored. N
2
SubFail 3
is equal to 62.
Thus, we have:
N
SubFail 1
N
DSEU
=
0 0:5 + 0 0:5
49
= 0
N
SubFail 2
N
DSEU
=
0 0:5 + 31 0:5
49
= 0:31
N
SubFail 3
N
DSEU
=
0 0:5 + 62 0:5
49
= 0:63
75
The DSEU can happen in any one of the three subintervals. Using equation (4.17), the
probabilities of distributions 1, 2, and 3 are 0.4, 0.2 and 0.4, respectively. Hence, the
probability of access failure computed by equation (4.18) is 0:40+0:20:31+0:4
0:63 = 0:31.
We now consider the case of two DSEUs in a vulnerability interval.
(c=2) For two DSEUs in a vulnerability interval, the process is similar to the case
of one DSEU. We define N
i;m
SubFailk
as the number of bits in the cache array such that
if fault patterns i and m occur in the domain and their superimposition causes domain
failure, but none of its neighboring domains fail given distribution k. The following
algorithm computesN
i;m
SubFailk
.
N
i;m
SubFail k
= 0;
For(l = 0; l< N
i
DSEU
; l + +)
For(n = 0; n< N
m
DSEU
; n + +)
If(superimposition of patterns i and m causes failure of domain but no
neighboring domain fails given distribution k)
N
i;m
SubFail k
+ +;
N
SubFailk
is the mean ofN
i;m
SubFailk
over all couples of patternsi andm.
Then:
P (Domain failure with no failures in neighborsj
2 DSEUs and Distributionk) =
N
SubFailk
(N
DSEU
)
2
(4.21)
This procedure can be generalized to compute equation (4.18) for more than two
DSEUs.
76
4.6 Model extensions
So far, the model has been developed in the context of a standard cache array under
nominal voltage and an error code protecting a contiguous domain. However, the model
is applicable to other environments such as:
Bit-interleaving: PARMA+ is applicable to bit-interleaved cache arrays or arrays
with interleaved protection codes. In both cases the only difference is that the protection
domain is not made of contiguous bits in the array, and N
i
DSEU
, N
i
Fail
and other such
variables are computed in an interleaved array. No equation of PARMA+ is changed.
We will evaluate the accuracy of PARMA+ in bit-interleaved caches in Section 4.7.
Early write-back and cache scrubbing: PARMA+ is also applicable to early write-
back or/and cache scrubbing. In these cases, the model simulates the additional accesses.
In the case of early write-backs this means extra write-backs. In the case of scrubbing
this means extra reads at periodic times.
Different sizes of protection domain and access granularity: If the protection
domain is smaller than the cache access granularity such as word-level protection in L2,
each L2 write-back or read accesses several protection domains. To model this,N
i
Fail
is
counted as the number of bits in the cache array such that if the N-W bit of fault pattern i
is pinned to any one of them, at least one of the protection domains fails. N
i;m
Fail
and other
such variables (likeN
i
SubFailk
) are also computed similarly while other equations do not
change. If the protection domain is larger than the granularity of a cache access, the
entire domain will be accessed and the protection code will be checked. Hence, if the
access granularity is smaller than the protection domain size, the tool always accesses
the entire domain like the case in which the access granularity is equal to the protection
domain size.
Dynamic Voltage and Frequency Scaling (DVFS): To reduce energy consumption,
modern processors change their voltage and frequency dynamically. At lower voltages,
77
the soft error rate increases because a smaller charge is stored in each SRAM cell [8].
This application of the model is more challenging. For each voltage level of the DVFS
scheme, different inputs must be provided to the tool. First, the ITRS FIT rate in equa-
tion (4.5) must be replaced by the raw FIT rate for the various non-nominal voltages
employed by the DVFS scheme. Second, the fault patterns and their probabilities (Q
i
of
equation (4.6)) must be determined for the non-nominal voltages as well.
4.7 Simulations
In this section, we compare the PARMA+ model with fault-injection simulations. Since
the actual SEU rate is extremely small, fault-injection rates must be raised drastically so
that fault-injection simulations can complete in a reasonable amount of time, especially
given that a large number of them must be done for each design point. This is one of the
reasons why formal models such as PARMA+ are useful in order to feasibly estimate
FIT rates at actual SEU rates.
The system configuration in all the simulations reported here is shown in Table 4.2.
We simulate 13 SPEC 2000 benchmarks with SimpleScalar [4]. Each benchmark is fast-
forwarded for 100 million instructions and is run in detail for an additional 100 million
instructions. We could use Simpoint and run 100 million instructions after fast forward-
ing a certain number of instructions but we did not do that for two reasons. First, we
would need to fast forward tens of billions of instructions which increases the simula-
tion time significantly and we would not be able to run many fault injection experiments.
Second, Simpoint is important in performance evaluations as it tries to provide the same
CPI and cache miss as the entire benchmark. However, for our reliability evaluation,
the CPI or cache miss do not matter. We perform reliability simulation experiments
on an L2 cache protected by SECDED and we call an error a failure when the error is
78
Table 4.2: Evaluation parameters
Parameter Value
Functional Units 4 integer ALUs, 1 integer multiplier/divider, 4
FP ALUs, 1 FP multiplier/divider
LSQ Size / RUU Size 16 Instructions / 64 Instructions
Issue Width 4 instructions / cycle
Frequency 3 GHZ
L1 data cache 64KB, 4-way, 32 byte lines, 2 cycles latency
L2 cache 1MB unified, 8-way, 32 byte lines, 8 cycles
latency
L1 instruction cache 16KB, 1-way, 32 byte lines, 1 cycle latency
Feature Size 32nm
propagated outside the L2 cache by either a read or write-back operation. PARMA+’s
results depend on the layout of the cache because it considers the dependencies between
different domains and the dependencies vary in different layouts. We assume a typical
layout for the cache and implement all dependency equations accordingly. In PARMA+,
it is assumed that the size of a cache row is equal to the cache line.
In our simulations, the probability of failure at access j is computed as
P
j
=P (1 DSEU)P (accessj failsj 1 DSEU)
+P (2 DSEUs)P (accessj failsj 2 DSEUs)
This is equation (4.9) where we neglect the probability of having more than 2 DSEUs
because the probability of having several DSEUs in a vulnerability interval is extremely
small.
In PARMA+, the failure rate is computed in a single simulation run for each bench-
mark by equation (4.3). Since fault patterns which are used in this section causes failure
of the cache with 1 DSEU, the probability of failure due to 2 DSEUs will not impact
the FIT rate. Thus, in order to make the simulation faster, we do not consider depen-
dencies for the case of two DSEUs. This is because the number of combinations which
79
is computed by equation (4.16) is very high and it would increase the simulation time.
In PARMA+ simulations, P (access j failsj 1 DSEU) is computed by equations (4.18),
(4.19) and (4.20) and P (access j failsj 2 DSEUs) is computed by equations (4.14) and
(4.15).
Fault-injection simulations are run at least 10,000 times for each benchmark and the
failure rate is estimated as the number of simulation runs in which the application fails
because of transient faults in the L2 cache, divided by the total number of simulations.
For some benchmarks with very low rate of significant faults events we had to increase
the number of fault-injection simulations up to 30,000.
In order to understand the importance of cross-domain failure dependencies, we also
compare the results of fault injections with a version of PARMA+ that does not take into
account failure dependencies across domains. Another reason to have this comparison is
because the running times of the simulations with or without dependencies are different.
The PARMA+ model without failure dependencies of Section 4.4 runs at about the same
speed as performance simulations on SimpleScalar. By contrast, the simulations includ-
ing the failure dependencies of Section 4.5 for 1 DSEU run 10 times slower. We refer
to the model of Section 4.4 (PARMA+ without cross-domain failure dependencies) as
PARMA+
light
. For PARMA+
light
, P (access j failsj 1 DSEU) is computed by equations
(4.12) and (4.13) and P (access j failsj 2 DSEUs) is computed by equations (4.14) and
(4.15). Note that in simulations of this section, PARMA+ and PARMA+
light
have the
same equations for two faults but the contribution of two DSEUs on the FIT rates on the
results shown in this section is negligible.
We also compare PARMA+ with MACAU [30]. To the best of our knowledge,
MACAU is the only existing model which can compute failure rates in presence of
spatial multi-bit faults. However, it is only applicable to faults shown in Figure 4.9,
80
Q1=0.89
Q2=0.01 Q3=0.06 Q4=0.015 Q5=0.015 Q6=0.01
Q2=0.01 Q3=0.06 Q4=0.015 Q5=0.015 Q6=0.01
Figure 4.9: Small fault patterns (black squares show faults)
it is limited to protection domain sizes of one word, and it cannot be applied to bit-
interleaved caches or caches with interleaved error code. Moreover the computational
complexity of MACAU is very high.
The deviation between estimates given by formal models such as PARMA+,
PARMA+
light
and MACAU, and fault injection simulations is computed as:
Deviation =j1
Model failure rate
Fault injection failure rate
j 100
If the failure rate of a benchmark is so small that the limited number of fault injections
cannot produce any failure, the deviation of the above formula would be infinity. For this
case, we use the absolute value of other models (PARMA+, PARMA+
light
or MACAU)
as deviation.
We report on five sets of simulations. First we compare PARMA+ and MACAU with
fault-injection simulations with fault patterns shown in Figure 4.9 observed in previous
physical beam injection experiments in a 65nm technology [32]. In Figure 4.9, faulty
bits are shown in a 23 fault footprint (black bits are faulty). In this set of experiments,
the raw SEU rate is 8:04 10
14
FIT/Mbit and SECDED is applied at the word level
since MACAU is only applicable to word-level protection domains.
Table 4.3 shows that deviation of PARMA+ is on average 2.0% while the deviation
of MACAU is 15.1%. Hence, the accuracy of PARMA+ is better than MACAU for the
fault patterns that are applicable to MACAU. Table 4.4 shows the fault injection results
(their absolute values) used in Table 4.3 and their 95% confidence interval .
81
Table 4.3: PARMA+ and MACAU deviations under fault patterns of Figure 4.9 (acceler-
ated SEU rate)
Benchmark PARMA+ MACAU
gcc 0.6 3.9
mesa 6.6 89.1
gzip 4.3 10.5
applu 0.4 2.2
twolf 6.9 23.2
mcf 0.7 9.9
crafty 1.1 15.3
equake 2.6 11.2
lucas 0.1 9.1
galgel 0.9 6.4
ammp 0.2 5.5
mgrid 1.3 8.5
swim 0.6 3.2
Average 2.0 15.1
Second, we compare PARMA+ and PARMA+
light
to fault-injection with patterns of
Figure 4.9. In this set of experiments, the raw SEU error rate is 8:0410
14
FIT/Mbit and
each cache block is protected by SECDED. We ran 30,000 fault-injection simulations
for mesa and equake to obtain statistically acceptable results. For this set of patterns,
the PARMA+ model is highly accurate and its deviation from fault-injection simulations
is on average 1.9%. By comparison, PARMA+
light
is less accurate with an average
deviation of 8.4%. Since a fault pattern can flip bits in up to two domains, PARMA+
light
may count a fault twice in two neighboring domains and overestimate the failure rate.
The 95% confidence interval for fault injection experiments used in Table 4.5 is shown
in Table 4.6.
In the third set of simulations, we compare PARMA+ and PARMA+
light
with fault-
injection simulations for a set of large fault patterns in order to stress the models. Figure
4.10 shows the fault patterns with their probability of occurrence. In Figure 4.10, black
squares show faulty bits in fault patterns with up to an 88 fault footprint. Bit faults are
82
Table 4.4: Fault injection results of Table 4.3 and their 95% confidence interval
Benchmark Fault injection results 95% confidence interval
gcc 0.874 [0.871, 0.876]
mesa 0.002 [0.0011, 0.0028]
gzip 0.557 [0.550, 0.563]
applu 0.9892 [0.9889, 0.9894]
twolf 0.021 [0.0193, 0.0226]
mcf 0.8125 [0.809, 0.815]
crafty 0.449 [0.441, 0.456]
equake 0.0034 [0.0027, 0.0040]
lucas 0.781 [0.777, 0.784]
galgel 0.076 [0.071, 0.080]
ammp 0.9932 [0.9930, 0.9933]
mgrid 0.93 [0.928, 0.931]
swim 0.882 [0.879, 0.884]
in adjacent or non-adjacent bits and fault patterns are much more complex than the faults
in Figure 4.9. These fault patterns were generated randomly with the goal to increase
the complexity of the fault as much as possible. In this set of experiments, the raw error
rate is 8:04 10
12
FIT/Mb and each cache block is protected by SECDED. Table 4.5
shows that the deviations of PARMA+ and PARMA+
light
with respect to fault-injection
simulations are 4.3% and 164.2%, respectively. Therefore the accuracy of PARMA+ is
very good in both sets of experiments. PARMA+
light
grossly overestimates the failure
rate with fault patterns of Figure 4.10 because fault patterns flip bits of up to eight
consecutive rows and the model may count the same fault eight times.
In the fourth set of simulations, the bits in the L2 cache array are 2-way or 4-way
interleaved. 64-bit words are protected by SECDED in blocks of 256 bits. In the case of
the 2-way interleaved array, bits of the first 64-bit word are placed in bits 0, 2, 4,....126
and the third word is stored in bit locations 128, 130, 132,...254. In the case of 4-way
interleaving, bits of the first 64-bit word are placed in bits 0, 4, 8, ...,252 and the third
64-bit word is placed in bits 2, 4, 8, ..., 254. In this set of simulations, the raw error
83
Table 4.5: PARMA+ and PARMA+
light
deviations under fault patterns of Figures 4.9 and
4.10 (accelerated SEU rate)
Benchmark
Fault patterns of Figure 4.9 Fault patterns of Figure 4.10
PARMA+ PARMA+
light
PARMA+ PARMA+
light
gcc 1.8 5.5 2.9 268.1
mesa 5.0 19.2 9.3 69 .0
gzip 2.5 15.8 4.9 325.7
applu 0.03 0.5 3.8 166.4
twolf 1.7 14.8 2.8 82.2
mcf 0.7 7.3 0.9 235.9
crafty 4.0 12.7 4.2 103.5
equake 3.0 9.8 8.7 65.0
lucas 2.3 10.7 4.5 364.4
galgel 1.6 4.9 4.0 22.7
ammp 0.08 0.3 2.4 50.1
mgrid 1.4 3.9 3.9 133.9
swim 1.1 5.5 5.1 178.2
Average 1.9 8.4 4.3 164.2
Table 4.6: 95% confidence interval for fault injection experiments of Table 4.5
Benchmark Fault
injection
results of
Figure 4.9
95% confidence
interval
Fault
injection
results of
Figure 4.10
95% confidence
interval
gcc 0.882111 [0.879, 0.884] 0.253 [0.245, 0.262]
mesa 0.0021 [00158, 00262] 0.0013 [0.00089, 0.00171]
gzip 0.5727 [0.566, 0.579] 0.083 [0.077, 0.088]
applu 0.9922 [0.9920, 0.9923] 0.34 [0.33, 0.35]
twolf 0.0242 [0.021, 0.027] 0.0104 [0.008, 0.012]
mcf 0.8165 [0.813, 0.819] 0.167 [0.160, 0.174]
crafty 0.4583 [0.451, 0.465] 0.165 [0.157, 0.172]
equake 0.00356 [0.0028, 0.0042] 0.0023 [0.00176, 0.00274]
lucas 0.7846 [0.780, 0.788] 0.1186 [0.112, 0.124]
galgel 0.0792 [0.074, 0.084] 0.0187 [0.0160, 0.0213]
ammp 0.9942 [0.9940, 0.9943] 0.611 [0.605, 0.616]
mgrid 0.927 [0.926, 0.928] 0.332 [0.324, 0.339]
swim 0.897 [0.895, 0.898] 0.1816 [0.174, 0.188]
84
Q1=0.2 Q2=0.1 Q3=0.15 Q4=0.05
Q5=0.15 Q6=0.11 Q7=0.1 Q8=0.14
Q1=0.2 Q2=0.1 Q3=0.15 Q4=0.05
Q5=0.15 Q6=0.11 Q7=0.1 Q8=0.14
Figure 4.10: Large fault patterns (black squares show faulty bits)
rate is 8:04 10
12
FIT/Mb. We use the patterns of Figure 4.10. Table 4.7 shows that
the deviations of PARMA+ and PARMA+
light
compared to fault-injection simulations
in the 2-way interleaved cache array are on average 4.8% and 121.4%, respectively.
In the 4-way interleaved cache array the deviations are 3.4% and 85.6% on average.
Table 4.8 shows the 95% confidence intervals for the fault injection experiments used in
Table 4.7. Hence, PARMA+ is also very accurate in the case of bit-interleaving. These
results demonstrate once more that cross-domain dependencies must be accounted for
in general.
In the last set of simulations, we compare the FIT rate predictions of PARMA+ and
PARMA+
light
at the fault rate of 1150 FIT/Mb, which is the raw soft error rate according
to ITRS. At this error rate, fault injection simulations are not feasible. So we cannot
compare to them. Table 4.9 compares the FIT rate of PARMA+ and PARMA+
light
under
fault patterns of Figures 4.9 and 4.10. Each cache block is protected by SECDED. The
85
Table 4.7: PARMA+ and PARMA+
light
deviations in bit-interleaved caches (accelerated
SEU rate)
Benchmark
2-way Interleaved cache 4-way Interleaved cache
PARMA+ PARMA+
light
PARMA+ PARMA+
light
gcc 2.6 126.8 7.3 88.9
mesa 3.4 23.3 0.001 0.002
gzip 5.2 272.5 10.3 280.2
applu 4.3 147.9 0.8 106.3
twolf 8.7 44.0 1.5 25.0
mcf 6.5 205.1 1.9 125.3
crafty 1.6 49.3 0.7 19.9
equake 6.4 14.5 0.00005 0.00007
lucas 7.3 289.2 7.8 185.6
galgel 3.3 37.7 4.0 22.4
ammp 2.8 46.4 4.1 39.8
mgrid 7.2 116.2 5.2 90.0
swim 3.7 217.1 1.0 137.5
Average 4.8 121.4 3.4 85.6
differences between PARMA+ and PARMA+
light
is much larger with the large fault
patterns of Figure 4.10 because a fault pattern affects more domains.
The simulation time of PARMA+ is reasonable. Each benchmark simulation finished
in up to 3 hours on our server. The most time-consuming part of PARMA+ simulations
is the equations of Section 4.5. The simulation time of PARMA+ increases when fault
patterns affect more rows as in Figure 4.10. In our simulations, PARMA+ increases the
simulation time on average by around one order of magnitude (10 times) as compared to
native Simplescalar. The simulation time of PARMA+
light
is almost the same as perfor-
mance simulations without PARMA+
light
because all equations except equations (4.9),
(4.11) and (4.4) are independent from the benchmark program and can be computed at
the beginning of the simulation. Equations (4.9), (4.11) and (4.4) are also very simple
to compute.
86
Table 4.8: 95% confidence interval for fault injection experiments of Table 4.7
Benchmark Fault
injection
results of
the 2-way
interleaved
cache
95% confidence
interval of fault
injection results of
the 2-way
interleaved cache
Fault
injection
results of
the 4-way
interleaved
cache
95% confidence
interval of fault
injection results of
the 4-way
interleaved cache
gcc 0.238 [0.230, 0.245] 0.1895 [0.182, 0.196]
mesa 0.0009 [0.00056, 0.00124] 0 [0, 0]
gzip 0.0728 [0.067, 0.077] 0.0619 [0.055, 0.066]
applu 0.3348 [0.327, 0.342] 0.3193 [0.311, 0.326]
twolf 0.0056 [0.0041, 0.0070] 0.0018 [0.0009, 0.0026]
mcf 0.149 [0.142, 0.155] 0.1368 [0.130, 0.143]
crafty 0.1267 [0.120, 0.132] 0.0666 [0.061, 0.071]
equake 0.0013 [0.00089, 0.00171] 0 [0, 0]
lucas 0.1145 [0.108, 0.120] 0.1057 [0.100, 0.111]
galgel 0.0181 [0.015, 0.020] 0.0123 [0.010, 0.014]
ammp 0.5727 [0.566, 0.579] 0.4755 [0.468, 0.482]
mgrid 0.2862 [0.278, 0.293] 0.2123 [0.205, 0.219]
swim 0.1809 [0.174, 0.187] 0.1709 [0.164, 0.177]
Table 4.9: PARMA+ and PARMA+
light
FIT rates under fault patterns of Figure 4.9 and
4.10 (actual SEU rate)
Benchmark
Fault patterns of Figure 4.9 Fault patterns of Figure 4.10
PARMA+ PARMA+
light
PARMA+ PARMA+
light
gcc 497.9 582.8 6606.6 24902.5
mesa 0.6 0.8 31.7 70.6
gzip 24.8 305.5 2450.0 12250.1
applu 822.5 1018.1 7320.2 40556.3
twolf 8.4 9.8 376.6 691.1
mcf 561.3 679.5 5900.1 26492.5
crafty 131.8 147.9 3856.2 8321.6
equake 1.0 1.2 60.9 108.8
lucas 680.1 849.9 5558.8 33601.6
galgel 34.9 37.0 991.0 1468.8
ammp 165.1 188.1 2958.6 7562.1
mgrid 244.9 288.6 7239.4 13031.1
swim 790.9 977.1 7061.0 38641.9
Average 299.1 363.3 3600.8 14867.8
87
4.8 PARMA+ floating point precision
The mathematical model of PARMA+ is fairly simple. However, we may need to modify
equation (4.2) in order to implement it on a practical machine.
In C and C++, the longest floating-point type is long double which has 80 bits in
some implementations and 128 bits in others [25]. The number of decimal digits in the
mantissa of these two representations is 19 and 31, respectively. Because of rounding
of the mantissa, 1P
j
in equation (4.2) will be equal to 1 ifP
j
is less than 10
19
(80-
bit representation) or 10
31
(128-bit representation). If all P
j
s of a program execution
cause this rounding error, R(t2) in equation (4.2) will become equal to 1 and the FIT
rate estimate in equation (4.4) will become zero.
The criticality of the rounding problem depends on two factors.
1- Value ofP
j
.
2- Precision of the floating-point library.
One can determine at the beginning of the simulation whether there will be a round-
ing problem given a particular C compiler. P
j
depends on several parameters: the pro-
tection domain size, the length of the vulnerability interval, the protection code, the
fault patterns and the raw error rate per bit per cycle. For a specific configuration, all of
these parameters are constant except for the length of the vulnerability interval, which
is variable during a benchmark execution. For simple approximate computations of this
section we do not consider domain dependences. Let P be an approximation for the aver-
age ofP
j
s. P can be approximated for example by considering the average vulnerability
interval lengths observed in Section 2.5.3 (Tavg). If P is much greater than 10
19
or
10
31
(depending on the floating-point library), the FIT rate obtained by PARMA+ sim-
ulations is reliable. Otherwise, it may not be reliable because the probability of failure
in many accesses may be ignored due to rounding errors.
88
If the floating-point library does not provide enough precision for a given configura-
tion, we change 1
Q
max
j=1
(1P
j
) in equation (4.4) as follows to get rid of 1P
j
.
1
max
Y
j=1
(1P
j
) =
max
X
j=1
P
j
max
X
j=1
max
X
i=j+1
P
j
P
i
+
max
X
j=1
max
X
i=j+1
max
X
k=j+1
P
j
P
i
P
k
:::: (4.22)
Equation (4.22) is also difficult to implement as we need to save all P
j
s from the
beginning of the program. If a program is large, saving all failure probabilities can fill
the entire memory space. Thus, we should have an alternative solution. One simple
alternative is to approximate the right side of equation (4.22) by its first term,
P
max
j=1
P
j
.
If the value of
P
max
j=1
P
j
is much larger than other terms on the right side of equation
(4.22), this approximation is valid. If P is 10
19
(for 80-bit long double implementa-
tion) and the program executes 1 trillion reads and write-backs in the simulated cache
(arguably a number way beyond the reach of simulators in the foreseeable future), the
value of
P
max
j=1
P
j
is around 10
10
, while the value of
P
max
j=1
P
max
i=j+1
P
j
P
i
is around
10
20
. Therefore,
P
max
j=1
P
j
is much larger than all other terms and is a reliable approxi-
mation to equation (4.22). Note that if P is less than 10
19
, the approximation would be
more accurate.
4.9 PARMA+ tool
This section explains the structure of the PARMA+ tool. The tool is implemented on
top of SimpleScalar and it measures the FIT rate of the L2 cache. In PARMA+ tool, it
is assumed that the size of a cache row is equal to the cache line.
The tool has an interface which receives several inputs as follows.
1- Raw FIT rate (FIT/Mbit): This is the rate at which cache bits are flipped by
energetic particles.
89
2- Processor frequency (GHZ): The frequency of the processor is fed into the tool.
3- Access granularity (bit): The number of bits read in each cache access.
4- Domain size (bit): Number of bits protected by an error protection code.
5- Protection code: There are three options in the tool. 2 refers to SECDED, 1 is
parity and 0 is no protection.
6- Mode: The PARMA+ tool has 4 operating modes. Mode 1 refers to the case in
which up to two DSEUs are considered in each vulnerability interval and the dependency
is considered for both 1 DSEU and 2 DSEUs. In mode 1,P
j
is computed as:
P
j
=P (1 DSEU)
S
X
k=1
P (Distributionkj 1 DSEU)P (Domain failure with no
failure in neighborsj 1 DSEU and Distributionk) +P (2 DSEUs)
S
X
k=1
P (Distributionkj 2 DSEUs)P (Domain failure with no failure in
neighborsj 2 DSEUs and Distributionk)
Mode 2 refers to the case in which there is at most one DSEU in a vulnerability interval
and the dependency is computed. In mode 2,P
j
is computed as:
P
j
=P (1 DSEU)
S
X
k=1
P (Distributionkj 1 DSEU)P (Domain failure with no
failure in neighborsj 1 DSEU and Distributionk)
Mode 3 refers to the case in which one DSEU is considered and dependency is ignored.
In mode 3,P
j
is equal to:
P
j
=P (1 DSEU)
N
Fail
N
DSEU
90
Mode 4 refers to the case in which up to 2 DSEUs are considered but dependency is
ignored for both the case of 1 DSEU and the case of 2 DSEUs. In this mode,P
j
is:
P
j
=P (1 DSEU)
N
Fail
N
DSEU
+P (2 DSEUs)
N
Fail
(N
DSEU
)
2
7- Interleaving degree: This determines the degree of bit-interleaving.
8- Fault patterns and their probability: Currently, the tool receives up to 10 fault
patterns which are contained in an 8 8 square.
To compute FIT rate, the tool first runs an initialization function which is run only
once at the beginning of the program. The initialization function first reads the input
file and saves the inputs in some variables. Then it computesN
i
DSEU
,N
i
Fail
andN
i;m
Fail
(if
mode is 1 or 4). Furthermore, the initialization function decides how the tool deals with
the rounding problem based on the input parameters.
At the time of each read or write-back, the probability that access fails is com-
puted. This probability is computed based on the operating mode. When dependency is
considered,N
i
SubFailk
and/orN
i;m
SubFailk
is computed for each access. This step is time-
consuming and increases the simulation time. When an access happens to a protection
domain, the neighbors of that protection domain save the time and the type of the access
as they would need this information later when they consider dependencies.
After computing the probability of an access failure, we use that either in
P
max
j=1
P
j
or in equation (4.2). At the end of the program, PARMA+ prints the average failure rate
and FIT rate.
91
4.10 New empirical model for multi-bit faults
In this subsection, we propose a very simple model to measure the FIT rate of cache
under spatial multi-bit faults for a single DSEU. If simulation time is a concern, this
model may be used. In this model, we define several variables as follows. Cache
Clean
is the average percentage of clean data in the cache while Cache
Dirty
is the average
percentage of dirty data in the cache in a benchmark. P
Mask
is the total number of
writes in the cache, and misses in clean blocks of the target cache in a program, divided
by the total number of accesses to the cache. P
Mask
is the probability that faults in a
block would be masked by writes or misses in clean blocks. We define CleanFail
i
as
the the number of rows which fail with fault pattern i if all of these rows are clean. We
define DirtyFail
i
as the the number of rows which fail with fault pattern i if all of rows
are dirty. It is assumed that different rows include different domains.
FIT rate of the cache can be approximated by the following formula. In fact, we
are multiplying raw FIT rate by an approximation of A VF. In the following formula,
FIT rate is multiplied by the probability that the occurred SEU will cause a failure. To
compute this probability, we compute the probability that an SEU occurs in dirty or
clean data. If it is in clean data,Cache
Clean
P
N
i=1
Q
i
(1 ((P
Mask
)
CleanFail
i
)) refers
to the probability that the produced fault pattern will not be masked in clean blocks
which means that there would not be masking (there would be a read or write-back) in
at least one of the failed rows which would cause failure of the cache. If the SEU occurs
in dirty data,Cache
Dirty
P
N
i=1
Q
i
(1 ((P
Mask
)
DirtyFail
i
)) refers to the probability
92
Table 4.10: The emprical model results and deviations from PARMA+
Benchmark
Fault patterns of Figure 4.9 Fault patterns of Figure 4.10
FIT rate Deviation from
PARMA+ (%)
FIT rate Deviation from
PARMA+ (%)
gcc 573.3 15.1 7357.0 11.3
mesa 249.9 41550 7359.9 13210
gzip 448.6 1700 7337.2 199
applu 761.5 7.5 7357.5 0.5
twolf 36.3 2710 7357.9 1853
mcf 747.2 33 7358.3 24.7
crafty 399.7 203 7359.8 90.8
equake 310.9 30990 7359.7 11980
lucas 346.0 49.2 6689.7 20.3
galgel 166.1 475 6099.1 515
ammp 52.9 68 3250.9 9.8
mgrid 175.4 38.4 6212.2 14.2
swim 416.5 47.4 6975.6 1.3
that there would not be masking (there would be a read or write-back) in at least one of
the dirty rows which would cause failure of the entire cache.
FIT rate =rawFIT (Cache
Clean
N
X
i=1
Q
i
(1 ((P
Mask
)
CleanFail
i
))
+Cache
Dirty
N
X
i=1
Q
i
(1 ((P
Mask
)
DirtyFail
i
)))
(4.23)
Table 4.10 shows the FIT rates produced by the approximate model and its deviation
from PARMA+. As is shown for some benchmarks the results are very close while for
some of them, they are so different. This is because the approximate model ignores
many complexities such as dependencies.
93
Chapter 5
Conclusions and Future Research
Directions
Reliability is an important concern in computer systems. Memories are one of the most
vulnerable parts of a computer system due to their structure and their large size. The
most popular way to improve memory reliability is through information redundancy
which decreases the effective size of the memory and it can also cause energy and per-
formance overheads. These overheads are more critical in high-level memories such as
cache memories for several reasons. First, caches are on the processor chip and they
have a limited size. Thus, they are very expensive. Second, caches significantly impact
the system performance as they are very close to the processor. Therefore, applying
redundant information to the cache is a critical task and designers try to increase the
cache reliability while decreasing area, performance and energy consumption overheads
as much as possible.
This thesis studies reliability of cache memories against soft error. Soft errors are
produced by energetic particles which hit SRAM cells and change the values of bits
from 0 to 1 and vice versa. There are different approaches to provide high reliability in
caches. However, they all have high costs. Hence, the main challenge in having reliable
caches is providing reliability at low costs.
This thesis aims at providing effective approaches to reach the mentioned goal of
reaching high reliability at low costs in two ways. First, it provides a very low cost
error protection scheme in caches called CPPC. In CPPC, error correction is added to a
94
parity-protected cache. Errors in clean data are recovered from the lower level memory
while errors in dirty data are corrected using the XOR of all dirty blocks of the cache
which is provided by CPPC during normal cache operations. CPPC is efficient in the
L1 cache but it is much more efficient in lower level caches. CPPC adds only 0.6%
energy overhead and practically no other overheads to an inclusive L2 parity-protected
cache while it provides both single-bit and spatial multi-bit error correction capabilities.
Costs of CPPC is almost the same as parity in an exclusive L2 cache which is wonderful.
Moreover, CPPC can adjust the level of reliability against both single-bit and multi-bit
faults finely, which is an outstanding feature.
This thesis also provides ideas to extend CPPC to main memory which is called Chip
Independent Error Correction (CIEC). Costs of CIEC is independent from the DIMM
and can be applied to different DIMMs. Furthermore, it has very low area overhead as
compared to other existing schemes.
Second, this thesis proposes a rigorous model to measure cache FIT rate. To select
a suitable combination of error protection schemes in the cache, we need to measure
the FIT rate of the cache in addition to its energy consumption and performance. While
there are effective tools to evaluate energy consumption and performance, there is no
rigorous tool to measure the cache FIT rate. Due to shrinking feature sizes, soft error
patterns have been changed and all soft errors will become spatial multi-bit by 2016 as
ITRS predicts. Modeling spatial multi-bit faults is much more difficult than single-bit
faults. To measure FIT rate of caches in presence of spatial multi-bit faults, this thesis
proposes a model called PARMA+. PARMA+ is applicable to any number of single-bit
and multi-bit faults, different error protection codes and schemes such as bit-interleaving
and cache scrubbing. Furthermore, PARMA+ has high accuracy which has been shown
through extensive experiments. By precise measurement of FIT rate, PARMA+ greatly
helps designers select an optimal error protection scheme for caches.
95
This thesis can be continued by extending PARMA+ to other cache error protection
schemes. Specifically, we are interested that PARMA+ measures the FIT rate of CPPC.
In that situation, we will be able to compare FIT rate of CPPC with SECDED very
accurately.
96
Reference List
[1] International Technology Roadmap for Semiconductors (ITRS), 2011 Edition.
[2] G. Asadi, V . Sridharan, M. Tahoori, and D. Kaeli. Balancing Performance and
Reliability in the Memory Hierarchy. In Proc. of International Symposium on
Performance Analysis of Systems and Software, pages 269–279, 2005.
[3] H. Asadi, V . Sridharan, M. Tahoori, and D. Kaeli. Vulnerability analysis of L2
cache elements to single event upsets. In Proc. of Design Automation and Test
conference in Europe (DATE), pages 1276–1281, 2006.
[4] T. Austin, E. Larson, and D. Ernst. Simplescalar: An Infrastructure for Computer
System Modeling. Computer, 35(2):59–67, 2002.
[5] A. Biswas, P. Racunas, R. Cheveresan, J. Emer, S. Mukherjee, and R. Rangan.
Computing Architectural Vulnerability Factors for Address-based Structures. In
Proc. of International Symposium on Computer Architecture (ISCA), pages 532–
543, 2005.
[6] V . Degalahal, L. Li, V . Narayanan, M. Kandemir, and M. J. Irwin. Soft errors
issues in low-power caches. IEEE Transactions on VLSI Systems, 10(13):1157–
1165, 2005.
[7] M. Finkelstein. Failure Rate Modeling for Reliability and Risk. Springer, 2008.
[8] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: sim-
ple techniques for reducing leakage power. In Proc. of International Symposium
on Computer Architecture (ISCA), pages 148–157, 2002.
[9] P. Hoang. Handbook of Engineering Statistics. Springer, 2006.
[10] S. Huntzicker, M. Dayringer, J. Soprano, A. Weerasinghe, D. Harris, and D. Patil.
Energy-delay tradeoffs in 32-bit static shifter designs. In Proc. of International
Conference on Computer Design (ICCD), pages 626–632, 2008.
97
[11] A. Hwang, I. Stefanovici, and B. Schroader. Cosmic Rays Dont Strike Twice:
Understanding the Nature of DRAM Errors and the Implications for System
Design. In Proc. of International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems (ASPLOS), pages 111–122, 2012.
[12] J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe. Multi-bit Error Tolerant
Caches Using Two Dimensional Error Coding. In Proc. of International Sympo-
sium on MicroArchitecture (MICRO), pages 197–2009, 2007.
[13] S. Kim. Area-Efficient Error Protection for Caches. In Proc. of Design, Automation
and Test in Europe (DATE), pages 1–6, 2006.
[14] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded
Sparc processor. IEEE Micro, 25(2):21–29, 2005.
[15] L. Li, V . Degalahal, N. Vijaykrishnan, M. Kandemir, and M. J. Ir-win. Soft Error
and Energy Consumption Interactions: a Data Cache Perspective. In Proc. of Inter-
national Symposium on Low Power Electronics and Design (ISLPED), pages 132–
137, 2004.
[16] X. D. Li, S. V . Adve, P. Bose, and J. A. Rivers. SoftArch:An Architecture Level
Tool for Modeling and Analyzing SoftErrors. In Proc. of International Conference
on Dependable Systems and Networks (DSN), pages 496–505, 2005.
[17] J. Maiz, S. Hareland, K. Zhang, and P. Armstrong. Characterization of Multi-bit
Soft Error Events in Advanced SRAMs. In Proc. of IEEE International Electron
Devices Meeting, pages 21.4.1–21.4.4, 2003.
[18] K. T. Malladi, F. Nothaft, K. Periyathambi, B. Lee, C. Kozyrakis, and M. Horowitz.
Towards Energy-Proportional Datacenter Memory with Mobile DRAM. In
Proc. of International Symposium on Computer Architecture (ISCA), pages 37–48,
2012.
[19] M. Manoochehri, M. Annavaram, and M. Dubois. CPPC: Correctable Parity Pro-
tected Cache. In Proc. of International Symposium on Computer Architecture
(ISCA), pages 223–234, 2011.
[20] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in
Microprocessors: Myth or Necessity? In Proc. of IEEE Pacific Rim Symposium on
Dependable Computing (PRDC), pages 37–42, 2004.
[21] V . Pandey, W. Jiang, Y . Zhou, and R. Bianchini. DMAaware memory energy
management for data servers. In Proc. of International Symposium on High Per-
formance Computer Architecture (HPCA), pages 133–144, 2006.
98
[22] N. Quach. High availability and reliability in the Itanium processor. IEEE Micro,
20(5):61–69, 2000.
[23] R. W. S. Baeg, S. Wen. Sram Interleaving Distance Selection with a Soft Error
Failure Model. IEEE Transactions on Nuclear Science, 56(4):2111–2118, 2009.
[24] N. N. Sadler and D. J. Sorin. Choosing an Error Protection Scheme for a Micropro-
cessors L1 Data Cache. In Proc. of International Conference on Computer Design
(ICCD), pages 499–505, 2006.
[25] G. M. Saeed. An Introduction to Object-Oriented Programming in C++. Morgan
Kaufmann Publishers Inc., 2001.
[26] A. Saleh, J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques
for Memory Systems. IEEE Transactions on Reliability, 39(1):114–122, 1990.
[27] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characteriz-
ing large scale program behavior. In Proc. of International Conference on Archi-
tectural Support for Programming Languages and Operating Systems (ASPLOS),
pages 45–57, 2002.
[28] J. Sim, G. Loh, V . Sridharan, and M. Connor. Resilient die-stacked DRAM caches.
In Proc. of International Symposium on Computer Architecture (ISCA), pages 416–
427, 2013.
[29] V . Sridharan, H. Asadi, M. B. Tahoori, and D. Kaeli. Reducing Data Cache Suscep-
tibility to Soft Errors. IEEE Transactions on Dependable and Secure Computing,
3(4):353–364, 2006.
[30] J. Suh, M. Annavaram, and M. Dubois. MACAU: A Markov model for reliability
evaluations of caches under single-bit and multi-bit upsets. In Proc. of Interna-
tional Symposium on High Performance Computer Architecture (HPCA), pages
3–14, 2012.
[31] J. Suh, M. Manoochehri, M. Annavaram, and M. Dubois. Soft error benchmarking
of L2 caches with PARMA. In Proc. of International Conference on Measurement
and Modeling of Computer Systems (SIGMETRICS), pages 85–96, 2011.
[32] A. D. Tipton, J. A. Pellish, J. M. Hutson, R. Baumann, X. Deng, A. Marshal, M. A.
Xapsos, H. S. Kim, M. R. Friendlich, M. J. Campola, C. M. Seidleck, K. A. LaBel,
M. H. Mendenhall, R. Reed, R. Schrimpf, R. Weller, and J. D. Black. Device-
Orientation Effects on Multiple-Bit Upset in 65 nm SRAMs. IEEE Transactions
on Nuclear Science, 55(6):2880–2885, 2008.
99
[33] A. N. Udipi, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. Jouppi.
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory
Systems. In Proc. of International Symposium on Computer Architecture (ISCA),
pages 285–296, 2012.
[34] A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and
N. Jouppi. Rethinking DRAM Design and Organization for Energy-Constrained
Multi-Cores. In Proc. of International Symposium on Computer Architecture
(ISCA), pages 175–186, 2010.
[35] S. Wang, J. Hu, and S. Ziavras. On the characterization and optimization of on-
chip cache reliability against soft errors. IEEE Transaction on Computers (TC),
58(9):1171–1184, 2009.
[36] S. J. E. Wilton and N. P. Jouppi. CACTI: An Enhanced Cache Access and Cycle
Time Model. IEEE Journal of Solid-State Circuits, 31(5):677–688, 1996.
[37] D. H. Yoon and M. Erez. Memory Mapped ECC: Low-Cost Error Protection for
Last Level Caches. In Proc. of International Symposium on Computer Architecture
(ISCA), pages 83–93, 2009.
[38] D. H. Yoon and M. Erez. Virtualized and Flexible ECC for Main Memory. In
Proc. of International Conference on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS), pages 397–408, 2010.
[39] W. Zhang. Replication Cache: a Small Fully Associative Cache to Improve Data
Cache Reliability. IEEE Micro, 54(12):1547–1555, 2005.
[40] W. Zhang, S. Gurumurthi, M. Kandemir, and A. Sivasubrama-niam. ICR: In-
Cache Replication for Enhancing Data Cache Reliability. In Proc. of International
Symposium on MicroArchitecture (MICRO), pages 291–300, 2003.
100
Abstract (if available)
Abstract
Due to shrinking feature sizes, cache memories have become highly vulnerable to soft errors. In this thesis, reliability of caches is studied in two ways: ❧ First, a reliable error protection scheme called Correctable Parity Protected Cache (CPPC) is proposed, which adds error correction capability to a parity-protected cache. In CPPC, parity bits detect errors and the XOR of all data written into the cache is kept to recover from detected errors in dirty data. Detected errors in clean data are simply recovered by reading the correct data from the lower level memory. Our simulation data shows that CPPC provides high reliability while its overheads are smaller than competitive schemes, especially in the L2 or lower level caches. CPPC idea can also be extended to main memories which will be explained later in details. ❧ Second, this thesis proposes efficient approaches to measure reliability of caches. To select an appropriate error protection scheme in caches, the impact of different schemes on the FIT (Failure In Time) rate should be measured. An accurate and versatile cache reliability model called PARMA+ is proposed which targets both single-bit and spatial multi-bit soft errors. Contrary to all previous models, PARMA+ is applicable to any multi-bit fault pattern and cache configuration and it has high accuracy. To validate PARMA+, we compare the measurements obtained with it to the results produced from accelerated fault-injection simulations. This comparison shows a very small deviation which demonstrates the high accuracy of PARMA+.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Parallel simulation of chip-multiprocessor
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
A framework for runtime energy efficient mobile execution
PDF
Ensuring query integrity for sptial data in the cloud
PDF
Elements of next-generation wireless video systems: millimeter-wave and device-to-device algorithms
PDF
A framework for soft error tolerant SRAM design
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Towards efficient fault-tolerant quantum computation
PDF
A signal processing approach to robust jet engine fault detection and diagnosis
Asset Metadata
Creator
Manoochehri, Mehrtash
(author)
Core Title
Reliable cache memories
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
01/24/2015
Defense Date
10/07/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cache,ECC,OAI-PMH Harvest,parity,reliability
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Dubois, Michel (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
mehrtash_m1@yahoo.com,mmanooch@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-523809
Unique identifier
UC11298762
Identifier
etd-Manoochehr-3125.pdf (filename),usctheses-c3-523809 (legacy record id)
Legacy Identifier
etd-Manoochehr-3125.pdf
Dmrecord
523809
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Manoochehri, Mehrtash
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cache
ECC
parity
reliability