Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Low cost fault handling mechanisms for multicore and many-core systems
(USC Thesis Other)
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOW COST FAULT HANDLING MECHANISMS FOR
MULTICORE AND MANY-CORE SYSTEMS
By
Waleed Dweik
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2015
Copyright 2015 Waleed Dweik
ii
Dedication
To my beloved parents
To my sweet sisters
To my precious wife, Huda
To my little blessing, baby Taiba
iii
Acknowledgments
First and foremost, I would like to express my special appreciation and gratitude
to my advisor Professor Murali Annavaram. I would like to thank you for your valuable
advice, continuous encouragement and support, and very long research discussions. I
would also like to thank the members of my defense committee, Professor Michel Dubois
and Professor William G.J. Halfond for their constructive comments and suggestions
which helped to put this dissertation in the best shape. Specially, I would like to thank
Professor Michel Dubois, who co-advised me during my first year as a PhD student, for
his insightful guidance. In addition, I would like to acknowledge Professor Timothy
Pinkston and Professor Jeff Draper for serving on my qualifying exam committee.
Special thanks also go to Professor Gandhi Puvvada whom I had the pleasure to work
with as a teaching assistant and whose dedication to excellent teaching has been a source
of inspiration to me. I would also like to thank Professor Sandeep Gupta for his useful
research advice and feedback.
Second, I would like to thank the graduated and current PhD students whom I
worked with in the Super Computing in Pocket (SCIP) research group for their
collaboration and beneficial discussions: Mohammad Abdel-Majeed, Melina Demertzi,
Hyeran Jeon, Gunjae Koo, Suk Hun Kang, Jinho Suh, Abdulaziz Tabbakh, Daniel Wong,
iv
Qiumin Xu, and Bardia Zandian. Specially, I would like to thank my lab mate and close
friend Mohammad Abdel-Majeed for his fruitful collaboration, which led to the GPU
reliability work presented in this dissertation, and his continuous support and
encouragement. Exceptional appreciations go to the following administrative members of
the Ming Hsieh Department of Electrical Engineering for their help in dealing with the
vagaries of graduate school: Tim Boston, Diane Demetras, Christina Fontenot, Estela
Lopez, and Janice Thompson. Funding for my research was made possible by National
Science Foundation grants NSF-CAREER-0954211, NSF-0834798, and NSF-1219186.
My research was also funded by Defense Advanced Research Projects Agency grant
DARPA-PERFECT-HR0011-12-2-0020 and the University of Jordan.
Third, I would like to thank all my friends for their support and well-
wishes/prayers throughout my graduate studies. Special thanks go to my friends at USC:
Mohammad Abdel-Majeed, Abdullah Alfarrarjeh, Dr. Anas Al Majali, Laith Alshalalfeh,
Daoud Burghal, Wael Elhaddad, Naumaan Nayyar, and Mohammed Alfasi. The debates,
dinners and barbecues, soccer games and trips, rides to the airport, and true friendship are
all greatly appreciated. Particularly, I would like to thank my trusted friend Dr. Anas Al
Majali, whom I shared a studio with during the first four years of my graduate studies, for
being always there through thick and thin.
Last but not least, I would like to thank my family for their immense support and
encouragement. Special thanks and appreciations go to my parents-in-law for looking
after my wife Huda and little angel Taiba during the last three months of my graduate
studies. I would like to express my deepest gratefulness and appreciation to my respected
v
parents and sweet sisters who have been always there for me throughout my life and
whose love and sacrifices are the recipe for all my accomplishments. I am greatly
indebted to my beloved wife Huda for her help and support which made much of the
work on this dissertation not only enjoyable but possible. She instilled me with much
needed confidence which enabled me to complete my PhD degree successfully.
Above all, all praise and thanks are only for Allah (God), the One who, by His
blessing and favor, perfected good works are accomplished.
vi
Table of Contents
Dedication ........................................................................................................................... ii
Acknowledgments.............................................................................................................. iii
List of Tables ...................................................................................................................... x
List of Figures .................................................................................................................... xi
Abstract ............................................................................................................................... 1
Introduction ................................................................................................... 6 Chapter 1
In-field fault detection .......................................................................................... 9 1.1
In-field fault diagnosis, recovery, and tolerance ................................................ 15 1.2
In-field fault handling in GPUs .......................................................................... 18 1.3
Contributions ...................................................................................................... 22 1.4
Usage-Signature Adaptive Periodic Testing ............................................... 24 Chapter 2
Introduction ........................................................................................................ 24 2.1
Related Work...................................................................................................... 28 2.2
Usage-failure causality................................................................................ 28 2.2.1
Periodic fault detection ............................................................................... 29 2.2.2
Heterogeneous Microprocessor Usage Signature .............................................. 32 2.3
Inter-epoch heterogeneity ........................................................................... 35 2.3.1
Inter-application heterogeneity ................................................................... 36 2.3.2
Inter-block heterogeneity ............................................................................ 36 2.3.3
Exploiting usage heterogeneity ................................................................... 37 2.3.4
SignTest Design Components ............................................................................ 40 2.4
Usage tracker for C-SignTest...................................................................... 41 2.4.1
Usage tracker for R-SignTest...................................................................... 43 2.4.2
Testing engine ............................................................................................. 44 2.4.3
vii
Experimental Methodology ................................................................................ 46 2.5
Test patterns/programs ................................................................................ 49 2.5.1
Benchmarks................................................................................................. 54 2.5.2
Energy and performance overheads evaluation .......................................... 55 2.5.3
Area overhead evaluation ........................................................................... 56 2.5.4
Evaluation Results .............................................................................................. 56 2.6
Energy and performance for C-SignTest .................................................... 56 2.6.1
Energy and performance for R-SignTest .................................................... 62 2.6.2
Fault coverage vs. testing overheads tradeoffs for R-SignTest .................. 64 2.6.3
Area overhead ............................................................................................. 65 2.6.4
Summary and Conclusions ................................................................................. 68 2.7
Distinguishing and Tolerating Intermittent Faults ...................................... 70 Chapter 3
Introduction ........................................................................................................ 70 3.1
Why to Distinguish Intermittent Faults .............................................................. 72 3.2
Related Work...................................................................................................... 73 3.3
Building Blocks for RAEs Design ..................................................................... 78 3.4
Fault detection ............................................................................................. 78 3.4.1
Disabling the faulty entry............................................................................ 80 3.4.2
RAE handler: fault classification ................................................................ 81 3.4.3
RAE handler: fault correction ..................................................................... 82 3.4.4
De-configuration through ISA extensions and sleep transistors ................. 84 3.4.5
Evaluation Methodology .................................................................................... 85 3.5
Fault rate ..................................................................................................... 86 3.5.1
Fault type .................................................................................................... 88 3.5.2
Fault model and period ............................................................................... 88 3.5.3
Experimental Results.......................................................................................... 89 3.6
Reliability evaluation .................................................................................. 89 3.6.1
Performance evaluation .............................................................................. 92 3.6.2
Area overhead ............................................................................................. 92 3.6.3
Summary and Conclusions ................................................................................. 92 3.7
viii
Tolerating Hard Faults in GPUs ................................................................. 94 Chapter 4
Introduction ........................................................................................................ 95 4.1
Related Work...................................................................................................... 97 4.2
Background and Motivation ............................................................................... 98 4.3
GPU baseline architecture ........................................................................... 98 4.3.1
Resource utilization imbalance ................................................................. 101 4.3.2
Warped-Shield Fault Tolerance Techniques .................................................... 102 4.4
Thread shuffling ........................................................................................ 102 4.4.1
Dynamic warp deformation ...................................................................... 109 4.4.2
Inter-SP warp shuffling ............................................................................. 112 4.4.3
Architectural Support ....................................................................................... 113 4.5
Intra-cluster thread shuffling..................................................................... 113 4.5.1
Warp deformation and scheduling ............................................................ 116 4.5.2
Inter-SP warp shuffling ............................................................................. 118 4.5.3
Issuing deformed warps ............................................................................ 119 4.5.4
Experimental Evaluation .................................................................................. 122 4.6
Methodology ............................................................................................. 122 4.6.1
Common case fault map............................................................................ 123 4.6.2
Worst case fault map ................................................................................. 125 4.6.3
Asymmetric fault maps ............................................................................. 126 4.6.4
Intra-cluster shuffling vs. dynamic warp deformation .............................. 128 4.6.5
Area and power overheads ........................................................................ 129 4.6.6
Summary and Conclusions ............................................................................... 130 4.7
Fault Detection and Correction in GPUs .................................................. 131 Chapter 5
Introduction ...................................................................................................... 132 5.1
Related Work.................................................................................................... 135 5.2
Background ...................................................................................................... 137 5.3
Opportunistic Redundant Execution ................................................................ 138 5.4
Inherent redundancy.................................................................................. 139 5.4.1
Underutilization of SIMT lanes ................................................................ 139 5.4.2
Dynamic warp deformation ...................................................................... 142 5.4.3
ix
Fault Detection: Opportunistic DMR Mode..................................................... 142 5.5
Opportunistic DMR granularity ................................................................ 143 5.5.1
Quantifying opportunistic DMR ............................................................... 145 5.5.2
DMR execution using dynamic warp deformation ................................... 148 5.5.3
Fault Detection and Correction: Opportunistic TMR Mode ............................ 149 5.6
Opportunistic TMR granularity ................................................................ 151 5.6.1
Quantifying opportunistic TMR ............................................................... 152 5.6.2
TMR execution using dynamic warp deformation ................................... 155 5.6.3
Warp Replay ..................................................................................................... 158 5.7
Architectural Support ....................................................................................... 161 5.8
Inherent redundancy and deformation analysis stage ............................... 162 5.8.1
Sub-warps active masks generation and thread replication stage ............. 170 5.8.2
Fault detection and correction stage ......................................................... 178 5.8.3
Warped-RE Design Alternatives ...................................................................... 181 5.9
Experimental Evaluation .................................................................................. 184 5.10
DMR mode evaluation .............................................................................. 185 5.10.1
TMR mode evaluation .............................................................................. 187 5.10.2
Area and power overheads ........................................................................ 189 5.10.3
Summary and Conclusions ............................................................................... 190 5.11
Conclusions and Future Work .................................................................. 192 Chapter 6
References ....................................................................................................................... 198
x
List of Tables
Table 1.1: In-field fault detection tradeoffs. ..................................................................... 13
Table 2.1: Functional blocks names and abbreviations. ................................................... 33
Table 2.2: Usage-fault coverage mapping. ....................................................................... 38
Table 4.1: Intra-cluster thread shuffling control logic. ................................................... 115
Table 4.2: Inter-SP warp shuffling.................................................................................. 118
Table 5.1: TMR inherent redundancy (TIR) vector description. .................................... 165
Table 5.2: Warp deformation control per DMR cluster. ................................................. 167
Table 5.3: Warp deformation control per TMR cluster. ................................................. 168
Table 5.4: DMR cluster sub-warps active masks............................................................ 173
Table 5.5: L
0
thread replication MUX control. ............................................................... 177
xi
List of Figures
Figure 2.1: Execution blocks usage-based epoch classification. ...................................... 35
Figure 2.2: Load/store blocks usage-based epoch classification. ..................................... 37
Figure 2.3: Usage level vs. fault sites. .............................................................................. 39
Figure 2.4: Usage tracker for C-SignTest. ........................................................................ 42
Figure 2.5: Usage tracker for R-SignTest. ........................................................................ 44
Figure 2.6: Test pattern shuffling...................................................................................... 52
Figure 2.7: Fault coverage vs. test patterns....................................................................... 54
Figure 2.8: Relative overheads of C-SignTest on top of BIST methodology. .................. 57
Figure 2.9: Relative overheads of C-SignTest on top of DFT methodology. ................... 59
Figure 2.10: Relative overheads of C-SignTest and ADAPT on top of SBST. ................ 61
Figure 2.11: Relative overheads and absolute fault coverage for R-SignTest and block-
level ADAPT. ................................................................................................................... 63
Figure 2.12: Fault coverage vs. testing overheads (R-SignTest). ..................................... 65
Figure 3.1: RAEs implementation. ................................................................................... 78
Figure 3.2: Relative ROB MTTC. .................................................................................... 90
Figure 3.3: Relative LSQ MTTC. ..................................................................................... 91
Figure 4.1: Baseline Nvidia GTX480 design. ................................................................... 99
xii
Figure 4.2: Thread activity breakdown. .......................................................................... 101
Figure 4.3: Average SIMT lane utilization. .................................................................... 102
Figure 4.4: Thread shuffling and mapping techniques. .................................................. 104
Figure 4.5: Thread mapping impact on shuffling opportunities. .................................... 105
Figure 4.6: Intra-cluster thread shuffling analysis. ......................................................... 108
Figure 4.7: Dynamic warp deformation. ......................................................................... 110
Figure 4.8: Deformation of two SIMT clusters. ............................................................. 111
Figure 4.9: Intra-cluster thread shuffling implementation. ............................................. 114
Figure 4.10: RASp unit design........................................................................................ 120
Figure 4.11: Performance overheard for common and worst fault maps (symmetric fault
maps). .............................................................................................................................. 125
Figure 4.12: Performance overhead of asymmetric fault maps. ..................................... 127
Figure 4.13: Contributions of fault tolerance techniques................................................ 129
Figure 5.1: TMR voter and comparator. ......................................................................... 138
Figure 5.2: Underutilization of SIMT lanes. ................................................................... 141
Figure 5.3: DMR inherent redundancy. .......................................................................... 144
Figure 5.4: Opportunistic DMR. ..................................................................................... 146
Figure 5.5: Warp deformation in DMR mode. ............................................................... 148
Figure 5.6: TMR inherent redundancy. ........................................................................... 151
Figure 5.7: Opportunistic TMR. ..................................................................................... 153
Figure 5.8: Warp deformation for regular TMR cluster. ................................................ 156
Figure 5.9: Warp deformation for special TMR cluster. ................................................ 157
Figure 5.10: SP unit occupancy. ..................................................................................... 161
xiii
Figure 5.11: Inherent redundancy and deformation analysis stage................................. 162
Figure 5.12: Inherent redundancy detection logic. ......................................................... 163
Figure 5.13: Sub-warps active masks generation and thread replication stage. ............. 171
Figure 5.14: DMR cluster RASp unit. ............................................................................ 172
Figure 5.15: TMR cluster RASp unit. ............................................................................. 174
Figure 5.16: Thread replication hardware support. ......................................................... 176
Figure 5.17: Select logic for L
0
’s3:1MUX . .................................................................. 178
Figure 5.18: Fault detection and correction for DMR and TMR clusters. ...................... 179
Figure 5.19: Fault detection and correction for special TMR cluster. ............................ 180
Figure 5.20: Design alternative functionality. ................................................................ 182
Figure 5.21: DMR mode performance overhead. ........................................................... 185
Figure 5.22: Opportunistic DMR breakdown. ................................................................ 187
Figure 5.23: TMR performance overhead. ..................................................................... 188
Figure 5.24: Opportunistic TMR breakdown.................................................................. 189
1
Abstract
As technology scales further down in the nanometer regime, chip manufacturers
are able to integrate billions of transistors on a single chip which boost the performance
of toda y ‘s mul ti c or e a n d man y-core systems. On the other hand, smaller transistor
devices become more vulnerable to various faults due to higher probability of
manufacturing defects, higher susceptibility to single event upsets, more process
variations and faster wearout rates. Nowadays, computer chips are tested extensively
using post-fabrication process to weed out any chips that do not meet the functional
specifications. The chips that meet the functional specifications are then used in building
computer systems. Once these systems, particularly in low-end consumer market
segments, enter the in-field operation they are not actively tested, other than using simple
error correcting codes to deal with soft errors. Since reliability is a critical requirement
for high-end mission-critical systems, they traditionally employ system-level redundancy
for in-field monitoring and handling of faults. However, technology scaling is expected to
make reliability a first-order concern for the in-field operation of even low-end
computing systems. Thus, in-field fault handling mechanisms that can detect, diagnose,
recover and tolerate different kinds of faults have to be deployed.
2
Unlike mission critical systems that can afford expensive fault handling
mechanisms, low-end computing systems are highly cost-sensitive. As a result, in-field
fault handling mechanisms have to be stringently cost-effective and should achieve the
highest fault coverage for a given area, performance and power overheads. Faults that
happen during in-field operation (i.e. in-field faults) are classified into three categories:
transient, intermittent, and permanent. Specialized fault handling mechanisms can be
deployed orthogonally in order to mitigate the effects of the three fault categories. This
dissertation presents four techniques for ultra-low cost in-field fault handling in chip
multiprocessors (CMPs) and many-core systems such as graphics processing units
(GPUs).
In the first part of this dissertation we present SignTest, a usage-signature
adaptive periodic testing mechanism for microprocessors in multicore systems. SignTest
strives to reduce the testing energy and time while maintaining high coverage against
permanent faults. To achieve this goal, SignTest tracks the usage of the microprocessor
functional blocks during every execution epoch and dynamically steers the testing phase
according to the usage level of every functional block. The evaluation results show that a
conservative implementation of SignTest maintains maximum fault coverage with up to
24% savings in the testing energy and time. Alternatively, a relaxed implementation of
SignTest achieves up to 51% savings in the testing energy and time while covering 93%
of the expected permanent fault sites.
Once approaches such as SignTest make it feasible to do periodic fault detection
at a very low cost, the next concern is how to use the detection outcomes to improve fault
3
diagnosis and recovery. The second part of this dissertation focuses on improving the
classification granularity of in-field faults. A new class of exceptions called reliability-
aware exceptions (RAEs) is proposed. RAEs use a fault history log to classify faults
detected in the microprocessor array structures into transient, intermittent, and
permanent. Most of the previously proposed approaches classify faults as transient or
non-transient, where all non-transient faults are handled as permanent faults. But treating
all non-transient faults as permanent faults leads to premature and often unnecessary
component replacement demands. Distinguishing intermittent faults and handling them as
such improves the effectiveness of fault handling by reducing the performance
degradation while simultaneously slowing down device wearout.
The RAE handlers have the ability to manipulate faulty entries in the array
structures to recover and tolerate all three categories of faults. For transient faults, the
RAE handler leverages the existing roll-back mechanism to start re-execution from the
instruction which triggers the fault. For intermittent faults, the RAE handler exploits the
inherent redundancy in the array structures to de-configure the faulty entries temporarily.
Entries that experience permanent faults are permanently de-configured. RAEs improve
the reliability of the load/store queue (LSQ) and the reorder buffer (ROB) in an out-of-
order processor by average factors of 1.3 and 1.95, respectively.
The remaining parts of the dissertation present solutions to handle execution
faults that occur in throughput-oriented graphics processing units (GPUs). GPUs are
becoming an attractive option to achieve power efficient throughput computing even for
general purpose applications that go beyond traditional multimedia workloads. This
4
dissertation presents two GPU-specific fault handling frameworks which take advantage
of massive resource replication present in GPUs to detect, correct, and tolerate faults.
The first framework is Warped-Shield which tolerates permanent (hard) faults in
the execution lanes of the GPUs by rerouting computations around faulty execution lanes.
Warped-Shield achieves this goal through the following three techniques: thread
shuffling, dynamic warp deformation, and warp shuffling. The three techniques work in
complementary ways to allow the Warped-Shield framework to tolerate failure rates as
high as 50% in the execution lanes with only 14% performance degradation.
Motivated by the insights obtained from Warped-Shield, the last contribution of
this dissertation presents Warped-Redundant Execution (Warped-RE) which provides
detection and correction for transient, intermittent, and permanent faults in the GPU
execution lanes. Warped-Shield assumes that fault locations are identified using other
orthogonal approaches, but Warped-RE relies on spatial redundant execution to achieve
fault detection and correction. During fault-free execution, dual modular redundant
(DMR) execution is enforced to detect faults. When a fault is detected, triple modular
redundant (TMR) execution is activated to correct the fault and identify potential faulty
execution lanes. As long as the detected faults are transients, DMR execution is restored
after correction is complete. After the first non-transient fault is detected, TMR execution
is retained to guarantee correctness.
To mitigate the high overheads of naïve DMR/TMR execution, Warped-RE
leverages the inherent redundancy among threads within the same warp and the
underutilization of the GPU execution lanes to achieve low cost opportunistic redundant
5
execution. On average, the performance overhead for DMR execution to detect faults is
8.4%, and once a permanent fault is detected TMR execution overhead is still just 29%.
6
Chapter 1
Introduction
Technology scaling trends lead to exponential growth in the number of on-chip
transistors. The small and fast transistors deliver the expected performance
improvements. However, this aggressive scaling comes with many obstacles. The main
obstacles associated with technology scaling are: reduced rate of voltage scaling, design
complexity, and in-field reliability. Due to the use of smaller and faster transistors, higher
clock frequency is possible; however, without voltage scaling the chip power
consumption exceeds the power budget constraints. Moreover, the complexity of the
design required to achieve higher instruction level parallelism increases with technology
scaling. In order to mitigate the power and complexity obstacles caused by the high dense
transistor integration, the computer systems industry is adopting multicore systems like
chip multiprocessors (CMPs) [15] and many-core systems like graphics processing units
(GPUs) [67] [68].
The third obstacle of technology scaling, which is also the focus of this
dissertation, is in-field reliability [39] of computer systems. In the past, high in-field
7
reliability was only a requirement for a small set of mission critical systems which have
extremely low tolerance for in-field faults. Such systems achieve their design goal using
expensive resource replication solutions [44] [51] [59] [41] [76] [89]. Extensive post-
fabrication testing used to be sufficient to weed out any chips that do not meet the design
specifications, thereby obviating the need for any in-field monitoring. But the ever
shrinking feature sizes, lower voltage levels, and tighter noise margins associated with
the aggressive scaling make the systems used for general-purpose computing highly
vulnerable to in-field faults of all types: transient, intermittent, and permanent. Hence,
cost-effective in-field fault handling solutions are expected to be a major demand in
future computer systems which are built using unreliable fabric.
There are four major sources of unreliable in- fie l d ope ra ti on in toda y ’ s c o mput e r
systems [40]. The first source is high sensitivity to transient (i.e. soft) faults. Transient
faults are caused by neutrons in cosmic rays or particle strikes in packaging material. As
feature sizes shrink and voltage levels become lower, the amount of charge stored in a
node decreases. As a result, the transient fault rate increases [27] and is expected to reach
1000 FIT (failures in 10
9
hours) [88] in deep nanometer regime. The second source is
design errors/bugs that escape burn-in tests. Due to the tight time-to-market requirements
a nd the c ompl e x it y of to da y ’ s c omput e r system designs, some design errors and bugs can
still escape the pre-fabrication verification tests and post-fabrication burn-in tests. These
errors and bugs manifest as permanent (i.e. hard) faults during the in-field operation.
The third source is process variations. Variations in process parameters in deep
submicron technologies are mainly due to fabrication process limitations, such as sub-
8
wavelength lithography and etching, and variations in the number of dopants in short
channel devices [13]. These process variations lead to delay and leakage power variations
which in turn lead to early failure of weak devices during the in-field operation.
The fourth source of unreliable in-field computing is the accelerated
aging/wearout rates of transistors and interconnects, especially under high stress
conditions such as temperature and activity factors. The physical phenomena which
accelerate wearout the most [96] [97] are: electro-migration (EM) [79], dielectric
breakdown [80], hot carrier injection (HCI) [82], and negative-bias temperature
instability (NBTI) [57]. These wearout phenomena manifest as timing degradations that
increase gradually over time until they start to cause timing violations under high
temperatures and stress conditions (i.e. intermittent faults). The effects of some wearout
phenomena, such as NBTI, can be partially reversed when stress conditions are removed
while the rest of them continue to cause intermittent timing violations and eventually
cause permanent faults.
In order to handle in-field transient, intermittent, and permanent faults in future
general-purpose computer systems, cost-effective fault handling mechanisms that can
detect, diagnose, recover (i.e. correct) and tolerate different fault categories need to be
deployed. The microarchitecture of t od a y ’s computer systems mainly consists of
execution elements (i.e. cores) and memory elements (e.g. caches). The memory elements
are usually protected during the in-field operation using well-known protection schemes
such as parity [29] [38] [58] and error correcting codes (ECC) [8] [48] [49]. Hence, the
9
work presented in this dissertation targets the in-field reliability of the functional blocks
that are within the execution cores.
The design of reliable in-field computer systems requires four steps: fault
detection, fault diagnosis, fault recovery, and fault tolerance. In order for a computer
system to handle faults, an in-field mechanism that can detect faults of the three types is
required. Once a fault is detected, a diagnosis mechanism is needed to classify the
detected fault and identify its location. Next, a recovery mechanism is responsible for the
restoration of the correct program state in order to guarantee functional correctness.
Finally, a fault tolerance mechanism is required to ensure proper execution despite the
existence of intermittent and permanent faults in order to guarantee forward progress.
This dissertation presents efficient and low cost fault handling mechanisms which
help to build in-field reliable CMPs and GPUs.
In-field fault detection 1.1
Recently, several hardware/software in-field fault detection mechanisms have
been proposed. The proposed mechanisms can be classified according to four detection
methodologies: 1) redundant execution [5] [51] [62] [41] [93] [69] [78], 2) periodic
testing [43] [90] [73] [7], 3) dynamic verification [9] [60], and 4) anomaly detection [74]
[54] [101].
Redundant execution mechanisms are further divided into hardware-based and
software-based redundant execution mechanisms. Hardware-based redundant execution
mechanisms (i.e. spatial redundancy) rely on multiprocessor and multithreading
properties of current CMPs to execute two copies (i.e. two threads) of the same program
10
either with lockstep configuration or with redundant multithreading (RMT) configuration.
In lockstep configuration, the two threads are executed on two identical cores and are
tightly synchronized together such that their results are compared on per cycle or per
instruction basis. The two identical cores can be either statically coupled together as in
proposed in [5] [41] or dynamically coupled as proposed in [51].
In RMT configuration, the two threads are loosely synchronized together in order
to reduce the overhead of the checker. The loose synchronization is implemented by
assigning one of the threads as the leading thread and the other thread as the trailing
thread. The trailing thread receives the results from the leading thread and compares them
with its own results. In RMT configuration, every core is assigned the leading thread of
one program and the trailing thread of another program. Examples of RMT include chip-
level redundant threading (CRT) [62] and Reunion [93]. The major advantage of
hardware-based redundant execution mechanisms is their ability to detect all types of
faults (i.e. 100% fault coverage) assuming single fault model. However, they require
significant area overhead of 100%.
Software-based redundant execution mechanisms (i.e. temporal redundancy)
provide a cost-effective alternative to hardware-based mechanisms by duplicating every
instruction and inserting check instructions to compare the results. Software-based
redundant execution mechanisms require negligible area overhead as the original
instructions and their duplicates execute on the same hardware. However, software-based
mechanisms are only effective against transient faults and they might fail to detect
intermittent and permanent faults. Another drawback of software-based redundancy is the
11
high performance overhead caused by the duplicates and the check instructions. Error
detection by duplicated instruction (EDDI) [69] and software implemented fault tolerance
(SWIFT) [78] are examples of software-based redundant execution mechanisms.
The second in-field fault detection methodology is periodic testing, which
provides non-concurrent fault detection by running specialized test patterns/programs and
checking their outputs periodically or during opportunistic idle periods. Periodic testing
mechanisms can also be divided into hardware-based and software-based mechanisms.
Hardware-based periodic testing mechanisms rely on the in-situ built-in self-test (BIST)
structures and/or design-for-testability (DFT) scan chains, which are traditionally used
for post-fabrication testing, to perform in-field periodic fault detection [43] [90] [7].
Software-based periodic testing mechanisms, such as software-based self-test
(SBST), are starting to gain attention due to their minimal hardware overhead while still
providing comparable fault coverage to hardware-based periodic testing. In SBST [55],
the test patterns are translated to test programs using the native instruction set
architecture (ISA) of the microprocessor. The test programs are stored in memory and
executed as traditional applications. Due to the non-concurrent nature of periodic testing
mechanisms, they can only detect permanent faults effectively. In addition, periodic
testing mechanisms generally cause high performance overhead in the range of 5% to
30% of the system time due to insufficient opportunistic idle periods [28].
The third in-field fault detection methodology is dynamic verification. Dynamic
verification mechanisms, such as DIVA [9] and Argus [60], insert dedicated hardware
checkers to validate a set of specific invariants in order to detect any faults that happen
12
during the normal execution of instructions. The main invariants are control flow, data
flow, and computation. The area overhead consumed by the checkers can reach up to
17% of the total microprocessor area while providing high fault coverage of 98%.
Dynamic verification mechanisms can detect all faults that cause a deviation from the
fault-free invariants values.
The detection methodologies discussed so far (i.e. redundant execution, periodic
testing, and dynamic verification) are over-provisioned in the sense that they do not take
the masking effects at the micro-architectural and architectural levels into consideration.
As a result, some faults which could have been safely ignored due to the masking effects
would still be detected and would add to the detection overheads. On the other hand,
anomaly detection mechanisms wait for the fault effect to propagate to the application
layer (i.e. causing an error); hence, they take the masking effects into consideration. This
is done by monitoring the software for any anomalous behavior such as data value
anomalies [74] (e.g. out of range values and bit invariants), micro-architectural behavior
anomalies [101] (e.g. cache misses and page faults), and software behavior anomalies
[54] (e.g. traps and hangs). Anomaly detection mechanisms are attractive because of their
low hardware and performance overheads. However, their detection latencies are 5 to 6
order of magnitude [40] greater than that of the previously mentioned detection
methodologies. As a result, anomaly detection mechanisms require sophisticated
recovery approaches to restore correct program state.
There are four metrics that can be used to measure the efficiency of fault detection
methodologies: area overhead, fault detection latency, fault coverage, energy and
13
performance overheads. Considering permanent faults only and based on the information
provided in [40], Table 1.1 is constructed to show which of the abovementioned detection
methodologies represents the best tradeoff point for general-purpose microprocessors.
The four detection methodologies provide sufficient permanent fault coverage between
98% and 100%; hence, we exclude the fault coverage metric from the table. The table
shows that periodic testing and dynamic verification provide better tradeoff between the
four metrics than redundant execution and anomaly detection. We also observe that there
is a room for improvement by reducing the energy and performance overheads associated
with the periodic testing at the expense of little extra area overhead in order to get a better
tradeoff point with low area, energy, and performance overheads.
Motivated by these observations, Chapter 2 of this dissertation presents a usage-
signature adaptive periodic testing mechanism called SignTest. SignTest periodically
suspends the microprocessor normal execution and applies specialized test
patterns/programs to verify the microprocessor functionality and detect any in-field
permanent faults. The testing phase in SignTest is dynamically tuned according to the
usage signature of the microprocessor functional blocks (e.g. ALUs, control units, and
decoders) during the last epoch of time.
Area Overhead Detection Latency Energy and Performance Overheads
Redundant Execution Very High Very Low Very Low
Periodic Testing Very Low Medium Low Medium Low
Dynamic Verification Medium Low Low Low
Anomaly Detection Very Low Very High Very Low
Table 1.1: In-field fault detection tradeoffs.
14
SignTest tracks the usage signature of the microprocessor functional blocks (i.e.
microprocessor usage signature (MUS)) by monitoring the switching activities at the
blocks’ input a nd output ports during e ve r y e poc h. At the end of every epoch, the MUS is
used to determine the test coverage required for each functional block within the
microprocessor. Hence, SignTest provides targeted functional block testing in order to
achieve high fault coverage while incurring less energy and performance overheads than
traditional periodic testing mechanisms which thoroughly test the entire microprocessor
without considering the different usage levels of its functional blocks [90] [55] [28].
A body of past research studies presented models that relate failure rates to the
usage level of a computer system and its components [21] [22] [46]. The statistical results
from these studies show strong correlation between failures and usage levels. SignTest
leverages this relationship by choosing the fault coverage for every functional block
based on its usage level during the last epoch. The lower the required fault coverage is,
the lower the energy and performance overheads will be.
In this dissertation, two SignTest implementations are presented: conservative
SignTest (C-SignTest) and relaxed SignTest (R-SignTest). C-SignTest provides coarse
grain usage monitoring by only distinguishing idle and used functional blocks. Any
functional block that experiences at least one switching activity at its input or output ports
during the epoch is considered used; otherwise, the block is considered idle. By skipping
the idle blocks during the testing phase, C-SignTest can achieve up to 24% savings in the
energy and performance overheads of thorough periodic testing without any reduction in
the permanent fault coverage.
15
On the other hand, R-SignTest provides fine grain usage monitoring by tracking
the frequency of times every functional block is used during the epoch (i.e. usage level).
According to the usage level of each functional block, a predefined percentage of the test
patterns/programs is applied to achieve the target fault coverage for the block. For
example, R-SignTest can achieve up to 51% savings in the energy and performance
overheads of thorough periodic testing while providing 93% permanent fault coverage.
R-SignTest has the flexibility to address various fault coverage requirements of different
applications by changing the mapping between usage level and the number of test
patterns/programs applied.
In-field fault diagnosis, recovery, and tolerance 1.2
After successful fault detection, fault recovery and tolerance mechanisms are
needed to guarantee functional correctness and forward progress at minimal performance
overhead. In-field fault diagnosis is an intermediate step between detection and recovery
and it aims at making the recovery and tolerance mechanisms simple and efficient; hence,
it is seldom that we find research works that only focus on fault diagnosis. Instead, most
of the proposed fault detection, recovery, and tolerance mechanisms implicitly provide
limited fault diagnosis capabilities. There are two objectives for fault diagnosis:
classifying faults and identifying their locations.
In-field faults are classified into three categories: transient, intermittent, and
permanent. Transient faults occur once and then disappear. Permanent faults are
persistent. Intermittent faults oscillate between periods of erroneous activity and
dormancy, depending on various operating conditions such as temperature and voltage.
16
Accurate fault classification allows the recovery and tolerance approaches to select the
most appropriate corrective action based on the fault type. For permanent and intermittent
faults, it is also important to identify the fault location since the corrective actions for
these fault types usually involve the isolation of the faulty block as will be discussed
shortly.
Previously conducted research papers [18] [14] [84] on fault diagnosis have a
major limitation because they cannot distinguish intermittent faults. Thus, intermittent
faults end up being classified either as transient faults which increase the performance
overhead due to the continuous flushing, or as permanent faults which cause extra
performance degradation due to unnecessary component isolation. Chapter 3 tackles this
shortcoming by presenting reliability-aware exceptions (RAEs) [34], a special class of
exceptions that enable software directed fault diagnosis, recovery, and tolerance in the
microprocessor array structures.
The novelty of RAEs is the ability to distinguish intermittent faults from transient
and permanent faults and provide appropriate corrective actions to recover and tolerate
the three fault categories. W he n a fa ult is de tec ted in one of the a rr a y str uc ture s’ e ntri e s
within the microprocessor, a reliability-aware exception (RAE) is raised and the RAE
handler is invoked when the faulting instruction is ready to commit. The RAE handler
uses the tracked failure history of the entry under concern to perform fault classification.
After the fault type and location are determined by the diagnosis mechanism, the
most appropriate corrective action is selected in order to restore the correct
microarchitecture state (i.e. fault recovery) and guarantee forward progress (i.e. fault
17
tolerance). In-field fault recovery approaches are classified into two classes: forward fault
recovery (FFR) and backward fault recovery (BFR). FFR approaches do not require a
roll-back and can correct the fault instantly such as using triple modular redundancy
(TMR). BFR approaches require a roll-back or a check-pointing technique to restore the
microprocessor correct state. RAEs use a BFR approach to retrieve the correct
microprocessor state.
The complexity of the roll-back/check-pointing technique for BFR approaches
depends on the fault detection latency. When faults are detected promptly before any
corrupted data modifies the committed state of the microprocessor (i.e. memory and
register file state), a simple roll-back technique similar to the one used for branch
misprediction and exception handling is sufficient. This is true for RAEs as faults are
detected promptly using parity bits (for transient and permanent faults) and double
sampling [30] (for intermittent faults). Hence, RAEs leverage the branch misprediction
roll-back technique when a reliability exception is raised in order to restore the correct
microprocessor state which does not require extra hardware and imposes low
performance overhead.
Transient faults persist for a single cycle; hence, no corrective action other than
roll-back is needed. On the other hand, permanent faults persist constantly and in order to
guarantee forward progress the faulty component has to be permanently isolated from the
working set. Previous fault tolerance studies proposed reconfiguration [84] [87] [95],
detouring [61], and core cannibalization [81] to isolate permanent faults. The RAE
handler tolerates transient and permanent faults in the same manner of previous
18
mechanisms. However, since RAEs can distinguish intermittent faults; the RAE handler
can provide the best corrective action which complies with the nature of intermittent
faults. Intermittent faults appear non-deterministically in the same location usually under
high stress and temperature conditions and they disappear when the stress conditions are
removed. Moreover, the effects of some intermittent faults (e.g. NBTI-induced faults)
could be partially reversed under low stress and temperature conditions. Hence, the RAE
handler tolerates intermittent faults in memory structure entries by temporarily de-
configuring the faulty entries from the working set (i.e. reducing stress and temperature
conditions). This allows RAEs to enhance the reliability of the reorder buffer (ROB) and
the load/store queue (LSQ) of an out-of-order microprocessor by factors of 1.95 and 1.3,
respectively.
In-field fault handling in GPUs 1.3
Similar to chip multiprocessors (CMPs), future graphics processing units (GPUs)
face imminent reliability threats due to extreme technology scaling. In the past, GPUs
were exclusively used to run multimedia applications which feature high thresholds for
in-field computational faults; hence, in-field reliability was not a first order design
constraint.
Nowadays, the massive parallel compute power of GPUs has attracted many
general-purpose applications which process large blocks of data in parallel. These
general-purpose applications, such as mission critical scientific and financial
applications, have strict reliability demands represented by low thresholds for in-field
computational faults. Hence, future GPUs that provide support for general-purpose
19
computations should deploy fault handling mechanisms capable of mitigating the effects
of in-field faults.
GPUs consist of hundreds or thousands of execution units organized as single
instruction multiple thread (SIMT) lanes. The SIMT execution model [67] [68] is
supported using the notion of warps. A warp is the smallest scheduled unit of work in
GPUs and it consists of up to 32 parallel threads that execute the same instruction on
different input operands values. The SIMT lanes account for 68% of the chip area in
current GPUs [53], while most of the chip area outside of the SIMT lanes is occupied by
memory elements which are usually protected during the in-field operation using error
correcting codes. For instance, register files in Nvidia GPUs are SECDED protected [68].
Hence, the GPU in-field reliability work presented in this dissertation focuses on
protecting the SIMT lanes within the GPUs.
Recent research work proposed GPU-specific hardware-based and software-based
techniques to detect in-field faults [32] [64] [47]. These techniques did not provide
solutions to recover or tolerate the detected faults. As a complement to these prior works,
this dissertation presents Warped-Shield in Chapter 4 [33]. Warped-Shield is a
lightweight fault tolerance framework that dynamically adapts the thread execution in
GPUs based on the available healthy SIMT lanes. Warped-Shield targets permanent
faults in the SIMT lanes and presumes that a detection mechanism is in-situ and the
correct microarchitecture state is recovered by rerunning the faulting kernel.
20
The Warped-Shield framework comprises of three hardware-based schemes that
do not require any software intervention and are transparent to the micro-architectural
blocks surrounding the SIMT lanes, such as the fetch logic, register file, and caches.
1. Thread shuffling: This scheme represents the building block of the Warped-
Shield framework and strives to shuffle the active threads within each warp, such
that every active thread is issued to a healthy SIMT lane.
2. Dynamic warp deformation: Whenever the number of active threads within a
warp exceeds the number of healthy SIMT lanes, thread shuffling alone will not
be sufficient. In these cases, the active threads within the warp get dynamically
deformed (i.e. split) into multiple sub-warps issued in consecutive cycles such
that thread shuffling becomes sufficient for each sub-warp.
3. Warp shuffling: The SIMT lanes in a GPU are grouped together to form a
streaming processor (SP) unit. Multiple SP units exist in the same streaming
multiprocessor (SM). As different SP units within the same SM might suffer from
different number of faulty SIMT lanes, warps can be shuffled such that every
warp is issued to the most appropriate SP unit according to the number of active
threads in the warp and the number of healthy SIMT lanes in the SP unit.
The three schemes allow Warped-Shield to tolerate failure rates as high as 50% in
the execution lanes with only 14% performance degradation while causing minimal area
and power overheads.
Motivated by the insights obtained from Warped-Shield, this dissertation presents
Warped-Redundant Execution (Warped-RE) in Chapter 5 as the last contribution.
21
Warped-RE is a unified in-field fault detection and correction framework for GPUs.
Warped-RE can detect and correct transient, intermittent, and permanent in-field faults in
the SIMT lanes of GPUs. Two modes of operation are available in Warped-RE: 1) dual
modular redundant (DMR) execution mode which is used during the common case fault-
free execution. 2) Triple modular redundant (TMR) execution mode which is activated
just-in-time to correct detected faults.
Initially when all SIMT lanes are healthy, Warped-RE runs in DMR mode in
which every warp instruction is redundantly executed and checked to detect all fault
types. When a fault is detected due to mismatch in DMR mode, the faulting warp
instruction is re-executed in TMR mode to correct the fault. In TMR mode, a warp
instruction is executed three times in order to choose the correct (i.e. majority) result and
identify potential faulty SIMT lanes. If the fault does not reoccur during the re-execution,
the initial fault instance is considered transient and Warped-RE returns to run in DMR
mode. Otherwise, the fault is considered non-transient and Warped-RE continues to run
in TMR mode to tolerate the fault and guarantee functional correctness.
To mitigate the high overheads associated with redundant execution during DMR
and TMR modes, Warped-RE leverages the spatial value localities between source
operands of threads within the same warp to achieve low cost opportunistic redundant
execution. When the source operands of two (three) adjacent threads within a warp are
matching, the threads are considered inherently DMR-ed (TMR-ed) and there is no need
for redundant execution. Furthermore, Warped-RE utilizes the idle SIMT lanes to
replicate the computations of the active threads and achieve DMR or TMR execution
22
opportunistically. Our empirical study on 22 benchmarks shows that around half of the
warp instructions can achieve opportunistic redundant execution during DMR and TMR
modes.
To cover the remaining warp instructions, dynamic warp deformation is used to
split the active threads of the warp instruction among multiple sub-warps and replicate
them on the idle SIMT lanes to achieve DMR and TMR execution. The performance
overhead of Warped-RE during DMR and TMR modes is 8.4% and 29%, respectively.
Contributions 1.4
The main contributions of this dissertation are summarized in the following:
1. A detailed usage analysis is conducted for the integer execution and load/store
functional blocks within the OpenSPARC T1 core ( Chapter 2). The analysis
shows that there is a plenty of opportunity to leverage the cause-effect
relationship between block usage and failure rate in order to achieve the best
tradeoff between fault coverage and testing energy and performance
overheads for periodic testing mechanisms.
2. Using the insights derived from the functional block usage analysis, a usage-
signature adaptive periodic testing mechanism called SignTest is presented
in Chapter 2. SignTest tracks the usage levels of the microprocessor functional
blocks during each epoch and uses them to tune the testing phase at the end of
the epoch by relating the fault coverage of each block to its usage level. This
helps to minimize the testing energy and performance overheads without
compromising fault coverage.
23
3. A new class of exceptions called reliability-aware exceptions (RAEs) is
presented in Chapter 3. RAEs provide fine grain in-field fault diagnosis by
distinguishing intermittent faults from transient and permanent faults.
Furthermore, RAEs use temporal de-configuration to tolerate intermittent
faults in the microprocessor array structures.
4. Warped-Shield, a lightweight framework to tolerate hard faults in the SIMT
lanes of GPUs is presented in Chapter 4.
5. Warped-RE, a unified fault detection and correction framework for the SIMT
lanes of GPUs is presented in Chapter 5.
24
Chapter 2
Usage-Signature Adaptive Periodic Testing
In this chapter we present SignTest, a usage-signature based periodic testing
mechanism to detect permanent faults in microprocessors during the in-field operation.
SignTest periodically suspends the normal microprocessor operation and runs specialized
test patterns for permanent fault detection. The novelty of SignTest is in exploiting the
dire c t re lation be tw e e n a mi c ropr oc e ssor ‘s usa g e sig na tur e a nd the e x pe c t e d fa ult sit e s to
reduce the testing energy and the testing time while maintaining high coverage against
permanent faults.
Introduction 2.1
As transistors shrink and voltage levels approach the sub-threshold region,
transistors become more vulnerable to in-filed faults [16] [96] due to higher probability
of manufacturing defects, more process variation and faster wearout rates. Faults that
happen during the in-field operation are categorized into: transient, intermittent, and
permanent faults. Among these, intermittent and permanent faults are considered critical
because they are persistent for long periods of time, as is the case with intermittent faults,
25
or irreversible, as is the case with permanent faults. When such faults happen during
operation, the microprocessor is expected to handle them by detecting their occurrence,
recovering the correct program state, and tolerating their existence. To achieve this goal,
specialized mechanisms should be deployed in future microprocessors.
In this chapter, we focus on the detection part and assume that a fault recovery
mechanism (e.g. ReVive [72] and SafetyNet [94]) and a fault tolerance mechanism (e.g.
reconfiguration [87] [84] [95] and detouring [61]) are in-situ. The previously proposed
in-field detection mechanisms [77] [41] [90] [28] [83] [71] [40] [91] can be divided based
on the testing frequency (i.e. how often a test is conducted) into two main categories:
continuous and periodic. Continuous testing [77] [41] [83] relies on redundant execution
in time or space to validate the execution of every instruction such as dual modular
redundancy (DMR). Although continuous testing provides full fault coverage against all
fault types, they cause large area (100% for DMR), performance (in case of redundant
execution in time), and energy overheads. Such huge overheads render the continuous
testing impractical, particularly for high-volume low cost microprocessors.
On the other hand, periodic testing [90] [28] [71] [91] suspends the normal
execution after a given time interval, called an epoch, to run specialized test
patterns/programs that validate the microprocessor functionality. The relatively lower
area and energy overheads of periodic testing make it more suitable than continuous
testing, especially for high-volume low cost microprocessors.
Multiple prior studies have shown that functional block usage is one of the
primary drivers that lead to aging and eventual failure [46] [96]. Despite that and until
26
recently, prior periodic detection mechanisms were agnostic to how a microprocessor is
used during the last epoch when issuing the next test sequence. Agnostic periodic testing
mechanisms test all blocks exhaustively without considering their different usage profiles
which causes high energy and performance overheads [90] [28]. Recent periodic testing
mechanisms use predetermined usage thresholds to decide whether a microprocessor
pipeline stage (e.g. execution stage and load/store stage) should be tested or not [71]
[91]. Such approaches either cause high energy and performance overheads when low
usage thresholds are used or severely impact the fault coverage when high usage
thresholds are used.
To address these shortcomings, this chapter presents SignTest, a usage-signature
based periodic testing mechanism. During every epoch, SignTest tracks the usage of the
microprocessor at the functional block level and dynamically steers the testing phase to
achieve high permanent fault coverage while minimizing testing energy and performance
ove rhe a ds. Two S ig nT e st’s im pleme ntations a re pr e se nted: c onser v a ti ve S ig nTe st (C-
SignTest) and relaxed SignTest (R-SignTest).
C-SignTest reduces testing energy and performance overheads by focusing the
testing process on the microprocessor functional blocks which were used in the last epoch
and skipping the ones that were idle. Intuitively, any idle block could not have
contributed to the correctness of the running application; hence, it can be skipped during
the testing phase without compromising the correctness. As a result, C-SignTest spends
less time testing, which translates to energy and performance savings without sacrificing
fault coverage against permanent faults.
27
R-SignTest leverages the causality relationship between block usage and block
failures to determine the fault coverage for each block based on its usage level.
Intuitively, blocks that are used less frequently during the epoch require lower fault
coverage, which requires lower testing energy and time. Relative to C-SignTest, R-
SignTest offers slightly lower fault coverage in order to achieve higher energy and
performance savings.
The causality between block usage and block failures can be described in two
folds: first, the probability of wearout induced faults increases with higher block usage.
For instance, the failure models of many wearout mechanisms, such as negative-bias
temperature instability (NBTI), electro-migration (EM), and hot carrier injection (HCI)
show strong dependence on the frequency of gate toggles [96] [86]. Second, in the case
where a functional block is already faulty, the more the block is used during the epoch
the higher the possibility that the fault will be activated. For example, if there is a stuck-
at-0 fault at the output of a 2-input AND gate, the fault will be activated only when the
two inputs are logic high, which has higher probability of occurrence when the gate is
used more often.
SignTest tracks the usage of the microprocessor functional blocks by monitoring
the switching activity at the inputs and outputs of every block (e.g. ALUs, control units,
and decoders). For C-SignTest, the functional blocks that experience at least one
switching activity at any of their input or output ports during an epoch are considered
used while the rest are assumed idle. During the testing phase, C-SignTest only tests the
blocks that are used. On the other hand, the tracking logic of R-SignTest needs to provide
28
the frequency of usage for each functional block and relies on that to determine the
testing thoroughness of each block.
SignTest is completely independent from the testing methodology (i.e. how the
test patterns/programs are applied and how the test responses are checked). For example,
the experimental evaluation of SignTest shows that irrespective to the testing
methodology in place, SignTest achieves substantial savings in the testing energy
overhead or the testing performance overhead or both. In other words, SignTest is a
methodology-agnostic scheme that makes periodic in-field testing more efficient.
Related Work 2.2
Usage-failure causality 2.2.1
Some past research works focused on the relationship between failures and usage
in computer systems. Beaudry developed measures which reflect the interaction between
the reliability and performance characteristics of computing systems [12]. Castillo and
Siewiorek developed performance measures for a large DEC-10A time-sharing system
[21]. The measured data from their experiments did not agree with the assumption of
constant system failure rates and they attributed the mismatch to the fluctuations in the
system usage. In a subsequent research work, Castillo and Siewiorek modeled failures
assuming that the instantaneous failure rate of a system resource (i.e. a microprocessor
functional block) is dependent on its usage profile [22].
Iyer et al. studied the relationship between usage levels and failures in computer
systems and components [46]. Their data covered three years of normal operation and
showed strong correlation between the amount of stress exerted on a system or a system
29
component and its hardware failure rate. The above studies served as the motivation for
our SignTest fault detection mechanism.
Periodic fault detection 2.2.2
Shyam et al. proposed BulletProof; a low cost periodic testing mechanism for
microprocessor pipelines [90]. BulletProof creates coarse-grained epochs during which a
BIST-like infrastructure validates the integrity of the microprocessor. All functional
blocks are conservatively tested during every epoch. BulletProof takes advantage of the
block idle cycles to run part of its test vectors before the end of the epoch. This may help
to reduce the testing time, but the testing energy, which is as critical, remains the same
because all blocks are tested all the time regardless of their usage signatures.
In a subsequent research work, Constantinides et al. proposed a software-based
online fault tolerance mechanism using a new set of instructions called access-control
extension (ACE) [28] . A C E inst ruc ti ons a c c e ss a nd c ontrol the mi c ropr oc e ssor’ s int e rna l
state by leveraging the existing scan chain architecture. Special firmware periodically
suspends the execution of the microprocessor and uses the ACE instructions to run
thorough tests on the hardware. Both BulletProof and ACE can be enhanced by
incorporating the SignTest scheme to further reduce testing energy and time either by
skipping idle functional blocks/ACE segments (in case C-SignTest is used) or by
selecting the fault coverage for the functional blocks/ACE segments according to their
usage levels (in case R-SignTest is used).
Recently, Gupta et al. proposed an adaptive testing framework (ATF) [42], to
reduce the testing overheads associated with software-based self-test (SBST). ATF is
30
implemented at the core-level in a chip multiprocessor environment (CMP) with low-
level sensors employed to measure the amount of wearout degradation of the individual
cores. Lightly degraded cores are tested with a low coverage test (i.e. short test period)
and heavily degraded cores are tested thoroughly. This approach lacks visibility into the
health of the individual functional blocks. For example, it could be the case that block X
is the weakest block in the microprocessor, yet the test may not focus on block X
extensively because the test is selected based on the core health status instead of the
health status of the individual blocks. This increases the probability of faults being
undetected during the testing phase. On the other hand, SignTest tracks the
microprocessor usage at the functional block granularity. C-SignTest identifies idle
blocks which have zero failure probability (i.e. could not have affect the correctness of
the running application) and skip them during the testing phase without sacrificing
coverage against permanent faults. R-SignTest allocates more testing energy and time for
functional blocks that have been used most.
Pellegrini and Bertacco presented application-aware testing (A
2
Test) [71]. A
2
Test
is implemented at the pipeline stage level (e.g. fetch stage, execute stage, and load/store
stage). A specialized controller examines the opcode of the instruction being decoded and
tags all the stage s tha t a re li ke l y to b e use d sim pl y b a se d on th e inst ruc ti on’s
functionality. At the end of every epoch, if the predicted usage of a stage is higher than a
predetermined threshold then the stage is thoroughly tested. Otherwise, the stage is not
tested at all.
31
Most recently, Skitsas et al. proposed DaemonGuard [91] an O/S-assisted
selective testing for multicore systems. Similar to A
2
Test, DaemonGuard observes the
usage at the pipeline stage level using hardware counters that get incremented based on
the current instruction type. A lightweight operating system process, called Testing
Manager, periodically checks for any pending testing requests. When the counter value of
a stage exceeds a predefined threshold, the Testing Manager invokes an idle test daemon
which executes the testing routine of the stage.
Compared to A
2
Test and DaemonGuard, SignTest gathers the usage information
at the granularity of a functional block within the stage (e.g. ALUs, control units, and
decoders) by monitoring the switching activity at the inputs and outputs of the individual
blocks. Hence, SignTest can capture the complete idleness of a stage (i.e. all functional
blocks within the stage are idle) as well as partial idleness where only a subset of the
stage ’s fun c ti ona l block s a re idl e whic h he lps to reduce testing energy and time. In
addition, A
2
Test and DaemonGuard use a single usage threshold according to which a
stage is either fully tested (i.e. 100% fault coverage) or NOT tested at all (i.e. 0% fault
coverage). This has the negative consequence of either high testing overheads when a low
threshold is predefined, or low fault coverage when a high threshold is used. Instead, R-
SignTest chooses from a spectrum of fault coverage levels based on the usage level
which helps to maintain high fault coverage and reduce the testing energy and
performance overheads.
Another recent work in the same field is Sampling+DMR [66]. Nomura et al.
argued that by replicating a sample of the running instructions during every epoch, all
32
permanent faults will be eventually detected. Although the performance overhead of
Sampling+DMR is minimal, fault detection latency varies significantly based on the
replicated instructions. High detection latency increases the amount of wasteful execution
until the fault is detected. On the other hand, C-SignTest provides maximum coverage
against permanent faults at the end of every epoch (i.e. fault detection latency is fixed to
one epoch). R-SignTest exploits test pattern rotation in order to limit the fault detection
latency to a handful of epochs.
Heterogeneous Microprocessor Usage Signature 2.3
The current trend in the microprocessor industry is to have multiple cores on chip
(i.e. chip multiprocessor (CMP)). Each core comprises of a series of pipeline stages. Each
stage is further sub-divided into multiple functional blocks. For example, the
UltraSPARC T1 from Sun Microsystems has eight cores. Each T1 core is a 6-stage in-
order pipeline with the following stages: fetch, thread select, decode, execute, load/store,
and write back. The execution stage of the T1 microprocessor consists of seven
functional blocks: ALU, bypass, control, divider, ECC, register management, and shifter.
In this section, we investigate the heterogeneity in microprocessor usage signature
(MUS) while running different programs. Furthermore, we investigate the heterogeneity
in the MUS within the same program execution as the program phase changes. For
example, phase_1 of program P could be memory intensive and hence lead to higher
usage of the load/store blocks. Phase_2 of the same program could be computational
intensive and hence lead to higher usage of the execution stage blocks.
33
In order to quantify MUS heterogeneity at the functional block level, we map the
RTL-level design of OpenSPARC T1 core [70], which is an open source version of
UltraSPARC T1, to an FPGA. The FPGA-emulated T1 boots unmodified Solaris
Operating System. We then run ten SPEC CPU2000 benchmarks on top of Solaris and
track the MUS during every epoch. In particular, we focus on the largest two stages:
execution and load/store stages which occupy 43% of the T1 core area. Similar to the
execution stage, the load/store stage is subdivided into 14 blocks. Table 2.1 li sts blocks’
names and abbreviations which will be used in the rest of this chapter for ease of
reference.
The epoch length is set to 100 million cycles and the rationale for this choice is
presented in subsection 2.5.3. The MUS is tracked by monitoring the switching activity at
Unit Name Block Name Abbreviation
Execution 1 Arithmetic and Logic exu_alu
2 Bypass Logic exu_byp
3 Divider exu_div
4 Error Correcting Code exu_ecc
5 Execute Control Logic exu_ecl
6 Register Management Logic exu_rml
7 Shifter exu_shft
Load/Store 8 D-Cache Data Path lsu_dcdp
9 D-Cache Control lsu_dctl
10 D-Cache Control Data Path lsu_dctldp
11 Exception Control lsu_excpctl
12 Queue Control (1) lsu_qctl1
13 Queue Control (2) lsu_qctl2
14 Processor-Cache Crossbar Data Path (1) lsu_qdp1
15 Processor-Cache Crossbar Data Path (2) lsu_qdp2
16 Store Buffer Control lsu_stb_ctl
17 Store Buffer Control Data Path lsu_stb_ctldp
18 Store Buffer read/write Control lsu_stb_rwctl
19 Store Buffer read/write Data Path lsu_stb_rwdp
20 Tag RAM Data Path lsu_tagdp
21 Translation Look-Aside Buffer Data Path lsu_tlbdp
Table 2.1: Functional blocks names and abbreviations.
34
the inputs and outputs of every functional block. In the conservative implementation of
SignTest (C-SignTest), every functional block is assigned a single bit in the MUS which
is set to logic one if the block experiences at least one switching activity at its inputs or
outputs as an indication of the block being used during the epoch. The MUS bits of idle
blocks remain 0. In our MUS analysis there are 21 blocks as shown in Table 2.1; hence,
the MUS vector is 21-bit long. For each functional block, we compute the percentage of
epochs during which the block is used (i.e. Used Epochs) and the percentage of epochs
during which the block is idle (i.e. Idle Epochs) over the entire execution duration of each
benchmark. Intuitively, the two percentages will always add to 1.
In the relaxed implementation of SignTest (R-SignTest), every block is
augmented with a usage counter, rather than a single bit, to track the number of cycles
during which the block experiences a switching activity at any of its inputs or outputs,
which represents the usage level of the block. The usage levels of the functional blocks
combined represent the MUS. For each block, we classify the epochs into six categories:
[0] means that the block was never used during the epoch. [0-1] means that the block was
used at least once but less than 1% of the epoch time. [1-5], [5-10], [10-20], and [20-100]
means that the block was used 1-5%, 5-10%, 10-20%, and 20-100% of the epoch time.
Figure 2.1 plots the percentage of epochs from the six usage categories for four
representative execution functional blocks assuming R-SignTest implementation. From
the figure, we can easily measure the percentage of idle and used epochs assuming C-
SignTest implementation. The percentage of idle epochs equals the percentage of epochs
from the [0] usage category and the percentage of used epochs equals the summation of
35
the epoch percentages from the [0-1], [1-5], [5-10], [10-20], and [20-100] usage
categories. Three main observations can be derived from Figure 2.1:
Inter-epoch heterogeneity 2.3.1
Assuming C-SignTest implementation and for the same benchmark, a block
oscillates between idle and used states from one epoch to another. For example, the
exu_div block (3
rd
set of bars in the figure) has the following epoch percentages while
running Twolf benchmark: 9% idle epochs and 91% used epochs.
Similarly and assuming R-SignTest implementation, Figure 2.1 shows that for the
same block and same benchmark the usage level varies from one epoch to another. For
example, the exu_byp block (2
nd
set of bars in the figure) has the following epoch
percentages while running Art benchmark: [0] = 15.5%, [0-1] = 25.2%, [1-5] = 50.3%,
[5-10] = 7%, [10-20] = 1.7% and [20-100] = 0.3%.
Figure 2.1: Execution blocks usage-based epoch classification.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
exu_alu exu_byp exu_div exu_ecc
Percentage of Epochs
[0] [0-1] [1-5] [5-10] [10-20] [20-100]
36
Inter-application heterogeneity 2.3.2
Assuming C-SignTest implementation and for the same block but different
benchmarks, the percentages of idle and used epochs vary from benchmark to another.
For example, the exu_ecc block has the following idle epoch percentages while running
the ten SPEC CPU2000: Art = 27%, Bzip = 0%, Crafty = 1.3%, Equake = 25%, Gzip =
0.1%, Mcf = 0.1%, Parser = 1.1%, Perlbmk = 1.1%, Twolf = 9.2% and Vpr = 4.4%.
Similarly and assuming R-SignTest implementation, Figure 2.1 shows that for the
same block but different benchmarks the percentage of epochs from one category varies
from one benchmark to another. For example, the exu_alu block has the following epoch
percentages of category [20-100]: Art = 0%, Bzip2 = 33%, Crafty = 3.6%, Equake =
0.3%, Gzip = 13.3%, Mcf = 3.5%, Parser = 15.7%, Perlbmk = 7.8%, Twolf = 3.9% and
Vpr = 3%.
Inter-block heterogeneity 2.3.3
Assuming C-SignTest implementation and for the same benchmark but different
functional blocks, the percentages of idle and used epochs vary from one block to
another. For example, while running Vpr benchmark, exu_alu and exu_byp are idle less
than 0.1% of the time. On the other hand, exu_div and exu_ecc are idle 4.4% of the total
execution time of Vpr.
Similarly and assuming R-SignTest implementation, Figure 2.1 shows that for the
same benchmark but different blocks the percentage of epochs from one category varies
from one block to another. For example, the four execution blocks have the following
37
epoch percentages of category [5-10] while running Twolf benchmark: exu_alu = 43%,
exu_byp = 30.8%, exu_div = 7% and exu_ecc = 0.4%.
Figure 2.2 plots the percentage of epochs from the six usage classes for four
representative load/store blocks assuming R-SignTest implementation. Again here, we
can easily measure the percentages of idle and used epochs assuming C-SignTest
implementation as described for Figure 2.1. Clearly, the aforementioned observations
hold for the load/store blocks as well. The remaining functional blocks have similar usage
behavior with vast heterogeneity across epochs and applications.
Exploiting usage heterogeneity 2.3.4
The above three observations prove that thorough testing, widely adopted by
agnostic periodic testing mechanisms [90] [28], wastes considerable amounts of energy
and time to test idle or lightly utilized functional blocks. At the same time, it is difficult
to find a single threshold setting to decide whether or not it is necessary to test a block as
Figure 2.2: Load/store blocks usage-based epoch classification.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
Art
Bzip2
Crafty
Equake
Gzip
Mcf
Parser
Perlbmk
Twolf
Vpr
lsu_dcdp lsu_dctl lsu_dctldp lsu_stb_ctl
Percentage of Epochs
[0] [0-1] [1-5] [5-10] [10-20] [20-100]
38
have been proposed by recent adaptive periodic testing approaches [71] [91]. SignTest
addresses these concerns by leveraging the vast heterogeneity in the microprocessor
usage signature (MUS) across epochs, applications and blocks. SignTest tracks the MUS
during every epoch and accordingly decides which blocks to test and how thoroughly to
test them. C-SignTest targets the functional blocks that were used at least once. When the
entire microprocessor is idle during an epoch, all bits in the MUS will be at logic zero by
the end of the epoch. This implies that none of the blocks contributed to the correctness
of the microprocessor state during the last epoch; hence, there is no need to run any test.
On the other extreme, when all MUS bits are at logic one, it indicates that all blocks may
have contributed to the correctness of the current microprocessor state. As a result, all
blocks must be tested to guarantee maximum permanent fault coverage.
R-SignTest dynamically changes the test coverage for each functional block
according to its latest usage level as given in Table 2.2. Any block that falls in category
[0] will not be tested, while any block that falls in category [0-1] will be tested with 90%
coverage and so on. In order to understand how we chose the six usage levels and their
respective fault coverage values, we present our empirical study on the relationship
between the block usage level and the percentage of the exercised fault sites within the
block.
For each block, we measure the average and maximum usage level as a
percentage of the epoch time. At the same time, we measure the average and maximum
Usage Level (%) 0 0-1 1-5 5-10 10-20 20-100
Fault Coverage (%) 0 90 92 95 97 100
Table 2.2: Usage-fault coverage mapping.
39
percentage of fault sites exercised during the epoch. We show the results for two blocks
that represent the worst and common cases. Figure 2.3a displays the average and
maximum usage level (labeled UL) and fault site percentages (labeled FS) for the
exu_alu block, which represents the worst case block. The average UL is less than 20%
of epoch time and the maximum UL reaches 99.5%. On the other hand, the percentage of
exercised FS remains around 90%. This indicates that in the worst case and regardless of
how frequent a block is used, 90% of the fault sites are exercised which requires high
fault coverage.
Figure 2.3b tells a different story about the data control logic of the load/store
stage (lsu_dctl) which represents the common case (16 out of the 21 blocks exhibited
similar trends). On average, the usage level is less than 20% of the epoch time and
reaches a maximum of 100%. However, the percentage of fault sites exercised is 30% on
average and never exceeds 45%. These results empirically confirm that in the common
case, as the block is used more frequently the percentage of exercised fault sites
increases.
(a) exu_alu (b) lsu_dctl
Figure 2.3: Usage level vs. fault sites.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
UL FS UL FS UL FS UL FS UL FS UL FS UL FS UL FS UL FS
Art Bzip Crafty Gzip Mcf Parser Perlbmk Twolf Vpr
Average Max
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
UL FS UL FS UL FS UL FS UL FS UL FS UL FS UL FS UL FS
Art Bzip Crafty Gzip Mcf Parser Perlbmk Twolf Vpr
Average Max
40
As the average block usage level in all cases is less than 20% of the epoch time,
we chose 20% as the usage threshold which when exceeded will trigger a thorough test to
provide maximum permanent fault coverage. This is represented in Table 2.2 by mapping
the [20-100] usage category to 100% fault coverage. On the other extreme, if the block is
not used during an epoch then there is no need to test it (i.e. [0]). The region between 0%
and 20% is divided into four usage levels (i.e. [0-1], [1-5], [5-10], and [10-20]).
Conservatively and based on the maximum percentage of exercised fault sites given in
Figure 2.3a, we mapped the four usage levels to 90%, 92%, 95%, and 97% fault
coverage, respectively.
SignTest Design Components 2.4
The implementation of SignTest is independent from the way test
patterns/programs are applied and verified (i.e. testing methodology). One could choose
to perform the testing process using a BIST-like structure [90]. In BIST, every functional
block is augmented with an on-chip read-only-memory (ROM) that holds t he block’ s test
pa tt e rns. A mul ti plex e r a t the block ’s input se l e c ts be twe en normal data during normal
operation and test vectors, read from ROM, during the testing phase. BIST also requires a
c he c k e ra tt he block ’sou tput toca pture testre spo nsesand validate them.
Another testing methodology is design for testability (DFT) scan chains. Most
modern microprocessors include full hold-scan structures to aid post-silicon testing
process [103] [50]. Further, one can leverage these scan chains to periodically apply test
patterns and collect their responses during in-field operation.
41
Most recently, software-based self-test (SBST) has emerged as a non-intrusive
testing methodology. Unlike BIST and DFT scan chains, SBST requires no additional
hardware for testing purposes. SBST translates the test vectors of every functional block
to a sequence of instructions (i.e. test program) from the instruction set architecture (ISA)
of the microprocessor [73] [99]. In order to test a specific functional block using SBST,
the test program(s) associated with that block are loaded from memory and executed as
regular applications. At the end of each test, contents of register file are checked to verify
the test output. Irrespective of the testing methodology used , S ig nT e st’s d e sig n c onsi sts
of two components: the usage tracker and the testing engine.
Usage tracker for C-SignTest 2.4.1
The usage tracker is responsible for monitoring the switching activity at the
blocks’ input a nd output ports a nd ge ne r a ti ng the mi c ropr oc e sso r usa g e si gna ture (MU S )
accordingly. For C-SignTest, all functional blocks that experience at least one switching
activity at any of their input or output ports during an epoch are considered used while
the rest are assumed idle. Hence, it is sufficient for the usage tracker of C-SignTest to
generate a single usage bit per functional block to indicate whether the block is used or
idle during the epoch. The concatenation of all usage bits together forms a bit vector that
represents the MUS. Figure 2.4 shows the usage tracker logic for functional block X
given C-SignTest implementation.
In order to detect the toggling activity at block X inputs and outputs, every input
and output bit of the block is connected to a D-type flip-flop (D-FF) through an XOR
gate. The D-FF stores the old value of the input/output bit and the XOR gate compares
42
the old and the new values. The XOR gates are OR-ed together and the output of the OR-
ing indicates whether a toggling activity is triggered during the current cycle or not.
Although it seems sufficient to only monitor the inputs of the functional blocks, it is
necessary to monitor the outputs as well in order to cover the cases where a permanent
fault causes one or more of the outputs to flip without any switching activity at the inputs.
The output of the XOR OR-ing drives the MUS bit of block X (i.e. MUS[x]). At
the beginning of every epoch, the MUS bit is initialized with logic zero and all D-FFs
that track changes in inputs and outputs are initialized with the current state of their
respective inputs and outputs. During the epoch, the first time an input or output bit
toggles, the corresponding XOR gate detects the transition and sets MUS[x] to logic one.
Figure 2.4: Usage tracker for C-SignTest.
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Bl oc k“X”
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
V
DD
Power_Line
Q
Q
SET
CLR
D
MUS[x] CLOCK
43
Once MUS[x] is set, it clock-gates itself to prevent any overwriting to logic zero. In
addition, MUS[x] power gates all usage tracking D-FFs in order to save dynamic and
leakage power. To reduce the complexity of Figure 2.4, we show the power gating
connections for only one usage tracking D-FF, namely the one connected to out0. In
simple words, with C-SignTest we are interested in identifying functional blocks which
experience at least one switching activity at their input or output bits during the epoch.
Usage tracker for R-SignTest 2.4.2
The usage tracker of R-SignTest needs to provide the frequency of usage for each
functional block. Hence, every functional block is augmented with a dedicated usage
counter that gets incremented every cycle a switching activity is detected at any of the
block’ s input s or output s. R-SignTest relies on the block’s usa ge c ounter to determine the
thoroughness of its testing. The values of the usage counters of all functional blocks
combined represent the MUS for R-SignTest.
Figure 2.5 shows the usage tracker for functional block X assuming R-SignTest
im pleme ntation. The pa rts re sponsi ble for moni tor ing the switching a c ti vit y a t the blocks’
inputs and outputs (i.e. D-FFs and XOR gates) are identical to the ones used in the usage
tracker of C-SignTest described in Figure 2.4. However for R-SignTest usage tracker, the
output of the XOR OR-ing drives the enable input of the usage counter of block X. At the
beginning of every epoch, the usage counter is initialized to zero and all D-FFs that track
changes in inputs and outputs are initialized with the current state of their respective
inputs and outputs. During the epoch, every cycle where at least one input or output bit
toggles, the corresponding XOR gate detects the transition and enables the usage counter
44
to be incremented by one. At the end of the epoch, the usage counter value would reflect
the usage level of the block (i.e. the percentage of time the block was used during the
epoch).
Testing engine 2.4.3
At the end of every epoch, the testing engine investigates the MUS and selects the
test patterns/programs to be applied accordingly. For C-SignTest, the testing engine only
chooses the test patterns/programs for the functional blocks whose MUS bits are one by
the end of the epoch and skips the test patterns/programs for the blocks whose MUS bits
are zero by the end of the epoch. This helps C-SignTest to save testing energy and time
without sacrificing permanent fault coverage. For R-SignTest, the testing engine
Figure 2.5: Usage tracker for R-SignTest.
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Bl oc k“X”
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Q
Q
SET
CLR
D
Usage
Counter[x]
45
determines the target coverage of every functional block according to the value of its
usage counter as given in Table 2.2. The target coverage of the functional block is then
translated to the percentage of test patterns/programs which needs to be applied. The
translation from target coverage to percentage of test patterns is explained in
subsection 2.5.1.
After that, the engine starts the testing phase by applying the selected test
patterns/programs and verifying their responses. When all test patterns/programs
complete successfully, the microprocessor is assumed to be healthy and normal operation
resumes. If a test pattern/program fails to match its golden response, it is applied again to
verify that the fault is not transient. In case the test pattern/program fails again, the
corresponding functional block is considered to have a non-transient fault.
The main focus of SignTest is fault detection. Nonetheless, SignTest can be easily
conjoined with an existing check-pointing mechanism to recover the correct
microprocessor state when a non-transient fault is detected. Some check-pointing
mechanisms that have small hardware overhead can be used in conjunction with SignTest
[72] [94]. Once the correct microprocessor state is recovered, it can be migrated to
another healthy microprocessor in a chip multiprocessor (CMP) environment and the
faulty microprocessor is de-configured. Another viable fault tolerance option would be to
only de-configure the faulty block and allow the microprocessor to continue normal
operation with degraded performance. Note that the latter option is only possible when
there are multiple instances of the faulty block and there is at least one healthy instance
(e.g. multiple ALUs, multipliers, and dividers).
46
Experimental Methodology 2.5
To evaluate SignTest, we use the OpenSPARC T1 core as a substrate [70].
OpenSPARC T1 is the open source version of UltraSPARC T1 from Sun Microsystems.
The OpenSPARC T1 microarchitecture consists of eight 6-stage in-order pipeline cores.
Each core can support up to 4 threads. The eight cores are connected to a shared 3MB L2
cache through a crossbar. In our evaluation, we focus on a single T1 core. In particular,
we apply SignTest to detect permanent faults in the execution and load/store stages of the
T1 core.
We evaluate C-SignTest by assuming it is implemented on top of the three main
testing methodologies: BIST, DFT scan chains, and SBST. For each testing methodology,
we compare C-SignTest to a base machine which thoroughly test all functional blocks at
the end of every epoch regardless of their usage. This helps to prove that SignTest can be
used as a generic detection framework in conjunction with any testing methodology to
reduce the testing energy and performance overheads of the naïve thorough periodic
testing.
For BIST, the base machine is similar to the one described in [90]. At the end of
every epoch, all functional blocks are tested in parallel regardless of their usage and the
test vectors of each block are applied one after another and verified. Hence, the testing
performance overhead is determined by the functional block that has the largest number
of test vectors. The testing energy overhead is the summation of the energy consumed by
each functional block during the testing phase which in turn is a function of the number
of test vectors per block.
47
For DFT scan chains, we assume that the base machine implements the scan chain
structure of the T1 core in which the 21 blocks under study are divided into eight scan
chains. During the testing phase of the base machine, all functional blocks are assumed to
be used and the test vectors of the functional blocks sharing the same scan chain are
concatenated together and loaded into the scan chain [7] [28]. The total number of
vectors loaded into each scan chain is determined by the functional block which has the
largest number of test vectors and is part of that chain. Since each scan chain has a
dedicated scan-in pin and scan-out pin, the eight scan chains are loaded in parallel. The
testing performance overhead is determined by the scan chain which requires the largest
number of clock cycles to load and apply its associated test vectors. The testing energy
overhead is the summation of the energy consumed by each scan chain which in turn is
determined by the functional block with the largest number of test vectors within each
chain.
For SBST, the base machine thoroughly tests every functional block at the end of
every epoch by running all its respective test programs. The testing performance
overhead for SBST is the time needed to run all selected test programs. The testing
energy overhead is equal to energy consumed by all functional blocks during the entire
testing phase.
Further, we compare C-SignTest to an adaptive scheme (ADAPT) that
implements a testing approach similar to application-aware testing (A
2
Test) [71] and
DaemonGuard [91] approaches. As explained in subsection 2.2.2, these approaches use
hig h leve l in for mation (e . g . inst ruc ti ons’ opc o de s) to c oll e c t us a g e s tatist ics at the
48
pipeline stage level and use a predetermined usage threshold to decide whether a stage
should be thoroughly tested or not tested at all. Since the original description of these
approaches assumed SBST methodology, we compare C-SignTest and ADAPT assuming
SBST methodology as well. In order to make the comparison meaningful, we set the
usage threshold of ADAPT to one for all stages. This means that when any stage is used
at least once it will be thoroughly tested and the stage will not be tested when it is idle.
This way, both C-SignTest and ADAPT achieve the maximum fault coverage against
permanent faults which makes the comparison of their testing energy and time overheads
relevant.
Compared to C-SignTest, ADAPT can only distinguish cases where all functional
blocks within a stage are idle. This means that the 14 load/store functional blocks are
considered either all idle when no memory instruction is executed during the epoch or all
used when at least a single memory instruction is executed during the epoch. Similarly,
all execution blocks except divider block are considered either all idle or all used. Since
the divider block is only used by dividing instructions, ADAPT can distinguish its idle
epochs independently from the rest of the execution blocks. The comparison of C-
SignTest and ADAPT helps to show that monitoring the usage of the microprocessor at
the func ti ona l block leve l is more a c c ur a te a nd e f fic ient than using inst ruc ti ons’ opc ode s
to monitor the usage at the stage level.
We evaluate relaxed SignTest (R-SignTest) by comparing it to five ADAPT
schemes with different usage thresholds: 1-time, 1%, 5%, 10%, and 20% of the epoch
time. R-SignTest is evaluated only using SBST approach because it is the only
49
methodology used in evaluating prior adaptive schemes [71] [91]. Thus, the comparison
of R-SignTest with prior adaptive schemes will be easier if we use SBST methodology.
However, R-SignTest does work with any of the three testing methodologies that were
described earlier for evaluating C-SignTest. As mentioned before, prior adaptive schemes
[71] [91] leve ra ge inst ruc ti ons’ opc ode s tomonitor the usage at the pipeline stage level,
such as execution stage and load/store stage. However, we assume that the five ADAPT
schemes, that we compare with R-SignTest, magically monitor the usage at the functional
block level, such as ALU and divider. This helps to disregard the advantage of
monitoring the usage at the functional block level, as proposed by SignTest, compared to
monitoring the usage at the pipeline stage level, as proposed by the prior adaptive
schemes, and only focus on the advantage of having multiple (usage threshold, fault
coverage) settings, as proposed by R-SignTest in Table 2.2, compared to having a single
(usage threshold, fault coverage) setting, as proposed by the adaptive schemes. As
mentioned before, the advantage of monitoring the usage at the functional block level can
be deduced from the comparison between C-SignTest and ADAPT.
Test patterns/programs 2.5.1
2.5.1.1 Generating test patterns/programs
To generate the test patterns for the 21 blocks under study, the RTL of the
individual blocks is extracted and synthesized using Synopsys Design Compiler and IBM
90nm standard cell library. Consequently, TetraMax automatic pattern generation
(ATPG) tool is used to generate the test patterns for each block using the industrial
50
standard stuck-at fault model. Each test pattern covers a certain percentage of the possible
fault sites in the functional block associated with the pattern.
The test patterns are directly used by BIST and DFT scan chains methodologies.
For SBST methodology, we make a simplified assumption by replacing each test pattern
generated by the ATPG tool with a single ISA instruction executed in a single cycle to
create that same test pattern. Hence, the test program of a functional block in SBST
consists of n-instructions and is executed in n-cycles, where n is the number of test
patterns associated with the block. Notice that in reality, m-instructions are needed per
pattern and m depends on the pattern. To understand how this simplified assumption
affects our SBST results, let us consider an example of two functional blocks A and B
with 15 and 25 test patterns, respectively. Let us also assume that in reality, the test
patterns of block A need 35 instructions executed in 35 cycles and the test patterns of
block B need 30 instructions executed in 30 cycles. Based on our simplified assumption,
when block A is idle and block B is used, we save 15 cycles which represent 37.5% of
the testing time. However in reality, we save 35 cycles which represent 54% of the
testing time. When block B is idle and block A is used, we save 25 cycles based on our
simplified assumption which represent 62.5% of the testing time. While in reality, we
save 30 cycles which represent 46% of the testing time.
2.5.1.2 Rotating test patterns for R-SignTest
During the testing phase, C-SignTest covers all the fault sites exercised during the
last epoch. This is guaranteed because all used blocks are thoroughly tested. On the other
hand, R-SignTest does not guarantee all the fault sites exercised during the last epoch to
51
be covered by the selected test patterns. For example, if the usage level of a functional
block is 8% then the test will cover 95% of the fault sites within the block as given in
Table 2.2. However, there is no guarantee that the left-out 5% fault sites were not
exercised during the epoch. To remedy this limitation of R-SignTest, test pattern rotation
is implemented.
Test pattern rotation guarantees all possible fault sites to be tested in a round-
robin fashion. The rotation is implemented by storing the test patterns of every block in a
queue-like structure. At the end of every epoch, the testing engine determines the
percentage of test patterns to be applied and extracts them from the head of the queue.
After each pattern completes, it gets re-inserted at the tail of the queue. This will ensure
that the next time the block is tested; priority is given to the least recently applied
patterns.
For pattern rotation to work, different test pattern sequences with the same
number of patterns should approximately provide the same fault coverage for a specific
func ti ona l block. To va li da te thi s, for e a c h block, we use the c ompl e te li s t of the block’ s
test patterns and produce 100 random shuffles of these patterns. Each shuffle is
incrementally applied until the fault coverage reaches 100%.
Figure 2.6a shows the fault coverage results of the 100 shuffles for the exu_alu
block. Similarly, Figure 2.6b depicts the fault coverage results of the 100 shuffles for the
lsu_excpctl block. Since there is no need to differentiate between the shuffles, we omitted
the fi g ur e s ’ le ge nds fo r c lar it y . The two fig u re s show that th e shuf fle s e x hibi t sim il a r
behavior and provide very close fault coverage values despite applying the test patterns in
52
different orders. This is because many test patterns sensitize common fault sites. In other
words, whether we apply pattern1 then pattern2 or pattern2 then pattern1, the growth in
fault coverage is very similar. The remaining functional blocks respond in the same
manner to different sequences of test patterns.
R-SignTest might still miss a fault during an epoch if the fault site is not covered
by the selected test patterns. However, test pattern rotation guarantees that the fault will
be detected after a handful number of epochs. Based on our analysis, in the worst case,
the fault will be detected after 10 epochs. This will happen if the fault is in the exu_div
block which only requires 10% of its test patterns to achieve 90% fault coverage. This
means that in the worst case (i.e. the exu_div usage level is less than 1% for 10
consecutive epochs) we need 10 epochs to apply all the test patterns and eventually detect
the fault. For all blocks that need 20% or more of its test patterns to achieve 90% fault
coverage, the fault will be detected within the next five epochs in the worst case.
2.5.1.3 Applying test patterns based on usage
When a block is used with C-SignTest or the usage level of a block is 20-100%
with R-SignTest, all the test patterns of the block are applied to achieve maximum
(a) exu_alu (b) lsu_excpctl
Figure 2.6: Test pattern shuffling.
80
82
84
86
88
90
92
94
96
98
100
0 50 100 150 200 250
Fault Coverage (%)
Number of Test Patterns
75
80
85
90
95
100
0 50 100 150 200
Fault Coverage (%)
Number of Test Patterns
53
permanent fault coverage. On the other extreme, when the block is idle with C-SignTest
or R-SignTest, there is no need to apply any test pattern. R-SignTest determines the
maximum percentage of test patterns that needs to be applied when the usage level of a
block is 0-1%, 1-5%, 5-10%, and 10-20% (i.e. the maximum percentage of test patterns
required to achieve 90%, 92%, 95%, and 97% fault coverage as given in Table 2.2). Due
to test pattern rotation, the order by which the test patterns are applied keeps changing;
hence, using the maximum percentage of test patterns guarantees the target fault coverage
to be achieved regardless of the current pattern order.
For each functional block, the maximum percentages of test patterns for 90%,
92%, 95%, and 97% fault coverage are determined as follows:
1. The test patterns generated by TetraMax are sorted in ascending order of the
number of unique fault sites covered by each test pattern. Thus, the first test
pattern covers the least number of fault sites and last pattern in this sorted order
covers the most number of fault sites. The sorted list of test patterns is essentially
the bottom most shuffles in Figure 2.6, which are highlighted by red diamond
markers.
2. Using this sorted list and starting from the first test pattern (i.e. the one which
covers the least number of fault sites), we measure the amount of fault coverage
each successive test pattern incrementally provides. We then create the first cutoff
point when 90% fault coverage is achieved. This gives the maximum number of
test patterns needed to guarantee 90% coverage. The same is done for 92%, 95%,
and 97% fault coverage.
54
Figure 2.7 displays the maximum percentage of test patterns required to guarantee
fault coverage between 90% and 100% against stuck-at permanent faults for execution
and load/store blocks. For all execution blocks, 90% of the fault sites are covered using
less than 40% of the test patterns. To achieve 97% fault coverage, 40-66% of the test
patterns are needed. Load/store blocks require less than 50% of the test patterns to
achieve 90% fault coverage and 35-75% of the test patterns to achieve 97% fault
coverage.
Benchmarks 2.5.2
Ten SPEC CPU2000 benchmarks are executed. We could not use the most recent
SPEC CPU2006 benchmarks in our experiments because their code sections and data sets
did not fit in the available external memory card attached to the Xilinx ML509 evaluation
platform where the T1 core is mapped.
Since our evaluation focuses on the integer execution and load/store functional
blocks, we choose eight out of ten benchmarks to be integer benchmarks. This helps to
demonstrate that while running integer applications which mainly stress the blocks under
(a) Execution blocks (b) Load/store blocks
Figure 2.7: Fault coverage vs. test patterns.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
90 91 92 93 94 95 96 97 98 99 100
Maximum Percentage of Test Patterns
Fault Coverage (%)
exu_alu exu_byp exu_div exu_ecc
exu_ecl exu_rml exu_shft
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
90 91 92 93 94 95 96 97 98 99 100
Maximum Percentage of Test Patterns
Fault Coverage (%)
lsu_dcdp lsu_dctl lsu_dctldp lsu_excpctl
lsu_qctl1 lsu_qctl2 lsu_qdp1 lsu_qdp2
lsu_sb_ctl lsu_sb_ctldp lsu_sb_rwctl lsu_sb_rwdp
lsu_tagdp lsu_tlbdp
55
study, there is still a plenty of variation in the usage across epochs, benchmarks and
blocks which can be leveraged to reduce testing energy and time. The other two
benchmarks (i.e. Art and Equake) are floating-point. The latter benchmarks help to
demonstrate how special purpose applications (e.g. floating-point applications and
stream-processing applications) render vast majority of the microprocessor blocks idle;
hence, providing considerable opportunities for reducing testing energy and performance
overheads.
Energy and performance overheads evaluation 2.5.3
To evaluate the energy and performance overheads of SignTest, we need to track
the microprocessor usage signature (MUS). An FPGA-based emulation framework is
used to monitor the switching activity at the inputs and outputs of the 21 blocks under
study. The full OpenSPARC T1 [70] core is mapped on a Xilinx ML509 FPGA
evaluation platform that uses Virtex 5 XC5VLX110T FPGA chip [1]. OpenSolaris 11 [2]
is booted on the OpenSPARC T1 and ten SPEC CPU2000 benchmarks are compiled to
run on the T1 core.
Xilinx ChipScope Pro Integrated Logic Analyzer (ILA) [3] probes the inputs and
outputs of the 21 blocks while running each benchmark. The inputs and outputs are saved
as value change dump (VCD) files and used to generate the MUS at the end of every
epoch. The MUS is used to select the test patterns/programs to be applied during the
testing phase. This essentially determines the testing time based on the testing
methodology in place. Testing energy overhead is computed as the summation of the
energy consumed by the individual functional blocks during the testing phase. To
56
compute the energy consumption of each b loc k, we us e the blo c k’ s a ve ra ge pow e r
consumption reported by Synopsys PrimeTime.
The energy and performance overheads of any in-field periodic testing
mechanism are functions of the testing time (i.e. how long it takes to apply the selected
test patterns/programs) and the testing frequency (i.e. how often the microprocessor is
tested). The testing frequency depends on the epoch length, which also determines the
memory logging requirements when a check-pointing mechanism is in-situ to recover the
microprocessor state in case of a fault. We perform our energy and performance
overheads evaluation using an epoch length of 100 million cycles. Assuming a coarse-
grain check-pointing mechanism, an epoch length in the order of hundreds of millions of
cycles has a reasonable memory logging requirements of a couple of megabytes [28].
Area overhead evaluation 2.5.4
Most of the area consumed by SignTest is occupied by the usage tracker logic
shown in Figure 2.4 and Figure 2.5. A detailed RTL design of the usage tracker logic is
implemented and synthesized using Synopsys Design Compiler with IBM 90nm standard
cell library. This provides an accurate estimation of the area overhead associated with
SignTest.
Evaluation Results 2.6
Energy and performance for C-SignTest 2.6.1
Using the BIST methodology: Figure 2.8 plots the relative energy and
performance overheads for C-SignTest when implemented on top of BIST methodology.
The results are relative to a base machine which agnostically tests all functional blocks at
57
the end of every epoch [90]. On average, C-SignTest provides 20% savings in the testing
energy overhead and 4% savings in the testing performance overhead, measured in terms
of the total execution cycles needed for completing the test patterns.
When C-SignTest determines that a functional block is idle during an epoch, no
test pa tt e rn is a ppli e d a t the block’ s input s, wh ich dire c tl y tra nsl a tes to e x tra energy
savings. However, that is not always the case for performance savings. As mentioned in
Section 2.5, performance overhead of the BIST methodology during an epoch is
determined by the functional block with the largest number of test patterns. Hence, even
when a functional block is idle, C-SignTest may not reduce the total testing time if
another block that has larger number of test patterns is used and must be tested. Note that
the average column in the figure represents the weighted average across all benchmarks.
Compared to integer benchmarks, floating-point benchmarks (i.e. Art and Equake) run
for much longer period; hence, they contribute more to the savings.
Figure 2.8: Relative overheads of C-SignTest on top of BIST methodology.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Energy Overhead Relative Performance Overhead
58
For the 21 blocks under study, lsu_dctl has the maximum number of test patterns
of 3263 followed by exu_ecl with 1992 patterns. During a specific epoch if the lsu_dctl is
used and exu_ecl is idle, then C-SignTest will achieve energy savings because exu_ecl
does not have to be tested. However, the performance overhead with the BIST
methodology is determined by lsu_dctl; hence, it will remain the same. On the other
hand, if during another epoch the lsu_dctl is idle and exu_ecl is used, then C-SignTest
will achieve energy savings because lsu_dctl does not have to be tested. In addition, C-
SignTest will achieve performance savings (test time reduction) because the maximum
number of test patterns with C-SignTest, for that epoch, is 1992 compared to 3263 in the
base machine.
Using the DFT scan chains methodology: Figure 2.9 plots the relative energy
and performance overheads for C-SignTest when implemented on top of DFT scan chain
methodology. The results are relative to a base machine which agnostically tests all
functional blocks at the end of every epoch and uses OpenSPARC T1 core scan chain
structure to apply test patterns and collect their responses. On average, C-SignTest
provides 17.2% savings in the testing energy overhead and 8.3% savings in the testing
performance overhead.
To understand how C-SignTest provides energy and performance savings when
implemented on top of DFT scan chains, we consider the example of two functional
blocks A and B that share the same scan chain. Assume that block A comes first in the
chain and has 10 scan-in registers and 50 test vectors. Block B comes second in the chain
and has 20 scan-in registers and 30 test vectors. Every test cycle in DFT includes:
59
creating a chain test vector by concatenating one test vector from A with one test vector
from B, shifting the chain test vector bits one at a time (i.e. 30 clock cycles to load every
test vector), and clocking the scan chain in order to apply the chain test vector at the
inputs of A and B.
The base machine, which assumes that both blocks are used all the time, will
apply 50 chain vectors (i.e. maximum test vectors between A and B) at the end of every
epoch. Each vector needs 30 clock cycles to be loaded (i.e. number of scan registers in
the c ha in) a nd one c y c l e to be a ppli e d a t the blocks’ input s, so a tot a l of 1550 c lock
cycles are needed to test the chain. C-SignTest can determine whether a block is used or
idle during an epoch. In case both blocks are used, then C-SignTest will cause the same
overheads as the base machine. However, if during an epoch both blocks are idle, there is
no need to load and apply any test pattern which gives the highest savings. In case block
A is used and block B is idle, C-SignTest would need to apply 50 test vectors and each
vector is loaded in 10 clock cycles, so only 550 clock cycles are needed (i.e. 65%
Figure 2.9: Relative overheads of C-SignTest on top of DFT methodology.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative Energy Overhead Relative Performance Overhead
60
savings). The last scenario is when block A is idle and block B is used, here SignTest
needs to apply 30 test vectors and each vector needs 30 clock cycles to be loaded. So, a
total of 930 clock cycles are required for testing purposes (i.e. 40% savings).
Using the SBST methodology: Figure 2.10 plots the relative energy and
performance overheads for C-SignTest when implemented on top of SBST methodology.
The results are relative to a base machine which runs all test programs at the end of every
epoch. We also report the energy and performance overheads of ADAPT framework
using a usage threshold of one for all pipeline stages (i.e. a pipeline stage is either
thoroughly tested when it is used at least once or not tested at all when it is idle).
With SBST methodology, the relative energy and performance overheads are
identical; hence, the y-axis in the figure represents both relative energy overhead and
relative performance overhead. With SBST, the test programs use native instructions
from the machine ISA and run as regular applications; hence, all the functional blocks or
the pipeline stages need to be powered during the entire testing phase regardless of the
block/stage targeted by the running test program. In other words, fine grain power gating
during the testing phase is not possible with SBST which causes the testing energy to
become a linear function of the testing time (i.e. relative testing energy and relative
testing time are identical). Let us consider a system with only two blocks/stages A and B
and their test programs need 50 and 200 cycles, respectively. Let us also assume that the
average power of A and B are 10 and 20 Joule/cycle, respectively. When A is idle during
a specific epoch, the testing time is reduced by 20% (i.e. 50/250) and the testing energy is
reduced by the same percentage (i.e. (50*30)/(250*30)).
61
On average, C-SignTest provides 28% savings in the testing energy and
performance overheads while ADAPT only achieves 2% savings. C-SignTest tracks the
microprocessor usage at the granularity of a functional block; hence, C-SignTest captures
all idle opportunities at the block level. On the other hand, ADAPT monitors the usage at
the pipeline stage level and only captures opportunities where all functional blocks within
a stage are simultaneously idle.
Compared to BIST and DFT scan chains, C-SignTest achieves better relative
energy and performance savings when implemented in conjunction with SBST. This
result is expected because with SBST, the idleness of any functional block translates to
direct savings in both energy and performance overheads. Another general observation
from the three figures is that higher savings in the energy and performance overheads are
achieved while running floating-point benchmarks, Art and Equake, compared to integer
benchmarks since floating-point benchmarks use integer execution and load/store blocks
sparingly.
Figure 2.10: Relative overheads of C-SignTest and ADAPT on top of SBST.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C-SignTest ADAPT
62
The results from the three test methodologies show that SignTest can be thought
of as a generic scheme to enhance the efficiency of thorough periodic fault detection
mechanisms by reducing the energy and time allocated for testing purposes regardless to
the testing methodology in place. SignTest achieves this enhancement by monitoring the
microprocessor usage at the functional block level and steering the testing phase
accordingly.
Energy and performance for R-SignTest 2.6.2
Figure 2.11 shows the relative energy and performance overheads of R-SignTest
and ADAPT framework with five different usage thresholds. The base machine in this
figure is the same as the one used in Figure 2.10 with SBST methodology and all
functional blocks tested thoroughly at the end of every epoch. The leftmost bar in the
figure represents R-SignTest. The rightmost five bars show the overheads of ADAPT
with usage thresholds: 1-time, 1%, 5%, 10%, and 20% of the epoch time, respectively.
The results in this figure assume that the usage monitoring in ADAPT framework
is done at the functional block level. The block is tested with full coverage only if its
usage level exceeds the predefined usage threshold, otherwise; the block will not be
tested at all. For example, if the usage threshold is 1%, then the block will be tested with
100% coverage if it is used more than 1% of the epoch time. On the other hand, R-
SignTest chooses the test coverage for each functional block according to its usage level
during the epoch as given in Table 2.2. Since SBST is used, the Y-axis in the figure
represents the relative energy and performance overheads. The number on top of each bar
reports the permanent fault coverage provided by each scheme. Notice that the 1-time
63
ADAPT scenario represents the C-SignTest implementation as all used blocks are
thoroughly test and all idle blocks are skipped during the testing phase. 1-time ADAPT
provides 100% permanent fault coverage for all benchmarks all the time; hence, its fault
coverage is omitted from the figure.
Compared to R-SignTest, on average, 1% ADAPT causes 12% higher testing
energy and performance overheads and achieves much lower average fault coverage of
56.4%. The 5%, 10%, and 20% ADAPT scenarios reduce the testing overheads
significantly, but their extremely low average fault coverage values (i.e. 32%, 23%, and
14%) render them impractical. When compared with 1-time ADAPT, R-SignTest
achieves 45% average savings in the testing energy and performance overheads while
causing minimal 7% reduction in the average permanent fault coverage. This concludes
that for applications which can tolerate slightly reduced fault coverage (e.g. graphics and
media processing applications), R-SignTest captures the best tradeoff between fault
coverage and testing overheads by requiring significantly lower testing energy and time
and providing high coverage of 93% against permanent faults. For applications which
Figure 2.11: Relative overheads and absolute fault coverage for R-SignTest and block-level
ADAPT.
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Art Bzip2 Crafty Equake Gzip Mcf Parser Perlbmk Twolf Vpr Average
Relative Overheads
Benchmark
R-SignTest 1-time ADAPT 1% ADAPT 5% ADAPT 10% ADAPT 20% ADAPT
92.7%
57.6%
22.2%
18.1%
17.1%
95.7%
74.6%
91.6%
46%
22.2%
94.1%
87.7%
51.8%
29.9%
7.2%
92.3%
43.9%
21.6%
18.2%
14.7%
95.2%
89.3%
65.3%
49.1%
16.2%
94.6%
88.4%
62%
33.6%
10.2%
94.8%
89.7%
61.5%
34.1%
14.5%
94.2%
86.5%
53.6%
31.1%
8.9%
94.3%
81%
55.9%
31.1%
13%
93.2%
82%
36.1%
17%
4.4%
93%
56.4%
31.6%
22.8%
14.1%
64
require maximum coverage against permanent fault, one can use the conservative
SignTest (C-SignTest) implementation to achieve maximum coverage while reducing the
testing overheads by skipping idle functional blocks.
Figure 2.11 also shows that higher savings are achieved while running the
floating-point benchmarks (i.e. Art and Equake) compared to the integer benchmarks.
Fault coverage vs. testing overheads tradeoffs for R-SignTest 2.6.3
The more patterns we apply during the testing phase, the higher the fault coverage
we get. At the same time, the energy and performance overheads caused by periodic
testing increase as the number of test patterns increases. Figure 2.7 shows that the
relationship between fault coverage and number of test patterns is non-linear for
execution and load/store functional blocks. There is an exponential increase in the
number of test patterns required to achieve the last three fault coverage percentage points.
For all 21 blocks, 97% fault coverage is achieved using an average of 59% of the test
patterns. This indicates that there is a plenty of fault coverage-overhead tradeoffs that one
can choose from based on the requirements of the current running application.
Figure 2.12 shows the average testing overheads and average fault coverage for
three exploratory design tradeoff choices: R-SignTest-88, R-SignTest-90, and R-
SignTest-96. R-SignTest-88 uses 0%, 88%, 91%, 94%, 97%, and 100% test coverage for
[0], [0-1], [1-5], [5-10], [10-20], and [20-100] usage levels, respectively. R-SignTest-96
uses 0%, 96%, 97%, 98%, 99%, and 100% test coverage. R-SignTest-90 uses the test
coverage values given in Table 2.2. The plotted average overhead is relative to thorough
periodic testing (i.e. all blocks are thoroughly tested at the end of every epoch).
65
As expected, the higher the test coverage of the different usage levels is, the lower
the savings are. At the same time, the microprocessor average fault coverage increases
with higher test coverage values. For example, R-SignTest-96 achieves average fault
coverage of 97.3% while still providing 45% and 23% less overheads compared to
thorough and C-SignTest periodic testing mechanisms, respectively.
For applications that have more relaxed fault coverage requirements, one can use
R-SignTest-88 which achieves average fault coverage of 91.7% while providing 63% and
48% less overheads compared to thorough and C-SignTest periodic testing mechanisms,
respectively.
Area overhead 2.6.4
Tracking all inputs and outputs of the 21 blocks under study (i.e. 7692 bits) has
high area overhead. Using IBM 90nm standard cell library, the area overhead needed to
monitor all these bits is estimated by 9.61% and 10.27% of the total T1 core area for C-
Figure 2.12: Fault coverage vs. testing overheads (R-SignTest).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
R-SignTest-88 R-SignTest-90 R-SignTest-96
Average Relative Overhead
Average Fault Coverage
Fault Coverage Performance
66
SignTest and R-SignTest, respectively. The difference in the area overhead of the two
implementations is due to the need of dedicated usage counters for the R-SignTest
implementation.
We present three optimizations to reduce the area overhead associated with
SignTest implementation. The first optimization is to use the decoded instruction
information to decide whether a functional block is used or not. This optimization is
limited to the functional blocks whose usage is directly related to specific instruction
types. For example, the exu_div block usage is directly related to the divide instruction.
As a result, every time a divide instruction is decoded the exu_div block is used and for
all other instruction types the exu_div block is idle. Out of the 21 blocks, there are seven
blocks that benefit from this optimization. Hence, there is no need to track the inputs and
outputs of these seven blocks and we monitor their usage by relying on the instruction s’
opcodes. The seven blocks are: exu_alu which is related to arithmetic instructions,
exu_div which is related to divide instructions, exu_shft which is related to shift
instructions, and the four store buffer blocks in the load/store stage (i.e. blocks 16, 17, 18,
and 19 in Table 2.1) which are related to memory write instructions.
The second optimization leverages the correlation between the usages of different
functional blocks. Block X could be functionally tied to a group of blocks {s, X ⊄ s},
such that whenever a block in {s} is used during an epoch then X is used in the same
epoch and when all blocks in {s} are idle then X is idle as well. In such cases, we do not
have to track the switching activity at the inputs and outputs of block X to monitor its
usage. Instead, we can accurately derive the usage of block X from the usage of the
67
blocks that are functionally tied to it. For the 14 blocks which are not covered by the first
optimization, we consider one block at a time and try to find a minimum set of blocks
that are functionally tied to the block under concern. Any block which has a set of
functionally tied blocks and is not part of the minimum set of any other block does not
have to be monitored. Blocks that benefit from this optimization are considered used
whenever any of the blocks within their respective minimum set is used and are
considered idle when all blocks within their minimum set are idle. Empirically, we find
out that exu_rml and lsu_tagdb blocks benefit from this optimization. The first two
optimizations combined help to accurately monitor the usage of nine functional blocks
without the ne e d to tra c k the switching a c ti vit y a t the blocks’ input s a n d output s. This
reduces the area overhead of C-SignTest and R-SignTest to 8.51% and 9.09%,
respectively, without any loss in the energy and performance savings.
The third optimization is partial tracking and it only targets the remaining 12
blocks which do not benefit from the first and second optimizations. Instead of
monitoring all remaining 12 blocks, we want to monitor the ones which provide us with
the highest savings-to-cost ratio and the non-monitored blocks will be conservatively
tested at the end of every epoch. Clearly, this optimization sacrifices a small percentage
of the energy and performance savings of SignTest. To compute the savings-to-cost ratio
of each functional block among the remaining 12 blocks, we divide the percentage of
sa ving s a c hiev e d du e to t he block’ s usage level by the number of input and output bits of
the block. Beside the first two optimizations, partial tracking efficiently reduces the area
overhead of C-SignTest to 3.23% or 5.82% while achieving 85% or 95% of the reported
68
savings, respectively. Similarly, partial tracking reduces the area overhead of R-SignTest
to 4.3% or 6.2% while achieving 85% or 95% of the reported savings, respectively.
Summary and Conclusions 2.7
In this chapter we presented SignTest, a usage-signature based adaptive
framework which performs in-field periodic testing at the granularity of microprocessor
functional blocks. The microprocessor usage signature (MUS) is tracked during normal
operation using D-type flip-flops and XOR-gates which monitor the switching activity at
the inputs and outputs of the functional blocks. The MUS is used to dynamically select
the test patterns/programs to be applied for each block according to its usage level in
order to reduce the energy and performance overheads of testing while maintaining high
fault coverage against permanent faults.
Two SignTest implementations were presented. A conservative implementation
called C-SignTest that classifies a block as either used or idle during an epoch. Used
blocks are thoroughly tested while idle blocks are completely skipped during the testing
phase. C-SignTest maintains the maximum permanent fault coverage, same as the one
that is achieved through thorough testing during each epoch, but reduces the testing
energy and performance overheads. The evaluation results showed that C-SignTest can
achieve up to 24% (28% * 0.85) reduction in the energy and performance overheads
relative to previous in-field periodic testing mechanisms while incurring minimal area
overhead of 3.23%.
A relaxed implementation called R-SignTest was also presented. R-SignTest
tracks the frequency of usage (i.e. usage level) for each block by counting the number of
69
cycles during which the block is used during the epoch. The test coverage of each block
is chosen according to its usage level. The evaluation results showed that R-SignTest can
achieve up to 51% (60% * 0.85) reduction in the energy and performance overheads
relative to previous in-field periodic testing mechanisms while incurring area overhead of
4.3% and fault coverage of 93%.
C-SignTest implementation targets applications that require maximum coverage
against permanent faults (e.g. financial applications). On the other hand, R-SignTest
implementation targets applications that can tolerate a couple of fault coverage
percentages in order to achieve higher saving in the testing energy and time (e.g. graphics
and media processing applications).
70
Chapter 3
Distinguishing and Tolerating Intermittent Faults
Previous chapter described approaches to reduce the energy and performance
overhead of periodic testing. In this chapter we present approaches to take advantage of
low cost testing methodologies to improve fault classification.
Introduction 3.1
Technology scaling has led to variations in device characteristics resulting in a
range of susceptibilities. Faults encountered during the in-field operation can be classified
into three categories: transient, intermittent and permanent. Transient faults occur once
and then disappear. Permanent faults are persistent. Intermittent faults oscillate between
periods of erroneous activity and dormancy, depending on various operating conditions
such as temperature and voltage. Fault classification allows a chip to take the appropriate
corrective action. Existing detection and recovery approaches [14] [54] classify faults as
transients or non-transients where all non-transient faults are handled as permanent faults.
This crude classification results in unnecessary performance degradation and increases
the probability of permanent failure due to excessive usage.
71
Intermittent faults which are mainly induced by wearout phenomena are
becoming dominant factors that limit chip lifetime in nanoscale technology nodes [25]
[65], yet there is no mechanism that provides finer classification of non-transient faults
into intermittent and permanent faults. Modern microprocessors contain large array
structures in the form of buffers and tables, such as instruction fetch queue (IFQ),
load/store queue (LSQ), reorder buffer (ROB), reservation stations (RS), and branch
history/prediction tables. Micro-architectural array structures are highly stressed because
they get read and written continuously. Even during idle periods, array structures are still
considered under stress because they continue to store the same bit values which have
negative-bias temperature instability (NBTI) impact on the PMOS devices. This makes
the microprocessor array structures highly susceptible to intermittent faults.
In this chapter we present reliability-aware exceptions (RAEs) [34], a special
class of exceptions that enable software-directed handling of faults detected in the
microprocessor array structures. RAEs are used in conjunction with hardware fault
detection mechanisms to improve microprocessor lifetime. The novelty of RAEs is the
ability to distinguish intermittent faults from transient and permanent faults and to
provide appropriate corrective actions for the three fault categories. For transient faults,
no corrective action is required and it is sufficient to roll-back the execution to the
faulting instruction. For intermittent (permanent) faults, before the execution is rolled-
back, the RAE handlers exploit the inherent redundancy in the microprocessor array
structures to temporarily (permanently) de-configure the faulty entries.
72
Using RAEs, we demonstrate that the reliability of two representative
microarchitecture structures, LSQ and ROB in an out-of-order processor, is improved by
average factors of 1.3 and 1.95, respectively.
Why to Distinguish Intermittent Faults 3.2
The authors in [6] found that soft breakdown (SBD) in the device dielectric
causes intermittent erratic fluctuations in the minimum voltage of 90nm SRAM cells. In
addition, interconnect resistance increases due to electro-migration (EM) which causes
intermittent timing violations under high temperatures [26]. Furthermore, a fault analysis
study of the memory subsystems of 257 servers revealed that 6.2% of the subsystems
experienced intermittent faults [25]. Finally, a recent study analyzed the fault logs
reported by the Windows error reporting (WER) system from one million consumer PCs
[65]. 30% of the CPU faults turned out to be intermittent because they recurrently
appeared and disappeared in the logs. Thus, there is growing evidence that intermittent
faults are becoming a major reliability concern in modern microprocessors.
Intermittent faults in array structures have two important features that distinguish
them from transient and permanent faults. First, an intermittent fault recurrently occurs in
the same location which is different from a random bit flip as in the case of a transient
fault. Second, intermittent faults oscillate between periods of activation and deactivation
according to the current operating conditions, whereas transient faults are one-time events
and permanent faults are continuously activated. As such, handling intermittent faults as
transients (i.e. flush and re-execute) will cause excessive flushing and degrade the
performance significantly. On the other hand, handling intermittent faults as permanent
73
faults will cause premature and excessive component replace demands. Hence, the
effectiveness of fault handling solutions can be significantly improved if they can
distinguish and handle intermittent faults as a separate fault category.
Intermittent faults are results of device wearout. The most important wearout
phenomena include electro-migration (EM) [79], negative-bias temperature instability
(NBTI) [57], and time dependent dielectric breakdown (TDDB) [80]. These phenomena
manifest as timing violations activated under abnormal temperature, stress and voltage
conditions. In the case of NBTI, the device deterioration can be reversed and the
intermittent fault can be eliminated by removing the stress conditions [35]. In the case of
EM and TDDB, removing the stress conditions only deactivates the intermittent faults but
does not reverse the degradation. These aging manifestations occur slowly over time;
hence, it is necessary to develop low cost solutions to handle wearout-induced
intermittent faults, which is the main aim of RAEs.
Related Work 3.3
The notion of using exceptions to deal with failures was proposed early on in
recovery blocks to handle software bugs [75]. Every critical software function that
requires fault tolerance from residual bugs is structured as a recovery block. Each
recovery block has a primary implementation and multiple alternative implementations of
the software function associated with it. On entry to a recovery block, the state of the
system is saved to permit backward fault recovery (BFR), i.e., establish a checkpoint.
Then, the primary implementation is executed and an acceptance test is evaluated to
provide an adjudication on the outcome. If the acceptance test is passed then the outcome
74
is regarded as successful and the recovery block is exited. However, if the test fails or if
any faults are detected by other means during the execution of the primary
implementation, then an exception is raised and BFR is invoked. This restores the state of
the system to what it was on entry. After such recovery, an alternative implementation is
executed and the acceptance test is applied again. This sequence continues until either an
acceptance test is passed or all alternates have failed the acceptance test. If all alternates
fail the test, a failure exception is signaled to the environment of the recovery block.
RAEs deal with an entirely new problem that of in-field hardware faults. The RAE
handlers analyze fault logs to classify faults and take appropriate corrective actions.
A threshold-based mechanism to discriminate transient from non-transient faults
is described in [14]. Bondavalli et al. introduced a count-and- thre shol d model c a ll e d α-
count. At regular time intervals, a single bit is collected from each system component
using hardware detection mechanisms, such as parity. These bits are used to produce an
α-value for each component which is compared with a predefined threshold to classify
the component as healthy or faulty. On the other hand, RAEs use a historical fault log and
inter-arrival time between faults to classify them as transient, intermittent, or permanent.
In [17], Bower et al. proposed self-repairing array structures (SRAS). SRAS map
out array structure entries that experience permanent faults; hence, they eliminate
continuous flushing due to such faults. SRAS use a handful of check rows to detect faults
in each array structure. Every write operation to a structure row is duplicated to a check
row, and then data is read from both rows and compared, off the critical path, to detect
faults. As the number of faults detected in a specific row reaches a predefined threshold,
75
the row is believed to have a permanent fault and is permanently mapped out. Otherwise,
the fault is assumed to be transient. RAEs are different from SRAS because: 1) by
handling intermittent faults as a new category, RAEs further extend the a rr a y struc tur e s’
lifetimes. 2) RAEs do not need check rows because neither a duplicated write nor a
comparison operation is required; hence, RAEs consume less area and power.
Bower et al. proposed an online diagnosis mechanism of hard faults in
microprocessors at the field de-configurable unit (FDU) granularity [18]. They rely on a
DIVA checker to provide fault detection and correction [9]. The diagnosis mechanism
uses a saturating counter for each FDU. When the DIVA checker detects a fault in an
instruction, the counters of all FDUs traversed by the faulty instruction are incremented.
The counters values are compared with predefined thresholds to classify faults as
transient or permanent. In addition to the limitation that intermittent faults are treated as
permanents, their fault detection approach assumes independent FDU usage. If two FDUs
happen to be used together most of the time then there is no way to know which FDU is
actually causing the fault.
Wells, Chakraborty, and Sohi proposed pausing core execution to adapt to
intermittent faults in multicore systems [102]. Suspending the use of the core will not
repair any manufacturing variations or in-progress wearout but it will cause both
temperature and voltage to stabilize; hence, reducing the occurrence of intermittent faults.
On the other hand, RAEs help to reverse or slow-down the in-progress wearout through
temporal de-configuration. In addition, RAEs are implemented at the granularity of array
structure entries; hence, do not require the entire core operation to be suspended.
76
Considerable amount of research has focused on detecting transient and
permanent faults in hardware using software-based mechanisms. SWAT [54] uses
anomalous software behaviors, such as high operating system activity, hangs, and
abnormal application exits, to suspect the presence of permanent faults. There are three
major differences between SWAT and RAEs: 1) the main focus of SWAT is fault
detection while the focus of RAEs is fault classification and recovery. 2) SWAT relies on
a checkpoint-based recovery while RAEs do not require check-pointing as faulty
instructions are handled before they commit. 3) SWAT distinguishes transient from
permanent faults by re-executing the faulty portion of the program on the expected faulty
hardware and a fault-free hardware and comparing the two results. As such many
intermittent faults, which persist for long periods, end up interpreted as permanent faults.
RAEs present finer classification, namely transient, intermittent, and permanent faults
and handle each case using an appropriate corrective action.
Commercial fault-tolerant microprocessors such as IBM POWER6 [76] and z10
[89] microprocessors use a variety of mechanisms, such as parity, error correcting code
(ECC), and hardware checkers, to detect transient and permanent faults in memory,
computational, and control structures. The internal memory structures, except caches and
register files, are protected using parity. These microprocessors use exceptions to
interrupt the core execution when a fault is detected. In the first occurrence of a fault, it is
assumed to be transient. When a second fault occurs, a checkstop is generated and the
latest check-pointed core state is migrated to a healthy core (i.e. the fault is categorized as
permanent). In the available literature we could not find out whether IBM
77
microprocessors provide the ability to isolate a faulty structure entry in case of a
permanent fault. Even if this is the case, the novelty of RAEs is in the ability to
distinguish intermittent faults and handle them using temporal de-configuration at the
granularity of a single entry.
While RAEs as described in this chapter are focused on array structure faults,
RAEs can be broadened to handle faults in redundant computational blocks. There are a
couple of schemes that can be used to detect faults in computational logic [24] [104]. For
instance, WearMon [104] injects well-defined test vectors and uses elevated test
frequency to detect imminent timing failures in computational logic. Once a fault is
detected in a computational block, RAEs classify and handle the fault in the same fashion
as for inherently redundant array structures.
In order to mimic the physical phenomena of the three fault categories (transient,
intermittent and permanent) in our simulation experiments, as will be explained in detail
later, we need accurate models for lifetime failure modes (e.g. EM, NBTI, and TDDB).
Srinivasan et al. developed a methodology called RAMP (Reliability Aware
MicroProcessor) to estimate lifetime reliability from an architectural perspective [96].
RAMP makes two assumptions: 1) uniform distribution of the failure rate across the
modeled failure types. 2) Uniform distribution of devices over chip area and identical
device vulnerability to the modeled failure types. To relax these assumptions, Shin et al.
proposed structure-aware lifetime reliability models called failure in time (FIT) of
reference circuit (FORC) [86]. In the FORC models, devices are only vulnerable to
78
failure mechanisms that actually affect them. We use the FORC models to compute the
probability of different fault types and measure the effectiveness of RAEs.
Building Blocks for RAEs Design 3.4
Fault detection 3.4.1
Figure 3.1 gives an overview of how RAEs can protect two representative array
structures of an out-of-order processor, namely the load/store queue (LSQ) and the
reorder buffer (ROB). The hardware/software additions necessary for RAEs are
highlighted in solid color and the modified blocks are shaded with stripes.
The first step in RAEs is to detect faults as soon as they occur. Transient,
intermittent, and permanent faults manifest either as bit flips or as timing violations. Like
commercial fault-tolerant microprocessors [76] [89], RAEs leverage the existing parity to
Figure 3.1: RAEs implementation.
79
detect bit flips assuming a single fault model. RAEs may also rely on RazorII [30] to
detect timing violations.
Whenever an array structure entry is read or written by an instruction, the
detection mechanisms are activated to validate that the data read/written is fault free.
Fault detection is done off the critical path. Whenever a fault is detected, three actions are
required: first, the faulting instruction is turned into a no-operation (NOP). Second, the
detection hardware sets a special bit in the ROB entry of the faulting instruction. The
special bit is called RAE bit and it indicates that there is a reliability-aware exception
associated with this instruction. Third, the identification number of the faulty array
structure entry (ASE_ID) and the ROB entry number (ROB_ID) of the faulting
instruction are stored in a special register called loc, short for location. ASE_ID is simply
the array structure identification number concatenated with the entry number within the
structure. For example, the fifth entry in LSQ has ASE_ID of LSQ_5. ASE_ID is used
later to classify the fault and to de-configure the faulty entry when needed. ROB_ID is
used to check if the faulting instruction is at the top of the ROB and ready to commit.
Since instructions may trigger reliability exceptions out-of-order, it is necessary to
handle the exceptions in process order by allowing only older instructions to overwrite
the loc register. This is similar to traditional precise exceptions where the most senior
exception flushes the pipeline; hence, there is no need to store the information of younger
exceptions. Figure 3.1 shows each ROB entry augmented with an RAE bit and it shows
the loc register. To reduce clutter, fault detection mechanisms are not shown in the figure.
80
The RAE bits and the loc register are assumed to be built fault-free using device sizing or
error correcting codes.
Disabling the faulty entry 3.4.2
Once a fault is detected, the faulty array structure entry is disabled to ensure that
no following instructions are allocated to the same entry until the exception handler has
dealt with the fault. Disabling a specific entry prevents that entry from further use, but the
entry is NOT turned off.
Microprocessor array structures can be categorized based on their access method
into circular buffer structures (e.g. IFQ, LSQ, and ROB) and tabular array structures (e.g.
RS and branch history and prediction tables). To disable faulty entries of circular
structures, we augment every entry with a busy bit. The faulty entry is disabled by setting
its busy bit to one [17]. In addition, the allocation/release logic needs to be modified to
consult the busy bit array before allocating new entries.
The allocation/release logic of tabular array structures naturally keeps track of
entries that are in-use and entries that are free. To disable faulty entries of tabular arrays,
we mark them in the allocation/release logic as in-use [18].
Figure 3.1 shows LSQ and ROB entries augmented with busy bits which are
assumed to be fault-free. Further, the allocation/release logic of LSQ and ROB are
shaded with stripes to indicate that they need to be modified to consult their busy bit
arrays before allocating new entries. Once a fault is detected in a specific entry, the fault
detection hardware, such as parity checker, is responsible for setting the busy bit of that
e ntr y to on e . W he n the fa ult is h a ndled, non e of the ha ndle r’ s instructions will be
81
allocated to the faulty entry because it is disabled; hence, it will be skipped by the
allocation/release logic.
RAE handler: fault classification 3.4.3
When the instruction with an RAE reaches the top of the ROB and is ready to
commit, the CPU is flushed and the RAE handler is invoked in order to classify the fault
and choose the most appropriate corrective action.
RAEs classification relies on mining a historical fault log. Every array structure
entry has a single record in the log which contains: 1) a fault counter to keep track of the
number of faults detected in the entry. 2) A timestamp of the latest fault detected in the
same entry. Once a reliability exception is raised, the RAE handler reads the loc register
to obtain the ASE_ID and uses it to access the corresponding record in the fault log. To
classify the current fault, the RAE handler compares the current timestamp with the
re c or d ’s ti mesta mp. The re a re two possi ble sc e na r ios: fir st, the re c or d’ s ti mesta mp isz e ro
(i.e. the current fault is the very first fault detected in the entry) or the difference between
the two timestamps is greater than or equal to the inter-arrival time between transient
faults. In this scenario, the current fault is optimistically classified as transient.
Sec ond, the re c o rd’ s ti m e stamp is g re a ter than z e ro (i.e . one o r more fa ult s we r e
previously detected in the entry) and the difference between the two timestamps is less
than the inter-arrival time between transient faults. In this scenario, the fault is either
c lassifie d a s int e rmitt e nt or pe rma ne nt ba se d on the re c or d ’s fa ult c ounte r va lue. I f the
counter value is less than a predetermined Intermittent-Fault-Threshold, the fault is
classified as intermittent; otherwise, the fault is classified as permanent. To avoid
82
misclassifying intermittent faults as permanents, the fault counters of all fault log records
are periodically reset to zero.
The inter-arrival time between transient faults, the Intermittent-Fault-Threshold,
a nd the pe riod to c le a r t he log ’s fa ult counters are design parameters that depend on the
current process technology node. For evaluation purposes, the values of the three
parameters are discussed in Section 3.5.
Accessing the software fault log for fault classification is a slow process. But
under the assumption that faults are rare events compared to the normal operational
window of a processor, we can afford to handle fault scenarios using slow software
handlers in order to save area and power consumed by hardware-based mechanisms.
Notice that the size of each log record is 12 bytes and the total size of the log depends on
the sizes of the array structures being protected by RAEs. For the microarchitecture
simulated in this study, which has 64 entries in LSQ and 128 entries in ROB, the total log
size is 2Kbytes. Hence, the fault log can easily fit in the main memory or even in the last
level cache to speed up fault classification.
RAE handler: fault correction 3.4.4
After fault classification is done, one of the following three corrective actions is
undertaken:
1. Transient_Action: Transient faults persist only for a single cycle; hence, no
corrective action other than restarting from the faulting instruction is needed.
However, the hardware has already disabled the faulty entry. Hence, the RAE
handler must enable the entry by directing the har dwa r e to re se t the e ntr y ’s bus y
83
bit in case of circular buffer array structures or mark the entry as free in case of
tabular array structures. In both cases the RAE handler uses the ASE_ID to direct
the hardware to enable the entry.
2. Intermittent_Action: Intermittent faults are mostly due to EM, NBTI, or TDDB.
The recovery action is to temporarily de-configure the faulty entry. De-
configuring an entry means to turn off the entry by power gating. Temporal de-
configuration helps to deactivate the intermittent fault. More importantly, in the
case of NBTI-induced faults, temporal de-configuration helps PMOS transistors
recover to their original state (i.e. eliminates the intermittent fault) [35]. Notice
that when the entries surrounding the faulty entry experience high activity, the
effect of temporal de-configuration may be limited (i.e. the fault may persist).
Even in such cases, de-configuration is still useful because it avoids using the
faulty entry and consequently reduces the performance overhead caused by
excessive flushing.
3. Permanent_Action: Permanent faults persist forever and whenever a faulty entry
is used an error will occur except in cases where the fault is logically masked.
Since masking effects cannot be easily detected at runtime, we choose to
permanently de-configure (turn off) the faulty entry.
After fault correction is complete, the execution is resumed starting from the
faulting instruction. In a processor that supports speculative execution, like the one we
used for RAEs evaluation, every instruction carries its program counter (PC) in its
designated ROB entry. This PC is used for re-execution in the case of a branch
84
misprediction or a software exception. RAE s’ ha ndler s cannot trust the PC of the faulting
instruction obtained from the ROB because the ROB contents might be faulty. As a
result, we augment the commit stage with a fault-free register called RAE_PC as shown
in Figure 3.1. RAE_PC holds the PC value of the instruction following the last correctly
committed instruction. In case the last committed instruction was a control instruction,
the RAE_PC holds the PC of the target instruction. Otherwise, the RAE_PC holds the PC
of the next instruction in program order.
De-configuration through ISA extensions and sleep transistors 3.4.5
The temporal or permanent de-configuration of the faulty entry in the case of
intermittent or permanent faults is achieved through sleep transistors [85]. De-
configuration is done at the granularity of a single entry. Hence, each entry has its own
sleep transistor and all entries are augmented with sleep bits to drive their sleep
transistors. Sleep bits are assumed to be fault-free.
We introduce a new instruction called SLP_SET. For intermittent and permanent
faults, SLP_SET is used by the RAE handler to de-configure the faulty entry by setting
its sleep bit to one. For intermittent faults only, a reconfiguration event must be scheduled
when the fault is expected to be no longer active. RAEs keep a sorted queue of the
reconfiguration events for all entries that are currently de-configured due to intermittent
faults, such that the reconfiguration event that will occur earliest is at the head of the
queue. When a reconfiguration event is due, the RAE handler also uses SLP_SET to
reconfigure the corresponding entry by resetting its busy and sleep bits.
85
As mentioned in subsection 3.4.2, once a fault is detected, the faulty entry is
disabled from further use by setting its busy bit. In case of transient faults, the faulty
entry needs to be re-enabled by resetting its busy bit. The RAE handler leverages the
SLP_SET instruction to achieve that as well. SLP_SET takes fault location and fault type
as source operands. For de-configuration purposes, the fault type operand of the
SLP_SET is set to intermittent or permanent which essentially sets the sleep bit to one.
For reconfiguration and enabling purposes, the fault type operand of the SLP_SET is set
to transient, which essentially resets both busy and sleep bits to zero.
Evaluation Methodology 3.5
To evaluate RAEs, we developed a simulation infrastructure that models various
types of faults and relies on accelerated fault injections. The simulation infrastructure is
based on the Simplescalar tool [20] integrated with Wattch [19] and HotSpot [45]. The
Simplescalar is augmented to keep track of the activity factors of different array
structures (i.e. number of read/write operations). The activity factors are plugged into the
power models of Wattch to estimate the power consumptions of the array structures. The
power trace is then fed to Hotspot to compute the temperature of each structure based on
a given floor plan.
The processor model simulated is a 4-way out-of-order processor with 64-entry
LSQ and 128-entry ROB. The processor model stores the speculative results from
instruction execution in the ROB. In our evaluation, we compare three different
processors: base (i.e. No RAEs support), large (i.e. No RAEs support, but LSQ and ROB
are built with bigger transistors which have lower failure probability), and RAEs (i.e. with
86
RAEs support). Base and large processors cannot distinguish intermittent faults and they
classify a fault either as transient or as permanent. For transient faults, the execution is
simply rolled-back, whereas permanent faults are handled via permanent de-
configuration.
For the simulations of the RAEs processor, the Intermittent-Fault-Threshold and
the pe riod to c le a r the log s’ f a ult c ounter s a re c hosen ba s e d on the c ompre he nsiv e
experimental study in [25]. The results of the study indicate that an intermittent fault in
an array structure entry is activated up to 15 times every day. Hence, we set the
Intermittent-Fault-Threshold to15andthe log s’f a ult c ounter sar e re se tt oz e roe ve r y da y .
The transistors of the LSQ and ROB structures in the large processor are 7%
bigger than the transistors used in base and RAEs processors. We have chosen 7%
because it is the estimated area overhead for supporting RAEs, as will be shown in
subsection 3.6.3. According to [105], a transistor failure probability is linearly
proportional to the size of its gate. Hence, the failure probability of LSQ and ROB entries
in the large processor is expected to be 7% lower than base and RAEs processors. We
use an accelerated fault injection methodology to do the evaluation. The fault injection
methodology has three main parameters: fault rate, fault type, and fault model.
Fault rate 3.5.1
The total fault rate of an array structure is the summation of the transient,
intermittent, and permanent fault rates of the structure. In 45nm process technology, the
transient fault rate per bit is 10
-3
FIT [92]. In our evaluation infrastructure, the LSQ has
64 72-bit wide entries and the ROB has 128 72- b it wide e ntrie s. He n c e , LSQ’s tra nsient
87
fa ult ra t e is 5 F I T a nd R OB ’s tra nsient fa ult r a te i s 10 F I T.The transient fault rate is used
to compute the average inter-arrival time between transient faults needed by the fault
classification of RAEs.
For intermittent fault rate, we rely on the reliability modeling framework
presented in [86]. The framework uses the concept of failure-in-time of reference circuit
(FORC) to compute the FIT rates of structures due to EM, NBTI, and TDDB. These
FORC models de pe nd on struc ture s’ siz e s, environmental conditions (e.g. temperature),
and stress conditions measured by the probabilities of having logic zero or logic one in a
cell. Our simulation infrastructure measures the average temperatures and stress
conditions for LSQ and ROB while running SPEC CPU2006 benchmarks. Then the
measured values are used to compute intermittent fault rate of LSQ and ROB as 275 and
550 FIT, respectively.
Current microprocessors are expected to have a mean-time-to-failure (MTTF) of 5
years; this translates to a cumulative intermittent and permanent fault rates of 22831 FIT.
Assuming this cumulative fault rate is uniformly distributed across the microprocessor
footprint, we deduce the permanent fault rates of LSQ and ROB as 38 and 76 FIT,
respectively. So, the total fault rates of LSQ and ROB are 318 and 636 failures in billion
hours, respectively. Assuming a clock frequency of 1GHz, the failure probabilities per
cycle for LSQ and ROB are (8.83*10
-20
) and (17.7*10
-20
), respectively.
The failure probabilities of LSQ and ROB indicate that faults are rare events
compared to the processor normal operational window. Hence, we need to accelerate the
occurrences of faults in our simulations. An acceleration factor of 10
13
is chosen because
88
it allows a couple of hundred faults to be injected during the simulation of each
benchmark. Therefore, the simulated failure probabilities per cycle for LSQ and ROB in
base and RAEs processors are (8.83*10
-7
) and (17.7*10
-7
), respectively. In the large
processor, the simulated failure probabilities per cycle for LSQ and ROB are (8.2*10
-7
)
and (16.5*10
-7
). Similarly, the average inter-arrival time between transient faults and the
pe riodtocle a rthe lo g s’f a ult c ounter sar e sc a leda c c or din g tot he a c c e l e ra ti onfa c tor.
Fault type 3.5.2
Whenever a fault is to be injected, the fault type needs to be specified. For each
array structure, the probability of transient, intermittent, and permanent faults can be
computed using the failure rate value of each type given in the previous subsection. For
example the probability of transient faults in the LSQ (P
LSQ-Transient
) can be computed as
follows:
⁄ 3.1
Where:
3.2
A random variable with standard uniform distribution U(0,1) is generated every
time a fault is to be injected. The value of the random variable is used to determine the
fault type according to the probability of each fault category computed as in equation 3.1.
Fault model and period 3.5.3
The last aspect of the fault injection setup is to decide how to model each fault
type and how long it will persist. Transient faults are modeled as bit flips and they persist
for a single cycle. Permanent faults are modeled as stuck-at and they persist for the entire
89
simulation. Intermittent faults are modeled as timing violations which manifest as bit
errors when the faulty entry is read or written. Due to the lack of information about the
activation and deactivation periods of intermittent faults, we rely on our empirical results
which indicate an average continuous stress time in the order of thousands of cycles.
Hence, we randomly pick among three activation periods of 1000, 2000, and 4000 cycles.
For intermittent faults we also need to determine the temporary de-configuration
period. Assuming a clock frequency of 1GHz and based on the fact that intermittent
faults deactivate in the first second after stress is removed [57], the faulty entry is de-
configured for 1 billion cycles. Once the de-configuration period expires, the entry is
reconfigured into normal operation as we described in subsection 3.4.5.
Experimental Results 3.6
For our simulation experiments, we chose ten SPEC CPU2006 benchmarks. We
fast-forwarded through the first 300 million instructions then ran detailed simulations for
three billion instructions.
Reliability evaluation 3.6.1
In order to compare the reliability of LSQ and ROB in the three processors, we
define a new reliability metric called mean-time-to-crash (MTTC). MTTC is the average
time for all entries in a structure to become faulty so as to cause the system to crash.
To compute the MTTC of a structure, we denote the number of cycles until all its
entries are faulty as N. Assuming that every clock cycle, one entry may become faulty
with a probability p, then N is essentially a random variable with a negative binomial
distribution NB(r,q). Where r is the number of the remaining non-faulty entries in the
90
structure and q is the probability of not having a fault in the entry that is being accessed
in the current cycle (i.e. q = 1-p). The expected value of N is the MTTC of the structure
and is computed as follows:
⁄ 3.3
Figure 3.2 shows the MTTC of the ROB in the RAEs and large processors
relative to the base processor at the end of the simulations. The geometric mean for all
benchmarks, except Mcf, shows that the RAEs processor improves the MTTC of the
ROB with respect to the base and large processors by factors of 1.95 and 1.82. This
improvement is mainly due to the ability of the RAEs processor to identify intermittent
faults and recover from them using temporal de-configuration. Mcf is excluded from the
mean calculation because base and large processors exhausted their ROB structures
before completing all three billion instructions. For Mcf, the RAEs processor achieved a
MTTC improvement factor of 71.
Figure 3.2: Relative ROB MTTC.
0
0.5
1
1.5
2
2.5
3
Relative MTTC
Benchmark
RAEs
Large
71
91
Another observation in Figure 3.2 is that for some benchmarks, the large
processor performs the same or even worse than the base processor despite the lower
fault injection rate. This is due to the randomization effect in determining the fault type as
described in subsection 3.5.2. As a result, more intermittent or permanent faults could get
injected in the large processor which makes it look worse than the base processor. The
randomization effect is expected to disappear as the benchmark run time increases.
Figure 3.3 shows the MTTC of the LSQ structure in the RAEs and large
processors relative to the base processor at the end of the simulations. The geometric
mean for all benchmarks, except Mcf, shows that the RAEs processor improves the
MTTC of the LSQ with respect to base and large processors by factors of 1.30 and 1.18.
LSQ utilization is typically less than ROB because it is only accessed by memory-access
instructions. Hence, smaller improvement factors are achieved for LSQ.
Figure 3.3: Relative LSQ MTTC.
0
0.5
1
1.5
2
2.5
Relative MTTC
Benchmark
RAEs
Large
5.4
92
Performance evaluation 3.6.2
The performance of the three processors is measured by their execution time. On
average, the RAEs processor achieves a speed up of 5% and 4% with respect to base and
large processors. This is because the RAEs processor does not permanently de-configure
LSQ and ROB entries with intermittent faults.
Area overhead 3.6.3
In order to protect the LSQ and the ROB with RAEs, every entry needs to be
augmented with sleep, busy, and parity bits. In addition, ROB entries are augmented with
RAE bits as shown in Figure 3.1. This adds up to a total of 704 additional bits, which
require 4224 transistors. As mentioned before, LSQ entries are 72-bit wide which means
that we need 71 2-input XOR gate for the parity checker. Each XOR gate requires 6
transistors, so the LSQ parity checker has a total of 426 transistors. Similarly, 426
transistors are required to build the ROB parity checker.
The loc and RAE_PC registers are 18 and 32-bit wide, respectively. Hence, they
require 300 transistors. Thus, the total number of transistors required to protect LSQ and
ROB using RAEs is 5376 which is 6.5% of the total number of transistors used to build
the base LSQ and ROB.
Summary and Conclusions 3.7
In this chapter we introduced reliability-aware exceptions (RAEs), a new class of
exceptions that use software handlers to deal with hardware faults in microprocessor
array structures. RAEs have the ability to classify faults into one of three categories,
transient, intermittent, and permanent, and choose the optimal corrective action based on
93
fault type. Fault classification for a specific entry is based on the inter-arrival time
between the latest two faults detected in the entry and the total number of faults detected
in the entry so far. Corrective actions range from simple flushing and re-execution in case
of transient faults to permanent de-configuration of the faulty array structure entries in
case of permanent faults. Through temporal de-configuration, RAEs help to reverse or
slow down the impact of intermittent faults.
We develop a new fault-injection experimental approach to mimic the physical
phenomena associated with the three fault categories. We measure the environmental and
stress conditions continuously during simulation and these measurements are used to
compute the probability of an intermittent fault.
Our experimental results showed that RAEs significantly improve the reliability
of LSQ and ROB. Although our experiments focused on two major array structures, we
expect to get the same advantages when applying RAEs to other circular buffer structures
and tabular array structures.
94
Chapter 4
Tolerating Hard Faults in GPUs
Once the detected fault is classified, the next step is to recover the correct system
state and if the fault is not transient the system must either tolerate the fault in the future
in order to guarantee correct execution and forward progress or safely shutdown when
functional correctness cannot be guaranteed. In this chapter we present solutions to
tolerate permanent faults (also called hard faults) in computer systems by exploiting the
resource redundancy. We shift our focus from multicore systems to many-core systems,
where such resource redundancy is plentiful, and present Warped-Shield framework to
tolerate in-field hard faults in graphics processing units (GPUs) [33]. Warped-Shield
tolerates the uncommon case of having two hard faults per four execution lane cluster,
with 14% average performance degradation. Even in the extreme case of having three
hard faults per four execution lane cluster, Warped-Shield provides functionally correct
execution with average performance degradation of 57%.
95
Introduction 4.1
Recently, general purpose applications with massively parallel computation
demands are relying on GPUs as the computational substrate. GPUs provision hundreds
or even thousands of execution units organized as single instruction multiple thread
(SIMT) lanes. In fact, 68% of the chip area is dedicated to SIMT lanes in current GPUs
[53]. The computations are parallelized across multiple SIMT lanes where a group of
SIMT lanes execute the same instruction but on different input operands. This simplified
control allows GPUs to achieve high performance at low power. Due to their high
performance per watt, many mission-critical and long-running scientific applications are
being ported to run on GPUs. These applications demand strong computational integrity.
GPUs, like many other digital systems, are becoming more vulnerable to different
types of in-field faults with the decreasing technology nodes. Among the possible in-field
faults, hard faults are very critical because they are persistent and irreversible. In this
chapter we target hard faults in the SIMT lanes. Given the vast number of SIMT lanes,
even a single hard fault in one SIMT lane can lead to wrong computations. Note that
most of the chip area outside of the SIMT lanes is occupied by memory structures, such
as register files, which are usually protected with error correcting codes. For instance,
register files in Nvidia GPUs are SECDED protected [68].
Recent research work proposed hardware [47] and software techniques [32] to
detect operational faults in GPUs but did not provide solutions to tolerate these faults.
Fault tolerance is the ability to continue correct execution, albeit at a reduced
performance, despite the existence of faults. As a complement to these prior works, we
96
present Warped-Shield framework which consists of three lightweight fault tolerance
schemes to dynamically adapt the thread execution in GPUs based on the available
healthy SIMT lanes. The presented schemes do not require any software intervention and
are transparent to the micro-architectural blocks surrounding the SIMT lanes, such as the
fetch logic, register file, and caches. The following is a short overview for the presented
fault tolerance schemes:
Improving GPUs resiliency to hard faults using thread shuffling: This
technique tolerates hard faults in the SIMT lanes by rerouting threads scheduled to run on
faulty lanes to idle healthy lanes. Thread shuffling exploits the underutilization of the
SIMT lanes to repurpose the idle lanes for fault tolerance.
Using dynamic warp deformation when thread shuffling is insufficient:
Depending solely on thread shuffling to tolerate hard faults in the SIMT lanes is
insufficient when the number of active threads exceeds the number of healthy lanes. To
tackle this problem, we present dynamic warp deformation which divides the original
warp into multiple sub-warps, such that each sub-warp has no more active threads than
the number of healthy lanes.
Reducing the performance overhead associated with warp deformation using
inter-SP warp shuffling: GPUs comprise of multiple streaming processor (SP) units and
these SP units may suffer non-uniform degradation. In cases where the fault maps of the
SP units are asymmetric, we present inter-SP warp shuffling which uses a scheduling
technique to assign a warp to a SP unit that is best suited for the warp's computational
needs. For instance, if a warp has 12 active threads and one SP unit has 12 healthy lanes
97
while a second SP unit has fewer than 12 healthy lanes, inter-SP warp shuffling schedules
the warp to the SP unit with 12 healthy SIMT lanes, thereby avoiding warp deformation.
Related Work 4.2
Reliable GPUs: The reliability of GPUs has been handled through hardware and
software techniques. Dimtrov et al. proposed three software-based dual modular
redundant (DMR) techniques (R-Naïve, R-Scatter, and R-Thread) to check the execution
correctness [32]. The software DMR techniques essentially run every kernel twice in an
interleaved fashion to detect computational faults. Another software-based fault detection
solution was proposed by Nathan et al. [64], where the GPU execution correctness is
verified by inserting a signature collector code. The output execution signature is
compared against a predefined golden signature to detect in-field computational faults.
On the hardware side, Jeon and Annavaram proposed Warped-DMR, a hardware
technique to check the correctness of the computation and detect faults, but not correct
them [47]. Warped-DMR leverages the underutilization of the SIMT lanes to enable
spatial (i.e intra-warp) and temporal (i.e. inter-warp) DMR execution to detect faults.
Similar to intra-warp DMR, the presented intra-cluster thread shuffling technique within
Warped-Shied leverages the idleness of the SIMT lanes; however, the goal of Warped-
Shield is to tolerate hard faults rather than detect them. Warped-Shield assumes that fault
detection is already done using orthogonal techniques, including Warped-DMR.
Warp formation: Due to the underutilization of the SIMT lanes, several
techniques have been proposed to dynamically form large warps by grouping the active
threads from different warps in order to improve performance [36] [63]. The large warps
98
are dynamically formed at the scheduler level. Contrary to large warp formation ideas,
the dynamic warp deformation technique within Warped-Shied divides the warps into
multiple sub-warps to avoid using the faulty SIMT lanes.
Tolerating hard faults: Any micro-architectural block might experience a hard
fault during in-field operation. Bower et al. proposed hardware-based techniques to
tolerate in-field hard faults in a wide range of CPUs' micro-architectural blocks, such as
ALUs and microprocessor array structures [17] [18]. On the other hand, the reliability-
aware exceptions (RAEs) [34] described in the previous chapter enable software-directed
handling of in-field faults detected in the microprocessor array structures. These
hardware and software techniques leverage the available micro-architectural redundancy
to de-configure the faulty blocks and continue execution with reduced resources.
Similarly, the GPU-specific fault tolerance techniques presented in this chapter are
inspired by the massive number of available SIMT lanes in GPUs.
Background and Motivation 4.3
GPU baseline architecture 4.3.1
This section provides an overview of the baseline GPU architecture used to
describe the implementation of the presented fault tolerance schemes. We use a Fermi-
like architecture (e.g. Nvidia GTX480) as the baseline GPU model [67]. A Fermi GPU
chip consists of multiple streaming multiprocessors (SMs). Figure 4.1 shows the main
pipeline stages within a single SM. Each SM has tens of execution resources, which can
be subdivided into special function units (SFUs), load/store units (LD/ST), integer (INT)
and floating-point (FP) units. Each SIMT lane consists of one INT and one FP unit.
99
Every SM contains 32 double-clocked SIMT lanes divided into two groups called
streaming processor units (i.e. SP
0
and SP
1
units in the figure). Thus, each SP unit can
execute 32 threads every cycle. Four consecutive SIMT lanes are grouped together to
form one SIMT cluster as shown in Figure 4.1. This cluster implementation is used in
existing commercial GPUs to reduce the complexity of data forwarding from a wide
register file to the SIMT lanes [37].
Each SM uses SIMT execution model which allows the lanes within one SP unit
to share a single program counter and execute the same instruction but on different data
elements concurrently. The SIMT execution model is supported using the notion of
warps. A warp is the smallest scheduled unit of work in GPUs and it consists of up to 32
parallel threads that execute the same instruction on different input operands values.
Multiple warps within the same program are grouped together into one cooperative thread
Figure 4.1: Baseline Nvidia GTX480 design.
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
L L
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
SFU
SFU
SFU
SFU
SFU LD/ST SP
0
unit SP
1
unit
INT
Unit
Operands
Result Queue
FP
Unit
Warp Scheduler
Fetch Operands
Execution Units
Shared Memory/ L1
cache
SM
Fetch and decode
SIMT Cluster
SIMT Lane
100
array (CTA) which is assigned to one SM for execution. In GTX480, each SM can
accommodate 48 warps (i.e. total of 1536 threads).
Warp scheduler: Each SM has its own warp scheduler. The scheduler extracts
warps from the instruction buffer according to the scheduling algorithm. When the input
operands of the warp instruction to be scheduled are not ready, due to read-after-write
(RAW) data hazards, the scheduler skips this warp instruction and checks the next warp
in the instruction buffer. If the input operands of a warp instruction are ready, the
scheduler assigns the warp instruction to an operand collector unit and sends the
operands ’ read requests to the register file. The operand collector unit is simply a staging
buffer where values read from the register file are temporarily stored.
The baseline architecture in our work uses the two-level scheduler, proposed in
[37], which divides the warps into pending and active warps. The scheduler extracts the
warps from the active warp queue in a round-robin fashion and up to two warps are
scheduled every cycle.
Warp issue: Once the input operands are gathered from the register file into the
operand collector, the warp instruction is then sent to the issue queue where it waits to be
issued to the appropriate execution unit. The issue logic checks for structural hazards on
the corresponding execution unit and result bus before issuing the warp. In the case of a
conflict, the warp is stalled in the issue queue. The issue logic also determines which
streaming processor unit (SP
0
or SP
1
) will execute the warp.
101
Resource utilization imbalance 4.3.2
The SIMT lanes within the same SP unit and across different SP units have
variable utilization behavior. This variation is due to two reasons: branch/memory
divergence phenomenon and insufficient application parallelism. Branch divergence
occurs when the current warp instruction is a branch and some of the warp's threads
diverge to the taken path while others diverge to the not taken path. As a result, the
threads on the taken path and the not taken path are scheduled over different cycles. Each
warp has a 32-bit active mask to indicate which SIMT lanes are going to be active (i.e.
active mask bit is one) and which SIMT lanes are going to be idle (i.e. active mask bit is
zero). Memory divergence happens when the current warp instruction is a memory-access
instruction and some of the warp's threads hit in the cache while others miss in the cache.
The threads which hit and the ones which miss are scheduled over different cycles.
Figure 4.2 shows the percentage of warps with 1, 2, 3,... 32 active threads for
different benchmarks selected from the GPGPU-Sim [10], Rodinia [23], and Parboil [98]
benchmark suites. The experimental setup is described in detail later in Section 4.6. The
divergence phenomena and insufficient parallelism result in fewer than 32 threads being
Figure 4.2: Thread activity breakdown.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Benchmark
Time Percentage
32 31 30
29 28 27
26 25 24
23 22 21
20 19 18
17 16 15
14 13 12
11 10 9
8 7 6
5 4 3
2 1
102
active for many applications. Note that the same observation has been made in many
prior studies and was exploited for fault detection and power savings [47] [52]. However,
we exploit this observation for tolerating in-field hard faults. We also monitored the
utilization of the SIMT lanes within a single SP unit across the same benchmarks.
Figure 4.3 plots the average utilization for each SIMT lane measured as the percentage of
time during which the respective SIMT lane is executing some instruction. The x-axis
represents the SIMT lane index. As shown, lanes experience variable utilization
behaviors. Using this motivational data, in the next section we present three fault
tolerance techniques to guarantee correct execution despite the existence of faulty SIMT
lanes.
Warped-Shield Fault Tolerance Techniques 4.4
Thread shuffling 4.4.1
Thread shuffling leverages the underutilization of the SIMT lanes to reroute
active threads, originally mapped to faulty lanes, to idle healthy SIMT lanes. After the
execution on the healthy lanes completes, threads are rerouted to their original lanes in
Figure 4.3: Average SIMT lane utilization.
0.75
0.8
0.85
0.9
0.95
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Utilization Percentage
Lane#
103
order to write their results to the correct register file banks. As mentioned before, several
warps have less than 32 active threads due to the divergence phenomena and insufficient
parallelism. Figure 4.2 indicates that there are plenty of opportunities to enable thread
shuffling between faulty and healthy lanes due to the underutilization of the SIMT lanes.
Implementing a perfect SP unit wide thread shuffling requires data from one
faulty active lane to be forwarded to any other healthy idle lane within the SP unit. Thus,
this approach requires a 16X16 crossbar to enable thread shuffling. In order to reduce the
area and timing overhead of the crossbar, we take advantage of GPUs' cluster-based
implementation [37] and limit thread shuffling to be within a cluster, which we term as
intra-cluster thread shuffling. When a thread is mapped to a faulty SIMT lane, intra-
cluster shuffling seeks to reroute the thread to the nearest healthy idle lane within the
same cluster, if any is available.
Figure 4.4a gives an example of intra-cluster thread shuffling with a cluster size
of four. In the figure, the rightmost lane (L
3
) is assumed to be faulty as indicated by the
red cross mark. The 0/1 bit vector on top of the SIMT lanes represents the active mask of
the threads mapped to the cluster. L
2
is idle and healthy while the faulty L
3
has an active
thread mapped to it. To handle this scenario, intra-cluster thread shuffling reroutes the
active thread, originally assigned to L
3
, to L
2
as indicated by the curved arrow. Clearly,
the effectiveness of intra-cluster thread shuffling depends on the cluster size. As cluster
size increases, more shuffling opportunities are expected at the cost of providing a larger
crossbar design. A sensitivity analysis of the cluster size and the expected fault tolerance
improvement is presented shortly.
104
Another important parameter that affects the efficiency of intra-cluster thread
shuffling is the way threads are mapped to the SIMT lanes. Figure 4.4b illustrates one
approach that sequentially maps threads to SIMT lanes (SEQ), such that thread_0 (T
0
) is
mapped to L
0
, T
1
is mapped to L
1
and so on. During divergence, if the active threads were
uniformly distributed across all lanes, then SEQ mapping would create equal
opportunities to have idle lanes across all clusters. However, divergence does not lead to
uniform distribution of active threads. Rather our empirical observations show that active
threads tend to be coalesced into nearby lanes. One reason for this behavior is that when a
warp executes a load/store instruction, consecutive threads within the warp tend to access
consecutive memory locations, which is called memory coalescing and is a common
phenomenon in GPU applications. Hence, when a thread misses in the cache there is a
high chance that its neighbors are going to miss as well. Thus, memory divergence causes
the threads that hit in the cache and those that miss in the cache to be spatially grouped
together. Similarly, for applications with few active threads due to limited parallelism,
the active threads are packed together and mapped to consecutive SIMT lanes.
(a) Thread shuffling (b) Sequential (SEQ) mapping
(c) Round-Robin (RR) mapping (d) Butterfly (BF) mapping
Figure 4.4: Thread shuffling and mapping techniques.
L
3
L
0
L
1
L
2
1 1 0 1
L
0
L
1
L
2
L
3
T
0
T
1
T
2
T
3
L
4
L
5
L
6
L
7
T
4
T
5
T
6
T
7
L
0
L
1
L
2
L
3
T
0
T
2
T
4
T
6
L
4
L
5
L
6
L
7
T
1
T
3
T
5
T
7
L
0
L
1
L
2
L
3
T
0
T
7
T
1
T
6
L
4
L
5
L
6
L
7
T
2
T
5
T
3
T
4
105
We empirically validate the claim that active threads are usually clustered into
nearby SIMT lanes with sequential thread mapping. Figure 4.5a shows the fraction of
time a given number of threads (0, 1, 2, 3, or 4 active threads) is mapped to a SIMT
cluster of four lanes for different benchmarks. The majority of the times, SIMT clusters
are either fully occupied with four active threads mapped to them or fully idle with zero
active threads mapped to them. As a result, fewer intra-cluster thread shuffling
opportunities are going to be available because the active threads are not evenly
distributed across the clusters, rather they are bundled together.
(a) SEQ mapping
(b) RR mapping
Figure 4.5: Thread mapping impact on shuffling opportunities.
0
0.2
0.4
0.6
0.8
1
Time Percentage
Benchmark
Active_0 Active_1 Active_2 Active_3 Active_4
0
0.2
0.4
0.6
0.8
1
Time Percentage
Benchmark
Active_0 Active_1 Active_2 Active_3 Active_4
106
To overcome this limitation, one can use the round-robin (RR) thread to cluster
mapping proposed in [47] and illustrated in Figure 4.4c. In this mapping, T
0
is mapped
to L
0
, T
1
is mapped to L
4
and so on. The RR mapping helps to distribute the active
threads evenly across the SIMT clusters to increase intra-cluster thread shuffling
opportunities. Figure 4.5b shows the impact of the RR mapping on intra-cluster thread
shuffling. The Y-axis shows the fraction of time a given number of threads (0, 1, 2, 3, or
4 active threads) is mapped to a SIMT cluster of four lanes for different benchmarks. It is
clear that the RR mapping increases the frequency of occurrences where 1, 2, or 3 active
threads are mapped to a SIMT cluster and decreases the frequency of occurrences where
0 or 4 active threads are mapped, which is precisely what is needed for intra-cluster
thread shuffling to perform better.
In addition to round-robin we also evaluated butterfly (BF) mapping approach.
This approach is illustrated in Figure 4.4d. Every cluster is populated by mapping threads
from the least significant side and most significant side interchangeably which helps to
distribute the active threads. Finally, we also evaluated optimal mapping policy (OPT),
which assumes that active threads of every warp, irrespective of their positions, are
uniformly distributed across clusters. Clearly, the OPT mapping is unrealistic to
implement in practice because it requires run-time information about the positions of
active threads and redistributing them uniformly across all clusters. However, evaluating
the OPT mapping is still important because it provides a reference point to measure the
efficiency of the other three mapping approaches, in terms of the intra-cluster thread
shuffling opportunities available.
107
To better understand the dependency of intra-cluster thread shuffling
opportunities on SIMT cluster size and thread mapping, we perform a sensitivity analysis
using four cluster sizes: 2, 4, 8, and 16. For each cluster size, we measure the weighted
average number of shuffling opportunities for the four thread mapping techniques: SEQ,
RR, BF, and OPT. In every cycle, the number of shuffling opportunities for each cluster
is measured as the number of active threads which can be simultaneously rerouted to idle
lanes within the cluster when the SIMT lanes originally assigned to these threads are
faulty. The opportunities across all clusters are summed to produce the warp shuffling
opportunities. After that, the number of opportunities for each benchmark is computed as
the summation of the shuffling opportunities for all warp instructions within the
benchmark. Finally, we calculate the weighted average across all benchmarks according
to the execution times of the benchmarks.
Figure 4.6 plots the weighted average number of shuffling opportunities for each
(cluster size, thread mapping) pair. The first observation is that for sequential (SEQ) and
butterfly (BF) mappings, the shuffling opportunities are limited at small cluster sizes but
the opportunities approximately double with cluster size. On the other hand, the number
of opportunities with round-robin (RR) and optimal (OPT) mappings is independent from
the cluster size. The second observation is that the RR mapping achieves nearly the same
number of opportunities as the OPT mapping. At any given cluster size, the RR mapping
outperforms the SEQ and BF mappings. Based on these results, we choose RR thread
mapping with cluster size of four SIMT lanes because it is the same cluster size used in
existing commercial GPUs [67] [37].
108
Intra-cluster thread shuffling is sufficient as long as the number of active threads
mapped to each SIMT cluster is less than or equal to the number of healthy lanes within
the cluster. After the warp completes execution, threads are reassembled back to their
original lanes so that they can write back their results to the correct register file banks.
Thus, intra-cluster thread shuffling is transparent to the software and to other micro-
architectural blocks inside the streaming multiprocessor (SM). The precise architectural
support necessary for this approach will be described in Section 4.5.
Intra-cluster thread shuffling with round-robin mapping is inspired by the intra-
warp DMR proposed in [47]. Both techniques leverage the underutilization of the SIMT
lanes, but for different purposes. Intra-warp DMR redundantly executes active threads on
idle SIMT lanes and compares results to detect operational faults. On the other hand, in
Warped-Shield, intra-cluster thread shuffling tolerates hard faults in SIMT lanes by
rerouting active threads, originally mapped to faulty lanes, to healthy idle lanes.
Figure 4.6: Intra-cluster thread shuffling analysis.
0.E+00
1.E+08
2.E+08
3.E+08
4.E+08
5.E+08
6.E+08
7.E+08
8.E+08
9.E+08
SEQ RR BF OPT
Weighted Average Shuffling
Opportunities
Thread Mapping
Cluster_Size(2) Cluster_Size(4) Cluster_Size(8) Cluster_Size(16)
109
Dynamic warp deformation 4.4.2
Although intra-cluster thread shuffling can be sufficient for many fault scenarios,
there are some cases where intra-cluster thread shuffling alone is not sufficient to tolerate
existing hard faults. A simple example is when there are 32 active threads within the
warp and there is at least one faulty lane within the SP unit. In such scenario, thread
shuffling cannot provide fault tolerance as there are no idle lanes to route the computation
from the faulty lane. To handle such cases, we present dynamic warp deformation.
Dynamic warp deformation splits the warp into multiple sub-warps with fewer
active threads. The sub-warps are scheduled consecutively one after the other. When all
sub-warps complete their execution, the original warp is reassembled and the results for
all active threads are written to the register file banks at once. As the number of active
threads per sub-warp is reduced, more intra-cluster thread shuffling opportunities are
created to handle the faults. Prior research proposed run-time large warp formation to
reduce the performance overhead caused by branch divergence [36] [63]. On the
contrary, warp deformation is used to improve fault tolerance capability of future GPUs.
Figure 4.7a shows how warp deformation is used when the number of active
threads is greater than the number of healthy SIMT lanes in the cluster. The 0/1 bit vector
on top of the four lanes represents the active mask of the threads to be issued. As shown
in the figure, L
3
in the cluster is considered faulty as indicated by the red cross mark, yet
an active thread is mapped to it. There are four active threads mapped to the cluster and
intra-cluster shuffling cannot provide fault tolerance in this scenario. Instead we rely on
warp deformation to divide the original warp into two sub-warps. The purpose of warp
110
deformation is to distribute the four active threads mapped to the cluster evenly across the
two sub-warps. Sub-warp
0
is assigned the two rightmost threads and sub-warp
1
is
assigned the two leftmost threads. Sub-warp
0
relies on intra-cluster shuffling to forward
the active thread, originally mapped to the faulty L
3
, to the idle healthy L
1
as indicated by
the curved arrow.
Figure 4.7b shows another situation where the SIMT cluster suffers from three
faulty lanes while having three active threads mapped it. In this case, the warp needs to
be divided into three sub-warps; each one is assigned a single active thread. Sub-warp
0
and sub-warp
1
rely on intra-cluster thread shuffling to provide fault tolerance as indicated
by the curved arrows.
(a) Two sub-warps
(b) Three sub-warps
Figure 4.7: Dynamic warp deformation.
L
0
L
1
L
2
L
3
1 1 1 1
L
0
L
1
L
2
L
3
1 1 0 0
L
0
L
1
L
2
L
3
0 0 1 1
Sub-warp
0
Sub-warp
1
L
0
L
1
L
2
L
3
1 1 0 1
0 1 0 0
L
0
L
1
L
2
L
3
1 0 0 0
L
0
L
1
L
2
L
3
0 0 0 1
Sub-warp
0
Sub-warp
1
L
0
L
1
L
2
L
3
Sub-warp
2
111
In the examples discussed so far, we have considered a single SIMT cluster.
However, each SP unit in the baseline architecture consists of four SIMT clusters.
Different clusters may suffer from different numbers of faulty lanes. In addition, different
numbers of active threads could be mapped to different clusters at different times based
on the running application. So, the general solution is to compute the number of sub-
warps required according to each cluster within the SP unit. Then warp deformation is
performed according to the maximum number of sub-warps required among all clusters.
Figure 4.8 shows an example of cluster
0
with two faulty lanes and cluster
1
with a
single faulty lane. Four active threads are mapped to cluster
0
and three active threads are
mapped to cluster
1
. When considering each cluster independently, cluster
0
requires two
sub-warps because it can only issue two threads at a time. Whereas, cluster
1
does not
require any deformation and intra-cluster thread shuffling is sufficient for it. Since
deformation is done at the warp-level, not at the cluster-level, we are forced to deform the
Figure 4.8: Deformation of two SIMT clusters.
L
0
L
1
L
2
L
3
1 1 1 1
L
0
L
1
L
2
L
3
1 1 0 1
L
0
L
1
L
2
L
3
0 0 1 1
L
0
L
1
L
2
L
3
0 0 0 1
L
0
L
1
L
2
L
3
1 1 0 0
L
0
L
1
L
2
L
3
1 1 0 0
Cluster
0
Cluster
1
Cluster
0
Cluster
1
Cluster
0
Cluster
1
Sub-warp
0
Sub-warp
1
112
original warp into two sub-warps as dictated by cluster
0
. The two sub-warps are shown in
the figure. When a warp is deformed, the sub-warps are issued consecutively one after the
other to the same SP unit. When all sub-warps complete execution successfully, they are
merged again to form the original warp before writing their results back to the register
file banks.
Inter-SP warp shuffling 4.4.3
The performance overhead of the dynamic warp deformation technique is directly
proportional to the number of sub-warps needed to tolerate the faulty SIMT lanes. Hence,
it is important to minimize the number of sub-warps or even avoid warp deformation
completely whenever possible. To achieve this goal, we propose inter-SP warp shuffling
mechanism which strives to issue every warp instruction to the most appropriate SP unit
(i.e. the SP unit where no deformation is required or the SP unit where fewer sub-warps
are needed). To understand how inter-SP warp shuffling works, let us consider the case
where two warp instructions X and Y are to be issued to the SP
0
and SP
1
units,
respectively. Due to variable SIMT lane utilization and process variations, SP
0
and SP
1
units may suffer from different number of faulty SIMT lanes. Hence, issuing warp
instruction X to the SP
0
unit might require three sub-warps, while issuing X to the SP
1
unit might require no deformation. Similarly, issuing warp instruction Y to the SP
0
unit
might require two sub-warps, while issuing Y to the SP
1
unit might require three sub-
warps. In this case, the inter-SP warp shuffling mechanism helps to reduce the
deformation overhead by issuing warp instruction X to SP
1
and warp instruction Y to
SP
0
.
113
Architectural Support 4.5
Intra-cluster thread shuffling 4.5.1
The intra-cluster thread shuffling approach is dependent on the cluster's fault map.
The fault map essentially shows which SIMT lanes within the cluster are healthy and
which lanes are suffering from permanent faults. Clusters' fault maps are assumed to be
generated using prior fault detection techniques that are specifically targeting GPUs [47].
Based on the fault map of the cluster, threads mapped to the cluster are shuffled to avoid
issuing any active thread to a faulty lane.
The implementation of intra-cluster thread shuffling consists of one stage of
multiplexers, called shuffling MUXes, added between the issue queues and the SIMT
lanes. After execution completes, we need to reshuffle the threads to their original lanes
in order to write back their results to the proper register file banks. Hence, another stage
of multiplexers (i.e. reshuffling MUXes) is added between the last execution stage and
the write-back stage.
Figure 4.9b shows the shuffling and reshuffling stages added to the SIMT lane
pipeline. For each lane, we add 4:1 shuffling and reshuffling MUXes. The shuffling
MUX forwards one of the cluster's four threads to the corresponding lane and the
reshuffling MUX forwards the output of the thread back to the SIMT lane where the
thread was originally mapped to. The select signals of the shuffling and reshuffling
MUXes are generated by the intra-cluster thread shuffling control logic using the warp
active mask and the clusters' fault maps.
114
The truth table given in Table 4.1 represents the functionality of the intra-cluster
thread shuffling control logic for a single cluster. The outputs are the select signals of the
shuffling and reshuffling multiplexers for the four SIMT lanes (L
0
, L
1
, L
2
, and L
3
). The
inputs are the cluster's active mask and fault map. When an active mask bit is set to one,
it indicates that an active thread is mapped to the corresponding SIMT lane. Otherwise,
the SIMT lane is idle. Also, when a bit is set to one in the fault map, it indicates that the
corresponding SIMT lane is faulty. Otherwise, the SIMT lane is healthy.
There are 16 possible combinations for the cluster's active mask and 16 possible
combinations for the cluster's fault map, so there should be 256 rows in Table 4.1.
(a) Baseline Pipeline
(b) Pipeline with intra-cluster thread shuffling support
Figure 4.9: Intra-cluster thread shuffling implementation.
Decode and
schedule
Issue EX
1
EX
N
Write-Back
Decode and
schedule
Issue EX
1
EX
N
Write-Back Shuffling Reshuffling
Pipe Register
0
Pipe Register
1
L
0
L
1
L
2
L
3
Pipe Register
N+1
Pipe Register
N+2
Shuffling Control
Reshuffling Control
Selective Write
115
However, all scenarios where the number of 1's in the active mask (i.e. number of active
threads) is greater than the number of 0's in the fault map (i.e. number of healthy lanes)
are handled using dynamic warp deformation first and then intra-cluster thread shuffling
might be used afterwards. So, the intra-cluster thread shuffling control logic only deals
with scenarios where the number of active threads is less than or equal to the number of
healthy SIMT lanes.
The 11 combinations shown in the table are representatives of all possible
combinations. For example, in the first row, when no active thread is mapped to the
cluster, the select signals of all shuffling and reshuffling MUXes are "don't care". On the
other hand, when the number of active threads is four and the number of healthy lanes is
four, as in the last row, each thread is forwarded to its original lane and no shuffling or
reshuffling is required. An example of a scenario, where shuffling is required, is given in
the 2
nd
row of the table. A single thread is originally mapped to L
3
and the cluster fault
map indicates that the only healthy lane in the cluster is L
0
. As a result, the select signals
of L
0
shuffling MUX should be set to "11" in order to forward the thread ’s c omputation
Row# Active Mask Fault Map Shuffling MUX Reshuffling MUX
L
0
L
1
L
2
L
3
L
0
L
1
L
2
L
3
1 0000 xxxx xx xx xx xx xx xx xx xx
2 0001 0111 11 xx xx xx xx xx xx 00
3 0001 1001 xx xx 11 xx xx xx xx 10
4 0001 0001 xx xx 11 xx xx xx xx 10
5 0001 0000 xx xx xx 11 xx xx xx 11
6 0110 0110 01 xx xx 10 xx 00 11 xx
7 0110 0100 01 xx 10 xx xx 00 10 xx
8 0110 0000 xx 01 10 xx xx 01 10 xx
9 0111 0001 11 01 10 xx xx 01 10 00
10 0111 0000 xx 01 10 11 xx 01 10 11
11 1111 0000 00 01 10 11 00 01 10 11
Table 4.1: Intra-cluster thread shuffling control logic.
116
from L
3
to L
0
. Similarly, the select signals of L
3
reshuffling MUX should be set to "00" to
forward the output of L
0
back to L
3
before writing the results back to the register file.
Note that for many scenarios, there are multiple shuffling options, which help to
reduce the design complexity of the shuffling control logic. For example, the scenario in
the 3
rd
row assumes one active thread mapped to L
3
and two healthy lanes (i.e. L
1
and
L
2
). In this case, we can either choose to shuffle the thread ’s computation to L
2
, as shown
in the table, or equivalently to L
1
. Eventually, we should choose the option that simplifies
the design of the shuffling control logic.
Warp deformation and scheduling 4.5.2
The decision on whether dynamic warp deformation is required or not can be
deduced from the warp's active mask and the fault map of the streaming processor (SP)
unit to which the warp will be issued. The logic that decides on dynamic warp
deformation is implemented as part of the warp scheduler. We refer to the modified warp
scheduler as the reliability-aware scheduler (RASc).
In addition to the functionality of the original warp scheduler, the RASc has two
extra tasks: first, it decides whether warp deformation is required. Second, if deformation
is needed the RASc determines the number of sub-warps necessary to tolerate the
existing hard faults. When performing these two tasks, the RASc must take into account
the fault maps of both SP
0
and SP
1
units. The reason is that the SP unit to which the
current warp will be issued is not known at the scheduler level. This decision is made
later by the warp issue logic.
117
In order to decide whether warp deformation is needed or not, the RASc compares
the active mask of every cluster with its corresponding fault map. When the number of
active threads is less than or equal to the number of healthy SIMT lanes in the cluster, no
deformation is needed. Otherwise, deformation is needed and the number of required sub-
warps for the cluster is computed as follows:
⌈
⌉ 4.1
When all clusters within the warp do not require deformation, the original warp will not
be split for the corresponding SP unit and intra-cluster thread shuffling is considered
sufficient to avoid any faulty SIMT lanes. On the other hand, if at least one cluster
requires deformation, then the original warp will be split for the corresponding SP unit
and the number of sub-warps is computed as the maximum of N
sub-warps
across all clusters.
The N
sub-warps
of all clusters, within the SP unit, are compared with each other
using a tree of 3-bit comparators to find the maximum N
sub-warps
. This process is iterated
twice, in the first iteration the RASc uses the fault map of the SP
0
unit and the final
output represents the number of sub-warps required in case the warp gets issued to SP
0
.
In the second iteration, the RASc uses the fault map of the SP
1
unit and the final output
represents the number of sub-warps required in case the warp gets issued to SP
1
.
In conclusion, for each warp, the RASc attaches the following deformation hint
bits: a flag bit to indicate if warp deformation is required (Split_Flag). A 3-bit value that
indicates the number of sub-warps needed ("000": No warp deformation, "010": Two
sub-warps, "011": Three sub-warps and "100": Four sub-warps). Two sets of the hint bits
118
are attached to each warp; one set is generated based on the fault map of the SP
0
unit and
one based on the fault map of the SP
1
unit. Hence, eight additional bits are attached to
each warp instruction at the warp scheduler.
Inter-SP warp shuffling 4.5.3
The functionality of inter-SP warp shuffling can be achieved by augmenting the
issue logic. Recall that in the baseline architecture, when all input operands of a warp
instruction are ready, the warp instruction is forwarded to the issue queues. All ready
warp instructions targeted to execute on a SP unit are forwarded to the same issue queue
(i.e. SP issue queue). Every cycle, the issue logic selects the most senior two warp
instructions from SP issue queue and issue them to the SP
0
and SP
1
units provided that
there are no structural hazards.
Instead, we propose to distribute the ready warp instructions into four issue
queues according to the warps' SP
0
_Split_Flag and SP
1
_Split_Flag as shown in Table 4.2.
The 1
st
issue queue holds the ready warp instructions which do not require deformation in
SP
0
, but require deformation in SP
1
. The 2
nd
issue queue holds all warps that require
deformation in SP
0
, but not in SP
1
. The 3
rd
issue queue holds all warps that need
deformation neither in SP
0
nor in SP
1
. The 4
th
issue queue holds warps that require
deformation irrespective of which SP unit they are issued to. In addition, the issue logic is
Issue Queue SP
0
_Split_Flag SP
1
_Split_Flag
1
st
0 1
2
nd
1 0
3
rd
0 0
4
th
1 1
Table 4.2: Inter-SP warp shuffling.
119
augmented with priority issue logic, such that it tries to issue a ready warp instruction to
the SP
0
unit by checking the four issue queues in the following order: 1
st
, 3
rd
, 4
th
, and
finally the 2
nd
queue. In the same cycle, the issue logic tries to issue another ready warp
instruction to the SP
1
unit by checking the four issue queues in the following order: 2
nd
,
3
rd
, 4
th
, and finally the 1
st
queue. The issue logic extracts the oldest warp within each
issue queue.
Issuing deformed warps 4.5.4
Once the issue logic selects the warp instruction to be issued, a reliability-aware
split (RASp) unit uses the warp's deformation hint bits to generate the active masks of the
sub-warps, in case deformation is needed. Each SP unit has a dedicated RASp unit
associated with it. Figure 4.10 depicts the design of the RASp unit. We only show one
part of the RASp unit which is responsible for generating the active masks of a single
SIMT cluster. The full RASp design can be thought of as n-replicas of Figure 4.10, where
n is the number of clusters in the SP unit. The input mask register shown in the figure
gets initialized to the cluster's original active mask (i.e. Issued Active Mask) with every
new warp instruction. At the end of every cycle, the output of the 4:1 MUX is used as the
cluster's active mask for the current sub-warp. Recall that when a warp is divided into X
sub-warps, the sub-warps are issued in X consecutive cycles. Each cycle, the sub-warp's
active mask is different, which is generated by RASp unit.
When generating the new active masks for the required sub-warps, there are three
possible scenarios. First, when no deformation is required (i.e. Split_Flag = 0 and N
sub-
warps
= 0), the cluster's original active mask is selected as shown by the topmost input of
120
the 4:1 MUX. The second scenario is when deformation is required (i.e. Split_Flag = 1)
and the number of sub-warps is two. In this case, the input mask is first ANDED with
"0011" to allow the two threads, mapped to L
2
and L
3
of the cluster, to be issued as part
of the first sub-warp. In the next cycle, the input mask is ANDED with "1100" to allow
the two threads, mapped to L
0
and L
1
of the cluster, to be issued as part of the second
sub-warp. The second scenario is represented by the bottommost input of the 4:1 MUX.
Notice that at the end of the first cycle, the input mask register gets updated with the
XORING of its current value and the mask of the current sub-warp (i.e. the output of the
4:1 MUX). In fact, this update process is not necessary for this scenario, but it is required
for the third scenario.
The third scenario is when deformation is required and the number of sub-warps
is three or four. In this case, one thread is issued to the cluster as part of every sub-warp.
In order to choose one of the active threads for each sub-warp, we use a 4:2 priority
encoder. The output of the encoder selects one of five possible cluster masks given at the
Figure 4.10: RASp unit design.
Priority Encoder
8 to 1 Mux
4 to 1 Mux
Mask[0]
Mask[1]
Mask[2]
Mask[3]
S2 S1 S0
0001
0010
0100
1000
Split_Flag
1100 or 0011
S1 S0
Input Mask Reg
2 to 1 Mux
Valid
0000
N
sub-warps
== 2
Issued
Active
Mask S
2 to 1 Mux
S
New Warp
Last
Sub-warp
Ou tpu t ≡ Curr e n t S u b-warp Active Mask
121
inputs of the 8:1 MUX. When all bits in the input mask are 0's, the valid bit at the output
of the encoder will be 0 and pattern "0000" is chosen as the cluster mask for the current
sub-warp. Otherwise, the priority encoder generates the index of the rightmost active
thread and the appropriate pattern (i.e. "0001", "0010", "0100", or "1000") is chosen as
the cluster's mask for the current sub-warp. In this scenario, it is important to update the
input mask register at the end of every cycle, so that the priority encoder generates the
index of the subsequent active thread from the right. The third scenario is represented by
the third topmost input of the 4:1 MUX. Rather than leaving the second topmost input of
the 4:1 MUX dangling, we choose to drive it from the input mask register.
The 2:1 MUX placed between the 8:1 and 4:1 MUXes is used to choose the value
of the input mask register as the cluster's mask for the last sub-warp. To explain why this
detour is required, let us consider an example where a warp of size eight is mapped to
two clusters. The first cluster has a single healthy lane and three active threads mapped to
it. The second cluster has four healthy lanes and four active threads mapped to it. In this
case, the warp needs to be deformed into three sub-warps. For the first cluster, the RASp
unit will generate three masks through the priority encoder; each mask has one active
thread. For the second cluster, the first two masks are also generated through the priority
encoder with one active thread in each of them. If the third mask of the second cluster is
left to be generated through the priority encoder, it will also include a single active thread
and the fourth active thread will not be issued. To overcome this limitation, we use the
2:1 MUX to select the input mask register, which includes all active threads which are yet
to issue, as the mask for the last sub-warp.
122
At every cycle, the 4-bit active masks of all clusters, generated by the RASp unit,
are concatenated together to form the active mask of the current sub-warp. Then, the sub-
warp active mask is fed to the intra-cluster thread shuffling control logic, given in
Table 4.1, in order to generate the appropriate select signals for the shuffling and
reshuffling MUXes.
Experimental Evaluation 4.6
Methodology 4.6.1
We evaluated the intra-cluster thread shuffling and dynamic warp deformation
techniques for performance overhead, given various SP unit fault maps, using GPGPU-
Sim v3.02 [10]. For evaluation, 20 benchmarks from GPGPU-Sim, Rodinia [23], and
Parboil [98] benchmark suites are used to cover a wide range of application domains.
We configured the baseline GPU architecture using the Nvidia GTX480
configuration file included in the GPGPU-Sim package. The baseline architecture core
runs at 700MHz clock frequency and consists of 15 streaming multiprocessors (SMs).
Subsection 4.3.1 presents the details of the SM design. On top of the baseline
architecture, we implemented the intra-cluster thread shuffling, dynamic warp
deformation, and inter-SP warp shuffling in GPGPU-Sim. Thus, the reported
performance and power overheads take into account the added circuitry including the
control logic, shuffling and reshuffling MUXes, RASc, and RASp unit.
To evaluate the performance overhead introduced by the intra-cluster thread
shuffling and dynamic warp deformation techniques, we assume a fault map for each SP
unit and run our simulation experiments accordingly. In reality each SP unit would have
123
two fault maps: one for the integer pipelines and one for the floating-point pipelines. In
our evaluation, and for the sake of simplicity, we assume a single fault map for each SP
unit. Hence, when we refer to a SIMT lane as faulty, it indicates that integer and floating-
point pipelines within the lane are faulty. Even with such simplification there are still a
tremendous number of possible fault maps for all SP units within the 15 SMs. We also
assume that at least one SIMT lane is healthy in each cluster. Hence, fault maps in which
one or more clusters in any SP unit are completely faulty (i.e. the four SIMT lanes in the
cluster are faulty) are not handled by the presented fault tolerance techniques.
To further limit the investigated fault maps, we make two more conservative
assumptions: first, clusters within the same SP unit suffer from the same number of faulty
lanes. Second, the SP
0
units in all 15 SMs share the same fault map. Similarly, the SP
1
units in all 15 SMs share the same fault map. Based on the aforementioned assumptions,
six representative fault maps are chosen to evaluate our techniques, two of which
represent the common and worst case fault maps when multiple faults occur in the GPU
chip. We refer to every investigated fault map using the following convention
SP
0
_x_SP
1
_y, where x is the number of faulty lanes in each SIMT cluster of the SP
0
unit
and y is the number of faulty lanes in each SIMT cluster of the SP
1
unit.
Common case fault map 4.6.2
It is to be emphasized that we expect fault-free operation to be the most common
case for the vast majority of the chip lifetime. But when multiple faults do occur in a
chip, the expected common case fault map for this scenario is derived from our empirical
results on SIMT lanes utilization profiles given in Figure 4.3. The figure shows that
124
within each cluster, two SIMT lanes are heavily utilized compared to the other two lanes.
So, it would be commonly expected for the highly utilized lanes to experience hard faults
simultaneously while the lightly utilized lanes remain healthy. In other words, in the
common case fault map, each cluster is expected to suffer from two faulty lanes (i.e.
SP
0
_2_SP
1
_2) which means that 50% of the GPU compute power is shut off.
With SP
0
_2_SP
1
_2 fault map, intra-cluster thread shuffling is sufficient as long as
the number of active threads mapped to each SP unit cluster is less than or equal to two.
For warps that map more than two active threads to at least one cluster, warp deformation
will be unavoidable. Since every SIMT cluster has two healthy lanes, the number of sub-
warps will always be two whenever a warp is deformed in the common case fault map.
Figure 4.11a shows the performance for SP
0
_2_SP
1
_2 fault map relative to the
baseline fault-free run. Benchmarks with high thread activity (e.g. backbrop, cutcp,
heartwall, hotspot, and sad) experience 20-40% performance overhead. This is expected
because all warps with at least one fully utilized cluster (i.e. a cluster with four active
threads mapped to it) will be split into two sub-warps issued in two consecutive cycles.
On the other hand, the rest of the benchmarks experience less than 10% performance
overhead. The last column shows a weighted average of 14% performance overhead.
The variations in the performance overhead across the benchmarks are due to two
reasons. First, intra-cluster thread shuffling opportunities vary from one benchmark to
another. For some benchmarks (e.g. CP, Mri-q, and WP), intra-cluster thread shuffling is
sufficient most of the time and as a result dynamic warp deformation is required much
less frequently. Second, some benchmarks are memory bound (e.g. MUM, WP, and bfs)
125
and computational delays, due to warp deformation and extra pipeline stages, will have a
small impact on the overall performance.
Worst case fault map 4.6.3
In order for intra-cluster thread shuffling and dynamic warp deformation to
guarantee execution progress, there should be at least one healthy SIMT lane per cluster.
Hence, the worst case fault map would render three SIMT lanes useless within each
cluster (i.e. SP
0
_3_SP
1
_3). This means that in the worst case, only 25% of the SIMT
lanes within the entire GPU are healthy. With SP
0
_3_SP
1
_3 fault map, intra-cluster
thread shuffling is sufficient as long as the number of active threads mapped to each SP
(a) SP
0
_2_SP
1
_2
(b) SP
0
_3_SP
1
_3
Figure 4.11: Performance overheard for common and worst fault maps (symmetric fault
maps).
0
0.5
1
1.5
Relative Performance
Benchmark
0
0.5
1
1.5
2
2.5
Relative Performance
Benchmark
126
unit cluster is less than or equal to one. For warps that map more than one active thread to
at least one cluster, warp deformation will be required.
Figure 4.11b shows the performance for SP
0
_3_SP
1
_3 fault map relative to the
baseline fault-free run. Benchmarks with high thread activity experience 50-130%
performance overhead. This is expected because all warps, where at least one cluster has
three/four active threads mapped to it, need to be split into three/four sub-warps issued in
three/four consecutive cycles. On the other hand, benchmarks with lower thread activity
experience performance overhead around 10%, as they benefit more from intra-cluster
thread shuffling. The last column shows a weighted average of 57% performance
overhead which is four times the average performance overhead for the common case
fault map.
Asymmetric fault maps 4.6.4
For the common and worst cases, SP
0
and SP
1
units have symmetric fault maps.
However, due to process variations and utilization variations fault maps maybe
asymmetric. For asymmetric fault maps, we consider four scenarios divided into two
groups. The first group assumes that the SP
1
unit has no faulty lanes while all clusters in
the SP
0
unit suffer from two and three faulty lanes (i.e. SP
0
_2_SP
1
_0 and SP
0
_3_SP
1
_0).
The second group assumes that all clusters in the SP
0
unit suffer from three faulty lanes
while SP
1
clusters suffer from one and two faulty lanes, respectively (i.e. SP
0
_3_SP
1
_1
and SP
0
_3_SP
1
_2).
Figure 4.12a plots the performance of the SP
0
_2_SP
1
_0 fault map relative to the
baseline fault-free run. The weighted average performance overhead drops to 5.5% which
127
represents more than two times improvement over the common case fault map and 10
times improvement over the worst case fault map. The maximum performance overhead
across all benchmarks is less than 18%. Similarly, Figure 4.12b plots the performance of
the SP
0
_3_SP
1
_0 fault map. Compared to the SP
0
_2_SP
1
_0 fault map, SP
0
_3_SP
1
_0
causes 2.5 times the performance overhead which is expected as more SIMT lanes
become faulty. In addition, the maximum performance overhead across all benchmarks is
also higher and it reaches 30%. For these two fault maps, the inter-SP warp shuffling
technique helps to issue the warps to the SP
1
unit whenever possible to avoid the
potential warp deformation on the SP
0
unit.
Figure 4.12c shows the performance of the SP
0
_3_SP
1
_1 fault map. The weighted
average performance overhead is 26.5% which is almost two times the performance
overhead of the common case fault map. The rational is that in the common case, all
clusters have two faulty lanes which means that the number of sub-warps is fixed to two
(a) SP
0
_2_SP
1
_0 (b) SP
0
_3_SP
1
_0
(c) SP
0
_3_SP
1
_1 (d) SP
0
_3_SP
1
_2
Figure 4.12: Performance overhead of asymmetric fault maps.
0
0.5
1
1.5
2
backprop
bfs
btree
CP
cutcp
gaussian
heartwall
hotspot
kmeans
lbm
LIB
lud
Mri-q
MUM
NN
nw
sad
sgemm
srad
WP
Average
Relative Performance
Benchmark
0
0.5
1
1.5
2
backprop
bfs
btree
CP
cutcp
gaussian
heartwall
hotspot
kmeans
lbm
LIB
lud
Mri-q
MUM
NN
nw
sad
sgemm
srad
WP
Average
Relative Performance
Benchmark
0
0.5
1
1.5
2
backprop
bfs
btree
CP
cutcp
gaussian
heartwall
hotspot
kmeans
lbm
LIB
lud
Mri-q
MUM
NN
nw
sad
sgemm
srad
WP
Average
Relative Performance
Benchmark
0
0.5
1
1.5
2
backprop
bfs
btree
CP
cutcp
gaussian
heartwall
hotspot
kmeans
lbm
LIB
lud
Mri-q
MUM
NN
nw
sad
sgemm
srad
WP
Average
Relative Performance
Benchmark
128
whenever deformation is required. However, for SP
0
_3_SP
1
_1, warps that are deformed
on the SP
1
unit will always require two sub-warps but warps that are deformed on the SP
0
unit might require two, three, or four sub-warps. Intuitively, the performance overhead
of warp deformation increases when more sub-warps are needed.
The performance of the SP
0
_3_SP
1
_2 fault map is shown in Figure 4.12d. All
benchmarks report almost the same performance overhead as the SP
0
_3_SP
1
_1 fault map.
The difference between SP
0
_3_SP
1
_1 and SP
0
_3_SP
1
_2 is in the number of faulty SIMT
lanes per cluster for the SP
1
unit. The only case where SP
0
_3_SP
1
_1 performs better is
when the warp issued to the SP
1
unit has a maximum of three active threads per cluster
(i.e. one cluster has exactly three active threads and all remaining clusters have three or
less active threads). In this case, no deformation is required for SP
0
_3_SP
1
_1 while
SP
0
_3_SP
1
_2 splits the warp into two sub-warps. Our empirical results indicate that the
latter case happens rather infrequently, instead, the maximum number of active threads
mapped to a cluster within a SP unit is either two or four most of the time. For a
maximum of two active threads, both fault maps require no deformation at the SP
1
unit,
and for a maximum of four active threads, both fault maps require two sub-warps.
Intra-cluster shuffling vs. dynamic warp deformation 4.6.5
In this subsection, we discuss how frequent each of the proposed techniques is
activated. Figure 4.13 shows the percentage of time intra-cluster thread shuffling is
sufficient versus the percentage of time warp deformation is required for the six fault
maps evaluated. In addition, the figure reports the percentage of time inter-SP warp
129
shuffling helps to avoid potential deformation by issuing the warp to the appropriate SP
unit. Obviously, for symmetric fault maps, the latter percentage is zero.
For the common and worst case fault maps (i.e. symmetric fault maps), warp
deformation is activated more than 80% of the time when fault tolerance is needed. For
asymmetric fault maps, the combination of intra-cluster thread shuffling and inter-SP
warp shuffling reduces the percentage of time during which deformation is activated. For
example when the SP
1
unit is completely healthy (i.e. SP
0
_2_SP
1
_0 and SP
0
_3_SP
1
_0),
shuffling becomes sufficient for more than 50% of the time.
Area and power overheads 4.6.6
To evaluate the area and power overheads of the proposed techniques, we
implemented the reliability-aware scheduler (RASc) circuitry, the reliability-aware split
(RASp) unit, the shuffling and reshuffling MUXes in RTL. We used Synopsys Design
Compiler and NCSU PDK 45nm library [4] to synthesize the RTL implementation.
Figure 4.13: Contributions of fault tolerance techniques.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time Percentage
Fault Map
Intra-Cluster Thread Shuffling Dynamic Warp Deformation Inter-SP Warp Shuffling
130
The total dynamic power of all additional components is 0.0334uW. This
represents the power consumed every time these components are activated. We measured
the dynamic power consumed by the SIMT lanes in the GPU using GPUWattch [52]. The
results show that the power overhead of the additional components is less than 0.9%. The
area consumed by the additional components is estimated by 0.031mm
2
and the total area
of the SIMT lanes is 32mm
2
. Thus, the area overhead of the proposed techniques is
0.01%.
Summary and Conclusions 4.7
In this chapter we presented the Warped-Shield framework which consists of
three techniques to tolerate hard faults in the SIMT lanes of GPUs. Intra-cluster thread
shuffling rearranges the threads within a cluster to avoid mapping any active thread to a
faulty SIMT lane. When intra-cluster thread shuffling alone is not sufficient, dynamic
warp deformation is used to split the original warp into multiple sub-warps in order to
distribute the original warp's active threads among the sub-warps. To minimize the
performance overhead of dynamic warp deformation, we introduced inter-SP warp
shuffling such that warps are issued to the SP unit which incurs less performance
overhead whenever possible.
The Warped-Shield framework was evaluated using various fault maps. In the
worst case, 75% of the GPU SIMT lanes are faulty but the presented fault tolerance
techniques guarantee forward progress with average performance overhead of 57%.
When 50% of the GPU SIMT lanes are faulty, the average performance overhead drops
to 14%. Warped-Shield incurs minimal area and power overheads of 0.01% and 0.9%.
131
Chapter 5
Fault Detection and Correction in GPUs
The Warped-Shield framework described in the previous chapter tolerates hard
faults, but makes the assumption that faults are detected by an orthogonal mechanism. In
this chapter we broaden the applicability and scope of this work through an enhanced
framework called Warped-Redundant Execution (Warped-RE). Warped-RE is a unified
framework capable of detecting and correcting transient and non-transient in-field faults
in the SIMT lanes of GPUs. During fault-free execution, Warped-RE runs in dual
modular redundant (DMR) mode and guarantees every thread computation within every
warp instruction to be verified which provides 100% fault coverage. When a fault is
detected in a warp instruction, the instruction is re-executed in triple modular redundant
(TMR) mode in order to correct the fault and identify SIMT lanes with potential non-
transient faults.
Warped-RE evaluation shows 8.4% and 29% average performance overhead
during DMR and TMR modes, respectively. In addition, Warped-RE reduces the power
overhead of traditional DMR and TMR operations by 42% and 40%, respectively.
132
Introduction 5.1
As mentioned in Section 4.1, graphics processing units (GPUs) are now the
dominant computing fabric within many supercomputers. Hence, many mission critical
applications run on GPUs thereby increasing the demands on reliability and
computational correctness of GPUs. Previous research work presented hardware and
software approaches to detect computational faults in GPU execution lanes [32] [64]
[47]. The Warped-Shield framework presented in Chapter 4 focused on tolerating non-
transient faults in the execution lanes to guarantee forward progress with minimal
performance overhead [33]. However, that approach assumes faults are already detected
and the faulty locations are identified a priori before fault tolerance can be activated to
avoid faulty locations. In this chapter we present Warped-RE, a unified framework to
detect and correct computational faults in GPU execution lanes. In particular, Warped-RE
is capable of correcting a single fault and detecting up to two faults in every cluster of
three execution lanes. As the name indicates, the framework is primarily based on the
very well-known redundant execution paradigm.
The most straightforward approach to achieve fault detection and correction is
triple modular redundancy (TMR) [56] [100]. TMR is a pure hardware solution that
executes the same instruction three times on disjoint execution lanes. The outputs of the
three lanes are fed to a voter circuit capable of choosing the majority among them which
essentially corrects a single fault. In addition, TMR compares every two outputs together
using XOR gates in order to identify potential faulty lanes (i.e. fault isolation). While
TMR is effective during the fault-free execution, considerable area/performance/power
133
overheads are expended to execute every instruction three times even before the
occurrence of the first fault. To address this drawback in Warped-RE, we use dual
modular redundancy (DMR) during fault-free execution and activate TMR execution
just-in-time (JIT) after a fault has been detected to provide fault correction capability.
Clearly, DMR causes less area/performance/power overheads while continuing to provide
100% detection for transient and non-transient faults.
Even with such improvement, considerable overheads will be consumed if the
DMR and TMR are naively implemented. For example, one can choose to have two extra
redundant execution lanes for each original lane. During DMR operation one of the
redundant lanes is used to verify the computation on the original lane and during TMR
operation both redundant lanes are used to correct a potential erroneous computation.
Such a solution causes 200% area overhead, 100% power overhead during DMR
operation and 200% power overhead during TMR operation. Alternatively, one can use
temporal redundancy and replicate a computation in time twice during DMR operation or
thrice during TMR operation using different execution lane for every replica. Executing
the replica(s) on disjoint lanes is important to guarantee the detection and correction of
non-transient faults.
Our unified framework aims at providing TMR-like detection and correction
capabilities with only a small fraction of their overheads. To achieve that, Warped-RE
exploits two critical observations about the GPU applications behavior. The first
observation is that neighboring execution lanes in a GPU tend to operate on the same
input operands. As described in Section 4.1 and subsection 4.3.1, the execution lanes in
134
GPUs are called SIMT (single instruction multiple thread) lanes. Dozens of SIMT lanes
are clustered together into streaming multiprocessors (SMs) within the GPU. The
software execution model of GPUs groups several tens of threads into warps and all
threads in a warp execute the same instruction but on different input operands. Since all
SIMT lanes associated with a warp execute the same instruction, whenever the input
operands of neighboring lanes are the same the outputs will be the same, unless one of
the neighboring lanes encounters a fault. Thus, when neighboring lanes exhibit inherent
redundancy they provide DMR or even TMR for minimal additional cost.
The second observation exploited in Warped-RE is the underutilization of the
SIMT lanes (i.e. some lanes are idle) for some warp instructions. This observation was
presented in [47] [52]. Jeon and Annavaram [47] exploited the idle SIMT lanes to
redundantly execute some of the threads within the warp and achieve intra-warp DMR
execution, which can only detect faults, but not correct them. We exploit the same
property to forcibly create new opportunities for DMR and TMR executions in order to
both detect and correct faults.
In our fault model we assume that one execution lane within a cluster of three
adjacent SIMT lanes can be faulty at any given time. Hence, Warped-RE can correct one
fault and detect up to two faults per each cluster of three SIMT lanes. Based on this
assumption, our results show that leveraging the inherent redundancy across threads
within the same warp guarantees low cost DMR execution for 38% of the warps and low
cost TMR execution for 36% of the warps. Exploiting the idle SIMT lanes to force
redundancy allows extra 10% of warps to be redundantly executed during DMR and
135
TMR modes. Thus, nearly 48% of all the warps can be opportunistically checked with
DMR to detect a fault. Once a fault is detected, 46% of the warps can be opportunistically
corrected with TMR. But to provide 100% fault detection and correction, we deploy the
dynamic warp deformation approach presented in subsection 4.4.2. Dynamic warp
deformation employs temporal redundancy where a single warp is split into multiple sub-
warps to create more idle lanes opportunities, which are then exploited for redundant
execution, albeit with some performance impact.
Related Work 5.2
The Warped-RE framework presented in this chapter is closely related to the
Warped-Shield framework presented in the previous chapter; hence, there is some
overlap in the related work associated with both frameworks. Parts of the overlapped
related work are repeated here for clarity and completeness and to put the Warped-RE
framework into context.
Improving the reliability of computing elements using DMR and TMR techniques
have been widely studied [56] [100] [44] [77] [41] [59] [32] [64] [66] [47] [33]. At the
CPUs and embedded systems front, the authors in [77] and [41] proposed to run two
copies of the same thread to detect faults. One of the threads runs ahead of the other and
their outputs are compared before committing the results of the trailing thread to memory.
The leading and trailing threads either run on the same processor (i.e. temporal DMR) or
on different processors (i.e. spatial DMR). Recently, the authors in [66] proposed to run
in DMR mode only during samples of the execution time. This sampling DMR approach
only detects permanent faults and the fault detection latency can be very high which
136
requires a sophisticated recovery mechanism. Alternatively, our unified framework
detects and corrects transient and non-transient faults instantaneously. The work
presented in [56] and [100] discussed how TMR improves the reliability of computer
systems. Similarly, a TMR-based highly reliable fault-tolerant microprocessor for
aircrafts is described in [44].
At the GPUs front, software-based and hardware-based DMR techniques have
been proposed to improve GPUs reliability [32] [64] [47] [33]. At the software level, the
authors in [32] proposed to run the code two times in order to detect faults. Alternatively,
the authors in [64] proposed to compare the output of the running program with a golden
output to verify correctness. While these techniques are able to provide high fault
coverage, they are agnostic to the inherent redundancy opportunities that exist in most of
the GPU workloads. Further, these software techniques neither detect faults as soon as
they occur nor they identify the locations of the faults. Finally, running multiple versions
of the same code as in [32] causes high power overhead (100% in case of DMR and
200% in case of TMR). On the other hand, Warped-RE leverages concurrent spatial
redundant execution to provide instantaneous fault detection and correction plus the
ability to isolate the faulty component. In addition, Warped-RE reduces the power and
performance overheads of traditional DMR and TMR executions by taking advantage of
the inherent redundancy across threads of the same warp and the underutilization of the
SIMT lanes.
At the hardware level, the authors in [47] proposed to leverage the
underutilization of the SIMT lanes to enable intra-warp DMR. When the underutilization
137
opportunities are not available, the entire warp is re-executed to detect faults. The
hardware techniques proposed in [47] provide fault detection only and assume fault
correction is in place. On the other hand, the Warped-Shield framework [33] presented
in Chapter 4 tolerates hard faults in the SIMT lanes of GPUs and assumes fault detection
is in place. Compared to all this previous work, Warped-RE exploits the inherent
redundancy in the GPU workloads and the underutilization of the SIMT lanes to enable
opportunistic fault detection (DMR mode) and correction (TMR mode) at low cost.
Background 5.3
The baseline GPU architecture used to describe the implementation of the
Warped-RE framework is an Nvidia Fermi-like architecture [67]. This architecture was
described in detail in subsection 4.3.1 of the previous chapter. As a quick summary of
relevant features, in the Fermi architecture, every streaming multiprocessor (SM)
contains 32 SIMT lanes divided into two streaming processor (SP) units. Each SP unit
has 16 SIMT lanes and can execute 32 threaded warps using double clocking, essentially
one half of the warp is completed in the first half of the SM cycle and the second half of
the warp is completed in the second half of the SM cycle.
Warped-RE achieves fault detection and correction by comparing the outputs of
the SIMT lanes. Figure 5.1 shows an example of a TMR voter and comparator
implementation. Three comparators are responsible for comparing the outputs of the three
SIMT lanes by considering them in pairs as follows: (L
0
,L
1
), (L
0
,L
2
), and (L
1
,L
2
). There
are three possible scenarios: first, when all pairs are matching; no fault is detected and the
three lanes are considered fault-free (i.e. F
0
= 0, F
1
= 0, F
2
= 0). Second, when all pairs
138
are not matching; it indicates that at least two lanes are faulty and reliable execution can
no longer be guaranteed if the faults are non-transient (i.e. F
0
= 1, F
1
= 1, F
2
= 1). Third,
when only one pair is matching; the lane which is not part of the matching pair is
considered faulty and execution resumes using the voter output. For example, when the
outputs of L
0
and L
1
are the only matching pair; L
2
is considered faulty (i.e. F
0
= 0, F
1
=
0, F
2
= 1).
Opportunistic Redundant Execution 5.4
Warped-RE relies on redundant execution to achieve fault detection and
correction in the SIMT lanes. To provide full detection coverage during DMR mode;
every thread computation must be replicated on a disjoint SIMT lane. Similarly, to
provide full correction coverage during TMR mode; every thread computation must be
replicated on two disjoint SIMT lanes. However, redundant execution is expensive and
can cause high area/power/performance overheads when implemented naively, especially
in the GPU platform. Instead, Warped-RE leverages the inherent redundancy between
Figure 5.1: TMR voter and comparator.
L
0
L
1
L
2
= = =
F
2
F
1
F
0
V oter
Majority
Output
139
threads within the same warp and the underutilization of the SIMT lanes to achieve low
cost opportunistic redundant execution.
Inherent redundancy 5.4.1
As dictated by the SIMT execution model, all active threads within a warp
execute the same instruction at a given clock. Hence, when the source operands of two or
more active threads are identical, the outputs of the threads are expected to match when
the SIMT lanes that execute the threads are fault-free. In other words, the threads are
considered inherently redundant when their source operands are identical. There are
multiple reasons why GPU applications exhibit value similarity among the source
operands of threads within the same warp. First, some warps operate on constant
variables which have the same values across multiple threads. Second, all threads within
a warp may compute the same vector base address which is then accessed using a thread-
specific offset from the base address. Thus, base address computations of a vector access
exhibit strong similarity. Third, image and video processing applications that use GPUs
exhibit great amount of value localities in neighboring pixel data. Accordingly, Warped-
RE leverages the available inherent redundancy by comparing the results of inherently
redundant threads within the same warp. The quantification for such opportunities and
the details on how these opportunities are exploited in the DMR and TMR modes of
Warped-RE are discussed in Sections 5.5 and 5.6, respectively.
Underutilization of SIMT lanes 5.4.2
Although every warp can support up to 32 active threads executing in a lockstep
fashion, some warps have less than 32 concurrent active threads which causes some
140
SIMT lanes to be idle. This idleness is due to the branch divergence phenomenon and
insufficient parallelism as described in subsection 4.3.2. The intra-warp DMR mechanism
in prior work [47] identified the underutilization of the SIMT lanes and exploited that to
dual redundantly execute the active threads within the warp to achieve fault detection. In
order to achieve 100% fault detection coverage by only relying on intra-warp DMR, the
number of idle lanes must be equal or greater than the number of active threads. Recall
that GPUs rely on active mask, a 32-bit vector mask that determines which threads within
a warp are actually active during branch divergence or insufficient parallelism. For
instance, if only two threads are active within a warp then the active mask associated with
that warp has two bits set to one, while the remaining 30 bits are set to zero.
Similar to previous studies [47] [52], we measured the underutilization of the
S I MT lan e s b y r e portin g the pe r c e nta ge of ti me during whic h a wa rp ha s 1, 2, 3, …,30,
31, and 32 active threads for benchmarks from GPGPU-Sim, Rodinia, and Parboil suits
and plotted the results in Figure 5.2. The same results were shown in Figure 4.2;
however, we present the results here again for the sake of completeness and to create a
self-sufficient chapter. The figure shows that some benchmarks suffer from insufficient
parallelism and end up having less than 32 active threads for all the warps, as is the case
with the gaussian and NN benchmarks in the figure. Alternatively, other benchmarks
suffer from branch divergence and end up having less than 32 simultaneous active
threads for some warps, as is the case with hotspot, heartwall, and WP. On the other
hand, some benchmarks fully utilize the SIMT lanes by having 32 active threads across
majority of the issued warps, as is the case with CP and cutcp. From the figure we infer
141
that only three benchmarks (i.e. gaussian, NN, and nw) exhibit more than 50% lane
idleness which provides us the ability to achieve 100% DMR execution with minimal
penalty. To correct a fault with TMR, the number of idle lanes has to be at least twice the
number of active threads, which further reduces the number of opportunities for low cost
TMR when relying purely on lane idleness.
Thus, while underutilization can force redundancy, it alone cannot provide
sufficient opportunities for both fault detection and correction. As such, Warped-RE
combines inherent redundancy with lane idleness to vastly increase the opportunities for
low cost fault detection and correction. By exploiting the idle SIMT lanes on top of
inherent redundancy, non-inherently redundant threads can be redundantly executed such
that more warps can be DMR-ed and TMR-ed at low cost.
The quantification of idle lanes opportunities and the details on how these
opportunities are exploited in the DMR and TMR modes of Warped-RE are discussed in
Sections 5.5 and 5.6, respectively.
Figure 5.2: Underutilization of SIMT lanes.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Benchmark
Time Percentage
32 31 30
29 28 27
26 25 24
23 22 21
20 19 18
17 16 15
14 13 12
11 10 9
8 7 6
5 4 3
2 1
142
Dynamic warp deformation 5.4.3
The opportunistic redundant execution analysis, described in Sections 5.5 and 5.6,
shows that about 50% of the warps can be opportunistically DMR-ed and TMR-ed by a
combination of inherent and forced redundancy. In order to provide fault detection and
correction for the remaining warps, Warped-RE deploys the dynamic warp deformation
approach presented in subsection 4.4.2. Dynamic warp deformation splits a warp into
multiple sub-warps with fewer active threads which artificially creates more idle lanes
and allows threads which are not covered by the opportunistic redundant execution to be
DMR-ed or TMR-ed. Dynamic warp deformation has noticeable performance
degradation as more cycles are needed until all sub-warps complete their execution. Thus,
we try to minimize the warp deformation by first relying on opportunistic redundancy
before activating warp deformation.
Next, we show how the opportunistic redundant execution, which exploits
inherent value redundancy and lane idleness, and the dynamic warp deformation are used
to achieve 100% fault detection during DMR mode and subsequently 100% fault
correction during TMR mode.
Fault Detection: Opportunistic DMR Mode 5.5
Warped-RE relies on DMR mode initially to just detect faults. To mitigate the
high overhead of DMR execution, the framework strives to achieve opportunistic DMR
execution using inherent redundancy that exploits value similarity and forced redundancy
using idle lanes. When opportunistic DMR is not sufficient to cover all threads within a
warp, dynamic warp deformation is activated.
143
Opportunistic DMR granularity 5.5.1
In order to detect the inherent redundancy between threads within a warp, the
source operands of the threads must be compared together. One extreme implementation
is to compare the source operands across all threads within the warp. We refer to this
implementation as warp-level inherent redundancy. Warp-level inherent redundancy
requires comparing the input of each SIMT lane with the inputs of all other lanes, which
has a significant hardware complexity. The other extreme implementation which requires
much less complexity is to divide the warp into clusters of size two and only compare the
source operands of the threads within the same cluster together. The latter
implementation is called cluster-level inherent redundancy.
In the warp-level inherent redundancy, a warp is considered inherently DMR-ed if
for every active thread in the warp there is at least one more active thread with matching
source operands regardless of the physical location of the SIMT lanes to which the
matching threads are assigned. In the cluster-level inherent redundancy, a warp is
considered inherently DMR-ed when the active threads assigned to every cluster have
matching source operands. According to the description of the two implementations, any
warp that is inherently DMR-ed in the cluster-level implementation will also be
inherently DMR-ed in the warp-level implementation, but not vice versa.
To better understand the two implementations, we provide four examples in
Figure 5.3. In these examples, for simplicity of illustration, it is assumed that every warp
consists of six threads assigned to six lanes. Note that there are 16 double-clocked SIMT
lanes per SP unit in our actual implementation. When multiple threads are inherently
144
redundant, their corresponding SIMT lanes are highlighted with the same shade. In
Figure 5.3a, threads assigned to L
0
, L
3
, L
4
, and L
5
are inherently redundant and threads
assigned to L
1
and L
2
are also inherently redundant. According to warp-level inherent
redundancy, this warp is inherently DMR-ed. When the warp is divided into clusters of
size two, the warp does not qualify as inherently DMR-ed because the threads assigned to
the (L
0
,L
1
) and (L
2
,L
3
) clusters do not have matching source operands. On the other hand,
the warp in Figure 5.3b is considered inherently DMR-ed at the warp-level and the
cluster-level implementation.
There are scenarios where inherent redundancy, even at the warp-level, is
insufficient to provide 100% fault detection. For example, the warp shown in Figure 5.3c
is not considered inherently DMR-ed because the threads assigned to L
2
and L
5
have
distinctive source operands. Notice that the figure shows the active mask bits on top of
the SIMT lanes and that L
0
and L
1
are idle because their active mask bits are set to zero.
When inherent redundancy is not sufficient, these idle lanes can be exploited to replicate
the non-inherently redundant threads to create a forced redundancy. Similar to inherent
redundancy, the idle lanes can be exploited at the warp-level by allowing an active thread
(a) (b)
(c) (d)
Figure 5.3: DMR inherent redundancy.
L
2
L
3
L
4
L
5
L
0
L
1
L
2
L
3
L
4
L
5
L
0
L
1
L
2
L
3
L
4
L
5
1 1 1 1
L
0
L
1
0 0
L
2
L
3
L
4
L
5
1 0 1 1
L
0
L
1
0 1
145
within a warp to be replicated on any idle lane within the same warp. As a result, the
warp in Figure 5.3c may be DMR-ed using a combination of inherent and forced
redundancy at the warp-level. This is true because L
3
and L
4
are inherently redundant,
and idle SIMT lanes are exploited at the warp-level by replicating the threads assigned to
L
2
and L
5
on L
0
and L
1
, respectively. Alternatively, the idle SIMT lanes can be only
exploited at the cluster-level by allowing an active thread within a cluster to be replicated
only on idle lanes within the same cluster. Using inherent and forced redundancy at the
cluster-level, the warp in Figure 5.3c cannot be DMR-ed. This is true because the
inherent redundancy between L
3
and L
4
cannot be exploited because they are in different
clusters and the idle lanes cannot be exploited because they are not present in the same
cluster as the primary computation lanes; L
2
and L
5
.
Figure 5.3d shows a case where exploiting inherent redundancy and idle SIMT
lanes at the cluster-level is sufficient. L
4
and L
5
are within the same cluster and are
inherently redundant. L
1
and L
2
have distinctive source operands, but they can be
replicated on the idle lanes within their own clusters, namely L
0
and L
3
, respectively.
Quantifying opportunistic DMR 5.5.2
To quantify the opportunistic DMR execution, we measure the percentage of warp
instructions that leverage inherent redundancy and/or idle SIMT lanes. The experimental
setup for these measurements is based on GPGPU-Sim v3.02 [10] configured using the
Nvidia GTX480 (i.e. Fermi) configuration file included in the GPGPU-Sim package. We
run 22 benchmarks from GPGPU-Sim [10], Parboil [98], and Rodinia [23] benchmark
suites. For every warp instruction that is issued to a SP unit, we compare the source
146
operands of the threads to detect inherent redundancy opportunities and we investigate
the active mask of the warp instruction to detect idle SIMT lanes opportunities. A warp
instruction is considered opportunistically DMR-ed when all threads within the warp
instruction can leverage inherent redundancy or idle SIMT lanes to become DMR-ed.
Figure 5.4 shows the percentage of opportunistically DMR-ed warps for warp-
level and cluster-level implementations; note that cluster-level implementation uses two
adjacent SIMT lanes as discussed in the previous subsection. On average, 63% and 48%
of the warps are opportunistically DMR-ed with warp-level and cluster-level
implementations, respectively.
The warp-level implementation captures all the possible warps which can be
opportunistically DMR-ed. However, this implementation is expensive and relatively
complex because it requires inputs of every SIMT lane to be compared against inputs of
all other SIMT lanes to detect inherent redundancy across the warp. At the same time, the
warp-level implementation requires a SP unit wide MUX rerouting logic to leverage idle
Figure 5.4: Opportunistic DMR.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Percentage of Warps
Benchmark
Cluster-level Warp-level
147
SIMT lanes that may be present anywhere within the warp. On the other hand, the
cluster-level implementation captures 72% of the opportunistically DMR-ed warps (i.e.
48/63) at a much lower design cost (micro-architectural implementation details are
presented shortly). For instance, only the inputs of every two adjacent lanes need to be
compared and simple rerouting logic across every two adjacent SIMT lanes is sufficient
to capture cluster-level opportunistic DMR.
Hence, we choose to leverage the opportunistic DMR execution in Warped-RE by
exploiting inherent redundancy and idle SIMT lanes at the cluster-level. Accordingly,
during DMR mode the active threads within a warp are logically divided into clusters of
size two. The 1
st
DMR cluster consists of threads assigned to L
0
and L
1
, the 2
nd
cluster
consists of threads assigned to L
2
and L
3
...etc.
For every DMR cluster, there are four possible scenarios: first, when both threads
within a cluster are active and all their source operands are matching, the threads are
inherently redundant and their outputs are compared for fault detection (i.e. opportunistic
DMR using inherent redundancy). Second, if only one thread in the cluster is active due
to underutilization then the thread is replicated on the idle lane (i.e. opportunistic DMR
using idle SIMT lanes). Third, if both threads are active and their source operands are not
matching then dynamic warp deformation is activated to achieve DMR execution as will
be described next. Fourth, if both SIMT lanes within the cluster are idle then nothing
happens.
148
DMR execution using dynamic warp deformation 5.5.3
Cluster-level opportunistic DMR execution covers 48% of the warp instructions
during DMR mode. To cover the remaining warp instructions, we used a modified
version of the dynamic warp deformation approach presented in subsection 4.4.2.
Dynamic warp deformation splits a warp into multiple sub-warps with fewer active
threads which artificially creates more idle SIMT lanes opportunities and allows the
threads which are not covered by the opportunistic DMR execution to be replicated and
verified for fault detection. Unlike the opportunistic DMR execution, dynamic warp
deformation is expected to cause performance degradation as more cycles are needed
until all the created sub-warps complete their execution.
When running in the DMR mode, the only case where warp deformation is
needed is when two active threads within the same cluster are non-inherently redundant.
Figure 5.5 shows a warp of eight threads logically divided into four clusters as indicated
by the dashed-line borders. Cluster
1
, cluster
2
, and cluster
3
are covered by opportunistic
Figure 5.5: Warp deformation in DMR mode.
L
4
L
5
L
6
L
7
1 0 0 0
L
0
L
1
L
2
L
3
1 1 1 1
L
4
L
5
L
6
L
7
1 0 0 0
L
0
L
1
L
2
L
3
0 1 1 1
L
4
L
5
L
6
L
7
0 0 0 0
L
0
L
1
L
2
L
3
1 0 0 0
Sub-warp
0
Sub-warp
1
Cluster
0
Cluster
1
Cluster
2
Cluster
3
Cluster
0
Cluster
1
Cluster
2
Cluster
3
Cluster
0
Cluster
1
Cluster
2
Cluster
3
149
DMR execution. The threads assigned to L
2
and L
3
are inherently redundant. The thread
assigned to L
4
has distinctive source operands but it can exploit the idle L
5
, and force
redundancy. L
6
and L
7
have no active threads assigned to them.
Cluster
0
has two active threads assigned to L
0
and L
1
and they are non-inherently
redundant. In order to guarantee 100% fault detection, threads assigned to L
0
and L
1
are
split across two sub-warps as shown in the figure. In sub-warp
0
, the thread assigned to L
1
is replicated on the idle L
0
that was created due to warp deformation. In sub-warp
1
, the
thread assigned to L
0
is replicated on the idle L
1
that was created due to warp
deformation. The forced redundant execution on idle lanes is shown by the curved arrows
in this figure.
Notice that the need for deformation is determined by the worst case cluster. In
the example given in Figure 5.5, the opportunistic DMR execution covers the active
threads in cluster
1
, cluster
2
, and cluster
3
. However, cluster
0
requires deformation in order
to allow each of its active threads to be DMR-ed which causes the entire warp to be
deformed and issued over two cycles. All active threads which are covered by
opportunistic DMR execution are issued as part of subwarp
0
and they do not need to be
split across the sub-warps.
Fault Detection and Correction: Opportunistic TMR Mode 5.6
In Warped-RE, DMR is the default operational mode as long as no faults are
detected in the SIMT lanes. As we stated in our fault model assumption, only a single
fault can be triggered for every cluster at any point of time. Thus, running in DMR mode
guarantees the detection of any transient or non-transient fault in any cluster. However,
150
when a fault is detected it cannot be corrected in DMR mode. In order to correct the fault,
the faulty warp is re-executed in TMR mode. In addition to fault correction, TMR has
fault isolation capability which is used to check if the fault still exists during the re-
execution. If the fault is not detected during TMR re-execution then the original fault is
considered transient and Warped-RE switches back to run in DMR mode. On the other
hand, if the fault is detected again during the re-execution then the fault is considered
non-transient and Warped-RE switches to run in TMR mode from then on. Note that a
variety of mode switching options could be easily implemented to tackle intermittent
faults if the faults appear only for a short time interval. But in our implementation, we
assume the second occurrence of a fault, when executed in the TMR mode, is an
indication of permanent fault and thus the system switches to TMR mode from then on.
Similar to DMR mode, Warped-RE leverages inherent and forced redundancy
across threads to achieve low cost opportunistic TMR execution whenever possible and
relies on dynamic warp deformation only when necessary. In order to support DMR and
TMR operational modes, every warp instruction issued to a streaming processor (SP) unit
is augmented with a mode bit (D_T). During fault-free operation, the D_T bit is set to
zero for all warp instructions to indicate that they should run in DMR mode. When a
warp instruction needs to be re-executed because of a fault, the D_T bit is set to one for
that instruction to indicate that it should run in TMR mode. Also, once a non-transient
fault is detected the D_T bit is set to 1 for all warps instructions from then on to
guarantee functional correctness through TMR.
151
Opportunistic TMR granularity 5.6.1
As in DMR mode, exploiting the inherent redundancy in TMR can be either
implemented at the warp-level by comparing the source operands across all threads in the
warp or at the cluster-level by limiting the comparison to threads within the same cluster.
Notice that in TMR mode, a cluster is defined as three consecutive SIMT lanes to allow
TMR execution. In the warp-level inherent redundancy implementation, a warp is
considered inherently TMR-ed if for every active thread in the warp there are at least two
more active threads with matching source operands regardless of the physical location of
the SIMT lanes to which the matching threads are assigned. In the cluster-level inherent
redundancy implementation, a warp is considered inherently TMR-ed when the active
threads assigned to every cluster of three adjacent SIMT lanes have matching source
operands.
To better understand the two implementations while running in TMR mode, we
provide four examples in Figure 5.6. The warp in Figure 5.6a can be inherently TMR-ed
assuming the warp-level implementation is used because every thread has two inherently
redundant threads within the same warp. However, when the cluster-level implementation
(a) (b)
(c) (d)
Figure 5.6: TMR inherent redundancy.
L
2
L
3
L
4
L
5
L
0
L
1
L
2
L
3
L
4
L
5
L
0
L
1
L
2
L
3
L
4
L
5
1 1 1 1
L
0
L
1
0 0
L
2
L
3
L
4
L
5
1 1 1 1
L
0
L
1
0 0
152
is used as indicated by the dashed-line borders the warp cannot be inherently TMR-ed
because the threads within each cluster do not have matching operands. On the other
hand, the warp in Figure 5.6b can be inherently TMR-ed with both implementations.
The warp in Figure 5.6c cannot be inherently TMR-ed even with the warp-level
implementation because the thread assigned to L
5
has unique source operands. Notice
that the figure shows the active mask bits on top of the SIMT lanes and that L
0
and L
1
are
idle because their active mask bits are zeros. When inherent redundancy is insufficient,
the idle lanes can be exploited to replicate the non-inherently redundant threads to
achieve TMR. If idle lanes can be exploited at the warp-level, then the thread assigned to
L
5
can be replicated twice on the idle L
0
and L
1
. So, the warp in the figure becomes
opportunistically TMR-ed by using a combination of inherent and forced redundancy at
the warp-level. Note that when cluster-level implementation restrictions are placed, then
this warp cannot be opportunistically TMR-ed since redundancy crosses cluster
boundaries.
Figure 5.6d shows a case where exploiting inherent redundancy and idle SIMT
lanes at the cluster-level is sufficient to provide TMR for the entire warp. This is true
because L
3
, L
4
, and L
5
are within the same TMR cluster and are inherently redundant.
Also L
2
has distinctive source operands, but can be replicated twice on the idle lanes
within its own cluster, namely L
0
and L
1
.
Quantifying opportunistic TMR 5.6.2
We quantified the opportunistic TMR execution by measuring the percentage of
warp instructions that leverage inherent redundancy and/or idle SIMT lanes to become
153
opportunistically TMR-ed. The experimental setup of these measurements is identical to
the one used to quantify the opportunistic DMR execution and was described in
subsection 5.5.2. However, here a warp instruction is considered opportunistically TMR-
ed when all threads within the warp instructions can leverage inherent redundancy or idle
SIMT lanes to become TMR-ed. Figure 5.7 shows that, on average, 56% and 47% of the
warps can be opportunistically TMR-ed with warp-level and cluster-level
implementations, respectively. These warps achieve fault detection and correction with
minimal performance impact. Although the warp-level implementation captures all
possible warps which can be opportunistically TMR-ed, its hardware complexity is
higher. Instead, the cluster-level implementation captures 86% of the opportunities (i.e.
47/56) with much less complexity.
Similar to the DMR mode, we choose to leverage the opportunistic TMR
execution by exploiting inherent redundancy and idle SIMT lanes at the cluster-level in
Warped-RE. Hence, during the TMR mode the active threads within a warp are logically
Figure 5.7: Opportunistic TMR.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Percentage of Warps
Benchmark
Cluster-level Warp-level
154
divided into clusters of size three. The 1
st
TMR cluster consists of threads assigned to
(L
0
,L
1
,L
2
) and the 2
nd
cluster consists of threads assigned to (L
3
,L
4
,L
5
) ...etc. In the Fermi
architecture [67], the GPU architecture used in our implementation and evaluations, the
number of SIMT lanes in each SP unit is 16 which are not divisible by three. As a result,
the last four lanes (i.e. L
12
, L
13
, L
14
, and L
15
) are handled as a special TMR cluster with
four lanes.
For the regular TMR clusters (i.e. cluster size = 3), there are four scenarios where
the opportunistic TMR execution is sufficient to cover all active threads within the
cluster: first, if all three threads within the cluster are active and they are inherently
redundant. Second, if there is only one active thread within the cluster then it can be
replicated twice on the two idle lanes. Third, if there are only two active threads within
the c lust e r a nd the y a re i nhe re ntl y r e dunda nt then the thre a ds’ computation is replicated
on the idle lane to achieve TMR execution. Fourth, if the three lanes are idle then nothing
needs to be done. For all other scenarios, warp deformation is activated to achieve full
TMR execution.
For the special TMR cluster (i.e. cluster size = 4), there are also four scenarios
where the opportunistic redundant execution is sufficient to cover all active threads
within the cluster: first, when there is a single active thread within the four lanes then it
will be replicated on two idle lanes within the cluster. Second, when there are only two
active threads and they are inherently redundant then the threads ’ computation will be
replicated on one of the remaining idle lanes. Third, when there are three or four active
155
threads and they are inherently redundant. Fourth, when all four lanes are idle then
nothing needs to be done. For all other scenarios, dynamic warp deformation is activated.
TMR execution using dynamic warp deformation 5.6.3
Opportunistic TMR execution covers 47% of the warp instructions during TMR
mode. To cover the remaining warp instructions, we rely on dynamic warp deformation
to create more idle SIMT lanes and allow the threads which are not covered by the
opportunistic TMR execution to be replicated twice for fault detection and correction.
During TMR mode, warp deformation is required when there is more than one active
thread assigned to a specific cluster and at least one of these active threads is non-
inherently redundant.
Figure 5.8 shows the three possible scenarios where warp deformation is required
for the regular TMR cluster. For simplicity, we assume that a warp consists of three
threads assigned to a single regular TMR cluster (i.e. three SIMT lanes). Figure 5.8a
shows the 1
st
scenario with two non-inherently redundant threads assigned to L
0
and L
1
.
In this case, the warp needs to be deformed into two sub-warps. One active thread is
assigned to each sub-warp and this active thread is replicated on the two idle lanes that
are created by the warp split, as indicated by the curved arrows.
Figure 5.8b shows the 2
nd
scenario with three active threads assigned to the cluster
and two of them being inherently redundant. Again here the warp needs to be deformed
into two sub-warps; the two inherently redundant threads are assigned to sub-warp
0
and
their computation is replicated on the third idle lane so that they become TMR-ed. The
non-inherently redundant thread is assigned to sub-warp
1
and its computation is
156
replicated on the two idle lanes available in the cluster. Figure 5.8c shows the 3
rd
scenario
with three non-inherently redundant active threads. In this case, three sub-warps are
issued in three consecutive cycles to allow each of the active threads to be TMR-ed.
Warp deformation for the special TMR cluster is handled the same way as for the
regular TMR clusters. Figure 5.9 shows the scenarios where warp deformation with four
or three sub-warps is required for the special TMR cluster. Figure 5.9a shows the worst
case with four active threads assigned to the four SIMT lanes and all of them are non-
inherently redundant. In this case, the warp is split into four sub-warps to allow each
active thread to be TMR-ed. Notice that the leftmost three lanes (i.e. L
12
, L
13
, and L
14
) are
(a) (b)
(c)
Figure 5.8: Warp deformation for regular TMR cluster.
L
0
L
1
L
2
1 1 0
L
0
L
1
L
2
1 0 0
L
0
L
1
L
2
0 1 0
Sub-warp
0
Sub-warp
1
L
0
L
1
L
2
1 1 1
L
0
L
1
L
2
1 0 1
L
0
L
1
L
2
0 1 0
Sub-warp
1
Sub-warp
0
L
0
L
1
L
2
1 1 1
L
0
L
1
L
2
1 0 0
L
0
L
1
L
2
0 1 0
L
0
L
1
L
2
0 0 1
Sub-warp
1
Sub-warp
0
Sub-warp
2
157
always used to achieve TMR execution and their outputs are fed to a TMR logic similar
to the one described in Figure 5.1.
Figure 5.9b and Figure 5.9c show the cases where three sub-warps are needed:
first (i.e. Figure 5.9b), when three active threads are assigned to the four SIMT lanes and
they are non-inherently redundant. Second (i.e. Figure 5.9c), when four active threads are
(a)
(b)
(c)
Figure 5.9: Warp deformation for special TMR cluster.
L
12
L
13
L
14
L
15
1 1 1 1
L
12
L
13
L
14
L
15
0 0 0 1
Sub-warp
0
L
12
L
13
L
14
L
15
0 0 1 0
Sub-warp
1
L
12
L
13
L
14
L
15
0 1 0 0
Sub-warp
2
L
12
L
13
L
14
L
15
1 0 0 0
Sub-warp
3
L
12
L
13
L
14
L
15
0 1 1 1
L
12
L
13
L
14
L
15
0 0 1 0
Sub-warp
1
L
12
L
13
L
14
L
15
0 0 0 1
Sub-warp
0
L
12
L
13
L
14
L
15
0 1 0 0
Sub-warp
2
L
12
L
13
L
14
L
15
1 1 1 1
L
12
L
13
L
14
L
15
0 0 1 0
Sub-warp
2
L
12
L
13
L
14
L
15
0 0 0 1
Sub-warp
1
L
12
L
13
L
14
L
15
1 1 0 0
Sub-warp
0
158
assigned to the cluster and two of them are inherently redundant. For all other cases
where warp deformation is required for the special TMR cluster, the number of sub-
warps will be only two.
As mentioned before, it is the worst case cluster that determines if warp
deformation is required or not and how many sub-warps are needed. For example, it
could be the case that the states of the four regular TMR clusters in the SP unit of the
Fermi architecture (i.e. L
0
-L
11
) are similar to that of Figure 5.8a (two clusters),
Figure 5.8b (one cluster), and Figure 5.8c (one cluster) while the state of the special TMR
cluster (i.e. L
12
-L
15
) is similar to that of Figure 5.9a. In this case, warp deformation is
required and the number of sub-warps is four according to the worst case cluster (i.e. the
special cluster) in order guarantee TMR execution for every thread within the warp.
Warp Replay 5.7
After a fault is detected in DMR mode, the faulty warp is re-executed in TMR
mode in order to correct the fault and identify the fault type. The procedure for re-
executing a faulty warp is described in this section. Current GPUs do not seem to support
precise exceptions and branch prediction; hence, they lack traditional instruction roll-
back mechanisms used for handling precise exceptions and branch mispredictions in
CPUs. These mechanisms would have made the TMR re-execution of the faulty warp
instruction straightforward; once DMR detects a fault then we can simply roll-back and
re-execute the faulty warp in TMR mode.
To support warp re-execution in Warped-RE, a replay buffer is added to store the
source operands and opcodes of the warps currently executing in the SP unit. The buffer
159
is indexed using the warp-id and the warp program counter. Every time a warp
instruction gets issued during DMR mode (i.e. D_T = 0), the source operands of all
threads within the warp and the instruction opcode are stored in the replay buffer. After
all threads within the warp complete their DMR execution, the outputs of every DMR
cluster are compared to detect faults. If no fault is detected in any DMR cluster,
execution resumes normally using a combination of opportunistic DMR and warp
deformation modes.
On the other hand, when a fault is detected in any DMR cluster, the faulty warp
instruction is converted to a no-operation (NOP) by deactivating its write-back control
signal in order to prevent register file contamination. In addition, the warp issue logic is
directed to re-issue the faulty warp instruction by reading its source operands and opcode
control data from the replay buffer. When the faulty warp instruction is re-issued, its D_T
bit is set to one in order to enforce TMR execution for fault correction. Note that during
re-execution, the input operands are read from the buffer. That way even if the register
file is updated after the last read by a non-faulty instruction in the pipeline we still re-
execute the faulty instruction with correct operands. Based on the output of the re-
executed instruction there are two possible scenarios: first, when the same fault does not
occur during TMR mode, the initial fault occurrence is considered transient and
execution resumes in DMR mode. Second, when the same fault is detected during TMR
mode, the fault is corrected by the TMR voter and execution resumes in TMR mode to
guarantee correction from then on. When the latter is the case, there is no need to store
160
instructions in the replay buffer from this point forward because fault detection and
correction are provided during TMR mode.
As multiple warps concurrently exist in a SP unit pipeline, SIMT lanes that are
experiencing non-transient faults might affect the computation of multiple warps
simultaneously. Hence, it is important to buffer the source operands and opcodes for all
warps that are in the SP unit pipeline. Typically the number of concurrent warps depends
on the depth of the pipeline. We measured the maximum SP unit occupancy (i.e. the
maximum number of warps concurrently running in a SP unit) for a Fermi-like
architecture and the results are presented in Figure 5.10. For 22 benchmarks, at most
eight warps concurrently exist in a SP unit pipeline. Hence, in Warped-RE
implementation, we augment every SP unit with 8-entry replay buffer to allow the re-
execution of faulty warp instructions. In case the replay buffer is full and a new warp
instruction is ready to be issued to the SP unit, the issue logic is suspended until an entry
is freed from the buffer.
To better understand how warp replay affects warps ’ execution, let us consider
the case where there are two warps (e.g. W
x
and W
y
) operating in DMR mode
concurrently in a fault-free SP unit. Assuming that W
y
finishes DMR execution first and
a fault is detected in one of its DMR clusters, W
y
is converted to NOP and replayed in
TMR mode. Notice that W
x
is still running in DMR mode. In case W
x
ends up executing
correctly due to masking effects or because the original fault is transient, W
x
commits its
results and no re-execution is necessary for it. On the other hand, if a fault is triggered for
W
x
then it will be replayed in TMR mode.
161
Before the TMR version of W
y
finishes execution, new warp instructions, such as
W
z
, might be ready to be issued to the SP unit. Since it is not clear yet whether the
detected fault is transient or not, one can choose to execute W
z
optimistically in DMR
mode or conservatively in TMR mode. In Warped-RE, we choose to conservatively
execute new warps in TMR mode until the detected fault type is identified. When the
replayed version of W
y
completes TMR execution, the fault is corrected and the fault
type is identified. If the fault is non-transient all new warps from then on are executed in
TMR mode. Otherwise, the fault is considered transient and new warps return to execute
in DMR mode.
Architectural Support 5.8
In this section we describe the hardware support required to implement the
techniques proposed in Warped-RE assuming a Fermi-like GPU baseline architecture
[67]. Three additional pipeline stages are added to the GPU pipeline as will be described
next.
Figure 5.10: SP unit occupancy.
0
1
2
3
4
5
6
7
8
9
SP Unit Occupancy (# of concurrent Warps)
Benchmark
162
Inherent redundancy and deformation analysis stage 5.8.1
This stage is responsible for detecting the inherent redundancy between the issued
threads in the same warp by comparing their source operands. Based on the inherent
redundancy opportunities, the deformation control logic is used to decide if warp
deformation is required and how many sub-warps are needed in DMR and TMR modes.
Figure 5.11 shows how the new stage fits in the GPU pipeline. The design details are
described next.
5.8.1.1 Detecting inherent redundancy
To detect inherent redundancy opportunities, XOR-based comparators are needed
to detect value similarity between the threads ’ sourc e ope ra nds. GP U inst ruc ti ons
generally have up to three source operands. Hence, three comparators between every two
Figure 5.11: Inherent redundancy and deformation analysis stage.
Decode and
Schedule
EX
1
Write-Back Issue
Pipe Register
Pipe Register
EX
N
= = = = =
DMR
Deformation
Ctrl
TMR
Deformation
Ctrl
Active Mask
DIR Vector
TIR Vector
WD
DMR
N
DMR_sub-warps
WD
TMR
N
TMR_sub-warps
Register Read
and Operand
Collector
Inherent Redundancy
and
Deformation analysis
163
threads within the same cluster are added. Figure 5.12 shows a simple case with six
SIMT lanes logically divided into three clusters during DMR mode and two clusters
during TMR mode. The comparator box between every two lanes contains three
comparators (i.e. one for each source operand). For DMR execution, the only
comparators that are relevant are the ones highlighted with stripes. The three striped
comparator boxes are responsible for comparing the source operands of the active threads
assigned to (L
0
,L
1
), (L
2
,L
3
), and (L
4
,L
5
). When all source operands are matching, threads
within a cluster are inherently redundant.
For TMR execution, all comparator boxes are relevant except the one that
compares the threads assigned to L
2
and L
3
. Each TMR cluster needs three comparator
boxes to compare every pair of active threads together. This approach helps to capture
cases where only two out of three threads are inherently redundant which mitigates the
performance overhead caused by the potential dynamic warp deformation. For example,
Figure 5.12: Inherent redundancy detection logic.
L
2
L
3
L
4
L
5
L
0
L
1
= = =
=
= =
=
DIR
TIR
164
the (L
0
,L
1
,L
2
) TMR cluster in Figure 5.12 uses three comparator boxes to compare
threads assigned to (L
0
,L
1
), (L
0
,L
2
), and (L
1
,L
2
). As mentioned before, every SP unit in
the Fermi architecture consists of 16 SIMT lanes. Hence, during DMR mode the SIMT
lanes are evenly divided into eight clusters. Each DMR cluster requires one comparator
box. So, a total of eight comparator boxes are needed (i.e. 24 comparators). During TMR
mode, the 16 lanes are divided into five TMR clusters. The first four TMR clusters are of
size three and they account for 12 lanes (i.e. L
0
-L
11
). Each one of these clusters requires
three comparator boxes (i.e. 9 comparators). The fifth TMR cluster is of size four lanes
(i.e. L
12
-L
15
) and six comparator boxes (i.e. 18 comparators) are needed to compare the
threads assigned to every two lanes together. As described in Figure 5.12, there is an
overlap between DMR and TMR comparators. Hence, to detect cluster-level inherent
redundancy opportunities for DMR and TMR modes in Fermi-like architecture, 53
comparators are needed for each SP unit.
The comparators ’ outputs are used to create two vectors for each warp: DMR
inherent redundancy (DIR) vector and TMR inherent redundancy (TIR) vector. The
vectors indicate the active threads which are inherently redundant assuming DMR and
TMR modes, respectively. Figure 5.12 shows that each DMR cluster is associated with a
single bit in the DIR vector. The bit is driven by the comparator box responsible for
comparing the source operands of the two threads assigned to the cluster. When the
output of the comparator box is one, it indicates that the two active threads within the
cluster are inherently redundant. On the other hand, when only one active thread is
165
assigned to the cluster or two active threads are assigned but their source operands are not
matching, the output of the comparator box is set to zero.
Alternatively, every bit in the TIR vector is associated with one SIMT lane. The
TIR bit of every lane is driven by the OR-ing of the outputs of the two comparator boxes
responsible for comparing the corresponding lane with the other two lanes in the same
TMR cluster. This lane level TIR bit is necessary to capture cases where two out of three
threads are inherently redundant. The three inherent redundancy bits of a TMR cluster
can have five possible combinations. Table 5.1 lists the possible TIR vector combinations
for the (L
0
,L
1
,L
2
) cluster and their respective descriptions.
In the Fermi-like GPU pipeline, warp instructions are kept in operand collector
units until all their source operands are fetched from the register file. Once all source
operands are ready, the instructions are sent to the issue queue where they wait to be
dispatched to the respective execution unit according to their type. The inherent
redundancy and deformation analysis stage is added between the operand collector units
and the issue queue as shown in Figure 5.11. This way after all source operands become
ready, they are compared at the cluster-level and DIR and TIR vectors are generated. The
TIR Vector Description
L
0
L
1
L
2
0 0 0 Each thread have unique set of source operands
0 1 1 Threads assigned to L
1
and L
2
have matching source operands
1 0 1 Threads assigned to L
0
and L
2
have matching source operands
1 1 0 Threads assigned to L
0
and L
1
have matching source operands
1 1 1 All threads have matching source operands
Table 5.1: TMR inherent redundancy (TIR) vector description.
166
warp active mask and the DIR and TIR vectors are then used to decide whether warp
deformation is required and how many sub-warps are needed as will be discussed next.
5.8.1.2 Analyzing dynamic warp deformation
The implementation of warp deformation in the Warped-RE framework is
different from its implementation in the Warped-Shield framework, described in
subsection 4.5.2, for two reasons: first, the deformation control logic has to deal with
cluster size of two in DMR mode and cluster size of three and four in TMR mode.
Second, the deformation control logic must take into consideration the inherent
redundancy information provided by the DIR and TIR vectors so as to avoid unnecessary
splits when inherent redundancy can already provide fault detection and correction
capabilities. Deformation analysis is first done for each cluster independently and then a
unified decision is made for the entire warp. The decision states whether deformation is
required or not and determines the number of sub-warps according to the cluster which
requires maximum number of sub-warps.
We first explain how warp deformation control is implemented for the DMR
mode and then expand the description to show how deformation control is implemented
for the TMR mode. For DMR mode, we consider the 1
st
DMR cluster in the SP unit (i.e.
L
0
and L
1
) and present the truth table of its control logic in Table 5.2. The control logic
has three-bit input represented by the active mask bits of L
0
and L
1
(AM[0:1]) and the
DIR vector bit of the cluster (DIR[0]). At the output side, the control logic has one-bit
output to indicate if warp deformation is required or not according to this cluster
(WD
DMR0
) and two-bit output to indicate the number of sub-warps needed (N
DMR0_sub-
167
warps
). The only case where deformation is required is when there are two active threads
(i.e. AM[0:1] = "11") and they are non-inherently redundant (i.e. DIR[0] = 0) as shown in
the 4
th
row of Table 5.2. To achieve DMR in this case, two sub-warps are needed.
In the Fermi architecture, there are eight DMR clusters per SP unit and each
cluster has its own deformation control logic. For a specific warp instruction, if at least
one cluster requires deformation then the warp is deformed to guarantee 100% DMR.
Hence, the outputs of deformation control logic of all DMR clusters (i.e.WD
DMRi
) are
OR-ed together to generate a single bit flag (WD
DMR
) to indicate whether the current
warp needs deformation or not. If warp deformation is required during DMR mode (i.e.
WD
DMR
=1), the number of sub-warps is two (i.e. N
DMR_sub-warps
= 2) or zero otherwise.
Warp deformation is a bit more complicated in TMR mode. Let us consider the 1
st
TMR cluster which consists of L
0
, L
1
and L
2
. Table 5.3 shows the truth table for the
deformation control logic with six-bit input represented by the active mask bits of L
0
, L
1
and L
2
(AM[0:2]) and the TMR inherent redundancy bits of the three lanes (TIR[0:2]).
The outputs of the TMR deformation control logic are the same as in the DMR
deformation control (i.e. WD
TMR0
and N
TMR0_sub-warps
).
Row# Input Output
DIR [0] AM [0:1] WD
DMR0
N
DMR0_sub-warps
1 0 00 0 0
2 0 01 0 0
3 0 10 0 0
4 0 11 1 2
5 1 11 0 0
Table 5.2: Warp deformation control per DMR cluster.
168
Despite the fact that there are 64 possible input combinations, only 15 are
applicable. When there is at most one active thread assigned to the cluster (i.e. rows 1, 2,
3, and 6), no deformation is required as the active thread can be replicated on the two idle
lanes within the cluster. Another scenario where no deformation is required during TMR
is when all the active threads assigned to the cluster are inherently redundant (i.e. rows 5,
8, 10, and 15).
There are two scenarios where deformation is required and the number of sub-
warps is set to two. First, when there are only two active threads assigned to the cluster
and they are non-inherently redundant (i.e. rows 4, 7, and 9). This scenario was described
in Figure 5.8a. Second, when there are three active threads assigned to the cluster and two
of them are inherently redundant (i.e. rows 12, 13, and 14). The latter scenario was
described in Figure 5.8b. There is only one scenario where three sub-warps are needed as
Row# Input Output
TIR [0:2] AM [0:2] WD
TMR0
N
TMR0_sub-warps
1 000 000 0 0
2 000 001 0 0
3 000 010 0 0
4 000 011 1 2
5 011 011 0 0
6 000 100 0 0
7 000 101 1 2
8 101 101 0 0
9 000 110 1 2
10 110 110 0 0
11 000 111 1 3
12 011 111 1 2
13 101 111 1 2
14 110 111 1 2
15 111 111 0 0
Table 5.3: Warp deformation control per TMR cluster.
169
shown in row 11. This happens when there are three active threads assigned to the cluster
and they are non-inherently redundant as described in Figure 5.8c.
The same deformation control logic described in Table 5.3 is used for (L
3
,L
4
,L
5
),
(L
6
,L
7
,L
8
), and (L
9
,L
10
,L
11
) TMR clusters in the Fermi architecture. The deformation
control for the four-lane special TMR cluster (i.e. L
12
-L
15
) is handled the same as long as
the number of active threads assigned to the cluster is less than four. On the other hand,
when the number of active threads assigned to the special cluster is four then the
deformation control will be as follows. The best case occurs when all four active threads
are inherently redundant as no deformation is required. The worst case occurs when each
of the four threads has distinctive source operands; hence, four sub-warps are needed as
described in Figure 5.9a. When two out of four threads are inherently redundant, three
sub-warps are needed as described in Figure 5.9c. Finally, when three out of four threads
are inherently redundant, two sub-warps are needed. The first sub-warp masks the non-
inherently redundant thread and executes the three redundant threads. The second sub-
warp masks the three redundant threads and allows the non-redundant thread to replicate
its computation on the idle lanes.
Similar to DMR mode, it is the worst case cluster that determines whether warp
deformation is needed or not and the number of sub-warps during TMR mode. This can
be achieved by OR-ing the TMR warp deformation output flags (i.e. WD
TMRi
) of the
regular TMR clusters and the special four-lane cluster to generate a single bit TMR
deformation flag (WD
TMR
) for the whole warp. Further, the numbers of sub-warps of all
170
clusters within the SP unit are compared with each other, and the maximum is chosen as
the number of sub-warps required for the whole warp (N
TMR_sub-warps
).
The warp deformation control logic of the DMR and TMR modes are
implemented as part of the inherent redundancy and deformation analysis stage as shown
in Figure 5.11. At the end of this stage, every warp instruction is augmented with DIR
vector, TIR vector, DMR deformation flag (WD
DMR
), TMR deformation flag (WD
TMR
),
number of sub-warps for DMR mode (N
DMR_sub-warps
), and number of sub-warps for TMR
mode (N
TMR_sub-warps
).
Sub-warps active masks generation and thread replication stage 5.8.2
After a warp instruction completes the inherent redundancy and deformation
analysis stage, it waits in the issue queue until its turn to be issued to the SP unit pipeline.
Before the warp instruction starts executing, two actions need to be performed: first, the
warp might need to be split into multiple sub-warps issued in consecutive cycles. Each
sub-warp needs a unique active mask to determine the active threads within. Second,
thread replication on the idle lanes needs to be managed in order to guarantee DMR or
TMR execution. To perform these tasks, Warped-RE adds a new stage, called sub-warps
active masks generation and thread replication stage, between the issue queue and the SP
unit pipeline. Figure 5.13 shows how this stage fits in the GPU pipeline.
Every warp instruction iterates in the sub-warps active masks generation and
thread replication stage according to the number of sub-warps required for the current
operational mode. In every cycle, a new sub-warp active mask is generated using the
reliability-aware split (RASp) unit which corresponds to the current operational mode
171
and the active mask is then used to control thread replication in order to force redundancy
by exploiting the idle lanes. In the following paragraphs, we discuss the detailed design
of the sub-warps active masks generation and thread replication stage.
5.8.2.1 Generating sub-warps active masks
A design similar to the RASp unit described in subsection 4.5.4 is used to
generate the active masks for the sub-warps during DMR and TMR modes of Warped-
RE. Figure 5.14 shows the RASp unit for one DMR cluster. For each SP unit in the Fermi
architecture, we need eight DMR RASp units responsible for generating the active masks
for the eight DMR clusters. The active masks of the DMR clusters are then concatenated
together to form the active mask of the whole sub-warp. For every new warp instruction,
the logic in Figure 5.14 is iterated according to the number of needed sub-warps. Recall
Figure 5.13: Sub-warps active masks generation and thread replication stage.
Decode and
Schedule
EX
1
Write-Back Issue EX
N
Sub- wa rps’ Ac ti ve Masks
and
Thread Replication
Pipe Register
Pipe Register
DMR
RASp Unit
TMR
RASp Unit
DIR Vector
TIR Vector
MUX
Select Ctrl
D_T
Sub_warp
Active Mask
Iteration
Counter
N
DMR_sub-warps
N
TMR_sub-warps
Active Mask
Register Read
and Operand
Collector
Inherent Redundancy
and
Deformation analysis
172
that in DMR mode, the number of sub-warps is either two when deformation is required
or zero otherwise.
Table 5.4 lists all possible scenarios that could happen while generating the active
mask for the sub-warps of a DMR cluster. The 1
st
row represents the case where no
deformation is required (i.e. WD
DMR
= 0). In this case, the original issued active mask of
the cluster (i.e. AM[y: y+1]) is selected to be the active mask of the first sub-warp (i.e.
S
1
_AM[y: y+1]) as shown by the upper input of the rightmost 2:1 MUX in Figure 5.14.
The "jk" expression in the 1
st
row can be any of the four possible active masks: "00",
"01", "10", and "11". As no deformation is required, there will be no second sub-warp.
The 2
nd
, 3
rd
and 4
th
rows show the cases when warp deformation is required but
the cluster under study has at most one active thread assigned to it. In these cases,
inherent redundancy is not available (i.e. DIR = 0) and the priority encoder ensures that
the active thread originally assigned to the cluster, if there is a one, is issued as part of the
1
st
sub-warp. Hence, the cluster will be completely idle during the 2
nd
sub-warp (i.e.
Figure 5.14: DMR cluster RASp unit.
2-to-1 Priority
Encoder
4 to 1
Mux
2 to 1 Mux
AM
y
AM
y+1
S
1
S
0
10
01
00
WD
DMR
Input Mask Reg
2 to 1 Mux
Valid
Issued
Active
Mask
2 to 1 Mux
New Warp
O utput ≡ Curre nt Su b-warp
Active Mask
DIR
11
173
S
2
_AM[y: y+1] = "00"). The latter is achieved by using the XOR-ing to deactivate all
threads issued as part of the 1
st
sub-warp.
The 5
th
row represents the case when warp deformation is required and there are
two threads assigned to the cluster but they are non-inherently redundant. During the 1
st
iteration, the priority encoder input is "11" and its output is "10". As a result, the active
mask of the 1
st
sub-warp is chosen as "01" through the 4:1 MUX. For the 2
nd
iteration, the
priority encoder input is "10", which is the result of the XOR-ing, and its output is "11".
As a result, the active mask of the 2
nd
sub-warp is chosen as "10" through the 4:1 MUX.
This way, the two active threads are distributed over the two sub-warps which allow them
to be replicated and become DMR-ed.
The last possible case is shown in the 6
th
row. Warp deformation is required and
two inherently redundant threads are assigned to the cluster (i.e. DIR = 1). In this case,
the two threads are assigned to the 1
st
sub-warp which is achieved through the upper
input of the second rightmost 2:1 MUX in Figure 5.14. The XOR-ing deactivates all
threads issued with the 1
st
sub-warp, which causes the active mask of the 2
nd
sub-warp to
be "00" as chosen by the priority encoder and the 4:1 MUX.
Row# Input Output
DIR AM[y: y+1] WD
DMR
S
1
_AM[y: y+1] S
2
_AM[y: y+1]
1 x jk 0 jk N/A
2 0 00 1 00 00
3 0 01 1 01 00
4 0 10 1 10 00
5 0 11 1 01 10
6 1 11 1 11 00
Table 5.4: DMR cluster sub-warps active masks.
174
Figure 5.15 shows the RASp unit for a regular TMR cluster. This unit can be
reused for the special four-lane TMR cluster by having four bit active mask input (AM[y:
y+3]) and four bit TMR inherent redundancy vector (TIR[y: y+3]). Furthermore, the
inputs of the 8:1 MUX need to be changed to ("0000", "1000", "0100", "0010", and
"0001"). The functionality is exactly the same as the DMR RASp unit with the exception
that the number of sub-warps during TMR mode can be zero (i.e. no deformation), two,
three, or four (i.e. when the special cluster has four non-inherent redundant threads).
When no deformation is required (i.e. WD
TMR
= 0), the original active mask is chosen as
it is for the 1
st
sub-warp and no subsequent sub-warps are needed.
When deformation is required, there are three general rules. First, any inherently
redundant thread is assigned to the 1
st
sub-warp as indicated by the upper input of the
second rightmost 2:1 MUX in Figure 5.15. Second, threads that are non-inherently
redundant are distributed among the remaining sub-warps such that one active thread is
assigned to each sub-warp which can be achieved through the priority encoder and the
Figure 5.15: TMR cluster RASp unit.
4-to-2 Priority
Encoder
8 to 1 Mux
2 to 1 Mux
AM
y+1
AM
y+2
S
2
S
1
S
0
001
010
000
WD
TMR
Input Mask Reg
2 to 1 Mux
Valid
Issued
Active
Mask
2 to 1 Mux
New Warp
O utput ≡ Curre nt Su b-warp
Active Mask
TIR [y+2 : y]
AM
y
100
0
175
8:1 MUX. Third, any sub-warp remaining after all active threads are issued will render
the cluster completely idle (i.e. the active mask of the sub-warp is all zeros).
Figure 5.13 shows how the DMR and TMR RASp units fit in the sub-warps active
mask generation and thread replication stage. Both units execute in parallel by receiving
the warp active mask, the inherent redundancy vector (i.e. DIR or TIR), and the warp
deformation flag (i.e. WD
DMR
or WD
TMR
) as inputs and generating the sub-warp active
mask. Consequently, the D_T mode bit selects the active mask that corresponds to the
current operational mode. The selected active mask controls thread replication which is
required to force redundant execution as will be described next. In addition, the figure
shows how the D_T mode bit helps to load the iteration counter with the number of sub-
warps (i.e. N
DMR_sub-warps
or N
TMR_sub-warps
) required according to the operational mode of
the current warp instruction.
5.8.2.2 Replicating active threads
To leverage idle lanes for forced redundancy, we need the ability to forward the
source operands of each SIMT lane to all other lanes in the same cluster. We achieve this
by adding forwarding multiplexers, adopted and modified from [47] [33], in the sub-
warps active masks generation and thread replication stage as shown in Figure 5.13. In
the previous implementations [47] [33], the cluster size is fixed to four SIMT lanes.
Whereas in Warped-RE, the cluster size is two lanes during DMR mode then it
dynamically changes to three lanes when a fault is detected and TMR mode is activated.
We could have separate multiplexers for DMR and TMR modes but that is unnecessary
since these multiplexers can be repurposed without the need for replication. As such, the
176
same multiplexers are used during DMR and TMR modes and the select logic takes into
consideration the current operational mode.
Figure 5.16 shows the multiplexers required for the first six SIMT lanes within a
SP unit (i.e. L
0
-L
5
). In DMR mode, the six lanes form three clusters: (L
0
,L
1
), (L
2
,L
3
), and
(L
4
,L
5
). In TMR mode, the six lanes form two clusters: (L
0
,L
1
,L
2
) and (L
3
,L
4
,L
5
). Based
on this and in addition to receiving the source operands of the active thread assigned to it
(T
0
), L
0
should be able to receive source operands from the thread assigned to L
1
(T
1
)
during DMR operation. During TMR operation, L
0
should be able to receive source
operands from T
1
and T
2
. Hence, for both modes combined, L
0
should be able to receive
source operands from threads T
0
, T
1
, and T
2
. So, a 3:1 MUX is needed for L
0
. Similarly,
L
1
should be able to receive source operands from threads T
0
, T
1
, and T
2
. So, a 3:1 MUX
is also needed for L
1
.
On the other hand, L
2
should be able to receive source operands from T
3
during
DMR mode and at the same time should be able to receive source operands from T
0
and
T
1
during TMR mode. Hence, a 4:1 MUX is needed for L
2
. By the same approach, L
3
, L
4
Figure 5.16: Thread replication hardware support.
L
2
L
3
L
4
L
5
L
0
L
1
T
5
T
4
T
3
T
2
T
1
T
0
3-to-1
MUX
3-to-1
MUX
4-to-1
MUX
4-to-1
MUX
3-to-1
MUX
3-to-1
MUX
177
and L
5
need 4:1 MUX, 3:1 MUX, and 3:1 MUX, respectively. The second six SIMT
lanes within the SP unit (i.e. L
6
-L
11
) have exactly the same multiplexing requirements as
the first six lanes. The last four SIMT lanes (i.e. L
12
-L
15
) have different multiplexing
requirements because they form one special TMR cluster. L
12
, L
13
and L
14
need 4:1
MUXes in order to receive operands from all threads assigned to the cluster. L
15
only
needs a 2:1 MUX to receive source operands from its assigned thread and the thread
assigned to L
14
during DMR mode.
For each one of the forwarding multiplexers, the control logic responsible for
generating the select signals is a function of the active mask bits of the corresponding
SIMT lanes and the D_T mode bit. Table 5.5 represents the truth table for the control
logic used to generate the select signals for L
0
’s 3:1 MUX described in Figure 5.16. The
inputs of the control logic are the active mask bits of L
0
, L
1
, and L
2
and the D_T bit.
Whenever we are running in DMR mode (i.e. D_T = 0), the active mask bit of L
2
(i.e. AM[2]) is irrelevant as only L
0
and L
1
are part of the DMR cluster. This means that
Row# Input Output
AM [0:2] D_T MUX Select (S[1:0])
1 00x 0 xx
2 01x 0 01
3 10x 0 00
4 11x 0 00
5 000 1 xx
6 001 1 10
7 010 1 01
8 011 1 01
9 100 1 00
10 101 1 00
11 110 1 00
12 111 1 00
Table 5.5: L
0
thread replication MUX control.
178
during DMR mode, L
0
either receives source operands from the thread assigned to it (as
shown in rows 3 and 4) or the thread assigned to L
1
(as shown in row 2). When L
0
and L
1
are both idle, the select signals can have any values (as shown in row 1).
During TMR mode (i.e. D_T = 1), if all three lanes are idle then the select signals
can have any value (as shown in row 5). Whenever there is an active thread assigned to
L
0
, the MUX control logic chooses the source operands provided by that thread (as
shown in rows 9, 10, 11, and 12). When L
0
is idle, we check if there is an active thread
assigned to L
1
and forward its source operands to L
0
(as shown in rows 7 and 8).
Otherwise, we forward the operands of the thread assigned to L
2
(as shown in row 6).
Based on this truth table, the logic required to generate the select signals for L
0
’s
3:1 MUX is given in Figure 5.17. Although DMR and TMR modes are supported, the
select logic turned out to be quiet simple with only two AND gates and three inverters as
shown in the figure. The same approach is used to design the control logic circuits
responsible for generating the select signals of the forwarding MUXes for the remaining
15 SIMT lanes (i.e. L
1
-L
15
).
Fault detection and correction stage 5.8.3
After each sub-warp completes its redundant execution, the outputs of the SIMT
lanes are compared against each other in order to detect and correct faults. To achieve
that, one more pipeline stage is added between the last execution stage and the write-back
Figure 5.17: Select logic for L
0
’s 3:1 MUX.
AM[1]
AM[0]
S
0
AM[1]
AM[0]
S
1
179
stage. During DMR mode, only fault detection is possible. Hence, it is sufficient to
compare the outputs of every two SIMT lanes which belong to the same DMR cluster
using an XOR-based comparator. During TMR mode, fault detection and correction are
performed. Hence, the outputs of every three SIMT lanes which belong to the same TMR
cluster are fed to TMR voter and comparator logic similar to the one described in
Figure 5.1 and discussed in Section 5.3.
Figure 5.18 shows how fault detection and correction are achieved for the first six
SIMT lanes in a SP unit. Recall that the TMR logic which is used to detect and correct
faults in (L
0
,L
1
,L
2
) cluster during TMR mode contains a comparator that compares the
outputs of L
0
and L
1
as described in Figure 5.1. This comparator can also be used during
DMR mode to compare the outputs of the two lanes and detect faults in (L
0
,L
1
) DMR
cluster. The same is true for (L
4
,L
5
) DMR cluster as the outputs of the two lanes are
compared against each other as part of the TMR logic used for (L
3
,L
4
,L
5
) TMR cluster.
As a result, there is no need to add dedicated comparators for fault detection in (L
0
,L
1
)
and (L
4
,L
5
) DMR clusters. On the other hand, the (L
2
,L
3
) pair is not part of any TMR
cluster. Hence, a dedicated comparator is needed to compare the outputs of the two lanes
Figure 5.18: Fault detection and correction for DMR and TMR clusters.
L
2
L
3
L
4
L
5
L
0
L
1
TMR
Logic
TMR
Logic
=
180
and provide fault detection during DMR mode. Notice that the comparator box in
Figure 5.18 represents a single XOR-based comparator because instructions have at most
one destination operand. This is different from the inherent redundancy detection logic
shown in Figure 5.12, where every comparator box there represents three XOR-based
comparators to compare up to three source operands. An identical instance of the fault
detection and correction logic shown in Figure 5.18 is used for the second six SIMT lanes
in the SP unit (i.e. L
6
-L
11
).
Figure 5.19 shows the fault detection and correction logic for the last four lanes in
the SP unit (i.e. L
12
-L
15
). During DMR mode, these four lanes form two DMR clusters
which require two comparators for fault detection. During TMR mode, the leftmost three
lanes (i.e. L
12
-L
14
) are used to achieve TMR execution for all active threads within the
special cluster. Hence, the outputs of (L
12
,L
13
,L
14
) are fed to a TMR voter and comparator
logic as shown in the figure. The TMR logic is used indirectly to compare the outputs of
(L
12
,L
13
) cluster during DMR mode. Hence, only one dedicated comparator is required to
compare L
14
and L
15
during DMR mode.
Figure 5.19: Fault detection and correction for special TMR cluster.
L
12
L
13
L
14
L
15
TMR
Logic
=
181
Warped-RE Design Alternatives 5.9
According to the implementation of Warped-RE described so far, the operational
mode will permanently change from DMR to TMR execution after the first non-transient
fault is detected. Although TMR execution is required to guarantee correctness, this
approach may seem overprovisioned as TMR execution is forced across all clusters when
only one cluster is suffering from a non-transient fault. In this section, we investigate if
there are other design alternatives that provide the same level of fault detection and
correction as the original Warped-RE with less area, performance, and power overheads.
One possible design alternative is to allow different clusters to operate in different modes,
such that healthy clusters run in DMR mode and faulty clusters run in TMR mode. Such
design is not promising for two reasons: first, very little or no performance advantage is
expected as the execution of most warp instructions will be bounded by the faulty cluster.
This is mainly because warp deformation is determined based on the needs of the worst
case cluster. Second, extra control complexity is required to allow threads within the
same warp to execute in different modes concurrently.
A second design alternative is to disable (i.e. power down) the SIMT lane which
suffer from non-transient fault and reroute its assigned threads to healthy nearby lanes.
Using this alternative and after a faulty warp instruction gets corrected through TMR
execution, we can return to run in DMR mode even if the fault is non-transient because
the faulty lane is disabled. As we continue to run in DMR mode after non-transient faults
are detected, one might expect that such design performs better than the original Warped-
RE design. However, our evaluations indicate that this design alternative causes extra
182
10% performance overhead compared to permanently run in TMR mode after the first
non-transient fault is detected regardless of which cluster experiences the fault.
To better understand the second design alternative and why it causes higher
performance overhead than the original Warped-RE design, let us consider the example
given in Figure 5.20 which shows four SIMT lanes with four threads (T
0
-T
3
) assigned to
them. During fault-free execution shown in Figure 5.20a, the alternative design is
identical to the original Warped-RE design because both run in DMR mode and all SIMT
lanes are healthy. When a fault is detected, the faulty warp is re-executed in TMR mode
to correct the fault. As long as the detected faults are transient, the alternative and the
original designs execute exactly the same and their performance overheads are identical.
Figure 5.20b shows the case where L
2
experiences a non-transient fault as indicated by
the red cross mark. According to the second design alternative, L
2
needs to be disabled so
that we can continue to run in DMR mode. Notice that disabling L
2
renders the entire
DMR cluster (L
2
,L
3
) useless because it can no longer provide DMR functionality. As
such, the threads assigned to (L
2
,L
3
) are rerouted to the cluster (L
0
,L
1
) as indicated by the
curved arrows in Figure 5.20b. For this to work, every warp instruction that has at least
one active thread assigned to the (L
2
,L
3
) cluster and at least one active thread assigned to
the (L
0
,L
1
) cluster needs to be deformed into two sub-warps. The 1
st
sub-warp masks the
(a) (b)
Figure 5.20: Design alternative functionality.
L
0
L
1
L
2
L
3
T
3
T
1
T
0
T
2
L
0
L
1
L
2
L
3
T
3
T
1
T
0
T
2
183
threads assigned to the (L
0
,L
1
) cluster and allows the threads assigned to the disabled
(L
2
,L
3
) cluster to be rerouted to the idle (L
0
,L
1
) cluster. The 2
nd
sub-warp masks the
threads assigned to the disabled (L
2
,L
3
) cluster and schedules the threads assigned to the
(L
0
,L
1
) cluster normally. Notice that further deformation might be required when inherent
redundancy and idle SIMT lanes opportunities are not available or insufficient to provide
DMR execution.
The extra warp deformations required to support the rerouting of the threads
assigned to a disabled cluster are the main contributors to the extra performance overhead
associated with the second design alternative. This is especially true when leveraging
inherent redundancy and idle SIMT lanes can provide opportunistic TMR execution for
all threads within a warp instruction (47% of the warps as shown in Figure 5.7). In the
latter case, the original Warped-RE design can execute the warp instruction in TMR
mode (i.e. with the existence of a non-transient fault) without the need to activate warp
deformation. On the other hand, the alternative design needs to deform most of the warp
instructions that can be opportunistically TMR-ed in order to reroute the threads assigned
to any disabled DMR cluster which causes extra performance overhead compared to the
original design.
Note that one can add extra comparators to the second design alternative in order
to detect inherent redundancy opportunities between DMR clusters and make this design
alternative performs the same or slightly better than the original Warped-RE design.
However, this requires extra hardware cost and increases the complexity of the control
logic. Hence, we decided to stick with the original Warped-RE design described in
184
Section 5.8 because it represents the most straight forward approach to achieve fault
detection and correction with minimum hardware overhead and design complexity while
achieving substantial savings in the performance and power overheads that accompany
traditional redundant execution.
Experimental Evaluation 5.10
To evaluate the Warped-RE framework, we used GPGPU-Sim v3.02 [10]. The
baseline GPU architecture is configured using the Nvidia GTX480 (i.e. Fermi)
configuration file included in the GPGPU-Sim package and the Warped-RE framework is
implemented on top of the baseline architecture. In particular, the following stages are
added to the GPU pipeline: inherent redundancy and deformation analysis, sub-warps
active masks generation and thread replication, and fault detection and correction stages
as described in Section 5.8. In the evaluation, 22 benchmarks from GPGPU-Sim, Rodinia
[23], and Parboil [98] benchmark suites are used to cover a wide range of application
domains. Each benchmark is executed on the baseline architecture without any fault
detection or correction support and on the new architecture which supports the Warped-
RE framework. This enables us to quantify the performance and power overheads of
Warped-RE relative to the baseline architecture.
We perform two sets of experiments. The first set of experiments is assumed to be
fault-free and GPU continuously runs in DMR mode to guarantee 100% fault detection.
This experiment shows the cost of providing fault detection in the common case. The
second set of experiments assumes every TMR cluster is suffering from a single non-
transient fault in one of its SIMT lanes. Hence, GPU continuously runs in TMR mode
185
during the second set of experiments. This experiment shows the cost of providing fault
correction in the rare case.
DMR mode evaluation 5.10.1
Figure 5.21 plots the execution time of the Warped-RE framework during DMR
mode relative to the baseline architecture without fault detection and correction support.
The weighted average performance overhead across all benchmarks is 8.4%. This is
much less than the expected overhead of dual redundant execution which may reach up to
100%. This huge reduction in performance overhead is attributed to three main factors.
First, low cost opportunistic DMR execution is achieved for many warp instructions (i.e.
46% as shown in Figure 5.4) by exploiting inherent redundancy and utilizing idle SIMT
lanes. Second, for some benchmarks it is rarely the case that warp instructions are issued
back-to-back to the same SP unit due to long latency data dependencies or application
instruction mix. This creates empty bubbles in the SP unit pipeline which help to hide the
performance degradation caused by dynamic warp deformation.
Figure 5.21: DMR mode performance overhead.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Relative Execution Time
Benchmark
186
Third, when warp deformation is activated it prevents consecutive warps from
issuing to the SP unit. Consequently, the consecutive warps give higher priority to ready
memory instructions. This helps to reduce the contentions in the memory sub-system
especially for memory-intensive benchmarks and in some cases it may even lead to an
anomalous increase in performance. For some benchmarks, the benefits achieved by the
three factors surpass the performance overhead of redundant execution. This is true for
srad, lbm, and sgemm benchmarks. For example, sgemm experiences 12% performance
improvements because the memory contention stalls are reduced by a factor of three.
In order to quantify the effectiveness of inherent redundancy and idle SIMT lanes
to achieve opportunistic DMR, we classified the warps that are opportunistically DMR-ed
into three categories: inherent redundancy warps (IR_warps), idle lanes warps
(IL_warps), and IR+IL_warps. IR_warps represent the warps that exclusively leverage
inherent redundancy to achieve opportunistic DMR. IL_warps represent the warps that
exclusively leverage idle SIMT lanes and thread replication to achieve opportunistic
DMR. IR+IL_warps are the warps which exploit both inherent redundancy and idle
SIMT lanes to become opportunistically DMR-ed. The results are presented in
Figure 5.22. For all benchmarks, except MUM and NN, more than 50% of the
opportunistically DMR-ed warps are IR_warps. On average, 78% of the opportunistically
DMR-ed warps are IR_warps as shown in the last bar in Figure 5.22. MUM and NN
benchmarks have limited inherent redundancy to exploit. However, these two
benchmarks have few active threads within each warp instruction which allow them to
187
utilize idle SIMT lanes to force redundancy. For example, 96% of the warp instructions
of the NN benchmark have only one active thread as shown in Figure 5.2.
TMR mode evaluation 5.10.2
Warped-RE can tolerate one non-transient fault in every cluster of three SIMT
lanes. In other words, if there is one SIMT lane that is continuously producing incorrect
results within each cluster, then Warped-RE guarantees functional correctness by running
in TMR mode. Warped-RE continues to guarantee functional correctness until a second
SIMT lane within a cluster becomes faulty due to a non-transient fault.
Figure 5.23 plots the execution time while running in TMR mode relative to the
baseline architecture without fault detection and correction support. The weighted
average performance overhead for all benchmarks is 29% as shown in the last bar in the
figure. Again, this is much less than the expected overhead of triple redundant execution
which may reach up to 200% and this reduction is attributed to the same three factors
discussed in subsection 5.10.1.
Figure 5.22: Opportunistic DMR breakdown.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Warp Percentage
Benchmark
IR_warps IL_warps IR+IL_warps
188
Each benchmark is expected to suffer higher performance degradation during
TMR mode compared to DMR mode. This is mainly because of two reasons: first,
opportunistic TMR execution is harder to achieve than opportunistic DMR as more
inherent redundancy and more idle lanes are required. This is shown in Figure 5.4 and
Figure 5.7 which indicate that, on average, 48% and 47% of the warp instructions can be
opportunistically DMR-ed and TMR-ed, respectively. Second and more important, when
warp deformation is activated to achieve TMR execution, more sub-warps are needed
because every thread should be replicated twice instead of just once as in DMR. Hence,
more idle lanes are needed which translates to more sub-warps. The results in Figure 5.23
show that generally most benchmarks experience higher performance degradation during
TMR mode compared to DMR mode. However, for some benchmarks, the TMR
performance overhead varies by less than 1% from the DMR performance overhead (e.g.
kmeans, srad, LIB, and MUM). This is mainly due to a combination of the three factors
Figure 5.23: TMR performance overhead.
0
0.5
1
1.5
2
2.5
Relative Execution Time
Benchmark
189
discussed in subsection 5.10.1 which helps to hide the extra latency associated with
redundant execution.
We also measured the effectiveness of inherent redundancy and idle SIMT lanes
to achieve opportunistic TMR. The results are given in Figure 5.24 and they are very
similar to the results in DMR mode. Namely, across all benchmarks 77% of the
opportunistically TMR-ed warps are IR_warps. On the other hand, only 7% of the
opportunistically TMR-ed warps are IL_warps and 16% are IR+IL_warps. Hence, we
deduce that inherent redundancy is also the main contributor to opportunistic TMR
execution.
Area and power overheads 5.10.3
To evaluate the area and power overheads of the Warped-RE framework, we
implemented the inherent redundancy detection logic, DMR and TMR deformation
control logic, DMR and TMR reliability-aware split (RASp) units, forwarding
multiplexers for thread replication, fault detection and correction logic in RTL using
Figure 5.24: Opportunistic TMR breakdown.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Warp Percentage
Benchmark
IR_warps IL_warps IR+IL_warps
190
Synopsys Design Compiler and synthesized the RTL implementations using NCSU PDK
45nm library [4]. The area consumed by the additional three pipline stages including the
wiring is estimated by 0.5mm
2
. Notice that the total area of the SIMT lanes is 32mm
2
;
thus, the area overhead of Warped-RE is only 1.5%.
Traditional DMR and TMR executions require 100% and 200% power overheads,
respectively. By exploiting inherent redundancy between threads within the same warp,
Warped-RE reduces the power overheads of DMR and TMR executions to 58% and
120%, respectively. The total dynamic power consumed by the additional three stages is
632mW. This represents the power consumed every cycle the stages are activated. We
measured the dynamic power consumed by the GPU baseline using GPUWattch [52].
The results show that the power overhead of the additional stages is around 10.6%.
Summary and Conclusions 5.11
In this chapter we presented Warped-RE, a unified framework to provide low cost
fault detection and correction for the SIMT lanes in GPUs. Warped-RE leverages the
inherent redundancy between threads within the same warp and the underutilization of
the SIMT lanes to achieve opportunistic low cost redundant execution. When
opportunistic redundant execution is not sufficient, the framework depends on the warp
deformation technique to force redundant execution and achieve full detection and
correction coverage.
By default, Warped-RE runs in DMR mode to provide 100% fault detection.
When a fault is detected, the faulty computation is re-executed in TMR mode in order to
provide fault correction. After the fault is corrected, the execution either returns to DMR
191
mode if the fault is transient or continues in TMR mode if the fault is non-transient to
guarantee correctness. Warped-RE incurs minimal 8.4% and 29% average performance
overhead during DMR and TMR modes, respectively. In addition, Warped-RE reduces
the power overhead of traditional DMR and TMR executions by 42% and 40%,
respectively.
192
Chapter 6
Conclusions and Future Work
Reliability is becoming a first-order design constraint in future chips. Future
devices are expected to experience higher rates of in-field transient, intermittent, and
permanent faults. In this dissertation we present efficient low cost fault detection,
diagnosis, correction and tolerance mechanisms for multicore (e.g. chip multiprocessors)
and many-core (e.g. graphics processing units) systems.
In Chapter 2 we presented a usage-signature based mechanism (SignTest) to
detect permanent faults in multicore systems. SignTest periodically suspends the normal
operation of the individual cores and runs specialized test patterns/programs to detect
pe rma ne nt f a ult s. R a ther than testin g a ll c or e s’ c o mponents, S ig nTe st tra c ks the usag e of
individual cores at the functional block level and adapts the testing phase accordingly.
Two possible implementations of SignTest were presented: conservative SignTest (C-
SignTest) and relaxed SignTest (R-SignTest).
C-SignTest is used in systems where the running applications require maximum
fault coverage against permanent faults (e.g. financial and scientific applications). While
193
maintaining maximum coverage, C-SignTest reduces the energy and performance
overheads associated with testing by skipping the idle functional blocks and only
applying the test patterns/programs of the used functional blocks. On the other hand, R-
SignTest is used in systems where the nature of the running applications allows them to
tolerate faults to some extent (e.g. graphics and media processing applications). R-
SignTest exploits the usage-failure relationship to determine the level of fault coverage
for each functional block according to its usage level during the last epoch. R-SignTest
achieves higher energy and performance savings than C-SignTest at the cost of the
minimal reduction in the fault coverage.
In Chapter 3 we tackled fault diagnosis and presented a new class of exceptions
(i.e. reliability-aware exceptions (RAEs)) that analyzes faults detected in the
microprocessor array structures and classifies them as transient, intermittent, or
permanent. The novelty of RAEs is the ability to distinguish intermittent faults and
handle them accordingly. RAEs use a historical fault log to keep track of the number of
faults detected in every array structure entry. In addition, the time stamp of the last fault
detected in every entry is also stored in the historical log.
When a fault is detected in one entry, the fault history of the entry is extracted
from the historical log and used to classify the fault. Transient faults are handled using
the preexisting roll-back mechanism originally used for branch misprediction and precise
exception handling. Permanent faults are handled through permanent de-configuration of
the faulty entries. Intermittent faults are handled using temporal de-configuration which
helps to recover from or deactivate the intermittent fault.
194
In Chapter 4 we shifted gears from multicore to many-core systems and presented
the Warped-Shield framework which consists of three low cost techniques to tolerate
hard faults in the SIMT lanes of GPUs. First, we described the intra-cluster thread
shuffling technique which exploits the underutilization of the SIMT lanes to reroute
active threads from faulty lanes to healthy idle lanes within the same cluster. Intra-cluster
thread shuffling has negligible hardware, performance, and power costs but it is limited
by the number of healthy idle lanes per cluster. To overcome this limitation, we presented
the dynamic warp deformation technique which splits a warp into multiple sub-warps
with fewer active threads. Hence, more healthy idle lanes are available per cluster for
each sub-warp which can be exploited by intra-cluster thread shuffling. The hardware and
power costs of dynamic warp deformation are minimal; however, some performance
degradation is unavoidable as warps are issued over multiple cycles.
Third, to mitigate the performance degradation of dynamic warp deformation we
presented the inter-SP warp shuffling technique. This technique leverages the asymmetric
fault maps of the streaming processor (SP) units within the same streaming
multiprocessor (SM) to issue every warp to the most appropriate SP unit (i.e. the SP unit
which does not require the warp to be deformed or the SP unit which requires the warp to
be split into the minimum number of sub-warps). Using the three fault tolerance
techniques, our evaluation showed that we can tolerate up to three faulty SIMT lanes
within every cluster of four lanes with 57% performance overhead.
Finally, in the last chapter we presented Warped-Redundant Execution (Warped-
RE), a unified fault detection and correction framework for the SIMT lanes in GPUs. The
195
framework is based on dual modular redundancy (DMR) during fault-free execution and
just-in-time triple modular redundancy (TMR) to correct faults after being detected.
During DMR operation, every unique computation needs to be replicated and checked
which essentially detects all transient, intermittent, and permanent faults. When a fault is
detected, the faulty warp instruction is re-executed in TMR mode in order to correct the
fault and identify any potential faulty SIMT lanes.
To minimize the high overheads associated with dual and triple redundant
computations, Warped-RE exploits two of the GPUs applications properties. First, in
some cases the active threads within a warp can be divided into groups, such that the
threads within the same group use the exact input values for their computations (i.e.
spatial value locality of the source operands). In these cases, the outputs of the threads
within the same group are expected to match; hence, the threads within each group are
considered inherently redundant. Warped-RE exploits the inherent redundancy between
threads within the same warp by dividing the threads into clusters of size two and three
during DMR and TMR operation, respectively. The source operands of the active threads
within the same cluster are compared against each other and when all source operands are
matching, the threads are considered inherently DMR-ed or TMR-ed and faults can be
detected and corrected simply by comparing the threads ’ outputs.
The second property is that many GPU applications underutilize the SIMT lanes.
Warped-RE exploits the underutilization by allowing active threads to replicate their
computations on the idle lanes within the same cluster and become DMR-ed or TMR-ed.
Our evaluation showed that around half of the warps benefit from the inherent
196
redundancy and the SIMT lanes underutilization properties to fully execute in DMR or
TMR mode. In order to protect the remaining warps, Warped-RE leverages the dynamic
warp deformation technique presented in Chapter 4 to split a warp into multiple sub-
warps with fewer active threads. For all threads which are neither inherently redundant
nor the underutilization of the original warp is sufficient for them, the deformation helps
to create artificial underutilization opportunities to allow the computation of each one of
them to be replicated once in DMR mode or twice in TMR mode.
Most of the research work that target the reliability of future chips, including the
work presented in this dissertation, tend to target specific components within the system
or provide protection against specific fault types. For instance, RAEs target the array
structures, Warped-Shield and Warped-RE target the SIMT lanes in GPUs, and SignTest
provides protection only against permanent faults. One possible future research direction
is to extend the domain of the presented techniques to cover other system components
and fault types or investigate the applicability of the multicore system techniques to
many-core systems and vice-versa.
For example, one can broaden the reach of RAEs to diagnose and recover from
faults detected in the tabular array structures (e.g. branch prediction buffers, branch target
buffers, and reservation stations) and redundant execution units (e.g. ALUs, multipliers,
and dividers). Similarly, the coverage of the GPU fault handling techniques can be
extended to include special function units (SFUs) and load/store units in the GPUs.
Further, the SignTest mechanism can be adapted to the GPU platform to implement
usage-based periodic testing for the different execution units.
197
Another open area of research is to explore how different reliability solutions can
be combined and implemented in one system to provide full protection against all
possible fault types. This exploration should address the expected design challenges and
conflicts between different reliability management policies. The fault handling techniques
presented in this dissertation serve as building blocks to implement such fully protected
systems.
Although future chips are expected to experience higher in-field fault rates, some
applications have high thresholds for faulty computations (e.g. graphics and media
processing applications). Hence, such applications can run on systems where more
resources are dedicated for power and performance efficiency and fewer resources are
dedicated for addressing reliability concerns. On the other hand, applications with low
thresholds for faulty computations (e.g. financial and scientific applications) should run
on reliable systems where all computation are protected against all in-field fault types all
the time. Based on this observation, an interesting area of research is to come up with
new reliability metrics, other than the traditional mean time between failures, to reflect
the fault tolerance thresholds of different applications. The new reliability metrics can be
used as part of the design specifications of future computing systems.
198
References
[1] [Online]. http://www.xilinx.com/univ/xupv5-lx110t.htm
[2] [Online]. http://www.oracle.com/technetwork/server-
storage/solaris11/downloads/index.html
[3] [Online]. http://www.xilinx.com/products/intellectual-property/chipscope_ila.html
[4] [Online]. http://www.eda.ncsu.edu/wiki/FreePDK
[5] Nidhi Aggarwal, Parthasarathy Ranganathan, Norman P. Jouppi, and James E.
Smith, "Configurable Isolation: Building High Availability Systems with
Commodity Multi-core Processors," in Proceedings of the 34th Annual
International Symposium on Computer Architecture, San Diego, 2007, pp. 470-
481.
[6] M. Agostinelli et al., "Erratic Fluctuations of SRAM Cache Vmin at the 90nm
Process Technology Node," in IEEE International Electron Devices Meeting
Technical Digest, Washington, 2005, pp. 655-658.
[7] H. Al-Asaad and M. Shringi, "On-line Built-in Self-test for Operational Faults," in
IEEE AUTOTESTCON Proceedings, Anaheim, 2000, pp. 168-174.
[8] H. Ando, R. Kan, Y. Tosaka, Keiji Takahisa, and K. Hatanaka, "Validation of
Hardware Error Recovery Mechanisms for the SPARC64 V Microprocessor," in
IEEE International Conference on Dependable Systems and Networks With FTCS
199
and DCC, Anchorage, 2008, pp. 62-69.
[9] T.M. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture
Design," in Proceedings of the 32nd Annual International Symposium on
Microarchitecture, Haifa, 1999, pp. 196-207.
[10] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt, "Analyzing
CUDA Workloads Using a Detailed GPU Simulator," in IEEE International
Symposium on Performance Analysis of Systems and Software, Boston, 2009, pp.
163-174.
[11] K. Batcher and C. Papachristou, "Instruction Randomization Self Test for
Processor Cores," in Proceedings of the 17th IEEE VLSI Test Symposium, Dana
Point, 1999, pp. 34-40.
[12] M.D. Beaudry, "Performance-Related Reliability Measures for Computing
Systems," IEEE Transactions on Computers, vol. C-27, no. 6, pp. 540-547, June
1978.
[13] S. Bhunia, S. Mukhopadhyay, and K. Roy, "Process Variations and Process-
Tolerant Design," in Proceedings of the 20th International Conference on VLSI
Design, Bangalore, 2007, pp. 699-704.
[14] A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni,
"Threshold-Based Mechanisms to Discriminate Transient from Intermittent
Faults," IEEE Transactions on Computers , vol. 49, no. 3, pp. 230-245, March
2000.
[15] S. Y. Borkar et al., "Platform 2015: Intel Processor and Platform Evolution for the
Next Decade," Intelligence/sigart Bulletin, 2005.
[16] S. Borkar, T. Karnik, and Vivek De, "Design and Reliability Challenges in
Nanometer Technologies," in Proceedings of the 41st Design Automation
Conference, San Diego, 2004, p. 75.
200
[17] F.A. Bower, P.G. Shealy, S. Ozev, and D.J. Sorin, "Tolerating Hard Faults in
Microprocessor Array Structures," in International Conference on Dependable
Systems and Networks , Florence, 2004, pp. 51-60.
[18] F.A. Bower, D.J. Sorin, and S. Ozev, "A Mechanism for Online Diagnosis of Hard
Faults in Microprocessors," in Proceedings of the 38th Annual IEEE/ACM
International Symposium on Microarchitecture, Barcelona, 2005, pp. 197-208.
[19] D. Brooks, V. Tiwari, and M. Martonosi, "Wattch: A Framework for
Architectural-Level Power Analysis and Optimizations," in Proceedings of the
27th International Symposium on Computer Architecture, Vancouver, 2000, pp.
83-94.
[20] Doug Burger and Todd M. Austin, "The SimpleScalar Tool Set, Version 2.0,"
ACM SIGARCH Computer Architecture News, vol. 25, no. 3, pp. 13-25, June
1997.
[21] X. Castillo and D. P. Siewiorek, "A Performance-Reliability Model for Computing
Systems," in Proceedings of the International Symposium on Fault-Tolerant
Computing, 1980, pp. 187 -192.
[22] X. Castillo and D.P. Siewiorek, "WORKLOAD, PERFORMANCE, AND
RELIABILITY OF DIGITAL COMPUTING SYSTEMS," in Proceedings of the
25th International Symposium on Fault-Tolerant Computing, 1995, pp. 367-372.
[23] Shuai Che et al., "Rodinia: A Benchmark Suite for Heterogeneous Computing," in
IEEE International Symposium on Workload Characterization, Austin, 2009, pp.
44-54.
[24] M. Choudhury and K. Mohanram, "Timing-driven optimization using lookahead
logic circuits," in Proceedings of the 46th ACM/IEEE Design Automation
Conference, San Francisco, 2009, pp. 390-395.
[25] C. Constantinescu, "Intermittent Faults and Effects on Reliability of Integrated
Circuits," in Annual Reliability and Maintainability Symposium, Las Vegas, 2008,
201
pp. 370-374.
[26] C. Constantinescu, "Intermittent Faults in VLSI Circuits," in IEEE Workshop on
System Effects of Logic Soft Errors, Champaign, 2006.
[27] C. Constantinescu, "TRENDS AND CHALLENGES IN VLSI CIRCUIT
RELIABILITY," IEEE Micro, vol. 23, no. 4, pp. 14-19, July 2003.
[28] K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, "Software-Based Online
Detection of Hardware Defects Mechanisms, Architectural Support, and
Evaluation," in Proceedings of the 40th Annual IEEE/ACM International
Symposium on Microarchitecture, Chicago, 2007, pp. 97-108.
[29] (2009, November) Cortex R series processors. [Online].
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Chdgfjac.
html
[30] S. Das et al., "RazorII: In Situ Error Detection and Correction for PVT and SER
Tolerance," IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 32-48,
January 2009.
[31] M. Depas, Tanya Nigam, and Marc M. Heyns, "Soft Breakdown of Ultra-Thin
Gate Oxide Layers," IEEE Transactions on Electron Devices, vol. 43, no. 9, pp.
1499-1504, September 1996.
[32] Martin Dimitrov, Mike Mantor, and Huiyang Zhou, "Understanding Software
Approaches for GPGPU Reliability," in Proceedings of 2nd Workshop on General
Purpose Processing on Graphics Processing Units, New York, 2009, pp. 94-104.
[33] W. Dweik, M. Abdel-Majeed, and M. Annavaram, "Warped-Shield: Tolerating
Hard Faults in GPGPUs," in Proceedings of the 44th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, Atlanta, 2014,
pp. 431-442.
202
[34] W. Dweik, M. Annavaram, and M. Dubois, "Reliability-Aware Exceptions:
Tolerating Intermittent Faults in Microprocessor Array Structures," in Design,
Automation and Test in Europe Conference and Exhibition, Dresden, 2014.
[35] M. Ershov et al., "Dynamic Recovery of Negative Bias Temperature Instability in
P-Type Metal –Oxide –Semiconductor Field-Effect Transistors," Applied Physics
Letters, vol. 83, no. 8, pp. 1647-1649, August 2003.
[36] W.W.L. Fung and T.M. Aamodt, "Thread Block Compaction for Efficient SIMT
Control Flow," in Proceedings of the 17th IEEE International Symposium on High
Performance Computer Architecture, San Antonio, 2011, pp. 25-36.
[37] M. Gebhart et al., "Energy-efficient Mechanisms for Managing Thread Context in
Throughput Processors," in Proceedings of the 38th Annual International
Symposium on Computer Architecture, San Jose, 2011, pp. 235-246.
[38] Paul Genua. (2007, November) Error Correction and Error Handling on
P owe rQ U I C C ™ II I P roc e ssors.[ Online] .
http://cache.freescale.com/files/32bit/doc/app_note/AN3532.pdf
[39] D.G iz opoulos a ndS hub he nduMukher je e ,"Gue s tEdi tors’ I ntrodu c ti on:S pe c ial
Section on Dependable Computer Architecture," IEEE Transactions on
Computers, vol. 60, no. 1, pp. 3-4, January 2011.
[40] D. Gizopoulos et al., "Architectures for Online Error Detection and Recovery in
Multicore Processors," in Design, Automation and Test in Europe Conference and
Exhibition, Grenoble, 2011, pp. 1-6.
[41] M.A. Gomaa, C. Scarbrough, T.N. Vijaykumar, and I. Pomeranz, "Transient-Fault
Recovery for Chip Multiprocessors," IEEE Micro, vol. 23, no. 6, pp. 76-83,
November 2003.
[42] S. Gupta, A. Ansari, Shuguang Feng, and S. Mahlke, "Adaptive Online Testing for
Efficient Hard Fault Detection," in Proceedings of the IEEE International
203
Conference on Computer Design, Lake Tahoe, 2009, pp. 343-349.
[43] G. Hetherington et al., "Logic BIST for Large Industrial Designs: Real Issues and
Case Studies," in Proceedings of the International Test Conference, Atlantic City,
1999, pp. 358-367.
[44] A.L., Jr. Hopkins, T.B., III Smith, and J.H. Lala, "FTMP-A Highly Reliable Fault-
Tolerant Multiprocess for Aircraft," Proceedings of the IEEE, vol. 66, no. 10, pp.
1221-1239, October 1978.
[45] Wei Huang, K. Sankaranarayanan, K. Skadron, R.J. Ribando, and M.R. Stan,
"Accurate, Pre-RTL Temperature-Aware Design Using a Parameterized,
Geometric Thermal Model," IEEE Transactions on Computers, vol. 57, no. 9, pp.
1277-1288, September 2008.
[46] R.K. Iyer, S.E. Butner, and E.J. McCluskey, "A Statistical Failure/Load
Relationship: Results of a Multicomputer Study," IEEE Transactions on
Computers, vol. C-31, no. 7, pp. 697-706, July 1982.
[47] Hyeran Jeon and M. Annavaram, "Warped-DMR: Light-weight Error Detection
for GPGPU," in Proceedings of the 45th Annual IEEE/ACM International
Symposium on Microarchitecture, Vancouver, 2012, pp. 37-47.
[48] R.E. Kessler, "The ALPHA 21264 MICROPROCESSOR," IEEE Micro, vol. 19,
no. 2, pp. 24-36, March 1999.
[49] Poonacha Kongetira, K. Aingaran, and K. Olukotun, "NIAGARA: A 32-WAY
MULTITHREADED SPARC PROCESSOR," IEEE Micro, vol. 25, no. 2, pp. 21-
29, March 2005.
[50] Ravishankar Kuppuswamy, Peter DesRosier, Derek Feltham, Rehan Sheikh, and
Paul Thadikaran, "Full Hold-Scan Systems in Microprocessors: Cost/Benefit
Analysis," Intel Technology Journal, vol. 8, no. 1, pp. 63-72, February 2004.
204
[51] C. LaFrieda, E. Ipek, J.F. Martinez, and R. Manohar, "Utilizing Dynamically
Coupled Cores to Form a Resilient Chip Multiprocessor," in Proceedings of the
37th Annual IEEE/IFIP International Conference on Dependable Systems and
Networks , Edinburgh, 2007, pp. 317-326.
[52] Jingwen Leng et al., "GPUWattch: Enabling Energy Optimizations in GPGPUs,"
in Proceedings of the 40th Annual International Symposium on Computer
Architecture, Tel-Aviv, 2013, pp. 487-498.
[53] Sheng Li et al., "McPAT: An Integrated Power, Area, and Timing Modeling
Framework for Multicore and Manycore Architectures," in Proceedings of the
42nd Annual IEEE/ACM International Symposium on Microarchitecture, New
York, 2009, pp. 469-480.
[54] Man-Lap Li et al., "Understanding the Propagation of Hard Errors to Software and
Implications for Resilient System Design," in Proceedings of the 13th
International Conference on Architectural Support for Programming Languages
and Operating Systems, Seattle, 2008, pp. 265-276.
[55] Tai-Hua Lu, Chung-Ho Chen, and Kuen-Jong Lee, "A Hybrid Self-Testing
Methodology of Processor Cores," in IEEE International Symposium on Circuits
and Systems, Seattle, 2008, pp. 3378-3381.
[56] R. E. Lyons and W. Vanderkulk, "The Use of Triple-modular Redundancy to
Improve Computer Reliability," IBM Journal of Research and Development, vol.
6, no. 2, pp. 200-209, April 1962.
[57] Souvik Mahapatra. (2011, May) Negative Bias Temperature Instability (NBTI) in
p-MOSFETs: Characterization, Material/Process Dependence and Predictive
Modeling. [Online]. https://nanohub.org/resources/11249
[58] C. McNairy and D. Soltis, "ITANIUM 2 PROCESSOR
MICROARCHITECTURE," IEEE Micro, vol. 23, no. 2, pp. 44-55, March 2003.
205
[59] P.J. Meaney, S.B. Swaney, P.N. Sanda, and L. Spainhower, "IBM z990 Soft Error
Detection and Recovery," IEEE Transactions on Device and Materials Reliability,
vol. 5, no. 3, pp. 419-427, September 2005.
[60] A. Meixner, M.E. Bauer, and D.J. Sorin, "Argus: Low-Cost, Comprehensive Error
Detection in Simple Cores," in Proceedings of the 40th Annual IEEE/ACM
International Symposium on Microarchitecture, Chicago, 2007, pp. 210-222.
[61] A. Meixner and D.J. Sorin, "Detouring: Translating Software to Circumvent Hard
Faults in Simple Cores," in IEEE International Conference on Dependable
Systems and Networks , Anchorage, 2008, pp. 80-89.
[62] S.S. Mukherjee, M. Kontz, and S.K. Reinhardt, "Detailed Design and Evaluation
of Redundant Multithreading Alternatives," in Proceedings of the 29th Annual
International Symposium on Computer Architecture, Anchorage, 2002, pp. 99-
110.
[63] Veynu Narasiman et al., "Improving GPU Performance via Large Warps and Two-
level Warp Scheduling," in Proceedings of the 44th Annual IEEE/ACM
International Symposium on Microarchitecture, Porto Alegre, 2011, pp. 308-317.
[64] R. Nathan and D. Sorin, "Argus-G: A Low-Cost Error Detection Scheme for
GPGPUs," in Workshop on Resilient Architectures (WRA), Atlanta, January 2010.
[65] Edmund B. Nightingale, John R. Douceur, and Vince Orgovan, "Cycles, Cells and
Platters: An Empirical Analysisof Hardware Failures on a Million Consumer
PCs," in Proceedings of the Sixth Conference on Computer Systems, New York,
2011, pp. 343-356.
[66] S. Nomura et al., "Sampling + DMR: Practical and Low-overhead Permanent Fault
Detection," in Proceedings of the 38th Annual International Symposium on
Computer Architecture, San Jose, 2011, pp. 201-212.
[67] (2009) Nvidia Corporation Website. [Online].
http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_ar
206
chitecture_whitepaper.pdf
[68] (2012) Nvidia Corporation Website. [Online].
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-
Whitepaper.pdf
[69] N. Oh, P.P. Shirvani, and E.J. McCluskey, "Error Detection by Duplicated
Instructions in Super-Scalar Processors," IEEE Transactions on Reliability, vol.
51, no. 1, pp. 63-75, March 2002.
[70] (2008, April) Oracle Corporation Website. [Online].
http://www.oracle.com/technetwork/systems/opensparc/t1-01-opensparct1-micro-
arch-1538959.html
[71] A. Pellegrini and V. Bertacco, "Application-Aware Diagnosis of Runtime
Hardware Faults," in IEEE/ACM International Conference on Computer-Aided
Design, San Jose, 2010, pp. 487-492.
[72] M. Prvulovic, Zheng Zhang, and J. Torrellas, "ReVive: Cost-Effective
Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors,"
in Proceedings of the 29th Annual International Symposium on Computer
Architecture, Anchorage, 2002, pp. 111-122.
[73] M. Psarakis, D. Gizopoulos, E. Sanchez, and M.S. Reorda, "Microprocessor
Software-Based Self-Testing," IEEE Design Test of Computers, vol. 27, no. 3, pp.
4-19, May 2010.
[74] P. Racunas, K. Constantinides, S. Manne, and S.S. Mukherjee, "Perturbation-
based Fault Screening," in IEEE 13th International Symposium on High
Performance Computer Architecture, Scottsdale, 2007, pp. 169-180.
[75] Brian Randell and Jie Xu, "The Evolution of the Recovery Block Concept," in
SOFTWARE FAULT TOLERANCE, 1994, pp. 1-22.
207
[76] K. Reick et al., "FAULT-TOLERANT DESIGN OF THE IBM POWER6
MICROPROCESSOR," IEEE Micro, vol. 28, no. 2, pp. 30-38, March 2008.
[77] S.K. Reinhardt and S.S. Mukherjee, "Transient Fault Detection via Simultaneous
Multithreading," in Proceedings of the 27th International Symposium on
Computer Architecture, Vancouver, 2000, pp. 25-36.
[78] G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August, "SWIFT:
Software Implemented Fault Tolerance," in International Symposium on Code
Generation and Optimization, San Jose, 2005, pp. 243-254.
[79] K. P. Rodbell, A. J. Castellano, and R. I. Kaufman, "AC Electromigration (10
MHz-1 GHz) in Al Metallization," in AIP Conference Proceedings, fourth
international workshop on stress induced phenomena in metallization , 1998, pp.
212-223.
[80] R. Rodriguez et al., "The Impact of Gate-Oxide Breakdown on SRAM Stability,"
IEEE Electron Device Letters, vol. 23, no. 9, pp. 559-561, September 2002.
[81] Bogdan F. Romanescu and Daniel J. Sorin, "Core Cannibalization Architecture:
Improving Lifetime Chip Performance for Multicore Processors in the Presence of
Hard Faults," in Proceedings of the 17th International Conference on Parallel
Architectures and Compilation Techniques, Toronto, 2008, pp. 43-51.
[82] E. Rosenbaum, R. Rofan, and Chenming Hu, "Effect of Hot-Carrier Injection on
n- and pMOSFET Gate Oxide Integrity ," IEEE Electron Device Letters, vol. 12,
no. 11, pp. 599-601, November 1991.
[83] M. Scholzel, "HW/SW Co-Detection of Transient and Permanent Faults with Fast
Recovery in Statically Scheduled Data Paths," in Design, Automation and Test in
Europe Conference and Exhibition, Dresden, 2010, pp. 723-728.
[84] E. Schuchman and T.N. Vijaykumar, "Rescue: A Microarchitecture for Testability
and Defect Tolerance," in Proceedings of the 32nd International Symposium on
208
Computer Architecture, Madison, 2005, pp. 160-171.
[85] K. Shi and D. Howard, "Sleep Transistor Design and Implementation - Simple
Concepts Yet Challenges To Be Optimum," in International Symposium on VLSI
Design, Automation and Test, Hsinchu, 2006, pp. 1-4.
[86] J. Shin, V. Zyuban, Zhigang Hu, J.A. Rivers, and P. Bose, "A Framework for
Architecture-Level Lifetime Reliability Modeling," in Proceedings of the 37th
Annual IEEE/IFIP International Conference on Dependable Systems and
Networks, Edinburgh, 2007, pp. 534-543.
[87] P. Shivakumar, S.W. Keckler, C.R. Moore, and D. Burger, "Exploiting
Microarchitectural Redundancy for Defect Tolerance," in Proceedings of the 21st
International Conference on Computer Design, San Jose, 2003, pp. 481-488.
[88] P. Shivakumar, M. Kistler, S.W. Keckler, D. Burger, and L. Alvisi, "Modeling the
Effect of Technology Trends on the Soft Error Rate of Combinational Logic," in
Proceedings of the International Conference on Dependable Systems and
Networks, Washington, DC, 2002, pp. 389-398.
[89] C.-L.K. Shum et al., "Design and microarchitecture of the IBM System z10
microprocessor," IBM Journal of Research and Development, vol. 53, no. 1, pp. 1-
12, January 2009.
[90] Smitha Shyam, Kypros Constantinides, Sujay Phadke, Valeria Bertacco, and Todd
Austin, "Ultra Low-cost Defect Protection for Microprocessor Pipelines," in
Proceedings of the 12th International Conference on Architectural Support for
Programming Languages and Operating Systems, San Jose, 2006, pp. 73-82.
[91] M.A. Skitsas, C.A. Nicopoulos, and M.K. Michael, "DaemonGuard: O/S-Assisted
Selective Software-Based Self-Testing for Multi-Core Systems," in Proceedings of
the IEEE International Symposium on Defect and Fault Tolerance in VLSI and
Nanotechnology Systems (DFT) , New York City, 2013, pp. 45-51.
209
[92] C. Slayman, "Soft Error Trends and Mitigation Techniques in Memory Devices,"
in Proceedings of the Annual Reliability and Maintainability Symposium , Lake
Buena Vista, 2011, pp. 1-5.
[93] J.C. Smolens, B.T. Gold, B. Falsafi, and J.C. Hoe, "Reunion: Complexity-
Effective Multicore Redundancy," in Proceedings of the 39th Annual IEEE/ACM
International Symposium on Microarchitecture, Orlando, 2006, pp. 223-234.
[94] D.J. Sorin, M.M.K. Martin, M.D. Hill, and D.A. Wood, "SafetyNet: Improving the
Availability of Shared Memory Multiprocessors with Global
Checkpoint/Recovery," in Proceedings of the 29th Annual International
Symposium on Computer Architecture, Anchorage, 2002, pp. 123-134.
[95] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, "Exploiting Structural
Duplication for Lifetime Reliability Enhancement," in Proceedings of the 32nd
International Symposium on Computer Architecture, Madison, 2005, pp. 520-531.
[96] J. Srinivasan, S.V. Adve, P. Bose, and J.A. Rivers, "The Case for Lifetime
Reliability-Aware Microprocessors," in Proceedings of the 31st Annual
International Symposium on Computer Architecture, München, 2004, pp. 276-287.
[97] J. Srinivasan, S.V. Adve, P. Bose, and J.A Rivers, "The Impact of Technology
Scaling on Lifetime Reliability," in International Conference on Dependable
Systems and Networks, Florence, 2004, pp. 177-186.
[98] John A. Stratton et al., "Parboil: A Revised Benchmark Suite for Scientific and
Commercial Throughput Computing," University of Illinois at Urbana-
Champaign, Champaign, Technical Report IMPACT-12-01, March 2012.
[Online]. http://impact.crhc.illinois.edu/Parboil/parboil.aspx
[99] G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, "A Software-Based
Self-Test Methodology for On-Line Testing of Processor Caches," in IEEE
International Test Conference, Anaheim, 2011, pp. 1-10.
210
[100] J.F. Wakerly, "Microcomputer Reliability Improvement Using Triple-Modular
Redundancy," Proceedings of the IEEE, vol. 64, no. 6, pp. 889-895, June 1976.
[101] N.J. Wang and S.J. Patel, "ReStore: Symptom-Based Soft Error Detection in
Microprocessors," IEEE Transactions on Dependable and Secure Computing, vol.
3, no. 3, pp. 188-201, August 2006.
[102] P.M. Wells, K. Chakraborty, and G.S. Sohi, "Adapting to Intermittent Faults in
Future Multicore Systems," in Proceedings of the 16th International Conference
on Parallel Architecture and Compilation Techniques, Brasov, 2007, pp. 431-431.
[103] T.J. Wood, "The Test and Debug Features of the AMD-K7TM Microprocessor,"
in Proceedings of the International Test Conference, Atlantic City, 1999, pp. 130-
136.
[104] B. Zandian, W. Dweik, Suk Hun Kang, T. Punihaole, and M. Annavaram,
"WearMon: Reliability Monitoring Using Adaptive Critical Path Testing," in
IEEE/IFIP International Conference on Dependable Systems and Networks,
Chicago, 2010, pp. 151-160.
[105] Quming Zhou and K. Mohanram, "Gate Sizing to Radiation Harden
Combinational Logic," IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems , vol. 25, no. 1, pp. 155-166, January 2006.
Abstract (if available)
Abstract
As technology scales further down in the nanometer regime, chip manufacturers are able to integrate billions of transistors on a single chip which boost the performance of today‘s multicore and many‐core systems. On the other hand, smaller transistor devices become more vulnerable to various faults due to higher probability of manufacturing defects, higher susceptibility to single event upsets, more process variations and faster wearout rates. Nowadays, computer chips are tested extensively using post‐fabrication process to weed out any chips that do not meet the functional specifications. The chips that meet the functional specifications are then used in building computer systems. Once these systems, particularly in low‐end consumer market segments, enter the in‐field operation they are not actively tested, other than using simple error correcting codes to deal with soft errors. Since reliability is a critical requirement for high‐end mission‐critical systems, they traditionally employ system‐level redundancy for in‐field monitoring and handling of faults. However, technology scaling is expected to make reliability a first‐order concern for the in‐field operation of even low‐end computing systems. Thus, in‐field fault handling mechanisms that can detect, diagnose, recover and tolerate different kinds of faults have to be deployed. ❧ Unlike mission critical systems that can afford expensive fault handling mechanisms, low‐end computing systems are highly cost‐sensitive. As a result, in‐field fault handling mechanisms have to be stringently cost‐effective and should achieve the highest fault coverage for a given area, performance and power overheads. Faults that happen during in‐field operation (i.e. in‐field faults) are classified into three categories: transient, intermittent, and permanent. Specialized fault handling mechanisms can be deployed orthogonally in order to mitigate the effects of the three fault categories. This dissertation presents four techniques for ultra‐low cost in‐field fault handling in chip multiprocessors (CMPs) and many‐core systems such as graphics processing units (GPUs). ❧ In the first part of this dissertation we present SignTest, a usage‐signature adaptive periodic testing mechanism for microprocessors in multicore systems. SignTest strives to reduce the testing energy and time while maintaining high coverage against permanent faults. To achieve this goal, SignTest tracks the usage of the microprocessor functional blocks during every execution epoch and dynamically steers the testing phase according to the usage level of every functional block. The evaluation results show that a conservative implementation of SignTest maintains maximum fault coverage with up to 24% savings in the testing energy and time. Alternatively, a relaxed implementation of SignTest achieves up to 51% savings in the testing energy and time while covering 93% of the expected permanent fault sites. ❧ Once approaches such as SignTest make it feasible to do periodic fault detection at a very low cost, the next concern is how to use the detection outcomes to improve fault diagnosis and recovery. The second part of this dissertation focuses on improving the classification granularity of in‐field faults. A new class of exceptions called reliability‐aware exceptions (RAEs) is proposed. RAEs use a fault history log to classify faults detected in the microprocessor array structures into transient, intermittent, and permanent. Most of the previously proposed approaches classify faults as transient or non‐transient, where all non‐transient faults are handled as permanent faults. But treating all non‐transient faults as permanent faults leads to premature and often unnecessary component replacement demands. Distinguishing intermittent faults and handling them as such improves the effectiveness of fault handling by reducing the performance degradation while simultaneously slowing down device wearout. ❧ The RAE handlers have the ability to manipulate faulty entries in the array structures to recover and tolerate all three categories of faults. For transient faults, the RAE handler leverages the existing roll‐back mechanism to start re‐execution from the instruction which triggers the fault. For intermittent faults, the RAE handler exploits the inherent redundancy in the array structures to de‐configure the faulty entries temporarily. Entries that experience permanent faults are permanently de‐configured. RAEs improve the reliability of the load/store queue (LSQ) and the reorder buffer (ROB) in an out‐of‐order processor by average factors of 1.3 and 1.95, respectively. ❧ The remaining parts of the dissertation present solutions to handle execution faults that occur in throughput‐oriented graphics processing units (GPUs). GPUs are becoming an attractive option to achieve power efficient throughput computing even for general purpose applications that go beyond traditional multimedia workloads. This dissertation presents two GPU‐specific fault handling frameworks which take advantage of massive resource replication present in GPUs to detect, correct, and tolerate faults. ❧ The first framework is Warped-Shield which tolerates permanent (hard) faults in the execution lanes of the GPUs by rerouting computations around faulty execution lanes. Warped-Shield achieves this goal through the following three techniques: thread shuffling, dynamic warp deformation, and warp shuffling. The three techniques work in complementary ways to allow the Warped-Shield framework to tolerate failure rates as high as 50% in the execution lanes with only 14% performance degradation. ❧ Motivated by the insights obtained from Warped-Shield, the last contribution of this dissertation presents Warped-Redundant Execution (Warped-RE) which provides detection and correction for transient, intermittent, and permanent faults in the GPU execution lanes. Warped-Shield assumes that fault locations are identified using other orthogonal approaches, but Warped-RE relies on spatial redundant execution to achieve fault detection and correction. During fault‐free execution, dual modular redundant (DMR) execution is enforced to detect faults. When a fault is detected, triple modular redundant (TMR) execution is activated to correct the fault and identify potential faulty execution lanes. As long as the detected faults are transients, DMR execution is restored after correction is complete. After the first non‐transient fault is detected, TMR execution is retained to guarantee correctness. ❧ To mitigate the high overheads of naïve DMR/TMR execution, Warped-RE leverages the inherent redundancy among threads within the same warp and the underutilization of the GPU execution lanes to achieve low cost opportunistic redundant execution. On average, the performance overhead for DMR execution to detect faults is 8.4%, and once a permanent fault is detected TMR execution overhead is still just 29%.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Energy proportional computing for multi-core and many-core servers
Asset Metadata
Creator
Dweik, Waleed
(author)
Core Title
Low cost fault handling mechanisms for multicore and many-core systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
02/18/2015
Defense Date
12/04/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dual modular redundancy,dynamic warp deformation,fault detection,fault diagnosis,fault handling,fault recovery,fault tolerance,graphics processing units,in‐field faults,intermittent faults,many‐core,microprocessor usage signature,multicore,OAI-PMH Harvest,periodic testing,permanent faults,reliability‐aware exceptions,single instruction multiple threads,thread shuffling,transient faults,triple modular redundancy,warp shuffling,wearout
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Dubois, Michel (
committee member
), Halfond, William G. J. (
committee member
)
Creator Email
dweik@usc.edu,w.dweik@ju.edu.jo
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-535147
Unique identifier
UC11298555
Identifier
etd-DweikWalee-3195.pdf (filename),usctheses-c3-535147 (legacy record id)
Legacy Identifier
etd-DweikWalee-3195.pdf
Dmrecord
535147
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Dweik, Waleed
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dual modular redundancy
dynamic warp deformation
fault detection
fault diagnosis
fault handling
fault recovery
fault tolerance
graphics processing units
in‐field faults
intermittent faults
many‐core
microprocessor usage signature
multicore
periodic testing
permanent faults
reliability‐aware exceptions
single instruction multiple threads
thread shuffling
transient faults
triple modular redundancy
warp shuffling
wearout