Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Defect-tolerance framework for general purpose processors
(USC Thesis Other)
Defect-tolerance framework for general purpose processors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEFECT-TOLERANCE FRAMEWORK FOR GENERAL PURPOSE
PROCESSORS
by
Hsunwei Hsiung
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2014
Copyright 2014 Hsunwei Hsiung
1
Dedication
To my family
2
Acknowledgements
This work is not possible without the guidance of my advisor, Dr. Sandeep Gupta. I
appreciate immensely his contribution of idea, time, inspiration, and funding. He has consciously
or unconsciously taught me how to navigate the known unknown and how to expect the unknown
unknown in terms of research. Through his insightful perspective, he has also taught me how to
elaborate difficult concepts or convoluted ideas with intuitive engineering thinking. I find my
experience of Ph.D. pursuit has been rewarding and stimulating under his guidance.
I would like to thank the members of my committee, Dr. Melvin Breuer, Dr. Murali
Annavaram, Dr. Ramesh Govindan, Dr. Shahin Nazarian, and Dr. Aiichiro Nakano. Their
expertise and experience in the respective fields have helped in many aspects and the
completeness of this dissertation. I value the time, suggestions, and critics from them.
My group members and friends have also helped facilitate several parts of this work.
Byeongju Cha has generously helped our OS experiments and his meticulous attitude to these
experiments is highly appreciated. Da Cheng has helped with the maintenance of the server and
CAD tools. Our collaboration in memory reliability with Bin Liu has also been exciting. Others
including Yue Gao, Doochul Shin, Mohammad Mirza-Aghatabar, and Prasanjeet Das, have also
provided their comments on my proposal and dissertation, for which I am thankful.
The funding from National Science Foundation and Semiconductor Research Company also
has made the work possible.
Lastly, I am deeply grateful to my family for their encouragement and unconditional support
for all my pursuits. My brother’s adventurous mind always reminds me there is limitless
possibility. My parents' understanding and vision are the foundation of what is accompanied.
Thank you.
3
Table of Contents
Dedication ....................................................................................................................................... 1
Acknowledgements ......................................................................................................................... 2
List of Figures ................................................................................................................................. 6
List of Tables .................................................................................................................................. 8
Abstract ......................................................................................................................................... 11
Chapter 1. Introduction ............................................................................................................ 12
1.1 Motivation ............................................................................................................................. 12
1.2 General purpose computing system hierarchy ...................................................................... 14
1.3 Correctness at system layers ................................................................................................. 16
1.4 Related research on advanced defect-tolerance approaches ................................................. 19
1.5 Metrics for defect-tolerance approaches ............................................................................... 22
1.6 Dissertation outline ............................................................................................................... 24
Chapter 2. Last level cache defect tolerance ........................................................................... 26
2.1 The significance of the last level cache ................................................................................ 26
2.2 SRAM defects in LLC .......................................................................................................... 27
2.3 Exploration for defect-tolerance approaches ........................................................................ 27
2.4 Spectrum of potential approaches ......................................................................................... 38
2.5 Analysis of our approaches and defect map consideration ................................................... 41
2.6 Related cache researches ....................................................................................................... 44
2.7 PCD cost-benefits analysis ................................................................................................... 45
4
Chapter 3. Defect-tolerance framework for processors ........................................................ 49
3.1 Chip organization for multicore processors .......................................................................... 49
3.2 Defect-tolerance components and categories ........................................................................ 50
3.3 Cross-layered defect-tolerance framework ........................................................................... 60
3.4 Target modules identification in modern processors ............................................................ 70
Chapter 4. Framework application on the target modules ................................................... 72
4.1 Branch prediction modules ................................................................................................... 72
4.2 Arithmetic datapath modules ................................................................................................ 81
4.3 Caching modules ................................................................................................................... 93
4.4 Queuing modules ................................................................................................................ 101
Chapter 5. Defect-tolerance approaches for datapath modules ......................................... 108
5.1 Floating-point unit .............................................................................................................. 109
5.2 Arithmetic logic unit ........................................................................................................... 114
5.3 Integer multiplier ................................................................................................................ 123
Chapter 6. Projecting defect-tolerance efficiency for multicore processors ...................... 136
6.1 Scope of cross-layered approaches ..................................................................................... 136
6.2 Multicore organization studied ........................................................................................... 137
6.3 Projection framework.......................................................................................................... 138
6.4 Multicore performance estimation ...................................................................................... 142
6.5 Approach evaluation ........................................................................................................... 144
6.6 Performance-per-area projection ........................................................................................ 156
6.7 Defect-constrained evolution: A new evolution path ......................................................... 161
5
Chapter 7. Conclusions ........................................................................................................... 163
7.1 Contributions....................................................................................................................... 163
7.2 Future research .................................................................................................................... 164
References ........................................................................................................................... 168
6
List of Figures
Figure 1. Layers in the system hierarchy ...................................................................................... 15
Figure 2. YPA of the 3MB cache with various number of spares ................................................ 30
Figure 3. LLC YPA by varying the number of sub-arrays ........................................................... 31
Figure 4. Two-way cache organization ......................................................................................... 33
Figure 5. LLC and main memory mapping .................................................................................. 35
Figure 6. Address translation ........................................................................................................ 37
Figure 7. (48, 1, 2) Cumulative grades percentage ....................................................................... 46
Figure 8. EC-per-area: optimal configurations of different approaches ....................................... 47
Figure 9. EGOPS-per-area: optimal EC-per-area configurations of different approaches ........... 48
Figure 10. Defect-tolerance approaches exploration flowchart .................................................... 62
Figure 11. Approach evaluation and grade setup flowchart ......................................................... 69
Figure 12. gshare predictor ........................................................................................................... 73
Figure 13. Branch target buffer (direct-mapped) .......................................................................... 74
Figure 14. TLB operation ............................................................................................................. 98
Figure 15. Reservation station entry ........................................................................................... 106
Figure 16. Efficiency for FP-DIS................................................................................................ 111
Figure 17. Efficiency for FPMUL-DIS ....................................................................................... 113
Figure 18. ALU-type instruction distribution ............................................................................. 115
Figure 19. ALU-type and ADD instruction distribution ............................................................. 117
Figure 20. Cumulative distribution of bit-width required for ADD instructions ........................ 117
Figure 21. A generic organization for ALU instruction execution ............................................. 118
7
Figure 22. Adder error-masking microarchitecture support ....................................................... 119
Figure 23. Efficiency for ALU-disabling and AdderEM ............................................................ 123
Figure 24. 4×4 CSA multiplier and its state machine controller ................................................ 126
Figure 25. Enhanced 4×4 CSA multiplier and modified controller ............................................ 127
Figure 26. Tier yields for different options ................................................................................. 132
Figure 27. Normalized sellable chips per area ............................................................................ 133
Figure 28. Overall efficiency for all options ............................................................................... 134
Figure 29. 4-core processor organization .................................................................................... 137
Figure 30. 45nm per-core power and performance ..................................................................... 139
Figure 31. Scaled per-core power and performance ................................................................... 140
Figure 32. (a) ALU area breakdown (b) Example of grades ...................................................... 147
Figure 33. Datapath-enhancement efficiency projection ............................................................ 152
Figure 34. PCD efficiency projection ......................................................................................... 154
Figure 35. Core-disabling efficiency projection ......................................................................... 154
Figure 36. Cross-layered approaches efficiency for FO-evolution processors ........................... 155
Figure 37. Cross-layered approaches efficiency for PC-evolution processors ........................... 156
Figure 38. Without advanced DT (a) EGOPSPA (b) Yield and area ......................................... 157
Figure 39. Performance-per-area projection for FO-evolution processors ................................. 158
Figure 40. Performance-per-area for PC-evolution processors. ................................................. 160
Figure 41. Performance-per-area for defect-constrained evolution processors. ......................... 162
8
List of Tables
Table I. Parameters to calculate yield-per-area ............................................................................. 23
Table II. Parameters to calculate EGOPSPA ................................................................................ 24
Table III. SRAM bit-cell height and M1 metal pitch .................................................................... 29
Table IV. Highest YPA of Optimal SRAM Design ...................................................................... 31
Table V. LLC defect-tolerance components ................................................................................. 38
Table VI. LLC approach implementation impact analysis ........................................................... 42
Table VII. Microarchitecture target modules area percentage ...................................................... 71
Table VIII. Branch prediction modules approaches impact analysis ........................................... 78
Table IX. Defective BTB microarchitecture level performance penalty ...................................... 79
Table X. Scenarios when using no-action for defective BTB module ......................................... 81
Table XI. Datapath approaches impact analysis ........................................................................... 90
Table XII. Defective datapath modules microarchitecture level performance penalties .............. 91
Table XIII. Module modified of approaches for datapath modules .............................................. 92
Table XIV. Low level cache approach implementation impact analysis ...................................... 96
Table XV. Defective low level cache microarchitecture level performance penalties ................. 97
Table XVI. Defective TLB microarchitecture level performance penalty ................................. 101
Table XVII. Defective ROB behavior ........................................................................................ 105
Table XVIII. Summary of approaches for datapath modules ..................................................... 108
Table XIX. Floating-point instruction distribution ..................................................................... 110
Table XX FPMUL instruction distribution ................................................................................. 112
Table XXI. FP-DIS and FPMUL-DIS tiers and yields ............................................................... 112
9
Table XXII. Performance of processors with different numbers of ALUs ................................. 116
Table XXIII. Performance comparison for ALU-disabling and AdderEM ................................ 121
Table XXIV. Tier yield formulation for ALU enhancement ...................................................... 121
Table XXV. Instruction distribution of MUL ............................................................................. 124
Table XXVI. Multiplexors’ select inputs encoding .................................................................... 128
Table XXVII. Enhancement options: performance and overhead estimation ............................ 130
Table XXVIII. Multiplier enhancement parameters ................................................................... 131
Table XXIX. Performance comparison between MUL-DIS and operand-shifting .................... 135
Table XXX. Scope of the approaches ......................................................................................... 137
Table XXXI. Technology parameters ......................................................................................... 138
Table XXXII. Power-constrained evolution processors ............................................................. 141
Table XXXIII. EGOPSPA formulation parameters.................................................................... 142
Table XXXIV. Utilization formulation parameters .................................................................... 144
Table XXXV. FPU enhancement grades .................................................................................... 145
Table XXXVI. MUL operand-shifting enhancement grades ..................................................... 146
Table XXXVII. ALU enhancement grades ............................................................................... 146
Table XXXVIII. LLC parameters for FO- and PC- evolution processors .................................. 153
Table XXXIX. Performance-per-area projection types .............................................................. 157
Table XL. Relevant parameters comparing ideal fabrication for FO-evolution processors ....... 159
Table XLI. Relevant parameters comparing ideal fabrication for PC-evolution processors ...... 160
Table XLII. Defect-constrained evolution processor parameters ............................................... 161
Table XLIII. Yield of non-SRAM-bitcell parts in LLC ............................................................. 161
10
Table XLIV. Relevant parameters comparing ideal fabrication for defect-constrained evolution
............................................................................................................................................. 162
11
Abstract
As CMOS fabrication technology continues to move deeper into nano-scale, circuit’s
susceptibility to manufacturing imperfections increases, and the improvements in yield, power
and delay, provided by each major scaling generation have started to slow down or even reverse.
This is especially true for microprocessors, which use aggressive design and leading-edge
technology. It is increasingly difficult to guarantee their correctness and conformance to
performance specifications, which leads to reduction in microprocessor yield. Ensuring
continuous benefits from scaling for microprocessors has been a challenge due to this adverse
trend. Classic defect-tolerance approaches have been used to mitigate this trend by adding
explicit redundancy and using it to ensure correctness of the circuit layer specifications. These
approaches impose unnecessarily stringent requirements and incur overheads that increasingly
compromise the economics.
In this dissertation, we introduce a new defect-tolerance concept by taking a global view of
the role of a microarchitecture module in the overall system. This view allows the relaxation of
the overly stringent correctness requirements imposed by classic approaches and opens up
opportunities for new defect-tolerance approaches that rely on implicit redundancy and hence are
uniquely efficient. We also introduce a framework to systematically explore the space of possible
defect-tolerance approaches that target microarchitecture modules under this global view. In
addition to approaches that can be implemented during a microprocessor’s design phase, our
framework also identified post-fabrication approaches which can salvage microprocessors
beyond hardware repair. We demonstrate that the approaches significantly improve
microprocessors’ economics at wafer-level.
12
Chapter 1. Introduction
We introduce a new defect-tolerance paradigm and a corresponding defect-tolerance
framework for general purpose processor chips. The framework systematically identifies defect-
tolerance opportunities by analyzing defects in various modules from different perspectives: at
circuit, microarchitecture, instruction set architecture (ISA), and software/OS layers of such
systems’ hierarchy. The framework identifies a rich set of inherent defect-tolerance opportunities
via such systematic enumeration, to achieve significantly more cost-effective defect tolerance for
processor chips.
1.1 Motivation
As CMOS fabrication technology continues to move deeper into nano-scale, the
improvements in yield, power and delay, provided by each major scaling have started to slow
down, or even reverse. The reason for this trend is that the scaling also increases circuit’s
susceptibility to manufacturing imperfections, such as process variations and random defects.
Imperfections affecting circuit operation are referred as defects. As a result, it is increasingly
difficult to guarantee the correctness of chips and their conformance to the performance
specifications. This leads to reduction in yield, especially at the top-levels of performance.
Among all the semiconductor products, processors are the key drivers since they use
aggressive designs and leading-edge manufacturing technology [1]. It is a great challenge to
ensure that processors and other semiconductor products continue to benefit from technology
scaling. Traditionally, circuit layer redundancy has been adapted to guarantee the functional
correctness of both memory and logic circuits in the digital systems in order to increase yield.
13
Such redundancy is implemented in the form of spare copies of selected modules and
reconfiguration circuits to circumvent the parts of circuits that are defective due to the
manufacturing imperfections. Triple-modular redundancy (TMR) has been utilized to implement
the two-out-of-three voting mechanism for logic circuits. In semiconductor memories, spare
rows and columns are added within each memory array to replace rows and columns with
defective memory cells, so memory modules with full capacities can be supplied. Error
correcting code (ECC) mainly targets soft errors and is a form of information redundancy to
detect and correct errors in memory arrays.
Spare rows and columns have been effective for yield improvement of memories. However,
as we will show in Chapter 2, the usefulness of spares will be limited in fundamental ways for
large memories fabricated in future technologies. When we explore new defect-tolerance
approaches we assign high priority to circuit modules that have larger areas, since the probability
that a module is defective is an exponential function of its area. As TMR will at least triple the
area of the module in the original design that it protects, the 200+% area overhead for protecting
the larger area modules is prohibitively expensive. To the best of our knowledge, TMR has never
been used for logic modules in modern processor chips. Under the high defect density and the
high process variations expected in the future technologies, TMR will be even less effective
since the additional copies of modules are also subject to these imperfections.
It is clear to see that the classic defect-tolerance approaches focus on the functionality of
individual circuit module. Classical manufacturing testing approaches also check the correctness
of individual circuit modules in presence of likely defects and variations within, to ensure that
only functionally correct chips are shipped. However, as we will see ahead, for many modules
14
this imposes an overly stringent correctness requirement upon the manufactured chips from the
view of system operations. In this research, we develop a defect-tolerance framework to explore
the possible relaxation of the correctness requirement at different layers of the general purpose
computing system hierarchy. By satisfying these relaxed correctness requirements with little or
no modifications of the modules in systems, the economics of processor chips can be
significantly improved for the future technologies.
1.2 General purpose computing system hierarchy
In this section, we have introduced the terminology and system hierarchy.
1.2.1 System layers
As shown in Figure 1, general purpose computing system consists of user layer (UL),
software/OS layer (SL), ISA layer (ISAL), microarchitecture layer (μAL) and circuit layer (CL).
The main purpose of the layered system is to facilitate hardware design/manufacturing, and
software/system development. At the user layer, users view the system as the combination of
applications, system software/OS and hardware that responds to their inputs per specifications.
At the software/OS layer, the abstract view of the system includes the processors as processing
units and main memory as storage for program data and instructions. At the ISA layer, the view
of the system includes first, a set of instructions and registers defined in the ISA and
implemented in the processor, and second, variables and instructions of a compiled software
program stored in the main memory, i.e., the components of software programs. At the
microarchitecture layer, hardware modules are defined to carry out the general functionalities of
a stored-program computer, such as fetch, decode, execute and write-back, to carry out the
15
functionalities of the instructions defined in the ISA. At the circuit layer, the modules are
constructed using blocks of logic and memory circuits to support the abovementioned
functionalities.
1.2.2 View of the system layers
Correctness at each layer is observed from the view of the layer. At the user layer, the
observed output from system is users’ reception of data, visual, or audio. At the software/OS
layer, system output observed by a software process running on the system is the I/O access and
the perceptible content stored in main memory. (The perceptible content of a software process
refers to the content belongs to virtual memory space of the process. The process is unaware of
the memory content that is not in its virtual space.) Outputs observed at the ISA layer includes
the arithmetic and logic instructions’ outputs written to the architecture registers (or the memory),
the outcome of register accesses (store instructions), and the outcome of memory accesses (load
instructions). Output observed at the microarchitecture layer is the outcome produced by each
Figure 1. Layers in the system hierarchy
16
microarchitecture module. Output observed at the circuit layer is the clock-to-clock responses
from circuits.
To streamline the discussion and analysis ahead, we simplify the definition of system output
at software/OS and ISA layers. Software/OS layer outputs include only the perceptible content in
main memory, since I/O functions can be memory-mapped. The ISA layer’s output is simplified
to include the content stored in the architecture registers and the content in memory, since
instructions operate on operands and store the results to architecture registers or memory. Both
the states of architecture registers and the states of visible memory will be collectively noted as
visible states to represent the output of the hardware layers (μAL and CL).
1.3 Correctness at system layers
Conventionally, the correctness of a processor is defined and measured at circuit layer. This is
due to the fact that the correctness of a fabricated processor chip is guaranteed via manufacturing
testing, which is classically performed at circuit-layer in a block-by-block manner using
corresponding correctness criteria. For example, structural test and functional test commonly
used for the manufacturing testing target the individual circuit modules. The notion of the
processor’s correctness at circuit layer is defined by the matching of clock-to-clock responses
between the actual silicon implementation and the expectation from the circuit designers,
simulators, or an infallible (golden) implementation. An important consequence of this is that
classical defect-tolerance approaches also focus on ensuring the correctness of functionality at
the circuit layer.
Modern microarchitectures support advanced mechanisms such as out-of-order execution,
speculation, caching, etc. to provide high performance. Contemporary computing systems create
17
the illusion of concurrency to the end users by context-switching between different software
processes running on the processor. Such features make it impractical to define the correctness as
the completion of a task at a specific clock cycle, or even over an interval of clock cycles, with
circuit layer correctness requirement being satisfied at every clock cycle. Hence, the overly
stringent correctness requirements at the circuit layer can limit the opportunities for defect
tolerance.
Correctness observed at each layer can be defined as follows. At the user layer, users’
perception of correctness depends on the types of applications. Applications in general purpose
computing system can be generalized in two categories: critical applications and intrinsically
variation-tolerant applications. Critical applications include financial applications, OS, system
software, etc. The user-level outcomes of the critical applications should be deterministic and
intolerant to variations. The intrinsically variation-tolerant applications include the processing of
audio, video, graphics, etc. For such applications, deviation from nominal specifications is often
allowed without detectable compromises to the users. In other words, in critical applications the
output of the software/OS layer (to be described ahead) is directly presented to the user. Hence,
there is no difference between the user layer’s correctness and the software/OS layer’s
correctness for critical applications. In contrast, for variation-tolerant applications, user layer
may hold some levels of tolerance to errors, i.e., to deviations from the nominal specifications
use to define correctness at the software/OS layer.
At the software/OS layer, correctness is defined as the correct final states stored in the
perceptible memory space to the software processes. Assuming that a program’s executed
perceptible states can be pre-determined by an oracle, a fabricated copy of processor is said to be
18
correct if the program’s execution on the copy results in the same states as the pre-determined
states.
At the ISA layer, correctness for a software process can be defined as the correct content
stored in the visible states in correct order from the processor. The correct order is the same order
that the instructions of software process being fetched by hardware non-speculatively, i.e., the
instruction fetching order as if the instructions are fetched, decoded, executed and written back
one at a time. The execution of a program in a processor equipped with advanced features such
as speculation and out-of-order execution does not necessarily execute the instructions along the
finalized path of a finished program. Nevertheless, the execution results are subject to
verification and are serialized according to the original program order before written back to the
visible states. The process of writing the execution results back to the visible states is referred as
commit in the microarchitecture literature.
In this research, the states of the on-chip caches are included in the visible states of the ISA.
In the microarchitecture with caches, the commit procedure is defined as writing results to the
data caches and the architecture registers. The committed results in the cache are viewed as valid
and final, since the results may be used for other instructions and eventually be written back to
the main memory. Therefore, the data written into the caches can be considered as visible to the
ISA layer.
Correctness at the microarchitecture layer is defined by the behavioral description of each
module. A processor copy is said to be correct at microarchitecture layer if each module behaves
accordingly to its definition in the microarchitecture. For example, regardless of access miss or
19
hit, a cache module is expected to supply the requested data either from the cache itself, in case
of a hit, or from the next memory module in the hierarchy, in case of a miss.
In this research, we will systematically explore the system layers and enumerate possible
defect-tolerance approaches for microarchitecture modules by relaxing the unnecessarily
stringent correctness requirements. Approaches explored may gracefully degrade the system
performance in the presence of tolerable defects in the fabricated chip copies but maintain the
correctness requirement. We use a module-oriented framework for such exploration. The view of
microarchitecture module granularity allows us to gather relevant higher layer information and it
can provide knowledge to help identify unique defect-tolerance opportunities for the modules.
1.4 Related research on advanced defect-tolerance approaches
To address the insufficiency of current defect-tolerance methodology in processors, previous
studies have explored defect-tolerance approaches from the perspective of different layers in the
computing system. The authors in [2] [3] observed that the branch predictor is a performance
enhancement module. All defects in the predictor module will only degrade the prediction
accuracy hence the performance, but will not affect the correctness of any program execution. In
[4] the authors further investigated the performance degradation caused by the defect-induced
faults in a branch predictor, and suggested the possibility of exploiting such characteristics for
the purpose of yield improvement. These works exploit the fact that the output of the branch
predictor is subject to further verification and correction by other microarchitecture modules.
The multiplicities of the on-chip storage locations for instructions, data, and intermediate
computation values are also explored for defect-tolerance. The works of [5] [6] [7] [8] [9] [10]
[11] [12] exploit this microarchitecture layer characteristics by enhancing the cache’s
20
configurability to achieve defect-tolerance for on-chip caches. In [13] the authors exploit the
inherent redundancy of various memory-based modules built within the microarchitecture for
defect tolerance. Authors in [14] proposed to use fixed-point instructions to implement floating-
point instructions in the processor chips with defective floating-point unit. The work relaxes the
microarchitecture layer correctness requirement by implementing alternatives at ISA layer. In
our global view terminology, this approach relaxed the circuit layer correctness requirement by
revealing the microarchitecture layer and the ISA layer information to allow the approaches to be
developed.
Many studies have focused on computing with hardware that is error-prone due to transient
faults or hard defects. Dynamic implementation verification architecture (DIVA) is proposed [15]
to add two hardware-based checking pipelines: computation check and communication check to
verify the correctness of the execution before the write-back stage of the processor. The
computed results from the processor can only be written to register and memory if both checks
pass. Computation check verifies the result of functional units’ computation by comparing the
original result with a recomputed result produced by dedicated verification functional units.
Communication check verifies the correctness of the load and store instructions by re-executing
them. Both checking mechanisms leverage the fact that there is no dependency between the
instructions when the original results are ready to be verified, the verification units can thus be
designed efficiently and the operation can be performed without any stall. If one of the
verifications fails, the values derived from checking mechanisms will be written back to register
and memory, and will be used for restarting the flushed processor.
21
Authors in [16] proposed Relax framework to expose the hardware defects to ISA layer and
software layer explicitly. The framework extends the notation of error to ISA and compilers. In
Relax framework, vulnerable instructions are put into relax code blocks. The execution in the
code block is subject to check and recovery. Checking mechanism of Relax depends on Argus
[17] or redundant multi-threading [18] to detect errors. Argus checks the run-time computed
dataflow and control signatures with the statically generated signature at compile time embedded
in the program. Redundant multi-threading executes duplicated threads on other processor cores
simultaneously or executes the same thread twice in a processor core for error detection. In the
presence of hard defects, Relax relies on re-execution on the robust processor cores if available.
In the presence of soft errors induced by radiation particles, Relax can also re-execute the block
again on the same core.
Authors in [19] proposed Rescue to transform the existing microarchitecture modules to
satisfy the intra-cycle logic dependency which allows the fine-grained isolation of combinational
logic blocks through scan test. The transformation includes use of additional pipeline stages,
redundant combinational logic blocks, additional shifters, etc. so the instructions can be diverted
from the defective combinational logic blocks. Rescue essentially enhances the configurability of
the modules and the corresponding module’s controller to achieve the run-time defect-
avoidance. As we will show in Chapter 3, these related studies are subsumed in our proposed
framework.
22
1.5 Metrics for defect-tolerance approaches
1.5.1 Module grades and processor tiers
To maintain the correctness in the presence of defects, a defect-tolerance approach may
modify the original designed operation at different system layers, so the defects can be tolerated
by configuring the modified operation. For example, spare rows-and-columns approach modifies
the circuit layer operation, so the memory array circuitry can be configured to use spare rows and
columns if defects present. The operations can be modified include circuit operation,
microarchitecture operation, system software/OS operation, and the semantics of programs.
(Concrete examples of modification on each type of operations will be shown in the later
chapters.) A defect-tolerance approach may also utilize the existing mechanisms in the system to
tolerate the defects. In either case, the system’s performance may differ in the presence of
defects in the fabricated processor copies since the modified operation or the existing mechanism
changes how the system proceeds. Hence, the following parameters can differ between a
processor without the approach and a processor using the approach: chip area, clock frequency,
number of clock cycles to complete an operation. Therefore, for a module using a defect-
tolerance approach, fabricated chip copies result in different grades of the module. For a
processor using defect-tolerance approaches on multiple modules, fabricated copies result in
different processor chip tiers with different performance. Each tier corresponds to a group of
fabricated copies with defects affecting a same set of modules, which degrades the system
performance in a same way. Each tier has different yield and each tier is graded based on its
performance.
23
1.5.2 Yield-per-area and expected-capacity-per-area
The economic profit of chip manufacturing is generally measured using a metric such as yield-
per-area (YPA), i.e., the sellable silicon area out of all the silicon area manufactured. YPA is the
metric to evaluate the classic defect-tolerance approaches. Classic approaches maintain the
correctness and the specification of the original design. For example, the authors in [20]
investigated optimizing the YPA metric for SRAM. When employing an approach, all processor
chips have the same area A. For a processor not using defect-tolerance approaches, i.e.,
unenhanced processors, its yield-per-area, 𝑌 𝑃𝐴 𝑜 , is calculated as 𝑌 𝑜 𝐴 𝑜 ⁄ . For a processor using
defect-tolerance approaches, i.e., enhanced processors, its yield-per-area, 𝑌 𝑃𝐴 𝑒 𝑛𝑐
, is calculated
as ∑ 𝑌 𝑇𝑖
𝑁𝑡
𝑖 = 1
𝐴 𝑒 𝑛𝑐
⁄ . Table I summaries the related parameters. Note that unenhanced processors
can be equipped with classic defect-tolerance approach, such as explicitly added row/column
redundancies in memories.
In addition, Expected-Capacity-per-Area (ECPA) can be used to measure the usable
percentage of on-chip cache bits per silicon area manufactured for advanced cache defect-
tolerance approaches, since the non-circuit layer performance of cache is directly related to cache
capacity.
Table I. Parameters to calculate yield-per-area
𝑌 𝑜 Yield of the unenhanced processor
𝑌 𝑇𝑖
Clock frequency of the unenhanced processor of tier 𝑇 𝑖
𝐴 𝑜 Chip area of the unenhanced processor
𝐴 𝑒 𝑛𝑐
Chip area of the enhanced processor
𝑁𝑡 The number of processor tiers
24
1.5.3 Performance-per-area and efficiency
To fully capture the effect of defect-tolerance approaches, Expected-Giga-Operation-per-
Second-per-Area (EGOPS-per-area, or EGOPSPA) is used to measure the performance-per-area
of processor chips fabricated. EGOPSA of an unenhanced processor and an enhanced processor
is calculated as follows. 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑒 𝑛𝑐
= 𝐼 𝑃𝐶 𝑜 × 𝐹 𝑜 × 𝑌 𝑜 𝐴 𝑜 ⁄ ,
and 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑜 = ∑ 𝐼 𝑃𝐶 𝑇𝑖
× 𝐹 𝑇𝑖
× 𝑌 𝑇𝑖
𝑁𝑡
𝑖 = 1
𝐴 𝑒 𝑛𝑐
⁄ . Table II summaries the related parameters. In
addition, we define an efficiency metric EFF to measure how effective defect-tolerance
approaches are to improve the gross computation capability per silicon area. EFF is calculated as
𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑒 𝑛𝑐
𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑜 ⁄ . Note that some ISA layer approaches changes the instruction counts
of applications and such effect must be taken into account. As we will show in Chapter 5, 𝐼 𝑃𝐶 𝑇𝑖
for a processor tier using ISA layer approaches can be estimated by normalizing the 𝐼 𝑃𝐶
measured with the ratio of increase in instruction counts.
1.6 Dissertation outline
This dissertation is structured in six chapters. The research motivation, the system layers
terminology and the defect-tolerance metrics are introduced in this chapter. In Chapter 2, we
explore the alternative defect-tolerance opportunities through the system hierarchy for the last
level cache. An implementation of this approach is presented and used to demonstrate its benefits.
Table II. Parameters to calculate EGOPSPA
𝐼 𝑃𝐶 𝑜 Instruction-per-cycle of the unenhanced processor
𝐼 𝑃𝐶 𝑇𝑖
Instruction-per-cycle of the enhanced processor of tier 𝑇 𝑖
𝐹 𝑜 Clock frequency of the unenhanced processor
𝐹 𝑇𝑖
Clock frequency of the unenhanced processor of tier 𝑇 𝑖
25
In Chapter 3, we derive a general microarchitecture module oriented defect-tolerance framework
which can systematically explore the defect-tolerance opportunities in general purpose
processors. The framework is applied to the modern processor to identify possible defect-
tolerance approaches in Chapter 4. In Chapter 5, we implement the promising approaches for
datapath modules in-depth and evaluate their effectiveness. In Chapter 6, we evaluate the
effectiveness of our approaches for multicore processors. In addition, we project the
effectiveness of these approaches for the processors of near-future technology generations. In the
last chapter, we conclude.
26
Chapter 2. Last level cache defect tolerance
In this chapter, new defect-tolerance approaches are developed for the last level cache (LLC)
modules on modern processor chips, as these are critical to modern processor chips’ yield.
Systematic approach is taken to explore the possible defect-tolerance approaches for the LLC
through the system hierarchy. A promising approach is then chosen for implementation and
evaluation. In the remaining chapters, we will use the terms such as, defective cache module,
defective SRAM row, etc., to refer to the specific microarchitecture module or a part of the
module which does not function according to its behavioral or electrical specification because of
hard defects. The term error refers to a binary value string which is interpreted differently from
the expected value because of defects.
2.1 The significance of the last level cache
Every state-of-the-art processor incorporates on-chip cache modules which are implemented
with dense SRAM bit-cells. Hence, cache modules have large area in which if defects present, it
will result in circuit failures, e.g., short or open circuits, etc. Cache modules are said to have
large critical areas [21]. In other words, caches are vulnerable to defects. Modern processors
dedicate a large fraction of chip area to caches, especially LLC. For example in Intel’s Itanium 2
Montecito processor, more than 80% of the die area is devoted to caches [22]. Furthermore, the
number of SRAM cells on processor chips is predicted to grow 2× every 3.5 years after 2013
[23]. It indicates that, out of all the failed processor chips screened by the testing procedures,
most of the chips will have defective LLC. LLC is thus vulnerable in the processor chips.
Defect-tolerance approaches that targets LLC are expected to enhance the chip yield effectively.
27
2.2 SRAM defects in LLC
A cache module can be roughly divided into the following parts: SRAM or CAM tag array,
SRAM data array, decoder, and caching logic. Among these parts, data array occupies majority
of the area and the transistor count in a typical processor cache module. Defects manifested in
the SRAM circuits can be modeled as logical faults, such as stuck-at-faults, stuck-open-faults,
coupling faults, data retention faults, etc. For a single SRAM bit-cell, stuck-at-faults and data
retention faults result in errors when read and write operations are performed on the defective
bit-cell. For a group of SRAM bit-cells with a coupling fault, the state of bit-cells, or a read or
write operation applied to a subset of these bit-cells can cause errors at other bit-cells. Data
retention faults manifest when read/write accesses to the neighboring bit-cells create
disturbances. Each type of faults leaves erroneous logic values in the SRAM array. Errors can be
produced at the modules’ outputs using read/write operations at the defective SRAM bit-cells,
and can be propagated to affect the correctness of the system layers.
2.3 Exploration for defect-tolerance approaches
This section first addresses the insufficiency and limitations of the classic circuit layer
approaches for LLC, and then explored new approaches that become possible when one takes a
more global view.
2.3.1 Insufficiency of the classic circuit layer approaches
Classic circuit layer spare-based approach for the caches replaces defective rows and columns
with defect-free spare rows and columns in the SRAM array, so the defects are prevented from
being excited, and errors are never generated. Error correcting codes (ECC), such as single error
28
correction/double error detection (SECDED) code, have been used in LLC to increase the
resilience against soft errors. The use of ECC for tolerance in hard defects at the expense of
degrading resilience to soft errors has been investigated [24] [25] [26]. When ECC is applied,
defects are not avoided actively. The errors produced due to defects can be checked or even
corrected by the error-masking ability of ECC. An integrated SRAM defect-tolerance approach
of combining spare rows-and-columns and ECC has been investigated [26]. The author has
shown that, when using double error correcting (DEC) code in SRAM to tolerate single bit-cell
hard defects, the reduction in the overall resilience to soft errors of the entire SRAM is negligible.
The reasons are that the resilience degrades only for the defective words, and a DEC-protected
word with a single bit-cell hard defect is effectively a SEC-protected word, i.e. the word is still
protected against single soft error. However, DEC code is not commonly used in processor
caches because of the area and latency overheads. SECDED is more commonly implemented in
LLC and parity code is implemented in low level cache modules in processors which require
high reliability. Nevertheless, using SECDED to tolerate hard defects can expose the defective
word largely unprotected against soft errors.
Spare rows-and-columns approach has been effective for defect tolerance in SRAM arrays.
Nevertheless, as we will show later, by themselves spares will be less cost-effective in terms of
YPA under the high defect density and high process variation expected in future technologies.
The spare-based approach causes the following area overheads: the area of the spares and
reconfiguration circuitry, and the area needed to fit the wire track of reconfiguration circuitry
into the pitch of each SRAM row or column. To be able to replace any row containing one or
more defective bit-cells with 𝑁 𝑠𝑟
spare rows, a wordline driver fan-outs to the row it drives as
29
well as neighboring 𝑁 𝑠𝑟
rows through a de-multiplexor. The height of a row will be the
maximum of the height of a de-multiplexor’s output wire track or the height of a SRAM bit-cell.
It is imperative that the re-configuration circuitry for each row fits within the height of the row,
since otherwise the SRAM density will decrease dramatically as its layout will include much
unused area. Similar situation occurs for spare columns. As the number of spare rows and
columns increases beyond a certain level, the area of the SRAM increases dramatically. Table III
shows typical SRAM bit-cell height [27] [28] [29] and metal 1 layer pitch for recent technologies
[30] [31] [32]. The first two rows of the table are used to derive the third row, which makes it
clear that at most one spare row can be used without increasing the SRAM bit-cells spacing and
causing dramatic increase in area. Next, a case study has been performed to demonstrate the
limitations of the spare rows-and-columns approach.
Case study:
A 3MB 12-way associative cache with 64-byte blocks is used as a case study throughout this
document. In our analysis and experiments using CACTI [33] (an industry-developed tool to
estimate SRAM cache sizes and other key values) assuming the tightest wire pitch allowed, only
one spare row and one spare column can be added to a sub-array without dramatically increasing
the overall area. The cache is divided into 48 equal-size sub-arrays. Each sub-array is equipped
with 𝑁 𝑠𝑟
spare rows and 𝑁 𝑠𝑐
spare columns. The YPA of the cache is plotted in Figure 2 for
Table III. SRAM bit-cell height and M1 metal pitch
Technology 65nm 45nm 32nm
6T SRAM height (nm) 460 [27] 369 [28] 240 [29]
M1 pitch (nm) 210 [31] 150 [30] 112.5 [32]
Max. M1 wires per row* 2 2 2
* Assume: metal width=metal spacing
30
various numbers of spares rows and columns. CACTI is enhanced to include the area overheads
of spares. The area of the cache is then obtained for a 32nm technology. The SRAM bit-cells are
assumed to have a combined failure rate of 10
− 6
[1] (one failure per million bit-cells) for
having defect-induced read, write, or retention failure. The interconnect wires are subject to
defects and the interconnect yield is calculated via critical area analysis [20]. Defective
interconnect is assumed fatal to the whole LLC. It can be seen that the YPA value reaches its
maximum value at 𝑁 𝑠𝑐
= 2 and 𝑁 𝑠𝑟
= 1 and decreases if higher numbers of spare rows and
columns are used.
By dividing the SRAM into smaller sub-arrays, we can increase the total number of defective
bit-cells that can be repaired without using too many spares in each sub-array. However,
increasing the number of sub-arrays also increases the interconnect wires in the array, and the
yield of such design is often limited by the yield of interconnect wires. By increasing the number
of spares ( 𝑁 𝑠𝑟
, 𝑁 𝑠𝑐
) for different number of sub-arrays ( 𝑁 𝑆 𝐴 ), we calculate the YPA of each
configuration until the YPA trend starts falling. Figure 3 shows the numbers of spare rows and
columns that maximize YPA for different numbers of sub-arrays. Note that the highest
Figure 2. YPA of the 3MB cache with various number of spares
31
achievable YPA drops as the number of sub-arrays increases beyond a certain level. Because
interconnect wires grow with the number of sub-arrays, the overall yield is limited by the
interconnect yield rather than sub-array yield for designs with large numbers of sub-arrays. In
addition, the overall area also grows and further decreases YPA.
Table IV shows the optimal YPA can be achieved by varying ( 𝑁 𝑆 𝐴 , 𝑁 𝑠𝑟
, 𝑁 𝑠𝑐
) of the SRAM
cache under failure rates predicted for future technologies [1]. The optimal YPA drops 42% and
the yield is only 57% for this YPA-optimized design under the predicted failure rate. We obtain
optimal designs by enumerating various values of ( 𝑁 𝑠𝑟
, 𝑁 𝑠𝑐
) for different numbers of sub-arrays
( 𝑁 𝑆 𝐴 ). The analysis clearly shows that by using only spares, low YPA and low yield will be
expected as the failure rate increases. Clearly, new defect-tolerance approaches must be
developed to achieve high YPA and yield in the future.
Figure 3. LLC YPA by varying the number of sub-arrays
0
1
2
3
4
5
6
7
8
9
10
12 24 48 96 192 384
(1,4)
(1,3)
(1,2)
(1,1)
(1,1)
(1,1)
Number of sub-arrays,
YPA(
Table IV. Highest YPA of Optimal SRAM Design
Failure rate ( 𝑁 𝑆𝐴
, 𝑁 𝑠𝑟
, 𝑁 𝑠𝑐
) Yield
Optimal YPA (
1
𝜇𝑚
2 � )
10
− 6
(48, 1, 2) 71.2% 9.74 × 10
− 8
2.5 × 10
− 6
(96, 1 ,3) 57.1% 5.69 × 10
− 8
32
2.3.2 Exploration in the system hierarchy
If only the information available at circuit layer is used, every memory location in a cache
module must function correctly, and hence defect-tolerance can only be achieved by using
redundancy which is extraneous to a cache module’s specification. Every memory rows provides
the same functionality as every other row; likewise for columns. Hence, spare rows and columns
act as redundancy which has identical functionality as other rows and columns to replace the
defective rows and columns in the memory arrays. Multiplexors and/or de-multiplexors must be
added to enhance cache modules so the accesses can be routed around the defective memory
locations and to use spare ones. Selections made by the multiplexors and the de-multiplexors are
controlled by a module which generates control signals based on defect information, i.e.,
information regarding defective rows and defective columns. Hence, the use of rows and
columns are said to be configurable by these multiplexors and de-multiplexors under the module
which controls the select signals. Spare rows-and-columns maintain the circuit layer correctness,
and therefore the correctness is met for all layers.
In the next two sections, we explore similar redundancy components of LLC module and
corresponding modules which select the redundancy to use at other system layers. Our goal is to
seek uniquely efficient defect-tolerance approaches and maintain the correctness at the non-
circuit layers. These approaches may cause overhead and performance degradation to the
processor chips. Such degradations will be quantified in later sections.
2.3.2.1 Microarchitecture layer
1) Exploration from cache organization
33
A generic cache module can be viewed with three parameters at microarchitecture layer:
number of sets 𝑁 𝑆 , degree of associativity 𝑁 𝑊 , and size of a block 𝑆 𝐵 . As shown in Figure 4, the
cache is organized into 2 ways ( 𝑁 𝑊 = 2), and each block is 𝑆 𝐵 -byte. Each main memory block
is mapped to a set, and the mapping between the main memory block and the 𝑁 𝑊 cache blocks
within the set is decided by the replacement policy logic, such as least-recently-used (LRU) or
FIFO logic. In case of a direct-mapped cache, the replacement policy logic does not exist since
there is no choice of blocks within a set.
Since multiple addresses are mapped to a set, when cache blocks are accessed an additional
action is required to distinguish between the memory blocks mapped into a set. The index field
of the accessing address is mapped into sets, and tag field is then used to distinguish between all
the addresses that can be mapped into the same set. The associative tag matching mechanism
ensures that the data mapped to a set can be distinguished. Hence, from the microarchitecture
Figure 4. Two-way cache organization
34
layer’s view, the functionality of one cache block can be replaced by another block in the same
set, and the functionality of a set can be replaced by another set. Because there is no difference
between the functionality of cache memory locations (and the mechanism to distinguish the
stored memory block already exists), there are several choices of storage elements of blocks,
ways, and sets within the cache module. Therefore, similar to the fact that spare rows and
columns are used to replace the functionality of defective bit-cells, rows, columns, storage
elements of blocks, ways and sets can also be used for the same purpose,
The use of sets is controlled by the decoder as can be seen from Figure 4. However, the
decoder has no inherent ability to select among the sets. The selection of a set is the direct result
of the instruction’s or data’s memory address. The decoder merely controls the accesses to the
sets indirectly by interpreting the incoming addresses. Similar to the implementation of spare
rows-and-columns approaches which adds selection control module to control the additional
multiplexors and de-multiplexors, the decoder can be enhanced to avoid the use of defective sets.
Within a set of a set-associative cache, the replacement policy logic selects among the cache
blocks in the mapped set for write requests. The replacement policy logic takes the utilization
information, e.g., number of accesses, of the individual blocks into account when making the
replacement decisions. The replacement policy logic can also be enhanced to avoid the use of
defective blocks.
2) Exploration from module’s functionality
A cache module contains a subset of recently accessed instructions or data from main
memory. Each of these instructions and data can still be accessed from main memory with longer
latency. Just as spare rows and columns act as the redundancy for defective rows and columns,
35
main memory can provide instruction and data accesses to processor core logic and hence can act
as the redundancy of a defective LLC module. Main memory can replace the functionality of a
whole LLC module if a controller module (for now assume that such a module exists) configures
a bypassing path in the microarchitecture to allow access to instruction or data either from the
LLC module or from the main memory.
2.3.2.2 Software/OS and ISA layers
The mapping relationship between the main memory and the physically addressed caches is
only available at the software/OS layer. As we will explain in the following, such mapping
relation creates another granularity of storage elements in LLC, which is physically addressed
and in the same address space as main memory. Such storage elements can be used as
redundancy in LLC modules.
Figure 5 illustrates the LLC and main memory mapping in a logical view, where a main
memory block has the same size as a cache block.
Each set of 𝑁 𝑆 contiguous main memory blocks constitutes a cache region. Since the number
of blocks in each cache region is 𝑁 𝑆 , there is a one-to-one, i.e., direct, mapping relation between
blocks in one cache region and all sets in the cache. From the view of OS, main memory is
Figure 5. LLC and main memory mapping
36
divided into numbers of page-frames, which consists of two consecutive blocks in the simplified
view in Figure 5. The view from OS introduces another LLC storage element granularity, a
page-cover, which is a consecutive region in the LLC mapped by a page-frame. Figure 5 shows a
simplified page-cover consisting of two sets as in the dashed rectangle. Since there is no
functional difference between the uses of page-frames, there is no difference between the uses of
page-covers.
The usage of page-covers is directly controlled by the memory allocation module in the OS.
During the operation of a modern OS, available page-frames are pooled together. The memory
allocation module dynamically allocates the page-frames from the pool to the software processes
on demand. The memory allocation module is able to select among the page-frames, hence able
to select among the page-covers. Similar to the multiplexors and de-multiplexors in spare rows-
and-columns approaches, the module can be enhanced to perform the same function of avoiding
the use of defective page-covers during normal system operation.
Other modules which may be enhanced to avoid the uses of defective page-covers can be
identified by tracing the flow from memory address computation to LLC module access. In the
computing system with virtual memory, the physical addresses used to access a LLC module are
first translated from virtual addresses used in computation. Figure 6 shows a commonly used
translation procedure typically implemented partly on-chip in the translation look-aside buffer
(TLB) and partly off-chip in the main memory. Page tables in the main memory are indexed by a
virtual page number (VPN) field in the address, and the indexed page table entry stores the
pointer to base address of the physical page frame in the main memory. The pointer called the
physical page frame numbers (PFN) is cached in the TLB on-chip if it is recently used. When the
37
page frames are allocated to a software process, the VPNs of the process are bound to the PFNs
assigned, and the occupancy locations of the software process in the LLC are determined.
Similar to the address decoder to the cache sets, the page tables in the main memory and the
TLB can be enhanced to avoid the use of defective page-covers during normal system operation.
The fact that a typical LLC is physically addressed makes it impossible to find other
manipulable storage elements at ISA layer. Although the static memory allocation is done during
the compilation stage for the applications, the memory space allocated is in virtual space and will
be re-allocated during runtime by OS to physical memory space, which is used to access LLC.
Table V summarizes the redundancies and the corresponding modules which control the
redundancies identified in this section. In the next section, the potential approaches are identified
and the trade-offs associated with each of these possible implementations are discussed.
Figure 6. Address translation
38
2.4 Spectrum of potential approaches
In this section the non-circuit layer defect-tolerance approaches based on the redundancies
identified above are summarized and categorized to facilitate further overhead analysis in
Section 2.5. An advanced approach is then described at the end of the chapter.
In contrast to the extraneous redundancy used by the circuit layer spare rows-and-columns
approach, non-circuit layer explorations identified the redundancies inherently designed in the
LLC module. The redundancies, either extraneous or inherent, can be used to carry out the
functionality of defective parts of the LLC module. Hence, these redundancies are referred as
functional redundancies (to be defined in Chapter 3). The modules which control the use of these
functional redundancies and avoid the defective ones are referred as controllers (also defined in
Chapter 3). In all approaches discussed in the following, a defect map is required for each
approach to provide defect location information to the corresponding controller.
2.4.1 Microarchitecture layer
By using the main memory as functional redundancy, the defective LLC will be bypassed and
effectively disabled. The instruction and data accesses will be carried out by the main memory.
Table V. LLC defect-tolerance components
Redundancy Controller
Circuit layer Spare rows and columns Mux and De-mux selection logic
μArch layer
Main memory LLC/memory access control module
Block, way Replacement policy logic
Set Cache address decoder
Software/OS layer Page-cover
Memory allocation module in OS
TLB
Page table
39
This implementation is referred as LLC module disabling, and the controller involved is the
LLC/memory access control module. By employing block or way functional redundancies, the
replacement policy logic can select among the blocks or ways in a mapped set in conjunction
with a defect map to capture the health of the blocks or ways. Only the defect-free blocks or
ways will be used during system operation and the defective blocks or ways will be effectively
disabled. These implementations are referred as block disabling and way disabling. They are not
applicable to a direct-mapped cache. The approach employing the set functional redundancy can
be implemented by enhancing the decoder with a defect map and a remapping mechanism, so
accesses to the defective sets are remapped to designated defect-free sets. Tag matching
mechanism must also to be enhanced to ensure that the data in the shared set can be
distinguished. This approach is referred as set remapping. Set remapping is not applicable to a
fully-associative cache.
The approaches identified above maintain the ISA layer correctness by not accessing the
defective location in the LLC module. Hence there will be no error in visible states.
2.4.2 Software/OS layer
By employing page-cover functional redundancy, the following approaches are possible.
Memory allocation module can be enhanced with a defect map to remove the page-frames which
map to the defective page-covers from the pool during system boots up. When the page frames
are allocated to a software process’s request, the VPNs of the process are bound to the PFNs
assigned, and the occupancy locations of the process in the LLC are determined. Hence,
defective page-covers are disabled and effectively avoided. The approach is referred as page-
cover disabling.
40
TLB or the access of page table can be enhanced with a defect map and a remapping
mechanism so the accesses to the defect page-covers are remapped. These approaches are
referred as TLB page-cover remapping and page table page-cover remapping. However, unlike
the possible decoder remapping mechanism implementation, the outputs of the TLB and page
tables are not directly connected to the LLC arrays (where the defects are most likely to be, the
target in LLC). The remapping mechanisms can be implemented at the input or the output of the
TLB or the page tables. (Note: TLB output remapping is the same as the decoder input
remapping.)
2.4.3 Advanced speculation based approaches
The fact that the defects are present in LLC does not necessarily mean that every access to
LLC will cause an error. Instead of avoiding the defective part of the module or the excitation of
faults, the following two approaches attempt to access the LLC speculatively and treat the LLC
as a speculation module. The result of each LLC access will be checked and corresponding
action can be taken based on whether checking detects an error or not.
1) Microarchitecture layer LLC speculative load
Load instructions may speculatively access a defective LLC and accompanied by a main
memory access to the same address initiated by the LLC/memory access control module.
Computation in the processor may continue, but the results computed from instructions that
depend on the speculation are not committed until the result of the load is checked against the
result from the main memory, or checked using the circuit layer coding techniques. Hence,
pipeline control logic may require enhancement for stalling the pipeline due to the additional
latency required for checking. From the microarchitecture layer’s view, store instructions to a
41
defective cache cannot be checked until it commits and the ISA layer correctness will be
corrupted if erroneous values are stored. Hence, store instructions must be implemented in a
write-through fashion to write the main memory and the LLC. The direct path to the main
memory bypassing the LLC is based on using the LLC module disabling approach.
2) Software/OS layer LLC speculative access
Similar to the idea of speculation at microarchitecture layer, a speculative approach may be
implemented at the higher layers. Fine-grained control of the access to a particular storage
module, including the LLC and the main memory, must be available at the software/OS layer so
the individual load or store instruction to LLC can be checked. This implies that the autonomous
caching which was handled by hardware has to be controlled by the software with some
additional ISA or software/OS layer mechanism.
2.5 Analysis of our approaches and defect map consideration
This section first compares the three important aspects of the defect-tolerance approaches to
qualify the scale of their impact. The feasibility of the approaches is also discussed. The
consideration of defect maps construction is then described.
1) Impact and feasibility analysis
All the above possible LLC defect-tolerance approaches are summarized and analyzed in
Table VI. The analysis focuses on the implementations’ impact on the system, rather than the
impact due to the presence of defects in the LLC. Three properties are used to characterize the
impact: the magnitude of the impact (magnitude), the frequency with which this impact is
incurred (frequency), and the pervasiveness of the impact, i.e., the set of chips that are impacted
(pervasiveness).
42
As can be seen, on-chip based approaches affect the performance of all chips manufactured.
With information from software/OS layer, defect-tolerance approaches can be implemented off-
chip, either in hardware or software, with impacts only for the defective chips. Various
approaches also incur the impact at events occurring with different frequencies. The approaches
that only affect defective chips or incur impacts with lower frequencies are especially attractive,
since there is no impact on defect-free chips or the impact on performance is limited to certain
events.
The feasibility of an approach depends on the accessibility of the hardware module circuit
designs or the software module source code, and the flexibility of modifying those. Under certain
conditions, the availability and the unavailability of the required properties may render certain
Table VI. LLC approach implementation impact analysis
Implementation impact measures
Approach Modification Magnitude Incurred frequency Pervasiveness
Spare rows and columns H/W based Increased LLC latency Every LLC access All chips
LLC module disabling H/W based
Increased LLC and
memory latency
Every LLC access All chips
Block or way disabling H/W based Increased LLC latency
Every LLC block
replacement event
All chips
Set remapping H/W based Increased LLC latency Every LLC access All chips
TLB input page-cover
remapping
H/W based
Increased TLB update
latency
Every TLB miss All chips
TLB output page-cover
remapping
H/W based
Increased TLB translation
latency
Every LLC access All chips
Page table input page-cover
remapping
Off-chip H/W
based
Increased main memory
latency
Every main memory
transaction
Only defective
chips
Page table output page-
cover remapping
Off-chip H/W
based
Increased main memory
latency
Every main memory
transaction
Only defective
chips
OS page-cover disabling S/W based
Increased page-frame pool
setup time
System boot-up
Only defective
chips
μArch layer speculative
load
H/W based
Increased pipeline control
latency
Every instruction All chips
ISA and software/OS layer
LLC speculative access
S/W based Increased instruction/code Every LLC access
Only defective
chips
43
approaches infeasible. On-chip hardware-based approaches naturally require the flexibility of
modification on the controllers’ designs. Off-chip hardware-based approaches involve the
modification of the on-board memory controller design. Software-based approaches require
modification on source code of ISA, compiler, or OS.
2) Defect map construction
Defect maps are essential to the controllers’ awareness of defects’ locations in all defect-
avoidance approaches. A defect map contains the information to identify the defective LLC at
the granularity of the corresponding approach, namely, rows, columns, sets, ways, blocks, page-
covers, or the LLC module itself.
Contemporary processor chips features sophisticated cache testing mechanisms to facilitate
the creation of the defect map. UltraSPARC T1 features direct access to L1 and L2 caches to
support testing from ATE [34]. A bitmap of failing cache physical locations are available to the
repairing software supporting the circuit layer approach. AMD quad-core Opetron features
memory built-in self-test (MBIST) to test on-chip memory arrays, including caches [35]. The
MBIST can perform memory testing and also serves as controlling mechanism of rows-and-
columns approach based on the testing results. The MBIST also allows the testing results to be
collected off-chip as a bitmap, thus the information can be further utilized for diagnosis. The
locations of the defective sets, ways, blocks, or page-covers can be derived by a straightforward
translation from the bitmap of failing physical locations obtained from the above testing
procedures.
Defect map can be stored in various forms according to the requirement of the approach. For
software-based approaches, the defect map can be stored in on-board memories and be utilized
44
during system operation or at system boot-up. For the hardware-based approaches, additional
storage is needed on-chip or off-chip to store the defect map. Non-volatile memories or eFUSE
[36] can be used for such purpose. The later has already been used for spare rows-and-columns
in on-chip memory arrays.
2.6 Related cache researches
Block disabling, way disabling and set remapping approaches have been investigated in
several previous studies. The defect map required for block or way disabling approach can be
integrated into the cache tag array in the form of additional valid bits, which are called
availability bits [5] or fault-tolerance bits (FTB) [8]. Way disabling is also proposed in [9] and is
implemented in a mass market processor chip for power management [37].
Set remapping approach is implemented in [6] [7]. A programmable cache address decoder is
proposed in [7]. The decoder is programmed prior to the processor’s normal operation to
implement different mapping functions based on the information in the defect map carrying the
location information of the defects. A similar concept is proposed in [6] to achieve defect
tolerance for a direct-mapped cache. The accesses to defective blocks are re-routed to defect-free
blocks by enforcing the column decoder to select the defect-free blocks in the row addressed.
Among the approaches above, block disabling has been proven the most effective. Authors in
[10] compared the miss rates of the caches implementing the above approaches. Under the
uniformly distributed defects with same defect density, block disabling will have the least
performance degradation among these approaches, because it has the finest functional
redundancy granularity of all, and the expected capacity of which will be the lowest under the
45
same defect density. Set remapping showed the most performance degradation since cavities in
the address space will be un-cacheable once a set is disabled.
Based on the impact analysis, page-cover disabling (PCD) involves least design changes and i
ncurs the penalty least among the all other approaches. In the following sections we evaluate the
effectiveness of the PCD approach.
2.7 PCD cost-benefits analysis
Circuit layer spare rows-and-columns approach has been proven to be less effective under the
high defect density in the future technologies. Nevertheless, spare approach is commonly used in
memory arrays and it still provides significant defect-tolerance capacity and can increase the
YPA within the designed SRAM dimension limitations. As can be observed from Figure 2, the
cache with no spares has almost zero yields.
The following analysis first compares the PCD approach with the spare rows-and-columns
approach. Then PCD is employed in conjunction with the existing spare rows-and-columns
approach in the cache memory arrays to show the additional benefits. In this case, the spare
approach repairs some chips completely, and other chips are left with un-repaired defective bit-
cells in LLC for PCD to tackle.
2.7.1 LLC grading and ECPA
By implementing the PCD approach, usable chips will have different number of usable
(defect-free) page-covers. The LLC grade of a chip is assigned according to the number of
usable page-covers. Grade 0 LLC contains zero un-repaired defective bit-cells, and grade X LLC
has X disabled page-covers due to the one or more un-repaired defective bit-cells in each
46
disabled page-cover. Figure 7 shows the cumulative grade percentage of the 3MB LLC from the
case study under two bit-cell failure rates. Note that there are total 64 page-covers in LLC given
the typical page-frame size of 4KB. The overall percentage is limited by the interconnect yield.
As can be seen, majority of the failing LLC can still be used by disabling a limited number of
page-covers.
Next, the benefit of employing PCD with spares implemented in cache arrays is evaluated and
compared with PCD only and with spares only approaches. Figure 8 shows the EC-per-area of
the approach combinations under two SRAM bit-cell failure rates. The area of the configurations
are normalized with regard to the area of the configuration ( 𝑁 𝑆 𝐴 , 𝑁 𝑠𝑟
, 𝑁 𝑠𝑐
) = (48, 1, 2) of the
spares-only approach. LLC is from grade 0 to grade 16 are considered in the expected capacity
calculation. Under the lower failure rate, Spares+PCD approach gains 1.5 times better EC-per-
area and the gain is more significant when under the higher failure rate. The substantial gain
comes from the fact that, first, with Spares+PCD the optimal EC-per-area can be achieved by
Figure 7. (48, 1, 2) Cumulative grades percentage
47
using fewer sub-arrays, spare rows, and columns. Second, PCD saves additional defective LLCs
that are beyond the repair capabilities of hardware spares.
2.7.2 Implementation and results
PCD is implemented by modifying the memory allocation module in the Linux 2.6.32 kernel.
By sending customized early kernel parameter commands via the GRand Unified Bootloader
(GRUB) during system boot-up, we reserve physical page-frames mapped to page-covers in LLC
that are identified as being defective. Subsequently, the OS meets all requests for memory from
software processes by allocating only the unreserved page-frames. This implementation is used
for performance evaluation (EGOPS) by running SPEC CPU2000 benchmarks [38] on a system
with a dual core processor and 1GB of DRAM. Each core of the processor has one 8-way 64KB
L1 data cache and one 8-way 64KB L1 instruction cache. Two cores share a 12-way 3MB
unified L2 cache (LLC).
Figure 9 shows the corresponding EGOPS-per-area of the optimal EC-per-area
configurations. EGOPS is the average benchmark performance of grade 0 to grade 16 LLC.
Figure 8. EC-per-area: optimal ( 𝑵 𝑺𝑨
, 𝑵 𝒔𝒓
, 𝑵 𝒔𝒄
) configurations of different approaches
48
Performance monitoring is accomplished by measuring the execution time, retired instruction
count, etc. using Oprofile [39]. The performance gain provided by Spares+PCD is substantial
and more computation capability per silicon area can be achieved.
In summary, we have systematically explored the system hierarchy for the LLC defect-
tolerance opportunities. The global view of the system identifies the controlling mechanisms that
are not acknowledged by classic circuit layer approaches. Impact and feasibility analyses are
carried out to further characterize the practicability of the approaches and the effects on the
system. We also showed that PCD dramatically increased EGOPS-per-area and EC-per-area by
exploiting modern operating systems’ capability of memory virtualization and demand paging. It
is clear that by adopting the systematized exploration proposed here, defect-tolerance approaches
with unique potentials can be possible at higher layers of the system hierarchy. The fact that the
software/OS layer’s view provides a plethora of possible defect-tolerance approaches with
various impact factors on the overall system operation shows the advantages of the newly
proposed global view methodology.
Figure 9. EGOPS-per-area: optimal EC-per-area configurations of different approaches
49
Chapter 3. Defect-tolerance framework for processors
The key features of our framework are the following. First, the framework is able to
enumerate possible defect-tolerance approaches for a given microarchitecture module in a
systematic way. Second, the approaches we discover subsume existing efficient approaches for
the specific modules. Third, the framework captures high level properties of the approaches to
efficiently characterize the approach without simulation. Finally, cross-layer approaches can be
derived from the approaches we identify to achieve uniquely efficient compound approaches. We
have already illustrated these features in Chapter 2. We will also illustrate these features in the
subsequent chapters for many other modules.
This chapter first describes the terminology used in general processor chips’ organization. The
following sections introduce the components and the categories of defect-tolerance approaches.
The generalized defect-tolerance framework is then outlined and is followed by detailed
explanations.
3.1 Chip organization for multicore processors
At the chip level, a multicore processor chip contains I/O blocks, processor core blocks,
graphics processor blocks, etc. These chip blocks are functionally distinct and each is typically
architected somewhat differently and independently. Within each chip block, microarchitecture
modules are defined to support various functionalities of the chip block. Examples of
microarchitecture modules are datapath modules in graphics processor blocks and LLC modules
shared by processor core blocks.
50
Our module-oriented framework starts at the granularity of individual microarchitecture
modules and explores defect-tolerance approaches through all system layers. The framework
focuses on individual modules. It allows us to capture rich operational information at the
microarchitecture layer, e.g., the cache operation for the LLC module, and relevant utilization
information from higher layers, e.g., the mapping between main memory and LLC.
3.2 Defect-tolerance components and categories
This section discusses the components required for defect-tolerance approaches and
characterizes of these approaches.
3.2.1 Defect-tolerance components
Three components are necessary for a defect-tolerance approach: functional redundancy,
controller, and configurability.
Functional redundancy: An on-chip or off-chip resource which is capable of carrying out the
same functionality as a module is called a functional redundancy for the module. There are two
types of functional redundancy, extraneous and inherent. An extraneous functional redundancy
refers to a resource that is beyond the specification of a module and has been explicitly added.
For instance, spare rows and columns are the extraneous functional redundancies for an LLC
module. An inherent functional redundancy refers to the resource that is designed to meet the
original specifications. For instance, cache blocks are the inherent functional redundancy for an
LLC module.
Our framework primarily focuses on identifying inherent functional redundancies for
microarchitecture modules through the system layers. As technology advances, more devices can
51
be fabricated on a single chip. Processor architects have been utilizing this fact to increase
parallelism, increase instruction/data widths, increase memory capacities, and implement
complex instructions. Therefore, in modern processors for almost every microarchitecture
module, several alternative ways exist to carry out its functionality in case the module is
defective. Inherent functional redundancies of a module at each layer can be identified by
exploiting the following properties that can be easily discovered in modern processors.
1) Alternative modules: external and internal functional redundancy
To achieve higher performance, superscalar microarchitectures exploit instruction level
parallelism dynamically by using hardware. Microarchitectures are designed to meet the need for
instruction level parallel computation by implementing multiple copies of modules to alleviate
structure hazards. These copies of modules implement identical functionalities and hence the
functionality of a defective module can be provided by other copies of the module. Since other
copies are external to the module itself, this type of functional redundancy is referred to as
external functional redundancy.
The alternative module type of functional redundancy goes beyond the external type and
includes alternative ways to carry out the function within the module itself. To combat the
memory wall problem, processor chips have been equipped with on-chip caching modules to
exploit the temporal and spatial locality of instructions, data, and address translations. To
increase the number of instructions that can execute in parallel, other memory-based queuing
modules are equipped to buffer intermediate values of instructions. As we have shown during
our LLC exploration in Chapter 2, there can be multiple levels of internal functional redundancy,
i.e., block, set, and way. This type of functional redundancy is distinct from external functional
52
redundancy for the LLC, i.e., main memory at different system layers. This type of functional
redundancy is referred to as internal functional redundancy.
2) Alternative implementation of the functionality
Alternatively, the functionality of a module might be carried out using other types of modules.
For example, the functionality of one instruction can be implemented by other instructions. In
particular, floating-point instructions can be implemented using fixed-point arithmetic and logic
instructions, if a dedicated floating-point unit module in a processor is defective. This can be
proven in general for a wide range of instructions using the fact that any logic function can be
implemented by a NAND gate with some storage and control. The way to implement an
instruction is not unique and hence typically there are several alternative implementations for the
functionality of a module.
Controller and configurability: A module’s controller is the software or hardware module that
determines how that module is used. The same controller or a different controller may control
these functional redundancies for the modules. Multiple controllers may be discovered at the
same or different system layers. These controllers can be identified by traversing the control flow
of the module, which may include circuit diagram, microarchitecture behavioral description, ISA
layer’s specification/system abstraction, software/OS layer’s operational description, and so on.
For instance, by following such control flow, the controllers for the page-cover functional
redundancy for an LLC can be identified as the memory allocation module of the OS, the page
table, and the TLB module.
Such a controller may or may not be designed in a way that inherently allows it to select some
or any of functional redundancies for a defective module. In cases where one or more controller
53
has such a capability, the modules and their functional redundancy are said to be inherently
configurable by the controllers. In the other cases, we say that the functional redundancy is not
inherently configurable. For example, page-covers are inherently configurable under the memory
allocation module of the OS but not inherently configurable under the TLB module, since the
TLB merely caches the control information on-chip.
3.2.2 Defect-tolerance categories
This sub-section describes the three categories of defect-tolerance approaches: repairing,
defect-avoidance, and error-masking. The mechanisms used for these three defect-tolerance
categories and to maintain their correctness are also described. Related studies are discussed
under each category.
3.2.2.1 Repairing
This category is the most common form of circuit layer defect-tolerance, which can be
achieved without the knowledge of upper layers. The approaches add extraneous functional
redundancy and extraneous controllers into the module, and use these to repair the defective
module, so the repaired module can provide the full functionality and capacity (e.g., memory
spaces, performance) designed in the microarchitecture. The correctness of each repair approach
is guaranteed at the circuit layer via manufacturing tests used to locate the defects, and by
reconfiguring the controller such that only the defect-free parts of the modules are used.
Microarchitecture layer correctness is guaranteed since each repaired module behaves just like a
defect-free module from the microarchitecture layer’s view. The classic spare rows-and-columns
approach for SRAMs falls into this category.
54
3.2.2.2 Defect-avoidance
The approaches in this category enhance the controller of a module to avoid the use of the
entire defective module or the defective parts of the module during system operation. These
approaches do not use extraneous functional redundancy. The controller is enhanced to be aware
of the defects in the module, and seeks the inherent functional redundancy to fulfill the
functionality of the defective module. For instance, block disabling approach for LLC utilizes the
defect-free inherent functional redundancy, namely, the other cache blocks, by slightly
modifying the controller, namely, the replacement policy logic, to avoid the use of defective
blocks, so no erroneous value is produced during normal system operation.
The approaches under this category are further divided based on the configurability of the
module’s functional redundancy. If the functional redundancy is not inherently configurable
under the controller, a remapping approach can be employed. For the functional redundancy with
inherent configurability under the controller, typically a disabling approach can be employed
with lower overheads than a remapping approach. A controller with inherent configurability is
designed to select among the functional redundancies. In contrast, a controller with no inherent
configurability needs to be enhanced at additional overheads in order to perform the desired
selection. For example, set remapping approach in LLC requires additional multiplexors to re-
direct the accesses originally directed to the defective sets to defect-free ones. In contrast, the
inherently configurable cache block disabling approach does not require such overheads.
The correctness of defect-avoidance approaches is guaranteed in the following manner. The
modified controller will carry out the functionality without using the defective module or the
defective part of the module. Hence, there will be no error at the output of the module.
55
Therefore, there will be no error in the visible states and the correctness is guaranteed at the ISA
layer.
Defect-avoidance approaches can be found in the following studies. The approaches presented
in [5] [6] [7] [8] [9] [10] [11] [12] [13] exploit the inherent functional redundancy to achieve
defect-tolerance for cache modules. Alternative implementation functional redundancy is
employed in [14] to avoid defects in a floating point unit by using fixed-point functional units.
The approaches in these studies are among those discovered and enumerated by our framework.
In addition, as we will explain ahead, our framework further analyzes the approaches discovered
to capture their high level characteristics in terms of implementation impact and performance
penalty. The fact that our framework can enumerate possible approaches at different layers
enables it to derive uniquely efficient cross-layer approaches. Rescue proposed in [19] explicitly
transforms the modules in a way so that the combinational logic blocks in the module can be
used as the functional redundancy for other logic blocks of the same types. Extraneous
controllers are also added to reconfigure the use of the logic blocks in the presence of defects.
Rescue is a rare exception in the sense that it cannot be discovered by our framework since their
approach fundamentally changes the circuit level designs by rearranging and duplicating the
logic blocks in the pipeline. However, any such approach will require detailed and
comprehensive circuit simulations to capture its performance impact.
3.2.2.3 Error-masking
This section describes required mechanisms for error-masking approaches and how error-
masking approaches should be purposed. Then the possibility of error-masking at the ISA layer
and the software/OS layer is discussed using related studies as examples.
56
1. Required mechanisms: checking and correcting
By expecting that an error might be produced from a defective module, an error-masking
approach isolates the error within a certain boundary so the error does not pass the boundary to
affect the correctness of upper layers. An error-masking approach requires a checking
mechanism to determine whether the results from a defective module are valid or invalid, and a
correcting mechanism to produce correct results when the results from the module are invalid.
For example, the microarchitecture layer LLC speculative load approach requires the result from
LLC load to be checked against the result from the corresponding main memory access.
Correcting mechanism: If a checking mechanism identifies a result as invalid, a correcting
mechanism is needed to produce a correct result by deriving a correct result in an alternative
way. Hence, correcting mechanisms often need to use defect-free functional redundancy of
defect-avoidance approaches already employed in the module. For instance, the LLC module
disabling approach, which is a defect-avoidance approach, is the correcting mechanism for the
microarchitecture layer LLC speculative load approach. It is clear that error-masking approaches
should be purposed as to increase silicon utilization or performance in additional to defect-
avoidance approaches already implemented (except for coding techniques in memory-based
modules). Error-masking may increase silicon utilization or performance of defective chips by
leveraging the fact that any error is produced with a low probability. Performance of a defective
processor might improve if defective modules can be utilized with a low probability to produce
errors, and the processor continues the computation by speculating that a defective module does
not produce any error. Hence, defective modules can be utilized with a suitably low probability
57
of incurring penalties due to occurrences of errors, and penalties for invoking correcting
mechanisms.
Errors will be visible and corrected at the layer at which the correcting mechanism is
implemented. Hence, the correctness of that layer is relaxed. However, correctness is guaranteed
at the layers above the layer of this implementation. For example, microarchitecture layer LLC
speculative load relaxes the microarchitecture layer correctness by allowing errors to be
produced. However, errors are checked and corrected before committing to visible states, so ISA
layer correctness is maintained.
Checking mechanisms: Types of checking mechanism for targeting individual
microarchitecture modules include: 1) result checking, 2) fault-excitation condition checking,
and 3) coding techniques.
1) Result checking: Result checking can be done directly or indirectly. Direct result checking
compares results produced by a defective module with trusted results produced by a defect-free
module. Indirect result checking verifies results from a defective module using other type of
operations (e.g., check results from a defective adder by using subtract operation). Naïve result
checking may directly compare the result of a defective module with a trusted result generated
from a same type as the defective module in parallel with the defective module’s computation. A
dedicated defect-free alternative module functional redundancy is required for such usage, and
this defeats the purpose of employing an error-masking approach to improve performance.
Hence, the functional redundancy used to generate the trusted results cannot be an alternative
module of the same type as the defective module. For example, using a defect-free ALU module
to produce trusted results for checking a defective ALU’s output actually decreases the chip’s
58
silicon utilization and performance. Therefore, direct result comparison needs to produce trusted
results by using alternative implementation functional redundancy.
Result checking may also be done indirectly. For instance, multiplication can be checked by
division or repetitive subtractions. However, such checking cannot be done efficiently and most
instructions do not have simple and efficient method for the checking.
2) Fault-excitation condition checking: Fault-excitation condition checking examines the vectors
at the inputs of a defective module with a set of stored conditions. The stored conditions define
the input vectors that will excite the faults (i.e., modeled defects) exist in the module and may
produce errors. Therefore, the use of fault-excitation condition checking is not only limited to
error-masking approaches but also can be applied in defect-avoidance approaches. This checking
mechanism can be used as the defect map for defect-avoidance approaches, since the mechanism
is not based on output checking. As an example, consider the following setup. A controller sends
add instructions to one of two ALU modules, one defective and one defect-free. Before sending
an instruction, the controller inspects the input vectors of the instruction against the stored
conditions of the defective adder. If there is a match, the controller sends the instruction to the
defect-free module. Otherwise, the instruction can be sent to either module.
When this checking mechanism is used in error-masking approaches, checking is performed
in parallel with the module’s computation.
However, specifying exact fault-excitation conditions of a general module is impractical for
the reason that the conditions have to be established from manufacturing tests. Unless a module
is tested with all possible input vectors, the exact set of vectors that causes excitation is hard to
59
derive. In Chapter 5, we propose a subsuming condition checking for datapath modules to
overcome the difficulty in implementing exact condition checking.
3) Coding techniques: Coding techniques for checking mechanism are commonly used for
memory-based modules. For instance, parity checking detects odd number of errors in protected
memory module. SECDED detects up to two errors and correct up to one error in the protected
unit. Therefore SECDED can also serve as a correcting mechanism. However, checking and
correcting capability of coding techniques is limited, and the uses of coding techniques are
limited to memory-based modules and interconnect modules. Coding techniques are special
cases within the error-masking category. They allow errors to be produced within a module,
hence, there is no speculation involved for performance improvement.
2. Consideration of error-masking at ISA and software/OS layers and related studies
In a module oriented framework, ISA layer and software/OS layer masking approaches are
impractical, unless the functionality of a module is confined to a few operations/instructions. A
majority of microarchitecture modules are involved in executing most instructions defined in
ISA. ISA or software/OS layer error-masking for such a module implies checking each and every
instruction that uses the module.
The approaches proposed in the previous studies summarized below fall into the error-
masking category. However, as we will explain ahead, these approaches will not be enumerated
in our framework since they are not efficient for combating manufacturing defects.
DIVA proposed in [15] allows errors to be produced at the microarchitecture layer, and adds
pipelines stages explicitly to compute correct results for the purpose of checking and correcting.
Correctness of DIVA is maintained by fundamentally changing its microarchitecture and
60
allowing errors to be produced in the microarchitecture. This is another approach that is not
discovered by our framework since it re-designs the overall microarchitecture. Such approach
requires comprehensive simulation to characterize its performance impact. ISA layer and
software/OS layer error-masking mechanisms can also be found in the related studies. Relax in
[16] relies on redundant multi-threading [18] and Argus [17] to detect errors at the ISA layer and
the software/OS layer. Instead of focusing on checking errors in an individual module, the
abovementioned studies focus on checking errors in a system wide fashion. These approaches
mainly focus on combating soft errors and they rely on temporal or spatial redundant execution
as checking and correcting mechanisms, i.e., direct result comparison using a module of the same
type as the defective module, which is inefficient for combating hard defects.
Nevertheless, as we mentioned earlier, our framework enumerates efficient defect-avoidance
approaches of which some have been implemented to deal with hard defects. As will be
demonstrated in later chapters, our framework also discovers new possible error-masking
approaches which can be efficient dealing in with hard defects.
3.3 Cross-layered defect-tolerance framework
In this section, a systematic defect-tolerance framework is described. Detailed explanation of
each step follows. There are four major steps in our framework:
1. Identify target microarchitecture modules.
2. Explore defect-tolerance approaches across the layers of the system hierarchy.
3. Analyze the impact and feasibility of approaches discovered.
4. Utilization analysis for the target modules and their corresponding approaches
5. Evaluate the approaches and establish processor tiers.
61
3.3.1 Identify target microarchitecture modules
Three criteria are used to identify target modules for developing defect-tolerance approaches:
1) essentialness, 2) vulnerability, and 3) potential benefits.
1) Essentialness: Integration has been a major trend in the advance of processor chips. Modern
processors have increasingly incorporated functionally distinctive chip blocks such as multiple
processor cores, graphics processing units, memory controllers, etc. onto a single chip. At the
ISA layer, varieties of instruction sets have been designed and supported by processor cores. For
example, single instruction multiple data (SIMD) datapath modules have been implemented in
processor cores to accelerate data-intensive applications, such as image processing and digital
signal processing.
Defects in hardware mainly used for intrinsically variation-tolerant applications may produce
errors, but the errors within application-specified thresholds may be acceptable [40]. In contrast,
for critical applications, defect-tolerance approaches developed for processors yield enhancement
are required to provide absolute correctness. The type of hardware modules mainly used for
variation-tolerant applications (e.g., graphic processor block, SIMD unit, etc.) can also be used
for critical applications. In terms of tolerating hard defects, it is possible to identify user layer
tolerance for these modules. In this work, we focus on microarchitecture modules which are
dedicated to and essential for critical applications.
2) Vulnerability: Chip blocks designed and implemented with smaller sized and densely packed
devices and densely packed modules will have more critical area [21] than others. Under the
same manufacturing imperfection density and the same imperfection size distribution, smaller
sized devices and densely packed modules are more likely to be affected by imperfections.
62
Hence, the imperfections in chip blocks which have more critical area are more likely to manifest
as defects and create errors at the blocks’ outputs, and these chip blocks are vulnerable to
manufacturing imperfections. Defect-tolerance approaches for processor chips yield
enhancement will be more effective if the modules in vulnerable blocks are targeted.
3) Potential benefits: In addition to device size and density, to a first order the probability that
the module will be defective grows exponentially with the area of the module. Modules with
larger area or more transistors are more susceptible to manufacture imperfections and process
variations. Targeting these modules increases the percentage of defective processor chips that
can be saved by our defect-tolerance approaches.
3.3.2 Explore defect-tolerance approaches throughout the system hierarchy
Flowchart of this step is shown in Figure 10. A target module is first categorized and defect-
tolerance approaches are enumerated in each layer. Detailed explanation follows.
Target module X
no
no, continue
to next layer
Speculative
Module?
Approach:
No action needed
yes
no
Approach: defect
avoidance by controller
enhancement to utilize
non-defective modules
Approach: defect avoidance
by controller enhancement
for alternative
implementations of
functionality
Alternative
modules exist?
Alternative
implementations
exist?
no
yes yes
Approach: controller
enhancement with the
controls of checking and
correcting mechanisms
Error masking
possible?
no
yes
no
Continue
exploration?
yes
All layer
explored?
Explore the layer for DT
components
yes
Approach impact analysis, feasibility
analysis, and utilization analysis
Approach evaluation
Figure 10. Defect-tolerance approaches exploration flowchart
63
Module characterization: Modules can be categorized into two types: speculative and non-
speculative. Speculative modules are identified from microarchitecture layer information. A
module can be identified as speculative if the module’s output is subject to further verification,
and the effect of the module’s output is subject to correction if the output is verified as incorrect.
Modern microarchitectures attempt to accelerate instruction flow or data flow of programs’
execution by predicting the outcomes of computations by using speculative modules. Processor
chips with defective speculative modules do not require additional approaches to guarantee the
ISA layer correctness. In such microarchitectures, outputs of speculative modules are verified
against the non-speculative results which are computed and available in later pipeline stages.
Incorrectly computed results stemming from false speculations are invalidated so they are not
written (not committed) to visible states. A correcting mechanism then recovers programs’
execution by using the non-speculative result. Therefore no action is needed for defective
speculative modules to guarantee correctness at ISA layer and beyond.
Defect-tolerance components identification through system hierarchy traversal: At each layer,
for the target module which is being examined the three defect-tolerance components are first
identified: functional redundancy, controllers and configurability. Functional redundancies of
the target module are identified by searching for the alternative implementation of its
functionality and the alternative modules among the information at each layer, i.e., circuit layer
diagram, microarchitecture layer description, ISA layer’s abstraction of the system, and
software/OS layer’s abstraction of the system. The controllers of the target module are the
Hardware or software modules that plan how the target module and its functional redundancy are
used. Controllers can be identified by tracing how the control information is generated and
64
propagated. Take the PCD approach in an LLC module as an example, the control information of
the approach, i.e., the page-covers to be used, is generated from a memory allocation module in
the OS. This control information propagates through page tables and TLB modules. The modules
along the generation and propagation path are identified as the controllers of the page-covers
functional redundancy of the LLC module. The target module and its functional redundancies
may or may not have the inherent configurability under the controllers identified.
Defect-avoidance approaches can be enumerated using the three components identified. First
consider controllers with inherent configurability to the target module’s functional redundancy.
By enhancing these controllers with the awareness of defect locations, the controller can avoid
the use of defective module (or the defective part of the module) and a disabling approach is
obtained. Next consider controllers without inherent configurability to its functional redundancy
of the target modules. Extraneous configurability must be added to such controllers in addition to
enhancing the awareness to defect locations to obtain a remapping approach.
Error-masking approaches can be enumerated by identifying alternative implementations to
produce trusted results, or by identifying appropriate fault-excitation conditions to be checking
mechanisms and defect-avoidance approaches as correcting mechanisms.
Tolerating defects in speculative modules with no action needed is a special case of error-
masking. This is because checking and correcting mechanisms are built-in into the
microarchitecture along with the designs of speculative modules. Take a branch predictor
module for example. The module predicts the results of branch instructions, and the actual results
of the branch instructions are computed by other modules in the microarchitecture. These results
are used as trusted results for checking the predictions made by the branch predictor module. The
65
trusted results are also used to correct the erroneous prediction for continuing computations for
instructions that follow the branch. The other modules implement the branch predictor’s
functionality, namely, resolving the branch results, in an alternatively way by actually computing
it. This does not contradict our definition of error-masking, since branch predictor modules are
designed to increase performance or silicon utilization by speculating in case where possible
errors are generated from speculation with low probability.
3.3.3 Analyze the impact and feasibility of approaches discovered
Implementations of defect-tolerance approaches reduce processor’s and/or system’s
performance. As analyzed in Section 2.5, three factors determine the overall severity of impact
of an implementation: magnitude, incurred frequency, and pervasiveness. Whether a processor
copy is defective or not, the implementation of an approach may affect circuit level latency of a
module (i.e., the pico- or nano-seconds required to obtain result from the module), or require
software re-design effort, etc. Note that clock period or cycle count to a module may also be
affected if the latency impact turns out to be critical. In the rest of this document, we use circuit
level latency to refer to such latency impacts, since it is not possible to ensure whether some
circuit layer modification is critical or not without detailed investigation of all modules. The
three properties capture the scale and severity of such impact. Magnitude measures the impact
itself. Incurred frequency indicates the probability that the impact will be exposed during the
operation. Pervasiveness describes the proportion of the manufactured chips that will be affected
by such impact.
A defective module’s microarchitecture level performance penalty can also vary in
magnitude, frequency and pervasiveness for different approaches. Penalty types considered for
66
microarchitecture level performance include clock cycles required to finish an operation, the
number of functional units, the capacity of a cache, accuracy of a branch predictor, penalty cycle
counts to recover from mis-prediction, etc. For example, block disabling for a defective LLC will
only incur penalty (i.e., possibly additional miss) when a set with disabled block is accessed. In
contrast, way disabling for a defective LLC incurs penalties every time the LLC is accessed.
Capturing these properties early can assist fast quantitative evaluation of potential approaches
without detailed simulation.
The feasibility of a defect-tolerance approach depends on the following constraints of the
system: Availability of a module’s design, flexibility for module re-design, and availability of
testing features required to construct a defect map. Approach implementations may require the
flexibility of re-designing controllers. For example, approaches with circuit layer
implementations require hardware modifications and approaches with software/OS layer
implementation require software changes. Circuit designs and software source code must be
available and the designer must have the flexibility to modify.
Certain approaches may require manufacturing tests results for the target module. Target
modules must have suitable testing and diagnosis capability to derive information about defect’s
location and severity for defective target modules. Potential defect-tolerance approaches to be
implemented in a processor should be selected according to impact and feasibility analyses to
maximize benefit.
3.3.4 Utilization analysis
More detailed analyses can be performed for promising approaches identified using impact
analysis. Utilization analysis is used to study how a module is utilized at different layers. The
67
information obtained from the analysis can be used to compare how often different approaches’
penalties are incurred and used to determine how a specific mechanism is designed for an
approach. Different types of utilization analyses for each layer are explained below.
1) System/user layer application types: the analysis determines how a module is utilized at
system/user layer. Most of the modules are utilized for all types of applications. However, for
some datapath modules are only used for specific types of applications. For example, FPU is
only used for floating-point applications. Analyzing the applications types that will execute on
the processors helps determine what processor tiers can be used. Single-core processor tiers with
defective FPU should not be used in the system where the workloads have heavy percentage of
floating-point applications.
2) ISA layer instruction distribution: This analysis determines the distribution of specific
types of instructions. The information can be used to determine how a module is utilized in an
application, and how frequent the penalty is incurred if the module is defective. The information
can also be used to determine when to employ an approach for a specific application. For
example, FPU using instruction disabling approach (more in Section 4.2) is suitable for the
applications that use low percentage of floating-point instructions since it incurs low
performance degradation.
3) Microarchitecture layer: Two types of analyses can be performed, i.e., module access rate
analysis and capacity utilization analysis. Module access rate analysis measures how frequent a
module is being utilized during the entire execution period of an application, or during a specific
time interval of the period. For a module which is highly utilized during the entire period of an
application’s execution, penalties incurred from tolerating defects in the module can degrade the
68
application performance severely. For a module which is highly utilized only for a few time
spans during the entire execution period, penalties incurred from tolerating defects in the module
may not degrade the performance as much. Capacity utilization analysis measures how frequent
a defect-free cache module or defect-free buffering module is filled during an application’s
execution period. The more frequent a defect-free module is filled the more a defective module
will degrade the performance. The reason is that tolerating defects in these modules results in
decreases capacity. If the defect-free module is already critical, a module with decreased
capacity will affect the performance severely.
4) Circuit layer operand distribution: This analysis determines how 1’s and 0’s distribute in
operands. The information can be used to determine how a module is utilized at circuit layer. For
example, the operands of an add instruction determines how many bit-slices in an adder have to
be defect-free to compute the instruction correctly. In Chapter 5, we present an example of
operand distribution analysis to help the mechanisms design of an error-masking approach.
3.3.5 Evaluate the approaches and establish processor tiers
This section first describes the performance grading of the processor chips which employ
defect-tolerance approaches explored by the framework. Then a procedure is described for
evaluating and comparing the efficiency of different approaches
3.3.5.1 Processor performance tier setup
The tier of a processor chip captures the capacities at its microarchitecture, ISA, and OS
layers and is evaluated via benchmarking. Figure 11 shows the flow of setting up performance
grades for fabricated copies of a processor.
69
Circuit layer characterization, such as area and clock cycle, are not included for the reason
that the circuit layer approaches employed are universal in the sense that they equally alter all
fabricated copies. Therefore all chips with the same circuit layer approaches have the same
circuit-level performance.
In addition, for speculative modules which do not require any action taken in the presence of
defects to guarantee correctness at upper layers, the effect of the defects must be captured in the
microarchitecture simulator by modeling target module’s circuit implementation, e.g., RTL or
HDL model with the ability to inject faults that model the defects. If the defects tolerated by the
approach employed only change the quantifiable parameters at microarchitecture, ISA, or OS,
circuit layer simulations are not required.
3.3.5.2 Approach evaluation
When comparing different approaches, the above benchmarking procedure can be followed to
derive performance measures for each tier version of the processor chip. Percentages of every
Figure 11. Approach evaluation and grade setup flowchart
70
chip tier, i.e., yields of every tier, can be derived by correlating specific defect densities and the
way in which the approach tolerates defects for the module as we described for a LLC module in
[41]. Expected performance can then be calculated. Area information can be derived from CAD
tools or other estimation tools. Combining yield, area, and performance measures, the metrics
discussed in Section 1.5, i.e., 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 and 𝐸 𝐹𝐹 can thus be obtained for each specific
approach. Comparison between approaches can thus be made.
3.4 Target modules identification in modern processors
The typical chip blocks in processor chips are processor core block with LLC, graphic
processor block, and I/O block.
Vulnerability: I/O blocks are typically implemented with devices of larger sizes. Hence I/O
blocks are relatively robust compared to other modules implemented with small or densely
packed devices. Therefore I/O blocks are excluded from further characterization.
Essentialness: Graphic processor block is dedicated to intrinsically variation-tolerant
applications. In contrast, the modules in processor cores are essential for critical applications. As
we mentioned earlier, this work will focus on microarchitecture modules which are essential for
critical applications. Table VII lists these microarchitecture modules with high area percentage in
the processor cores in two processor chips.
It can be observed that caching modules, including LLC and low level caches, occupy the
largest percentage of the area, and such area percentage increases for more recent processors.
Datapath modules, decoder module, and queuing modules consume a majority of the rest of the
area of the logic cores. By protecting these modules in the processor core, a majority of chip area
71
can be protected, and yield (and EGOPSPA, performance-per-area) can be enhanced
significantly. Defect-tolerance approaches are explored in the following chapters.
Table VII. Microarchitecture target modules area percentage
Pentium III, 1999 AMD Bulldozer, 2011 [70] [73]
No L2 2MB L2 (LLC) 41.20%
fetch+L1i cache 10.18% L1i cache 2.30%
L1d cache 6.83% L1d cache 2.15%
dTLB 1.80%
decode 7.88% FPU 10.47%
FPU 4.72% L/SU 9.09%
Fixed-point FU 4.37% Fixed-point FU 5.55%
rename 3.67% Bpred 5.36%
ROB 3.42% decode 5.31%
Inst. schd & Q 3.07% Inst. sched & Q 5.30%
L/SU 2.75% fetch 2.15%
BTB 2.17% Physical register 1.91%
Branch addr. calc 0.96% rename 1.58%
72
Chapter 4. Framework application on the target modules
From the view of system layer, a defect-free processor core is the inherent functional
redundancy for a defective core. A core can also be the inherent functional redundancy for any
defective modules in the defective core in modern multicore processors. Hence, core-disabling
has been a common system-layer approach for tolerating defects in any modules in cores [42]
[35]. In addition, there are other functional redundancies available for the modules from ISA
layer to circuit layer. In this chapter, defect-tolerance approaches are explored for the target
modules in these layers.
4.1 Branch prediction modules
Branch prediction is commonly used in modern processors to sustain continuous instruction
fetch without stalling to wait for branch conditions to be resolved or for branch target addresses
to be computed. This section first describes the functionality and the operation of branch
prediction modules. The framework is then applied to explore possible defect-tolerance
approaches.
Branch prediction modules include branch predictor module and branch target buffer (BTB)
module. A branch predictor module predicts the outcome of branch instructions to speculate if
the branches are to be taken or not-taken. A BTB module predicts branch target addresses (the
addresses to be fetch next) if the branch instructions are predicted taken.
1) Branch predictor module
Figure 12 depicts a gshare branch predictor module. By hashing branch instruction addresses
with the actual branch history in the branch history table (BHT), the pattern history table (PHT)
73
is accessed to make the prediction. The PHT contains a 2-bit finite state machine (FSM) for each
indexing address. A branch prediction is made based on the state of the FSM indexed.
2) BTB module
BTB module stores the addresses of recent branches that were taken and their target addresses.
BTB is accessed during the instruction fetch stage using program counter (PC) and makes
predictions based on the stored branch history. A direct-mapped BTB organization is depicted in
Figure 13. Similarly to a cache module, BTB module can also be designed with multiple blocks
in one set, and a replacement policy logic circuitry is necessary in such an associative
organization. There are two fields in each entry of BTB module. One field stores branch
instruction address (BIA), and the other field stores the corresponding branch target address (BTA)
of the branch instruction in the entry.
When being read, index fields of the incoming addresses are used to index to the sets in BTA
and BIA. Tag fields of the incoming addresses are compared with the tag stored in the BIA.
When the tag matches, the corresponding address stored in BTA is used as the target address to
fetch the next instruction, if the branch is predicted as taken by a branch predictor module. If the
Figure 12. gshare predictor (reproduced from [4])
74
tag does not match and the branch is predicted as taken, the instruction fetch stalls and waits for
target address computation to complete. The above procedure above is used to locate the set
when the value in the BTB is updated. In an associative organization, the replacement policy
logic determines which block is to be overwritten. In contrast, in a direct-mapped organization,
the indexed block is overwritten directly.
4.1.1 Module characterization and recovery mechanism
Both branch prediction modules are used to prevent pipeline stall by speculating which
instruction to fetch next. Defects in a branch predictor module can manifest as erratic branch
prediction (taken instead of not-taken or vice versa) behavior, and defects in a BTB module can
manifest as incorrect branch target address prediction. In this manner, the errors produced in
Figure 13. Branch target buffer (direct-mapped)
75
these defective versions of these modules may fetch and process instructions that are different
from those if these modules were not defective.
However, the effect of these errors will never propagate to the visible states since the branch
prediction modules are speculative by design at the microarchitecture layer. Specific recovery
mechanism is designed to invalidate the computation produced by mis-speculation and to restore
the processor to a previously recorded state. Outcomes of speculations are verified with resolved
branch conditions and computed branch target addresses before speculatively computed results
are written to visible states through commit process. Hence, ISA layer correctness is maintained.
According to the framework, no action is needed for branch prediction modules to guarantee the
correctness of ISA layer in the presence of defects.
4.1.2 System layers exploration
Although a no-action approach has been identified at the microarchitecture layer, additional
approaches may be identified by continuing exploration of the system layers using the above
framework. However, in this case, no approach exists at the ISA and the software/OS layers,
since both modules are absent from the system abstractions of these two layers (Section 1.2.2).
Hence, in this case, the microarchitecture layer is the only one to explore for functional
redundancies and corresponding controllers. As describe next, we can identify an alternative
module functional redundancy by observing the no-action approach and the module’s operation.
1) Observation from the no-action approach
The reason that no action is needed to guarantee the correctness in case of defects in
speculation modules is that error-masking’s requirements, namely checking and correcting, are
automatically satisfied and designed inherently in the microarchitecture design. Errors caused by
76
defects in the module will be automatically corrected at the microarchitecture layer by design.
The inherent checking mechanism employs direct result comparison by producing trusted results
using other modules in the original execution path. The external functional redundancy of a
branch predictor module is the combination of the modules that resolve the branch instructions
non-speculatively. The corresponding controller is identified as the pipeline control logic which
can bypass the branch predictor and stall the pipeline until the branch prediction is resolved non-
speculatively. The external functional redundancy of a BTB module is the combination of the
modules that compute the branch target addresses non-speculatively. The corresponding
controller is the combination of the pipeline control logic which can stall the instruction
progression to wait for the actual branch target address computation, and the logic of the select
signal labeled BTB_hit/BTB_miss controlling the multiplexor depicted in Figure 13. The signal
determines the address to be used is from BTB or the non-speculative address computation. Both
modules are inherently configurable under their controllers identified.
A module-disabling defect-avoidance approach can be derived based on the external
functional redundancies of each module. By incorporating a defect map and slightly modifying
the controller, i.e., pipeline control logic which controls the advancing and stalling of
instructions in the pipeline, the defective module can be disabled as an entity, and its
functionality replaced by the actual computation using its functional redundancy. Pipeline
control logic is modified so the pipeline stalls when resolving branch instructions and calculating
target addresses.
2) Observation from module’s operation
77
Functional redundancy can also be discovered within each module. In the gshare branch
predictor, there is no fixed state in the PHT to which a branch instruction must be mapped to.
Hence, the state machines in the PHT are the functional redundancies of each other. Just as we
have seen in the LLC module that the decoder is the controller to the sets, the PHT decoder is
identified as the controller without inherent configurability. Hence, a remapping approach
similar to that explored in the LLC module can be applied by enhancing the PHT decoder.
The microarchitecture layer functional redundancy of sets, blocks, and ways for the LLC
module can be applied to BTB module. BTB is managed in the same fashion as the cache
modules. Defect-avoidance approaches for BTB module, such as block disabling, way disabling,
and set remapping, can also be achieved using approaches that are similar to this
implementations for the LLC module. Same restrictions apply: block and way disabling are not
applicable to direct-mapped organizations, and set remapping is not applicable to fully-
associative organizations.
4.1.3 Impact and feasibility analyses
Table VIII compares the impact of the implementations for above approaches for branch
prediction modules. All defect-avoidance approaches require some hardware modification and
impose impact on all chips. The no-action approach itself has no implementation impact on chips
when applied for both modules and is free of hardware re-design. This property of the no-action
approach for defect-tolerance in branch predictor module has been acknowledged in the studies
of [2] [3] [4]. Authors in [4] further characterized the yield benefits and the performance
degradation caused by specific defect-induced faults.
78
Table IX summarizes the microarchitecture level performance penalties due to a defective
BTB module. It can be seen that block disabling and no-action approach have lowest penalty
incurring frequency. Among the two approaches, the performance penalty difference is that no-
action approach requires additional cycles to recover to a previous valid state of the processor.
However, no-action approach does not require any hardware re-design. The overall
microarchitecture performance difference between the two approaches can be further studied to
evaluate the performance of these suitable approaches to identify the approaches that are suitable
to employ under different situations.
Table VIII. Branch prediction modules approaches impact analysis
Implementation impact measures
Magnitude Incurred frequency Pervasiveness
Module disabling
Additional pipeline control
logic latency
Every cycle All chips
Block or way disabling for
BTB
Additional latency in BTB
replacement policy logic
Every BTB access All chips
Set remapping for BTB,
PHT remapping for branch
predictor
Additional decoder latency Every module access All chips
No-action None None None
79
4.1.4 No-action approach BTB module fast grading
According to the grade setup flow in Section 3.3.5.1, grading processors which employ no-
action approach requires microarchitecture benchmarking in conjunction with module circuit
modeling for defect-induced fault injections. By modeling a gshare branch predictor with a
microarchitecture simulation, authors in [4] investigated performance degradation of the
defective versions of a gshare branch predictor module. They showed that for most stuck-at-
faults considered, there is very little degradation in prediction accuracy. Hence the resulting
performance degradation is small for most stuck-at-faults.
Similar methodology can be used to setup grades when employing the no-action approach in
BTB module. However, unlike the study in [4], microarchitecture benchmarking for BTB
module does not have to be joined with circuit layer modeling or defect-induced fault injections
Table IX. Defective BTB microarchitecture level performance penalty
Defective module performance penalty measures
Magnitude
Incurred frequency
(for every predicted-
taken branch)
Pervasiveness
Module disabling Pipeline stall
Every branch
instruction
Only defective
module
Block disabling Additional BTB misses
Accesses to the BTB
sets with disabled
blocks
Only defective
module
Way disabling Additional BTB misses Every BTB access
Only defective
module
Set remapping Additional BTB misses
Remapped branches
and branches mapped
to the shared sets
Only defective
module
No-action Additional recovery cycles*
Every hit on defective
BTB blocks
Only defective
module
*Assume a defective BTB supplies erroneous target address when bit
80
for the following reasons. Majority of the area and transistor counts in BTB are taken by BIA
and BTA memory arrays, hence, we focus on the defects in these arrays. The effect of the defects
in BTB arrays can be captured in the form of specific types of errors in value stored, i.e.,
erroneous target addresses or erroneous tags. Therefore, predictions from a defective BTB
module can be captured in the following categories.
1) False hit: Defects in BIA can generate a false hit with incoming PC. An incorrect target is
provided by BTB. (However, given the space of all possible instruction addresses, this is
very unlikely to occur due to address aliasing.)
2) False miss: Defects in BIA cause a mis-match instead of what should have been a match in a
defect-free BTB module.
3) Wrong target: Defects in BTA cause the BTB module to provide an erroneous target address
when a match occurs.
The above erroneous results cause different penalties when branch predictor module and the
prediction it makes as compared against the actual branch resolution. Table X shows the incurred
penalties.
81
As can be seen, the penalty caused by a defective BTB module differs based on the prediction
from branch predictor module and actual branch resolution. It also relates to the difference
between the cycle count required for target address calculation and the cycle count required for
resolving a branch condition.
4.2 Arithmetic datapath modules
This section describes the application of above framework to arithmetic datapath modules.
First, system layers explorations are carried out for these modules. Possible defect-tolerance
approaches are then summarized.
Arithmetic datapath modules carry out the main logic and arithmetic computations. In a
typical processor, these modules include fixed-point arithmetic logic unit (ALU) modules,
complex fixed-point unit modules (for example, fixed-point multiplication unit), and floating
point unit (FPU) modules. Typically datapath modules are not speculative. In the following, we
will use the term datapath modules for short to refer to arithmetic datapath modules.
Table X. Scenarios when using no-action for defective BTB module (with a defect-free branch predictor)
Possible errors
Prediction (by a defect-
free predictor)/actual
resolved branch
False hit False miss Wrong target
Taken/Taken
Penalty to recover from
execution due to use of
wrong target address used
Penalty due to pipeline stall to
wait for target address
calculation
Penalty to recover from
execution due to use of wrong
target address
Taken/Not-taken No penalty No penalty No penalty
Not-taken/Taken
No penalty if actual target
address is calculated
before/when branch is
resolved
No penalty if actual target
address is calculated
before/when branch is
resolved
No penalty if actual target
address is calculated
before/when branch is
resolved
Not-taken/Not-taken No penalty No penalty No penalty
82
4.2.1 System layers exploration
4.2.1.1 Circuit layer exploration for all datapath modules
As many datapath modules have iterative structures, each bit-slice of a function unit in a datapath
module is the inherent functional redundancy of each other. This observation is based on
alternative implementation of functionality as many datapath functions can be implemented by
combining the same type of function for shorter operands. For example, an AND instruction with
32-bit operands can be carried out by concatenating the outcomes of using two AND instructions
for 16-bit operands. The granularity of the inherent bit-slices functional redundancy can vary by
grouping various numbers of bit-slices together. Typically, there is no controller (an FSM) or
inherent configurability for such bit-slice functional redundancy if a datapath module is deigned
to execute instructions in one pass. Extra bit-slices can also be added as extraneous functional
redundancy for a functional unit. However, an extraneous bit-slice cannot be used to replace any
defective bit-slice with only multiplexing or de-multiplexing at the input/output of a complex
functional unit (e.g., adder or multiplier). Consider using an extraneous bit-slice to replace any
defective bit-slice in an adder functional unit. The multiplexing and routing of carry bits and
shifting of input bits will significantly increase the design complexity. Note that our framework
does not consider duplication of circuit structures as this incurs at least 100% area overhead.
4.2.1.2 ALU module: non-circuit layers
1) Microarchitecture layer
Alternative module functional redundancy of an ALU module can be identified. Since fixed-
point instructions are one of the most common instruction types, multiple ALU modules are
commonly available in one processor core to support parallel execution of multiple fixed-point
83
instructions per cycle. For example, Intel Nehalem features five ALU modules in its out-of-order
processor core [43]. Such features are designed and visible at the microarchitecture layer.
In a generic flow of fixed-point instructions from the dispatch stage to the execution stage,
decoded instructions are dispatched into a unified reservation station module waiting for operand
values to be available. The en-queued instructions will have their ready flags in their status fields
set once all their input operands are ready. An instruction scheduler module is responsible for
delivering the operands of a fixed-point instruction to one of the available ALU modules for
computation. Instructions are allowed to execute out-of-order and the instruction scheduler
module controls the instructions’ routing between the reservation station and the ALU modules
[44]. Hence, each ALU module is an alternative module functional redundancy of each other
ALU modules at the granularity of a module entity, and the instruction scheduler module is the
corresponding controller under which the inherent configurability exists.
By using defect information of ALU modules and slightly modifying the instruction scheduler,
the use of defective ALU module can be avoided. Hence, a fabricated processor with defective
ALU modules can still be used if at least one defect-free ALU presents in the core. This can be
achieved by deactivating the cycle counter [45] of the defective ALU modules in the instruction
scheduler module. The approach, ALU-disabling, is investigated in Chapter 5.
2) ISA layer: ALU modules are not controllable individually at the granularity of a module
entity. ALU modules are viewed as individual functions, i.e., ADD, AND, OR, etc., that ALU
modules implement and are defined in the ISA. A fixed-point instruction itself can be
implemented by the combination of other instructions and compilers (should) have the ability to
84
synthesize the functionality required by a program by using different combinations of the
available instructions in the ISA. Hence, alternative implementation functional redundancy at the
ISA layer exists for the individual functional unit in ALU modules. Compiler and the ISA are the
corresponding controllers with inherent configurability, since the control information of the ALU
modules, i.e., instructions, are generated from them. Another controller is the decoder module,
which is identified by tracing the instruction flow in pipeline. A decoder module merely
interprets the instruction to generate control signals. Hence there is no inherent configurability of
ALU modules under a decoder module. By modifying the controllers with the corresponding
defect information, the defective functional units in the ALU modules can be effectively avoided.
4.2.1.3 Complex fixed-point unit module and floating-point unit module: non-circuit
layers
ISA layer: functional redundancy can be discovered in the same way in which ALU modules’
alternative implementation functional redundancy is found at the layer. Alternative
implementations of singular datapath modules’ functions can be derived at the layer and server
as functional redundancy. The alternative implementation of the modules’ functionality and their
corresponding controllers, i.e., the compiler, ISA, and the decoder module, can be modified so
the use of the defective modules can be avoided.
Microarchitecture layer: In some modern processor chips, several processor cores can share a
single FPU module, and each core may have at most one complex fixed-point (e.g.,
multiplication) unit module [43] [46]. Unlike ALU modules, there is no external functional
redundancy at the microarchitecture layer and the modules are singular in processor chips.
Hence, in the following these modules are referred to as singular datapath modules.
85
4.2.2 Defect-avoidance approaches
This section lists the possible defect-avoidance approaches identified at each layer based on
the functional redundancy explored in the previous section. Defect-avoidance approaches for
datapath modules maintain ISA layer correctness by avoiding the use of defective module or
defective functional units, and use alternative modules or alternative implementation to carry out
the functionality which was provided by the defective module or the defective functional unit.
Therefore, there will be no error produced and written to visible states.
1) Circuit layer
Most datapath modules have circuit layer bit-slices functional redundancy within the modules
themselves. To achieve defect-avoidance by using bit-slices functional redundancy, the module
must be re-designed in a way such that it is capable of repetitively utilizing defect-free bit-slices
to operate on full-length input operands in defective functional unit. Full-length operations are
effectively divided into multiple shorter-length operations executed in multiple passes and then
combined. Temporary storage, e.g., registers, are required to buffer the partial results from each
pass. By using only the defect-free part of datapath modules to compute, defects are avoided and
the defective module is partially utilized. An additional FSM or enhancement to existing FSM in
a target datapath module is required to control the use of bit-slice groups accordingly (more in
Chapter 5). This approach is referred as circuit layer partial utilization.
2) Microarchitecture layer
ALU module disabling can be performed by enhancing the instruction scheduler module to
store and use a defect map of ALU modules. By only using defect-free inherent functional
86
redundancy, every defective ALU module can be avoided. We also develop an error-masking
approach to enhance the performance of chips with defective ALU modules.
3) ISA layer
Disabling type and remapping type approaches can be conceived at this layer. General
implementation of disabling approaches is first described. Several possible remapping approach
implementations are then discussed.
By avoiding the use of potentially erroneous instructions, i.e., the instruction which may
execute on a defective functional unit, from the available instruction set, binaries generated by
compilers will not contain the potentially erroneous instructions (Assuming compilers are
capable to implement alternatives). Defective functional units in the defective modules or defect-
free functional units in other modules will never be used. Hence, the potentially erroneous
instructions are disabled and an instruction disabling approach is derived.
Instead of disabling the use of potentially erroneous instructions by modifying the ISA
controller or the compiler controller, potentially erroneous instructions can be remapped by a
decoder module to a sequence of other normal instructions that carry out the same functionality.
When the decoder module detects an instruction that has the potential of being executed
erroneously, it triggers a mechanism to introduce an alternative normal instruction sequence to
pipeline, which implements the potentially erroneous instruction’s functionality. The mechanism
to introduce the alternative normal instruction sequence can be achieved using a hardware
implementation or a software implementation.
In hardware, microcode can be used for such purpose. Microcode is used in modern
processors to fix the bugs found after product shipment. When buggy instructions are
87
encountered during operation, they are replaced by a sequence of alternative instructions. When
system boots up, the alternative instructions are loaded into an on-chip microcode ROM from
system memory. When a buggy instruction is decoded, the alternative instructions in the ROM
are injected into the pipeline to carry out computation that is equivalent to that for the buggy
instruction. This mechanism can also be used to implement abovementioned instruction
remapping approaches. This approach is referred to as microcode-based instruction remapping.
In software, the exception mechanism which is sometimes used to handle anomalies (e.g.,
divide by zero) by using software procedures can be used for defect-avoidance purpose. This
implementation is referred as exception-based instruction remapping. Note that the use of
software-based mechanism does not mean that a defect-tolerance approach is discovered at the
software/OS layer, since the functional redundancy is identified by exploring the ISA layer.
The concept of instruction remapping can also be implemented by introducing new controllers
which may not be in the original system design. Instruction translation and instruction emulation
fit the concept of remapping. Translation can be performed statically during the time when the
original binaries are compiled. Emulation can be performed dynamically during runtime. Similar
mechanism has been used to allow non-native applications to execute on a processor [47].
Implementing instruction remapping in this way may requires introduction of extraneous
controllers into the system, i.e., use of the additional software which works either statically to
translate or dynamically to emulate, in case when such a mechanism does not exists in the
system. These approaches are referred as translation-based and emulation-based instruction
remapping respectively.
88
4.2.3 Error-masking approaches
This section discusses possible error-masking opportunities to increase silicon utilization on
fabricated processor copies with defective datapath modules. By employing error-masking
approaches, correctness is maintained at all layers above the layer at which a correcting
mechanism is implemented. For instance, ISA layer correctness is maintained if errors are
corrected by a microarchitecture layer approach. We explored error-masking approaches by first
discussing error-masking with direct result checking, and then error-masking with fault-
excitation condition checking. For all error-masking approaches, if checking fails, checking
mechanism invokes a correcting mechanism which is based on the defect-avoidance approaches
mentioned previously.
As mentioned in Section 3.2.2, effective error-masking with direct result checking requires
alternative implementations of target module’s functionality to generate trusted results. The only
way to implement this approach is at the ISA layer, since there are no alternative
implementations for datapath modules at the microarchitecture layer. ISA layer checking
mechanism may work as follows. In program code, a potentially erroneous datapath instruction
is accompanied with a sequence of substitute instructions which generates trusted results
independently of the functional units executing the potentially erroneous instruction. The
possibly erroneous result generated from the potentially erroneous instruction is compared
against the trusted result. If the result matches, then the computation goes on. If not, a correcting
mechanism is invoked to invalidate the erroneous result and continue the computation using the
trusted result. This approach is referred as ISA layer error-masking with direct result comparison.
89
Fault-excitation condition checking can also be implemented by using software or hardware.
Hardware-based checking stores fault-excitation vectors on-chip to compare the input operands
of incoming instructions. Software-based checking uses additional instructions to compare the
operands of potentially erroneous instructions in the software with the stored fault-excitation
vectors. These approaches are referred as ISA layer error-masking with hardware-based fault-
excitation condition checking and ISA layer error-masking with software-based fault-excitation
condition checking respectively.
However, as we mentioned in Section 3.2.2, exact fault-excitation conditions of datapath
modules are difficult to derive and often expensive to store. In Chapter 5, we develop a
subsuming fault-excitation condition checking based microarchitecture layer error-masking for
ALU modules by using alternative modules as correcting mechanisms.
4.2.4 Approach analysis and defect map consideration
This section first compares the impact and the feasibility of the approaches identified above.
The construction of defect maps is then addressed.
1) Impact and feasibility analysis
Impact analysis of the approaches is summarized in Table XI. The following analysis
discusses ALU module and singular datapath modules, i.e., complex fixed-point module (such as
integer multiplier module) and FPU module, separately. When employing defect-avoidance
approaches to ALU module, ISA layer approaches seem attractive since, when implementing in
software, they affect the performance only for defective chips. However, in such cases,
functional units of the same types as the defective functional unit in defective ALU modules will
not be utilized during operation. The same reason discourages the use of ISA layer error-masking
90
approaches listed for ALU modules. It implies that some instructions will be checked
unnecessarily since they might execute on one of the defect-free ALU modules.
When defect-avoidance approaches are applied to singular datapath modules, instruction
disabling approach does not require other mechanisms such as microcode, decoder triggering or
additional software layer. When error-masking approaches are employed, ISA layer error-
masking with direct result comparison seems the most attractive since it affects only a subset of
chips and the checking mechanism is viable (no exhaustive testing is required as in fault-
excitation condition checking).
Table XII shows the microarchitecture level performance penalties of defective datapath
modules for different approaches. It can be seen that almost all ISA layer approaches will
effectively avoid using all datapath modules that are of the same type as the defective module.
Table XI. Datapath approaches impact analysis
Implementation impact measures
Approach Magnitude Incurred frequency Pervasiveness
Circuit layer partial utilization
Module latency ↑
Every module access
All chips
ALU module disabling (μArch layer) Every fixed-point instruction
Instruction disabling (ISA layer) Compilation time ↑ Only during compilation
Only defective
chips
Microcode-based instruction remapping*
(ISA layer) Compilation time and
decoder latency ↑
During compilation/every
instruction Exception-based instruction remapping
(ISA layer)
All chips
Emulation-based instruction remapping
(ISA layer)
Additional software
layer
Every instruction
Only defective
chips ISA layer error masking with direct result
comparison
Increased compilation
time and instruction
counts
During compilation and
runtime
ISA layer error masking with hardware-
based fault-excitation condition checking
Instruction latency ↑ Every instruction All chips
ISA layer error masking with software-
based fault-excitation condition checking
Compilation time and
instruction counts ↑
During compilation and
runtime
Only defective
chips
*Assume that the microcode mechanism is already designed in the processor.
91
Hence, in this case, severe performance degradation can be expected for the ISA layer
approaches.
Controller modules modification required for the approaches are summarized in Table XIII. In
general, ISA layer approaches require the flexibility of ISA and/or compiler modification. Circuit
layer and microarchitecture layer approaches requires modifications for the original designs. The
feasibility of the approaches depends on target processor and system.
Table XII. Defective datapath modules microarchitecture level performance penalties
Defective module microarchitecture penalty measures
Approach Magnitude Incurred frequency Pervasiveness
Circuit layer partial utilization Increased operation cycles Every module access
Only defective
modules
ALU module disabling (μArch layer)
Decreased number of
available modules
Every fixed-point
instruction
Instruction disabling (ISA layer)
Less efficient instructions
used
Same as the frequency of
the disabled instructions
in the original program
All module of
the same type as
the defective
one will not be
used
Microcode-based instruction
remapping (ISA layer)
Exception-based instruction
remapping (ISA layer)
Emulation-based instruction
remapping (ISA layer)
Additional software layer Every instruction
ISA layer error masking with direct
result comparison Checking overhead and
recover overhead if an error
occurs
Every instruction that may
execute on a defective
module
ISA layer error masking with
software-based fault-exciting
condition checking
ISA layer error masking with
hardware-based fault-exciting
condition checking
Recover overhead if an error
occurs
When a fault-exciting
instruction executes on a
defective module
Only defective
modules
92
In summary, for a datapath module with multiple copies in a processor core, such as an ALU
module, ISA layer approaches are expected to decrease the performance of defective chips
severely. ALU module disabling is expected to penalize the microarchitecture level performance
least amount at the lowest overhead. In Chapter 5, we investigate the efficiency of the
microarchitecture layer approach, ALU-disabling, and combine it with an error-masking
approach. For a singular datapath module, ISA layer approach may be suitable since singular
datapath modules, such as FPU, are only utilized by some applications. Microcode-based
instruction remapping re-discovered by our framework is implemented for an FPU in [14]. The
authors have investigated the use of such mechanism for floating point unit to tolerate hard
defects. They have showed approximately 7.3 × slowdown on average and approximately 20 ×
maximum slowdown for eight floating point SPEC2000 benchmarks if defective floating point
instructions are replaced by alternative fixed-point instructions.
2) Defect map constructions
Table XIII. Module modified of approaches for datapath modules
Approach Module(s) modified
Circuit layer partial utilization Datapath module FSM control logic
ALU module disabling (μArch layer) Instruction scheduler module
Instruction disabling (ISA layer) ISA and compiler
Microcode-based instruction remapping* (ISA layer) None
Exception-based instruction remapping (ISA layer) Decoder module
Emulation-based instruction mapping (ISA layer) Additional runtime emulation software
ISA layer error masking with direct result comparison ISA and compiler
ISA layer error masking with hardware-based fault-exciting
condition checking
Additional hardware module for operand
inspection
ISA layer error masking with software-based fault-exciting
condition checking
Compiler
* Assume that the microcode mechanism is already designed in the microprocessor.
93
Derivation of defect information for defect maps depends on functional redundancy’s
granularity utilized for corresponding approaches. For the non-circuit layer approaches, the
functional redundancies utilized are at the granularity of individual functional units or a module
(ALU). The commonly used scan based manufacturing test procedure can be used to perform the
test for each functional unit or module and identify the defective unit or module. The test results,
i.e. identifications of the defective functional units or defective modules, are then used for the
construction of defect maps. For circuit layer partial utilization approach, ATPG can be
modified to test the individual bit-slice groups within a functional unit of a datapath module.
Such a procedure will be discussed at the end of Section 0.
4.3 Caching modules
Caching modules include low level cache modules and TLB modules, which are mainly
memory-based. Hence, classic circuit layer spare rows-and-columns approach is available. This
section first describes the high level functionality of each caching module. Non-circuit layers are
explored to identify new defect-tolerance approaches for each individual module.
Caching modules are incorporated on-chip to conquer the memory wall problem which arises
as the result of speed disparity between processor cores and main memory. Caching modules
take advantage of the temporal and spatial locality of programs to achieve higher performance.
TLB modules store recently accessed addresses translated from the page tables residing in the
main memory. Cache modules bring frequently accessed instructions and data closer to the logic
in processor cores.
94
4.3.1 Low level cache modules
Low level cache modules are managed in the same way as the LLC module. Cache modules
are organized into sets, blocks, and ways. However, the major operational difference between
low level caches and LLC is that, low level cache modules are required to have low access
latency in order to keep up with processor core’s logic. Typical low level caches are much
smaller than the LLC module and are virtually addressed (i.e., addressed using virtual addresses).
We will assume such low level cache modules in the following.
4.3.1.1 System layer exploration
At the microarchitecture layer, the functional redundancies identified for the LLC module,
namely block, way, set, and the main memory, and their corresponding controllers can also apply
to low level cache modules. At the ISA layer, no functional redundancy can be identified since
system abstractions at the layer lacks cache modules.
At the software/OS layers, page-cover functional redundancy in LLC module are also
applicable to low level cache in a similar manner as described in the following. Similar to page-
covers observed from the mapping relation between the LLC and the main memory, low level
caches also have identical mapping relation with virtual memory space. Virtual memory space is
also divided into virtual page frames by its manager module. A page-cover in a virtually
addressed cache, a virtual page-cover, is a continuous space in the cache module mapped by a
virtual page frame. The controller of virtual page-covers is the manager module of virtual
memory space, i.e., a combination of a compiler and the OS. The allocation of virtual space to a
software process occurs in two phases. In the first phase, static-allocation address space, such as
address space containing executable binaries, can be allocated during compilation by a compiler.
95
In the second phase, the dynamic-allocation address space, such as heap containing runtime
generated data, is allocated dynamically by the OS during program execution.
4.3.1.2 Possible approach implementations and analysis
This section first lists the possible approaches identified during the above exploration above.
Then the impact and feasibility of the approaches are analyzed.
Defect-avoidance: At the microarchitecture layer, implementations of the disabling and
remapping approaches in LLC module can also be applied to low level caches due to the reason
that they are managed in the same way at the microarchitecture layer. The approaches include
module disabling, block disabling, way disabling, and set remapping. At the software/OS layer,
virtual page-covers disabling can be achieved by enhancing the two-phase controllers, namely,
the compiler and the OS. Virtual page-cover functional redundancy is inherently configurable
under the controller, just as the memory allocation module in OS for page-cover functional
redundancy in LLC module. However, while there are other page-cover controllers available for
the LLC module, no other controller is available for virtual page-covers. When tracing from the
generation of virtual page-cover control information, i.e., virtual addresses, to low level cache
modules, there is no other module between the generation of control information (i.e., addresses)
and the cache modules. The modules are accessed by the virtual address computed by
instructions, and there is no other module in between to be enhanced to perform virtual page-
cover based remapping. Hence, virtual page-covers remapping cannot be applied. (Note that the
page-cover remapping may still be performed at the input of the cache’s decoder. However, that
will be equivalent to the microarchitecture layer set remapping.)
96
Error-masking: Similar to the LLC module, microarchitecture load speculation is possible for
low level cache modules. However, instead of using main memory as the only alternative module
for instruction and data access functionality, low level cache modules also have the LLC module
as functional redundancy.
Impact and feasibility analyses: Table XIV shows the impact analysis for the above
approaches. As mentioned in Section 2.6, microarchitecture layer disabling and remapping
approaches have been proven feasible and have become successfully implemented. For the
virtual page-cover disabling, this will not be feasible for the following reason. The size of a low
level cache is typically small in order to have low latency. A page-cover of a typical size (4KB)
is likely to occupy a large percentage of a cache module and it makes virtual page-cover
disabling impossible.
Table XV summarizes the microarchitecture level performance penalty of a defective low
level cache module using different approaches. Since low level cache modules are typically
private for each processor core in multi-core chips, applying virtual page-cover disabling
Table XIV. Low level cache approach implementation impact analysis
Implementation impact measures
Approach Magnitude Incurred frequency Pervasiveness
Block or way disabling Increased module latency
Every module block
replacement event
All chips
Module disabling
Increased module latency
and next-level cache
latency
Every module access All chips
Set remapping Increased module latency Every module access All chips
μArch layer speculative
load
Increased pipeline control
latency
Every instruction All chips
ISA-and-OS virtual
page-cover disabling
Increased compilation
time and runtime
During compilation and
dynamic virtual memory
allocation
Defective chips
97
approach implies that all processor cores will be penalized even in the presence of defects in only
one core’s low level cache. Speculatively loading from a defective low level cache module can
prevent instructions from committing before a load can be verified. The cycle count required for
verifying such load instruction determines the number of dependent instruction that must stall.
4.3.2 TLB module
Figure 14 depicts the generic organization of a fully-associative TLB. The virtual address
calculated is divided into a tag field, i.e., the virtual page number (VPN) field, and an offset
field. Tag fields are used to CAM-match (i.e., match using a CAM) the VPN stored in tag array.
The corresponding PFN stored in the matched entry of the SRAM based PFN array forms the
Table XV. Defective low level cache microarchitecture level performance penalties
Defective module performance penalty measures
Approach Magnitude Incurred frequency Pervasiveness
Block disabling
Increased probability of misses
Every access to sets with
disabled blocks
Only defective
module
Way disabling Every module access
Module disabling: data
cache
Increased cycle count for memory
access instructions
Every memory access
instruction
Module disabling:
instruction cache
Increased cycle count for
instruction fetch
Every instruction fetch
Set remapping Increased probability of misses Every module access
μArch layer speculative
load
Increased cycle count required for
write instructions
Every write instruction
Increased stalling cycle count to
commit instructions
Some instructions depending
on load instructions
(depending on the actual
cycles needed)
ISA-and-OS virtual page-
cover disabling
Increased misses
During compilation and
dynamic virtual memory
allocation
All modules in all
processor cores
98
physical address. Physical address is then derived using this address and the offset field. TLB
entries are managed by a replacement policy logic shown in the figure.
4.3.2.1 System layers exploration
At the microarchitecture layer, the TLB module is managed in a similar fashion as cache
modules. Hence, the microarchitecture layer approaches for cache modules are also available for
TLB modules in general. Block disabling, way disabling, and set remapping approaches are
available to TLB by using types of internal functional redundancies. Module disabling is also
available, and the external functional redundancy of a TLB module is the main memory. TLB
caches a portion of the page table from the main memory for fast access. Module disabling
approach can be implemented by accessing the main memory directly to obtain translated
address.
At the ISA layer, individual entries or all entries in the TLB can be invalidated using specific
instructions provided in the ISA. However, the use of TLB entries, i.e., write to individual entries,
Figure 14. TLB operation
99
is not controlled at the ISA layer. Therefore, there is no approach available for the TLB module
at the ISA layer.
At the software/OS layer, the mapping relation of VPN and TLB sets can be exploited in a
non-fully-associative TLB module in a similar manner to page-cover based approaches used for
cache modules. Similar to a non-fully-associative cache, in a non-fully-associative TLB, VPN
field is further divided into an index field which is used to index into a set, and a tag field which
is used for tag match. In the TLB module, every set is functionally equivalent to every other, just
as the sets in cache modules. Hence, TLB sets are the internal functional redundancy of each
other. VPN field of an accessing address is determined when a virtual page frame is allocated to
a software process. The controller of the set functional redundancy is the two-phase virtual page
frame allocation module, namely, compiler and OS. A VPN based approach can be achieved by
modifying the virtual page frame allocation module.
4.3.2.2 Possible approach implementations and analyses
Under defect-avoidance category, the implementations of microarchitecture layer approaches
for TLB module are similar to those for cache modules, with similar impact and feasibility
characteristics. At the software/OS layer, VPN-based disabling approaches can be implemented
by modifying the two-phase virtual page frame allocation module.
Under error-masking category, microarchitecture layer speculative TLB access is possible
consider the following. Each TLB access will be accompanied by a page table access for TLB
result checking. Similar to speculative loads to cache modules, no results computed using
speculations can be committed before the TLB access is verified as being correct. Software/ISA
layer speculative TLB access can be implemented by fine-grained management of load and store
100
instructions which require physical addresses from TLB accesses. The capability to derive
addresses from page tables and TLB for each load and store instruction must be available at the
software/OS layer, so the individual load or store instruction can be checked.
TLB error-masking can also cause other potential problems and may require further
microarchitecture modification. For example, fetching invalid instructions from LLC which
cannot be decoded may cause an exception. Additional mechanism may be required to prevent
such exceptions.
Table XVI summarizes the microarchitecture level performance penalties caused by a
defective TLB module. For all the approaches, penalties are only incurred when there is a miss
for a low level cache module. Since a TLB access is needed only when physical address of a
memory access is required.
101
4.4 Queuing modules
Queuing modules are responsible for buffering instructions, register values, dependency
tracking information, and so on, to support out-of-order execution. Queuing modules are
memory-based. Hence circuit layer spare rows-and-columns approach is available to these
modules. In addition, since these modules are non-speculative at the microarchitecture layer, and
are transparent to the ISA layer and the software/OS layer, effective defect-tolerance approaches
can only be devised at the microarchitecture layer. ISA layer and software/OS layer approaches
are possible but will not be effective since the functionality of these modules are not confined in
a few instructions, as discussed in Section 3.2.2. As a consequence, all approaches for queuing
modules will maintain ISA layer correctness. Because that queuing modules share some unique
Table XVI. Defective TLB microarchitecture level performance penalty
Defective module performance penalty measures
Approach Magnitude Incurred frequency Pervasiveness
Block disabling
Increased probability of TLB
misses
Every access to sets with
disabled blocks caused by a
low level cache miss
Only defective
module
Way disabling
Every module access
caused by a low level cache
miss
Module disabling: data
TLB
Increased translation cycle
counts for memory access
instructions
Every module access
caused by a low level data
cache miss
Module disabling:
instruction TLB
Increased translation cycle
counts for instruction fetch
Every module access
caused by a low level
instruction cache miss
Set remapping
Increased probability of TLB
misses
Every module access
caused by a low level cache
miss
μArch layer speculative
TLB accesses
Increased cycle counts for
memory access instructions
ISA-and-OS VPN
disabling
Increased probability of TLB
misses
All modules in all
processor cores
102
properties, this section first describes microarchitecture layer functional redundancy for all
queuing modules. Then the register renaming mechanism in the microarchitecture is discussed to
streamline the discussion for related modules.
All queuing modules share the following properties. At microarchitecture layer, modules are
managed at the granularity of individual entries. In general, the entries with no valid data stored
are selected by the controller of a queuing module to allocate to an incoming allocation request.
For example, such request can be a register allocation request to the physical register file module
when a destination register is requested by an instruction. There is no difference in functionality
between available entries in a queuing module when the controller allocates. Therefore, the
entries are the internal functional redundancy (entry functional redundancy) of each other.
However, unlike the internal functional redundancy in caching modules, the use of individual
entry in a queuing module is not mapped by any software/OS layer information such as certain
memory addresses of instruction or data, and there is no additional tag matching when an entry is
accessed. Hence, the entry functional redundancies are only controllable when an entry is
allocated and a remapping approach is not possible.
In addition, only defect-avoidance approaches are possible for queuing modules but not error-
masking approaches targeting individual modules, for the reason that queuing modules do not
have alternative on-chip or off-chip modules or alternative implementations that can produce
trusted result for checking mechanism.
4.4.1 Register renaming mechanism
This section briefly describes the register renaming mechanism which is related to physical
register file module and reorder buffer module. Write-after-read and write-after-write hazards
103
occur when instructions are executed out of original program order in an out-of-order
microarchitecture which exploits instruction level parallelism. A register renaming mechanism
provides a register rename buffer module, which uses more number of entries (i.e., rename
buffers) than the number of architecture registers to serve as temporary storage for destination
registers for executing instructions.
Two general implementation types exist. The first type of implementation includes a mapping
table and a physical register file. The mapping table keeps updating the pointers to the entries in
the physical register file which contain the currently committed architectural values (i.e., the
registers’ values in the ISA visible states), and the pointers to the entries in physical register file
which contain the latest values of architecture registers (i.e., not-yet committed intermediate
register values). Destination architecture register is renamed to one of the rename buffers in
physical register file module. The second type of implementation incorporates rename buffers
into the reorder buffer (ROB) module, in which destination architecture register of each
instruction is renamed to the buffer associated with the ROB entry allocated to each instruction.
4.4.2 Physical register file module
A physical register entry in the physical register file module is allocated by its controller to
rename an architecture register, and is de-allocated (freed) during the commit phase of the
following instruction that defines the same architecture register as the one to which this physical
register entry is mapped, since only then can it be assured that all instructions that consumed this
register must have committed. Therefore, the use of physical register entries can only be
determined at the time of allocation by allocation controller. One common physical register
allocation controller mechanism is free register FIFO queue (FRQ) [48]. The addresses of
104
available entries in a physical register file module are kept in FRQ. FRQ is initialized with the
addresses of entries in the physical register file module. An entry is allocated by reading an
address from the head pointer to FRQ, and de-allocated entry address is written to the tail pointer
to FRQ. Defect-avoidance approach can be achieved by initializing only the addresses of defect-
free physical registers in FRQ when the chip is powered-up. The use of defective entry will then
be avoided during normal operation.
4.4.3 Reorder buffer
Reorder buffer (ROB) module is associated with the function of committing instructions
according to program order. ROB consists of entries to store the information of in-flight
instructions (i.e., the instructions between when they enter the execution stage and before they
are committed) and keeps instructions in the original program order. Two ROB types can be
implemented depending on the register renaming mechanism designs mentioned above. In
microarchitecture designs where ROB also serves the role of register rename buffers, one of the
fields in an ROB entry is used as the rename buffer for destination architecture register. The field
stores the computed result of the instruction stored in the entry. In contrast, for microarchitecture
designs where rename buffers are implemented as a separate physical register file, there is no
such field in its ROB entries. Instead, a field stores the pointer to physical register file module to
indicate the destination register of the stored instruction in the ROB entry.
In general, a ROB entry contains the following fields shown in Table XVII. The table also
describes the microarchitecture layer function and possible failure mode for each field when
defects are present.
105
As can be seen, defects in ROB entries can either cause erroneous data written to visible
states, or cause the processor to hang. Either situation is catastrophic to the ISA layer correctness.
Entry functional redundancy in ROB is controlled at allocation and the entry allocation can be
managed as a FIFO [49]. Microarchitecture layer defect-avoidance approach can be implemented
by enhancing the FIFO controller. With a defect map indicating defective ROB entries, the
enhanced FIFO controller allocates available defect-free entries and skips defective entries.
4.4.4 Load/store unit
Load/store unit module includes a load unit and a store unit. Load unit buffers load
instructions waiting to access data cache. Store unit buffers the addresses and the values to be
stored for in-order commitment of store instructions to data cache. Store unit also forwards
values to be stored to load instructions with the same accessing address (load forwarding). Store
Table XVII. Defective ROB behavior
Fields Function Possible outcome if defects present in the field
Busy
Indicating that the entry is
occupied by an instruction
Instruction may be overwritten. Valid instruction will not produce
and commit result to visible states as expected
Finish
Indicating that the
instruction has completed its
execution and waiting to
commit
Instructions may be indicated as never completed, and cause
microprocessor to hang since the instruction is not able to commit.
Instructions may appear to be completed prematurely. Erroneous
result can be read by other instructions or commit to visible states
Speculative
Indicating whether the
instruction is on speculative
path or not
The result of a non-speculative instruction may not commit if it is
invalidated by a mis-prediction. An instruction on a mis-predicted
path may commit.
Valid
Indicating whether the
instruction is on mis-
predicted path
The result of a non-speculative instruction may not commit. An
instruction on the mis-predicted path may commit.
Inst. addr.
Storing the address of the
instruction in the entry
Prevent the recovery from interrupt or exception to the correct
program address.
Inst. type
Indicating the instruction
type: reg., mem., etc.
May prevent the results of register type instructions from being
written back to registers, etc.
Dest. addr.
Indicating the destination
register address to commit
the result
Cause result to be committed to a wrong or an invalid register.
Result value
Buffering the value of
registers to be committed
Write erroneous value to architectural registers once the instruction
commits.
106
unit can be implemented as a CAM array for associative search of addresses to implement load
forwarding, and an SRAM array buffering the values to be stored. Load unit can be implemented
as an SRAM array buffering the loading addresses.
Defects in the module can cause erroneous data stored, erroneous address accessed, or
erroneous data forwarded to load instructions. Each of these types of errors can propagate to
visible states and affect the ISA layer correctness. The allocation of entries in the module can be
managed by a free-list [50]. Mechanisms similar to those used for physical register file module
can be applied to avoid the allocation of defective entries.
4.4.5 Reservation station
Decoded instructions are dispatched by a dispatch module to a reservation station module to
standby. An instruction must wait in the module until all its dependencies are satisfied (i.e., input
operands are available) to be ready. Ready instructions are then picked to be issued to an
available datapath module. A reservation station module contains the fields shown in Figure 15.
Figure 15. Reservation station entry
107
Valid field for each operand indicates if the corresponding operand value is ready or not.
Operand value and valid bit are updated if the operand tag stored matches with the tag
broadcasted on tag bus. Ready signal indicates that both operands are valid, and is ready to
execute. A busy bit indicates that the reservation station entry is occupied by an instruction not
yet issued for execution. An instruction dispatch module examines busy bits in all entries in
reservation station module and selects one of non-busy entries to allocate to decoded instruction.
Defects in the entries may cause erroneous data to be used for computation, hanging
instruction, the overwritten of a not yet finished instruction, and other catastrophic events
affecting the ISA layer correctness. Defect-avoidance approach can be achieved by
implementing an entry-disabling mechanism incorporating a defect-tolerance bit (1 is defective,
active high) in each entry. Instruction dispatch module can only select the entry which has NOR-
ed value of the busy bit and the defect-tolerance bit equal to one.
Entry-based disabling approach will be investigated for all queuing modules to characterize
the approach’s efficiency. In contrast to other existing defect-tolerance approaches which are
also possible for queuing modules, such as DIVA [15] and Relax [16], entry-based disabling
target defective modules and does not require explicit checking pipeline stages or require
computing trusted results for checking on other defect-free processor cores. Hence, the entry-
based disabling is expected to have higher efficiency.
In this chapter, we have demonstrated the following qualities of our framework. First, the
framework enumerates approaches systematically and subsumes (almost all) other existing
approaches. Second, the framework analyzes the approaches discovered quantitatively to capture
the key characteristics of the approaches efficiently without simulation.
108
Chapter 5. Defect-tolerance approaches for datapath modules
Datapath modules, including arithmetic logic unit (ALU), integer multiplier (MUL), and
floating-point unit (FPU), are commonly used in modern processors. We first summarize the
promising approaches for tolerating defects discussed in the previous chapter. We then perform
detailed utilization analysis to qualitatively evaluate these approaches and make final selections.
Subsequently, we implement our final approaches and evaluate their effectiveness. Throughout
this chapter, we use a single-core out-of-order processor with 2 ALUs, 1 MUL, and 1 FPU as the
unenhanced baseline processor for the evaluation of our approaches. In this chapter, this
unenhanced processor is assumed to have 30% yields. In the next chapter, we will evaluate these
approaches using projected defect densities for future technologies and show that significant
improvements can be provided by our approaches.
Table XVIII summarizes the inherent redundancy present in a design and the corresponding
approaches for each module at different layers discussed in Chapter 4. According to the
pervasiveness analysis, ISA-layer approaches are preferred since their implementations do not
affect all chips. Circuit-layer and microarchitecture-layer approaches are less favorable unless
ISA-layer approaches incur large performance penalties.
Table XVIII. Summary of approaches for datapath modules
Redundancy and DT
approach
ALU MUL FPU
ISA layer Using other error-free
arithmetic/logic
instructions
Using other error-free
arithmetic/logic
instructions
(MUL-DIS)
Using other fixed-point
instructions
(FP-DIS)
μArch layer Multiple instances
(ALU-disabling)
Singular
(No approach)
Singular
(No approach)
Circuit layer Iterative bit-slices (Partial utilization)
109
5.1 Floating-point unit
At ISA layer, FP-DIS is available for FPU. However, instead of disabling all types of
floating-point instructions if the FPU is defective, a fine-grained disabling approach can be
developed. It is possible that only the functions provided by the individual defective floating-
point functional units need to be disabled. For example, if an FPU has defects in its multiplier
only, disabling floating-point multiplications (FPMUL-DIS) is sufficient to maintain the
correctness and other floating-point functions can still execute at high performance. Hence, the
fine-grained disabling will have significantly lower penalty than disabling all floating-point
instructions. In the following sections, we evaluate FP-DIS and FPMUL-DIS, respectively. Then
show the efficiency that can be achieved when the two approaches are combined.
5.1.1 Utilization analysis for FP-DIS
Table XIX shows the distribution of floating-point instructions and the instruction counts with
and without emulation for floating-point benchmarks. First, we observe the instruction
distribution within a 100M-instruction window selected using SimPoint [51]. It can be inferred
that, in general, an application that has high floating-point instruction percentage will have low
performance if a defective FPU is used under the FP-DIS mode to execute the application. To
account the performance degradation more accurately, we count the total number of instructions
when FP-DIS is not used (original) and the total number of instructions when FP-DIS is used
(FP-DIS). It can be observed that, in general, applications that have higher FP instructions % also
have higher increase in the total number of instructions when FP-DIS is used. However, the
increases in the number of instructions when FP-DIS is used are significant for all applications.
110
As shown in the last column of Table XIX, the relative performance of FP-DIS for every
application is estimated as
1
𝑖 𝑛𝑠𝑡 𝑟 𝑢𝑐𝑡 𝑖 𝑜 𝑛 𝑐𝑜 𝑢𝑛𝑡 𝑖𝑛 𝑐𝑟 𝑒 𝑎𝑠 𝑒
which ranges from 3.3% to 14.4% of the original applications. This implies that the processors
with defective FPU will have approximately 3%~14.4% of the performance of defect-free
processors. Nevertheless, for most integer benchmarks (listed in Table XIX), there will be no
degradation at all for processors with defective FPU. The only exception is twolf, which has
0.64% floating-point instructions. The processors with defective FPU will have approximately
61.3% of the performance of defect-free processors when running twolf. The number is projected
using polynomial regression from the floating-point benchmarks’ data.
Table XIX. Floating-point instruction distribution
Benchmarks
FP
instruction %*
Total number of instructions to complete execution
Relative performance
estimate
Original FP-DIS** Increase
ammp 32.9% 1.89 × 10
1 2
3.18 × 10
1 3
16.8× 5.95%
equake 25.7% 1.75 × 10
1 1
5.20 × 10
1 2
29.7× 3.37%
apsi 17.7% 8.16 × 10
1 1
1.71 × 10
1 3
20.9× 4.78%
swim 42.5% 4.40 × 10
1 1
1.34 × 10
1 3
30.3× 3.30%
art 15.2% 5.44 × 10
1 0
1.01 × 10
1 2
18.4× 5.43%
applu 30% 8.23 × 10
1 1
1.92 × 10
1 3
23.3× 4.29%
mgrid 6% 4.43 × 10
1 2
3.08 × 10
1 3
6.9× 14.49%
Integer Benchmarks
1× 100%
gzip, mcf, gcc, parser, vortex, bzip2
twolf 0.64% – – 1.63× 61.3%
*Observed in the 100M instruction windows selected using SimPoint
**Use a soft-floating-point library during compilation
111
5.1.2 Evaluation of FP-DIS
Two processor tiers will be fabricated if FP-DIS is applied. Tier-1 (T1) is the defect-free
processors and tier-2 (T2) is the processors with defects only in FPU. In a processor chip with
30% yield, T1’s yield is 30% and T2’s yield is 7.4%. Figure 16 shows the efficiency of FP-DIS.
Note that we calculate performance 𝐺𝑂 𝑃𝑆 𝑇 1
as 𝐼 𝑃𝐶 𝑇 1
× 𝐹 𝑜 and equivalent of 𝐺𝑂 𝑃𝑆 𝑇 2
is
estimated as 𝐼 𝑃𝐶 𝑇 2
× 𝐹 𝑜 𝑖 𝑛𝑠𝑡 𝑟 𝑢𝑐𝑡 𝑖 𝑜 𝑛 𝑖𝑛 𝑐𝑟 𝑒 𝑎𝑠 𝑒 . ⁄ Note that in order more accurately measure the
performance, 𝐼 𝑃𝐶 𝑇 1
and 𝐼 𝑃𝐶 𝑇 2
are derived separately from microarchitecture simulations using
original benchmark code and FP-DIS benchmark code respectively. The performance-per-area
gain for integer benchmarks from T2 is the highest since there is no degradation for these
benchmarks and every T2 processor performs as well as T1 processors. However, for most
floating-point benchmarks, significant performance degradation decreases the overall efficiency.
Figure 16. Efficiency for FP-DIS
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
T2
T1
112
5.1.3 Utilization analysis for FPMUL-DIS
Table XX shows the floating-point multiplication distribution and the instruction counts with
and without emulation for floating-point benchmarks. It can be observed that the increase in the
number of instruction is significantly lower than that of FP-DIS for every benchmark. It can be
inferred that a processor using FPMUL-DIS will gain more performance-per-area than a
processor using FP-DIS only.
5.1.4 Evaluation of FP-DIS with FPMUL-DIS and comparison
Three tiers of processors will be fabricated if FP-DIS and FPMUL-DIS are applied. In the
synthesis study of the double-precision FPU, its multiplier occupies 51% of the FPU area. Table
XXI summarizes the tiers and their yields.
Table XX FPMUL instruction distribution
Benchmarks
FPMUL
instruction %**
Total number of instructions to complete execution
Relative
performance
estimate
Unmodified FPMUL-DIS Increase
ammp 7.38% 1.89 × 10
1 2
1.89 × 10
1 2
1.38×
72.46%
equake 11.23% 1.75 × 10
1 1
1.15 × 10
1 2
1.85×
54.05%
apsi 3.49% 8.16 × 10
1 1
3.28 × 10
1 2
1.75×
57.14%
swim 0.00% 4.40 × 10
1 1
4.04 × 10
1 2
1.89×
52.91%
art 5.83% 5.44 × 10
1 0
9.12 × 10
1 0
1.40×
71.43%
applu 4.66% 8.23 × 10
1 1
1.28 × 10
1 2
∗ 1.56×*
64.10%
mgrid 0.70% 4.43 × 10
1 2
6.45 × 10
1 2
∗ 1.46×*
68.49%
*Derived using polynomial regression with available data
**Observed in the 100M instruction windows selected using SimPoint
Table XXI. FP-DIS and FPMUL-DIS tiers and yields
Tiers T1 T2 T3
Description Defect-free
Any of the FP functional units other
than FP-multiplier is defective
Defective FP-multiplier (other FP
functional units defect-free)
Yield 30% 3.84% 3.59%
113
The yield for each tier is calculated as the following. 𝑌 𝑇 1
= 𝑃𝑆 � 𝑑 , 𝐴 𝜇 𝑝 , 0 � , where 𝑃𝑆 ( 𝑑 , 𝐴 , 𝑥 )
represent the probability that there are 𝑥 defects in an area of 𝐴 given that defect density is 𝑑 .
The yield of T2 and T3 are calculated as the following.
𝑌 𝑇 2
= � 𝑃𝑆 � 𝑑 , 𝐴 𝜇𝑃
− 𝐴 𝐹 𝑃𝑈
, 0 � �× (1 − ( 𝑃𝑆 ( 𝑑 , 𝐴 𝐹 𝑃𝑈
− 𝐴 𝐹𝑃 𝑀 𝑈𝐿
, 0)))
𝑌 𝑇 3
= � 𝑃𝑆 � 𝑑 , 𝐴 𝜇𝑃
− 𝐴 𝐹 𝑃𝑈
, 0 � �× (1 − ( 𝑃𝑆 ( 𝑑 , 𝐴 𝐹 𝑃 𝑀 𝑈𝐿
, 0)))
Defect density 𝑑 is chosen so that T1 processors have 30% yield. Figure 17 shows the overall
efficiency provided the three tiers. Note that the performance of each tier is simulated and
calculated in the same way as the performance of FP-DIS is measured described in the previous
section.
The overall efficiency of FP-DIS only is also shown in hollow bars for comparison. It is clear
that the same efficiency for integer benchmarks is achieved by FP-DIS only and FP-DIS with
FPMUL-DIS, because there is no additional defective chip that is made functional when
FPMUL-DIS is applied in addition to FP-DIS. However, the combined approach shows
significant improvement for several floating-point benchmarks over FP-DIS only.
Figure 17. Efficiency for FPMUL-DIS
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1.20
1.25
1.30
ammp equake apsi swim art applu mgrid INT
T3
T2
T1
FP-DIS
T1+T2
114
5.2 Arithmetic logic unit
In this section, instruction distribution analysis is first performed for ALU-type instructions.
The performance degradation due to disabling defective ALU is also measured. More fine-
grained instruction distribution analysis is then performed to explore the possibility of limiting
the performance degradation due to disabling defective ALUs. Lastly we evaluate the efficiency
of our final approach.
5.2.1 Utilization analysis: ALU-type instructions
Figure 18 shows the percentage of ALU-type instructions in various benchmarks. The
following observations can be made. ISA layer approach (use other instructions to implement the
functionality of ALU-type instructions) for ALU is not practical. This is because ALU-type
instructions contain the most basic functions such as add and logic functions which constitute the
majority of all instructions. From Figure 18, it is clear that most of the instructions require ALU,
and disabling defective ALU may impact the performance significantly. Implementing ALU-
type instructions in an alternative manner can result in significant increase in the number of
instructions, hence the performance will be low.
115
Table XXXII shows the IPC performance of processors with 2 defective-free ALUs and
processors with 1 defect-free ALU only. Note that we use simulations to capture the ALU
utilization more precisely than using instruction distribution alone, since instruction distribution
does not reflect the utilization in the temporal domain. Significant performance degradation can
be observed in most benchmarks if only 1 ALU is available. This implies that processors with a
defective ALU that is disabled will have significant performance degradation. The results
correspond well to the results of ISA layer instruction distribution analysis: disabling a highly
utilized module will result in high performance degradation. In the next section, we perform a
more detailed analysis for ALU in order to improve the performance of those processors.
Figure 18. ALU-type instruction distribution
0%
20%
40%
60%
80%
100%
116
5.2.2 Utilization analysis: ADD-type instructions
Figure 19 shows both the distribution of ALU-type instructions and ADD-type instructions.
ADD-type instructions include all the instructions that require the adder in ALUs. It is clear that
the majority of the ALU-type instructions require adder computation. From our synthesis study
of an integer execution unit [52], the adder occupies almost 50 percentage of one ALU. Hence, a
large percentage of defective ALUs have defects in their adders. A circuit-layer approach may
help recover the performance loss due to disabling defective ALUs.
Table XXII. Performance of processors with different numbers of ALUs
Benchmarks 2-ALU IPC 1-ALU IPC 1 ALU relative performance
ammp 1.3374 0.8940 66.8%
equake 1.7334 0.9700 55.9%
apsi 0.4698 0.4243 90.3%
applu 0.6486 0.5458 84.1%
swim 0.3056 0.2904 95.0%
art 0.4688 0.3952 84.3%
bzip2 1.4812 0.8896 60.0%
gcc 1.4376 0.8767 60.9%
gzip 1.5407 0.9210 59.7%
mcf 0.8455 0.6477 76.6%
vortex 1.3282 0.8811 66.3%
mgrid 0.5251 0.5024 95.68%
parser 1.4648 0.9061 61.86%
twolf 1.7593 1.7273 98.18%
average 1.0961 0.7765 70.84%
117
To exploit circuit-layer inherent redundancy, we performed an operand distribution analysis
for ADD-type instructions. Figure 20 shows the cumulative distribution of instructions’ required
bit-width in the adder. The required adder bit-width of an ADD-type instruction is defined as the
number of LSB required to compute the instruction correctly.
It can be observed that 92% of the add-type instructions require no more than 33-LSB
computation and the 31-MSB of their outputs are zero, where an adder in ALU produces 64-bit
Figure 19. ALU-type and ADD instruction distribution
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ALU type
ADD
Figure 20. Cumulative distribution of bit-width required for ADD instructions
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
118
outputs. In the next section, we develop an error-masking mechanism to exploit this property to
mitigate the performance loss from ALU-disabling.
5.2.3 ALU-disabling and adder error-masking mechanism
Figure 21 depicts a generic organization for dispatching, issuing, and execution of ALU-type
instructions. Decoded instructions are dispatched into a unified reservation station waiting for
their input operands to become available. The en-queued instructions will have their ready flags
in the reservation station set when all their input operands are ready. An instruction scheduler is
responsible for delivering the operands of a fixed-point instruction to one of the ALUs for
computation in the integer execution unit, which contains 2 ALUs in this particular example.
Instructions are allowed to execute out-of-order and the instruction scheduler controls the
instructions’ routing between the reservation station and the ALUs [44].
ALU-disabling can be implemented as follows. By using test information regarding which
ALU is defective and slightly modifying the instruction scheduler, the use of defective ALU can
be avoided. Hence, a fabricated processor with defective ALUs can be used if at least one defect-
free ALU is available in the processor. This can be achieved at negligible overhead by
implementing a mechanism to deactivate the cycle counter [45] of each defective ALU in the
instruction scheduler. The defective ALU will never appear as being available to the instruction
Reservation
station
Instruction
scheduler
ALU1
ALU0
Decoded
instructions
Figure 21. A generic organization for ALU instruction execution
119
scheduler during operation. Therefore, the defective ALU will be disabled and no data error will
be produced due to the defective ALU.
By checking the number of LSBs required for an add-type instruction and comparing this
number with the number of guaranteed functional less-significant-bits of a defective adder, it can
be determined whether the defective adder is guaranteed to produce an error-free result for the
instruction. Figure 22 shows the microarchitecture level mechanism we propose. The instruction
scheduler issues instructions to ALUs as their operands become ready. In a processor with an
adder with defects in the 31-MSB bit-slices, its Errornous_31MSB bit will be set. Operand width
inspector dynamically checks if the number of LSB bit-slices required for an add-type instruction
is larger than 33. The operands are checked in parallel with their computation in adders. If the
add-type instruction executed on the defective adder does not require more than 33-LSBs
computation, the 31-MSBs of the output are masked by zeroes. On the other hand, if the
instruction requires more than 33-LSB computation, the output from the ALU with the defective
adder is invalidated, and subsequently the instruction is re-issued and re-executed on a defect-
free adder in the defect-free ALU.
To minimize area overhead of this approach, we exploit an existing microarchitecture
mechanism used in most modern processors to serve the purpose of re-executing add-type
Figure 22. Adder error-masking microarchitecture support
Reservation
station
Instruction
scheduler
ALU1
ALU0
Operand width inspector
Replay
Invalidate
• Required width >33
• Required width ≤33: no action
…
Y0
Y32
Y33
…
Y63
!Erroreous_31MSB
… …
120
instructions with possibly erroneous results. Modern processor designs speculate on the cycles
required for load instructions to complete [53] [54]. They allow the instructions dependent on a
load instruction to be issued speculatively assuming that the load instruction will have a hit in
the level 1 cache. In case of a level 1 cache miss, a correcting mechanism invalidates the
speculatively issued instructions and re-executes by replaying the dependent instructions. Our
approach uses this mechanism to re-execute add-type instructions whose outputs are possibly
erroneous. Hence, the area overhead of the re-execution mechanism in our adder error-masking
approach is negligible.
Table XXIII shows the IPC performance measured by microarchitecture simulations. The
simulator is modified to implement the error-masking mechanism in the selected ALU. It can be
observed that AdderEM successfully mitigate the significant performance degradation caused by
disabling the defective ALU. Almost all the benchmarks achieve 99% of the performance when
using 2 defect-free ALUs. Note that when combining ALU-disabling and AdderEM (discussed
in the next section), three processor tiers will have the performance corresponding to the three
categories shown in Table XXIII.
121
5.2.4 Evaluation
Table XXIV summarizes the key parameters of the unenhanced processors and the processors
enhanced with ALU-disabling and AdderEM.
Table XXIII. Performance comparison for ALU-disabling and AdderEM
Benchmarks
2 ALUs 1 ALU and 1 ALU disabled 1 ALU and 1 ALU with AdderEM
IPC IPC
Relative
performance %
IPC
Relative
performance %
ammp 1.3374 0.8940 66.8% 1.3217 98.8%
equake 1.7334 0.9700 55.9% 1.7310 99.8%
apsi 0.4698 0.4243 90.3% 0.4663 99.2%
applu 0.6486 0.5458 84.1% 0.6465 99.6%
swim 0.3056 0.2904 95.0% 0.3038 99.4%
art 0.4688 0.3952 84.3% 0.4666 99.5%
bzip2 1.4812 0.8896 60.0% 1.4796 99.8%
gcc 1.4376 0.8767 60.9% 1.4359 99.8%
gzip 1.5407 0.9210 59.7% 1.5374 99.7%
mcf 0.8455 0.6477 76.6% 0.8414 99.5%
vortex 1.3282 0.8811 66.3% 1.3254 99.7%
mgrid 0.5251 0.5024 95.6% 0.5251 100%
parser 1.4648 0.9061 61.8% 1.4608 99.7%
twolf 1.7593 1.7273 98.1% 1.7593 100%
average 1.0961 0.7765 70.8% 1.0929 99.7%
Table XXIV. Tier yield formulation for ALU enhancement
Tier Yield
Unenhanced
area: 𝑨 𝒖 𝒏𝒆𝒏𝒄
Defect-free 𝑌 𝑢𝑛𝑒 𝑛𝑐
Defective 1 − 𝑌 𝑢𝑛𝑒 𝑛 𝑐
Enhanced
area: 𝑨 𝒆 𝒏 𝒄
T1 (defect-free) 𝑌 𝑡 𝑖 𝑒𝑟 1
= 𝑌 𝑒𝑛 𝑐
T2 𝑌 𝑡 𝑖 𝑒𝑟 2
= 𝑃𝐹
𝑡 𝑖 𝑒𝑟 2
× (1 − 𝑌 𝑒𝑛 𝑐 )
T3 𝑌 𝑡 𝑖 𝑒𝑟 3
= 𝑃𝐹
𝑡 𝑖 𝑒𝑟 3
× (1 − 𝑌 𝑒𝑛 𝑐 )
Discard (1 − 𝑃𝐹
𝑡 𝑖 𝑒𝑟 2
− 𝑃𝐹
𝑡 𝑖 𝑒𝑟 3
) × (1 − 𝑌 𝑒𝑛 𝑐 )
122
Without our approach, unenhanced processors yield is 𝑌 𝑢𝑛𝑒 𝑛𝑐
. Hence, 1 − 𝑌 𝑢𝑛𝑒 𝑛𝑐
. of all
processors fabricated are discarded. With our approaches, enhanced processors have area 𝐴 𝑒 𝑛𝑐
(where 𝐴 𝑒 𝑛𝑐
> 𝐴 𝑢𝑛𝑒 𝑛𝑐
) and yield of 𝑌 𝑒 𝑛𝑐
(where 𝑌 𝑒 𝑛𝑐
< 𝑌 𝑢𝑛𝑒 𝑛𝑐
for the same defect density). T
area overhead of the approach ( 𝐴 𝑢𝑛𝑒 𝑛𝑐
− 𝐴 𝑒 𝑛𝑐
) is derived by synthesizing a modified fixed-point
execution unit [12] with 2 ALUs. The 𝑌 𝑒 𝑛𝑐
portion of all processors fabricated, which we call T1
processors. T2 processors have one defective adder which can be used by error-masking, and T3
processors have one disabled defective ALU. T2 and T3 processors account for 𝑃𝐹 𝑇 2
and
𝑃𝐹 𝑇 3
fraction of the 1 − 𝑌 𝑒 𝑛𝑐
portion of the processors fabricated respectively (described ahead).
Note that T2 and T3 processors are defective processors that would have been thrown away if our
approaches are not deployed. Let n represent the limit of the numbers of defects to be considered,
and let 𝑃 𝑎𝑙𝑢 ( 𝑃 𝑎𝑑𝑑𝑒 𝑟 ) be the area percentage of an ALU (adder) in the processor (which can be
derived from synthesis). Let 𝑃 ( 𝑘 ) be the probability that there are k defects on the processor chip,
and by assuming Poisson distribution of defects, it can be calculated as 𝑒 − 𝑑 𝐴 𝑐 ℎ 𝑖𝑝
� 𝑑 𝐴 𝑐 ℎ 𝑖𝑝 �
𝑘 /𝑘 !,
where d is defect density and 𝐴 𝑐 ℎ 𝑖 𝑝 is the area of the entire processor. Hence, 𝑃𝐹 𝑇 2
is calculated
as ∑ 2( 𝑃 𝑎𝑑𝑑𝑒 𝑟 )
𝑘 𝑃 ( 𝑘 )
𝑛 𝑘 = 1
and 𝑃𝐹 𝑇 3
is calculated as ∑ 2( 𝑃 𝑎𝑙𝑢 )
𝑘 𝑃 ( 𝑘 ) − 𝑃𝐹 𝑇 2
𝑛 𝑘 = 1
.
Figure 23 shows the efficiency of our approaches across the benchmarks. The unenhanced
processor has 30% yield and the yields of T1, T2, and T3 processors are 29.97%, 0.87%, and
0.37% respectively. Since the enhanced processors have slightly larger area, the efficiency of T1
processors, i.e., defect-free processors, is always slightly less than 1. However, the approach
allows additional T2 and T3 processors to be used where each of these processors has a slightly
low performance than a defect-free processor.
123
5.3 Integer multiplier
5.3.1 Utilization analysis: integer multiplication (MUL)
At the ISA-layer, we undertake an instruction distribution analysis for MUL to estimate the
performance of MUL-DIS approach. As shown in the second column of Table XXV, only a
small fraction of the 100M instructions use the MUL. This shows that if MUL-DIS is
implemented, and if each instruction that originally used MUL is replaced by a procedure that
uses add instructions, a very small fraction of instructions affected will be affected. However, the
observation is made only for the 100M instructions inspected. Hence, we have modified a
compiler to use a procedure that uses add instructions to implement multiplications and compiled
the benchmarks. Table XXV shows the increases in the number the instruction counts for the
benchmarks which are compiled using the MUL-DIS option compared to the number of
instructions for the original benchmarks which are compiled using the original compiler. As can
Figure 23. Efficiency for ALU-disabling and AdderEM
0.96
0.97
0.98
0.99
1.00
1.01
1.02
1.03
1.04
T3
T2
T1
124
be observed, although the fraction of MUL instruction is low in the original benchmarks, the
total number of instructions can grow significantly for some benchmarks (e.g., applu and apsi) as
shown in the table. Performance estimate is calculated as
𝑁 𝑖𝑛𝑠𝑡
𝑜 𝑁 𝑖𝑛𝑠𝑡
𝑀 𝑈𝐿 − 𝐷𝐼𝑆
, where 𝑁𝑖 𝑛 𝑠𝑡
𝑜 and
𝑁𝑖 𝑛 𝑠𝑡
𝑀𝑈 𝐿 − 𝐷 𝐼𝑆
are the number of instructions for the original version of the applications and
those compiled using the MUL-DIS option, respectively. The upper-bound of average
performance estimate is calculated assuming that the rest of the benchmark’s performance is
100%.
It can be observed that MUL-DIS can degrade performance greatly for some applications.
This implies that processors with defective multiplier using MUL-DIS will degrade the
performance greatly for these applications. However, the approach is still useful for most
applications which have limited increase in the total number of instructions. This also suggests
Table XXV. Instruction distribution of MUL
Benchmarks
Fraction MUL
instructions**
Increase in total number of
instructions*
Performance estimate
art 0 (1+1.47× 10
− 2
)× 98.56%
gzip 0 – –
mcf 1.7 × 10
− 5
(1+7.29× 10
− 5
)× 99.99%
equake 0 (1+8.05× 10
− 7
)× 100%
bzip2 0 (1+5.04× 10
− 3
)× 99.5%
applu 1.22 × 10
− 2
(2.45× 10
1
)× 4.08%
ammp 2.13 × 10
− 4
– –
swim 1.23 × 10
− 5
(1+3.36× 10
− 4
)× 99.97%
apsi 7. 91 × 10
− 2
(3.13× 10
1
)× 3.19%
mgrid 1. 02 × 10
− 2
– –
vortex 4. 63 × 10
− 5
– –
parser 1. 69 × 10
− 4
– –
twolf 0 – –
gcc 4. 99 × 10
− 5
– –
average – –
72.81%
(Upper bound: 85%)
*We modify a GCC cross-compiler to replace every multiplication operation by a procedure that uses add instructions
**Observed in the 100M instruction windows selected using SimPoint
125
that a circuit-layer approach may perform better on average, because a circuit-layer approach
may only add few clock cycles to each MUL instruction, which is unlikely to have 3%~4% of
the performance as estimated for applu and apsi. Hence, next we investigate possible circuit
layer approaches for MUL in the next section.
5.3.2 Circuit-layer partial-utilization approach: operand-shifting
Figure 24 depicts a 4×4 CSA (carry-save adder) multiplier and its state machine controller.
The multiplier array is composed of sixteen CSA cells and four carry-propagate adder (CPA)
cells. When signaled with the assertion of rdy (input ready), in one clock cycle, the multiplier
produces an 8-bit product ( 𝑝 7
~ 𝑝 0
) by multiplying a 4-bit 𝑋 operand ( 𝑥 3
~ 𝑥 0
) with a 4-bit 𝑌
operand ( 𝑦 3
~ 𝑦 0
). The done signal is then asserted at the end of the cycle. In the cell array circuit
of the multiplier, each diagonal series of four CSA cells and one CPA cell along an 𝑦 𝑖 input
constitute a bit-slice.
Every bit-slice performs the same functionality. Bit-slice 𝑖 generates a partial product by
multiplying 𝑋 with 𝑦 𝑖 , adding the product generated by bit-slice 𝑖 + 1, and adding the carry-bit
propagated from bit-slice 𝑖 − 1. If only one bit-slice is designed in a multiplier implementation,
with a modified controller design and additional multiplexors, a 4×4 multiplication can be
performed by serially rotating through 𝑦 𝑖 each cycle and storing the partial product generated so
that it can be added to the product computed in the next cycle. Hence, 4-bit multiplication can be
completed by one bit-slice in 4 cycles.
126
Figure 25 depicts the enhanced defect-tolerant CSA multiplier design. Two neighboring bit-
slices are grouped together to form a bit-slice group (bg). Specifically, bit-slices along 𝑦 0
and 𝑦 1
form one group ( 𝑏𝑔
0
) and bit-slices along 𝑦 2
and 𝑦 3
form the other ( 𝑏𝑔
1
). Hence, the entire
multiplier is divided into two halves, and each half contains one bit-slice group.
Depending on how a multiplier is affected by defects on a particular processor fabricated
processor chip, the enhanced multiplier can be defect-free, defective on 1
st
-half only, defective
on 2
nd
-half only, or defective on both halves. The enhanced controller in Figure 25 shows that,
depending on the dfct signal which is set based on the result of testing, the multiplier can operate
respectively in mode 0 when both halves are defect-free, and mode 1or mode 2 when one half is
defect-free.
Each bit-slice group is enhanced with multiplexors to select the proper connections based on the
operation mode. When operating in one of the “defectively-functional” modes (mode 1 and mode
2), the multiplication is segmented into two shorter multiplications which are executed on a
defect-free enhanced bit-slice group in two separate states. Three main types of multiplexors are
required:
rdy == 1
rdy == 0
done = 1
S
0 S
1
done = 0
reset
x
0
x
1
p
1
p
0
x
2
x
3
p
2
p
4
p
3
p
5
p
6
p
7
y
0
y
1
y
2
y
3
A
B
S
in C
in
S
out
C
out
+
AB
S
in
S
out
C
in C
out
=
B
C
in
S
out
C
out
+
A B
S
out
C
in C
out =
Y operand input reg.
X operand input reg.
Output reg.
A
CPA
cell
CSA
cell
Figure 24. 4×4 CSA multiplier and its state machine controller
127
1. Input operand multiplexors select which 𝒀 operand bits are to be computed, i.e., the bits that
go to A inputs of CSA cells.
2. Output product multiplexors select which 𝑺 𝒐𝒖 𝒕 outputs from CSA or CPA cells should be
selected to be stored in output registers as the intermediate partial product or the final product.
3. Partial product multiplexors select if partial products, 0s, or the original paths are connected
to 𝑺 𝒊𝒏
inputs in CSA cells for computation, or to A inputs of CPA cells at the boundary of two
groups.
p
4
y
1
x
0
x
1
p
1
p
0
x
2
x
3
p
4
p
3
p
5
p
6
p
7
y
2
y
3
y
0
c
2 c
1
p
0
w
0
z
0
z
0
p
2
p
3
w
2
w
1
w
3
w
5
z
4
z
5
w
4
im1
w
5
y
0
y
1
y
2
y
3
0 1
0 2 1
0 2 1 3
im3
0 1
p
2
p
3
p
4
p
5
im2
0 2 1
om1
p
1
z
1
z
1
w
1
z
3
om2
om3
0 1
0 1
def
gnd
z
1
p
5
z
2
c
1
z
3
p
2
z
0
w
0
z
2
z
2
w
2
z
4
w
4
z
3
w
3
z
5
c
2
rdy == 1 && (dfct == 0 || dfct == 3)
done = 0
S
2
reset
rdy == 1 && dfct== 1
rdy == 0
done = 0
S
1
done = 1
S
0
done = 0
.
S
3
done = 0
S
4
done = 0
S
3
rdy == 1 && dfct== 2
(mode 1)
(mode 0)
(mode 2)
Figure 25. Enhanced 4×4 CSA multiplier and modified controller
128
Table XXVI shows the encoding of the select inputs of the multiplexors at each state.
It is clear that there are many additional options for implementing such enhancements for the
CSA. At one extreme, we may choose to use multiplexors to select every non-defective slice.
However, for such an extreme approach the overheads would be large and benefits small. Hence,
we focus on a much smaller set of configurations. In our approach, an enhanced CSA multiplier
design is defined by two parameters: the number of bit-slices per group, 𝑁𝑆
𝑏𝑔
, and the number of
enhanced groups, 𝑁 𝑏 𝑔𝐸𝑛
. The example in Figure 25 corresponds to the option ( 𝑁𝑆
𝑏𝑔
= 2,
𝑁 𝑏 𝑔𝐸𝑛
= 2). Design options with different parameter values result in different overheads and
effectiveness. The smaller the number of bit-slices per group, the larger the chance that an
enhanced group will be defect-free when fabricated. However, a smaller bit-slice number in a
group also means (1) higher multiplexor overheads, and (2) that higher number of cycles are
required to complete a multiplication. The more the number of enhanced groups, the chance that
one of them is defect-free is larger. However, it also means more multiplexors are required and
area and performance overheads are high.
In the next section, we estimate the overheads for different enhancement options for a 32×32
CSA multiplier used in a processor considering the three main types of multiplexors. We have
also implemented one of the most promising enhancement options. We will show that this
enhancement incurs low area overheads yet significantly improves the overall performance-per-
area of the processor.
Table XXVI. Multiplexors’ select inputs encoding
Select inputs im1 im2 im3 om1 om2 om3 def
States
S0 0 0 0 0 0 0 0
S1 1 x 0 2 2 x 1
S2 0 x 1 1 3 0 1
S3 0 1 x 0 0 x 1
S4 1 2 x 1 1 1 1
S5 0 0 0 0 0 0 0
129
5.3.3 Operand-shifting options for a 32× 32 multiplier
An unenhanced 32×32 CSA multiplier is first implemented using a standard cell library. Area
and delay are then measured. The three main types of multiplexors required for different design
options are implemented in RTL and synthesized to obtain area and delay information. Area
overhead of each enhancement option is estimated by accumulating the area of different types of
multiplexors required. Delay overhead is estimated by accumulating multiplexor delays along the
original critical paths of the unenhanced design. Table XXVII shows area and delay ratios of the
enhanced multiplier for each option with respect to those of the unenhanced multiplier. Note that
we have implemented the option (16, 2). The overheads difference between our actual
implementation and our estimation is negligible (below 1%).
Microarchitecture-layer performance for each option is captured by the number of cycles
required to execute a fixed-point multiplication. The number of processor cycles required to
execute a complete 32 ×32 multiplication in the unenhanced multiplier can be derived by
rounding up the ratio of the measured circuit layer delay to the given processor clock cycle. The
number of cycles required for other options can be derived in the same way. However, when an
enhanced multiplier operates in one of the defectively-functional modes, the number of cycles
increases according to the number of segmented multiplications that need to be executed.
Note that it is possible that our approach can be implemented with any positive integer
𝑁𝑆
𝑏𝑔
value smaller than 32. However, for example, the option of ( 𝑁𝑆
𝑏𝑔
= 17, 𝑁 𝑏 𝑔𝐸𝑛
= 1) does
not provide more performance or yield benefits than the option of ( 𝑁𝑆
𝑏𝑔
= 16, 𝑁 𝑏 𝑔𝐸𝑛
= 1) since
it still requires 22 cycles to complete a full multiplication in defectively-functional mode, and it
requires larger area to operate error-free in a defectively-functional mode. It is also possible that
130
the approach can be implemented in such a way that the boundary of the first enhanced group is
not aligned with the original boundary of the multiplier. However, such implementations will
result in higher multiplexor area overheads since when operating under defectively-functional
modes, the 𝑌 operand cannot use to default path and additional multiplexing is required.
5.3.4 Processor tiers and their yields
Fabricated processors with multiplier enhancement can be graded into three tiers based on
how the defects affect the multiplier: tier 1 (T1), defect-free processors, tier 2 (T2), defectively-
functional processors, and tier 3 (T3), non-functional defective processors, i.e., processors to be
discarded. The parameters associated with each enhancement option are shown in Table XXVIII.
Table XXVII. Enhancement options: performance and overhead estimation
Options
( 𝑵𝑺
𝒃 𝒈 , 𝑵 𝒃𝒈 𝑬 𝒏 )
Circuit layer Microarchitecture layer
Area ratio Delay ratio
Number of processor cycles
Defect-free Defectively- functional
Unenhanced* 1 1 10 N/A
(16,1) 1.028 1.011
11
22 (16,2) 1.051 1.016
(16,2)* 1.045 1.022
(8,1) 1.042 1.022
44 (8,2) 1.071 1.024
(8,4) 1.146 1.032
(4, 1) 1.032 1.028
88
(4, 2) 1.116 1.031
(4, 4) 1.237 1.038
(4, 8) 1.470 1.045
*These design options are implemented
131
The yield for each tier 𝑖 ( 𝑌 𝑇𝑖
) can be evaluated in the following way for each design option.
𝑌 𝑇 1
= 𝑃𝑆 � 𝑑 , 𝐴 𝜇𝑝 𝑀𝑢𝑙𝐸 𝑛 , 0 � , where 𝑃𝑆 ( 𝑑 , 𝐴 , 𝑥 ) represent the probability that there are 𝑥 defects in
an area of 𝐴 given that defect density is 𝑑 . Note that we assume defects are uniformly
distributed. 𝑌 𝑇 2
= (1 − 𝑃𝑆 ( 𝑑 , 𝐴 𝑚 𝑢𝑙𝐸 𝑛 , 0) − 𝑃 𝑚 𝑢𝑙𝐹 𝑎 𝑖 𝑙 ) × � 𝑃𝑆 � 𝑑 , 𝐴 𝜇 𝑝 𝑀 𝑢𝑙𝐸 𝑛 − 𝐴 𝑚 𝑢𝑙𝐸 𝑛 , 0 � �, where 𝑃 𝑚 𝑢𝑙𝐹𝑎 𝑖𝑙 is the
probability that all enhanced bit-slice groups are defective, i.e., the probability that a defective
multiplier does not work even under any of the defectively-functional modes. And finally 𝑌 𝑇 3
=
1 − 𝑌 𝑇 1
− 𝑌 𝑇 2
.
Figure 26 shows the yield of T1 and T2 for the enhancement options when an unenhanced
processor has 30% yield. 𝑌 𝑇 1
of each options are always slightly less than 30% because the small
area of additional multiplexors increase the total multiplier area, which are also susceptible to
defects. For options with the same number of bit-slices per group (e.g., both options
with 𝑁𝑆
𝑏𝑔
= 16, or all options with 𝑁𝑆
𝑏𝑔
= 8), the area overhead increases as the number of
enhanced groups ( 𝑁 𝑏 𝑔𝐸𝑛 ) increases due to the increase in the number of multiplexors of the same
complexities. In addition, 𝑌 𝑇 2
increases dramatically when a smaller 𝑁𝑆
𝑏𝑔
is chosen since the
probability that an enhanced group is defect-free increases. 𝑌 𝑇 2
also increases sharply when
𝑁 𝑏 𝑔𝐸𝑛
increases from 1 to 2 because the probability that a group can be utilized under
Table XXVIII. Multiplier enhancement parameters
Parameter With enhancement No enhancement
Area of a processor
chip
𝐴 𝑀𝑈 𝐿 − 𝐸𝑀𝑈
𝐴 𝜇𝑝
Area of multiplier 𝐴 𝑚 𝑢 𝑙𝐸𝑛 𝐴 𝑚𝑢 𝑙
Area of a bit-slice
group
𝐴 𝑏 𝑔 N/A
Yield 𝑌 𝑇 𝑖 (for tier 𝑖 ) 𝑌 𝑜
Instructions per cycle 𝐼𝑃𝐶
𝑇 𝑖 (for tier 𝑖 ) 𝐼𝑃𝐶
𝑜
132
defectively-functional modes doubles. However, 𝑌 𝑇 2
saturates or even decreases when 𝑁 𝑏 𝑔𝐸𝑛
is
further increased for the following reasons. First, the probability of occurrences decreases as the
number of defects increases. Hence the benefits that can be provided by more enhanced groups
are limited. Second, the multiplexors area overheads dramatically increase and outweigh the
further benefits provided by additional enhanced groups.
To measure the unit-area yield benefits provided by each option, the metric
( 𝑌 𝑇 1
+ 𝑌 𝑇 2
) 𝐴 𝜇 𝑝 𝑀𝑢 𝑙 𝐸 𝑛 ⁄ is used to capture the sellable-chip-per-die-area. Figure 27 shows the metric
normalized to that of an unenhanced processor when its yield ( 𝑌 𝑜 ) is 30%. It can be observed
from Figure 26 that although 𝑌 𝑇 2
for option (16, 2) is slightly lower than that for option (8, 2)
and comparable to those for options (4, 2) to (4, 8), option (16, 2) yields the most sellable
processors per area in Figure 27 because of its lower area overhead and its slightly higher 𝑌 𝑇 1
. In
the next section, the performance aspect of each option will be considered by performing
microarchitecture simulations.
Figure 26. Tier yields for different options
28.0%
28.5%
29.0%
29.5%
30.0%
30.5%
31.0%
31.5%
(NS
bg
, N
bgEn
)
T2
T1
133
Note that the enhancement mechanism used in our approach requires every enhanced group to
be able to compute for other groups in case they are defective. It is possible that alternative
mechanism can be used. For example, each of enhanced group in the (8, 2) option can be used
for one unenhanced group only, rather than all other three groups. The implementation of such
alternative mechanism reduces multiplexor area overhead than the original (8, 2) option. This
alternative (8, 2) option can work if both enhanced groups are defect-free. Hence, the area which
can be defective while the multiplier can still be functional is equal to that of the original (16, 1)
option. However, the multiplexor area overhead of alternative (8, 2) option will be larger than
that of the original (16, 1) option since the alternative (8, 2) option requires more partial product
multiplexors. The same reasoning applies to other possible alternative options with more than
one enhanced group, hence the original mechanisms will result in superior sellable-chip-per-area.
5.3.5 Evaluation
We perform microarchitecture simulation for to derive IPC for each processor tier. SPEC2000
[38] benchmarks are simulated using a microarchitecture simulator [55]. The multiplication
latency of each tier is set according to processor cycle estimation derived from circuit layer
nano-second delay as shown in Table XXVII. Figure 28 shows the overall EFF for all options
Figure 27. Normalized sellable chips per area
0.96
0.98
1.00
1.02
1.04
(16, 1) (16, 2) (8, 1) (8, 2) (8, 4) (4,1) (4,2) (4,4) (4,8)
T2 T1
134
when 𝑌 𝑜 is 30%. Compared to the (16, 1) option, (16, 2) option can be functional if either one of
the bit-slice group is defect-free. The (16, 2) option appears even more attractive after the
microarchitecture performance is considered. Since not only the sellable-chip-per-area of the
option is larger, the performance provided by tier 2 is also higher than other options with
comparable sellable-chip-per-area measures, i.e., options (8, 2) and (4, 1). Hence, the (16, 2)
option achieves the highest improvement.
5.3.6 Comparison with MUL-DIS
Table XXIX shows the performance for T1 and T2 processors using MUL operand-shifting
and performance estimated for MUL-DIS. The relative performance of T1 and T2 processors are
calculated as 𝐼 𝑃𝐶 𝑇 1
𝐼 𝑃𝐶 𝑜 ⁄ and 𝐼 𝑃𝐶 𝑇 2
𝐼 𝑃𝐶 𝑜 ⁄ respectively. IPCs are measured using
microarchitecture simulations.
Table XXIX also shows the efficiency of operand-shifting and MUL-DIS for every
application. The efficiency of MUL-DIS is calculated as follows. Like operand-shifting, MUL-
DIS results in two processor tiers, i.e., a defect-free tier (T1’) and a tier with defective MUL only
(T2’). However, MUL-DIS incurs neither area overhead on all chip or performance overhead on
defect-free chips. As can be observed, operand-shifting performs uniformly for every
application in both tiers. However, when MUL-DIS is used, T2’ processors can degrade the
Figure 28. Overall efficiency for all options
0.96
0.98
1.00
1.02
1.04
(16, 1) (16, 2) (8, 1) (8, 2) (8, 4) (4,1) (4,2) (4,4) (4,8)
135
performance of some applications dramatically (apsi and applu). The overall efficiency of
operand-shifting is 1.034 and that for MUL-DIS ranges from 1.030 to 1.035. However, operand-
shifting does not dramatically degrade the performance of any applications.
Table XXIX. Performance comparison between MUL-DIS and operand-shifting
Relative performance
(Defect-free, unenhanced performance. = 100%)
Efficiency
Benchmarks
Operand-shifting
T1 processors
Operand-shifting
T2 processors
MUL-DIS T2
processors
Operand-shifting MUL-DIS
ammp 100.02% 93.41% – 1.0370 –
equake 100.00% 85.74% 100.00% 1.0335 1.0418
apsi 99.58% 94.74% 3.19% 1.0332 1.0013
applu 99.80% 95.42% 4.08% 1.0357 1.0017
swim 100.00% 100.62% 99.97% 1.0399 1.0417
art 100.00% 93.27% 98.56% 1.0368 1.0411
bzip2 100.00% 84.13% 99.50% 1.0328 1.0415
gcc 99.91% 88.60% – 1.0338 –
gzip 100.00% 84.90% – 1.0331 –
mcf 100.00% 92.33% 99.99% 1.0364 1.0417
vortex 100.00% 98.49% – 1.0390 –
mgrid 95.34% 65.23% – 0.9782 –
parser 99.99% 86.86% – 1.0339 –
twolf 100.00% 100.00% – 1.0397 –
136
Chapter 6. Projecting defect-tolerance efficiency for multicore
processors
In this chapter, we project the efficiency of the defect-tolerance approaches for multicore
processors in near-future technology nodes. The processor organization and microarchitecture
studied are first presented. The study considers both the related technology parameters and the
evolution of processors. We first evaluate the approaches individually, and then we combine the
approaches in a cross-layered fashion to achieve high improvement in EGOPSPA. We show that
our approaches can provide dramatic improvement as technology advances.
6.1 Scope of cross-layered approaches
Table XXX summarizes the approaches to be evaluated in this chapter. We will first evaluate
the efficiency of the following approaches individually: core-disabling, PCD, and datapath-
enhancement comprising the approaches discussed in Chapter 5. Then, we combine the
approaches in a cross-layered fashion to demonstrate their maximum efficiency. In addition to
our approaches, we also evaluate the efficiency of existing advanced approaches. As discussed in
Chapter 4, most existing approaches exploit inherent redundancy at microarchitecture layer. We
will evaluate these microarchitecture-layer approaches to show the upper-bound efficiency that
can be achieved by these approaches.
137
6.2 Multicore organization studied
We study a multicore processor which has a similar organization and core microarchitecture
as those of a 45nm 4-core processor [43]. Figure 29 shows the organization. The processor has
homogeneous cores and an 8MB 16-way LLC with 64B blocks is shared by the cores through
crossbar interconnects. Every core has the following datapath modules: 1 FPU, 1 integer
multiplier, and 3 ALUs. Every core also has their own private caches: a 32KB 2-way level-1
instruction cache with 32B blocks, a 32KB 2-way level-1 data cache with 32B blocks, and a
256KB 4-way leve-2 unified cache with 64B blocks.
Table XXX. Scope of the approaches
System Core-disabling [42] [35]
PCD for LLC
ISA Instruction disabling for FPU (FPU-DIS)
Microarchitecture L1/L2 cache block disabling [5] [6] [7] [8] [9] [10] [11] [12]
Buffering module entry-disabling [13] [71] [72]
Do-nothing for speculation modules [2] [3] [4]
ALU-disabling [13] [19] [72] (+ Adder error-masking)
Circuit Operand-shifting for integer multiplier
*Datapath-enhancement includes FPU-DIS, ALU-disabling + Adder error-masking, and operand-shifting
for integer multiplier.
Figure 29. 4-core processor organization
LLC banks
Cores Crossbar
138
6.3 Projection framework
In this section, we first describe the approaches we use to project technology parameters and
processor evolution paths. Then we describe the area estimate and yield calculation for the
baseline processors and the processors enhanced with advanced approaches (enhanced
processors). Baseline processors at each technology node only use classic approaches, i.e., spare
rows/columns in SRAM arrays.
6.3.1 Technology scaling
We consider the following technology parameters: process variation induced SRAM bit-cell
failure rate, processor clock frequency, and defect density. SRAM bit-cell failure rates and clock
frequency for the technology nodes studied are taken from the in ITRS projections [1] [56]. We
assume defect-density at 45nm is 1700/ 𝑚 2
, and defect density for every node is scaled according
to the increase in critical area [57]. Table XXXI summarizes these parameters for every node.
6.3.2 Processor evolution
When a new generation of a multicore processor is designed, the following components
evolve: the number of cores, the capacity of LLC, interconnects, core microarchitecture, and
special purpose processing engines such as graphic processing unit. In this work, we investigate
the evolution of the number of cores and the capacity of LLC. The evolution of other
components will be projected in our future studies.
Table XXXI. Technology parameters
Technology(nm) 45 32 22 16 11 8
Frequency(GHz) 3.2 3.5 7.6 10.3 13.3 12.3
SRAM bit-cell failure rate 4 × 10
− 1 1
1 × 10
− 7
5 × 10
− 6
1 × 10
− 4
5 × 10
− 4
1 × 10
− 3
Defect density scaling factor* 1 1.13 1.23 1.29 1.33 1.36
*Assume p=3.0 in [57]
139
We investigate two possible evaluation paths, namely fixed-organization (FO) evolution and
power-constrained (PC) evolution. Fixed-organization evolution does not change processors
parameters, i.e., all processor generations have 4 cores and an 8MB LLC of the same
organization. The processors on the fixed-organization evolution path for every technology
nodes are referred as FO-evolution processors. Power-constrained evolution limits the total
thermal design power (TDP) to a budget for all technology nodes. The derivation of PC
evolution is described in the following.
In the study of [56], the authors investigated existing processor products to predict the power-
performance relation of processors in future technology generations. As we reproduce their
results in Figure 30, per-core power versus performance of a set of 45nm processors is plotted.
The authors then empirically derive a Pareto frontier curve which represents the optimal
performance that can be achieved for a given per-core power consumption. As the figure
suggested, Nehalem cores achieve high performance with a more advanced microarchitecture at
the expense of greater power consumption. On the other end, Atom cores consume least power
but also have a simpler microarchitecture.
Figure 30. 45nm per-core power and performance (reproduced [56])
Core power (W)
Performance (SPECmark)
140
By assuming that the processor core design does not evolve, the power-performance frontier
curves for the future technology generations can be plotted by scaling the performance according
to FO4 delay (frequency) predicted and scaling the power according to the Vdd, capacitance, and
frequency predicted. In Figure 31 we reproduce the scaled power-performance Pareto frontiers
for future technologies. The curves represent the optimal power-performance core
microarchitecture design points for every technology.
The authors in [56] studied the optimal number of cores and the type of cores that can
achieve the maximum speed-up for a set of parallel benchmarks under the predefined TDP
budget (125W). Table XXXII shows the average number of homogeneous cores across these
benchmarks and the per-core power for every technology node. For each technology node, the
per-core power consumption number falls onto the high-performance/high-power end of the
corresponding curves in Figure 31. Hence, this result suggests that to achieve most speed-ups
under the TDP budget, the cores will have microarchitecture similar to Nehalem. (Note that
this does not suggest that future processors will not have microarchitecture innovations.) We
will use the number of cores shown in Table XXXII in our power-constrained evolution.
Figure 31. Scaled per-core power and performance (reproduced [56])
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160
45nm
32nm
22nm
16nm
11nm
8nm
Core power (W)
Performance (SPECmark)
141
Table XXXII also shows the corresponding LLC capacity, which is scaled proportional to the
number of cores [58] (closest power-of-two number). The capacity of LLC is increased by
increasing the number of sets in LLC. We will refer the processors on this evolution path for
every technology node as PC-evolution processors.
6.3.3 Yield calculation and area estimation
Yield calculation: We consider different failure sources in different components, i.e., process
variation in SRAM bit-cells and defects in non-SRAM logic. Failed SRAM bit-cells and defects
in logic are assumed to follow Poisson distribution. The yield of an SRAM array is calculated as
𝑒 − 𝑓𝑛
, where 𝑓 is the process variation induced bit-cell failure rate, and 𝑛 is the number of bit-
cells in the array. The yield of a non-SRAM logic module is calculated as 𝑒 − 𝑑𝐴
, where 𝑑 is the
defect density and 𝐴 is the area of the module.
Core area and module area: the area of a 45nm core is measured from the die photo of the
45nm Nehalem processor. The areas of private caches in a core are derived by assuming each
cache module is designed to achieve optimal yield-per-area using spares only. The area of
datapath modules are derived by synthesizing our RTL designs and modified designs in [52]
[59]. The area of other modules in a core is then estimated by subtracting the areas of caches and
datapath modules from the core area measured from die photo. For the subsequent technology
nodes, areas of the caches are derived by designing the cache modules to achieve optimal yield-
Table XXXII. Power-constrained evolution processors
Tech (nm) 45 32 22 16 11 8
Avg. number of cores 4 8 12 14 20 33
Per-core power(W) 31.25 15.63 10.42 8.93 6.25 3.79
LLC capacity (MB) 8 16 32 32 32 64
142
per-area for each technology nodes. The areas of other modules are derived by scaling according
to the technology based on their areas in 45nm. The area of a core is the summation of the area
of all modules in one core.
LLC area: For an unenhanced LLC, i.e., an LLC which does not use PCD, the area is derived
by assuming the module is designed to achieve optimal yield-per-area at each technology node.
For an enhanced LLC, its area is derived by assuming the module is designed to achieve optimal
ECPA at each technology node.
Interconnect area: For every technology node, the cores are assumed to be square. Cores and
LLC banks are arranged in a way that the entire chip, including interconnects, has aspect ratio
closest to 1. The area of interconnect is then derived from the arrangement.
6.4 Multicore performance estimation
In Chapter 1, the performance-per-area metric (i.e., EGOPSPA) for defect-tolerance is
presented. In this section, we explain how we compute the metric for multicore processors. For
an unenhanced processor, its performance-per-area metric is calculated as 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑜 = 𝑁𝑐
𝑜 ×
𝐹 𝑜 × 𝐼 𝑃𝐶 𝑜 × 𝜂 𝑜 × 𝑌 𝑜 /𝐴 𝑜 . Table XXXIII summarizes the parameters.
Table XXXIII. EGOPSPA formulation parameters
𝑁𝑐
𝑜 Number of cores Depends on the processor design
𝐼𝑃𝐶
𝑜 Instruction-per-cycle of each core Derived from simulations using SPEC2000
𝐹 𝑜 GHz operating frequency Obtain from ITRS prediction
𝜂 𝑜 Core utilization (described ahead) Depends on LLC hit rate (described ahead)
𝑌 𝑜 Processor chip yield Described in previous sections
𝐴 𝑜 Processor chip area Described in previous sections
143
Core utilization 𝜂 𝑜 measure the fraction of time a core is actively computing, i.e., the fraction
of time the core is not waiting for the response from LLC. Core utilization is calculated as the
following: 𝜂 𝑜 = 𝑚𝑖 𝑛 (1,
1
1 + 𝑟 𝑚 × 𝑡 𝑎 𝑣𝑔
× 𝐼 𝑃 𝐶 𝑜 ), where 𝑟 𝑚 is the percentage of LLC accessing
instructions of all instructions, and 𝑡 𝑎𝑣𝑔 is the average number of cycles required to acquire data
from LLC or beyond. 𝑡 𝑎𝑣𝑔 is calculated as 𝑃 ℎ 𝑖𝑡 × 𝑡 𝐿𝐿𝐶
+ (1 − 𝑃 ℎ 𝑖𝑡 ) × 𝑡 𝑚 , where 𝑃 ℎ 𝑖𝑡 is the hit
rate of LLC, 𝑡 𝐿𝐿𝐶
and 𝑡 𝑚 are the access latencies (number of processor cycles) to LLC and main
memory respectively. To derive 𝑃 ℎ 𝑖 𝑡 , available LLC is shared by every core evenly [56] [60].
𝑃 ℎ 𝑖𝑡 can be modeled empirically as a function of available LLC per core (
𝑆 𝐿𝐿𝐶
𝑁 𝑐 , where 𝑆 𝐿𝐿𝐶
is the
capacity of available LLC), and 𝑃 ℎ 𝑖 𝑡 �
𝑆 𝐿𝐿𝐶
𝑁 𝑐 �= 1 − (
𝑆 𝐿𝐿𝐶
𝑁 𝑐 𝛽 )
−( 𝛼 − 1)
where 𝛼 and 𝛽 are the locality
parameters [61] [62] and can be derived from simulation trace. The derivation of 𝑡 𝐿𝐿𝐶
and 𝑡 𝑚 is
described ahead next.
The round-trip main memory ns-access time is assumed to be 50ns for all technology nodes
[63] [64]. The main memory access latency is then derived by dividing 50ns by the processor
clock cycle time at each technology node. The LLC access latency ( 𝑡 𝐿𝐿𝐶
) is derived by dividing
the LLC ns-access time ( 𝐷 𝐿𝐿𝐶
) by the processor clock cycle time at each technology node.
𝐷 𝐿𝐿𝐶
= 2 × 𝐷 𝑋𝑏 𝑎𝑟
+ 2 × 𝐷 𝑏 𝑎𝑛𝑘 + 𝐷 𝑎𝑟 𝑟 𝑎𝑦
, where 𝐷 𝑋𝑏 𝑎𝑟
, 𝐷 𝑏 𝑎𝑛𝑘 , and 𝐷 𝑎𝑟 𝑟 𝑎𝑦
are the ns-delay
through the cross-bar interconnects, LLC bank, and SRAM array ns-access time respectively.
These delays can be calculated: 𝐷 𝑋𝑏 𝑎𝑟
≈ 1.67 � 𝐷 𝐹𝑂 4
𝑅 𝑋𝑏 𝑎𝑟
𝐶 𝑋𝑏𝑎 𝑟 × 𝐿 𝑋𝑏 𝑎𝑟
[65], 𝐷 𝑏 𝑎𝑛𝑘 ≈
1.67 � 𝐷 𝐹𝑂 4
𝑅 𝑏 𝑎𝑛𝑘 𝐶 𝑏 𝑎𝑛𝑘 × 𝐿 𝑏 𝑎𝑛𝑘 [65], 𝐷 𝑎𝑟 𝑟 𝑎𝑦
= 𝑁 𝑠𝑡 𝑎𝑔𝑒 𝑑 𝐹𝑂 4
+ ∑ 𝑝𝑖
𝑁 𝑠 𝑡 𝑎𝑔𝑒 𝑖 = 1
+ 𝐷 𝑊 𝐿 + 𝐷 𝐵𝐿
[66].
Note that 𝐷 𝑋𝑏 𝑎𝑟
and 𝐷 𝑏 𝑎𝑛𝑘 is calculated as the optimized (repeated) wire. 𝐷 𝑎𝑟 𝑟 𝑎𝑦
is calculated by
144
adding the optimized decoder delay and the delay in wordline and bitline. Table XXXIV
summarizes the above parameters.
For an enhanced processor, its performance-per-area is calculated as 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑒 𝑛𝑐
=
∑ ( ∑ 𝐼 𝑃𝐶 𝐶𝑗 𝑁𝑐
𝑇𝑖
𝑗 = 1
× 𝜂 𝐶𝑗 ) × 𝐹 𝑇𝑖
× 𝑌 𝑇𝑖
𝑁𝑇
𝑖 = 1
/𝐴 𝑒 𝑛𝑐
. 𝐴 𝑒 𝑛𝑐
is the area of the enhanced processor and the
other parameters are the number of usable cores, clock frequency, core instruction-per-cycle,
core utilization, and yields of various processor tiers. Note that our approaches do not change
clock frequency, hence 𝐹 𝑇𝑖
is equal to 𝐹 𝑜 .
6.5 Approach evaluation
In this section, we first detail the formulation for processor tiers evaluation. The evaluation of
the approaches for near-future processors follows.
6.5.1 Formulation for individual approaches
The formulation for PCD is presented in Chapter 2. Here we describe out formulation for
datapath-enhancement and core-disabling.
Table XXXIV. Utilization formulation parameters
D
FO 4
FO4 inverter delay Obtain from ITRS
R
Xb ar
(R
ban k
)
Unit-length resistance of crossbar
(H-tree in banks)
Depends on the length of wire (calculated based on bank
dimension obtained from CACTI) and unit-length resistance
(obtain from ITRS)
C
Xb ar
(C
ban k
)
Unit-length capacitance of crossbar
(H-tree in banks)
Depends on the length of wire (calculated based on bank
dimension obtained from CACTI) and unit-length capacitance
(obtain from ITRS)
N
s t age
Number of stages in decoder
Depends on LLC organization
(described in the previous section)
D
FO 4
FO4 delay Obtain from ITRS
d
FO 4
Extrinsic FO4 delay (w/o parasitic) Obtain from ITRS
pi Parasitic delay Obtain from ITRS
D
WL
(D
BL
) Wordline(bitline) delay
Depends on the length and load on the of wires (obtained from
CACTI)
145
6.5.1.1 Datapath-enhancement
In Chapter 5 we use the term tier to describe fabricated processors with different performance
due to defects when using a single approach targeting a single datapath module. In this chapter,
we use the term grade to describe the condition of every datapath module of different types when
the several different approaches are implemented for the corresponding modules in the same
processor. The notation [d, d’] represents the grade of a datapath module, where d denotes the
number of defective modules in a processor core, and d’ denotes the defective modules that can
be used in a “special salvaging mode” (explained ahead). Table XXXV shows the grades of FPU
(gFPU) modules when FP-DIS and FPMUL-DIS are used. A defective FPU can operate in the
special salvaging mode if the defects only affect its FP-multiplier (grade [1, 1]).
The yield of grade [d, d’] is represented as 𝑌 𝐸 𝐹𝑃𝑈 ([𝑑 , 𝑑′ ]). The yield formulation of every
grade is also shown Table XXXV. Note that 𝐴 𝐹𝑃𝑈 and 𝐴 𝐹 𝑃𝑀𝑈 𝐿 are the areas of FPU and FP-
multiplier respectively. 𝑌 ( 𝐴 ) denotes the yield of a module with area of 𝐴 .
Table XXXVI shows the grades and formulation for the integer multiplier when operand-
shifting is used. There is no special mode for this approach. Note that the grade [1,0] only
requires half of the MUL area to be defect-free. If both halves are defective, then the MUL
cannot be utilized (dead grade).
Table XXXV. FPU enhancement grades
𝑔 𝐹𝑃 𝑼 Description 𝒀 𝑬𝑭𝑷 𝑼 ([ 𝒅 , 𝒅 ′])
[0,0] Defect-free 𝑌 ( 𝐴 𝐹𝑃 𝑈 )
[1,0] Defective FPU (FP-DIS)
1 − 𝑌 ( 𝐴 𝐹𝑃 𝑈 − 𝐴 𝐹𝑃 𝑀𝑈𝐿 )
[1,1]
Defective FPMUL only
(FPMUL-DIS)
𝑌 ( 𝐴 𝐹𝑃 𝑈 − 𝐴 𝐹𝑃 𝑀𝑈𝐿 ) × � 1 − 𝑌 ( 𝐴 𝐹𝑃 𝑀𝑈𝐿 ) �
146
Table XXXVII shows the grades for ALU module when ALU-disabling and AdderEM are used.
Note that as we mentioned earlier, there are three ALUs in the multicore organization we studied.
Figure 32 illustrates the selected grades for clarity.
Table XXXVI. MUL operand-shifting enhancement grades
𝑔 𝑀 𝑈𝐿 Description
𝒀 𝑬𝑴𝑼 𝑳 ([ 𝒅 , 𝒅 ′])
[0,0] Defect-free
𝑌 ( 𝐴 𝑀𝑈 𝐿 )
[1,0] Defective, but can be utilized 2 × 𝑌 (0.5 𝐴 𝑀𝑈 𝐿 ) × (1 − 𝑌 (0.5 𝐴 𝑀𝑈 𝐿 ))
dead Defective and cannot be utilized (1 − 𝑌 (0.5 𝐴 𝑀𝑈 𝐿 ) × (1 − 𝑌 (0.5 𝐴 𝑀𝑈 𝐿 ))
Table XXXVII. ALU enhancement grades
𝒈𝑨𝑳𝑼 Description 𝒀 𝑬𝑨 𝑳 𝑼 ([ 𝒅 , 𝒅 ′])
[0,0] Defect-free 𝑌 (3 𝐴 𝐴𝐿 𝑈 )
[1,0]
1 defective ALU
(that is not used)
3 × 𝑌 (2 𝐴 𝐴𝐿 𝑈 ) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) �
[1,1] 1 ALU using EM 3 × 𝑌 (2 𝐴 𝐴𝐿 𝑈 ) × 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) �
[2,0]
2 defective ALU
(that are not used)
3 × 𝑌 ( 𝐴 𝐴𝐿 𝑈 ) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) �
2
[2,1]
1 defective ALU, 1 ALU
using EM
6 × 𝑌 ( 𝐴 𝐴𝐿 𝑈 ) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) �× 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 ))
× � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) �
[2,2] 2 ALUs using EM
3 × 𝑌 ( 𝐴 𝐴𝐿 𝑈 ) × � 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) � �
2
dead
𝑌 𝐸𝐴 𝐿 𝑈 ( 𝑑 𝑒 𝑎𝑑 ) = � � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) � �
3
+ 3 × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) �
2
× 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) �+ 3 × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) �
× � 𝑌 � 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 ) �× � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) � �
2
+ � 𝑌 ( 𝐴 𝐴𝐿 𝑈 (1 − 𝐸 𝑀 𝑝 )) × � 1 − 𝑌 ( 𝐴 𝐴𝐿 𝑈 𝐸 𝑀 𝑝 ) � �
3
147
When combining the above three approaches together, processor tiers can be denoted as
𝑇 𝐷𝑝 𝑎𝑡 ℎ𝐸𝑛𝑐 � [( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 1
… , ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐𝑁
𝑐 ]� ,
where ( 𝑔𝐴 𝐿 𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 )
𝑐𝑖
is the grade of core i, and there are Nc cores in the processor.
Yield of this processor tier can be calculated as
𝑌 𝑇 𝐷𝑝 𝑎𝑡 ℎ𝐸𝑛𝑐 � [( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 )
𝑐 1
… , ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐𝑁
𝑐 ]�= 𝑌 𝐿𝐿𝐶
× 𝑌 𝑖𝑛𝑡 𝑐 ×
∏ 𝑌 𝐸 𝑛𝑐 𝐶𝑜 𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 𝑖 𝑁 𝑐 𝑖 = 1
,
where 𝑌 𝑖𝑛𝑡 𝑐 are the yields of LLC and interconnects, respectively.
𝑌 𝐸 𝑛𝑐 𝐶𝑜 𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 )
𝑖 represents the yield of grade of core i, and
𝑌 𝐸 𝑛𝑐 𝐶𝑜 𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 )
𝑖 = 𝑌 𝑛𝑜 𝑛𝐷 𝑝 × 𝑌 𝐸𝐴𝐿𝑈 ( 𝑔𝐴 𝐿𝑈 ) × 𝑌 𝐸 𝐹𝑃𝑈 ( 𝑔𝐹 𝑃𝑈 ) × 𝑌 𝐸𝑀 𝑈𝐿
( 𝑔𝑀 𝑈 𝐿 ).
𝑌 𝑛𝑜 𝑛𝐷 𝑝 is the yield of non-datapath modules in a core.
When using datapath-enhancement, the set of processor tiers, 𝑆 𝐷𝑝 𝑎𝑡 ℎ𝐸𝑛𝑐 , can be enumerated
as the cross-product of the grade of every core, i.e., 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 1
× 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 2
× … × 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 𝑁 𝐶 .
𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐𝑖
represents the grade of core i, and 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐𝑖
= � 𝑆 𝑔𝐴𝐿𝑈
� × � 𝑆 𝑔𝐹𝑃𝑈 � × � 𝑆 𝑔𝑀𝑈 𝐿 �, where
Figure 32. (a) ALU area breakdown (b) Example of grades (gray area denotes defective area)
𝐴 𝐴𝐿𝑈
𝐴 𝑎 𝑑𝑑
𝐴 𝑎𝑑𝑑𝐸𝑀
𝐸𝑀𝑝 = ( 𝐴 𝑎 𝑑𝑑 − 𝐴 𝑎 𝑑𝑑𝐸𝑚 )/ 𝐴 𝐴𝐿𝑈 [1,0]
[1,1]
[2,1]
(a)
(b)
disabled
AdderEM
disabled
AdderEM
148
𝑆 𝑔𝐴𝐿𝑈
= {[0,0], [1,0], [1,1], [2,0], [2,1], [2,2]}, 𝑆 𝑔𝐹𝑃𝑈 = {[0,0], [1,0], [1,1]}, and 𝑆 𝑔𝑀𝑈 𝐿 = {[0,0],
[1,1]}
6.5.1.2 Core-disabling
When using core-disabling, the usable grades of all cores in a processor can be represented as
𝑔 𝑃𝑐𝑜𝑟 𝑒𝑠
(0), …, 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 ), … , 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑁 𝑐 − 1) , where i represents the number of defective cores
that are disabled. The yield of 𝑔 𝑃𝑐𝑜𝑟 𝑒𝑠
( 𝑖 ) can be calculated as
𝑌 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 ) = �
𝑁 𝑐 𝑜 𝑟𝑒 𝑖 �× (1 − 𝑌 𝑐𝑜 𝑟 𝑒 )
𝑖 × 𝑌 𝑐𝑜 𝑟 𝑒 𝑁 𝑐 𝑜 𝑟𝑒 − 𝑖 ,
where 𝑌 𝑐𝑜 𝑟 𝑒 is the yield of a core. 𝑌 𝑐𝑜 𝑟 𝑒 = 𝑌 𝐿 1 $
× 𝑌 𝐿 2 $
× 𝑌 𝑛𝑜 𝑛 $
, where 𝑌 𝐿 1 $
, 𝑌 𝐿 2 $
, and 𝑌 𝑛𝑜 𝑛 $
are the
yields of level-1 caches, level-2 cache, and non-cache modules, respectively.
Processor tiers can be represented as 𝑇 𝐶𝑜𝑟𝑒 𝐷𝑖𝑠 ( 𝑖 ) and i is the number of defective cores. The
yield of 𝑇 𝐶𝑜𝑟𝑒 𝐷𝑖𝑠 ( 𝑖 ) can be calculated as 𝑌 𝑇 𝐶𝑜𝑟 𝑒 𝐷 𝑖𝑠 ( 𝑖 ) = 𝑌 𝐿𝐿𝐶
× 𝑌 𝑖𝑛𝑡 𝑐 × 𝑌 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 ).
6.5.2 Formulation for cross-layered approaches
6.5.2.1 PCD + core-disabling
When PCD and core-disabling are used, processor tiers can be represented as a 2-tuple,
T � 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), 𝑔 𝑃 𝑐 𝑜 𝑟 𝑒𝑠
( 𝑖 𝑐𝑜𝑟𝑒
) � , where 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
) and 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 𝑐𝑜 𝑟 𝑒 ) represent the grade of LLC
with 𝑖 𝐿𝐿𝐶
page-covers disabled and the grade of cores with 𝑖 𝑐𝑜 𝑟 𝑒 disabled, respectively. The yield
of tier T � 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 𝑐𝑜 𝑟 𝑒 ) � is calculated as
𝑌 𝑇 �� 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 𝑐𝑜𝑟 𝑒 ) � �= 𝑌 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
) × 𝑌 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 𝑐𝑜𝑟𝑒 ) × 𝑌 𝑖𝑛𝑡 𝑐 .
Processor tiers can be enumerated by finding all combinations of the possible grades of LLC
and possible grades of cores, i.e.,
{ 𝑔 𝐿𝐿𝐶
(0), … 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), … 𝑔 𝐿𝐿𝐶
( 𝑛 𝐿𝐿𝐶
)} × { 𝑔 𝑃𝑐𝑜𝑟 𝑒𝑠
(0), … 𝑔 𝑃𝑐𝑜𝑟 𝑒𝑠
( 𝑖 𝑐𝑜𝑟𝑒
), … 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑁 𝑐 − 1)}.
149
6.5.2.2 PCD + datapath-enhancement
When PCD and datapath-enhancement are used, processor tiers can be represented as an
( 𝑁 𝑐 +1)-tuple,
𝑇 � �𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), � 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 �
𝑐 1
… , � 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 �
𝑐 𝑁 𝐶 �� ,
where the tuples represent the grade of LLC and the grade of every core. The yield of such tier is
calculated as
𝑌 𝑇 � 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 1
… , ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 𝑁 𝐶 �=
𝑌 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
) × 𝑌 𝑖𝑛𝑡 𝑐 × ∏ 𝑌 𝐸𝑐𝑜𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑁 𝑖 𝑁 𝑐 𝑖 = 1
.
Processor tiers can be enumerated by finding all combinations of the possible grades of LLC
and possible grades of every core, i.e.,
� 𝑔 𝐿𝐿𝐶
(0), 𝑔 𝐿𝐿𝐶
(1), … 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), … 𝑔 𝐿𝐿𝐶
( 𝑛 𝐿𝐿𝐶
) � × 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 1
× 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 2
× … 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐 𝑁 𝐶 .
Note that 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐𝑖
is defined in Section 6.5.1.1
6.5.2.3 Core-disabling + datapath-enhancement
When core-disabling and datapath-enhancement are used, processor tiers can be represented
as 𝑇 � 𝑔 ′
𝑃𝑐𝑜𝑟 𝑒𝑠
( 𝑖 ), [( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈𝐿 )
𝑐 1
… , ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐𝑁
𝑐 − 𝑖 ] �, where 𝑖
represents the number of defective cores that are disabled, and the second tuple represents the
grade of every functional core. The yield of such tier is calculated as
𝑌 𝑇 ( 𝑔 ′
𝑃𝑐𝑜𝑟 𝑒𝑠
( 𝑖 ), [( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 1
… , ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐 𝑁 𝐶 − 𝑖 ]) =
𝑌 𝐿𝐿𝐶
× 𝑌 𝑖 𝑛𝑡 𝑐 × �
𝑁 𝑐 𝑖 � ( 𝑃 𝑑𝑒 𝑎𝑑𝐶𝑜𝑟 𝑒 )
𝑖 × � 𝑌 𝑛𝑜 𝑛𝐷 𝑝 �
𝑁 𝑐 − 𝑖 × ∏ 𝑌 𝐸𝑐𝑜𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑁 𝑗 𝑁 𝑐 − 𝑖 𝑗 = 1
,
where 𝑌 𝑛𝑜 𝑛𝐷 𝑝 is yield of all non-datapath modules in a core, and
150
𝑃 𝑑𝑒 𝑎𝑑𝐶𝑜𝑟 𝑒 = (1 − 𝑌 𝑛𝑜 𝑛𝐷 𝑝 ) + 𝑃 𝑑𝑒 𝑎𝑑𝐷𝑝 − (1 − 𝑌 𝑛𝑜 𝑛𝐷 𝑝 ) × 𝑃 𝑑𝑒 𝑎𝑑𝐷𝑝 .
𝑃 𝑑𝑒 𝑎𝑑𝐷𝑝 is the probability that defective datapath modules in a core cannot be saved by datapath
enhancement approaches (note that defective FPU can always be saved by FP-DIS and FPMUL-
DIS.) It can be calculated as
𝑃 𝑑𝑒 𝑎𝑑𝐷𝑝 = 𝑌 𝐸𝐴𝐿𝑈 ( 𝑑𝑒 𝑎 𝑑 ) + 𝑌 𝐸𝑀 𝑈𝐿
( 𝑑𝑒 𝑎 𝑑 ) − 𝑌 𝐸𝐴𝐿𝑈 ( 𝑑𝑒 𝑎𝑑 ) × 𝑌 𝐸𝑀 𝑈𝐿
( 𝑑𝑒 𝑎 𝑑 ).
Processor tiers can be enumerated by finding all combinations of the possible numbers of
defective cores and possible grades of every core, i.e.,
{ 𝑔 ′
𝑃𝑐𝑜𝑟 𝑒𝑠
( 𝑖 )} × 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐 1
× 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐 2
× … 𝑆 𝑔𝐶𝑜𝑟𝑒
𝑐 𝑁 𝐶 − 𝑖 , 𝑓𝑜𝑟 𝑖 ∈ ℕ 𝑎 𝑛𝑑 𝑖 < 𝑁 𝐶
Note that 𝑆 𝑔 𝐶𝑜 𝑟 𝑒 𝑐𝑖
is defined in Section 6.5.1.1
6.5.2.4 Core-disabling + PCD + datapath-enhancement
When three approaches are used, processor tiers can be represented as an ( 𝑁 𝑐 − 𝑖 𝑐𝑜𝑟𝑒
+ 2)-
tuple, 𝑇 � 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
), 𝑔 𝑃𝑐𝑜𝑟𝑒 𝑠 ( 𝑖 𝑐𝑜𝑟 𝑒 ), [𝑔 𝑐 1
… , 𝑔 𝑐𝑁
𝑐 − 𝑖 𝑐 𝑜 𝑟𝑒 ]� , where 𝑖 𝑐𝑜 𝑟 𝑒 is the number of defective
cores that are disabled in the tier, and the tuples represent the grade of LLC, and the grade of
every core. The yield of such tier is calculated as
𝑌 𝑇 � 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶 ), 𝑔 𝑃 𝑐 𝑜 𝑟 𝑒𝑠
( 𝑖 𝑐𝑜 𝑟 𝑒 ), [𝑔 𝑐 1
… , 𝑔 𝑐𝑁
𝑐 − 𝑖 𝑐 𝑜 𝑟𝑒 ]�=
𝑌 𝑔 𝐿𝐿𝐶
( 𝑖 𝐿𝐿𝐶
) × 𝑌 𝑖𝑛𝑡 𝑐 × �
𝑁 𝑐 𝑖 𝑐 𝑜 𝑟𝑒 �( 𝑃 𝑑𝑒 𝑎𝑑𝐶𝑜𝑟 𝑒 )
𝑖 𝑐 𝑜 𝑟𝑒 × � 𝑌 𝑛𝑜 𝑛𝐷 𝑝 �
𝑁 𝑐 − 𝑖 𝑐 𝑜 𝑟𝑒 ×
∏ 𝑌 𝐸𝑐𝑜𝑟 𝑒 ( 𝑔𝐴 𝐿𝑈 , 𝑔𝐹 𝑃𝑈 , 𝑔𝑀 𝑈 𝐿 )
𝑐𝐾
𝑁 𝑐 − 𝑖 𝑐 𝑜 𝑟𝑒 𝐾 = 1
.
Processor tiers can be enumerated by finding all combinations of the possible grades of LLC,
the possible grades of all cores, and the possible grades of every core.
151
6.5.3 Efficiency projection of approaches
In this section, we first evaluate the efficiency of the individual approaches at every
technology node. Then we combine the approaches to maximize the performance-per-area
improvement. The efficiency provided by an approach at technology node i ( 𝐸 𝐹𝐹 𝑖 ) is calculated
as 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑎𝑑𝑣𝐷𝑇 _ 𝑖 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑏 𝑎𝑠𝑒 _ 𝑖 ⁄ , where 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑎𝑑𝑣𝐷𝑇 _ 𝑖 and 𝐸 𝐺𝑂 𝑃𝑆 𝑃𝐴 𝑏 𝑎𝑠𝑒 _ 𝑖 are the
EGOPSPA achieved by the enhanced processors and baseline processors at technology node i.
The YPA improvement made at technology node i is calculated as 𝑌 𝑃𝐴 𝑎𝑑𝑣𝐷𝑇 _ 𝑖 𝑌 𝑃𝐴 𝑏 𝑎𝑠𝑒 _ 𝑖 ⁄ , where
𝑌 𝑃𝐴 𝑎𝑑𝑣𝐷𝑇 _ 𝑖 and 𝑌 𝑃𝐴 𝑏 𝑎𝑠𝑒 _ 𝑖 are the YPA achieved by the enhanced processors and baseline
processors at technology node i.
6.5.3.1 Efficiency projection of individual approaches
Figure 33 shows the improvement in performance-per-area (EFF) and the improvement in
YPA for datapath-enhancement approach. Note that YPA is calculated considering that the
processors saved by the approach are counted as functional processors. From the improvement
made over the fixed-organization evolution path, it can be observed that YPA improvement
increases from 3.7% at 45nm to 5.2% at 8nm, and that EFF increases from 3.3% at 45nm to
4.8% at 8nm the improvement increase as technology scales. Two main points must be noticed.
First, EFF is always lower than YPA improvement since a fraction of defective processors saved
by the datapath–enhancement approach have lower performance than defect-free processors.
Second, more importantly, the approach improves more as the technology scales. The reason is
that as defect density increases, more cores are defective and hence datapath-enhancement can
salvage has more defective modules. Similar trends can be observed from Figure 33(b) for
power-constrained evolution. However, the improvement increases more dramatically with each
152
future technology generation. In addition to the increase in defect density, the fact that under this
evolution each processor has more cores/datapath modules also contributes to the significant
improvement the approach achieves. Per-core improvement is now compounded over the
increasing number of cores.
Table XXXVIII summarizes the areas and delays of two different LLC designs for every
technology node. One is the LLC designed to achieve optimal YPA using hardware spare
rows/columns only, i.e., the LLC used in the baseline processors in every technology (spares-
only LLC). The other is the LLC designed to achieve optimal ECPA using PCD + hardware
spare rows/columns (PCD-LLC). Area and delay values are normalized with respect to those of
the spares-only LLC for 45nm. The following observations can be made. First, for the FO-
evolution processors, spares-only LLC can still improve the area as feature size shrinks.
However, PCD-LLC improves the area even more since PCD-LLC designs do not require as
many spares to achieve optimal ECPA. Similar observation can be made for PC-evolution
processors. Even when the area of spares-only LLC increases from one generation to the next
Figure 33. Datapath-enhancement efficiency projection
1.020
1.025
1.030
1.035
1.040
1.045
1.050
1.055
45 32 22 16 11 8
YPA EFF
Fixed-organization
1.00
1.10
1.20
1.30
1.40
1.50
1.60
45 32 22 16 11 8
YPA EFF
Technology
Power-constrained
Technology
Improvement
(a) (b)
153
(e.g., from 32nm to 22nm) due to increase in cache capacity, PCD-LLC can still shrink the area
of LLC. Second, PCD-LLC uses lower number of cycles for the FO-evolution processors in all
technology nodes. Similar trend can be observed in most technology nodes for PC-evolution
processors except at 22nm and 11nm nodes. The reason that PC-evolution processors have
slightly higher numbers of LLC access cycles in 22nm and 11nm is that placements of modules
in PC-evolution processors in these two nodes result in slightly higher interconnect latencies.
(Recall that we arrange the cores, LLC banks, and interconnects so that the overall chip aspect
ratio is closest to 1.)
Figure 34 shows the YPA and EGOPSPA improvement when PCD is used. For FO-evolution
processors, YPA improvement is up to 52% and EGOPSPA improvement is up to 57%. The
reason that EFF is higher than YPA is that processors using PCD uses fewer clock cycles to
access LLC. For PC-evolution processors, PCD improves YPA up to 14.8 × and EGOPSPA up
to 18.2 × at 8nm. It can be noticed that for all PC-evolution processors, EFF is higher than YPA
improvement except for processors in node 22nm and 11nm because the PC-evolution processors
at these two nodes have slightly higher number of LLC access cycles when PCD is used.
Table XXXVIII. LLC parameters for FO- and PC- evolution processors
Evolution
Normalized LLC area (w.r.t. 45nm baseline processor)
Tech(nm) 45 32 22 16 11 8
FO
HW spares only 1.00 0.51 0.28 0.21 0.15 0.10
PCD + spares 1.00 0.51 0.25 0.18 0.13 0.08
PC
HW spares only 1.00 1.09 1.15 0.96 0.95 1.09
PCD + spares 1.00 1.08 1.02 0.81 0.85 0.80
LLC latency (cycles)
FO
HW spares only 14 20 34 53 97 103
PCD + spares 14 20 31 47 86 91
PC
HW spares only 14 27 52 92 151 228
PCD + spares 14 25 54 78 157 169
154
Figure 35 shows the efficiency when core-disabling is used. Dramatic YPA improvements can
be observed for processors in both evolution paths in every technology node. As the number of
defective cores increases because of increasingly higher defect densities (for both PC- and FO-
evolution processors) and increasingly higher numbers of cores (for PC processors), core-
disabling becomes more effective. The YPA improvements can be approximated
using 1 𝑌 𝑐𝑜 𝑟 𝑒 𝑁𝑐
⁄ , where 𝑌 𝑐𝑜 𝑟 𝑒 is the core yield and 𝑁 𝐶 is the number of cores, because almost
every fabricated chip has at least one defect-free core.
Figure 34. PCD efficiency projection
0.75
1.00
1.25
1.50
1.75
45 32 22 16 11 8
YPA EFF
Power-constrained
0.80
1.00
1.20
1.40
1.60
1.80
45 32 22 16 11 8
YPA EFF
14.8 18.2
Fixed-organization
Improvement
Technology Technology
Figure 35. Core-disabling efficiency projection
0.75
1.00
1.25
1.50
1.75
45 32 22 16 11 8
YPA EFF
Fixed-organization
Power-constrained
1
2
3
4
5
6
7
8
9
10
45 32 22 16 11 8
YPA EFF
45.7 40.7 10.1
Improvement
Technology Technology
(a) (b)
155
6.5.3.2 Efficiency projection for cross-layered approaches
We combine the approaches in a cross-layered fashion to maximize the EGOPSPA
improvements. Figure 36 show the results for FO-evolution processors. At every technology
node, our approaches dramatically improve EGOPSPA beyond those provided by core-disabling.
In addition, increasingly higher improvements are made with each generation.
Figure 37 shows the efficiency results for PC-evolution processors. Improvements are greater
than those of FO-evolution processors with each technology node because of the effect of
compounding individual approaches and the effect of increasingly higher numbers of defective
modules/core than FO-evolution processors.
Figure 36. Cross-layered approaches efficiency for FO-evolution processors
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
45 32 22 16 11 8
CoreDis
PCD+DpathEnc
PCD+CoreDis+DpathEnc
EFF
Technology
156
6.6 Performance-per-area projection
6.6.1 Without advanced defect-tolerance approaches
Figure 38(a) shows the EGOPSPA projection for the processors without advanced approaches
applied for the two evolution paths. Note that EGOPSPA at each technology node is normalized
to that of the 45nm processors. It can be observed that for FO-evolution processors, EGOPSPA
keeps increasing through 8nm. However, for PC-evolution processors, EGOPSPA is flat until
22nm and declines starting at 16nm. This phenomenon is explained by the yield and area trends
illustrated in Figure 38(b). For FO-evolution processors, their yields decline with the advance of
technology but their areas decline at a higher rate. Hence, the overall effect of scaling on per-
wafer performance is still positive. However, this is not the case for the PC evolution processors.
They remain at approximately the same area but their yield declines sharply.
It is clear that without advanced defect-tolerance approaches, the economics of fabrication do
not benefit from scaling in terms of per-wafer performance output.
Figure 37. Cross-layered approaches efficiency for PC-evolution processors
0
5
10
15
20
25
30
45 32 22 16 11 8
CoreDis
PCD+DpathEnc
PCD+CoreDis+DpathEnc
EFF
Technology
750
40.7
157
6.6.2 With advanced defect-tolerance approaches
Table XXXIX describes the approaches evaluated for performance-per-area projection in this
section. Note that MUPB includes all the existing microarchitecture layer approaches discussed
in Section 6.1. We calculate the upper-bound of MUPB’s EGOPSPA by assuming the yield of
every core is 1. Figure 39 plots the performance-per-area projected for FO-evolution processors.
The performance-per-area of each point is normalized w.r.t. the EGOPSPA of 45nm processors
without advanced DT.
Figure 38. Without advanced DT (a) EGOPSPA (b) Yield and area
0
50
100
150
200
250
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
45 32 22 16 11 8
Yield: PC
Yield: FO
Area: PC
Area: FO
0
1
2
3
4
5
6
7
8
9
10
45 32 22 16 11 8
PC
FO
Technology
Yield
Area
𝑚𝑚
2
Technology
Normalized EGOPSPA
Table XXXIX. Performance-per-area projection types
Approach Description
Ideal
Assume fabrication is ideal: defect density and bit-cell failure rates
are zero. Caches are designed with no spares.
PCD + MUPB
PCD is applied and LLC is designed to achieve optimal ECPA.
Core yield is optimistically assumed to be 1. Real defects and bit-cell
failures are assumed for caches
PCD + CoreDis + DpathEnc
PCD is applied and LLC is designed to achieve optimal ECPA.
Core-disabling and datapath-enhancement are applied. Real defects
and bit-cell failures are assumed for all modules.
No advanced DT
No advanced DT is applied. LLC is designed to achieve optimal YPA
using spares only. Real defects and bit-cell failures are assumed for
all modules.
158
The following observations can be made. First, our cross-layered approach (PCD+CoreDis
+DpathEnc) can effectively improve the performance-per-area by 2.6 × at 8nm node. This
combined approach achieves performance-per-area almost as high as the optimistic MUPB.
Second, although performance-per-area can still benefit from scaling without advanced DT
applied, our approach significantly improves the performance-per-area benefit when technology
scales. Furthermore, the improvement increasingly grows as technology scales.
In addition to the observations made above, our approach tracks the ideal curve closely until
16nm node. The ideal fabrication processors have yield of 1 and do not require spares.
Processors using PCD+MUPB are different from ideal processors in the following aspects. 1)
The yield of non-SRAM part in LLC: Although PCD can salvage defective bit-cells in LLC,
defects in non-SRAM part of the module (decoder logic, peripheral logic, wiring, etc.) can render
LLC non-usable even when PCD is used. 2) Interconnect yield: PCD+MUPB does not include
approaches targeting defects in interconnects. 3) The design of LLC: When PCD is used, some
spares are still required to achieve optimal ECPA. This results in larger LLC area, larger chip
area, and longer LLC access latency, which results in lower core utilization. Table XL
Figure 39. Performance-per-area projection for FO-evolution processors
0
10
20
30
40
50
60
45 32 22 16 11 8
Ideal
PCD+MUPB
PCD+CoreDis+DpathEnc
No advanced DT
Normalized EGOPSPA
159
summarizes these factors, and the EGOPSPA ratio can be approximated considering these
factors.
Figure 40(a) shows the performance-per-area for PC-evolution processors. As can be
observed for the processors without advanced approaches, performance-per-area declines when
technology scales beyond 22nm. Our approaches enable the processor to continue benefit from
scaling until the 16nm node. Until 16nm node, our approaches dramatically improve the
performance-per-area as technology scales. However, the performance-per-area declines for
11nm and 8nm nodes even when our approaches are applied. Although the approaches can still
improve performance-per-area considerably, overall processors fabrication does not benefit from
scaling. The reason is that low yield in non-SRAM part of LLC limits the effectiveness of both
hardware spares and PCD.
Table XL. Relevant parameters comparing ideal fabrication for FO-evolution processors
Technology node
45 32 22 16 11 8
PCD+MUPB
Interconnect yield
0.98 0.98 0.98 0.98 0.97 0.97
LLC non-SRAM yield 0.97 0.96 0.94 0.88 0.78 0.68
Ideal/(PCD+MUPB)
ratio
Core utilization
1.00 1.00 1.02 1.09 1.10 1.16
Chip area
1.00 1.00 1.00 0.95 0.76 0.75
EGOPSPA
1.05 1.06 1.10 1.34 1.93 2.34
160
In Figure 40(b), the performance-per-area margin between ideal fabrication and PCD +MUPB
can be seen as being larger that for the FO-evolution processor. The reason is that PC-evolution
processors have more interconnects area and non-SRAM area in LLC. Table XLI shows the
relevant parameters to compare the difference between ideal fabrication and PCD+MUPB.
It is clear that the performance-per-area is limited by defects rather than power. In the next
section, we provide another possible processor evolution path that remedies above problem for
11nm and 8nm.
Figure 40. Performance-per-area for PC-evolution processors.
(a) Without idea fabrication curve (b) With Ideal fabrication curve
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
45 32 22 16 11 8
Normalized EGOPSPA Normalized EGOPSPA
Technology Technology
0
5
10
15
20
25
30
35
45 32 22 16 11 8
Ideal
PCD+MUPB
PCD+CoreDis+DpathEnc
No advanced DT
(a) (b)
Table XLI. Relevant parameters comparing ideal fabrication for PC-evolution processors
Technology node (nm)
45 32 22 16 11 8
PCD+MUPB
Interconnect yield 0.98 0.93 0.84 0.77 0.55 0.35
LLC non-SRAM
yield
0.97 0.92 0.66 0.49 0.10 0.02
Ideal/(PCD+MUPB)
ratio
Core utilization 1.00 1.00 1.02 1.04 1.33 1.09
Chip area 1.00 1.00 0.96 0.90 0.80 0.73
EGOPSPA 1.05 1.17 1.92 3.09 29.22 232.47
161
6.7 Defect-constrained evolution: A new evolution path
As we have shown in the previous section, PC-evolution processors are limited by the low
yield in non-SRAM part in LLC at more advanced technology nodes. We evaluate a new
evolution path, namely defect-constrained evolution, in order to continue to benefit from scaling.
Table XLII shows the new parameters used for defect-constrained evolution. Note that we use
16MB LLC for both 11nm and 8nm nodes.
Figure 41(a) shows that with decreased LLC capacity, performance-per-area continues to
improve beyond 11nm for the reason that the yield of non-SRAM part in LLC is now higher, so
that the improvement from PCD can be materialized. Table XLIII shows the yield of non-
SRAM-bitcell parts in the LLC.
This defect-constrained evolution path demonstrates that it is still possible to continue to
benefit from technology scaling. Nevertheless, the performance-per-area can be achieved by
advanced approaches is still far from ideal as shown in Figure 41(b). Because PCD still requires
some spares to achieve optimal ECPA, processors using PCD+MUPB have larger chip area,
Table XLII. Defect-constrained evolution processor parameters
Technology node (nm)
45 32 22 16 11 8
Defect-constrained
Number of cores 4 8 12 14 20 33
LLC capacity (MB) 8 16 32 32 16 16
Table XLIII. Yield of non-SRAM-bitcell parts in LLC
Defect-constrained evolution Power-constrained evolution
Technology Capacity (MB) HW spares only PCD Capacity (MB) HW spares only PCD
11nm 16 0.556 0.630 32
0.0749 0.102
8nm 16 0.455 0.544 64
0.0038 0.018
162
longer LLC latency, and lower core utilization. Table XLIV shows the relevant parameters for
comparison.
It is clear that there are other possible ways to improve performance-per-area and incorporate
more components into one processor. In the next chapter, we list the possible approaches to
further improve the performance-per-area and enable continuing benefiting from scaling.
Figure 41. Performance-per-area for defect-constrained evolution processors.
(a) Without ideal fabrication curve (b) With ideal fabrication curve
0
2
4
6
8
10
12
45 32 22 16 11 8
0
10
20
30
40
50
60
45 32 22 16 11 8
Ideal
PCD+MUPB
PCD+CoreDisEnc+DpathEnc
No advanced DT
Normalized EGOPSPA Normalized EGOPSPA
Technology
Technology
(a) (b)
Table XLIV. Relevant parameters comparing ideal fabrication for defect-constrained evolution
Technology node (nm)
45 32 22 16 11 8
PCD+MUPB
Interconnect yield 0.98 0.93 0.84 0.77 0.54 0.44
LLC non-SRAM yield 0.97 0.92 0.66 0.49 0.63 0.54
Ideal/(PCD+MUPB)
ratio
Core utilization 1.00 1.00 1.02 1.04 1.17 1.14
Chip area 1.00 1.00 0.96 0.90 0.92 0.86
EGOPSPA 1.05 1.17 1.92 3.09 3.74 5.63
163
Chapter 7. Conclusions
7.1 Contributions
In this dissertation, we have proposed to use the performance-per-area metric, EGOPSPA, to
measure the effectiveness of defect-tolerance approaches. The metric measures the
computational capacity output from silicon real-estate on each, and hence accounts for the true
economics of fabrication.
We have developed an innovative multi-layered system approach to discover the possible
advanced defect-tolerance opportunities for general purpose processors. Based on this multi-
layered approach, we have developed a framework to discover the rich inherent-redundancy in
the layers and to develop advanced defect-tolerance approaches based harnessing on the
inherent-redundancy we discover. We have developed a taxonomy for redundancies, controllers,
and defect-tolerance approaches to enable systematic development of defect-tolerance
approaches. The many defect-tolerance approaches discovered using our systematic approach es
can be first studied qualitatively using pervasiveness analysis and utilization analysis to evaluate
their effectiveness without expensive simulations, design, and quantification. Only the promising
approaches need to be quantitatively evaluated to determine their effectiveness.
The large number of approaches that we have discovered can be analyzed collectively to
identify the combinations of approaches from different layers to form composite cross-layered
approaches that achieve high performance-per-area. We have demonstrated that the innovative
cross-layered approaches are uniquely efficient. Because of our approaches’ ability to
circumvent defective modules or defective parts of the modules from higher layers in the system,
164
modules can be designed with minimal extraneous redundancy to achieve optimality. This
property further enables the reduction in area overhead and results in smaller chip area.
We have projected the performance-per-area that can be achieved by our new advanced
approaches. Technology scaling trends and multicore processors evolution have both been
considered. The performance-per-area projected on processor evolution paths has demonstrated
that our advanced approaches will significantly enhance benefits from technology scaling, i.e.,
continue to provide higher performance output per silicon area as technology scales.
Furthermore, we have identified a constraint that has not been considered for processor
evolution before. Although our advanced approaches can improve performance-per-area
significantly, increasingly higher defect density in future technology nodes still limits the
capacity of LLC that can be incorporated on chips for reasons for which currently there exists no
effective defect-tolerance approach.
7.2 Future research
1) Develop approaches for special-purpose acceleration modules. Modules implementing
specialized functionalities have been incorporated into processor chips. For example, recent
high-performance processors incorporate on-die GPU to provide the capability to accelerate
graphic applications. Also, single-instruction-multiple-data (SIMD) co-processors have been
used to implement specialized instruction sets that accelerate certain types of applications, such
multimedia, encryption, etc. As more specialized modules continue to be incorporated on
processor chips, the area occupied by these modules increases. Hence, targeting these modules
can potentially improve the performance-per-area for these processors.
165
Although these modules implement the functions that are not general-purpose, our multi-
layered system view can still enable the discovery of approaches to circumvent the failures in
these modules. Hence, efficient approaches can be developed.
2) Multiple cross-layered approaches for maximum EGOPSPA in LLC. In Chapter 6, we
have identified that defect is one of the main constraints on processor evolution: defects will
limit the yield of non-SRAM-bitcell circuitries in LLC. Developing circuit-layer approaches
targeting these peripheral circuitries could further enable incorporating larger LLC on chips to
improve the performance of future processors.
By projecting the performance-per-area for ideal fabrication processors, we have identified the
insufficiency of current approaches in future technology nodes. Although PCD can effectively
reduce the number of spares required in LLC to achieve optimal capacity per area, the required
spares still impose significant area overheads. By combining approaches from other layers, such
as microarchitecture-layer block-disabling, the amount of spares required to achieve optimal
capacity per area can be reduced. Hence, LLC will occupy less area and the performance-per-
area that can be achieved will be closer to that of an ideal fabrication processor with zero defect
density. However, the additional controller modification for block-disabling will also impose
additional area overhead and circuit level performance overhead. The combination of the
approaches must be evaluated to demonstrate its effectiveness.
3) Projecting approach effectiveness for parallel applications. To extend the projection to
parallel applications, the performance-per-area metric for multicore must be updated to take the
following effects into account. First, the performance of a processor is now limited by the worst-
performance core on chip if all working cores are utilized. Second, the traffic between cores and
166
LLC must be modeled considering possible conflicts and arbitration since now more than one
core can access a shared data in LLC at the same time.
In addition, a system layer approach must be explored for parallel applications. For
applications that do not utilize all cores, system can be designed to dynamically assign the
threads to the cores that have higher performance with higher priority. Such an approach will
minimize performance degradation.
4) Developing manufacturing tests. Our approaches require the knowledge of which module
or which part of a module is defective. For PCD, this information can be derived by collecting
the pass/fail results of every array after the hardware spares are allocated to repair defective bit-
cells in every array. Using the failing arrays’ address, the addresses of the faulty page frames can
be calculated.
For datapath-enhancement approaches, the location of defect-induced functional failures is
required. Diagnosis has been performed during post-silicon validation to identify both design
and manufacturing defect issues [67] [68]. The diagnosis process involves analysis of scan-
dump, the test output from scan-chain, to progressively trace back to the failing sites. Such
diagnosis can provide accurate isolation of failure but it is not scalable for volume
manufacturing. However, the application of our approach does not require such fine-grained
information. Our approaches only require the coarse-grained pass/fail results for individual
modules or specific sub-modules (e.g., the 33 least significant bits of a 64-bit adder). This
information can be derived using microarchitecture layer or ISA layer tests. Using specific
sequences of instructions in a specifically designed test program, functional tests targeting a
167
specific module can be developed. However, such tests must be developed in a way that only the
targeted module is exercised and high coverage is achieved so that modules can be isolated.
168
References
[1] "International Technology Roadmap for Semiconductors".
[2] S. Almukhaizim, T. Verdel and Y. Makris, "Cost-effective graceful deggradation in speculative
processor subsystems: the branch prediction case," in Proc. Int’l. Conf. on Computer Design, 2003.
[3] S. Almukhaizim, P. Petrov and A. Orailoglu, "Faults in processor control subsystems: testing
correctness and performance faults in the data prefetching unit," in Proc. Asian Test Symp, 2001.
[4] T.-Y. Hsieh, M. A. Breuer, M. Annavaram, S. K. Gupta, and K.-J. Lee, "Tolerance of Performance
Degrading Faults for Effective Yield Improvement," in Proceeding of International Test Conference,
2009.
[5] G. S. Sohi, "Cache memory organization to enhance the yield of high performance VLSI
processors," IEEE Trans. Computers, vol. 38, pp. 484-492, April 1989.
[6] A. Agarwal, B. C. Paul, H. Mahmoodi, A. Datta, and K. Roy, "A Process-Tolerant Cache
Architecture for Improved Yield in Nanoscale Technologies," IEEE TRANSACTIONS ON VERY
LARGE SCALE INTEGRATION (VLSI) SYSTEMS, vol. 13, no. 1, 2005.
[7] P. P. Shirvani and E. J. McCluskey, "PADded cache: a new fault-tolerance technique for cache
memories," 17th IEEE VLSI Test Symposium, pp. 440-445, April 1999.
[8] D. A. Patterson, P. Garrison, M. Hill, D. Lioupis, C. Nyberg, T. Sippel, and K. Van Dyke,
"Architecture of a VLSI instruction cache for a RISC," the 10th annual international symposium on
Computer Architectur, vol. 11, pp. 108-116, June 1983.
[9] Y. Ooi et al, "Fault-Tolerant Architecture in a Cache Memory Control LSI," IEEE Journal of Solid-
State Circuits, vol. 27, no. 4, pp. 507-514, Apr. 1992.
[10] Lee, H., Cho, S., and Childers, B. R, "Performance of Graceful Degradation for Cache Faults," in
Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2007.
[11] D. C Bossen et al, "Power4 System Design for High Reliability," IEEE Micro, vol. 22, no. 1, pp. 16-
24, 2002.
[12] H. Lee, S. Cho, and B. Childers, "Exploring the interplay of yield, area, and performance in
processor caches," in International Conference on Computer Design,, 2007.
[13] Premkishore Shivakumar , Stephen W. Keckler , Charles R. Moore , Doug Burger, "Exploiting
Microarchitectural Redundancy For Defect Tolerance," in Proceedings of the 21st International
Conference on Computer Design, 2003.
[14] Weaver, N., Kelm, J.H., and Frank, M.I. , "Emμcode: Masking hard faults in complex functional
units," in International Conference on Dependable Systems & Networks, 2009.
[15] T. M. Austin, "DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design," in
Proceddings of the 32nd International Symposium onMicroarchitecture, 1999.
[16] Marc de Kruijf, Shuou Nomura, and karthikeyan Sankaralingam, "Relax: An Architecture Framwork
for Software Recovery of Hardware Faults," in International Symposium on Computer Architecture,
2010.
[17] Albert Meixner, Michael E. Bauer, Daniel J. Sorin, "Argus: Low-Cost, Comprehensive Error
Detection in Simple Cores," in International Symposium on Microarchitecture, 2007.
[18] Shubhendu S. Mukherjee, Michael Kontz, Steven K. Reinhardt, "Detailed Design and Evaluation of
Redundant Multithreading Alternatives," in International Symposium on Computer Architecture,
2002.
169
[19] E. Schuchman, T. N. Vijaykumar, "Rescue: A Microarchitecture for Testability and Defect
Tolerance," in Proc. of the 32nd Annual Int’l Symposium on Computer Architecture, 2005.
[20] J. C. Cha and S. K. Gupta, "Characterization of granularity and redundancy for SRAMs for optimal
yield-per-area," in IEEE Internatio-nal Conference on Computer Design, 2008.
[21] I. Koren and Z. Koren, "Defect tolerant VLSI circuits: Techniques and yield analysis," Proceedings
of the IEEE, vol. 86, pp. 1817-1836, 1998.
[22] B. Amelifard, "Power efficient design of SRAM arrays and optimal design of signal and power
distribution networks in VLSI circuits," Ph.D. Dissertation, University of Southern California, 2007.
[23] K. Jeong and A. B. Kahng, "A power-constrained MPU roadmap for the International Technology
Roadmap for Semiconductors (ITRS)," in SoC Design Conference (ISOCC), November 2009.
[24] L. Hung, H. Irie, M. Goshima, and S. Sakai, " Utilization of SECDED for soft error and variation-
induced defect tolerance in caches," in Proc. Design, Automation & Test in Europe Conference &
Exhibition, 2007.
[25] H. Sun, N. Zheng, and T. Zheng, "Realization of L2 Cache Defect Tolerance Using Multi-bit ECC,"
in IEEE Intl' Symposium on Defect and Fault Tolerance of VLSI Systems, 2008.
[26] J. C. Cha, "Optimal Defect-Tolerant SRAM Designs in Terms of Yield-Per-Area under Constraints
on Soft-Error Resilience and Performance," PhD dissertation, University of Southern California,
2010.
[27] K. Zhang, "SRAM design on 65-nm CMOS technology with dynamic sleep transistor for leakage
reduction," IEEE J. Solid-State Circuits, vol. 40, p. 895, 2005.
[28] F. Hamzaoglu, "A 153 Mb-SRAM design with dynamic stability enhancement and leakage reduction
in 45 nm high-K metal-gate CMOS technology," in IEEE ISSCC Dig. Tech. Papers, 2008.
[29] Y. Wang, et al., "A 4.0GHz 291Mb Voltage-Scalable SRAM in 32nm high-κ Metal-Gate CMOS
with Integrated Power Management," in ISSCC Dig. Tech. Papers, 2009.
[30] K. Mistry et al., "A 45nm logic technology with hu=igh-k+ metal gate transistors, strained silicon, 9
Cu interconnect layers, 193nm dry patterning, and 100% Pb-free packaging," in Proc. Intl. Electron
Devices Meeting, 2007.
[31] P.Bai, et al., "A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 Cu
interconnect layers, low-k ILD and 0.57 𝜇 𝑚 ^2 SRAM cell," in Proc. Intl. Electron Devices Meeting,
2004.
[32] S. Natarajan et al., "A 32nm logic technology featuring 2nd-generation high-k + metal-gate
transistors, enhanced channel strain and 0.171 𝜇 𝑚 ^2 SRAM cell size in a 291Mb array," in Proc.
Intl. Electron Devices Meeting, 2008.
[33] Shyamkuma Thoziyoorr, Naveen Muralimanohar, Jung Ho Ahn, and Norman P. Jouppi, "CACTI
5.1," HP Laboratories, April 2008.
[34] P. Tan, T. Le, K.-H. Ng, P. Mantri, and J. Westfall, "Testing of UltraSPARC T1 Microprocessor and
Its Challenges," in Proceedings of the International Test Conference, 2006.
[35] T. Wood, G. Giles, C. Kiszely, M. Schuessler, D. Toneva, J. Irby, M. Mateja, "The Test Features of
the Quad-Core AMD Opteron Microprocessor," in Proc. International Test Conf., 2008.
[36] Norman Robson, John Safran, Chandrasekharan Kothandaraman, Alberto Cestero, Xiang Chen, Raj
Rajeevakumar, Alan Leslie, Dan Moy, Toshiaki Kirihata, and Subramanian Iyer, "Electrically
Programmable Fuse (eFUSE): From Memory Redundancy to Autonomic Chips," in Custom
Integrated Circuits Conferance, 2007.
[37] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V. Naydenov, T. Khondker, S.
170
Sarkar, and P. Singh. Penryn, "PENRYN: 45-nm Next Generation Intel® Core™ 2 Processor," in
ASSCC Dig. Tech. Papers, 2007.
[38] J. L. Henning, "SPEC CPU2000: measuring cpu performance in the new millennium," IEEE
Computer, vol. 33, no. 7, pp. 28-35, July 2000.
[39] J. Levon, "Oprofile - a system profiler for linux," http://oprofile.sourceforge.net/.
[40] D. Shin, "TECHNIQUES FOR DESIGN AND SYNTHESIS OF APPROXIMATE DIGITAL
CIRCUITS FOR ERROR-TOLERANT APPLICATIONS," Ph.D. Dissertation, University of
Southern California, 2011.
[41] Hsunwei Hsuing, Byeongju Cha, Sandeep K. Gupta, "Salvaging chips with caches beyond repair," in
DATE, 2012.
[42] N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith, "Configurable isolation: building high
availability systems with commodity multi-core processors," in Proc. of the 34th Annual
International Symposium on Computer Architecture, 2007.
[43] D. Kanter, "Inside Nehalem: Intel's Future Processor and System," www.realworldtech.com, 2008.
[44] Steve Haga , Natasha Reeves , Rajeev Barua , Diana Marculescu, "Dynamic Functional Unit
Assignment for Low Power," in Proceedings of the conference on Design, Automation and Test in
Europe, 2003.
[45] E. Morancho, J. M. Llaberia and A.Olive, "Recovery Mechanism for Latency Misprediction," in
Proc. of International Conference on Parallel Architectures and Compilation Techniques, 2001.
[46] D. Kanter, "AMD's Bulldozer Microarchitecture," www.realworldtech.com, 2010.
[47] Erik R. Altman , David Kaeli , Yaron Sheffer, "Welcome to the Opportunities of Binary
Translation," Computer, no. 3, pp. 40-45, 2000.
[48] J. Shen, M. Lipasti, Modern processor design: fundamentals of superscalar processors, McGraw-Hill
Higher Education, 2005.
[49] Kucuk, G., Ponomarev, D., Ghose, K, "Low–Complexity Reorder Buffer Architecture," in
Proceedings of the International Conference on Supercomputing, 2002.
[50] Simha Sethumadhavan,et.al., "Late-binding: Enabling Unordered Load-Store Queues," in
International Symposium on Computer Architecture, 2007.
[51] Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder,
"Using SimPoint for Accurate and Efficient Simulation," in ACM SIGMETRICS the International
Conference on Measurement and Modeling of Computer Systems, 2003.
[52] N. Choudhary, S. Wadhavkar, T. Shah, H. Mayukh, J. Gandhi, B. Dwiel, S. Navada, H. Najaf-abadi,
and E. Rotenberg, "FabScalar: composing synthesizable RTL designs of arbitrary cores within a
canonical superscalar template," in ISCA, 2011.
[53] Ilhyun Kim, Mikko H. Lipasti, "Understanding Scheduling Replay Schemes," in Proceedings of the
10th International Symposium on High Performance Computer Architecture,, 2004.
[54] "Alpha 21264 Microprocessor Hardware Reference Manual," Compaq Computer Corporation , 1999.
[55] Eric Larson , Saugata Chatterjee, and Todd Austin, "MASE: A Novel Infrastructure for Detailed
Microarchitectural Modeling," in ISPASS, 2001.
[56] Hadi Esmaeilzadeh , Emily Blem , Renee St. Amant , Karthikeyan Sankaralingam , Doug Burger,
"Dark siliconand the end of multicore scaling," in In Proceeding of the 38th Annual International
Symposium on Computer Architecture (ISCA), 2011.
[57] I. Koren, "The effect of scaling on the yield of VLSI circuits," in Yield Modeling and Defect
171
Tolerance in VLSI Circuits, W.R. Moore, W. Maly,and A. Strojwas, Eds. Bristol, 1988.
[58] W. Huang, K. Rajamani, M. Stan, and K. Skadron, "Scaling with design constraints: Predicting the
future of big chips," IEEE Micro, July-Aug 2011.
[59] D. Lundgren, "Dobule precision FPU," OpenCores, 2009.
[60] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser, "Many-Core vs. Many-
Thread Machines: Stay Away From the Valley," IEEE COMPUTER ARCHITECTURE LETTERS,
vol. 8, no. 1, pp. 25-28, 2009.
[61] C. K. Chow, "Determination of Cache's Capacity and its Matching Storage Hierarchy," IEEE Trans.
Computers, Vols. c-25, 1976.
[62] B. L. Jacob, P. M. Chen, S. R. Silverman, and T. N. Mudge, "An Analytical Model for Designing
Memory Hierarchies," IEEE Trans. Computers, vol. 45, no. 10, 1996.
[63] L. Shimpi, "The Bulldozer review: AMD FX-8150 Tested cache and memory performance,"
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/6, 2012.
[64] M. Greenberg, "DDR4: Double the speed, double the latency? Make sure your system can handle
next-generation DRAM," Cadence tech talk, 2011.
[65] Neil H. E. Weste, David F. Harris, CMOS VLSI Design: A Circuits and Systems Perspective,
Pearson/Addison-Wesley, 2005.
[66] B.S. Amrutur and M.A. Horowitz, "Speed and power scaling of SRAMs," IEEE Journal of Solid
State Circuits, vol. 35, no. 2, pp. 175-185, 2000.
[67] Amyeen, M. E., S. Venkataraman, and M. W. Mak, "Microprocessor System Failures Debug and
Fault Isolation Methodology," in Proc. IEEE Intl. Test Conf, 2009.
[68] M. Abramovici, "A Reconfigurable Design-for-Debug Infrastructure for SoCs," in Proc. IEEE/ACM
Design Automation Conf., 2006.
[69] Wood, T., et. al., "The Test Features of the Quad-Core AMD Opteron Microprocessor," in
Proceeding of Interncation Test Conference, 2008.
[70] M. Golden et al., "40-entry unified out-of-order scheduler and integer execution unit for the AMD
Bulldozer x86-64 core," in ISSCC, 2011.
[71] Fred A. Bower, Paul G. Shealy, Sule Ozev, and Daniel J. Sorin, "Tolerating Hard Faults in
Microprocessor Array Structures," in International Conference on Dependable Systems and
Networks, 2004.
[72] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, "Exploiting structural duplication for lifetime
reliability enhancement," in ISCA, 2005.
[73] H. McIntyre , S. Arekapudi , E. Busta , T. Fischer , M. Golden , A. Horiuchi , T. Meneghini , S.
Naffziger and J. Vinh, "Design of the Two-core x86-64 AMD ‘Bulldozer’ Module in 32 nm SOI
CMOS," in JSSC, 2012.
Abstract (if available)
Abstract
As CMOS fabrication technology continues to move deeper into nano-scale, circuit’s susceptibility to manufacturing imperfections increases, and the improvements in yield, power and delay, provided by each major scaling generation have started to slow down or even reverse. This is especially true for microprocessors, which use aggressive design and leading-edge technology. It is increasingly difficult to guarantee their correctness and conformance to performance specifications, which leads to reduction in microprocessor yield. Ensuring continuous benefits from scaling for microprocessors has been a challenge due to this adverse trend. Classic defect-tolerance approaches have been used to mitigate this trend by adding explicit redundancy and using it to ensure correctness of the circuit layer specifications. These approaches impose unnecessarily stringent requirements and incur overheads that increasingly compromise the economics. ❧ In this dissertation, we introduce a new defect-tolerance concept by taking a global view of the role of a microarchitecture module in the overall system. This view allows the relaxation of the overly stringent correctness requirements imposed by classic approaches and opens up opportunities for new defect-tolerance approaches that rely on implicit redundancy and hence are uniquely efficient. We also introduce a framework to systematically explore the space of possible defect-tolerance approaches that target microarchitecture modules under this global view. In addition to approaches that can be implemented during a microprocessor’s design phase, our framework also identified post-fabrication approaches which can salvage microprocessors beyond hardware repair. We demonstrate that the approaches significantly improve microprocessors’ economics at wafer-level.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
High level design for yield via redundancy in low yield environments
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Error tolerance approach for similarity search problems
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Error tolerant multimedia compression system
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Error-rate and significance based error-rate (SBER) estimation via built-in self-test in support of error-tolerance
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Timing-oriented approach for delay testing
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Error-rate testing to improve yield for error tolerant applications
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
A framework for runtime energy efficient mobile execution
PDF
Lower overhead fault-tolerant building blocks for noisy quantum computers
Asset Metadata
Creator
Hsiung, Hsunwei
(author)
Core Title
Defect-tolerance framework for general purpose processors
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/10/2014
Defense Date
09/02/2014
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
defect tolerance,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Sandeep K. (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
hsunwei@gmail.com,hsunweih@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-474092
Unique identifier
UC11286907
Identifier
etd-HsiungHsun-2916.pdf (filename),usctheses-c3-474092 (legacy record id)
Legacy Identifier
etd-HsiungHsun-2916.pdf
Dmrecord
474092
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hsiung, Hsunwei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
defect tolerance