Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimal redundancy design for CMOS and post‐CMOS technologies
(USC Thesis Other)
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OPTIMAL REDUNDANCY DESIGN FOR CMOS AND
POST-CMOS TECHNOLOGIES
by
Da Cheng
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfilment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2015
Copyright 2015 Da Cheng
ii
To the memory of my grandfather
Tianxiang Cheng
(1924-2011)
iii
Acknowledgements
Foremost, I would like to thank my academic advisor, Professor Sandeep K. Gupta, for his
guidance throughout my Ph.D program. He has helped me learn all important skills required to
conduct research, including thinking and solving problems systematically, presenting ideas and
results clearly. I appreciate immensely his patience and contribution of time during this process.
Besides academic guidance, he is also generous in sharing his own life experience with me and
willing to give advice. My Ph.D study and research have been a stimulating, rewarding, and
pleasant journey under his guidance.
Besides my advisor, I would like to thank the rest of my dissertation committee: Professor
Antonio Ortega and Professor Ramesh Govindan, for their insightful and detailed comments,
which have helped in many aspects to make this dissertation complete. Special thanks to
Professor Ortega for letting me access his DSP lab, where I collected experimental results for my
first paper, and to Professor Govindan for the guidance during our collaboration on a networks
project.
I would like to thank all my colleagues and friends at USC for exciting collaborations and
intriguing discussions, which provided numerous interesting ideas for my dissertation work. And
special thanks to all directed research students, who have worked with me, for their contributions
on several research projects. My sincere thanks also go to my friends at Los Angeles, who have
made my days joyful.
Lastly, I am deeply grateful to my parents Caimin Wang and Zhiwen Cheng for their
unconditional love. Thanks for always being there for me. Many thanks to my younger sister
Qun Cheng for keeping my parents company while I was far away in the past six years. You
always make me proud of being a brother.
iv
Table of Contents
Acknowledgements ............................................................................................................... iii
List of Figures ........................................................................................................................ v
List of Tables ....................................................................................................................... vii
Abstract…… ....................................................................................................................... viii
Introduction: Yield-Aware Design ......................................................................................... 1
Previous Research ....................................................................................................... 2 1.1.
Motivation ................................................................................................................... 5 1.2.
A Systematic Methodology to Improve Yield/Area CMPs ................................................... 9
Key Ideas ................................................................................................................... 10 1.3.
Problem Definition .................................................................................................... 13 1.4.
Yield-Per-Area Optimization .................................................................................... 14 1.5.
Newly Proposed Spare Configurations ..................................................................... 27 1.6.
Comparison of Proposed and Traditional Redundancy Techniques ......................... 36 1.7.
Conclusion ................................................................................................................. 44 1.8.
Optimizing Redundancy Design for Flexible Utility Functions .......................................... 45
Optimal Spare Design for CMPs ............................................................................... 46 1.9.
Utility Functions ...................................................................................................... 48 1.10.
Case Studies on NVIDIA Fermi GPU Architecture ............................................... 57 1.11.
Conclusions ............................................................................................................. 67 1.12.
Partially-working Processors Binning to Maximize Wafer Utilization ............................... 68
Related Research ..................................................................................................... 69 1.13.
GPU Background .................................................................................................... 72 1.14.
Problem Statement .................................................................................................. 73 1.15.
Proposed Approach ................................................................................................. 74 1.16.
Experimental Results .............................................................................................. 83 1.17.
Conclusions ............................................................................................................. 85 1.18.
Optimal Redundancy Designs for CNFET-Based Circuits.................................................. 86
Related Research ..................................................................................................... 87 1.19.
Background of CNFETs .......................................................................................... 89 1.20.
A CNFET Design Methodology ............................................................................. 93 1.21.
Problem Statement ................................................................................................ 101 1.22.
Redundancy Design for Logic Circuits ................................................................. 103 1.23.
Redundancy Design for Memory .......................................................................... 111 1.24.
Conclusions ........................................................................................................... 117 1.25.
Conclusions.. ...................................................................................................................... 118
Contributions ......................................................................................................... 118 1.26.
Future Research ..................................................................................................... 119 1.27.
Reference…. ...................................................................................................................... 121
v
List of Figures
Figure 1. Defect-tolerant designs for chip-multiprocessor. .......................................................... 12
Figure 2. Sharing of spare processors (cores). .............................................................................. 17
Figure 3. Floorplans for different spare configurations for two 4-core processors. ..................... 18
Figure 4. CMP floorplan and area calculation for spare cores approach. ..................................... 21
Figure 5. Delay overheads of Demux and mux. ........................................................................... 24
Figure 6. Domains of spare core sets S
16
1
, S
8
2
, S
4
3
and S
2
6
, which all cover core 12. ................... 32
Figure 7. Yield per area for SCS-gp with D
0
κ
=0.2. ...................................................................... 39
Figure 8. Three representative spare configurations with 192 spare cores. .................................. 40
Figure 9. Yield per area vs. Defect density. .................................................................................. 42
Figure 10. Yield per area vs. technologies. ................................................................................... 44
Figure 11. IPCs for GPU programs with fixed L2-cache and memory configuration. ................. 51
Figure 12. Normalized value function for linear, IPC and catalog price. ..................................... 51
Figure 13. Yield profile, PS profile, and PSPA profile for NPB C with ...................................... 54
Figure 14. Optimal designs, with n = 16 and k = 11 for NPB A and C. ....................................... 56
Figure 15. Relations between optimal spare configurations of different NPB functions. ............ 60
Figure 16. Theoretical maximal PSPA for different values of m. ................................................ 61
Figure 17. Residual PSPA for each NPB function, for n =16, k = 11. ......................................... 62
Figure 18. Optimal and -optimal designs ( with n = 16, k = 11 for NPB A and C. ........ 64
Figure 19. Normalized figure-of-merits vs. number of spare processors for NPB C. .................. 66
Figure 20. IPCs for benchmarks from ISPASS and Nvidia CUDA SDK. ................................... 76
Figure 21. Estimation of performance for defective CMPs. ......................................................... 77
Figure 22. IPC vs. number of cores per SM for an arbitrary benchmark. .................................... 80
Figure 23. Optimal numbers of spares at various scopes of sharing. ........................................... 82
vi
Figure 24. Structure of CNFET with 4 CNTs (adapted from [86]). ............................................. 89
Figure 25. Screening effect on capacitance (C
gc
) and current (I
d
). ............................................... 91
Figure 26. Ratio of capacitance and current for an inverter with 𝑁𝑡𝑢𝑏 . ...................................... 96
Figure 27. The proposed heuristic. (a) Without branches. (b) With branches. ............................. 97
Figure 28. Schematic of 1-bit full adder built with compound gates............................................ 99
Figure 29. Elmore delay model of 1-bit full adder. ...................................................................... 99
Figure 30. 6T CNFET SRAM cell [34]. ..................................................................................... 101
Figure 31. A 12-bit address decoder with 4-bit pre-decoding [21]. ............................................ 106
Figure 32. Yield/area and delay for different redundant RCA designs. ..................................... 110
Figure 33. Layout of 1-to-2 demux. ............................................................................................ 116
Figure 34. Optimal redundant CNTs design vs. SRAM array size. ............................................ 116
Figure 35. Impact of N
access
on decoder delay for primary inputs with different inputs. ............ 116
vii
List of Tables
Table 1. Area ratio for different architectures. ............................................................................. 13
Table 2. Symbols and values. ....................................................................................................... 23
Table 3. Spare core sets for spare configuration (4, 4; n
16
= 4, n
8
= 4, n
4
= 8, n
2
= 8). ................ 32
Table 4. Available spare core sets after 1
st
step. ........................................................................... 34
Table 5. Available spare core sets after γ + 1 steps. .................................................................... 36
Table 6. Spare cores sharing configurations used in Figure 8. ..................................................... 40
Table 7. Optimal spare configurations. ......................................................................................... 42
Table 8. Delay penalties due to spare cores. ................................................................................. 43
Table 9. Floorplan characteristics ................................................................................................. 47
Table 10. u(j) for various NPB functions for linear value function. ............................................. 52
Table 11. Area scaling factor for a particular design with n = 16, for various values of m.......... 54
Table 12. Optimal spare configurations. ....................................................................................... 55
Table 13. -optimal designs with =8%. ...................................................................................... 63
Table 14. Optimal IPC for PPB and FPPCB................................................................................. 83
Table 15. Optimal spare configurations [n
sc,32
, n
sc,16
, n
sc,8
, n
sc,4
, n
sc,2
, n
sc,1
]. ................................. 83
Table 16. Elmore delays for RCA with different Tr
lmt
using three approaches. ......................... 100
Table 17. Elmore delays for AD with different Tr
lmt
using three approaches. ........................... 101
Table 18. Nominal designs (ND) with different Tr
lmt
for RCA. ................................................. 105
Table 19. Nominal decoder designs with different Tr
lmt
. ............................................................ 107
Table 20. Redundancy designs using PS (RDPS) for different designs. .................................... 108
Table 21. (N
dc
, N
ds
1
, N
ds
2
) for six redundant CNTs designs. ....................................................... 110
Table 22. Optimal redesigns using Redundant CNTs (RDC) for ND1-ND5. ............................ 110
viii
Abstract
As CMOS fabrication technology continues to move deeper into nano-scale, circuit’s
susceptibility to manufacturing imperfections increases, and the improvements in yield, power,
and delay provided by each major scaling generation have started to slow down or even reverse.
This trend poses great challenges for designing and manufacturing advanced electronics in the
future generations of CMOS, especially chip-multiprocessors, which usually have many
processors and hundreds of cores. At the same time, new fabrication technologies, such as
carbon nanotube field-effect transistors (CNFET), are emerging as promising building blocks for
the next-generation semiconductor electronics. However, substantial imperfections inherent to
CNT technology are the main obstacles to the demonstration of robust and complex CNFET
circuits. Both these scenarios show that it is increasingly difficult to guarantee the correctness
and conformance of circuits to performance specifications, which leads to reduction in yield.
We develop a systematic methodology to use spare processors (cores) to replace/repair
defective ones to optimize yield per area of CMPs. We improve effectiveness of spare processors
(cores) by sharing them among original ones. We develop the first detailed area model to obtain
more realistic estimates of yield per area and hence to derive more efficient designs. The
experimental results show that our spare cores sharing approach improves yield per area from
67.2% to 76.6% (where 100% denotes the yield per area of an unattainable ideal, namely the
original design with a zero-defect process). Then we further improve wafer utilization by
proposing a new utility function that assigns appropriate value to partially defective but useable
chips. Specifically we present two utility functions i.e., Fully-working Processor and Partially-
working Chip Binning (FPPCB) at processor-level, and Partially-working Processors Binning
ix
(PPB) at core level. Results show that our utility functions provide yield per area of up to 90% of
the ideal.
The above improvements are in part due to a number of other new approaches we have
developed. In particular, we have developed a greedy repair algorithm to use spare cores to
replace defective ones, and have proven its optimality for the scenario without utility functions.
We have also developed new repair heuristics for different categories of benchmarks for both
utility functions. We have also proposed an approach to estimate performance of defective CMPs
by averaging performance obtained for CMPs with different warp sizes. This fast estimation
method enables evaluation of various redundancy approaches and hence enables rapid
identification of the optimal level of redundancy.
For future technologies with smaller feature sizes and extremely high defect densities, we
explored finer-level redundancy techniques by studying CNFET-based circuits. We propose an
approach to add CNTs for each CNFET based on its characteristics to increase yield per area for
logic circuits. We also propose a hybrid redundancy approach for SRAM arrays by identifying
the optimal combination of redundant CNTs and spare columns (rows). Our approach increases
yield/area from 60.9% to 76.3% for a 2MB memory module compared to only spare CNTs
approach, while reducing decoder delay overhead from 19.2% to 15.7%.
1
Introduction: Yield-Aware Design
Increasing complexity of applications and their large dataset sizes make it imperative to
consider novel architectures that are efficient in terms of performance and power. Chip multi-
processors (CMP) are one such example where multiple cores are integrated on a die. As
technology scales, the number of cores in a CMP will continue to increase to satisfy performance
requirements of future applications [82].
One key challenge faced by “end-of-Moore” nano-scale CMOS technologies is that they are
expected to have failure rates 1-2 orders of magnitude higher than today’s technologies [11]. In
2008, a large percentage of fabricated chips for NVIDIA mobile GPU design failed [65]. Also
NVIDIA GTX470/480 released in 2010 had low fabrication yields [66]. Generally speaking,
failures that occur within functional units during fabrication, command increasing attention from
researchers and vendors as (i) more cores are integrated into CMPs (for example, processing
units of Fermi GPUs take up to 30% of the die area [52]); and (ii) well-developed hardware
redundancy techniques combined with software-level defect-tolerance techniques provide fairly
high yields for memory modules [16]. Several previous works assume perfect on-chip cache
modules [82]. We assume yield of 1 for L1 cache since its area is small. Also, the yield of L2
cache appears as a multiplicative term in yield for the cases we study. Hence, we can simply
factor out the yield of L2 cache from total yield value and total yield per area value. Hence, in
this work, we focus on functional units.
For clarity, we use the term yield to capture the probability of a chip being functional,
equivalently, the percentage of fabricated chips that are functional. We use the term yield per
area to capture the probability of a fabricated chip being functional while taking into account
area overhead by dividing by the area of the chip.
2
Previous Research 1.1.
A significant body of published research regarding detecting defects focuses on multi-core
systems during post-fabrication testing [39][81]. Based on defect information, researchers have
proposed defect-tolerance techniques to ensure functional correctness. To address yield problems
of modern CMPs, vendors and academic researchers have proposed defect-tolerance approaches
by considering: (i) granularity at which to add spare copies, such as spare processors, spare
cores, spare functional modules, and so on, and (ii) areas of such spare copies.
In terms of granularity, at processor level, defect-tolerance techniques are employed, either by
disabling every non-working processor or by replacing non-working processors by spare
processors added to the design. For example, it was revealed that GTX480 and GTX470 GPUs
were sold after disabling one and two non-working processors, respectively, out of sixteen (each
processor is referred to as a streaming multi-processor, or SM, and each processor has 32 cores)
[66]. IBM CELL chip consists of eight processors (called synergistic processing elements, or
SPEs), out of which one is a spare processor that can be used to replace any non-working
processor [45]. As defects in a processor may only affect some parts of a GPU but not others,
Durytskyy et al. [6] propose to use partially non-working processors to generate run-time
information, such as instruction pre-fetching, memory coalescing, etc., by running a copy of the
kernel. Then working processors can use this information to speed up the execution of the actual
kernel.
At sub-processor level, some defect-tolerance techniques propose to disable the non-working
components (such as defective banks of cache) using hardware or software methods [70][27], or
to share pipeline stages between processor pipelines using reconfiguration interconnection
networks so as to obtain a maximum number of working processors [56]. Some others have
3
proposed to add spare components or pipeline stages. Spare copies of certain components, such
as a register file, a store queue, etc., are added to each processor, to be used to replace
corresponding non-working components (if any) [70][56]. Alternatively, Shanmugam et al. [76]
propose to add an FPGA module as a reconfigurable hardware unit (RHU) to each processor,
such that this RHU can be configured to replace any non-working component within the
processor. DIVA proposes adding a checker (which is a robust copy of the execution units), after
the last stage of the original processor pipeline to ensure the correctness of the execution of the
program [84].
Previous defect-tolerance approaches that explore the optimal level of granularity at which to
add redundancy tackle granularity in a theoretical manner. For example, in [88] a CMP design is
partitioned into modules of arbitrary sizes, each with an equal amount of functional logic. Spare
modules are then incorporated to replace any defective module, i.e., with a global scope. This
approach assumes that a spare module with a global scope incurs a fixed area overhead. This is
unrealistic as it ignores the overheads of interconnects required to use spare modules. Finally the
optimal size of a module (at which spare modules are used) is derived to maximize yield per area.
This work claims that for each fixed area overhead, yield per area is maximized by partitioning
the CMP design into partitions of a certain size. A CMP model is proposed and yield trends
across technology generations are studied for processor-level as well as sub-processor-level
redundancies [70]. This paper shows that redundancy at coarser level of granularity is preferred
after 100 nm technology node. However, this conclusion is based on an unequal comparison,
where the metrics for evaluation are defined differently for processor-level redundancy and sub-
processor-level redundancy. Specifically, at processor level no redundancy is actually added and
a CMP is counted as working as long as there are one or more working processors. In contrast, at
4
sub-processor level, spare sub-processor modules are added and a CMP with even one non-
working processor is considered to be non-working.
In [54], the authors start with a design at the gate-level and propose an algorithm to cluster
and partition to form modules to be replicated. First, a logic circuit is clustered into
combinational logic blocks (CLBs) to address design and test constraints such as timing closure
and testing complexity. Second, the larger of the generated CLBs are partitioned and the smaller
CLBs are further clustered to arrive at the optimal level of granularity for replication to
maximize yield/area. Using a real design (OpenSPARC T2) and defect densities projected in the
near future, the experimental results show that the proposed algorithm achieves 1.1 to 13.3 times
better yield/area as a function of defect density.
Authors in [49] propose a self-repair techniques, i.e., Sparing through Hierarchical
Exploration (SHE), where identical sub-modules are identified among different functional
modules. The authors claim that SHE reduces area overhead by 80% compared to traditional
spare techniques where both logic and memory are replicated, as SHE replicates only logic units
while assuming acceptable yield for memory due to existing spare columns and/or rows.
With continuing decrease of feature sizes, for emerging new manufacturing technologies,
such as CNFET and FinFET, defect densities can be 1-2 orders higher. Such trend makes it
imperative and useful to explore redundancy at finer-level of granularities.
Recently, some approaches of defect-tolerance for nano-electronics have focused on adding
redundancy at functional module level or gate level [44]. It has been shown that adding
redundancy at transistor level can provide more efficient defect tolerance than adding
redundancy at module and gate levels by reducing area overhead [7]. Adding redundant CNTs
within each transistor has been used as a practical approach to improve yield by previous works.
5
For example, in Stanford Nanotube Computer [55], each CNFET comprises approximately
10~200 CNTs, depending on the relative sizing of the CNFETs.
Motivation 1.2.
This dissertation develops a set of approaches for optimal use of redundancy. We started this
research by focusing on CMPs, as these are (1) large chips that are facing increasingly severe
yield problems, (2) sold at high prices and in high volumes hence will benefit significantly by
increase in yield per area, and (3) have fairly regular structures and hence are ideal for
development of efficient approaches for improving yield per area. The following sub-sections
show the motivation for various parts of our research, including (1) identification of important
considerations for deriving optimal redundancy designs, (2) optimal use of redundant modules
via sharing and associated concept, namely scope of sharing, (3) flexibility in utility functions,
and (4) optimal granularity to add redundancy.
1.2.1. The optimal redundancy design
While the yield benefits of spare processors (cores) and their area overheads can be easily
taken into account, previous works on yield enhancement have not developed an accurate area
model for CMPs, especially in terms of following overheads associated with adding spare
processors (cores): (a) wasted area (i.e., unused area) in the floorplan, e.g., considering the
floorplan for an 𝑛 𝑟𝑝
× 𝑛 𝑐𝑝
processor array, a single spare processor may result in wasted area as
high as the area of ( 𝑛 𝑟𝑝
− 1) or ( 𝑛 𝑐𝑝
− 1) processors, (b) the overheads of the additional
reconfiguration modules and new interconnects required to use spare processors, and (c) the area
overhead that is incurred when the widths of data and control buses increase due to use of spare
processors. Even in designs where these wires run over processors (as in our case study), an
6
increase in bus width beyond the width of processors requires moving processors farther apart
and hence leads to area overheads.
Previous works also have failed to identify the importance of the scope of sharing of spare
processors (cores). We define the scope of sharing of a spare processor (core) as the set of all
original processors (cores) that the spare processor (core) can replace. (Details are presented in
Figure 2 and Section 1.5.) This is particularly important to us, since there is a first-order tradeoff
between the total area overhead and the number of spare processors (cores). For example, adding
redundancy at a finer granularity requires fewer spare processors (cores) to achieve a particular
yield; however, this also increases the overheads associated with restructuring of interconnects.
Similarly, sharing spare processors (cores) among multiple identical components reduces the
number of spare processors (cores) required to achieve a particular yield; however, this also
increases the overhead associated with restructuring of interconnects and may even require us to
move components farther apart to fit this increasing bus width. None of the previous approaches
explicitly captures such tradeoffs in the search for the optimal spare configuration, including the
numbers of spare processors (cores), the scope of sharing of spare processors (cores), and so on.
We will show a detailed exploration using Nvidia GPUs and present a systematic
methodology to improve yield/area of highly-parallel CMPs in 0.
1.2.2. Flexibilities in specifications
Consider a chip-multiprocessor (CMP) with 𝑛 processors. In the past, when defect densities
were low, such a chip would have been designed with 𝑛 processors. During post-fabrication
testing, a significant percentage of chips would have been identified as having 𝑛 working
processors. These chips would have been sold and the other chips would have been discarded.
7
In recent years, as defect densities have increased, the percentage of chips with 𝑛 working
processors has decreased. One classical approach to improve yield of CMPs is to add a spare
processor to the design so as to increase the percentage of fabricated chips with 𝑛 working
processors. During post-fabrication testing, every chip identified as having fewer than 𝑛 working
processors is discarded. Every chip with 𝑛 or more working processors is counted as a good chip
and sold as an 𝑛 -processor chip. This approach is discussed in [48], which calls it AMAD (as
many as demanded). This approach has been used in commercial processors, such as the IBM
CELL processor [45], which is sold only in a seven-processor configuration, i.e., 𝑛 = 7, but
designed with eight processors, i.e., with one spare processor.
As the defect densities have continued to increase, it has become necessary to add multiple
spare processors to the design. Also as the hierarchy of a CMP design grows, each processor may
consist of multiple cores. 0 shows a systematical approach for adding spare processors and/or
cores to highly-parallel CMPs. Using the specifications and a model of the CMP design,
including the chip’s floor-plan, its interconnection networks, the designs of its level-2 and level-
3 caches, and so on, we can identify the optimal number of spare processors and/or spare cores to
be added to the design to maximize the yield per wafer, using negative-binomial yield model, for
a given defect density [32].
However, even such a CMP design which is optimal under AMAD is not maximally efficient,
since we waste a large number of working processors. First, we discard every chip with less than
𝑛 working processors. Every working processor on every such discarded chip is wasted. Second,
in every chip with 𝑛 + 1 working processors we waste one working processor by disabling, in
every chip with 𝑛 + 2 working processors we waste two working processors, and so on.
8
In recent years, some CMP vendors have been forced to adopt, in an ad-hoc manner, an
orthogonal direction, namely, flexibility provided in terms of types of chips that can be sold. For
example, to deal with low yield problem at 40nm technology node, NVIDIA put three GPU
products on market, namely, GTX 480, GTX 470 and GTX 465, which are actually the same
design (Fermi) but sold with 1, 2, and 5 processors disabled, respectively [77].
In 0, we will present newly-developed utility functions at processor level to improve wafer
usage. In 0, we further improve wafer usage by including utility functions at core level.
1.2.3. Redundancy at finer-level granularities
Though the idea of adding redundancy at gate and transistor level has been proposed, previous
research doesn’t provide the optimal design, e.g., number of CNTs in each CNFET. Instead of
adding CNTs in ad-hoc manner, we carry out a study on common logic and memory modules to
identify the optimal redundancy techniques or their combination by considering existing
redundancy techniques in this work. 0 will introduce characteristics of CNFETs and related
redundancy techniques in detail. Then a systematic methodology for adding redundancy for
CNFETs will be presented.
9
A Systematic Methodology to Improve Yield/Area CMPs
In this chapter, we present a systematic methodology to use spare processors (cores) to
replace/repair defective ones to optimize yield per area of CMPs. for highly parallel CMPs.
Specifically, contributions of this work are as follows.
(1) We take advantage of design regularity to accurately estimate area overheads of adding
spare processors (cores). We develop the first detailed area model to obtain more realistic
estimates of yield per area and hence to derive more efficient designs for CMPs.
(2) We introduce a new notion we call scope of sharing, which improves effectiveness of spare
cores and enriches spare configuration choices.
(3) We systematically enumerate all possible ways of adding spare processors (cores),
covering the entire spare design space to identify the optimal spare configuration(s).
(4) A straightforward implementation of our approach has high computational complexity, as
each spare configuration requires significant amount of Monte Carlo simulations [13]. We
substantially reduce the time required by doing as follows.
a. Using importance sampling to reduce the required number of Monte Carlo simulations.
b. Trading off run-time for memory usage.
(5) We have developed a greedy repair algorithm to use spare cores to replace defective cores,
and have proven its optimality for the scenario without utility functions
(6) We present the impact of adding spare processors (cores) on CMP performance in terms of
the minimum clock cycle.
(7) We define a new evaluation metric for measuring effectiveness of approaches to improve
yield per area in terms of ideal yield per area, which is obtained for the original design with
10
no redundancy and zero defect. Our approaches improve yield per area from 67.2% to 76.6%
of the ideal yield per area.
Key Ideas 1.3.
If the area associated with additional interconnects is not considered, it can be proved that
adding spare copies at a finer level of granularity and with a global scope of sharing is always
preferable. However, in reality, any spare configuration at very fine grain is not useful as such
spare copies usually require very high interconnect area overheads according to a study of wiring
area of crossbars and other cases [75]. Hence, to identify practically useful spare configurations
in terms of yield per area, it is imperative to consider area overhead of interconnects and wasted
area, in addition to the area of spare copies.
While analytically quantifying interconnect overheads is impossible for an arbitrary chip,
highly-parallel CMPs use regular or semi-regular floorplans for components and regular or semi-
regular interconnect topologies, such as a shared bus, a crossbar, an H-tree, and so on [18]. Our
approach is based on the observation that in such chips, area overheads of spare processors
(cores) and wasted area can be computed easily.
Figure 1 shows high-level views of architectures of different logical and physical topologies
with different levels of defect tolerance. Considering a CMP with 𝑛 𝑝 processors, Table 1 shows
the corresponding area ratios (due to area of spare processors and wasted area) for defect-tolerant
designs with 𝑚 𝑝 processors, i.e., with 𝑚 𝑝 − 𝑛 𝑝 spare processors, which are capable of tolerating
𝑚 𝑝 − 𝑛 𝑝 defective processors. Furthermore, based on desired interconnects and spare
configurations for each architecture, we can accurately capture parameters of the final
interconnects, e.g., the number of de-multiplexers, the size of each de-multiplexer, etc., and
11
hence we can obtain interconnect area overheads as well as important interconnect parameters,
including delay and power, to a first-order accuracy in an analytical manner.
For each CMP architecture illustrated in Figure 1, we preserve the full functionality of the
original architecture when reconfiguring defective CMPs with spare processors. Reconfiguration
algorithms have been proposed to preserve the original logical topology (hence to keep the full
functionality) for mesh after defective processors are replaced [48][90]. This approach is also
applicable to torus and bus-based topologies. This is done by supporting a configurable ID at
each processor and assigning ID values to reconstruct the original logical topology. For H-tree,
we add redundancy using a straightforward approach where (i) we start by adding a duplicated
copy of H-tree to the original H-tree, which makes a double-sized H-tree, and (ii) then removing
a maximum number of processors so that the obtained H-tree has a certain level of defect
tolerance, e.g., one-defective-processor tolerance and two-defective-processors tolerance.
Crossbar interconnect has dedicated links for each original processor as well as each spare
processor. Hence it is easy to maintain original logic topology while adding spare processors to a
crossbar-based architecture.
Furthermore, when considered in combination with design rules, design guidelines, and
design-for-manufacture rules, all possible options for adding spare processors are discrete and
enumerable such that we are able to exhaust the design space to obtain the optimal spare
configuration. Above characteristics allow us to obtain the maximal yield per area at first-order
accuracy using analytical equations.
12
Bus
(in-line)
Bus
(array)
Crossbar
(in-line)
Crossbar
(array)
Mesh
(array)
Torus
(array)
H-tree
(array)
Original processor Spare processor
(a) Original n
p
-processor designs. (b) One-defective-processor tolerant designs. (c) Two-
defective-processors tolerant designs.
Figure 1. Defect-tolerant designs for chip-multiprocessor.
13
Table 1. Area ratio for different architectures.
Logical topology Physical topology Area ratio
Bus
In-line 𝑚 𝑝 /𝑛 𝑝
Array
⌈√𝑚 𝑝 ⌉ ⋅ ⌊√𝑚 𝑝 ⌋/𝑛 𝑝
Crossbar
In-line 𝑚 𝑝 /𝑛 𝑝
Array
⌈√𝑚 𝑝 ⌉ ⋅ ⌊√𝑚 𝑝 ⌋/𝑛 𝑝
Mesh
Array
⌈√𝑚 𝑝 ⌉ ⋅ ⌊√𝑚 𝑝 ⌋/𝑛 𝑝
Torus
Array
⌈√𝑚 𝑝 ⌉ ⋅ ⌊√𝑚 𝑝 ⌋/𝑛 𝑝
Binary tree
Array
⌈√𝑚 𝑝 ⌉ ⋅ ⌊√𝑚 𝑝 ⌋/𝑛 𝑝
Defect-tolerance level = 𝑚 𝑝 − 𝑛 𝑝 .
Problem Definition 1.4.
In this section, we first explain the domain in which our problem is defined. Then we present
the problem statement.
1.4.1. Problem domain
We have studied a broad range of modern CMPs, including Cisco QuantumFlow Processor,
NVIDIA Fermi GPUs, AMD Bulldozer, and IBM CELL [18][20][1]. Even though the surveyed
CMPs are used in different application domains, ranging from network processors specializing in
packet switching to graphics processors specializing in parallel computations, we are still able to
derive a general structure by capturing their similarities: each CMP (i) uses a small number of
distinct types of components and uses multiple copies of many of these components, (ii) uses
interconnect topologies belonging to a small family, such as shared bus, crossbar, H-tree, etc.,
and (iii) is usually composed using two or more levels of hierarchies.
Inside traditional single-core processors, there used to be a single copy of each constituent
component, such as an instruction decoder or an ALU. However, as can be seen for most modern
CMPs, there are multiple copies of components. This characteristic changes the nature of the
problem and causes us to expand the set of possible configurations of spare copies, especially by
taking into account the scope of sharing of spare copies.
14
Generally speaking, we view this new defect-tolerance problem as the search for the optimal
level of granularity at which to add spare copies and the optimal scope of sharing of each spare
copy. For each alternative defect-tolerance configuration, we go beyond the first-order estimates
of area overheads to accurately compute yield per area. In this work, we consider two levels of
hierarchies: (i) the CMP consisting of multiple processors and a shared L2 cache module, and (ii)
each processor consisting of multiple cores and a shared L1 cache module.
1.4.2. Problem statement
Our problem is to identify a configuration of spare copies and the corresponding floorplan to
maximize yield per area for a given CMP. In particular, we will consider the set of spare
configurations obtained by exploring the following.
(1) Choose the appropriate level of granularity to add spare copies. In this work, we consider a
two-level approach, i.e., adding spare processors and adding spare cores.
(2) Use each spare processor and each spare core with an appropriate scope of sharing, i.e., the
manner in which the spare processor (core) is shared, including: (a) a spare processor (core)
that can replace any defective processor (core), in other words, shared by all processors
(cores), and (b) a spare processor (core) that can replace any defective processor (core)
within a specific scope.
(3) Use an appropriate number of spare processors (cores), for each scope of sharing.
Yield-Per-Area Optimization 1.5.
In this section, we investigate five main aspects of our problem. We start with the key steps in
the proposed methodology. Next we present available spare configurations for a given CMP
architecture. For each spare configuration, the floorplan can be obtained by taking into account
15
various area overheads. Performance degradation is also studied to evaluate the obtained spare
configurations. In the end, we show how yield per area is computed for each spare configuration.
1.5.1. Key steps
We tackle our problem in three steps. First, we enumerate all possible configurations of spare
processors (cores) as described in Section 1.4.2. Second, we design the floorplan for any spare
configuration to estimate area as well as delay. Third, we compute yield for any spare
configuration.
1.5.2. Enumerating configurations
The problem statement identifies three dimensions of exploration, namely levels of
granularity, scopes of sharing, and numbers of spare processors (cores). Since the granularity is
already limited to two choices, namely processor level and core level, we start by focusing on
how to enumerate scopes of sharing.
Figure 2 shows an example CMP with four processors, where each processor has four cores.
First, we consider adding spare processors in following ways: (a) 4-way processor-sharing: any
spare processor is used to replace any defective processor among the original four processors, (b)
2-way processor-sharing: the four original processors are partitioned into two clusters, where
each cluster has two processors, and spare processors are added to each cluster in a manner
where a spare processor can replace any defective processor within its cluster, and (c) 1-way
processor-sharing: spare processors are added to each processor so that any spare processor can
be used if the corresponding original one is defective. In addition, we consider adding spare
cores in following ways: (d) 16-way core-sharing: any spare core is used to replace any defective
core among the total sixteen original cores in four processors, (e) 8-way core-sharing: the four
16
original processors are partitioned into two groups, and spare cores are added to each group in a
manner where a spare core can replace any defective original core within its group, (f) 4-way
core-sharing: spare cores are added to each processor in a manner where a spare core can replace
any defective core within its processor, (g) 2-way core-sharing: the four original cores within one
processor are partitioned into two groups, where each group has two cores, and spare cores are
added to each group in a manner where a spare core can replace any defective core within its
group, and (h) 1-way core-sharing: spare cores are added in a manner that they can only be used
to replace a specific core (this option is not shown in Figure 2).
For clarity, in the rest of this article, we use the term domain to represent the scope of sharing
for spare processors and cores at different granularities, i.e., for clusters of processors as well as
groups of cores, depending on whether we are adding spare processors or spare cores.
Finally, as to the appropriate number of spare processors (cores) in each category, we use a
branch and bound algorithm to carry out this enumeration in a manner that identifies the optimal
configuration at low computational complexity.
17
(a) 4-way processor (b) ) 2-way processor (c) 1-way processor (d) 16-way core (e) 8-way core
(f) 4-way core (g) 2-way core
Processor Core Spare cores Spare processors
Figure 2. Sharing of spare processors (cores).
18
1 1
2
3.2
3.1
P
1 P
2
L1 Cache L1 Cache L1 Cache L1 Cache
P
3
l
ca
w
p
=w
ca
=w
L1C
=w
L1CR
l
p
w'
p
=max(w'
ca
,w'
L1CR
)
l'
p
l'
ca
l
L1CR
l
L1C
l'
L1CR
l
L1C
w'
L1CR
2
Register file Register file
Register file Register file
(a) Zero spare cores (b) One 4–way spare core for each processor
Area of spare modules
Wasted area
Impact on demux width
Impact on demux height
3.2
1
2
3.1
Impact on crossbar
3.3
1 1
3.1
3.2
3.3
L1 Cache L1 Cache
Register file
1 1
Register file
(c) One 4–way spare core for each processor and two 8-way spare cores for two processors.
4-way Spares De-mux 8-way Spares Crossbar Switch
Existing
repeaters
Additinoal
repeaters
Figure 3. Floorplans for different spare configurations for two 4-core processors.
1.5.3. Floorplan for each configuration
Figure 3(a) shows the original floorplan for two processors, where cores are placed in arrays
and the cache module is lined up on one side of the core array inside each processor [25]. Using
this structure, we are able to capture the physical characteristics at different levels of granularity
for a broad range of architectures [45][52][65]. In particular, we are able to calculate the overall
area overheads of adding spare processors (cores) for these architectures.
In this work we use crossbar as an example of interconnect between cores and L1 cache [26].
We assume that address and data buses are routed over cores and cache modules [62]. In the
19
original CMP design (i.e., the design with no spare cores), wiring area of buses is typically less
than core and cache area and hence usually buses do not determine chip area. The only concern
for buses routed over cores and cache modules is when such cores and cache modules need to be
expanded to accommodate additional repeaters and latches required on wires that must be
elongated due to spare cores, or accommodate the increase in bus widths.
As spare cores are added, the total CMP area is increased by the area of spare cores and also
the area due to additional repeaters and latches. Furthermore, we must consider area overheads
from the following three aspects, which are illustrated using an example of two 4-core processors,
as shown in Figure 3(b) and (c).
(1) For a regular floorplan, adding spare cores may result in wasted area. In the scenario
shown in Figure 3(b), there is a wasted area equivalent to one core per processor when one
spare core is added to each processor (which is labeled as ➁ in the figure).
Spare cores have impact on interconnects. We use crossbar as an example of interconnect
between cores and L1 cache. We also use de-multiplexers between register file and cores.
Consider a 2×2 core array and a dedicated register file which has four banks. If 𝑖 4-way
shared spare cores are incorporated to each 2×2 core array, each register bank is capable of
communicating with (𝑖 +1) cores. After reconfiguration, each register bank is connected to
one out of these (1+𝑖 ) cores. Hence, we need four 1-to-(1+𝑖 ) de-multiplexers for the four
register banks. We also need to increase the crossbar width by 𝑖 . Synthesis results from
Synopsys Design Compiler show area overheads increase with 𝑖 , e.g., the 1-to-4 de-
multiplexers incur much larger area than the 1-to-2 de-multiplexers, which is shown in
Figure 3(b) and (c). Moreover, additional repeaters may be required for wires whose
lengths need to be increased. Our crossbar model considers area overhead of these
20
additional repeaters due to spare cores as part of 𝑤 𝐿 1𝐶𝑅
and 𝑙 𝐿 1𝐶𝑅
, which is illustrated in
Figure 3(b) and (c).
(2) The total width of wires routed over a column of cores is calculated in (1),
𝑊 = 𝑛 𝑟𝑐
⋅ 𝑛 𝑏𝑢𝑠 ⋅ 𝑃 𝑚 1
⋅ ( 1 + 𝑁 𝑠𝑐
) , (1)
where 𝑛 𝑟𝑐
is the number of rows in the core array. 𝑛 𝑏𝑢𝑠 is the width of bus in bits for each
core in the original design. 𝑃 𝑚 1
is the pitch value for Metal 1. 𝑁 𝑠𝑐
is the number of spare
cores that can be used to replace a particular original core, e.g., 𝑁 𝑠𝑐
= 𝑖 if we only add i 4-
way spare cores to a 4-core array, and the term “1” captures the original core. For 𝑁 𝑠𝑐
values for which 𝑊 is large and exceeds the width of a core, bus width dominates the
processor area and the cores placed underneath will have to be moved further apart in
columns to fit the increasing bus width. Such increase in the area overhead is equivalent to
the increase in the width of the de-multiplexer for each column of cores. We represent such
overhead by increasing the width of de-multiplexers, which is shown as
3.3
in Figure 3(c).
Figure 4 shows the floorplan of a CMP, where processors are placed in the same manner as
they are for Nvidia Fermi architecture [1]. Based on Figure 3 and Figure 4, area overheads of
designs with and without spare cores can be computed using Equations (2)-(5), which are also
key intermediate results of our tool. Symbols with primes (e.g., 𝑙 𝐶𝑀𝑃 ′
, and 𝑤 𝐶𝑀𝑃 ′
) indicate the
corresponding values after spare cores are added. Specific values of these parameters used in our
experiments are listed in Table 2.
21
L2
Cache
l
p
w
L2C
l
L2C
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Processor
Crossbar
Crossbar
w
L2CR
l
CMP
w
CMP
w
P
Figure 4. CMP floorplan and area calculation for spare cores approach.
𝑙 𝐶𝑀𝑃 ′
𝑙 𝐶𝑀𝑃 =
𝑙 𝑝 ′
𝑙 𝑝 =
( ℎ
𝑟𝑒𝑔
+ ℎ
𝑑𝑒𝑚𝑢𝑥 + 𝑙 𝑐𝑎
+ 𝑙 𝐿 1𝐶𝑅
⋅ (1 +
𝑁 𝑠𝑐
𝑛 𝑐 ) + 𝑙 𝐿 1𝐶 )
ℎ
𝑟𝑒𝑔
+ 𝑙 𝑐𝑎
+ 𝑙 𝐿 1𝐶𝑅
+ 𝑙 𝐿 1𝐶
(2)
𝑤 𝑝 ′
𝑤 𝑝 =
𝑚𝑎𝑥 ((⌈
𝑛 𝑠𝑐
𝑛 𝑟𝑐
⌉ + 𝑛 𝑐𝑐
) ⋅ 𝑚𝑎𝑥 ( 𝑤 𝑐 , ( 1 + 𝑁 𝑠𝑐
)⋅ 𝑛 𝑏𝑢𝑠 ⋅ 𝑃 𝑚 1
⋅ 𝑛 𝑟𝑐
) , 𝑤 𝐿 1𝐶𝑅
′
)
𝑤 𝑐𝑎
(3)
𝑤 𝐶𝑀𝑃 ′
𝑤 𝐶𝑀𝑃 =
𝑤 𝐿 2𝐶 + 2 ⋅ 𝑤 𝐿 2𝐶𝑅
+ 𝑤 𝑝 ′
⋅ 𝑛 𝑐𝑝
𝑤 𝐿 2𝐶 + 2 ⋅ 𝑤 𝐿 2𝐶𝑅
+ 𝑤 𝑝 ⋅ 𝑛 𝑐𝑝
(4)
𝐴 𝐶𝑀𝑃 ′
𝐴 𝐶𝑀𝑃 =
𝑤 𝐶𝑀𝑃 ′
𝑤 𝐶𝑀𝑃 ⋅
𝑙 𝐶𝑀𝑃 ′
𝑙 𝐶𝑀𝑃 (5)
1.5.4. Performance degradation
In this section, we compute delay overheads to estimate the performance degradation within
processors due to spare cores. (We don’t tackle timing closure issue since re-design is a subject
of our future research.) A complete data flow includes three paths within each processor (see
Figure 3(a)): the path from L1 cache to register file (𝑃 1
), the path from register file to cores (𝑃 2
),
and the path from cores to L1 cache (𝑃 3
).
Adding spare cores requires inserting de-multiplexers on 𝑃 2
and 𝑃 3
, which incurs delay
overhead. Such delay overhead is a function of 𝑁 𝑠𝑐
, since it directly decides the size of de-
22
multiplexers. 𝑁 𝑠𝑐
is determined by a given spare configuration. Also the critical path in a
crossbar network is a multiplexer. Hence, in our work we use the delay overhead of 1 + 𝑁 𝑠𝑐
input multiplexers to estimate the delay of the entire crossbar network. We synthesize de-
multiplexers and multiplexers using Synopsys Design Compiler [83] and NCSU 45nm library
[63]. Figure 5 shows delay overheads of de-multiplexers and multiplexers for different 𝑁 𝑠𝑐
.
Delay overheads are shown as percentages of the clock period (1.25ns) of GeForce 400 Series
(GPUs from Nvidia) [28]. For example, for 𝑁 𝑠𝑐
= 8, i.e., 8 spare cores can be used to replace
each defective core, the delay overhead of de-multiplexer is close to 9% of the nominal clock
period (1.25ns).
23
Table 2. Symbols and values.
Symbol Definitions Values
Fabrication
level
𝑇 Technology (𝑛𝑚 ) 45/32/22
𝐷 0
κ Failure rate (/𝑚 𝑚 2
) 0.05~0.3
𝑎 Clustering factor 0.3
Design rules 𝑃 𝑚 1
Pitch of Metal 1-3 (nm) 90/112/160
Circuit
level
𝑠 𝑟𝑒𝑝 Spacing of repeater (𝑚𝑚 ) 1.6
ℎ
𝑟𝑒𝑝 Repeater height (𝑚𝑚 ) 0.015
𝑤 𝑟𝑒𝑝 Repeater width (𝑚𝑚 ) 0.016
𝑛 𝑎 _𝑏𝑢𝑠 Address bus width (bits) 32
𝑛 𝑤 _𝑏𝑢𝑠 Write bus width (bits) 32
𝑛 𝑟 _𝑏𝑢𝑠 Read bus width (bits) 32
𝑛 𝑏𝑢𝑠 𝑛 𝑎 _𝑏𝑢𝑠 + 𝑛 𝑤 _𝑏𝑢𝑠 + 𝑛 𝑟 _𝑏𝑢𝑠 96
Core
level
𝑛 𝑐 No. of cores per processor 32
𝑛 𝑟𝑐
No. of rows of cores 4
𝑛 𝑐𝑐
No. of columns of cores 8
𝑤 𝑝 Processor width (𝑚𝑚 ) 3
𝑙 𝑝 Processor length (𝑚𝑚 ) 5
𝑤 𝑐𝑎
Width of core array (𝑚𝑚 ) 3
𝑙 𝑐𝑎
Length of core array (𝑚𝑚 ) 2.9
𝑤 𝑐 Width of each core =
𝑤 𝑐𝑎
𝑛 𝑐𝑐
-
𝑙 𝑐
Length of each core =
𝑙 𝑐𝑎
𝑛 𝑟𝑐
-
𝑤 𝐿 1𝐶 Width of L1 Cache (𝑚𝑚 ) 3
𝑙 𝐿 1𝐶 Length of L1 Cache (𝑚𝑚 ) 0.7
𝑤 𝐿 1𝐶𝑅
Width of L1 Crossbar (𝑚𝑚 ) 3
𝑙 𝐿 1𝐶𝑅
Length of L1 Crossbar (𝑚𝑚 ) 0.7
ℎ
𝑟𝑒𝑔 Height of register file (mm) 0.7
ℎ
𝑑𝑒𝑚𝑢𝑥 Height of demux (mm) -
𝐴 𝑐 Area of a core -
𝑌 𝑐 Yield of a core -
𝑌 𝑖𝑛𝑡 −𝑐 Yield of core-level interconnects -
𝑌 𝐿 1
Yield of L1 cache 1
Processor
level
𝑛 𝑝 No. of processors per CMP 16
𝑛 𝑟𝑝
No. of rows of processors 4
𝑛 𝑐𝑝
No. of columns of processors 4
𝑤 𝐶𝑀𝑃 Width of CMP (𝑚𝑚 ) 23
𝑙 𝐶𝑀𝑃 Length of CMP (𝑚𝑚 ) 24
𝑤 𝐿 2𝐶 Width of L2 Cache (𝑚𝑚 ) 23
𝑙 𝐿 2𝐶 Length of L2 Cache (𝑚𝑚 ) 6
𝑤 𝐿 2𝐶𝑅
Width of L2 Crossbar (𝑚𝑚 ) 23
𝑙 𝐿 2𝐶𝑅
Width of L2 Crossbar (𝑚𝑚 ) 3
𝐴 𝑝 Area of a processor -
𝑌 𝑝 Yield of a processor -
𝑌 𝑖𝑛𝑡 −𝑝 Yield of processor-level interconnects -
𝑌 𝐿 2
Yield of L2 cache 1
Other
notations
𝐴 𝐶𝑀𝑃 CMP area with spare processors (cores) -
𝑛 𝑠𝑐 ,𝑘 No. of spare cores shared by 𝑘 cores -
𝑁 𝑠𝑐
No. of spare cores available for one core -
𝐸𝑁𝑆 𝐶 𝑥 Definition of 𝑁 𝑠𝑐
using scope of sharing -
𝑛 𝑠𝑝
No. of spare processors ∑𝑛 𝑠 𝑐 ,𝑘 𝑘
𝑛 𝑠𝑐
No. of spare cores ∑𝑛 𝑠𝑐 ,𝑘 𝑘
𝑚 𝑝 No. of fabricated cores 𝑛 𝑝 + 𝑛 𝑠𝑝
𝑚 𝑐 No. of fabricated processors 𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐
𝑑 A defective core -
𝑠 A spare core -
𝑗 No. of defective cores -
𝑃 𝑗 Probability of 𝑗 defective cores -
𝑄 𝑗 Probability of a CMP which has 𝑗 Defective cores being functional -
𝑌 Yield of a CMP -
𝑆 𝑘 𝑡 Spare core set -
𝑂 𝑘 𝑡 Domain of spare cores in 𝑆 𝑘 𝑡 -
𝑆 𝑘 ∗
Union of 𝑆 𝑘 𝑡 for each 𝑘 -
24
Figure 5. Delay overheads
1
of Demux and mux.
1.5.5. Yield Computation for CMPs
We first show the yield model that is used in this work. Then we use Monte Carlo approach to
estimate yield for each spare configuration. In the end, we show a heuristic to speed up the
overall simulation time by trading off memory usage for simulation time.
1.5.5.1. Yield estimation
There are two types of defects: systematic defects, which are usually caused during
lithography process, and random defects, which are randomly distributed due to particle
contamination [68]. In this work, we consider random defects for simplicity, and use the small-
area negative-binomial yield model to capture such defects [12]. In large chips, defect clusters
appear to be of uniform size but are much smaller than the chip area. In this case, instead of
viewing the chip as one entity for statistical purposes, we can view it as comprising independent
regions. In this work, a region corresponds to a core. Yield of a core is computed in Equation (6),
where 𝐷 0
is the defect density measured in number of defects per 𝑐 𝑚 2
, 𝐴 𝑐 is the area of the core
1
Y-axis shows the additional delay as a percentage of clock period.
1 2 4 8
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Number of spare cores per core (N
sc
)
Normalized delay overhead
Demux
Mux
25
in 𝑐 𝑚 2
, κ is the kill ratio, and 𝛼
is the defect clustering factor [70]. Specifically, κ models the
interaction between the defect size and the layout feature size and increases as the ratio of defect
size to the layout feature size increases. Also κ is dependent on particular circuit types since
different circuit types are built using different metal layers and layout features.
𝑌 𝑐 = (1 +
𝐷 0
⋅ 𝐴 𝑐 ⋅ 𝜅 𝛼 )
−𝛼
(6)
As defects in cores and interconnects are assumed to be independent, yield of processors is
computed by multiplying the yield of cores, yield of L1 cache, and yield of interconnects
between cores and L1 cache. Similarly, CMP yield is computed by multiplying the yield of
processors, yield of L2 cache, and yield of interconnects between processors and L2 cache. Yield
of the CMP is obtained using Equation (7), where 𝑌 𝐿 2
is yield of L2 cache and 𝑌 𝑖𝑛𝑡 −𝑝 is yield of
interconnects between processors and L2 cache, which can be computed using the negative-
binomial yield model. 𝑄 𝑗 is the conditional probability that CMPs are functional (either with or
without repair) given 𝑗 defective cores for a given spare configuration. 𝑄 𝑗 also takes into account
the defects in interconnects between cores and L1 cache within each processor by making sure
that a processor is counted as working only if it has at least 𝑛 𝑐 working cores, a working L1
cache and the corresponding working interconnects. For simplicity, yields of L1 and L2 cache
are assumed to be 1, e.g., 𝑌 𝐿 1
= 𝑌 𝐿 2
= 1, and we focus on functional logic block (i.e., cores) in
this work [82]. The number of defective cores (𝑗 ) is a binomial random variable and the
probability of 𝑗 cores being defective (𝑃 𝑗 ) is computed using Equation (8), where 𝑚 𝑐 is the total
number of cores fabricated on a CMP, including originally designed cores (𝑛 𝑐 𝑛 𝑝 ) and spare
cores ( 𝑛 𝑠𝑐
) , and 𝑌 𝑐 is the probability of each core being functional.
26
𝑌 = 𝑌 𝐿 2
⋅ 𝑌 𝑖𝑛𝑡 −𝑝 ⋅ ∑ 𝑄 𝑗 𝑗 ⋅ 𝑃 𝑗
(7)
𝑃 𝑗 = (
𝑚 𝑐 𝑗 )( 1 − 𝑌 𝑐 )
𝑗 ⋅ 𝑌 𝑐 𝑚 𝑐 −𝑗 (8)
1.5.5.2. Monte Carlo approach to compute 𝑄 𝑗
Unlike traditional redundancy approaches, it is difficult to capture 𝑄 𝑗 in Equation (7) in a
closed-form equation in our approach. We use Monte Carlo (MC) approach [13] to generate
10,000 copies of the given CMP design, where each copy has 𝑗 defective cores. A processor on a
CMP copy is functional if it has required number of working cores, a working L1 cache and
corresponding working interconnects. A CMP is functional if it has required number of working
processors, a working L2 cache, and corresponding working interconnects. We use Equation (35)
to compute 𝑄 𝑗 , where 𝐼 𝑖 ,𝑗 is an indicator variable obtained from the repair process of copy 𝑖 of
the CMP with 𝑗 defective cores in presence. 𝐼 𝑖 ,𝑗 = 1 indicates that a chip 𝑖 can be sold as a good
chip either because it is defect-free or repairable, and 𝐼 𝑖 ,𝑗 = 0 indicates that a chip has to be
discarded as it is not repairable.
𝑄 𝑗 = ∑ 𝐼 𝑖 ,𝑗 10,000
𝑖 =1
(9)
In observance of high yield for CMPs with a small number of defective cores and low yield
for CMPs with a large number of defective cores, importance-sampling is used in the Monte
Carlo simulations, which greatly reduces the simulation time and enables us to explore the
complete design space for spare cores.
Our framework explores the optimal combination of spare configurations to maximize the
yield per area within a broad range of ways of adding spare cores in a given CMP design, while
27
satisfying constraints on: (a) packaging, e.g., die size and aspect ratio, and (b) performance, e.g.,
critical path latency and bandwidth. Test cost, as another important aspect which is dependent on
spare cores configurations, will be integrated into the evaluation metric in our future work. Area
impact on power dissipation, and so on, will also be part of our future work. However, we do
assume that unused processors and cores will be powered down to reduce such impacts.
1.5.5.3. Reuse of 𝑄 𝑗 from Monte Carlo Simulations
Different defect densities result in different yield profiles. However, the percentage of good
chips for a given number of defects (𝑄 𝑗 ) is constant for a given spare configuration. Hence, when
we obtain the percentage of good chips for the first time from Monte Carlo simulations, we store
this value. Then when 𝑄 𝑗 is needed for a different defect density but for the same spare
configuration, we can directly reuse this value instead of running Monte Carlo simulations again.
In other words, we trade off a little memory for overall Monte Carlo simulation time. This is
critical as we have observed from our experiments that Monte Carlo simulations consume
substantial amount of time, especially in scenarios where each CMP consists of hundreds of
processors.
Newly Proposed Spare Configurations 1.6.
We propose two spare sharing techniques to explore the optimal spare configuration for a
given CMP design while considering a detailed area model. Specifically, we propose spare
processors sharing (SPS), where spare processors are added in a combination of ways
exemplified by Figure 2(a), (b), and (c), and spare cores sharing (SCS), where spare cores are
added in a combination of ways exemplified by Figure 2(d), (e), (f), and (g). Then we present our
greedy repair strategy for spare configurations and prove its optimality.
28
1.6.1. Proposed spare configurations
Consider the example of an original CMP with 𝑛 𝑝 multi-core processors, where each
processor has 𝑛 𝑐 cores. Both 𝑛 𝑝 and 𝑛 𝑐 are powers of 2. At processor level, i.e., SPS, into the
CMP we incorporate 𝑛 𝑠𝑝 ,1
1-way spare processors, 𝑛 𝑠𝑝 ,2
2-way spare processors, 𝑛 𝑠𝑝 ,4
4-way
spare processors, and so on. Examples of such cases are shown in Figure 2(a), (b) and (c), for
𝑛 𝑝 = 4.
At core level, i.e., SCS, we consider two scenarios. In the first scenario, we keep the integrity
of a processor, which has 𝑛 𝑐 cores, then into the CMP we incorporate 𝑛 𝑠𝑐 ,𝑛 𝑐 𝑛 𝑐 -way spare cores,
𝑛 𝑠𝑐 ,2𝑛 𝑐 2𝑛 𝑐 -way spare cores, and so on. In general, we consider 𝑛 𝑠𝑐 ,𝑘 𝑘 -way spare cores, where 𝑘
is chosen to be a power of two and 𝑘 ∈ { 𝑛 𝑝 𝑛 𝑐 ,
𝑛 𝑝 𝑛 𝑐 2
, … , 2𝑛 𝑐 , 𝑛 𝑐 }. Due to symmetry, 𝑛 𝑠𝑐 ,𝑘 𝑘 -way
spare cores are distributed equally among
𝑛 𝑝 𝑛 𝑐 𝑘 domains, that is, ( 𝑛 𝑠𝑐 ,𝑘 ∙
𝑘 𝑛 𝑝 𝑛 𝑐 ) spare cores are
added to each domain to replace defective cores (if any) within the domain. Examples of such
cases are shown in Figure 2(d), (e) and (f). In our second scenario, we extend this to cases where
𝑘 is less than 𝑛 𝑐 , the number of cores per processor. In this scenario, cores in each processor are
divided into equal-size domains, with each domain comprising k original cores, where 𝑘 =
𝑛 𝑐 2
,
𝑛 𝑐 4
, … , 1. An example of this is shown in Figure 2(g). To sum up these two scenarios,
( 𝑛 𝑠𝑐 ,𝑘 ∙
𝑘 𝑛 𝑝 𝑛 𝑐 ) spare cores are added to each domain to replace defective cores within the domain,
where 𝑘 ∈ { 𝑛 𝑝 𝑛 𝑐 ,
𝑛 𝑝 𝑛 𝑐 2
, … , 2𝑛 𝑐 , 𝑛 𝑐 ,
𝑛 𝑐 2
, … , 2, 1}.
As spare cores are shared by original cores at different scopes, we use a formal definition
𝐸𝑁𝑆 𝐶 𝑥 to represent 𝑁 𝑠𝑐
for original core 𝑥 in terms of scope of sharing, which is shown as
follows.
29
Definition 1. The Effective Number of Spare Cores that one original core 𝑥 can use is
denoted by 𝐸𝑁𝑆 𝐶 𝑥 , which is computed as:
𝐸𝑁𝑆 𝐶 𝑥 = 𝑛 𝑠𝑐 ,𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐 ,𝑛 𝑝 𝑛 𝑐 /2
⋅
1
2
+ ⋯ + 𝑛 𝑠𝑐 ,2
∙
2
𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐 ,1
∙
1
𝑛 𝑝 𝑛 𝑐 . (10)
We describe spare configurations using following notation
( 𝑛 𝑝 , 𝑛 𝑐 ; 𝑛 𝑠𝑐 ,𝑛 𝑝 𝑛 𝑐 , 𝑛 𝑠𝑐 ,
𝑛 𝑝 𝑛 𝑐 2
, … , 𝑛 𝑠𝑐 ,𝑛 𝑐 , 𝑛 𝑠𝑐 ,
𝑛 𝑐 2
, 𝑛 𝑠𝑐 ,
𝑛 𝑐 4
, … , 𝑛 𝑠𝑐 ,1
) .The traditional approach, sharing spare
cores within processors, is a special case of SCS and can be denoted as
( 𝑛 𝑝 , 𝑛 𝑐 ; 0, … ,0, 𝑛 𝑠𝑐 ,𝑛 𝑐 , 0, … , 0) . For example, in a chip where 𝑛 𝑝 = 16, 𝑛 𝑐 = 8, 𝑛 𝑠𝑐 ,8
= 32, and
𝑛 𝑠𝑐 ,4
= 32, we have a CMP (i) with 16 processors, where each processor has 8 cores, (ii) two
spare cores are added to and shared by 8 cores within each processor (8-way spare cores sharing),
and (iii) the 8 cores inside each processor are partitioned into two domains of 4, where one spare
core is added to each of the 32 domains (4-way spare cores sharing).
For the convenience of theoretical modeling and computational complexity of Monte Carlo
simulations, our previous work heuristically considers using only two types of spare cores: some
spare cores are 𝑛 𝑐 -way spare cores, and the remaining spare cores are all 𝑘 -way spare cores,
where 𝑘 ≠ 𝑛 𝑐 [18]. In this work we extend experiments to exhaust the complete design space,
i.e., all combinations of spare cores with all possible scopes of sharing. A branch-and-bound
heuristic has been implemented to greatly reduce the computational complexity for our new
complete search.
1.6.2. Repair strategy for CMPs with defective cores
We first propose a greedy repair algorithm. Then we prove the optimality of this algorithm.
30
1.6.2.1. Greedy repair algorithm
Given a specific spare configuration of the CMP and the locations of defects obtained from
post-fabrication testing, we repair defective CMPs by replacing defective cores with non-
defective spare cores using a greedy repair algorithm shown as Algorithm 1. Specifically we
choose spare cores with smallest possible scope of sharing that is available (more ahead in
Lemma 1 and 2). Those defective cores which have been replaced are powered off for the entire
life cycle of the CMP and won’t have any impact on other functional cores or caches.
First we assign a unique ID to each original core and define the original core set 𝐶 =
{ 1, 2, 3, 4, … , 𝑛 𝑝 𝑛 𝑐 }. Then we assign unique IDs to spare cores and define spare core set 𝑆 =
{ 𝑛 𝑝 𝑛 𝑐 + 1, 𝑛 𝑝 𝑛 𝑐 + 2, … , 𝑚 𝑐 }, where 𝑚 𝑐 =𝑛 𝑝 𝑛 𝑐 +𝑛 𝑠𝑐
=𝑛 𝑝 𝑛 𝑐 + ∑𝑛 𝑠𝑐 ,𝑘 𝑘 . We define the spare core
set 𝑆 𝑘 𝑡 as spare cores which can replace the same defective cores and have the equal-size scope of
sharing (i.e., 𝑘 ). We use 𝑡 as the index for each set, where 𝑡 ∈ [1,
𝑛 𝑝 𝑛 𝑐 𝑘 ]. Also we use domain of
spare cores 𝑂 𝑘 𝑡 to represent the original core set, where each original core can be replaced by any
spare core in 𝑆 𝑘 𝑡 . We will have a unique set of spare core sets 𝑆 𝑘 𝑡 and corresponding 𝑂 𝑘 𝑡 for a
given spare configuration. For example, Table 3 shows 𝑆 𝑘 𝑡 and 𝑂 𝑘 𝑡 for different 𝑘 and 𝑡 for the
spare configuration ( 4, 4; 𝑛 𝑠𝑐 ,16
= 4, 𝑛 𝑠𝑐 ,8
= 4, 𝑛 𝑠𝑐 ,4
= 8, 𝑛 𝑠𝑐 ,2
= 8) . We use a set with an
asterisk superscript to represent the union set of all sets that have the equal-size scope of sharing,
e.g., 𝑆 16
∗
=∪
𝑡 𝑆 16
𝑡 . Also we have 𝑆 =∪
𝑘 𝑆 𝑘 ∗
. The cardinality of each union set with a specific
scope of sharing equals the total number of spare cores with the corresponding scope of sharing,
i.e., |𝑆 16
∗
| = 𝑛 16
, |𝑆 8
∗
| = 𝑛 8
, and so on. If an original core is defective, it can be replaced by a
spare core from spare core sets with different scopes of sharing. Figure 6 shows such an example,
where defective core 12 can be replaced by any spare core from 𝑆 16
1
, 𝑆 8
2
, 𝑆 4
3
, and 𝑆 2
6
.
31
Algorithm 1. Greedy Repair Algorithm
1. Initialize
2. Original core set: 𝐶 = { 1, 2, 3, 4, … , 𝑛 𝑝 𝑛 𝑐 };
3. Defective original core set: 𝐷 𝑐 ⊂ 𝐶 ;
4. Defective spare core set: 𝐷 𝑠 ⊂{IDs between 𝑛 𝑝 𝑛 𝑐 +1 and 𝑚 𝑐 };
5. Spare core set 𝑆 = { 𝑛 𝑝 𝑛 𝑐 + 1, … , 𝑚 𝑐 } ∖ 𝐷 𝑠 , where 𝑚 𝑐 = 𝑛 𝑝 𝑛 𝑐 +𝑛 𝑠𝑐
=𝑛 𝑝 𝑛 𝑐 + ∑𝑛 𝑠𝑐 ,𝑘 𝑘 ;
6. Spare core set 𝑆 𝑘 𝑡 =∪ ( 𝑛 𝑝 𝑛 𝑐 + ∑ 𝑛 𝑠𝑐 ,𝑖 𝑖 <𝑘 + 𝜁 ) , where core ID 𝜁 ∈ [1,
𝑛 𝑠 𝑘 ] and 𝑡 ∈ [1,
𝑛 𝑝 𝑛 𝑐 𝑘 ];
7. Spare core sets 𝑆 𝑘 ∗
=∪
𝑡 𝑆 𝑘 𝑡 , where 𝑘 ∈ { 𝑛 𝑝 𝑛 𝑐 ,
𝑛 𝑝 𝑛 𝑐 2
, … , , 2, 1};
8. Original core set that can be repaired by cores in 𝑆 𝑘 𝑡 : 𝑂 𝑘 𝑡 , with core ID 𝜁 ∈ [𝑘𝑡 − 𝑘 + 1, 𝑘𝑡 ], and
𝑡 ∈ [1,
𝑛 𝑝 𝑛 𝑐 𝑘 ];
9. End initialize
10. If 𝐷 𝑐 = ∅
11. Return 1;
12. Foreach 𝑑 ∈ 𝐷 𝑐 /*Work on one defective original core*/
13. 𝐷 𝑐 = 𝐷 𝑐 \𝑑 ;
14. success = 0;
15. For (𝑘 = 1; 𝑘 ≤ 𝑛 𝑝 𝑛 𝑐 ; 𝑘 = 𝑘 × 2) /*Examine all domains*/
16. For (𝑡 = 1; 𝑡 ≤
𝑛 𝑝 𝑛 𝑐 𝑘 ; 𝑡 ++) /*Start from the smallest one*/
17. If ( 𝑑 ∈ 𝑂 𝑘 𝑡 ) and (𝑆 𝑘 𝑡 ≠ ∅)
18. 𝑆 𝑘 𝑡 = 𝑆 𝑘 𝑡 \𝑠 , where 𝑠 is spare core with smallest ID;
19. Print “Core 𝑑 is replaced by spare core 𝑠 ”;
20. success = 1;
21. Break;
22. If (success=1)
23. Break;
24. If (success=0)
25. Return 0;
26. Return 1;
32
Table 3. Spare core sets for spare configuration (4, 4; n
16
= 4, n
8
= 4, n
4
= 8, n
2
= 8).
𝑘 𝑡 Original cores O
k
t
Spare cores S
k
t
2
1 {1, 2} {17}
2 {3, 4} {18}
3 {5, 6} {19}
4 {7, 8} {20}
5 {9, 10} {21}
6 {11, 12} {22}
7 {13, 14} {23}
8 {15, 16} {24}
4
1 {1, 2, 3, 4} {25, 26}
2 {5, 6, 7, 8} {27, 28}
3 {9, 10, 11, 12} {29, 30}
4 {13, 14, 15, 16} {31, 32}
8
1 {1, 2, 3, …, 8} {33, 34}
2 {9, 10, 11, …, 16} {35, 36}
16 1 {1, 2, 3, 4, 5,…, 16} {37, 38, 39, 40}
Defective core
3
4
S
6
2
S
2
8
S
1
16
S
1 2
3 4 7
5 6
8
9
11
10
12
13
15 16
14
37 38
39 40
35 36
29 30
22
Figure 6. Domains of spare core sets S
16
1
, S
8
2
, S
4
3
and S
2
6
, which all cover core 12.
1.6.2.2. Proof of optimality of Greedy Repair Algorithm
We prove the optimality of choosing spare cores with smallest possible scope of sharing in
our greedy repair algorithm, which is shown as Algorithm 1, using induction over γ, i.e., the
number of repair steps that have been performed. Each associated spare core is taken as one
repair option for a defective core. Consider following two repair options to repair the first
defective core 𝑑 :
33
Option a: replace the defective core 𝑑 with a spare core 𝑢 from spare core set 𝑆 𝑘 ∗
, and
Option b: replace the defective core 𝑑 with a spare core 𝑣 from 𝑆 𝑘 ′
∗
, where 𝑘 ′
> 𝑘 .
To prove the optimality of Algorithm 1, we derive Lemma 1 and Lemma 2.
Lemma 1: If both 𝑢 and 𝑣 can repair the same defective core, then domain of 𝑢 is covered by
that of 𝑣 . Or vice versa.
Lemma 2: For a given spare configuration, the choice of spare cores to repair a defective
core is optimal if the residual 𝐸𝑁𝑆𝐶 𝑥 is maximal, where x represents any remaining original
core.
According to Lemma 1, if there is a 𝑢 ∈ 𝑆 𝑘 𝑖 corresponding to defective core 𝑑 , and a 𝑣 ∈ 𝑆 𝑘 ′
𝑖 ′
corresponding to core 𝑑 for all 𝑘 ′
> 𝑘 , then 𝑆 𝑘 𝑖 ⊂ 𝑆 𝑘 ′
𝑖 ′
. For example, for defective core 𝑑 =12, if 𝑢
is spare core 22 for 𝑘 = 2, 𝑣 must be either spare core 29 or spare core 30 for 𝑘 ′
= 4.
We use notation 𝑆 𝑘 𝑡 ( γ) to represent the state of 𝑆 𝑘 𝑡 after γ steps of repair. 𝑆 𝑘 𝑡 ( 0) represents the
initial state of 𝑆 𝑘 𝑡 . Table 4 shows the remaining spare core sets, respectively, after the two repair
options are applied. According to Lemma 2, in order to prove that the choice of the spare core to
repair the first defective core 𝑑 is optimal, we only need to prove that 𝐸𝑁𝑆 𝐶 𝑥 for every
remaining original core 𝑥 of the resulting spare configuration in Table 4(a) is greater than or
equal to that in Table 4(b).
34
Table 4. Available spare core sets after 1
st
step.
(a) Spare core 𝑢 is used.
𝑆 16
1
( 0)
𝑆 8
1
( 0) 𝑆 8
2
( 0)
𝑆 4
1
( 0) 𝑆 4
2
( 0) 𝑆 4
3
( 0) 𝑆 4
4
( 0)
𝑆 2
1
( 0) 𝑆 2
2
( 0) 𝑆 2
3
( 0) 𝑆 2
4
( 0) 𝑆 2
5
( 0) 𝑆 2
6
( 0) \𝑢 𝑆 2
7
( 0) 𝑆 2
8
( 0)
(b) Spare core 𝑣 is used.
𝑆 16
1
( 0)
𝑆 8
1
( 0) 𝑆 8
2
( 0)
𝑆 4
1
( 0) 𝑆 4
2
( 0) 𝑆 4
3
( 0) \𝑣 𝑆 4
4
( 0)
𝑆 2
1
( 0) 𝑆 2
2
( 0) 𝑆 2
3
( 0) 𝑆 2
4
( 0) 𝑆 2
5
( 0) 𝑆 2
6
( 0) 𝑆 2
7
( 0) 𝑆 2
8
( 0)
Generally, assuming that a spare core 𝑣 belongs to a set whose domain is greater than that of a
spare core 𝑢 , i.e., 𝑘 ′ > 𝑘 , we are able to compare residual 𝐸𝑁𝑆 𝐶 𝑥 using the two options for each
original core 𝑥 in the following three scenarios.
Scenario 1: 𝑥 is not in the domain of spare core 𝑣 and hence 𝑥 is also not in the domain of
spare core 𝑢 .
Then the Residual ENSC
x
(RENSC
x
) of the two options are equal. We use β to represent this
quantity, which is shown in Equation (11).
𝛽 = 𝑛 𝑠𝑐 ,𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐 ,
𝑛 𝑝 𝑛 𝑐 2
⋅
1
2
+ ⋯ + 𝑛 𝑠𝑐 ,2
∙
2
𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐 ,1
∙
1
𝑛 𝑝 𝑛 𝑐 (11)
In this scenario, we have Equations (12) and (13). Hence we have 𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] =
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ].
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] = 𝛽 (12)
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ] = 𝛽 (13)
Scenario 2. 𝑥 is in the domain of spare core 𝑣 but 𝑥 is not in the domain of spare core 𝑢 .
In this scenario, we have Equations (14) and (15). Hence we have 𝑅 𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] >
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ].
35
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] = 𝛽 (14)
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ] = 𝛽 − 1 (15)
Scenario 3. 𝑥 is in the domain of spare core 𝑣 and 𝑥 is also in the domain of spare core 𝑢 .
In this scenario, we have Equations (16) and (17). Hence we have 𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] =
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ].
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑎 ] = 𝛽 − 1 (16)
𝑅𝐸𝑁𝑆 𝐶 𝑥 [𝑂𝑝𝑡𝑖𝑜𝑛 𝑏 ] = 𝛽 − 1 (17)
As can be seen, 𝑅𝐸𝑁𝑆 𝐶 𝑥 of Option a is always greater than or equal to that of Option b. So it
is always a better option to first use spare core 𝑢 than spare core 𝑣 to replace the first defective
core 𝑑 , given that 𝑘 ′
> 𝑘 . In other words, the first step in our greedy repair algorithm is optimal.
Next we prove the optimality of the greedy repair algorithm for the ( γ + 1) th step. (The first
step is a special case when γ = 0, whose optimality has already been proved.) Table 5 shows the
available spare core sets after the ( γ + 1) th repair using two repair options: (a) spare core 𝑢 is
used, and (b) spare core 𝑣 is used. Same conclusions can be derived for previously-described
three scenarios. Hence we can prove that our algorithm is optimal using induction.
36
Table 5. Available spare core sets after γ + 1 steps.
(a) Spare core 𝑢 is used.
𝑆 16
1
( γ)
𝑆 8
1
( 𝛾 ) 𝑆 8
2
( 𝛾 )
𝑆 4
1
( 𝛾 ) 𝑆 4
2
( 𝛾 ) 𝑆 4
3
( 𝛾 ) 𝑆 4
4
( 𝛾 )
𝑆 2
1
( 𝛾 ) 𝑆 2
2
( 𝛾 ) 𝑆 2
3
( 𝛾 ) 𝑆 2
4
( 𝛾 ) 𝑆 2
5
( 𝛾 ) 𝑆 2
6
( 𝛾 ) \𝑢 𝑆 2
7
( 𝛾 ) 𝑆 2
8
( 𝛾 )
(b) Spare core 𝑣 is used.
𝑆 16
1
( 𝛾 )
𝑆 8
1
( 𝛾 ) 𝑆 8
2
( 𝛾 )
𝑆 4
1
( 𝛾 ) 𝑆 4
2
( 𝛾 ) 𝑆 4
3
( 𝛾 ) \𝑣 𝑆 4
4
( 𝛾 )
𝑆 2
1
( 𝛾 ) 𝑆 2
2
( 𝛾 ) 𝑆 2
3
( 𝛾 ) 𝑆 2
4
( 𝛾 ) 𝑆 2
5
( 𝛾 ) 𝑆 2
6
( 𝛾 ) 𝑆 2
7
( 𝛾 ) 𝑆 2
8
( 𝛾 )
Comparison of Proposed and Traditional Redundancy Techniques 1.7.
In this section, we compare the proposed and traditional redundancy techniques across
different technologies in terms of the optimal yield per area. The obtained yield/area values for
all approaches will be normalized with respect to that obtained in the ideal scenario, where defect
density is zero and no redundancy is added to the original design.
1.7.1. Experiment setup
According to ITRS, defect density 𝐷 0
remains constant at 2,503/𝑚 2
and 1,395/𝑚 2
for
DRAM and MPU, respectively [37]. We assume 𝐷 0
κ to be within the range of 0.05~0.3/𝑚𝑚
2
.
Defect clustering factor 𝑎 is set to 0.3. Other parameters such as wire pitch, repeater size,
maximum spacing between repeaters, etc., which are listed in Table 2, are obtained from Intel
technology specifications [62], previous studies [69], and ITRS [37]. NVIDIA Fermi GPU
architecture is used in our experiments. Characteristics of processors and L2 cache are obtained
from a study of NVIDIA GTX480 die photos [1]. Characteristics of components inside
processors, such as L1 cache and cores are obtained from NVIDIA GTX480 die photos and
verified using Cacti [14].
37
1.7.2. Different approaches for 𝐷 0
𝜅 = 0.2
We conduct three case studies: spare processors sharing, where each spare processor is shared
globally (SPS-g), a version of spare cores sharing where every spare core is shared globally
within a particular processor (SCS-gp), and the general spare cores sharing approach including
the types illustrated in Figure 2(d), (e), (f) and (g).
1.7.2.1. Spare processors sharing-global (SPS-g)
We consider a traditional approach, (e.g., [70][45]), which is a special case of our solution
space: adding spare processors with a global scope of sharing, i.e., spare processors are added to
the CMP in a manner where any spare can replace any defective processor in the CMP.
Yield model. Yield of one processor is calculated by Equation (18), where 𝑌 𝑐 represents the
yield of a core, which is computed in Equation (6), and 𝑌 𝑖𝑛𝑡 −𝑐 is the yield of interconnects
between cores and L1 cache. Yield of the CMP with 𝑛 𝑝 original processors and 𝑛 𝑠𝑝
spare
processors is calculated in Equation (19), where 𝑌 𝑖𝑛𝑡 −𝑝 is the yield of interconnects between
processors and L2 cache, and 𝑚 𝑝 = 𝑛 𝑝 + 𝑛 𝑠𝑝
.
𝑌 𝑝 = 𝑌 𝐿 1
⋅ 𝑌 𝑖𝑛𝑡 −𝑐 ⋅ ( 𝑌 𝑐 )
𝑛 𝑐 (18)
𝑌 𝑆𝑃𝑆 −𝑔 = 𝑌 𝐿 2
⋅ 𝑌 𝑖𝑛𝑡 −𝑝 ∙ ∑ (
𝑚 𝑝 𝑖 ) ∙ 𝑌 𝑝 𝑖 ∙ ( 1 − 𝑌 𝑝 )
𝑚 𝑝 −𝑖 𝑚 𝑝 𝑖 =𝑛 𝑝 (19)
Finally the optimal 𝑛 𝑠𝑝
(i.e., 𝑚 𝑝 − 𝑛 𝑝 ) is determined to maximize yield per area, i.e.,
𝑌 𝑆𝑃𝑆 −𝑔 𝐴 𝐶𝑀𝑃 ′
,
where 𝐴 𝐶𝑀𝑃 ′
is the area of the CMP design with spare processors.
Observations. As each processor has a large layout area, spare processors themselves have
very low yield. Hence a large number of spare processors are required to achieve a certain yield.
Our previous experiments show that the maximum yield per area using SPS-g is obtained by
38
adding 49 spare processors with a global domain. However, as the number of spare processors
becomes large, sharing spare processors with a global domain causes substantial wiring
overheads for reconfiguring interconnects for spare processors so that they are able to replace
any defective processor. As a result, this method results in high area overheads and hence an
unacceptably low yield per area, especially compared to our spare cores approaches. Hence we
exclude spare processors approach from comparison in the following sections.
1.7.2.2. Spare cores sharing-global within processor (SCS-gp)
We consider another traditional approach (e.g., [88][70]), which is also a special case of our
solution space: adding spare cores within processors with the global scope of sharing, i.e., spare
cores are added in a manner that any spare core can replace any defective core within the
corresponding processor.
Yield model. Yield of each core is calculated using Equation (6). Each processor has
𝑚 𝑐 𝑛 𝑝 cores,
including 𝑛 𝑐 original cores and
𝑚 𝑐 𝑛 𝑝 − 𝑛 𝑐 spare cores, where 𝑚 𝑐 = 𝑛 𝑝 𝑛 𝑐 + 𝑛 𝑠𝑐
. Yield of the CMP
is calculated using Equation (20), where 𝑌 𝑐 is the yield of a core, and the optimal 𝑛 𝑠𝑐
is
determined to maximize yield per area, i.e.,
𝑌 𝑆𝐶𝑆 −𝑔𝑝
𝐴 𝐶𝑀𝑃 ′
.
𝑌 𝐶𝑆 −𝑃 = 𝑌 𝑖𝑛𝑡 −𝑝 [𝑌 𝑖𝑛𝑡 −𝑐 ⋅ ∑ (
𝑚 𝑐 𝑛 𝑝 𝑖 )𝑌 𝑐 𝑖 ( 1 − 𝑌 𝑐 )
𝑚 𝑐 𝑛 𝑝 −𝑖 𝑚 𝑐 𝑛 𝑝 𝑖 =𝑛 𝑐 ]
𝑛 𝑝 .
(20)
Observations. Figure 8 shows the optimal spare configuration using SCS-gp for a CMP with
4×8 core array. In the optimal spare configuration, we add 5 spare cores to each processor and
obtain the maximum yield per area value of 0.57. When the number of spare cores is small, area
overheads which come from interconnects are insignificant. However, when the number of spare
cores is beyond a certain point, the width of interconnects, which run over each core, exceeds the
39
width of the core. As a result the interconnect area begins to dominate the overall CMP area and
results in a linear increase for the overall CMP area. In our case study, this linear area increase
occurs when the number of spare cores is beyond 4. At the same time the yield is approaching 1,
and does not improve much with additional spare cores. Hence yield per area starts decreasing
beyond 5 spare cores per processor. If wasted area and the impact of spare cores on interconnects
are not taken into account, the optimal number of spare cores will be 8 according to our
experiments. As a result, we will end up with a non-optimal spare configuration and lower yield
per area.
Figure 7. Yield per area for SCS-gp with D
0
κ
=0.2.
1.7.2.3. Spare cores sharing (SCS)
Now we turn to the general case for spare cores sharing in search for higher yield per area.
Yield calculation for the proposed approach has been illustrated in Section 1.5.5. In this section
we present our observations from experimental results.
Observations. Figure 8 shows the yield per area for three representative spare configurations
for a particular 𝐷 0
𝜅 = 0.2, where 𝑥 -axis shows spare configuration IDs, i.e., #1, #2, and #3.
These specific spare configurations are used here to illustrate an important trend. Table 6 shows
0 1 2 3 4 5 6 7
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Normalized values
Number of spare cores per processor
Yield
Area
Yield/Area
40
the spare configurations corresponding to these IDs. The three spare configurations are ordered
in such a way that the average scope of sharing (ASS) of spare cores is in ascending order.
First, configuration #2 is superior to #1 as it has higher yield while approximately the same
area overhead. Second, although yield may increase with the ASS of spare cores, we observe that
area may increase at a much higher rate. For example, area rises from 1.33 to 1.45 from spare
configuration #2 to #3, whereas the yield increase is negligible. As a result, spare configuration
#2 leads to a higher yield per area compared to configuration #3. Configuration #2 actually
provides the overall maximal yield per area of 0.725 obtained by enumerating all possible spare
cores sharing configurations.
Figure 8. Three representative spare configurations with 192 spare cores.
Table 6. Spare cores sharing configurations used in Figure 8.
IDs
Number of spare cores with different scopes of sharing
𝑛 512
𝑛 256
𝑛 128
𝑛 64
𝑛 32
𝑛 16
𝑛 8
𝑛 4
𝑛 2
𝑛 1
#1 0 0 0 0 0 0 64 128 0 0
#2 0 0 0 0 0 64 0 128 0 0
#3 0 0 0 0 0 64 128 0 0 0
#1 #2 #3
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
Normalized values
Configuration IDs
Yield
Area
Yield/Area
41
1.7.3. Comparison of SCS-gp and SCS for different 𝐷 0
𝜅
Figure 9 shows yield per area obtained from the traditional spare cores sharing approach
where each spare core is shared globally within the corresponding processor (SCS-gp) and the
proposed general SCS approach, for different values of 𝐷 0
κ. Table 7 shows the corresponding
spare configurations. We found that our SCS approach leads to designs with spare configurations
that provide higher yield per area compared to those obtained by the traditional SPS-g and SCS-
gp approaches.
Detailed analysis of our experimental results has provided the following clear reason for this
trend. Consider a column with 4 cores. When any column is not fully occupied (i.e., with 1, 2, or
3 spare cores), one more spare core by itself causes negligible area overhead. In contrast, when
every column of spare cores is fully occupied (i.e., with 4 spare cores), an additional spare core
causes a significant area overhead which is approximately equivalent to adding a new column of
cores. Furthermore, in both cases, we must also consider the increasing overheads of
interconnects due to spare cores. This overhead grows much faster for the traditional SCS-gp
approach since the sharing of spare cores is restricted to a fixed scope, namely globally within
the corresponding processor. In contrast, in the proposed SCS, this component of the area
overheads can be lessened to some extent due to the flexibility of use of scope of sharing. This
can be seen from the fact that the obtained 𝑁 𝑠𝑐
of the optimal spare configurations
for our new
general SCS is smaller than that obtained from SCS-gp for any 𝐷 0
𝜅 , which is shown in Table 7.
According to Figure 9, our new SCS approach provides yield per area improvements over
traditional SCS-gp that is higher for processes with higher defect densities (𝐷 0
κ). In practical
terms, this shows that our new SCS approach will provide greater yield per area when a new
42
process is first used for high-volume fabrication. This is particularly useful as CMPs command
high price at that time.
Performance penalty of our approach is also smaller. We calculate 𝐸𝑁𝑆 𝐶 𝑥 for each optimal
spare configuration for different 𝐷 0
𝜅 . Then we are able to obtain delay overheads for each
𝐸𝑁𝑆 𝐶 𝑥 using Synopsys Design Compiler and NCSU 45nm library. The obtained overheads are
shown in Table 8. For a typical defect density of 0.2/mm
2
, SCS and SCS-gp incurs a total delay
overheads of 4.8% and 5.6%, respectively, of the nominal clock period for the original design.
Figure 9. Yield per area vs. Defect density.
Table 7. Optimal spare configurations.
4×8 𝐷 0
𝜅
Number of spares with different domains of sharing
N
sc
𝑛 512
𝑛 256
𝑛 128
𝑛 64
𝑛 32
𝑛 16
𝑛 8
𝑛 4
𝑛 2
𝑛 1
CS-P
0.05 0 0 0 0 48 0 0 0 0 0 3
0.1 0 0 0 0 64 0 0 0 0 0 4
0.15 0 0 0 0 64 0 0 0 0 0 4
0.2 0 0 0 0 80 0 0 0 0 0 5
0.25 0 0 0 0 96 0 0 0 0 0 6
0.3 0 0 0 0 112 0 0 0 0 0 7
SCS
0.05 0 0 0 0 32 32 0 0 0 0 3
0.1 0 0 0 16 16 0 0 0 0 0 3
0.15 0 0 0 0 16 32 64 0 0 0 3
0.2 0 0 0 0 0 64 0 128 0 0 3
0.25 0 0 0 0 32 32 64 0 0 0 3
0.3 0 0 0 16 0 0 128 0 0 0 4
43
Table 8. Delay penalties due to spare cores.
𝐷 0
𝜅 0.1 0.15 0.2 0.25
SCS-gp
𝑁 𝑠𝑐
4 4 5 6
Demux 0.056 0.056 0.056 0.064
Crossbar 0.032 0.032 0.04 0.04
SCS
𝑁 𝑠𝑐
3 3 3 3
Demux 0.048 0.048 0.048 0.048
Crossbar 0.032 0.032 0.032 0.032
1.7.4. Comparison of SCS-gp and SCS across technologies
The number of cores on a silicon chip is predicted to double with technology generations [70].
Defect density remains constant across technologies [37]. As feature size decreases, critical area
increases. Taking 45nm technology as the baseline technology, scaling ratios of critical area for
32nm and 22nm are 1.13 and 1.23, respectively [70][32]. To explore the trend of yield per area at
first order accuracy, we scale kill ratio accordingly, i.e., 0.2, 0.226, and 0.246, respectively, for
45nm, 32nm, and 22nm technologies.
Figure 10 shows yield per area of the CMPs in different technologies. We observe that the
newly proposed SCS leads to a much greater improvement in yield per area compared to the
traditional SCS-gp approach for newer technology generations, i.e., 26.97%, 100.56%, and
102.30% improvements over SCS-gp for 45nm, 32nm, and 22nm technologies, respectively. The
results clearly show that the advantage of SCS compared to SCS-gp is expected to grow as
technology feature size becomes smaller.
44
Figure 10. Yield per area vs. technologies.
Conclusion 1.8.
This chapter presents a systematic methodology to use spare processors (cores) to optimize
yield per area of CMPs. We have developed a general floorplan for CMPs and a floorplan-based
approach to calculate the area of CMPs by taking into account various aspects of area overheads
for adding spare cores, as well as to obtain the yield of CMPs. Then we propose our new spare
cores sharing approach for using spare cores, which enumerates spare configurations and
identifies the optimal one that maximizes yield per area for a given CMP. The proposed
approach provides a more realistic optimal spare configuration compared to previous approaches.
The experimental results show that our spare cores sharing approach provides significant yield
per area improvements, i.e., 67.2%, over the traditional approaches, i.e., 76.6%. Furthermore, the
results indicate that the benefits of the proposed spare cores sharing approach will continue to
grow as technology continues to scale in the future. The delay overheads of the optimal spare
configurations from the proposed approach have also been estimated, which are smaller than
those for the traditional approaches.
45nm 32nm 22nm
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Technology
Normalized Yield/Area
No spare
SCS-gp
SCS
26.97%
100.56% 102.3%
45
Optimizing Redundancy Design for Flexible Utility Functions
In this chapter, we further improve wafer utilization by considering flexibilities of assigning
appropriate value to defective but useable chips. Specifically we define a utility function for
CMPs in terms of two functions: (i) the number-of-processors-binning (NPB) function, which
captures the range of the number of (enabled) working processors over which a chip can be sold,
and (ii) the value function, which captures how the value of a chip to the user depends upon the
number of processors enabled in the chip. We systematically explore the impact of utility
functions at processor level on the optimal spare design, based on NVIDIA’s Fermi GPU
architecture, where each processor (also called a streaming multiprocessor or SM by NVIDIA)
consists of multiple (32) cores. By taking into account such factors, our approach is able to
proactively provide optimal spare designs for different utility functions. Contributions of this
work are as follows.
(1) We capture performance degradation for defective CMPs by adopting evaluation metrics,
such as performance per wafer, processors sold per area, and revenue per wafer.
(2) We introduce various types of utility functions at processor level to enhance flexibility of
assigning value to defective but useable CMP designs to further increase the figure of
merit for different evaluation metrics. The proposed utility functions increase figure of
merit from 67.7% to 84% in terms of revenue per area. More importantly, the obtained
revenue per area is within 17% of the ideal scenario, namely a state-of-the-art process for
which we idealize by assuming zero defect density.
(3) We derive design changes to make to obtain the optimal spare design under a new utility
function.
46
(4) We define -optimal spare design, i.e., a design which is close to the optimal design (in
terms of working processors) but with significantly decreased number of spares (𝑚 ).
(5) We develop repair heuristic to effectively use spare cores for the proposed utility
functions.
Contents of this chapter are organized as follows. Section 1.9 shows optimal spare designs for
CMPs. Section 1.10 introduce utility functions. Section 1.11 presents case studies on Nvidia
Fermi architectures and experimental results. Section 1.12 shows the conclusions.
Optimal Spare Design for CMPs 1.9.
0 has shown an approach for adding spare processors and/or cores to highly-parallel CMPs
[18]. This approach provides the optimal spare design for scenarios where only full-
configuration CMPs can be sold. Specifically, this approach explores the optimal level of
hierarchy at which to add spares, as well as the optimal scope of sharing, e.g., 1-way sharing, 2-
way sharing, etc. Figure 2 (in 0) shows some ways of sharing spare processors and spare cores.
The optimal spare configuration is obtained in the form of a combination of shared spare
processors and cores with various scopes of sharing.
While yield can be easily calculated using the negative binomial yield model for multi-core
chips [32], it is more challenging to take into account all area overheads of adding spares since
overheads due to changes to and additions of interconnections are extremely important in multi-
core chips [72]. In [18], this problem is solved by studying a range of modern multi-core chips
[45][77][52], (a) to develop a general floor-plan for multi-core chips, and (b) by using the
corresponding floor-plan when adding spare cores, to accurately estimate not only the area of
spare cores but also wasted area resulting from adding spare cores (see area labeled 2 in Figure
47
3), the increase in area of interconnects including those for additional multiplexers and repeaters
(see areas labeled 3.1, 3.2, and 3.3 in Figure 3), and so on.
For our experiments, parameters such as defect density, wire pitch, repeater size, etc. are
obtained from ITRS [35], previous literature [75], and Intel technology specifications [62].
Characteristics of the processor and L2 cache are obtained from a study of NVIDIA GTX 480
die photos [77]. Characteristics of components inside processors, such as L1 cache and cores are
obtained from NVIDIA GTX 480 die photos and verified using Cacti [14]. Table 9 is a summary
of parameters of floorplan [18] for yield and area calculations.
Table 9. Floorplan characteristics
# of cores per processor 16 &32 # of processors per CMP 16
Core Array Floorplan 4×4&4×8 Processor Array Floorplan 4×4
Core-level Interconnect crossbar Processor-level Inter. Crossbar
Width per SM (𝑚𝑚 ) 3 Width of CMP (𝑚𝑚 ) 23
Length per SM (𝑚𝑚 ) 5 Length of CMP (𝑚𝑚 ) 24
Width of core array (𝑚𝑚 ) 3 Width L2 Cache (𝑚𝑚 ) 23
Length of core array (𝑚𝑚 ) 2.9 Height of L2 Cache (𝑚𝑚 ) 6
Width of L1 Cache (𝑚𝑚 ) 3 Height of Crossbar (𝑚𝑚 ) 2
Height L1 Cache (𝑚𝑚 ) 1.4 Height L1 Crossbar (𝑚𝑚 ) 0.7
Technology (𝑛𝑚 ) 45 Repeater width (𝑚𝑚 ) 0.016
𝐷 0
κ (/𝑚 𝑚 2
) 0.2 Address bus width (bits) 32
Spacing of repeater (𝑚𝑚 ) 1.6 Write bus width (bits) 32
Repeater height (𝑚𝑚 ) 0.015 Read bus width (bits) 32
We have already demonstrated in 0 that our spare core sharing approach can greatly improve
yield per area of highly-parallel CMPs compared to previous spare approaches [18]. In this work,
we extend our spare core sharing approach to take into account different utility functions to
provide optimal spare design using a newly developed algorithm with significantly reduced time
complexity.
48
Utility Functions 1.10.
We first present two components of utility and then derive utility functions that capture
emerging ad-hoc explorations and significantly generalize them.
1.10.1. Number-of-Processors Binning Models
We start by considering two types of flexibilities emerging in various communities of
customers of CMPs. First, chips with fewer than 𝑛 working processors can be sold, albeit at
lower prices. This approach has been used for commercial products, e.g., NVIDIA’s GTX 480,
GTX 470 and GTX 465 are actually the same design sold with 1, 2, and 5 processors disabled,
respectively [77]. This approach is also discussed in [48], which calls it AMAA (as many as
available), and is explored analytically in [72] under a simple model. Second, the extra working
processors, e.g., the 17th and the 18th working processors on a chip with 𝑛 = 16, can be enabled
and put into use. This approach has not been studied in other people’s work.
Consider a CMP where 𝑛 denotes the nominal number of processors given in the
specifications. To improve yield profile, each chip is designed with a total of 𝑚 processors, i.e.,
with 𝑚 − 𝑛 spare processors. We define NPB function 𝜙 ( 𝑗 ) as
𝜙 ( 𝑗 ) = {
0, 𝑖𝑓 𝑗 < 𝑘 ,
𝑗 , 𝑖𝑓 𝑘 ≤ 𝑗 ≤ 𝑙 ,
𝑙 , 𝑖𝑓 𝑗 > 𝑙 ,
where 𝑗 is the total number of working processors, 𝑘 ( 𝑘 ≤ 𝑛 ) captures the fact that there may
exist a minimum level of performance below which no customer finds a chip useful, and 𝑙
( 𝑛 ≤ 𝑙 ≤ 𝑚 ) , in an analogous manner, denotes the point beyond which extra working processors
are not advantageous to customers. We define four NPB functions based on different values of 𝑘 ,
𝑙 , and 𝑛 .
49
NPB Origin ( 𝑘 = 𝑙 = 𝑛 ) : Chips have to be sold with full nominal configuration ( 𝑛 working
processors). This is equivalent to the classical approach, i.e., AMAD, which provides no
flexibility in terms of number-of-processors binning.
NPB A ( 𝑘 < 𝑛 𝑎𝑛𝑑 𝑙 = 𝑛 ) : For chips with fewer than 𝑛 working processors but with at least
𝑘 working processors, we disable non-working processors and sell such chips at lower prices, i.e.,
we sell some chips with 𝑛 − 1 working processors, some with 𝑛 − 2 working processors, …,
some with 𝑘 working processors. For chips with more than 𝑛 working processors, we disable
every working processor beyond 𝑛 and sell such chips as 𝑛 -processor chips. This model is useful
in a context where some customers are willing to buy chips with fewer than 𝑛 working
processors at suitably discounted prices.
NPB B ( 𝑘 = 𝑛 𝑎𝑛𝑑 𝑙 > 𝑛 ) : Every chip with fewer than 𝑛 working processors is discarded.
However, for chips with more than 𝑛 working processors but with less than 𝑙 working processors,
we keep all working processors enabled and sell such chips at higher prices as chips with 𝑛 + 1
processors, 𝑛 + 2 processors, and so on.
This model is useful in a context where 𝑛 working processors are deemed essential to meet
the performance needs even of the least demanding customer and where high-end customers are
willing to pay extra for higher performance. For example, consider a GPU where the basic
graphics functions require exactly 𝑛 working processors while some high-end customers can use
chips with additional working processors for scientific computing applications.
NPB C ( 𝑘 < 𝑛 𝑎𝑛𝑑 𝑙 > 𝑛 ) : It is easy to imagine a combination of A and B. For chips with
fewer than 𝑛 working processors but with at least 𝑘 working processors, we disable non-working
processors and sell such chips at lower prices. And for chips with more than 𝑛 working
50
processors but with less than 𝑙 , we keep all working processors enabled and sell such chips at
higher prices.
To sum up, NPB A benefits from enabling downside flexibility, whereas NPB B benefits from
enabling upside flexibility. And NPB C combines the benefits of NPB A and B. Note that
downside flexibility can be increased by decreasing 𝑘 , and upside flexibility can be increased by
increasing 𝑙 .
1.10.2. Value Functions
We define three value functions based on how the value of a chip 𝑣 ( 𝑖 ) depends on 𝑖 , the
number of (enabled) working processors on a chip sold to customer.
Linear: We first define the simplest value function, where the value of a chip 𝑣 ( 𝑖 ) is linear
with 𝑖 , the number of (enabled) working processors, as shown in Figure 12.
IPC: Next we evaluate performance degradation due to non-working or disabled processors,
and establish a more practical value function based on instructions per cycle (IPC) of a CMP
with a certain number of working processors.
GF100 is NVIDIA’s first Fermi GPU architecture, which is designed with 16 processors,
where each processor has 32 cores. Using GPGPU-sim [26], we have studied four typical GPU
programs (dct, matrix, transpose, and clock) from NVIDIA CUDA PDK [64]. By modifying
GPGPU-sim parameters accordingly (i.e., number of processors in gpgpusim.config and number
of nodes of the interconnection network in icnt_config_fermi_islip), we are able to obtain IPCs
for different numbers of processors, as shown in Figure 11. We use the same L2-cache and
memory configuration (from NVIDIA Fermi) throughout our experiment, since typically these
are not changed as spare processors are incorporated. (Note that L1-cache is inside each
51
processor.) As a result, the increase of IPC slows down as memory bandwidth becomes the
bottleneck beyond a certain point.
Figure 11. IPCs for GPU programs with fixed L2-cache and memory configuration.
Figure 11 shows that IPC increases monotonically with the number of working processors.
(Note: values are normalized to IPC for the configuration of 16 working processors. NVIDIA
Fermi’s memory hierarchy is used.) However the IPC trend is highly dependent on program
characteristics. To derive a generally useful spare configuration, an averaged IPC for different
programs should be used. In this work, we use the four programs to derive the averaged IPC as
an example, which is shown in Figure 12.
Figure 12. Normalized value function for linear, IPC and catalog price.
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Normalized IPC
Number of working processors
clock matrix dct transpose
0.5
1
1.5
2
2.5
11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Normolized values
Number of working processors
Linear IPC Catalog price
52
Catalog price: Finally we derive another practical value function using catalog prices for a
chip with different numbers of enabled processors.
The first hot-lot of Fermi that came from TSMC had 7 good chips out of a total of 416
candidates, or a yield of less than 2 percent [11]. This corresponds to the “No spares design” in
NPB Origin in Figure 14 (c), which for our defect models shows around 0.0% revenue-per-area
(RPA). Due to low yield, manufactured copies of this design were sold in three versions, namely
GTX 480, GTX 470, and GTX 465, which have 15, 14, and 11 working processors, respectively
[77]. Note that in this case 12 and 13 working-processor chips were sold as 11 working-
processor chips. This strategy corresponds to NPB A, where 𝑣 ( 𝑖 ) is the catalog price of a chip
with 𝑖 working processors. The value function is extrapolated using available catalog price, as
shown in Figure 12.
1.10.3. Utility Functions and Metrics for Evaluation
We define utility functions using one NPB function and one value function as follows.
𝑢 ( 𝑗 ) = 𝑣 ( 𝜙 ( 𝑗 ) ) .
Hence a utility function represents the value of a chip to users whose preferences are captured
by selecting the appropriate NPB and value functions. For example, utility functions for different
NPB functions using linear value function is captured in Table 10.
Table 10. u(j) for various NPB functions for linear value function.
# of working processors (𝑗 ) per chip Origin A B C
𝑘 > 𝑗 0 0 0 0
𝑛 > 𝑗 ≥ 𝑘 0 𝑗 0 𝑗
𝑙 > 𝑗 ≥ 𝑛 𝑛 𝑛 𝑗 𝑗
𝑗 ≥ 𝑙 𝑛 𝑛 𝑙 𝑙
Each utility function requires a particular metric for evaluation of wafer-level utility
according to which value function is used: with linear value function we use processors-sold-
53
per-area (PSPA), with IPC value function we use instructions-per-cycle-per-area (IPCPA), and
with catalog price value function we use revenue-per-area (RPA).
1.10.4. Computation of Metrics for All Utility Functions
We use a general function 𝐹 as the evaluation metric as shown in Equation (21).
𝐹 =
1
𝐴 ∑ 𝑃 ( 𝑗 )⋅ 𝑢 ( 𝑗 )
𝑚 𝑗 =𝑘 ,
(21)
where 𝐴 is the area scaling ratio of a design with a total of 𝑚 processors, which is calculated
using the area of the nominal design (𝑛 processors), the area of spares, the overheads in the area
of the interconnection network, and so on. 𝑃 ( 𝑗 ) is the probability that a fabricated copy of the
chip will have 𝑗 working processors. Note that 𝑃 ( 𝑗 ) depends on the yield model, and throughout
this chapter we use the negative binomial yield model. 𝑃 ( 𝑗 ) vs. 𝑗 plot is what we call the yield
profile for a design. Finally, 𝑢 ( 𝑗 ) is the utility function of the number of working processors ( 𝑗 ) ,
which is defined in Section 1.10.3.
Here we use linear value function as an example to illustrate how to calculate the overall
PSPA using spare processor approach. Figure 13 shows the profile of yield, PS (processors sold),
and PSPA for NPB C with 𝑚 = 28. (Note that in this figure, each part of every bar with a
different pattern represents the contribution of chips with a corresponding number of working
processors.) PS profile is obtained by summing up the products of each term of yield profile and
the number of enabled working processors, i.e., 𝑃 ( 𝑗 ) from the yield profile, and 𝑢 ( 𝑗 ) from the
NPB function. PSPA profile is obtained by dividing PS for each value of 𝑚 with a corresponding
area scaling factor, according to Equation (1). Table 11 shows the area scaling factor, which is
calculated using the area model presented in Section 1.9.
54
Figure 13. Yield profile, PS profile, and PSPA profile for NPB C with
n = 16, and m = 28.
Table 11. Area scaling factor for a particular design with n = 16, for various values of m.
𝑚 16 20 24 28 32 36 40
Area scaling factor 1 1.31 1.62 1.92 2.25 2.58 2.92
1.10.5. Ideal Case as Normalization Factor/Baseline
Consider an ideal case for a CMP design, namely zero redundancy, which is fabricated in a
process with zero defect density. Clearly, in the real world, no design can surpass this ideal case.
In Section 1.10.2, value functions are normalized to the figure-of-merit of the original design
when we defined value functions. Similarly, for the entire CMP production, we define baseline
as the overall wafer-level utility for the ideal case. In our baseline scenario, 𝐹 in Equation (1)
becomes equal to 𝑢 ( 𝑛 ) , since in this case, 𝑚 = 𝑛 , 𝐴 = 1, and 𝑃 ( 𝑛 ) = 1.
In Section 1.11, for the convenience of comparison, the overall wafer figure-of-merit for a
particular utility function is normalized to the respective baseline, which produces a
dimensionless figure-of-merit value between 0 and 1, where 1 denotes the upper bound that can
only be achieved under ideal conditions (namely, zero defect density and zero redundancy).
Hence a solution that has a figure-of-merit of 0.9 indicates that it is close to optimal since it is
within 10% of the ideal case.
0
0.25
0.5
0.75
1
Yield profile
0
2
4
6
8
10
12
14
16
Processors sold Processors sold per
area
25 24
23 22
21 20
19 18
17 16
15 14
13 12
55
Table 12. Optimal spare configurations.
(a) Linear (PSPA)
NPB function
Spare processors Origin A B C
𝒎 36 24 40 28
Spare cores Origin A B C
4-way 128 0 128 0
8-way 0 0 0 0
16-way 64 32 64 32
32-way 0 32 0 32
(b) IPC (IPCPA)
NPB function
Spare processors Origin A B C
𝒎 36 24 36 32
Spare cores Origin A B C
4-way 128 0 128 0
8-way 0 0 0 0
16-way 64 32 64 32
32-way 0 32 0 32
(c) Catalog price (RPA).
NPB function
Spare processors Origin A B C
𝒎 36 28 40 32
Spare cores Origin A B C
4-way 128 0 128 0
8-way 0 0 0
16-way 64 32 64 32
32-way 0 32 0 32
56
(a) PSPA
(b) IPCPA
(c) RPA
Figure 14. Optimal designs
2
, with n = 16 and k = 11 for NPB A and C.
2
Note that, for each value function, (i) Optimal processors design for NPB origin represents the chip design with 36
processors (i.e., 20 spare processors), and (ii) Optimal cores design for NPB origin represents the chip design with
128 4-way spare cores and 64 16-way spare cores.
0.00
0.08
0.00
0.08
0.37
0.38
0.47
0.49
0.37
0.48 0.48
0.51
0.73
0.75
0.73
0.75
0.73
0.82
0.73
0.82
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
No spares design Optimal processors design for NPB Origin
Optimal spare processors design Optimal cores design for NPB Origin
Optimal spare cores design
0.00
0.18
0.00
0.18
0.37
0.39
0.53 0.55
0.37
0.50
0.53
0.56
0.73
0.75
0.73
0.75
0.73
0.83
0.73
0.83
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
0.00
0.16
0.00
0.16
0.37
0.38
0.48
0.50
0.37
0.47
0.49
0.50
0.73
0.75
0.73
0.75
0.73
0.84
0.73
0.84
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
57
Case Studies on NVIDIA Fermi GPU Architecture 1.11.
In this section, we conduct experiments to demonstrate the impact of utility functions on
optimal spare configurations for NVIDIA Fermi GPU architecture.
1.11.1. Experiment Setup
We use our spare processor approach as well as our spare core approach (i.e., the shared spare
core approach in [18]). We have studied different utility functions based on our CMP model with
𝑛 = 16, where 𝑘 = 11 for NPB A and C and where 𝑙 = 𝑚 for NPB B and C.
For different value functions, the optimal spare configuration and corresponding wafer-level
utility for various NPB functions are shown in Table 12 and Figure 14. Note that for NPB A, B,
and C, we also show the corresponding overall wafer figure-of-merit for the spare configuration
that is optimal for NPB Origin.
1.11.2. A Heuristic to Repair CMPs and Compute Yield
Note that a greedy repair algorithm has been proved to be optimal in previous chapter for
fully functional CMPs, i.e., NPB model origin. However, it does not guarantee the optimal spare
configuration for our other different utility functions. For example, the optimal spare
configuration that maximizes the evaluation metric in one NPB model doesn’t maximize that in
another NPB model. In this work, we use Heuristic 1, which invokes the greedy algorithm
described in [18], to repair defective CMPs and compute the yield. An algorithm will be
developed in future work to derive the optimal spare configuration for various utility functions.
58
Heuristic 2: Repair defective CMP.
1. Sort processors in ascending order with respect to the number of defective cores within the
processor.
2. Repair the processor that has the minimum number of defective cores, using the greedy
repair algorithm in [18].
3. Repeat until either of following conditions is met.
a. No more defective processors can be repaired due to lack of spare cores.
b. All processors have required number of working cores to be fully functional.
4. Print the number of working processors available on the CMP.
1.11.3. Main Conclusions
Based on our experiments, we are able to combine approaches in a way that provide figure of
merit that is up to 83.8% of that for an ideal process with zero defect density, even in an era of
high defect density which leads to extremely low yield. We draw the following conclusions
based on the study of spare processors approach as well as spare cores approach using various
utility functions. We illustrate these conclusions using linear value function as an example in this
section. Note that all our conclusions also hold for our other two value functions, i.e., the IPC
and catalog price functions.
(1) The use of new NPB functions improves wafer-level figure-of-merit, even for the designs
that target NPB origin. For example, Figure 14(a) shows that the obtained PSPA for the No
spares design (the original design with no redundancy, i.e., m = 16) increases from 0% to
7.9% via adoption of NPB A. Also the obtained PSPA for the Optimal processors design for
NPB Origin, i.e., m = 36, increases from 36.6% to 38.5%, 47.2%, and 49.1% by adopting
59
NPB A, B, and C, respectively. A similar trend of increase exists for spare cores approach.
All these conclusions also hold for the IPC and catalog price value functions. Such increases
come from the flexibility provided by each new NPB function.
(2) The optimal configuration of spares for various NPB functions follows a mathematical trend.
Consider the linear value function in Table 12(a) as an example. The optimal spare
configuration is to add 20 spare processors for NPB Origin (i.e., m = 36). Compared to NPB
Origin, NPB A has the same upside flexibility (namely, zero) while it has additional
downside flexibility. Hence the maximal PSPA is obtained for NPB A for a smaller value of
m, namely, 24. NPB B has the same downside flexibility (i.e., zero) as Origin while it has
additional upside flexibility. Hence the maximal PSPA is obtained for NPB B for a higher
value of m, namely, 40. Moreover, NPB A and C both have the same downside flexibility
(i.e., 11~15 processor chips are valuable) while C has additional upside flexibility. Hence
the maximal PSPA is obtained for NPB C for a higher value of m, namely, 28. NPB B and C
both have the same upside flexibility (i.e., chips with more than 16 processors are sold at
higher prices) while C has additional downside flexibility. Hence the maximal PSPA is
obtained for NPB C for a smaller value of m, namely, 28.
To put it simply, with equal upside flexibility, the NPB function with additional downside
flexibility leads to smaller m for optimal spare processors approach; and with equal
downside flexibility, the NPB function with additional upside flexibility leads to larger m for
optimal spare processors approach. We claim that such observations are independent of the
exact form of u( j) , provided that u( j) is a non-decreasing function of 𝑗 .
60
Note that in spare cores approach, this trend is true in terms of ENSC
3
. Figure 15
explicitly captures such characteristics of the relations among NPB functions. The
direction of arrows indicates (i) a decrease in the number of spare processors for the
optimal spare processors approach, and (ii) a decrease in ENSC for the optimal spare cores
approach.
Figure 15. Relations between optimal spare configurations of different NPB functions
4
.
(3) Optimizing the design for each new NPB utility function, provides dramatic increase in
wafer-level figure-of-merit. Downside flexibility helps reduce wastage and increases
overall utility in all cases. For example, Figure 14(a) shows that the normalized PSPA from
the Spare processors design increases from 36.6% to 47.6%, 47.8%, and 50.6% for NPB A,
B, and C, respectively, for spare processors approach. A similar trend of increase exists for
spare cores approach.
Upside flexibility only helps when spare processors are used. This can be seen from the
fact that, for no spares approach and spare cores approach, NPB Origin and NPB B always
have the same overall wafer-level figure-of-merit, i.e., 0.0% and 72.5%, respectively.
These values don’t change for different value functions. Similarly, NPB A and C are
3
Effective Number of Spare Cores that one defective core 𝑙 can use is denoted as 𝐸𝑁𝑆 𝐶 𝑙 , which equals to 𝑛 1
∙
1
𝑝𝑞
+
𝑛 2
∙
2
𝑝𝑞
+ 𝑛 4
∙
4
𝑝𝑞
+ ⋯ + 𝑛 𝑝 𝑞 , where 𝑝 is the number of processors and 𝑞 is the number of cores per processor. 𝑛 𝑖 is
the number of spares that can be shared by 𝑖 cores, where 𝑖 is powers of 2.
4
Note that the direction of arrows indicates (i) the decrease of number of spare processors for spare processors
approach, and (ii) decrease of ENSC for spare cores approach.
A
C
Origin
B
61
always identical for spare cores approach. By combining the benefits of upside and
downside flexibilities, NPB C is always better than or equal to other NPB functions.
Figure 16. Theoretical maximal PSPA for different values of m.
(4) We show that maximizing overall wafer-level figure-of-merit is not equivalent to
minimizing wastage, using linear value function as an example. First we define the
theoretical maximal value of PSPA (tm-PSPA), i.e., the maximum-possible value of PSPA
for a given 𝑚 , by counting every single working processor on the wafer. In other words,
tm-PSPA is defined using NPB C with 𝑘 = 1 and 𝑙 = 𝑚 . Figure 16 shows the tm-PSPA,
for each value of 𝑚 . The maximal value of tm-PSPA occurs when no spare processors are
added (𝑘 = 1 and 𝑚 = 16), i.e., where the overheads of spares (area of spares, wasted
layout area, and overhead area for additional interconnects) are zero and hence the
effective processor area per wafer is maximized.
The overall PSPA loss (denoted as Residual PSPA) consists of working processors on
discarded chips and working processors disabled on chips sold. By subtracting the PSPA
from the corresponding tm-PSPA, we obtain the Residual PSPA for each 𝑚 as shown in
Figure 17. We have found that, for NPB C, the optimal spare configuration (i.e., 𝑚 = 32)
does not guarantee the minimum Residual PSPA (i.e., ~0.00) , which is achieved when
𝑚 = 40. This demonstrates that minimizing Residual PSPA is not the optimal strategy.
9.00
8.59
8.36
8.20
7.99
7.85
7.71
7.5
8
8.5
9
9.5
16 20 24 28 32 36 40
tm-PSPA
m
62
This is because that, as area overhead increases due to increasing number of spare
processors, the effective area for processors per wafer also decreases.
Figure 17. Residual PSPA for each NPB function, for n =16, k = 11.
7.73
3.97
1.28
0.29
0.05
0.01
0.00
0
2
4
6
8
10
16 20 24 28 32 36 40
Residual PSPA
m
Origin A B C NPB functions:
63
Table 13. -optimal designs with =8%.
(a) Linear (PSPA)
NPB function
Spare processors Origin A B C
𝑚 31 23 35 24
Spare cores Origin A B C
4-way 128 0 128 0
8-way 32 0 32 0
16-way 0 64 0 64
32-way 0 0 0 0
(b) IPC (IPCPA)
NPB function
Spare processors Origin A B C
𝑚 31 23 32 24
Spare cores Origin A B C
4-way 128 0 128 0
8-way 32 0 32 0
16-way 0 64 0 64
32-way 0 0 0 0
(c) Ccatalog price (RPA)
NPB function
Spare processors Origin A B C
𝑚 31 24 35 27
Spare cores Origin A B C
4-way 128 0 128 0
8-way 32 64 64 64
16-way 0 0 0 0
32-way 0 16 0 16
64
(a) PSPA
(b) IPCPA
(c) RPA
Figure 18. Optimal and -optimal designs ( with n = 16, k = 11 for NPB A and C.
0.37
0.48 0.48
0.51
0.34
0.44
0.45
0.48
0.73
0.82
0.73
0.82
0.69
0.77
0.69
0.77
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
Optimal spare processors design Ɛ-optimal spare processors design
Optimal spare cores design Ɛ-optimal spare cores design
0.37
0.50
0.53
0.56
0.34
0.48
0.49
0.51
0.73
0.83
0.73
0.83
0.69
0.80
0.69
0.80
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
0.37
0.47
0.49
0.50
0.34
0.45
0.46
0.47
0.73
0.84
0.73
0.84
0.69
0.77
0.69
0.77
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
NPB Origin NBP A NPB B NPB C
Normalized figure-of-merit
65
(5) Our spare cores approach provides much more dramatic improvement compared to spare
processors approach. For NPB Origin, compared to optimal spare processors approach,
our spare cores approach doubles PSPA, IPCPA, and RPA from 36.6% to 72.5%. The
improvement is equally dramatic for NPB A. However, for now our spare cores approach
does not consider spare processors. Hence spare cores in our approach are distributed
across the chip and cannot be consolidated to obtain additional working processors. As a
result, in its current form, our spare cores approach cannot benefit from upside flexibility
provided by NPB B and NPB C.
(6) Exploring -optimal designs can significantly reduce the number of spares. By exhausting
the entire design space, we are able to find the optimal processor (core) design to make full
use of flexibilities provided by each NPB function and each value function. However
whether such changes on designs is worthwhile depends on specific situations. For
example, Figure 19 shows how normalized figure-of-merit changes with the number of
spare processors for NPB C. (Note that the decrease of figure-of-merits with the increase of
spare processors for certain ranges of 𝑚 is due to wasted wafer area. Detailed explanation
has been provided in [18].) The optimal design is to add 12, 16, and 16 spare processors for
linear, IPC, and catalog price value functions. If we are given a design margin = 2%, i.e.,
2% below the optimal figure-of-merit is acceptable, the optimal design requires 12, 12, and
12 spare processors for the three value functions respectively. Hence, the efficiency of -
optimal spare design highly depends on characteristics of value functions. We have
identified -optimal designs for all utility functions using spare processors approach as well
as spare cores approach with equal to 8%, with the corresponding spare configurations
and figure-of-merits shown in Table 13 and Figure 18, respectively.
66
This phenomenon occurs for most of our cases. Due to diminishing return effects beyond
certain value of 𝑚 , we always observe scenarios where overall figure-of-merit increases
slowly with increasing amount of spares being added. To address such tradeoff between
optimality and the amount of spares, our approach first finds the optimal design and then
traces back to the -optimal design with the smallest overheads, which meets the overall
figure-of-merit margin ( ) and typically has much smaller 𝑚 or ENSC for spare
processors approach and spare cores approach respectively.
Figure 19. Normalized figure-of-merits vs. number of spare processors for NPB C.
1.11.4. New Algorithm Design
On top of Heuristic 1 which is newly proposed and implemented in our experiments, we made
following extensions to algorithms in 0.
The optimal spare configuration using spare processors approach is easy to derive with
relatively short run-time. However, our spare cores approach requires substantial computation
time, as the percentage of working chips for a given number of defects is obtained via a large
number of Monte Carlo simulations. A branch and bound algorithm was used in our previous
spare approach to reduce the complexity of exhausting the entire design space for the optimal
48.3%
48.6%
50.6%
49.8%
51.3%
54.9%
56.0%
45.9%
49.8%
50.1%
0.4
0.45
0.5
0.55
0.6
8 9 10 11 12 13 14 15 16
Normalized figure-of-merit
Numer of spare processors
PSPA IPCPA RPA
67
spare configuration. Nevertheless, this approach has time complexity that is too high for future
scenarios, where CMPs will have hundreds of processors. Based on previous conclusions, we are
able to reduce the complexity significantly by improving our algorithm in following two ways.
Extended branch & bound algorithm: We extend our branch and bound algorithm to take
into account relations among NPB functions shown in Figure 15. For example, the arrow
pointing to NPB A from Origin suggests that the optimal spare configuration for NPB A requires
fewer spare processors than that for NPB Origin. Hence, to obtain the optimal spare
configuration for NPB A we only explore designs with fewer spare processors than that of NPB
Origin. Similarly, Figure 15 suggests that the optimal spare configuration for NPB A has a
smaller ENSC than that for NPB Origin, in terms of spare cores approach. Specifically, as our
approach enumerates spare cores configurations in an ascending order of ENSC, this extension is
implemented by initiating a new search space using the previously found optimal spare
configuration for the new search.
Conclusions 1.12.
In this chapter, we explore different aspects of utility functions at processor level and their
impacts on optimal spare designs. First we present four models to bin chips, i.e., number of
processors binning (NPB), and three value functions. Then we conduct experiments to analyze
the impact of various combinations of NPB and value functions on the optimal spare
configurations for different spare approaches. In particular, we show that the flexibility provided
by our NPB functions can dramatically increase the figure-of-merit which can, for very realistic
value functions and a process with very high defect density, approach within 16-17% of an ideal
case which assumes zero defect density and zero redundancy.
68
Partially-working Processors Binning to Maximize Wafer Utilization
In this chapter, we continue to explore flexibilities in utility functions at core level by
enabling partially-working processors on CMPs. We simulate GPU benchmarks on GPGPU-sim
to evaluate the proposed utility function. Contributions of this work are as follows.
(1) We complete the analysis of flexibilities of assigning value to defective but useable chips
by including a new utility function at core level.
(2) We provide a detailed analysis on the relationship between CMP performance and number
of cores per processor for a CMP specification for various benchmarks, by simulating
ISPASS benchmarks as well as those from Nvidia CUDA SDK on GPGPU-sim.
(3) We present an effective method to estimate performance for defective CMPs, where
processors have different numbers of working cores, by averaging performance obtained
for simulations with different warp sizes. This fast estimation method enables evaluation
of various redundancy approaches and hence enables rapid identification of the optimal
level of redundancy.
(4) We develop new repair heuristics for the proposed utility function, and identify the
optimal one for each spare configuration while enumerating all configurations. Impact of
different repair algorithms and heuristics are shown for particular spare configurations.
(5) Improvements provided by the proposed approach is dependent on benchmarks. Results
show that our utility functions provide yield per area of up to 90% of the ideal, for
benchmarks with enough parallelism given current defect density. Results also show that
our design and repair approaches provide above 50% IPC per wafer area even with 10x
the current defect density.
69
Contents of this chapter are organized as follows. Section 1.13 shows the related research.
Section 1.14 shows background of GPU architectures and simulators. Section 1.15 presents the
problem to be solved in this chapter. Section 1.16 presents the proposed approach. Section 1.17
presents the experiments using GPU simulators and results. Section 1.18 shows conclusions.
Related Research 1.13.
The optimal level of granularity at which to add redundancy has been explored in theoretical
and practical studies. Theoretical: In [88], a CMP design is partitioned into modules of arbitrary
sizes, each with an equal amount of functional logic. Spare modules are then incorporated with a
global scope, i.e., in such a manner that each spare module can replace any module that is
defective. This approach assumes that a spare module with a global scope incurs a fixed area
overhead and ignores the overheads of interconnects required to use the spare modules in a
global manner. Hence this approach provides an upper bound on the yield benefits of adding
spares. Finally, the approach identifies the size of a module, which is also the granularity of each
spare module, to maximize yield per area. Practical: In contrast, the practical approaches
proposed to date start with models of the chip’s floorplan and typically view a chip in terms of
processors and sub-processor level modules. In case of CMPs, sub-processor modules are cores,
caches shared by cores, and interconnects, or parts thereof. The approaches study yield trends
across technology generations for various processor- and sub-processor-level redundancies
[18][20][19]. These approaches take into account the area overhead of interconnects and the
optimal designs they derive are more realistic.
In particular, a systematic spare cores sharing (SCS) approach was developed to add spare
processors and cores to CMPs in 0. In this approach, all spare configurations are enumerated in
terms of: (a) scope of sharing of spare cores, i.e., the number of original cores which share the
70
same spare cores, and (b) the number of spare cores per scope. The notation
[ … , 𝑛 𝑠𝑐 ,𝑘 , … , 𝑛 𝑠𝑐 ,4
, 𝑛 𝑠𝑐 ,2
, 𝑛 𝑠𝑐 ,1
] is used to represent a spare configuration, where 𝑛 𝑠𝑐 ,𝑘 is the
number of spare cores that are each shared by 𝑘 original cores. We use 𝑛 𝑝 to denote the number
of nominal (original) processors per chip, and use 𝑛 𝑐 to denote the number of nominal (original)
cores per processor. Hence the maximum value of k is equal to 𝑛 𝑝 ⋅ 𝑛 𝑐 . Experiments show that
this approach provides performance per wafer that is 65% of the performance per wafer for the
ideal scenario, i.e., the scenario where defect density is zero and no redundancy is required.
Substantial functional resources are still wasted in this approach since the latter requires that (i) a
processor is considered working if the number of working cores within is equal to or greater than
the nominal number of cores (𝑛 𝑐 , which is 32 for their example CMP), and (ii) a chip is
considered working if the number of working processor it has is equal to or greater than the
nominal number of processors (𝑛 𝑝 , which is 16 for their example CMP).
Number of processors binning (NPB) has been proposed in 0 to reduce such wastage of
working cores and processors by allowing defective chips with fewer than the nominal number
of working processors. In this approach, non-working streaming multiprocessors (SM, the Nvidia
terminology for processors) are disabled, where an SM is defined as working if it has the
nominal number or more working cores. (In the remainder of this chapter, our target CMP is a
state-of-the-art Nvidia GPU standard, where the nominal number of processors (SMs) in a chip is
𝑛 𝑝 = 16 and the nominal number of cores in each processor is 𝑛 𝑐 = 32.) This approach was
motivated by three GPUs from Nvidia, namely, GTX 480, GTX 470, and GTX 465, all of which
are the same Fermi design, sold with different numbers of SMs disabled. Once it considers NPB,
the aforementioned SCS approach provides performance per wafer that is 82.5% of the
performance for the ideal scenario.
71
However, NPB still wastes (disables) a large number of working resources on defective CMPs.
Specifically, this approach results in wastage in two ways: (1) SMs with fewer than 32 cores are
disabled, where a disabled SM may have up to 31 working cores. Such wastage is significant
especially when random defects are the main concern and hence the number of failing cores in
each SM is typically small. (2) SMs with more than 32 working cores (including the available
non-defective spares) will disable the additional cores, i.e., the 33
rd
core, the 34
th
core, and so on.
To clearly contrast with the approach we present in this chapter, we adopt a more descriptive
term for NPB, namely fully-working processors and partially-working chip binning (FPPCB).
In this chapter, we present a new utility function to minimize aforementioned wastage by
relaxing the requirements on a working processor. Specifically, a processor with any number of
working cores is deemed acceptable and the value assigned to a chip with specific numbers of
cores in each enabled processor is proportional to the performance the chip can provide for a
desired set of benchmark programs. In comparison to FPPCB, we use the term partially-working
processors binning (PPB) to describe the proposed approach.
We develop PPB as a procedure with three main steps. First, via simulations we estimate
performance of target benchmarks, for the nominal chip configuration as well as the
configurations of chips that are likely to be obtained after fabrication and repair. Second, we
develop repair algorithms that ensure that spare copies are used in an efficient manner for each
defective chip. Third, we develop an approach to identify the optimal spare configuration by
systematically evaluating different spare configurations.
72
GPU Background 1.14.
We use state-of-the-art GPU models as case studies for CMPs. This section presents the
background related to this research. Here we describe some background information about how
application programs are executed on GPUs.
A GPU benchmark usually consists of multiple kernels, where each kernel is a grid of blocks
of threads. SMs receive thread blocks in a round robin fashion. The maximum number of threads
that can be assigned to each SM is determined by the size of register file per SM.
During execution within each SM, threads are further separated into warps, where each warp
has certain number of threads, e.g., warp size 𝑤 is equal to 32 for Nvidia GTX 480. Threads
inside a warp are executed in a lock-step manner. Consider number of cores per SM to be 𝑛 .
Then the time in execution stage of one instruction in a warp is proportional to ⌈
𝑤 𝑛 ⌉. For example,
execution of an integer instruction for each warp takes 2 cycles in GTX 480, since in this case
𝑤 = 32 and 𝑛 𝑐 = 16. Both warp size and number of cores is identical for all SMs as a common
practice [9][4]. We use 𝑤 = 𝑛 due to GPGPU-sim constraints, where the simulator only supports
𝑤 = 𝑛 [26].
A warp is issued to cores if all hazards and dependency are resolved for all threads, e.g., bank
conflicts, memory dependency, branch divergence, and register dependency. Enabling many
warps for concurrent execution in a single SM can hide the impact of latencies, which are due to
either hazard or dependency, on overall GPU performance.
73
Problem Statement 1.15.
We are looking for an optimal order in which defective SMs should be repaired, so that the
evaluated metric, e.g., performance per wafer, is maximized after a defective CMP is repaired for
a given spare configuration.
We first explore the optimal strategy to use spare cores to repair any defective CMP. Once we
have such a repair strategy, we develop approaches to identify an optimal spare configuration.
PPB does not specify on the numbers of working cores required for a SM to be working, i.e.,
it allows SMs to have various numbers of cores instead of a fixed identical value of 32 for each
SM. As a result, SMs can have different speeds. It is easy to imagine that all SMs should be
granted equal priorities on being repaired under PPB if the performance of an SM is proportional
to the number of working cores in the SM. Through experiments, however, we found that
performance of a CMP doesn’t always increase proportionally with the number of cores per SM
for certain benchmarks, e.g., NQU, Dct8x8, VectorAdd, and FastWalshTransform, due to
diminishing return effect and compiler limitations, which is true in common sense. In other
words, it might be better to repair SMs with 𝑥 working cores other than SMs with 𝑦 working
cores since the former adds to more performance when we assign it a single spare core. Absolute
values of 𝑥 , 𝑦 , and their relationship will be presented in Section 1.16.
For each spare configuration, we use Monte Carlo approach to create a large number of CMPs
that have different numbers of defective cores in various SMs. Effectiveness of a spare
configuration can be evaluated based on the average performance of all the CMPs after we apply
our repair process. Approaches which explore all spare configurations to identify the optimal one
have been proposed in previous research [19]. In this work, we integrate the newly proposed
repair strategy into the exploring process.
74
We break our problem into the following two sub-problems.
(1) What is the most effective repair strategy given a specific spare configuration?
(2) What is the optimal spare configuration?
Proposed Approach 1.16.
In this section, we separate benchmarks into different categories. Then we show an approach
to estimate performance of a defective CMP. We also present new repair heuristics to efficiently
apply spare cores to repair defective SMs for different categories of benchmarks. In the end, we
present our approach to identify the optimal spare configuration.
1.16.1. Benchmark categories
We have studied both ISPASS benchmarks, e.g., NQU, RAY, and STO, and those from
Nvidia’s CUDA SDK [64], e.g., VectorAdd, DCT8x8, and FastWalshTransform, using GPGPU-
sim [26]. Instructions per cycle (IPC) have been shown for different number of cores per SM for
selected benchmarks in Figure 20. Based on performance (IPC), these benchmarks can be
separated into following three categories.
Category I: Benchmarks without parallelism, e.g., N-Queens Solver (NQU), which solves a
classic puzzle of placing N queens on a NxN chess board. Computation in such benchmarks is
usually performed by a single thread. As a result, such benchmarks have low and constant IPCs,
which does not increase with increasing number of cores per SM. In other words, IPCs for such
benchmarks are bottlenecked by the benchmarks themselves instead of hardware. Hence, for
applications in this category, there is no benefit of obtaining an additional working core from the
repair process for any SM if they already have more than one working core.
75
Category II: Some benchmarks have high parallelism, e.g., Store-GPU (STO) and Ray-tracing
(RAY). For such benchmarks, more cores per SM result in higher IPCs since it allows more
threads to be executed in parallel. At the same time, however, the number of bank conflicts in the
shared memory within each SM, which occur when more than one thread accesses the same
memory bank, increases with the number of memory access threads that are executed
concurrently. Also the fraction of the L1 cache that is assigned to each thread decreases as total
number of threads increases, which results in higher L1-cache miss rate and consequently more
global memory writes/reads if one assumes that the number of warps is constant. Lastly, the
probability of branch divergence increases accordingly [9]. As a result, we can observe
diminishing returns on IPC curves as the number of working cores increases. If we primarily
focus on applications in this category, SMs with more defective cores should be granted higher
priority of being repaired since the obtained benefit of one additional working core decreases
with number of cores per SM.
Category III: Some benchmarks have high parallelism but only benefit from certain values of
number of cores pre SM, especially powers of 2. Such benchmarks include DCT8x8, VectorAdd,
FastWalshTransform, and so on. For such applications, the performance increases in step
functions at specific numbers of cores. For applications in this category, it is not as
straightforward to decide the right defective SM to repair due to distinctive characteristics of
such step functions.
Due to the discreet nature of values of warp sizes, we consider IPCs of benchmarks as an
arbitrary step function. Our proposed repair algorithm should be capable of dealing with such
arbitrariness, i.e., improvements due to additional cores being arbitrary functions of the number
76
of existing working cores. Note that such an arbitrary step function subsumes as a special case
the diminishing marginal improvements obtained for category 2.
(a)
(b)
Figure 20. IPCs for benchmarks from ISPASS and Nvidia CUDA SDK.
1.16.2. Estimating performance of defective CMPs
Performance of a CMP where each SM has 𝑛 working cores can be obtained by setting
𝑤 = 32 in the simulations on GPGPU-sim. Similarly, we can simulate benchmarks on GPGPU-
sim with decreased warp sizes to obtain the performance of each defective CMP, i.e., IPC. Note
that warp size (𝑤 ) being an identical value for all SMs in GPGPU-sim is a common feature of
GPU compilers. But SMs can have different numbers of working cores (𝑛 𝑖 ) in defective CMPs,
77
i.e., 𝑛 1
, 𝑛 2
, … , 𝑛 16
may not be identical for a CMP with 16 SMs. Hence performance of a
defective CMP can’t be obtained directly from simulations since such heterogeneity is not
supported by GPGPU-sim yet. For simplicity, we estimate performance of a defective CMP by
computing the average performance of 16 CMPs with 𝑤 𝑖 = 𝑛 𝑖 , where 𝑖 ∈ [1,16]. Figure 21
further illustrates the way that performance of a defective CMP is estimated, where the CMP has
4 SMs with 2, 4, 4, and 8 working cores, respectively.
Then we can obtain IPCs for all warp sizes through simulations and create a lookup table. For
each defective CMP, we decide its IPC by indexing the lookup table. By doing so, we don’t need
to simulate each defective CMP, which greatly reduces the simulation time.
Figure 21. Estimation of performance for defective CMPs.
1.16.3. Repair Heuristic
For an arbitrary IPC function and a specific spare configuration, we propose a general Greedy
Repair Heuristic (GRH-x), which is shown in Heuristic 1. We use 𝑓 ( 𝑛 𝑖 ) for the IPC of a CMP
with all SMs having 𝑛 𝑖 working cores. We define the first-order benefit 𝛼 𝑖 1
as the benefit
obtained by repairing one defective core for SM 𝑖 , i.e., 𝛼 𝑖 1
= 𝑓 ( 𝑛 𝑖 + 1)− 𝑓 ( 𝑛 𝑖 ) . Similarly, we
define the second-order benefit 𝛼 𝑖 2
as the benefit obtained by repairing two defective cores for
SM 𝑖 , where 𝛼 𝑖 2
= 𝑓 ( 𝑛 𝑖 + 2)− 𝑓 ( 𝑛 𝑖 ) , and the third-order benefit 𝛼 𝑖 3
= 𝑓 ( 𝑛 𝑖 + 3)− 𝑓 ( 𝑛 𝑖 ) , and
78
so on. This heuristic always picks the SM which generates or will generate the maximum benefit
in terms of IPC with best effort, by comparing different 𝛼 𝑖 𝑗 for all 𝑗 and 𝑖 .
For benchmarks in category I, we have 𝛼 𝑖 𝑗 of zero for all i and j values. In this situation, all
defective SMs should be granted with equal priority of being repaired. In other words, any order
of SMs being repaired works fine. For benchmarks in category II, 𝛼 𝑖 𝑗 decreases as i increases for
the same j. As a result, our heuristic should repair SMs with the most defective cores in a greedy
manner, according to our best-effort philosophy. Hence we propose GRH-r, which is shown in
Heuristic 2, as a simplified version of GRH-x. Note that we use “r” in GRH-r to denote the fact
that defective SMs are picked in the reverse order compared to previously developed GRH,
where we repair the SM with fewest defective cores in a greedy manner. As can be seen, the key
difference among the three repair heuristics lies in the order in which defective SMs are repaired.
Heuristic 1: GRH-x
Step 1:
𝑘 = 1;
𝑓 ( 𝑛 𝑖 ) : the IPC of a CMP with all SMs having 𝑛 𝑖 working cores;
While (1)
Foreach 𝑆𝑀
𝑖
Compute 𝛼 𝑖 𝑘 = 𝑓 ( 𝑛 𝑖 + 𝑘 )− 𝑓 ( 𝑛 𝑖 ) ;
End foreach
If equal 𝛼 𝑖 𝑘 ’s are found
𝑘 = 𝑘 + 1;
Continue;
Else
Break;
End if
End while
Step 2:
Sort SMs in descending order using 𝛼 𝑖 1
, 𝛼 𝑖 2
, …, 𝛼 𝑖 𝑗 , etc., where
significance of 𝛼 𝑖 𝑗 decreases with 𝑗 .
Step 3: Repair the first SM.
Heuristic 2: GRH-r
Step 1:
Sort SMs in descending order using number of defective cores in each.
Step 2: Repair the first SM.
79
Different impacts of these three GRHs can be observed if both of the following two
conditions are satisfied.
(1) Spare cores are required, i.e., designs with spare cores added have better performance per
wafer than that of the nominal design. Note that the optimal spare configuration can be
adding no spares due to substantial area overheads of spare cores in certain scenarios,
especially when performance per wafer is already high or close to 1.
(2) Spare cores are shared across different SMs, where the order in which SMs are repaired
becomes critical. To the contrary, if spare cores are only added within SMs, each spare
core is dedicated to only one SM or certain cores in one SM. There is no such an order in
which SMs should be repaired. As a result, GRH, GRH-r, GRH-x will provide identical
results.
To show the differences among the three repair heuristics, we use an arbitrary step function of
IPC for different numbers of cores per SM, which is shown in Figure 22. Since a nominal SM
consists of 32 cores, we are only interested in spare cores that can be shared by more than 32
cores. We apply three repair heuristics to identify the optimal 𝑛 𝑠𝑐 ,𝑥 for the scope of sharing 𝑥 =32,
64, 128, 256, and 512, respectively, when no cores with other scopes of sharing exist. Figure
23(a) shows that the proposed GRH-x provides obvious improvements compared to GRH and
GRH-r in the shadow area, where two above-mentioned conditions are satisfied. Note that to the
left of the shadow area, area overhead of adding spare cores is so large that no spares should be
added. And to the right of the shadow area, spare cores are added within each SM and hence all
three GRHs are equivalent. In this experiment, we use a high defect density of 15,000/mm
2
for
the purpose of demonstration of effectiveness of GRH-x. Intuitively, this is equivalent to a mean
value of 5.3 defective cores for a SM which is designed with 32 cores. For comparison, Figure
80
23 (b) and (c) show the result obtained using defect density of 10,000/mm
2
and 20,000/mm
2
,
which are equivalent to mean values of defective cores of 3.8 and 6.4.
Figure 22. IPC vs. number of cores per SM for an arbitrary benchmark.
1.16.4. Proposed redundancy approach
In our approach, we enumerate all possible values of 𝑛 𝑠𝑐 ,𝑥 for different 𝑥 for different spare
configurations. Then we design the floorplan for any spare configuration to estimate its area
overhead as well as delay penalty. In the end, we compute performance per wafer for any spare
configuration using our proposed approach described above (a) generate a large number of CMP
copies with different numbers of defective cores, where number of defective cores in each SM is
a Poisson variable, (b) apply the repair heuristic to replace defective cores with spare cores, and
(c) compute performance for the defective CMP through previously obtained CMP performances
with various warps sizes.
Figure 23 (a), (b), and (c) show that GRH, GRH-r, GRH-x provide the same result if the
aforementioned two conditions are not satisfied. Hence GRH-x should be replaced with GRH or
GRH-r once applicable since it has higher complexity than the other two. In our redundancy
approach, we will check the applicability of each spare configuration, especially in two scenarios:
(a) no spares are required since yield is already high or close to 1 and/or area overhead is
81
substantial, and (b) only spare cores inside SMs are available in the configuration. When
applicable, we should simply apply GRH or GRH-r for those spare configurations as a practice
to reduce program run time.
82
(a) Defect density of of 10,000/cm
2
(b) Defect density of of 15,000/cm
2
(c) Defect density of 20,000/cm
2
Figure 23. Optimal numbers of spares at various scopes of sharing.
83
Table 14. Optimal IPC for PPB and FPPCB.
Defect
density
Core
yield
No redundancy PPB FPPCB
STO FWT STO FWT STO FWT
2,000 95.1% 0.650 0.695 0.892 0.893 0.825 0.868
10,000 82.4% 0.539 0.576 0.772 0.720 0.704 0.676
15,000 77.2% 0.468 0.500 0.724 0.659 0.617 0.588
20,000 73.3% 0.395 0.422 0.687 0.594 0.548 0.551
Table 15. Optimal spare configurations [n
sc,32
, n
sc,16
, n
sc,8
, n
sc,4
, n
sc,2
, n
sc,1
].
Defect
density
Core
yield
NA
PPB FPPCB
STO FWT STO FWT
2,000 95.1% No spares No spares [32, 32, 0, 0 ,0, 0] [32, 32, 0, 0 ,0, 0]
10,000 82.4% No spares [64, 64, 0, 0, 0, 0] [32, 64, 0, 0 ,0, 0] [64, 64, 0, 0 ,0, 0]
15,000 77.2% No spares [0, 64, 64, 0, 0, 0] [0, 64, 64, 0 ,0, 0] [0, 64, 128, 0 ,0, 0]
20,000 73.3% No spares [0, 64, 64, 0, 0, 0] [0, 32, 128, 0 ,0, 0] [0, 64, 128, 0 ,0, 0]
Experimental Results 1.17.
We have obtained the optimal performance per wafer and corresponding spare configuration
designs for three benchmarks, namely, NQU, STO, and FastWalshTransform (FWT), which
respectively represent Categories I, II, and III. Since NQU is mainly executed in a single thread,
the presence of defective cores won’t affect the performance because there are always available
working cores for the single thread. The IPC for NQU is 0.029 for defect density values
considered in this work. Table 14 shows optimal IPCs obtained for different benchmarks for
defect densities of 2,000/𝑚𝑚
2
, 10,000/𝑚𝑚
2
, 15,000/𝑚𝑚
2
, and 20,000/𝑚𝑚
2
, where 2,000/𝑚𝑚
2
is current random defect density as reported by ITRS [38]. IPC values are normalized to the IPC
value obtained for each benchmark for the nominal design. Table 15 shows the corresponding
spare configurations, where 𝑛 𝑠𝑐 ,𝑘 are not shown for 𝑘 > 32 since they are all zeros. We also
show yield of a core for every defect density. We have following observations.
(1) For benchmarks with limited number of threads, such as NQU, the optimal spare
configuration for PPB is adding no spares. The performance per wafer for such
84
benchmarks is limited by the fact that they must be executed in single thread. Improving
CMP configuration doesn’t add to any benefit even for 10x current defect density, when
each SM still has an average of 19 working cores.
(2) For benchmarks with sufficient parallelism, such as STO, the optimal spare configuration
for PPB is also adding no spares. The reason is that adding additional spare cores won’t
increase performance per wafer a lot especially when it is already high (if not close to 1),
and adding spare cores incurs area overhead.
(3) For benchmarks with distinct step functions, such as FWT, adding no spares is the optimal
spare configuration for low defect density of 2,000/𝑚 𝑚 2
. When defect density increases,
yield of a core becomes low. At the same point, the benefits of adding spare cores become
more significant than area overheads incurred due to these spare cores. In such scenarios,
spare cores should be added. It is also observed that number of spare cores required for the
optimal design in PPB is less than that in FPPCB for both STO and FWT.
(4) We also show the result for defect density of 2,000/𝑚 𝑚 2
, which is the current defect
density reported by ITRS. For STO, compared to the optimal designs obtained from an
approach without any utility functions, PPB can utilize additional 24.2% of cores on a
nominal design and provide 37.2% improvement on performance per wafer, i.e., 89.2%,
versus 65%. Also comparing the optimal designs obtained for PPB and our previously-
developed FPPCB, we found that PPB utilizes additional 6.7% working cores and
provides an 8% improvement on performance per wafer, i.e., 89.2%, versus 82.5%. [20].
Theoretically, the proposed PPB can utilize all working cores with best effort.
85
(5) For FWT, for defect density of 2,000/mm
2
, improvement of PPB over FPPCB is about 3%.
This is lower compared to the improvement for STO because processors with fewer than
24 cores do not provide much lower performance (see Figure 20).
(6) As defect density increases, the improvement of PPB over FPPCB also increases (in
almost all cases). For STO, the improvements are 8.1%, 9.6%, 17.3%, and 25.3%
respectively for the four defect densities shown in above tables. For FWT, the
corresponding improvements are 2.9%, 6.5%, 12.2%, and 7.7%. The reason is that the
number of fully-working processors decreases as defect density increases. As a result,
wastage in FPPCB increases. (The reason that the increase is lower for FWT for the
highest defect density is because few fabricated processors have 24 or more cores, even
after repair using spares.)
Conclusions 1.18.
This chapter presents a new utility function called Partially-working Processors Binning (PPB)
to remove the requirement on the number of working cores per processor on a CMP. With this
utility function included, we have completed the analysis of impact and flexibilities on wafer
usage. A repair heuristic has also been developed for different benchmarks for PPB. Also we
have presented an approach to estimate performance for defective CMPs. Results show that the
proposed approach provides the capability of utilizing all working cores on each CMP. For
current defect density and benchmarks with enough parallelism, PPB shows the capability of
utilizing 89.2% logic resources. Such improvement will increase as defect density increases.
86
Optimal Redundancy Designs for CNFET-Based Circuits
In this chapter, we explore finer-level redundancy techniques by studying CNFET-based
circuits for future technologies with smaller feature sizes and extremely high defect densities.
Contributions of this work are as follows.
(1) We demonstrate that the logical effort approach is not applicable to gate sizing in CNFET-
based circuits. Furthermore, we develop a logical-effort-based heuristic to size gates to
minimize critical path delays of CNFET-based circuits. The proposed heuristic provides
more realistic delay, which is up to 9% lower than that from only using logical effort
approach.
(2) We propose an approach to add CNTs for each CNFET based on its characteristics to
increase yield per area for logic circuits with negligible delay penalty.
(3) We show that yield per area for the optimal logic design is restricted by the previous-stage
logic, e.g., maximum allowed load capacitance.
(4) We also propose a hybrid redundancy approach for SRAM arrays by identifying the
optimal combination of redundant CNTs and spare columns (rows).
(5) Our approach increases yield per area from 60.9% to 76.3% for a 2MB memory module
compared to only spare CNTs approach, while reducing decoder delay overhead from
19.2% to 15.7%.
Contents of this chapter are organized as follows. Section 1.19 shows related research.
Section 1.20 shows background of CNFETs. Section 1.21 presents a CNFET design
methodology to estimate critical path delay and size gates in a logic circuit. Section 1.22 presents
the problem to be solved in this chapter. Section 1.23 shows a redundant-CNTs-only approach
for logic circuits based on the interplay of performance and yield. Section 1.24 shows a hybrid
87
redundancy approach for memory array by combining redundant CNTs approach with traditional
spare columns (rows) approach. Section 1.25 summarizes our contribution.
Related Research 1.19.
High rates of imperfections in CNFETs are one key obstacle to the demonstration of large
scale CNFET circuits [57][71]. These imperfections are due to higher defect rates and cause
higher performance variations and low yield for circuits, compared to circuits fabricated in
traditional CMOS technology [60][58][72]. In particular, CNFET circuits have high rates of
metallic CNTs (or m-CNTs), which are always conducting regardless of gate voltage, in contrast
with the useful semiconducting CNTs (or s-CNTs). Hence defect-tolerance must be an integral
aspect of circuit design. Authors in [73] propose a novel-stacking layout of CNFETs, which
helps reduce the statistical probability of Ohmic short between the source and drain of a CNFET
and increase the functional yield of logic gates up to 10X in the presence of m-CNTs. Authors in
[58] and [60] propose an algorithm to determine vulnerability of a given CNFET layout to
misaligned CNTs and present a technique to generate designs which are immune to misaligned
CNTs by etching specific regions. Authors in [58] present a VLSI-compatible m-CNT removal
technique called VMR which can mitigate the yield challenges due to m-CNTs. Authors in [72]
and [73] explore yield improvement solutions for logic gates for two scenarios: (1) with m-CNTs
present, and (2) with m-CNTs removed. They claim that circuit-level techniques (e.g., transistor
stacking and CNT stacking) are sufficient to build robust CNFET-based circuits in the presence
of small percentage of m-CNTs; however, designers should consider m-CNT removal techniques
and CNT-level redundancy in the presence of large percentage of m-CNTs.
Recently, some approaches of defect-tolerance for nano-electronics have focused on adding
redundancy at functional module or gate level [44]. It has been shown that adding redundancy at
88
transistor level can provide higher tolerance than adding redundancy at module and gate levels
by reducing area overhead [7].
Transistor-level redundancy. Transistor-level redundancy is based on the well-established
method of combating device shorts and opens through redundant series and parallel connections
[24][61]. In general, combining series and parallel structures can lead to both open- and short-
defect immune structures. In such designs, each transistor in the original design is replaced by N
2
series-parallel/parallel-series transistor structures, where N is the number of replicated transistors
in each dimension [7]. Direct implementations of such structures provide poor performance
when considering the correlation between transistors which share the same aligned CNTs.
Authors in [7] propose a novel layout technique to address this problem by placing transistors in
staggered manner in the direction of CNT growth. Even though the efficiency of N
2
-transistor
structures is improved in this way, these approaches generally have large area overheads.
CNT-level redundancy. Adding redundant CNTs within each transistor has been used as a
practical approach to improve yield by previous works. For example, in Stanford Nanotube
Computer [55], each CNFET comprises approximately 10~200 CNTs, depending on the relative
sizing of the CNFETs.
Previous research provides a rich set of approaches for tackling imperfections in CNFET
circuits. What is lacking are approaches for adding optimal amount of redundancy that maximize
yield while minimizing performance and area overheads. In this work, we propose an approach
which derives the optimal design by addressing these problems and explicitly providing gate
sizes for both logic circuits and memory arrays.
89
Background of CNFETs 1.20.
Over recent years the scaling in CMOS technology has been very aggressive. With ultra-thin
dimensions, the technology is facing many critical challenges and reliability issues. Aggressive
scaling has resulted in augmented short channel effects, exponential rise in leakage currents,
process variations, depressed gate control for transistors and hysterical power densities. Carbon
nanotubes (CNTs) have been identified with highest potential risk-benefit ratio for emerging
logic applications such as nano field-effect transistors.
CNFETs are fabricated using CNTs, which are hollow, cylindrical nanostructures composed
of a single sheet of carbon atoms, and have exceptional electrical, physical and thermal
properties [8][40][57]. A top view of a CNFET is reproduced in Figure 24 [86]. In order to
obtain both p-type and n-type FETs using CNTs, the polarities of the FETs are controlled using
metals with different work functions. As electrons and holes have almost the same mobility,
number of CNTs in pull-up and pull-down networks will be the same. We use 𝑁 𝑡𝑢𝑏 for the
number of CNTs for a CNFET or gate. In the rest of this section, we review CNFETs in terms of
defect model, yield model, area model, and SRAM cell design.
Drain
Source
Gate Channel length
Gate width
Pitch CNT diameter
CNTs
Figure 24. Structure of CNFET with 4 CNTs (adapted from [86]).
90
Carbon nanotube (CNT) field-effect transistors (CNFETs) are one of the promising emerging
technologies for the next generation of highly energy-efficient electronics [5][43][47]. Since the
first CNFET was reported in 1998, great progress has been made in all the areas of CNFET
science and technology, including materials, devices, and circuits [86]. The first CNFET-based
computer, Stanford Nanotube Computer, was demonstrated in 2013 [57]. This 1-bit computer
has 178 p-type CNFETs, and is fabricated using a 1𝑢𝑚 lithography process. This initial
demonstration has set the stage for the development of useful CNFET chips.
1.20.1. Capacitance and current model
A CNFET capacitance model is developed in [42], where gate capacitance of a CNFET (𝐶 𝑔 )
mainly consists of three components: gate-to-channel capacitance ( 𝐶 𝑔𝑐
), outer fringe gate
capacitance (𝐶 𝑜𝑓
), and the gate-to-source/drain coupling capacitance (𝐶 𝑔𝑡𝑔 ). Both 𝐶 𝑔𝑐
and 𝐶 𝑜𝑓
are strongly affected by screening effect (or shielding) of neighboring channels, particularly, for
closely spaced channels. Specifically, in order to calculate coupling capacitance between a CNT
and gate metal electrode, the total effect of other CNTs can be lumped and approximated as the
two nearest CNTs. This effect has a similar impact on current as well. The relationship between
capacitance (current) and CNT pitch is reproduced in Figure 25 [42]. Parasitic capacitance is
assumed to be equal to gate capacitance [50][87].
91
Figure 25. Screening effect on capacitance (C
gc
) and current (I
d
).
1.20.2. Area model
The minimum gate width (𝑊 𝑔𝑎𝑡𝑒 ) of a typical transistor is 3𝜆 , due to lithography limitations
[39]. For 32nm technology, the minimum 𝑊 𝑔𝑎𝑡𝑒 equals to 48nm. If we consider 4nm as the
minimum pitch value for CNTs [42], there can be up to 12 CNTs in a CNFET gate for the
minimum gate width. We use 𝑁 𝑎 ,𝑡 ℎ
for the maximum allowable 𝑁 𝑡𝑢𝑏 in this scenario. If a
CNFET gate has more than 𝑁 𝑎 ,𝑡 ℎ
CNTs, 𝑊 𝑔𝑎𝑡𝑒 must be increased to cover all CNTs. Note that
𝑊 𝑔𝑎𝑡𝑒 can only be increased by multiples of 𝜆 due to lithography limitations, i.e., 𝑊 𝑔𝑎𝑡𝑒 can only
have values such as 3𝜆 , 4𝜆 , 5𝜆 , and so on. The width of a CNFET gate is computed using
Equation (22), where 𝑃 𝑡𝑢𝑏 is CNT pitch value, and 𝑁 𝑡𝑢𝑏 is number of CNTs for a CNFET.
𝑊 𝑔𝑎𝑡𝑒 = 𝑀𝐴𝑋 (3𝜆 , ⌈
𝑁 𝑡𝑢𝑏 · 𝑃 𝑡𝑢𝑏 𝜆 ⌉ · 𝜆 )
(22)
1.20.3. Defect model
CNFET is expected to have significantly higher variability compared to conventional silicon
CMOS, including 1) the presence of m-CNTs, 2) CNT diameter variations, 3) mis-positioned and
92
mis-aligned CNTs, and 4) CNT density variations. 𝑚 -CNTs are the major source of variations,
where the probability of a CNT being metallic is typically 8% to 32% [59][72][73]. This has
significant impact on yield and delay. For example, for the initial design of an 8-bit adder design
in a study ahead, yield at 8% probability of m-CNTs is below 10
−5
.
CNFETs which are fabricated using the same aligned CNTs are fully correlated. As such
correlation exists between transistors in the direction of CNT growth, it is called single-direction
correlation [7]. CNFETs which are fabricated without sharing any CNTs, on the contrary, are
fully independent. Such correlation is especially important in SRAM designs due to their highly-
regular structure. Ideally, SRAM cells in a column (row), which share the same aligned CNTs,
will be fully correlated. However, due to CNT density variations and misalignment
[46][39][58][60], SRAM cells far from each other do not share the same CNTs even if they are
in the same column (row), and hence they are not correlated. In our defect model, for logic
circuits we assume that all transistors are independent. For SRAM, we assume that transistors in
each SRAM cell are correlated if they share the same CNTs, e.g., the two pull-up transistors
share two p-type CNTs, and that two different SRAM cells are independent of each other. Also
we leave it for our future research to include defects in CMOS process used to fabricate CNFETs.
1.20.4. Yield model
The presence of m-CNTs impacts both delay and static power consumption of logic gates. In
[73], gates are considered functional if their delay and static power consumption in the presence
of m-CNTs are less than the maximum allowable delay and static-power constraint. In [46] and
[51], a gate is counted as functional if each of its CNFETs has one or more s-CNTs after the
removal of m-CNTs. In this chapter we use latter definition of yield at the gate-level; however,
while calculating yield, we do impose a desired maximum delay constraint at the circuit level.
93
1.20.5. SRAM cell design
The classical 6-T SRAM cell design is shown in Figure 30, where M2/M5 are access
transistors, M3/M6 are pull-up transistors, and M1/M4 are pull-down transistors. As CNFET-
based circuits are fabricated based on traditional silicon technology, design rules in CMOS
technology also apply to CNFET-based circuit designs [57][58][62][80]. Cell height ( 𝐻 𝑆𝑅𝐴𝑀 )
and width (𝑊 𝑆𝑅𝐴𝑀 ) are computed using Equations (23) and (24) according to a layout design
proposed in [86], where 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑢
, 𝑁 𝑝𝑑
represent 𝑁 𝑡𝑢𝑏 for access transistors, pull-up
transistors, pull-down transistors, respectively, and 𝑃 𝑝𝑜𝑙𝑦 is pitch value for poly-silicon.
𝐻 𝑆𝑅𝐴𝑀 = 𝑀𝐴𝑋 ( 𝑃 𝑐𝑛𝑡 ⋅ 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑊 𝑔𝑎𝑡𝑒 )+ 2 ⋅ 𝑃 𝑝𝑙𝑜𝑦 + 𝑀𝐴𝑋 ( 𝑃 𝑐𝑛𝑡 ⋅ 𝑁 𝑝𝑑
, 𝑊 𝑔𝑎𝑡𝑒 )
+ 𝑀𝐴𝑋 ( 𝑃 𝑐𝑛𝑡 ⋅ 𝑁 𝑝𝑢
, 𝑊 𝑔𝑎𝑡𝑒 )
(23)
𝑊 𝑆𝑅𝐴𝑀 = 20𝜆
(24)
In SRAM cell design, maximum pull-up ratio (𝑅 𝑝𝑢 ,𝑡 ℎ
) is required for write stability; and
minimum pull-down ratio ( 𝑅 𝑝𝑑 ,𝑡 ℎ
) is required for read stability [85]. Actual ratios, i.e., 𝑅 𝑝𝑢
and
𝑅 𝑝𝑑
, are computed using Equations (25) and (26). Hence an SRAM cell is working only if
𝑅 𝑝𝑢
≤ 𝑅 𝑝𝑢 ,𝑡 ℎ
and 𝑅 𝑝𝑑
≥ 𝑅 𝑝𝑑 ,𝑡 ℎ
. For example, the reported 𝑅 𝑝𝑑 ,𝑡 ℎ
and 𝑅 𝑝𝑢 ,𝑡 ℎ
are 0.46 and 0.61,
respectively, for CNT diameter ( 𝐷 𝑐𝑛𝑡 ) of 1.5nm [85].
𝑅 𝑝𝑢
=
𝑁 𝑝𝑢
𝑁 𝑎𝑐𝑐𝑒𝑠𝑠
(25)
𝑅 𝑝𝑑
=
𝑁 𝑝𝑑
𝑁 𝑎𝑐𝑐𝑒𝑠𝑠
(26)
A CNFET Design Methodology 1.21.
In this section, we proposed a heuristic to size logic circuits so that the critical path delay is
minimized. First we demonstrate that traditional logical effort cannot be used for CNFETs. Then
94
we present our heuristic based on logical effort. Effectiveness of our heuristic is demonstrated
through case studies on two common circuits, i.e., ripple carry adder (RCA) and address decoder
(AD). More accurate delay values are obtained from simulations using Stanford CNFET library.
1.21.1. Problem Statement
Significant progress has been achieved on demonstration of functionality for CNFET-based
circuits, e.g., a ring oscillator [89] and a 1-bit carbon nanotube computer [55]. Now it is helpful
to look into performance-oriented issues in order to build fast circuits.
Our problem is to develop an approach which determines 𝑁 𝑡𝑢𝑏 for each transistor so that the
critical path delay of a circuit is minimized, under given design specifications.
Transistors in a logic circuit can be divided into three categories: (i) limited transistors, or
𝑇𝑟
𝑙𝑚𝑡 , whose sizes are limited to particular constraints required by design specifications, e.g.,
maximum input capacitance required by preceding logic, sizes of transistors on the critical path
determined by logical effort, (ii) don’t-care transistors, or 𝑇𝑟
𝑑𝑐
, which are not directly driven by
any transistors on the critical path, and (iii) down-scaled transistors, or 𝑇𝑟
𝑑𝑠
, which are directly
driven by transistors on the critical path (as they increase the internal or load capacitance, they
have impact on critical path delay).
To minimize critical path delay, (i) 𝑇𝑟
𝑙𝑚𝑡 , which usually compose pull-up and pull-down
networks for critical path, should be assigned relatively large size as they provide the driving
current, and (ii) 𝑇𝑟
𝑑𝑠
, which don’t compose pull-up or pull-down networks but add to internal or
load capacitance should be assigned the minimum size. In this work, we assign the minimum
size, i.e., 𝑁 𝑡𝑢𝑏 = 1, to both 𝑇𝑟
𝑑 𝑠 and 𝑇𝑟
𝑑𝑐
. And our approach will determine 𝑁 𝑡𝑢𝑏 for each
transistor in 𝑇𝑟
𝑙𝑚𝑡 .
95
1.21.2. Key Ideas
Library-based simulations provide accurate and reliable estimates for circuit performance
based on a given library. However, this approach is not an efficient approach to identify the
optimal design as (i) number of transistors in a circuit is usually large, which requires substantial
simulation time, and (ii) design space is usually large, especially for new technologies, where
more variations have to be considered. Elmore delay model enables fast estimations of critical
path delays and gate sizing without the complexity of constructing libraries and simulating
circuits. The optimal circuit design can be obtained analytically, i.e., by enumerating different
circuit configurations.
In CMOS technology, input and output capacitance (C) are both linearly dependent on
channel width for a given functional gate [62][80]. At the same time, drain current (I) is linearly
dependent on the channel width as well. As a result, the RC product (which can be computed by
C/I) becomes a constant regardless of the gate width. As logical effort (g) is a constant, logical
effort approach has been used as an efficient approach to estimate path delay and size gates
without enumeration of design configurations [34].
For CNFET technology, however, the gate capacitance and drain current of a transistor are no
more linearly dependent on 𝑁 𝑡𝑢𝑏 . (Note that 𝑁 𝑡𝑢𝑏 in the context of CNFET corresponds to the
channel width in the context of CMOS.) Based on Figure 26, we are able to compute the gate
capacitance and drain current for a transistor with a certain 𝑁 𝑡𝑢𝑏 . The RC product, as well as gate
capacitance (approximated using 𝐶 𝑔𝑐
) and drain current, are shown in Figure 26. It can be seen
that current and capacitance change with 𝑁 𝑡𝑢𝑏 in different manners. Hence, the RC product is not
a constant, and the idea of logical effort doesn’t apply to CNFETs in a strict sense. However, as
𝑁 𝑡𝑢𝑏 in a transistor increases beyond a certain point, which is marked as “Break point” at
96
𝑁 𝑡𝑢𝑏 =10 in Figure 26, the RC product doesn’t change a lot. In other words, logical effort
approach might still be applicable for CNEFTs which have more than 10 CNTs, i.e., 𝑁 𝑡𝑢𝑏 ≥ 10.
In this work, we propose a hybrid approach for delay estimations and gate sizing by enumerating
𝑁 𝑡𝑢𝑏 for transistors whose sizes are below a certain threshold, and applying logical effort
approach to fast identify the optimal 𝑁 𝑡𝑢𝑏 for transistors above this threshold.
Figure 26. Ratio of capacitance and current for an inverter with 𝑁 𝑡𝑢𝑏 .
1.21.3. The Proposed Heuristic
We use 𝑁 𝑙𝑒 ,𝑡 ℎ
to represent the threshold below which a transistor cannot be sized using logical
effort approach. In our approach, we first identify small transistors, i.e., transistors with 𝑁 𝑡𝑢𝑏 <
𝑁 𝑙𝑒 ,𝑡 ℎ
. For logic circuits without branches, gates in the beginning always have larger sizes than
those gates in the end in each path. Hence transistors with smallest 𝑁 𝑡𝑢𝑏 will always appear in
the first few stages. For logic circuits with branches, transistors in following few stages after
each branch might have 𝑁 𝑡𝑢𝑏 < 𝑁 𝑙𝑒 ,𝑡 ℎ
especially when fan-out is high.
97
Then we apply following heuristic to a circuit given input transistor sizes, which are subject to
design specifications, and branch information. Figure 27 illustrates the proposed heuristic for
two scenarios, i.e., without and with branches.
The proposed heuristic. We start by enumerating 𝑁 𝑡𝑢𝑏 for transistors which might be logical-
effort-infeasible, including those in the first few stages in the circuit and in each branch. During
enumeration, transistors sizes are set to be greater than or equal to that of their predecessors. If
transistors have 𝑁 𝑡𝑢𝑏 ≥ 𝑁 𝑙 𝑒 ,𝑡 ℎ
in stage 𝑖 , we apply logical-effort approach to the following
transistors from stage 𝑖 to next branch point, which are logical-effort-feasible.
(a)
Logical effort approach
1 16
4
64 256 1K
(b)
Logical effort approach
4 12 48 192 64 256 1K
Enumerated
Figure 27. The proposed heuristic. (a) Without branches. (b) With branches.
1.21.4. Experimental Studies
In this section, we evaluate the proposed heuristic by conducting two case studies on ripple-
carry-adder and address decoder.
1.21.4.1. Ripple-Carry-Adder (RCA)
Figure 28 shows the transistor-level schematic of a 1-bit full adder built with compound gates,
where 𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
is used to compute 𝑆𝑢𝑚 𝑖 . Elmore delay model is shown in Figure 29, where
98
resistance variables ( i. e. , 𝑅 𝑎 , 𝑅 𝑏 ,and 𝑅 𝑐 ) and capacitance variables (i.e., 𝐶 𝑎 , 𝐶 𝑏 , and 𝐶 𝑐 ) are
obtained using Equation (27)-(30). Note that 𝐶 𝐷 𝑖 , 𝐶 𝐺 𝑖 , and 𝑅 𝑖 represent drain capacitance, gate
capacitances, and resistance for CNFET i.
In a typical RCA design built by cascading such 1-bit full adders, 𝑐𝑎𝑟𝑟 𝑦 𝑖 and 𝑐𝑎𝑟𝑟 𝑦 𝑖 +1
are on
the critical path. Usually the size of input transistors of the critical path is subject to particular
constraints due to design specifications. Then following transistors on the critical path are sized
accordingly. Hence, in a RCA design, 𝑇𝑟
𝑙𝑚𝑡 include transistors which compose pull-up and pull-
down networks for 𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
, or 𝑐𝑎𝑟𝑟 𝑦 𝑖 , e.g., CNFET 1-6, 13, and 14, which are highlighted in
ellipses in Figure 28. And 𝑇𝑟
𝑑𝑠
include transistors which add to internal or load capacitance to
𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
, or 𝑐𝑎𝑟𝑟 𝑦 𝑖 , e.g., CNFET 8, 9, 11, 12, 17, and 18. The remaining transistors, which are
highlighted in rectangles, belong to 𝑇𝑟
𝑑𝑐
.
For simplicity, we assign 𝑁 𝑡𝑢𝑏 =1 to both 𝑇𝑟
𝑑𝑠
and 𝑇𝑟
𝑑𝑐
. Based on Elmore delay model, we
enumerate all possible combinations of 𝑁 𝑡𝑢𝑏 for transistors in 𝑇𝑟
𝑙𝑚𝑡 and obtained the optimal
𝑁 𝑡𝑢𝑏 configurations for different 𝑁 𝑡𝑢𝑏 values of input transistor (i.e., CNFET 3 and 4) for an 8-
bit RCA, which is shown in Table 16, where CNFET 4 is used as a measure of input driving
capability. The obtained design from above exhaustive approach is taken as the ground truth for
the design with minimum delay. We also obtain designs using the proposed heuristic and logical
effort approach.
Then we compute delay values using Elmore delay model for the designs which are obtained
using the three approaches, i.e., exhaustive approach, the proposed heuristic, and logical effort
approach. Results show that the proposed heuristic provides the same optimal design as the
exhaustive approach, which is different from that obtained by the logical effort approach. These
different designs are shown in Table 16. The delay difference between the two design decreases
99
as 𝑁 𝑡𝑢𝑏 increases for 𝑇𝑟
𝑙𝑚𝑡 . Also there is not much difference between the obtained optimal
delays if CNFET 4 has more than 16 CNTs. It is also important to note that all other 𝑇𝑟
𝑙𝑚𝑡 have
more than 10 CNTs. This supports our prediction in the previous section, i.e., logical effort
approach may still be applicable beyond “Break point” at 𝑁 𝑡𝑢𝑏 =10 in Figure 26.
5 5
4 4 9 9
13 13
3 3
14 14
8 8
12 12
11 11
7 7
10 10
17 17
16 16
18 18
19 19 20 20
21 21 22 22
23 23
24 24
25 25
26 26
1 1 2 2
6 6
15 15
i
Carry
1 i
Carry
i
Sum
dc
Tr
lmt
Tr
ds
Tr Others: Others:
i
Carry
i
Carry
Figure 28. Schematic of 1-bit full adder built with compound gates.
R
a
R
b
R
c
C
a
C
b
C
c
Figure 29. Elmore delay model of 1-bit full adder.
𝐶 𝑎 = 𝐶 𝐷 4
+ 𝐶 𝐷 5
+ 𝐶 𝐷 6
(27)
𝐶 𝑏 = 𝐶 𝐷 3
+ 𝐶 𝐷 4
+ 𝐶 𝐷 8
+ 𝐶 𝐷 9
+ 𝐶 𝐺 11
+ 𝐶 𝐺 12
+ 𝐶 𝐺 13
+ 𝐶 𝐺 14
(28)
𝐶 𝑐 = 𝐶 𝐺 3
+ 𝐶 𝐺 4
+ 𝐶 𝐺 15
+ 𝐶 𝐺 16
+ 𝐶 𝐺 17
+ 𝐶 𝐺 18
(29)
𝑅 𝑎 = 𝑅 5
, 𝑅 𝑏 = 𝑅 4
, 𝑅 𝑐 = 𝑅 13
(30)
100
Table 16. Elmore delays for RCA with different Tr
lmt
using three approaches.
(𝑁 4
, 𝑁 5
, 𝑁 13
, 𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
)
Exhaustive/
proposed heuristic
Logical effort
approach
( 1, 1, 1, 1, 1) 13.60 15.30
( 2, 2, 5, 1, 1) 8.76 8.51
( 4, 4, 7, 1, 1) 5.97 5.69
( 8, 8, 11, 1, 1) 4.06 3.94
(16, 16, 12, 1, 1) 3.16 3.09
(32, 32, 24, 1, 1) 2.67 2.67
1.21.4.2. Address decoder
Address decoder (AD) is one of the most commonly-used logic circuit and an important
component of memory design. A typical address decoder in SRAM designs is designed with
identical paths, where each path drives a word line. As a result, all paths are critical paths. Hence,
address decoder is a special case of logic circuits in terms of adding redundant CNTs as it
comprises only 𝑇 𝑟 𝑙𝑚 𝑡 . In this work, we use 4-to-16 decoder as a case study. We assume each
word line drives 16 SRAM cells. Gate capacitance of an access transistor is 𝐶 𝑔 . As each SRAM
cell has two access transistors, i.e., M2 and M5 in Figure 30, load capacitance for each word line
is 32𝐶 𝑔 .
Table 19 shows delays for address decoders with different input capacitance constraints.
Results show that the proposed heuristic provides more accurate delay estimations than logical
effort approach. Especially the propose heuristic provides the same designs as the exhaustive
approach when 𝑁 𝑡𝑢𝑏 ≥ 4 for the primary input CNFETs.
101
Bit
Wordline
M3
M1
M6
M4
M5 M2
Bit
Figure 30. 6T CNFET SRAM cell [34].
Table 17. Elmore delays for AD with different Tr
lmt
using three approaches.
Gate sizes Exhaustive Proposed heuristic Logical effort approach
(1, 5, 12, 19, 29, 44) 4.71 4.96 5.12
(2, 8, 17, 33) 3.19 3.45 3.49
(4, 12, 22, 38) 2.65 2.65 2.96
(8, 14, 24, 40) 2.29 2.29 2.58
(16, 24, 34, 47) 2.03 2.03 2.28
(32, 46) 1.01 1.01 1.15
Problem Statement 1.22.
Significant progress has been made on demonstration of functionality for CNFET-based
circuits, e.g., a ring oscillator [86] and a 1-bit carbon nanotube computer [55]. Hence the time is
right to look into low yield and performance variations due to fabrication imperfections in order
to design complex CNFET-based circuits. Through a study of commonly-used functional blocks,
e.g., inverter chains, ripple-carry-adders, and SRAM arrays, we have identified the following
characteristics for both logic blocks and memory arrays from the perspective of adding
redundancy.
1) Finer-level redundancy, e.g., transistor-level and CNT-level redundancy, must be
considered since defect density in CNFETs is significantly higher than that in CMOS. In
particular, the probability of m-CNT is the dominant concern.
102
2) Yield problems usually result from transistors with small 𝑁 𝑡𝑢𝑏 values. Additional CNTs
should be used in these transistors. Adding CNTs to small transistors results in zero area
overhead, according to the area model presented in Section 1.20.2.
3) For logic circuits, the use of additional CNTs creates zero delay penalty for transistors
which are not directly driven by transistors on the critical path. At the same time, the use of
additional CNTs creates non-zero delay penalty for those transistors which are directly
driven by transistors on the critical path, as they increase internal or load capacitance.
Furthermore, the weaker the preceding transistors, the higher the delay penalty.
4) For memory arrays, addition of CNTs to transistors in SRAM cells results in substantial
area overhead when 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, and 𝑁 𝑝𝑢
are beyond certain values, which results in
diminishing yield per area. At the same time, traditional spare columns (rows) approach
becomes inefficient as the probability of each spare column (row) being defective is too
high for CNFET-based SRAMs.
Our problem is to develop an approach which determines 𝑁 𝑡𝑢𝑏 for each transistor so that
yield/area (yield per area) of a circuit is maximized, by using available redundancy under given
delay constraints for the circuits. We consider following two sub-problems for logic circuits and
SRAM arrays, respectively.
Logic circuits. Based on above observations, we separate transistors into three categories: (i)
limited transistors, or 𝑇𝑟
𝑙𝑚𝑡 , whose sizes are limited to particular constraints required by circuit
specifications, e.g., maximum input capacitance specified by preceding logic blocks, (ii) don’t-
care transistors, or 𝑇𝑟
𝑑𝑐
, which are not directly driven by any transistors on the critical path, and
(iii) down-scaled transistors, or 𝑇𝑟
𝑑𝑠
, which are directly driven by transistors on the critical path
(as they increase the internal or load capacitance, they have impact on critical path delay).
103
Furthermore, the same amount of change in the number of CNTs, i.e., Δ𝑁 𝑡𝑢𝑏 , in a CNFET in
𝑇 𝑟 𝑑𝑠
which are driven by small 𝑇 𝑟 𝑙𝑚𝑡 will have greater impact on critical path delay than those
driven by large 𝑇 𝑟 𝑙𝑚𝑡 . Hence we separate 𝑇 𝑟 𝑑𝑠
into different categories based on 𝑁 𝑡𝑢𝑏 of
the 𝑇 𝑟 𝑙𝑚𝑡 in their fanin.
Our approach determines (i) the categories of transistors whose 𝑁 𝑡𝑢𝑏 should be increased, (ii)
the order in which these categories are processed, and (iii) the number of additional CNTs to be
added.
SRAM arrays. Spare columns (rows) techniques are well-established for CMOS SRAM
designs [41][74], based on which we are able to identify the optimal size of sub-arrays, number
of spare columns (rows) to be added, and so on. We explore a way to integrate redundant CNTs
approach with a spare columns (rows) approach.
Our approach determines (i) 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑢
, and 𝑁 𝑝𝑑
, (ii) the size of sub-arrays to which spare
columns (rows) are added, (iii) number of spare columns (rows) to be added, and (iv) 𝑁 𝑡𝑢𝑏 for
transistors in reconfiguring logic incurred due to spare columns (rows).
Redundancy Design for Logic Circuits 1.23.
In an ideal scenario where defect density is zero, we use our previously-developed heuristics
for delay estimation and gate sizing [17]. In realistic scenarios, we propose a redundant CNTs
approach to improve yield/area. In this section, we use an 8-bit ripple-carry-adder (RCA) and a
12-bit address decoder as two case studies, where the probability of a CNT being metallic is 8%
[59][72][73]. We use yield/area as the metric for evaluation, where the area of each redundant
design is normalized to the area of the circuit with the minimum 𝑁 𝑡𝑢𝑏 required for functionality.
All final designs are evaluated based on realistic delay values obtained from simulations using
104
the Stanford CNFET library [42]. Delay values are obtained for circuits with all CNTs being
semiconducting.
1.23.1. Nominal design
We first introduce the way to compute critical path. Then we carry out case studies on two
functional logic designs: ripple carry adder and address decoder.
1) Estimation of critical path delay
During intermediate stages of circuit design, designers use Elmore delay model for fast
estimation of path delays. The final designs are evaluated using accurate simulations. In CMOS
technology, both capacitance (C) and drain current (I) are linearly dependent on channel width
for a given functional gate [62][80]. As a result, the RC product (which can be computed by C/I)
is a constant regardless of gate width, which makes logical effort (g) a constant. Hence, logical
effort approach has been used as an efficient approach to estimate path delay during the process
of sizing gates in CMOS circuits [34]. However, Section 1.21 shows that logical effort approach
is not applicable for scenarios where CNFET size is below a certain value. Section 1.21 also
proposes a logical-effort-based heuristic to estimate delays and size gates for CNFET-based logic
circuits. This heuristic is used to construct the nominal design with desired functionality and
delay. (We always obtain the delay of final designs using accurate simulations.)
2) Ripple-carry-adder design with minimal delay
Figure 28 shows the transistor-level schematic of a 1-bit full adder built with compound gates,
where 𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
is used to compute 𝑆𝑢𝑚 𝑖 . In a typical ripple-carry-adder (RCA) design built by
cascading such 1-bit full adders, 𝐶𝑎𝑟𝑟 𝑦 𝑖 and 𝐶𝑎𝑟𝑟 𝑦 𝑖 +1
are on the critical path. Usually the size
of input transistors of the critical path is subject to particular constraints due to design
105
specifications. Then the subsequent transistors in the critical path are sized accordingly. Hence,
in a RCA design, 𝑇𝑟
𝑙𝑚𝑡 include transistors which compose pull-up and pull-down networks for
𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
, or 𝐶𝑎𝑟𝑟 𝑦 𝑖 , i.e., CNFET 1-6, 13, and 14, which are highlighted in ellipses in Figure 28.
And 𝑇𝑟
𝑑𝑠
include transistors which add to internal or load capacitance to 𝐶𝑎𝑟𝑟𝑦 𝑖 ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
, or 𝐶𝑎𝑟𝑟 𝑦 𝑖 , e.g.,
CNFET 8, 9, 11, 12, 17, and 18. The remaining transistors, which are highlighted in rectangles,
belong to 𝑇𝑟
𝑑𝑐
.
We size 𝑇 𝑟 𝑙𝑚𝑡 so that critical path delay is minimized using our logical-effort-based heuristic
[17]. In original designs, 𝑇 𝑟 𝑑𝑠
are uniformly assigned the minimum size in practice to reduce
capacitance on critical path and 𝑇 𝑟 𝑑𝑐
are assigned with minimum size for power concerns. We
use 𝑁 𝑖 to denote 𝑁 𝑡𝑢𝑏 for CNFET 𝑖 . Table 18 shows yield/area and performance for nominal
designs with different sizes for 𝑁 4
, i.e., ND1-ND5, where the size of 𝑁 4
is a measure of input
driving capability required by design specifications. As 𝑇𝑟
𝑙𝑚𝑡 , e.g., CNFET 4, 5, and 13, usually
each have distinctive 𝑁 𝑡𝑢𝑏 in an optimal delay design, we specify 𝑁 𝑡𝑢𝑏 values for each 𝑇 𝑟 𝑙𝑚𝑡
instead of assigning to them a uniform value. (Due to symmetry, 𝑁 1
= 𝑁 2
= 𝑁 5
= 𝑁 6
, 𝑁 3
=
𝑁 4
, 𝑁 13
= 𝑁 14
.) We observe that delay improvement diminishes for increasing 𝑁 4
. We will apply
our approach to the five designs in Table 18.
Table 18. Nominal designs (ND) with different Tr
lmt
for RCA
5
.
(𝑁 4
, 𝑁 5
, 𝑁 13
, 𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
) Yield/area
Delay from simulations
Rise (ns) Fall (ns)
ND1: ( 1, 1, 1, 1, 1) 2.94E-08 0.039 0.040
ND2: ( 2, 2, 5, 1, 1) 4.48E-06 0.037 0.039
ND3: ( 4, 4, 7, 1, 1) 6.10E-06 0.028 0.028
ND4: ( 8, 8, 11, 1, 1) 6.10E-06 0.025 0.025
ND5: (16, 16, 12, 1, 1) 4.96E-06 0.022 0.022
5
N
dc
= N
7
, N
10
, N
19
, N
20
, N
21
, N
22
. N
ds
= N
8
, N
9
, N
11
, N
12
, N
15
, N
16
, N
17
, N
18
.
106
3) Address decoder
As one of the commonly-used logic circuit and an important component of memory design, a
typical address decoder is designed with identical paths with each driving a word line in SRAM
designs. As a result, all paths are critical paths. Hence, address decoder is a special case of logic
circuits in terms of adding redundant CNTs as it comprises only 𝑇 𝑟 𝑙𝑚𝑡 . A 12-bit address decoder
with 4-bit pre-decoding is illustrated in Figure 31, where each word line drives 4,096 cells for a
2MB cell array [21]. C
g
denotes gate capacitance of an access transistor. As each SRAM cell has
two access transistors, i.e., M2 and M5 in Figure 30, load capacitance for each word line is
8,172C
g
.
0
15
FO256
A[15]
Pre-decoding A[11:8]
Pre-decoding A[7:4]
FO8
8,172·C
g
0
4,095
2,047
Pre-decoding A[3:0]
Figure 31. A 12-bit address decoder with 4-bit pre-decoding [21].
Table 19 shows 𝑇 𝑟 𝑙𝑚𝑡 and yield/area for address decoders for different constraints. Due to
large load capacitance from SRAM arrays, most of 𝑇𝑟
𝑙𝑚𝑡 have large 𝑁 𝑡𝑢𝑏 . Hence yield is not a
prime concern for these transistors. Yield of address decoder is mainly dependent on the first few
stages where gates have relatively few CNTs, e.g., the first stage in these five designs shown in
Table 19.
107
Table 19. Nominal decoder designs with different Tr
lmt
.
Gate sizes for a 12-stage address decoder Yield/area
(1, 9, 48, 242, 1215, 4365, 15671, 56260, 201974, 725087, 325383, 4563) 1.7E-01
(2, 12, 57, 264, 1215, 4365, 15671, 56260, 201974, 725087, 325383, 4563) 3.4E-01
(4, 18, 74, 301, 1215, 4365, 15671, 56260, 201974, 725087, 325383, 4563) 5.8E-01
(8, 29, 102, 353, 1215, 4365, 15671, 56260, 201974, 725087, 325383, 4563) 8.3E-01
(16, 48, 142, 416, 1215, 4365, 15671, 56260, 201974, 725087, 325383, 4563) 9.7E-01
As address decoder does not have 𝑇𝑟
𝑑𝑐
or 𝑇𝑟
𝑑𝑠
, no special effort is needed for adding
redundant CNTs. Hence, in the rest of this section, we will skip address decoder and focus on
RCA. Load capacitance for address decoder is greatly dependent on the sizing of memory arrays
after CNTs are added to each SRAM cell. As decoder delay, which is affected by load
capacitance, is an important metric, we will revisit address decoder when we study memory
arrays in Section 1.24.
1.23.2. N
2
-transistor redundancy approach
N
2
-transistor structure can be implemented in two forms: series-of-parallel (SP) and parallel-
of-series (PS) structures [7]. Yield model is shown in Equations (31) and (32), respectively, for
SP and PS, where 𝑃 𝑠 ,𝐶𝑁𝐹𝐸𝑇 and 𝑃 𝑜 ,𝐶𝑁𝐹𝐸𝑇 represent probabilities of a CNFET being short and
open. We assume 𝑃 𝑠 ,𝐶𝑁𝐹𝐸𝑇 = 𝑃 𝑜 ,𝐶𝑁𝐹𝐸𝑇 = 8% in our case study. Hence we have 𝑌 𝑝𝑠
= 𝑌 𝑠𝑝
. We
apply PS (or SP, as the two are equivalent due to aforementioned assumption) to transistors
which cause yield problems due to low 𝑁 𝑡𝑢𝑏 which are underlined and boldfaced in Table 20.
Table 20 shows redundancy designs based on N
2
-transistor structures, where N is the number of
replications for a transistor. As can be seen by comparing Table 18 and Table 20, though N
2
-
transistor structure approach improves yield/area significantly, the final yield/area obtained is
still very low. Delay overhead is also significant especially for small 𝑁 𝑡𝑢𝑏 values.
108
𝑌 𝑝𝑠
= ( 1 − 𝑃 𝑠 ,𝐶𝑁𝐹𝐸𝑇 𝑁 )
𝑁 − [1 − ( 1 − 𝑃 𝑜 ,𝐶𝑁𝐹𝐸𝑇 )
𝑁 ]
𝑁
(31)
𝑌 𝑠𝑝
= ( 1 − 𝑃 𝑜 ,𝐶𝑁𝐹𝐸𝑇 𝑁 )
𝑁 − [1 − ( 1 − 𝑃 𝑠 ,𝐶𝑁𝐹𝐸𝑇 )
𝑁 ]
𝑁
(32)
Table 20. Redundancy designs using PS (RDPS) for different designs.
(𝑁 4
, 𝑁 5
, 𝑁 13
, 𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
) Yield/area N
Delay from simulations
Rise (ns) Fall (ns)
RDPS1 ( 1, 1, 1, 1, 1) 3.9E-04 4 0.269 0.279
RDPS2 ( 2, 2, 5, 1, 1) 5.9E-02 4 0.134 0.139
RDPS3 ( 4, 4, 7, 1, 1) 8.0E-02 4 0.083 0.086
RDPS4 ( 8, 8, 11, 1, 1) 8.0E-02 4 0.052 0.054
RDPS5 (16, 16, 12, 1, 1) 7.9E-02 4 0.046 0.048
1.23.3. Our redundant CNTs approach
𝑇 𝑟 𝑑𝑐
and 𝑇 𝑟 𝑑𝑠
have the same impact on yield and area, i.e., the same increase of 𝑁 𝑡𝑢𝑏 for a
transistor in both categories result in same improvement on yield and same overhead on area. We
use 𝛩 ( 𝛥 𝑁 𝑡𝑢𝑏 ) to represent the impact on the critical path delay due to the increase of 𝛥 𝑁 𝑡𝑢𝑏 of a
transistor. We know that 𝑇 𝑟 𝑑𝑐
have zero 𝛩 ( 𝛥𝑁
𝑡𝑢𝑏 ) whereas 𝑇 𝑟 𝑑𝑠
have non-zero 𝛩 ( 𝛥𝑁
𝑡𝑢𝑏 )
according to their definitions. Furthermore, 𝑇 𝑟 𝑑𝑠
which are driven by small 𝑇 𝑟 𝑙𝑚𝑡 have greater
𝛩 ( 𝛥𝑁
𝑡𝑢𝑏 ) than those driven by large 𝑇 𝑟 𝑙𝑚𝑡 . For clarity, we separate 𝑇 𝑟 𝑑𝑠
into different
categories based on 𝑁 𝑡𝑢𝑏 of the 𝑇 𝑟 𝑙𝑚𝑡 in its fanin. We denote and sort these categories as 𝑇 𝑟 𝑑𝑠
1
,
𝑇 𝑟 𝑑𝑠
2
, 𝑇 𝑟 𝑑𝑠
3
, and so on, where superscript of 1 indicates that 𝑁 𝑡𝑢𝑏 of the 𝑇 𝑟 𝑢𝑠
in their fanin is
maximal. Hence 𝑇 𝑟 𝑑𝑠
1
will have the minimum impact on delay among all categories of transistors
for the same value of 𝛥 𝑁 𝑡𝑢𝑏 . Consequently, we should first increment Δ𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑑𝑠
1
. We use a
truncated-exhaustive approach to improve yield/area that explores above observation.
We apply this truncated-exhaustive approach to these designs with CNTs configurations
shown in Table 18, where each nominal design has particular 𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑙𝑚𝑡 . We divide
𝑇 𝑟 𝑑𝑠
into two categories, i.e., 𝑇 𝑟 𝑑𝑠
1
and 𝑇 𝑟 𝑑𝑠
2
. 𝑇 𝑟 𝑑𝑠
1
contains CNFET 8, 9, 11, and 12, which are
109
driven by same 𝑇 𝑟 𝑙𝑚𝑡 , i.e., CNFET 1~6. 𝑇 𝑟 𝑑𝑠
2
contains CNFET 15~18, which are driven by same
𝑇 𝑟 𝑙𝑚𝑡 , i.e., CNFET 13 and 14. Nominal designs are identified by particular 𝑁 𝑡 𝑢𝑏
for their 𝑇𝑟
𝑙𝑚𝑡 ,
i.e., CNFET 4, 5, and 13. We consider different redundant CNTs designs, i.e., different (𝑁 𝑑𝑐
,
𝑁 𝑑𝑠
1
, 𝑁 𝑑𝑠
2
), for each of these nominal designs.
Figure 32 shows the yield/area and Elmore delays obtained for nominal designs and five
redundant designs, where each curve represents specific 𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑙𝑚𝑡 , i.e., (𝑁 4
, 𝑁 5
, 𝑁 13
). (𝑁 𝑡𝑢𝑏
for other 𝑇 𝑟 𝑙𝑚𝑡 can be decided due to symmetry.) The nominal designs are highlighted using
rectangle with dash lines. Note that delay values we report are obtained for designs without any
m-CNTs. While some of the fabricated circuits will have these delay values, the delays of other
fabricated circuits, namely those that have one or more m-CNTs, will be different – likely,
somewhat greater – from the reported values. However the adopted delay estimation approach
serves the purpose of comparing different redundancy configurations since relation between
different redundancy designs remains the same whether m-CNTs are considered or not. Table 21
shows (𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
1
, 𝑁 𝑑𝑠
2
) for these six designs, which correspond to six markers on each curve. We
built an 8-bit RCA using Cadence using the Stanford CNFET library [42]. Critical path delays
for Redundant Design 5, which are considered as the optimal design for ND1-ND5, are obtained
from simulations and shown in Table 22. Note that for these designs we have 𝑁 𝑑𝑠
= 𝑁 𝑑𝑠
1
=
𝑁 𝑑𝑠
2
= 3.
110
Figure 32. Yield/area and delay for different redundant RCA designs
6
.
Table 21. (N
dc
, N
ds
1
, N
ds
2
) for six redundant CNTs designs
7
.
Nominal
design
Redundant CNTs designs
Design 1 Design 2 Design 3 Design 4 Design 5
𝑁 𝑑𝑐
1 12 12 12 12 12
𝑁 𝑑𝑠
1
1 1 1 2 2 3
𝑁 𝑑𝑠
2
1 1 2 2 3 3
Table 22. Optimal redesigns using Redundant CNTs (RDC) for ND1-ND5
8
.
(𝑁 4
, 𝑁 5
, 𝑁 13
, 𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
) Yield/Area
Delay from simulations
Rise (ns) Fall (ns)
RDC1 ( 1, 1, 1, 12, 3) 4.7E-03 0.081 0.086
RDC2 ( 2, 2, 5, 12, 3) 7.1E-01 0.050 0.054
RDC3 ( 4, 4, 7, 12, 3) 9.7E-01 0.036 0.038
RDC4 ( 8, 8, 11, 12, 3) 9.7E-01 0.029 0.030
RDC5 (16, 16, 12, 12, 3) 9.7E-01 0.025 0.025
We have following observations from Table 19, Table 20, and Table 22:
1) Increasing 𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑙𝑚𝑡 can efficiently reduce critical path delay, especially when 𝑇 𝑟 𝑙𝑚𝑡
have small 𝑁 𝑡𝑢𝑏 . This is clear when we compare different nominal designs.
6
where each curve represents particular 𝑁 𝑡𝑢𝑏 for Tr
lmt
, i.e., (N
4
, N
5
, N
13
).
7
N
dc
= N
7
, N
10
, N
19
, N
20
, N
21
, N
22
. N
ds
1
= N
8
, N
9
, N
11
, N
12
. N
ds
2
= N
15
, N
16
, N
17
, N
18
.
8
N
dc
= N
7
, N
10
, N
19
, N
20
, N
21
, N
22
. N
ds
= N
8
, N
9
, N
11
, N
12,
N
15
, N
16
, N
17
, N
18.
2 4 6 8 10 12 14 16 18 20
10
-8
10
-6
10
-4
10
-2
10
0
Delay
Yield/Area
(1, 1, 1)
(2, 2, 5)
(4, 4, 7)
(8, 8, 11)
(16, 16, 12)
Nominal designs
111
2) Increasing 𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑙𝑚𝑡 can increase yield/area. This is also clear when we compare
different nominal designs. While such increase is limited by small 𝑁 𝑡𝑢𝑏 values for 𝑁 𝑑𝑐
, 𝑁 𝑑𝑠
1
,
and 𝑁 𝑑𝑠
2
.
3) Yield/area increases by orders of magnitude while increasing 𝑁 𝑡𝑢𝑏 for 𝑇 𝑟 𝑑𝑐
, 𝑇 𝑟 𝑑𝑠
1
, and 𝑇 𝑟 𝑑𝑠
2
,
which can be seen on each curve. The maximal yield/area is limited by 𝑁 𝑡𝑢𝑏 of 𝑇 𝑟 𝑙𝑚𝑡 .
Delay overhead is lower if 𝑇 𝑟 𝑙𝑚𝑡 have higher 𝑁 𝑡𝑢𝑏 values due to stronger driving
capability, in other words, smaller
𝐶 𝐼 .
Redundancy Design for Memory 1.24.
An SRAM module consists of cell array, address decoder, and control logic [62][80]. In this
section, we first design a nominal SRAM array and a corresponding address decoder that
minimizes the delay for an ideal scenario, i.e., when defect density is zero. Then we propose two
redundancy approaches for SRAM arrays: (1) redundant CNTs approach, and (2) a hybrid
approach by integrating redundant CNTs approach with traditional spare columns (rows)
approach.
We use a 2MB SRAM (as in each core for Intel i7-3770k) as a case study to evaluate our
methodology in terms of yield/area quantitatively. We assume that this SRAM array has 4K
columns and 4K rows. We use yield/area as the metric for evaluation, where the area of each
redundant SRAM array design is normalized to the area of SRAM array design with minimum
𝑁 𝑡𝑢𝑏 required for functionality and no spare columns (rows). In the end we discuss the impact of
redundancy designs on decoder delay.
112
1.24.1. Nominal design
In this section, we design a nominal SRAM array for an ideal scenario, where defect density
equals to zero. For such a design, we minimize 𝑁 𝑡𝑢𝑏 in each SRAM cell while maintaining the
required pull-up and pull-down ratios, i.e., 𝑅 𝑝𝑢
and 𝑅 𝑝𝑑
. By minimizing 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , load
capacitance is minimized for each word line in the address decoder. Corresponding 𝑁 𝑝𝑢
and 𝑁 𝑝𝑑
are computed using Equations (25) and (26). The obtained CNT configuration of (𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
,
𝑁 𝑝𝑢
) is (2, 1, 1). We use this cell design as the nominal SRAM cell design. Such an SRAM cell
is functional only if all 6 CNFETs (4 CNTs in total with single-direction correlation considered)
contain no m-CNTs, according to Equations (25) and (26). As each CNT has probability of 8%
being metallic, yield of a cell is ( 92%)
4
= 0.71. Hence the yield of the SRAM array, which has
16M cells, becomes close to zero.
1.24.2. Spare columns approach
Previous studies show that adding spare columns (rows) can greatly improve yield/area for
memory modules [41][74]. Consider the original design of a memory array which comprises
𝑚 𝑟𝑜𝑤 ,𝑜 rows and 𝑚 𝑐𝑜𝑙 ,𝑜 columns. As spare rows require adding demux between address decoder
and SRAM array, memory access delay increases. Hence spare columns are used in our case
study.
In our approach, we traverse different spare columns designs in two dimensions: (i) sizes of
sub-arrays, and (ii) number spare columns added to each sub-array, while keeping
( 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, 𝑁 𝑝𝑢
) as is in the above nominal design, i.e., (2, 1, 1). Yield of memory array with
spare columns can be calculated using Equation (33), where 𝑚 𝑐𝑜𝑙 ,𝑠 is the number of spare
columns, 𝑚 𝑐𝑜𝑙 ,𝑡𝑜𝑡 is the total number of columns ( i. e. , 𝑚 𝑐𝑜𝑙 ,𝑡𝑜𝑡 = 𝑚 𝑐𝑜𝑙 ,𝑜 + 𝑚 𝑐𝑜𝑙 ,𝑠 ) , and
113
𝑌 𝑐𝑜𝑙
= ( 𝑌 𝑐𝑒𝑙𝑙 )
𝑚 𝑟𝑜𝑤 ,𝑜 is the yield of a column of cells. 𝑌 𝑐𝑒𝑙𝑙 is computed using ( 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, 𝑁 𝑝𝑢
)
of SRAM cells with single-direction correlation considered. Yield of reconfiguring interconnects
(e.g., demux) for each redundant CNTs configuration is taken into account in our yield model,
which is shown in Equation (34). Hence 𝑁 𝑡𝑢𝑏 values of CNFETs in these interconnects vary with
𝑚 𝑐𝑜𝑙 ,𝑠 and ( 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, 𝑁 𝑝𝑢
) for SRAM cells.
𝑌 𝑎𝑟𝑟𝑎𝑦 = ∑ (
𝑚 𝑐𝑜𝑙 ,𝑡𝑜𝑡 𝑚 𝑐𝑜𝑙 ,𝑜 + 𝑖 ) ⋅ 𝑌 𝑐𝑜𝑙
𝑚 𝑐𝑜𝑙 ,𝑠 +𝑖 ⋅ ( 1 − 𝑌 𝑐𝑜𝑙
)
𝑚 𝑐𝑜𝑙 ,𝑡𝑜𝑡 −𝑚 𝑐𝑜𝑙 ,𝑠 −𝑖 𝑚 𝑐𝑜𝑙 ,𝑠 𝑖 =1
(33)
𝑌 = 𝑌 𝑎𝑟𝑟𝑎𝑦 ⋅ 𝑌 𝑖𝑛𝑡𝑒𝑟𝑐𝑜 𝑛𝑛𝑒𝑐𝑡
(34)
Area overheads are due to additional CNTs as well as spare SRAM columns (rows). The
former can be computed using Equation (22)~( 24). The latter is illustrated as follows. We
design a layout for demux, which is shown in Figure 33. Both bit and bit-bar lines are required
for each spare column. Take spare columns as an example. As 𝑚 𝑐𝑜𝑙 ,𝑠 increases beyond a certain
point, the overall width of bit and bit-bar lines will exceed the width of a column, or cell width
𝑊 𝑐𝑒𝑙𝑙 [27]. For 𝑊 𝑐𝑒𝑙𝑙 of 20𝜆 , the maximal number of spare columns to be added is 5 without
moving adjacent columns further apart. Area overhead is calculated using Equation (35) by
taking into account above aspects, where 𝑊 𝑐𝑜𝑙
is the original width of a column (which equals to
𝑊 𝑐𝑒𝑙𝑙 ), and 𝑊 𝑐𝑜𝑙
′
is the width of a column after spare columns are added.
𝑅 𝑎𝑟𝑒𝑎 =
𝑊 𝑐𝑜𝑙 ′
𝑊 𝑐𝑜𝑙 ⋅
𝑚 𝑐𝑜𝑙 ,𝑜 𝑚 𝑐𝑜𝑙 ,𝑡𝑜𝑡 ⋅
ℎ
𝑚𝑢𝑥 +ℎ
𝑎𝑟𝑟𝑎𝑦 ℎ
𝑎𝑟𝑟𝑎𝑦
(35)
The optimal redundancy configuration suggests that each column is incorporated with 10
spares, which results in an unacceptable design with 19.1 times area of the original design. This
is because yield of a cell is 0.71, and the probability of a column of cells being functional is
0.71
4,098
, which is close to zero. As a result, the number of spare columns required is high. In
114
this scenario, adding spare columns is not an efficient approach. Corresponding yield/area for the
optimal redundancy configuration remains very low.
1.24.3. Redundant CNTs approach
In our approach, we improve the yield of cell arrays by enumerating combinations of 𝑁 𝑡𝑢𝑏
values for CNFETs in SRAM cells. An SRAM cell is identified as working only if two
conditions are satisfied: (a) every transistor has at least one CNT, and (b) read and write stability
are ensured according to Section 1.20.5. Specifically, for each 𝑁 𝑡𝑢𝑏 value of access transistor, we
derive the minimum 𝑁 𝑡𝑢𝑏 for pull-down transistors and the maximum 𝑁 𝑡𝑢𝑏 for pull-up transistors
using Equation (25) and (26). For each combination of 𝑁 𝑡𝑢𝑏 for transistors in an SRAM cell, we
can compute yield/area for SRAM array with a given size. Figure 34 shows the optimal
yield/area and the corresponding 𝑁 𝑡𝑢𝑏 for SRAM arrays with different sizes, where 𝑥 -axis is
logarithmic value of memory array size with base of 2. For a 2MB SRAM, i.e., 𝑥 =24, the
obtained redundant CNTs configuration ( 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, 𝑁 𝑝𝑢
) is (28, 26, 8) and the corresponding
yield/area is 0.609. As memory size increases, the number of required redundant CNTs increases.
Also the obtained optimal yield/area decreases.
1.24.4. A hybrid redundancy approach
We propose a hybrid approach by integrating redundant CNTs with traditional spare columns
(rows) approach. Each redundancy configuration includes (i) ( 𝑁 𝑎𝑐 𝑐 𝑒𝑠𝑠 , 𝑁 𝑝𝑑
, 𝑁 𝑝𝑢
) of SRAM cells,
(ii) the size of sub-arrays, (iii) the number of spare columns (rows) for each sub-array, and (iv)
𝑁 𝑡𝑢𝑏 for transistors in reconfiguring circuits. We identify the optimal configuration by
enumerating configurations with above four dimensions. The optimal configuration includes: (i)
( 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , 𝑁 𝑝𝑢
, 𝑁 𝑝𝑑
) = ( 19, 18, 6) for each SRAM cell, (ii) memory array is partitioned into sub-
115
arrays of 32 rows and 4K columns, where each sub-array is incorporated with 4 spare columns,
and (iii) 𝑁 𝑡𝑢𝑏 = 4 for reconfiguring circuits. The obtained yield/area is 0.763, i.e., 25.3%
improvement over that of redundant CNTs approach.
1.24.5. Impact of redundancy on decoder delay
The proposed hybrid approach reduces delay overhead on address decoder due to redundant
CNTs in SRAM cells. Then we size the address decoder using logical-effort-based heuristic,
where 𝐶 𝑙𝑜𝑎𝑑 can be computed accordingly for each word line. The obtained 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 value is of 2,
28, and 19, respectively, for the nominal design, the optimal redundant-CNTs-only design, and
the optimal hybrid redundancy design. Figure 35 compares decoder delays obtained from these
three designs for different 𝑁 𝑃𝐼
( 𝑁 𝑡𝑢𝑏 for primary input transistors), where 𝑁 𝑃𝐼
is a measure of
input driving capability required by design specifications. As the hybrid approach provides an
SRAM cell design with a smaller 𝑁 𝑎𝑐𝑐𝑒𝑠𝑠 , i.e., 19, compared to that in the redundant-CNTs-only
design, i.e., 28, the address decoder has a smaller delay. The average delay overhead of optimal
designs from the redundant CNTs approach and the hybrid redundancy approach are 19.2% and
15.7%, respectively.
116
Metal 1
Metal 2
Metal 3
Via
P+ doped CNT
N+ doped CNT
Design rules Gate
VDD
GND
Select
in out1 out2
Figure 33. Layout of 1-to-2 demux.
Figure 34. Optimal redundant CNTs design vs. SRAM array size.
Figure 35. Impact of N
access
on decoder delay for primary inputs with different inputs.
0 5 10 15 20 25 30
0
10
20
30
40
50
60
70
80
90
100
Number of SRAM cells (Powers of 2)
Number of CNTs
0.5
1
Nominalized maximal yield/area
N
access
N
pd
N
pu
Yield per area
117
Conclusions 1.25.
In this chapter, we identify transistors’ characteristics in terms of their impact on design yield
and performance. Based on such characteristics, we propose an approach to increase yield per
area for logic circuits. Results show that the obtained yield/area is restricted by load capacitance
constraints of preceding logic. Our approach achieves yield/area of close to 100% with negligible
delay penalty for an 8-bit ripple carry adder. We also propose a hybrid redundancy approach for
SRAM arrays by incorporating redundant CNTs in an optimal manner with the traditional spare
columns (rows) approach. Results show that this approach increases yield/area from 60.9% to
76.3% for a 2MB memory module. Also the obtained SRAM cell design has smaller access
transistors than those obtained for the redundant-CNTs-only approach, which reduces address
decoding delay from 19.2% to 15.7%.
118
Conclusions
This chapter summarizes contributions of this thesis, and presents topics of our future
research.
Contributions 1.26.
In this thesis, we develop a systematic methodology to use spare processors (cores) to
replace/repair defective ones to optimize yield per area of CMPs. For a given defect density, our
approach provides the optimal redundancy configuration for a given nominal design. Our
approach mainly includes following two aspects.
(1) We improve effectiveness of spare processors (cores) by sharing them among original
ones. We develop the first detailed area model to obtain more realistic estimates of yield per area
and hence to derive more efficient designs. The experimental results show that our spare cores
sharing approach improves yield per area from 67.2% to 76.6% (where 100% denotes the yield
per area of an unattainable ideal, namely the original design with a zero-defect process).
(2) Then we further improve wafer utilization by proposing a new utility function that assigns
appropriate value to defective but useable chips. Specifically we present two utility functions i.e.,
Fully-working Processor and Partially-working Chip Binning (FPPCB) at processor-level, and
Partially-working Processors Binning (PPB) at core level. Results show that our utility functions
provide yield per area of up to 90% of the ideal.
The above improvements are in part due to a number of other new approaches we have
developed. In particular, we have developed a greedy repair algorithm to use spare cores to
replace defective cores, and have proven its optimality for the scenario without utility functions.
We have also developed new repair heuristics for different categories of benchmarks for both
utility functions. Impact of different repair algorithms and heuristics are shown for particular
119
spare configurations. We introduce the definition of -optimal spare design, i.e., a design which
is close to the optimal design (in terms of working processors) but with significantly decreased
number of spares (𝑚 ). These designs allow customers to have much smaller chip area with still
acceptable yield per area. We also propose an approach to estimate performance of defective
CMPs by averaging performance obtained for CMPs with different warp sizes. This fast
estimation method enables evaluation of various redundancy approaches and hence enables rapid
identification of the optimal level of redundancy.
For future technologies with smaller feature sizes and extremely high defect densities, we
explored finer-level redundancy techniques by studying CNFET-based circuits. We propose an
approach to add CNTs for each CNFET based on its characteristics to increase yield per area for
logic circuits. We also propose a hybrid redundancy approach for SRAM arrays by identifying
the optimal combination of redundant CNTs and spare columns (rows). Our approach increases
yield/area from 60.9% to 76.3% for a 2MB memory module compared to only spare CNTs
approach, while reducing decoder delay overhead from 19.2% to 15.7%.
Future Research 1.27.
To apply the developed redundancy approach during design of a CMP chip, we will need to
make changes to existing CAD tools to include some of our approaches, especially our
approaches for accurately estimating area and delay overheads when adding redundancy and our
repair algorithms. We will incorporate our estimation approaches into existing design tools in our
future research. We will also adopt and add our repair algorithms to existing testing tools.
We will develop algorithms with lower complexity to enumerate spare configurations to
replace the current branch & bound algorithm. This is important especially when more
processors/cores are integrated onto a single CMP and more spare cores are added.
120
We will consider asymmetry when deciding the number of spare cores assigned to original
cores, i.e., repair domain of spare cores might be overlapped, and each original core might be not
covered by equal amount of spare cores. Such flexibility in asymmetry may play an important
role when clustering factor is considered for defects.
We will explore possible redundancy solutions for arbitrary logic designs as well as new
memory structures, e.g., 8-t SRAM, for different technologies.
121
Reference
[1] “GeForce GTX 480 and 470: From Fermi To Actual Cards,” 2010,
[2] “NVIDIA denies rumors of faulty chips,” 2008,
[3] “NVIDIA GTX480/470 to lose cores,” 2010,
[4] A. Bakhoda, et al. “Analyzing CUDA workloads using a detailed GPU simulator,” IEEE
International Symposium on Performance Analysis of Systems and Software, 2009.
[5] A. D. Franklin, et al., “Sub-10nm Carbon Nanotube Transistor,” Nano Letter, 2012.
[6] A. Durytskyy, M. Zahran, and R. Karri, “Improving GPU Robustness by making use of
faulty parts,” in Proc. Int’l Conf. Computer Design, Amherst, MA, USA, 2011, pp. 346-351.
[7] A. H. El-maleh, B. M. AI-Hashimi, A. Melouki, and F. Khan, “Defect-Tolerant N2-
Transistor Structure for Reliable Nanoelectronic Designs,” in IET Computer and Digital
Techniques, vol. 3, no. 6, pp. 570-580, 2009.
[8] A. Javey, J. Guo, M. Lundstrom, and H. Dai, “Ballistic Carbon Nanotube Field-Effect
Transistors,” Nature, 424.6949: 654-657, 2003.
[9] A. Lashgar, B. Amirali, and K. Ahmad, “Investigating Warp Size Impact in GPUs,” arXiv
preprint arXiv:1205.4967, 2012.
[10] B. Liu, H. Hsiung, D. Cheng, R. Govindan, and S. K. Gupta, “Towards Systematic
Roadmaps for Networked Systems,” HotNets, 2012.
[11] C. Demerjian, “Nvidia’s Fermi GTX480 is broken and unfixable,” http://semiaccurate.com/
[12] C. H. Stapper, “Small-area fault clusters and fault tolerance in VLSI circuits,” IBM J. Res.
Develop, vol. 33, no. 2, pp. 174-177, Mar., 1989.
[13] C. P. Robert, and G. Casella. Monte Carlo statistical methods. Vol. 319. New York:
Springer, 2004.
[14] Cacti tools, Available: www.hpl.hp.com/research/cacti
[15] Cadence Virtuoso Schematic Editor. http://www.cadence.com
[16] D. Cheng and S. Gupta, "A novel software-based defect-tolerance approach for application-
specific embedded systems," in Proc. Int’l Conf. Computer Design, Amherst, MA, USA,
2011, pp. 443-444.
[17] D. Cheng and S. K. Gupta, “A Heuristic Logical Effort Approach for Gate Sizing for
CNFET-Based Circuits,” Tech. Report, University of Southern California, Feb., 2014.
[18] D. Cheng and S. K. Gupta, “A systematic methodology to improve yield per area of highly-
122
parallel CMPs,” Proc. Symp. Defect and Fault Tolerance in VLSI and Nanotechnology
Systems, 2012, pp. 126-133.
[19] D. Cheng and S. K. Gupta, “Maximizing yield per area of highly parallel CMPs using
hardware redundancy,” Computer Aided Design of Integrated Circuits and Systems, IEEE
Transactions on 33, no. 10(2014): 1545-1558.
[20] D. Cheng and S. K. Gupta, “Optimizing Redundancy Design for Chip Multiprocessors for
Flexible Utility Functions,” Proc. Int’l Test Conf., Anaheim, CA, USA, 2013.
[21] D. Cheng, H. Hsiung, B. Liu, J. Chen, J. Zeng, R. Govindan, and S. K. Gupta, “A New
March Test for Process-Variation Induced Delay Faults in SRAMs,” in proceedings of IEEE
Asian Test Symposium, pp. 115-122, 2013.
[22] D. Cheng and S. K. Gupta, “PPB: Partially-working Processors Binning for Maximizing
Wafer Utilization,” 33
rd
IEEE VLSI Test Symposium (VTS’15), 2015.
[23] Design compiler, http://acms.ucsd.edu/_files/dcug.pdf.
[24] E. F., Moore and C. E. Shannon, “Reliable Circuits Using Less Reliable Relays,” Journal of
the Franklin Institute 262.3: 191-208, 1956.
[25] F. Sibai, “Thermal, Power, and Performance Shaping of Multi-core Floorplans,” in
Proceedings of ICMON, 2010.
[26] GPGPU-Sim 3.x, http://gpgpu-sim.ece.ubc.ca/
[27] H. Hisung, B. Cha, and S. K. Gupta, “Salvaging Chips with Caches beyond Repair,” in
Proceedings of DATE, 2011.
[28] H. Jeon, and M. Annavaram, “Warped-DMR: Light-weight Error Detection for GPGPU,” in
Proc. Int’l Symp. Microarchitecture, Vancouver, BC, Canada, 2012, pp. 37-47.
[29] http://arstechnica.com/ardware/news/2008/07/nvidia-denies-rumors-of-mass-gpu-
failures.ars.
[30] http://www.slashgear.com/nvidia-geforce-gtx-480470-to-lose-coresover-poor-gpu-yield-
2278420/.
[31] http://www.tomshardware.com/reviews/geforce-gtx-480,2585-2.html
[32] I. Koren and Z. Koren, “Defect Tolerance in VLSI Circuits: Techniques and Yield
Analysis,” in Proceedings of IEEE, 1998.
[33] I. Koren, “The effect of scaling on the yield of VLSI circuits,” in Yield Modeling and Defect
Tolerance in VLSI, UK: Adam Hillger, 1988.
123
[34] I. Sutherland, R. F. Sproull and D. Harris, Logical Effort: Designing Fast CMOS Circuits,
Morgan Kaufmann, 1999.
[35] International Technology Roadmap for Semiconductors 2009, ORTC-5.
[36] International Technology Roadmap for Semiconductors 2011, “Yield enhancement.”
[37] International Technology Roadmap for Semiconductors Report, 2011, http://www.itrs.net/.
[38] ITRS 2011, http://www.itrs.net/
[39] J. Aidemark, P. Folkesson, and J. Karlsson, “On the Probability of Detecting Data Errors
Generated by Permanent Faults Using Time Redundancy,” Proc. IOLTS, 2003.
[40] J. Appenzeller, “Carbon nanotubes for high-performance electronics-progress and propect,”
in Proceedings of IEEE, 2008.
[41] J. Cha and S. K. Gupta. "Characterization of granularity and redundancy for SRAMs for
optimal yield-per-area." In proceedings of IEEE International Conference of Computer
Design, pp. 219-226, 2008.
[42] J. Deng, Device Modeling and Circuit Performance Evaluation for Nanoscale Devices:
Silicon Technology Beyond 45nm Node and Carbon Nanotube Field Effect Transistors,
PHD Thesis, 2007.
[43] J. Deng, et al., “Carbon Nanotube Transistor Circuits Circuit level performance bench
marking and design options for living with imperfections,” in Proceedings of IEEE
International Solid-State Circuits Conference, pp. 70-588, 2007.
[44] J. Han, J. Gao, P. Jonker, Y. Qi, “Toward Hardware-Redundant, Fault-Tolerant Logic for
Nanoelectronics,” in Transactions on Design & Test Computer, 4: 328-339, 2005.
[45] J. Kahl, M. Day, H. Hofstee, C. Johns, T. Maeurer, and D. Shippy, “Introduction to the Cell
Multiprocessor,” IBM Journal R&D, 2005.
[46] J. Zhang, N. Patil, A. Hazeghi, and S. Mitra, “Carbon Nanotube Circuits in the presence of
Carbon Nanotube Density Variations,” in Proceedings of ACM Design Automation
Conference, pp. 71-76, 2009.
[47] L. Wei, D. J. Frank, L. Chang, H. S. Wong, “A Non-Iterative Compact Model for Carbon
Nanotube FETs Incorporating Source Exhaustion Effects,” in IEEE Electron Devices
Meeting, pp. 1-4, 2009.
[48] L. Zhang, et al., “Defect Tolerance in Homogeneous Manycore Processors Using Core-
Level Redundancy with Unified Topology,” in Proceedings of DATE, 2008.
124
[49] Y. Li, et al. "Self-repair of uncore components in robust system-on-chips: An OpenSPARC
T2 case study." Test Conference (ITC), 2013 IEEE International. IEEE, 2013.
[50] M. Ali, R. Ashraf, and M. Chrzanowska-Jeske, “Logical Effort of CNFET-based Circuits in
the Presence of Metallic Tubes,” in Proceedings of IEE-NANO, pp. 1-6, 2012.
[51] M. C. Jeske, R. Ashraf, and R. K. Nain, “Performance analysis of CNFET based circuits in
the presence of fabrication imperfections,” in Proceedings of International Symposium on
Circuits and Systems, pp. 1363-1366, 2012.
[52] M. Chiappetta, “AMD FX-8150 8-core CPU Review,” http://hothardware.com/Reviews/
AMD-FX8150-8CoreprocessorReview-Bulldozer-Has-Landed/?page=2
[53] M. Kaneko, “Reconfiguration of Folded Torus PE Networks for Fault Tolerant WSI
Implementations,” in Proc. Asia-Pacific Conf. Circuits and Systems, Chiangmai, Thailand,
1998, pp. 791-794.
[54] M. M. Aghatabar, M. A. Breuer, S. K. Gupta, “A Design Flow to Maximize Yield/Area of
Physical Devices via Redundancy,” in Proceedings of International Test Conference (ITC),
pp. 1-10, 2012.
[55] M. M. Shulaker, et al., “Carbon Nanotube Computer,” Nature, 501(7468), 526-530, 2013.
[56] M. Mirza-Aghatabar, M. A. Breuer, and S. K. Gupta, “Algorithms to maximize yield and
enhance yield/area of pipeline circuitry by insertion of switches and redundant modules,”
Proc. DATE, 2010.
[57] M. Shulaker, et al., “Carbon nanotube computer,” Nature, 2013.
[58] N. Patil, A. Lin, J. Zhang, H. Wei, and K. Anderson, “VMR: VLSI-Compatible Metallic
Carbon Nanotube Removal for Imperfection-Immune Cascaded Multi-Stage Digital Logic
Circuits using Carbon Nanotube FETs,” in Procedings of IEEE IEDM, 2009.
[59] N. Patil, et al., “Circuit-level Performance Benchmarking and Scalability Analysis of
Carbon Nanotube Transistor Circuits,” in IEEE Transactions on Nanotechnology, 2009.
[60] N. Patil, J. Deng, A. Lin, H. S. Wong, and S. Mitra, “Design Methods for Misaligned and
Mis-positioned Carbon-Nanotube-Immune Circuits,” in IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 27.10, pp. 1725-1736, 2008.
[61] N. Sirisantana, B. C. Paul, and K. Roy, "Enhancing Yield at the End of the Technology
Roadmap," in proceedings of IEEE Design & Test, 21.6:563-571, 2004.
[62] N. Weste and D. Harris, “CMOS VLSI Design: A Circuits and Systems Perspective,”
125
Addison-Wesley, Fouth edition, 2010.
[63] NCSU 45nm gate library. http://www.eda.ncsu.edu.
[64] NVIDIA CUDA SDK, https://developer.nvidia.com/.
[65] NVIDIA denies rumors of faulty chips, http://arstechnica.com/ardware/news/2008/07/
nvidia-denies-rumors-of-mass-gpu-failures.ars
[66] NVIDIA GTX480/470to lose cores, http://www.slashgear.com/nvidia- geforce- gtx-480470-
to-lose-coresover-poor-gpu-yield-2278420/
[67] Overall Technology Roadmap Characters, 2009,
http://www.itrs.net/Links/2009ITRS/2009Chapters_2009Tables/2009Tables_FOCUS_A_IT
RS.xls.
[68] P. Gupta and E. Papadopoulou, “Yield analysis and optimization,” in The Handbook of
Algorithms for VLSI Physical Design Automation., 2010.
[69] P. Packan, et al., “High Performance 32nm Logic Technology Featuring 2nd Generation
High-k + Metal Gate Transistors,” in Proceedings of IEDM, 2009.
[70] P. Sivakumar, S. W. Keckler, C. Moore, and D. Burger, “Exploiting microarchitectural
redundancy for defect tolerance,” in Proceedings of ICCD, 2003, pp. 481-488, San Jose,
CA, USA.
[71] Q. Cao, S. J. Han, G. S. Tulevski, Y. Zhu, and D. D. Lu, “Arrays of Single-Walled Carbon
Nanotubes with Full Surface Coverage for High-Performance Electronics,” Nature
Nanotechnology, 8(3), 180-186, 2013.
[72] R. Ashraf, et al, “Analysis of Yield Improvement Technique for CNFET-based Logic
Gates,” in Proceedings of IEEE-NANO, 2011.
[73] R. Ashraf, et al, “Functional Yield Estimation of Carbon Nanotube-Based Logic Gates in the
Presence of Defects,” IEEE Transactions on Nanotechnology, 2010.
[74] R. Datta and N. A. Touba, “Exploiting Unused Spare Columns to Improve Memory
ECC,” in proceedings of IEEE VLSI Test Symposium, pp. 47-52, 2009.
[75] R. Kumar, V. Zyuban, and D. Tullsen, “Interconnections in multi-core architectures:
Understanding mechanisms, overheads and scaling,” in Proc. Int’l Symp. Computer
Architecture, WI, USA, 2005, pp. 408-419.
[76] R. Shanmugam, et al., “Fault Tolerance in multicore processors using reconfigurable
hardware unit,” in Proceedings of HiPC, 2008.
126
[77] R. Smith, “NVIDIA’s GeForce GTX 480 and GTX 470,” 2010. http://www.anandtech.com/.
[78] S. Borkar, “Thousand core chips-A technology perspective,” in Proceedings of DAC, 2007.
[79] S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke, “The StageNet Fabric for
Constructing Reslilient Multicore Systems,” in Proceedings of MICRO, 2008.
[80] S. M. Kang and Y. Leblebici, “CMOS digital integrated circuits,” McGraw-Hill, New York.
[81] S. Nomura, M. Sinclair, M. Sinclair, C. Ho, V. Govindaraju, M. Kruijf, and K.
Sankaralingam, “Sampling+DMR: Practical and Lowoverhead Permanent Fault Detection,”
in Proceedings of ISCA, 2011.
[82] S. Shamshiri and K. Cheng “Modeling yield, cost, and quality of an NoC with uniformly
and non-uniformly distributed redundancy,” in Proceedings of VTS, 2010.
[83] Synopsys, Design Compiler.
[84] T. Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” in
Proceedings of MICRO, 1999.
[85] W. Wang, “High SNM 6T CNFET SRAM Cell Design Considering Nanotube Diameter and
Transistor Ratio,” in Proceedings of EIT, 2011.
[86] W. Wang, Z. Yu, P. He, and K. Choi, “Design Method for 6T CNFET Misalignment
Immune SRAM Circuit,” in Proceedings of Midwest Symposium of Circuits And Systems,
pp. 1-4, 2011.
[87] Y. B. Kim and F. Lombardi, “A novel design methodology to optimize the speed and power
of the CNFET circuits,” in Proceedings of MWSCAS, 2009.
[88] Y. Markovsky and J. Wawrzynek, “On the opportunity to improve system yield with multi-
core architectures,” Int’l Workshop Design for Manufacturability and Yield, Santa Clara,
CA, USA, 2007.
[89] Z. Chen et al., “An Integrated Circuit Assembled on A Single Carbon Nanotube,” Science,
311(5768), 1735-1735, 2006.
[90] Z. Wu, et al., “Exploration of A Reconfigurable 2D Mesh Network-on-Chip Architecture
and A Topology Reconfiguration Algorithm,” in Proc. Int’l Conf. Solid-State Integrated
Circuit Technology, Xi’an, China, 2012.
[91] A. Pushkarna, S. Raghavan, and H. Mahmoodi, “Comparison of Performance Parameters of
SRAM Designs in 16nm CMOS and CNTFET Technologies,” IEEE Internation SOC
Conference, 2010.
Abstract (if available)
Abstract
As CMOS fabrication technology continues to move deeper into nano‐scale, circuit’s susceptibility to manufacturing imperfections increases, and the improvements in yield, power, and delay provided by each major scaling generation have started to slow down or even reverse. This trend poses great challenges for designing and manufacturing advanced electronics in the future generations of CMOS, especially chip‐multiprocessors, which usually have many processors and hundreds of cores. At the same time, new fabrication technologies, such as carbon nanotube field‐effect transistors (CNFET), are emerging as promising building blocks for the next‐generation semiconductor electronics. However, substantial imperfections inherent to CNT technology are the main obstacles to the demonstration of robust and complex CNFET circuits. Both these scenarios show that it is increasingly difficult to guarantee the correctness and conformance of circuits to performance specifications, which leads to reduction in yield. ❧ We develop a systematic methodology to use spare processors (cores) to replace/repair defective ones to optimize yield per area of CMPs. We improve effectiveness of spare processors (cores) by sharing them among original ones. We develop the first detailed area model to obtain more realistic estimates of yield per area and hence to derive more efficient designs. The experimental results show that our spare cores sharing approach improves yield per area from 67.2% to 76.6% (where 100% denotes the yield per area of an unattainable ideal, namely the original design with a zero‐defect process). Then we further improve wafer utilization by proposing a new utility function that assigns appropriate value to partially defective but useable chips. Specifically we present two utility functions i.e., Fully‐working Processor and Partially‐working Chip Binning (FPPCB) at processor‐level, and Partially‐working Processors Binning (PPB) at core level. Results show that our utility functions provide yield per area of up to 90% of the ideal. ❧ The above improvements are in part due to a number of other new approaches we have developed. In particular, we have developed a greedy repair algorithm to use spare cores to replace defective ones, and have proven its optimality for the scenario without utility functions. We have also developed new repair heuristics for different categories of benchmarks for both utility functions. We have also proposed an approach to estimate performance of defective CMPs by averaging performance obtained for CMPs with different warp sizes. This fast estimation method enables evaluation of various redundancy approaches and hence enables rapid identification of the optimal level of redundancy. ❧ For future technologies with smaller feature sizes and extremely high defect densities, we explored finer‐level redundancy techniques by studying CNFET‐based circuits. We propose an approach to add CNTs for each CNFET based on its characteristics to increase yield per area for logic circuits. We also propose a hybrid redundancy approach for SRAM arrays by identifying the optimal combination of redundant CNTs and spare columns (rows). Our approach increases yield/area from 60.9% to 76.3% for a 2MB memory module compared to only spare CNTs approach, while reducing decoder delay overhead from 19.2% to 15.7%.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
High level design for yield via redundancy in low yield environments
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Defect-tolerance framework for general purpose processors
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Improving efficiency, privacy and robustness for crowd‐sensing applications
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Theory, implementations and applications of single-track designs
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Error tolerance approach for similarity search problems
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Nanomaterials for macroelectronics and energy storage device
Asset Metadata
Creator
Cheng, Da
(author)
Core Title
Optimal redundancy design for CMOS and post‐CMOS technologies
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering (VLSI Design)
Publication Date
02/10/2015
Defense Date
01/13/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
binning,carbon nanotube,chip multiprocessor,defect density,design for test,design for yield,OAI-PMH Harvest,yield
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Gupta, Sandeep K. (
committee chair
), Govindan, Ramesh (
committee member
), Ortega, Antonio K. (
committee member
)
Creator Email
bjslax@gmail.com,dacheng@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-531660
Unique identifier
UC11298667
Identifier
etd-ChengDa-3177.pdf (filename),usctheses-c3-531660 (legacy record id)
Legacy Identifier
etd-ChengDa-3177.pdf
Dmrecord
531660
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Cheng, Da
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
binning
carbon nanotube
chip multiprocessor
defect density
design for test
design for yield
yield