Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
(USC Thesis Other)
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
REDUNDANCY DRIVEN DESIGN OF LOGIC CIRCUITS FOR YIELD/AREA
MAXIMIZATION IN EMERGING TECHNOLOGIES
by
Mohammad Mirzaaghatabar Ahangar
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2012
Copyright 2012 Mohammad Mirzaaghatabar Ahangar
ii
Dedication
To
My parents, Nahid & Nasrollah
&
Mohammad Moghaddam
iii
Acknowledgements
It is a pleasure to thank those who made this dissertation possible through their
supports, guidance, encouragements and inspirations.
My deepest gratitude is to my advisor, Professor Melvin A. Breuer. I always had
freedom to explore different topics on my own and at the same time his advice guided me
towards right direction. It was not possible to overcome many crisis situations and finish
this dissertation without his patience and continuous support. Thanks Mel.
My co-advisor, Professor Sandeep K. Gupta, has been always there to listen and
help me. His technical advice from the preliminary to the concluding level assisted me to
develop an understanding of the subject materials. Thanks Sandeep.
I would like to extend my great appreciation to my other dissertation committees,
including Professor Nenad Medvidovic, Professor Massoud Pedram, and Professor
Jeffrey Draper for their time, and positive feedbacks.
Also, I wish to express my sincere gratitude to Dr. Saro Nikraz, Dr. Shahin
Nazarian, Dr. Ehsan Pakbaznia, Dr. Doochul Shin and Dr. Hamed Abrishami for their
technical support and care that helped me overcome my challenges and obstacles in order
to focus on my graduate study.
Last but not least, I offer my regards and blessings to all of those who supported
me in any respect during the completion of this dissertation such as my siblings Azadeh,
Forough, Mehrdad, and my great friends; especially, Kamran Saleh, Mona Saghatchi,
Damon Moazen, Yasaman Hashemian, and Fatemeh Kashfi.
iv
Table of Contents
Dedication........................................................................................................................... ii
Acknowledgements............................................................................................................ iii
List of Tables ..................................................................................................................... vi
List of Figures.................................................................................................................. viii
Abstract............................................................................................................................ xiii
Chapter 1: Introduction........................................................................................................1
1.1 Metric of efficiency: Yield-per-Area (Y/A)..................................................5
1.2 Motivation of using redundancy at fine level of granularity .......................7
1.2.1 Scenario 1: Redundancy for originally partitioned circuit................... 9
1.2.2 Scenario 2: Partitioning of original circuit for redundancy ............... 13
1.3 Related work ..............................................................................................14
1.4 Dissertation outline ....................................................................................18
Chapter 2: Steering Logic ..................................................................................................19
2.1 Introduction................................................................................................19
2.1.1 q-way-fork.......................................................................................... 19
2.1.2 q-way-join .......................................................................................... 20
2.1.3 q
in
×q
out
-way-switch ............................................................................ 21
2.2 Configurable and Testable steering logic ..................................................23
2.3 Defects .......................................................................................................27
2.4 Configurable layout ...................................................................................30
2.4.1 Critical area for extra metal (X
m
) defects ........................................... 31
2.4.2 Critical area for missing metal (M
m
) defects...................................... 32
2.5 Simulation..................................................................................................33
2.6 Delay Model used by TYSEL .....................................................................38
2.7 Conclusions................................................................................................42
2.8 Appendix: TYSEL.......................................................................................42
2.8.1 Input format........................................................................................ 43
2.8.2 Layout simulation............................................................................... 48
2.8.3 Fault simulation.................................................................................. 50
2.8.4 Yield computation by TYSEL............................................................. 52
2.8.5 Experimental Results by TYSEL ........................................................ 53
v
Chapter 3: Algorithms and Heuristics for Yield and Yield/Area Maximization in SoC...58
3.1 Introduction................................................................................................58
3.2 SIRUP: Switch Insertion in RedUndant Pipelines.....................................62
3.2.1 Partitioning a pipeline ........................................................................ 62
3.2.2 Implementation of SIRUP .................................................................. 65
3.3 OSIP: Optimal Switch Insertion in Pipelines ............................................68
3.4 M-OSIP: Modified OSIP............................................................................72
3.5 HYPER: a Heuristic for Yield/area imProvEment using Redundancy ......75
3.5.1 Phase 1: Implementation of HYPER using Divide & Conquer.......... 76
3.5.2 Phase 2: Implementation of BREAK using Dynamic Programming.. 79
3.6 MYRA: Maximizing Yield/Area via Replication .......................................84
3.7 Experimental Results .................................................................................92
3.8 Conclusions................................................................................................96
Chapter 4: Theory of Partitioning for Redundancy ...........................................................97
4.1 Introduction................................................................................................97
4.2 Theory of partitioning for Y/A maximization using redundancy .............100
4.2.1 Validation of Theorems using MYRA .............................................. 106
4.3 Proposed Design Flow .............................................................................113
4.3.1 Phase 1: CLB partitioning................................................................ 113
4.3.2 Phase 2: Overall optimization .......................................................... 119
4.4 Experimental Results ...............................................................................127
4.5 Conclusions..............................................................................................132
Chapter 5: Main Contributions, Conclusions and Future work .......................................133
5.1 Main Contributions ..................................................................................133
5.2 Conclusions..............................................................................................133
5.3 Future work..............................................................................................136
Bibliography ....................................................................................................................138
vi
List of Tables
Table 1.1: Notations used in this dissertation ..............................................................5
Table 1.2: The yield of forks and joins for different number of copies.....................12
Table 2.1: Size and probability of occurrence for each defect...................................36
Table 2.2: Functionality of some variables in Comp_Struc.......................................49
Table 2.3: Yield of a fork under different bus widths and defect densities ...............57
Table 3.1: Comparison of our developed techniques.................................................61
Table 3.2: Yield of forks & joins as a function of degree of replication ...................74
Table 3.3: Information of the modules in Figure 3.19 ...............................................88
Table 3.4: Output of MYRA for Figure 3.19 ..............................................................89
Table 3.5: Yield of OpenSPARC T2 modules...........................................................93
Table 3.6: Number of each module in HYPER-5RM structure in 7 cases..................94
Table 3.7: G
Y/A
of HYPER structure compared to other structures ............................95
Table 4.1: Y
new
as a function of n and q ...................................................................105
Table 4.2: OpenSPARC T2 core’s modules ............................................................107
Table 4.3: Y/A comparison of the configuration generated by MYRA & qRM ........107
Table 4.4: Area distribution of each module of TLU...............................................109
Table 4.5: Y/A comparison of the Equal and Unequal partitions (E vs. UE) ...........109
vii
Table 4.6: Area of the core after redundancy (mm
2
) ...............................................112
Table 4.7: Size and number of CLBs in OpenSPARC T2 .......................................119
Table 4.8: Optimization results after CLB clustering ..............................................123
Table 4.9: Y/A gain of different redundant designs over original design.................129
viii
List of Figures
Figure 1.1: Yield learning trend of Intel 65nm technology [94]..................................2
Figure 1.2: Three equal wafers with different number of acceptable chips.................6
Figure 1.3: Partitioning of original circuit ...................................................................8
Figure 1.4: qRM = q Redundant Modules..................................................................10
Figure 1.5: The structure of n pipeline modules with j groups of modules ...............10
Figure 1.6: Yield and Y/A for a pipeline with 15 modules.........................................12
Figure 1.7: Different partitions of one circuit ............................................................14
Figure 2.1: Fork, input data sent to outputs ...............................................................20
Figure 2.2: Join, one of the input buses go to the output bus.....................................20
Figure 2.3: Switch: A multi-purpose transmission unit .............................................21
Figure 2.4: A simple example of using steering logic ...............................................22
Figure 2.5: Yield of a switch with different number of input/output buses...............23
Figure 2.6: (a) redundant circuit (b) after testing and configuration..........................23
Figure 2.7: A configurable 2-input/2-output switch ..................................................24
Figure 2.8: A testable switch with FF sharing ...........................................................26
Figure 2.9: Different defect types and sizes...............................................................28
Figure 2.10: A 3-way-fork with different types of defects ........................................29
ix
Figure 2.11: X
m
defect versus X
m
fault .......................................................................31
Figure 2.12: Critical area for a X
m
defect...................................................................32
Figure 2.13: M
m
defect versus M
m
fault .....................................................................32
Figure 2.14: Critical area for M
m
defect.....................................................................33
Figure 2.15: Layout of a 2-way-fork with bus width 2..............................................35
Figure 2.16: Accuracy of TYSEL compared to Poisson yield model .........................37
Figure 2.17: RC equivalent circuit for any path in a fork ..........................................39
Figure 2.18: Different capacitance types ...................................................................41
Figure 2.19: Pseudo code of TYSEL...........................................................................43
Figure 2.20: Example of input data to TYSEL ...........................................................44
Figure 2.21: A simple fork with different input bus orders .......................................45
Figure 2.22: A 3×5-way-switch with Order = 2.........................................................46
Figure 2.23: A faulty 3-way-fork with different input bus order...............................47
Figure 2.24: Data structure for a defect .....................................................................48
Figure 2.25: Data structure for a component .............................................................48
Figure 2.26: A component of Metal 1........................................................................49
Figure 2.27: A fork and its connectivity graph ..........................................................49
Figure 2.28: An example for Defect generation ........................................................51
Figure 2.29: The input file for a fork with 128 bus width and 6 output buses...........53
x
Figure 2.30: The generated output by TYSEL for the given input in Figure 2.29......54
Figure 2.31: The fault of RUN#4376 given in Figure 2.30 .......................................55
Figure 2.32: Yield of a 9-way-fork with different bus widths and d = .075/mm
2
......56
Figure 2.33: Yield of q-way-fork under different defect densities ............................57
Figure 3.1: (a) same number of spares, (b) different number of spares.....................59
Figure 3.2: Different yields with and without concatenation.....................................60
Figure 3.3: Different structures..................................................................................61
Figure 3.4: First phase of SIRUP ...............................................................................65
Figure 3.5: Second phase of SIRUP...........................................................................66
Figure 3.6: An example of Left and Right pointers ...................................................67
Figure 3.7: Different choices for the last switch in a pipeline with 3 modules .........69
Figure 3.8: Pseudo Code of OSIP ..............................................................................72
Figure 3.9: A sample of multi switch structure using OSIP ......................................72
Figure 3.10: A sample of M-OSIP structure (F = Fork, J = Join)..............................73
Figure 3.11: (a) Struc_1, (b) Struc_2, (c) Struc_3 .....................................................76
Figure 3.12: Pseudo code of HYPER .........................................................................77
Figure 3.13: An example of HYPER..........................................................................78
Figure 3.14: Pseudo code of BREAK .........................................................................80
Figure 3.15: Fewest number of spares for each module ............................................81
xi
Figure 3.16: Min_Area determination for HYPER.....................................................82
Figure 3.17: Pseudo code of Part II of Build_Struc ...................................................84
Figure 3.18: The original irredundant logic circuit, (b) its redundant .......................85
Figure 3.19: (a) Original circuit and (b) MYRA’s output ...........................................87
Figure 3.20: Behaviour of yield & Y/A for Steps 17 to 27.........................................90
Figure 3.21: Pseudo code of the algorithm ................................................................92
Figure 4.1: Circuit C17 with Redundancy at different level of granularity...............98
Figure 4.2: Given circuit with two partitions m
1
and m
2
..........................................100
Figure 4.3: Y/A as a function of q and n for Y = 0.1.................................................105
Figure 4.4: Y/A of TLU as a function on n and defect density (DD) .......................110
Figure 4.5: Y/A of a core with different number of partitions and defect densities .111
Figure 4.6: (a) A CLB, (b) clustered FFs to registers, (c) redundant CLB ..............115
Figure 4.7: New scan chain with spare FF...............................................................118
Figure 4.8: Pseudo code for step 3 of CLB partitioning algorithm..........................118
Figure 4.9: An example of CLB clustering..............................................................121
Figure 4.10: A visual example for CLB-netlist........................................................122
Figure 4.11: General redundant configuration of all CLBs of the original circuit ..124
Figure 4.12: Design flow for Y/A maximization of logic circuits............................124
Figure 4.13: Partitioning of a big CLB ....................................................................126
xii
Figure 4.14: Y/A comparison of different designs for different defect densities .....129
xiii
Abstract
Reduced scaling of feature sizes and process variations in CMOS nano-
technologies introduce manufacturing anomalies that reduce yield, and this trend is
predicted to get worse for emerging technologies. In addition, it takes more time to
be resolved these issues compared to previous technologies. Therefore, it will be
increasingly more crucial to develop design techniques to enhance yield in emerging
technologies. While logic circuits, namely gates and flip-flops, occupy a small
amount of chip area, they are more critical compared to memories as their irregular
structure makes it difficult to improve their yield. In addition, logic circuitry contains
many single points of failure, and thus any killer defect in this circuitry can turn a die
to scrap. This fact suggests the need to develop a highly efficient architectural design
methodology based on using redundancy for logic circuits.
In this dissertation we use redundancy in logic circuits to improve silicon
yield/area (a.k.a revenue per wafer). While most of the traditional techniques use
redundancy at the core level; we show that for emerging technologies with low yield,
redundancy needs to be used at lower level of granularity compared to core level
(inter-level) to enhance yield and reduce time to market. Our theoretical and
experimental results show a significant increase in yield and yield/area compared to
the original circuit without redundancy.
To employ redundancy at fine level of granularity, we need to take into
consideration the following issues: (i) design of steering logic - the generic term for a
xiv
fork, join and switch - for logically selecting a redundant copy of a module to use as
well as directing data to and from such modules, (ii) designing a support architecture
for testing both the steering logic as well as the modules, (iii) estimating the
overheads of steering logic such as their yield, area and delay, (iv) finding
appropriate number of spares for heterogeneous modules with different sizes (areas)
and yields while taking into consideration the overheads of inserting testable and
configurable steering logic, and (v) partitioning the original circuit to find the
optimal level of granularity for yield/area maximization using redundancy.
The focus of this dissertation is to develop CAD tool, algorithms, heuristics
and theorems to address all these issues. We develop a layout-driven CAD tool
(TYSEL) to precisely estimate the overheads of steering logic. We then develop
different algorithms and heuristics for yield and yield/area maximization of logic
circuits with linear and non-linear structures. Our techniques take into account the
overheads of steering logic (estimated by TYSEL) in their computations.
Finally, we introduce a theory of partitioning of the original logic circuit to
capture the impact of granularity, and uniformity of partitions on yield/area after
using redundancy. Based on our theoretical results, we present a design flow to find
the optimal level of granularity for the given logic circuit to be used for redundancy.
Our design flow satisfies the realistic issues of using redundancy at finer granularity,
such as performance loss and DFT (design for testability).
1
Chapter 1
Introduction
The ongoing demand of market for having powerful electronic devices on a single
chip necessitates the technologies to scale down bellow 22nm. To handle this huge
number of transistors and the processing needs of such users, much of the functionality of
these devices will be implemented on SoCs (systems-on-chip). On the other hand scaling
in CMOS nano-technology increases the number of transistors per chip beyond billions
[3]. Just as an example Intel has shown a working 22nm SRAM shuttle test chip (364
Mb) with 2.9 billion transistors [94]. Additionally, the density of killer defects is
anticipated to increase for new technologies since many defects that are currently ignored
because of their small sizes, will eventually generate faults. In fact, defect densities in
emerging technologies have been predicted to be 1-2 orders of magnitude as high as the
current levels [26]. For example the president of TSMC Europe says about the defect
density of 28nm process: “Everything is getting more difficult and will be more difficult
at 20nm again” [16]. Another problem is the detrimental effects of process variations. In
an ITRS design chapter [98], high variability-induced failure rates for inverters, latches
and SRAMS in 16nm and 12nm technologies have been quantified that lead to significant
yield loss. All these fabrication non-idealities lead to yield loss that causes repeating the
tape-out process multiple times before mass production and that in turn causes longer
time to market and more investment. Trends shown in [56] also illustrate yield reductions
2
with shrinking technologies. In an article discussing NVIDIA’s experience with the 40
nm Fermi GTX480 [20] the following statement occurs: “The fab wafer yields are still in
single digit percentages, for example the yield on the first hot-lot of Fermis that came
back from TSMC was 7 good chips out of a total of 416 candidates, or a yield of less than
2 percent.”
In the early stages of each technology node the yield is very low and mass
production of die is not profitable (NVIDIA’s example). The yield learning trend history
of Intel Company for 65nm technology is shown in Figure 1.1. As we can see at the
beginning of this technology node the yield is very low and time is lost in increasing the
yield to a point where mass production (Y
mass-production
) can occur. It took almost four years
of “yield-learning” for Intel to start mass production of their 65nm technology. This
achievement was accomplished by Intel in part by their emphasis on design-for-
manufacturing (DFM) requirements [94]. In this dissertation we accelerate the yield
learning process to the same Y
mass-production
using redundancy. The cartoon graph in Figure
1.1 shows our yield learning for 65nm with better time to market (about two years).
Figure 1.1: Yield learning trend of Intel 65nm technology [94]
3
Such observations support the need for yield enhancement design techniques to
ensure viable future manufacturing yield. Some performance and/or die area may need to
be sacrificed to benefit yield and reliability. Redundancy is one way to potentially
increase yield, where some modules are replicated and input and output signals
distributed to and from selected modules using steering logic (more details about steering
logic will be provided in Chapter 2). The design chapter in the ITRS [98] states that
architectural redundancy will be required due to the difficulty in making circuits yield.
Most SoC’s have memory and logic circuits, and the memory is becoming the
dominant portion of these chips. According to the ITRS-2001, embedded memories will
occupy 54% to 94% of silicon real estate by the year 2014 [70] [103]. The remaining area
contains functional units such as controllers, register stacks, ALUs, DSP units, routing
networks, and performance enhancing modules such as branch predictors. Much of this
circuitry is made robust (tolerant to defects, process variations, and noise) using
techniques such as: ECC (Error-Correcting Code) [73], where errors are corrected; TMR
(Triple Modular Redundancy) [45], where errors are masked; also DFM (Design for
Manufacturing) adjustments to masks such as double-vias; reconfiguration; and
electrically and logically removing modules at the expense of capability, functionality,
capacity, performance or latency [43] [96]. Many of these techniques are extensively used
in memories. These modules often consume a significant portion of chip area and are
designed using very dense circuitry and small signal to noise ratios. Thus, once
fabricated, they often contain many defects and several faults. But by using robust
enhancement techniques such as those listed above, these memories usually are produced
4
having very high values of yield. Hence the yield of the logic circuits can become the
weak link in the yield chain and this becomes the focus of our work.
Although logic circuitry occupies a small fraction (F) of the area of these chips,
most faults that create errors in this logic cause a chip to fail the manufacturing test. Due
to the small amount of area taken by this type of circuitry, redundancy can significantly
improve yield at a modest cost in total chip area. Yield enhancement of logic circuits,
such as a processor, is harder than that for memory due to its non-regular structure. In
addition, for emerging technologies with very low yield we need redundancy at lower
level of granularity compared to core level or replication the entire circuit [72] [75]. Also,
logic circuits usually consist of many different modules at different levels of granularity
that have different probabilities of being error-free due to their variations in complexity,
area, and design style. Therefore, they may need different number of spares while
creating redundancy at finer levels.
In this dissertation first we assume the original logic circuit has been partitioned
and information about the yield and size (area) of each module is available. Then, we
develop algorithms and heuristics that use redundancy to (i) identify the best number of
spares to instantiate for each module, (ii) while considering the cost of adding spares
(yield reduction, area overhead, and performance loss), with (iii) the goal of maximizing
the total number of good dies per wafer. Next, we assume that partitioning the original
logic circuit is allowed. Then, we develop a theory of partitioning that studies the impact
of granularity and uniformity of logic circuit partitions on yield/area using redundancy. In
this dissertation we make use of the notations listed in Table 1.1.
5
Table 1.1: Notations used in this dissertation
Y, A
Yield and Area of the original targeted logic circuitry without
redundancy
n The number of modules or partitions of the original circuit,
Y/A Yield-per-Area that is yield divided by area
a
i
The area of module i,
y
i
The yield of module i, that is the probability that it is fault free,
q
i
The number of copies of module i after redundancy,
d Defect density per unit of area
Y
OM
, A
OM
Yield & Area of Other Modules (such as memory) in chip
B
R
Budget for Redundancy (% of total area of original chip)
q
i
y
J
i
,
i
q
i
a
J
The yield and area, respectively, of a join of q
i
copies of module i, i.e., it
has q
i
input busses and one output bus (details in Chapter 2),
i
q
i
y
F
,
i
q
i
a
F
The yield and area of a fork of q
i
copies of module i, i.e., it has one
input bus and q
i
output busses (details in Chapter 2),
We consider uniform defect distribution inside a chip; therefore, we use equation
( 1.1) to compute the yield of each module. In fact, defect clustering divides the wafer into
regions with different defect densities but each region has uniform defect distribution
[88] [91]. We may either use different redundant configurations for each region to
maximize the number of healthy dies at each region, or simply one redundant
configuration for all regions to lower the mask cost.
ad
i
ye
i
-´
=
( 1.1)
1.1 Metric of efficiency: Yield-per-Area (Y/A)
Consider a die having a given performance and functionality that sales for some
fixed price. It seems beneficial to maximize the number of such dies that can be obtained
from each wafer. Thus, our focus is on maximizing Y/A or revenue/wafer. Consider a
6
wafer of area A
W
, and a design D of area A
d
and yield Y
d
that implements some
specification S. Then, on average, the number of good dies G
d
produced per wafer is
W
dd
d
A
GY
A
=´ . Thus, to maximize G
d
we seek a design that maximizes Y/A.
(a) (b) (c)
Figure 1.2: Three equal wafers with different number of acceptable chips
Figure 1.2 shows three different designs for some specifications S. The size of the
wafer for all three designs is the same and it is A
W
= 1; in this case G
d
would be equal to
Y/A. Assume we use different techniques such as redundancy or upsizing to enhance the
yield. These techniques have area overhead that reduces the number of fabricated dies. In
Figure 1.2, design (a) has lowest yield but 44 dies, while, it reduces to 22 for design (c)
with highest yield. Considering the yield as the only parameter for maximizing the
number of good dies cannot be always correct, i.e., a design with higher yield does not
necessarily have more number of good dies. For example, although the yield of the
design (c) is 1 but it has 22 good dies while the design (b) with lower yield has more
number of good dies (it is 23). As we can see design (b) is the best design among three
designs that has the highest Y/A (it is 23).
7
We can mathematically prove that design with highest Y/A has the most number
of good dies when the size of the wafer is fixed. Assume we have two different designs
with different yields and areas but same wafer size (A
W
). The first design is better than
design 2 when it has greater Y/A as follow:
1 11 2 22
12
12
1 212
1 2 12
::
WW
WW
AA
Design GY Design GY
AA
AA YY
G GYY
A A AA
=´ =´
> ®´ > ´ ®>
In this dissertation we will maximize this metric for the given design using
redundancy and partitioning.
1.2 Motivation of using redundancy at fine level of granularity
In this section we explain our motivation for using redundancy at finer level of
granularity compared to core level (entire circuit) with two scenarios, (i) the original
circuit has been partitioned, and (ii) we partition the original circuit for redundancy
purpose.
Most SoCs have logic circuits and other modules such as memories as shown in
Figure 1.3 (a). Let the yield of Logic Circuits be Y
LC
, and the yield of the Other Modules
on the chip be Y
OM
. Thus the yield of the SoC, denoted by Y
SoC
is Y
SoC
= Y
LC
× Y
OM
.
Memory modules contribute to Y
OM
, and extensive work has gone into developing design
techniques, such as ECC and reconfiguration, to increase Y
OM
. Different fault tolerance
techniques based on using redundancy (spare rows and columns) are generally used to
improve memory yield [9] [10] [58] [80] [90]. In the best case Y
OM
= 1, and thus Y
SOC
=
8
Y
LC
, but in general, Y
SOC
< Y
LC
. In either case, it is desirable to make Y
LC
as large as
possible. Using replication, it is possible to increase Y
LC
; its new value being denoted by
Y
Rep-LC
. Thus, Y
Rep-SoC
= Y
Rep-LC
× Y
OM
.
Figure 1.3: Partitioning of original circuit
Figure 1.3 (b) indicates the original un-partitioned logic circuit, though in general,
it may already be described in terms of partitions. In Figure 1.3 (c) we illustrate the
partitioning of the original circuit into n modules. For Figure 1.3 (c) we have:
Y = y
1
× y
2
× … × y
n
and A = a
1
+ a
2
+ … + a
n
and each module (partition) is assumed to
be unique and has its own pool of spares. The general redundant configurations for
Figure 1.3 (c) is shown in for Figure 1.3 (d). The white rectangles are spares and for an
instantiation (fabricated copy) of the redundant circuit in Figure 1.3 (d) to be good
(operational) at least one copy of each module should be good. The Y/A of a chip with
other modules and a logic circuit with n partitions and spare modules (Figure 1.3 (d)) is,
[ (1 (1)]
1
/
()
1
OM
OM
n
qqq
iii
y y yY
i
FJ
ii
i
YA
n
qq
ii
a q a aA
ii
FJ
ii
i
´ -- ´´ Õ
=
=
´ + ++ å
=
( 1.2)
The following equation shows the Y/A gain, G
Y/A
, obtained using redundant logic
circuits (Figure 1.3 (d)) compared to original design without redundancy,
9
.
/
Y
Rep SoC
AY
AA
Rep SoC Rep LC
LC OM
G
YA
Y
Y AA
SoC
LC Rep LC OM
A
SoC
-
+
--
= =´
+
-
( 1.3)
Here A
LC
, A
Rep-LC
and A
OM
represent the area of the original logic circuits,
replicated logic circuits and other modules, respectively. In the next two sub-sections we
describe the two scenarios.
1.2.1 Scenario 1: Redundancy for originally partitioned circuit
Here we assume the circuit has been designed and partitioned and the information
about each module such as its yield, area and number of inputs and outputs will be given
to us. We illustrate the importance of redundancy in Y/A maximization for the given
circuit with different examples. As mentioned, redundancy is one method to potentially
increase yield, where some modules are replicated and input and output signals
distributed to and from selected modules using q-way-forks and q-way-joins, respectively
(forks and joins will be discussed in details in Chapter 2).
As mentioned earlier, replication is desirable when G
Y/A
> 1. Usually the area of
other modules is significantly more than that of logic circuits; hence, the yield gain of
using redundancy is more than the area overhead, and thus G
Y/A
> 1. In this dissertation
we present different algorithms, heuristics and theorems to improve G
Y/A
as much as
possible. But it is not feasible for all cases, as an example, when the A
OM
is 0 and yield of
logic core is greater than 0.5, there is no way that replicating the entire circuit at different
levels of granularity could lead to G
Y/A
> 1.
10
Another example that redundancy may not improve Y/A is the replication of the
entire circuit (when A
OM
= 0); without loss of generality assume the given circuit is a
pipeline. It is known that q times replication of an entire circuit, referred to as qRM (q
Replicated Modules, shown in Figure 1.4) usually results in yield improvement but not
Y/A improvement. Because replicating a circuit q times requires at least q times the
original area, but, as can be easily shown, a yield gain of at most q. However, we will
show that inserting more steering logic into a design at appropriate places, i.e., using
redundancy at finer level of granularity, can have a modest effect on area overhead but a
significant effect on yield enhancement (Figure 1.5).
Figure 1.4: qRM = q Redundant Modules
Figure 1.5: The structure of n pipeline modules with j groups of modules
The yield of a qRM is:
(1 (1 )).
1
n
qq q
Y y Yy
i qRM
FJ
i
= ´ --´ Õ
=
11
The yield of a qRM with j groups of modules is:
*1
[ (1 (1 ) ]: *1 ...
11 ()
* *1
1*
: ....
12
ik j
s
qqq
Y y yy i kk
i s qRMj
FJ
i ik
s ii
s
wherekk kn
j
+-
= ´ -- ´ =+ ++ ÕÕ
-
+-
==
+ + +=
If
q
y
F
and
q
y
J
do not decrease too much as q increases, then Y
qRM
tends to
monotonically increase with q. Figure 1.6 provides some results for a 15-module
pipeline, where all modules have the same yield (and similar fork and join). The solid
line curves represent the yield and Y/A for the qRM and dashed lines represent yield and
Y/A of structure with multiple switches as shown in Figure 1.5. The values for the yield of
the steering logic (forks, joins and switches) are given in Table 1.2. Figure 1.6 (a)
illustrates that for the range of values shown, the yield is not monotonically increasing
with increasing q. This occurs because as q increases, the yield improvement due to
having another copy of the pipeline becomes monotonically smaller, but the yield loss
due to more complex steering logic continue to grow faster than linear. Hence, in a
general case a finite value of q exists that maximizes yield. (In the special case when the
steering logic has a yield of 1, the optimal value of q is infinity for yield maximization.)
The optimal value of q decreases as the yield of modules increases. For example, in
Figure 1.6 (a), the optimal value of q for modules with y = 0.95 is 5, while for y = 0.85 it
is 10. This is consistent with today’s technology where yields for a module are very high,
so replication is seldom used. But as modules become more complex, such as a processor
core, replication (multiple-cores) becomes prevalent.
12
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12 14
q = Number of Copies
Yield
y=0.95 y=0.90 y=0.85 y=0.75
y_multi=0.95 y_multi=0.90 y_multi=0.85 y_multi=0.75
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2 4 6 8 10
q = Number of Copies
Yield/Area
y=0.95 y=0.90
y=0.85 y=0.75
y_multi=0.95 y_multi=0.90
y_multi=0.85 y_multi=0.75
(a) Yield (b) Y/A
Figure 1.6: Yield and Y/A for a pipeline with 15 modules
Table 1.2: The yield of forks and joins for different number of copies
q
Yield
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Fork .99 .98 .97 .96 .95 .94 .93 .92 .90 .87 .84 .83 .82 .8
Join .99 .98 .97 .96 .95 .94 .93 .92 .90 .87 .84 .83 .82 .8
Figure 1.6 (b) shows that for qRM (solid lines), Y/A monotonically decreases as q
increases. Assume A
OM
= 0 and interconnects are perfect then using ( 1.3) we define the
Y/A gain of qRM (G
Y/A-qRM
) with respect to the original circuit (simple pipeline structure):
1 (1)
1
,
/
1
n
q
Y
i
i
G
Y A qRM
n
qY
i
i
--Õ
=
=
-
´Õ
=
since G
Y/A-qRM
monotonically approaches 1 from below,
using qRM never leads to Y/A improvement. We next consider the design shown in
Figure 1.5, where we can have up to n groups. Using steering logic we partition the
original qRM design into j groups of modules, where group i has k
i
modules. The tight
upper bound for yield gain (G
Y
) of this design with respect to the original design is given
by ( 1.4).
( ) , 1.
qqjj
G y y q where jn
y
FJ
£ ´´ ££ ( 1.4)
13
From ( 1.4) we see that the maximum value of G
y
is q
n
and occurs when the yield
of steering logic is 1. Ignoring the area of the steering logic, the maximum Y/A gain is q
n-
1
. As shown in Figure 1.6 (dashed lines), using multiple steering logic among modules
not only improves the yield compared to the qRM structure, but in most cases results in
Y/A improvement. Placing steering logic in different locations results in different gains in
yield as well as Y/A. The reader can verify that G
Y/A
> 1, especially for small values of q
and y
i
.
So far we have replicated all different modules q times; while, different modules
with different yields need different number of spares. For example, modules with high
yield need fewer numbers of copies and simpler and more reliable steering logic that can
even improve the Y/A. In Chapter 3 we present different algorithms and heuristics that
use redundancy for Y/A maximization of originally partitioned logic circuits.
1.2.2 Scenario 2: Partitioning of original circuit for redundancy
In this section we address this fact that the way we partition the original circuit
could affect the result of our redundancy algorithms for Y/A improvement. Consider a
circuit where Y = 0.25 and A = 1. Assume that we partition it into two modules in two
different ways, as shown in Figure 1.7 (a, b). Next consider the same circuit but this time
with more number of partitions as shown in Figure 1.7 (c). We use equation ( 1.1) to
compute yield of each partition. Now we duplicate both modules (Figure 1.7 (d, e, f)) and
ignore the interconnect overhead. The Y/A of Figure 1.7 (d, e, f) is 0.259, 0.281, and
0.321, respectively, while the value for the original design is 0.25. That is, the latter
design with more number of balanced partitions led to better Y/A after redundancy.
14
Figure 1.7: Different partitions of one circuit
The above observations motivate us to derive theoretical results relating the
number of partitions and balancing factors (the relations between the areas of partitions)
for Y/A maximization with redundancy. In Chapter 4 we present our theorems and will
prove them mathematically. Later in this chapter we will verify the correctness of our
theorems and come up with a design flow to find the optimal level of granularity for
partitioning the original circuit. Our design flow considers the real life constraints of
using redundancy at fine level of granularity such as steering logic overheads.
1.3 Related work
Many yield enhancement techniques exist, and most can be classified if they are
applicable to (i) memory or logic circuits, (ii) regular or irregular structures, (iii) circuits
with common pool of spares for modules or circuits with individual pool of spares for
each module, (iv) if they are general purpose techniques or for special structures, and (v)
techniques that ignore the overhead of steering logic vs. ones consider it.
For several styles of regular designs, most notably memory IC’s. The high
regularity of memory arrays simplifies the task of incorporating redundancy into their
design for yield or reliability improvement [4] [7] [12] [24] [57] [61] [74] [76] [100]. Defect
tolerance techniques for memories, include using spare rows and columns and error
15
correcting codes [50]. In contrast, only a few logic IC’s have been designed with some
built-in defect tolerance (e.g., [51]).
In this dissertation we focus on one yield-enhancement design techniques, namely
spare modules that explicitly target logic circuits, rather than memories and regular
arrays. Redundancy has been used for logic circuits by researchers to improve functional
and/or performance yield [4] [8] [21] [51] [54] [57] [68] [70] [72] [79] [84] [95] [101]. There
exists a large body of work that use redundancy for circuits with regular architectures or
topologies mainly done by I. Koren [11] [13] [44] [46] [47] [48] [49]. Also there are many
papers on lifetime reliability enhancement in processor design using redundancy, such as
those described in [33] [34] [78] [83] [85] [86] [87]. Due to lack of space we do not provide a
summary of each single paper, instead we use this section to point out the main
differences between those redundant-based techniques and our work.
Most of the related work assumes a common pool of spares for modules. These
works are just applicable to arrays of common circuitry
[14] [22] [29] [53] [57] [70] [72] [77] [81] [93] [104]. For example, in [57] the authors use a
classical bit slice organization to design the data path of HYETI and add spare slices with
the goal of Y/A maximization. In [72] a design methodology is proposed for maximizing
the expected profit that is a function of Y/A and packaging cost. It is assumed that the
circuit has N identical primary modules, and a pool of R spare modules. Using
reconfiguration circuitry, each primary module can be replaced by any spare module. The
goal is to achieve a connected configuration of N fault-free primary modules. A
reconfiguration algorithm is introduced in [71] for VLSI arrays where it is assumed that a
16
row and a column of spares exist for all of the process elements. Another example could
be [104] where the authors implemented a mesh topology with 12 cores while they need
9 of them to be functional and since cores are identical each core could be replaced by a
faulty one (similar work by [77]). These papers do not address redundancy at lower level
of granularity where we could have many heterogeneous modules, while, our algorithms
and heuristics in this dissertation address it efficiently.
Many other yield enhancement techniques replicate the entire circuit rather just
portions of it [71] [101]. Some traditional computers, such as Tandem S2 [36], Stratus
[97], and Teramac [18] simply replicate entire processors to tolerate fabrication defects.
The IBM mainframe [83] not only has redundant processors, but also uses intra-processor
redundancy within the processor for hard fault tolerant purpose. For example, redundant
units for fetch/decode and instruction execution have been used in the IBM G5
microprocessor. As mentioned earlier, replicating an entire circuit does not necessarily
result in an improvement of Y/A. For emerging technologies where high defect densities
are predicted [26], replicating entire cores to obtain an acceptable yield results in using a
large number of spares, which in turn reduces the value of Y/A due to high area overhead.
We will quantify this claim in our section on experimental results. Thus, by either
replicating interchangeable modules or employing intra-module redundancy at finer
levels of granularity (instead of the whole circuit), we can increase the yield and Y/A of a
chip. In addition, this prior work does not deal with developing an efficient mechanism
for identifying the best number of spares for each module. These issues are dealt with in
this dissertation.
17
Most of the prior work does not address various side effects of using redundancy,
i.e., the extra cost of interconnect, such as performance loss, area overhead and yield
reduction. Previous fine-grain approaches such as those discussed in [57] [65] [66] [79]
have ignored the testability requirements in modern processors with intra-processor
redundancy. On the other hand, in [75] the authors discuss a mechanism to diagnose hard
defects in processors with fine grain redundancy, but do not address the problem of
finding the best redundant design.
Some research focuses on chip multi-processors (CMP) [5] [72], where cores can
share their fault free components with each other. For example, a core with a faulty
element of type Z “borrows” (time multiplexes) a component of type Z from a
neighboring core that continues to be functional. Although these techniques increase the
number of functional cores with the cost of performance degradation, for new
technologies with high defect densities there is a need to exploit finer-grain redundancy
at the micro-architectural block level of each core as mentioned in [72].
The main difference between this dissertation and prior works is that we study all
aspects of using redundancy at finer level of granularity compared to core level. For a
circuit that has been originally partitioned we develop different algorithms and heuristics
[65] [66] [67] that maximize the yield and Y/A of the circuits with linear and non-linear
structures in SoC. Our algorithms and heuristics determine the best number of spares to
use for each block. The blocks can be identical or different and each block can have its
pool of spares. Interconnect overheads such as yield reduction, area overhead, and
performance loss is considered. Since the circuit has already been partitioned, the only
18
module-level reorganization that can occur is the concatenation (combining) of modules.
In contrast, in other part of this dissertation we study the attributes of partitioning that
increase or maximize Y/A using redundancy, such as the relative size of partitions, and
the number of partitions [68]. For this part of dissertation we develop theorems and a
cloud based partitioning design flow, which lead to Y/A maximization after redundancy.
It means we partition the circuit for redundancy purpose (based on our theoretical
results). These properties will be discussed in detail in the following chapters.
1.4 Dissertation outline
In this dissertation in Chapter 2, we design a configurable and testable steering
logic for our redundancy purpose. Then we preset our layout-driven tool (TYSEL) for
yield, area and delay estimation of steering logic. In Chapter 3 we focus on originally
partitioned circuits and will define different problems to maximize their yield and Y/A.
We will present different algorithms and heuristics to solve these problems. In Chapter 4
we study the partitioning attributes of original circuit with goal of Y/A maximization
using redundancy. For that, we develop two theorems and a design flow to find the
optimal level of granularity to be used for redundancy to maximize Y/A. In Chapter 5 we
conclude this dissertation and propose future work.
19
Chapter 2
Steering Logic
2.1 Introduction
As mentioned earlier when multiple copies of a module are added to a circuit for
redundancy, the input/output wires to and from these modules also have to be replicated.
We call these sets of wires steering logic, since they steer the input and output data to and
from the appropriate modules. In this chapter, we study structures of steering logic, their
susceptibility to different types of defects, extra circuitry to make them configurable and
testable. We present our layout driven tool to compute yield, area and delay of steering
logic that uses Monte Carlo simulations to insert defects into the layout and compute its
yield. Its functionality is similar to [19].
2.1.1 q-way-fork
Figure 2.1 shows a simple structure for a 2-way-fork, where the buses are W bits
wide. For the fork, data at the input bus (port) is sent to both output buses if in the
broadcasting mode, or just one bus if in the singleton mode. The generalization of a 2-
way-fork is a q-way-fork, denoted by
q
F
M , where the input is usually sent to all
(broadcast mode) or just one (singleton mode) output port. If such a unit is faulty, it may
be able to send the input data to some output ports, but not to others, in an error-free
20
manner. ()
q
F
yn is the probability that in a q-way-fork, at least n paths work correctly
where n = 1, 2, …, q. So ()
q
F
yq refers to the probability of a q-way-fork to operate
correctly and for simplicity we represent it by
q
F
y . For a fork, a path refers to a
connection from the input bus to any output bus.
Figure 2.1: Fork, input data sent to
outputs
Figure 2.2: Join, one of the input buses
go to the output bus
2.1.2 q-way-join
A 2-way join is shown in Figure 2.2, where data from just one of the input buses
is sent to the output bus. A path in a join refers to a connection from any input bus to an
output bus. The generalization of a 2-way-join is a q-way-join, denoted by
q
J
M , and
similarly ()
q
J
yn is the probability that in a q-way-join, at least n paths work correctly,
where n = 1, 2, …, q. ()
q
J
yq refers to the probability of a q-way-join operates correctly,
i.e., is defect free, and for simplicity we represent it by
q
J
y . Such a module may be faulty
yet still operate correctly for some input-output combinations (paths).
21
2.1.3 q
in
×q
out
-way-switch
A q
in
×q
out
-way-switch is a switch that has q
in
input buses and q
out
output buses.
Figure 2.3 (a, b) indicate, respectively, a 2×2-way-switch and its simplified
representation. In Figure 2.3 (c) we indicate two of the normal functional modes of
operation, namely (i) what comes in i’th input bus comes out on the i’th output bus, and
(ii) what comes in on bus 1 comes out on bus 2, and vice versa. In Figure 2.3 (d) we show
two non-standard modes of operation.
(a) (b) (c) (d)
Figure 2.3: Switch: A multi-purpose transmission unit
In general, a q
in
×q
out
-way-switch is denoted by
qq
out in
M
F
´
, and the probability
of having at least n paths operating correctly is ()
qq
out in
yn
S
´
where n = 1, 2, …, q
in
×q
out
.
Therefore, ()
qq
out in
y qq
in out
S
´
´ refers to a q
in
×q
out
-way-switch that operates correctly.
For a switch, path refers to a connection from any input bus to any output bus. Again for
simplicity we use
qq
out in
y
S
´
to refer to the yield of a defect-free switch. Figure 2.4
shows a simple example of using different steering logic such as 2-way-fork, 2×3-way-
switch and 3-way-join, where w
i
refers to the width of each bus.
22
Figure 2.4: A simple example of using steering logic
Different parameters such as the number of input buses, output buses, bus width,
length of wire and wire width affect the area (and yield) of a steering logic. For example
Figure 2.5 shows hypothetical value of
qq
out in
y
S
´
of a switch for different values of q
in
and q
out
. As the number of input/output buses increases the yield of the switch reduces.
Later we will show that yield also reduces as bus width increases. In addition to area,
yield is also dictated by defect density. In general we have defects with different sizes
and densities on various layers of a wafer. This matter will be discussed more fully later
in this chapter.
In the next section we design a configurable and testable steering logic and
formulate a problem to compute its yield and area. We introduce our tool to solve this
problem. We call this tool TYSEL that stands for Tool for Yield Estimation of Steering
Logic. In Section 2.3 we discuss different types of defects and their probabilities of
occurrences. We assume we have a configurable, parameterized layout for switches that
will be the topic of Section 2.4. In Section 2.5 we talk about our simulation technique and
provide a few results. Section 2.6 presents a simple delay model that TYSEL uses to
compute the extra added delay to a path due to using a steering logic. The user’s manual
for our tool can be found in Section 2.8 along with experimental results.
23
Figure 2.5: Yield of a switch with different number of input/output buses
2.2 Configurable and Testable steering logic
Figure 2.6 (a) is an example of redundant configuration. We need to test each
module and steering logic separately and once we found error free paths and modules, we
configure the structure as shown in Figure 2.6 (b).
(a) (b)
Figure 2.6: (a) redundant circuit (b) after testing and configuration
In a configurable steering logic we are able to activate a path from any input bus
to any output bus and deactivate all other paths for various reasons such as those listed
next.
24
(i) Test: we may activate one bus multiple times for testing different paths or
input/output modules. For example Path
1
to Path
4
refer to paths (Inp#1, Out#1), (Inp#1,
Out#2), (Inp#2, Out#1), (Inp#2, Out#2) in Figure 2.7 for a 2×2-way-switch. Assume the
order to test these paths is Path
1
, Path
2
, Path
3
and Path
4
respectively where the Out#2
bus needs to be: off, on, off, on. We use pass transistors or transmission gates for this
purpose. Once we chose a path to be in the final configuration we will blow off laser
fuses on other paths to alleviate the effect of steering logic on performance loss. By
blowing off fuses on non-active output buses we reduce the output load capacitance.
(These laser fuses usually require extra processing steps, but since we use them for
memory we can use them for steering logic too).
ü
ý
þ Out. #1
ü
ý
þ Out. #2
Inp. #1 Inp. #2
Metal 1
Metal 2
Fuse
Via M
1
-M
2
Extra metal
Figure 2.7: A configurable 2-input/2-output switch
(ii) Yield improvement: A defect may or may not disable all paths through a
steering logic module, but, by programmatically deactivating some paths in this module,
it may be possible to rehabilitate other paths. For example, in Figure 2.7, the extra metal
(short defect) on output bus #1 kills paths from input bus #1 and #2 to both output buses.
By blowing off the fuses (see arrows) we deactivate this path so the other paths become
operational (more details in next sections).
25
We can also use laser fuses to power off all the modules that are not in final
configuration for power saving purpose (powered down modules in Figure 2.6 (b)).
Modules are testable, i.e., there are usually scan chains and/or BIST circuitry inside each
module and all we need to do to test a module is to access its inputs and outputs (similar
idea for testing different IP cores using wrappers [2] [28] [62] [92] [99] [105]). Thus, inputs
should be controllable and outputs observable [1]. One way to achieve this functionality
is by adding transparent FFs to each input and output bit of a steering logic to generate
scan chains. By transparent we mean, in normal operation we bypass them. This
modification however has a high area overhead, and for a q
1
-input/q
2
-output steering
logic module with bus width w we need (q
1
+q
2
)×w extra FFs. A better design could be to
share w FFs among the output buses, and another w among the input buses. The structure
of a testable switch that also can be used for testing its associated modules with FF
sharing is given in Figure 2.8. Here the switch has two input modules (with one bit width)
and two output modules. The FFs are transparent (they will not be used in the normal
mode of the system) and in general using this type of switch only needs 2w FFs, which
are 2/(q
1
+q
2
) of the previous design. This overhead can be further reduced by considering
the fact that many output signals of Module 1 (in Figure 2.8) are driven by FFs, say k
1
of
them, and many input signals of Module 2 are driving FFs, say k
2
, so we can use these
FFs instead of adding transparent FFs to these lines to reduce the number of additional
FFs to 2×(w-k
1
-k
2
).
26
Figure 2.8: A testable switch with FF sharing
Using this design (Figure 2.8), inputs and outputs of all modules (including the
spares) are accessible. Finally, we want to test the steering logic for stuck-at-faults, opens
and shorts. Many standardized methods have been developed to detect and diagnose
faults in interconnection networks [15] [25] [30] [35] [37] [40] [59] [102] [105]. These
techniques need the inputs and outputs of wires to be controllable and observable, which
are supported by our design. All we need to do is first, configure the steering logic for
each path, and then apply the test vector to input bus and observe the response at related
output bus. By the way there are also other ways of testing the steering logic such as
combining each path with input or output module and test them simultaneously (all these
methods are supported by our steering logic design).
We assume that any defect in the extra logic circuitry of testable and configurable
steering logic (FFs, pass transistors and their control lines) will kill the steering logic. So
we can use any current yield model to estimate the yield (probability of being error free)
for that circuit. However, this area is negligible compared to whole steering logic’s area
and its yield is almost 1. But, the rest of steering logic could be faulty and still be used in
final design (will be discussed in details in next sections). Current defect based yield
models, for example Poisson or negative binomial [54] models are pessimistic for those
27
parts because they do not address tolerable defects. By consideration of these parameters
we define the following problem:
Problem_TYSEL: Given
(i) the specs of a steering logic, including the number of input and output buses,
bus width, width of interconnect on each layer, distance between adjacent wires
on each layer, and order of input/output bus, etc., and
(ii) a model of defects present in the process characterized by type (open,
short, …), sizes where applicable, and their probability of occurrence (densities)
Determine the area, delay and yield ( ()
qq
out in
yk
S
´
where k = 1, 2, …, q
in
×q
out
) of
the given steering logic.
2.3 Defects
We consider three types of defects: extra metal (X
m
) that can create a short
between two adjacent wires; missing metal (M
m
) that can create an open in a wire; and a
bad (open) via. We also assume defects can have different sizes with different probability
of occurrence. In this dissertation we may use extra metal (missing metal) and short
defect (open defect) interchangeably. For example, Figure 2.9 shows the defect universe
obtained by looking at 100,000 chips that have defects classified as either extra metal
(short defect) or missing metal (open defect) for layer i (Note: these are random numbers
and we just made them to emphasize on important concepts regarding our tool,
companies usually do not release this information). By the way these are defects and not
necessarily faults. Some chips had more than one defect. In addition, not all defects of the
28
same category were modeled the same way. For example, 60% of all of the extra metal
defects in layer i are modeled as circular extra metal deposits of radius 2λ, and these
could create a short (fault) based on where they might fall. The others were modeled
using a larger radius, as shown in Figure 2.9 (a). Those defects that are smaller than 2l
are ignored since they do not lead to erroneous circuit operation and thus not considered
to be a fault. Also, the probability that a via between layers i and i + 1 is open (faulty or
non-functional) is given as 0.0002. Multiple vias may be used in the design as part of the
design for manufacturability process. There could be different ways to compute the open
via probability (
1
P
i
V
i+
) between layer i & i+1 but we use the following equation:
1
,
1
1
i
rd
i
P
i
V d
i i
V
i
´
+
=
+
+
where, d is the defect density per unit of area for the chip,
1
i
r
i+
is the
percentage of these defects which are open via between layers i & i+1, and
1
d
i
V
i+
is via
density btw layers i & i+1. For example, for a q
in
×q
out
-way-steering logic with two layers,
area A and bus width w, via density could be defined as follow:
max( ,)
1
2
qqw
in out
d
A
V
´
=
( 2.1)
2l
3l
4l
2l
3l
4l
5l
(a) Extra metal, (b) Missing metal, (c) Open via
Figure 2.9: Different defect types and sizes
29
The effect of defects on the steering logic depends on their locations, types and
sizes. Thus, not all defects are killer defects, i.e., a defect that corrupts the functionality
of a circuit. Also a circuit with a killer defect can sometimes be used with less
functionality. In other words, the number of functioning paths through a switch
determines its usefulness. The given circuit in Figure 2.10 shows a fork that has one input
bus (of width w) and three output buses. The black squares represent fuses which, when
blown, create an open in their associated wire. In the actual circuit, each of the buses
would be connected to a different module.
ü
ý
þ
ü
ý
þ
ü
ý
þ
Figure 2.10: A 3-way-fork with different types of defects
30
We can classify these defects into four classes. Class (i) defects are those that do
not cause error producing faults and can be ignored, such as defects 8 and 9 in Figure
2.10. Another example could be a short between two segments of the same net. Class (ii)
defects disable the entire circuit such as defects 1 and 2 -- these defects create an open on
Bit_1 and Bit_2 of the input bus respectively, and thus makes the fork non functional.
Class (iii) defects do not disable the entire circuit, such as defects 3 through 7, -- defect 7
causes a short between two bits of output bus 1 but by blowing the fuses on output bus 1
(f
1
, f
2
and f
3
) we can still use the other buses. The same situation exists for defect 5. If
defect 4 occurs (an open), we can still use buses 1 and 2 without blowing any fuses.
Defect 6 is an open via and just makes output bus 3 non functional. Class (iv) consists of
tolerable defects. These defects are tolerable just for our redundancy purpose where we
only need one error-free path in the final configuration. For example, defect 3 is a
tolerable defect because by blowing appropriate fuses we can use any output bus. If the
user needs output bus 1 then by blowing fuses on buses 2 (f
7
, f
8
and f
9
) or/and 3 (f
4
, f
5
and
f
6
) that bus is usable. By blowing f
4
, f
5
and f
6
we can use bus 2, and by blowing f
7
, f
8
and
f
9
we can use bus 3.
2.4 Configurable layout
We have developed a configurable, parameterized layout for steering logic. The
following parameters are used: (i) bus width (w), (ii) number of input/output buses (q
in
,q
out
), (iii) width of wire on each layer (W
1
, W
2
in Figure 2.10), (iv) distance between
adjacent wires on each layer (S
1
, S
2
in Figure 2.10), (v) length of wires (L
1
, L
2
, L
3
in
Figure 2.10) and (vi) order of input/output buses. These parameters affect the area (and
31
yield) of steering logic. In addition to area, yield is also dictated by defect density. We
assume that defects of different sizes and densities can exist on the various layers of a
wafer. Yield is also affected by the exact location of input/output buses. We will
determine the critical area of each defect to compute the yield, which will be explained in
the next section. For more details regarding layout generation see the Appendix in the
end of this chapter.
2.4.1 Critical area for extra metal (X
m
) defects
Each X
m
defect is not necessarily a fault and it depends on the size and location of
the defect. In Figure 2.11 (a) a X
m
defect does not connect two wires, while the same size
defect in Figure 2.11 (b) does. Figure 2.11 (c) shows the same situation as in Figure 2.11
(a), but now the defect is larger and does create a short. So the size and center location of
a defect are critical parameters.
(a) (b) (c)
Figure 2.11: X
m
defect versus X
m
fault
A microscopic unit of area, m
t
(a), is said to be critical with respect to (i) a layout
L, (ii) a defect of type t and (iii) size (radius) a if (a) the center of the defect is at m and
(b) it creates a fault. The sum of all of the microscopic areas associated with a layout L
and defect of type t and size a is referred to as the critical area w.r.t. (L, t, a). Assume the
radius of a X
m
defect is r and the distance between two adjacent wires is d. If d is greater
than 2r then the critical area for that defect is 0, i.e., the X
m
defect can never excite a
32
fault. Otherwise, the critical area for two adjacent wires is given by the expression below
and shown as black region in Figure 2.12:
() ()
22
22
: ; 2; ( ).
2
xy xy
Critical area short defect yL y Lx
d
where L wirelengthy r dxr
´´
= + ´+ = ´+
= = - =-
Figure 2.12: Critical area for a X
m
defect
We compute the critical area of each X
m
defect that is located in steering logic to
see if it is a fault or just a defect. If it is a fault then we store all information about that
fault (the size, type, location, and the layer) in the related data structure to compute the
yield and other parameters that will be explained later in this chapter.
2.4.2 Critical area for missing metal (M
m
) defects
Again each M
m
defect is not fault. Similar to X
m
defect, it depends on the size and
location of that defect. The M
m
defect, shown in Figure 2.13 (a), does not bisect the wire,
while that in Figure 2.13 (b) does. Again, the two main parameters associated with an M
m
defect are its center and size.
(a) (b) (c)
Figure 2.13: M
m
defect versus M
m
fault
33
The critical area for an M
m
defect of type a and size r with respect to a layout L is
the area such that a fault is created if the center of the defect is in this area. Assume the
radius of the defect is r and the width of a wire is w. If w is greater than 2r then the
critical area for that defect is 0; i.e., the M
m
defect cannot cause a fault. Otherwise, the
critical area will be computed for a wire as follow and shown as black region in Figure
2.14:
() ()
22
22
: ; 2; ( ).
2
xy xy
Critical area short defect yL y Lx
d
whereL wirelengthy r dxr
´´
= + ´+ = ´+
= = - =-
Figure 2.14: Critical area for M
m
defect
We compute the critical area for each M
m
defect that is located in steering logic to
see if it causes fault or not. If it is a fault then we store all information about that fault
(the size, type, location, and the layer) in related data structure. We use this information
to compute the yield and other parameters that will be explained later in this chapter.
2.5 Simulation
To solve Problem_TYSEL we develop a tool based on the Monte Carlo simulation
technique. Monte Carlo methods (or Monte Carlo experiments) are a class of
computational algorithms that rely on repeated random sampling to compute their results.
Monte Carlo methods are often used in simulating physical and mathematical systems.
34
Because of their reliance on repeated computation of random or pseudo-random numbers,
these methods are most suited to calculation by a computer and tend to be used when it is
unfeasible or impossible to compute an exact result with a deterministic algorithm [63].
Our tool runs the simulation procedure a fixed number of times, specified by the user,
and every time checks the functionality of each input/output bus. At the end it reports
()
qq
out in
yk
S
´
where k = 1, 2, …, q
in
×q
out
.
Next we give a simple example to show how TYSEL draws the layout and
generates defects and then provide a simulation result using real defect density data.
Assume we are given a 2-way-fork with two bits bus-width and we have one defect size
(radius λ) for both layers and both types of open and short. TYSEL just once draws the
layout of the steering logic and then computes the critical area of each type of defect (X
m
and M
m
) with all different sizes (Figure 2.15 (a, b)). Then during each run, TYSEL
generates defects and verifies if the located defect(s) in steering logic is (are) fault(s) or
just a defect(s) (a defect could be fault when its center is in critical area). For example
this simulation for the RUN #1 of Figure 2.15 (c, d) is as follows:
· Open Defect (Metal 1): TYSEL generated one open defect (Figure 2.15 (c)) which its
center is located in critical area and hence disables the circuit,
· Open Defect (Metal 2): TYSEL generated one open defect (Figure 2.15 (c)) which its
center is not located in the critical area and hence could be tolerated,
· Short Defect (Metal 1): TYSEL generated two short defects (Figure 2.15 (d)) which
their centers are not located in critical area and hence both defects could be tolerated,
35
· Open Defect (Metal 2): TYSEL generated one short defect (Figure 2.15 (d)) which its
center is not located in the critical area and hence could be tolerated.
(a) Critical Areas for open defect, (b) Critical Areas for short defect
(c) Run#1 for open defect, (d) Run#1 for short defect
(e) Run#2 for open defect, (f) Run#2 for short defect
Figure 2.15: Layout of a 2-way-fork with bus width 2
Then TYSEL performs via simulation and finally, updates the yield of each path
using the generated faults (open, short and open via) and repeats this process (defect
36
generation and fault simulation) for the given number of runs by user such as RUN#2
(Figure 2.15 (e, f)) and at the end reports ()
qq
out in
yk
S
´
where k = 1, 2, …,q
in
×q
out
.
Next, we give a simple example of TYSEL’s output based on real defect densities
and show that the commonly used Poisson yield model is pessimistic compared to
TYSEL. The defect sizes and their distributions at different layers are given in Table 2.1
based on papers [23] [31] [41] [42] [69] [89]. Size refers to the radius of the defect, e.g., if
there is a defect in a chip, the probability of it to be a short defect with radius λ and
located in layer 1 is around 10.7%. We set
1
2
r to 5% for open via probability. Rest of the
defects is located on other layers such as metal 3, poly or open via between metal 2 & 3
which is not related to our design of steering logic. By the way these numbers could be a
little bit different for new or future technologies but it does not have anything to do with
the functionality of our tool.
Table 2.1: Size and probability of occurrence for each defect
Layer 1 Layer 2
Size X
m
M
m
X
m
M
m
Defect1 λ 10.7 1.43 9.45 1.28
Defect2 2λ 5.37 0.72 4.72 0.64
Defect3 3λ 3.47 0.46 3.06 0.41
Defect4 4λ 2.84 0.38 2.5 0.34
Defect5 5λ 2.21 0.3 1.94 0.26
Defect6 6λ 1.89 0.25 1.67 0.23
Defect7 7λ 1.58 0.21 1.39 0.19
Defect8 8λ 1.26 0.17 1.11 0.15
Defect9 9λ 1.26 0.17 1.11 0.15
Defect10 10λ 0.95 0.13 0.83 0.11
Total: 31.6% 4.22% 27.8% 3.75%
An important motivation for developing TYSEL was that yield models cannot
address some defects which could be tolerated due to our application specific design of
37
steering logic (class iv defects as discussed in Section 2.2). As an example we consider a
4-way-fork with different number of bus widths (varies from 8 to 1024 bits). We use the
following equation to compare the generated yield of a 4-way-fork by TYSEL (
14
(4) y
S
´
)
and the Poisson yield model (y
model
) given in ( 1.1):
14
(4)
100.
yy
model
S
Error
y
model
´
-
=´
TYSEL runs the simulation procedure 1000000 times to compute the results and
Figure 2.16 shows the Error for the 4-way-fork under four different defect densities. As
we can see this popular model is pessimistic and the error increases when the defect
density (and bus width) increases. This fact shows the necessity of having a precise yield
estimator for steering logic when defect density is high (emerging technologies);
especially, when we are looking for optimal level of granularity or number of spares for
redundancy to maximize Y/A.
0
0.5
1
1.5
2
2.5
3
3.5
4
0 200 400 600 800 1000
Bus Width
Error
DD=0.075
DD= 0.25
DD=0.35
DD=0.5
Figure 2.16: Accuracy of TYSEL compared to Poisson yield model
38
More experimental results and details about simulation are given in appendix.
TYSEL also reports the added delay of the given steering logic to the circuit. In the next
section we discuss the delay model that TYSEL uses.
2.6 Delay Model used by TYSEL
TYSEL reports the added delay of the given steering logic to the circuit. This
delay depends on different parameters such as design information (number of
input/output buses, bus width, etc), load capacitance, mutual interconnect capacitance,
crosstalk, fuse capacitance and etc. Since these interconnects are local, the flight time
over these interconnects is very less compared to the signal rise/fall times [38]. Thus the
logic can be modeled as a pure RC model [38]. The delay of any circuit is proportional to
the RC time constant of the circuit. Therefore, we model the delay of the steering logic as
an RC circuit with R and C values of the individual metal layers which in turn depend on
the design information of the steering logic. In this section we develop a model to
compute this delay. For simplicity of presentation we present the model for the case
where we just need one active path from one input bus to one output bus. We use fuses to
cut other paths which reduces load capacitance and leads to delay reduction. By the way,
TYSEL reports the delay for each path under all conditions where we have or do not have
fuses and where we have more than one active path.
Let’s develop the delay model for the given fork in Figure 2.10, where we have a
w-bit 3-way-fork. The delay is the time that it takes for the signal to propagate from one
of the inputs A, P or W to the corresponding output bit on the output buses #1, 2 or 3.
The worst case delay is of particular interest to us. First, all the parameters that the delay
39
depends upon are determined. Assuming all the output lines in the fork drive equal load
capacitances, the delay now is proportional to the interconnect length, fuse capacitance
and lateral capacitance (crosstalk induced delay [6]) between interconnects. Analysis is
done for the fork module considering interconnect length and lateral capacitance. Fuses
have different types and here we assume they are transmission gates (TG). The
capacitance on the transmission gates also contributes to the path delay. The equivalent
RC model for any bit of an active path of the given fork in Figure 2.10 is shown in Figure
2.17.
Figure 2.17: RC equivalent circuit for any path in a fork
The following is the description of the RC model and the method to calculate the
parameters.
(a) R
1
/C
1
: The distributed RC model of the resistance and capacitance of metal 1
of length L
m1
.
(b) OFF-TG CAP: This is the diffusion capacitance seen by the path that is
contributed by the TGs that are in the off-state (disabled). This depends on the size of the
pmos/nmos transistors in the TG, we consider them to be minimum size but user can
choose any other size. We have: C
TG
= C
d-pmos
+ C
d-nmos
. The metal 1 line with length L
1
(Figure 2.10) has a diffusion capacitance of 2C
TG
when TG is off.
40
(c) ON-TG MODEL: This is the RC component of the TG itself which is part of
the path from input to output. It is represented as a π-section. The resistance of the TG is
contributed by the pmos and nmos resistances and is represented by R
TG
: R
TG
= R
pmos
||
R
nmos.
(d) L
2
-SUB-SECTION: This part represents the number of L
2
sections which are
considered in the RC delay model. This depends on the relative position of the input bus
with respect to the output bus. For a fork or a switch, the parameter order specifies to
which output bus the input bus is connected. For a join, it specifies to which input bus the
output is connected. We may have m number of L
2
sub-section and it depends on the
active path (order and input/output bus number): m = |Order( input bus) – Order(output
bus)|.
(e) R
3
/C
3
: The distributed RC model of the resistance and capacitance of metal 1
of length L
3
.
The parameters R
1
, C
1
, R
2
, C
2
, R
3
and C
3
are distributed along with the length of
their respective metals. The parameters C
TG,
R
TG
and C
load
are lumped values. To compute
the values of R1/C1, R2/C2 and R3/C3, the lengths of the corresponding sections have to
be computed. The lengths L
m1
, L
m2
and L
m3
are calculated as follows.
Referring to Figure 2.10 we have: () ( 1)( ) :1
11 2 22
Li L i W SW in = + -´ ++ ££ ,
( ) :0
1 11
,& () ( 1) ()
3 22 23
0 :0
n S W S ifm
L L i L ni W n iS
ifm
ì
ï
í
ï
î
´ +->
= = + - + ´ + -´
=
. We use
these computed lengths to extract the resistance and capacitance of each segment of wires
41
and then we use the distributed RC model to compute the delay of each segment. In
following paragraphs we calculate the resistance and capacitance of each segment.
Resistance calculation: The resistance of the line with length L (for both metal 1
and 2) is calculated as follows: [1 ( 25)]
L
RR TCRT
s
W
eff
= ´ ´+ ´- ; where, (i) R
s
:
Sheet
Resistance, (ii) L: Length of the interconnect line, (iii) W
eff
: effective width of the
interconnect (W
1
or W
2
), (iv) TCR: Temperature Coefficient of Resistance, (v) T:
Temperature in °C,
Capacitance calculation: The capacitance computation is more complicated
compared to resistance. We can have different types of capacitances as shown in Figure
2.18. Side capacitances (C
left
, C
right
, C
up
, C
down
), and the capacitance between metals and
NDF (N-Diffusion), PDF (P-Diffusion), PC (Poly Contact) and ISO (Isolation region).
By looking at Figure 2.10 we see that C
right
and C
left
for both metal 1 and metal 2 should
be considered but all other capacitances depends on the final layout design of the circuit.
By the way, TYSEL considers all the capacitances and user can set these values to have
more realistic delay.
Figure 2.18: Different capacitance types
42
The worst case is when we have as many as we can, wires on top and bottom of a
wire, and all types of NDF, PDF, PC and ISO at the bottom of that wire with length L
which gives us the following formula for capacitance:
( )( )()
4
WL
C C C L CC LC C CC
up
NDF PDF ISO PC left right down WS
= + ´+ + ´ ´+ + + +´
+
User can set any of these values to 0 based on the layout of the final design. The
parameters such as TCR, R
s
and C
up
, C
down
and etc can be found in Process Design Kit
(PDK) of each technology. Having the values of R and C for each section of Figure 2.17
TYSEL uses distributed RC model and computes the delay of given steering logic.
2.7 Conclusions
When we use redundancy at fine level of granularity, overheads of steering logic
such as area, yield reduction and extra delay cannot be ignored anymore. These
overheads need to be addressed explicitly during our yield computations because accurate
yield estimation is important for companies from different points of views. Current yield
models are pessimistic especially for interconnects where many defects and few faults
can be tolerated. In this chapter we developed a CAD tool (TYSEL) to estimate the yield
model for interconnect (we used steering logic to explain TYSEL’s functionality) and
showed that classical yield model is pessimistic and yield estimation error increases for
more complicated designs.
2.8 Appendix: TYSEL
TYSEL is the name of the tool we developed for yield computation of steering
logic and its pseudo code is given in Figure 2.19. First, TYSEL reads the information of
43
the given steering logic such as number of input/output buses, bus width and etc from
input file (Section 2.8.1 explains the input file format in details). Then TYSEL simulates
the layout of the steering logic for fault simulation; Section 2.8.2 explains the layout
simulation. In line
3
TYSEL asks user to enter the number of runs (x) and in line
4
it calls
simulation function x times. Section 2.8.3 gives the details of simulation function. Finally
it computes the yield of given steering logic after the simulations and report the results
(Section 2.8.4).
(){
();
1
();
2
() &;
3
() ( 1; ;)
4
() ();
5
() _ ();
6
}
TYSEL
Line Read the input file
Line Simulatethelayout of given steering logic
Line Get the number of runs from user storeit inx
Line fori i xi
Line simulation
Line Yield Computation
Simulat
= £ ++
(){
( ) _ ( &)
7
() _ ( &)
8
() _ ( &)
9
() _ ( &)
10
(
ion
Line Defect generation first layer open defect
Line Defect generation first layer short defect
Line Defect generation second layer open defect
Line Defect generation second layer short defect
) _ _ ();
11
() _ _ ();
12
( ) _ _ ();}
13
Line Open Fault Simulation
Line Short Fault Simulation
Line VIA Fault Simulation
Figure 2.19: Pseudo code of TYSEL
2.8.1 Input format
TYSEL gets different parameters from an input text file. Figure 2.20 shows the
parameters that user will be asked to enter. First line of Figure 2.20 asks user to enter the
“w: Bus width” which has been set to 2 in this example. W is the number of bits of a bus.
In the second and third lines, TYSEL asks user to enter the number of input and output
buses respectively. Here they are 1 and 3 which mean this steering logic has one input
44
bus and three output buses (a 3-way-fork). Line
4
and line
5
ask for technology, TYSEL
needs these parameters to extract the area of the given steering logic. In this example we
use 65nm technology.
(Line
1
) w: Bus Width: 2
(Line
2
) I: Number of input buses: 1
(Line
3
) O: Number of output buses: 3
(Line
4
) Technology(nm): 65
(Line
5
) λ (nm): 35
(Line
6
) W
1
(λ) = 2
(Line
7
) L
1
(λ) = 2
(Line
8
) D
1
(λ) = 2
(Line
9
) W
2
(λ) = 2
(Line
10
) L
2
(λ) = 2
(Line
11
) D
2
(λ) = 2
(Line
12
) Order = 1
(Line
13
) ####INFO OF FIRST LAYER####
(Line
14
) Open Defect Density[Defects/(λ*λ)]= 25 e -5
(Line
15
) Short Defect Density[Defects/(λ*λ)]= 45 e -5
(Line
16
) Number of Open Defects= 4
(Line
17
) 1st Open Defect info (λ & %) = 2.0 75
(Line
18
) 2st Open Defect info (λ & %) = 4.0 20
(Line
19
) 3st Open Defect info (λ & %) = 6.0 4
(Line
20
) 4st Open Defect info (λ & %) = 8.0 1
(Line
21
) Number of Short Defects= 4
(Line
22
) 1st Short Defect info (λ & %) = 2.0 85
(Line
23
) 2st Short Defect info (λ & %) = 4.0 10
(Line
24
) 3st Short Defect info (λ & %) = 6.0 4
(Line
25
) 4st Short Defect info (λ & %) = 8.0 1
(Line
26
) ####INFO OF Second LAYER####
(Line
27
) Open Defect Density[Defects/(λ*λ)]= 25 e -5
(Line
28
) Short Defect Density[Defects/(λ*λ)]= 45 e -5
(Line
29
) Number of Open Defects= 4
(Line
30
) 1st Open Defect info (λ & %) = 2.5 65
(Line
31
) 2st Open Defect info (λ & %) = 3.5 25
(Line
32
) 3st Open Defect info (λ & %) = 5.0 7
(Line
33
) 4st Open Defect info (λ & %) = 7.0 3
(Line
34
) Number of Short Defects= 4
(Line
35
) 1st Short Defect info ( λ & %) = 2.5 75
(Line
36
) 2st Short Defect info ( λ & %) = 3.5 15
(Line
37
) 3st Short Defect info ( λ & %) = 5.0 7
(Line
38
) 4st Short Defect info ( λ & %) = 7.0 3
(Line
39
) V12(Via_Failure_Rate)(%)= 4 e -6
Figure 2.20: Example of input data to TYSEL
Figure 2.21 shows the generated layout by these parameters. As we can see from
Figure 2.21 (a) each bus has 2 bits. Next parameter is W
1
which is the width of metal 1
45
and is shown in Figure 2.21 (a) on Bit_1 of input bus. In this version of TYSEL we have
two metal layers but we can extend this tool to support more number of metal layers. All
parameters in line
6
to line
11
are based on λ. For example W
1
= 2 means the width of metal
1 is 2λ. We also assume that all metal 1 & 2 wires have the same widths. Same definition
exists for W
2
(line
9
) which is the width of metal 2. L
1
(line
7
) is the minimum length of the
input bus bits, in Figure 2.21 (a) it is the length of Bit_0. Similarly, L
2
(line
10
) is the
minimum length of output bus bits, and in Figure 2.21 (a) it is the length of Bit_1.
(a) (b) (c)
Figure 2.21: A simple fork with different input bus orders
D
1
(line
8
) and D
2
(line
11
) show the distance between two metal 1 and metal 2
wires (as shown in Figure 2.21 (a)). We assume the distance between all adjacent metal 1
and metal 2 wires are always D
1
and D
2
. Order (line
12
) shows the order or location of
input bus or output bus. When the steering logic is a fork, Order shows the location of
input bus, for example there can be three different orders in the given example of Figure
2.20. These three different orders, 1, 2 and 3 are shown in Figure 2.21 (a, b, c)
respectively. When the steering logic is a join the Order shows the location of output bus.
46
When the steering logic is a switch the Order shows the location of the first input/output
bus. If I is greater than O then Order shows the location of the first bus of output bus and
if O is greater than I then Order shows the location of the first bus of input bus. For
example when I = 3, O = 5 and Order = 2, it means the location of the first bus of input
bus is 2 as shown in Figure 2.22.
Figure 2.22: A 3×5-way-switch with Order = 2
The order of an input/output bus may change due to layout problems. In addition,
this order could change the functionality of the steering logic so needs to be addressed
clearly. For example in a fork, where we need as many as possible functional output
buses, it is better to have the input bus in the middle not on top or bottom. Figure 2.23
shows a 3-way-fork with orders 1 and 2. A short defect on metal 2 kills one output bus
when Order 1 (Figure 2.23 (a)) and we still can use the two other buses by blowing the
47
shown fuses with arrows. But, the same defect kills two output buses in 3when Order 2
(Figure 2.23 (b)).
As mentioned earlier, TYSEL assumes three types of defects, (i) short, (ii) open
and (iii) open via. TYSEL supports different number of defects (different sizes and
densities) for each layer. For example in Figure 2.20, line
14
the defect density of open
defect in first layer is 25×10
-5
per λ
2
while the defect density of short defect in first layer
is 45×10
-5
(line
15
). User should enter number of defects with different sizes for each
layer, for example the user entered 4 (line
16
and line
21
) for open and short defects in each
layer. Then user should enter the radius (size) of each defect and its probability of
occurrence. For example the radius of the first open defect in layer one (line
17
) is 2λ and
its percentage of occurrence is 75% which means 75% of open defects in first layer is a
defect with size of 2λ. Similar definitions are for other defects at other layers. Finally the
user will be asked for the open via rate, in this example this rate is 4×10
-6
per via.
(a) Order 1 (b) Order 2
Figure 2.23: A faulty 3-way-fork with different input bus order
TYSEL uses the given data structure in Figure 2.24 to simulate a defect. This data
structure has the type of a defect which can be open, short or open via (Kind); the related
48
layer of that defect which is first or second layer (Layer); the size of the defect which is
radius of that defect (Size); and finally the coordinates of the defect’s center (X & Y).
typedef struct Defect_Struc{
char Kind; //Open, Short or open VIA
enum Comp_Type Layer; // METAL1 or METAL2
float Size; // Size of the Defect
float X; // X of the Defect's Location
float Y; // Y of the Defect's Location
} Defect_Struc;
Figure 2.24: Data structure for a defect
2.8.2 Layout simulation
Based on the given information from the input file, TYSEL simulates the layout of
steering logic. Different components are given in Figure 2.25 which are metal 1, metal 2,
via (between metal 1 and metal 2), fuse2 (on metal 2), fuse1 (on metal 1), dummy_fuse2
& dummy_metal2 (will be explained later).
enum Comp_Type {METAL1, METAL2, VIA, FUSE2,
FUSE1, DUMMY_FUSE2, DUMMY_METAL2};
typedef struct Comp_Struc {
(Line
1
) enum Comp_Type Type;
(Line
2
) float X;
(Line
3
) float Y;
(Line
4
) float Height;
(Line
5
) float Width;
(Line
6
) struct Comp_Struc **OComps;
(Line
7
) char Num_Comp;
(Line
8
) char FUSE;
(Line
9
) char VIA;
(Line
10
) char Open;
(Line
11
) char Short;
(Line
12
) int Defect_Number;
} Comp_Struc;
Figure 2.25: Data structure for a component
For each component we store the following information: the type of the
component, its coordinates X&Y, height and length (all are shown in Figure 2.26).
49
Figure 2.26: A component of Metal 1
(a) (b)
Figure 2.27: A fork and its connectivity graph
Num_Comp (line
7
in Figure 2.25) refers to the number of components that the
current component is connected to, for example this variable for I
0
in Figure 2.27 is 2.
Definitions of other variables are given in Table 2.2. The last variable is Defect_Number
which shows the number of defects located on this component and made it faulty.
Table 2.2: Functionality of some variables in Comp_Struc
Name of Variable when it is 1 when it is 0
Fuse
For fuse component this value
shows the fuse has not been blown.
For fuse component this value shows
the fuse has been blown.
VIA
If the component is VIA, then this
value shows the via is open.
If the component is VIA, then this
value shows the via is connected.
Open
For metal 1 or 2 component this
value shows the metal is open
If the component is either metal 1 or 2,
then this value shows the metal is OK.
Short
If the component is either metal 1
or 2, then this value shows the
metal is short because of a defect
If the component is either metal 1 or 2,
then this value shows the metal is OK.
50
2.8.3 Fault simulation
For each layer and each short and open defect density we call the
Defect_Generation function. We show the operation of this function using an example.
Assume the defect density of first layer for short defects is 0.04 per λ
2
, it means in a
circuit with area of 100λ
2
we have 4 defects. We simulate this area as a square which its
side is 10λ. We know the area of the given steering logic produced by the embedded
layout generator, assume in this example it is 24λ
2
(height = 6λ and length = 4λ). Assume
we have a random generator function which, given a square, returns the (X, Y)
coordinates of a random point in this square using a uniform distribution. In this example
we call this function 4 times for a square with area of 100λ
2
. We embed our steering logic
in this square and consider the defects which their coordinates are located in it. If the size
of the steering logic is greater than 100λ
2
, then TYSEL uses a larger rectangular model.
Assume in the previous example the height and length of the generated layout was 25λ
and 20λ, respectively. Then TYSEL would use a 25λ × 20λ rectangle which has 0.04 ×
(25×20) = 20 defects.
Figure 2.28 (a) shows the square (100λ
2
), embedded steering logic (24λ
2
), and 4
generated random coordinates for 4 short defects. As we can see one of these defects are
located in steering logic. Now assume open defect density for this layer is 0.06/λ
2
which
means we have 6 open defects in a square of 100λ
2
. Figure 2.28 (b) shows the generated
coordinates for open defects by our random generator function where two open defects
are located in the steering logic.
51
{
10l
10l
4l
6l
l
l
l
1.5l
l
l
l
1.5l
2.5l
l
(a) (b)
Figure 2.28: An example for Defect generation
Next we should determine the size of each defect. Assume for both short and open
defects we have the following distribution:
Size (radius) λ 1.5 λ 2 λ 2.5 λ
Probability of occurrence 80% 15% 4% 1%
It means 80% of open/short defects have a radius of λ and 1% of them have a size
of 2.5λ (just an example). So every time our random generator function returns the
coordinates of a defect it also returns its size. This function generates a random number
between 0 and 1. For our example if that number is between 0 and 0.8 then the size of
defect is λ; if it is between 0.8 and 0.95 then the size of defect will be 1.5λ; if it is
between 0.95 and 0.99 then the size of defect will be 2λ; and finally if it is between 0.99
and 1 then the size of defect is 2.5λ. In Figure 2.28, the size of each defect is also given.
Similar operations will be done for short/open defects of other layer. In this example we
have one short and two open defects in first layer. It means TYSEL supports multiple
defects at the same time for each layer.
At next step TYSEL determines if these defects are faults or not. TYSEL
determines if a defect is fault or not by determination of critical area for that defect and
its location. As mentioned if it is fault then TYSEL stores all information about that fault
52
in related data structure otherwise it ignores the defect. We use this information to
compute the yield and other reliability parameters.
For open via simulation, TYSEL calls a random number generator function for
each via. For example assume open via rate is 0.0025 per via which means among 10000
VIAs, 25 of them are open (faulty). Our random generator function returns a number
between 1 to 10000, if the number is smaller than or equal to 25 then we say that via is
open (it is faulty); otherwise, it works correctly. If the via is faulty TYSEL stores the
related information for that via and will use it later in yield computation, which will be
explained in next section.
2.8.4 Yield computation by TYSEL
As mentioned, TYSEL runs the simulation for limited number of times which will
be given by user. During each run TYSEL calls the Defect_Generation function for each
layer. Then it checks if the generated defects are faults or not. Based on the order of
input/output bus and fault’s location it finds the functional buses. TYSEL reports the yield
of each bus which is the probability of a bus to be functional. Any defect which kills a
bus directly or indirectly reduces the yield of that bus. For example in Figure 2.23 (b) the
defect 1 kills the middle output bus directly; consequently, it kills the top output bus.
TYSEL counts the number of times that each bus is functional during total runs and then
divide this number to the total number of runs and report it as the yield of that bus.
TYSEL also reports ()
in
out
qq
S
yn
´
for n = 1, …, q
in
×q
out
. Next section provides lots of
experimental results.
53
2.8.5 Experimental Results by TYSEL
In this section, we explain the TYSEL’s output format and provide results of this
tool for yield computation. We explain TYSEL’s output file by an example, consider a 6-
way-fork with the following information:
Bus Width= 128
I: Number of input buses: 1
O: Number of output buses: 6
Technology(nm)= 65, Lambda(nm)= 35
W1(Lambda)= 2, L1(Lambda)= 2, D1(Lambda)= 2
W2(Lambda)= 2, L2(Lambda)= 2, D2(Lambda)= 2
Order= 3
####INFO OF FIRST LAYER####
Open Defect Density[Defects/(Lambda*Lambda)]= 75 e -12
Short Defect Density[Defects/(Lambda*Lambda)]= 75 e -12
Number of Open Defects= 4
1st Open Defect info(Size(radius)&Distribution(%))= 2 75
2st Open Defect info(Size(radius)&Distribution(%))= 4 20
3st Open Defect info(Size(radius)&Distribution(%))= 6 4
4st Open Defect info(Size(radius)&Distribution(%))= 8 1
Number of Short Defects= 4
1st Short Defect info(Size(radius)&Distribution(%))= 2 85
2st Short Defect info(Size(radius)&Distribution(%))= 4 10
3st Short Defect info(Size(radius)&Distribution(%))= 6 4
4st Short Defect info(Size(radius)&Distribution(%))= 8 1
####INFO OF Second LAYER####
Open Defect Density[Defects/(Lambda*Lambda)]= 75 e -12
Short Defect Density[Defects/(Lambda*Lambda)]= 75 e -12
Number of Open Defects= 4
1st Open Defect info(Size(radius)&Distribution(%))= 2 75
2st Open Defect info(Size(radius)&Distribution(%))= 4 20
3st Open Defect info(Size(radius)&Distribution(%))= 6 4
4st Open Defect info(Size(radius)&Distribution(%))= 8 1
Number of Short Defects= 4
1st Short Defect info(Size(radius)&Distribution(%))= 2 85
2st Short Defect info(Size(radius)&Distribution(%))= 4 10
3st Short Defect info(Size(radius)&Distribution(%))= 6 4
4st Short Defect info(Size(radius)&Distribution(%))= 8 1
V12(Via_Failure_Rate)(%)= 75 e -10
Figure 2.29: The input file for a fork with 128 bus width and 6 output buses
In this example TYSEL runs the simulation function 10000 times and generates
the output given in Figure 2.30. Path(i,j) refers to the path from input bus i to output bus
j. Line
1
to line
12
show the generated defects during 10000 runs. For example, line
4
says
54
there is an open fault on path (1,2) during RUN#4376, and consequently path (1,1) also
does not work (line
3
), which means this fault is on metal 2. For better understanding
Figure 2.31 shows this case (without loss of generality and for simplicity of presentation
we assumed the bus width is 2 while in our example the bus width is 128). Another
example is line
11
which says path (1,2) does not work because of a short fault. Here we
can see that this fault is on metal 1, and by blowing the fuses on metal 1 of this bus we
can save the other output buses.
(Line
1
) RUN#82: BUS 5 DOES NOT WORK BECAUSE OF AN OPEN FAULT
(Line
2
) RUN#3643: THE PATH(1,1) DOES NOT WORK BECAUSE OF A SHORT FAULT
(Line
3
) RUN#4376: THE PATH(1,1) DOES NOT WORK BECAUSE OF OTHER BUSES'S FAULTS
(Line
4
) THE PATH(1,2) DOES NOT WORK BECAUSE OF AN OPEN FAULT
(Line
5
) RUN#5747: THE PATH(1,1) DOES NOT WORK BECAUSE OF OTHER BUSES'S FAULTS
(Line
6
) THE PATH(1,2) DOES NOT WORK BECAUSE OF AN OPEN FAULT
(Line
7
) RUN#6187: THE PATH(1,1) DOES NOT WORK BECAUSE OF A SHORT FAULT
(Line
8
) RUN#7041: THE PATH(1,1) DOES NOT WORK BECAUSE OF OTHER BUSES'S FAULTS
(Line
9
) THE PATH(1,2) DOES NOT WORK BECAUSE OF AN OPEN FAULT
(Line
10
)RUN#9397: THE PATH(1,1) DOES NOT WORK BECAUSE OF OTHER BUSES'S FAULTS
(Line
11
) THE PATH(1,2) DOES NOT WORK BECAUSE OF A SHORT FAULT
(Line
12
)RUN#9973: THE PATH(1,6) DOES NOT WORK BECAUSE OF AN OPEN VIA
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The Results After 10000 Runs are As follow:
(Line
15
)==>(1) The Probability of Having At Least 6 Error Free Paths is 0.9993
(Line
16
)==>(2) The Probability of Having At Least 5 Error Free Paths is 0.9996
(Line
17
)==>(3) The Probability of Having At Least 4 Error Free Paths is 1.000000
(Line
18
)==>(4) The Probability of Having At Least 3 Error Free Paths is 1.000000
(Line
19
)==>(5) The Probability of Having At Least 2 Error Free Paths is 1.000000
(Line
20
)==>(6) The Probability of Having At Least 1 Error Free Paths is 1.000000
(Line
21
) The Yield of Perfect Steering Logic is 0.999300
***********Yield of Different Paths from Input Buses to Output Buses
(Line
23
) Path(1,1)=.9994, Path(1,2)=.9996, Path(1,3)=1, Path(1,4)=1,
Path(1,5)=.9999, Path(1,6)=1
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Figure 2.30: The generated output by TYSEL for the given input in Figure 2.29
Line
15
to line
20
show the
6
()
F
yn which has been defined earlier. Line
21
reports the
yield of perfect fork, which means all input/output buses work correctly and in this
example it is equal to
6
(6)
F
y . Finally, line
23
gives the yield of each path. One important
parameter for accuracy of results is the number of runs. A small number leads to
55
inaccurate results and a large number would be time consuming. We did an investigation
and found 1000000 to be a good number for accurate results which is not time
consuming. By the way this number depends on the size of the steering logic, i.e. for
bigger steering logic we need greater number. Figure 2.32 shows the perfect yield
( ()
q
yq
F
) of 9-way-forks with different bus widths under defect density of 0.075/mm
2
.
The input bus order is also set to 5. The number of runs very from 100 to 2000000 and
for clarity of representation we normalized the horizontal axis in Figure 2.32 by natural
logarithm. As we can see 1000000 is a good point where the yield of steering logic
converges to its real value. The 9-way-fork with 1024 bus width is a big steering logic
and 0.075/mm
2
is a high defect density which needs a significant amount of runs to get
accurate results. Therefore, we select this number for our simulations where the defect
densities are less than 0.075/mm
2
and steering logic are smaller than this fork.
Figure 2.31: The fault of RUN#4376 given in Figure 2.30
56
0.999
0.9992
0.9994
0.9996
0.9998
1
4 LN(Number of Runs)
Yield
Width=256 bits
1000000
(a) Bus width = 256 bits
0.994
0.995
0.996
0.997
0.998
0.999
1
4 LN(Number of Runs)
Yield
Width=512 bits
1000000
(b) Bus width = 512 bits
0.985
0.986
0.987
0.988
0.989
0.99
4
LN(Number of Runs)
Yield
Width=1024 bits
1000000
(c) Bus width = 1024 bits
Figure 2.32: Yield of a 9-way-fork with different bus widths and d = .075/mm
2
Next we show the results for steering logic with different bus widths (w) and
different defect densities (d) per mm
2
. We consider a 6-way-fork with input order 3. As
we can see from Table 2.3 when bus width increases the yield decreases. The reason is
57
the greater area which causes lower yield. Also when defect density increases the yield of
steering logic decreases.
Another parameter which increases the area is the number of input/output buses.
For example, in a fork having more number of output buses means greater area and lower
yield. Figure 2.33 shows yield of a q-way-fork when q varies from 1 to 9. The bus width
is 1024 bits and the results are given under different defect densities. As we can see under
all defect densities when q increases the yield decreases. This reduction is more under
higher defect density.
Table 2.3: Yield of a fork under different bus widths and defect densities
W
d
16 32 64 128 256 512 1024
0.002 1 .999999 .999999 .999991 .999962 .999909 .999417
0.0035 1 .999999 .999998 .999989 .999941 .999783 .999138
0.0045 1 .999999 .999997 .999985 .999934 .999759 .998967
0.015 1 .999997 .999990 .999938 .999775 .999112 .996523
0.05 .999996 .999990 .999947 .999817 .999264 .997234 .98888
0.075 .999991 .999990 .999935 .999742 .998935 .995776 .98290
Figure 2.33: Yield of q-way-fork under different defect densities
58
Chapter 3
Algorithms and Heuristics for Yield and Yield/Area
Maximization in SoC
3.1 Introduction
In this chapter we assume the original circuit has been designed and partitioned
and we have the information about the yield and area of each module and its steering
logic. We assume that modules chosen for replication are testable and related tests exist
for each module. Most IP’s are good examples of such modules. Once a chip is
manufactured it is tested and a table is developed indicating which modules and steering
logic pass. If a path from the primary inputs to the primary outputs exists that only passes
through “good” modules and steering circuitry, the chip is configured accordingly, and
modules that are not used are powered down to minimize dynamic and static power.
Although we are primarily addressing hard fabrication faults, our techniques are equally
applicable to the problem of increasing system reliability. We power down all good and
faulty spares that are NOT initially used. So there is no or little aging associated with
power-downed good spares that can be used later once the present working modules
begin to fail.
We introduce different algorithms and heuristics for yield and Y/A maximization
and classify them based on the following seven concepts:
59
P
1
: Is it an algorithm or heuristic, i.e., does this technique guarantee finding an
optimal answer?
P
2
: Do the modules have different number of spares or all have the same number
of spares? As mentioned before, Y/A is an important measure of design quality. The value
of Y/A for designs produced with same number of spares for all modules can be further
increased based on the following observations: (i) when the number of copies for each
module increases, the yield of the associated steering logic decreases, and (ii) for two
modules with different yields, it is usually better to use more copies of the module with
lower yield than the other module with higher yield. Figure 3.1 (a) shows a structure with
maximum yield where all modules have the same number of spares, and Figure 3.1 (b)
shows a new structure where the number of spares is not the same for each module. The
yields of modules and steering logic are shown inside each box. Although the new
structure has fewer copies of m
1
, it has a higher yield than structure (a). More
importantly, although the yield for the new structure is only a little higher than that of
structure (a), it has significantly less area and hence significant higher Y/A.
(a) Y = 0.80 (b) Y = 0.823
Figure 3.1: (a) same number of spares, (b) different number of spares
P
3
: Does it maximize yield or Y/A? Techniques for yield maximization have lower
timing complexity compared to techniques for Y/A maximization; nevertheless in many
cases they also improve Y/A.
60
P
4
: Does it concatenate different modules to get better Y/A? As mentioned in this
chapter we assume the circuit has been partitioned and we just try to maximize yield or
Y/A using appropriate number of spares. The reason we introduce this concept is that in a
design with arbitrary partitioning we may have modules with very high and very low
yields. Sometimes it is better to concatenate modules with high yields to remove the
steering logic circuitry overhead between them. For example Figure 3.2 (a) shows a
structure where the yield of modules are high and we used a switch between them; while
its yield is less compared to structure (b) where we removed that switch and concatenated
modules directly. And design (b) also has smaller area since we removed the switch
which saved us area; it means it has better Y/A compared to design (a).
(a) Y = 0.995 (b) Y = 0.997
Figure 3.2: Different yields with and without concatenation
P
5
: What kind of technique it uses? Greedy (G), Divide and Conquer (DC) or
Dynamic Programming (DP). We basically use either these techniques by themselves or
combination of them to develop our procedures. The implementation details of each
algorithm and heuristic will be given in this section.
P
6
: What is the time complexity of the procedure as a function of number of
modules (n)? It compares the timing complexity of our procedures using big O notation
[17].
P
7
: Is it also applicable to non-linear structures? All our techniques are applicable
to linear structure. By linear we mean outputs of each module just feeds the next module
61
or primary outputs and its inputs are coming from the previous module or primary inputs.
Many pipelines are implemented as linear structures. By non-Linear structures we mean a
general structure where any module can feed any other module. Figure 3.3 (a, b) show
two simple examples of linear and non-linear structure.
(a) Linear (b) Non-linear
Figure 3.3: Different structures
First we compare all our techniques using these terms (P
1
to P
7
) and explain each
one separately, and finally show experimental results and compare their results with each
other under different defect densities. Table 3.1 shows the comparison among these
techniques with each other using predefined terms. In the following sections we define
the related problem for each technique, and then explain its implementation.
Table 3.1: Comparison of our developed techniques
Technique
Concepts
SIRUP OSIP M-OSIP MYRA HYPER
P
1
Heuristic Algorithm Algorithm Algorithm Heuristic
P
2
Same Same Different Different Different
P
3
Yield Yield Yield Yield/Area Yield/Area
P
4
YES YES YES NO YES
P
5
G DP DP DP G, DP, DC
P
6
O(n) O(n
2
) O(n
2
) O(n
2
) O(n
3
)
P
7
NO NO NO YES NO
62
3.2 SIRUP: Switch Insertion in RedUndant Pipelines
We begin with defining the problem which SIRUP solves and then explain the
procedure. For this problem we assume that the yield of all switches are identical and less
than or equal to 1. We name the structure generated by SIRUP as SIRUP_qRM.
Problem_SIRUP: Given m
i
and y
i
for i = 1, 2,… , n, a constant value of replication
q, scalable switches along with their associated yields, where should switches be
inserted within qRM (Figure 1.4) to maximize the resulting yield?
3.2.1 Partitioning a pipeline
We introduce a theorem to develop SIRUP and will prove it.
Theorem_SIRUP: Given n modules where, (i) every module has the same yield y, (ii)
constant value of replication q, and (iii) a fixed number, j, of switches, the maximum
yield is attainable by putting switches after each
1
n
j +
module. (Please note that for
situations where n/p is not integer, then in some cases one needs to take the floor of n/p,
and in other cases the ceiling of n/p so that the sum of all elements in the p + 1 groups is
n.)
Proof: We use mathematical induction to prove Theorem_SIRUP when q is 2.
For other values of q the proof is similar.
The Basic Step: We show that the theorem is correct when j = 1, and thus the
number of modules in each group is
2
n
assuming 1 n > . When j = 1 we only have one
switch, hence two groups of modules. Let the number of modules in the first group be z,
63
so the number of modules in the second group is n-z. Referring to Figure 1.5 the yield of
such a system is
22 22 22
( ) ( ) (1 (1 ) ) (1 (1 ))
22 22
( ) () (2 ) (2 ).
z nz
Yy y yy
FJ
n z nz
y y y yy
FJ
-
= ´ ´ -- ´ --=
-
´ ´ ´ - ´-
To find the value of z that maximizes Y, we take the derivative of Y with respect to z and
set the result equal to 0,
22 22
0 ( ) () ln (2)
22 22
( ) () ln (2)
22.
2
n
dY
y n z nz
y y y yy
FJ
dz
n
y n nzz
y y y yy
FJ
n
z n nzn
y y y y z n zz
-
= Þ- ´ ´ ´ ´ ´-
-
+ ´ ´ ´ ´ ´-
-
Þ -= - Þ = - Þ=
Thus the basis has been proven.
The Inductive Step: Assume the statement holds for j-1 switches, which implies
that the number of modules in each group needed to maximize Y is
n
j
. Consider the
optimal solution with j switches, let the number of modules in the first group be k
1
, hence
the number of modules in the other groups is n-k
1
. If this is an optimal allocation of
switches, then the n-k
1
modules with j-1 switches should also be optimal. Using our
induction hypodissertation, the number of modules in each group of the n-k
1
modules
should be
1
.
nk
j
-
( 3.1)
Referring to Figure 1.5, the yield of the configuration just described is
64
1
1
1
1
22 22 22
22 22
( ) ( ) (1 (1 ) ) (1 (1 ))
( ) () (2 ) (2 ).
nk
k j j
FJ
nk
knjj
FJ
Y yy yy
y y y yy
-
-
= ´ ´ -- ´ --
= ´ ´ ´ - ´-
We seek the value of k
1
that maximizes Y, so we take the derivative of Y with respect to
k
1
.
22 22 1
: ( ) ()
1
1 ()
0 ln (2 ) (2)
1
()
ln (2 )0
( ) ( ) ()
22,
1
nk
n
For simplicity consider gy y y andz k n jz
FJ
j
n
dY
y j nn jz nz nz
g y jyy
dk
n
y nn jzj nz
g y jy
n
nn z jz nn jz nn z jz nz
yy y y z n j zz
j
we defined
-
= ´ ´ = Þ = -´
- -´
= Þ-´ ´ ´ ´- ´-
-´
+´´ ´ ´-=
+ -´ -´ + -´
Þ- = - Þ = - ´ Þ=
+
11
.
1
11
nk nknn
zk
j j jj
--
= Þ = Þ=
++
( 3.2)
We assumed that the number of modules in the other groups satisfies ( 3.1), and by
using k
1
from ( 3.2) in ( 3.1) we have
... . . ..
121
1
n
kk kk QED
j j
j
= = = ==
+
+
( 3.3)
Using ( 3.3) one can verify that the yield of each group is
1
1
():.
j n
Y y Y is yield of groupk
i kk
ii
+
=
( 3.4)
Using Theorem_SIRUP and ( 3.4) we get the following.
Basis of SIRUP: Given n modules in a linear structure with different yields, y
i
, a
constant value of replication q, and j the number of switches to imbed, the maximum
attainable yield is achieved when the yield of each group of modules is as close to the
value P
j
as possible, where
65
1
1
( ).
1
n
j
Py
ji
i
+
= Õ
=
( 3.5)
In the next section, we explain the SIRUP procedure for insertion of switches.
3.2.2 Implementation of SIRUP
We use the above result to develop SIRUP. The qRM structure with optimally
place switches is denoted by SIRUP-qRM. SIRUP has two phases and the first phase
finds the best number of switches that maximize the yield of SIRUP-qRM. The pseudo
code of this phase is shown in Figure 3.4.
(1) 01
1
[( ) (1 (1 ))]:
(2)_*
Let j that valueof j n which
q q qq j qj
maximizesy P wherey yy
j
FJ SS
num switchj
* £ £-
+
´ -- =´
=
Figure 3.4: First phase of SIRUP
The complexity of this process is O(n). The second phase identifies where to
insert the j* switches so as to maximize Y. The details of this procedure are shown in
Figure 3.5. SIRUP uses two pointers, Left and Right, that initially point to the first and
last modules (Step 0). In step 1, SIRUP checks if a switch is to be inserted. All j*
switches might have already been inserted, or the first phase could return j* = 0. This can
occur, e.g., when the yield of a switch is relatively low compared to that of the modules.
Step 2 compares the number of remaining switches that have not yet been inserted with
the number of available switch locations. If they are equal, then SIRUP assigns one
switch to each available location and the procedure terminates. Otherwise, the procedure
enters Step 3 where the location for inserting the next switch is determined.
66
Figure 3.5: Second phase of SIRUP
In step 3 we first re-compute the value of “P”, using an expression different from
that shown in ( 3.5). We will return to this point later in the discussion.
67
As mentioned before, SIRUP uses two pointers, Left and Right, to find two groups
of modules, where m
Left
is the first module of the left group and m
Right
is the last module
of the right group, as shown in Figure 3.6. ΔL and ΔR are the differences in the yield
between P and the left and right groups, respectively. We would like ΔL and ΔR to be 0.
This is often not possible because modules have different yields, and n is not divisible by
j*. Hence, we use two pointers to increase the chance of finding a group of modules
whose yield is close to P. As long as adding a module to the right-end of the left group
reduces ΔL, SIRUP inserts a module to this group and increases the value of second
index, Left2. The same process is repeated for the right group. Then if ΔL is smaller than
ΔR, SIRUP inserts a switch between m
Left2
and m
Left2+1
, updates the Left pointer,
decreases the number of switches by one, and goes to Step1. A similar process is used
when ΔR is smaller than ΔL as shown in Figure 3.6.
Figure 3.6: An example of Left and Right pointers
Each time SIRUP inserts a switch, we have a new sub-problem consisting of the
modules from m
Left
to m
Right
. However, ( 3.5) considers all modules, not a subset of them.
Thus, to increase the accuracy of our procedure, we modify the P function, as shown in
the first line of Step 3. Now we only consider the modules from m
Left
to m
Right
. The
complexity of the second phase is also O(n) as we just traverse n modules, and hence the
complexity of SIRUP is O(n).
68
3.3 OSIP: Optimal Switch Insertion in Pipelines
In this problem we remove the constraint in SIRUP of having a fixed number of
similar switches. We believe different modules can have different interconnect
complexity, so they have steering logic with different yields and areas. Also the number
of steering logic that maximizes the yield could be different for designs with different
yields. The following problem addresses these issues.
Problem_OSIP: Given m
i
and y
i
for i = 1, 2,… , n, a constant value of replication
q, and scalable steering modules (forks, joins and switches) along with their
associated yields, where should steering logic be inserted within qRM to maximize
the resulting yield?
In this section we describe an algorithm, OSIP, of time complexity O(n
2
), that
solves this problem. This algorithm produces a structure denoted by OSIP-qRM. The
algorithm is based on one of the principles of dynamic programming, namely, an optimal
schedule for going from A to C that passes through B, consists of two optimal schedules,
one going from A to B, and the other going from B to C. We make use of the following
function.
Function Yield_Compute(ind
1
, ind
2
, q) computes the yield of a sequence of q
copies of modules that lie between two blocks of steering logic, and with no steering
logic in between. Here, ind
1
and ind
2
are the indices of the first and last modules in the
sequence. Thus,
2
_ ( , ,) (1 (1 ) ),
12
1
1
1
ind
q
Yield Compute ind ind qyy
i S
ind
i ind
= ´ --
Õ
-
=
69
where
1
1
y
S
ind -
is the yield of a switch, if it exists, between q copies of
1
1
m
ind -
and q
copies of
1
m
ind
. Let OPT(k) be the maximum possible yield that can be obtained by
inserting zero or more switches within the first k modules (m
1
to m
k
) of an original qRM
configuration. Once we find an optimal design for the first k modules, we then consider
the k+1
st
module and find the optimal design for the first k+1 modules using the values
of OPT(0), OPT(1), …, OPT(k).
If k = 0 then the sequence of modules is empty and we have the structure shown
in Figure 3.7 (a) thus, (0).
qq
OPT yy
FJ
=´ When k = 1, we have q copies of m
1
,
and ( 1) _ ( 1,1,)
qq
OPT y y Yield Computeq
FJ
= ´´ . When k ≥ 1, we have a choice of how to
partition the first k modules. The optimal design for the first k modules has an index
0 ≤ i* < k at which the last switch occurs (after module i*). i* = 0 implies that there are
no switches within the first k modules.
(a) OPT(0) (b) OPT(1) (c) First choice of OPT(2) (d) Second choice of OPT(2)
(e) First choice of OPT(3) (f) Second choice of OPT(3) (g) Third choice of OPT(3)
Figure 3.7: Different choices for the last switch in a pipeline with 3 modules
For k modules, there are k places where the last switch can be placed. Figure 3.7
(e, f, g) show a structure with n = 3 and q = 2, and there are three places where the last
70
switch can be placed. We desire to know which of these k choices is best. To help answer
this question, we use the following lemma.
Lemma 1: The optimum yield OPT(k) satisfies the equations
(0)
() max () _ ( 1, , ).
0
qq
OPT YY
FJ
OPTk OPTi Yield Computei kq
ik
éù
ëû
=´
= ´+
£<
Proof: There exists a configuration as shown in Figure 3.7 (a) that has a
maximum yield. Let i be the location of the right-most switch, not including the final
join. Then the yield of this design is the product of the yield of the left and right
components of this design. Since Yield_Compute (i+1, n, q) is fixed, to maximize OPT(k)
one must maximize OPT(i). We can use Lemma 1 to design an algorithm for solving
Problem_OSIP. Note that different switches can have different values of yield, because
the various modules can have different number of I/O ports.
Without loss of generality, in this illustration of our algorithm we assume q = 2.
For k = 0, there are no module, and (0)
qq
OPT yy
FJ
=´
(Figure 3.7 (a)). For k = 1, there is
no place to insert a switch, and from Lemma 1 we have: OPT(1)=OPT(0) ×
Yield_Compute(1, 1, 2) (Figure 3.7 (b)).
For k = 2, two choices exist (
k
i
C
refers to the yield of the ith choice of k modules),
namely use no switches, or insert “the last switch” between modules 1 and 2.Thus,
2
(0) _ ( 1, 2, 2),
1
C OPT Yield Compute =´ (Figure 3.7 (c))
2
( 1) _ (2, 2, 2),
2
C OPT Yield Compute =´ (Figure 3.7 (d))
71
and
22
(2) max( ,)
12
OPT CC = where OPT(0) and OPT(1) have been previously computed.
For k = 3, three choices exist, as shown in Figure 3.7 (e-g). Thus,
3
(0) _ ( 1, 3, 2),
1
C OPT Yield Compute =´ (Figure 3.7 (e))
3
( 1) _ (2,3, 2),
2
C OPT Yield Compute =´ (Figure 3.7 (f))
3
(2) _ (3, 3, 2),
3
C OPT Yield Compute =´ (Figure 3.7 (g))
and
333
(3) max( , ,)
1 23
OPT CCC = , where OPT(0), OPT(1) and OPT(2) have been previously
computed. In general,
() max( , ,...,)
12
.
( 1) _ (, ,2)
kkk
OPTk CC C where
k
k
C OPTi Yield Compute ik
i
ìü
ïï
ïï
íý
ïï
ïï
îþ
=
= -´
The pseudo code of this algorithm is given in Figure 3.8. Its complexity is directly
proportional to the number of choices we have for each value of k, namely k. Since
OPT(k) is computed for k = 1, 2, … , n, the time complexity for OSIP is 1+2+…+n = n
2
.
Figure 3.9 shows the result of applying OSIP to a pipeline structure, where the
yields of the modules vary from 0.55 to 0.98. The yield of all switches is 0.98. This
example clearly shows that it is not always best to place switches between all modules.
The values of G
Y
and G
Y/A
for this structure compared to 2RM are 11.2 and 5.33,
respectively. The location of the switches are stored in the switch_location[] array. In the
next section, we modify this algorithm by removing the constraint of having the same
number of copies for all modules. More results are given in experimental results section.
72
[ ]
( ) (0)
1
( ) 1 //
2
( ) 0* _ (* 1, ,) ( *)
3
() () ( *) _ (* 1, ,)
4
()_
5
qq
Line OPT YY
FJ
Line fork ton do n is the number of modules
Line Let i k be such that maximizes Yield Computei kq OPTi
Line Let Optk OPTi Yield Computei kq
Line Let Switch Locatio
=´
=
£< +´
=´+
[] *;
()
6
() () ()
7
() , _ [ ], _ [ _ [ ]],...
8
nki
Line end for
Line Output OPTn as the optimum maximal yield
Line Output n switch locationn switch location switch locationn
as thelocation of switches
=
Figure 3.8: Pseudo Code of OSIP
.97 .89 .65 .67 .55 .99
.99
.98
.98
.89
.89
.63 .58 .59
F S S S S S
.97 .89 .65 .67 .55 .63
S
.58
S
.59
S
.95
.95
.96
.96
.95
.95
S
.69
.69
J
Figure 3.9: A sample of multi switch structure using OSIP
3.4 M-OSIP: Modified OSIP
As noted above, when the number of copies for each module increases, the yield
of the associated steering logic decreases. Also, for two modules with quite different
yields, it is usually better to use more copies of the module with the lower yield than that
of the other module. We define the following problem based on these observations.
Problem_M-OSIP: Given m
i
and y
i
for i = 1, 2, … , n and scalable steering modules
(forks, joins and switches) along with their associated yields, what is the best
number of copies for each module and the best location for the switches to be
inserted to maximize the resulting yield?
In this section we present a modified version of OSIP, denoted by M-OSIP to
solve this problem. This algorithm produces a structure denoted by M-OSIP-qRM.
73
Consider the structure shown in Figure 3.10, where the yields of the various
modules are shown. Compared to OSIP-5RM, this structure has less area and higher yield
(G
Y
= 1.2) and thus a better Y/A value, namely 2.3.
Figure 3.10: A sample of M-OSIP structure (F = Fork, J = Join)
M-OSIP uses forks and joins, rather than switches, in its computations to connect
one partition of modules to another one. That is, we model a q
1
-input/q
2
-output switch as
a q
1
-way-join followed by a q
2
-way-fork. Once we identified the best number of copies
for each module, the concatenated forks and joins are replaced by a switch. For example,
in Figure 3.10, M-OSIP produced a result having a 2-way-join followed by a 5-way-fork
between m
3
and m
4
. They can be replaced by a 2-input/5-output switch in the final design.
Table 3.2 shows the yields for different forks and joins for the five modules
depicted in Figure 3.10. Yield decreases as the number of replications of a module
increases. For simplicity, we have set the yield of a fork and join to be the same. As
mentioned, in practice, a join driving a fork is replaced by a switch. We do not show the
corresponding matrix for yield for this case since there would be 4×4 entries for each
74
module. ()
qq
FJ
yy can differ for different modules because they can have different bus
widths and/or physical attributes.
Table 3.2: Yield of forks & joins as a function of degree of replication
2
y
F
2
y
J
3
y
F
3
y
J
4
y
F
4
J
y
5
y
F
5
J
y
m
1
0.92 0.92 0.90 0.90 0.88 0.88 0.86 0.86
m
2
0.99 0.99 0.98 0.98 0.97 0.97 0.96 0.96
m
3
1 1 0.90 0.90 0.88 0.88 0.86 0.86
m
4
0.99 0.99 0.98 0.98 0.97 0.97 0.96 0.96
m
5
0.94 0.94 0.90 0.90 0.88 0.88 0.86 0.86
m
6
0.94 0.94 0.90 0.90 0.88 0.88 0.86 0.86
Assume we are processing the modules from left to right, and have determined the
number of copies to use for m
j
. At this time we do not know the number of copies to use
for m
j+1
, thus we do not know the dimensionality of the switch to use, hence its yield. M-
OSIP like OSIP need not insert a switch between adjacent modules. For example, in
Figure 3.10, m
5
and m
6
are not separated by a switch.
Several modifications need to be made to OSIP to create M-OSIP. Assume we
have just determined that the best number of copies for module m
2
in Figure 3.10 is three,
but we do not know that the best number of copies for m
3
is two. Therefore, we do not
know that a 3-input/2-output switch is required. Hence, we use a join to produce an
output and a fork to connect the output of the join to its neighbour. The yield of the
modules and their related forks and joins will be used to determine the best number of
copies for each sub-problem. M-OSIP assumes that the actual number of copies of each
module in the final structure can be different, and also that the maximum number of
copies is also specified. Let the maximum number of allowable copies that can be used
75
for module m
i
be M
i
. Since the yield-compute function uses the same number of copies of
modules m
ind1
up to m
ind2
, it cannot use more than M
max
copies of each module, where
[ ]
12
max
min.
i
ind i ind
MM
£<
=
( 3.6)
The best number of copies to use for these modules is that value of q identified by
the function:
2
_ _ ( , , ) max (1 (1 )).
max 12
1
1 max
ind
qq q
Yield Compute New ind indM y yy
i
FJ
qM
i ind
éù
= ´ --´ Õ
êú
££
=
ëû
M-OSIP is the same as OSIP except for the modifications just mentioned. The
complexity of Yield_Compute_New() is O(n+m*), where n is the number of modules in
the pipeline, and m* is computed by ( 3.6) when ind
1
= 1 and ind
2
= n. Therefore, the
complexity of M-OSIP is O(n
2
(n+m*)). Usually n is greater than m* so the complexity of
M-OSIP is O(n
3
). Note that the value of n is usually small (lass than thousand), and so an
O(n
3
) algorithm will not take long to execute.
3.5 HYPER: a Heuristic for Yield/area imProvEment using
Redundancy
The previous techniques focused on yield and for many cases they also improved
Y/A. In this section we develop a heuristic for Y/A maximization. We define this problem
using a simple example. Three different structures, Struc_1 (Y = .952, A = 12.35),
Struc_2 (Y = .958, A = 10.2) and Struc_3 (Y = .94, A = 7.1) are shown in Figure 3.11. In
Struc_1 both modules have the same number of copies, namely 3. Struc_2 shows a
structure where q is no longer the same for each module. The yields of modules and
steering logic are shown inside each box. Although this structure has fewer copies of m
1
,
76
it has a higher yield and less area than Struc_1 and results in a larger Y/A. More
importantly, we can increase Y/A by using fewer spares, hence reducing area and
simplifying switches. For example, in Struc_3 we removed the third module of m
2
that
led to a negligible reduction in yield but significant Y/A improvement, compared to
Struc_1 and Struc_2, of 73% and 41.5%, respectively.
(a) Y/A = .077 (b) Y/A = .094 (c) Y/A = .133
Figure 3.11: (a) Struc_1, (b) Struc_2, (c) Struc_3
Problem_HYPER: Given m
i
and y
i
for i = 1, … , n, scalable steering modules (forks,
joins and switches) along with their associated yields and areas, and a threshold
value of yield for the system (Y
th
), how many spares for each module should be used
so that the yield of the redundant system exceeds y
th
, and Y/A is maximized?
In this section we present the heuristic HYPER that produces the structure
HYPER-qRM to solve this problem. HYPER consists of two main phases. Also, three
combinatorial techniques are employed, namely greedy search, divide & conquer and
dynamic programming. HYPER, just like M-OSIP uses forks and joins, rather than
switches, in its computations to connect one group of modules to another one.
3.5.1 Phase 1: Implementation of HYPER using Divide & Conquer
We start the procedure by calling HYPER(1, n, Y
th
), where 1 and n are the indices
of the first and last module. The main concept behind HYPER is that every time it inserts
77
a switch (join-fork pair) into a structure it breaks it into two new structures, and it then
inserts a switch into each new structure thus generating four new structures. This process
continues until no new structures can be generated. However, it sometimes may not insert
a switch into a sub-structure, or even the first structure, due to the low yield of steering
logic. The pseudo code of this phase is given in Figure 3.12.
( , ,)
12
( ){ ( , ,)
1 12
( )( . 0) (); //
2
( ) {( . ) ( , . , .)
311
1
( ) ( .1
4
HYPER ind indY
th
Line Struc BREAK ind indY
th
Line if Struc ind return No Switch has been inserted
m
Line else if ind Struc ind HYPER ind Struc ind StrucY
mm
th
Line if Struc ind in
m
=
=
¹
+¹ ) ( . 1, , . )}}
22
2
d HYPER Struc ind ind StrucY
m
th
+
Figure 3.12: Pseudo code of HYPER
The variable Struc in Figure 3.12 (line
1
) has the information that defines new
structures. We illustrate the functionality of HYPER with respect to Figure 3.13 that
indicates a 5-stage pipeline and the yield of each module. Assume Y
th
is 0.95 and the
values of
2 2 33
, ,,
F J FJ
y y yy for all modules are .999, .999, .997 and .997.
First we call HYPER(1,5, .95) that results in two new structures: m
1
to m
4
and m
5
with different number of copies for each structure. BREAK is an optimal function which
minimizes the area of each structure for a given Y
th
. We will provide the details of
BREAK.
Struc.Ind
m
indicates the location of the switch determined by BREAK,
0 ≤ Struc.Ind
m
≤ n-1, where 0 implies that no switch is inserted. For example, after the
first instantiation of BREAK, Struc.Ind
m
= 4, thus the first switch is located between m
4
and m
5
. The second structure (m
5
) has just one module; therefore, BREAK need not
process this structure (line
4
) in Figure 3.12). Hence for this sub-structure HYPER stops
78
and the final number of spares for m
5
is 2. For the first structure (m
1
to m
4
) HYPER calls
HYPER(1,4,.9746) where .9746 is the new yield threshold. It is important that Y
th
be
updated for any new structure. Clearly, if the target yield is Y
th
, then the yield of each
substructure must be larger than this value. For our running example, after the first break,
the yields of the first and second structures are .9746 and .9755, respectively, and the
yield of the final structure is .9507. The total yield of new structures during the second
break should not be less than .9746 to guarantee that the yield of the final structure does
not drop below .95.
( 1,5,0.95) HYPER
( 1, 4,0.9746) HYPER
( 1,3,0.9874) HYPER
Figure 3.13: An example of HYPER
After the second break Struc.Ind
m
is 3, thus m
1
to m
3
is the first new structure and
m
4
is the second. Again the second structure has just one module (m
4
) so HYPER stops
and the number of spares for m
4
reduces from 3 to 2. Finally, after the third break
79
Struc.Ind
m
is 0. There is no more structure to break so HYPER terminates and returns the
last structure (Figure 3.13) as the final structure. In the next sub-section we describe the
functionality of BREAK.
3.5.2 Phase 2: Implementation of BREAK using Dynamic Programming
BREAK(ind
1
, ind
2
, Y
th
) is an algorithm that minimizes the area of a given structure
(m
ind1
to m
ind2
) with respect to Y
th
by inserting at most one switch. Result
0
refers to the
case where no switch is used, and Result
i
refers to the case where one switch is inserted
between m
i
and m
i+1
, where ind
1
≤ i ≤ ind
2
-1. In producing a result that maximizes Y/A
subject to the yield exceeding Y
th
, BREAK also determines the number of spares to
instantiate for each of the two new structures. The pseudo code of BREAK is given in
Figure 3.14.
As mentioned, BREAK inserts at most one switch between m
ind1
and m
ind2
, thus
forming two groups of modules (new structures). As there is no steering logic among the
modules in each group, without loss of generality we can concatenate the modules inside
each group to generate one module where the yield and area of these two new pseudo-
modules is given in line
3
of Figure 3.14. For example, referring to Figure 3.13, during
first break the yield of the first and second group are .95×.95×.9×.9=.731 and .85,
respectively.
A greedy selection rule is used in the BREAK function, where every time it is
called it returns the switch location and number of spares needed to maximize Y/A for a
given sub-problem. Although each sub-problem results in an optimal result, the final
structure need not be optimal.
80
( , ,)
12
BREAK ind indY
th
2 2
{( 1) ; ;//
1 10
1 1
ind ind
Area a Yield y Result Generation
ii
i ind i ind
==
å Õ
= =
(2)_ _ ( , , , ,0,0,0,0, );
1 1 12
Best Struc Build Struc Area Yield ind indY
th
=
( ; ; ) {//
12
fori ind i indi Result Generation
i
= £ ++
2 2
(3) ; ; ;;
1 1 22
1 1
1 1
ind ind
i i
Area a Yield y Area a Yieldy
j j jj
ji ji j ind j ind
= = ==
åå ÕÕ
=+ =+ = =
(4)_ _ ( , , ,,, , 1, , );
1 11 2 22
Tmp Struc Build Struc Area Yield ind i Area Yield i indY
th
=+
(5)( _ ./ _ . /) if Tmp StrucYield Area Best StrucYield Area >
(6)_ _ ;} Best Struc Tmp Struc =
( _ );} return Best Struc
Figure 3.14: Pseudo code of BREAK
Build_Struc is a function that generates the optimal Y/A for the two groups of
modules (two new structures), the index of switch and Y
th
. During Result
0
generation
(line
1
) the information describing the second group is null, since it does not exist. The
functionality of Build_Struc is described next, where group
1
and group
2
refer to the
groups (new structures) produced by the BREAK function.
3.5.2.1 Part I of Build_Struc: Determination of fewest number of spares
for each group
Build_Struc guaranties that the yield of the new structure is not less than Y
th
while
its area is minimal. There can be many structures that have a yield greater than or equal to
Y
th
with different areas. Build_Struc generates a structure with minimum area based on
the inherent constraints of our problem. That is, we achieve enhanced yield via spares,
not by redesigning modules or by employing coding techniques. Clearly, the yield of the
whole system (with all modules) is less than y
i
for i = 1, ..., n. Generalizing on this
concept, for the yield of the system to be greater than Y
th
, the yield of each group,
81
including its spares, fork and join, must be greater than Y
th
. The pseudo code in Figure
3.15 determines the fewest number of spares for each group required to satisfy the above
concept assuming these groups are not to be split apart by BREAK. Next we add a few
spares to the current spares with minimum area to improve the yield of the system to Y
th
.
( 1; 2; )//
{ 0;
111
( (1 (1) );
12
_ [] ;}
fori i i we have two groups of modules
q
qqq
whiley Yield y Yq
i th
FJ
ind ind
ii
num spareiq
= £ ++
=
+++
´ -- ´ < ++
=
Figure 3.15: Fewest number of spares for each module
3.5.2.2 Part II of Build_Struc: Optimal number of spares for each
group
From Part I we need at least num_spare[i] spares for each module in group
i
.
Assume the yield of the structure after executing Part I of Build_Struc is Y
PartI
. In Part II,
additional spares with minimum area are added to the current group(s) as long as
Y
currunt
< Y
th
. To accomplish this, we first determine the minimum unit of area (Min_Area)
that we can add at each step of Part II. This process is based on dynamic programming
and we build the solution bottom up. When the area of steering logic is negligible
compared to the area of the groups, the minimum unit of area is equal to the group with
the minimum area. Often this area is not negligible, so when we add a spare to a group
we increase the total area by the area of the spare and the area of its new steering logic.
The function in Figure 3.16 determines the Min_Area.
82
_ (, )//
{ ( 0) (0){
11
(( )( ));}
2
1 22
( _ ( 1, _ [ 1] 1) _ (2, _ [2] 1)){
Func Area iq i is the index of group q is the of copies for group
i
ifi return
q q qq
else return Areaa a aa
i
F F JJ
ind ind ind ind
i ii
if Func Area num spare Func Area num spare
==
--
+ - +-
+<+
_ 1;//
_ _ ( 1, _ [ 1] 1);}
{ _ 1;
_ _ (2, _ [2] 1);}
Min ind indexof group withminareaoverhead
Min Area Func Area num spare
else Min ind
Min Area Func Area num spare
=
=+
=
=+
Figure 3.16: Min_Area determination for HYPER
Let maximum Gain (G
max
) be defined as G
max
= Y
th
/Y
PartI
. We denote the spares
added to the original structure as spare
1
, spare
2
, …, where spare
i
is the i’th spare added.
When spare
i
is added to the current structure we obtain a gain in yield G
i
. Thus, for
example, after adding spare
1
and spare
2
to the original structure the total yield gain is
G
1
×G
2
, and the yield of this new structure is G
1
×G
2
× Y
PartI
. At each step we increase the
available area overhead by Min_Area and select the best spare that results in the largest
increase in yield. The new yield gain is denoted by G
i*
. If G
i*
< G
max
the procedure
continues to increase the area overhead and select new spares; otherwise it terminates.
Let Opt[][] be a 2-dimensional array used to store information regarding sub-problems,
such as the number of spares, yield gain and area overhead. The first row of this array
stores the yield gain of each step; thus Opt[0][i] returns the yield gain of the i’th step. The
last row stores the area overhead of each step. Other rows store the number of copies of
each module at each step, thus Opt[1][i] returns the number of copies for group
1
at the
i’th step, so the array has 4 rows, and there is one column for each step.
Let the function Gain_Func(q, i) returns the yield gain obtained by using q spares
for group
i
compared to q-1 spares of group
i
.
83
10
111
(1 (1)
_ ( ,)
12
):
(1 (1)
11
ifi
qqq
y Yieldy
i
FJ
Gain Funcqi ind ind
ii
Otherwise
qqq
y Yieldy
i
FJ
ind ind
ii
= ìü
ïï
+++
´ --´ ïï
ïï
=
íý
ïï
´ --´
ïï
ïï
îþ
( 3.7)
At step k the maximum area overhead is k×Min_Area and when we select group
i
,
the remaining area overhead would be k×Min_Area – Area
i
, and k′
i
refers to the sub-
problem where the area overhead was k×Min_Area – Area
i
and we know the optimal
answer for this sub-problem.
Let M
i
be a user specified upper bound on the number of spares one can use in
group
i
, and from Part I it is known that at least num_spare[i] spares are needed for
group
i
. Since at each step we increase the area overhead by Min_Area, the maximum
number of step (Step
MAX
) is given in ( 3.8). Build_Struc terminates when a solution is
identified, or when the number of steps exceeds Step
MAX
.
2
( _ [ ])
_
1
M num sparei Area
ii
Step
MAX
Min Area
i
-´
=
å
=
( 3.8)
The pseudo code of Build_Struc is given in Figure 3.17.
In the worst case HYPER will call BREAK O(n) times and BREAK calls
Build_Struc O(n
2
) times. The complexity of Build_Struc is O(M), where M is the
maximum number of spares for a group of modules. In practice, M is not a large number
and can be assumed to be constant. Therefore the complexity of HYPER is O(n
3
). Setting
the threshold voltage is another problem that we believe user will take care of it which
reduces the run time a lot, especially when we have many modules. It means HYPER
maximizes the Y/A for the given threshold but in the next section we present our optimal
algorithm which maximizes the Y/A in general.
84
_ ( , , , ,, , , ,)
1 1 11 122 2 21 22
{ //
[0][0] 1; 0; 0; //
0
(( [0][] )( ))
{;
( 0; 2; ) //
Build Struc Area Yield ind ind Area Yield ind ind Y
th
SECOND PART
OPT Area k number of steps
while OPT k G ANDk Step
max MAX
k
forii i We have at most two groups
i
= ==
<£
++
= £ ++
( _ ){
_
_
*:
[0][] _ ( [ *][ ] 1, *)}
**
[0][] [0][] _ ( [ *][ ] 1, *);}
**
(
f Area k Min Area
i
k Min Area Area
i
k
i
Min Area
leti i be such that can maximize the value of
OPT k Gain Func OPTiki
ii
OPT k OPT k Gain Func OPTiki
ii
ifk Step
M
£´
êú ´-
¢=
êú
ëû
=
¢¢ ´+
¢¢ =´+
> )()
( 1; ; ) [ ][] [ *][ ];
*
( *! 0) [ *][];
( )}
return thereis no solution for this structure
AX
fori i ni OPTik OPTik
i
ifi OPTik
return new structure with new number of spares
¢ = £ ++=
= ++
Figure 3.17: Pseudo code of Part II of Build_Struc
3.6 MYRA: Maximizing Yield/Area via Replication
In this section we describe an algorithm (MYRA) for general structures, which
identifies the minimum amount of area (spares) to be added to a circuit to maximize the
Y/A. The main differences between MYRA and our prior work [65] [66] [67] are that: (i)
MYRA is a general algorithm (not heuristic) to maximize Y/A (not just yield) with
consideration of a budget for redundancy which is applicable to logic circuit with any
structure (pipeline, linear, non-linear, etc.), (ii) each module can have different number of
spares, (iii) our technique is applicable to multiple core SoC’s, and finally, (iv) it does not
concatenate the modules and finds the best number of spares for each module separately.
Consider the non-linear structure given in Figure 3.18 (a), that is part of the execution
85
module of the OpenSPARC T2 core [32]. As we can see we cannot concatenate modules
BYP and ALU without considering the output wire of BYP (dashed line) which feeds
module SHFT. MYRA uses one fork and one join for each replicated module, so it can be
applied to any structure independent of the connectivity among modules. Figure 3.18 (b)
shows an example of configuration generated by MYRA for Figure 3.18 (a). We define
the following problem and present the solution to it.
(a) (b)
Figure 3.18: The original irredundant logic circuit, (b) its redundant
Problem_MYRA: Given the (a) yield and area of each module and their associated
forks and joins, (b) an area budget (B
R
) for redundant modules and steering logic,
and (c) the total area of all of the other modules such as memory (A
OM
), that are
not part of the circuitry being considered for spares, determine the number of
spares to instantiate for each module so as to maximizes the Y/A gain function
( 1.3) while satisfying the constraint ( 1)
1
n
q aB
i iR
i
- ´£ å
=
.
In practice, there is often an empty space (E
S
) on each chip that can be used for
yield improvement, i.e. for spares, without any area cost [57]. Thus, the budget for
86
redundancy could be increased to B
R
+ E
S
, and A
OM
reduces to A
OM
- E
S
. In this
dissertation we consider the worst case when E
s
= 0.
MYRA uses forks and joins, rather than switches, to connect q
i
copies of module
m
i
to q
j
copies of m
j
. That is, we model a q
i
-input/q
j
-output switch as a q
i
-way-join
followed by a q
j
-way-fork. Once we identified the final value of the q
i
’s, concatenated
forks and joins are replaced by a switch. For example assume the design shown in Figure
3.18 is an output of MYRA, where we have a q
1
-way-join followed by a q
2
-way-fork
between the RML and IRF modules. This fork and join can be replaced by a q
1
-input/q
2
-
output switch in the final design.
MYRA produces the structure MYRA-qRM to solve this problem. We illustrate the
steps in MYRA using an example shown in Figure 3.19. Table 3.3 shows some
quantitative data regarding this circuit. The maximum number of allowable copies for m
1
and m
4
is 4, and for m
2
& m
3
it is 3. One reason for this limitation could be unacceptable
performance degradation for using 5 or more copies of m
1
. The other reason could be
significant interconnect overhead on using 5 or more copies of m
1
, which may lead to
yield reduction, as explained in previous sections. Let Q
i
be the maximum allowable
number of copies for module i, then we define the Maximum Achievable yield (y
MA
) as
follow: ( [1 (1 )])
1
n
QQ Q
ii i
y y yy
i
FJ
MA
i
= ´ --´
Õ
=
.
The y
MA
for Figure 3.19 (a) is 0.9 while its initial yield (without redundancy) is
0.44. By adding a spare module to the circuit, both the yield and area are changing, so
maximizing Y/A is not trivial. For Y/A maximization MYRA tries to increase the yield in
an iterative manner, where starting from the initial value of yield to y
MA
, a minimum
87
amount of area (spares) is added to the circuit at each iteration. MYRA maximizes the
yield for a given amount of budgeted area reserved for spare modules. First, MYRA
determines the minimum unit of area (Min_Area) that can be added to the design at each
step. When the area of the steering logic is assumed to be negligible compared to the
areas of the modules, the minimum unit of area is the area of the smallest module, i.e.,
_ ()
1
n
Min Area Mina
i
i
=
=
. For example in Figure 3.19 (b), Min_Area = 0.5mm
2
. When this
assumption does not hold, then MYRA considers the area of spares and also the steering
logic as cost in each step as follows: while adding a spare to the structure, MYRA
calculates the additional area overhead as the area of newly added spare module plus the
difference between the areas of its new steering logic and the old one. Then the area
increase in the each step is considered appropriately. For example, when we increase the
number of copies of a module from 3 to 4, we need a 4-way-fork and 4-way-join at its
inputs and outputs, respectively, instead of a 3-way-fork and 3-way-join. MYRA considers
the area of steering logic in its computations, but in this example for simplicity of
presentation we assume the steering logic’s areas are negligible compared to modules.
(a) (b)
Figure 3.19: (a) Original circuit and (b) MYRA’s output
88
Table 3.3: Information of the modules in Figure 3.19
m
1
m
2
m
3
m
4
Other assumptions
y
i
.64 .952 .91 .79
a
i
(mm
2
) 5 .5 1 2.5
22
, yy
FJ
.992 .991 .992 .995
33
, yy
FJ
.99 .99 .99 .993
44
, yy
FJ
.988 -- -- .991
d = 0.1 per mm
2
A
OM
= 36 mm
2
(80% of area
of the chip)
B
R
= 30%
At step i in the iterative process to maximize yield (with minimum area), MYRA
has an area (maximum budget) of value i×Min_Area to use for redundancy. It is well
known that dynamic programming will always find the optimal solution; therefore,
MYRA uses this principle in an efficient manner (order of n
2
, as will be proven). At step
1, MYRA has 0.5mm
2
(1×Min_Area) for yield maximization which means MYRA can just
add one spare for m
2
(Conf#1 shown in Table 3.4). Conf#0 refers to the original design
without any redundancy. Then MYRA stores this configuration as the one with maximum
yield for the given amount of budget. At step 2, MYRA has 2×0.5=1 mm
2
budget for
redundancy, and thus can use two spares for m
2
or one spare for m
3
. MYRA chose the
latter choice due to its higher yield (Conf#2). In general Conf[i×Min_Area] refers to the
configuration with maximum yield at step i, where we use equations
[ .(1 (1 ) . ], (.)
1 1
n n
q q qq q
i i ii i
Y y y yA aq aa
i ii Config Config
F J FJ
i i
= -- = ++ å Õ
= =
to compute the
yield and area of this configuration. The function yield_gain(q, i) returns the yield gain of
using q+1 copies of m
i
instead of q as follow:
11 1
[1 (1 )]
_ ( ,).
[1 (1 )]
qqq
y yy
i
FJ
yield gain qi
qqq
y yy
i
FJ
++ +
´ --´
=
´ --´
89
Table 3.4: Output of MYRA for Figure 3.19
Number of
copies
Step# m
1
m
2
m
3
m
4
BR
i*
Yield Yield/Area
Conf#0 1 1 1 1 0 0.438 4921
Conf#1 1 2 1 1 0.5 0.4508 5037
Conf#2 1 1 2 1 1 0.4698 5220
Conf#3 1 2 2 1 1.5 0.4836 5343
Conf#4 1 3 2 1 2 0.4836 5315
Conf#5 1 1 1 2 2.5 0.5247 5735
Conf#6 1 2 1 2 3 0.54 5870
Conf#7 1 1 2 2 3.5 0.5628 6085
Conf#8 1 2 2 2 4 0.5793 6229
Conf#9 1 3 2 2 4.5 0.5794 6196
Conf#10 2 1 1 1 5 0.5862 6236
Conf#11 2 2 1 1 5.5 0.6033 6384
Conf#12 2 1 2 1 6 0.6288 6619
Conf#13 2 2 2 1 6.5 0.6472 6776
Conf#14 2 3 2 1 7 0.6473 6742
Conf#15 2 1 1 2 7.5 0.7022 7277
Conf#16 2 2 1 2 8 0.7227 7451
Conf#17 2 1 2 2 8.5 0.7532 7725
Conf#18 2 2 2 2 9 0.7752 7911
Conf#19 2 3 2 2 9.5 0.7754 7872
Conf#20 2 2 3 2 10 0.7779 7857
Conf#21 2 3 3 2 10.5 0.778 7819
Conf#22 2 3 3 2 11 0.778 7819
Conf#23 2 2 2 3 11.5 0.8003 7963
Conf#24 2 3 2 3 12 0.8004 7925
Conf#25 2 2 3 3 12.5 0.803 7911
Conf#26 2 3 3 3 13 0.8031 7874
Conf#27 3 1 2 2 13.5 0.8217 8016
* Budget for Redundancy for step i in mm
2
When MYRA is at step i it chooses the configuration with highest yield base on
the following calculation
( _)
( _).
[ ( _ )] _ ( ,) 1
n q number ofm in configi Min Areaa
kk
Configi Min Area MAX
Y configi Min Areaa yield gainqk k
R k
ìü
ïï
íý
ïï
îþ
= ´-
´=
´ -´ =
90
Table 3.4 summarizes all steps of MYRA’s output for the given circuit in Figure
3.19 (a) where MYRA generates 27 different configurations. Figure 3.20 shows that yield
increasing does not necessarily lead to monotonically Y/A improvement. For example, the
yield of step 3 is better than that of step 19 while its Y/A is worse. This is very important
as it shows that the previous redundancy-based techniques based on yield improvement
will not be able to generate optimal results in emerging technologies. MYRA stores the
optimal solution (configuration with maximum Y/A) and returns it at the end. It stops
when (i) the entire given budget for redundancy is used or (ii) when the yield reaches y
MA
,
which implies adding more spares will not improve yield any further and hence Y/A
cannot be improved further either. To find the global optimal we can set B
R
to a very big
number as is done in our experimental results section. For this example Conf#27 has the
best Y/A and is illustrated in Figure 3.19 (b). The maximum number of steps required by
this algorithm is ,
_
B Area
R chip
Step
MAX
Min Area
éù
êú
êú
êú
´
= where .
1
n
Area Aa
i OM chip
i
=+
å
=
The
complexity of MYRA is O(n×Step
MAX
). Step
MAX
can be approximated by a linear function
of n, so the computational complexity is O(n
2
).
17 18 19 20 21 22 23 24 25 26 27
Yield
Yield/Area
Figure 3.20: Behaviour of yield & Y/A for Steps 17 to 27
91
MYRA includes the cost (yield and area) of steering logic during its computations.
It has access to a pre-computed table of these costs as a function of q
1
, q
2
. To compute the
yield and area of steering logic we use our tool TYSEL (has been described in Chapter 2).
The pseudo code of MYRA is shown in Figure 3.21 where we define the maximum gain,
G
max
with respect to yield of the current structure as max.
y
MA
G
y
current
= In MYRA we use
Opt[][] the 2-dimensional array, which also has been used in HYPER. For example,
Opt[4][i] returns the number of copies for m
4
at the i’th step. This array has n + 2 rows,
and each column represents one step of algorithm. We denote the spares added to the
original structure as spare
1
, spare
2
, …, where spare
i
is the i’th spare added. When spare
i
is added to the current structure we obtain a yield gain G
i
. Thus, for example, after adding
spare
1
and spare
2
to the original structure the total yield gain is G
1
× G
2
, and the yield of
this new structure is G
1
× G
2
× Y
current.
At each step we increase the available area
overhead by Min_Area and select the best spare that results in the largest increase in
yield. The new yield gain is denoted by G
i
*. If G
i
* < G
max
the algorithm continues to
increase the area overhead and select new spares; otherwise, it terminates.
We again use function Gain_Func(q, i) ( 3.7) to return the gain in yield obtained
by using q spares for m
i
compared to q-1 spares. At step k of this procedure the maximum
area overhead is k×Min_Area and when we select m
i
, the remaining area overhead would
be k×Min_Area – a
i
, and k′
i
refers to the sub-problem where the area overhead was
k×Min_Area – a
i
and we know the optimal answer for this sub-problem.
92
The complexity of MYRA is O(n×Step
MAX
). Step
MAX
can be approximated by a
linear function of n, so the computational complexity is O(n
2
).
[0][0] 1; 0; 0; //
0
( [0][])
{;
( 0; ;)
( _)
_
{
_
*:
[0][] _ ( [ *][
*
OPT a k number of steps
while OPT kG
max
k
fori i ni
if a k Min Area
i
k Min Areaa
i
k
i
Min Area
leti i be such that can maximize the value of
OPT k Gain Func OPTi
i
êú
êú
êú
ëû
= ==
<
++
= £ ++
£´
´-
¢=
=
¢¢ ´ ] 1, *)
*
}
[0][] [0][] _ ( [ *][ ] 1, *);
**
}
( 1; ;)
[ ][] [ *][ ];
*
( *! 0)
[ *][];
ki
i
OPT k OPT k Gain Func OPTiki
ii
fori i ni
OPTik OPTik
i
ifi
OPTik
+
¢¢ =´+
= £ ++
¢ =
=
++
Figure 3.21: Pseudo code of the algorithm
3.7 Experimental Results
In this section we illustrate the application of HYPER and compare the results
with our baseline design, SIRUP, OSIP, M-OSIP and MYRA, under different defect
densities. We focus on the core of OpenSPARC T2 as our baseline design to illustrate the
attributes of our algorithms. We used a 90nm technology to synthesize the core and
determine the area of different modules. OpenSPARC T2 has 8 cores that take up 28% of
the area of a chip (each core is around 3.5%). The area of each module is given in Table
3.5. The yield of each module was estimated using ( 1.1), though any other way of
estimating yield is equally acceptable.
93
Table 3.5: Yield of OpenSPARC T2 modules
Yields of modules with different defect densities
Case1 Case2 Case3 Case4 Case5 Case6 Case7
Name of
Module
Area
(mm
2
)
d
+
=.002 .0035 .0045 .015 .035 .05 .075
DEC 0.132484 0.9997 0.9995 0.9994 0.998 0.9954 0.9934 0.9901
EXU1 0.523982 0.999 0.9982 0.9976 0.9922 0.9818 0.9741 0.9615
EXU2 0.523982 0.999 0.9982 0.9976 0.9922 0.9818 0.9741 0.9615
FGU 2.598879 0.9948 0.9909 0.9884 0.9618 0.9131 0.8781 0.8229
GKT 0.381301 0.9992 0.9987 0.9983 0.9943 0.9867 0.9811 0.9718
IFU_CMU 0.26336 0.9995 0.9991 0.9988 0.9961 0.9908 0.9869 0.9804
IFU_FTU 0.799001 0.9984 0.9972 0.9964 0.9881 0.9724 0.9608 0.9418
IFU_IBU 0.813506 0.9984 0.9972 0.9963 0.9879 0.9719 0.9601 0.9408
LSU 1.651542 0.9967 0.9942 0.9926 0.9755 0.9438 0.9207 0.8835
MMU 1.272749 0.9975 0.9956 0.9943 0.9811 0.9564 0.9383 0.909
PKU 0.56534 0.9989 0.998 0.9975 0.9916 0.9804 0.9721 0.9585
PMU 0.333517 0.9993 0.9988 0.9985 0.995 0.9884 0.9835 0.9753
TLU 2.140357 0.9957 0.9925 0.9904 0.9684 0.9278 0.8985 0.8517
CORE 12 .976 .958 .947 .835 .657 .548 .406
+
d is defect density per millimeter square
According to the ITRS roadmap [98], the defect density of state-of-the-art IC
products ranges from 0.002-0.005/mm
2
. Cases 1, 2 and 3 in Table 3.5 correspond to
defect densities in these ranges, while Cases 4 to 7 refer to immature technologies and/or
emerging bio and molecular technologies. We estimated the area of the steering logic
using (i) a scalable template of the layout of this logic, (ii) the number of spare modules,
and (iii) the number of input and output ports of each module. The yield is determined
using (1). For example, the number of inputs/outputs of IFU_FTU is 1029/365, and when
d=.015 the yield of a 2-way fork and 2-way join are .998 and .9997, respectively.
As mentioned we denote the structures produced by SIRUP, OSIP, M-OSIP,
MYRA and HYPER as SIRUP-qRM, OSIP-qRM, M-OSIP-qRM, MYRA-qRM and SIRUP-
qRM. Let the maximum number of copies for each module be q. (In practice each module
94
can be associated with a unique maximum replication bound.) M-OSIP-qRM, MYRA-
qRM and HYPER-qRM structures will have at least 1 and at most q copies for each
module. Structures produced by OSIP and SIRUP have q′ copies (1≤ q′ ≤ q) for all
modules. Table 3.6 shows the number of copies of each module in a HYPER-5RM
structure. The ‘*’ means there is a switch between this module and next one. In general
Y
th
is a user defined variable. One can use M-OSIP to find the maximum attainable yield
for the given structure and then select Y
th
accordingly. We see that different modules have
different number of spares, which leads to higher yield, less area and larger values of Y/A.
When the defect density increases the number of spares also increases, which seems
reasonable. Since the yield of the steering logic is less than 1, HYPER does not insert
switches between every pair of modules (see Table 3.6).
Table 3.6: Number of each module in HYPER-5RM structure in 7 cases
Number of each module in different cases
Case1 Case2 Case3 Case4 Case5 Case6 Case7
Name of Module
d
+
=.00
2
.0035 .0045 .015 .035 .05 .075
DEC 2 2 1 3 2* 2 2*
EXU 2 2 1* 3 3 2* 5
FGU 2 2 2 3 3* 3 5
GKT 2 2 2 3 2* 3* 5
IFU_CMU 2 2 2 3 2 4 5
IFU_FTU 2 2 2 3 2* 4 5*
IFU_IBU 2 2 2 3* 4 4 4
LSU 2 2 2 2 4 4* 4
MMU 2 2 2 2 4 3 4
PKU 2 2 2 2 4* 3* 4*
PMU 2* 2* 2 2 3 4 3
TLU 1 1 2 2 3 4 3
Area Overhead of a Core(mm
2
) 10 10 11 18 27 30 35
Yield of a Core .99 .99 .99 .99 .99 .99 .974
95
Table 3.7 compares the Y/A improvement (G
Y/A
) of HYPER-5RM structures to our
baseline structure, and to those generated by SIRUP, OSIP, M-OSIP and MYRA. Again
we target OpenSPARC T2 and the same seven defect densities. Equation ( 1.3) is used to
compute Y/A. Here the value of A
OM
is 332 mm
2
and we use redundancy (the new
structure) for all 8 cores.
Table 3.7: G
Y/A
of HYPER structure compared to other structures
G
Y/A
of HYPER Compared to different structures (%)
Baseline SIRUP OSIP M-OSIP MYRA
Case1 1.6 55 54 17 1.25
Case2 9 49 48 30 1.23
Case3 21 36 35 20 1.2
Case4 194 33 33 29 1.11
Case5 1722 22 21 21 1.05
Case6 7183 17 16 15 1.07
Case7 64924 11 4 .05 1.19
The Y/A improvement of HYPER compared to the baseline design is significant,
especially for larger defect densities. G
Y/A
for Cases 4 to 7 are extremely large. The G
Y/A
of HYPER compared to SIRUP and OSIP is of special interests. In all cases HYPER
results in higher Y/A, though, when defect density increases the improvement, especially
compared to SIRUP and OSIP, decreases. This seems to occur because at high defect
densities HYPER uses more spares to increase yield, which in turn reduces Y/A. SIRUP
and OSIP always use the same number of copies for all modules (five in this example)
which leads to Y/A reduction at low defect density compared to HYPER. M-OSIP has the
flexibility of using different number of spares for different modules hence the G
Y/A
of
HYPER compared to it is non-monotonic.
96
The Y/A improvement of HYPER compared to M-OSIP is less than that achieved
for the other techniques. This seems to imply that the capability of M-OSIP in the
selection of different spares for different modules leads to larger values of Y/A compared
to SIRUP and OSIP. Subsequently, we can say MYRA and HYPER are almost the same
but HYPER is a little bit better. The reason is that HYPER can concatenate the modules to
remove steering logic’ overhead while it is not possible for MYRA. But MYRA could be
used for any structure not just linear. Later in this proposal we show when modules have
almost equal yields, MYRA finds the design with the best Y/A.
More experimental results for these algorithms have been published in our papers
[65] [66] [67]. In the next chapter we present two theorems and a design flow based on our
theoretical results to show the importance of original partitioning on Y/A maximization
using redundancy.
3.8 Conclusions
In this chapter we assumed the logic circuit has been originally partitioned. We
showed that for a circuit with heterogeneous modules we need to instantiate different
spare copies to maximize Y/A or equivalently, revenue per wafer. We developed different
heuristics and algorithms to maximize yield and Y/A for logic circuits with linear and
non-linter structures while taking into account the yield and area of all modules and their
related steering logic. Our experimental results show the efficiency of using intra-logic
redundancy for Y/A improvement compared to original circuit.
97
Chapter 4
Theory of Partitioning for Redundancy
4.1 Introduction
In Chapter 3, we assumed the original circuit has been already partitioned, but in
this chapter we partition the original circuit for redundancy purpose. It means we find the
optimal level of granularity, considering all design constraints, to be used for Y/A
maximization by redundancy.
Adding redundancy to logic circuits to maximize Y/A is not trivial, and that is the
primary reason that traditional works usually replicate the entire circuit for simplicity.
For example, in multi-core processors a spare core may be added to improve yield. In
Chapter 3 we showed that for emerging technologies with low levels of yield, redundancy
needs to be added at finer levels of granularity compared to core level (entire circuit). As
mentioned, our goal in this dissertation is to maximize Y/A, not just yield. While adding
redundancy at lower levels of granularity can improve the yield, it results in more area
overhead due to interconnect complexity and hence puts downward pressure on Y/A.
Therefore, there exists an optimal level of granularity at which redundancy must be
employed to maximize Y/A. We illustrate this point in our next example shown in Figure
4.1. Redundancy can be used at different levels of granularity, i.e., from coarsest grain
that is the entire circuit (Figure 4.1 (a)) to finest grain that is a gate (Figure 4.1 (b)).
Without loss of generality we duplicate each module in Figure 4.1. Intuitively, we can
98
say redundancy at finer granularity leads to better yield, e.g., we can see that the
replicated circuit at coarsest grain (Figure 4.1 (a)) can tolerate at most one faulty gate or
interconnect, while the circuit shown in Figure 4.1 (b) can tolerate up to six different
faulty gates or interconnects.
Figure 4.1: Circuit C17 with Redundancy at different level of granularity
However, redundancy at finer levels of granularity leads to more area overhead,
because of the switches (multiplexers and de-multiplexers) and extra circuitry, such as
FFs and control signals, for testing modules, as shown in Figure 4.1 (b). As mentioned,
these overheads increase when redundancy is added at finer level, which could in turn
lead to a reduction in Y/A. Thus, there is an optimal level of granularity for employing
redundancy, and in this chapter we first develop theorems with the goal of Y/A
maximization using the attributes of partitioning for redundancy. The attributes are
namely the number of partitions and size of each partition. Then we propose a design
flow to find the optimal level of granularity considering the constraints mentioned above.
99
Figure 4.1 (c) illustrates the optimal solution, where gates 3, 4, 5 and 6 have been
clustered into one module, and gates 1 and 2 are clustered into another. Other constraints,
such as timing closure will be discussed in the following sections.
The inputs to our design flow are the netlist of the original circuit (gates and FFs),
and models for computing the yield and area of a block of logic; our design flow
produces a redundant configuration with maximum Y/A. The configuration includes the
multiplexers and de-multiplexers, and the netlist of each module, and the yield and area
values, including these multiplexers. The design flow has two phases: (i) phase 1, also
known as combinational logic block (CLB) partitioning, forms maximum size blocks of
logic gates and registers that satisfy several topological, design, and test constraint, and
(ii) phase 2 (optimization) where the results from phase 1 are processed to maximize Y/A.
This is accomplished by clustering small CLBs together into larger CLBs, and
partitioning large CLBs.
In the next section, we develop our theorems and validate them. In Section 4.3,
we propose our design flow which has two phases. The phase 1 explains the practical
constraints on partitioning the original logic circuit and proposes a solution to satisfy
these constraints. Phase 2 is an optimization procedure and attempts to maximize Y/A
using different heuristics. To show the effectiveness of our design flow, the results for a
real circuit (Open SPARC T2 core) for several defect densities and various redundant
configurations are presented in Section 4.4.
100
4.2 Theory of partitioning for Y/A maximization using
redundancy
In this section we first develop our theorems in an ideal world then subsequently
incorporate the real life limitations. Ideally: (i) The original circuit can be partitioned at
any arbitrary level of granularity, the partitions can be of different sizes (area), and no
restrictions exist on which partition a specific element of the circuit can reside; (ii) The
partitioning process does not allow any element of a circuit to be replicated; (iii)
Interconnect overhead for redundancy is negligible; the theory of redundancy for logic
circuit is as follows:
Model 1: Assume the original circuit has been partitioned into n modules
(partitions) where each module is replicated q times, i.e., there are q-1 spares. (Figure 1.3
where q
i
= q for i: 1..n).
Theorem 1: For Model 1 and fixed n, the maximum Y/A is achieved when the
areas (and hence, yields) of modules are identical.
Proof: We use mathematical induction to prove Theorem 1.
The Basic Step: We show that the theorem is correct when n = 2, and thus we
have two modules as shown in Figure 4.2 with yields
1
, YY
aa -
, where01 a << . We
prove that the Y/A is maximum when α = 0.5.
A a
(1)A a -
Figure 4.2: Given circuit with two partitions m
1
and m
2
101
Since changing the value ofa does not change the area of a partitioned circuit, the
Y/A is maximum when the yield is maximum. The yield for the circuit in Figure 4.2 when
each module is replicated q times is
1
[1 (1 ) ] [1 (1 )]
qq
Y YY
new
aa -
= -- ´ -- .
To find the value of a that maximizes Y
new
, we take the derivative of Y
new
with
respect toa and set the result equal to 0. Thus
1 1
0 [ ln() (1 ) ] [1 (1 ) ] [1 (1 )]
1 11
[ ln() (1 ) ]0
11 1 11
(1 ) [1 (1 )] (1 ) [1 (1 ) ]0
11 11
1 ;1 (1) (1)
11
dY
q qq
new
q YYY YY
d
q
q YYY
q q qq
YY Y YYY
q q qq
Leta Y b Y Ya b Y ba
q qq
Ya Ya bY
aa aa
a
aa
aa a a aa
a aaa
aa
- -
=Þ - ´ -- + --´
- --
- - =Þ
-- - --
- -- - - --=
-- --
=- =- Þ -= -Þ
--
--
11 11
0
1 1 11 11
( )0
1 1 11 11
( )0
1 1 11 1
(1 ) (1)
q qq
b Y ab
q q qq
Ya Y b a b Y a Yb
q q qq
Ya Y b a b Y YYY
q q qq
Ya b Y ba
aa
aa aa
aa aa
aa
-- --
+ =Þ
- - -- --
- + - =Þ
- - -- --
- + -- + =Þ
- - -- -
-=-
( 4.1)
We use the following expansion to solve the equation ( 4.1):
12 21
( )( )( ...)
nn nn nn
a b a ba a b abb
-- --
- =- + +++
then we have:
1
1& 1, Y aY b after substitution we have
aa -
=- =-Þ
1 1 11
( )(1 )( )(1)
q q q q qq
a a b b ba
- - --
- - = - -Þ
1 111 q q qq q q qq
a a ab b b ba
- ---
-+ = -+Þ
11 11
( ) ( )0
q q q q qq
a b a b a b ab
-- --
- - -+ - =Þ
11 12 21
( )[ ( ...)
qq qq qq
a bab a a b abb
-- -- --
- - + ++ ++
2 3 4 32 2
( ... )]0
q q q qq
a a bab abb
- -- --
+ + ++ + =Þ
2
2 11
( )[(1) (1 )]0
0
q
q i qq i
ab a a bba
i
-
-- --
-- - - =Þ å
=
102
22
21
( )[(1) (1) ]0
00
qq
qiq ii
ab a a b a ab
ii
--
---
-- -- =Þ åå
==
2
21
( )[(1) ( )]0
0
q
q iq i
ab a a bb
i
-
---
-- -= å
=
( 4.2)
Equation ( 4.2) can be satisfied in two ways, namely (i) setting ( )0 ab -= and (ii)
2
21
[(1) ( )]0
0
q
q iq i
a a bb
i
-
---
- -= å
=
. It can be shown that the left hand side of second
condition is always positive.
Thus we require that ( )0 ab -= and hence
1
11 YY
aa -
- =-
1
1 0.5. YY
aa
a aa
-
Þ = Þ = - Þ= Thus, the basis has been proven.
The Inductive Step: Assume the statement holds for n-1 modules, which implies
that the area (and yield) of each module needed to maximize Y
new
is
1
()
A
n
andY
n
.
With respect to the optimal solution with n modules, let the yield of the first
module be Y
b
, hence the yield of all other modules is
1
Y
b -
. If this is an optimal
partition of a circuit, then the sub-circuit composed of those n-1 modules should also be
partitioned optimally. Using our induction hypothesis, the yield of each module should be
1
n
Y
b -
, and the yield of this configuration is
1
[1 (1 ) ] [1 (1 )]
qq n
n
Y YY
new
b
b
-
= -- ´ --
We seek the value of b that maximizes Y
new
, so we take the derivative of Y
new
with
respect to b and set it to 0.
103
1
1
0 [ ln() (1 ) ] [1 (1 )]
111
1 1
[1 (1 )] [ ln() (1 ) ] [1 (1 )]0
1 11
11
(1 ) [1 (1 )] (1 ) [1 (1 )]
1
1
1 ;1 (1
dY
qqn new n
q YYYY
d
q
q qqn
nnn
Y n YYYY
n
q q qq
n nn
YY Y YYY
qq
n
Now we assumea Y b Y Yab
b
bb
b
bbb
b
b bb
bbb
b
bb
-
-
=Þ - ´ --+
---
-
- -
--´ - ´ --=
- --
--
Þ - -- = - --
-
-
=- =- Þ-
1
1
) (1)
qq
n
Y ba
b -
-
=-
The above equation is equivalent to ( 4.1) so we can similarly do what we did to
prove the basic step and show that the only root is when ( )0 ab -= that gives us
11
11
11.
1
nn
Y Y YY
nn
bb
b
bb
bb
--
-
- =- Þ = Þ= Þ=
+
Using β from above equation one can verify that the yield of each partition is
1
1
:1 1 . ..
n
YY in QED
i
+
= £ £+
Thus, independent of the value of q and n, n-way partitioning that results in equal
area modules will maximize Y/A.
Theorem 2: For Model 1, the Y/A is a monotonically increasing function of n, the
number of partitions. Consequently, for a large number of balanced partitions, no more
than one spare is needed for Y/A maximization.
Proof: (The proof provided here is only for the case of q = 2. Later we will show
that for large n, q = 2 is sufficient to maximize Y/A.) From Theorem 1 we know that to
maximize Y/A it is best to partition a circuit into modules of equal area. Assume we
partition the circuit into n equal area modules, each of which is replicated q times. The
104
yield of the new circuit is
1
[1 (1 ) ].
q n
n
YY
new
= -- For q = 2,
11
2
[1 (1 )] [2]
nn
nn
Y Y YY
new
= -- =- and the total area of this new circuit is 2A. Thus, to
maximize Y/A we need to maximize Y
new
. To show that Y
new
is a monotonically increasing
function of n we will show that the derivative of Y
new
with respect to n is positive for all
values of n. After much algebraic manipulation, we have that
1 1 11 1
1
(2 ) ln( ) (2 ) ln(2 ).
dY n
n new n n nn
Y Y Y Y YY
dn
æö
ç÷
ç÷
ç÷
ç÷
èø
-
= ´- ´ +- ´-
( 4.3)
We know that
1
0 101
n
YY < < ®< <®
11
1
1 (2 ) 2 0 (2 ).
n
nn
Y YY
-
< - < ® < ´- We
next show that the rightmost term in ( 4.3) is also positive for0 1&1 Yn <<£ . Let
1
n
aY = .
We want to show that ln() (2 ) ln(2) a a aa ´ + - ´- is positive. We rewrite this expression as
[(1 ) ln(2 )][ ln() ln(2 )] a a aaa - - +´ +- .
Now 01 &2 1 ln(2 ) 0 [(1 ) ln(2 )] 0. a a a aa <- - >® - > ®- -> To show that
[ ln() ln(2 )] aaa ´ +- is positive, we note that[ ln() ln(2 )] ln( (2 ))
a
aa a aa ´ + - = ´- , and
when 01 a << we can show that (2 )1
a
aa ´ -> , and therefore ln( (2 ))0
a
aa ´ -> .
Therefore Y
new
is a monotonically increasing function of n. Q.E.D.
The results for Y
new
for different values of Y, n and q are given in Table 4.1.
Figure 4.3 shows the value of Y/A as a function of n and q for Y = 0.1. Note that while the
value of Y/A for q = 2 and small values of n might be smaller that the value of Y/A for
105
larger values of q and the same value of n, eventually these curves cross. Based on these
results (and others) we make the following observations.
Observation 1: For q = 2 and q
*
> 2 and fixed Y, there exists an n = n
*
such that
for all m > n
*
, Y/A|
q = 2
>Y/A|
q = q*
. That is, as the number of partitions increases, using
only duplication is sufficient to maximize Y/A.
Observation 2: For q > 1, the asymptotic value of Y/A as n increases is 1/q. Thus
the maximum Y/A value occurs for q = 2 and is 0.50.
Table 4.1: Y
new
as a function of n and q
q = 2 q = 3 q = 4
n Y=.25 Y=0.5 Y=0.9 Y=.25 Y=.5 Y=0.9 Y=.25 Y=.5 Y=0.9
2 0.563 0.8358 0.995 0.766 0.95 0.9997 0.879 0.985 0.999986
3 0.643 0.8777 0.996 0.856 0.974 0.9999 0.945 0.995 0.999996
4 0.699 0.9025 0.997 0.903 0.984 0.9999 0.971 0.997 0.999998
5 0.739 0.919 0.998 0.931 0.989 1 0.983 0.999 0.999999
6 0.77 0.9307 0.998 0.948 0.992 1 0.989 0.999 0.999999
7 0.795 0.9394 0.998 0.96 0.994 1 0.993 0.999 1
… … … … … … … … … …
1000 0.998 0.9995 1 1 1 1 1 1 1
Yield/Area vs. number of partitions
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 2 4 6 8 10 12 14
Number of Partitions
Yield/Area
q=2
q=3
q=4
Figure 4.3: Y/A as a function of q and n for Y = 0.1
106
The above results are based on the following assumptions: (i) perfect
interconnects; (ii) the original circuit can be partitioned in an arbitrary way, including all
modules having the same size.
We use MYRA to maximize Y/A where the overheads (area and yield) of
interconnects is considered. In the subsequent section, we compare the results obtained
by MYRA with our theoretical results assuming a fixed number (q) of copies for each
module, and show that MYRA results in a higher value of Y/A.
4.2.1 Validation of Theorems using MYRA
We refer to configurations where each module is replicated q times as q-
Replicated Modules (qRM). In this section, we first show that for a given set of modules,
the output of MYRA is as good as or better than qRM configurations. Our experimental
results are based on the OpenSPARC T2 chip that has 8 identical cores each of which
consists of 13 modules. Thus it already has been partitioned. Some information about
each module is given in Table 4.2. This data was obtained by first taking the Verilog
(RTL) description of each module and synthesizing it using 90nm technology.
In our experimental results we use defect densities between 0.05 and 0.5 defects
per mm
2
. We first compare the output of MYRA with qRM. Recall that the main reasons
for developing MYRA was to address the situation where real constraints force us to use
modules with unequal areas, and to include the interconnect overheads. The OpenSPARC
T2 core has unequal size modules and is thus a good candidate for this study. For a qRM
structure we report the value of q that maximizes Y/A. For MYRA we report the value of
q
i
for each module m
i
. The results are given in Table 4.3.
107
Table 4.2: OpenSPARC T2 core’s modules
Name #input #output #gates and FFs % in core
DEC 631 358 2260 1.1
EXU1 435 471 8251 4.37
EXU2 435 471 8251 4.37
FGU 582 348 41065 21.7
GKT 521 461 5246 3.18
IFU_CMU 248 348 4350 2.19
IFU_FTU 1029 365 12281 6.65
IFU_IBU 671 337 13492 6.77
LSU 742 993 25013 13.8
MMU 510 366 19474 10.6
PKU 394 248 9759 4.7
PMU 169 92 5476 2.77
TLU 1036 125 33969 17.8
Total 188887 100
Table 4.3: Y/A comparison of the configuration generated by MYRA & qRM
# of copies (q
i
) of modules identified by
MYRA for various defect densities Name of the modules
.05 .075 .085 .095 .12 .15
DEC 2 2 2 2 2 2
EXU1 2 2 2 3 3 3
EXU2 2 2 2 3 3 3
FGU 3 3 4 3 4 4
GKT 2 2 2 2 3 3
IFU_CMU 2 2 2 2 2 3
IFU_FTU 2 2 3 3 3 3
IFU_IBU 2 2 3 3 3 3
LSU 2 3 3 3 3 4
MMU 2 3 3 3 3 4
PKU 2 2 2 3 3 3
PMU 2 2 2 2 3 3
TLU 2 3 3 3 4 4
Level of replication for qRM (q) 2 3 3 3 3 4
Y/A Improvement 0.11% 4.63% 2.65% 1.45% 5.6% 11.6%
The Y/A improvement using MYRA compared to qRM is define by the expression:
//
/ 100.
/
Y Aof MYRA Y Aof qRM
Y A Improvement
Y Aof qRM
-
=´
108
The last row of Table 4.3 shows the value of this improvement. We see that
MYRA always produces better results than qRM, and in the worst case these two
procedures result in the same value. The main reason for the better performance of MYRA
is that since different modules can have different areas, they are allowed to have different
number of copies. For example, for a defect density of .085, the qRM approach uses 3
copies of each module, while MYRA uses 2 copies for some modules, 3 for others and 4
for one module. Modules with smaller yield (larger areas) require more spares. For the
rest of this report we use MYRA to generate redundant designs with maximum Y/A. We
will show that when all partitions have approximately equal areas, for all defect densities,
the procedures MYRA and qRM produce the same configurations.
4.2.1.1 Theorem 1 and MYRA
We focus our attention on the module TLU, since it is quite large. We partition it
into equal and unequal size modules and show that modules with equal area lead to
redundant configurations with the largest values of Y/A.
We use the min-cut based partitioner hMEtis [39]. From the netlist of this module
we extract its hypergraph representation. The inputs to hMetis are (i) the hypergraph
representation of a circuit, (ii) the desired number of partitions, and (iii) a balancing
factor (BF). BF is used to specify the allowed imbalance between the relative
complexities of the partitions. To partition the circuit into almost equal partitions we set
BF to 1 (BF varies from 1 to 49 [39]).
109
We partition the TLU into 2, 4, 6, 8 and 10 modules in several ways, some with
equal areas and others with unequal areas. Table 4.4 shows the percentage of each
partition that occupies the original module (TLU).
Table 4.4: Area distribution of each module of TLU
#of partitions Area distribution in each partition (%)
E
*
50 50
2
UE 36 64
E 25.4 25.4 24.5 24.7
4
UE 40.43 24.86 19.07 15.64
E 16.38 16.38 16.94 16.38 16.96 16.96
6
UE 8.08 22.49 12.24 10.46 17.84 28.89
E 13.08 13.19 12.67 12.67 11.98 11.85 11.86 12.71
8
UE 15.2 15.4 20.4 9.94 6.96 12.7 7.5 11.9
E 10.3 10.1 10.2 9.9 9.9 9.85 9.83 9.9 9.92 10.1
10
UE 9.4 5.67 13.1 14.8 14.6 4.95 6.3 5.98 14.8 10.4
* E means Equal Area Modules & UE means UnEqual Area Modules
Each of these partitions is then processed by MYRA and the results are compared
using the expression:
/( ) /()
/() 100.
/()
Y AE Partitions Y A UE Partitions
Y A E vsUE
Y A UE Partitions
- --
=´
-
The results of this comparison are given in Table 4.5. It can be seen that the Y/A
of designs with equal area modules is always better than those having unequal area
modules. This result is consistent with Theorem 1.
Table 4.5: Y/A comparison of the Equal and Unequal partitions (E vs. UE)
d
#Partitions
.12 .15 .25 .35 .5
2 1.52 2.2 3.41 1.93 5.09
4 1.57 2.34 0.23 5.46 4.72
6 1.5 2.2 5.4 2.3 8.69
8 0.67 1.02 2.65 0.92 3.53
10 0.75 4.72 8.69 3.53 4.04
To get unequal partitions we increased BF to 15 which not only generates unequal
modules, but also much fewer interconnects between partitions compared to the case
110
where BF = 1. This results in much simpler forks and joins. It is important to note that
designs having equal area modules and complex interconnects still result in better values
of Y/A than those designs having unequal size modules and simpler interconnect. From
these results, we can conclude that when the interconnect structures associated with a
module are relatively small compared to the module itself, it is reasonable to ignore the
interconnect overheads in determining the best value of q.
4.2.1.2 Theorem 2 and MYRA
From the previous sub-section we see that balanced partitions lead to better
designs. In this section we show that increasing the number of partitions also leads to
better Y/A.
Yield/Area vs. Number of partitions
5
10
15
20
25
30
1 3 5 7 9
Number of partitions
Yield/Area (×1000)
DD=0.12 DD=0.15 DD=0.25
DD=0.35 DD=0.5
Figure 4.4: Y/A of TLU as a function on n and defect density (DD)
Figure 4.4 shows the result of first partitioning TLU into n equal area modules,
and then obtaining their optimal redundant design using MYRA. We see that the
maximum achievable value of Y/A via the use of redundant modules is a concave
function. The best value of n for the defect densities shown is 8. In general, we denote
111
this best value by n*. For larger values of n, Y/A decreases, since while each module is
getting smaller, the numbers of forks and joins are increasing.
The value of n* increases as the size of the original circuit increases. To illustrate
this, we partitioned an entire OpenSPARC T2 core into equal area modules for various
values of n. Figure 4.5 shows the maximum achievable values of Y/A.
2000
2500
3000
3500
4000
0.075 0.085 0.095 0.12 0.15
Defect Density
Yield/Area
P1=13 P2=98
P3=182 P4=279
Figure 4.5: Y/A of a core with different number of partitions and defect densities
Design P
1
refers to the original partitioned circuit (Table 4.2) which has unequal
size modules. For design P
2
, we selected the smallest module (DEC) and partitioned the
other modules into sub-modules, each of which has an area close to that of DEC. This
resulted in 98 modules. For designs P
3
and P
4
we first partitioned DEC into two (for P
3
)
and three (for P
4
) almost equal size modules, and then partitioned all of the other modules
into sub-modules whose area is close to that of the partitioned module DEC. This process
resulted in 182 and 279 modules for P
3
and P
4
respectively. Again, Y/A under all defect
densities initially increases as n increases, and then decreases as n continues to increase.
112
In Table 4.6 we show the final area for each of these designs as a function of the defect
density. The area of original core is 12 mm
2
[32].
Table 4.6: Area of the core after redundancy (mm
2
)
#partitions
d
P
1
= 13 P
2
= 98 P
3
= 182 P
4
= 279
0.075 31.82 24.08 24.12 24.41
0.085 33.46 24.08 24.12 24.41
0.095 35.09 24.08 24.12 24.41
0.12 40.58 24.08 24.12 24.41
0.15 43.81 24.31 24.32 24.59
We see that when we partition the circuit into equal area modules (P
2
, P
3
, P
4
), the
total area is slightly more than twice that of the original core, indicating that average
value of q ≈ 2 and the area of the interconnect is very small compared to that of the
original circuit. For P
1
the ratio of areas is between 2.65 for a low defect density, and
3.65, for a high defect density. This leads us to the following conclusions.
Conclusion 1: When a large number (n) of equal size modules are used, it is
sufficient to use q = 2 when maximizing Y/A. Also, as n increases, by Rent’s rule [55],
the total number of interconnects for each module decreases, hence each steering logic
becomes simpler. Simpler steering logic leads to lower performance loss, since we have
many of them, they could reduce the yield (will be discussed in details in next section).
Conclusion 2: As defect density increases, the reduction in Y/A for unbalanced
partitions is large. On the other hand, for balanced partitions, this reduction is more
gradual. This occurs because as the yield of modules and interconnect decrease, the used
area for redundancy does not change, i.e., it remains at a little over the area of the original
core.
113
As mentioned, there are however some limitations and overheads in reality when
redundancy is added at finer level of granularity such as, the overhead of steering logic
and testing complexity, which includes the extra circuitry to be added to test each
partition, its spare(s) and steering logic. Next, we present a new design flow which
considers these overheads.
4.3 Proposed Design Flow
In this section we introduce our design flow that finds the optimal level of
granularity for the given original logic circuit to be used for Y/A maximization using
redundancy. The design flow has two phases and in the following sub-sections we
describe them.
4.3.1 Phase 1: CLB partitioning
In this section we explain our motivations and algorithm for CLB partitioning of
the original circuit. We are given the netlist of the original circuit and we are interested to
use redundancy at finer level of granularity compared to core level (whole circuit).
Therefore, we need to partition the circuit into smaller modules, but there are several
practical constraints that must be satisfied including those shown below:
Design constraint (timing closure): Timing closure is the process where a
design is modified to meet its timing requirements. If we partition a circuit along a
combinational path and add switches there, then designers would need to redo the timing
closure process which is expensive and we prefer to avoid it.
114
DFT (design for testability) constraint: we should be able to test each module
and its spares and configure them after testing. To support this requirement we need extra
circuitry (such as scan FFs) and control signals for each module and its spare(s). We
prefer to avoid these extra overheads as much as possible and use the available circuitry
for testing the original circuit to test our new partitions or modules.
Search space constraint: The size of the search space for many CAD problems is
usually very large, and this is true for our partitioning problem. Here, our input is a logic
circuit that may have millions of gates and FFs and we want to avoid searching such a
large space to find the optimal level of granularity.
The first two of the above constraints are satisfied if we limit ourselves to
modules whose inputs and outputs, terminate at FFs. Next, we define the term CLB for
this dissertation and then provide the reasons that CLB partitioning will satisfy these
constraints.
CLB (Combinational Logic Block): For an arbitrary logic gate we cluster
(associate) all other gates which are in the transitive fan-out or transitive fan-in of this
gate via paths that do not pass through any FF, PI (primary input) or PO (primary output).
We refer to this cluster as a combinational logic block (CLB). Once a gate is associated
with a CLB, it is flagged so that it is not revisited again. Therefore, a CLB is a
combinational logic block, all of whose inputs are either PIs or driven by FFs and all its
outputs are POs or drive FFs.
There are three categories of FFs, namely drivers (FF
D
), receivers (FF
R
), and
driver/receiver (FF
D/R
). The general structure of a CLB is illustrated in Figure 4.6 (a), and
115
in this dissertation we call a cluster of driver, receiver or driver/ receiver FFs of a CLB, a
register (R) as shown in Figure 4.6 (b).
(a) (b) (c)
Figure 4.6: (a) A CLB, (b) clustered FFs to registers, (c) redundant CLB
CLB partitioning satisfies the above constraints as follows:
(i) No switch will cut any combinational paths, so we do not need to repeat timing
closure (satisfies timing closure constraint).
(ii) There is no extra cost for making the inputs/outputs of a CLB controllable or
observable for testing, since we can simply use the current scan FFs in the original circuit
(satisfies DFT constraint). Please note that we need to modify the scan chains by adding
MUX or De-MUX to them (Figure 4.6 (c)) that could increase the critical path delay but
we do not need to do timing closure again. All we should do is to increase our delay
budget to account for test circuitry in the first place and let our designers know about it.
(iii) It reduces the search space from millions of gates to hundreds or thousands of
CLBs. In addition, we can apply a combinational ATPG to each CLB to generate scan
test vectors for them [27] and FFs can be clustered into registers to be used to process
random test data (RTD) in a BIST based test architecture (PRPG, MISR) [1].
Efficient spares for FFs: One important observation about CLBs (Figure 4.6 (a))
is that FFs can share spare FFs and thus significantly reduce area overhead for FF
116
redundancy, as will be discussed next. Figure 4.6 (c) shows the redundant version of a
CLB with switches (MUXs and De-MUXs) and spare FFs for registers.
Observation 3: The FFs of a CLB are usually identical, and consequently we can
add a small pool of s spares, s < r, to increase Y/A with a simple switch circuitry, instead
of replicating all r of them.
Proof. Assume we have r FFs and cluster them into a register R, and the yield of
each FF is y. Then the yield of register R is y
r
. We consider the following two cases to
create redundancy for these FFs. Case 1: We duplicate the whole register and its yield
(the probability of having at least one functional register) would be Y
1
= 1 –(1-y
r
)
2
.
Case 2: If the FFs of register R have one common pool of s spares, then the yield of R
would be Y
2 (1)
n rs
i r si
yy
i
ir
æö
ç÷
ç÷
èø
+
+-
=- å
=
, where we need at least r functional FFs to have a
functional register. We can mathematically prove that Y
2
is always larger than or equal to
Y
1
. The worst case for Y
2
is when s = 1 but we can prove that even for this case Y
2
outperforms Y
1
, i.e., Y
2
–Y
1
= y
r
×(1 - y) ×[r – (1 + y + y
2
+…+ y
r
)]. The term y
r
×(1-y) is
always positive because 0 < y < 1. Also (1 + y + y
2
+…+ y
r
) has r elements each of which
is positive and smaller than or equal to 1. Thus the right hand side of the preceding
equation is also positive, so Y
2
> Y
1
. In addition, it is obvious that case 2 has lower area
compared to case 1. Also case 2 needs one simple multiplexer at the input/output of each
FF, while case 1 needs a big multiplexer and de-multiplexer, i.e., case 2 has much lower
interconnect complexity. Therefore with higher yield and smaller area case 2 has a large
value of Y/A compared to case 1.
117
Utilizing observation 3 we only replicate each CLB and add spare FFs to the
driver, receiver and driver/receiver FFs instead of replicating every FF (Figure 4.6 (b)).
This observation is very important because FFs are large elements and many designs
have hundreds of thousands of FFs. For example, in OpenSPARC T2, FFs occupy 45%
of the total area of the core [32], and this observation helps us significantly improve Y/A
with almost negligible area overhead.
In order to support a spare FF for scan chains we need to modify a scan FF as
shown in Figure 4.7 (a). A scan FF originally has a MUX (shown with arrow) and we add
another MUX, De-MUX and a control generation circuit to it (Figure 4.7 (b)). The
control generation circuit is similar to one proposed in [57]. It initially generates a 1
signal but if we blow the fuses it generates 0. For clarification, we provide an example in
Figure 4.7 (c) where the scan chain has four FFs and one spare FF, shown in gray.
Initially all CG
i
signals are 1, thus the spare FF is not logically part of the circuit. Now
assume there is a defect in FF
3
, as shown in Figure 4.7 (c). There are different diagnostic
techniques for scan chains, such as [52], to detect and locate this fault. Once we identify
the faulty FF, we configure the MUXs and De-MUXs by blowing the appropriate fuses to
by-pass the faulty FF and replace it by a fault-free spare FF as shown in Figure 4.7 (d). In
this example we designed a scan chain that can tolerate one faulty FF in this register, but
this idea can be extended to handle more faulty FFs.
Next, we provide some details of our CLB partitioning algorithm. CLBs are
generated from the netlist of the logic circuit through three steps: Step1 (Parser): This
step parses the given netlist and stores it in a link list data structure, that includes
118
information regarding each gate such as type, number of fan-in’s, the nodes that drive this
gate, the number of fan-out’s and the nodes which this gate drives. Step2 (Seed list
generator): Seeds are primary inputs and FFs that we start from to generate CLBs. In
this step we generate all the seeds and store them in a seed list. Step3 (CLB generator):
The final step generates CLBs, its pseudo code is given in Figure 4.8.
Figure 4.7: New scan chain with spare FF
(a) A scan FF, (b) the control generator circuit [57], (c) a scan chain with 4 FFs and
one scan FF, (d) the configured scan chain in presence of a defect
//Initialization: The CLB number of all gates is undefined
//PIs and POs are dealt with as gates
C = 1; //C is a unique CLB number
S
list
= the generated seed list from Step2
Search
list
= {};
while (S
list
is not empty)
{S = pop a seed from S
list
;
Assign C to S and all gates in fan-out of S;
Add all gates in fan-out of S to Search
list
;
while (Search
list
is not empty)
{g = pop a gate from Search
list
;
for (all gates in fan-in and fan-out of g)
{if (the gate has not been assigned to any CLB yet)
{Assign C to it;
if (the gate is not a FF) Add it to Search
list
}
}
}
C++;}
Figure 4.8: Pseudo code for step 3 of CLB partitioning algorithm
119
We applied our CLB partitioning algorithm to OpenSPARC T2 core and Table
4.7 lists the number of CLBs and their corresponding sizes (number of gates in each
CLB). As mentioned earlier, CLB partitioning reduces the search space from gate (or FF)
level to CLB level and for this specific design (OpenSPARC T2) it reduced the search
space from 188887 gates and FFs to 1808 CLBs, i.e., 104 times fewer entries to consider.
Due to the property of our definition of CLB, the CLB results for each circuit are
deterministic and there is no room for optimization in this phase; hence, in the next
section we propose two heuristics to optimize Y/A that deal with CLBs as individual
modules.
Table 4.7: Size and number of CLBs in OpenSPARC T2
Size (# of gates) #of CLBs Percentage of area in the core
10000 < Size(C) 6 59.4%
1000 < Size(C) < 10000 11 30.5%
100 < Size(C) < 1000 25 4.78%
Size(C) < 100 1766 5.32%
Total: 1808 100%
4.3.2 Phase 2: Overall optimization
The goal of phase 2 is to maximize the Y/A for the given CLBs generated in phase
1 by first changing the level of granularity by clustering some small CLBs together to
form larger ones, and partitioning some large CLBs into smaller ones followed by
identifying the best number of spares for each CLB. Phase 2 has three steps: (i) it clusters
small CLBs to minimize area overheads of configuration circuitry and spare FFs, (ii) it
determines the best number of spares for each CLB, and if it is necessary (iii) it partitions
large CLBs to increase yield in a spiral model. In the following sub-sections we address
these steps.
120
4.3.2.1 Clustering of small CLBs
We see from Table 4.7 that CLB partitioning may end up with many CLBs, and
the variance in their size is quite large. The total number of CLBs for the core is 1808
where 1766 small CLBs collective occupy only about 5% of the area, and the rest (i.e., 42
CLBs) collectively occupy 95% of the core. The contribution of those 1766 small CLBs
to Y/A improvement of the core with redundancy is low; however, the extra circuitry to
test and configure these small CLBs is very high. In this step we cluster the small CLBs
together to form a large CLB. This reduces the number of control signals and number of
spare FFs, i.e., leads to lower area overhead. We explain the motivation of CLB
clustering using an example.
Figure 4.9 (a) shows 4 clusters that have been generated in the preprocessing
phase of our design flow. In Figure 4.9 (b) each CLB has been duplicated and related
MUXs, De-MUXs and control signals have been added to the circuit. We need 4 control
signals (S
0
to S
3
) to test and configure this circuit after testing. Also, since registers are
separated we need at least one spare FF for each register to make sure it is not a single
point of failure, i.e., total 3 spare FFs. Now, assume CLBs C
1
to C
3
are relatively small
and we cluster them to form a new CLB (C
1,2,3
). Figure 4.9 (c) shows the replicated
circuit for this new design where we just need 2 control signals and since registers have
also been clustered, we need only one spare FF. This design needs fewer control signals
(simpler configuration circuitry) and lower area overhead for spare FFs. In the
OpenSparc T2 core, clustering the 1766 small CLBs into fewer but larger new CLBs
121
could significantly reduce overheads. We will show our experimental results pertaining to
this issue. Next, we explain our procedure for CLB clustering.
(a) (b) (c)
Figure 4.9: An example of CLB clustering
Our technique clusters the small CLBs at the same level, i.e., CLBs in parallel
(shown in Figure 4.9), but another technique for clustering is to concatenate the small
CLBs in series. Although this process eliminates the need for a switch between these
CLBs after replication but we do not recommend this solution because we have to
replicate every FF between these CLBs. The area of a FF is much larger than a MUX or a
De-MUX. So we only cluster the small CLBs in parallel, i.e., at the same level, which has
much lower area overhead after replication.
First, for the given CLBs from phase 1 we generate a netlist (we call it CLB-
netlist) as follows, where each CLB will be an element that has a size (i.e., its area or
number of gates) and nets are PIs, POs or wires between CLBs:
CLB_netlist Generator from the given output from phase 1:
1. Remove all the FFs & connect the input of each FF to its output.
2. Consider each CLB as one element and assign its size to it.
3. Cluster all PIs and POs that drive or driven by each CLB into one
PI or PO net.
4. Cluster all wires between any two CLBs into one CLB net.
122
There would be no branch in CLB-netlist, because our CLB partitioning algorithm
does not generate any loop or branch among CLBs (for more details please read the
algorithm details in Figure 4.8).
Next, a levelization algorithm [37] is applied to the CLB-netlist. Finally, all small
CLBs having the same level value are clustered into one large CLB as follows:
1. Initialization:
For each net c in CLB_netlist set levinp(ci) = undefined
Ask user to enter the maximum size of a small CLB and store it in S
2. For each primary input net xi, set levinp(xi) = 0.
3. While there exist one or more element (CLBs) such that: (i) levinp
is defined for each of the element’s inputs, and (ii) levinp is
undefined for any of its outputs, select one such element:
If the selected element has inputs ci1, ci2, … , ciα and output cj1,
cj2, …, cjβ then for each output cjl, where l = 1, 2, … , β, assign:
levinp(cjl) = max [levinp(ci1), levinp(ci2), … , levinp(ciα)] + 1.
4. Cluster all elements with the same level and size lower than S to
same cluster
The above procedure has a time complexity of O(nlogn), where n is the number of
CLBs or elements in the CLB-netlist. A visual example for CLB-netlist is given in Figure
4.10, where gray and white vertices represent the large and small CLBs, respectively. The
CLB level number is shown inside each corresponding vertex. Using our heuristic, three
clusters (new larger CLBs) are generated for this example.
Figure 4.10: A visual example for CLB-netlist
We defined any CLB, with size less than 100 as a small CLB for OpenSPARC
T2. The reason is that the CLBs smaller than the size of 100, collectively occupy about
123
5% of the original core. Small cluster definition depends on the original circuit design.
Using this heuristic we reduce the total number of CLBs in OpenSPARC T2 core from
1808 to 58 CLBs. Table 4.8 provides the optimization results for OpenSparc T2. The
significant results are due to the reduction in the number of control signals (almost 31
times lower). Also the number of spare FFs was reduced by a factor of 14, which saves
about 2% area overhead on each core.
Table 4.8: Optimization results after CLB clustering
Before Optimization After Optimization
Number of CLBs 1808 58
Number of control signals 1808 58
Number of Spare FFs 1830 131
Area overhead: 100
Area of Spare FFs
Area of a core
´
2.09% 0.15%
Once we clustered the CLBs we go to the next step and determine the best
number of spares for each CLB to maximize Y/A. If the resulting Y/A is acceptable then
we stop here; otherwise, we go to third step for extra partitioning. In order to determine if
the generated Y/A is acceptable or not, we will define a golden design to compute the gap
between the Y/A of the golden design and our current design (more details in
experimental results section).
4.3.2.2 Adding redundancy to CLBs for Y/A maximization
Our main goal in this step is to find the best number of spares for the given CLBs
of the original circuit to maximize the Y/A. We use MYRA (details in Chapter 3) to
generate the redundant design with maximum Y/A, where each CLB is dealt as a module.
124
Figure 4.11 shows the general redundant configuration for an original circuit with
n CLBs, where the gray rectangles are spares. Please note that we do not replicate
registers and just add one spare FF to them, hence for simplicity of presentation we do
not show them in Figure 4.11. Each CLB could be unique (gates and FFs) and has its own
pool of spares.
Figure 4.11: General redundant configuration of all CLBs of the original circuit
Figure 4.12: Design flow for Y/A maximization of logic circuits
In this step of phase two, for the given yield and area of CLBs, and other info
about the chip such as A
OM
and Y
OM
, we find the best number of spares for each CLB to
maximize Y/A. Please note that there is no optimization for FFs and we simply add one
spare to each register of a CLB, but we apply their yields and areas to report the Y/A of
125
the final redundant design. The design flow is given in Figure 4.12 where in its step 4 we
use MYRA to find the best number of spares for each CLB to maximize Y/A, while, taking
into account all overheads (yield reduction, and area overhead) of switches and extra
circuitry for testing and configuration. We use our tool TYSEL to estimate all of these
overheads.
4.3.2.3 Partitioning of large CLBs
Table 4.7 shows that few CLBs could be very large. Therefore, we add one more
step to our overall optimization phase to partition large CLBs into smaller ones. Every
time we go to this step, we choose the largest CLB and partition it to two smaller CLBs
with almost equal sizes. The main idea behind the partitioning of large CLBs is
maximizing the yield (compatible with our theoretical results). Theoretically, this extra
partitioning increases yield, but it increases different overheads. We explain these
overheads by an example in Figure 4.13, where we partition a large CLB C into two
smaller CLBs C
1
and C
2
(Figure 4.13 (b)). As shown in Figure 4.13, partitioning the big
CLB C, reduces the number of spares from two to one in Figure 4.13 (b), because these
small CLBs have higher yield so need lower number of spares. But the following
overheads have been increased: (i) extra circuitry for making the IO wires between
partitions (C
1
and C
2
) testable, and, (ii) extra cost of having MUX and De-MUX between
partitions (this cost could be yield loss, performance degradation and area overhead).
These overheads need to be addressed in our computations for overall optimization for
finding the optimal level of granularity. To address this fact, the design flow (Figure
126
4.12) has a spiral model to stop extra partitioning when the resulting overheads cancel the
benefits of having more balanced partitions.
Figure 4.13: Partitioning of a big CLB
We again use the min-cut based partitioner hMEtis [39] to partition the large
CLBs. We extract the hypergraph representation of each CLB (from its netlist) in which
gates and nets are modeled by vertices and hyper edges, respectively. As mentioned in
Chapter 3, the inputs to hMetis are (i) the hypergraph representation of the circuit, (ii) the
desired number of partitions, and (iii) a balancing factor (BF). To partition the circuit into
almost equal partitions we set BF to 1. Step 5 in design flow determines whether more
partitions are required or not. The number of extra partitions depends on the type of
computational design and also the circuit application. For example in carbon nanotube
where the defect density is very high [60] [64], it might be beneficial to use more
partitions for yield improvement purpose but for a MOS based processor having more
partitions could improve the functional yield, but drops the performance yield.
To make decision for extra partitioning we propose a golden design (in section
4.4) that has the maximum Y/A, and in step 5 we calculate the gap between current
redundant design (from step 4) and the golden design. If this gap was negligible we stop
extra partitioning; otherwise, we go through steps 6 and 7 and increase the number of
127
partitions for large CLBs by 1. Note that the minimum size of large CLBs will be
determined by user, for example in OpenSPARC T2 it could be 10000 gates.
4.4 Experimental Results
We use, as our benchmark, the OpenSPARC T2 that has 8 identical cores. The
areas of each core and the chip are about 12mm
2
and 342mm
2
, respectively (using 65nm
technology). It means A
OM
= 246mm
2
and without loss of generality we set Y
OM
= 1. In
this section we compare the Y/A of following designs:
· D
orig
is the original design without redundancy.
·
Red
D
CLB
is a design that we first generate all the CLBs for the given logic circuit (steps
1 & 2 in design flow), cluster the small CLBs (step 3) and add appropriate amount of
redundancy (step 4) to maximize Y/A; however, there is no extra partitioning of large
CLBs, i.e., the design flow exits as soon as it goes to step 5.
·
Red
D
CLB Reg +
is similar to design
Red
D
CLB
, but it does not use any spare FF for registers
and simply replicates the registers same as CLBs, (note that, here register means a
cluster of driver, receiver or driver/receiver FFs).
·
Red
D
Core
is a redundant design which uses redundancy at core level (traditional
technique).
·
Red
D
func modules -
is a redundant design that uses the original functional modules of
design for redundancy. For example, OpenSPARC T2 originally has 13 functional
modules (partitions) such as DEC, FGU, IFU and etc.
128
·
Red
D
Golden
is the golden redundant design generated by our design flow. In order to
find this design we assume the overheads of switches, extra circuitry for testing are
negligible. Although, this assumption is not practical but it gives us a design with an
upper bound Y/A which could be used as our golden reference model. Since, these
overheads are not addressed, after CLB generation and in optimization phase, the
design flow keeps partitioning each CLB until finest level of granularity that is a gate.
In redundancy step (step 4) we simply duplicate each gate of a CLB to maximize Y/A.
Therefore,
Red
D
Golden
ideally has the maximum Y/A and can be used as an upper
bound on Y/A of generated design by our design flow with extra partitioning in each
CLB.
As mentioned we address the interconnect complexities (yield and area) in all
redundant designs (except the golden design) using our program tool (TYSEL) that draws
the complete layout, including all overheads for configurability and testability.
For
Red
D
Core
, the design with spare core, given m cores and s spare cores, we need
at least m fault-free cores.
Red
Y
Core
and
Red
A
Core
are defined as follows:
(1) & ()
ms n
Red i m si Red
Y yy A m sa
ccc
Core Core
i im
+
æö
+-
= - = +´ å
ç÷
=
èø
( 4.4)
where y
c
and a
c
refer to yield and area of each core, respectively.
Next, we compare the Y/A gain (equation ( 1.3)) of these redundant designs over
the original design under different defect densities in Figure 4.14. Table 4.9 also presents
129
the exact Y/A gain of all these redundant designs over original design. The dotted line in
this figure is the gain of golden design that as explained is an upper bound on Y/A gain.
Figure 4.14: Y/A comparison of different designs for different defect densities
Table 4.9: Y/A gain of different redundant designs over original design
d
+
Red
D
Core
Red
D
func module -
Red
D
CLB Reg +
Red
D
CLB
Red
D
Golden
.02 4.46 4.05 4.2 4.87 4.9
.025 6.99 6.91 7.2 8.42 8.6
.03 10.3 11.3 12 14.1 15
.035 14.3 18 19 23.1 24
.04 19.2 28 31 37.5 40
.045 24.8 43.3 49 60.4 65
.05 31.3 66.1 77 96.8 105
.055 38.5 99.9 120 154 170
.06 46.3 150 186 246 276
.065 54.8 223 286 390 446
.07 63.9 331 439 618 722
.075 73.4 489 672 977 1167
+
d: defect density per mm
2
The following describes our findings:
130
(i) For emerging technologies with low yield, D
orig
, the original design without
redundancy has a very low Y/A that leads to long time to market because of long (12-18
months) yield learning process.
(ii)
Red
D
Core
, the design with spare core, is highly flexible in sharing spares, where
a spare could replace each faulty core. Although the given data in Figure 4.14 and Table
4.9 suggest the Y/A of
Red
D
Core
is similar to or better that of other designs under lower
defect densities (i.e., in recent technologies) it has the worst Y/A in future technologies
with lower yield. For example
Red
D
CLB
has 1.1 to 13.3 times better Y/A compared to
Red
D
Core
as a function of defect density. This necessitates using (a) redundancy at finer
granularities rather than traditional core level redundancy configuration for multi core
chips, or (b) at least a combination of both intra-core and inter-core redundancy
techniques for multiple cores in high defect density technologies. Note that for a fair
comparison we calculated the area overhead of
Red
D
CLB
and divided this by the area of a
core (12mm
2
) to find the maximum number of spare cores in
Red
D
Core
. We then found the
value of s (in equation ( 4.4)) that maximizes its gain.
(iii) Another good example for necessity of using redundancy at finer level of
granularity for emerging technology is
Red
D
func modules -
where we replicate the 13
original functional modules of OpenSparc T2. As we can see this design underperforms
other designs with more partitions under all defect densities. The only design that has
131
worse Y/A than this design is
Red
D
Core
which uses redundancy at the coarsest grain.
Looking at Table 4.9 we can see that the gap between Y/A gain of this design and for
example
Red
D
CLB
that has 58 partitions increases with defect density. This observation
emphasizes on adding redundancy at finer level of granularity for higher defect densities.
(iv) The main difference between
Red
D
CLB
and
Red
D
CLB Reg +
is highlighted by
Observation 1. As mentioned earlier in that observation, adding spare FFs for registers
(cluster of FFs) rather than duplicating the whole register leads to better Y/A. It is
confirmed in Figure 4.14 and also in Table 4.9 that
Red
D
CLB
always
outperforms
Red
D
CLB Reg +
. Importantly, the gap between these two increases as defect
density increases (for future technologies).
(v)
Red
D
CLB
outperforms all other designs (except the golden design) under all
defect densities. Also its gain is close to the provable upper bound gain (i.e., the golden
design). This means that our design flow that properly uses CLB partitioning and
clustering as well as adding spare FFs to registers, generates redundant configurations
with Y/A close to the ideal value. Extra partitioning of large CLBs could even reduce the
gap between
Red
D
CLB
and golden design. However, as mentioned earlier, this extra
partitioning depends on the design application and the type of computational design. For
example if performance is not that important then it is better to partition the large CLBs;
otherwise, we may want to keep the current partitioning.
132
4.5 Conclusions
We presented a design flow which targets the logic circuits of chips to maximize Y/A
using redundancy for emerging technologies with lower yields. The flow is based on two
theorems that conclude for perfect interconnect, partitioning the circuit into many
balanced modules leads to the highest Y/A with duplication. We proposed a design flow
that finds the optimal level of granularity for logic circuits (i.e., gates and flip flops) to be
replicated for Y/A maximization. The flow has two phases, the first phase deals with
design and test constraints of partitioning the original logic circuit and uses CLB
partitioning to satisfy these constraints. The second phase of flow uses the output of first
phase and heuristically maximizes the Y/A in three steps; (i) small CLB clustering to
minimize overheads of testing, configuration and spare FFs, (ii) it determines the best
number of spares for each CLB, and (iii) partitioning large CLBs to increase Y/A and
reduce performance degradation in a spiral model. The flow also clusters the driver,
receiver, or driver/receiver FFs of a CLB into registers and adds a single spare FF into
each register instead of replicating every register to increase Y/A. All the above were
added to our design flow using a spiral model. Our experiments on OpenSPARC T2
under various defect densities show that the design configuration generated by our design
flow outperforms other designs such as the traditional design with spare core from 1.1 to
13.3 times higher Y/A.
133
Chapter 5
Main Contributions, Conclusions and Future work
5.1 Main Contributions
In this dissertation we proved that for logic circuits with irregular structure
realized in technologies with low yields, to enhance yield/area, redundancy should be
added at finer levels of granularity, e.g., compared to the core level,. Therefore, we
addressed all design aspects, practical constraints, and overheads of adding redundancy at
finer levels of granularity. We also showed that, it is beneficial to maximize the number
of healthy dies that can be obtained from each wafer. Thus, we focused on maximizing
Y/A or revenue/wafer. We developed different algorithms, heuristics, tool and theorems to
address these issues and maximize the number of healthy chips. Our experimental results
show our redundant designs significantly outperform both the original design without
redundancy and traditional redundant designs, and sometimes result in 1-2 orders of
magnitude higher Y/A.
5.2 Conclusions
In many SoC’s, the fraction F of real estate devoted to logic circuitry as apposed
to memory and other arrays is getting smaller, whereas defect densities increase with
technology node scaling. Although small, F contains the critical circuitry in the sense that
134
it qualifies as a single point of failure, i.e., if it were faulty one would usually need to
discard the chip. This situation is usually not applicable to the rest of the circuitry such as
memory part, which is often protected by many techniques, such as redundant rows and
columns, and ECC. Since F is small, it is feasible to instantiate spare copies of the logic
circuitry, which has a small effect on the total area of a chip, and a large effect on the
yield. Thus it is feasible to maximize Y/A or equivalently, revenue per wafer.
In this dissertation we studied the necessity of using redundancy at sub-chip levels
of granularity to maximize Y/A for emerging technologies with low yield. The
dissertation focuses on the challenging task of enhancing yield of typical logic circuits
with irregular structures. To incorporate redundancy at finer level we need to take into
account the yield and area of all modules and their related steering logic. Therefore to be
realistic and practical, those overheads are incorporated into our study and all of our
computations. We developed a CAD tool (TYSEL), described in chapter 2, to precisely
estimate the steering logic overheads. TYSEL addresses tolerable defects, and therefore,
compared to the commonly used Poisson yield model results in much higher accuracies.
In this dissertation, first, we assumed the logic circuit has been originally
partitioned into different modules. We showed modules with different sizes (areas) need
different number of spares to maximize Y/A. In chapter 3 we developed different
algorithms and heuristics to find the best number of spares for each module to maximize
yield and Y/A. Our techniques are applicable to both linear and non-linear structures.
Experimental results show the efficiency of using intra-logic redundancy compared to
135
original circuit or replicating the entire circuit q times (qRM), i.e., they confirm that
redundancy need to be used at finer levels of granularity compared to core level.
Next, in Chapter 4 we studied the attributes of partitioning that increase or
maximize Y/A using redundancy. We developed two theorems that relate the size and
number of partitions in the original circuit to Y/A maximization using redundancy. Our
theorems describe, with negligible interconnect overhead, partitioning the circuit into
many balanced (equal size) modules, one can maximize Y/A by only duplicating each
module. However the interconnect yield, area and test overhead is aggravated by adding
redundancy at such a fine level of granularity and that in turn restricts the level of
granularity for logic replication. Then we proposed a design flow that finds the optimal
level of granularity for logic circuits to be replicated for Y/A maximization. The flow has
two phases, the first phase deals with design and test constraints of partitioning the
original logic circuit and uses CLB partitioning to satisfy these constraints. The second
phase of flow uses the output of the first phase and heuristically maximizes the Y/A in
three steps; (i) clustering small CLB to minimize overheads of testing, configuration and
spare FFs, (ii) determining the best number of spares for each CLB, and (iii) partitioning
large CLBs to increase Y/A and reduce performance degradation. These three processes
are used in a spiral model. The flow also adds spare FFs to registers instead of replicating
entire registers to enhance Y/A. Our experiments on OpenSPARC T2 under various defect
densities show that the design configuration generated by our design flow significantly
outperforms all other designs considered. For example, our best result outperforms the
136
traditional design with spare core by 1-2 orders of magnitude. Next we discuss future
works.
5.3 Future work
Electronic devices such as smart phones and tablets have morphed into personal
communication and entertainment devices. To serve the processing needs of such users,
much of the functionality of these devices will be implemented on SoCs (systems-on-
chip) that will integrate multiple processing cores, DSP cores, graphics cores, and other
third party IP blocks. Furthermore, to meet the stringent price-points for such products,
such SoCs will be fabricated using the latest nano-scale fabrication technologies, which,
in the next few years, will scale below 20nm. We mentioned that a key challenge facing
such ‘end-of-Moore’ nano-scale CMOS technologies is that they are expected to have
very low yield due to many non-idealities and high transistor counts of future SoCs.
One important problem in our list of future work and motivated by this
dissertation is to develop a new approach to design of SoCs that is not restricted to just
logic circuits,that will combat the otherwise fatal combination of increasing defect
densities, increasing process variations, and slow yield learning. This approach will
capture the benefits as well as costs of redundancy by using metrics like yield-per-area
and GOPS-per-wafer (Giga Operations per Second). A theoretical framework and
practical tools need to be developed by capturing and exploiting the structure of highly-
parallel heterogeneous SoCs.
Problem Definition for future work: For a highly-parallel heterogeneous SoC
we are given: (1) types of components, (2) number of copies of each component, (3) the
137
size of each component, (4) metrics for each component, including (a) yield (probability
of being manufactured without any fatal defect), calculated using the component’s size in
combination with the defect density and clustering factor for the fabrication process, (b)
performance, which is a function of the number of working components and their levels
of degradation, and (c) power consumption for each component, at different performance
levels, and (5) a hierarchical description of the SoC, including factors such as level of
each cluster for each cluster, types of components and the number of copies of each
component, the type of interconnects and its topology, the floorplan of the modules.
(Note: we use the term topology to denote logical connectivity of the interconnects, e.g., a
bus, a tree, or a mesh; when combined with a floorplan of the modules, interconnect
topologies define the interconnect layout, e.g., a 4´4 array floorplan of 16 modules
connected using a tree implies an H-tree interconnect layout.)
The objective of this problem is to maximize the revenue-per-wafer, or more
specifically a metric like Y/A or GOPS/wafer, by exploring a broad range of ways of
inserting redundancy in the given SoC design, while satisfying constraints on (a)
packaging-- die size, aspect ratio, etc., and (b) performance-- critical path latency,
bandwidth, etc. This approach for inserting redundancy should follow three principles:
(1) spares must be added at appropriate levels of granularity (this was partially done in
this dissertation in Chapter 4), (2) spares must be shared in appropriate ways, and (3) an
appropriate number of spares must be used for each component (what we did in this
dissertation in Chapter 3).
138
Bibliography
[1] M. Abramovici, M. A. Breuer and A. Friedman, Digital Systems Testing and
Testable Design. Computer Science Press, 1990.
[2] S. Adham, D. Burek, C.J. Clark., M. Collins, G. Giles, A. Sales, E.J. Marinissen,
T. McLaurin, J. Monzel, F. Muradali, J. Rajoki, R. Rajsuman, M. Ricchatti, D.
Stannard, J. Udell, P. Varma, L. Whetsel, A. Zamfirescu and Y. Zorian,
"Preliminary Outline of the IEEE PI 500 Scalable Architecture for Testing
Embedded Cores," Proceedings of 17th IEEE VLSI Test Symposium, pp.483-488,
1999.
[3] A. Allen, D. Edenfeld, W.H. Joyner, A.B. Kahng, M. Rodgers and Y. Zorian,
“2001 technology roadmap for semiconductors,” IEEE Computer, pp. 42-53, Jan.
2002.
[4] A. Ansari, S. Gupta, S. Feng and S. Mahlke, “Maximizing spare utilization by
virtually reorganizing faulty cache lines”, IEEE Transaction on Computers, vol.
60, pp. 35-49, Jan. 2011.
[5] A. Ansari, S. Gupta, S. Feng and S. Mahlke, “StageNet: A reconfigurable fabric
for constructing dependable CMPs,” IEEE Transaction on Computers, vol. 60, pp.
5-19, Jan. 2011.
[6] N. D. Arora, K. V. Raol, R. Schumann and L. M. Richardson, "Modeling and
extraction of interconnect capacitances for multilayer VLSI circuits," Computer-
IEEE Transactions on Aided Design of Integrated Circuits and Systems, vol.15,
no.1, pp.58-67, Jan 1996.
[7] L. M. Arzubi, “Memory System with Temporary or Permanent Substitution of
Cells for Defective Cells,” U.S. Patent 3,755,791, U.S. C1. 340/173R, Aug. 1973.
[8] F. Bower, D. Sorin and S. Ozev, "A mechanism for online diagnosis of hard faults
in microprocessors," Proceedings of 38th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 1-12, 12-16 Nov. 2005.
[9] J. C. Cha and S.K. Gupta, "Characterization of granularity and redundancy for
SRAMs for optimal yield-per-area," IEEE International Conference on Computer
Design (ICCD), pp.219-226, 12-15 Oct. 2008.
[10] J. C. Cha and S.K. Gupta, "Yield-per-Area Optimization for 6T-SRAMs Using an
Integrated Approach to Exploit Spares and ECC to Efficiently Combat High
Defect and Soft-Error Rates," 20th Asian Test Symposium (ATS), pp.126-135,
20-23 Nov. 2011.
139
[11] W. Che and I. Koren, "Fault Spectrum Analysis for Fast Spare Allocation in
Reconfigurable Arrays," Proceedings of the IEEE International Workshop on
Defect and Fault Tolerance in VLSI Systems, pp. 60-69, November 1992.
[12] A. Chen, “Redundancy in LSI Memory Array,” IEEE J. Solid-state Circuits SC-4,
291-293, 1967.
[13] Z. Chen and I. Koren, "Techniques for Yield Enhancement of VLSI Adders,"
Proceedings of ASAP, the International Conference on Application-Specific
Array Processors, pp. 222-229, July 1995.
[14] Y. Y. Chen and S.J. Upadhyaya, "Yield analysis of reconfigurable array
processors based on multiple-level redundancy," IEEE Transactions on
Computers , vol.42, no.9, pp.1136-1141, Sep 1993.
[15] W. T. Cheng, J.L. Lewandowski and E. Wu, “Diagnosis for wiring
interconnects,” Proceedings of International Test Conf., pp. 565-571, 1990.
[16] P. Clarke, “TSMC returns fire over 28-nm process issues”, in Electronics Design,
Strategy, News (EDN), Jan. 26, 2012.
[17] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, “Introduction to
Algorithms”, by MIT press, 1990 (First edition), ISBN · 978-0-262-03384-8.
[18] W. B. Culbertson, et al., “The Teramac Custom Computer: Extending the Limits
with Defect Tolerance”, Proceedings of the IEEE Int’l Symposium on Defect and
Fault Tolerance in VLSI Systems, Nov. 1996.
[19] A. Dalal, P. Franzon and M. Lorenzetti, “A layout-driven yield predictor and fault
generator for VLSI” IEEE Trans. on Manufacturing Semiconductor, vol. 6, pp.
77-82, Feb. 1993.
[20] C. Demerjian, “Nvidia’s Fermi GTX480 is broken and unfixable”, in
Semiaccurate, Feb. 17, 2010.
[21] J. DeSousa and V. Agrawal, “Reducing the complexity of defect level modeling
using the clustering effect,” Proceedings of IEEE Design, Automation and Test in
Europe, pp. 640-644, March, 2000.
[22] S. Dutt and J. Hayes, “On designing and reconfiguring K-fault-tolerant tree
architectures,” IEEE Transactions on Computers, vol. 39, pp. 490403, Apr. 1990.
[23] S. Gandemer, B.C. Tremintin and J.-J. Charlot, "Critical area and critical levels
calculation in IC yield modeling," IEEE Transactions on Electron Devices,
vol.35, no.2, pp.158-166, Feb 1988.
140
[24] J. Glass, “An efficient method for improving reliability of a pipeline FFT,” IEEE
Trans. on Computers, vol. 29, no. 11, pp. 1017-1020, Nov. 1980.
[25] P. Goel and M.T. McMahon, "Electronic Chip-In-Place Test," Proceedings of
19th Design Automation Conference, pp. 482- 488, 14-16 June 1982.
[26] T. Gupta and A.H. Jayatissa, “Recent advances in nanotechnology: Key issues &
potential problem areas,” Proceedings of IEEE Conf. on Nanotechnology, vol. 2,
pp. 469 – 472, 2003.
[27] R. Gupta, R. Srinivasan and M. A. Breuer, "Partitioning and reorganization of
hierarchical circuits for DFT," Proceedings of Fourth CSI/IEEE International
Symposium on VLSI Design, pp.106-111, 4-8 Jan 1991.
[28] P. Harrod, "Testing reusable IP-a case study," Proceedings of International Test
Conference, pp.493-498, 1999.
[29] A. Hassan and V. Agarwal, “A fault-tolerant modular architecture for binary
trees,” IEEE Transactions on Computers, vol. C-35, pp. 356-361, Apr. 1986.
[30] A. Hassan, J. Rajski and V.K. Agarwal, "Testing and diagnosis of interconnects
using boundary scan architecture," Proceedings of International Test Conference,
New Frontiers in Testing, pp.126-137, 12-14 Sep 1988.
[31] C. Hess and L. Weiland, "Comparison of defect size distributions based on
electrical and optical measurement procedures," IEEE/SEMI Advanced
Semiconductor Manufacturing Conference and Workshop, pp.277-282, 10-12 Sep
1997.
[32] R. Hetherington, “OpenSPARC T1 & T2 overview”,
http://www.opensparc.net/pubs/preszo/09/brussels/04_RH_OpenSPARC_T1T2.p
df. April, 2012.
[33] L. Huang and Q. Xu, "Characterizing the lifetime reliability of many core
processors with core-level redundancy," IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), pp.680-685, 7-11 Nov. 2010.
[34] L. Huang and Q. Xu, “Lifetime Reliability for Load-Sharing Redundant Systems
with Arbitrary Failure Distributions”, IEEE Transactions on Reliability, pp. 319-
330, VOL. 59, NO. 2, JUNE 2010.
[35] N. Jarwala and C.W. Yau, "A new framework for analyzing test generation and
diagnosis algorithms for wiring interconnects," Proceedings of International Test
Conference, Meeting the Tests of Time., pp.63-70, 29-31 Aug 1989.
141
[36] D. Jewett, “Integrity S2: A Fault-Tolerant UNIX Platform”, Proceedings of the
21st Int’l Symposium on Fault-Tolerant Computing Systems, pages 512–519,
June 1991.
[37] N. Jha and S. Gupta, Testing of Digital Systems, Cambridge, U.K., Cambridge
Univ. Press, 2003.
[38] S. M. Kang and Y. Leblebici, “CMOS Digital Integrated Circuits Analysis &
Design”, McGraw-Hill, Inc. New York, USA 1996.
[39] G. Karypis and V. Kumar, “hMetis, A Hypergraph Partitioning Package, Version
1.5.3”, University of Minnesota, Department of Computer Science &
Engineering, Nov. 22, 1998.
[40] W.H. Kautz, “Testing for Faults in Wiring Networks”, IEEE Trans. on
Computers, Vol. C-23, No. 4, pp. 358-363, April 1974.
[41] J. Khare, W. Maly, M. Thomas, "Extraction of defect size distributions in an IC
layer using test structure data," IEEE Transactions on Semiconductor
Manufacturing, vol.7, no.3, pp.354-368, Aug 1994.
[42] V. Kim, and T. Chen, "SRAM yield estimation in the early stage of the design
cycle," Proceedings of International Workshop on Memory Technology,
Proceedings of Design and Testing, pp.21-26, 11-12 Aug 1997.
[43] I. Kim, Y. Zorian, G. Komoriya, H. Pham, F. P. Higgins, and J. L. Lweandowski,
“Built in self repair for embedded high density SRAM”, Proceedings of Int. Test
Conf. (ITC), pp. 1112–1119, 1998.
[44] Z. Koren and I. Koren, “A model for enhanced manufacturability of defect
tolerant integrated circuits,” International Workshop on Defect and Fault
Tolerance in VLSI Systems, pp. 81-92, Nov. 1991.
[45] I. Koren and C. Krishna, Fault Tolerant Systems, Morgan Kaufmann Publisher,
2007.
[46] I. Koren and M. Breuer, "On Area and Yield Considerations for Fault-Tolerant
VLSI Processor Arrays," IEEE Trans. on Comp., Vol. C-33, pp. 21-27, Jan. 1984.
[47] I. Koren and D. Pradhan, "Introducing Redundancy into VLSI Designs for Yield
and Performance Enhancement," Proceedings of the 15th International.
Symposium on Fault-Tolerant Computing, pp. 330-335, June 1985.
142
[48] I. Koren and D. Pradhan, "Modeling the Effect of Redundancy on Yield and
Performance of VLSI Systems," IEEE Trans. on Comp., Vol. C-36, pp. 344-355,
Mar. 1987.
[49] I. Koren and D. Pradhan, "Yield and Performance Enhancement Through
Redundancy in VLSI and WSI Multi-processor Systems," Proceedings of IEEE,
Special Issue on Fault-Tolerance in VLSI, Vol. 74, No. 5, pp. 699-711, May 1986
[50] I. Koren and A. Singh, “Fault tolerance in VLSI Circuits,” IEEE Computer, vol.
23, `pp. 73-83, July 1990.
[51] M. Kuboschek, H. J. Iden, U. Jagau and J. Otte, “Implementation of a defect
tolerant large area monolithic multiprocessor system,” International Conf. Wafer
Scale Integration, pp. 28-34, 1992.
[52] S. Kundu, "On diagnosis of faults in a scan-chain," in 11th VLSI Test
Symposium, pp.303-308, 6-8 Apr 1993.
[53] S. Kung, S. Jean, and C. Chang, “Fault-tolerant array processors using single-
track switches,” IEEE Transactions on Computers, vol. 38, pp. 501-514, Apr.
1989. vol. 38, pp. 547-554, Apr. 1989.
[54] W. Kuo and T. Kim, “An overview of manufacturing yield and reliability
modeling for semiconductor products,” Proceedings of the IEEE, vol. 87, no. 8,
pp. 1329-1344, Aug. 1999.
[55] B. Landman and R. L. Russo, On a Pin Versus Block Relationship For Partitions
of Logic Graphs, IEEE Trans. on Computer, col. C-20, pp. 1469-1479, 1971.
[56] R. Leachman and C. Berglund, “Systematic mechanisms limited yield (SMLY)
study,” International SEMATECH, DOC #03034383A-ENG, March 2003.
[57] R. Leveugle, Z. Koren, I. Koren, G. Saucier and N. Wehn, “The hyeti defect
tolerant microprocessor: a practical experiment and its cost-effectiveness
analysis,” IEEE Trans. on Computers, vol. 43, no.12, pp.1398-1406, Dec. 1994.
[58] X. Li, "Rethinking memory redundancy: Optimal bit cell repair for maximum-
information storage," 48th Design Automation Conference (DAC), pp.316-321, 5-
9 June 2011.
[59] J. Lien and M. Breuer, "MAXIMAL DIAGNOSIS FOR WIRING NETWORKS,"
Proceedings of International Test Conference, pp.96-105, 26-30 Oct 1991.
143
[60] T. Makarova and F. Palacio, “Carbon-Based Magnetism: An Overview of the
Magnetism of Metal Free Carbon-based Compounds and Materials”, Elsevier
Science, 2006.
[61] T. Mano and M. Wada and N. Ieda, M. Tanimoto, ”A Redundancy Circuit for a
Fault-Tolerant 256K MOS RAM”, IEEE Journal of Solid State Circuits, vol.SC-
17, no.4, pp.726-731,1982
[62] E. Marinissen, R. Arendsen, G. Bos, H. Dingemanse, M. Lousberg, C. Wouters,
"A structured and scalable mechanism for test access to embedded reusable
cores," Proceedings of International Test Conference, pp.284-293, 18-23 Oct
1998.
[63] N. Metropolis and S. Ulam, "The Monte Carlo Method." Journal of the American
Statistical Association, 44, 335-341, 1949.
[64] N. Mingo, D. A. Stewart, D. A. Broido, and D. Srivastava, "Phonon transmission
through defects in nanotubes from first principles", Physical Review B 77,
033418, 2008.
[65] M. Mirza-Aghatabar, M.A. Breuer and S.K. Gupta, "SIRUP: Switch Insertion in
RedUndant Pipeline Structures for Yield and Yield/Area Improvement,"
Proceedings of Asian Test Symposium, pp. 193-199, Nov. 2009.
[66] M. Mirza-Aghatabar, M.A. Breuer and S.K. Gupta, “Algorithms to maximize
yield and enhance yield/area of pipeline circuitry by insertion of switches and
redundant modules”, Proceedings of Design, Automation, and Test in Europe, pp.
1249-1254, Mar. 2010.
[67] M. Mirza-Aghatabar, M.A. Breuer and S.K. Gupta, “HYPER: a Heuristic for
Yield/area imProvEment using Redundancy in SoC”, Proceedings of Asian Test
Symposium, pp. 249-254, Dec. 2010.
[68] M. Mirza-Aghatabar, M. Breuer, S. Gupta and S. Nazarian, “Theory of
Redundancy for Logic Circuits to Maximize Yield/Area“,to appear in
International Symposium on Quality Electronic Design (ISQED), March 2012.
[69] H. Parks and E. Burke, "The nature of defect size distributions in semiconductor
processes," IEEE/SEMI International Semiconductor Manufacturing Science
Symposium (ISMSS), pp.66, 22-24 May 1989.
[70] C. Poh, S. Bhattacharya, J. Ferguson, J. D. Cressler, J. Papapolymerou,
"Extraction of a lumped element, equivalent circuit model for via interconnections
in 3-D packages using a single via structure with embedded capacitors,"
144
Proceedings of 60
th
Electronic Components and Technology Conference (ECTC),
pp.1783-1788, 1-4 June 2010.
[71] S. Popli and M. Bayoumi, “A reconfigurable VLSI array for reliability and yield
enhancement,” International Conf on Systolic Arrays, pp. 631-642, May 1988.
[72] B. Romanescu and D. Sorin, “Core cannibalization architecture: improving
lifetime chip performance for multicore processors in the presence of hard faults,”
International Conf. on Parallel Architectures and Compilation Techniques, pp. 43-
51, Oct. 2008.
[73] D. Rossi, N. Timoncini, M. Spica and C. Metra, "Error correcting code analysis
for cache memory high reliability and performance," Design, Automation & Test
in Europe Conference & Exhibition (DATE), 2011 , pp.1-6, 14-18 March 2011
[74] V. Schober, S. Paul and O. Picot, ”Memory Built-In Self-Repair using redundant
word”, Proceedings of IEEE International Test Conference (ITC), pp.995- 1001,
2001
[75] E. Schuchman and T.N. Vijaykumar, "Rescue: a microarchitecture for testability
and defect tolerance," Proceeding of 32nd Int’l Symposium on Computer
Architecture, pp. 160- 171, 4-8 June 2005.
[76] S. E. Schuster, “Multiple Word/Bit Line Redundancy for Semiconductor
Memories,” IEEE J. Solid-state Circuits SC-13,698-703, 1978.
[77] S. Shamshiri, P. Lisherness, S.-J. Pan, and K.-T. Tim Cheng, “A cost analysis
framework for multi-core systems with spares,” Proceedings of IEEE
International Test Conference (ITC), pp.1-8, 28-30 Oct. 2008.
[78] J. Sheaffer, D. Luebke, and K. Skadron, “A Hardware Redundancy and Recovery
Mechanism for Reliable Scientific Computation on Graphics Processors”,
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on
Graphics hardware, pp. 55–64, Aug 4-5, 2007.
[79] P. Shivakumar, S.W. Keckler, C.R. Moore and D. Burger, "Exploiting
microarchitectural redundancy for defect tolerance," Proceedings of International
Conf. on Computer Design, pp. 481- 488, Oct. 13-15, 2003.
[80] S. Shoukourian et al., “SoC yield optimization via an embedded-memory test and
repair infrastructure,” IEEE Design & Test of Computers, pp. 200-207,
May 2004.
145
[81] A. Singh, “Interstitial redundancy: An area efficient fault tolerance scheme for
large area VLSI processor arrays,” IEEE Transactions on Computers, vol. 37, pp.
1398-1410, Nov. 1988.
[82] L. Spainhower et al., IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A
Historical Perspective. In IBM Journal of R&D, September/November 1999.
[83] L. Spainhower and T. A. Gregg. IBM S/390 Parallel Enterprise Server G5 Fault
Tolerance: A Historical Perspective. IBM Journal of Research and Development,
43(5/6), September/November 1999.
[84] J. Srinivasan, S. Adve, B. Pradip and J. Rivers, "Exploiting structural duplication
for lifetime reliability enhancement," Proceedings of International Symposium on
Computer Architecture, pp. 520- 531, June 4-8, 2005.
[85] J. Srinivasan, S. Adve, P. Bose and J. Rivers, "Lifetime reliability: toward an
architectural solution," IEEE Micro, vol.25, no.3, pp. 70- 80, May-June 2005.
[86] J. Srinivasan, S. Adve, P. Bose and J. Rivers, "The case for lifetime reliability-
aware microprocessors," Proceedings of 31st Annual International Symposium on
Computer Architecture, pp. 276- 287, 19-23 June 2004.
[87] J. Srinivasan, S. Adve, P. Bose and J. Rivers, "The Impact of Scaling on
Processor Lifetime Reliability”, Proceedings of the Intl. Conf. on Dependable
Systems and Networks, 2004.
[88] C. Stapper, F. Armstrong, K. Saji, "Integrated circuit yield statistics," Proceedings
of the IEEE , vol.71, no.4, pp. 453- 470, April 1983.
[89] C. Stapper, "Modeling of Integrated Circuit Defect Sensitivities," IBM Journal of
Research and Development , vol.27, no.6, pp.549-557, Nov. 1983
[90] C.L. Su, Y.T. Yeh and C.W. Wu, "An integrated ECC and redundancy repair
scheme for memory reliability enhancement," 20th IEEE International
Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 81- 89, 3-5 Oct.
2005
[91] W.J. Tee, M.P.-L Ooi, Y.C. Kuang and C. Chan, "Defect cluster segmentation for
CMOS fabricated wafers," Innovative Technologies in Intelligent Systems and
Industrial Applications, (CITISIA), pp.134-138, July 2009.
[92] P. Varma, B. Bhatia, "A structured test re-use methodology for core-based system
chips," Proceedings of International Test Conference, pp.294-302, 18-23 Oct
1998.
146
[93] M. Wang, M. Cutler, and S. Su, “Reconfiguration of VLSI/WSI mesh array
processors with two-level redundancy,” IEEE Transactions on Computers, pp.
547-554, vol. 38, no. 4, April 1989.
[94] C. Webb, “45nm Design for Manufacturing”, Intel Technology Journal, Vol. 12,
Issue 02, June 17, 2008.
[95] N. Wehn, M. Glesner, K. Caesar, P. Mann and A. Roth, “A defect tolerant and
fully testable PLA,” Proceedings of Design Automation Conf., pp. 22-27, 1988.
[96] C. Wickman, D. G. Elliott, and B. F. Cockburn, “Cost model for large file
memory DRAMs with ECC and bad block marking”, Proceedings of IEEE Int.
Symposium Defect and Fault Tolerance in VLSI Systems (DFT), p.319–327,
Nov. 1999.
[97] D. Wilson. The Stratus Computer System. In Resilient Computer Systems, pp.
208–231, 1985.
[98] L. Wilson, project manager of ITRS Roadmap, http://www.itrs.net/, Feb. 2012.
[99] R. Xiaojun, Z. Jinyi, C. Xing, L. Jiao, "An Bidirectional IP Wrapper Design for
SoC DFT," High Density Microsystem Design and Packaging and Component
Failure Analysis, 2005 Conference on , vol., no., pp.1-5, 27-29 June 2005.
[100] T. Yamagata, H. Sato, K. Fujita, Y. Nishmura and K. Anami, ``A Distributed
Globally Replaceable Redundancy Scheme for Sub-Half-micron ULSI Memories
and Beyond," IEEE J. of Solid-State Circuits, vol. 31, pp. 195-201, 1996.
[101] J. Yao, H. Shimada and K. Kobayashi, “A stage-level recovery scheme in scalable
pipeline modules for high dependability,” International Workshop on Innovative
Architecture for Future Generation High-Performance Processors and Systems,
March 2009.
[102] C.W. Yau and N. Jarwala, “A unified theory for designing optimal test generation
and diagnosis algorithms for board interconnects,” Proceedings of International
Test Conf., pp. 71-77, 1989.
[103] K. Yi, S.Y. Cheng, Y.H. Park, F. Kurdahi and A. Eltawil, “An alternative
organization of defect map for defect-resilient embedded on-chip memories”,
Asia-Pacific Computer Systems Architecture Conf. (ACSAC), LNCS 4697, pp.
102–113, 2007.
[104] L. Zhang, Y. Han, Q. Xu, X. Li and H. Li, "On Topology Reconfiguration for
Defect-Tolerant NoC-Based Homogeneous Manycore Systems," IEEE
147
Transactions on Very Large Scale Integration (VLSI) Systems, , vol.17, no.9,
pp.1173-1186, Sept. 2009
[105] Y. Zorian, E. J. Marinissen and S. Dey, "Testing embedded-core-based system
chips," Computer , vol.32, no.6, pp.52-60, Jun 1999.
Abstract (if available)
Abstract
Reduced scaling of feature sizes and process variations in CMOS nano-technologies introduce manufacturing anomalies that reduce yield, and this trend is predicted to get worse for emerging technologies. In addition, it takes more time to be resolved these issues compared to previous technologies. Therefore, it will be increasingly more crucial to develop design techniques to enhance yield in emerging technologies. While logic circuits, namely gates and flip-flops, occupy a small amount of chip area, they are more critical compared to memories as their irregular structure makes it difficult to improve their yield. In addition, logic circuitry contains many single points of failure, and thus any killer defect in this circuitry can turn a die to scrap. This fact suggests the need to develop a highly efficient architectural design methodology based on using redundancy for logic circuits. ❧ In this dissertation we use redundancy in logic circuits to improve silicon yield/area (a.k.a revenue per wafer). While most of the traditional techniques use redundancy at the core level
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
High level design for yield via redundancy in low yield environments
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Structural delay testing of latch-based high-speed circuits with time borrowing
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Timing and power analysis of CMOS logic cells under noisy inputs
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Error-rate testing to improve yield for error tolerant applications
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Static timing analysis of GasP
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Improving efficiency to advance resilient computing
PDF
Library characterization and static timing analysis of asynchornous circuits
Asset Metadata
Creator
Mirza-Aghatabar Ahangar, Mohammad
(author)
Core Title
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
05/01/2012
Defense Date
03/27/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
algorithms,emerging technologies,logic circuits,OAI-PMH Harvest,redundancy,theorems,yield/area
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Breuer, Melvin A. (
committee chair
), Draper, Jeffrey (
committee member
), Gupta, Sandeep K. (
committee member
), Medvidović, Nenad (
committee member
), Pedram, Massoud (
committee member
)
Creator Email
m.aghatabar@gmail.com,mirzaagh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-20387
Unique identifier
UC11290043
Identifier
usctheses-c3-20387 (legacy record id)
Legacy Identifier
etd-MirzaAghat-702.pdf
Dmrecord
20387
Document Type
Dissertation
Rights
Mirza-Aghatabar Ahangar, Mohammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
algorithms
emerging technologies
logic circuits
redundancy
theorems
yield/area